GDFSIC: A Few-Shot Image Classification Framework Integrating Global–Local Attention with Distance–Direction Similarity

Geng, Biao; Pu, Liping

doi:10.3390/mca31020038

Open AccessArticle

GDFSIC: A Few-Shot Image Classification Framework Integrating Global–Local Attention with Distance–Direction Similarity

by

Biao Geng

^1,2,3,* and

Liping Pu

²

¹

School of Computer Science and Technology/School of Artificial Intelligence, China University of Mining and Technology, Xuzhou 221116, China

²

School of Health Management, Suzhou Vocational Health College, Suzhou 215009, China

³

Digitization of Mine, Engineering Research Center of Ministry of Education, Xuzhou 221116, China

^*

Author to whom correspondence should be addressed.

Math. Comput. Appl. 2026, 31(2), 38; https://doi.org/10.3390/mca31020038

Submission received: 6 January 2026 / Revised: 26 February 2026 / Accepted: 1 March 2026 / Published: 3 March 2026

Download

Browse Figures

Versions Notes

Abstract

For few-shot image classification tasks, the recognition accuracy of existing models remains limited due to the inherent complexity of the few-shot learning setting. To address this challenge, this paper proposes a few-shot image classification approach, termed GDFSIC, which integrates a Global–Local Channel Attention Module (GLCAM) with a graph-propagation-based Distance–Direction Similarity Earth Mover’s Distance (DDS-EMD). The GLCAM module is incorporated into the feature extractor to enhance focus on discriminative regions and increase model attention to critical feature areas. Furthermore, a Distance–Direction Similarity (DDS) metric is introduced as a more effective distance criterion for capturing subtle differences in latent spatial representations. The proposed method is evaluated on four widely used few-shot image classification benchmarks: CIFAR-FS, CUB-200-2011, mini-ImageNet, and Tiered-ImageNet. Experimental results demonstrate that our approach achieves a clear competitive advantage in classification accuracy across these datasets. Ablation studies and further analyses confirm the effectiveness of each component of the proposed framework.

Keywords:

few-shot image classification; channel attention; metric learning; graph propagation; DDS-EMD distance function; deep neural networks

1. Introduction

In conventional few-shot image classification (FSIC), a widely adopted paradigm involves processing input images through a feature extractor to obtain global feature vectors representing holistic image information. A metric function is then employed to compute similarity, typically based on distances between these global vectors, to determine the image category. While capable of capturing overall image characteristics, this approach often neglects highly discriminative local feature information embedded within specific image regions. The indiscriminate use of global features risks losing critical local details or introducing interference from irrelevant areas, thereby limiting classification accuracy.

Although existing few-shot learning approaches have made significant progress in feature extraction and relation matching, most remain confined to isolated optimization of either representation or the metric process, failing to systematically integrate task-adaptive representation, structured relational reasoning, and iterative metric optimization. For instance, while DeepEMD achieves fine-grained local matching via optimal transport, its attention mechanism operates only during the feature alignment stage, lacking explicit modeling of the global topological relationships among samples. Graph-based methods (e.g., DPGN, RPMN) can mine sample correlations through graph propagation but often rely on fixed metric spaces or heuristic edge construction, limiting their adaptability to complex cross-domain relationships.

To enhance feature representation, we integrate a GLCAM into the encoder. This module assesses the saliency of features from a channel-attention perspective, enabling the model to concentrate on important features within the input image while suppressing noise and redundant information. An adaptive fusion strategy is adopted to enhance the complementarity between global and local attention features. By dynamically adjusting their respective weights, the model can flexibly allocate attention according to different input data, thereby better adapting to diverse complex scenarios and task requirements. This design improves the model’s adaptability to novel tasks, promoting higher accuracy and superior generalization in few-shot classification.

For similarity measurement, we propose a DDS-EMD metric. It operates by calculating the divergence between two distributions. By assigning adaptive weights across categories, this metric demonstrates superior discriminative capability for image similarity compared to commonly used Euclidean or Marginal distances, effectively improving model accuracy. The proposed method is validated on four standard few-shot image classification benchmarks.

The primary contributions of this work are threefold:

(1) GLCAM: To improve feature extraction and obtain more representative embeddings, we integrate the GLCAM into the network. This module learns and captures relationships between different spatial locations within the image along the channel dimension, highlighting key regions and suppressing irrelevant information. This mechanism directs the model’s focus more effectively toward the target, enhancing the accuracy and robustness of feature representation.

(2) Improved Similarity Metric (DDS-EMD): DeepEMD, a state-of-the-art FSIC method, employs the Earth Mover’s Distance (EMD) to measure feature similarity. However, it incurs high computational cost, and its accuracy is constrained by an optimal transport cost based solely on cosine similarity, which captures only directional differences. To mitigate these limitations, we propose the DDS-EMD. This more efficient metric significantly reduces computational overhead while effectively improving classification accuracy. Remarkably, when the number of patches is 16, DDS-EMD is competitive against a DeepEMD model that samples 25 patches, which translates to a reduction in computational effort ≈36%.

(3) Comprehensive Experimental Validation: Extensive experiments on four mainstream FSIC datasets (CIFAR-FS, CUB-200-2011, mini-ImageNet, and tiered-ImageNet) demonstrate that our proposed GDFSIC framework achieves highly competitive performance, surpassing current leading methods in model accuracy.

2. Related Works

Few-shot learning has been extensively researched in recent years, yielding numerous effective methods for various scenarios. Koch et al. [1] introduced a Siamese neural network that performs feature similarity comparisons for input sample pairs via weight sharing, combining it with metric learning for few-shot face recognition. Vinyals et al. [2] proposed a matching network that leverages metric learning and an attention mechanism to compute similarity between query and support samples, generating predicted labels through a weighted average, thus enabling efficient few-shot classification without fine-tuning. Lin et al. [3] presented prototypical networks (ProtoNet), a seminal few-shot learning approach that computes a prototype for each class as the mean of its support samples in the embedding space and classifies query samples based on their distances to these prototypes. Zuo et al. [4] developed a relation network that extracts features from support and query samples using convolutional neural networks and computes relation scores for classification. These conventional metric learning methods typically rely on image-level features to represent images. However, Wu et al. [5] demonstrated in their study on deep nearest-neighbor networks that compact image-level representations may lose significant discriminative information. In contrast, local descriptors can better retain and express such discriminative details, enabling nearest-neighbor classification by measuring the relationship between query samples and k local descriptors from each support class.

While existing metric learning methods emphasize feature description and relationship measurement, they face limitations when processing embedding vectors under challenging conditions, such as cluttered backgrounds, small or occluded foreground objects, or partial object visibility [6]. A primary reason is that the relatively simple CNN structures in many metric learning approaches struggle to learn effective target features when foreground saliency is low [7]. To address this, Koch et al. [1] incorporated an object localization mechanism to identify and focus on key image regions during training, thereby enhancing classification performance in real-world scenarios. Masana et al. [8] proposed a dual attention network (DAN) that integrates channel and spatial attention mechanisms to improve fine-grained classification in few-shot learning, enabling more effective capture and utilization of important features within complex backgrounds.

The effectiveness of attention mechanisms in handling complex scenes has led to their broad adoption in few-shot learning [9]. Tao et al. [10] introduced the convolutional block attention module (CBAM), which combines channel and spatial attention to enhance feature representation across multiple visual tasks. Zhang et al. [11] proposed a hybrid attention network based on prototype networks for text classification, combining instance-level and feature-level attention to highlight key samples and features, further boosting performance under complex conditions.

With the advancement of graph neural networks (GNNs), their ability to model relationships through graph structures has been exploited to capture high inter-class similarity and low intra-class dissimilarity. Xiao et al. [12] proposed an edge-labeling graph neural network (EGNN), which formulates few-shot tasks on graphs and iteratively refines edge labels by assessing similarity between samples, thereby obtaining prototypes for classification. Li et al. [13] introduced a GNN model combining transductive and inductive learning, quantifying the contribution of node-entity relationships and intrinsic attributes to mitigate over-smoothing.

Graph propagation algorithms operate iteratively on graph structures, propagating label information from annotated nodes to unlabeled ones. GNNs are particularly suitable for non-Euclidean graph data, enabling relational reasoning between instances by treating each sample as a node and modeling their interactions via edges. Leveraging the strength of GNNs in representing few-shot relations, Garcia and Bruna [14] first applied graph neural networks to few-shot image classification in 2018, proposing a framework where support and query features serve as nodes, and edges represent relational measures, enabling classification through iterative graph updates that capture deeper inter- and intra-class relationships.

Recent work has also focused on improving distance metrics between support and query features using techniques such as covariance pooling, divergence between multivariate Gaussian distributions, optimal transport over discrete distributions, bidirectional random walks, and Brownian distance covariance [15,16,17,18,19]. These methods enhance performance by improving feature expressiveness or exploiting class distribution information. Typically, they measure distance via inner product, Euclidean distance, or cosine similarity between final feature vectors. In contrast, our proposed DDS differs fundamentally by incorporating both distance and directional information.

Building on these insights, this paper presents the GDFSIC method, which integrates a GLCAM and a DDS-EMD metric to enhance discriminative capability and classification performance in complex visual environments. The GLCAM module is embedded into the feature encoder to improve feature extraction by focusing on key information and modeling interactions between support and query images. Coupled with an adaptive fusion strategy, it enables the model to concentrate on core objects effectively. Furthermore, a graph propagation module establishes node connections via graph convolutional networks, creating a well-structured embedding space through weight redistribution. This design improves adaptability to novel tasks, yielding higher accuracy and stronger generalization in few-shot classification scenarios.

As shown in Table 1, compared to the most closely related methods, our approach is the first to achieve architectural deep integration of attention, image propagation, and metric learning, rather than a modular assembly. This results in significantly stronger cohesion and robustness in cross-dataset generalization.

3. Methodology

3.1. Fundamentals of Few-Shot Learning

3.1.1. Few-Shot Learning

Few-shot learning (FSL) is a machine learning paradigm designed to enable models to recognize new categories from only a limited number of labeled examples [23]. In contrast, conventional deep learning models typically require large-scale, annotated datasets to achieve high performance, which is a process that is both data-intensive and computationally expensive. However, in many practical computer vision applications, such as rare object identification or specialized medical diagnosis, acquiring sufficient training samples is often infeasible. When applied to such data-scarce scenarios, traditional models tend to overfit and generalize poorly. FSL addresses this fundamental limitation by developing methods that can learn effectively from very few samples, aiming to match the performance achievable with abundant data.

3.1.2. The $N$ -Way $K$ -Shot Problem

The standard evaluation protocol for FSL is the

N

-way

K

-shot classification task, commonly implemented within a meta-learning framework [24]. This framework consists of two phases: meta-training and meta-testing. During meta-training, the dataset is episodically sampled to create a series of meta-tasks. Each meta-task is constructed by randomly selecting

N

distinct classes from the dataset. From each of these

N

classes,

K

labeled examples are drawn to form the support set S. Subsequently, a disjoint set of samples from the same

N

classes are selected to form the query set

Q

. This process can be formally described as follows:

S = {\{(x_{i}, y_{i})\}}_{i = 1}^{N \times K}

(1)

Q = {\{(x_{i}, y_{i})\}}_{j = 1}^{M}

(2)

where

x

denotes an image sample,

y

its corresponding label, and

M

is the number of query samples.

The core objective of the

N

-way

K

-shot problem is for the model to learn a discriminative strategy from the

N \times K

examples in the support set to accurately classify the samples in the query set. During meta-testing, the model faces entirely novel classes not seen during meta-training, testing its ability to rapidly adapt based on the learning strategy acquired.

3.1.3. Meta-Learning

Meta-learning, or “learning to learn”, provides a powerful framework for FSL [25]. It equips models with the ability to quickly adapt to new tasks by extracting transferable knowledge from a distribution of related tasks encountered during meta-training. This is typically achieved through episodic training, where each episode mimics a few-shot learning task as described above. By training over a vast number of such diverse episodes, the model learns a general-purpose initialization or a set of learning rules that can be fine-tuned with minimal adjustment on a novel task.

Meta-learning methods for FSL can be broadly categorized into three groups: (1) optimization-based methods, which learn effective model initializations for fast adaptation (e.g., MAML); (2) metric-based methods, which learn a transferable similarity metric in a shared embedding space (e.g., prototypical networks); and (3) model-based methods, which employ architectures like memory-augmented networks to rapidly assimilate new information.

3.1.4. Metric-Based Meta-Learning

Metric-based meta-learning is one of the most prevalent and effective approaches for FSL [26]. The core idea is to learn a non-linear embedding function that projects input images into a feature space where simple distance metrics (e.g., Euclidean or cosine distance) can effectively measure similarity. The learning process involves two key components: a feature embedding network that extracts discriminative representations, and a metric function that compares these representations.

During meta-training, the model learns a universal embedding space where samples from the same class are clustered closely together, while samples from different classes are well-separated. For a novel few-shot task, the distance between a query sample and the prototypes (e.g., class centroids) of the support set in this space directly determines its classification. This paradigm separates the challenge of learning a good feature representation from the fast, non-parametric inference on new tasks.

Prominent examples of metric-based methods include matching networks (which use an attention mechanism over the support set), prototypical networks (which classify based on distance to class mean prototypes), and relation networks (which learn a deep similarity metric). Despite their success, challenges remain, such as designing more robust metrics for complex data distributions and improving feature embedding to better handle high inter-class similarity and intra-class variance inherent in few-shot datasets.

3.2. Methodological Structure

The overall architecture of the proposed GDFSIC framework is illustrated in Figure 1. Our approach is built upon three core components: a GLCAM, an image propagation module, and a metric-based classification module.

The GLCAM enhances feature representation by evaluating the saliency of both global and local features from a channel-attention perspective. This enables the model to focus on semantically important regions within the input image while suppressing noise and redundant information. An adaptive fusion strategy is employed to dynamically balance the contributions of global and local attention features, allowing the model to adjust its focus according to different input characteristics.

The graph propagation module constructs relational connections between samples using a graph convolutional network, refining node representations through iterative feature aggregation and weight redistribution to form a well-structured embedding space. Finally, the metric classification module computes similarity scores in this embedding space between support and query samples, enabling accurate category prediction within a few-shot learning framework.

3.3. GLCAM

The channel attention mechanism has proven effective in enhancing the representational capacity of deep convolutional neural networks (CNNs) [27]. A seminal work in this area, the squeeze-and-excitation network (SENet), learns to recalibrate channel-wise feature responses by modeling interdependencies between channels [28]. Typically, a squeeze-and-excitation (SE) block first applies global average pooling to aggregate spatial information for each channel [29]. Subsequently, two fully connected (FC) layers with a non-linear activation are used to generate channel-wise weights via a sigmoid function. While this design, particularly the use of FC layers with dimensionality reduction to control model complexity, has been widely adopted, the reduction operation may introduce undesirable side effects and inefficiently capture dependencies across all channels.

While many subsequent methods have developed more sophisticated attention modules for improved performance, they often incur significant computational overhead. To address this trade-off, Chen et al. [30] proposed the efficient channel attention (ECA) module, which generates channel attention via a fast one-dimensional convolution with an adaptively determined kernel size. The ECA module is lightweight and has been successfully integrated as a plug-and-play component to boost the performance of various CNN architectures.

Despite their effectiveness, processing features with a single, global channel attention mechanism may lack the granularity needed to characterize fine-grained local details. To overcome this limitation and further advance model performance, we propose a GLCAM. The module employs two parallel branches to extract globally and locally enhanced features, respectively, and incorporates an adaptive fusion mechanism to dynamically combine the two feature streams. This design aims to better integrate global and local attentional information, thereby improving the model’s ability to capture fine details and enrich feature representation.

Specifically, for local feature refinement, we calculate a correlation score as the weight

w_{i}

for each local feature vector by computing its dot product with a prototype representation of its corresponding local patch group, as formulated below:

w_{i} = m a x \{{u_{i}}^{T} \cdot v_{j}, 0\}

(3)

where

u_{i}

denotes the

i

-th local feature vector to be weighted, and

v_{j}

represents the prototype feature of the

j

-th local patch group. The

m a x (\cdot)

function ensures non-negative weighting. This design assigns higher weights to local patches that contain more discriminative information and lower weights to less informative patches. All computed weights are then normalized to maintain a consistent total weighting across the structure.

GLCAM is a generic module that can be flexibly integrated into existing deep learning frameworks for end-to-end training. As illustrated in Figure 2, the module comprises three components: a global feature attention branch, a local feature attention branch, and an adaptive fusion unit.

The prototype metric module measures the similarity between nodes using Euclidean distance. Specifically, it computes the L2 norm between pairs of node representations, where a smaller L2 distance indicates higher feature similarity and a greater likelihood that the two nodes belong to the same category.

In the feature extraction process, a feature map

F \in R^{C \times H \times W}

is output after each convolutional layer, after which the feature map

F

is input into two parallel branches, respectively. In the global feature attention branch, the feature map

F

is directly processed by the ECA module to obtain the global feature enhancement feature map

F_{G} \in R^{C \times H \times W}

. In the local feature attention branch, the feature map

F

is first divided into

N

non-overlapping region blocks

\{F^{1}, F^{2} {, \dots, F}^{N}\}

.

{\{F^{p}\}}_{p = 1}^{N} = {S p l i t}^{S} (F),

(4)

where

{S p l i t}^{S} (\cdot)

denotes the spatial dimension division operation.

Each region block is processed separately using the ECA module, and finally the

N

attention-enhanced region blocks

{\{F_{L}^{P}\}}_{p = 1}^{N}

obtained after the channel attention processing in parallel are spliced in spatial dimensions to obtain the local feature-enhanced feature maps

F_{L} \in R^{C \times H \times W}

, thus completing the extraction of the local attentional features.

F_{L} = {C o n c a t}^{S} \{F_{L}^{1}, F_{L}^{2}, \dots, F_{L}^{N}\},

(5)

where

{C o n c a t}^{S} (\cdot)

denotes the spatial dimension feature splicing operation.

In the adaptive fusion part, the global feature enhancement map

F_{G}

and local feature enhancement map

F_{L}

, which have been processed by the ECA module, are first spliced in the channel dimension. Then global pooling is used, and then a tensor of length

2 C

is output after a

1 \times 1

convolution block, and then the result is mapped into the

[0, 1]

interval using Sigmoid, after which the result is divided into two copies of weights

w_{G}

and

w_{L}

of length

C

.

w_{G}, w_{L} = s i g m o i d ({C o n v}_{1 \times 1} (G A P ({C o n c a t}^{C} ({F_{G}, F}_{L})))),

(6)

where

{C o n c a t}^{C} (\cdot)

denotes the channel dimension splicing operation and GAP is the global average pooling layer.

Finally, the weights are elementwise multiplied and then summed with the corresponding

F_{G}

and

F_{L}

, respectively, to obtain the final result

F^{’} \in R^{C \times H \times W}

.

F^{’} = F_{G} \cdot w_{G} + F_{L} \cdot w_{L},

(7)

During the experiments, the GLCAM module is embedded into the feature extractor to enhance the model’s feature representation capability. Since convolutional layers at different depths capture distinct types of features, we place the attention module after several selected convolutional layers. This design enables the model to adaptively focus on semantically important features at multiple levels. Experimental results confirm the effectiveness of this configuration, showing that feature information aggregated from different attention branches contributes positively to overall network performance. A structural comparison of the ResNet-12 backbone before and after integrating GLCAM is illustrated in Figure 3.

While dropout is a common regularization technique, it is typically applied to fully connected layers and is less effective for convolutional layers. Its random dropping of individual units ignores the strong spatial correlation present in feature maps; adjacent activations can contain redundant information, meaning the network may still propagate similar information even after dropout, reducing its regularization efficacy. To address this, we employ DropBlock, a structured form of dropout specifically designed for convolutional networks [31]. Instead of dropping random individual units, DropBlock removes contiguous blocks of units from a feature map. This approach more effectively disrupts local spatial structures and forces the network to learn more robust and distributed representations. In our implementation, DropBlock is applied not only within the convolutional layers but also to the skip connection pathways, further enhancing regularization.

3.4. Image Propagation Module

The graph neural network module aggregates node features within the graph and embeds the entire graph into a new feature space, thereby enhancing the representation of neighborhood information. In this work, we employ a graph convolutional network propagation mechanism to propagate both sample features and graph-structured information from the support set into a new embedding space, facilitating subsequent metric computation. The structure of the propagation network is illustrated in Figure 4.

To enable continuous propagation of graph information, the convolution process is performed over multiple iterations. This aggregation of neighborhood node information within the same layer can be expressed as follows:

e_{u}^{(k + 1)} = A G G (e_{u}^{(k)}, \{e_{i}^{(k)} : i \in N_{u}\}),

(8)

where

e_{u}^{(k)}

and

e_{i}^{(k)}

are the embeddings of the support set

u

and the query set

i

after propagation in the

k

th layer, respectively;

N_{u}

is the set of samples of the query set associated with the support set

u

; and

A G G (\cdot)

is an aggregation function used to obtain the feature embedding representations of the target node and its neighbors in the

k

th layer. The propagation formula of the node in each layer of the graph is expressed as follows:

H^{(k + 1)} = θ ({\bar{D}}^{\frac{1}{2}} \bar{A} {\bar{D}}^{\frac{1}{2}} H^{(k)} W^{(k)}),

(9)

where,

H^{(k)}

is the feature of the node at the

k

th layer,

\bar{A}

is the adjacency matrix of the graph,

\bar{D}

is the degree matrix,

W^{(k)}

is the weight at the

k

th layer, and

θ

is the nonlinear transformation. The final output is the feature

f_{θ} (\cdot)

represented as well as possible in the feature space.

3.5. Measurement Module

Metric-based few-shot learning methods aim to learn an effective distance metric to measure the similarity between image features for accurately classifying unseen query samples. While existing efforts have largely focused on improving feature representations, the design of the distance metric itself and its impact on few-shot task performance have often been overlooked. To address this, we propose a DDS metric, which jointly considers both magnitude and orientation when measuring the similarity between two feature vectors, as illustrated in Figure 5. We define the DDS similarity metric as follows:

D D S (x, y) ≜ (1 - γ) \frac{1}{2} {‖x - y‖}_{2}^{2} + γ c s (x, y),

(10)

where

x, y

are two eigenvectors and

γ

is a hyperparameter that weighs the

L 2

distance

{‖x - y‖}_{2}^{2}

against the cosine similarity

c s (x, y)

.

In the pre-training stage, our proposed metric module first learns to embed a network on the training set and uses the features extracted by the backbone network to locate prominent object maps. At various stages of meta-training and meta-testing, we use learned image patches to select and extract multiple feature vectors from images. To introduce diversity in sampling, we use a beta distribution

β (ϵ, ϵ)

with a single random seed

ϵ

as a hyperparameter to generate clipping centers and additionally sample the height and width of patches with uniform distribution. It is worth noting that the beta distribution with

ϵ = 1

corresponds to random distribution sampling. The proposed DDS-EMD distance metric evaluates the transmission cost of EMD and combines the similarity between distance and direction vectors.

4. Experimental Results and Analysis

This section presents the experimental evaluation of the proposed GDFSIC method. Experiments are conducted on four publicly available FSIC datasets: mini-ImageNet, tiered-ImageNet, CUB-200-2011, and CIFAR-FS [32]. We first describe the experimental environment and dataset configurations, followed by the evaluation metrics, parameter settings, and implementation details. We then perform a comparative analysis with current mainstream few-shot image classification algorithms and baseline models. Subsequently, ablation studies are presented to validate the contribution of each component, along with a comparison against other metric-based classifiers. Finally, we examine the impact of the attention mechanism on model performance.

4.1. Experimental Environment and Dataset Setup

All experiments are conducted on a system running Ubuntu 20.04 with Python 3.8 and PyTorch 1.13. The hardware includes an NVIDIA GeForce RTX 3090 GPU with 24 GB of memory. The manufacturer is NVIDIA Corporation, headquartered in Santa Clara, CA, United States. Standard data augmentation techniques are applied during training, including random resizing to

128 \times 128

pixels, random cropping to

64 \times 64

pixels, and random horizontal/vertical flipping. Optimization is performed using stochastic gradient descent (SGD) [33].

To evaluate the effectiveness of the proposed algorithm, few-shot image classification experiments are carried out on four benchmark datasets. Table 2 summarizes key statistics of these datasets, including the number of images, number of classes, and image resolution, following the conventions established in prior work.

4.2. Evaluation Criteria and Parameterization

The classification accuracy of the algorithm was evaluated using 5-way 1-shot and 5-way 5-shot; 5-way 1-shot means that 5 categories were randomly selected and 1 image was chosen as a sample in each category; 5-way 5-shot means that 5 categories were randomly selected and 5 images were chosen as samples in each category. At the same time, the classification accuracy is used to evaluate the performance of the model, which is measured in %; the higher the value, the better the performance of the model and the better the classification effect. For the problem of FSIC, the specific definition of accuracy refers to the proportion of correctly classified samples in each validation batch to the total number of validation samples, as shown in Equation (11):

A c c u r a c y = \frac{T P + T N}{T P + F N + F P + T N} \times 100 %,

(11)

where

T P

represents the number of positive samples classified as positive,

F N

represents the number of positive samples classified as negative,

F P

represents the number of negative samples classified as positive, and

T N

represents the number of negative samples classified as negative.

In the scenario of FSIC tasks, uneven distribution of data samples in each test batch may lead to significant fluctuations in model performance and poor accuracy stability. Therefore, the experimental part of this article further introduces confidence intervals, aiming to examine the performance of the model from a more comprehensive perspective and explore in depth the credibility and robustness of its predictive accuracy. The confidence interval can reflect the level of confidence in the evaluation results of the model, usually evaluated at a 95% confidence level. The confidence interval represents the estimated range of the model’s true performance at a given confidence level. If the accuracy of the model is 0.80 and its 95% confidence interval is [0.75, 0.85], it indicates a 95% confidence that the model’s true accuracy is within this range. The calculation of confidence intervals is generally based on the assumptions of central limit theorem and normal distribution. Usually, the mean and standard deviation of the sample are used to estimate the interval of true performance. Assuming a sample set

x = \{x_{1}, x_{2}, \dots, x_{n}\}

composed of accuracy obtained from multiple tests, where

n

is the number of tests. The confidence interval calculation is shown in Equation (12).

c o n f = (\bar{x} - z \cdot \frac{δ}{\sqrt{n}}, \bar{x} + z \cdot \frac{δ}{\sqrt{n}}),

(12)

where

\bar{x}

represents the mean accuracy,

z

represents the standard normal distribution, and

δ

represents the overall standard deviation of accuracy.

The settings of each parameter for the training of the algorithm in this study are shown in Table 3.

4.3. Experimental Flow

The steps of the proposed method are shown in Algorithm 1. To test the effectiveness of the method, two other different image information acquisition networks, Conv-4 and ResNet-12, are used in this paper in addition to the WRN network [34].

(1) Conv-4: A 4-layer convolutional network, where each block consists of a convolution, batch normalization, ReLU activation, and a 2 × 2 max-pooling layer.

(2) ResNet-12: A 12-layer residual network comprising four residual blocks, each containing three convolutional layers with shortcut connections. Algorithm 1. Few-shot image classification algorithm based on GDFSIC.

Algorithm 1. In a Minibatch,

N_{C}

is the number of all categories including the support set and the query set,

N_{S}

is the number of samples per category in the support set,

N_{Q}

is the number of samples per category in the query set,

S_{K}

is the set of samples per category in the support set, and

Q_{K}

is the set of samples in the query set;

J

is the loss function,

f_{θ} (\cdot)

represents the feature extraction network, and

d

represents the DDS-EMD.

Input: Training set

D_{t r} = \{(x_{1}, y_{1}), (x_{2}, y_{2}) \dots, (x_{N}, y_{N}), y_{i} \in \{1,2, \dots, K\}\}

,

x_{i}

represents the i-th sample feature,

y_{i}

represents the label of the i-th sample feature, and

\bar{x}

is a sample in the query set.
Output: J.
1: Initialize network parameters
2: Preliminary feature embedding space

f_{θ} (x)

is obtained through the GLCAM processing
3: Obtain the optimized feature embedding space

{\hat{f}}_{θ} (x)

through the Image propagation module processing
4: Calculate the

C_{k} = {(N_{S})}^{- 1} \times \sum_{(x_{i}, y_{i}) \in S_{K}} {\hat{f}}_{θ} (x_{i})

;
5: Initialization:

J \leftarrow 0

;
6: For k in

\{1,2, \dots, N_{C}\}

do
7: For

(\bar{x}, y)

in

Q_{K}

do
8: Use DDS-EMD to calculate the similarity

d ({\hat{f}}_{θ} (\bar{x}), C_{k})

between the support set and the query set samples
9:

J \leftarrow J + d ({\hat{f}}_{θ} (\bar{x}), C_{k}) \times {(N_{C} N_{Q})}^{- 1} + l o g \sum_{k} e x p (- d ({\hat{f}}_{θ} (\bar{x}), C_{k}))

10: End for
11: End for

4.4. Experimental Results and Comparative Analysis

In order to fully verify the effectiveness of the algorithm in this paper, Relation Nets, Matching Nets, ProtoNet, SNAIL, TADAM, MAML, R2D2, Variational FSL, Fine-tuning, ADAResNet, MetaOptNet, CTM, LEO-trainval, RFS-simple, RFS-distill, Boosting, and DeepEMD are used as a variety of algorithms as comparative algorithms. The test results of the comparison algorithms and the algorithms of this paper that identify the classification accuracy in both settings are shown in Table 4.

As shown in Table 4, the proposed algorithm achieves the highest classification accuracy on the mini-ImageNet dataset, reaching 67.19% in the 5-way 1-shot setting and 84.94% in the 5-way 5-shot setting. Compared to the metric-based matching networks, our method improves accuracy by 23.57% and 29.63% under the respective settings. Notably, against DeepEMD, which is the baseline of our approach, the proposed model outperforms by 1.22% (1-shot) and 2.49% (5-shot). These results demonstrate that integrating a channel attention mechanism with an EMD-based metric function effectively enhances classification accuracy in the few-shot regime, yielding a substantial performance gain.

We further evaluate the proposed algorithm on the tiered-ImageNet dataset under the same 5-way K-shot protocol, where

k = 1

or 5. The classification accuracy of our method and several compared approaches is summarized in Table 5.

As shown in Table 5, the proposed algorithm achieves a classification accuracy of 73.56% in the 5-way 1-shot setting and 88.59% in the 5-way 5-shot setting on the tiered-ImageNet dataset. Compared to Relation Nets, our method improves accuracy by 19.12% and 17.22% under the respective settings. Furthermore, it exceeds the DeepEMD baseline by 2.34% (1-shot) and 2.51% (5-shot). These results on tiered-ImageNet further validate the effectiveness of the proposed approach.

We also evaluate the stability of our model on the fine-grained CUB-200-2011 dataset under the same 5-way 1-shot and 5-way 5-shot protocols. The corresponding experimental results are presented in Table 6.

As shown in Table 6, the proposed algorithm also achieves the highest classification accuracy on the CUB-200-2011 dataset, reaching 48.20% and 65.58% in the 5-way 1-shot and 5-way 5-shot settings, respectively. Compared to the meta-learning-based ProtoNet, our method improves accuracy by 12.84% (1-shot) and 16.97% (5-shot). It also outperforms the DeepEMD baseline by 1.72% (1-shot) and 2.36% (5-shot). These results further confirm the effectiveness of our network design on fine-grained recognition tasks.

It is worth noting that the absolute accuracy on CUB-200-2011 is lower than that on mini-ImageNet and tiered-ImageNet, which can be attributed to the lower image resolution and higher intra-class variance in this dataset, presenting a greater challenge for few-shot classification. Nevertheless, the performance gains over baseline methods on this dataset are more pronounced, underscoring the robustness and superiority of the proposed approach.

Finally, we evaluate our algorithm on the CIFAR-FS dataset under both 5-way 1-shot and 5-way 5-shot settings. The corresponding results are presented in Table 7.

As shown in Table 7, the proposed algorithm achieves the highest classification accuracy on the CIFAR-FS dataset, with 76.89% in the 5-way 1-shot setting and 88.81% in the 5-way 5-shot setting. Compared to the metric-based Relation Nets, our method surpasses them by 21.83% and 19.46% under the respective settings. Furthermore, it outperforms the baseline model DeepEMD by 1.20% (1-shot) and 2.12% (5-shot).

In summary, the proposed algorithm achieves the best classification performance across all four few-shot benchmark datasets. These results not only validate the effectiveness of our improvements over the baseline model but also comprehensively demonstrate the superiority of our approach compared to other mainstream methods.

4.5. Analysis of Ablation Experiments

To analyze the contribution of each component in the GDFSIC method, we conducted ablation experiments on the mini-ImageNet dataset. The results are summarized in Table 8.

Three configurations were evaluated as follows:

Baseline: the network without any attention module.
Baseline + GLCAM: adding only the global–local channel attention module.
Full model (GDFSIC): integrating both GLCAM and the graph propagation module.

Compared with the baseline, adding GLCAM alone improves accuracy by 12.79% (5-way 1-shot) and 5.73% (5-way 5-shot). The full model achieves 67.79% (1-shot) and 83.56% (5-shot), corresponding to gains of 13.23% and 6.44% over the baseline. The superior performance of the full model stems from the complementary roles of the two modules: GLCAM enhances feature discriminability by fusing global and local attention, enabling the model to focus on salient regions, while the graph propagation module aggregates neighborhood information and embeds the whole graph into a structured feature space, facilitating subsequent metric computation. When combined, these modules provide a more comprehensive and accurate representation of images, thereby boosting few-shot classification accuracy.

4.6. Comparison with Other Metric Classifiers

A key factor in the effectiveness of deep metric learning lies in the feature discrimination capability of the chosen metric. Prior inductive studies and recent empirical summaries suggest that classification performance generally improves when the metric operates on more discriminative, higher-level features. In this work, we primarily employ the DDS-EMD distance as the core metric.

To examine the functional differences among distance metrics, we compare cosine distance, Euclidean distance, and the proposed DDS-EMD under both 5-way 1-shot and 5-way 5-shot settings on two benchmark datasets. All experimental configurations remain consistent with those described earlier in the paper. The results of this comparative evaluation are presented in Table 9.

As shown in the results, the Euclidean distance significantly outperforms cosine distance in classification accuracy. Similarly, DDS-EMD yields superior accuracy compared to Euclidean distance. This improvement stems from the fact that, unlike conventional Euclidean distance, which measures point-wise feature differences, DDS-EMD performs a multidimensional comparison between feature distributions derived from the extraction network, thereby capturing richer structural relationships.

As Figure 6 illustrates, classification accuracies peak at

γ = 0.8

in the 1-shot task and when

γ = 0.6

in the 5-shot case. The importance values of both the

l 2

direction and distance between feature vectors extracted in both 1- and 5-shot tasks are thus different. Such differences can be attributed to the computation of the structured fully connected layers used in the baseline. Namely, the 5-shot case requires additional iterative steps to optimize for the features.

4.7. Comparative Experiments with Other Methods That Incorporate Attention Mechanisms

As shown in Figure 7, the proposed method also achieves superior performance on the mini-ImageNet dataset compared to other attention-based approaches. Our method outperforms RPMN—which likewise weights different image regions according to their importance—by 6.05% (5-way 1-shot) and 5.88% (5-way 5-shot). It surpasses PARN [50], which uses deformable convolution to localize target objects, by 4.18% (1-shot) and 3.68% (5-shot). Compared to DBRNFW [51] that integrates spatial and channel attention with global-to-local relational mapping, our method yields gains of 2.32% (1-shot) and 2.64% (5-shot). Against ABNet [52], which selects and re-weights local patch blocks, our approach improves accuracy by 1.28% (1-shot) and 3.21% (5-shot), and exceeds DbMRNT [53] by 0.58% (1-shot) and 1.04% (5-shot). These consistent gains can be attributed to our Global–Local attention fusion mechanism, which enhances the model’s ability to capture discriminative details and improves feature representation.

4.8. Network Complexity Analysis

To evaluate the lightweight design and practical applicability of the proposed GDFSIC, we analyze the number of parameters and computational complexity of the network using the model visualization tool TorchSummary. Computational complexity is defined as the number of floating-point operations (FLOPs) required for a single forward pass given one input image, reflecting the model’s time complexity.

GDFSIC is compared with several widely-used few-shot feature extraction backbones, including WRN-18, ResNet-12, and ResNet-18. The results are summarized in Table 10.

The results indicate that both the parameter count and computational complexity of GDFSIC are substantially lower than those of ResNet-12 and ResNet-18, leading to improved inference speed and better real-time performance in practical applications. Although GDFSIC exhibits a complexity and parameter scale comparable to WRN-18, it achieves significantly higher classification accuracy across all tested few-shot benchmarks.

4.9. Computational Effort

Table 11 shows the performance of DeepEMD and DDS-EMD on mini-ImageNet and tiered-ImageNet.

From Table 11, we can conclude that our method outperforms DeepEMD by a clear margin on two datasets. Remarkably, when the number of patches is 16, DDS-EMD is competitive against a DeepEMD model that samples 25 patches, which translates to a reduction in computational effort ≈36%.

4.10. Visualization Analysis

To visually illustrate the improvement in few-shot classification performance achieved by the proposed GDFSIC method, we employ t-Distributed Stochastic Neighbor Embedding (t-SNE) to project the extracted features onto a two-dimensional plane for visualization [54].

A test task is constructed by randomly sampling four classes from the CUB-200-2011 dataset. The features of the involved samples are reduced using t-SNE, and the resulting distributions are shown in Figure 8. The left panel illustrates the feature distribution obtained from the prototype network (baseline), while the right panel shows the distribution obtained from our GDFSIC framework. Samples belonging to the same class are depicted with the same color; the axes

x

and

y

represent the two dimensions of the t-SNE projection.

As shown in Figure 8a, the baseline model—which does not incorporate the proposed modules—struggles to separate different bird species in the CUB-200-2011 dataset. This difficulty arises because the dataset contains only bird images, all of which share general avian characteristics. The baseline’s relatively shallow feature extraction capability leads to poor generalization and confuses the characteristics of different birds, thereby hindering classification. In contrast, Figure 8b demonstrates that our model, augmented with GLCAM and adaptive feature aggregation across network layers, adjusts the sample distribution to produce clearer separation between classes. This enhanced discriminability reduces classification difficulty and confirms that the features learned by our model are more distinctive, contributing directly to improved classification accuracy.

5. Concluding Remarks

Metric-based few-shot image classification methods are primarily built upon two components: a feature extractor and a metric classifier. If the feature extractor is too simplistic, it may fail to capture deeper discriminative abstract features. Conversely, an overly complex extractor can easily lead to overfitting due to the limited number of available samples. Therefore, designing an appropriate feature extractor is crucial. When the extracted features satisfy the requirements of the metric classifier, performing classification using a metric that effectively matches these features will further enhance model accuracy.

This paper presents a novel few-shot learning framework whose core contribution is a tripartite synergistic mechanism of attention, image propagation, and metric learning. It fundamentally addresses the limitation of prior methods, which treated task-adaptive representation, structured relational reasoning, and iterative metric optimization in a decoupled or partially integrated manner. Compared to methods like DeepEMD, DPGN, and RPMN, our approach not only leverages attention to dynamically focus on critical feature regions but, more importantly, utilizes attention weights as priors for graph edge construction and propagation, thereby guiding the continuous optimization of the metric space in a task-driven manner. This deep fusion allows the model to simultaneously capture local matching details and global topological constraints among samples, significantly enhancing the model’s generalization capability and stability on unseen datasets.

Future work may explore the application of this synergistic mechanism in hierarchical few-shot or cross-modal learning scenarios and further investigate the balance between its interpretability and computational efficiency.

Author Contributions

Conceptualization, B.G. and L.P.; Data curation, B.G.; Formal analysis, B.G.; Funding acquisition, B.G. and L.P.; Methodology, B.G.; Project administration, B.G. and L.P.; Resources, B.G. and L.P.; Software, B.G.; Supervision, L.P.; Validation, B.G.; Visualization, B.G.; Writing—original draft, B.G.; Writing—review and editing, B.G. and L.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the key project of Ling-Yan plan of Suzhou Vocational Health College, grant number szwzy202311.

Data Availability Statement

The data that support the findings of this study are available from the Corresponding Author upon reasonable request.

Acknowledgments

The authors thank Guan Yuan from China University of Mining and Technology for his guidance on the design of the few-shot image classification model and the experimental design. We would like to extend our sincere appreciation to the editor and reviewers for their valuable feed-back and constructive comments, which significantly improved the quality of this paper. Their insightful suggestions have greatly contributed to the refinement of our work. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GLCAM	Global–Local Channel Attention Module
EMD	Earth Mover’s Distance
DDS-EMD	Distance–Direction Similarity Earth Mover’s Distance
FSIC	Few-Shot Image Classification
ProtoNet	Prototypical Network
CNN	Convolutional Neural Network
DAN	Dual Attention Network
CBAM	Convolutional Block Attention Module
GNN	Graph Neural Network
EGNN	Edge-labeling Graph Neural Network
SE	Squeeze-and-Excitation
SENet	Squeeze-and-Excitation Network
FC	Fully-Connected
ECA	Efficient Channel Attention
SGD	Stochastic Gradient Descent
FLOPs	Floating-Point Operations
t-SNE	t-Distributed Stochastic Neighbor Embedding

References

Koch, G.; Zemel, R.; Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France; JMLR.org: Brookline, MA, USA, 2015; Volume 37, pp. 2041–2049. [Google Scholar]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D. Matching networks for one shot learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 3637–3645. [Google Scholar]
Lin, X.X.; Li, Z.; Zhang, P.; Liu, L.C.; Zhou, C.; Wang, B. Structure-aware prototypical neural process for few-shot graph classification. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 4607–4621. [Google Scholar] [CrossRef]
Zuo, Y.; Chen, Z.; Feng, J.; Fan, Y. Federated learning and optimization for few-shot image classification. Comput. Mater. Contin. 2025, 82, 4649–4667. [Google Scholar] [CrossRef]
Wu, Z.; Peng, C. Few-shot image classification for defect detection in aviation materials. Measurement 2025, 253, 117749. [Google Scholar] [CrossRef]
Wu, Z.D.; Li, D.L.; Zou, L.; Zhao, H. Multi-granularity awareness via cross fusion for few-shot learning. Inf. Sci. 2025, 714, 122209. [Google Scholar] [CrossRef]
Miller, E.; Matsakis, N.; Viola, P. Learning from one example through shared densities on transforms. In Proceedings of 2000 IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head Island, SC, USA, 15 June 2000; IEEE: Piscataway, NJ, USA, 2000; pp. 464–471. [Google Scholar]
Masana, M.; Liu, X.; Twardowski, B.; Menta, M.; Bagdanov, A.; Weijer, J. Class-incremental learning: Survey and performance evaluation on image classification. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 5513–5533. [Google Scholar] [CrossRef]
Vilalta, R.; Drissi, Y. A perspective view and survey of meta-learning. Artif. Intell. Rev. 2002, 18, 77–95. [Google Scholar] [CrossRef]
Tao, P.; Feng, L.; Du, Y.; Gong, X.; Wang, J. Meta-cosine loss for few-shot image classification. J. Image Graph. 2024, 29, 506–519. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, M.; Lu, Z.; Xiang, T.; Wen, J. AdarGCN: Adaptive aggregation GCN for few-shot learning. In Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA; IEEE: Piscataway, NJ, USA, 2021; pp. 3482–3491. [Google Scholar]
Xiao, T.; Xia, Y.; Tang, R.; Du, W.; Wang, Z. Fusion of global and adaptive local information for few-shot image classification. Pattern Recognit. 2025, 168, 111802. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Duan, C.; Yang, Y.; Wang, X. Classification of hyperspectral image based on double-branch dual-attention mechanism network. Remote Sens. 2020, 12, 582. [Google Scholar] [CrossRef]
Li, X.; Lu, P.; Zhu, R.; Ma, Z.; Cao, J.; Xue, J. Rise by lifting others: Interacting features to uplift few-shot fine-grained classification. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 3094–3103. [Google Scholar] [CrossRef]
Jia, Y.; Dong, L.; Jiao, Y. Medical image classification based on contour processing attention mechanism. Comput. Biol. Med. 2025, 191, 110102. [Google Scholar] [CrossRef]
Nikulins, A.; Edelmers, E.; Sudars, K.; Polaka, I. Adapting classification neural network architectures for medical image segmentation using explainable AI. J. Imaging 2025, 11, 55. [Google Scholar] [CrossRef]
Song, L.; Gao, Y.; Gui, Y.; Jiang, D.; Zhang, M.; Liu, H. LHAS: A lightweight network based on hierarchical attention for hyperspectral image segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5508012. [Google Scholar] [CrossRef]
Liang, X.; Li, X.; Wang, Q.; Qian, J.; Wang, Y. Hyperspectral image change detection method based on the balanced metric. Sensors 2025, 25, 1158. [Google Scholar] [CrossRef] [PubMed]
Snell, J.; Swersky, K.; Richard, Z. Prototypical networks for few-shot learning. In Proceedings of 31st Conference on Neural Information Processing Systems Long Beach, CA, USA; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 4077–4087. [Google Scholar]
Sinha, A.K.; Fleuret, F. DeepEMD: A Transformer-Based Fast Estimation of the Earth Mover’s Distance. In Proceedings of the 27th International Conference on Pattern Recognition (ICPR 2024), Kolkata, India, 1–5 December 2024; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2025; Volume 15304, p. 1. [Google Scholar] [CrossRef]
Yang, L.; Li, L.L.; Zhang, Z.L.; Zhou, X.Y.; Zhou, E.J.; Liu, Y. DPGN: Distribution Propagation Graph Network for Few-shot Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14–19 June 2020; IEEE Computer Society: Los Alamitos, CA, USA, 2020; pp. 13387–13396. [Google Scholar] [CrossRef]
Kim, J.W.; Joo, H.Y.; Moon, J.H.; Lee, G.J. Development of a radiation detector for the radioactive-plume monitoring network (RPMN). Prog. Nucl. Energy 2020, 123, 103290. [Google Scholar] [CrossRef]
Huang, Z.H.; Shi, J.J.; Li, X.L. Quantum Few-Shot Image Classification. IEEE Trans. Cybern. 2025, 55, 194–206. [Google Scholar] [CrossRef] [PubMed]
Shi, F.; Wang, R.; Zhang, S.Y.; Cao, X.C. Few-Shot Classification with Multi-task Self-supervised Learning. In Proceedings of the 28th International Conference on Neural Information Processing, 8–12 December 2021; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2021; Volume 13111, pp. 224–236. [Google Scholar] [CrossRef]
Sun, Y.C.; Keung, J.W.; Yu, H.K.; Luo, W.Q. LogMeta: A few-shot model-agnostic meta-learning framework for robust and adaptive log anomaly detection. J. Syst. Softw. 2026, 235, 112781. [Google Scholar] [CrossRef]
Zeng, W.; Xiao, Z.Y. Few-shot learning based on deep learning: A survey. Math. Biosci. Eng. 2024, 21, 679–711. [Google Scholar] [CrossRef]
Lu, K.D.; Huang, J.C.; Zeng, G.Q.; Chen, M.R.; Geng, G.G.; Weng, J. Multi-Objective Discrete Extremal Optimization of Variable-Length Blocks-Based CNN by Joint NAS and HPO for Intrusion Detection in IIoT. IEEE Trans. Dependable Secur. Comput. 2025, 22, 4266–4283. [Google Scholar] [CrossRef]
Roy, A.G.; Navab, N.; Wachinger, C. Recalibrating Fully Convolutional Networks With Spatial and Channel Squeeze and Excitation Blocks. IEEE Trans. Med. Imaging 2019, 38, 540–549. [Google Scholar] [CrossRef]
Sun, H.; Li, B.H.; Dan, Z.P.; Hu, W.; Du, B.; Yang, W.; Wan, J. Multi-level Feature Interaction and Efficient Non-Local Information Enhanced Channel Attention for image dehazing. Neural Netw. 2023, 163, 10–27. [Google Scholar] [CrossRef]
Chen, C.; Wang, C.Y.; Liu, B.; He, C.; Cong, L.; Wan, S.H. Edge Intelligence Empowered Vehicle Detection and Image Segmentation for Autonomous Vehicles. IEEE Trans. Intell. Transp. Syst. 2023, 24, 13023–13034. [Google Scholar] [CrossRef]
An, H.J.; He, H.H.; Ma, S.H.; Pan, R.X.; Liu, C.B.; Guo, Y.X.; Liu, G.; Song, M.X.; Dong, Z.K.; Chen, G.X. Fault Diagnosis Method for Axial Piston Pump Slipper Wear Based on Symmetric Dot Pattern and Multi-Channel Densely Connected Convolutional Networks. Sensors 2025, 25, 7465. [Google Scholar] [CrossRef]
Ren, J.; An, Y.H.; Lei, T.; Yang, J.P.; Zhang, W.Y.; Pan, Z.C.; Liao, Y.; Gao, Y.S.; Sun, C.M.; Zhang, W.C. Adaptive feature selection-based feature reconstruction network for few-shot learning. Pattern Recognit. 2026, 171, 112289. [Google Scholar] [CrossRef]
Przybylowicz, P.; Sobieraj, M. On the randomized Euler scheme for stochastic differential equations with integral-form drift. J. Comput. Appl. Math. 2026, 483, 117367. [Google Scholar] [CrossRef]
Chen, H.K.; Luo, Z.X.; Zhang, J.H.; Zhou, L.; Bai, X.Y.; Hu, Z.Y. Learning to match features with seeded graph matching network. In Proceedings of 2021 IEEE International Conference on Computer Vision, Montreal, QC, Canada; IEEE: Piscataway, NJ, USA, 2021; pp. 6281–6290. [Google Scholar]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.; Hospedales, T. Learning to compare: Relation network for few-shot learning. In Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA; IEEE: Piscataway, NJ, USA, 2018; pp. 1199–1208. [Google Scholar]
Li, W.; Xu, J.; Huo, J.; Wang, L.; Gao, Y.; Luo, J. Distribution consistency based covariance metric networks for few-shot learning. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA; AAAI Press: Washington, DC, USA, 2019; pp. 8642–8649. [Google Scholar]
Wei, T.; Hou, J.; Feng, R. Fuzzy graph neural network for few-shot learning. In Proceedings of 2020 International Joint Conference on Neural Networks, Glasgow, UK; IEEE: Piscataway, NJ, USA, 2020; pp. 1–8. [Google Scholar]
Tian, Y.; Wang, Y.; Krishnan, D.; Tenenbaum, J.; Isola, P. Rethinking few-shot image classification: A good embedding is all you need? In Proceedings of 16th European Conference on Computer Vision, Glasgow, UK; Springer: Cham, Switzerland, 2020; pp. 1–17. [Google Scholar]
Cai, Q.; Li, F. Meta-learning with class feature augmentation. J. Chin. Comput. Syst. 2022, 43, 225–230. [Google Scholar]
Smirnov, E.; Timoshenko, D.; Andrianov, S. Comparison of regularization methods for ImageNet classification with deep convolutional neural networks. AASRI Procedia 2014, 6, 89–94. [Google Scholar] [CrossRef]
Kim, J.; Kim, T.; Kim, S.; Yoo, C. Edge-Labeling graph neural network for few-shot learning. In Proceedings of 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA; IEEE: Piscataway, NJ, USA, 2019; pp. 11–20. [Google Scholar]
Zhou, X.; Zhang, Y.; Wei, Q. Few-shot fine-grained image classification via GNN. Sensors 2022, 22, 7640. [Google Scholar] [CrossRef]
Tang, S.; Chen, D.; Bai, L.; Liu, K.; Ge, Y.; Ouyang, W. Mutual CRF-GNN for few-shot learning. In Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA; IEEE: Piscataway, NJ, USA, 2021; pp. 2329–2339. [Google Scholar]
Zhang, T.; Shan, H.; Little, M. Causal GraphSAGE: A robust graph method for classification based on causal sampling. Pattern Recognit. 2022, 128, 108696. [Google Scholar] [CrossRef]
Hu, S.; Miao, D.; Pedrycz, W. Multi granularity based label propagation with active learning for semi-supervised classification. Expert Syst. Appl. 2025, 192, 116276. [Google Scholar] [CrossRef]
Raisi, E.; Bach, S. Selecting auxiliary data using knowledge graphs for image classification with limited labels. In Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA; IEEE: Piscataway, NJ, USA, 2020; pp. 4026–4031. [Google Scholar]
Ren, M.; Triantafillou, E.; Ravi, S.; Snell, J.; Swersky, K.; Tenenbaum, J.B.; Larochelle, H.; Zemel, R.S. Meta-learning for semi-supervised few-shot classification. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 1–5. [Google Scholar]
Luo, H.F.; Li, H.H.; Huang, T.Q.; Huang, L.Q. PMGAE: Self-supervised graph representation with proximity matrix reconstruction auto-encoders. Neurocomputing 2026, 671, 132617. [Google Scholar] [CrossRef]
Chen, L.; Lou, Y.; Wang, L.; Chen, G.R. D2R: A distance metric for exploring network structural robustness enhancement potential. Reliab. Eng. Syst. Saf. 2026, 270, 112173. [Google Scholar] [CrossRef]
Wertheimer, D.; Tang, L.; Hariharan, B. Few-shot classification with feature map reconstruction networks. In Proceedings of 2021 IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA; IEEE: Piscataway, NJ, USA, 2021; pp. 8008–8017. [Google Scholar]
Yang, Z.; Wang, J.; Zhu, Y. Few-shot classification with contrastive learning. In Proceedings of 17th European Conference on Computer Vision, Tel Aviv, Israel; Springer: Berlin/Heidelberg, Germany, 2022; pp. 293–309. [Google Scholar]
Xie, J.; Long, F.; Lv, J.; Wang, Q.; Li, P. Joint distribution matters: Deep brownian distance covariance for few-shot classification. In Proceedings of 2022 IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA; IEEE: Piscataway, NJ, USA, 2022; pp. 7962–7971. [Google Scholar]
Kang, D.; Kwon, H.; Min, J.; Cho, M. Relational embedding for few-shot classification. In Proceedings of 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada; IEEE: Piscataway, NJ, USA, 2021; pp. 8802–8813. [Google Scholar]
Cai, T.; Ma, R. Theoretical foundations of t-SNE for visualizing high-dimensional clustered data. J. Mach. Learn. Res. 2022, 23, 1–54. [Google Scholar]

Figure 1. Schematic diagram of the GDFSIC method structure.

Figure 2. Global–local channel attention module.

Figure 3. Network architecture diagram before and after integrating GLCAM.

Figure 4. Architecture diagram of the image propagation module.

Figure 5. An overview of our proposed DDS-EMD. For the query sample, we compute the DDS-EMD distance (the optimal transport cost) between its set of feature vectors and the feature set of each prototype (or support sample) in the task. The query is then classified to the class corresponding to the prototype with the smallest DDS-EMD distance. This winner-takes-all rule based on the learned metric is the core classification principle.

Figure 6. Sensitivity analysis of the

γ

interpolation constant in DDS-EMD for a 5-way

k

shot.

Figure 6. Sensitivity analysis of the

γ

interpolation constant in DDS-EMD for a 5-way

k

shot.

Figure 7. Comparative experiment of combining attention mechanisms with other methods on the mini-ImageNet dataset.

Figure 8. CUB-200-2011 dataset t-SNE visualization.

Table 1. Comparative table of methods.

Method	Attention Mechanism	Image Propagation	Metric Learning	Core Integration Strategy	Fundamental Limitation
DeepEMD [20]	Local feature matching (for alignment only)	None	EMD	Attention guides local alignment, separate from metric.	Lacks global relation modeling; weak cross-dataset adaptability.
DPGN [21]	None (or for initialization only)	Yes (Dual Graph Iterative Propagation)	Cosine Similarity (Fixed)	Graph propagation and metric are decoupled; metric is non-learnable.	Relies on pre-defined similarity; struggles with distribution shifts.
RPMN [22]	Yes (Relational Attention)	Yes (Message Passing)	Learnable Linear Classifier	Attention refines relations but is separate from graph structure.	Graph construction relies on fixed rules; metric and graph learning are staged/separated.
Ours	GLCAM	Yes (Attention-Weighted Propagation)	DDS-EMD (dynamically adjusted)	Tripartite Synergy: Attention drives graph building, graph output optimizes metric, metric feedback refines attention.	-

Table 2. Experimental datasets.

Dataset	Division	Number of Categories	Image Size	Number of Images
Mini-ImageNet	training set	64	84 × 84	60,000
	validation set	16	84 × 84
	test set	20	84 × 84
Tiered-ImageNet	training set	351	84 × 84	779,165
	validation set	97	84 × 84
	test set	160	84 × 84
CIFAR-FS	training set	64	84 × 84	60,000
	validation set	16	84 × 84
	test set	20	84 × 84
CUB-200-2011	training set	100	84 × 84	11,788
	validation set	50	84 × 84
	test set	50	84 × 84

Table 3. Algorithm training parameters.

Parameter Name	Parameter Values
Learning rate	0.001
Dropblock	0.1
Weight decay	0.000 01
Generation numbers	6
Loss generation numbers	6
Generation weight	0.2
Train batch size	50
Eval batch size	4

Table 4. Experimental results on the mini-ImageNet dataset.

Method	Backbone	5-Way 1-Shot	5-Way 5-Shot
Relation Nets [35]	Conv-4	50.49 ± 0.85	65.37 ± 0.70
Matching Nets [36]	Conv-4	43.62 ± 0.84	55.31 ± 0.75
ProtoNet [37]	Conv-4	49.47 ± 0.78	68.25 ± 0.64
MAML [38]	Conv-4	48.75 ± 1.84	63.16 ± 0.92
R2D2 [39]	Conv-4	51.25 ± 0.64	68.85 ± 0.12
SNAIL [40]	ResNet-12(pre)	55.71 ± 0.97	68.83 ± 0.90
TADAM [41]	ResNet-12(pre)	58.55 ± 0.32	76.75 ± 0.33
Variational FSL [42]	ResNet-12	61.23 ± 0.26	77.69 ± 0.17
ADAResNet [43]	ResNet-12	56.83 ± 0.62	71.94 ± 0.59
MetaOptNet [44]	ResNet-12	62.69 ± 0.61	78.68 ± 0.46
DeepEMD [20]	ResNet-12	65.97 ± 0.80	82.45 ± 0.56
CTM [45]	ResNet-18	64.17 ± 0.82	80.56 ± 0.15
Fine-tuning [21]	WRN-28	57.75 ± 0.67	78.17 ± 0.49
LEO-trainval [2]	WRN-28	61.76 ± 0.08	77.59 ± 0.10
Boosting [46]	WRN-28	63.72 ± 0.45	80.75 ± 0.31
Ours	ECAResNet-12	67.19 ± 0.32	84.94 ± 0.37

Table 5. Experimental results on the tiered-ImageNet dataset.

Method	Backbone	5-Way 1-Shot	5-Way 5-Shot
Relation Nets [35]	Conv-4	54.44 ± 0.93	71.37 ± 0.76
ProtoNet [37]	Conv-4	53.36 ± 0.89	72.74 ± 0.71
MAML [38]	Conv-4	51.67 ± 1.81	70.30 ± 1.75
MetaOptNet [44]	ResNet-12	65.94 ± 0.72	81.61 ± 0.53
CTM [45]	ResNet-12	68.46 ± 0.39	84.23 ± 1.73
DeepEMD [20]	ResNet-12	71.22 ± 0.87	86.08 ± 0.58
Fine-tuning [21]	WRN-28	66.53 ± 0.70	85.55 ± 0.42
LEO-trainval [2]	WRN-28	66.30 ± 0.07	81.49 ± 0.09
Boosting [46]	WRN-28	70.58 ± 0.52	84.93 ± 0.36
Ours	ECAResNet-12	73.56 ± 0.42	88.59 ± 0.49

Table 6. Experimental results on the CUB-200-2011 dataset.

Method	Backbone	5-Way 1-Shot	5-Way 5-Shot
ProtoNet [37]	Conv-4	35.36 ± 0.62	48.61 ± 0.65
TADAM [41]	ResNet-12(pre)	40.15 ± 0.42	56.17 ± 0.40
ProtoNet [37]	ResNet-12	37.52 ± 0.65	52.55 ± 0.63
RFS-simple [47]	ResNet-12	42.65 ± 0.72	59.13 ± 0.61
RFS-distill [47]	ResNet-12	44.63 ± 0.74	60.92 ± 0.62
MetaOptNet [44]	ResNet-12	41.15 ± 0.62	55.50 ± 0.60
DeepEMD [20]	ResNet-12	46.48 ± 0.71	63.22 ± 0.73
Ours	ECAResNet-12	48.20 ± 0.25	65.58 ± 0.63

Table 7. Experimental results on the CIFAS-FS dataset.

Method	Backbone	5-Way 1-Shot	5-Way 5-Shot
ProtoNet [37]	ResNet-12	72.24 ± 0.71	83.56 ± 0.53
RFS-simple [47]	ResNet-12	71.56 ± 0.83	86.04 ± 0.51
RFS-distill [47]	ResNet-12	73.96 ± 0.83	86.94 ± 0.53
MetaOptNet [44]	ResNet-12	72.65 ± 0.74	84.35 ± 0.50
DeepEMD [20]	ECAResNet-12	75.69 ± 0.81	86.69 ± 0.52
Relation Nets [35]	Conv-4	55.06 ± 1.20	69.35 ± 1.02
ProtoNet [37]	Conv-4	55.56. ± 0.72	72.04 ± 0.61
MAML [38]	Conv-4	58.96 ± 1.93	71.54 ± 1.01
R2D2 [39]	Conv-4	65.34 ± 0.21	79.46 ± 0.13
Ours	Conv-4	76.89 ± 0.38	88.81 ± 0.25

Table 8. The ablation experiment of the GDFSIC method on mini-ImageNet.

	GLCAM	Image Propagation Module	5-Way 1-Shot	5-Way 5-Shot
1	$\times$	$\times$	54.56 ± 0.84	77.12 ± 0.69
2	$\sqrt$	$\times$	67.35 ± 0.74	82.85 ± 0.72
3	$\sqrt$	$\sqrt$	67.79 ± 0.75	83.56 ± 0.67

Table 9. Comparison of experimental results with other metric classifiers.

Similarity Criteria	Mini-ImageNet		CUB-200-2011
Similarity Criteria	5-Way 1-Shot	5-Way 5-Shot	5-Way 1-Shot	5-Way 5-Hot
Cosine distance [48]	48.51	67.86	48.98	75.20
Euclidean distance [49]	49.47	68.25	50.51	76.43
DDS-EMD	55.45	71.61	65.43	80.79

Table 10. Comparison of backbone network complexity.

Backbone	FLOPs/MB	Parameters
WRN-18	15.85	117,120
ResNet-12	138.61	10,737,672
ResNet-18	151.06	11,271,432
Ours	27.07	153,440

Table 11. Comparison results on mini-ImageNet and tiered-ImageNet.

Model	Distance Metric	Patch Count	Mini-ImageNet		Tiered-ImageNet
Model	Distance Metric	Patch Count	5-Way 1-Shot	5-Way 5-Shot	5-Way 1-Shot	5-Way 5-Hot
MCL	Cosine	25	66.51	82.86	71.98	85.20
DeepEMD	Cosine	25	66.83	82.14	71.19	85.04
DDS-EMD	DDS	16	67.45	83.61	72.43	85.79

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Geng, B.; Pu, L. GDFSIC: A Few-Shot Image Classification Framework Integrating Global–Local Attention with Distance–Direction Similarity. Math. Comput. Appl. 2026, 31, 38. https://doi.org/10.3390/mca31020038

AMA Style

Geng B, Pu L. GDFSIC: A Few-Shot Image Classification Framework Integrating Global–Local Attention with Distance–Direction Similarity. Mathematical and Computational Applications. 2026; 31(2):38. https://doi.org/10.3390/mca31020038

Chicago/Turabian Style

Geng, Biao, and Liping Pu. 2026. "GDFSIC: A Few-Shot Image Classification Framework Integrating Global–Local Attention with Distance–Direction Similarity" Mathematical and Computational Applications 31, no. 2: 38. https://doi.org/10.3390/mca31020038

APA Style

Geng, B., & Pu, L. (2026). GDFSIC: A Few-Shot Image Classification Framework Integrating Global–Local Attention with Distance–Direction Similarity. Mathematical and Computational Applications, 31(2), 38. https://doi.org/10.3390/mca31020038

Article Menu

GDFSIC: A Few-Shot Image Classification Framework Integrating Global–Local Attention with Distance–Direction Similarity

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. Fundamentals of Few-Shot Learning

3.1.1. Few-Shot Learning

3.1.2. The N -Way K -Shot Problem

3.1.3. Meta-Learning

3.1.4. Metric-Based Meta-Learning

3.2. Methodological Structure

3.3. GLCAM

3.4. Image Propagation Module

3.5. Measurement Module

4. Experimental Results and Analysis

4.1. Experimental Environment and Dataset Setup

4.2. Evaluation Criteria and Parameterization

4.3. Experimental Flow

4.4. Experimental Results and Comparative Analysis

4.5. Analysis of Ablation Experiments

4.6. Comparison with Other Metric Classifiers

4.7. Comparative Experiments with Other Methods That Incorporate Attention Mechanisms

4.8. Network Complexity Analysis

4.9. Computational Effort

4.10. Visualization Analysis

5. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1.2. The $N$ -Way $K$ -Shot Problem