GRCD-Net: Guided Global–Local Relational Learning for Few-Shot Fine-Grained and Remote Sensing Scene Classification

Liu, Jianfeng; Du, Yibo; Sun, Lifan; Li, Xiaozheng; Si, Yanna; Song, Xiaoli; Zheng, Ruijuan

doi:10.3390/rs18101632

Open AccessArticle

GRCD-Net: Guided Global–Local Relational Learning for Few-Shot Fine-Grained and Remote Sensing Scene Classification

by

Jianfeng Liu

^1,*

,

Yibo Du

¹

,

Lifan Sun

^2,3,

Xiaozheng Li

¹,

Yanna Si

²,

Xiaoli Song

¹

and

Ruijuan Zheng

¹

School of Software, Henan University of Science and Technology, Luoyang 471000, China

²

School of Information Engineering, Henan University of Science and Technology, Luoyang 471000, China

³

Institute of Physics, Henan Academy of Sciences, Zhengzhou 450046, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1632; https://doi.org/10.3390/rs18101632

Submission received: 28 March 2026 / Revised: 10 May 2026 / Accepted: 15 May 2026 / Published: 19 May 2026

(This article belongs to the Special Issue Advanced Applications of Artificial Intelligence in Remote Sensing Image Recognition (2nd Edition))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The proposed GRCD-Net introduces a top-down guidance mechanism, proactively using global semantic context to filter severe background clutter before local feature extraction.
Empowered by a synergistic dual-metric learning strategy, the model consistently outperforms fine-grained baselines by 2–4% and achieves a strong 81.39% one-shot accuracy on the NWPU-RESISC45 remote sensing scene classification dataset, exceeding current state-of-the-art methods by 7.55%.

What are the implications of the main findings?

It effectively resolves a critical structural bottleneck in existing few-shot learning methods, successfully preventing models from matching irrelevant environmental noise instead of small semantic targets in complex aerial scenes.
The framework provides a promising solution for practical Earth observation applications, addressing the dual challenges of extreme data scarcity and highly cluttered remote sensing environments.

Abstract

Remote sensing scene classification (RSSC) faces severe challenges from data scarcity and complex background clutter. To overcome these limitations, this paper draws inspiration from few-shot fine-grained image classification (FSFGIC) to filter noise and capture subtle details. However, existing methods often process global context and local features separately, which limits their ability to suppress background noise in complex scenes. Consequently, the Guided Relational Cross-Attention Dual-branch Network (GRCD-Net) is proposed. Its core Guided Relational Cross-Attention (GRC) block leverages global semantics to filter local background noise prior to bidirectional feature interaction. Additionally, Iterative Global Relation (IGR) and Patch-level Dual-Metric (PDM) modules are integrated to robustly refine global relations and capture local similarities. Extensive experiments demonstrate that GRCD-Net consistently outperforms baselines by 2–4% on standard FSFGIC benchmarks. Notably, on the challenging NWPU-RESISC45 RSSC dataset, it achieves an 81.39% one-shot accuracy and exceeds current state-of-the-art methods by 7.55%, validating its efficacy for complex Earth observation.

Keywords:

remote sensing scene classification; few-shot learning; fine-grained image classification; transformer; cross-attention; metric learning

1. Introduction

Remote sensing scene classification (RSSC) is essential for practical Earth observation applications, such as environmental monitoring, urban planning, and disaster assessment [1]. Although conventional deep learning methods have advanced the field significantly, they rely heavily on large quantities of annotated data. Acquiring such datasets is labor-intensive, time-consuming, and requires specialized geographical expertise [2]. These practical constraints and the scarcity of data make deploying few-shot learning (FSL) an urgent necessity for the remote sensing community [3]. FSL is a promising paradigm that alleviates this dependency by enabling models to recognize new categories using only a minimal number of samples. However, applying generic FSL techniques directly to aerial imagery poses unique and significant challenges. Unlike standard natural images, where objects are usually centered and distinct, remote sensing scenes exhibit extreme scale variations, arbitrary orientations, and highly complex terrestrial background clutter [4]. Furthermore, recent remote sensing fusion studies [5,6] have emphasized the importance of balancing global semantic consistency and local spatial-detail enhancement. This observation is also relevant to few-shot RSSC, where discriminative targets are often embedded in complex backgrounds and require both holistic scene understanding and fine-grained local representation. This inherent chaotic environmental noise, such as expansive vegetation or intricate road networks, often obscures the actual semantic targets.

Consequently, distinguishing subtle inter-class differences while effectively suppressing complex environmental noise under extremely scarce data conditions has become a primary bottleneck in current few-shot RSSC research [7]. Moreover, as models become increasingly sophisticated for capturing intricate spatial and semantic correlations, maintaining a practical balance among classification accuracy, model complexity, and inference speed is important for potential real-time or resource-constrained Earth observation applications. To address these challenges, inspiration is drawn from fine-grained image classification (FGIC) [8,9], which plays a pivotal role in specialized applications ranging from aerospace [10] and Earth observation [11] to precision agriculture [12] and medical diagnosis [13]. Specifically, few-shot fine-grained image classification (FSFGIC) [14,15,16] requires models to tackle a similar challenge: they must discern extremely subtle inter-class differences (e.g., specific building structures or beak shapes of birds) while simultaneously handling significant intra-class variations and severe background interference. This inherent demand for highly discriminative feature learning makes FSFGIC principles highly applicable to overcoming the complex environmental noise in remote sensing tasks.

However, when adapting these fine-grained methods to remote sensing, their performance is often constrained by a fundamental structural limitation: the tendency to treat global semantic context and local discriminative features in isolation [17]. Early metric-based approaches [18,19] rely heavily on single global feature vectors, which inherently average out critical local cues. Conversely, recent spatial alignment methods [14] focus on local features but lack an explicit mechanism to model the hierarchical interaction between global context and local details. This disconnection creates a critical performance bottleneck. Without global semantic guidance, local feature extractors often fail to filter out background noise, leading to semantic misalignment. For instance, a model may misalign a bird’s white feather with a visually similar white flower or match an aircraft’s wing with a white building in a complex aerial scene. This type of ambiguous matching introduces significant noise into the metric space, which severely constrains recognition accuracy and limits generalization in scarce-data regimes. This is particularly true under the severe terrestrial background interference inherent in remote sensing imagery.

To address these limitations, this paper proposes the Guided Relational Cross-Attention Dual-branch Network (GRCD-Net) to bridge the gap between holistic understanding and local discrimination. At its core, the Guided Relational Cross-Attention (GRC) block embeds global–local interaction directly into the feature refinement process. Specifically, the global branch generates a spatial guidance map that acts as a semantic prior to dynamically modulate the local branch. This mechanism enables the model to emphasize informative regions while actively filtering complex terrestrial background noise prior to bidirectional cross-attention. Consequently, GRC produces highly focused and semantically consistent representations, which are essential for accurate target recognition in cluttered aerial scenes.

To further ensure robust metric learning across these refined features, GRCD-Net integrates two complementary modules. The Iterative Global Relation (IGR) module refines image-level representations by modeling the interactions among query and support samples through an iterative process. Through progressive aggregation, it enhances global-level semantics and suppresses noisy correlations, leading to more discriminative global embeddings. Concurrently, the Patch-level Dual-Metric (PDM) module computes local similarities by fusing cosine similarity and Euclidean distance. By jointly capturing directional correlations and spatial discrepancies, the PDM module generates more robust invariant feature comparisons that remain highly reliable against severe intra-class appearance variations.

The main contributions of this paper are summarized as follows:

1.: A novel Guided Global–Local Relational Learning framework (i.e., GRCD-Net) is proposed to address the structural limitations in current FSFGIC and RSSC methods. At its core, the GRC block introduces a top-down spatial mechanism that leverages global context to proactively filter environmental clutter, ensuring the precise localization of fine-grained cues.
2.: A synergistic metric learning strategy is developed by integrating the IGR and PDM modules. This strategy effectively overcomes severe intra-class variations by jointly refining global semantic consistency and computing robust local geometric similarities.
3.: Extensive experiments were conducted on FSFGIC, general FSL, and RSSC benchmarks. The results show that GRCD-Net achieves competitive performance on multiple datasets while maintaining reasonable computational complexity, demonstrating its effectiveness in handling complex background clutter and improving generalization in few-shot scenarios.

The remainder of this paper is organized as follows: Section 2 reviews the related work on FSL, FSFGIC and RSSC. Section 3 details the proposed GRCD-Net framework, including the design of the GRC block, the IGR module, and the PDM module. Section 4 presents extensive experimental results, ablation studies, and visualizations on multiple benchmarks, along with a detailed analysis of computational complexity. Finally, Section 5 concludes the paper and discusses future research directions.

2. Related Work

2.1. Few-Shot Learning

Research in FSL is generally categorized into three main streams: optimization-based, augmentation-based, and metric-based methods. Optimization-based methods, exemplified by model-agnostic meta-learning (MAML) [20], aim to learn an initialization of model parameters that can be rapidly adapted to new tasks with minimal gradient updates. Augmentation-based strategies [21] focus on alleviating data scarcity by generating additional training samples or features through “hallucination” techniques. Meanwhile, metric-based methods have emerged as a dominant paradigm due to their simplicity and effectiveness. In this paradigm, a feature embedding space is learned for classification by comparing distances between query samples and class representatives. Classic frameworks include Prototypical Networks [18], which use class prototypes computed by averaging support sample embeddings, and Relation Networks [19], which employ a deep neural network to learn a nonlinear distance metric.

However, a significant limitation of these pioneering metric-based works is their reliance on holistic image features, primarily designed for coarse-grained classification. For fine-grained tasks requiring subtle local details, global pooling operations often average out such information. The proposed GRCD-Net builds upon the metric-based formulation to address this limitation by explicitly extracting and refining local details. This ensures that fine-grained information is preserved within the embedding space.

2.2. Few-Shot Fine-Grained Classification

To overcome the limitations of holistic embeddings, specialized FSFGIC methods have emerged to capture subtle inter-class differences [22]. One prominent strategy is to construct richer relational correspondences. Reconstruction-based methods, such as feature map reconstruction network (FRN) [23] and its extension, i.e., bi-directional feature reconstruction network (BiFRN) [16], capture pairwise feature correlations by reconstructing query features from the support latent space. Another significant line of work emphasizes local region alignment. Representative examples include cross-attention frameworks such as cross-attention network (CAN) [24] and CrossTransformer (CTX) [25], as well as methods that incorporate foreground–background separation [14], multi-scale representations [26], or joint channel–spatial alignment [27]. Furthermore, complex alignment is formulated as an optimal transport problem [28] or a semantic selection procedure [29].

Despite these advancements, most approaches rely on discrete “one-to-one” alignment mechanisms that overlook the hierarchical dependencies between global context and local details. Without explicit global guidance, local descriptors are susceptible to semantic misalignment. For example, they may match a foreground object part with visually similar background noise (such as a white feather versus a white flower) [22]. To address this specific gap, GRCD-Net introduces a “one-to-many” guided interaction mechanism that uses global contextual information to direct local feature extraction and filter background noise.

2.3. Transformers for Few-Shot Classification

Convolutional neural networks (CNNs) excel at local feature extraction, but the Transformer architecture [30] is increasingly being adopted in FSL because of its ability to model long-range dependencies and global context. Several adaptations have been introduced: few-shot embedding adaptation with Transformer (FEAT) [31] uses a Transformer to create set-to-set transformations for task-specific embeddings. Universal representation Transformer (URT) [32] combines multi-domain features for broader generalization. Furthermore, unified query-support Transformer (QSFormer) [33] uses an encoder–decoder structure to accomplish both feature representation and metric learning simultaneously. To address the computational intensity of standard Vision Transformers [34], few-shot classification with Transformers using reweighted embedding similarity (FewTURE) [35] has integrated hierarchical architectures such as Swin Transformers [36] to capture multi-scale features efficiently.

Nevertheless, a structural gap remains in the utilization of these hierarchical features in fine-grained tasks. Most Transformer-based methods treat global and local features as separate entities or fuse them via simple concatenation. These methods lack a mechanism to model the interaction between semantic context and discriminative details. Conversely, GRCD-Net integrates the interaction directly within the feature extraction process. This enables global semantics to refine local details progressively before metric computation.

2.4. Few-Shot Remote Sensing Scene Classification

RSSC poses unique challenges compared to natural image recognition. Aerial and satellite images are typically characterized by extremely complex terrestrial backgrounds, large-scale variations, and high intra-class diversity. Recently, FSL has been introduced to the remote sensing domain to alleviate the heavy reliance on large-scale annotated datasets. To address the inherent visual ambiguity in aerial scenes, several specialized methods have been proposed. Early explorations, such as discriminative learning of adaptive match network (DLA-MatchNet) [37], utilize attention mechanisms to match spatial and channel semantics, while attention-based contrastive learning network (ACL-Net) [7] employs contrastive learning to enhance feature discriminability. More recently, researchers have focused on highlighting target regions to suppress environmental noise. For instance, the task-adaptive fine-tuning strategy and a multi-level spatial feature aggregation (TA-MSA) [38] designs a multi-level spatial aggregation module, and the parameter-free attention with selective region matching (PA-SRM) [39] introduces parameter-free region attention to explicitly capture discriminative local patches. Similarly, multi-stream architectures like two-stream deep nearest-neighbor neural network (TSDN4) [40] and discriminative enhanced attention-based deep nearest-neighbor neural network (DEADN4) [41] have been developed to jointly capture local and global descriptors from limited remote sensing samples.

Despite these advancements, existing few-shot RSSC methods have difficulty structurally integrating holistic scene understanding with local feature extraction. When small semantic targets are submerged in cluttered aerial environments, these methods often average out critical details or mismatch them with prominent background noise. To overcome this challenge, the proposed GRCD-Net employs the GRC block, which uses global semantics as a top-down spatial gate. This mechanism proactively filters environmental clutter before feature fusion occurs, ensuring that local representations remain highly focused on actual semantic targets.

3. Method

This section provides a detailed introduction to the proposed GRCD-Net. As illustrated in Figure 1, the framework consists of three primary components: (1) a hierarchical feature extraction backbone that generates multi-scale representations; (2) a series of GRC blocks that enable explicit global–local interaction; and (3) a synergistic metric learning strategy comprising the IGR module and the PDM module. Finally, the inference process and objective function are detailed.

3.1. Problem Definition

Under the standard FSL protocol, a labeled dataset is denoted as

D = {(x_{i}, y_{i})}_{i = 1}^{| D |}

, where

x_{i}

represents an input image, and

y_{i}

is its corresponding class label. The dataset

D

is partitioned into three subsets, a training set

D_{t r a i n}

, a validation set

D_{v a l}

, and a test set

D_{t e s t}

, which contain the image-label pairs belonging to the class sets

C_{t r a i n}, C_{v a l}

, and

C_{t e s t}

, respectively. These corresponding class sets are pairwise disjoint, strictly satisfying

C_{t r a i n} \cap C_{v a l} = C_{v a l} \cap C_{t e s t} = C_{t r a i n} \cap C_{t e s t} = \emptyset

. This disjoint assumption is the fundamental premise of FSL, forcing the network to evaluate its generalization capability on entirely novel unseen categories rather than merely memorizing training classes.

The framework is optimized and evaluated via episodic training. Each episode formulates an N-way K-shot classification task

T

, meaning that the model must classify query images into N different novel classes, given only K labeled support samples per class. For each task, N classes

C_{T} \subset C_{t r a i n}

are randomly sampled. From each class, K labeled samples are drawn to form the support set

S = {(x_{i}^{s}, y_{i}^{s})}_{i = 1}^{N \times K}

, and Q additional samples per class are selected to form the query set

Q = {(x_{j}^{q}, y_{j}^{q})}_{j = 1}^{N \times Q}

. The objective of few-shot classification is to predict the labels of the query samples in

Q

by relying solely on the explicit supervision provided by the support set

S

.

3.2. Guided Feature Extraction Network

3.2.1. Hierarchical Feature Construction

The Swin Transformer [36] is employed as the foundational feature extraction module. As depicted in Figure 2, this architecture constructs a hierarchical representation across four stages. The process begins with patch partitioning and embedding, followed by patch merging layers that systematically reduce spatial resolution (from

H / 4 \times W / 4

to

H / 32 \times W / 32

) while expanding the channel dimension. A key component, the shifted window self-attention, efficiently models both local and global dependencies within each block.

For an input image

x_{i}

, the backbone generates a hierarchical feature set

F_{i} = {F_{i}^{(1)}, F_{i}^{(2)}, F_{i}^{(3)}, F_{i}^{(4)}}

. To facilitate dual-branch learning, the high-resolution output from Stage 2 (

F_{i}^{(2)}

) is designated as the input for the local detail branch, and the semantically richer low-resolution output from Stage 3 (

F_{i}^{(3)}

) is allocated to the global semantic branch. These selected hierarchical feature maps are subsequently flattened along the spatial dimensions and linearly projected into a sequence of D-dimensional token embeddings, a process commonly known as tokenization [30].

X_{d e t a i l} = [C_{d}; P_{d}^{1}, \dots, P_{d}^{M}] \in R^{(1 + M) \times D}

(1)

X_{g l o b a l} = [C_{g}; P_{g}^{1}, \dots, P_{g}^{N}] \in R^{(1 + N) \times D}

(2)

where

M = H_{2} \times W_{2}

,

N = H_{3} \times W_{3}

, and D represents the embedding dimension of the tokenized sequence. The terms

C_{d}

and

C_{g}

denote the learnable classification (CLS) tokens for the detail and global branches, respectively, while P denotes the image patch tokens. The robust features extracted by this hierarchical module provide the basis for the subsequent guided interaction and metric learning components.

3.2.2. Guided Relational Cross-Attention Block

The GRC block is the core component of the proposed framework, facilitating deep, semantic-driven interactions between global and local representations. As outlined in Algorithm 1, this module processes the hierarchical token sequences

X_{g l o b a l}

and

X_{d e t a i l}

extracted from the backbone. Building on standard cross-attention mechanisms [42], a global–local guidance mechanism is introduced. This mechanism transforms the interaction from a naive bidirectional exchange to a guided refinement process, effectively enhancing the subtle discriminative cues crucial for FSFGIC. The architecture of the GRC block is illustrated in Figure 3.

Upon entering the GRC block, the global branch

X_{g l o b a l} = [C_{g}; P_{g}]

and the detail branch

X_{d e t a i l} = [C_{d}; P_{d}]

are initially processed independently by separate intra-branch Transformer blocks to update their internal representations:

\begin{matrix} X_{g l o b a l}^{'} & = {TransformerBlock}_{g l o b a l} (X_{g l o b a l}) \end{matrix}

(3)

\begin{matrix} X_{d e t a i l}^{'} & = {TransformerBlock}_{d e t a i l} (X_{d e t a i l}) \end{matrix}

(4)

where each TransformerBlock consists of Multi-Head Self-Attention (MHSA), Multi-Layer Perceptron (MLP), LayerNorm (LN), and residual connections. Following this step, the updated outputs are separated back into CLS and patch tokens, i.e.,

X_{g l o b a l}^{'} \to [C_{g}^{'}; P_{g}^{'}]

and

X_{d e t a i l}^{'} \to [C_{d}^{'}; P_{d}^{'}]

.

As the central innovation of the GRC block, this mechanism mitigates background interference by leveraging global semantics to filter local details prior to cross-branch interaction. Specifically, the global patches

P_{g}^{'}

are first fed into a lightweight MLP followed by a sigmoid activation to generate a spatial attention gate

G_{s}

:

G_{s} = σ (MLP (P_{g}^{'}))

(5)

The resulting gate vector is reshaped into a 2D attention map

G_{m a p} \in R^{B \times 1 \times H_{g} \times W_{g}}

, where B is the batch size, and

H_{g} = W_{g} = \sqrt{N}

. It is then up-sampled by bilinear interpolation to match the spatial size

H_{d} \times W_{d}

of the detail branch, yielding the guided spatial gate:

G_{g u i d e d} = Reshape (Interpolate (G_{s}, size = (H_{d}, W_{d})))

(6)

This guided gate dynamically modulates the detail patch tokens through element-wise multiplication:

P_{d}^{″} = P_{d}^{'} ⊙ G_{g u i d e d}

(7)

This operation effectively suppresses irrelevant background noise while amplifying salient local regions highlighted by the global semantic prior knowledge. It should be noted that this operation does not perform explicit image-level denoising; instead, it suppresses semantic background clutter at the feature level through learnable spatial modulation.

To achieve comprehensive feature fusion, the bidirectional cross-attention mechanism operates through two complementary pathways:

(1): Global-to-Detail (G2D): The global CLS token $C_{g}^{'}$ queries the purified local patches $P_{d}^{″}$ to absorb fine-grained details:

${\tilde{C}}_{g} = C_{g}^{'} + CrossAttention (Q = C_{g}^{'}, K V = P_{d}^{″})$

(8)
(2): Detail-to-Global (D2G): Simultaneously, the local CLS token $C_{d}^{'}$ queries the global patches $P_{g}^{'}$ to acquire broader contextual awareness:

${\tilde{C}}_{d} = C_{d}^{'} + CrossAttention (Q = C_{d}^{'}, K V = P_{g}^{'})$

(9)

Here, CrossAttention utilizes the standard scaled dot-product attention mechanism.

Finally, the updated CLS tokens are concatenated with their corresponding patch token sequences to form the input states for the next block:

X_{g l o b a l} \leftarrow [{\tilde{C}}_{g}; P_{g}^{'}] X_{d e t a i l} \leftarrow [{\tilde{C}}_{d}; P_{d}^{″}]

(10)

After passing through N stacked GRC blocks, the refined sequences

X_{g l o b a l}

and

X_{d e t a i l}

are obtained. From these sequences, we extract the global CLS tokens (denoted as

F_{g}

) and the enhanced detail patch tokens (denoted as

F_{p}

) to serve as the final representations. These representations are subsequently forwarded to the synergistic metric learning stage:

F_{g}

is processed by the IGR module for global relational reasoning, while

F_{p}

is delivered to the PDM module for robust local geometric computation.

Following the extraction of these refined features (

F_{g}

and

F_{p}

), a synergistic metric learning strategy is proposed to measure the similarity between query and support samples. As formalized in Algorithm 2, this strategy decouples the similarity computation into two complementary dimensions: global semantic consistency and local geometric robustness. Specifically, the features extracted from all images in an episode are partitioned into their respective support and query sets, yielding global features

F_{g} = {F_{g}^{S}, F_{g}^{Q}}

and patch features

F_{p} = {F_{p}^{S}, F_{p}^{Q}}

.

Algorithm 1 GRC-enhanced feature extraction.
Input: Image batch $X \in R^{B \times H \times W \times 3}$ .
Output: Global features $F_{g}$ , local features $F_{p}$ .
Hyperparameters: N (GRC blocks), D (dim).
Parameters: $θ_{S w i n}$ (backbone), $θ_{G R C}$ (attention/gate weights).
1:	Step 1. Hierarchical Extraction
2:	$F \leftarrow SwinTransformer (X; θ_{S w i n})$
3:	$X_{d e t a i l} \leftarrow Tokenize (F^{(2)} \in F) + E_{p o s}$
4:	$X_{g l o b a l} \leftarrow Tokenize (F^{(3)} \in F) + E_{p o s}$
5:	Step 2. Guided Relational Interaction
6:	for $l = 1$ to N do
7:	$[C_{g}^{'}, P_{g}^{'}] \leftarrow TransformerBlock (X_{g l o b a l})$
8:	$[C_{d}^{'}, P_{d}^{'}] \leftarrow TransformerBlock (X_{d e t a i l})$
9:	$G_{s} \leftarrow σ (MLP (P_{g}^{'}))$	▹ Generate spatial gate
10:	$P_{d}^{″} \leftarrow P_{d}^{'} ⊙ Upsample (G_{s})$	▹ Apply guidance
11:	${\tilde{C}}_{g} \leftarrow C_{g}^{'} + CrossAttn (Q = C_{g}^{'}, K V = P_{d}^{″})$
12:	${\tilde{C}}_{d} \leftarrow C_{d}^{'} + CrossAttn (Q = C_{d}^{'}, K V = P_{g}^{'})$
13:	$X_{g l o b a l} \leftarrow [{\tilde{C}}_{g}; P_{g}^{'}]$ , $X_{d e t a i l} \leftarrow [{\tilde{C}}_{d}; P_{d}^{″}]$
14:	end for
15:	Extract global CLS tokens: $F_{g} \leftarrow X_{g l o b a l} [:, 0, :]$
16:	Extract detail patch tokens: $F_{p} \leftarrow X_{d e t a i l} [:, 1 :, :]$
17:	return $F_{g}, F_{p}$

3.2.3. Iterative Global Relation Module

The IGR module is introduced to explicitly model high-level semantic correlations among all samples within an episode. While the design draws inspiration from the SampleFormer in QSFormer [33], it diverges fundamentally in both purpose and mechanism. Instead of directly converting cross-attention weights into similarity scores, the proposed IGR module employs a Transformer decoder to iteratively refine query features via repeated interactions with the support set. Consequently, the similarity is computed from these deeply refined features rather than from attention weights, facilitating a richer and more robust representation of cross-sample relations. The detailed IGR process is presented in Algorithm 2 and illustrated in Figure 4.

Encoder: Support Set Contextualization. The IGR encoder constructs the support memory bank by explicitly modeling intra-support relationships. Given the support global features

F_{g}^{S} \in R^{N_{s} \times D}

, they are projected into Query (

Q_{s}

), Key (

K_{s}

), and Value (

V_{s}

) matrices. Self-attention is first applied to capture intra-support contextual dependencies:

{Attn}_{s \to s} = Softmax (\frac{Q_{s} K_{s}^{T}}{\sqrt{D}})

(11)

The resulting attention maps are applied to

V_{s}

and residually connected to the original features to produce the enhanced support embeddings

E_{s}

:

E_{s} = LayerNorm (F_{g}^{S} + {Attn}_{s \to s} V_{s})

(12)

A feed-forward network (FFN) further transforms

E_{s}

, followed by a residual connection and normalization, to form the contextualized support memory bank

M_{s u p p o r t} \in R^{N_{s} \times D}

. This memory bank serves as the semantic foundation for the decoder’s iterative reasoning.

Decoder: Iterative Query-Support Refinement. The decoder refines the query features by progressively integrating contextual cues from the support memory bank. Given the query global features

F_{g}^{Q} \in R^{N_{q} \times D}

, the decoder first performs self-attention to encode intra-query relationships:

D_{c t x} = LayerNorm (F_{g}^{Q} + {Attn}_{q \to q} V_{q})

(13)

The context-enhanced query features

D_{c t x}

subsequently query the support memory bank using cross-attention to obtain aggregate task-relevant information:

\tilde{D} = {Attn}_{q \to s} {\tilde{V}}_{s}, {Attn}_{q \to s} = Softmax (\frac{D_{c t x} {\tilde{K}}_{s}^{T}}{\sqrt{D}})

(14)

where

{\tilde{K}}_{s}

and

{\tilde{V}}_{s}

are linear projections of

M_{s u p p o r t}

. The output

\tilde{D}

represents query embeddings updated with support-set context. The refinement is iterated T times. At each iteration t, the query features are updated with a weighted residual step controlled by

λ_{r e f i n e}

:

Q_{c u r r e n t}^{(t + 1)} = Q_{c u r r e n t}^{(t)} + λ_{r e f i n e} \cdot LayerNorm (\tilde{D} + FFN (\tilde{D}))

(15)

Through this iterative decoding mechanism, the query representations progressively accumulate dense cross-sample relational cues. After the final iteration, the refined query embeddings are utilized to compute the global similarity matrix

M_{g}

via a dot product.

3.2.4. Patch-Level Dual-Metric Module

To address the severe intra-class variations prevalent in fine-grained tasks, the PDM module is developed for the local branch. Building on the multi-metric fusion paradigm introduced in a graph neural network based on a Swin Transformer (sTransGNN) [43], the PDM module formalizes this concept at the patch level, as detailed in Algorithm 2, operating through two primary stages:

First, the module performs contextual enhancement on the input detail features. Both support patches (

F_{p}^{S}

) and query patches (

F_{p}^{Q}

) are propagated through a patch encoder consisting of stacked Transformer encoder layers. Subsequently, a local class prototype is computed by aggregating each class based on the enhanced support patches, which effectively extracts the representative characteristics of a given category:

{Proto}_{i} = \frac{1}{k_{s h o t} \cdot N_{p}} \sum_{j = 1}^{k_{s h o t}} \sum_{p = 1}^{N_{p}} {\tilde{F}}_{s} (i, j, p)

(16)

where

{\tilde{F}}_{s}

denotes the enhanced patch features, and

{Proto}_{i} \in R^{D}

is the resulting local prototype vector.

To comprehensively evaluate local matching, two complementary distance metrics are employed. The first is cosine similarity, which evaluates similarity by measuring the angular correlation between feature vectors. Due to its magnitude invariance, this metric exhibits strong robustness to visual variations such as illumination changes:

{Sim}_{\cos} (p_{q}, {Proto}_{i}) = \frac{F_{p}^{Q} (p_{q}) \cdot {Proto}_{i}}{| | F_{p}^{Q} (p_{q}) | | \cdot | | {Proto}_{i} | |}

(17)

The second metric is the Euclidean distance, which calculates spatial discrepancies in the embedding space and remains highly sensitive to magnitude differences. To align this distance with a similarity formulation, the negative squared Euclidean distance is adopted:

{Sim}_{euc} (p_{q}, {Proto}_{i}) = - | | F_{p}^{Q} (p_{q}) - {Proto}_{i} {| |}^{2}

(18)

where

p_{q}

represents the spatial index of a specific patch within the query sample, and

F_{p}^{Q} (p_{q})

denotes the local feature vector of that specific patch. The notation

| | \cdot | |

denotes the standard L2 norm of a vector.

These two metrics capture feature relationships from directional and spatial perspectives, respectively. Finally, the cosine and Euclidean similarity maps are fused through a weighted summation to obtain a unified patch-level score:

F_{q, p, i} = α \cdot {Sim}_{\cos} (F_{p}^{Q} (p_{q}), {Proto}_{i}) + (1 - α) \cdot {Sim}_{euc} (F_{p}^{Q} (p_{q}), {Proto}_{i})

(19)

where

α

is a balancing hyperparameter. Finally, the fused scores across all patches of a query sample are averaged to construct the final patch-level similarity matrix

M_{p}

. As illustrated in Figure 5, this dual-metric fusion strategy effectively integrates directional correlation with spatial proximity, resulting in a more comprehensive and resilient similarity measure.

Algorithm 2 Synergistic metric calculation.
Input: Support features $F^{S} = {F_{g}^{S}, F_{p}^{S}}$ , query features $F^{Q} = {F_{g}^{Q}, F_{p}^{Q}}$ .
Output: Global score $M_{g}$ , local score $M_{p}$ .
Parameters: $θ_{I G R}$ (relation module), $θ_{P D M}$ (patch encoder).
1:	Step 1. Iterative Global Relation (IGR)
2:	$M_{s u p p} \leftarrow Encoder (F_{g}^{S})$	▹ Contextualize support
3:	$Q_{c u r r} \leftarrow F_{g}^{Q}$
4:	for $t = 1$ to $T_{i t e r}$ do
5:	$Q_{c u r r} \leftarrow Q_{c u r r} + λ_{r e f} \cdot Decoder (Q_{c u r r}, M_{s u p p})$
6:	end for
7:	$M_{g} \leftarrow DotProduct (Q_{c u r r}, M_{s u p p})$
8:	Step 2. Patch-level Dual-Metric (PDM)
9:	${\tilde{F}}_{p}^{S}, {\tilde{F}}_{p}^{Q} \leftarrow PatchEncoder (F_{p}^{S}, F_{p}^{Q}; θ_{P D M})$
10:	$P r o t o_{c} \leftarrow \frac{1}{K} \sum {\tilde{F}}_{p}^{S}$	▹ Compute class prototypes
11:	$S_{c o s} \leftarrow CosineSim ({\tilde{F}}_{p}^{Q}, P r o t o_{c})$
12:	$S_{e u c} \leftarrow - {EuclideanDist}^{2} ({\tilde{F}}_{p}^{Q}, P r o t o_{c})$
13:	$M_{p} \leftarrow α \cdot S_{c o s} + (1 - α) \cdot S_{e u c}$
14:	return $M_{g}, M_{p}$

3.3. Optimization and Training Strategy

3.3.1. Objective Function

The GRCD-Net is optimized in an end-to-end manner. To effectively synergize the global semantic consistency captured by the IGR module and the local fine-grained details refined by the PDM module, an adaptive score fusion strategy is employed prior to the loss calculation.

Let

M_{g} \in R^{N_{q} Q \times N}

denote the global similarity matrix and

M_{p} \in R^{N_{q} Q \times N}

denote the patch-level similarity matrix. Instead of optimizing them independently, the final classification logits S are computed as a dynamically weighted summation:

λ_{f u s e} = σ (θ_{λ}), S = λ_{f u s e} \cdot M_{g} + (1 - λ_{f u s e}) \cdot M_{p}

(20)

where

σ (\cdot)

is the sigmoid function, and

θ_{λ}

is a learnable parameter initialized to 0 (yielding an initial

λ_{f u s e} = 0.5

). This parameterization allows the network to automatically adjust the relative contribution of global context versus local cues during training, adapting to the task’s specific difficulty.

Based on the fused logits S, the probability of a query sample

x_{q}

belonging to a support class c is predicted via a temperature-scaled Softmax function:

P (y = c | x_{q}) = \frac{exp (τ \cdot S_{q, c})}{\sum_{k = 1}^{N} exp (τ \cdot S_{q, k})}

(21)

where

τ

is a scaling factor that controls the sharpness of the probability distribution. The entire network is trained by minimizing the standard cross-entropy loss averaged over all query samples in the episode:

L_{t o t a l} = - \frac{1}{| Q |} \sum_{x_{q} \in Q} \sum_{c = 1}^{N} I (y_{q} = c) \cdot log P (y = c | x_{q})

(22)

where

I (\cdot)

is the indicator function. By minimizing

L_{t o t a l}

, the gradients backpropagate through the fusion layer, jointly optimizing the backbone, the GRC interaction, the metric modules, and the fusion weight to maximize the discriminative power of the combined representation.

3.3.2. Episodic Training Procedure

The training process adheres to the standard episodic protocol for FSL, as outlined in Algorithm 3. Specifically, in each iteration, an N-way K-shot task

T

is sampled, comprising a support set

S

and a query set

Q

.

The workflow proceeds sequentially: (1) extracting multi-scale features, where the GRC-enhanced backbone generates hierarchical features for all images; (2) computing similarity metrics, where the IGR and PDM modules independently calculate global and local similarity scores; and (3) fusing these scores via

λ_{f u s e}

to compute the final loss and update the model parameters. This end-to-end strategy ensures that the model learns robust, generalizable features capable of handling the diverse categories and complex background clutter characteristic of remote sensing scenes.

Algorithm 3 Training process of GRCD-Net.
Input: Training set $D_{t r a i n}$ , hyperparameters $τ, η$ .
Output: Trained model parameters $Θ$ (including $θ_{λ}$ ).
1:	for each episode $T = {S, Q}$ sampled from $D_{t r a i n}$ do
2:	Step 1. Feature Extraction (via Algorithm 1)
3:	$F_{g}^{S}, F_{p}^{S} \leftarrow GRCFeatureExtraction (S)$
4:	$F_{g}^{Q}, F_{p}^{Q} \leftarrow GRCFeatureExtraction (Q)$
5:	Step 2. Metric Computation (via Algorithm 2)
6:	$M_{g}, M_{p} \leftarrow MetricCalculation ({F_{g}^{S}, F_{p}^{S}}, {F_{g}^{Q}, F_{p}^{Q}})$
7:	Step 3. Adaptive Score Fusion
8:	$λ_{f u s e} \leftarrow σ (θ_{λ})$
9:	$S \leftarrow λ_{f u s e} \cdot M_{g} + (1 - λ_{f u s e}) \cdot M_{p}$
10:	Step 4. Optimization
11:	Calculate probability P via Softmax( $τ \cdot S$ )
12:	Calculate loss $L_{t o t a l}$ via Cross-Entropy( $P, Y_{g t}$ )
13:	Update parameters: $Θ \leftarrow Θ - η \cdot \nabla_{Θ} L_{t o t a l}$
14:	end for
15:	return $Θ$

4. Experiment

This section describes the comprehensive experiments conducted to evaluate the effectiveness and robustness of the proposed GRCD-Net. First, Section 4.1 provides a description of the experimental setup, including the datasets and implementation details. Subsequently, the following four key research questions (RQs) are addressed through systematic evaluation:

RQ1: How does the foundational GRCD-Net architecture compare to state-of-the-art methods on standard FSFGIC and general FSL benchmarks?
RQ2: Can the proposed framework effectively tackle the severe background clutter inherent in Earth observation tasks, establishing its superiority in RSSC datasets?
RQ3: What is the individual contribution of each proposed module (i.e., GRC, IGR and PDM) to the final performance? Is the model robust to hyperparameter variations?
RQ4: Does the model effectively suppress complex background noise and focus on task-relevant regions, learning semantically meaningful embeddings as intended?

4.1. Experimental Setup

4.1.1. Datasets

To comprehensively evaluate the performance and generalization capability of GRCD-Net, experiments were conducted on nine benchmarks categorized into three domains: FSFGIC, general FSL and RSSC. The detailed statistics and splitting protocols for all datasets are summarized in Table 1.

4.1.2. Implementation Details

The experiments were implemented using the PyTorch library, version 2.9.1 (PyTorch Foundation, Linux Foundation, San Francisco, CA, USA) and executed on a Ubuntu 22.04 LTS server (Canonical Ltd., London, UK). The hardware environment was equipped with an Intel Xeon Silver 4210 CPU (Intel Corporation, Santa Clara, CA, USA) @ 2.20 GHz, 64 GB of RAM, and an NVIDIA Tesla T4 GPU (NVIDIA Corporation, Santa Clara, CA, USA) with 16 GB of video memory. The specific configurations for the model architecture, data preprocessing, and training protocol are detailed below.

Model Configuration. The GRCD-Net adopted a Swin-Tiny (patch4, window7, 224) backbone pretrained on ImageNet-1K, which served as the foundational feature extractor. Feature maps from Stage 2 and Stage 3 were utilized to construct the local detail branch and the global semantic branch, respectively. The architecture incorporated

L = 3

GRC blocks, and all Transformer-based components (GRC, IGR, PDM) shared an embedding dimension of

D = 256

with four attention heads. The IGR module performed

T = 2

iterative refinements with a query update rate of

λ_{r e f i n e} = 0.1

, while the PDM module employed a two-layer PatchEncoder for local patch enhancement.

Data Splitting and Preprocessing. To ensure direct comparability with existing methods, standard class splits defined in prior literature [7,49,50,54] were adopted for all benchmarks. Input images from both FSFGIC and RSSC were resized to

224 \times 224

to preserve structural detail, whereas

84 \times 84

inputs were used for general FSL benchmarks. Bilinear interpolation was applied where resolution adjustment was necessary. Data augmentation techniques such as random resized cropping, horizontal flipping and color jittering were employed during training to enhance robustness, while a single center-crop strategy was used for validation and testing. All images were normalized using the ImageNet mean and standard deviation. No explicit image-level denoising or preprocessing-based noise removal was applied in this work; background clutter suppression was handled by the proposed GRC block during feature representation learning.

Training Protocol. A two-stage episodic training strategy was executed. In the warm-up stage, the Swin-T backbone was frozen, and only lightweight modules (i.e., projection layers, GRC, IGR, PDM) were optimized for 600 episodes with a learning rate of

1 \times 10^{- 4}

. In the full fine-tuning stage, the entire network underwent end-to-end optimization. Dataset-specific learning rates were applied (e.g.,

5 \times 10^{- 6}

for CUB and

5 \times 10^{- 5}

for tiered-ImageNet). The model was optimized using AdamW with a weight decay of

1 \times 10^{- 2}

. The maximum gradient norm was set to 1.0, while automatic mixed precision (AMP) was adopted to accelerate training.

Evaluation. Performance is reported as the mean accuracy with a 95% confidence interval over 600 randomly sampled five-way episodes. For general FSL datasets and FSFGIC datasets, 15 query samples per class were used, whereas five query samples per class were employed for RSSC benchmarks.

4.2. Main Results on Foundational Benchmarks

To address RQ1, a comprehensive quantitative comparison with state-of-the-art methods was initially conducted on standard FSFGIC and general FSL benchmarks. This initial phase was strictly designed to validate the foundational feature extraction and metric learning capabilities of the GRCD-Net architecture before its application to complex aerial scenes. The evaluation covered two distinct scenarios: fine-grained classification (e.g., CUB-200-2011 and Stanford Cars), which challenged the ability to discern subtle inter-class differences, and general few-shot learning (e.g., mini-ImageNet), which assessed broad generalization capabilities. Subsequently, having established the robustness of the proposed framework, the evaluation transitioned to the primary target domain, RSSC, to demonstrate its practical applicability in Earth observation. The comparative results on FSFGIC and general FSL, reported in terms of top-one accuracy under standard five-way one-shot and five-way five-shot settings, are detailed in Table 2 and Table 3.

4.2.1. Performance on Fine-Grained Datasets

The effectiveness of the proposed framework was first validated on four challenging FSFGIC benchmarks: CUB-200-2011, Stanford Dogs, Stanford Cars, and Oxford Flowers. As shown in Table 2, GRCD-Net demonstrates a strong capability in capturing discriminative local cues despite significant intra-class variations. On the Oxford Flowers dataset, which is characterized by high intra-class variance and subtle inter-class texture differences, the proposed method achieves an accuracy of 88.05% (one-shot) and 96.39% (five-shot). This performance significantly outperforms the previous state-of-the-art SaberNet [17] by +3.72% and +2.20%, respectively. The substantial improvement indicates that the dual-branch architecture effectively captures fine-grained texture details essential for differentiating non-rigid natural objects.

Similarly, on Stanford Dogs, GRCD-Net achieves exceptionally high accuracy (92.43% one-shot/98.33% five-shot). It should be noted that this performance boost is partially attributable to the domain overlap between the ImageNet-1K pretraining dataset and the Stanford Dogs test set. However, given this shared context, the significant margin over the graph-based STransGNN [43] (+7.22% in one-shot) demonstrates the model’s promising capability in fine-grained domain adaptation, suggesting that the GRC and IGR modules effectively refine general features into task-specific representations without over-smoothing.

Regarding the Stanford Cars dataset, GRCD-Net achieves competitive results with 84.81% (one-shot) and 95.51% (five-shot). While the one-shot accuracy is slightly lower than some specific ResNet-based methods (e.g., BiFRN [16]), it surpasses the baseline SaberNet by a remarkable margin (+8.10%). Following the analysis in SaberNet [17], this phenomenon is attributed to rigid objects like cars exhibiting extreme pose variations (e.g., frontal versus side views). Such geometric shifts challenge standard Vision Transformers, as they inherently prioritize semantic abstractions over structural geometry. However, the strong performance of GRCD-Net suggests that the PDM module effectively compensates for this by enforcing explicit geometric alignment, crucial for recognizing rigid targets (e.g., vehicles, buildings) in aerial imagery.

Finally, on the highly competitive CUB-200-2011 benchmark, the model attains 90.51% accuracy in the one-shot setting and matches state-of-the-art methods (95.81%) in the five-shot setting. These consistent gains across diverse domains confirm that the proposed global–local guidance mechanism provides a robust foundation for fine-grained recognition, successfully mitigating the interference of complex backgrounds even in natural image scenarios.

4.2.2. Performance on General FSL Benchmarks

Beyond FGIC domains, the architectural robustness of GRCD-Net was further evaluated on three classical FSL benchmarks: mini-ImageNet, tiered-ImageNet, and FC100. These datasets helped assess the model’s ability to generalize across broad semantic categories and varying image resolutions. The quantitative results are summarized in Table 3.

GRCD-Net consistently achieves state-of-the-art performance across both one-shot and five-shot settings. Specifically, on mini-ImageNet, the framework attains 78.11% (one-shot) and 89.55% (five-shot). Crucially, when compared to FewTURE [35], which also utilizes a Swin-Tiny backbone, GRCD-Net achieves substantial gains of +5.71% and +3.17%, respectively. This direct comparison isolates the contribution of the proposed GRC and metric learning modules, indicating that the performance boost stems from the architectural design rather than the backbone capability. A similar trend is observed on the larger-scale tiered-ImageNet, where the model achieves 83.10% and 90.15%, again outperforming prior methods and verifying its scalability to larger semantic hierarchies.

The evaluation on FC100 is particularly noteworthy due to its lower image resolution (

32 \times 32

), which poses a challenge for preserving discriminative features. GRCD-Net reaches 55.31% (one-shot) and 65.73% (five-shot) on this benchmark. These results demonstrate that the GRC mechanism effectively retains critical spatial information even when feature resolution is limited. This characteristic is theoretically advantageous for downstream tasks involving small target recognition, such as those frequently encountered in aerial and satellite imagery.

In summary, the competitive performance across these diverse general benchmarks confirms that the synergistic metric learning strategy of GRCD-Net is not limited to fine-grained distinctions but provides a robust foundation for broad category generalization.

4.3. Application to Remote Sensing Scene Classification

To answer RQ2, the evaluation was extended to the challenging field of RSSC. Unlike natural images focused on centered objects, aerial imagery is characterized by small targets submerged in highly complex terrestrial backgrounds and extreme scale variations. Three benchmarks—NWPU-RESISC45, UC Merced, and WHU-RS19—were employed to verify if the proposed global–local guidance mechanism could effectively filter environmental noise and generalize to Earth observation tasks.

4.3.1. Quantitative Analysis on Remote Sensing Scene Classification

As presented in Table 4, GRCD-Net exhibits remarkable robustness on the large-scale and highly challenging NWPU-RESISC45 benchmark. The method achieves 81.39% (one-shot) and 92.81% (five-shot), outperforming the recent state-of-the-art RSSC method TSDN4 [40] by a substantial margin of +7.55% and +4.95%, respectively. This decisive advantage strongly validates the core hypothesis: the GRC block’s top-down guidance is exceptionally effective at suppressing environmental noise and locating semantic targets within highly cluttered and large-scale aerial scenes.

On the much smaller-scale UC Merced and WHU-RS19 datasets, GRCD-Net remains highly competitive, surpassing the attention-based ACL-Net [7] in most settings. It is observed that highly specialized methods, such as TSDN4 and DEADN4 achieve higher accuracy on these specific small datasets. This performance dynamic is theoretically consistent with architectural differences. Transformer-based frameworks, such as GRCD-Net, rely on data-driven global relational modeling. They inherently lack the strong, localized inductive biases of lightweight CNNs, making them slightly prone to under-optimization on extremely small and saturated datasets (e.g., WHU-RS19 contains only 1000 images).

However, modern Earth observation applications demand models capable of handling large-scale data and high scene diversity. The significant performance leap on the comprehensive NWPU-RESISC45 dataset suggests that GRCD-Net possesses superior scalability and generalization capabilities for complex, real-world remote sensing challenges.

4.3.2. Visual Analysis of Background Suppression

To visually validate the effectiveness of the proposed global–local guidance mechanism in highly complex Earth observation scenarios, a background suppression analysis was conducted using the NWPU-RESISC45 dataset. As illustrated in Figure 6, attention maps were compared across four distinct stages: (1) the original input aerial image, (2) the initial global attention (representing the model’s focus without spatial guidance) (3) the final guided attention after processing by the GRC block, and (4) the final fine-grained saliency map, which precisely highlights the most discriminative details for classification.

In the global-context visualizations, the attention distributions are typically diffuse and heavily distracted by salient, yet irrelevant, background elements. For instance, in the Storage Tank scene, the initial attention incorrectly encompasses not only the target tanks but also the surrounding road networks and irrelevant industrial infrastructure. Similarly, in the Tennis Court example, the unguided model broadly attends to the entire recreational area, failing to isolate the courts from adjacent vegetation and pathways.

In this context, the term “noise” refers to semantic background interference rather than sensor-level degradation. Such interference does not need to be removed through explicit image-level denoising; instead, it should be suppressed during feature representation learning. Therefore, the background suppression discussed here is interpreted as a feature-level refinement process, where irrelevant regions are down-weighted by the global–local guidance mechanism.

Conversely, following the application of the global–local guidance mechanism, the guided-attention maps exhibit a remarkable capability to suppress severe terrestrial background noise. The feature focus tightens precisely around the critical semantic targets: the distinctive circular geometries of the storage tanks are sharply isolated from the concrete ground, and the tennis courts are clearly delineated from the surrounding natural landscape. This visual evidence confirms that GRCD-Net effectively leverages global semantic context as a spatial gate to filter out complex background clutter. Crucially, it demonstrates the exceptional suitability of the proposed architecture for real-world aerial image analysis, where mitigating background interference is paramount for accurate scene classification.

Furthermore, the final fine-grained saliency maps (Row 4) demonstrate the ultimate discriminative power of the proposed framework. While the guided attention successfully isolates the general target regions, the fine-grained saliency precisely pinpoints the most critical local geometric structures, such as the specific boundary lines of the tennis courts or the distinct structural edges of the storage tanks. This visualization confirms that after effectively suppressing the background via the GRC block, the subsequent PDM module successfully extracts the subtle, patch-level details required for accurate classification in highly ambiguous remote sensing scenarios.

4.4. Ablation Study and Model Analysis

To answer RQ3, the GRCD-Net framework underwent systematic ablation studies. To isolate the algorithmic improvements of each module from the compounding complexities of aerial imaging, these fundamental analyses were conducted on standard FSFGIC datasets (i.e., CUB-200-2011 and Stanford Cars) and a highly challenging remote sensing benchmark (i.e., NWPU-RESISC45). Specifically, the impact of each module (i.e., GRC, IGR, and PDM) was isolated to quantify its respective performance gain. Additionally, a sensitivity analysis was performed on critical key hyperparameters (e.g., the number of GRC blocks and fusion weights) to confirm the stability of the design.

4.4.1. Effectiveness of Key Components

To rigorously quantify the contribution of each module, a component-wise ablation study was performed on three representative datasets: CUB-200-2011 (representing non-rigid biological objects with subtle features), Stanford Cars (representing rigid man-made objects with geometric structures), and NWPU-RESISC45 (representing highly cluttered Earth observation scenes). This selection allowed for a multifaceted evaluation of the model’s capability to handle texture-based, structure-based, and noise-heavy recognition tasks.

Starting from a baseline (M1) that utilized a dual-branch backbone with simple late-fusion and a standard prototypical network, the proposed components, i.e., GRC, PDM, and IGR, were progressively integrated. The quantitative results are detailed in Table 5.

Impact of the GRC Block (M2): The integration of the GRC block (M2) yields a tremendous performance leap. On Stanford Cars, accuracy surges by +4.51% (77.29% → 81.80%), and on the NWPU-RESISC45 dataset, which contains complex background clutter, it improves by +3.43% (73.32% → 76.75%). This substantial gain conclusively proves the efficacy of the top-down spatial guidance mechanism. By proactively filtering out irrelevant global noise before feature fusion, the GRC block ensures that the model focuses precisely on discriminative object parts (e.g., headlights of cars or isolated semantic targets in aerial views). This capability is paramount in remote sensing imagery, where separating small semantic targets from complex terrestrial clutter dictates the success of few-shot recognition.

Impact of the PDM Module (M3): Replacing the basic Euclidean metric with the PDM module (M3) further boosts performance by +2.50% on Cars, +1.57% on CUB, and +1.43% on NWPU-RESISC45. This consistent improvement highlights the necessity of the dual-metric strategy. While the cosine similarity captures directional texture patterns that are highly beneficial for non-rigid targets like birds, the Euclidean distance imposes strict constraints on spatial magnitude, an attribute crucial for recognizing rigid objects. This result confirms that GRCD-Net effectively models rigid geometric correspondences, thereby facilitating the recognition of man-made infrastructure (e.g., storage tanks, bridges) in remote sensing tasks that exhibit diverse orientations and scales.

Impact of the IGR Module (M4): Finally, the IGR module (M4) is introduced to refine the global semantic context. This addition further improves the overall performance, particularly on NWPU-RESISC45, where the accuracy increases by +3.21% (78.18% → 81.39%). By iteratively modeling the relationships among all samples in an episode, the IGR module corrects potential local misalignments and enforces global semantic consistency.

4.4.2. Hyperparameter Sensitivity Analysis

To evaluate the stability and robustness of the GRCD-Net framework, a systematic sensitivity analysis was conducted regarding three critical architectural hyperparameters: (1) the depth of GRC blocks, (2) the number of IGR iterations, and (3) the depth of the local patch encoder.

All experiments in this section were performed on the CUB-200-2011 dataset under the standard five-way one-shot setting. To facilitate efficient comparison across multiple configurations, an accelerated training schedule of 1000 episodes (including a 300-episode warm-up) was utilized. While this shortened protocol did not reach full convergence, it provided reliable and consistent performance trends for hyperparameter selection. The results are summarized in Figure 7.

As illustrated in Figure 7a, performance improves as the number of GRC blocks increases from one to three, peaking at three blocks. This trend indicates that stacking interaction layers allows the model to progressively refine local features using global context. However, adding a fourth block leads to a slight degradation in accuracy. This suggests that while deeper interaction enhances feature expressiveness, excessive depth may induce feature over-smoothing, where discriminative local details are diluted by repeated global aggregation.

Figure 7b examines the effect of the iterative reasoning steps in the IGR module. It is observed that performance improves significantly when moving from one to two iterations, confirming the value of recursive context modeling. However, increasing iterations beyond two yields negligible gains. Consequently,

T = 2

is identified as the optimal trade-off between semantic refinement quality and computational efficiency.

Finally, Figure 7c analyzes the depth of the local patch encoder. The accuracy peaks at a depth of two layers. This finding suggests that moderate contextual enhancement is beneficial for distinguishing local patterns. Conversely, deeper encoders risk overfitting to high-frequency local noise, thereby weakening the semantic consistency required for effective alignment with the global branch.

Based on these empirical findings, the default configuration for all experiments was established as: three GRC blocks, two IGR iterations, and a two-layer patch encoder.

4.5. Computational Complexity and Efficiency Analysis

To provide a more complete assessment of the trade-off between classification accuracy and computational cost, the proposed GRCD-Net was compared with representative methods in terms of model parameters, floating-point operations (FLOPs), inference time, and one-shot classification accuracy. The selected baselines covered different architectural designs, including CNN-based, Transformer-based, graph-based, remote-sensing-specific, and vision–language-assisted methods.

Following the standard five-way one-shot setting, each episode contained 80 images, including support and query samples. Parameters, FLOPs, and inference time were evaluated under the same episodic setting, and the results are reported together with the corresponding one-shot accuracy on CUB, mini-ImageNet, NWPU-RESISC45, and WHU-RS19, as summarized in Table 6. The inference time was averaged over 100 episodes in our implementation.

As shown in Table 6, GRCD-Net contains

35.80

M parameters. Although this is higher than some lightweight Transformer-based baselines, such as CPEA (21.81M), it remains lower than DLA-MatchNet (

50.91

M) and substantially smaller than SemFew (>200M), which incorporates additional pretrained vision–language components. This indicates that the proposed global–local relational design introduces a moderate increase in parameter size while avoiding an excessively large model scale.

In terms of computational cost, GRCD-Net requires

413.03

GFLOPs per episode. This value is slightly lower than that of STranGNN under the evaluated setting, although GRCD-Net has more parameters. One possible reason is that graph-based methods may involve additional operations for modeling relational structures, whereas the proposed framework mainly relies on attention-based feature interaction and matrix operations that are well supported by modern deep learning libraries. Nevertheless, the difference in FLOPs should be interpreted as part of the overall computational trade-off rather than as a standalone efficiency advantage.

Regarding inference time, GRCD-Net processes one 80-image episode in

41.40

ms in our implementation. Although CPEA reports a shorter inference time, GRCD-Net achieves higher accuracy on the evaluated datasets, especially on the RSSC benchmarks. These results suggest that the proposed method provides a reasonable balance between classification performance and computational cost.

Overall, the complexity and efficiency analyses show that GRCD-Net improves few-shot classification performance while maintaining manageable computational complexity. This additional evaluation provides a more complete understanding of the practical trade-off between accuracy, model size, FLOPs, and inference time.

4.6. Qualitative Visualization

To address RQ4 and provide interpretability into the network’s internal decision-making process, a series of qualitative visualizations is presented. These analyses aim to clarify how the model organizes feature representations and locates semantic targets. Specifically, the evaluation was conducted from two complementary perspectives: (1) attention saliency analysis, which visually verifies the background suppression capability of the GRC block; and (2) embedding space visualization, which employs t-SNE and PCA to demonstrate the intra-class compactness and inter-class separability achieved by the synergistic metric learning strategy.

4.6.1. Attention and Saliency Visualization

Figure 8 illustrates the progressive attention refinement process within the GRCD-Net framework. Representative examples from CUB-200-2011 (non-rigid biological objects) and Stanford Cars (rigid man-made objects) were selected to demonstrate the model’s robustness across diverse structural domains. The visualization is decomposed into four stages:

First, the top row displays the original input images, which are deeply embedded in complex natural or artificial backgrounds, establishing the baseline difficulty of the classification task.

Subsequently, as observed in the second row, the initial attention heatmaps generated by the backbone exhibit only coarse object localization. The focus is typically diffuse and heavily distracted by salient but irrelevant environmental clutter, such as tree branches in the bird images or road markings in the car scenes. This confirms that without explicit guidance, the feature extractor struggles to distinguish the semantic target from its surroundings.

The third row visualizes the attention distribution after modulation by the GRC block’s global–local guidance mechanism. A decisive improvement is evident: the attention focus tightens significantly around the object’s core structure, effectively filtering out the previously distracting background noise. This visual evidence corroborates the GRC block’s ability to leverage global semantic context as a spatial gate to purify local representations.

Most importantly, the fourth row displays the final fine-grained saliency maps. For CUB-200-2011 (see Figure 8a), the model successfully zooms in on minute biological details crucial for species identification—such as the distinctive red crest of the woodpecker or specific wing patterns—rather than merely attending to the bird’s center. Similarly, for Stanford Cars (see Figure 8b), the saliency maps precisely highlight rigid, discriminative components like headlights, grilles, and wheel rims (e.g., the iconic kidney grille of the BMW or the rugged tires of the Jeep). This confirms that GRCD-Net effectively learns to capture task-relevant fine-grained cues while suppressing complex background interference, a capability that is theoretically transferable to identifying small targets in cluttered aerial imagery.

4.6.2. Embedding Space and Metric Separability Analysis

To rigorously assess the geometric properties of the learned feature manifold, the embedding space was analyzed using both uniform manifold approximation and projection (UMAP) and cosine similarity matrices.

Figure 9 projects the

L_{2}

-normalized embeddings of a randomly sampled five-way one-shot episode into a 2D space. Unlike linear projection methods, UMAP preserves the nonlinear local topological structure of the high-dimensional feature space. The visualization reveals that GRCD-Net constructs a highly structured metric space. The support prototypes (large stars) serve as distinct semantic anchors, well separated from one another. Importantly, the query samples (small dots) form highly dense topological clusters that tightly align with their corresponding prototypes. This indicates that the synergistic metric learning strategy effectively minimizes intra-class variance while maximizing inter-class separability. Consequently, it ensures reliable nearest-neighbor classification even with limited samples.

The discriminative power of the learned feature manifold is further quantified through the cosine similarity matrices visualized in Figure 10, where both matrices exhibit a distinct block-diagonal pattern under the same colormap scale, effectively suppressing the high-dimensional noise floor, with bright yellow diagonal regions indicating high confidence in correct matches. Specifically, the matrix for the texture-dominant CUB dataset (Figure 10a) displays sharp contrast against a dark background, reflecting the model’s ability to leverage unique biological textures for clean separation. In contrast, the matrix for the structure-dominant Stanford Cars dataset (Figure 10b) exhibits higher off-diagonal values (lighter background), a theoretically expected phenomenon given that rigid objects share significant geometric similarities (e.g., wheels, windows) across categories. However, the diagonal blocks remain dominant and distinct despite this high baseline similarity. This result is particularly significant for remote sensing, as it demonstrates the model’s robustness in distinguishing spectrally and structurally similar man-made targets (e.g., different industrial plants or dense residential blocks) by capturing subtle, discriminative fine-grained cues, proving its efficacy even in scenarios with severe inter-class visual overlap.

4.6.3. Evolution of Feature Manifolds

To visually substantiate the incremental contributions of the proposed modules, the evolution of the learned feature space was analyzed using t-distributed stochastic neighbor embedding (t-SNE). Figure 11 visualizes the feature distributions of three key model variants on CUB-200-2011 and Stanford Cars.

As observed in the left column, relying solely on the GRC-enhanced backbone yields relatively diffuse clusters. While the GRC block effectively filters background noise, the resulting feature space still exhibits significant inter-class overlap. This is particularly evident in the Stanford Cars dataset (Figure 11b, left), where the purple and orange clusters are heavily entangled. Such severe overlap indicates that without explicit geometric constraints, the model struggles to handle the large intra-class pose variations typical of rigid objects, a challenge that mirrors the difficulty of recognizing arbitrarily oriented targets in aerial imagery.

With the integration of the PDM module, a dramatic improvement in manifold structure is observed. The previously scattered points condense into tighter, distinguishable groups. Notably, on Stanford Cars, the chaotic overlap is largely resolved, confirming that the dual-metric strategy (i.e., cosine + Euclidean) successfully enforces local geometric alignment. This capability to align rigid features regardless of pose is critical for distinguishing spectrally similar man-made structures.

Finally, the full GRCD-Net refines the manifold to its optimal state. The IGR module introduces global semantic constraints that further maximize inter-class margins. As shown in Figure 11a (right), the clusters become more compact, with clearer inter-class margins. Similarly, on Stanford Cars, the decision boundaries become sharp and clear. This progression demonstrates that the synergistic integration of local alignment driven by the PDM module and global reasoning enabled by the IGR module transforms an initially overlapping feature space into a more discriminative one. As a result, the model shows improved generalization in complex fine-grained scenarios.

5. Conclusions

In this paper, the core challenges of FSFGIC and RSSC were addressed through the proposed GRCD-Net. The results indicate that effective fine-grained recognition in complex environments hinges not merely on extracting multi-scale features but on establishing a structured interaction between them. The key innovation, the GRC block, realizes this by introducing a global–local guidance mechanism, where global semantic context proactively refines local details before feature fusion. This design ensures that the model precisely focuses on discriminative, task-relevant targets while robustly suppressing severe environmental and terrestrial background noise. Furthermore, it was demonstrated that the synergistic integration of the IGR module and the PDM module was essential for accurately quantifying similarities across both global semantics and local spatial structures. Extensive experiments confirmed the superiority of this approach. GRCD-Net achieved new state-of-the-art performance not only on foundational FSFGIC benchmarks (e.g., CUB-200-2011, Oxford Flowers, and Stanford Cars) but also on highly challenging RSSC datasets, notably NWPU-RESISC45. These results highlight the immense potential and robust generalization capability of GRCD-Net for practical Earth observation applications, particularly in scenarios characterized by data scarcity and highly cluttered aerial backgrounds.

Future work will focus on extending the proposed feature-level background suppression mechanism to continuous, arbitrary-resolution large-scale satellite imagery, reducing the reliance on fixed-size patch-based inference. Lightweight architectures, multi-modal data integration, and Earth-observation-oriented self-supervised pretraining will also be explored to improve efficiency and generalization.

Author Contributions

Conceptualization, J.L. and Y.D.; methodology, Y.D.; software, Y.D. and L.S.; validation, Y.D., X.L., and Y.S.; formal analysis, Y.D. and X.S.; investigation, Y.D., L.S., and X.L.; resources, J.L.; data curation, Y.D. and Y.S.; writing—original draft, Y.D.; writing—review and editing, J.L., Y.D., L.S., X.L., Y.S., and X.S.; visualization, Y.D. and X.S.; supervision, J.L. and R.Z.; project administration, J.L.; funding acquisition, J.L., L.S., Y.S., X.S., and R.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Natural Science Foundation of China (No. 62271193), Key Research Project Plan for Basic Research Special Fund of Higher Education Institutions in Henan Province (No. 25ZX009), Key Program of Natural Science Foundation of Henan Province (No. 252300421295), Key Research and Development and Promotion of Special (Science and Technology) Project of Henan Province (Nos. 242102211031, 242102240030, 262102211129), and Key Scientific Research Project of Higher Education Institutions in Henan Province (No. 24B520010).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article. Furthermore, publicly available benchmark datasets were analyzed in this work, and their access details can be found in the corresponding references cited within the manuscript.

Acknowledgments

During the preparation of this work, the authors used Gemini 3.1 Pro in order to improve the English language, grammar, and overall readability of certain sections of the manuscript. After using this tool, the authors rigorously reviewed and edited the content as needed, and take full responsibility for the scientific accuracy and integrity of the publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Cheng, G.; Xie, X.; Han, J.; Guo, L.; Xia, G.S. Remote sensing image scene classification meets deep learning: Challenges, methods, benchmarks, and opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3735–3756. [Google Scholar] [CrossRef]
Zhang, L.; Lan, M.; Zhang, J.; Tao, D. Stagewise unsupervised domain adaptation with adversarial self-training for road segmentation of remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5609413. [Google Scholar] [CrossRef]
Qiu, C.; Zhang, X.; Tong, X.; Guan, N.; Yi, X.; Yang, K.; Zhu, J.; Yu, A. Few-shot remote sensing image scene classification: Recent advances, new baselines, and future trends. ISPRS J. Photogramm. Remote Sens. 2024, 209, 368–382. [Google Scholar] [CrossRef]
Yao, X.; Cao, Q.; Feng, X.; Cheng, G.; Han, J. Scale-aware detailed matching for few-shot aerial image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5611711. [Google Scholar] [CrossRef]
Gong, X.; Luo, Y.; Chen, W.; Chang, Y.; Wan, Y.; Ma, A.; Zhong, Y. BASHVS: A Multispectral and SAR Image Fusion Method Based on Bidirectional Aggregation of Saliency in Human Visual System. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5405615. [Google Scholar] [CrossRef]
Yang, S.; Gong, X.; Zhou, X.; Luo, Y.; Wan, Y.; Ma, A.; Zhong, Y. LABLF: A multispectral and SAR image fusion method based on least squares-optimized adaptive box-guided and Laplacian-Gaussian filtering. Int. J. Remote Sens. 2026, 47, 3545–3575. [Google Scholar] [CrossRef]
Xu, Y.; Bi, H.; Yu, H.; Lu, W.; Li, P.; Li, X.; Sun, X. Attention-based contrastive learning for few-shot remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5620317. [Google Scholar] [CrossRef]
Li, M.; Lei, L.; Sun, H.; Li, X.; Kuang, G. Fine-grained visual classification via multilayer bilinear pooling with object localization. Vis. Comput. 2022, 38, 95–106. [Google Scholar] [CrossRef]
Liu, Y.; Wan, L.; Lyu, F.; Feng, W. Fine-grained scale space learning for single image super-resolution. Vis. Comput. 2022, 38, 3377–3389. [Google Scholar] [CrossRef]
Meoni, G.; Märtens, M.; Derksen, D.; See, K.; Lightheart, T.; Sécher, A.; Martin, A.; Rijlaarsdam, D.; Fanizza, V.; Izzo, D. The OPS-SAT case: A data-centric competition for onboard satellite image classification. Vis. Comput. 2024, 8, 507–528. [Google Scholar] [CrossRef]
Ma, Y.; Deng, X.; Wei, J. Land use classification of high-resolution multispectral satellite images with fine-grained multiscale networks and superpixel postprocessing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3264–3278. [Google Scholar] [CrossRef]
Liang, X. Few-shot cotton leaf spots disease classification based on metric learning. Plant Methods 2021, 17, 114. [Google Scholar] [CrossRef]
Li, R.; Li, X.; Sun, H.; Yang, J.; Rahaman, M.; Grzegozek, M.; Jiang, T.; Huang, X.; Li, C. Few-shot learning based histopathological image classification of colorectal cancer. Intell. Med. 2024, 4, 256–267. [Google Scholar] [CrossRef]
Zha, Z.; Tang, H.; Sun, Y.; Tang, J. Boosting few-shot fine-grained recognition with background suppression and foreground alignment. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 3947–3961. [Google Scholar] [CrossRef]
Tang, H.; Yuan, C.; Li, Z.; Tang, J. Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognit. 2022, 130, 108792. [Google Scholar] [CrossRef]
Wu, J.; Chang, D.; Sain, A.; Li, X.; Ma, Z.; Cao, J.; Guo, J.; Song, Y.-Z. Bi-directional feature reconstruction network for fine-grained few-shot image classification. Proc. AAAI Conf. Artif. Intell. 2023, 37, 2821–2829. [Google Scholar] [CrossRef]
Li, Z.; Hu, Z.; Luo, W.; Hu, X. SaberNet: Self-attention based effective relation network for few-shot learning. Pattern Recognit. 2023, 133, 109024. [Google Scholar] [CrossRef]
Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 1199–1208. [Google Scholar] [CrossRef]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning; PMLR: New York, NY, USA, 2017; pp. 1126–1135. [Google Scholar]
Zhang, R.; Che, T.; Ghahramani, Z.; Bengio, Y.; Song, Y. MetaGAN: An adversarial approach to few-shot learning. In Advances in Neural Information Processing Systems 31 (NeurIPS 2018); Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
Lim, J.M.; Lim, K.M.; Lee, C.P.; Lim, J.Y. A review of few-shot fine-grained image classification. Expert Syst. Appl. 2025, 275, 127054. [Google Scholar] [CrossRef]
Wertheimer, D.; Tang, L.; Hariharan, B. Few-shot classification with feature map reconstruction networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 8012–8021. [Google Scholar]
Hou, R.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Cross-attention network for few-shot classification. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019); Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 4004–4014. [Google Scholar]
Doersch, C.; Gupta, A.; Zisserman, A. CrossTransformers: Spatially-aware few-shot transfer. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020); Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 21981–21993. [Google Scholar]
Chen, H.; Li, H.; Li, Y.; Chen, C. Multi-scale adaptive task attention network for few-shot learning. In 2022 26th International Conference on Pattern Recognition (ICPR); IEEE: Piscataway, NJ, USA, 2022; pp. 4765–4771. [Google Scholar]
Song, W.; Yang, K. Dual adaptive local semantic alignment for few-shot fine-grained classification. Vis. Comput. 2025, 40, 2923–2937. [Google Scholar] [CrossRef]
Zhang, C.; Cai, Y.; Lin, G.; Shen, C. DeepEMD: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2020; pp. 12203–12213. [Google Scholar] [CrossRef]
Hao, F.; He, F.; Cheng, J.; Wang, L.; Cao, J.; Tao, D. Collect and select: Semantic alignment metric learning for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2019; pp. 8460–8469. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Ye, H.-J.; Hu, H.; Zhan, D.-C.; Sha, F. Few-shot learning via embedding adaptation with set-to-set functions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2020; pp. 8808–8817. [Google Scholar] [CrossRef]
Liu, L.; Hamilton, W.; Long, G.; Jiang, J.; Larochelle, H. A universal representation transformer layer for few-shot image classification. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
Wang, X.; Wang, X.; Jiang, B.; Luo, B. Few-shot learning meets transformer: Unified query-support transformers for few-shot classification. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7789–7802. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Hiller, M.; Ma, R.; Harandi, M.; Drummond, T. Rethinking generalization in few-shot classification. In Advances in Neural Information Processing Systems 35 (NeurIPS 2022); Curran Associates, Inc.: Red Hook, NY, USA, 2022; pp. 3582–3595. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Li, L.; Han, J.; Yao, X.; Cheng, G.; Guo, L. DLA-MatchNet for few-shot remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 7844–7853. [Google Scholar] [CrossRef]
Li, X.; Sun, Y.; Peng, X.; Zhang, J.; Qi, G.; Liu, D. TA-MSA: A fine-tuning framework for few-shot remote sensing scene classification. Remote Sens. 2025, 17, 1395. [Google Scholar] [CrossRef]
Jia, Y.; Sun, C.; Gao, J.; Wang, Q. Few-shot remote sensing scene classification via parameter-free attention and region matching. ISPRS J. Photogramm. Remote Sens. 2025, 227, 265–275. [Google Scholar] [CrossRef]
Lei, Y.; Li, Y.; Mao, H. A novel two-stream network for few-shot remote sensing image scene classification. Remote Sens. 2025, 17, 1192. [Google Scholar] [CrossRef]
Chen, Y.; Li, Y.; Mao, H.; Liu, G.; Chai, X.; Jiao, L. A novel discriminative enhancement method for few-shot remote sensing image scene classification. Remote Sens. 2023, 15, 4588. [Google Scholar] [CrossRef]
Wang, Q.; Dong, Y.; Xu, N.; Xu, F.; Mou, C.; Chen, F. Image classification of tree species in relatives based on dual-branch vision transformer. Forests 2024, 15, 2243. [Google Scholar] [CrossRef]
Wang, K.; Ren, J.; Zhang, W. Few-shot image classification algorithm of graph neural network based on Swin transformer. Laser Optoelectron. Prog. 2024, 61, 1237003. [Google Scholar]
Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset; Technical Report CNS-TR-2011-001; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
Khosla, A.; Jayadevaprakash, N.; Yao, B.; Li, F.-F. Novel dataset for fine-grained image categorization. In Proc. CVPR Workshop Fine-Grained Vis. Categorization; IEEE: Piscataway, NJ, USA, 2011. [Google Scholar]
Krause, J.; Stark, M.; Deng, J.; Li, F.-F. 3D object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops; IEEE: Piscataway, NJ, USA, 2013; pp. 554–561. [Google Scholar] [CrossRef]
Nilsback, M.-E.; Zisserman, A. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing; IEEE: Piscataway, NJ, USA, 2008. [Google Scholar] [CrossRef]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D.; Kavukcuoglu, K. Matching networks for one shot learning. In Advances in Neural Information Processing Systems 29 (NIPS 2016); Curran Associates, Inc.: Red Hook, NY, USA, 2016. [Google Scholar]
Ren, M.; Ravi, S.; Triantafillou, E.; Snell, J.; Swersky, K.; Tenenbaum, J.B.; Larochelle, H.; Zemel, R.S. Meta-learning for semi-supervised few-shot classification. In Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Oreshkin, B.; Rodríguez, P.; Lacoste, A. TADAM: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems 31 (NeurIPS 2018); Curran Associates, Inc.: Red Hook, NY, USA, 2018. [Google Scholar]
Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems; Association for Computing Machinery: New York, NY, USA, 2010; pp. 270–279. [Google Scholar] [CrossRef]
Sheng, G.; Yang, W.; Xu, T.; Sun, H. High-resolution satellite scene classification using a sparse coding based multiple feature combination. Int. J. Remote Sens. 2012, 33, 2395–2412. [Google Scholar] [CrossRef]
Chen, W.-Y.; Liu, Y.-C.; Kira, Z.; Wang, Y.-C.F.; Huang, J.-B. A closer look at few-shot classification. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Ma, Z.X.; Chen, Z.D.; Zheng, T.; Luo, X.; Jia, Z.; Xu, X.S. Few-shot fine-grained image classification with progressively feature refinement and continuous relationship modeling. Proc. AAAI Conf. Artif. Intell. 2025, 5439, 6036–6044. [Google Scholar] [CrossRef]
Zhang, B.; Yuan, J.; Li, B.; Chen, T.; Fan, J.; Shi, B. Learning cross-image object semantic relation in transformer for few-shot fine-grained image classification. In Proceedings of the 30th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2022; pp. 2135–2144. [Google Scholar] [CrossRef]
Ou, Q.; Zou, J. Channel-wise attention-enhanced feature mutual reconstruction for few-shot fine-grained image classification. Electronics 2025, 14, 377. [Google Scholar] [CrossRef]
Ma, Z.; Chen, Z.; Zhao, L.; Zhang, Z.; Luo, X.; Xu, X. Cross-layer and cross-sample feature optimization network for few-shot fine-grained image classification. Proc. AAAI Conf. Artif. Intell. 2024, 38, 4136–4144. [Google Scholar] [CrossRef]
Li, X.; Wang, L.; Zhu, R.; Ma, Z.; Cao, J.; Xue, J.H. SRML: Structure-relation mutual learning network for few-shot image classification. Pattern Recognit. 2025, 160, 111822. [Google Scholar] [CrossRef]
Guo, Z.; Xiao, L.; Jin, Q. MPRe: Multi-scale feature guided prototype reconstruction for few-shot fine-grained image classification. In Proceedings of the 2025 7th International Conference on Frontier Technologies of Information and Computer (ICFTIC), Qingdao, China, 5–7 December 2025; pp. 417–422. [Google Scholar] [CrossRef]
Chen, Y.; Liu, Z.; Xu, H.; Darrell, T.; Wang, X. Meta-Baseline: Exploring simple meta-learning for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 9062–9071. [Google Scholar]
Xie, J.; Long, F.; Lv, J.; Wang, Q.; Li, P. Joint distribution matters: Deep Brownian distance covariance for few-shot classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 7962–7971. [Google Scholar] [CrossRef]
Wang, Y.; Chao, W.L.; Weinberger, K.Q.; van der Maaten, L. SimpleShot: Revisiting nearest-neighbor classification for few-shot learning. arXiv 2019, arXiv:1911.04623. [Google Scholar]
Xu, W.; Xu, Y.; Wang, H.; Tu, Z. Attentional constellation nets for few-shot learning. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
Dong, B.; Zhou, P.; Yan, S.; Zuo, W. Self-promoted supervision for few-shot transformer. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; p. 13680. [Google Scholar] [CrossRef]
Hao, F.; He, F.; Liu, L.; Wu, F.; Tao, D.; Cheng, J. Class-aware patch embedding adaptation for few-shot image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 18905–18915. [Google Scholar] [CrossRef]
Zhang, H.; Xu, J.; Jiang, S.; He, Z. Simple semantic-aided few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 28588–28597. [Google Scholar] [CrossRef]

Figure 1. An overview of the proposed GRCD-Net framework, which mainly comprises a dual-branch Swin Transformer backbone, a series of GRC blocks for multi-scale feature interaction, an IGR module for refining global semantics, and a PDM module for local feature comparison, followed by feature fusion for final classification.

Figure 2. Network structure of the Swin Transformer and its basic block.

Figure 3. Structure of the GRC block. The global branch generates a spatial gate to purify detailed features before bidirectional cross-attention, enabling focused interaction between global semantics and local details.

Figure 4. The structure of the IGR module. The red arrow shows how the query representation is progressively improved by repeatedly interacting with the encoded support set.

Figure 5. Schematic illustration of the PDM module. The module assesses similarity through two complementary geometric perspectives: (Left) Cosine similarity focuses on feature orientation by measuring the directional angles (

θ_{1}

–

θ_{5}

) between the query embedding and different class prototypes. (Middle) Euclidean distance captures the magnitude discrepancies (

d_{1}

–

d_{5}

) by calculating absolute spatial distances. (Right) The PDM module combines directional (red) and spatial (green) metrics into a unified similarity score (purple). This weighted integration ensures robust feature matching against significant intra-class appearance variations, yielding the final patch score

M_{p}

. The dashed colored lines indicate conceptual feature matching relationships, while the actual metric outputs are scalar scores rather than vectors.

Figure 5. Schematic illustration of the PDM module. The module assesses similarity through two complementary geometric perspectives: (Left) Cosine similarity focuses on feature orientation by measuring the directional angles (

θ_{1}

–

θ_{5}

) between the query embedding and different class prototypes. (Middle) Euclidean distance captures the magnitude discrepancies (

d_{1}

–

d_{5}

) by calculating absolute spatial distances. (Right) The PDM module combines directional (red) and spatial (green) metrics into a unified similarity score (purple). This weighted integration ensures robust feature matching against significant intra-class appearance variations, yielding the final patch score

M_{p}

. The dashed colored lines indicate conceptual feature matching relationships, while the actual metric outputs are scalar scores rather than vectors.

Figure 6. Visual analysis of background suppression on (a) NWPU-RESISC45 and (b) UC Merced datasets. The heatmaps illustrate the progressive refinement of features: (Row 1) original aerial images embedded in complex terrestrial environments; (Row 2) initial global context, exhibiting diffuse focus distracted by background clutter; (Row 3) guided attention after the GRC block, demonstrating effective noise suppression and target localization; (Row 4) final fine-grained saliency maps, precisely highlighting discriminative semantic details for classification. For the heatmaps in Rows 2–4, warmer colors, such as red and yellow, indicate stronger contextual, attention, or saliency responses, whereas cooler colors, such as blue and green, indicate weaker responses.

Figure 7. Hyperparameter sensitivity analysis on CUB-200-2011 (5-way one-shot). The plots illustrate the impact of varying key architectural hyperparameters while keeping others fixed. (a) Depth of GRC blocks, peaking at

L = 3

. (b) Number of refinement iterations in the IGR module, showing optimal performance at

T = 2

. (c) Depth of the local patch encoder, where 2 layers yield the best accuracy. Red markers indicate the optimal configurations selected for the final GRCD-Net, and shaded regions represent the 95% confidence intervals around the mean accuracy.

Figure 7. Hyperparameter sensitivity analysis on CUB-200-2011 (5-way one-shot). The plots illustrate the impact of varying key architectural hyperparameters while keeping others fixed. (a) Depth of GRC blocks, peaking at

L = 3

. (b) Number of refinement iterations in the IGR module, showing optimal performance at

T = 2

. (c) Depth of the local patch encoder, where 2 layers yield the best accuracy. Red markers indicate the optimal configurations selected for the final GRCD-Net, and shaded regions represent the 95% confidence intervals around the mean accuracy.

Figure 8. Visualization of progressive attention refinement on (a) CUB-200-2011 and (b) Stanford Cars datasets. The heatmaps demonstrate the evolution of feature focus: (Row 1) original input images with complex backgrounds; (Row 2) initial global attention, showing diffuse focus distracted by environmental noise; (Row 3) guided attention after the GRC block, exhibiting suppressed background clutter; (Row 4) final fine-grained saliency maps, precisely highlighting discriminative details while ignoring non-informative regions. For the heatmaps in Rows 2–4, warmer colors, such as red and yellow, indicate stronger contextual, attention, or saliency responses, whereas cooler colors, such as blue and green, indicate weaker responses.

Figure 9. UMAP projections of the learned embedding space on two different FSFGIC datasets. Large stars represent class prototypes, while small dots denote query samples. The visualization reveals that query samples form compact clusters tightly aligned with their respective prototypes, verifying the high intra-class compactness and inter-class separability achieved by the proposed metric learning strategy.

Figure 10. Cosine similarity matrices within a 5-way 1-shot episode on (a) CUB-200-2011 and (b) Stanford Cars datasets. The clear block-diagonal patterns demonstrate that GRCD-Net successfully learns high intra-class compactness and inter-class separability. Notably, the model maintains robust discrimination on the Stanford Cars dataset despite the inherent geometric similarity between rigid car models, a characteristic critical for distinguishing man-made structures in aerial imagery.

Figure 11. t-SNE visualization of feature manifold evolution on (a) CUB-200-2011 and (b) Stanford Cars datasets. The plots illustrate the progressive improvement from M2 (baseline + GRC, left), which exhibits noticeable inter-class overlap, to M3 (incorporating PDM, center), where geometric alignment tightens the clusters, and finally to M4 (full GRCD-Net, right), where the IGR module further improves intra-class compactness and inter-class separability. This trajectory confirms that the synergistic integration of local and global modules effectively resolves structural ambiguity.

Table 1. Summary of the nine datasets used in our experiments, categorized by domain. The splits denote the number of classes used for training/validation/testing respectively.

Category	Dataset	Images	Classes	Split (Tr/Val/Te)	Characteristics
FSFGIC	CUB-200-2011 [44]	11,788	200	100/50/50	Bird species with subtle inter-class differences.
	Stanford Dogs [45]	20,580	120	70/20/30	Dog breeds with high visual similarity.
	Stanford Cars [46]	16,185	196	130/17/49	Car models with pose and viewpoint variations.
	Oxford Flowers [47]	8189	102	51/26/25	Flower categories with scale and light variations.
General FSL	mini-ImageNet [48]	60,000	100	64/16/20	Subset of ImageNet for general object recognition.
	tiered-ImageNet [49]	779,165	608	351/97/160	Larger scale with hierarchical category structure.
	FC100 [50]	33,200	100	60/20/20	Derived from CIFAR-100 with low resolution.
RSSC	NWPU-RESISC45 [51]	31,500	45	25/10/10	Large-scale aerial scenes with complex backgrounds.
	UC Merced [52]	2100	21	10/6/5	High-res urban land-use scenes (0.3 m/pixel).
	WHU-RS19 [53]	1005	19	9/5/5	Aerial images with varying resolutions.

Table 2. Comparison with state-of-the-art methods on FSFGIC datasets (5-way accuracy %). The best results are shown in bold, and the second-best results are underlined. The dash symbol “-” denotes unavailable results that were either not reported in the original paper or could not be reproduced due to unavailable source code.

Model	Backbone	CUB-200-2011		Stanford Dogs		Stanford Cars		Oxford Flowers
Model	Backbone	1-Shot	5-Shot	1-Shot	5-Shot	1-Shot	5-Shot	1-Shot	5-Shot
Methods with ConvNet/ResNet Backbones
Boosting [14]	ResNet-12	82.27 ± 0.46	90.76 ± 0.26	69.58 ± 0.50	82.59 ± 0.33	88.93 ± 0.38	95.20 ± 0.20	-	-
FRN [23]	ResNet-12	81.51 ± 0.20	91.77 ± 0.11	76.43 ± 0.21	88.23 ± 0.12	87.95 ± 0.16	95.20 ± 0.20	71.16 ± 0.22	86.01 ± 0.15
AGPF [15]	ResNet-12	84.02 ± 0.57	93.50 ± 0.13	72.34 ± 0.86	85.34 ± 0.74	85.34 ± 0.74	94.79 ± 0.35	-	-
BiFRN [16]	ResNet-12	85.44 ± 0.18	94.73 ± 0.09	76.89 ± 0.21	88.27 ± 0.12	90.44 ± 0.15	97.49 ± 0.05	-	-
DALSA [27]	ResNet-12	85.26 ± 0.67	94.40 ± 0.73	75.91 ± 0.84	89.43 ± 0.68	89.62 ± 0.52	96.88 ± 0.72	-	-
SUITED [55]	ResNet-12	86.02 ± 0.45	94.13 ± 0.30	76.55 ± 0.47	88.86 ± 0.27	89.97 ± 0.36	96.53 ± 0.16	-	-
HelixFormer [56]	ResNet-12	81.66 ± 0.30	91.83 ± 0.17	65.92 ± 0.49	80.65 ± 0.36	79.40 ± 0.43	92.26 ± 0.15	-	-
CSCAM [57]	ResNet-12	73.37 ± 0.22	89.20 ± 0.12	59.61 ± 0.22	78.56 ± 0.15	67.09 ± 0.22	87.95 ± 0.11	-	-
C2-Net [58]	ResNet-12	-	-	75.50 ± 0.49	87.65 ± 0.28	88.96 ± 0.37	95.16 ± 0.20	-	-
ProtoNet [18]	ResNet-50	62.48 ± 1.00	87.22 ± 0.58	-	-	53.65 ± 0.94	75.19 ± 0.80	73.93 ± 0.91	93.52 ± 0.37
RelationNet [19]	ResNet-50	77.02 ± 0.87	88.74 ± 0.57	-	-	63.72 ± 0.95	79.51 ± 0.75	76.66 ± 0.87	89.18 ± 0.49
SRML [59]	ConvNet-4	79.84 ± 0.45	90.68 ± 0.23	65.72 ± 0.50	80.80 ± 0.34	78.73 ± 0.42	90.89 ± 0.23	75.07 ± 0.51	88.66 ± 0.31
MPRe [60]	ConvNet-4	88.54 ± 0.17	93.02 ± 0.10	84.26 ± 0.20	90.22 ± 0.20	91.99 ± 0.15	95.48 ± 0.08	-	-
Methods with Transformer Backbones
STransGNN [43]	Swin-T	91.08 ± 0.44	94.63 ± 0.50	85.21 ± 0.45	95.68 ± 0.46	91.10 ± 0.43	94.15 ± 0.47	-	-
SaberNet [17]	Swin-T	89.75 ± 0.68	95.74 ± 0.31	-	-	76.71 ± 0.97	87.92 ± 0.62	84.33 ± 0.71	94.19 ± 0.36
GRCD-Net (Ours)	Swin-T	90.51 ± 0.64	95.81 ± 0.44	92.43 ± 0.59	98.33 ± 0.18	84.81 ± 0.74	95.51 ± 0.29	88.05 ± 0.64	96.39 ± 0.23

Table 3. Comparison with state-of-the-art methods on general FSL benchmarks (5-way accuracy %). The best results are shown in bold, and the second-best results are underlined. The dash symbol “-” denotes unavailable results that were either not reported in the original paper or could not be reproduced due to unavailable source code.

Model	Backbone	Mini-ImageNet		Tiered-ImageNet		FC100
Model	Backbone	1-Shot	5-Shot	1-Shot	5-Shot	1-Shot	5-Shot
Methods with ResNet Backbones
ProtoNet [18]	ResNet-12	63.03 ± 0.29	78.72 ± 0.21	68.68 ± 0.34	85.09 ± 0.23	40.91 ± 0.26	56.66 ± 0.25
Meta-Baseline [61]	ResNet-12	63.17 ± 0.23	79.26 ± 0.17	68.62 ± 0.27	83.74 ± 0.18	-	-
DeepEMD [28]	ResNet-12	65.43 ± 0.28	79.28 ± 0.20	69.84 ± 0.32	84.06 ± 0.23	45.58 ± 0.26	62.08 ± 0.25
DeepBDC [62]	ResNet-12	67.83 ± 0.43	85.45 ± 0.29	73.82 ± 0.47	89.00 ± 0.30	-	-
SimpleShot [63]	ResNet-18	62.85 ± 0.20	80.02 ± 0.14	69.09 ± 0.22	84.58 ± 0.16	-	-
ConstellationNet [64]	ResNet-12	65.53 ± 0.23	80.55 ± 0.16	-	-	43.90 ± 0.20	59.70 ± 0.20
QSFormer [33]	ResNet-12	65.24 ± 0.28	79.96 ± 0.20	72.47 ± 0.31	85.43 ± 0.22	46.51 ± 0.26	61.58 ± 0.25
FEAT [31]	ResNet-12	64.75 ± 0.28	79.96 ± 0.20	71.34 ± 0.33	85.28 ± 0.23	42.28 ± 0.26	56.37 ± 0.25
Methods with Transformer Backbones
SUN-F [65]	ViT	66.60 ± 0.44	81.90 ± 0.32	72.66 ± 0.51	87.08 ± 0.33	-	-
CEPA [66]	ViT	71.97 ± 0.65	87.06 ± 0.53	76.93 ± 0.70	90.15 ± 0.45	47.24 ± 0.58	65.02 ± 0.60
FewTURE [35]	ViT	68.02 ± 0.88	84.51 ± 0.53	72.96 ± 0.92	86.43 ± 0.67	46.20 ± 0.79	63.14 ± 0.73
FewTURE [35]	Swin-T	72.40 ± 0.78	86.38 ± 0.49	82.37 ± 0.77	89.89 ± 0.52	54.27 ± 0.77	65.02 ± 0.72
SemFew-Trans [67]	Swin-T	71.94 ± 0.53	84.21 ± 0.80	74.10 ± 0.63	87.56 ± 0.48	45.91 ± 0.69	63.11 ± 0.64
GRCD-Net (Ours)	Swin-T	78.11 ± 0.82	89.55 ± 0.52	83.44 ± 0.89	90.15 ± 0.48	55.31 ± 0.82	65.73 ± 0.69

Table 4. Comparison with state-of-the-art methods on RSSC datasets (5-way accuracy %). The best results are shown in bold, and the second-best results are underlined.

Model	NWPU-RESISC45		UC Merced		WHU-RS19
Model	1-Shot	5-Shot	1-Shot	5-Shot	1-Shot	5-Shot
MatchingNet [48]	54.46 ± 0.77	67.87 ± 0.59	46.16 ± 0.71	66.73 ± 0.56	60.60 ± 0.68	82.99 ± 0.40
RelationNet [19]	58.61 ± 0.83	78.63 ± 0.52	48.89 ± 0.73	64.10 ± 0.54	60.54 ± 0.71	76.24 ± 0.34
DLA-MatchNet [37]	68.80 ± 0.70	81.63 ± 0.46	53.76 ± 0.62	63.01 ± 0.51	68.27 ± 1.83	79.89 ± 0.33
DEADN4 [41]	73.56 ± 0.83	87.28 ± 0.50	67.27 ± 0.74	87.69 ± 0.44	86.89 ± 0.57	97.63 ± 0.19
ACL-Net [7]	76.13 ± 0.24	86.54 ± 0.23	59.74 ± 0.46	74.89 ± 0.29	78.30 ± 0.32	90.43 ± 0.15
TSDN4 [40]	73.84 ± 0.80	87.86 ± 0.51	68.12 ± 0.81	88.57 ± 0.52	87.34 ± 0.62	98.25 ± 0.15
GRCD-Net (Ours)	81.39 ± 0.54	92.81 ± 0.24	66.12 ± 0.48	85.04 ± 0.28	86.53 ± 0.56	95.46 ± 0.14

Table 5. Component-wise ablation study of the GRCD-Net framework. Experiments were conducted on the CUB-200-2011, Stanford Cars, and NWPU-RESISC45 datasets under the 5-way 1-shot setting. The table progressively validates the massive incremental contributions of the GRC block, PDM module, and IGR module.

#	Model Configuration	Components				Accuracy (%)
#	Model Configuration	Backbone	GRC	PDM	IGR	CUB	Cars	NWPU-RESISC45
M1	Baseline (Dual Branch + ProtoNet)	✔				86.94 ± 0.79	77.29 ± 0.89	73.32 ± 0.68
M2	Baseline + GRC	✔	✔			87.76 ± 0.76	81.80 ± 0.75	76.75 ± 0.66
M3	Baseline + GRC + PDM	✔	✔	✔		89.33 ± 0.73	84.30 ± 0.72	78.18 ± 0.70
M4	Full Model	✔	✔	✔	✔	90.51 ± 0.64	84.81 ± 0.74	81.39 ± 0.54

✔ denotes that the corresponding component is included in the model configuration.

Table 6. Comprehensive trade-off analysis among computational complexity, inference time, and 1-shot accuracy across diverse domains. FLOPs and inference time are calculated per entire episode (80 images). The best results are shown in bold, and the dash symbol “-” denotes unavailable results that were either not reported in the original paper or could not be reproduced due to unavailable source code.

Method	Backbone	Params (M)	FLOPs (G)	Time (ms)	1-Shot Accuracy (%)
Method	Backbone	Params (M)	FLOPs (G)	Time (ms)	CUB	mini-ImageNet	NWPU	WHU-RS19
DLA-MatchNet [37]	ConvNet	50.91	-	-	-	-	68.80	68.27
CPEA [66]	ViT-S	21.81	345.60	25.79	87.06	71.97	-	-
SemFew [67]	Swin-T	207.80 ^*	360.68 ^†	45.90 ^†	84.21	71.94	-	-
FewTURE [35]	Swin-T	29.00	-	-	86.38	72.40	-	-
STranGNN [43]	Swin-T	28.66	442.00	-	91.08	71.94	-	-
GRCD-Net (Ours)	Swin-T	35.80	413.03	41.40	90.51	78.11	81.39	78.27

^* SemFew incorporates an additional massive frozen language/vision model (178.71 M). ^† Assumes CLIP text features are pre-computed offline. Dynamic computation adds ∼1.5 GFLOPs.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, J.; Du, Y.; Sun, L.; Li, X.; Si, Y.; Song, X.; Zheng, R. GRCD-Net: Guided Global–Local Relational Learning for Few-Shot Fine-Grained and Remote Sensing Scene Classification. Remote Sens. 2026, 18, 1632. https://doi.org/10.3390/rs18101632

AMA Style

Liu J, Du Y, Sun L, Li X, Si Y, Song X, Zheng R. GRCD-Net: Guided Global–Local Relational Learning for Few-Shot Fine-Grained and Remote Sensing Scene Classification. Remote Sensing. 2026; 18(10):1632. https://doi.org/10.3390/rs18101632

Chicago/Turabian Style

Liu, Jianfeng, Yibo Du, Lifan Sun, Xiaozheng Li, Yanna Si, Xiaoli Song, and Ruijuan Zheng. 2026. "GRCD-Net: Guided Global–Local Relational Learning for Few-Shot Fine-Grained and Remote Sensing Scene Classification" Remote Sensing 18, no. 10: 1632. https://doi.org/10.3390/rs18101632

APA Style

Liu, J., Du, Y., Sun, L., Li, X., Si, Y., Song, X., & Zheng, R. (2026). GRCD-Net: Guided Global–Local Relational Learning for Few-Shot Fine-Grained and Remote Sensing Scene Classification. Remote Sensing, 18(10), 1632. https://doi.org/10.3390/rs18101632

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GRCD-Net: Guided Global–Local Relational Learning for Few-Shot Fine-Grained and Remote Sensing Scene Classification

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Few-Shot Learning

2.2. Few-Shot Fine-Grained Classification

2.3. Transformers for Few-Shot Classification

2.4. Few-Shot Remote Sensing Scene Classification

3. Method

3.1. Problem Definition

3.2. Guided Feature Extraction Network

3.2.1. Hierarchical Feature Construction

3.2.2. Guided Relational Cross-Attention Block

3.2.3. Iterative Global Relation Module

3.2.4. Patch-Level Dual-Metric Module

3.3. Optimization and Training Strategy

3.3.1. Objective Function

3.3.2. Episodic Training Procedure

4. Experiment

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Implementation Details

4.2. Main Results on Foundational Benchmarks

4.2.1. Performance on Fine-Grained Datasets

4.2.2. Performance on General FSL Benchmarks

4.3. Application to Remote Sensing Scene Classification

4.3.1. Quantitative Analysis on Remote Sensing Scene Classification

4.3.2. Visual Analysis of Background Suppression

4.4. Ablation Study and Model Analysis

4.4.1. Effectiveness of Key Components

4.4.2. Hyperparameter Sensitivity Analysis

4.5. Computational Complexity and Efficiency Analysis

4.6. Qualitative Visualization

4.6.1. Attention and Saliency Visualization

4.6.2. Embedding Space and Metric Separability Analysis

4.6.3. Evolution of Feature Manifolds

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI