GRCD-Net: Guided Global–Local Relational Learning for Few-Shot Fine-Grained and Remote Sensing Scene Classification
Highlights
- The proposed GRCD-Net introduces a top-down guidance mechanism, proactively using global semantic context to filter severe background clutter before local feature extraction.
- Empowered by a synergistic dual-metric learning strategy, the model consistently outperforms fine-grained baselines by 2–4% and achieves a strong 81.39% one-shot accuracy on the NWPU-RESISC45 remote sensing scene classification dataset, exceeding current state-of-the-art methods by 7.55%.
- It effectively resolves a critical structural bottleneck in existing few-shot learning methods, successfully preventing models from matching irrelevant environmental noise instead of small semantic targets in complex aerial scenes.
- The framework provides a promising solution for practical Earth observation applications, addressing the dual challenges of extreme data scarcity and highly cluttered remote sensing environments.
Abstract
1. Introduction
- 1.
- A novel Guided Global–Local Relational Learning framework (i.e., GRCD-Net) is proposed to address the structural limitations in current FSFGIC and RSSC methods. At its core, the GRC block introduces a top-down spatial mechanism that leverages global context to proactively filter environmental clutter, ensuring the precise localization of fine-grained cues.
- 2.
- A synergistic metric learning strategy is developed by integrating the IGR and PDM modules. This strategy effectively overcomes severe intra-class variations by jointly refining global semantic consistency and computing robust local geometric similarities.
- 3.
- Extensive experiments were conducted on FSFGIC, general FSL, and RSSC benchmarks. The results show that GRCD-Net achieves competitive performance on multiple datasets while maintaining reasonable computational complexity, demonstrating its effectiveness in handling complex background clutter and improving generalization in few-shot scenarios.
2. Related Work
2.1. Few-Shot Learning
2.2. Few-Shot Fine-Grained Classification
2.3. Transformers for Few-Shot Classification
2.4. Few-Shot Remote Sensing Scene Classification
3. Method
3.1. Problem Definition
3.2. Guided Feature Extraction Network
3.2.1. Hierarchical Feature Construction
3.2.2. Guided Relational Cross-Attention Block
- (1)
- Global-to-Detail (G2D): The global CLS token queries the purified local patches to absorb fine-grained details:
- (2)
- Detail-to-Global (D2G): Simultaneously, the local CLS token queries the global patches to acquire broader contextual awareness:
| Algorithm 1 GRC-enhanced feature extraction. | ||
| Input: Image batch . | ||
| Output: Global features , local features . | ||
| Hyperparameters: N (GRC blocks), D (dim). | ||
| Parameters: (backbone), (attention/gate weights). | ||
| 1: | Step 1. Hierarchical Extraction | |
| 2: | ||
| 3: | ||
| 4: | ||
| 5: | Step 2. Guided Relational Interaction | |
| 6: | for to N do | |
| 7: | ||
| 8: | ||
| 9: | ▹ Generate spatial gate | |
| 10: | ▹ Apply guidance | |
| 11: | ||
| 12: | ||
| 13: | , | |
| 14: | end for | |
| 15: | Extract global CLS tokens: | |
| 16: | Extract detail patch tokens: | |
| 17: | return | |
3.2.3. Iterative Global Relation Module
3.2.4. Patch-Level Dual-Metric Module
| Algorithm 2 Synergistic metric calculation. | ||
| Input: Support features , query features . | ||
| Output: Global score , local score . | ||
| Parameters: (relation module), (patch encoder). | ||
| 1: | Step 1. Iterative Global Relation (IGR) | |
| 2: | ▹ Contextualize support | |
| 3: | ||
| 4: | for to do | |
| 5: | ||
| 6: | end for | |
| 7: | ||
| 8: | Step 2. Patch-level Dual-Metric (PDM) | |
| 9: | ||
| 10: | ▹ Compute class prototypes | |
| 11: | ||
| 12: | ||
| 13: | ||
| 14: | return | |
3.3. Optimization and Training Strategy
3.3.1. Objective Function
3.3.2. Episodic Training Procedure
| Algorithm 3 Training process of GRCD-Net. | |
| Input: Training set , hyperparameters . | |
| Output: Trained model parameters (including ). | |
| 1: | for each episode sampled from do |
| 2: | Step 1. Feature Extraction (via Algorithm 1) |
| 3: | |
| 4: | |
| 5: | Step 2. Metric Computation (via Algorithm 2) |
| 6: | |
| 7: | Step 3. Adaptive Score Fusion |
| 8: | |
| 9: | |
| 10: | Step 4. Optimization |
| 11: | Calculate probability P via Softmax() |
| 12: | Calculate loss via Cross-Entropy() |
| 13: | Update parameters: |
| 14: | end for |
| 15: | return |
4. Experiment
- RQ1: How does the foundational GRCD-Net architecture compare to state-of-the-art methods on standard FSFGIC and general FSL benchmarks?
- RQ2: Can the proposed framework effectively tackle the severe background clutter inherent in Earth observation tasks, establishing its superiority in RSSC datasets?
- RQ3: What is the individual contribution of each proposed module (i.e., GRC, IGR and PDM) to the final performance? Is the model robust to hyperparameter variations?
- RQ4: Does the model effectively suppress complex background noise and focus on task-relevant regions, learning semantically meaningful embeddings as intended?
4.1. Experimental Setup
4.1.1. Datasets
4.1.2. Implementation Details
4.2. Main Results on Foundational Benchmarks
4.2.1. Performance on Fine-Grained Datasets
4.2.2. Performance on General FSL Benchmarks
4.3. Application to Remote Sensing Scene Classification
4.3.1. Quantitative Analysis on Remote Sensing Scene Classification
4.3.2. Visual Analysis of Background Suppression
4.4. Ablation Study and Model Analysis
4.4.1. Effectiveness of Key Components
4.4.2. Hyperparameter Sensitivity Analysis
4.5. Computational Complexity and Efficiency Analysis
4.6. Qualitative Visualization
4.6.1. Attention and Saliency Visualization
4.6.2. Embedding Space and Metric Separability Analysis
4.6.3. Evolution of Feature Manifolds
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Cheng, G.; Xie, X.; Han, J.; Guo, L.; Xia, G.S. Remote sensing image scene classification meets deep learning: Challenges, methods, benchmarks, and opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3735–3756. [Google Scholar] [CrossRef]
- Zhang, L.; Lan, M.; Zhang, J.; Tao, D. Stagewise unsupervised domain adaptation with adversarial self-training for road segmentation of remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5609413. [Google Scholar] [CrossRef]
- Qiu, C.; Zhang, X.; Tong, X.; Guan, N.; Yi, X.; Yang, K.; Zhu, J.; Yu, A. Few-shot remote sensing image scene classification: Recent advances, new baselines, and future trends. ISPRS J. Photogramm. Remote Sens. 2024, 209, 368–382. [Google Scholar] [CrossRef]
- Yao, X.; Cao, Q.; Feng, X.; Cheng, G.; Han, J. Scale-aware detailed matching for few-shot aerial image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5611711. [Google Scholar] [CrossRef]
- Gong, X.; Luo, Y.; Chen, W.; Chang, Y.; Wan, Y.; Ma, A.; Zhong, Y. BASHVS: A Multispectral and SAR Image Fusion Method Based on Bidirectional Aggregation of Saliency in Human Visual System. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5405615. [Google Scholar] [CrossRef]
- Yang, S.; Gong, X.; Zhou, X.; Luo, Y.; Wan, Y.; Ma, A.; Zhong, Y. LABLF: A multispectral and SAR image fusion method based on least squares-optimized adaptive box-guided and Laplacian-Gaussian filtering. Int. J. Remote Sens. 2026, 47, 3545–3575. [Google Scholar] [CrossRef]
- Xu, Y.; Bi, H.; Yu, H.; Lu, W.; Li, P.; Li, X.; Sun, X. Attention-based contrastive learning for few-shot remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5620317. [Google Scholar] [CrossRef]
- Li, M.; Lei, L.; Sun, H.; Li, X.; Kuang, G. Fine-grained visual classification via multilayer bilinear pooling with object localization. Vis. Comput. 2022, 38, 95–106. [Google Scholar] [CrossRef]
- Liu, Y.; Wan, L.; Lyu, F.; Feng, W. Fine-grained scale space learning for single image super-resolution. Vis. Comput. 2022, 38, 3377–3389. [Google Scholar] [CrossRef]
- Meoni, G.; Märtens, M.; Derksen, D.; See, K.; Lightheart, T.; Sécher, A.; Martin, A.; Rijlaarsdam, D.; Fanizza, V.; Izzo, D. The OPS-SAT case: A data-centric competition for onboard satellite image classification. Vis. Comput. 2024, 8, 507–528. [Google Scholar] [CrossRef]
- Ma, Y.; Deng, X.; Wei, J. Land use classification of high-resolution multispectral satellite images with fine-grained multiscale networks and superpixel postprocessing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3264–3278. [Google Scholar] [CrossRef]
- Liang, X. Few-shot cotton leaf spots disease classification based on metric learning. Plant Methods 2021, 17, 114. [Google Scholar] [CrossRef]
- Li, R.; Li, X.; Sun, H.; Yang, J.; Rahaman, M.; Grzegozek, M.; Jiang, T.; Huang, X.; Li, C. Few-shot learning based histopathological image classification of colorectal cancer. Intell. Med. 2024, 4, 256–267. [Google Scholar] [CrossRef]
- Zha, Z.; Tang, H.; Sun, Y.; Tang, J. Boosting few-shot fine-grained recognition with background suppression and foreground alignment. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 3947–3961. [Google Scholar] [CrossRef]
- Tang, H.; Yuan, C.; Li, Z.; Tang, J. Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognit. 2022, 130, 108792. [Google Scholar] [CrossRef]
- Wu, J.; Chang, D.; Sain, A.; Li, X.; Ma, Z.; Cao, J.; Guo, J.; Song, Y.-Z. Bi-directional feature reconstruction network for fine-grained few-shot image classification. Proc. AAAI Conf. Artif. Intell. 2023, 37, 2821–2829. [Google Scholar] [CrossRef]
- Li, Z.; Hu, Z.; Luo, W.; Hu, X. SaberNet: Self-attention based effective relation network for few-shot learning. Pattern Recognit. 2023, 133, 109024. [Google Scholar] [CrossRef]
- Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 1199–1208. [Google Scholar] [CrossRef]
- Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning; PMLR: New York, NY, USA, 2017; pp. 1126–1135. [Google Scholar]
- Zhang, R.; Che, T.; Ghahramani, Z.; Bengio, Y.; Song, Y. MetaGAN: An adversarial approach to few-shot learning. In Advances in Neural Information Processing Systems 31 (NeurIPS 2018); Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
- Lim, J.M.; Lim, K.M.; Lee, C.P.; Lim, J.Y. A review of few-shot fine-grained image classification. Expert Syst. Appl. 2025, 275, 127054. [Google Scholar] [CrossRef]
- Wertheimer, D.; Tang, L.; Hariharan, B. Few-shot classification with feature map reconstruction networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 8012–8021. [Google Scholar]
- Hou, R.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Cross-attention network for few-shot classification. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019); Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 4004–4014. [Google Scholar]
- Doersch, C.; Gupta, A.; Zisserman, A. CrossTransformers: Spatially-aware few-shot transfer. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020); Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 21981–21993. [Google Scholar]
- Chen, H.; Li, H.; Li, Y.; Chen, C. Multi-scale adaptive task attention network for few-shot learning. In 2022 26th International Conference on Pattern Recognition (ICPR); IEEE: Piscataway, NJ, USA, 2022; pp. 4765–4771. [Google Scholar]
- Song, W.; Yang, K. Dual adaptive local semantic alignment for few-shot fine-grained classification. Vis. Comput. 2025, 40, 2923–2937. [Google Scholar] [CrossRef]
- Zhang, C.; Cai, Y.; Lin, G.; Shen, C. DeepEMD: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2020; pp. 12203–12213. [Google Scholar] [CrossRef]
- Hao, F.; He, F.; Cheng, J.; Wang, L.; Cao, J.; Tao, D. Collect and select: Semantic alignment metric learning for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2019; pp. 8460–8469. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
- Ye, H.-J.; Hu, H.; Zhan, D.-C.; Sha, F. Few-shot learning via embedding adaptation with set-to-set functions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2020; pp. 8808–8817. [Google Scholar] [CrossRef]
- Liu, L.; Hamilton, W.; Long, G.; Jiang, J.; Larochelle, H. A universal representation transformer layer for few-shot image classification. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
- Wang, X.; Wang, X.; Jiang, B.; Luo, B. Few-shot learning meets transformer: Unified query-support transformers for few-shot classification. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7789–7802. [Google Scholar] [CrossRef]
- Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
- Hiller, M.; Ma, R.; Harandi, M.; Drummond, T. Rethinking generalization in few-shot classification. In Advances in Neural Information Processing Systems 35 (NeurIPS 2022); Curran Associates, Inc.: Red Hook, NY, USA, 2022; pp. 3582–3595. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
- Li, L.; Han, J.; Yao, X.; Cheng, G.; Guo, L. DLA-MatchNet for few-shot remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 7844–7853. [Google Scholar] [CrossRef]
- Li, X.; Sun, Y.; Peng, X.; Zhang, J.; Qi, G.; Liu, D. TA-MSA: A fine-tuning framework for few-shot remote sensing scene classification. Remote Sens. 2025, 17, 1395. [Google Scholar] [CrossRef]
- Jia, Y.; Sun, C.; Gao, J.; Wang, Q. Few-shot remote sensing scene classification via parameter-free attention and region matching. ISPRS J. Photogramm. Remote Sens. 2025, 227, 265–275. [Google Scholar] [CrossRef]
- Lei, Y.; Li, Y.; Mao, H. A novel two-stream network for few-shot remote sensing image scene classification. Remote Sens. 2025, 17, 1192. [Google Scholar] [CrossRef]
- Chen, Y.; Li, Y.; Mao, H.; Liu, G.; Chai, X.; Jiao, L. A novel discriminative enhancement method for few-shot remote sensing image scene classification. Remote Sens. 2023, 15, 4588. [Google Scholar] [CrossRef]
- Wang, Q.; Dong, Y.; Xu, N.; Xu, F.; Mou, C.; Chen, F. Image classification of tree species in relatives based on dual-branch vision transformer. Forests 2024, 15, 2243. [Google Scholar] [CrossRef]
- Wang, K.; Ren, J.; Zhang, W. Few-shot image classification algorithm of graph neural network based on Swin transformer. Laser Optoelectron. Prog. 2024, 61, 1237003. [Google Scholar]
- Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset; Technical Report CNS-TR-2011-001; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
- Khosla, A.; Jayadevaprakash, N.; Yao, B.; Li, F.-F. Novel dataset for fine-grained image categorization. In Proc. CVPR Workshop Fine-Grained Vis. Categorization; IEEE: Piscataway, NJ, USA, 2011. [Google Scholar]
- Krause, J.; Stark, M.; Deng, J.; Li, F.-F. 3D object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops; IEEE: Piscataway, NJ, USA, 2013; pp. 554–561. [Google Scholar] [CrossRef]
- Nilsback, M.-E.; Zisserman, A. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing; IEEE: Piscataway, NJ, USA, 2008. [Google Scholar] [CrossRef]
- Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D.; Kavukcuoglu, K. Matching networks for one shot learning. In Advances in Neural Information Processing Systems 29 (NIPS 2016); Curran Associates, Inc.: Red Hook, NY, USA, 2016. [Google Scholar]
- Ren, M.; Ravi, S.; Triantafillou, E.; Snell, J.; Swersky, K.; Tenenbaum, J.B.; Larochelle, H.; Zemel, R.S. Meta-learning for semi-supervised few-shot classification. In Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Oreshkin, B.; Rodríguez, P.; Lacoste, A. TADAM: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems 31 (NeurIPS 2018); Curran Associates, Inc.: Red Hook, NY, USA, 2018. [Google Scholar]
- Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
- Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems; Association for Computing Machinery: New York, NY, USA, 2010; pp. 270–279. [Google Scholar] [CrossRef]
- Sheng, G.; Yang, W.; Xu, T.; Sun, H. High-resolution satellite scene classification using a sparse coding based multiple feature combination. Int. J. Remote Sens. 2012, 33, 2395–2412. [Google Scholar] [CrossRef]
- Chen, W.-Y.; Liu, Y.-C.; Kira, Z.; Wang, Y.-C.F.; Huang, J.-B. A closer look at few-shot classification. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Ma, Z.X.; Chen, Z.D.; Zheng, T.; Luo, X.; Jia, Z.; Xu, X.S. Few-shot fine-grained image classification with progressively feature refinement and continuous relationship modeling. Proc. AAAI Conf. Artif. Intell. 2025, 5439, 6036–6044. [Google Scholar] [CrossRef]
- Zhang, B.; Yuan, J.; Li, B.; Chen, T.; Fan, J.; Shi, B. Learning cross-image object semantic relation in transformer for few-shot fine-grained image classification. In Proceedings of the 30th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2022; pp. 2135–2144. [Google Scholar] [CrossRef]
- Ou, Q.; Zou, J. Channel-wise attention-enhanced feature mutual reconstruction for few-shot fine-grained image classification. Electronics 2025, 14, 377. [Google Scholar] [CrossRef]
- Ma, Z.; Chen, Z.; Zhao, L.; Zhang, Z.; Luo, X.; Xu, X. Cross-layer and cross-sample feature optimization network for few-shot fine-grained image classification. Proc. AAAI Conf. Artif. Intell. 2024, 38, 4136–4144. [Google Scholar] [CrossRef]
- Li, X.; Wang, L.; Zhu, R.; Ma, Z.; Cao, J.; Xue, J.H. SRML: Structure-relation mutual learning network for few-shot image classification. Pattern Recognit. 2025, 160, 111822. [Google Scholar] [CrossRef]
- Guo, Z.; Xiao, L.; Jin, Q. MPRe: Multi-scale feature guided prototype reconstruction for few-shot fine-grained image classification. In Proceedings of the 2025 7th International Conference on Frontier Technologies of Information and Computer (ICFTIC), Qingdao, China, 5–7 December 2025; pp. 417–422. [Google Scholar] [CrossRef]
- Chen, Y.; Liu, Z.; Xu, H.; Darrell, T.; Wang, X. Meta-Baseline: Exploring simple meta-learning for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 9062–9071. [Google Scholar]
- Xie, J.; Long, F.; Lv, J.; Wang, Q.; Li, P. Joint distribution matters: Deep Brownian distance covariance for few-shot classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 7962–7971. [Google Scholar] [CrossRef]
- Wang, Y.; Chao, W.L.; Weinberger, K.Q.; van der Maaten, L. SimpleShot: Revisiting nearest-neighbor classification for few-shot learning. arXiv 2019, arXiv:1911.04623. [Google Scholar]
- Xu, W.; Xu, Y.; Wang, H.; Tu, Z. Attentional constellation nets for few-shot learning. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
- Dong, B.; Zhou, P.; Yan, S.; Zuo, W. Self-promoted supervision for few-shot transformer. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; p. 13680. [Google Scholar] [CrossRef]
- Hao, F.; He, F.; Liu, L.; Wu, F.; Tao, D.; Cheng, J. Class-aware patch embedding adaptation for few-shot image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 18905–18915. [Google Scholar] [CrossRef]
- Zhang, H.; Xu, J.; Jiang, S.; He, Z. Simple semantic-aided few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 28588–28597. [Google Scholar] [CrossRef]











| Category | Dataset | Images | Classes | Split (Tr/Val/Te) | Characteristics |
|---|---|---|---|---|---|
| FSFGIC | CUB-200-2011 [44] | 11,788 | 200 | 100/50/50 | Bird species with subtle inter-class differences. |
| Stanford Dogs [45] | 20,580 | 120 | 70/20/30 | Dog breeds with high visual similarity. | |
| Stanford Cars [46] | 16,185 | 196 | 130/17/49 | Car models with pose and viewpoint variations. | |
| Oxford Flowers [47] | 8189 | 102 | 51/26/25 | Flower categories with scale and light variations. | |
| General FSL | mini-ImageNet [48] | 60,000 | 100 | 64/16/20 | Subset of ImageNet for general object recognition. |
| tiered-ImageNet [49] | 779,165 | 608 | 351/97/160 | Larger scale with hierarchical category structure. | |
| FC100 [50] | 33,200 | 100 | 60/20/20 | Derived from CIFAR-100 with low resolution. | |
| RSSC | NWPU-RESISC45 [51] | 31,500 | 45 | 25/10/10 | Large-scale aerial scenes with complex backgrounds. |
| UC Merced [52] | 2100 | 21 | 10/6/5 | High-res urban land-use scenes (0.3 m/pixel). | |
| WHU-RS19 [53] | 1005 | 19 | 9/5/5 | Aerial images with varying resolutions. |
| Model | Backbone | CUB-200-2011 | Stanford Dogs | Stanford Cars | Oxford Flowers | ||||
|---|---|---|---|---|---|---|---|---|---|
| 1-Shot | 5-Shot | 1-Shot | 5-Shot | 1-Shot | 5-Shot | 1-Shot | 5-Shot | ||
| Methods with ConvNet/ResNet Backbones | |||||||||
| Boosting [14] | ResNet-12 | 82.27 ± 0.46 | 90.76 ± 0.26 | 69.58 ± 0.50 | 82.59 ± 0.33 | 88.93 ± 0.38 | 95.20 ± 0.20 | - | - |
| FRN [23] | ResNet-12 | 81.51 ± 0.20 | 91.77 ± 0.11 | 76.43 ± 0.21 | 88.23 ± 0.12 | 87.95 ± 0.16 | 95.20 ± 0.20 | 71.16 ± 0.22 | 86.01 ± 0.15 |
| AGPF [15] | ResNet-12 | 84.02 ± 0.57 | 93.50 ± 0.13 | 72.34 ± 0.86 | 85.34 ± 0.74 | 85.34 ± 0.74 | 94.79 ± 0.35 | - | - |
| BiFRN [16] | ResNet-12 | 85.44 ± 0.18 | 94.73 ± 0.09 | 76.89 ± 0.21 | 88.27 ± 0.12 | 90.44 ± 0.15 | 97.49 ± 0.05 | - | - |
| DALSA [27] | ResNet-12 | 85.26 ± 0.67 | 94.40 ± 0.73 | 75.91 ± 0.84 | 89.43 ± 0.68 | 89.62 ± 0.52 | 96.88 ± 0.72 | - | - |
| SUITED [55] | ResNet-12 | 86.02 ± 0.45 | 94.13 ± 0.30 | 76.55 ± 0.47 | 88.86 ± 0.27 | 89.97 ± 0.36 | 96.53 ± 0.16 | - | - |
| HelixFormer [56] | ResNet-12 | 81.66 ± 0.30 | 91.83 ± 0.17 | 65.92 ± 0.49 | 80.65 ± 0.36 | 79.40 ± 0.43 | 92.26 ± 0.15 | - | - |
| CSCAM [57] | ResNet-12 | 73.37 ± 0.22 | 89.20 ± 0.12 | 59.61 ± 0.22 | 78.56 ± 0.15 | 67.09 ± 0.22 | 87.95 ± 0.11 | - | - |
| C2-Net [58] | ResNet-12 | - | - | 75.50 ± 0.49 | 87.65 ± 0.28 | 88.96 ± 0.37 | 95.16 ± 0.20 | - | - |
| ProtoNet [18] | ResNet-50 | 62.48 ± 1.00 | 87.22 ± 0.58 | - | - | 53.65 ± 0.94 | 75.19 ± 0.80 | 73.93 ± 0.91 | 93.52 ± 0.37 |
| RelationNet [19] | ResNet-50 | 77.02 ± 0.87 | 88.74 ± 0.57 | - | - | 63.72 ± 0.95 | 79.51 ± 0.75 | 76.66 ± 0.87 | 89.18 ± 0.49 |
| SRML [59] | ConvNet-4 | 79.84 ± 0.45 | 90.68 ± 0.23 | 65.72 ± 0.50 | 80.80 ± 0.34 | 78.73 ± 0.42 | 90.89 ± 0.23 | 75.07 ± 0.51 | 88.66 ± 0.31 |
| MPRe [60] | ConvNet-4 | 88.54 ± 0.17 | 93.02 ± 0.10 | 84.26 ± 0.20 | 90.22 ± 0.20 | 91.99 ± 0.15 | 95.48 ± 0.08 | - | - |
| Methods with Transformer Backbones | |||||||||
| STransGNN [43] | Swin-T | 91.08 ± 0.44 | 94.63 ± 0.50 | 85.21 ± 0.45 | 95.68 ± 0.46 | 91.10 ± 0.43 | 94.15 ± 0.47 | - | - |
| SaberNet [17] | Swin-T | 89.75 ± 0.68 | 95.74 ± 0.31 | - | - | 76.71 ± 0.97 | 87.92 ± 0.62 | 84.33 ± 0.71 | 94.19 ± 0.36 |
| GRCD-Net (Ours) | Swin-T | 90.51 ± 0.64 | 95.81 ± 0.44 | 92.43 ± 0.59 | 98.33 ± 0.18 | 84.81 ± 0.74 | 95.51 ± 0.29 | 88.05 ± 0.64 | 96.39 ± 0.23 |
| Model | Backbone | Mini-ImageNet | Tiered-ImageNet | FC100 | |||
|---|---|---|---|---|---|---|---|
| 1-Shot | 5-Shot | 1-Shot | 5-Shot | 1-Shot | 5-Shot | ||
| Methods with ResNet Backbones | |||||||
| ProtoNet [18] | ResNet-12 | 63.03 ± 0.29 | 78.72 ± 0.21 | 68.68 ± 0.34 | 85.09 ± 0.23 | 40.91 ± 0.26 | 56.66 ± 0.25 |
| Meta-Baseline [61] | ResNet-12 | 63.17 ± 0.23 | 79.26 ± 0.17 | 68.62 ± 0.27 | 83.74 ± 0.18 | - | - |
| DeepEMD [28] | ResNet-12 | 65.43 ± 0.28 | 79.28 ± 0.20 | 69.84 ± 0.32 | 84.06 ± 0.23 | 45.58 ± 0.26 | 62.08 ± 0.25 |
| DeepBDC [62] | ResNet-12 | 67.83 ± 0.43 | 85.45 ± 0.29 | 73.82 ± 0.47 | 89.00 ± 0.30 | - | - |
| SimpleShot [63] | ResNet-18 | 62.85 ± 0.20 | 80.02 ± 0.14 | 69.09 ± 0.22 | 84.58 ± 0.16 | - | - |
| ConstellationNet [64] | ResNet-12 | 65.53 ± 0.23 | 80.55 ± 0.16 | - | - | 43.90 ± 0.20 | 59.70 ± 0.20 |
| QSFormer [33] | ResNet-12 | 65.24 ± 0.28 | 79.96 ± 0.20 | 72.47 ± 0.31 | 85.43 ± 0.22 | 46.51 ± 0.26 | 61.58 ± 0.25 |
| FEAT [31] | ResNet-12 | 64.75 ± 0.28 | 79.96 ± 0.20 | 71.34 ± 0.33 | 85.28 ± 0.23 | 42.28 ± 0.26 | 56.37 ± 0.25 |
| Methods with Transformer Backbones | |||||||
| SUN-F [65] | ViT | 66.60 ± 0.44 | 81.90 ± 0.32 | 72.66 ± 0.51 | 87.08 ± 0.33 | - | - |
| CEPA [66] | ViT | 71.97 ± 0.65 | 87.06 ± 0.53 | 76.93 ± 0.70 | 90.15 ± 0.45 | 47.24 ± 0.58 | 65.02 ± 0.60 |
| FewTURE [35] | ViT | 68.02 ± 0.88 | 84.51 ± 0.53 | 72.96 ± 0.92 | 86.43 ± 0.67 | 46.20 ± 0.79 | 63.14 ± 0.73 |
| FewTURE [35] | Swin-T | 72.40 ± 0.78 | 86.38 ± 0.49 | 82.37 ± 0.77 | 89.89 ± 0.52 | 54.27 ± 0.77 | 65.02 ± 0.72 |
| SemFew-Trans [67] | Swin-T | 71.94 ± 0.53 | 84.21 ± 0.80 | 74.10 ± 0.63 | 87.56 ± 0.48 | 45.91 ± 0.69 | 63.11 ± 0.64 |
| GRCD-Net (Ours) | Swin-T | 78.11 ± 0.82 | 89.55 ± 0.52 | 83.44 ± 0.89 | 90.15 ± 0.48 | 55.31 ± 0.82 | 65.73 ± 0.69 |
| Model | NWPU-RESISC45 | UC Merced | WHU-RS19 | |||
|---|---|---|---|---|---|---|
| 1-Shot | 5-Shot | 1-Shot | 5-Shot | 1-Shot | 5-Shot | |
| MatchingNet [48] | 54.46 ± 0.77 | 67.87 ± 0.59 | 46.16 ± 0.71 | 66.73 ± 0.56 | 60.60 ± 0.68 | 82.99 ± 0.40 |
| RelationNet [19] | 58.61 ± 0.83 | 78.63 ± 0.52 | 48.89 ± 0.73 | 64.10 ± 0.54 | 60.54 ± 0.71 | 76.24 ± 0.34 |
| DLA-MatchNet [37] | 68.80 ± 0.70 | 81.63 ± 0.46 | 53.76 ± 0.62 | 63.01 ± 0.51 | 68.27 ± 1.83 | 79.89 ± 0.33 |
| DEADN4 [41] | 73.56 ± 0.83 | 87.28 ± 0.50 | 67.27 ± 0.74 | 87.69 ± 0.44 | 86.89 ± 0.57 | 97.63 ± 0.19 |
| ACL-Net [7] | 76.13 ± 0.24 | 86.54 ± 0.23 | 59.74 ± 0.46 | 74.89 ± 0.29 | 78.30 ± 0.32 | 90.43 ± 0.15 |
| TSDN4 [40] | 73.84 ± 0.80 | 87.86 ± 0.51 | 68.12 ± 0.81 | 88.57 ± 0.52 | 87.34 ± 0.62 | 98.25 ± 0.15 |
| GRCD-Net (Ours) | 81.39 ± 0.54 | 92.81 ± 0.24 | 66.12 ± 0.48 | 85.04 ± 0.28 | 86.53 ± 0.56 | 95.46 ± 0.14 |
| # | Model Configuration | Components | Accuracy (%) | |||||
|---|---|---|---|---|---|---|---|---|
| Backbone | GRC | PDM | IGR | CUB | Cars | NWPU-RESISC45 | ||
| M1 | Baseline (Dual Branch + ProtoNet) | ✔ | 86.94 ± 0.79 | 77.29 ± 0.89 | 73.32 ± 0.68 | |||
| M2 | Baseline + GRC | ✔ | ✔ | 87.76 ± 0.76 | 81.80 ± 0.75 | 76.75 ± 0.66 | ||
| M3 | Baseline + GRC + PDM | ✔ | ✔ | ✔ | 89.33 ± 0.73 | 84.30 ± 0.72 | 78.18 ± 0.70 | |
| M4 | Full Model | ✔ | ✔ | ✔ | ✔ | 90.51 ± 0.64 | 84.81 ± 0.74 | 81.39 ± 0.54 |
| Method | Backbone | Params (M) | FLOPs (G) | Time (ms) | 1-Shot Accuracy (%) | |||
|---|---|---|---|---|---|---|---|---|
| CUB | mini-ImageNet | NWPU | WHU-RS19 | |||||
| DLA-MatchNet [37] | ConvNet | 50.91 | - | - | - | - | 68.80 | 68.27 |
| CPEA [66] | ViT-S | 21.81 | 345.60 | 25.79 | 87.06 | 71.97 | - | - |
| SemFew [67] | Swin-T | 207.80 * | 360.68 † | 45.90 † | 84.21 | 71.94 | - | - |
| FewTURE [35] | Swin-T | 29.00 | - | - | 86.38 | 72.40 | - | - |
| STranGNN [43] | Swin-T | 28.66 | 442.00 | - | 91.08 | 71.94 | - | - |
| GRCD-Net (Ours) | Swin-T | 35.80 | 413.03 | 41.40 | 90.51 | 78.11 | 81.39 | 78.27 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Liu, J.; Du, Y.; Sun, L.; Li, X.; Si, Y.; Song, X.; Zheng, R. GRCD-Net: Guided Global–Local Relational Learning for Few-Shot Fine-Grained and Remote Sensing Scene Classification. Remote Sens. 2026, 18, 1632. https://doi.org/10.3390/rs18101632
Liu J, Du Y, Sun L, Li X, Si Y, Song X, Zheng R. GRCD-Net: Guided Global–Local Relational Learning for Few-Shot Fine-Grained and Remote Sensing Scene Classification. Remote Sensing. 2026; 18(10):1632. https://doi.org/10.3390/rs18101632
Chicago/Turabian StyleLiu, Jianfeng, Yibo Du, Lifan Sun, Xiaozheng Li, Yanna Si, Xiaoli Song, and Ruijuan Zheng. 2026. "GRCD-Net: Guided Global–Local Relational Learning for Few-Shot Fine-Grained and Remote Sensing Scene Classification" Remote Sensing 18, no. 10: 1632. https://doi.org/10.3390/rs18101632
APA StyleLiu, J., Du, Y., Sun, L., Li, X., Si, Y., Song, X., & Zheng, R. (2026). GRCD-Net: Guided Global–Local Relational Learning for Few-Shot Fine-Grained and Remote Sensing Scene Classification. Remote Sensing, 18(10), 1632. https://doi.org/10.3390/rs18101632

