Few-Shot Semantic Segmentation in Remote Sensing: A Review on Definitions, Methods, Datasets, Advances and Future Trends
Highlights
- Provides a remote-sensing-specific review of few-shot semantic segmentation, clarifying definitions, protocols, and evaluation pitfalls.
- Systematically categorizes few-shot segmentation methods, including meta-learning, conditioning, transductive inference, and foundation-assisted approaches.
- Highlights dataset characteristics (resolution, modality, annotation regimes) that critically affect episodic evaluation and benchmarking reliability.
- Identifies open challenges and outlines future directions for scalable, reproducible, and foundation-model-assisted few-shot segmentation in Earth observation.
Abstract
1. Introduction
2. Definitions
2.1. Semantic Segmentation
2.2. Strategies
2.3. Notation
2.4. Operational Formulation of Few-Shot Semantic Segmentation
2.5. Evaluation Metrics
- N is the total number of classes,
- represents true positives for class i,
- represents false positives, and
- represents false negatives.
2.6. Relation to Neighboring Problem Settings
3. Datasets
3.1. Optical RGB Aerial Imagery
3.2. Optical RGB Satellite Imagery
3.3. Multispectral Satellite Imagery
3.4. Synthetic Aperture Radar (SAR) and Multimodal Datasets
3.5. Hyperspectral Imagery
3.6. Dataset Annotation: Practices, Pitfalls, and Quality Assessment
4. Few-Shot Methods for Semantic Segmentation in Remote Sensing
4.1. Meta-Learning Method Categories (Episodic Training)
- Proximal penalties or weight decay terms to prevent the adapted parameters from drifting too far from the base initialization.
- Early stopping in the inner loop, limiting the number of gradient steps to avoid memorizing noise from a handful of pixels.
- Meta-regularization, where the model learns during training to be robust to imperfect or sparse supervision by adding noise to support labels or dropping pixels.
4.2. Parameter Generation and Conditioning
- Separable kernels: depthwise followed by pointwise sharply reduces parameters but preserves shape cues (useful for roads/canals).
- Filtered budgets: constrain the number of generated filters (per scale) and reuse across decoder layers to keep memory stable on VHR images.
- Stability regularizers: spectral or weight-norm penalties on outputs; a proximal term keeps generated filters close to a safe default when support is noisy.
- Orthogonality/diversity: encourage for to reduce cross-class interference (important when novel classes co-occur).
- Multi-level FiLM: generate at several pyramid stages so both fine detail (cars, roof edges) and global context (fields, water bodies) are influenced.
- Spatial FiLM (sFiLM): predict low-res modulation maps , upsample to , and apply element-wise modulation; this localizes conditioning around likely object extents.
- Conditional norms: use conditional BN/IN where are affine-transformed by support features (more stable than free-form on noisy support).
- Adapters/LoRA: insert bottleneck adapters (or low-rank updates ) at a few layers and tune only adapter parameters per episode. This keeps the encoder frozen (good for multi-sensor deployments) while allowing rapid per-episode alignment.
- Bounding modulation: pass through tanh and scale by a small factor (e.g., ) to prevent catastrophic shifts when the support mask is tiny or mislabeled.
- Query budgeting: assign a fixed per class (e.g., parts) or a dynamic budget selected by a support scoring head; avoid query starvation for small/thin classes.
- Negative/background prompts: initialize a few queries from support complements to model distractors (e.g., shadows vs. roofs), reducing false positives in cluttered scenes.
- Prompt diversity: add a diversity regularizer to discourage duplicate masks (prevents multiple queries collapsing onto the same region).
- Text–visual fusion (optional): for open-vocabulary scenarios, fuse a text embedding with the visual prompt, , which stabilizes the query when support is extremely sparse.
- Losses and calibration: combine pixel-wise CE/Dice with a contrastive alignment term that pulls support features toward the activated query tokens while pushing them away from negatives; temperature scaling on mask logits helps calibration when base and novel classes co-occur.
4.3. Inductive vs. Transductive Inference
- Advantages: Simple to scale, embarrassingly parallel, and results are easier to reproduce since each query can be evaluated in isolation. This is the most common protocol in the early FSSS literature.
- Limitations in RS: Aerial and satellite scenes often exhibit strong redundancy, e.g., many similar rooftops, repeated agricultural fields, or parallel road networks. Treating each query independently discards these correlations. Moreover, in tiling pipelines for gigapixel images, adjacent images may share the same structures, and ignoring this information reduces consistency across boundaries.
- Entropy minimization: add a penalty on unlabeled query pixels, which encourages confident predictions in regions with clear low-level support. This stabilizes homogeneous areas like fields or water.
- Pseudo-label refinement: high-confidence predictions from one query are iteratively added to the support set or used as soft supervision for other queries. In RS, this is effective when multiple images capture different parts of the same structure (e.g., a long road).
- Episode-level normalization: compute statistics (mean, variance) jointly across all queries and the support, then normalize features accordingly. This mitigates illumination and sensor differences across queries in the same episode.
4.4. Data-Centric Strategies
4.5. Transfer Learning and Pretraining
4.6. Vision–Language Integration in Few-Shot Segmentation
4.7. Foundation and Prompt-Based Segmentation
4.8. Backbones, Decoders, and Conditioning: Design Choices That Matter
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Lv, J.; Shen, Q.; Lv, M.; Li, Y.; Shi, L.; Zhang, P. Deep learning-based semantic segmentation of remote sensing images: A review. Front. Ecol. Evol. 2023, 11, 1201125. [Google Scholar] [CrossRef]
- Guo, Y.; Liu, Y.; Georgiou, T.; Lew, M.S. A review of semantic segmentation using deep neural networks. Int. J. Multimed. Inf. Retr. 2018, 7, 87–93. [Google Scholar] [CrossRef]
- Rawat, W.; Wang, Z. Deep convolutional neural networks for image classification: A comprehensive review. Neural Comput. 2017, 29, 2352–2449. [Google Scholar] [CrossRef]
- Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef]
- Chang, Z.; Lu, Y.; Ran, X.; Gao, X.; Wang, X. Few-shot semantic segmentation: A review on recent approaches. Neural Comput. Appl. 2023, 35, 18251–18275. [Google Scholar] [CrossRef]
- Lyu, Y.; Vosselman, G.; Xia, G.S.; Yilmaz, A.; Yang, M.Y. UAVid: A semantic segmentation dataset for UAV imagery. ISPRS J. Photogramm. Remote Sens. 2020, 165, 108–119. [Google Scholar] [CrossRef]
- Lei, S.; Zhang, X.; He, J.; Chen, F.; Du, B.; Lu, C.T. Cross-domain few-shot semantic segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 73–90. [Google Scholar]
- Xu, H.; He, H.; Zhang, Y.; Ma, L.; Li, J. A comparative study of loss functions for road segmentation in remotely sensed road datasets. Int. J. Appl. Earth Obs. Geoinf. 2023, 116, 103159. [Google Scholar] [CrossRef]
- Hua, Y.; Marcos, D.; Mou, L.; Zhu, X.X.; Tuia, D. Semantic segmentation of remote sensing images with sparse annotations. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
- Kalluri, T.; Varma, G.; Chandraker, M.; Jawahar, C. Universal semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5259–5270. [Google Scholar]
- Cui, B.; Chen, X.; Lu, Y. Semantic segmentation of remote sensing images using transfer learning and deep convolutional neural network with dense connection. IEEE Access 2020, 8, 116744–116755. [Google Scholar] [CrossRef]
- Zhang, M.; Zhou, Y.; Zhao, J.; Man, Y.; Liu, B.; Yao, R. A survey of semi-and weakly supervised semantic segmentation of images. Artif. Intell. Rev. 2020, 53, 4259–4288. [Google Scholar] [CrossRef]
- ISPRS WG III/4. ISPRS 2D Semantic Labeling Contest. Available online: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/semantic-labeling.aspx (accessed on 5 February 2026).
- Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 3226–3229. [Google Scholar]
- Boguszewski, A.; Batorski, D.; Ziemba-Jankowska, N.; Dziedzic, T.; Zambrzycka, A. LandCover. ai: Dataset for automatic mapping of buildings, woodlands, water and roads from aerial imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1102–1110. [Google Scholar]
- Xia, J.; Yokoya, N.; Adriano, B.; Broni-Bediako, C. Openearthmap: A benchmark dataset for global high-resolution land cover mapping. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 6254–6264. [Google Scholar]
- Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2018, 57, 574–586. [Google Scholar] [CrossRef]
- Mommert, M.; Kesseli, N.; Hanna, J.; Scheibenreif, L.; Borth, D.; Demir, B. Ben-ge: Extending BigEarthNet with geographical and environmental data. In Proceedings of the IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 1016–1019. [Google Scholar]
- Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. Deepglobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 172–181. [Google Scholar]
- Van Etten, A.; Lindenbaum, D.; Bacastow, T.M. Spacenet: A remote sensing dataset and challenge series. arXiv 2018, arXiv:1807.01232. [Google Scholar]
- Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation. arXiv 2021, arXiv:2110.08733. [Google Scholar]
- Mnih, V. Machine Learning for Aerial Image Labeling; University of Toronto: Toronto, ON, Canada, 2013. [Google Scholar]
- Gupta, R.; Goodman, B.; Patel, N.; Hosfelt, R.; Sajeev, S.; Heim, E.; Doshi, J.; Lucas, K.; Choset, H.; Gaston, M. Creating xBD: A dataset for assessing building damage from satellite imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019; pp. 10–17. [Google Scholar]
- Rahnemoonfar, M.; Chowdhury, T.; Sarkar, A.; Varshney, D.; Yari, M.; Murphy, R.R. Floodnet: A high resolution aerial imagery dataset for post flood scene understanding. IEEE Access 2021, 9, 89644–89654. [Google Scholar] [CrossRef]
- Alemohammad, H.; Booth, K. LandCoverNet: A global benchmark land cover classification training dataset. arXiv 2020, arXiv:2012.03111. [Google Scholar]
- Brown, C.F.; Brumby, S.P.; Guzder-Williams, B.; Birch, T.; Hyde, S.B.; Mazzariello, J.; Czerwinski, W.; Pasquarella, V.J.; Haertel, R.; Ilyushchenko, S.; et al. Dynamic World, Near real-time global 10 m land use land cover mapping. Sci. Data 2022, 9, 251. [Google Scholar] [CrossRef]
- Toker, A.; Kondmann, L.; Weber, M.; Eisenberger, M.; Camero, A.; Hu, J.; Hoderlein, A.P.; Şenaras, Ç.; Davis, T.; Cremers, D.; et al. Dynamicearthnet: Daily multi-spectral satellite dataset for semantic change segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 21158–21167. [Google Scholar]
- Robinson, C.; Malkin, K.; Jojic, N.; Chen, H.; Qin, R.; Xiao, C.; Schmitt, M.; Ghamisi, P.; Hänsch, R.; Yokoya, N. Global land-cover mapping with weak supervision: Outcome of the 2020 IEEE GRSS data fusion contest. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 3185–3199. [Google Scholar] [CrossRef]
- Bonafilia, D.; Tellman, B.; Anderson, T.; Issenberg, E. Sen1Floods11: A georeferenced dataset to train and test deep learning flood algorithms for sentinel-1. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 210–211. [Google Scholar]
- Karantzalos, K.; Karakizi, C.; Kandylakis, Z.; Antoniou, G. HyRANK hyperspectral satellite dataset I (version v001). Zenodo 2018. [Google Scholar] [CrossRef]
- Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9197–9206. [Google Scholar]
- Sherrah, J. Fully convolutional networks for dense semantic labelling of high-resolution aerial imagery. arXiv 2016, arXiv:1606.02585. [Google Scholar] [CrossRef]
- Robinson, C.; Ortiz, A.; Malkin, K.; Elias, B.; Peng, A.; Morris, D.; Dilkina, B.; Jojic, N. Human-machine collaboration for fast land cover mapping. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 2509–2517. [Google Scholar]
- Gama, P.H.T.; Oliveira, H.; dos Santos, J.A.; Cesar, R.M., Jr. An overview on Meta-learning approaches for Few-shot Weakly-supervised Segmentation. Comput. Graph. 2023, 113, 77–88. [Google Scholar] [CrossRef]
- Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1199–1208. [Google Scholar]
- Min, J.; Kang, D.; Cho, M. Hypercorrelation squeeze for few-shot segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6941–6952. [Google Scholar]
- Zhang, C.; Lin, G.; Liu, F.; Guo, J.; Wu, Q.; Yao, R. Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9587–9595. [Google Scholar]
- Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; Wierstra, D. Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 2016, 29, 1–9. [Google Scholar]
- Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
- Jia, X.; De Brabandere, B.; Tuytelaars, T.; Gool, L.V. Dynamic filter networks. Adv. Neural Inf. Process. Syst. 2016, 29, 667–675. [Google Scholar]
- Liu, Y.; Lee, J.; Park, M.; Kim, S.; Yang, E.; Hwang, S.J.; Yang, Y. Learning to propagate labels: Transductive propagation network for few-shot learning. arXiv 2018, arXiv:1805.10002. [Google Scholar]
- Boudiaf, M.; Kervadec, H.; Masud, Z.I.; Piantanida, P.; Ben Ayed, I.; Dolz, J. Few-shot segmentation without meta-learning: A good transductive inference is all you need? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13979–13988. [Google Scholar]
- Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2918–2928. [Google Scholar]
- Ma, H.; Lin, X.; Wu, Z.; Yu, Y. Coarse-to-fine domain adaptive semantic segmentation with photometric alignment and category-center regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4051–4060. [Google Scholar]
- Yang, Y.; Soatto, S. Fda: Fourier domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Omputer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4085–4095. [Google Scholar]
- Hariharan, B.; Girshick, R. Low-shot visual recognition by shrinking and hallucinating features. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3018–3027. [Google Scholar]
- Triantafillou, E.; Zhu, T.; Dumoulin, V.; Lamblin, P.; Evci, U.; Xu, K.; Goroshin, R.; Gelada, C.; Swersky, K.; Manzagol, P.A.; et al. Meta-dataset: A dataset of datasets for learning to learn from few examples. arXiv 2019, arXiv:1903.03096. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Hu, F.; Xia, G.S.; Hu, J.; Zhang, L. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sens. 2015, 7, 14680–14707. [Google Scholar] [CrossRef]
- Manas, O.; Lacoste, A.; Giró-i Nieto, X.; Vazquez, D.; Rodriguez, P. Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9414–9423. [Google Scholar]
- Sun, X.; Wang, P.; Lu, W.; Zhu, Z.; Lu, X.; He, Q.; Li, J.; Rong, X.; Yang, Z.; Chang, H.; et al. RingMo: A remote sensing foundation model with masked image modeling. IEEE Trans. Geosci. Remote Sens. 2022, 61, 1–22. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Xu, J.; Hou, J.; Zhang, Y.; Feng, R.; Wang, Y.; Qiao, Y.; Xie, W. Learning open-vocabulary semantic segmentation models from natural language supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2935–2944. [Google Scholar]
- Zhang, J.; Zhou, Z.; Mai, G.; Hu, M.; Guan, Z.; Li, S.; Mu, L. Text2Seg: Zero-shot remote sensing image semantic segmentation via text-guided visual foundation models. In Proceedings of the 7th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery; Association for Computing Machinery: New York, NY, USA, 2024; pp. 63–66. [Google Scholar]
- Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. Remoteclip: A vision language foundation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
- Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 1290–1299. [Google Scholar]
- Osco, L.P.; Wu, Q.; De Lemos, E.L.; Gonçalves, W.N.; Ramos, A.P.M.; Li, J.; Junior, J.M. The segment anything model (sam) for remote sensing applications: From zero to one shot. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103540. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 11976–11986. [Google Scholar]
- Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Wang, L.; Li, R.; Duan, C.; Zhang, C.; Meng, X.; Fang, S. A novel transformer based semantic segmentation scheme for fine-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
- Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
- Carion, N.; Gustafson, L.; Hu, Y.T.; Debnath, S.; Hu, R.; Suris, D.; Ryali, C.; Alwala, K.V.; Khedr, H.; Huang, A.; et al. Sam 3: Segment anything with concepts. arXiv 2025, arXiv:2511.16719. [Google Scholar] [CrossRef]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 418–434. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Yuan, Y.; Fang, J.; Lu, X.; Feng, Y. Spatial structure preserving feature pyramid network for semantic image segmentation. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2019, 15, 1–19. [Google Scholar] [CrossRef]
- Cheng, B.; Schwing, A.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 17864–17875. [Google Scholar]
- Zhang, J.; Li, Y.; Yang, X.; Jiang, R.; Zhang, L. RSAM-Seg: A SAM-Based Model with Prior Knowledge Integration for Remote Sensing Image Semantic Segmentation. Remote Sens. 2025, 17, 590. [Google Scholar] [CrossRef]
- Li, Z.; Lu, F.; Zou, J.; Hu, L.; Zhang, H. Generalized few-shot meets remote sensing: Discovering novel classes in land cover mapping via hybrid semantic segmentation framework. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–21 June 2024; pp. 2744–2754. [Google Scholar]
- Tian, Z.; Zhao, H.; Shu, M.; Yang, Z.; Li, R.; Jia, J. Prior guided feature enrichment network for few-shot segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1050–1065. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]







| Symbol | Meaning |
|---|---|
| Support set (labeled examples used to condition an episode). | |
| Query set (unlabeled examples to be segmented in an episode). | |
| Input sample and its label/mask in the support set. | |
| W-way, K-shot | Few-shot episode setting: W novel classes, K support examples per class. |
| Base classes used for meta-training and novel classes used for evaluation. | |
| Feature extractor/encoder network shared by support and query. | |
| Binary support mask for class c (1 on class pixels, 0 otherwise). | |
| Prototype (class-wise pooled feature vector) for class c. | |
| Similarity function (often cosine similarity) used for prototype matching. | |
| Predicted class label at query pixel location v. | |
| True positives, false positives, false negatives used for segmentation metrics. | |
| Mean Intersection-over-Union across classes. | |
| Dice/F1 score for class c. | |
| Trainable model parameters; denotes episode-adapted parameters. | |
| Inner-loop and outer-loop learning rates in optimization-based meta-learning. | |
| Number of mask queries in transformer-based mask decoders (avoid conflict with ). | |
| The k-th learned mask query token in a transformer decoder. |
| Dataset | Spatial Resolution (GSD) | Image Resolution (px) | Sensor | #Images | #Classes | File Types (Image/Label) |
|---|---|---|---|---|---|---|
| ISPRS Potsdam | 5 cm | ∼6000 × 6000 | Aerial RGB + IR | 38 | 6 | .tif/.tif |
| ISPRS Vaihingen | 9 cm | ∼2500 × 2500 | Aerial RGB + IR | 33 | 6 | .tif/.tif |
| Inria Aerial | 0.3 m | 5000 × 5000 | Aerial RGB | 360 | 2 | .tif/.tif |
| Massachusetts Roads/Buildings | ∼1 m | 1500 × 1500 | Aerial RGB | 151/137 | 2 | .tif/.tif |
| DeepGlobe Land Cover | 0.5 m | 2448 × 2448 | Satellite RGB | 803 | 7 | .jpg/.png |
| SpaceNet (1–7) | 0.3–1 m | 650 × 650–1024 × 1024 | Satellite RGB + MS | 10k–30k | 2 | .tif/.geojson |
| LoveDA | 0.3 m | 1024 × 1024 | Satellite RGB | 5987 | 7 | .tif/.png |
| FloodNet | ≤0.5 m | 1024 × 1024 | UAV RGB | 2363 | 2 | .jpg/.png |
| xBD | 0.3–0.8 m | variable (512–1024) | Maxar/WorldView | 850k | 4 | .png/.geojson |
| Sen1Floods11 | 10 m | 512 × 512 | Sentinel-1 SAR + S2 RGB | 4k pairs | 2 | .tif/.tif |
| HyRANK | 30 m | 1200 × 700 | Hyperspectral | 1 | 13 | ENVI .hdr/.dat, .mat/.mat |
| LandCover.ai | 25 cm | ∼9000 × 9500 | Aerial RGB | 41 | 4 | .tif/.shp, .geojson |
| DFC2020 | 10 m | variable (512–1024) | S1 SAR + S2 MS | ∼20k | 10 | .tif/.tif |
| DynamicEarthNet | 10 m | 256 × 256 | Sentinel-2 (time series) | 75k | 7 | .tif/.tif |
| UAVid | ∼5 cm | 3840 × 2160 | UAV RGB | 300+ | 8 | .jpg/.png |
| OpenEarthMap | 0.25–0.5 m | 1024 × 1024 | Aerial RGB | 5k+ | 8 | .tif/.png |
| WHU Building | 0.3 m | 512 × 512 | Aerial RGB | 8k+ | 2 | .tif/.tif |
| Ben-Ge | ∼0.3 m | variable | Aerial RGB | 600+ | 2 | .tif/.png |
| DeepGlobe Road | 0.5 m | 1024 × 1024 | Satellite RGB | 6k+ | 2 | .jpg/.png |
| 95-Cloud | 30 m | 512 × 512 | Landsat | 15k+ | 2 | .jpg/.png |
| Component | Typical Choices | Primary Role | RS Pitfalls | Remedies |
|---|---|---|---|---|
| Prompt source | Edge/interior points, tight boxes, grid prompts | Coverage of candidates | Missed small objects | Denser tiling; add contour prompts |
| Proposal model | Category-agnostic segmenter (point/box prompts) | Over-segmentation of query | Over- or under-segmentation | Multi-threshold sweep; mask fusion |
| Mask scoring | Prototype similarity, text similarity, geometric priors | Select class-consistent masks | Texture confusions | Add boundary or shape priors |
| Decoder refinement | Light boundary-aware head (3–5 layers) | Sharpen edges, de-overlap | Broken linear structures (e.g., roads) | Connectivity loss; DSM channel if available |
| Adapters | LoRA or bottleneck modules on cross-attention | Episode-time calibration | Sensor or seasonal shift | Adapt only small modules; early stopping |
| Method | Dataset | Backbone | 1-Shot | 5-Shot | Zero-Shot | Metric |
|---|---|---|---|---|---|---|
| RemoteCLIP [55] | RS classification benchmarks (avg. over 10 datasets) | ViT-B/32 | – | 84.4 | 61.8 | Top-1 Acc (%) |
| Text2Seg (GDINO+CLIP+SAM) [54] | LoveDA | ViT-H | – | – | 53.8 | Overall Accuracy (%) |
| RSAM-Seg [73] | Building segmentation benchmark | ResNet backbone | – | – | 84.2 | mIoU (%) |
| SegLand [74] | OpenEarthMap Few-Shot Challenge | Swin-T + UperNetPlus | – | 54.5 | – | mIoU (%) |
| PANet [31] | DeepGlobe | ResNet-50 | 36.55 | 45.43 | – | mIoU (%) |
| PFENet [75] | DeepGlobe | ResNet-50 | 16.88 | 18.01 | – | mIoU (%) |
| PFENet [75] | ISPRS Vaihingen | ResNet backbone | 12.58 | 12.29 | – | mIoU (%) |
| U-Net [76] | Sentinel-2 | CNN Encoder–Decoder | – | – | 55.45 | mIoU (%) |
| DeepLabV3+ [77] | Sentinel-2 | ResNet backbone | – | – | 65.27 | mIoU (%) |
| SegFormer [62] | Sentinel-2 | MiT Transformer | – | – | 61.55 | mIoU (%) |
| U-Net [76] | DeepGlobe | CNN Encoder–Decoder | – | – | 75.06 | mIoU (%) |
| DeepLabV3+ [77] | DeepGlobe | ResNet backbone | – | – | 65.27 | mIoU (%) |
| SegFormer [62] | DeepGlobe | MiT Transformer | – | – | 77.44 | mIoU (%) |
| DC-Swin [65] | ISPRS Potsdam | Swin Transformer | – | – | 87.6 | mIoU (%) |
| UNetFormer [66] | ISPRS Potsdam | Transformer Encoder–Decoder | – | – | 87.5 | mIoU (%) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Petrov, M.; Pandilova, E.; Dimitrovski, I.; Trajanov, D.; Spasev, V.; Kitanovski, I. Few-Shot Semantic Segmentation in Remote Sensing: A Review on Definitions, Methods, Datasets, Advances and Future Trends. Remote Sens. 2026, 18, 637. https://doi.org/10.3390/rs18040637
Petrov M, Pandilova E, Dimitrovski I, Trajanov D, Spasev V, Kitanovski I. Few-Shot Semantic Segmentation in Remote Sensing: A Review on Definitions, Methods, Datasets, Advances and Future Trends. Remote Sensing. 2026; 18(4):637. https://doi.org/10.3390/rs18040637
Chicago/Turabian StylePetrov, Marko, Ema Pandilova, Ivica Dimitrovski, Dimitar Trajanov, Vlatko Spasev, and Ivan Kitanovski. 2026. "Few-Shot Semantic Segmentation in Remote Sensing: A Review on Definitions, Methods, Datasets, Advances and Future Trends" Remote Sensing 18, no. 4: 637. https://doi.org/10.3390/rs18040637
APA StylePetrov, M., Pandilova, E., Dimitrovski, I., Trajanov, D., Spasev, V., & Kitanovski, I. (2026). Few-Shot Semantic Segmentation in Remote Sensing: A Review on Definitions, Methods, Datasets, Advances and Future Trends. Remote Sensing, 18(4), 637. https://doi.org/10.3390/rs18040637

