Few-Shot Semantic Segmentation in Remote Sensing: A Review on Definitions, Methods, Datasets, Advances and Future Trends

Petrov, Marko; Pandilova, Ema; Dimitrovski, Ivica; Trajanov, Dimitar; Spasev, Vlatko; Kitanovski, Ivan

doi:10.3390/rs18040637

Open AccessReview

Few-Shot Semantic Segmentation in Remote Sensing: A Review on Definitions, Methods, Datasets, Advances and Future Trends

by

Marko Petrov

^*,

Ema Pandilova

,

Ivica Dimitrovski

,

Dimitar Trajanov

,

Vlatko Spasev

and

Ivan Kitanovski

Faculty of Computer Science and Engineering, University Ss Cyril and Methodius, 1000 Skopje, North Macedonia

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(4), 637; https://doi.org/10.3390/rs18040637

Submission received: 9 January 2026 / Revised: 1 February 2026 / Accepted: 6 February 2026 / Published: 18 February 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Provides a remote-sensing-specific review of few-shot semantic segmentation, clarifying definitions, protocols, and evaluation pitfalls.
Systematically categorizes few-shot segmentation methods, including meta-learning, conditioning, transductive inference, and foundation-assisted approaches.

What are the implications of the main findings?

Highlights dataset characteristics (resolution, modality, annotation regimes) that critically affect episodic evaluation and benchmarking reliability.
Identifies open challenges and outlines future directions for scalable, reproducible, and foundation-model-assisted few-shot segmentation in Earth observation.

Abstract

Semantic segmentation in remote sensing images, which is the task of classifying each pixel of the image in a specific category, is widely used in areas such as disaster management, environmental monitoring, precision agriculture, and many others. However, traditional semantic segmentation methods face a major challenge: they require large amounts of annotated data to train effectively. To tackle this challenge, few-shot semantic segmentation has been introduced, where the models can learn and adapt quickly to new classes from just a few annotated samples. This paper presents a comprehensive review of recent advances in few-shot semantic segmentation (FSSS) for remote sensing, covering datasets, methods, and emerging research directions. We first outline the fundamental principles of few-shot learning and summarize commonly used remote-sensing benchmarks, emphasizing their scale, geographic diversity, and relevance to episodic evaluation. Next, we categorize FSSS methods into major families (meta-learning, conditioning-based, and foundation-assisted approaches) and analyze how architectural choices, pretraining strategies, and inference protocols influence performance. The discussion highlights empirical trends across datasets, the behavior of different conditioning mechanisms, the impact of self-supervised and multimodal pretraining, and the role of reproducibility and evaluation design. Finally, we identify key challenges and future trends, including benchmark standardization, integration with foundation and multimodal models, efficiency at scale, and uncertainty-aware adaptation. Collectively, they signal a shift toward unified, adaptive models capable of segmenting novel classes across sensors, regions, and temporal domains with minimal supervision.

Keywords:

few-shot learning; semantic segmentation; remote sensing; artificial intelligence; deep learning

1. Introduction

Semantic segmentation [1] has become a cornerstone task in computer vision, enabling dense pixel-level understanding of visual data. In the field of remote sensing, it plays a vital role in applications such as urban planning, disaster management, environmental monitoring, precision agriculture, and land-cover mapping. The rapid increase in the availability and resolution of aerial and satellite imagery has made automated segmentation an essential component of large-scale Earth observation pipelines. However, the effectiveness of segmentation models critically depends on the availability of labeled data. Creating accurate pixel-level annotations for remote-sensing imagery is labor-intensive, costly, and often infeasible across diverse geographic regions, sensors, and temporal domains. These challenges motivate the search for learning paradigms that can generalize from limited supervision. Few-shot semantic segmentation has emerged as a promising direction to address this issue. It extends the principles of few-shot learning to dense prediction tasks, allowing models to recognize and segment new classes from only a few annotated examples. Instead of requiring thousands of labeled samples per category, few-shot methods are trained to rapidly adapt to unseen classes through episodic training, using small labeled support sets to guide predictions on unlabeled query images. This ability to generalize under extreme data scarcity aligns naturally with the realities of remote sensing, where annotated datasets are scarce, heterogeneous, and prone to spatial and temporal domain shifts. Despite significant progress in computer vision, applying few-shot segmentation to remote sensing presents unique challenges. Aerial and satellite images differ substantially from natural images in terms of viewing geometry, scale variability, and spectral characteristics. Geographic and seasonal shifts can drastically change the appearance of the same object class, while class boundaries are often ill-defined due to resolution limits or sensor noise. Moreover, remote-sensing datasets typically contain large homogeneous regions, severe class imbalance, and multi-modal inputs spanning optical, multispectral, hyperspectral, and radar data. These factors demand specialized architectures, conditioning mechanisms, and evaluation protocols that go beyond standard computer-vision formulations.

This paper presents a comprehensive review of few-shot semantic segmentation methods in remote sensing, focusing on both methodological advances and empirical practices. We examine how meta-learning principles, conditioning mechanisms, and foundation-assisted strategies are being adapted to the unique characteristics of overhead imagery. This paper also emphasizes the importance of data-centric factors such as episode design, geographic partitioning, and reproducibility, which strongly influence reported performance but are often under-discussed. The remainder of this paper is organized as follows. Section 2 introduces key definitions and the conceptual framework of few-shot semantic segmentation, clarifying its distinction from related paradigms such as transfer, semi-supervised, and weakly supervised learning. Section 3 reviews commonly used datasets for remote sensing segmentation, summarizing their spatial properties, label taxonomies, and suitability for episodic evaluation. Section 4 categorizes existing few-shot segmentation methods into major families, including meta-learning, conditioning-based, and foundation-assisted approaches, and discusses how architectural and pretraining choices affect performance. Section 5 synthesizes empirical findings across datasets and models, analyzes conditioning behaviors, pretraining strategies, and evaluation protocols, and offers practical guidance for model selection and deployment. Finally, Section 6 outlines emerging research trends and challenges, including benchmark standardization, multimodal pretraining, foundation integration, and efficient, uncertainty-aware adaptation. By unifying current knowledge and identifying open gaps, this review aims to provide a structured overview of few-shot semantic segmentation in remote sensing, bridging advances in computer vision with the specific needs of large-scale Earth observation and operational geospatial analysis.

2. Definitions

2.1. Semantic Segmentation

Semantic segmentation is a significant challenge in the field of computer vision, aiming to achieve pixel-level understanding of visual data by assigning a label to each pixel in an image [2]. Figure 1 illustrates this concept using a representative example from the UAVid dataset, where complex urban scenes captured from aerial platforms are densely annotated at the pixel level. To better grasp the role of semantic segmentation, it can be viewed as a natural progression from simpler visual tasks such as image classification [3] and object detection [4]. It is widely used across various domains, including medical imaging, autonomous driving, agriculture, and urban planning, among others. In the context of remote sensing, semantic segmentation facilitates the interpretation of aerial and satellite imagery, allowing for detailed analysis of geographic areas. This is crucial for a wide range of applications including land cover classification, disaster management, environmental monitoring, precision agriculture, and others. However, implementing traditional semantic segmentation approaches presents significant challenges, mainly due to the requirement for large volumes of annotated data to achieve reliable performance. Collecting such data in remote sensing presents significant challenges, including high costs and time constraints as a result of the diverse geographic coverage required. To overcome these challenges, few-shot semantic segmentation [5] has been proposed as a solution that allows models to learn from a small number of annotated examples and quickly adapt to new classes.

2.2. Strategies

Few-shot learning strategies can be broadly classified into data augmentation-based methods and prior knowledge-based methods. Data augmentation techniques expand the available training set, while prior knowledge methods leverage existing experience to improve model adaptation to new tasks with limited data. These strategies rely on two distinct sets, the support set and the query set, which play a crucial role in training and evaluating the model.

The support set (

S

) consists of a small collection of labeled examples, represented as

S = {(x_{1}, y_{1}), \dots, (x_{K}, y_{K})}

(1)

where each

x_{i}

is a data sample, and

y_{i}

is its corresponding label. These labeled examples provide reference points that help the model identify and distinguish between different classes.

The query set (

Q

) contains a group of unlabeled data samples, represented as

Q = {x_{1}^{'}, \dots, x_{m}^{'}}

(2)

for which the model must predict the corresponding labels. The performance of the model is evaluated based on its ability to correctly classify these query samples using the knowledge gained from the support set.

2.3. Notation

Table 1 summarizes the main symbols used throughout this paper.

2.4. Operational Formulation of Few-Shot Semantic Segmentation

Operationally, most methods extract features from both support and query images with a shared encoder and derive class-conditioned guidance from the support labels. This guidance can take the form of pixel prototypes (class-wise feature averages), attention maps, dynamic filters, or light-weight parameter modulations that specialize the decoder to the episode classes. During inference, some approaches are purely feed-forward given the support, while others perform brief test-time adaptation on the support (and sometimes the query) to better match domain-specific appearance. Inductive inference treats each query independently, while transductive inference allows limited coupling across the queries within an episode (e.g., by normalizing with episode-level statistics or refining pseudo-labels jointly).

Remote sensing introduces nuances that matter already at the definition level. High-resolution scenes are typically tiled, so train/validation/test partitions must prevent spatial or geographic leakage (adjacent images, overlapping flight lines, or the same city appearing in multiple splits). Objects vary dramatically in scale, from sub-meter road markings to multi-hectare fields, so models often require multi-scale context even when only a few labeled examples are available. Sensor and seasonal shifts (optical vs. multispectral vs. SAR, leaf-on vs. leaf-off) mean that cross-domain few-shot evaluation [7] is of practical interest: the support and query may come from different sensors or regions, emphasizing the model’s ability to transfer rather than memorize. Although pixel-perfect masks are the standard supervision in few-shot semantic segmentation, the annotation regime can be relaxed without changing the core problem statement. Scribbles, polygons, bounding boxes, or even point-level hints can serve as the support labels, with the method inferring dense masks from sparse cues. When reporting results, the labeling budget (e.g., minutes per image or number of labeled pixels) and the exact form of supervision should be stated alongside

(W, K)

.

2.5. Evaluation Metrics

Several evaluation metrics are used to assess the performance of few-shot semantic segmentation models, with the most commonly used being Mean Intersection over Union (mIoU). The mIoU measures the average overlap between the predicted segmentation and the ground truth across all classes. It is calculated as:

m I o U = \frac{1}{N} \sum_{i = 1}^{N} \frac{T P_{i}}{T P_{i} + F P_{i} + F N_{i}}

(3)

where:

N is the total number of classes,
$T P_{i}$ represents true positives for class i,
$F P_{i}$ represents false positives, and
$F N_{i}$ represents false negatives.

In few-shot settings, the composition of each evaluation episode is usually described as W-way K-shot, where W is the number of novel classes present in the episode and K is the number of labeled support images per class. Early work commonly used one-way setups (foreground vs. background), while more recent protocols consider multi-class episodes where several novel categories appear simultaneously. To avoid information leakage, the label set is split into disjoint base (

C_{base}

) and novel (

C_{novel}

) classes, with meta-training performed only on

C_{base}

and evaluation performed only on

C_{novel}

. A model is considered truly few-shot if, at test time, it has access only to the small support set for each episode and does not rely on any identity or statistics of the base classes.

Beyond mIoU, additional metrics are often informative in remote sensing [8]. The Dice/F1 score for a class c,

{Dice}_{c} = \frac{2 T P_{c}}{2 T P_{c} + F P_{c} + F N_{c}},

(4)

is less sensitive to small absolute area differences than IoU [9], foreground–background IoU (FB-IoU) is useful in one-way protocols frequency-weighted IoU reduces the impact of class imbalance and boundary-aware scores (e.g., Boundary F1) capture contour quality, which is critical for buildings, roads, and parcel boundaries. Sound reporting averages episode scores across many random episodes and, when applicable, across multiple base/novel folds and random seeds. It should also be made explicit whether background pixels are included in the score, whether any external data or pretraining beyond

C_{base}

is used, and whether inference is inductive or transductive.

2.6. Relation to Neighboring Problem Settings

Finally, it is helpful to distinguish few-shot semantic segmentation from neighboring problem settings. Semi-supervised segmentation [10] assumes the same classes with many unlabeled images and a modest number of labeled ones, transfer learning [11] assumes sufficient labeled data to fine-tune on new classes without episodic structure, and weakly supervised segmentation [12] focuses on sparse labels but not necessarily few images or class transfer. In contrast, few-shot semantic segmentation targets class transfer under extreme label scarcity, evaluated episodically with strict base/novel separation and with care to the remote-sensing specifics outlined above.

3. Datasets

The datasets considered in this review are grouped primarily by sensing modality, since sensor characteristics strongly influence spatial resolution, spectral content, annotation quality, and the suitability of few-shot protocols. Within each modality, datasets may serve different application domains such as urban mapping, land-cover classification, or disaster response. Figure 2 provides representative visual examples from widely used remote sensing benchmarks, illustrating differences in spatial resolution, scene complexity, and sensing conditions across datasets.

3.1. Optical RGB Aerial Imagery

High-resolution aerial RGB datasets provide sharp object boundaries and fine spatial detail, making them particularly suitable for evaluating few-shot segmentation of buildings, roads, and urban structures. Representative datasets include ISPRS Potsdam and Vaihingen [13], Inria Aerial [14], LandCover.ai [15], UAVid [6], OpenEarthMap [16], WHU Building [17], and Ben-Ge [18]. These datasets are commonly used in few-shot experiments due to their high spatial resolution (5–30 cm GSD) and dense pixel-level annotations. However, geographic leakage is a critical concern, as images are often tiled from continuous coverage, and strict spatial separation is required to avoid inflated results.

3.2. Optical RGB Satellite Imagery

Medium-resolution satellite RGB datasets introduce greater appearance variability due to atmospheric effects, illumination changes, and sensor differences across acquisition campaigns. Datasets such as DeepGlobe Land Cover [19], SpaceNet [20], LoveDA [21], Massachusetts Buildings and Roads [22], xBD [23], and FloodNet [24] are widely used in few-shot segmentation studies. Their broader geographic diversity makes them suitable for evaluating cross-city and cross-region generalization, but also increases sensitivity to support set selection and annotation noise.

3.3. Multispectral Satellite Imagery

Multispectral datasets provide richer spectral information at the cost of coarser spatial resolution. Benchmarks such as LandCoverNet [25], Dynamic World [26], DynamicEarthNet [27], and DFC2020 [28] combine Sentinel-2 imagery with dense land-cover annotations. These datasets emphasize spectral reasoning over boundary precision and are particularly relevant for evaluating few-shot adaptation across seasons and land-cover transitions. Few-shot segmentation on multispectral data often requires modality-aware normalization and spectral augmentation to stabilize prototype estimation.

3.4. Synthetic Aperture Radar (SAR) and Multimodal Datasets

SAR imagery offers all-weather, day–night sensing but presents unique challenges due to speckle noise and different image statistics. Datasets such as Sen1Floods11 [29] and DFC2020 [28] include SAR–optical pairs that enable multimodal segmentation and cross-sensor evaluation. Few-shot segmentation on SAR data remains underexplored but is critical for operational scenarios such as flood mapping and disaster response where optical imagery may be unavailable.

3.5. Hyperspectral Imagery

Hyperspectral datasets such as HyRANK [30] provide hundreds of spectral bands, enabling fine-grained material discrimination but with limited spatial coverage and extremely small dataset sizes. These properties make hyperspectral benchmarks challenging for episodic few-shot segmentation, and most existing work relies on heavy pretraining or band-selection strategies to reduce dimensionality.

Table 2 summarizes representative remote sensing datasets commonly used for semantic segmentation, highlighting their spatial resolution, sensor modality, number of images and classes, and available data formats. These datasets collectively provide the empirical basis for evaluating few-shot semantic segmentation in remote sensing. However, challenges remain: protocols are not standardized, annotation regimes are limited, and multi-modal imagery is underexplored. Addressing these issues in future benchmark design will be essential for achieving robust and comparable progress across studies.

Overall, the reviewed datasets differ substantially in spatial resolution, sensing modality, geographic coverage, and annotation quality. High-resolution aerial benchmarks emphasize boundary precision, multispectral datasets favor spectral discrimination, while SAR and multimodal datasets introduce additional domain shifts. These differences highlight that dataset selection strongly affects both model behavior and reported performance, and that few-shot evaluation protocols must be chosen with care to ensure meaningful interpretation.

3.6. Dataset Annotation: Practices, Pitfalls, and Quality Assessment

Although Section 3 groups datasets by sensing modality, the usefulness of a benchmark for few-shot semantic segmentation is equally shaped by how annotations are produced and validated. In remote sensing, labels are often created under practical constraints related to scale, geographic coverage, and data availability, and these constraints introduce characteristic failure modes. Their impact is amplified in few-shot settings, where performance may depend on only one or two support masks per class.

Several annotation-related factors can systematically bias both training and episodic evaluation. A common issue is georegistration and temporal mismatch: misalignment between imagery and reference GIS layers, or between pre- and post-event acquisitions, can lead to systematic boundary offsets and spurious changes, particularly for small or thin objects [23]. Tiling-related effects are another critical concern. Large aerial and satellite scenes are frequently split into overlapping tiles [21], and if spatially adjacent tiles appear across splits, models may exploit near-duplicate context. Few-shot protocols therefore require strict spatial or geographic separation to avoid leakage. Class taxonomy ambiguity further complicates benchmarking, as definitions such as impervious surface versus building, or low vegetation versus crop, vary across datasets and annotation guidelines; remapping across benchmarks should explicitly document assumptions and merged categories. Annotation fidelity is also bounded by spatial resolution and label source. Polygon-to-raster conversion and coarse digitization tend to blur object boundaries, disproportionately affecting roads, narrow rivers, and small buildings. Finally, label noise and incompleteness are common in large-scale datasets [31] derived from GIS or OpenStreetMap sources, where objects may be missing, outdated, or inconsistently labeled across regions. In few-shot episodes, even a single noisy support mask can strongly bias prototype estimation or attention mechanisms.

Remote-sensing segmentation datasets rely on a mixture of annotation methods that trade accuracy for scalability. Dense manual pixel labeling provides the highest fidelity but is expensive and therefore limited to relatively small high-resolution benchmarks, particularly for urban scenes. Polygon-based annotation followed by rasterization is widely used for buildings, roads, parcels, and land-use regions; its quality depends on digitization accuracy, rasterization rules, and handling of connectivity and boundary thickness. Many large-scale datasets generate labels by aligning imagery with existing GIS or OSM vector layers, enabling global coverage but introducing misalignment, incompleteness, and regional bias. Semi-automatic pipelines combine interactive tools, classical computer-vision methods, or model-assisted pre-segmentation to reduce annotation effort, although such approaches may introduce systematic errors that require targeted auditing [32,33]. More recently, weak and sparse supervision, such as points, scribbles, bounding boxes, or coarse scene labels, has gained attention as a realistic alternative in operational settings and as a means to evaluate label efficiency beyond dense masks.

For fair comparison and reliable few-shot evaluation, annotation quality should be assessed and reported whenever possible. Inter-annotator agreement computed on a shared subset, using metrics such as class-wise IoU, Dice score, or boundary agreement, provides an estimate of human consistency. Boundary-aware diagnostics complement region-based metrics like mIoU by quantifying contour fidelity, which is particularly important for buildings and linear structures. Consistency and topology checks, such as enforcing non-self-intersecting polygons, road connectivity continuity, or removal of isolated artifacts, help identify systematic labeling errors. Noise auditing through stratified spot checks across regions, seasons, sensors, or illumination conditions can reveal failure patterns that are otherwise hidden in aggregate scores. In few-shot settings, sensitivity analysis, varying support selection, or injecting controlled label perturbations offers additional insight into how annotation noise propagates through episodic evaluation.

4. Few-Shot Methods for Semantic Segmentation in Remote Sensing

Few-shot semantic segmentation has emerged as a key strategy in remote sensing, aiming to produce pixel-level maps of previously unseen classes when only limited annotated data are provided for support. Methods are best organized by how the support set conditions the query prediction, by the optimization strategy used to adapt during an episode, and by the priors they inject for overhead imagery (scale variability, arbitrary rotation, structured boundaries, cross-sensor shifts). We first introduce broad method categories then detail concrete mechanisms and losses that are particularly effective for aerial/satellite scenes (gigapixel tiling, multi-scale context, geographic leakage, and multi-sensor fusion). Figure 3 provides a high-level taxonomy of few-shot semantic segmentation approaches in remote sensing, illustrating the major method families and their relationships to architectural choices, pretraining strategies, and dataset characteristics. Most few-shot segmentation methods in remote sensing build upon a common encoder–decoder segmentation template with skip connections, which we illustrate in Figure 4 to provide architectural context for the methods discussed below.

Figure 5 provides a chronological overview of representative models and methodological milestones that have shaped few-shot semantic segmentation in remote sensing, highlighting the evolution from early metric-based approaches to recent foundation- and prompt-based frameworks.

4.1. Meta-Learning Method Categories (Episodic Training)

Meta-learning [34] trains with episodes that mirror test-time usage: a small support set provides labels for a few classes, and the model must segment those classes in one or more queries. The objective is not to memorize specific categories but to learn how to adapt from very little supervision.

Metric-based and prototypical matching. The support mask for class c is summarized into a compact descriptor (a prototype), and each query pixel is assigned to the class whose prototype is most similar to its feature [31]. Let

f (\cdot)

denote the shared encoder,

F_{s} = f (x_{\sup})

and

F_{q} = f (x_{qry})

the corresponding feature maps, and

Ω_{c}

the binary support mask for class c. One computes

\begin{matrix} p_{c} & = \frac{\sum_{u} Ω_{c} (u) F_{s} (u)}{\sum_{u} Ω_{c} (u)}, \end{matrix}

(5)

\begin{matrix} \hat{y} (v) & = arg max_{c} sim (F_{q} (v), p_{c}), \end{matrix}

(6)

where

sim (\cdot, \cdot)

is usually cosine similarity. In remote sensing, simple pooling is often insufficient because appearance varies with scale, season, and viewing geometry. Practical refinements include (a) multi-scale prototypes (aggregate at several FPN levels) so sub-meter objects and hectare-scale regions are both represented; (b) mixture-of-prototypes (several centroids per class) to capture intra-class modes like metal roof vs. tile roof, (c) rotation-aware pooling (average features after discrete rotations of the support crop) to reduce heading sensitivity, (d) boundary-weighted pooling that increases the contribution of pixels near support edges to better preserve building and road contours.

Learned relation heads (beyond a single prototype). Instead of a fixed similarity, a small network learns the comparison between support and query features [35]. For a query location v and a support location u, a relation score

r_{v, u} = g ([f_{q} (v) ∥ f_{s} (u)])

is computed (concatenation or bilinear fusion) then aggregated over all masked support pixels of class c to produce a score map for c. This design handles subtle, part-level correspondences (e.g., roof ridges, road markings) and adapts its notion of similarity to the scene statistics of the episode.

Correlation and hyper-correlation volumes. Dense support-query correlations create a 4D tensor

C (i, j, m, n) = 〈 F_{q} (i, j), F_{s} (m, n) 〉

that captures all pairwise feature matches; masking with the support labels yields class-specific evidence in the query [36]. To keep memory practical on large images, restrict correlations to local windows, downsample keys/values while keeping high-res queries, factorize similarity (e.g., low-rank projections), or pre-cluster support/query tokens into superpixels. Hyper-correlation fuses correlations from multiple backbone stages to recover both small vehicles and large homogeneous regions (forest, water).

Graph-based matching and region reasoning. Instead of operating directly on pixels, some approaches first partition the image into superpixels or connected regions, which are then treated as nodes in a graph. Edges encode spatial adjacency, relative orientation, or appearance similarity between neighboring regions. Features extracted from the support image are used to condition node embeddings, which are then propagated to the query graph via message passing or attention-based graph networks. Final segmentation masks are obtained by assigning labels at the region level and projecting them back to pixels. This form of graph reasoning [37] is naturally robust to local texture noise and speckle, since decisions are made over aggregated regions rather than individual pixels. It also enables topology-aware constraints, such as encouraging continuity of elongated structures like roads or rivers, which are difficult to enforce with purely pixel-wise conditioning.

Memory-augmented/meta-attention. A differentiable memory stores key–value pairs for support regions, and query tokens attend to this memory to assemble class-conditioned templates [38]. Episode-wise normalization (using only support and current queries) plays a similar role by aligning distributions inside an episode and reducing biases from pretraining domains.

Optimization-based meta-learning. Instead of relying only on prototypes or attention mechanisms, these approaches explicitly fine-tune a small set of network parameters within each episode using the support data. The general idea is to simulate, during training, the process of taking a model that has been pre-trained on base classes and adapting it quickly to novel classes through one or a few gradient descent steps [39]. Formally, given parameters

θ

and support loss

L_{\sup}

, the model performs an inner update

θ^{'} = θ - α \nabla_{θ} L_{\sup} (θ),

(7)

and then optimizes the outer objective on the query set,

θ \leftarrow θ - β \nabla_{θ} L_{qry} (θ^{'}) .

(8)

In practice, only lightweight components such as decoder heads, adapter layers, or normalization parameters are adapted, since updating the full encoder on a few support pixels risks overfitting. For remote sensing, optimization-based methods are particularly sensitive to the quality of the support labels. To counter noise or boundary errors, regularization strategies are often introduced:

Proximal penalties or weight decay terms to prevent the adapted parameters $θ^{'}$ from drifting too far from the base initialization.
Early stopping in the inner loop, limiting the number of gradient steps to avoid memorizing noise from a handful of pixels.
Meta-regularization, where the model learns during training to be robust to imperfect or sparse supervision by adding noise to support labels or dropping pixels.

The main difficulty for optimization-based meta-learning lies in balancing fast adaptation with stability, since noisy support masks and heterogeneous regions can easily cause either overfitting or underfitting during the inner-loop updates. Taken together, meta-learning approaches remain among the most data-efficient strategies for few-shot segmentation in remote sensing. Prototype-based and correlation-driven methods perform well under limited supervision, but their sensitivity to support quality and appearance variability motivates the use of multi-scale reasoning, robust pooling strategies, and carefully designed episodic training protocols.

4.2. Parameter Generation and Conditioning

Here, the segmenter is specialized to the novel classes using signals distilled from the support set, rather than only comparing support and query features. This specialization can happen by generating filters, by modulating feature statistics, or by seeding decoder “mask queries” that drive the final masks [40].

Dynamic convolution and weight imprinting. Given a support prototype

p_{c} \in R^{D}

for class c, a small hypernetwork

h_{θ} : R^{D} \to R^{K \times K \times C_{in} \times C_{out}}

emits episode-specific filters

W_{c} = h_{θ} (p_{c}),

(9)

used in a light decoder to produce class logits on the query, e.g.,

z_{c} = Conv (F_{q}; W_{c})

where

F_{q}

is a pyramid of query features. Practical variants:

Separable kernels: depthwise $3 \times 3$ followed by pointwise $1 \times 1$ sharply reduces parameters but preserves shape cues (useful for roads/canals).
Filtered budgets: constrain the number of generated filters (per scale) and reuse across decoder layers to keep memory stable on VHR images.
Stability regularizers: spectral or weight-norm penalties on $h_{θ}$ outputs; a proximal term $λ ∥ W_{c} - W_{base} ∥_{2}^{2}$ keeps generated filters close to a safe default when support is noisy.
Orthogonality/diversity: encourage $W_{c}^{⊤} W_{d} \approx 0$ for $c \neq d$ to reduce cross-class interference (important when novel classes co-occur).

Weight imprinting inserts class rows into the final classifier on the fly:

\begin{matrix} w_{c} & = τ \frac{p_{c}}{∥ p_{c} ∥_{2}}, \end{matrix}

(10)

\begin{matrix} z_{c} (v) & = w_{c}^{⊤} f (x_{qry}) (v) + b_{c} . \end{matrix}

(11)

with temperature

τ

controlling logit scale. Imprinting is attractive for its constant-time update and pairs well with pyramid prototypes (one imprinted row per scale). In RS, thin objects benefit from a two-step head: imprinted

1 \times 1

projection -> small separable conv for boundary refinement.

Feature-wise linear modulation (FiLM), conditional norms, and adapters. Support cues produce per-channel modulation of query features. For a backbone stage ℓ with features

X^{(ℓ)} \in R^{H_{ℓ} \times W_{ℓ} \times C_{ℓ}}

, a conditioner

g_{ϕ}

maps pooled support features to

(γ_{c}^{(ℓ)}, β_{c}^{(ℓ)}) \in R^{C_{ℓ}}

and applies

Y^{(ℓ)} = γ_{c}^{(ℓ)} ⊙ X^{(ℓ)} + β_{c}^{(ℓ)},

(12)

broadcast across spatial locations. Design options:

Multi-level FiLM: generate $(γ, β)$ at several pyramid stages so both fine detail (cars, roof edges) and global context (fields, water bodies) are influenced.
Spatial FiLM (sFiLM): predict low-res modulation maps $Γ^{(ℓ)}, B^{(ℓ)} \in R^{h \times w \times C_{ℓ}}$ , upsample to $(H_{ℓ}, W_{ℓ})$ , and apply element-wise modulation; this localizes conditioning around likely object extents.
Conditional norms: use conditional BN/IN where $μ, σ$ are affine-transformed by support features (more stable than free-form $(γ, β)$ on noisy support).
Adapters/LoRA: insert bottleneck adapters (or low-rank updates $W + B A$ ) at a few layers and tune only adapter parameters per episode. This keeps the encoder frozen (good for multi-sensor deployments) while allowing rapid per-episode alignment.
Bounding modulation: pass $(γ, β)$ through tanh and scale by a small factor (e.g., $0.1$ ) to prevent catastrophic shifts when the support mask is tiny or mislabeled.

In RS practice, a FiLM-at-pyramid + tiny adapters recipe gives most of the benefit of full fine-tuning but with predictable memory and latency on large images.

Prompted decoders and mask queries. Mask-transformer decoders allocate Q latent queries

{q_{k}}_{k = 1}^{Q}

; each query attends to the image features and decodes a soft mask. Support-derived tokens can initialize a subset of these queries:

q_{c, k} = W_{q} Pool (f (x_{\sup}) ⊙ Ω_{c}) + PE,

(13)

where

k = 1, \dots, K_{c}

are per-class slots and PE is a positional term. Additional details:

Query budgeting: assign a fixed $K_{c}$ per class (e.g., parts) or a dynamic budget selected by a support scoring head; avoid query starvation for small/thin classes.
Negative/background prompts: initialize a few queries from support complements to model distractors (e.g., shadows vs. roofs), reducing false positives in cluttered scenes.
Prompt diversity: add a diversity regularizer $\sum_{k \neq k^{'}} {∥ q_{c, k}^{⊤} q_{c, k^{'}} ∥}_{2}$ to discourage duplicate masks (prevents multiple queries collapsing onto the same region).
Text–visual fusion (optional): for open-vocabulary scenarios, fuse a text embedding $t_{c}$ with the visual prompt, ${\tilde{q}}_{c, k} = λ q_{c, k} + (1 - λ) W_{t} t_{c}$ , which stabilizes the query when support is extremely sparse.
Losses and calibration: combine pixel-wise CE/Dice with a contrastive alignment term that pulls support features toward the activated query tokens while pushing them away from negatives; temperature scaling on mask logits helps calibration when base and novel classes co-occur.

For RS, prompt seeding is particularly effective with tiling: cache support-initialized queries and reuse them across overlapping windows of a gigapixel scene, then merge masks with simple NMS or IoU-based fusion. Overall, few-shot segmentation methods in remote sensing increasingly combine prototype-based conditioning with attention-driven or transformer-based architectures, trading computational efficiency for improved robustness under geographic and appearance variability. While lightweight metric-based approaches remain attractive for resource-constrained settings, recent trends favor hybrid and foundation-assisted designs that better capture large-scale spatial context and domain shifts. Conditioning-based methods improve flexibility by directly adapting segmentation heads or feature representations to novel classes. Dynamic filters, FiLM modulation, and prompt-based decoders allow stronger specialization than metric matching alone but introduce stability and calibration challenges when support annotations are sparse or noisy. Lightweight adaptation mechanisms therefore offer the most practical trade-off between adaptability and robustness.

4.3. Inductive vs. Transductive Inference

Few-shot segmentation methods differ not only in how they use the support but also in how they treat the queries during evaluation. The distinction between inductive and transductive inference is central to the design of both algorithms and benchmarks [41,42].

Inductive inference. In the inductive setting, each query image is processed independently given the support set. The model extracts features from the query, conditions them with the support prototypes or prompts, and directly predicts segmentation masks. No information is shared across queries in the same episode.

Advantages: Simple to scale, embarrassingly parallel, and results are easier to reproduce since each query can be evaluated in isolation. This is the most common protocol in the early FSSS literature.
Limitations in RS: Aerial and satellite scenes often exhibit strong redundancy, e.g., many similar rooftops, repeated agricultural fields, or parallel road networks. Treating each query independently discards these correlations. Moreover, in tiling pipelines for gigapixel images, adjacent images may share the same structures, and ignoring this information reduces consistency across boundaries.

Transductive inference. Here, all queries in the same episode are leveraged jointly, under the assumption that they are unlabeled but available together. A few strategies are common:

Entropy minimization: add a penalty $L_{ent} = \sum_{v} H (p_{θ} (\cdot | v))$ on unlabeled query pixels, which encourages confident predictions in regions with clear low-level support. This stabilizes homogeneous areas like fields or water.
Pseudo-label refinement: high-confidence predictions from one query are iteratively added to the support set or used as soft supervision for other queries. In RS, this is effective when multiple images capture different parts of the same structure (e.g., a long road).
Episode-level normalization: compute statistics (mean, variance) jointly across all queries and the support, then normalize features accordingly. This mitigates illumination and sensor differences across queries in the same episode.

Transductive methods can significantly improve few-shot performance because they exploit the self-similarity of man-made and natural structures. However, they also blur the boundary between supervised and semi-supervised learning. For reproducibility, papers must state explicitly whether inference is inductive or transductive, as this choice can substantially affect reported performance across commonly used few-shot remote-sensing benchmarks.

4.4. Data-Centric Strategies

Task-aware augmentation. Generic image transformations such as random flips and crops provide some regularization but fail to capture the variability of overhead imagery. More specialized augmentations are therefore widely used. Copy–paste operations cut labeled objects from support images, such as individual buildings, cars, or crop fields, and transplant them into queries [43]. This increases object diversity and exposes the model to more cluttered contexts, making it easier to separate foreground from background. Spectral and illumination transfer is another crucial strategy: adjusting color histograms or applying lightweight style transfer aligns imagery from different seasons (leaf-on versus leaf-off vegetation) or sensors (aerial RGB versus satellite RGB) [44,45]. These operations reduce sensitivity to seasonal or acquisition-induced shifts that otherwise cause prototypes to fail. Geometric jitter is particularly important in RS because objects can appear at arbitrary orientations; rotations by multiples of 90° and moderate scale changes (20–30%) simulate typical aerial capture geometries. Finally, context-aware cropping ensures that patches contain full objects instead of truncations, because a support patch that contains only half a building provides misleading prototypes and degrades query predictions.

Feature hallucination and class mixing. Instead of synthesizing new images, some methods enrich variability directly in the feature or mask space [46]. Prototype mixing combines support prototypes from the same class with small linear interpolations, simulating alternative appearances such as rooftops with slightly different reflectance or vegetation under different growth stages. This creates a smoother class manifold and prevents the model from overfitting to a single narrow prototype. Mask-level mixing strategies such as CutMix insert labeled regions into new contexts: for instance, pasting road masks into different urban backdrops forces the model to generalize beyond the local background statistics. Boundary-focused mixing, where objects are deliberately placed along strong edges, teaches the model to sharpen decision boundaries, which is particularly valuable for elongated or thin structures such as roads, rivers, and powerlines. Such strategies can be seen as a way to inject rare and difficult conditions into the episode distribution without requiring new annotations.

Episode sampling protocols. The way episodes are formed during training and evaluation has a surprisingly strong impact on reported performance [47]. If support and query sets are sampled naively, broad area-cover categories such as impervious surfaces, vegetation, or water often dominate the episodes, overshadowing more localized object categories like cars or individual buildings. Balanced sampling across both spatial extent and object instance counts mitigates this bias and yields more stable mean IoU estimates. Another key issue is geographic non-overlap: remote sensing datasets are often created by tiling continuous aerial or satellite coverage, and allowing adjacent images to appear in different splits leads to geographic leakage, artificially inflating performance. Strong protocols explicitly disallow spatially adjacent patches across support and query folds. For temporal datasets, separation across acquisition dates is equally important. For example, in DynamicEarthNet, if support and query images come from the same location and date, models can trivially memorize unchanged structures, and separating them by season or year turns the task into a genuine test of temporal adaptation. Beyond splitting, researchers also explore “hard episode mining”, where episodes are constructed to emphasize rare classes or high intra-class variability, leading to more informative training. Empirically, unstable or leakage-prone episode sampling can shift reported mIoU by more than the gain of an entirely new architectural module, underlining the importance of rigorous protocol design.

4.5. Transfer Learning and Pretraining

Transfer initialization has an outsize impact in few-shot semantic segmentation because the model must generalize from very few labeled pixels. In remote sensing (RS), pretraining is not one-size-fits-all: success depends on the spectral bands available, revisit frequency, sensor physics (e.g., SAR speckle), and geographic diversity. Three main routes have proven influential in few-shot semantic segmentation for remote sensing: supervised natural-image pretraining, self-supervised pretraining on EO archives, and vision–language pretraining, each with practical choices that strongly affect aerial and satellite performance.

Supervised non-RS pretraining (natural images). ImageNet-style [48] pretraining is commonly used to initialize backbones for remote-sensing segmentation and often provides strong low-level filters and improved optimization stability, particularly for RGB aerial datasets. However, several studies have noted limitations arising from spectral mismatch for multispectral (MS) or SAR inputs and a texture bias that may not align with overhead imagery. To mitigate these issues, prior work has explored channel-agnostic input stems, where early convolutions are applied band-wise and subsequently mixed, enabling backbones to ingest varying numbers of spectral bands without reinitializing later layers. Band dropout during fine-tuning has also been used to encourage robustness to missing or corrupted channels [49]. For SAR data, existing approaches typically process intensity as a single-band input with tile-wise normalization using robust statistics or introduce a lightweight modality-specific stem before a shared encoder.

Self-supervised pretraining on RS corpora. Unlabeled EO archives enable pretraining on the same sensing conditions as downstream tasks, yielding the largest consistent gains in FSSS. Two families dominate: contrastive methods that bring together positive views of the same scene (spatially or temporally) [50], and masked image modeling (MIM) that reconstructs masked patches/bands [51]. A contrastive RS loss often uses geospatially or temporally anchored positives:

L_{InfoNCE} = - \sum_{i} log \frac{exp (〈 z_{i}, z_{i}^{+} 〉 / τ)}{\sum_{j} exp (〈 z_{i}, z_{j} 〉 / τ)},

(14)

where

z_{i}

and

z_{i}^{+}

are encodings of the same location (or same parcel) under different dates/sensors, and negatives are drawn from distant locations or different parcels. For MIM on multispectral data, mask both spatial patches and spectral bands; reconstruct in radiance/reflectance space with band-wise scaling to stabilize training. Temporal positives should be chosen within a cloud-free window (for example, within ±30 days over mid-latitudes) to avoid trivial variations. In agricultural settings, it is also useful to pair observations across different phenological stages so that the model learns features that remain stable across seasonal growth cycles.

Vision–language pretraining (VLP). Vision–language models extend pretraining by aligning overhead imagery with textual descriptions, enabling open-vocabulary priors and language-guided prompting in FSSS [52]. A CLIP-style contrastive loss,

L_{VLP} = - \sum_{i} log \frac{exp (〈 v_{i}, t_{i} 〉 / τ)}{\sum_{j} exp (〈 v_{i}, t_{j} 〉 / τ)},

(15)

trains a joint space for image embeddings

v_{i}

and text embeddings

t_{i}

, where

t_{i}

may come from captions, GIS-derived tags, or structured ontologies (e.g., “industrial area”, “greenhouse”, “rice paddy”). In remote sensing, textual supervision is often weak or noisy but still provides useful global semantics. Coarse scene tags (urban, rural, water, forest) can be used as an initial alignment signal, while more fine-grained land-use categories further refine the embedding space.

For FSSS, such embeddings can regularize visual prototypes by anchoring them to corresponding textual concepts (e.g., pulling a “road” prototype toward the text embedding of road) or serve as scoring functions when selecting among candidate masks in open-set episodes. This integration of language not only improves segmentation of novel classes with few examples but also moves toward generalized few-shot segmentation where semantic categories are defined at inference time rather than fixed in advance.

4.6. Vision–Language Integration in Few-Shot Segmentation

While vision–language pretraining provides a useful initialization, recent work demonstrates that vision–language models (VLMs) can also be incorporated directly into the few-shot segmentation pipeline [53]. Instead of relying exclusively on support prototypes or dense correlation maps, VLM-based methods exploit the joint text–image embedding space to guide mask selection, refine support cues, and enable open-vocabulary segmentation. One practical role of VLMs is to supply semantic anchors that complement the limited visual information available in the support set. When a novel class is represented by only a few annotated pixels, the textual embedding can stabilize prototype estimation by providing an additional semantic prior. This becomes particularly beneficial for visually ambiguous or heterogeneous categories such as “industrial buildings” or “mixed vegetation”, where support-derived features may vary substantially across instances or geographic regions.

A second role emerges in mask scoring and selection. Candidate masks produced by a proposal generator or transformer-based decoder can be evaluated jointly through visual similarity and text-guided semantic alignment, a strategy explored in recent text-guided remote-sensing segmentation frameworks such as Text2Seg [54]. This dual criterion improves discrimination in cluttered or visually similar environments, which are frequent in aerial and satellite imagery. Empirical studies show that models incorporating both forms of evidence yield more consistent predictions under domain shift and remain effective even when the support set is small or noisy.

A third contribution of VLMs is the possibility of open-vocabulary segmentation. Rather than restricting inference to a predefined taxonomy of novel classes, models can segment categories specified at test time through natural-language prompts. A user may request masks for terms such as “solar panel”, “airport runway”, or “rice paddy”, and the model retrieves corresponding regions by combining text embeddings with support cues or proposal masks. This flexibility is highly relevant in operational remote sensing scenarios, where class definitions frequently evolve and analysts often need to detect emerging or rare categories for which dedicated training data are scarce.

Integrating VLMs into few-shot segmentation remains an ongoing challenge. Most pretrained models are tuned to RGB natural imagery and may not transfer cleanly to multispectral or SAR domains. Adaptations typically involve channel-remapping stems, modality-specific branches, or training on remote-sensing-specific image–text pairs to reduce the domain gap, as demonstrated in RemoteCLIP [55]. Textual metadata extracted from GIS sources can be noisy or inconsistent, yet even coarse scene descriptions have been shown to aid alignment and improvement of mask selection. VLMs therefore expand the set of mechanisms available for few-shot segmentation by introducing semantic structure that is difficult to obtain from limited visual supervision alone. Their ability to combine linguistic cues with sparse support information enables richer conditioning, more stable mask predictions, and a transition toward segmentation systems that operate beyond a fixed set of labels defined at training time.

4.7. Foundation and Prompt-Based Segmentation

Category-agnostic segmenters and large decoders change how we design FSSS. Instead of learning everything from a few pixels, we can compose strong priors (generic mask proposals, text semantics) with support-driven calibration. Three integration patterns have emerged: proposal-first with mask selection, promptable decoders with support tokens, and open-vocabulary hybrids, each with RS-specific twists. Table 3 summarizes the main components of foundation-assisted few-shot semantic segmentation pipelines, their roles, and common remote-sensing-specific pitfalls together with practical remedies.

Proposal-first (category-agnostic) pipelines. A Segment-Anything-style model [56] (or similar) is used to over-segment the query into many candidate masks at multiple points/boxes. The support set supplies class evidence to select and fuse relevant candidates. A practical recipe is as follows: (i) construct prompts from support masks (edges, bounding boxes, interior seeds); (ii) generate candidate masks with a foundation segmenter; (iii) score candidates by prototype similarity, text alignment, and geometric priors; (iv) select and fuse consistent masks; and (v) refine with a lightweight boundary-aware decoder. This approach leverages foundation priors for coverage while letting the few-shot support decide which regions are relevant. RS caveats: tile at native GSD to avoid missed small objects, and re-normalize intensities per tile to handle brightness jumps.

Prompted mask transformers (support-conditioned decoders). Mask-transformer decoders allocate a fixed budget of latent queries, each decoding one mask [57]. In FSSS, a subset of queries is initialized from support tokens or prototypes, and cross-attention pulls out the corresponding regions in the query. RS-specific refinements include multi-part prompting (edges, corners, interiors) for thin structures, low-rank adapters (LoRA) or bottlenecks in cross-attention for efficient per-episode adaptation, and query budgeting (reserve generic “catch-all” queries to handle spurious regions).

Open-vocabulary few-shot (text + support). Text prompts define candidate classes while the few-shot support calibrates them to the local sensor/domain. Joint scoring combines prototype similarity, text similarity, and geometric consistency. This setup is effective for categories with clear semantics but high visual variability (e.g., “greenhouse”, “solar array”), where textual anchors provide semantic grounding and support masks refine local appearance.

Foundation models pretrained on natural images often under-segment small objects (cars, road fragments) or over-segment repetitive textures (agricultural fields) [58]. Mitigations include tiling at full GSD, adding contour prompts, sweeping proposal thresholds, and refining boundaries with DSM or connectivity-aware losses. For SAR, feeding both despeckled and raw log-intensity channels improves mask stability. Efficiency can be improved with prompt subsampling, windowed attention, and caching proposals across overlapping images. Reporting should cover the number of prompts, proposal thresholds, window sizes, and whether text priors were used, since these hyperparameters strongly affect mIoU.

4.8. Backbones, Decoders, and Conditioning: Design Choices That Matter

In few-shot segmentation for remote sensing, performance depends not only on the meta-learning objective but also on how the backbone, decoder, and conditioning pathway work together. The backbone controls the trade-off between spatial detail and semantic abstraction, the decoder fuses multi-scale context into dense masks, and conditioning determines how support cues influence the query prediction. Figure 6 summarizes practical trade-offs in compute, memory, and parameter count across common architectures.

Convolutional backbones (e.g., ResNet [59], ConvNeXt [60]) remain strong defaults for VHR RGB imagery due to stable optimization and strong edge/texture features, and they are typically paired with pyramid features to recover fine detail. HRNet [61] preserves a high-resolution stream throughout the network and is often advantageous for thin or elongated targets (roads, rivers), albeit with higher memory cost. Transformer backbones (SegFormer/MiT [62], Swin [63], ViT [64]) are increasingly used in RS because global context improves robustness under geographic and appearance variation, especially when combined with domain-relevant pretraining. Recent transformer-based segmentation architectures such as DC-Swin [65] and UNetFormer [66] further demonstrate the effectiveness of hierarchical self-attention and hybrid encoder–decoder designs for high-resolution aerial imagery. DC-Swin introduces deformable cross-window attention to better capture long-range dependencies while preserving local spatial detail. UNetFormer combines transformer encoders with UNet-style multi-scale feature fusion, providing a balanced trade-off between global context modeling and boundary precision. Although originally developed under fully supervised settings, these architectures are increasingly adopted as backbones or initialization points for few-shot and adaptation-based segmentation pipelines in remote sensing.

Foundation encoders such as the Segment Anything Model (SAM) [56,67] provide strong objectness and boundary priors but can exhibit domain gaps on overhead imagery. In practice, freezing the encoder and adapting only lightweight decoders or adapters is often more stable than full fine-tuning. For multispectral and SAR inputs, modality-aware stems (e.g., band-agnostic mixing for MS; log-intensity preprocessing and GroupNorm/LayerNorm for SAR) are commonly required; dual-branch designs can further support cross-sensor episodes.

Decoder choice largely determines how multi-scale information is converted into masks. DeepLabV3+ is a widely used baseline due to ASPP-based context aggregation and low-level feature fusion; Figure 7 outlines the typical encoder–decoder structure. Pyramid decoders such as PSPNet [68] and UPerNet [69] are efficient for large-area land-cover mapping. When efficiency or test-time adaptation is critical, lightweight heads on top of FPN features are often preferred [70,71]. Transformer decoders (MaskFormer [72], Mask2Former [57]) represent strong modern baselines by predicting masks from a set of latent queries; support information can be injected by initializing or modulating a subset of these queries.

Conditioning mechanisms are typically lightweight but high-impact. Multi-scale prototype matching remains common, injecting class prototypes at multiple feature levels. Feature-wise linear modulation (FiLM) conditions decoder activations through per-channel scale/shift parameters derived from support cues. Dynamic filter generation specializes late-stage logits by predicting class-specific kernels from support embeddings, while transformer-based pipelines often condition by seeding a portion of mask queries with support tokens.

To provide a structured overview of reported quantitative performance, Table 4 presents a representative quantitative comparison of selected few-shot, foundation-assisted, open-vocabulary, and supervised segmentation models across commonly used remote sensing benchmarks.

Architectural design choices in few-shot remote sensing segmentation are closely intertwined with practical constraints arising from image resolution, batch size, and inference strategy. Across the literature, fine-grained targets such as roads, rivers, and small buildings are typically evaluated at lower output strides, whereas coarser land-cover categories tolerate more aggressive downsampling. Large aerial and satellite inputs necessitate memory-aware design choices, including localized correlation computation, token clustering, or tiled inference with overlap. Normalization layers also play a non-trivial role under few-shot regimes, where small effective batch sizes can amplify instability. Reported performance is therefore influenced not only by the learning formulation but also by architectural configuration, normalization strategy, and inference protocol. Differences in these factors, including whether inference is inductive or transductive, are known to produce non-negligible variations in mIoU on commonly used few-shot remote-sensing benchmarks.

5. Discussion

Few-shot semantic segmentation in remote sensing represents a complex balance between three often conflicting objectives: maximizing data efficiency, ensuring generalization across heterogeneous sensing conditions, and maintaining spatial fidelity for fine-grained geographic interpretation. These challenges coexist in an environment where spectral variability, geometric distortions, and annotation scarcity complicate model learning. A well-designed FSSS framework must therefore strike a careful balance between extracting transferable semantic representations and preserving structural detail that is vital in mapping applications. The following discussion integrates observations from recent advances, beginning with the influence of datasets and their properties, and then reflecting on the evolution of methods and architectures that address these constraints.

Dataset characteristics play a decisive role in shaping both learning dynamics and generalization behavior. High-resolution aerial datasets such as ISPRS Potsdam and Vaihingen prioritize precise boundary delineation and favor models that excel in fine structural detail. Their dense textures and sharp object boundaries complement methods that rely on strong local features and boundary-aware refinements. Medium-resolution satellite datasets such as LoveDA or SpaceNet introduce greater domain diversity, atmospheric variability, and urban morphological differences across cities. In these cases, models must generalize beyond local texture and geometry to capture broader semantic consistency, often requiring domain normalization and feature alignment mechanisms. At lower resolutions, multispectral benchmarks such as DFC2020 emphasize spectral reasoning rather than spatial precision, rewarding models that can leverage contextual and spectral correlations over explicit edge sharpness. These distinctions suggest that dataset resolution, modality, and geographic coverage jointly determine the optimal model family and training protocol. Urban–rural splits, seasonal differences, and cross-sensor combinations reveal the fragility of models trained on narrow distributions, highlighting the ongoing need for spectral harmonization and domain adaptation strategies. Furthermore, the quality of the support set remains a key factor in few-shot performance. A single mislabeled or unrepresentative support patch can significantly bias prototype estimation or correlation maps, and robust design practices such as trimmed-mean pooling, episodic augmentation, and adaptive query updates have proven effective in mitigating this sensitivity. Across the surveyed studies, consistent trends indicate that dataset characteristics such as spatial resolution, sensing modality, and geographic diversity influence performance as strongly as architectural choices. No single model family consistently dominates across benchmarks, reinforcing the importance of dataset-aware method selection and cautious cross-paper comparison.

The evolution of few-shot methods in remote sensing reflects a progression from compact, metric-based learners toward transformer and foundation-assisted architectures that balance efficiency, generalization, and scalability. Prototype-based metric learners remain the most label-efficient and interpretable category. They represent each class through compact embeddings derived from few labeled examples, regularizing learning and reducing overfitting in data-scarce regimes. Approaches such as PANet and PFENet demonstrate competitive performance on homogeneous aerial datasets like Potsdam and Vaihingen, where texture and structure are consistent, but they struggle with multisource datasets like LoveDA and DFC2020, where spectral variability distorts embeddings. Despite these limitations, prototype-based approaches remain practical for resource-constrained or rapidly deployable systems, especially when complemented by lightweight boundary refinement or conditional random field post-processing.

Transformer-based methods have emerged to overcome the locality constraints of prototype networks by modeling long-range dependencies and global spatial context through self-attention mechanisms. Networks such as HSNet and CATNet show improved robustness to intra-class variability and are capable of segmenting large-scale or topologically complex objects, including roads, rivers, and vegetation boundaries. However, these advantages come with higher computational demands and a dependency on extensive pretraining. Without initialization on large natural-image or domain-specific remote-sensing datasets, transformer-based architectures often overfit to the limited structure of episodic support examples. Their scalability, therefore, depends on computational resources, memory capacity, and access to representative pretraining corpora.

The most recent wave of research focuses on foundation-assisted approaches, drawing inspiration from promptable architectures such as SAM and cross-modal vision–language models like CLIP [52]. These methods leverage large pretrained encoders capable of delineating generic object boundaries or associating visual features with semantic text embeddings. In few-shot settings, they act as class-agnostic proposal generators, where a small number of labeled samples guide the selection and refinement of relevant masks. This approach greatly reduces the need for dense supervision and enables rapid adaptation to unseen regions or classes. Nonetheless, several challenges persist, particularly regarding prompt sensitivity and the transfer gap between RGB pretraining and multispectral or SAR imagery. Parameter-efficient adaptation techniques such as LoRA or bottleneck adapters offer practical solutions, enabling domain-specific fine-tuning while preserving the general visual priors of the pretrained model. These strategies make foundation-assisted methods increasingly viable for real-world mapping pipelines, balancing adaptability and computational efficiency.

When aligning datasets with model families, consistent patterns emerge. Prototype-based architectures with boundary refinements tend to excel on high-resolution aerial benchmarks such as ISPRS Potsdam and Vaihingen, where fine edge delineation and spatial precision dominate. Transformer-based and foundation-assisted approaches generally perform better on diverse and large-scale datasets such as LoveDA, SpaceNet, and xBD, where geographic variability and complex urban morphology outweigh local texture fidelity. Cross-modal benchmarks, including DFC2020 and Sen1Floods11, benefit from models that incorporate late fusion and co-attention mechanisms to reconcile complementary sensor information. In contrast, coarser-resolution multispectral segmentation datasets favor prototype-driven or lightweight conditioning paradigms, particularly when spectral augmentation or self-supervised pretraining is used to stabilize class representations. These trends confirm that architectural choices depend not only on the target application but also on the spatial resolution, sensing modality, and domain diversity of the dataset.

Evaluation remains a critical aspect where inconsistencies often obscure real progress. Mean Intersection over Union alone cannot fully capture the performance of models on fine-structured or imbalanced classes. Complementary metrics such as Boundary IoU, connectivity measures, or foreground–background IoU offer a more comprehensive understanding, particularly for network-like structures such as roads. Reliable assessment should also include cross-domain generalization tests, such as leave-one-city-out and seasonal holdout protocols, along with ablation analyses to isolate modality contributions in multispectral or SAR configurations. Transparent reporting of calibration metrics, inference efficiency, and the chosen evaluation mode (inductive or transductive) ensures comparability and reproducibility across studies. From a deployment perspective, practical trade-offs remain between computational cost, annotation availability, and inference throughput. Prototype-based models are lightweight and efficient, typically operating within 8 GB of GPU memory, whereas transformer- and foundation-based models require higher computational capacity, often in the range of 16–24 GB or dedicated accelerators. While the added complexity of proposal generation and mask scoring increases latency in foundation-assisted frameworks, techniques such as prompt batching or region-of-interest gating can mitigate these costs during large-tile inference. Moreover, promptable and semi-supervised variants are showing promise in approaching fully supervised performance with sparse labels, making them suitable for time-sensitive or cost-constrained mapping and monitoring applications. The implications of deploying few-shot segmentation models under high predictive uncertainty are especially pronounced in critical operational settings such as disaster response, flood mapping, or post-event damage assessment. In these scenarios, segmentation outputs may directly inform emergency planning, resource allocation, or policy decisions, amplifying the cost of false positives and false negatives. Few-shot models are highly sensitive to the composition and quality of the support set; a small number of mislabeled or ambiguous support pixels can bias prototype estimation or attention mechanisms, resulting in systematic over- or under-segmentation over large areas. This sensitivity underscores the need for uncertainty-aware deployment strategies, including probabilistic outputs, ensemble-based uncertainty estimation, or abstention mechanisms that flag low-confidence regions for human review. In practice, few-shot segmentation should be embedded within human-in-the-loop pipelines rather than treated as a fully autonomous decision-making tool, with explicit confidence thresholds and validation steps tailored to the risk profile of the application domain. Beyond technical performance, few-shot semantic segmentation raises important ethical and societal considerations when applied to real-world remote sensing scenarios. Because few-shot models rely on extremely limited supervision, their predictions can be disproportionately influenced by biases present in the support set, including geographic bias, sensor bias, or seasonal bias. If the support examples are unrepresentative of the broader region or conditions under analysis, systematic errors may propagate across large spatial extents. This risk is particularly relevant in global monitoring contexts, where data availability and annotation quality vary significantly across regions, potentially reinforcing existing inequalities in environmental assessment or infrastructure mapping. In addition, the growing accessibility of few-shot and foundation-assisted segmentation lowers the barrier for large-scale surveillance or sensitive infrastructure analysis, raising concerns about misuse. Responsible deployment therefore requires transparency about model limitations, careful curation of support examples, and adherence to legal and ethical frameworks governing geospatial data use. Equally important is the communication of model confidence: presenting segmentation outputs without conveying uncertainty may lead downstream users to overtrust predictions that are inherently fragile under sparse supervision.

Looking ahead, the field is transitioning toward unified multimodal foundations capable of integrating optical, multispectral, and radar data through shared pretraining. Such systems are expected to reduce the reliance on task-specific adapters and improve robustness across domains. Advances in prompt design will likely enable more expressive conditioning, such as curve- or graph-based prompts tailored to structured geographic entities like roads or building networks. Continual few-shot learning paradigms are also emerging, allowing models to adapt to new data streams over time while retaining previously acquired knowledge. Uncertainty-aware supervision and active learning strategies will further enhance reliability by prioritizing the most informative samples for annotation. Integrating geospatial priors from GIS datasets, digital elevation models, or topographic maps can embed physical realism into predictions, constraining model outputs to plausible spatial configurations. Few-shot semantic segmentation in remote sensing is evolving toward modular, adaptive frameworks that combine the efficiency of metric-based learning with the generalization capabilities of foundation models. The most successful approaches will remain sensitive to the particularities of each dataset, its spatial resolution, sensing modality, and geographic variability, while maintaining stable calibration and interpretable behavior. Continued progress will depend not only on methodological innovation but also on disciplined benchmarking and the establishment of standardized, multimodal datasets that can sustain reproducible evaluation and accelerate practical adoption.

Taken together, the reviewed evidence indicates that performance in remote sensing few-shot segmentation is shaped as much by dataset characteristics and evaluation protocols as by model architecture, highlighting the importance of data-centric design, standardized benchmarks, and transparent reporting alongside algorithmic innovation.

6. Conclusions

Few-shot semantic segmentation has emerged as a promising paradigm for remote sensing, addressing the fundamental challenge of learning robust pixel-level models under severe annotation scarcity. This review has surveyed recent advances in datasets, learning paradigms, and architectural designs, highlighting how few-shot methods adapt classical segmentation pipelines to the unique characteristics of aerial and satellite imagery. We discussed how factors such as spatial resolution, sensing modality, geographic diversity, and annotation quality strongly influence both method design and empirical performance, underscoring the importance of data-centric evaluation and rigorous experimental protocols. Several converging trends are likely to shape future research. A primary priority is the development of standardized benchmarks with well-defined base–novel splits, strict geographic separation, and episodic evaluation protocols spanning multiple sensors and resolutions. In parallel, foundation and multimodal models are increasingly influencing few-shot segmentation, enabling stronger priors through large-scale pretraining and more flexible conditioning via visual and textual prompts. Parameter-efficient adaptation techniques, such as adapters and low-rank updates, are expected to play a central role in bridging the gap between large pretrained models and data-scarce remote-sensing scenarios. Beyond static imagery, integrating few-shot learning with temporal analysis, cross-sensor transfer, and continual adaptation remains an open and impactful direction. Looking ahead, practical deployment will depend not only on accuracy but also on efficiency, reliability, and interpretability. Advances in lightweight architectures, uncertainty-aware learning, and human-in-the-loop refinement offer promising avenues for reducing annotation costs while maintaining operational robustness. Ultimately, progress in few-shot semantic segmentation for remote sensing will rely on the convergence of foundation models, multimodal pretraining, and disciplined benchmarking. By aligning methodological innovation with standardized evaluation and open data practices, few-shot segmentation can transition from a research paradigm into a scalable and adaptive tool for real-world Earth observation applications.

Author Contributions

Conceptualization, M.P. and E.P.; Methodology, M.P., E.P. and I.K.; Writing—original draft preparation, M.P. and E.P.; Writing—review and editing, I.K. and M.P.; Visualization, V.S.; Supervision, I.D.; Project administration, I.D.; Funding acquisition, D.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Education and Science of the Republic of North Macedonia through the project “Utilizing AI and National Large Language Models to Advance Macedonian Language Capabilities”.

Data Availability Statement

This article is a review paper and does not report original experimental data. All datasets discussed in this study are publicly available and are cited in the corresponding references within the manuscript. No new data were generated or analyzed for this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lv, J.; Shen, Q.; Lv, M.; Li, Y.; Shi, L.; Zhang, P. Deep learning-based semantic segmentation of remote sensing images: A review. Front. Ecol. Evol. 2023, 11, 1201125. [Google Scholar] [CrossRef]
Guo, Y.; Liu, Y.; Georgiou, T.; Lew, M.S. A review of semantic segmentation using deep neural networks. Int. J. Multimed. Inf. Retr. 2018, 7, 87–93. [Google Scholar] [CrossRef]
Rawat, W.; Wang, Z. Deep convolutional neural networks for image classification: A comprehensive review. Neural Comput. 2017, 29, 2352–2449. [Google Scholar] [CrossRef]
Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef]
Chang, Z.; Lu, Y.; Ran, X.; Gao, X.; Wang, X. Few-shot semantic segmentation: A review on recent approaches. Neural Comput. Appl. 2023, 35, 18251–18275. [Google Scholar] [CrossRef]
Lyu, Y.; Vosselman, G.; Xia, G.S.; Yilmaz, A.; Yang, M.Y. UAVid: A semantic segmentation dataset for UAV imagery. ISPRS J. Photogramm. Remote Sens. 2020, 165, 108–119. [Google Scholar] [CrossRef]
Lei, S.; Zhang, X.; He, J.; Chen, F.; Du, B.; Lu, C.T. Cross-domain few-shot semantic segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 73–90. [Google Scholar]
Xu, H.; He, H.; Zhang, Y.; Ma, L.; Li, J. A comparative study of loss functions for road segmentation in remotely sensed road datasets. Int. J. Appl. Earth Obs. Geoinf. 2023, 116, 103159. [Google Scholar] [CrossRef]
Hua, Y.; Marcos, D.; Mou, L.; Zhu, X.X.; Tuia, D. Semantic segmentation of remote sensing images with sparse annotations. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Kalluri, T.; Varma, G.; Chandraker, M.; Jawahar, C. Universal semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5259–5270. [Google Scholar]
Cui, B.; Chen, X.; Lu, Y. Semantic segmentation of remote sensing images using transfer learning and deep convolutional neural network with dense connection. IEEE Access 2020, 8, 116744–116755. [Google Scholar] [CrossRef]
Zhang, M.; Zhou, Y.; Zhao, J.; Man, Y.; Liu, B.; Yao, R. A survey of semi-and weakly supervised semantic segmentation of images. Artif. Intell. Rev. 2020, 53, 4259–4288. [Google Scholar] [CrossRef]
ISPRS WG III/4. ISPRS 2D Semantic Labeling Contest. Available online: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/semantic-labeling.aspx (accessed on 5 February 2026).
Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 3226–3229. [Google Scholar]
Boguszewski, A.; Batorski, D.; Ziemba-Jankowska, N.; Dziedzic, T.; Zambrzycka, A. LandCover. ai: Dataset for automatic mapping of buildings, woodlands, water and roads from aerial imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1102–1110. [Google Scholar]
Xia, J.; Yokoya, N.; Adriano, B.; Broni-Bediako, C. Openearthmap: A benchmark dataset for global high-resolution land cover mapping. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 6254–6264. [Google Scholar]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2018, 57, 574–586. [Google Scholar] [CrossRef]
Mommert, M.; Kesseli, N.; Hanna, J.; Scheibenreif, L.; Borth, D.; Demir, B. Ben-ge: Extending BigEarthNet with geographical and environmental data. In Proceedings of the IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 1016–1019. [Google Scholar]
Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. Deepglobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 172–181. [Google Scholar]
Van Etten, A.; Lindenbaum, D.; Bacastow, T.M. Spacenet: A remote sensing dataset and challenge series. arXiv 2018, arXiv:1807.01232. [Google Scholar]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation. arXiv 2021, arXiv:2110.08733. [Google Scholar]
Mnih, V. Machine Learning for Aerial Image Labeling; University of Toronto: Toronto, ON, Canada, 2013. [Google Scholar]
Gupta, R.; Goodman, B.; Patel, N.; Hosfelt, R.; Sajeev, S.; Heim, E.; Doshi, J.; Lucas, K.; Choset, H.; Gaston, M. Creating xBD: A dataset for assessing building damage from satellite imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019; pp. 10–17. [Google Scholar]
Rahnemoonfar, M.; Chowdhury, T.; Sarkar, A.; Varshney, D.; Yari, M.; Murphy, R.R. Floodnet: A high resolution aerial imagery dataset for post flood scene understanding. IEEE Access 2021, 9, 89644–89654. [Google Scholar] [CrossRef]
Alemohammad, H.; Booth, K. LandCoverNet: A global benchmark land cover classification training dataset. arXiv 2020, arXiv:2012.03111. [Google Scholar]
Brown, C.F.; Brumby, S.P.; Guzder-Williams, B.; Birch, T.; Hyde, S.B.; Mazzariello, J.; Czerwinski, W.; Pasquarella, V.J.; Haertel, R.; Ilyushchenko, S.; et al. Dynamic World, Near real-time global 10 m land use land cover mapping. Sci. Data 2022, 9, 251. [Google Scholar] [CrossRef]
Toker, A.; Kondmann, L.; Weber, M.; Eisenberger, M.; Camero, A.; Hu, J.; Hoderlein, A.P.; Şenaras, Ç.; Davis, T.; Cremers, D.; et al. Dynamicearthnet: Daily multi-spectral satellite dataset for semantic change segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 21158–21167. [Google Scholar]
Robinson, C.; Malkin, K.; Jojic, N.; Chen, H.; Qin, R.; Xiao, C.; Schmitt, M.; Ghamisi, P.; Hänsch, R.; Yokoya, N. Global land-cover mapping with weak supervision: Outcome of the 2020 IEEE GRSS data fusion contest. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 3185–3199. [Google Scholar] [CrossRef]
Bonafilia, D.; Tellman, B.; Anderson, T.; Issenberg, E. Sen1Floods11: A georeferenced dataset to train and test deep learning flood algorithms for sentinel-1. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 210–211. [Google Scholar]
Karantzalos, K.; Karakizi, C.; Kandylakis, Z.; Antoniou, G. HyRANK hyperspectral satellite dataset I (version v001). Zenodo 2018. [Google Scholar] [CrossRef]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9197–9206. [Google Scholar]
Sherrah, J. Fully convolutional networks for dense semantic labelling of high-resolution aerial imagery. arXiv 2016, arXiv:1606.02585. [Google Scholar] [CrossRef]
Robinson, C.; Ortiz, A.; Malkin, K.; Elias, B.; Peng, A.; Morris, D.; Dilkina, B.; Jojic, N. Human-machine collaboration for fast land cover mapping. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 2509–2517. [Google Scholar]
Gama, P.H.T.; Oliveira, H.; dos Santos, J.A.; Cesar, R.M., Jr. An overview on Meta-learning approaches for Few-shot Weakly-supervised Segmentation. Comput. Graph. 2023, 113, 77–88. [Google Scholar] [CrossRef]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1199–1208. [Google Scholar]
Min, J.; Kang, D.; Cho, M. Hypercorrelation squeeze for few-shot segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6941–6952. [Google Scholar]
Zhang, C.; Lin, G.; Liu, F.; Guo, J.; Wu, Q.; Yao, R. Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9587–9595. [Google Scholar]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; Wierstra, D. Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 2016, 29, 1–9. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
Jia, X.; De Brabandere, B.; Tuytelaars, T.; Gool, L.V. Dynamic filter networks. Adv. Neural Inf. Process. Syst. 2016, 29, 667–675. [Google Scholar]
Liu, Y.; Lee, J.; Park, M.; Kim, S.; Yang, E.; Hwang, S.J.; Yang, Y. Learning to propagate labels: Transductive propagation network for few-shot learning. arXiv 2018, arXiv:1805.10002. [Google Scholar]
Boudiaf, M.; Kervadec, H.; Masud, Z.I.; Piantanida, P.; Ben Ayed, I.; Dolz, J. Few-shot segmentation without meta-learning: A good transductive inference is all you need? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13979–13988. [Google Scholar]
Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2918–2928. [Google Scholar]
Ma, H.; Lin, X.; Wu, Z.; Yu, Y. Coarse-to-fine domain adaptive semantic segmentation with photometric alignment and category-center regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4051–4060. [Google Scholar]
Yang, Y.; Soatto, S. Fda: Fourier domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Omputer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4085–4095. [Google Scholar]
Hariharan, B.; Girshick, R. Low-shot visual recognition by shrinking and hallucinating features. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3018–3027. [Google Scholar]
Triantafillou, E.; Zhu, T.; Dumoulin, V.; Lamblin, P.; Evci, U.; Xu, K.; Goroshin, R.; Gelada, C.; Swersky, K.; Manzagol, P.A.; et al. Meta-dataset: A dataset of datasets for learning to learn from few examples. arXiv 2019, arXiv:1903.03096. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Hu, F.; Xia, G.S.; Hu, J.; Zhang, L. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sens. 2015, 7, 14680–14707. [Google Scholar] [CrossRef]
Manas, O.; Lacoste, A.; Giró-i Nieto, X.; Vazquez, D.; Rodriguez, P. Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9414–9423. [Google Scholar]
Sun, X.; Wang, P.; Lu, W.; Zhu, Z.; Lu, X.; He, Q.; Li, J.; Rong, X.; Yang, Z.; Chang, H.; et al. RingMo: A remote sensing foundation model with masked image modeling. IEEE Trans. Geosci. Remote Sens. 2022, 61, 1–22. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Xu, J.; Hou, J.; Zhang, Y.; Feng, R.; Wang, Y.; Qiao, Y.; Xie, W. Learning open-vocabulary semantic segmentation models from natural language supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2935–2944. [Google Scholar]
Zhang, J.; Zhou, Z.; Mai, G.; Hu, M.; Guan, Z.; Li, S.; Mu, L. Text2Seg: Zero-shot remote sensing image semantic segmentation via text-guided visual foundation models. In Proceedings of the 7th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery; Association for Computing Machinery: New York, NY, USA, 2024; pp. 63–66. [Google Scholar]
Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. Remoteclip: A vision language foundation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 1290–1299. [Google Scholar]
Osco, L.P.; Wu, Q.; De Lemos, E.L.; Gonçalves, W.N.; Ramos, A.P.M.; Li, J.; Junior, J.M. The segment anything model (sam) for remote sensing applications: From zero to one shot. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103540. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 11976–11986. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Wang, L.; Li, R.; Duan, C.; Zhang, C.; Meng, X.; Fang, S. A novel transformer based semantic segmentation scheme for fine-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Carion, N.; Gustafson, L.; Hu, Y.T.; Debnath, S.; Hu, R.; Suris, D.; Ryali, C.; Alwala, K.V.; Khedr, H.; Huang, A.; et al. Sam 3: Segment anything with concepts. arXiv 2025, arXiv:2511.16719. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 418–434. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Yuan, Y.; Fang, J.; Lu, X.; Feng, Y. Spatial structure preserving feature pyramid network for semantic image segmentation. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2019, 15, 1–19. [Google Scholar] [CrossRef]
Cheng, B.; Schwing, A.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 17864–17875. [Google Scholar]
Zhang, J.; Li, Y.; Yang, X.; Jiang, R.; Zhang, L. RSAM-Seg: A SAM-Based Model with Prior Knowledge Integration for Remote Sensing Image Semantic Segmentation. Remote Sens. 2025, 17, 590. [Google Scholar] [CrossRef]
Li, Z.; Lu, F.; Zou, J.; Hu, L.; Zhang, H. Generalized few-shot meets remote sensing: Discovering novel classes in land cover mapping via hybrid semantic segmentation framework. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–21 June 2024; pp. 2744–2754. [Google Scholar]
Tian, Z.; Zhao, H.; Shu, M.; Yang, Z.; Li, R.; Jia, J. Prior guided feature enrichment network for few-shot segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1050–1065. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]

Figure 1. Example of semantic segmentation in remote sensing from the UAVid dataset [6]. Left: original UAV RGB image. Right: corresponding pixel-wise semantic annotation, where each pixel is assigned to a semantic class such as road, building, vegetation, or vehicle.

Figure 2. Examples of representative remote sensing datasets used in few-shot semantic segmentation research. The top row shows high-resolution aerial imagery, while the bottom row includes medium-resolution satellite and disaster-mapping datasets, illustrating differences in spatial resolution, modality, and scene complexity.

Figure 3. Taxonomy of few-shot semantic segmentation in remote sensing.

Figure 4. Canonical encoder–decoder architecture with skip connections commonly used in semantic segmentation. The encoder progressively downsamples the input to extract semantic representations, while the decoder upsamples and fuses multi-scale features to produce dense pixel-level predictions.

Figure 5. A chronological overview of key models and methodological milestones that have influenced few-shot semantic segmentation in remote sensing.

Figure 6. Computational complexity and memory footprint of representative segmentation models considered in this review. The x-axis shows the number of GFLOPs for a

512 \times 512

input, the y-axis indicates approximate inference memory in GB (both in log scale), and bubble size encodes the number of parameters in millions.

Figure 6. Computational complexity and memory footprint of representative segmentation models considered in this review. The x-axis shows the number of GFLOPs for a

512 \times 512

input, the y-axis indicates approximate inference memory in GB (both in log scale), and bubble size encodes the number of parameters in millions.

Figure 7. DeepLabV3+ encoder–decoder overview with ASPP multi-rate atrous convolutions and low-level feature fusion for boundary refinement.

Table 1. Notation used in this paper.

Symbol	Meaning
$S$	Support set (labeled examples used to condition an episode).
$Q$	Query set (unlabeled examples to be segmented in an episode).
$(x_{i}, y_{i})$	Input sample $x_{i}$ and its label/mask $y_{i}$ in the support set.
W-way, K-shot	Few-shot episode setting: W novel classes, K support examples per class.
$C_{base}, C_{novel}$	Base classes used for meta-training and novel classes used for evaluation.
$f (\cdot)$	Feature extractor/encoder network shared by support and query.
$Ω_{c}$	Binary support mask for class c (1 on class pixels, 0 otherwise).
$p_{c}$	Prototype (class-wise pooled feature vector) for class c.
$sim (\cdot, \cdot)$	Similarity function (often cosine similarity) used for prototype matching.
$\hat{y} (v)$	Predicted class label at query pixel location v.
$T P, F P, F N$	True positives, false positives, false negatives used for segmentation metrics.
$m I o U$	Mean Intersection-over-Union across classes.
${Dice}_{c}$	Dice/F1 score for class c.
$θ$	Trainable model parameters; $θ^{'}$ denotes episode-adapted parameters.
$α, β$	Inner-loop and outer-loop learning rates in optimization-based meta-learning.
$N_{q}$	Number of mask queries in transformer-based mask decoders (avoid conflict with $Q$ ).
$q_{k}$	The k-th learned mask query token in a transformer decoder.

Table 2. Summary of representative remote sensing datasets used for semantic segmentation.

Dataset	Spatial Resolution (GSD)	Image Resolution (px)	Sensor	#Images	#Classes	File Types (Image/Label)
ISPRS Potsdam	5 cm	∼6000 × 6000	Aerial RGB + IR	38	6	.tif/.tif
ISPRS Vaihingen	9 cm	∼2500 × 2500	Aerial RGB + IR	33	6	.tif/.tif
Inria Aerial	0.3 m	5000 × 5000	Aerial RGB	360	2	.tif/.tif
Massachusetts Roads/Buildings	∼1 m	1500 × 1500	Aerial RGB	151/137	2	.tif/.tif
DeepGlobe Land Cover	0.5 m	2448 × 2448	Satellite RGB	803	7	.jpg/.png
SpaceNet (1–7)	0.3–1 m	650 × 650–1024 × 1024	Satellite RGB + MS	10k–30k	2	.tif/.geojson
LoveDA	0.3 m	1024 × 1024	Satellite RGB	5987	7	.tif/.png
FloodNet	≤0.5 m	1024 × 1024	UAV RGB	2363	2	.jpg/.png
xBD	0.3–0.8 m	variable (512–1024)	Maxar/WorldView	850k	4	.png/.geojson
Sen1Floods11	10 m	512 × 512	Sentinel-1 SAR + S2 RGB	4k pairs	2	.tif/.tif
HyRANK	30 m	1200 × 700	Hyperspectral	1	13	ENVI .hdr/.dat, .mat/.mat
LandCover.ai	25 cm	∼9000 × 9500	Aerial RGB	41	4	.tif/.shp, .geojson
DFC2020	10 m	variable (512–1024)	S1 SAR + S2 MS	∼20k	10	.tif/.tif
DynamicEarthNet	10 m	256 × 256	Sentinel-2 (time series)	75k	7	.tif/.tif
UAVid	∼5 cm	3840 × 2160	UAV RGB	300+	8	.jpg/.png
OpenEarthMap	0.25–0.5 m	1024 × 1024	Aerial RGB	5k+	8	.tif/.png
WHU Building	0.3 m	512 × 512	Aerial RGB	8k+	2	.tif/.tif
Ben-Ge	∼0.3 m	variable	Aerial RGB	600+	2	.tif/.png
DeepGlobe Road	0.5 m	1024 × 1024	Satellite RGB	6k+	2	.jpg/.png
95-Cloud	30 m	512 × 512	Landsat	15k+	2	.jpg/.png

Table 3. Foundation-assisted FSSS: components, purposes, and remote sensing considerations.

Component	Typical Choices	Primary Role	RS Pitfalls	Remedies
Prompt source	Edge/interior points, tight boxes, grid prompts	Coverage of candidates	Missed small objects	Denser tiling; add contour prompts
Proposal model	Category-agnostic segmenter (point/box prompts)	Over-segmentation of query	Over- or under-segmentation	Multi-threshold sweep; mask fusion
Mask scoring	Prototype similarity, text similarity, geometric priors	Select class-consistent masks	Texture confusions	Add boundary or shape priors
Decoder refinement	Light boundary-aware head (3–5 layers)	Sharpen edges, de-overlap	Broken linear structures (e.g., roads)	Connectivity loss; DSM channel if available
Adapters	LoRA or bottleneck modules on cross-attention	Episode-time calibration	Sensor or seasonal shift	Adapt only small modules; early stopping

Table 4. Representative quantitative comparison of selected remote sensing few-shot, open-vocabulary, foundation-assisted, and supervised segmentation methods. The table reports a curated subset of commonly cited models and benchmarks used in this study.

Method	Dataset	Backbone	1-Shot	5-Shot	Zero-Shot	Metric
RemoteCLIP [55]	RS classification benchmarks (avg. over 10 datasets)	ViT-B/32	–	84.4	61.8	Top-1 Acc (%)
Text2Seg (GDINO+CLIP+SAM) [54]	LoveDA	ViT-H	–	–	53.8	Overall Accuracy (%)
RSAM-Seg [73]	Building segmentation benchmark	ResNet backbone	–	–	84.2	mIoU (%)
SegLand [74]	OpenEarthMap Few-Shot Challenge	Swin-T + UperNetPlus	–	54.5	–	mIoU (%)
PANet [31]	DeepGlobe	ResNet-50	36.55	45.43	–	mIoU (%)
PFENet [75]	DeepGlobe	ResNet-50	16.88	18.01	–	mIoU (%)
PFENet [75]	ISPRS Vaihingen	ResNet backbone	12.58	12.29	–	mIoU (%)
U-Net [76]	Sentinel-2	CNN Encoder–Decoder	–	–	55.45	mIoU (%)
DeepLabV3+ [77]	Sentinel-2	ResNet backbone	–	–	65.27	mIoU (%)
SegFormer [62]	Sentinel-2	MiT Transformer	–	–	61.55	mIoU (%)
U-Net [76]	DeepGlobe	CNN Encoder–Decoder	–	–	75.06	mIoU (%)
DeepLabV3+ [77]	DeepGlobe	ResNet backbone	–	–	65.27	mIoU (%)
SegFormer [62]	DeepGlobe	MiT Transformer	–	–	77.44	mIoU (%)
DC-Swin [65]	ISPRS Potsdam	Swin Transformer	–	–	87.6	mIoU (%)
UNetFormer [66]	ISPRS Potsdam	Transformer Encoder–Decoder	–	–	87.5	mIoU (%)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Petrov, M.; Pandilova, E.; Dimitrovski, I.; Trajanov, D.; Spasev, V.; Kitanovski, I. Few-Shot Semantic Segmentation in Remote Sensing: A Review on Definitions, Methods, Datasets, Advances and Future Trends. Remote Sens. 2026, 18, 637. https://doi.org/10.3390/rs18040637

AMA Style

Petrov M, Pandilova E, Dimitrovski I, Trajanov D, Spasev V, Kitanovski I. Few-Shot Semantic Segmentation in Remote Sensing: A Review on Definitions, Methods, Datasets, Advances and Future Trends. Remote Sensing. 2026; 18(4):637. https://doi.org/10.3390/rs18040637

Chicago/Turabian Style

Petrov, Marko, Ema Pandilova, Ivica Dimitrovski, Dimitar Trajanov, Vlatko Spasev, and Ivan Kitanovski. 2026. "Few-Shot Semantic Segmentation in Remote Sensing: A Review on Definitions, Methods, Datasets, Advances and Future Trends" Remote Sensing 18, no. 4: 637. https://doi.org/10.3390/rs18040637

APA Style

Petrov, M., Pandilova, E., Dimitrovski, I., Trajanov, D., Spasev, V., & Kitanovski, I. (2026). Few-Shot Semantic Segmentation in Remote Sensing: A Review on Definitions, Methods, Datasets, Advances and Future Trends. Remote Sensing, 18(4), 637. https://doi.org/10.3390/rs18040637

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Few-Shot Semantic Segmentation in Remote Sensing: A Review on Definitions, Methods, Datasets, Advances and Future Trends

Highlights

Abstract

1. Introduction

2. Definitions

2.1. Semantic Segmentation

2.2. Strategies

2.3. Notation

2.4. Operational Formulation of Few-Shot Semantic Segmentation

2.5. Evaluation Metrics

2.6. Relation to Neighboring Problem Settings

3. Datasets

3.1. Optical RGB Aerial Imagery

3.2. Optical RGB Satellite Imagery

3.3. Multispectral Satellite Imagery

3.4. Synthetic Aperture Radar (SAR) and Multimodal Datasets

3.5. Hyperspectral Imagery

3.6. Dataset Annotation: Practices, Pitfalls, and Quality Assessment

4. Few-Shot Methods for Semantic Segmentation in Remote Sensing

4.1. Meta-Learning Method Categories (Episodic Training)

4.2. Parameter Generation and Conditioning

4.3. Inductive vs. Transductive Inference

4.4. Data-Centric Strategies

4.5. Transfer Learning and Pretraining

4.6. Vision–Language Integration in Few-Shot Segmentation

4.7. Foundation and Prompt-Based Segmentation

4.8. Backbones, Decoders, and Conditioning: Design Choices That Matter

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI