Next Article in Journal
Remote Sensing for Quantifying Greenhouse Gas Emissions at Carbon Capture, Utilisation and Storage Facilities: A Review
Previous Article in Journal
GLFFEN: A Global–Local Feature Fusion Enhancement Network for Hyperspectral Image Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Exploiting Diffusion Priors for Generalizable Few-Shot Satellite Image Semantic Segmentation

Unmanned System Research Institute, Northwestern Polytechnical University, Xi’an 710072, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(22), 3706; https://doi.org/10.3390/rs17223706
Submission received: 19 September 2025 / Revised: 4 November 2025 / Accepted: 12 November 2025 / Published: 13 November 2025

Highlights

What are the main findings?
  • We propose DiffSatSeg, a diffusion-based framework that combines parameter-efficient fine-tuning, distributional similarity-based segmentation, and consistency learning to tackle few-shot satellite component segmentation, while also enabling fine-grained, reference-guided predictions.
  • The proposed method delivers state-of-the-art performance across one-, three-, and ten-shot settings, maintains high effectiveness under low-light and backlit conditions, robustly handles substantial morphological variations—particularly in antennas—and achieves significant gains over previous methods, reaching up to +33.6% mIoU in the one-shot scenario.
What are the implications of the main findings
  • This work opens a new direction for leveraging the prior knowledge of diffusion models to achieve fine-grained perception of satellite targets, bridging a critical research gap. To the best of our knowledge, it is the first attempt to employ diffusion models for few-shot satellite segmentation, establishing a solid foundation for future exploration in this field.
  • DiffSatSeg addresses the scarcity of annotated data and substantial morphological differences across satellite types, enabling reliable generalization to unseen targets and fine-grained, reference-guided segmentation, which holds tremendous value for practical space applications such as structural analysis, fault detection, and on-orbit servicing.

Abstract

Satellite segmentation is vital for spacecraft perception, supporting tasks like structural analysis, fault detection, and in-orbit servicing. However, the generalization of existing methods is severely limited by the scarcity of target satellite data and substantial morphological differences between target satellites and training samples, leading to suboptimal performance in real-world scenarios. In this work, we propose a novel diffusion-based framework for few-shot satellite segmentation, named DiffSatSeg, which leverages the powerful compositional generalization capability of diffusion models to address the challenges inherent in satellite segmentation tasks. Specifically, we propose a parameter-efficient fine-tuning strategy that fully exploits the strong prior knowledge of diffusion models while effectively accommodating the unique structural characteristics of satellites as rare targets. We further propose a segmentation mechanism based on distributional similarity, designed to overcome the limited generalization capability of conventional segmentation models when encountering novel satellite targets with substantial inter-class variations. Finally, we design a consistency learning strategy to suppress redundant texture details in diffusion features, thereby mitigating their interference in segmentation. Extensive experiments demonstrate that our method achieves state-of-the-art performance, yielding a remarkable 33.6% improvement over existing approaches even when only a single target satellite image is available. Notably, our framework also enables reference-based segmentation, which holds great potential for practical deployment and real-world applications.

1. Introduction

Satellite segmentation [1,2,3,4] is a fundamental task in spacecraft perception and autonomous inspection, serving as a foundation for downstream applications such as structural analysis, fault detection, and in-orbit servicing. Accurate and fine-grained recognition of satellite structures and components is crucial for enabling intelligent space operations. The rapid development of deep learning has significantly advanced semantic segmentation [5,6,7,8,9], bringing notable improvements to satellite perception in recent years [1,2,3,4,10]. However, these successes rely heavily on access to large-scale annotated datasets, which are extremely scarce in real space missions.
To alleviate the scarcity of labeled real-world satellite imagery, existing satellite segmentation methods predominantly rely on semiphysical simulation to synthesize training data [2,4,10]. However, such synthetic imagery still exhibits a notable domain gap from real orbital observations, leading to significant performance degradation when applied in real-world scenarios. This challenge is further exacerbated when dealing with non-cooperative targets, whose structures and material properties are entirely unknown—preventing the generation of synthetic datasets through simulation. Consequently, models trained on synthetic data exhibit extremely poor generalization and often fail when deployed in real settings, especially when encountering non-cooperative satellites.
Another effective way to alleviate the scarcity of labeled satellite data is through few-shot segmentation (FSS) [11,12,13,14], which aims to segment target objects in a query image using only a few annotated support examples. While these methods have shown encouraging results, they rely heavily on support information and are prone to segmentation errors when substantial intra-class variations exist between the support examples and target objects—particularly in the satellite domain, where components of the same category can differ markedly in geometry, structure, and surface characteristics across different satellites. This paper aims to bridge this gap by developing a framework explicitly designed to generalize across morphologically diverse satellite instances under few-shot conditions.
Meanwhile, diffusion models [15,16], with their powerful pre-trained structural priors, have emerged as a promising direction for improving cross-domain generalization in perception tasks. Recent studies [17,18,19,20,21] have begun to explore their potential for semantic segmentation and report encouraging results. Unlike vision transformer backbones or pre-trained contrastive frameworks (e.g., CLIP- or SAM-based approaches), which primarily capture instance-level appearance cues, diffusion models reconstruct images through iterative denoising, allowing them to internalize geometry- and topology-aware priors and model latent dependencies among components. These properties are particularly critical in satellite perception, where semantic understanding is often governed by structural composition rather than local texture patterns. However, harnessing their potential for the few-shot satellite segmentation task remains non-trivial, as these existing methods suffer from three critical limitations: (1) Directly training diffusion models or integrating them into existing segmentation frameworks as auxiliary components incurs substantial computational overhead, making them inefficient for perception tasks. (2) Most existing approaches still follow a traditional pixel- or region-wise classification scheme, which fails to fully leverage the structural reasoning capabilities of diffusion models and tends to be unstable when dealing with drastic shape variations. (3) Since diffusion models are generative by nature, their feature space contains a large amount of fine-grained texture information optimized for image reconstruction, which is often irrelevant or even detrimental to fine-grained perception tasks such as semantic segmentation.
Therefore, this paper aims to unleash the potential of diffusion models for the challenging task of few-shot satellite segmentation. We propose a novel diffusion-based framework for few-shot satellite segmentation, dubbed DiffSatSeg, which harnesses the strong prior knowledge of diffusion models to improve segmentation generalization. Specifically, for the first challenge, we design a parameter-efficient fine-tuning strategy that preserves the rich priors of diffusion models while enabling flexible adaptation to the unique and diverse structural characteristics of satellites—treated as rare and complex targets—by leveraging proxy queries as interfaces to seamlessly integrate diffusion features with satellite-specific representations. To address the second challenge, we further propose a segmentation mechanism driven by distributional similarity, leveraging the principal components of a similarity matrix constructed from proxy queries to address the substantial morphological variations inherently present across different satellite types. Finally, to tackle the third challenge, we design a consistency learning strategy that suppresses redundant texture details in diffusion features, thereby encouraging the model to focus on semantic structures rather than superficial textures.
Extensive experiments validate the effectiveness of the proposed framework, demonstrating state-of-the-art performance across four benchmark datasets, with a substantial improvement over existing methods even in the highly constrained setting where only a single target satellite image is available. Moreover, the proposed framework supports reference-based segmentation, underscoring its practicality and strong potential for deployment in real-world satellite applications.
The main contributions of this paper are as follows:
1.
We propose a novel diffusion-based framework for few-shot satellite segmentation, named DiffSatSeg, which employs a set of learnable proxy queries for parameter-efficient fine-tuning, retaining the rich priors of diffusion models while enabling flexible adaptation to diverse satellite structures.
2.
We propose a segmentation mechanism guided by distributional similarity, which extracts the principal components of a metric similarity matrix constructed from proxy queries to perform discriminative analysis, thereby enabling strong generalization to unseen satellite targets with substantial intra-class variations.
3.
We design a consistency learning strategy that suppresses redundant texture details in diffusion features, guiding the proxy queries to concentrate on structural semantics and enhancing the reliability of feature representations for segmentation.
4.
Extensive experiments conducted on four benchmark datasets comprehensively validate the effectiveness of the proposed framework, achieving state-of-the-art performance across diverse satellite segmentation scenarios. Furthermore, our method supports reference-based fine-grained segmentation, demonstrating strong practicality and adaptability for real-world satellite applications.

2. Related Work

2.1. Semantic Segmentation

Semantic segmentation is one of the most essential tasks in computer vision, aiming to assign a semantic label to each pixel in an image. With the advent of deep learning, this task has witnessed significant progress in recent years. Fully Convolutional Networks (FCNs) [5] first introduced end-to-end training for pixel-wise prediction, which has since become a cornerstone of modern segmentation models. Building on this, the DeepLab series [6,22] advanced the field through atrous convolutions and spatial pyramid pooling to capture multi-scale context. More recently, Transformer-based architectures, such as SETR [23] and SegFormer [9], have pushed segmentation performance further by leveraging global attention mechanisms. In particular, query-based formulations like MaskFormer [24] and its successor Mask2Former [25] unify semantic, instance, and panoptic segmentation within a single framework, demonstrating strong versatility across tasks. In parallel, advances in domain adaptation [26] and domain generalization [27] have sought to improve robustness under distribution shifts, which is essential for real-world deployment.
Recent studies have explored various deep learning strategies for spacecraft component segmentation. Chen et al. [28] employed an improved Mask R-CNN for part detection, achieving good accuracy but facing challenges in terms of computational cost and training time. Qu and Wei [3] enhanced DeeplabV3+ with a convolutional block attention mechanism on the Speed dataset, enabling accurate segmentation of four common modules including solar panels, antennas, GPS, and the satellite. Xiang et al. [29] addressed shadow and low-visibility issues with a multi-illumination fusion strategy, improving the robustness of component segmentation. In addition, Liu et al. [30] proposed a multi-scale dilated convolutional network with channel attention that delivered clearer and more complete masks than standard DeeplabV3+, while Shao et al. [10] introduced the Pyramid Attention and Decoupled Attention Network along with a large-scale video dataset, advancing segmentation performance under more realistic orbital scenarios. Despite these advances, existing methods primarily rely on synthetic data for training, leading to poor generalization in real-world scenarios—particularly when encountering non-cooperative or previously unseen satellite targets. To address this limitation, we propose a novel few-shot satellite segmentation framework that effectively leverages the strong priors of diffusion models to enhance the generalization ability of segmentation under limited supervision.

2.2. Few-Shot Segmentation

Few-shot segmentation aims to segment query images of novel classes with only a few labeled support samples. Given the scarcity of annotations, effectively exploiting support information is essential, and existing approaches can be broadly categorized into prototypical learning and affinity learning methods. Prototypical learning approaches [12,13,14,31,32] condense masked support features into one or multiple prototypes for comparison with query features. While single-prototype strategies based on masked average pooling [31] are simple, they often discard spatial details. To address this, recent works generate multiple prototypes via clustering or EM algorithms [14] or introduce auxiliary prototypes to capture uncovered regions, achieving better spatial coverage. However, prototype compression still limits the ability to model fine-grained support information. Affinity learning methods [11,12,33] instead establish pixel-level correspondences between support and query features using attention or cost volume aggregation. For example, PFENet [34] leverages a class-agnostic prior mask to guide segmentation, while CyCTR [35] employs cycle-consistent attention for selective feature aggregation. Existing few-shot segmentation approaches are built on the assumption that the query image and support examples share similar appearances, which leads to significant performance degradation in spatial scenarios such as satellite imagery, where different satellites often exhibit substantial morphological variations even within the same category. In this work, we propose a segmentation mechanism driven by distributional similarity, which promotes fine-grained and robust segmentation by comparing feature distributions, effectively handling intra-class variations common in satellite imagery.

2.3. Diffusion Model

Diffusion models [15,16,36] have recently emerged as a powerful paradigm of generative models, achieving remarkable success in high-fidelity image synthesis and editing [37,38,39]. Their probabilistic formulation is based on a forward process that gradually corrupts data with Gaussian noise and a reverse process that learns to iteratively denoise, enabling the recovery of detailed structures and rich semantic information. Beyond generation, diffusion models have demonstrated strong potential for perception tasks due to their ability to capture rich semantic priors and exhibit compositional generalization capabilities. Recent studies have applied diffusion models to tasks such as semantic segmentation [40,41,42], object detection [43], and depth estimation [21,44], showing that diffusion priors can significantly enhance recognition and generalization. DiffusionSeg [20] leverages pre-trained diffusion models in conjunction with CLIP for unsupervised object discovery through a synthesis exploitation framework that generates pseudo-labeled data and extracts diffusion features for discriminative segmentation tasks. Diff-UNet [19] integrates a denoising diffusion model into a 3D U-Net architecture to perform end-to-end volumetric medical segmentation, using a label-embedding mechanism and a step-uncertainty fusion module to enhance multi-organ segmentation robustness and accuracy. MaskDiffusion [18] leverages the internal features and cross-attention maps of a pre-trained Stable Diffusion model conditioned by CLIP text embeddings to achieve open-vocabulary semantic segmentation without additional training, enabling effective segmentation of both general and fine-grained categories. These methods, which either directly train diffusion models or integrate them into existing segmentation frameworks as auxiliary modules, introduce substantial computational overhead and fail to fully exploit the structural priors inherently encoded in diffusion representations. In this work, we propose a parameter-efficient fine-tuning strategy together with a segmentation mechanism guided by distributional similarity to fully leverage the structural priors of diffusion models for efficient and generalizable satellite segmentation.

3. Methods

In this paper, we propose DiffSatSeg, a diffusion-based framework for few-shot satellite segmentation that leverages the strong priors of diffusion models to enhance semantic representation and enable accurate segmentation with limited supervision. Specifically, we first propose a parameter-efficient fine-tuning strategy that preserves the latent space of diffusion models to fully retain their prior knowledge, while effectively adapting to the unique structural characteristics of satellites as rare and complex targets by employing a set of proxy queries as interfaces (Section 3.1). Next, considering the substantial morphological variations across different satellite types, we propose a segmentation mechanism guided by distributional similarity, which leverages the principal components of a similarity matrix constructed from proxy queries to enable flexible and fine-grained segmentation of satellite components under few-shot scenarios (Section 3.2). Finally, we design a consistency learning strategy that effectively suppresses redundant texture details embedded in diffusion features, thereby mitigating their interference with the segmentation process (Section 3.3).

3.1. Parameter-Efficient Adaptation of Diffusion Models to Rare Targets

Stable Diffusion is composed of three key components: an encoder V and a decoder R derived from VQGAN [45], along with a noise prediction network ϵ θ , typically implemented as a UNet. The encoder and decoder modules enable transformation between the pixel and latent spaces by compressing input images into compact latent representations and reconstructing them back to full resolution. The training follows a forward–reverse denoising paradigm. In the forward stage, the image x is first encoded into the latent space using the encoder V to obtain a clean latent representation z 0 = V ( x ) . This representation is then progressively perturbed by Gaussian noise. The forward diffusion process can be formulated as:
z t = α ¯ t z 0 + 1 α ¯ t ϵ , where ϵ N ( 0 , 1 )
where α ¯ t : = s = 0 t 1 β s is determined by the predefined noise schedule β s and t denotes the diffusion timestep. Generally, larger values of t correspond to higher noise levels. The reverse process aims to recover the clean representation by iteratively predicting and removing the injected noise. A single step of this denoising process can be formulated as:
p θ z t 1 z t : = N z t 1 μ θ z t , t , θ z t , t
where μ θ is estimated by the noise predictor ϵ θ , and θ is commonly fixed as a predefined variance.
Given that satellites are rare targets with inherently unique structural characteristics, it is essential to adapt diffusion models in a way that preserves their strong prior knowledge while effectively aligning with the distinctive semantic structure of satellites. Preserving the latent space intact is essential for fully exploiting the potential of pretrained diffusion models. To this end, we propose a parameter-efficient fine-tuning method that employs a set of learnable proxy queries to flexibly adapt the diffusion model to the structural uniqueness of satellite. As illustrated in Figure 1, our framework begins by extracting raw features from the pretrained diffusion model. To do this, we first encode the input image x into the latent space via the pretrained encoder V , yielding the latent representation z 0 = V ( x ) . Next, we apply the forward diffusion process (as defined in Equation (1)) to generate the noisy latent representation z t , where t is fixed to a constant value during both training and inference to ensure consistent noise conditions and stable feature representations. The resulting z t is subsequently fed into the noise prediction network ϵ θ ( · ) , which performs denoising to extract the corresponding diffusion features:
f i = ϵ θ i ( z t )
where f i denotes the feature representation obtained from the i-th layer of the diffusion model. These features are then refined through a set of learnable proxy queries q, yielding refined feature maps as follows:
f r e f i n e i = FFN Softmax ( S i ) × V i + f i , S i = Q i ( K i ) T d i Q i = f i W Q i , K i = q W K i , V i = q W V i
where W Q i , W K i and W V i are the learnable matrices that project the input features into the query, key, and value spaces, respectively. The proxy queries q are shared across all selected diffusion layers to ensure consistent semantic alignment and stable multi-scale refinement. Each query is initialized using a uniform distribution, promoting balanced feature scaling in the early training stage and preventing premature collapse. This cross-layer sharing acts as an intrinsic regularization constraint, limiting parameter redundancy and stabilizing optimization across layers. Additionally, because the backbone weights are entirely frozen, the learning space of q is implicitly confined to the pretrained backbone’s representational subspace—serving as an implicit regularization by parameter isolation. FFN is composed of a fully connected transformation followed by a non-linear activation function. f r e f i n e i acts as the refined feature representation passed into the next layers of the diffusion model. Instead of applying refinement at all layers, we insert proxy queries at a few selected diffusion layers, typically i = 3 , 6 , 9 , 12 , which has been empirically shown to offer the best trade-off between performance and efficiency (as shown in our ablation results).
Ultimately, the training loss is defined as:
L f ( θ ) = E ϵ , t ϵ ϵ θ z t , t 2
where ϵ is the noise sampled during the forward process. Notably, throughout training, the diffusion model remains frozen, and only a small set of proxy queries is optimized.

3.2. Few-Shot Segmentation via Proxy-Driven Similarity Modeling

Given the substantial and intrinsic morphological differences across satellite types, conventional segmentation methods struggle to bridge this distribution gap effectively. This limitation is further exacerbated by the scarcity of annotated samples for target satellites, which increases the risk of overfitting and severely limits the generalization ability of traditional pixel-wise classification frameworks. To overcome this limitation, we propose a segmentation mechanism guided by distributional similarity, where the principal components extracted from a similarity matrix constructed using proxy queries are utilized to discriminate semantic regions based on their distributional patterns. Compared to traditional pixel-level classification approaches, this design enables more flexible and fine-grained segmentation of satellite components. As a result, it facilitates robust few-shot satellite segmentation and ensures precise perception of satellite components even under severe data scarcity.
As illustrated in Figure 2, we first construct prototype vectors corresponding to different component categories of the target satellite. Specifically, we utilize the feature representation obtained from the final layer of the diffusion model’s noise prediction network ϵ θ . Mask maps for different satellite components are subsequently derived from the ground-truth annotations, and an element-wise product with the extracted feature maps is applied to derive the category-specific prototypes:
p n = h = 1 H w = 1 W f o ( h , w ) I y ( i , j ) = n h = 1 H w = 1 W I y ( h , w ) = n , y = Downsample ( Y )
where f o denotes the output features of the noise predictor ϵ θ , Y denotes the label corresponding to the input image x, n is the category index ranging from 1 to N, with N being the total number of categories, p n represents the category-specific prototype for class n, Downsample denotes the downsampling operation, and H and W represent the height and width of f o , respectively.
Subsequently, f o is integrated with the proxy queries to derive representative embeddings for each segment:
q ^ = S M × ( f o ) , S M = Softmax Linear ( q ) × f o
where Linear is a two-layer MLP with layer normalization. The refined query q ^ is then used to construct a metric similarity matrix M, upon which singular value decomposition (SVD) is performed to extract its principal components, i.e., M = U V , where U, , and V denote the left singular vectors, singular values, and right singular vectors (transposed) of the similarity matrix M, respectively. Note that, due to the symmetry of the similarity matrix M, its left and right singular vectors are equal, i.e., U = V .
We select the top-k singular vectors corresponding to the largest singular values to form a set of cost vectors v ( · ) , which are then compared with the prototype vectors to compute similarity distributions, as illustrated in Figure 3:
d ( j , n ) = v ( j ) · p n v ( j ) p n , where 1 j k
where j denotes the index of the cost vectors. Similarly, we compute a similarity distribution vector between the cost vectors v ( · ) and the proxy queries:
d ^ ( l , j ) = q ^ ( l ) · v ( j ) q ^ ( l ) v ( j ) , where 1 l L
where l denotes the index of the proxy queries and L denotes the total number of proxy queries. Next, we utilize the two sets of similarity distribution vectors to perform category discrimination:
D ( l , n ) = Cosine ( d ^ ( l , : ) , d ( : , n ) )
where D ( l , n ) denotes the probability that the l-th proxy query corresponds to the n-th category, Cos ine ( · ) denotes the cosine similarity function. The final segmentation prediction and the corresponding training objective are formulated as follows:
L d = g = 1 G u = 1 U Y ( g , u ) log p r e d ( g , u ) , where p r e d = Upsample ( f o ) × q ^ × D
where Upsample denotes the upsampling operation, G and U represent the width and height of the image, respectively, and p r e d corresponds to the final segmentation result of the entire image.

3.3. Consistency Learning for Texture Suppression

Considering that diffusion models are originally designed for generative tasks, their feature space inherently captures an abundance of fine-grained texture details. While such information is beneficial for producing high-fidelity images, it is often irrelevant—or even detrimental—to perception tasks like semantic segmentation, where excessive low-level textures may introduce noise and hinder effective learning. To mitigate this issue, we design a consistency learning strategy that explicitly suppresses redundant texture signals in the diffusion feature space, as illustrated in Figure 4. Rather than directly using low-resolution features—which would yield coarse, blurred masks lacking boundary precision—we employ them as a semantic supervision signal to regularize the high-resolution branch. This cross-resolution constraint enforces semantic-level consistency, encouraging the model to focus on structurally meaningful patterns while retaining fine spatial detail. As a result, the model achieves more robust and semantically coherent feature representations, improving both precision and generalization in few-shot satellite segmentation.
The process begins with a 4× downsampling of the target satellite image. Both the original high-resolution image and the corresponding downsampled image are then encoded into the latent space using the pretrained encoder V , resulting in the latent representations z 0 o r i and z 0 r e s i z e , respectively. Subsequently, the forward diffusion process (as defined in Equation (1)) is applied to obtain the noisy latent samples z t o r i and z t r e s i z e . These noisy latents are then passed through the noise prediction network ϵ θ ( · ) to produce two sets of diffusion features corresponding to different input scales:
f o r i = ϵ θ z t o r i , f r e s i z e = ϵ θ z t r e s i z e
The large-scale feature maps are first processed with average pooling to match the spatial resolution of the low-resolution features, and a consistency constraint is then imposed between the two sets of features as follows:
L c = h ^ = 1 H ^ w ^ = 1 W ^ f r e s i z e log f r e s i z e f ^ o r i , where f ^ o r i = AvgPool f o r i
where H ^ and W ^ denote the height and width of the feature map f r e s i z e , respectively. The consistency constraint serves to suppress redundant visual details. This is because if the model tends to emphasize texture details, the features extracted from high-resolution and low-resolution inputs tend to differ significantly, as high-resolution images inherently contain richer local visual details. By constraining this discrepancy, the consistency loss encourages the model to focus on high-level semantic structures rather than superficial textures. This guidance fosters more robust representation learning and enhances generalization in the challenging setting of few-shot satellite segmentation.
Full Objective. The overall training objective is defined as follows:
L total = L d + ω L f + η L c
where ω and η are hyperparameters that balance the relative contributions of each loss component.
During inference, as illustrated in Figure 5, the input image is first encoded into a compact latent representation via the VAE encoder, after which controlled Gaussian noise (corresponding to a fixed timestep t) is added. The noisy latent is then processed by the noise prediction network to extract diffusion features. These features are used to update the proxy queries, which subsequently form a metric similarity matrix M. Singular Value Decomposition (SVD) is performed on M to obtain the top-k singular vectors, yielding cost vectors that capture the dominant distributional directions. These cost vectors are then employed to compute distributional similarity with both the prototype vectors and the proxy queries, producing class probability distributions for each proxy query. Finally, segmentation predictions are generated through matrix multiplication among the upsampled diffusion features, the updated proxy queries, and their associated class probability vectors. Notably, the prototype vectors remain unchanged throughout inference.

4. Results

4.1. Experimental Setup

4.1.1. Datasets

In our experiments, we conduct comprehensive evaluations on multiple spacecraft perception datasets, covering both synthetic and semi-real scenarios, as summarized in Table 1. Specifically, SatelliteDataset [2] is employed for pre-training, leveraging its large-scale and richly annotated data to enhance feature generalization. The remaining datasets—Speed+ [46], UESD [4], SSP [47], and MIAS [29]—are utilized for few-shot training and testing. These datasets differ in image realism, illumination conditions, and structural complexity, collectively providing a comprehensive benchmark for evaluating the adaptability and robustness of spacecraft segmentation models under varying environmental and visual conditions.
SatelliteDataset [2] is a large-scale dataset curated for spacecraft perception tasks, including segmentation, classification, and pose estimation. It comprises 3117 RGB images at a resolution of 1280 × 720, encompassing both synthetic renderings and photorealistic frames extracted from videos. The dataset captures a wide range of spacecraft under diverse poses, lighting conditions, and spatial configurations. Each image is densely annotated with instance-level masks, resulting in over 10,350 labeled spacecraft components across 3667 unique instances. These detailed annotations enable fine-grained semantic segmentation of complex spacecraft structures, making the dataset particularly valuable for training models under data-scarce or few-shot conditions.
Speed+ [46] is a real-world dataset originally designed for spacecraft pose estimation, providing 3D models and corresponding pose annotations. Following the approach in [48,49], we convert the pose labels into fine-grained segmentation masks by projecting the 3D mesh models onto the image plane. This transformation enables pixel-wise semantic evaluation, even though the original Speed+ dataset does not include segmentation annotations. For our segmentation experiments, 40% of the samples are randomly selected as the test set, yielding 35,976 training images and 23,984 testing images.
UESD [4] is a photorealistic dataset constructed for satellite component recognition and spacecraft perception tasks. Built on Unreal Engine 4 (UE4), UESD simulates a realistic near-Earth orbital environment to generate lifelike satellite imagery. A total of 33 high-quality satellite models are collected and refined from public sources such as NASA 3D Resources, and imported into the environment to produce 10,000 synthetic images under diverse attitudes, lighting conditions, and viewing angles. Unlike existing datasets, UESD focuses on five distinctive components—solar panel, antenna, instrument, thruster, and optical payload—providing a reliable benchmark for fine-grained satellite structure understanding and recognition. The dataset is officially divided into 8000 training images and 2000 testing images, and we strictly follow this split in all experiments to ensure fair comparison and reproducibility.
SSP [47] is a large-scale dataset for spacecraft payload semantic segmentation. It contains 6600 semi-real images with manually annotated labels, generated by merging spacecraft-model images captured by real cameras with space backgrounds released by NASA. The use of real camera imaging preserves authentic lighting and material properties, enhancing visual realism and domain consistency for spacecraft perception research. The dataset is officially divided into 3300 training images, 1650 validation images, and 1650 testing images, and we strictly follow this split in all experiments to ensure fair comparison and reproducibility.
MIAS [29] is constructed to evaluate spacecraft perception under varying lighting conditions. The dataset contains 300 training scenes and 22 test scenes, each consisting of two spacecraft images captured under different illumination angles. This design enables the study of illumination robustness in spacecraft image analysis. The images are rendered using a physically based simulation to replicate realistic shading, reflections, and light–surface interactions. The MIAS dataset can be used not only for illumination-invariant image fusion but also for other space vision tasks such as detection, segmentation, and recognition under adverse lighting conditions.
We adopt mean Intersection over Union (mIoU) across all part categories as the primary evaluation metric, where higher scores reflect improved segmentation performance.

4.1.2. Implementation Details

We adopt Stable Diffusion v2.1 [16] in our segmentation framework. To retain the strong prior knowledge encoded in its latent space, all components of the pretrained diffusion model—including the encoder V , decoder R , and noise prediction network ϵ θ —are kept entirely frozen during training. Rather than fine-tuning the diffusion model, we introduce a lightweight set of learnable proxy queries, which guide the segmentation process in a parameter-efficient manner while preserving generalization to unseen targets. During training, we first resize the shorter image side to 640 pixels, then randomly crop a 640 × 640 patch as input. During inference, we proportionally resize the longer side to the nearest multiple of 8, pad the shorter side with blank pixels to also reach a multiple of 8, and finally remove the padding and upsample the output back to the original resolution. This ensures architectural compatibility while preserving spatial fidelity.
To optimize the learnable proxy queries, we use the AdamW optimizer [50] with a learning rate of 6 × 10 5 and default momentum coefficients ( β 1 = 0.9 , β 2 = 0.999 ) . No weight decay is applied to the query parameters. For data preprocessing, we adopt standard data augmentation techniques—including random cropping, horizontal flipping, color jittering, and Gaussian blurring—following the augmentation pipeline of DACS [51], which has proven effective for robust semantic segmentation under limited supervision. For all experiments, follow the same training and test splits provided by each dataset to ensure fair and consistent evaluation. In the few-shot setting, only a small number of samples are randomly selected from the official training set of each target dataset for model fine-tuning, while evaluation is performed on the official test set. The overall training process consists of two stages. In the first stage, we conduct parameter-efficient fine-tuning of the diffusion backbone on the SatelliteDataset for 60 epochs with a batch size of 8, enabling the proxy queries to adapt to satellite-specific semantics while preserving the pretrained diffusion priors. In the second stage, the model is trained on the few-shot satellite segmentation datasets—including Speed+, UESD, SSP, and MIAS—for 30 epochs to achieve robust adaptation across diverse imaging domains. The entire training process requires approximately 4 h on a single NVIDIA GeForce RTX 4090 GPU. All experiments are conducted using a fixed random seed of 42 to ensure reproducibility.

4.2. Main Results

To comprehensively evaluate the generalization capability of the proposed framework, we conduct few-shot segmentation experiments across four representative satellite datasets—Speed+, UESD, SSP, and MIAS—covering diverse imaging domains from synthetic to photorealistic and low-light environments. All models are pre-trained on the full SatelliteDataset and then fine-tuned using a small number of samples randomly selected from the training set of each target dataset under the one-shot, three-shot, and ten-shot configurations. The final performance is evaluated on the corresponding test set of each target dataset. Table 2, Table 3, Table 4 and Table 5 present a comprehensive performance comparison between our method and representative approaches from three major paradigms—conventional semantic segmentation (DeepLabV3+ [52], HRNet+ [53], GroupViT [54], Segformer [9], MaskFormer [24], Mask2Former [25]), few-shot segmentation (BCM [55], PI-CLIP [56], LLaFS [57]), and diffusion-based segmentation (DiffeWS [58], DeFSS [59], DICEPTION [60]). These results collectively demonstrate the robustness and adaptability of our framework across varied satellite domains and visual conditions.

4.2.1. Comparisons with Previous Methods on the Speed+ Dataset

Table 2 reports the comparison results on the Speed+ dataset under the one-, three-, and ten-shot settings.
Table 2. Performance comparison between the proposed method and existing approaches under the few-shot setting on the Speed+ dataset. The best results are highlighted in bold.
Table 2. Performance comparison between the proposed method and existing approaches under the few-shot setting on the Speed+ dataset. The best results are highlighted in bold.
MethodTraining SamplesmIoU
Body Solar Panel Antenna Avg.
DeepLabV3+ [52]Semantic SegmentationSpeed+ (One-shot)50.359.60.036.6
HRNet+ [53]53.660.31.938.6
GroupViT [54]60.962.72.842.1
Segformer [9]57.162.42.140.5
Maskformer [24]63.763.83.643.7
Mask2former [25]64.163.25.744.3
BCM [55]Few-shot Segmentation69.768.628.655.6
PI-CLIP [56]72.371.933.859.3
LLaFS [57]74.573.135.961.2
DICEPTION [60]Diffusion-based Segmentation76.672.335.161.3
DiffewS [58]78.974.840.764.8
DeFSS [59]81.775.242.466.4
Ours86.577.569.877.9
DeepLabV3+ [52]Semantic SegmentationSpeed+ (Three-shot)50.559.70.036.7
HRNet+ [53]53.361.21.738.7
GroupViT [54]61.463.53.042.6
Segformer [9]56.963.12.840.9
Maskformer [24]64.163.75.844.5
Mask2former [25]64.063.96.044.6
BCM [55]Few-shot Segmentation71.370.738.560.2
PI-CLIP [56]74.774.241.063.3
LLaFS [57]77.175.042.164.7
DICEPTION [60]Diffusion-based Segmentation77.973.838.963.5
DiffewS [58]80.375.642.166.0
DeFSS [59]82.176.444.367.6
Ours89.780.671.580.6
DeepLabV3+ [52]Semantic SegmentationSpeed+ (Ten-shot)52.160.91.738.2
HRNet+ [53]55.661.32.239.7
GroupViT [54]65.666.84.545.6
Segformer [9]60.165.73.943.2
Maskformer [24]69.868.57.948.7
Mask2former [25]71.670.88.750.4
BCM [55]Few-shot Segmentation81.680.053.271.6
PI-CLIP [56]82.381.551.771.8
LLaFS [57]81.980.755.872.8
DICEPTION [60]Diffusion-based Segmentation83.678.952.071.5
DiffewS [58]85.081.754.673.8
DeFSS [59]86.582.158.575.7
Ours93.686.777.686.0
Comparison with conventional semantic segmentation. Across all settings, our method achieves the highest mIoU, clearly outperforming representative conventional baselines. The antenna category, characterized by significant morphological differences between SatelliteDataset and Speed+, remains particularly challenging. Conventional methods nearly collapse in this case (e.g., DeepLabV3+ at 0.0, Segformer at 2.1), while our approach achieves 69.8 IoU in the one-shot and 77.6 IoU in the ten-shot setting, demonstrating strong robustness to cross-domain morphological variation.
Comparison with few-shot segmentation methods. Compared with recent few-shot frameworks, our method exhibits markedly stronger generalization. In the one-shot setting, the best-performing baseline (LLaFS) achieves 61.3 mIoU, whereas ours reaches 77.9 mIoU (+16.6). With three and ten shots, our model further improves to 80.6 and 86.0 mIoU, exceeding competitors by over 14 points on average. Moreover, our framework maintains stable results on morphologically diverse categories such as antenna, confirming that the proposed distributional-similarity mechanism and proxy-query adaptation effectively capture cross-satellite structural relationships under data scarcity.
Comparison with diffusion-based segmentation methods. Against diffusion-based approaches, our framework achieves consistent advantages—77.9 mIoU in one-shot versus 66.4 for DeFSS, and 80.6 versus 67.6 for DiffeWS. These gains highlight the effectiveness of our parameter-efficient fine-tuning, which preserves diffusion priors while flexibly adapting to satellite-specific semantics through a small set of proxy queries. Avoiding full fine-tuning allows our model to retain diffusion structural knowledge and deliver efficient, accurate, and stable adaptation for few-shot satellite segmentation.
Qualitative comparisons. In addition to the quantitative gains, qualitative comparisons further validate the advantages of our approach, as shown in Figure 6. Compared with Segformer and Mask2Former, our method produces more accurate and fine-grained segmentation of satellite components across a variety of challenging scenarios. The improvements are especially notable under adverse conditions such as low lighting or backlighting (rows 4–8), where existing methods often fail to preserve structural coherence or misidentify components. In contrast, our framework consistently preserves clear part boundaries and accurately localizes antenna regions. These qualitative findings are consistent with the feature-level analysis shown in Figure 7, which presents the t-SNE visualization of learned embeddings for Segformer, Mask2Former, and our method. Compared with the baselines, our framework produces more compact and clearly separated clusters for different satellite components, indicating that the learned representations are more discriminative. In particular, while Segformer and Mask2Former show significant overlap between component categories, our method effectively distinguishes structurally similar parts such as antennas, solar panels, and the main body. This demonstrates the ability of our approach to capture fine-grained semantic cues and enhance inter-class separability, which directly contributes to its superior segmentation performance. These results demonstrate the superior robustness and generalization ability of the proposed method in real-world applications.

4.2.2. Comparisons with Previous Methods on the UESD Dataset

Table 3 reports the comparison results on the UESD dataset under the one-, three-, and ten-shot settings.
Table 3. Comprehensive performance comparison between the proposed method and existing approaches under the few-shot setting on the UESD dataset. The best results are highlighted in bold.
Table 3. Comprehensive performance comparison between the proposed method and existing approaches under the few-shot setting on the UESD dataset. The best results are highlighted in bold.
MethodTraining SamplesmIoU
Solar Panel Antenna Instrument Thruster Optical Payload Avg.
DeepLabV3+ [52]Semantic SegmentationUESD (One-shot)67.348.422.913.417.233.8
HRNet+ [53]68.752.725.813.318.935.9
GroupViT [54]70.758.031.215.221.039.2
Segformer [9]70.960.533.118.727.842.2
Maskformer [24]71.161.536.122.139.646.1
Mask2former [25]72.663.837.423.641.647.8
BCM [55]Few-shot Segmentation75.268.546.437.957.157.0
PI-CLIP [56]76.470.948.340.262.559.7
LLaFS [57]76.170.651.543.163.360.9
DICEPTION [60]Diffusion-based Segmentation76.373.456.848.964.263.9
DiffewS [58]77.375.359.353.067.366.4
DeFSS [59]77.576.361.256.969.568.3
Ours78.479.765.360.177.672.2
DeepLabV3+ [52]Semantic SegmentationUESD (Three-shot)69.153.826.925.220.539.1
HRNet+ [53]70.260.031.220.128.842.1
GroupViT [54]71.763.738.728.339.948.5
Segformer [9]71.367.441.732.341.250.8
Maskformer [24]72.466.650.239.052.156.1
Mask2former [25]73.269.043.637.353.255.3
BCM [55]Few-shot Segmentation75.670.357.049.459.462.3
PI-CLIP [56]77.472.661.659.965.467.4
LLaFS [57]79.674.465.558.964.668.6
DICEPTION [60]Diffusion-based Segmentation79.476.763.351.568.467.9
DiffewS [58]81.777.465.759.970.371.0
DeFSS [59]81.478.664.458.172.571.0
Ours82.381.170.764.978.575.5
DeepLabV3+ [52]Semantic SegmentationUESD (Ten-shot)72.467.740.843.741.153.1
HRNet+ [53]73.970.142.745.044.355.2
GroupViT [54]74.672.145.147.951.658.3
Segformer [9]74.971.253.854.054.561.7
Maskformer [24]76.473.052.355.857.162.9
Mask2former [25]77.173.354.556.260.864.4
BCM [55]Few-shot Segmentation80.777.963.259.968.870.1
PI-CLIP [56]83.976.965.361.772.072.0
LLaFS [57]83.179.168.563.276.074.0
DICEPTION [60]Diffusion-based Segmentation82.581.367.762.375.573.9
DiffewS [58]84.783.669.466.572.275.3
DeFSS [59]85.883.468.365.476.775.9
Ours87.186.275.670.283.480.5
Comparison with conventional semantic segmentation. When transferring from SatelliteDataset to the photorealistic UESD domain, conventional segmentation methods exhibit a notable performance drop. In the one-shot setting, the strongest baseline (Mask2Former [25]) achieves 47.8 mIoU, while our method reaches 72.2 mIoU (+24.4). Even with more supervision, these baselines show limited scalability (56.1 mIoU in three-shot, 64.4 mIoU in ten-shot), whereas our framework continues to improve to 80.5 mIoU. Notably, our model achieves high accuracy on Instrument (65.3 IoU) and Thruster (60.1 IoU), confirming its strong cross-domain adaptability and structural awareness under realistic conditions.
Comparison with few-shot segmentation methods. Compared with recent few-shot frameworks, our method consistently demonstrates superior generalization. In the one-shot setting, the best-performing baseline (LLaFS) achieves 60.9 mIoU, whereas ours reaches 72.2 mIoU (+11.3). With three and ten shots, our model further improves to 75.5 mIoU and 80.5 mIoU, surpassing competitors by 5–7 points on average. Moreover, our framework maintains stable performance on Antenna and Optical Payload, validating that the proposed distributional-similarity mechanism effectively aligns semantic relationships between synthetic and photorealistic domains.
Comparison with diffusion-based segmentation methods. Compared with diffusion-based approaches that also exploit generative priors, our method achieves consistent advantages. In the one-shot setting, DeFSS reaches 68.3 mIoU, while ours attains 72.2 mIoU (+3.9), and the gap widens to +4.5–5.0 in three- and ten-shot configurations. These results highlight the strength of our query-efficient fine-tuning strategy, which preserves pretrained diffusion priors while adaptively refining them for satellite-specific semantics via compact proxy queries. This design enables efficient and stable knowledge transfer, yielding superior accuracy and robustness across few-shot scenarios.

4.2.3. Comparisons with Previous Methods on the SSP Dataset

Table 4 reports the comparison results on the SSP dataset under the one-, three-, and ten-shot settings.
Table 4. Comprehensive performance comparison between the proposed method and existing approaches under the few-shot setting on the SSP dataset. The best results are highlighted in bold.
Table 4. Comprehensive performance comparison between the proposed method and existing approaches under the few-shot setting on the SSP dataset. The best results are highlighted in bold.
MethodTraining SamplesmIoU
Spacecraft Solar Panel Radar Thruster Avg.
DeepLabV3+ [52]Semantic SegmentationSSP (One-shot)63.064.739.725.448.2
HRNet+ [53]65.066.339.832.851.0
GroupViT [54]69.268.842.339.755.0
Segformer [9]68.068.242.146.556.2
Maskformer [24]69.369.645.043.756.9
Mask2former [25]70.971.844.745.158.1
BCM [55]Few-shot Segmentation71.770.653.952.662.2
PI-CLIP [56]73.472.553.855.063.7
LLaFS [57]75.975.957.459.367.1
DICEPTION [60]Diffusion-based Segmentation76.176.061.659.968.4
DiffewS [58]77.878.963.658.469.7
DeFSS [59]78.079.866.959.471.0
Ours81.281.870.963.174.3
DeepLabV3+ [52]Semantic SegmentationSSP (Three-shot)68.567.542.629.452.0
HRNet+ [53]69.868.143.435.654.2
GroupViT [54]71.470.649.243.858.8
Segformer [9]71.171.447.047.359.2
Maskformer [24]73.573.450.348.861.5
Mask2former [25]74.475.651.750.963.2
BCM [55]Few-shot Segmentation75.277.456.453.665.7
PI-CLIP [56]78.378.161.358.369.0
LLaFS [57]77.878.764.760.470.4
DICEPTION [60]Diffusion-based Segmentation78.979.765.161.671.3
DiffewS [58]80.381.868.063.873.5
DeFSS [59]81.183.271.465.175.2
Ours83.385.474.767.677.8
DeepLabV3+ [52]Semantic SegmentationSSP (Ten-shot)71.071.549.243.558.8
HRNet+ [53]73.372.751.847.461.3
GroupViT [54]74.872.856.755.665.0
Segformer [9]76.574.459.455.166.4
Maskformer [24]76.677.259.057.567.6
Mask2former [25]78.577.459.956.168.0
BCM [55]Few-shot Segmentation80.879.863.560.771.2
PI-CLIP [56]82.781.567.666.974.7
LLaFS [57]81.981.769.065.474.5
DICEPTION [60]Diffusion-based Segmentation83.784.771.767.376.9
DiffewS [58]85.787.072.568.578.4
DeFSS [59]86.187.974.169.079.3
Ours87.688.979.973.182.4
Comparison with conventional semantic segmentation. On the semi-realistic SSP dataset, conventional segmentation networks show weak generalization due to pronounced visual and material differences between training and test domains. In the one-shot setting, the strongest baseline (Mask2Former [25]) achieves 58.1 mIoU, while our method reaches 74.3 mIoU (+16.2). Even with more supervision, conventional models yield limited gains (63.2 mIoU in three-shot, 68.0 mIoU in ten-shot), whereas our framework scales effectively to 82.4 mIoU. Notably, our model excels on Radar and Thruster, categories with significant illumination and reflectance variations, demonstrating strong adaptability to diverse surface characteristics.
Comparison with few-shot segmentation methods. Compared with few-shot frameworks such as BCM [55], PI-CLIP [56], LLaFS [57], and DICEPTION [60], our method achieves consistently higher performance. In the one-shot case, the best-performing baseline (LLaFS) records 67.1 mIoU, while ours attains 74.3 mIoU (+7.2). With three and ten shots, our model further rises to 77.8 mIoU and 82.4 mIoU, surpassing all baselines by 5–6 points on average. Furthermore, it maintains balanced accuracy across categories—81.8 IoU on Solar Panel and 70.9 IoU on Radar—confirming that the proposed distributional-similarity mechanism effectively captures intra- and inter-class relationships under semi-realistic conditions.
Comparison with diffusion-based segmentation methods. When compared with diffusion-based approaches such as DiffeWS [58] and DeFSS [59], our framework consistently performs better. In the one-shot setting, DeFSS achieves 71.0 mIoU, whereas our method attains 74.3 mIoU (+3.3). Across three- and ten-shot configurations, it continues to outperform by 2–5 points. This improvement stems from our parameter-efficient query tuning, which preserves diffusion structural priors while adapting them to the semi-realistic domain, achieving superior segmentation accuracy and robustness with minimal computational overhead.

4.2.4. Comparisons with Previous Methods on the MIAS Dataset

Table 5 reports the comparison results on the MIAS dataset under the one-, three-, and ten-shot settings, evaluating performance under varying illumination conditions, including low-light and backlit environments.
Table 5. Comprehensive performance comparison between the proposed method and existing approaches under the few-shot setting on the MIAS dataset. The best results are highlighted in bold.
Table 5. Comprehensive performance comparison between the proposed method and existing approaches under the few-shot setting on the MIAS dataset. The best results are highlighted in bold.
MethodTraining SamplesmIoU
Solar Panel Antenna Body Avg.
DeepLabV3+ [52]Semantic SegmentationMIAS (One-shot)58.033.557.549.7
HRNet+ [53]59.435.259.851.5
GroupViT [54]62.446.162.356.9
Segformer [9]63.647.164.558.4
Maskformer [24]65.350.963.760.0
Mask2former [25]66.052.767.061.9
BCM [55]Few-shot Segmentation68.453.369.163.6
PI-CLIP [56]71.456.871.866.7
LLaFS [57]72.559.572.268.1
DICEPTION [60]Diffusion-based Segmentation73.462.472.669.5
DiffewS [58]75.566.774.272.1
DeFSS [59]75.967.274.872.6
Ours79.772.978.276.9
DeepLabV3+ [52]Semantic SegmentationMIAS (Three-shot)60.442.959.654.3
HRNet+ [53]62.041.461.855.1
GroupViT [54]65.355.265.762.1
Segformer [9]65.757.167.463.4
Maskformer [24]68.955.869.464.7
Mask2former [25]68.856.570.865.4
BCM [55]Few-shot Segmentation71.961.673.769.1
PI-CLIP [56]73.863.674.470.6
LLaFS [57]75.964.674.671.7
DICEPTION [60]Diffusion-based Segmentation77.268.477.474.3
DiffewS [58]80.872.177.776.9
DeFSS [59]81.271.978.677.2
Ours84.177.282.981.4
DeepLabV3+ [52]Semantic SegmentationMIAS (Ten-shot)64.851.763.960.1
HRNet+ [53]66.752.265.761.5
GroupViT [54]68.660.168.565.7
Segformer [9]71.962.970.168.3
Maskformer [24]74.065.773.771.1
Mask2former [25]75.363.976.571.9
BCM [55]Few-shot Segmentation77.467.979.574.9
PI-CLIP [56]78.670.679.876.3
LLaFS [57]80.771.581.177.8
DICEPTION [60]Diffusion-based Segmentation80.273.182.478.6
DiffewS [58]82.878.284.881.9
DeFSS [59]83.978.085.782.5
Ours87.682.788.186.1
Comparison with conventional semantic segmentation. Under low-light and backlit conditions, conventional segmentation models degrade notably as their appearance-based features fail to capture consistent structural cues. In the one-shot setting, the strongest baseline (Mask2Former [25]) achieves 61.9 mIoU, while our method reaches 76.9 mIoU (+15.0). Even with additional supervision, conventional methods scale poorly (65.4 mIoU in three-shot and 71.9 mIoU in ten-shot), whereas our framework continues to improve to 86.1 mIoU. These results demonstrate the strong illumination robustness and structural consistency of our model under challenging visual conditions.
Comparison with few-shot segmentation methods. Compared with few-shot frameworks such as BCM [55], PI-CLIP [56], and LLaFS [57], our approach achieves consistently superior performance. In the one-shot configuration, the best baseline (LLaFS) attains 68.1 mIoU, while our method achieves 76.9 mIoU (+8.8). With three and ten shots, it further improves to 81.4 mIoU and 86.1 mIoU, surpassing all competitors by 5–8 points. Our framework remains stable on illumination-sensitive components such as Antenna and Body, confirming that the distributional-similarity module effectively mitigates appearance shifts while proxy-query adaptation preserves fine-grained geometric structures.
Comparison with diffusion-based segmentation methods. When compared with diffusion-based methods like DiffeWS [58] and DeFSS [59], our framework consistently outperforms them under varying illumination. In the one-shot setting, DeFSS records 72.6 mIoU, while our model attains 76.9 mIoU, with +4–5 point gains across all few-shot configurations. These improvements stem from our parameter-efficient diffusion adaptation, which emphasizes structural cues over brightness, producing illumination-invariant representations. Consequently, our approach achieves robust and precise segmentation even in low-light and backlit environments.
Overall, these results highlight the effectiveness of our proposed framework. In particular, it demonstrates a remarkable ability to handle structurally diverse categories—such as antennas—that pose significant challenges to previous methods. Moreover, under adverse visual conditions like low-light or backlit scenarios, our approach consistently maintains accurate segmentation with well-preserved boundaries. These strengths collectively underscore the superior adaptability and generalization of our method across both challenging categories and real-world visual environments.

4.2.5. Fine-Grained Segmentation of Satellite Components

Previous methods are limited to independent pixel-wise classification and do not support fine-grained segmentation guided by reference images. In contrast, our framework enables reference-based segmentation, allowing the model to leverage a limited number of annotated examples to perform detailed component-level predictions. Table 6 presents fine-grained segmentation results on the Speed+ dataset under few-shot settings. Specifically, our method successfully distinguishes and segments the three individual antennas on Speed+ satellites—an ability that prior methods do not possess. As the number of reference samples increases, segmentation performance improves consistently, with average mIoU rising from 47.1% in the one-shot setting to 60.5% in the ten-shot configuration. These results highlight the effectiveness of our proposed mechanism in transferring semantic knowledge from reference exemplars to novel target instances, thereby enabling accurate and fine-grained segmentation under data-scarce conditions.
To further demonstrate the effectiveness of our approach, Figure 8 presents qualitative results of fine-grained, reference-based segmentation for satellite components. Using different reference images for Antenna1, Antenna2, and Antenna3 (left column), our method successfully transfers semantic information from the references to accurately identify and segment the corresponding parts in unseen target images (right columns). In contrast to conventional pixel-wise classification methods—which lack reference-guided segmentation capability—our framework achieves precise component-level predictions even in few-shot scenarios. This highlights its effectiveness in recognizing fine-grained satellite parts.

4.3. Ablation Studies and Further Analysis

4.3.1. Ablation Study on Primary Components

Table 7 presents the ablation study evaluating the contributions of key components in our framework. L d denotes the segmentation mechanism guided by distributional similarity, L f represents the parameter-efficient fine-tuning strategy, and L c corresponds to the consistency learning strategy. The baseline (row 1) corresponds to directly training the diffusion model using the full SatelliteDataset along with a single annotated sample from Speed+, with final performance evaluated on the Speed+ test set.
We progressively introduce the three proposed modules, each contributing to consistent performance improvements. Introducing the segmentation mechanism guided by distributional similarity ( L d ) in row 2 markedly improves the average mIoU from 45.3 to 69.7 (+24.4), with an exceptional +55.1 gain on Antenna (5.6 → 60.7). This demonstrates that our mechanism effectively mitigates the severe degradation caused by large morphological discrepancies across satellites—a persistent challenge in few-shot segmentation. By modeling intrinsic feature relationships instead of relying on superficial appearance cues, it enables cross-instance structural consistency and robust recognition of geometrically diverse components.
Adding the parameter-efficient fine-tuning strategy ( L f , row 3) further boosts performance to 72.7 (+3.0). Unlike full fine-tuning, which disrupts pretrained diffusion priors, this lightweight adaptation introduces only a few learnable proxy queries, preserving structural knowledge while aligning with satellite-specific semantics. Consistent improvements across all categories (Body 81.1, Solar Panel 72.2, Antenna 64.7) indicate that this strategy strikes an optimal balance between adaptability and stability.
Introducing the consistency learning module ( L c , row 4) raises the mIoU to 73.9, enhancing boundary coherence and robustness. By enforcing multi-scale feature consistency, this module suppresses redundant texture information inherent to diffusion features—information beneficial for generation but harmful for segmentation—thereby encouraging structure-aware and semantically stable representations.
When all three components are jointly applied (row 5), the framework achieves 77.9 mIoU, with per-category scores of 86.5 (Body), 77.5 (Solar Panel), and 69.8 (Antenna). These modules act complementarily: L d mitigates inter-satellite discrepancies, L f enables efficient semantic adaptation while preserving priors, and L c enhances stability by filtering irrelevant visual textures. Their synergistic integration yields coherent, structure-aware, and semantically consistent segmentation results.

4.3.2. Effect of Diffusion Models and Feature Layers

Table 8 presents the performance of our framework when integrated with three different pretrained diffusion models, including Stable Diffusion v1.4, v1.5, and v2.1. Across all variants, our method yields consistently strong results, achieving mIoU scores of 77.8, 77.8, and 77.9 under the one-shot setting. These results demonstrate that the proposed framework is agnostic to the specific diffusion backbone and can effectively leverage the prior knowledge encoded in different diffusion models. The minimal performance gap among the models further reflects the stability and general applicability of our approach.
Table 9 compares the performance of our framework using diffusion features extracted from different layers under the one-shot setting. We find that selecting features from a representative subset of intermediate layers (3, 6, 9, 12) leads to the best overall performance, achieving 77.9 mIoU. Interestingly, incorporating all diffusion layers does not yield further gains and slightly degrades performance to 77.7 mIoU. This suggests that simply aggregating more features does not guarantee better results. Including all layers may introduce redundant or low-level noise, thereby weakening the model’s discriminative capability. In contrast, selectively utilizing a diverse set of informative intermediate layers captures complementary semantic cues without overwhelming the model with unnecessary details, resulting in improved segmentation performance.

4.3.3. Effect of Proxy Query Length and Dimension

Figure 9 illustrates the impact of proxy query length and feature dimension on segmentation performance. As shown on the left, increasing the query length initially improves performance, reaching the highest mIoU of 77.9% when the length is set to 100. However, further increasing the length to 125 or 150 leads to performance degradation. This suggests that although longer queries offer greater capacity to model structural variations, excessive length introduces redundancy and noise, which undermines discriminative capability and makes optimization more difficult.
The right portion of Figure 9 illustrates the effect of varying query dimensionality. As the dimension increases from 128 to 512, segmentation performance improves steadily, reaching a peak mIoU of 77.9% at 512 dimensions. However, further increasing the dimensionality to 1024 leads to a decline in performance (76.8% mIoU). This trend suggests that excessively large query dimensions may introduce redundant information and increase the difficulty of optimization, thereby undermining the model’s ability to generalize effectively.

4.3.4. Effect of the Number of Principal Components

Figure 10 illustrates the effect of varying the number of principal components on one-shot segmentation performance, focusing on the antenna and solar panel categories. For both categories, segmentation performance gradually improves with an increasing number of components, reaching the highest mIoU when using 100 principal components—69.8% for antennas and 77.5% for solar panels. As the number of components increases further, performance begins to decline (e.g., 67.3% for antennas and 74.8% for solar panels at 150 components).
These results underscore the importance of selecting an appropriate number of principal components. Using too few components restricts the representational capacity of the model, limiting its ability to capture the structural diversity of satellite parts. Conversely, incorporating too many components introduces redundancy and noise, which can overwhelm the discriminative mechanism and impair segmentation performance. The optimal performance observed at 100 components suggests that this configuration strikes a favorable balance between compactness and expressiveness, effectively capturing the essential structural cues required for accurate satellite component segmentation.

4.3.5. Effect of Downsampling Scale in Consistency Learning

Table 10 presents an ablation study on the impact of different downsampling scales in the consistency learning strategy under the one-shot setting. Using a downsampling scale of 4 yields the best overall performance, achieving an average mIoU of 77.9%, outperforming both scale 2 (76.1%) and scale 8 (75.3%).
When the downsampling scale is too small (e.g., 2×), the resolution gap between the original and downsampled images is minimal, resulting in weaker constraints and limited suppression of redundant texture information. On the other hand, a large downsampling scale (e.g., 8×) leads to excessive loss of semantic distribution, which impairs alignment between the two feature streams and causes a noticeable decline in segmentation performance. The best results at a 4× downsampling scale indicate a well-balanced trade-off as it introduces sufficient resolution disparity to suppress low-level textures while preserving enough semantic content to enable effective consistency learning.

4.3.6. Effect of Loss Weights

Table 11 and Table 12 report the ablation study on the loss weights ω and η in the one-shot setting. For the weight ω , performance peaks when ω = 1 , achieving an mIoU of 77.9%. Smaller or larger values lead to performance drops, e.g., 76.8% at ω = 0.5 and 76.2% at ω = 2 . A similar trend is observed for the weight η , where the best result (77.9%) is also obtained at η = 1 .
These results highlight the importance of selecting appropriate loss weights. Excessively small or large values for ω and η may underemphasize or over-penalize their corresponding objectives, leading to an imbalance between different components of the training process. Setting both weights to 1 achieves the best balance, allowing each module to contribute effectively to the overall model optimization.

4.3.7. Fine-Tuning Strategy Analysis

Table 13 presents an ablation study on different fine-tuning strategies within the proposed framework. The “Full” setting refers to updating all parameters in the vision backbone of the noise predictor (UNet) within the pretrained diffusion model, involving approximately 368M trainable parameters. While this approach allows comprehensive adaptation, it achieves only 69.5% average mIoU under the one-shot setting, with suboptimal performance across all categories (e.g., 78.3 on Body, 68.9 on Solar panel, and 61.2 on Antenna). These results suggest that fully fine-tuning a pretrained diffusion model under few-shot supervision not only incurs high computational cost, but may also disrupt the pretrained knowledge, thereby limiting generalization and degrading overall segmentation performance.
By contrast, our proposed parameter-efficient fine-tuning method updates only a small set of proxy queries, involving just 5.9M trainable parameters—approximately 62× fewer than full fine-tuning. Despite this substantial reduction in learnable parameters, our approach achieves a significantly higher average mIoU of 77.9%, outperforming the full fine-tuning baseline by a large margin. The gains are consistent across all object categories, with improvements of +8.2 points on Body, +8.6 on Solar Panel, and +8.6 on Antenna, demonstrating the effectiveness and efficiency of our design under few-shot conditions. This improvement primarily stems from preserving the pretrained diffusion priors. By freezing the backbone and updating only the proxy queries, our framework retains the semantic knowledge encoded in the diffusion model while enabling efficient task-specific adaptation. This design reduces overfitting risks in few-shot settings and ensures strong generalization and computational efficiency.

4.3.8. The Choice of Timestep t

We investigate the influence of different diffusion timesteps t on segmentation performance using the Speed+ dataset, as shown in Table 14. Previous studies [61,62] reveal that semantically meaningful diffusion features are mainly generated within the range of 0–200, and that neighboring timesteps tend to produce highly similar feature representations. Building upon these insights, we further explore timestep intervals of 25, 50, and 100 to ensure sufficient diversity and to empirically determine the optimal noise level for semantic preservation. The results show that performance remains stable for moderate noise levels ( t 200 ) but deteriorates when the noise becomes excessive ( t 300 ), as strong perturbations disrupt the underlying semantic structure. Among all configurations, t = 100 achieves the highest average mIoU (77.9), providing the most effective balance between semantic richness and feature stability. Consequently, we fix t = 100 for both training and inference throughout all experiments to ensure consistent noise conditions, stable diffusion behavior, and reproducible segmentation performance.

4.3.9. Analysis of Computational Efficiency

To fairly assess computational efficiency, we evaluate both training and inference performance against representative baselines—DeepLabV3+, Mask2Former, PI-CLIP, and DeFSS—under the same hardware environment (NVIDIA GeForce RTX 4090, input resolution 640 × 640), as shown in Table 15. Our framework completes training in 4.3 h, comparable to conventional models such as DeepLabV3+ (3.8 h) and Mask2Former (4.2 h), while achieving substantially higher segmentation accuracy. Compared with vision-language and diffusion-based approaches, it is far more efficient—requiring 4.3 h versus 6.1 h for PI-CLIP and 8.6 h for DeFSS, which rely on full-parameter fine-tuning or joint diffusion training. In terms of model complexity, our method contains only 21 M trainable parameters, significantly fewer than DeepLabV3+ (62 M), Mask2Former (89 M), and DeFSS (97 M). This reflects the strength of our parameter-efficient fine-tuning, which adapts diffusion priors via a small set of proxy queries rather than full model updates. During inference, our framework also achieves an excellent trade-off between speed and accuracy, processing one image in 0.12 s, nearly 2–3× faster than PI-CLIP (0.26 s) and DeFSS (0.35 s). Overall, these results confirm that our approach offers superior accuracy and generalization with minimal computational overhead, demonstrating strong practicality for real-world satellite perception tasks.

5. Conclusions

In this work, we propose a novel diffusion-based segmentation framework named DiffSatSeg, which leverages the rich prior knowledge embedded in pretrained diffusion models for few-shot satellite component segmentation. To accommodate the unique semantic structures of satellite targets, we propose a parameter-efficient fine-tuning strategy that employs a set of learnable proxy queries as interfaces to integrate satellite-specific semantics while preserving the generative priors of the diffusion model. To address the substantial morphological variations among different satellite types, we propose a segmentation mechanism based on distributional similarity, which utilizes the principal components of a similarity matrix built from proxy queries to guide fine-grained and flexible part segmentation under limited supervision. Moreover, to suppress interference from redundant texture details inherent in diffusion features, we design a consistency learning strategy that encourages the model to focus on high-level semantic structures rather than low-level visual details. Extensive experiments validate the effectiveness of our proposed method, which consistently surpasses prior approaches and achieves state-of-the-art performance in few-shot settings. It demonstrates exceptional performance under challenging conditions such as low-light and backlit scenarios, and effectively tackles the significant morphological differences among satellite components. In addition, the framework supports reference-based segmentation, enabling accurate and fine-grained component-level predictions from only a few annotated examples.
Overall, this work not only fills a critical gap in few-shot satellite segmentation but also opens a promising avenue for leveraging diffusion priors to improve segmentation performance under limited supervision. Future research could explore extending this framework to multi-modal scenarios and integrate it with satellite pose estimation techniques to further support applications such as spacecraft monitoring, anomaly detection, and autonomous on-orbit servicing.

Author Contributions

Conceptualization, F.L. and Z.Z.; methodology, F.L. and X.W. (Xuan Wang); validation, F.L. and X.W. (Xuan Wang); formal analysis, F.L. and X.W. (Xuanbin Wang); investigation, F.L.; resources, Y.X. and X.W. (Xuanbin Wang); writing—original draft preparation, F.L.; writing—review and editing, F.L. and Z.Z.; visualization, F.L.; supervision, Y.X. and X.W. (Xuanbin Wang); project administration, F.L. and Y.X.; funding acquisition, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Young Scientists Fund of the National Natural Science Foundation of China (Grant No. 52302506), the Shaanxi Key Research and Development Program (Grant No. 2025GH-YBXM-022), the Fundamental Research Funds for the Central Universities (Grant No. G2024KY0603), and the Fundamental Research Funds for the National Key Laboratory of Unmanned Aerial Vehicle Technology (Grant No. WR202414).

Data Availability Statement

The data that support the findings of this study are all derived from publicly available datasets.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Yang, X.; Wu, T.; Wang, N.; Huang, Y.; Song, B.; Gao, X. HCNN-PSI: A hybrid CNN with partial semantic information for space target recognition. Pattern Recognit. 2020, 108, 107531. [Google Scholar] [CrossRef]
  2. Dung, H.A.; Chen, B.; Chin, T.J. A spacecraft dataset for detection, segmentation and parts recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 2012–2019. [Google Scholar]
  3. Qu, Z.; Wei, C. A spatial non-cooperative target image semantic segmentation algorithm with improved deeplab V3+. In Proceedings of the 2022 IEEE 22nd International Conference on Communication Technology (ICCT), Nanjing, China, 11–14 November 2022; pp. 1633–1638. [Google Scholar]
  4. Zhao, Y.; Zhong, R.; Cui, L. Intelligent recognition of spacecraft components from photorealistic images based on Unreal Engine 4. Adv. Space Res. 2023, 71, 3761–3774. [Google Scholar] [CrossRef]
  5. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  6. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
  7. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
  8. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
  9. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
  10. Shao, Y.; Wu, A.; Li, S.; Shu, L.; Wan, X.; Shao, Y.; Huo, J. Satellite component semantic segmentation: Video dataset and real-time pyramid attention and decoupled attention network. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 7315–7333. [Google Scholar] [CrossRef]
  11. Wang, H.; Zhang, X.; Hu, Y.; Yang, Y.; Cao, X.; Zhen, X. Few-shot semantic segmentation with democratic attention networks. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 730–746. [Google Scholar]
  12. Zhang, B.; Xiao, J.; Qin, T. Self-guided and cross-guided learning for few-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8312–8321. [Google Scholar]
  13. Yang, B.; Liu, C.; Li, B.; Jiao, J.; Ye, Q. Prototype mixture models for few-shot semantic segmentation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 763–778. [Google Scholar]
  14. Li, G.; Jampani, V.; Sevilla-Lara, L.; Sun, D.; Kim, J.; Kim, J. Adaptive prototype learning and allocation for few-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8334–8343. [Google Scholar]
  15. Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
  16. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
  17. Niemeijer, J.; Schwonberg, M.; Termöhlen, J.A.; Schmidt, N.M.; Fingscheidt, T. Generalization by adaptation: Diffusion-based domain extension for domain-generalized semantic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–10 January 2024; pp. 2830–2840. [Google Scholar]
  18. Kawano, Y.; Aoki, Y. Maskdiffusion: Exploiting pre-trained diffusion models for semantic segmentation. IEEE Access 2024, 12, 127283–127293. [Google Scholar] [CrossRef]
  19. Xing, Z.; Wan, L.; Fu, H.; Yang, G.; Zhu, L. Diff-unet: A diffusion embedded network for volumetric segmentation. arXiv 2023, arXiv:2303.10326. [Google Scholar] [CrossRef]
  20. Ma, C.; Yang, Y.; Ju, C.; Zhang, F.; Liu, J.; Wang, Y.; Zhang, Y.; Wang, Y. Diffusionseg: Adapting diffusion towards unsupervised object discovery. arXiv 2023, arXiv:2303.09813. [Google Scholar] [CrossRef]
  21. Ke, B.; Obukhov, A.; Huang, S.; Metzger, N.; Daudt, R.C.; Schindler, K. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Waikoloa, HI, USA, 1–10 January 2024; pp. 9492–9502. [Google Scholar]
  22. Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
  23. Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
  24. Cheng, B.; Schwing, A.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 17864–17875. [Google Scholar]
  25. Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
  26. Hoyer, L.; Dai, D.; Van Gool, L. Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9924–9935. [Google Scholar]
  27. Wei, Z.; Chen, L.; Jin, Y.; Ma, X.; Liu, T.; Ling, P.; Wang, B.; Chen, H.; Zheng, J. Stronger Fewer & Superior Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Waikoloa, HI, USA, 1-10 January 2024, pp. 28619–28630.
  28. Chen, Y.; Gao, J.; Zhang, Y.; Duan, Z.; Zhang, K. Satellite components detection from optical images based on instance segmentation networks. J. Aerosp. Inf. Syst. 2021, 18, 355–365. [Google Scholar] [CrossRef]
  29. Xiang, A.; Zhang, L.; Fan, L. Shadow removal of spacecraft images with multi-illumination angles image fusion. Aerosp. Sci. Technol. 2023, 140, 108453. [Google Scholar] [CrossRef]
  30. Liu, Y.; Zhu, M.; Wang, J.; Guo, X.; Yang, Y.; Wang, J. Multi-scale deep neural network based on dilated convolution for spacecraft image segmentation. Sensors 2022, 22, 4222. [Google Scholar] [CrossRef] [PubMed]
  31. Zhang, C.; Lin, G.; Liu, F.; Yao, R.; Shen, C. Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5217–5226. [Google Scholar]
  32. Liu, Y.; Zhang, X.; Zhang, S.; He, X. Part-aware prototype network for few-shot semantic segmentation. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 142–158. [Google Scholar]
  33. Min, J.; Kang, D.; Cho, M. Hypercorrelation squeeze for few-shot segmentation. In Proceedings of the the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 6941–6952. [Google Scholar]
  34. Tian, Z.; Zhao, H.; Shu, M.; Yang, Z.; Li, R.; Jia, J. Prior guided feature enrichment network for few-shot segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1050–1065. [Google Scholar] [CrossRef]
  35. Zhang, G.; Kang, G.; Yang, Y.; Wei, Y. Few-shot segmentation via cycle-consistent transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 21984–21996. [Google Scholar]
  36. Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
  37. Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 3836–3847. [Google Scholar]
  38. Chen, M.; Laina, I.; Vedaldi, A. Training-free layout control with cross-attention guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–10 January 2024; pp. 5343–5353. [Google Scholar]
  39. Epstein, D.; Jabri, A.; Poole, B.; Efros, A.; Holynski, A. Diffusion self-guidance for controllable image generation. Adv. Neural Inf. Process. Syst. 2023, 36, 16222–16239. [Google Scholar]
  40. Wu, W.; Zhao, Y.; Shou, M.Z.; Zhou, H.; Shen, C. Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 1206–1217. [Google Scholar]
  41. Wu, W.; Zhao, Y.; Chen, H.; Gu, Y.; Zhao, R.; He, Y.; Zhou, H.; Shou, M.Z.; Shen, C. Datasetdm: Synthesizing data with perception annotations using diffusion models. Adv. Neural Inf. Process. Syst. 2023, 36, 54683–54695. [Google Scholar]
  42. Lee, H.Y.; Tseng, H.Y.; Yang, M.H. Exploiting diffusion prior for generalizable dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Waikoloa, HI, USA, 1–10 January 2024; pp. 7861–7871. [Google Scholar]
  43. Chen, S.; Sun, P.; Song, Y.; Luo, P. Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 19830–19843. [Google Scholar]
  44. Tosi, F.; Ramirez, P.Z.; Poggi, M. Diffusion models for monocular depth estimation: Overcoming challenging conditions. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 236–257. [Google Scholar]
  45. Esser, P.; Rombach, R.; Ommer, B. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12873–12883. [Google Scholar]
  46. Park, T.H.; Märtens, M.; Lecuyer, G.; Izzo, D.; D’Amico, S. SPEED+: Next-generation dataset for spacecraft pose estimation across domain gap. In Proceedings of the 2022 IEEE Aerospace Conference (AERO), Big Sky, MT, USA, 5–12 March 2022; pp. 1–15. [Google Scholar]
  47. Guo, Y.; Feng, Z.; Song, B.; Li, X. SSP: A large-scale semi-real dataset for semantic segmentation of spacecraft payloads. In Proceedings of the 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 27–29 July 2023; pp. 831–836. [Google Scholar]
  48. Wang, Z.; Zhang, Z.; Sun, X.; Li, Z.; Yu, Q. Revisiting Monocular Satellite Pose Estimation With Transformer. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 4279–4294. [Google Scholar] [CrossRef]
  49. Wang, Z.; Chen, M.; Guo, Y.; Li, Z.; Yu, Q. Bridging the Domain Gap in Satellite Pose Estimation: A Self-Training Approach Based on Geometrical Constraints. IEEE Trans. Aerosp. Electron. Syst. 2023, 60, 2500–2514. [Google Scholar] [CrossRef]
  50. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
  51. Tranheden, W.; Olsson, V.; Pinto, J.; Svensson, L. Dacs: Domain adaptation via cross-domain mixed sampling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 1379–1389. [Google Scholar]
  52. Pissas, T.; Ravasio, C.S.; Cruz, L.D.; Bergeles, C. Multi-scale and cross-scale contrastive learning for semantic segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 413–429. [Google Scholar]
  53. Wang, W.; Zhou, T.; Yu, F.; Dai, J.; Konukoglu, E.; Van Gool, L. Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 7303–7313. [Google Scholar]
  54. Xu, J.; De Mello, S.; Liu, S.; Byeon, W.; Breuel, T.; Kautz, J.; Wang, X. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18134–18144. [Google Scholar]
  55. Sakai, T.; Qiu, H.; Katsuki, T.; Kimura, D.; Osogami, T.; Inoue, T. A surprisingly simple approach to generalized few-shot semantic segmentation. Adv. Neural Inf. Process. Syst. 2024, 37, 27005–27023. [Google Scholar]
  56. Wang, J.; Zhang, B.; Pang, J.; Chen, H.; Liu, W. Rethinking prior information generation with clip for few-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Waikoloa, HI, USA, 1–10 January 2024; pp. 3941–3951. [Google Scholar]
  57. Zhu, L.; Chen, T.; Ji, D.; Ye, J.; Liu, J. Llafs: When large language models meet few-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Waikoloa, HI, USA, 1–10 January 2024; pp. 3065–3075. [Google Scholar]
  58. Zhu, M.; Liu, Y.; Luo, Z.; Jing, C.; Chen, H.; Xu, G.; Wang, X.; Shen, C. Unleashing the potential of the diffusion model in few-shot semantic segmentation. Adv. Neural Inf. Process. Syst. 2024, 37, 42672–42695. [Google Scholar]
  59. Qin, Z.; Xu, J.; Ge, W. DeFSS: Image-to-Mask Denoising Learning for Few-shot Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 11–15 June 2025; pp. 22232–22240. [Google Scholar]
  60. Zhao, C.; Sun, Y.; Liu, M.; Zheng, H.; Zhu, M.; Zhao, Z.; Chen, H.; He, T.; Shen, C. Diception: A generalist diffusion model for visual perceptual tasks. arXiv 2025, arXiv:2502.17157. [Google Scholar] [CrossRef]
  61. Xu, J.; Liu, S.; Vahdat, A.; Byeon, W.; Wang, X.; De Mello, S. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Paris, France, 2–3 October 2023; pp. 2955–2966. [Google Scholar]
  62. Baranchuk, D.; Voynov, A.; Rubachev, I.; Khrulkov, V.; Babenko, A. Label-Efficient Semantic Segmentation with Diffusion Models. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Figure 1. Illustration of our parameter-efficient fine-tuning strategy. We employ a few learnable proxy queries to adapt the diffusion model to the unique structural characteristics of satellites as rare and complex targets. During training, only the proxy queries are updated, while the pretrained diffusion model remains frozen, thereby preserving its latent space and prior knowledge.
Figure 1. Illustration of our parameter-efficient fine-tuning strategy. We employ a few learnable proxy queries to adapt the diffusion model to the unique structural characteristics of satellites as rare and complex targets. During training, only the proxy queries are updated, while the pretrained diffusion model remains frozen, thereby preserving its latent space and prior knowledge.
Remotesensing 17 03706 g001
Figure 2. An overview of our proposed few-shot segmentation mechanism. The process begins with the construction of prototype vectors from the diffusion features of the labeled target satellite. These features are then used to update the proxy queries, which are subsequently employed to construct a metric similarity matrix M. Singular Value Decomposition (SVD) is performed on M to extract the top-k singular vectors, forming a set of cost vectors that capture the dominant distributional directions. These cost vectors are used to compute distributional similarity with both the prototype vectors and the proxy queries, resulting in class probability distributions assigned to each proxy query. Finally, segmentation predictions are generated through matrix multiplication among the upsampled diffusion features, the updated proxy queries, and their associated class probability vectors.
Figure 2. An overview of our proposed few-shot segmentation mechanism. The process begins with the construction of prototype vectors from the diffusion features of the labeled target satellite. These features are then used to update the proxy queries, which are subsequently employed to construct a metric similarity matrix M. Singular Value Decomposition (SVD) is performed on M to extract the top-k singular vectors, forming a set of cost vectors that capture the dominant distributional directions. These cost vectors are used to compute distributional similarity with both the prototype vectors and the proxy queries, resulting in class probability distributions assigned to each proxy query. Finally, segmentation predictions are generated through matrix multiplication among the upsampled diffusion features, the updated proxy queries, and their associated class probability vectors.
Remotesensing 17 03706 g002
Figure 3. Illustration of the discriminative mechanism. We begin by computing the similarity between each proxy query and the k cost vectors to obtain a similarity distribution for each query. Likewise, each category prototype is compared with the same cost vectors to produce a category-specific reference distribution. Finally, the cosine similarity between each query’s distribution and the reference distributions is calculated to derive a category probability distribution for each proxy query.
Figure 3. Illustration of the discriminative mechanism. We begin by computing the similarity between each proxy query and the k cost vectors to obtain a similarity distribution for each query. Likewise, each category prototype is compared with the same cost vectors to produce a category-specific reference distribution. Finally, the cosine similarity between each query’s distribution and the reference distributions is calculated to derive a category probability distribution for each proxy query.
Remotesensing 17 03706 g003
Figure 4. Illustration of the consistency learning strategy. The target satellite image is first downsampled to generate a low-resolution counterpart. Both the original high-resolution and the downsampled images are processed through the diffusion model to extract multi-scale features. To ensure spatial alignment, the large-scale diffusion features are processed with average pooling to match the resolution of the small-scale features. A consistency constraint is then applied between the two feature representations, guiding the model to focus on the semantic structure of the satellite rather than fine-grained texture details. This strategy suppresses redundant visual cues and improves generalization in few-shot satellite segmentation scenarios.
Figure 4. Illustration of the consistency learning strategy. The target satellite image is first downsampled to generate a low-resolution counterpart. Both the original high-resolution and the downsampled images are processed through the diffusion model to extract multi-scale features. To ensure spatial alignment, the large-scale diffusion features are processed with average pooling to match the resolution of the small-scale features. A consistency constraint is then applied between the two feature representations, guiding the model to focus on the semantic structure of the satellite rather than fine-grained texture details. This strategy suppresses redundant visual cues and improves generalization in few-shot satellite segmentation scenarios.
Remotesensing 17 03706 g004
Figure 5. A brief illustration of the proposed framework inference pipeline. During inference, the input image is encoded into a latent representation via the VAE encoder, where Gaussian noise (with a fixed timestep t) is added and processed by the noise prediction network to extract diffusion features. These features update the proxy queries to form a similarity matrix M, from which SVD derives dominant distributional directions for discriminative analysis. The resulting cost vectors compute distributional similarity with prototype vectors and proxy queries to generate class probabilities. Final segmentation is produced through matrix multiplication of upsampled diffusion features, updated proxy queries, and their class probabilities. Notably, the prototype vectors remain fixed throughout inference.
Figure 5. A brief illustration of the proposed framework inference pipeline. During inference, the input image is encoded into a latent representation via the VAE encoder, where Gaussian noise (with a fixed timestep t) is added and processed by the noise prediction network to extract diffusion features. These features update the proxy queries to form a similarity matrix M, from which SVD derives dominant distributional directions for discriminative analysis. The resulting cost vectors compute distributional similarity with prototype vectors and proxy queries to generate class probabilities. Final segmentation is produced through matrix multiplication of upsampled diffusion features, updated proxy queries, and their class probabilities. Notably, the prototype vectors remain fixed throughout inference.
Remotesensing 17 03706 g005
Figure 6. Qualitative comparison with previous methods. From left to right: target image, predictions from Segformer, Mask2Former, and our method, followed by the ground-truth label. Key components are highlighted in distinct colors—green for the satellite body, red for solar panels, and blue for antennas—to facilitate visual assessment of segmentation accuracy across methods.
Figure 6. Qualitative comparison with previous methods. From left to right: target image, predictions from Segformer, Mask2Former, and our method, followed by the ground-truth label. Key components are highlighted in distinct colors—green for the satellite body, red for solar panels, and blue for antennas—to facilitate visual assessment of segmentation accuracy across methods.
Remotesensing 17 03706 g006
Figure 7. T-SNE visualization of learned feature embeddings for Segformer, Mask2Former, and our method, with the body shown in green, solar panels in red, and antennas in blue.
Figure 7. T-SNE visualization of learned feature embeddings for Segformer, Mask2Former, and our method, with the body shown in green, solar panels in red, and antennas in blue.
Remotesensing 17 03706 g007
Figure 8. Fine-grained reference-based segmentation of satellite components. Using different reference images (left) for Antenna1, Antenna2, and Antenna3, our method effectively leverages structural cues to accurately localize and segment the corresponding components in unseen target images (right). Best viewed in zoom for detail.
Figure 8. Fine-grained reference-based segmentation of satellite components. Using different reference images (left) for Antenna1, Antenna2, and Antenna3, our method effectively leverages structural cues to accurately localize and segment the corresponding components in unseen target images (right). Best viewed in zoom for detail.
Remotesensing 17 03706 g008
Figure 9. Impact of proxy query length and dimension on segmentation performance.
Figure 9. Impact of proxy query length and dimension on segmentation performance.
Remotesensing 17 03706 g009
Figure 10. Impact of the selected number of principal components on one-shot segmentation performance.
Figure 10. Impact of the selected number of principal components on one-shot segmentation performance.
Remotesensing 17 03706 g010
Table 1. Details of the pixel-level spacecraft datasets used in this study, including dataset links, number of categories, and the corresponding class definitions.
Table 1. Details of the pixel-level spacecraft datasets used in this study, including dataset links, number of categories, and the corresponding class definitions.
DatasetLinkNumber of CategoriesClasses
SatelliteDataset [2]https://github.com/Yurushia1998/SatelliteDataset (accessed on 18 September 2025)3[Body; Solar panel; Antenna]
Speed+ [46]https://github.com/willer94/lava1302 (accessed on 18 September 2025)3[Body; Solar panel; Antenna]
UESD [4]https://github.com/zhaoyunpeng57/BUAA-UESD33 (accessed on 18 September 2025)5[Solar panel; Antenna; Instrument; Thruster; Optical Payload]
SSP [47]https://github.com/Dr-zfeng/SPSNet (accessed on 18 September 2025)4[Body (Spacecraft); Solar panels; Radar; Thruster]
MIAS [29]https://github.com/xiang-ao-data/Spacefuse-shadow-removal/tree/master (accessed on 18 September 2025)3[Body; Solar panel; Antenna]
Table 6. Fine-grained segmentation results on the Speed+ dataset under few-shot settings. Unlike previous methods that lack reference-based capabilities, our proposed framework supports reference-guided segmentation, enabling precise component-level predictions. In particular, it can accurately distinguish and segment the three individual antennas of Speed+ satellites.
Table 6. Fine-grained segmentation results on the Speed+ dataset under few-shot settings. Unlike previous methods that lack reference-based capabilities, our proposed framework supports reference-guided segmentation, enabling precise component-level predictions. In particular, it can accurately distinguish and segment the three individual antennas of Speed+ satellites.
Training SamplesmIoU
Antenna1 Antenna2 Antenna3 Avg.
SatelliteDataset & Speed+ (One-shot)44.942.653.747.1
SatelliteDataset & Speed+ (Three-shot)50.849.560.153.5
SatelliteDataset & Speed+ (Ten-shot)57.255.768.560.5
Table 7. Ablation study on the primary components of the proposed framework under the one-shot segmentation setting.
Table 7. Ablation study on the primary components of the proposed framework under the one-shot segmentation setting.
L d L f L c mIoU
Body Solar Panel Antenna Avg.
1 65.764.65.645.3
2 77.570.960.769.7
3 81.172.264.772.7
4 82.973.565.373.9
586.577.569.877.9
Table 8. Quantitative comparison of different diffusion models under the one-shot segmentation setting.
Table 8. Quantitative comparison of different diffusion models under the one-shot segmentation setting.
Diffusion ModelStable Diffusion v1.4Stable Diffusion v1.5Stable Diffusion v2.1
mIoU77.877.877.9
Table 9. Performance comparison of different diffusion layers under the one-shot segmentation setting.
Table 9. Performance comparison of different diffusion layers under the one-shot segmentation setting.
Layer-i of Diffusion FeaturemIoU
Body Solar Panel Antenna Avg.
382.773.965.774.1
682.473.266.374.0
1282.473.966.574.3
3, 683.374.967.675.3
6, 983.574.267.475.0
9, 1284.174.067.775.3
3, 6, 985.775.868.176.5
6, 9, 1285.976.168.876.9
3, 6, 9, 1286.577.569.877.9
All86.477.269.577.7
Table 10. Ablation study on the downsampling scale in the consistency learning strategy under the one-shot setting.
Table 10. Ablation study on the downsampling scale in the consistency learning strategy under the one-shot setting.
Downsampling ScalemIoU
Body Solar Panel Antenna Avg.
85.275.767.576.1
86.577.569.877.9
84.374.966.775.3
Table 11. Ablation study on the loss weight ω under the one-shot segmentation setting.
Table 11. Ablation study on the loss weight ω under the one-shot segmentation setting.
ω 0.511.52
mIoU76.877.977.176.2
Table 12. Ablation study on the loss weight η under the one-shot segmentation setting.
Table 12. Ablation study on the loss weight η under the one-shot segmentation setting.
η 0.511.52
mIoU76.377.976.976.0
Table 13. Ablation study on fine-tuning strategies within the consistency learning framework under the one-shot setting. “Full” refers to updating all vision-related parameters in the noise predictor (UNet) of the diffusion model, while “Trainable Parameters” indicates the total number of trainable parameters in the diffusion model under each fine-tuning strategy.
Table 13. Ablation study on fine-tuning strategies within the consistency learning framework under the one-shot setting. “Full” refers to updating all vision-related parameters in the noise predictor (UNet) of the diffusion model, while “Trainable Parameters” indicates the total number of trainable parameters in the diffusion model under each fine-tuning strategy.
Fine-Tune MethodTrainable ParametersmIoU
Body Solar Panel Antenna Avg.
Full368M78.368.961.269.5
Ours5.9M86.577.569.877.9
Table 14. Ablation study on the influence of diffusion timestep t under the one-shot setting on the Speed+ dataset.
Table 14. Ablation study on the influence of diffusion timestep t under the one-shot setting on the Speed+ dataset.
tmIoU
Body Solar Panel Antenna Avg.
085.076.468.776.7
2585.276.569.176.9
5085.577.069.577.3
7586.177.369.777.7
10086.577.569.877.9
15085.977.269.477.5
20085.877.068.377.0
30085.176.467.876.4
50084.074.966.975.3
Table 15. Comparison of computational efficiency in terms of training time, trainable parameters, and inference speed. All models are evaluated under the same hardware environment (NVIDIA GeForce RTX 4090 GPU, input resolution 640 × 640).
Table 15. Comparison of computational efficiency in terms of training time, trainable parameters, and inference speed. All models are evaluated under the same hardware environment (NVIDIA GeForce RTX 4090 GPU, input resolution 640 × 640).
MethodTraining Time (h)Trainable Parameters (M)Inference Time (s)
DeepLabV3+3.8620.10
Mask2Former4.2890.13
PI-CLIP6.1630.26
DeFSS8.6970.35
Ours4.3210.12
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, F.; Zhang, Z.; Wang, X.; Wang, X.; Xu, Y. Exploiting Diffusion Priors for Generalizable Few-Shot Satellite Image Semantic Segmentation. Remote Sens. 2025, 17, 3706. https://doi.org/10.3390/rs17223706

AMA Style

Li F, Zhang Z, Wang X, Wang X, Xu Y. Exploiting Diffusion Priors for Generalizable Few-Shot Satellite Image Semantic Segmentation. Remote Sensing. 2025; 17(22):3706. https://doi.org/10.3390/rs17223706

Chicago/Turabian Style

Li, Fan, Zhaoxiang Zhang, Xuan Wang, Xuanbin Wang, and Yuelei Xu. 2025. "Exploiting Diffusion Priors for Generalizable Few-Shot Satellite Image Semantic Segmentation" Remote Sensing 17, no. 22: 3706. https://doi.org/10.3390/rs17223706

APA Style

Li, F., Zhang, Z., Wang, X., Wang, X., & Xu, Y. (2025). Exploiting Diffusion Priors for Generalizable Few-Shot Satellite Image Semantic Segmentation. Remote Sensing, 17(22), 3706. https://doi.org/10.3390/rs17223706

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop