Self-Training Based Image–Text Multimodal Unsupervised Domain Adaptation Segmentation Model for Remote Sensing Images

Liu, Qianqian; Wang, Xili

doi:10.3390/rs18040651

Open AccessArticle

Self-Training Based Image–Text Multimodal Unsupervised Domain Adaptation Segmentation Model for Remote Sensing Images

by

Qianqian Liu

and

Xili Wang

^*

School of Artificial Intelligence and Computer Science, Shaanxi Normal University, Xi’an 710119, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(4), 651; https://doi.org/10.3390/rs18040651

Submission received: 17 January 2026 / Revised: 10 February 2026 / Accepted: 17 February 2026 / Published: 20 February 2026

(This article belongs to the Special Issue Advances in Deep Learning and Machine Learning for Remote Sensing Image Analysis)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

SIT-UDA integrates learnable text category hints with image data for accurate semantic segmentation, and two strategies—entropy-guided pixel-level weighting (EGPW) and contrastive text constraint (CTC)—are proposed to improve pseudo-label utilization and strengthen domain-invariant feature learning with greater discriminability.
Experiments on six representative remote sensing domain adaptation tasks demonstrate that SIT-UDA achieves superior balanced performance and exhibits stronger robustness compared with existing methods.

What are the implications of the main findings?

SIT-UDA demonstrates that incorporating vision–language models into remote sensing tasks enhances generalization and domain-invariant representation learning.
SIT-UDA shows strong potential for real-world application such as land-cover monitoring across urban and rural domains and disaster response across regions.

Abstract

Deep self-training-based unsupervised domain adaptation (UDA) semantic segmentation methods learn from labeled source domain images and unlabeled target domain images, performing more stably than those based on adversarial training. We propose a self-training-based image–text multimodal unsupervised domain adaptation semantic segmentation model (SIT-UDA) for remote sensing images. Unlike UDA methods, which rely solely on images, SIT-UDA enhances generalization performance by integrating category hint information from textual descriptions with image data to segment images. SIT-UDA employs a teacher–student self-training framework consisting of two components: the teacher multimodal segmentation model, which predicts pseudo-labels for target domain images, and the student multimodal segmentation model, which is trained to learn feature representations from both the source and target domains with guidance from the teacher model. To enhance the adaptability of image–text pretrained models in remote sensing domains, SIT-UDA introduces text prompt tuning to optimize the text features in the student model, and two learning strategies are proposed to optimize the model’s training objectives: One is the entropy-guided pixel-level weighting (EGPW) strategy, which adaptively weights the loss obtained by self-training on target domain images, leveraging the pseudo-labels rationally according to the entropy value at the pixel level. The other is the contrastive text constraint (CTC) strategy, which maximizes the similarity of text features for the same category between teacher and student models while minimizing the similarity of text features across different categories, improving text feature discriminability to promote cross-domain image–text alignment. Experiments in various domain adaptation scenarios among three remote sensing datasets (Potsdam, Vaihingen and LoveDA) demonstrate that the SIT-UDA is superior to the comparative domain adaptation semantic segmentation methods in terms of qualitative and quantitative segmentation results.

Keywords:

remote sensing images; semantic segmentation; unsupervised domain adaptation; image–text multimodal; self-training

1. Introduction

Semantic segmentation of remote sensing images (RSIs) assigns semantic class labels to each pixel, enabling automatic interpretation and understanding of remote sensing scenes. This technique is crucial in urban planning [1], environmental monitoring [2], and agricultural management [3]. In recent years, deep learning methods for image semantic segmentation, such as U-Net [4], DeepLab [5] and Segformer [6], have significantly improved segmentation accuracy. However, these methods rely heavily on large amounts of labeled data, and manual annotation of RSIs is labor-intensive and time-consuming, limiting the practical application of supervised methods.

Unsupervised domain adaptation (UDA) leverages the labeled source domain and unlabeled target domain data to jointly train a semantic segmentation model and learn the target domain features, further maintaining the segmentation performance of the model in cross-domain scenarios. Early UDA methods focus on reducing feature distribution discrepancies between source and target domains by minimizing well-defined distances on feature distributions, such as Maximum Mean Discrepancy [7]. With ongoing advancements in deep learning, many UDA approaches train networks by adversarial learning, in which the generator and discriminator in a generative adversarial network [8] are trained to align the feature distribution between source and target domains. Despite achieving good results, adversarial learning usually leads to oscillation and instability during training. Self-training-based UDA methods [9,10,11] enhance the model’s generalization ability by learning from a training set extended with target domain images with pseudo-labels and perform more stably compared to adversarial training.

Most of the self-training-based UDA methods pre-compute the pseudo-labels and predictions in one model, add high-confidence pseudo-labels to the training set, train the model with the expanded dataset, and iteratively repeat the process. Alternatively, pseudo-labels can be calculated by another model during the training; this provides two segmentation models with the same structure but different parameters: the teacher model and the student model [12]. The student model is trained using two objectives: (1) a supervised loss on labeled source domain images and (2) a consistency loss applied to augmented versions of unlabeled target domain images. Meanwhile, the teacher model generates pseudo-labels from the original (non-augmented) target images to help the student’s learning process. Recently, many UDA semantic segmentation methods [13,14] have been used to augment target domain images by mixing source and target pixels and apply a mixed segmentation loss on these images as a consistency loss. This image mixing narrows the gap between the source and target domains through the generation of new images while simultaneously enhancing the model’s predictions on unlabeled pixels, ultimately yielding a robust student segmentation model capable of effective inference. Since pseudo-labels are not the real ones, a loss weight is adopted to control the impact of mixed loss for model updating. Existing methods weight all target pixels in the mixed image with the same weight value [15] while not considering the different contributions of pixels with different confidence levels to the model. To better exploit high-confidence pixels in the mixed image and mitigate negative transfer from low-confidence pixels, we propose an Entropy-Guided Pixel Weighting (EGPW) strategy to adaptively adjust the loss weights of pseudo-label pixels in mixed segmentation loss, thus leveraging pseudo-labeling more rationally.

Recent advances in vision–language pre-training works have shown that the language descriptions for images can provide additional semantic information for computer vision tasks [16,17]. Large-scale vision–language pretrained segmentation models like CLIPSeg [18] and SAM [19] demonstrate that text combined with images in an ensemble network enhances feature representation and improves segmentation performance. In UDA semantic segmentation, both adversarial learning and self-training image-based UDA methods ultimately aim to learn domain-invariant feature representations, improving cross-domain segmentation performance. Vision–language segmentation models have the potential to crucially improve domain adaptation tasks. Compared with only relying on image features, the effective combination of domain-invariant text features and image features helps the model to construct a more robust category semantic space for domain adaptation, especially in cases where complex image domain discrepancies exist [20]. To enable the application of vision–language pretrained models to downstream tasks, text prompt tuning freezes the parameters of the text encoder and adds different prompts to the input text to generate diverse text features. While preserving the knowledge acquired during pre-training, it efficiently adapts general vision–language pre-trained models to specific downstream tasks. Given the limited resources and the limited labeled RSIs, we proposed a self-training-based image–text multimodal UDA semantic segmentation model (SIT-UDA) for practical remote sensing application. This model extracts effective and discriminative image semantic features with the help of text to obtain high-quality pseudo-labels for target domain images. Furthermore, vision–language pretrained models are integrated into the UDA semantic segmentation of RSIs, which significantly reduces the dependency on annotated data and enhances segmentation accuracy. It is particularly significant for remote sensing applications with scarce labeled data.

In the proposed SIT-UDA, we implement the image–text multimodal segmentation network in the student and teacher models. To enhance the adaptability of the model, SIT-UDA introduces learnable prompt vectors and category words as inputs to the frozen text encoder of the student model. By iteratively updating the parameters in the learnable prompts through back-propagation, the student model refines text features that contain the contextual information of remote sensing images. The parameters of the teacher model are updated from those of the student model via Exponential Moving Average (EMA) [12] and are therefore not directly optimized through gradient back-propagation. However, the randomly initialized parameters of learnable prompt vectors can alter the category representation of text features, thereby degrading the quality of pseudo-labels predicted by the teacher model. To address this, the teacher model utilizes the fixed text prompt from the vision–language pretrained model to extract domain-agnostic text features (i.e., not tied to remote sensing domain images). To enhance the domain-invariant representation learning, SIT-UDA proposes a contrastive text constraint (CTC) strategy that leverages the text features of the teacher model to constrain the learnable text features in the student model while also improving the discriminability among different categories of text features. The main contributions of this study can be summarized as follows:

A multimodal UDA semantic segmentation model: The SIT-UDA model aligns and fuses learnable class text features with image features for segmentation and provides more credible pseudo-labels through the multimodal network, improving the model’s generalization performance.
EGPW strategy: This strategy adaptively adjusts the loss weights of unlabeled pixels in the mixed images based on the entropy value of the prediction probability map, learning the high-confidence pseudo-labels and reducing the interference from low-confidence pseudo-labels.
CTC strategy: This strategy encourages intra-class text features in the teacher and student models to become closer while driving inter-class text features further apart. The resulting optimized text features can effectively adapt to remote sensing domains while preserving domain-invariant and discriminative semantic representations.

The remainder of this paper is organized as follows. Section 2 introduces the related work. Section 3 describes the proposed method in detail. Section 4 presents the experiment settings, comparative and ablation experimental results, and further analyses of the results. Section 5 provides the conclusion.

2. Related Work

2.1. Image-Based UDA Semantic Segmentation

In image-based UDA semantic segmentation, domain-alignment-based methods mitigate the domain discrepancy between the source and target domains by aligning their feature distributions (at the image level, feature level, and output level) [21,22]. However, these methods focus more on global distribution alignment and may overlook local context. Self-training-based UDA methods enhance the model’s generalization by learning target domain image features in two ways: pseudo-labeling and consistency regularization. Pseudo-labeling methods focus on improving the quality of pseudo-labels and filtering the negative effects of unreliable pseudo-labels by pseudo-label denoising [11], yet some excluded pseudo-labels often contain important class boundary information and small objects, hampering learning and performance. Consistency regularization [23] enforces prediction consistency on the unlabeled target domain image and its augmented version to learn the invariant features in the image. To this end, some methods compute the category prototype from the two predictions on the target domain and adopt the distance between category prototypes as consistency loss [11,23]. Some methods regard all pixel predictions on original target domain images as pseudo-labels, leveraging the cross-entropy loss between the pseudo-labels and the predictions on augmented target domain images as consistency loss [12,15]. The final optimization objective is formulated as a weighted combination of the supervised loss on the source domain and the consistency loss on the target domain, where the trade-off parameter dynamically balances unsupervised domain adaptation and supervised learning.

To balance the contribution of consistency loss in the total loss, Tarvainen et al. [12] adopted a sigmoid ramp-up to gradually increase the loss weight from 0 to 1, but it weighted all images to the same value without considering the discrepancies among images. French et al. [24] observed that the proportion of pixels with confidences above a threshold could roughly evaluate the model’s prediction quality on different images and adopted this proportion as the loss weight [14,15,25]. Nevertheless, assigning the same weight to all pixels in one image does not account for different confidence levels. With reference to self-training-based UDA semantic segmentation approaches [15,25], SIT-UDA employs mixed images, generated by combining the source and target domains, as augmented target domain images. The mixed segmentation loss computed on these mixed images is adopted as the consistency loss. To better utilize pixel information with varying confidence levels, we propose a pixel-level adaptive weight to control the contribution of mixed segmentation loss. This approach aims to leverage high-confidence pixels effectively and mitigate the negative impact of low-confidence pixels during training.

2.2. Image–Text Multimodal UDA Semantic Segmentation

With the development of vision–language pre-training models such as CLIP [16], image–text joint training has been widely used in computer vision tasks and has made significant progress. CLIP embeds images and their text descriptions into a shared feature space and aligns them by contrastive learning, providing richer semantic information for classification. Subsequent works, such as DenseCLIP [26] and LSeg [27], aligned pixel and text features, further extending CLIP to segmentation tasks. In UDA semantic segmentation, domain-invariant category text can align with both source and target domain pixel features, guiding the segmentation of unlabeled target domain images and enhancing generalization ability [20,28,29]. Kim et al. [20] aligned the text features with the source and target domain pixel features and then appended the alignment results to the segmentation decoder. However, their model has a large scale, and there are few image–text UDA semantic segmentation works for RSIs. In the remote sensing domain, as shown in Figure 1, though a significant domain discrepancy exists in different domains, the image semantic features of the same class from both the source and target domains should be aligned with the corresponding domain-invariant text semantic features. Therefore, we propose a self-training-based multimodal UDA semantic segmentation model for RSIs to learn domain-invariant category semantic features. To take full advantage of the correlations between visual-language representations within a shared feature space, we extract image and text embedding from the pretrained CLIP and correlate the class text in RSIs onto these images with a more lightweight network.

2.3. UDA Semantic Segmentation for Remote Sensing Images

In image-based UDA semantic segmentation, self-training-based methods have shown impressive performance, but they are primarily developed for natural images. The small intra-class variance and large inter-class variance in RSIs scatter the features of some categories. Thus, UDA semantic segmentation methods for RSIs focus more on latent category semantic information across domains. In terms of adversarial learning-based UDA methods, Chen et al. [30] propose a class-level discriminator to differentiate features of the same class between source and target domains. Wang et al. [23] combine adversarial learning and self-training in stages, aligning category features by minimizing the distance between category prototypes of target domain images and their augmented views. These category-level UDA methods aim to reduce both data distribution gaps between domains and within each class, advancing RSIs domain adaptation [31,32,33]. Given that image–text domain adaptation methods for remote sensing semantic segmentation are currently scarce, we introduce image–text multimodal information in UDA semantic segmentation for RSIs, aggregating scattered image information through the alignment between category text features and image features, further improving segmentation performance. To enhance the discriminative category representation of text features, we propose to minimize the contrastive text loss between text features of the student model and teacher models. The domain-agnostic text features extracted from the teacher model are employed to constrain the optimized text features in the student model, enabling them to adapt to remote sensing domains while maintaining domain-invariant semantic representations.

3. Self-Training-Based Image–Text Multimodal Unsupervised Domain Adaptation Semantic Segmentation Model

As shown in Figure 2, the self-training-based image–text multimodal unsupervised domain adaptation semantic segmentation model (SIT-UDA) comprises two parts: the student model and the teacher model. Both models share the same multimodal segmentation framework. To better adapt to remote sensing domain adaptive semantic segmentation task, SIT-UDA introduces learnable prompt vectors as text prompts. By training the parameters of these prompt vectors, the text features extracted in the student model are optimized. The teacher model adopts the text prompts and frozen text encoder parameters from the pretrained model, ensuring semantic consistency with the pretrained representation. In addition, we propose two strategies, the Entropy-Guided Pixel-Wise Weighting (EGPW) strategy and the contrastive text constraint (CTC) strategy, in the student model to effectively utilize pseudo-labels and enhance the domain-invariant representation learning of the model.

3.1. The Multimodal Segmentation Network

The image-text multimodal semantic segmentation network aligns and fuses image and text features, enriching category representations and thereby improving segmentation performance. In the UDA semantic segmentation task, alignment and fusion between images from different domains and the same text features reduce discrepancies among domain-specific image features, enhance the learning of domain-invariant representations, and ultimately improve domain adaptive segmentation performance. To further enhance the adaptability of the vision–language pretrained model in remote sensing domain adaptive semantic segmentation, SIT-UDA adopts a text prompt tuning strategy. A set of learnable prompt vectors is concatenated with category words to optimize the extracted text features. These learnable prompts are randomly initialized with parameters drawn from a standard normal distribution (mean = 0, variance = 1) and are updated through model training. As shown in Figure 3, the multimodal segmentation network comprises three parts: the encoder, pixel-text feature alignment and fusion, and decoder.

In the encoder part, image features and text features are extracted into a shared feature space by the CLIP-pretrained image and text encoders. The image encoder employs the ResNet50 network to extract feature maps at four different scales, donated as

{x_{l} \in R^{H_{l} \times W_{l} \times D_{l}}}_{l = 1}^{4}

, where

x

represents the feature map; l is the number of layers; and H, W, and D denote the height, width, and number of channels, respectively. For text, fixed class words are obtained from the source domain category set, with each word mapped to a fixed-dimensional class embedding (e.g., 512). Beyond the class words, learnable prompts are incorporated [34], which can provide complementary information and significantly influence performance. In this setting, a certain number of learnable vectors are introduced as contextual tokens for each class word. Each vector shares the same dimensionality as the class embeddings. These learnable vectors are concatenated with the class embeddings corresponding to the class words, and the combined embeddings are then fed into the frozen Vit-B/16 text encoder to extract text features. Compared to the fixed text prompt, parameters in the learnable prompts are continually updated in the training stage to make the extracted text features comprise more category-related contextual image information, facilitating the alignment of text and image features. The obtained text features are denoted as

q \in R^{C \times D_{4}}

, where C represents the number of text categories, which is equal to the number of land object categories in RSIs.

To achieve collaborative interaction between image and text, we align and fuse deep image features and text features. Specifically, the aligned score map is obtained by calculating the similarity between pixel features and text features. The score map and deep image feature map are then concatenated for feature fusion. The similarity calculation is implemented using the dot product:

s = {x_{4}}^{'} {q^{'}}^{T} .

(1)

where

{x_{4}}^{'}

and

q^{'}

represent the

ℓ_{2}

normalized version of

x_{4}

and

q

, respectively. The score map

s \in R^{H_{4} \times W_{4} \times C}

indicates the similarity between each pixel feature and text features, with higher scores suggesting a higher likelihood of the pixel belonging to that class. We adopt this multimodal interaction for mainly three reasons: Firstly, deep image features contain more semantic information, which can effectively align with text semantic features. Secondly, the aligned score map serves as a segmentation result of high-level semantic image features. By minimizing the cross-entropy loss between the score map and the down-sampling label as

L_{s e g}^{a u x} (I) = C r o s s E n t r o p y (s, L^{H_{4} \times W_{4}}) .

(2)

where L represents the one-hot encoding representation of the ground truth label of the image I, the model is encouraged to extract matching text and image features across the two modalities. Thirdly, by fusing the multimodal alignment score map with image features, the model achieves a balanced integration of text and image information, reducing the risk of over-reliance on text for classification.

In the decoder part, the fused multimodal features and multi-scale image features are input into the Semantic FPN [35] decoder to achieve segmentation. Compared to single-modal image segmentation networks, the multimodal image–text segmentation network enhances deep semantic image features by multimodal feature alignment and fusion, improving segmentation results. In UDA tasks, the text features are aligned and fused with source and target domain images, and the multimodal segmentation network improves the prediction accuracy of pseudo-labels, further learning effective features from target domain images.

3.2. Entropy-Guided Pixel-Level Weighting Strategy

The student model in SIT-UDA relies on the consistency regularization on target images and their augmentation views to learn robust image features. The teacher model generates pseudo-labels for target domain images

I_{T}

. To obtain augmented target domain images through image mixing, a mask M of the same size as the source and target images is initialized with all values set to 1. Then, we randomly select half of the categories from the source image and keep the corresponding pixel positions in the mask as 1 while setting the remaining pixel positions to 0. After obtaining the mask M, the source and target images, together with the source ground truth labels and target pseudo-labels, are mixed as follows:

\begin{matrix} I_{M} & \leftarrow M ⊙ I_{S} + (1 - M) ⊙ I_{T} \end{matrix}

(3)

\begin{matrix} L_{M} & \leftarrow M ⊙ L_{S} + (1 - M) ⊙ {\hat{L}}_{T} \end{matrix}

(4)

where

I_{M}

denotes the mixed images and

L_{M}

denotes the mixed labels. The prediction of student model on mixed images should remain consistent with the mixed labels to obtain mixed segmentation loss. To improve the mixed segmentation loss, the loss values on pseudo-labels should be weighted to control its effect, as follows:

L_{M} = - \sum_{i = 1}^{H} \sum_{j = 1}^{W} \sum_{c = 1}^{C} \{\underset{(i, j) \in I_{S}}{L_{M}^{(i, j, c)} log {g_{θ} (I_{M})}^{i, j, c}} + β \underset{(i, j) \in I_{T}}{L_{M}^{(i, j, c)} log {g_{θ} (I_{M})}^{i, j, c}}\}

(5)

where

(i, j) \in I_{S}

denotes that the pixel at the position of

(i, j)

in the mixed images is from the source domain,

g_{θ} (\cdot)

represents the prediction probability of the student model with parameters of

θ

, and

β

represents the weight value.

During initial training, the pseudo-labels obtained by the teacher model are usually inaccurate; thus, the loss weight should be set to a small value. With the improvement in the student model’s performance, the prediction accuracy of the teacher model is also improved by the exponential moving average update of the student model, and the loss weight can gradually increase to a larger value. Therefore, many methods adopt the proportion of pixels in which the maximum predicted probability exceeds a certain confidence threshold as the weighting factor. As the model performance improves, the number of high-confidence pixels increases, and the weight value accordingly becomes larger. The loss values of all pseudo-labels are assigned the same weight. However, different pixels in the mixed image have varying confidence, and the pixel loss for different positions and different training stages needs to be adjusted to provide the network with the maximization training effect.

Therefore, SIT-UDA proposes an entropy-guided pixel-level weighting strategy to adaptively increase the weights of high-confidence pixels and reduce the weights of low-confidence pixels. Information entropy [21], as a measure of information quantity and uncertainty, has been widely used in semantic segmentation tasks [36]. The information entropy map strongly correlates to the prediction error, with lower entropy indicating smaller prediction errors [37,38]. Therefore, we calculate Shannon entropy to express the model’s prediction error for different pixels in target domain images. Specifically, for the prediction probability maps of the target domain images provided by the teacher model

g_{ϕ} (I_{T}) \in R^{H \times W \times C}

, the entropy matrix

E \in {[0, 1]}^{H \times W}

for different pixels is calculated as follows:

E = - \sum_{c = 1}^{C} log g_{ϕ} (I_{T}) log g_{ϕ} (I_{T})

(6)

where

g_{ϕ} (\cdot)

represents the prediction results of the teacher model with parameters of

ϕ

. The probability distribution of the under-confident predictions is relatively flat, resulting in high entropy. Conversely, more confident predictions lead to a sharper probability distribution and lower entropy. Therefore, the weight matrix corresponding to the entropy matrix is defined as follows:

W = 1 - E

(7)

where

W

represents the weight matrix corresponding to the entropy map of the prediction results. The entropy values of pixels at boundaries and in difficult-to-segment classes are generally large, corresponding to small weights, yet small weights can hinder the model’s learning and prevent it from effectively capturing information from such pixels. Therefore, we compare each pixel’s weight value with the proportion of high-confidence pixels and choose the larger value as the loss weight for that pixel. This allows the model to learn from difficult-to-segment pixels while leveraging the influence of highly confident pixels, ultimately enhancing the model’s learning ability for target domain images. The final weight is defined as follows:

W^{(i, j)} = \{\begin{matrix} W^{(i, j)}, & i f W^{(i, j)} > β \\ β, & o t h e r s \end{matrix}

(8)

β = \frac{\sum_{i = 1}^{H} \sum_{j = 1}^{W} [{max}_{c} {g_{ϕ} (I_{T})}^{(i, j, c)} > τ, c = [1, C]]}{H \cdot W}

(9)

where

τ

denotes the confidence threshold. A lower threshold increases the proportion of high-quality pixels, thereby enlarging the corresponding loss weight, while a higher threshold reduces this proportion and consequently decreases the loss weight.

As the model’s performance improves, the predictions become more accurate, resulting in a higher proportion of high-confidence pixels and larger weights. Consequently, the mixed segmentation loss with the entropy-guided pixel-level weighting strategy on the mixed images is obtained as

L_{M} = - \sum_{i = 1}^{H} \sum_{j = 1}^{W} \sum_{c = 1}^{C} \{\underset{(i, j) \in I_{S}}{L_{M}^{(i, j, c)} log {g_{θ} (I_{M})}^{i, j, c}} + W^{(i, j)} \underset{(i, j) \in I_{T}}{L_{M}^{(i, j, c)} log {g_{θ} (I_{M})}^{i, j, c}}\}

(10)

where

W^{(i, j)}

denotes the loss weight value corresponding to the pixel at the position of

(i, j)

in the mixed image. The EGPW strategy effectively leverages pixel-level pseudo-labels with varying confidence levels to learn more informative representations from the target domain, further improving the model’s generalization capability on the target domain.

3.3. Contrastive Text Constraint Strategy

In the student model’s text feature representation, learnable prompt vectors capture more category information related to remote sensing images to optimize the text features. To enhance the domain-invariant representation of these features, this paper proposes a contrastive text constraint strategy. Specifically, the text features from the teacher model (i.e., domain-agnostic text features derived from the vision–language pretrained model) are used to constrain the optimization process of text features while improving the discriminability among categories and achieving more stable cross-domain vision–language alignment and fusion. The optimized text features can thus adapt to remote sensing domain-adaptive tasks while preserving domain-invariant semantic information. InfoNCE [39] is a contrastive learning objective that maximizes the mutual information between different views of the same data, encouraging the model to learn better representations by distinguishing similar samples from dissimilar ones. Therefore, we calculate the InfoNCE loss between text features in the teacher and student models to achieve this text constraint.

Specifically, text features in the student and teacher models are denoted as

q_{S} \in R^{C \times D}

and

q_{T} \in R^{C \times D}

, respectively. For a given text feature

q_{S}^{c}

, the positive sample is the corresponding same class text feature from the teacher network, denoted as

q_{T}^{c}

, and the negative samples are other classes’ text features in the teacher network, denoted as

q_{T}^{k}, c \neq k

. The contrastive text loss is defined as follows:

L_{C} = \sum_{c = 1}^{C} [- log \frac{exp [S i m (q_{S}^{c}, q_{T}^{c}) / σ]}{\sum_{k = 1}^{C} exp [S i m (q_{S}^{c}, q_{T}^{k}) / σ]}]

(11)

where

S i m

denotes cosine similarity, and

σ

is the temperature coefficient that controls the smoothness of sample similarity. A smaller

σ

value leads to a sharper similarity distribution, while a larger value results in a smoother distribution. Referring to [40], we set

σ = 0.1

to encourage the same class’s text features between two domains closer.

During the training process, applying contrastive loss between different text features reduces the discrepancies within classes and enhances the distinctiveness across classes. Furthermore, aligning source domain images and mixed images with the constrained text features prompts the learning of the category-level invariant features for RSIs.

3.4. Training and Inference

The SIT-UDA model primarily facilitates the student model to learn from the supervised learning on source images and the unsupervised consistency regularization on target images. The cross-entropy loss between the student model’s predictions on source images and its ground truth labels is defined as the supervised loss:

L_{S} = - \sum_{i = 1}^{H} \sum_{j = 1}^{W} \sum_{c = 1}^{C} L_{S}^{(i, j, c)} log g_{θ} {(I_{S})}^{(i, j, c)} + L_{s e g}^{a u x} (I_{S})

(12)

where

L_{S}

represents the ground truth label of source domain images

I_{S}

. The consistency regularization between the student model’s predictions on augmented target images and their pseudo-labels from the teacher model is implemented through the pixel-level weighted mixed segmentation loss

L_{M}

. Due to the strong correlation between text features and the prediction results of the image–text multimodal segmentation model, the contrastive text loss

L_{C}

facilitates the learning of domain-invariant representations in student models. The complete loss function is given by

L = L_{S} + L_{M} + λ_{1} L_{C} .

(13)

The teacher model parameters are initialized from the student model and subsequently updated by applying the exponential moving average (EMA) of the student model parameters, which is expressed as follows:

ϕ_{t + 1} \leftarrow α ϕ_{t} + (1 - α) θ_{t}

(14)

where t represents the t-th iteration, and

ϕ

and

θ

denote the parameters of the teacher and student networks, respectively. The hyperparameter

α

controls the update of the teacher network, and we set

α = 0.999

as in [15] to keep the teacher network different from the student network. As the iterations progress, the difference between the parameters of the teacher and student models increases, influenced by the student model’s parameters, and the teacher model’s predictions become more accurate, thereby obtaining high-quality pseudo-labels. The training process of the proposed model is presented in Algorithm 1.

In the inference stage, class text words that concatenate with the trained learnable prompt parameters and test images are fed into the trained student multimodal segmentation model, and after feature encoding, multimodal feature alignment, and feature decoding, the prediction results can be obtained.

Algorithm 1 The training process of the proposed model

Input:: Labeled source data $X_{S} = {I_{S}, L_{S}}$ , unlabeled target data $X_{T} = {I_{T}}$ , student segmentation model $g_{θ}$ , teacher segmentation model $g_{ϕ}$
1:: Initialize the parameters of the teacher and student models $θ = ϕ$
2:: for i = 1 to max_iterations do
3:: predict target domain pseudo-labels by teacher model ${\hat{L}}_{T} \leftarrow g_{ϕ} (I_{T})$
4:: obtain mixed images and labels $I_{M}, L_{M}$
5:: predict labels of source images and mixed images by student model $g_{θ} (I_{S}), g_{θ} (I_{M})$
6:: compute source segmentation loss $L_{S}$ , mixed segmentation loss $L_{M}$ and contrastive text constraint loss $L_{C}$
7:: compute complete loss $L = L_{S} + L_{M} + L_{C}$
8:: perform back-propagation and update student model parameters $θ$
9:: update teacher model parameters $ϕ$ by EMA of $θ$
10:: end for
Output:: student segmentation model $g_{θ}$

4. Experiments and Result Analysis

In this section, we first introduce the RSIs segmentation datasets, six domain adaptation tasks, implementation details of the proposed SIT-UDA, and several evaluation metrics in Section 4.1. Section 4.2 compares the SIT-UDA model with current state-of-the-art UDA semantic segmentation methods in different domain adaptation tasks. As detailed in Section 5.1, ablation experiments were conducted to verify the role of each part of the SIT-UDA model, and the effect of weight parameters and the scale of the model was further analyzed.

4.1. Experimental Settings

4.1.1. Datasets and UDA Tasks

In the experimental section, we verify the effectiveness of the proposed SIT-UDA model on Potsdam [41] and Vaihingen [42] datasets with different geographical locations, imaging modes, and spatial resolutions and on the LoveDA [43] dataset with different geographical landscapes of Urban and Rural. The Potsdam and Vaihingen datasets contain six classes: ‘impervious surface’, ‘building’, ‘low vegetation’, ‘tree’, ‘car’, and ‘clutter’. The LoveDA dataset contains seven classes: ‘background’, ‘building’, ‘road’, ‘water’, ‘barren’, ‘forest’, and ‘agriculture’. The specific information and partitioning of datasets and tasks are shown in Table 1 and Table 2. In all domain adaptation tasks, training set of labeled source domain and unlabeled target domain serve as the training data. Models trained on low-resolution source data have limited performance that degrades dramatically on high-resolution targets. Therefore, in the VaiIRRG2PotsIRRG and VaiIRRG2PotsRGB tasks, we upsampled the 256 × 256 images in Vaihingen to the size of 512 × 512 by bilinear interpolation before feeding them into the segmentation network.

4.1.2. Evaluation Metrics

Five common metrics are adopted to evaluate the performance of UDA semantic segmentation methods, namely, Intersection over Union (IoU), Mean Intersection over Union (mIoU), F1 score (F1), Mean F1 score (mF1), and Overall Accuracy (OA). IoU measures the segmentation capability of the model for a particular class by computing the ratio of the intersection to the union of the ground truth labels and predicted values for that class, and mIoU is the mean value of IoU metrics for all classes; they can be calculated as follows:

\begin{matrix} {IoU}_{c} & = \frac{n_{c, c}}{\sum_{k = 1}^{C} n_{c, k} + \sum_{k = 1}^{C} n_{k, c} - n_{c, c}} \end{matrix}

(15)

\begin{matrix} mIoU & = \frac{1}{C} \sum_{c = 1}^{C} {IoU}_{c} \end{matrix}

(16)

where

n_{c, c}

represents the number of pixels that belong to a class c and are correctly predicted as the class, and

n_{c, k}

represents the number of pixels that belong to a class c but are incorrectly predicted as the class k. The F1 score is a comprehensive metric to evaluate the precision and recall of a certain class, and mF1 represents the average F1 score for all classes, which can assess the model’s ability to segment all classes accurately and completely. Their calculation equations are as follows:

\begin{matrix} {precision}_{c} & = \frac{n_{c, c}}{\sum_{k = 1}^{C} n_{k, c}} \end{matrix}

(17)

\begin{matrix} {recall}_{c} & = \frac{n_{c, c}}{\sum_{k = 1}^{C} n_{c, k}} \end{matrix}

(18)

\begin{matrix} F 1_{c} & = 2 \times \frac{p r e c i s i o n_{c} \times r e c a l l_{c}}{p r e c i s i o n_{c} + r e c a l l_{c}} \end{matrix}

(19)

\begin{matrix} mF 1 & = \frac{1}{C} \sum_{c = 1}^{C} F 1_{c} \end{matrix}

(20)

OA represents the ratio of all correctly predicted pixels to all pixels. It can be calculated as follows:

\begin{matrix} OA = \frac{\sum_{c = 1}^{C} n_{c, c}}{\sum_{c = 1}^{C} N_{c}} \end{matrix}

(21)

where

N_{c}

represents the total number of pixels belonging to the class c. The higher the values of IoU, mIoU, F1, mF1, and OA, the better the segmentation performance of the model.

4.1.3. Implementation Details

The experimental equipment in this study was Ubuntu 20.04.4 LTS, with a CPU of Intel^®Xeon^®Gold5215 and a GPU of GeForce RTX 3090 with 24GB of memory. The model running environment was PyTorch1.10, Python3.8, CUDA11.4. The model was trained using the AdamW [44] optimizer with an initial learning rate set to

1 \times 10^{- 4}

. To better maintain the pretraining weights, the text encoder parameters were frozen, and the learning rate of the image encoder was set to

1 \times 10^{- 5}

. The poly learning strategy was applied, the minimum learning rate was set to

1 \times 10^{- 6}

, the training count used the number of iterations, and the maximum number of iterations was set to

4 \times 10^{4}

. Since the cropped image size varies across datasets, the batch size was adjusted accordingly to balance memory consumption and training stability. Specifically, for the Potsdam and Vaihingen dataset, the batch size was set to 8, while for the LoveDA dataset, the batch size was set to 2. Table 3 provides a summary of the hyperparameter settings employed in the experiments. Following common practice in contrastive learning methods [45] and teacher–student methods [15], the temperature coefficient and EMA update coefficient were set to 0.1 and 0.999. A comprehensive analysis of confidence threshold and loss weight is provided in the parameter sensitivity analysis (Section 5.1.2).

4.2. Comparative Experiments and Results Discussion

Due to the lack of image–text multimodal UDA methods for remote sensing images, we conducted comparative experiments with five advanced image-based single-modal UDA methods and two reimplemented image–text multimodal UDA methods from recent years. The image-based comparison methods include: the domain alignment approach (CIA-UDA [46]), the self-training-based methods (ProDA [11] and DACS [15]), and multi-stage methods combining domain alignment and self-training (JDAF [31] and FGUDA [23]). The image–text multimodal UDA methods include CLIP-UDA and CLIP-ProCL [47]. CLIP-UDA with the same image segmentation framework as SIT-UDA was implemented to achieve a fair comparison. Unlike SIT-UDA, CLIP-UDA adopts the fixed text features from the pretrained CLIP model as the text representations in both teacher and student models. CLIP-ProCL incorporates the pretrained image and text features from CLIP into an image-based domain adaptation model for semantic segmentation, aligning and fusing them with the image features extracted by the model’s backbone network. All image–text UDA methods incorporate consistency regularization and teacher–student models to achieve self-learning, and the ClassMix-based image mixing augmentation [15] was applied in DACS and three image–text methods. The segmentation framework and image encoder of comparison methods are shown in Table 4 and Table 5 for detailed comparison. The DACS, CLIP-UDA, and CLIP-ProCL methods were reimplemented independently to obtain the experimental results, whereas the results of the other methods are cited directly from their original publications.

4.2.1. PotsIRRG2VaiIRRG and PotsRGB2VaiIRRG

The UDA task of PotsIRRG2VaiIRRG evaluates the performance of comparison methods in handling geographic location variation and the experimental results are presented in Table 6. The segmentation performance of image mixing augmented UDA methods, namely, DACS, CLIP-UDA, CLIP-ProCL and SIT-UDA, significantly surpasses other comparison single-modal methods. By leveraging consistency regularization on mixed images, these methods effectively enhance the learning on target domain data and improve generalization performance. In the image segmentation network, although the ResNet-50 encoder exhibits inferior deep feature extraction capacity compared to ResNet-101, the CLIP-UDA model achieves competitive performance by leveraging CLIP pretrained image encoder parameters. Compared to the suboptimal DACS, the multimodal SIT-UDA model achieves a significant performance improvement of 4.92% in mIoU and 4.59% in mF1 by the effective image–text representations, demonstrating the robustness and generalization of the multimodal model in UDA tasks.

In terms of segmentation results for each class, SIT-UDA achieves the best results on most classes except for ‘impervious surface’ and ‘building’ and gains significant performance on ‘low vegetation’, ‘car’, and ‘clutter’. This demonstrates that the text–image interaction not only compensates for architectural discrepancies in image encoder networks but also enhances discriminative feature learning particularly for small objects and hard-to-segment classes through multimodal alignment, advancing segmentation performance. The qualitative segmentation results of various methods on certain test images from the Vaihingen dataset are shown in Figure 4. It can be observed that the spectrum of ‘tree’ and ‘low vegetation’ is confused, and the edges of ‘clutter’ are blurry; comparison methods sometimes mistakenly segment ‘clutter’ as ‘impervious surface’. The SIT-UDA model effectively distinguishes confusing categories by aligning text features with image features. Additionally, SIT-UDA leverages the contrastive text constraint strategy to decrease intra-class variance and increase inter-class variance, further enhancing class separability.

The UDA task of PotsRGB2VaiIRRG validates the performance of comparison methods in handling geographical and spectral band composition variations, and the segmentation results for various methods are shown in Table 7. Due to the spectral discrepancy between source and target domains, cross-domain image mixing results in varying spectral band compositions in mixed images, leading to significant performance degradation in DACS. The domain-alignment-based UDA methods, such as CIA-UDA, JDAF, and FGUDA, leverage image-level, feature-level, and output-level alignments, effectively mitigating spectral discrepancies between different domains, exhibiting enhanced generalization capability on this task. The CLIP-UDA model ranks second, suggesting that the CLIP-pretrained image–text UDA model exhibits stronger generalization capabilities than the ImageNet-pretrained model. The SIT-UDA model achieves the best results with mIoU (58.82%) and mF1 (71.84%), further indicating the strong transferability and robustness of the image–text multimodal model. Although SIT-UDA does not achieve the best performance in every category, it yields results close to the optimum, which demonstrates the model’s advantage in balanced performance and strong generalization across categories.

4.2.2. VaiIRRG2PotsIRRG and VaiIRRG2PotsRGB

In the UDA task of VaiIRRG2PotsIRRG, the low spatial resolution of the Vaihingen dataset poses great challenges when adapting models to the high-resolution Potsdam dataset. Table 8 records the segmentation results of the comparison methods. Compared to the reverse direction UDA task (PotsIRRG2VaiIRRG); these methods exhibit performance degradation, particularly in the ‘tree’ and ‘clutter’ categories. This limitation stems from the insufficient spatial resolution hindering effective learning of intricate tree textures and complex background patterns. Compared to DACS, SIT-UDA achieves a significant mIoU improvement of 13.26% and 10.06% on ‘tree’ and ‘car’, which may be attributable to the robust feature learning of the CLIP-pretrained models. Meanwhile, the text–image alignment and fusion strengthen discriminative representations of confused classes and small objects.

The domain discrepancies of the VaiIRRG2PotsRGB task involve geographical location, spectral band composition, and spatial resolution, making the task more challenging, further limiting the generalization ability of UDA methods on the target domain. Table 9 presents the segmentation results of comparison methods; SIT-UDA not only achieves the best mIoU and mF1 values but also delivers optimal or near-optimal segmentation results across all categories. This demonstrates the model’s robustness, balanced performance, and strong generalization ability across diverse and complex scenarios. Figure 5 shows the segmentation results of the comparison methods on the test images in the Potsdam dataset. The limited spatial resolution severely impedes the extraction of discriminative visual features for ‘clutter’ and ‘tree’, which are misclassified as ‘impervious surface’ or ‘low vegetation’.

4.2.3. Urban2Rural and Rural2Urban

To further verify the robustness and generalization of the proposed model on diverse datasets, we conducted domain adaptation experiments on the LoveDA dataset and record the IoU, mIoU, F1, and mF1 values of the experimental results in Table 10. In both the rural2urban and urban2rural domain adaptation tasks, the multimodal method CLIP-ProCL achieves the second-best segmentation performance, while SIT-UDA obtains the highest mIoU scores, 51.84% and 38.70%, respectively. SIT-UDA also reaches the best IoU values in most categories. Compared with CLIP-ProCL, SIT-UDA fully exploits the CLIP pretrained model to extract image and text features. In addition, it integrates pixel-level weighting and the contrastive text constraint to optimize the training process, which contributes significantly to performance improvement. In the urban2rural task, SIT-UDA shows inferior segmentation results for the ‘water’, ‘barren’, and ‘agriculture’ categories. The confusion matrix of SIT-UDA in this task is shown in Figure 6. It reveals that segmentation accuracy of the model for ‘barren’ is particularly low; only 11.16% of pixels truly belonging to ‘barren’ are correctly predicted, while 80.79% are misclassified as ‘background’. The complex background in the LoveDA dataset makes it difficult to capture discriminative features for the ‘background’ class, and many pixels from other categories are misclassified as ‘background’. The segmentation performance of SIT-UDA on the LoveDA dataset still requires further improvement.

Figure 7 shows the qualitative segmentation results of the comparison methods on the test images in the Rural dataset. Compared to DACS, CLIP-UDA and CLIP-ProCL, SIT-UDA segments are better on ‘building’ and ‘water’, demonstrating their enhanced capability in capturing structural and hydrological features. These methods exhibit inferior performance on ‘barren’ and require further learning and refinement of feature representation for this class.

5. Discussion

5.1. Ablation Study

To validate the contributions of individual components in the SIT-UDA model, we conducted comprehensive ablation studies on three domain adaptation tasks: PotsIRRG2VaiIRRG, VaiIRRG2PotsIRRG and Rural2Urban. The experimental results are shown in Table 11. CLIP-UDA, which shares the same architecture as SIT-UDA, was adopted as the baseline model and was trained by segmentation losses on both source domain images and mixed images, as defined in Equations (5) and (13). Building upon this baseline, SIT-UDA introduces learnable text prompts and proposes the entropy-guided pixel-level weighting strategy together with the contrastive text constraint strategy. The overall training objective of SIT-UDA is shown in Equation (13). As observed from Table 11, learnable text prompts, EGPW, and CTC each contribute to performance improvements. When combined, the SIT-UDA model achieves the best segmentation results.

5.1.1. Ablation of Text Prompts

To examine the impact of learnable text prompts on image features in the student model, we visualized the image features of the baseline with the fixed text prompt and SIT-UDA with learnable text prompts converted by t-SNE. For clearer comparison, the categories ‘impervious surface’ vs. ‘clutter’ and ‘tree’ vs. ‘low vegetation’, which exhibit similar appearances, were selected for visualization. The t-SNE converted image features are shown in Figure 8. It is evident that the SIT-UDA model not only increases inter-class differences between ‘low vegetation’ and ‘tree’, and ‘imperious surface’ and ‘clutter’, but also reduces intra-class feature dispersion, providing more concentrated and discriminative image features for these easily confusable categories.

To further verify the impact of fixed text prompts in the teacher model on the performance of SIT-UDA, we adopted learnable text prompts in the teacher model for comparison and conducted experiments on the three domain adaptation tasks mentioned above. As shown in Table 12, using fixed text prompts from pretrained models to generate text features in the teacher model leads to greater performance improvements.

5.1.2. Ablation of Contrastive Text Constraint

Figure 9 illustrates the similarity between text features of SIT-UDA before and after introducing the contrastive text constraint. Compared with Figure 9a, Figure 9b shows that the similarity among text features of different categories is reduced, indicating that the contrastive text constraint enhances the distinctiveness between text features of different categories, thereby improving the discriminative ability of the model for different category features through image–text alignment and fusion.

5.1.3. Parameter Sensitivity Analysis

In the total loss presented in Equation (13), the segmentation loss weight for the source domain is fixed at 1, while the mixed segmentation loss is adaptively weighted through the proposed Entropy-Guided Pixel Weighting (EGPW) strategy. In EGPW, the loss weight of each target domain pseudo-label pixel is determined by its entropy value: pixels with lower entropy are assigned larger weights, whereas pixels with higher entropy are weighted according to the proportion of pseudo-label pixels exceeding the confidence threshold. The confidence threshold

τ

directly influences the proportion of high-quality pixels. A lower threshold increases the proportion of high-quality pixels, thereby enlarging the corresponding loss weight, while a higher threshold reduces this proportion and consequently decreases the loss weight. To investigate the sensitivity of the confidence threshold, we set it to [0.1, 0.3, 0.5, 0.7, 0.9, 1.0] and trained the model on the PotsIRRG2VaiIRRG and VaiIRRG2PotsIRRG tasks. As shown in Table 13, both overly low and overly high confidence thresholds result in inferior segmentation performance. A threshold of 0.9 yields the highest values of mIoU and OA.

To verify the influence of contrastive text loss

L_{C}

on the performance of the model, we set a hyperparameter

λ_{1}

within the range of [0,1] to control the loss value and evaluated the model’s performance on the PotsIRRG2vaiIRRG and VaiIRRG2potsIRRG tasks. As illustrated in Figure 10 and Figure 11, assigning a weight of 1 to

L_{C}

yields the most significant performance gain for the model.

5.2. Computational Complexity Analysis

To further analyze the resource consumption of the proposed model, we compared the number of model parameters (Params), floating-point operations (GFLOPs), training time and the number of image frames processed per second during inference (FPS) of baseline and SIT-UDA methods on the PotsIRRG2VaiIRRG task. The results are shown in Table 14. The SIT-UDA model utilizes the ResNet50 image encoder network along with a frozen text encoder network, achieving superior accuracy improvements with a smaller scale. Compared with CLIP-UDA, the learnable prompt vectors introduce only a small number of additional parameters. Meanwhile, the pixel-level loss weighting on mixed images and the contrastive text constraint strategy lead to only minor increases in model parameters and computational time, which remain within an acceptable range.

5.3. Limitations and Future Study

In the above experiments, the proposed SIT-UDA model achieves the best performance in terms of both mIoU and mF1 across all six UDA tasks. Nevertheless, several limitations remain. First, the current approach assumes that the source and target domains share the same label set and does not account for scenarios where the target domain may contain unseen categories. Second, the CLIP model, pre-trained based on image–text pairs, primarily relies on image–text alignment to improve generalization performance, and its ability to extract fine-grained visual features still requires further improvement. These limitations point to meaningful directions for future research. To boost efficiency, we will investigate more effective image feature extraction networks and further enhance generalization through better integration of image–text alignment and multimodal fusion. Furthermore, open-set domain adaptive semantic segmentation based on image–text multimodal will be developed in the future.

6. Conclusions

This paper proposes a self-training-based image–text multimodal unsupervised domain adaptation semantic segmentation model (SIT-UDA) for RSIs. The model adopts the teacher–student self-training and image mixing approach and leverages aligned text and image features to construct a robust domain-invariant feature space. To effectively learn target domain unlabeled data, the entropy-guided pixel-level weighting strategy is proposed to adaptively adjust the loss weight of unlabeled pixels, making the best of high-confidence pixels and alleviating negative transfer from low-confidence pixels. In addition, the contrastive text constraint strategy is proposed to promote the learning of domain-invariant feature representation and to enhance the discrepancies among different classes’ text features in the student model for multimodal alignment. Experiments on six domain adaptation tasks reveal that the SIT-UDA outperforms advanced UDA methods in handling domain discrepancies of RSIs while maintaining lower model complexity. The model is designed for domain adaptation tasks in which categories in source and target domains are the same; its performance may degrade if new categories appear in the target domain. Future research will explore extensions to other remote sensing tasks and ways to further improve the model’s performance on UDA tasks with different spatial resolutions.

Author Contributions

Conceptualization, Q.L. and X.W.; methodology, Q.L.; software, Q.L.; validation, Q.L. and X.W.; formal analysis, X.W.; writing—original draft preparation, Q.L.; writing—review and editing, Q.L. and X.W.; supervision, X.W.; funding acquisition, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Second Tibetan Plateau Scientific Expedition and Research under Grant 2019QZKK0405 and the National Natural Science Foundation of China under Grant no. 42361056.

Data Availability Statement

The data in the paper can be obtained through the following links. ISPRS Postdam: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx, accessed on 1 June 2023. ISPRS Vaihingen: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/2d-sem-label-vaihingen.aspx, accessed on 1 June 2023.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RSIs	Remote Sensing Images
UDA	Unsupervised Domain Adaptation
EGPW	Entropy-Guided Pixel-level Weighting
SIT-UDA	Self-Training-Based Image–Text Multimodal Unsupervised Domain Adaptation Semantic Segmentation
CTC	Contrastive Text Constraint
EMA	Exponential Moving Average

References

Wang, P.; Tang, Y.; Liao, Z.; Yan, Y.; Dai, L.; Liu, S.; Jiang, T. Road-side individual tree segmentation from urban MLS point clouds using metric learning. Remote Sens. 2023, 15, 1992. [Google Scholar] [CrossRef]
Tang, X.; Tu, Z.; Wang, Y.; Liu, M.; Li, D.; Fan, X. Automatic detection of coseismic landslides using a new transformer method. Remote Sens. 2022, 14, 2884. [Google Scholar] [CrossRef]
Marcos, D.; Volpi, M.; Kellenberger, B.; Tuia, D. Land cover mapping at very high resolution with rotation equivariant CNNs: Towards small yet accurate models. ISPRS J. Photogramm. Remote Sens. 2018, 145, 96–107. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Berlin, Germany, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A kernel two-sample test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. In Proceedings of the Twenty-Eighth Annual Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Zou, Y.; Yu, Z.; Kumar, B.V.K.; Wang, J. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 289–305. [Google Scholar]
Zheng, Z.; Yang, Y. Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation. Int. J. Comput. Vis. 2021, 129, 1106–1120. [Google Scholar] [CrossRef]
Zhang, P.; Zhang, B.; Zhang, T.; Chen, D.; Wang, Y.; Wen, F. Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12414–12424. [Google Scholar]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings of the Thirty-one Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 17 October–3 November 2019; pp. 6023–6032. [Google Scholar]
Olsson, V.; Tranheden, W.; Pinto, J.; Svensson, L. Classmix: Segmentation-based data augmentation for semi-supervised learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 1369–1378. [Google Scholar]
Tranheden, W.; Olsson, V.; Pinto, J.; Svensson, L. Dacs: Domain adaptation via cross-domain mixed sampling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 1379–1389. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision–language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
Lüddecke, T.; Ecker, A. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Denver, CO, USA, 3–7 June 2022; pp. 7086–7096. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 4015–4026. [Google Scholar]
Kim, Y.E.; Lee, Y.W.; Lee, S.W. LC-MSM: Language-Conditioned Masked Segmentation Model for unsupervised domain adaptation. Pattern Recognit. 2024, 148, 110201. [Google Scholar] [CrossRef]
Zheng, A.; Wang, M.; Li, C.; Tang, J.; Luo, B. Entropy guided adversarial domain adaptation for aerial image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5405614. [Google Scholar] [CrossRef]
Toldo, M.; Michieli, U.; Agresti, G.; Zanuttigh, P. Unsupervised domain adaptation for mobile semantic segmentation based on cycle consistency and feature alignment. Image Vis. Comput. 2020, 95, 103889. [Google Scholar] [CrossRef]
Wang, L.; Xiao, P.; Zhang, X.; Chen, X. A Fine-Grained Unsupervised Domain Adaptation Framework for Semantic Segmentation of Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 4109–4121. [Google Scholar] [CrossRef]
French, G.; Mackiewicz, M.; Fisher, M. Self-ensembling for visual domain adaptation. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Hoyer, L.; Dai, D.; Van Gool, L. Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 9924–9935. [Google Scholar]
Rao, Y.; Zhao, W.; Chen, G.; Tang, Y.; Zhu, Z.; Huang, G.; Zhou, J.; Lu, J. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 18082–18091. [Google Scholar]
Li, B.; Weinberger, K.Q.; Belongie, S.; Koltun, V.; Ranftl, R. Language-Driven Semantic Segmentation. arXiv 2022, arXiv:2201.03546. [Google Scholar] [CrossRef]
Mata, C.; Ranasinghe, K.; Ryoo, M.S. Copt: Unsupervised domain adaptive segmentation using domain-agnostic text embeddings. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 October–4 November 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 424–440. [Google Scholar]
Wang, H.; Jiang, Z.; Xie, L.; Jiang, D.; Shen, W.; Tian, Q. Domain-Adaptive Semantic Segmentation Emerges From vision–language Supervised Domain-Debiased Self-Training. In Proceedings of the ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 3930–3934. [Google Scholar]
Chen, J.; Zhu, J.; Guo, Y.; Sun, G.; Zhang, Y.; Deng, M. Unsupervised domain adaptation for semantic segmentation of high-resolution remote sensing imagery driven by category-certainty attention. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5616915. [Google Scholar] [CrossRef]
Huang, H.; Li, B.; Zhang, Y.; Chen, T.; Wang, B. Joint distribution adaptive-alignment for cross-domain segmentation of high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5401214. [Google Scholar] [CrossRef]
Ismael, S.F.; Kayabol, K.; Aptoula, E. Unsupervised domain adaptation for the semantic segmentation of remote sensing images via a class-aware Fourier transform and a fine-grained discriminator. Digit. Signal Process. 2024, 151, 104551. [Google Scholar] [CrossRef]
Zeng, W.; Cheng, M.; Yuan, Z.; Dai, W.; Wu, Y.; Liu, W.; Wang, C. Domain adaptive remote sensing image semantic segmentation with prototype guidance. Neurocomputing 2024, 580, 127484. [Google Scholar] [CrossRef]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision–language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
Kirillov, A.; Girshick, R.; He, K.; Dollár, P. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 6399–6408. [Google Scholar]
Vu, T.H.; Jain, H.; Bucher, M.; Cord, M.; Pérez, P. Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 2517–2526. [Google Scholar]
Bi, X.; Zhang, X.; Wang, S.; Zhang, H. Entropy-weighted reconstruction adversary and curriculum pseudo labeling for domain adaptation in semantic segmentation. Neurocomputing 2022, 506, 277–289. [Google Scholar] [CrossRef]
Wang, R.; Zhou, Q.; Zheng, G. EDRL: Entropy-guided disentangled representation learning for unsupervised domain adaptation in semantic segmentation. Comput. Methods Programs Biomed. 2023, 240, 107729. [Google Scholar] [CrossRef] [PubMed]
van den Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Wang, W.; Zhou, T.; Yu, F.; Dai, J.; Konukoglu, E.; Van Gool, L. Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7303–7313. [Google Scholar]
Potsdam. ISPRS Potsdam 2D Semantic Labeling Dataset. 2018. Available online: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx (accessed on 1 June 2023).
Vaihingen. ISPRS Vaihingen 2D Semantic Labeling Dataset. 2018. Available online: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/2d-sem-label-vaihingen.aspx (accessed on 1 June 2023).
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. arXiv 2021, arXiv:2110.08733. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Shu, Y.; Guo, X.; Wu, J.; Wang, X.; Wang, J.; Long, M. CLIPood: Generalizing CLIP to Out-of-Distributions. arXiv 2023, arXiv:2302.00864. [Google Scholar] [CrossRef]
Ni, H.; Liu, Q.; Guan, H.; Tang, H.; Chanussot, J. Category-level Assignment for Cross-domain Semantic Segmentation in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5608416. [Google Scholar] [CrossRef]
Liu, K.; Zhu, C. Unsupervised Domain Adaptive Semantic Segmentation Based on Clip-Guided Prototypical Contrastive Learning. In Proceedings of the International Conference on Image Processing, ICIP, Abu Dhabi, United Arab Emirates, 27–30 October 2024; pp. 291–297. [Google Scholar] [CrossRef]

Figure 1. Examples of domain discrepancy of the same class in different domain images.

Figure 2. Overview of our SIT-UDA model framework with teacher image–text segmentation network and student image–text segmentation network.

Figure 3. The image–text multimodal semantic segmentation network.

Figure 4. Qualitative comparisons of different methods on the task of PotsIRRG2VaiIRRG. Red boxes are marked to highlight the differences.

Figure 5. Qualitative comparisons of different methods on the task of VaiIRRG2PotsRGB. Red boxes are marked to highlight the differences.

Figure 6. Confusion Matrix of SIT-UDA in the Urban2Rural task. 0—background, 1—building, 2—road, 3—water, 4—barren, 5—forest, 6—agriculture.

Figure 7. Qualitative comparisons of different methods on the task of Urban2Rural. Red boxes are marked to highlight the differences.

Figure 8. Image feature visualization converted by t-SNE in Baseline and SIT-UDA on the PotsIRRG2VaiIRRG task. 0—impervious surface, 2—low vegetation, 3—tree, 5—clutter.

Figure 9. Similarity matrix of text features in SIT-UDA w/o CTC and SIT-UDA models on the PotsIRRG2VaiIRRG task. 0—impervious surface, 1—building, 2—low vegetation, 3—tree, 4—car, 5—clutter.

Figure 10. The effect of contrastive text loss weight

λ_{1}

for SIT-UDA on the PotsIRRG2VaiIRRG task.

Figure 10. The effect of contrastive text loss weight

λ_{1}

for SIT-UDA on the PotsIRRG2VaiIRRG task.

Figure 11. The effect of contrastive text loss weight

λ_{1}

for SIT-UDA on the VaiIRRG2PotsIRRG task.

Figure 11. The effect of contrastive text loss weight

λ_{1}

for SIT-UDA on the VaiIRRG2PotsIRRG task.

Table 1. The information of the datasets.

Dataset	Spectral	Average Size	Cropping Size	Resolution	Training Set	Test Set
Potsdam	IRRG, RGB	6000 × 6000	512 × 512	5 cm	24 images	14 images
Vaihingen	IRRG	2494 × 2064	512 × 512	9 cm	25 images	5 images
Vaihingen	IRRG	2494 × 2064	256 × 256	9 cm	25 images	5 images
Loveda	RGB	1024 × 1024	1024 × 1024	30 cm	Urban: 1156 images	677 images
Loveda	RGB	1024 × 1024	1024 × 1024	30 cm	Rural: 1366 images	992 images

Table 2. UDA tasks with different domain shifts.

Task	Source Domain	Target Domain	Domain Shift
Task	Source Domain	Target Domain	Geographic Location	Imaging Mode	Spatial Resolution	Geographical Landscape
PotsIRRG2VaiIRRG	Potsdam IRRG	Vaihingen IRRG	✓	✗	↓	✗
PotsRGB2VaiIRRG	Potsdam RGB	Vaihingen IRRG	✓	✓	↓	✗
VaiIRRG2PotsIRRG	Vaihingen IRRG	Potsdam IRRG	✓	✗	↑	✗
VaiIRRG2PotsRGB	Vaihingen IRRG	Potsdam RGB	✓	✓	↑	✗
Urban2Rural	Urban	Rural	✗	✗	✗	✓
Rural2Urban	Rural	Urban	✗	✗	✗	✓

✓ represents different, ✗ represents identical, ↑ denotes resolution from low to high, and ↓ denotes resolution from high to low.

Table 3. Hyperparameter settings.

Parameter Descriptions	Confidence Threshold	EMA Updater Coefficient	Temperature Coefficient	Loss Weight
Values	$τ = 0.968$	$α = 0.999$	$σ = 0.1$	$λ_{1} = 1$

Table 4. Details of image-based comparison methods.

Method	Segmentation Framework	Image Encoder	Pretrained	Domain Alignment			Self-Training
Method	Segmentation Framework	Image Encoder	Pretrained	Image Level	Feature Level	Output Level	Pseudo-Label Filtering	Consistency Regularization
CIA-UDA [46]	Deeplabv3	RN101	ImageNet	✓	✓	✓
ProDA [11]	Deeplabv2	RN101	ImageNet				✓	✓
JDAF [31]	Deeplabv3	RN101	ImageNet		✓	✓	✓
FGUDA [23]	Deeplabv3	RN101	ImageNet		✓	✓		✓
DACS [15]	Deeplabv3	RN101	ImageNet					✓

Table 5. Details of image–text comparison methods.

Method	Segmentation Framework	Image Encoder	Pretrained	Text Prompt
CLIP-ProCL [47]	Deeplabv2	RN101	ImageNet	Learnable
CLIP-UDA [11]	Semantic FPN	RN50	CLIP	Fixed
SIT-UDA	Semantic FPN	RN50	CLIP	Learnable

Table 6. Results (IoU, F1, mIoU, and mF1 in %) of different methods on the PotsIRRG2VaiIRRG task; the best results in each column are highlighted in bold.

Method	Impervious Surface		Building		Low Vegetation		Tree		Car		Clutter		mIoU	mF1
Method	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	mIoU	mF1
CIA-UDA	63.28	77.51	75.13	85.80	48.03	64.90	64.11	78.13	52.91	69.21	27.80	43.51	55.21	69.84
ProDA	62.51	76.85	71.61	82.95	34.49	51.65	56.26	72.09	39.20	56.52	3.99	8.21	44.68	58.05
JDAF	68.76	81.49	77.19	87.13	47.39	64.30	58.38	73.72	42.76	59.90	38.65	55.75	55.52	70.38
FGUDA	76.17	86.47	84.37	91.52	46.05	63.06	54.09	70.20	43.82	60.94	15.45	26.77	53.33	66.50
DACS	80.53	89.21	90.12	94.80	55.84	71.66	66.34	79.76	63.42	77.62	32.21	48.73	64.74	76.96
CLIP-ProCL	81.06	89.66	90.92	95.28	57.58	73.40	60.27	76.10	66.53	80.13	29.34	46.20	64.28	76.79
CLIP-UDA	75.07	85.76	82.59	90.46	54.51	70.56	56.86	72.50	61.20	75.93	43.88	60.99	62.35	76.03
SIT-UDA	80.77	89.37	86.64	92.84	61.34	76.04	68.62	81.39	69.46	81.98	51.15	67.68	69.66	81.55

Table 7. Results (IoU, F1, mIoU, and mF1 in %) of different methods on the PotsRGB2VaiIRRG task; the best results in each column are highlighted in bold.

Method	Impervious Surface		Building		Low Vegetation		Tree		Car		Clutter		mIoU	mF1
Method	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	mIoU	mF1
CIA-UDA	62.63	77.02	79.71	88.71	33.31	49.97	63.43	77.62	52.28	68.66	13.50	23.78	50.81	64.29
ProDA	49.04	66.11	68.94	81.89	32.44	49.06	49.11	65.86	31.56	48.16	2.39	5.09	38.91	52.70
JDAF	64.33	78.29	75.53	86.06	42.16	59.31	51.99	68.41	45.87	62.90	32.71	49.30	52.10	67.38
FGUDA	73.80	84.92	83.76	91.16	43.27	60.40	44.41	61.50	43.24	60.38	12.61	22.39	50.18	63.46
DACS	56.03	71.82	73.94	85.01	40.28	57.42	47.65	64.54	47.80	64.69	21.29	35.11	47.83	63.10
CLIP-UDA	70.13	82.45	82.96	90.69	40.05	57.19	41.41	59.30	65.64	79.26	25.25	40.62	54.24	68.25
CLIP-ProCL	77.36	87.23	88.41	93.85	48.92	65.70	37.04	54.05	64.39	78.34	17.80	30.22	55.65	68.23
SIT-UDA	73.15	84.49	88.37	93.83	45.07	62.13	50.81	67.38	69.68	82.13	25.87	41.10	58.82	71.84

Table 8. Results (IoU, F1, mIoU, and mF1 in %) of different methods on the VaiIRRG2PotsIRRG task; the best results in each column are highlighted in bold.

Method	Impervious Surface		Building		Low Vegetation		Tree		Car		Clutter		mIoU	mF1
Method	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	mIoU	mF1
CIA-UDA	62.74	77.11	72.31	83.93	54.40	70.47	47.74	64.63	65.35	79.04	10.87	19.61	52.23	65.80
ProDA	44.70	61.72	56.85	72.49	40.55	57.71	31.59	48.02	46.78	63.74	10.63	19.21	38.51	53.82
JDAF	67.70	80.74	76.36	86.59	51.19	67.72	36.21	53.17	63.22	77.47	13.10	23.17	51.30	64.81
FGUDA	73.43	84.55	76.32	87.43	47.69	63.45	32.68	47.36	63.86	77.85	11.65	19.47	50.94	63.31
DACS	73.98	85.04	83.65	90.74	55.97	71.77	28.86	44.79	73.81	84.93	10.04	18.25	54.29	65.92
CLIP-ProCL	66.52	79.89	76.02	86.38	44.67	61.75	34.99	51.84	59.21	74.38	1.02	2.01	47.07	59.38
CLIP-UDA	75.16	85.82	82.35	90.32	53.86	70.01	35.46	52.36	81.98	90.10	8.57	15.01	56.23	67.27
SIT-UDA	75.87	86.28	82.71	90.54	58.44	73.77	42.12	59.27	83.87	91.22	14.31	25.04	59.55	71.02

Table 9. Results (IoU, F1, mIoU, and mF1 in %) of different methods on the VaiIRRG2PotsRGB task; the best results in each column are highlighted in bold.

Method	Impervious Surface		Building		Low Vegetation		Tree		Car		Clutter		mIoU	mF1
Method	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	mIoU	mF1
CIA-UDA	53.39	69.61	70.48	82.68	43.96	61.07	44.90	61.97	63.36	77.57	9.20	16.86	47.55	61.63
ProDA	44.77	62.03	46.37	63.06	35.84	52.75	30.56	46.91	41.21	59.27	11.13	20.51	34.98	50.76
JDAF	60.05	75.04	71.42	83.33	27.79	43.49	38.74	55.84	58.64	73.93	18.09	30.63	45.79	60.38
FGUDA	66.11	79.75	68.63	81.32	35.47	51.85	28.64	43.51	65.45	80.17	10.84	17.49	45.86	59.74
DACS	71.76	83.56	85.53	92.20	47.52	64.43	12.43	22.11	75.34	85.94	1.93	3.79	49.09	58.67
CLIP-ProCL	65.49	79.14	75.63	86.12	36.09	53.03	35.55	52.45	75.08	85.77	0.65	1.30	48.08	59.64
CLIP-UDA	66.28	79.72	65.68	79.29	44.94	62.09	40.61	57.69	84.06	91.09	2.54	4.96	50.68	62.47
SIT-UDA	73.32	84.61	83.23	90.85	59.91	74.93	40.75	57.91	84.16	91.40	1.70	3.35	57.18	67.17

Table 10. Results (IoU, F1, mIoU, and mF1 in %) of different methods on the Rural2Urban and Urban2Rural tasks; the best results in each column are highlighted in bold.

Task	Method	Background		Building		Road		Water		Barren		Forest		Agriculture		mIoU	mF1
Task	Method	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	mIoU	mF1
Rural 2Urban	DACS	39.57	57.23	53.74	69.70	49.65	66.35	66.02	78.97	29.97	45.71	45.60	61.72	54.71	71.38	48.47	64.44
	CLIP-ProCL	40.57	57.72	60.22	75.17	51.81	68.81	66.04	79.55	40.10	56.12	46.30	62.40	54.64	71.32	51.38	67.30
	CLIP-UDA	35.32	52.20	47.68	64.57	45.80	61.91	60.91	75.71	42.82	60.90	48.61	65.42	48.72	65.52	47.12	63.75
	SIT-UDA	38.67	55.78	56.43	72.15	53.90	70.05	63.22	77.47	43.97	61.08	50.85	67.42	55.86	71.68	51.84	67.95
Urban 2Rural	DACS	41.99	59.14	45.94	62.96	34.58	51.39	59.94	74.74	6.48	12.32	26.51	39.33	31.52	47.93	35.28	49.69
	CLIP-ProCL	49.11	66.65	47.42	63.46	33.33	50.00	60.88	75.69	12.27	21.85	30.82	47.12	34.61	51.42	38.35	53.44
	CLIP-UDA	44.97	62.04	58.22	73.59	45.21	63.06	38.75	54.67	8.18	15.12	29.87	46.00	16.43	28.23	34.52	48.96
	SIT-UDA	50.65	67.25	60.05	75.04	46.82	63.78	44.42	61.52	6.68	12.52	32.51	49.07	29.77	45.88	38.70	53.58

Table 11. Ablation experimental results of SIT-UDA on different UDA tasks.

	Self-Training	Learnable Text Prompt	CTC	EGPW	PotsIRRG2VaiIRRG			VaiIRRG2PotsIRRG			Rural2Urban
	Self-Training	Learnable Text Prompt	CTC	EGPW	mIoU	mF1	OA	mIoU	mF1	OA	mIoU	mF1	OA
Baseline (CLIP-UDA)	✓				62.35	76.03	80.57	56.23	67.27	76.72	47.12	63.75	62.54
B + L	✓	✓			65.41	78.41	83.02	57.19	68.44	77.87	49.25	65.67	65.26
B + E	✓			✓	64.46	77.90	81.48	57.64	68.45	78.04	48.64	65.21	64.02
B + L + C	✓	✓	✓		67.73	80.06	84.33	58.51	69.55	78.23	50.80	66.96	66.08
B + L + E	✓	✓		✓	66.78	79.41	83.81	57.89	68.49	78.09	50.49	66.74	65.56
B + L + C + E	✓	✓	✓	✓	69.66	81.55	85.35	59.55	71.02	79.06	51.84	67.95	66.75

Table 12. The effect of text prompt in teacher model on different UDA tasks.

	PotsIRRG2VaiIRRG			VaiIRRG2PotsIRRG			Rural2Urban
	mIoU	mF1	OA	mIoU	mF1	OA	mIoU	mF1	OA
Learnable text prompts + [CLASS]	68.51	80.92	84.59	58.99	69.75	78.73	50.77	66.88	66.31
A photo of [CLASS]	69.66	81.55	85.35	59.55	71.02	79.06	51.84	67.95	66.75

Table 13. Sensitivity analysis of confidence threshold on PotsIRRG2VaiIRRG and VaiIRRG2PotsIRRG tasks.

Confidence Threshold	PtsIRRG2VaiIRRG		VaiIRRG2PotsIRRG
Confidence Threshold	mIoU	OA	mIoU	OA
0.1	65.36	83.89	54.28	76.55
0.3	66.12	84.05	55.24	77.13
0.5	67.05	84.04	56.50	77.92
0.7	69.10	85.27	58.98	78.65
0.9	69.66	85.35	59.55	79.06
1.0	67.34	84.67	57.15	78.02

Table 14. Parameters, GFLOPs, training time and inference time of comparison methods on PotsIRRG2VaiIRRG task.

Method	Params/M	GFLOPs/G	Training Time/h	Inference Time/FPS
CLIP-UDA	46.20	61.50	4.13	81.88
SIT-UDA	46.23	65.49	4.53	62.57

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Q.; Wang, X. Self-Training Based Image–Text Multimodal Unsupervised Domain Adaptation Segmentation Model for Remote Sensing Images. Remote Sens. 2026, 18, 651. https://doi.org/10.3390/rs18040651

AMA Style

Liu Q, Wang X. Self-Training Based Image–Text Multimodal Unsupervised Domain Adaptation Segmentation Model for Remote Sensing Images. Remote Sensing. 2026; 18(4):651. https://doi.org/10.3390/rs18040651

Chicago/Turabian Style

Liu, Qianqian, and Xili Wang. 2026. "Self-Training Based Image–Text Multimodal Unsupervised Domain Adaptation Segmentation Model for Remote Sensing Images" Remote Sensing 18, no. 4: 651. https://doi.org/10.3390/rs18040651

APA Style

Liu, Q., & Wang, X. (2026). Self-Training Based Image–Text Multimodal Unsupervised Domain Adaptation Segmentation Model for Remote Sensing Images. Remote Sensing, 18(4), 651. https://doi.org/10.3390/rs18040651

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Self-Training Based Image–Text Multimodal Unsupervised Domain Adaptation Segmentation Model for Remote Sensing Images

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Image-Based UDA Semantic Segmentation

2.2. Image–Text Multimodal UDA Semantic Segmentation

2.3. UDA Semantic Segmentation for Remote Sensing Images

3. Self-Training-Based Image–Text Multimodal Unsupervised Domain Adaptation Semantic Segmentation Model

3.1. The Multimodal Segmentation Network

3.2. Entropy-Guided Pixel-Level Weighting Strategy

3.3. Contrastive Text Constraint Strategy

3.4. Training and Inference

4. Experiments and Result Analysis

4.1. Experimental Settings

4.1.1. Datasets and UDA Tasks

4.1.2. Evaluation Metrics

4.1.3. Implementation Details

4.2. Comparative Experiments and Results Discussion

4.2.1. PotsIRRG2VaiIRRG and PotsRGB2VaiIRRG

4.2.2. VaiIRRG2PotsIRRG and VaiIRRG2PotsRGB

4.2.3. Urban2Rural and Rural2Urban

5. Discussion

5.1. Ablation Study

5.1.1. Ablation of Text Prompts

5.1.2. Ablation of Contrastive Text Constraint

5.1.3. Parameter Sensitivity Analysis

5.2. Computational Complexity Analysis

5.3. Limitations and Future Study

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI