Self-Training Based Image–Text Multimodal Unsupervised Domain Adaptation Segmentation Model for Remote Sensing Images
Highlights
- SIT-UDA integrates learnable text category hints with image data for accurate semantic segmentation, and two strategies—entropy-guided pixel-level weighting (EGPW) and contrastive text constraint (CTC)—are proposed to improve pseudo-label utilization and strengthen domain-invariant feature learning with greater discriminability.
- Experiments on six representative remote sensing domain adaptation tasks demonstrate that SIT-UDA achieves superior balanced performance and exhibits stronger robustness compared with existing methods.
- SIT-UDA demonstrates that incorporating vision–language models into remote sensing tasks enhances generalization and domain-invariant representation learning.
- SIT-UDA shows strong potential for real-world application such as land-cover monitoring across urban and rural domains and disaster response across regions.
Abstract
1. Introduction
- A multimodal UDA semantic segmentation model: The SIT-UDA model aligns and fuses learnable class text features with image features for segmentation and provides more credible pseudo-labels through the multimodal network, improving the model’s generalization performance.
- EGPW strategy: This strategy adaptively adjusts the loss weights of unlabeled pixels in the mixed images based on the entropy value of the prediction probability map, learning the high-confidence pseudo-labels and reducing the interference from low-confidence pseudo-labels.
- CTC strategy: This strategy encourages intra-class text features in the teacher and student models to become closer while driving inter-class text features further apart. The resulting optimized text features can effectively adapt to remote sensing domains while preserving domain-invariant and discriminative semantic representations.
2. Related Work
2.1. Image-Based UDA Semantic Segmentation
2.2. Image–Text Multimodal UDA Semantic Segmentation
2.3. UDA Semantic Segmentation for Remote Sensing Images
3. Self-Training-Based Image–Text Multimodal Unsupervised Domain Adaptation Semantic Segmentation Model
3.1. The Multimodal Segmentation Network
3.2. Entropy-Guided Pixel-Level Weighting Strategy
3.3. Contrastive Text Constraint Strategy
3.4. Training and Inference
| Algorithm 1 The training process of the proposed model |
|
4. Experiments and Result Analysis
4.1. Experimental Settings
4.1.1. Datasets and UDA Tasks
4.1.2. Evaluation Metrics
4.1.3. Implementation Details
4.2. Comparative Experiments and Results Discussion
4.2.1. PotsIRRG2VaiIRRG and PotsRGB2VaiIRRG
4.2.2. VaiIRRG2PotsIRRG and VaiIRRG2PotsRGB
4.2.3. Urban2Rural and Rural2Urban
5. Discussion
5.1. Ablation Study
5.1.1. Ablation of Text Prompts
5.1.2. Ablation of Contrastive Text Constraint
5.1.3. Parameter Sensitivity Analysis
5.2. Computational Complexity Analysis
5.3. Limitations and Future Study
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| RSIs | Remote Sensing Images |
| UDA | Unsupervised Domain Adaptation |
| EGPW | Entropy-Guided Pixel-level Weighting |
| SIT-UDA | Self-Training-Based Image–Text Multimodal Unsupervised Domain Adaptation Semantic Segmentation |
| CTC | Contrastive Text Constraint |
| EMA | Exponential Moving Average |
References
- Wang, P.; Tang, Y.; Liao, Z.; Yan, Y.; Dai, L.; Liu, S.; Jiang, T. Road-side individual tree segmentation from urban MLS point clouds using metric learning. Remote Sens. 2023, 15, 1992. [Google Scholar] [CrossRef]
- Tang, X.; Tu, Z.; Wang, Y.; Liu, M.; Li, D.; Fan, X. Automatic detection of coseismic landslides using a new transformer method. Remote Sens. 2022, 14, 2884. [Google Scholar] [CrossRef]
- Marcos, D.; Volpi, M.; Kellenberger, B.; Tuia, D. Land cover mapping at very high resolution with rotation equivariant CNNs: Towards small yet accurate models. ISPRS J. Photogramm. Remote Sens. 2018, 145, 96–107. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Berlin, Germany, 2015; pp. 234–241. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A kernel two-sample test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. In Proceedings of the Twenty-Eighth Annual Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Zou, Y.; Yu, Z.; Kumar, B.V.K.; Wang, J. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 289–305. [Google Scholar]
- Zheng, Z.; Yang, Y. Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation. Int. J. Comput. Vis. 2021, 129, 1106–1120. [Google Scholar] [CrossRef]
- Zhang, P.; Zhang, B.; Zhang, T.; Chen, D.; Wang, Y.; Wen, F. Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12414–12424. [Google Scholar]
- Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings of the Thirty-one Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 17 October–3 November 2019; pp. 6023–6032. [Google Scholar]
- Olsson, V.; Tranheden, W.; Pinto, J.; Svensson, L. Classmix: Segmentation-based data augmentation for semi-supervised learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 1369–1378. [Google Scholar]
- Tranheden, W.; Olsson, V.; Pinto, J.; Svensson, L. Dacs: Domain adaptation via cross-domain mixed sampling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 1379–1389. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision–language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
- Lüddecke, T.; Ecker, A. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Denver, CO, USA, 3–7 June 2022; pp. 7086–7096. [Google Scholar]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 4015–4026. [Google Scholar]
- Kim, Y.E.; Lee, Y.W.; Lee, S.W. LC-MSM: Language-Conditioned Masked Segmentation Model for unsupervised domain adaptation. Pattern Recognit. 2024, 148, 110201. [Google Scholar] [CrossRef]
- Zheng, A.; Wang, M.; Li, C.; Tang, J.; Luo, B. Entropy guided adversarial domain adaptation for aerial image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5405614. [Google Scholar] [CrossRef]
- Toldo, M.; Michieli, U.; Agresti, G.; Zanuttigh, P. Unsupervised domain adaptation for mobile semantic segmentation based on cycle consistency and feature alignment. Image Vis. Comput. 2020, 95, 103889. [Google Scholar] [CrossRef]
- Wang, L.; Xiao, P.; Zhang, X.; Chen, X. A Fine-Grained Unsupervised Domain Adaptation Framework for Semantic Segmentation of Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 4109–4121. [Google Scholar] [CrossRef]
- French, G.; Mackiewicz, M.; Fisher, M. Self-ensembling for visual domain adaptation. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Hoyer, L.; Dai, D.; Van Gool, L. Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 9924–9935. [Google Scholar]
- Rao, Y.; Zhao, W.; Chen, G.; Tang, Y.; Zhu, Z.; Huang, G.; Zhou, J.; Lu, J. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 18082–18091. [Google Scholar]
- Li, B.; Weinberger, K.Q.; Belongie, S.; Koltun, V.; Ranftl, R. Language-Driven Semantic Segmentation. arXiv 2022, arXiv:2201.03546. [Google Scholar] [CrossRef]
- Mata, C.; Ranasinghe, K.; Ryoo, M.S. Copt: Unsupervised domain adaptive segmentation using domain-agnostic text embeddings. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 October–4 November 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 424–440. [Google Scholar]
- Wang, H.; Jiang, Z.; Xie, L.; Jiang, D.; Shen, W.; Tian, Q. Domain-Adaptive Semantic Segmentation Emerges From vision–language Supervised Domain-Debiased Self-Training. In Proceedings of the ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 3930–3934. [Google Scholar]
- Chen, J.; Zhu, J.; Guo, Y.; Sun, G.; Zhang, Y.; Deng, M. Unsupervised domain adaptation for semantic segmentation of high-resolution remote sensing imagery driven by category-certainty attention. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5616915. [Google Scholar] [CrossRef]
- Huang, H.; Li, B.; Zhang, Y.; Chen, T.; Wang, B. Joint distribution adaptive-alignment for cross-domain segmentation of high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5401214. [Google Scholar] [CrossRef]
- Ismael, S.F.; Kayabol, K.; Aptoula, E. Unsupervised domain adaptation for the semantic segmentation of remote sensing images via a class-aware Fourier transform and a fine-grained discriminator. Digit. Signal Process. 2024, 151, 104551. [Google Scholar] [CrossRef]
- Zeng, W.; Cheng, M.; Yuan, Z.; Dai, W.; Wu, Y.; Liu, W.; Wang, C. Domain adaptive remote sensing image semantic segmentation with prototype guidance. Neurocomputing 2024, 580, 127484. [Google Scholar] [CrossRef]
- Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision–language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
- Kirillov, A.; Girshick, R.; He, K.; Dollár, P. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 6399–6408. [Google Scholar]
- Vu, T.H.; Jain, H.; Bucher, M.; Cord, M.; Pérez, P. Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 2517–2526. [Google Scholar]
- Bi, X.; Zhang, X.; Wang, S.; Zhang, H. Entropy-weighted reconstruction adversary and curriculum pseudo labeling for domain adaptation in semantic segmentation. Neurocomputing 2022, 506, 277–289. [Google Scholar] [CrossRef]
- Wang, R.; Zhou, Q.; Zheng, G. EDRL: Entropy-guided disentangled representation learning for unsupervised domain adaptation in semantic segmentation. Comput. Methods Programs Biomed. 2023, 240, 107729. [Google Scholar] [CrossRef] [PubMed]
- van den Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
- Wang, W.; Zhou, T.; Yu, F.; Dai, J.; Konukoglu, E.; Van Gool, L. Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7303–7313. [Google Scholar]
- Potsdam. ISPRS Potsdam 2D Semantic Labeling Dataset. 2018. Available online: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx (accessed on 1 June 2023).
- Vaihingen. ISPRS Vaihingen 2D Semantic Labeling Dataset. 2018. Available online: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/2d-sem-label-vaihingen.aspx (accessed on 1 June 2023).
- Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. arXiv 2021, arXiv:2110.08733. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Shu, Y.; Guo, X.; Wu, J.; Wang, X.; Wang, J.; Long, M. CLIPood: Generalizing CLIP to Out-of-Distributions. arXiv 2023, arXiv:2302.00864. [Google Scholar] [CrossRef]
- Ni, H.; Liu, Q.; Guan, H.; Tang, H.; Chanussot, J. Category-level Assignment for Cross-domain Semantic Segmentation in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5608416. [Google Scholar] [CrossRef]
- Liu, K.; Zhu, C. Unsupervised Domain Adaptive Semantic Segmentation Based on Clip-Guided Prototypical Contrastive Learning. In Proceedings of the International Conference on Image Processing, ICIP, Abu Dhabi, United Arab Emirates, 27–30 October 2024; pp. 291–297. [Google Scholar] [CrossRef]











| Dataset | Spectral | Average Size | Cropping Size | Resolution | Training Set | Test Set |
|---|---|---|---|---|---|---|
| Potsdam | IRRG, RGB | 6000 × 6000 | 512 × 512 | 5 cm | 24 images | 14 images |
| Vaihingen | IRRG | 2494 × 2064 | 512 × 512 | 9 cm | 25 images | 5 images |
| 256 × 256 | ||||||
| Loveda | RGB | 1024 × 1024 | 1024 × 1024 | 30 cm | Urban: 1156 images | 677 images |
| Rural: 1366 images | 992 images |
| Task | Source Domain | Target Domain | Domain Shift | |||
|---|---|---|---|---|---|---|
|
Geographic
Location |
Imaging
Mode |
Spatial
Resolution |
Geographical
Landscape | |||
| PotsIRRG2VaiIRRG | Potsdam IRRG | Vaihingen IRRG | ✓ | ✗ | ↓ | ✗ |
| PotsRGB2VaiIRRG | Potsdam RGB | Vaihingen IRRG | ✓ | ✓ | ↓ | ✗ |
| VaiIRRG2PotsIRRG | Vaihingen IRRG | Potsdam IRRG | ✓ | ✗ | ↑ | ✗ |
| VaiIRRG2PotsRGB | Vaihingen IRRG | Potsdam RGB | ✓ | ✓ | ↑ | ✗ |
| Urban2Rural | Urban | Rural | ✗ | ✗ | ✗ | ✓ |
| Rural2Urban | Rural | Urban | ✗ | ✗ | ✗ | ✓ |
| Parameter Descriptions | Confidence Threshold | EMA Updater Coefficient | Temperature Coefficient | Loss Weight |
|---|---|---|---|---|
| Values |
| Method | Segmentation Framework | Image Encoder | Pretrained | Domain Alignment | Self-Training | |||
|---|---|---|---|---|---|---|---|---|
|
Image
Level |
Feature
Level |
Output
Level |
Pseudo-Label
Filtering |
Consistency
Regularization | ||||
| CIA-UDA [46] | Deeplabv3 | RN101 | ImageNet | ✓ | ✓ | ✓ | ||
| ProDA [11] | Deeplabv2 | RN101 | ImageNet | ✓ | ✓ | |||
| JDAF [31] | Deeplabv3 | RN101 | ImageNet | ✓ | ✓ | ✓ | ||
| FGUDA [23] | Deeplabv3 | RN101 | ImageNet | ✓ | ✓ | ✓ | ||
| DACS [15] | Deeplabv3 | RN101 | ImageNet | ✓ | ||||
| Method | Segmentation Framework | Image Encoder | Pretrained | Text Prompt |
|---|---|---|---|---|
| CLIP-ProCL [47] | Deeplabv2 | RN101 | ImageNet | Learnable |
| CLIP-UDA [11] | Semantic FPN | RN50 | CLIP | Fixed |
| SIT-UDA | Semantic FPN | RN50 | CLIP | Learnable |
| Method | Impervious Surface | Building | Low Vegetation | Tree | Car | Clutter | mIoU | mF1 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | |||
| CIA-UDA | 63.28 | 77.51 | 75.13 | 85.80 | 48.03 | 64.90 | 64.11 | 78.13 | 52.91 | 69.21 | 27.80 | 43.51 | 55.21 | 69.84 |
| ProDA | 62.51 | 76.85 | 71.61 | 82.95 | 34.49 | 51.65 | 56.26 | 72.09 | 39.20 | 56.52 | 3.99 | 8.21 | 44.68 | 58.05 |
| JDAF | 68.76 | 81.49 | 77.19 | 87.13 | 47.39 | 64.30 | 58.38 | 73.72 | 42.76 | 59.90 | 38.65 | 55.75 | 55.52 | 70.38 |
| FGUDA | 76.17 | 86.47 | 84.37 | 91.52 | 46.05 | 63.06 | 54.09 | 70.20 | 43.82 | 60.94 | 15.45 | 26.77 | 53.33 | 66.50 |
| DACS | 80.53 | 89.21 | 90.12 | 94.80 | 55.84 | 71.66 | 66.34 | 79.76 | 63.42 | 77.62 | 32.21 | 48.73 | 64.74 | 76.96 |
| CLIP-ProCL | 81.06 | 89.66 | 90.92 | 95.28 | 57.58 | 73.40 | 60.27 | 76.10 | 66.53 | 80.13 | 29.34 | 46.20 | 64.28 | 76.79 |
| CLIP-UDA | 75.07 | 85.76 | 82.59 | 90.46 | 54.51 | 70.56 | 56.86 | 72.50 | 61.20 | 75.93 | 43.88 | 60.99 | 62.35 | 76.03 |
| SIT-UDA | 80.77 | 89.37 | 86.64 | 92.84 | 61.34 | 76.04 | 68.62 | 81.39 | 69.46 | 81.98 | 51.15 | 67.68 | 69.66 | 81.55 |
| Method | Impervious Surface | Building | Low Vegetation | Tree | Car | Clutter | mIoU | mF1 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | |||
| CIA-UDA | 62.63 | 77.02 | 79.71 | 88.71 | 33.31 | 49.97 | 63.43 | 77.62 | 52.28 | 68.66 | 13.50 | 23.78 | 50.81 | 64.29 |
| ProDA | 49.04 | 66.11 | 68.94 | 81.89 | 32.44 | 49.06 | 49.11 | 65.86 | 31.56 | 48.16 | 2.39 | 5.09 | 38.91 | 52.70 |
| JDAF | 64.33 | 78.29 | 75.53 | 86.06 | 42.16 | 59.31 | 51.99 | 68.41 | 45.87 | 62.90 | 32.71 | 49.30 | 52.10 | 67.38 |
| FGUDA | 73.80 | 84.92 | 83.76 | 91.16 | 43.27 | 60.40 | 44.41 | 61.50 | 43.24 | 60.38 | 12.61 | 22.39 | 50.18 | 63.46 |
| DACS | 56.03 | 71.82 | 73.94 | 85.01 | 40.28 | 57.42 | 47.65 | 64.54 | 47.80 | 64.69 | 21.29 | 35.11 | 47.83 | 63.10 |
| CLIP-UDA | 70.13 | 82.45 | 82.96 | 90.69 | 40.05 | 57.19 | 41.41 | 59.30 | 65.64 | 79.26 | 25.25 | 40.62 | 54.24 | 68.25 |
| CLIP-ProCL | 77.36 | 87.23 | 88.41 | 93.85 | 48.92 | 65.70 | 37.04 | 54.05 | 64.39 | 78.34 | 17.80 | 30.22 | 55.65 | 68.23 |
| SIT-UDA | 73.15 | 84.49 | 88.37 | 93.83 | 45.07 | 62.13 | 50.81 | 67.38 | 69.68 | 82.13 | 25.87 | 41.10 | 58.82 | 71.84 |
| Method | Impervious Surface | Building | Low Vegetation | Tree | Car | Clutter | mIoU | mF1 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | |||
| CIA-UDA | 62.74 | 77.11 | 72.31 | 83.93 | 54.40 | 70.47 | 47.74 | 64.63 | 65.35 | 79.04 | 10.87 | 19.61 | 52.23 | 65.80 |
| ProDA | 44.70 | 61.72 | 56.85 | 72.49 | 40.55 | 57.71 | 31.59 | 48.02 | 46.78 | 63.74 | 10.63 | 19.21 | 38.51 | 53.82 |
| JDAF | 67.70 | 80.74 | 76.36 | 86.59 | 51.19 | 67.72 | 36.21 | 53.17 | 63.22 | 77.47 | 13.10 | 23.17 | 51.30 | 64.81 |
| FGUDA | 73.43 | 84.55 | 76.32 | 87.43 | 47.69 | 63.45 | 32.68 | 47.36 | 63.86 | 77.85 | 11.65 | 19.47 | 50.94 | 63.31 |
| DACS | 73.98 | 85.04 | 83.65 | 90.74 | 55.97 | 71.77 | 28.86 | 44.79 | 73.81 | 84.93 | 10.04 | 18.25 | 54.29 | 65.92 |
| CLIP-ProCL | 66.52 | 79.89 | 76.02 | 86.38 | 44.67 | 61.75 | 34.99 | 51.84 | 59.21 | 74.38 | 1.02 | 2.01 | 47.07 | 59.38 |
| CLIP-UDA | 75.16 | 85.82 | 82.35 | 90.32 | 53.86 | 70.01 | 35.46 | 52.36 | 81.98 | 90.10 | 8.57 | 15.01 | 56.23 | 67.27 |
| SIT-UDA | 75.87 | 86.28 | 82.71 | 90.54 | 58.44 | 73.77 | 42.12 | 59.27 | 83.87 | 91.22 | 14.31 | 25.04 | 59.55 | 71.02 |
| Method | Impervious Surface | Building | Low Vegetation | Tree | Car | Clutter | mIoU | mF1 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | |||
| CIA-UDA | 53.39 | 69.61 | 70.48 | 82.68 | 43.96 | 61.07 | 44.90 | 61.97 | 63.36 | 77.57 | 9.20 | 16.86 | 47.55 | 61.63 |
| ProDA | 44.77 | 62.03 | 46.37 | 63.06 | 35.84 | 52.75 | 30.56 | 46.91 | 41.21 | 59.27 | 11.13 | 20.51 | 34.98 | 50.76 |
| JDAF | 60.05 | 75.04 | 71.42 | 83.33 | 27.79 | 43.49 | 38.74 | 55.84 | 58.64 | 73.93 | 18.09 | 30.63 | 45.79 | 60.38 |
| FGUDA | 66.11 | 79.75 | 68.63 | 81.32 | 35.47 | 51.85 | 28.64 | 43.51 | 65.45 | 80.17 | 10.84 | 17.49 | 45.86 | 59.74 |
| DACS | 71.76 | 83.56 | 85.53 | 92.20 | 47.52 | 64.43 | 12.43 | 22.11 | 75.34 | 85.94 | 1.93 | 3.79 | 49.09 | 58.67 |
| CLIP-ProCL | 65.49 | 79.14 | 75.63 | 86.12 | 36.09 | 53.03 | 35.55 | 52.45 | 75.08 | 85.77 | 0.65 | 1.30 | 48.08 | 59.64 |
| CLIP-UDA | 66.28 | 79.72 | 65.68 | 79.29 | 44.94 | 62.09 | 40.61 | 57.69 | 84.06 | 91.09 | 2.54 | 4.96 | 50.68 | 62.47 |
| SIT-UDA | 73.32 | 84.61 | 83.23 | 90.85 | 59.91 | 74.93 | 40.75 | 57.91 | 84.16 | 91.40 | 1.70 | 3.35 | 57.18 | 67.17 |
| Task | Method | Background | Building | Road | Water | Barren | Forest | Agriculture | mIoU | mF1 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | ||||
| Rural 2Urban | DACS | 39.57 | 57.23 | 53.74 | 69.70 | 49.65 | 66.35 | 66.02 | 78.97 | 29.97 | 45.71 | 45.60 | 61.72 | 54.71 | 71.38 | 48.47 | 64.44 |
| CLIP-ProCL | 40.57 | 57.72 | 60.22 | 75.17 | 51.81 | 68.81 | 66.04 | 79.55 | 40.10 | 56.12 | 46.30 | 62.40 | 54.64 | 71.32 | 51.38 | 67.30 | |
| CLIP-UDA | 35.32 | 52.20 | 47.68 | 64.57 | 45.80 | 61.91 | 60.91 | 75.71 | 42.82 | 60.90 | 48.61 | 65.42 | 48.72 | 65.52 | 47.12 | 63.75 | |
| SIT-UDA | 38.67 | 55.78 | 56.43 | 72.15 | 53.90 | 70.05 | 63.22 | 77.47 | 43.97 | 61.08 | 50.85 | 67.42 | 55.86 | 71.68 | 51.84 | 67.95 | |
| Urban 2Rural | DACS | 41.99 | 59.14 | 45.94 | 62.96 | 34.58 | 51.39 | 59.94 | 74.74 | 6.48 | 12.32 | 26.51 | 39.33 | 31.52 | 47.93 | 35.28 | 49.69 |
| CLIP-ProCL | 49.11 | 66.65 | 47.42 | 63.46 | 33.33 | 50.00 | 60.88 | 75.69 | 12.27 | 21.85 | 30.82 | 47.12 | 34.61 | 51.42 | 38.35 | 53.44 | |
| CLIP-UDA | 44.97 | 62.04 | 58.22 | 73.59 | 45.21 | 63.06 | 38.75 | 54.67 | 8.18 | 15.12 | 29.87 | 46.00 | 16.43 | 28.23 | 34.52 | 48.96 | |
| SIT-UDA | 50.65 | 67.25 | 60.05 | 75.04 | 46.82 | 63.78 | 44.42 | 61.52 | 6.68 | 12.52 | 32.51 | 49.07 | 29.77 | 45.88 | 38.70 | 53.58 | |
| Self-Training | Learnable Text Prompt | CTC | EGPW | PotsIRRG2VaiIRRG | VaiIRRG2PotsIRRG | Rural2Urban | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mIoU | mF1 | OA | mIoU | mF1 | OA | mIoU | mF1 | OA | |||||
| Baseline (CLIP-UDA) | ✓ | 62.35 | 76.03 | 80.57 | 56.23 | 67.27 | 76.72 | 47.12 | 63.75 | 62.54 | |||
| B + L | ✓ | ✓ | 65.41 | 78.41 | 83.02 | 57.19 | 68.44 | 77.87 | 49.25 | 65.67 | 65.26 | ||
| B + E | ✓ | ✓ | 64.46 | 77.90 | 81.48 | 57.64 | 68.45 | 78.04 | 48.64 | 65.21 | 64.02 | ||
| B + L + C | ✓ | ✓ | ✓ | 67.73 | 80.06 | 84.33 | 58.51 | 69.55 | 78.23 | 50.80 | 66.96 | 66.08 | |
| B + L + E | ✓ | ✓ | ✓ | 66.78 | 79.41 | 83.81 | 57.89 | 68.49 | 78.09 | 50.49 | 66.74 | 65.56 | |
| B + L + C + E | ✓ | ✓ | ✓ | ✓ | 69.66 | 81.55 | 85.35 | 59.55 | 71.02 | 79.06 | 51.84 | 67.95 | 66.75 |
| PotsIRRG2VaiIRRG | VaiIRRG2PotsIRRG | Rural2Urban | |||||||
|---|---|---|---|---|---|---|---|---|---|
| mIoU | mF1 | OA | mIoU | mF1 | OA | mIoU | mF1 | OA | |
| Learnable text prompts + [CLASS] | 68.51 | 80.92 | 84.59 | 58.99 | 69.75 | 78.73 | 50.77 | 66.88 | 66.31 |
| A photo of [CLASS] | 69.66 | 81.55 | 85.35 | 59.55 | 71.02 | 79.06 | 51.84 | 67.95 | 66.75 |
| Confidence Threshold | PtsIRRG2VaiIRRG | VaiIRRG2PotsIRRG | ||
|---|---|---|---|---|
| mIoU | OA | mIoU | OA | |
| 0.1 | 65.36 | 83.89 | 54.28 | 76.55 |
| 0.3 | 66.12 | 84.05 | 55.24 | 77.13 |
| 0.5 | 67.05 | 84.04 | 56.50 | 77.92 |
| 0.7 | 69.10 | 85.27 | 58.98 | 78.65 |
| 0.9 | 69.66 | 85.35 | 59.55 | 79.06 |
| 1.0 | 67.34 | 84.67 | 57.15 | 78.02 |
| Method | Params/M | GFLOPs/G | Training Time/h | Inference Time/FPS |
|---|---|---|---|---|
| CLIP-UDA | 46.20 | 61.50 | 4.13 | 81.88 |
| SIT-UDA | 46.23 | 65.49 | 4.53 | 62.57 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Liu, Q.; Wang, X. Self-Training Based Image–Text Multimodal Unsupervised Domain Adaptation Segmentation Model for Remote Sensing Images. Remote Sens. 2026, 18, 651. https://doi.org/10.3390/rs18040651
Liu Q, Wang X. Self-Training Based Image–Text Multimodal Unsupervised Domain Adaptation Segmentation Model for Remote Sensing Images. Remote Sensing. 2026; 18(4):651. https://doi.org/10.3390/rs18040651
Chicago/Turabian StyleLiu, Qianqian, and Xili Wang. 2026. "Self-Training Based Image–Text Multimodal Unsupervised Domain Adaptation Segmentation Model for Remote Sensing Images" Remote Sensing 18, no. 4: 651. https://doi.org/10.3390/rs18040651
APA StyleLiu, Q., & Wang, X. (2026). Self-Training Based Image–Text Multimodal Unsupervised Domain Adaptation Segmentation Model for Remote Sensing Images. Remote Sensing, 18(4), 651. https://doi.org/10.3390/rs18040651

