Exploring Difference Semantic Prior Guidance for Remote Sensing Image Change Captioning
Highlights
- A novel RSICC framework unifies both the difference context and semantic changes on the RSI pair, enhancing cross-modal association with cross refined attention. The network comprises three core modules: the dual-branch difference extraction layer, the difference comprehending module, and the cross refined attention block.
- For the alignment of common/difference context features, a dual-branch difference extraction layer composed of symmetric difference context encoding and shallow difference auxiliary encoding is designed to seek the multi-scale difference features.
- In order to learn semantic knowledge in difference representations, a trainable difference comprehending module can generate high-level linguistic embeddings about appeared changes, and then the detected change content and embedding textual cues are integrated into a novel cross refined attention block for more cross-modal inter-actional information.
Abstract
1. Introduction
2. Related Work
2.1. Cross-Temporal Change Location Methods
2.2. Multi-Task Change Aggregation Methods
3. Method
3.1. Standard Change Captioning
3.2. Image Pair Encoder
3.3. Context-Aware Difference Learning
3.4. Difference Comprehending
3.5. Change Caption Decoder
3.6. Training Strategy
4. Experiments and Analysis
4.1. Dataset and Setting
4.1.1. Datasets
- (1)
- LEVIR-CC dataset [7]: Each <imageT1,imageT2> focuses on changes across various building instances (i.e., buildings). The dataset consists of 10,077 RSI pairs with span time from 5 to 14 years. Each image has 256 × 256 pixels with a resolution of 0.5 m/pixel. For annotations, the maximum length of change sentence is 39. Based on the official split [7], a split of of 6815, 1333, and 1929 of the <imageT1,imageT2,sentences> is used for training, validation and testing, respectively.
- (2)
- Dubai-CC dataset [6]: The original bi-temporal RSIs are collected about Dubai urban environments, spanning from 19 May 2000 to 16 June 2010. In order to facilitate RSICC task, each RSI is cropped with a size of 50 × 50 pixels. Therefore, there are 500 sliced <imageT1,imageT2>. For consistent experimental conditions with LEVIR-CC dataset, the RSI will be upsampled to 256 × 256 pixels. Notably, the changing scenarios are fewer than LEVIR-CC dataset. The longest annotated sentence contains 27 words, shorter than Dubai-CC dataset. We also follow the split [7] and use 300, 50 and 150 <imageT1,imageT2,sentences> for training, validation and testing respectively.
4.1.2. Evaluation Metrics
4.1.3. Train Details and Experimental Setup
4.1.4. Compared Models
4.2. Evaluation Results and Analysis
5. Discussion
5.1. Ablation Experiments
5.2. Parameter Analysis
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| RSICC | Remote sensing image change captioning. |
| RSI | Remote sensing image. |
| CNN | Convolutional neural networks. |
| LSTM | Long short-term memory. |
| VGG | Visual geometry group. |
| ResNet | Residual network. |
| BLEU | Biingual evaluation understudy. |
| ROUGE-L | Recall-oriented understudy for gisting evaluation—Longest. |
| METEOR | Metric for Evaluation of translation with explicit ordering. |
| CIDEr | Consensus-based image description evaluation. |
| SOTA | State-of-the-art models. |
| MHA | Multi-head attention. |
| TACC | Text-augmented change captioning method. |
| The shallow bi-temporal features. | |
| The deep bi-temporal features. | |
| The top difference context features of l layer. | |
| The shallow difference context features of l layer. | |
| The fused difference feature from the top difference encoding module. | |
| The fused difference feature from the down difference encoding module. | |
| S | The predicted change semantic words. |
| The alignment feature between the change features and semantic words. | |
| T | The max length in ground-truth sentence. |
| The generated word at t time. |
References
- Zhang, X.; He, L.; Qin, K.; Dang, Q.; Si, H.; Tang, X.; Jiao, L. SMD-Net: Siamese Multi-Scale Difference-Enhancement Network for Change Detection in Remote Sensing. Remote Sens. 2022, 14, 1580. [Google Scholar] [CrossRef]
- Wang, G.; Zhang, X.; Peng, Z.; Tian, S.; Zhang, T.; Tang, X.; Jiao, L. OraL: An Observational Learning Paradigm for Unsupervised Hyperspectral Change Detection. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 5380–5393. [Google Scholar] [CrossRef]
- Zhang, X.; Hong, W.; Li, Z.; Cheng, X.; Tang, X.; Zhou, H.; Jiao, L. Hierarchical Knowledge Graph for Multilabel Classification of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
- Zhang, X.; Chen, Y.; Wang, G.; Zhang, Y.; Jiao, L. EDDA: An Efficient Divide-and-Conquer Domain Adapter for Automatics Modulation Recognition. IEEE J. Sel. Top. Signal Process. 2025, 19, 140–153. [Google Scholar] [CrossRef]
- Zhang, X.; Li, Y.; Wang, X.; Liu, F.; Wu, Z.; Cheng, X.; Jiao, L. Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning. Remote Sens. 2023, 15, 579. [Google Scholar] [CrossRef]
- Hoxha, G.; Chouaf, S.; Melgani, F.; Smara, Y. Change captioning: A new paradigm for multitemporal remote sensing image analysis. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
- Liu, C.; Zhao, R.; Chen, H.; Zou, Z.; Shi, Z. Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–20. [Google Scholar] [CrossRef]
- Wang, Y.; Yu, W.; Ghamisi, P. Change Captioning in Remote Sensing: Evolution to SAT-Cap—A Single-Stage Transformer Approach. arXiv 2025, arXiv:2501.08114. [Google Scholar]
- Chang, S.; Ghamisi, P. Changes to captions: An attentive network for remote sensing change captioning. IEEE Trans. Image Process. 2023, 32, 6047–6060. [Google Scholar] [CrossRef]
- Sun, Q.; Wang, Y.; Song, X. Scene Graph and Dependency Grammar Enhanced Remote Sensing Change Caption Network (SGD-RSCCN). In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 2121–2130. [Google Scholar]
- Zhou, Q.; Gao, J.; Yuan, Y.; Wang, Q. Single-stream Extractor Network with Contrastive Pre-training for Remote Sensing Change Captioning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
- Sun, D.; Bao, Y.; Liu, J.; Cao, X. A lightweight sparse focus transformer for remote sensing image change captioning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 18727–18738. [Google Scholar] [CrossRef]
- Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
- De Bem, P.P.; de Carvalho Junior, O.A.; Fontes Guimarães, R.; Trancoso Gomes, R.A. Change detection of deforestation in the Brazilian Amazon using landsat data and convolutional neural networks. Remote Sens. 2020, 12, 901. [Google Scholar] [CrossRef]
- Khan, S.H.; He, X.; Porikli, F.; Bennamoun, M. Forest change detection in incomplete satellite images with deep neural networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 5407–5423. [Google Scholar] [CrossRef]
- Liu, C.; Yang, J.; Qi, Z.; Zou, Z.; Shi, Z. Progressive scale-aware network for remote sensing image change captioning. In Proceedings of the IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 6668–6671. [Google Scholar]
- Cai, C.; Wang, Y.; Yap, K.H. Interactive change-aware transformer network for remote sensing image change captioning. Remote Sens. 2023, 15, 5611. [Google Scholar] [CrossRef]
- Hang, R.; Luo, J.; Lin, H.; Liu, Q. Text-Augmented Semantic Feature Extraction and Difference Information Learning for Remote Sensing Image Change Captioning. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5636112. [Google Scholar] [CrossRef]
- Yang, Y.; Liu, T.; Pu, Y.; Liu, L.; Zhao, Q.; Wan, Q. Remote sensing image change captioning using multi-attentive network with diffusion model. Remote Sens. 2024, 16, 4083. [Google Scholar] [CrossRef]
- Li, X.; Sun, B.; Wu, Z.; Li, S.; Guo, H. Cd4c: Change detection for remote sensing image change captioning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 9181–9194. [Google Scholar] [CrossRef]
- Li, Y.; Zhang, X.; Gu, J.; Li, C.; Wang, X.; Tang, X.; Jiao, L. Recurrent attention and semantic gate for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–16. [Google Scholar] [CrossRef]
- Zhang, X.; Wang, X.; Tang, X.; Zhou, H.; Li, C. Description generation for remote sensing images using attribute attention mechanism. Remote Sens. 2019, 11, 612. [Google Scholar] [CrossRef]
- Zhang, Z.; Diao, W.; Zhang, W.; Yan, M.; Gao, X.; Sun, X. LAM: Remote sensing image captioning with Label-Attention Mechanism. Remote Sens. 2019, 11, 2349. [Google Scholar] [CrossRef]
- Qiu, Y.; Yamamoto, S.; Nakashima, K.; Suzuki, R.; Iwata, K.; Kataoka, H.; Satoh, Y. Describing and localizing multiple changes with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1971–1980. [Google Scholar]
- Li, Y.; Zhang, X.; Cheng, X.; Chen, P.; Jiao, L. Inter-temporal interaction and symmetric difference learning for remote sensing image change captioning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar]
- Wu, R.; Ye, H.; Liu, X.; Li, Z.; Sun, C.; Wu, J. A Cross-Spatial Differential Localization Network for Remote Sensing Change Captioning. Remote Sens. 2025, 17, 2285. [Google Scholar] [CrossRef]
- Liu, C.; Chen, K.; Chen, B.; Zhang, H.; Zou, Z.; Shi, Z. Rscama: Remote sensing image change captioning with state space model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
- Karimli, O.; Mustafazade, I.; Karaca, A.C.; Amasyalı, F. Data augmentation in remote sensing image change captioning. In Proceedings of the 2024 9th International Conference on Computer Science and Engineering (UBMK), Antalya, Turkiye, 26–28 October 2024; pp. 287–292. [Google Scholar]
- Yang, K.; Wei, J.; Chen, C.; Wang, Z.; Lan, J.; Li, X.; Hua, D.; Xue, D.; Wu, Y. Restricted supervised Cascade Information Network for remote sensing change captioning with serial sentences. Int. J. Appl. Earth Obs. Geoinf. 2025, 142, 104686. [Google Scholar] [CrossRef]
- Wang, Z.; Wang, M.; Xu, S.; Li, Y.; Zhang, B. Ccexpert: Advancing mllm capability in remote sensing change captioning with difference-aware integration and a foundational dataset. arXiv 2024, arXiv:2411.11360. [Google Scholar] [CrossRef]
- Yu, X.; Li, Y.; Ma, J.; Li, C.; Wu, H. Diffusion-rscc: Diffusion probabilistic model for change captioning in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5617013. [Google Scholar] [CrossRef]
- Bai, Q.; Wang, X. Cross-Temporal Remote Sensing Image Change Captioning: A Manifold Mapping and Bayesian Diffusion Approach for Land Use Monitoring. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 14406–14415. [Google Scholar] [CrossRef]
- Li, X.; Sun, B.; Li, S. Detection assisted change captioning for remote sensing image. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 10454–10458. [Google Scholar]
- Liu, C.; Chen, K.; Qi, Z.; Liu, Z.; Zhang, H.; Zou, Z.; Shi, Z. Pixel-level change detection pseudo-label learning for remote sensing change captioning. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 8405–8408. [Google Scholar]
- Karaca, A.C.; Ozelbas, E.; Berber, S.; Karimli, O.; Yildirim, T.; Amasyali, M.F. Robust change captioning in remote sensing: Second-cc dataset and mmodalcc framework. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 21494–21513. [Google Scholar] [CrossRef]
- Liu, C.; Chen, K.; Zhang, H.; Qi, Z.; Zou, Z.; Shi, Z. Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
- Shi, J.; Zhang, M.; Hou, Y.; Zhi, R.; Liu, J. A Multitask Network and Two Large-Scale Datasets for Change Detection and Captioning in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
- Yang, C.; Li, Z.; Jiao, H.; Gao, Z.; Zhang, L. Enhancing perception of key changes in remote sensing image change captioning. IEEE Trans. Image Process. 2025, 34, 7378–7390. [Google Scholar] [CrossRef] [PubMed]
- Zhu, Y.; Li, L.; Chen, K.; Liu, C.; Zhou, F.; Shi, Z. Semantic-cc: Boosting remote sensing image change captioning via foundational knowledge and semantic guidance. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5648916. [Google Scholar] [CrossRef]
- Sun, D.; Yao, J.; Xue, W.; Zhou, C.; Ghamisi, P.; Cao, X. Mask approximation net: A novel diffusion model approach for remote sensing change captioning. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5652311. [Google Scholar] [CrossRef]
- Frome, A.; Corrado, G.S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; Mikolov, T. Devise: A deep visual-semantic embedding model. Adv. Neural Inf. Process. Syst. 2013, 26, 2121–2129. [Google Scholar]
- Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification 2010. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information System, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments; Association for Computational Linguistics: Stroudsburg, PA, USA, 2005; pp. 65–72. [Google Scholar]
- Lin, C. ROUGE: A Package for Automatic Evaluation of Summaries; Association for Computational Linguistics: Stroudsburg, PA, USA, 2004. [Google Scholar]
- Vedantam, R.; Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. arXiv 2015, arXiv:1411.5726. [Google Scholar] [CrossRef]
- Park, D.H.; Darrell, T.; Rohrbach, A. Robust change captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4624–4633. [Google Scholar]







| Methods | BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGE_L | CIDEr |
|---|---|---|---|---|---|---|---|
| DUDA | 0.8144 | 0.7222 | 0.6424 | 0.5779 | 0.3715 | 0.7104 | 1.2432 |
| MCCFormers-S | 0.8216 | 0.7295 | 0.6542 | 0.5941 | 0.3826 | 0.7210 | 1.2834 |
| MCCFormers-D | 0.8049 | 0.7111 | 0.6352 | 0.5734 | 0.3823 | 0.7140 | 1.2685 |
| RSICCformer | 0.8411 | 0.7540 | 0.6801 | 0.6193 | 0.3879 | 0.7302 | 1.3140 |
| Chg2Cap | 0.8614 | 0.7808 | 0.7066 | 0.6439 | 0.4003 | 0.7512 | 1.3661 |
| SEN | 0.8510 | 0.7705 | 0.7001 | 0.6409 | 0.3959 | 0.7457 | 1.3602 |
| ICT-Net | 0.8606 | 0.7812 | 0.7145 | 0.6612 | 0.4051 | 0.7521 | 1.3836 |
| Diffusion | 0.8628 | 0.7750 | 0.7109 | 0.6693 | 0.4016 | 0.7537 | 1.3861 |
| SFT | 0.8456 | 0.7587 | 0.6864 | 0.6287 | 0.3993 | 0.7469 | 1.3705 |
| SAT-Cap | 0.8614 | 0.7819 | 0.7144 | 0.6582 | 0.4051 | 0.7537 | 1.4023 |
| TACC | 0.8549 | 0.7741 | 0.7052 | 0.6462 | 0.4007 | 0.7496 | 1.3717 |
| Ours | 0.8649 | 0.7881 | 0.7221 | 0.6671 | 0.4161 | 0.7681 | 1.4334 |
| Methods | BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGE_L | CIDEr |
|---|---|---|---|---|---|---|---|
| DUDA | 0.5885 | 0.4359 | 0.3363 | 0.2539 | 0.2205 | 0.4834 | 0.6278 |
| MCCFormers-S | 0.5297 | 0.3702 | 0.2762 | 0.2257 | 0.1864 | 0.4329 | 0.5381 |
| MCCFormers-D | 0.6465 | 0.5045 | 0.3936 | 0.2948 | 0.2509 | 0.5127 | 0.6651 |
| RSICCformer | 0.6792 | 0.5361 | 0.4137 | 0.3128 | 0.2541 | 0.5196 | 0.6654 |
| Chg2Cap | 0.7204 | 0.6018 | 0.5084 | 0.4170 | 0.2892 | 0.5866 | 0.9249 |
| SEN | 0.7095 | 0.5728 | 0.4581 | 0.3625 | 0.2662 | 0.5595 | 0.9177 |
| ICT-Net | 0.6938 | 0.5703 | 0.4650 | 0.3617 | 0.2678 | 0.5731 | 0.9297 |
| Diffusion | 0.7318 | 0.6136 | 0.5225 | 0.4541 | 0.3085 | 0.6056 | 0.9647 |
| SFT | 0.7204 | 0.6018 | 0.5084 | 0.4170 | 0.2892 | 0.5866 | 0.9249 |
| SAT-Cap | 0.7348 | 0.6098 | 0.5051 | 0.4080 | 0.2962 | 0.5906 | 0.9774 |
| TACC | 0.7217 | 0.5865 | 0.4824 | 0.3859 | 0.2817 | 0.5893 | 0.9545 |
| Ours | 0.7234 | 0.6052 | 0.5030 | 0.4065 | 0.2940 | 0.6266 | 1.0042 |
| Methods | Parameters (M) | Inference Speed (s) |
|---|---|---|
| RSICCformer | 56.20 | 12.67 |
| Chg2Cap | 32.81 | 39.58 |
| SEN | 39.90 | 3.79 |
| ICT-Net | 96.40 | 10.44 |
| TACC | 94.14 | 1.06 |
| Ours | 110.97 | 12.75 |
| Methods | BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGE_L | CIDEr |
|---|---|---|---|---|---|---|---|
| Baseline | 0.8606 | 0.7812 | 0.7145 | 0.6612 | 0.4051 | 0.7521 | 1.3836 |
| D-S-Diff | 0.8625 | 0.7855 | 0.7212 | 0.6699 | 0.4170 | 0.7624 | 1.4015 |
| D-S-Diff+DE | 0.8652 | 0.7909 | 0.7286 | 0.6780 | 0.4159 | 0.7612 | 1.4152 |
| D-S-Diff(D)+DC | 0.8643 | 0.7894 | 0.7255 | 0.6735 | 0.4166 | 0.7612 | 1.4087 |
| D-S-Diff(S)+DC | 0.8629 | 0.7824 | 0.7109 | 0.6513 | 0.4083 | 0.7629 | 1.4207 |
| FULL(DE) | 0.8649 | 0.7881 | 0.7221 | 0.6671 | 0.4161 | 0.7681 | 1.4334 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Li, Y.; Zhang, X.; Wang, G.; Zhang, T. Exploring Difference Semantic Prior Guidance for Remote Sensing Image Change Captioning. Remote Sens. 2026, 18, 232. https://doi.org/10.3390/rs18020232
Li Y, Zhang X, Wang G, Zhang T. Exploring Difference Semantic Prior Guidance for Remote Sensing Image Change Captioning. Remote Sensing. 2026; 18(2):232. https://doi.org/10.3390/rs18020232
Chicago/Turabian StyleLi, Yunpeng, Xiangrong Zhang, Guanchun Wang, and Tianyang Zhang. 2026. "Exploring Difference Semantic Prior Guidance for Remote Sensing Image Change Captioning" Remote Sensing 18, no. 2: 232. https://doi.org/10.3390/rs18020232
APA StyleLi, Y., Zhang, X., Wang, G., & Zhang, T. (2026). Exploring Difference Semantic Prior Guidance for Remote Sensing Image Change Captioning. Remote Sensing, 18(2), 232. https://doi.org/10.3390/rs18020232

