Towards Discriminative and Consistent Cross-Modal Alignment for Remote Sensing Image–Text Retrieval
Highlights
- A discriminative and consistent cross-modal alignment (DCCA) framework for remote sensing image–text retrieval is proposed, comprising global contrastive learning with negative pair expansion, bidirectional intra-inter-modal distribution matching constraint, and remote sensing information injection module, effectively enhancing both discriminability and modality consistency.
- By strengthening the mining of hard sample pairs, enhancing the contrast between positive and negative pairs, improving modality-aware distribution consistency and injecting scene-discriminative information into a VLP model, DCCA achieves superior performance on both the RSITMD and RSICD benchmarks.
- The proposed global contrastive learning strategy with negative pair expansion significantly emphasizes hard samples and improves discriminative capability, providing a generalizable solution for addressing multi-modal alignment in contexts characterized by high intra-modal similarity.
- The proposed framework offers a data-efficient paradigm for transferring pretrained vision–language models to remote sensing scenarios, where the injection of remote sensing visual knowledge substantially reduces the reliance on additional image–text corpora.
Abstract
1. Introduction
- We present DCCA, a novel framework to facilitate discriminative and consistent cross-modal alignment, targeting two main challenges in RSITR: high similarity within modalities and limited adaptability of vision–language models pretrained on natural images to the remote sensing domain.
- We develop an enhanced contrastive learning strategy with negative pair expansion and bidirectional intra-inter-modal distribution matching constraint, improving hard sample mining, cross-modal discrimination, and modality consistency. We further introduce a remote sensing information injection module to enhance visual discriminability and domain adaptability without requiring additional paired data.
- Evaluations on a variety of remote sensing benchmarks demonstrate that DCCA exceeds existing methods in performance while achieving results comparable to models pretrained on extensive remote sensing corpora, even with a reduced amount of training data.
2. Related Work
2.1. Remote Sensing Image–Text Retrieval
2.2. Vision–Language Pretraining
2.3. Contrastive Learning
2.4. Comparative Analysis
3. Method
3.1. Problem Formulation
3.2. Model Overview
3.3. Feature Extraction
3.4. Global Contrastive Learning with Negative Pair Expansion
3.5. Bidirectional Intra-Inter-Modal Distribution Matching Constraint
3.6. Remote Sensing Information Injection
4. Experiments
4.1. Datasets and Evaluation Metrics
4.2. Baselines
4.3. Implementation Details
4.4. Comparison Results
4.5. Comparison of Different Contrastive Learning Variants
4.6. Ablation Studies
4.7. Comparison with Large-Scale Pretrained Remote Sensing Models
4.8. Hyperparameter Studies
4.9. Visualization of Image–Text Attention Heat Maps
4.10. Visualization of Retrieval Results
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Joyce, K.E.; Belliss, S.E.; Samsonov, S.V.; McNeill, S.J.; Glassey, P.J. A review of the status of satellite remote sensing and image processing techniques for mapping natural hazards and disasters. Prog. Phys. Geogr. Earth Environ. 2009, 33, 183–207. [Google Scholar] [CrossRef]
- Weiss, M.; Jacob, F.; Duveiller, G. Remote sensing for agricultural applications: A meta-review. Remote Sens. Environ. 2020, 236, 111402. [Google Scholar] [CrossRef]
- Patino, J.E.; Duque, J.C. A review of regional science applications of satellite remote sensing in urban settings. Comput. Environ. Urban Syst. 2013, 37, 1–17. [Google Scholar] [CrossRef]
- Xu, L.; Wang, L.; Zhang, J.; Ha, D.; Zhang, H. A Review of Cross-Modal Image-Text Retrieval in Remote Sensing. Remote Sens. 2025, 17, 3995. [Google Scholar] [CrossRef]
- Wang, T.; Li, F.; Zhu, L.; Li, J.; Zhang, Z.; Shen, H.T. Cross-Modal Retrieval: A Systematic Review of Methods and Future Directions. Proc. IEEE 2024, 112, 1716–1754. [Google Scholar] [CrossRef]
- Yuan, Z.; Zhang, W.; Fu, K.; Li, X.; Deng, C.; Wang, H.; Sun, X. Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
- Yuan, Z.; Zhang, W.; Tian, C.; Rong, X.; Zhang, Z.; Wang, H.; Fu, K.; Sun, X. Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
- Zhang, S.; Li, Y.; Mei, S. Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–17. [Google Scholar] [CrossRef]
- Ji, Z.; Meng, C.; Zhang, Y.; Pang, Y.; Li, X. Knowledge-Aided Momentum Contrastive Learning for Remote-Sensing Image Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
- Zhang, W.; Li, J.; Li, S.; Chen, J.; Zhang, W.; Gao, X.; Sun, X. Hypersphere-Based Remote Sensing Cross-Modal Text-Image Retrieval via Curriculum Learning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
- Pan, J.; Ma, Q.; Bai, C. A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 611–620. [Google Scholar] [CrossRef]
- Tang, X.; Huang, D.; Ma, J.; Zhang, X.; Liu, F.; Jiao, L. Prior-Experience-Based Vision-Language Model for Remote Sensing Image-Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
- Chen, Y.; Huang, J.; Xiong, S.; Lu, X. Integrating Multisubspace Joint Learning with Multilevel Guidance for Cross-Modal Retrieval of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
- Hu, G.; Wen, Z.; Lv, Y.; Zhang, J.; Wu, Q. Global-Local Information Soft-Alignment for Cross-Modal Remote-Sensing Image-Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
- Yang, R.; Wang, S.; Tao, J.; Han, Y.; Lin, Q.; Guo, Y.; Hou, B.; Jiao, L. Accurate and Lightweight Learning for Specific Domain Image-Text Retrieval. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 9719–9728. [Google Scholar] [CrossRef]
- Plummer, B.A.; Wang, L.; Cervantes, C.M.; Caicedo, J.C.; Hockenmaier, J.; Lazebnik, S. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2641–2649. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2015, arXiv:1405.0312. [Google Scholar] [CrossRef]
- Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring Models and Data for Remote Sensing Image Caption Generation. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2183–2195. [Google Scholar] [CrossRef]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
- Xiao, S.; Liu, Z.; Zhang, P.; Muennighoff, N.; Lian, D.; Nie, J.Y. C-Pack: Packed Resources for General Chinese Embeddings. arXiv 2024, arXiv:2309.07597. [Google Scholar] [CrossRef]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; pp. 4904–4916. [Google Scholar] [CrossRef]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar] [CrossRef]
- Jiang, Q.; Chen, C.; Zhao, H.; Chen, L.; Ping, Q.; Tran, S.D.; Xu, Y.; Zeng, B.; Chilimbi, T. Understanding and Constructing Latent Modality Structures in Multi-Modal Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7661–7671. [Google Scholar] [CrossRef]
- Han, Z.; Zhang, S.; Su, Y.; Chen, X.; Mei, S. DR-AVIT: Toward Diverse and Realistic Aerial Visible-to-Infrared Image Translation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
- Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhao, T.; Guo, Y.; Yin, J. RS5M and GeoRSCLIP: A Large-Scale Vision- Language Dataset and a Large Vision-Language Model for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–23. [Google Scholar] [CrossRef]
- Cheng, Q.; Zhou, Y.; Fu, P.; Xu, Y.; Zhang, L. A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4284–4297. [Google Scholar] [CrossRef]
- Ji, Z.; Meng, C.; Zhang, Y.; Wang, H.; Pang, Y.; Han, J. Eliminate Before Align: A Remote Sensing Image-Text Retrieval Framework with Keyword Explicit Reasoning. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 1662–1671. [Google Scholar] [CrossRef]
- Guan, J.; Shu, Y.; Li, W.; Song, Z.; Zhang, Y. PR-CLIP: Cross-Modal Positional Reconstruction for Remote Sensing Image-Text Retrieval. Remote Sens. 2025, 17, 2117. [Google Scholar] [CrossRef]
- Zheng, C.; Li, X.; Liang, X.; Huang, L.; Du, S.; Nie, J.; Dong, J. Cross-Modal Progressive Perspective Matching Network for Remote Sensing Image-Text Retrieval. IEEE Trans. Multimed. 2025, 27, 3966–3978. [Google Scholar] [CrossRef]
- Sun, T.; Zheng, C.; Li, X.; Gao, Y.; Nie, J.; Huang, L.; Wei, Z. Strong and Weak Prompt Engineering for Remote Sensing Image-Text Cross-Modal Retrieval. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 6968–6980. [Google Scholar] [CrossRef]
- Zheng, C.; Nie, J.; Yin, B.; Li, X.; Qian, Y.; Wei, Z. Frequency- and Spatial-Domain Saliency Network for Remote Sensing Cross-Modal Retrieval. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–13. [Google Scholar] [CrossRef]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 13–23. [Google Scholar] [CrossRef]
- Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.J.; Chang, K.W. What Does BERT with Vision Look At? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5265–5275. [Google Scholar] [CrossRef]
- Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 121–137. [Google Scholar] [CrossRef]
- Kim, W.; Son, B.; Kim, I. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; pp. 5583–5594. [Google Scholar] [CrossRef]
- Sohn, K. Improved deep metric learning with multi-class N-pair loss objective. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 1857–1865. [Google Scholar]
- van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2019, arXiv:1807.03748. [Google Scholar] [CrossRef]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9726–9735. [Google Scholar] [CrossRef]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, Online, 13–18 July 2020; pp. 1597–1607. [Google Scholar] [CrossRef]
- Chen, X.; He, K. Exploring Simple Siamese Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15745–15753. [Google Scholar] [CrossRef]
- Sun, Y.; Cheng, C.; Zhang, Y.; Zhang, C.; Zheng, L.; Wang, Z.; Wei, Y. Circle Loss: A Unified Perspective of Pair Similarity Optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6397–6406. [Google Scholar] [CrossRef]
- Zhang, Y.; Jiang, H.; Miura, Y.; Manning, C.D.; Langlotz, C.P. Contrastive Learning of Medical Visual Representations from Paired Images and Text. In Proceedings of the 7th Machine Learning for Healthcare Conference, Durham, NC, USA, 5–6 August 2022; pp. 2–25. [Google Scholar]
- Jiang, D.; Ye, M. Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 2787–2797. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A Benchmark Data Set for Performance Evaluation of Aerial Scene Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
- Tang, X.; Wang, Y.; Ma, J.; Zhang, X.; Liu, F.; Jiao, L. Interacting-Enhancing Feature Transformer for Cross-Modal Remote-Sensing Image and Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
- Pan, J.; Ma, Q.; Bai, C. Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval. In Proceedings of the ACM International Conference on Multimedia Retrieval, Thessaloniki, Greece, 12–15 June 2023; pp. 398–406. [Google Scholar] [CrossRef]
- He, Y.; Xu, X.; Chen, H.; Li, J.; Pu, F. Visual Global-Salient-Guided Network for Remote Sensing Image-Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
- Ma, Q.; Pan, J.; Bai, C. Direction-Oriented Visual-Semantic Embedding Model for Remote Sensing Image-Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
- Zhang, X.; Li, W.; Wang, X.; Wang, L.; Zheng, F.; Wang, L.; Zhang, H. A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text-Image Retrieval in Remote Sensing. Remote Sens. 2023, 15, 4637. [Google Scholar] [CrossRef]
- Yuan, Y.; Zhan, Y.; Xiong, Z. Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
- Chen, D. ITRA. Available online: https://github.com/ChenDelong1999/ITRA (accessed on 18 June 2025).
- Chefer, H.; Gur, S.; Wolf, L. Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 387–396. [Google Scholar] [CrossRef]










| Dataset | Method | Image Backbone | Text Backbone | I2T | T2I | Mean Recall | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |||||
| RSITMD | AMFMN | ResNet | GRU | 11.06 | 29.20 | 38.72 | 9.96 | 34.03 | 52.96 | 29.32 |
| GaLR | PP-YOLO + ResNet | GRU | 14.82 | 31.64 | 42.48 | 11.15 | 36.68 | 51.68 | 31.41 | |
| SWAN | ResNet | GRU | 13.35 | 32.15 | 46.90 | 11.24 | 40.40 | 60.60 | 34.11 | |
| KAMCL | ResNet | GRU | 16.51 | 36.28 | 49.12 | 13.50 | 42.15 | 59.32 | 36.14 | |
| VGSGN | ResNet | GRU | 14.16 | 34.96 | 50.66 | 13.23 | 42.57 | 63.41 | 36.50 | |
| DOVE | ResNet | GRU | 16.81 | 36.80 | 49.93 | 12.20 | 44.13 | 66.50 | 37.73 | |
| IEFT | Transformer | Transformer | 15.49 | 37.61 | 51.40 | 11.19 | 38.09 | 58.84 | 35.43 | |
| PIR | Swin-Transformer | BERT | 18.14 | 41.15 | 52.88 | 12.17 | 41.68 | 63.41 | 38.24 | |
| MTGFE | ViT-B-16 | BERT | 17.92 | 40.93 | 53.32 | 16.59 | 48.50 | 67.43 | 40.78 | |
| PE-RSITR | ViT-B-32 | Transformer | 23.67 | 44.07 | 60.36 | 20.10 | 50.63 | 67.97 | 44.47 | |
| CLIP-ZS | ViT-B-16 | Transformer | 9.29 | 25.00 | 33.41 | 7.74 | 29.03 | 45.66 | 25.02 | |
| CLIP-FT | ViT-B-16 | Transformer | 27.88 | 50.66 | 62.83 | 23.72 | 54.42 | 71.06 | 48.43 | |
| AIR | ViT-B-16 | Transformer | 29.20 | 49.78 | 65.27 | 26.06 | 57.04 | 73.98 | 50.22 | |
| GLISA | ViT * | Transformer | 32.08 | 51.99 | 63.94 | 23.36 | 58.27 | 74.47 | 50.69 | |
| DCCA(Ours) | ViT-B-16 | Transformer | 31.64 | 53.32 | 64.60 | 24.47 | 57.61 | 76.33 | 51.33 | |
| RSICD | AMFMN | ResNet | GRU | 5.39 | 15.08 | 23.40 | 4.90 | 18.28 | 31.44 | 16.42 |
| GaLR | PP-YOLO + ResNet | GRU | 6.59 | 19.85 | 31.04 | 4.69 | 19.48 | 32.13 | 18.96 | |
| SWAN | ResNet | GRU | 7.41 | 20.13 | 30.86 | 5.56 | 22.26 | 37.41 | 20.61 | |
| KAMCL | ResNet | GRU | 12.08 | 27.26 | 38.70 | 8.65 | 27.43 | 42.51 | 26.10 | |
| VGSGN | ResNet | GRU | 8.33 | 21.87 | 32.57 | 6.53 | 23.13 | 36.85 | 21.55 | |
| DOVE | ResNet | GRU | 8.66 | 22.35 | 34.95 | 6.04 | 23.95 | 40.35 | 22.72 | |
| IEFT | Transformer | Transformer | 8.78 | 28.47 | 43.88 | 8.38 | 28.17 | 44.16 | 26.97 | |
| PIR | Swin-Transformer | BERT | 9.88 | 27.26 | 39.16 | 6.97 | 24.56 | 38.92 | 24.46 | |
| MTGFE | ViT-B-16 | BERT | 15.28 | 37.05 | 51.60 | 8.67 | 27.56 | 43.92 | 30.68 | |
| PE-RSITR | ViT-B-32 | Transformer | 14.13 | 31.51 | 44.78 | 11.63 | 33.92 | 50.73 | 31.12 | |
| CLIP-ZS | ViT-B-16 | Transformer | 6.86 | 17.29 | 26.44 | 5.45 | 18.19 | 29.30 | 17.26 | |
| CLIP-FT | ViT-B-16 | Transformer | 18.94 | 37.88 | 49.59 | 14.69 | 39.58 | 54.71 | 35.90 | |
| AIR | ViT-B-16 | Transformer | 18.85 | 39.07 | 51.78 | 14.24 | 39.03 | 54.49 | 36.24 | |
| GLISA | ViT * | Transformer | 19.52 | 40.44 | 52.28 | 14.75 | 39.50 | 55.46 | 36.99 | |
| DCCA(Ours) | ViT-B-16 | Transformer | 18.85 | 40.99 | 54.07 | 15.41 | 41.68 | 57.97 | 38.16 | |
| Dataset | Method | I2T | T2I | Mean Recall | ||||
|---|---|---|---|---|---|---|---|---|
| R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |||
| RSITMD | 27.88 | 50.66 | 62.83 | 23.72 | 54.42 | 71.06 | 48.43 | |
| 28.98 | 49.56 | 62.17 | 23.32 | 55.62 | 72.48 | 48.69 | ||
| 31.64 | 51.11 | 62.61 | 24.60 | 56.95 | 73.01 | 49.99 | ||
| (used in DCCA) | 28.54 | 52.65 | 64.60 | 24.65 | 57.43 | 73.41 | 50.21 | |
| RSICD | 18.94 | 37.88 | 49.59 | 14.69 | 39.58 | 54.71 | 35.90 | |
| 17.75 | 39.98 | 51.78 | 14.38 | 40.93 | 57.73 | 37.09 | ||
| 19.49 | 38.88 | 53.98 | 15.28 | 41.43 | 58.06 | 37.85 | ||
| (used in DCCA) | 18.57 | 38.98 | 52.52 | 15.74 | 41.37 | 58.24 | 37.57 | |
| Dataset | I2T | T2I | Mean Recall | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |||||||
| RSITMD | 27.88 | 50.66 | 62.83 | 23.72 | 54.42 | 71.06 | 48.43 | |||||
| ✔ | 28.54 | 52.65 | 64.60 | 24.65 | 57.43 | 73.41 | 50.21 | |||||
| ✔ | 29.42 | 52.88 | 65.27 | 24.12 | 55.00 | 71.73 | 49.73 | |||||
| ✔ | ✔ | 28.98 | 51.11 | 64.60 | 25.58 | 57.79 | 75.04 | 50.52 | ||||
| ✔ | ✔ | 29.20 | 52.21 | 65.27 | 24.25 | 57.57 | 75.04 | 50.59 | ||||
| ✔ | ✔ | ✔ | 28.54 | 53.54 | 66.15 | 25.40 | 57.21 | 75.18 | 51.00 | |||
| ✔ | ✔ | ✔ | ✔ | 28.98 | 53.32 | 65.04 | 24.60 | 56.55 | 74.96 | 50.58 | ||
| ✔ | ✔ | ✔ | ✔ | 29.65 | 51.99 | 66.37 | 24.65 | 57.92 | 76.19 | 51.13 | ||
| ✔ | ✔ | ✔ | ✔ | 30.75 | 51.11 | 63.50 | 24.96 | 57.21 | 76.24 | 50.63 | ||
| ✔ | ✔ | ✔ | ✔ | ✔ | 31.64 | 53.32 | 64.60 | 24.47 | 57.61 | 76.33 | 51.33 | |
| RSICD | 18.94 | 37.88 | 49.59 | 14.69 | 39.58 | 54.71 | 35.90 | |||||
| ✔ | 18.57 | 38.98 | 52.52 | 15.74 | 41.37 | 58.24 | 37.57 | |||||
| ✔ | 19.58 | 40.16 | 52.79 | 14.38 | 40.24 | 55.92 | 37.18 | |||||
| ✔ | ✔ | 18.66 | 38.88 | 52.97 | 15.30 | 41.85 | 58.46 | 37.69 | ||||
| ✔ | ✔ | 19.49 | 40.81 | 53.43 | 15.06 | 41.10 | 56.91 | 37.80 | ||||
| ✔ | ✔ | ✔ | 20.31 | 40.71 | 53.06 | 15.39 | 40.48 | 57.31 | 37.88 | |||
| ✔ | ✔ | ✔ | ✔ | 19.95 | 39.80 | 53.71 | 15.17 | 41.28 | 57.51 | 37.90 | ||
| ✔ | ✔ | ✔ | ✔ | 19.30 | 40.16 | 54.16 | 15.85 | 41.52 | 57.77 | 38.13 | ||
| ✔ | ✔ | ✔ | ✔ | 19.03 | 39.43 | 53.89 | 15.90 | 40.99 | 58.24 | 37.91 | ||
| ✔ | ✔ | ✔ | ✔ | ✔ | 18.85 | 40.99 | 54.07 | 15.41 | 41.68 | 57.97 | 38.16 | |
| Method | Trained On | Tested On | I2T | T2I | Mean Recall | ||||
|---|---|---|---|---|---|---|---|---|---|
| R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||||
| RemoteCLIP-B-32 | RET-3 + DET-10 + SEG-4 | RSITMD | 27.88 | 50.66 | 65.71 | 22.17 | 56.46 | 73.41 | 49.38 |
| RemoteCLIP-L-14 | RET-3 + DET-10 + SEG-4 | 28.76 | 52.43 | 63.94 | 23.76 | 59.51 | 74.73 | 50.52 | |
| GeoRSCLIP | RS5M + RSITMD | 30.09 | 51.55 | 63.27 | 23.54 | 57.52 | 74.60 | 50.10 | |
| GeoRSCLIP | RS5M + RET-2 | 32.30 | 53.32 | 67.92 | 25.04 | 57.88 | 74.38 | 51.81 | |
| DCCA | RSITMD | 31.64 | 53.32 | 64.60 | 24.47 | 57.61 | 76.33 | 51.33 | |
| RemoteCLIP-B-32 | RET-3 + DET-10 + SEG-4 | RSICD | 17.02 | 37.97 | 51.51 | 13.71 | 37.11 | 54.25 | 35.26 |
| RemoteCLIP-L-14 | RET-3 + DET-10 + SEG-4 | 18.39 | 37.42 | 51.05 | 14.73 | 39.93 | 56.58 | 36.35 | |
| GeoRSCLIP | RS5M + RSICD | 22.14 | 40.53 | 51.78 | 15.26 | 40.46 | 57.79 | 38.00 | |
| GeoRSCLIP | RS5M + RET-2 | 21.13 | 41.72 | 55.63 | 15.59 | 41.19 | 57.99 | 38.87 | |
| DCCA | RSICD | 18.85 | 40.99 | 54.07 | 15.41 | 41.68 | 57.97 | 38.16 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Song, Z.; Shu, Y.; Li, W.; Guan, J.; Zhang, Y. Towards Discriminative and Consistent Cross-Modal Alignment for Remote Sensing Image–Text Retrieval. Remote Sens. 2026, 18, 662. https://doi.org/10.3390/rs18040662
Song Z, Shu Y, Li W, Guan J, Zhang Y. Towards Discriminative and Consistent Cross-Modal Alignment for Remote Sensing Image–Text Retrieval. Remote Sensing. 2026; 18(4):662. https://doi.org/10.3390/rs18040662
Chicago/Turabian StyleSong, Zihan, Yulou Shu, Wengen Li, Jihong Guan, and Yichao Zhang. 2026. "Towards Discriminative and Consistent Cross-Modal Alignment for Remote Sensing Image–Text Retrieval" Remote Sensing 18, no. 4: 662. https://doi.org/10.3390/rs18040662
APA StyleSong, Z., Shu, Y., Li, W., Guan, J., & Zhang, Y. (2026). Towards Discriminative and Consistent Cross-Modal Alignment for Remote Sensing Image–Text Retrieval. Remote Sensing, 18(4), 662. https://doi.org/10.3390/rs18040662

