Multimodal Prompt Learning for Spatial Reasoning in Remote Sensing Image Scene
Highlights
- A Unified Visual-Semantic Triple Prompt Learning (UVSTPL) framework is proposed, which realizes effective fusion of visual and text modalities to enhance remote sensing scene spatial relationship reasoning.
- A novel Geo-RSSG dataset with detailed ground object annotations, precise spatial relationships, and rich attributes is constructed, and UVSTPL achieves state-of-the-art performance on this dataset.
- The UVSTPL framework promotes the integration of vision-language multimodal learning and GeoAI, providing a new approach for intelligent interpretation of remote sensing data.
- The Geo-RSSG dataset bridges the gap in high-quality multimodal benchmarks for remote sensing reasoning tasks and provides a reliable foundation for future research in this field.
Abstract
1. Introduction
- We constructed Geo-RSSG, a large-scale remote sensing scene graph dataset, including detailed annotations for various ground objects, precise spatial relationships, and rich attribute information.
- We propose the UVSTPL framework for spatial relationship reasoning. It combines multimodal features with prompt learning and adopts UVTransE for robust relationship prediction, thereby improving the reliability and accuracy of spatial relationship prediction for remote sensing images.
2. Related Work
2.1. Remote Sensing Image Scene Graph Relationship Reasoning
2.2. Scene Reasoning of Remote Sensing Images Based on Visual Language Models
3. Geo-RSSG Dataset
3.1. Dataset Description
3.2. Data Annotation
- (1)
- Data Collection: To meet the needs of practical applications, high-resolution satellite images in the Geo-RSSG dataset were collected from Google Earth, with a spatial resolution of 0.3 m per pixel. The original exported images are 20,350 × 20,225 pixels in RGB format. Valid regions without borders or watermarks were cropped into non-overlapping patches of 512 × 512 pixels, preserving the original spatial resolution. A total of 3587 remote sensing images from two relatively representative cities in China were selected, including 2249 images of the coastal city of Guangzhou and 1338 images of the plain city of Zhengzhou. These images cover three types of scenes: urban areas, suburban areas, and rural areas.
- (2)
- Scene Graph Annotation: We designed a standardized dataset annotation process, as shown in Figure 3. In the object annotation stage, valuable image patches are extracted from remote sensing images, with latitude and longitude coordinate information retained, and instance segmentation is performed on geographic objects. All instance segmentation masks are manually annotated by professional annotators with reference to the original imagery. For the attribute annotation and relationship annotation stages, we developed an annotation tool, AR-annotation, to annotate the topological and directional relationships between adjacent ground objects.
3.3. Data Analysis
4. UVSTPL Framework
4.1. Subject-Object Pairing
4.2. Triplet Prompt Learning
4.3. UVTransE Classifier
5. Experiments and Results
5.1. Experimental Setup
5.2. Comparison Experiments
5.3. Results Analysis
5.3.1. Analysis of Model Relationship Prediction Performance
5.3.2. Analysis of the Distribution of Model Relationship Prediction Precision
5.3.3. Clustering Analysis of the Impact of Context Length
5.4. Ablation Experiments
5.4.1. Effect of Network Sub-Modules
5.4.2. Effect of Triplet Fusion Strategies
5.4.3. Effect of Negative Sampling Strategies
5.5. Discussion
5.5.1. Failure on the “Overlap” Relationship
5.5.2. Few-Shot Relationships Remain Challenging
5.5.3. Choice of CLIP Backbone
5.5.4. Geographic Transferability
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Huang, Z.; Feng, Y.; Liu, Z.; Yang, S.; Liu, Q.; Wang, Y. Openrsd: Towards open-prompts for object detection in remote sensing images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–23 October 2025; pp. 8384–8394. [Google Scholar] [CrossRef]
- Xiao, Z.; Li, Z.; Cao, J.; Liu, X.; Kong, Y.; Du, Z. OriMamba: Remote sensing oriented object detection with state space models. Int. J. Appl. Earth Obs. Geoinf. 2025, 143, 104731. [Google Scholar] [CrossRef]
- Xie, J.; Wang, G.; Zhang, T.; Sun, Y.; Chen, H.; Zhuang, Y.; Li, J. LLaMA-Unidetector: A LLaMA-Based Universal Framework for Open-Vocabulary Object Detection in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4409318. [Google Scholar] [CrossRef]
- Feng, J.; Luo, H.; Gu, Z. Improving semi-supervised remote sensing scene classification via Multilevel Feature Fusion and pseudo-labeling. Int. J. Appl. Earth Obs. Geoinf. 2025, 136, 104335. [Google Scholar] [CrossRef]
- Wang, C.; Yang, J.; Ahmed, T.; Zhao, Y.; Zhang, T.; Sun, B.; Chen, T. Zero-Shot Remote Sensing Scene Classification Based on Automatic Knowledge Graph and Dual-Branch Semantic Correlation Supervision. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 18, 3300–3314. [Google Scholar] [CrossRef]
- Yi, J.; Zhong, Y.; Su, Y.; Yang, R.; Liu, Y.; Wang, J. Global urban high-resolution scene classification via uncertainty-aware domain generalization. ISPRS J. Photogramm. Remote Sens. 2025, 230, 92–108. [Google Scholar] [CrossRef]
- Liu, X.; Wang, T.; Jin, F.; Rui, J.; Wang, S.; Huang, Z.; Yu, X. Multimodal cross fusion Mamba network for remote sensing image semantic segmentation with complementary masked self-supervision. Int. J. Appl. Earth Obs. Geoinf. 2025, 145, 104960. [Google Scholar] [CrossRef]
- Luo, M.; Zan, Y.; Khoshelham, K.; Ji, S. Domain generalization for semantic segmentation of remote sensing images via vision foundation model fine-tuning. ISPRS J. Photogramm. Remote Sens. 2025, 230, 126–146. [Google Scholar] [CrossRef]
- Ma, X.; Zhang, X.; Pun, M.O.; Huang, B. A Unified Framework with Multimodal Fine-tuning for Remote Sensing Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5405015. [Google Scholar] [CrossRef]
- Sun, S.; Dustdar, S.; Ranjan, R.; Morgan, G.; Dong, Y.; Wang, L. Remote sensing image interpretation with semantic graph-based methods: A survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 4544–4558. [Google Scholar] [CrossRef]
- Yin, S.; Wang, L.; Shafiq, M.; Teng, L.; Laghari, A.A.; Khan, M.F. G2Grad-CAMRL: An object detection and interpretation model based on gradient-weighted class activation mapping and reinforcement learning in remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3583–3598. [Google Scholar] [CrossRef]
- Zhu, Q.; Lao, J.; Ji, D.; Luo, J.; Wu, K.; Zhang, Y.; Zhao, F. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2025; pp. 14733–14744. [Google Scholar] [CrossRef]
- Feng, J.; Wang, H. A multi-scale contextual attention network for remote sensing visual question answering. Int. J. Appl. Earth Obs. Geoinf. 2024, 126, 103641. [Google Scholar] [CrossRef]
- He, J.; Liu, G.; Li, P.; Su, X.; Jiang, W.; Zhang, D.; Zhong, S. PERS: Parameter-Efficient Multi-modal Transfer Learning for Remote Sensing Visual Question Answering. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 14823–14835. [Google Scholar] [CrossRef]
- Gao, Z.; Sun, S.; Cheng, M.M.; Liu, Y.; Liu, L. Multi-modal large models driven SAR image captioning: A benchmark dataset and baselines. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 24011–24026. [Google Scholar] [CrossRef]
- Ren, J.; Liu, W.; Chen, J.; Yin, S. HI4HC and AAAAD: Exploring a hierarchical method and dataset using hybrid intelligence for remote sensing scene captioning. Int. J. Appl. Earth Obs. Geoinf. 2025, 139, 104491. [Google Scholar] [CrossRef]
- Wang, Q.; Yang, Z.; Ni, W.; Wu, J.; Li, Q. Semantic-spatial collaborative perception network for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5649912. [Google Scholar] [CrossRef]
- Im, J.; Nam, J.; Park, N.; Lee, H.; Park, S. Egtr: Extracting graph from transformer for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 24229–24238. [Google Scholar] [CrossRef]
- Jeon, J.; Kim, K.; Yoon, K.; Park, C. Semantic diversity-aware prototype-based learning for unbiased scene graph generation. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 379–395. [Google Scholar] [CrossRef]
- Li, J.; Wang, Y.; Guo, X.; Yang, R.; Li, W. Leveraging predicate and triplet learning for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 28369–28379. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L. Microsoft coco: Common objects in context. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. Available online: https://link.springer.com/content/pdf/10.1007/978-3-319-10602-1_48.pdf (accessed on 11 September 2024).
- Lu, C.; Krishna, R.; Bernstein, M.; Fei-Fei, L. Visual relationship detection with language priors. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2016; pp. 852–869. [Google Scholar] [CrossRef]
- Kim, J.; Park, J.; Park, J.; Kim, J.; Kim, S.; Kim, H.J. Groupwise query specialization and quality-aware multi-assignment for transformer-based visual relationship detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 28160–28169. [Google Scholar] [CrossRef]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Fei-Fei, L. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef]
- Chen, J.; Zhou, X.; Zhang, Y.; Sun, G.; Deng, M.; Li, H. Message-passing-driven triplet representation for geo-object relational inference in HRSI. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
- Li, P.; Zhang, D.; Wulamu, A.; Liu, X.; Chen, P. Semantic relation model and dataset for remote sensing scene understanding. ISPRS Int. J. Geo-Inf. 2021, 10, 488. [Google Scholar] [CrossRef]
- Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring models and data for remote sensing image caption generation. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2183–2195. [Google Scholar] [CrossRef]
- Lin, Z.; Zhu, F.; Kong, Y.; Wang, Q.; Wang, J. SRSG and S2SG: A model and a dataset for scene graph generation of remote sensing images from segmentation results. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4707411. [Google Scholar] [CrossRef]
- Qu, B.; Li, X.; Tao, D.; Lu, X. Deep semantic understanding of high-resolution remote sensing image. In Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China, 6–8 July 2016; pp. 1–5. [Google Scholar] [CrossRef]
- Li, Y.; Wang, L.; Wang, T.; Yang, X.; Luo, J.; Wang, Q.; Yan, J. STAR: A first-ever dataset and a large-scale benchmark for scene graph generation in large-size satellite imagery. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 1832–1849. [Google Scholar] [CrossRef] [PubMed]
- Tang, J.; Tong, X.; Qiu, C.; Sun, Y.; Song, H.; Lei, Y.; Guo, C. Remote sensing scene graph generation for improved retrieval based on spatial relationships. ISPRS J. Photogramm. Remote Sens. 2025, 220, 741–752. [Google Scholar] [CrossRef]
- Cui, W.; Wang, F.; He, X.; Zhang, D.; Xu, X.; Yao, M.; Huang, J. Multi-scale semantic segmentation and spatial relationship recognition of remote sensing images based on an attention model. Remote Sens. 2019, 11, 1044. [Google Scholar] [CrossRef]
- Lin, Z.; Zhu, F.; Wang, Q.; Kong, Y.; Wang, J.; Huang, L.; Hao, Y. RSSGG_CS: Remote sensing image scene graph generation by fusing contextual information and statistical knowledge. Remote Sens. 2022, 14, 3118. [Google Scholar] [CrossRef]
- Rui, Q.; You, Y.; Cao, J.; Zhu, K.; Qiao, Y. SGRD: A Ship Group Relationship Description Method Based on Scene Graph Generation with a Global-Local Context Fusion Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 14570–14581. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sutskever, I. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar] [CrossRef]
- Subramanyam, R.; Jayram, T.S.; Anirudh, R.; Thiagarajan, J.J. Exploring the Utility of Clip Priors for Visual Relationship Prediction. In Proceedings of the ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 6825–6829. [Google Scholar] [CrossRef]
- Qiu, C.; Yu, A.; Yi, X.; Guan, N.; Shi, D.; Tong, X. Open self-supervised features for remote-sensing image scene classification using very few samples. IEEE Geosci. Remote Sens. Lett. 2022, 20, 2500505. [Google Scholar] [CrossRef]
- Singha, M.; Jha, A.; Solanki, B.; Bose, S.; Banerjee, B. Applenet: Visual attention parameterized prompt learning for few-shot remote sensing image generalization using clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2024–2034. [Google Scholar] [CrossRef]
- Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Zhou, J. RemoteCLIP: A vision language foundation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622216. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhao, T.; Guo, Y.; Yin, J. RS5m and GeoRSCLIP: A large scale vision-language dataset and a large vision-language model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5642123. [Google Scholar] [CrossRef]
- Wang, Z.; Prabha, R.; Huang, T.; Wu, J.; Rajagopal, R. Skyscript: A large and semantically diverse vision-language dataset for remote sensing. Proc. AAAI Conf. Artif. Intell. 2024, 38, 5805–5813. [Google Scholar] [CrossRef]
- Meng, L.; Wang, J.; Meng, R.; Yang, Y.; Xiao, L. A multiscale grouping transformer with clip latents for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4703515. [Google Scholar] [CrossRef]





















| Major Category | Subcategory |
|---|---|
| Water Systems | river, ocean, lake, pond, reservoir |
| Topography | mountain, beach, island, |
| Vegetation and Soil | farmland, forest, meadow, greenbelt, bare land, paved surface |
| Residential Areas and Facilities | residential area, commercial area, industrial area, stadium, athletics track, basketball court, football field, tennis court, school, park, storage tank, greenhouse, container, swimming pool, shed |
| Transportation | parking lot, airport, railway station, harbor, wharf, runway, railway, highway, freeway, intersection, overpass, viaduct, bridge, roundabout, gas station, toll station |
| Major Category | Subcategory |
|---|---|
| Topological relationships | overlap, contained, surround, partially surround, parallel, touch, adjoin |
| Direction relationships | north, south, west, east, northwest, northeast, southwest, southeast, center |
| Major Category | Subcategory |
|---|---|
| Color | green, blue, white, black, gray, yellow, red, brown |
| Shape | straight, curved, round, oval, rectangular, square, striped, triangular, irregular, neatly arranged, scattered, high-rise, low-rise, long, short |
| Region | wide, narrow, big, small |
| Method | Precision | MP | MR | MF1 |
|---|---|---|---|---|
| CREPE-box | 50.56 | 38.22 | 33.24 | 35.73 |
| RSSGG_CS-seg | 61.83 | 49.32 | 38.73 | 44.03 |
| CLIP-box | 88.44 | 44.01 | 40.77 | 42.39 |
| CLIP-seg | 88.44 | 54.87 | 42.01 | 48.44 |
| RemoteCLIP-box | 88.52 | 49.56 | 41.44 | 45.50 |
| RemoteCLIP-seg | 88.43 | 56.65 | 42.20 | 49.43 |
| GeoRSCLIP-box | 88.51 | 53.78 | 43.04 | 48.41 |
| GeoRSCLIP-seg | 88.25 | 56.22 | 41.48 | 48.85 |
| UVSTPL-box | 90.23 | 61.67 | 50.18 | 55.93 |
| UVSTPL-seg | 90.28 | 65.14 | 59.66 | 62.40 |
| Method | Overlap | Adjoin | Intersect | Contained | Partially Surround | Parallel | Touch | Cross | Cover | Surround |
|---|---|---|---|---|---|---|---|---|---|---|
| CREPE-box | 0 | 76.11 | 53.24 | 18.07 | 9.56 | 80.56 | 86.11 | 43.24 | 0 | 15.32 |
| RSSGG_CS-seg | 0 | 75.54 | 60.73 | 53.41 | 17.23 | 88.83 | 92.54 | 65.73 | 0 | 37.23 |
| CLIP-box | 0 | 82.11 | 64.89 | 48.39 | 0 | 90.99 | 95.17 | 58.59 | 0 | 0 |
| CLIP-seg | 0 | 82.45 | 64.60 | 53.92 | 27.34 | 91.70 | 95.11 | 71.67 | 0 | 51.85 |
| RemoteCLIP-box | 0 | 82.94 | 64.62 | 68.18 | 45.24 | 90.98 | 95.23 | 48.39 | 0 | 0 |
| RemoteCLIP-seg | 0 | 82.48 | 63.33 | 57.78 | 43.10 | 92.06 | 95.31 | 67.19 | 0 | 52.78 |
| GeoRSCLIP-box | 0 | 83.06 | 63.76 | 53.49 | 45.00 | 91.92 | 95.13 | 50.90 | 0 | 54.55 |
| GeoRSCLIP-seg | 0 | 83.08 | 66.05 | 36.54 | 43.86 | 89.30 | 95.38 | 67.12 | 0 | 46.42 |
| UVSTPL-box | 0 | 90.23 | 74.91 | 19.99 | 8.93 | 94.03 | 98.47 | 60.66 | 0 | 46.67 |
| UVSTPL-seg | 0 | 90.28 | 76.62 | 43.22 | 29.65 | 94.84 | 97.95 | 81.15 | 25.00 | 56.00 |
| Method | n_ctx | Precision | MP | MR | MF1 |
|---|---|---|---|---|---|
| UVSTPL-box | 2 | 90.22 | 59.07 | 51.64 | 55.36 |
| 4 | 90.23 | 61.67 | 50.18 | 55.93 | |
| 6 | 89.94 | 61.91 | 56.42 | 59.17 | |
| UVSTPL-seg | 2 | 89.44 | 63.66 | 58.08 | 60.87 |
| 4 | 90.28 | 65.14 | 59.66 | 62.40 | |
| 6 | 89.94 | 51.43 | 50.19 | 50.81 |
| Sub-Network | Precision | MP | MR | MF1 | |
|---|---|---|---|---|---|
| PL | UV | ||||
| √ | × | 86.77 | 55.58 | 43.36 | 49.47 |
| × | √ | 88.84 | 59.42 | 41.22 | 50.32 |
| √ | √ | 90.28 | 65.14 | 59.66 | 62.40 |
| Triplet Fusion | Precision | MP | MR | MF1 |
|---|---|---|---|---|
| Additive (ours) | 90.28 | 65.14 | 59.66 | 62.40 |
| MLP-based (concatenation) | 90.45 | 65.38 | 59.82 | 62.65 |
| Negative Sampling | Precision | MP | MR | MF1 |
|---|---|---|---|---|
| Random sampling | 87.62 | 60.53 | 52.47 | 56.13 |
| Hard negative sampling (cosine distance, ours) | 90.28 | 65.14 | 59.66 | 62.40 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ren, Y.; Qian, H.; Jiang, B.; Li, T.; Wang, X.; Sun, L.; Yang, L. Multimodal Prompt Learning for Spatial Reasoning in Remote Sensing Image Scene. Remote Sens. 2026, 18, 1959. https://doi.org/10.3390/rs18121959
Ren Y, Qian H, Jiang B, Li T, Wang X, Sun L, Yang L. Multimodal Prompt Learning for Spatial Reasoning in Remote Sensing Image Scene. Remote Sensing. 2026; 18(12):1959. https://doi.org/10.3390/rs18121959
Chicago/Turabian StyleRen, Yan, Haizhong Qian, Bingchuan Jiang, Tingting Li, Xiao Wang, Long Sun, and Li Yang. 2026. "Multimodal Prompt Learning for Spatial Reasoning in Remote Sensing Image Scene" Remote Sensing 18, no. 12: 1959. https://doi.org/10.3390/rs18121959
APA StyleRen, Y., Qian, H., Jiang, B., Li, T., Wang, X., Sun, L., & Yang, L. (2026). Multimodal Prompt Learning for Spatial Reasoning in Remote Sensing Image Scene. Remote Sensing, 18(12), 1959. https://doi.org/10.3390/rs18121959

