Multimodal Prompt Learning for Spatial Reasoning in Remote Sensing Image Scene

Ren, Yan; Qian, Haizhong; Jiang, Bingchuan; Li, Tingting; Wang, Xiao; Sun, Long; Yang, Li

doi:10.3390/rs18121959

Open AccessArticle

Multimodal Prompt Learning for Spatial Reasoning in Remote Sensing Image Scene

by

Yan Ren

¹

,

Haizhong Qian

¹,

Bingchuan Jiang

^1,*,

Tingting Li

¹,

Xiao Wang

¹

,

Long Sun

² and

Li Yang

²

¹

Institute of Geospatial Information, Information Engineering University, Zhengzhou 450001, China

²

College of Information Engineering, China Jiliang University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(12), 1959; https://doi.org/10.3390/rs18121959 (registering DOI)

Submission received: 15 April 2026 / Revised: 8 June 2026 / Accepted: 8 June 2026 / Published: 12 June 2026

(This article belongs to the Special Issue Vision–Language Multimodal Learning for Remote Sensing and Geospatial Artificial Intelligence)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A Unified Visual-Semantic Triple Prompt Learning (UVSTPL) framework is proposed, which realizes effective fusion of visual and text modalities to enhance remote sensing scene spatial relationship reasoning.
A novel Geo-RSSG dataset with detailed ground object annotations, precise spatial relationships, and rich attributes is constructed, and UVSTPL achieves state-of-the-art performance on this dataset.

What are the implications of the main findings?

The UVSTPL framework promotes the integration of vision-language multimodal learning and GeoAI, providing a new approach for intelligent interpretation of remote sensing data.
The Geo-RSSG dataset bridges the gap in high-quality multimodal benchmarks for remote sensing reasoning tasks and provides a reliable foundation for future research in this field.

Abstract

A remote sensing scene graph (RSSG) enables machines to interpret interactions among ground objects in remote sensing images and supports semantic reasoning and description, thus making it a fundamental technique in the field. However, most existing scene reasoning approaches cannot fully utilize multimodal information, resulting in limited performance when inferring spatial relationships among ground objects. To this end, we propose a Unified Visual-Semantic Triple Prompt Learning (UVSTPL) framework, which integrates visual features with matched geospatial object labels, leverages a prompt learning module for multimodal feature extraction, and employs a refined UVTransE model to predict spatial relationships. The core principle of UVSTPL is to enhance semantic feature extraction and improve relationship prediction performance via the collaborative fusion of visual and linguistic modalities. To strengthen the model’s ability to reason about the spatial relationships among ground objects in images, a novel Geo-RSSG dataset is constructed, which includes precise annotations of geographic entities, spatial relationships, and attributes. Extensive experiments demonstrate that the proposed UVSTPL method outperforms benchmark models on the spatial relationship prediction task. In comparison with the best baseline method, our approach improves prediction precision by 1.85%, mean precision by 8.49%, mean recall by 17.46%, and mean F1-score by 12.97%. This study offers valuable insights for advancing the understanding and cognitive capabilities of remote sensing scenes.

Keywords:

remote sensing image; scene graph; spatial relationship reasoning; prompt learning; multimodal feature fusion

1. Introduction

Remote sensing images serve as core data products for airborne Earth observation, and also act as a vital carrier for humans to interpret terrestrial geographical phenomena and analyze human–environment interactions. Essentially, such images capture spatial scenes of specific land areas from a top-down perspective, recording the attributes and spatial relationship characteristics of ground objects (Figure 1a). Human interpretation of remote sensing images involves two complementary levels: pixel-level and semantic-level understanding. Pixel-level analysis targets object recognition, boundary localization, and instance segmentation, delivering fundamental structural information of the scene (Figure 1b). As a higher-order cognitive process, semantic-level understanding aims to explicate spatial relationships and semantic attributes of ground objects, thereby forming structured and interpretable geographic knowledge (Figure 1c).

Currently, computers can effectively adopt large model fine-tuning and Transformer architecture to perform low-level understanding tasks for remote sensing images, including object detection [1,2,3], scene classification [4,5,6], and semantic segmentation [7,8,9]. Nevertheless, remote sensing images are captured from a unique top-down perspective and exhibit strong spatial heterogeneity, making high-level tasks—which demand precise analysis of geographic objects and their intricate spatial relationships—still highly challenging.

A remote sensing scene graph adopts a graph representation, where nodes denote ground objects or attributes, and edges stand for relational connections. It explicitly models geographic entities, their attributes, and inter-object spatial relationships within remote sensing imagery. As a structured knowledge representation paradigm, it supports fine-grained semantic understanding of remote sensing scenes and provides fundamental knowledge for downstream tasks such as image interpretation [10,11,12], visual question answering [13,14], and natural language description [15,16,17].

Scene graphs for natural images [18,19,20] have been thoroughly investigated in computer vision. Relevant studies mainly build upon mainstream datasets, including Microsoft COCO [21] and Visual Genome (VG) [22], to implement visual relationship prediction [23,24]. However, remote sensing images contain diverse ground objects, complex spatiotemporal relationships, and distinct scene characteristics. Existing scene graph reasoning approaches designed for natural images cannot be directly adapted to the remote sensing domain, and research on remote sensing scene graph reasoning remains relatively limited. Most existing algorithms yield unsatisfactory performance on remote sensing data, mainly due to two major challenges. First, publicly available remote sensing scene graph datasets remain scarce, and most existing datasets are originally developed for object detection. Ground objects in remote sensing images are complex and distributed, with large ground objects often enclosing smaller ones within their bounding boxes. Traditional bounding box localization struggles to accurately capture the morphological characteristics of ground objects, which inevitably introduces noise, impedes spatial relationship extraction, and ultimately degrades the accuracy and reliability of scene graphs. Second, existing datasets generally lack attribute annotations and only provide basic relationship labels, limiting the semantic richness of the resulting scene graphs.

To address the above issues and empower large models to conduct accurate fine-grained spatial cognition and reasoning, we propose a remote sensing image scene graph spatial relationship reasoning framework named UVSTPL (Unified Visual-Semantic Triple Prompt Learning). Based on a finely annotated scene graph dataset, integrating joint feature encoding and a dedicated relationship prediction network can effectively improve the accuracy of relationship prediction. Our main contributions are summarized as follows:

We constructed Geo-RSSG, a large-scale remote sensing scene graph dataset, including detailed annotations for various ground objects, precise spatial relationships, and rich attribute information.
We propose the UVSTPL framework for spatial relationship reasoning. It combines multimodal features with prompt learning and adopts UVTransE for robust relationship prediction, thereby improving the reliability and accuracy of spatial relationship prediction for remote sensing images.

2. Related Work

2.1. Remote Sensing Image Scene Graph Relationship Reasoning

To improve the capability of machines to recognize and understand remote sensing images, constructing specialized datasets is indispensable for verifying the efficacy of proposed approaches. Current datasets largely concentrate on labeling objects and their relationships within remote sensing scenes. GRTRD [25] contains 19,904 annotated geographic objects and 18,602 geospatial relationships. RSSGD [26] is a remote sensing scene graph dataset built based on the descriptive sentences of RSICD [27], consisting of objects, attributes, regional coordinates, and inter-objects relationships. Lin et al. [28] developed a scene graph dataset by selecting six object categories and twelve relationship types from the captions of three remote sensing image captioning datasets: UCM-CAPTIONS [29], Sydney-CAPTIONS [29], and RSICD [27], with relative positions of up, down, left, and right in this dataset. Derived from high-resolution remote sensing images, RSG [30] contains more than 210,000 objects and 400,000 triplets. It encompasses 11 complex geospatial scene types, 48 object categories annotated with oriented bounding boxes (OBB), and 58 relationship categories. SSGD [31] is automatically generated by a spatial relationship calculation model. It requires no manual annotation but involves complex computations. Most existing remote sensing scene graph datasets rely on bounding boxes for object annotation. However, ground objects in remote sensing images are mostly scattered and densely mixed, hindering accurate discrimination. Additionally, directional relationships in these datasets are represented by relative positions rather than actual geographic locations, and detailed attribute information is generally missing.

In terms of remote sensing image scene graph relationship prediction, Cui et al. [32] proposed the Multi-Scale Remote Sensing Image Interpretation Network (MSRIN) to achieve end-to-end recognition of remote sensing objects and spatial relationships. Li et al. [26] utilized the Multi-Scale Semantic Fusion Network (MSFN) to fuse and refine multi-scale semantic context, thereby enhancing the ability to understand remote sensing scenes. Chen et al. [25] put forward a novel method based on an object-relationship message-passing mechanism to effectively predict geospatial relationships in high-resolution remote sensing scenes. Lin et al. [28] designed a segmentation-driven model to generate more comprehensive and accurate remote sensing image scene graphs. In a further study, Lin et al. [33] integrated contextual information from all objects and statistical knowledge based on an adjusted Transformer architecture. Rui et al. [34] proposed a ship group relationship description (SGRD) method based on scene graph generation with a global and local context fusion network.

2.2. Scene Reasoning of Remote Sensing Images Based on Visual Language Models

Visual Language Models (VLMs) bridge computer vision and natural language processing. These models process and understand visual and textual information in a unified manner and have achieved remarkable performance in various computer vision tasks. Radford et al. [35] proposed a simple yet effective visual language model that combines image and text representations with contrastive learning. The model achieved remarkable success in various downstream tasks and shows strong generalization on unseen data. Subramanyam et al. [36] proposed CREPE (CLIP Representation Enhanced Predicate Estimation), which validated the effectiveness of CLIP for relationship prediction. Qiu et al. [37] employed a pre-trained CLIP model to extract features from remote sensing images, achieving impressive scene classification results with scarce annotations. Applenet [38] leverages multi-scale visual content and style information from the CLIP visual encoder to learn prompt tokens, addressing few-shot recognition and generalization of optical remote sensing scenes. As the first foundation visual language model dedicated to remote sensing, RemoteCLIP [39] learns robust visual features with rich semantics and well-aligned text embeddings. It supports a set of remote sensing tasks such as scene classification, cross-modal retrieval, and object counting. GeoRSCLIP [40] is a CLIP model fine-tuned on RS5M, the first large-scale remote sensing image–text pairing dataset. This work successfully migrates pre-trained VLMs to the remote sensing domain. Wang et al. [41] developed a continuously pre-trained CLIP model, which substantially outperforms vanilla CLIP and other baselines in three remote sensing downstream tasks: zero-shot scene classification, fine-grained attribute classification, and cross-modal retrieval. Meng et al. [42] proposed a remote sensing image semantic description model that combines CLIP latent variables with a multi-scale grouped Transformer. Despite the above progress, research on VLMs—and the CLIP model in particular—for spatial relationship reasoning in remote sensing scene graphs remains underexplored.

3. Geo-RSSG Dataset

3.1. Dataset Description

Most existing remote sensing scene graph datasets adopt object detection boxes and mainly focus on annotating objects and relationships. From a geographical perspective, we constructed a fine-grained geographical scene graph dataset based on semantic segmentation. It contains diverse geographical objects, authentic spatial relationships, and comprehensive attribute information. Geo-RSSG includes five major categories: water systems, residential areas and facilities, transportation, topography, and vegetation and soil, along with 45 subcategories. Detailed descriptions of these categories are presented in Table 1.

Geographic relationships include topological relationships and directional relationships, as summarized in Table 2. We define ten types of topological relationships based on the nine-intersection model, with illustrative examples visualized in Figure 2. Different from existing scene graph datasets, directional relationships in our dataset are described using geographic cardinal directions (East, South, West, and North). This is determined by the fixed acquisition posture of remote sensing imaging platforms. For attribute information, we adopt a hybrid qualitative and quantitative labeling scheme. Qualitative attributes cover object color, shape, and regional properties (see Table 3), while quantitative attributes include area, width, and other measurable geometric metrics.

3.2. Data Annotation

The entire annotation flow includes the following steps:

(1): Data Collection: To meet the needs of practical applications, high-resolution satellite images in the Geo-RSSG dataset were collected from Google Earth, with a spatial resolution of 0.3 m per pixel. The original exported images are 20,350 × 20,225 pixels in RGB format. Valid regions without borders or watermarks were cropped into non-overlapping patches of 512 × 512 pixels, preserving the original spatial resolution. A total of 3587 remote sensing images from two relatively representative cities in China were selected, including 2249 images of the coastal city of Guangzhou and 1338 images of the plain city of Zhengzhou. These images cover three types of scenes: urban areas, suburban areas, and rural areas.
(2): Scene Graph Annotation: We designed a standardized dataset annotation process, as shown in Figure 3. In the object annotation stage, valuable image patches are extracted from remote sensing images, with latitude and longitude coordinate information retained, and instance segmentation is performed on geographic objects. All instance segmentation masks are manually annotated by professional annotators with reference to the original imagery. For the attribute annotation and relationship annotation stages, we developed an annotation tool, AR-annotation, to annotate the topological and directional relationships between adjacent ground objects.

3.3. Data Analysis

The finalized dataset contains 86,000 ground object labels, 150,000 relationship labels, and 246,000 attribute labels, with annotation examples visualized in Figure 4. The distribution of the five major categories of ground objects is presented in Figure 5, which presents a typical long-tail distribution consistent with the real-world geographic characteristics. Common geographic objects, including residential areas, roads, and greenbelts, dominate the dataset, while rare categories such as airports and railway stations appear less frequently.

The distribution of topological relationships is shown in Figure 6. Relationships including adjoin, touch, and parallel occur more frequently, whereas contained and overlap relations are relatively scarce. In contrast, directional relationships exhibit a relatively balanced quantity distribution. Figure 7 shows the distribution of each attribute category in the Geo-RSSG dataset. The annotation of attribute information, such as color and shape, not only enriches the information of geographic objects but also facilitates accurate geographic object recognition. For example, football fields possess representative attributes, including green color and rectangular shape, which align with human common sense.

4. UVSTPL Framework

Drawing on the relationship prediction network of CREPE [36], we proposed the UVSTPL framework for remote sensing spatial relationship reasoning. The overall architecture of UVSTPL is depicted in Figure 8, which consists of the following three steps: subject-object pairing, triplet prompt learning, and relationship classification.

Subject-object pairing: We extract the subject and object features of images through instance segmentation, instead of traditional bounding box annotations. All adjacent ground objects in the instance segmentation labels are combined into different subject-object pairs, which are fed into the CLIP text encoder. Meanwhile, the corresponding regions of adjacent ground objects are cropped from the original remote sensing images as the input for the CLIP image encoder.

Triplet prompt learning: A token embedding module is used to transform the image features of subjects, objects, and their combined regional areas into unified vector representations. The input to the network includes subject features, object features, and joint features of subject-object overlapping regions.

Relationship classification network: By subtracting the subject and object features from the previous joint region features, relationship features are obtained. These features are further fed into a fully connected network to acquire the probability distribution of relationship predictions, thereby enabling relationship prediction. Relational features are obtained by subtracting independent subject and object features from the extracted joint regional features.

4.1. Subject-Object Pairing

In this step, instance segmentation is performed on the original image I to obtain the text labels of the subject and object, and the regions corresponding to the entities in the segmentation results are extracted, as shown below:

S = \{R (E_{1}), R (E_{2}), \dots, R (E_{n})\},

(1)

where

E_{n}

represents the n-th entity, and

R (E_{n})

represents the region of the entity in the original image I.

The next step is to combine the entity regions in pairs, which can be expressed as:

U (E_{i}, E_{j}) = R (E_{i}) \cup R (E_{j}),

(2)

where

U (E_{i}, E_{j})

denotes the combination of the regions of two adjacent entities

E_{i}

and

E_{j}

in the same image I.

Text labels and combined region images are input into CLIP’s text and image encoders to obtain the subject features, object features, and image features of the combined regions.

4.2. Triplet Prompt Learning

To better extract relationship features from combined features, we use a learnable prompt combination structure. The structure is shown in Figure 9.

First, the input subject and object are tokenized using the token module, which converts original words or phrases into a series of token identifiers. In this step, each ground object is represented as a unique token ID. For example, “River” is represented as 2473, and “Grassland” as 5922. The relationship we intend to predict is marked as the same token for n times. All embeddings are bounded by the start-of-sequence [SOS] and end-of-sequence [EOS] tokens, with a maximum length of 77 tokens. Next, these tokens are converted into their corresponding vector representations through the token embedding module. During this process, each token is embedded into a fixed-length vector in the high-dimensional space, while preserving its semantic information.

To enhance the model’s adaptability to remote sensing scenarios, we introduce a dynamic adjustment function

∆ (\cdot)

, which adjusts the initial token embedding

M_{f}

based on the combined image feature

I_{f}

. The adjusted token embedding

M_{θ}

can be expressed as:

M_{θ} = ∆ (M_{f} + I_{f})

(3)

The dynamic adjustment function

∆ (\cdot)

is implemented as a multilayer perceptron (MLP) that takes the element-wise sum of the initial token embedding

M_{f}

and the combined region image feature

I_{f}

as inputs. The MLP consists of three hidden layers with ReLU activations: a linear layer mapping the 512-dimensional input to 256, a second layer mapping 256 to 128, a third layer mapping 128 back to 256, and a final linear layer projecting 256 to 512. The output of the MLP is an adjustment vector of dimension 512. This adjustment vector is then added element-wise to the original token embedding, producing the final adjusted token embedding

M_{θ}

. The MLP contains approximately 0.33 million trainable parameters and is trained jointly with the prompt learning network. This design allows the model to dynamically adapt the token embedding based on the visual content of the subject-object combined region while preserving the original semantic information through the additive residual connection.

By combining the adjusted token embedding with the subject and object token embeddings, the triplet token embedding

U_{e m b}

is formed, as described below:

U_{emb} = \{S_{emb} + M_{θ} + O_{emb}\},

(4)

herein,

S_{e m b}

and

O_{e m b}

respectively denote the token embeddings of the subject and object, while

M_{θ}

represents the combination of the adjusted and refined token embeddings.

We employ simple element-wise addition in Equation (4), drawing inspiration from the translation-based paradigm of TransE, in which relationships are modeled as vector offsets. Complex interactions among subject, predicate, and object are explicitly captured by the subsequent UVTransE classifier (Section 4.3), which computes a non-linear residual

P_{f} = MLP (U_{f} - S_{f} - O_{f})

based on the encoded joint text feature. Furthermore, our ablation study (Section 5.4) demonstrates that replacing additive fusion with a more sophisticated MLP-based non-linear fusion brings only marginal performance gains (<0.5% MF1). This validates that the simple additive formulation is empirically sufficient. Accordingly, we retain this simple yet effective design.

To construct hard negative samples, we compute pairwise cosine distances between predicate embeddings in the original 512-dimensional CLIP text space. For each positive predicate, we select the predicate with the smallest cosine distance (excluding itself) as the negative sample. This strategy forces the model to discriminate between semantically similar relationships, improving fine-grained reasoning. Figure 10 visualizes the predicate embeddings using t-SNE for illustration only; the actual negative sampling is based on distances in the original feature space. Then, the embedded triplet token

U_{e m b}

is input into CLIP’s text encoder to obtain a unified text feature vector

U_{f}

, which integrates information about the subject, object, and relationship. For example, when the ground truth label is <residential area_1, adjoin, bare_land_3>, we construct <residential area_1, touch, bare_land_3> as a negative sample, ensuring that the negative sample has similar semantic features to the positive samples.

In the computation process, the features of the composite region image processed by the CLIP image encoder are represented as

I_{f}

. Thus, the composite feature vector can be expressed as:

U_{f} = {CLIP}_{txt} (φ (S_{e m b,} I_{f,} O_{e m b})),

(5)

herein,

S_{e m b}

and

O_{e m b}

respectively denote the token embeddings of the subject and object,

I_{f}

represents the combined region image feature,

φ

denotes the learnable context vector, and

U_{f}

stands for the joint feature vector.

Through self-supervised learning, the contrastive loss enables the model to capture relationship-specific features. It effectively distinguishes visually similar yet semantically distinct relationships and boosts the accuracy of fine-grained relationship prediction. Thus, the following contrastive loss function is used to optimize the learnable parameters:

L_{LTD} = - \log \frac{\exp (sim (I_{f}, U_{f}))}{\exp (sim (I_{f}, U_{f})) + \exp (sim (U_{f}, {\hat{U}}_{f}))},

(6)

where sim denotes cosine similarity, and

s i m (I_{f}, U_{f})

represents the cosine similarity between the combined region image feature and the joint text feature.

{\hat{U}}_{f}

indicates the negative sample most similar to the ground truth label, and

e x p

stands for the exponential function.

The dynamic adjustment function

∆ (\cdot)

is parameterized by learnable parameters

φ

. During training, it dynamically calibrates the context vector and prompt insertion positions by combining image features to achieve image-aware prompt generation. This mechanism strengthens the model’s representation capability for complex relationships and improves the efficacy of contrastive learning.

4.3. UVTransE Classifier

The joint text feature vector

U_{f}

generated during the prompt learning phase, together with the subject

S_{f}

and object feature vectors

O_{f}

, are fed into the UVMLP network for relation prediction. The network architecture of UVMLP is illustrated in Figure 11. Here,

S_{f}

represents the subject feature vector,

O_{f}

refers to the object feature vector,

U_{f}

denotes the joint text feature vector,

P_{f}

is the relation feature vector,

P_{fgt}

is the relation feature vector for the true label,

P_{logits}

represents the prediction probability of the relationship, and

P_{score}

indicates the probability weight.

The relational feature representation is obtained by subtracting subject and object features from the joint text embedding. The relational feature vector is then fed into a fully connected network to calculate the probability distribution of the relationship. The Softmax function converts the probabilities into normalized weights, yielding the final relational prediction.

The loss function of the UVTransE network is:

L_{t} = L_{cross} + L_{cosine}

(7)

L_{cross}

is the cross-entropy loss function, and the formula is as follows:

L_{cross} = - \sum_{x} p (x) l o g (q (x)),

(8)

herein,

p (x)

denotes the probability distribution of the ground truth relationship category, and

q (x)

represents the predicted distribution of the model.

L_{cosine}

is the cosine similarity loss:

L_{cosine} = 1 - \cos (θ) = 1 - \frac{P_{f} \cdot P_{fgt}}{‖P_{f}‖ \times ‖P_{fgt}‖}

(9)

herein,

c o s (θ)

denotes the cosine similarity. Since we usually minimize the loss rather than directly maximize the similarity, we define the loss function using a negative value.

L_{cross}

contributes to higher classification accuracy. However, when using the cross-entropy loss function alone, the UVTransE network suffers from unsatisfactory convergence, and the similarity between the predicted relationship features and the ground truth relationship features remains low. The loss balances the predicted relationship feature vector and the ground truth relationship feature vector

P_{f g t}

, which further improves the performance of the UVTransE network.

5. Experiments and Results

5.1. Experimental Setup

For the spatial reasoning task, we primarily focus on relationship prediction accuracy. When comparing against baseline models, we focus our analysis on the prediction of different relationship types. Therefore, we use precision as the metric for overall prediction performance. Additionally, we adopt mean

P r e c i s i o n

,

R e c a l l

, and

F 1 - s c o r e

to evaluate the model’s average performance across all relationship types.

Formally, we define the metrics as follows:

P r e c i s i o n

is the ratio of correctly predicted relationship triplets to all predicted triplets:

P r e c i s i o n = \frac{T P}{T P + F P}

(10)

For a single relationship category,

R e c a l l

is defined as:

R e c a l l = \frac{T P}{T P + F N}

(11)

Mean precision (

M P

) is the average precision over all relationship categories:

M P = \frac{1}{N} \sum_{i = 1}^{N} P r e c i s i o n_{i}

(12)

Mean recall (

M R

) is the average recall over all relationship categories:

M R = \frac{1}{N} \sum_{i = 1}^{N} R ecal l_{i}

(13)

Mean F1-score (

M F 1

) is the harmonic mean of mean precision and mean recall:

M F 1 = 2 \times \frac{M P \times M R}{M P + M R}

(14)

where

T P

= true positives,

F P

= false positives,

F N

= false negatives, and

N

denotes the total number of relationship categories.

The remote sensing scene graph dataset is divided into an 8:2 ratio, with 80% for training and 20% for testing. Model training is conducted on a computing service equipped with five NVIDIA A100 GPUs (80 GB memory each). The image and text encoders of the CLIP model are frozen during training; only the prompt learning module and the UVTransE classifier are updated. For the prompt learning network, the number of epochs is set to 50, the learning rate to 0.01, the batch size to 64, and the optimizer is SGD. For the classifier network, the number of epochs is 150, while other parameter settings are the same as those of the prompt learning network.

To evaluate the computational efficiency of UVSTPL, we report its number of trainable parameters, floating-point operations (FLOPs), and inference latency. Although multimodal prompts may introduce extra computation, our UVSTPL incurs only 2.20M additional trainable parameters since the CLIP backbone is kept frozen. The UVTransE classifier consumes 1.87 MFLOPs and incurs an inference latency of 7.67 ms per sample on an NVIDIA A100 GPU. Combined with the frozen CLIP encoder, the full model requires approximately 13.0 GFLOPs with an inference latency of 22.7 ms per sample. This modest increase yields a 12.97% absolute improvement in MF1 score over the optimal baseline (Remote CLIP-seg), demonstrating a favorable accuracy–efficiency trade-off.

5.2. Comparison Experiments

To comprehensively evaluate the proposed UVSTPL, we compare it against a set of representative baselines, spanning both general visual relationship prediction methods and approaches tailored for remote sensing imagery. These baselines are briefly described below.

CREPE-box [36]: A CLIP-based visual relationship prediction model that extracts subject-object regions via bounding boxes and adopts CLIP pre-trained priors to predict relationship predicates.

RSSGG_CS-seg [33]: A remote sensing scene graph generation method proposed in 2022. It fuses contextual information and statistical knowledge, and takes segmentation masks as input. This model serves as a competitive baseline for remote sensing scene graph tasks.

CLIP-box and CLIP-seg: We build two simple yet effective baselines based on the vanilla CLIP (ViT-B/32). For CLIP-box, bounding boxes are utilized to crop subject and object regions; for CLIP-seg, instance segmentation masks are adopted instead. A linear classifier is trained on the Geo-RSSG training set to identify relationship categories.

RemoteCLIP [39] and GeoRSCLIP [40]: These are recently released vision-language foundation models pre-trained on large-scale remote sensing image-text pairs. Compared with vanilla CLIP, the two models are more specialized for the remote sensing domain. We treat them as strong upper-bound baselines to verify whether our UVSTPL can surpass state-of-the-art domain-specific foundation models.

For a fair comparison, all baseline models are trained on the Geo-RSSG training set with their image encoders frozen. Only the linear classifier or task-specific prediction head is updated during training. Consistent with this setting, our UVSTPL also freezes the pre-trained CLIP encoder, and only optimizes the prompt learning module and UVTransE classifier.

We evaluate all methods in terms of precision, MP, MR, and MF1, and the quantitative results are presented in Table 4. In this work, the suffix “-box” indicates that subject-object combined regions are extracted using bounding boxes, while “-seg” denotes regions derived from segmentation masks. Among all baselines, RemoteCLIP-seg achieves the best overall performance, with a precision of 88.43%, MP of 56.65%, MR of 42.20%, and MF1 of 49.43%.

Our proposed UVSTPL-seg outperforms Remote CLIP-seg by a clear margin across all metrics: precision increases by 1.85% (90.28 vs. 88.43), MP by 8.49% (65.14 vs. 56.65), MR by 17.46% (59.66 vs. 42.20), and MF1 by 12.97% (62.40 vs. 49.43). When adopting bounding box inputs, UVSTPL-box also exceeds the top-performing baseline GeoRSCLIP-box, yielding performance gains of 1.72% in precision (90.23 vs. 88.51), 7.89% in MP (61.67 vs. 53.78), 7.14% in MR (50.18 vs. 43.04), and 7.52% in MF1 (55.93 vs. 48.41).

Meanwhile, the precision rates of different models in each relationship are compared (see Table 5 for results). Overall, the prediction precision rate using the -seg method is higher than that using the -box method. Although the UVSTPL-box method has an unsatisfactory prediction effect on small sample relations, its prediction accuracy for common relations is higher than that of the baseline model. The UVSTPL-seg method has reached the highest precision in the prediction of other relationships except for “contained” and “partially surround”. In terms of the small sample relationship, especially for “cross”, “cover” and “surround”, the prediction accuracy of our model is much higher than that of other models.

Figure 12 presents the prediction results of representative models for different relational categories. To balance clarity and comprehensiveness, we select CREPE-box, RSSGG_CS-seg, CLIP-seg, RemoteCLIP-seg, GeoRSCLIP-seg, UVSTPL-box, and UVSTPL-seg for visualization. Consistent with other models, our method fails to make correct predictions for the “overlap” category. However, our method well differentiates similar topological relations, e.g., between “cross” and “intersect”, as well as “contained” and “surround”. Notably, it gains obvious advantages on challenging spatial relations: it reaches 25.00% precision on “cover”, whereas nearly all baselines score 0% for this category. Full precision values of all models, including CLIP-box, RemoteCLIP-box, and GeoRSCLIP-box, which are not displayed in the figure, are summarized in Table 5.

5.3. Results Analysis

5.3.1. Analysis of Model Relationship Prediction Performance

As shown in Table 6, we compare the precision, mean precision, mean recall, and mean F1-score of the UVSTPL-seg and UVSTPL-box models as n_ctx (context size) is set to 2, 4, and 6. It can be observed that different context lengths have a certain impact on the model’s prediction performance, and the models achieve the best performance when n_ctx = 4. This indicates that appropriate context information can improve model performance. Meanwhile, segmentation-based models maintain a relatively high mean precision and mean recall. Since relationships in our dataset are usually expressed as one or two words, UVSTPL can better predict masked relationships when n_ctx = 4.

5.3.2. Analysis of the Distribution of Model Relationship Prediction Precision

Under different n_ctx values, the precision distribution of UVSTPL-seg and UVSTPL-box across 10 relationship types is analyzed in the form of confusion matrices and radar charts. For the confusion matrix, the horizontal axis represents predicted labels, the vertical axis represents true labels, and the depth of color indicates the sample count. Figure 13 uses segmented region features, Figure 14 uses detected bounding box region features, and the same patterns apply to Figure 15a and Figure 15b respectively.

The model predicts common relationships (e.g., “adjoin”, “intersect”, “parallel”, “touch”) consistently well, and also achieves high precision with uncommon ones (e.g., “contained”, “cross”, “cover”, “surround”). In contrast, the prediction for “overlap” is notably poor—all models obtain zero precision in this category (see Table 5 and Figure 12a). Figure 16 shows a typical failure case where <meadow1, overlap, pond1> is misclassified as <meadow1, adjoin, pond1>. The underlying reasons for this complete failure are analyzed in Section 5.5.1.

When the context size is set to 2 or 4, the two methods produce fairly similar relationship predictions. However, at a context size of 6, their predictions diverge considerably. This is because UVSTPL-seg extracts more accurate object contours from segmented regions and therefore requires less contextual information. In contrast, the UVSTPL-box, which relies on bounding boxes, tends to include background noise and thus benefits from additional context. Notably, UVSTPL-seg can predict the rare relationship “cover” with 25.00% precision, whereas UVSTPL-box almost entirely fails to do so (0% precision).

5.3.3. Clustering Analysis of the Impact of Context Length

We adopt the t-SNE dimensionality reduction method to visualize the results on the test set. As shown in Figure 17 and Figure 18, we feed both bounding boxes and segmented regions into UVSTPL and compare its ability to distinguish among different relationship types. The results indicate that both methods achieve the best separation among relationship types when the context length is set to 4. At a context length of 2, the clustering performance remains satisfactory. However, at a context length of 6, the clarity and separation of clusters degrade notably, reflecting reduced discriminability. In Figure 17c, at a context length of 4, the separation among relationship categories is clearest, with relationships such as “parallel”, “touch”, and “adjoin” forming distinct clusters. Across all context lengths and both input types, the segmentation-based method consistently outperforms the bounding-box-based method in distinguishing relationships, particularly when the context length is 4. This indicates that the fine-grained spatial information provided by segmentation masks enhances the model’s ability to discriminate subtle spatial relationships, especially when attending to local contexts.

Our experiments show that the discriminative performance of UVSTPL depends on the input modality and requires a suitable context length. Moreover, incorporating excessive contextual information can introduce noise or irrelevant cues that may obscure the distinctive features of individual relationships. The strong performance of the segmentation-based method, especially at a context length of 4, underscores the importance of combining fine-grained spatial information with focused local context to accurately identify relationships in complex remote sensing scenes.

5.4. Ablation Experiments

5.4.1. Effect of Network Sub-Modules

We adopt the best-performing UVSTPL-seg variant, which takes segmented regions as input with a context length of 4. The performance of different sub-networks is compared in terms of precision, MP, MR, and MF1 (see Table 7). Additionally, confusion matrices and t-SNE dimensionality reduction feature maps are used to analyze the relationship prediction capabilities of the different sub-networks.

When the model uses both the prompt learning network and the UVTransE classifier simultaneously, it achieves the best prediction performance. Without the UVTransE classifier, the model’s precision drops to 86.77%, and its mean precision and mean recall fall to 55.58% and 43.36%, respectively. When the prompt learning module is removed, precision decreases only slightly. However, both mean precision and mean recall drop, with recall falling to just 41.22%. These results indicate that the UVTransE classifier contributes more to precision improvement, while the prompt learning module plays a critical role in few-shot relationship prediction.

The confusion matrix, shown in Figure 19, illustrates the classification performance of the UVSTPL-seg model. As shown in Figure 19a, removing the UVTransE classifier results in a large number of misclassifications, particularly for categories such as “intersect”, “contained”, and “partially surround”. This indicates that prompt learning alone cannot improve precision across multiple categories. As shown in Figure 19b, the introduction of UV improves the prediction results for common relationships, particularly in terms of identifying the “intersect” relationship. However, the prediction results for “contained” and “partially surround” (which have a small number of samples) are still not ideal. The most significant performance improvement is observed in Figure 19c. The synergistic combination of PL and UV significantly improves the precision for all relationship categories, especially in distinguishing fine-grained spatial relationships such as “contained” and “partially surround”. Collectively, these results highlight the complementarity of PL and UV in enhancing the UVSTPL-seg model’s ability to distinguish subtle spatial relationships in complex remote sensing scenarios.

In Figure 20, we illustrate the impact of different sub-modules within the UVSTPL framework on distinguishing relationship features. The t-SNE visualization shows the distribution of 10 relationship categories under various configurations (with a context length of 4). In Figure 20a, the model with only the PL network achieves good clustering performance only for the “adjoin” and “parallel” relationships. In contrast, the model with only UV (Figure 20b) exhibits poor clustering performance. This indicates that without contextual prompts, the model struggles to distinguish among different relationship categories. Figure 20c demonstrates the most significant improvement, where the integration of both prompt learning and UVTransE leads to distinct clusters for most relationship categories. This configuration yields excellent class separation, particularly for similar relationships such as “contained”, “adjoin”, and “partially surround”. The clear feature separation observed under this setting highlights the synergistic effect of combining the prompt learning module with the UVTransE classifier. Overall, UVSTPL substantially enhances the ability to discriminate subtle spatial relationships in complex geographic scenes.

5.4.2. Effect of Triplet Fusion Strategies

To investigate whether additive fusion imposes performance limitations, we substitute it with a two-layer MLP that operates on the concatenated embeddings of the subject, predicate, and object. As presented in Table 8, the MLP-based fusion yields only a negligible MF1 gain of 0.25 percentage points (62.65% vs. 62.40%). This empirical evidence demonstrates that additive fusion is not a bottleneck for our model, as complex relational interactions are sufficiently modeled by the subsequent UVTransE module. Accordingly, we adopt the additive formulation for its simplicity and efficiency.

5.4.3. Effect of Negative Sampling Strategies

To evaluate the contribution of hard negative sampling, we compare random sampling with our cosine-distance-based hard negative sampling. As shown in Table 9, hard negative mining consistently outperforms random sampling across all metrics, with an absolute gain of +6.27% in MF1 (62.40% vs. 56.13%). This verifies that the proposed strategy well distinguishes semantically similar relationships. Therefore, we adopt this strategy in our framework.

5.5. Discussion

5.5.1. Failure on the “Overlap” Relationship

As shown in Table 5 and Figure 12a, all models (including the proposed UVSTPL and the strong remote sensing foundation baselines) achieve zero precision on the “overlap” relationship. The universal failure across all approaches indicates that this limitation is not model-specific, but originates from inherent topological ambiguity, semantic coupling, and extreme data imbalance in remote sensing scene interpretation.

First, extreme sample scarcity. The dataset contains only 28 “overlap” instances, accounting for less than 0.02% of all relational samples. Such severe class imbalance restricts the model from learning stable and discriminative features for this tail category.

Second, high semantic similarity. The cosine similarity between “overlap” and “adjoin” in the CLIP text embedding space reaches 0.92, which is substantially higher than the average pairwise similarity of 0.68. The tight semantic alignment causes inherent language-level confusion, hindering vision-language differentiation of the two relations.

Third, inherent visual topological ambiguity (core cause). The fundamental challenge lies in the inconsistent pixel-level manifestation of abstract topological rules in remote sensing imagery. Geometrically, “adjoin” describes boundary sharing without area intersection, while “overlap” denotes partial area overlap. However, narrow overlapping regions lack visible edge cues, as the upper object occludes underlying surfaces and removes boundary contours. In contrast, “adjoin” scenarios consistently present clear and continuous edges, forming a strong visual prior for the model.

Faced with edge-free overlap samples and overwhelmingly dominant “adjoin” cases, the model lacks contradictory cues to override its learned prior and, thus, statistically defaults to “adjoin” predictions. Combined with terrain undulation, variable imaging angles, fragmented land cover, and mixed-pixel interference, subtle overlap and adjacency become visually indistinguishable. This ambiguity is an intrinsic limitation of remote sensing visual interpretation rather than a simple dataset defect.

Essentially, this failure stems from the inherent mismatch between abstract topological definitions and ambiguous visual presentations of remote sensing scenes. Even with sufficient samples, pure vision-language models struggle to distinguish these fine-grained relations without explicit geometric constraints. For future improvement, we will integrate geometric prior information (e.g., IoU metrics), generate synthetic overlap samples for tail-class augmentation, and adopt reweighted loss functions and few-shot learning strategies to alleviate long-tail learning bias.

5.5.2. Few-Shot Relationships Remain Challenging

Although UVSTPL achieves notable improvements on rare relations such as “cross” and “surround” (e.g., 81.15% vs. 67.19% for “cross”; 56.00% vs. 52.78% for “surround”), the absolute accuracy for “cover” (25.00%) and “surround” (56.00%) remains moderate. As shown in Table 5, all baseline models, including RemoteCLIP and GeoRSCLIP, perform poorly on these tail categories, and most attain 0% precision on “cover”.

The long-tail distribution is alleviated but not fully resolved. Apart from extreme class imbalance, the core challenge stems from the ambiguous visual boundaries of these complex spatial relations in remote sensing imagery. The visual manifestation of the “cover” relation often overlaps with partial overlap and adjacency. Large-area coverage shares similar visual patterns with object intersection, while partial coverage is difficult to distinguish from simple adjacency due to mixed pixels and shadow interference. Likewise, the topological difference between “surround” and other adjacent relations is easily obscured by irregular layouts of ground objects in complex land cover scenes.

To tackle these challenges, we put forward several future research directions: generating augmented samples for rare and visually ambiguous relations via synthetic data creation, adopting reweighted loss functions to balance category learning, and developing few-shot learning strategies dedicated to remote sensing scene graph tasks to boost feature representation for tail classes.

5.5.3. Choice of CLIP Backbone

In this study, we adopted the generic CLIP (ViT-B/32) as the visual language backbone instead of remote-sensing-specific variants (RemoteCLIP, GeoRSCLIP). This is a deliberate design choice: using a backbone that has not been pre-trained on remote sensing data provides a more challenging test for our prompt learning and UVTransE modules. It allows us to demonstrate that the observed improvements come from the proposed architectural components, not from a stronger pre-trained encoder. Moreover, the UVSTPL is “backbone-agnostic”. It can be directly applied to any CLIP-style encoder without modification. Comparison experiments indicate that replacing the generic CLIP with RemoteCLIP yields an additional performance gain of about 1–2% in MF1, which we plan to investigate systematically in future work.

5.5.4. Geographic Transferability

For remote sensing spatial relationship recognition, geographic transferability describes a model’s ability to retain stable performance on unseen regions with different geographic features, a property that defines its practical applicability.

The Geo-RSSG dataset employed in this work consists of remote sensing images from two Chinese cities: Guangzhou, a coastal city with complex terrain, and Zhengzhou, a representative plain city. While the dataset includes urban, suburban, and rural scenes and covers two typical geographic types, the model’s performance on mountainous areas and overseas cities with unique land-use patterns remains unquantified. Several critical factors limiting cross-region transferability are discussed as follows.

First, regions differ substantially in land cover and object appearance. Noticeable variations exist in architectural styles, building materials, vegetation, road networks, and spectral characteristics. For example, residential buildings in European countries show distinct roof designs and colors compared with those in China. Our vision-language backbones are pre-trained on massive global data and can learn generalizable representations to alleviate domain shift. Nevertheless, fine-tuning on data limited to the two domestic cities introduces scene bias, making the model overly adapted to local visual characteristics.

Second, spatial relationships follow region-specific statistical distributions. Divergent urban planning results in varying occurrence frequencies and visual forms of topological relations, such as “adjoin” and “contained”. Models trained on compact urban scenes often struggle with sparse suburban areas and rugged mountainous regions. Terrain undulation, shadow occlusion, and fragmented ground objects in mountainous areas further change how relations like “adjoin” and “surround” appear visually. Such domain shifts break the consistency of data distribution, leading to performance drops when the model is applied to new regions.

Third, image scale and resolution present additional challenges. The entire dataset uses a fixed ground sampling distance (GSD) of 0.3 m, matching common high-resolution satellite data. When processing images from different sensors or with varying zoom scales, resampling operations or scale-invariant feature learning are required to maintain reliable results.

To overcome these limitations, we plan the following future work. We will first construct and annotate a multi-region test set for quantitative transferability assessment. We will also explore domain adaptation methods, including adversarial feature alignment and test-time adaptation to bridge cross-domain gaps. Moreover, we will exploit unlabeled cross-region data through semi-supervised learning and decouple relational semantics from scene-specific visual features to improve model robustness across geographic scenarios.

6. Conclusions

To enhance the ability of large models to understand remote sensing images, this study proposes a remote sensing image scene graph spatial relationship reasoning method based on joint visual-semantic triplet prompt learning. We construct Geo-RSSG, a fine-grained remote sensing scene graph dataset that defines comprehensive entity and relationship types and provides rich geographic semantic information. Built upon semantic segmentation results, the proposed method integrates the CLIP model, prompt learning, and the UVTransE classifier to perform spatial relationship reasoning. The experimental results demonstrate that the UVSTPL model significantly outperforms baseline methods in both precision and recall. This research advances the understanding of remote sensing scenes and is of considerable value for improving spatial cognition in remote sensing imagery.

However, several limitations remain. Constrained by natural and anthropogenic geographic factors, the ground objects and relationships in our dataset follow a heavily long-tailed distribution, which limits the model’s ability to learn complex relationships such as “contained and surround” and “adjoin and overlap”. Moreover, the dataset scale needs to be further expanded, which can be addressed through data augmentation and annotation, assisted by pre-trained models. This study focuses on the prediction of topological relationships, as directional relationships can achieve very high accuracy through spatial relationship calculations, but such calculations usually involve high computational complexity and time costs. Therefore, based on the constructed dataset, the inference of directional relationships can be realized by fine-tuning multimodal large models. Additionally, while the method proposed in this paper focuses on relationship reasoning within scene graphs, a promising extension is to convert scene graph datasets into question–answer pairs and leverage visual question answering (VQA) to jointly predict entities, relationships, and attributes, thereby enabling automatic generation of remote sensing scene graphs.

In future work, we plan to further enrich the Geo-RSSG dataset, leverage large vision-language models to infer directional relationships, and explore the application of scene graphs to scene reasoning tasks such as image captioning and VQA.

Author Contributions

Conceptualization, H.Q.; methodology, Y.R. and B.J.; software, Y.R. and L.S.; validation, Y.R. and X.W.; formal analysis, X.W. and L.Y.; investigation, B.J. and T.L.; resources, B.J. and L.S.; data curation, Y.R. and X.W.; writing—original draft preparation, Y.R. and B.J.; writing—review and editing, T.L. and H.Q.; visualization, Y.R. and T.L.; supervision, H.Q. and L.Y.; project administration, H.Q.; funding acquisition, B.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 42571549, and the Key Laboratory of Smart Earth, grant number SYS-ZX06-2024-01.

Data Availability Statement

The data that support the findings of this research are available from the corresponding author, B. Jiang, upon reasonable request.

Acknowledgments

The authors would like to thank the anonymous reviewers for their helpful and constructive comments that greatly contributed to improving this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, Z.; Feng, Y.; Liu, Z.; Yang, S.; Liu, Q.; Wang, Y. Openrsd: Towards open-prompts for object detection in remote sensing images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–23 October 2025; pp. 8384–8394. [Google Scholar] [CrossRef]
Xiao, Z.; Li, Z.; Cao, J.; Liu, X.; Kong, Y.; Du, Z. OriMamba: Remote sensing oriented object detection with state space models. Int. J. Appl. Earth Obs. Geoinf. 2025, 143, 104731. [Google Scholar] [CrossRef]
Xie, J.; Wang, G.; Zhang, T.; Sun, Y.; Chen, H.; Zhuang, Y.; Li, J. LLaMA-Unidetector: A LLaMA-Based Universal Framework for Open-Vocabulary Object Detection in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4409318. [Google Scholar] [CrossRef]
Feng, J.; Luo, H.; Gu, Z. Improving semi-supervised remote sensing scene classification via Multilevel Feature Fusion and pseudo-labeling. Int. J. Appl. Earth Obs. Geoinf. 2025, 136, 104335. [Google Scholar] [CrossRef]
Wang, C.; Yang, J.; Ahmed, T.; Zhao, Y.; Zhang, T.; Sun, B.; Chen, T. Zero-Shot Remote Sensing Scene Classification Based on Automatic Knowledge Graph and Dual-Branch Semantic Correlation Supervision. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 18, 3300–3314. [Google Scholar] [CrossRef]
Yi, J.; Zhong, Y.; Su, Y.; Yang, R.; Liu, Y.; Wang, J. Global urban high-resolution scene classification via uncertainty-aware domain generalization. ISPRS J. Photogramm. Remote Sens. 2025, 230, 92–108. [Google Scholar] [CrossRef]
Liu, X.; Wang, T.; Jin, F.; Rui, J.; Wang, S.; Huang, Z.; Yu, X. Multimodal cross fusion Mamba network for remote sensing image semantic segmentation with complementary masked self-supervision. Int. J. Appl. Earth Obs. Geoinf. 2025, 145, 104960. [Google Scholar] [CrossRef]
Luo, M.; Zan, Y.; Khoshelham, K.; Ji, S. Domain generalization for semantic segmentation of remote sensing images via vision foundation model fine-tuning. ISPRS J. Photogramm. Remote Sens. 2025, 230, 126–146. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.O.; Huang, B. A Unified Framework with Multimodal Fine-tuning for Remote Sensing Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5405015. [Google Scholar] [CrossRef]
Sun, S.; Dustdar, S.; Ranjan, R.; Morgan, G.; Dong, Y.; Wang, L. Remote sensing image interpretation with semantic graph-based methods: A survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 4544–4558. [Google Scholar] [CrossRef]
Yin, S.; Wang, L.; Shafiq, M.; Teng, L.; Laghari, A.A.; Khan, M.F. G2Grad-CAMRL: An object detection and interpretation model based on gradient-weighted class activation mapping and reinforcement learning in remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3583–3598. [Google Scholar] [CrossRef]
Zhu, Q.; Lao, J.; Ji, D.; Luo, J.; Wu, K.; Zhang, Y.; Zhao, F. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2025; pp. 14733–14744. [Google Scholar] [CrossRef]
Feng, J.; Wang, H. A multi-scale contextual attention network for remote sensing visual question answering. Int. J. Appl. Earth Obs. Geoinf. 2024, 126, 103641. [Google Scholar] [CrossRef]
He, J.; Liu, G.; Li, P.; Su, X.; Jiang, W.; Zhang, D.; Zhong, S. PERS: Parameter-Efficient Multi-modal Transfer Learning for Remote Sensing Visual Question Answering. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 14823–14835. [Google Scholar] [CrossRef]
Gao, Z.; Sun, S.; Cheng, M.M.; Liu, Y.; Liu, L. Multi-modal large models driven SAR image captioning: A benchmark dataset and baselines. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 24011–24026. [Google Scholar] [CrossRef]
Ren, J.; Liu, W.; Chen, J.; Yin, S. HI4HC and AAAAD: Exploring a hierarchical method and dataset using hybrid intelligence for remote sensing scene captioning. Int. J. Appl. Earth Obs. Geoinf. 2025, 139, 104491. [Google Scholar] [CrossRef]
Wang, Q.; Yang, Z.; Ni, W.; Wu, J.; Li, Q. Semantic-spatial collaborative perception network for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5649912. [Google Scholar] [CrossRef]
Im, J.; Nam, J.; Park, N.; Lee, H.; Park, S. Egtr: Extracting graph from transformer for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 24229–24238. [Google Scholar] [CrossRef]
Jeon, J.; Kim, K.; Yoon, K.; Park, C. Semantic diversity-aware prototype-based learning for unbiased scene graph generation. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 379–395. [Google Scholar] [CrossRef]
Li, J.; Wang, Y.; Guo, X.; Yang, R.; Li, W. Leveraging predicate and triplet learning for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 28369–28379. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L. Microsoft coco: Common objects in context. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. Available online: https://link.springer.com/content/pdf/10.1007/978-3-319-10602-1_48.pdf (accessed on 11 September 2024).
Lu, C.; Krishna, R.; Bernstein, M.; Fei-Fei, L. Visual relationship detection with language priors. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2016; pp. 852–869. [Google Scholar] [CrossRef]
Kim, J.; Park, J.; Park, J.; Kim, J.; Kim, S.; Kim, H.J. Groupwise query specialization and quality-aware multi-assignment for transformer-based visual relationship detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 28160–28169. [Google Scholar] [CrossRef]
Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Fei-Fei, L. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef]
Chen, J.; Zhou, X.; Zhang, Y.; Sun, G.; Deng, M.; Li, H. Message-passing-driven triplet representation for geo-object relational inference in HRSI. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Li, P.; Zhang, D.; Wulamu, A.; Liu, X.; Chen, P. Semantic relation model and dataset for remote sensing scene understanding. ISPRS Int. J. Geo-Inf. 2021, 10, 488. [Google Scholar] [CrossRef]
Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring models and data for remote sensing image caption generation. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2183–2195. [Google Scholar] [CrossRef]
Lin, Z.; Zhu, F.; Kong, Y.; Wang, Q.; Wang, J. SRSG and S2SG: A model and a dataset for scene graph generation of remote sensing images from segmentation results. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4707411. [Google Scholar] [CrossRef]
Qu, B.; Li, X.; Tao, D.; Lu, X. Deep semantic understanding of high-resolution remote sensing image. In Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China, 6–8 July 2016; pp. 1–5. [Google Scholar] [CrossRef]
Li, Y.; Wang, L.; Wang, T.; Yang, X.; Luo, J.; Wang, Q.; Yan, J. STAR: A first-ever dataset and a large-scale benchmark for scene graph generation in large-size satellite imagery. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 1832–1849. [Google Scholar] [CrossRef] [PubMed]
Tang, J.; Tong, X.; Qiu, C.; Sun, Y.; Song, H.; Lei, Y.; Guo, C. Remote sensing scene graph generation for improved retrieval based on spatial relationships. ISPRS J. Photogramm. Remote Sens. 2025, 220, 741–752. [Google Scholar] [CrossRef]
Cui, W.; Wang, F.; He, X.; Zhang, D.; Xu, X.; Yao, M.; Huang, J. Multi-scale semantic segmentation and spatial relationship recognition of remote sensing images based on an attention model. Remote Sens. 2019, 11, 1044. [Google Scholar] [CrossRef]
Lin, Z.; Zhu, F.; Wang, Q.; Kong, Y.; Wang, J.; Huang, L.; Hao, Y. RSSGG_CS: Remote sensing image scene graph generation by fusing contextual information and statistical knowledge. Remote Sens. 2022, 14, 3118. [Google Scholar] [CrossRef]
Rui, Q.; You, Y.; Cao, J.; Zhu, K.; Qiao, Y. SGRD: A Ship Group Relationship Description Method Based on Scene Graph Generation with a Global-Local Context Fusion Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 14570–14581. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sutskever, I. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar] [CrossRef]
Subramanyam, R.; Jayram, T.S.; Anirudh, R.; Thiagarajan, J.J. Exploring the Utility of Clip Priors for Visual Relationship Prediction. In Proceedings of the ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 6825–6829. [Google Scholar] [CrossRef]
Qiu, C.; Yu, A.; Yi, X.; Guan, N.; Shi, D.; Tong, X. Open self-supervised features for remote-sensing image scene classification using very few samples. IEEE Geosci. Remote Sens. Lett. 2022, 20, 2500505. [Google Scholar] [CrossRef]
Singha, M.; Jha, A.; Solanki, B.; Bose, S.; Banerjee, B. Applenet: Visual attention parameterized prompt learning for few-shot remote sensing image generalization using clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2024–2034. [Google Scholar] [CrossRef]
Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Zhou, J. RemoteCLIP: A vision language foundation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622216. [Google Scholar] [CrossRef]
Zhang, Z.; Zhao, T.; Guo, Y.; Yin, J. RS5m and GeoRSCLIP: A large scale vision-language dataset and a large vision-language model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5642123. [Google Scholar] [CrossRef]
Wang, Z.; Prabha, R.; Huang, T.; Wu, J.; Rajagopal, R. Skyscript: A large and semantically diverse vision-language dataset for remote sensing. Proc. AAAI Conf. Artif. Intell. 2024, 38, 5805–5813. [Google Scholar] [CrossRef]
Meng, L.; Wang, J.; Meng, R.; Yang, Y.; Xiao, L. A multiscale grouping transformer with clip latents for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4703515. [Google Scholar] [CrossRef]

Figure 1. Pixel-level comprehension and semantic-level comprehension for remote sensing images: (a) remote sensing image; (b) pixel-level comprehension; (c) semantic-level comprehension.

Figure 2. Examples of topological relationships.

Figure 3. Standardized annotation process for Geo-RSSG dataset.

Figure 4. Examples of annotation results.

Figure 5. Visualization of statistical results for ground objects.

Figure 6. Visualization of statistical results for relationships.

Figure 7. Visualization of statistical results for attributes.

Figure 8. Overall architecture of UVSTPL.

Figure 9. Prompt learning architecture.

Figure 10. t-SNE visualization of predicate embedding (negative sampling uses original-space cosine distances).

Figure 11. UVTransE network.

Figure 12. Visualization of the prediction results of different models on each relationship.

Figure 13. Relationship distribution of UVSTPL-seg with different contexts.

Figure 14. Relationship distribution of UVSTPL-box with different contexts.

Figure 15. Visualization of the precision rate of a single relationship.

Figure 16. Reason analysis for failure to predict “overlap”.

Figure 17. t-SNE visualization of relationships for UVSTPL-seg.

Figure 18. t-SNE visualization of relationships for UVSTPL-box.

Figure 19. Relationship distribution of the UVSTPL-seg model under different sub-module configurations. Note: The red dashed box represents the relationship of a decrease in the number of predictions, while the solid box represents the relationship of an increase in the number of predictions.

Figure 20. Impact of different sub-modules on distinguishing relationship features in the UVSTPL framework.

Table 1. Geographical object types.

Major Category	Subcategory
Water Systems	river, ocean, lake, pond, reservoir
Topography	mountain, beach, island,
Vegetation and Soil	farmland, forest, meadow, greenbelt, bare land, paved surface
Residential Areas and Facilities	residential area, commercial area, industrial area, stadium, athletics track, basketball court, football field, tennis court, school, park, storage tank, greenhouse, container, swimming pool, shed
Transportation	parking lot, airport, railway station, harbor, wharf, runway, railway, highway, freeway, intersection, overpass, viaduct, bridge, roundabout, gas station, toll station

Table 2. Relationship types.

Major Category	Subcategory
Topological relationships	overlap, contained, surround, partially surround, parallel, touch, adjoin
Direction relationships	north, south, west, east, northwest, northeast, southwest, southeast, center

Table 3. Attribute types.

Major Category	Subcategory
Color	green, blue, white, black, gray, yellow, red, brown
Shape	straight, curved, round, oval, rectangular, square, striped, triangular, irregular, neatly arranged, scattered, high-rise, low-rise, long, short
Region	wide, narrow, big, small

Table 4. Precision and recall of different models.

Method	Precision	MP	MR	MF1
CREPE-box	50.56	38.22	33.24	35.73
RSSGG_CS-seg	61.83	49.32	38.73	44.03
CLIP-box	88.44	44.01	40.77	42.39
CLIP-seg	88.44	54.87	42.01	48.44
RemoteCLIP-box	88.52	49.56	41.44	45.50
RemoteCLIP-seg	88.43	56.65	42.20	49.43
GeoRSCLIP-box	88.51	53.78	43.04	48.41
GeoRSCLIP-seg	88.25	56.22	41.48	48.85
UVSTPL-box	90.23	61.67	50.18	55.93
UVSTPL-seg	90.28	65.14	59.66	62.40

Note: Bold indicates the best performance for each metric.

Table 5. Precision of different models on individual relationships.

Method	Adjoin	Intersect	Contained	Partially Surround	Parallel	Touch	Cross	Cover	Surround
CREPE-box	76.11	53.24	18.07	9.56	80.56	86.11	43.24	0	15.32
RSSGG_CS-seg	75.54	60.73	53.41	17.23	88.83	92.54	65.73	0	37.23
CLIP-box	82.11	64.89	48.39	0	90.99	95.17	58.59	0	0
CLIP-seg	82.45	64.60	53.92	27.34	91.70	95.11	71.67	0	51.85
RemoteCLIP-box	82.94	64.62	68.18	45.24	90.98	95.23	48.39	0	0
RemoteCLIP-seg	82.48	63.33	57.78	43.10	92.06	95.31	67.19	0	52.78
GeoRSCLIP-box	83.06	63.76	53.49	45.00	91.92	95.13	50.90	0	54.55
GeoRSCLIP-seg	83.08	66.05	36.54	43.86	89.30	95.38	67.12	0	46.42
UVSTPL-box	90.23	74.91	19.99	8.93	94.03	98.47	60.66	0	46.67
UVSTPL-seg	90.28	76.62	43.22	29.65	94.84	97.95	81.15	25.00	56.00

Note: Bold indicates the best performance for each metric.

Table 6. Model evaluation under different data inputs and different contexts.

Method	n_ctx	Precision	MP	MR	MF1
UVSTPL-box	2	90.22	59.07	51.64	55.36
	4	90.23	61.67	50.18	55.93
	6	89.94	61.91	56.42	59.17
UVSTPL-seg	2	89.44	63.66	58.08	60.87
	4	90.28	65.14	59.66	62.40
	6	89.94	51.43	50.19	50.81

Note: Bold indicates the best performance for each metric.

Table 7. Performance of the UVSTPL-seg model with different sub-networks.

Sub-Network		Precision	MP	MR	MF1
PL	UV	Precision	MP	MR	MF1
√	×	86.77	55.58	43.36	49.47
×	√	88.84	59.42	41.22	50.32
√	√	90.28	65.14	59.66	62.40

Note: “PL” refers to the prompt learning network, while “UV” denotes the UVTransE-based classifier network. “√” denotes experiments with the corresponding sub-network integrated, and “×” denotes those without the sub-network. Bold indicates the best performance for each metric.

Table 8. Effect of triplet fusion strategies.

Triplet Fusion	Precision	MP	MR	MF1
Additive (ours)	90.28	65.14	59.66	62.40
MLP-based (concatenation)	90.45	65.38	59.82	62.65

Table 9. Effect of negative sampling strategies.

Negative Sampling	Precision	MP	MR	MF1
Random sampling	87.62	60.53	52.47	56.13
Hard negative sampling (cosine distance, ours)	90.28	65.14	59.66	62.40

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ren, Y.; Qian, H.; Jiang, B.; Li, T.; Wang, X.; Sun, L.; Yang, L. Multimodal Prompt Learning for Spatial Reasoning in Remote Sensing Image Scene. Remote Sens. 2026, 18, 1959. https://doi.org/10.3390/rs18121959

AMA Style

Ren Y, Qian H, Jiang B, Li T, Wang X, Sun L, Yang L. Multimodal Prompt Learning for Spatial Reasoning in Remote Sensing Image Scene. Remote Sensing. 2026; 18(12):1959. https://doi.org/10.3390/rs18121959

Chicago/Turabian Style

Ren, Yan, Haizhong Qian, Bingchuan Jiang, Tingting Li, Xiao Wang, Long Sun, and Li Yang. 2026. "Multimodal Prompt Learning for Spatial Reasoning in Remote Sensing Image Scene" Remote Sensing 18, no. 12: 1959. https://doi.org/10.3390/rs18121959

APA Style

Ren, Y., Qian, H., Jiang, B., Li, T., Wang, X., Sun, L., & Yang, L. (2026). Multimodal Prompt Learning for Spatial Reasoning in Remote Sensing Image Scene. Remote Sensing, 18(12), 1959. https://doi.org/10.3390/rs18121959

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Prompt Learning for Spatial Reasoning in Remote Sensing Image Scene

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Remote Sensing Image Scene Graph Relationship Reasoning

2.2. Scene Reasoning of Remote Sensing Images Based on Visual Language Models

3. Geo-RSSG Dataset

3.1. Dataset Description

3.2. Data Annotation

3.3. Data Analysis

4. UVSTPL Framework

4.1. Subject-Object Pairing

4.2. Triplet Prompt Learning

4.3. UVTransE Classifier

5. Experiments and Results

5.1. Experimental Setup

5.2. Comparison Experiments

5.3. Results Analysis

5.3.1. Analysis of Model Relationship Prediction Performance

5.3.2. Analysis of the Distribution of Model Relationship Prediction Precision

5.3.3. Clustering Analysis of the Impact of Context Length

5.4. Ablation Experiments

5.4.1. Effect of Network Sub-Modules

5.4.2. Effect of Triplet Fusion Strategies

5.4.3. Effect of Negative Sampling Strategies

5.5. Discussion

5.5.1. Failure on the “Overlap” Relationship

5.5.2. Few-Shot Relationships Remain Challenging

5.5.3. Choice of CLIP Backbone

5.5.4. Geographic Transferability

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI