PR-CLIP: Cross-Modal Positional Reconstruction for Remote Sensing Image–Text Retrieval

Guan, Jihong; Shu, Yulou; Li, Wengen; Song, Zihan; Zhang, Yichao

doi:10.3390/rs17132117

Open AccessArticle

PR-CLIP: Cross-Modal Positional Reconstruction for Remote Sensing Image–Text Retrieval

by

Jihong Guan

,

Yulou Shu

,

Wengen Li

^*

,

Zihan Song

and

Yichao Zhang

School of Computer Science and Technology, Tongji University, Shanghai 200092, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(13), 2117; https://doi.org/10.3390/rs17132117

Submission received: 13 May 2025 / Revised: 17 June 2025 / Accepted: 19 June 2025 / Published: 20 June 2025

(This article belongs to the Special Issue Advanced AI Technology for Remote Sensing Analysis)

Download

Browse Figures

Versions Notes

Abstract

With the development of satellite technology, remote sensing images have become increasingly accessible, making multi-modal remote sensing retrieval increasingly important. However, most existing methods rely on global visual and textual features to compute similarity, ignoring the positional correspondence between image regions and textual descriptions. To address this issue, we propose a novel cross-modal retrieval model named PR-CLIP, which leverages a cross-modal positional information reconstruction task to learn position-aware correlations between modalities. Specifically, PR-CLIP first uses a cross-modal positional information extraction module to extract the complementary features between images and texts. Then, the unimodal positional information filtering module filters out the complementary information from the unimodal features to generate embeddings for reconstruction. Finally, the cross-modal positional information reconstruction module reconstructs the unimodal embeddings of the images and texts based on the complete embeddings of the opposite modality, guided by a cross-modal positional consistency loss to ensure reconstruction quality. During the inference stage of retrieval, PR-CLIP directly calculates the similarity between the unimodal features without executing the modules of the reconstruction task. By combining the advantages of dual-stream and single-stream models, PR-CLIP achieves a good balance between performance and efficiency. Extensive experiments on multiple public datasets demonstrated the effectiveness of PR-CLIP.

Keywords:

cross-modal remote sensing retrieval; positional information; position reconstruction

Graphical Abstract

1. Introduction

With the rapid advancements in aerospace technology, the acquisition of remote sensing data has become increasingly convenient [1]. However, the continuous expansion of remote sensing data presents a pressing challenge to efficiently and accurately retrieving target data that meet query requirements from massive datasets. Against this backdrop, remote sensing image–text retrieval (RSITR) has emerged as a research hotspot in recent years [2,3,4]. Currently, RSITR technology is widely applied in various areas, such as natural disaster monitoring [5], urban planning [6], and agricultural production [7].

Traditional remote sensing image retrieval methods mainly rely on handcrafted features and shallow models [8,9,10]. However, these methods suffer from low computational efficiency and limited retrieval accuracy. In recent years, deep learning-based RSITR methods have gradually become a research hotspot, and they can be mainly divided into dual-stream models and single-stream models [11,12,13,14]. Dual-stream models [15,16,17] typically employ independent encoders to extract features from images and texts separately, and they train them, based on a sample-matching relationship, to learn cross-modal consistency. In contrast, single-stream models [18,19,20] feed both images and texts into the same encoder to learn shared cross-modal representations.

However, the existing RSITR methods still have certain limitations. While dual-stream models offer higher computational efficiency, they lack deep cross-modal feature interactions. On the other hand, single-stream models can more comprehensively model cross-modal relationships but come with higher computational costs and lower inference efficiency. Moreover, when calculating image–text similarity most of these methods rely on global features, overlooking crucial positional relationships in remote sensing data. In multimodal remote sensing retrieval tasks, textual descriptions often refer to specific target regions rather than the entire image. Therefore, relying solely on global features for matching may introduce redundant information, thus leading to increased retrieval errors. Consequently, how to enable the model to capture cross-modal spatial relationships on top of global feature matching, thereby achieving finer-grained cross-modal alignment, has become a key challenge for improving the performance of remote sensing image–text retrieval.

Figure 1 presents an example of a remote sensing image–text pair. The text description primarily focuses on specific areas of the image, such as “five planes” and “red building,” which are spatially related through the connective phrase “next to.” However, the existing methods lack the ability to capture cross-modal positional relationships, resulting in background information being incorporated into similarity calculations. In other words, these methods fail to align the object words in the text with their corresponding regions in the image, and they are also unable to associate two objects with positional relationships through spatial keywords. This interference disrupts the alignment between key target regions and textual descriptions, ultimately degrading retrieval performance.

To address the insufficient utilization of positional information in existing methods, we propose a new remote sensing image–text retrieval model termed PR-CLIP. The code is publicly available at https://github.com/ADMIS-TONGJI/PR-CLIP (accessed on 18 June 2025). PR-CLIP adopts CLIP [21] as the image and text encoder, and it aligns remote sensing image–text features through cross-modal contrastive learning, thereby bringing semantically similar samples closer together and enhancing cross-modal matching capability. Based on this, we propose a cross-modal positional information reconstruction task to improve the model’s ability to capture spatial relationships between images and texts. Specifically, the encoded image and text features are first concatenated and fed into the cross-modal positional information extraction module to extract complementary cross-modal positional information. Subsequently, we employ the unimodal positional information filtering module to remove spatial information from the image and text features, obtaining unimodal representations that lack positional information. Next, with the complete information from the other modality, we reconstruct the filtered unimodal features using the cross-modal positional information reconstruction module. To ensure the quality of the reconstruction, we introduce a positional information reconstruction consistency loss to enforce similarity between the reconstructed features and the original unimodal features, thereby enhancing the model’s ability to capture positional relationships. It is worth noting that global feature matching still plays a central role in RSITR tasks. PR-CLIP aims to introduce cross-modal positional information modeling on top of global feature matching to achieve finer-grained cross-modal alignment. Although in some scenarios the image–text retrieval may not rely on a positional relationship, mining the potential spatial structure information does not weaken the representational ability of the global features. For scenarios involving spatial relationships, positional information plays an important role in cross-modal alignment.

PR-CLIP combines the advantages of both dual-stream and single-stream models. During training, the cross-modal positional information reconstruction task enhances spatial alignment between modalities, improving cross-modal matching accuracy. During inference, PR-CLIP omits the cross-modal positional information reconstruction modules and relies solely on the CLIP encoder for retrieval. Therefore, the model can retain the high computational efficiency of dual-stream models while leveraging learned positional information to enhance retrieval accuracy.

Our contributions are summarized as follows:

We propose a novel remote sensing image–text retrieval model, PR-CLIP, which can capture cross-modal positional information, enabling more efficient and accurate cross-modal retrieval.
We introduce a new cross-modal positional information reconstruction training task to enhance the model’s ability to understand and utilize positional relationships across modalities.
We carried out adequate experiments on two public datasets, and the results indicate that our model performs obviously better than the existing approaches.

The structure of this paper is as follows. In Section 2, we review the current RSITR methods and discuss their limitations. Section 3 presents a detailed description of the proposed PR-CLIP model and the cross-modal positional information reconstruction training task. In Section 4, we evaluate the performance of PR-CLIP by comparing it against multiple SOTA baselines. Finally, Section 5 summarizes this work.

2. Related Work

In this section, we review recent studies on remote sensing image–text retrieval. The existing RSITR methods can be categorized into two types, i.e., dual-stream models and single-stream models.

2.1. Dual-Stream RSITR Models

Dual-stream models typically use independent unimodal encoders to extract feature embeddings for images and texts, separately. For example, they employ CNNs [22] or Transformers [23] to process image features, and they use BERT [24] or other language models to encode textual information. During training, the models are optimized in a supervised manner using positive and negative sample pairs, ensuring that the matched image–text pairs are as close as possible in the feature space while pushing apart the unmatched pairs. Since these methods extract unimodal features independently, they achieve higher computational efficiency.

For example, Liao et al. [25] employed separable convolution and text convolution to extract image and text features separately and enhance cross-modal retrieval performance through knowledge distillation from large-scale pretrained models. Pan et al. [17] employed a progressive spatial attention encoder to encode images and a progressive temporal attention encoder to encode text, and, ultimately, they trained the model using a cluster-based membership loss. Yang et al. [26] encoded images by extracting texture and saliency features and used BERT to encode text. Zhang et al. [15] proposed a hypersphere-based visual–semantic alignment network, which optimizes the model through curriculum learning after separately encoding images and text. Zhou et al. [18] proposed a coarse-to-fine two-stage image–text retrieval framework. First, images and texts are separately encoded to compute similarity for coarse ranking. Then, a fine-grained ranking is performed on the candidate results using an image–text matching task.

In recent years, some researchers have introduced large-scale data into dual-stream models for training, aiming to enhance the generalization ability and robustness of cross-modal feature learning. These approaches seek to more effectively align image and text features and improve retrieval performance. For example, RemoteCLIP [27] adopts a data-augmentation approach to integrate the SEG-4, DET-10, and RET-3 datasets, resulting in a dataset 12 times larger than all the existing datasets combined. It is trained, based on the CLIP framework, to enhance the performance of remote sensing image–text retrieval. GeoRSCLIP [28] introduces the RS5M remote sensing image–text paired dataset, which contains 5 million remote sensing images with textual descriptions. The model is also trained based on the CLIP framework. EBAKER [29] introduces the NWPU dataset [30] into the training of the CLIP model, where the NWPU dataset contains 31,500 images and 157,500 matched texts. SkyCLIP [31] introduces the SkyScript dataset with 2.6 million image–text pairs and achieves good retrieval performance through the CLIP model.

However, dual-stream models lack deep cross-modal feature interactions and fusion, making it difficult for them to fully capture cross-modal semantic relationships, which, ultimately, limits their performance in retrieving complex remote sensing content.

2.2. Single-Stream RSITR Models

The core idea of single-stream models is to facilitate early-stage modality fusion during feature extraction. Therefore, they typically adopt a transformer structure to simultaneously process image and text data to learn a shared cross-modal representation. Subsequently, these models utilize the fused embeddings for tasks such as image–text matching and image–text masking to enhance cross-modal alignment and improve retrieval performance. For example, Yuan et al. [3] used a multi-scale visual self-attention module to extract image features and a cross-attention mechanism for text interaction, while proposing a triplet loss based on prior similarity to address the challenge of distinguishing similar images. Yu et al. [32] constructed graph structures separately for images and text to extract the corresponding modality features and aligned different modalities through an image–text association module. Yuan et al. [33] integrated different hierarchical features through multi-level information dynamic fusion and introduced a denoising representation matrix and an enhanced adjacency matrix to optimize the local features generated by GCN. Zhu et al. [34] proposed a multi-task joint learning framework that enhances cross-modal retrieval performance through a noise-aware background reconstruction task and a pixel-level prediction-based semantic segmentation task. Zhou et al. [18] first extracted image and text features separately and then promoted cross-modal alignment through a multi-visual-guided dynamic fusion module. Huang et al. [35] utilized textual cues to guide rich semantic reasoning within the visual context and further enhanced cross-modal interaction between textual and visual data through context region learning and consistency semantic alignment.

Single-stream methods can model inter-modal relationships more comprehensively than dual-stream models. However, since the retrieval stage requires computing the fused embedding between the target modality and all candidate features, these methods suffer from low inference efficiency. Additionally, single-stream models may weaken the independence of unimodal features, potentially leading to suboptimal cross-modal retrieval accuracy.

2.3. Discussion

Although the existing RSITR methods achieve high retrieval accuracy, they still struggle to effectively model cross-modal positional associations between images and text. Dual-stream models primarily focus on extracting precise unimodal representations but lack interaction modules for integrating image and text features, making it challenging to capture cross-modal positional relationships. Single-stream models employ a fusion encoder to align images and text. However, they predominantly emphasize global semantic matching while lacking fine-grained modeling of positional associations. To address such limitations, this study enhances the proposed retrieval model PR-CLIP’s ability to capture cross-modal positional associations through positional information removal and reconstruction. PR-CLIP leverages the reconstruction task for modality interaction only during training. Therefore, during retrieval it maintains high computational efficiency similar to most dual-stream models.

3. Methodology

3.1. Problem Formulation

Let

V

be the set of remote sensing images and

T

be the corresponding set of textual descriptions, where each image

V \in V

is associated with one or more text descriptions

T \in T

. In the remote sensing text-to-image retrieval task, given a query text

T_{i}

, we evaluate its similarity with all candidate images, i.e.,

{S (T_{i}, V_{j}) ∣ V_{j} \in V}

, where S is the similarity scoring function, which is typically calculated using cosine similarity. Finally, we select the image with the highest similarity to

T_{i}

as the retrieval result. Similarly, in the image-to-text retrieval task, we compute

{S (V_{i}, T_{j}) ∣ T_{j} \in T}

, which measures the similarity between image

V_{i}

and all candidate texts, and we select the text with the highest similarity as the retrieval result.

3.2. Model Overview

Figure 2 illustrates the framework of the PR-CLIP model, which consists of four main modules: a unimodal encoder, a cross-modal positional information extraction module, a unimodal positional information filtering module, and a cross-modal positional information reconstruction module. The unimodal encoder separately encodes images and text, and it utilizes a cross-modal contrastive loss to bring matching image–text pairs closer in the feature space. The cross-modal positional information extraction module inputs the encoded image and text features into a fusion encoder to learn positional associations. To enhance the model’s ability to recognize cross-modal positional relationships, we propose a positional information reconstruction task during training. Specifically, the unimodal positional information filtering module removes the positional association information learned by the fusion encoder from the original unimodal representations. Then, by leveraging information from the matched counterpart modality the cross-modal positional information reconstruction module reconstructs the missing positional information in the incomplete unimodal representations. During this process, we introduce a positional reconstruction consistency loss to guide the model optimization.

3.3. Unimodal Encoder

PR-CLIP adopts a dual-stream architecture similar to CLIP [21] as its unimodal encoder, and it employs two independent vision transformers [23] to encode images and text separately. On the one hand, the transformer structure in the unimodal encoder efficiently extracts key information from images and text while effectively capturing global context. On the other hand, using the same encoder structure ensures consistency in encoding across the image and text modalities, which allows features from different modalities to be mapped to the same semantic space during similarity calculation.

Given an input image–text pair

(v_{i}, t_{i})

, its unimodal encoding process can be expressed as

\begin{matrix} V_{i}^{o} & = [v_{i}^{c l s}, v_{i}^{1}, v_{i}^{2}, . . ., v_{i}^{m}] = F_{v} (v_{i}) \\ T_{i}^{o} & = [t_{i}^{c l s}, t_{i}^{1}, t_{i}^{2}, . . ., t_{i}^{n}] = F_{t} (t_{i}) \end{matrix}

(1)

where

V_{i}^{o}

and

T_{i}^{o}

represent the encoded embeddings of the image and text, respectively;

V_{i}^{o}

consists of a global token

v_{i}^{c l s}

and m patch tokens;

T_{i}^{o}

consists of a global token

t_{i}^{c l s}

and n word tokens; and

F_{v}

and

F_{t}

denote the image and text encoders, respectively.

The image and text features used for retrieval directly adopt the global token representations from the image and text embeddings, i.e.,

\begin{matrix} v_{i}^{o} & = W_{v}^{o} f_{v}^{o} (v_{i}^{c l s}) \\ t_{i}^{o} & = W_{t}^{o} f_{t}^{o} (t_{i}^{c l s}) \end{matrix}

(2)

where

v_{i}^{o}

and

t_{i}^{o}

represent the image and text features,

f_{v}^{o}

and

f_{t}^{o}

are linear layers used to project the image and text features into the same feature space for similarity computation, and

W_{v}^{o}

and

W_{t}^{o}

denote the linear transformation matrices in the affine transformation.

To minimize the distance between matching images and texts, PR-CLIP employs cross-modal contrastive learning for model training. Specifically, given a batch of N image–text pairs

{(v_{i}, t_{i})}_{i = 1}^{N}

, we first compute the cosine similarity between the image and text features, and we then optimize the model using the InfoNCE loss, referred to as the cross-modal contrastive loss (CMC Loss) in this work, as follows:

\begin{matrix} L_{C M C}^{v} = - \frac{1}{N} \sum_{i = 1}^{N} (log \frac{exp (S (v_{i}^{o}, t_{i}^{o}) / τ)}{\sum_{j = 1}^{N} exp (S (v_{i}^{o}, t_{j}^{o}) / τ)}) \\ L_{C M C}^{t} = - \frac{1}{N} \sum_{i = 1}^{N} (log \frac{exp (S (v_{i}^{o}, t_{i}^{o}) / τ)}{\sum_{j = 1}^{N} exp (S (v_{j}^{o}, t_{i}^{o}) / τ)}) \\ L_{C M C} = \frac{1}{2} (L o s s_{C M C}^{v} + L o s s_{C M C}^{t}) \end{matrix}

(3)

where

S (v_{i}^{o}, t_{j}^{o})

represents the cosine similarity between the image

v_{i}

and the text

t_{j}

, and where

τ

is the temperature coefficient used to adjust the distribution span in contrastive learning.

By computing the bidirectional cross-modal contrastive loss for text-to-image and image-to-text, the model effectively aligns image and text modalities, making matched samples closer in the feature space while pushing unmatched samples apart. This training approach enhances the robustness of cross-modal retrieval, and it enables the model to accurately measure the semantic association between images and text when computing similarity.

Through unimodal encoding and optimization via cross-modal contrastive learning, PR-CLIP can acquire basic cross-modal retrieval capabilities. However, the model still lacks the ability to model cross-modal positional associations. To address this, we introduce a cross-modal positional information reconstruction task to further enhance the model’s understanding and alignment of spatial relationships between images and texts. The following sections will sequentially introduce the three key modules involved in this task.

3.4. Cross-Modal Positional Information Extraction

We perform deep interaction between image and text through the cross-modal positional information extraction module to extract cross-modal positional associations. The structure of this module is shown in Figure 3:

First, the complete embeddings of the image and text are projected into the same vector space through two separate linear projection layers to enable unified modeling and subsequent cross-modal interaction, i.e.,

\begin{matrix} {\tilde{V}}_{i}^{o} & = V_{i}^{o} W_{v}^{p r o j} \\ {\tilde{T}}_{i}^{o} & = T_{i}^{o} W_{t}^{p r o j} \end{matrix}

(4)

where

{\tilde{V}}_{i}^{o}

and

{\tilde{T}}_{i}^{o}

denote the projected image and text embeddings, respectively; and where

W_{v}^{p r o j}

and

W_{t}^{p r o j}

represent the linear projection matrices for the image and text, respectively.

Subsequently, we concatenate the image and text embeddings into a single long sequence to construct a unified cross-modal representation, i.e.,

\begin{matrix} X_{i} = [{\tilde{V}}_{i}^{o}, {\tilde{T}}_{i}^{o}] = [{\tilde{v}}_{i}^{c l s}, {\tilde{v}}_{i}^{1}, . . ., {\tilde{v}}_{i}^{m}, {\tilde{t}}_{i}^{c l s}, {\tilde{t}}_{i}^{1}, . . ., {\tilde{t}}_{i}^{n}] \end{matrix}

(5)

The concatenated cross-modal sequence is then fed into an encoder composed of multiple transformer blocks for fusion encoding. This encoder adopts the standard transformer architecture, which includes multi-head self-attention mechanisms and feed-forward networks. Residual connections and layer normalization are incorporated at each layer to stabilize the training process and enhance the representation capacity. This process can be formulated as

\begin{matrix} {\tilde{X}}_{i} & = MSA (LN (X_{i})) + X_{i} \\ {\hat{X}}_{i} & = MLP (LN ({\tilde{X}}_{i})) + {\tilde{X}}_{i} \end{matrix}

(6)

where

{\hat{X}}_{i}

denotes the fused embedding of the image and text,

{\tilde{X}}_{i}

represents the intermediate representation after applying self-attention, MSA refers to the multi-head self-attention, LN denotes the layer normalization, and MLP represents the feedforward full-connection.

At this stage, explicit cross-modal connections are established between image and text tokens through the self-attention mechanism. Image tokens can acquire semantic supplements from text tokens, while text tokens can perceive information from the corresponding regions in the image. Through this process, the model can effectively exploit the complementary information between different modalities, thereby enhancing their unimodal representations.

Then, the unimodal embeddings that contain positional information can be separated from the fused embeddings, i.e.,

\begin{matrix} {\hat{X}}_{i} & = [{\hat{v}}_{i}^{c l s}, {\hat{v}}_{i}^{1}, . . ., {\hat{v}}_{i}^{m}, {\hat{t}}_{i}^{c l s}, {\hat{t}}_{i}^{1}, . . ., {\hat{t}}_{i}^{n}] \\ V_{i}^{p} & = [{\hat{v}}_{i}^{c l s}, {\hat{v}}_{i}^{1}, . . ., {\hat{v}}_{i}^{m}] \\ T_{i}^{p} & = [{\hat{t}}_{i}^{c l s}, {\hat{t}}_{i}^{1}, . . ., {\hat{t}}_{i}^{n}] \end{matrix}

(7)

where

V_{i}^{p}

and

T_{i}^{p}

denote the image and text embeddings, respectively, that contain positional information; and where

{\hat{v}}_{i}^{*}

and

{\hat{t}}_{i}^{*}

denote the token representations of the image and text embeddings.

3.5. Unimodal Positional Information Filtering

To enable the model to reconstruct positional information, we explicitly remove such information from the original unimodal embeddings of images and text through the unimodal positional information filtering module. This module is designed to mask position-related information in unimodal embeddings, forcing the subsequent reconstruction task to rely on complementary modality to recover the removed positional information. In this way, PR-CLIP is encouraged to better learn cross-modal positional associations. The structure of this module is shown in Figure 4:

Taking the image embedding as an example, we first perform element-wise subtraction between the original image embedding

{\tilde{V}}_{i}^{o}

and the image embedding containing the positional information

V_{i}^{p}

, resulting in an embedding

V_{i}^{f}

with explicit positional information removed. On this basis, to reallocate the importance of the features and optimize their expressive capacity,

V_{i}^{f}

is further projected through a linear transformation layer. The complete positional information filtering process for the image embedding can be expressed as

\begin{matrix} V_{i}^{f} = Linear ({\tilde{V}}_{i}^{o} - V_{i}^{p}) \end{matrix}

(8)

where ’Linear’ denotes a linear transformation layer.

Accordingly, the text unimodal embedding with filtered positional information can be expressed as

\begin{matrix} T_{i}^{f} = Linear ({\tilde{T}}_{i}^{o} - T_{i}^{p}) \end{matrix}

(9)

3.6. Cross-Modal Positional Information Reconstruction

After completing the filtering of the positional information, the filtered unimodal embedding is fed into the cross-modal positional information reconstruction module. The model tries to reconstruct the missing positional information using complete information from the other modality via a cross-modal interaction mechanism without directly relying on the positional cues of its own modality. The structure of this module is shown in Figure 5:

Taking image embedding reconstruction as an example, we adopt a transformer decoder structure, where the image embedding without positional information is used as the target sequence and the corresponding complete text embedding is fed into the decoder as the context sequence. The decoder models the interaction between the two modalities through a cross-attention mechanism, extracting position-related information for the image from the text. Unlike direct cross-modal feature fusion, this process is reconstruction-oriented rather than merely integrating features. The reconstruction objective encourages the model to retrieve and reconstruct the missing information from the other modality, thereby achieving fine-grained semantic alignment. This process can be formulated as

\begin{matrix} {\tilde{V}}_{i}^{f} & = MSA (LN (V_{i}^{f})) + V_{i}^{f} \\ {\tilde{V}}_{i}^{r} & = C - MSA (LN ({\tilde{V}}_{i}^{f}), LN ({\tilde{T}}_{i}^{o})) + {\tilde{V}}_{i}^{f} \\ V_{i}^{r} & = MLP (LN ({\tilde{V}}_{i}^{r})) + {\tilde{V}}_{i}^{r} \end{matrix}

(10)

where

{\tilde{V}}_{i}^{f}

denotes the filtered image embedding after self-attention,

{\tilde{V}}_{i}^{r}

denotes the image embedding after cross-modal attention with the original text embedding

{\tilde{T}}_{i}^{o}

,

V_{i}^{r}

denotes the reconstructed image embedding with positional information, and C-MSA refers to the multi-head cross-attention.

Similarly, the decoding process for reconstructing the text using the complete image can be expressed as follows:

\begin{matrix} {\tilde{T}}_{i}^{f} & = MSA (LN (T_{i}^{f})) + T_{i}^{f} \\ {\tilde{T}}_{i}^{r} & = C - MSA (LN ({\tilde{T}}_{i}^{f}), LN ({\tilde{V}}_{i}^{o})) + {\tilde{T}}_{i}^{f} \\ T_{i}^{r} & = MLP (LN ({\tilde{T}}_{i}^{r})) + {\tilde{T}}_{i}^{r} \end{matrix}

(11)

To ensure that the reconstructed features are as close as possible to the original features, in terms of semantics and structure, we introduce the mean squared error (MSE) loss as the positional reconstruction consistency (PRC) loss, i.e.,

\begin{matrix} L_{P R C}^{v} = \frac{1}{m} \sum_{i = 1}^{m} {∥V_{i}^{r} - {\tilde{V}}_{i}^{o}∥}_{2}^{2} \\ L_{P R C}^{t} = \frac{1}{n} \sum_{i = 1}^{n} {∥T_{i}^{r} - {\tilde{T}}_{i}^{o}∥}_{2}^{2} \\ L_{P R C} = \frac{1}{2} (L o s s_{P R C}^{v} + L o s s_{P R C}^{t}) \end{matrix}

(12)

This loss function measures the element-wise difference between the reconstructed features and the original ones. By minimizing the distance between the reconstructed and the true features, the model is encouraged to accurately restore the filtered positional information.

We combine the cross-modal contrastive loss and the positional information reconstruction loss as the final optimization objective of our PR-CLIP model, i.e.,

\begin{matrix} L = λ_{1} L_{C M C} + λ_{2} L_{P R C} \end{matrix}

(13)

where

λ_{1}

and

λ_{2}

are hyperparameters used to balance the weights of the loss items.

4. Experiments

4.1. Datasets

We used two public multimodal remote sensing image–text retrieval datasets, i.e., RSICD [36] and RSITMD [3], to evaluate the performance of PR-CLIP. All the images in the RSICD dataset were captured by aircraft or satellites, covering 28 categories including airports, forests, and ports. Each image is paired with five textual descriptions that provide detailed semantic information. For the RSITMD dataset, the image categories are extended to 32, including stadiums, farmlands, and schools. Similarly, each image is associated with five textual descriptions to support diverse cross-modal retrieval tasks. PR-CLIP was trained on the RET-2 dataset, which is a combination of the RSICD and RSITMD datasets. To prevent data leakage between the training and test datasets, the RET-2 dataset followed the strict de-duplication strategy proposed by RemoteCLIP [27], ensuring the fairness and reliability of the model evaluation. Table 1 presents the statistics of the three datasets. During the training process, each dataset was divided into training, validation, and testing sets with a split ratio of 8:1:1:

4.2. Evaluation Metrics

To maintain consistency with the baseline models, we report the Recall at K (R@K, K = 1, 5, 10) and mean Recall (mR). R@K indicates whether the correct match appears within the top-K retrieval results, and it is used to evaluate the model’s performance under different retrieval accuracy requirements. The formula of R@K is defined as

R @ K = \frac{1}{N} \sum_{i = 1}^{N} 1 (r_{i} \leq K)

(14)

where N is the total number of queries,

r_{i}

denotes the rank of the ground-truth match for the i-th query, and

1 (\cdot)

is the indicator function that returns 1 if the condition holds and 0 otherwise.

Metric mR represents the mean recall across all categories, providing a more comprehensive reflection of the model’s overall performance under class imbalance. The formula of mR is defined as

mR = \frac{1}{6} \sum_{k \in {1, 5, 10}} (R @ k_{I 2 T} + R @ k_{T 2 I}) .

(15)

where

R @ k_{I 2 T}

denotes the R@K metric for the image-to-text retrieval task and

R @ k_{T 2 I}

denotes the Recall@K metric for the text-to-image retrieval task.

4.3. Baselines

We compared PR-CLIP with the following baseline methods:

VSE++ [37] introduced a hard-negative-aware loss function to enhance visual–semantic embedding learning.
AMFMN [3] designed an asymmetric multimodal feature matching network with multi-scale attention and a dynamic-margin triplet loss.
GaLR [33] proposed a global–local RSITR framework with dynamic fusion and re-ranking to enhance retrieval performance.
SWAN [38] introduced a scene-aware aggregation network with multiscale fusion and fine-grained sensing to reduce semantic confusion.
FAMMI [39] proposed a fine-grained semantic alignment method that aggregates multi-scale features and enhances cross-layer consistency.
PIR [17] introduced a prior-instructed representation framework with progressive attention encoders to reduce semantic noise.
MTGFE [40] proposed a multi-task guided fusion encoder with a multi-view joint representations contrast task to enhance fine-grained alignment.
KAMCL [41] introduced a knowledge-aided contrastive learning framework with hierarchical aggregation to enhance fine-grained discrimination.
PE-RSITR [16] introduced a parameter-efficient transfer learning framework with a hybrid contrastive loss to adapt vision–language models to RSITR.
VGSGN [42] proposed a visual-global-salient-guided network with dynamic fusion to enhance cross-modal alignment between image and text.
RemoteCLIP [27] scaled CLIP pretraining with 12× enlarged RS-specific data via data augmentation, significantly improving RSITR performance.
GeoRSCLIP [28] introduced a large-scale remote sensing dataset RS5M with 5 million image–text pairs, and it trained a CLIP-based model to enhance RSITR.

4.4. Implementation Details

To clarify the entire research process, we add more implementation details in Section 4.4 in the new manuscript. Concretely, during the retrieval stage, PR-CLIP first employs unimodal encoders to extract global semantic features from both images and texts independently. Based on these features, the model computes similarity scores between image–text pairs and ranks them accordingly, enabling bidirectional retrieval from image to text and vice versa. We evaluated PR-CLIP on the RSICD and RSITMD datasets using R@K and mR, and we compared it with numerous state-of-the-art methods to demonstrate the superiority of PR-CLIP in retrieval tasks. To verify the effectiveness of each module, we also conducted ablation studies. We partially or entirely removed key components of the cross-modal location reconstruction module during training, and we adopted the same evaluation protocol as in the main experiments to assess performance differences. Furthermore, to better illustrate the effectiveness and working mechanism of the proposed method, we also conducted visualization analysis. We extracted attention weight matrices from the final layer of the Transformer in the cross-modal location module and plotted the corresponding attention maps. Additionally, we visualized image features extracted at different stages of the model to further validate the effectiveness of PR-CLIP in modeling spatial location information.

We implemented PR-CLIP with PyTorch v2.5.1, and we trained the model based on the ITRA framework [43]. For the RSICD dataset, the number of training epochs was set to 20, the learning rate was set to the 5 × 10⁻⁵, and the batch size was set to 160. For the RSITMD dataset, the number of training epochs was set to 7, the learning rate was set to the 5 × 10⁻⁶, and the batch size was set to 100. The training configuration included 100 warm-up steps, a weight decay of 0.5, and a maximum gradient norm of 50 for gradient clipping. For the transformer layers, we employed two transformer layers for cross-modal positional information extraction and eight transformer layers for the cross-modal positional information reconstruction. The CMC loss weight

λ_{1}

was set to 0.01, and the temperature was a tunable hyperparameter dynamically adjusted during the training process. The PRC loss weight

λ_{2}

was set to 1. All the experiments were conducted on a Linux server equipped with two NVIDIA RTX 4090 GPUs.

4.5. Comparison Results

Table 2 shows the RSITR results of PR-CLIP and various baseline methods on the two datasets. According to the table, we have the following observations:

PR-CLIP achieved the best overall performance among the models trained on the same scale of training data. Although our model was trained on RET-2, it is worth noting that RET-2 is constructed by combining RSCID and RSITMD after removing duplicates, and, thus, the actual amount of unique training data remains comparable to that of the baseline models. Specifically, it outperformed all the existing RSITR methods, in terms of the mean Recall (mR) metric. On the RSICD dataset, PR-CLIP improved mR by 26% (from 31.12 to 39.23), and it also achieved an 18% improvement (from 44.47 to 52.41) on the RSITMD dataset. These results indicate that introducing the cross-modal positional information reconstruction task enables the model to effectively learn image–text positional associations, further improving cross-modal retrieval performance.

Secondly, all the models performed better on the image-to-text retrieval task than on the text-to-image retrieval task, in terms of the R@1 metric. This indicates that in tasks requiring the retrieval of a single best match, the existing methods tend to perform more accurately in image-to-text retrieval. Image features generally contain richer detail positional information, making it easier to retrieve semantically matched texts. In contrast, textual descriptions often exhibit higher ambiguity and vagueness. Compared with the existing methods, PR-CLIP achieved more significant improvements on the text-to-image retrieval task and achieved the best performance across all the evaluation metrics. This confirms that the cross-modal positional information reconstruction task can better help the model align text and image features.

Thirdly, the models trained with more additional data generally performed better on the RSITR task, even with relatively simple architectures. This suggests that RSITR performance is highly data-dependent and that sufficient cross-modal data samples can significantly enhance the model’s representation ability. RemoteCLIP and GeoRSCLIP were trained with additional data, where RemoteCLIP used a training set approximately 12 times larger than that of PR-CLIP, and GeoRSCLIP utilized a dataset 16 times larger. Despite this, PR-CLIP achieved consistently superior performance compared to RemoteCLIP across all the evaluation metrics, and it was only slightly outperformed by GeoRSCLIP on two specific indicators. These results demonstrate the effectiveness of PR-CLIP, even under limited data conditions.

4.6. Ablation Studies

To validate the effectiveness of the proposed cross-modal positional information reconstruction task in PR-CLIP, we conducted a series of ablation studies. Specifically, we removed the PRC loss from the image side and the text side, respectively, and we summarize the results in Table 3. When the reconstruction task was entirely removed, PR-CLIP experienced a significant performance drop. Retaining only the reconstruction loss on either the image or text side led to performance improvement compared to the complete removal of the task, but still underperformed the full model. These results demonstrate that the proposed cross-modal positional information reconstruction task effectively captures the positional correspondence between image and text, and that it is crucial for achieving more accurate cross-modal image-text retrieval.

4.7. Hyper-Parameter Studies

We also conducted experiments to evaluate the effects of the weights of CMC loss and PRC loss. The results are shown in Table 4 and Table 5, where

λ_{1}

and

λ_{2}

are the weights of CMC loss and PRC loss, respectively. The experimental results show that the PR-CLIP model achieved the best performance, with

λ_{1} = 0.01

and

λ_{2} = 1

. Such results indicate that the model is particularly sensitive to the position reconstruction loss that is essential for the performance of cross-modal retrieval tasks. In contrast, the cross-modal contrastive loss is assigned a lower weight to ensure that the information from different modalities is brought closer while the inter-modal difference is preserved.

4.8. Efficiency Evaluation

We assessed the time efficiency of PR-CLIP alongside multiple representative baseline RSITR methods. These models were tested on the RSICD and RSITMD datasets using identical batch sizes and embedding dimensions on an NVIDIA RTX 4090 GPU.

Table 6 shows the mR of PR-CLIP and the baseline methods. ‘IT(s)’ denotes the total inference time of the model on the testing set, measured in seconds. The results indicate that PR-CLIP not only outperforms the compared models but also requires a shorter total inference time. This is because the cross-modal positional information reconstruction task is only introduced during training, while inference relies solely on the independent encoding of images and texts, without requiring deep cross-modal interactions. PR-CLIP effectively combines the efficiency of dual-stream models with the alignment capabilities of single-stream models, thus significantly improving retrieval performance on the RSITR task while maintaining inference-time efficiency.

4.9. Visualization of Positional Alignment

To verify whether PR-CLIP achieves positional alignment between images and text, we extracted the attention weight matrix from the last transformer layer in the cross-modal positional information extraction module. We selected key semantic entities from the textual descriptions and analyzed the attention response of each image patch to the corresponding tokens. A higher attention weight indicated a stronger association between the image region and the textual object. Figure 6 presents the attention visualization of a remote sensing image–text pair. The regions with higher attention are displayed with higher opacity to highlight their relevance, while the regions with lower attention are shown with lower opacity, indicating weaker association with the current textual entity. Where the textual keyword is “planes”, the regions in the image containing airplanes were assigned higher attention weights. Where the keyword is “red building”, the model focused more on the region where the red building was located. These two regions were connected through the spatial keyword “next to”, and a clear boundary is observable in the attention map. This demonstrates that through the cross-modal positional information reconstruction task PR-CLIP can effectively learn the positional associations between images and texts, thereby enhancing cross-modal retrieval performance.

To intuitively observe the role of positional information in the model, we visualized the attention distribution of the image at different stages using Grad-CAM, as shown in Figure 7. ‘Global visual feature’ represents the global visual feature extracted by the model before incorporating textual information. Its attention is relatively scattered, indicating that a cross-modal positional relationship had not yet been modeled at this stage. Then, the cross-modal positional feature was obtained through the cross-modal positional information extraction module. The attention was clearly focused on the regions corresponding to “five planes” and “red building”, which aligns well with the text description. ‘Visual feature without positional information’ refers to the image feature after the positional information was removed via the unimodal positional information filtering module. The attention to the planes and surrounding regions was significantly weakened, and the red building almost completely disappeared from focus. ‘Reconstructed visual feature’ denotes the image feature reconstructed with the help of textual information. The model re-focused on key semantic regions, with notably enhanced attention to the red building, indicating that this region is a critical cue for distinguishing the query text from other remote sensing images. From raw visual features to position awareness, and through filtering and reconstruction, PR-CLIP progressively achieves alignment of key target regions via image–text interaction. This process fully validates the effectiveness of the proposed cross-modal positional information reconstruction task in remote sensing image–text retrieval.

4.10. Visualization of Retrieval Results

We also conducted a visual analysis of PR-CLIP’s retrieval results. As shown in Figure 8, PR-CLIP accurately retrieved the target corresponding to the query text or image. In the figure, the top three retrieval results returned by the model, excluding the correct match, are displayed. It can be observed that PR-CLIP successfully captured key spatial positional information, such as “park”, in both the image-to-text and text-to-image retrieval tasks. Although these mismatched images and texts are not exact matches to the target, they are semantically highly relevant. This demonstrates that PR-CLIP effectively enhances the semantic relevance between queries and results by learning cross-modal positional associations between images and text, thereby improving the robustness of the retrieval model.

In addition, as shown in Figure 9, we conducted a case study on a set of real-world remote sensing images from the Gaofen-4 satellite. Given the input text query, the figure shows the top three images retrieved by PR-CLIP. Obviously, PR-CLIP was able to retrieve the images with a high semantic similarity to the input text, demonstrating the effectiveness and generalization capability of PR-CLIP.

5. Conclusions

In this work, we propose a novel remote sensing image–text retrieval model, PR-CLIP, which enhances cross-modal alignment by introducing a cross-modal positional information reconstruction task. PR-CLIP adopts a CLIP-based dual-stream architecture for efficient unimodal encoding, and it captures the positional correspondence between image regions and textual entities through fusion and reconstruction of unimodal features during training. This design effectively combines the advantages of dual-stream and single-stream paradigms, thus achieving high retrieval accuracy and fast inference. Extensive experiments on benchmark datasets demonstrate that PR-CLIP outperforms the existing state-of-the-art RSITR methods across multiple evaluation metrics. However, the effectiveness of PR-CLIP in modeling cross-modal positional relationships relies on high-quality input data. When the retrieval text lacks explicit spatial cues or the image contains numerous cluttered objects, the reconstruction of cross-modal positional relationships may become less effective and could even introduce noise. In future work, we plan to incorporate external knowledge to assist the modeling of positional information, aiming to enhance the model’s robustness in real-world scenarios.

Author Contributions

Funding acquisition, Project administration, Resources, Writing review and editing, J.G.; Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing original draft preparation, Y.S.; Supervision, Project administration, Writing review and editing, W.L.; Data curation, Software, Validation, Visualization, Z.S.; Supervision, Project administration, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by National Natural Science Foundation of China (No. 62202336, No. 62172300, No. 62372326), and the Fundamental Research Funds for the Central Universities (No. 2024-4-YB-03).

Data Availability Statement

The datasets analyzed during the current study are publicly available. The RSICD dataset is available at https://github.com/201528014227051/RSICD_optimal (accessed on 18 June 2025), and the RSITMD dataset is available at https://github.com/xiaoyuan1996/AMFMN (accessed on 18 June 2025).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Navalgund, R.R.; Jayaraman, V.; Roy, P. Remote sensing applications: An overview. Curr. Sci. 2007, 93, 1747–1766. [Google Scholar]
Abdullah, T.; Bazi, Y.; Al Rahhal, M.M.; Mekhalfi, M.L.; Rangarajan, L.; Zuair, M. TextRS: Deep bidirectional triplet network for matching text to remote sensing images. Remote Sens. 2020, 12, 405. [Google Scholar] [CrossRef]
Yuan, Z.; Zhang, W.; Fu, K.; Li, X.; Deng, C.; Wang, H.; Sun, X. Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–19. [Google Scholar] [CrossRef]
Tang, X.; Wang, Y.; Ma, J.; Zhang, X.; Liu, F.; Jiao, L. Interacting-enhancing feature transformer for cross-modal remote-sensing image and text retrieval. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Li, J.; Pei, Y.; Zhao, S.; Xiao, R.; Sang, X.; Zhang, C. A review of remote sensing for environmental monitoring in China. Remote Sens. 2020, 12, 1130. [Google Scholar] [CrossRef]
Patino, J.E.; Duque, J.C. A review of regional science applications of satellite remote sensing in urban settings. Comput. Environ. Urban Syst. 2013, 37, 1–17. [Google Scholar] [CrossRef]
Weiss, M.; Jacob, F.; Duveiller, G. Remote sensing for agricultural applications: A meta-review. Remote Sens. Environ. 2020, 236, 111402. [Google Scholar] [CrossRef]
Wan, J.; Wang, D.; Hoi, S.C.H.; Wu, P.; Zhu, J.; Zhang, Y.; Li, J. Deep learning for content-based image retrieval: A comprehensive study. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 157–166. [Google Scholar]
Mao, G.; Yuan, Y.; Xiaoqiang, L. Deep cross-modal retrieval for remote sensing image and audio. In Proceedings of the 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS), Beijing, China, 19–20 August 2018; IEEE: New York, NY, USA, 2018; pp. 1–7. [Google Scholar]
Demir, B.; Bruzzone, L. Hashing-based scalable remote sensing image search and retrieval in large archives. IEEE Trans. Geosci. Remote Sens. 2015, 54, 892–904. [Google Scholar] [CrossRef]
Zhang, Y.; Ji, Z.; Pang, Y.; Han, J. Hierarchical and complementary experts transformer with momentum invariance for image-text retrieval. Knowl.-Based Syst. 2025, 309, 112912. [Google Scholar] [CrossRef]
Liu, A.A.; Yang, B.; Li, W.; Song, D.; Sun, Z.; Ren, T.; Wei, Z. Text-guided knowledge transfer for remote sensing image-text retrieval. IEEE Geosci. Remote Sens. Lett. 2024, 21, 3504005. [Google Scholar] [CrossRef]
Ma, Q.; Pan, J.; Bai, C. Direction-oriented visual-semantic embedding model for remote sensing image-text retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4704014. [Google Scholar] [CrossRef]
Yang, R.; Wang, S.; Han, Y.; Li, Y.; Zhao, D.; Quan, D.; Guo, Y.; Jiao, L.; Yang, Z. Transcending fusion: A multi-scale alignment method for remote sensing image-text retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4709217. [Google Scholar] [CrossRef]
Zhang, W.; Li, J.; Li, S.; Chen, J.; Zhang, W.; Gao, X.; Sun, X. Hypersphere-based remote sensing cross-modal text–image retrieval via curriculum learning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Yuan, Y.; Zhan, Y.; Xiong, Z. Parameter-efficient transfer learning for remote sensing image–text retrieval. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Pan, J.; Ma, Q.; Bai, C. A prior instruction representation framework for remote sensing image-text retrieval. In Proceedings of the 31st ACM International Conference on Multimedia, Melbourne, Australia, 28 October– 1 November 2023; pp. 611–620. [Google Scholar]
Zhou, W.; Wu, H. From Coarse To Fine: An Offline-Online Approach for Remote Sensing Cross-Modal Retrieval. In Proceedings of the IGARSS 2024–2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; IEEE: New York, NY, USA, 2024; pp. 305–309. [Google Scholar]
Mikriukov, G.; Ravanbakhsh, M.; Demir, B. Deep unsupervised contrastive hashing for large-scale cross-modal text-image retrieval in remote sensing. arXiv 2022, arXiv:2201.08125. [Google Scholar]
Chen, Y.C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 104–120. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; PmLR: Cambridge, MA, USA, 2021; pp. 8748–8763. [Google Scholar]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Liao, Y.; Yang, R.; Xie, T.; Xing, H.; Quan, D.; Wang, S.; Hou, B. A fast and accurate method for remote sensing image-text retrieval based on large model knowledge distillation. In Proceedings of the IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; IEEE: New York, NY, USA, 2023; pp. 5077–5080. [Google Scholar]
Yang, R.; Zhang, D.; Guo, Y.; Wang, S. A Texture and Saliency Enhanced Image Learning Method For Cross-Modal Remote Sensing Image-Text Retrieval. In Proceedings of the IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; IEEE: New York, NY, USA, 2023; pp. 4895–4898. [Google Scholar]
Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. Remoteclip: A vision language foundation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622216. [Google Scholar] [CrossRef]
Zhang, Z.; Zhao, T.; Guo, Y.; Yin, J. RS5M and GeoRSCLIP: A large scale vision-language dataset and a large vision-language model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5642123. [Google Scholar] [CrossRef]
Ji, Z.; Meng, C.; Zhang, Y.; Wang, H.; Pang, Y.; Han, J. Eliminate Before Align: A Remote Sensing Image-Text Retrieval Framework with Keyword Explicit Reasoning. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 1662–1671. [Google Scholar]
Cheng, Q.; Huang, H.; Xu, Y.; Zhou, Y.; Li, H.; Wang, Z. NWPU-captions dataset and MLCA-net for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
Wang, Z.; Prabha, R.; Huang, T.; Wu, J.; Rajagopal, R. Skyscript: A large and semantically diverse vision-language dataset for remote sensing. Proc. AAAI Conf. Artif. Intell. 2024, 38, 5805–5813. [Google Scholar] [CrossRef]
Yu, H.; Yao, F.; Lu, W.; Liu, N.; Li, P.; You, H.; Sun, X. Text-image matching for cross-modal remote sensing image retrieval via graph neural network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 16, 812–824. [Google Scholar] [CrossRef]
Yuan, Z.; Zhang, W.; Tian, C.; Rong, X.; Zhang, Z.; Wang, H.; Fu, K.; Sun, X. Remote sensing cross-modal text-image retrieval based on global and local information. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Zhu, Z.; Kang, J.; Diao, W.; Feng, Y.; Li, J.; Ni, J. SIRS: Multi-task joint learning for remote sensing foreground-entity image-text retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5625615. [Google Scholar] [CrossRef]
Huang, J.; Chen, Y.; Xiong, S.; Lu, X. Visual Contextual Semantic Reasoning for Cross-Modal Drone Image-Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4707012. [Google Scholar] [CrossRef]
Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring models and data for remote sensing image caption generation. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2183–2195. [Google Scholar] [CrossRef]
Faghri, F.; Fleet, D.J.; Kiros, J.R.; Fidler, S. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. In Proceedings of the British Machine Vision Conference (BMVC), Newcastle, UK, 3–6 September 2018. [Google Scholar]
Pan, J.; Ma, Q.; Bai, C. Reducing semantic confusion: Scene-aware aggregation network for remote sensing cross-modal retrieval. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval, Melbourne, Australia, 28 October– 1 November 2023; pp. 398–406. [Google Scholar]
Zheng, F.; Wang, X.; Wang, L.; Zhang, X.; Zhu, H.; Wang, L.; Zhang, H. A fine-grained semantic alignment method specific to aggregate multi-scale information for cross-modal remote sensing image retrieval. Sensors 2023, 23, 8437. [Google Scholar] [CrossRef]
Zhang, X.; Li, W.; Wang, X.; Wang, L.; Zheng, F.; Wang, L.; Zhang, H. A fusion encoder with multi-task guidance for cross-modal text–image retrieval in remote sensing. Remote Sens. 2023, 15, 4637. [Google Scholar] [CrossRef]
Ji, Z.; Meng, C.; Zhang, Y.; Pang, Y.; Li, X. Knowledge-aided momentum contrastive learning for remote-sensing image text retrieval. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
He, Y.; Xu, X.; Chen, H.; Li, J.; Pu, F. Visual Global-Salient Guided Network for Remote Sensing Image-Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5641814. [Google Scholar] [CrossRef]
Chen, D. ITRA. Available online: https://github.com/ChenDelong1999/ITRA (accessed on 18 June 2025).

Figure 1. An example of a remote sensing image–text pair. The text focuses on specific regions of the image. More regions in the image contain redundant information that is irrelevant to the retrieval task.

Figure 2. The framework of PR-CLIP, which consists of a unimodal encoder, a cross-modal positional information extraction module, a unimodal positional information filtering module, and a cross-modal positional information reconstruction module.

Figure 3. The structure of the cross-modal positional information extraction module.

Figure 4. The structure of the unimodal positional information filtering module.

Figure 5. The structure of the cross-modal positional information reconstruction module.

Figure 6. The visualization of image–text attention, where the regions with higher opacity indicate a stronger association with the textual keywords.

Figure 7. Visualization of the attention heatmaps of PR-CLIP at different stages, where the text that matches the example image is “Five planes are parked next to the red building”.

Figure 8. Visualization of the top three image-to-text and text-to-image retrieval results.

Figure 9. Example of PR-CLIP for real remote sensing images.

Table 1. Statistics of the datasets used in the experiments.

Dataset	#Image	#Text	Image Size
RSICD	10,921	54,605	224 × 224
RSITMD	4743	23,715	256 × 256
RET-2	13,092	65,460	-

Table 2. Comparison of various methods on RSICD and RSITMD datasets.

Testing Dataset	Method	Training Dataset	Text to Image			Image to Text			Mean Recall
Testing Dataset	Method	Training Dataset	R@1	R@5	R@10	R@1	R@5	R@10	Mean Recall
RSICD	VSE++ (2018)	RSICD	2.82	11.32	18.10	3.38	9.51	17.46	10.43
	AMFMN (2022)	RSICD	4.90	18.28	31.44	5.39	15.08	23.40	16.42
	GaLR (2022)	RSICD	4.69	19.48	32.13	6.59	19.85	31.04	18.96
	SWAN (2023)	RSICD	5.56	22.26	37.41	7.41	20.13	30.86	20.61
	FAMMI (2023)	RSICD	8.11	25.59	41.37	10.44	22.66	30.89	23.18
	PIR (2023)	RSICD	6.97	24.56	38.92	9.88	27.26	39.16	24.46
	MTGFE (2023)	RSICD	8.67	27.56	43.92	15.28	37.05	51.6	30.68
	KAMCL (2023)	RSICD	8.65	27.43	42.51	12.08	27.26	38.70	26.10
	PE-RSITR (2023)	RSICD	11.63	33.92	50.73	14.13	31.51	44.78	31.12
	VGSGN (2024)	RSICD	6.53	23.13	36.85	8.33	21.87	32.57	21.55
	RemoteCLIP (2024)	RET-3+DET-10+SEG-4	13.71	37.11	54.25	17.02	37.97	51.51	35.26
	GeoRSCLIP (2024)	RS5M+RET-2	15.59	41.19	57.99	21.13	41.72	55.63	38.87
	PR-CLIP (Ours)	RET-2	16.47	41.74	58.12	22.32	40.81	55.90	39.23
RSITMD	VSE++ (2018)	RSITMD	7.79	24.87	38.67	10.38	27.65	39.60	24.83
	AMFMN (2022)	RSITMD	9.96	34.03	52.96	11.06	29.20	38.72	29.32
	GaLR (2022)	RSITMD	11.15	36.68	51.68	14.82	31.64	42.48	31.41
	SWAN (2023)	RSITMD	11.24	40.40	60.60	13.35	32.15	46.90	34.11
	FAMMI (2023)	RSITMD	12.96	42.39	59.96	16.15	35.62	48.89	35.99
	PIR (2023)	RSITMD	12.17	41.68	63.41	18.14	41.15	52.88	38.24
	MTGFE (2023)	RSITMD	16.59	48.5	67.43	17.92	40.93	53.32	40.78
	KAMCL (2023)	RSITMD	13.50	42.15	59.32	16.51	36.28	49.12	36.14
	PE-RSITR (2023)	RSITMD	20.10	50.63	67.97	23.67	44.07	60.36	44.47
	VGSGN (2024)	RSITMD	13.23	42.57	63.41	14.16	34.96	50.66	36.50
	RemoteCLIP (2024)	RET-3+DET-10+SEG-4	22.17	56.46	73.41	27.88	50.66	65.71	49.38
	GeoRSCLIP (2024)	RS5M+RET-2	25.04	57.88	74.38	32.30	53.32	67.92	51.81
	PR-CLIP (Ours)	RET-2	27.43	58.36	74.69	32.74	54.65	66.59	52.41

Table 3. The results of ablation studies on RSICD and RSITMD datasets.

Testing Dataset	PRC Loss (Image)	PRC Loss (Text)	Text to Image			Image to Text			Mean Recall
Testing Dataset	PRC Loss (Image)	PRC Loss (Text)	R@1	R@5	R@10	R@1	R@5	R@10	Mean Recall
RSICD	✗	✗	15.24	41.45	57.55	19.67	39.98	53.71	37.93
	✓	✗	15.81	40.84	57.27	18.66	40.07	53.89	37.76
	✗	✓	15.74	40.91	57.27	20.40	40.72	55.08	38.35
	✓	✓	16.47	41.74	58.12	22.32	40.81	55.90	39.23
RSITMD	✗	✗	25.75	56.77	72.35	26.11	50.44	64.16	49.26
	✓	✗	25.35	58.05	74.07	30.75	51.77	65.93	50.99
	✗	✓	24.60	57.26	72.39	30.53	51.99	64.16	50.15
	✓	✓	27.43	58.36	74.69	32.74	54.65	66.59	52.41

Table 4. The results of PR-CLIP while varying the weight

λ_{1}

of

L_{C M C}

.

Table 4. The results of PR-CLIP while varying the weight

λ_{1}

of

L_{C M C}

.

Testing Dataset	CMC Loss $λ_{1}$	Text to Image			Image to Text			Mean Recall
Testing Dataset	CMC Loss $λ_{1}$	R@1	R@5	R@10	R@1	R@5	R@10	Mean Recall
RSICD	0.001	15.35	41.23	57.73	19.76	40.35	54.16	38.10
	0.01	16.47	41.74	58.12	22.32	40.81	55.90	39.23
	0.1	15.39	41.32	57.64	19.40	39.70	53.25	37.78
	1	15.33	41.26	57.58	19.40	39.89	53.25	37.79
RSITMD	0.001	24.73	58.94	73.36	31.19	51.99	65.71	50.99
	0.01	27.43	58.36	74.69	32.74	54.65	66.59	52.41
	0.1	25.53	54.62	73.14	27.65	51.77	64.16	49.78
	1	25.35	56.64	72.96	26.77	50.66	64.60	49.50

Table 5. The results of PR-CLIP while varying the weight

λ_{2}

of

L_{P R C}

.

Table 5. The results of PR-CLIP while varying the weight

λ_{2}

of

L_{P R C}

.

Testing Dataset	PRC Loss $λ_{2}$	Text to Image			Image to Text			Mean Recall
Testing Dataset	PRC Loss $λ_{2}$	R@1	R@5	R@10	R@1	R@5	R@10	Mean Recall
RSICD	0.01	15.72	40.29	57.40	18.39	39.71	53.34	37.47
	0.1	15.79	41.02	57.86	19.03	41.45	53.43	38.10
	1	16.47	41.74	58.12	22.32	40.81	55.90	39.23
	10	9.22	30.87	47.19	10.70	30.10	43.37	28.58
RSITMD	0.01	25.44	57.43	71.33	28.98	51.77	66.15	50.18
	0.1	25.40	56.86	72.65	29.20	52.21	65.71	50.34
	1	27.43	58.36	74.69	32.74	54.65	66.59	52.41
	10	20.97	52.92	70.53	26.55	44.69	57.74	45.57

Table 6. The results of efficiency evaluation.

Method	RSICD		RSITMD
Method	mR	IT(s)	mR	IT(s)
VSE++	10.43	8.63	24.83	5.52
AMFMN	16.42	25.56	29.72	6.39
GaLR	18.96	22.92	31.41	6.23
KAMCL	23.26	11.86	36.19	5.63
PR-CLIP	39.23	3.95	52.41	2.12

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guan, J.; Shu, Y.; Li, W.; Song, Z.; Zhang, Y. PR-CLIP: Cross-Modal Positional Reconstruction for Remote Sensing Image–Text Retrieval. Remote Sens. 2025, 17, 2117. https://doi.org/10.3390/rs17132117

AMA Style

Guan J, Shu Y, Li W, Song Z, Zhang Y. PR-CLIP: Cross-Modal Positional Reconstruction for Remote Sensing Image–Text Retrieval. Remote Sensing. 2025; 17(13):2117. https://doi.org/10.3390/rs17132117

Chicago/Turabian Style

Guan, Jihong, Yulou Shu, Wengen Li, Zihan Song, and Yichao Zhang. 2025. "PR-CLIP: Cross-Modal Positional Reconstruction for Remote Sensing Image–Text Retrieval" Remote Sensing 17, no. 13: 2117. https://doi.org/10.3390/rs17132117

APA Style

Guan, J., Shu, Y., Li, W., Song, Z., & Zhang, Y. (2025). PR-CLIP: Cross-Modal Positional Reconstruction for Remote Sensing Image–Text Retrieval. Remote Sensing, 17(13), 2117. https://doi.org/10.3390/rs17132117

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PR-CLIP: Cross-Modal Positional Reconstruction for Remote Sensing Image–Text Retrieval

Abstract

1. Introduction

2. Related Work

2.1. Dual-Stream RSITR Models

2.2. Single-Stream RSITR Models

2.3. Discussion

3. Methodology

3.1. Problem Formulation

3.2. Model Overview

3.3. Unimodal Encoder

3.4. Cross-Modal Positional Information Extraction

3.5. Unimodal Positional Information Filtering

3.6. Cross-Modal Positional Information Reconstruction

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Baselines

4.4. Implementation Details

4.5. Comparison Results

4.6. Ablation Studies

4.7. Hyper-Parameter Studies

4.8. Efficiency Evaluation

4.9. Visualization of Positional Alignment

4.10. Visualization of Retrieval Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI