1. Introduction
With the rapid advancements in aerospace technology, the acquisition of remote sensing data has become increasingly convenient [
1]. However, the continuous expansion of remote sensing data presents a pressing challenge to efficiently and accurately retrieving target data that meet query requirements from massive datasets. Against this backdrop, remote sensing image–text retrieval (RSITR) has emerged as a research hotspot in recent years [
2,
3,
4]. Currently, RSITR technology is widely applied in various areas, such as natural disaster monitoring [
5], urban planning [
6], and agricultural production [
7].
Traditional remote sensing image retrieval methods mainly rely on handcrafted features and shallow models [
8,
9,
10]. However, these methods suffer from low computational efficiency and limited retrieval accuracy. In recent years, deep learning-based RSITR methods have gradually become a research hotspot, and they can be mainly divided into dual-stream models and single-stream models [
11,
12,
13,
14]. Dual-stream models [
15,
16,
17] typically employ independent encoders to extract features from images and texts separately, and they train them, based on a sample-matching relationship, to learn cross-modal consistency. In contrast, single-stream models [
18,
19,
20] feed both images and texts into the same encoder to learn shared cross-modal representations.
However, the existing RSITR methods still have certain limitations. While dual-stream models offer higher computational efficiency, they lack deep cross-modal feature interactions. On the other hand, single-stream models can more comprehensively model cross-modal relationships but come with higher computational costs and lower inference efficiency. Moreover, when calculating image–text similarity most of these methods rely on global features, overlooking crucial positional relationships in remote sensing data. In multimodal remote sensing retrieval tasks, textual descriptions often refer to specific target regions rather than the entire image. Therefore, relying solely on global features for matching may introduce redundant information, thus leading to increased retrieval errors. Consequently, how to enable the model to capture cross-modal spatial relationships on top of global feature matching, thereby achieving finer-grained cross-modal alignment, has become a key challenge for improving the performance of remote sensing image–text retrieval.
Figure 1 presents an example of a remote sensing image–text pair. The text description primarily focuses on specific areas of the image, such as “five planes” and “red building,” which are spatially related through the connective phrase “next to.” However, the existing methods lack the ability to capture cross-modal positional relationships, resulting in background information being incorporated into similarity calculations. In other words, these methods fail to align the object words in the text with their corresponding regions in the image, and they are also unable to associate two objects with positional relationships through spatial keywords. This interference disrupts the alignment between key target regions and textual descriptions, ultimately degrading retrieval performance.
To address the insufficient utilization of positional information in existing methods, we propose a new remote sensing image–text retrieval model termed
PR-CLIP. The code is publicly available at
https://github.com/ADMIS-TONGJI/PR-CLIP (accessed on 18 June 2025). PR-CLIP adopts CLIP [
21] as the image and text encoder, and it aligns remote sensing image–text features through cross-modal contrastive learning, thereby bringing semantically similar samples closer together and enhancing cross-modal matching capability. Based on this, we propose a
cross-modal positional information reconstruction task to improve the model’s ability to capture spatial relationships between images and texts. Specifically, the encoded image and text features are first concatenated and fed into the cross-modal positional information extraction module to extract complementary cross-modal positional information. Subsequently, we employ the unimodal positional information filtering module to remove spatial information from the image and text features, obtaining unimodal representations that lack positional information. Next, with the complete information from the other modality, we reconstruct the filtered unimodal features using the cross-modal positional information reconstruction module. To ensure the quality of the reconstruction, we introduce a positional information reconstruction consistency loss to enforce similarity between the reconstructed features and the original unimodal features, thereby enhancing the model’s ability to capture positional relationships. It is worth noting that global feature matching still plays a central role in RSITR tasks. PR-CLIP aims to introduce cross-modal positional information modeling on top of global feature matching to achieve finer-grained cross-modal alignment. Although in some scenarios the image–text retrieval may not rely on a positional relationship, mining the potential spatial structure information does not weaken the representational ability of the global features. For scenarios involving spatial relationships, positional information plays an important role in cross-modal alignment.
PR-CLIP combines the advantages of both dual-stream and single-stream models. During training, the cross-modal positional information reconstruction task enhances spatial alignment between modalities, improving cross-modal matching accuracy. During inference, PR-CLIP omits the cross-modal positional information reconstruction modules and relies solely on the CLIP encoder for retrieval. Therefore, the model can retain the high computational efficiency of dual-stream models while leveraging learned positional information to enhance retrieval accuracy.
Our contributions are summarized as follows:
We propose a novel remote sensing image–text retrieval model, PR-CLIP, which can capture cross-modal positional information, enabling more efficient and accurate cross-modal retrieval.
We introduce a new cross-modal positional information reconstruction training task to enhance the model’s ability to understand and utilize positional relationships across modalities.
We carried out adequate experiments on two public datasets, and the results indicate that our model performs obviously better than the existing approaches.
The structure of this paper is as follows. In
Section 2, we review the current RSITR methods and discuss their limitations.
Section 3 presents a detailed description of the proposed PR-CLIP model and the cross-modal positional information reconstruction training task. In
Section 4, we evaluate the performance of PR-CLIP by comparing it against multiple SOTA baselines. Finally,
Section 5 summarizes this work.
2. Related Work
In this section, we review recent studies on remote sensing image–text retrieval. The existing RSITR methods can be categorized into two types, i.e., dual-stream models and single-stream models.
2.1. Dual-Stream RSITR Models
Dual-stream models typically use independent unimodal encoders to extract feature embeddings for images and texts, separately. For example, they employ CNNs [
22] or Transformers [
23] to process image features, and they use BERT [
24] or other language models to encode textual information. During training, the models are optimized in a supervised manner using positive and negative sample pairs, ensuring that the matched image–text pairs are as close as possible in the feature space while pushing apart the unmatched pairs. Since these methods extract unimodal features independently, they achieve higher computational efficiency.
For example, Liao et al. [
25] employed separable convolution and text convolution to extract image and text features separately and enhance cross-modal retrieval performance through knowledge distillation from large-scale pretrained models. Pan et al. [
17] employed a progressive spatial attention encoder to encode images and a progressive temporal attention encoder to encode text, and, ultimately, they trained the model using a cluster-based membership loss. Yang et al. [
26] encoded images by extracting texture and saliency features and used BERT to encode text. Zhang et al. [
15] proposed a hypersphere-based visual–semantic alignment network, which optimizes the model through curriculum learning after separately encoding images and text. Zhou et al. [
18] proposed a coarse-to-fine two-stage image–text retrieval framework. First, images and texts are separately encoded to compute similarity for coarse ranking. Then, a fine-grained ranking is performed on the candidate results using an image–text matching task.
In recent years, some researchers have introduced large-scale data into dual-stream models for training, aiming to enhance the generalization ability and robustness of cross-modal feature learning. These approaches seek to more effectively align image and text features and improve retrieval performance. For example, RemoteCLIP [
27] adopts a data-augmentation approach to integrate the SEG-4, DET-10, and RET-3 datasets, resulting in a dataset 12 times larger than all the existing datasets combined. It is trained, based on the CLIP framework, to enhance the performance of remote sensing image–text retrieval. GeoRSCLIP [
28] introduces the RS5M remote sensing image–text paired dataset, which contains 5 million remote sensing images with textual descriptions. The model is also trained based on the CLIP framework. EBAKER [
29] introduces the NWPU dataset [
30] into the training of the CLIP model, where the NWPU dataset contains 31,500 images and 157,500 matched texts. SkyCLIP [
31] introduces the SkyScript dataset with 2.6 million image–text pairs and achieves good retrieval performance through the CLIP model.
However, dual-stream models lack deep cross-modal feature interactions and fusion, making it difficult for them to fully capture cross-modal semantic relationships, which, ultimately, limits their performance in retrieving complex remote sensing content.
2.2. Single-Stream RSITR Models
The core idea of single-stream models is to facilitate early-stage modality fusion during feature extraction. Therefore, they typically adopt a transformer structure to simultaneously process image and text data to learn a shared cross-modal representation. Subsequently, these models utilize the fused embeddings for tasks such as image–text matching and image–text masking to enhance cross-modal alignment and improve retrieval performance. For example, Yuan et al. [
3] used a multi-scale visual self-attention module to extract image features and a cross-attention mechanism for text interaction, while proposing a triplet loss based on prior similarity to address the challenge of distinguishing similar images. Yu et al. [
32] constructed graph structures separately for images and text to extract the corresponding modality features and aligned different modalities through an image–text association module. Yuan et al. [
33] integrated different hierarchical features through multi-level information dynamic fusion and introduced a denoising representation matrix and an enhanced adjacency matrix to optimize the local features generated by GCN. Zhu et al. [
34] proposed a multi-task joint learning framework that enhances cross-modal retrieval performance through a noise-aware background reconstruction task and a pixel-level prediction-based semantic segmentation task. Zhou et al. [
18] first extracted image and text features separately and then promoted cross-modal alignment through a multi-visual-guided dynamic fusion module. Huang et al. [
35] utilized textual cues to guide rich semantic reasoning within the visual context and further enhanced cross-modal interaction between textual and visual data through context region learning and consistency semantic alignment.
Single-stream methods can model inter-modal relationships more comprehensively than dual-stream models. However, since the retrieval stage requires computing the fused embedding between the target modality and all candidate features, these methods suffer from low inference efficiency. Additionally, single-stream models may weaken the independence of unimodal features, potentially leading to suboptimal cross-modal retrieval accuracy.
2.3. Discussion
Although the existing RSITR methods achieve high retrieval accuracy, they still struggle to effectively model cross-modal positional associations between images and text. Dual-stream models primarily focus on extracting precise unimodal representations but lack interaction modules for integrating image and text features, making it challenging to capture cross-modal positional relationships. Single-stream models employ a fusion encoder to align images and text. However, they predominantly emphasize global semantic matching while lacking fine-grained modeling of positional associations. To address such limitations, this study enhances the proposed retrieval model PR-CLIP’s ability to capture cross-modal positional associations through positional information removal and reconstruction. PR-CLIP leverages the reconstruction task for modality interaction only during training. Therefore, during retrieval it maintains high computational efficiency similar to most dual-stream models.
3. Methodology
3.1. Problem Formulation
Let be the set of remote sensing images and be the corresponding set of textual descriptions, where each image is associated with one or more text descriptions . In the remote sensing text-to-image retrieval task, given a query text , we evaluate its similarity with all candidate images, i.e., , where S is the similarity scoring function, which is typically calculated using cosine similarity. Finally, we select the image with the highest similarity to as the retrieval result. Similarly, in the image-to-text retrieval task, we compute , which measures the similarity between image and all candidate texts, and we select the text with the highest similarity as the retrieval result.
3.2. Model Overview
Figure 2 illustrates the framework of the PR-CLIP model, which consists of four main modules:
a unimodal encoder,
a cross-modal positional information extraction module,
a unimodal positional information filtering module, and
a cross-modal positional information reconstruction module. The unimodal encoder separately encodes images and text, and it utilizes a cross-modal contrastive loss to bring matching image–text pairs closer in the feature space. The cross-modal positional information extraction module inputs the encoded image and text features into a fusion encoder to learn positional associations. To enhance the model’s ability to recognize cross-modal positional relationships, we propose a positional information reconstruction task during training. Specifically, the unimodal positional information filtering module removes the positional association information learned by the fusion encoder from the original unimodal representations. Then, by leveraging information from the matched counterpart modality the cross-modal positional information reconstruction module reconstructs the missing positional information in the incomplete unimodal representations. During this process, we introduce a positional reconstruction consistency loss to guide the model optimization.
3.3. Unimodal Encoder
PR-CLIP adopts a dual-stream architecture similar to CLIP [
21] as its unimodal encoder, and it employs two independent vision transformers [
23] to encode images and text separately. On the one hand, the transformer structure in the unimodal encoder efficiently extracts key information from images and text while effectively capturing global context. On the other hand, using the same encoder structure ensures consistency in encoding across the image and text modalities, which allows features from different modalities to be mapped to the same semantic space during similarity calculation.
Given an input image–text pair
, its unimodal encoding process can be expressed as
where
and
represent the encoded embeddings of the image and text, respectively;
consists of a global token
and
m patch tokens;
consists of a global token
and
n word tokens; and
and
denote the image and text encoders, respectively.
The image and text features used for retrieval directly adopt the global token representations from the image and text embeddings, i.e.,
where
and
represent the image and text features,
and
are linear layers used to project the image and text features into the same feature space for similarity computation, and
and
denote the linear transformation matrices in the affine transformation.
To minimize the distance between matching images and texts, PR-CLIP employs cross-modal contrastive learning for model training. Specifically, given a batch of
N image–text pairs
, we first compute the cosine similarity between the image and text features, and we then optimize the model using the InfoNCE loss, referred to as the cross-modal contrastive loss (CMC Loss) in this work, as follows:
where
represents the cosine similarity between the image
and the text
, and where
is the temperature coefficient used to adjust the distribution span in contrastive learning.
By computing the bidirectional cross-modal contrastive loss for text-to-image and image-to-text, the model effectively aligns image and text modalities, making matched samples closer in the feature space while pushing unmatched samples apart. This training approach enhances the robustness of cross-modal retrieval, and it enables the model to accurately measure the semantic association between images and text when computing similarity.
Through unimodal encoding and optimization via cross-modal contrastive learning, PR-CLIP can acquire basic cross-modal retrieval capabilities. However, the model still lacks the ability to model cross-modal positional associations. To address this, we introduce a cross-modal positional information reconstruction task to further enhance the model’s understanding and alignment of spatial relationships between images and texts. The following sections will sequentially introduce the three key modules involved in this task.
3.4. Cross-Modal Positional Information Extraction
We perform deep interaction between image and text through the cross-modal positional information extraction module to extract cross-modal positional associations. The structure of this module is shown in
Figure 3:
First, the complete embeddings of the image and text are projected into the same vector space through two separate linear projection layers to enable unified modeling and subsequent cross-modal interaction, i.e.,
where
and
denote the projected image and text embeddings, respectively; and where
and
represent the linear projection matrices for the image and text, respectively.
Subsequently, we concatenate the image and text embeddings into a single long sequence to construct a unified cross-modal representation, i.e.,
The concatenated cross-modal sequence is then fed into an encoder composed of multiple transformer blocks for fusion encoding. This encoder adopts the standard transformer architecture, which includes multi-head self-attention mechanisms and feed-forward networks. Residual connections and layer normalization are incorporated at each layer to stabilize the training process and enhance the representation capacity. This process can be formulated as
where
denotes the fused embedding of the image and text,
represents the intermediate representation after applying self-attention, MSA refers to the multi-head self-attention, LN denotes the layer normalization, and MLP represents the feedforward full-connection.
At this stage, explicit cross-modal connections are established between image and text tokens through the self-attention mechanism. Image tokens can acquire semantic supplements from text tokens, while text tokens can perceive information from the corresponding regions in the image. Through this process, the model can effectively exploit the complementary information between different modalities, thereby enhancing their unimodal representations.
Then, the unimodal embeddings that contain positional information can be separated from the fused embeddings, i.e.,
where
and
denote the image and text embeddings, respectively, that contain positional information; and where
and
denote the token representations of the image and text embeddings.
3.5. Unimodal Positional Information Filtering
To enable the model to reconstruct positional information, we explicitly remove such information from the original unimodal embeddings of images and text through the unimodal positional information filtering module. This module is designed to mask position-related information in unimodal embeddings, forcing the subsequent reconstruction task to rely on complementary modality to recover the removed positional information. In this way, PR-CLIP is encouraged to better learn cross-modal positional associations. The structure of this module is shown in
Figure 4:
Taking the image embedding as an example, we first perform element-wise subtraction between the original image embedding
and the image embedding containing the positional information
, resulting in an embedding
with explicit positional information removed. On this basis, to reallocate the importance of the features and optimize their expressive capacity,
is further projected through a linear transformation layer. The complete positional information filtering process for the image embedding can be expressed as
where ’Linear’ denotes a linear transformation layer.
Accordingly, the text unimodal embedding with filtered positional information can be expressed as
3.6. Cross-Modal Positional Information Reconstruction
After completing the filtering of the positional information, the filtered unimodal embedding is fed into the cross-modal positional information reconstruction module. The model tries to reconstruct the missing positional information using complete information from the other modality via a cross-modal interaction mechanism without directly relying on the positional cues of its own modality. The structure of this module is shown in
Figure 5:
Taking image embedding reconstruction as an example, we adopt a transformer decoder structure, where the image embedding without positional information is used as the target sequence and the corresponding complete text embedding is fed into the decoder as the context sequence. The decoder models the interaction between the two modalities through a cross-attention mechanism, extracting position-related information for the image from the text. Unlike direct cross-modal feature fusion, this process is reconstruction-oriented rather than merely integrating features. The reconstruction objective encourages the model to retrieve and reconstruct the missing information from the other modality, thereby achieving fine-grained semantic alignment. This process can be formulated as
where
denotes the filtered image embedding after self-attention,
denotes the image embedding after cross-modal attention with the original text embedding
,
denotes the reconstructed image embedding with positional information, and C-MSA refers to the multi-head cross-attention.
Similarly, the decoding process for reconstructing the text using the complete image can be expressed as follows:
To ensure that the reconstructed features are as close as possible to the original features, in terms of semantics and structure, we introduce the mean squared error (MSE) loss as the positional reconstruction consistency (PRC) loss, i.e.,
This loss function measures the element-wise difference between the reconstructed features and the original ones. By minimizing the distance between the reconstructed and the true features, the model is encouraged to accurately restore the filtered positional information.
We combine the cross-modal contrastive loss and the positional information reconstruction loss as the final optimization objective of our PR-CLIP model, i.e.,
where
and
are hyperparameters used to balance the weights of the loss items.
4. Experiments
4.1. Datasets
We used two public multimodal remote sensing image–text retrieval datasets, i.e., RSICD [
36] and RSITMD [
3], to evaluate the performance of PR-CLIP. All the images in the RSICD dataset were captured by aircraft or satellites, covering 28 categories including airports, forests, and ports. Each image is paired with five textual descriptions that provide detailed semantic information. For the RSITMD dataset, the image categories are extended to 32, including stadiums, farmlands, and schools. Similarly, each image is associated with five textual descriptions to support diverse cross-modal retrieval tasks. PR-CLIP was trained on the RET-2 dataset, which is a combination of the RSICD and RSITMD datasets. To prevent data leakage between the training and test datasets, the RET-2 dataset followed the strict de-duplication strategy proposed by RemoteCLIP [
27], ensuring the fairness and reliability of the model evaluation.
Table 1 presents the statistics of the three datasets. During the training process, each dataset was divided into training, validation, and testing sets with a split ratio of 8:1:1:
4.2. Evaluation Metrics
To maintain consistency with the baseline models, we report the Recall at K (R@K, K = 1, 5, 10) and mean Recall (mR). R@K indicates whether the correct match appears within the top-K retrieval results, and it is used to evaluate the model’s performance under different retrieval accuracy requirements. The formula of R@K is defined as
where
N is the total number of queries,
denotes the rank of the ground-truth match for the
i-th query, and
is the indicator function that returns 1 if the condition holds and 0 otherwise.
Metric mR represents the mean recall across all categories, providing a more comprehensive reflection of the model’s overall performance under class imbalance. The formula of mR is defined as
where
denotes the R@K metric for the image-to-text retrieval task and
denotes the Recall@K metric for the text-to-image retrieval task.
4.3. Baselines
We compared PR-CLIP with the following baseline methods:
VSE++ [
37] introduced a hard-negative-aware loss function to enhance visual–semantic embedding learning.
AMFMN [
3] designed an asymmetric multimodal feature matching network with multi-scale attention and a dynamic-margin triplet loss.
GaLR [
33] proposed a global–local RSITR framework with dynamic fusion and re-ranking to enhance retrieval performance.
SWAN [
38] introduced a scene-aware aggregation network with multiscale fusion and fine-grained sensing to reduce semantic confusion.
FAMMI [
39] proposed a fine-grained semantic alignment method that aggregates multi-scale features and enhances cross-layer consistency.
PIR [
17] introduced a prior-instructed representation framework with progressive attention encoders to reduce semantic noise.
MTGFE [
40] proposed a multi-task guided fusion encoder with a multi-view joint representations contrast task to enhance fine-grained alignment.
KAMCL [
41] introduced a knowledge-aided contrastive learning framework with hierarchical aggregation to enhance fine-grained discrimination.
PE-RSITR [
16] introduced a parameter-efficient transfer learning framework with a hybrid contrastive loss to adapt vision–language models to RSITR.
VGSGN [
42] proposed a visual-global-salient-guided network with dynamic fusion to enhance cross-modal alignment between image and text.
RemoteCLIP [
27] scaled CLIP pretraining with 12× enlarged RS-specific data via data augmentation, significantly improving RSITR performance.
GeoRSCLIP [
28] introduced a large-scale remote sensing dataset RS5M with 5 million image–text pairs, and it trained a CLIP-based model to enhance RSITR.
4.4. Implementation Details
To clarify the entire research process, we add more implementation details in
Section 4.4 in the new manuscript. Concretely, during the retrieval stage, PR-CLIP first employs unimodal encoders to extract global semantic features from both images and texts independently. Based on these features, the model computes similarity scores between image–text pairs and ranks them accordingly, enabling bidirectional retrieval from image to text and vice versa. We evaluated PR-CLIP on the RSICD and RSITMD datasets using R@K and mR, and we compared it with numerous state-of-the-art methods to demonstrate the superiority of PR-CLIP in retrieval tasks. To verify the effectiveness of each module, we also conducted ablation studies. We partially or entirely removed key components of the cross-modal location reconstruction module during training, and we adopted the same evaluation protocol as in the main experiments to assess performance differences. Furthermore, to better illustrate the effectiveness and working mechanism of the proposed method, we also conducted visualization analysis. We extracted attention weight matrices from the final layer of the Transformer in the cross-modal location module and plotted the corresponding attention maps. Additionally, we visualized image features extracted at different stages of the model to further validate the effectiveness of PR-CLIP in modeling spatial location information.
We implemented PR-CLIP with PyTorch v2.5.1, and we trained the model based on the ITRA framework [
43]. For the RSICD dataset, the number of training epochs was set to 20, the learning rate was set to the 5 × 10
−5, and the batch size was set to 160. For the RSITMD dataset, the number of training epochs was set to 7, the learning rate was set to the 5 × 10
−6, and the batch size was set to 100. The training configuration included 100 warm-up steps, a weight decay of 0.5, and a maximum gradient norm of 50 for gradient clipping. For the transformer layers, we employed two transformer layers for cross-modal positional information extraction and eight transformer layers for the cross-modal positional information reconstruction. The CMC loss weight
was set to 0.01, and the temperature was a tunable hyperparameter dynamically adjusted during the training process. The PRC loss weight
was set to 1. All the experiments were conducted on a Linux server equipped with two NVIDIA RTX 4090 GPUs.
4.5. Comparison Results
Table 2 shows the RSITR results of PR-CLIP and various baseline methods on the two datasets. According to the table, we have the following observations:
PR-CLIP achieved the best overall performance among the models trained on the same scale of training data. Although our model was trained on RET-2, it is worth noting that RET-2 is constructed by combining RSCID and RSITMD after removing duplicates, and, thus, the actual amount of unique training data remains comparable to that of the baseline models. Specifically, it outperformed all the existing RSITR methods, in terms of the mean Recall (mR) metric. On the RSICD dataset, PR-CLIP improved mR by 26% (from 31.12 to 39.23), and it also achieved an 18% improvement (from 44.47 to 52.41) on the RSITMD dataset. These results indicate that introducing the cross-modal positional information reconstruction task enables the model to effectively learn image–text positional associations, further improving cross-modal retrieval performance.
Secondly, all the models performed better on the image-to-text retrieval task than on the text-to-image retrieval task, in terms of the R@1 metric. This indicates that in tasks requiring the retrieval of a single best match, the existing methods tend to perform more accurately in image-to-text retrieval. Image features generally contain richer detail positional information, making it easier to retrieve semantically matched texts. In contrast, textual descriptions often exhibit higher ambiguity and vagueness. Compared with the existing methods, PR-CLIP achieved more significant improvements on the text-to-image retrieval task and achieved the best performance across all the evaluation metrics. This confirms that the cross-modal positional information reconstruction task can better help the model align text and image features.
Thirdly, the models trained with more additional data generally performed better on the RSITR task, even with relatively simple architectures. This suggests that RSITR performance is highly data-dependent and that sufficient cross-modal data samples can significantly enhance the model’s representation ability. RemoteCLIP and GeoRSCLIP were trained with additional data, where RemoteCLIP used a training set approximately 12 times larger than that of PR-CLIP, and GeoRSCLIP utilized a dataset 16 times larger. Despite this, PR-CLIP achieved consistently superior performance compared to RemoteCLIP across all the evaluation metrics, and it was only slightly outperformed by GeoRSCLIP on two specific indicators. These results demonstrate the effectiveness of PR-CLIP, even under limited data conditions.
4.6. Ablation Studies
To validate the effectiveness of the proposed cross-modal positional information reconstruction task in PR-CLIP, we conducted a series of ablation studies. Specifically, we removed the PRC loss from the image side and the text side, respectively, and we summarize the results in
Table 3. When the reconstruction task was entirely removed, PR-CLIP experienced a significant performance drop. Retaining only the reconstruction loss on either the image or text side led to performance improvement compared to the complete removal of the task, but still underperformed the full model. These results demonstrate that the proposed cross-modal positional information reconstruction task effectively captures the positional correspondence between image and text, and that it is crucial for achieving more accurate cross-modal image-text retrieval.
4.7. Hyper-Parameter Studies
We also conducted experiments to evaluate the effects of the weights of CMC loss and PRC loss. The results are shown in
Table 4 and
Table 5, where
and
are the weights of CMC loss and PRC loss, respectively. The experimental results show that the PR-CLIP model achieved the best performance, with
and
. Such results indicate that the model is particularly sensitive to the position reconstruction loss that is essential for the performance of cross-modal retrieval tasks. In contrast, the cross-modal contrastive loss is assigned a lower weight to ensure that the information from different modalities is brought closer while the inter-modal difference is preserved.
4.8. Efficiency Evaluation
We assessed the time efficiency of PR-CLIP alongside multiple representative baseline RSITR methods. These models were tested on the RSICD and RSITMD datasets using identical batch sizes and embedding dimensions on an NVIDIA RTX 4090 GPU.
Table 6 shows the mR of PR-CLIP and the baseline methods. ‘IT(s)’ denotes the total inference time of the model on the testing set, measured in seconds. The results indicate that PR-CLIP not only outperforms the compared models but also requires a shorter total inference time. This is because the cross-modal positional information reconstruction task is only introduced during training, while inference relies solely on the independent encoding of images and texts, without requiring deep cross-modal interactions. PR-CLIP effectively combines the efficiency of dual-stream models with the alignment capabilities of single-stream models, thus significantly improving retrieval performance on the RSITR task while maintaining inference-time efficiency.
4.9. Visualization of Positional Alignment
To verify whether PR-CLIP achieves positional alignment between images and text, we extracted the attention weight matrix from the last transformer layer in the cross-modal positional information extraction module. We selected key semantic entities from the textual descriptions and analyzed the attention response of each image patch to the corresponding tokens. A higher attention weight indicated a stronger association between the image region and the textual object.
Figure 6 presents the attention visualization of a remote sensing image–text pair. The regions with higher attention are displayed with higher opacity to highlight their relevance, while the regions with lower attention are shown with lower opacity, indicating weaker association with the current textual entity. Where the textual keyword is “planes”, the regions in the image containing airplanes were assigned higher attention weights. Where the keyword is “red building”, the model focused more on the region where the red building was located. These two regions were connected through the spatial keyword “next to”, and a clear boundary is observable in the attention map. This demonstrates that through the cross-modal positional information reconstruction task PR-CLIP can effectively learn the positional associations between images and texts, thereby enhancing cross-modal retrieval performance.
To intuitively observe the role of positional information in the model, we visualized the attention distribution of the image at different stages using Grad-CAM, as shown in
Figure 7. ‘Global visual feature’ represents the global visual feature extracted by the model before incorporating textual information. Its attention is relatively scattered, indicating that a cross-modal positional relationship had not yet been modeled at this stage. Then, the cross-modal positional feature was obtained through the cross-modal positional information extraction module. The attention was clearly focused on the regions corresponding to “five planes” and “red building”, which aligns well with the text description. ‘Visual feature without positional information’ refers to the image feature after the positional information was removed via the unimodal positional information filtering module. The attention to the planes and surrounding regions was significantly weakened, and the red building almost completely disappeared from focus. ‘Reconstructed visual feature’ denotes the image feature reconstructed with the help of textual information. The model re-focused on key semantic regions, with notably enhanced attention to the red building, indicating that this region is a critical cue for distinguishing the query text from other remote sensing images. From raw visual features to position awareness, and through filtering and reconstruction, PR-CLIP progressively achieves alignment of key target regions via image–text interaction. This process fully validates the effectiveness of the proposed cross-modal positional information reconstruction task in remote sensing image–text retrieval.
4.10. Visualization of Retrieval Results
We also conducted a visual analysis of PR-CLIP’s retrieval results. As shown in
Figure 8, PR-CLIP accurately retrieved the target corresponding to the query text or image. In the figure, the top three retrieval results returned by the model, excluding the correct match, are displayed. It can be observed that PR-CLIP successfully captured key spatial positional information, such as “park”, in both the image-to-text and text-to-image retrieval tasks. Although these mismatched images and texts are not exact matches to the target, they are semantically highly relevant. This demonstrates that PR-CLIP effectively enhances the semantic relevance between queries and results by learning cross-modal positional associations between images and text, thereby improving the robustness of the retrieval model.
In addition, as shown in
Figure 9, we conducted a case study on a set of real-world remote sensing images from the Gaofen-4 satellite. Given the input text query, the figure shows the top three images retrieved by PR-CLIP. Obviously, PR-CLIP was able to retrieve the images with a high semantic similarity to the input text, demonstrating the effectiveness and generalization capability of PR-CLIP.