Towards Discriminative and Consistent Cross-Modal Alignment for Remote Sensing Image–Text Retrieval

Song, Zihan; Shu, Yulou; Li, Wengen; Guan, Jihong; Zhang, Yichao

doi:10.3390/rs18040662

Open AccessArticle

Towards Discriminative and Consistent Cross-Modal Alignment for Remote Sensing Image–Text Retrieval

by

Zihan Song

,

Yulou Shu

,

Wengen Li

^*

,

Jihong Guan

and

Yichao Zhang

School of Computer Science and Technology, Tongji University, Shanghai 200092, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(4), 662; https://doi.org/10.3390/rs18040662

Submission received: 4 January 2026 / Revised: 12 February 2026 / Accepted: 19 February 2026 / Published: 22 February 2026

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A discriminative and consistent cross-modal alignment (DCCA) framework for remote sensing image–text retrieval is proposed, comprising global contrastive learning with negative pair expansion, bidirectional intra-inter-modal distribution matching constraint, and remote sensing information injection module, effectively enhancing both discriminability and modality consistency.
By strengthening the mining of hard sample pairs, enhancing the contrast between positive and negative pairs, improving modality-aware distribution consistency and injecting scene-discriminative information into a VLP model, DCCA achieves superior performance on both the RSITMD and RSICD benchmarks.

What are the implications of the main finding?

The proposed global contrastive learning strategy with negative pair expansion significantly emphasizes hard samples and improves discriminative capability, providing a generalizable solution for addressing multi-modal alignment in contexts characterized by high intra-modal similarity.
The proposed framework offers a data-efficient paradigm for transferring pretrained vision–language models to remote sensing scenarios, where the injection of remote sensing visual knowledge substantially reduces the reliance on additional image–text corpora.

Abstract

As large-scale remote sensing data continue to proliferate, research on remote sensing image–text retrieval (RSITR) has become progressively more prominent. Nevertheless, RSITR still faces two primary challenges. First, remote sensing data exhibit substantially higher intra-modal similarity than typical natural image–text corpora, complicating the discrimination of positive and negative pairs. Second, vision–language models pretrained on natural images (VLP), such as CLIP, are not readily adaptable to remote sensing scenarios without undergoing large-scale remote sensing pretraining that entails substantial cost. To tackle these challenges, we introduce DCCA, a novel framework designed for discriminative and consistent cross-modal alignment. We develop a global contrastive learning strategy with negative pair expansion mechanism to boost representation discrimination when intra-modal similarity is pronounced. Additionally, we introduce a bidirectional distribution matching constraint that jointly aligns intra- and inter-modal distributions, promoting consistent cross-modal alignment beyond the instance level. To further enhance domain adaptation, we propose a remote sensing information injection module that transfers knowledge from a pretrained remote sensing image recognition model into VLP, thereby improving its visual discriminability in remote sensing scenarios. Evaluations conducted on publicly available RSITR benchmarks indicate that DCCA consistently surpasses baseline methods, while attaining performance on par with models trained using large-scale remote sensing datasets under markedly reduced data requirements. These findings verify that the framework is both effective and well-suited for practical deployment.

Keywords:

remote sensing image–text retrieval; cross-modal alignment; contrastive learning; information injection

1. Introduction

With the ongoing evolution of spaceborne remote sensing, the volume and utility of remote sensing images have grown considerably, which serves as a key resource across diverse fields such as disaster monitoring [1], agricultural management [2], and urban planning [3]. Efficiently extracting valuable information from massive remote sensing image repositories has thus emerged as a critical challenge [4]. Traditional single-modal retrieval methods often struggle to handle the growing diversity and complexity of remote sensing scenarios [5]. In contrast, cross-modal retrieval strategies, particularly those that associate remote sensing visuals with their corresponding natural language captions, have attracted growing attention. Such methods enable flexible bidirectional retrieval across visual and textual modalities, thereby facilitating more semantically expressive access to remote sensing data. Consequently, RSITR has become an important focus in research aimed at advancing intelligent analysis and large-scale utilization of remote sensing imagery.

Research on RSITR has progressed through several architectural paradigms. Early models [6,7,8,9,10] relied on convolutional networks for image representation and recurrent structures for text encoding, facilitating basic cross-modal correlation modeling. Later studies [11,12,13] incorporated transformer-based encoders, which enhanced global dependency modeling and produced more expressive cross-modal representations. Recently, CLIP-based architectures [14] have attracted substantial interest due to their robust vision–language alignment learned from large-scale image–text corpora. However, despite these advances, existing CLIP-oriented approaches [15,16] still face limitations in fully addressing the unique challenges posed by remote sensing imagery, indicating that further refinement is necessary.

First, RSITR is fundamentally constrained by the intrinsic properties of remote sensing data. Compared with natural image–text retrieval benchmarks such as Flickr30K [17] and MSCOCO [18], remote sensing datasets like RSITMD [6] and RSICD [19] exhibit significantly higher intra-modal similarity in both modalities, as depicted in Figure 1. To demonstrate this, we encode image and text modalities using multiple different encoders and compute their average similarities. Due to scene homogeneity, textual descriptions tend to follow repetitive expression patterns, while corresponding remote sensing images share highly consistent spatial structures and semantic content. As a result, the similarity gap between matched and mismatched pairs becomes substantially reduced, leading to an increased proportion of hard negatives during training. This property makes accurate cross-modal discrimination particularly challenging and highlights the necessity for retrieval models to learn highly discriminative representations that can capture subtle semantic variations across modalities.

Prior work predominantly employs contrastive learning frameworks to establish correspondence between visual and textual representations by distinguishing positive from negative pairs. While this paradigm has shown strong performance in the natural image domain [14,22,23], it is less effective in remote sensing scenarios with high intra-modal similarity. Moreover, research directly aiming to enhance the contrastive learning paradigm itself remains limited.

To better discriminate hard negative samples, we note that existing image–text contrastive learning approaches do not fully leverage the abundant negative pair resources. Typically, each positive pair is contrasted with only a subset of negative pairs within a batch. While sufficient for natural images, this strategy is suboptimal in high-similarity remote sensing scenarios. Furthermore, hard sample mining has not been fully exploited and presents room for improvement. In light of these findings, we develop a global contrastive learning framework with a negative pair expansion mechanism tailored for RSITR. This approach enables more effective cross-modal alignment while capturing implicit relationships and discriminative associations across modalities. Additionally, contrastive learning lacks explicit constraints on modality distribution consistency. Following Yang et al. [16], who demonstrated a positive correlation between performance and distribution consistency, we design a bidirectional intra-inter-modal distribution matching constraint inspired by prior geometric consistency studies [24,25] to strengthen alignment implicitly.

Second, although CLIP [14] pretrained on general-domain image–text corpora exhibits strong vision–language alignment, directly transferring it to the remote sensing domain remains challenging. RemoteCLIP [26] and GeoRSCLIP [27] demonstrate that large-scale pretraining on remote sensing data significantly improves domain adaptation. Nonetheless, such methods typically involve considerable data requirements and computational overhead. To achieve effective domain adaptation under limited data without collecting additional image–text pairs, we introduce a remote sensing information injection module. This module leverages a pretrained remote sensing image recognition model to incorporate domain-specific knowledge into CLIP, guiding it to learn more unbiased, domain-aware representations. As a result, CLIP can better capture discriminative features and semantic information of remote sensing scenes without relying on extra paired data.

In sum, this work makes threefold contributions, outlined below:

We present DCCA, a novel framework to facilitate discriminative and consistent cross-modal alignment, targeting two main challenges in RSITR: high similarity within modalities and limited adaptability of vision–language models pretrained on natural images to the remote sensing domain.
We develop an enhanced contrastive learning strategy with negative pair expansion and bidirectional intra-inter-modal distribution matching constraint, improving hard sample mining, cross-modal discrimination, and modality consistency. We further introduce a remote sensing information injection module to enhance visual discriminability and domain adaptability without requiring additional paired data.
Evaluations on a variety of remote sensing benchmarks demonstrate that DCCA exceeds existing methods in performance while achieving results comparable to models pretrained on extensive remote sensing corpora, even with a reduced amount of training data.

2. Related Work

2.1. Remote Sensing Image–Text Retrieval

Research on RSITR can be categorized, according to the data used during training, into three main paradigms: methods without image–text pretraining, vision–language pretraining-based methods (VLP), and VLP-based methods augmented with additional remote sensing image–text pretraining (RSVLP).

Methods without image–text pretraining generally employ separate image and text encoders, such as CNN-RNN or transformer-based architectures, and are optimized directly on downstream retrieval datasets. Cheng et al. [28] presented a framework that leverages a semantic matching component together with attention-guided processing and gating operations to refine cross-modal embeddings. Yuan et al. [6] addressed multi-modal feature matching by designing an asymmetric architecture capable of handling multi-scale inputs and multi-source retrieval, while adaptively suppressing redundant information. In a subsequent study, Yuan et al. [7] further emphasized the joint modeling of holistic and fine-grained characteristics, introducing a flexible feature integration framework that dynamically fuses representations at various semantic depths. Zhang et al. [8] designed a relational representation framework driven by masking strategies, which combines Transformer–CNN encoders with entity-level supervision, aiming to enhance relational modeling while reducing redundant cues. Ji et al. [9] introduced a contrastive learning approach built upon a momentum mechanism and supplemented by knowledge, which injects semantic priors through knowledge initialization, construction, filtering, and alignment to facilitate discriminative representation learning. To explicitly account for sample difficulty arising from large intra-class variations in remote sensing imagery and high textual similarity, Zhang et al. [10] adopted a curriculum learning strategy and formulated visual–semantic alignment within a hyperspherical embedding space. Tang et al. [12] further incorporated prior experience into joint visual–text representation learning to alleviate the imbalance between image and language information commonly observed in remote sensing datasets.

VLP-based approaches leverage knowledge acquired from large-scale image–text corpora, which facilitates more effective cross-modal alignment and leads to improved retrieval performance. Yang et al. [16] enhanced cross-modal consistency by introducing a regularization objective that aligns the intra-modal distance distributions of visual and textual samples, guiding the embedding space toward a more structured form. In addition, a self-pruning distillation strategy was adopted to remove redundant network layers, enabling model lightweighting without compromising inference capability. Hu et al. [15] proposed a transformer-based architecture that incorporates global–local soft alignment to strengthen semantic correspondence between modalities. Ji et al. [29] addressed the issue of weakly correlated sample pairs by adopting an “eliminate-before-align” mechanism, which filters unreliable pairs prior to alignment, and further introduced a keyword-aware reasoning module to explicitly model fine-grained semantic differences. Guan et al. [30] developed a framework that utilizes a position-aware learning objective to model spatially sensitive relationships across visual and textual embeddings.

RSVLP-based methods further pretrain VLP models on large-scale remote sensing image–text datasets, thereby improving their suitability for domain-specific retrieval tasks. Liu et al. [26] developed RemoteCLIP, which focuses on learning domain-relevant visual representations and aligning them with semantically informative textual embeddings to support diverse remote sensing applications. Zhang et al. [27] presented GeoRSCLIP, a specialized vision–language pretrained model that explicitly addresses the domain discrepancy between generic vision–language representations and remote sensing downstream tasks. Subsequent methods [31,32,33] built upon RemoteCLIP or GeoRSCLIP have further advanced retrieval performance.

Most of the aforementioned RSITR methods primarily rely on triplet loss or standard contrastive learning objectives. Although such approaches have demonstrated strong effectiveness in natural image–text retrieval tasks, they are less suited to the scenarios with high intra-modal similarity which is common in remote sensing data. Furthermore, they remain dependent on extensive remote sensing image–text corpora during pretraining to achieve satisfactory domain adaptation. To address these limitations, we further enhance the contrastive learning objective to more effectively handle the pronounced intra-modal similarity characteristic of remote sensing data. In addition, we incorporate a module for injecting remote sensing information to alleviate the dependence on large volumes of paired image and text data.

2.2. Vision–Language Pretraining

VLP has established itself as a dominant framework for learning unified representations. Early studies explored different architectural strategies to enable cross-modal interaction, including dual-stream designs with co-attention mechanisms [34], single-stream transformers that jointly process visual and textual tokens [35,36], and simplified formulations that reduce the computational overhead of visual inputs [37]. More recent approaches have demonstrated that dual-encoder architectures optimized via large-scale contrastive learning can achieve effective cross-modal alignment and strong transferability across downstream tasks [14,22], while other works have further extended VLP models to support both understanding and generation through task-adaptive training objectives [23]. Among these approaches, CLIP [14] has gained widespread adoption owing to its efficient dual-encoder structure and favorable scalability. In this work, our framework leverages a CLIP-derived encoder to serve as the core image–text feature extractor. To reconcile differences between natural and remote sensing image domains, we further introduce a remote sensing information injection module that compensates for the inherent domain bias of generic VLP models.

2.3. Contrastive Learning

Contrastive learning plays a central role in learning feature representations via explicitly modeling similarities among samples. Early formulations focused on batch-wise construction strategies and corresponding objectives to separate positive and negative instances, as exemplified by the multi-class N-pair loss [38] and later generalized into the widely used InfoNCE objective [39]. Subsequent studies investigated more effective mechanisms for negative sample utilization and optimization stability. For instance, momentum-based updating and queue-based sampling were introduced to enlarge and stabilize negative sets [40], while alternative designs demonstrated that competitive performance can be achieved without memory banks or even explicit negative pairs [41,42]. Beyond sampling strategies, unified loss formulations such as circle loss [43] have been proposed to provide finer control over similarity distributions by reweighting positive and negative pairs. These principles have also been extended beyond uni-modal settings. Contrastive objectives have been adapted to cross-modal representation learning in specialized domains such as medical image–text modeling [44], and further scaled to general-purpose vision–language pretraining frameworks [14]. Together, these developments establish a solid foundation for modern contrastive objectives and motivate the exploration of tailored formulations for RSITR. Building upon these prior studies, we further explore an optimized contrastive learning design that better accommodates the characteristics of remote sensing data.

2.4. Comparative Analysis

DCCA differs from existing image–text retrieval methods from both general computer vision and remote sensing perspectives. Compared with natural image–text retrieval, remote sensing data exhibit much higher intra-modal similarity in both visual and textual modalities, resulting in numerous hard negative pairs and reduced effectiveness of conventional instance-level contrastive or triplet-based losses. To address this issue, DCCA introduces a global-level contrastive learning strategy with negative pair expansion, which explicitly models batch-wise competition and is better suited to high-similarity remote sensing data. In addition, to mitigate the domain bias of VLP models developed using natural image–text data, DCCA incorporates a remote sensing information injection module that transfers domain-specific knowledge from a pretrained remote sensing image recognition model, a design fundamentally different from general computer vision approaches. Compared with existing RSITR methods, DCCA further achieves more comprehensive cross-modal alignment by jointly combining global contrastive learning with negative pair expansion and enforcing a more comprehensive consistency with bidirectional intra-inter-modal distribution matching constraint, rather than focusing on a single alignment direction. Notably, the proposed information injection module is used only during training, does not rely on large-scale remote sensing image–text pretraining, and requires no scene category annotations.

3. Method

3.1. Problem Formulation

Consider a dataset consisting of remote sensing images

V

and their related texts

T

, with each image

v_{i} \in V

linked to one or multiple text samples

t_{j} \in T

. Cross-modal retrieval aims to establish semantic correspondence between these two modalities by mapping them into a shared embedding space.

Under this setting, two complementary retrieval scenarios are considered. For retrieving images from textual input, the query

t_{i}

searches the collection

V

for the most semantically relevant visual instance. Given a similarity measurement function

S (\cdot, \cdot)

defined in the shared space, the retrieval result is determined by

\begin{matrix} v^{*} = arg max_{v_{j} \in V} S (t_{i}, v_{j}) . \end{matrix}

(1)

Conversely, image-to-text retrieval takes an image

v_{i}

as the query and identifies the most relevant textual description from

T

, which can be expressed as

\begin{matrix} t^{*} = arg max_{t_{j} \in T} S (v_{i}, t_{j}) . \end{matrix}

(2)

By adopting this bidirectional formulation, both visual and textual samples are projected into a unified semantic space, enabling cross-modal relevance to be consistently evaluated through the similarity function

S (\cdot, \cdot)

.

3.2. Model Overview

Figure 2 depicts the overall architecture of DCCA, comprising four core components: the feature extraction module, the global contrastive learning with negative pair expansion, the bidirectional intra-inter-modal distribution matching constraint, and the remote sensing information injection module.

The feature extraction module leverages CLIP’s encoders to transform input images and textual descriptions into high-dimensional feature representations. Serving as a unified representation interface, this module projects the raw multi-modal inputs into a shared latent space, which is subsequently consumed by the global contrastive learning with negative pair expansion module, the distribution matching constraint, and the remote sensing information injection module. Consequently, all downstream alignment and optimization processes are performed within the embedding space produced by this module.

The global contrastive learning with negative pair expansion seeks to emphasize hard cross-modal matches while ensuring that the similarity of aligned pairs consistently exceeds that of unaligned pairs within the shared embedding space. Compared with conventional contrastive learning, this enhanced formulation strengthens implicit alignment and substantially increases the model’s capacity to differentiate distinct representations.

To alleviate the inherent deficiencies of contrastive learning in cross-modal representation alignment, we introduce a bidirectional intra-inter-modal distribution matching constraint. This mechanism enforces distributional consistency both within each modality and across modalities, thereby promoting more structured and coherent cross-modal alignment.

Furthermore, to enhance the domain adaptability of CLIP-based models in remote sensing scenarios, we incorporate a remote sensing information injection module guided by a pretrained remote sensing scene recognition model. This module encodes images into feature representations enriched with domain-specific knowledge, injecting remote sensing information into the visual representation learning process and improving the model’s capacity to recognize and leverage discriminative features of remote sensing scenes.

Overall, these components operate synergistically to jointly facilitate the model in capturing distinctive and consistent relationships across remote sensing imagery and associated text.

3.3. Feature Extraction

The feature extraction module constitutes the foundational component of the proposed DCCA framework, as illustrated in Figure 3. Its primary role is to transform raw image and text inputs into unified embedding representations that underpin all subsequent alignment and optimization objectives. Specifically, this module adopts a dual-encoder architecture based on the CLIP framework, leveraging CLIP’s Transformer-based encoders for visual and language inputs to obtain corresponding feature representations.

Following the CLIP design, the encoders are instantiated using multi-layer transformer architectures, where each layer integrates attention operations with feed-forward blocks. For the visual branch, each input image is segmented into independent patches and linearly projected to obtain latent embeddings. Positional information is incorporated through learnable embeddings, and an additional trainable token is inserted into the patch sequence to capture holistic image-level representations through the self-attention mechanism. For the textual branch, the input sentence is tokenized into word tokens, mapped to word embeddings, and combined with positional encodings, with special start and end tokens inserted to capture global semantic information. Within each transformer layer, multi-head self-attention enables long-range dependency modeling among tokens, while feed-forward networks further refine token-wise representations. Through successive layers, global tokens progressively integrate contextual information from all tokens, yielding compact and semantically rich embeddings for cross-modal alignment.

For each paired input

(v_{i}, t_{i})

, modality-specific encoders are applied, which can be written as

\begin{matrix} [v_{i}^{c l s}, v_{i}^{1}, \dots, v_{i}^{l_{v}}] & = T r a n s_{v} (P E (v_{i})) \end{matrix}

(3)

\begin{matrix} [t_{i}^{b o s}, t_{i}^{1}, \dots, t_{i}^{l_{t}}, t_{i}^{e o s}] & = T r a n s_{t} (W E (t_{i})) \end{matrix}

(4)

where the image token sequence consists of a global token

v_{i}^{c l s}

and

l_{v}

patch tokens, while the text token sequence comprises a start token

t_{i}^{b o s}

, an end token

t_{i}^{e o s}

(serving as the global token), and

l_{t}

word tokens. Here,

P E (\cdot)

and

W E (\cdot)

denote the patch embedding and word embedding operations, respectively, while

T r a n s_{v}

and

T r a n s_{t}

represent the vision transformer and text transformer.

To facilitate efficient retrieval and unified cross-modal alignment, only the global tokens of the two modalities are preserved and mapped into a common semantic space through learnable linear projections:

\begin{matrix} V_{i} & = P r o j_{v} (v_{i}^{c l s}) \end{matrix}

(5)

\begin{matrix} T_{i} & = P r o j_{t} (t_{i}^{e o s}) \end{matrix}

(6)

where

P r o j_{v}

and

P r o j_{t}

correspond to the linear transformation functions applied to the image and text modalities.

The resulting embeddings

V_{i}

and

T_{i}

serve as the core representations for all subsequent modules within DCCA. Specifically, they are used to compute cross-modal similarities in the global contrastive learning with negative pair expansion, to construct similarity matrices for the distribution matching constraints, and to act as the student representations in the remote sensing information injection module.

In the standard CLIP framework, image–text alignment is typically optimized using the InfoNCE loss, which enables contrastive learning within a joint embedding space. However, this paradigm exhibits inherent limitations, including restricted utilization of negative samples and the absence of explicit consistency constraints, which impede the learning of highly discriminative and semantically coherent cross-modal representations. To overcome these limitations, the following sections introduce targeted enhancements that promote more discriminative and consistent alignment, consequently strengthening the model’s retrieval capability for remote sensing image–text data.

3.4. Global Contrastive Learning with Negative Pair Expansion

We propose a two-stage refinement of the contrastive learning paradigm, consisting of a global-level contrastive strategy and a negative pair expansion mechanism, to more effectively adapt it to the characteristics of remote sensing scenarios. The conventional contrastive learning objective, i.e., InfoNCE loss, is formulated as

\begin{matrix} L_{i t c} & = - \frac{1}{2 M} \sum_{i = 1}^{M} [log \frac{exp (S (V_{i}, T_{i}) / τ)}{\sum_{j = 1}^{M} exp (S (V_{i}, T_{j}) / τ)} + log \frac{exp (S (T_{i}, V_{i}) / τ)}{\sum_{j = 1}^{M} exp (S (T_{i}, V_{j}) / τ)}] \end{matrix}

(7)

= \frac{1}{2 M} \sum_{i = 1}^{M} [log (1 + \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{M} exp ((s_{n}^{i j} - s_{p}^{i i}) / τ)) + log (1 + \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{M} exp ((s_{n}^{j i} - s_{p}^{i i}) / τ))]

(8)

where S(·, ·) denotes the similarity function, M stands for the batch size,

τ

represents the trainable temperature coefficient,

V_{i}

denotes the i-th image embedding, and

T_{i}

refers to the corresponding text embedding. Inspired by Sun et al. [43], we further rewrite Equation (7) into the equivalent form shown in Equation (8), which provides an alternative perspective for interpreting contrastive learning. In this formulation,

s_{p}^{i i} = S (V_{i}, T_{i})

,

s_{n}^{i j} = S (V_{i}, T_{j})

(

j \neq i

). Compared with Equation (7), Equation (8) places greater emphasis on pairwise comparisons, whereas Equation (7) focuses more on the distance measurement between the individual image and text samples within each pair. This reformulation facilitates subsequent methodological improvements.

In Equation (8), a sample-level optimization strategy is adopted, where image–text representation alignment is achieved by averaging the contrastive objectives centered on each sample within a batch. Sun et al. [43] pointed out that the

\sum exp (\cdot)

operation in Equation (8) implicitly performs soft hard negative mining among samples. However, since Equation (8) optimizes around individual samples, the mining of difficult

(s_{n} - s_{p})

terms is restricted to a local scope. Moreover, the subsequent unweighted averaging of these locally mined results is suboptimal for emphasizing the truly critical hard negative pairs that should be prioritized at the batch level, as illustrated in Figure 4a.

Given the limitations imposed by sample-level contrastive learning, we remove the averaging operation over samples and move the outer summation inward, thereby formulating a global-level contrastive learning strategy, as illustrated in Figure 4b. This formulation uniformly models the competition among sample pairs across the entire batch, allowing highly similar hard sample pairs to receive stronger optimization emphasis at the global level, rather than being treated equally through averaging. As a result, global-level contrastive learning enhances the model’s discriminative capability with respect to hard sample pairs. The resulting optimized formulation is presented as follows:

L_{g i t c} = log [1 + \sum_{i = 1}^{M} (\sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{M} exp ((s_{n}^{i j} - s_{p}^{i i}) / τ) + \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{M} exp ((s_{n}^{j i} - s_{p}^{i i}) / τ))]

(9)

where

s_{n}^{i j}

refers to the similarity computed for a negative pair

(V_{i}, T_{j})

, while

s_{p}^{i i}

reflects the similarity associated with the corresponding positive pair

(V_{i}, T_{i})

.

Evidence from earlier work suggests that the outcomes of contrastive learning benefit substantially from large batch sizes, as they introduce a greater number of negative samples, as demonstrated by CLIP [14] and ALIGN [22]. Alternatively, methods like MoCo [40] maintain a memory queue to ensure a sufficient supply of negative samples. Increasing the pool of negative samples consequently raises the number of

(s_{n} - s_{p})

terms in Equation (8), thereby leading to performance improvements. Motivated by this observation, we intuitively hypothesize that the number of

(s_{n} - s_{p})

comparisons is closely related to the effectiveness of contrastive learning. Accordingly, we further introduce a negative pair expansion (NPE) mechanism, as shown in Equation (10):

\begin{matrix} L_{g n p e} & = log [1 + \sum_{i = 1}^{M} \sum_{j = 1}^{N} exp ((s_{n}^{j} - s_{p}^{i}) / τ)] \end{matrix}

(10)

\begin{matrix} = log [1 + \sum_{j = 1}^{N} exp (s_{n}^{j} / τ) \sum_{i = 1}^{M} exp (- s_{p}^{i} / τ)] \end{matrix}

(11)

where

s_{p}^{i}

corresponds to the similarity of the i-th positive pair, while

s_{n}^{j}

refers to that of the j-th negative pair, and N is the total count of negative pairs in the batch. Under the contrastive learning perspective reformulated in Equation (8), this mechanism increases the number of negative pairs

s_{n}

compared against each positive pair

s_{p}

, enabling more thorough exploitation of negative sample resources within a batch and achieving more effective and discriminative contrastive optimization.

We now elaborate the improvements brought by Equation (10) over Equation (9). During training, for each matched pair

(V_{i}, T_{i})

within a batch of size M, Equation (9) samples

(M - 1)

mismatched texts

T_{j}

or images

V_{j}

(j = 1, \dots, M, j \neq i)

to form negative pairs

(V_{i}, T_{j})

or

(T_{i}, V_{j})

. This conventional construction yields

2 (M - 1)

negative pairs for each positive pair, as shown in Figure 5a. In contrast, Equation (10) allows each positive pair to be compared against all negative pairs within the batch, implying that any positive pair should be more similar than any negative pair. Consequently, the number of negative pairs in Equation (10) increases to

M (M - 1)

, as illustrated in Figure 5b, which is substantially larger than the

2 (M - 1)

negative pairs used in conventional methods. This expanded comparison scheme is expected to achieve stronger performance by providing more comprehensive and discriminative contrastive supervision. The naive computation of Equation (10) has a complexity of O(

M^{3}

). To reduce this computational burden, we derive an equivalent and computationally efficient form of Equation (10), i.e., Equation (11) which has a complexity of O(

M^{2}

), matching that of the original InfoNCE loss.

Overall, the global-level contrastive strategy enhances the mining of hard sample pairs, while the negative pair expansion mechanism strengthens the discrimination between positive and negative pairs. Such two refinements jointly improve the model’s retrieval capability for image–text pairs with high intra-modal similarity.

3.5. Bidirectional Intra-Inter-Modal Distribution Matching Constraint

Inspired by the geometric consistency discussed by Jiang et al. [24] and Han et al. [25], to overcome the inherent limitations of contrastive learning and to improve the consistency of modality distributions, we integrate the ideas of AIR [16] and IRRA [45] and propose a bidirectional intra-inter-modal distribution matching constraint, which consists of two complementary components: a bidirectional intra-modal distribution constraint and a bidirectional inter-modal distribution constraint. Specifically, for the intra-modal constraint, we follow AIR to compute the image-level and text-level similarity matrices

R_{v}

and

R_{t}

. Afterward, we match the distributions of the two modalities using the Kullback–Leibler (KL) divergence. Unlike AIR, we remove the temperature parameter to reduce hyperparameter tuning effort. The matching is formulated as

\begin{matrix} L_{i n t r a}^{I 2 T} & = \frac{1}{M} \sum_{i = 1}^{M} K L (S o f t m a x (R_{t}^{i}) ∥ S o f t m a x (R_{v}^{i})) \end{matrix}

(12)

where

R_{t}^{i}

and

R_{v}^{i}

denote the i-th row of the intra-modal similarity matrices. Due to the asymmetry of KL divergence, this term can be interpreted as using the text modality as a teacher to supervise and maintain consistency in the image modality’s distribution. To encourage more symmetric interaction, we further introduce the reverse alignment:

\begin{matrix} L_{i n t r a}^{T 2 I} = \frac{1}{M} \sum_{i = 1}^{M} K L (S o f t m a x (R_{v}^{i}) ∥ S o f t m a x (R_{t}^{i})) \end{matrix}

(13)

The final bidirectional intra-modal distribution matching loss is then defined as a weighted sum:

\begin{matrix} L_{i n t r a}^{a s y m} = L_{i n t r a}^{I 2 T} + α_{1} \cdot L_{i n t r a}^{T 2 I} \end{matrix}

(14)

where

α_{1}

is a balancing hyperparameter that adjusts the contributions of the two matching directions, accounting for the inherent difference in information density between image and text representations.

While the intra-modal constraint strengthens representation consistency within each individual modality, cross-modal consistency remains inadequately addressed. Although contrastive learning facilitates instance-level alignment between modalities, it does not explicitly enforce distributional consistency. To remedy this limitation, we further introduce a bidirectional inter-modal distribution matching constraint:

\begin{matrix} \begin{matrix} L_{i n t e r}^{s y m} & = \frac{1}{M} \sum_{i = 1}^{M} (K L (S o f t m a x (R_{t v}^{i}) ∥ S o f t m a x (R_{v t}^{i})) \\ + K L (S o f t m a x (R_{v t}^{i}) ∥ S o f t m a x (R_{t v}^{i}))) \end{matrix} \end{matrix}

(15)

where

R_{t v}^{i}

and

R_{v t}^{i}

correspond to the i-th row of the cross-modal similarity matrix.

The overall bidirectional intra-inter-modal distribution matching loss integrates both objectives:

\begin{matrix} L_{i i m d m} = L_{i n t r a}^{a s y m} + α_{2} \cdot L_{i n t e r}^{s y m} \end{matrix}

(16)

where

α_{2}

weights the contributions of intra-modal and inter-modal matching terms. This loss jointly enforces consistency within each modality and alignment across modalities, thereby facilitating the learning of a more unified and consistent multi- modal representation.

3.6. Remote Sensing Information Injection

To further strengthen the model’s awareness of domain-specific characteristics in remote sensing, we introduce a remote sensing information injection module. During optimization, this module leverages a pretrained visual remote sensing model to provide guidance for training the retrieval network, without requiring additional image–text pair data. Specifically, following PIR [11], we adopt a ResNet-50 [46] pretrained on the remote sensing scene classification dataset AID [47] as the teacher model in remote sensing domain. Unlike PIR, however, our information injection mechanism is only applied during training, thereby reducing the overhead of additional feature extraction and fusion, and achieving higher computational efficiency during inference. Moreover, it does not require scene categories as supervisory signals, and only relies on the representations containing scene information obtained from the remote sensing model, making it more generally applicable. We first obtain feature representation

V_{i}^{r s}

for the i-th image from the remote sensing model. It is further linearly projected by

P r o j_{r s}

into the unified image–text representation space to facilitate direct information flow. The remote sensing information injection loss is formulated as follows:

L_{r s i i} = \frac{1}{M} \sum_{i = 1}^{M} {∥V_{i} - P r o j_{r s} (V_{i}^{r s})∥}_{2}^{2}

(17)

where

V_{i}

represents the feature representation of the i-th image extracted using CLIP vision transformer. By minimizing this loss, the model is encouraged to align its visual representations with those from the pretrained remote sensing model, allowing domain-relevant information to be explicitly utilized during the retrieval process. This integration enables the model to better capture representative visual cues specific to remote sensing data, thereby improving its adaptability and performance within the remote sensing domain.

The optimization of DCCA is driven by a composite objective that integrates the global contrastive loss with negative pair expansion, the bidirectional intra-inter-modal distribution matching loss, and the remote sensing information injection loss:

L_{t o t a l} = L_{g n p e} + β L_{i i m d m} + γ L_{r s i i}

(18)

where

β

and

γ

modulate the relative importance of different loss components during training. This comprehensive training objective ensures that DCCA effectively learns to align cross-modal representations consistently while incorporating domain-specific knowledge from remote sensing data.

4. Experiments

4.1. Datasets and Evaluation Metrics

To validate DCCA, we carried out experiments on two representative datasets for RSITR, namely RSICD [19] and RSITMD [6]. RSICD contains 10,921 images collected from aerial and satellite sources, spanning 30 scene categories, with each image paired with five corresponding textual descriptions. RSITMD further expands the semantic coverage to 32 categories and contains 4743 images, with each image linked to five captions. Owing to its annotation strategy that explicitly accounts for the pronounced intra-modal resemblance characteristic inherent in remote sensing data, RSITMD provides more fine-grained and diverse textual descriptions compared with RSICD. For training protocol consistency, we adopted dataset splits that are commonly used in CLIP-based remote sensing retrieval studies [15,16,26,27]. Specifically, all training samples defined in the original RSICD benchmark [19] were utilized, without incorporating validation data into the training set, unlike the strategy adopted in IEFT [48]. For RSITMD, the complete training split provided by Yuan et al. [6] was employed.

For a fair comparison with existing studies, retrieval performance was assessed using recall-based metrics, including

R @ K

values with K set to 1, 5, and 10, along with the the mean recall (mR). The

R @ K

metric evaluates whether a ground-truth match appears among the top-ranked candidates, reflecting retrieval quality under varying levels of ranking strictness. To provide a more comprehensive assessment, mR is computed as the average recall across different values of K, offering a holistic measure of overall retrieval effectiveness.

4.2. Baselines

To comprehensively examine the performance of DCCA, we compare it with representative RSITR methods trained solely on either the RSITMD or RSICD dataset. Specifically, AMFMN [6] and GaLR [7] focus on asymmetric feature matching and joint modeling of global and local information, respectively, while SWAN [49] emphasizes scene-aware aggregation and multi-scale fusion to alleviate semantic ambiguity. Several methods incorporate additional guidance to enhance fine-grained discrimination, such as KAMCL [9], which introduces knowledge-aided contrastive learning, VGSGN [50], which exploits salient visual cues for cross-modal alignment, and DOVE [51], which models directional visual–semantic relationships. Other approaches explore feature interaction and prior-driven learning mechanisms. IEFT [48] treats images and texts as a unified entity through an interacting-enhancing transformer, whereas PIR [11] leverages prior knowledge to guide adaptive representation learning. MTGFE [52] proposes a fusion encoder guided by multiple tasks and optimized with multi-view contrastive learning to improve fine-grained cross-modal correspondence. In addition, PE-RSITR [53] leverages a parameter-efficient adaptation scheme combined with a hybrid contrastive objective to tailor VLP for RSITR. CLIP-ZS [14] indicates applying CLIP in a zero-shot manner, without task-specific adaptation, while CLIP-FT [14] denotes full fine-tuning of CLIP on remote sensing image–text data. Furthermore, AIR [16] and GLISA [15] extend CLIP with lightweight adaptation and global–local soft alignment mechanisms, respectively, to improve retrieval performance in remote sensing scenarios.

4.3. Implementation Details

We implemented DCCA in PyTorch v2.1.1 and carried out training following the ITRA framework [54]. For the RSITMD dataset, training proceeded for 7 epochs with batches of 100 samples and a learning rate set to

1 \times 10^{- 5}

. The weighting of the loss components was configured as follows:

β

for

L_{i i m d m}

to 5.0,

γ

for

L_{r s i i}

to 1.0,

α_{1}

for

L_{i n t r a}^{T 2 I}

to 1.0, and

α_{2}

for

L_{i n t e r}^{s y m}

to 0.5. For RSICD, training was carried out for 8 epochs while maintaining identical batch size and learning rate settings, with corresponding weights

β

,

γ

,

α_{1}

, and

α_{2}

set to 1.5, 1.0, 0.3, and 0.1, respectively. Additional training strategies incorporated a 100-step warm-up, weight decay set to 0.5, and gradients clipped to a maximum norm of 50. The temperature parameter in

L_{g n p e}

was treated as trainable and updated dynamically during training. The experimental setup employed a Linux machine with an NVIDIA RTX 4090 GPU.

During inference, DCCA extracts image and text representations separately using the fine-tuned CLIP encoders. Cross-modal similarity is then computed between these representations to construct ranking lists in both retrieval directions. We evaluated DCCA on RSITMD and RSICD and benchmarked its results against multiple leading methods to validate its performance. Different contrastive learning variants were compared to assess the effectiveness of the global contrastive learning with NPE strategy. To investigate the effect of each module, ablation experiments were carried out by selectively deactivating one or more key modules while keeping the training protocol consistent. Comparisons with methods pretrained on large-scale remote sensing image–text corpora further highlighted DCCA’s efficiency under limited data conditions. For qualitative evaluation, attention maps and example retrieval results were visualized for both modalities, providing interpretable insights into how DCCA captures cross-modal correspondences and supporting the observed improvements during inference.

4.4. Comparison Results

The results of DCCA relative to various baseline methods on RSITMD and RSICD is presented in Table 1. To maintain consistency across methods, results corresponding to CLIP-FT were derived by fully fine-tuning the CLIP model on RSITMD or RSICD, where all parameters are updated without freezing any layers. The training settings are kept consistent with those used for DCCA, and the conventional contrastive learning objective formulated in Equation (7) is adopted for optimization. The results of the other baseline methods are consistent with those reported in their original publications. Based on these results, several observations can be made.

First, DCCA outperforms all other evaluated approaches in terms of overall performance. With respect to the mean Recall (mR) metric, DCCA improves the performance on RSITMD by 6% (from 48.43 to 51.33) and on RSICD by 6.3% (from 35.90 to 38.16). These improvements indicate that the proposed discriminative and consistent cross-modal alignment strategy effectively enhances retrieval performance in remote sensing scenarios.

Second, compared with existing approaches, DCCA exhibits more pronounced gains at the

R @ 5

and

R @ 10

metrics. These results imply that DCCA effectively expands the range of retrieved candidates with meaningful semantic relevance, rather than overemphasizing optimization at

R @ 1

in highly similar remote sensing settings. The consistent improvements across multiple recall metrics further demonstrate that the discriminative and consistent alignment mechanism effectively boosts the model’s capability to recognize correspondences across modalities, thereby enabling more stable and reliable retrieval performance even in the presence of elevated intra-modal similarity within remote sensing image–text datasets.

Third, DCCA yields more substantial performance gains on the RSICD dataset than on RSITMD. Compared with prior methods, the relative improvements on RSICD are noticeably larger. Since RSICD contains a higher proportion of visually and semantically similar image–text pairs, it poses a more challenging scenario for cross-modal alignment. The more significant gains on RSICD indicate that, by jointly leveraging the enhanced contrastive learning strategy and the remote sensing information injection module, DCCA is able to learn more discriminative and reliable cross-modal representations. As a result, it more effectively distinguishes highly similar image–text pairs, thereby boosting the quality of the retrieved results.

4.5. Comparison of Different Contrastive Learning Variants

To assess the contribution of each optimization stage in our contrastive learning strategies, we executed comparative experiments. Specifically, the other components of DCCA were removed and only the contrastive learning variants were retained. We evaluated the effectiveness of

L_{i t c}

(Equation (7)),

L_{g i t c}

(Equation (9)), the first formulation of

L_{g n p e}

(Equation (10)) denoted as

L_{g n p e 1}

, and the second formulation of

L_{g n p e}

(Equation (11)) denoted as

L_{g n p e 2}

on RSITMD and RSICD, with the corresponding results presented in Table 2.

L_{i t c}

represents standard contrastive learning loss, while

L_{g i t c}

extends it by introducing global-level contrastive strategy.

L_{g n p e 1}

and

L_{g n p e 2}

, which are mathematically equivalent, further build upon

L_{g i t c}

by expanding negative pairs involved in the contrastive process.

The results reveal several important observations. First, each additional optimization has been shown in evaluations to yield improvements in performance, highlighting the effectiveness of each individual strategy. Second,

L_{g i t c}

delivers more substantial improvements on RSICD compared to RSITMD. This is consistent with our expectations, as RSICD contains more semantically similar textual descriptions, while the global-level strategy of focusing on hard negatives becomes more beneficial, leading to more pronounced performance gains. Third,

L_{g n p e 1}

and

L_{g n p e 2}

provide additional improvements by enlarging the pool of negative pairs, suggesting that more negative comparisons boost the learning quality in contrastive frameworks. Although

L_{g n p e 1}

and

L_{g n p e 2}

are mathematically equivalent, they exhibit slight performance differences in practical implementation. For consistency across experiments and to maintain a unified training configuration, we adopt the results derived from

L_{g n p e 2}

in all reported evaluations.

4.6. Ablation Studies

To investigate the role of each component within DCCA, ablation experiments were systematically conducted. Specifically, we removed individual or multiple components and summarized the corresponding results in Table 3. First, the global contrastive learning with NPE mechanism delivers the largest improvement, acting as the main contributor to the enhanced cross-modal retrieval performance. Additionally, the remote sensing information injection module also contributes a significant performance gain, indicating its effectiveness in adapting CLIP to the specific characteristics of remote sensing scenarios. Moreover, incorporating the intra-inter-modal distribution matching constraint brings additional performance gains, demonstrating its effectiveness and complementary contribution to the contrastive learning framework.

When

L_{i n t r a}^{T 2 I}

was removed, the model performance declined consistently across both datasets, confirming that this new matching direction is essential for preserving intra-modal consistency. In contrast, removing

L_{i n t e r}^{s y m}

produced a larger performance decline in RSITMD versus RSICD. This observation suggests that improving inter-modal consistency is more influential in RSITMD, whereas intra-modal consistency is relatively more important for RSICD, likely due to the differences in semantic similarity and description granularity between the two datasets.

Furthermore, removing the remote sensing information injection module led to a noticeably larger performance degradation on RSITMD compared with RSICD, indicating that remote sensing domain knowledge is particularly beneficial for improving retrieval effectiveness on RSITMD.

In sum, these ablation results confirm that every proposed component contributes uniquely to improving retrieval accuracy, and that jointly integrating these components leads to a more effective cross-modal alignment framework.

4.7. Comparison with Large-Scale Pretrained Remote Sensing Models

Previous studies [26,27] have shown that models trained with larger amounts of additional data typically achieve better performance in RSITR tasks, despite their architectural simplicity. This trend highlights the strong data dependency of RSITR and suggests that abundant cross-modal samples boost the quality of learned representations. To demonstrate DCCA’s data efficiency, we conducted a comparison with models pretrained on extensive remote sensing image–text corpora. Table 4 displays the results. RemoteCLIP and GeoRSCLIP both leverage large-scale pretraining datasets, where RemoteCLIP is developed with an image–text dataset of about 0.83 million pairs, in contrast to GeoRSCLIP, which employs nearly 5 million. Despite the significant gap in training data volume, DCCA surpasses RemoteCLIP with respect to the mean Recall (mR) and achieves performance comparable to, and in some cases slightly higher than, the fine-tuned single-dataset results reported by GeoRSCLIP. Although DCCA performs marginally below GeoRSCLIP tuned on RET-2, which is the combination of RSITMD and RSICD, the overall results demonstrate that DCCA remains competitive even under limited data conditions. These findings indicate that DCCA exhibits strong data efficiency and is capable of learning reliable cross-modal representations without requiring large-scale pretraining on remote sensing data, thereby reinforcing its practical value in remote sensing settings with limited or expensive annotated data.

4.8. Hyperparameter Studies

Experiments were also carried out to study how varying the weights of individual loss terms influences model performance. All experiments are conducted under the settings described in Section 4.3. For each hyperparameter study, only the target hyperparameter is varied, while all other hyperparameters are kept fixed at the values obtained from the optimal hyperparameter configuration. The variation curves are depicted in Figure 6, where

β

denotes the coefficient applied to the distribution matching constraint loss

L_{i i m d m}

,

α_{1}

and

α_{2}

represent the weights of the

L_{i n t r a}^{T 2 I}

and

L_{i n t e r}^{s y m}

, respectively, and

γ

corresponds to the weight of the remote sensing information injection loss

L_{r s i i}

.

From the overall trends observed in Figure 6, the proposed DCCA framework exhibits generally stable performance across a wide range of parameter settings, indicating notable robustness to hyperparameter selection. When the hyperparameter ranges are further expanded, performance degradation can be observed under excessively large parameter values, with varying degrees of decline across different parameters and datasets, suggesting that overly strong constraints could negatively affect model optimization. Specifically, the performance curves with respect to

β

,

α_{2}

, and

γ

remain relatively consistent on both RSITMD and RSICD, with only minor fluctuations over broad intervals. In contrast, the results are more sensitive to

α_{1}

, where relatively larger variations are observed, indicating that properly balancing the bidirectional intra-modal distribution matching objective is more critical to retrieval performance. Overall, these results demonstrate that DCCA maintains stable performance within reasonably wide parameter ranges and is not overly sensitive to most individual hyperparameters.

Based on the best-performing parameter configurations shown in Figure 6, for RSITMD, DCCA attains the highest performance when

β

= 5.0,

α_{1}

= 1.0,

α_{2}

= 0.5 and

γ

= 1.0, whereas on the RSICD dataset, optimal performance is obtained with

β

= 1.5,

α_{1}

= 0.3,

α_{2}

= 0.1 and

γ

= 1.0. These results suggest several important observations. First, the intra-inter-modal distribution matching constraint loss plays a more significant role relative to the remote sensing information injection loss, highlighting its critical contribution to cross-modal representation learning. Second, the distribution matching constraint loss has a larger weight on RSITMD compared to RSICD, indicating that improving modal consistency is particularly beneficial for the RSITMD dataset. Third,

L_{i n t r a}^{T 2 I}

and

L_{i n t r a}^{I 2 T}

are equally important on RSITMD, suggesting that the image–text correspondences in this dataset are generally well balanced and reliable. In contrast, on RSICD,

L_{i n t r a}^{I 2 T}

carries more weight than

L_{i n t r a}^{T 2 I}

, implying that using text to guide image intra-modal consistency is more effective for this dataset. This observation aligns with the characteristics of the datasets: RSITMD texts are more discriminative and the image–text associations have been specially optimized. Finally, the remote sensing information injection loss

L_{r s i i}

consistently contributes to performance improvements across both datasets, indicating that the incorporation of remote sensing domain information provides essential scene-level knowledge that facilitates accurate cross-modal alignment. Collectively, the observations underscore the complementary roles of the proposed losses and demonstrate the importance of carefully balancing their contributions for optimal cross-modal retrieval performance.

4.9. Visualization of Image–Text Attention Heat Maps

To further assess whether DCCA achieves potential explicit alignment after realizing implicit alignment and whether it preserves interpretability while delivering high performance, we adopt a gradient-based attention rollout mechanism [55] to quantify cross-modal relevance. Specifically, we first compute the logits for image–text alignment, and back-propagate the gradients to the attention maps of both visual and textual transformers. For each layer, the element-wise product of the attention weights and their corresponding gradients is aggregated and progressively propagated through the network to derive relevance scores. This process yields token-level textual relevance and patch-level visual relevance, enabling us to identify which text tokens and image regions contribute most significantly to the final matching prediction.

The resulting attention-based heat maps are depicted in Figure 7. In Figure 7a, DCCA accurately highlights the blue house and the white plane, demonstrating precise recognition of color-related attributes. Figure 7b shows clear localization of the three cars, indicating strong capability in understanding numerical information. In Figure 7c, the model precisely identifies the tennis court while also showing partial activation around the houses. Finally, Figure 7d correctly focuses on the two hemispherical buildings, revealing accurate perception of shape-related attributes. These results collectively confirm that DCCA maintains interpretability while achieving improved cross-modal alignment performance.

4.10. Visualization of Retrieval Results

We carried out a qualitative analysis of the results produced by DCCA. We compared DCCA with the feature extraction module (FEM) introduced in Section 3.3 optimized using conventional contrastive objective, which is equivalent to the CLIP-FT model based on ViT-B-16, in order to directly illustrate the improvements introduced by our proposal. Figure 8 presents a comparison of textual query-driven image retrieval outcomes, which displays the five highest-ranked results returned by the models. DCCA can accurately retrieve the targets corresponding to the query texts. As can be observed, in the first query, the top-1 result of DCCA precisely matches the ground truth, while all top-5 results consistently contain the target of a “baseball stadium”. Although these images are highly similar, primarily depicting a baseball field, DCCA still ranks the image containing the key description “blue and white house” (which only occupies a small area) at the top. This indicates that DCCA has a stronger discriminative ability for highly similar images. In contrast, the baseline method fails to recognize the “blue and white house” and even misidentifies a “swimming pool” as a “blue house”. This demonstrates that DCCA possesses superior retrieval and discrimination capabilities, enabling it to identify important information in the text and reflect it in the retrieved images, thereby demonstrating the effectiveness of injecting remote sensing information into the model. In the second query, DCCA fully captures the concept of “two rows of tanks”, and this is reflected in each of its top-5 results. However, the baseline method fails to adequately understand this concept, ranking images with more than two rows of tanks higher. This shows that DCCA achieves fine-grained alignment in practice.

Figure 9 presents a comparison of image-driven text retrieval outcomes, which displays the ten highest-ranked items retrieved by each model. DCCA successfully ranks the ground-truth texts at the top, even though the top ten results contain negative texts highly similar to the ground-truth texts. For instance, the text ranked fifth includes matching elements such as “tennis courts”, “buildings”, and “trees”. Nevertheless, DCCA can still recognize the mismatch with “harbor” and rank it lower than most of the positive texts. This indicates that, in image-to-text retrieval, DCCA exhibits stronger discriminative capability to identify highly similar negative texts.

Overall, these visualizations demonstrate that DCCA effectively enables better discrimination of subtle cross-modal differences, resulting in more precise retrieval outcomes than the baseline.

5. Discussion

The proposed DCCA framework is consistently validated through experiments on two publicly available RSITR datasets. Compared with a wide range of representative baselines, DCCA yields the leading retrieval performance on RSITMD and RSICD, with particularly notable improvements in mean Recall (mR) and higher-order recall metrics (

R @ 5

and

R @ 10

). These results indicate that DCCA substantially improves overall ranking quality, which is especially critical in remote sensing scenarios characterized by high intra-modal similarity. Further analyses of different contrastive learning variants verify that each stage of the proposed optimization, including the global-level contrastive strategy and negative pair expansion, contributes incremental performance gains. This confirms the importance of strengthening hard negative sample mining and expanding negative pairs under subtle semantic differences. Ablation studies further reveal that global contrastive learning with negative pair expansion is the primary contributor to performance improvement, while the bidirectional intra-inter-modal distribution matching constraint and the remote sensing information injection module provide complementary benefits by enhancing representation consistency and domain adaptability, respectively. Notably, DCCA achieves larger performance gains on RSICD, a dataset with higher semantic ambiguity, highlighting its strong discriminative capability in more challenging retrieval settings. Moreover, comparisons with large-scale pretraining approaches demonstrate that DCCA attains competitive performance without relying on massive remote sensing image–text corpora, underscoring its data efficiency and practical applicability. Visualization results further confirm that DCCA preserves interpretability while achieving strong cross-modal alignment. Collectively, these findings validate that the proposed discriminative and consistent alignment strategies effectively address key challenges in RSITR and provide a reliable and data-efficient solution for remote sensing image–text retrieval.

Despite its effectiveness, DCCA still has several limitations. The proposed framework implicitly assumes that the training data are largely free of annotation noise. However, in real-world RSITR scenarios, datasets with high intra-modal similarity inevitably contain not only hard negative pairs but also false negative and false positive pairs. Such noisy supervision can mislead contrastive optimization, disrupt the learning of correct cross-modal correspondences, and ultimately degrade retrieval performance.

To some extent, our framework can alleviate the noise introduced by false negatives. In particular, the proposed negative pair expansion (NPE) mechanism substantially increases the number of negative pairs associated with each positive pair compared to conventional contrastive learning strategies. By diluting the relative contribution of a small number of false negative pairs to the overall loss, this design improves the tolerance of the optimization process to moderate false negative noise and partially mitigates its adverse impact. Nevertheless, when the proportion of false negative or false positive pairs becomes significant, this implicit robustness is insufficient. The current framework still lacks explicit mechanisms to model annotation noise or handle ambiguous image–text associations, indicating scope for further enhancements in robustness.

In future research, we aim to improve the robustness and scalability of the proposed framework. Specifically, we intend to explore robustness-aware RSITR methods that explicitly mitigate the influence of false negative and false positive pairs, such as noise-tolerant contrastive objectives and adaptive sample reweighting strategies. In addition, although DCCA achieves performance comparable to large-scale remote sensing image–text pretrained models, it still trails approaches further enhanced by RSVLP or trained on substantially larger datasets. Expanding the scale and diversity of training data, as well as integrating the proposed framework with large-scale remote sensing vision–language pretraining paradigms, represents a promising direction for achieving more practical and generalizable retrieval performance.

6. Conclusions

This study addresses two major issues in RSITR: high similarity within modalities and limited adaptability of vision–language models pretrained on natural images. To tackle these issues, we propose DCCA (discriminative and consistent cross-modal alignment), a framework that combines a global contrastive learning strategy with negative pair expansion to enhance discrimination of subtle cross-modal differences, along with a bidirectional distribution matching constraint that enforces consistency both within and across modalities. In addition, a remote sensing information injection module is incorporated to transfer domain-specific visual knowledge into CLIP, reducing the requirement for massive remote sensing corpora and addressing the domain discrepancy between natural and remote sensing images. Evaluations on two public RSITR benchmarks show that DCCA achieves strong and generalizable retrieval performance. The framework outperforms existing RSITR approaches and reaches levels comparable to models pretrained on extensive remote sensing datasets, despite requiring significantly fewer training samples. Qualitative analyses further illustrate that the model can establish coherent links between visual content and associated text. Despite these strengths, DCCA assumes relatively clean training data, which may limit its robustness to false negative and false positive pairs. Future efforts will explore enhancing noise tolerance and scalability, potentially through robust contrastive objectives and integration with large-scale remote sensing vision–language pretraining, to further improve generalization and practical applicability.

Author Contributions

Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing—original draft preparation, Z.S.; Data curation, Software, Validation, Visualization, Y.S.; Supervision, Project administration, Writing—review and editing, W.L.; Funding acquisition, Project administration, Resources, Writing—review and editing, J.G.; Supervision, Project administration, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by National Natural Science Foundation of China (No. 62372326).

Data Availability Statement

The datasets used in this study are publicly accessible. Specifically, RSICD can be obtained from https://github.com/201528014227051/RSICD_optimal and RSITMD from https://github.com/xiaoyuan1996/AMFMN, with both accessed on 1 July 2025. The source code for DCCA is publicly available at https://github.com/ADMIS-TONGJI/DCCA (accessed on 29 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Joyce, K.E.; Belliss, S.E.; Samsonov, S.V.; McNeill, S.J.; Glassey, P.J. A review of the status of satellite remote sensing and image processing techniques for mapping natural hazards and disasters. Prog. Phys. Geogr. Earth Environ. 2009, 33, 183–207. [Google Scholar] [CrossRef]
Weiss, M.; Jacob, F.; Duveiller, G. Remote sensing for agricultural applications: A meta-review. Remote Sens. Environ. 2020, 236, 111402. [Google Scholar] [CrossRef]
Patino, J.E.; Duque, J.C. A review of regional science applications of satellite remote sensing in urban settings. Comput. Environ. Urban Syst. 2013, 37, 1–17. [Google Scholar] [CrossRef]
Xu, L.; Wang, L.; Zhang, J.; Ha, D.; Zhang, H. A Review of Cross-Modal Image-Text Retrieval in Remote Sensing. Remote Sens. 2025, 17, 3995. [Google Scholar] [CrossRef]
Wang, T.; Li, F.; Zhu, L.; Li, J.; Zhang, Z.; Shen, H.T. Cross-Modal Retrieval: A Systematic Review of Methods and Future Directions. Proc. IEEE 2024, 112, 1716–1754. [Google Scholar] [CrossRef]
Yuan, Z.; Zhang, W.; Fu, K.; Li, X.; Deng, C.; Wang, H.; Sun, X. Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
Yuan, Z.; Zhang, W.; Tian, C.; Rong, X.; Zhang, Z.; Wang, H.; Fu, K.; Sun, X. Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Zhang, S.; Li, Y.; Mei, S. Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–17. [Google Scholar] [CrossRef]
Ji, Z.; Meng, C.; Zhang, Y.; Pang, Y.; Li, X. Knowledge-Aided Momentum Contrastive Learning for Remote-Sensing Image Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Zhang, W.; Li, J.; Li, S.; Chen, J.; Zhang, W.; Gao, X.; Sun, X. Hypersphere-Based Remote Sensing Cross-Modal Text-Image Retrieval via Curriculum Learning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Pan, J.; Ma, Q.; Bai, C. A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 611–620. [Google Scholar] [CrossRef]
Tang, X.; Huang, D.; Ma, J.; Zhang, X.; Liu, F.; Jiao, L. Prior-Experience-Based Vision-Language Model for Remote Sensing Image-Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Chen, Y.; Huang, J.; Xiong, S.; Lu, X. Integrating Multisubspace Joint Learning with Multilevel Guidance for Cross-Modal Retrieval of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
Hu, G.; Wen, Z.; Lv, Y.; Zhang, J.; Wu, Q. Global-Local Information Soft-Alignment for Cross-Modal Remote-Sensing Image-Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Yang, R.; Wang, S.; Tao, J.; Han, Y.; Lin, Q.; Guo, Y.; Hou, B.; Jiao, L. Accurate and Lightweight Learning for Specific Domain Image-Text Retrieval. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 9719–9728. [Google Scholar] [CrossRef]
Plummer, B.A.; Wang, L.; Cervantes, C.M.; Caicedo, J.C.; Hockenmaier, J.; Lazebnik, S. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2641–2649. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2015, arXiv:1405.0312. [Google Scholar] [CrossRef]
Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring Models and Data for Remote Sensing Image Caption Generation. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2183–2195. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
Xiao, S.; Liu, Z.; Zhang, P.; Muennighoff, N.; Lian, D.; Nie, J.Y. C-Pack: Packed Resources for General Chinese Embeddings. arXiv 2024, arXiv:2309.07597. [Google Scholar] [CrossRef]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; pp. 4904–4916. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar] [CrossRef]
Jiang, Q.; Chen, C.; Zhao, H.; Chen, L.; Ping, Q.; Tran, S.D.; Xu, Y.; Zeng, B.; Chilimbi, T. Understanding and Constructing Latent Modality Structures in Multi-Modal Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7661–7671. [Google Scholar] [CrossRef]
Han, Z.; Zhang, S.; Su, Y.; Chen, X.; Mei, S. DR-AVIT: Toward Diverse and Realistic Aerial Visible-to-Infrared Image Translation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
Zhang, Z.; Zhao, T.; Guo, Y.; Yin, J. RS5M and GeoRSCLIP: A Large-Scale Vision- Language Dataset and a Large Vision-Language Model for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–23. [Google Scholar] [CrossRef]
Cheng, Q.; Zhou, Y.; Fu, P.; Xu, Y.; Zhang, L. A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4284–4297. [Google Scholar] [CrossRef]
Ji, Z.; Meng, C.; Zhang, Y.; Wang, H.; Pang, Y.; Han, J. Eliminate Before Align: A Remote Sensing Image-Text Retrieval Framework with Keyword Explicit Reasoning. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 1662–1671. [Google Scholar] [CrossRef]
Guan, J.; Shu, Y.; Li, W.; Song, Z.; Zhang, Y. PR-CLIP: Cross-Modal Positional Reconstruction for Remote Sensing Image-Text Retrieval. Remote Sens. 2025, 17, 2117. [Google Scholar] [CrossRef]
Zheng, C.; Li, X.; Liang, X.; Huang, L.; Du, S.; Nie, J.; Dong, J. Cross-Modal Progressive Perspective Matching Network for Remote Sensing Image-Text Retrieval. IEEE Trans. Multimed. 2025, 27, 3966–3978. [Google Scholar] [CrossRef]
Sun, T.; Zheng, C.; Li, X.; Gao, Y.; Nie, J.; Huang, L.; Wei, Z. Strong and Weak Prompt Engineering for Remote Sensing Image-Text Cross-Modal Retrieval. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 6968–6980. [Google Scholar] [CrossRef]
Zheng, C.; Nie, J.; Yin, B.; Li, X.; Qian, Y.; Wei, Z. Frequency- and Spatial-Domain Saliency Network for Remote Sensing Cross-Modal Retrieval. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–13. [Google Scholar] [CrossRef]
Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 13–23. [Google Scholar] [CrossRef]
Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.J.; Chang, K.W. What Does BERT with Vision Look At? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5265–5275. [Google Scholar] [CrossRef]
Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 121–137. [Google Scholar] [CrossRef]
Kim, W.; Son, B.; Kim, I. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; pp. 5583–5594. [Google Scholar] [CrossRef]
Sohn, K. Improved deep metric learning with multi-class N-pair loss objective. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 1857–1865. [Google Scholar]
van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2019, arXiv:1807.03748. [Google Scholar] [CrossRef]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9726–9735. [Google Scholar] [CrossRef]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, Online, 13–18 July 2020; pp. 1597–1607. [Google Scholar] [CrossRef]
Chen, X.; He, K. Exploring Simple Siamese Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15745–15753. [Google Scholar] [CrossRef]
Sun, Y.; Cheng, C.; Zhang, Y.; Zhang, C.; Zheng, L.; Wang, Z.; Wei, Y. Circle Loss: A Unified Perspective of Pair Similarity Optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6397–6406. [Google Scholar] [CrossRef]
Zhang, Y.; Jiang, H.; Miura, Y.; Manning, C.D.; Langlotz, C.P. Contrastive Learning of Medical Visual Representations from Paired Images and Text. In Proceedings of the 7th Machine Learning for Healthcare Conference, Durham, NC, USA, 5–6 August 2022; pp. 2–25. [Google Scholar]
Jiang, D.; Ye, M. Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 2787–2797. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A Benchmark Data Set for Performance Evaluation of Aerial Scene Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
Tang, X.; Wang, Y.; Ma, J.; Zhang, X.; Liu, F.; Jiao, L. Interacting-Enhancing Feature Transformer for Cross-Modal Remote-Sensing Image and Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Pan, J.; Ma, Q.; Bai, C. Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval. In Proceedings of the ACM International Conference on Multimedia Retrieval, Thessaloniki, Greece, 12–15 June 2023; pp. 398–406. [Google Scholar] [CrossRef]
He, Y.; Xu, X.; Chen, H.; Li, J.; Pu, F. Visual Global-Salient-Guided Network for Remote Sensing Image-Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Ma, Q.; Pan, J.; Bai, C. Direction-Oriented Visual-Semantic Embedding Model for Remote Sensing Image-Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Zhang, X.; Li, W.; Wang, X.; Wang, L.; Zheng, F.; Wang, L.; Zhang, H. A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text-Image Retrieval in Remote Sensing. Remote Sens. 2023, 15, 4637. [Google Scholar] [CrossRef]
Yuan, Y.; Zhan, Y.; Xiong, Z. Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Chen, D. ITRA. Available online: https://github.com/ChenDelong1999/ITRA (accessed on 18 June 2025).
Chefer, H.; Gur, S.; Wolf, L. Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 387–396. [Google Scholar] [CrossRef]

Figure 1. Average image and text intra-modal similarity scores for natural datasets (Flickr30K and MSCOCO) and remote sensing datasets (RSITMD and RSICD) using different encoders. The ViT variants are derived from the CLIP image encoder. Paraphrase-MiniLM-L6-v2 [20], all-mpnet-base-v2 [20], and bge-large-en-v1.5 [21] are specifically trained for text similarity computation.

Figure 2. The overall framework of DCCA, which comprises four core components: the feature extraction module, the global contrastive learning with negative pair expansion, the bidirectional intra-inter-modal distribution matching constraint, and the remote sensing information injection module. Squares represent CLIP image features, triangles denote CLIP text features, and circles correspond to image features encoded by the remote sensing image encoder.

Figure 3. Feature extraction module of DCCA.

Figure 4. Comparison of sample-level and global-level contrastive learning strategy. In (a), positive and negative pairs are first constructed with each sample as the center, followed by local hard sample pair mining, and then aggregated through global averaging. In (b), hard sample pair mining is performed globally across the entire batch.

Figure 5. Comparison of the number of negative pairs associated with each positive pair in conventional contrastive learning and the negative pair expansion mechanism. Positive pairs are indicated in green, and negative pairs are indicated in red.

Figure 6. The retrieval mR of DCCA with varying the hyperparameters are as follows: (a) Hyperparameter

β

of the

L_{i i m d m}

; (b) hyperparameter

α_{1}

of the

L_{i n t r a}^{T 2 I}

; (c) hyperparameter

α_{2}

of the

L_{i n t e r}^{s y m}

; (d) hyperparameter

γ

of the

L_{r s i i}

.

Figure 6. The retrieval mR of DCCA with varying the hyperparameters are as follows: (a) Hyperparameter

β

of the

L_{i i m d m}

; (b) hyperparameter

α_{1}

of the

L_{i n t r a}^{T 2 I}

; (c) hyperparameter

α_{2}

of the

L_{i n t e r}^{s y m}

; (d) hyperparameter

γ

of the

L_{r s i i}

.

Figure 7. Visualization of image–text attention heat maps between image patches and text tokens. (a–d) illustrate representative image–text pairs demonstrating their semantic correspondences in terms of color, quantity, object and shape.

Figure 8. Visualization of text-to-image retrieval results.

Figure 9. Visualization of image-to-text retrieval results.

Table 1. Comparison of various methods on RSITMD and RSICD datasets.

Dataset	Method	Image Backbone	Text Backbone	I2T			T2I			Mean Recall
Dataset	Method	Image Backbone	Text Backbone	R@1	R@5	R@10	R@1	R@5	R@10	Mean Recall
RSITMD	AMFMN	ResNet	GRU	11.06	29.20	38.72	9.96	34.03	52.96	29.32
	GaLR	PP-YOLO + ResNet	GRU	14.82	31.64	42.48	11.15	36.68	51.68	31.41
	SWAN	ResNet	GRU	13.35	32.15	46.90	11.24	40.40	60.60	34.11
	KAMCL	ResNet	GRU	16.51	36.28	49.12	13.50	42.15	59.32	36.14
	VGSGN	ResNet	GRU	14.16	34.96	50.66	13.23	42.57	63.41	36.50
	DOVE	ResNet	GRU	16.81	36.80	49.93	12.20	44.13	66.50	37.73
	IEFT	Transformer	Transformer	15.49	37.61	51.40	11.19	38.09	58.84	35.43
	PIR	Swin-Transformer	BERT	18.14	41.15	52.88	12.17	41.68	63.41	38.24
	MTGFE	ViT-B-16	BERT	17.92	40.93	53.32	16.59	48.50	67.43	40.78
	PE-RSITR	ViT-B-32	Transformer	23.67	44.07	60.36	20.10	50.63	67.97	44.47
	CLIP-ZS	ViT-B-16	Transformer	9.29	25.00	33.41	7.74	29.03	45.66	25.02
	CLIP-FT	ViT-B-16	Transformer	27.88	50.66	62.83	23.72	54.42	71.06	48.43
	AIR	ViT-B-16	Transformer	29.20	49.78	65.27	26.06	57.04	73.98	50.22
	GLISA	ViT *	Transformer	32.08	51.99	63.94	23.36	58.27	74.47	50.69
	DCCA(Ours)	ViT-B-16	Transformer	31.64	53.32	64.60	24.47	57.61	76.33	51.33
RSICD	AMFMN	ResNet	GRU	5.39	15.08	23.40	4.90	18.28	31.44	16.42
	GaLR	PP-YOLO + ResNet	GRU	6.59	19.85	31.04	4.69	19.48	32.13	18.96
	SWAN	ResNet	GRU	7.41	20.13	30.86	5.56	22.26	37.41	20.61
	KAMCL	ResNet	GRU	12.08	27.26	38.70	8.65	27.43	42.51	26.10
	VGSGN	ResNet	GRU	8.33	21.87	32.57	6.53	23.13	36.85	21.55
	DOVE	ResNet	GRU	8.66	22.35	34.95	6.04	23.95	40.35	22.72
	IEFT	Transformer	Transformer	8.78	28.47	43.88	8.38	28.17	44.16	26.97
	PIR	Swin-Transformer	BERT	9.88	27.26	39.16	6.97	24.56	38.92	24.46
	MTGFE	ViT-B-16	BERT	15.28	37.05	51.60	8.67	27.56	43.92	30.68
	PE-RSITR	ViT-B-32	Transformer	14.13	31.51	44.78	11.63	33.92	50.73	31.12
	CLIP-ZS	ViT-B-16	Transformer	6.86	17.29	26.44	5.45	18.19	29.30	17.26
	CLIP-FT	ViT-B-16	Transformer	18.94	37.88	49.59	14.69	39.58	54.71	35.90
	AIR	ViT-B-16	Transformer	18.85	39.07	51.78	14.24	39.03	54.49	36.24
	GLISA	ViT *	Transformer	19.52	40.44	52.28	14.75	39.50	55.46	36.99
	DCCA(Ours)	ViT-B-16	Transformer	18.85	40.99	54.07	15.41	41.68	57.97	38.16

* indicates that GLISA does not explicitly specify the ViT variant it uses. Top-performing values are emphasized in bold.

Table 2. Comparison of various contrastive learning variants on RSITMD and RSICD datasets.

Dataset	Method	I2T			T2I			Mean Recall
Dataset	Method	R@1	R@5	R@10	R@1	R@5	R@10	Mean Recall
RSITMD	$L_{i t c}$	27.88	50.66	62.83	23.72	54.42	71.06	48.43
	$L_{g i t c}$	28.98	49.56	62.17	23.32	55.62	72.48	48.69
	$L_{g n p e 1}$	31.64	51.11	62.61	24.60	56.95	73.01	49.99
	$L_{g n p e 2}$ (used in DCCA)	28.54	52.65	64.60	24.65	57.43	73.41	50.21
RSICD	$L_{i t c}$	18.94	37.88	49.59	14.69	39.58	54.71	35.90
	$L_{g i t c}$	17.75	39.98	51.78	14.38	40.93	57.73	37.09
	$L_{g n p e 1}$	19.49	38.88	53.98	15.28	41.43	58.06	37.85
	$L_{g n p e 2}$ (used in DCCA)	18.57	38.98	52.52	15.74	41.37	58.24	37.57

L_{i t c}

(Equation (7)) denotes the conventional contrastive learning loss,

L_{g i t c}

(Equation (9)) denotes the global contrastive learning loss,

L_{g n p e 1}

(Equation (10)) and

L_{g n p e 2}

(Equation (11)) denotes the global contrastive learning loss with negative pair expansion.

Table 3. The results of ablation studies on RSITMD and RSICD datasets.

Dataset	$L_{gnpe}$	$L_{intra}^{I 2 T}$	$L_{intra}^{T 2 I}$	$L_{inter}^{sym}$	$L_{rsii}$	I2T			T2I			Mean Recall
Dataset	$L_{gnpe}$	$L_{intra}^{I 2 T}$	$L_{intra}^{T 2 I}$	$L_{inter}^{sym}$	$L_{rsii}$	R@1	R@5	R@10	R@1	R@5	R@10	Mean Recall
RSITMD						27.88	50.66	62.83	23.72	54.42	71.06	48.43
	✔					28.54	52.65	64.60	24.65	57.43	73.41	50.21
					✔	29.42	52.88	65.27	24.12	55.00	71.73	49.73
	✔				✔	28.98	51.11	64.60	25.58	57.79	75.04	50.52
	✔	✔				29.20	52.21	65.27	24.25	57.57	75.04	50.59
	✔	✔			✔	28.54	53.54	66.15	25.40	57.21	75.18	51.00
	✔	✔	✔	✔		28.98	53.32	65.04	24.60	56.55	74.96	50.58
	✔	✔	✔		✔	29.65	51.99	66.37	24.65	57.92	76.19	51.13
	✔	✔		✔	✔	30.75	51.11	63.50	24.96	57.21	76.24	50.63
	✔	✔	✔	✔	✔	31.64	53.32	64.60	24.47	57.61	76.33	51.33
RSICD						18.94	37.88	49.59	14.69	39.58	54.71	35.90
	✔					18.57	38.98	52.52	15.74	41.37	58.24	37.57
					✔	19.58	40.16	52.79	14.38	40.24	55.92	37.18
	✔				✔	18.66	38.88	52.97	15.30	41.85	58.46	37.69
	✔	✔				19.49	40.81	53.43	15.06	41.10	56.91	37.80
	✔	✔			✔	20.31	40.71	53.06	15.39	40.48	57.31	37.88
	✔	✔	✔	✔		19.95	39.80	53.71	15.17	41.28	57.51	37.90
	✔	✔	✔		✔	19.30	40.16	54.16	15.85	41.52	57.77	38.13
	✔	✔		✔	✔	19.03	39.43	53.89	15.90	40.99	58.24	37.91
	✔	✔	✔	✔	✔	18.85	40.99	54.07	15.41	41.68	57.97	38.16

L_{g n p e}

(Equation (11)) denotes the global contrastive learning loss with negative pair expansion,

L_{i n t r a}^{I 2 T}

(Equation (12)) denotes the text-guided intra-modal distribution matching loss in the image-to-text direction,

L_{i n t r a}^{T 2 I}

(Equation (13)) denotes the image-guided intra-modal distribution matching loss in the text-to-image direction,

L_{i n t e r}^{s y m}

(Equation (15)) denotes the bidirectional inter-modal distribution matching loss, and

L_{r s i i}

(Equation (17)) denotes the remote sensing information injection loss.

Table 4. Comparison with large-scale pretrained remote sensing models on RSITMD and RSICD.

Method	Trained On	Tested On	I2T			T2I			Mean Recall
Method	Trained On	Tested On	R@1	R@5	R@10	R@1	R@5	R@10	Mean Recall
RemoteCLIP-B-32	RET-3 + DET-10 + SEG-4	RSITMD	27.88	50.66	65.71	22.17	56.46	73.41	49.38
RemoteCLIP-L-14	RET-3 + DET-10 + SEG-4		28.76	52.43	63.94	23.76	59.51	74.73	50.52
GeoRSCLIP	RS5M + RSITMD		30.09	51.55	63.27	23.54	57.52	74.60	50.10
GeoRSCLIP	RS5M + RET-2		32.30	53.32	67.92	25.04	57.88	74.38	51.81
DCCA	RSITMD		31.64	53.32	64.60	24.47	57.61	76.33	51.33
RemoteCLIP-B-32	RET-3 + DET-10 + SEG-4	RSICD	17.02	37.97	51.51	13.71	37.11	54.25	35.26
RemoteCLIP-L-14	RET-3 + DET-10 + SEG-4		18.39	37.42	51.05	14.73	39.93	56.58	36.35
GeoRSCLIP	RS5M + RSICD		22.14	40.53	51.78	15.26	40.46	57.79	38.00
GeoRSCLIP	RS5M + RET-2		21.13	41.72	55.63	15.59	41.19	57.99	38.87
DCCA	RSICD		18.85	40.99	54.07	15.41	41.68	57.97	38.16

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Song, Z.; Shu, Y.; Li, W.; Guan, J.; Zhang, Y. Towards Discriminative and Consistent Cross-Modal Alignment for Remote Sensing Image–Text Retrieval. Remote Sens. 2026, 18, 662. https://doi.org/10.3390/rs18040662

AMA Style

Song Z, Shu Y, Li W, Guan J, Zhang Y. Towards Discriminative and Consistent Cross-Modal Alignment for Remote Sensing Image–Text Retrieval. Remote Sensing. 2026; 18(4):662. https://doi.org/10.3390/rs18040662

Chicago/Turabian Style

Song, Zihan, Yulou Shu, Wengen Li, Jihong Guan, and Yichao Zhang. 2026. "Towards Discriminative and Consistent Cross-Modal Alignment for Remote Sensing Image–Text Retrieval" Remote Sensing 18, no. 4: 662. https://doi.org/10.3390/rs18040662

APA Style

Song, Z., Shu, Y., Li, W., Guan, J., & Zhang, Y. (2026). Towards Discriminative and Consistent Cross-Modal Alignment for Remote Sensing Image–Text Retrieval. Remote Sensing, 18(4), 662. https://doi.org/10.3390/rs18040662

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Towards Discriminative and Consistent Cross-Modal Alignment for Remote Sensing Image–Text Retrieval

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Remote Sensing Image–Text Retrieval

2.2. Vision–Language Pretraining

2.3. Contrastive Learning

2.4. Comparative Analysis

3. Method

3.1. Problem Formulation

3.2. Model Overview

3.3. Feature Extraction

3.4. Global Contrastive Learning with Negative Pair Expansion

3.5. Bidirectional Intra-Inter-Modal Distribution Matching Constraint

3.6. Remote Sensing Information Injection

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Baselines

4.3. Implementation Details

4.4. Comparison Results

4.5. Comparison of Different Contrastive Learning Variants

4.6. Ablation Studies

4.7. Comparison with Large-Scale Pretrained Remote Sensing Models

4.8. Hyperparameter Studies

4.9. Visualization of Image–Text Attention Heat Maps

4.10. Visualization of Retrieval Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI