CSSA: A Cross-Modal Spatial–Semantic Alignment Framework for Remote Sensing Image Captioning

Han, Xiao; Wu, Zhaoji; Li, Yunpeng; Zhang, Xiangrong; Wang, Guanchun; Hou, Biao

doi:10.3390/rs18030522

Open AccessArticle

CSSA: A Cross-Modal Spatial–Semantic Alignment Framework for Remote Sensing Image Captioning

by

Xiao Han

¹,

Zhaoji Wu

¹,

Yunpeng Li

²

,

Xiangrong Zhang

^1,*

,

Guanchun Wang

¹

and

Biao Hou

¹

Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, China

²

School of Electronics Information Engineering, Wuxi University, Wuxi 214105, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(3), 522; https://doi.org/10.3390/rs18030522

Submission received: 18 December 2025 / Revised: 26 January 2026 / Accepted: 2 February 2026 / Published: 5 February 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel remote sensing image captioning framework based on cross-modal spatial–semantic alignment (CSSA) is designed, which utilizes a multi-branch cross-modal contrastive learning (MCCL) mechanism to effectively narrow the representation gap between image and text.
A dynamic geometry Transformer (DG-former) is designed to utilize spatial geometry information in remote sensing image scenes with scattered objects, realizing spatial alignment.

What are the implications of the main findings?

Compared to discrete text, an image is fine-grained and contains more noise, making it more challenging to perceive the semantic alignment. CSSA significantly improves the semantic fidelity and spatial coherence of generated captions by explicitly modeling the alignment between visual regions and textual phrases, which is particularly beneficial for complex and heterogeneous remote sensing scenes.
By incorporating geometric priors through the DG-former, the model achieves superior generalization in scenes with sparse, irregularly distributed objects, which is common in real-world remote sensing applications, thereby advancing the integration of spatial reasoning into image captioning.

Abstract

Remote sensing image captioning (RSIC) aims to generate natural language descriptions for the given remote sensing image, which requires a comprehensive and in-depth understanding of image content and summarizes it with sentences. Most RSIC methods have successful vision feature extraction, but the representation of spatial features or fusion features fails to fully consider cross-modal differences between remote sensing images and texts, resulting in unsatisfactory performance. Thus, we propose a novel cross-modal spatial–semantic alignment (CSSA) framework for an RSIC task, which consists of a multi-branch cross-modal contrastive learning (MCCL) mechanism and a dynamic geometry Transformer (DG-former) module. Specifically, compared to discrete text, remote sensing images present a noisy property, interfering with the extraction of valid vision features. Therefore, we present an MCCL mechanism to learn consistent representation between image and text, achieving cross-modal semantic alignment. In addition, most objects are scattered in remote sensing images and exhibit a sparsity property due to the overhead view. However, the Transformer structure mines the objects’ relationships without considering the geometry information of the objects, leading to suboptimal capture of the spatial structure. To address this, a DG-former is designed to realize spatial alignment by introducing geometry information. We conduct experiments on three publicly available datasets (Sydney-Captions, UCM-Captions and RSICD), and the superior results demonstrate its effectiveness.

Keywords:

remote sensing image captioning; multi-branch cross-modal contrastive learning; dynamic geometry transformer

1. Introduction

With the evolution of remote sensing, high-resolution remote sensing images (RSIs) are becoming increasingly available. As a result, various fields, including scene classification [1], object detection [2], and image segmentation [3], have benefited from remote sensing imagery. Although these computer vision tasks can classify and identify objects, they cannot fully understand an RSI’s content and translate it into an easy-to-understand modality (i.e., natural language sentences). To comprehensively express the semantic content of RSIs, remote sensing image captioning (RSIC) tasks are widely studied, which aim to generate accurate, natural, novel and flexible descriptive statements with rich vocabulary to comprehensively summarize RSIs’ content at the semantic level.

Generally, RSIC methods can be mainly divided into three categories: template-based, retrieval-based and deep learning-based approaches. The template-based approach firstly learns the objects and attributes in the RSI and then fills them into the template sentences. The retrieval-based approach searches for the most similar candidates for a given RSI and uses one sentence of the candidates as the generated prediction. In contrast to previous RSIC methods, the deep learning-based approach can produce novel and rich descriptive statements, thereby becoming popular in RSIC tasks. However, since RSIs are acquired from a top view, they exhibit special spatial and semantic characteristics. For example, objects have structural properties and appear with irregular shapes or special proportions. In contrast, the convolution kernel of a Convolutional Neural Network (CNN) is a square with a regular shape and fixed scale, which makes it challenging to capture unique spatial structures. Furthermore, objects in RSIs are scattered, which makes it difficult to thoroughly perceive the matching relationships between image and text.

With respect to spatial characteristics, Zhao et al. [4] used RSI segmentation maps based on semi-supervised semantic segmentation as a prior to guide focus towards regions, called structured attention. Li et al. [5] mimicked the mechanism of the human iris to capture spatial object information. Zhao et al. [6] constructed a new RSIC dataset, in which each RSI had a bounding box for objects that appeared. This promoted more accurate extraction of object features with Faster R-CNN, but small-scale objects still could not be taken into account. Further, Li et al. [7] adopted a Transformer encoder for extracting patch-level salient region information for scattered and irregular remote sensing scenes. When processing RSIs with scattered objects (e.g., random distribution of roads, rivers), these RSIC methods tend to ignore the adjacency and distance relationships between objects, failing to capture the true spatial layout of remote sensing scenes.

Considering semantic characteristics, Zhang et al. [8] tackled the challenge of dealing with numerous objects and complex relationships by introducing label attention [9], which employs the predicted results of a CNN to focus on relevant image regions at each time step. Li et al. [10] proposed a recurrent attention mechanism to address RSIs’ spatial and semantic characteristics. This mechanism updates the hidden state of attention and enables it to focus on different types of information within the images. In addition, the Transformer [11] model emerged as a powerful alternative to CNNs for various computer vision tasks, thanks to its ability to model long-range dependencies and capture global spatial relationships in an efficient and scalable manner. Zhuang et al. [12] modeled the semantic relationships between objects at different spatial locations by using the Transformer’s global attention, which ignored the geometric spatial relationships between objects. By extracting more knowledge with similar semantic distributions, two-stage RSIC methods are enabled to provide fine-grained semantic information for the input RSI. For example, Wang et al. [13] proposed a Word–Sentence approach, which combines the word extraction stage with a “word-to-sentence” stage to enhance the overall performance of sentence generation. Ye et al. [14] also used a two-stage approach, where a multi-label classification task was involved to guide predictions during the captioning stage. However, such methods tend to misalign local regions with irrelevant semantic representation, leading to poor cross-modal consistency. Furthermore, Cheng et al. [15] adopted contrastive learning for image–text features, which is the key reason for optimal performance in RSIC tasks.

Although the above methods proposed corresponding solutions for the spatial and semantic properties of RSIs, they still have the following problems.

How to perceive the semantic relationship between image and text? Compared to discrete text, RSIs are fine-grained and contain more noise, making it more challenging to perceive the semantic alignment. However, RSIs and texts should have similar semantics, which are closely related in the semantic space, while image–text pairs with different meanings should be pushed far away. Hence, we can use text representation to provide direct guidance for extracting vision representation, achieving semantic consistency between different modalities.
How to introduce geometry information to achieve spatial alignment? The global self-attention mechanism utilized in the Transformer structure enables the modeling of relationships between any positions within an RSI, which aligns with the properties of RSIs that contain rich objects, scattered distribution and complex relationships. However, the one-dimensional positional encoding used by Transformers is unsuitable for two-dimensional vision feature maps, leading to suboptimal capture of the spatial structure.

Considering the above problems, we design a cross-modal spatial–semantic alignment framework for an RSIC task, as shown in Figure 1. Specifically, we propose a multi-branch cross-modal contrastive learning mechanism that utilizes text representation to provide direct guidance for extracting image representation. To address the problem of lacking geometry information, a dynamic geometry Transformer is designed to consider the property of object scattering in RSIs by dynamically introducing the spatial geometry information.

Our main contributions are as follows.

Considering sparsity and noisy properties in RSIs, we propose a cross-modal spatial–semantic alignment (CSSA) framework for an RSIC task to learn consistent representation between image and text.
To realize semantic alignment, we propose a multi-branch cross-modal contrastive learning (MCCL) mechanism, which narrows the modality gap between image and text in the representation space.
A dynamic geometry Transformer (DG-former) is proposed to utilize spatial geometry information in RSI scenes with scattered objects, realizing spatial alignment.
To demonstrate the effectiveness of our method, we conduct extensive experiments on three remote sensing image captioning datasets, achieving state-of-the-art performance.

The rest of this article is organized as follows. Section 2 introduces the related work, which contains natural image captioning and remote sensing image captioning. Section 3 introduces the proposed framework in detail. Section 4 introduces the dataset and evaluation metrics. Then, we present and analyze the experimental results. Section 5 focuses on ablation experiments and parameter analyses. Finally, the conclusion is drawn in Section 6.

2. Related Work

With the introduction of large-scale annotated RSIC datasets [16,17], more related works have emerged in the RSIC field. The widely adopted RSIC algorithm is presented with RSI encoding and text decoding.

In terms of the encoding stage, the primary objective expects to discern effective object-level features and high-level semantic correlations, which has been the focus of numerous efforts by most researchers. From a semantic understanding perspective, Shi et al. [18] constructed an FCN-based model that extracts feature information in three aspects: key instances, environment semantics and panorama. In [19,20], sentences were generated by extracting and fusing features at different scales or levels, including instance-level, object-level, and scene-level features. Zhang et al. [21] proposed attribute attention approaches: FC-attention and SM-attention. The former utilizes attribute features from the fully connected layer while the latter combines softmax features. To enhance the accuracy of RSIC results, the label attention mechanism [8] was explored for class-specific feature embeddings to address unclear categories across different RSI scenes. The improvements indicate that higher-level semantic information such as attributes and classification results is used as a priori information to guide the attention mechanism, which can filter weakly relevant objects in the sentence generation stage [22]. Li et al. [10,23] turned the focus from label embedding to generated words for obtaining semantic information. Three levels of attention [23] were proposed for focusing on different regions of images, generated words in sentences and interactions between visual and textual information. Lu et al. [24] proposed sound active attention, which processes sound signals as semantic information and guides the focus regions of the image. Although these methods outperform traditional RSIC approaches, which ignore semantic feature extraction, they still lack object-level semantic distribution for discrete and sparse objects in RSI scenes, leading to insufficient capture of fine-grained semantic details.

To capture the object-rich characteristics of RSIs, the structural property of RSI objects is employed to enhance object-level feature extraction [4,7] by incorporating additional tasks or encoding positional information. For instance, Cui et al. [25] used a segmentation map as a priori information to guide the attention mechanism to select object regions more finely. Wang et al. [26] improved performance by training the model to predict critical semantic words in sentences and fusing the global or region features with semantic word features to address the cross-modal mismatch between image and text. Ren et al. [27] considered that the image feature maps obtained by a CNN are local features and lack global information. They proposed a pure Transformer-based architecture that leverages the global modeling properties of the Transformer as semantic information and a mask-guided mechanism for object representation. Wu et al. [28] thought that swin-Transformer can better estimate spatial correlations between distant and adjacent objects, and then integrated swin-Transformer as an encoder for multi-scale visual features to extract hybrid geometric features. Later, Li et al. [7] leveraged patch-level feature learning in the Transformer structure to capture salient object-level information, and the class token from the Transformer was converted into label embedding. Cheng et al. [29] captured structural information, entity embeddings, and relationship embeddings in the encoding stage, enhancing the decoder with sufficient capture of fine-grained semantic and structural details. These methods balance local or global geometric information and explore fusion between semantic and geometric features, making an effort to sufficiently capture the correlation between object semantics and spatial positions. However, they cannot dynamically adjust geometric information according to the distribution characteristics of objects in different RSI scenes, resulting in limited adaptability for RSIC tasks.

With respect to text decoding, refs. [30,31] enhanced the completeness of generated sentences by fusing multiple label sentences into a single sentence and eliminating redundancy through machine learning and deep learning methods, respectively. In [13,32], words associated with the input RSI were predicted, and the predictions were involved in sentence generation. Fu et al. [33] proposed a persistent memory mechanism to prevent the LSTM from overwriting useful vision information, retaining information deleted by the LSTM, which is helpful for RSIC tasks. In addition, Shen et al. [11] and Li et al. [34] increased the training samples or improved the loss function to mitigate overfitting, respectively. Hoxha et al. [35] addressed the problem of CNN–LSTM-based networks with significant training data demand and replaced LSTM with SVM, which works better with a small number of training samples. Text decoding is a complex process that requires the utilization and integration of cross-modal features. Its performance will not be ideal due to limitations in cross-domain alignment.

Existing RSIC works have made significant progress in semantic, geometric feature extraction and text decoding. To tackle the mentioned limitations, we propose a novel cross-modal spatial–semantic alignment framework, which integrates MCCL and a dynamic geometry-aware Transformer. Our work expects to capture the correlation between object semantics and geometric features, which enhances the quality of cross-modal feature extraction and alignment.

3. Proposed Method

This paper proposes a CSSA network for an RSIC task, as shown in Figure 2. It contains two training stages: the MCCL and RSIC stages. Considering that many important objects in RSIs are scattered in complex backgrounds, we utilize a CNN with the proposed DG-former as the encoder, which combines the strengths of local feature extraction from CNNs and global feature modeling from transformers. The text decoder in MCCL adopts a GPT-like [36] one-way transformer decoder. The text decoder in RSIC uses the pure transformer decoder.

3.1. Dynamic Geometry Transformer

To address the problem that a pure Transformer does not take full advantage of geometry information, we propose the DG-former, which can improve feature extraction by using the spatial geometry information of RSIs.

Taking an RSI

I \in R^{W \times H \times 3}

as an input, the image encoder first uses a pre-trained ResNet-101 [37] to extract the feature map

x

, which is then flattened and dimensionally reduced by a fully connected layer. The obtained features

f \in R^{W H \times d}

are used as inputs to DG-former, as shown in Figure 2. We propose a novel attention mechanism called Multi-head dynamic geometry enhancement Attention (MGA) in DG-former for exploring the spatial geometry information of RSIs:

\begin{matrix} MGA (x) & = Concat (y_{1}, y_{2}, \dots, y_{h}) W \\ y_{i} & = Softamx (\frac{φ_{i} {(x)}^{T} ψ_{i} (x)}{\sqrt{d}} + Ω_{i}^{l}) δ_{i} (x) \\ Ω^{l} & = DGP (x, l), \end{matrix}

(1)

where

y_{i}

is the result of the

i_{th}

MGA head and h is the number of attention heads.

φ_{i} (x)

,

ψ_{i} (x)

and

δ_{i} (x)

are projected with different linear layers, which are called query, key and value, respectively. The “+” denotes addition operation for fusing

Ω_{i}^{l}

with

\frac{φ_{i} {(x)}^{T} ψ_{i} (x)}{\sqrt{d}}

.

Ω^{l}

is the dynamic geometry positional encoding information of the

l_{th}

layer. DGP is the dynamic geometry positional encoding module. The remainder of the DG-former is the same as the pure Transformer, with a Positional-wise Feed Forward Network and LayerNorm layer.

3.2. Dynamic Geometry Positional Encoding Module

Existing transformer-based methods for RSIC tasks have not explored the impact of the missing geometry structure information. So, we propose a dynamic geometry positional encoding (DGP) module, as shown in Figure 3. For two elements

i, j

at any position in the image feature

f

, we perform relative positional encoding by a two-dimensional vector

r_{i j}

,

r_{i j} = (log (\frac{| x_{i} - x_{j} |}{w_{i}}), log (\frac{| y_{i} - y_{j} |}{h_{i}})),

(2)

where

(x_{i}, y_{i}), w_{i}, h_{i}

are the centroid coordinates, width and height of element i, respectively. To learn a richer representation, we map

r_{i j}

to a higher dimension and then to a lower dimension.

g_{i j} = ReLU (f c (r_{i j})),

(3)

where

f c

is a high-dimensional linear mapping layer and ReLU is a nonlinear activation function.

The designed encoder’s lower layer focuses on the local information interaction. It also needs to consider geometry information to assist local information interaction. However, as the number of layers rises, the encoder increasingly tends to interact with global information. Many objects in remote sensing scenes (e.g., aircraft, airports, residential areas, ports, buildings, playgrounds, etc.) are distributed across all image regions. The characteristic of RSIs requires the encoder to perform adequate global semantic interaction at a higher level when the dependence on geometry information should be reduced, as shown in Figure 3. Therefore, we incorporate a dynamic adjustment mechanism to change the strength of geometry information according to the number of layers. We design three specific dynamic adjustment forms.

3.2.1. Exponential Decay

d_{\exp} (l) = exp (- l \cdot s)

(4)

where s is the learnable parameter and

l = 0, 1, . . .

is the number of layers.

3.2.2. Logarithmic Decay

d_{\log} (l) = 1 - \frac{ln (l + 1)}{ln (l_{\max} + 1)}

(5)

where

l = 0, 1, . . .

is the number of layers and

l_{\max}

is the maximum number of layers.

3.2.3. Cosine Decay

d_{\cos} (l) = cos (\frac{l \cdot π}{l_{\max} + 1})

(6)

where

l = 0, 1, . . .

is the number of layers and

l_{\max}

is the maximum number of layers.

Taking exponential decay as an example, after obtaining the relative geometry feature

g_{i j}

based on the image feature

f

, the dynamic geometry positional encoding information

Ω_{l}

for the

l_{th}

layer is

\begin{matrix} Ω_{l} & = DGP (f, l) \\ = G \cdot d_{\exp} (l), \end{matrix}

(7)

where the element of

G

is

g_{i j}

.

In Figure 4, we visualize the category attention heatmap and position attention heatmap of DG-former and its ablation models, displaying them in the first and second rows, respectively. Regarding the category attention heatmap, we chose to visualize the category “bridge.” It can be observed that T-former (DG-former w/o DGP) exhibits scattered and less accurate positioning for the bridge in Figure 4a. G-former alleviates the issue of scattered positioning in Figure 4b, and DG-former achieves a more uniform and comprehensive localization of the bridge in Figure 4c.

Concerning the position attention heatmap, we visualize a set of positions with higher attention weights for each image token. It can be seen that T-former in Figure 4d, lacking geometric information, shows a relatively uniform distribution of attention weights. G-former in Figure 4e, lacking dynamic adjustment strategies, focuses only on the surrounding image tokens. DG-former in Figure 4f achieves a banded attention range, striking a better balance between T-former and G-former. This aligns with the challenges of RSIC, progressively enhancing the interactions among objects in RSIs while focusing on the objects themselves.

3.3. Text Decoder

We utilize two different text decoders for MCCL and RSIC, respectively. The difference is that the former processes only text modal input, while the latter considers both image and text modalities to generate sentences.

3.3.1. Text Decoder in MCCL

The text decoder in MCCL adopts a GPT-like one-way transformer decoder, as shown in Figure 2, which models the input text. It consists of three parts: Masked Multi-Head Attention, Positional-wise Feed Forward Network, and Add&Norm.

For the input sentence

S = {s_{t}}_{t = 1}^{T}

, we perform one-hot encoding, word embedding and positional encoding to get sequence tokens

W = {w_{t}}_{t = 1}^{T} \in R^{T \times d}

. For the

t_{th}

moment, we model the relationship of words

w_{< t}

with masked multi-head attention,

\begin{matrix} M_{t} & = MSA (w_{t}, w_{< t}, w_{< t}) \\ N_{t} & = LN (M_{t} + w_{t}) \\ R_{t} & = FFN (N_{t}) \\ S_{t} & = LN (R_{t} + N_{t}) \end{matrix}

(8)

3.3.2. Text Decoder in RSIC

The text decoder in RSIC adopts the pure transformer decoder, as shown in Figure 2, which generates natural language sentences based on the image representation from DG-former.

3.4. Training Strategy and Loss Function

We use the image encoder (CNN+DG-former) to learn RSI features for two stages. In terms of the two text decoders, just like Meta-Cap [38], they are architecturally independent and not connected. However, in terms of parameter initialization, the latter stage is initialized with pre-trained parameters from the previous stage. The same colored modules of the different stages in Figure 2 represent the relationship.

Also, the current combination of modules is only a preliminary option for CSSA. Various alternative image encoder and text decoder models can be used to implement the proposed framework, and they may also achieve a better performance.

3.4.1. MCCL Stage

The multi-branch self-supervised learning stage takes the complete sentence

S

as labels and learns the similarity of image–text pairs by projecting images and sentences into the same semantic space in a contrastive learning manner. Also, we add two additional branches to enhance the robustness of image and text modal features, as shown in Figure 5.

We construct enhanced images and build an image contrastive learning branch based on original and enhanced images:

\begin{matrix} I^{'} & = T_{img} (I) \\ f_{1} & = F_{img} (I) \\ f_{1}^{'} & = F_{img} (I^{'}), \end{matrix}

(9)

where

T_{img}

denotes the image augmentation operation, such as rotation, flipping, scaling, cropping and color jittering.

I^{'}

denotes the image augmentation result.

F_{img}

is the image encoder.

f_{1}

and

f_{1}^{'}

denote the original and enhanced image features in the MCCL stage, respectively. Similarly, the following operation is performed in the text modality:

\begin{matrix} S^{'} & = T_{txt} (S) \\ T_{1} & = H_{txt}^{1} (S) \\ T_{1}^{'} & = H_{txt}^{1} (S^{'}), \end{matrix}

(10)

where

T_{txt}

denotes the text augmentation operation, which consists of synonym replacement and back translation [39,40].

S^{'}

denotes the text augmentation result.

H_{txt}^{1}

is the text decoder in the MCCL stage.

T_{1}

and

T_{1}^{'}

denote the original and enhanced text features in the MCCL stage, respectively.

Based on the enhanced image and text features, the loss of MCCL stage is calculated as follows

L_{mccl} = L_{cl}^{i 2 s} (f_{1}^{'}, T_{1}^{'}) + γ_{i} L_{cl}^{img} (f_{1}, f_{1}^{'}) + γ_{s} L_{cl}^{txt} (T_{1}, T_{1}^{'}),

(11)

where

L_{cl}

use the original

info_nce

function in [41,42], while

γ_{i}

is set to 0.4 and

γ_{s}

is 0.1 in our experiments.

\begin{matrix} L_{cl} (x, y) = - \frac{1}{N} \sum_{i = 1}^{N} λ \cdot log (\frac{exp (x_{i}^{T} y_{i})}{\sum_{j = 1}^{N} exp (x_{i}^{T} y_{j})}) \\ + (1 - λ) \cdot log (\frac{exp (x_{i}^{T} y_{i})}{\sum_{j = 1}^{N} exp (x_{j}^{T} y_{i})}) . \end{matrix}

(12)

The sequence-level supervised signal in the MCCL stage prompts the image encoder to perceive the matching relationships between images and texts, laying the foundation for the RSIC stage.

3.4.2. RSIC Stage

The remote sensing image captioning stage generates sentences in an autoregressive manner based on the input image features,

\begin{matrix} f_{2} & = F_{img} (I) \\ T_{2} & = H_{txt}^{2} (f_{2}, S), \end{matrix}

(13)

where

f_{2}

and

T_{2}

are image features and text features, respectively.

H_{txt}^{2}

is the text decoder in the RSIC stage.

In the training stage, we optimize the model by computing and accumulating the cross-entropy loss word by word to reduce the difference between generated and annotated sentences. The loss function

L_{c a p}

is expressed as

\begin{matrix} L_{c a p} = - log p (S |I; θ) \\ = - \sum_{t = 0}^{N - 1} log (p_{θ} (s_{t} |I, s_{0 : t - 1})) . \end{matrix}

(14)

where

θ

denotes the parameters of our network, N represents the length of words in a sentence, and

p_{θ} (\cdot)

is the score function of the predicted sequence

s_{0 : t - 1}

in t time.

4. Experiments

4.1. Datasets

To demonstrate the effectiveness of our method, it is validated on three publicly available datasets, including Sydney-Captions, UCM-Captions and RSICD.

Sydney-Captions: The Sydney-Captions dataset is based on the Sydney dataset, with five descriptive sentences annotated per RSI. There are 613 images in the dataset involving seven types of ground objects, including residential, airport, meadow, rivers, ocean, industrial and runway. The size of the images is 500 × 500.
UCM-Captions: The UCM-Captions dataset is based on the UC-Merced land use dataset with five descriptive sentences per image. The dataset contains 21 categories of ground objects, each with 100 images. The size of each RSI is 256 × 256.
RSICD: The RSICD is the largest RSIC dataset. These RSIs are from Google Earth, Baidu Map, MapABC and Tianditu, and are fixed to 224 × 224 pixels with various resolutions. The total number of RSIs is 10,921, with five sentence descriptions per image.

4.2. Evaluation Metrics

Eight metrics are used in the experiments, including BLEU-n [43] (n = 1, 2, 3, 4), METEOR [44], ROUGE [45], CIDEr [46], and SPICE [47]. BLEU-n evaluates the accuracy of a generated caption by comparing the n-gram consistency between generated and annotated sentences, while METEOR is the summed average of the precision and recall of the unigram. ROUGE denotes the recall corresponding to the longest common subsequence of generated and annotated sentences. Compared with other metrics, CIDEr and SPICE are metrics designed specifically for the image captioning task, focusing more on the calculation of high-level semantic consistency.

4.3. Experimental Settings

We perform the experiments on NVIDIA GTX 1080Ti with PyTorch version 1.8.0. We use ResNet-101 pre-trained on the ImageNet dataset as the backbone. The layer of the encoder and decoder is set to 4. The dimensions in all Transformers are set to 512. We use Noam to optimize our model, and the warm_up step is 20,000. In addition, the beam search is adopted for sentence generation, and beam_size is set to 3. All models are trained for a total of 40 epochs. The first 10 epochs are the MCCL stage, while the following 30 epochs are the RSIC stage.

4.4. Comparison with Other Methods

In this section, we demonstrate the effectiveness of the proposed model by comparing it with several state-of-the-art RSIC models.

SAT [48] utilizes a CNN to extract image features. It employs a spatial attention mechanism to select relevant image regions, which are subsequently translated into natural language sentences using an LSTM model.
FC(SM)-ATT + LSTM [21] incorporates high-level attribute information from a CNN to guide the attention layer, which can choose the most informative context vectors.
SAT(LAM) [8] uses RSI explicit classification predictions to guide the attention layer rather than high-level attribute information, as in [21].
Structured-ATT [4] integrates an RSI object segmentation map into a structured attention mechanism, enabling selective attention to essential object contour.
GVFGA + LSGA [22] filters out redundant feature components and irrelevant information in the attention mechanism by exploiting GVFGA and LSGA mechanisms.
RASG [10] combines stacked LSTMs with a recurrent attention mechanism, which leverages the recurrence property of LSTM to capture the relevance between image and text.
Word–Sentence [13] is a two-stage method, which divides the image captioning task into word extraction and sentence generation.
JTTS [14] is also a two-stage approach, where the initial stage involves a multi-label classification task, followed by the fusion of predictions during the captioning stage.
VRTMM [11] employs a variational autoencoder to generate semantic vision features and utilizes a transformer to generate sentences.
The CNN + Transformer [12] framework uses a CNN as an encoder to extract image features and uses a Transformer as an decoder to generate sentences.
TrTr-CMR [28] integrates swin-Transformer as an encoder for multi-scale visual features, while a Transformer decoder generates a well-formed sentence for an RSI.
KE [29] effectively captures the intrinsic semantic information of remote sensing categories for entity embeddings and relationship embeddings. In the decoder stage, combining the visual features of RSIs with structural information on embedded knowledge improves the detail expressiveness of the generated descriptions.

4.4.1. Quantitative Comparison

The results of the different methods on the three datasets are shown in Table 1, Table 2 and Table 3. The experiments follow a division based on public datasets, and the best scores are highlighted in bold. The most important metric is CIDEr, which evaluates the semantic similarity between the generated and annotated sentences.

We conduct an analysis of the experimental results considering three categories. The first category only uses attention-based single-stage RSIC methods, including SAT [48], FC(SM)-ATT + LSTM [21], SAT(LAM) [8], GVFGA + LSGA [22], and RASG [10]. The second category includes two-stage RSIC methods, namely Word–Sentence [13] and JTTS [14]. The third category consists of Transformer-based RSIC methods, specifically VRTMM [11], CNN + Transformer [12], TrTr-CMR [28] and KE [29].

Table 1 and Table 2 report the comparative results of the previous state-of-the-art methods and our proposed approach on the Sydney-Captions and UCM-captions datasets. Among these contrasting models, our method achieves the highest performance for almost all eight metrics. Especially for the Sydney-Captions dataset, the “RASG” method has an advantage among the attention-based single-stage methods. However, our method achieves a 15.9% improvement in the CIDEr score compared to the “RASG” method. Similar improvements are seen on the UCM-captions dataset, with a 15.9% improvement in the CIDEr score compared to the “RASG” method. This phenomenon indicates that the lack of exploration of semantic knowledge will limit the performance of an RSIC task. Compared with the two-stage methods, our method brings improvements of nearly 8.9% and 3.1% in CIDEr score compared to the “JTTS” method on Sydney-Captions and UCM-captions, respectively. The main reason is that the two-stage models do well in modeling the relationship between the vision and sentence elements in the first learning stage. Compared with the transformer-based methods, our method achieves 10.1% and 5.9% improvements in the CIDEr score compared with the “KE” method on Sydney-Captions and UCM-captions, respectively. Therefore, Table 1 and Table 2 show that our model gains much better results on two public datasets. The main reason is our model combines two core innovative components: MCCL and DG-former. Specifically, MCCL provides semantic guidance to DG-former, prompting it to prioritize spatially relevant objects that match text phrases for generating the description sentence.

Table 3 shows the results of comparative experiments on the RSICD. Among these contrast models, our method achieves the highest performance on the RSICD, scoring first in all eight metrics. This shows that high-level semantic information and RSI-specific geometry information are important to transform an RSI into a sentence. The attention-based single-stage methods focus more on local consistency and alignment for an RSIC task, like the “RASG” method. As shown in Table 3, our method achieves a 5.5% improvement in the CIDEr score compared to “RASG”. The two-stage methods generally adopt semantic feature learning or directly map vision features to text space for image–text interaction. However, there is a 3.9% improvement in the CIDEr score compared to the “JTTS” method. Compared with the Transformer-based methods, our method achieves a 7.3% improvement in the CIDEr score compared to the “KE” method. This indicates that addressing the cross-modal mismatch and spatial structure can contribute to eliminating the semantic gap between vision and language.

4.4.2. Qualitative Comparison

Figure 6 compares the ground truth, CNN + Transformer, and CSSA methods for several RSIC results. We have highlighted significant objects and properties of the RSIs in blue. Any generated words that contain errors are specifically marked in red.

The top row corresponds to the RSICD dataset. In Figure 6a, the beach area is depicted, but the “CNN + Transformer” method overlooks the presence of white waves, which are accurately described in the sentences generated by our model. In Figure 6b, the RSI represents a baseball field, but the “CNN + Transformer” method incorrectly identifies the number of baseball fields as “A”, whereas ours correctly identifies it as “Four”. In Figure 6c, the trees and a residential area are the main elements, which our model can accurately describe. However, the “CNN + Transformer” method mistakenly generates a description mentioning a “railway station”.

The middle row represents visualization examples for the Sydney-Captions dataset. Our method accurately generates the scene “river”, as depicted in Figure 6d. In Figure 6e, the “CNN + Transformer” method erroneously generates the word “houses”, whereas our model successfully avoids this error. Furthermore, in Figure 6f, our model could provide a more comprehensive and detailed description of the remote sensing image compared to the “CNN + Transformer”.

The bottom row pertains to the UCM-Captions dataset. Similar to Figure 6b, our model excels in accurately capturing the number of remotely sensed objects in Figure 6g. Conversely, for both Figure 6h,i, “CNN + Transformer” fails to comprehend the scenes correctly, resulting in the generation of non-existent objects. These examples collectively highlight the superior performance of our method in RSIC tasks.

5. Discussion

5.1. Ablation Study

In the preceding section, we demonstrated the considerable advancement achieved by our proposed model in the domain of RSIC tasks. However, every specific aspect of our model requires further elucidation. To address this, we conducted ablation experiments on three datasets to assess the effectiveness of each component.

DGP: This component dynamically augments geometry information based on the number of layers.
MCCL: This component facilitates the acquisition of robust image representations, thereby reducing the semantic gap between image and text.

The standard Transformer serves as the baseline model in our experiments. To ensure fairness for ablation comparisons, the number of attention heads, hidden dimensions and decoder layers of the Transformer in the baseline model are same in our network. Initially, the image feature is extracted using a pre-trained ResNet-101 model and then applied to the baseline and our network. Subsequently, the sentence is generated using the standard transformer, utilizing the extracted image features. In contrast, CSSA incorporates two additional components, namely the DGP and MCCL, in addition to the baseline model. These components enhance the captioning process by dynamically supplementing geometry information and narrowing the semantic gap between image and text.

In addition, the effectiveness of the image augmentation branch and text augmentation branch driven by our MCCL is also worth analyzing. As shown in Figure 7, we denote the baseline as B1, and B4 denotes the full model. The CSSA incorporating the text augmentation branch is namely B2, while B3 represents our network incorporating the image augmentation branch.

5.1.1. Quantitative Comparison

Table 4 reports the ablation experiment results. Effect of DGP: We can see improvements of 8.7% (2.6533 → 2.8862), 2.0% (3.5912 → 3.6639), and 1.4% (2.7963 → 2.8367) in the three datasets, respectively. Therefore, it is essential to supply geometry information during feature extraction.

Effect of MCCL: We can see improvements of 8.4% (2.6533 → 2.8773), 2.4% (3.5912 → 3.6781), and 1.5% (2.7963 → 2.8390) in the three datasets, respectively. This suggests that it is necessary to align image and text features for remote sensing image captioning.

Effect of the combination of DGP and MCCL: We can see improvements of 15.0% (2.6533 → 3.0501), 6.5% (3.5912 → 3.8248), and 3.9% (2.7963 → 2.9056) in the three datasets, respectively. In addition, CSSA achieves an optimal performance, demonstrating that the two components can improve the performance of the baseline model from different perspectives.

Figure 7 shows the ablation results of the image augmentation branch and text augmentation branch driven by the MCCL module. On UCM-Captions, Sydney-Captions and RSICD, different ablation variants are plotted with the CIDEr score. The CIDEr score shows an upward trend from B1 to B4, while B0 has the lowest performance on the three datasets. The growth trend between B2 and B3 suggests that image augmentation plays a critical role across the three datasets, which have a more prominent impact on Sydney-Captions. This is because B3 is related to the designed DG-former module. This indicates that the image/text augmentation branch contributes to visual/semantic feature enhancement.

5.1.2. Qualitative Comparison

As previously mentioned, the ablation modules positively impact RSI understanding. To further highlight their discriminability, we present qualitative results on three datasets, as illustrated in Figure 8.

Effect of DGP: The baseline model encounters difficulties in accurately modeling the number of remote sensing objects while generating descriptions. However, incorporating the DGP module improves the computation of remote sensing objects by considering spatial alignment, as shown in Figure 8a. Moreover, it can relieve missing objects and wrong property problems in the baseline model, as shown in Figure 8b,d.

Effect of MCCL: MCCL helps the model to understand the image more comprehensively, capturing attributes, “rectangle”, and objects, “waves”, as well as ignoring less important objects, “plants”, as shown in Figure 8b,f. In addition, it also demonstrates the ability to accurately model the number of objects in Figure 8a,e, similar to DGP.

Finally, the proposed CSSA combines the advantages of DGP and MCCL, achieving a comprehensive and precise understanding of remote sensing images.

5.2. Parameter Sensitivity Analysis

5.2.1. DGP Parameter

Due to many objects and their dispersed distribution in RSIs, the modeling of bottom-level local features relies more on geometry information. On the other hand, the modeling of top-level semantic features emphasizes the interaction among different objects and relies less on geometry information.

The proposed DGP module aims to balance the input of geometry information across different layers. It achieves this by incorporating exponential decay, where the initial parameters influence the decay rate. A larger initial parameter results in a faster decay of geometry information, while an initial parameter of 0 indicates no dynamic adjustment. Through experiments conducted on the three datasets, as shown in Table 5, we observed that the optimal initial parameter is 1. Different decay rates can positively or negatively affect the model’s performance. When the initial parameter is 0, excessive geometry information negatively impacts high-level semantic interaction. Conversely, when the initial parameter is too large, such as 2, insufficient geometry information hinders the model’s ability to effectively capture bottom-level features. As shown in Table 5, the cosine decay performs moderately, while the logarithmic decay shows poor adaptability on the three datasets. Therefore, the exponential decay is proven to be the most universal strategy, which is adopted in our designed network.

5.2.2. MCCL Parameter

In our experiments, we investigate the impact of various parameters on MCCL. These parameters include learning rate, temperature, weight, and batch size. We present the experimental results on the three datasets in Figure 9.

Firstly, we adopt the same learning strategy as Convirt and CLIP, where the warmup value influences the learning rate. A larger warmup value results in a slower initial rise in the learning rate. Considering the limited computational resources and many parameters, we keep the other parameters the same as the warmup parameters. For each dataset, we conduct experiments using two sets of parameters.

The experimental results demonstrate that the learning rate significantly impacts contrastive learning. When the warmup value is small, the learning rate rises too quickly, leading to model instability and poor performance. The temperature parameter mainly affects the value of the contrastive learning loss. Smaller temperature values result in larger contrastive losses for unpaired samples. The experimental results highlight the sensitivity of contrastive learning to the temperature parameter. Smaller temperature values make it easier to optimize the model, leading to better performance. The weight parameter balances the loss from image to text and from text to image. The experimental results indicate that the optimal weight parameter varies across different datasets. Furthermore, our experiments show that batch size has a relatively small influence on contrastive learning.

In summary, the degree of influence of different parameters on contrastive learning is ranked as follows: learning rate > temperature parameter > weight parameter. The influence of batch size is relatively minor.

6. Conclusions

To better realize RSIC, we propose a novel CSSA approach to generate descriptions for RSIs. There is a hypothesis that the paired image–text will be closer in semantic space. Therefore, cross-modal knowledge is acquired to support RSIC tasks, and we propose an MCCL module that uses encoded text representations to guide the extraction of image features. It narrows the cross-modal gap between image and text. The knowledge from the first stage improves the performance of the RSIC model. At the same time, considering that the objects in RSIs are scattered, DG-former is designed to improve vision feature extraction by dynamically introducing geometry information. Finally, the experimental results on three benchmark datasets show a better performance than several other state-of-the-art methods.

Author Contributions

Conceptualization, X.H., Z.W. and X.Z.; methodology, X.H., Z.W. and Y.L.; software, X.H.; validation, Z.W. and Y.L.; writing—original draft preparation, X.H. and Z.W.; writing—review and editing, X.Z., Y.L., G.W. and B.H.; visualization, Y.L. and G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Shaanxi Province Innovation Capability Support Plan under Grant 2023-CX-TD-09, the National Natural Science Foundation of China under Grant 62276197, the Fundamental Research Funds for the Centra Universities under Grant QTZX25070, and the National Natural Science Foundation of China under Grant 62501433 and 62506285.

Data Availability Statement

The datasets in this study are available online from https://github.com/201528014227051/RSICD_optimal (accessed on 1 June 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RSIC	Remote sensing image captioning.
RSI	Remote sensing image.
CNN	Convolutional neural network.
LSTM	Long short-term memory.
MCA	Multi-head cross attention.
CSSA	Cross-modal spatial-semantic alignment framework.
MCCL	Multi-branch cross-modal contrastive learning mechanism.
MGA	Multi-head dynamic geometry enhancement attention.
DGP	Dynamic geometry positional encoding module.
BLEU	Biingual evaluation understudy.
ROUGE	Recall-oriented understudy for gisting evaluation—Longest.
METEOR	Metric for Evaluation of translation with explicit ordering.
CIDEr	Consensus-based image description evaluation.
$y_{i}$	The result of the i-th MGA head.
$Ω^{l}$	Dynamic geometry positional encoding information.
$r_{i j}$	The relative positional encoding.
$f_{1}$	The original RSI features in the MCCL stage.
$f_{1}^{'}$	The enhanced RSI features in the MCCL stage.
$S^{'}$	The text augmentation result.
$T_{1}$	The original text features in the MCCL stage.
$T_{1}^{'}$	The enhanced text features in the MCCL stage.
T	The max length in ground-truth sentence.
$s_{t}$	The generated word at t time.

References

Cheng, G.; Yang, C.; Yao, X.; Guo, L.; Han, J. When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative CNNs. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2811–2821. [Google Scholar] [CrossRef]
Wang, G.; Zhang, X.; Peng, Z.; Jia, X.; Tang, X.; Jiao, L. MOL: Towards accurate weakly supervised remote sensing object detection via Multi-view nOisy Learning. ISPRS J. Photogramm. Remote Sens. 2023, 196, 457–470. [Google Scholar] [CrossRef]
Nie, J.; Wang, C.; Yu, S.; Shi, J.; Lv, X.; Wei, Z. MIGN: Multiscale Image Generation Network for Remote Sensing Image Semantic Segmentation. IEEE Trans. Multimed. 2022, 25, 5601–5613. [Google Scholar] [CrossRef]
Zhao, R.; Shi, Z.; Zou, Z. High-resolution remote sensing image captioning based on structured attention. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5603814. [Google Scholar] [CrossRef]
Zhang, X.; Li, Y.; Wang, X.; Liu, F.; Wu, Z.; Cheng, X.; Jiao, L. Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning. Remote Sens. 2023, 15, 579. [Google Scholar] [CrossRef]
Zhao, K.; Xiong, W. Exploring region features in remote sensing image captioning. Int. J. Appl. Earth Obs. Geoinf. 2024, 127, 103672. [Google Scholar] [CrossRef]
Li, Y.; Zhang, X.; Zhang, T.; Wang, G.; Wang, X.; Li, S. A patch-level region-aware module with a multi-label framework for remote sensing image captioning. Remote Sens. 2024, 16, 3987. [Google Scholar] [CrossRef]
Zhang, Z.; Diao, W.; Zhang, W.; Yan, M.; Gao, X.; Sun, X. LAM: Remote sensing image captioning with label-attention mechanism. Remote Sens. 2019, 11, 2349. [Google Scholar] [CrossRef]
Li, Y.; Zhang, X.; Cheng, X.; Tang, X.; Jiao, L. Learning consensus-aware semantic knowledge for remote sensing image captioning. Pattern Recognit. 2024, 145, 109893. [Google Scholar] [CrossRef]
Li, Y.; Zhang, X.; Gu, J.; Li, C.; Wang, X.; Tang, X.; Jiao, L. Recurrent attention and semantic gate for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5608816. [Google Scholar] [CrossRef]
Shen, X.; Liu, B.; Zhou, Y.; Zhao, J.; Liu, M. Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning. Knowl.-Based Syst. 2020, 203, 105920. [Google Scholar] [CrossRef]
Zhuang, S.; Wang, P.; Wang, G.; Wang, D.; Chen, J.; Gao, F. Improving remote sensing image captioning by combining grid features and transformer. IEEE Geosci. Remote Sens. Lett. 2021, 19, 6504905. [Google Scholar] [CrossRef]
Wang, Q.; Huang, W.; Zhang, X.; Li, X. Word–sentence framework for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2020, 59, 10532–10543. [Google Scholar] [CrossRef]
Ye, X.; Wang, S.; Gu, Y.; Wang, J.; Wang, R.; Hou, B.; Giunchiglia, F.; Jiao, L. A Joint-Training Two-Stage Method For Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4709616. [Google Scholar] [CrossRef]
Cheng, K.; Liu, J.; Mao, R.; Wu, Z.; Cambria, E. CSA-RSIC: Cross-Modal Semantic Alignment for Remote Sensing Image Captioning. IEEE Geosci. Remote Sens. Lett. 2025, 22, 6012305. [Google Scholar] [CrossRef]
Qu, B.; Li, X.; Tao, D.; Lu, X. Deep semantic understanding of high resolution remote sensing image. In Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (Cits), Kunming, China, 6–8 July 2016; IEEE: New York, NY, USA, 2016; pp. 1–5. [Google Scholar]
Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring models and data for remote sensing image caption generation. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2183–2195. [Google Scholar] [CrossRef]
Shi, Z.; Zou, Z. Can a machine generate humanlike language descriptions for a remote sensing image? IEEE Trans. Geosci. Remote Sens. 2017, 55, 3623–3634. [Google Scholar] [CrossRef]
Huang, W.; Wang, Q.; Li, X. Denoising-based multiscale feature fusion for remote sensing image captioning. IEEE Geosci. Remote Sens. Lett. 2020, 18, 436–440. [Google Scholar] [CrossRef]
Ma, X.; Zhao, R.; Shi, Z. Multiscale methods for optical remote-sensing image captioning. IEEE Geosci. Remote Sens. Lett. 2020, 18, 2001–2005. [Google Scholar] [CrossRef]
Zhang, X.; Wang, X.; Tang, X.; Zhou, H.; Li, C. Description generation for remote sensing images using attribute attention mechanism. Remote Sens. 2019, 11, 612. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, W.; Yan, M.; Gao, X.; Fu, K.; Sun, X. Global visual feature and linguistic state guided attention for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5615216. [Google Scholar] [CrossRef]
Li, Y.; Fang, S.; Jiao, L.; Liu, R.; Shang, R. A multi-level attention model for remote sensing image captions. Remote Sens. 2020, 12, 939. [Google Scholar] [CrossRef]
Lu, X.; Wang, B.; Zheng, X. Sound active attention framework for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2019, 58, 1985–2000. [Google Scholar] [CrossRef]
Cui, W.; Wang, F.; He, X.; Zhang, D.; Xu, X.; Yao, M.; Wang, Z.; Huang, J. Multi-scale semantic segmentation and spatial relationship recognition of remote sensing images based on an attention model. Remote Sens. 2019, 11, 1044. [Google Scholar] [CrossRef]
Wang, S.; Ye, X.; Gu, Y.; Wang, J.; Meng, Y.; Tian, J.; Hou, B.; Jiao, L. Multi-label semantic feature fusion for remote sensing image captioning. ISPRS J. Photogramm. Remote Sens. 2022, 184, 1–18. [Google Scholar] [CrossRef]
Ren, Z.; Gou, S.; Guo, Z.; Mao, S.; Li, R. A mask-guided transformer network with topic token for remote sensing image captioning. Remote Sens. 2022, 14, 2939. [Google Scholar] [CrossRef]
Wu, Y.; Li, L.; Jiao, L.; Liu, F.; Liu, X.; Yang, S. TrTr-CMR: Cross-Modal Reasoning Dual Transformer for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5643912. [Google Scholar] [CrossRef]
Cheng, K.; Cambria, E.; Liu, J.; Chen, Y.; Wu, Z. KE-RSIC: Remote Sensing Image Captioning Based on Knowledge Embedding. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 4286–4304. [Google Scholar] [CrossRef]
Wang, B.; Lu, X.; Zheng, X.; Li, X. Semantic descriptions of high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1274–1278. [Google Scholar] [CrossRef]
Sumbul, G.; Nayak, S.; Demir, B. SD-RSIC: Summarization-driven deep remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2020, 59, 6922–6934. [Google Scholar] [CrossRef]
Wang, B.; Zheng, X.; Qu, B.; Lu, X. Retrieval topic recurrent memory network for remote sensing image captioning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 256–270. [Google Scholar] [CrossRef]
Fu, K.; Li, Y.; Zhang, W.; Yu, H.; Sun, X. Boosting memory with a persistent memory mechanism for remote sensing image captioning. Remote Sens. 2020, 12, 1874. [Google Scholar] [CrossRef]
Li, X.; Zhang, X.; Huang, W.; Wang, Q. Truncation cross entropy loss for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2020, 59, 5246–5257. [Google Scholar] [CrossRef]
Hoxha, G.; Melgani, F. A novel SVM-based decoder for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5404514. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training; OpenAI: San Francisco, CA, USA, 2018. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Yang, Q.; Ni, Z.; Ren, P. Meta captioning: A meta learning based remote sensing image captioning framework. ISPRS J. Photogramm. Remote Sens. 2022, 186, 190–200. [Google Scholar] [CrossRef]
Song, R.; Zhao, B.; Yu, L. Enhanced CLIP-GPT Framework for Cross-Lingual Remote Sensing Image Captioning. IEEE Access 2024, 13, 904–915. [Google Scholar] [CrossRef]
Lin, Q.; Wang, S.; Ye, X.; Wang, R.; Yang, R.; Jiao, L. CLIP-based grid features and masking for remote sensing image captioning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 18, 2631–2642. [Google Scholar] [CrossRef]
Zhang, Y.; Jiang, H.; Miura, Y.; Manning, C.D.; Langlotz, C.P. Contrastive learning of medical visual representations from paired images and text. In Proceedings of the Machine Learning for Healthcare Conference, PMLR, Durham, NC, USA, 5–6 August 2022; pp. 2–25. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
Lin, C. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004. [Google Scholar]
Vedantam, R.; Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. Spice: Semantic propositional image caption evaluation. In Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 382–398. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, PMLR, San Diego, CA, USA, 9–12 May 2015; pp. 2048–2057. [Google Scholar]

Figure 1. Traditional method (top) and our method (down). Our approach can thoroughly perceive the matching relationships between image and text. The red arrows form the MCCL stage, and the black arrows form the RSIC stage. MCCL: multi-branch cross-modal contrastive learning.

L_{mccl}

: the loss of MCCL.

Figure 1. Traditional method (top) and our method (down). Our approach can thoroughly perceive the matching relationships between image and text. The red arrows form the MCCL stage, and the black arrows form the RSIC stage. MCCL: multi-branch cross-modal contrastive learning.

L_{mccl}

: the loss of MCCL.

Figure 2. The network architecture of CSSA. It consists of three parts: DG-former, Text Decoder in MCCL and Text Decoder in RSIC. DG-former learns RSI features with ResNet-101. Text Decoder in the MCCL stage models sequence dependencies and works with DG-former to learn similarities between RSIs and sentences, helping the model to thoroughly perceive the matching relationships between image and text. Text Decoder in the RSIC stage generates sentences based on image features from DG-former. MGA: Multi-head dynamic Geometry enhancement Attention. FFN: Feed-Forward Network. DGP: Dynamic Geometry Positional encoding module. M-MSA: Masked Multi-head Self-Attention. MCA: Multi-head Cross-Attention.

Figure 3. The demonstration of DGP module. It adjusts the intensity of geometry information dynamically based on the layer index. 2D-RPE: two-dimensional relative positional encoding. MAP-H&L: mapping

r

to a higher dimension and then to a lower dimension. DA: dynamic adjustment.

Figure 3. The demonstration of DGP module. It adjusts the intensity of geometry information dynamically based on the layer index. 2D-RPE: two-dimensional relative positional encoding. MAP-H&L: mapping

r

to a higher dimension and then to a lower dimension. DA: dynamic adjustment.

Figure 4. Category and position attention heatmaps for T-former (DG-former w/o DGP), DG-former (DG-former w/o DA), and DG-former.

Figure 5. Demonstration of the MCCL stage. The image and text branches construct positive and negative sample pairs using different augmentation methods to mine the semantics of images and texts, respectively. Meanwhile, the branch between the image and text learns the correspondence between the image and text.

Figure 6. Comparison of experimental visualization results from three publicly available datasets. (a–c) From the RSICD, (d–f) from the Sydney-Captions dataset, and (g–i) from the UCM-Captions dataset. The generated sentences are from (1) ground truth (GT): one selected ground truth sentence; (2) transformer model; (3) our proposed model. Words in red indicate a mismatch, and words in blue are the exact words.

Figure 7. Ablation performance of the image augmentation branch and text augmentation branch driven by MCCL module. B0: baseline, B1: only with text augmentation, B2: only with image augmentation, B3: full model.

Figure 8. Comparison of the results of visualization ablation experiments for three datasets. (a,b) From the RSICD, (c,d) from the Sydney-Captions dataset, and (e,f) from the UCM-Captions dataset. the GT sentences are human-annotated, and the ablation model generates the other sentences. Incorrect words are in red font, and correct words are in blue font.

Figure 9. Effect of different MCCL parameters on three datasets with the generated sentences’ quality. The “ucm”, “sydney” and “rsicd” denote UCM-Captions, Sydney-Captions and RSICD, respectively.

Table 1. Comparison scores of our method and other state-of-the-art methods on Sydney-Captions dataset.

Method	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE	CIDEr	SPICE
SAT [48]	0.7905	0.7020	0.6232	0.5477	0.3925	0.7206	2.2013	-
FC-ATT + LSTM [21]	0.8076	0.7160	0.6276	0.5544	0.4099	0.7144	2.2033	-
SM-ATT + LSTM [21]	0.8143	0.7351	0.6586	0.5806	0.4111	0.7195	2.3021	-
SAT(LAM) [8]	0.7405	0.6550	0.5904	0.5304	0.3689	0.6814	2.3519	0.4308
Structured-ATT [4]	0.7795	0.7019	0.6392	0.5861	0.3954	0.7299	2.3791	-
GVFGA + LSGA [22]	0.7681	0.6846	0.6145	0.5504	0.3866	0.7030	2.4522	0.4532
RASG [10]	0.8000	0.7217	0.6531	0.5909	0.3908	0.7218	2.6311	0.4301
Word–Sentence [13]	0.7891	0.7094	0.6317	0.5625	0.4181	0.6922	2.0411	-
JTTS [14]	0.8492	0.7797	0.7137	0.6496	0.4457	0.7660	2.8010	0.4679
VRTMM [11]	0.7443	0.6723	0.6172	0.5699	0.3748	0.6698	2.5285	-
CNN + Transformer [12]	0.8100	0.7320	0.6500	0.5710	-	0.7490	2.5670	-
TrTr-CMR [28]	0.8270	0.6994	0.6002	0.5199	0.3803	0.7220	2.2728	-
KE [29]	0.8550	0.7810	0.6450	0.5800	0.4530	0.7280	2.9440	-
CSSA (Ours)	0.8406	0.7777	0.7138	0.6544	0.4555	0.7839	3.0501	0.5050

Table 2. Comparison scores of our method and other state-of-the-art methods on UCM-Captions dataset.

Method	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE	CIDEr	SPICE
SAT [48]	0.7993	0.7355	0.6790	0.6244	0.4174	0.7441	3.0038	-
FC-ATT + LSTM [21]	0.8135	0.7502	0.6849	0.6352	0.4173	0.7504	2.9958	-
SM-ATT + LSTM [21]	0.8154	0.7575	0.6936	0.6458	0.4240	0.7632	3.1864	-
SAT(LAM) [8]	0.8195	0.7764	0.7485	0.7161	0.4837	0.7908	3.6171	0.5024
Structured-ATT [4]	0.8538	0.8035	0.7572	0.7149	0.4632	0.8141	3.3489	-
GVFGA + LSGA [22]	0.8319	0.7657	0.7103	0.6596	0.4436	0.7845	3.3270	0.4853
RASG [10]	0.8518	0.7925	0.7432	0.6976	0.4571	0.8072	3.3887	0.4891
Word–Sentence [13]	0.7931	0.7237	0.6671	0.6202	0.4395	0.7132	2.7871	-
JTTS [14]	0.8696	0.8224	0.7788	0.7376	0.4906	0.8364	3.7102	0.5231
VRTMM [11]	0.8394	0.7785	0.7283	0.6828	0.4527	0.8026	3.4948	-
CNN + Transformer [12]	0.8340	0.7720	0.7180	0.6730	-	0.7700	3.3150	-
TrTr-CMR [28]	0.8156	0.7091	0.6220	0.5469	0.3978	0.7442	2.4742	-
KE [29]	0.8990	0.8290	0.7860	0.7170	0.4950	0.8430	3.7660	-
CSSA (Ours)	0.8911	0.8469	0.8037	0.7598	0.4870	0.8474	3.8248	0.5428

Table 3. Comparison scores of our method and other state-of-the-art methods on on RSICD dataset.

Method	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE	CIDEr	SPICE
SAT [48]	0.7336	0.6129	0.5190	0.4402	0.3549	0.6419	2.2486	-
FC-ATT + LSTM [21]	0.7459	0.6250	0.5338	0.4574	0.3395	0.6333	2.3664	-
SM-ATT + LSTM [21]	0.7571	0.6336	0.5385	0.4612	0.3513	0.6458	2.3563	-
SAT(LAM) [8]	0.6753	0.5537	0.4686	0.4026	0.3254	0.5823	2.585	0.4636
Structured-ATT [4]	0.7016	0.5614	0.4648	0.3934	0.3291	0.5706	1.7031	-
GVFGA + LSGA [22]	0.6779	0.5600	0.4781	0.4165	0.3285	0.5929	2.6012	0.4683
RASG [10]	0.7729	0.6651	0.5782	0.5062	0.3626	0.6691	2.7549	0.4719
Word–Sentence [13]	0.7240	0.5861	0.4933	0.4250	0.3197	0.6260	2.0629	-
JTTS [14]	0.7893	0.6795	0.5893	0.5135	0.3773	0.6823	2.7958	0.4877
VRTMM [11]	0.7813	0.6721	0.5645	0.5123	0.3737	0.6713	2.7150	-
CNN + Transformer [12]	0.7740	0.6680	0.5810	0.5100	-	0.6780	2.7730	-
TrTr-CMR [28]	0.6201	0.3937	0.2671	0.1932	0.2399	0.4859	0.7518	-
KE [29]	0.7910	0.6790	0.5910	0.5170	0.3810	0.6910	2.8320	-
CSSA (Ours)	0.8131	0.7016	0.6021	0.5275	0.3997	0.6932	2.9056	0.4988

Table 4. Ablation performance of our designed model on three remote sensing image captioning datasets.

Dataset	DGP	MCCL	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE	CIDEr	SPICE
Sydney-Captions	×	×	0.8368	0.7525	0.6683	0.5874	0.4160	0.7471	2.6533	0.4357
	✓	×	0.8128	0.7510	0.6906	0.6319	0.4315	0.7497	2.8862	0.4868
	×	✓	0.8333	0.7608	0.6879	0.6178	0.4364	0.7782	2.8773	0.4642
	✓	✓	0.8406	0.7777	0.7138	0.6544	0.4555	0.7839	3.0501	0.5050
UCM-Captions	×	×	0.8600	0.8106	0.7637	0.7188	0.4628	0.8041	3.5912	0.4848
	✓	×	0.8700	0.8252	0.7782	0.7322	0.4712	0.8192	3.6639	0.5024
	×	✓	0.8770	0.8278	0.7801	0.7338	0.4772	0.8326	3.6781	0.5208
	✓	✓	0.8911	0.8469	0.8037	0.7598	0.4870	0.8474	3.8248	0.5428
RSICD	×	×	0.7839	0.6711	0.5812	0.5074	0.3666	0.6680	2.7963	0.4800
	✓	×	0.7938	0.6818	0.5909	0.5136	0.3776	0.6853	2.8367	0.4891
	×	✓	0.7632	0.6528	0.5634	0.4902	0.3915	0.6915	2.8390	0.5005
	✓	✓	0.8131	0.7016	0.6021	0.5275	0.3997	0.6932	2.9056	0.4988

Table 5. Effect in terms of CIDEr metric with different decay strategies on three datasets.

Decay	Init. Params $s_{0}$	CIDEr
Decay	Init. Params $s_{0}$	UCM-Captions	Sydney-Captions	RSICD
Exp	0.0	3.5924	2.7801	2.8144
	0.5	3.6398	2.7912	2.8542
	1.0	3.6639	2.8862	2.8367
	2.0	3.6483	2.8176	2.8587
Log	-	3.6476	2.8735	2.8325
Cos	-	3.6539	2.8746	2.8426

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Han, X.; Wu, Z.; Li, Y.; Zhang, X.; Wang, G.; Hou, B. CSSA: A Cross-Modal Spatial–Semantic Alignment Framework for Remote Sensing Image Captioning. Remote Sens. 2026, 18, 522. https://doi.org/10.3390/rs18030522

AMA Style

Han X, Wu Z, Li Y, Zhang X, Wang G, Hou B. CSSA: A Cross-Modal Spatial–Semantic Alignment Framework for Remote Sensing Image Captioning. Remote Sensing. 2026; 18(3):522. https://doi.org/10.3390/rs18030522

Chicago/Turabian Style

Han, Xiao, Zhaoji Wu, Yunpeng Li, Xiangrong Zhang, Guanchun Wang, and Biao Hou. 2026. "CSSA: A Cross-Modal Spatial–Semantic Alignment Framework for Remote Sensing Image Captioning" Remote Sensing 18, no. 3: 522. https://doi.org/10.3390/rs18030522

APA Style

Han, X., Wu, Z., Li, Y., Zhang, X., Wang, G., & Hou, B. (2026). CSSA: A Cross-Modal Spatial–Semantic Alignment Framework for Remote Sensing Image Captioning. Remote Sensing, 18(3), 522. https://doi.org/10.3390/rs18030522

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CSSA: A Cross-Modal Spatial–Semantic Alignment Framework for Remote Sensing Image Captioning

Highlights

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Dynamic Geometry Transformer

3.2. Dynamic Geometry Positional Encoding Module

3.2.1. Exponential Decay

3.2.2. Logarithmic Decay

3.2.3. Cosine Decay

3.3. Text Decoder

3.3.1. Text Decoder in MCCL

3.3.2. Text Decoder in RSIC

3.4. Training Strategy and Loss Function

3.4.1. MCCL Stage

3.4.2. RSIC Stage

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Experimental Settings

4.4. Comparison with Other Methods

4.4.1. Quantitative Comparison

4.4.2. Qualitative Comparison

5. Discussion

5.1. Ablation Study

5.1.1. Quantitative Comparison

5.1.2. Qualitative Comparison

5.2. Parameter Sensitivity Analysis

5.2.1. DGP Parameter

5.2.2. MCCL Parameter

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI