A Cross-Spatial Differential Localization Network for Remote Sensing Change Captioning

Wu, Ruijie; Ye, Hao; Liu, Xiangying; Li, Zhenzhen; Sun, Chenhao; Wu, Jiajia

doi:10.3390/rs17132285

Open AccessArticle

A Cross-Spatial Differential Localization Network for Remote Sensing Change Captioning

by

Ruijie Wu

^1,*,

Hao Ye

²,

Xiangying Liu

²,

Zhenzhen Li

¹,

Chenhao Sun

² and

Jiajia Wu

³

¹

State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan 430079, China

²

School of Geodesy and Geomatics, Wuhan University, Wuhan 430079, China

³

School of Minerals Processing and Bioengineering, Central South University, Changsha 410083, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(13), 2285; https://doi.org/10.3390/rs17132285

Submission received: 7 May 2025 / Revised: 30 June 2025 / Accepted: 30 June 2025 / Published: 3 July 2025

(This article belongs to the Special Issue Machine Learning and Deep Learning Applied to Remote Sensing Image Analysis)

Download

Browse Figures

Versions Notes

Abstract

Remote Sensing Image Change Captioning (RSICC) aims to generate natural language descriptions of changes in bi-temporal remote sensing images, providing more semantically interpretable results than conventional pixel-level change detection methods. However, existing approaches often rely on stacked Transformer modules, leading to suboptimal feature discrimination. Moreover, direct difference computation after feature extraction tends to retain task-irrelevant noise, limiting the model’s ability to capture meaningful changes. This study proposes a novel cross-spatial Transformer and symmetric difference localization network (CTSD-Net) for RSICC to address these limitations. The proposed Cross-Spatial Transformer adaptively enhances spatial-aware feature representations by guiding the model to focus on key regions across temporal images. Additionally, a hierarchical difference feature integration strategy is introduced to suppress noise by fusing multi-level differential features, while residual-connected high-level features serve as query vectors to facilitate bidirectional change representation learning. Finally, a causal Transformer decoder creates accurate descriptions by linking visual information with text. CTSD-Net achieved BLEU-4 scores of 66.32 and 73.84 on the LEVIR-CC and WHU-CDC datasets, respectively, outperforming existing methods in accurately locating change areas and describing them semantically. This study provides a promising solution for enhancing interpretability in remote sensing change analysis.

Keywords:

change caption; transformer encoder; symmetric difference localization; remote sensing image; natural language processing

1. Introduction

Monitoring and analyzing changes on the Earth’s surface provides scientific evidence and strategic guidance for achieving sustainable development. Consequently, remote sensing-based earth observation has become a crucial tool in the fields of environmental monitoring, land planning, disaster management, etc. Although remote sensing image change detection (RSICD) can find changed areas using mask outputs, it only shows changes at the pixel level and does not provide easy-to-understand information for real-world use, like descriptions of object types, locations, and change dynamics. The remote sensing image change captioning (RSICC) task aims to describe and explain the changes in the image scene with natural language. As shown in Figure 1, the RSICC task takes bi-temporal images as input, finds the differences between them, and provides textual descriptions that include where the changes are and what types of objects are involved. Compared with the RSICD task, it delivers more flexible results and can characterize diverse change types.

Current RSICC approaches primarily employ encoder–decoder architectures, with model designs focusing on two key aspects: image feature enhancement and differential feature refinement.

Image feature enhancement aims to reduce the scale differences of the change targets, with its core focusing on the fusion and optimization of multi-scale features [1]. The design of a scale-aware enhancement module enhances the model’s ability to perceive objects with varying sizes of changes [2]. For example, Liu et al. [3] proposed a dual-branch transformer structure and used a multi-level dual-temporal fusion module to fuse multi-scale features. Chang et al. [4] designed a hierarchical self-attention module to locate features related to changes. ICT-Net [5] uses a multi-layer adaptive fusion module to reduce the effect of unimportant visual features and a cross-attention module to highlight different-scale features when creating captions. Although these studies have made significant progress, they are all limited by the constrained receptive field of CNNs and the high complexity of Transformer architectures.

Regarding improvements in differential features, Park et al. [6] suggested an effortless dual dynamic attention method to show differences between two visual images, creating change maps by subtracting bi-temporal features. However, the visual and semantic relationships between objects received insufficient attention. To resolve these issues, Tu et al. [7] created self-semantic and cross-semantic relation blocks to determine change areas, using their strengths to build connections both within the same type of data and between different types in change captioning models. Cai et al. [5] created a special network that uses an interactive encoder to determine important changes at different scales and a transformer with a multi-layer fusion module to produce change descriptions. Nevertheless, these methods primarily rely on direct subtraction and global similarity to capture changing features between two remote sensing images, rather than attempt to learn interactions between bi-temporal features before change assessment.

Therefore, there are still some limitations in current research that require improvement:

(1) Simply stacking convolutional or Transformer encoder modules does not adequately achieve feature enhancement. Effectively improving the model’s spatial perception capability is crucial for capturing localized information, yet existing architectures often lack explicit mechanisms to strengthen positional awareness.

(2) Most studies focus excessively on learning bi-temporal features and then rely on simple subtraction or similarity computation to detect changes. This approach leads to two key issues: insufficient extraction of change-region features, resulting in weak discriminative signals, and retention of task-irrelevant interference, which hinders effective differential feature learning.

To address the issues mentioned above, this study proposes a novel cross-spatial Transformer and symmetric difference localization network (CTSD-Net) for RSICC. Specifically, this study first designs a cross-space Transformer encoder, utilizing a spatial attention mechanism to adaptively focus on the spatial positions and feature relationships of interest region features before and after enhancement. Subsequently, the model employs a hierarchical differential feature integration strategy, combining multi-level difference maps to enrich both medium- and coarse-grained contextual information while effectively suppressing noise interference. The architecture utilizes residual-connected high-level differential features as query vectors in the multi-head attention mechanism, enabling simultaneous precise learning of discriminative change representations and bidirectional attention to pre-change and post-change characteristics. The final differential representation is synthesized by concatenating the enhanced spatial features with the optimized symmetric differential features. A causal Transformer decoder then processes this integrated representation, which establishes a robust cross-modal mapping to generate linguistically accurate change descriptions.

The remainder of this study is organized as follows: Section 2 introduces related works on image captioning and image change detection in remote sensing images. Section 3 describes the structure and modules of CTSD-Net. Section 4 presents the experimental analysis and discussion of CTSD-Net. Section 5 presents our conclusion.

2. Related Works

2.1. Remote Sensing Image Caption

The remote sensing image captioning (RSIC) task takes a single image as input and aims to generate a natural language description that summarizes the semantic information of the image [8]. Based on the generation approach, we can broadly categorize existing methods into three types: retrieval-based methods, object detection-based methods, and sequence generation-based methods.

Retrieval-based methods utilize large-scale image-text databases to generate sentences by retrieving similar images and their corresponding descriptions [9]. Therefore, their performance is highly dependent on the quality of retrieval. When the target image differs significantly from those in the database, the generation quality drops considerably. Object detection-based methods, on the other hand, first detect the objects within an image, then model the relationships between the objects, and use a sentence generation model to produce the description [10,11]. The performance of this approach is critically dependent on the accuracy of object detection and relationship modeling, as detection errors directly affect the quality of the generated sentences.

Sequence generation-based methods are currently the mainstream approach for the RSIC task. These methods adopt an encoder–decoder paradigm, encoding images into feature vectors and decoding them into natural language descriptions [1,2,4,12,13,14]. These methods typically use convolutional neural networks (CNNs) to extract visual features, combined with language models such as recurrent neural networks (RNNs) or long short-term memory networks to generate descriptions. Qu et al. [8] were the first to apply the CNN-RNN architecture to the remote sensing image captioning task. Existing research mainly improves model performance in two directions: enhancing the visual features by optimizing image encoding through multi-scale feature fusion and denoising, such as spatial and channel denoising followed by feature fusion from different layers of a CNN [15]; and enhancing the information integration capability during the language decoding phase, for example, by using multi-level attention models [16] combined with context modeling to improve the accuracy and coherence of the descriptions.

2.2. Remote Sensing Image Change Caption

Similar to the RSIC task, the RSICC task further incorporates the identification of change regions and shifts the focus of the description task to the changes occurring within the scene. The implementation still primarily follows an encoder–decoder architecture. Specifically, it first employs a Siamese neural network to extract deep features from the two temporal images, learns the differential feature representations between the images, and then builds cross-modal associations from images to text to translate these features into descriptive sentences.

Jhamtani et al. [17] first introduced the “change caption” task and proposed the Siamese CNN-RNN architecture to describe the semantic differences between dual-temporal images. Subsequent research [6] elevated the semantic interpretation to a higher level by proposing a dual dynamic attention model that can distinguish semantic visual information between image pairs without being affected by lighting and viewpoint variations. Yue et al. [18] further expanded this progress by introducing an internal interaction representation network to finely learn different image representations. Hosseinzadeh et al. [19] focused on designing new training schemes, attempting to improve the learning efficiency and generalization ability of the network by incorporating auxiliary tasks and additional supervisory signals, thereby enhancing the model’s ability to perceive fine-grained image differences.

A novel image pretraining-finetuning paradigm, based on CLIP (Contrastive Language–Image Pretraining), establishes cross-modal associations between images and text [20]. This method leverages CLIP’s outstanding zero-shot capabilities by designing three contrastive self-supervised tasks, specifically optimizing the visual encoder to capture image differences and achieve text alignment [21]. To enhance adaptability to complex scenes, Qiu et al. [22] further proposed a Transformer-based network architecture, which dynamically generates attention states to strengthen recognition capabilities. This model detects multiple changes and demonstrates excellent cross-scene generalization performance, laying a foundation for research in RSICC tasks and change semantic expression.

3. Methods

As shown in Figure 2, CTSD-Net learns the symmetric relationships of semantic changes through the differential features between dual-temporal images. Specifically, the model first employs a Siamese weight-sharing SegFormer [23] backbone to extract multi-scale features from the dual-temporal images. These features are then passed through a shared positional encoding and fed into two key modules: the Cross-Spatial Transformer (CST) module, which captures spatial-semantic correlations within the regions of interest, and the Symmetric Differential Localization module, which learns symmetric differential features. The two sets of features are concatenated and input into the decoding module to realize the mapping conversion between image and text and obtain the final descriptive output.

3.1. Cross-Spatial Transformer

CTSD-Net adopts the Mix Transformer (MiT) backbone from SegFormer for feature extraction, and uses stage 1–4 outputs to obtain a more comprehensive feature representation, as shown in Figure 2a. Formally, given two input images

I_{1}

and

I_{2}

, their extracted features can be denoted as

(F_{1}, F_{2})

with multi-scale feature representations. Subsequently, learnable positional embeddings are added only to the highest-level features to reduce model capacity requirements and unnecessary computational costs. This process can be formally defined as:

{Token}_{i} = T (F_{i}) + F_{p}, (i = 1, 2)

(1)

where

T (\cdot)

represents feature reshaping operations, including flattening and transposing, which transform the original features into the appropriate shape required for embedding tokens.

F_{p}

denotes a learnable positional embedding used to retain the positional information within the model.

The feature sequences

(F_{1}, F_{2})

are then fed into the cross-spatial Transformer (CST) module. As illustrated in Figure 3, the structure of the CST module consists of two spatial attention (SA) blocks, one window-based multi-head self-attention (W-MSA) block, one Transformer block, two normalization layers, and two feed-forward networks (FFNs). The standard Transformer architecture employs a multi-layer stacked structure for progressive feature refinement. To enhance this framework, the proposed CST module incorporates W-MSA to effectively capture local texture patterns while maintaining computational efficiency. The subsequent Transformer layers then complement this process by integrating global contextual information with the locally-enhanced features from W-MSA outputs. Simultaneously, the SA mechanism operates across both components to maintain global spatial semantic modeling. This hierarchical design establishes an efficient extraction to a deep processing feature refinement pipeline. It addresses the computational complexity inherent in pure global attention approaches and ensures feature quality through carefully designed stacked modules.

The design of CST incorporates three key characteristics: (1) Dual Spatial Attention Mechanisms: Two stages of spatial attention are applied to adaptively focus on key regions before and after feature extraction, effectively filtering out background noise, shadow occlusions, and other complex artifacts in remote sensing images, thus improving the model’s robustness. (2) W-MSA: Inspired by Swin Transformer, CST partitions the encoded image features into several non-overlapping windows, within which self-attention is computed independently, thereby significantly reducing computational complexity. (3) Transformer-based Feature Interaction Modeling: The tokens obtained after the first spatial attention serve as the queries, while the tokens after W-MSA and the second spatial attention act as the keys and values. This design captures the spatial feature relationships between the original and enhanced feature sequences, further promoting effective global context modeling across the entire image.

Both W-MSA and the Transformer block utilize the MSA mechanism. The input to the MSA is a triplet (query

q

, key

k

, and value

v

) derived from the input features

F_{i} \in R^{c \times d}, (i = 1, 2)

, with

c

representing the number of channels and

d

representing the feature dimension. This process can be formulated as:

q = F_{i} W_{Q}, k = F_{i} W_{K}, v = F_{i} W_{V}

(2)

A (q, k, v) = S o f t M a x (\frac{q k^{T}}{\sqrt{d}}) v

(3)

M S A (F_{i}) = C a t [A (q_{1}, k_{1}, v_{1}), \dots, A (q_{m}, k_{m}, v_{m})] W_{o}

(4)

where

W_{Q}

,

W_{K},

and

W_{V}

are learnable parameters.

A

refers to the attention mechanism.

M S A

refers to executing multiple self-attention operations in parallel, where

m

denotes the number of attention heads.

The complete encoding process of the CST module can be summarized as:

X_{i} = S A (F_{i}) + F_{i}

(5)

X_{i}^{'} = W M S A (L N (X_{i})) + X_{i}

(6)

X_{i}^{″} = F F N (L N (X_{i}^{'}))

(7)

X_{i}^{″} = S A (X_{i}^{″}) + X_{i}^{″}

(8)

X_{i}^{'''} = T r (L N (X_{i})) + X_{i}^{″}

(9)

X_{i} = F F N (L N (X_{i}^{'''}))

(10)

where

L N

is layer normalization,

F F N

is a feed-forward network, which can be expressed as:

F F N = D r o p (S i L U (F W_{1} + b_{1}) W_{2} + b_{2})

. It consists of two fully connected layers, a

S i L U

activation function, and a dropout function

D r o p

.

T r

is a standard Transformer module.

3.2. Symmetric Differential Localization

The semantic representation of image differences is crucial for guiding the model to learn the change features between bi-temporal images. However, directly computing the difference between two feature maps or applying point-wise operations after differencing is insufficient to fully capture the complex characteristics of image changes. To address this limitation, we propose a novel symmetric differential localization (SDL) module, which operates in parallel with the CST module. The SDL module aggregates multi-scale differential features and employs a self-attention mechanism to learn the semantic relationships surrounding objects and filter out irrelevant and unchanged feature representations. This design enables enhanced differential interaction between the bi-temporal features.

Figure 4 illustrates the composition of the module, for the multi-level features

F_{i}^{l}

, where

l

denotes the feature levels from 1 to 5. The SDL module computes bidirectional differential features (forward change:

F_{1}

→

F_{2}

and backward change:

F_{2}

→

F_{1}

) as complementary inputs, uses convolution with kernels of 5 and 7 to increase the receptive field of the features, and fuses the features layer by layer through downsampling and normalization operations.

Specifically, this process can be expressed as:

D_{12}^{l} = C o n v (F_{1}^{l} - F_{2}^{l})

(11)

D_{21}^{l} = C o n v (F_{2}^{l} - F_{1}^{l})

(12)

C o n v = {C o n v}_{3} ({Conv}_{1} (C a t ({C o n v}_{5} (D_{12}^{l} [: C / 2]), {C o n v}_{7} (D_{12}^{l} [C / 2 :]))))

(13)

where

{C o n v}_{i}, i = 1, 3, 5, 7

denotes convolutional operations with corresponding kernel sizes, and

C a t

represents the tensor concatenation operation. The differential features from different levels are first downsampled through

{C o n v}_{3}

convolution, and then aggregated by summation. The model obtains a more informative and semantically enriched differential feature map by fusing multi-scale differential features. Additionally, the combined multi-scale differential features help improve the high-level differential features, allowing the model to better concentrate on the areas where real changes happen. Specifically, the SDL module first uses average pooling to the two multi-scale differential features

{(D}_{12}, D_{21})

to enhance their representations:

Φ_{m} (D_{12}, D_{21}) = f_{2} (f_{1} (D_{12}, D_{21}))

(14)

f_{1} = GELU (BN (D_{12}, D_{21}))

(15)

f_{2} = Sig (FFN (AP (f_{1})))

(16)

where

G E L U

is the Gaussian error linear unit activation function,

S i g

is the Sigmoid activation function,

A P

is the average pooling function, and

B N

is batch normalization.

Subsequently, a cross-semantic change locator is constructed using an MHA layer and an FFN, with residual connections between the two components. In the MHA layer, differential features from the two temporal images are used separately. For

F_{1}

,

D_{12}

is used as the query, while

D_{21}

serves as both the key and value for the attention computation:

D_{12}^{l} = {M H A}_{12}^{l} (D_{12}, D_{21}, D_{21})

(17)

The FFN layer is used to process the output of the MHA layer:

Z_{12}^{l} = FFN (B N (D_{12} + D r o p (D_{12}^{l})))

(18)

Similarly, for

F_{2}

, MHA uses

D_{21}

as the query and

D_{12}

as the key and value for attention calculation:

D_{21}^{l} = {M H A}_{21}^{l} (D_{21}, D_{12}, D_{12})

(19)

Z_{21}^{l} = FFN (B N (Z_{1}^{l - 1} + D r o p (D_{21}^{l})))

(20)

Finally, the obtained symmetric difference feature is added to the cross-space feature

X_{i}

obtained in the previous section to obtain the final image feature, and the dual-phase features are concatenated as the output feature

X_{o u t}

:

X_{1} = X_{1} + Z_{12}^{l}

(21)

X_{2} = X_{2} + Z_{21}^{l}

(22)

X_{o u t} = {C a t (X}_{1}, X_{2})

(23)

3.3. Text Decoder

The CTSD-Net uses the causal Transformer decoder [24] to unify the change features of bi-temporal images with text features. As shown in Figure 5, each Transformer block consists of a masked MHA, an MHA, and a feed-forward network. The masked MHA takes the word embeddings from the previous block as input and applies causal masking to ensure that the output of each sequence element depends solely on its preceding elements. As illustrated in Figure 6a, the orange cell at row

i

and column

j

indicates that, when generating the

i

-th output token, the attention mechanism is permitted to attend to the

j

-th input token, while preventing access to any subsequent tokens beyond

j

. Subsequently, the MHA integrates the extracted visual features into the word embeddings, allowing the model to attend to the visual information. The entire decoder layer employs the residual connection to retain the information of the word embeddings.

Specifically, to prepare change descriptions for the caption decoder during training, we first map the text tokens into word embeddings using an embedding layer. A mapping function

f_{e m b} : R^{n} \to R^{n \times d_{e m b}}

is employed to transform the original token

t

into corresponding word embedding

E_{t e x t}

. The initial input to the Transformer decoder can then be obtained as follows:

E_{t e x t}^{0} = f_{e m b} (t) + E_{p o s}

(24)

where

E_{p o s}

denotes the positional embeddings computed using sine and cosine functions of different frequencies and phases based on the position of each token [25]. Then, the

j

-th masked MHA sublayer can be formulated as:

S_{text}^{j} = {M H A}^{j} (E_{text}^{j - 1}, E_{text}^{j - 1}, E_{text}^{j - 1}) = C o n c a t ({head}_{1}, \dots, {head}_{h}) W^{O}

(25)

{head}_{l} = A (E_{text}^{j - 1} W_{l}^{Q}, E_{text}^{j - 1} W_{l}^{K}, E_{text}^{j - 1} W_{l}^{V})

(26)

{M H A}^{j}

represents the function of the

j

-th masked MHA sublayer,

h

is the number of heads in the masked MHA sublayer.

W_{l}^{Q}, W_{l}^{K}, W_{l}^{V} \in R^{d_{e m b} \times \frac{d_{e m b}}{h}}

are the trainable weight matrices for the

l

-th head, and

W^{O} \in R^{d_{e m b} \times d_{e m b}}

is the trainable weight matrix for linearly projecting the output feature dimension. After passing through

N

layers of the Transformer decoder, the retrieved word embeddings are fed into a linear layer, and then the word probabilities are obtained through a SoftMax activation function.

3.4. Loss Function

The model employs a cross-entropy loss function to compute the loss between the predicted sentence and ground-truth sentence. Given a target ground-truth caption sequence and the predicted sequence from model, the objective of the loss function

L_{cap}

is to minimize the sum of the negative log-likelihoods of correctly predicted words. Thus, the loss function can be formulated as:

L_{cap} = - \frac{1}{L} \sum_{l = 1}^{L} \sum_{v = 1}^{V} y_{l}^{(v)} l o g (p_{l}^{(v)})

(27)

where

L

is the total number of word tokens,

V

is the vocabulary size,

y_{l}^{(v)}

indicates the ground-truth label denoting whether the

l

-th word corresponds to the vocabulary index

V

, and

p_{l}^{(v)}

is the predicted probability of assigning the

l

-th word to the vocabulary index

V

.

4. Experiments

4.1. Experimental Setting

4.1.1. Dataset

The LEVIR-CC [3] dataset consists of 637 pairs of bi-temporal images collected from Texas, USA, with a spatial resolution of 0.5 m. After cropping, 10,077 image pairs (each sized 256 × 256 pixels) are obtained, including 5038 pairs with changes and 5039 without changes. It focuses on various changing scenes and objects, such as buildings, roads, playgrounds, and rivers. It combines each pair of images with five descriptive sentences that detail the type, location, and state of the changes, totaling 50,385 sentences. The textual descriptions have a maximum sentence length of 39 words, with an average of 7.99 words per sentence. The LEVIR-MCI [26] dataset is an extension of the change captioning dataset LEVIR-CC. For each changed image pair, it annotates the changed areas at the pixel level and includes a generated mask file, highlighting the changed roads and changed buildings.

The WHU-CDC [27] dataset is constructed from high-resolution (0.075 m/pixel) aerial imagery captured in Christchurch, New Zealand, with original image dimensions of 32,507 × 15,354 pixels covering rural, residential, cultural, and industrial areas. The dataset comprises 7434 cropped image pairs (each sized 256 × 256 pixels), where each pair is annotated with five descriptive captions, yielding a total of 37,170 textual descriptions with a vocabulary size of 327 unique words.

4.1.2. Evaluation Metrics

To evaluate the performance of the proposed model, we adopt several commonly used evaluation metrics, including BLEU-N (N = 1, 2, 3, 4) [28], METEOR [29], ROUGE-L [30], CIDEr-D [31], and

S_{m}^{*}

[32]. These metrics are primarily used to measure the similarity between the generated descriptions and the reference descriptions, thus serving as standards for assessing the quality of the descriptions.

BLEU-N: measures the quality of generated text by calculating the n-gram level matching between the candidate text and reference text. A higher BLEU score indicates stronger similarity in n-gram matching between the generated text and the reference text. We compute the metric for N = 1, 2, 3, and 4.

ROUGE-L: evaluates the similarity between the generated text and reference text by calculating the length of the Longest Common Subsequence (LCS). A higher score indicates a better match in terms of structure and word order between the generated text and the reference text.

METEOR: provides a more comprehensive measure of the similarity between machine-generated text and reference text by incorporating synonym matching, stemming, and word order matching. The higher the score, the better the match.

CIDEr-D: measures the relevance between the generated text and human reference text by calculating the term frequency–inverse document frequency (TF-IDF) weighted n-gram similarity. Compared to traditional evaluation metrics like BLEU and METEOR, CIDEr places more emphasis on semantic consistency and diversity in descriptions, offering a more accurate reflection of text quality. A higher CIDEr score indicates greater similarity and a higher quality of the generated description relative to the reference description.

S_{m}^{*}

: is a comprehensive evaluation indicator that comprehensively evaluates the accuracy of the generated text at the syntactic and semantic levels by calculating the average of the four indicators BLEU-4, METEOR, ROUGE-L, and CIDEr.

4.1.3. Implementation Details

The proposed network is implemented using PyTorch version 2.0.1 on an Nvidia GeForce GTX 4090Ti GPU. The backbone network uses the SegFormer Mit-b1 model, which is pretrained on ImageNet-1k. It uses a hierarchical Transformer encoder, achieving a global receptive field through self-attention mechanisms, without relying on deep stacking or dilated convolutions like CNNs. The model is optimized using the Adam optimizer, with learning rates set to 1 × 10⁻⁴ for both the encoder and decoder. Early stopping is used during training, where the training process stops early if the BLEU-4 score on the validation set does not improve for 10 epochs. The maximum number of epochs is set to 50, and the batch size is set to 64. The word embedding size is fixed at 512. The maximum sentence length for each paragraph is set to 41. The number of attention heads in the multi-head attention mechanism is set to eight.

4.2. Comparative Analysis

To demonstrate the advancement of the proposed method, this section compares it against classical and recent State-of-the-Art approaches in the RSICC task, including RSICCformer [3], PSNet [2], PromptCC [1], Chg2Cap [4], SparseFocus [33], RSCaMa [34], SEN [35], SFEN [36], KCFI [37], MaskApproxNet [38], SAT-Cap [39], Change3D [40], RDD + ACR [41], MCINet [26], and CTMTNet [42]. These methods all adopt the encoder–decoder model architecture, and the selected methods focus on the improvement of differential features and the enhancement of image features. Table 1, Table 2 and Table 3 show the detailed comparison results of the proposed method and the baseline methods on the LEVIR-CC, LEVIR-MCI, and WHU-CDC datasets. All the scores are expressed as percentages (%). Bold indicates the best value.

Among the models used in the comparative experiments, RSICCformer, PromptCC, Chg2Cap, SparseFocus, SEN, and SAT-Cap aim to enhance the original deep features extracted by the backbone network. As shown in Table 1, Table 2 and Table 3, the proposed method achieves the best performance in most indicators. Specifically, RSICCformer achieved the lowest scores in the LEVIR-CC dataset, indicating that simply using a dual-branch Transformer encoder fails to enhance feature representations sufficiently for this task. SparseFocus introduced some improvements to the Transformer components, but its lightweight design also limited further improvements in accuracy. Consequently, its performance on the smaller-scale WHU-CDC dataset is substantially inferior to that of RSICCformer. SEN and PromptCC use pre-trained extractors and Large Language Models (LLMs) for feature extraction and embedding, respectively, but their accuracy still needs to be improved on both datasets. Both Chg2Cap and SAT-Cap employ an attention model combined with a Transformer to locate change-relevant features. Chg2Cap achieves 1.38% and 2.44% improvement on

S_{m}^{*}

metric on the two datasets compared to the dual-branch Transformer encoder, and SAT-Cap achieves the highest METEOR and CIDEr score in the LEVIR-CC dataset. This indicates that guiding the model’s focus toward potential regions of interest using attention mechanisms before applying Transformer modules can result in better localization performance. Therefore, the CST module proposed in this study integrates both spatial feature localization and attention-guided modeling strategies, achieving superior evaluation results.

Among the models used in the comparative experiments, PSNet, RSCaMa, SFEN, KCFI, and RDD + ACR mainly focus on maximizing the utilization of differences between bi-temporal features by designing various differential computation modules to enhance feature representation capabilities. PSNet, built entirely on a Transformer architecture, progressively fuses multi-level differential features for multi-scale difference interactions, resulting in a 4.32% improvement in the BLEU-4 score. Similarly, the SFEN model achieved a 6.88% BLEU-4 increase by introducing multi-scale semantic information from feature maps to enhance bi-temporal features. This demonstrates that multi-scale semantic difference information is an effective way to help models learn key details in deep features while filtering out redundant information. Therefore, the symmetric difference localization module proposed in this paper also incorporates multi-scale difference features as a means of enhancing differential feature representations. The RDD + ACR framework employs learnable vectors to progressively extract global difference features, while RSCaMa and KCFI focus on enhancing key feature perception at higher semantic levels, offering distinct perspectives on differential feature integration. Similarly, the proposed symmetric difference localization module aims to strengthen symmetric change feature modeling and achieves approximately 1% improvement in BLEU-4 score compared to these two models, which strongly validates the effectiveness and advancement of our approach.

MaskApproxNet and Change3D adapt methodologies from diffusion models and video encoding, respectively, to address the RSICC task, offering novel research perspectives. However, their accuracy remains suboptimal. Both CTMTNet and MCINet jointly perform CC and CD tasks, as demonstrated in Table 2, where the CD task results are utilized to supplement and enhance the CC task performance. In contrast, our proposed method achieves the best performance on BLEU-N metrics without incorporating an auxiliary CD task while maintaining highly competitive results across other evaluation metrics, including METEOR, ROUGE-L, and CIDEr.

4.3. Ablation Experiment

This study designs three sets of ablation experiments to validate the innovation and effectiveness of each proposed module, focusing on the backbone network, the CST module, and the SDL module.

To evaluate the performance of different backbone networks within the proposed method, this study selects ResNet34, ResNet50, and ResNet101 as backbones. Table 4 shows the corresponding experimental results. It can be observed that as the depth of the ResNet architecture increases, the model achieves higher scores. However, compared to using the SegFormer network, these models do not achieve more competitive results. This is because SegFormer is specifically designed with a focus on image feature extraction and semantic segmentation tasks, making it better suited for fine-tuning the differential feature maps required by the proposed method.

Table 5 shows the ablation experiment results of the CST module. Removing the CST module significantly reduces the accuracy of the CTSD-Net model. Specifically, BLEU-4, METEOR, and ROUGE-L scores each decrease by approximately 4%, and the CIDEr score declines by 5.5%. We also verify the impact of different attention mechanisms on the performance of this module. Compared with window-based attention (WA), global attention (GA) results also decrease to a certain extent, indicating that WA is better at capturing local surface details and can improve the model’s ability to capture contextual features of the global image. In the design of the CST, the W-MSA mechanism is incorporated to enhance local feature extraction, working in conjunction with standard Transformer layers to form a deep feature refinement pipeline. Experimental results demonstrate that removing the Transformer module eliminates the essential iterative feature optimization process. This ablation leads to significant degradation in model performance. In contrast, the model maintains considerable accuracy without employing the W-MSA mechanism.

Table 6 shows the ablation experiment results of the SDL module, where L-1 denotes using only the top-level features for differential feature extraction, L-2 to L-4 progressively incorporate intermediate features, and L-5 utilizes all hierarchical features. Compared to the baseline without SDL, L-1 yields about a 1% improvement in BLEU-4 score, demonstrating that guiding the model’s attention toward changed objects via differential features is an effective strategy. The results of L-2 and L-3 show comparable performance with minor fluctuations, suggesting that mid-level features contribute similarly to change detection. However, when L-4 introduces shallow features, the model achieves consistent gains across all metrics, indicating that fine-grained spatial details from early layers enhance the model’s ability to perceive changing areas. The performance of L-5 aligns with SDL’s core design principle: hierarchical fusion of differential features guides the model to focus on changed regions across varying granularities.

4.4. Complexity Analysis

Table 7 shows the model’s parameter counts and floating-point operations (FLOPs) under different module conditions. As shown in the first two rows, using SegFormer as the backbone network can reduce both parameter count and computational complexity while improving performance. The SDL module has more parameters and is more computationally complex than the CST module because it uses multi-level differential features in its computation. However, it significantly enhances model performance, achieving superior accuracy over the baseline model, which we consider an acceptable trade-off.

4.5. Visualization

Figure 7 shows the visualization of the feature maps and examples of change descriptions generated by the proposed model on the experimental dataset. The last convolutional layer collects the feature maps before feeding them into the decoder. The red and yellow regions represent high-response regions, which contribute the most to the prediction results of the model, while the blue regions are low-response regions, indicating that these regions have little influence or are irrelevant to the current prediction. The distribution of the feature maps closely aligns with the distribution of the change masks. The proposed model demonstrates a strong ability to focus on and accurately locate changed regions, with minimal interference from unchanged areas. Furthermore, in the prediction results, the model generates more detailed change descriptions, and there is a clear correlation between the detected change regions and the predicted keywords.

5. Conclusions

This study proposes a novel cross-spatial Transformer and symmetric difference localization network (CTSD-Net) for the RSICC task. To enhance the model’s descriptive capability, we consider that spatial feature mapping learning is more effective than learning individual features in parallel. Therefore, a Cross-Space Transformer encoding algorithm is designed to perceive the location of image interest regions, enhancing the filtering of background noise, shadow occlusion, and other complex terrain-related interferences in remote sensing images. Traditional differential computation methods fail to produce stable change mappings, so we designed a symmetric differential module to extract symmetric change mapping relationships and explore semantic change representations. By using multi-scale aggregation and bidirectional computation, we fused the differential features before and after change at different scales, successfully focusing on the pixels that changed before and after the event. The model achieves competitive results and excellent performance on two public datasets. This research provides technical support for the natural language understanding of change images and can offer users more intuitive references for change image forensics.

Author Contributions

Data curation, Z.L.; investigation, H.Y.; methodology, R.W.; software, H.Y., X.L. and C.S.; supervision, J.W.; visualization, X.L.; writing—original draft, R.W. and C.S.; writing—review and editing, R.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China, grant number 42071431.

Data Availability Statement

The LEVIR-CC dataset can be obtained from https://github.com/Chen-Yang-Liu/RSICC (accessed on 28 October 2023). The WHU-CDC dataset can be obtained from https://huggingface.co/datasets/hygge10111/RS-CDC (accessed on 31 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, C.; Zhao, R.; Chen, J.; Qi, Z.; Zou, Z.; Shi, Z. A Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5622018. [Google Scholar] [CrossRef]
Liu, C.; Yang, J.; Qi, Z.; Zou, Z.; Shi, Z. Progressive Scale-Aware Network for Remote Sensing Image Change Captioning. In Proceedings of the IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023. [Google Scholar]
Liu, C.; Zhao, R.; Chen, H.; Zou, Z.; Shi, Z. Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–20. [Google Scholar] [CrossRef]
Chang, S.; Ghamisi, P. Changes to Captions: An Attentive Network for Remote Sensing Change Captioning. IEEE Trans. Image Process. 2023, 32, 6047–6060. [Google Scholar] [CrossRef] [PubMed]
Cai, C.; Wang, Y.; Yap, K.-H. Interactive Change-Aware Transformer Network for Remote Sensing Image Change Captioning. Remote Sens. 2023, 15, 5611. [Google Scholar] [CrossRef]
Park, D.H.; Darrell, T.; Rohrbach, A. Robust Change Captioning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Tu, Y.; Li, L.; Yan, C.; Gao, S.; Yu, Z. R³Net:Relation-Embedded Representation Reconstruction Network for Change Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021. [Google Scholar]
Qu, B.; Li, X.; Tao, D.; Lu, X. Deep Semantic Understanding of High Resolution Remote Sensing Image. In Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China, 6–8 July 2016; pp. 1–5. [Google Scholar]
Ordonez, V.; Kulkarni, G.; Berg, T. Im2Text: Describing Images Using 1 Million Captioned Photographs. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Curran Associates, Inc.: Red Hoo, NY, USA, 2011; Volume 24. [Google Scholar]
Yang, Y.; Teo, C.; Daumé, H., III; Aloimonos, Y. Corpus-Guided Sentence Generation of Natural Images. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, 27–29 July 2011; Barzilay, R., Johnson, M., Eds.; Association for Computational Linguistics: Edinburgh, UK, 2011; pp. 444–454. [Google Scholar]
Li, S.; Kulkarni, G.; Berg, T.L.; Berg, A.C.; Choi, Y. Composing Simple Image Descriptions Using Web-Scale N-Grams. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, Portland, OR, USA, 23–24 June 2011; Goldwater, S., Manning, C., Eds.; Association for Computational Linguistics: Portland, OR, USA, 2011; pp. 220–228. [Google Scholar]
Chouaf, S.; Hoxha, G.; Smara, Y.; Melgani, F. Captioning Changes in Bi-Temporal Remote Sensing Images. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 2891–2894. [Google Scholar]
Hoxha, G.; Chouaf, S.; Melgani, F.; Smara, Y. Change Captioning: A New Paradigm for Multitemporal Remote Sensing Image Analysis. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, W.; Zhang, Z.; Gao, X.; Sun, X. Multiscale Multiinteraction Network for Remote Sensing Image Captioning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2154–2165. [Google Scholar] [CrossRef]
Huang, W.; Wang, Q.; Li, X. Denoising-Based Multiscale Feature Fusion for Remote Sensing Image Captioning. IEEE Geosci. Remote Sens. Lett. 2021, 18, 436–440. [Google Scholar] [CrossRef]
Li, Y.; Fang, S.; Jiao, L.; Liu, R.; Shang, R. A Multi-Level Attention Model for Remote Sensing Image Captions. Remote Sens. 2020, 12, 939. [Google Scholar] [CrossRef]
Jhamtani, H.; Berg-Kirkpatrick, T. Learning to Describe Differences Between Pairs of Similar Images. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018. [Google Scholar]
Yue, S.; Tu, Y.; Li, L.; Yang, Y.; Gao, S.; Yu, Z. I3N: Intra- and Inter-Representation Interaction Network for Change Captioning. IEEE Trans. Multimed. 2023, 25, 8828–8841. [Google Scholar] [CrossRef]
Hosseinzadeh, M.; Wang, Y. Image Change Captioning by Learning from an Auxiliary Task. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2724–2733. [Google Scholar]
Yao, L.; Wang, W.; Jin, Q. Image Difference Captioning with Pre-Training and Contrastive Learning. In Proceedings of the AAAI Conference on Artificial Intelligence 2022, Online, 22 February–1 March 2022. [Google Scholar]
Guo, Z.; Wang, T.-J.J.; Laaksonen, J. CLIP4IDC: CLIP for Image Difference Captioning. arXiv 2022, arXiv:2206.00629. [Google Scholar]
Qiu, Y.; Yamamoto, S.; Nakashima, K.; Suzuki, R.; Iwata, K.; Kataoka, H.; Satoh, Y. Describing and Localizing Multiple Changes with Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the NIPS’21: 35th International Conference on Neural Information Processing System, New Orleans, LA, USA, 6–14 December 2021. [Google Scholar]
Voigtlaender, P.; Luiten, J.; Torr, P.H.S.; Leibe, B. Siam R-CNN: Visual Tracking by Re-Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
Liu, C.; Chen, K.; Zhang, H.; Qi, Z.; Zou, Z.; Shi, Z. Change-Agent: Toward Interactive Comprehensive Remote Sensing Change Interpretation and Analysis. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5635616. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction From an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; Isabelle, P., Charniak, E., Lin, D., Eds.; Association for Computational Linguistics: Philadelphia, PA, USA, 2002; pp. 311–318. [Google Scholar]
Lavie, A.; Agarwal, A. METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic, 23 June 2007; Callison-Burch, C., Koehn, P., Fordyce, C.S., Monz, C., Eds.; Association for Computational Linguistics: Prague, Czech Republic, 2007; pp. 228–231. [Google Scholar]
Lin, C.-Y.; Hovy, E. Manual and Automatic Evaluation of Summaries. In Proceedings of the ACL-02 Workshop on Automatic Summarization, Phildadelphia, PA, USA, 11 July 2002; Association for Computational Linguistics: Phildadelphia, PA, USA, 2002; pp. 45–51. [Google Scholar]
Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-Based Image Description Evaluation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 15 October 2015. [Google Scholar]
Liu, C.; Zhang, J.; Chen, K.; Wang, M.; Zou, Z.; Shi, Z. Remote Sensing Spatio-Temporal Vision-Language Models: A Comprehensive Survey. arXiv 2025, arXiv:2412.02573. [Google Scholar]
Sun, D.; Bao, Y.; Liu, J.; Cao, X. A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 18727–18738. [Google Scholar] [CrossRef]
Liu, C.; Chen, K.; Chen, B.; Zhang, H.; Zou, Z.; Shi, Z. RSCaMa: Remote Sensing Image Change Captioning With State Space Model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Zhou, Q.; Gao, J.; Yuan, Y.; Wang, Q. Single-Stream Extractor Network With Contrastive Pre-Training for Remote-Sensing Change Captioning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Zhang, F.; Zhang, W.; Xia, K.; Feng, H. Scale-Wised Feature Enhancement Network for Change Captioning of Remote Sensing Images. Int. J. Remote Sens. 2024, 45, 5845–5869. [Google Scholar] [CrossRef]
Yang, C.; Li, Z.; Jiao, H.; Gao, Z.; Zhang, L. Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning. arXiv 2024, arXiv:2409.12612. [Google Scholar]
Sun, D.; Yao, J.; Zhou, C.; Cao, X.; Ghamisi, P. Mask Approximation Net: A Novel Diffusion Model Approach for Remote Sensing Change Captioning. arXiv 2025, arXiv:2412.19179. [Google Scholar]
Wang, Y.; Yu, W.; Ghamisi, P. Change Captioning in Remote Sensing: Evolution to SAT-Cap—A Single-Stage Transformer Approach. arXiv 2025, arXiv:2501.08114. [Google Scholar]
Zhu, D.; Huang, X.; Huang, H.; Zhou, H.; Shao, Z. Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective. arXiv 2025, arXiv:2503.18803. [Google Scholar]
Li, R.; Li, L.; Zhang, J.; Zhao, Q.; Wang, H.; Yan, C. Region-Aware Difference Distilling with Attribute-Guided Contrastive Regularization for Change Captioning. Proc. AAAI Conf. Artif. Intell. 2025, 39, 4887–4895. [Google Scholar] [CrossRef]
Shi, J.; Zhang, M.; Hou, Y.; Zhi, R.; Liu, J. A Multitask Network and Two Large-Scale Datasets for Change Detection and Captioning in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]

Figure 1. Example of remote sensing image change captioning data. It takes bi-temporal images as input and outputs a description of the changes between images.

Figure 2. The illustration of the proposed CTSD-Net. The model is composed of three components: (a) a weight-sharing SegFormer model for feature extraction, (b) the CST and SDL modules that facilitate feature fusion and enhancement, and (c) the causal Transformer decoder.

⨁

is element-wise addition,

⊖

Figure 2. The illustration of the proposed CTSD-Net. The model is composed of three components: (a) a weight-sharing SegFormer model for feature extraction, (b) the CST and SDL modules that facilitate feature fusion and enhancement, and (c) the causal Transformer decoder.

⨁

is element-wise addition,

⊖

Figure 3. The architecture of the CST module. SA represents spatial attention.

⨁

is element-wise addition.

Figure 3. The architecture of the CST module. SA represents spatial attention.

⨁

is element-wise addition.

Figure 4. The architecture of the SDL module. It takes bidirectional multi-level features as input and outputs differential characteristics.

Figure 5. The architecture of the causal Transformer decoder.

Figure 6. The working principal diagram of the causal Transformer decoder. (a) is a visualization of the causal mask rule, the model can only view tokens that have appeared so far while preventing access to any subsequent tokens. (b) is the alignment logic between the input tokens and the ground truth during decoding.

Figure 7. The visualization results of the models. T1 and T2 correspond to the bi-temporal images. Mask is the pixel-level annotation of the changed area. Feature Map is the heatmap generated from the output of the model’s last convolutional layer. The red areas indicate high-response regions, meaning these regions contribute the most to the model’s prediction results.

Table 1. Evaluation results with baseline models on the LEVIR-CC dataset. Bold indicates the highest value of the corresponding indicator.

Methods	Time	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr	$S_{m}^{*}$
RSICCformer	2022.11	84.72	76.27	68.87	62.77	39.61	74.12	134.12	77.65
PSNet	2023.07	83.86	75.13	67.89	62.11	38.80	73.60	132.62	76.78
PromptCC	2023.10	83.66	75.73	69.10	63.54	38.82	73.72	136.44	78.13
Chg2Cap	2023.11	86.14	78.08	70.66	64.39	40.03	75.12	136.61	79.03
SparseFocus	2024.05	84.56	75.87	68.64	62.87	39.93	74.69	137.05	78.64
RSCaMa	2024.05	85.79	77.99	71.04	65.24	39.91	75.24	136.56	79.24
SEN	2024.05	85.10	77.05	70.01	64.09	39.59	74.57	136.02	78.68
SFEN	2024.07	85.20	78.91	70.96	64.67	40.12	75.22	136.47	79.12
KCFI	2024.09	86.34	77.31	70.89	65.30	39.42	75.47	138.25	79.61
MaskApproxNet	2024.12	85.90	77.12	70.72	64.32	39.91	75.67	137.70	79.40
SAT-Cap	2025.01	86.14	78.19	71.44	65.82	40.51	75.37	140.23	80.48
Change3D	2025.03	85.81	77.81	70.57	64.38	40.03	75.12	138.29	79.46
RDD + ACR	2025.04	-	-	-	65.60	40.30	75.50	138.30	79.93
Ours	-	86.75	78.94	72.06	66.32	40.34	75.70	140.22	80.65

Table 2. Evaluation results with baseline models on the LEVIR-MCI dataset. Bold indicates the highest value of the corresponding indicator.

Methods	Time	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr	$S_{m}^{*}$
MCINet	2024.03	86.68	78.75	71.74	65.95	40.80	75.96	140.29	79.48
CTMTNet	2024.11	85.16	77.24	70.13	64.16	39.34	74.03	134.52	78.01
Ours	-	86.75	78.94	72.06	66.32	40.34	75.70	140.22	80.65

Table 3. Evaluation results with baseline models on the WHU-CDC dataset. Bold indicates the highest value of the corresponding indicator.

Methods	Time	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr	$S_{m}^{*}$
RSICCformer	2022.11	80.05	74.24	69.61	66.54	42.65	73.91	133.44	79.14
PSNet	2023.07	81.26	73.25	65.78	60.32	36.97	71.60	130.52	74.85
PromptCC	2023.10	81.12	73.96	37.22	61.45	36.99	71.88	134.50	76.21
SparseFocus	2024.05	81.17	72.90	66.06	60.27	37.34	72.63	134.64	76.22
Chg2Cap	2024.05	78.93	72.64	67.20	62.71	41.46	77.95	144.18	81.58
SEN	2024.05	80.60	74.64	67.69	61.97	36.76	71.70	133.57	76.00
CTMTNet	2024.11	83.56	77.66	72.76	69.00	45.39	79.23	149.40	85.76
MaskApproxNet	2024.12	81.34	75.68	71.16	67.73	43.89	75.41	135.31	80.59
Ours	-	85.49	80.07	76.45	73.84	47.75	80.99	153.96	89.14

Table 4. Ablation experiment results of the backbone network. Bold indicates the highest value of the corresponding indicator.

Backbone	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
ResNet34	76.74	64.02	57.46	53.72	33.13	67.18	113.60
ResNet50	80.38	72.11	65.08	59.28	36.93	70.76	125.37
ResNet101	84.18	76.27	68.95	62.83	39.41	74.40	134.37
CTSD-Net	86.75	78.94	72.06	66.32	40.34	75.70	140.22

Table 5. Ablation experiment results of the CST module. GA denotes global attention, WA denotes window-based attention, and Tf denotes Transformer. Bold indicates the highest value of the corresponding indicator.

Method	CST			BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
Method	GA	WA	Tf	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
CTSD-Net	×	×	×	81.62	74.34	67.66	62.61	36.82	71.71	124.72
	√	×	√	84.90	76.24	68.90	64.00	39.82	74.65	133.62
	×	√	×	82.86	73.43	65.86	59.96	38.46	72.92	128.65
	×	×	√	83.08	74.96	67.47	62.64	38.92	73.58	131.32
	×	√	√	86.75	78.94	72.06	66.32	40.34	75.70	140.22

Table 6. Ablation experiment results of the SDL module. Bold indicates the highest value of the corresponding indicator.

Method	SDL Level	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
CTSD-Net	×	83.10	75.39	68.63	63.33	37.61	72.29	127.91
	L-1	84.98	77.35	70.63	64.16	39.75	74.74	135.78
	L-2	84.70	76.97	70.75	64.01	39.43	74.02	135.03
	L-3	84.55	77.11	70.84	64.85	39.47	74.40	135.52
	L-4	85.75	77.30	71.97	65.11	40.40	75.45	137.20
	L-5	86.75	78.94	72.06	66.32	40.34	75.70	140.22

Table 7. Numbers of model parameters and FLOPs.

Method	ResNet101	SegFormer	CST	SDL	Params (M)	FLOPs
CTSD-Net	√	×	×	×	42.50	11.61
	×	√	×	×	13.70	4.50
	×	√	√	×	24.63	16.04
	×	√	√	√	99.73	40.83

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, R.; Ye, H.; Liu, X.; Li, Z.; Sun, C.; Wu, J. A Cross-Spatial Differential Localization Network for Remote Sensing Change Captioning. Remote Sens. 2025, 17, 2285. https://doi.org/10.3390/rs17132285

AMA Style

Wu R, Ye H, Liu X, Li Z, Sun C, Wu J. A Cross-Spatial Differential Localization Network for Remote Sensing Change Captioning. Remote Sensing. 2025; 17(13):2285. https://doi.org/10.3390/rs17132285

Chicago/Turabian Style

Wu, Ruijie, Hao Ye, Xiangying Liu, Zhenzhen Li, Chenhao Sun, and Jiajia Wu. 2025. "A Cross-Spatial Differential Localization Network for Remote Sensing Change Captioning" Remote Sensing 17, no. 13: 2285. https://doi.org/10.3390/rs17132285

APA Style

Wu, R., Ye, H., Liu, X., Li, Z., Sun, C., & Wu, J. (2025). A Cross-Spatial Differential Localization Network for Remote Sensing Change Captioning. Remote Sensing, 17(13), 2285. https://doi.org/10.3390/rs17132285

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Cross-Spatial Differential Localization Network for Remote Sensing Change Captioning

Abstract

1. Introduction

2. Related Works

2.1. Remote Sensing Image Caption

2.2. Remote Sensing Image Change Caption

3. Methods

3.1. Cross-Spatial Transformer

3.2. Symmetric Differential Localization

3.3. Text Decoder

3.4. Loss Function

4. Experiments

4.1. Experimental Setting

4.1.1. Dataset

4.1.2. Evaluation Metrics

4.1.3. Implementation Details

4.2. Comparative Analysis

4.3. Ablation Experiment

4.4. Complexity Analysis

4.5. Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI