1. Introduction
Monitoring and analyzing changes on the Earth’s surface provides scientific evidence and strategic guidance for achieving sustainable development. Consequently, remote sensing-based earth observation has become a crucial tool in the fields of environmental monitoring, land planning, disaster management, etc. Although remote sensing image change detection (RSICD) can find changed areas using mask outputs, it only shows changes at the pixel level and does not provide easy-to-understand information for real-world use, like descriptions of object types, locations, and change dynamics. The remote sensing image change captioning (RSICC) task aims to describe and explain the changes in the image scene with natural language. As shown in
Figure 1, the RSICC task takes bi-temporal images as input, finds the differences between them, and provides textual descriptions that include where the changes are and what types of objects are involved. Compared with the RSICD task, it delivers more flexible results and can characterize diverse change types.
Current RSICC approaches primarily employ encoder–decoder architectures, with model designs focusing on two key aspects: image feature enhancement and differential feature refinement.
Image feature enhancement aims to reduce the scale differences of the change targets, with its core focusing on the fusion and optimization of multi-scale features [
1]. The design of a scale-aware enhancement module enhances the model’s ability to perceive objects with varying sizes of changes [
2]. For example, Liu et al. [
3] proposed a dual-branch transformer structure and used a multi-level dual-temporal fusion module to fuse multi-scale features. Chang et al. [
4] designed a hierarchical self-attention module to locate features related to changes. ICT-Net [
5] uses a multi-layer adaptive fusion module to reduce the effect of unimportant visual features and a cross-attention module to highlight different-scale features when creating captions. Although these studies have made significant progress, they are all limited by the constrained receptive field of CNNs and the high complexity of Transformer architectures.
Regarding improvements in differential features, Park et al. [
6] suggested an effortless dual dynamic attention method to show differences between two visual images, creating change maps by subtracting bi-temporal features. However, the visual and semantic relationships between objects received insufficient attention. To resolve these issues, Tu et al. [
7] created self-semantic and cross-semantic relation blocks to determine change areas, using their strengths to build connections both within the same type of data and between different types in change captioning models. Cai et al. [
5] created a special network that uses an interactive encoder to determine important changes at different scales and a transformer with a multi-layer fusion module to produce change descriptions. Nevertheless, these methods primarily rely on direct subtraction and global similarity to capture changing features between two remote sensing images, rather than attempt to learn interactions between bi-temporal features before change assessment.
Therefore, there are still some limitations in current research that require improvement:
(1) Simply stacking convolutional or Transformer encoder modules does not adequately achieve feature enhancement. Effectively improving the model’s spatial perception capability is crucial for capturing localized information, yet existing architectures often lack explicit mechanisms to strengthen positional awareness.
(2) Most studies focus excessively on learning bi-temporal features and then rely on simple subtraction or similarity computation to detect changes. This approach leads to two key issues: insufficient extraction of change-region features, resulting in weak discriminative signals, and retention of task-irrelevant interference, which hinders effective differential feature learning.
To address the issues mentioned above, this study proposes a novel cross-spatial Transformer and symmetric difference localization network (CTSD-Net) for RSICC. Specifically, this study first designs a cross-space Transformer encoder, utilizing a spatial attention mechanism to adaptively focus on the spatial positions and feature relationships of interest region features before and after enhancement. Subsequently, the model employs a hierarchical differential feature integration strategy, combining multi-level difference maps to enrich both medium- and coarse-grained contextual information while effectively suppressing noise interference. The architecture utilizes residual-connected high-level differential features as query vectors in the multi-head attention mechanism, enabling simultaneous precise learning of discriminative change representations and bidirectional attention to pre-change and post-change characteristics. The final differential representation is synthesized by concatenating the enhanced spatial features with the optimized symmetric differential features. A causal Transformer decoder then processes this integrated representation, which establishes a robust cross-modal mapping to generate linguistically accurate change descriptions.
The remainder of this study is organized as follows:
Section 2 introduces related works on image captioning and image change detection in remote sensing images.
Section 3 describes the structure and modules of CTSD-Net.
Section 4 presents the experimental analysis and discussion of CTSD-Net.
Section 5 presents our conclusion.
3. Methods
As shown in
Figure 2, CTSD-Net learns the symmetric relationships of semantic changes through the differential features between dual-temporal images. Specifically, the model first employs a Siamese weight-sharing SegFormer [
23] backbone to extract multi-scale features from the dual-temporal images. These features are then passed through a shared positional encoding and fed into two key modules: the Cross-Spatial Transformer (CST) module, which captures spatial-semantic correlations within the regions of interest, and the Symmetric Differential Localization module, which learns symmetric differential features. The two sets of features are concatenated and input into the decoding module to realize the mapping conversion between image and text and obtain the final descriptive output.
3.1. Cross-Spatial Transformer
CTSD-Net adopts the Mix Transformer (MiT) backbone from SegFormer for feature extraction, and uses stage 1–4 outputs to obtain a more comprehensive feature representation, as shown in
Figure 2a. Formally, given two input images
and
, their extracted features can be denoted as
with multi-scale feature representations. Subsequently, learnable positional embeddings are added only to the highest-level features to reduce model capacity requirements and unnecessary computational costs. This process can be formally defined as:
where
represents feature reshaping operations, including flattening and transposing, which transform the original features into the appropriate shape required for embedding tokens.
denotes a learnable positional embedding used to retain the positional information within the model.
The feature sequences
are then fed into the cross-spatial Transformer (CST) module. As illustrated in
Figure 3, the structure of the CST module consists of two spatial attention (SA) blocks, one window-based multi-head self-attention (W-MSA) block, one Transformer block, two normalization layers, and two feed-forward networks (FFNs). The standard Transformer architecture employs a multi-layer stacked structure for progressive feature refinement. To enhance this framework, the proposed CST module incorporates W-MSA to effectively capture local texture patterns while maintaining computational efficiency. The subsequent Transformer layers then complement this process by integrating global contextual information with the locally-enhanced features from W-MSA outputs. Simultaneously, the SA mechanism operates across both components to maintain global spatial semantic modeling. This hierarchical design establishes an efficient extraction to a deep processing feature refinement pipeline. It addresses the computational complexity inherent in pure global attention approaches and ensures feature quality through carefully designed stacked modules.
The design of CST incorporates three key characteristics: (1) Dual Spatial Attention Mechanisms: Two stages of spatial attention are applied to adaptively focus on key regions before and after feature extraction, effectively filtering out background noise, shadow occlusions, and other complex artifacts in remote sensing images, thus improving the model’s robustness. (2) W-MSA: Inspired by Swin Transformer, CST partitions the encoded image features into several non-overlapping windows, within which self-attention is computed independently, thereby significantly reducing computational complexity. (3) Transformer-based Feature Interaction Modeling: The tokens obtained after the first spatial attention serve as the queries, while the tokens after W-MSA and the second spatial attention act as the keys and values. This design captures the spatial feature relationships between the original and enhanced feature sequences, further promoting effective global context modeling across the entire image.
Both W-MSA and the Transformer block utilize the MSA mechanism. The input to the MSA is a triplet (query
, key
, and value
) derived from the input features
, with
representing the number of channels and
representing the feature dimension. This process can be formulated as:
where
,
and
are learnable parameters.
refers to the attention mechanism.
refers to executing multiple self-attention operations in parallel, where
denotes the number of attention heads.
The complete encoding process of the CST module can be summarized as:
where
is layer normalization,
is a feed-forward network, which can be expressed as:
. It consists of two fully connected layers, a
activation function, and a dropout function
.
is a standard Transformer module.
3.2. Symmetric Differential Localization
The semantic representation of image differences is crucial for guiding the model to learn the change features between bi-temporal images. However, directly computing the difference between two feature maps or applying point-wise operations after differencing is insufficient to fully capture the complex characteristics of image changes. To address this limitation, we propose a novel symmetric differential localization (SDL) module, which operates in parallel with the CST module. The SDL module aggregates multi-scale differential features and employs a self-attention mechanism to learn the semantic relationships surrounding objects and filter out irrelevant and unchanged feature representations. This design enables enhanced differential interaction between the bi-temporal features.
Figure 4 illustrates the composition of the module, for the multi-level features
, where
denotes the feature levels from 1 to 5. The SDL module computes bidirectional differential features (forward change:
→
and backward change:
→
) as complementary inputs, uses convolution with kernels of 5 and 7 to increase the receptive field of the features, and fuses the features layer by layer through downsampling and normalization operations.
Specifically, this process can be expressed as:
where
denotes convolutional operations with corresponding kernel sizes, and
represents the tensor concatenation operation. The differential features from different levels are first downsampled through
convolution, and then aggregated by summation. The model obtains a more informative and semantically enriched differential feature map by fusing multi-scale differential features. Additionally, the combined multi-scale differential features help improve the high-level differential features, allowing the model to better concentrate on the areas where real changes happen. Specifically, the SDL module first uses average pooling to the two multi-scale differential features
to enhance their representations:
where
is the Gaussian error linear unit activation function,
is the Sigmoid activation function,
is the average pooling function, and
is batch normalization.
Subsequently, a cross-semantic change locator is constructed using an MHA layer and an FFN, with residual connections between the two components. In the MHA layer, differential features from the two temporal images are used separately. For
,
is used as the query, while
serves as both the key and value for the attention computation:
The FFN layer is used to process the output of the MHA layer:
Similarly, for
, MHA uses
as the query and
as the key and value for attention calculation:
Finally, the obtained symmetric difference feature is added to the cross-space feature
obtained in the previous section to obtain the final image feature, and the dual-phase features are concatenated as the output feature
:
3.3. Text Decoder
The CTSD-Net uses the causal Transformer decoder [
24] to unify the change features of bi-temporal images with text features. As shown in
Figure 5, each Transformer block consists of a masked MHA, an MHA, and a feed-forward network. The masked MHA takes the word embeddings from the previous block as input and applies causal masking to ensure that the output of each sequence element depends solely on its preceding elements. As illustrated in
Figure 6a, the orange cell at row
and column
indicates that, when generating the
-th output token, the attention mechanism is permitted to attend to the
-th input token, while preventing access to any subsequent tokens beyond
. Subsequently, the MHA integrates the extracted visual features into the word embeddings, allowing the model to attend to the visual information. The entire decoder layer employs the residual connection to retain the information of the word embeddings.
Specifically, to prepare change descriptions for the caption decoder during training, we first map the text tokens into word embeddings using an embedding layer. A mapping function
is employed to transform the original token
into corresponding word embedding
. The initial input to the Transformer decoder can then be obtained as follows:
where
denotes the positional embeddings computed using sine and cosine functions of different frequencies and phases based on the position of each token [
25]. Then, the
-th masked MHA sublayer can be formulated as:
represents the function of the -th masked MHA sublayer, is the number of heads in the masked MHA sublayer. are the trainable weight matrices for the -th head, and is the trainable weight matrix for linearly projecting the output feature dimension. After passing through layers of the Transformer decoder, the retrieved word embeddings are fed into a linear layer, and then the word probabilities are obtained through a SoftMax activation function.
3.4. Loss Function
The model employs a cross-entropy loss function to compute the loss between the predicted sentence and ground-truth sentence. Given a target ground-truth caption sequence and the predicted sequence from model, the objective of the loss function
is to minimize the sum of the negative log-likelihoods of correctly predicted words. Thus, the loss function can be formulated as:
where
is the total number of word tokens,
is the vocabulary size,
indicates the ground-truth label denoting whether the
-th word corresponds to the vocabulary index
, and
is the predicted probability of assigning the
-th word to the vocabulary index
.