Interactive Change-Aware Transformer Network for Remote Sensing Image Change Captioning

: Remote sensing image change captioning (RSICC) aims to automatically generate sentences describing the difference in content in remote sensing bitemporal images. Recent works extract the changes between bitemporal features and employ a hierarchical approach to fuse multiple changes of interest, yielding change captions. However, these methods directly aggregate all features, potentially incorporating non-change-focused information from each encoder layer into the change caption decoder, adversely affecting the performance of change captioning. To address this problem, we proposed an Interactive Change-Aware Transformer Network (ICT-Net). ICT-Net is able to extract and incorporate the most critical changes of interest in each encoder layer to improve change description generation. It initially extracts bitemporal visual features from the CNN backbone and employs an Interactive Change-Aware Encoder (ICE) to capture the crucial difference between these features. Speciﬁcally, the ICE captures the most change-aware discriminative information between the paired bitemporal features interactively through difference and content attention encoding. A Multi-Layer Adaptive Fusion (MAF) module is proposed to adaptively aggregate the relevant change-aware features in the ICE layers while minimizing the impact of irrelevant visual features. Moreover, we extend the ICE to extract multi-scale changes and introduce a novel Cross Gated-Attention (CGA) module into the change caption decoder to select essential discriminative multi-scale features to improve the change captioning performance. We evaluate our method on two RSICC datasets (e.g., LEVIR-CC and LEVIRCCD), and the experimental results demonstrate that our method achieves a state-of-the-art performance.


Introduction
Recently, deep-learning-based sensing image change captioning (RSICC) technologies have demonstrated their effectiveness in observing and analyzing the change in the earth's surface [1,2].They take advantage of multitemporal images acquired by sensors onboard satellites or aerial platforms for continual observation and tracking of environmental changes.RSICC is an evolving field of research that aims to understand the changes in input bitemporal remote sensing (RS) images and generate descriptive natural language sentences that accurately describe the differences between them.It analyzes and illustrates the differences between bitemporal scenes, significantly deepening our understanding of the dynamic changes in the environment and landscape.RSICC has a broad range of applications, including landscape damage examination, city planning, environmental monitoring, and land planning [2][3][4].
RSICC involves interpreting change regions between two RS images captured at the same location but at different times (as shown in Figure 1).It requires a deep understanding of the semantic meaning of these changes in the complex environment and a detailed analysis of the evolved scene.Like recent image captioning works [5][6][7], RSICC adopts an encoder-decoder-based architecture, where a visual encoder extracts discriminative features and captures the difference between bitemporal images, while the language decoder generates descriptive sentences that articulate these differences.Chouaf et al. [1] are pioneers in the RSICC task; they used a CNN as a visual encoder to capture the temporal changes in scenes and adopted an RNN as a decoder to generate descriptions of the changes.Liu et al. [3] adopted a Transformer-based [8] encoder-decoder framework for the RSICC task, which has achieved a great performance.A visualization of the existing method and our proposed method.(a) The existing method [3] uses a hierarchical approach that tends to integrate the unchanged focused information from each encoder layer, disrupting the change feature learning in the decoder and generating inferior change descriptions.Our proposed method attentively aggregates the essential features for more informative caption generation.(b) Existing methods [2][3][4]9] overlook the change in objects with various scales, generating inferior change descriptions.Ours can extract discriminative information across various scales (e.g., a small scale) for change captioning.Blue indicates that the word "house" is attended to the particular region in the image, while reddish colors suggest a lower level of focus on it.The bluer the color, the higher the attention value.
Recent RSICC methods [3,9] proposed capturing the changes in each encoder layer and gradually concatenating low-level and high-level change-aware semantic features in all the layers to support the change caption decoder in generating more accurate captions.Nevertheless, these approaches are prone to incorporating redundant features from each encoder layer into the caption decoder, thereby adversely affecting the change caption generation process.For instance, in Figure 1a, these methods tend to fuse non-changefocused features and propagate them to the decoder, which causes disruption in word and feature attention in the decoder.Consequently, this interference produces less accurate change descriptions with the absence of the "tree" and "road" compared to the ground truth caption.Moreover, most existing methods [2][3][4]9] overlook the distinctive characteristics between natural images and remote sensing images, which consequently limits the model's ability to effectively capture the changes in the objects at a different scale (e.g., small-scale objects) and leads to generating inferior sentences to describe the changes, as shown in Figure 1b.Illustrated in Figure 1, it is evident that a significant challenge in remote sensing image change captioning research lies in effectively filtering out noisy feature representations [10,11].In addition, the diversity of scales in images is a natural characteristic resulting from variations in camera-to-object distances and causes differences in scale among the objects within the image.Hence, it is crucial to be aware of the presence and absence of the objects across different regions with varying scales in bitemporal images and provide comprehensive descriptions of these changes.
In this paper, we proposed an Interactive Change-Aware Transformer Network (ICT-Net) to alleviate the above-mentioned problems.ICT-Net excels in extracting and integrating the most pivotal changes of interest within each encoder layer, thereby enhancing the generation of more effective change descriptions.In the encoder, ICT-Net utilizes an Interactive Change-Aware Encoder (ICE) to capture change information between bitemporal features extracted from the backbone network (e.g., ResNet [12]).Specifically, the ICE leverages the Cross Multihead Attention (Cross-MHA) mechanism [8] in difference and content attention encoding modules to learn the most discriminative representations and recognize the changes of interest between paired features.Moreover, the Multi-Layer Adaptive Fusion (MAF) module is introduced to effectively integrate relevant low-and high-level semantic change-aware features in each ICE layer.MAF utilizes an attention design to filter out irrelevant change information from integrated visual features.In addition, we expand the ICE to extract multi-scale change-aware features, aiming to overcome the challenges of recognizing changes in objects at various scales.In the change caption decoder, we propose a Cross Gated-Attention (CGA) module to generate a change description by considering the relationship of the words and each scale of the features.CGA employs a gated attention structure, enhancing the decoder's capability to utilize crucial features for more precise change caption generation.
To summarise, in the proposed ICT-Net, we utilize an ICE to capture multi-scale discriminative change-aware information between bitemporal features, followed by an MAF module to integrate the most relevant change information in each layer for the change caption decoder.A CGA module is adopted in the decoder to model the relationships between semantic and multi-scale change-aware features to enhance the change captioning performance.A comprehensive set of experiments is conducted on two remote sensing image change caption datasets.The results of these experiments demonstrate that our proposed model achieves superior performance compared to the state-of-the-art approaches across all evaluation metrics.Our contributions are summarized in the following: 1.
We propose an Interactive Change-Aware Transformer Network (ICT-Net) to accurately capture and describe changes in objects in remote sensing bitemporal images.

2.
We introduce the Interactive Change-Aware Encoder (ICE) equipped with the Multi-Layer Adaptive Fusion (MAF) module.It effectively captures change information from bitemporal features and extracts essential change-aware features from each encoder layer, contributing to improved change caption generation.

3.
We present the Cross Gated-Attention (CGA) module, a novel module designed to effectively utilize multi-scale change-aware representations during the sentencegeneration process.This module empowers the change caption decoder to explore the relationships between words and multi-scale features, facilitating the discernment of critical representations for better change captioning.
Section 2 provides a summary of previous work in the field of remote sensing image captioning, remote sensing change detection, and neutral change image captioning.In Section 3, we present our proposed ICT-Net in detail.Next, Section 4 presents the experimental results and analysis.Finally, in Section 5, we conclude this work.

Remote Sensing Image Change Captioning
The objective of remote sensing image change captioning (RSICC) is to analyze and illustrate the differences between bitemporal scenes using natural language.Chouaf et al. [1] are pioneers in the RSICC task; they used a CNN as a visual encoder to capture the temporal changes between scenes and adopted an RNN as a decoder to generate descriptions of the changes.Hoxha et al. [2] proposed early and late feature fusion strategies to fuse the bitemporal visual features and utilizes an RNN and a multi-class Support Vector Machine (SVM) decoder to generate change captions.More recently, Liu et al. [3] adopted a Transformerbased [8] encoder-decoder framework for the RSICC task, in which they used a dual-branch Transformer encoder to identify the changes between the scenes and proposed a multistage fusion module to fuse multi-layer features for change description generation.Liu et al. [9] further improved the method by utilizing progressive difference perception Transformer layers to capture the high-level and low-level semantic change information.Liu et al. [4] proposed a prompt-based method that uses pre-trained large language models (LLMs) for RSICC tasks, where they used visual features, change classes, and language representation as input prompts to a frozen LLM for change caption generation.Nevertheless, current methods tend to incorporate irrelevant change information into the model, resulting in an inferior performance.Hence, we propose to capture more change-aware discriminative information with the attention structure to enhance the model's ability to illustrate the changes in scenes.

Remote Sensing Image Captioning
Remote sensing image captioning (RSIC) aims to generate sentences that describe the contents of the given RS image with natural language.Recently, most of the RSIC works [10,[13][14][15][16][17][18][19][20][21] have used deep learning techniques and adopted an encoder-decoder framework for caption generation.The visual encoder utilizes a CNN [12] or a Vision Transformer [22] pre-trained network to extract the visual features from the input image, then injects the features into the RNN-based [23] or Transformer-based [8] decoder to generate the descriptive sentences.Lu et al. [24] explored an encoder-decoder-based method for RSIC that utilizes CNN models to extract the remote sensing image features and uses a recurrent neural network (RNN) to generate the sentence.Li et al. [25] introduced a novel truncation cross entropy (TCE) loss for RSIC, which aims to solve the overfitting issue and facilitates the model to generate more concise RS image descriptions.Sumbul et al. [14] proposed a summarization-driven RSIC method, which implements an adaptive weighting strategy to effectively integrate the summarized ground truth captions into the captioning model to improve performance.RS images may contain objects of different sizes.Some RSIC methods aim to improve the visual representation modeling abilities of the captioning model and aim to describe the objects with various scales in the RS image.Wang et al. [15] proposed a multi-scale multi-interaction method to connect multiscale image features at different levels, allowing for more efficient visual representation interaction.Ma et al. [26] introduced scene-level feature extraction and target-level feature extraction modules to capture more fine-grained visual representations for RSIC.The aforementioned RSIC methods aim to generate descriptive sentences of an object in a single image.In contrast, RS image change captioning is focused on capturing and describing the differences in bitemporal remote sensing images.

Remote Sensing Change Detection
The objective of Remote Sensing Image Change Detection (RSICD) [27][28][29][30][31][32] is to detect the change regions between bitemporal images and generate a pixel-level change map that illustrates the changed areas.Chen et al. [28] introduced a Siamese Transformer-based [8] framework to improve the model's context and identify the change of interest between given bitemporal images.Bao et al. [33] utilized a Convolutional Neural Network (CNN)based dual structure to extract and detect the difference between multi-scale features of bi-temporal images and employed a Feature Pyramid Network (FPN) [34] fusion module to fuse information over layers to enhance the detection performance.Peng et al. [27] proposed a dense attention architecture for change detection to improve texture and detail extraction of the visual representations.Saha et al. [35] proposed unsupervised learning techniques for RSICD, combining the proposed deep change vector analysis methods with the extracted spatial contextual information to determine changed pixels.Tang et al. [36] further explored the graph convolutional network (GCN) [37] and metric learning algorithm method that captures rich contextual information from the visual representations.In contrast to RSCD tasks that aim to recognize pixel-level changes of interest, RS image change captioning concentrates on detecting and describing the changes of interest between two images at the semantic level.

Natural Image Captioning
Natural image captioning (NIC) is a fundamental multimodal task at the intersection of computer vision [38][39][40][41][42][43] and natural language processing [8,23,44,45], which aims to identify objects within images and describe recognized objects with language.Similar to RSIC, most recent NIC methods utilize an encoder-decoder framework.Xu et al. [5] proposed to use a CNN encoder to extract the natural image features and utilize an RNN network as the language decoder to generate natural language words in sequences.Subsequently, spatial [6] and Transformer multi-head attention [8] mechanisms have then been explored with the intention of enhancing the performance of image captioning tasks.Cornia et al. [7] developed a Transformer-based framework incorporating meshed memory to exploit low-level and high-level visual features for caption generating.Besides NIC, several methods [46][47][48][49] have been introduced to solve natural scene, 3D scene, and synthetic image change captioning tasks.Qiu et al. [47] proposed understanding and describing the change in 3D scenes from different viewpoints.Tu et al. [48] introduced a method for learning semantic relation-aware difference representations, which effectively localizes semantic changes and captures the semantic relationships across two images.In contrast, in this work, our objective is to describe the change in real RS scenes, which contain many different object categories with multiple scales and complex ground details.

Methodology
The ICT-Net utilizes a CNN and a Transformer-based encoder-decoder framework.The overall structure is shown in Figure 2, and is composed of three main elements: (1) A multi-scale feature extractor to extract pairs of RS visual features from different stages of the backbone CNN network; (2) The proposed Interactive Change-Aware Encoder (ICE) with a Multi-Layer Adaptive Fusion (MAF) module to adaptively capture the semantic discrimination information from each pair of multi-scale features; (3) A multi-scale change caption decoder that utilizes a Cross Gated-Attention (CGA) module to select crucial information from all multi-scale change-aware features generated by the MAF module for change captioning.

Multi-Scale Feature Extraction
We extract multi-scale features using different convolutional stages in the ResNet [12] backbone to enable the model to capture objects with different scales.As illustrated in Figure 2, given a pair of input images I t0 and I t1 , the backbone network extracts multiscale features and uses a transformation function (e.g., 1 × 1, 3 × 3 convolutional layers) to transform them to the same dimension, D. We use X i t0 (e.g., , where H and W denote the height and width of the feature) and X i t1 to represent the multi-scale feature pairs, where i = {3, 4, 5} denotes the features extracted from the respective stage in the ResNet.

Interactive Change-Aware Encoder
Obtaining different information that reflects the change regions between bitemporal RS images is essential for RSICC.In this paper, we propose an Interactive Change-Aware Encoder (ICE) that aims to interactively extract highly discriminative features between each pair of input bitemporal features X i t0 ∈ R N×D and X i t1 ∈ R N×D interactively, where N = W × H.As shown in Figure 2, each ICE layer comprises difference attention encoding (DAE) and content attention encoding (CAE) modules.These modules work interactively to capture the changes between bitemporal features by utilizing different features denoted as X i di f f ∈ R N×D , and further enhance the change awareness through the incorporation of aggregated features represented as X i f us ∈ R N×D .Specifically, DAE first extracts the difference between paired bitemporal features and subsequently models the discriminative representations with these features using the Cross Multihead Attention (Cross-MHA) mechanism.Then, CAE further constructs the output content of DAE through Cross-MHA with aggregated bitemporal features.This process models the long-range dependency of discriminative representations with aggregated features, emphasizing the critical dissimilarities between bitemporal features X i t0 and X i t1 .The DAE process can be represented as follows: Zi and the CAE process can be expressed as follows: where W q , W k , W v , Ŵq , Ŵk and Ŵv are trainable weight matrices, and j = (0, 1).To ease the representation, we assume that position encoding (PE) is added to with bitemporal features.A Feed-Forward Network (FFN) and Layer Normalization (LN) are included in Cross-MHA, similar to the Transformer block.Furthermore, we introduce a Multi-Layer Adaptive Fusion (MAF) module to adaptively fuse the change-aware multi-level representations obtained from each layer within the preceding ICE.Each ICE layer can encompass distinct meaningful change representations.By leveraging the MAF module, our model can acquire these distinct features from all ICE layers, allowing it to concentrate on the relevant change representations while filtering out irrelevant changes.As illustrated in Figure 3, we first concatenate all the bitemporal change-aware representations from each ICE layer in the channel dimension.Subsequently, we incorporate a gated attention mechanism that allows the model to filter the irrelevant information and determine the essential change-aware representations from concatenated features.The process of MAF can be formulated as follows: where [;] denotes concatenation, W a , W b , W c and W d are the learnable weights, and σ and represent the sigmoid activation and element-wise multiplication, respectively.Sigmoid activation and element-wise multiplication serve as a gate to bypass the redundant information from multiple ICE layers.l represents the number of layers in ICE.Subsequently, we can obtain the filtered change-aware features Z i (i = 3, 4, 5 with respect to the scale of features) through the MAF module, where Z i are down-sampled to a consistent spatial size N = H × W.These features are injected into the decoder for caption prediction.

Multi-Scale Change Caption Decoder
We leverage the previously generated multi-scale change-aware representations modeled from the MAF modules while constructing a multi-layered decoder architecture for change caption generation.To achieve this, we introduce a novel Cross Gated-Attention (CGA) module, which is in contrast to the cross-attention operator used in the original Transformer decoder network [8].The CGA module allows us to effectively utilize all the multi-scale change-aware representations during the sentence-generation process.Furthermore, it allows the change decoder to attend to and select essential change-aware multi-scale representations for change caption generation with the help of the gated structure.The proposed change caption decoder is composed of three sub-modules: Masked-Multihead Attention (Mask-MHA), Cross Gated-Attention (CGA), and the Feed-Forward Network (FFN), as illustrated in Figure 4.The residual connection and Layer Normalization (LN) operation are adopted for each sub-module. where where W Q i , W K i , W V i are the learnable projection matrices for query, key, and value of the word embedding at the i-th head and W o is the projection matrix that aggregates the information for h number of heads.[;] represents the concatenation operation.
Subsequently, the CGA module is introduced to connect the generated sequence of word features E * with all multi-scale change-aware representations Z i .Hence, instead of focusing on one single scale of the change-aware features, we compute the long-range dependencies across all multi-scale features.The process of computing sentence representations can be written as follows: where Then, gated attention is introduced to focus on relevant changes of interest in the multi-scale-dependent sentence features S i for change caption generation, and it can be computed as: Finally, these multi-scale contributed sentence features Ŝi ∈ R L×d are then summed together: where W s and W c denote learnable projection matrices, and b s and b c represent a learnable bias vector.σ and denote sigmoid activation and element-wise multiplication that are used to select and balance the weights learned from each multi-scale feature-dependent word representation S i , respectively.The output of the caption decoder C ∈ R L×d is then fed into a linear projection layer and a softmax layer for the prediction of caption word probabilities in the vocabulary: where L is the length of the sentence, d is the embedding dimension, W p ∈ R d×Σ are the weight parameters to be learned and Σ denotes the vocabulary size.
The procedure of our proposed model is shown in Algorithm 1 as follows: Algorithm 1: ICTNet

Training Objective
During the training stage, similar to the existing RSICC [2,3] model, we adopt the widely used cross-entropy (CE) loss to optimize the change caption model, which can be written as follows: The model is trained to predict the target ground truth caption y * t with the previous words y * 1:t−1 , and the given images I t0 and I t1 .LEVIRCCD dataset.We further verify the performance of the proposed method on the LEVIRCCD dataset [2].It consists of 500 bitemporal images that were originally used for building change detection (CD).The images are cropped into 256 × 256 pixel size.Each image has been annotated with five remote sensing change descriptions, resulting in 2500 change descriptions in total.A split of of 60%, 10%, and 30% of the image and change caption pairs is used for training, validation, and testing, respectively.
The BLEU evaluation metric is used to evaluate the precision accuracy between the candidate and reference sentences, where N represents the n-gram precision between sentences.
METEOR evaluates the uni-gram precision and recall probabilities, and ROUGE-L measures the similarity, calculating the longest common subsequence between two sentences.METEOR and ROUGE-L account for sentence fluency by involving a penalty factor.
CIDEr-D calculates the cosine similarity of the Term Frequency Inverse Document Frequency (TF-IDF).It takes into account both precision and recall, and it reports the real values that exceed 100% [53].
For all these metrics, the higher the metric scores, the higher the accuracy of the generated change description.

Experimental Setup
We utilized pre-trained ResNet101 [12] as the backbone network for bitemporal remote sensing image feature extraction.The initial learning rate was set to 0.0001 and decays by a weight of 0.7 as the training steps increase by three epochs.The maximum training epoch was set to 40, and the training was discontinued when there was no improvement in the BLEU-4 score for five consecutive epochs.We utilized two Transformer encoder layers and one decoder layer with eight attention heads to achieve the best change caption performance.The model was optimized through the Adam optimizer [54].Like existing works [3,4], the beam search size was set to 3 for inference.The model was implemented in the PyTorch framework.

Comparison with State-of-the-Art Methods
In Table 1, we compare the remote sensing image change caption performance with stateof-the-art methods on the LEVIR-CC dataset, which include Capt-Dual-Att [55], DUDA [55], MCCFormers s [49], MCCFormers d [49], RSICCFormer [3], PSNet [9] and PromptNet [4].Capt-Dual-Att [55] combines two convolutional layers with spatial attention to attend to important bitemporal visual features.DUDA [55] introduces a dynamic speaker, allowing the model to adaptively attend to visual representations.MCCFormers s [49] flattens and concatenates the bitemporal feature maps, then injects the fused features into a Transformer network for captioning.MCCFormers d [49] introduce a Siamese Transformer encoder design to model the relationships between bitemporal visual features and capture the changes.Most of the methods compared utilize the same ResNet-101 backbone, except for PSNet and PromptNet, which use a VIT [22] and CLIP [38] backbone, respectively.B-N helps assess the presence of n-gram words in a sequence.The widely used CIDEr score evaluates the generation of global semantic words in the caption.We can observe that the proposed model presented a superior performance in all of the metrics.These performance improvements shown in the table have proven the effectiveness of our proposed method.We further validate our change caption performance on the LEVIRCCD dataset in Table 6.We compared a method that uses the same backbone network (ResNet50) and has the same settings as our method in Table 6 for a fair comparison.In addition, we selected methods that achieve state-of-the-art performance on the Levir-CC dataset for comparison.We can see that our proposed method achieves a better performance compared with other methods, which further demonstrates the effectiveness of the proposed method.The results can be attributed to the fact that the proposed method has the ability to recognize multi-scale object changes and is able to adaptively fuse multi-layer semantic information for better change caption decoding.

Table 1.
Comparison of our proposed method and other state-of-the-art image change caption methods on the Levir-CC dataset.
The higher the score, the better the captioning performance.Bold numbers indicate the best result.

Ablation Studies
In this section, we present the numerical results of our ablation studies that validate the effectiveness of the following proposed modules: the Interactive Change-Aware Encoder (ICE), Multi-Layer Adaptive Fusion (MAF), and Cross Gated-Attention (CGA).The ablation models are based on the ResNet101 backbone and were evaluated on the LEVIR-CC dataset.
Table 3 demonstrates the effectiveness of including different components in the proposed method.Difference attention encoding (DAE) and content attention encoding (CAE) are two sub-modules in the ICE module.A tick in the table denotes that the module is included in the model.We observed that the model demonstrates superior performance through the integration of DAE or CAE in the change-aware encoder, surpassing the baseline model utilizing the original Transformer encoder [8].The proposed method can achieve better results when utilizing both DAE and CAE in the model.Furthermore, the performance is further enhanced by adopting the MAF module that adaptively fuses the change-aware multi-level semantic information obtained from each layer of the ICE.Moreover, the result is further improved with the inclusion of the CGA module that enables the decoder to select the critical multi-scale representation for better change caption generation.In Table 4, we evaluated the model's abilities to determine whether changes exist between bitemporal remote sensing images and whether it was able to describe them with a caption.Hence, we tested the performance with different settings by (1) testing image pairs with no changes, (2) testing image pairs with changes, and (3) testing the overall test set.We can see that the model with an ICE performs better in all three settings as compared to the Transformer network baseline, demonstrating that the ICE effectively captures the change-aware features in bi-temporal remote sensing images.The proposed model with the MAF module achieved a higher evaluation performance compared to only utilizing ICE.This shows the effectiveness of the MAF module in interpreting and filtering the semantic information extracted from different encoder layers to capture multiple changes of interest for better caption generation.Furthermore, the overall model, which incorporates a CGA module, can significantly improve the model performance in all settings.It is designed to exploit word and multi-scale feature relationships and facilitate the selection of essential features to benefit change captioning.Table 4 showcases the significant enhancement brought by our method in terms of both change discrimination and sentence generation performance.Furthermore, this paper investigates the effectiveness of capturing and describing multi-scale object changes between bitemporal remote sensing images.Hence, it is essential to experiment utilizing different stage features (e.g., Stage-3, Stage-4, Stage-5) from the backbone ResNet to localize multi-scale object changes in images.In Table 5, we show the performance of the proposed model after adopting different scales of features for capturing change-aware features.It was observed that the model achieved the best performance when using the Stage-3 and Stage-4 multi-scale features to localize the differences in the two images and describe them with captions.This observation also implies that bitemporal remote sensing images in the dataset tend to contain small-to medium-scale objects, while our proposed model is able to extract and make use of the captured multi-scale change features to improve caption generation.Subsequently, in the experiment, we mainly showcase the outcomes of employing Stage-3 and Stage-4 features as inputs to the model.3 and 4 provide evidence of the effectiveness of the ICE modules, illustrating their ability to enhance the model's performance.In addition, it is worth paying attention to the change regions located by the ICE between the two images (images taken "before" and "after").In Figure 5, we visualize and compare the change attention obtained using the DAE module only and the DAE + CAE modules in the ICE.We captured the output attention maps at the last layer of the ICE with different scale input features (Stage-4 and Stage-3), where M large and M small denote attention maps for large and small changes captured between RS image features, respectively.We compare the attention maps generated only using the DAE module with the combination of the DAE and CAE modules to observe and test the effectiveness of these two modules.In role (a), given the two images with only small changes (a small house), we can see that the small-scale object change attention map (M small ) generated using DAE + CAE is able to attend to the small house more accurately compared the model only using the DAE module.Similarly, in (b), M small using DAE + CAE is able to focus on the changes in both small houses.We visualize a somewhat large change in (c).M large with DAE + CAE more accurately attends to the change in the large buildings.In (d), we capture the changes in both the large buildings and the narrow load, and we can see that M large with DAE + CAE highly attends to the group of buildings, and M small focuses more on the changes in the narrow load.With these visualizations, we can conclude that DAE and CAE enhance the discriminative feature learning ability of the ICE.M larger and M small denote the attention maps for large and small changes captured between bitemporal image features, respectively.I t 0 and I t 1 denote input RS images.Note that regions appearing more blue indicate higher levels of attention.We use the red dotted box to ground the small change areas to ease the visualization.

Multi-Layer Adaptive Fusion module
As shown in Tables 3 and 4, utilizing the MAF module to integrate multi-level semantic feature representation from each layer of the ICE would allow the change caption decoder to explore the relationship between words and each change of interest, which improves the change caption performance.In Figure 6, we visualize the decoder attention between words and integrated change-aware features from the MAF module.The top row are the input bitemporal remote sensing images.The middle image is the word feature attention map computed obtained using MBF [3], and the bottom image is the attention map computed with the proposed MAF.Both MBF and MAF modules are designed to integrate multilevel change aware semantic feature representation.However, MBF lacks a gating design, which may lead to introductions of irrelevant features into the decoder and result in an inferior change description.The MBF module utilizes a gated attention mechanism to select the essential change-aware representations from multi-layer semantic information.For instance, in (a), we can observe that the attention map of the word "house" computed with the MBF module focuses more on the other places instead of the "house" in the image, whereas the attention map captured using the proposed MAF module accurately attends to the "house".Furthermore, in (b), the model with the MAF module tends to focus on the "road" in the image and is able to generate a more accurate change caption with respect to the ground truth (GT) caption.This visualization demonstrates the effectiveness of incorporating the MAF module, which is beneficial in word visual relationship modeling and allows the model to generate better change captions.

Cross Gated-Attention Module
Besides observing the relation between words and single-scale change representations in Figure 6, it is also worth paying attention to the relationships between words and multiscale change-aware features and discovering the ability of GCA that allows the change caption decoder to effectively utilize and select the useful change-aware representations for change description generation with a gated structure.Figure 7 shows the captured multi-scale word and feature attention maps, where L words and S words denote the attention that captures large and small changes for object words (in red) in the generated change caption, respectively.The top three pairs of results (1), (2), and (3) show the abilities of GCA in capturing the small object changes in two images, whereas (4), ( 5), and ( 6) demonstrate the capability of GCA to attend to multi-scale objects.For each set of examples, (a) is the generated change caption that only uses a single scale of features as an input and (b) is the proposed method using multi-scale features, where GT denotes the ground truth caption.In (1), we can observe that the attention map L house is not able to attend to the change in the small "house" in the images, while S house is able to capture it.Hence, by selecting the information of both attention weights from S house using GCA, the proposed method is able to accurately generate a change caption (b) to describe the small "house" as compared to (a), which fails to capture the changes in the images.GCA is able to locate small changes in (2)(b) and (3)(b) and allows the model to generate more accurate captions as compared to the GT.In ( 4) and ( 5), we can see that GCA assists the change decoder to attend to the larger changes in the "road" and "house" for change caption generation.In ( 6), the decoder attends to both the larger changes in the "trees" that have been removed and the narrow/small changes in the new "road" that has been built.As a result, we can conclude that the GCA module is critical in the change caption model for identifying and selecting the essential changes of interest in the image for better change description generation.4), ( 5), and ( 6) include middle to large-scale changes.The regions appearing more blue indicate higher levels of attention.

Qualitative Analysis
Figure 8 shows the change captioning results on the LEVIR-CC dataset.For each image pair, we provide one of the five ground truth sentences and the sentences generated by an existing method [3] in (a) and our proposed method in (b).The accurately predicted change object words by our method (b) are highlighted in blue.It is observed that the proposed method generates change descriptions that are more precise and accurate compared to the existing method.For instance, our method is able to identify and describe the change in the small-scale "house" in the woods, as shown in image pairs (1) and ( 2), whereas the baseline method tends to predict no change or inferior results.Our method can simultaneously recognize and describe multiple changes in the objects at different scales in the bitemporal images.For instance, it accurately recognizes "trees", "villas", and the "road" in image pairs in (3)(b) rather than just the "trees" and "road" (highlighted in green) in (3)(a).Similarly, in image pair (4)(b), our method can describe the change more informatively compared to the caption generated by the baseline model in (4)(a).Our proposed method can effectively leverage distinct scale information for more precise recognition of changes in bitemporal remote sensing images and generate a more informative and accurate change description.

Parametric Analysis
There are multi-layers that can be stacked in the proposed ICE and change the caption decoder.These layers of the network are essential hyperparameters that can significantly influence the performance of the model for generating change descriptions.In Table 6, we show the performance of the models when adopting different numbers of layers in both the encoder (E.L.) and decoder (D.L.).We observe that the model achieves the best performance when E.L. is equal to 2 and D.L. is equal to 1.Our ICE is composed of DAE and CAE sub-modules, which can effectively capture the change-aware features.The encoder avoids the need for additional encoders to enhance feature extraction complexity, which could impact performance.Fewer encoder layers will reduce MAF's ability to integrate multilayer semantic information for captioning, resulting in an inferior performance.Similarly, CGA in the change caption decoder can assist the model in capturing essential multi-scale changes of interest in the image for better captioning results.
The utilization of the beam search strategy is a general approach to enhance the performance of image captioning methods, in which different beam sizes (e.g., 1, 3, 5, etc.) will affect the accuracy of the generated sentence.Table 7 demonstrates the effectiveness of using various beam sizes for caption generation.We can observe that the best performance was achieved when choosing a beam size equal to 3. A beam size that is smaller or larger than 3 can result in a lower performance.This also aligns with existing methods [3,4] that present the best results with a beam size of 3.

Conclusions
We introduced an Interactive Change-Aware Transformer Network (ICT-Net) to recognize changes in objects at various scales (e.g., small-scale objects) in remote sensing bitemporal images and generate a change caption to describe them accurately.We proposed the Interactive Change-Aware Encoder (ICE) to capture discrimination representations between each pair of multi-scale features and utilized a Multi-Layer Adaptive Fusion (MAF) module to aggregate relevant multi-layer change-aware features to generate better change captions.We proposed a novel Cross Gated-Attention (CGA) module to effectively utilize and select the multi-scale change-aware representations for better change captioning.We conducted extensive experiments that demonstrated the effectiveness of our proposed ICT-Net.ICT-Net significantly improves the performance of remote sensing image change captioning.

Figure 1 .
Figure 1.A visualization of the existing method and our proposed method.(a) The existing method[3] uses a hierarchical approach that tends to integrate the unchanged focused information from each encoder layer, disrupting the change feature learning in the decoder and generating inferior change descriptions.Our proposed method attentively aggregates the essential features for more informative caption generation.(b) Existing methods[2][3][4]9] overlook the change in objects with various scales, generating inferior change descriptions.Ours can extract discriminative information across various scales (e.g., a small scale) for change captioning.Blue indicates that the word "house" is attended to the particular region in the image, while reddish colors suggest a lower level of focus on it.The bluer the color, the higher the attention value.

Figure 2 .
Figure 2. Overview of the proposed ICT-Net.It consists of three components: a multi-scale feature extractor to extract visual features, an Interactive Change-Aware Encoder (ICE) with a Multi-Layer Adaptive Fusion (MAF) module to capture the semantic changes between bitemporal features, and a change caption decoder with a Cross Gated-Attention (CGA) module to generate change descriptions.

Figure 3 .
Figure 3. Structure of the Multi-Layer Adaptive Fusion module.

Figure 4 .
Figure 4. Structure of the Cross Gated-Attention module.At the training stage, given a sequence of word embeddings E = {E 1 , E 2 , . . ., E L } ∈ R L×d as inputs, Mask-MHA masks the subsequent position embeddings at time step t and learns to predict the the word features e * t , where L denotes the length of the sentence and d is the word embedding dimension.The process can be written as follows:

4 . Experiments 4 . 1 .
Dataset and Evaluation Metrics LEVIR-CC dataset.We conduct experiments on the recently published large-scale LEVIR-CC dataset [3].The LEVIR-CC dataset contains 10,077 bitemporal remote sensing image pairs, where 5038 image pairs have changed regions and another 5039 image pairs are without changes.The dataset contains 50,385 associate ground truth sentences describing changes between image pairs, whereas 25,190 sentences describe image pairs with changes and the remaining 25,195 sentences express image pairs without changes.The size of the images is 256 × 256 pixels.The dataset has been split into 6815, 1333, and 1929 image pairs for training, validation, and testing, respectively.

Figure 5 .
Figure 5.Comparison of attention maps generated using DAE and DAE + CAE.M larger and M small denote the attention maps for large and small changes captured between bitemporal image features, respectively.I t 0 and I t 1 denote input RS images.Note that regions appearing more blue indicate higher levels of attention.We use the red dotted box to ground the small change areas to ease the visualization.

Figure 6 .
Figure 6.Visualization of the generated attention map of the caption decoder using the existing MBF[3] method and the proposed MAF.The word highlighted in red in the caption corresponds to the blue region in the generated attention map.Note that regions appearing more blue indicate higher levels of attention.

Figure 7 .
Figure 7. Visualization of captured multi-scale word and feature attention maps in the change caption decoder of the GCA module, where L words and S words denote the attention maps that capture large and small object changes for each object word (highlighted in red) in the generated change caption, respectively.We use the red bounding boxes to indicate the small-scale object change regions for image pairs (1), (2), and (3).(4),(5), and (6) include middle to large-scale changes.The regions appearing more blue indicate higher levels of attention.

Figure 8 .
Figure 8. Qualitative results on the LEVIR-CC dataset.The I t0 image was captured "before", and the I t1 was captured "after".GT represents the ground truth caption.We use red bounding boxes to indicate the small-scale object change regions for image pairs (1) and (2).(3) and (4) include middle to large-scale changes.Green and blue words highlighted the correctly predicted change objects for the existing method (a) and ours (b), respectively.

Table 2 .
Comparisons on the LevirCCD dataset.Our model achieved higher scores, where the metrics in bold have the best performance.Bold numbers indicate the best result.

Table 3 .
Performance of the model with various settings in the Levir-CC dataset.A tick means the module was included for training, whereas a cross denotes the module was not included.Bold numbers indicate the best result.

Table 4 .
Ablation studies on the ICE, MBF, and CGA modules on the test sets with only no changes and only changes and the entire test set.A tick means the module was included for training, whereas a cross denotes the module was not included.Bold numbers indicate the best result.

Table 5 .
Performance of the model when utilizing different CNN stages on the Levir-CC dataset.A tick means the module was included for training, whereas a cross denotes the module was not included.Bold numbers indicate the best result.

Table 6 .
Performance of the model in different layers on the LEVIR-CC dataset, where E.L and D.L denote the encoder layers and decoder layers, respectively.Bold numbers indicate the best result.

Table 7 .
Performance of the model when choosing different beam sizes during the inference stage.Bold numbers indicate the best result.