1. Introduction
Change Detection (CD) through the analysis of Remote Sensing (RS) images stands as an indispensable tool across a multitude of disciplines, encompassing agriculture, urban planning, and environmental surveillance. The fundamental principle involves the analysis of two or more images captured over different time instances, all within the same geographical area, aimed at discerning temporal changes [
1]. The continuous evolution of remote sensing technology has significantly eased the task of detecting changes in even small-scale objects, such as buildings, leveraging the capabilities of Very High-Resolution (VHR) Images. In the sector of urban management, specifically Land Use Land Cover (LULC), identification of illegal construction, and disaster evaluation, the application of building change detection proves to be significantly useful. Furthermore, the insights derived from change detection analysis offer valuable solutions to policymakers for the effective planning and monitoring of sustainable urban development, ensuring the preservation of ecological balance and the development of cities [
2].
Considering the importance of building CD, several traditional and Deep-Learning (DL)-based methods have been proposed to accurately accomplish this task. Traditional techniques for CD rely on manual interpretation and image differencing, practices that, though established, are susceptible to consuming significant time and are prone to errors. However, the Remote Sensing Change Detection (RSCD) task has been profoundly reshaped by the strides taken in deep learning. These advancements have launched a new era of methodologies characterized by heightened efficiency and precision [
3].
Depending on how the deep features are extracted or the hidden patterns are learned from the bi-temporal data, DL techniques for detecting changes can be categorized into two main approaches: single-stream and double-stream [
4]. The single-stream approach typically involves combining the bi-temporal input images and subsequently conducting a classification task to generate a binary or multiclass Change Map (CM). However, this configuration poses two significant research challenges: determining the data fusion strategy and optimizing the DL classifier. In contrast to the single-stream model, which operates with a single network, the double-stream architecture comprises two subnetworks with identical structures. These subnetworks are concurrently treated and trained to discern the deep features inherent in the two input images. The outcomes are then concatenated to formulate the CM. This configuration, founded upon the Siamese convolutional network, finds widespread application [
5] owing to its capability to simultaneously train the two subnetworks and learn the deep features of bitemporal input images. However, the current dual-stream networks exhibit certain limitations, including elevated complexity and the need for heightened precision in generating the final CM.
To address this challenge, we introduce a single-stream architecture that leverages adversarial learning. Our approach involves concurrently training two sub-networks within an adversarial learning, one tasked with generating a CM and the other designed to evaluate the quality of the generated CM. The model we propose, named Adversarial Change Detection Network (Adv-CDNet), adopts the adversarial learning principles of Generative Adversarial Networks (GANs) [
6]. The foundational architecture of our model draws inspiration from the Pix2Pix model, renowned for its ability to translate an input image from a source domain to a desired target domain [
7]. In the same idea of processing the CD task as an image-to-image translation problem [
8], our model operates by employing the resulting six-channel image, obtained by concatenating the two bi-temporal images, as the source input, while the CM serves as the intended target image. A schematic representation of our proposed model is depicted in
Figure 1.
The underlying framework of the pix2pix architecture relies on a supervised deep-learning methodology that depends on the availability of extensive labeled datasets. However, this approach faces challenges when applied to the RS building CD task, which suffers from the lack of large established datasets. The process of annotating large-scale CD datasets is marked by its time-intensive and labor-demanding nature. Additionally, the rarity and sparsity of the buildings changes (considered as the positive class) render the acquisition of compelling bitemporal images a formidable task. Compounding the issue, currently available building CD datasets, such as those highlighted in [
9,
10], often encompass only restricted geographic regions and are constrained by limited variations in image conditions.
To address the challenge posed by insufficient CD datasets, the adoption of data augmentation techniques emerges as a viable solution. Traditional approaches, predominantly rooted in image processing methodologies encompassing geometric and color transformations, image blending, and related techniques, have been explored [
11]. However, these methods primarily involve geometric transformations or simply change the pixel values within RGB channels. Consequently, they fail in enhancing the semantic information fidelity of RS images, particularly when deployed in tasks demanding nuanced interpretation like CD [
12]. To overcome these limitations, some works have leveraged imaging simulation systems to generate synthetic RS samples, which are subsequently combined with original data [
13]. These innovative methodologies effectively address concerns such as data diversity, blurriness, and distortions. However, it is important to note that the generated images often exhibit compromised quality [
3]. Despite these efforts, the challenge of generating high-quality synthetic images for bolstering CD datasets remains.
Recently, the performance of DL in image generation has manifested across computer vision tasks, demonstrating the capacity to produce high-quality and diverse samples that augment the original dataset. This technique has found substantial utility in generating RS samples as well. In the field of RS, DL-based approaches for image generation are based on various techniques, including Variational Auto Encoding (VAE) and adversarial learning such as the application of GANs [
11]. For instance, Lv et al. [
12] introduced a modified GAN, termed Deeply supervised GAN (D-sGAN), to synthesize new RS training samples for soil-moving detection. Expanding on this, Singh and Bruzzone [
14] enhanced the generative adversarial network with class-based spectral indices, facilitating the generation of multispectral RS images. Addressing the specific task of aircraft detection within RS images, Liu et al. [
15] devised a multiscale attention Cycle GAN to create novel samples. Xu et al. [
16], on the other hand, proposed a data augmentation strategy combining a modified pix2pix model with the copy–paste operator for Solid Waste Detection. Notably, these endeavors primarily center around generating RS images intended for RS classification and object detection tasks.
Despite these advancements, generating new samples for CD tasks remains a formidable challenge within the DL method. Seo et al. [
17] tackled this by synthesizing changes through diverse mechanisms like random building cropping, inpainting for building suppression, and copy–paste instance labeling. Another work proposed by Chen et al. [
18] employed a GAN-based approach to create new building CD samples. Their methodology involved training a GAN model on a building dataset, followed by transferring generated instances of varying styles onto the synthesized images. The authors additionally introduced context-aware blending techniques for realistic building-background composites, concluding with context-aware color transfer for the final output. Similarly, Li et al. [
19] proposed a method called Image-level Sample Pair Generation (ISPG) based on Label Translation GAN (LT-GAN) to address the challenges of limited data volume and severe class-imbalance issues in building change detection datasets.
Within the scope of our research, as illustrated in
Figure 2, we introduce a framework based on the GAN model, designed to generate new images at time T
1 (post-change instance), that contains basically buildings objects by taking building labels as input. This is achieved through the creation of a novel building label, extrapolated from an image devoid of buildings taken at time T
0 (pre-change instance). Subsequently, the generated image at T
1 and its respective image at T
0, both accompanied by their corresponding created building masks, are integrated with the original dataset to create the balance. This concerted effort serves to amplify the quantity and variety of building CD samples, effectively remedying the data imbalance highlighted earlier.
Furthermore, beyond addressing data imbalance issues using data augmentation techniques, our study establishes the efficacy of introducing an attention module into a deep-learning model. This technique, accomplished through the integration of an attention mechanism, serves as a powerful strategy for rectifying the imbalance between changed and unchanged pixels. The utility of attention mechanisms lies in their capacity to enhance model detection capabilities by emphasizing specific features. In the context of RSCD, this mechanism ensures a heightened focus on changes within an image. For instance, a recent work by Feng et al. [
20] introduced a dual-branch multilevel intertemporal network leveraging self and cross-attention mechanisms to effectively capture change representations, particularly in cases where foregrounds and backgrounds vary. Similarly, Li et al. [
21] proposed a progressive feature aggregation approach with supervised attention, embedded within MobileNet architecture. This technique demonstrated high accuracy while maintaining a reduced parameter count and shorter training times for CD tasks. Due to its efficiency, we integrated an attention module into our Adv-CDNet model. This amelioration significantly improves the accuracy of change detection while concurrently rectifying the imbalance between changed and unchanged areas.
This paper contributed significantly in the following ways:
Firstly, we propose a data augmentation strategy designed to effectively generate new CD samples featuring diverse changes in numerous buildings. By employing building label creation and artificial image generation, we enhance the existing CD dataset, ultimately mitigating the risk of class imbalance challenges commonly encountered during the training of DL models for remote sensing building change detection.
Secondly, we present an innovative adversarial training framework called Adv-CDNet, which utilizes a modified Pix2Pix model and integrates a channel attention mechanism. This model can directly map bi-temporal images to a CM while extracting more discriminative features, in the imbalanced dataset, for the CD task.
Thirdly, we assess the performance of our high-resolution image generation framework on datasets with severe class imbalance. Experimentation involves two distinct publicly available Remote Sensing (RS) building Change Detection (CD) datasets. Comprehensive comparisons between our Adv-CDNet and other state-of-the-art methods show its effectiveness over alternative approaches. Furthermore, the empirical findings from the evaluations of the incorporation of our data augmentation technique demonstrate significant improvement in the performance of our proposed model.
4. Experiments and Results
4.1. Experimental Setup
The implementation of change detection and image generation models was carried out using Torch. The CD model was executed on a single Nvidia GeForce RTX 3090 GPU, which is manufactured by Yunxuan Ltd. from Shanghai, China. In contrast, the image generation model utilized the power of two Nvidia GeForce RTX 3090 GPUs for increased processing capability due to its higher complexity compared with the CD model.
4.2. Evaluation
The evaluation of our change detection framework included the utilization of five key assessment metrics: Overall Accuracy (
), Intersection over Union (
), Precision (
), Recall (
), and F1-Score (
).
where
,
,
,
, and
N correspond to the counts of true positives, false positives, true negatives, false negatives, and the total number of pixels, respectively. These metrics collectively provided a comprehensive understanding of the framework’s performance.
In the specific context of change detection, a noteworthy precision value signifies a limited occurrence of false alarms, while a substantial recall value indicates minimal instances of missed detections. Simultaneously, the F1-Score and overall accuracy serve as holistic performance indicators, with higher values indicative of superior performance. The intersection over Union metric gauges the degree of alignment between the predicted CM and the Ground Truth. By harnessing this comprehensive suite of metrics, our evaluation methodology offered a well-rounded perspective on the efficacy and capabilities of our change detection framework.
4.3. Comparison with the State-of-the-Art Change Detection Methods
To comprehensively assess the performance of our proposed change detection model, we conducted both quantitative and qualitative comparisons with State-of-the-Art (SOTA) change detection methods. These methods serve as benchmarks against which the efficacy of our model can be measured. The following SOTA methods were selected for evaluation:
Fully Convolutional Early Fusion (FC-EF) [
43]: This method employs image-level fusion based on the U-Net architecture. The bi-temporal images are concatenated into a single input for the U-Net model, facilitating holistic feature extraction.
Fully Convolutional Siamese Concatenation (FC-Siam-Conc) [
43]: In contrast to FC-EF, FC-Siam-Conc adopts feature-level fusion. It leverages two encoders with shared weights to extract features from bi-temporal images, concatenating them to the decoder at the same level.
Fully Convolutional Siamese Difference (FC-Siam-Diff) [
43]: This method shares similarities with FC-Siam-Conc, differing primarily in the formation of skip connections. Instead of simple concatenation, FC-Siam-Diff transports the absolute value of the difference between bi-temporal features to the decoder.
Bitemporal Image Transformer (BIT) [
44]: This network captures contextual information within the spatial–temporal domain. By leveraging transformer, BIT effectively models contexts between different temporal images, enhancing its ability to analyze and interpret complex spatial–temporal relationships.
Spatial–Temporal Attention Neural Network (STANet) [
9]: STANet represents a metric-based Siamese FCN approach, enhanced with a spatial–temporal attention module to extract more discriminative features.
Hierarchical Attention Network (HANet) [
26]: This model is a discriminative Siamese network, featuring a hierarchical attention network (HAN) with a lightweight and efficient self-attention mechanism, which is designed to integrate multiscale features and refine detailed features.
Analysis of
Table 1 reveals our Adv-CDNet boasts a higher parameter count than the three basic Siamese Networks and BIT, underscoring its ability to learn and represent complex patterns effectively. Interestingly, when coupled with the attention module, it exhibits fewer parameters than the intricate STA-Net, showcasing a balance between complexity and efficiency. Furthermore, our model demonstrates lower FLOPs compared to HANet and STANet, emphasizing computational efficiency.
Upon reviewing the experimental outcomes detailed in
Table 1 for both WHU-CD and LEVIR-CD datasets, it becomes apparent that our method delivers satisfactory performance on the LEVIR-CD dataset, even without the attention module. The
,
,
,
, and
were 98.62%, 74.85%, 90.95%, 80.88%, and 85.62%, respectively. Compared to the three baseline networks (FC-EF, FC-Siam-Conc, and FC-Siam-Diff), our method provided improvements of 6.99%, 2.83%, and 8.23% in terms of the F1-Score, respectively. These results can be confirmed through the visual interpretations shown in
Figure 8, where we see limited occurrences of false alarms (
) and missed detections (
) in our model’s results compared to the sub-mentioned networks. The decreased
and
explain the improved precision and recall, respectively, leading to the increase in F1-Score. Similarly, as shown in
Table 1, the test results of our Adv-CDNet on the WHU-CD dataset also outperform state-of-the-art methods, including FC-EF, FC-Siam-Conc, and FC-Siam-Diff, across various evaluation metrics such as Overall Accuracy, Intersection over Union, precision, and F1-Score. These results can be confirmed with the visual interpretation from
Figure 8, where FC-Siam-Conc and FC-Siam-Diff show a significant amount of occurrence of false positives, which explains the decrease of their precision compared to our model.
The main reason for these results is related to the fact that most of these sub-mentioned SOTA methods use a Siamese network, which is a double-stream framework that generates change maps based on feature differences between two images. Therefore, these methods are highly dependent on high contrast between the two images, in contrast to our model, which is a single stream framework that can map directly the two input images into a building change map. This approach leads to more efficient feature extraction and change detection, as it eliminates the need for separate processing and alignment of the two images used in other methods. Moreover, the generator part of the Adv-CDNet utilized U-Net architecture to generate a change map from input images and skip connection to fuse shallow and deep feature representations. These allow the proposed model to be able to recognize complex features that are difficult to extract using the aforementioned methods. This is especially helpful in recognizing changes in some complicated scenarios. However, when we compare its performance for the two datasets (LEVIR-CD and WHU-CD), a notable discrepancy emerges. Specifically, the F1-Score exhibits a substantial 20% difference between the two datasets, underscoring the adverse impact of class imbalance in WHU-CD on the model training. As illustrated in
Figure 8, our model exhibits superior performance on LEVIR-CD, with more effective detection of building changes compared to WHU-CD. Especially in cases of very subtle changes, the model struggles to detect building alterations in WHU-CD.
Including the attention module in our Adv-CDNet has shown improvements for LEVIR-CD in the
,
,
, and
, as shown at the bottom of
Table 1. We can observe that our model outperformed the BIT network in terms of
,
, and
. Moreover, it also outperformed STA-Net, which uses an attention mechanism, with 2.75%, 1.09%, and 2.46% in terms of
,
, and
, respectively. Meanwhile, there was a trade-off in precision and recall where our change detection model provided a 12.75% improvement in precision, while STA-Net outpaced our network by 12.64%. We can also note the similar qualitative and quantitative performance between our Adv-CDNet with attention and the HANet model in terms of
,
, and
.This similarity is likely because both models incorporated the same attention mechanism. Similarly, the incorporation of an attention module into our model for WHU-CD training has yielded significant performance enhancements. We observe significant improvements in
(9.4%), precision (6.47%), recall (6.8%), and F1-Score (6.84%). These improvements are very high compared to those achieved on the LEVIR-CD dataset, where the gains are 1.25% for
, 2.02% for recall, and 0.94% for F1-Score. This highlights the substantial influence of the attention module, particularly when dealing with severely imbalanced datasets such as WHU-CD in our case.
Notably, the introduction of the attention module amplifies its impact on WHU-CD more than LEVIR-CD, underscoring its efficacy in addressing dataset imbalances, as seen in this challenging dataset. The enhanced performance observed with the attention module stems from its ability to establish relationships among individual channels and recalibrate feature responses on a per-channel basis. This functionality enables the model to concentrate its training efforts on more pertinent features, fostering improved deep representations. Essentially, it facilitates the creation of channel connections and the recalibration of channel-wise feature responses. In practical terms, this empowers the network to boost performance by amplifying the response to semantic changes while constraining the impact of non-changes.
4.4. Data Augmentation
In this section, we examine our generation model’s performance through two key aspects: visual evaluation of the generated images and their impact when integrated into the original change-detection dataset for training our Adv-CDNet. Initially, we employ visual interpretation to compare outcomes produced by our generator model with those from the copy–paste method. Subsequently, we delve into assessing the qualitative results stemming from training our model on augmented datasets utilizing both our augmentation method and the conventional approach. The detailed findings are presented in
Figure 9 and
Table 2.
As shown in
Figure 9, our findings reveal a stark contrast between the quality of generated images produced by our generator compared to those generated through the copy–paste method [
16] for both WHU and LEVIR datasets. Notably, images generated using our model exhibit a striking resemblance to reality, complete with intricate details such as the inclusion of shadows on buildings. In contrast, the copy–paste method fails to account for the surrounding environment, resulting in buildings appearing detached and unnatural.
Furthermore, our generator showcases its versatility by extending its capability to create realistic representations of roads and trees surrounding the buildings, enhancing the overall contextual fidelity of the generated scenes. It is worth noting that the performance of our model can be further improved by expanding the training dataset to include a larger number of images and by training for an extended number of epochs. These refinements hold the potential to elevate the model’s ability to capture even finer nuances of the urban environment in generated imagery.
Table 2 presents a comprehensive evaluation of our Adv-CDNet model’s performance across three distinct scenarios. First, we examined its performance without any data augmentation, meaning the model was trained on the original datasets. Second, we applied traditional data augmentation techniques, including copy–paste [
16] rotation, reflection, and color saturation [
11] to increase the number of changed images and create balanced datasets. Finally, we assessed its performance using our proprietary data augmentation approach for balancing the original data. Notably, both data augmentation approaches improved model performance on WHU-CD. However, our method demonstrated substantial superiority, with improvements of 21.44% in
, 7.81% in
, 22.67% in
, and 16.50% in
compared to just 2.93%, 6.31%, 3.15%, and 4.32% with traditional methods, respectively. These results highlight the remarkable efficacy of our data augmentation technique, which has a more pronounced positive impact on change detection model performance compared to conventional methods. In contrast, data augmentation resulted in a lower performance improvement of the model when trained on the augmented LEVIR-CD dataset compared to WHU-CD. This discrepancy is due to the difference in the original distribution of changed and unchanged samples in both datasets. Specifically, only 25% of change samples are present in WHU-CD compared to 44.6% in LEVIR-CD. Consequently, the number of change samples in LEVIR-CD was not augmented as dramatically as in WHU-CD, thereby explaining the greater impact of data augmentation on model performance when trained on WHU-CD compared to LEVIR-CD.
To assess the impact of our data augmentation method on Deep-Learning (DL) models trained on imbalanced and balanced datasets, we conducted experiments using State-of-the-Art (SOTA) methods such as BIT, STA-Net, and HANet alongside our ADV-CDNet with attention. Each model was trained with and without data augmentation. The results from
Table 3 demonstrate performance enhancements across all models when trained on augmented datasets for both WHU-CD and LEVIR-CD. Notably, the improvements are more pronounced in WHU-CD compared to LEVIR-CD. For instance, in terms of F1-Score, the enhancements for WHU-CD are substantial, with increases of approximately 8.67%, 9.63%, 10.95%, and 11.17% for BIT, STA-Net, HANet, and ADV-CDNet with attention, respectively. In contrast, the improvements for LEVIR-CD are more modest, with gains of approximately 0.65%, 2.05%, 0.29%, and 0.15% for the same models, respectively. These findings underscore the significant impact of data augmentation on model performance, particularly when dealing with highly imbalanced original datasets.
To further elucidate the influence of dataset composition on the change detection model, we conducted an extended series of experiments, systematically varying the ratio of changed to unchanged images within our datasets. As shown in
Table 4, initially WHU-CD comprised 25% changed and 75% unchanged images, while LEVIR-CD contained 44.4% changed and 55.6% unchanged images. We incrementally augmented the proportion of changed images using our data augmentation approach to create an equivalent number to the unchanged images. Our findings revealed noticeable enhancements in model performance with both configurations, with and without the attention module, for both datasets. Continuing this trend, we progressively adjusted the datasets until reaching 75% changed images and 25% unchanged images. Interestingly, this particular configuration exhibited significant performance boosts for WHU-CD in both model variants. Notably, the model performed excellently without the attention module, demonstrating that a well-prepared dataset can render this extra module unnecessary. In contrast, this configuration did not improve model performance for LEVIR-CD compared to the previous one. However, when we explored using only changed images and deleting all unchanged ones, we observed stark declines in model performance compared to the prior configuration for both datasets. This underscores the importance of including unchanged images during training.
In summary, our experiments underscore the significance of specific image ratios for optimal performance in change detection. For WHU-CD, the optimal configuration involves 75% changed and 25% unchanged images, while for LEVIR-CD a balanced distribution of 50% changed and 50% unchanged images yields the best results. These ratios are crucial in maximizing the effectiveness of our change detection model. These findings can be attributed to the characteristics of the original datasets. WHU-CD contains a relatively small number of changed buildings (3281) compared to the more extensive set in LEVIR-CD (31,000). Furthermore, the change maps in LEVIR-CD exhibit a more uniform distribution of building changes, in contrast to WHU-CD, where a majority of change maps depict smaller alterations. This discrepancy underscores the necessity for an augmented dataset when training the change detection model on WHU-CD, emphasizing the importance of introducing an excess of building change samples to capture the nuanced variations present in the data.
Figure 10 provides valuable insights into model performance for LEVIR and WHU CD across three illustrative examples. In the first two WHU-CD instances, we observe decreased false positives (
) and false negatives (
) when increasing the number of changed images in the dataset. This positively influences the model’s precision and recall, respectively. The inclusion of an attention mechanism demonstrates effectiveness with the original dataset composition. However, the second example indicates that the attention mechanism has minimal effect when an adequate number of changed images are available. For LEVIR-CD, the first two examples show that the best change detection result occurs when the model is trained with an attention mechanism on balanced data. Moving to the third example, for both datasets it becomes evident that the improvement in model performance due to an increased number of changed images reaches a limit. This is characterized by emerging false positive pixels, subsequently reducing precision and model performances, as discussed in
Table 4. Overall, these findings underscore the dynamic relationships between dataset composition, attention mechanisms, and model performance. They provide nuanced insights into predicted image interpretation.
5. Discussion
The Adv-CDNet has demonstrated competitive performance in detecting changes in remote sensing images. However, it faces challenges in detecting very subtle changes, particularly in building alterations, when dealing with highly imbalanced data such as the WHU-CD dataset, as discussed in the previous section. To enhance change detection in such data, the incorporation of an attention module has been proposed. The attention module is designed to selectively focus on the most informative features by assigning higher weights to the features that are most relevant to the change detection task. This is achieved through the use of channel connections and channel weight recalibration, which refine intricate features and are then used to modulate the generator’s output. By selectively focusing on the most informative features, the attention module helps the generator produce more accurate change maps, even in cases of highly imbalanced data.
In addition to the attention module, another approach to enhance the performance of Adv-CDNet in imbalanced data is the creation of extensive labeled datasets and the generation of remote sensing images to augment the change detection dataset. Labeled datasets play a pivotal role in supervised learning tasks, enabling models to learn patterns and make accurate predictions. In the context of building CD, having a diverse and extensive labeled dataset is essential for training models to detect building changes accurately across various environmental conditions. Moreover, the use of generator models based on GAN for data augmentation has significantly enhanced the performance of Adv-CDNet by generating synthetic data samples that can supplement the original dataset. GANs are particularly useful for addressing data scarcity issues and improving model generalization by creating additional training examples. By leveraging GANs for data augmentation, the robustness and accuracy of CD model has been enhanced, especially in scenarios where the dataset is highly imbalanced.
The obtained findings illuminate a delicate trade-off between leveraging attention modules to optimize model performance and enhancing dataset effectiveness through data augmentation techniques. Each method carries its distinct advantages and limitations. employing attention modules to enhance change detection model performance emerges as particularly advantageous when dealing with imbalanced datasets. In these instances, the attention mechanism can significantly bolster the model’s ability to discern changes. However, its impact becomes marginal when the dataset achieves a balanced state, primarily adding complexity and latency without yielding substantial performance improvements compared to non-attention models. This nuanced understanding underscores the significance of dataset balance and the judicious application of attention mechanisms in the context of change detection models.
On the other hand, data augmentation, particularly through GANs, proves invaluable in augmenting the dataset with effective training samples. By introducing diverse building changes to images, this technique enhances the diversity of the dataset, augmenting both its quality and quantity within the realm of change detection. These samples closely align with real-world scenarios, facilitating training on a higher number of high-quality images. Furthermore, the training loss plot of the change detection model in
Figure 11 shows that the use of our data augmentation method led to faster model stability compared to other approaches for addressing data imbalance. Specifically, Adv-CDNet with our data augmentation strategy yielded stability in fewer epochs than using an attention module or traditional augmentation techniques. This more rapid stabilization also helps address overfitting issues. However, it is essential to note that GAN-based augmentation requires its own set of training samples to fine-tune the generative model, a task facilitated in our WHU-CD dataset through building labels in conjunction with before and after images. Nonetheless, datasets lacking such supplementary information may necessitate a dedicated effort to construct an appropriate training dataset for the generative model. For example, in the case of LEVIR-CD, a set of images with their building label is meticulously extracted from the primary CD dataset. Moreover, the generator model needs further improvement to generate more realistic RS images.