Remote Sensing Image Change Detection Based on Dynamic Adaptive Context Attention

Xie, Yong; Wang, Yixuan; Wang, Xin; Tan, Yin; Qin, Qin

doi:10.3390/sym17050793

Open AccessArticle

Remote Sensing Image Change Detection Based on Dynamic Adaptive Context Attention

by

Yong Xie

¹

,

Yixuan Wang

²,

Xin Wang

^1,3,4,*,

Yin Tan

^4,* and

Qin Qin

⁵

¹

School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541000, China

²

Department of Architecture and Civil Engineering, City University of Hong Kong, Hong Kong 999077, China

³

School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu 610000, China

⁴

School of Computer Engineering, Guilin University of Electronic Technology, Beihai 536000, China

⁵

School of Electronic Information, Guilin University of Electronic Technology, Beihai 536000, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2025, 17(5), 793; https://doi.org/10.3390/sym17050793

Submission received: 15 April 2025 / Revised: 13 May 2025 / Accepted: 14 May 2025 / Published: 20 May 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

:

Although some progress has been made in deep learning-based remote sensing image change detection, the complexity of scenes and the diversity of changes in remote sensing images lead to challenges related to background interference. For instance, remote sensing images typically contain numerous background regions, while the actual change regions constitute only a small proportion of the overall image. To address these challenges in remote sensing image change detection, this paper proposes a Dynamic Adaptive Context Attention Network (DACA-Net) based on an exchanging dual encoder–decoder (EDED) architecture. The core innovation of DACA-Net is the development of a novel Dynamic Adaptive Context Attention Module (DACAM), which learns attention weights and automatically adjusts the appropriate scale according to the features present in remote sensing images. By fusing multi-scale contextual features, DACAM effectively captures information regarding changes within these images. In addition, DACA-Net adopts an EDED architectural design, where the conventional convolutional modules in the EDED framework are replaced by DACAM modules. Unlike the original EDED architecture, DACAM modules are embedded after each encoder unit, enabling dynamic recalibration of T1/T2 features and cross-temporal information interaction. This design facilitates the capture of fine-grained change features at multiple scales. This architecture not only facilitates the extraction of discriminative features but also promotes a form of structural symmetry in the processing pipeline, contributing to more balanced and consistent feature representations. To validate the applicability of our proposed method in real-world scenarios, we constructed an Unmanned Aerial Vehicle (UAV) remote sensing dataset named the Guangxi Beihai Coast Nature Reserves (GBCNR). Extensive experiments conducted on three public datasets and our GBCNR dataset demonstrate that the proposed DACA-Net achieves strong performance across various evaluation metrics. For example, it attains an F1 score (F1) of 72.04% and a precision(P) of 66.59% on the GBCNR dataset, representing improvements of 3.94% and 4.72% over state-of-the-art methods such as semantic guidance and spatial localization network (SGSLN) and bi-temporal image Transformer (BIT), respectively. These results verify that the proposed network significantly enhances the ability to detect critical change regions and improves generalization performance.

Keywords:

change detection; remote sensing image; deep learning; attention mechanism; adaptive context features

1. Introduction

Remote sensing image change detection involves utilizing remote sensing images captured at two distinct time points to identify changes in land cover or ground features between these intervals. This process is crucial across various application domains, including urban landscape monitoring [1], agricultural surveys [2], land cover mapping [3], and natural resource management [4]. It represents a significant research direction within the field of computer vision. Traditional methods predominantly rely on manually designed feature extraction algorithms, which exhibit limited capacity for feature representation, strong dependence on specific scenes and datasets, as well as low precision and computational efficiency. In recent years, deep learning techniques have gained rapid traction in the realm of remote sensing [5,6] due to the availability of vast amounts of remote sensing data and advancements in deep learning methodologies. Deep learning-based approaches can automatically learn robust feature representations from extensive datasets, capturing richer and more intricate patterns of change while demonstrating strong generalization capabilities. Furthermore, these methods can be applied to remote sensing data from diverse regions and time periods, facilitating automatic detection of changes in remote sensing images through end-to-end training processes. This results in significantly enhanced accuracy and computational efficiency.

Transformers were originally proposed by Vaswani et al. in 2017 to address sequence modeling tasks in natural language processing [7]. Their core mechanism, self-attention, enables the modeling of long-range dependencies without relying on traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs) [8]. Compared with RNNs, Transformers offer higher parallel computation efficiency and more stable performance on long sequences; comparable with CNNs, Transformers are capable of capturing global contextual information by attending to the entire input [9,10]. With the introduction of the Vision Transformer (ViT), Transformer architectures have been successfully adapted to the field of computer vision, achieving remarkable results in tasks such as image classification, object detection, and image segmentation [11]. Due to their powerful feature extraction and modeling capabilities, Transformers have gradually been applied to remote sensing image processing tasks [12,13]. In particular, they show great potential in change detection by effectively capturing subtle yet critical differences between images through self-attention mechanisms, thereby improving the accuracy of identifying change regions [14]. As a result, Transformers are becoming a significant direction in the research of remote sensing image change detection.

The success of Transformers in the field of natural language processing [15,16] has attracted increasing attention within the domain of computer vision, particularly in remote sensing image change detection, which represents a significant application area. In recent years, several Transformer-based models have been introduced. These models predominantly employ a two-stream architecture that integrates Transformers with various components such as U-shaped Network (UNet) [17], other CNN [18], and additional attention mechanisms [19]. BIT [20] is a change detection method grounded in Transformers that features an element known as the bitemporal Image Transformer. This approach initially extracts features from each image using Residual Network (ResNet) and subsequently transforms these features into a set of semantic tags through a spatial attention mechanism. A Transformer model is then utilized to process contextual information both spatially and temporally. The feature difference images are computed by projecting the resulting annotations back into pixel space from two context-rich feature maps. Finally, CNN is employed for change prediction. However, this method struggles to effectively suppress interference caused by irrelevant features and pseudo changes while enhancing true change features during remote sensing image change detection. Fully Convolutional Early Fusion (FC-EF) [21] combines fully convolutional networks with UNet architecture. Unlike U-Net’s five sampling layers, FC-EF consists solely of four maximum pooling layers and four upsampling layers. The network input involves concatenating two images along the channel dimension from a pair of images. Fully Convolutional Siamese Network with Concatenation (FC-Siam-conc) [21] utilizes twin networks that share parameters to process pre- and post-change images before merging them with generated features at the decoding layer. Meanwhile, FC-Siam-diff [21] computes the differences between the front and back image features based on concatenation before integrating them with the generated decoding layer features. However, the accuracy and generalization capabilities of these methods are insufficient for effective change detection in remote sensing images.

The dual encoder–decoder (DED) architecture [22] is designed to input images from two different time periods into a dual encoder–decoder framework. This approach segments the target object within the image and subsequently fuses it into a single decoder to generate a variation map. EDED [22] shares the same structural design as DED; however, in EDED, features are exchanged between encoders so that each branch incorporates features from both time periods. Despite this innovative feature exchange mechanism, the EDED structure exhibits suboptimal performance when faced with background interference in images, highlighting an urgent need for enhancements in both accuracy and generalization capabilities.

The attention mechanism [23] is a data processing approach that emulates human attention allocation, enabling machine learning models to identify target regions containing significant information amidst numerous irrelevant background areas [24]. This method enhances both the performance and efficiency of the model. In remote sensing images, a notable characteristic is the prevalence of extensive background regions, with change regions requiring detection typically constituting only a small fraction of the overall image. Consequently, the attention mechanism assists the model in concentrating on critical change regions by assigning varying weights to features [25]. This process improves the network’s ability to differentiate between foreground targets in remote sensing images and complex backgrounds while mitigating interference from these intricate backgrounds during change detection tasks. Ultimately, this leads to improved accuracy in detecting changes within remote sensing imagery [26]. Furthermore, different types of changes depicted in an image—such as increases in buildings, modifications to roads, or alterations in vegetation—exhibit unique characteristics. The attention mechanism empowers the model to focus on key features pertinent to each specific type of change, thereby facilitating effective detection of those changes [27]. It can be observed that this mechanism significantly enhances the adaptability and generalization capabilities of the model.

By integrating the strengths of the aforementioned methods, this paper introduces DACA-Net based on the EDED architecture. Unlike the standard EDED architecture, our network incorporates a carefully designed DACAM module that enables immediate capture of fine-grained differences between T1 and T2 at each encoding layer. It achieves dynamic and adaptive bidirectional information flow across both the channel and temporal dimensions. This multi-level, cross-temporal attention enhancement strategy allows for earlier, finer, and more comprehensive exploitation of temporal change features in the image sequence, significantly improving the model’s sensitivity to subtle changes and its segmentation accuracy. The core component of this network is DACAM, which automatically adjusts its scale in accordance with the input image and effectively fuses contextual feature information. This approach enhances the ability to capture change information in remote sensing images, thereby improving both accuracy and generalization capabilities of the network. Furthermore, DACA-Net facilitates feature exchange within the feature extraction layer, enabling each encoder to access features from both branches of two time-series images. This mechanism is employed to refine change regions more effectively.

The primary contributions of this paper are as follows:

We propose the DACA-Net, which fully leverages the features of two temporal images and significantly enhances the performance of change detection in remote sensing imagery.
The DACAM is designed to learn attention weights, enabling automatic adjustment of the appropriate scale based on the characteristics of the input image. This approach effectively fuses multi-scale contextual features to maximize the capture of change information in remote sensing imagery while mitigating interference from complex backgrounds during change detection.
We have constructed a UAV remote sensing image dataset, GBCNR, to address the existing gap in wetland remote sensing image datasets.

2. Specific Work

2.1. The Challenges of Change Detection in Remote Sensing Images

As illustrated in Figure 1, remote sensing images typically encompass multiple objects, including wetlands, mangroves, roads, and buildings, alongside extensive background areas. In the context of change detection tasks, it is often observed that only a limited subset of objects of interest—such as mangroves—are effectively detected. Consequently, the primary focus of remote sensing image change detection should be directed towards the target change area, specifically the mangrove regions. Furthermore, as depicted in Figure 1, other entities such as water bodies and infrastructure are present within the remote sensing images; these can lead to issues related to background interference. In remote sensing imagery, “background interference” refers to elements that appear to change visually but are unrelated to the actual targets of interest. Examples include shadow variations caused by differing illumination conditions, vegetation differences due to wind-induced tree movement, or reflections on water surfaces. These non-target changes can mislead detection models and produce false change results; therefore, they must be effectively suppressed in change detection tasks. However, existing methodologies frequently struggle to adequately mitigate the interference stemming from irrelevant features and pseudo changes. This presents significant challenges for effective change detection; for instance, while the advanced EDED structure demonstrates commendable performance overall, it does not sufficiently address background interference. To tackle this issue more effectively, this paper proposes the incorporation of an attention mechanism aimed at alleviating these challenges.

2.2. The Overall Structure of the Network Is Presented

In image change detection, the encoder–decoder architecture is a commonly used deep learning framework. The encoder is responsible for extracting high-level features from the images, effectively compressing them into informative representations. The decoder then reconstructs these features into pixel-level change detection results. This architecture enables a better understanding of image content and spatial information, making it well-suited for change detection tasks in remote sensing imagery.

Based on the EDED architecture, this paper presents a remote sensing image change detection method named DACA-Net. The overall structure of DACA-Net is illustrated in Figure 2. To overcome the limitation of the conventional EDED architecture in capturing multi-scale and fine-grained changes, this paper incorporates a DACAM module after each feature extraction stage in the encoder. This design enables dynamic recalibration and cross-temporal attention fusion of T1/T2 features. The EDED architecture is illustrated in Figure 3. DACA-Net comprises a dual encoder–decoder framework featuring two weight-shared encoders and two weight-shared decoders. Each encoder consists of a convolutional unit and a DACAM, specifically designed for effective feature extraction. The weight-sharing mechanism allows the encoders to extract spatial features from two temporal images. The T1 image and T2 image represent remote sensing images captured at the same location during different time periods, which are input into the respective weight-shared encoders. Subsequently, the features extracted by both encoders undergo semi-exchange, enabling each encoder to incorporate characteristics from both time periods. This process facilitates an initial identification of changing areas within each encoder’s output.

The two decoder branches are upsampled to generate a change mask that is half the size of the input image. Subsequently, these two decoders are fused to produce a change mask that matches the dimensions of the input image, which constitutes the output of DACA-Net. Encoders 1–5 employ a 3 × 3 convolution for processing input features and subsequently utilize DACAM to enhance these features. The encoders effectively facilitate feature extraction, endowing them with rich spatial and semantic characteristics. Decoders 1–3 perform upsampling on the features before fusing those from both branches in order to accurately identify regions of variation.

2.3. DACAM

The attention mechanism is a technique that simulates the human visual focus system, allowing the model to concentrate on more important regions when processing images. Similar to how humans tend to focus on areas with the most noticeable changes when viewing a scene, attention mechanisms assign different weights to various parts of the image, enabling the model to highlight key change regions while suppressing background interference, thereby improving detection accuracy.

DACAM serves as the core component of DACA-Net, with its structure illustrated in Figure 4. The DACAM module is divided into three stages, which perform attention modulation on the channel, spatial, and scale dimensions, respectively. DACAM effectively captures change feature information from remote sensing images by learning attention weights across multiple scales and maximizing the fusion of multi-scale contextual information while simultaneously mitigating the interference caused by complex backgrounds during change detection.

The model first determines the importance of each channel. For instance, some channels may focus more on buildings, while others are more sensitive to water bodies. We apply average pooling and max pooling to obtain preliminary statistics, followed by convolution operations to compute the channel attention weights, thereby enhancing the more informative channels. As shown in Figure 4a, in the initial stage, the channel convolution layer is designed to assess the significance of each channel within the feature map. This layer adaptively assigns weights to channels, thereby enhancing feature representation capabilities, improving model performance and computational efficiency, and facilitating adaptive feature learning. As illustrated in the first stage of Figure 4, average and maximum pooling operations are conducted on the input features to compute both mean and maximum channel attention. Subsequently, a one-dimensional convolutional layer is employed to calculate the channel attention weights by transforming the number of input channels from 2 to 1 while constraining these weights within a range of 0 to 1. The resulting channel attention weights are then applied to the input feature map in order to derive a channel attention-weighted feature map.

Next, the model evaluates the importance of each spatial location in the image. For example, changes may only occur in the bottom-right region. After spatial pooling, we use a convolution to generate a spatial attention map that highlights key regions. As shown in Figure 4b, the second stage consists of a spatial convolution layer that emphasizes key spatial positions within the feature map. This layer adaptively assigns weights to different spatial locations, thereby enhancing the extraction of features from critical areas such as targets and improving the model’s capacity to leverage spatial positional information, ultimately leading to enhanced model performance. As illustrated in Figure 4, the mean and maximum values of spatial attention are derived by averaging and maximizing the feature maps weighted by channel attention. Subsequently, the spatial attention weight is computed using a two-dimensional convolution layer that processes two-dimensional data, resulting in a reduction in channels from 2 to 1 while constraining values between 0 and 1. The calculated spatial attention weight is then applied to the feature map that has been weighted by channel attention.

Finally, the model applies convolutional kernels of varying sizes (e.g., 3 × 3, 5 × 5, and 7 × 7) to process the image, capturing changes at multiple scales. This step mimics how humans observe images from different perspectives and enhances the model’s robustness in detecting both large-scale and fine-grained changes. As shown in Figure 4c, the third stage consists of multi-scale convolution layers, which effectively capture features of various target sizes and details through adaptive attention mechanisms and the fusion of feature information across different scales. This approach enhances the model’s ability to perceive and utilize multi-scale information, thereby improving its performance in complex scenarios. As illustrated in the three stages of Figure 4, three distinct convolution layers with kernel sizes of 3, 5, and 7 are applied to the spatial attention-weighted feature maps to generate three multi-scale attention maps. Following this process, a sigmoid activation function is applied to these feature maps to derive the multi-scale attention weights. These weights are then utilized on the spatially attended feature map to produce the final output feature map.

The DACAM framework incorporates channel attention, spatial attention, and multi-scale attention as sequential structures. It processes input feature maps through their respective pooling operations and learning weights before fusing the resulting outputs. This design is predicated on the complementary nature of the three attention mechanisms, which collectively enhance feature extraction capabilities. Consequently, this approach improves overall performance and generalization ability while adapting effectively to complex data distributions.

2.4. Loss Function

The change detection task in remote sensing imagery is framed as a binary classification problem, wherein the objective is to ascertain whether a specific pixel has undergone any changes. Given that unchanged pixels typically constitute the majority of the dataset, this leads to sample imbalance issues. To mitigate the effects of such imbalances, this paper proposes a loss function that integrates the binary cross-entropy loss [28] with the similarity coefficient loss [29]. This combined approach effectively addresses the challenges posed by unbalanced binary classification and avoids excessively penalizing the model for predicting unchanged pixels.

L o s s_{b c e} = - \frac{1}{n} \sum_{i = 1}^{n} (y_{i} l o g \hat{y_{l}} + (1 - y_{i}) l o g (1 - \hat{y_{l}}))

(1)

The binary cross-entropy loss function is presented in Formula (1).

In this context, n denotes the total number of samples within the dataset, while

y_{i}

represents the true label for the

i^{t h}

sample, which can take on values of either 0 or 1. Additionally,

{\hat{y}}_{l}

indicates the predicted probability output generated by the model for the

i^{t h}

sample, with values ranging between 0 and 1.

The primary objective of this loss function is to minimize the discrepancy between the model’s predicted probabilities and the actual labels.

L o s s_{d i c e} (p, t) = 1 - \frac{2 \sum p_{i} t_{i} + ϵ}{\sum p_{i} + \sum t_{i} + ϵ}

(2)

The similarity coefficient loss function is presented in Formula (2).

In this context, p denotes the probability value predicted by the model, while t represents the binary value of the actual label, which can take on values of either 0 or 1. Specifically,

p_{i}

and

t_{i}

correspond to the predicted and true values for the

i_{t h}

pixel, respectively. To mitigate issues related to division by zero, a small smoothing term

ε

is typically incorporated. This loss function emphasizes the overall degree of overlap between the prediction results and the actual labels rather than focusing solely on individual pixel accuracy; thus, it effectively addresses challenges associated with sample imbalance.

L o s s = L o s s_{b c e} + L o s s_{d i c e}

(3)

Therefore, the loss function proposed in this paper is defined as the combination of the binary cross-entropy loss and the similarity coefficient loss, as illustrated in Formula (3).

2.5. Evaluation Metrics

In this study, F1 is adopted as a comprehensive metric to evaluate the model’s performance in change detection. Recall (R) measures the proportion of actual change regions that are correctly identified, while P quantifies the proportion of correctly detected changes among all predicted change regions. Intersection over Union (IoU) assesses the degree of overlap between the predicted change regions and the ground truth.

Precision refers to the ratio of true change regions within the predicted change regions. It is defined as shown in Formula (4).

P = \frac{T P}{T P + F P}

(4)

Here, True Positive (TP) refers to correctly detected change regions, pixels that are predicted as changes by the model and actually correspond to real changes. False Positive (FP) denotes falsely detected change regions, i.e., pixels predicted as changes by the model but in fact unchanged. A higher precision indicates a lower false alarm rate of the model.

Recall refers to the proportion of actual change regions that are correctly detected. It is defined as shown in Formula (5).

R = \frac{T P}{T P + F N}

(5)

False Negative (FN) refers to missed change regions, pixels where actual changes occurred but were not detected by the model. A higher recall indicates a lower miss rate of the model.

IoU is used to measure the overlap between the change regions detected by the model and the actual change regions. The definition of IoU is shown in Formula (6).

I o U = \frac{T P}{T P + F P + F N}

(6)

Here, TP refers to pixels that are detected as changes and actually correspond to changes. FP denotes falsely detected change regions, while FN refers to missed change regions. A higher IoU value indicates a greater overlap between the model’s detected change regions and the actual change regions.

F1 is the harmonic mean of precision and recall, used to provide a comprehensive evaluation of the model’s performance in change detection. The definition of the F1 is shown in Formula (7).

F 1 = \frac{2 \times P \times R}{P + R}

(7)

A higher F1 indicates better overall performance of the model.

3. Experiment and Analysis

3.1. The Constructed GBCNR Dataset

Although several public change detection datasets exist (e.g., LEVIR-CD, WHU-CD, and CDD), most of them focus on urban structural changes and are primarily based on satellite or aerial imagery with relatively low spatial resolution. These datasets are insufficient for high-precision ecological monitoring tasks. In contrast, wetland changes are typically characterized by blurred boundaries, slow evolution, and semantic ambiguity, which require more refined spatial-temporal perception from deep learning models. However, current datasets rarely include scenes related to coastal wetland ecosystems, and there is a lack of dedicated benchmarks for such ecological scenarios. Furthermore, most existing datasets are annotated by non-experts, which may compromise label quality. By contrast, our proposed GBCNR dataset is specifically designed for wetland change detection and presents several unique advantages: (1) high-resolution UAV images captured by DJI Mavic drones over the coastal wetlands of Beihai, Guangxi, at centimeter-level resolution, enabling the detection of subtle ecological changes; (2) expert-guided pixel-level annotations under the supervision of environmental ecologists, ensuring high label accuracy and domain relevance; (3) diverse and challenging real-world scenes with wide wetland coverage and ecological variability.As a result, the GBCNR dataset offers a unique and challenging benchmark for remote sensing change detection in ecological contexts. It helps fill the gap in existing datasets and can support applications in ecological monitoring, wetland conservation, and natural resource management.

We developed the GBCNR dataset, which was captured by a drone over the nature reserves along the Guangxi Beihai coast. The images were acquired at an altitude of 500 m using the DJI Mavic series UAV, with each image measuring 5280 × 3956 pixels. Under the guidance of environmental ecologists, these images were annotated using the LabelMe tool to identify wetland areas. To satisfy the input requirements of our proposed deep learning models, selected regions from the images were cropped to dimensions of 256 × 256 pixels and randomly divided into a training set (1748 pairs), a validation set (499 pairs), and a test set (249 pairs). Some samples from the GBCNR dataset are presented in Figure 5.

3.2. Common Dataset

Verification experiments were conducted on four commonly utilized datasets in the field of remote sensing image change detection, specifically LEVIR-CD [30], CLCD [31], and SYSU-CD [32].

The LEVIR-CD dataset comprises 637 pairs of high-resolution remote sensing images, each with dimensions of 1024 × 1024 pixels and a resolution of 0.5 m. This dataset encompasses various types of buildings, including villa homes, high-rise apartments, small garages, and large warehouses. To alleviate the computational burden on the GPU server, we cropped each image to a size of 256 × 256 pixels. In this study, the training set consists of 7120 pairs of clipped images, while the verification set contains 1024 pairs, and the test set includes 2048 pairs. Some samples from the LEVIR-CD dataset are illustrated in Figure 6.

The CLCD dataset is an annual land cover dataset of China, developed by Wuhan University utilizing 335,709 Landsat images on Google Earth Engine. This dataset provides year-by-year land cover information for China from 1985 to 2020, with a spatial resolution of 30 m. To facilitate deep learning tasks, each image in the CLCD has been cropped into segments of 256 × 256 pixels and subjected to random rotations. The training set consists of 1440 pairs of cropped images, while the verification and test sets comprise 480 pairs each. Some samples from the CLCD dataset are illustrated in Figure 6.

The SYSU-CD dataset comprises 20,000 pairs of aerial images captured in Hong Kong between 2007 and 2014. These images depict a variety of complex change scenarios, including road expansions, new urban developments, vegetation alterations, suburban transformations, construction foundations, and more. Each image has dimensions of 256 × 256 pixels and a spatial resolution of 0.5 m. In this study, the training set, validation set, and test set were created by randomly rotating each image; they consist of 12,000 pairs for training and 4000 pairs each for validation and testing. Some samples from the SYSU-CD dataset are illustrated in Figure 6.

3.3. Data Enhancement Strategies

In the training phase, several data augmentation strategies were implemented to enhance the diversity of the training dataset and improve the model’s generalization capabilities.

Random flip: the images are randomly flipped, including both horizontal and vertical flips, with a flipping probability of 0.5;
Random rotation: the images undergo random rotation within an angle range of −45° to 45°, with a rotation probability set at 0.3;
Random constant angle rotation: the images are rotated at fixed angles of 90°, 180°, or 270°, with a rotation probability of 0.5;
Gaussian noise: Gaussian noise is added randomly to the image pairs, with an addition probability of 0.3.

3.4. Experimental Setup and Evaluation Metrics

The proposed DACA-Net is implemented within the PyTorch2.1.0 framework, utilizing a single NVIDIA GeForce RTX 4090 GPU for training, validation, and testing. During the training phase, the Adam optimizer is employed with an initial learning rate of 0.001 and weight decay set to 0.001; additionally, the network operates with a batch size of 64. In this study, binary cross-entropy loss serves as the loss function. The experiments involved training DACA-Net for 300 epochs across the Levi-CD, CLCD, SYSU-CD, and GBCNR datasets while saving checkpoints corresponding to the highest F1 score on the validation set for subsequent testing.

In this paper, P, R, F1, and IoU are employed as evaluation metrics for change detection [14]. These metrics are widely recognized as standard performance indicators for assessing change detection methods. The precision metric quantifies the extent of false detections among changing pixels, with a higher value signifying a lower incidence of false positives. Conversely, the recall metric assesses the degree of missed detections, where a higher value indicates fewer missed pixels. However, achieving high levels of both accuracy and recall simultaneously presents significant challenges. The F1 offers a comprehensive assessment by integrating both precision and recall through its harmonic mean. Lastly, the IoU metric measures the overlap between predicted changes and actual changes; a higher value reflects more accurate model predictions.

3.5. Ablation Experiments

In this study, we conducted a series of ablation experiments on our self-constructed GBCNR dataset to evaluate the effectiveness of the proposed DACAM method. The experimental results are presented in Table 1. The baseline method (BaseLine) is the standard EDED architecture. The proposed DACAM module is decomposed into three components: the channel attention module in the first stage—channel attention (CA); the spatial attention module in the second stage—spatial attention (SA); and the multi-scale fusion module in the third stage—multi-scale attention (MS). A set of ablation experiments was performed to assess the individual contribution of each component.

As shown in Table 1, each individual component contributes to performance improvement. Specifically, introducing the CA alone yields a noticeable gain over the baseline. When CA is combined with the SA, the performance further improves, demonstrating their complementarity. Finally, integrating all three components achieves the best performance across all metrics, validating the overall effectiveness of the DACAM design. These results confirm that each part of DACAM contributes positively, and their combined effect is greater than the sum of their individual contributions.

Figure 7 shows a visual comparison of change detection results on the GBCNR dataset under different ablation settings. The ground truth (True) is presented alongside the results from the baseline model and models enhanced with CA, CA+SA, and the complete DACAM module. From the visualizations, we observe that the baseline model generates a substantial number of false positives and false negatives. The addition of channel attention (Baseline+CA) helps reduce noise and better focuses on changed regions but still misses some fine-grained changes. Incorporating spatial attention (Baseline+CA+SA) further refines the change boundaries and suppresses irrelevant areas. The full DACAM module achieves the most accurate and complete change detection results, effectively reducing false positives in red and false negatives in blue, closely aligning with the ground truth.

3.6. Comparative Experiments

To validate the advantages and effectiveness of the proposed DACA-Net, we conducted comparative experiments with several currently popular networks, including SNUNet [33], FC-EF [21], FC-Siam-Diff [21], FC-Siam-Conc [21], BiT [20], and SGSLN [22]. The results are presented in Table 2.

The comparison results on the LEVIR-CD dataset are presented in Table 2. As illustrated in Table 2, the proposed DACA-Net outperforms all other methods evaluated, achieving the highest IoU of 85.81%, F1 of 92.36%, and P of 93.16% on the LEVIR-CD dataset. The R stands at 91.58%, and notably, the F1 of the proposed DACA-Net shows an improvement of 0.31% comparable with the suboptimal method. Although our method shows only a modest improvement over the SGSLN method on the LEVIR-CD dataset, we include it in the statistics to demonstrate the stronger cross-scene generalization ability of our method across multiple datasets.

As shown in Figure 8, the visual comparison of change detection results on the LEVIR-CD dataset reveals significant performance differences among various methods in suppressing false positives and false negatives. A horizontal comparison of the original bi-temporal images (T1, T2) and the ground truth labels shows that the FC-EF method produces dense red false detection patches along building edges, while the SNUNet algorithm exhibits systematic blue false negatives in low-contrast regions. Although the BiT model effectively suppresses missed detections, it suffers from diffuse red false positives. Notably, the SGSLN method maintains the integrity of the main building structures but introduces fine-grained mixed red–blue errors within roof regions due to excessive smoothing. In contrast, the method proposed in this study demonstrates a significant advantage in balancing false positives and false negatives. Its detection results exhibit the highest spatial consistency with the ground truth, with only sparse blue false negatives observed in shadow transition areas. This indicates that our network effectively mitigates intra-class variation caused by illumination changes and viewpoint shifts and demonstrates superior spatial-context modeling capabilities, particularly in the fine-grained change detection of complex urban structures.

The comparison results on the CLCD dataset are presented in Table 3. As illustrated in Table 3, the proposed DACA-Net outperforms all other models under comparison, achieving the highest IoU of 60.46%, an F1 of 75.36%, and a P of 77.32% on the CLCD dataset. The R is at 73.5%, and notably, the F1 of the proposed DACA-Net has been enhanced by 4.26% comparable with the suboptimal method.

As shown in Figure 9, the visual comparison of change detection results on the CLCD dataset reveals significant differences among algorithms in terms of sensitivity to complex land cover changes and robustness to interference. A spatial comparison between reference images (T1, T2) and ground truth labels indicates that the FC-Siam-Diff method suffers from extensive red false positives and blue false negatives due to feature confusion. The FC-EF algorithm produces dispersed false positives along building edges, while the SNUNet model, despite detecting major change regions, exhibits systematic false negatives in medium- to low-contrast vegetated areas. Although the BiT method suppresses false negatives through deep feature fusion, it results in linear false positives spreading along roads. The SGSLN algorithm, affected by excessive smoothing, introduces fragmented red–blue alternating errors in small change regions. In contrast, the proposed method, equipped with an improved multi-scale contextual attention module, preserves farmland boundaries and effectively suppresses shadow interference. Its detection results demonstrate the highest spatial consistency with the ground truth, with only sparse false negatives observed in regions of overlapping land cover. These findings suggest that our network effectively mitigates intra-class variation caused by seasonal changes and lighting differences in multi-temporal remote sensing imagery and shows superior discriminative capability, particularly in the fine-grained extraction of heterogeneous land cover boundaries.

The comparison results on the SYSU-CD dataset are presented in Table 4. Given that the semantic information of change objects within the SYSU-CD dataset is ambiguous and encompasses multiple object classes, other change detection methods struggle to accurately identify these change objects. In contrast, our proposed DACA-Net effectively detects all classes of change objects and delineates the change regions for each identified object. As illustrated in Table 4, DACA-Net achieves the highest IoU at 71.07% and an F1 of 83.09% on this dataset. Notably, the F1 attained by DACA-Net represents a 1.2% improvement over that of the suboptimal method, thereby significantly surpassing all other compared approaches.

As shown in Figure 10, the visualized results of the change detection comparison experiments on the SYSU-CD dataset demonstrate significant performance differences among various algorithms in capturing the complex interactions between coastal building clusters and dynamic marine environments. A spatial comparison of the bi-temporal images (T1/T2) and ground truth labels reveals that the FC-EF method produces large areas of false positives (red artifacts) in the intertidal zone. The SNUNet algorithm introduces jagged false detections with alternating red and blue edges due to feature confusion near building contours. Although the BiT model effectively captures the main structural changes in buildings, it suffers from systematic false negatives (blue regions) caused by wave reflections in the marine background. The SGSLN approach, due to excessive smoothing, yields fragmented red–blue noise within detailed rooftop structures. In contrast, the proposed method not only preserves the geometric integrity of building clusters but also effectively suppresses pseudo-changes induced by tidal variations. Its detection results exhibit the highest spatial alignment with the ground truth, with only minor false negatives observed in high-reflectance transitional water areas. These findings highlight the superior robustness and spatial consistency of our network in handling sub-pixel-level changes along building boundaries in tidal inundation zones.

The comparison results on the GBCNR dataset are presented in Table 5. As indicated in Table 5, the proposed DACA-Net outperforms other methods, achieving the highest scores of 56.3% for IoU, 72.04% for F1, and 66.59% for P on the GBCNR dataset. Additionally, the R rate reaches 78.46%, and the F1 shows an improvement of 3.94% comparable with the suboptimal method.

This result indicates that our method achieves substantial improvement on wetland remote sensing datasets such as GBCNR, demonstrating that it not only performs well on datasets involving buildings and roads but also exhibits superior generalization ability across diverse scene types.

As shown in Figure 11, the visualized comparison experiments on change detection using the self-constructed GBCNR dataset reveal significant performance differences among various methods in addressing intra-class variation and boundary ambiguity in natural surface scenes. Spatial comparisons between the bi-temporal images (T1 and T2) and the ground-truth labels demonstrate that the traditional FC-Siam-Diff method suffers from extensive red–blue mixed artifacts due to feature confusion. The deep learning-based FC-EF method exhibits diffuse red false alarms in vegetated areas, while SNUNet and BiT effectively suppress missed detections but still show residual blue omissions in overlapping regions. The SGSLN method, affected by excessive smoothing, generates fragmented red–blue errors along building edges. In contrast, the proposed method preserves the structural integrity of farmland-to-building transition zones while effectively suppressing shadow-induced disturbances. The blue omission regions are mainly concentrated in sub-pixel-level change areas, indicating that our network achieves superior feature discrimination, particularly for low-contrast object boundaries. This highlights its robustness and suitability for fine-grained change detection in complex natural environments.

4. Limitations of Methods

Although DACA-Net achieves promising detection performance across multiple datasets, it still has certain limitations. Its adaptability to images captured under extreme weather or lighting conditions—such as heavy fog or nighttime imaging—is limited, leading to performance degradation. Additionally, the model has a relatively complex architecture and slower inference speed, making it less suitable for real-time remote sensing applications. Moreover, there is still room for improvement in capturing boundary details, as the detection results may appear coarse in some blurred-edge scenarios. In future work, we plan to incorporate lightweight architectures and boundary enhancement strategies to further improve the model’s efficiency and accuracy.

5. Conclusions

This paper presents DACA-Net, a novel architecture for remote sensing image change detection based on the EDED framework. DACA-Net comprises two encoders, two decoders, and a DACAM module. The DACAM employs a combination of spatial attention mechanisms, channel attention mechanisms, and multi-scale attention mechanisms to learn the attention weights. This approach enables the model to automatically adjust the appropriate scale in accordance with the features of the input image, thereby assisting the encoder in accurately extracting relevant features from an abundance of background information within the image. This capability provides guidance for precisely locating changing regions in future analyses. DACAM effectively integrates multi-scale contextual features to maximize its ability to capture changes in remote sensing images while minimizing interference from complex backgrounds. Furthermore, experimental results demonstrate that DACAM significantly enhances the model’s generalization capabilities and improves both accuracy and overall performance. Additionally, this study constructs a UAV remote sensing image dataset, GBCNR, focused on Guangxi Beihai coastal nature reserves to address gaps present in existing wetland remote sensing datasets. Extensive experiments were conducted using three public datasets—LEVI-CD, CLCD, and SYSU-CD—and our self-constructed UAV dataset, GBCNR. DACA-Net demonstrates excellent performance across the four datasets: LEVIR-CD, CLCD, SYSU-CD, and GBCNR, with F1 improving by 0.31%, 4.26%, 1.2%, and 3.94% comparable with the second-best methods, respectively. Notably, the improvements are particularly pronounced in more complex scenarios with greater interference, such as CLCD and GBCNR. The proposed DACAM significantly enhances the model’s ability to perceive target change regions and effectively suppresses background interference. Furthermore, the self-constructed GBCNR wetland remote sensing change detection dataset fills a gap in this field. The findings of this study are expected to be widely applied in areas such as land use monitoring, ecological environment protection, and urban expansion identification, providing more efficient and intelligent remote sensing change detection tools for related fields.

Author Contributions

Conceptualization, Y.X.; methodology, Y.X.; software, Y.X. and Y.W.; validation, Y.X., Y.W. and X.W.; formal analysis, Y.X. and Y.T.; investigation, Y.X. and X.W.; resources, Y.T. and Q.Q.; data curation, Y.X.; writing—original draft preparation, Y.X.; writing—review and editing, Y.X. and Y.W.; visualization, Y.X. and X.W.; supervision, Y.T. and Q.Q.; project administration, Y.X.; funding acquisition, Q.Q. and Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was financially supported by the Guangxi Science and Technology Major Project (Grant No. AA19254016) and the Beihai Science and Technology Bureau Project (Grant No. Bei Kehe 2023158004).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

We extend our heartfelt gratitude to the anonymous reviewers for their valuable insights and constructive comments, which have significantly enhanced the quality and clarity of this manuscript. We are also immensely grateful to all the individuals and institutions that have offered their support and assistance throughout this research endeavor. We wish to express our special appreciation to the Guangxi Science and Technology Major Project and the Beihai Science and Technology Bureau for their generous funding and unwavering support of this study. Additionally, we are particularly thankful to the Institute of Marine Electronics and Information Technology at the Nanzhu Campus of Guilin University of Electronic Science and Technology for providing an excellent experimental environment and essential resource support.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wen, D.; Huang, X.; Zhang, L.; Benediktsson, J.A. A novel automatic change detection method for urban high-resolution remotely sensed imagery based on multiindex scene representation. IEEE Trans. Geosci. Remote Sens. 2015, 54, 609–625. [Google Scholar] [CrossRef]
Gim, H.J.; Ho, C.H.; Jeong, S.; Kim, J.; Feng, S.; Hayes, M.J. Improved mapping and change detection of the start of the crop growing season in the US Corn Belt from long-term AVHRR NDVI. Agric. For. Meteorol. 2020, 294, 108143. [Google Scholar] [CrossRef]
Berberoglu, S.; Akin, A. Assessing different remote sensing techniques to detect land use/cover changes in the eastern Mediterranean. Int. J. Appl. Earth Obs. Geoinf. 2009, 11, 46–53. [Google Scholar] [CrossRef]
Ye, S.; Rogan, J.; Zhu, Z.; Eastman, J.R. A near-real-time approach for monitoring forest disturbance using Landsat time series: Stochastic continuous change detection. Remote Sens. Environ. 2021, 252, 112167. [Google Scholar] [CrossRef]
Wang, Q.; Liu, S.; Chanussot, J.; Li, X. Scene classification with recurrent attention of VHR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2018, 57, 1155–1167. [Google Scholar] [CrossRef]
Zhai, H.; Zhang, H.; Li, P.; Zhang, L. Hyperspectral image clustering: Current achievements and future lines. IEEE Geosci. Remote Sens. Mag. 2021, 9, 35–67. [Google Scholar] [CrossRef]
Yan, J.; Cheng, Y.; Zhang, F.; Zhou, N.; Wang, H.; Jin, B.; Wang, M.; Zhang, W. Multi-modal imitation learning for arc detection in complex railway environments. IEEE Trans. Instrum. Meas. 2025, 74, 3529413. [Google Scholar] [CrossRef]
Pan, J.; Bai, Y.; Shu, Q.; Zhang, Z.; Hu, J.; Wang, M. M-swin: Transformer-based multi-scale feature fusion change detection network within cropland for remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4702716. [Google Scholar] [CrossRef]
Lin, H.; Hang, R.; Wang, S.; Liu, Q. Diformer: A difference transformer network for remote sensing change detection. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6003905. [Google Scholar] [CrossRef]
Cheng, Y.; Yan, J.; Zhang, F.; Li, M.; Zhou, N.; Shi, C.; Jin, B.; Zhang, W. Surrogate modeling of pantograph-catenary system interactions. Mech. Syst. Signal Process. 2025, 224, 112134. [Google Scholar] [CrossRef]
Yu, W.; Zhuo, L.; Li, J. GCFormer: Global context-aware transformer for remote sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4703212. [Google Scholar] [CrossRef]
Liu, B.; Chen, H.; Li, K.; Yang, M.Y. Transformer-based multimodal change detection with multitask consistency constraints. Inf. Fusion 2024, 108, 102358. [Google Scholar] [CrossRef]
Wang, X.; Jiang, H.; Mu, M.; Dong, Y. A dynamic collaborative adversarial domain adaptation network for unsupervised rotating machinery fault diagnosis. Reliab. Eng. Syst. Saf. 2025, 255, 110662. [Google Scholar] [CrossRef]
Lu, K.; Huang, X.; Xia, R.; Zhang, P.; Shen, J. Cross attention is all you need: Relational remote sensing change detection with transformer. GISci. Remote Sens. 2024, 61, 2380126. [Google Scholar] [CrossRef]
Vaswani, A. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
Bousias Alexakis, E.; Armenakis, C. Evaluation of UNet and UNet++ architectures in high resolution image change detection applications. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, 43, 1507–1514. [Google Scholar] [CrossRef]
Rakhlin, A. Convolutional neural networks for sentence classification. GitHub 2016, 6, 25. [Google Scholar]
Ghaffarian, S.; Valente, J.; Van Der Voort, M.; Tekinerdogan, B. Effect of attention mechanism in deep learning-based remote sensing image processing: A systematic literature review. Remote Sens. 2021, 13, 2965. [Google Scholar] [CrossRef]
Chen, H.; Qi, Z.; Shi, Z. Remote sensing image change detection with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5607514. [Google Scholar] [CrossRef]
Daudt, R.C.; Le Saux, B.; Boulch, A. Fully convolutional siamese networks for change detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4063–4067. [Google Scholar]
Zhao, S.; Zhang, X.; Xiao, P.; He, G. Exchanging dual-encoder–decoder: A new strategy for change detection with semantic guidance and spatial localization. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4508016. [Google Scholar] [CrossRef]
Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
Li, Y.; Weng, L.; Xia, M.; Hu, K.; Lin, H. Multi-scale fusion siamese network based on three-branch attention mechanism for high-resolution remote sensing image change detection. Remote Sens. 2024, 16, 1665. [Google Scholar] [CrossRef]
Lu, W.; Wei, L.; Nguyen, M. Bitemporal attention transformer for building change detection and building damage assessment. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4917–4935. [Google Scholar] [CrossRef]
Li, Z.; Cao, S.; Deng, J.; Wu, F.; Wang, R.; Luo, J.; Peng, Z. STADE-CDNet: Spatial–temporal attention with difference enhancement-based network for remote sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611617. [Google Scholar] [CrossRef]
Han, C.; Wu, C.; Hu, M.; Li, J.; Chen, H. C2F-SemiCD: A coarse-to-fine semi-supervised change detection method based on consistency regularization in high-resolution remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4702621. [Google Scholar] [CrossRef]
Jadon, S. A survey of loss functions for semantic segmentation. In Proceedings of the 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Vina del Mar, Chile, 27–29 October 2020; pp. 1–7. [Google Scholar]
Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Jorge Cardoso, M. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3; Springer: Berlin/Heidelberg, Germany, 2017; pp. 240–248. [Google Scholar]
Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Yang, J.; Huang, X. 30 m annual land cover and its dynamics in China from 1990 to 2019. Earth Syst. Sci. Data Discuss. 2021, 13, 3907–3925. [Google Scholar]
Shi, Q.; Liu, M.; Li, S.; Liu, X.; Wang, F.; Zhang, L. A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5604816. [Google Scholar] [CrossRef]
Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A densely connected Siamese network for change detection of VHR images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8007805. [Google Scholar] [CrossRef]

Figure 1. The remote sensing image samples illustrate the presence of different land cover types and background interference: (a) shows an urban area, (b) depicts a residential area, and (c) represents a coastal wetland area. The red bounding boxes indicate the target change regions, while other areas, such as roads, water bodies, and shadows, may cause background interference.

Figure 2. Overall architecture of the proposed DACA-Net. The network takes bi-temporal remote sensing images as input, extracts multi-scale features through parallel encoders with E-blocks and DACAM modules, and performs channel exchange to enhance temporal interaction. The decoder restores spatial resolution and fuses features to generate the final change detection map.

Figure 3. Diagram of the EDED dual-stream change detection model structure. The dual encoders separately extract T1 and T2 temporal features, while the decoder reconstructs the features through multi-level skip connections and ultimately outputs a mask.

Figure 4. The structure diagram of DACAM. It is divided into three parts: (a–c), representing three different stages of channel attention, spatial attention, and multi-scale attention, respectively.

Figure 5. Representative samples from the GBCNR dataset. These aerial images illustrate typical land–water boundaries captured in the dataset, featuring dense vegetation, riverbanks, and man-made structures. The left image shows a small pavilion-like structure surrounded by trees near the water’s edge. The center image highlights a sharp contrast between bright green foliage and murky water, indicating rich spectral-texture information. The right image presents a heavily vegetated area partially submerged in water, showcasing the complexity of natural boundary delineation.

Figure 6. Sample images from the three datasets illustrate the complexity of the scenes. The first row presents samples from the LEVIR-CD dataset, which primarily consists of various types of buildings. The second row shows samples from the CLCD dataset, providing annual land cover information. The third row displays samples from the SYSU-CD dataset, including scenarios such as road expansion and vegetation changes.

Figure 7. Visualization result graph of ablation experiments on the GBCNR dataset. Among them, blue represents the missed detection area, and red represents the false detection area.

Figure 8. Visualization of comparative experimental results on the LEVIR-CD dataset. From left to right are the two temporal images, the ground truth labels, and the change detection results of different methods. Blue indicates missed detection regions, while red indicates false detection regions.

Figure 9. Visualization of comparative experimental results on the CLCD dataset. From left to right are the two temporal images, the ground truth labels, and the change detection results of different methods. Blue indicates missed detection regions, while red indicates false detection regions.

Figure 10. Visualization of comparative experimental results on the SYSU-CD dataset. From left to right are the two temporal images, the ground truth labels, and the change detection results of different methods. Blue indicates missed detection regions, while red indicates false detection regions.

Figure 11. Visualization of comparative experimental results on the GBCNR dataset. From left to right are the two temporal images, the ground truth labels, and the change detection results of different methods. Blue indicates missed detection regions, while red indicates false detection regions.

Table 1. The ablation experiment results on the GBCNR dataset.

Datasets	Methods	F1 (%)	P (%)	R (%)	Iou (%)
GBCNR	BaseLine	68.10	64.61	71.98	51.63
	BaseLine + CA	69.47	65.27	73.53	53.17
	BaseLine + CA + SA	71.21	65.96	76.28	55.49
	BaseLine + DACAM	72.04	66.59	78.46	56.30

Table 2. Comparative experimental results of different methods on LEVIR-CD dataset.

Methods	F1 (%)	P (%)	R (%)	Iou (%)
SNUNet	88.16	89.18	87.17	78.83
FC-EF	83.40	86.91	80.17	71.53
BiT	89.30	89.24	89.37	80.68
FC-Sima-Diff	86.31	89.53	83.31	75.91
FC-Siam-Conc	83.69	91.99	76.77	71.96
SGSLN	92.05	92.91	91.21	85.28
Our	92.36	93.16	91.58	85.81

Table 3. Comparative experimental results of different methods on CLCD dataset.

Methods	F1 (%)	P (%)	R (%)	Iou (%)
SNUNet	66.32	70.82	62.37	49.62
FC-EF	57.22	71.70	47.60	40.07
BiT	62.08	61.42	62.75	45.01
FC-Sima-Diff	57.69	64.26	52.33	40.54
FC-Siam-Conc	61.45	73.27	52.91	44.35
SGSLN	71.10	73.84	68.55	55.15
Our	75.36	77.32	73.5	60.46

Table 4. Comparative experimental results of different methods on SYSU-CD dataset.

Methods	F1 (%)	P (%)	R (%)	Iou (%)
SNUNet	77.27	78.26	76.30	62.96
FC-EF	75.07	74.32	75.84	60.09
BiT	78.74	81.14	76.48	64.94
FC-Sima-Diff	72.58	89.13	61.21	56.96
FC-Siam-Conc	76.35	82.54	71.03	61.75
SGSLN	81.89	86.20	78.00	69.34
Our	83.09	86.50	79.93	71.07

Table 5. Comparative experimental results of different methods on GBCNR dataset.

Methods	F1 (%)	P (%)	R (%)	Iou (%)
SNUNet	63.79	63.10	64.50	46.83
FC-EF	59.89	56.97	63.12	42.74
BiT	67.32	60.19	76.36	50.73
FC-Sima-Diff	59.14	61.40	57.05	41.99
FC-Siam-Conc	57.29	59.79	54.99	40.14
SGSLN	68.10	64.61	71.98	51.63
Our	72.04	66.59	78.46	56.30

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xie, Y.; Wang, Y.; Wang, X.; Tan, Y.; Qin, Q. Remote Sensing Image Change Detection Based on Dynamic Adaptive Context Attention. Symmetry 2025, 17, 793. https://doi.org/10.3390/sym17050793

AMA Style

Xie Y, Wang Y, Wang X, Tan Y, Qin Q. Remote Sensing Image Change Detection Based on Dynamic Adaptive Context Attention. Symmetry. 2025; 17(5):793. https://doi.org/10.3390/sym17050793

Chicago/Turabian Style

Xie, Yong, Yixuan Wang, Xin Wang, Yin Tan, and Qin Qin. 2025. "Remote Sensing Image Change Detection Based on Dynamic Adaptive Context Attention" Symmetry 17, no. 5: 793. https://doi.org/10.3390/sym17050793

APA Style

Xie, Y., Wang, Y., Wang, X., Tan, Y., & Qin, Q. (2025). Remote Sensing Image Change Detection Based on Dynamic Adaptive Context Attention. Symmetry, 17(5), 793. https://doi.org/10.3390/sym17050793

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Remote Sensing Image Change Detection Based on Dynamic Adaptive Context Attention

Abstract

1. Introduction

2. Specific Work

2.1. The Challenges of Change Detection in Remote Sensing Images

2.2. The Overall Structure of the Network Is Presented

2.3. DACAM

2.4. Loss Function

2.5. Evaluation Metrics

3. Experiment and Analysis

3.1. The Constructed GBCNR Dataset

3.2. Common Dataset

3.3. Data Enhancement Strategies

3.4. Experimental Setup and Evaluation Metrics

3.5. Ablation Experiments

3.6. Comparative Experiments

4. Limitations of Methods

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI