1. Introduction
Change detection is a critical technology for identifying surface or scene changes by analyzing observation data from different periods. It is widely used in urban development [
1,
2], disaster management [
3,
4], deforestation [
5,
6], and environmental surveillance [
7,
8]. High-resolution remote sensing images can provide high-precision information about change regions. Still, two main challenges need to be addressed: On one hand, high-resolution images are known for their high spatial resolution and detailed information representation, which allows for the clear detection of object boundaries and subtle changes. However, this also introduces significant fine-grained noise, which imposes higher requirements on the robustness and sensitivity of detection models [
9]. On the other hand, some remote sensing data are influenced by external factors such as weather and terrain due to the variety of platforms and sensor types, as well as different shooting angles and times during the imaging process [
10]. This results in the same object displaying varying textures and spectral features across images, thereby increasing the complexity of feature alignment and model training [
11]. Therefore, when processing multi-source, multi-temporal, and multi-angle high-resolution remote sensing images, the demands for change detection technology continue to rise. Models are required to have dynamic interaction modelling capabilities and accurate feature expression abilities.
Currently, commonly used change detection techniques are primarily based on CNN [
12] and Transformer [
13]. CNN-based networks for change detection typically perform initial feature extraction on dual-temporal images using an encoder with shared weights (such as ResNet [
14], MobileNet [
15], etc.) and apply differential methods [
16,
17], spatiotemporal attention mechanisms [
18,
19,
20], or multi-scale feature fusion [
21,
22,
23] to identify change regions. These methods achieve high precision and efficiency in capturing details and semantic information in change areas. For example, FC-Siam-conc [
24] and FC-Siam-diff [
24], based on a Siamese CNN architecture, fuse dual-temporal features through feature concatenation and feature differencing, respectively, to enhance the representation of change regions. DTCDSCN [
25] uses a spatiotemporal attention mechanism to enhance the network’s focus on change regions, improving detection performance in key areas. SNUNet [
26] addresses the loss of localization information in deep networks through dense connections, thus improving detection robustness. A2Net [
27] integrates multi-scale features to effectively combine information from different scales, enhancing the network’s ability to extract change features. However, the local receptive field of CNNs is ineffective at expressing global context information and handling long-range dependencies, making it challenging to process scenarios that require global information [
28]. Additionally, due to the fixed receptive field limitations in the same feature extraction layer [
29], CNNs struggle to model irregular change regions, affecting the accuracy and detail preservation in change detection. In comparison, Transformer models can capture long-distance dependencies between features and excel at modelling global context information for complex scenes [
30]. For instance, ChangeFormer [
31] is a model built entirely on the Transformer architecture, using a decoder to generate difference maps from results computed by specific modules to complete the detection task. However, Transformers’ high dependency on large-scale labelled data and rapid computational complexity growth limit their use in high-resolution remote sensing image change detection [
32].
Researchers have gradually explored hybrid architectures combining CNN and Transformer to address the abovementioned issues [
33]. The TransUNetCD proposed by Li et al. [
34] leverages the strengths of both CNN and Transformer, achieving complementarity by using CNN for local feature extraction and Transformer for global relationship modelling, resulting in an end-to-end encoder-decoder architecture for change detection. BIT, proposed by Chen et al. [
35], enhances computational efficiency by combining CNN’s initial feature extraction ability with the Transformer’s non-local self-attention mechanism, significantly reducing the model’s resource demands in high-resolution image processing. The EATDer network, designed by Ma et al. [
36], uses a Siamese architecture, with each branch capturing local and global information through three Self-Adaptive Vision Transformer (SAVT) blocks. An edge-aware decoder ensures clearer and smoother edges. DMINet, proposed by Feng et al. [
37], integrates self-attention (SelfAtt) and cross-attention (CrossAtt), using a joint-time attention (JointAtt) block to regulate the global feature distribution of each input. This mechanism facilitates information coupling within layers while suppressing noise interference.
The hybrid architecture partially addresses the limitations of both Transformers and CNNs, combining local feature extraction with global context modelling capabilities. However, such architectures still face notable challenges. For example, in modelling deep interactions between dual-temporal features, they fail to fully capture the complex relationships and dynamic variations between dual-temporal image features, leading to inadequate representation of change regions. Additionally, during the dual-temporal feature extraction process in the backbone network, there is a lack of effective modelling of global context information. This limitation makes it difficult to fully integrate long-range dependencies and multi-scale features, thereby restricting the model’s accuracy and robustness in detecting complex change regions. Furthermore, as remote sensing image data volumes continue to grow, optimizing computational resources, reducing model complexity, and ensuring efficient operation for long-term, continuous monitoring tasks remain critical challenges. Although existing methods have improved computational efficiency to some extent, they still encounter significant computational costs when processing large-scale data.
We leverage the advantages of hybrid architectures combining CNN and Transformer and, based on the design concept of ChangeFormer, propose an improved change detection network—CGLCS-Net. This network reduces the number of parameters while enhancing the backbone network’s ability to model local and global features. Additionally, we introduce a dynamic interaction mechanism for dual-temporal feature modelling, further improving the detailed expression of change information. The main contributions of this paper are as follows:
- (1)
We propose the Global-Local Context-Aware Selector (GLCAS) module, which combines depth convolution with different receptive fields and an adaptive selection mechanism to capture both local details and global dependencies rather than merely modelling dense dependencies. GLCAS reduces computational complexity, significantly improving the model’s ability to extract multi-scale features in complex scenes and effectively addresses the challenge of insufficiently capturing change features at different scales in multi-temporal remote sensing images for change detection.
- (2)
We design the Subspace Self-Attention Fusion (SSAF) module, which dynamically models the differences between dual-temporal features to precisely focus on meaningful change regions in remote sensing images, addressing multi-view changes in remote sensing data. Guided by feature differences, SSAF enhances the model’s focus on change regions, improving its flexibility and accuracy when handling irregular boundaries and subtle changes.
- (3)
We performed comparisons with 10 models. We conducted ablation experiments on three primary change detection datasets, validating the robustness and efficiency of CGLCS-Net and achieving state-of-the-art performance results.
3. Experiments
3.1. Dataset
We selected three widely applicable and representative remote sensing change detection datasets: LEVIR-CD [
40], SYSU-CD [
41], and S2Looking [
42]. These datasets feature long temporal spans, multi-sensor sources, and multi-view characteristics. To facilitate fair comparisons with state-of-the-art methods, all remote sensing images were uniformly resized to 256 × 256 pixels, and standard dataset partitions were adopted to allocate training, validation, and test sets.
LEVIR-CD: Contains 637 high-resolution remote sensing images with original dimensions of 1024 × 1024 pixels, a spatial resolution of 0.5 m per pixel, and a temporal span of 5 to 14 years. The dataset primarily documents the growth and decline of buildings, covering diverse architectural types, including villa residences, high-rise apartments, small garages, and large warehouses. After resizing to 256 × 256 pixels, the dataset is partitioned into 7120 training images, 1024 validation images, and 2048 test images.
SYSU-CD: Contains 20,000 pairs of high-resolution remote sensing images captured in Hong Kong between 2007 and 2014, with original dimensions of 256 × 256 pixels and a spatial resolution of 0.5 m per pixel. The dataset records diverse change types, including new urban constructions, suburban expansion, pre-post-construction site changes, vegetation variations, road extensions, and marine development. It is partitioned into 12,000 training images, 4000 validation images, and 4000 test images.
S2Looking: Comprises 5000 dual-temporal satellite image pairs with off-nadir viewing angles from global rural areas. Original image dimensions are 1024 × 1024 pixels, with a spatial resolution of 0.5 to 0.8 m per pixel. The dataset focuses on land-cover changes in rural regions. After resizing, it is partitioned into 56,000 training images, 8000 validation images, and 16,000 test images.
3.2. Assessment Metric
The model’s accuracy is assessed on five standard evaluation metrics for change detection tasks [
43]: Precision, Recall, F1 score, IoU (Intersection over Union), and OA (Overall Accuracy). These metrics are formulated as follows:
where
TP denotes the number of pixels in the change areas correctly extracted by the network,
TN represents the number of pixels in the unchanged areas correctly extracted,
FP indicates the number of unchanged area pixels incorrectly classified as change area pixels, and
FN refers to the number of change area pixels incorrectly classified as unchanged area pixels.
3.3. Experimental Environment
CGLCS-Net is developed based on the PyTorch = 2.0.0 + cu117 framework and trained on an NVIDIA RTX 4070 GPU. In the experiments, the hyperparameters are configured as follows: the batch size is set to 8, the optimizer is Adamw, and the initial learning rate is set to 0.0001, which is dynamically adjusted using a multi-step decay strategy to enhance the flexibility of parameter optimization. To improve the model’s generalization capability, various data augmentation techniques are extensively applied, including random cropping, random rotation, horizontal flipping, Gaussian blur, random hue adjustment, random saturation adjustment, and random brightness adjustment.
To address the issue of extreme class imbalance in change detection tasks, a hybrid loss function [
44] is designed to mitigate the impact of sample imbalance on model training. This hybrid loss function combines a weighted cross-entropy loss [
45] and a Dice loss [
46], and its specific definition is as follows:
where:
denotes the true pixel value,
denotes the predicted pixel value,
denotes the weight for class
C,
α and
β are weight control parameters with values set to 1 and 0.5, and
N denotes the total number of pixels.
3.4. Model Comparison
To thoroughly assess the performance of the CGLCS-Net model, it is compared with ten advanced change detection methods. These include CNN-based methods: FC-Siam-Conc, FC-Siam-Diff, DTCDSCN, SNUNet, DMINet, A2Net and ABMFNet [
47], as well as Transformer-based methods: BIT, ChangeFormer, and EATDer.
The quantitative evaluation results show that CGLCS-Net achieved top-ranking scores compared to all ten advanced methods, as shown in
Table 1,
Table 2 and
Table 3.
Table 1 presents the quantitative experimental results of the model on the LEVIR-CD dataset. Due to the long time span and the irregular shapes and scales of the change regions, this dataset places higher demands on the model’s ability to extract multi-scale features and model irregular regions. CGLCS-Net achieved the highest scores in Recall, F1, IoU, and OA across all evaluation metrics. In comparison to CNN-based models, FC-Siam-conc and FC-Siam-diff performed the poorest, suggesting that direct feature concatenation and difference methods are insufficient for feature capture. In contrast, DTCDSCN and ABMFNet, which incorporate spatiotemporal attention mechanisms, improved the model’s focus on change features, leading to better IoU and F1 scores. This indicates that spatiotemporal attention mechanisms can effectively help reduce misclassifications when dealing with long time spans and irregular regions. While A2Net achieved the highest Precision (92.96%), it had a relatively lower Recall (85.81%). On the other hand, CGLCS-Net demonstrated the optimal Recall value while maintaining 91.27% Precision through dynamic feature selection, confirming its advantage in feature extraction and processing using convolutional structures. When compared to Transformer-based models, CGLCS-Net also outperforms in comprehensive metrics, highlighting its superior ability to integrate global and local information. These results suggest that CGLCS-Net offers a significant advantage in extracting features from irregular regions with long time spans.
Table 2 presents the quantitative experimental results of various models on the SYSU-CD dataset. The SYSU-CD dataset is characterized by a wide variety of change types, a long time span, and a larger data volume. The experimental results show that CGLCS-Net achieved higher scores in Recall, F1, and IoU evaluation metrics. Compared to convolution-based models such as FC-Siam-conc, FC-Siam-diff, DTCDSCN, DMINet, A2Net, and ABMFNet, CGLCS-Net showed a slight decrease in Precision and OA (Overall Accuracy). Specifically, FC-Siam-conc and FC-Siam-diff achieved Precision scores of 83.51% and 86.14%, respectively, demonstrating strong precision but lower Recall. This suggests that these methods, which rely on difference calculation and feature concatenation, tend to overlook certain types of changes, failing to capture all changes comprehensively. In contrast, CGLCS-Net achieved 83.32% in Recall, a significant improvement, indicating that it can more comprehensively extract feature information from change regions, allowing it to identify more changes. Compared to Transformer-based models, CGLCS-Net also performed better in terms of balancing performance: while BIT (84.14%) and EATDer (81.02%) slightly outperformed CGLCS-Net in Precision, CGLCS-Net demonstrated a 7.88% and 0.97% improvement in F1 score over BIT (74.05%) and EATDer (80.96%), respectively, thanks to the effective noise reduction provided by its spatial-spectral attention mechanism (SSAF). This result confirms that the hybrid architecture of CGLCS-Net is better at coordinating global context modelling with local feature refinement, achieving a superior balance of performance in complex scenarios.
Table 3 presents the quantitative experimental results of the model on the S2Looking dataset. This dataset is characterized by varying ground-level viewpoints, complex rural object change patterns, and the absence of uniform boundaries. CGLCS-Net achieved the highest scores in Recall, F1, IoU, and OA across all evaluation metrics. While FC-Siam-conc achieved the highest Precision (84.23%), its Recall (34.18%) was significantly low, indicating that its strict difference threshold led to a high number of missed detections. FC-Siam-diff improved Recall to 50.62% by modifying the difference strategy, but its Precision dropped by 16.08%, reflecting the limited adaptability of simple feature operations in complex rural scenarios. In contrast, other convolution-based models such as DTCDSCN, SNUNet, DMINet, A2Net, and ABMFNet employed different strategies. Although these models exhibited lower Precision, they showed improvements in Recall and IoU, suggesting that the incorporation of attention mechanisms or enhanced multi-scale feature fusion can effectively improve the network’s ability to capture and locate features in scenes with varying viewpoints. When compared to Transformer-based models such as BIT, ChangeFormer, and EATDer, CGLCS-Net led in Recall (63.82%), F1 (64.35%), and IoU (47.43%), demonstrating its more comprehensive change detection capabilities. Although CGLCS-Net’s Precision (64.88%) is lower than that of ChangeFormer, its superior performance in F1 and IoU indicates that the model not only effectively captures detailed change features but also suppresses noise caused by viewpoint differences, thereby more accurately identifying change regions.
Figure 4 presents the visualization results for the LEVIR-CD dataset. The first and second rows show scenes with small building clusters where building shadow interference is present. CGLCS-Net significantly reduces misclassification and omission when extracting edge details, leading to more precise segmentation results. The third row shows scenes with lighting changes, where all ten other models exhibit significant misclassification, while CGLCS-Net maintains accurate segmentation without any misclassification. The fourth and fifth rows represent large building scenes, where CGLCS-Net provides a more complete segmentation of building interiors.
Figure 5 presents the visualization results for the SYSU-CD dataset. CGLCS-Net outperforms other comparison methods across various change types. The first and fourth rows show scenes with irregular buildings prone to tree occlusion and shadow interference. CGLCS-Net can capture the change regions more accurately, effectively reducing false positives and false negatives. The second, third, and fifth rows’ truth labels, red indicates FP, green signifies FN, black represents TN, and white denotes TP.
Show scenes from the port building areas. CGLCS-Net demonstrates more stable performance in overcoming interference from lighting variations, texture differences, and color discrepancies, significantly reducing misclassification and omission.
Figure 6 presents the visualization results for the S2Looking dataset. The first, second, and third rows show small building cluster scenes, where CGLCS-Net produces more complete detection results when facing different color variations. The fourth and fifth rows show large building scenes, where CGLCS-Net achieves the complete extraction of change regions despite the influence of viewpoint differences and boundary shadows.
3.5. Ablation Study
To thoroughly assess the effectiveness of the proposed modules and their contribution to the overall model performance, we conducted a series of ablation studies on the LEVIR-CD dataset, as shown in
Table 4. We removed all Transformer Block modules from ChangeFormer and used them as the baseline (Baseline). In the feature fusion stage, we used only a simple concatenation strategy, significantly reducing the parameters and computational complexity but resulting in a 0.85% and 1.39% decrease in F1-score and IoU, respectively.
In the Baseline + GLCAS experiment, the parameter count increased from 10.71 M to 11.51 M, with F1-Score improving from 89.55% to 89.76% and IoU increasing from 81.09% to 81.42%. This demonstrates that the GLCAS module significantly enhances multi-scale feature representation through multi-scale depthwise separable convolutions and global-local dynamic fusion.
In the Baseline + SSAF experiment, the parameter count slightly increased to 11.28 M, with F1-Score rising to 89.91% and IoU increasing to 81.68%. This shows that the SSAF module effectively focuses on changing regions by modelling the dynamic interaction between dual-temporal and different features.
In the Baseline + GLCAS + SSAF experiment, the parameter count increased to 12.09 M, with F1-Score improving to 90.96% and IoU increasing to 83.43%. Compared to the Baseline model, IoU improved by 2.34%, proving that GLCAS and SSAF modules work synergistically in feature extraction and dynamic interaction modelling, significantly improving overall performance. Furthermore, compared to the original ChangeFormer model, CGLCS-Net (Baseline + GLCAS + SSAF) reduced the parameter count from 41.03 M to 12.09 M (a reduction of about 70.5%) while maintaining excellent detection performance. This demonstrates that CGLCS-Net excels not only in performance but also in lightweight design.
To further assess the impact of the pooling configuration in the Global-Local Context-Aware Selector (GLCAS), we conducted comparative experiments with different pooling methods (Max Pooling, Mean Pooling, and a combined pooling strategy), with the results shown in
Table 5.
When Max Pooling was used to extract global context information, the F1-score was 89.79%, and the IoU was 81.47%, indicating that Max Pooling tends to preserve prominent features in the data and can effectively capture local extrema in the change regions. This demonstrates that Max Pooling outperforms the use of Mean Pooling alone. In contrast, Mean Pooling provides a smoother representation of global information, with F1-Score and IoU values of 89.54% and 81.06%, respectively. To combine the strengths of both methods, we used a combined pooling strategy, which resulted in a significant increase in F1-score to 90.96% and IoU to 83.43%. The combined pooling approach achieves complementarity between both methods, effectively capturing both significant and global smoothing features, significantly improving the GLCAS module’s ability to model complex change regions.
To further assess the impact of hyperparameters on model performance, we conducted multiple experiments on α and β in Equation (26).
Table 6 presents the experimental results for the model under various combinations of these hyperparameters across several evaluation metrics. The results indicate that when α = 1 and β = 0.5, the model achieves the best accuracy, ensuring effective detection of changing regions while preventing an overemphasis on precision that could neglect recall. In contrast to other combinations of α and β, when the β weight is equal to or lower than α, the model exhibits lower recall and IoU but higher precision. This suggests that the model prioritizes pixel classification accuracy at the expense of its focus on changing regions.
3.6. Model Efficiency Analysis
We evaluated 11 models based on their parameter count (Params), floating-point operations (FLOPs), and inference time, as presented in
Table 7. The image size used for the experiments was 3 × 256 × 256, with parameters measured in millions (M), floating-point operations in gigaflops (G), and inference time in seconds. CGLCS-Net, a model that combines convolutional and self-attention mechanisms, has 12.09 million parameters and 187.58 gigaflops of floating-point operations. In terms of computational efficiency, compared to the baseline model, ChangeFormer, CGLCS-Net reduces the parameter count by approximately 70%, floating-point operations by 7.5%, and inference time by 11.5%. Although CGLCS-Net’s computational complexity remains higher than most lightweight models, it demonstrates superior performance on the task. In the key evaluation metrics, the F1 score (90.96%) and Intersection over Union (IoU) (83.43%), CGLCS-Net outperforms all other models.