Next Article in Journal
Understanding the Optical Behavior and Spectral Signature of Dredging-Induced Plumes in Coastal Waters
Previous Article in Journal
Long-Term Assessment of Inter-Sensor Radiometric Biases Among SNPP, NOAA-20, NOAA-21 ATMS, and NOAA-19 AMSU-A Instruments Using the NOAA ICVS Framework
Previous Article in Special Issue
Hyperspectral Band Selection via Tensor Low Rankness and Generalized 3DTV
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SCNAnet: Structure-Aware Contrastive with Noise-Augmented Network for Unsupervised Change Detection

1
School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China
2
School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(9), 1427; https://doi.org/10.3390/rs18091427
Submission received: 22 March 2026 / Revised: 29 April 2026 / Accepted: 29 April 2026 / Published: 4 May 2026

Highlights

What are the main findings?
  • We propose SCNAnet, an unsupervised change detection framework that combines a structure-aware style encoder, a noise-perturbation consistency branch, and a frequency-attention decoder to reduce shortcut-driven optimization and improve semantic change representation.
  • SCNAnet achieves state-of-the-art performance on GF-2 VHR, OSCD, and QuickBird datasets, with more accurate change localization, fewer false positives, and clearer boundaries than competing unsupervised methods.
What are the implications of the main findings?
  • The results show that alleviating optimization shortcuts is critical for unsupervised remote sensing change detection, because style-loss-driven training alone may misclassify unchanged but stylistically similar regions as changes.
  • The proposed framework provides a practical way to improve robustness in complex scenarios with seasonal variation, illumination differences, and multi-scale changes, which is valuable for real-world Earth observation applications.

Abstract

Unsupervised change detection (UCD) is a key technique in Earth observation, aiming to identify and quantify surface changes over time by analyzing multi-temporal remote sensing images without manual annotations. Unlike supervised approaches that rely on ground reference to directly guide discriminative semantic learning, UCD methods must construct their own reference. A mainstream strategy employs one temporal image as the reference and uses transformation models (e.g., style transfer networks) to align the other image in unchanged regions. Loss is then reduced by labeling hard-to-align pixels as “changes” and excluding them from the objective. However, this optimization process is dominated by style losses, which cause the model to learn to exclude regions that make only limited contributions to style-loss minimization, rather than to acquire discriminative representations of true geospatial changes. Such shortcut-driven optimization results in insufficient modeling of genuine change features and frequent misclassification of unchanged yet stylistically similar regions. To address these limitations, we propose SCNAnet, a novel framework that integrates three modules: a noise-perturbation consistency branch to suppress shortcut-driven learning, a structure-aware style transformation encoder to strengthen semantic representations of structural changes, and a frequency-attention decoder to refine the delineation of change regions. Extensive experiments on three benchmark datasets (GF-2, OSCD, and QuickBird) demonstrate the effectiveness of SCNAnet. Specifically, SCNAnet improves the F1 score by approximately 8% on the Montpellier dataset compared with the second-best method, demonstrating its effectiveness under challenging conditions.

1. Introduction

Change detection (CD), a pivotal technique in Earth observation, it aims to identify and quantify significant changes in a given region by analyzing multi-temporal remote sensing images captured at different time points [1,2]. Through this process, CD offers crucial information on the spatial and temporal changes occurring within the observed area, enabling a wide range of applications in urban development monitoring, environmental conservation, disaster response, and agricultural management [3,4,5,6].
The advancement of deep learning in computer vision has led to its increasingly widespread application in remote sensing change detection. With the sharing of public datasets, a large number of supervised change detection models have emerged. Most of these models, based on convolutional neural networks [7,8,9,10,11,12], Transformer architectures [13,14,15,16,17] and graph networks [18,19,20,21] are designed to learn highly discriminative features from changed and unchanged regions by minimizing the differences between predictions and ground truth labels, thereby producing high accuracy change maps. Beyond these architectures, recent literature has witnessed the emergence of novel paradigms that show immense potential in CD tasks. For instance, State-Space Models (SSMs) like ChangeMamba [22] have been introduced to efficiently capture long-range spatio-temporal dependencies, which significantly improves the accuracy of identifying changes in highly complex scenes. Meanwhile, Denoising Diffusion Probabilistic Models (DDPMs) [23] have been successfully utilized as powerful feature extractors in change detection tasks. Although these methods demonstrate strong modeling capabilities, they are typically designed under supervised settings and rely on labeled data. The process of generating such annotations is both time-consuming and labor-intensive, particularly for previously unseen regions [24].
Unsupervised change detection (UCD) is an alternative approach [25,26,27] to eliminate the reliance on large annotated datasets. Early UCD methods typically analyze multi-temporal images using image algebra, transformation and post-classification comparison. Representative techniques such as Change Vector Analysis (CVA) [28], Multivariate Variation Detection (MAD) [29] and its variant IR-MAD [30], as well as Slow Characteristic Analysis (SFA) [31,32], primarily rely on shallow, manually designed features, which are highly susceptible to interference from irrelevant factors such as illumination variations. In recent years, deep learning-based UCD methods have been increasingly regarded as the dominant paradigm, broadly categorized into two groups: iterative pseudo-label refinement [33,34,35,36] and cross-temporal style transformation [37,38,39,40,41], which have demonstrated notable improvements in change detection accuracy compared with traditional UCD techniques.
Iterative pseudo-label refinement typically consists of a pseudo-label generator and a change detection module. This approach utilizes samples produced by the generator as pseudo-labels to train the change detection module. Since the initial pseudo-labels are often unreliable, iterative strategies such as metric learning or clustering are employed to alternately update the generator and the change detector, with the aim of gradually refining the pseudo-labels and selecting high-confidence samples to improve the effectiveness of change detection training. However, the iterative process relies on similarity measures of multi-temporal image features, making it difficult to determine appropriate thresholds that can establish reliable boundaries between changed and unchanged regions. Moreover, pseudo-change factors such as illumination and seasonal variations further complicate threshold selection, ultimately reducing the quality and reliability of the pseudo-labels [42,43,44,45,46,47,48,49].
Cross-temporal style transformation typically employs generative network architectures, such as generative adversarial networks (GANs) [37] and autoencoders [41,50,51], to align the domains of multi-temporal images, thereby reducing style discrepancies caused by non-target factors like illumination and seasonal variations. Following image alignment, change detection maps are usually generated by comparing the differences between the aligned images or their deep feature representations, often in conjunction with conventional change detection algorithms. However, such approaches tend to rely heavily on post-processing steps, which ultimately limit the precision of the detection results.
To mitigate errors introduced by post-processing, recent studies have introduced a mask-guided strategy into cross-temporal style transformation (called mask-guided style transformationhere). This approach integrates the style transformation module and the change detection module within a unified joint framework. The core idea is to exclude regions of true geographical changes during the optimization of style transformation, since these changes arise from alterations in geographical entities rather than from non-geographical factors (e.g., illumination, seasonal variations) and cannot be effectively aligned through style transformation. Therefore, a change mask generated by the detection module is incorporated into the optimization of style transformation. By ignoring change regions in the computation of style loss and enforcing alignment only on unchanged areas, the method achieves more efficient loss minimization and progressively guides the change detection network to accurately identify true change regions. In addition, to prevent the style loss from driving the change detector to classify all pixels as changed, which would reduce the style loss to zero, an auxiliary sparsity constraint is imposed on the number of change pixels. Compared with other unsupervised approaches, mask-guided style transformation suppresses pseudo-change factors more effectively and achieves end-to-end change detection without complex post-processing.
However, in mask-guided style transformation, the model is driven primarily by the optimization of style loss, which pushes the change detection module to learn the regions that make limited contributions to style-loss minimization. Although this strategy can successfully identify change areas that are difficult to align in terms of style, it actually fails to capture genuine semantic-level change features. As a result, this shortcut-based optimization mechanism often misclassifies some unchanged yet stylistically similar areas as changes, as illustrated in Figure 1. To quantitatively support this phenomenon, we analyze the relationship between pixel-wise similarity and prediction errors (specifically false positives, FP) using the baseline model. We observe that, in high-similarity regions, a large proportion of pixels are predicted as “changed”, while many of them are actually unchanged according to the ground truth (exceeding 83% across the datasets). This suggests that such regions tend to be excluded from the style alignment loss computation, leading the change detector to mistakenly identify them as change regions.
To address these issues, we propose SCNAnet, which enhances the semantic representation learning capability of the change detection module and alleviates the optimization shortcuts. The key ideas of the proposed method are as follows:
  • Noise-perturbation consistency branch: To mitigate the effect of optimization shortcuts, we introduce a noise-perturbation branch in the decoding stage of the change detection module, with a consistency constraint applied between its output and the original branch. The injected noise elevates the loss of low-loss regions, preventing these regions from being adversely affected by optimization shortcuts. The consistency constraint encourages the network to learn noise-invariant, robust semantic representations, leading to a semantic-aware model.
  • Structure-aware style transformation encoder: To enhance feature separability, we construct explicit positive and negative sample pairs and apply contrastive loss. The positive sample is generated by applying the style transformation network to one temporal image. The negative sample is generated by shuffling patches of the transformed image, which disrupts its geospatial structural integrity while preserving local spectral statistics. This design compels the network to focus on geospatial structural changes, improving its ability to accurately identify true change regions.
  • Frequency-attention decoder: To improve the precision of change boundary delineation, a frequency-attention decoder is introduced to the decode stage to jointly utilize the high-frequency details and low-frequency global context information. The fused features are further processed through a spatial attention map, which emphasizes change boundary regions while suppressing irrelevant background information. This design enhances the network’s sensitivity to change boundaries and improves the accuracy of their delineation.
The components above enable SCNAnet to avoid optimization shortcuts, enhance semantic representation, and improve the accuracy and robustness of unsupervised change detection.
The rest of this paper is structured as follows. Section 2 reviews the related work on unsupervised change detection and contrastive learning in CD task. Section 3 elaborates on the principle and algorithm of the SCNAnet framework. Section 4 expounds on the experiments on public datasets. Section 5 presents the ablation study and discussion. Section 6 draws the concluding remarks.

2. Related Works

2.1. Mask-Guided Style Transformation in CD

With the development of deep learning, researchers have introduced more innovative change detection methods. Mask-guided style transformation is a typical approach, which achieves end-to-end unsupervised change detection without post-processing and alleviates the problem of inconsistent styles in bi-temporal images.
As we known, Liu et al. [52] made an early exploration and proposed the probabilistic UCD model, which employs a two-branch convolutional network to extract features from both temporal images independently. A style loss is computed between the two images based on these feature representations. To reduce interference from changed areas, a change probability map (i.e., mask) is introduced, guiding the network to focus on aligning the styles of unchanged regions. Furthermore, sparse constraints are applied to the mask to prevent the trivial solution where all pixels are excluded (i.e., labeled as changed). In comparison to the probabilistic UCD model, Bandara et al. [9] introduced an additional convolutional network, termed a change probability generator, to produce a preliminary change mask. Moreover, their method computes style loss at both the feature level and the original image level. Essentially, both approaches perform feature extraction on dual-phase images separately. By minimizing feature discrepancies, they achieve domain alignment between the two images in a middle feature space, yet without genuinely employing style transfer networks to transform one image domain into another. Wu et al. [53] advanced this line of work by incorporating a GAN-based style transfer network. By excluding changed regions, their model transfers the image from t 1 to the style of t 2 with minimal style loss, thereby enhancing alignment accuracy. To mitigate the overfitting issue inherently caused by a single GAN, Liu et al. [54] further incorporated CycleGAN into the mask-guided generation process. Han et al. [19] further extended the framework by adopting a graph autoencoder as the style transformation network, tailored for handling multimodal data. This method progressively refines the change regions through structured feature alignment. Additionally, it employs a bidirectional style transformation strategy, utilizing two graph autoencoders to accomplish transfers both from t 1 to t 2 and vice versa, leading to more accurate and consistent change detection results. In summary, the evolution of unsupervised change detection has seen a notable shift towards integrated frameworks that combine generative style transformation with mask-guided optimization. The commonality across these studies is the use of a change mask, either directly estimated or learned by a network, to restrict style losses only to unchanged regions, thereby enabling more precise domain adaptation.
However, despite their effectiveness in suppressing pseudo-changes, these methods exhibit a shared limitation: the model is essentially driven by the objective of style-loss minimization, rather than by learning semantically discriminative features for change. This often encourages the model to focus on regions that facilitate style alignment while ignoring those that do not. As a result, the model may rely on shortcut solutions that minimize the loss without capturing true semantic changes. In particular, regions with high stylistic similarity but no actual change are prone to being misclassified, as they contribute less to the loss reduction. This reveals an inherent limitation of mask-guided style transformation frameworks and highlights the need for methods that mitigate optimization shortcuts and promote semantically meaningful change representations.

2.2. Contrastive Learning and Consistency Regularization

Contrastive learning, a self-supervised learning approach [55], revolves around constructing positive and negative sample pairs to optimize the model’s feature learning. By pulling positive pairs together and pushing negative pairs apart in the latent space, classical frameworks (e.g., SimCLR [56], MoCo [57] and BYOL [58]) effectively learn general-purpose invariant features. Contrastive learning has been successfully applied across a wide range of domains, including computer vision, natural language processing, speech processing, recommendation systems and so on.
Recently, contrastive learning has been adapted for remote sensing change detection (CD) by constructing positive and negative pairs based on bi-temporal scenes [59], seasonal variations [60], spatial neighborhoods [61], or patch similarities [15]. However, most methods rely on contrasting local patches within a scene, which tends to cause the model to focus on local contextual patterns, hindering the learning of discriminative features at a global scene level. Our structure-aware style transformation encoder is inspired by the contrastive learning paradigm but differs fundamentally from existing approaches. Specifically, it constructs negative samples via spatial patch shuffling, which alters spatial arrangements while preserving global spectral statistics. This design enables the model to capture structural differences at a broader spatial level, facilitating the decoupling of structure and style and improving the discriminative capability for change detection.
In addition, our framework incorporates principles from consistency regularization. Methods such as Mean Teacher [62] and Noisy Student [63] enforce prediction consistency under perturbations, typically in semi-supervised settings with model ensembling or pseudo-labeling. In contrast, our noise-perturbed consistency branch adapts this principle to unsupervised change detection on a single image pair. Specifically, feature-level noise perturbation is introduced to improve representation robustness and reduce optimization bias, thereby mitigating the optimization shortcut behavior observed in mask-guided style transfer.
In addition, our framework incorporates principles from consistency regularization. Methods such as Mean Teacher [62] and Noisy Student [63] enforce prediction consistency under perturbations, typically in semi-supervised settings to leverage global data distributions via model ensembling or pseudo-labeling. In contrast, our noise-perturbed Consistency (NC) branch is designed for unsupervised change detection, where only a single bi-temporal image pair is available. Instead of using noise to connect labeled and unlabeled data, we introduce feature-level perturbations within the same image pair. This helps improve robustness and reduces the tendency of the model to learn unstable features, thereby alleviating the optimization shortcut issue without requiring additional data.

3. Methodology

3.1. Overall Framework

The overall architecture of SCNAnet is designed to address the limitations of existing unsupervised change detection networks. It aims to disrupt optimization shortcuts, enhance the discriminability of feature representations, and improve the precision of change boundary delineation.
As illustrated in Figure 2, the framework integrates three core components into a unified, end-to-end trainable model:(1) the structure-aware style transformation encoder (ST-E) that employs contrastive learning based on autoencoder backbone to learn geospatially structural features while suppressing stylistic discrepancies. (2) the frequency-attention decoder (FA-D) is used to extract the high-frequency detail and low-frequency contextual features and fuse them through an attention mechanism, enabling precise and accurate delineation of change boundaries. (3) the auxiliary noise-perturbation consistency branch (NC) injects feature-level noise into FA-D and enforces consistency constraint, preventing shortcut optimization and encouraging noise-invariant semantic learning.
The architecture operates on a pair of bi-temporal images, denoted as X and Y, captured at different times. The process begins with the ST-E module, which aligns the style of X to that of Y. ST-E employs patch-shuffling and style transformation to construct the positive sample X + and negative sample X . Image X is fed into the encoders E 1 , while X + and X are fed into the encoder E 2 , respectively. The two encoders produce different scale features, e 1 i , x and e 1 i , + or e 1 i , , then the contrastive constraint is used to make the encoders learn geospatially structural features while suppressing cross-temporal style discrepancies. Then, image Y is fed into the encoder E 2 and e 2 i , y is obtained, e 1 i , x and e 2 i , y are concatenated to f i and fed into the parallel decoders. One branch is FA-D based on the frequency-attention module, which exposes the encoder feature to high-frequency detail and low-frequency contextual features and then fuses them through a spatial attention mechanism, enabling precise and accurate delineation of change boundaries. The other branch is NC, that injects random noise into the feature f i and feeds them into FA-D, while enforcing consistency, thereby preventing the network from exploiting low-loss regions and encouraging noise-invariant semantic learning. The entire network is optimized in an end-to-end fashion through a combination of losses, including style loss, contrastive loss, consistency loss, and sparsity constraint, which will be elaborated in subsequent sections.

3.2. Structure-Aware Style Transformation Encoder

The structure-aware style transformation encoder (ST-E) is made of two parts: an autoencoder for style transformation, as shown in Figure 3, and parallel encoders designed for structure-aware contrastive learning. Parallel encoders which share weights are used for extracting semantic features from bi-temporal images. Each encoder is following the UNet encoder architecture, which is a convolutional network consisting of four stages. The stages comprise convolutional and pooling layers to progressively extract and compress multi-scale features from the input image.

3.2.1. Autoencoder-Based Style Transformation

The autoencoder in our framework is built upon a convolutional network. The autoencoder processes the input pair of temporal images, with the objective of transferring one image to the other under the assumption that no significant changes have occurred. The autoencoder consists of multiple convolutional layers, batch normalization layers and residual connections. The convolutional layers progressively extract hierarchical features from the input images, capturing both low-level details and high-level semantic information. Each convolutional layer is followed by batch normalization, which stabilizes training by normalizing activations, mitigating internal covariate shift, and accelerating convergence. Residual blocks with skip connections are incorporated to preserve critical feature information throughout the network.
Specifically, the autoencoder is trained to align the style of the input image X with that of the reference image Y. When the image is processed by AE, it is transformed into an output image X + , which preserves the semantic content of X while adopting the stylistic characteristics of Y. This transformation can be formally represented as
X + = A E ( X )
To ensure effective style alignment, the style transformation loss is computed exclusively within unchanged regions, using a change mask M produced by the encoders and decoders of the overall framework. The corresponding loss function is defined as follows:
style = min ( 1 M ) ( X + Y ) MSE
where MSE refers to the mean square error (MSE), which is employed to measure the distance between two features in the proposed approach. Due to the change mask M with value 1 for change, ( 1 M ) refers to the unchanged region. Hence, X + and Y can be approximately style-consistent by excluding the change area.

3.2.2. Parallel Encoders for Structure-Aware Contrastive Learning

To enhance the encoders to extract feature separability of bi-temporal images, a structure-aware contrastive learning is introduced to map features into a potential space where the distance between unchanged and change regions is consistent.
Assuming that the image X at t 1 is divided into m × m patches, it can be expressed as X = x 1 , x 2 , , x m × m . Here, the parameter m denotes the number of grid divisions along each spatial dimension. Then, image X is further manipulated to generate an additional variant Z. The generation of Z is achieved by rearranging the spatial structure of X = { x 1 , x 2 , , x m × m } using a new index vector σ = ( σ 1 , σ 2 , , σ m × m ) . This operation alters the original spatial arrangement of the patches while preserving their spectral content, allowing the model to focus on structural differences rather than appearance variations. The resulting Z = x σ 1 , x σ 2 , , x σ m × m disrupts the inherent spatial relationships within the scene. Patch shuffling operation only permutes the spatial locations of image patches without changing the pixel values. Therefore, the spectral statistics, such as color histograms and spectral distributions, are preserved after shuffling. In contrast, the original spatial arrangement and structural relationships among neighboring patches are disrupted. This allows the constructed negative samples to preserve spectral characteristics while altering spatial structure, thereby encouraging the contrastive module to focus primarily on structural differences.
Following this structural modification, Z undergoes further enhancement through a series of geometric and photometric transformations, collectively denoted by the function Γ (), which controlled variations such as rotations, scaling, and adjustments in brightness.
Z = Γ ( Z )
The function Γ ( · ) is implemented as a sequential combination of specific operations. Specifically, the input images undergo random horizontal and vertical flipping, each with a probability of 0.5, followed by a discrete random rotation selected from the angle set { 45 , 135 , 225 , 315 } . Subsequently, random translation and scaling are applied, including a translation within ± 10 % of the image spatial dimensions and a random scaling in the range of [ 0.9 , 1.1 ] . Finally, photometric distortion is introduced via color jittering, with maximum perturbation factors set to 0.5 for brightness, 0.5 for contrast, 0.5 for saturation, and 0.2 for hue. These modifications further distinguish Z from the context of X, thereby creating a challenging negative sample that enhances the ability of the encoder to learn the structural differences and semantic features.
The structure-aware contrastive mechanism leverages positive and negative sample pairs for training, which are defined as:
  • Positive pairs: ( X , X + ) , where X + is the output of A E ( ) that applies a style transformation to image X. The autoencoder simulates the unchanged sample, while the positive samples preserve the context of X.
  • Negative pairs: ( X , X ) , where X is generated by applying the same transformation A E ( ) to Z . The image disrupts the original spatial context of X, which is used for negative samples.
Mathematically, the contrastive loss is composed of contributions from both positive and negative samples, which can be formulated as
contrastive = pos + neg
where pos guides the encoder to ensure that features from the same scene remain closely aligned under different styles:
pos = min e 1 i , x e 2 i , x + MSE
neg drives the encoders to learn the differences between true temporal changes, excluding irrelevant variations:
neg = max ( 0 , 1 e 1 i , x e 2 i , x MSE )
Here, the margin of the loss is empirically set to 1, which is widely adopted in contrastive learning [64]. This setting enforces a sufficient minimum distance boundary to push negative pairs apart, without over-penalizing pairs that are already well-separated, thereby ensuring training stability. e 1 i , x is generated by putting the image X into encoder E 1 ; e 2 i , x + and e 2 i , x are generated by putting the constructed images X + and X into the encoder E 2 . i { 1 , 2 , 3 , 4 } indicates the i-th level feature.
The goal is to effectively minimize the distance between features of positive pairs while maximizing the distance between features of negative pairs. Therefore, the encoders can be contrastively trained to perceive geospatial structural changes while filtering out irrelevant interference.
In addition, image Y is also fed into the encoder E 2 to get the feature e 2 i , y . Then the fused feature f i can be obtained as
f i = concat ( e 1 i , x , e 2 i , y )
The fused encoder features f i are fed into the decode stage.

3.3. Frequency-Attention Decoder

To address the challenge of accurately localizing change boundaries, the frequency-attention decoder (FA-D) is proposed. Corresponding to the architecture of the encoder, four frequency-attention modules (FA) are applied in the frequency-attention decoder, and the change mask M can be obtained. Each FA module is shown in Figure 4.
The FA module first processes the multi-scale features f i and f i 1 using a frequency feature extractor (FFE), which consists of two main convolutional branches. Given the input feature maps f i C i × H i × W i and f i 1 C i 1 × H i 1 × W i 1 , representing deeper level and shallower level features, respectively, f i is first up-sampled as the same size as f i 1 , namely f i C i × H i 1 × W i 1 . Then, f i and f i 1 are passed through convolutional layers followed by batch normalization, mapping the original input features to an intermediate channel C. During the frequency separation process, we apply the average pooling operation after Conv and BN layers, which acts as a low-pass filter, to extract the low-frequency components. Subsequently, the high frequency components F high i C × H i 1 × W i 1 are obtained through a residual calculation by subtracting the extracted low-frequency components from the original features. The extraction of high- and low-frequency components for f i (up-sampled by f i ) can be expressed as
F low i = A v e r a g e P o o l ( B N ( C o n v ( f i ) ) ) F h i g h i = f i F low i
The high- and low-frequency components of f i and f i 1 are then passed through a two-branch convolutional network for further refinement and fusion. Within this network, the low-frequency features primarily capture global contextual information, while the high-frequency features focus on fine details and local structural changes. Each branch processes the features through a series of 3 × 3 convolution kernels and batch normalization layers. Then, the low-frequency component is obtained via average pooling with a kernel size of 3 × 3 , a stride of 1, and padding of 1. The high-frequency component is obtained based on the low-frequency component. After these processes, the refined features are fused in the feature space, resulting in the fused feature map U C × H i 1 × W i 1 . The fused feature map U is further processed through a frequency-guided attention mechanism. This mechanism first applies a 3 × 3 convolution to U, followed by a Sigmoid function to generate an attention map α . The attention map is used to modulate the input feature f i 1 , allowing the network to focus more on change regions while suppressing irrelevant background. Finally, the estimated value of the refined feature map d i 1 is obtained by multiplying f i 1 with the attention map α :
d i 1 = f i 1 α
After four FA modules, the change mask M can be obtained layer by layer as d 4 d 3 d 2 d 1 , and the formula is as follows:
d 4 = F A ( f 5 , f 4 ) d 3 = F A ( d 4 , f 3 ) d 2 = F A ( d 3 , f 2 ) M = d 1 = F A ( d 2 , f 1 )
The FA module jointly exploits high frequency components, which capture fine-grained structural details, and low-frequency components, which provide global contextual cues. By explicitly modeling complementary frequency information and emphasizing spatially critical regions, this design strengthens the network’s boundary-awareness and facilitates more precise delineation of change regions.

3.4. Noise-Perturbation Consistency Branch

Current mask-guided style transformation methods often face the optimization shortcuts, especially in low-loss regions. In cases where a very similar-value region j in bi-temporal images is involved, an x j patch can be generated as patch y j by the style transformation in a nearly lossless manner.
At this point, the value of loss A E ( x j ) y j MSE is relatively small, and the parameter of the style transformation network may be difficult to adjust to get a further minimization and rapid loss decreasing. Therefore, it may drive the detection network to exclude this region to get more decreasing of the style loss, with less increasing of the sparse loss, relatively. To discourage the model from exploiting optimization shortcuts, a noise-perturbation branch could be placed in the decoding stage, with a consistency constraint applied between its outputs M * and M of the original branch. The design of the noise-perturbation branch is inspired by the Denoising Autoencoder paradigm [65], which enforces representation stability under input perturbations, encourages the learning of robust features, and prevents the model from relying on noise-sensitive patterns. Similar ideas have also been explored in consistency regularization methods, such as Mean Teacher, which enforce consistency under perturbations to improve model robustness. In our method, noise perturbation is introduced to suppress optimization shortcuts by discouraging the model from relying on unstable, low-loss feature patterns, thereby promoting the extraction of more robust information.
Therefore, at the connection between the encoder and decoder, a Gaussian noise perturbation n N ( μ , σ 2 ) is added to the feature space to slightly deviate the feature maps from their original distribution. In implementation, the noise is sampled from a Gaussian distribution N ( μ , σ 2 ) , where μ = 0 and σ = 1 . This setting introduces moderate perturbations to improve robustness while avoiding excessive distortion of the feature representation:
f * = f 5 + n
where f * is the noised feature, f 5 is the deepest feature generated from the encoder and computed by Equation (7), and n is the random noise. Both the perturbed and original feature maps are passed to the decoder, resulting in two change maps M and M * . M * is computed by Equation (10), with the noised feature f * instead of the original feature f 5 . Then, the noise-perturbation consistency constraint can be formatted as
N - consistency = min M M * MSE
The consistency constraint guides the network to learn robust, noise-invariant semantic representations, effectively shifting the change detector from relying on loss optimization to recognizing genuine semantic changes.

3.5. Training and Inference

3.5.1. Model Training and Iterative Optimization

The entire SCNAnet framework is trained end-to-end by optimizing a composite loss function,
L = style + λ 1 contrastive + λ 2 N - consistency + λ 3 sparse
where λ i are hyper-parameters balancing the contribution of each loss term. sparse is the sparse constraint to the number of change pixels in M:
sparse = min M 1
where 1 is l 1 n o r m .
The optimization is iterative. The change mask M is obtained by the FA-D and used to update the unchanged region mask for the style calculation in the next forward pass. This iterative process allows the style transformation and change detection to mutually guide and improve each other progressively, as shown in Algorithm 1.
Algorithm 1 SCNAnet training and iterative optimization
  1:
Put and initialization:
  2:
Input: A pair of co-registered remote sensing images, X and Y, captured at different times.
  3:
Initialization: All network parameters are randomly initialized. The initial change mask M ( 0 ) can be set to zero. k is the iterative count.
  4:
Forward pass:
  5:
Structure-aware style transformation encoder: Image X and the current change mask M ( k ) are fed into the ST-E. This module generates image X + and X , and corresponding encoder features e 1 i , x , e 2 i , x + , e 2 i , x and e 2 i , y .
  6:
Frequency-attention decoder: feature e 1 i , x and e 2 i , y are fed into the FA-D to generate a new change probability map M ( k ) with Equations (7) and (10).
  7:
Noise-Perturbation Consistency Branch: feature e 1 i , x and e 2 i , y are added noise and fed into the NC branch to generate a noised change probability map M * .
  8:
Loss computation:
  9:
 Total loss L is computed by Equation (12)
10:
Backward pass and parameter update:
11:
 The gradients of the total loss with respect to all network parameters (ST-E, FA-D, NC) are computed via backpropagation.
12:
Mask Update:
13:
M ( k + 1 ) generated by step 2b is used for new forward pass with the updated network parameters.
14:
Iterative Cycle:
15:
 Steps 2 to 5 are repeated iteratively until the model converges.

3.5.2. Inference

After training is complete, inference is performed to obtain the final change detection result. This process is a straightforward, one-way forward pass.
As shown in Figure 5, the trained branch of ST-E without the autoencoder and FA-D are activated. The noise-perturbation branch is disabled. At last, the final binary change map is obtained by the framework.

4. Experiments

In this section, we evaluate the performance of SCNAnet for change detection tasks on standard change detection datasets. We compare the proposed model with state-of-the-art approaches in change detection. Furthermore, an ablation study is conducted to analyze the individual impact of the ST-E, FA-D and NC.

4.1. Datasets

4.1.1. GF-2 VHR Dataset

The GF-2 VHR Dataset [66] includes two high-resolution multi-temporal image from Wuhan and Hanyang. Captured on 4 April 2016 and 1 September 2016 by the GF-2 satellite, the images have a spatial resolution of 4 m and four spectral bands. Both datasets are 1000 × 1000 pixels in size, with Wuhan (WH) showing changes in urban development and water bodies, while and Hanyang (HY) focuses on changes such as factory and railway construction.

4.1.2. OSCD Dataset

The Onera Satellite Change Detection (OSCD) dataset [67] contains 24 pairs of multispectral images from the Sentinel-2 satellites (2015–2018). Covering locations worldwide, these images include 13 spectral bands with spatial resolutions ranging from 10 m to 60 m, offering a broad variety of change detection scenarios.

4.1.3. QB Dataset

The QB dataset [68] contains bi-temporal images from 2009 and 2014, captured by the QuickBird satellite. Each image has three spectral bands, with a resolution of 2.4 m and a size of 1154 × 740 pixels. It includes pixel-level ground truth maps for change detection, with prominent seasonal changes making it challenging for precise analysis.

4.2. Experiment Settings

Implementation and Parameter Selection

We implemented our unsupervised CD framework in Pytorch 2.0 with an NVIDIA RTX 4090 GPU. We employed the Adam optimizer, which includes a learning rate warm-up strategy. The SCNAnet were trained and iteratively optimized for 100 epochs. The batch size was fixed at 8 for all experiments. The model is optimized using the Adam optimizer with β 1 = 0.9 and β 2 = 0.99 . The base learning rate is set to 2 × 10 4 . A warm-up strategy is adopted during the first 20 epochs, where the learning rate increases linearly from 1 × 10 4 to 2 × 10 4 . After the warm-up stage, the learning rate is gradually reduced from 2 × 10 4 to 1 × 10 4 between epochs 20 and 80.
The hyper-parameter m is set to 5, and 25 patches can be obtained. In practice, m = 5 is a trade-off between structural disruption and spatial coherence. Smaller values provide limited perturbation and fail to effectively construct negative samples, while larger values tend to over-fragment the image and introduce noise. Additionally, λ 1 ,   λ 2 , and λ 3 are determined based on the initial magnitudes of their respective loss terms, aiming to balance their gradient contributions. Specifically, we set λ 1 = 0.2 ,   λ 2 = 0.2 , and λ 3 = 0.75 in the experiments.

4.3. Evaluation Metrics and Change Map Visualizations

We used overall accuracy (OA), Kappa coefficient (Kappa), F1 score (F1), mIOU, and cIOU to evaluate the model’s performance. Here, mIOU and cIOU represent the mean IOU and change-specific IOU, respectively. Bold red values indicate the best results, and bold blue values indicate the second-best results. The standard predicted change maps are binary. However, for qualitative analysis and visual comparison against the ground reference, we highlight the predictions with different colors: white represents true positives (TP), black represents true negatives (TN), red indicates false positives (FP, false alarms), and blue signifies false negatives (FN, omission errors).

4.4. Experiment Results

4.4.1. Results on GF-2 VHR Dataset

The qualitative performance of SCNAnet compared to other methods on the GF-2 VHR dataset, encompassing the Hanyang and Wuchang districts of Wuhan, China, is illustrated in Figure 6 and Figure 7. Traditional approaches, such as ISFA, capture most urban and structural changes but generate numerous scattered false positives, particularly in complex areas like water bodies and urban developments, as shown in Figure 6g and Figure 7g. These methods, reliant on shallow features, struggle with misclassifications in regions affected by stylistic variations. KPCAMNet enhances feature extraction through convolutional architectures, and FCD-GAN employs GAN-based style transfer to align bi-temporal images; however, both exhibit omission errors in intricate urban settings, as evident in Figure 6h and Figure 7h. In contrast, SCNAnet delivers precise change detection with well-defined contours and minimal noise, as demonstrated in Figure 6j and Figure 7j. Its noise-perturbation consistency branch and structure-aware style transformation encoder effectively mitigate false positives, while the frequency-attention decoder ensures accurate boundary delineation for changes in factories, roads, and buildings.
Quantitative results are presented in Table 1. ISFA underperforms, achieving an overall accuracy (OA) below 95% on both HY and WH datasets, limited by its inability to distinguish changed from unchanged regions. Deep learning methods like RNN, DSCN, and SiamCRNN variants offer improvements over ISFA but are constrained by their dependence on pseudo-labels or post-processing, resulting in moderate F1 scores. While KPCAMNet and FCD-GAN achieve higher precision, their recall is lower due to missed detections. SCNAnet outperforms all methods, achieving a new state-of-the-art with an OA of 98.6%, F1 score of 88.1%, and Kappa of 87.0% on HY, and an OA of 99.3%, F1 score of 80.4%, and Kappa of 80.0% on WH. Notably, SCNAnet improves the F1 score by approximately 3% over the next-best method, driven by its robust semantic feature learning, effective suppression of optimization shortcuts, and enhanced boundary delineation capabilities.

4.4.2. Result on OSCD Dataset

The OSCD dataset contains some low-quality images due to varying imaging conditions, making reliable change detection challenging. We focused on qualitative analysis for three widely studied subsets—Lasvegas, Montpellier, and Beirut—while also providing quantitative results for the full dataset and these subsets (Table 2 and Table 3). The Lasvegas subset, characterized by scattered building changes and similar regions, tends to amplify optimization shortcuts. Montpellier features a prominent changing road, while Beirut includes both scattered buildings and large land patches with dense construction, adding to the detection complexity.
Qualitative results on the OSCD subsets—Lasvegas, Montpellier, and Beirut—are shown in Figure 8, Figure 9, and Figure 10, respectively. Traditional methods produce noisy predictions with numerous false positives, particularly in scattered or small change regions. Other deep learning methods, while mitigating stylistic differences, struggle to balance detection between large and small change areas, often missing small scattered changes or missegmenting large homogeneous regions. In contrast, SCNAnet delivers more consistent and precise results across different spatial scales. The FA-D module plays a critical role in this improvement, enhancing boundary delineation and ensuring both small and large change regions are accurately captured with minimal noise.
Quantitative results further support these observations: SCNAnet consistently improves key metrics compared to the best competing method, with overall accuracy increased by around 1%, F1 score up to 8%, and Kappa up to 10% on average across the subsets. These results confirm that SCNAnet achieves robust segmentation performance across both large and small change regions.

4.4.3. Result on QB Dataset

The qualitative performance of SCNAnet compared to other methods on the QB dataset is shown in Figure 11. The dataset exhibits noticeable appearance variations, likely caused by differences in atmospheric conditions and illumination, with changes mainly occurring in buildings and woodlands and at diverse spatial scales, making accurate detection particularly challenging. Traditional methods such as MAD, IRMAD, ISFA, and PCAKmeans detect some changes but generate numerous false positives, especially in regions with strong appearance variations, as depicted in Figure 11d–g. These methods, which rely on shallow features, struggle to accurately distinguish changed from unchanged regions. Although SCCN, CAA, Metric-CD, and FCD-GAN better align stylistic differences, they still suffer from high misclassification rates, as shown in Figure 11h–l.
Quantitative comparisons in Table 4 confirm these observations. Traditional approaches remain limited, with overall accuracies typically below 85%. Even recent deep learning methods show restricted gains, as they often improve precision but suffer from low recall. In contrast, SCNAnet consistently outperforms competing methods, achieving the highest overall accuracy and F1 score on the QB dataset. These improvements align well with the visual results, particularly in capturing large-scale changes in road networks and woodland areas while suppressing pseudo-changes in regions with strong appearance variations. Together, the qualitative and quantitative evidence demonstrates that SCNAnet achieves more reliable change maps than existing methods.

5. Discussion

To evaluate the contributions of SCNAnet’s core components—ST-E, NC, and FA-D—we conducted ablation experiments on the GF-2 VHR dataset, with results detailed in Table 5. The qualitative outcomes, particularly for similar regions prone to optimization shortcuts, are illustrated in Figure 12, showcasing SCNAnet’s change detection performance.

5.1. Impact of Individual Modules

The baseline model, excluding ST-E, NC, and FA-D, is essentially a plain mask-guided style transformation backbone such as FCD-GAN. Specifically, it consists of weight-sharing CNN encoders, a basic convolutional decoder without frequency attention, and a standard autoencoder driven solely by the style loss and sparsity constraint, lacking both the contrastive learning mechanism and the noise-perturbation consistency. This baseline achieves an OA of 95.1%, Kappa of 73.4%, and F1 score of 75.9% on the HY dataset, and an OA of 98.3%, Kappa of 73.7%, and F1 of 74.5% on the WH dataset. Incorporating ST-E alone enhances performance, increasing OA to 96.5% and F1 to 83.9% on HY, demonstrating its ability to improve feature separability through contrastive learning. Similarly, adding NC alone boosts the F1 score to 80.2% on HY, highlighting its role in mitigating optimization shortcuts by enforcing noise-invariant representations. Including FA-D alone yields significant gains, with OA reaching 97.2% and F1 87.9% on HY, underscoring its effectiveness in precise boundary delineation. Comparable trends are observed on the WH dataset, with NC and FA-D individually improving performance, while ST-E’s impact is slightly less pronounced due to the dataset’s complexity.

5.2. Combination of Modules

Combining NC and FA-D produces robust results, achieving an OA of 97.1%, F1 of 87.5%, and mIOU of 77.8% on HY, surpassing single-module configurations. The integration of ST-E with NC further improves performance, yielding an OA of 96.9% and F1 of 85.9% on HY, indicating synergy between robust feature learning and shortcut mitigation. The combination of ST-E and FA-D achieves an OA of 97.2% and F1 of 87.9% on HY, demonstrating that structural feature extraction and boundary-aware decoding complement each other effectively. The complete SCNAnet model, integrating ST-E, NC, and FA-D, delivers the best performance, with an OA of 98.6%, F1 of 88.1%, and mIOU of 78.8% on HY, and an OA of 99.3%, F1 of 80.4%, and mIOU of 67.3% on WH, highlighting the synergistic effect of all components.

5.3. Mitigation of Optimization Shortcuts

The qualitative results in Figure 12 demonstrate SCNAnet’s ability to address optimization shortcuts, particularly in similar regions where style-invariant areas are often misclassified as changes. Without NC, models tend to exclude low-loss regions, resulting in false positives. The NC branch, by introducing noise and enforcing consistency, promotes robust semantic learning, significantly reducing misclassifications. ST-E enhances this by focusing on geospatial structural changes, while FA-D refines change boundaries, producing cleaner change maps with fewer false positives and omissions, especially in complex urban scenes.
In summary, the ablation study confirms that each module makes significant contributions to SCNAnet’s performance. The full integration of these components achieves optimal results, with the NC branch proving particularly effective in countering optimization shortcuts, as visually validated in Figure 12. ST-E enhances feature discriminability, and FA-D improves boundary precision, making SCNAnet highly effective for unsupervised change detection.

6. Conclusions

In this paper, we introduced SCNAnet, a novel UCD method for remote sensing images, addressing the dependency on labeled data and the challenge of optimization shortcuts in deep learning-based approaches. SCNAnet uniquely combines ST-E, NC, and FA-D to enhance the distinguishability of changed and unchanged regions. The ST-E employs contrastive learning to extract geospatial structural features, mitigating stylistic discrepancies, while the NC branch suppresses optimization shortcuts by enforcing noise-invariant semantic representations. The FA-D leverages multi-frequency features to achieve precise boundary delineation, ensuring accurate change maps. By integrating these components into an end-to-end framework, SCNAnet avoids the need for labeled data and effectively suppresses pseudo-changes caused by illumination or seasonal variations. Comparative experiments on the GF-2 VHR, QuickBird, and OSCD datasets confirm SCNAnet’s superior performance over state-of-the-art unsupervised CD methods, demonstrating significant improvements in OA, F1 scores, and Kappa. Notably, SCNAnet excels in challenging scenarios with similar regions, as validated in ablation studies. However, scattered change regions remain a challenge, a common issue in real-world applications. Future work will explore feature refinement strategies and the modeling of semantic change representations to further enhance robustness and precision in UCD.

Author Contributions

Conceptualization, Y.S. and N.W.; methodology, N.W.; software, N.W. and Q.W.; validation, Y.S., Q.W. and N.W.; formal analysis, N.W.; investigation, N.W.; resources, Y.S.; data curation, Q.W.; writing—original draft preparation, Y.S.; writing—review and editing, Q.W. and N.W.; visualization, N.W. and Q.W.; supervision, N.W.; project administration, N.W.; funding acquisition, N.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Beijing Natural Science Foundation (L247008), National Natural Science Foundation of China (62471041), the National Key Research and Development Program of China (2022YFB3903404), and Science for a Better Development of Inner Mongolia Program (2022EEDSKJXM003-2).

Data Availability Statement

The GF-2 VHR dataset used in this study is available from the authors upon reasonable request. The OSCD dataset is publicly available from https://ieee-dataport.org/open-access/oscd-onera-satellite-change-detection, accessed on 22 March 2026. The QuickBird dataset is available from the authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Hussain, M.; Chen, D.; Cheng, A.; Wei, H.; Stanley, D. Change detection from remotely sensed images: From pixel-based to object-based approaches. ISPRS J. Photogramm. Remote Sens. 2013, 80, 91–106. [Google Scholar] [CrossRef]
  2. Singh, A. Review article digital change detection techniques using remotely-sensed data. Int. J. Remote Sens. 1989, 10, 989–1003. [Google Scholar] [CrossRef]
  3. Du, Z.; Yang, J.; Ou, C.; Zhang, T. Agricultural land abandonment and retirement mapping in the northern china crop-pasture band using temporal consistency check and trajectory-based change detection approach. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4406712. [Google Scholar] [CrossRef]
  4. Liu, T.; Yang, L.; Lunga, D. Change detection using deep learning approach with object-based image analysis. Remote Sens. Environ. 2021, 256, 112308. [Google Scholar] [CrossRef]
  5. Shafique, A.; Cao, G.; Khan, Z.; Asad, M.; Aslam, M. Deep learningbased change detection in remote sensing images: A review. Remote Sens. 2022, 14, 871. [Google Scholar] [CrossRef]
  6. Stilla, U.; Xu, Y. Change detection of urban objects using 3d point clouds: A review. ISPRS J. Photogramm. Remote Sens. 2023, 197, 228–255. [Google Scholar] [CrossRef]
  7. Zhang, M.; Guo, C.; Zhang, Y.; Liu, H.; Li, W. Gccd: A generative cross-domain change detection network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5628410. [Google Scholar] [CrossRef]
  8. Lv, Z.; Huang, H.; Gao, L.; Benediktsson, J.A.; Zhao, M.; Shi, C. Simple multiscale unet for change detection with heterogeneous remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 2504905. [Google Scholar] [CrossRef]
  9. Bandara, W.G.C.; Patel, V.M. Deep metric learning for unsupervised remote sensing change detection. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); IEEE: New York, NY, USA, 2025; pp. 5125–5135. [Google Scholar]
  10. Hu, M.; Wu, C.; Du, B.; Zhang, L. Binary change guided hyperspectral multiclass change detection. IEEE Trans. Image Process. 2023, 32, 791–806. [Google Scholar] [CrossRef] [PubMed]
  11. Panda, M.K.; Subudhi, B.N.; Veerakumar, T.; Jakhetiya, V. Modified resnet-152 network with hybrid pyramidal pooling for local change detection. IEEE Trans. Artif. Intell. 2024, 5, 1599–1612. [Google Scholar] [CrossRef]
  12. Varghese, A.; Gubbi, J.; Ramaswamy, A.; Balamuralidhar, P. ChangeNet: A deep learning architecture for visual change detection. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops; Springer: Berlin/Heidelberg, Germany, 2018; pp. 129–145. [Google Scholar]
  13. Zhang, Y.; Zhao, Y.; Dong, Y.; Du, B. Self-supervised pretraining via multimodality images with transformer for change detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5402711. [Google Scholar] [CrossRef]
  14. Bandara, W.G.C.; Patel, V.M. A transformer-based siamese network for change detection. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium; IEEE: New York, NY, USA, 2022; pp. 207–210. [Google Scholar]
  15. Wang, J.; Yan, L.; Yang, J.; Xie, H.; Yuan, Q.; Wei, P.; Gao, Z.; Zhang, C.; Atkinson, P.M. MaCon: A generic self-supervised framework for unsupervised multimodal change detection. IEEE Trans. Image Process. 2025, 34, 1485–1500. [Google Scholar] [CrossRef]
  16. Zhang, C.; Wang, L.; Cheng, S.; Li, Y. Swinsunet: Pure transformer network for remote sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5224713. [Google Scholar] [CrossRef]
  17. Jiang, B.; Wang, Z.; Wang, X.; Zhang, Z.; Chen, L.; Wang, X.; Luo, B. Vct: Visual change transformer for remote sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 2005214. [Google Scholar] [CrossRef]
  18. Wang, N.; Li, W.; Tao, R.; Du, Q. Graph-based block-level urban change detection using sentinel-2 time series. Remote Sens. Environ. 2022, 274, 112993. [Google Scholar] [CrossRef]
  19. Han, T.; Tang, Y.; Chen, Y.; Yang, X.; Guo, Y.; Jiang, S. Sdc-gae: Structural difference compensation graph autoencoder for unsupervised multimodal change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622416. [Google Scholar] [CrossRef]
  20. Qu, J.; Xu, Y.; Dong, W.; Li, Y.; Du, Q. Dual-branch difference amplification graph convolutional network for hyperspectral image change detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5519912. [Google Scholar] [CrossRef]
  21. Liu, T.; Xu, J.; Lei, T.; Wang, Y.; Du, X.; Zhang, W.; Gong, M. AEKAN: Exploring superpixel-based autoencoder Kolmogorov-Arnold network for unsupervised multimodal change detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5601114. [Google Scholar] [CrossRef]
  22. Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. Changemamba: Remote sensing change detection with spatiotemporal state space model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4409720. [Google Scholar] [CrossRef]
  23. Bandara, W.G.C.; Nair, N.G.; Patel, V.M. Ddpm-cd: Denoising diffusion probabilistic models as feature extractors for change detection. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); IEEE: New York, NY, USA, 2025; pp. 5250–5262. [Google Scholar]
  24. Gaspar, J.G.; Neider, M.B.; Simons, D.J.; McCarley, J.S.; Kramer, A.F. Change detection: Training and transfer. PLoS ONE 2013, 8, e67781. [Google Scholar] [CrossRef]
  25. Bruzzone, L.; Prieto, D.F. Automatic analysis of the difference image for unsupervised change detection. IEEE Trans. Geosci. Remote Sens. 2002, 38, 1171–1182. [Google Scholar] [CrossRef]
  26. Fang, H.; Du, P.; Wang, X. A novel unsupervised binary change detection method for vhr optical remote sensing imagery over urban areas. Int. J. Appl. Earth Obs. Geoinf. 2022, 108, 102749. [Google Scholar] [CrossRef]
  27. Leichtle, T.; Geiß, C.; Wurm, M.; Lakes, T.; Taubenböck, H. Unsupervised change detection in vhr remote sensing imagery–an object-based clustering approach in a dynamic urban environment. Int. J. Appl. Earth Obs. Geoinf. 2017, 54, 15–27. [Google Scholar] [CrossRef]
  28. Johnson, R.D.; Kasischke, E.S. Change vector analysis: A technique for the multispectral monitoring of land cover and condition. Int. J. Remote Sens. 1998, 19, 411–426. [Google Scholar] [CrossRef]
  29. Nielsen, A.A.; Conradsen, K.; Simpson, J.J. Multivariate alteration detection (mad) and maf postprocessing in multispectral, bitemporal image data: New approaches to change detection studies. Remote Sens. Environ. 1998, 64, 1–19. [Google Scholar] [CrossRef]
  30. Nielsen, A.A. The regularized iteratively reweighted mad method for change detection in multi-and hyperspectral data. IEEE Trans. Image Process. 2007, 16, 463–478. [Google Scholar] [CrossRef]
  31. Wu, C.; Du, B.; Zhang, L. Slow feature analysis for change detection in multispectral imagery. IEEE Trans. Geosci. Remote Sens. 2013, 52, 2858–2874. [Google Scholar] [CrossRef]
  32. Du, B.; Ru, L.; Wu, C.; Zhang, L. Unsupervised deep slow feature analysis for change detection in multi-temporal remote sensing images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9976–9992. [Google Scholar] [CrossRef]
  33. Li, Q.; Mu, T.; Tuniyazi, A.; Yang, Q.; Dai, H. Progressive pseudolabel framework for unsupervised hyperspectral change detection. Int. J. Appl. Earth Obs. Geoinf. 2024, 127, 103663. [Google Scholar]
  34. Ran, L.; Wen, D.; Zhuo, T.; Zhang, S.; Zhang, X.; Zhang, Y. AdaSemiCD: An adaptive semi-supervised change detection method based on pseudo-label evaluation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5615814. [Google Scholar] [CrossRef]
  35. Mao, Z.; Tong, X.; Luo, Z. Semi-supervised remote sensing image change detection using mean teacher model for constructing pseudolabels. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
  36. Zuo, Y.; Li, L.; Liu, X.; Gao, Z.; Jiao, L.; Liu, F.; Yang, S. Robust instance-based semi-supervised learning change detection for remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4404815. [Google Scholar] [CrossRef]
  37. Noh, H.; Ju, J.; Seo, M.; Park, J.; Choi, D.G. Unsupervised change detection based on image reconstruction loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2022; pp. 1352–1361. [Google Scholar]
  38. Noh, H.; Ju, J.; Kim, Y.; Kim, M.; Choi, D.G. Unsupervised change detection based on image reconstruction loss with segment anything. Remote Sens. Lett. 2024, 15, 919–929. [Google Scholar] [CrossRef]
  39. Liu, Z.G.; Zhang, Z.W.; Pan, Q.; Ning, L.B. Unsupervised change detection from heterogeneous data based on image translation. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4403413. [Google Scholar] [CrossRef]
  40. Xu, Q.; Shi, Y.; Guo, J.; Ouyang, C.; Zhu, X.X. Ucdformer: Unsupervised change detection using a transformer-driven image translation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5619917. [Google Scholar] [CrossRef]
  41. Luppino, L.T.; Kampffmeyer, M.; Bianchi, F.M.; Moser, G.; Serpico, S.B.; Jenssen, R.; Anfinsen, S.N. Deep image translation with an affinitybased change prior for unsupervised multimodal change detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4700422. [Google Scholar] [CrossRef]
  42. Wang, Q.; Chen, Z.; Yang, C.; Liu, J.; Li, Z.; Zhao, F. Psedet: Revisiting the power of pseudo label in incremental object detection. In Proceedings of the International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025. [Google Scholar]
  43. Keshk, H.M.; Yin, X.C. Change detection in sar images based on deep learning. Int. J. Aeronaut. Space Sci. 2020, 21, 549–559. [Google Scholar] [CrossRef]
  44. Wang, H.; Li, H.; Qian, W.; Diao, W.; Zhao, L.; Zhang, J.; Zhang, D. Dynamic pseudo-label generation for weakly supervised object detection in remote sensing images. Remote Sens. 2021, 13, 1461. [Google Scholar] [CrossRef]
  45. Ye, K.; Huang, Z.; Xiong, Y.; Gao, Y.; Xie, J.; Shen, L. Progressive pseudo labeling for multi-dataset detection over unified label space. IEEE Trans. Multimed. 2024, 27, 531–543. [Google Scholar] [CrossRef]
  46. Li, H.; Zou, B.; Zhang, L.; Qin, J. CausalCD: A causal graph contrastive learning framework for self-supervised SAR image change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5217016. [Google Scholar] [CrossRef]
  47. Zhu, H.; Gao, D.; Cheng, G.; Povey, D.; Zhang, P.; Yan, Y. Alternative pseudo-labeling for semi-supervised automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 3320–3330. [Google Scholar] [CrossRef]
  48. Wang, G.; Zhang, X.; Peng, Z.; Tian, S.; Zhang, T.; Tang, X.; Jiao, L. OraL: An observational learning paradigm for unsupervised hyperspectral change detection. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 5380–5393. [Google Scholar] [CrossRef]
  49. Rolih, B.; Fučka, M.; Wolf, F.; Čehovin Zajc, L. Make some noise: Unsupervised remote sensing change detection using latent space perturbations. arXiv 2026, arXiv:2602.19881. [Google Scholar] [CrossRef]
  50. Farahani, M.; Mohammadzadeh, A. Domain adaptation for unsupervised change detection of multisensor multitemporal remote-sensing images. Int. J. Remote Sens. 2020, 41, 3902–3923. [Google Scholar] [CrossRef]
  51. Liu, T.; Zhang, M.; Gong, M.; Zhang, Q.; Jiang, F.; Zheng, H.; Lu, D. Commonality feature representation learning for unsupervised multimodal change detection. IEEE Trans. Image Process. 2025, 34, 1219–1233. [Google Scholar] [CrossRef]
  52. Liu, J.; Zhang, W.; Liu, F.; Xiao, L. A probabilistic model based on bipartite convolutional neural network for unsupervised change detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4701514. [Google Scholar] [CrossRef]
  53. Wu, C.; Du, B.; Zhang, L. Fully convolutional change detection framework with generative adversarial network for unsupervised, weakly supervised and regional supervised change detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9774–9788. [Google Scholar] [CrossRef]
  54. Liu, Y.; Lu, Y. Consistency change detection framework for unsupervised remote sensing change detection. In 2025 IEEE International Conference on Multimedia and Expo (ICME); IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
  55. Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
  56. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2020; pp. 1597–1607. [Google Scholar]
  57. He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 9729–9738. [Google Scholar]
  58. Grill, J.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Pires, B.A.; Guo, Z.; Azar, M.G.; et al. Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284. [Google Scholar]
  59. Chen, Y.; Bruzzone, L. Self-supervised remote sensing images change detection at pixel-level. arXiv 2021, arXiv:2105.08501. [Google Scholar] [CrossRef]
  60. Manas, O.; Lacoste, A.; Giró-i-Nieto, X.; Vazquez, D.; Rodriguez, P. Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021; pp. 9414–9423. [Google Scholar]
  61. Wu, H.; Geng, J.; Jiang, W. Multidomain constrained translation network for change detection in heterogeneous remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5616916. [Google Scholar] [CrossRef]
  62. Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 2017, 30, 1195–1204. [Google Scholar]
  63. Xie, Q.; Luong, M.; Hovy, E.; Le, Q.V. Self-training with noisy student improves ImageNet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 10687–10698. [Google Scholar]
  64. Raia, H.; Sumit, C.; Yann, L. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06); IEEE: New York, NY, USA, 2006; Volume 2, pp. 1735–1742. [Google Scholar]
  65. Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning; ACM: New York, NY, USA, 2008; pp. 1096–1103. [Google Scholar]
  66. Wu, C.; Chen, H.; Du, B.; Zhang, L. Unsupervised change detection in multitemporal vhr images based on deep kernel pca convolutional mapping network. IEEE Trans. Cybern. 2021, 52, 12084–12098. [Google Scholar] [CrossRef] [PubMed]
  67. Caye Daudt, R.; Le Saux, B.; Boulch, A.; Gousseau, Y. Urban change detection for multispectral earth observation using convolutional neural networks. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS); IEEE: New York, NY, USA, 2018; pp. 207–210. [Google Scholar]
  68. Zhang, M.; Shi, W. A feature difference convolutional neural network-based change detection method. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7232–7246. [Google Scholar] [CrossRef]
Figure 1. Incorrect classification of some unchanged, yet stylistically similar areas, in the current mask-guided style transformation method as change areas.
Figure 1. Incorrect classification of some unchanged, yet stylistically similar areas, in the current mask-guided style transformation method as change areas.
Remotesensing 18 01427 g001
Figure 2. Overview of the proposed SCNAnet framework, where X denotes the image at time T1, Y denotes the image at time T2, X + and X denote positive and negative samples, M is the predicted change mask, and M * is the perturbed mask.
Figure 2. Overview of the proposed SCNAnet framework, where X denotes the image at time T1, Y denotes the image at time T2, X + and X denote positive and negative samples, M is the predicted change mask, and M * is the perturbed mask.
Remotesensing 18 01427 g002
Figure 3. Autoencoder-based style transformation.
Figure 3. Autoencoder-based style transformation.
Remotesensing 18 01427 g003
Figure 4. Frequency-attention module (FA).
Figure 4. Frequency-attention module (FA).
Remotesensing 18 01427 g004
Figure 5. Inference process.
Figure 5. Inference process.
Remotesensing 18 01427 g005
Figure 6. Results on GF-2 HY: (a) T1; (b) ISFA; (c) KPCAMNet; (d) FCD-GAN; (e) Ours; (f) T2; (g) the qualitative analysis of ISFA; (h) the qualitative analysis of KPCAMNet; (i) the qualitative analysis of FCD-GAN; (j) the qualitative analysis of Ours.
Figure 6. Results on GF-2 HY: (a) T1; (b) ISFA; (c) KPCAMNet; (d) FCD-GAN; (e) Ours; (f) T2; (g) the qualitative analysis of ISFA; (h) the qualitative analysis of KPCAMNet; (i) the qualitative analysis of FCD-GAN; (j) the qualitative analysis of Ours.
Remotesensing 18 01427 g006
Figure 7. Results on GF-2 WH: (a) T1; (b) ISFA; (c) KPCAMNet; (d) FCD-GAN; (e) Ours; (f) T2; (g) the qualitative analysis of ISFA; (h) the qualitative analysis of KPCAMNet; (i) the qualitative analysis of FCD-GAN; (j) the qualitative analysis of Ours.
Figure 7. Results on GF-2 WH: (a) T1; (b) ISFA; (c) KPCAMNet; (d) FCD-GAN; (e) Ours; (f) T2; (g) the qualitative analysis of ISFA; (h) the qualitative analysis of KPCAMNet; (i) the qualitative analysis of FCD-GAN; (j) the qualitative analysis of Ours.
Remotesensing 18 01427 g007
Figure 8. Change detection results on OSCD-lasvegas: (a) T1; (b) T2; (c) reference change mask; (d) MAD; (e) IRMAD; (f) ISFA; (g) PCAKMeans; (h) SCCN; (i) CAA; (j) KPCAMNet; (k) Metric-CD; (l) FCD-GAN; (m) Ours; (n) the qualitative analysis of Ours.
Figure 8. Change detection results on OSCD-lasvegas: (a) T1; (b) T2; (c) reference change mask; (d) MAD; (e) IRMAD; (f) ISFA; (g) PCAKMeans; (h) SCCN; (i) CAA; (j) KPCAMNet; (k) Metric-CD; (l) FCD-GAN; (m) Ours; (n) the qualitative analysis of Ours.
Remotesensing 18 01427 g008
Figure 9. Change detection results on OSCD-montpellier: (a) T1; (b) T2; (c) reference change mask; (d) MAD; (e) IRMAD; (f) ISFA; (g) PCAKMeans; (h) SCCN; (i) CAA; (j) KPCAMNet; (k) Metric-CD; (l) FCD-GAN; (m) Our; (n) the qualitative analysis of Ours.
Figure 9. Change detection results on OSCD-montpellier: (a) T1; (b) T2; (c) reference change mask; (d) MAD; (e) IRMAD; (f) ISFA; (g) PCAKMeans; (h) SCCN; (i) CAA; (j) KPCAMNet; (k) Metric-CD; (l) FCD-GAN; (m) Our; (n) the qualitative analysis of Ours.
Remotesensing 18 01427 g009
Figure 10. Change detection results on OSCD-beirut: (a) T1; (b) T2; (c) reference change mask; (d) MAD; (e) IRMAD; (f) ISFA; (g) PCAKMeans; (h) SCCN; (i) CAA; (j) KPCAMNet; (k) Metric-CD; (l) FCD-GAN; (m) Our; (n) the qualitative analysis of Ours.
Figure 10. Change detection results on OSCD-beirut: (a) T1; (b) T2; (c) reference change mask; (d) MAD; (e) IRMAD; (f) ISFA; (g) PCAKMeans; (h) SCCN; (i) CAA; (j) KPCAMNet; (k) Metric-CD; (l) FCD-GAN; (m) Our; (n) the qualitative analysis of Ours.
Remotesensing 18 01427 g010
Figure 11. Change detection results on QB Dataset: (a) T1; (b) T2; (c) reference change mask; (d) MAD; (e) IRMAD; (f) ISFA; (g) PCAKMeans; (h) SCCN; (i) CAA; (j) KPCAMNet; (k) Metric-CD; (l) FCD-GAN; (m) Ours; (n) the qualitative analysis of Ours.
Figure 11. Change detection results on QB Dataset: (a) T1; (b) T2; (c) reference change mask; (d) MAD; (e) IRMAD; (f) ISFA; (g) PCAKMeans; (h) SCCN; (i) CAA; (j) KPCAMNet; (k) Metric-CD; (l) FCD-GAN; (m) Ours; (n) the qualitative analysis of Ours.
Remotesensing 18 01427 g011
Figure 12. Change detection results of similar regions by SCNAnet.
Figure 12. Change detection results of similar regions by SCNAnet.
Remotesensing 18 01427 g012
Table 1. Quantitative results on GF-2 VHR dataset. Color convention: best, 2nd best.
Table 1. Quantitative results on GF-2 VHR dataset. Color convention: best, 2nd best.
MethodHY DatasetWH Dataset
OAKapF1mIOUcIOUOAKapF1mIOUcIOU
ISFA0.9410.6370.6570.4890.7250.9590.3280.3460.2090.584
RNN0.9440.7300.7610.6150.7770.9750.6520.6650.4980.736
DSCN0.9340.6710.7080.5480.7380.9730.5580.5710.4000.686
SiamCRNN_FC0.9450.7250.7560.6070.7740.9770.6920.7040.5430.760
SiamCRNN_GRU0.9510.7530.7810.6400.7930.9780.7020.7140.5550.766
SiamCRNN_LTSM0.9540.7700.7960.6610.8050.9810.7290.7380.5850.783
KPCAMNet0.9800.7880.7990.6650.8220.9900.7390.7440.5930.792
FCD-GAN0.9830.8390.8480.7360.8590.9910.7700.7750.6320.807
Ours0.9860.8700.8810.7880.8870.9930.8000.8040.6730.832
Table 2. Quantitative results on Lasvegas, Montpellier, and Beirut in OSCD dataset. Color convention: best, 2nd best.
Table 2. Quantitative results on Lasvegas, Montpellier, and Beirut in OSCD dataset. Color convention: best, 2nd best.
MethodLasvegasMontpellierBeirut
OAKapF1mIOUcIOUOAKapF1mIOUcIOUOAKapF1mIOUcIOU
KPCAMNet0.9390.6070.6400.7030.4710.9210.5590.5980.6710.4260.9760.5750.5870.6950.415
Metric-CD0.9440.6510.6810.7290.5170.9050.5170.5620.6450.3910.9400.3310.3510.5760.213
FCD-GAN0.9460.6460.6750.7260.5090.9220.5530.5920.6690.4200.9760.5520.5640.6840.393
Ours0.9520.6770.7030.7450.5420.9510.6610.6870.7360.5240.9770.5790.5910.6980.420
Table 3. Accuracy evaluation of whole OSCD dataset. Color convention: best, 2nd best.
Table 3. Accuracy evaluation of whole OSCD dataset. Color convention: best, 2nd best.
MethodOAKappaF1mIOUcIOU
MAD0.8830.2090.2300.5100.139
IRMAD0.8990.2340.2620.5300.163
ISFA0.8650.1800.2130.4950.128
PCAKmeans0.8620.2240.2520.5100.160
SCCN0.6310.0700.0840.3360.046
CAA0.8430.1630.1960.4790.118
KPCAMNet0.8800.2380.2670.5240.171
Metric-CD0.9040.2510.2760.5380.173
FCD-GAN0.9120.2580.2830.5450.180
Ours0.9200.2830.3060.5590.199
Table 4. Accuracy evaluation of QuickBird dataset. Color convention: best, 2nd best.
Table 4. Accuracy evaluation of QuickBird dataset. Color convention: best, 2nd best.
MethodOAKappaF1mIOUcIOU
MAD0.8140.3890.4820.3170.557
IRMAD0.8450.4430.5240.3550.593
ISFA0.8410.4390.5210.3520.589
PCAKmeans0.8380.4350.5190.3500.587
SCCN0.8300.3660.4570.2960.556
CAA0.8880.4590.5220.3530.617
KPCAMNet0.8520.4830.5600.3890.613
Metric-CD0.8490.4510.5310.3610.598
FCD-GAN0.8790.4620.5300.3610.615
Ours0.8980.5450.6020.4310.660
Table 5. Ablation experiments of SCNAnet on GF2 VHR Dataset. Color convention: best.
Table 5. Ablation experiments of SCNAnet on GF2 VHR Dataset. Color convention: best.
ModuleHY DatasetWH Dataset
ST-ENCFA-DOAKPF1mIOUOAKPF1mIOU
×××0.9510.7340.7590.6120.9830.7370.7450.594
××0.9650.8200.8390.7230.9830.7490.7570.609
××0.9570.7780.8020.6780.9840.7490.7570.609
×0.9710.8590.8750.7780.9850.7690.7770.635
×0.9690.8420.8590.7530.9850.7750.7820.642
×0.9720.8630.8790.7830.9840.7670.7750.633
0.9860.8700.8810.7880.9930.8000.8040.673
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, Y.; Wu, Q.; Wang, N. SCNAnet: Structure-Aware Contrastive with Noise-Augmented Network for Unsupervised Change Detection. Remote Sens. 2026, 18, 1427. https://doi.org/10.3390/rs18091427

AMA Style

Sun Y, Wu Q, Wang N. SCNAnet: Structure-Aware Contrastive with Noise-Augmented Network for Unsupervised Change Detection. Remote Sensing. 2026; 18(9):1427. https://doi.org/10.3390/rs18091427

Chicago/Turabian Style

Sun, Yijie, Qingxi Wu, and Nan Wang. 2026. "SCNAnet: Structure-Aware Contrastive with Noise-Augmented Network for Unsupervised Change Detection" Remote Sensing 18, no. 9: 1427. https://doi.org/10.3390/rs18091427

APA Style

Sun, Y., Wu, Q., & Wang, N. (2026). SCNAnet: Structure-Aware Contrastive with Noise-Augmented Network for Unsupervised Change Detection. Remote Sensing, 18(9), 1427. https://doi.org/10.3390/rs18091427

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop