Spatial-Frequency Decoupling Alignment Encoding for Remote Sensing Change Detection

Zhang, Xu; Du, Yue; Zhou, Weiran; Zhang, Kaihua

doi:10.3390/s26061979

Open AccessArticle

Spatial-Frequency Decoupling Alignment Encoding for Remote Sensing Change Detection

¹

School of Electronics and Information Engineering, Suzhou Polytechnic University, Suzhou 215104, China

²

Information Center, Suzhou Polytechnic University, Suzhou 215104, China

³

School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(6), 1979; https://doi.org/10.3390/s26061979

Submission received: 6 February 2026 / Revised: 8 March 2026 / Accepted: 19 March 2026 / Published: 21 March 2026

(This article belongs to the Special Issue Image Processing and Analysis for Object Detection: 3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

Existing remote sensing change detection methods often struggle to accurately capture the contours of complex change targets and subtle textural differences. This makes it difficult to effectively distinguish between the boundaries of change targets and the background. To address this challenge, we propose a novel method called spatial-frequency decoupling alignment encoding (SDA-Encoding), which is designed to fully leverage information from both the spatial and frequency domains. Specifically, we first use a Transformer encoder to extract bi-temporal features. Next, we apply wavelet transform to decouple these features into low-frequency and high-frequency components. In the multi-scale high-frequency interaction (MHI) module, we combine local spatial enhancement using spatial pyramid pooling with cross-scale dependency supplementation via the dual-domain alignment fusion (DAF) module. Meanwhile, in the position-aware low-frequency enhancement (PLE) module, spatial position sensitivity is restored using coordinate attention, and region-level contextual dependencies are captured through the selective fusion attention (SFA) module. Finally, the two frequency-domain branches are complementarily fused within the spatial domain to achieve unified detection of both fine-grained and structural changes. Experimental results on three benchmark datasets demonstrate the significant performance improvements of SDA-Encoding.

Keywords:

remote sensing change detection; multi-scale high-frequency interaction; dual-domain alignment fusion; position-aware low-frequency enhancement; selective fusion attention

1. Introduction

Remote sensing (RS) image data has experienced tremendous growth in recent years. Currently, change detection (CD) technologies are being widely applied across various domains, including urbanization monitoring [1,2,3], disaster assessment [4,5,6], and agricultural monitoring [7,8,9].

In recent years, CD methods have evolved from traditional approaches [10,11,12,13,14,15,16,17,18] to deep learning-based techniques. Traditional methods rely on hand-crafted features and lack end-to-end optimization, which limits their robustness when dealing with complex noise and diverse real-world scenarios. Nowadays, deep learning methods have increasingly become a focal point of research in remote sensing image change detection [19,20,21,22,23,24,25,26]. Deep models, such as deep belief networks (DBNs) [27], stacked autoencoders (SAEs) [28], and convolutional neural networks (CNNs) [29], have demonstrated superior capabilities in extracting abstract and discriminative features for RSCD. While CNNs effectively preserve spatial information through local connectivity, their limited receptive fields constrain their ability to model global context and handle complex scenes. To overcome this limitation, recent research has introduced methods like dilated convolutions [30,31], spatial pyramid pooling [32,33], and attention mechanisms [34,35] for global context modeling. However, these methods still aggregate information locally. Transformer-based models, by leveraging self-attention, capture long-range dependencies and holistic scene context, resulting in significant improvements for RSCD. Recent advancements, including the Vision Transformer (ViT) [36], Swin Transformer [37], and hybrid CNN-Transformer architectures [38,39], further enhance both efficiency and detection accuracy.

In RSCD, the current networks effectively integrate both global and local spatial information. However, they typically rely solely on spatial-domain segmentation without explicitly incorporating frequency-domain features. Illumination variations and shadows in RS images are often misinterpreted as genuine land cover changes in the spatial domain, while blurred regions, lacking stable texture features, are frequently misclassified as unchanged. In signal processing theory, decoupling an image into different frequency components provides a clearer understanding of its content. As shown in Figure 1, the first two rows illustrate the effects of decomposing the image into low-frequency and high-frequency components, with key features gradually becoming more apparent. The low-frequency component reveals the overall structure and layout of the image, while the high-frequency component highlights edges and finer details. The final row presents the differences between the two images across various wavelet subbands, demonstrating that frequency-domain information is more effective in addressing issues such as lighting changes and shadows. Additionally, it plays a crucial role in extracting the image’s structure and boundaries more effectively. Moreover, frequency-domain features are more sensitive to changes in complex regions [40,41,42]. Therefore, actively incorporating frequency-domain information into spatial segmentation networks could improve the model’s robustness, particularly in scenarios involving foreground–background blurring.

Inspired by the insights above, we propose spatial-frequency decoupling alignment encoding (SDA-Encoding). Specifically, in the MiT-based bi-temporal fusion stage, spatial-domain features are first extracted using a Transformer encoder. Through early fusion, a bi-temporal feature space with cross-scale consistency is then constructed, forming the foundation for spatial-frequency decoupling. In the multi-scale high-frequency interaction (MHI) module, recognizing that high-frequency features are sensitive to local geometric structures, we incorporate a spatial pyramid pooling (SPP) [43] module, which introduces multi-scale convolutional kernels to enhance sensitivity to texture variations at different scales. Additionally, high-frequency features often exhibit cross-regional spatial dependencies that convolution alone cannot effectively capture. To address this, we design a dual-domain alignment fusion (DAF) module to perform both spatial and cross-scale enhancements on high-frequency features, enabling the simultaneous capture of local texture variations and long-range edge dependencies. In the position-aware low-frequency enhancement (PLE) module, to enhance positional awareness and multi-scale structural representation for capturing region-level changes, we incorporate a coordinate attention (CA) [44] module. This explicitly encodes spatial position and directional information within the low-frequency features. Additionally, the selective fusion attention (SFA) module is used to model long-range dependencies by jointly learning spatial information in the spatial domain and low-frequency structural features. This strengthens the global structural awareness of low-frequency features, enhancing their ability to capture large-scale changes. Finally, a segmentation head is employed to aggregate the features and generate the final CD results for bi-temporal images.

The contributions of our work can be summarized as follows:

An MHI module is designed to enhance high-frequency components by applying spatial and multi-scale enhancements, thereby strengthening fine-grained texture variations. This process transforms these variations into more structured and distinguishable feature cues, improving the model’s ability to detect subtle changes.
A PLE module is designed to explicitly restore spatial positional structures, enhancing the structural integrity and consistency of low-frequency features. This enables the model to generate low-frequency change representations that are more sensitive to large-scale changes, thereby improving its ability to discriminate extensive area changes.
We introduce SDA-Encoding, a framework that extracts multi-scale features and effectively harnesses information from both the spatial and frequency domains. The framework then aligns these features using multi-scale axial convolutions and dual cross-attention, ultimately generating a CD map.

2. Related Works

2.1. CNN-Based CD Methods

CNNs have become the dominant approach in RSCD, significantly outperforming traditional methods that rely on hand-crafted features. Among these, encoder–decoder architectures are especially popular due to their effectiveness in generating pixel-wise change maps. For instance, FCN [45] pioneered the use of fully convolutional networks (FCNs) for optical image CD. UNet++ [46] improves edge accuracy by incorporating nested skip connections. Beyond end-to-end approaches, early works like FC-Siam-diff [47] generated different images using Siamese networks but struggled with spectral noise. Recent advancements leverage attention mechanisms to enhance the reliability of the generated images. For example, DMINet [48] integrates self-attention and cross-attention mechanisms to effectively capture spatiotemporal relationships across different feature levels. SEIFNet [49] incorporates CA and CBAM [50] modules to capture global contextual features while effectively emphasizing changed objects. SFCF-Net [51] introduces a novel approach for change detection in high-resolution drone images, effectively addressing noise interference and semantic inconsistency by incorporating multiple specialized modules.

2.2. Transformer-Based Methods

The remarkable success of transformer models in computer vision tasks has garnered significant attention from the RSCD community, sparking investigations into their potential applications in CD tasks. For instance, ChangFormer [52] leverages pure Transformer blocks to capture pixel-wise long-range dependencies between bi-temporal images through temporal self-attention, effectively overcoming the limitations of local receptive fields inherent in CNNs. STADE-CDNet [53] combines Transformer and LSTM architectures, with the Transformer encoder extracting global features and the LSTM specifically learning inter-phase state transitions, resulting in computationally efficient representations. BIT [38] is a hybrid architecture and utilizes ResNet18 [54] to capture local details, while the Transformer’s cross-attention mechanism is employed to align global context effectively. It is a dual-branch transformer framework [55] that tackles the challenges of ground and remote sensing image matching by incorporating panoramic sky perception, geometric projection transformation, and gradient-aware weighting modules. This approach significantly enhances geo-location accuracy across varying viewpoints.

2.3. Frequency-Based Remote Sensing Image Processing

In recent years, frequency-domain-based methods for remote sensing image processing have gained considerable attention, demonstrating significant advantages in tasks such as image fusion, scene classification, and dehazing [56,57,58,59].

In the field of CD, frequency-domain-related methods have also emerged continuously. FTAN [60] improves the model’s sensitivity to change regions by fusing multi-scale spatial and frequency features. FDTNet [61] enhances the ability to extract both global and local information in complex scenes by integrating frequency-domain processing capabilities with a Transformer architecture. CDC2F [62] achieves continuous-scale change identification through cross-domain structures. Additionally, DDLNet [63] has achieved good results in change detection tasks by combining frequency-domain enhancement modules with spatial-domain recovery modules. RACDNet [64] addresses the resolution disparity issue and improves the change detection accuracy of high-resolution remote sensing images by introducing a gradient-enhanced amplification module and a dual-domain discrepancy learning module.

Compared to existing change detection methods that rely solely on spatial domains or simple frequency fusion, SDA-Encoding introduces a decoupled and aligned frequency enhancement strategy. While methods like FTAN and FDTNet incorporate frequency features, they do not specifically optimize the distinct roles of high- and low-frequency components. SDA-Encoding addresses this gap through two key designs. The multi-scale high-frequency interaction (MHI) module enhances sensitivity to local textures via multi-scale convolutions, while also capturing long-range edge dependencies through dual-domain alignment fusion (DAF). This overcomes the limitations of traditional convolutions in modeling global edge information. Meanwhile, the position-aware low-frequency enhancement (PLE) module encodes spatial position using coordinate attention and reinforces global structural awareness via selective fusion attention (SFA). This effectively mitigates interference from illumination changes and shadows in low-frequency representations. By shifting frequency features from “passive fusion” to “active decoupling and alignment,” SDA-Encoding captures subtle texture changes while robustly identifying large-scale regions, offering enhanced robustness and discriminability in CD tasks.

3. Methods

3.1. Overview

In this paper, we propose spatial-frequency decoupling alignment encoding (SDA-Encoding) for RSCD. As shown in Figure 2, we adopted the Mix Transformer (MiT) [65] to extract bi-temporal features. At each stage, the feature maps are downsampled to half the resolution of the previous stage. Given temporally paired images

{T_{1}, T_{2} \in R^{3 \times H \times W}}

, the output bi-temporal features at the four stages can be represented as

{x_{i}^{t_{1}}, x_{i}^{t_{2}} \in R^{C_{i} \times \frac{H}{2^{i + 1}} \times \frac{W}{2^{i + 1}}}}_{i = 1}^{4}

. Subsequently, we concatenate the features and apply convolutional operations to generate the temporal difference features

{d_{i} \in R^{2 C_{i} \times \frac{H}{2^{i + 1}} \times \frac{W}{2^{i + 1}}}}_{i = 1}^{4}

. This process can be formulated as follows:

d_{i} = cat (x_{i}^{t_{1}}, x_{i}^{t_{2}}),

(1)

where

cat (\cdot)

represents the feature concatenation operation.

The feature extractor in the first stage generates four multi-scale features, denoted as

{d_{i} \in R^{2 C_{i} \times \frac{H}{2^{i + 1}} \times \frac{W}{2^{i + 1}}}}_{i = 1}^{4}

. Then,

d_{i = 2}^{4}

are resized to the same size as

d_{1}

, and all features are subsequently merged. This procedure can be formulated as:

X = δ_{1 \times 1} (cat (d_{1}, φ (d_{2}), φ (d_{3}), φ (d_{4}))),

(2)

where

δ_{k \times k}

indicates a convolution with a kernel size of

k \times k

,

φ (\cdot)

denotes the interpolation operation, and

X \in R^{C \times \frac{H}{4} \times \frac{W}{4}}

represents the merged feature of

d_{i = 1}^{4}

with a unified channel dimension

C = 64

.

Then, the initial feature tensor X undergoes dual-domain decoupling. Specifically, X is decoupled into high-frequency and low-frequency features through wavelet transform frequency decoupling (WTFD). We employ the Haar wavelet and performed a two-level wavelet decomposition. The low-frequency components were used to capture the global variations of the image, while the high-frequency detail coefficients were employed to detect local changes. To minimize boundary artifacts, we used symmetric extension during the wavelet transform. During the spatial mapping process, X generates locally enhanced features via the SPP and globally enhanced features via the CA. This procedure can be formulated as:

\{\begin{matrix} F_{l}, F_{h} = f_{w} (X), \\ F_{g} = f_{g} (X), \\ F_{m} = f_{m} (X), \end{matrix}

(3)

where

f_{w} (\cdot)

,

f_{g} (\cdot)

, and

f_{m} (\cdot)

represent the feature mapping operations for WTFD, CA, and SPP, respectively.

F_{l}, F_{h}, F_{g}, F_{m} \in R^{C \times \frac{H}{4} \times \frac{W}{4}}

represent the low-frequency features, high-frequency features, globally enhanced features, and locally enhanced features, respectively.

Spatial-domain and frequency-domain features are first categorized into two distinct groups. These features are then passed through the DAF and SFA modules, which address the semantic disparity between the two types for effective feature alignment. Finally, the aligned features, along with the output from the initial layer of the feature extraction stage, are fused. The combined features are subsequently processed by the segmentation head to generate the final CD map.

\{\begin{matrix} F_{o u t}^{1} = f_{d a f} (F_{m}, F_{h}), \\ F_{o u t}^{2} = f_{s f a} (F_{g}, F_{l}), \end{matrix}

(4)

F_{o u t} = δ_{1 \times 1} (cat (F_{o u t}^{1}, F_{o u t}^{2}, φ (d_{1}), X),

(5)

where

f_{d a f} (F_{m}, F_{h})

represents the DAF module, which aligns the local spatial features

F_{m}

and the high frequency-domain features

F_{h}

;

f_{s f a} (F_{g}, F_{l})

represents the SFA module used to fuse the global spatial features

F_{g}

and the low frequency-domain features

F_{l}

; and

F_{o u t}^{1}, F_{o u t}^{2} \in R^{C \times \frac{H}{4} \times \frac{W}{4}}

are the outputs of the DAF and SFA modules.

F_{o u t} \in R^{C \times \frac{H}{4} \times \frac{W}{4}}

denotes the final mixed feature.

3.2. MiT-Based Bi-Temporal Fusion Stage

Due to significant spatial misalignment, radiometric differences, and scale variations between bi-temporal images, simply stacking the two frames makes it challenging for the model to capture consistent change cues. To address this, we introduce MiT, a method that naturally enables cross-window interaction and convolutional local sensitivity. This hybrid attention mechanism effectively captures cross-scale dependencies while preserving the local structure. Additionally, the grouped mixing architecture facilitates the simultaneous optimization of both global and local feature representations. It effectively extracts rich semantic context while explicitly enhancing the model’s sensitivity to high-frequency spatial patterns, which are crucial for detecting changes. The process is as follows:

[x_{i}^{t_{1}}, x_{i}^{t_{2}}] = MiT ([T_{1}, T_{2}]) .

(6)

Subsequently, we fuse the multi-layer features

x_{i}^{t_{1}}, x_{i}^{t_{2}}

into a spatiotemporally aligned joint representation, providing a unified base feature space for the subsequent frequency branches. This fusion effectively mitigates the unchanged noise arising from cross-temporal differences and accentuates the saliency of true changes. The process is as follows:

d_{i} = cat (x_{i}^{t_{1}}, x_{i}^{t_{2}}),

(7)

where

i = 1, 2, 3, 4

denotes the feature level and

cat (\cdot)

represents the feature concatenation operation. The early fusion constructed by MiT provides a bi-temporal feature space with cross-scale consistency, laying the foundation for spatial-frequency decomposition.

3.3. Multi-Scale High-Frequency Interaction Module

Although the deep features extracted by the backbone network are rich in semantic information, their output represents a mix of various frequency components. The high-frequency details essential for change detection are often overshadowed by the dominant low-frequency semantic context. To effectively capture changes, it is necessary to explicitly isolate and enhance these high-frequency signals. As shown in Figure 2, we propose the multi-scale high-frequency interaction (MHI) module, which introduces an SPP unit to capture multi-scale local contexts, thereby ensuring sensitivity to subtle spatial perturbations. Meanwhile, the required high-frequency features are obtained using wavelet transform frequency decoupling (WTFD). Subsequently, a dual-domain alignment fusion (DAF) module is employed to model the long-range dependencies of high-frequency change patterns along the spatial dimension, enabling collaborative detection of continuous and structured change regions.

As shown in Figure 3a, to compensate for the limitation of axial attention in local information modeling, the input local-enhanced features and high-frequency features are first processed by the multi-scale axis convolution. This operation introduces strip-shaped convolution kernels with different kernel sizes multiple times along each axial attention path. Such a design effectively enhances the efficiency of the DAF module in encoding local information. The process is as follows:

\begin{matrix} Q, K, V = δ_{1 \times 1} (A C_{3} (L N (F_{i n})) + A C_{5} (L N (F_{i n})) + A C_{7} (L N (F_{i n}))), \end{matrix}

(8)

where

A C_{k}

represents the axis convolution with kernel sizes of

1 \times k

and

k \times 1

and

L N (\cdot)

represents layer normalization.

F_{i n}

represents the input features.

Q, K, V

are the matrices obtained after the multi-scale axis convolution operation.

As illustrated in Figure 3b, the

F_{m}

and

F_{h}

each produce three matrices, denoted as (

Q_{1}

,

K_{1}

,

V_{1}

) and (

Q_{2}

,

K_{2}

,

V_{2}

). To address the issues of insufficient positional bias learning and the limited long-range interactions in axial attention, a cross-axis attention mechanism is introduced. This mechanism computes attention by querying corresponding elements along with their respective key–value pairs. The computational steps involved in the cross-axis attention process are outlined as follows:

\{\begin{matrix} F_{1} = δ_{1 \times 1} (A t t n (Q_{2}, K_{1}, V_{1})), \\ F_{2} = δ_{1 \times 1} (A t t n (Q_{1}, K_{2}, V_{2})), \end{matrix}

(9)

the attention formula is:

A t t n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d}}) V,

(10)

where

\frac{1}{\sqrt{d}}

denotes scaling factor.

F_{o u t}^{1} = F_{1} + F_{2},

(11)

where

F_{1}

and

F_{2}

are the outputs of attention and

F_{o u t}^{1}

is the output of the DAF module.

3.4. Position-Aware Low-Frequency Enhancement Module

Modern visual backbone networks, such as MiT, tend to weaken spatial location information when extracting high-level semantics through deep Transformer blocks. While their self-attention mechanisms effectively capture long-range dependencies, they rely less on absolute and relative position encoding. This leads to a blurred spatial layout for the output features, which hinders pixel-level accurate localization of change regions. To address this, we propose a position-aware low-frequency enhancement (PLE) module that explicitly compensates for and enhances the spatial and structural information of low-frequency features. As shown in Figure 2, CA decouples position information into 1D attention that is highly sensitive to the vertical direction, restoring the spatial layout expression of low-frequency features. Meanwhile, the required low-frequency features are obtained using wavelet transform frequency decoupling (WTFD). Then, the selective fusion attention (SFA) module provides a spatial-frequency selection mechanism to adaptively integrate spatial information and low-frequency structural information. Finally, a low-frequency change description that is most sensitive to large-area change regions is formed, significantly enhancing the representation and discrimination ability for low-frequency change signals.

As shown in Figure 4, the spatial branch first applies channel-wise max pooling

P_{\max} (\cdot)

and average pooling

P_{avg} (\cdot)

operations along the channel dimension to focus on the most salient and dominant features. To ensure information interaction between these two spatial descriptors, the pooled features are initially concatenated. Subsequently, a simple

1 \times 1

convolutional layer is employed to map these two pooled features into two spatial attention maps. The above process can be expressed as follows:

\bar{S} = δ_{1 \times 1} (cat (P_{\max} (F_{g}), P_{avg} (F_{g})),

(12)

where the spatial attention feature maps

\bar{S} \in R^{2 \times \frac{H}{4} \times \frac{W}{4}}

. Next, the sigmoid activation function

σ (\cdot)

is applied to obtain a spatial selection mask S:

S = σ (\bar{S}) .

(13)

In frequency domain analysis, signals are decomposed into different frequency components. Low-frequency features typically represent smooth, structured regions in an image, capturing the overall contours and background. The variations in these regions are small and often repeat across multiple local areas. As a result, low-frequency features tend to have redundant responses in different parts of the image. This redundancy does not increase the effective information in the image. Instead, it can cause overfitting, increase computational burden, and reduce the model’s performance.

The frequency-domain selection mechanism addresses this by evaluating the importance of each frequency component. It suppresses redundant low-frequency information and focuses on more discriminative high-frequency features, such as edges and fine details. These high-frequency features contain more local variations and valuable information, helping the model make more accurate decisions. Specifically, global average pooling

P_{GAP} (\cdot)

is first applied to compress the feature

F_{l}

, producing a frequency-aware feature descriptor. Subsequently, a fully connected layer

F_{FC} (\cdot)

generates a more compact frequency-domain attention feature

Z \in R^{C \times 1}

. This process can be formulated as:

Z = F_{FC} (P_{GAP} (F_{l})) .

(14)

Then, we apply a sigmoid operation to the low-frequency attention feature Z to generate frequency-domain attention weights. These weights adaptively modulate the low-frequency channels, producing a frequency selection mask

C \in R^{C \times 1}

. This process suppresses redundant low-frequency responses while emphasizing more discriminative low-frequency structural components. The process is outlined as follows:

C = σ (Z) .

(15)

Subsequently, the corresponding spatial–frequency selection masks are multiplied to obtain the spatial–frequency selection weight:

W = S \times C .

(16)

Finally, the global feature

F_{g}

and low-frequency feature

F_{l}

are weighted by the spatial-frequency selection weight w. The final output is generated after summation:

F_{o u t}^{2} = F_{g} \times W + F_{l} \times W .

(17)

3.5. Loss Function

The loss function

L

is a combination of the dice coefficient (Dice) loss

L_{Dice}

[66] and the binary cross-entropy (BCE) loss

L_{BCE}

[67]. The formula can be written as follows:

L_{BCE} (p, g) = g log (p) + (1 - g) log (1 - p),

(18)

L_{Dice} (p, g) = 1 - \frac{2 g p}{∥ g ∥ + ∥ p ∥},

(19)

L (p, g) = L_{BCE} (p, g) + L_{Dice} (p, g),

(20)

where g and p represent the ground-truth labels and the corresponding predicted change maps, respectively. The pixel values in g are either 0 or 1, while the pixel values in p range from 0 to 1, indicating the probability of being predicted as changed.

4. Experiment

4.1. Datasets

4.1.1. WHU-CD Dataset

The WHU-CD [68] dataset is a building CD dataset consisting of a pair of high-resolution aerial images, each with a resolution of 0.2 m and pixel dimensions of 32,507 × 15,354. The dataset covers a city in New Zealand and focuses on detecting changes in buildings. The complex building structures make the precise detection of building boundaries challenging. We divided the images into non-overlapping patches of size 256 × 256 and split them into 5947 training samples, 743 validation samples, and 744 testing samples.

4.1.2. LEVIR-CD Dataset

LEVIR-CD [69] provides 637 paired high-resolution images (1024 × 1024 pixels, 0.5 m resolution) collected from 20 distinct urban areas in Texas, USA. It is widely used for detecting notable changes in land use, particularly related to building activity. The dataset includes a broad spectrum of building categories, from villas and high-rise apartments to smaller structures and large storage facilities. The LEVIR-CD dataset also contains a significant amount of seasonal and lighting changes, which make CD more challenging. We divided the images into non-overlapping patches of size 256 × 256 and split them into 7120 training samples, 1024 validation samples, and 2048 testing samples.

4.1.3. SYSU-CD Dataset

The SYSU-CD [70] dataset includes land cover change data from Hong Kong spanning from 2007 to 2014. It comprises 20,000 pairs of VHR RS images. The SYSU-CD dataset covers a variety of complex change types such as buildings, urban roads, ports, and ships, as well as mountainous vegetation, with significant interference from pseudo-changes. We divided the images into non-overlapping patches of size 256 × 256 and split them into 12,000 training, 4000 validation, and 4000 testing samples.

4.2. Implementation Details

In the experiments, the network was trained with the PyTorch 1.7.1 framework on two NVIDIA RTX 2080Ti GPUs. The AdamW optimizer was used with the following settings: a batch size of 16, an initial learning rate of 0.0001, momentum of 0.99, epochs of 200, and weight decay of 0.01. A linear learning rate decay strategy was implemented. To ensure the reproducibility of the experiments, we fixed the random seed to 42 in all experiments and will set this in the code. Data augmentation was applied to all experimental data during the training phase to improve robustness and generalization. The techniques included flipping, cropping, and Gaussian blurring.

4.3. Evaluation Metrics

To evaluate the performance of CD methods quantitatively, we use five widely adopted metrics: precision (Pre), recall (Rec), F1-score (F1), intersection over union (IoU), and overall accuracy (OA). The definitions of these metrics are provided below:

Pre = TP / (TP + FP),

(21)

Rec = TP / (TP + FN),

(22)

F 1 = 2 \times Rec \times Pre / (Rec + Pre),

(23)

IoU = TP / (TP + FP + FN),

(24)

OA = (TP + TN) / (TP + TN + FP + FN),

(25)

where TP, TN, FP, and FN denote the number of true positives, true negatives, false positives, and false negatives, respectively.

4.4. Compared Methods

To validate the effectiveness of SDA-Encoding in RSCD tasks, this section selects eight state-of-the-art models as competitors. These models include three methods based on FCN: FC-EF [47], FC-Siam-Diff [47], and FC-Siam-Conc [47]; four models based on attention mechanisms: DMINet [48], SEIFNet [49], SChanger [71], and CASP [72]; and three models based on Transformers: BIT [38], ChangFormer [52], and STADE-CDNet [53]. Each method was retrained under identical experimental conditions, strictly adhering to the official implementations provided by the respective authors.

4.5. Results and Analysis

4.5.1. Results on the WHU-CD Dataset

Table 1 presents the results of generalization experiments conducted on the WHU-CD dataset. Among all compared methods, SDA-Encoding achieves the best overall performance with an IoU of 89.87% and F1-score of 94.67%, outperforming other state-of-the-art approaches. Specifically, SDA-Encoding demonstrates a balanced trade-off between a precision of 96.69% and a recall of 92.73%, indicating its superior capability in accurately identifying change regions while maintaining low false detection rates.

As shown in Figure 5a, although significant changes are visually apparent, most existing methods fail to detect them accurately. This is primarily because, from a semantic perspective, these methods do not identify such areas as building changes. While some approaches can detect the changes, they still experience considerable false positives and missed detections along object boundaries. In contrast, our method, which incorporates frequency-domain learning, effectively captures the structural and edge features of the changed objects, resulting in more precise detection. As shown in Figure 5b, although the FC-Siam-Conc method detected the change in the upper-left corner, it was accompanied by a significant number of false positives. In contrast, other methods failed to identify this change. This suggests that our method struggles to detect changes in buildings with high ambiguity, highlighting a key limitation of our approach. As shown in Figure 5c,d, when the background is complex or when the foreground and background share similar colors, only our method successfully detects the changed objects. This demonstrates that the joint learning of low-frequency structural features and high-frequency edge features enhances the model’s robustness in complex change detection scenarios.

4.5.2. Results on the LEVIR-CD Dataset

Table 2 reports the generalization results on the LEVIR-CD dataset. As shown, the proposed SDA-Encoding achieves the best results across all evaluation metrics. Compared with previous methods such as ChangeFormer and DMINet, SDA-Encoding significantly improves IoU and F1 scores, indicating its superior capability in accurately capturing both semantic consistency and fine-grained structural changes.

As shown in Figure 6a,c,d, when dealing with blurred foreground–background boundaries and small-scale change targets, only our method can fully and accurately detect the change regions. This suggests that our approach not only preserves sufficient spatial information but also integrates frequency-domain features, enhancing the detection of fine-grained changes. Furthermore, as shown in Figure 6b, when pseudo-changes such as color variations occur, our method effectively suppresses disturbances caused by color and seasonal changes in the spatial domain by leveraging frequency-domain structural information. This enables the model to capture more precise semantic representations of the changed objects.

4.5.3. Results on the SYSU-CD Dataset

Table 3 summarizes the generalization results on the SYSU-CD dataset. As shown, SDA-Encoding achieves the best overall performance among all compared methods, reaching an IoU of 71.57% and an F1-score of 83.43%. Compared to other strong baselines such as DMINet, SChanger, and CASP, SDA-Encoding achieves notable improvements in both precision (which was 87.46%) and IoU, demonstrating its robustness and superior generalization capability when transferring to a new domain with complex urban and rural scene variations. Although STADE-CDNet exhibits a slightly higher recall of 83.11%, its overall accuracy and precision are significantly lower, indicating a tendency toward over-detection. These results confirm that the multi-scale dual-attention mechanism in SDA-Encoding effectively enhances the model’s ability to distinguish subtle and diverse change patterns under challenging domain shifts.

As shown in Figure 7, the two previously discussed datasets are limited to binary change detection (CD) tasks, whereas the SYSU-CD dataset involves multi-class change scenarios and introduces additional irrelevant background variations. Experimental results reveal that existing methods tend to generate a large number of irrelevant or non-target change predictions. In contrast, our proposed SDA-Encoding significantly outperforms these methods by effectively suppressing such spurious changes. This improvement is largely attributed to the integration of spatial and frequency-domain features. However, it is important to note that our method encounters challenges when handling datasets with multiple types of changes. Despite the changes being clearly visible, the method struggles to accurately segment the changed objects. This suggests that our approach has not yet fully captured the semantic information related to different change types, leading to confusion and limitations in performance.

4.6. Ablation Studies

As shown in Table 4, ablation studies were conducted on the LEVIR-CD dataset to validate the impact of different components and configurations of SDA-Encoding. These experiments aim to assess the significance of individual components by systematically removing them and observing the resulting performance changes.

Table 4 presents the results of the ablation study conducted on the LEVIR-CD dataset, evaluating the contribution of each proposed component, including the MHI, PLE, DAF, and SFA modules. When two modules are combined, the performance improves further, highlighting their complementary roles in enhancing feature representation and change localization. Ultimately, the integration of all four modules yields the best results, with an IoU of 85.29% and an F1 score of 92.06%. This demonstrates that the MHI, PLE, DAF, and SFA modules work together to boost both global semantic understanding and fine-grained structural perception, resulting in more accurate and robust change detection.

Table 5 presents the experimental results of various backbone networks used as feature extractors on the LEVIR-CD dataset. The performance varies across different backbones. Notably, ViT [36], Swin-T [37], MobileViT [73], and FastViT [74] show slightly lower performance. This suggests that MiT better captures multi-scale features and global contextual information, leading to more accurate identification and delineation of change regions.

Figure 8 illustrates the effects of adding each individual component to the baseline model. Overall, the results demonstrate that each component proposed in our study has a positive impact on the model’s performance.

Figure 9 illustrates the changes in F1 scores over training epochs for three different SDA-Encoding methods during both training and validation. It shows that the F1 scores for all methods rise rapidly in the early training stages and then gradually converge. Overall, SDA-Encoding and SDA-Encoding w/o PLE exhibit similar, stable performance, while SDA-Encoding w/o MHI also maintains strong and steady performance in both training and validation. This suggests that the MHI and PLE modules both influence model performance, but all methods steadily improve as training progresses.

4.7. Comparison of Complexity

Table 6 compares the model complexity across different methods. Our proposed SDA-Encoding model has 45.57 M parameters, 44.80 G FLOPs, and 90.51 samples per second.

Compared to lightweight approaches, our model is more complex. This is mainly due to the dual-branch encoder, which is designed to capture multi-scale spatiotemporal variations. Despite its higher complexity, our model achieves high accuracy in change detection tasks. It performs especially well in detecting subtle texture differences and complex lighting changes. Remote sensing image change detection often requires capturing fine local textures and long-range edge information. More complex network architectures can improve the recognition accuracy of these details. Although our model has more parameters than other methods, its superior accuracy makes it highly valuable for high-precision remote sensing change detection. This is particularly true for detecting changes in complex images or extreme conditions. However, in real-world applications, such as real-time remote sensing monitoring with drones or satellites, computational resources may be limited. Reducing computational cost while maintaining high accuracy is an ongoing challenge.

Therefore, our follow-up work will focus on optimizing the model architecture. This includes exploring more efficient Transformer structures or convolutional operators to replace the existing MiT encoder. Additionally, we aim to compress the model size via pruning and knowledge distillation without significantly sacrificing performance. We will also investigate sparse activation strategies for specific tasks, enabling high-frequency feature interactions only in key regions.

5. Conclusions

In this paper, we proposed a spatial-frequency decoupling alignment encoding (SDA-Encoding) for change detection in bi-temporal remote sensing imagery. The proposed framework effectively integrates spatial-domain and frequency-domain representations to capture both fine-grained texture variations and large-scale structural changes. Specifically, a Transformer-based bi-temporal fusion stage constructs cross-scale consistent spatial features, serving as the foundation for spatial–frequency decoupling. The multi-scale high-frequency interaction (MHI) module enhances local texture sensitivity through spatial pyramid pooling (SPP) and employs a dual-domain alignment fusion (DAF) module to model spatial and cross-scale dependencies. Meanwhile, the position-aware low-frequency enhancement (PLE) module incorporates coordinate attention to explicitly encode positional information and utilizes a selective fusion attention (SFA) module to strengthen global structural awareness. Finally, a segmentation head aggregates multi-scale features to generate precise change detection maps. Extensive experiments on multiple benchmark datasets demonstrate that SDA-Encoding achieves superior performance compared with existing state-of-the-art methods. The results confirm the effectiveness of the proposed spatial–frequency decoupling strategy and the dual-branch architecture, which enhance both detail preservation and structural consistency.

In future work, we plan to explore weakly supervised learning strategies to reduce the dependence on fine-grained annotations and further investigate the potential of frequency-domain features in RSCD. Specifically, we aim to further investigate the potential of frequency-domain features in RSCD. In weakly supervised settings, pseudo-labels are often used for training. Since frequency-domain features are better at distinguishing real structural changes from low-frequency noise (such as lighting and shadows), pseudo-labels generated from these features are likely to be cleaner and more reliable than those based solely on spatial-domain features. This could improve the performance limits of weakly supervised learning. Additionally, we plan to streamline the network architecture to maintain performance while enhancing the model’s practicality and deployment efficiency.

Author Contributions

Conceptualization, X.Z.; methodology, X.Z.; validation, X.Z.; resources, X.Z. and Y.D.; writing—original draft preparation, W.Z.; writing—review and editing, X.Z.; supervision, K.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Seventh Batch of Science and Technology Development Plan (Agriculture) Project of Suzhou (SNG2023007) and Jiangsu Higher Vocational College Teachers’ Visiting Study Program (2024GRFX053).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bolorinos, J.; Ajami, N.K.; Rajagopal, R. Consumption Change Detection for Urban Planning: Monitoring and Segmenting Water Customers During Drought. Water Resour. Res. 2020, 56, e2019WR025812. [Google Scholar] [CrossRef]
Du, P.; Liu, S.; Gamba, P.; Tan, K.; Xia, J. Fusion of Difference Images for Change Detection Over Urban Areas. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2012, 5, 1076–1086. [Google Scholar] [CrossRef]
Liu, X.; Lathrop, R.G. Urban change detection based on an artificial neural network. Int. J. Remote Sens. 2002, 23, 2513–2518. [Google Scholar] [CrossRef]
Qin, D.; Zhou, X.; Zhou, W.; Huang, G.; Ren, Y.; Horan, B.; He, J.; Kito, N. MSIM: A change detection framework for damage assessment in natural disasters. Expert Syst. Appl. 2018, 97, 372–383. [Google Scholar] [CrossRef]
Michel, U.; Thunig, H.; Ehlers, M.; Reinartz, P. Rapid Change Detection Algorithm for Disaster Management. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2012, I-4, 107–111. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A.; Zhang, L. Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to man-made disasters. Remote Sens. Environ. 2021, 265, 112636. [Google Scholar] [CrossRef]
Tarimo, B.; Mtalo, E.; Liwa, E. Land Use Change Detection and Impact Assessment on an Agricultural Area. J. Sustain. Dev. 2013, 6, 55. [Google Scholar] [CrossRef][Green Version]
Prishchepov, A.V.; Radeloff, V.C.; Dubinin, M.; Alcantara, C. The effect of Landsat ETM/ETM + image acquisition dates on the detection of agricultural land abandonment in Eastern Europe. Remote Sens. Environ. 2012, 126, 195–209. [Google Scholar] [CrossRef]
Malinverni, E.S.; Rinaldi, M.; Ruggieri, S. Agricultural crop change detection by means of hybrid classification and high resolution images. EARSeL eProc. 2012, 11, 132. [Google Scholar]
Singh, A. Change detection in the tropical forest environment of northeastern India using Landsat. Remote Sens. Trop. Land Manag. 1986, 44, 254–273. [Google Scholar]
Todd, W.J. Urban and regional land use change detected by using Landsat data. J. Res. US Geol. Surv. 1977, 5, 529–534. [Google Scholar]
Dai, X.; Khorram, S. Quantification of the impact of misregistration on the accuracy of remotely sensed change detection. In Proceedings of the 1997 IEEE International Geoscience and Remote Sensing Symposium Proceedings. Remote Sensing—A Scientific Vision for Sustainable Development, Singapore, 3–8 August 1997; Volume 4, pp. 1763–1765. [Google Scholar]
Celik, T. Unsupervised Change Detection in Satellite Images Using Principal Component Analysis and k-Means Clustering. IEEE Geosci. Remote Sens. Lett. 2009, 6, 772–776. [Google Scholar] [CrossRef]
Saha, S.; Bovolo, F.; Bruzzone, L. Unsupervised Deep Change Vector Analysis for Multiple-Change Detection in VHR Images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 3677–3693. [Google Scholar] [CrossRef]
Han, T.; Wulder, M.A.; White, J.C.; Coops, N.C.; Alvarez, M.F.; Butson, C. An Efficient Protocol to Process Landsat Images for Change Detection with Tasselled Cap Transformation. IEEE Geosci. Remote Sens. Lett. 2007, 4, 147–151. [Google Scholar] [CrossRef]
Negri, R.G.; Frery, A.C.; Casaca, W.; Azevedo, S.; Dias, M.A.; Silva, E.A.; Alcantara, E.H. Spectral–Spatial-Aware Unsupervised Change Detection With Stochastic Distances and Support Vector Machines. IEEE Trans. Geosci. Remote Sens. 2021, 59, 2863–2876. [Google Scholar] [CrossRef]
Sun, Y.; Lei, L.; Guan, D.; Kuang, G. Iterative Robust Graph for Unsupervised Change Detection of Heterogeneous Remote Sensing Images. IEEE Trans. Image Process. 2021, 30, 6277–6291. [Google Scholar] [CrossRef]
Seo, D.K.; Kim, Y.H.; Eo, Y.D.; Lee, M.H.; Park, W.Y. Fusion of SAR and Multispectral Images Using Random Forest Regression for Change Detection. ISPRS Int. J. Geo-Inf. 2018, 7, 401. [Google Scholar] [CrossRef]
Hu, S.; Bian, Y.; Chen, B.; Song, H.; Zhang, K. Language-Guided Semantic Clustering for Remote Sensing Change Detection. Sensors 2024, 24, 7887. [Google Scholar] [CrossRef]
Wei, G.; Shi, B.; Wang, C.; Wang, J.; Zhu, X. CINet: A Constraint-and Interaction-Based Network for Remote Sensing Change Detection. Sensors 2024, 25, 103. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Li, D.; Fang, J.; Feng, X. DynaNet: A Dynamic Feature Extraction and Multi-Path Attention Fusion Network for Change Detection. Sensors 2025, 25, 5832. [Google Scholar] [CrossRef]
Zheng, X.; Lin, X.; Qing, L.; Ou, X. Semantic-Aware Remote Sensing Change Detection with Multi-Scale Cross-Attention. Sensors 2025, 25, 2813. [Google Scholar] [CrossRef]
Liu, K.; Xue, H.; Huang, C.; Huo, J.; Chen, G. CGLCS-Net: Addressing Multi-Temporal and Multi-Angle Challenges in Remote Sensing Change Detection. Sensors 2025, 25, 2836. [Google Scholar] [CrossRef]
Xiong, J.; Liu, F.; Wang, X.; Yang, C. Siamese transformer-based building change detection in remote sensing images. Sensors 2024, 24, 1268. [Google Scholar] [CrossRef]
Wu, Z.; Ma, X.; Zheng, K.; Lian, R.; Chen, Y.; Huang, Z.; Zhang, W.; Song, S. CD-lamba: Boosting remote sensing change detection via a cross-temporal locally adaptive state space model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2026, 19, 4028–4044. [Google Scholar] [CrossRef]
Guan, F.; Zhao, N.; Fang, Z.; Jiang, L.; Zhang, J.; Yu, Y.; Huang, H. Multi-level representation learning via ConvNeXt-based network for unaligned cross-view matching. Geo-Spat. Inf. Sci. 2025, 28, 2344–2357. [Google Scholar] [CrossRef]
Samadi, F.; Akbarizadeh, G.; Kaabi, H. Change detection in SAR images using deep belief network: A new training approach based on morphological images. IET Image Process. 2019, 13, 2255–2264. [Google Scholar] [CrossRef]
Liu, G.; Li, L.; Jiao, L.; Dong, Y.; Li, X. Stacked Fisher autoencoder for SAR change detection. Pattern Recognit. 2019, 96, 106971. [Google Scholar] [CrossRef]
Feng, Y.; Zheng, J.; Qin, M.; Bai, C.; Zhang, J. 3D Octave and 2D Vanilla Mixed Convolutional Neural Network for Hyperspectral Image Classification with Limited Samples. Remote Sens. 2021, 13, 4407. [Google Scholar] [CrossRef]
Venugopal, N. Automatic semantic segmentation with DeepLab dilated learning network for change detection in remote sensing images. Neural Process. Lett. 2020, 51, 2355–2377. [Google Scholar] [CrossRef]
Wang, X.; Du, J.; Tan, K.; Ding, J.; Liu, Z.; Pan, C.; Han, B. A high-resolution feature difference attention network for the application of building change detection. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102950. [Google Scholar] [CrossRef]
Cheng, H.; Wu, H.; Zheng, J.; Qi, K.; Liu, W. A hierarchical self-attention augmented Laplacian pyramid expanding network for change detection in high-resolution remote sensing images. ISPRS J. Photogramm. Remote Sens. 2021, 182, 52–66. [Google Scholar] [CrossRef]
Liu, T.; Gong, M.; Lu, D.; Zhang, Q.; Zheng, H.; Jiang, F.; Zhang, M. Building change detection for VHR remote sensing images via local–global pyramid network and cross-task transfer learning strategy. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4704817. [Google Scholar] [CrossRef]
Ren, H.; Xia, M.; Weng, L.; Hu, K.; Lin, H. Dual-attention-guided multiscale feature aggregation network for remote sensing image change detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4899–4916. [Google Scholar] [CrossRef]
Song, L.; Xia, M.; Weng, L.; Lin, H.; Qian, M.; Chen, B. Axial cross attention meets CNN: Bibranch fusion network for change detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 16, 21–32. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, 3–7 May 2021. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Chen, H.; Qi, Z.; Shi, Z. Remote sensing image change detection with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5607514. [Google Scholar] [CrossRef]
Feng, Y.; Xu, H.; Jiang, J.; Liu, H.; Zheng, J. ICIF-Net: Intra-scale cross-interaction and inter-scale feature fusion network for bitemporal remote sensing images change detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4410213. [Google Scholar] [CrossRef]
Xu, C.; Jia, W.; Wang, R.; Luo, X.; He, X. Morphtext: Deep morphology regularized accurate arbitrary-shape scene text detection. IEEE Trans. Multimed. 2022, 25, 4199–4212. [Google Scholar] [CrossRef]
Xu, C.; Fu, H.; Ma, L.; Jia, W.; Zhang, C.; Xia, F.; Ai, X.; Li, B.; Zhang, W. Seeing Text in the Dark: Algorithm and Benchmark. In Proceedings of the MM’24: 32nd ACM International Conference on Multimedia, New York, NY, USA, 28 October–1 November 2024; pp. 2870–2878. [Google Scholar]
Mao, Z.; Luo, Z.; Tang, Y. Remote sensing building change detection with global high-frequency cues guidance and result-aware alignment. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6005105. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Jaturapitpornchai, R.; Matsuoka, M.; Kanemoto, N.; Kuzuoka, S.; Ito, R.; Nakamura, R. Sar-Image Based Urban Change Detection in Bangkok, Thailand Using Deep Learning. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 7403–7406. [Google Scholar]
Peng, D.; Zhang, Y.; Guan, H. End-to-end change detection for high resolution satellite images using improved UNet++. Remote Sens. 2019, 11, 1382. [Google Scholar] [CrossRef]
Caye Daudt, R.; Le Saux, B.; Boulch, A. Fully Convolutional Siamese Networks for Change Detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4063–4067. [Google Scholar]
Feng, Y.; Jiang, J.; Xu, H.; Zheng, J. Change Detection on Remote Sensing Images Using Dual-Branch Multilevel Intertemporal Network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4401015. [Google Scholar] [CrossRef]
Huang, Y.; Li, X.; Du, Z.; Shen, H. Spatiotemporal Enhancement and Interlevel Fusion Network for Remote Sensing Images Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5609414. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wu, T.; Sun, Z.; Zhao, Z.; Wang, J.; Yu, R.; Ji, J. SFCF-Net: ACross-hierarchical Progressive Feature Integration Network for UAV Image Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2026, 1–18. [Google Scholar] [CrossRef]
Bandara, W.G.C.; Patel, V.M. A Transformer-Based Siamese Network for Change Detection. In Proceedings of the 2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 207–210. [Google Scholar]
Li, Z.; Cao, S.; Deng, J.; Wu, F.; Wang, R.; Luo, J.; Peng, Z. STADE-CDNet: Spatial–Temporal Attention with Difference Enhancement-Based Network for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611617. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Guan, F.; Zhao, N.; Wang, H.; Fang, Z.; Zhang, J.; Yu, Y.; Jiang, L.; Huang, H. Dual-branch transformer framework with gradient-aware weighting feature alignment for robust cross-view geo-localization. Inf. Fusion 2025, 127, 103808. [Google Scholar] [CrossRef]
Sun, H.; Zhong, Q.; Du, B.; Tu, Z.; Wan, J.; Wang, W.; Ren, D. Bidirectional-modulation frequency-heterogeneous network for remote sensing image dehazing. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 10649–10664. [Google Scholar] [CrossRef]
Zhou, M.; Huang, J.; Yan, K.; Hong, D.; Jia, X.; Chanussot, J.; Li, C. A general spatial-frequency learning framework for multimodal image fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 5281–5298. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Jiao, L.; Liu, F.; Liu, X.; Li, L.; Chen, P.; Yang, S. An explainable spatial–frequency multiscale transformer for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5907515. [Google Scholar] [CrossRef]
Li, J.; Zhang, S.; Sun, Y.; Han, Q.; Sun, Y.; Wang, Y. Frequency-driven edge guidance network for semantic segmentation of remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 9677–9693. [Google Scholar] [CrossRef]
Yu, C.; Li, H.; Hu, Y.; Zhang, Q.; Song, M.; Wang, Y. Frequency-temporal attention network for remote sensing imagery change detection. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5005305. [Google Scholar] [CrossRef]
Liu, Y.; Shi, A. Frequency-driven transformer network for remote sensing image change detection. J. Appl. Remote Sens. 2024, 18, 034523. [Google Scholar] [CrossRef]
Zhang, W.; Guo, W.; Li, Y.; Xie, W. Change detection meets frequency learning: A coarse-to-fine dual-domain detection network. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5618513. [Google Scholar] [CrossRef]
Ma, X.; Yang, J.; Che, R.; Zhang, H.; Zhang, W. Ddlnet: Boosting remote sensing change detection with dual-domain learning. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
Liu, Y.; He, Q.; Li, J.; Liu, X.; Fiorio, P.R.; Nakai, É.S.; Yang, B. Towards resolution-arbitrary remote sensing change detection with spatial-frequency dual domain learning. ISPRS J. Photogramm. Remote Sens. 2026, 231, 137–150. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Zhu, J.; Liao, S.; Yi, D.; Lei, Z.; Li, S.Z. Multi-label cnn based pedestrian attribute learning for soft biometrics. In Proceedings of the 2015 International Conference on Biometrics (ICB), Phuket, Thailand, 19–22 May 2015; pp. 535–540. [Google Scholar]
Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction From an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Shi, J.; Wu, T.; Qin, A.K.; Lei, Y.; Jeon, G. Semisupervised Adaptive Ladder Network for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5408220. [Google Scholar] [CrossRef]
Zhou, Z.; Hu, K.; Fang, Y.; Rui, X. SChanger: Change Detection From a Semantic Change and Spatial Consistency Perspective. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 10186–10203. [Google Scholar] [CrossRef]
Wang, Q.; Zhang, M.; Ren, J.; Li, Q. Exploring Context Alignment and Structure Perception for Building Change Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5609910. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Vasu, P.K.A.; Gabriel, J.; Zhu, J.; Tuzel, O.; Ranjan, A. Fastvit: A fast hybrid vision transformer using structural reparameterization. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 5785–5795. [Google Scholar]

Figure 1. Understanding images in the frequency domain.

Figure 2. Architecture of the proposed SDA-Encoding.

Figure 3. Architecture of the proposed DAF.

Figure 4. Architecture of the proposed SFA.

Figure 5. Visualization results of different methods on the WHU-CD test set. (a–d) denote prediction results of all the compared methods for different samples, respectively. (1–11) represent FC-EF, FC-Siam-Diff, FC-Siam-Conc, BIT, ChangeFormer, DMINet, SEIFNet, STADE-CDNet, SChanger, CASP, and SDA-Encoding. White denotes TP, black denotes TN, red denotes FP, and green denotes FN.

Figure 6. Visualization results of different methods on the LEVIR-CD test set. (a–d) denote prediction results of all the compared methods for different samples, respectively. (1–11) represent FC-EF, FC-Siam-Diff, FC-Siam-Conc, BIT, ChangeFormer, DMINet, SEIFNet, STADE-CDNet, SChanger, CASP, and SDA-Encoding. White denotes TP, black denotes TN, red denotes FP, and green denotes FN.

Figure 7. Visualization results of different methods on the SYSU-CD test set. (a–d) denote prediction results of all the compared methods for different samples, respectively. (1–11) represent FC-EF, FC-Siam-Diff, FC-Siam-Conc, BIT, ChangeFormer, DMINet, SEIFNet, STADE-CDNet, SChanger, CASP, and SDA-Encoding. White denotes TP, black denotes TN, red denotes FP, and green denotes FN.

Figure 8. The figure shows the locally enlarged segmentation results after removing different components from the SDA-Encoding module. (1) represents SDA-Encoding without the PLE module. (2) represents SDA-Encoding without the MHI module. (3) represents SDA-Encoding without frequency information. (4) represents SDA-Encoding. White denotes TP, black denotes TN, red denotes FP, and green denotes FN.

Figure 9. F1-score curves of SDA-Encoding during the training epochs. (a) Training F1-score on the LEVIR-CD dataset. (b) Validation F1-score on the LEVIR-CD dataset.

Table 1. Generalization experiments conducted on the WHU-CD dataset. The best results are indicated in bold.

Methods	OA (%)	IoU (%)	F1 (%)	Pre (%)	Rec (%)
FC-EF	98.19	58.30	73.66	87.25	63.73
FC-Siam-Diff	98.25	67.34	80.48	72.27	90.80
FC-Siam-Conc	95.91	45.37	62.42	49.08	85.72
BIT	99.12	79.44	88.54	91.46	85.81
ChangeFormer	99.11	79.16	88.37	92.21	84.83
DMINet	98.52	71.19	83.17	75.79	92.14
SEIFNet	98.80	75.12	85.79	81.12	91.04
STADE-CDNet	99.20	80.77	89.36	94.06	85.11
SChanger	99.16	80.86	89.42	89.68	89.16
CASP	99.54	88.71	94.02	96.94	91.27
SDA-Encoding	99.59	89.87	94.67	96.69	92.73

Table 2. Generalization experiments conducted on the LEVIR-CD dataset. The best results are indicated in bold.

Methods	OA (%)	IoU (%)	F1 (%)	Pre (%)	Rec (%)
FC-EF	98.38	72.19	83.85	85.10	82.63
FC-Siam-Diff	98.98	81.38	89.74	92.09	87.50
FC-Siam-Conc	98.62	75.72	86.18	87.83	84.59
BIT	98.99	81.39	89.74	92.66	87.01
ChangeFormer	99.00	81.60	89.87	91.88	87.95
DMINet	99.03	82.27	90.27	92.42	88.22
SEIFNet	98.91	80.16	88.99	91.79	86.36
STADE-CDNet	98.91	80.44	89.16	90.65	87.72
SChanger	99.04	82.25	90.26	92.99	87.69
CASP	99.10	83.47	90.99	92.50	89.53
SDA-Encoding	99.20	85.29	92.06	93.34	90.82

Table 3. Generalization experiments conducted on the SYSU-CD dataset. The best results are indicated in bold.

Methods	OA (%)	IoU (%)	F1 (%)	Pre (%)	Rec (%)
FC-EF	89.57	61.81	76.40	81.93	71.57
FC-Siam-Diff	91.37	67.52	80.61	85.70	76.09
FC-Siam-Conc	90.06	62.99	77.29	83.74	71.77
BIT	89.31	62.41	76.85	78.49	75.28
ChangeFormer	90.51	65.18	78.92	82.89	75.31
DMINet	91.78	69.11	81.73	85.85	78.00
SEIFNet	91.59	69.32	81.88	83.22	80.59
STADE-CDNet	87.11	60.33	75.26	68.76	83.11
SChanger	91.62	69.24	81.83	83.73	80.01
CASP	91.67	68.79	81.51	85.51	77.87
SDA-Encoding	92.53	71.57	83.43	87.46	79.75

Table 4. Ablation study of different modules. The best results are indicated in bold.

Baseline	MHI	PLE	DAF	SFA	LEVIR-CD
Baseline	MHI	PLE	DAF	SFA	OA (%)	IoU (%)	F1 (%)	Pre (%)	Rec (%)
✓	×	×	×	×	99.00	81.60	89.87	91.88	87.95
✓	×	✓	×	✓	99.15	84.42	91.55	92.17	90.95
✓	✓	×	✓	×	99.14	84.31	91.49	92.40	90.60
✓	✓	✓	×	×	99.06	83.06	90.75	91.34	90.16
✓	×	×	✓	✓	99.14	84.26	91.46	92.12	90.81
✓	✓	✓	✓	✓	99.20	85.29	92.06	93.34	90.82

Table 5. The experimental results of different backbones on the LEVIR-CD dataset. The best results are indicated in bold.

Method	LEVIR-CD
Method	OA (%)	IoU (%)	F1 (%)	Pre (%)	Rec (%)
Vit	99.08	83.35	90.91	91.28	90.55
Swin-T	99.14	84.25	91.45	92.62	90.32
MobileVit	99.14	84.45	91.57	92.74	90.43
FastVit	99.08	83.47	90.99	90.97	91.01
MiT-b0	99.15	84.52	91.61	92.47	90.76
MiT-b3	99.20	85.29	92.06	93.34	90.82

Table 6. Complexity comparison of different methods.

Methods	Para. (M)	FLOPs (G)	Throughput (Sample/s)
FC-EF	1.35	3.59	721.87
FC-Siam-conc	1.55	5.34	566.48
FC-Siam-diff	1.35	4.74	577.64
BIT	12.40	10.87	237.35
Changeformer	41.04	45.97	148.40
DMINet	6.24	14.55	215.29
SEIFNet	27.91	8.37	292.40
STADE-CDNet	11.90	10.25	63.68
SChanger	2.37	17.91	87.36
CASP	14.55	9.19	91.47
SDA-Encoding (Ours)	45.57	44.80	90.51

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, X.; Du, Y.; Zhou, W.; Zhang, K. Spatial-Frequency Decoupling Alignment Encoding for Remote Sensing Change Detection. Sensors 2026, 26, 1979. https://doi.org/10.3390/s26061979

AMA Style

Zhang X, Du Y, Zhou W, Zhang K. Spatial-Frequency Decoupling Alignment Encoding for Remote Sensing Change Detection. Sensors. 2026; 26(6):1979. https://doi.org/10.3390/s26061979

Chicago/Turabian Style

Zhang, Xu, Yue Du, Weiran Zhou, and Kaihua Zhang. 2026. "Spatial-Frequency Decoupling Alignment Encoding for Remote Sensing Change Detection" Sensors 26, no. 6: 1979. https://doi.org/10.3390/s26061979

APA Style

Zhang, X., Du, Y., Zhou, W., & Zhang, K. (2026). Spatial-Frequency Decoupling Alignment Encoding for Remote Sensing Change Detection. Sensors, 26(6), 1979. https://doi.org/10.3390/s26061979

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Spatial-Frequency Decoupling Alignment Encoding for Remote Sensing Change Detection

Abstract

1. Introduction

2. Related Works

2.1. CNN-Based CD Methods

2.2. Transformer-Based Methods

2.3. Frequency-Based Remote Sensing Image Processing

3. Methods

3.1. Overview

3.2. MiT-Based Bi-Temporal Fusion Stage

3.3. Multi-Scale High-Frequency Interaction Module

3.4. Position-Aware Low-Frequency Enhancement Module

3.5. Loss Function

4. Experiment

4.1. Datasets

4.1.1. WHU-CD Dataset

4.1.2. LEVIR-CD Dataset

4.1.3. SYSU-CD Dataset

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Compared Methods

4.5. Results and Analysis

4.5.1. Results on the WHU-CD Dataset

4.5.2. Results on the LEVIR-CD Dataset

4.5.3. Results on the SYSU-CD Dataset

4.6. Ablation Studies

4.7. Comparison of Complexity

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI