FACDNet: A Frequency-Aware Cross-Layer Network for Remote Sensing Change Detection

Zhao, Liangjun; Zhao, Chenzhi; Zhang, Lei; Zhong, Zimin

doi:10.3390/electronics15112416

Open AccessArticle

FACDNet: A Frequency-Aware Cross-Layer Network for Remote Sensing Change Detection

¹

School of Computer Science and Engineering, Sichuan University of Science and Engineering, Yibin 644000, China

²

School of Automation and Information Engineering, Sichuan University of Science and Engineering, Yibin 644000, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(11), 2416; https://doi.org/10.3390/electronics15112416

Submission received: 23 April 2026 / Revised: 16 May 2026 / Accepted: 27 May 2026 / Published: 2 June 2026

Download

Browse Figures

Versions Notes

Abstract

Remote sensing change detection is crucial for urban expansion monitoring and ecological assessment. Recently, methods based on Convolutional Neural Networks (CNNs) and Transformers have advanced significantly. However, state-of-the-art models relying primarily on pure spatial-domain modeling and absolute feature differences struggle to balance global semantics with high-frequency boundary details. This paradigm loses physical change directionality and amplifies pseudo-change noise in complex backgrounds. To overcome this, we propose a Frequency-Aware Cross-Layer Change Detection Network (FACDNet) that leverages frequency-spatial synergy to enhance feature discriminability. Specifically, a Wavelet Interaction Block (WIB) decouples bitemporal features using Haar wavelets, employing heterogeneous attention to targetedly reinforce macroscopic semantics and edge textures. Furthermore, to mitigate noise in shallow features, a Cross-Layer Frequency Context Aggregator (CLFCA) injects deep global semantics top-down, purifying multi-scale spatial gating signals. Finally, a Context-guided Difference Fusion Module (CDFM) extracts direction-aware bidirectional difference features, utilizing the purified gating to accurately suppress pseudo-changes. Extensive experiments on the LEVIR-CD and highly challenging SHCD datasets demonstrate FACDNet’s remarkable robustness. It achieves change-class F1-scores of 92.04% and 83.64%, and Intersection over Union (IoU) scores of 85.26% and 71.89%, respectively, achieving highly competitive performance compared with existing mainstream methods.

Keywords:

change detection; remote sensing images; frequency-aware; cross-layer aggregation; attention mechanism; multi-scale features

1. Introduction

Remote Sensing Change Detection (RSCD) aims to identify and extract the dynamic evolution processes of the Earth’s surface by comparing images of the same geographical area acquired at different times [1,2]. This technology plays an irreplaceable fundamental role in Earth observation tasks such as urban expansion planning, forest resource management, disaster damage assessment, and farmland protection [3,4]. With the proliferation of high-resolution satellite and aerial platforms, bitemporal remote sensing data exhibit extremely high spatial resolution and rich texture details [5]. However, complex scenes often contain numerous appearance perturbations irrelevant to actual land-cover changes, such as building shadows, seasonal vegetation succession, and drastic variations in illumination conditions [6,7]. These factors induce severe visual ambiguity in bitemporal images, rendering models highly susceptible to false alarms [8,9]. Consequently, accurately disentangling genuine structural changes from complex background interference remains a highly challenging task [10].

In recent years, deep learning models, represented by Convolutional Neural Networks (CNNs) and Transformers, have significantly advanced change detection technologies through their powerful feature representation capabilities [11]. CNNs excel at extracting spatial details via local perception [12,13], whereas Transformers demonstrate immense potential in modeling global dependencies through self-attention mechanisms [14,15]. Additionally, innovative deep learning frameworks have significantly broadened the scope of various visual detection tasks. For instance, a confluent triple-flow network employing a divide-and-conquer strategy [16] has been developed to effectively integrate multi-modal cues for robust object detection. Nevertheless, existing mainstream methods still face two critical bottlenecks. First, most of these models are confined to feature interactions within the pure spatial domain, tending to prioritize high-level semantic features at the expense of fine-grained spatial clues [17,18]. This leads to blurred boundaries of changed regions and insufficient sensitivity to subtle targets, as they overlook the stable physical structures embedded in the frequency domain of remote sensing images (e.g., low-frequency global layouts and high-frequency edge transitions) [19,20]. Second, under complex backgrounds, although shallow features contain prominent appearance changes, they lack consistent semantic meanings. Existing multi-scale fusion strategies often lack deep synergy mechanisms between shallow and deep features, which easily results in redundant activations and disrupts the boundary integrity of fine-grained changes [21].

From the perspective of signal representation, land-cover changes in bitemporal remote sensing images are not merely drastic fluctuations of spatial pixels, but also structural shifts in frequency energy distribution [22,23]. Low-frequency components typically dominate the macroscopic semantics and consistent layout of a scene, whereas high-frequency components precisely delineate detail changes such as building edges and road contours [24]. Although early studies attempted to introduce wavelet transforms or Fourier analysis for pixel-level detection [25], these traditional methods rely heavily on hand-crafted features and struggle to meet the high-dimensional semantic demands of complex urban scenes. Furthermore, to alleviate the difficulties in recognizing multi-scale targets, current deep learning models commonly employ feature pyramids or skip connections to fuse features across different levels [26,27]. However, these fusion mechanisms are mostly fixed and loose. When processing cross-temporal information, they not only fail to effectively filter out high-frequency pseudo-change noise caused by illumination or seasonal variations in shallow features, but also tend to propagate redundant shallow activations directly to the global scale, further exacerbating the fragmentation of high-frequency boundaries and semantic confusion [28,29]. Therefore, how to explicitly extract change edges using frequency-domain decoupling while adaptively aligning high- and low-frequency semantic and structural information through cross-layer interactions within a unified end-to-end framework has become an urgent theoretical challenge [30].

Integrating frequency-domain analysis with deep learning has emerged as an effective approach to enhancing feature discriminability in remote sensing change detection. A representative work, FDINet [31], successfully leverages wavelet transforms for intra-scale frequency decoupling and interaction, emphasizing high-frequency boundaries and low-frequency semantics to improve fine-grained detection accuracy. However, FDINet’s frequency-domain processing is primarily confined within the same feature level, and its cross-scale interaction still largely relies on traditional spatial-domain concatenation or upsampling during the decoding stage. In contrast, our proposed FACDNet introduces a fundamentally different frequency-aware cross-layer guidance paradigm. Beyond utilizing the Wavelet Interaction Block (WIB) for multi-scale context extraction, FACDNet incorporates a Cross-Layer Frequency Context Aggregator (CLFCA) to establish a top-down semantic injection pathway. This mechanism leverages high-order, noise-robust global semantics from deep layers as a prior to explicitly purify and calibrate the sensitive shallow high-frequency gating signals during the feature enhancement phase. By deeply coupling frequency decoupling with cross-layer interaction, FACDNet achieves an interpretable feature purification mechanism at the frequency level before final fusion, a capability that distinguishes it from traditional spatial-domain cross-scale aggregation.

More critically, when acquiring bitemporal difference features, existing deep learning networks generally rely on simple concatenation, element-wise addition, or absolute value subtraction. Such mathematically crude feature extraction methods fail to fully exploit the high-order semantics inherent within each temporal phase. In particular, the absolute difference operation completely eliminates the directionality of physical changes (for instance, “building to bare land” and “bare land to building” yield identical responses under absolute difference) [32]. This not only generates a massive amount of redundant feature representations but also amplifies pseudo-change noise in regions with severe illumination and seasonal discrepancies, severely constraining the model’s discriminative capability for genuine changes.

Furthermore, addressing domain shifts caused by severe illumination, texture, and perspective variations remains a critical challenge in robust aerial and UAV vision. Recent self-supervised and domain-adaptive approaches have made significant strides in this area. For instance, the innovative Region-Aligned 3D Transformer (RA3T) [33] employs a dual-branch self-supervised strategy and region-level adversarial calibration to explicitly align fine-grained domain shifts. This architecture demonstrates remarkable robustness against complex environmental variations, such as local shadows and uneven illumination, in low-altitude scenarios. Inspired by the necessity of such fine-grained domain robustness, our proposed FACDNet tackles the analogous challenge of phenological and illumination disparities in bitemporal change detection. Instead of adversarial alignment, we achieve robust feature representation by explicitly decoupling high-frequency structural details from low-frequency global semantics, thereby ensuring consistent change discrimination across diverse scenes.

To overcome the aforementioned limitations of spatial-domain modeling and absolute difference fusion, this paper proposes an effective Frequency-Aware Cross-layer Change Detection Network (FACDNet). This network organically integrates frequency-domain decoupling, cross-layer semantic purification, and direction-aware difference modeling, aiming to achieve high-precision and robust change detection.

The main contributions of this paper are as follows:

Frequency Decoupling and Heterogeneous Attention Mechanism: We design a Wavelet Interaction Block (WIB) that utilizes Haar wavelets to explicitly decouple bitemporal features into low-frequency approximation and high-frequency detail subbands. Tailored to their distinct physical properties, a heterogeneous attention mechanism is introduced (applying channel attention to low-frequency components to focus on macroscopic semantics, and spatial attention to high-frequency components to sharpen edge textures), thereby significantly enhancing feature discriminability in the frequency domain.
Cross-Layer Frequency Context Aggregation Mechanism: To address the susceptibility of shallow high-frequency features to noise interference, we construct a Cross-Layer Frequency Context Aggregator (CLFCA). This module injects deep global semantics into shallow boundary features in a top-down manner, effectively purifying multi-scale spatial gating signals and resolving the semantic–structural mismatch between deep and shallow features.
Direction-Aware Context-Guided Difference Fusion: Discarding traditional absolute difference calculations, we propose a Context-Guided Difference Fusion Module (CDFM). This module extracts bidirectional difference features that preserve physical directionality and leverages the context features purified by CLFCA as spatial constraint gates. This precisely suppresses pseudo-change noise in unchanged regions, achieving robust fusion of multi-source features.

2. Methodology

2.1. Overall Architecture of FACDNet

To address the critical challenges of severe pseudo-change interference and insufficient multi-scale feature fusion in complex remote sensing scenarios, this paper proposes an end-to-end Frequency-Aware Cross-Layer Change Detection Network (FACDNet). The data flow of this network is conceptualized as a synergistic pipeline: “multi-scale feature extraction → frequency-domain decoupled interaction → cross-layer semantic purification → direction-aware fusion → progressive decoding”. Initially, given bitemporal input images, the network utilizes a shared-weight ConvNeXt-Small backbone encoder for local-global hybrid feature extraction [34], progressively outputting a hierarchical feature pyramid encompassing four spatial resolutions. Subsequently, at each scale level, the bitemporal features are fed in parallel into the Wavelet Interaction Block (WIB). This module employs the Haar wavelet transform to explicitly decouple features into low-frequency approximation subbands and high-frequency detail subbands [35,36], while innovatively introducing a heterogeneous attention mechanism—applying channel attention to the low-frequency components containing macroscopic semantics and spatial attention to the high-frequency components encoding edge textures. This achieves adaptive feature enhancement in the frequency domain and reconstructs preliminary multi-scale frequency context features.

Considering that shallow high-frequency features are highly susceptible to local noise interference in complex backgrounds, the aforementioned preliminary context features are immediately processed by the Cross-Layer Frequency Context Aggregator (CLFCA). This module establishes a top-down semantic injection pathway, progressively upsampling high-order semantics with strong macroscopic noise-robust capabilities from the deep network and integrating them with shallow features through additive fusion [37], thereby effectively purifying the multi-scale spatial context gating signals. Following semantic purification, the multi-scale bitemporal features and their corresponding gating signals jointly enter the Context-Guided Difference Fusion Module (CDFM). To overcome the inherent flaw of traditional absolute difference operations losing the directionality of physical changes, CDFM extracts direction-aware bidirectional difference features and utilizes the CLFCA-purified context signals as spatial masks to precisely suppress pseudo-change responses in unchanged regions, outputting highly refined fused features. Finally, a progressive decoder adopts a U-Net-like skip-connection architecture to step-wise upsample and aggregate the multi-scale fused features, generating the final pixel-level change probability map through the prediction head at the network’s end to achieve high-precision change detection for bitemporal remote sensing images. The overall architecture is shown in Figure 1.

2.2. Encoder

To balance computational efficiency while ensuring feature representation capability, this paper selects ConvNeXt-Small as the shared-weight dual-stream backbone encoder. Building upon the inherent advantages of local inductive biases in standard Convolutional Neural Networks (CNNs), ConvNeXt systematically assimilates macroscopic architectural designs from Vision Transformers [38] (such as depthwise separable convolutions and layer normalization). This establishes an excellent local-global hybrid receptive field, demonstrating outstanding representational capacity and favorable transferability on large-scale datasets. In this network, bitemporal input images

T_{1}

,

T_{2}

∈

R^{B \times 3 \times H \times W}

(where

B

is the batch size, and

H

and

W

are the input spatial dimensions) are respectively fed into the encoder initialized with ImageNet-1K pre-trained weights. To precisely align with the subsequent multi-scale feature interaction and decoding processes, this paper extracts the four native feature stages inherent to ConvNeXt-Small (containing 3, 3, 27, and 3 micro feature blocks, respectively), with its detailed network architecture illustrated in Figure 2. This layer-by-layer process generates a feature pyramid comprising four scales:

F 1_{i}, F 2_{i} = E n c o d e r_{i} (T_{1}, T_{2}), i \in {0, 1, 2, 3}

(1)

Specifically, these four stages sequentially output feature maps with channel dimensions of 96, 192, 384, and 768, with their corresponding spatial resolutions being 1/4, 1/8, 1/16 and 1/32 of the original input image size, respectively. This hierarchical representation—progressing from shallow to deep, with progressively doubling channel dimensions and step-wise decreasing resolutions—not only preserves high-frequency texture details in shallow layers (

i

= 0, 1) for fine-grained boundary restoration, but also extracts high-order abstract semantics in deep layers (

i

= 2, 3) for large-scale macroscopic change judgment. This lays a solid feature foundation for the subsequent frequency decoupling, cross-layer context aggregation, and difference fusion.

2.3. Wavelet Interaction Block (WIB)

Most existing bitemporal feature interaction methods are confined to stacking or addition within the pure spatial domain. This paradigm easily leads to the mutual coupling of low-frequency background signals representing macroscopic semantics and high-frequency edge signals representing local details [39]. In complex remote sensing scenarios, this coupling not only blurs the boundaries of changed regions but also tends to misclassify high-frequency noise as genuine changes. To break this limitation, this paper designs a Wavelet Interaction Block (WIB). The specific network topology of this module is illustrated in Figure 3. The module utilizes the Haar wavelet transform to explicitly decouple the bitemporal features into the frequency domain [40]. Tailored to the distinct physical properties of high- and low-frequency signals, a heterogeneous attention mechanism is introduced, thereby achieving adaptive feature purification and precise interaction at the frequency level.

2.3.1. Frequency-Domain Decoupling of Bitemporal Features

For the bitemporal features

F_{1}, F_{2} \in R^{C \times H \times W}

output by the backbone network at a given scale, WIB first employs a 2D Discrete Wavelet Transform (2D-DWT, utilizing the computationally efficient Haar wavelet in this paper) to decouple them into four frequency subbands with halved spatial resolution. Taking feature

F_{1}

as an example, the decomposition process can be expressed as:

L L_{1}, L H_{1}, H L_{1}, H H_{1} = D W T (F_{1})

(2)

In the physical representation of remote sensing images, these four subbands encapsulate distinctly different semantic meanings:

{LL}_{1} \in R^{C \times \frac{H}{2} \times \frac{W}{2}}

(Low-Frequency Approximation Subband): Contains the main structure and global low-frequency energy of the image, representing the macroscopic semantic consistency of farmlands, water bodies, or large building clusters.

{LH}_{1}, {HL}_{1}, {HH}_{1}

(High-Frequency Detail Subbands): Capture the high-frequency gradient information in the horizontal, vertical, and diagonal directions, respectively. They are highly sensitive to land-cover boundaries (e.g., road networks, building contours), but are also prone to entangling sensor noise or fragmented pseudo-change shadows. Similarly, applying the identical decoupling operation to

F_{2}

yields

{LL}_{2}, {LH}_{2}, {HL}_{2}, {HH}_{2}

.

2.3.2. Intra-Frequency Cross-Temporal Feature Interaction

To capture the evolution patterns of bitemporal features within identical frequency bands, WIB concatenates the corresponding subbands of

T_{1}

and

T_{2}

along the channel dimension. Subsequently, a 1 × 1 convolution is applied to facilitate cross-channel information interaction and dimensionality reduction, followed by a ReLU activation function for non-linear mapping. This process calculates the initial change responses for each frequency band:

L L_{f u s e d} = δ (C o n v_{1 \times 1} [L L_{1}, L L_{2}]) L H_{f u s e d} = δ (C o n v_{1 \times 1} [L H_{1}, L H_{2}]) H L_{f u s e d} = δ (C o n v_{1 \times 1} [H L_{1}, H L_{2}]) H H_{f u s e d} = δ (C o n v_{1 \times 1} [H H_{1}, H H_{2}])

(3)

where

[\cdot, \cdot]

denotes the concatenation operation along the channel dimension, and

δ

represents the ReLU activation function.

2.3.3. Heterogeneous Attention Mechanism

After obtaining the fused frequency band features, this paper designs a heterogeneous attention mechanism tailored to the physical disparity between high and low frequencies.

Low-Frequency Channel Attention (CA):

{LL}_{f u s e d}

encapsulates the change semantics of the macroscopic background. While its features are relatively smooth in spatial distribution, different channels exhibit vastly different response sensitivities to global changes. Therefore, this paper applies channel attention to the low-frequency component. To comprehensively evaluate channel importance, the CA module simultaneously employs global average pooling (AvgPool) and global max pooling (MaxPool): the former aggregates the foundational semantic distribution over the global range, while the latter captures the most salient change extrema within the channel.

{LL}_{a t t} = C A (L L_{f u s e d}) = σ (m = M L P (A v g P o o l (L L_{f u s e d})) + M L P (M a x P o o l (L L_{f u s e d}))) \otimes L L_{f u s e d}

(4)

where MLP is a shared-weight multi-layer perceptron (containing a bottleneck structure with a dimensionality reduction ratio of 16 to minimize parameter overhead),

σ

denotes the Sigmoid activation function, and

\otimes

represents element-wise multiplication.

High-Frequency Spatial Attention (SA): Unlike the low frequency,

L H_{f u s e d}

,

H L_{f u s e d}

and

H H_{f u s e d}

encode high-frequency textures and edges. These features exhibit highly sparse activations in the spatial domain (i.e., strong responses occur exclusively at boundaries). To precisely sharpen these authentic change boundaries and filter out high-frequency background noise, this paper applies spatial attention to the high-frequency subbands. The SA module first calculates the mean (

M e a n_{c}

) and maximum (

M a x_{c}

)values along the channel dimension to compress channel redundancy and highlight spatially salient edges. Subsequently, a large-kernel spatial convolution of 7 × 7 is introduced to expand the receptive field, thereby bridging fragmented local edge structures:

L H_{a t t} = S A (L H_{f u s e d}) = σ (C o n v_{7 \times 7} ([M e a n_{c} (L H_{f u s e d}), M a x_{c} (L H_{f u s e d})])) \otimes L H_{f u s e d}

(5)

Similarly,

H L_{a t t}

and

H H_{a t t}

can be calculated respectively.

2.3.4. Frequency Domain Reconstruction

Finally, WIB performs a 2D Inverse Discrete Wavelet Transform (2D-IWT) on the four purified frequency subbands enhanced by heterogeneous attention. This maps them from the frequency domain back to the spatial domain, reconstructing a single-branch frequency context feature

C

:

C = I W T (L L_{a t t}, L H_{a t t,} H L_{a t t}, H H_{a t t})

(6)

This output feature

C \in R^{C \times H \times W}

perfectly balances the purity of macroscopic semantics and the sharpness of high-frequency edges, providing a high-quality cross-layer gating mask for the subsequent network.

2.4. Cross-Layer Frequency Context Aggregator (CLFCA)

After processing by the Wavelet Interaction Block (WIB), the network obtains preliminary frequency context features at four different scales, denoted as

C = {C_{0}, C_{1}, C_{2}, C_{3}}

. Although WIB achieves adaptive enhancement of high and low frequencies within each independent scale, an inherent “Semantic–Structural Gap” persists among features across different scales. Due to the lack of a global receptive field, shallow features are highly susceptible to local illumination discrepancies or high-frequency pseudo-change noise. Conversely, deep features, having undergone multiple downsampling operations, possess a vast global receptive field and encapsulate highly abstract, noise-robust macroscopic change semantics, but lose fine-grained spatial localization capabilities.

Existing multi-scale fusion methods (e.g., direct concatenation or skip connections) often fail to effectively bridge this gap, easily leading to the propagation of shallow noise features to deep layers [41]. To break the feature isolation across scales and leverage deep semantics to “protect” and “purify” shallow boundary features, this paper proposes the Cross-Layer Frequency Context Aggregator (CLFCA). This module constructs a top-down semantic injection pathway, similar to a feature pyramid, achieving adaptive alignment and fusion of deep global priors and shallow high-frequency details through step-wise guidance. The specific microscopic topology and the dimensional evolution process of cross-layer features of the CLFCA module are illustrated in Figure 4.

Because the feature channel dimensions and spatial resolutions output by different stages of the backbone network vary, CLFCA first achieves precise matching of cross-layer signals through feature alignment operations. In the channel dimension, a linear projection layer composed of a 1 × 1 convolution, Batch Normalization (BN), and a ReLU activation function is utilized to compress the high-dimensional channels of the deep layer to match those of the adjacent shallow layer. Subsequently, in the spatial dimension, to compensate for the spatial information loss caused by downsampling, bilinear interpolation upsampling is performed on the dimensionality-reduced deep features, precisely aligning their resolution with the corresponding shallow context.

After completing the alignment, CLFCA adopts a step-wise injection strategy to “back-feed” the macroscopic semantics with strong physical robustness from the deep network into the shallow features. The context feature of the deepest layer (Stage 4) is defined as the initial global anchor, i.e.,

{C_{3}}^{’} = C_{3}

. For shallower levels

i \in {2, 1, 0}

, the cross-layer aggregation process can be rigorously formulated as:

{C_{i}}^{’} = C_{i} + U p_{2 \times} (A l i g n_{i + 1 \to i} ({C_{i + 1}}^{’}))

(7)

where

A l i g n (\cdot)

denotes the channel compression operation based on a 1 × 1 convolution, and

U p_{2 \times} (\cdot)

is the bilinear spatial upsampling. Unlike complex gating or multiplicative modulation, this paper adopts a residual fusion mechanism based on element-wise addition. This mechanism not only drastically reduces computational overhead but also allows deep global semantics to act as a structural prior, implicitly purifying the noise-filled high-frequency boundaries in shallow layers. This enables the network to better distinguish genuine land-cover contours from background interference.

Through three-stage cascaded aggregation, CLFCA ultimately outputs a sequence of purified multi-scale context gating signals

C^{’} = {{C_{0}}^{’}, {C_{1}}^{’}, {C_{2}}^{’}, {C_{3}}^{’}}

. These gating signals not only maintain a high degree of consistency in global semantics but also preserve the sharp boundaries of the shallow features, providing high-quality spatial constraints for the subsequent CDFM module to precisely eliminate pseudo-changes.

2.5. Context-Guided Difference Fusion Module (CDFM)

During the interactive fusion stage of bitemporal features, existing mainstream methods universally employ the calculation of absolute difference (i.e.,

| F_{1} - F_{2} |

)to extract change responses. However, from the perspective of the intrinsic physical evolution in remote sensing, the absolute difference operation is mathematically symmetric. This completely eliminates the directionality of physical land-cover changes (for instance, the transition from “building to bare land” and “bare land to building” will generate completely identical feature activations under the absolute difference) [42]. This loss of directional information not only weakens the discriminative capability of the features but also causes the network to amplify pseudo-change noise responses in regions with severe illumination discrepancies or intense seasonal succession.

To overcome this fatal flaw and fully utilize the purified semantic priors provided by the CLFCA module, this paper proposes the Context-Guided Difference Fusion Module (CDFM). Discarding the traditional absolute difference, this module achieves high-precision fusion by extracting direction-aware bidirectional difference features [43] and combining them with the purified context gating. The specific internal topological structure of the CDFM is illustrated in Figure 5.

2.5.1. Direction-Aware Bidirectional Difference Feature Extraction

In CDFM, rather than relying on the symmetric absolute difference which neglects the physical directionality of changes, we propose a bidirectional difference mechanism to explicitly capture the ‘appearance’ and ‘disappearance’ of objects. To prevent the mathematical cancellation effect during subsequent linear convolutions, we introduce an asymmetric non-linear activation (ReLU) before feature concatenation. This isolates the forward difference (representing the disappearance of features from

T_{1}

to

T_{2}

) and the reverse difference (representing the appearance of new features in

T_{2}

):

D_{c a t} = C o n c a t [ReLU (F 1_{i} - F 2_{i}), ReLU (F 2_{i} - F 1_{i})]

(8)

D_{d i r} = ReLU (B N (C o n v_{1 \times 1} (D_{c a t})))

(9)

where

ReLU (\cdot)

ensures that only positive intensity changes are preserved in each respective branch, effectively preventing the opposing signals from cancelling each other out during the fusion stage. Then,

D_{c a t}

is fed into a 1 × 1 convolution layer to generate the initial change-aware feature map.

2.5.2. Context Gating Guidance and Pseudo-Change Suppression

Although

D_{d i r}

possesses direction-aware capabilities, since it directly originates from the spatial features of the encoder, it inevitably entangles pseudo-change noise caused by registration errors or illumination variations. To precisely eliminate this noise, CDFM introduces the purified gating signal

{C_{i}}^{’}

output by the Cross-Layer Frequency Context Aggregator (CLFCA).

First, a 1 × 1 convolution and a Sigmoid activation function are applied to

{C_{i}}^{’}

, compressing it into a single-channel spatial gating weight map

G a t e \in R^{1 \times H \times W}

. Because

{C_{i}}^{’}

has already fused the deep noise-robust macroscopic semantics with the shallow sharp boundaries, this

G a t e

can accurately indicate the spatial probability distribution of genuine changes. Subsequently, this gating weight is utilized to perform spatial pixel-wise modulation on the initial difference feature:

G a t e = σ (C o n v_{1 \times 1} ({C_{i}}^{’}))

(10)

D_{r e f i n e} = D_{d i r} \otimes G a t e

(11)

where

σ

denotes the Sigmoid function, and

\otimes

represents broadcast element-wise multiplication in the spatial dimension. Through this Soft Mask mechanism, feature activations located in unchanged or pseudo-change regions are effectively suppressed, thereby yielding the refined difference feature

D_{r e f i n e}

.

2.5.3. Multi-Source Feature Synergistic Fusion

Finally, to ensure the integrity of features within the changed regions and provide rich semantic context, CDFM concatenates the refined difference feature

D_{r e f i n e}

with the context gating signal

{C_{i}}^{’}

again along the channel dimension. A fusion convolutional layer is then applied for the final feature aggregation:

F u s e d_{i} = δ (B N (C o n v_{3 \times 3} ([D_{r e f i n e}, {C_{i}}^{’}])))

(12)

where

C o n v_{3 \times 3}

denotes a 3 × 3 convolutional kernel with padding, used to further expand the local receptive field and smooth the fused features. The final output

F u s e d_{i} \in R^{C \times H \times W}

is sent into the progressive decoder. Through the multi-source synergistic processing of CDFM, FACDNet achieves extremely strong robustness against pseudo-changes in complex scenarios while simultaneously preserving the high-frequency details of change boundaries.

2.6. Decoder

To restore the fused multi-scale features into a high-resolution change mask and precisely recover the boundary details of land covers, this paper designs a progressive symmetric decoder. Following a U-Net-like architectural logic, this decoder deeply integrates deep high-order semantics with shallow spatial details through four consecutive Decoder Blocks [44].

2.6.1. Structural Design of the Decoder Block

Each Decoder Block undertakes the dual task of spatial resolution enhancement and feature refinement. First, a 2 × 2 transposed convolution (ConvTranspose2d) is utilized to double the spatial scale of the input feature map while halving the number of channels, achieving an initial restoration of resolution. Through a skip connection, the upsampled feature is concatenated along the channel dimension with the corresponding same-scale fused feature

F u s e d_{i}

from the encoding stage. If there is a slight discrepancy in their spatial dimensions, bilinear interpolation is introduced for precise alignment to ensure the strict spatial consistency of multi-scale information. The concatenated features subsequently pass through two consecutive 3 × 3 convolutional layers. Each convolutional layer is followed by Batch Normalization (BN) and a ReLU activation function. This design not only effectively mitigates the “checkerboard artifacts” introduced by transposed convolutions but also enhances local consistency constraints during the fusion of cross-layer features [45].

2.6.2. Progressive Reconstruction Pathway

The decoding process begins from the deepest fused feature

F u s e d_{3}

and propagates downwards layer by layer:

Stage 4 to Stage 1: Through the first three Decoder Blocks, the feature map resolution is progressively restored from 8 × 8 to 64 × 64, and the channel count decreases from 768 to 96. In each stage, the multi-scale features enhanced by WIB and CLFCA are introduced as spatial priors, ensuring the continuous guidance of macroscopic semantics over fine-grained boundaries.
Self-Skip Enhancement Unit (Block 4): In the final decoding stage, this paper designs a special self-skip connection structure. It concatenates the output feature of the third stage with itself, further strengthening the representational capacity of the features near the original resolution and accumulating sufficient feature responses for the final pixel-level discrimination.
Final Projection and Prediction: After completing the four-stage decoding, the feature map is restored to the original input size via bilinear interpolation. Finally, a 1 × 1 convolutional layer is utilized to project the feature channels into a single-channel map, which is then combined with a Sigmoid activation function to output the final pixel-level change probability map Y.

3. Experimental Results and Analysis

3.1. Experimental Settings and Dataset Construction

To ensure the strict reproducibility of the results, all experiments in this paper were conducted within a unified hardware and software environment. Specifically, the computational platform was deployed on an Ubuntu operating system, equipped with an AMD EPYC 7601 32-Core Processor (AMD, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 3090 GPU (NVIDIA, Santa Clara, CA, USA) with 24 GB of memory. The software stack was built upon Python 3.9 and the PyTorch 2.5.1 deep learning framework, utilizing CUDA 12.1 for hardware acceleration.

All experiments in this paper were implemented using the PyTorch deep learning framework and executed on a single NVIDIA RTX 3090 GPU for both training and testing. The ConvNeXt-Small backbone network was initialized with ImageNet-1K pre-trained weights, while the newly introduced modules (i.e., WIB, CLFCA, and CDFM) adopted a random initialization strategy. To ensure the rigor and strict reproducibility of the experiments, a global random seed was fixed across all implementations.

During the model training phase, the input images (from both the LEVIR-CD and SHCD datasets) were uniformly resized to a spatial resolution of 256 × 256. The AdamW optimizer was employed with an initial learning rate of 1 × 10⁻⁴ and a weight decay coefficient of 1 × 10⁻⁵. A Cosine Annealing strategy was utilized to dynamically adjust the learning rate over a total of 200 training epochs. To enhance training stability, gradient norm clipping was applied to the model parameters, with the maximum threshold set to 1.0.

In remote sensing change detection tasks, an extreme class imbalance typically exists between changed pixels (positive samples) and unchanged pixels (negative samples), and pixels along fine-grained boundaries are exceptionally difficult to classify. To address this issue, this paper discards the conventional Binary Cross-Entropy (BCE) loss and instead employs a joint supervision of Focal Loss and Dice Loss (Focal-Dice Loss) for network optimization. Specifically, the Focal Loss introduces a weighting factor

α

= 0.75 to significantly boost the model’s recall rate for genuine change categories, alongside a focusing parameter

γ

= 2.0 to compel the network to concentrate on hard-to-classify edge pixels. Meanwhile, the Dice Loss evaluates the structural similarity between the prediction map and the ground truth label from a global spatial perspective. By combining the two with a 1:1 weighting ratio, the proposed loss function achieves an effective unification of local boundary sharpening and global structural integrity.

The change detection datasets evaluated in this study are LEVIR-CD [46] and SHCD [47]. A detailed introduction to these two datasets is provided below:

(1) LEVIR-CD (Learning, Vision and Remote Sensing Change Detection Dataset) is a large-scale public benchmark dataset dedicated to building change detection in remote sensing images, constructed by the LEVIR Laboratory at Beihang University. The original images are 1024 × 1024 pixels, covering 20 cities in Texas, USA, with a temporal span of 5 to 14 years. The dataset exhibits multi-dimensional diversity in seasonal illumination, building types, and geographical environments, containing 31,333 building change instances annotated by domain experts. In our experiments, the cropped images were uniformly resized to 256 × 256 pixels before being fed into the network. The dataset was partitioned into a training set (7120 pairs), a validation set (1024 pairs), and a testing set (2048 pairs).

(2) SHCD (Southwestern Hilly Cropland Change Detection Dataset) is a dedicated public benchmark dataset for cropland change detection in the southwestern hilly regions of China, constructed by the research team at Sichuan University of Science & Engineering. Sourced from Gaofen-2 satellite remote sensing imagery and cropped to a size of 256 × 256 pixels, it covers typical hilly areas in Southwest China, encompassing complex scenes with a temporal span of 4 years (2020–2024). Core change types—including farmlands, sloping lands, buildings, and roads—were annotated by a professional team and rigorously verified to ensure utmost precision. The data comprehensively captures the typical characteristics of hilly regions, such as topographic undulations, farmland fragmentation, and the interlocking of multiple land types, presenting multi-dimensional diversity in topography, vegetation coverage, and land-use types. For this study, the dataset was partitioned into a training set (871 pairs), a validation set (150 pairs), and a testing set (255 pairs). It is specifically tailored for the training, validation, and evaluation of cropland change detection algorithms in complex terrains, thereby filling the void of dedicated benchmark datasets for cropland change detection in hilly regions.

To evaluate the performance of the proposed method, four widely adopted metrics are utilized for quantitative assessment: Precision (P), Recall (R), F1-score, and Intersection over Union (IoU).

P = \frac{T P}{T P + F P}

(13)

R = \frac{T P}{T P + F N}

(14)

F 1 = \frac{2 P R}{P + R}

(15)

I O U = \frac{T P}{T P + F P + F N}

(16)

3.2. Comparative Experiments

3.2.1. Comparative Analysis on the LEVIR-CD Dataset

Quantitative Analysis

This paper evaluates the performance of the proposed FACDNet on the LEVIR-CD dataset alongside several representative and recent state-of-the-art change detection networks, including STANet [46], FC_Siam_conc [48], SNUNet [49], BIT [50], HANet [51], IFN [52], ChangeFormer [53], MADNet [47], and CLFF [54], as detailed in Table 1. Our model achieves an F1-score of 92.04% and an IoU of 85.26%. These results indicate that the proposed method reaches a highly competitive performance level compared with the listed baselines. Specifically, even when compared with the recent and effective CLFF model (F1: 91.89%, IoU: 84.99%), FACDNet maintains a leading position in both key metrics. This suggests that the frequency-aware cross-layer mechanism effectively enhances the discriminative capacity for identifying building changes.

Compared with the Transformer-based BIT model, which possesses strong global modeling capabilities, FACDNet further elevates the recall rate from 90.45% to 91.96% and achieves a higher overall F1-score 92.04%. Notably, even when compared with the recent CLFF model (F1: 91.89%), FACDNet maintains a competitive edge, particularly in its balanced Precision (92.12%) and Recall (91.96%). The improvement in these metrics supports the rationale of the proposed architectural design: pure CNN or Transformer models tend to over-smooth high-frequency edges in deep networks when processing high-resolution buildings. In contrast, the proposed Wavelet Interaction Block (WIB) employs explicit Haar wavelet decomposition to independently enhance the high-frequency subbands (LH, HL, HH) in the frequency domain, helping to preserve the regular geometric boundaries of buildings. Furthermore, the introduction of the Focal-Dice joint loss function encourages the model to focus on hard boundary samples while optimizing global structural similarity. This effectively mitigates the phenomenon of missed detections (false negatives), achieving a robust performance balance.

Furthermore, despite a relatively larger parameter count to ensure feature representational capacity, FACDNet maintains a highly competitive computational efficiency with only 30.57 G FLOPs. Notably, when tested on a single RTX 3090 GPU, the practical inference speed of our complete model reaches 37.83 Frames Per Second (FPS), which fully satisfies the standard requirements for real-time remote sensing applications.

2.: Visualization Analysis

To visually compare the performance of the evaluated algorithms on the LEVIR-CD dataset, five representative image pairs are selected for qualitative analysis, as illustrated in Figure 6. Red areas indicate missed detections (false negatives), while blue areas represent false alarms (false positives). From the visual results, it is evident that many baseline models encounter challenges such as boundary adhesion and structural detail loss when processing complex building layouts.

In the second group featuring large-area regular buildings, early baseline models (e.g., FC_Siam_conc and SNUNet) exhibit significant edge overflow (blue false alarm edges) due to the loss of high-frequency details during deep spatial downsampling. While models like MADNet and CLFF mitigate these artifacts to some extent, they still show minor edge omissions or localized false alarms along the building perimeters. In contrast, the contours extracted by FACDNet are notably sharper and align more precisely with the ground truth. This performance supports the effectiveness of the Wavelet Interaction Block (WIB) in preserving orthogonal geometric edges through explicit frequency-domain decoupling.

For the third group, which contains dense and parallel factory buildings, methods relying on pure spatial-domain differences (such as STANet and IFN) suffer from severe feature aliasing, resulting in massive missed detections and topological fractures. Recent architectures like ChangeFormer, MADNet, and CLFF also encounter difficulties in completely separating these closely-spaced structures, often leading to edge adhesion or incomplete detection. Conversely, FACDNet demonstrates robust structural parsing capability, successfully identifying independent buildings without obvious adhesion. This is largely attributed to the synergistic effect of the progressive decoder and multi-scale feature fusion, which allows the model to leverage shallow high-frequency structural priors during resolution restoration.

Furthermore, in the fourth and fifth groups characterized by scattered small buildings within complex backgrounds, FACDNet effectively suppresses the interference from shadows and land-cover similarities. This consistent performance across diverse scenarios further validates the utility of the Focal-Dice joint loss in mining hard small-target samples.

3.2.2. Comparative Analysis on the SHCD Dataset

Quantitative Analysis

The SHCD dataset encompasses farmland scenes in southwestern hilly regions characterized by significant topographic undulations and high land-cover fragmentation, presenting severe spectral confusion and seasonal interference. Crucially, we include MADNet—the domain-specific baseline introduced alongside the SHCD dataset—and the recent CLFF in our comparison. As presented in Table 2, this challenging dataset causes a noticeable performance drop across many early baseline models. While the domain-tailored MADNet achieves a strong F1-score of 82.06%, the proposed FACDNet exhibits even more robust cross-scene adaptability. FACDNet achieves an F1-score of 83.64% and an IoU of 71.89%, effectively outperforming the comparative methods. It is particularly noteworthy that FACDNet maintains a solid balance between precision 85.68% and recall 81.70%, further demonstrating its capability to suppress seasonal noise while accurately identifying genuine agricultural land-cover changes.

This robust capability to capture minute and fragmented changes in complex agricultural scenarios is primarily attributed to the Cross-Layer Frequency Context Aggregator (CLFCA). Through a top-down semantic injection utilizing channel alignment, spatial upsampling, and element-wise addition, this mechanism endows shallow features with a global macroscopic perspective. This effectively mitigates the interference of vegetation phenological variations when delineating faint farmland boundaries. Consequently, while ensuring a high precision of 85.68%, the model substantially reduces the missed detection rate against complex backgrounds, maintaining a strong recall of 81.70%.

2.: Visualization Analysis

Figure 7 illustrates the qualitative detection results of different algorithms on the SHCD dataset. In the first and fourth groups depicting large-scale contiguous farmland change scenarios, early baseline models (such as BIT and ChangeFormer) exhibit extensive missed detections (red areas) within the changed regions. Recent methods, including MADNet and CLFF, improve the regional completeness but still exhibit noticeable internal fragmentation or boundary omissions. This phenomenon generally occurs because traditional methods overly rely on pixel-level absolute differences, making them vulnerable to local response attenuation caused by uneven texture or illumination within the same farmland. In contrast, FACDNet maintains higher internal consistency within the changed regions. This is attributed to the Context-Guided Difference Fusion Module (CDFM), which not only extracts bidirectional difference features to preserve the physical evolution logic but also utilizes the cross-layer purified macroscopic semantics to perform global modulation, thereby mitigating the impact of local texture fluctuations on the overall change discrimination.

In the third and fifth groups containing narrow and elongated roads, intense seasonal vegetation disparities pose a severe challenge. Numerous comparative methods (such as FC_Siam_conc, HANet, and ChangeFormer) generate massive blocky false alarms (blue areas) in the unchanged farmland regions flanking the roads, erroneously identifying phenological succession as structural changes. While MADNet and CLFF reduce the area of these false alarms, they still mistakenly classify some severe seasonal variations as genuine changes. In contrast, the background regions predicted by FACDNet contain significantly fewer false alarms. This visual performance supports the value of the Cross-Layer Frequency Context Aggregator (CLFCA)—by injecting the noise-robust abstract semantics from the deep network in a step-wise, top-down manner, the model constructs a reliable spatial gating mask. Serving as a “filter” within the CDFM, this mask effectively suppresses the high-frequency pseudo-change noise induced by shallow illumination or seasonal disparities, thereby achieving more precise extraction of land-cover evolution despite the complex interferences inherent to hilly farmlands.

3.3. Ablation Experiments

To systematically evaluate the specific contributions of the proposed Wavelet Interaction Block (WIB), Cross-Layer Frequency Context Aggregator (CLFCA), and Context-Guided Difference Fusion Module (CDFM) to the model’s performance, this paper conducts a rigorous controlled-variable ablation study on both the LEVIR-CD and SHCD datasets. To guarantee strict variable control, for variants where the CDFM is not enabled, a non-parametric Naive Spatial Gating strategy is adopted to isolate and validate the intrinsic effectiveness of the context features. The network variants are defined as follows: M1 serves as the pure baseline model; M2 introduces the WIB; M3 incorporates both the WIB and CDFM; M4 incorporates both the WIB and CLFCA; and M5 represents the complete FACDNet.

To explicitly define the experimental settings, the aforementioned naive spatial gating mask

G_{n a t i v e} \in R^{H \times W}

used in variants without the CDFM (i.e., M1, M2, and M4) is calculated by taking the absolute difference of the bitemporal features

F_{1}

and

F_{2}

, averaging them across the channel dimension without any learnable parameters, and applying a Sigmoid activation (

σ

):

G_{n a t i v e} = σ (\frac{1}{C} \sum_{c = 1}^{C} | F 1_{c} - F 2_{c} |)

(17)

The resulting single-channel mask is then multiplied with the concatenated features to perform basic spatial modulation. This primitive strategy serves as a strict baseline, enabling a clear and fair validation of the robust feature extraction capabilities provided by the proposed direction-aware CDFM.

3.3.1. Ablation Analysis on the LEVIR-CD Dataset

LEVIR-CD comprises a vast collection of high-resolution building change samples, on which the baseline model already demonstrates substantial recognition performance. Table 3 presents the quantitative evaluation results for each network variant on the LEVIR-CD dataset.

Capability of Frequency-Domain Decoupling in Capturing Minute Features: Compared with the baseline M1, M2 achieves a simultaneous improvement in Precision (from 91.53% to 92.15%) and Recall (from 90.55% to 90.96%) after the introduction of the WIB module. In the LEVIR-CD dataset, the primary challenges are the missed detection of tiny buildings and the edge adhesion of dense building clusters. Through frequency-domain decoupling, the spatial attention mechanism of WIB significantly enhances high-frequency boundary features, enabling the network to delineate building contours more sharply, thereby providing a high-quality prior context.
Precise Discrimination of Change Semantics via Direction Awareness: After incorporating the CDFM, the Precision of M3 surges to 92.68%, which is the highest among all variants on this dataset. This strongly substantiates that the direction-aware difference mechanism can explicitly model the physical change processes of “appearance” (from nothing to something) and “disappearance” (from something to nothing). Consequently, it effectively suppresses false alarms in unchanged building clusters caused by variations in illumination angles.
Refinement of Spatial Gating by Cross-Layer Semantics: With the introduction of CLFCA into M2 to form M4, all evaluation metrics exhibit a steady increase. This demonstrates that the top-down injection of deep global semantics can effectively smooth out intra-building feature fractures. As a result, the generated spatial gating mask remains internally complete and pristine while preserving sharp boundary edges.
Synergistic Breakthrough of the Complete FACDNet: Although the performance of the LEVIR-CD baseline is approaching saturation, the complete M5 model still breaks through the bottleneck, achieving the optimal F1-score of 92.04% and an IoU of 85.26%. It perfectly amalgamates the high recall of M4 and the high precision of M3, proving the exceptional synergy of the three modules when processing remote sensing images. The visual effects of the ablation study are illustrated in Figure 8.

Furthermore, to eliminate the potential influence of random initialization and rigorously verify the statistical significance of the performance gains, we conducted repeated experiments for both the baseline (M1) and the complete FACDNet (M5) using three different random seeds (42, 1024, and 3047). The results demonstrate that the complete FACDNet achieves an average F1-score of 92.05%

\pm

0.10%, which stably and significantly outperforms the pure baseline’s average F1-score of 91.04%

\pm

0.15%. This extremely small variance confirms that the improvements brought by our proposed modules are highly robust and statistically significant, rather than the result of random fluctuations.

3.3.2. Ablation Analysis on the SHCD Dataset

Compared with LEVIR-CD, the scenes within the SHCD dataset are considerably more complex, featuring a more limited sample size and a greater diversity of change categories. Consequently, this imposes far more stringent demands on the noise robustness and generalization capabilities of the models. Table 4 present the performance of each network variant on this dataset.

Recall Capability for Concealed Changes: In the complex SHCD scenarios, a highly instructive phenomenon is observed: after M2 introduces the WIB module, the model’s Recall experiences a substantial leap (from 78.53% to 81.38%), whereas the Precision slightly decreases (from 85.56% to 84.17%). This is not because WIB introduces noise, but rather it is an inevitable physical manifestation of its high-frequency enhancement mechanism. While the spatial attention of WIB greatly enhances sensitivity to genuine change edges and reduces missed detections (resulting in an increased Recall), it inevitably amplifies “pseudo-change” features possessing high-frequency texture attributes, such as seasonal alternations and shadow fluctuations. This phenomenon proves that WIB successfully excavates a richer representation of features; however, in complex scenes, the raw frequency-domain context is hyper-sensitive and requires regularization by higher-dimensional semantics.
Directional Difference Filtering of Pseudo-Changes by CDFM: After M3 incorporates the CDFM, the Precision reaches an impressive 86.52%. In complex land-cover changes, the directional difference mechanism significantly bolsters the network’s capacity to extract the directional attributes of physical changes. By utilizing the context mask as a gate, it achieves precise suppression of interfering elements.
Hierarchical Semantic Suppression and Noise Mitigation by CLFCA: The introduction of CLFCA in M4 perfectly resolves the slight Precision drop observed in M2. By progressively injecting deep, macroscopic noise-robust semantics (e.g., “this is a farmland changing color with the seasons, not a structural building”) into the sensitive shallow high-frequency features, M4 successfully pulls the Precision back from 84.17% to 84.89% while maintaining a high Recall. This intuitively and profoundly demonstrates the core value of CLFCA in complex scenarios: employing globally consistent semantics to precisely quell high-frequency false responses, thereby extracting a high-purity, noise-resistant gating mask.
FACDNet: The Complete Synergistic Model: The complete entity, M5, achieves comprehensive superiority across all evaluation metrics, attaining an F1-score of 83.64% and an IoU of 71.89%. It perfectly inherits the acute sensitivity brought by the WIB (Recall: 81.70%), and minimizes false alarms through the semantic suppression of CLFCA and the direction identification of CDFM. This not only proves the effectiveness of the individual modules but also rigorously demonstrates the indispensable cascading synergy of the three components. The visual effects of the ablation study are illustrated in Figure 9.

To further validate the stability of our model under complex environmental conditions, we conducted the same statistical significance test on the SHCD dataset using three random seeds (42, 1024, and 3047). As reported in the final analysis, the complete FACDNet achieves a stable average F1-score of 83.38%

\pm

0.37%. Compared with the pure baseline’s average F1-score of 81.79%

\pm

0.32%, our model maintains a significant and stable performance lead of 1.59%. The consistent gains across multiple initializations confirm that the proposed frequency-aware architecture effectively suppresses seasonal and illumination interference in the SHCD dataset, ensuring highly robust change detection performance.

3.4. Impact of Wavelet Basis Selection

As indicated in Table 5, while higher-order wavelets (db2 and sym2) are traditionally favored for continuous signal approximation, the Haar wavelet exhibits better overall performance in our specific change detection task, particularly in maintaining high Recall. This observation can be discussed through the trade-offs among wavelet properties, specifically spatial support length, phase characteristics, and their subsequent impact on the network’s attention mechanisms. In high-resolution remote sensing imagery, changing artificial targets, such as building contours, typically manifest as sharp step edges. Higher-order wavelets possess higher vanishing moments, which are highly beneficial for representing smooth textures but inherently require a longer spatial support length. When a wavelet with longer support is convolved with a sharp step edge, the high-frequency response tends to spread across adjacent pixels, leading to boundary response spreading or spatial smearing. The Haar wavelet, with its minimal compact support (length 2), appears to localize the high-frequency energy of structural boundaries within a tighter pixel radius, which aligns better with the abrupt nature of spatial changes.

Furthermore, the phase characteristics of these wavelets play a crucial role in spatial alignment. The Haar wavelet is symmetric, offering linear phase, whereas higher-order Daubechies wavelets are asymmetric, introducing non-linear phase shifts during frequency-domain decoupling. In the context of our Wavelet Interaction Block (WIB), this potential spatial smearing coupled with phase misalignment may blur the structural activation masks within the Spatial Attention module. Such semantic smearing could reduce the activation confidence of the attention mechanism on subtle or irregular changes, likely contributing to the observed drop in Recall from 81.70% to 80.61%. Therefore, rather than being universally optimal, the Haar wavelet is demonstrated to be more suitable for this specific boundary-sensitive, fine-grained change detection task, although higher-order wavelets may still hold advantages in scenarios requiring smooth texture modeling or strong noise robustness.

3.5. Validation of the Direction-Aware Mechanism

To deeply investigate the impact of the direction-aware mechanism and address the limitations of traditional difference operations (which often lose physical change directionality and suffer from mathematical cancellation), we conducted a dedicated comparative experiment on both the LEVIR-CD and SHCD datasets. We compared our proposed ReLU-based bidirectional difference strategy within the CDFM against the traditional Absolute Difference

(| F_{1} - F_{2} |)

and Signed Difference

(F_{1} - F_{2})

.

As presented in Table 6, replacing our mechanism with Absolute and Signed Differences yielded notable performance drops across both datasets. On the LEVIR-CD dataset, the F1-scores of traditional methods (91.92% and 91.95%) were strictly lower than our proposed bidirectional difference (92.04%). This performance gap is even more pronounced on the complex SHCD dataset, where our method (83.64%) significantly outperforms the Absolute (83.25%) and Signed (83.34%) strategies. This explicitly proves that physically isolating appearance and disappearance features using an asymmetric non-linear activation (ReLU) effectively prevents mathematical cancellation, thereby extracting more discriminative representations in both building and cropland change scenarios.

Furthermore, to rigorously verify that our network learns genuine physical directionality rather than merely overfitting to the temporal input order, we performed a swapped temporal order test (

T_{1}

,

T_{2}

). The model achieved F1-scores of 92.02% (on LEVIR-CD) and 83.78% (on SHCD), exhibiting negligible performance fluctuations (only 0.02% and 0.14%, respectively) compared with the original temporal order (

T_{1}

,

T_{2}

). This exceptionally stable performance across distinct datasets powerfully demonstrates the symmetric robustness of our directional feature extraction.

3.6. Interpretability Analysis of Frequency-Domain Decoupling and Cross-Layer Interaction

To deeply investigate the underlying physical mechanisms of FACDNet, this paper extracts the frequency-domain subband responses following WIB decomposition in the shallow layer (Stage 1) of the model and computes their absolute activation energy statistics. To faithfully represent the native receptive field of the features at this specific network layer, the visualizations are rendered without smooth interpolation, directly preserving the discrete pixel-grid morphology of the feature maps (as illustrated in Figure 10).

The visualization results clearly demonstrate that the Haar wavelet transform achieves the physical decoupling of features. The response of the low-frequency subband (LL) exhibits large-scale, continuous pixel-block activations, effectively and stably filling the interior of the farmland evolution regions. This verifies its capability to firmly capture the macroscopic semantics of land-cover transitions without being disturbed by the absence of local details. Conversely, the high-frequency subbands (LH + HL + HH) display highly dispersed grid-like activations. By superimposing these upon the high-resolution ground-truth boundaries (red lines), it becomes evident that the strongly activated high-frequency pixels precisely adhere to the geometric edges of irregular farmlands and narrow roads. This substantiates the extreme sensitivity of high-frequency features to spatial gradients, thereby validating the rationale of introducing spatial attention to enhance high-frequency edge textures.

Although high frequencies can precisely anchor boundaries, Figure 2, Figure 3, Figure 4 and Figure 5 expose a severe phenomenon: a massive number of high-frequency “bright spots” are also scattered across the background regions outside the red lines. The energy statistics chart on the left reveals the underlying logic of this phenomenon: in the low-frequency band, the energy of the changed regions (orange bars) maintains an absolute dominance; however, in the high-frequency bands, the energy of the unchanged regions (blue bars) is abnormally high, even closely approximating the energy of the genuinely changed regions.

This authentic underlying data profoundly exposes the critical pain point of change detection in complex remote sensing scenarios: illumination fluctuations or seasonal phenological alternations in hilly areas trigger massive high-frequency noise in the shallow frequency domain. Relying directly on such raw high-frequency difference features would inevitably lead to the model being overwhelmed by background noise, resulting in severe false alarms. This physically demonstrates the absolute necessity of the proposed architecture: it is imperative to introduce the CLFCA to inject deep, noise-robust global semantics in a “top-down” manner. Serving as a spatial mask, this mechanism performs mandatory “spatial filtering” on the pervasive high-frequency discrete noise within the CDFM. Consequently, while preserving the genuinely changed pixel blocks, it thoroughly quells the pseudo-change interferences inherent to complex terrains.

4. Conclusions

To address the two core challenges in high-resolution bitemporal remote sensing image change detection—the susceptibility of minute targets to missed detections and the propensity for pseudo-change false alarms in complex scenes—this paper proposes an effective Frequency-Aware Cross-layer Network (FACDNet). Distinct from traditional methods that primarily rely on spatial-domain feature processing, FACDNet innovatively constructs a complete, closed-loop feature processing paradigm encompassing “frequency-domain extraction, semantic purification, and difference fusion.” Initially, the network acutely captures and amplifies high-frequency edge texture features via the Wavelet Interaction Block (WIB), substantially reducing the missed detection rate of subtle changes at the source. Subsequently, by employing the Cross-Layer Frequency Context Aggregator (CLFCA) for top-down deep semantic injection, it precisely suppresses the high-frequency pseudo-change noise induced by illumination or seasonal alternations, thereby distilling the coarse frequency-domain context into a spatial gating mask of exceptional purity. Finally, through the Context-Guided Difference Fusion Module (CDFM), it explicitly models the physical change direction of the bitemporal features and, in conjunction with the pristine mask, achieves precise shielding against interferences. Extensive comparative and ablation experiments conducted on two public datasets, LEVIR-CD and SHCD, demonstrate that these three core modules exhibit a profound synergistic enhancement effect. FACDNet not only achieves highly competitive performance against existing state-of-the-art methods in quantitative metrics (achieving F1-scores of 92.04% and 83.64%, respectively) and attains an excellent dynamic balance between Precision and Recall, but its qualitative results also fully substantiate that the network possesses an extremely low false alarm rate and high internal consistency while preserving sharp change boundaries.

Despite the exceptional detection performance exhibited by FACDNet, the frequency-domain transformations and multi-scale cross-layer interactions inevitably incur a certain degree of computational overhead. In future work, we will dedicate our efforts to exploring more lightweight frequency-domain decoupling operators, aiming to further enhance the model’s inference efficiency while maintaining high precision. Furthermore, we intend to extend this frequency-aware cross-layer architecture to unsupervised change detection or cross-modal (e.g., optical and SAR imagery) change detection tasks, thereby continuously broadening its application potential in real-world, complex remote sensing scenarios.

Author Contributions

Conceptualization, L.Z. (Liangjun Zhao) and C.Z.; methodology, L.Z. (Liangjun Zhao); software, L.Z. (Liangjun Zhao); validation, L.Z. (Liangjun Zhao), L.Z. (Lei Zhang) and Z.Z.; formal analysis, L.Z. (Liangjun Zhao); investigation, L.Z. (Liangjun Zhao); resources, C.Z.; data curation, Z.Z.; writing—original draft preparation, L.Z. (Liangjun Zhao); writing—review and editing, C.Z. and L.Z. (Lei Zhang); visualization, Z.Z.; supervision, C.Z.; project administration, C.Z.; funding acquisition, C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Sichuan Science and Technology Program, grant number 2023YFS0371; and the Sichuan Smart Tourism Research Base Project, grant number ZHZJ24-01.

Data Availability Statement

The LEVIR-CD dataset analyzed in this study is a publicly available dataset, which can be accessed at [https://justchenhao.github.io/LEVIR/ (accessed on 26 May 2026)], The SHCD dataset used in this study is openly available on GitHub at [https://github.com/LeeJiEunx/SHCD.git (accessed on 26 May 2026)].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Parelius, E.J. A review of deep-learning methods for change detection in multispectral remote sensing images. Remote Sens. 2023, 15, 2092. [Google Scholar] [CrossRef]
Yu, C.; Yang, H.; Ma, L.; Yang, J.; Jin, Y.; Zhang, W.; Zhao, Q. Deep learning-based change detection in remote sensing: A comprehensive review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 1–25. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A.; Zhang, L. Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to man-made disasters. Remote Sens. Environ. 2021, 265, 112636. [Google Scholar] [CrossRef]
Rolih, B.; Fučka, M.; Wolf, F.; Zajc, L.Č. Be the Change You Want to See: Revisiting Remote Sensing Change Detection Practices. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4704811. [Google Scholar] [CrossRef]
Wen, D.; Huang, X.; Bovolo, F.; Li, J.; Ke, X.; Zhang, A.; Benediktsson, J.A. Change detection from very-high-spatial-resolution optical remote sensing images: Methods, applications, and future directions. IEEE Geosci. Remote Sens. Mag. 2021, 9, 68–101. [Google Scholar] [CrossRef]
Toth, D.; Aach, T.; Metzler, V. Illumination-invariant change detection. In Proceedings of the 4th IEEE Southwest Symposium on Image Analysis and Interpretation, Austin, TX, USA, 2–4 April 2000; pp. 3–7. [Google Scholar]
Elisy, M.M.; Xiao, X.; Sharouda, M.H.; Gong, J.; Li, D. ReSCD-Net: Resolution-Aware Deep Learning for Building Change Detection in High-Resolution Satellite Imagery. IEEE Trans. Geosci. Remote Sens. 2026, 64, 1–20. [Google Scholar] [CrossRef]
Sun, C.; Chen, H.; Du, C.; Jing, N. SemiBuildingChange: A semi-supervised high-resolution remote sensing image building change detection method with a pseudo bitemporal data generator. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5622319. [Google Scholar] [CrossRef]
Liu, Y.; Pang, C.; Zhan, Z.; Zhang, X.; Yang, X. Building change detection for remote sensing images using a dual-task constrained deep siamese convolutional network model. IEEE Geosci. Remote Sens. Lett. 2020, 18, 811–815. [Google Scholar] [CrossRef]
Liu, W.; He, J.; Zhong, Y.; Yu, Y.; Luo, Z.; Guan, H.; Li, J. Semi-Supervised Building Change Detection From Bi-Temporal Remote Sensing Images Leveraging Visual-Language Models and Consistency Learning. IEEE Trans. Geosci. Remote Sens. 2026, 64, 5610015. [Google Scholar]
Zheng, Z.; Ermon, S.; Kim, D.; Zhang, L.; Zhong, Y. Changen2: Multi-temporal remote sensing generative change foundation model. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 725–741. [Google Scholar] [CrossRef]
Peng, D.; Zhang, Y.; Guan, H. End-to-end change detection for high resolution satellite images using improved UNet++. Remote Sens. 2019, 11, 1382. [Google Scholar] [CrossRef]
Chen, J.; Yuan, Z.; Peng, J.; Chen, L.; Huang, H.; Zhu, J.; Li, H. DASNet: Dual attentive fully convolutional Siamese networks for change detection in high-resolution satellite images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 1194–1206. [Google Scholar] [CrossRef]
Li, Q.; Zhong, R.; Du, X.; Du, Y. TransUNetCD: A hybrid transformer network for change detection in optical remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5622519. [Google Scholar] [CrossRef]
Feng, Y.; Jiang, J.; Xu, H.; Zheng, J. Change detection on remote sensing images using dual-branch multilevel intertemporal network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4401015. [Google Scholar] [CrossRef]
Tang, H.; Li, Z.; Zhang, D.; He, S.; Tang, J. Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 1958–1974. [Google Scholar] [CrossRef]
Hu, Q.; Wang, Y.; Zhang, Y.; Yang, W.; Xu, F.; Xia, G.S. STAR-CD: Style-Aligned Remote Sensing Change Detection with Appearance-Relation Modeling. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5657115. [Google Scholar] [CrossRef]
Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. ChangeMamba: Remote sensing change detection with spatiotemporal state space model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4409720. [Google Scholar] [CrossRef]
Ma, R.; Zhang, Y.; Zhang, B.; Fang, L.; Huang, D.; Qi, L. Learning attention in the frequency domain for flexible real photograph denoising. IEEE Trans. Image Process. 2024, 33, 3707–3721. [Google Scholar] [CrossRef]
Xie, Z.; Miao, S.; Zhang, Z.; Li, X.; Huang, J. Frequency Domain Feature Interaction Combined with Multi-Scale Attention for Remote Sensing Change Detection. IEEE Sens. J. 2025, 25, 29284–29295. [Google Scholar] [CrossRef]
Li, J.; Shao, F.; Liu, Q.; Meng, X. Global-Local Collaborative Learning Network for Optical Remote Sensing Image Change Detection. Remote Sens. 2024, 16, 2341. [Google Scholar] [CrossRef]
Ma, M.; Wang, B.; Zhang, Y.; Che, K.; Xu, L. Adaptive object detection based on cross-domain information decoupling. In Proceedings of the International Conference on Remote Sensing Technology and Image Processing (RSTIP 2024), Online, 15–17 June 2025; Volume 13640, pp. 130–139. [Google Scholar]
Asali, E. Spatiotemporal Multimodal Representation Learning for Activity Understanding and Behavior Monitoring. Ph.D. Thesis, University of Georgia, Athens, GA, USA, 2025. [Google Scholar]
Chen, Y.; Feng, S.; Zhao, C.; Su, N.; Li, W.; Tao, R.; Ren, J. High-resolution remote sensing image change detection based on Fourier feature interaction and multiscale perception. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5539115. [Google Scholar] [CrossRef]
Liu, J.; Chen, H.; Gu, H.; Pan, Y.; Chen, H.; Tian, E.; Li, Z. MewCDNet: A Wavelet-Based Multi-Scale Interaction Network for Efficient Remote Sensing Building Change Detection. Comput. Mater. Contin. 2026, 86, 1. [Google Scholar] [CrossRef]
Wang, C.; Sun, W.; Fan, D.; Liu, X.; Zhang, Z. Adaptive feature weighted fusion nested U-Net with discrete wavelet transform for change detection of high-resolution remote sensing images. Remote Sens. 2021, 13, 4971. [Google Scholar] [CrossRef]
Jiang, M.; Chen, Y.; Dong, Z.; Liu, X.; Zhang, X.; Zhang, H. Multiscale fusion CNN-transformer network for high-resolution remote sensing image change detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 5280–5293. [Google Scholar] [CrossRef]
Ding, S.; Lu, X.; Liu, R.; Yang, Y.; Gu, H.; Li, H. Cross-layer hybrid feature aggregation network for change detection in very high-resolution remote sensing images. J. Appl. Remote Sens. 2025, 19, 016504. [Google Scholar] [CrossRef]
Gao, S.; Li, D.; Pan, E.; Guo, H.; Wei, J.; Liu, J.; Sun, K. Change-prior guided cross-scale interaction network for remote sensing image change detection. Geo-Spat. Inf. Sci. 2025, 1–20. [Google Scholar] [CrossRef]
Zhong, H.; Wu, C.; Xiao, Z. LRNet: Change detection in high-resolution remote sensing imagery via a localization-then-refinement strategy. Remote Sens. 2025, 17, 1849. [Google Scholar] [CrossRef]
Tang, Y.; Feng, S.; Zhao, C.; Fan, Y.; Shi, Q.; Li, W.; Tao, R. An object fine-grained change detection method based on frequency decoupling interaction for high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5600213. [Google Scholar] [CrossRef]
Hou, W.; Wang, Y.; Su, J.; Hou, Y.; Zhang, M.; Shang, Y. Multi-scale bilateral spatial direction-aware network for cropland extraction based on remote sensing images. IEEE Access 2023, 11, 109997–110009. [Google Scholar] [CrossRef]
Ma, X.; Xie, J.; Shao, D.; Yao, A.; Dong, C. RA3T: An Innovative Region-Aligned 3D Transformer for Self-Supervised Sim-to-Real Adaptation in Low-Altitude UAV Vision. Electronics 2025, 14, 2797. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Li, X.; Ding, H.; Zhu, Y. Wavelet-guided Multi-scale Edge Fusion Network for Aerial Object Detection. Digit. Signal Process. 2026, 173, 105946. [Google Scholar] [CrossRef]
Cao, S.; Tang, B.; Liang, W.; Chen, Y. Frequency-aware dual-domain network for remote sensing optical image change detection. J. Appl. Remote Sens. 2026, 20, 018504. [Google Scholar] [CrossRef]
Han, P.; Zhao, B.; Li, X. Progressive feature interleaved fusion network for remote-sensing image salient object detection. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5500414. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
Lu, C.; Wang, F.; Wang, Z.; Xu, N.; You, Z.; Huang, D.S. BWFNet: Bitemporal Wavelet Frequency Network for Change Detection in High-Resolution Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 25562–25582. [Google Scholar] [CrossRef]
Zhu, Y.; Cheng, D.; Li, J. DWFDA-CD: A Diffusion Model-Based Wavelet Frequency Domain Attention Network for Remote Sensing Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 27453–27472. [Google Scholar] [CrossRef]
Zheng, Z.; Wan, Y.; Zhang, Y.; Xiang, S.; Peng, D.; Zhang, B. CLNet: Cross-layer convolutional neural network for change detection in optical remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2021, 175, 247–267. [Google Scholar] [CrossRef]
Ding, L.; Guo, H.; Liu, S.; Mou, L.; Zhang, J.; Bruzzone, L. Bi-temporal semantic reasoning for the semantic change detection in HR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5620014. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Tian, S.; Ma, A.; Zhang, L. ChangeMask: Deep multi-task encoder-transformer-decoder architecture for semantic change detection. ISPRS J. Photogramm. Remote Sens. 2022, 183, 228–239. [Google Scholar] [CrossRef]
Chen, P.; Zhang, B.; Hong, D.; Chen, Z.; Yang, X.; Li, B. FCCDN: Feature constraint network for VHR image change detection. ISPRS J. Photogramm. Remote Sens. 2022, 187, 101–119. [Google Scholar] [CrossRef]
Lu, H.; Liu, F.; Du, Q.; Duan, R. Detail-Preserving Dual-Branch Downsampling for Small Object Detection. In Proceedings of the 2025 2nd International Conference on Intelligent Communication, Sensing and Electromagnetics (ICSE), Shanghai, China, 20–22 December 2025; pp. 130–135. [Google Scholar]
Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Zhao, L.; Xi, Y.; Wang, Y.; Ning, F.; He, Z.; Liang, G.; Zhang, Y. MADNet: Cropland change detection network for the complex terrain and dense vegetation hilly region in the Southwestern China. Vis. Comput. 2025, 41, 5835–5854. [Google Scholar] [CrossRef]
Daudt, R.C.; Le Saux, B.; Boulch, A. Fully convolutional siamese networks for change detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4063–4067. [Google Scholar]
Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A densely connected Siamese network for change detection of VHR images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8007805. [Google Scholar] [CrossRef]
Chen, H.; Qi, Z.; Shi, Z. Remote sensing image change detection with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5607514. [Google Scholar] [CrossRef]
Han, C.; Wu, C.; Guo, H.; Hu, M.; Chen, H. HANet: A hierarchical attention network for change detection with bitemporal very-high-resolution remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3867–3878. [Google Scholar] [CrossRef]
Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
Bandara, W.G.C.; Patel, V.M. A transformer-based siamese network for change detection. In Proceedings of the IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 207–210. [Google Scholar]
Meng, X.; Qiu, C.; Liu, C.; Xu, Y. A Cross-Layer Feature Fusion Framework with Hierarchical Interaction for Remote Sensing Change Detection. Sensors 2026, 26, 1176. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overall model architecture.

Figure 2. Architecture of the shared-weight ConvNeXt-Small encoder.

Figure 3. Structure of WIB.

Figure 4. Structure of CLFCA.

Figure 5. Structure of CDFM.

Figure 6. Visualization of different algorithms on the LEVIR-CD dataset. White represents true positives, black represents true negatives, red indicates false positives, and blue indicates false negatives.

Figure 7. Visualization of different algorithms on the SHCD dataset. White represents true positives, black represents true negatives, red indicates false positives, and blue indicates false negatives.

Figure 8. Visualization of the ablation study on the LEVIR-CD dataset. White represents true positives, black represents true negatives, red indicates false positives, and blue indicates false negatives.

Figure 9. Visualization of the ablation study on the SHCD dataset. White represents true positives, black represents true negatives, red indicates false positives, and blue indicates false negatives.

Figure 10. Frequency-domain responses and energy statistics of FACDNet in complex hilly farmland (SHCD) scenarios. The response maps preserve the discrete pixel morphology of the native spatial resolution, demonstrating that the low-frequency (LL) pixel blocks capture macroscopic consistency, whereas the high-frequency pixel blocks precisely anchor the genuine boundaries (indicated by red lines).

Table 1. Comparative experiment of different algorithms on LEVIR-CD dataset.

Model	P/%	R/%	F1/%	IOU/%	Params(M)	FLOPs(G)
STANet	83.81	91.01	87.26	77.40	16.93	13.08
FC_Siam_conc	88.25	80.29	84.08	72.53	1.54	4.70
SNUNet	90.07	90.07	90.07	81.93	12.04	54.80
BIT	92.32	90.45	91.38	84.12	3.50	10.63
HANet	92.20	89.65	90.91	83.33	11.50	25.30
IFN	89.41	92.33	90.84	83.22	36.00	79.04
ChangeFormer	91.36	89.71	90.53	82.69	16.63	85.35
MADNet	90.96	88.94	89.94	81.71	3.52	8.54
CLFF	93.26	90.56	91.89	84.99	27.62	61.20
FACDNet	92.12	91.96	92.04	85.26	77.97	30.57

Table 2. Comparative experiment of different algorithms on SHCD dataset.

Model	P/%	R/%	F1/%	IOU/%
STANet	83.6	70.57	76.53	61.99
FC_Siam_conc	80.91	74.12	77.37	63.09
SNUNet	82.34	77.37	79.78	66.36
BIT	80.62	76.80	78.66	64.83
HANet	83.33	76.68	79.86	66.48
IFN	82.57	79.23	80.87	67.88
ChangeFormer	84.83	75.02	79.62	66.14
MADNet	83.18	80.96	82.06	69.57
CLFF	85.61	75.62	80.30	67.09
FACDNet	85.68	81.70	83.64	71.89

Table 3. Ablation experiments of each module on LEVIR-CD.

Baseline	WIB	CDFM	CLFCA	P/%	R/%	F1/%	IOU/%
√				91.53	90.55	91.04	83.55
√	√			92.15	90.96	91.55	84.42
√	√	√		92.68	90.35	91.50	84.34
√	√		√	92.29	91.12	91.70	84.67
√	√	√	√	92.12	91.96	92.04	85.26

The “√” symbol indicates that the corresponding module is included in the experiment.

Table 4. Ablation experiments of each module on SHCD.

Baseline	WIB	CDFM	CLFCA	P/%	R/%	F1/%	IOU/%
√				85.56	78.53	81.90	69.35
√	√			84.17	81.38	82.75	70.58
√	√	√		86.52	79.79	83.02	70.97
√	√		√	84.89	80.96	82.88	70.76
√	√	√	√	85.68	81.70	83.64	71.89

The “√” symbol indicates that the corresponding module is included in the experiment.

Table 5. Performance comparison of different wavelet bases in WIB on the SHCD dataset.

Wavelet Basis	F1/%	P/%	R/%	IOU/%
Daubechies 2 (db2)	83.37	85.60	81.25	71.48
Symlets 2 (sym2)	83.45	86.50	80.61	71.60
Haar (Ours)	83.64	85.68	81.70	71.89

Table 6. Effectiveness of different difference strategies in CDFM across two datasets.

Difference Strategy	LEVIR-CD				SHCD
Difference Strategy	F1/%	P/%	R/%	IOU/%	F1/%	P/%	R/%	IOU/%
Absolute Difference (\|F1−F2\|)	91.92	92.90	90.95	85.04	83.25	84.68	81.87	71.31
Signed Difference (F1−F2)	91.95	92.81	91.11	85.10	83.34	83.35	83.32	71.43
Ours (Bidirectional)	92.04	92.12	91.96	85.26	83.64	85.68	81.70	71.89
Ours (Swapped Order)	92.02	92.11	91.93	85.22	83.78	85.57	82.06	72.09

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, L.; Zhao, C.; Zhang, L.; Zhong, Z. FACDNet: A Frequency-Aware Cross-Layer Network for Remote Sensing Change Detection. Electronics 2026, 15, 2416. https://doi.org/10.3390/electronics15112416

AMA Style

Zhao L, Zhao C, Zhang L, Zhong Z. FACDNet: A Frequency-Aware Cross-Layer Network for Remote Sensing Change Detection. Electronics. 2026; 15(11):2416. https://doi.org/10.3390/electronics15112416

Chicago/Turabian Style

Zhao, Liangjun, Chenzhi Zhao, Lei Zhang, and Zimin Zhong. 2026. "FACDNet: A Frequency-Aware Cross-Layer Network for Remote Sensing Change Detection" Electronics 15, no. 11: 2416. https://doi.org/10.3390/electronics15112416

APA Style

Zhao, L., Zhao, C., Zhang, L., & Zhong, Z. (2026). FACDNet: A Frequency-Aware Cross-Layer Network for Remote Sensing Change Detection. Electronics, 15(11), 2416. https://doi.org/10.3390/electronics15112416

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FACDNet: A Frequency-Aware Cross-Layer Network for Remote Sensing Change Detection

Abstract

1. Introduction

2. Methodology

2.1. Overall Architecture of FACDNet

2.2. Encoder

2.3. Wavelet Interaction Block (WIB)

2.3.1. Frequency-Domain Decoupling of Bitemporal Features

2.3.2. Intra-Frequency Cross-Temporal Feature Interaction

2.3.3. Heterogeneous Attention Mechanism

2.3.4. Frequency Domain Reconstruction

2.4. Cross-Layer Frequency Context Aggregator (CLFCA)

2.5. Context-Guided Difference Fusion Module (CDFM)

2.5.1. Direction-Aware Bidirectional Difference Feature Extraction

2.5.2. Context Gating Guidance and Pseudo-Change Suppression

2.5.3. Multi-Source Feature Synergistic Fusion

2.6. Decoder

2.6.1. Structural Design of the Decoder Block

2.6.2. Progressive Reconstruction Pathway

3. Experimental Results and Analysis

3.1. Experimental Settings and Dataset Construction

3.2. Comparative Experiments

3.2.1. Comparative Analysis on the LEVIR-CD Dataset

3.2.2. Comparative Analysis on the SHCD Dataset

3.3. Ablation Experiments

3.3.1. Ablation Analysis on the LEVIR-CD Dataset

3.3.2. Ablation Analysis on the SHCD Dataset

3.4. Impact of Wavelet Basis Selection

3.5. Validation of the Direction-Aware Mechanism

3.6. Interpretability Analysis of Frequency-Domain Decoupling and Cross-Layer Interaction

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI