An Improved Symmetric Network with Feature Difference and Receptive Field for Change Detection

Zhang, Botao; Wang, Yixuan; Lu, Jia; Wang, Qin

doi:10.3390/sym17071095

Open AccessArticle

An Improved Symmetric Network with Feature Difference and Receptive Field for Change Detection

¹

Graduate School, Space Engineering University, Beijing 101416, China

²

School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China

³

Department of Architecture and Civil Engineering, City University of Hong Kong, Hong Kong 999077, China

⁴

School of Computer Engineering, Guilin University of Electronic Technology, Beihai 536000, China

⁵

Basic Teaching Department, Guilin University of Electronic Technology, Beihai 536000, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2025, 17(7), 1095; https://doi.org/10.3390/sym17071095

Submission received: 9 June 2025 / Revised: 28 June 2025 / Accepted: 2 July 2025 / Published: 8 July 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Change detection (CD) is essential for Earth observation tasks, as it identifies alterations in specific geographic areas over time. The advancement of deep learning has significantly improved the accuracy of CD. However, encoder–decoder architectures often struggle to effectively capture temporal differences. Encoders may lose critical spatial details, while decoders can introduce bias due to inconsistent receptive fields across layers. To address these limitations, this paper proposes an enhanced symmetric network, termed FDRF (feature difference and receptive field), which incorporates two novel components: the multibranch feature difference extraction (MFDE) module and the adaptive ensemble decision (AED) module. MFDE independently extracts differential features from bitemporal images at each encoder layer, using multiscale fusion to retain image content and improve the quality of feature difference modeling. AED assigns confidence weights to predictions from different decoder layers based on their receptive field sizes and then combines them adaptively to reduce scale-related bias. To validate the effectiveness and robustness of FDRF, experiments were conducted on five public datasets (SYSU, LEVIR-CD, WHU, NJDS, and CLCD), as well as a UAV-based dataset collected from two national coastal nature reserves in Guangxi Beihai, China. The results demonstrate that FDRF consistently outperforms existing methods in accuracy and robustness across diverse scenarios.

Keywords:

change detection; feature difference; receptive field; adaptive ensemble decision; multibranch

1. Introduction

Change detection (CD) plays a central role in remote sensing applications such as urban development monitoring, ecosystem management, and coastal wetland assessment [1]. By comparing satellite or aerial images taken at different times, CD techniques identify alterations in land cover resulting from natural or anthropogenic factors [2]. Traditional CD approaches, including pixel-wise comparison methods like Euclidean distance or correlation coefficients, often struggle to capture complex spatial and temporal variations in high-resolution imagery [3].

To overcome these limitations, deep learning models, particularly those based on encoder–decoder architectures, have gained prominence in the CD field. These frameworks use paired inputs to learn hierarchical features and generate predictions of change regions [4,5,6,7]. Variants such as single-decoder networks [8,9,10,11] and dual encoder–decoder structures [12,13] have been developed to handle the challenges of bitemporal image comparison. While attention mechanisms and multi-scale feature extraction have further improved performance [14], several challenges persist.

Despite these advancements, dual encoder–decoder networks often suffer from feature degradation and inadequate information exchange between components. During the encoding stage, aggressive downsampling often results in the loss of fine-grained spatial information. In the decoding phase, deeper layers may introduce distortion, and the network may struggle to accurately model changes at multiple scales. Moreover, the lack of explicit mechanisms for feature interaction between encoders and decoders can hinder effective information fusion, leading to suboptimal change detection accuracy [15].

These challenges point to a gap in the existing literature: the need for a more robust mechanism to retain critical spatial features while enhancing the interpretability and effectiveness of multi-scale predictions. Prior studies have attempted to compute feature differences within encoder structures [8,16,17], but this can lead to degraded performance due to information loss and distortion through deeper layers.

To address these challenges, this paper proposes two complementary modules: MFDE and AED. Rather than fusing feature differences inside the encoder, MFDE uses dual branches, one for differential modeling and another for preserving original content, to independently compute change features at each scale. This design preserves essential spatial information and improves the localization of changes. AED, integrated into the decoder, assigns adaptive weights to features from different scales based on their receptive field coverage. It then combines these weighted outputs to produce the final CD prediction, reducing scale-related bias.

Experiments were conducted on multiple datasets encompassing buildings, land, and wetlands. Additionally, as shown in Figure 1, we introduce a new UAV-based dataset from the Guangxi Beihai Coast Nature Reserves (GBCNR), which captures tidal-driven temporal changes in coastal wetlands. This dataset also serves as a robust benchmark for evaluating model generalizability and performance in complex environments.

In summary, the main contributions of this paper are as follows:

(1): The proposal of the MFDE module, which extracts feature differences while preserving original spatial content and enhances multiscale feature fusion.
(2): The development of the AED module, which dynamically weights decoder outputs based on receptive field scales to improve inference accuracy.
(3): The construction of a UAV-based dataset (GBCNR), offering a high-quality benchmark for evaluating CD models in coastal wetland environments.

2. Related Works

With the rise in deep learning needs, researchers have made extensive explorations and efforts in the deep learning field, creating numerous artificial neural networks by analyzing and simulating the human neuronal system [18,19,20,21,22]. In the field of computer vision, concepts such as feature difference and receptive field have been proposed by simulating the human eye’s processing of visual signals. These concepts have been influential in constructing various network modules that are now effectively applied in change detection tasks.

2.1. Feature Difference

To obtain the feature difference between two images, FC-Siam-diff [8] takes the absolute values of the difference between the encoding streams of the symmetric architecture’s same-scale encoder outputs and concatenates it with the same-scale decoder’s decoding feature vectors as the input for the next layer of the network. The feature differences obtained in the encoding stage are concatenated with the decoding stage via skip connections, allowing the network to accurately recognize the change map between the two images. BIT [18] also applies the concept of the feature difference by subtracting and taking the absolute values of the output features of the symmetric architecture’s bitemporal decoders before inputting them into the prediction network, aiming to obtain the final change map through differential information. Different from direct subtraction, ChangeFormer [19] refines the concept of the feature difference by concatenating the bitemporal image features in the encoding stage, followed by the application of convolutional layers and activation functions. This module’s architecture allows the feature difference to learn the optimal distance metric, thereby improving the quality of feature difference without directly using subtraction. To focus on the changing features, IDE [23] decreases the distance between invariant features and enlarges the distance between changing features. FD-Net [16] extracts deep features from remote sensing images obtained from different sensors with varying spatial resolutions. BDFR [24] makes use of the information of multilevel feature maps from the subnetwork.

However, these methods tend to be overly simplistic in constructing the feature difference, leading to a significant loss of differential information during the element-wise subtraction of bitemporal features. Consequently, they struggle to effectively distinguish their true changes. For example, in tasks such as building detection, irrelevant elements like shadows can erroneously influence the model’s judgment. Achieving a good representation of the feature difference between bitemporal images is key for obtaining an accurate change map, making the modeling of high-quality feature differences a critical issue in change detection tasks.

2.2. Receptive Field

The receptive field is one of the basic concepts in deep learning [25,26,27,28], referring to the range in which a unit in a certain layer of the network can make sense. In fully connected networks, the value of each unit depends on all inputs; however, in convolutional networks, the value of each unit only depends on a specific area of the previous layer that can make sense. Given that convolutional structures are widely used in deep learning networks, the concept of receptive fields runs throughout the entire model framework. Dilated convolution [29] expands the receptive field range of each unit in normal convolution operations by inserting gaps within the receptive field, which increases the receptive field without adding parameters or changing the feature map size, thereby enhancing network efficiency. On this basis, RBF [30] simulates the human visual receptive field using dilated convolutions with three rates (1, 3, and 5), followed by concatenation to fuse different feature inputs. LRF [23] strategy employs different kernel shapes in different spatial dimensions for obtaining features. RC [17] dynamically generates a receptive field kernel based on the convolution kernel’s size. DRFNet [31] designs a transformer-style DRF module for object detection, which defines the correlation coefficient between two feature points by their relative distance.

These works are based on convolution operations, using various methods to expand the receptive field of each unit during convolution calculations and allowing each unit to obtain more information. In the decoder stage of change detection tasks, units at different scales have various receptive fields and perceive distinct change information. Previous change detection methods often leveraged the shape and size of the receptive field kernels, focusing on altering the external attributes of these kernels. However, they overlooked the intrinsic attributes of receptive field kernels at different feature scales. Depending on the scale of the features, receptive fields of varying extents contribute differently to the change map. Assigning different kernel weights can yield higher-quality change features.

3. Methods

3.1. Overall Structure

The overall structure of FDRF is shown in Figure 2. The encoder block and decoder block utilize the exchanging dual encoder–decoder backbone (EDED) [4]. This backbone is designed with two weight-sharing encoders, two weight-sharing decoders, and a fusion decoder, as shown in the “encoder/decoder architecture” section of Figure 2. In the encoding stage, this paper performs differential operations on image features at different scales to extract differential information between the two images, thereby obtaining feature differences. In the exchange stage, this paper swaps the even-numbered channels of the T₁ branch features with the odd-numbered channels of the T₂ branch features. In the decoding stage, this paper corrects the bias of change features at different scales and uniformly expands them to the original image size. Finally, this paper obtains the change by averaging the corrected prediction results with the predictions guided by feature difference.

3.2. Multibranch Feature Difference Extraction

In encoder–decoder network architectures, the encoder does not effectively capture image differences and suffers from information loss. To address these issues, this paper proposes the MFDE module. As shown in Figure 3, MFDE intercepts the features from each layer of the encoder to model feature differences more accurately and uses multiscale fusion to obtain high-quality feature differences. This provides robust guidance for the final change detection. The MFDE module contains three strategies: difference, preservation, and fusion.

(1): Difference: The difference branch extracts feature differences. Taking the encoded features of the encoder as the input, this branch calculates a difference matrix of the input features to initially capture change information. It then calculates a corresponding threshold matrix, highlighting the differential parts of the feature expression based on the threshold matrix to obtain feature differences. The mask module is used to filter out non-change information from the features. The process is formalized as follows:

$\{\begin{matrix} F_{d} = |T_{1} - T_{2}| \\ F_{d m} = M a s k (F_{d} - A v g p o o l (F_{d}), 0) \\ M a s k (x, 0) = \{\begin{matrix} x & if x \geq 0 \\ 0 & if x < 0, \end{matrix} \end{matrix}$

(1)

where $T_{1}$ and $T_{2}$ are the features of the bitemporal images, and $F_{d}$ is the absolute value of the difference between the two features, $T_{1}$ and $T_{2}$ , representing the difference between them. $A v g p o o l$ represents average pooling, and $M a s k (x, 0)$ sets all elements in the matrix less than 0 to 0. $A v g p o o l$ generates the average value of the differences, and subtracting this value from the difference information yields the deviation of each element from the average difference. The $M a s k$ operation sets pixel values smaller than the difference to 0, eliminating their influence during network training and highlighting pixels that truly contain differential information. $F_{d m}$ is the feature difference of the different branches.
In the difference branch of the MFDE module, the combined approach of average pooling and clipping operations ( $M a s k (x, 0)$ ) proves more effective, stemming from its specialized optimization mechanism for expressing feature differences in change detection tasks. Average pooling, by computing the global mean of feature differences, establishes a dynamic baseline threshold, enabling subsequent mask operations to effectively distinguish random noise from genuine change signals. Meanwhile, clipping operations, by suppressing negative fluctuations, preserve significant positive differential features while eliminating reverse interference. This unidirectional activation property aligns closely with the fundamental requirement in change detection tasks of ”focusing solely on differential absolute values”.
(2): Preservation: This branch models the information lost in the difference branch. The subtraction and clamp operations in the difference branch can lead to the loss of pixel information. Thus, the preserve branch is added to model the lost information and integrates it into the differential information, resulting in high-quality feature differences. The process is formalized as follows:

$F_{p} = C o n v (C o n c a t ((T_{1} + T_{2}), C o n c a t (T_{1} + T_{2}))),$

(2)

where $C o n c a t$ concatenates the input vectors along the channel dimension, and $C o n v$ is a 3 × 3 convolution used to fuse and reduce the dimensions of the concatenated vectors. $F_{p}$ represents partial information on the features $T_{1}$ and $T_{2}$ . Compared to directly adding $T_{1}$ and $T_{2}$ , this paper additionally concatenates input vectors $T_{1}$ and $T_{2}$ along the channel dimension, preserving more data information and alleviating information loss during the network encoding process.
In the selection of feature processing strategies, feature concatenation offers a multidimensional information preservation advantage over simple subtraction. Feature concatenation, by parallelizing features in the channel dimension, retains the complete topological structure of original features and achieves the interaction fusion of cross-temporal features through subsequent 3 × 3 convolutions. This processing method is particularly suitable for scenarios with temporal changes, as the cascaded feature space can simultaneously encode both “disappearing features” and “emerging features” as orthogonal dimensions, whereas subtraction operations can only reflect net differences.
(3): Fusion: This branch fuses different scales to obtain feature differences with complete information. It uses convolution and upsampling to match the scale of deep features with shallow features and then concatenates them and uses convolution to model feature differences after multiscale fusion. The process is formalized as follows:

$\{\begin{matrix} F_{u p} = C o n v (U p s a m p l e (F_{P} + F_{d m})) \\ F_{d i f f} = C o n c a t (C o n v (F_{u p}, F_{l a s t})), \end{matrix}$

(3)

where $F_{u p}$ records the information obtained from the first two branches at each layer during the encoding stage. $C o n v$ is a 3 × 3 convolution. $F_{l a s t}$ is the feature difference obtained from the previous layer. $F_{d i f f}$ is the feature difference obtained from the current layer. The fusion branch fuses the first layer’s features with the next layer, and the fusion result is further fused with the third layer, and so on, thereby fusing feature differences at different scales in the encoding stage. In summary, within an encoder network, with $T_{1}$ and $T_{2}$ as inputs in the first layer, the difference branch of MFDE generates the current layer’s feature $F_{d m}$ , and the preservation branch generates the current layer’s preservation feature $F_{p}$ . These are added element-wise to obtain the feature differences of that layer. The feature differences of the current layer are fused with the previous layer’s feature differences through convolution and upsampling, following a reverse fusion strategy from the last layer of the network to produce feature differences with the same dimensions as the original images.

3.3. Adaptive Ensemble Decision

In the decoding stage of the EDED backbone [4], multiple decoders decode the extracted feature vectors and use skip connections to retain the previous network information, compensating for information lost during network inference. However, EDED overlooks that different scale features often tend to express distinct levels of features due to various receptive fields. Shallow features often focus on color and texture, whereas deep features contain more pixel information and represent semantics. The EDED backbone uses the TFAM module to fuse bitemporal features, providing spatial information. The proposed channel avgpool module (CAM) averages the feature vectors along the channel dimension, and the confidence factor corrects the biases in the decoded change information at different scales, as shown in Figure 4. The AED module sets the corresponding confidence factors for different scale features based on their receptive field differences. Confidence factors measure the reliability of the predictions for the final results of the change, correcting the predictions of changes by different scale features and building individual learners. Finally, all individual learners collectively predict the change map. The AED module transforms the original vertical linear network inference process into a parallel ensemble decision module to determine change guidance information.

(1): Adaptive learning: In the proposed module, the change map obtained from the last layer of the decoder stage has the same size as the original input (256 × 256). Each unit on this scale corresponds to each unit on the change map, with a response field ratio (RFR) of 1. RFRs at different scales are calculated as follows:

$r_{i} = \frac{W_{t} \times H_{t}}{W_{t i} \times H_{t i}} .$

(4)

Using the receptive field ratios, we can calculate the confidence factors for different scales as follows:

$I_{i} = \frac{r_{i}}{\sum_{i = 1}^{d} r_{i}}, d \in [1, 5],$

(5)

where $r_{i}$ represents the ratio of the receptive field of the ith layer in the decoding stage; $W_{t}$ and $H_{t}$ are the dimensions of the original input image; $W_{t i}$ and $H_{t i}$ represent the width and height dimensions of the features of the ith layer, respectively; $I_{i}$ represents the confidence factor of the ith layer; d represents the number of network layers in the decoding stage. For example, in a three-layer decoder network, the spatial dimensions of each layer’s features are 16 × 16, 32 × 32, and 64 × 64, respectively, with the final prediction result also being 64 × 64. Thus, the area that each layer maps to in the final prediction image decreases progressively. The single pixel of each layer corresponds to 16, 4, and 1 pixels in the prediction image, respectively, representing each layer’s $r_{i}$ . Using the ratio of a current layer’s $r_{i}$ to the sum of all layers’ $r_{i}$ as the confidence factor for that layer, we can measure the contribution of the potential information from the current layer to the final prediction. Through the CAM and MASK modules, regions expressing non-change information in the features can be suppressed. Subsequently, different feature factors and residual structures are used for regulation to obtain the output features of each layer, referred to as $F_{d e c}$ .
RFR serves as a dependable proxy for determining confidence levels within the AED module. It directly measures the spatial impact of features at each decoder layer on the final change prediction, aligning with their hierarchical representational significance. Shallow layers, characterized by small RFR values, excel at capturing intricate details such as textures but may lack comprehensive contextual understanding. In contrast, deep layers with larger RFR values encode semantic information with broader spatial influence. By standardizing the RFR of each layer relative to the total, confidence factors dynamically adjust the weighting of contributions. Layers with expansive RFRs (e.g., 16 × 16 features governing a 16 × 16 output block) are assigned higher confidence to maintain global consistency, while layers with smaller RFRs (e.g., 64 × 64 features with localized impact) refine specific details. This approach ensures that the collective decision effectively balances multi-scale features in proportion to their receptive fields, effectively dampening noise (via CAM/MASK) while amplifying trustworthy change signals.
The RFR-based mechanism essentially transforms the decoder into a parallelized, adaptive system where each layer’s confidence level mirrors its inherent ability to facilitate precise change detection, thereby sidestepping arbitrary weight assignments based on heuristics.
(2): Ensemble Decision: By multiplying the confidence factors with the feature maps at different scales in the decoder stage, this paper constructs features at the corresponding scales. The integrated results from all learners provide accurate change information. The process is formalized as follows:

$C h a n g e = C o n v (F_{d i f f} + \sum_{i = 1}^{d} [E x t e n s i o n (F_{d e c}^{i})]), d \in [1, 5],$

(6)

where d represents the number of layers in the decoding stage of the network. $C o n v$ is a 7 × 7 convolution. $E x t e n s i o n$ refers to the expansion of features at different scales, where the expansion is performed by duplicating a single element simultaneously in the width and height dimensions to achieve the same scale as the predicted result. $C h a n g e$ refers to the change prediction result of the FDRF network. For example, the first layer of the network, with spatial dimensions of 16 × 16, can replicate a single pixel into a 16 × 16 block, expanding it to 256 × 256. Each layer is expanded to a uniform size according to its $r_{i}$ and then adjusted for its contribution to the prediction using its $I_{i}$ . Finally, the feature differences obtained from MFDE and the decision features from AED are fused through convolution to obtain the prediction result.

While AED shares conceptual similarities with attention mechanisms, such as dynamic weighting of features, it differs in implementation and purpose. Conventional attention modules (e.g., self-attention or channel attention) learn attention maps through parameterized operations and dot-product similarities. In contrast, AED leverages a deterministic weighting strategy based on spatial receptive field ratios without learning attention maps via gradients.

Thus, AED is better described as a confidence-aware ensemble strategy rather than a traditional attention mechanism. It avoids the additional computational overhead of attention layers and provides interpretable, scale-dependent weighting grounded in the spatial structure.

4. Experiments and Discussion

This paper conducts experiments on five public datasets (LEVIR-CD, CLCD, WHU, NJDS, and SYSU) and a UAV-based image dataset of GBCNR for three types of objects: buildings, lands, and wetlands. The main target boundary changes are more blurred in the GBCNR dataset, making the model more prone to misjudgments and omissions. Using the GBCNR dataset can further validate the model’s robustness.

4.1. Datasets

This paper offers a brief comparison of the experimental binary change detection datasets in Table 1.

The SYSU dataset [32] contains images that capture various types of complex change scenes, such as road expansions, new urban buildings, vegetation changes, suburban growth, and groundwork before construction. This paper divides the data into training, validation, and test sets at a ratio of 6:2:2, following the same approach in the literature [32].

The CLCD dataset [33] is an annual land cover dataset for China, created based on 335,709 scenes of Landsat data on the Google Earth (GE) engine. The dataset includes annual land cover information for China from 1985 + 1990 to 2020.

The LEVIR-CD dataset [34] is a large-scale remote sensing building change detection dataset. It consists of 637 very high-resolution (VHR, 0.5 m/pixel) GE image patch pairs with a size of 1024 × 1024 pixels. Following the literature [12], this paper crops the images into patches of 256 × 256 pixels with an overlap of 128 pixels on each side (horizontal and vertical) and divides the samples into training, validation, and test sets with a ratio of 7:1:2.

The WHU dataset [35] consists of two-period aerial images acquired in 2012 and 2016, which contain various buildings with large-scale changes. Following the splitting approach used in the literature [36], this paper crops the dataset into nonoverlapping patches of 256 × 256 pixels and randomly splits them into training, validation, and test sets with a ratio of 7:1:2.

The NJDS dataset [37] addresses the building height displacement issue in change detection. It contains bitemporal images of Nanjing City in 2014 and 2018, obtained from GE. The images include different types of low-, middle-, and high-rise buildings. Following the same approach in the literature [37], this paper crops the images into nonoverlapping patches of 256 × 256 pixels and randomly splits them into training (540 pairs), validation (152 pairs), and testing sets (1827 pairs).

The GBCNR dataset is obtained through UAVs in the GBCNR, located in the Guangxi Zhuang Autonomous Region of China. It is captured using DJI Mavic series drones at a flight altitude of 500 m, with drone image dimensions of 5280 × 3956 pixels. Based on the tide table, two sets of images were captured at tide heights of 5.11 m and 0.23 m, respectively. After removing images obscured by clouds and fog, 13 pairs of images were selected to represent the tidal changes before and after in two coastal nature reserves in Beihai, Guangxi. Two professors with experience in dataset creation and an expert knowledgeable in mangrove ecology were invited to guide the dataset’s creation. Based on the image data, field surveys were conducted to differentiate between mangroves and regular trees. The LabelMe annotation tool was used to label the 13 pairs of images according to the survey data. To meet the input requirements of deep learning models, this paper crops the images of the selected areas to 256 × 256 pixels and randomly splits them into training (1748 pairs), validation (499 pairs), and testing sets (249 pairs).

4.2. Implementation Details

This paper implements the verification experiments of FDRF using PyTorch on a 1 NVIDIA GTX 3090 GPU (24 GB memory). The programming language used is Python 3.10, and the deep learning framework PyTorch 2.0.1+cu117 is employed. During the training process, data augmentation is applied through random flipping (probability = 0.5), transposing (probability = 0.5), shifting (probability = 0.3), scaling (probability = 0.3), and random rotation (probability = 0.3). The binary cross entropy loss and dice coefficient loss are used as the loss function. This paper utilizes the AdamW [38] optimizer with an initial learning rate of 0.001 and a weight decay of 0.001. In addition, this paper reduces the learning rate by 0.1 if the F1 score of the validation completes every 12 epochs without increasing. The batch size is set to 32, and this paper trains FDRF for 300 epochs. By initializing the model parameters, this paper did not use pretrained model parameters but instead followed PyTorch’s default settings to maintain the same parameter initialization as other methods. Therefore, the first 100 epochs were skipped in validation to allow the model to converge [39]. The optimization times for the SYSU, LEVIR-CD, WHU, NJDS, CLCD, and GBCNR change detection datasets were 4.7 h, 4.4 h, 3.2 h, 6.4 h, 0.8 h, and 2.3 h, respectively.

4.3. Ablation Study

This paper performs ablation studies to verify the effectiveness of MFDE and AED components on all datasets. Table 2 shows the effectiveness of the MFDE and AED models. When compared with EDED, the fourth row shows that the MFDE model improves the F1 score from 64.40 to 66.96 and the IoU values from 47.50 to 50.33. When compared with EDED+MFDE, the fourth row shows that the AED model improves the F1 score from 66.96 to 68.19 and the IoU value from 50.33 to 51.73.

Figure 5 and Figure 6 display the visualization results of the ablation experiments, where red areas denote false positives (incorrectly detected changes), blue areas denote false negatives (undetected changes),

T_{1}

and

T_{2}

represent the bitemporal input images, and ground truth (GT) indicates the actual difference labels between the bitemporal images used for model training and evaluation. The performance of EDED+MFDE surpasses that of the EDED backbone, while the predictions from EDED+MFDE+AED closely align with the ground truth labels. Notably, on the WHU and CLCD datasets, false positives and false negatives are virtually imperceptible to the naked eye. The EDED backbone network lacks modeling of differential features. The MFDE module addresses this by generating differential features layer by layer during the encoder phase, adding difference information to the network. Additionally, the decoder phase of EDED processes changes features vertically, leading to the loss of shallow information. The AED module alleviates this issue by assigning different weights to features of various scales based on the size of the receptive field, thereby mitigating information loss. Both data and visualization results demonstrate the effectiveness of these two modules.

Additionally, Table 3 presents the experimental results of using ResNet-50 [40] and VGG16 [41] as backbones on GBCNR. It shows that FDRF-ResNet50 has the lowest FLOPs (18.93) and Params (5.71). This suggests that ResNet-50 is suitable for scenarios where computational resources are limited. FDRF-vgg16 has the highest FLOPs (59.22) and Params (11.49), and it also has better performance than ResNet-50. FDRF-EDED provides a balance between computational complexity and performance, achieving the highest F1 score (72.5) and IoU (56.86), suggesting that the EDED backbone is effective for tasks requiring high accuracy.

4.4. Comparison Results and Discussion

This paper compares FDRF with change detection networks, namely, FC-EF [35], FC-Sima-diff [35], FC-Siam-conc [35], IFN [9], STANet [34], BiT [18], SNUNet [11], SGSLN [4], UNet [42], AttU-Net [43], PSPNet [44], AMTNet [45], BiFA [46], and MTU-Net [47].

Table 4 shows the comparison of different CD methods on remote sensing datasets. Concretely, on the SYSU dataset, the data show that FDRF surpass other methods in R, F1, and IoU metrics, achieving 79.34, 82.87, and 70.75, respectively, only falling behind the FC-Sima-diff method in the P metric. Compared to the second-best performing SGSLN method, the F1 metric improves by 0.98%, and the IoU metric improves by 1.43%. On the LEVIR-CD dataset, the R, F1, and IoU metrics reach 91.72, 92.50, and 86.05, respectively, surpassing all other methods, only falling behind the IFN method in the P metric. Compared to the second-best performing SGSLN method, the F1 metric improves by 0.45%, and the IoU metric improves by 0.77%. On the WHU dataset, the metrics reach 93.16, 93.11, and 87.11, respectively, also surpassing all other methods, only falling behind the SGSLN method in the P metric. Compared to the second-best performing SGSLN method, the F1 metric improves by 1.08%, and the IoU metric improves by 1.87%. The visualization results from comparative experiments, depicted in Figure 7, illustrate that SGSLN is prone to a higher incidence of false positives, erroneously labeling many unchanged areas as altered. This issue stems from its inadequate feature differences extraction capabilities. In stark contrast, FDRF capitalizes on feature differences to significantly mitigate misclassification, yielding predictions that more accurately reflect the true changes. It can be observed that the red area in FDRF is the smallest. In the third row of the visualization, due to the similarity between the roof color and the step color, both SGSLN and AMTNet mistakenly identified part of the steps as a building, with SGSLN exhibiting a particularly severe misjudgment area. Neither SGSLN nor AMTNet models the differential features, resulting in a lack of differential information, making it difficult to distinguish irrelevant pseudo-changes. This further underscores the importance of extracting differential features.

Table 5 shows the comparison of the NJDS dataset. The data show that FDRF surpasses other methods in three metrics (P, F1, and IoU), achieving 84.07, 68.19, and 51.73, respectively, only slightly underperforming in the R metric. In terms of improvement, compared to the second-best performing SGSLN method, the F1 metric improves by 3.79%, and the IoU metric improves by 4.23%, representing the largest improvement. On NJDS, because of the different imaging angles of multitemporal remote sensing images, the same building has large spatial differences in the bitemporal images, leading to confusion about real changes and, thus, false positives [4]. Further analysis confirms the superiority of FDRF; improvements across various scenarios underscore its effectiveness in change detection. As Figure 8 highlights, FDRF excels in complex scenarios by adeptly filtering out irrelevant differences caused by angles and shadows, thus pinpointing actual changes.

Table 5 also shows the method pretrained on the LEIVR-CD dataset demonstrates significant performance degradation when evaluated on the NJDS dataset, with the pretrained SGSLN-PT method exhibiting notably lower metrics (precision: 71.21%, recall: 41.45%, F1 score: 52.27%, and IoU: 35.44%) compared to its source domain performance. This performance deterioration primarily stems from systematic disparities in imaging parameters (e.g., resolution and spectral bands) and scene characteristics (e.g., land cover types and seasonal variations) between the two datasets. In the future, we will focus on adjusting the model structure, enhancing training methods, and extending training data to bolster its transferability.

The comparative analysis between FDRF and the lightweight BiFA method reveals significant trade-offs in performance and efficiency. While BiFA demonstrates superior parameter efficiency, its computational cost is substantially higher. FDRF achieves markedly better precision and overall performance, suggesting its receptive field mechanism more effectively balances accuracy and computational load. This positions FDRF as preferable for accuracy-critical applications, whereas BiFA’s lower parameter count may benefit resource-constrained deployments despite its higher FLOPs.

Table 6 shows the comparison on the CLCD dataset. The data show that FDRF surpasses other methods in all metrics (P, R, F1, and IoU), achieving 73.78, 73.48, 73.63, and 58.27, respectively. Compared to the second-best performing SGSLN method, the F1 metric improves by 2.46%, and the IoU metric improves by 3.07%. The data indicate a more significant improvement in the CLCD dataset compared to the SYSU dataset, suggesting that change detection tasks on the CLCD dataset remain challenging and have great potential for further exploration. The challenges of detecting changes in cultivated land are evident from the visualization in Figure 8. Both SGSLN and FDRF encounter considerable misclassifications; however, FDRF demonstrates a reduced error footprint and precisely delineates the boundaries of change. By harnessing feature differences, FDRF ensures that the demarcations in the resultant imagery are both sharp and distinct.

Table 6 shows the comparison on the GBCNR dataset. The data show that the proposed methods surpass other methods in all metrics (P, R, F1, and IoU), achieving 67.96, 77.69, 72.50, and 56.86, respectively. Compared to the second-best performing SGSLN method, the F1 metric improves by 4.4%, and the IoU metric improves by 5.2%. The data indicate that the proposed methods can better handle change detection involving complex surface elements, demonstrating the robustness of the proposed methods. The table also presents the floating point operations (FLOPs) and parameters (Params) of different methods, facilitating the analysis of their computational and spatial complexities. FC-EF, FC-Sima-diff, and FC-Sima-conc, being purely convolution-based structures, have the lowest FLOPs and Params. On the GBCNR dataset, FDRF has similar FLOPs to the AMTNet method but with less than half the Params, achieving a 4.27% improvement in the F1 score. SGSLN, while increasing both FLOPs and Params, only improves the F1 score by 0.78% over the BiT method, whereas FDRF achieves a 5.27% improvement. This validates the superior performance of FDRF on the GBCNR dataset.

Figure 9 showcases the detection performance on the GBCNR dataset. As shown in the second row, the dense intermingling of mangroves and water bodies makes it challenging to accurately detect the boundaries of changed areas. Moreover, as illustrated in the third row, GBCNR contains numerous water bodies of varying depths, resulting in diverse light reflections from the wetland surface beneath the water. This variation further complicates change detection tasks. FDRF results exhibit clearer region boundaries and fewer gaps in the predicted change areas. The visualization suggests that FDRF demonstrates superior performance in change detection tasks within intricate scenes featuring water and vegetation.

The GBCNR dataset inherently contains several challenging scenarios that test the proposed model’s robustness. (1) Class Imbalance: The dataset exhibits significant class imbalance between changed and unchanged regions. Specifically, the areas undergoing changes occupy a substantially smaller proportion compared to unchanged areas, resulting in an imbalanced binary classification scenario. Despite this inherent challenge in change detection tasks, our method demonstrates robust performance and outperforms baseline approaches. (2) Temporal Variations: The bitemporal images in GBCNR were captured at a one-month interval, introducing several real-world challenges: natural atmospheric and illumination variations between acquisition times, secondary changes in vegetation and building conditions unrelated to the target changes, and background noise from dynamic urban environments. These temporal factors create a more challenging and realistic testing environment. While these conditions affect the overall detection accuracy, the proposed method maintains superior performance compared to existing approaches, demonstrating its robustness in handling real-world scenarios.

4.5. Training Stability and Robustness

In machine learning-based cybersecurity models, to ensure stability and robustness, the model should typically be trained over multiple epochs. Performance metrics, such as accuracy, loss, and others, should be recorded at each epoch to observe the model’s learning process and assess its stability and robustness [48]. In this section, a comprehensive analysis of the model’s training process is presented to emphasize its stability and robustness. The model underwent training across 300 epochs, during which key performance indicators, i.e., precision, recall, F1 score, and loss, were systematically monitored and recorded at each epoch. The learning curves for these metrics are depicted in Figure 10. The results demonstrate that the system requires approximately 170 epochs to achieve stability, after which the metrics converge and exhibit consistent behavior.

The figures illustrate clear trends in the model’s performance metrics. Precision, recall, and F1 score show consistent improvements during the initial epochs and stabilize as the training progresses. Notably, both precision and recall converge to values exceeding 0.9, signifying a robust and effective learning process. The F1 score aligns closely with these trends, demonstrating the model’s capability to balance precision and recall efficiently.

The loss curve reveals a sharp decline in the early stages of training, followed by a gradual convergence to a minimal value. This pattern indicates the successful minimization of the objective function and a stable optimization process. The smooth convergence of all metrics underscores the robustness and stability of the training process, suggesting the model’s reliability for practical applications.

These observations collectively affirm that the training process is both stable and robust. The absence of indicators such as overfitting or divergence further supports the model’s capability to maintain consistent performance in deployment scenarios.

Figure 11 illustrates the average performance and standard deviation (error bars) of the SGSLN and FDRF models on four performance metrics (precision, recall, F1 score, and IoU). The experimental results, derived from multiple statistical analyses, show that the error bars represent the standard deviation, with a red asterisk (*) denoting statistical significance between the two models (p-value < 0.05).

From the graph, it is evident that the FDRF model outperforms the SGSLN model across all performance metrics, particularly in recall and F1 score, showing significant improvements. This outcome suggests that the FDRF model exhibits stronger capabilities in capturing classification boundaries and enhancing overall model performance. Furthermore, the substantial enhancement in the IoU metric further validates the superiority of the FDRF model in spatial localization accuracy.

Statistical significance analysis conducted through t-tests reveals that the p-values for precision, recall, F1 score, and IoU are all less than 0.05, indicating statistically significant differences between the two model groups in these metrics. This further supports the superiority of the FDRF model in performance.

The figure clearly delineates the performance disparities between the SGSLN and FDRF models, offering crucial insights for subsequent model optimizations and research endeavors.

5. Conclusions

This paper introduces a binary change detection network named FDRF, which comprises an EDED backbone, an MFDE module, and an AED module. The EDED backbone represents a novel approach in binary change detection. The MFDE module, using a multi-branch network structure, preserves both dual-temporal image data and feature differences, addressing the potential loss of critical information during the encoding phase. The AED module, by assigning varied weights to change features based on their receptive fields, plays a pivotal role in determining the outcome of change detection. This design effectively harnesses information across multiple scales during the decoding phase. The verified experiments, which span multiple targets such as buildings, land, and wetlands, reveal that FDRF not only outperforms all comparative methods but does so efficiently. Remarkably, even in the specialized GBCNR dataset, influenced by dense mangroves and varying water depths, FDRF demonstrates exceptional performance.

However, the current implementation of the proposed modules is designed and validated specifically for RGB optical imagery. Extension to other modalities, such as SAR and infrared, would require substantial modifications to account for their unique characteristics and may involve adapting feature extraction mechanisms for different imaging characteristics, developing modality-specific preprocessing steps, and investigating cross-modal learning approaches. These adaptations represent important directions for future research, though they would require significant architectural modifications beyond the current scope of this work. The current MFDE employs baseline convolution with concatenation to prioritize lightweight design, intentionally avoiding advanced techniques like transformers or gating mechanisms to minimize computational overheads. We acknowledge that manual receptive field estimation for the AED confidence factor is suboptimal. Future enhancements will integrate lightweight attention or gating mechanisms and develop an automated AED confidence factor calculation module.

Author Contributions

Conceptualization, B.Z.; methodology, B.Z.; software, B.Z.; validation, B.Z. and Q.W.; formal analysis, B.Z.; investigation, B.Z. and J.L.; resources, B.Z. and Y.W.; data curation, B.Z. and Q.W.; writing—original draft preparation, B.Z.; writing—review and editing, Y.W. and J.L.; visualization, B.Z.; supervision, Y.W. and J.L.; project administration, B.Z. and Q.W.; funding acquisition, Y.W., J.L. and Q.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Guangxi Science and Technology Major Project (Grant No. AA19254016) and the Beihai Science and Technology Bureau Project (Grant No. Bei Kehe 2023158004).

Data Availability Statement

Readers can access our data by sending an email to the corresponding author.

Acknowledgments

Special thanks are given to all contributors who provided invaluable feedback and assistance during the research and manuscript preparation process. We also thank our colleagues and friends for their encouragement and constructive suggestions, which greatly improved the quality of this work.

Conflicts of Interest

The authors declare no competing interests.

References

Khelifi, L.; Mignotte, M. Deep learning for change detection in remote sensing images: Comprehensive review and meta-analysis. IEEE Access 2020, 8, 126385–126400. [Google Scholar] [CrossRef]
Singh, A. Review article digital change detection techniques using remotely-sensed data. Int. J. Remote Sens. 1989, 10, 989–1003. [Google Scholar] [CrossRef]
Chicco, D. Siamese neural networks: An overview. In Artificial Neural Networks; Humana: New York, NY, USA, 2021; pp. 73–94. [Google Scholar]
Zhao, S.; Zhang, X.; Xiao, P.; He, G. Exchanging dual-encoder–decoder: A new strategy for change detection with semantic guidance and spatial localization. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4508016. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Tian, S.; Ma, A.; Zhang, L. ChangeMask: Deep multi-task encoder-transformer-decoder architecture for semantic change detection. ISPRS J. Photogramm. Remote Sens. 2022, 183, 228–239. [Google Scholar] [CrossRef]
Panda, M.K.; Sharma, A.; Bajpai, V.; Subudhi, B.N.; Thangaraj, V.; Jakhetiya, V. Encoder and decoder network with ResNet-50 and global average feature pooling for local change detection. Comput. Vis. Image Underst. 2022, 222, 103501. [Google Scholar] [CrossRef]
Yang, Z.; Wu, Y.; Li, M.; Hu, X.; Li, Z. Unsupervised change detection in PolSAR images using siamese encoder–decoder framework based on graph-context attention network. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103511. [Google Scholar] [CrossRef]
Daudt, R.C.; Le Saux, B.; Boulch, A. Fully convolutional siamese networks for change detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4063–4067. [Google Scholar]
Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
Hou, B.; Liu, Q.; Wang, H.; Wang, Y. From W-Net to CDGAN: Bitemporal change detection via deep learning techniques. IEEE Trans. Geosci. Remote Sens. 2019, 58, 1790–1802. [Google Scholar] [CrossRef]
Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A densely connected Siamese network for change detection of VHR images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8007805. [Google Scholar] [CrossRef]
Chen, P.; Zhang, B.; Hong, D.; Chen, Z.; Yang, X.; Li, B. FCCDN: Feature constraint network for VHR image change detection. ISPRS J. Photogramm. Remote Sens. 2022, 187, 101–119. [Google Scholar] [CrossRef]
Liang, Y.; Zhang, C.; Han, M. RaSRNet: An end-to-end relation-aware semantic reasoning network for change detection in optical remote sensing images. IEEE Trans. Instrum. Meas. 2023, 73, 5006711. [Google Scholar] [CrossRef]
Aitken, K.; Ramasesh, V.; Cao, Y.; Maheswaranathan, N. Understanding how encoder-decoder architectures attend. Adv. Neural Inf. Process. Syst. 2021, 34, 22184–22195. [Google Scholar]
Zhang, L.; Hu, X.; Zhang, M.; Shu, Z.; Zhou, H. Object-level change detection with a dual correlation attention-guided detector. ISPRS J. Photogramm. Remote Sens. 2021, 177, 147–160. [Google Scholar] [CrossRef]
Zhang, M.; Shi, W. A feature difference convolutional neural network-based change detection method. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7232–7246. [Google Scholar] [CrossRef]
Yuan, S.; Wei, F.; Zhang, L.; Fu, H.; Gong, P. Receptive Convolution Boosts Large-Scale Multi-Class Change Detection. In Proceedings of the IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 10459–10462. [Google Scholar]
Chen, H.; Qi, Z.; Shi, Z. Remote sensing image change detection with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5607514. [Google Scholar] [CrossRef]
Bandara, W.G.C.; Patel, V.M. A transformer-based siamese network for change detection. In Proceedings of the IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 207–210. [Google Scholar]
Yan, J.; Cheng, Y.; Wang, Q.; Liu, L.; Zhang, W.; Jin, B. Transformer and graph convolution-based unsupervised detection of machine anomalous sound under domain shifts. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 2827–2842. [Google Scholar] [CrossRef]
Yan, J.; Cheng, Y.; Zhang, F.; Zhou, N.; Wang, H.; Jin, B.; Wang, M.; Zhang, W. Multi-modal imitation learning for arc detection in complex railway environments. IEEE Trans. Instrum. Meas. 2025, 74, 3529413. [Google Scholar] [CrossRef]
Cheng, Y.; Yan, J.; Zhang, F.; Li, M.; Zhou, N.; Shi, C.; Jin, B.; Zhang, W. Surrogate modeling of pantograph-catenary system interactions. Mech. Syst. Signal Process. 2025, 224, 112134. [Google Scholar] [CrossRef]
Li, L.; Wang, L.; Du, A.; Li, Y. LRDE-Net: Large receptive field and image difference enhancement network for remote sensing images change detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 162–174. [Google Scholar] [CrossRef]
Luo, F.; Zhou, T.; Liu, J.; Guo, T.; Gong, X.; Ren, J. Multiscale diff-changed feature fusion network for hyperspectral image change detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5502713. [Google Scholar] [CrossRef]
Yuan, J.; Deng, Z.; Wang, S.; Luo, Z. Multi receptive field network for semantic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1894–1903. [Google Scholar]
Shen, X.; Wang, C.; Li, X.; Yu, Z.; Li, J.; Wen, C.; Cheng, M.; He, Z. RF-Net: An end-to-end image matching network based on receptive field. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8132–8140. [Google Scholar]
He, Z.; Cao, Y.; Du, L.; Xu, B.; Yang, J.; Cao, Y.; Tang, S.; Zhuang, Y. MRFN: Multi-receptive-field network for fast and accurate single image super-resolution. IEEE Trans. Multimed. 2019, 22, 1042–1054. [Google Scholar] [CrossRef]
Araujo, A.; Norris, W.; Sim, J. Computing receptive fields of convolutional neural networks. Distill 2019, 4, e21. [Google Scholar] [CrossRef]
Yu, F. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Liu, S.; Huang, D. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 385–400. [Google Scholar]
Tan, M.; Yuan, X.; Liang, B.; Han, S. DRFnet: Dynamic receptive field network for object detection and image recognition. Front. Neurorobot. 2023, 16, 1100697. [Google Scholar] [CrossRef]
Shi, Q.; Liu, M.; Li, S.; Liu, X.; Wang, F.; Zhang, L. A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5604816. [Google Scholar] [CrossRef]
Yang, J.; Huang, X. 30 m annual land cover and its dynamics in China from 1990 to 2019. Earth Syst. Sci. Data Discuss. 2021, 2021, 1–29. [Google Scholar]
Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2018, 57, 574–586. [Google Scholar] [CrossRef]
Peng, X.; Zhong, R.; Li, Z.; Li, Q. Optical remote sensing image change detection based on attention mechanism and image difference. IEEE Trans. Geosci. Remote Sens. 2020, 59, 7296–7307. [Google Scholar] [CrossRef]
Shen, Q.; Huang, J.; Wang, M.; Tao, S.; Yang, R.; Zhang, X. Semantic feature-constrained multitask siamese network for building change detection in high-spatial-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 189, 78–94. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Hendrycks, D.; Lee, K.; Mazeika, M. Using pre-training can improve model robustness and uncertainty. In Proceedings of the International Conference on Machine Learning. PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 2712–2721. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Simonyan, K. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Liu, W.; Lin, Y.; Liu, W.; Yu, Y.; Li, J. An attention-based multiscale transformer network for remote sensing image change detection. ISPRS J. Photogramm. Remote Sens. 2023, 202, 599–609. [Google Scholar] [CrossRef]
Zhang, H.; Chen, H.; Zhou, C.; Chen, K.; Liu, C.; Zou, Z.; Shi, Z. Bifa: Remote sensing image change detection with bitemporal feature alignment. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5614317. [Google Scholar] [CrossRef]
Tsutsui, S.; Hirakawa, T.; Yamashita, T.; Fujiyoshi, H. Semantic segmentation and change detection by multi-task U-Net. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 619–623. [Google Scholar]
Ahmed, M.; Alasad, Q.; Yuan, J.S.; Alawad, M. Re-Evaluating Deep Learning Attacks and Defenses in Cybersecurity Systems. Big Data Cogn. Comput. 2024, 8, 191. [Google Scholar] [CrossRef]

Figure 1. A pair sample from GBCNR. (a) T₁ image, (b) T₂ image, and (c) ground truth image.

Figure 2. Overall structure of the FDRF.

Figure 3. Illustration of the MFDE module.

Figure 4. Illustration of the AED module.

Figure 5. Visualization comparison of ablation experiments on SYSU, LEVIR-CD and WHU. Red areas denote false positives and blue areas denote false negatives.

Figure 6. Visualization comparison of ablation experiments on NJDS, CLCD and GBCNR. Red areas denote false positives and blue areas denote false negatives.

Figure 7. Visualization comparison of change detection results. Red areas denote false positives and blue areas denote false negatives.

Figure 8. Visualization comparison of change detection results on NJDS. Red areas denote false positives and blue areas denote false negatives.

Figure 9. Visualization comparison of change detection results on GBCNR. Red areas denote false positives and blue areas denote false negatives.

Figure 10. Training performance metrics across 300 epochs: precision, recall, F1 score, and loss trends.

Figure 11. Comparison of SGSLN and FDRF methods on performance metrics with error bars and statistical significance. Red asterisk (*) denoting statistical significance between the two models (p-value < 0.05).

Table 1. Brief comparison of the six datasets.

Name	SYSU	LEVIR-CD	WHU	NJDS	CLCD	GBCNR
Resolution (m)	0.5	0.5	0.3	0.3	0.5–2	0.13
Image pairs	20,000	637	1	1	2400	2496
Image size (pixels)	256 × 256	1024 × 1024	32,207 × 15,354	14,231 × 11,381	256 × 256	256 × 256

Table 2. Ablation study on six datasets (%).

Datasets	Methods	P	R	F1	IoU
	EDED	88.46	76.07	81.80	69.20
SYSU	EDED+MFDE	87.78	77.33	82.22	69.81
	EDED+MFDE+AED	86.72	79.34	82.87	70.75
	EDED	93.25	90.95	92.09	85.34
LEVIR-CD	EDED+MFDE	92.91	91.93	92.42	85.91
	EDED+MFDE+AED	93.30	91.72	92.50	86.05
	EDED	92.32	91.75	92.03	85.24
WHU	EDED+MFDE	93.00	92.09	92.54	86.12
	EDED+MFDE+AED	93.06	93.16	93.11	87.11
	EDED	80.84	53.52	64.40	47.50
NJDS	EDED+MFDE	79.25	57.96	66.96	50.33
	EDED+MFDE+AED	84.07	57.35	68.19	51.73
	EDED	70.39	71.90	71.14	55.20
CLCD	EDED+MFDE	73.43	71.91	72.66	57.06
	EDED+MFDE+AED	73.78	73.48	73.63	58.27
	EDED	64.61	71.98	68.10	51.63
GBCNR	EDED+MFDE	65.60	75.78	70.32	54.23
	EDED+MFDE+AED	67.96	77.69	72.50	56.86

Note: Bold values indicate the best performance metrics for each dataset.

Table 3. Comparison of different backbones on GBCNR.

Backbone	FLOPs (G)	Params (M)	P (%)	R (%)	F1 (%)	IoU (%)
FDRF-ResNet50	18.93	5.71	62.74	79.41	70.10	53.96
FDRF-VGG16	59.22	11.49	64.88	77.11	70.47	54.40
FDRF-EDED	23.47	9.16	67.96	77.69	72.50	56.86

Note: Bold values indicate the best performance metrics.

Table 4. Comparison on SYSU, LEVIR-CD, and WHU (%).

Methods	SYSU				LEVIR-CD				WHU
Methods	P	R	F1	IoU	P	R	F1	IoU	P	R	F1	IoU
FC-EF	78.26	76.3	77.27	62.96	88.53	86.83	87.67	78.05	80.87	75.43	78.05	64.01
FC-Sima-diff	83.04	79.11	81.03	68.11	94.02	82.93	88.13	78.77	84.73	87.31	86.00	75.44
FC-Siam-conc	74.32	75.84	75.07	60.09	83.81	91.00	87.26	77.39	78.86	78.64	78.75	64.95
IFN	79.59	75.58	77.53	63.31	89.18	87.17	88.16	78.83	91.44	89.75	90.59	82.79
STANet	81.14	76.48	78.74	64.94	86.91	80.17	83.40	71.53	79.37	85.5	82.32	69.95
BiT	89.13	61.21	72.58	56.96	89.24	89.37	89.30	80.68	86.64	81.48	83.98	72.39
SNUNet	70.76	85.33	77.36	63.09	89.53	83.31	86.31	75.91	85.60	81.49	83.49	71.67
AMTNet	80.96	76.84	78.85	65.08	91.14	89.21	90.17	82.09	92.86	81.49	83.49	71.67
SGSLN	86.20	78.00	81.89	69.34	92.91	91.21	92.05	85.28	92.32	91.99	92.27	85.64
FDRF	86.72	79.34	82.87	70.75	93.30	91.72	92.50	86.05	93.06	93.16	93.11	87.11

Note: Bold values indicate the best performance metrics for each dataset.

Table 5. Comparison on NJDS.

Methods	FLOPs (G)	Params (M)	P (%)	R (%)	F1 (%)	IoU (%)
U-Net	7.8	35.6	46.45	52.64	49.35	32.76
AttU-Net	8.9	49.2	55.57	44.60	49.48	32.88
PSPNet	65.6	364.2	50.57	58.21	54.12	37.10
DTCDSCN	28.3	187.5	51.92	62.78	56.84	39.70
IFN	41.10	50.71	49.44	14.35	22.24	12.51
MTU-Net	46.2	215.8	65.29	62.82	64.03	47.09
ATMNet	21.56	24.67	77.32	55.75	64.79	47.91
SGSLN-PT	11.5	6.04	71.21	41.45	52.27	35.44
FDRF-PT	23.47	9.16	72.35	43.53	54.35	37.31
SGSLN	11.5	6.04	80.84	53.52	64.40	47.50
BiFA	53.00	5.58	78.14	56.12	65.33	48.47
FDRF	23.47	9.16	84.07	57.35	68.19	51.73

Note: Bold values indicate the best performance metrics for each dataset.

Table 6. Comparison of CLCD and GBCNR.

Methods	FLOPs (G)	Params (M)	CLCD				GBCNR
Methods	FLOPs (G)	Params (M)	P (%)	R (%)	F1 (%)	IoU (%)	P (%)	R (%)	F1 (%)	IoU (%)
FC-EF	3.58	1.35	70.82	62.37	66.32	49.62	56.97	63.12	59.89	42.74
FC-Sima-diff	4.73	1.35	71.70	47.60	57.22	40.07	61.40	57.05	59.14	41.99
FC-Siam-conc	5.33	1.55	61.42	62.75	62.08	45.01	59.79	54.99	57.29	40.14
SNUNet	54.82	12.04	64.26	52.33	57.69	40.54	63.1	64.50	63.79	46.83
BiT	8.75	3.49	73.27	52.91	61.45	44.35	60.19	76.36	67.32	50.73
ATMNet	21.56	24.67	73.97	72.54	73.25	57.79	66.59	69.96	68.23	51.78
SGSLN	11.5	6.04	70.39	71.90	71.14	55.20	64.61	71.98	68.10	51.63
FDRF	23.47	9.16	73.78	73.48	73.63	58.27	67.96	77.69	72.50	56.86

Note: Bold values indicate the best performance metrics for each dataset.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, B.; Wang, Y.; Lu, J.; Wang, Q. An Improved Symmetric Network with Feature Difference and Receptive Field for Change Detection. Symmetry 2025, 17, 1095. https://doi.org/10.3390/sym17071095

AMA Style

Zhang B, Wang Y, Lu J, Wang Q. An Improved Symmetric Network with Feature Difference and Receptive Field for Change Detection. Symmetry. 2025; 17(7):1095. https://doi.org/10.3390/sym17071095

Chicago/Turabian Style

Zhang, Botao, Yixuan Wang, Jia Lu, and Qin Wang. 2025. "An Improved Symmetric Network with Feature Difference and Receptive Field for Change Detection" Symmetry 17, no. 7: 1095. https://doi.org/10.3390/sym17071095

APA Style

Zhang, B., Wang, Y., Lu, J., & Wang, Q. (2025). An Improved Symmetric Network with Feature Difference and Receptive Field for Change Detection. Symmetry, 17(7), 1095. https://doi.org/10.3390/sym17071095

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved Symmetric Network with Feature Difference and Receptive Field for Change Detection

Abstract

1. Introduction

2. Related Works

2.1. Feature Difference

2.2. Receptive Field

3. Methods

3.1. Overall Structure

3.2. Multibranch Feature Difference Extraction

3.3. Adaptive Ensemble Decision

4. Experiments and Discussion

4.1. Datasets

4.2. Implementation Details

4.3. Ablation Study

4.4. Comparison Results and Discussion

4.5. Training Stability and Robustness

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI