DFANet: A Deep Feature Attention Network for Building Change Detection in Remote Sensing Imagery

Lu, Peigeng; Ding, Haiyong; Tian, Xiang

doi:10.3390/rs17152575

Open AccessFeature PaperArticle

DFANet: A Deep Feature Attention Network for Building Change Detection in Remote Sensing Imagery

by

Peigeng Lu

¹,

Haiyong Ding

^1,* and

Xiang Tian

²

¹

School of Remote Sensing and Geomatics Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

School of Information Science and Engineering, Shandong Agricultural University, Taian 271018, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(15), 2575; https://doi.org/10.3390/rs17152575

Submission received: 25 May 2025 / Revised: 20 July 2025 / Accepted: 20 July 2025 / Published: 24 July 2025

Download

Browse Figures

Versions Notes

Abstract

Change detection (CD) in remote sensing (RS) is a fundamental task that seeks to identify changes in land cover by analyzing bitemporal images. In recent years, deep learning (DL)-based approaches have demonstrated remarkable success in a wide range of CD applications. However, most existing methods have limitations in detecting building edges and addressing pseudo-changes, and lack the ability to model feature context. In this paper, we introduce DFANet—a Deep Feature Attention Network specifically designed for building CD in RS imagery. First, we devise a spatial-channel attention module to strengthen the network’s capacity to extract change cues from bitemporal feature maps and reduce the occurrence of pseudo-changes. Second, we introduce a GatedConv module to improve the network’s capability for building edge detection. Finally, Transformer is introduced to capture long-range dependencies across bitemporal images, enabling the network to better understand feature change patterns and the relationships between different regions and land cover categories. We carried out comprehensive experiments on two publicly available building CD datasets—LEVIR-CD and WHU-CD. The results demonstrate that DFANet achieves exceptional performance in evaluation metrics such as precision, F1 score, and IoU, consistently outperforming existing state-of-the-art approaches.

Keywords:

change detection; Transformer; convolutional neural network; deep learning

1. Introduction

Change detection (CD) refers to the process of identifying and analyzing modifications in areas or objects through comparing two temporally distinct images of the same geographic region acquired using remote sensing (RS) or related techniques. Research in CD has seen broad application in areas such as environmental monitoring [1], resource exploration [2], urban growth analysis [3], and disaster impact assessment [4]. With continuous innovations in RS, sensor, and computer technologies, CD systems are also evolving [5].

Traditional CD methods, which form the basis for modern approaches, include algebra-based, transform-based, and classification-based techniques. Algebra-based approaches involve operations such as image differencing [6], ratioing [7], and quantization [8]; they determine pixel-level changes by comparing images acquired at two distinct time points and then apply thresholding or cluster analysis to produce difference maps of changed and unchanged pixels. Transformation-based approaches include principal component analysis [9] and the tasseled cap transform [10], which project features into transformed spaces that emphasize changed pixels. Classification-based approaches use algorithms such as support vector machines [11], spatial domain analysis [12], decision trees [13], and random forests [14] to detect changed pixels through comparing classification maps generated from images captured at different times. Although machine learning (ML)-based methods have contributed significantly to the progress of CD in RS imagery [15], they still struggle to effectively capture complex feature representations.

In recent years, deep learning (DL) research in RS has surged, with many studies applying neural-network-based techniques to CD. DL is a subset of ML that leverages artificial neural networks to model and learn complex data representations. It focuses on using multi-layer neural network architectures to automatically extract features and learn patterns from large volumes of data, leveraging increasingly deep and complex architectures to process and analyze information. Key DL architectures include autoencoder (AE), convolutional neural network (CNN), deep belief network (DBN), recurrent neural network (RNN), and generative adversarial network (GAN) [16]. In particular, deep CNN (e.g., VGG [17], UNet [18], ResNet [19]) combined with semi-supervised learning [20] has significantly boosted change-detection performance. DL-based CD methods can be broadly categorized into three types: (1) Feature-based CD. Hand-crafted features are usually tailored for specific tasks, requiring substantial expert domain knowledge and exhibiting limited generalizability, whereas deep features derived via hierarchical extraction from data and thus more stable. Deep features extracted from CNN models pre-trained on natural images have been shown to be effective when applied to RS domain. Consequently, a significant number of works have incorporated hierarchical features from pre-trained CNN into CD methods. (2) Pixel-block-based CD. Unlike feature-based approaches, pixel-block-based methods do not overly rely on difference images; instead, they generate pixel blocks from original images or difference maps as network inputs. Because determining the locations of pixel blocks is challenging and neighboring blocks contain substantial redundant information, these methods inevitably incur high computational costs, limiting their scalability and large-scale applicability. (3) Image-based CD. This approach uses bitemporal images as network input and directly learns change categories for each pixel via end-to-end training, thereby significantly improving the accuracy and efficiency of CD. As a result, DL-based CD methods have become mainstream. Specifically, based on differences in network architecture and feature-enhancement modules, image-based CD approaches fall into distinct categories: CNN-based CD; attention-based CD; Transformer-based CD; and hybrid CNN-Transformer-based CD. Recent network designs focus on expanding the receptive field and enhancing spatiotemporal feature discrimination by stacking deeper convolutions and incorporating attention mechanisms. Unlike convolution-only models, attention modules (channel attention [21], spatial attention [22], self-attention [23]) effectively capture global context.

Although a plethora of advanced CD techniques have been put forward, they still suffer from several limitations, which can be summarized as follows: (1) Most current methodologies generate change representations by directly merging and upscaling low-resolution feature maps, which may introduce noise; (2) most current methodologies are deficient in detecting building edges and in addressing the pseudo-change problem; (3) most methods simply use attention mechanisms to re-weight features along the channel or spatial dimensions, hence demonstrating inadequacy in capturing global spatio-temporal correlation.

To tackle the aforementioned limitations, we present DFANet, a deep feature attention network for building CD in RS imagery. Our design is guided by two principles. First, the network must thoroughly extract features from bitemporal images so that downstream modules can accurately discern differences between those features; it must also effectively suppress the influence of background on change-region detection, enabling precise extraction of change features, refinement of region boundaries, and avoidance of pseudo-changes. Second, the network must capture long-range spatio-temporal dependencies by modeling remote contextual information within the bitemporal images. We use the residual network ResNet-18 with a total of 18 layers for feature extraction, introduce SCAM at each layer for feature enhancement, apply GatedConv for boundary refinement and pseudo-change suppression, and integrate Transformer modules to capture long-range dependencies. We hypothesize that the above methods can solve related problems and verify the effectiveness of our method through sufficient experiments.

The key contributions of this work can be summarized as follows.

(1): We propose a deep feature attention network based on SCAM, GatedConv, and Transformer, named DFANet.
(2): DFANet provides insights for addressing issues such as noisy features extracted by CD networks, insufficient modeling of long-range spatio-temporal dependencies, blurred boundaries in CD results, and pseudo-changes.
(3): To validate the proposed method, we performed extensive experiments on two RS building CD datasets, LEVIR-CD and WHU-CD, and performed numerical and visual comparisons with other advanced models, validating the superiority of the proposed method.

The remaining sections are arranged to logically unfold as follows: Section 2 briefly reviews related works on DL-based CD. Section 3 details the methodology used. Section 4 is dedicated to the experimental setup and the presentation of CD results. Section 5 is dedicated to a discussion of results, and Section 6 finalizes the paper with the conclusion.

2. Related Works

This section briefly reviews recent DL-based CD methods. CD is a key task in RS, and a variety of DL-based techniques have emerged.

Since CD requires bitemporal inputs, current DL methodologies can be broadly grouped into two types: image-level and feature-level approaches. Image-level approaches directly perform channel-wise concatenation of bitemporal images and input the fused data into a unified segmentation network for detection. Daudt et al. [24] developed FC-EF, a UNet-based fully convolutional network that implements pixel-level early fusion. Peng et al. [25] applied the same concatenation strategy within a UNet++ architecture with multi-branch fusion. Liu et al. [26] incorporated depth-separable convolutions into UNet to improve detection performance. Fang et al. [27] further enhanced this by adding dense connectivity between encoder and decoder layers and introducing channel attention to strengthen feature representation. Feature-level methods utilize a pair of weight-sharing siamese networks to independently extract features. Shi et al. [28] designed a bitemporal convolutional twin network that integrates a Convolutional Block Attention Module (CBAM) [29] to enhance feature discriminability. Guo et al. [30] developed a hybrid framework that merges a fully convolutional siamese network with contrastive loss, enabling effective learning of discriminative bitemporal features for performance enhancement. Li et al. [31] introduced a dense jump-connection module for multi-level feature aggregation between encoder and decoder. Chen et al. [32] developed a hybrid CD framework combining a feature-constrained network with a novel bitemporal decoder backbone, which suppresses background noise, and a self-supervised strategy to facilitate discriminative feature learning.

Vaswani et al. [33] pioneered the Transformer model in 2017, a neural network architecture that relies exclusively on self-attention mechanisms for sequence modeling. Unlike recurrent or convolutional networks, Transformer processes input sequences more efficiently and captures long-range dependencies effectively. Recently, many CD methods have incorporated Transformer modules to overcome the limited receptive field of CNN. Accordingly, attention-based, Transformer-based, and hybrid CNN-Transformer approaches have become popular. Zhang et al. [34] built a change-detection network using multiple Swin Transformer blocks in a twin-network setup. Feng et al. [35] proposed a cross-scale feature-fusion strategy with Transformer blocks. Chen et al. [36] developed a Transformer-based twin network to model spatio-temporal context.

In the past two to three years, the field of CD has continued to attract significant research attention, leading to the publication of numerous influential studies. These works have provided a solid theoretical foundation and practical insights for the improvement of existing methods and the development of new applications. Jian et al. [37] proposes an Uncertainty-Aware Graph Self-Supervised Learning (UA-GSSL) framework for hyperspectral image CD. By constructing superpixel-based graphs and incorporating node/edge-level augmentations with uncertainty modeling, the method effectively learns robust representations. Pang et al. [38] proposes SFGT-CD, a Semantic Feature-Guided Transformer-based framework for building CD using bitemporal RS images. By incorporating prior semantic segmentation knowledge to generate semantic feature maps, the model enhances change localization through cross-temporal attention mechanisms. Lu et al. [39] introduces the Bitemporal Attention Transformer (BAT), a unified framework for building CD and damage assessment. BAT extracts multi-scale features from bitemporal images using dual-branch encoders and fuses them via cross-temporal attention.

Building on these advances, we introduce DFANet, a deep feature attention network for building CD in RS imagery, which integrates SCAM, GatedConv modules, and Transformer blocks.

3. Methodology

This section provides detailed information about the network design. First, we illustrate the overall workflow of DFANet. Then, we describe the SCAM, GatedConv, and Transformer modules.

3.1. DFANet Overview

As shown in Figure 1, the architectural framework of DFANet is composed of four core components: Feature Extraction Layer, Deep Feature Attention Layer, Transformer Layer, and Classification Output Layer. The bitemporal RS images

T^{1}

and

T^{2}

first pass through a siamese feature extraction network based on ResNet-18. For each layer, the extracted bitemporal features are concatenated, and the resulting feature maps are processed by the Spatial and Channel Attention Module (SCAM), followed by the GatedConv module for further feature extraction. Each layer’s features are then upsampled and concatenated with the features from the previous layer, iteratively proceeding until the final feature representation is obtained. To reduce computational complexity, the features with 960 channels are mapped to 32 channels via a 3

\times

3 convolutional layer. Given that Transformers require token sequences as input, the 32-channel feature map with a spatial dimension of 64

\times

64 is converted into tokens and fed into the Transformer layer. Finally, the features pass through a classification output module consisting of two convolutional layers to generate the final CD result map.

3.2. Deep Feature Attention

To enhance the features extracted by ResNet-18, we designed a Deep Feature Attention Module. After ResNet-18 extracts bitemporal image features, we obtain four feature maps of sizes (H/4, W/4), (H/8, W/8), (H/16, W/16), (H/32, W/32) (where H and W are the input image height and width). Each feature map is first processed by SCAM and then further refined by GatedConv. The enhanced features are upsampled and fused with those from the preceding stage, iteratively, until the highest-level features are obtained. Our Deep Feature Attention Module aims to focus on change-relevant features while suppressing irrelevant ones.

SCAM (Figure 2) consists of channel and spatial attention. In our approach, the features are first processed through a channel attention mechanism and then through a spatial attention mechanism. Channel attention (Figure 2b) can help the network distinguish which channels of the feature are more important, enhance the response of key semantic channels, and suppress irrelevant (or weak) channels. This mechanism enables the network to prioritize feature dimensions carrying discriminative change information. The algorithmic process of channel attention is presented as follows:

\begin{matrix} M_{c}^{a v g} = M L P (A v g P o o l (F)) \end{matrix}

(1)

\begin{matrix} M_{c}^{m a x} = M L P (M a x P o o l (F)) \end{matrix}

(2)

\begin{matrix} M_{c} (F) = σ (M_{c}^{a v g} + M_{c}^{m a x}) \end{matrix}

(3)

where

F \in R^{C \times H \times W}

signifies the input features, C, H and W denote the channel dimension, spatial height, and spatial width of the feature map, respectively. First, the features extracted by ResNet-18 are subjected to global average and maximum pooling layers to capture two channel-dimensional vectors of shape C

\times

1

\times

1, which we denote by

M_{c}^{a v g}

and

M_{c}^{m a x}

, respectively. Subsequently, the weight-sharing multilayer perceptron (MLP) assigns channel-wise weight allocation for the two pooled vectors. Following this, the channel attention feature map

M_{c} (F)

is derived via Sigmoid activation.

Spatial attention (Figure 2c) helps the model distinguish which spatial locations are most important, highlighting regions likely to change and suppressing responses in static (background or irrelevant) areas. Traditional spatial attention employs large convolutional kernels, which incur significant computational cost. Inspired by visual attention mechanisms [40], we decompose a large kernel convolution into three convolutional layers to reduce computational complexity. These layers include a 5

\times

5 depth-wise convolution (DW-Conv), a 7

\times

7 depth-wise dilation convolution (DW-D-Conv), and a 1

\times

1 channel convolution. This can be mathematically formulated as:

\begin{matrix} M_{s} (F) = {C o n v}_{1 \times 1} ({C o n v}_{7 \times 7} ({C o n v}_{5 \times 5} (F))) \end{matrix}

(4)

where

F \in R^{C \times H \times W}

signifies the input features, C, H and W denote the channel dimension, spatial height, and spatial width of the feature map, respectively.

M_{s} (F)

is the feature map obtained after convolution.

The enhanced features can be obtained after the above two parts of processing and can be mathematically formulated as:

\begin{matrix} F^{'} = M_{c} (F) ⨂ F \end{matrix}

(5)

\begin{matrix} F^{″} = F^{'} + M_{s} (F^{'}) ⨂ F^{'} \end{matrix}

(6)

where

⨂

denotes element-wise multiplication,

F^{'}

denotes the feature map after channel attention processing, and

F^{″}

is obtained by element-wise multiplying

F^{'}

with its spatial attention map

M_{s} (F^{'})

, followed by adding the original

F^{'}

.

To effectively suppress interference from envir onmental factors, enhance the response in changed regions and refine boundaries, inspired by spectral techniques [41], we designed GatedConv (Figure 3) for further feature processing. Traditional convolution applies the same criteria across all spatial positions and channels, whereas GatedConv allows the network to conditionally control “when to pass features” and “when to suppress noise”, dynamically adjusting information flow at each position and channel. This enables effective suppression of background, illumination, and shadow interference, improves the network’s response to changed areas, and refines change-region boundaries.

Below is the specific implementation of GatedConv. First, the input features are directed to two parallel branches: the feature-extraction branch and the gating branch. The feature-extraction branch applies a standard convolutional layer to the input features to produce a feature map. The gating branch processes the input features via a gated convolution layer with subsequent Sigmoid activation, producing a gating map. Finally, the two branch outputs are multiplied element-wise to produce the final processed features.

Given that channel attention, spatial attention, and GatedConv can all enhance CD performance by suppressing irrelevant background features and highlighting change-related features, we term the combination of these methods Deep Feature Attention. For edge refinement, after the features undergo initial processing by the Spatial and Channel Attention Module (SCAM) to suppress irrelevant information and accentuate changes, we further employ GatedConv for subsequent feature refinement. GatedConv adaptively amplifies responses along edge contours while suppressing redundant background signals based on the current feature context, thereby yielding sharper and more contiguous boundary details in the processed features.

3.3. Transformer

To alleviate the limitations of features extracted by the network in terms of global context and long-range dependencies, and enable the network to better model feature change patterns and inter-category relationships across different geographic regions, we introduce the Transformer module [33] (Figure 4). The Transformer is constituted by an encoder and a decoder, which will be described in detail below.

The encoder comprises

N_{E}

layers, each layer integrating a multi-head self-attention mechanism and an MLP in sequence. In each layer, the query Q, key K and value V fed into the multi-head self-attention mechanism are obtained through three parallel linear transformations applied to the feature map X from the Deep Feature Attention Module. This can be mathematically formulated as:

\begin{matrix} Q = X W_{Q}, K = X W_{K}, V = X W_{V} \end{matrix}

(7)

where

W_{Q}

,

W_{K}

,

W_{V}

represent the computed weight matrices used to generate Q, K, V, respectively.

An attention head is mathematically represented as:

\begin{matrix} A t t e n t i o n (Q, K, V) = σ (\frac{Q K^{T}}{\sqrt{d}}) V \end{matrix}

(8)

where d is the channel dimension of the triplet, and

σ (*)

signifies the softmax function applied specifically along the channel dimension.

Constructed with two linear projection layers, the MLP incorporates gaussian error linear unit (GELU) [42] activation as an intervening non-linear transformation, facilitating hierarchical feature encoding. With input and output dimensions both fixed at C, while the hidden layer features a dimensionality of 2C. This can be expressed as:

\begin{matrix} M L P (X) = G E L U (X W_{1}) W_{2} \end{matrix}

(9)

where

W_{1}

,

W_{2}

are linear projection matrices.

The composition of the decoder is similar to that of the encoder, with the key difference being the source of Q, K, V. In DFANet, the Q, K, V of the encoder are each derived via applying three distinct linear transformations to the feature map X from the Deep Feature Attention Module. In the decoder, however, Q is derived directly from X, while K and V originate from the encoder’s output features. Subsequently, these Q, K, V are introduced into the self-attention layer, and then the MLP, to produce the final decoded feature

X^{'}

.

We attempted various combinations of encoder and decoder layers, and found that changes in the number of encoder and decoder layers had no significant effect on improving the performance of CD. Considering the high demand of Transformer for hardware memory, we adopted a Transformer with a combination of one encoder layer and one decoder layer to reduce the load.

3.4. Classifier

The classification output module consists of an upsampling layer followed by two convolutional layers (Figure 5). The Transformer-processed feature map is first upsampled to a 256

\times

256 resolution with 32 channels, which then undergoes the first convolutional block. This block is structured as a 3

\times

3 convolution followed by batch normalization and ReLU activation, preserving both the spatial dimensions and channel count of the feature map. Subsequently, the feature map is fed into the second convolutional layer, a single 3

\times

3 convolution that maintains the spatial resolution while reducing the channel number to 2. Through this architecture, the final output is a 256

\times

256 CD map with two channels, representing the binary classification results.

4. Experimental Results and Analysis

This section commences with an introduction to the two publicly available building CD datasets, followed by a detailed elaboration on implementation specifics and evaluation metrics. Next, comparative experiments against state-of-the-art methods are conducted to validate the proposed approach’s superiority. The section concludes with ablation studies and complexity analysis to assess the efficacy of each proposed module.

4.1. Datasets

For a comprehensive evaluation of the proposed approach, we conducted experiments on two public CD datasets, which can be summarized as follows:

(1): LEVIR-CD [43]: A public large-scale building CD dataset. It consists of 637 pairs of high-resolution (0.5-m) image patches, each sized at 1024 × 1024 pixels. These bitemporal images were collected from multiple cities in Texas, USA, spanning a timeframe of 5 to 14 years. We cropped the images into non-overlapping 256 × 256 patches and partitioned them according to the official training, validation, and test splits, resulting in 7120 pairs for training, 1024 pairs for validation, and 2048 pairs for testing. The dataset can be obtained from https://justchenhao.github.io/LEVIR/ (accessed on 10 October 2022).
(2): WHU-CD [44]: A public building CD dataset, namely the high-resolution aerial imagery building CD dataset released by the GPCV Group at Wuhan University in 2019. It contains a pair of high-resolution aerial images with a resolution of 0.075 m and a size of 32,507 × 15,354 pixels. The images were cropped into non-overlapping patches of 256 × 256 pixels, which were randomly partitioned into three subsets: 6096 pairs for training, 762 pairs for validation, and 762 pairs for testing. The dataset can be obtained from https://gpcv.whu.edu.cn/data/building_dataset.html (accessed on 1 October 2023).

4.2. Implementation Details and Evaluation Metrics

The DFANet was implemented using the PyTorch2.5.0 DL framework and trained on an NVIDIA GeForce RTX 4060 Ti graphics card with 16 GB of GPU memory. Before feeding the data into the network, we applied conventional data enhancement methods on input image blocks, encompassing flipping, scaling, cropping, and gaussian blurring. The network was optimized using stochastic gradient descent (SGD) with a momentum of 0.99 and a weight decay of 5 × 10⁻⁴, complemented by a cross-entropy loss function for parameter updating to ensure convergent training. With a batch size of 16, the learning rate was initialized at 0.01 and followed a linear decay schedule to zero throughout 200 training epochs. Post-epoch validation was performed to monitor model generalization, and the validation-best checkpoint was employed for test set inference to ensure unbiased evaluation.

To comprehensively evaluate the method’s efficacy, we employ five representative evaluation metrics: precision (Pre), recall (Rec), F1 score, intersection over union (IoU), and overall accuracy (OA), the definitions of which are presented as follows:

\begin{matrix} P r e = \frac{T P}{T P + F P} \end{matrix}

(10)

\begin{matrix} R e c = \frac{T P}{T P + F N} \end{matrix}

(11)

\begin{matrix} F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l} \end{matrix}

(12)

\begin{matrix} IoU = \frac{DetectionResult ⋂ GroundTruth}{DetectionResult ⋃ GroundTruth} \end{matrix}

(13)

\begin{matrix} O A = \frac{TP + TN}{TP + FP + FN + TN} \end{matrix}

(14)

where TP, FP, TN, and FN signify the quantities of true positive, false positive, true negative, and false negative, respectively. Precision measures the proportion of samples predicted as positive by the model that are actually positive. Recall measures the proportion of all true positive samples that are correctly identified by the model. The F1 score, a harmonic mean of precision and recall, serves as a composite metric: higher values reflect more balanced and effective model performance. IoU is mainly used in image segmentation or object detection to measure the overlap degree between the predicted region and the real region. OA is the proportion of all samples (including positive and negative examples) that are correctly predicted by the model.

4.3. Comparison Methods

To validate the superiority of DFANet, it is compared to the following seven state-of-the-art CD methods.

(1): FC-EF [24]: An early fusion approach where bitemporal images are channel-wise concatenated to form a multi-channel feature volume, which is then processed through a UNet-based encoder-decoder architecture to generate a pixel-wise change detection map.
(2): FC-Siam-Conc [24]: A late fusion variant of FC-EF that employs dual parallel backbone networks to extract hierarchical features from bitemporal images;
(3): FC-Siam-Diff [24]: A late fusion variant akin to FC-Siam-Conc, which extracts hierarchical features via twin backbones and fuses diachronic information by concatenating absolute differences of corresponding-level features before feeding into a UNet decoder for change map generation;
(4): SNUNet-CD [27]: A hierarchical feature fusion architecture that integrates a Siamese network with UNet++ for CD. It employs dense skip connections between encoders and decoders to propagate high-resolution bitemporal features, enabling multi-scale context aggregation;
(5): BIT [36]: A Transformer-based CD approach that employs a Transformer encoder-decoder framework to extract contextual dependencies from features. It integrates an augmented semantically labeled CNN for multi-scale feature extraction, followed by element-wise feature differencing to generate change maps;
(6): ICIFNet [35]: A CD framework integrating CNN and Transformer, which enhances complex scene performance via same-scale cross-modal interaction and hierarchical cross-scale fusion;
(7): EATDer [45]: An Edge-Assisted Adaptive Transformer designed for remote sensing change detection. EATDer integrates edge information to guide feature extraction and employs a multi-scale adaptive attention mechanism to enhance detection accuracy.

These models were realized using the publicly available codebases of the aforementioned architectures, and their hyperparameters were kept as close as possible to those in the original literature.

4.4. Results Evaluation

Within this subsection, we present a comprehensive performance validation of DFANet and comparative methods. The quantitative evaluation results on the benchmark LEVIR-CD and WHU-CD test sets are systematically summarized in Table 1 and Table 2, respectively. As evidenced by the tabulated results, the proposed method outperforms competing approaches across both datasets, demonstrating superior CD performance.

As evidenced by the quantitative results tabulated in Table 1 and Table 2, the proposed method demonstrates superior performance over state-of-the-art approaches on the publicly available benchmark datasets LEVIR-CD and WHU-CD for building CD. we mark the best results in bold. The results in Table 1 reveal that DFANet attains state-of-the-art performance on the LEVIR-CD dataset, achieving optimal F1 score, IoU, and OA of 90.56%, 82.75% and 99.05% among all evaluated methods. The F1 score improved by 7.16% compared to the previous FC-EF, 0.60% compared to ICIFNet, and 0.65% compared to EATDer. The results in Table 2 reveal that DFANet attains state-of-the-art performance on the WHU-CD dataset, achieving optimal F1 score, IoU, and OA of 89.98%, 81.78% and 99.22% among all evaluated methods. The F1 score improved by 6.48% compared to the previous SNUNet-CD, 1.66% compared to ICIFNet, and 3.68% compared to EATDer. We believe the improvement in the above metrics is attributed to the SCAM, GatedConv, and Transformer modules we incorporated. These architectural modules facilitate enhanced feature extraction and bolster the network’s attentional focus on change features.

The visualization results comparison of each method on the LEVIR-CD datasets is presented in Figure 6. Empirical evidence shows that DFANet outperforms competing methods in generating more accurate CD results. Specifically, benefiting from the SCAM and GatedConv modules, DFANet better discriminates the boundaries of building change areas compared to other methods (Figure 6a,b,e,f). It also more effectively reduces phenomena such as missed detection and false detection in predictions (Figure 6a,c,d). When detecting irregular building changes, comparative methods exhibit varying degrees of missed detection, pseudo-changes, and blurred boundaries, whereas DFANet more completely detects building change areas (Figure 6a,c,d). Thanks to the Transformer module, DFANet significantly outperforms other methods in the completeness of prediction results (Figure 6a–f).

The visualization results comparison of each method on the WHU-CD datasets is presented in Figure 7. Empirical evidence shows that DFANet outperforms competing methods in generating more accurate CD results. Due to the similar color features of roads, cement surfaces, and buildings, most comparative methods misclassify road and cement surface changes as building changes, leading to pseudo-changes (Figure 7a–c,f), or fail to detect cement-to-building changes, resulting in missed detection (Figure 7e). DFANet, however, avoids these issues effectively due to its deep feature attention module. Thanks to the Transformer module, DFANet significantly outperforms other methods in the completeness of prediction results (Figure 7a–f).

4.5. Complexity Analysis and Significance Test

FLOPs and Params are the two most commonly used metrics to measure model complexity. FLOPs represent the sum of floating-point multiplications and additions performed in a model’s forward inference process. The higher the FLOPs, the more arithmetic power is theoretically required, and the higher the latency and energy consumption during inference. Params represent the sum of all learnable parameters (weights and biases) in a model, more parameters indicate greater model capacity, which theoretically enables fitting more complex functions and learning richer features.

The F1 scores and complexity metrics of various methods on the LEVIR-CD and WHU-CD benchmark datasets are systematically tabulated in Table 3. DFANet has a FLOPs value of 7.88 on both datasets, which is lower than the compared methods, indicating lower computational requirements, inference latency, and energy consumption. The number of Params is 30.11, relatively higher among all methods, primarily due to the inclusion of the Transformer module. The self-attention mechanism used in the Transformer leads to higher computational and memory overhead. However, the relatively higher Params count also suggests to some extent that DFANet learns richer features, enabling better CD performance. In summary, DFANet demonstrates a superior balance between complexity and detection performance.

We plot the numerical results of F1 score and complexity values of different methods on the LEVIR-CD and WHU-CD datasets in Figure 8, which provides a clearer visualization of the differences among different methods. We use red triangles to represent DFANet and blue dots to represent other methods.

To validate the statistical significance of DFANet, the McNemar test [46] was employed. Specifically, a p-value less than 0.05 in this test indicates a statistically significant difference between the two compared methods. We conducted comparisons between DFANet and six alternative methods. Table 4 presents the McNemar’s test results of different methods on the LEVIR-CD and WHU-CD datasets, respectively. Herein, A represents the number of pixels correctly identified by DFANet but misclassified by the counterpart method, while B denotes the number of pixels misidentified by DFANet but correctly classified by the counterpart method. The results in Table 4 demonstrate that DFANet outperforms the other methods on both the LEVIR-CD and WHU-CD datasets.

4.6. Ablation Study

In this subsection, the ablation experiments of DFANet on the LEVIR-CD dataset are systematically detailed. Specifically designed for building CD, DFANet incorporates SCAM, GatedConv, and Transformer modules. To evaluate efficacy of these modules, ablation experiments are systematically designed and their visualization results are presented in Figure 9. The heatmaps generated from the prediction head is shown in Figure 10. The network architecture devoid of supplementary modules is designated as Base, which is designed based on ResNet-18. The model with the SCAM module added is denoted as M_a, the model with both SCAM and GatedConv modules added is denoted as M_b, and the model with both SCAM and Transformer modules added is denoted as M_c.

To assess the impact of the three modules, ablation experiments were performed, with their results presented in Table 5. As indicated by the values in the table, adding the SCAM module to Base significantly increases the F1 score by 2.19% compared with Base, demonstrating that the SCAM module can effectively focus on change regions and enhance feature representation. When both SCAM and GatedConv modules are incorporated into Base, the F1 score increases by 2.21%, indicating that the GatedConv module further extracts change features. Adding both SCAM and Transformer modules to Base boosts the F1 score by 2.37%, indicating that the Transformer module alleviates limitations in capturing global context and long-distance dependencies, enabling the network to better model feature change patterns. When all three modules (SCAM, GatedConv, and Transformer) are added to Base, the F1 score improves by 2.63%, representing the maximum enhancement.

We also evaluated the FLOPs and Params of each model in the ablation study to better quantify the computational and storage overhead introduced by our method. The results show that the FLOPs generally exhibit a downward trend, while the Params generally show an upward trend.

The ablation experiment visualizations are documented in Figure 9, where distinct colors are employed to represent true positives (white), true negatives (black), false positives (red), and false negatives (green). From the visualization results, we can see that the SCAM, GatedConv, and Transformer modules added by us all contribute to performance improvement. M_a, with the addition of the SCAM module, enhances the baseline model’s focus on changing features and reduces missed detection (Figure 9a,e). M_b, by adding the GatedConv module to the SCAM-integrated M_a, further extracts features and refines building boundaries (Figure 9a,c,d). M_c, by adding the Transformer module to the SCAM-integrated Base model, better captures feature change patterns and the relationships between different blocks and feature classes, thus distinguishing changed and unchanged areas (Figure 9b,e,f). DFANet achieves the optimal CD performance due to the integration of the SCAM, GatedConv, and Transformer modules (Figure 9a–f).

4.7. Parameter Analysis

In this subsection, we first analyze the impact of changes in the number of layers in the encoders and decoders of the Transformer on CD performance. Secondly, we present a hyperparameter analysis of DFANet on the LEVIR-CD and WHU-CD dataset. We analyzed the impacts of batch size, learning rate, and the number of layers in Transformer encoders and decoders on model performance.

We tried various combinations of encoder and decoder layers and found that changes in the number of encoder and decoder layers did not significantly improve the performance of CD. Experimental results are presented in Table 6. Considering the high hardware memory requirements of Transformer, to ensure CD accuracy and minimize model load as much as possible, we adopted a Transformer combined with one encoder layer and one decoder layer.

The batch size refers to the number of samples input into the model in each iteration. The learning rate controls the step size of model parameter updates. Different batch sizes and learning rates will affect the convergence speed and final performance of the model. We investigate the impact of the batch size and learning rate on the F1 score of DFANet (Figure 11). For the batch size hyperparameter, while keeping all other hyperparameters constant, we trained the model using batch sizes of 4, 8, 16, and 32, respectively. The maximum F1 score of 90.56% was achieved at a batch size of 16. For the learning rate hyperparameter, while keeping all other hyperparameters constant, we trained the model with values of 0.001, 0.005, 0.01, and 0.02. The maximum F1 score of 90.56% was achieved at a learning rate of 0.01.

5. Discussion

In summary, we performed extensive experiments on two publicly built CD datasets, LEVIR-CD and WHU-CD. Through quantitative comparisons, DFANet outperforms other methods in evaluation metrics such as F1 score, demonstrating superior performance. Due to the incorporation of SCAM, GatedConv, and Transformer modules, DFANet exhibits excellent performance in building CD. These modules help the network learn more discriminative features and enhance its focus on change features. The SCAM module is integrated for feature enhancement, effectively mitigating the impact of environmental factors on change area recognition. This enables the network to accurately extract change features and avoid pseudo-changes. The GatedConv module is added for further feature refinement, assisting the network in precisely delineating the boundaries of building change areas. The Transformer module helps the network model the global context in bitemporal images, allowing it to better capture feature change patterns and the relationships between different image blocks and feature classes.

However, DFANet still has some limitations. First, since we used a basic ResNet-18 as the feature extractor, we will explore the impact of other advanced feature extractors on network performance in future research. Second, given the variety of existing attention mechanisms, we will investigate how other attention mechanisms affect building CD performance and compare them with the ones integrated into our model. Finally, cross-entropy loss is adopted as the loss function. In future work, we plan to explore and combine alternative loss functions while closely monitoring the resulting variations in network performance.

6. Conclusions

In this paper, we propose a Deep Feature Attention Network (DFANet) for building CD in RS images. For the purpose of exhaustive change feature extraction from bitemporal imagery, we incorporate the SCAM and GatedConv modules. SCAM is specifically designed to enhance the model’s focus on change features, while GatedConv further extracts detailed change features and refines feature boundaries. Additionally, to model the global contextual information in bitemporal imagery, a Transformer module is incorporated, which facilitates the capture of fine-grained feature change dynamics and contextual associations through self-attention mechanisms. Compared with existing state-of-the-art methods, DFANet demonstrates superior performance on LEVIR-CD and WHU-CD. It also strikes a better balance between complexity and detection performance, demonstrating the robustness of DFANet for improving CD accuracy.

Author Contributions

Conceptualization, H.D.; methodology, H.D. and P.L.; software, P.L.; validation, H.D., P.L. and X.T.; formal analysis, P.L.; investigation, P.L. and X.T.; resources, H.D. and P.L.; data curation, P.L.; writing—original draft preparation, P.L.; writing—review and editing, H.D., P.L. and X.T.; visualization, P.L.; supervision, H.D., and X.T.; project administration, H.D.; funding acquisition, H.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Major Project of High Resolution Earth Observation System under Grant No. 30-Y60B01-9003-22/23.

Data Availability Statement

The research data generated in this study are accessible upon reasonable request to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DFANet	Deep Feature Attention Network
FC-EF	Fully Convolutional-Early Fusion
FC-Siam-Diff	Fully Convolutional-Siamese-Difference
FC-Siam-Conc	Fully Convolutional-Siamese-Concatenation
SNUNet-CD	Siamese NestedUNet for Change Detection
BIT	Bitemporal Image Transformer
ICIFNet	Intra-scale Cross-interaction and Inter-scale Feature Fusion Network
CBAM	Convolutional Block Attention Module
SCAM	Spatial-Channel Attention Module
CNN	Convolutional Neural Network
CD	Change Detection
RS	Remote Sensing
DL	Deep Learning
ML	Machine Learning
GELU	Gaussian Error Linear Unit

References

De Bem, P.P.; de Carvalho Junior, O.A.; Fontes Guimarães, R.; Trancoso Gomes, R.A. Change detection of deforestation in the Brazilian Amazon using landsat data and convolutional neural networks. Remote Sens. 2020, 12, 901. [Google Scholar] [CrossRef]
Kennedy, R.E.; Townsend, P.A.; Gross, J.E.; Cohen, W.B.; Bolstad, P.; Wang, Y.; Adams, P. Remote sensing change detection tools for natural resource managers: Understanding concepts and tradeoffs in the design of landscape monitoring projects. Remote Sens. Environ. 2009, 113, 1382–1396. [Google Scholar] [CrossRef]
Hawash, E.; El-Hassanin, A.; Amer, W.; El-Nahry, A.; Effat, H. Change detection and urban expansion of Port Sudan, Red Sea, using remote sensing and GIS. Environ. Monit. Assess. 2021, 193, 723. [Google Scholar] [CrossRef] [PubMed]
Qing, Y.; Ming, D.; Wen, Q.; Weng, Q.; Xu, L.; Chen, Y.; Zhang, Y.; Zeng, B. Operational earthquake-induced building damage assessment using CNN-based direct remote sensing change detection on superpixel level. Int. J. Appl. Earth Observ. Geoinf. 2022, 112, 102899. [Google Scholar] [CrossRef]
Zhang, Z.; Jiang, H.; Pang, S.; Hu, X. Research Status and Prospects of Change Detection for Multi-Temporal Remote Sensing Images. J. Geomatics. 2022, 51, 1091–1107. [Google Scholar]
Coppin, P.; Jonckheere, I.; Nackaerts, K.; Muys, B.; Lambin, E. Review ArticleDigital change detection methods in ecosystem monitoring: A review. Int. J. Remote Sens. 2004, 25, 1565–1596. [Google Scholar] [CrossRef]
Howarth, P.J.; Wickware, G.M. Procedures for change detection using Landsat digital data. Int. J. Remote Sens. 1981, 2, 277–291. [Google Scholar] [CrossRef]
Dai, X.; Khorram, S. Quantification of the impact of misregistration on the accuracy of remotely sensed change detection. In Proceedings of the IGARSS’97. 1997 IEEE International Geoscience and Remote Sensing Symposium Proceedings. Remote Sensing-A Scientific Vision for Sustainable Development, Singapore, 3–8 August 1997; pp. 1763–1765. [Google Scholar]
Celik, T. Unsupervised change detection in satellite images using principal component analysis and $ k $-means clustering. IEEE Geosci. Remote Sens. Lett. 2009, 6, 772–776. [Google Scholar] [CrossRef]
Han, T.; Wulder, M.A.; White, J.C.; Coops, N.C.; Alvarez, M.; Butson, C. An efficient protocol to process Landsat images for change detection with tasselled cap transformation. IEEE Geosci. Remote Sens. Lett. 2007, 4, 147–151. [Google Scholar] [CrossRef]
Habib, T.; Inglada, J.; Mercier, G.; Chanussot, J. Support vector reduction in SVM algorithm for abrupt change detection in remote sensing. IEEE Geosci. Remote Sens. Lett. 2009, 6, 606–610. [Google Scholar] [CrossRef]
Zong, K.; Sowmya, A.; Trinder, J. Building change detection from remotely sensed images based on spatial domain analysis and Markov random field. J. Appl. Remote Sens. 2019, 13, 024514. [Google Scholar] [CrossRef]
Im, J.; Jensen, J.R. A change detection model based on neighborhood correlation image analysis and decision tree classification. Remote Sens. Environ. 2005, 99, 326–340. [Google Scholar] [CrossRef]
Seo, D.K.; Kim, Y.H.; Eo, Y.D.; Park, W.Y.; Park, H.C. Generation of radiometric, phenological normalized image based on random forest regression for change detection. Remote Sens. 2017, 9, 1163. [Google Scholar] [CrossRef]
Feng, W.; Sui, H.; Tu, J.; Huang, W.; Sun, K. A novel change detection approach based on visual saliency and random forest from multi-temporal high-resolution remote-sensing images. Int. J. Remote Sens. 2018, 39, 7998–8021. [Google Scholar] [CrossRef]
Ball, J.E.; Anderson, D.T.; Chan, C.S. Comprehensive survey of deep learning in remote sensing: Theories, tools, and challenges for the community. J. Appl. Remote Sens. 2017, 11, 042609. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; part III 18. pp. 234–241. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Peng, D.; Bruzzone, L.; Zhang, Y.; Guan, H.; Ding, H.; Huang, X. SemiCDNet: A semisupervised convolutional neural network for change detection in high resolution remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 5891–5906. [Google Scholar] [CrossRef]
Liu, Y.; Pang, C.; Zhan, Z.; Zhang, X.; Yang, X. Building change detection for remote sensing images using a dual-task constrained deep siamese convolutional network model. IEEE Geosci. Remote Sens. Lett. 2020, 18, 811–815. [Google Scholar] [CrossRef]
Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
Chen, J.; Yuan, Z.; Peng, J.; Chen, L.; Huang, H.; Zhu, J.; Liu, Y.; Li, H. DASNet: Dual attentive fully convolutional Siamese networks for change detection in high-resolution satellite images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2020, 14, 1194–1206. [Google Scholar] [CrossRef]
Daudt, R.C.; Le Saux, B.; Boulch, A. Fully convolutional siamese networks for change detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4063–4067. [Google Scholar]
Peng, D.; Zhang, Y.; Guan, H. End-to-end change detection for high resolution satellite images using improved UNet++. Remote Sens. 2019, 11, 1382. [Google Scholar] [CrossRef]
Liu, R.; Jiang, D.; Zhang, L.; Zhang, Z. Deep depthwise separable convolutional network for change detection in optical aerial images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2020, 13, 1109–1118. [Google Scholar] [CrossRef]
Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A densely connected Siamese network for change detection of VHR images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Shi, Q.; Liu, M.; Li, S.; Liu, X.; Wang, F.; Zhang, L. A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–16. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Guo, E.; Fu, X.; Zhu, J.; Deng, M.; Liu, Y.; Zhu, Q.; Li, H. Learning to measure change: Fully convolutional siamese metric networks for scene change detection. arXiv 2018, arXiv:1810.09111. [Google Scholar] [CrossRef]
Li, Z.; Yan, C.; Sun, Y.; Xin, Q. A densely attentive refinement network for change detection based on very-high-resolution bitemporal remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
Chen, P.; Zhang, B.; Hong, D.; Chen, Z.; Yang, X.; Li, B. FCCDN: Feature constraint network for VHR image change detection. ISPRS J. Photogramm. Remote Sens. 2022, 187, 101–119. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure transformer network for remote sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Feng, Y.; Xu, H.; Jiang, J.; Liu, H.; Zheng, J. ICIF-Net: Intra-scale cross-interaction and inter-scale feature fusion network for bitemporal remote sensing images change detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Chen, H.; Qi, Z.; Shi, Z. Remote sensing image change detection with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
Jian, P.; Ou, Y.; Chen, K. Uncertainty-aware graph self-supervised learning for hyperspectral image change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–19. [Google Scholar] [CrossRef]
Pang, S.; Lan, J.; Zuo, Z.; Chen, J. SFGT-CD: Semantic Feature-Guided Building Change Detection from Bitemporal Remote-Sensing Images with Transformers. IEEE Geosci. Remote Sens. Lett. 2023, 21, 1–5. [Google Scholar] [CrossRef]
Lu, W.; Wei, L.; Nguyen, M. Bitemporal Attention Transformer for Building Change Detection and Building Damage Assessment. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4917–4935. [Google Scholar] [CrossRef]
Guo, M.H.; Lu, C.Z.; Liu, Z.N.; Cheng, M.M.; Hu, S.M. Visual Attention Network. arXiv 2022, arXiv:2022.09741. [Google Scholar] [CrossRef]
Patro, B.N.; Namboodiri, V.P.; Agneeswaran, V.S. Spectformer: Frequency and attention is what you need in a vision transformer. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 9543–9554. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2018, 57, 574–586. [Google Scholar] [CrossRef]
Ma, J.; Duan, J.; Tang, X.; Zhang, X.; Jiao, L. EATDer: Edge-Assisted Adaptive Transformer Detector for Remote Sensing Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5602015. [Google Scholar] [CrossRef]
Manandhar, R.; Odeh, I.O.; Ancev, T. Improving the accuracy of land use and land cover classification of Landsat data using post-classification enhancement. Remote Sens. 2009, 1, 330–344. [Google Scholar] [CrossRef]

Figure 1. DFANet structure diagram.

Figure 2. SCAM we designed. (a) SCAM overview, (b) channel attention module, (c) spatial attention module.

Figure 3. GatedConv overview.

Figure 4. Overview of Transformer. (a) Transformer encoder; (b) transformer decoder.

Figure 5. Classifier overview.

Figure 6. Visualization results of different methods on the LEVIR-CD. (a–f) present the visualization results on the LEVIR-CD test set. Different colors are utilized to represent true positives (white), true negatives (black), false positives (red), and false negatives (green).

Figure 7. Visualization results of different methods on the WHU-CD. (a–f) present the visualization results on the WHU-CD test set. Different colors are utilized to represent true positives (white), true negatives (black), false positives (red), and false negatives (green).

Figure 8. F1 score and complexity of different methods on the LEVIR-CD and WHU-CD datasets. (a,b) display the complexity and F1 score of the LEVIR-CD dataset; (c,d) display the complexity and F1 score on the WHU-CD dataset.

Figure 9. Visualization results of ablation experiments for DFANet on the LEVIR-CD dataset. (a–f) present the visualization results of ablation experiments on the LEVIR-CD test set. Different colors are utilized to represent true positives (white), true negatives (black), false positives (red), and false negatives (green).

Figure 10. The heatmaps of ablation experiment results for DFANet on the LEVIR-CD dataset. (a–f) present the heatmaps of ablation experiment results on the LEVIR-CD test set. The color gradient from blue to red indicates low to high attention to features.

Figure 11. Hyperparameter analysis of DFANet on the LEVIR-CD dataset. (a) Batch size, (b) Learning rate.

Table 1. Quantitative results of different methods on the LEVIR-CD dataset; the best results are marked in bold. All metrics are described as percentages (%).

Methods	Pre	Rec	F1	IoU	OA
FC-EF [24]	86.91	80.17	83.40	71.53	98.39
FC-Siam-Diff [24]	89.53	83.31	86.31	75.92	98.67
FC-Siam-Conc [24]	91.99	76.77	83.69	71.96	98.49
SNUNet-CD [27]	89.18	87.17	88.16	78.83	98.82
BIT [36]	89.24	89.37	89.31	80.68	98.92
ICIFNet [35]	91.32	88.64	89.96	81.75	98.99
EATDer [45]	88.13	91.77	89.91	81.68	98.95
Base	90.64	85.39	87.93	78.46	98.81
DFANet (Ours)	91.87	89.29	90.56	82.75	99.05

Table 2. Quantitative results of different methods on the WHU-CD dataset; the best results are marked in bold. All metrics are described as percentages (%).

Methods	Pre	Rec	F1	IoU	OA
FC-EF [24]	71.63	67.25	69.37	53.11	97.61
FC-Siam-Diff [24]	47.33	77.66	58.81	41.66	95.63
FC-Siam-Conc [24]	60.88	73.58	66.63	49.95	97.04
SNUNet-CD [27]	85.60	81.49	83.50	71.67	98.71
BIT [36]	86.64	81.48	83.98	72.39	98.75
ICIFNet [35]	92.98	85.56	88.32	79.24	98.96
EATDer [45]	86.38	86.82	86.60	76.36	98.88
Base	86.26	89.09	87.66	78.02	99.04
DFANet (Ours)	92.32	87.75	89.98	81.78	99.22

Table 3. F1 score and complexity of different methods on LEVIR-CD and WHU-CD datasets.

Methods	FLOPs (G)	Params (M)	F1 (%)
Methods	FLOPs (G)	Params (M)	LEVIR-CD	WHU-CD
FC-EF [24]	3.57	1.35	83.40	69.37
FC-Siam-Diff [24]	4.72	1.35	86.31	58.81
FC-Siam-Conc [24]	5.32	1.55	83.69	66.63
SNUNet-CD [27]	54.83	12.03	88.16	83.50
BIT [36]	12.85	13.48	89.31	83.98
ICIFNet [35]	25.36	23.82	89.96	88.32
EATDer [45]	6.60	23.43	89.91	86.60
Base	13.01	11.85	87.93	87.66
DFANet (Ours)	7.88	30.11	90.56	89.98

Table 4. The results of the McNemar’s test between DFANet and the other six methods on the LEVIR-CD and WHU-CD datasets.

Methods	LEVIR-CD				WHU-CD
Methods	A × 10⁶	B × 10⁶	X² × 10⁶	p	A × 10⁶	B × 10⁶	X² × 10⁶	p
FC-Siam-Diff [24]	2.1847	0.5207	1.0234	<0.001	6.7196	0.1347	6.3260	<0.001
SNUNet-CD [27]	0.4223	0.2651	0.0359	<0.001	0.2546	0.1539	0.0248	<0.001
BIT [36]	0.4765	0.3368	0.0240	<0.001	0.3628	0.1319	0.1078	<0.001
ICIFNet [35]	0.5401	0.4564	0.0070	<0.001	0.4445	0.1399	0.1588	<0.001
EATDer [45]	0.3276	0.2624	0.0072	<0.001	0.1487	0.0786	0.0216	<0.001
Base	0.6786	0.4236	0.0590	<0.001	0.2167	0.1437	0.0148	<0.001

Table 5. Quantitative results of ablation experiments on DFANet for the LEVIR-CD dataset. √ indicates the addition of the corresponding module to the baseline model. All metrics are described as percentages (%).

Methods	SCAM	GatedConv	Transformer	Pre	Rec	F1	IoU	OA	FLOPs (G)	Params (M)
Base				90.64	85.39	87.93	78.46	98.81	13.01	11.85
M_a	√			90.39	89.85	90.12	82.01	98.99	5.99	26.21
M_b	√	√		91.24	89.07	90.14	82.06	99.01	7.27	26.88
M_c	√		√	91.20	89.43	90.30	82.33	99.02	6.60	28.44
DFANet (Ours)	√	√	√	91.87	89.29	90.56	82.75	99.05	7.88	30.11

Table 6. The impact of different numbers of Transformer encoder and decoder layers on model performance on the LEVIR-CD and WHU-CD datasets. We use

N_{E}

to denote the number of encoder layers and

N_{D}

to denote the number of decoder layers.

Table 6. The impact of different numbers of Transformer encoder and decoder layers on model performance on the LEVIR-CD and WHU-CD datasets. We use

N_{E}

to denote the number of encoder layers and

N_{D}

to denote the number of decoder layers.

N^E	N^D	F1 (%)
N^E	N^D	LEVIR-CD	WHU-CD
1	1	90.56	89.98
1	4	90.63	90.03
1	8	90.65	90.08
2	4	90.49	89.90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, P.; Ding, H.; Tian, X. DFANet: A Deep Feature Attention Network for Building Change Detection in Remote Sensing Imagery. Remote Sens. 2025, 17, 2575. https://doi.org/10.3390/rs17152575

AMA Style

Lu P, Ding H, Tian X. DFANet: A Deep Feature Attention Network for Building Change Detection in Remote Sensing Imagery. Remote Sensing. 2025; 17(15):2575. https://doi.org/10.3390/rs17152575

Chicago/Turabian Style

Lu, Peigeng, Haiyong Ding, and Xiang Tian. 2025. "DFANet: A Deep Feature Attention Network for Building Change Detection in Remote Sensing Imagery" Remote Sensing 17, no. 15: 2575. https://doi.org/10.3390/rs17152575

APA Style

Lu, P., Ding, H., & Tian, X. (2025). DFANet: A Deep Feature Attention Network for Building Change Detection in Remote Sensing Imagery. Remote Sensing, 17(15), 2575. https://doi.org/10.3390/rs17152575

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DFANet: A Deep Feature Attention Network for Building Change Detection in Remote Sensing Imagery

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. DFANet Overview

3.2. Deep Feature Attention

3.3. Transformer

3.4. Classifier

4. Experimental Results and Analysis

4.1. Datasets

4.2. Implementation Details and Evaluation Metrics

4.3. Comparison Methods

4.4. Results Evaluation

4.5. Complexity Analysis and Significance Test

4.6. Ablation Study

4.7. Parameter Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI