1. Introduction
Buildings are a significant component of urban development, and building change detection is commonly utilized in land use management [
1], illegal building management [
2], and urban development [
3]. Change detection (CD) aims to analyze multitemporal images within the same region and detect the process of object or phenomenon differences within the region [
4]. Remote sensing images can be used for identifying and evaluating surface changes utilizing obtained multitemporal images covering the same surface region and its accompanying data [
5]; this is one of the most active areas of research in the remote sensing field. With the development of remote sensing technology, a substantial chunk of remote sensing data are now accessible, providing substantial information about land cover [
2]. The development of computing and AI technology also provides a great deal of theoretical guidance for CD methods. The CD methods can be divided into two categories: pixel-based and object-based methods [
6].
Due to their resolution limitations [
7], pixel-based methods for change detection in remote sensing images became mainstream. These methods compare pixels between multiple temporal images [
8,
9] to produce change maps using clustering techniques [
10,
11] or threshold segmentation [
12,
13]. The main classifiers used include support vector machines [
14], random forests [
15], and k-nearest neighbor classification [
16], etc. However, they ignore contextual information and are prone to noise [
17]. Object-based methods were introduced to take advantage of the spatial contextual relationships in these images. As a result, these techniques are frequently utilized for jobs involving the CD of low- and medium-resolution remote sensing images [
18]. These methods use local clusters of labeled pixels called “objects” to reduce false positives and missing values in change maps [
19]. They effectively remove noise and misaligned images by segmenting images into homogeneous regions using spectral and spatial adjacency similarities and then generating change result maps based on the texture, shape, and spatial relationships of neighboring objects [
20,
21,
22,
23]. However, the precision of object-based CD models is impacted by the segmentation procedure, and they often require manual intervention and are not robust.
Due to breakthroughs in the computer vision field, deep learning networks have been applied to remote sensing image analysis [
24,
25]. Deep learning networks are applied to the CD task to automatically extract the relevant semantic features and obtain an abstract feature representation of the image information. Compared with traditional methods, deep-learning-based approaches can automatically learn features, which significantly reduces the need for specialized domain knowledge. The performance of CD tasks has greatly improved profits owing to the excellent performance of deep learning models in capturing and representing image features. This has resulted in a new era of CD technologies. The CD network framework consists of three main parts: feature extraction, network optimization, and accuracy evaluation. From the perspective of feature extraction, the CD methods can be classified into five major types: autoencoder (AE)-based, recurrent neural network (RNN)-based, generative adversarial network (GAN)-based, convolution neural network (CNN)-based, and transformers-based methods [
7]. AE is an artificial neural network used in semi-supervised and unsupervised learning to learn representations of input information by using it as a learning target. Liu et al. [
26] propose a stacked autoencoder for synthetic aperture radar change detection that effectively adapts to the multiplicative noise environment present in SAR change detection. RNN generates CD results by learning multitemporal image features and establishing change relationships between multi-period sequence remote sensing images [
27,
28]. Lyu et al. [
29] build on RNN using an improved LSTM approach to design a change rule with transferability to provide reliable change information for multitemporal images. GANs can be trained with a small amount of real data and generate a large amount of pseudo-data through adversarial learning, thus enhancing the network’s generalization ability [
30,
31]. Peng et al. [
18] used multitemporal images as input to UNet++ to generate multi-level feature maps of predicted change maps. Multiple side-output fusion (MSOF) was then introduced to generate the final variogram. Fang et al. [
32] proposed SUNNet-CD, a tightly connected Siamese network for CD tasks, with deep supervision of the network training by aggregating and refining semantic information at different levels. Transformer-based methods are an effective attention-based approach to improve CD results by capturing global contextual information and establishing remote target dependencies [
33]. Wang et al. [
25] proposed TransCD by adding the transformer structure to the feature difference network to improve the robustness to noisy changes by establishing global semantic information. The CD methods are roughly divided into two categories, the early fusion (EF) strategy and the late fusion (LF) strategy, from the perspective of the feature fusion strategy [
34]. The authors of [
35] design a CD network with EF strategies based on UNet networks, use skip connections to improve accuracy, and explore the impact of different fusion strategies on network performance by varying the feature fusion. In [
36], the Siamese network is used to extract the dual-time image features, and the boundary integrity of the resultant map is improved based on an LF strategy using multi-level depth features fused with differential features through an attention mechanism. Studies have shown that late fusion strategies give better CD results than early fusion strategies.
There has been a growing interest in researching multi-scale features for building change detection in remote sensing images in recent years. This is due in part to the increasing resolution of remote sensing images, which allows for more detailed and accurate detection of changes in buildings. Previous methods, such as pixel-based and object-based approaches, have been effective in detecting changes in low- to medium-resolution images, but may not fully utilize the multi-scale information available in high-resolution images [
37]. Thus, there is a need to develop methods that can effectively extract multi-scale building features from high-resolution images in order to improve the accuracy of building change detection. Research in this area has included the use of deep learning methods, such as convolutional neural networks, to extract and utilize multi-scale features in building change detection [
34]. However, there are still challenges to be addressed, such as the information gap between upper and lower layers of the network and the difficulty in fully utilizing the local and global information of the feature maps [
38]. Further research in this area is necessary in order to improve the performance and efficiency of building change detection in high-resolution remote sensing images.
Although deep learning methods have achieved great advantages in CD tasks, there are still some problems to overcome. On the one hand, while current building change detection methods consider multi-scale features, they are often unable to fully utilize both global and local information in multi-scale feature maps. Due to successive downsampling operations in the encoder part and successive upsampling operations in the decoder part, detailed information such as building boundary information and multi-scale buildings is gradually diluted. On the other hand, the existing CD networks tend to connect only sibling feature maps when passing contextual information through skip connections. This leads to an information gap between the upper and lower layers when the network is connected, which cannot fully utilize the local and global information of the feature maps, thus affecting the efficiency of feature extraction and limiting the performance of the model.
The following is a summary of this paper’s significant contributions.
- (1)
In order to fully utilize the high-frequency information in remote sensing images to enhance building edges and extract multi-scale features through multi-level feature maps, we propose a new CD network called FERA-Net, using a Siamese network to detect building changes at the pixel level. The network can extract multi-level and multi-scale features from dual-time remote sensing images.
- (2)
We designed a hierarchical attention-guided high-frequency feature extraction module (AGFM) to effectively capture building information in remote sensing images. The AGFM uses the residual attention mechanism and the high-frequency information enhancement module to effectively capture building features while enhancing building boundary information and mitigating the effect of noise. We used the feature-enhanced skip connection module (FESCM) to fuse the local and global information of the dual-temporal image differential maps, aggregating differential map features of different levels and scales to capture building information at different scales in terms of fusion strategy selection. To better perform change monitoring, we designed a hybrid loss function to focus on building change information and a priori boundary information.
- (3)
We conducted experiments on two publicly available datasets to evaluate the performance of FERA-Net in remote sensing CD tasks in multiple aspects. Both quantitative and qualitative comparison results show the significant advantages of FERA-Net compared with other network models.
The remainder of this paper is structured as follows. The work related to CD is described in
Section 2. The details of the FERA-Net are described in
Section 3. The dataset and experimental setup are presented in
Section 4. Comparative and ablation experiments are described and discussed in
Section 5.
Section 6 summarizes the work conducted in this paper and future perspectives.