1. Introduction
With the ongoing increase in the world population and rapid urbanization processes, the global land surface has undergone significant changes. Therefore, the study of urbanization and environmental change interactions has drawn increased attention. With the breakthrough of earth observation techniques, massive remote sensing (RS) images provide a rich data source, such as satellite imagery, e.g., WorldView, QuickBird, GF2, and aerial images. In recent years, the spatial–spectral–temporal resolution of RS images has gradually improved. Nowadays, the availability of high- and very-high-resolution (VHR) images offers convenience for urban monitoring [
1]. The remote sensing image interpretation techniques increasingly energize refined urban management. Specifically, change detection (CD) is one of the critical techniques. CD aims to identify and locate the change footprints based on multitemporal remote sensing images acquired over the same geographical region at different times [
2]. Multitemporal RS images have the characteristics of macroscopicity and periodicity. CD has a wide range of applications, such as land cover and land use detection, urbanization monitoring, illegal construction identification, and damage assessment [
3,
4,
5,
6,
7]. Furthermore, CD techniques can effectively reflect the development speed of the urbanization process. Specifically, the above-ground buildings are the most representative artificial structures. Therefore, building change detection can effectively reveal the development trend of urban spatial patterns. In this paper, the main concern was binary CD based on optical RS images. High-resolution optical images that reflect abundant spectral and spatial information of geospatial objects allow us to retain more details and obtain high-quality change maps. CD methods usually generate a pixel-level change map, in which pixels are classified as changed or unchanged. Many approaches have been explored to improve the accuracy and automation of change detection, which can be roughly divided into traditional [
8] and deep learning (DL)-based methods [
6].
According to the different analysis units for change detection, traditional change detection methods are roughly classified as the pixel-based method and object-based method [
7,
8]. In the early stage, change detection based on the RS images with low and medium spatial resolution is applied for land use and land cover change. Pixel-based methods were explored to obtain change information and reveal the ground surface change by using the spectral characteristics of pixels. The most representative methods are based on the difference images (DI), which is generated by image differencing, image ratio, or image transformation methods (such as change vector analysis (CVA) [
9], principal component analysis (PCA) [
4], and the regularized iteratively reweighted multivariate alteration detection (IR-MAD) method) [
10]. DI indicates the magnitude of change. A change map can be generated based on the DI by the threshold method [
11] and the clustering method [
12]. Image transformation methods usually transform the RS images into a specific feature space, in which changed and unchanged pixels can be more discriminative using the extracted features. For instance, CVA is widely used to generate DI and obtain the change intensity. PCA is a typical method for feature dimension reduction. Celik proposed an unsupervised method based on PCA and k-means clustering method [
13]. It applies PCA on the non-overlapping blocks of DI to extract feature vectors and utilizes the k-means algorithm to determine whether a corresponding pixel has changed. Due to the absence of context information, pixel-based methods are susceptible to noise, and, thus, the change detection results suffer from many pseudo-changes. Meanwhile, it is difficult to choose a property transformation method for a specific application scenario. With increased spatial resolution, pixel-based methods exhibit poor performance based on very-high-resolution (VHR) images [
14].
High-resolution images can reflect the spatial distribution and geometric structure of geospatial objects. Alternatively, object-based methods effectively explore spatial information by employing the image object or superpixel as the basic processing unit [
15,
16,
17]. Object-based methods have been widely studied, which utilize spectral, textural, and geometrical features for change detection. Object-based methods usually consist of three steps, i.e., object unit segmentation, object feature extraction, and feature classification [
8]. Some machine learning methods are applied as classifiers for determining the change type, such as k-nearest neighbor (kNN) method [
18], support vector machine (SVM) [
19], random forest [
20], and graphical models, i.e., Markov random field models [
21] and conditional random field models (CRF) [
14]. Besides, post-classification comparison methods [
22] have been developed for specific tasks, which provide a complete matrix of change directions. Object-based methods are more suitable in high-resolution image change detection by measuring the similarity of segmented units. However, object-based methods are generally sensitive to segmentation errors. The detection accuracy highly depends on the results obtained by different segmentation strategies. To alleviate the problem, Lv et al. [
14] combined the CRF method with the object-based technique to explore the spectral–spatial information. However, feature extraction and selection is a complex process that requires professional knowledge and experience, which limits the object-based methods’ application range. Traditional approaches based on hand-crafted features hinder their performance due to the limited representation of high-level semantics.
With the impressive breakthroughs in artificial intelligence and deep learning technology, CD methods have gradually evolved from traditional to DL-based approaches. Convolutional neural network (CNN) has an inherent advantage of feature representation. Thus, CNN becomes a better solution for feature extraction than hand-crafted features [
23]. In recent years, CNN-based methods have made remarkable progress in remote sensing image change detection [
6]. Specifically, the supervised methods based on prior knowledge provided from manually annotated labels achieve better performance than traditional methods in terms of accuracy and robustness. Some attempts were inspired by the image semantic segmentation models, such as UNet [
24] and UNet++ [
25]. The proposed change detection networks are based on a U-shape encoder–decoder architecture [
26,
27,
28,
29]. These methods emphasize end-to-end change detection, which is implemented by constructing a fully convolutional network. Different from the image segmentation tasks, change detection involves a pair of bi-temporal images as an input of the model.
The network framework can be roughly divided into early- and late-fusion frameworks [
30]. The early-fusion framework concatenates the bi-temporal images along the channel axis as an input of the network. The late-fusion framework extracts feature maps from the two co-registered images using a parallel dual-stream network separately, where two branches usually share the same structure. If the two branches share weights, it is the so-called Siamese framework; otherwise, it is the pseudo-Siamese framework [
26]. Daudt et al. implemented end-to-end change detection based on the pseudo-Siamese framework, i.e., fully convolutional Siamese-difference network (FC-Siam-diff) and fully convolutional Siamese-concatenation network (FC-Siam-conc) [
26]. The difference lies in how the skip connections are performed. The former concatenates the absolute value of bi-temporal features’ difference during the decoding phase. The latter directly concatenates the bi-temporal features instead. Hou et al. [
31] extended UNet and proposed a Siamese variant called W-Net for building change detection. W-Net learns the difference features of bi-temporal features by comparison in the feature domain. Though attractive in improving accuracy by fusing features through skip connections, checkerboard artifacts caused by deconvolutions during decoding becomes one of the main concerns. Alternatively, upsampling combined with convolutions is a good solution to alleviate checkerboard artifacts of the detection results. For instance, Zhang et al. [
30] proposed a deeply supervised image fusion network (IFN) based on the pseudo-Siamese framework. More precisely, they introduced the CBAM attention modules [
32] during decoding for overcoming the heterogeneity problem. Similarly, Fang et al. [
33] proposed the SNUNet-CD based on the Siamese network and UNet++. The ensemble channel attention module (ECAM) was applied for aggregating and refining features of multiple semantic levels. Wang et al. [
28] proposed a pseudo-Siamese network called ADS-Net that emphasizes feature fusion using a mid-layer fusion method. Instead, Zhang et al. [
34] proposed a hierarchical network, called HDFNet, which introduces dynamic convolution modules into decoding stages for emphasizing feature fusion. The aforementioned works share a similarity in that skip connections are applied to concatenate deep features with low-level features during the decoding stage for performance improvement. These studies demonstrated that both high-level semantic information and low-level detail information are important in change detection. Unfortunately, which feature fusion strategy is the better is not clear. Dense skip connections bring about high computational costs.
Alternatively, Daudt et al. [
35] proposed FC-EF-Res that adopts the early-fusion framework based on UNet by incorporating residual modules [
36]. FC-EF-Res utilizes the residual modules to facilitate the training of the deeper network. FC-EF-Res achieved better performance than FC-Siam-diff and FC-Siam-conc. Zheng et al. [
29] proposed a lightweight model CLNet based on the U-Net, which builds the encoder part by incorporating the cross-layer blocks (CLBs). An input feature map was first divided into two parallel but asymmetric branches. Then, CLBs apply convolution kernels with different strides to capture multi-scale context for performance improvement. More recently, some attempts that adopt early-fusion frameworks were developed based on the UNet++. Peng et al. [
27] proposed an improved UNet++ with multiple side-outputs fusion (MSOF) for change detection in high-resolution images. The dense skip structure of UNet++ facilitates multilayer feature fusion. Peng et al. [
37] proposed a simplified UNet++ called DDCNN that utilizes dense upsampling attention units for accuracy improvement. Zhang et al. [
38] proposed DifUnet++, which emphasizes the explicit representation of difference features using a differential pyramid of bi-temporal images. Yu et al. [
39] implemented the NestNet based on the UNet++. NestNet promotes the explicit difference representation using absolute differential operation (ADO). During model training, multistage prediction and deep supervision have been proven effective strategies for achieving better performance. For instance, some attempts apply the multistage prediction strategy at the decoder’s output side, such as Peng et al. [
27], DifUnet++ [
38], NestNet [
39], IFN [
30], HDFNet [
34], and ADS-Net [
28]. The overall loss function is calculated based on the weighted sum of multistage prediction’s loss. The deep supervision strategy facilitates the network convergence during the training phase, whereas it brings about more computation and memory cost than single-head prediction. Besides, high-level features have a coarse resolution but are accurate in semantic representation compared with low-level features. However, low-level features are more accurate in spatial location. ADS-Net [
28] and IFN [
30] methods employed the spatial–channel attention modules for feature fusion during decoding. Peng et al. [
37] proposed the upsampling attention unit for promoting feature fusion during upsampling. High-level features are applied to guide the selection of low-level features for performance improvement.
Recently, change detection methods by incorporating attention mechanisms [
40] have drawn considerable attention. Attention mechanisms have been widely studied in computer vision, such as the self-attention model (e.g., Non-local [
41]), the channel attention model (e.g., squeeze and excitation modules [
42]), and spatial–channel attention model (e.g., CBAM [
32] and DANet [
43]). Some attempts introduce attention modules in the network, which learns discriminative features and alleviates distractions caused by pseudo-changes. For example, Chen et al. [
44] proposed STANet that consists of a feature extraction network and a pyramid spatial–temporal attention module (PAM). ResNet-18 was applied for feature extraction, and the self-attention module was used to calculate the attention weights and model the spatial–temporal relationships at various scales. STANet with PAM achieved a better F1-score than the baseline. When training with sufficient samples, attention-based methods achieve superior performance in accuracy and robustness. More recently, transformer-based models have achieved a breakthrough in computer vision field, such as ViT [
45] for image classification, DETR [
46] for object detection, and SETR [
47] for image semantic segmentation. Chen et al. [
48] proposed BIT_CD that combines the transformer with CNN to solve the bitemporal image change detection. BIT_CD adopts a transformer encoder to model contexts in the compact semantic token-based space-time. BIT_CD outperforms some attention-based methods, such as STANet [
44] and IFN [
30].
We can conclude that the recent advances in DL-based CD methods mainly focus on improving precision through enhancing the feature representation ability of the model. Some attempts employed deeper networks to address the issue. These methods applied multilevel feature extraction and fusion for multiscale context modeling. Thus, though attractive in improving performance by applying a deep supervision strategy for model training, the model consumes massive memory cost. More recent attempts introduced attention modules for promoting the discrimination of features. Based on the supervised technique, these methods achieve state-of-the-art interpretation accuracy. However, the increase in the network depth and width that involves a large number of network parameters requires large memory space for storage. In addition, the deeper networks incorporated with attention-based or multiscale context-based modules usually consume massive memory during training and require more inference time. It hinders the interpretation efficiency of massive remote sensing images in practice. Recently, some lightweight change detection networks have been proposed. Chen et al. [
49] proposed a lightweight multiscale spatial pooling network to exploit the spatial context information on changed regions for bitemporal SAR image change detection. Wang et al. [
50] proposed a lightweight network that replaces normal convolutional layers with bottleneck layers and employs dilated convolutional kernels with a few non-zero entries that reduce the running time in convolutional operators. However, they did not give a specific number of network parameters and computations. It is hard to evaluate the computational efficiency in practice. In this sense, a lightweight network is designed to promote the inference speed and achieve high computational efficiency. We attempt to design an efficient network that achieves accuracy improvements and comparable inference speed.
The main contributions of this paper are summarized as follows. This paper first proposed an effective network, called 3M-CDNet, for accuracy improvement. It requires about 3.12 M trainable parameters. The network consists of a lightweight backbone network and a concise classifier. The former is used for feature extraction, and the latter is used to classify the extracted features and generate a change probability map. Moreover, a lightweight variant called 1M-CDNet that only requires about 1.26 M parameters was proposed for computation efficiency with the limitation of computing power. 3M-CDNet and 1M-CDNet have the same backbone network architecture but different classifiers. The lightweight network incorporates deformable convolutions (DConv) [
51,
52] into the residual blocks to enhance the geometric transformation modeling ability for change detection. Besides, change detection was implemented based on high-resolution feature maps to promote the detection of small changed geospatial objects. A two-level feature fusion strategy was applied to improve the feature representation. Dropout [
53] was applied in the classifier to improve the generalization ability. The networks achieved better accuracy compared with the state-of-the-art methods while reducing network parameters. Specifically, the inference runtime of the proposed 1M-CDNet is superior to most existing methods.