An Efﬁcient Lightweight Neural Network for Remote Sensing Image Change Detection

: Remote sensing (RS) image change detection (CD) is a critical technique of detecting land surface changes in earth observation. Deep learning (DL)-based approaches have gained popularity and have made remarkable progress in change detection. The recent advances in DL-based methods mainly focus on enhancing the feature representation ability for performance improvement. However, deeper networks incorporated with attention-based or multiscale context-based modules involve a large number of network parameters and require more inference time. In this paper, we ﬁrst proposed an effective network called 3M-CDNet that requires about 3.12 M parameters for accuracy improvement. Furthermore, a lightweight variant called 1M-CDNet, which only requires about 1.26 M parameters, was proposed for computation efﬁciency with the limitation of computing power. 3M-CDNet and 1M-CDNet have the same backbone network architecture but different classiﬁers. Speciﬁcally, the application of deformable convolutions (DConv) in the lightweight backbone made the model gain a good geometric transformation modeling capacity for change detection. The two-level feature fusion strategy was applied to improve the feature representation. In addition, the classiﬁer that has a plain design to facilitate the inference speed applied dropout regularization to improve generalization ability. Online data augmentation (DA) was also applied to alleviate overﬁtting during model training. Extensive experiments have been conducted on several public datasets for performance evaluation. Ablation studies have proved the effectiveness of the core components. Experiment results demonstrate that the proposed networks achieved performance improvements compared with the state-of-the-art methods. Speciﬁcally, 3M-CDNet achieved the best F1-score on two datasets, i.e., LEVIR-CD (0.9161) and Season-Varying (0.9749). Compared with existing methods, 1M-CDNet achieved a higher F1-score, i.e., LEVIR-CD (0.9118) and Season-Varying (0.9680). In addition, the runtime of 1M-CDNet is superior to most, which exhibits a better trade-off between accuracy and efﬁciency.


Introduction
With the ongoing increase in the world population and rapid urbanization processes, the global land surface has undergone significant changes. Therefore, the study of urbanization and environmental change interactions has drawn increased attention. With the breakthrough of earth observation techniques, massive remote sensing (RS) images provide a rich data source, such as satellite imagery, e.g., WorldView, QuickBird, GF2, and aerial images. In recent years, the spatial-spectral-temporal resolution of RS images has gradually improved. Nowadays, the availability of high-and very-high-resolution (VHR) images offers convenience for urban monitoring [1]. The remote sensing image interpretation combined the CRF method with the object-based technique to explore the spectral-spatial information. However, feature extraction and selection is a complex process that requires professional knowledge and experience, which limits the object-based methods' application range. Traditional approaches based on hand-crafted features hinder their performance due to the limited representation of high-level semantics.
With the impressive breakthroughs in artificial intelligence and deep learning technology, CD methods have gradually evolved from traditional to DL-based approaches. Convolutional neural network (CNN) has an inherent advantage of feature representation. Thus, CNN becomes a better solution for feature extraction than hand-crafted features [23]. In recent years, CNN-based methods have made remarkable progress in remote sensing image change detection [6]. Specifically, the supervised methods based on prior knowledge provided from manually annotated labels achieve better performance than traditional methods in terms of accuracy and robustness. Some attempts were inspired by the image semantic segmentation models, such as UNet [24] and UNet++ [25]. The proposed change detection networks are based on a U-shape encoder-decoder architecture [26][27][28][29]. These methods emphasize end-to-end change detection, which is implemented by constructing a fully convolutional network. Different from the image segmentation tasks, change detection involves a pair of bi-temporal images as an input of the model.
The network framework can be roughly divided into early-and late-fusion frameworks [30]. The early-fusion framework concatenates the bi-temporal images along the channel axis as an input of the network. The late-fusion framework extracts feature maps from the two co-registered images using a parallel dual-stream network separately, where two branches usually share the same structure. If the two branches share weights, it is the so-called Siamese framework; otherwise, it is the pseudo-Siamese framework [26]. Daudt et al. implemented end-to-end change detection based on the pseudo-Siamese framework, i.e., fully convolutional Siamese-difference network (FC-Siam-diff) and fully convolutional Siamese-concatenation network (FC-Siam-conc) [26]. The difference lies in how the skip connections are performed. The former concatenates the absolute value of bi-temporal features' difference during the decoding phase. The latter directly concatenates the bi-temporal features instead. Hou et al. [31] extended UNet and proposed a Siamese variant called W-Net for building change detection. W-Net learns the difference features of bi-temporal features by comparison in the feature domain. Though attractive in improving accuracy by fusing features through skip connections, checkerboard artifacts caused by deconvolutions during decoding becomes one of the main concerns. Alternatively, upsampling combined with convolutions is a good solution to alleviate checkerboard artifacts of the detection results. For instance, Zhang et al. [30] proposed a deeply supervised image fusion network (IFN) based on the pseudo-Siamese framework. More precisely, they introduced the CBAM attention modules [32] during decoding for overcoming the heterogeneity problem. Similarly, Fang et al. [33] proposed the SNUNet-CD based on the Siamese network and UNet++. The ensemble channel attention module (ECAM) was applied for aggregating and refining features of multiple semantic levels. Wang et al. [28] proposed a pseudo-Siamese network called ADS-Net that emphasizes feature fusion using a mid-layer fusion method. Instead, Zhang et al. [34] proposed a hierarchical network, called HDFNet, which introduces dynamic convolution modules into decoding stages for emphasizing feature fusion. The aforementioned works share a similarity in that skip connections are applied to concatenate deep features with low-level features during the decoding stage for performance improvement. These studies demonstrated that both high-level semantic information and low-level detail information are important in change detection. Unfortunately, which feature fusion strategy is the better is not clear. Dense skip connections bring about high computational costs.
Alternatively, Daudt et al. [35] proposed FC-EF-Res that adopts the early-fusion framework based on UNet by incorporating residual modules [36]. FC-EF-Res utilizes the residual modules to facilitate the training of the deeper network. FC-EF-Res achieved better performance than FC-Siam-diff and FC-Siam-conc. Zheng et al. [29] proposed a lightweight model CLNet based on the U-Net, which builds the encoder part by incorporating the cross-layer blocks (CLBs). An input feature map was first divided into two parallel but asymmetric branches. Then, CLBs apply convolution kernels with different strides to capture multi-scale context for performance improvement. More recently, some attempts that adopt early-fusion frameworks were developed based on the UNet++. Peng et al. [27] proposed an improved UNet++ with multiple side-outputs fusion (MSOF) for change detection in high-resolution images. The dense skip structure of UNet++ facilitates multilayer feature fusion. Peng et al. [37] proposed a simplified UNet++ called DDCNN that utilizes dense upsampling attention units for accuracy improvement. Zhang et al. [38] proposed DifUnet++, which emphasizes the explicit representation of difference features using a differential pyramid of bi-temporal images. Yu et al. [39] implemented the Nest-Net based on the UNet++. NestNet promotes the explicit difference representation using absolute differential operation (ADO). During model training, multistage prediction and deep supervision have been proven effective strategies for achieving better performance. For instance, some attempts apply the multistage prediction strategy at the decoder's output side, such as Peng et al. [27], DifUnet++ [38], NestNet [39], IFN [30], HDFNet [34], and ADS-Net [28]. The overall loss function is calculated based on the weighted sum of multistage prediction's loss. The deep supervision strategy facilitates the network convergence during the training phase, whereas it brings about more computation and memory cost than single-head prediction. Besides, high-level features have a coarse resolution but are accurate in semantic representation compared with low-level features. However, low-level features are more accurate in spatial location. ADS-Net [28] and IFN [30] methods employed the spatial-channel attention modules for feature fusion during decoding. Peng et al. [37] proposed the upsampling attention unit for promoting feature fusion during upsampling. High-level features are applied to guide the selection of low-level features for performance improvement.
Recently, change detection methods by incorporating attention mechanisms [40] have drawn considerable attention. Attention mechanisms have been widely studied in computer vision, such as the self-attention model (e.g., Non-local [41]), the channel attention model (e.g., squeeze and excitation modules [42]), and spatial-channel attention model (e.g., CBAM [32] and DANet [43]). Some attempts introduce attention modules in the network, which learns discriminative features and alleviates distractions caused by pseudochanges. For example, Chen et al. [44] proposed STANet that consists of a feature extraction network and a pyramid spatial-temporal attention module (PAM). ResNet-18 was applied for feature extraction, and the self-attention module was used to calculate the attention weights and model the spatial-temporal relationships at various scales. STANet with PAM achieved a better F1-score than the baseline. When training with sufficient samples, attention-based methods achieve superior performance in accuracy and robustness. More recently, transformer-based models have achieved a breakthrough in computer vision field, such as ViT [45] for image classification, DETR [46] for object detection, and SETR [47] for image semantic segmentation. Chen et al. [48] proposed BIT_CD that combines the transformer with CNN to solve the bitemporal image change detection. BIT_CD adopts a transformer encoder to model contexts in the compact semantic token-based space-time. BIT_CD outperforms some attention-based methods, such as STANet [44] and IFN [30].
We can conclude that the recent advances in DL-based CD methods mainly focus on improving precision through enhancing the feature representation ability of the model. Some attempts employed deeper networks to address the issue. These methods applied multilevel feature extraction and fusion for multiscale context modeling. Thus, though attractive in improving performance by applying a deep supervision strategy for model training, the model consumes massive memory cost. More recent attempts introduced attention modules for promoting the discrimination of features. Based on the supervised technique, these methods achieve state-of-the-art interpretation accuracy. However, the increase in the network depth and width that involves a large number of network parameters requires large memory space for storage. In addition, the deeper networks incorporated with attention-based or multiscale context-based modules usually consume massive memory during training and require more inference time. It hinders the interpretation efficiency of massive remote sensing images in practice. Recently, some lightweight change detection networks have been proposed. Chen et al. [49] proposed a lightweight multiscale spatial pooling network to exploit the spatial context information on changed regions for bitemporal SAR image change detection. Wang et al. [50] proposed a lightweight network that replaces normal convolutional layers with bottleneck layers and employs dilated convolutional kernels with a few non-zero entries that reduce the running time in convolutional operators. However, they did not give a specific number of network parameters and computations. It is hard to evaluate the computational efficiency in practice. In this sense, a lightweight network is designed to promote the inference speed and achieve high computational efficiency. We attempt to design an efficient network that achieves accuracy improvements and comparable inference speed.
The main contributions of this paper are summarized as follows. This paper first proposed an effective network, called 3M-CDNet, for accuracy improvement. It requires about 3.12 M trainable parameters. The network consists of a lightweight backbone network and a concise classifier. The former is used for feature extraction, and the latter is used to classify the extracted features and generate a change probability map. Moreover, a lightweight variant called 1M-CDNet that only requires about 1.26 M parameters was proposed for computation efficiency with the limitation of computing power. 3M-CDNet and 1M-CDNet have the same backbone network architecture but different classifiers. The lightweight network incorporates deformable convolutions (DConv) [51,52] into the residual blocks to enhance the geometric transformation modeling ability for change detection. Besides, change detection was implemented based on high-resolution feature maps to promote the detection of small changed geospatial objects. A two-level feature fusion strategy was applied to improve the feature representation. Dropout [53] was applied in the classifier to improve the generalization ability. The networks achieved better accuracy compared with the state-of-the-art methods while reducing network parameters. Specifically, the inference runtime of the proposed 1M-CDNet is superior to most existing methods.
The rest of this paper is organized as follows. Section 2 presents the method proposed in this paper, including the DConv-based backbone network in Section 2.1 and the pixelwise classifier in Section 2.2. Section 3 discusses the experimental results on public datasets, including method comparison in Sections 3.5.1 and 3.5.2. Section 4 discusses the ablation studies. Conclusions are shared in Section 5.

Proposed Method
In this section, we present the proposed effective network for urban change detection in remote sensing images. The proposed 3M-CDNet only involves about 3.12 M trainable parameters. As shown in Figure 1, 3M-CDNet mainly consists of two core components: Figure 1a shows a deformable convolution (DConv)-based backbone network, and Figure 1b shows a pixel-wise classifier. The former is used for feature extraction from the input I (1,2) ∈ R 6×H×W The latter is used to classify the extracted features into two classes in a change probability map. The network adopts the early-fusion framework and takes as an input with six bands a pair of bitemporal RGB images. Then, it generates a binary change map CM ∈ R 1×H×W , where pixels are either changed or unchanged. 3M-CDNet has a modular structure with high flexibility. It allows achieving performance improvement by incorporating some plug-and-play modules, such as DConv [52] and dropout regularization [53]. In addition, a lightweight variant, called 1M-CDNet, was proposed to reduce computation costs for computation efficiency by using a simpler classifier with fewer trainable parameters.
First, we introduced the network architecture, i.e., the DConv-based backbone network and the pixel-wise classifier. Second, the loss function definition for model training was described.

DConv-Based Backbone Network
As shown in Figure 1a, the backbone network of 3M-CDNet is composed of the Input Layer, Layer 1, and Layer 2. The main concern is to reduce the size of input through consecutive downsampling and convolution operations, and extract features maps with varying degrees of semantics from shallow to deep layers. Specifically, the Input Layer consists of three stacked 3 × 3 convolutional layers followed by a MaxPool layer. The Input Layer is applied to downsample the input and transform I (1,2) ∈ R 6×H×W into a 3-D tensor are feature maps extracted from Layer 1 and Layer 2, respectively.

Introducing Residual Network
When the CNN goes deeper, which could hamper the convergence, it leads to a degradation problem [36]. Therefore, Layer 1 and Layer 2 were designed based on the residual network [36]. Bottleneck residual blocks were designed as basic units of Layer 1 and Layer 2, which have the advantage of alleviating the degradation problem and promoting convergence during training. As shown in Figure 2c, bottleneck blocks can be formulated as follows: where X l in and X l out are the input and output tensors of the lth residual block, respectively. ξ(·) indicates the residual mapping function, i.e., the right branch that consists of three stacked convolution layers. Specifically, the Conv1 × 1, Conv3 × 3, and Conv1 × 1 are applied in series to model the residual mapping function ξ(·). The number of feature channels is first reduced and then increased, also known as the bottleneck structure. H(·) indicates the identity mapping function, i.e., the left branch. H(·) applies a downsampling projection shortcut through a Conv1 × 1 and a batch normalization (BN) layer only if stride is set to 2, e.g., the first block of Layer 2. Otherwise, an identity shortcut is identified. Besides, BN is also applied at the tail end of each convolution kernel to facilitate the training procedure more stable. The results of ξ(·) and H(·) are further merged by element-wise summation. ReLU(·) is the rectified linear unit activation function for enhancing the non-linear fitting ability. ReLU(·) can be expressed in f (z) = max(0, z). The detailed architecture of the backbone network is described in Table 1. For instance, the size of input images is set to 6 × 512 × 512. Layer1_x indicates that Layer1 is composed of three residual blocks in series, i.e., Layer1_1, Layer1_2, and Layer1_3. The stride of all filters is set to 1. Layer1_1 adopts the structure as shown in Figure 2a. Layer1_2 and Layer1_3 adopt the structure as shown in Figure 2b. Layer2_x indicates that Layer2 is composed of four residual blocks in series, i.e., Layer2_1, Layer2_2, Layer2_3, and Layer2_4. Layer2_1 adopts the structure as shown in Figure 2c. Layer2_1 applies downsampling on the output feature maps of Layer1_3 and reduces the size of feature maps to half. Therefore, the stride of the Conv1 × 1 in H(·) and Conv3 × 3 in ξ(·) is set to 2, while the stride of the Conv1 × 1 in ξ(·) is set to 1. The remained three blocks adopt the structure as shown in Figure 2d. The stride of their filters is set to 1. Besides, a bilinear upsample layer Layer2_Upsample_2× upsamples the extracted features to the 1/4 size of the input.
Therefore, change detection is implemented based on high-resolution feature maps. Unlike the original ResNet [36], which has too many downsampling operations, spatial details in deep features are lost, and high-resolution features promote the detection of small changed objects. We reduced the width and depth of the backbone network so that the number of parameters decreased. However, the receptive field of deep features is limited due to the limitation of the backbone network's depth. It is difficult to keep the completeness of the contextual semantics in deep features. The limited receptive field leads to weak feature representation [54]. To alleviate the problem, deformable convolutions [51,52] provide a feasible solution. We introduced deformable convolutions in residual blocks for capturing deformable context from objects with various shapes and scales.

Introducing Deformable Convolutions
Let X in (p) and X out (p) denote the feature at location p of the input and output feature maps, respectively. Given a convolution kernel of K sampling locations, let w k and p k denote the weight value and default offset for the kth location of the kernel, respectively, e.g., a 3 × 3 kernel with p k ∈ {(−1, −1), (−1, 0), · · · , (1, 1)|K = 9}. Two-dimensional convolutions sample the input feature map with a fixed grid can be formulated as follows: where w k enumerates the weights of the kernel according to the kth location.
Convolutions that sample the input feature map using a fixed and regular grid have a fixed receptive field. We introduced deformable convolutions (DConv) [52] to promote the ability of modeling geometric transformation. DConv achieves arbitrary deformation of the receptive field by adjusting the 2-D offsets and modulation factors of sampling locations. Deformable contexts adaptively build long-range dependencies based on the structural information of geospatial objects. DConv can be formulated as follows: where ∆p k and ∆m k are the learnable 2-D offsets and modulation factor for the kth location, respectively. ∆p k includes the x and y directions' offsets. As shown in Figure 3, let X in ∈ R C×H×W denote the input feature map of DConv3 × 3, and p i denote the center location of the kernel. Two convolution layers, O f f setConv3 × 3 and ModConv3 × 3, are separately applied over the input feature maps to obtain ∆p k and ∆m k . The number of filters is set to 2K for O f f setConv3 × 3 and K for ModConv3 × 3.
The former generates the feature map of the learned 2-D offsets, i.e., ∆p = ∆p i The latter followed by a Sigmoid activation function generates the feature map of modulation factors, i.e., ∆m = ∆m i k K k=1 ∈ R K×H×W , 1 ≤ i ≤ H·W, K = 9. The Sigmoid function transforms modulation factors to the range (0, 1), which is expressed in f (z) = 1 1+exp(−z) . Due to the fractional coordinate p + p k + ∆p k , the value of X in (p + p k + ∆p k ) is calculated based on the values of the four surrounding integer points by bilinear interpolation. In summary, DConv consists of two key steps. First, it generates deformable feature maps from the input feature maps based on the learned offsets in the x and y directions {∆p k = (∆x k , ∆y k )} K k=1 and applies the learned modulation factors {∆m k } K k=1 ∈ (0, 1) to modulate the activation of each location. Next, it applies a regular 2-D convolution over the deformable feature maps and then generates the output feature maps. In this paper, DConvs are applied to replace all the 3 × 3 convolution layers of the bottleneck blocks, termed DConvBottleneck. In this way, the DConv-based backbone network is enforced to adaptively generate a more explicit spatial feature representation. Therefore, 3M-CDNet has the advantage of overcoming the adverse effects of scale variations of objects with various shapes.

Multilevel Feature Fusion Strategies
Previous works demonstrate that both high-level semantics and low-level detail information are important in change detection. Based on the DConv-based backbone network, a two-level strategy is applied to improve the feature representation by fusing the features X 2nd and X 1st . The channel concatenation operation was selected for its simplicity to achieve high computational efficiency using the minimal number of parameters.
It is rare for previous studies to clearly state which feature fusion strategy is effective. We compare three kinds of feature fusion strategies, which are as follows: (1) only applying the high-level feature maps X 2nd , termed the one-level strategy; (2) applying the fusion feature maps X obtained by concatenating the high-level X 2nd and low-level X 1st along the channel axis, i.e., X = X 1st X 2nd , termed the two-level strategy; and (3) applying the two-level strategy and then an extra fusion feature mapX ∈ R 384× H 4 × W 4 , which is obtained by concatenating the output feature maps of the first 1 × 1 convolution layer of the classifier and X θ extracted by the Input Layer, termed the three-level strategy.

Pixelwise Classifier
The pixel-wise classifier of 3M-CDNet adopts a plain design that only consists of four convolution layers in series. Table 2 presents the detailed architecture of the 3M-CDNet classifier. First, a 1 × 1 convolution layer Conv3_1 transforms the fusion features and reduces the feature channels from 768-D to 256-D. To obtain a change map of the same spatial resolution as the input, a 2-fold bilinear upsampling is applied after the first and last 1 × 1 convolution layers. The subsequent convolution layers classify the extracted features into two classes and predict a change probability map CM prob ∈ R 1×H×W through a sigmoid layer, of which the values lie in the range (0, 1). Finally, the binary change map CM ∈ R 1×H×W can be generated by applying thresholding over CM prob ∈ R 1×H×W with a fixed threshold. As shown in Table 2, change detection was implemented based on a high-resolution feature map, i.e., the input feature map of Conv3_2, to promote the detection of small changed geospatial objects. Therefore, the two Conv3 × 3 (Conv3_2 & Conv3_3) with a large number of input/output channels have high computation costs, which occupy about 80% computation of 3M-CDNet. For example, when the size of bi-temporal input images is set to 6 × 512 × 512, the two Conv3 × 3 (Conv3_2 & Conv3_3) take as input a high-resolution feature map with a shape of 256 × 256 × 256, i.e., X ∈ R 256× H 2 × W 2 . The resolution of the internal feature is half of the input images. The computation costs of a convolution operator can be formulated as follows: where K h = 3, K w = 3 denote the kernel size; C in = 256, C out = 256 denote the input/output channels; and H out = 256, W out = 256 denote the size of output internal feature map. We can observe that the computation of one Conv3 × 3 is about 38.65 GFLOPs, and that of two Conv3 × 3 is about 77.31 GFLOPs in total.
Due to the limitation of computing power in some practical platforms, a lightweight variant called 1M-CDNet was proposed to reduce computation costs for computation efficiency by using a simpler classifier with fewer trainable parameters. Compared with 3M-CDNet, 1M-CDNet's classifier only has three 1 × 1 convolution layers to facilitate the inference speed, as shown in Table 3. The computation costs (4.34 GFLOPs) will be sharply reduced compared with that of 3M-CDNet's classifier (80.66 GFLOPs) when the size of bi-temporal input images is set to 6 × 512 × 512. The proposed networks allow us to flexibly adjust the classifier to match different application requirements and limitations in practice. Dropout [53] is a simple yet effective way to prevent neural networks from overfitting. During training, dropout randomly drops units from the network with a certain probability p d , which can be equivalent to training numerous different networks simultaneously, i.e., D(X) = D m X, where D m indicates a binary mask of the same size with the feature map X, and indicates the element-wise multiplication operation. D m is randomly generated from a Bernoulli distribution with a probability p d , and the units of the feature maps corresponding to the locations of zeros are to be discarded during training. At test time, a neural unit is always presented, and the weights are multiplied by p d so that the output of the unit would be the same as the expected output at training time, i.e., D(X) = p d · X. In this paper, two dropout layers with probabilities of 0.5 and 0.1 are applied at the tail end of the classifier's 3 × 3 convolution layers.

Binarization
The trained model outputs the change probability activation map, i.e., CM prob ∈ R 1×H×W . A fixed threshold segmentation method is applied on CM prob ∈ R 1×H×W for binarization. It generates a binary change map with the same size as the input image, i.e., CM ∈ R 1×H×W . It can be formulated as shown in Equation (5).
The subscript i, j(1 i H, 1 j W) indicates the indexes of the change map's height and width, respectively. T indicates a fixed binarization threshold to determine whether a pixel has changed. A pixel is classified as changed if and only if the change probability is larger than T; otherwise, it is classified as background. In this paper, T was empirically set to 0.5 for simplicity.

Loss Function Definition
During training, network parameters are iteratively updated by minimizing the loss between the forward output of 3M-CDNet and the reference change map with the backpropagation (BP) algorithm according to a specific loss function. The similarity between two probability distributions can be measured by counting the cross-entropy loss. Change detection aims to classify all the pixels into two subsets, i.e., changed and unchanged. The binary cross-entropy (BCE) loss function becomes an intuitive candidate for model training, which can be formulated as shown in Equation (6).
The parameter N is the number of samples. Each pixel is counted as a sample. The parameter y n ∈ indicates an unchanged or changed pixel of the reference change map, and y n ∈ [0, 1] denotes the prediction of the model. However, the number of unchanged pixels is usually more than that of changed pixels. Due to the widespread class imbalance, dominant unchanged pixels would make models tend to collapse and increase the difficulty during training. To alleviate this issue, the soft Jaccard index is introduced. The loss function can be formulated as shown in Equation (7).
y n ·ŷ n y n +ŷ n −y n ·ŷ n , The parameter λ ∈ [0, 1] is the weight factor to balance the BCE loss and Jaccard loss. In this paper, λ was empirically set to 0.7.

Experimental Dataset
For model training and evaluation, we adopted two representative datasets, including LEVIR-CD [44] and Season-Varying [55]. We then applied the criteria as recommended by the creators to split the datasets. We trained the model for 300 epochs and 600 epochs on two datasets, respectively.
(1) LEVIR-CD Dataset (https://justchenhao.github.io/LEVIR/, accessed on 6 July 2021). The dataset contains 637 pairs of co-registered very-high-resolution (VHR, 0.5 m/pixel) Google Earth images with a size of 1024 × 1024 pixels. These bitemporal images with a period of 5∼14 years were collected from 20 different regions that sit in several cities in Texas of the US. This dataset mainly focuses on building-related changes, including the building growth (the change from soil/grass/hardened ground or building under construction to new build-up regions) and the building decline. The buildings have various types and scales. Besides, irrelevant changes caused by seasonal changes and illumination changes bring about challenges. The number of changed and unchanged pixels is 30,913,975 and 637,028,937, respectively. The creator randomly split the dataset into three parts, i.e., 70% samples for training, 10% for validation, and 20% for testing [44]. Due to the limitation of GPU memory, the original images were cropped into smaller image tiles with a size of 512 × 512 pixels for model training and evaluation. In our case, 4016 and 1024 tiles were cropped for training and validation using a sliding window with a stride of 256 overlapping pixels, respectively. In addition, 512 non-overlapping tiles were cropped for testing.
(2) Season-Varying Dataset (https://drive.google.com/file/d/1GX656JqqOyBi_Ef0w6 5kDGVto-nHrNs9, accessed on 6 July 2021). The dataset contains 7 pairs of co-registered images with the size of 4725 × 2700 pixels for manual ground truth creation, which were obtained by Google Earth (DigitalGlobe). Bitemporal images with seasonal changes were introduced in this dataset, such as from summer to winter/autumn. The spatial resolution of these images is from 3 to 100 cm/pixel. The change types are mainly related to land changes, building changes, road changes, and car changes. Each pair of images was cropped into randomly rotated fragments (0-2π) with a size of 256 × 256 pixels and at least a fraction of changed pixels. Finally, Season-Varying contains 16,000 pairs of image tiles with fixed size 256 × 256 pixels, of which 10,000 and 3000 tiles are used for training and validation, respectively, and an extra 3000 tiles were used for testing [55].
Specially, we used each pair of images in the dataset to get the PSNR of each bitemporal images, and then averaged them to get the PSNR of the datasets. PSNR for LEVIR-CD and Season-Varying is approximately 13 dB and 11 dB, respectively.

Evaluation Metrics
The most common metrics related to the changed category were adopted for the quantitative evaluation, including overall accuracy (OA), precision (Pr), recall (Re), F1score (F1) (https://nndl.github.io/nndl-book.pdf, accessed on 16 November 2021), and the intersection of union (IoU) [24]. The above metrics related to the changed category can be formulated as follows.
In binary change detection, true positive (TP) indicates the number of pixels misclassified as changed. False positive (FP) indicates the number of pixels correctly classified as changed. False negative (FN) indicates the number of pixels misclassified as unchanged. True negative (TN) indicates the number of correctly classified as unchanged. Pr indicates the ratio of the number of correctly classified positive samples to that of samples classified as positive by the classifier. Re indicates the ratio of the number of correctly classified positive samples to that of all positive samples. F1 is a harmonic mean of Pr and Re. F1 and IoU are comprehensive indicators to reveal the overall performance; the higher the value, the better the performance.
Besides, the time complexity of the models is measured by the runtime and computational costs. Specifically, runtime (ms) is measured by counting the average time of randomly running a forward prediction 1000 times during the testing phase. Computational costs are measured by counting the number of floating-point operations (FLOPs) [48] in the testing phase, i.e., 1 GFLOPs = 1 × 10 9 FLOPs.

Implementation Details
The proposed 1M-CDNet and 3M-CDNet were implemented in Python using PyTorch framework [56]. During training, the AdamW optimizer [57] is used for updating the network parameters. The AdamW optimizer has the advantage of adapting its parameterwise learning rates and facilitating convergence. AdamW means Adam with decoupled weight decay. The decoupled weight decay renders the optimal settings of the learning rate and the weight decay factor more independent, thereby simplifying the hyperparameter optimization. It allows us to apply AdamW for model training using a fixed learning rate schedule and weight decay, which reduces the difficulty of hyperparameter choice. During the training phase, the model was optimized by minimizing Equation (6) through the AdamW optimizer with β 1 = 0.9 and β 2 = 0.99, of which the initial learning rate and weight decay were empirically set to 0.000125 and 0.0005, respectively. The minibatch size was set to 16 on an NVIDIA RTX 3090 GPU with 24 GB memory. The operating system is Ubuntu 18.04 with 128 GB memory, and the CPU is Intel(R) Xeon(R) Silver 4215R.

Online Data Augmentation
Data augmentation (DA) is a simple yet effective technique for regularizing the network. DA can be used to simulate scale variations, illumination variations, and pseudochanges, such as the spectra changes between bitemporal images. Online data augmentation means that DA is only performed when training instead of expanding the original training set at the cost of expensive training time. In this paper, online data augmentation was randomly applied after every batch data loaded, with a probability of 0.8 by randomly shifting-rotating-scaling with padding zeros, rotating by 90 • , 180 • , and 270 • , and flipping in horizontal and vertical directions, and applying color jitter. Each kind of augmentation was randomly applied with a probability of 0.5. Online DA is equivalent to an implicit expansion of the training data set, which can increase the randomness of the training while avoiding the linear level of the original training data set.
(1) FC-Siam-Diff [26]. A feature-level late-fusion method, which uses a pseudo-Siamese FCN to extract and fuse the bitemporal multilevel features by a feature difference operation.
(2) FC-Siam-Conc [26]. It is very similar to FC-Siam-Diff. The difference lies in the way to fuse the bitemporal features by a feature concatenation operation. (3) FC-EF-Res [35]. An image-level early-fusion method. The network takes as an input the concatenated bitemporal images. It introduced the residual modules to facilitate network convergence easily. (4) CLNet [29]. A U-Net based early-fusion method, which builds the encoder part by incorporating the cross layer blocks (CLBs). An input feature map was first divided into two parallel but asymmetric branches, then CLBs apply convolution kernels with different strides to capture multi-scale context for performance improvement. (5) STANet [44]. A metric-based method, which adopts a Siamese FCN for feature extraction and learns the change map based on the distances between the bitemporal features. Inspired by the self-attention mechanism, a spatial-temporal attention module was proposed to learn the spatial-temporal relationships between the bitemporal images to generate more discriminative features. (6) DDCNN [37]. An attention-based method that adopts a simplified UNet++ architecture. Combined with the dense upsampling units, high-level features were applied to guide the selection of low-level features during the upsampling phase for performance improvement. (7) FarSeg [56]. A foreground-aware relation network for geospatial objects segmentation in RS images. From the perspective of relation, FarSeg enhances the discrimination of foreground features via foreground-correlated contexts associated by learning foreground scene relation. (8) BIT-CD [48]. A transformer-based method, which expresses the input images into a few high-level semantic tokens. By incorporating a transformer encoder in the CNN backbone network, BIT-CD models the context in a compact token-based space-time. (9) MSPP-Net [49]. A lightweight multi-scale spatial pooling (MSPP) network was used to exploit the changed information from the noisy difference image. Multi-scale pooling kernels are equipped in a convolutional network to exploit the spatial context information on changed regions from images. (10) Lite-CNN [50]. A lightweight network replaces normal convolutional layers with bottleneck layers that keep the same number of channels between input and output. It also employs dilated convolutional kernels with a few non-zero entries that reduce the running time in convolutional operators. Table 4 presents the number of parameters (M), the computational costs (GFLOPs), and the inference runtime (ms) of different CD networks. All results presented in Table 4 were measured on an NVIDIA RTX 2080Ti GPU with 11-GB memory. When calculating the computational cost during testing, it takes 6 × 256 × 256 and 6 × 512 × 512 fixed-size as inputs. The runtime was measured with different batch sizes ("bs") during testing, where "bs" was set 1 and 16, respectively. Note that DDCNN and STANet that consume massive memory cost cannot run on a single GPU with 11-GB memory under the setting with "bs = 16", where an error called CUDA out of memory occurred.
In light of using high-resolution features for change detection, 3M-CDNet costs 23.71 G and 94.83 GFLOPs, and 1M-CDNet costs 4.54 G and 18.14 GFLOPs, according to the input size of 6 × 256 × 256 and 6 × 512 × 512. We can observe that the inference speed of 1M-CDNet is superior to most existing methods under the setting with "bs = 16" and "bs = 1". From Table 4, we can observe that 1M-CDNet only involves 1.26 M parameters, which are about 3% of FarSeg that requires 31.38 M. Specifically, 1M-CDNet requires much fewer parameters than DDCNN and reduces the computational costs. Table 4 shows the number of parameters of DDCNN (60.21 M) is about 47 times that of 1M-CDNet, and DDCNN's computational cost (214. 16 GFLOPs) is about 46 times of 1M-CDNet. What is more, compared with the lightweight model, such as CLNet, BIT-CD, FC-Siam-Diff, and FC-Siam-Conc, 1M-CDNet achieves better accuracy with comparable runtime. More details of the experimental results will be discussed in Section 3.5.

Comparisons on LEVIR-CD Dataset
(a) Quantitative evaluation Table 5 presents the quantitative results on the LEVIR-CD dataset. Due to the effectiveness of aggregating multiscale context, we can observe that CLNet and DDCNN achieve remarkable progress with a significant margin compared with the pure FCN-based FC-Siam-Conc, FC-Siam-Diff, and FC-EF-Res. Instead, our 1M-CDNet and 3M-CDNet do not apply sophisticated structures, such as UNet++ or UNet with dense skip connections as well as the deep supervision strategy for facilitating intermedia layers, which are powerful for pixel-wise prediction tasks. Nonetheless, the quantitative results show that 1M-CDNet and 3M-CDNet consistently outperform the other approaches in terms of the comprehensive metrics F1 and IoU. 3M-CDNet achieves the best F 1 (0.9161) and IoU (0.8452), which perform better than the baseline STANet with a significant improvement of F 1 (+3.53%) and IoU (+5.83%). Besides, 1M-CDNet achieves the second best F 1 (0.9118) and IoU (0.8379). Table 5 suggests that 1M-CDNet outperforms the state-of-the-art CLNet (w/DA), which increased by about IoU (+0.82%) and F 1 (+0.49%), respectively, with fewer computation costs. Moreover, as for Pr and Re metrics, Table 5 suggests that DDCNN achieves the highest Pr, but its Re is low, indicating that DDCNN detects fewer change areas. STANet achieves the highest Re, but its Pr is low, indicating that STANet detects more errors in the changing area. 3M-CDNet makes a better trade-off between precision (91.99%) and recall (91.24%) than other approaches.

(b) Qualitative evaluation
For intuitive comparisons, some change detection results are presented in Figure 4. For the LEVIR-CD dataset, the main change type lies in the building changes. Note that the black pixels indicate the changed buildings while the white pixels indicate the background regions. From the first two columns, we can observe that 1M-CDNet and 3M-CDNet generated more compact change masks. However, other approaches, such as DDCNN and the three U-shape-based variants, exhibit poor performance. Most of them generated change masks with holes and fragmentized boundaries. Specifically, as shown in the following columns, our 1M-CDNet and 3M-CDNet exhibit better performance on the detection of changed objects with small scales than other methods. 1M-CDNet and 3M-CDNet achieved higher recall than other methods, which is consistent with the quantitative analysis. 1M-CDNet and 3M-CDNet succeeded to discriminate the crowded building instances from each other. The main advantage owes to that we apply the DConv-based backbone network to extract high-resolution feature maps for change detection. From the last two columns, other approaches suffer from many false alarms of varying degrees due to the identical building roofs in bitemporal images exhibiting different colors. Instead, our 1M-CDNet and 3M-CDNet completely overcome the distractions and succeeds to identify the missed changed buildings in the reference maps as shown in the 7th column. We can conclude that 1M-CDNet and 3M-CDNet are more stable against pseudo-changes caused by spectral changes.

Comparisons on Season-Varying Dataset
(a) Quantitative evaluation Table 6 presents the quantitative results on the Season-Varying dataset. Due to more challenges caused by season variations, FarSeg achieved third place and exhibited better performance than other FCN-based or attention-based approaches by modeling the foreground-correlated contexts. We can observe that our 1M-CDNet and 3M-CDNet consistently perform better than other benchmarks in terms of accuracy. For example, 1M-CDNet outperforms the lightweight CLNet with an increased IoU (0.56%) and F1 (0.30%), and that 3M-CDNet outperforms the lightweight CLNet with an increased IoU (1.87%) and F1 (0.99%). From Table 4, we can observe that 1M-CDNet only involves 1.26 M parameters, which are about 4% of FarSeg that requires 31.38 M. Nonetheless, 1M-CDNet achieved an improvement of IoU (+0.36%) and F 1 (+0.20%) compared with FarSeg. Specifically, 1M-CDNet requires much fewer parameters than DDCNN and reduces the computational costs. Table 4 shows the number of parameters of DDCNN (60.21 M) is about 47 times that of 1M-CDNet, and DDCNN's computational cost (214.16 GFLOPs) is about 46 times of 1M-CDNet. However, 1M-CDNet outperforms DDCNN with a significant margin of an increased F 1 (+2.34%) and IoU (+4.28%). In addition, the models that require similar parameters to our 1M-CDNet exhibit poor performance on Season-Varying dataset, such as FC-Siam-Conc, FC-Siam-Diff, and FC-EF-Res. Their performance is limited to the insufficient model capacity.

(b) Qualitative evaluation
For the Season-Varying dataset, the change types are mainly related to land changes, building changes, road changes, and car changes. Some change results are shown in Figure 5. The black pixels indicate unchanged regions and the white pixels indicate the changed regions. Compared to other models, change masks generated by 1M-CDNet and 3M-CDNet preserve the actual shape of changed objects with more complete boundaries. However, change masks generated by other methods show fragmentized boundaries, such as FarSeg, STANet, CLNet, BIT-CD, and MSPP-Net, especially for large-scale geospatial objects with various shapes. Even worse, the other three U-shape-based variants exhibit a poor recall because they failed to detect the small changed objects in most cases. What is more, 1M-CDNet and 3M-CDNet generated promising change maps that are more robust to the spectral changes and vegetation growth caused by seasonal variations (e.g., from summer to winter/autumn).

Discussion
Ablation studies were conducted to verify each component's contribution to 3M-CDNet. tables 7 and 8 present the quantitative results on two public datasets, where "w/o" and "w/" mean "without" and "with", respectively.  Table 7 presents the effects of multilevel feature fusion strategies. Note that all these experiments were carried out by applying online data augmentation during training. Table 6 suggests that the two-level strategy always achieves the best performance in terms of F 1 and IoU, i.e., LEVIR-CD (0.9161/0.8452) and Season-Varying (0.9749/0.9510). Compared with the one-level strategy that lacks low-level details, it increased IoU and F 1 by about 2.09% and 1.24% on LEVIR-CD as well as 1.70% and 0.90% on Season-Varying, respectively. Unfortunately, the impact of the three-level strategy is negligible for model training. Compared with the one-level strategy, as shown on LEVIR-CD, the three-level strategy just help to achieve a little improvement of F 1 (0.29%) and IoU (0.48%) as well as improvement of F 1 (0.23%) and IoU (0.44%) on Season-Varying, respectively. We can conclude that the two-level strategy is enough for improvements in our case while introducing either insufficient or excessive features could bring about an unexpected degradation problem. Therefore, we employ the two-level strategy for feature fusion in the experiments. Table 8 presents the effectiveness of core components of 3M-CDNet. In Table 8, the first row shows 3M-CDNet's quantitative results on LEVIR-CD and Season-Varying datasets, where 3M-CDNet adopted all these components, i.e., online DA, dropout, and DConv.

Effects of Online DA, Dropout, and Dconv
Online DA was used to simulate scale variations, illumination variations, and pseudochanges caused by season variations. Table 8 suggests that online DA makes impressive contributions on both datasets, e.g., 3M-CDNet vs. w/o DA. Compared with the situation "w/o DA", applying DA achieves considerable improvements of F 1 (1.43%) and IoU (2.40%) on LEVIR-CD, as well as F 1 (0.78%) and IoU (1.47%) on Season-Varying. It demonstrates that the online DA strategy described in Section 3.3 (b) is an effective trick to achieve immediate gains through improving the diversity of samples, especially when lacking enough training samples, such as LEVIR-CD.
Meanwhile, dropout can be an effective complementary regularization with online DA for achieving a good generalization capacity. Compared with the results shown in "w/o DA/Dropout", 3M-CDNet achieves improvements of F 1 (2.05%) and IoU (3.43%) on LEVIR-CD, as well as F 1 (0.95%) and IoU (1.69%) on Season-Varying. Dropout increased the F 1 by 0.62% and IoU by 1.03% on LEVIR-CD, and as well as F 1 (0.17%) and IoU (0.32%) on Season-Varying, even in the case of not using online DA ("w/o DA/Dropout" vs. "w/o DA").
Last but not least, DConv was incorporated into the backbone network to enlarge the receptive field of deep features. Table 8 shows that DConv serves as an indispensable component for achieving high accuracy. For an instance, the last row shows that performance drops significantly without DConv (3M-CDNet vs. w/o DConv), where F 1 decreased by about 1% and 1.38% on the two datasets, and IoU decreased by 1.69% and 2.59%, respectively. Moreover, DConv with online DA promoted the performance significantly and achieved improvements of F 1 (+2.29%) and IoU (+3.83%) on LEVIR-CD (3M-CDNet vs. "w/o DA/DConv"). Since the Season-Varying dataset includes geospatial objects with various shapes and scales, 3M-CDNet achieved a significant margin compared with the situation "w/o DA/DConv", i.e., F 1 (+2.63%) and IoU (+4.89%) on Season-Varying. Thus, the DConv-based backbone promotes the geometric transformation modeling ability of our lightweight model. For intuitive comparisons, some detection results on both datasets were presented in Figure 6. We can observe that change maps generated by 3M-CDNet are overall closer to the reference change map than that of "w/o DConv". From the first two columns in Figure 6(1), 3M-CDNet presents a significant margin compared with the change maps of "w/o DConv". The latter suffers from false alarms caused by the different colors of the identical building roofs. Meanwhile, 3M-CDNet achieved a higher detection rate on the objects with small scales, as shown in the last column of Figure 6(1). In addition, the changed building masks generated by 3M-CDNet have more complete smoother boundaries, as shown in Figure 6(1,2). Moreover, 3M-CDNet could identify building instances from the crowded buildings, as shown in the 3rd and 4th columns of Figure 6(1). What is more, the season changes between bi-temporal images significantly vary, such as season changes of natural objects (i.e., from wide forest areas to single trees). However, during the generation of reference maps, only the appearance and disappearance were considered as image changes while ignoring changes due to season differences, brightness, and other factors. 3M-CDNet generated promising change maps that show robustness to the spectral changes and vegetation growth caused by seasonal variations (e.g., from summer to winter/autumn), which is challenging for traditional methods.

Conclusions
In this paper, an effective network termed 3M-CDNet, and a lightweight variant termed 1M-CDNet, were proposed for urban change detection using bitemporal remote sensing images. The lightweight model was obtained by reducing the width and depth of the backbone network. We can conclude the proposed networks achieve performance improvements from the following perspectives. First, high-resolution feature maps extracted by the backbone network facilitate the detection of small changed objects with acceptable computational costs. Second, the backbone network was incorporated with deformable convolutions to promote the geometric transformation modeling ability of our lightweight model. In addition, the two-level feature fusion strategy was applied to improve the feature representation. Finally, dropout applied in the classifier and online data augmentation bring about immediate gains without extra cost. What is more, the proposed networks allow us to flexibly adjust the classifier to satisfy different trade-offs between accuracy and efficiency in practice.
Extensive experiments have verified the effectiveness of the 1M-CDNet and 3M-CDNet. Experiment results have shown that 1M-CDNet and 3M-CDNet exhibited better performance compared with the state-of-the-art approaches. For example, 1M-CDNet achieved the F 1 (0.9118) and IoU (0.8379) on LEVIR-CD dataset, as well as the F 1 (0.9680) and IoU (0.9379) on the Season-Varying dataset. Additionally, 3M-CDNet achieved the best F 1 (0.9161) and IoU (0.8452) on the LEVIR-CD dataset, as well as the best F 1 (0.9749) and IoU (0.9510) on the Season-Varying dataset. Specifically, 1M-CDNet makes a better trade-off between accuracy and inference speed compared with existing methods. Future works will focus on further improving detection accuracy and reducing the computational costs by incorporating some model compression techniques, such as knowledge distillation and channel pruning techniques.