MCCRNet: A Multi-Level Change Contextual Reﬁnement Network for Remote Sensing Image Change Detection

: Change detection based on bi-temporal remote sensing images has made signiﬁcant progress in recent years, aiming to identify the changed and unchanged pixels between a registered pair of images. However, most learning-based change detection methods only utilize fused high-level features from the feature encoder and thus miss the detailed representations that low-level feature pairs contain. Here we propose a multi-level change contextual reﬁnement network (MCCRNet) to strengthen the multi-level change representations of feature pairs. To effectively capture the dependencies of feature pairs while avoiding fusing them, our atrous spatial pyramid cross attention (ASPCA) module introduces a crossed spatial attention module and a crossed channel attention module to emphasize the position importance and channel importance of each feature while simultaneously keeping the scale of input and output the same. This module can be plugged into any feature extraction layer of a Siamese change detection network. Furthermore, we propose a change contextual representations (CCR) module from the perspective of the relationship between the change pixels and the contextual representation, named change region contextual representations. The CCR module aims to correct changed pixels mistakenly predicted as unchanged by a class attention mechanism. Finally, we introduce an effective sample number adaptively weighted loss to solve the class-imbalanced problem of change detection datasets. On the whole, compared with other attention modules that only use fused features from the highest feature pairs, our method can capture the multi-level spatial, channel, and class context of change discrimination information. The experiments are performed with four public change detection datasets of various image resolutions. Compared to state-of-the-art methods, our MCCRNet achieved superior performance on all datasets (i.e., LEVIR, Season-Varying Change Detection Dataset, Google Data GZ, and DSIFN) with improvements of 0.47%, 0.11%, 2.62%, and 3.99%, respectively.


Introduction
Change detection aims to distinguish differences in multi-temporal remote sensing images, which plays an important role in understanding land surface change, global resource monitoring, land use change, disaster assessment, visual monitoring, and urban management-forming a significant part of remote sensing image intelligent interpretation [1]. Common change detection methods feed the registered bi-temporal images into a corresponding model and output the predicted change intensity map with the same size as the original image pair, in which each pixel is predicted to be changed or unchanged.

Change Detection
Up to now, many methods have been proposed, including traditional ways and learning-based ways.
Learning-based methods rely on the rapid development of deep learning algorithms. Many image classification and recognition algorithms based on convolution neural networks (CNN) give satisfactory results for remote sensing image tasks [5]. As change detection can be regarded as a pixel-level prediction task, almost all deep neural network models comply with the encoder-decoder structure to predict the change map. Daudt et al. [6] first designed three Siamese convolutional network models based on a U-net structure. Subsequently, an enhanced version of U-Net was also applied to remote sensing image change detection [7] and achieved better results. Fang et al. [8] proposed a Siamese framework according to dual learning-based domain transfer mechanism and put forward a combined loss function for solving the class im-balanced problem. Chen and Shi [9] proposed STANet, which innovatively established the spatial-temporal relationship between multi-temporal images through a self-attention mechanism, and was applied to optical remote sensing images with a Siamese network structure. However, this method was based on metric learning, thus requiring a long training iteration time, so we improved it by proposing a classification-based method. Zhang et al. [10] pointed out that current change detection methods based on deep learning have some limitations in terms of deep feature fusion and supervision, and they improved the ability to discriminate differences by inserting a spatial attention module (SAM) and channel attention module (CAM) into various level feature layers, which greatly ignored the relationship between different feature layers. Our model thus cascaded the feature pairs refined by the cross-attention module from a high layer to a bottom layer, thereby facilitating the full utilization of multi-level difference discrimination features. Meanwhile, to fully explore the relationship between a pixel and its surrounding region, we proposed a change contextual representation (CCR) module.
Change detection methods based on deep learning are mainly divided into two categories, one based on metric learning [11][12][13][14] and another based on classification [15][16][17][18][19][20][21][22]. The former regarded the change information as the similarity of feature pairs and then pulled samples belonging to the same class in the embedding space closer while simultaneously pushing samples belonging to different classes further away. DASNet [23] used a dual attention module to enhance the generalization of the extracted bi-temporal features. Finally, a metric module was adopted to predict the result by thresholding the L2-norm distance with a change map. In addition, a weighted double margin contrastive loss was put forward on the basis of a universal contrastive function. The abovementioned are metric-based methods, which are suitable for most regular change detection datasets in most cases, especially for street view images. However, they are not applicable for highresolution bi-temporal remote sensing images, due to the difficulty of designing a more appropriate threshold in the decision module. Recently, most change detection networks based on semantic segmentation models have performed slightly better than these methods. PGA-SiamNet [24] introduced a global co-attention mechanism to emphasize the importance of correlation between the input feature pairs, thus making up the displacement of buildings in orthoimages. Adopting an identical framework, it also enhanced the postfused bi-temporal features extracted by shared feature extraction backbone. In the decoder, multi-level features were fused as final change discriminating information. DTCDSCN [25] divided semantic segmentation and change detection into two subnetworks to make up for the lack of a boundary in the latter task and proposed improved focal loss to solve the problem of imbalanced samples. Based on the UNet++ [26] structure, DifUnet++ [27] simultaneously fused the concatenated and absolute difference features of bi-temporal images. In particular, the researchers adopted a multiple side-outs fusion strategy to reset the loss weight of different scales. Although this network structure fully utilized the fused information of the original bi-temporal images, including absolute information on the difference between feature pairs and the sum of feature maps, the experimental results were not significant due to the lack of an effective correlation between feature pairs and the context between pixels. Our attention modules greatly improved the accuracy of the corresponding datasets.

Attention Mechanism
An attention mechanism has the ability to capture long-range dependencies, so it is widely used in natural language processing, image classification, semantic segmentation, and object detection. SENet [28] first proposed a channel attention module to adaptively correct the weight ratio between channels, which simply captures channel-level long-range dependencies. By utilizing non-local theory from a global point of view, Non-local Net [29] made the receptive field no longer limited to the fixed size of the local area. That means that the model merely requires calculating the interactions between any two locations to capture long-range dependencies directly. DANet [30] proposed a position attention module and channel attention module to learn a spatial attention map and a channel attention map. The former aggregates and updates all positions by weighting on the spatial position of one feature. The latter applies attention weights to the channel dimension. In addition, DANet proved that the sum fusion of the two attention modules can further improve feature representations, which contributes to more accurate results. Different from the above, ACFNet [31] utilizes class context information instead of spatial context information as the attention weight. Specifically, it first calculates the class center by using coarse segmentation result and a feature map. Then, it corrects the pixels based on incorrect predicted results. ACFNet first introduced a class attention mechanism.
As mentioned above, methods that are based on classification perform better than metric-based ones in most cases. Based on the encoder-decoder structure, we proposed a multi-level change contextual refinement net (MCCRNet), which extracts multi-level feature pairs by a shared VGG16 [32] backbone. The extracted feature pairs are subsequently modified and strengthened through four atrous spatial pyramid cross-attention (ASPCA) modules. The decoder was constructed in the manner of coarse-to-fine, which means that the modified feature pairs with their output from the previous upsampling layer are concatenated, and the fused feature is then forwarded to the next layer to gradually restore the original image resolution. Different from other self-attention operations used in existing methods, the ASPCA module designed by us no longer takes the fused single feature as the input but uses the original multi-level feature pairs instead. Concretely, each of the feature pairs were fed into both the atrous spatial pyramid cross spatial attention (ASPCPA) module and the atrous spatial pyramid cross-channel attention (ASPCCA) module, thus establishing an interactive relationship of sharing information between dual features and simultaneously keeping the primitive dual-branch features in a steady state. Compared to the pyramid spatial-temporal attention module proposed by [6], which partitioned the image scale in a uniform manner, our pyramid structure was in the manner of atrous spatial pyramid pooling [33] (ASPP), which proved to be more effective for semantic segmentation, image classification, and object detection. Compared to the research in [29], which only enhanced dual-branch features by dual attention modules but ignored the fusing information of multi-level features, our method solved this by gradually cascading the updated feature ISPRS Int. J. Geo-Inf. 2021, 10, 591 4 of 24 pairs with the upsampled feature from the previous upsampling layer. To fully make use of the multi-level feature differences discrimination contextual information, we designed a change contextual representational (CCR) module, which utilized position attention and class attention mechanisms to capture the change region representations, the pixel-change region relation, and change region contextual representations. CCR first introduced the correlation between pixels and their context into change detection algorithms. This motivation came from the fact that the pixels around the changed pixels are also most likely to change. In Figure 1, the black dots represent changed pixels, and small white dots represent unchanged pixels; the rectangle represents the changed area, and the large circle represents the unchanged area. A pixel prefers to belong to the same class as its surrounding context region, which means that the class label assigned to one pixel is the category of the region/object that the pixel belongs to. We aimed to augment the change representation of one pixel by exploiting the representation of the change region of the corresponding change class, which was realized by triple operations of self-attention in this work. By fusing the upsampled features and then forwarding them into this module, the change feature could be made more robust and discriminative. The experiments proved that the ASPCA and CCR modules effectively improved the results of the change detection. In addition, to solve the problem of imbalanced change detection samples, we used the idea of cost-sensitive learning [34][35][36] to assign an adaptive weight for changed samples and unchanged sample loss. Specifically, a mathematical formula for the effective number of samples in [37] was adopted, which preferred to assign higher weights to changed pixels' loss. Focal loss was combined to form the final effective sample number adaptively weighted loss (EAWLoss), which confirmed the effectiveness of the pixel-level binary classification task and can be extended to a multiclass task such as semantic change detection [38][39][40].
of atrous spatial pyramid pooling [33] (ASPP), which proved to be more effective for semantic segmentation, image classification, and object detection. Compared to the research in [29], which only enhanced dual-branch features by dual attention modules but ignored the fusing information of multi-level features, our method solved this by gradually cascading the updated feature pairs with the upsampled feature from the previous upsampling layer. To fully make use of the multi-level feature differences discrimination contextual information, we designed a change contextual representational (CCR) module, which utilized position attention and class attention mechanisms to capture the change region representations, the pixel-change region relation, and change region contextual representations. CCR first introduced the correlation between pixels and their context into change detection algorithms. This motivation came from the fact that the pixels around the changed pixels are also most likely to change. In Figure 1, the black dots represent changed pixels, and small white dots represent unchanged pixels; the rectangle represents the changed area, and the large circle represents the unchanged area. A pixel prefers to belong to the same class as its surrounding context region, which means that the class label assigned to one pixel is the category of the region/object that the pixel belongs to. We aimed to augment the change representation of one pixel by exploiting the representation of the change region of the corresponding change class, which was realized by triple operations of self-attention in this work. By fusing the upsampled features and then forwarding them into this module, the change feature could be made more robust and discriminative. The experiments proved that the ASPCA and CCR modules effectively improved the results of the change detection. In addition, to solve the problem of imbalanced change detection samples, we used the idea of cost-sensitive learning [34][35][36] to assign an adaptive weight for changed samples and unchanged sample loss. Specifically, a mathematical formula for the effective number of samples in [37] was adopted, which preferred to assign higher weights to changed pixels' loss. Focal loss was combined to form the final effective sample number adaptively weighted loss (EAWLoss), which confirmed the effectiveness of the pixel-level binary classification task and can be extended to a multiclass task such as semantic change detection [38][39][40].  Our contributions can be summarized as follows: (1) We proposed a novel end-to-end framework called a multi-level change contextual refinement net (MCCRNet) for the change detection of bi-temporal remote sensing images. Compared to other methods, MCCRNet would be capable of capturing more intensive change information between bi-temporal images. Our contributions can be summarized as follows: (1) We proposed a novel end-to-end framework called a multi-level change contextual refinement net (MCCRNet) for the change detection of bi-temporal remote sensing images. Compared to other methods, MCCRNet would be capable of capturing more intensive change information between bi-temporal images. (2) We proposed a change contextual representation (CCR) module to take advantage of changed region context. CCR first utilized the relationship between change pixels and their context through multiple self-attention operations. (3) We proposed an effective sample number adaptively weighted loss for solving the sample class-imbalanced problem.

Materials and Methods
In this section, we describe the details of the proposed method. Firstly, we present our network multi-level change contextual refinement net (MCCRNet) in Section 2.1; then, in Section 2.2, we introduce the experimental datasets in our work and propose the effective sample number adaptively weighted loss in Section 2.3. Finally, we describe the experimental implementation details in Section 2.4.

Methods
In this subsection, the multi-level change contextual refinement net (MCCRNet) pipeline is presented, then the designed atrous spatial pyramid cross attention and change contextual representation (CCR) modules are described in detail.

Network Overview
Like most binary change detection methods, the network input comprised two registered bi-temporal images expressed as I 1 , I 2 with a size of C × H × W, where C is the number of channels, and produced a change map whose width and height are the same as that of the input image except that the channel number turns into 1. For each pixel of the change map, 1 usually means changed and 0 means unchanged.
The overall architecture of the multi-level change contextual refinement net (MC-CRNet) is shown in Figure 2, which consists of an encoder (Section 2.1.2), a decoder (Section 2.1.3), and a final change contextual representation module (Section 2.1.4), where Conv Blocks 1-4 indicate the convolution blocks of the VGG16 backbone layers, except the last one, and ConvTransposed Block indicates the usual deconvolution, batchnorm, and dropout operations. The multi-level feature pairs extracted by the encoder were separately forwarded to the ASPCA module from the top layer to the bottom layer, then the dual features updated by ASPCA were concatenated with upsampled features from the upper layer, which served as the input of the next layer. Especially for the first layer, we forwarded the absolute difference of feature pairs from Conv Block 4 as extra change representation information. By mapping the features four times in the decoder, we could get feature maps twice the size of the original feature pairs, which were finally forwarded into the CCR module to predict the results. The optimization of our model in the training phase was to minimize the loss between the output and the ground truth.

Feature Extractor
With the rapid development of CNN, more and more feature extraction networks have shown strong feature extraction ability. Many of them can be applied to existing computer vision tasks such as object detection [41], land-cover classification [42][43][44], and image matching [45][46][47]. For change detection tasks that may be regarded as pixel-level

Feature Extractor
With the rapid development of CNN, more and more feature extraction networks have shown strong feature extraction ability. Many of them can be applied to existing computer vision tasks such as object detection [41], land-cover classification [42][43][44], and image matching [45][46][47]. For change detection tasks that may be regarded as pixel-level classification problems, a fully convolutional layer rather than a fully connected layer could achieve this [48]. Considering the speed and GPU memory capacity, we chose VGG16 [32] as our feature extractor backbone.
As shown in Figure 2, we had two main aims: (1) avoiding the loss of image details and abundant upsampling stages in decoder and (2) reducing the calculating complexity of the model as far as possible while extracting strong representations. The first four shared blocks of VGG16 were used to extract multi-level features of bi-temporal images. The channels of the extracted features are 64, 128, 256, and 512, respectively, while the scales are 1/2, 1/4, 1/8, and 1/16 of the original image pairs.

Decoder
After obtaining the feature sets of the bi-temporal images, they were not fused directly like the current methods but were forwarded into an atrous spatial pyramid cross-attention (ASPCA) module to strengthen the representation ability between feature pairs at the same level. As mentioned above, the ASPCA module was also realized based on self-attention theory shown in Figure 3. The subgraphs in the third column represent the dual spatial features and dual channel features refined by ASPCPA and ASPCCA, respectively. Unlike most attention-based works that forward a concatenated single feature into the attention module and output a single weighted feature, our module utilized dual features without fusing or adding operations.   The structure of the ASPCPA module is shown in Figure 4. The green boxes represent the atrous convolution with rates of 1, 6, 12, and 18, respectively and 1 × 1 Conv represents the convolution of kernel size 1 × 1, BatchNorm, and ReLU. We referred to the idea of atrous spatial pyramid pooling (ASPP) in [49]; the dual features were forwarded into an atrous spatial pyramid module containing four atrous convolution operations with ratios of 1, 6, 12, and 18, respectively, and then these output features were concatenated in channel dimension. To increase the nonlinear capability of the model, a convolution with kernel size 1 × 1 operation was performed; we kept the number of channels the same in our work. Given the dual features fused by the atrous spatial pyramid block (denoted as  The ASPCA module comprises an atrous spatial pyramid cross-position attention (ASPCPA) module and an atrous spatial pyramid cross-channel attention (ASPCPA) module, both of which accept dual features of the same size from the same level of feature extraction layers. The former captured the long-range spatial-temporal interdependencies, while the latter captured the long-range channel-temporal interdependencies. Although there are many operations to fuse spatial attention features and channel attention features such as concatenating or cascading in parallel, the experiments indicated that they were not suitable for this task, so a summing operation was employed. It is worth noting that the updated image feature of the previous time (expressed as f (1) ) and that the updated image feature of the latter time (expressed as f (2) ) equaled the element-wise summation of the spatial attention feature and channel attention feature in the form of crossover. The mathematical expression is as follows: (2) denote the feature pairs updated by the ASPCA module for a certain layer; s denote the output bi-temporal features from the ASPCPA module, and f denote the output bi-temporal features from the ASPCCA module.
The structure of the ASPCPA module is shown in Figure 4. The green boxes represent the atrous convolution with rates of 1, 6, 12, and 18, respectively and 1 × 1 Conv represents the convolution of kernel size 1 × 1, BatchNorm, and ReLU. We referred to the idea of atrous spatial pyramid pooling (ASPP) in [49]; the dual features were forwarded into an atrous spatial pyramid module containing four atrous convolution operations with ratios of 1, 6, 12, and 18, respectively, and then these output features were concatenated in channel dimension. To increase the nonlinear capability of the model, a convolution with kernel size 1 × 1 operation was performed; we kept the number of channels the same in our work. Given the dual features fused by the atrous spatial pyramid block (denoted as f asp ∈ C × H × W (C denotes the channel number, and H × W indicates the spatial size), two parallel 1 × 1 convolutions were applied to f (1) asp and f (2) asp , respectively, which produced Query ∈ C × H × W (expressed as Q) and Key ∈ C × H × W (expressed as K), where C is the channel number. Generally, C is reduced to 1/4 or 1/8 of C for saving memory, but here we kept them the same. Meanwhile, asp into another two convolution layers to generate corresponding value features Value ∈ C × H × W (expressed as V 1 and V 2 , respectively), which were Simultaneously, we also reshaped Q and K into C × N. To capture the spatial contextual relationships of feature pairs, we calculated an attention map with forward and backward directions. For the forwarded = direction ("T1 to T2"), Q is permuted to N × C , while K kept the original size; thus, we constructed the forward energy matrix Λ ∈ N × N, formulated as Λ = Q T K, where the element at (i, j) of Λ is the sum product of the ith row elements of Q and the jth column elements of K and measures the similarity between ith position in f (1) and jth position in f (2) . Λ then performed the normalization by Softmax operation and matrix multiplication with V 1 , as mentioned above, where the former is calculated as follows: where Λ (i,j) indicates the element at position (i, j), and N 2 indicates columns of Λ. Similar to this, another energy matrix of the backward direction ("T2 to T1") is formulated as Ω = K T Q, which means the matrix multiplication of K after transposed and Q. We also applied Softmax to Ω as follows:  In a word, long-range spatial independencies form two directions, both with a strengthened change representation ability for bi-temporal features.
The ASPCCA module was designed to capture long-range channel independencies between ) 1 ( f and ) 2 ( f . As shown in Figure 5, the atrous spatial pyramid and bidirectional structures are identical to ASPCPA except that the latter has to produce K , Q , .  was also normalized by a Softmax operation to get an attention map: exp exp (8) Finally, the augmented channel attention map of ) 1 ( f could be calculated as follows: and  is a model parameter like  ; here, Slightly different from the above, a backward direction attention map was used to measure the similarity between the jth position in f (2) and the ith position in f (1) . N 1 is the rows of the Ω matrix. The bidirectional similarity measurements contributed more comprehensive spatial change context between dual features. Finally, we obtained the updated spatial attention map of f (1) named f (1) s by adding f (1) asp to the weighted V 1 : where and ∂ is a model parameter with an initial value of 1, leveraging the dissimilarity importance of f (1) compared to f (2) . Conversely, an argument spatial attention map of f (2) named f (2) s was generated by adding f (2) asp to the weighted V 2 : and The model parameter β was also initialized to 1, which leverages the dissimilarity importance of f (2) compared to f (1) .
In a word, long-range spatial independencies form two directions, both with a strengthened change representation ability for bi-temporal features.
The ASPCCA module was designed to capture long-range channel independencies between f (1) and f (2) . As shown in Figure 5, the atrous spatial pyramid and bidirectional structures are identical to ASPCPA except that the latter has to produce K, Q, V 1 and V 2 by four 1 × 1 convolution layers before calculating attention maps. Here, we performed matrix multiplication of the original concatenated features directly. Given dual multi-scale features f (1) asp ∈ C × H × W and f (2) asp ∈ C × H × W mapped by four atrous convolutions at rates of 1, 6, 12, and 18, respectively, we reshaped both of them into C × N, where N = H × W. For the forward direction (T1 to T2), the transposed f (2) asp T to generate forward energy map Φ ∈ C × C. Φ was also normalized by a Softmax operation to get an attention map: exp exp (11) Thereby, the augmented channel attention map of and  is a model parameter with an identical initial value like  ; Whether in the ASPCPA module or the ASPCCA module, a norm layer comprising 1 × 1 convolution, BatchNorm, and a ReLU activation function was separately applied to 2 ( c f , which ensured that the channel number of the input remains unchanged after being updated by ASPCA. As shown in Figure 2, the decoder gradually restored the change feature map resolution by forwarding concatenated  Finally, the augmented channel attention map of f (1) could be calculated as follows: and δ is a model parameter like ∂; here, f c models the channel context from f (1) to f (2) . In the same way, a backward energy map T T and normalized as follows: Thereby, the augmented channel attention map of f (2) was obtained: where f and ρ is a model parameter with an identical initial value like δ; f c models the channel context from f (2) to f (1) .
Whether in the ASPCPA module or the ASPCCA module, a norm layer comprising 1 × 1 convolution, BatchNorm, and a ReLU activation function was separately applied to f c , and f (2) c , which ensured that the channel number of the input remains unchanged after being updated by ASPCA. As shown in Figure 2, the decoder gradually restored the change feature map resolution by forwarding concatenated f (1) , f (2) and abs( f (1) − f (2) ) (where abs denotes absolute operation) into the upsampling block ConvTransposed from high layers to bottom layers. For each ConvTransposed block, the first two ConvTransposed2d layers are used for dimension reduction, while the other one aims to upsample the doubled spatial size. The specific parameters and feature sizes are shown in Table 1. The multi-scale features up-sampled by the four ConvTransposed blocks contain abundant change discriminatory information with different levels, which means that high layers contain rich abstract semantic information, while low levels represent detailed texture information.

Change Contextual Representational Module
The multi-level features updated by the ASPCA module in Section 2.1.3 only capture the long-range interdependencies between pixels of feature pairs. This subsection proposes a change contextual representational (CCR) module, which captures pixel-change region relation and change region contextual representations by exploiting representations of change regions that the pixels belong to. In Figure 6, ×8, ×4, and ×2 represent bilinear interpolation to 8, 4, and 2 times the original size, respectively, and Conv2d, in the green box, represents a general 1 × 1 convolution while Conv in the green box represents 1 × 1 conv → BatchNorm → ReLU . The features outputted from the four ConvTransposed blocks were resized to be the same as the output from ConvTransposed Block 1, then the fused feature was forwarded into a linear activation layer called Conv to get the pixel representations. Meanwhile, an auxiliary output from a fully convolution layer contributed to the coarse change detection result, which was supervised by an auxiliary loss. Given pixel representations P ∈ C × H × W and coarse change regions O ∈ 2 × H × W, where 2 indicates the change class and unchanged class, both P and O were separately reshaped to P ∈ C × N (to reduce calculation complexity, C means 512 in this work) and O ∈ 2 × N. Then the change region representations f c ∈ C × 2 could be obtained by a matrix multiplication between regularized O and P formulated as follows: Similar to the attention map in ASPCA, the pixel-change region relation att f was cal- where  and  are also transform functions like  and  , but it is worth noting that  ,  , and  are all dimension reduction transformation (512 to 256 in this work), while  denotes conv 1 1
The attributes are shown in Table 2. The LEVIR-CD dataset contains 637 very-highresolution Google Earth (GE) image patch pairs with a size of 1024 × 1024 pixels, covering 20 different regions in Texas, USA. Most image pairs belonging to building change involve man-made structures and date from 2002 to 2018. The Season-Varying Change Detection Dataset (CCD) also originated from Google Earth but the spatial resolution ranges from 3 to 100 cm/pixel. Different from the first one, this dataset is more focused on changes corresponding to the appearance and disappearance of objects, but ignores changes due to Similar to the attention map in ASPCA, the pixel-change region relation f att was calculated by: where σ and φ were both implemented as 1 × 1 conv → BatchNorm → ReLU and f att ∈ 2 × N. Then the matrix multiplication of change region representations f c and attention map f att . were calculated as follows: where δ and ρ are also transform functions like σ and φ, but it is worth noting that σ, φ, and δ are all dimension reduction transformation (512 to 256 in this work), while ρ denotes 1 × 1 conv from 256 channel to 512. For reusing the pixel representations, f context was first reshaped to f context ∈ C × H × W and then concatenated with P to generate change region contextual representations, updated by a Conv layer to restore the original channel dimensions. Finally, a pixel-level convolution predicted the change intensity map.
The attributes are shown in Table 2 Dataset (CCD) also originated from Google Earth but the spatial resolution ranges from 3 to 100 cm/pixel. Different from the first one, this dataset is more focused on changes corresponding to the appearance and disappearance of objects, but ignores changes due to seasonal differences, brightness, and other factors. Google Data GZ is a large-scale VHR change detection satellite image dataset obtained from 2006 to 2019, covering suburban areas of Guangzhou City, China. Google Data GZ contains 19 season-varying VHR images pairs with three bands, which mainly focus on building changes. The last DSIFN dataset is collected from Google Earth, covering many Chinese cities such as Beijing, Chengdu, Shenzhen, Chongqing, Wuhan, etc. In the training stage, all datasets were cropped into 256 × 256 patches.

Loss Design
To solve the sample class-imbalanced problem in change detection, we proposed an effective sample number adaptively weighted loss. The point of this is to associate each sample with a small neighboring region instead of a single point. From [37], it can be realized that a newly sampled pixel either inside a previously sampled changed region with the probability of p or an outside, unchanged region with the probability of 1 − p, which means the effective number of samples is the expected volume of samples, so the loss was designed to capture the hidden marginal benefits by using more data points of a class. Following the mathematical formulation of effective number, the effective sample number adaptively weighted loss (EAWLoss) was defined as follows: where n 1 and n 0 indicate the number of changed pixels and unchanged pixels in ground truth, respectively; L(y,ŷ) represents the standard cross entropy loss or focal loss [51] between the label and the predicted result. β controls the proportion of effective sample number; in this work, it was set to 0.5 as the change detection task only contains two classes, where β = 0 means no reweighting and β → 1 means reweighting by inverse class frequency. To keep the experiments identical, we also used EAWLoss as the auxiliary supervision loss. So, the total loss was expressed as follows: L sum (y,ŷ out ,ŷ aux ) = L EAW (y,ŷ out ) + λL EAW (y,ŷ aux ) (18) where y,ŷ out ,ŷ aux are ground truth, final prediction, and auxiliary prediction, and λ is the weight factor and was set to 0.4.

Implementation Details
To verify the effectiveness of the proposed method, five evaluation metrics were utilized to quantify the experiment's performance, defined in Section 2.4.1. The training details of the experiments and model configuration are given in Section 2.4.2.

Evaluation Metrics
In this work, we utilized overall accuracy (OA), mean intersection over union (mIoU), precision (precision), recall (recall), and F1-score (F1) as performance metrics, the definitions of which are as follows: where true positive (TP) indicates the number of pixels predicted correctly as changed; true negative (TN) represents the number of pixels predicted correctly as unchanged; false positive (FP) denotes the number of pixels predicted incorrectly as changed, and false negative (FN) means the number of pixels predicted incorrectly as unchanged. Generally, high precision or high recall is only suitable for specific applications. F1 combines the characteristics of the two measurements, creating a benchmark.

Experiment Details
Our work was implemented by PyTorch with two Telsa GPUs with 12 GB memory. In the training phase, we cropped the image pairs of the above datasets into 256 × 256 nonoverlapping patches before forwarding them to the model. The VGG16 backbone of the model was initialized with an ImageNet-pretrained [52] weight, and the initial learning rate was 0.0001. We chose cosine annealing as the learning rate decay mode. The value decreased slowly during the first 50 epochs, then increased over the next 50 epochs. Adam solver was used [53] as the model optimizer with β 1 = 0.5 and β 2 = 0.99. Random crop, random flip, and random rotation from −30 • to 30 • were utilized to increase the generalization of the model.

Results
In this section, we present quantitative comparisons and visualization results of our method. The ablation experiment results are described in Section 3.1. The evaluation metric comparison with other related methods is given in Section 3.2.

Ablation Study
To assess the effectiveness of the proposed ASPCA module and the CCR module, we experimented with different modules, comparing them to the baseline on CCD dataset. Specifically, the baseline was built without any attention module, but a basic encoderdecoder structure, including a VGG16 backbone and four ConvTransposed blocks. In addition, the effective sample number adaptively weighted loss was compared with usual cross entropy loss. All the experiments show that our ASPCA module, the CCR module, and loss improved the performance. The complete ablation results are shown in Table 3. Compared to the baseline, we outperformed 0.49 points of the F1-score and 1.04 points of the mean intersection over union with only the ASPCA module, while the CCR module outperformed by 1.95 points the F1-score and by 2.16 points the mean intersection over union. The combination of both proposed modules achieved the best results, as seen in the last row of Table 3: it outperformed by 3.19 points the F1-score and by 2.74 points the mean intersection over union. The visual comparison results of the ablation experiment are shown in Figure 7, wherein black indicates unchanged pixels predicted correctly and white indicates changed pixels predicted correctly. Red indicates unchanged pixels predicted in error, and green indicates ignored changed pixels. The ASPCA module slightly improved the ability to capture the interdependencies of pixels, thus migrating the holes in enormous areas. The CCR module strongly corrected pixels predicted in error compared to the baseline and preferred to continuously refine for specific shapes or texture regions. The model with both modules had the best performance and effectively solved the problem of ignored change pixels. Our designed loss is also crucial for the experiment's performance. Table 4 gives the ablation study of our proposed effective sample number adaptively weighted loss (EAWLoss) on the CCD dataset. For each item, the performance with EAWLoss was better than the cross-entropy loss. The visual ablation result is shown in Figure 8. To further verify the robustness of our method, we listed the results of different scenarios. The first four columns in Figure 8 are based on large buildings or roads, and the latter four are based on small vehicles. It can be seen that our designed EAWLoss effectively solved the class-imbalanced problem. Our designed loss is also crucial for the experiment's performance. Table 4 gives the ablation study of our proposed effective sample number adaptively weighted loss (EAWLoss) on the CCD dataset. For each item, the performance with EAWLoss was better than the cross-entropy loss. The visual ablation result is shown in Figure 8. To further verify the robustness of our method, we listed the results of different scenarios. The first four columns in Figure 8 are based on large buildings or roads, and the latter four are based on small vehicles. It can be seen that our designed EAWLoss effectively solved the class-imbalanced problem.  To measure the computational efficiency of the proposed model, the comparisons of GFLOPs and parameter size are given in Table 5. As can be seen, the proposed modules improved the performance while modestly increasing the computational complexity.

Comparisons with Other Methods
We experimented on the four datasets described in Section 2.2 to compare our method with recent learning-based change detection methods: • FC-EF [6]: Image-level fusing method based on FCN: concatenating the bi-temporal images as the model input and transferring the feature information by a skip-connection from the encoder to the decoder. • FC-Siam-conc [6]: Single-level feature fusing method based on FCN, which employed a Siamese encoder-decoder structure for the inputting of bi-temporal images. In the decoder, this involves concatenating the upsampled feature with dual features extracted by the encoder to gradually restore the changed map resolution. • FC-Siam-diff [6]: Single-level fusing method based on FCN, which used Siamese structure for bi-temporal input. The only difference from FC-Siam-conc is that the skip -connection was replaced by the absolute difference rather than the element-wise sum of feature maps. • U-Net++ [7]: An image-level fusing method based on U-Net++ [44], which utilized deep supervision by multiple side-outputs fusion of concatenated bi-temporal images. • DASNet [27]: A dual-branch, metric-based method based on spatial attention and channel attention mechanism, which aimed to punish the L2 distance between feature pairs updated by the dual attention module, thus making the changed pair and unchanged pair more easily discriminated. • STANet [9]: A single-level feature fusing method based on distance metric, which employed a spatial-temporal attention module to capture the temporal-spatial dependency between the bi-temporal images. • SNUNet-CD [54]: A feature-level, densely connected Siamese method based on U-Net++, which mitigates localization information loss in the deep layers by transmitting compact information from the encoder to the decoder. Moreover, an ensemble channel attention module is proposed to aggregate and refine features of multiple semantic levels.
We experimented with the above methods according to the original parameters described in corresponding papers. Table 6 reports the quantitative comparison results on the LEVIR-CD dataset. For F1 score, mIOU, and OA, our model outperformed other learningbased methods. The visualization is shown in Figure 9; due to the first three models being similar, only the visual map produced by FC-Siam-conc was listed.  The quantitative comparison result on the CCD dataset is shown in Table 7. Our model also outperformed the other methods in terms of precision, F1, mIOU, and OA; in particular, the precision achieved quite a high level. The visualization of the ablation results is shown in Figure 10. We still only give the results of FC-Siam-conc for the fully convolutional network (the first three items in Table  7). It can be seen that, whether the change scene is small vehicles or broad roads, our method greatly reduces the ignored area (the green parts in Figure 10) and further corrects the mispredicted unchanged pixels (the red parts in Figure 10). The quantitative comparison result on the CCD dataset is shown in Table 7. Our model also outperformed the other methods in terms of precision, F1, mIOU, and OA; in particular, the precision achieved quite a high level. The visualization of the ablation results is shown in Figure 10. We still only give the results of FC-Siam-conc for the fully convolutional network (the first three items in Table 7). It can be seen that, whether the change scene is small vehicles or broad roads, our method greatly reduces the ignored area (the green parts in Figure 10) and further corrects the mispredicted unchanged pixels (the red parts in Figure 10). ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 20 of 26 A comparison between different methods is shown in Table 8, from which we achieved state-of-the-art on all metrics, and the F1 and OA were much higher than the semi-supervised methods. The visual comparison results are shown in Figure 11. For most change regions of buildings, our model effectively migrated the gaps between discontinuous blocks and refined the building edges. A comparison between different methods is shown in Table 8, from which we achieved state-of-the-art on all metrics, and the F1 and OA were much higher than the semisupervised methods. The visual comparison results are shown in Figure 11. For most change regions of buildings, our model effectively migrated the gaps between discontinuous blocks and refined the building edges.
We also give the quantitative comparison results on DSIFN dataset in Table 9. Due to the high resolution and the complex environment, most of the methods could not achieve excellent performance, but MCCRNet achieved a remarkable result for recall.   We also give the quantitative comparison results on DSIFN dataset in Table 9. Due to the high resolution and the complex environment, most of the methods could not achieve excellent performance, but MCCRNet achieved a remarkable result for recall. The visualization of comparison results is shown in Figure 12. Our method greatly reduced the change regions ignored by the network (the green parts in Figure 12). The visualization of comparison results is shown in Figure 12. Our method greatly reduced the change regions ignored by the network (the green parts in Figure 12).

Discussion
Our study aimed to obtain the change map of bi-temporal remote sensing images from the perspective of two points: dual long-range interdependencies between feature pairs and change region contextual representation. In addition, we achieved a coarse-tofine change detection network by employing multi-scale feature pairs rather than a singlelevel fused feature. The proposed ASPCA module adopted an atrous spatial pyramid

Discussion
Our study aimed to obtain the change map of bi-temporal remote sensing images from the perspective of two points: dual long-range interdependencies between feature pairs and change region contextual representation. In addition, we achieved a coarse-to-fine change detection network by employing multi-scale feature pairs rather than a single-level fused feature. The proposed ASPCA module adopted an atrous spatial pyramid pooling structure and a dual self-attention mechanism to capture bi-directional attention maps, effectively strengthening the distinguishable change information between feature pairs. From the ablation study in Section 3.1, we found that the ASPCA module could improve the performance of change detection. The change contextual representational module utilized the relationship between pixels and their contextual region to correct misclassified pixels, especially for those changed pixels that were predicted to be unchanged (false negatives). The CCR module first introduced the class mechanism into the change detection task.

The Effectiveness of ASPCA
To verify the capacity of strengthening feature pairs' representation with the ASPCA module, we created visualization heatmaps of features updated by ASPCA module, as shown in Figure 13. Due to the low resolution of high-level features, only the feature pair from the last ASPCA (named att1_1 and att1_2) and the fused one from ConvTransposed Block 1 (named transconv1) are given. From this, ASPCA enhanced the distinguishable differences of dual features by applying interactive attention weights. Different from other change detection methods based on attention mechanism, ASPCA receives two inputs corresponding to bi-temporal features and produces dual-features output, avoiding the defects of single-feature representation. In particular, the bottom layers (such as layer 1 and layer 2) in the encoder tend to extract more detailed texture information, while high layers (such as layer 3 and layer 4) extract more abstract semantic information, so the former emphasizes more detailed change appearance information such as edges, shapes, and colors, while the latter emphasizes global change semantic information such as categories and regions.

The Effectiveness of CCR
To verify the capacity of correcting misclassified pixels, the feature maps output fro the CCR module were was visualized. As shown in Figure 14, we made an attention m of pixel-change region relationships and the heatmap of change region contextual rep sentations. As can be seen, the region around the change pixels most likely belongs to t same change category, which indicates that the change region representations a weighted by pixel-change region relationships, thus producing finer change region co

The Effectiveness of CCR
To verify the capacity of correcting misclassified pixels, the feature maps output from the CCR module were was visualized. As shown in Figure 14, we made an attention map of pixel-change region relationships and the heatmap of change region contextual representations. As can be seen, the region around the change pixels most likely belongs to the same change category, which indicates that the change region representations are weighted by pixel-change region relationships, thus producing finer change region contextual representations. In this work, the channel of fused feature was reduced from 960 to 512, thus decreasing the computational cost and simultaneously increasing the nonlinear ability.

The Effectiveness of CCR
To verify the capacity of correcting misclassified pixels, the feature maps output from the CCR module were was visualized. As shown in Figure 14, we made an attention map of pixel-change region relationships and the heatmap of change region contextual representations. As can be seen, the region around the change pixels most likely belongs to the same change category, which indicates that the change region representations are weighted by pixel-change region relationships, thus producing finer change region contextual representations. In this work, the channel of fused feature was reduced from 960 to 512, thus decreasing the computational cost and simultaneously increasing the nonlinear ability. In addition, we adopted an auxiliary supervision with the last upsampled feature to increase the generalization ability in the training stage, but it was removed in the testing stage.

Generalization
As the experiments show, datasets with different resolutions had varying generalization abilities: VHR image pairs were harder to distinguish due to elaborate object details, while low-resolution samples gained a more gratifying results owing to a tasksurpassed network model. Specifically, our model obtained finer building outlines for the LEVIR dataset and had stronger generalizability in season-varying change scenes. For the CCD dataset, the model learned excellent discriminable weights from season-varying image pairs, including objects of different scales (such as cars and buildings). As the results of DSIFN present, image pairs with sophisticated and multi-view circumstances had greatly reduced performance due to the prominent depth difference. For common grounds, all datasets were sensitive to object edges, caused by intrinsic defects of cascading transposed convolutions.

Conclusions
In this work, we have proposed an end-to-end network named a multi-level change contextual refinement network (MCCRNet) for remote sensing image change detection. The MCCRNet creatively put forward a dual-input and dual-output attention module ASPCA. A CCR module first introduced a corrective ability for mispredicted pixels in the manner of coarse-to-fine. In addition, we designed a novel loss to solve the class-imbalanced problem based on cost-sensitive learning. Compared to early methods, our loss adaptively adjusted the weight of positive and negative samples' loss with an increase in the number of training iterations. As to the overall structure of the proposed network, multi-level dual features were fully fused in a cascade manner, thus helping to discriminate change representations. The experimental results on four change datasets proved the validity of our method. In particular, for low-resolution samples such as the CCD dataset and fine-grained very-high-resolution images such as the LEVIR dataset, our method achieved extremely high performance. On the other hand, our model had a poor generalization ability under the scene of view-changes. So, the image depth will be considered to improve the performance afterwards. In the whole work, a large number of image labels were used for supervised learning, creating labor-intensive and time-consuming challenges for other label-free datasets. In the next stage, we will focus on unsupervised learning to solve change detection tasks.

Data Availability Statement:
The data presented in this study are available from the author upon reasonable request.