MAEANet: Multiscale Attention and Edge-Aware Siamese Network for Building Change Detection in High-Resolution Remote Sensing Images

: In recent years, using deep learning for large area building change detection has proven to be very efﬁcient. However, the current methods for pixel-wise building change detection still have some limitations, such as a lack of robustness to false-positive changes and confusion about the boundary of dense buildings. To address these problems, a novel deep learning method called multiscale attention and edge-aware Siamese network (MAEANet) is proposed. The principal idea is to integrate both multiscale discriminative and edge structure information to improve the quality of prediction results. To effectively extract multiscale discriminative features, we design a contour channel attention module (CCAM) that highlights the edge of the changed region and combine it with the classical convolutional block attention module (CBAM) to construct multiscale attention (MA) module, which mainly contains channel, spatial and contour attention mechanisms. Meanwhile, to consider the structure information of buildings, we introduce the edge-aware (EA) module, which combines discriminative features with edge structure features to alleviate edge confusion in dense buildings. We conducted the experiments using LEVIR-CD and BCDD datasets. The proposed MA and EA modules can improve the F1-Score of the basic architecture by 1.13% on the LEVIR CD and by 1.39% on the BCDD with an accepted computation overhead. The experimental results demonstrate that the proposed MAEANet is effective and outperforms other state-of-the-art methods concerning metrics and visualization.


Introduction
Building change detection (BCD) is the procedure of extracting the dynamic changes in buildings in the same area based on remote sensing images acquired at different times. This process is of great importance to many fields, such as land resource utilization [1], urban expansion [2] and illegal construction management [3,4]. As more and more highquality sensors are launched (e.g., IKONOS, QuickBird, GeoEye-1), many remote sensing images become accessible, which extensively drives the development of intelligent change detection technology. Concurrently, high-resolution remote sensing images, with their rich color information, clear textures and regular geometric structures [5], have become the primary data source for BCD.
As a hot issue in remote sensing image interpretation, change detection (CD) has been studied for several decades. The traditional CD methods can be categorized as follows: (1) image algebra, (2) image transformation and (3) machine learning methods [6]. The process of image algebra-based methods entail directly applying the spectral information difference or change vector analysis (CVA) [7] of bi-temporal images to obtain a feature difference map containing the magnitude of change. The final predicted map is obtained However, the false-positive changes caused by shadow and complex appearance are still hard to distinguish based on temporal information. Therefore, exploring effective techniques to extract discriminatory information and mitigate the false-positive changes is worthwhile. Meanwhile, the deep semantic information often lacks detailed boundary information, resulting in poor boundary integrity of detected change objects and boundary confusion in dense buildings. Thus, it is a vital challenge to recover the morphology of the changed objects.
To solve the issues of false-positive changes and boundary confusion in bi-temporal images dense BCD, a novel end-to-end multiscale attention and edge-aware network (MAEANet) is proposed in this article. Among them, we design two modules to improve the change detection performance of the network. Concerning enhancing the discrimination of features, we introduce a multiscale attention (MA) module, which mainly consists of the convolutional block attention module (CBAM) [34] and a contour channel attention module (CCAM). Meanwhile, we introduce an edge-aware (EA) module in the last two layers to reduce the confusion between dense buildings.
The main contributions of this article are as follows: (1) In this article, a novel network called MAEANet is proposed for BCD. Both discriminative and prior edge information are combined into one framework to enhance the quality of binary BCD results.
(2) We focus on two aspects to improve the performance of BCD networks, respectively. In terms of enhancing the discriminability of features, we proposed a MA module, including the classical CBAM module and a CCAM module designed to be able to highlight the edge of changed objects. The CBAM is used in deep features with more semantic information, while CCAM is more concerned with recovering boundary information. In terms of improving the quality of building boundary, we introduced an EA module that can better maintain the building boundary while generating binary change detection results, which can successfully overcome the boundary confusion problem of buildings.
(3) The results of a series of experiments show that the proposed MAEANet outperforms other state-of-the-art methods in metrics such as F1-Score and Kappa Coefficient on remote sensing image BCD datasets. Meanwhile, the advantage is that it does not need post-processing to obtain better pixel-level binary prediction maps.
The remainder of this article is organized as follows: we describe the detailed network composition parts in Section 2. Then, the experimental results and their visualization are presented in Section 3. Next, building edge performance comparison and some parameterization setting experiments are discussed in Section 4. Finally, Section 5 presents a summary of the proposed method.

Methods
In this section, we further describe the detailed information about the MAEANet network. Then, the specific information on Siam-fusedNet network (Siam-fusedNet), MA module and EA module is introduced. Finally, we further explain the principle of the hybrid loss function used for BCD.

Overview of MAEANet Network
The proposed MAEANet network mainly consists of three parts: the base architecture of the Siam-fusedNet, the MA module and the EA module. The first step of MAEANet is to crop the bi-temporal images into 256 × 256 and feed them into Siam-fusedNet to extract the initial features. Then, the MA module is added to extract discriminative information from the multiscale features after upsampling. The final binary predicted map is generated by adding a multilevel EA module that effectively introduces the edge information. More specific descriptions are shown in Figure 1.
Remote Sens. 2022, 14, x FOR PEER REVIEW 4 of 20 map is generated by adding a multilevel EA module that effectively introduces the edge information. More specific descriptions are shown in Figure 1.

Siam-fusedNet
We adopt the Siamese network as the basic reference to design the bi-temporal BCD network. In terms of feature extraction, since the binary BCD task can be seen as a classical semantic segmentation task, UNet [35] with an encoder-decoder structure is utilized to obtain abundant multiscale features. Meanwhile, to exploit as much helpful information as possible, a bi-temporal feature fused mechanism (FFM) is added after UNet. Here is the detailed workflow of Siam-fusedNet.
The bi-temporal images are fed into the network independently. First, the encoder of the network has two streams with shared weights and the pre-trained module of ResNet34 [36] is introduced. Then, in the decoder, we use upsampling to recover the image's resolution and adopt skip-connection to combine the underlying location information with the deep semantic information to recover the edge detail information lost during the encoding process.
To improve the learning of parallel feature differences between two-branch decoders in the same scale layer, a bi-temporal FFM [37] is used. As shown in the bottom left part of Figure 1, it consists of two branches: the difference and the concatenation. Firstly, we take the parallel features and from the same scale layer as the input. Then, in the concatenation branch, for the adaptive acquisition of local features critical to change, fusion features are generated by parallel features between and . Finally, the fused features and learned from two distinct branches are integrated and then processed through the Batch normalization (BN) and LeakyReLU operations to obtain the eventually fused feature . The detailed calculation formulas are as follows:

Multiscale Attention (MA) Module
One of the most critical issues in the feature extraction process is compressing the information of irrelevant features and better highlighting the information of changed features. Therefore, the MA module is designed in this paper. The MA module covers three

Siam-fusedNet
We adopt the Siamese network as the basic reference to design the bi-temporal BCD network. In terms of feature extraction, since the binary BCD task can be seen as a classical semantic segmentation task, UNet [35] with an encoder-decoder structure is utilized to obtain abundant multiscale features. Meanwhile, to exploit as much helpful information as possible, a bi-temporal feature fused mechanism (FFM) is added after UNet. Here is the detailed workflow of Siam-fusedNet.
The bi-temporal images are fed into the network independently. First, the encoder of the network has two streams with shared weights and the pre-trained module of ResNet34 [36] is introduced. Then, in the decoder, we use upsampling to recover the image's resolution and adopt skip-connection to combine the underlying location information with the deep semantic information to recover the edge detail information lost during the encoding process.
To improve the learning of parallel feature differences between two-branch decoders in the same scale layer, a bi-temporal FFM [37] is used. As shown in the bottom left part of Figure 1, it consists of two branches: the difference and the concatenation. Firstly, we take the parallel features F t1 and F t2 from the same scale layer as the input. Then, in the concatenation branch, for the adaptive acquisition of local features critical to change, fusion features F t1 are generated by parallel features between F t1 and F t2 . Finally, the fused features F t1 and F t2 learned from two distinct branches are integrated and then processed through the Batch normalization (BN) and LeakyReLU operations to obtain the eventually fused feature F f . The detailed calculation formulas are as follows: (1)

Multiscale Attention (MA) Module
One of the most critical issues in the feature extraction process is compressing the information of irrelevant features and better highlighting the information of changed features. Therefore, the MA module is designed in this paper. The MA module covers three kinds of attention mechanisms: channel, spatial and contour attention mechanisms, and its main component models are CBAM and CCAM.

Channel Attention Mechanism
The channel attention mechanism focuses on "what" benefits the original image [34]. Therefore, we commonly use the method of computing channel attention maps to acquire the inter-channel relationships of features. For example, feature channels associated strongly with change detection will be emphasized, while feature channels unassociated strongly with change detection will be suppressed. Figure 2 shows the detailed structure of the channel attention mechanism and the formula of the channel attention map (M channel ) is calculated as follows: where F denotes the original feature of size C × H × W. Firstly, we calculate the input features' max pooling and average pooling separately to generate two C × 1 × 1 aggregation vectors. Then, two C × 1 × 1 aggregation vectors are input into a weight-shared multilayer perception (MLP) layer, which mainly consists of reducing the number of channels to r and recovering them. Finally, an element-wise sum is adopted for two types of vectors and a nonlinear operation is used to obtain the final channel attention map. kinds of attention mechanisms: channel, spatial and contour attention mechanisms, and its main component models are CBAM and CCAM.

Channel Attention Mechanism
The channel attention mechanism focuses on "what" benefits the original image [34]. Therefore, we commonly use the method of computing channel attention maps to acquire the inter-channel relationships of features. For example, feature channels associated strongly with change detection will be emphasized, while feature channels unassociated strongly with change detection will be suppressed. Figure 2 shows the detailed structure of the channel attention mechanism and the formula of the channel attention map ( ) is calculated as follows: where denotes the original feature of size × × . Firstly, we calculate the input features' max pooling and average pooling separately to generate two × 1 × 1 aggregation vectors. Then, two × 1 × 1 aggregation vectors are input into a weight-shared multilayer perception (MLP) layer, which mainly consists of reducing the number of channels to and recovering them. Finally, an element-wise sum is adopted for two types of vectors and a nonlinear operation is used to obtain the final channel attention map.

Spatial Attention Mechanism
The spatial attention mechanism concentrate on "where" is helpful to the input image. Therefore, a spatial attention map is computed to capture the location relationship of features. For example, locations in the features consistent with changed pixels will be given higher weights, while inconsistent locations will be given lower weights. The spatial attention mechanism is shown in Figure 3 and the formula of the spatial attention map ( ) is calculated as follows: where denotes the original feature of size C × H × W. Firstly, we calculate the input features' max pooling and average pooling separately to generate 1 × H × W aggregation vectors. Then, two vectors are concatenated, and a Conv operation is applied with a kernel size of 7 × 7 to generate a highlighting feature of size 1 × H × W. Finally, a nonlinear operation is used to obtain spatial attention map.

Spatial Attention Mechanism
The spatial attention mechanism concentrate on "where" is helpful to the input image. Therefore, a spatial attention map is computed to capture the location relationship of features. For example, locations in the features consistent with changed pixels will be given higher weights, while inconsistent locations will be given lower weights. The spatial attention mechanism is shown in Figure 3 and the formula of the spatial attention map (M spatial ) is calculated as follows: where F denotes the original feature of size C × H × W. Firstly, we calculate the input features' max pooling and average pooling separately to generate 1 × H × W aggregation vectors. Then, two vectors are concatenated, and a Conv operation is applied with a kernel size of 7 × 7 to generate a highlighting feature of size 1 × H × W. Finally, a nonlinear operation is used to obtain spatial attention map.

Contour Attention Mechanism
The contour attention mechanism is more concerned with the internal consistency of the changed objects. Therefore, a contour attention map that introduces the super-pixel objects is calculated to ensure the pixels within an object only own one category, either changed or unchanged. The internal pixel consistency will be enhanced for changed objects and weakened for unchanged objects. As a result, compared with the original pixel-wise, super-pixel objects have superior performance in presenting the local structural information of the image. Remote Sens. 2022, 14, x FOR PEER REVIEW 6 of 20

Contour Attention Mechanism
The contour attention mechanism is more concerned with the internal consistency of the changed objects. Therefore, a contour attention map that introduces the super-pixel objects is calculated to ensure the pixels within an object only own one category, either changed or unchanged. The internal pixel consistency will be enhanced for changed objects and weakened for unchanged objects. As a result, compared with the original pixelwise, super-pixel objects have superior performance in presenting the local structural information of the image.
In comparison with various super-pixel generation methods proposed so far, Simple Linear Iterative Clustering (SLIC) [38] is an effective algorithm with comparative performance in terms of metrics such as under-segmentation error, boundary recall and explained variance [39]. Meanwhile, it also has outstanding advantages in boundary preserving. The SLIC aims to build region clusters based on the k-means algorithm with seed pixels initialization. Each pixel on the image is described as a five-dimensional feature vector [ , , , , ] . The parameter to be customized is , which indicates the number of super-pixel objects expected to be generated. The optional parameter represents the compactness of the super-pixel objects. After comparing several different sets of experiments, we find that the best segmentation results are obtained when we set the value of to 900 and the value of to 30. The specific process of super-pixel segmentation can be divided into three steps: (1) Divide the image uniformly according to the number of segmented objects and calculate the interval size; (2) Adopt the location with the smallest gradient in the 3 × 3 neighborhoods of the cluster center as the new cluster center to avoid placing seeds on edges or noisy pixels [38,40]; (3) Through a multiple iteration process, disjoint pixels are assigned to nearby superpixel based on the distance measure . Typically, the number of iterations will be set to 10. The formula for calculating the distance is as follows: where ( , ) represents the location of pixel and ( , , ) represents the color component of pixel in the CIELAB color space [41]. and denote the color proximity distance and spatial proximity distance, respectively.
As shown in Figure 4, the framework of the contour attention mechanism has two inputs, the original feature and the super-pixel segmented reference image . The In comparison with various super-pixel generation methods proposed so far, Simple Linear Iterative Clustering (SLIC) [38] is an effective algorithm with comparative performance in terms of metrics such as under-segmentation error, boundary recall and explained variance [39]. Meanwhile, it also has outstanding advantages in boundary preserving. The SLIC aims to build region clusters based on the k-means algorithm with seed pixels initialization. Each pixel on the image is described as a five-dimensional feature vector [l, a, b, x, y] T . The parameter to be customized is K, which indicates the number of super-pixel objects expected to be generated. The optional parameter m represents the compactness of the super-pixel objects. After comparing several different sets of experiments, we find that the best segmentation results are obtained when we set the value of K to 900 and the value of m to 30. The specific process of super-pixel segmentation can be divided into three steps: (1) Divide the image uniformly according to the number of segmented objects and calculate the interval size; (2) Adopt the location with the smallest gradient in the 3 × 3 neighborhoods of the cluster center as the new cluster center to avoid placing seeds on edges or noisy pixels [38,40]; (3) Through a multiple iteration process, disjoint pixels are assigned to nearby superpixel based on the distance measure D. Typically, the number of iterations will be set to 10. The formula for calculating the distance D is as follows: where (x i , y i ) represents the location of pixel i and (l i , a i , b i ) represents the color component of pixel i in the CIELAB color space [41]. d c and d s denote the color proximity distance and spatial proximity distance, respectively. As shown in Figure 4, the framework of the contour attention mechanism has two inputs, the original feature F and the super-pixel segmented reference image S. The super-pixel segmentation map S after one-hot processing will be used as the guide image for subsequent original feature extraction. In this way, we aim to describe the internal consistency information more closely, especially for complex and diverse buildings. The formula of contour attention (M contour ) is calculated as follows: Remote Sens. 2022, 14, 4895 where F represents the original image feature and S represents the super-pixel segmented post-temporal image. Firstly, we compute the max pooling and avg pooling of the original features separately. Meanwhile, the one-hot processed super-pixel segmentation image is produced and multiplied with the pooled feature to obtain P. The number of segmented objects is n, and m is the total number of pixels. Calculated P contains the original feature information and the index information of the super-pixel segmented objects. Then, we calculated the w to describe the average of the original feature in super-pixel segmented objects. Furthermore, we multiply w with the one-hot processed super-pixel segmentation image to generate the new feature F n . Finally, we concatenate two new features generated by averaging and pooling features to obtain more channels to feature. Meanwhile, a Conv operation with the kernel size of 7 × 7 and a nonlinear operation are adopted to obtain the final contour attention map (M contour ).
super-pixel segmentation map after one-hot processing will be used as the guide image for subsequent original feature extraction. In this way, we aim to describe the internal consistency information more closely, especially for complex and diverse buildings. The formula of contour attention ( ) is calculated as follows: where represents the original image feature and represents the super-pixel segmented post-temporal image. Firstly, we compute the max pooling and avg pooling of the original features separately. Meanwhile, the one-hot processed super-pixel segmentation image is produced and multiplied with the pooled feature to obtain . The number of segmented objects is , and is the total number of pixels. Calculated contains the original feature information and the index information of the super-pixel segmented objects. Then, we calculated the to describe the average of the original feature in superpixel segmented objects. Furthermore, we multiply with the one-hot processed superpixel segmentation image to generate the new feature . Finally, we concatenate two new features generated by averaging and pooling features to obtain more channels to feature. Meanwhile, a Conv operation with the kernel size of 7 × 7 and a nonlinear operation are adopted to obtain the final contour attention map ( ). The structure of the MA module mainly contains two parts: CBAM [34] and the proposed CCAM module, respectively. The detailed description is shown in Figure 5. CBAM focuses on the changed information and is composed of the channel and spatial attention mechanisms. CCAM integrates information about changed objects and comprises channel and contour attention mechanisms. In the proposed MAEANet network, we introduce CBAM after deep features with more semantic information and CCAM after shallow features with more detailed location information. The formula of CBAM and CCAM is calculated as follows: For CBAM: The structure of the MA module mainly contains two parts: CBAM [34] and the proposed CCAM module, respectively. The detailed description is shown in Figure 5. CBAM focuses on the changed information and is composed of the channel and spatial attention mechanisms. CCAM integrates information about changed objects and comprises channel and contour attention mechanisms. In the proposed MAEANet network, we introduce CBAM after deep features with more semantic information and CCAM after shallow features with more detailed location information. The formula of CBAM and CCAM is calculated as follows: For CBAM: For CCAM: where F represents the deep feature and G represents the shallow feature. After introducing the M channel , M spatial and M contour to CBAM and CCAM successively, more discriminative features about F and G are acquired.
For CCAM: where represents the deep feature and represents the shallow feature. After introducing the , and to CBAM and CCAM successively, more discriminative features about and are acquired.

Edge Aware Module
Recent studies have shown that introducing the building edge information in deep supervision can effectively generate better quality BCD results [32,42]. The main reason is that the edge information of buildings has a distinct geometric structure, which is very effective for determining the location of buildings. At the same time, adequate edge information can guide the network to generate better binary prediction results, especially in dense building areas. Based on the above considerations, we introduce the EA module to MAEANet.
As shown in Figure 6, we combine our module's edge estimation task and building change prediction task to achieve a mutually reinforcing effect. Firstly, we use the Conv block, which includes a convolution kernel with the size of 3 × 3, BN and Relu to extract edge information. The represents the channel number of the original feature and is set to 2. After the above operation, we obtain the edge information from the output. Then, we adopt the BN and Relu to further obtain nonlinear boundary information. Finally, original features are concatenated with edge information to estimate the binary change and edge prediction maps. The advantage of the EA module is that the edge feature can be integrated directly into the discriminant feature, thus improving the accuracy of detecting building changes and alleviating the mixing and misclassification of building edges.

Edge Aware Module
Recent studies have shown that introducing the building edge information in deep supervision can effectively generate better quality BCD results [32,42]. The main reason is that the edge information of buildings has a distinct geometric structure, which is very effective for determining the location of buildings. At the same time, adequate edge information can guide the network to generate better binary prediction results, especially in dense building areas. Based on the above considerations, we introduce the EA module to MAEANet.
As shown in Figure 6, we combine our module's edge estimation task and building change prediction task to achieve a mutually reinforcing effect. Firstly, we use the Conv block, which includes a convolution kernel with the size of 3 × 3, BN and Relu to extract edge information. The C 1 represents the channel number of the original feature and C 2 is set to 2. After the above operation, we obtain the edge information from the output. Then, we adopt the BN and Relu to further obtain nonlinear boundary information. Finally, original features are concatenated with edge information to estimate the binary change and edge prediction maps. The advantage of the EA module is that the edge feature can be integrated directly into the discriminant feature, thus improving the accuracy of detecting building changes and alleviating the mixing and misclassification of building edges.

Loss Function Details
To achieve training and optimization of the network, we design a deep hybrid loss function which not only includes a loss function that tends to focus on change detection (E wbce ) but also a loss function that focuses on edge detection (E Dice ). Since the unchanged sample counts are much higher than the changed sample counts in the actual change detection task, we introduce a weighted binary cross-entropy loss to modify the imbalance training samples. The specific binary cross-entropy loss can be described as: where N is the total pixel count of the image; ω is defined to reduce the weight of negative samples and increase the focus on positive samples during the training process of the network; y n is the n-th sample pixel, when y n = 0 means that the pixel is unchanged, while y n = 1 denotes the changed pixel. p n is the expression of the probability of change.
Remote Sens. 2022, 14, x FOR PEER REVIEW 9 of 20 Figure 6. The structure of the EA module.

Loss Function Details
To achieve training and optimization of the network, we design a deep hybrid loss function which not only includes a loss function that tends to focus on change detection ( ) but also a loss function that focuses on edge detection ( ). Since the unchanged sample counts are much higher than the changed sample counts in the actual change detection task, we introduce a weighted binary cross-entropy loss to modify the imbalance training samples. The specific binary cross-entropy loss can be described as: where is the total pixel count of the image; is defined to reduce the weight of negative samples and increase the focus on positive samples during the training process of the network; is the n-th sample pixel, when = 0 means that the pixel is unchanged, while = 1 denotes the changed pixel. is the expression of the probability of change. In addition, the dice coefficient loss can weaken the category imbalance and work well in change detection tasks. Therefore, we adopt the dice coefficient loss to describe the edge detection. The formula for dice coefficients loss is expressed as: where denotes the predicted probability value of all change class pixels in the image, and denotes the ground truth value of the image. The final hybrid loss function can be calculated as follows: where the weight ratio λ is used to balance the influence of change detection and edge detection, we set it to 10. Through the continuous training of the hybrid loss, the eventually obtained building change results will present more accurate bounds while maintaining higher accuracy.

Data Description
To prove the validity of MAEANet, we conduct experiments on two publicly availa- In addition, the dice coefficient loss can weaken the category imbalance and work well in change detection tasks. Therefore, we adopt the dice coefficient loss to describe the edge detection. The formula for dice coefficients loss is expressed as: where γ denotes the predicted probability value of all change class pixels in the image, and γ denotes the ground truth value of the image. The final hybrid loss function can be calculated as follows: where the weight ratio λ is used to balance the influence of change detection and edge detection, we set it to 10. Through the continuous training of the hybrid loss, the eventually obtained building change results will present more accurate bounds while maintaining higher accuracy.

Data Description
To prove the validity of MAEANet, we conduct experiments on two publicly available datasets: LEarning VIsion Remote Sensing (LEVIR CD) [25] and WHU Building Change Detection Dataset (BCDD) [30]. Figure 7 shows some images obtained from LEVIR CD and BCDD.
LEVIR CD dataset is a public, large-scale CD dataset with a total of 637 pairs of images. The dataset is a 3-band multi-temporal image with a 0.5 m spatial resolution. The images covering building changes vary from 2002 to 2018 in 20 regions of Texas, USA. In addition, some influencing factors such as illumination and seasonal changes are considered, which help us in developing more robust methods. Taking into account the limited memory of our GPU, we chose to crop the image to 256 × 256 pixels without overlap. Furthermore, since the changed building instances account for a relatively small amount of the whole dataset, some data augmentation operations, such as random rotation, are used. Finally, we obtain 8256/1024/2048 pairs of images to generate training, validation, and test set. LEVIR CD dataset is a public, large-scale CD dataset with a total of 637 pairs of images. The dataset is a 3-band multi-temporal image with a 0.5 m spatial resolution. The images covering building changes vary from 2002 to 2018 in 20 regions of Texas, USA. In addition, some influencing factors such as illumination and seasonal changes are considered, which help us in developing more robust methods. Taking into account the limited memory of our GPU, we chose to crop the image to 256 × 256 pixels without overlap. Furthermore, since the changed building instances account for a relatively small amount of the whole dataset, some data augmentation operations, such as random rotation, are used. Finally, we obtain 8256/1024/2048 pairs of images to generate training, validation, and test set.
WHU Building Change Detection Dataset (BCDD) is a public dataset containing two scenes of aerial data acquired in Christchurch, New Zealand, in 2012 and 2016. This bitemporal dataset is 32,507 × 15,354 pixels in size with a total of 11,328 pairs of images and has a 1.6 m spatial resolution. In addition, we crop each image into the patch of size 256 × 256 pixels without overlap, obtaining 7434 image pairs. Finally, we obtain 5204/744/1486 pairs of images for training, validation and test.

Comparison Methods
We selected four deep learning-based CD methods to objectively evaluate the performance of the MAEANet, as described in below.
MSPSNet [22] is a deeply supervised multiscale Siamese network. The network learns effective feature maps through the convolutional block. Furthermore, it integrates feature maps by parallel convolutional structure, while self-attention further enhances the CD representation of image information.
SNUNet [31] is proposed for CD based on densely connected Siamese networks. The network reduces the localization information loss during feature extraction through dense connections between encoders and decoders. In addition, the Ensemble Channel Attention is introduced to advanced representation features at different semantic levels.
STANet [25] is a metric-based Siamese neural network that identifies bi-temporal image CD. The network feeds ResNet18 as a backbone, introduces the self-attention to build

Comparison Methods
We selected four deep learning-based CD methods to objectively evaluate the performance of the MAEANet, as described in below.
MSPSNet [22] is a deeply supervised multiscale Siamese network. The network learns effective feature maps through the convolutional block. Furthermore, it integrates feature maps by parallel convolutional structure, while self-attention further enhances the CD representation of image information.
SNUNet [31] is proposed for CD based on densely connected Siamese networks. The network reduces the localization information loss during feature extraction through dense connections between encoders and decoders. In addition, the Ensemble Channel Attention is introduced to advanced representation features at different semantic levels.
STANet [25] is a metric-based Siamese neural network that identifies bi-temporal image CD. The network feeds ResNet18 as a backbone, introduces the self-attention to build the temporal and spatial relationship, and then uses it to capture various scales of spatial-temporal dependencies at multiscale subregions.
EGRCNN [32] uses the convolutional neural network for BCD. The network feeds bi-temporal images into a weight-shared Siamese network for extracting primary multilevel features. It then provides a long short-term memory (LSTM) module to produce the feature. In addition, the edge structure of the building is introduced into the network, aiming to improve the BCD's quality.

Implementation Details and Evaluation Metrics
In this article, all of the experiments were implemented based on the Pytorch framework, and the network was conducted on a single Nvidia Tesla V100 (16 GB of RAM). We performed data augmentation on the training data, such as rotation and flipping. During training, we used the Adam optimizer with an initial learning rate of 10 −4 and the learning rate decay strategy was also used. The batch size was set to be 4 and the epoch was 100. If the F1-score did not increase after 15 epochs, we would reduce the learning rate by 10 times. The parameter ω in the E wbce was set to be 0.25. The comparison methods were the same as the public code and all parameter settings were the default parameters of the original papers.
To quantitatively verify the performance of the network, five general used metrics were used: Precision (P), Recall (R), F1-score (F1), Overall Accuracy (OA), and Kappa Coefficient (KC). The P is the proportion of correctly detected changed pixels to all identified changed pixels. The closer the value of P to 1, the lower the error detection rate of the predicted value. The R is the proportion of correctly detected changed pixels relative to all pixels that should be recognized as changed. The closer the R value is to 1, the lower the rate of missed detections of predicted values. The F1 is the harmonic average of P and R. When one of them decreases, the F1 also decreases. The OA indicates the ratio of correctly detected change pixels in all samples. The KC represents the similarity between the predicted map and the ground truth. The equations of these evaluation metrics are as follows: where TP, FP, TN, and FN represent the correctly classified true positive count, the correctly classified false positive count, the correctly classified true negative count, and the correctly classified false negative count, respectively.

Ablation Study for the Proposed MAEANet
To prove the efficiency of each component in MAEANet, a series of ablation experiments are conducted for the MA and EA module using the LEVIR CD and BCDD dataset. In detail, we adopt the Siam-fusedNet as the base architecture to extract original features. The MA module is introduced to enhance the discrimination of original features and refine the misclassification. Introducing the EA module aims to improve the boundary quality of dense buildings. Detailed experiments are shown in Table 1, and the most optimal results are highlighted in bold.
Considering the data of LEVIR CD in Table 1 Considering the data of BCDD in Table 1, the metrics of the Siam-fusedNet for P, R, F1, OA and KC are 94.02%, 86.66%, 90.19%, 99.28% and 89.81%. The value of P is satisfactory, but the value of R is poor, which means that the performance of Siam-fusedNet is unstable. To improve the metric value of R, we observe that the value of R after adding the MA module is nearly 2.7% higher than Siam-fusedNet and adding the EA module is almost 2.08% higher than Siam-fusedNet. When we add both the MA and EA module to the Siam-fusedNet, the network's capability is greatly enhanced, with F1, OA and KC values of 91.58%, 99.36% and 91.25%. From the above experimental results, when we introduce the MA and EA module into the Siam-fusedNet, the difference between P and R is the smallest while the F1 is the highest, thus strongly verifying the performance of the MAEANet.
To further illustrate the method's superiority in change detection, some visual comparisons of the results in the ablation experiment are conducted, as shown in Figure 8. As we can see, many false positives are classified as changes (rows 1 and 5 in Figure 8) and some changed buildings without clear boundaries (rows 3, 4, 6 and 7 in Figure 8) in the Siam-fusedNet model. The reason is that CD is affected by bi-temporal illumination, color and other factors. Therefore, the feature information in bi-temporal images cannot be extracted well using only the convolutional blocks in an ordinary Siam-fusedNet network. After adding the EA module to the Siam-fusedNet network, we find the edge of building visual effect is optimized to some extent (rows 1, 4 and 7 in Figure 8) because the EA module can better mine the boundary information of buildings. When we introduce the MA module to the Siam-fusedNet network, the problem of false positive changes has been significantly solved and the quality of the building boundaries has been further improved (rows 2, 3, 4, 5 and 6 in Figure 8). When the MA module and EA module are consistently added to the Siam-fusedNet, the misclassification change information is dramatically reduced, and the boundary between adjacent buildings is clear and closer to the ground truth value.

Comparison Experiments
To objectively and quantitatively compare our proposed MAEANet with existing CD methods, a series of comparison experiments are conducted on two datasets and three evaluation metrics are calculated for analysis, P, R and F1, respectively. Specific results are shown in Table 2.
By comparing the metric values in Table 2, it can be found that MAEANet obviously outperformed the other networks in P, R and F1 metrics. Although MSPSNet and SNUNet achieve promising accuracy values, more omissions exist for objects that have changed, resulting in lower values of R and F1. When we compare STANet with the proposed MAEANet, the P, R and F1 metrics of the MAEANet improved by 7.54%, 1.1% and 4.5% on the LEVIR CD dataset and by 10.9%, 4.48% and 7.78% on the BCDD dataset. The main reason is that our proposed attention module takes into account the internal consistency of objects, thus reducing misclassification due to bi-temporal differences. Compared to EGRCNN, the F1 of MAEANet is 1.79% and 1.76% higher than EGRCNN on the LEVIR CD and BCDD datasets, which proved its excellent performance in detecting changed buildings.

Comparison Experiments
To objectively and quantitatively compare our proposed MAEANet with existing CD methods, a series of comparison experiments are conducted on two datasets and three evaluation metrics are calculated for analysis, P, R and F1, respectively. Specific results are shown in Table 2.   Simultaneously, the visual results between the proposed MAEANet and other methods are shown in Figure 9. By subjective visual comparison with the above-mentioned change detection methods, it can be found that the binary CD results are superior to the other existing methods. The comparison of the visible result in Figure 9 shows that for the dense building, MSPSNet, SNUNet and STANet have significant boundary confusion problems (rows 1, 3 and 4 in Figure 9). Still, the method of EGRCNN and our proposed MAEANet overcome this problem. The main reason is that both EGRCNN and MAEANet introduce the edge prior information, thus avoiding the problem of building boundary confusion. For the changed large-size individual houses, the proposed MAEANet is most reliable to other methods in maintaining the internal consistency of the building and the integrity of its boundaries (rows 2, 5 and 6 in Figure 9). resulting in lower values of R and F1. When we compare STANet with the proposed MAEANet, the P, R and F1 metrics of the MAEANet improved by 7.54%, 1.1% and 4.5% on the LEVIR CD dataset and by 10.9%, 4.48% and 7.78% on the BCDD dataset. The main reason is that our proposed attention module takes into account the internal consistency of objects, thus reducing misclassification due to bi-temporal differences. Compared to EGRCNN, the F1 of MAEANet is 1.79% and 1.76% higher than EGRCNN on the LEVIR CD and BCDD datasets, which proved its excellent performance in detecting changed buildings.
Simultaneously, the visual results between the proposed MAEANet and other methods are shown in Figure 9. By subjective visual comparison with the above-mentioned change detection methods, it can be found that the binary CD results are superior to the other existing methods. The comparison of the visible result in Figure 9 shows that for the dense building, MSPSNet, SNUNet and STANet have significant boundary confusion problems (rows 1, 3 and 4 in Figure 9). Still, the method of EGRCNN and our proposed MAEANet overcome this problem. The main reason is that both EGRCNN and MAEANet introduce the edge prior information, thus avoiding the problem of building boundary confusion. For the changed large-size individual houses, the proposed MAEANet is most reliable to other methods in maintaining the internal consistency of the building and the integrity of its boundaries (rows 2, 5 and 6 in Figure 9).

Quantitative Comparison of Binary Edge Prediction Results
To compare the performance of the proposed MAEANet with the above-mentioned change detection methods in boundary maintenance, we generate the binary boundary learning results by performing morphological processing within a 9 × 9 neighborhood on the binary change detection images. Corresponding evaluation metrics are calculated, as shown in Table 3. We can see that the best results are obtained for P, R, F1, OA, and KC when MAEANet is used. Compared with three models that do not considering boundary factors, such as MSPSNet, SNUNet and STANet, all five metrics show significant improvements. We also find that MAEANet still maintains an excellent performance compared to EGRCNN with boundary factors, where the F1 improves by 2.01% and KC improves by 2.49%. Therefore, it is powerfully demonstrated that the proposed MAEANet has obvious superiority in the boundary maintenance of CD. In our MAEANet method, the building edge information is mainly extracted by the contour attention mechanism with the SLIC over-segmentation operator. Alongside the edge information intuitively from the RGB images, extra information extracted from the corresponding digital surface models (DSMs) has been introduced into the building extraction tasks. It is obvious that the DSMs data can provide more edge information of the flat roofs of buildings without the influence of building shadows or the spectral confusion between buildings and roads. Intuitively, the introduction of DSMs data can yield performance gain in the building extraction tasks, as well as the building change detection tasks.

The Experiments on the Hybrid Loss
The weighted binary cross entropy loss focuses on image prediction and dice coefficient loss focus on edge map estimation. In the hybrid loss function proposed in this paper, we set a scale factor λ to balance the effects of the two different loss functions mentioned above. Meanwhile, to verify whether the edge-based dice coefficient loss is effective for the whole MAEANet, we set four scale factors, 0, 0.1, 1, and 10 [32], and conducted experiments on the BCDD dataset.
As shown in Figure 10, we calculate five different metrics for comparison and validation. When the parameter λ is 0, the performance in each metric is the lowest value. As the value of λ increases, the value of F1 and KC show steady growth, although the value of P tends to rise and then fall and the R tends to fall and then rise, which indicates that the introduced distance loss can improve the change detection effect of the building. In addition, we notice that when λ is 10, the difference between P and R is the smallest, which leads to the maximum value of F1. We can see from Figure 11 that the change detection results set to 10 can restore more details while reducing misclassification, especially for fine boundaries. Therefore, in this paper, we select λ = 10 as the best parameter of the loss function.

How to Combine the CBAM and CCAM in the MA Module?
The attention module aims to refine the features of an image and is now widely used in various image interpretation tasks, especially for change detection. In our proposed model, the multiscale attention model has four layers in total. To extract more discriminative features from the different layers, we generally associate them with CBAM. At the same time, CCAM focuses on the shallow layers containing more semantic information. However, it is still worth exploring how many CCAMs can be introduced in the proposed MA module. To verify the effectiveness of CCAM, we repeated the experiment MAEANet_n 3 times, where n is the total number of CCAM in the MA model, n∈{0,1,2,3,4}. MAEANet_0 represents that no CCAM is introduced, but 4 CBAMs are used to construct the MA model. MAEANet_1 means that only one CCAM exists in the MA, the first three layers are all CBAMs, and the last layer is CCAM. MAEANet_4 represents that in the MA module, all four layers are CCAMs without the CBAM involved. Therefore, we compare the effect of n on the performance of the BCD by calculating five objective metrics such as P, R, F1, OA and KC. We noticed that the F1, OA, and KC values showed a trend of increasing and then decreasing with the increase in n. As shown in Figure 12, several experiments with n > 0 outperformed n = 0 in three indexes such as F1, OA and KC, especially when n = 2, which reached the highest in all four indexes except precision. This experimental result confirms that introducing CCAM can effectively extract features with better discrimination, which is beneficial in enhancing the identification of the changed building. However, as the number of CCAM increases, such as n >= 3, the model's performance decreases because as we introduce CCAM to the deep features, some semantic information peculiar to the deep features will be lost, thus leading to misclassification. Therefore, we select the value of n as 2 when constructing the MA model, using CBAM for the deep features in the first two layers and CCAM for the shallow features in the last two layers, respectively. The attention module aims to refine the features of an image and is now widely used in various image interpretation tasks, especially for change detection. In our proposed model, the multiscale attention model has four layers in total. To extract more discriminative features from the different layers, we generally associate them with CBAM. At the The attention module aims to refine the features of an image and is now widely used in various image interpretation tasks, especially for change detection. In our proposed model, the multiscale attention model has four layers in total. To extract more discriminative features from the different layers, we generally associate them with CBAM. At the better discrimination, which is beneficial in enhancing the identification of the changed building. However, as the number of CCAM increases, such as n >= 3, the model's performance decreases because as we introduce CCAM to the deep features, some semantic information peculiar to the deep features will be lost, thus leading to misclassification. Therefore, we select the value of n as 2 when constructing the MA model, using CBAM for the deep features in the first two layers and CCAM for the shallow features in the last two layers, respectively.

Conclusions
In this article, a novel building change detection framework called MAEANet is proposed, which aims to enhance the robustness of false-positive changes and mitigate the boundary confusion of dense buildings. The MAEANet method consists of three main parts: bi-temporal feature fusion Siam-fusedNet with encoder-decoder structure as the backbone, multiscale attention discriminative feature extraction, and multilevel edgeaware binary change map prediction. The proposed CCAM module designs a contour attention mechanism with a smoothing effect by introducing the contour information of super-pixel segmented objects, which alleviates the problems such as misclassification of small targets and poor robustness of false-positive changes. Furthermore, the EA module effectively combines multilevel edge features and multiscale discriminative features to avoid the boundary confusion of dense buildings in the prediction map. Meanwhile, we design an edge constrain loss to learn the information about the changed buildings and their boundaries using gradient descent. The results show that the network achieves better numerical indicators and visualization results on both available building change detection datasets. In our future work, we will devote ourselves to mining more representative discriminatory features, such as adding building DSM information as an object identifier and

Conclusions
In this article, a novel building change detection framework called MAEANet is proposed, which aims to enhance the robustness of false-positive changes and mitigate the boundary confusion of dense buildings. The MAEANet method consists of three main parts: bi-temporal feature fusion Siam-fusedNet with encoder-decoder structure as the backbone, multiscale attention discriminative feature extraction, and multilevel edge-aware binary change map prediction. The proposed CCAM module designs a contour attention mechanism with a smoothing effect by introducing the contour information of super-pixel segmented objects, which alleviates the problems such as misclassification of small targets and poor robustness of false-positive changes. Furthermore, the EA module effectively combines multilevel edge features and multiscale discriminative features to avoid the boundary confusion of dense buildings in the prediction map. Meanwhile, we design an edge constrain loss to learn the information about the changed buildings and their boundaries using gradient descent. The results show that the network achieves better numerical indicators and visualization results on both available building change detection datasets. In our future work, we will devote ourselves to mining more representative discriminatory features, such as adding building DSM information as an object identifier and constructing datasets from different sensors to improve the flexibility of the proposed MAEANet.