MDESNet: Multitask Difference-Enhanced Siamese Network for Building Change Detection in High-Resolution Remote Sensing Images

: Building change detection is a prominent topic in remote sensing applications. Scholars have proposed a variety of fully-convolutional-network-based change detection methods for high-resolution remote sensing images, achieving impressive results on several building datasets. However, existing methods cannot solve the problem of pseudo-changes caused by factors such as “same object with different spectrums” and “different objects with same spectrums” in high-resolution remote sensing images because their networks are constructed using simple similarity measures. To increase the ability of the model to resist pseudo-changes and improve detection accuracy, we propose an improved method based on fully convolutional network, called multitask difference-enhanced Siamese network (MDESNet) for building change detection in high-resolution remote sensing images. We improved its feature extraction ability by adding semantic constraints and effectively utilized features while improving its recognition performance. Furthermore, we proposed a similarity measure combining concatenation and difference, called the feature difference enhancement (FDE) module, and designed comparative experiments to demonstrate its effectiveness in resisting pseudo-changes. Using the building change detection dataset (BCDD), we demonstrate that our method outperforms other state-of-the-art change detection methods, achieving the highest F1-score (0.9124) and OA (0.9874), indicating its advantages for high-resolution remote sensing image building change detection tasks.


Introduction
Numerous Earth observation studies have been conducted using remote sensing technology [1][2][3][4][5], among which building change detection is currently a research hotspot in remote sensing applications [6]. Building change detection, which uses remote sensing images, is important for urban management and supervision, urban development and planning, and city feature renewal [6,7]. With the continuous improvement of aerospace and imaging technologies, the ability to obtain high-resolution remote sensing images has been gradually improved. These images are characterized by more abundant spectral features of various ground objects and clearer shape, texture, and color features [8]. However, pseudo-changes caused by background and noise information in images might result in increased false detections. Many studies have shown that the improvement in image spatial resolution is not accompanied by an improvement in interpretation accuracy or reduction in detection difficulty [9][10][11]. Therefore, building change detection in high-resolution remote sensing images remains a great challenge, and improving the accuracy and intelligent interpretation is still an important issue in remote sensing building change detection.
Since the 1970s, scholars have conducted various studies on remote sensing change detection [12]. After approximately half a century of development, many change detection methods have emerged. With the improvement in the spatial resolution of remote sensing which can then be efficiently used. However, the complex decoder structure in this model might lead to a large number of model parameters, which is not conducive to deployment in practical projects.
Although scholars have proposed many methods, the problems in building change detection have not been completely solved. The main reasons are as follows: First, when measuring the similarity between feature pairs, the existing measurement methods cannot distinguish between real changes and pseudo-changes caused by registration, imaging angle, and shadow, leading to serious false detection phenomena. Second, frequent downand upsampling in the network causes errors in the positioning information. This results in extremely difficult restoration of the regular and neat boundaries of buildings, and a less ideal prediction effect of small targets. Finally, the number of changes in the real world is far fewer than the constant number of samples, leading to a serious imbalance in the sample distribution and increasing the difficulty of model training.
In this study, we propose a novel deep learning model, MDESNet, which provides a more effective solution for building change detection in high-resolution remote sensing images. MDESNet has an encoder-decoder structure. In the encoding stage, the sharedweight Siamese network is used as the encoder to extract the features of bitemporal inputs, while a feature pyramid network (FPN) [34] is introduced to fuse the multiscale feature maps top-down. The encoded features eventually pass through two semantic decoders and a change detection branch to obtain segmentation and change the results. In contrast to the single-task change detection model, MDESNet has two semantic segmentation branches, which can provide semantic constraints for the Siamese encoder and increase the resistance of the model to pseudo-changes. More importantly, we propose a method, called the FDE module, to measure the similarity between multiscale bitemporal features. It focuses more on the real changing area in the model training process and improves the comprehensive performance of the model. To obtain the real boundary information of buildings, we propose a multiscale feature fusion method called the MSFF module. By fusing the features of adjacent scales level by level, the network learns the importance of each scale feature. In the MSFF module, we introduce the scSE [35] dual attention mechanism to recalibrate the spatial and channel dimensions of feature maps and obtain more discriminative features. Finally, we propose a weighted loss function combining binary cross-entropy loss and focal loss to alleviate the problem of imbalance between changing and unchanged samples.
The main contributions of this study are as follows: 1.
We propose a multitask difference-enhanced Siamese network based on a fully convolutional structure, which consists of a main task branch for change detection and two auxiliary branches for extracting bitemporal buildings. The introduction of semantic constraints enabled the model to learn the features of targets, facilitating the avoidance of pseudo-changes. The MSFF module was designed as a decoder in three branches, and the scSE algorithm was introduced to improve the ability of the model to recover spatial details.

2.
We propose an FDE module that combines concatenation and differences. This module enhanced the differences in bitemporal features at different scales, increasing the distance between pairs of real-changing pixels and thus enlarging the interclass disparity. This improved its ability to detect real changes and its robustness to pseudo-changes.

3.
We verify the performance of the proposed method on BCDD and achieve the best F1-score (0.9124) compared with other baseline methods.
Our manuscript is organized as follows. Section 2 introduces our proposed method in detail. Section 3 describes the BCDD dataset and evaluation metrics, and conducts ablation and comparative experiments. Section 4 discusses and analyzes the experimental results. Finally, Section 5 summarizes our research and suggests future work.

Materials and Methods
In this section, we detail the proposed multitask change detection network and its application. The general structure of the proposed network is described after enumerating several related studies. Then, the FDE and MSFF modules are illustrated. Finally, the loss function is introduced.

•
Siamese Network The Siamese network [36], which was originally used to compare whether the signature on the check is consistent with the signature reserved by the bank, was proposed in 1993. The Siamese network has two inputs that are extracted by the same encoders, whose weights are shared. This dual-stream structure is beneficial for measuring the similarity between the two inputs. It has been widely used in signature verification [36], face verification [37], change detection [25,30,31], and semantic similarity analysis [38,39].

• ResNeSt
ResNeSt [40] is a variant of ResNet [41], which introduces the split-attention module of ResNet. It employs multibranch convolution in the same layer to extract more discriminative features, whereas application of the channel-attention algorithm highlights the importance of different feature channels and assigns them different weights. Compared with ResNet, ResNeXt [42], and SENet [43], ResNeSt performs better in image classification, semantic segmentation, instance segmentation, and object detection tasks [40].

• FPN
The FPN structure was originally designed for object detection tasks [34] and was later applied to a panoptic semantic segmentation task [44] by He et al. When there are objects of various sizes in the image, an FPN ensures that these objects are represented on feature maps of different scales. Similar to UNet, the FPN network applies skip connections to fuse the features extracted by the encoder in the top-down process. Unlike UNet, the original FPN upsamples the low-resolution feature maps by a factor of 2, and after the features introduced by the skip connections pass through a 1 × 1 convolutional layer, they are finally added together. We used this FPN in our study, which can fuse multiscale information without increasing the number of parameters.

• scSE
The concurrent spatial and channel squeeze and channel excitation (scSE) module [35], which is a synthesis of channel attention and spatial attention, is similar to a plugin. The scSE module compresses and activates features in the channel and spatial dimensions, respectively, and can explicitly model the interdependence between feature map channels. After using the scSE module, the feature maps undergo spatial and channel recalibration, which increases the ability of the model to learn more discriminative features; more importantly, it only adds a few parameters. Figure 1 shows the network structure of the proposed MDESNet. The input of the model was bitemporal remote sensing images of the same area. The bitemporal multiscale features were obtained after passing through the Siamese encoder composed of ResNeSt50 and the neck layer composed of FPN. The decoder, composed of FDE and MSFF modules, decoded the bitemporal multiscale features. Finally, the prediction results of semantic segmentation and change detection of buildings in bitemporal images were obtained. semantic segmentation and change detection of buildings in bitemporal images were obtained.

Figure 1.
Overview of the proposed MDESNet. The T0 and T1 remote sensing images were inputted into the Siamese network (based on ResNeSt-50), and bitemporal multiscale feature maps were obtained (resolutions are 1/4, 1/8, 1/16, and 1/32 of the input, respectively). Subsequently, these were, respectively, inputted into FPN to fully integrate the context information. The two semantic segmentation branches were decoded using the MSFF module, whereas the change detection branch used the FDE module to obtain multiscale change features, which were then decoded using the MSFF module. Finally, they were restored to the original image resolution using a 4× bilinear upsampling.
In the encoding stage, we extracted bitemporal image features using a Siamese network that shares weight. We used ResNeSt-50 as the backbone of the model to enable it to extract more discriminative features. ResNeSt, which introduces a split-attention module, realizes the weight allocation of feature channels through group convolution and channel-wise attention. Therefore, it extracts more diversified features, achieving remarkable performance in image classification, object detection, and semantic segmentation [40]. The feature maps obtained from the Siamese network were divided into four different spatial scales with resolutions of 1/4, 1/8, 1/16, and 1/32 of the input image, and numbers of channels of 256, 512, 1024, and 2048, respectively. To fuse the context information, multiscale feature maps were inputted into the FPN structure, reducing the number of channels to 256. The top-down summation operation and convolution layer facilitated multiscale information aggregation and also reduced the aliasing effect that results from the superimposition of feature maps of different scales. In addition, reducing the number of channels further reduced the amount of computation and memory usage in the decoding stage.
In the decoding stage, we used two branches for semantic segmentation and one branch for change detection. For the two semantic segmentation branches, the bitemporal multiscale feature maps were decoded using the MSFF module, and the semantic prediction results of bitemporal buildings were obtained after four rounds of upsampling. For the change detection branch, the bitemporal feature maps of the corresponding scales were inputted into the FDE module to finally obtain the multiscale difference feature maps, which were decoded using the MSFF module and then upsampled four times to obtain the prediction results of the changed buildings. The proposed FDE and MSFF Figure 1. Overview of the proposed MDESNet. The T0 and T1 remote sensing images were inputted into the Siamese network (based on ResNeSt-50), and bitemporal multiscale feature maps were obtained (resolutions are 1/4, 1/8, 1/16, and 1/32 of the input, respectively). Subsequently, these were, respectively, inputted into FPN to fully integrate the context information. The two semantic segmentation branches were decoded using the MSFF module, whereas the change detection branch used the FDE module to obtain multiscale change features, which were then decoded using the MSFF module. Finally, they were restored to the original image resolution using a 4× bilinear upsampling.
In the encoding stage, we extracted bitemporal image features using a Siamese network that shares weight. We used ResNeSt-50 as the backbone of the model to enable it to extract more discriminative features. ResNeSt, which introduces a split-attention module, realizes the weight allocation of feature channels through group convolution and channel-wise attention. Therefore, it extracts more diversified features, achieving remarkable performance in image classification, object detection, and semantic segmentation [40]. The feature maps obtained from the Siamese network were divided into four different spatial scales with resolutions of 1/4, 1/8, 1/16, and 1/32 of the input image, and numbers of channels of 256, 512, 1024, and 2048, respectively. To fuse the context information, multiscale feature maps were inputted into the FPN structure, reducing the number of channels to 256. The top-down summation operation and convolution layer facilitated multiscale information aggregation and also reduced the aliasing effect that results from the superimposition of feature maps of different scales. In addition, reducing the number of channels further reduced the amount of computation and memory usage in the decoding stage.
In the decoding stage, we used two branches for semantic segmentation and one branch for change detection. For the two semantic segmentation branches, the bitemporal multiscale feature maps were decoded using the MSFF module, and the semantic prediction results of bitemporal buildings were obtained after four rounds of upsampling. For the change detection branch, the bitemporal feature maps of the corresponding scales were inputted into the FDE module to finally obtain the multiscale difference feature maps, which were decoded using the MSFF module and then upsampled four times to obtain the prediction results of the changed buildings.  Overview of the FDE module. The proposed FDE module has difference and con branches. The former applies the sigmoid activation function to obtain the feature differ tion map after making a difference on bitemporal feature maps, whereas the latter ele multiplies the attention map and the concatenated features to enhance the difference. T maps of the four scales are operated in the same way.

Feature Difference Enhancement Module
A suitable similarity measure facilitates the extraction of more discriminative features from the network, which is very meaningful for increasing the robustness of the

Feature Difference Enhancement Module
A suitable similarity measure facilitates the extraction of more discriminative features from the network, which is very meaningful for increasing the robustness of the model and reducing the impact of pseudo-changes. Daudt et al. [27] proposed a difference or concatenation method to derive changed information from bitemporal feature maps. However, although the difference method more clearly indicates the change area, it might contain more erroneous information than the concatenation method when bitemporal images have problems such as inconsistent imaging perspectives and large registration errors. Conversely, despite containing the complete features of bitemporal images, the concatenation method cannot clearly specify the changed information as efficiently as the difference method. As shown in Figure 2, we propose the use of an FDE module to integrate these two methods.
The bitemporal multiscale feature maps were inputted in the FDE module, and went through two branches on the same scale: difference and concatenation. Initially, the distance of the bitemporal feature maps was calculated in the difference branch, and we chose the element-wise subtract because it consumed less time and space complexity than cosine distance. Thus, the feature difference maps were obtained as follows: where FDM denotes the feature difference maps, F i 1 is the feature map with T1 as the input, F i 0 is the feature map with T0 as the input, and N is the number of channels in the input feature maps.
Subsequently, we applied a 1 × 1 convolutional layer to compress the channels of the feature difference maps from 256 dimensions to 1 dimension and obtained the feature difference attention maps after activating them through the sigmoid function. The feature difference attention maps of the four scales were obtained as follows: where FDAM represents the feature difference attention map, η is the 1 × 1 convolutional layer, and j represents the four scales.
In the concatenation branch, we first passed the concatenated feature maps through a 3 × 3 convolutional layer followed by batch normalization and the ReLU function, which reduced the number of channels from 512 to 256 dimensions and prepared the feature difference enhancement. Subsequently, the concatenated feature maps were multiplied element-wise with the feature difference attention map, which represents the fusion of the differential and concatenated information. Finally, 3 × 3 and 1 × 1 convolutional layers were used to obtain the changed features of this scale. The difference indicated the features of the changed area, with the concatenation calibrating the error in the difference information. The FDE module combined the advantages of the two modules to enhance the feature difference. Utilizing the FDE module at four scales allowed the network to focus on regions with semantic and spatial differences.

Multiscale Feature Fusion Module
The MSFF module was designed to recover the spatial resolution of semantic feature maps in multiscale feature maps. The detailed architecture of the MSFF module is illustrated in Figure 3.
Regarding feature maps at four scales, their channels were 256-dimensional for both the semantic segmentation and change detection branches. Feature maps of adjacent scales were inputted into FusionBlock for fusion. FusionBlock contained two residual blocks and one scSE module. The former is a layer composed of two 3 × 3 convolutional layers, followed by batch normalization and ReLU, which reduced the channels of the input feature maps to 128. The scSE module is usually applied to learn the importance of each feature channel; hence, the feature maps undergo spatial and channel recalibration, resulting in the enhancement of more meaningful feature maps by the network. Moreover, the addition of the scSE module did not markedly increase the parameters of the model. All feature maps of adjacent scales went through the residual block, and the low-resolution feature maps were concatenated with high-resolution after upsampling. The output of FusionBlock was obtained through the scSE module after the channel was restored to 256 dimensions. Therefore, we applied six FusionBlocks to restore 1/4 of the size of the input image, and finally used a 1 × 1 convolutional layer and softmax function to obtain the classification results.
The proposed MSFF module played the role of decoding multiscale features. It did not utilize skip connections to introduce the features of the encoder because the multiscale feature maps were fused once by the FPN structure. The difference from the lightweight decoder proposed in [44] and [45] is that the MSFF module learns the importance of feature maps at different scales owing to the cascade structure and the scSE module.

Loss Function
Our network is a multitask deep learning model for extracting pixel-level semantic and changed information. The cross-entropy loss function is usually used for training pixel-level tasks [46]. However, in both semantic segmentation and change detection tasks, the sample quantities of positive and negative categories are often unbalanced [47]. Consequently, training with only the cross-entropy loss function typically results in a large number of misjudgments. To solve this problem, we used a focus loss function to alleviate the imbalance and classification difficulty between positive and negative sample sizes [48]. The binary cross-entropy loss function L BCE was given as follows: whereas the focal loss function L Focal was where t i represents a binary label value; if pixel i belongs to the ground truth, then t i = 1, otherwise t i = 0. y i represents the probability of the predicted pixel i belonging to the ground truth. Parameter α alleviates the imbalance of positive and negative sample categories in each task, while parameter γ is used to adjust the importance of the model to samples that are difficult to classify. We combined the binary cross-entropy loss and focal loss functions, and found that their weights were β and 1 − β. Finally, the total loss function of the proposed network was

Experiments and Results
We used the BCDD dataset to verify the advanced outcome of our proposed method. We first describe the change detection dataset used. Next, we explain the evaluation metrics and parameter settings used in the experiments. Finally, we design ablation and comparison experiments and visualize and analyze the experimental results.

Dataset
In this study, we needed to use a dataset that contained both pre-and post-temporal building labels and changed building labels; therefore, we used the BCDD dataset released by Ji et al. [49] from Wuhan University. This dataset has been used by many scholars to Remote Sens. 2022, 14, 3775 9 of 21 study various change detection methods [25,27,31,33,46,47]. It includes aerial imagery, building labels, and changing building labels in Christchurch, New Zealand, before the 2012 earthquake and after the 2016 reconstruction. The spatial resolution of the dataset was 0.075 m/pixel and the image size was 32,507 × 15,324 pixels. We cropped the size of the BCDD dataset to 512 × 512 pixels without overlay and obtained 2400 pairs of available images after data cleaning. We provided statistics on the ratio of building pixels to other pixels and the ratio of changed pixels to unchanged areas before and after cleaning the data, as shown in Table 1. Finally, we randomly divided them into training and validation image pairs according to a ratio of 9:1; an example of the dataset is shown in Figure 4. When training the network, we used flips, rotations, brightness, and contrast enhancements with random probability. In this study, we needed to use a dataset that contained both pre-and post-temporal building labels and changed building labels; therefore, we used the BCDD dataset released by Ji et al. [49] from Wuhan University. This dataset has been used by many scholars to study various change detection methods [25,27,31,33,46,47]. It includes aerial imagery, building labels, and changing building labels in Christchurch, New Zealand, before the 2012 earthquake and after the 2016 reconstruction. The spatial resolution of the dataset was 0.075 m/pixel and the image size was 32,507 × 15,324 pixels. We cropped the size of the BCDD dataset to 512 × 512 pixels without overlay and obtained 2400 pairs of available images after data cleaning. We provided statistics on the ratio of building pixels to other pixels and the ratio of changed pixels to unchanged areas before and after cleaning the data, as shown in Table 1. Finally, we randomly divided them into training and validation image pairs according to a ratio of 9:1; an example of the dataset is shown in Figure 4. When training the network, we used flips, rotations, brightness, and contrast enhancements with random probability.

Evaluation Metrics
We used evaluation metrics to evaluate the performance of the various change detection algorithms. We evaluated the proposed model using precision, recall, F1-score, and total accuracy (OA). A higher precision indicated fewer false detections, whereas a higher recall indicated fewer missed detections. The F1-score is a metric used to measure the performance of a binary classification algorithm, considering both precision and recall. The F1-score and OA can comprehensively evaluate the performance of the model, with higher values representing improved performance. The evaluation metrics were defined as follows: In the change detection task, true positive (TP) and true negative (TN) denote the number of changed and unchanged pixels detected correctly, whereas false positive (FP) and false negative (FN) denote the number of changed and unchanged pixels detected incorrectly. In the semantic segmentation task, TP and TN represent the number of buildings and background pixels detected correctly, respectively, whereas FP and FN represent the number of buildings and background pixels detected incorrectly, respectively.

Parameter Settings
We implemented distributed data parallel (DDP) and synchronized cross-GPU batch normalization (SyncBN) [50] on PyTorch using the NVIDIA NCCL ToolKit, which ensures high-performance computing of the model. In addition, we chose the Adam [51] optimizer and used the cosine annealing [52] algorithm as the learning rate decay method, which makes the network jump out of the saddle point faster by oscillating. The initial learning rate was set to 1 × 10 −3 , while the minimum learning rate was set to 1 × 10 −7 to prevent a small learning rate. Considering the memory of the GPU, we set the batch size to 4. Furthermore, we set parameter α and parameter γ in focal loss to 0.25 and 2, respectively. Finally, we ran all experiments on two NVIDIA GeForce RTX 2080 Ti GPUs with 11 GB memory.

Ablation Study
To investigate the effect of adding semantic segmentation branches, the FPN structure, and the scSE module on model performance, we conducted an ablation study on the BCDD dataset; the results are shown in Table 2. It has been reported that without adding a semantic segmentation branch, adding an FPN or scSE module alone improves the performance of change detection by 10.66% and 6.04%, respectively, compared with the F1-score of the baseline model change detection of 0.7791. When the FPN and scSE modules were added simultaneously, the performance of the model was further improved, and the F1-score reached 0.8857, which was an increase of 10.72% compared with baseline. Interestingly, we found that the change detection F1-score of the model was increased by 3.58%, and we also obtained a semantic segmentation F1-score of 0.9059 after introducing the semantic segmentation branch. On this basis, we assumed that the introduction of an FPN or scSE module further improved model performance. The former achieved a change detection F1-score of 0.9032 and a semantic segmentation F1-score of 0.9195, whereas the latter two tasks exhibited F1-scores of 0.8774 and 0.9297, respectively. Finally, if we applied all three to the baseline model, the F1-scores for change detection and semantic segmentation reached their highest values of 0.9124 and 0.9441, respectively. Figure 5 illustrates the results of the ablation study. The first, third, and fifth lines represent new factories, subtle changes, and new dense buildings, respectively. The second, fourth, and sixth rows are the magnified details, respectively. Red and blue pixels in the results represent false and missed detections, respectively, while white and black represent changed and unchanged pixels, respectively. Before adding the semantic segmentation branch (Figure 5d-g), we observed many false and missed detections in the baseline model. However, adding FPN and scSE reduced false and missed detections, respectively. Further, adding FPN and scSE simultaneously greatly reduced false and missed detections. After adding the semantic segmentation branch (Figure 5h-k), we found that combining the three methods (Figure 5k) on the baseline model yielded the smallest false and missed detection rates, maximizing the ability to discriminate between building features of different scales. This implementation obtained more complete and regular boundaries and ultimately improved the comprehensive performance of the model. Through the comparison of performance and visual effects, we confirmed the easy application and effectiveness of our proposed model and method.

Comparative Study of Similarity Measures
A suitable similarity measure facilitates the identification of pseudo-changes by the network. To verify the effectiveness of the proposed FDE module, we only changed the feature similarity measure method in MDESNet and chose the following four similarity measures: (a) Concatenation [27]: We concatenated the bitemporal features, and then used three

Comparative Study of Similarity Measures
A suitable similarity measure facilitates the identification of pseudo-changes by the network. To verify the effectiveness of the proposed FDE module, we only changed the feature similarity measure method in MDESNet and chose the following four similarity measures: (a) Concatenation [27]: We concatenated the bitemporal features, and then used three consecutive 3 × 3 convolutional layers, which reduced the channels, to extract the change information from the connected features according to the FC-Siam-conc. (b) Difference [27]: We subtracted the bitemporal features from the corresponding channel dimension and used the absolute value of the difference as the changed feature. (c) Normalized difference [53]: Based on this difference, we performed a further normalization operation. (d) Local similarity attention module [31]: In this module, we extracted the similarity attention (SA) value from the input feature maps using the cosine distance and then multiplied the SA element by element with the bitemporal feature maps. Finally, we concatenated the bitemporal feature maps and applied a 3 × 3 convolutional layer to adjust the number of channels as the changed features.
We calculated the OA, precision, recall, and F1-score for the five methods, as shown in Table 3. We observed that the proposed FDE module performed the best in terms of accuracy (0.9874), precision (0.9264), and F1-score (0.9124). Compared with the second case, it was better by 0.26%, 4.54%, and 1.40%, respectively. We found that the best performance in recall was that of local similarity attention, which reached 0.9571, whereas that of our method was only 0.8988. The F1-scores of the difference and normalized difference were not markedly different at 0.8984 and 0.8974, respectively, which were 1.40% and 1.50% lower than that of the FDE module, respectively. We also detected that the F1-scores of the concatenation and local similarity attention modules were 0.8767 and 0.8781, which were 3.57% and 3.43% lower than that of the FDE module, respectively. This confirmed that the FDE module achieved the best F1-score among the five methods and improved the comprehensive performance of the model. The highest precision also indicated that the FDE method achieved the fewest false detections and was more resistant to pseudo-changes. In the FDE module, inspired by the scSE module, we compressed the channels after subtracting the bitemporal features and used the sigmoid activation function to obtain the feature difference attention maps. We visualized the feature-difference attention maps on four scales to analyze the principle of the FDE module. Figure 6 shows differences in the emphasis of the feature-difference attention map at different scales. We observed that the attention map at a resolution of 16 × 16 pixels contained rich low-frequency semantic information, whereas the FDE module focused on enhancing semantic differences. In the process of increasing the resolution, we detected that the enhanced center of gravity gradually shifted to spatial information owing to the abundance of the high-frequency spatial details of the attention maps. Finally, we found that the FDE module enhanced the differences in features at multiple scales to achieve a comprehensive performance of the model. Remote Sens. 2022, 14, x FOR PEER REVIEW 14 of 22

Comparative Study Using Other Methods
To evaluate the performance and competitiveness of MDESNet, we selected several popular change detection models for comparative experiments, as follows:

Comparative Study Using Other Methods
To evaluate the performance and competitiveness of MDESNet, we selected several popular change detection models for comparative experiments, as follows: (a) FC-EF [27]: This model concatenated two temporal images and formed an image with skip connections and a U-shaped structure, using six channels as input. We extracted the change feature from the fused image, and finally obtained the change result using the softmax function. (b) FC-Siam-conc [27]: This model was an extension of FC-EF that used a Siamese network with the same structure and shared weights as the encoder. We concatenated the extracted bitemporal features and then inputted them into the decoder with skip connections to obtain the change results. (c) FC-Siam-diff [27]: This model was very similar to FC-Siam-conc; the difference was that we subtracted and obtained the absolute value of the extracted bitemporal features and then input them into the decoder with skip connections to obtain the change results. (d) ChangeNet [30]: This was proposed to detect changes in street scenes. We sampled the change features of the three scales to the same scale and used the softmax function to obtain the change results after summation, which located and identified the changes between image pairs. (e) DASNet [25]: Its core was set to use the spatial attention and channel attention algorithms to obtain more abundant discriminative features. Unlike other methods, this method gave a distance map as output and used a threshold algorithm to obtain the final changed results. (f) SNUNet-CD [54]: This method used a dense connection Siamese network, similar to UNet++, which is known to mitigate the effects of deep location information loss in neural networks. We employed the ensemble channel attention module (ECAM) to extract the representative features of different semantic levels and obtain the change results. (g) DTCDSCN [33]: This model was also a multitask Siamese network with two semantic segmentation branches and a change detection branch, similar to the proposed MDESNet. Its decoder structure was similar to that of FC-Siam-diff, except that it added the scSE module to improve the feature representation capabilities.
To ensure fairness and effectiveness of the comparative study and consistency with MDESNet, we replaced the backbone network used by ChangeNet, DASNet, and DTCD-SCN with ResNeSt-50. Of note, FC-EF, FC-Siam-conc, FC-Siam-diff, and SNUNet-CD do not apply backbone networks such as VGG or ResNet; therefore, their models remain unchanged. The other parameters were consistent with those in the respective studies. In addition, we trained them on the divided BCDD dataset under the same experimental environment. Table 4 shows that the proposed network outperformed the other models in terms of accuracy, precision, and F1-score, reaching 0.9874, 0.9264, and 0.9124, respectively, which corresponded to improvements of 0.72%, 5.89%, and 2.96%, respectively. These results also verified the validity of the FDE and MSFF modules. We found that DASNet achieved the highest recall rate of 0.9266, which was 2.78% higher than that of the proposed network. The proposed method did not achieve the best recall because it was more focused on solving false detections owing to pseudo-changes; however, the highest F1-score verified that our proposed method had the best comprehensive performance.  Figure 7 shows a part of the experimental results that more intuitively demonstrate the superior performance of our proposed network. The first and second rows demonstrate that our model recovered more regular boundaries and tighter interiors when recognizing large or small buildings, respectively. In the third and fourth rows, although the bitemporal images contained impermeable pavements and containers that were similar to roof material, our proposed network recognized them as unchanged pixels, indicating its ability to resist pseudo-changes or noise. The last two rows show that our model identified and located imperceptible changes far better than other models.  Figure 7 shows a part of the experimental results that more intuitively demonstrate the superior performance of our proposed network. The first and second rows demonstrate that our model recovered more regular boundaries and tighter interiors when recognizing large or small buildings, respectively. In the third and fourth rows, although the bitemporal images contained impermeable pavements and containers that were similar to roof material, our proposed network recognized them as unchanged pixels, indicating its ability to resist pseudo-changes or noise. The last two rows show that our model identified and located imperceptible changes far better than other models.  Table 2 shows the effect of adding the semantic segmentation branch, FPN structure, and scSE algorithm on the performance of the baseline model. To more clearly compare  Table 2 shows the effect of adding the semantic segmentation branch, FPN structure, and scSE algorithm on the performance of the baseline model. To more clearly compare the changes in the performance of the model before and after adding the semantic segmentation branch, we made a histogram, as shown in Figure 8. We found that adding the semantic segmentation branch on different baseline models improved the F1-score of change detection.

The Effect of Semantic Segmentation Branches
the changes in the performance of the model before and after adding the semantic segmentation branch, we made a histogram, as shown in Figure 8. We found that adding the semantic segmentation branch on different baseline models improved the F1-score of change detection. We also selected several semantic segmentation models for dense prediction for comparison with MDESNet: UNet [55], UNet++ [56], PSPNet [57], DeepLabv3+ [58], DTCDSCN [33], and FarSeg [45]. The average experimental results for the two branches in the same environment are shown in Table 5. The accuracy, precision, recall, and F1score of the semantic segmentation branch of the proposed model were 0.9792, 0.9485, 0.9397, and 0.9441, respectively. Compared with other models, the performance of MDES-Net in the semantic segmentation task was at a moderate level. This was consistent with the failure of the semantic segmentation task in the DTCDSCN to achieve a good F1-score. Although adding a semantic segmentation branch adds constraints to the encoder for extracting building features and achieving better change detection results, it also restricts the accuracy of the semantic segmentation model to a certain extent. This was a limitation of the present study. In the future, further research should be conducted on the means by which to achieve satisfactory results for both the main task of change detection and the auxiliary task of semantic segmentation.  We also selected several semantic segmentation models for dense prediction for comparison with MDESNet: UNet [55], UNet++ [56], PSPNet [57], DeepLabv3+ [58], DTCD-SCN [33], and FarSeg [45]. The average experimental results for the two branches in the same environment are shown in Table 5. The accuracy, precision, recall, and F1-score of the semantic segmentation branch of the proposed model were 0.9792, 0.9485, 0.9397, and 0.9441, respectively. Compared with other models, the performance of MDESNet in the semantic segmentation task was at a moderate level. This was consistent with the failure of the semantic segmentation task in the DTCDSCN to achieve a good F1-score. Although adding a semantic segmentation branch adds constraints to the encoder for extracting building features and achieving better change detection results, it also restricts the accuracy of the semantic segmentation model to a certain extent. This was a limitation of the present study. In the future, further research should be conducted on the means by which to achieve satisfactory results for both the main task of change detection and the auxiliary task of semantic segmentation.

The Effect of the Value of β in Loss Function
Several experiments were designed to determine the value of β in the loss function with the best performance. Figure 9 shows the effect of different β values on the overall performance of the network. When β was 5/6, the performance of the network was optimal, achieving the highest change detection F1-score (0.9124) and the highest semantic segmentation F1-score (0.9441). If only the focal loss function or the binary cross-entropy loss function was applied, the performance of the network was slightly lower, indicating that this combination was beneficial to the improvement of the network performance.

The Effect of the Value of β in Loss Function
Several experiments were designed to determine the value of β in the loss function with the best performance. Figure 9 shows the effect of different β values on the overall performance of the network. When β was 5/6, the performance of the network was optimal, achieving the highest change detection F1-score (0.9124) and the highest semantic segmentation F1-score (0.9441). If only the focal loss function or the binary cross-entropy loss function was applied, the performance of the network was slightly lower, indicating that this combination was beneficial to the improvement of the network performance.  Figure 10 shows the number of parameters of the proposed model. Compared with the multitask detection model, DTCDSCN, of the same type, our model reduced the number of parameters by 4.278 M. In addition, compared with other single-task change detection models and the best-performing semantic segmentation model (UNet++), our method had the highest change detection F1-score and the lowest number of parameters, indicating the advantage of the multitask model in terms of the number of parameters. One of the future research directions will be to further reduce the number of parameters of the multitask change detection model while retaining the network performance.  Figure 10 shows the number of parameters of the proposed model. Compared with the multitask detection model, DTCDSCN, of the same type, our model reduced the number of parameters by 4.278 M. In addition, compared with other single-task change detection models and the best-performing semantic segmentation model (UNet++), our method had the highest change detection F1-score and the lowest number of parameters, indicating the advantage of the multitask model in terms of the number of parameters. One of the future research directions will be to further reduce the number of parameters of the multitask change detection model while retaining the network performance.

Conclusions
In this study, we proposed MDESNet, a multitask difference-enhanced Siamese net- Figure 10. Comparison of the number of parameters between MDESNet, multitask model DTCDSCN, and 2 single-task models (change detection models and UNet++).

Conclusions
In this study, we proposed MDESNet, a multitask difference-enhanced Siamese network that can output semantic information and change information end-to-end, for building change detection in high-resolution remote sensing images. By combining the Siamese network and FPN structure, we generated bitemporal multiscale feature maps that fully incorporated contextual information. The semantic segmentation branches added constraining information to the encoder for the extraction of ground features, having a positive impact on the change detection branch. To enhance the changed regions in the feature maps and improve the interclass separability, we proposed the FDE module as a similarity measure for bitemporal multiscale features. Our proposed network achieved the highest F1-score (0.9124) and OA (0.9874) on the BCDD dataset, with better overall performance and resistance to spurious changes compared with other state-of-the-art methods. Likewise, compared with the same type of multitask model, our method had higher accuracy and contained fewer parameters. However, its disadvantage was that it had an intermediate level of performance in the semantic segmentation branch.
In the future, we will validate the proposed method using other public datasets and conduct further research on the relationship and influence between semantic segmentation and change detection.  Data Availability Statement: Publicly available datasets were analyzed in this study. The dataset can be found here: BCDD: http://study.rsgis.whu.edu.cn/pages/download/building_dataset.html, accessed on 25 July 2022.
Acknowledgments: Our team is grateful to those who have published data and classic models; it is their selflessness that provides an open research environment for the community.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: