Building Damage Detection Using U-Net with Attention Mechanism from Pre- and Post-Disaster Remote Sensing Datasets

: The building damage status is vital to plan rescue and reconstruction after a disaster and is also hard to detect and judge its level. Most existing studies focus on binary classiﬁcation, and the attention of the model is distracted. In this study, we proposed a Siamese neural network that can localize and classify damaged buildings at one time. The main parts of this network are a variety of attention U-Nets using different backbones. The attention mechanism enables the network to pay more attention to the effective features and channels, so as to reduce the impact of useless features. We train them using the xBD dataset, which is a large-scale dataset for the advancement of building damage assessment, and compare their result balanced F (F1) scores. The score demonstrates that the performance of SEresNeXt with an attention mechanism gives the best performance among single models, with the F1 score reaching 0.787. To improve the accuracy, we fused the results and got the best overall F1 score of 0.792. To verify the transferability and robustness of the model, we selected the dataset on the Maxar Open Data Program of two recent disasters to investigate the performance. By visual comparison, the results show that our model is robust and transferable.


Introduction
In the past 20 years, natural disasters have claimed one million lives and caused more trauma, displacement and loss of families and livelihoods [1]. Building damage is the main type of disaster damage, which is used to estimate the location distribution of the affected population [2] and essential for emergency management professionals, helping them direct the rescue teams in a short time to the right locations [2]. It has been proven that remote sensing data are able to derive accurate building damage in a short time [2][3][4], with low cost and a wide field of view.
Remote sensing (RS) is utilized widely for disaster assessment and the detection of damaged buildings [2,3,[5][6][7][8][9][10][11][12]. Frequently used remote sensing images are mainly optical and synthetic aperture radar (SAR) data. SAR data is less affected by weather conditions and has been gradually used for emergency and disaster assessment. Both of the backscatter products and phase data of SAR can be used to detect damaged buildings [8,13]. Compared with optical images, the processing of SAR data is more complicated. Although it is not susceptible to interference from shadows and cloud information, it has many noise, blurry boundaries, no color and less information than multispectral images [14].
Optical remote sensing images can directly reflect the real surface information and are the primary data source in the early stage of remote sensing seismic damage assessments. damaged buildings, it is undeniable that the global feature of the image contributes to the classification task, because the scale of disaster is larger than that of building. At the same time, according to the first law of geography [25], the part closer to the target building should get more attention when judging the target type. The embodiment of this idea is the attention mechanism. The attention mechanism was proposed in 2014 [26]. It pointed out that human beings selectively focus their attention on each part of the visual space to obtain information at the required time and place and combine the information from different gazes over time to establish an internal representation of the scene. The attention mechanism utilizes this human trait in neural networks. Informally, the attention mechanism provides a neural network with the ability to focus on a subset of its inputs (or features): it selects specific inputs. This helps the model reduce the impact of useless information and increases the contribution of useful information to the results. Attention can be categorized into hard and soft attention, and it is a general idea that does not depend on a specific framework but is currently mainly used in combination with the Encoder-Decoder framework. R. Liu et al. introduced the attention mechanism in change detection and gave weight to features of different times to enhance the changed information of the images, which significantly improved the results [19]. H. Hao et al. proposed the Siam-U-Net-Attn model, which achieved a 0.70 damage F1 score and 0.73 localization F1 score on the xBD dataset [27]. In the task of classification of a hyperspectral remote sensing image, by adding the spectral attention module to the CNN, L. Mou et al. made the model selectively emphasize the useful band and suppress the less useful band [28].
In this paper, we added a soft attention mechanism to Siamese U-Net for exploring the model's performance of different backbones on the degree of building damage. There were three objectives of this study: 1.
Explore the performance of multi-artificial neural networks on detecting different damage levels of buildings using both present and post satellite data.

2.
Compare the fusion results of different networks and the result of a single network.

3.
Evaluate the transferability and robustness of the total model.
For objectives 1 and 2, the xBD dataset [29] is used. As the distribution of four damage level samples are imbalanced, two balancing strategies are adopted, including random under-sampling and a cost-sensitive strategy, which are introduced in Section 2.2.3. For evaluation, the F1 score was used in this study. For objective 3, datasets of Beirut explosion and hurricane Laura are used. The rest of this paper is organized as follows. Section 2 gives the descriptions of all datasets and proposed method. Section 3 presents the experimental results, and the discussion is in Section 4. Finally, the conclusions are drawn in Section 5.

xBD Dataset
xBD is a large-scale dataset of building damage assessment used to advance research on humanitarian assistance and disaster recovery. It contains 850,736 building annotations and covers 45,362 km 2 of images [29]. xBD provides building polygons, labels of damage levels ( Figure 1) and satellite images before and after various disaster events.
However, the number of class samples in the xBD dataset is seriously unbalanced. After counting the post-disaster data of the training set (including train and tier3), the proportion of each class in relation to the category of Destroyed Buildings is shown in Table 1. For this, we preprocessed the data before training; the specific operation is in Section 2.2. 3. In the process of adjusting the model, the xBD train and tier3 datasets were used for training. Since we wanted to ensure more training samples and more validation data, this will take up more training time. Taking these factors into account, the dataset was randomly divided into 90% training data and 10% validation data, and the test dataset in xBD was used for verification.  For verifying the transferability and robustness of the model, we selected two disasters out of xBD for applying our method. The reason we chose these two disasters is that the date of occurrence were relatively new, the data was available and they are two different types of disasters. In particular, the explosion in Beirut was further evaluated by The Copernicus Emergency Management Service (CEMS). The buildings affected were divided into four categories, and the classification mapping to xBD was as shown in Table  2. The details of these two disasters and the image data involved are shown in Table 3. The remote sensing image data is provided by the Maxar/DigitalGlobe Open Data Program (https://www.maxar.com/open-data (accessed on 24 November 2020)) [30].  Undisturbed. No sign of water, structural or shingle damage, or burn marks.

(Minor Damage)
Building partially burnt, water surrounding structure, volcanic flow nearby, roof elements missing, or visible cracks 2 (Major Damage) Partial wall or roof collapse, encroaching volcanic flow, or surrounded by water/mud.

(Destroyed)
Scorched, completely collapsed, partially/completely covered with water/mud, or otherwise no longer present.  In the process of adjusting the model, the xBD train and tier3 datasets were used for training. Since we wanted to ensure more training samples and more validation data, this will take up more training time. Taking these factors into account, the dataset was randomly divided into 90% training data and 10% validation data, and the test dataset in xBD was used for verification.

Instance Data
For verifying the transferability and robustness of the model, we selected two disasters out of xBD for applying our method. The reason we chose these two disasters is that the date of occurrence were relatively new, the data was available and they are two different types of disasters. In particular, the explosion in Beirut was further evaluated by The Copernicus Emergency Management Service (CEMS). The buildings affected were divided into four categories, and the classification mapping to xBD was as shown in Table 2. The details of these two disasters and the image data involved are shown in Table 3. The remote sensing image data is provided by the Maxar/DigitalGlobe Open Data Program (https://www.maxar.com/open-data (accessed on 24 November 2020)) [30].  The preprocessing of the data is as follows. First, the two images before and after the disaster are geo-referenced. Second, they are cut according to the area of interest. In the selection of the area of interest, we try to avoid the clouds. Third, because the resolutions of the two images before and after the disaster are different, they are resampled to the resolution of the images before the disaster (0.3 m × 0.3 m) to ensure the correspondence of the pixel positions in space. Fourth, crop the two images to a size of 1024 × 1024.

Methods
Convolutional neural networks can process data in the form of multiple arrays [31]. For classifying a pixel, the CNN-based segmentation method uses pixel blocks in a fixed size window centered on the pixel as the input of the CNN. This method has several disadvantages. Firstly, the storage space required is large. Secondly, the calculation efficiency is low. Thirdly, the window's size limits the extent of the perceptual field. Usually, the window's size is much smaller than the whole image's. Only some local features can be extracted, which leads to the limited performance of classification. In order to overcome the above shortcomings, a full convolutional network (FCN) come into being. The FCN is a special type of CNN and can recover the category of each pixel from the abstract features. That is, it extends from image-level to pixel-level classification. In this paper, U-Net [32] is used as a kind of FCN. Essentially, convolution is feature fusion of a local area that fuses features from spatial dimensions and channel dimensions. For a convolutional neural network, its core calculation is a convolution operator that learns a new feature map from the input feature map through the convolution kernel. Different backbones are used to try to strengthen our model. All backbones used are the residual network and its variants.

Proposed Framework
This section introduces the details of the overall framework. As shown in Figure 2, the architecture of the proposed model is a Siamese neural network that is divided into two parts, and both share the same weight. One part with the pre-disaster images is used to localize buildings, and the other part is used to classify the buildings' damage levels. While training, the localization part is trained first and as the pre-training weight of two parts. This step is marked as 1 in Figure 2. Then, the image pairs contained pre-and post-disaster are augmented and input into the network for training simultaneously. This step is marked as 2 in Figure 2. Finally, we get an end-to-end damaged building detection network. This is an example of transfer learning. The training model of one problem can be reused as the initialization of another model of a similar problem [33]. For further improving the accuracy, we process the results of the end-to-end network data, use the building mask generated by the building localization network and remove some nonbuilding pixels in the result. This step is marked as 3 in Figure 2. The whole process is shown in Figure 2.

Attention U-Net
The attention mechanism means that, when selecting information, it calculates the weighted average of the N input information and then passes it on to the next block. The decoder part of our U-Net adds the attention mechanism. The specific architecture is shown in Figure 3.
In the decoder part of U-Net in Figure 3, the previous decoder layer's output is originally directly spliced with the output of the corresponding encoder layer as the input of the next decoder layer. After adding the attention block, the input will be processed by the attention gate, which is shown in Figure 4, and then enter into the next decoder layer to express the spatial attention. The overall framework. Localization U-Net is used to locate buildings. After pre-training with pre-disaster image, it shares weight with classification U-Net, which is used to classify damage level. The combination of the two is Siamese neural network. We use three random seeds to train it. After inputting pre-and post-disaster images into Siamese neural network, we get three five channels classification results. The first channel is the result of localization, and the last four channels are the probability of each damage level at each location. In the fusion step, the results corresponding to the three seeds are first weighted and averaged, and the weight is determined by the validation accuracy during the training process. Then, the threshold is used to determine the value of each pixel. Finally, for improving the accuracy, the pretrained classification U-Net localizes the buildings again and get the localize result which be used for the double check of buildings' localization.

Attention U-Net
The attention mechanism means that, when selecting information, it calculates the weighted average of the N input information and then passes it on to the next block. The decoder part of our U-Net adds the attention mechanism. The specific architecture is shown in Figure 3. . U-Net with the attention mechanism's structure [34]. The encoder performs 4 downsampling. Symmetrically, its decoder upsamples 4 times to restore the features to the original image resolution. The attention gate is placed at the end of the skip connection.
In the decoder part of U-Net in Figure 3, the previous decoder layer's output is originally directly spliced with the output of the corresponding encoder layer as the input of the next decoder layer. After adding the attention block, the input will be processed by the attention gate, which is shown in Figure 4, and then enter into the next decoder layer to express the spatial attention. The overall framework. Localization U-Net is used to locate buildings. After pre-training with pre-disaster image, it shares weight with classification U-Net, which is used to classify damage level. The combination of the two is Siamese neural network. We use three random seeds to train it. After inputting pre-and post-disaster images into Siamese neural network, we get three five channels classification results. The first channel is the result of localization, and the last four channels are the probability of each damage level at each location. In the fusion step, the results corresponding to the three seeds are first weighted and averaged, and the weight is determined by the validation accuracy during the training process. Then, the threshold is used to determine the value of each pixel. Finally, for improving the accuracy, the pre-trained classification U-Net localizes the buildings again and get the localize result which be used for the double check of buildings' localization.
Remote Sens. 2021, 13, x FOR PEER REVIEW 6 of 22 Figure 2. The overall framework. Localization U-Net is used to locate buildings. After pre-training with pre-disaster image, it shares weight with classification U-Net, which is used to classify damage level. The combination of the two is Siamese neural network. We use three random seeds to train it. After inputting pre-and post-disaster images into Siamese neural network, we get three five channels classification results. The first channel is the result of localization, and the last four channels are the probability of each damage level at each location. In the fusion step, the results corresponding to the three seeds are first weighted and averaged, and the weight is determined by the validation accuracy during the training process. Then, the threshold is used to determine the value of each pixel. Finally, for improving the accuracy, the pretrained classification U-Net localizes the buildings again and get the localize result which be used for the double check of buildings' localization.

Attention U-Net
The attention mechanism means that, when selecting information, it calculates the weighted average of the N input information and then passes it on to the next block. The decoder part of our U-Net adds the attention mechanism. The specific architecture is shown in Figure 3. . U-Net with the attention mechanism's structure [34]. The encoder performs 4 downsampling. Symmetrically, its decoder upsamples 4 times to restore the features to the original image resolution. The attention gate is placed at the end of the skip connection.
In the decoder part of U-Net in Figure 3, the previous decoder layer's output is originally directly spliced with the output of the corresponding encoder layer as the input of the next decoder layer. After adding the attention block, the input will be processed by the attention gate, which is shown in Figure 4, and then enter into the next decoder layer to express the spatial attention.   [34]. F, H and W stand for channel, height and width respectively, and D is the depth of the 3D data block. , the feature map from the encoder layer, is scaled with the attention coefficients (α), which are computed by and g. The previous decoder features in g are added to to determine the focus regions; then, the value of the attention coefficients is between 0 and 1 throughout training.

Data Augmentation
In machine learning algorithms, the ideal situation is that the number of samples in each class is roughly the same. However, in most real scenes, the category distributions are uneven [35]. This research uses a variety of data balancing strategies. The first is the under-sampling strategy. The 1024 × 1024 image is randomly cropped multiple times, and the crop size is 512 × 512. The cropping scheme with the largest sum of pixel values is selected to reduce the sampling frequency of non-buildings. Secondly, a cost-sensitive strategy is adopted. When constructing the loss function, the loss of each category and the total loss are combined, and at the same time, they are given different weights by referring to the proportion of the categories.
In addition, for enhancing the robustness of the model, we also randomly flip, rotate, translate, side view and zoom on the input images; adjust their saturation, contrast and brightness; convert the color space and band order and add Gaussian noise and filtering operations randomly.

ResNet-34 backbone
The first backbone used is ResNet-34 [36], belonging to the residual network (ResNet) pre-trained on the ImageNet [37] dataset. The ability of CNN to retrieve relevant information from images is enhanced with the increase of the network depth [38]. However, if the network is too deep, it will lead to gradient explosion and network degradation. Residual connections [36] solved this problem by feeding a given layer into the previous one. Figure 5 is the structure of a residual connection. •

Squeeze-and-Excitation Networks (SENet) backbone
For convolution operations, a large part of the work is to improve the receptive field-that is, to fuse more features spatially or to extract multi-scale spatial information, such as the multi-branch structure of the Inception network [39]. For the feature fusion of  [34]. F, H and W stand for channel, height and width respectively, and D is the depth of the 3D data block. x l , the feature map from the encoder layer, is scaled with the attention coefficients (α), which are computed by x l and g. The previous decoder features in g are added to x l to determine the focus regions; then, the value of the attention coefficients is between 0 and 1 throughout training.

Data Augmentation
In machine learning algorithms, the ideal situation is that the number of samples in each class is roughly the same. However, in most real scenes, the category distributions are uneven [35]. This research uses a variety of data balancing strategies. The first is the under-sampling strategy. The 1024 × 1024 image is randomly cropped multiple times, and the crop size is 512 × 512. The cropping scheme with the largest sum of pixel values is selected to reduce the sampling frequency of non-buildings. Secondly, a cost-sensitive strategy is adopted. When constructing the loss function, the loss of each category and the total loss are combined, and at the same time, they are given different weights by referring to the proportion of the categories.
In addition, for enhancing the robustness of the model, we also randomly flip, rotate, translate, side view and zoom on the input images; adjust their saturation, contrast and brightness; convert the color space and band order and add Gaussian noise and filtering operations randomly.

•
ResNet-34 backbone The first backbone used is ResNet-34 [36], belonging to the residual network (ResNet) pre-trained on the ImageNet [37] dataset. The ability of CNN to retrieve relevant information from images is enhanced with the increase of the network depth [38]. However, if the network is too deep, it will lead to gradient explosion and network degradation. Residual connections [36] solved this problem by feeding a given layer into the previous one. Figure 5 is the structure of a residual connection.
Remote Sens. 2021, 13, x FOR PEER REVIEW 7 of 22 Figure 4. Attention gate structure [34]. F, H and W stand for channel, height and width respectively, and D is the depth of the 3D data block. , the feature map from the encoder layer, is scaled with the attention coefficients (α), which are computed by and g. The previous decoder features in g are added to to determine the focus regions; then, the value of the attention coefficients is between 0 and 1 throughout training.

Data Augmentation
In machine learning algorithms, the ideal situation is that the number of samples in each class is roughly the same. However, in most real scenes, the category distributions are uneven [35]. This research uses a variety of data balancing strategies. The first is the under-sampling strategy. The 1024 × 1024 image is randomly cropped multiple times, and the crop size is 512 × 512. The cropping scheme with the largest sum of pixel values is selected to reduce the sampling frequency of non-buildings. Secondly, a cost-sensitive strategy is adopted. When constructing the loss function, the loss of each category and the total loss are combined, and at the same time, they are given different weights by referring to the proportion of the categories.
In addition, for enhancing the robustness of the model, we also randomly flip, rotate, translate, side view and zoom on the input images; adjust their saturation, contrast and brightness; convert the color space and band order and add Gaussian noise and filtering operations randomly.

Backbones
• ResNet-34 backbone The first backbone used is ResNet-34 [36], belonging to the residual network (ResNet) pre-trained on the ImageNet [37] dataset. The ability of CNN to retrieve relevant information from images is enhanced with the increase of the network depth [38]. However, if the network is too deep, it will lead to gradient explosion and network degradation. Residual connections [36] solved this problem by feeding a given layer into the previous one. Figure 5 is the structure of a residual connection. •

Squeeze-and-Excitation Networks (SENet) backbone
For convolution operations, a large part of the work is to improve the receptive field-that is, to fuse more features spatially or to extract multi-scale spatial information, such as the multi-branch structure of the Inception network [39]. For the feature fusion of •

Squeeze-and-Excitation Networks (SENet) backbone
For convolution operations, a large part of the work is to improve the receptive fieldthat is, to fuse more features spatially or to extract multi-scale spatial information, such as the multi-branch structure of the Inception network [39]. For the feature fusion of the channel dimensions, the convolution operation basically defaults to fusing all channels of the input feature map. The Group Convolution and Depth-wise Separable Convolution in the MobileNet network group channels mainly make the model more lightweight and reduce the amount of calculation. The innovation of the SENet network is to focus on the relationship between channels, hoping that the model can automatically learn the importance of different channel features; the SENet can be regarded as the channel-wise attention mechanism. To this end, SENet proposes the Squeeze-and-Excitation (SE) module, as shown in Figure 6.
Remote Sens. 2021, 13, x FOR PEER REVIEW 8 of 22 the channel dimensions, the convolution operation basically defaults to fusing all channels of the input feature map. The Group Convolution and Depth-wise Separable Convolution in the MobileNet network group channels mainly make the model more lightweight and reduce the amount of calculation. The innovation of the SENet network is to focus on the relationship between channels, hoping that the model can automatically learn the importance of different channel features; the SENet can be regarded as the channelwise attention mechanism. To this end, SENet proposes the Squeeze-and-Excitation (SE) module, as shown in Figure 6. The core module of SENet is divided into three parts: squeeze, excitation and scale. The squeeze part is used to compress features to 1 × 1 × channels in the spatial dimension, which represents the channels' global distribution. The excitation part is reassembled at the gating mechanism, which produces channel-wise weights ∈ ℝ × . The scale part uses the learned weights to reweigh the importance of each channel to build attention on the channel.

SEResNeXt backbone
SEResNeXt is a model obtained by applying the SE module to the residual block in ResNeXt. In fact, in the block of ResNet, one residual path becomes multiple residual paths. The success of the Visual Geometry Group Net (VGGNet) [40] and ResNet shows that the method of stacking blocks of the same shape can reduce the number of hyperparameters and achieve state-of-the-art (SOTA) results. The practice represented by Goog-leNet and Inception also shows that a fine network design through the split-transformmerge strategy can also achieve very good results. ResNeXt's idea is to combine these two good ideas. ResNeXt does not perform split-transform-merge like the GoogleNet series but simply repeats the same substructure, as shown in Figure 7, so that the split-transform-merge is done; at the same time, there is not much increase in the hyperparameters.
ResNeXt structure [41]. In the figure above, the structure of (a) is the original structure of ResNeXt, and (b,c) are equivalent representations of the structure of (a) in an actual implementation, the structure of (c) which is relatively simple to implement, and the basic block of ResNeXt is realized through the form of grouped convolution. . Squeeze-and-Excitation (SE) module [39]. The module is mainly composed of three parts: squeeze, excitation and scale. F sq (·) represents the squeeze transformation, F ex (·, w) represents the excitation transformation and F scale (·) represents the scale transformation.
The core module of SENet is divided into three parts: squeeze, excitation and scale. The squeeze part is used to compress features to 1 × 1 × channels in the spatial dimension, which represents the channels' global distribution. The excitation part is reassembled at the gating mechanism, which produces channel-wise weights W ∈ R c 2 ×c 2 . The scale part uses the learned weights to reweigh the importance of each channel to build attention on the channel.

•
SEResNeXt backbone SEResNeXt is a model obtained by applying the SE module to the residual block in ResNeXt. In fact, in the block of ResNet, one residual path becomes multiple residual paths. The success of the Visual Geometry Group Net (VGGNet) [40] and ResNet shows that the method of stacking blocks of the same shape can reduce the number of hyperparameters and achieve state-of-the-art (SOTA) results. The practice represented by GoogleNet and Inception also shows that a fine network design through the split-transform-merge strategy can also achieve very good results. ResNeXt's idea is to combine these two good ideas. ResNeXt does not perform split-transform-merge like the GoogleNet series but simply repeats the same substructure, as shown in Figure 7, so that the split-transform-merge is done; at the same time, there is not much increase in the hyperparameters. the channel dimensions, the convolution operation basically defaults to fusing all channels of the input feature map. The Group Convolution and Depth-wise Separable Convolution in the MobileNet network group channels mainly make the model more lightweight and reduce the amount of calculation. The innovation of the SENet network is to focus on the relationship between channels, hoping that the model can automatically learn the importance of different channel features; the SENet can be regarded as the channelwise attention mechanism. To this end, SENet proposes the Squeeze-and-Excitation (SE) module, as shown in Figure 6. The core module of SENet is divided into three parts: squeeze, excitation and scale. The squeeze part is used to compress features to 1 × 1 × channels in the spatial dimension, which represents the channels' global distribution. The excitation part is reassembled at the gating mechanism, which produces channel-wise weights ∈ ℝ × . The scale part uses the learned weights to reweigh the importance of each channel to build attention on the channel.

SEResNeXt backbone
SEResNeXt is a model obtained by applying the SE module to the residual block in ResNeXt. In fact, in the block of ResNet, one residual path becomes multiple residual paths. The success of the Visual Geometry Group Net (VGGNet) [40] and ResNet shows that the method of stacking blocks of the same shape can reduce the number of hyperparameters and achieve state-of-the-art (SOTA) results. The practice represented by Goog-leNet and Inception also shows that a fine network design through the split-transformmerge strategy can also achieve very good results. ResNeXt's idea is to combine these two good ideas. ResNeXt does not perform split-transform-merge like the GoogleNet series but simply repeats the same substructure, as shown in Figure 7, so that the split-transform-merge is done; at the same time, there is not much increase in the hyperparameters.
(a) (b) (c) Figure 7. ResNeXt structure [41]. In the figure above, the structure of (a) is the original structure of ResNeXt, and (b,c) are equivalent representations of the structure of (a) in an actual implementation, the structure of (c) which is relatively simple to implement, and the basic block of ResNeXt is realized through the form of grouped convolution. ResNeXt structure [41]. In the figure above, the structure of (a) is the original structure of ResNeXt, and (b,c) are equivalent representations of the structure of (a) in an actual implementation, the structure of (c) which is relatively simple to implement, and the basic block of ResNeXt is realized through the form of grouped convolution.
SEResNeXt is obtained by adding the SE module to the residual block in ResNeXt. The structure of a single residual block combined with the SE module is shown in Figure 8. SEResNeXt is obtained by adding the SE module to the residual block in ResNeXt. The structure of a single residual block combined with the SE module is shown in Figure  8.

•
Dual Path Net (DPN) backbone DPN is a new convolutional network structure that combines the advantages of the ResNet and Dense Convolutional Network (DenseNet). By revealing the equivalence of the ResNet and the Dense Convolutional Network (DenseNet), the author found that Res-Net supports element reuse, while DenseNet supports new element exploration. In order to integrate the benefits of these two path topologies, DPN aggregates the functions of the two.
In Figure 9, the phase results of the Dense Net on the left and ResNeXt on the right are added together. The added result is then processed by 3 × 3 convolution and 1 × 1 dimension transformation operations; finally, its channels are divided into two parts. The left part is merged with the original input on the left, and the right part is added with the original input on the right. The operation, in this way, is a block formed in which the original input can be the input that entered the network at the beginning or the input of the previous stage.

•
Dual Path Net (DPN) backbone DPN is a new convolutional network structure that combines the advantages of the ResNet and Dense Convolutional Network (DenseNet). By revealing the equivalence of the ResNet and the Dense Convolutional Network (DenseNet), the author found that ResNet supports element reuse, while DenseNet supports new element exploration. In order to integrate the benefits of these two path topologies, DPN aggregates the functions of the two.
In Figure 9, the phase results of the Dense Net on the left and ResNeXt on the right are added together. The added result is then processed by 3 × 3 convolution and 1 × 1 dimension transformation operations; finally, its channels are divided into two parts. The left part is merged with the original input on the left, and the right part is added with the original input on the right. The operation, in this way, is a block formed in which the original input can be the input that entered the network at the beginning or the input of the previous stage.

Fusion
Each model uses three different random seeds for training, so each network has three training weights and its optimal overall F1 score while training. After using the three training weights to predict the verification set, three prediction results can be obtained. Then, the optimal overall F1 score is used as the weight of each result for the fusion based on the weighted average. Finally, the preliminary results of localization and classification are obtained. When fusing the results of different models, the same method is used, except that the weighted results are changed from 3 to 12. Since the localization task network is a single target network, it is more targeted than the classification network, which has two targets in the localization task. We used the localization result to mask the classification result and removed the non-building pixels to get the results. The weighted average method is also used for the fusion of different networks with or without attention. Remote Sens. 2021, 13, x FOR PEER REVIEW 10 of 22

Fusion
Each model uses three different random seeds for training, so each network has three training weights and its optimal overall F1 score while training. After using the three training weights to predict the verification set, three prediction results can be obtained. Then, the optimal overall F1 score is used as the weight of each result for the fusion based on the weighted average. Finally, the preliminary results of localization and classification are obtained. When fusing the results of different models, the same method is used, except that the weighted results are changed from 3 to 12. Since the localization task network is a single target network, it is more targeted than the classification network, which has two targets in the localization task. We used the localization result to mask the classification result and removed the non-building pixels to get the results. The weighted average method is also used for the fusion of different networks with or without attention.

Metric
In a classification task, a confusion matrix is frequently used to evaluate the accuracy of the information and the performance of a model [43], and accuracy indicators such as precision and recall provide a summary of the information in it. Each row of the confusion matrix represents a prediction category. Each column represents an actual category to which the pixel belongs. TP (True Positive) is the number of pixels that are correctly predicted as this category. FP (False Positive) is when the number of pixels that belong to other categories are wrongly classified as this category. FN (False Negative) is when the number of pixels that belong to this category are mistakenly classified as another one. TN (True Negative) is the number of pixels that are correctly classified as other categories.
The measure of accuracy using the portion of TP and TN does not distinguish between different categories; thus, the overall performance of a multi-class model is not well-described when dealing with an unbalanced dataset. By contrast, the measures of precision and recall reflect the true classification performance, and the F1 score is balanced between the two indicators.

Metric
In a classification task, a confusion matrix is frequently used to evaluate the accuracy of the information and the performance of a model [43], and accuracy indicators such as precision and recall provide a summary of the information in it. Each row of the confusion matrix represents a prediction category. Each column represents an actual category to which the pixel belongs. TP (True Positive) is the number of pixels that are correctly predicted as this category. FP (False Positive) is when the number of pixels that belong to other categories are wrongly classified as this category. FN (False Negative) is when the number of pixels that belong to this category are mistakenly classified as another one. TN (True Negative) is the number of pixels that are correctly classified as other categories.
The measure of accuracy using the portion of TP and TN does not distinguish between different categories; thus, the overall performance of a multi-class model is not welldescribed when dealing with an unbalanced dataset. By contrast, the measures of precision and recall reflect the true classification performance, and the F1 score is balanced between the two indicators.
The evaluation metrics we used is the F1 score calculated by a weighted average of the localization f1 score (lf1) and the damage f1 score (df1), which was provided by xView2 Challenge [29]. Overall The localization f1 score is the normal F1 score, which is the harmonic mean of precision and recall [44] used to assess the effectiveness of building identification, which is a binary classification task.
Our model classifies pixels into four labels, so df1 is a score of multi-class F measures. The macro-averaged F1 score is a popular performance score that is computed by averaging the per-category scores [45]. It adapts to a large-scale dataset. As the global arithmetic mean of each, it does not adequately represent the performance of the classifier in each category. Our damage F1 score is calculated by taking the harmonic mean of the 4 f1 scores calculated for each damage level [29]. It behaves differently compared to the macro F1 score, as it gives a larger weight to lower numbers.
For reference, we also used the Mean Intersection over Union (MIoU) [46] to evaluate the performance of the model.

Training Implementation
Considering both the resources and efficiency, Adam [47] with a learning rate of 0.0002 is chosen as the optimization algorithm, which has strong robustness in the selection of super parameters. While training the localization task, the training data batch size is set to 16, and 100 epochs are trained on the network. When two tasks are trained at the same time, the training data batch size is set to 10, and the network is trained for 24 epochs. The implementation of the framework network is based on pytorch [48], and two NVIDIA GTX 1080ti GPUs with 8G memory are used for training and verifying. We used the weights of networks trained with ImageNet provided by pytorch to initialize the network.

Compare Models
We trained a total of two groups of U-Net models with eight different backbonesnamely, the group with the attention mechanism and the group without. The different backbones used were introduced in Section 2.2.4. Under the premise that the data is divided into a training set and validation set using random seeds, each model used three different random seeds for training. As shown in Tables 4 and 5, the results of each model of the verification set are shown. The classify result index uses the overall F1 score introduced in Section 2.3, the localization result index uses the ordinary F1 score and the overall index is a 0.3 localization F1 score and 0.7 classify F1 score.  As can be seen from Tables 4 and 5, SENet and SEresNeXt both show better overall performances without and with the attention. For the classification task, their performances with the attention mechanism are better than those without the attention mechanism. In the task of building localization, DPN with the attention mechanism shows the best performance, reaching an F1 value of 0.870. Observation shows that whether the attention mechanism is added has no uniform impact on the localization accuracy, but it will improve the accuracy of the classification. In order to further observe the results of each model, we selected three sample images in the verification set to compare the results, as shown in  As can be seen from Tables 4 and 5, SENet and SEresNeXt both show better overall performances without and with the attention. For the classification task, their performances with the attention mechanism are better than those without the attention mechanism. In the task of building localization, DPN with the attention mechanism shows the best performance, reaching an F1 value of 0.870. Observation shows that whether the attention mechanism is added has no uniform impact on the localization accuracy, but it will improve the accuracy of the classification. In order to further observe the results of each model, we selected three sample images in the verification set to compare the results, as shown in Figures 10-12.  Figure 10 was chosen to observe the discrimination of minor-damaged buildings. It can be seen that the model has a low detection accuracy for minor damage. The minor damage building in the lower left corner of Figure 10 is classified as major damage in most models; only in Figure 10j is it judged as minor damage. In Figure 10a,b, the observation directions are different, so the location of buildings in the two images cannot be completely overlapped, which may cause an error of judgment. The error is also related to the degree of damage that is continuous but is artificially divided into discrete levels. The  Figure 10j is it judged as minor damage. In Figure 10a,b, the observation directions are different, so the location of buildings in the two images cannot be completely overlapped, which may cause an error of judgment. The error is also related to the degree of damage that is continuous but is artificially divided into discrete levels. The appearance of minor-damaged buildings is not obvious and diverse. For example, the damaged parts are on the sides of the buildings, which cannot be observed by remote sensing images.  Figure 11 is used to observe the discrimination of nondamaged and major-damag buildings. It can be seen that the performance of each model in the detection of nonda aged buildings is stable, but when distinguishing major-damaged buildings, it is easy make misjudgments. For the buildings circled in Figure 11c, there are misjudged pixels nearby buildings, which can be seen in every model's results. As can be seen from Figu 11b, the part of the building in the red circle has changed in texture and color compar to Figure 11a. Its appearance may have changed due to various factors, but the buildi itself does not reach the level of minor damage. This is related to the limitation of opti image data, which is easily inferred by the color information, and false changes are d tected. Compared with the result without the attention mechanism, the result with t attention mechanism is more accurate in the detection of major-damaged buildings.  Figure 10 was chosen to observe the discrimination of minor-damaged buildings. It can be seen that the model has a low detection accuracy for minor damage. The minor damage building in the lower left corner of Figure 10 is classified as major damage in most models; only in Figure 10j is it judged as minor damage. In Figure 10a,b, the observation directions are different, so the location of buildings in the two images cannot be completely overlapped, which may cause an error of judgment. The error is also related to the degree of damage that is continuous but is artificially divided into discrete levels. The appearance of minor-damaged buildings is not obvious and diverse. For example, the damaged parts are on the sides of the buildings, which cannot be observed by remote sensing images. Figure 11 is used to observe the discrimination of nondamaged and major-damaged buildings. It can be seen that the performance of each model in the detection of nondamaged buildings is stable, but when distinguishing major-damaged buildings, it is easy to make misjudgments. For the buildings circled in Figure 11c, there are misjudged pixels in nearby buildings, which can be seen in every model's results. As can be seen from Figure 11b, the part of the building in the red circle has changed in texture and color compared to Figure 11a. Its appearance may have changed due to various factors, but the building itself does not reach the level of minor damage. This is related to the limitation of optical image data, which is easily inferred by the color information, and false changes are detected. Compared with the result without the attention mechanism, the result with the attention mechanism is more accurate in the detection of major-damaged buildings. nearby buildings, which can be seen in every model's results. As can be seen from Figure  11b, the part of the building in the red circle has changed in texture and color compared to Figure 11a. Its appearance may have changed due to various factors, but the building itself does not reach the level of minor damage. This is related to the limitation of optical image data, which is easily inferred by the color information, and false changes are detected. Compared with the result without the attention mechanism, the result with the attention mechanism is more accurate in the detection of major-damaged buildings.  Figure 12 is used to observe the judgment of destroyed buildings. It can be seen that the location and classification of the damaged area for each model are basically the same, but the details are different. For example, all models have detected that the large area above the image is destroyed, but the detection results of the state of the lower right house are inconsistent. Compared with the ground truth, Figure 12g,i shows a better performance, which is the results of SENet (w/o A) and SEresNeXt (w A). As shown in Figure 12f,h, DPN (w/o A) and resNet (w A) perform poorly. In the localization task, except for the poor performance in Figure 12f, which is the result of DPN (w/o A), the classification effects of other models are similar. Among them, the boundary is the clearest, and the least sticky is Figure 12h, which is the result of ResNet (w A).

Fusion Results
In pursuit of high precision, we try to explore whether the fusion results of different networks would be more accurate than the result of a single network. For this reason, we divide the models into two groups according to whether we added attention and integrate the building localization and classification results of four networks. In the fusion process, the four network contribution weight ratios are 1:1:1:1, and the fusion results are shown in Figure 13 and Table 6.

Fusion Results
In pursuit of high precision, we try to explore whether the fusion results of different networks would be more accurate than the result of a single network. For this reason, we divide the models into two groups according to whether we added attention and integrate the building localization and classification results of four networks. In the fusion process, the four network contribution weight ratios are 1:1:1:1, and the fusion results are shown in Figure 13 and Table 6.   Comparing Figure 13 and Table 6, the fusion results of the model with the attention mechanism had an increase in the overall accuracy compared with the results of the single model, which increased by 0.005 compared with the highest overall accuracy of the single model. The result of fusion without the attention mechanism is lower than the highest overall accuracy of single model but only 0.003 lower than SENet. Comparing the accuracy of the two groups, the group with the attention mechanism was higher, indicating that the attention mechanism had a certain effect on improving the overall accuracy of the model. Comparing Figure 13b,c,e,f,h,i, respectively, the fusion results showed a stable classification result. However, they had some differences in the details. As in the circled part in Figure 13, the fusion results of the model with the attention mechanism detected more accurate damage areas than the results of the model without the attention mechanism. In terms of localization, the fusion results of the model the with attention mechanism had clearer boundaries and fewer adhesions between multiple buildings than without the attention mechanism. In summary, the fusion of multiple models with the attention mechanism is beneficial to the improvement of the accuracy.

Transferability and Robustness
For verifying the robustness and transferability of model, two disasters not in the training and verification set were selected. One was the explosion accident in Beirut, Lebanon on 4 August 2020, and the other was Hurricane Laura, which occurred on 27 August 2020; see Section 2.1.2 for the specific introduction. We selected one region of interest for the explosion in Beirut and one for Hurricane Laura. The pre-and post-disaster images of each disaster were input into our model, and the results are shown in Figures 14 and 15.
In the case of the Beirut explosion, for the classify task, the results in Figure 14 are more accurate in detecting the destroyed area. However, in the upper left part of all results, a piece was detected as nondamaged or major-damaged. The original satellite image Figure 14a shows that there is a shed here, and the roof still exists after the explosion in Figure 14b, but the wall under it may have collapsed completely, which makes it impossible to detect this situation from a top view. These situations generally occur when the building's upper layer collapses directly to the bottom floor [49]. This shows that the optical satellite image has limitations in building damage detection. Compared with the ground truth in Figure 14m, almost all the results do not accurately classify the disaster damage grade of pixels in the lower right corner, and they tend to overestimate it.
In the localization task, the dense buildings in the lower right corner of the image are not recognized. This part of the building is high, showing the effect of side shooting in the image, and there are large areas of shadows, which limits the model's recognition of the buildings. In the red circle of Figure 14a, there are ships docked at the port, which are identified as buildings in Figure 14k with no attention mechanism. But in Figure 14l with the attention mechanism, it is excluded from the building localization. This proves that the attention mechanism is helpful for localization tasks to distinguish between ships and buildings.
Since there is no public official ground truth of Hurricane Laura, we manually label the ground truth (seen in Figure 14m) following the rules of xBD. Compared to Beirut's explosion, the image of Hurricane Laura is covered by thin clouds, and the texture is more blurred. Unlike the Beirut port, the scale of the buildings is smaller. From the results of Hurricane Laura in Figure 15, the difference between the fusion results with and without the attention mechanism is very small, and both models present similar interpretation levels for each house. By comparing all the results, there is not much difference in the building localization tasks. Some tiny buildings or buildings covered by trees are likely to be missed. As shown in Figures 14 and 15 for determining the level of damage, compared with the existing disasters in the training set, the performances of the two disasters' classification tasks are poor. However, it is basically possible to distinguish between damaged and nondamaged buildings, and the error in determining the level of damage is mostly within one level.

Transferability and Robustness
For verifying the robustness and transferability of model, two disasters not in the training and verification set were selected. One was the explosion accident in Beirut, Lebanon on 4 August 2020, and the other was Hurricane Laura, which occurred on 27 August 2020; see Section 2.1.2 for the specific introduction. We selected one region of interest for the explosion in Beirut and one for Hurricane Laura. The pre-and post-disaster images of each disaster were input into our model, and the results are shown in Figures 14 and 15  In the case of the Beirut explosion, for the classify task, the results in Figure 14 are more accurate in detecting the destroyed area. However, in the upper left part of all results, a piece was detected as nondamaged or major-damaged. The original satellite image Figure 14a shows that there is a shed here, and the roof still exists after the explosion in Figure  14b, but the wall under it may have collapsed completely, which makes it impossible to detect this situation from a top view. These situations generally occur when the building's upper layer collapses directly to the bottom floor [49]. This shows that the optical satellite image has limitations in building damage detection. Compared with the ground truth in  Since there is no public official ground truth of Hurricane Laura, we manually label the ground truth (seen in Figure 14m) following the rules of xBD. Compared to Beirut's explosion, the image of Hurricane Laura is covered by thin clouds, and the texture is more blurred. Unlike the Beirut port, the scale of the buildings is smaller. From the results of Hurricane Laura in Figure 15, the difference between the fusion results with and without

Discussion
It can be seen from the indicators in Section 3.1 that the U-Net with the SE module maintains better performance in the group with or without the attention mechanism, which may be due to the characteristics of the SE module; that is, the attention mechanism of the channel. The SE module can let the network know what channels are more important for the current task. Indeed, compared to images with dozens of channels, our data does not seem to have much need to choose important channels, but it does not mean that this operation is completely meaningless. It turns out that it can improve the accuracy. At the same time, we can expect that, when the number of channels increases, the attention mechanism on the channels will have a greater effect.
In addition to the band attention mechanism, we also added the spatial attention mechanism. The accuracy of the network with the attention mechanism on the localization task did not change much. However, the accuracy on the classification task improved. For the task of building localization, the spatial attention mechanism was not very helpful, and the original convolutional neural network was enough to achieve good results. For classification tasks, the changes in the global characteristics of the image may affect the classification results of its disaster types. According to Tobler's First Law [25], "Everything is related, but nearby things are more related than distant things". The environment around a building can also be used as one of the basis for determining its damage level. As defined in Figure 1, if a house is surrounded by water/mud, no matter how it looks on the outside, it will also be classified as damaged.
The localization and classification of disaster-damaged buildings is a technology that supports post-disaster rescue. For this reason, the performance of the network on untrained disaster images is critical. Therefore, we conducted research on the transferability and robustness of the model. See Section 3.3. It can be concluded that our model performs well on different types of disasters that have not been trained, but there is still room for improvement. At the same time, the pre-trained model on the xBD dataset can be used as the basis for future building disaster detection research without the need to train the model from scratch. When using the trained model to process a pair of 1024 × 1024 images, it takes no more than one second to get the result and can save valuable time during disaster relief.
Tables 4 and 5 showed that the accuracy of minor damage in the four categories is the lowest, which is related to the fact that the remote sensing image does not contain information on the side of the building, such as the surrounding walls. The misjudgment caused by this lack of information is also reflected in the disaster of the Beirut explosion. If the related street view or oblique photographic image can be added, the accuracy of the model can be further improved. Regarding the reasons for restricting the best accuracy, we think there are roughly three points. One is the limitation of data; that is, the viewing angle of the remote sensing image is limited as mentioned above. The second is the limitation of the method; that is, the method used in this article is not perfect to adapt to this problem. The third is the problem of level classification. Even experts can hardly determine the damage level to some houses, because the damage itself is difficult to divide, leading to errors in data calibration.
In terms of accuracy assessment, evaluating the disaster damage level is not based on each pixel of the building as a unit but mostly based on an entire building as a unit. Therefore, the indicators and loss values used during model training can be improved, such as calculating the loss with a single building as a unit and training model with the loss.

Conclusions
Neural networks have been widely used in damaged building detection after disasters [4,13,16,17,32,43,47]. However, most of the studies focus on the binary classification about whether buildings collapsed or not, and most models give the same attention to each feature, which makes it more difficult for some important features to play a full role. In order to make the model focus on more important part for disaster-damaged building classifications, in this study, we described a variety of U-Nets using different backbones with the attention mechanism. These networks can automatically detect damaged buildings in satellite images and assess their level. We trained different networks using xBD and compared their F1 scores on the verification set. Among them, the performance of SEresNeXt with the attention mechanism on two dimensions is the best, the overall F1 score reaching 0.787. For further improving the accuracy, we fused the results of four models and got better results on the fusion model with the attention mechanism than all the other models, and the overall F1 score reached 0.792. This result proves that the attention mechanism is helpful for the detection of damaged buildings. In order to verify the performance of models on untrained disasters, two disasters not in the training and verification sets were selected to verify the model's portability and robustness. The results showed that our model had good robustness and portability on localization and classification tasks, but there is still space for improvement.
A future research direction should consider specialized network training according to disaster types to improve the accuracy of different types of disasters. The classification of a building object can also be considered, which is more in line with the actual situation. We plan to consider more types of disasters, especially large-scale and high-frequency disasters. We also plan to study some technologies to make the model adapt to different data sources, such as lower resolution remote sensing data, street view data from different perspectives, etc.