Pyramid Pooling Module-Based Semi-Siamese Network: A Benchmark Model for Assessing Building Damage from xBD Satellite Imagery Datasets

: Most mainstream research on assessing building damage using satellite imagery is based on scattered datasets and lacks uniﬁed standards and methods to quantify and compare the performance of different models. To mitigate these problems, the present study develops a novel end-to-end benchmark model, termed the pyramid pooling module semi-Siamese network (PPM-SSNet), based on a large-scale xBD satellite imagery dataset. The high precision of the proposed model is achieved by adding residual blocks with dilated convolution and squeeze-and-excitation blocks into the network. Simultaneously, the highly automated process of satellite imagery input and damage classiﬁcation result output is reached by employing concurrent learned attention mechanisms through a semi-Siamese network for end-to-end input and output purposes. Our proposed method achieves F1 scores of 0.90, 0.41, 0.65, and 0.70 for the undamaged, minor-damaged, major-damaged, and destroyed building classes, respectively. From the perspective of end-to-end methods, the ablation experiments and comparative analysis conﬁrm the effectiveness and originality of the proposed method. Finally, the consistent prediction results of our model for data from the 2011 Great East Japan Earthquake verify the high performance of our model in terms of the domain shift problem, which implies that it is effective for evaluating future disasters.


Introduction
Natural disasters, which have been occurring frequently in recent years [1], pose a huge threat to the safety of residential buildings as well as life and property. Therefore, it is of great significance to obtain accurate information on damaged buildings to carry out interventions after natural disasters [2,3]. Satellite remote sensing technology is used to obtain disaster information because it can acquire rapid and large-scale surface information [4][5][6][7][8][9]. In particular, the recent development of deep convolutional neural network algorithms has improved disaster assessment input data. Combinations of these networks were then distributed among three separate feature streams. The extracted features were summarized into a continuous value denoting the damage level. The transfer learning mechanism can thus benefit all end-to-end models; however, we do not apply the mechanism to compare the performance of model structures in this study.
In addition to those works discussed above, Valentijn et al. [22] addressed the problem of automated building damage assessment based on the xBD dataset. The authors proposed a CNN consisting of two inception-v3 blocks for extracting features from pre-/post-disaster images and a stack of fully connected layers for the classifier. To overcome the overfitting problem, they employed a batch normalization layer and a dropout layer for each fully connected layer and analyzed the generalizability and transferability of the CNN. Harirchian et al. [23] addressed the problem of risk assessment using SVM and data on the Düzce Earthquake in Turkey. They employed 22 building features such as system type, year of construction, and ground floor area as inputs to the SVM for the estimation. Compared with CNN-based methods, this method is a "white box". However, it relies more on carefully chosen parameter(s) for the SVM and may perform worse than CNN-based methods. Zhuo et al. [24] focused on evaluating the risk of the subsidence of reclaimed land at the Xiamen Xi'an New Airport in China. They showed that SAR data are a powerful information source for analyzing reclaimed land subsidence as well as estimating the risk of future subsidence, which is valuable for land use planning. Morfidis et al. [25] used an artificial neural network (ANN) to estimate seismic damage to structures. This study provided a good explanation for civil engineers unfamiliar with ANNs. Harirchian et al. [26] addressed the problem of predicting damage to reinforced concrete buildings when an earthquake occurs. They employed six human-defined features to represent a building. A shallow neural network was then used as the estimator, which was trained and tested based on the representation vectors consisting of the six features for each sample. The dataset employed for this work was obtained from the Düzce Earthquake in Turkey. Morfidis et al. [27] addressed the problem of estimating damage to reinforced concrete buildings using ANNs. The authors employed human-defined features (i.e., seismic and structural parameters) to train a shallow neural network consisting of linear production layers and activation layers and then analyzed the network's hyper-parameters and human-defined features, providing a good guide for applying ANNs experimentally.
In this study, we design a concurrent learned attention network, which is an end-to-end trainable, unified model, to localize buildings and classify damage jointly. This network is built on a semi-Siamese strategy that can learn collectively. We use a pixel-level segmentation-based approach as well as residual blocks (RBs) with dilated convolution and squeeze-and-excitation (SE) blocks to detect damage to the segmented buildings. To model the global contextual prior, we also introduce the pyramid-pooling module (PPM) that enhances the scale invariance of images, while lowering over-fitting risk.
To benchmark our method, we develop our model based on the large-scale xBD dataset, which contains satellite images from multiple disaster types worldwide such as earthquakes, hurricanes, floods, and wildfires. To verify our method's effectiveness and practicality, we compare its performance with that of the published baseline model based on the xBD dataset. To demonstrate its usefulness, we use data from the 2011 Great East Japan Earthquake.
We contribute to the body of knowledge in four main ways. First, redwe propose a benchmark model for assessing building damage based on a large-scale xBD satellite imagery dataset. Second, we put forward an end-to-end model for assessing building damage, termed PPM-SSNet, which adopts the semi-Siamese technique, the PPM, and an attention mechanism. To overcome the difficulty of multi-target learning, we use the weighted combined losses of dice, focal, and cross-entropy. Third, we use efficient five data augmentation methods and four class balance strategies designed for these tasks to improve the task performance of all the mainstream models. Finally, we use different disaster images, including severely damaged images and rare disaster images, to test our model's robustness by comparing it with two strong baseline models.

Data
The xBD dataset [16] used in this study comes from xView 2 challenge (https://xview2.org/ dataset). It contains over 850,000 building polygons from six types of disasters (earthquake, tsunami, flood, volcanic eruption, wildfire, and wind) worldwide, covering 45,000 km 2 . The building polygons and damage scales are included. Following the joint damage scale (JDS) based on EMS-98, the building damage scales are visually interpreted from satellite imagery and categorized into undamaged, minor-damaged, major-damaged, and destroyed buildings. The training dataset contains 9168 pairs of pre-event/post-event three-band images with a spatial resolution of 1024 × 1024 pixels. Moreover, segmented ground truth masks with building polygons and building damage class labels are provided in the JSON file format. Figure 1 shows the details of the xBD dataset. Approximately 96.7% of the pixels are in the non-building area, as shown in Table 1, which indicates the sample imbalance among our original data.  Consistent with real-world disaster case scenarios, the xBD dataset presents severe class imbalance. In terms of the building area/non-building area ratio at the pixel level, the non-building pixel occupies 97% of the image pixels, as shown in Table 1. Regarding the proportional distribution of the damage class at the pixel level, the number of undamaged building pixels far exceeds that of the other three classes, with a ratio of up to 76%. Only 6% of pixels belong to the class of destroyed. The minor-damaged and major-damaged categories account for almost the same proportion. Figure 2 compares the class balance. To verify our method's transferability, we test other satellite imagery with the developed model based on the xBD dataset. Two areas in Higashi Matsushima severely affected by the 2011 Great East Japan Earthquake are used for testing, as shown in Figure 3a-c. These two areas are selected because the xBD dataset does not contain any disaster data from Japan and data on the tsunami in the xBD dataset are scarce. This design can test the ability of our model for to evaluate and predict unknown disasters.   Figure 10a with the ground truth data of building damage; and (c) The close-up of the red area as shown in Figure 10a with the ground truth data of building damage.
The building damage ground truth data for the testing area are based on the field investigation conducted by TTJS [28]. To retain consistency with the xBD data label as much as possible to facilitate the comparative analysis, we recategorize the TTJS building damage data into four classes: "undamaged", "minor damage" (including "moderate damage and" "minor damage in the" TTJS standard), "major damage," and "destroyed" (including "washed away," "collapsed," and "completely damaged" in the TTJS standard) as shown in Figure 3b,c. We implement this classification standard because standards based on field surveys are much stricter than the visual interpretation based on satellite images.
The four-band multispectral high-resolution Worldview-2 images with a spatial resolution of 0.6 m, collected before and after the 2011 Great East Japan Earthquake, were used for validation as shown in background of Figure 3b,c.

Methodology
The PPM-SSNet model developed in this research employed dilated convolution, the SE mechanism for attention, and the PPM, as detailed below.

Dilated Convolution for Large Receptive Fields
Collectively leveraging the global and local features of an input image is effective at solving computer vision problems [29][30][31][32]. Because of the nature of images, the different characters of an image are represented on different scales. A large field in an image includes global appearances such as objects' contours, whereas a small field includes local appearances such as local textures. This also applies to building localization and damage assessment. One way to realize this idea is with image down-sampling, which reduces the size of an image. This is equivalent to enlarging the receptive field of a convolutional unit in a specific location of an image. Although down-sampling an image leads to less information compared with a reduction in the resolution, it is still used when computing resources (e.g., GPU memory) are limited. Another way to enlarge the convolutional receptive field is by employing dilated convolution [29]. A dilated convolutional unit performs in the same way as normal convolution on an image. The difference is that it has dilated convolutional kernels. A high-dilated rate enables us to have a large convolutional receptive field for the unit. Further, no information is lost with an increasing receptive field under dilated convolution. Figure 4 shows an example of dilated convolution with a dilated rate of 2.  . g, h, and u mean the input image (or activation map), convolutional kernel, and output. An output u is calculated by summing the multiplications of each value (i, j) at the kernel h and its corresponding value (x, y) at g.

SE Mechanism for Attention
The SE mechanism was originally developed to improve the performance of image classification on ImageNet [33]. It is a weighting system that produces and applies channel-wise weights on a feature map (i.e., the output from an intermediate layer in a CNN). To determine the weight on each channel, it computes the average activation values of the channels; then, these are converted by two linear production layers with ReLU and Sigmoid activation functions to generate the channel-wise weights. The aggregation of the activation values is equivalent to global average pooling, as shown in Figure 5. A CNN, which is equipped with a number of attentional mechanisms, can perform feature recalibration; it learns to selectively emphasize informative features and suppress less useful features, which helps reduce ambiguity when estimating the correct damage level and thus improves the accuracy of building assessment.

PPM
The PPM pools the activation map of each channel in a pyramidal fashion [34]. It makes N × N (N = 1, 2, 4, ...) grids on the activation map of each channel. Each cell of a grid overlaps with a square region of the activation map. Each grid for the channel perfectly covers the whole activation map. On the region covered by each cell of a grid, a user-defined pooling process such as global max pooling or global average pooling is employed to pool the region into a single value. This process quantifies each activation map into a vector with a length equal to N × N. The vectors produced with different N (e.g., 1, 2, and 4) are then concatenated into a representation vector for the channel. The above process is applied to all the channels to produce their representation vectors. The final output of this module is generated by concatenating these representation vectors, as shown in Figure 6. The PPM is a simple yet effective feature aggregation mechanism. It aggregates features from multiple scales. Global features such as the shapes of buildings are covered with a small N (e.g., N = 2), whereas local features such as the details of damaged buildings are covered with a large N (e.g., N = 4). Then, the final output of this mechanism becomes a representative vector of the input sample, which improves the accuracy of building localization and damage assessment.
Representation vector

Pyramid Pooling Module-Based Semi-Siamese Network b(PPM-SSNet)
The task of estimating the damage assessment of buildings is divided into two stages. The first stage identifies the buildings on an image. This can be treated as a localization problem in which a system such as a CNN is employed to estimate the binary localization map for an input image. A location with 1 or 0 on the map indicates whether it is a building or not. The localization map is then employed as a prior for the second stage to estimate the damage assessment of a location with a value equal to 1. Based on this idea, we design a network to jointly estimate buildings' locations and assess their damage. We use the pre-image alone to estimate the location map and then use both the pre-and the post-images to estimate the assessment result. To leverage the localization map to produce an accurate assessment result, we directly multiply it by the output of the assessment estimator. This process corrects the assessment result, improving its quality from a coarse to a fine level (see Figure 7). Figure 7 shows the architecture. The network is built on a semi-Siamese strategy. We let the weights at the shallow layers of the network share the two input images (i.e., pre-/post-images) to enable it to produce a good "filters' bank" by collectively learning the low-level features from both. As the layers go deeper, we stop sharing weights and use independent branches for the two inputs instead. The two branches are merged by subtracting one from the other along their channels, which encourages the network to learn the differences between the pre-and post-images. For the tail of the network, we use a single branch of the layers to produce the final estimation result. In the network, we employ RBs with dilated convolution and SE blocks. Our motivation for using RBs is that the network can extract features from large and small receptive fields by employing the large and small dilated rates used in RBs, which may improve its representation ability for the estimations. In addition, SE blocks are employed to encourage the network to focus on the important features, while suppressing the less useful ones. We employ a PPM at the end of the network, immediately before an SE block, and a convolutional layer to aggregate the features.

Localization map
Coarse-level assessment result Figure 7. The architecture of the proposed network. c, b, d, and r represent the convolutional layer, batch normalization layer, dropout layer, and ReLU layer. SE, RB', RB, and PPM represent the modules illustrated at the bottom of this figure. The difference between RB' and RB is that RB' has an additional convolutional layer + batch normalization layer, which is designed to change the number of channels or size of the input tensor if needed. See Table 2 for more details.  Figure 7). For the convolutional layer (Conv./conv.), in, out, stride, and dila mean the input's dimension, output's dimension, stride, and dilation rate for the layer. k × k means the size of the convolutional kernel. For an SE module, in, mid, and out mean the input's dimension, dimension of the output of the middle layer, and output's dimension. For the PPM, out means the output's dimension.

Experimental Analysis
Over-sampling and data augmentation are adopted in this study. The assessment metrics as well as loss and mask dilation parameter settings are detailed below.

Resampling
Building damage detection networks based on xBD generally perform badly when detecting minor and major damage, resulting in comparatively low recalls and F1 scores for these two categories because of imbalanced training data. To overcome this problem, we devise several methods to increase the number of minor damage and major damage instances, one of which is over-sampling the training dataset. Since our model is designed to generate pixel-level classification results, we suggest using a main label to decide how many times a picture containing multi-label pixels should be repeated in the training dataset. A weight vector w = (w 0 , w 1 , w 2 , w 3 ) T is given based on experience, each element of which represents the relative importance of the corresponding category. For picture i, n i is the vector recording the number of pixels of each category and its main label is defined as where category 0 denotes no damage, category 1 denotes minor damage, and so on. Table 3 shows the main label categories and corresponding repeated times. Table 3. Main labels and corresponding repeated times.

Main Label No Damage Minor Damage Major Damage Destroyed
Repeated Times 0 3 2 1 Since the images are cropped and randomly augmented later, there is no concern that the repeated pictures are identical to the original ones.
After over-sampling, we perform a cropping-and-selecting process with discrimination. Similar to above, we reweight each pixel as inversely proportional to the frequency of its corresponding damage level. The original image size is 1024 × 1024. We uniformly sample several 512 × 512 crops from each image and choose the one with the largest sum of pixel weights. Without increasing the volume of the training data, such a process further alleviates the data imbalance of the xBD dataset.

Data Augmentation
To enhance the generalizability of our model, we apply the following data augmentation methods sequentially to each image. As shown in Table 4, every method is assigned a value, indicating the probability of occurrence. In other words, the sequence of augmentation methods applied to an image is determined randomly and the higher the order, the earlier is the execution.

Assessment Metrics
End-to-end building damage assessment includes two progressive tasks: building localization and damage classification. The former can be regarded as a binary segmentation, while the latter is a multi-classification task. This study adopts F1 scores, precision, recall, and IoU to evaluate our network's performance. For the localization task, the F1 score is used: where TP loc denotes the number of pixels correctly classified as buildings, FP loc indicates the number of pixels misclassified as buildings, and FN loc means the number of pixels misclassified as background.
For the classification task, the F1 scores, precision, and recall for each damage category are calculated. A macro-IoU is also implemented to quantify accuracy when data are imbalanced: IoU j where j ∈ {0, 1, 2, 3}, TP j denotes the number of pixels (or instances) correctly classified as category j, FP j indicates the number misclassified as category j, and FN j means the number misclassified as other categories.

Loss and Mask Dilation
The output damage scale classification mask has five channels: the four damage levels and no-building label. We adopt a weighted mixed loss that consists of dice loss and focal loss for the damage scale classification loss L d and weighted binary cross-entropy loss for the building segmentation loss L d , which are defined as where y p and y s are the ground truth label and detected building segmentation probability, respectively, while m p and m t are the true mask and predicted mask for damage scale c, respectively. As most samples contain no buildings, we use a larger weight for the building class, as indicated by w s,1 in segmentation loss L s . Additionally, minor-damaged and-major damaged buildings are uncommon in our samples. Therefore, we select larger weights for them (c = 2, 3) in damage scale classification loss. We also use weighted mixed loss in which focal loss accounts for a larger proportion to improve category imbalance. To achieve better classification at the boundary, we expand the building damage scale labels. Given the overlap in pixel' labels, we prioritize minor damaged and major damaged buildings (c = 2, 3), which are relatively vulnerable in the classification.

Experimental Setting
In this work, we use PyTorch, a Python package that provides Tensor computation with strong GPU acceleration and deep neural networks built on a tape-based autograd system, as the deep learning framework. PyTorch is designed to be intuitive, linear in thought, and easy to use. Equipped with acceleration libraries such as Intel MKL and NVIDIA and custom memory allocators for GPU, PyTorch enables users to train larger deep learning models than with other Python packages.
All the experimentation and modeling tasks are implemented in the public cluster in the x64 Linux environment with the public computing cloud at the Renmin University of China. This computing cloud is equipped with the Simple Linux Utility for Resource Management (Slurm) scheduling system. Computations are performed on the node titan, which is configured with 128 GB of RAM, two Intel Gold 5218 CPUs, and two NVIDIA Titan RTX GPUs.

Ablation Study
In this study, we use an ablation experiment to demonstrate the effectiveness of our proposed method. An ablation study typically refers to subtracting a "feature" of the model or algorithm and verifying how this affects performance. Instead of subtracting, however, we gradually add modules such as Siamese, attention, and pyramid pooling into our proposed baseline network to verify its performance. Nevertheless, the improvement of the model performance is incompatible with different sectional tasks. Conducting experiments over several rounds guarantees that the modules of interest boost model performance. Tables 5 and 6 show the results of the ablation experiment. The shaded row in the tables represents the performance of our proposed baseline model. The second row in Table 5 indicates that deploying the Siamese network module to the baseline model leads to a significant increment in all the metrics. Adding the attention module into the model results in a slight decline in all the metrics except the recall rate. The increase in the recall rate might be a consequence of the scale-aware semantic image segmentation that arises with an attention mechanism. We then introduce the PPM, which raises all the metrics except the recall rate slightly. This variation can be attributed to pyramid pooling, which enhances the scale invariance of images, while lowering the risk of over-fitting. Table 5. Ablation experiments of the location methods with different modules (the shaded row represents the results of the ablated model).

IOU Non−building (%) IOU Building (%) Mean IoU(%) Precision loc (%)
Recall loc (%) F1 loc (%) Dice loc (%) Score loc (%)  Table 6. Ablation experiments of the multi-classification methods with different modules (the shaded row represents the results of the ablated model).  Table 6 shows that sequentially applied modules improve overall performance since total F1, the harmonic mean of the F1 of each category, increases gradually with a mere recession. As for the irregular increment in the metrics, a trade-off between the precision rate and recall rate and the respective F1s of the different classes often results. For instance, deploying the Siamese network raises F1 cl f 2 and F1 cl f 3 and lowers F1 cl f 0 and F1 cl f 1 . This is based on the decision boundaries, mutually exclusive in hyperspace, and generated by the recently attached module that changes when another module is consequently applied, leading to fluctuations in the metrics. Finally, the introduction of pyramid pooling, which is noteworthy for its scale-adaptive feature extracting ability, enables the model to yield rather satisfactory metrics for all the categories. Table 7 also shows the confusion matrix of our final PPM-SSNet. Our model performs well overall and the non-building category holds the highest accuracy of 96.52%; whereas accuracy for the minor damage pixel is only 30.29%. Table 8 compares the results between the post-and-pre strategy (both pre-disaster and post-disaster images are available) and the post-only strategy (only the post-disaster images are applied). According to the results, The performance of using only post-disaster image is lower than using both pre and post disaster images to locate buildings and assess damage levels,this demonstrates the important role of pre-disaster image in improving building localization and damage classification.

Comparisons with Other Methods
Since the release of the xBD dataset, some studies have divided a share of its data for training and achieved good results, whereas others use different evaluation metrics to assess accuracy. Moreover, some work is not strictly an end-to-end study, preventing us from being able to compare these published results with ours. To solve this problem, we reproduce previous research results and carry out comparative experiments under uniform experimental conditions. A Mask R-CNN network [19] and Siam-U-Net-Attention network [20] are compared.
Weber et al. [19] used the Mask R-CNN with the FPN architecture as well as the same model architecture for both building localization and damage classification. However, instead of working with full images, they trained the architecture on both the pre-and the post-image quadrants and fused the final segmentation layer to draw building boundaries more accurately. For the class imbalance problem, they engineered their loss function to weight errors on classes inversely proportional to their occurrence on the dataset. However, this is insufficient to address the problem. In practice, to solve class imbalance, we usually combine multiple approaches such as over-sampling and reweight operation with the weighted loss functions used in our experiment. Figure 8 shows the details of the Mask R-CNN network. Hao [20] designed a Siam-U-Net-Attn model end-to-end for both damage classification and building segmentation. One element of this architecture was a U-Net model that analyzed a single input image and produced a segmentation mask showing building locations. The same U-Net model was used for both the pre-disaster and the post-disaster images to produce binary masks. The features extracted from the encoder regions of the U-Net model also helped classify the damage scale. The two features produced by both the pre-image and the post-image U-Net encoder would be used by the middle part: a new separate decoder constitute in the Siamese network that compared the features from the two input frames to detect damage to buildings. The network achieved an appreciable IoU score on localization and performed well when classifying undamaged and destroyed buildings. However, the model could not identify minor-damaged and major-damaged buildings accurately. Figure 9 shows the Siam-U-Net-Attention network. We train and test our network and other methods using the same datasets described above and same parameter settings. The results show that our proposed network easily outperforms the other approaches, as shown in Tables 9 and 10. We also compare the classification results of earthquakes, tsunamis, floods, typhoons, and volcanic eruptions, as shown in Figure 10. The results again verify the superiority of our method over previous approaches.  Table 10. Comparison with other methods on the classification task. Further, our model outperforms baseline models when predicting building localization and damage classification. Post-disaster images with destroyed buildings make a noise to building localization since the edges of destroyed buildings may be vague. FPN-R-CNN classified the majority of destroyed buildings into the no building category, while the U-Net-Siam-Attn's prediction of destroyed buildings is not robust. In these cases, our model can easily distinguish undamaged and destroyed buildings, but it is hard to distinguish minor from major damage.

Robustness of the Method
The validation areas are characterized by a great diversity of environmental settings, building structures and spatial distributions, tsunami processes, and image acquisition conditions, as shown in Figure 11(a1,b1,a2,b2), respectively.
The predicted results show that the proposed model detects destroyed and undamaged buildings, but separating minor damage from major damage is still challenging, as shown in Figure 11(c1,d1,c2,d2) and Table 11. Partly because the Tohoku tsunami's annotation standard and that of the xBD dataset are not uniform. The Tohoku tsunami's building label is from a field survey, while the label of the xBD dataset comes from a visual interpretation, leading to an error in the like-for-like comparison. As the small validation area as shown in Figure 3c contains almost destroyed buildings, therefore we only did quantitative confusion matrix (Table 11) analysis for the larger validation area with variety of damage types as shown in Figure 3b. Still, We can visually interpret that the prediction results of the small validation area as shown in Figure 11(c1) are quite consistent with the ground truth data as shown in Figure 3c. Further, satellite remote sensing is limited when detecting fine-scale building damage because of its lower spatial resolution. Therefore, the method's inability to distinguish major and minor damage is logical. One way to solve this challenge would be to use high-resolution drone images. In general, our prediction results are consistent with the field observation data.

Conclusions
In this study, we developed an end-to-end attention-guided semi-Siamese network with a pyramid-pooling module. Our proposed model yielded satisfactory results when focusing on building localization and damage classification compared with other methods. Employing dilated convolution, the method leveraged the global and local features of an input image. To improve damage classification performance, we adopted a squeeze-and-excitation mechanism, a weighting system that produces and applies channel-wise weights on a feature map. Our ablation experiments on the xBD dataset demonstrated that the proposed semi-Siamese network, dilated convolution, and squeeze-and-excitation mechanism were both necessary and effective. Meanwhile, the demonstration with 2011 Great East Japan Earthquake data revealed consistent results with the ground truth data, confirming the effectiveness of evaluating future disasters using our proposed method. Further, it achieved true end-to-end input and output. Thanks to the open source of the large-scale high-precision xBD dataset, which used to be the main challenge of training deep learning models for building damage assessment from satellite imagery, it has become unnecessary to xxxx. Nevertheless, the contribution of this research is developing a damage detection algorithm based on large-scale benchmark data from multiple types of disasters. Therefore, we do not provide targeted solutions for a specific type of disaster.
Our research has some limitations. It is based on the visual information of optical images, meaning that it may be unable to measure extensive flood damage under an intact roof. To address this, researchers could consider using synthetic aperture radar images to detect bottom or sidewall damage [35]. In addition, wall ruptures caused by earthquakes may not be effectively measured, which could be overcome using higher resolution drone images to detect this type of damage [36]. These limitations suggest that despite the contributions of the proposed approach, a highly robust and transplant deep learning model for assessing building damage with high precision is still urgently needed. Since domain shift is still an important challenge in deep learning, satellite imagery is particularly problematic in this field, and this will be the direction of our future efforts.