On the Generalization Ability of a Global Model for Rapid Building Mapping from Heterogeneous Satellite Images of Multiple Natural Disaster Scenarios

Post-classification comparison using pre- and post-event remote-sensing images is a common way to quickly assess the impacts of a natural disaster on buildings. Both the effectiveness and efficiency of post-classification comparison heavily depend on the classifier’s precision and generalization abilities. In practice, practitioners used to train a novel image classifier for an unexpected disaster from scratch in order to evaluate building damage. Recently, it has become feasible to train a deep learning model to recognize buildings from very high-resolution images from all over the world. In this paper, we first evaluate the generalization ability of a global model trained on aerial images using post-disaster satellite images. Then, we systemically analyse three kinds of method to promote its generalization ability for post-disaster satellite images, i.e., fine-tune the model using very few training samples randomly selected from each disaster, transfer the style of postdisaster satellite images using the CycleGAN, and perform feature transformation using domain adversarial training. The xBD satellite images used in our experiment consist of 14 different events from six kinds of frequently occurring disaster types around the world, i.e., hurricanes, tornadoes, earthquakes, tsunamis, floods and wildfires. The experimental results show that the three methods can significantly promote the accuracy of the global model in terms of building mapping, and it is promising to conduct post-classification comparison using an existing global model coupled with an advanced transfer-learning method to quickly extract the damage information of buildings.


Introduction
The frequent occurrence of various kinds of disaster around the world has caused unprecedented heavy losses to human life and property. As one of the main disaster-bearing bodies, buildings might collapse and/or be damaged due to earthquakes, typhoons and other major natural disasters. Most human casualties from disasters occur in collapsed buildings. As an important indicator of the severity of a disaster, damage assessment of buildings plays an important role in disaster relief and government decision making. With increasing numbers of remote-sensing images and higher spatial resolutions, the issue of how to use remote-sensing technology for damage assessment has become the focus of researchers. A common method for collapsed building detection after disasters is to extract building damage information from high-resolution remote-sensing images.
Many methods have been designed for building-damage detection, and they can roughly be divided into two groups: (1) methods that detect changes between pre-and post-event data and (2) methods that only interpret postevent data [1]. These two types of method mainly differ in their applicability and the accuracy of results. Generally, the method using pre-and post-event data can obtain more accurate results, since more information is available to utilize. However, it is still very challenging to quickly obtain uniform images with dual time phases in many areas. The spectral difference in the pre-and as shown in Figure 1, the model fails to identify undamaged buildings in the post-disaster images of the xBD dataset, while the identification results for pre-disaster images are very good. The potential reason for this might be that the global model might be misled by some invisible factors caused by imaging conditions, environmental changes due to disaster impacts and other factors. Even if there is no damaged building in the post-disaster images, the change in imaging conditions caused by the disaster still makes the boundary of the building more blurred, while the boundary of the building is clearer and sharper in the predisaster images. The blurring of key information results in poor performance of the model. Therefore, in order to extract the damage information of buildings through this post-classification comparison change detection method, it is necessary to promote the generalization ability of the existing building-identification model on the post-disaster data of the xBD dataset. In order to promote the generalization ability of the existing global building mapping model on xBD postdisaster images, damaged buildings are identified through postclassification comparison. In this paper, we first evaluate the generalization ability of a global model trained on aerial images using postdisaster satellite images. Then, we systemically analyse three types of methods, i.e., fine-tune the model using very few training samples randomly selected from each disaster, transfer the style of postdisaster satellite images using CycleGAN [15], and perform feature transformation using domain adversarial training [16], to promote the model's generalization ability on postdisaster satellite images.
The remainder of this paper is organized as follows. Section 2 describes the experimental data, the global building mapping model used in this paper, and the evaluation of the model generalization ability. Section 3 provides a description of the three transfer learning methods used in this paper and the results of promoting the generalization ability of the model. We discuss the results of the experiments in Section 4. Finally, we draw some conclusions in Section 5.

xBD Dataset
In this big data era, many disaster damage assessment datasets have been produced in the past few years. However, most satellite image datasets tend to be limited to a single type of disaster and only contain postdisaster images of disaster areas. The experimental data used in this paper are part of the xBD dataset [13]. The xBD dataset is the United States Department of Defense's open-source satellite image dataset for natural disasters, and it is the largest building damage assessment dataset to date.

Images in the xBD Dataset
All images in the dataset are from the Maxar/DigitGlobe Open Data Program for both building damage assessment and postdisaster reconstruction. The dataset contains 22,068 optical satellite images with RGB bands (size 1024 × 1024 × 3). The spatial resolution is below 0.8 m. The xBD dataset includes pre-and post-event satellite images of 19 disasters, including a diverse set of disasters around the world-earthquakes, tsunamis, floods, volcanic eruptions, hurricanes, etc. In view of some quality problems in the dataset, we selected 14 out of 19 disasters in this paper. The distribution of the disaster locations is shown in Figure 2. Most of these disasters come from the United States, including wildfires and wind disasters. For disasters in the United States, some are named after the location, such as Woolsey wildfire and Joplin tornado, while some wind disasters are named after the wind, such as Harvey hurricanes and Michael hurricanes. The specific information of disasters, covering six major kinds of natural disaster, i.e., wildfires (WF), tornadoes (TN), hurricanes (HC), floods (FD), tsunamis (TM) and earthquakes (EQ), is shown in Table 1. These are also frequent natural disasters around the world in recent years, that have caused serious damage to people's lives and property. The pre-and post-event images of some disasters are shown in Figure 3.

Damage Scales in xBD Dataset
In the process of annotating postdisaster images, Gupta et al. [13] proposed a joint damage scale, which was used to uniformly evaluate the degree of damage to buildings from the satellite images of different types of disaster. Table 2 shows the evaluation criteria for different levels of damage. Figure 4 shows the specific forms of buildings with different levels of damage.  There are 850,736 building annotations across 45,362 km 2 of imagery. In the predisaster images, two kinds of labels are provided, i.e., building or nonbuilding. The post-disaster images have labels with different degrees of damage, among which the polygon of the building was directly obtained from the predisaster image according to the projection of the geographical coordinates. There are four levels of building damage, i.e., no damage, minor damage, major damage and destroyed. The Figure 5 shows the post-disaster images and labels of some disasters.

Quality of the xBD Dataset
Although the xBD dataset has made significant contributions to advance change detection and post-disaster damage assessment for humanitarian assistance and disaster recovery, some challenges remain. First, as shown in Figure 6, the occurrence of disasters is often accompanied by changes in the weather. Therefore, cloud cover cannot be ignored, especially in the images after a disaster. This is a major challenge to the existing building identification model. In this paper, the experiment manually screens out images with large-area cloud cover. Second, the post-disaster polygons are obtained directly from the pre-disaster images according to the projection of the geographical coordinates. Therefore, there are problems of dislocation between the polygons in the pre-and post-event images due to the difference in the times and angles of satellite imaging. Although we only evaluate and promote the generalization ability of U-NASNetMobile on post-disaster images, such annotation deviation still affects the accuracy assessment to a certain extent. Third, this paper aims to evaluate and promote the generalization ability of the building identification model on post-disaster images by using the deep-transfer-learning method. Therefore, it is necessary to consider the number of building samples of each disaster and the imbalance in the categories. In some images in the xBD dataset, there is a serious imbalance between building samples and background samples. For instance, more than 99% of the pixels in some images are labelled as background. Consequently, considering the challenges existing in the xBD dataset, we selected partial data of 14 natural disasters from 19 natural disasters.
(a) cloud cover

Differences between Disasters
When evaluating the generalization ability of a building identification model, the images of different disasters usually come from different regions, different times and different imaging conditions. There could be differences in building style or image quality. As shown in Figure 7, even though most of the buildings in the images are not damaged by disasters, there are still significant differences in the styles and sizes of buildings, and even the image quality, between images of different disasters. In general, the post-disaster images of Moore-TD and Mexico-EQ are relatively clear. The size of the buildings is relatively large, with obvious features. Especially in Moore-TD, most of the undamaged buildings after the disaster have a very regular form. The degree of separation between buildings is relatively high. There are few buildings that are closely connected to each other. In contrast, the buildings that remained intact after the Mexico EQ are more complex in terms of image features. There are some buildings closely which are connected to each other. The images of Nepal-FD are less clear, with smaller buildings, tighter interconnections and rougher roof textures. Such differences often affect the performance of existing building identification models in different disaster images to a large extent. Yang et al. [14] also confirmed that models trained on data from Europe have good results for data from Europe but do not perform as well using data from America. Therefore, this paper conducts experiments on 14 natural disasters to evaluate and promote the generalization ability of the U-NASNetMobile model on the images of each disaster.

The Global Model Trained on DREAM-B
In order to use very high-resolution (VHR) images for building mapping on a global scale, Yang et al. [14] constructed the so-called DREAM-B dataset, including the VHR image dataset of buildings from all over the world, and trained the U-NASNetMobile network using the DREAM-B dataset.

DREAM-B Dataset
Some commonly used datasets in image semantic segmentation are established on the basis of a few cities. This approach has difficulty meeting the needs of global building mapping. Yang et al. [14] used aerial images from more than 100 cities around the world to create the Disaster Mitigation and Emergency Management Construction Data Set (DREAM-B). DREAM-B contains 626, 4096 × 4096 image tiles, which are composed of three colour bands of red, green and blue, with a spatial resolution of 30 cm. The dataset contains two classes: the building and the non-building classes.

U-NASNetMobile
U-net [17] is a classical structure in the image segmentation field. It consists of a contraction path (encoder) and a symmetrical extension path (decoder) connected by a bottleneck. The encoder gradually reduces the spatial size of feature maps, which captures the context information and transmits it to the decoder. The decoder recovers the image details and spatial dimensions of the object through up-sampling and skip connections. Zoph et al. [18] proposed the NASNet-Mobile to learn the model architectures directly on the dataset of interest. In order to improve the computational efficiency, Yang et al. [14] combined the U-Net model with the NASNet-Mobile model, in which the neural cell obtained via neural architecture searching was used to replace the convolution modules in the U-Net. This model is called U-NASNetMobile in this paper. The architecture is shown in Figure 8.
Of the 626 image tiles in the DREAM-B dataset, 250 are used for training, 63 for validation and 313 for testing. The input size of U-NASNetMobile is 512 × 512. The original image tiles of the DREAM-B dataset are divided into 51 × 12 pieces to match the model. Data enhancement, including random horizontal and vertical flips, random rotation, and random brightness jitter, is used during training to avoid overfitting. In the training process, the Adam optimizer [19] is used for optimization. In addition, the cosine attenuation learning rate [20] is used. The maximum and minimum learning rates are 3 × 10 −4 and 1 × 10 −6 , respectively. All the experiments are trained for 200 epochs with a minibatch size of 16. In addition, the Intersection over Union (IoU) was used to assess the accuracy of building areas [21]. The IoU is defined as

Evaluation
In this subsection, the trained model, i.e., U-NASNetMobile, is used to predict where buildings are in the post-disaster images and evaluate the generalization ability of this model. In view of the different damaged buildings in the post-disaster images, there are four kinds of modes used to evaluate the model performance. Mode 1 considers only the undamaged buildings in the post-disaster images to be recognizable buildings. Mode 2 considers only the undamaged and minor damage buildings in the post-disaster images to be recognizable buildings. Mode 3 considers only the undamaged, minor damage and major damage buildings in the post-disaster images to be recognizable buildings. Mode 4 considers buildings with any degree of damage to be recognizable buildings. We conducted experiments on different disasters with the four different evaluation modes. Taking the IoU of the buildings as an example, as shown in Figure 9, the identification accuracy of the model varies greatly from one disaster to another. Different evaluation modes in the same disaster have little influence on the evaluation results. Generally, when a building is marked as having major damage, its overall form will change significantly. If it is still considered to be a recognizable building, it will be quite different from the building form learned by the existing model, resulting in an unreliable accuracy evaluation. Therefore, this paper adopts Mode 2 to evaluate the identification results. It can also be seen from Figure 9 that Woolsey-WF, Tuscaloosa-TD, Moore-TD, Joplin-TD and Midwest-FD all showed the highest accuracy in the Mode 2 evaluation method. A disaster with relatively high accuracy indicates that the distribution of these images is closer to that of the training data. The image quality is better. Therefore, the degree of damage will have a more significant impact on its accuracy. Consequently, it is appropriate to consider the major damage and destroyed buildings as the background in the evaluation.  Table 3 shows the prediction results of all 14 disasters using the U-NASNetMobile model. In this paper, the recall, precision and IoU of the building class are used to evaluate the identification results. We also use the missed detection rate and false detection rate to evaluate the identification accuracy of a single building object. In the ground truth, we selected undamaged and minor damage buildings in the post-disaster images as the building class, while other types of damaged building are all included in the background class. Table 3 shows that the IoU is generally low. The recall is low, but the precision is relatively high. Regarding the detection accuracy of an individual building, the missed detection rate is low, but the false detection rate is relatively high. This result shows that the generalization ability of the existing U-NASNetMobile model on xBD postdisaster images is generally poor. This method has difficulty identifying the buildings from the images. However, most of the buildings identified by the model are correct. In addition, the generalization performance of the existing U-NASNetMobile model varies greatly among different disasters, with the highest IoU of 0.556 in Moore-TD and the lowest IoU of 0.107 in Harvey-HC.  Figure 10 shows a typical phenomenon in the identification results of the U-NASNetMobile model. Some obvious undamaged buildings in the postdisaster images cannot be correctly identified by the model. In addition, Table 3 shows that buildings have low recall and high missed detection rates while achieving good precision and false detection rates. In the subsequent promotion, we can focus on the changes in the recall, missed detection rate and IoU. In addition, there are certain differences between different kinds of disasters. The dataset used in this paper includes six different kinds of disasters, such as earthquakes, tsunamis, hurricanes, tornadoes, floods and wildfires. There were two or more cases for all disasters except earthquakes and tsunamis. These cases of disasters can be used to assess the impact of disaster types on the model's generalization ability. Figure 11 shows the average IoU of each disaster, indicating that the accuracies of tornadoes and wildfires are higher than those of both hurricanes and floods. The reason for this phenomenon may be that in the four disasters of Harvey-HC, Florence-HC, Nepal-FD and Midwest-FD, the damage to buildings is mainly caused by the inundation of water. Buildings with such characteristics have never been seen by models in the training process, which results in the low generalization ability of models in such disasters.

Promotion
As shown in Figure 12, three transfer learning methods, i.e., the CycleGAN, finetuning and domain adversarial training, were used to promote the generalization ability of the U-NASNetMobile model in this paper. Transfer learning [22] aims to improve the performance of the task based on the target domain by discovering and transferring latent knowledge from the task based on the source domain. In this paper, the source domain and target domain are the DREAM-B dataset and xBD dataset, respectively. The Cycle-GAN and domain adversarial training do not need the labels of xBD, while fine-tuning requires a small number of labels of xBD. The domain adversarial training and fine-tuning method require proper training of the original network model with xBD images, while the CycleGAN method does not need to adjust any parameters of the model. In this subsection, U-NASNetMobile is fine-tuned with a small set of training samples. This transfer-learning method is based on the pre-training network. When the size of the annotated target dataset is significantly smaller than that of the source dataset, transfer learning based on a pre-training network can be a powerful tool. The convolutional neural network has multiple layers to extract features. The lower layer is used to capture basic common features. The higher layer is applied to learn the advanced features corresponding to the input. Based on this characteristic, we can freeze the lower layer and retrain the specific layer parameters of the higher layer to meet the requirements of the new task. In the experiment, we freeze the encoder of U-NASNetMobile. We retrain the decoder of U-NASNetMobile using the post-disaster images in the xBD dataset. The models were executed using Python 3.7.0 on the platform of Intel(R) Xeon(R) Gold 5118 CPU @ 2.70 GHz system. All experiments were run on a single NVIDIA Tesla P40 GPU with 20 h.
In the actual damage assessment of buildings, the distribution of damaged buildings needs to be obtained as soon as possible due to the needs of post-disaster emergency rescue. However, it is difficult to obtain a large-scale and detailed degree of damage labels of buildings in a short time. However, in many cases, it is possible to obtain the labels of a small number of undamaged buildings. Based on this reality, this method uses a small number of post-disaster images and undamaged building labels to fine-tune U-NASNetMobile so as to promote the building identification accuracy of the network on the new post-disaster data. In order to achieve the goal of rapid damage assessment through a small portion of the labels acquired for post-disaster images, we use the fine-tuning method to improve the identification accuracy of the U-NASNetMobile model on xBD postdisaster images. Since it is difficult to obtain the labels of post-disaster images on a large scale, we select only 10% of the data from each disaster for model training. Table 4 shows the identification results of the model on 14 disasters. The results show that the recall of all disasters increased after fine-tuning. In addition, the precision of Santarosa-WF, Harvey-HC, Joplin-TD and Moore-TD also increased, while the precision of other disasters decreased in exchange for the increase in the recall. Overall, the IoUs of all disasters increased to different degrees. In addition, the missed detection rates of building objects decreased obviously, while the false detection rates increased slightly. Figure 13 shows the increasing rate of the IoU for each disaster. The Palu-TM, Portugal-WF, Harvey-HC and Socal-WF increased by nearly 100%.   Figure 14 shows a post-disaster image from Harvey-HC with its ground truth and identification result before and after fine-tuning. From Figure 14a, highlighted in red, we can see that the buildings in this image are not affected by the hurricane at all. The U-NASNetMobile model missed most of the buildings. This suggests that, at least on this image, the generalization ability of U-NASNetMobile is very poor because most buildings are confused with the background. However, after fine-tuning, the building identification results were greatly improved. Only a small number of buildings were missed.

Image Translation from xBD to DREAM-B
The traditional image enhancement methods are generally divided into spatial domain enhancement and frequency domain enhancement. Spatial domain enhancement methods, such as histogram equalization algorithm, may cause image blurring and amplify image noise. Frequency domain enhancement methods, such as directional-filter methods, usually have problems of contrast and sharpness reduction and partial feature loss. The generation model based on deep learning can obtain features from the data at different levels by abstracting the original data layer by layer, which solves the limitations of traditional methods to a certain extent. The Generative Adversarial Network (GAN) [23] consists of a generator and a discriminator. The generator attempts to generate fake images. The discriminator seeks to distinguish the difference between fake images and real images. In the training process of a GAN, the performance of the generator and discriminator is alternately improved. Finally, the generator can be used to generate images that approximate real samples. Based on pix2pix [24], Zhu et al. [15] proposed a cycle generative adversarial network (CycleGAN) method, which is a deformation of a conventional GAN. The CycleGAN can transform an image from source domain X to target domain Y without paired training data. Many studies have applied the CycleGAN to various fields [25][26][27][28].
After a disaster occurs, it is necessary to evaluate building damage. In many cases, there are many building identification models that have been trained on remote-sensing images without disasters. However, the training data and the actual postdisaster data are greatly different for different data sources or in different regions. Furthermore, because of the changes in the imaging conditions caused by disasters, the two kinds of data can be very different. Therefore, the generalization ability of an existing model on new post-disaster data is poor. In this part of this paper, image translation based on the CycleGAN is applied to make the xBD images have the style of the DREAM-B images. The CycleGAN can make the distribution of the two types of data as close as possible so as to promote the generalization ability of the models on new data without affecting the performance of the existing models.
As shown in Figure 15, the CycleGAN has two mirror GANs. The ring network is formed by two generators G and F and two discriminators D 1 and D 2 . In this paper, generator G utilizes an image from DREAM-B to generate a fake xBD image, and vice versa for generator F. The function of discriminator D 2 is to distinguish real and fake xBD images, and vice versa for discriminator D 1 . In the training process of the CycleGAN, generators G and F seek to generate real xBD and DREAM-B images, respectively. Discriminators D 1 and D 2 seek to distinguish real and fake images. Finally, we can use the generator F to transform the images from xBD to DREAM-B. The models were executed using Python 3.7.4 on the platform of Intel(R) Xeon(R) Gold 6226 CPU @ 2.70 GHz system. All experiments were run on a single NVIDIA Tesla v100 GPU with 30 h.

Quantitative Evaluation
In order to maintain consistency with the resolution of the DREAM-B dataset, we converted the image size of the xBD dataset to 2048 × 2048. After conducting training and prediction using the post-disaster images of 14 disasters, the quantitative evaluation is shown in Table 5. This evaluation shows that the missed classification and missed detection rates are lower than those before image translation. The CycleGAN results show that at the pixel level, the recall of most disasters increased. However, the precision decreased slightly after image translation based on the CycleGAN. In general, except for the Nepal-FD, other disasters also showed different degrees of increase in their IoU. The most significant improvement was in Harvey-HC, where the IoU increased by 0.147. The evaluation results at the building object level are consistent with those at the pixel level. The missed detection rates of most disasters decreased obviously, but the false detection rates increased slightly. Thus, in the absence of supervised information guidance, the CycleGAN image translation method can improve the building identification accuracy. In the training process, the CycleGAN network can learn the difference between two datasets, especially the specific information of the DREAM-B dataset learned by the U-NASNetMobile model. Figure 16 shows the increase in the IoU of each disaster. Except for Nepal-FD, all disasters have increased IoUs to different extents. Among the disasters, the increase in the IoU of Harvey-HC is over 100%. Figure 17 shows a DREAM-B image and an xBD image before and after image translation. The DREAM-B image is from Shanghai, China. The xBD image is from Harvey-HC. It can be seen that the overall style of the image from xBD is green and grey with low contrast before image translation. Most of the images in the DREAM-B dataset are from China. The shooting time and imaging conditions are different from the Harvey-HC postevent images. As a result, the image tone and style of the two domains are different. The buildings in Figure 17e are closely connected while the distance between the buildings in Figure 17c is relatively larger. There are two main changes in the translated image. First, the overall tone changes from green and grey to blue. The buildings in the image become clearer, with sharper edges and a higher contrast to the background. It can also be seen from Figure 17d-f that all the buildings in this image were not affected by the hurricane disaster at all, while the U-NASNetMobile model missed most of the buildings. However, the model missed only a small part of the buildings after image translation by the CycleGAN.

Domain Adversarial Training between xBD and DREAM-B
Domain adversarial training [16] is a classic method based on adversarial training for domain adaptive research. There is one more objective function on the basis of the existing model to encourage confusion between the two domains.
In this paper, the network structure of domain adversarial training is shown in Figure 18, and the structure is divided into three parts, i.e., a feature extractor, a label predictor and a domain classifier. We split U-NASNetMobile into a feature extractor and label predictor.
The domain classifier, which consists of a convolutional layer and a fully connected layer, is connected to the penultimate layer. The inputs of the network are two images from DREAM-B and xBD. The feature extractor should generate the same distributed features regardless of the input from DREAM-B or xBD. The label predictor is used to classify the buildings. The domain classifier should distinguish whether the data features come from DREAM-B or xBD.
The gradient reversal layer (GRL), which reverses the gradient in the backpropagation, is inserted between the feature extractor and domain classifier to achieve domain adversarial training. The domain classifier after the GRL minimizes the domain classification loss, while the feature extractor before the GRL maximizes the domain classification loss. Finally, the domain classifier is unable to distinguish the features from DREAM-B and xBD. The feature spaces of DREAM-B and xBD are completely mixed together. The gradient reversal layer will cause the model to maximize the error to some extent instead of only minimizing the objective function like the original model. This means that the model can learn the features that minimize the objective function without allowing the two domains to be distinguished.
Like the CycleGAN, the domain adversarial training transfer-learning method does not require the labelled information of xBD images. All the post-disaster images of different disasters are used for domain adversarial training. The same part of the domain adversarial network as U-NASNetMobile uses the weight pretrained on the DREAM-B dataset. The models were executed using Python 3.7.0 on the platform of Intel(R) Xeon(R) Gold 5118 CPU @ 2.70 GHz system. All experiments were run on a single NVIDIA Tesla P40 GPU with 30 h.

Quantitative Evaluation
The accuracy evaluation after domain adversarial training and prediction is shown in Table 6. The increase in the IoU for each disaster is shown in Figure 19. Table 6 shows that the recall of all disasters has increased significantly. Among the disasters, the highest increase is in Nepal-FD, of 0.638. In addition, the precision of many disasters fell. The greatest reduction was in the Midwest-FD, with a reduction of 0.338. However, in general, as shown in Figure 19, except for Santarosa-WF, all the IoUs increased to varying degrees. The evaluation results at the building object level are consistent with those at the pixel level. The missed detection rates of most disasters decreased obviously. However, the false detection rates increased to different degrees.

Discussion
In order to improve the reliability and accuracy of building damage extraction via post-classification comparison, we aim to promote the generalization ability of a global model for building mapping using heterogeneous satellite images from multiple natural disaster scenarios. In this paper, we use the satellite images in the xBD dataset and its manual labels of the degree of building damage. Compared with Landsat and Sentinel data, the image resolution of the xBD dataset is much higher. In addition, this dataset has building labels with different degrees of damage, which can save the time and effort of manual annotating. The xBD dataset is the largest and best-quality building damage dataset currently available. Therefore, the results based on this dataset should also be reliable. The experimental results show that when the existing building identification model is directly applied to actual postdisaster images without any transfer learning method, the overall performance is poor. The difference between different disasters is particularly large. This may be caused by the type of disaster. It can be found in the experiment of this paper that the performance of the model is generally poor when there are a large number of water-covered damage buildings in an image after a disaster. In addition, in the accuracy evaluation of this paper, the major damage and destroyed buildings in the xBD dataset are all treated as the background, which may be a factor affecting the result. However, this is not the main factor affecting the low generalization ability of global building-mapping models in post-disaster images. Previous research suggested that image parameters, such as the off-nadir angle, can influence the performance [11,29]. We can see from the results of the qualitative analysis that the building damage and the changes in imaging conditions caused by disasters are the main factors that influence the performance of the model. After the disaster, some buildings were damaged and lost the building features that the global model learned. The changes in imaging conditions also blurred the features of the undamaged buildings in the post-disaster images, thus influencing the performance. Therefore, the global model performs poorly when applied to the post-disaster images of xBD dataset. In general, the U-NASNetMobile model trained on DREAM-B has a poor generalization ability in the post-disaster images of the xBD dataset. The post-classification comparison method is completely dependent on the performance of the existing building identification model. Consequently, it is not advisable to directly apply the existing model to actual post-disaster images.
In view of the low generalization ability of the existing global model, we use the transfer-learning method to seek to promote the generalization ability of the existing models, hoping to make the post-classification comparison method more feasible and reliable. We systemically analyse three kinds of methods, i.e., fine-tuning the model using very few training samples randomly selected from each disaster, transferring the style of post-disaster satellite images using the CycleGAN, and feature transformation using domain adversarial training, to promote the generalization ability using post-disaster satellite images. Image translation based on the CycleGAN does not need to use the manual annotation information of the xBD dataset. It also does not make any changes to the existing model. By means of image translation between the two domains, the translated xBD image can acquire the image features that are conducive to classification. The training method based on domain adversarial training does not need to use the manual annotation information of the xBD dataset. However, it needs to conduct adversarial training on the basis of the existing model. The fine-tuning method not only needs to use the manual annotation information of the xBD dataset, but it also needs to retrain the existing model. The experimental results show that the three methods have obvious promotional effects on the performance of the model. There is a phenomenon where the recall tends to increase at the expense of the precision, which is not a bad thing for the post-classification comparison method. Since the generalization abilities of the three methods are different in principle, the performances of the three methods are also different.
First, the fine-tuning method uses the most information. This should be the best way to improve the generalization ability. However, only a limited amount of annotated data from the xBD dataset were used in the fine-tuning experiment. Therefore, the improvement in the generalization ability was not significantly higher than those of the other two methods. It is worth mentioning that this method of promotion is the most stable without a negative transfer phenomenon. The training process of the model is always conducted in the direction of improving the building identification accuracy of xBD data. All disasters showed varying degrees of increase in their IoU. There is no manual annotation information in the CycleGAN and domain adversarial training. Therefore, the impact on the model's generalization ability is not as stable with a certain degree of negative transfer phenomenon in a few disasters. Although they had positive impacts in most of the disasters, there were always 1 or 2 disasters whose building identification accuracy was reduced. Previous studies have found that the domain adversarial training, combined with cycle-consistency constraints, can improve the performance of semantic segmentation model [28]. However, the CycleGAN experiment in this paper does not change any parameters of the model, nor does it require any supervision information of the target domain. Therefore, it is a surprise that the image translation based on the CycleGAN could obtain such a good result. However, in the training process of the CycleGAN, a reduction in the loss may not mean an improvement in the performance of the existing building identification model. Given the potential for negative transfer and the unclear building features in the images of Nepal, as shown in Figure 7, there is a large decrease in IoU accuracy in the Nepal experiment. Except for Nepal-FD, the identification accuracy of all disasters is still generally improved. Overall, image translation based on the CycleGAN can improve the performance of the existing model on postdisaster images to some extent. Domain adversarial training causes the model to lose part of the unique characteristics on the DREAM-B dataset during training, which may reduce the performance on the DREAM-B dataset to a certain extent. However, after adversarial training, the recall of building identification on the postdisaster images from the xBD dataset is considerably increased. This increase is the largest increase among the three transfer-learning methods. Although the precision decreased considerably, in general, it improved the building identification performance on the xBD dataset.
The transfer-learning experiment in this paper conducts training on different disasters. Therefore, it is inevitable that the promotion of the generalization ability may be caused by the multiplicity of the model. When we transfer the model to 14 different disasters, there will be 14 different transfer directions to meet the characteristics of the current disaster. Therefore, it seems unfair to evaluate these 14 "models" together with the results of one previous model. However, due to the diversity of the disasters in the xBD dataset, such experiments are necessary. Moreover, in many cases, we only face one type of disaster at a time. It is more realistic to conduct experiments on different types of disasters.

Conclusions
In conclusion, in order to evaluate building damage via post-classification comparison, we first evaluate the generalization ability of a global model trained on aerial images using post-disaster satellite images. Then, we systemically analyse three kinds of methods, i.e., fine-tune the model using very few training samples randomly selected from each disaster, transfer the style of post-disaster satellite images using the CycleGAN, and perform feature transformation using domain adversarial training, to promote the generalization ability for post-disaster satellite images.
The research results show that the performance of the existing global building mapping model is poor when directly applied to xBD post-disaster images. Even undamaged buildings are difficult for the model to recognize, i.e., the recall of the identification results is generally low. Furthermore, the model shows wide differences for various disasters. When there are a large number of water-covered damaged buildings in the post-disaster images, the performance of the model is generally poor. Therefore, when the generalization ability of the model is not guaranteed, it is not advisable to use the existing global building mapping model to assess damage via post-classification comparison.
The research in this paper mainly focuses on the generalization ability of the existing global building-mapping model on the post-disaster images of the xBD dataset. Overall, the promotion of the performance of the model is very obvious. From these results, we can find that even if the generalization ability of a building global-mapping model may not be satisfactory, some transfer-learning methods can also be used to promote the identification performance so as to provide strong support for the post-classification comparison method to assess damage. When the annotation information of post-disaster images is available, fine-tuning will be the most reliable transfer learning method to avoid unnecessary negative transfer. When the annotation information is not available, the image-translation method based on CycleGAN and the method of domain adversarial training will also be a good method to improve the generalization ability.