Building Instance Change Detection from Large-Scale Aerial Images using Convolutional Neural Networks and Simulated Samples

We present a novel convolutional neural network (CNN)-based change detection framework for locating changed building instances as well as changed building pixels from very high resolution (VHR) aerial images. The distinctive advantage of the framework is the self-training ability, which is highly important in deep-learning-based change detection in practice, as high-quality samples of changes are always lacking for training a successful deep learning model. The framework consists two parts: a building extraction network to produce a binary building map and a building change detection network to produce a building change map. The building extraction network is implemented with two widely used structures: a Mask R-CNN for object-based instance segmentation, and a multi-scale full convolutional network for pixel-based semantic segmentation. The building change detection network takes bi-temporal building maps produced from the building extraction network as input and outputs a building change map at the object and pixel levels. By simulating arbitrary building changes and various building parallaxes in the binary building map, the building change detection network is well trained without real-life samples. This greatly lowers the requirements of labeled changed buildings, and guarantees the algorithm’s robustness to registration errors caused by parallaxes. To evaluate the proposed method, we chose a wide range of urban areas from an open-source dataset as training and testing areas, and both pixel-based and object-based model evaluation measures were used. Experiments demonstrated our approach was vastly superior: without using any real change samples, it reached 63% average precision (AP) at the object (building instance) level. In contrast, with adequate training samples, other methods—including the most recent CNN-based and generative adversarial network (GAN)-based ones—have only reached 25% AP in their best cases.


Introduction
Change detection is the process of identifying differences in the state of an object, a scene or a phenomenon by comparing them at different times [1]. Remote sensing data has become a major data source in change detection due to its high time-frequency, wide spectrum of spatial and spectral resolutions and the broad bird's eye view [2][3][4]. It has been widely applied to detecting land cover and land-use changes [5][6][7][8], urban development [9][10][11], natural disaster evaluation [12] and forestry [13][14][15]. A typical challenge of detecting changes from remote sensing data is that the spectral behavior of remote sensing imagery (e.g., reflectance values, local textures) may lead to false alarms due to anthropogenic behavior, atmospheric conditions, illumination, viewing angles and soil moisture [1,8,16,17], which has accordingly led to the development of a variety of change detection methodologies.
to classify land cover types such as forests [56], rivers and farmland [54] and landslides [57], and to detect land cover changes. These have obtained better performances than classic methods. Among fundamental network models for deep learning, such as convolutional neural networks (CNNs) [58], deep belief networks (DBNs) [59], sparse autoencoders (AEs) [60], recurrent neural networks (RNNs) and generative adversarial networks (GANs) [61], the CNN, which consists of a series of convolutional layers, is the most widely used structure in image classification and change detection.
CNNs have been applied to building change detection. Daudt et al. [62] utilized three fully convolutional neural networks (FCNs) to detect changes in registered images. In this study, satellite images of only 10-60 m resolutions were used, which inevitably results in low change detection accuracy. Nemoto et al. [63] used a CNN to extract buildings from a new image, and then used the building classification map and the two images as inputs for another CNN to detect building changes. This study was tested on large aerial images, but the change detection results were unsatisfactory. In [64], a Siamese CNN network was applied to detect changes of buildings and trees between a laser point cloud and an aerial image in a very small area. Amin et al. [65] used a super-pixel segmentation on registered bi-temporal images and then applied a CNN to detect pixel-based building change. This work used only two small images for testing, and cannot recognize changes in building instances.
Generally, these recent CNN-based building change detection studies have contributed to automatic building change detection, but a variety of challenges still exists. For example, the previous studies have either utilized very small images, or did not perform well on large datasets; they need enough samples of changed buildings, which are commonly sparse, to train the CNN. Furthermore, most of those studies only detected changes at the pixel level, whereas the statistics on building instances are often more important in practice.
In this paper, we present a novel framework for building change detection from VHR aerial images, which incorporates a building extraction network and a building change detection network, the performances of which are thoroughly evaluated on a large open building dataset and compared to some of the most recent methods. The framework is fully automatic, end-to-end, and could easily compute pixel-and object (building instance)-based change maps. The main idea and contribution of this paper can be summarized in three aspects: (1) A new and end-to-end framework is proposed to not only detect changed buildings in pixels, but also in building instances. The latter case realizes a true "object change detection" instead of treating a group of arbitrary pixels as an object, as most object-based studies have done. The front-end building detection network is implemented with a pixel-based semantic segmentation network and an object-based instance segmentation network.
(2) The back-end change detection network we proposed can not only detect accurate building changes, but also greatly mitigates one of the most prevalent problems of the deep-learning-based change detection: the requirement of large, high-quality training samples, which is rarely met since changes (positive samples) are usually less common. The back-end network can be well pretrained with automatically generated positive samples in building classification maps to achieve high accuracy without one real change sample. This simulation manner also improves the robustness of our method in situations where accurate image registration cannot be achieved due to sensor angles and building parallaxes, which affect the performance of most of change detection methods.
(3) Experiments demonstrated that our method is promising: without any real change sample, it greatly outperformed other methods trained with adequate samples by at least 38% AP (average precision) at the object level. Our algorithm was evaluated on a larger urban area compared to most of the relevant studies, which guaranteed more rigorous statistical significance and better practical reference values. Specifically, the experiments were executed on an open-source dataset, namely, the WHU building dataset [66]. The test area covers about 120,000 buildings (including 2007 changed buildings) with diverse architectural styles and usages.

Methodology
The framework of our CNN-based building change detection is shown in Figure 1. The input bi-temporal images are classified into a binary map of buildings and background using a building extraction network. The noises in the maps caused by imperfect building extraction were filtered out with a designed filter, then the maps of different times were concatenated and input into the change detection network to detect building changes and produce a building change map. The framework of our CNN-based building change detection is shown in Figure 1. The input bitemporal images are classified into a binary map of buildings and background using a building extraction network. The noises in the maps caused by imperfect building extraction were filtered out with a designed filter, then the maps of different times were concatenated and input into the change detection network to detect building changes and produce a building change map.

Building Extraction Network
In order to evaluate the impact of the building extraction network at the object and pixel levels on the subsequent change detection, this paper uses two network structures to extract buildings, namely, the Mask R-CNN [67] and the MS-FCN (multi-scale fully convolutional network), which are based on the U-Net [68] structure.
Although the Mask R-CNN was proposed in 2017, it is the most powerful instance segmentation method up to now. The structure of the Mask R-CNN is shown in Figure 2. It consists of a backbone CNN framework, a regional proposal network (RPN) [69], a region of interest (RoI) alignment process (RoIAlign) and three output branches: classification, box regression and mask prediction. The Mask R-CNN adopts a two-stage strategy. In the first stage, the feature map is searched through the RPN for regions that may contain foreground. Rectangles with different sizes and proportions are used to cover such regions, and the suggested rectangles are used as the bounding box for the candidates. The second step uses the bounding box to obtain the RoI from the feature maps of CNN layers, then performs classification and bounding box regression. We kept all the parameters and super parameters the same as the original version, except for changing from the multi-class object detection to a single object detection (building).

Building Extraction Network
In order to evaluate the impact of the building extraction network at the object and pixel levels on the subsequent change detection, this paper uses two network structures to extract buildings, namely, the Mask R-CNN [67] and the MS-FCN (multi-scale fully convolutional network), which are based on the U-Net [68] structure.
Although the Mask R-CNN was proposed in 2017, it is the most powerful instance segmentation method up to now. The structure of the Mask R-CNN is shown in Figure 2. It consists of a backbone CNN framework, a regional proposal network (RPN) [69], a region of interest (RoI) alignment process (RoIAlign) and three output branches: classification, box regression and mask prediction. The Mask R-CNN adopts a two-stage strategy. In the first stage, the feature map is searched through the RPN for regions that may contain foreground. Rectangles with different sizes and proportions are used to cover such regions, and the suggested rectangles are used as the bounding box for the candidates. The second step uses the bounding box to obtain the RoI from the feature maps of CNN layers, then performs classification and bounding box regression. We kept all the parameters and super parameters the same as the original version, except for changing from the multi-class object detection to a single object detection (building).
We designed a U-Net structure combining with multi-scale aggregation for pixel-based segmentation named the MS-FCN. We find this light structure efficient and especially effective in remote sensing data classification. The MS-FCN structure is shown in Figure 3. The encoder consists of a series of 3 × 3 convolution and 2 × 2 max-pooling to extract a higher level of semantic features; the decoder of the original U-Net [68] gradually enlarges the feature maps by a series of 3 × 3 convolutions and 2 × 2 up-samplings up to a feature map with the same size of the original input image. However, due to the different sizes of buildings, only applying this feature map for building extraction will cause an incomplete inspection of some large-sized buildings, as well as some small-sized non-building objects, such as cars and containers, which could be mistakenly classified as buildings. In order to improve the robustness of multi-scale building extraction, we add a convolution layer with one channel in each scale (red) and up-sampled them to the original scale and concatenate them to form a final four-channel feature map. We designed a U-Net structure combining with multi-scale aggregation for pixel-based segmentation named the MS-FCN. We find this light structure efficient and especially effective in remote sensing data classification. The MS-FCN structure is shown in Figure 3. The encoder consists of a series of 3 × 3 convolution and 2 × 2 max-pooling to extract a higher level of semantic features; the decoder of the original U-Net [68] gradually enlarges the feature maps by a series of 3 × 3 convolutions and 2 × 2 up-samplings up to a feature map with the same size of the original input image. However, due to the different sizes of buildings, only applying this feature map for building extraction will cause an incomplete inspection of some large-sized buildings, as well as some smallsized non-building objects, such as cars and containers, which could be mistakenly classified as buildings. In order to improve the robustness of multi-scale building extraction, we add a convolution layer with one channel in each scale (red) and up-sampled them to the original scale and concatenate them to form a final four-channel feature map.   We designed a U-Net structure combining with multi-scale aggregation for pixel-based segmentation named the MS-FCN. We find this light structure efficient and especially effective in remote sensing data classification. The MS-FCN structure is shown in Figure 3. The encoder consists of a series of 3 × 3 convolution and 2 × 2 max-pooling to extract a higher level of semantic features; the decoder of the original U-Net [68] gradually enlarges the feature maps by a series of 3 × 3 convolutions and 2 × 2 up-samplings up to a feature map with the same size of the original input image. However, due to the different sizes of buildings, only applying this feature map for building extraction will cause an incomplete inspection of some large-sized buildings, as well as some smallsized non-building objects, such as cars and containers, which could be mistakenly classified as buildings. In order to improve the robustness of multi-scale building extraction, we add a convolution layer with one channel in each scale (red) and up-sampled them to the original scale and concatenate them to form a final four-channel feature map.  The binary building map produced by the Mask R-CNN and the MS-FCN contains some errors. We filtered out pixel segments that are smaller than a given threshold to produce a more accurate classification map.

Self-Trained Building Change Detection Network
Different from directly using image pairs as the input of the change detection network [70], we use the binary maps produced by the building extraction network as the input. This modification has tremendous advantages. First, we simulate arbitrary building changes in the binary maps, which is almost impossible to simulate directly in the original images. With the simulated samples, the rigid demand for a large number of manual samples of a supervised deep learning method is greatly reduced.
Note that images with a high proportion of changes are rare in practice, as typically only a small fraction of buildings are changed even in a large area.
Second, we simulate registration errors of buildings in the binary maps by randomly shifting a building's mask within a given threshold (e.g., 10 pixels) to train the network that is resistant to this change (i.e., treating it as an unchanged building). Note that the parallax of buildings from different view angles of VHR images captured from pin-hole cameras always leads to a geometric registration error. This simple self-learning strategy is beneficial in comparison with empirical, data-specific and unstable post-processing methods, which typically involve many parameters trying to filter out these false changes. Figure 4 shows some examples of simulated change samples and buildings' parallaxes. classification map.

Self-Trained Building Change Detection Network
Different from directly using image pairs as the input of the change detection network [70], we use the binary maps produced by the building extraction network as the input. This modification has tremendous advantages. First, we simulate arbitrary building changes in the binary maps, which is almost impossible to simulate directly in the original images. With the simulated samples, the rigid demand for a large number of manual samples of a supervised deep learning method is greatly reduced. Note that images with a high proportion of changes are rare in practice, as typically only a small fraction of buildings are changed even in a large area.
Second, we simulate registration errors of buildings in the binary maps by randomly shifting a building's mask within a given threshold (e.g., 10 pixels) to train the network that is resistant to this change (i.e., treating it as an unchanged building). Note that the parallax of buildings from different view angles of VHR images captured from pin-hole cameras always leads to a geometric registration error. This simple self-learning strategy is beneficial in comparison with empirical, data-specific and unstable post-processing methods, which typically involve many parameters trying to filter out these false changes. Figure 4 shows some examples of simulated change samples and buildings' parallaxes. As we only need to learn changes from binary maps, a simple CNN is suitable. Our change detection network structure is a simplified U-Net with less feature map channels, as shown in Figure  5. This structure has been empirically demonstrated to be better than the original version of U-Net, our MS-FCN and more recent structures such as the DeepLab v3+ [71]. As we only need to learn changes from binary maps, a simple CNN is suitable. Our change detection network structure is a simplified U-Net with less feature map channels, as shown in Figure 5. This structure has been empirically demonstrated to be better than the original version of U-Net, our MS-FCN and more recent structures such as the DeepLab v3+ [71].

Data Set and Evaluation Measures
The dataset used in this paper comes from the WHU building change detection dataset [66]. The

Data Set and Evaluation Measures
The dataset used in this paper comes from the WHU building change detection dataset [66]. The study area is in Christchurch, New Zealand, and covers about 120,000 buildings with various architectural styles and usages. According to different usages, we divided the dataset into five sub-datasets enclosed in colored boxes, as shown in Figure 6. In the training and prediction of a CNN model, all data was divided into small blocks of 512 × 512 pixels with a ground resolution of 0.2 m to adapt to an NVIDIA GTX 1080Ti 11G GPU, which was used in all the experiments. The details are listed in Table 1.
The    In the third step, the simulated building change detection data set (SI-2016, blue) was created and used to train the change detection network. First, building masks were randomly shifted with 0-5 pixels in an arbitrary direction to simulate misplacement of bi-temporal images. Then, the buildings were randomly reduced and added to a change map to simulate building changes. Specifically, for each tile, we randomly dropped 0-3 buildings, or added to the background 0-3 buildings that were randomly selected from building masks of the simulation area as change labels. Finally, the change detection network was trained on the simulated change dataset and applied to predict those change buildings in the target area (red box). The model can also be fine-tuned on available real building change maps to produce a better change detection map of the target region. Figure 7 lists examples of 512 × 512 tiles of different sub-datasets. The diversity of building styles and usages makes the study area with more than 120,000 buildings and 2007 changed buildings in the red box an ideal place to study building extraction and change detection. Figure 8 shows examples of change labels. New buildings were built on the bare ground of images of 2011. Aside from changes to the buildings, a lot of land cover changes such as roads, parking lots and gardens are visible, which could result in many false alarms for any change detection methods except those based on post-classification comparison methods.
As we investigated both object (building instance) accuracy and pixel accuracy of the change detection algorithm, we applied two types of evaluation measures. The first one uses the intersection of union (IoU) as the main index for pixel-based evaluation, which is defined as In building extraction, true positive (TP) indicates the number of pixels correctly classified as buildings, false positive (FP) indicates the number of pixels misclassified as buildings and false negative (FN) indicates the number of pixels misclassified as background. In building change detection, TP indicates the number of pixels correctly classified as changed buildings, FP indicates the number of pixels misclassified as changed buildings and FN indicates the number of pixels misclassified as background.
The second evaluation measure uses average precision (AP) as the main index for object-based evaluation, which is defined as where p denotes precision and r denotes recall. AP is the area under the precision-recall curve with precision as the vertical axis and recall as the horizontal axis. All experiments were executed on a single NVIDIA GeForce 1080 TI GPU with 11 GB RAM.
randomly selected from building masks of the simulation area as change labels. Finally, the change detection network was trained on the simulated change dataset and applied to predict those change buildings in the target area (red box). The model can also be fine-tuned on available real building change maps to produce a better change detection map of the target region. Figure 7 lists examples of 512 × 512 tiles of different sub-datasets. The diversity of building styles and usages makes the study area with more than 120,000 buildings and 2007 changed buildings in the red box an ideal place to study building extraction and change detection.     Figure 8 shows examples of change labels. New buildings were built on the bare ground of images of 2011. Aside from changes to the buildings, a lot of land cover changes such as roads, parking lots and gardens are visible, which could result in many false alarms for any change detection methods except those based on post-classification comparison methods. As we investigated both object (building instance) accuracy and pixel accuracy of the change detection algorithm, we applied two types of evaluation measures. The first one uses the intersection of union (IoU) as the main index for pixel-based evaluation, which is defined as

Building Extraction Results
We Due to the low accuracy of the extraction network at the edge of tiles, predictions from the Mask R-CNN and MS-FCN were made on the overlapped tiles. That is, when cutting the original large image of the TA-2016 and TA-2011, all the cropped tiles had 50% overlapped regions. After prediction, the edges of each tile were removed and then stitched to a seamless large image for evaluation. This strategy effectively avoids the edge effect that especially affects object instance detection methods such as the Mask R-CNN.
Building extraction accuracy of the two networks at the object and pixel levels are shown in Table 2 Figure 9. In comparison of the third and fourth lines, the MS-FCN is slightly better than the Mask R-CNN in extracting building edges. The examples of prediction results of the Mask R-CNN and the MS-FCN on TA-2016 and TA-2011 are shown in Figure 9. In comparison of the third and fourth lines, the MS-FCN is slightly better than the Mask R-CNN in extracting building edges.

Building Change Detection Results
Firstly, the binary building map obtained through the building extraction network was preprocessed by a simple filter. Buildings smaller than 500 pixels (corresponding to the ground 4.8 × 4.8 m 2 ) were considered as false detections and are removed. Then, the bi-temporal classification map was divided into 1,827 512 × 512 tiles, with corresponding change labels as the ground truth.
We carried out three groups of tests. The first one did not use any change labels in the target study area (TA-2011 and TA-2016). We train our change detection network only on the automatically simulated data (SI-2016), from which 1800 tiles are used for training and 92 tiles for validation. Then, the model is applied to predict building changes in the test area (outside of the red box in Figure 10).

Building Change Detection Results
Firstly, the binary building map obtained through the building extraction network was preprocessed by a simple filter. Buildings smaller than 500 pixels (corresponding to the ground 4.8 × 4.8 m 2 ) were considered as false detections and are removed. Then, the bi-temporal classification map was divided into 1827 512 × 512 tiles, with corresponding change labels as the ground truth.
We carried out three groups of tests. The first one did not use any change labels in the target study area (TA-2011 and TA-2016). We train our change detection network only on the automatically simulated data (SI-2016), from which 1800 tiles are used for training and 92 tiles for validation. Then, the model is applied to predict building changes in the test area (outside of the red box in Figure 10).
The second and third tests use half (green box) and full training samples (red box), respectively, to train the model, and therefore could be compared to other recent deep learning methods that require training samples in the original images. Two recent methods are compared to our method. One is the FC-EF [62], which is an end-to-end change detection method based on the CNN and predicts changes from bi-temporal images directly. The other is a generative adversarial network (GAN)-based method [70] with the same end-to-end manner.

Building Change Detection Results
Firstly, the binary building map obtained through the building extraction network was preprocessed by a simple filter. Buildings smaller than 500 pixels (corresponding to the ground 4.8 × 4.8 m 2 ) were considered as false detections and are removed. Then, the bi-temporal classification map was divided into 1,827 512 × 512 tiles, with corresponding change labels as the ground truth.
We carried out three groups of tests. The first one did not use any change labels in the target study area (TA-2011 and TA-2016). We train our change detection network only on the automatically simulated data (SI-2016), from which 1800 tiles are used for training and 92 tiles for validation. Then, the model is applied to predict building changes in the test area (outside of the red box in Figure 10). In Table 3, in the case of training using only the simulation dataset, the AP (counted on changed building instances) reaches to 0.630 and 0.609 using the Mask R-CNN and MS-FCN building extraction, respectively; the IoU (counted on changed pixels) reaches 0.798. The FC-EF-and GAN-based change detection methods could not execute without real labels in the remote sensing images. Table 3. Building change detection accuracy at the object level and pixel level under different training data: merely simulated data, half-change samples (within the green box in Figure 10) and full-change samples. When half of the samples were used for further training, the AP of our model was improved to 0.806 and 0.793, respectively, and the IoU changed to 0.773 and 0.843, respectively. In contrast, the FC-EF-and GAN-based methods obtained extremely poor results: only 2% AP and less than 26% IoU.
When all the training samples were used, the AP of our method was slightly improved to 0.814 and 0.796, respectively, and the IoU is improved to 0.83. The AP of FC-EF was improved from 0.02 to 0.25, indicating it requires a considerable amount of training samples to train an adequate model. However, even with enough change samples (about 300 changed buildings), it performed much worse than our method. Note that this area had undergone an earthquake in 2011, and plenty of the buildings were changed. Normally, it is even less possible to supply enough change samples to train a network like the FC-EF. The GAN method could not converge with these samples. This indicates that the GAN-based method is unstable.
It should be noted that when our change detection network was trained directly (with random weights), its performance approached those pretrained on the simulated data. For example, with full training samples, the AP of the direct training strategy was 0.803 and 0.732, respectively.
It is concluded from Table 3 that, first, without any real training samples, our algorithm is dominant to other methods trained with adequate samples (incrementally by at least 40% AP and 30% IoU). Second, at the pixel level, without real change samples, our algorithm (0.798 IoU) approached the top performance (0.830 IoU); at the object level, the algorithm reached to the top performance with only a small amount of training samples (i.e., the performance did not improve with more samples). These two features are highly favorable in a practice where change samples are scarce or even unavailable. Figure 11 lists four examples of building change detection results. Our change detection network with either the Mask R-CNN or the MS-FCN building extraction could detect changes with high accuracy. The results of our MS-FCN strategy were slightly better than the Mask R-CNN, as the later over-smoothed at building boundaries. Most of the changed buildings were missed by the FC-EF and GAN-based methods, and those detected changes were very noisy.   Figure 12 shows the results of different methods on the whole test area and clearly demonstrates our method to be much better than other methods even without any change samples. Right behind our method is the FC-EF with full training samples (Table 3); however, it only reaches 25% AP.  Figure 12 shows the results of different methods on the whole test area and clearly demonstrates our method to be much better than other methods even without any change samples. Right behind our method is the FC-EF with full training samples (Table 3); however, it only reaches 25% AP.

Discussion
In this section, we further discuss: (1) the advantages of our change detection network compared to the traditional methods with available building masks from the building extraction network, (2) the prerequisites of our method and (3) potential improvement of the framework.
(1) The advantages of our change detection network Even after extracting the building mask, change detection at the object (building instance) level could still be extremely challenging with a traditional change detection method. As most of the object-based methods treat arbitrary groups of pixels as objects, we only compare our algorithm with our empirical designs (Table 4) at the building instance level. Table 4 shows that, although different empirical methods were tried, the accuracy they obtained was much lower than our change detection network. In addition, the parameters are unstable and data-specific. This is the reason that a self-training CNN is applied to the bi-temporal building classification map for building change detection. Table 5 shows the comparison between our method and other conventional methods (i.e., an image-differencing-based method [24], an image-ratioing-based method [25] and the FLICM [73] at the pixel level. The IoU score of all the methods was obviously lower than ours. Note that on binary maps the differencing and ratioing methods achieved the same results. (2) The prerequisite of our method The change detection framework depends on the accuracy of the building extraction network, which requires sufficient training data. However, we don't treat this as a shortcoming, as there are plenty of existing building datasets. Besides the open-source datasets such as the WHU [66], Inria [74] and OSM (open street map) [75], there are global building GIS maps in central or local government branches of surveying, mapping and city planning, which can be used in practice. The critical shortage is the change samples. This problem has been greatly reduced with our self-training strategy.
(3) The potential improvement of the framework Although we only simulated changed samples on the SI-2016 area containing 1892 tiles, they enabled us to train a very good model at pixel and object levels. Especially at the pixel level, the self-trained model approached the top performance achieved with adequate real change samples. Moreover, the self-training area can easily be extended, which would further improve the model's accuracy at the instance level without requiring real samples of changed buildings. Table 4. Different methods to discover changed building instances from the bi-temporal building classification map produced by the Mask R-CNN. "Difference" indicates the direct differencing of the two classification maps; "Distance & IoU 1" indicates a threshold of a 20-pixel shift of center points of corresponding building masks, and 0.33 IoU is used to determine if the two masks of different times are the same buildings. "Distance & IoU 2" indicates a threshold of a 20-pixel shift of center points of corresponding building masks, and 0.5 IoU is used. "Erode & dilate" indicates we used a morphological erosion operation to eliminate small masks and alignment errors and a dilation operation to restore buildings. "Erode & intersect" indicates that we used an erosion operation followed by an intersection operation with a threshold of 0.5 IoU. "Our network" is the change detection network trained with simulated samples.  Table 5. Different methods to discover changed buildings at the pixel level from the bi-temporal building classification map produced by the Mask R-CNN. "Our network" is the change detection network trained with simulated samples.

Conclusions
This paper proposes a new building change detection framework with a building extraction network and a self-trained building change detection network for VHR remote sensing images. The building extraction network provides highly accurate building classification maps. The building change detection network takes bi-temporal building classification maps as inputs and computes building change maps at the pixel and object levels. The network could be well trained through simulated changed buildings, and is robust to the registration errors caused by unavoidable parallax changes in the VHR images. The experimental results proved the distinctive superiority of the proposed algorithm compared with other recent methods. Without any real change labels, our change detection network dominated other methods trained with adequate samples. As the change labels are commonly scarce, the reduced demand of training samples makes our framework effective and applicable in practice. In this study we focused on building change detection, but the application of our framework can be easily adapted to detect changes of other land cover objects.