Automatic Building Extraction from Google Earth Images under Complex Backgrounds Based on Deep Instance Segmentation Network

Building damage accounts for a high percentage of post-natural disaster assessment. Extracting buildings from optical remote sensing images is of great significance for natural disaster reduction and assessment. Traditional methods mainly are semi-automatic methods which require human-computer interaction or rely on purely human interpretation. In this paper, inspired by the recently developed deep learning techniques, we propose an improved Mask Region Convolutional Neural Network (Mask R-CNN) method that can detect the rotated bounding boxes of buildings and segment them from very complex backgrounds, simultaneously. The proposed method has two major improvements, making it very suitable to perform building extraction task. Firstly, instead of predicting horizontal rectangle bounding boxes of objects like many other detectors do, we intend to obtain the minimum enclosing rectangles of buildings by adding a new term: the principal directions of the rectangles θ. Secondly, a new layer by integrating advantages of both atrous convolution and inception block is designed and inserted into the segmentation branch of the Mask R-CNN to make the branch to learn more representative features. We test the proposed method on a newly collected large Google Earth remote sensing dataset with diverse buildings and very complex backgrounds. Experiments demonstrate that it can obtain promising results.


Introduction
During the last 10 years, many countries in the world have suffered from natural disasters, which are increasing in frequency and intensity and brought huge loss in exposure of persons and assets [1]. Disaster loss assessment can provide technical support and decision-making basis for disaster relief and post-disaster reconstruction. Loss from building damage always accounts for a high percentage of all losses especially in typhoon, earthquake, flood, and geological disasters, so loss assessment of building damage is obviously an essential work of the whole loss assessment. Remote sensing images have played an important role in building damage assessment for their characteristics of wide coverage, high resolution and high time efficiency [2,3]. Building footprint vector data can provide Challenge dataset is superior to 97% [27]. In general, the existing work of deep learning method for building extraction from high-resolution remote sensing images is mainly based on semantic segmentation, and the work on target detection and image classification method is few. The main idea is to improve the context information by adding the multi-layer features to the FCN framework, and to improve the ability to adapt to the complex background of remote sensing images and the small building targets.
Semantic segmentation under complex geospatial backgrounds is likely to result in edge connection among closely adjacent buildings, which is unfavorable for subsequent edge extraction and outline fitting owing to edge confusion of buildings. Mask R-CNN [28], a pioneer work in instance segmentation which is a task predicting bounding boxes and segmentation masks simultaneously, have achieved significant improvement. In this work, segmentation is carried out based on detection result, making it especially suitable to deal with outline extraction of densely distributed small buildings. A newly defined Rotation Bounding Box (RBB), which involves angle regression, is incorporated into Faster R-CNN framework, and this method forces the detection networks to learn the correct orientation angle of ship targets according to angle-related IoU and angle-related loss function [29]. Meanwhile, a novel Receptive Field Block (RFB) module, which makes use of multi-branch pooling with varying kernels and atrous convolution layers to simulate RFs of different sizes in human visual system, is developed to strengthen the deep features learned from lightweight CNN detection models [30]. In this paper, we incorporate the RBB and RFB into the RPN stage and segmentation branches of Mask R-CNN framework, respectively. This improvement can provide a denser bounding box, and furtherly promote the accuracy of mask prediction owing to better adaptation to multi-scale building targets.
The main contributions of this paper include: 1. Different from previous FCN-based methods, instance segmentation framework is applied into building detection and segmentation, which can better deal with closely adjacent small buildings and some other tough problems.

2.
We adapt rotatable anchors into the RPN stage of Mask R-CNN framework, which regress a minimum area bounding rectangle (MABR) likely rotation bounding box and eliminate redundant background pixels around buildings. 3.
We use several RFB modules to boost the segmentation branch of Mask R-CNN framework, which can better accommodate to multi-scale building targets by parallel connecting multi-branch receptive fields with varying eccentricities.
Experiments based on a newly collected large building outline dataset show that our method, improved from Mask R-CNN framework, has a state-of-the-art performance in joint building detection and rooftop segmentation task.
The remainder of this paper is organized as follows: Section 2 presents the details of building extraction method; Section 3 describes the experimental results in Google Earth remote sensing dataset; Section 4 is a discussion of our method and some possible plan of improvements; Section 5 presents our concluding remarks.

Methods of Building Extraction from Remote Sensing Images
Similar to Mask R-CNN, our method mainly consists of four parts. Firstly, rotation anchors are introduced into RPN stage since we intend to predict the minimum area bounding rectangle of buildings. Secondly, the feature maps of ROIs are rotated anticlockwise into horizontal rectangles and are then processed by ROI Align. Thirdly, the regression branch revises the coordinate of bounding box, the classification branch predicts the corresponding classification scores, and the segmentation branch produces corresponding object masks through several RFB modules. Finally, the bounding box and mask are rotated clockwise by the regressed angle as the instance segmentation results. The losses of the three branches are computed and summed to form a multitask loss. Figure 1 illustrates the schematic architecture of the proposed method.

Rotation Region Proposal Network
Feature map from backbone is fed into rotation region proposal network. In the learning stage, the rotation bounding box is defined as the ground truth of each building sample for detection. Rotation proposals are formulated by adding angle parameter, and are generated by traversing every composition of ratio, scale and angle. In the prediction stage, the feature maps of rotation detection bounding boxes generated by rotation RPN are rotated anticlockwise to horizontal rectangles by the regressed angle. Then after ROI Align, they are transferred to the multi-branch network.

Rotation Bounding Box
The refined outline of each building is regarded as the ground truth data, which is necessary for segmentation task. However, for the detection purpose, the ground truth is the minimum area bounding rectangle (MABR) of buildings. Unlike traditional horizontal bounding rectangle, MABR is a dense bounding box, which has the minimum area among all of the bounding rectangles and normally is inclined from horizontal axis. Figure 2 illustrates the outline and the minimum area bounding rectangle of buildings. So 5 parameters, i.e., ( , , , , ) x y w h  are used to represent the rotation bounding box, where ( , ) x y represent the center coordinate of bounding box, ( , ) w h represent the length of the short side and the long side of the bounding box respectively, and the angle between the long side of MABR and x-axis rotated from counterclockwise direction is represented as parameter  .  is constrained within the interval [−π/4, 3π/4) to ensure the uniqueness of MABR. Figure 3 presents angle parameter  of rotation bounding box.

Rotation Region Proposal Network
Feature map from backbone is fed into rotation region proposal network. In the learning stage, the rotation bounding box is defined as the ground truth of each building sample for detection. Rotation proposals are formulated by adding angle parameter, and are generated by traversing every composition of ratio, scale and angle. In the prediction stage, the feature maps of rotation detection bounding boxes generated by rotation RPN are rotated anticlockwise to horizontal rectangles by the regressed angle. Then after ROI Align, they are transferred to the multi-branch network.

Rotation Bounding Box
The refined outline of each building is regarded as the ground truth data, which is necessary for segmentation task. However, for the detection purpose, the ground truth is the minimum area bounding rectangle (MABR) of buildings. Unlike traditional horizontal bounding rectangle, MABR is a dense bounding box, which has the minimum area among all of the bounding rectangles and normally is inclined from horizontal axis. Figure 2 illustrates the outline and the minimum area bounding rectangle of buildings. So 5 parameters, i.e., (x, y, w, h, θ) are used to represent the rotation bounding box, where (x, y) represent the center coordinate of bounding box, (w, h) represent the length of the short side and the long side of the bounding box respectively, and the angle between the long side of MABR and x-axis rotated from counterclockwise direction is represented as parameter θ. θ is constrained within the interval [−π/4, 3π/4) to ensure the uniqueness of MABR. Figure 3 presents angle parameter θ of rotation bounding box.

Rotation Region Proposal Network
Feature map from backbone is fed into rotation region proposal network. In the learning stage, the rotation bounding box is defined as the ground truth of each building sample for detection. Rotation proposals are formulated by adding angle parameter, and are generated by traversing every composition of ratio, scale and angle. In the prediction stage, the feature maps of rotation detection bounding boxes generated by rotation RPN are rotated anticlockwise to horizontal rectangles by the regressed angle. Then after ROI Align, they are transferred to the multi-branch network.

Rotation Bounding Box
The refined outline of each building is regarded as the ground truth data, which is necessary for segmentation task. However, for the detection purpose, the ground truth is the minimum area bounding rectangle (MABR) of buildings. Unlike traditional horizontal bounding rectangle, MABR is a dense bounding box, which has the minimum area among all of the bounding rectangles and normally is inclined from horizontal axis. Figure 2 illustrates the outline and the minimum area bounding rectangle of buildings. So 5 parameters, i.e., ( , , , , ) x y w h  are used to represent the rotation bounding box, where ( , ) x y represent the center coordinate of bounding box, ( , ) w h represent the length of the short side and the long side of the bounding box respectively, and the angle between the long side of MABR and x-axis rotated from counterclockwise direction is represented as parameter  .  is constrained within the interval [−π/4, 3π/4) to ensure the uniqueness of MABR. Figure 3 presents angle parameter  of rotation bounding box.

Rotation Anchor
In order to match the rotation bounding box, rotation anchors are designed by adding rotation angle to traditional anchor parameters. Buildings account for different functions, such as factory and residence. Factory buildings, housings for urban and rural residents, office buildings are likely to have distinct aspect ratios. According to statistics of a large amount of building samples, we set the aspect ratios as {1:2, 1:3, 1:5, 1:7}. Six scales, i.e., {8, 16, 32, 64, 128, 256}, are kept to fit in the scale variation of buildings. In addition, we adopt six orientations {−π/6, 0, π/6, π/3, π/2, 2π/3} to adjust anchors to match angle changes of buildings. 144 rotation anchors (4 aspect ratios, 6 scales, 6 orientations) will be created for each pixel on the feature map, 720 outputs (5 144) for the reg layer and 384 score outputs (2 144) for the cls layer.

Leveling ROIs
The rotation ROIs output from the RPN stage always have a certain angle against horizontal axis represented by parameter . The feature map of ROI is rotated by the  angle anticlockwise around its center into a horizontal rectangle of the same size by bilinear interpolation. The transformed coordinates can be calculated as follows: cos sin sin cos where ( , ) x y represent the center coordinate of bounding box, ( , ) x y   represent the coordinate of pixel in original ROI feature map, ( , ) x y   represent the coordinate of pixel in transformed ROI feature map. Then we use the ROI Align to process the horizontal feature maps of ROIs and transfer the resulting fixed-size feature maps to the following multi-branch prediction network.

Rotation Anchor
In order to match the rotation bounding box, rotation anchors are designed by adding rotation angle to traditional anchor parameters. Buildings account for different functions, such as factory and residence. Factory buildings, housings for urban and rural residents, office buildings are likely to have distinct aspect ratios. According to statistics of a large amount of building samples, we set the aspect ratios as {1:2, 1:3, 1:5, 1:7}. Six scales, i.e., {8, 16, 32, 64, 128, 256}, are kept to fit in the scale variation of buildings. In addition, we adopt six orientations {−π/6, 0, π/6, π/3, π/2, 2π/3} to adjust anchors to match angle changes of buildings. 144 rotation anchors (4 aspect ratios, 6 scales, 6 orientations) will be created for each pixel on the feature map, 720 outputs (5 × 144) for the reg layer and 384 score outputs (2 × 144) for the cls layer.

Leveling ROIs
The rotation ROIs output from the RPN stage always have a certain angle against horizontal axis represented by parameter θ. The feature map of ROI is rotated by the θ angle anticlockwise around its center into a horizontal rectangle of the same size by bilinear interpolation. The transformed coordinates can be calculated as follows: where (x, y) represent the center coordinate of bounding box, (x , y ) represent the coordinate of pixel in original ROI feature map, (x , y ) represent the coordinate of pixel in transformed ROI feature map. Then we use the ROI Align to process the horizontal feature maps of ROIs and transfer the resulting fixed-size feature maps to the following multi-branch prediction network.

Multi-Branch Prediction Network
Multi-branch prediction network has three branches: two branches perform classification and bounding-box regression respectively, the third branch performs segmentation and generates masks. The segmentation branch is reconfigured with Receptive Field Block modules to obtain finer masks by integrating advantages of inception block and atrous convolution. Then, the regressed bounding-box and the predicted mask are simultaneously rotated to their original θ angle obtained from RPN stage. In this way, we can obtain the final instance segmentation results of buildings.

Receptive Field Block
The scales of buildings vary significantly, ranging from a dozen of pixels to thousands of pixels. To better handle scale viability problem, a new architecture named Receptive Field Block is built upon the structure of Inception-ResNet module [31] by replacing the filter concatenation stage of Inception V4 module with residual connection and stacking atrous convolution of different kernel sizes and sampling rates. Figure 4 shows the architecture of a RFB module. The 1 × 1 atrous convolution with rate 1, 3 × 3 atrous convolution with rate 3 and 3 × 3 atrous convolution with rate 5 are inserted into a paralleled three-branch structure respectively, and the feature maps extracted for different sampling rates are further concatenated and followed by a 1 × 1 convolution. The output of the filter is then residual added with the output of pre-stage layer by a shortcut channel. Each branch of the three-branch structure consists of a 1 × 1 convolution layer to decrease the number of channels in the feature map plus an n × n convolution layer. The scales of buildings vary significantly, ranging from a dozen of pixels to thousands of pixels. To better handle scale viability problem, a new architecture named Receptive Field Block is built upon the structure of Inception-ResNet module [31] by replacing the filter concatenation stage of Inception V4 module with residual connection and stacking atrous convolution of different kernel sizes and sampling rates. Figure 4 shows the architecture of a RFB module. The 1 1 atrous convolution with rate 1, 3 3 atrous convolution with rate 3 and 3 3 atrous convolution with rate 5 are inserted into a paralleled three-branch structure respectively, and the feature maps extracted for different sampling rates are further concatenated and followed by a 1 1 convolution. The output of the filter is then residual added with the output of pre-stage layer by a shortcut channel. Each branch of the three-branch structure consists of a 1 1 convolution layer to decrease the number of channels in the feature map plus an n n convolution layer.

RFB Stacked Segmentation Network Branch
We replaced each convolution layer of the original segmentation branch of Mask R-CNN with the RFB module, and then activated with sigmoid function, as shown in Figure 5. Two RFB modules connected in sequence can enlarge the receptive field and avoid time-consuming. Accumulating more RBF blocks would slightly improve the performance. However, when attaching more than three blocks, it will lead to unstable accuracy and make training more difficult. The output map of this branch is the mask of building target.

RFB Stacked Segmentation Network Branch
We replaced each convolution layer of the original segmentation branch of Mask R-CNN with the RFB module, and then activated with sigmoid function, as shown in Figure 5. Two RFB modules connected in sequence can enlarge the receptive field and avoid time-consuming. Accumulating more RBF blocks would slightly improve the performance. However, when attaching more than three blocks, it will lead to unstable accuracy and make training more difficult. The output map of this branch is the mask of building target. the RFB module, and then activated with sigmoid function, as shown in Figure 5. Two RFB modules connected in sequence can enlarge the receptive field and avoid time-consuming. Accumulating more RBF blocks would slightly improve the performance. However, when attaching more than three blocks, it will lead to unstable accuracy and make training more difficult. The output map of this branch is the mask of building target. Figure 5. Pipeline of Receptive Field Block (RFB) modules stacked network.

Inverse Rotation of Mask
The bounding box regression branch only revises the coordinates of horizontal rectangles, i.e., ( , , , ) x y w h . The angle  generated from the Rotation RPN stage is adopted as the final angle parameter. The horizontal rectangle is rotated clockwise by the  angle as the final rotation bounding box. The m m mask output predicted from the segmentation branch is first rotated clockwise by the  angle, and then is resized to the size of final bounding box and binarized at a threshold of 0.5.

Inverse Rotation of Mask
The bounding box regression branch only revises the coordinates of horizontal rectangles, i.e., (x, y, w, h). The angle θ generated from the Rotation RPN stage is adopted as the final angle parameter. The horizontal rectangle is rotated clockwise by the θ angle as the final rotation bounding box. The m × m mask output predicted from the segmentation branch is first rotated clockwise by the θ angle, and then is resized to the size of final bounding box and binarized at a threshold of 0.5.

Loss Function
The positive labels are assigned to the anchors as follows: (i) the anchor/anchors with the highest IoU overlap with a ground-truth box; (ii) an anchor which has an IoU overlap higher than 0.8 and an angular separation less than 10 degrees with the ground-truth box. The Negative label is assigned to the anchor following two conditions: (i) an anchor has an IoU overlap less than 0.2; (ii) an anchor has an IoU overlap higher than 0.8 but has an angular separation higher than 10 degrees. Other anchors without positive or negative labels are not considered during training.
We follow the multi-task loss of Mask R-CNN which is defined as follow to train our method: where p * i represents the ground-truth label of the object, p i is the predicted probability distribution of anchor i being an object of different classes, t * i is the vector representing the coordinate offset of ground-truth box and positive anchors, t i represents the offset of the predicted five parameterized coordinate vector and that of ground-truth box, s * i is the matrix of ground-truth binary mask, s i represents the predicted mask of the object. The hyper-parameter λ and γ in Equation (2) controls the balance between the three task losses.
The regression mode for 5 coordinate parameters of rotational bounding box is defined as follow: where x, y, w and h denote the box's center coordinates and its width and height. Variables x, x a and x * are for the predicted box, anchor box, and ground-truth box respectively; the same is for y, w, h and θ. The parameter k ∈ Z to keep t θ and t * θ in the range [−π/4, 3π/4).

Data and Research Area
To assess the performance of the proposed method and facilitate future researches, we collected a large volume of images from Google Earth in Fujian province, China, as shown in Figure 6. Diverse regions including urbans, towns, and villages are selected to cover different kinds of buildings. Several examples are shown in Figure 7, from which we can see almost all types of buildings from village to urban, from small rural housing to villa and high-rise apartment, from L shape to U shape are included in our dataset, providing us plenty of samples for training the models. 86 typical regions of spatial resolution 0.26 m are selected, with the image size ranging from 1000 × 1000 to 10,000 × 10,000 pixels. After obtain the images, 5 students major in Geography and surveying science were asked to label the buildings with polygon vector using ArcGIS 10.2 manually. The polygon vectors fit to the outlines of building footprints, as shown in Figure 7. Because deep learning models can only learn parameters from fixed-size images with numerical labels, we crop the images into 500 × 500, and map the vector boundaries into bounding boxes. Finally, we have 2000 images and about 84,366 buildings in total. We split the dataset equally into two parts, one for training and the other one for testing. village to urban, from small rural housing to villa and high-rise apartment, from L shape to U shape are included in our dataset, providing us plenty of samples for training the models. 86 typical regions of spatial resolution 0.26 m are selected, with the image size ranging from 1000 × 1000 to 10,000 × 10,000 pixels. After obtain the images, 5 students major in Geography and surveying science were asked to label the buildings with polygon vector using ArcGIS 10.2 manually. The polygon vectors fit to the outlines of building footprints, as shown in Figure 7. Because deep learning models can only learn parameters from fixed-size images with numerical labels, we crop the images into 500 × 500, and map the vector boundaries into bounding boxes. Finally, we have 2000 images and about 84,366 buildings in total. We split the dataset equally into two parts, one for training and the other one for testing.

Implementation Details
The model is built upon Mask R-CNN framework. We use PyTorch to implement the proposed method and train it with Adam optimizer. The backbone of the model is ResNet-101 which was pretrained on ImageNet dataset. The learning rate was initialized with 0.001 and decayed in every 25 k iterations. It will converge in 80 k iterations. Other hyperparameters such as weight decay and momentum were set as 0.0001 and 0.9 as recommended. At inference time, 500 proposals are generated for predicting buildings and refine their locations. The top 100 predictions with the highest scores are sent to the segmentation task branch and obtain their masks. All experiments including training and testing of models are conducted on a single 1080Ti GPU with 12 GigaByte memory on board.

Evaluation of Detection Task
Building detection from very complex backgrounds is an important task. Detecting objects from images has been a hot research topic in computer vision community. And lots of deep learning based methods have been proposed in recent years. Most of these methods can be categorized into two groups: two-stage methods and one stage methods. Two-stage methods has an RPN network that

Implementation Details
The model is built upon Mask R-CNN framework. We use PyTorch to implement the proposed method and train it with Adam optimizer. The backbone of the model is ResNet-101 which was pre-trained on ImageNet dataset. The learning rate was initialized with 0.001 and decayed in every 25 k iterations. It will converge in 80 k iterations. Other hyperparameters such as weight decay and momentum were set as 0.0001 and 0.9 as recommended. At inference time, 500 proposals are generated for predicting buildings and refine their locations. The top 100 predictions with the highest scores are sent to the segmentation task branch and obtain their masks. All experiments including training and testing of models are conducted on a single 1080Ti GPU with 12 GigaByte memory on board.

Evaluation of Detection Task
Building detection from very complex backgrounds is an important task. Detecting objects from images has been a hot research topic in computer vision community. And lots of deep learning based methods have been proposed in recent years. Most of these methods can be categorized into two groups: two-stage methods and one stage methods. Two-stage methods has an RPN network that generates candidate regions potentially containing objects and a followed network classifies these regions into different object categories and predicts their fine coordinates, simultaneously. The representative method is the Faster R-CNN and its variants. While one stage methods directly predict the classification score and coordinates of the objects from the feature maps without an RPN stage. Thus, one stage methods are faster than two-stage methods in inference however have poor performance in detecting and locating objects. In this work, we compare our method with Mask R-CNN and Faster R-CNN since they obtain the state-of-the-art results. Two different networks VGG [9] and ResNet101 [10] are utilized as a backbones of the Faster R-CNN. The proposed and Mask R-CNN are not configured with VGG network because Mask R-CNN and the proposed method are actually built upon Faster R-CNN, thus it is unnecessary to repeat the VGG configuration again. We use mean average precession (mAP) to evaluate the performance of the proposed method. The results are listed in Table 1. A few examples are shown in Figure 8.
From Table 1 we can see that Faster R-CNN configured with ResNet101 outperforms its VGG version significantly, indicating powerful ability of residual networks. ResNets have been utilized widely in various computer vision tasks and demonstrate superior performance than other shallower networks, such as VGG nets. Thus, in the following experiments we also believe in ResNets and employ ResNet101 which is probability the most widely used residual network as backbone of the proposed method. Mask R-CNN-ResNet101 obtains similar results to Faster R-CNN-ResNet101. They are actually the same model if only the detection task is considered. The proposed method improves the results with help of the rotation anchors. The reason behind this maybe that the rotated anchors provide more information of target characteristics (i.e., rotation angle) than normal anchors, so they are more suitable for capturing features of rotated objects. They have a higher possibility of filtering out pixels of distractive backgrounds than normal anchors which leads to better results.
From Figure 8 it can be observed that Faster R-CNN configured with VGG miss detect buildings at most. The image in the first row is a very challenging one. There are many buildings locating closely and are very hard to distinguish from each other. Faster R-CNN-VGG miss many buildings, while Mask R-CNN-ResNet101 and the proposed method obtain the best results though there are still many missing buildings. The rotated bounding boxes fit bounding footprints well, as can be seen from the last column of Figure 8.

Evaluation of Segmentation Task
Segmenting buildings from their surrounding backgrounds also known as building extraction. In this subsection, we compare our method with segmentation branch of Mask R-CNN. Three indicators including precision, recall and F 1 score are used to evaluate the performance of the proposed method. We report them in Table 2 and shown some examples of the segmentation results in Figure 9.
From Table 2 we can see that the proposed method outperforms the Mask R-CNN-ResNet101 in terms of all of the three indicators. One should know that the Fujian dataset is very challenging. Many of the buildings are hard to distinguish from surroundings due to the poor quality of the Google Earth images. The RFB block [30] inspired by the mechanism of the human visual systems plays a central role in improving the performance of the segmentation. One possible explanation may be that the atrous convolution enlarges the receptive fields and combinations of different radius and rates enable extractions of more powerful features. This can be read from Figure 9, from which we can see the proposed method successfully segment some indistinguishable buildings from backgrounds which are missed by Mask R-CNN-ResNet101, as can be seen from the second and forth columns of Figure 9.
Mask R-CNN produce instance level segmentation results, which means different instances of the same category are annotated with distinct pixel-level labels, as indicated by the different colors of Figure 9. Instance segmentation is extremely useful when buildings close to each other with adjacent boundaries or even share with the same wall. General segmentation methods such as U-Net-style networks [32] cannot distinguish different instances. Thus, for adjacent buildings they could generate one big mask for several buildings. Mask R-CNN provides a good solution to this by segmenting buildings in their bounding boxes. This can also help to improve accuracy of segmentation and provide a fine outline of buildings. We demonstrate that the results could be further boosted by inserting the RFB blocks.

Discussion
Our proposed method has achieved improved performance for building extraction and segmentation tasks in terms of quantitative indicators, especially on building detection. However, we believe the performance could further be improved from the following aspects.
1. More and diversity building samples. Deep neural networks are data hungry models, requiring a huge volume of training samples. Although we have labeled thousands of buildings to train our network, providing more samples will further boost the performance. In addition, buildings

Discussion
Our proposed method has achieved improved performance for building extraction and segmentation tasks in terms of quantitative indicators, especially on building detection. However, we believe the performance could further be improved from the following aspects.

1.
More and diversity building samples. Deep neural networks are data hungry models, requiring a huge volume of training samples. Although we have labeled thousands of buildings to train our network, providing more samples will further boost the performance. In addition, buildings have diversity sizes, structures. For instance, factory buildings and residential houses possess distinctly different features. Even residential houses, buildings from city and village are with different sizes, aspect ratio and shapes. To detect them all, samples should cover as many instances as possible. Moreover, complex backgrounds could be distractions to the detector, especially there are objects with similar appearance, such as vehicle, ships, roads, and so on. An example is shown in Figure 10. It is better to label a certain amount of buildings under complex backgrounds.

2.
Refine rotated angle of bounding box. In this work, the value of a rotated angle is regressed from the RPN stage. Since there is only one category i.e., buildings to be detected, ROIs generated by the RPN network should be close to that of detection branch, thus we use angles from the RPN as the final rotation angles. However, we believed that, similar to bounding boxes regression they can be further refined by the second stage. In future, we will focus on two solutions. The first one is designing a new branch and adding it after the second stage to refine the rotated angle.
The new branch will accept the rotated mask as input and predict the angle. The second one is transforming the vertical ROIs generated by RPN to the second stage. The vertical ROIs consist of rotation information thus can be used to infer the angle value. Since ROI Align is applied in the RPN stage, we will obtain more accurate angles. 3.
Network compression. Mask R-CNN framework has a huge number of parameters, which will consume large amount of computation resource and lead to decrease in inference time.
In recent years, with the rapid development of mobile device and the demand for real-time computation, researchers have attempted to compress the size of the deep models while maintaining their performance. These methods resolve the network compression problem from three aspects: designing light network, network pruning and kernel sparsity. Both the backbone of the Mask R-CNN and the proposed method are based on residual network, which could be pruned to produce a lighter backbone. In addition to this, some inborn light networks such as ShuffleNet [33], CornerNet [34] can be used to design the proposed method. produce a lighter backbone. In addition to this, some inborn light networks such as ShuffleNet [33], CornerNet [34] can be used to design the proposed method. Building extraction is still an open problem requiring more research efforts. In the future, we will plan to design and train specific network aiming at detecting closely located small buildings, large scale buildings, buildings with special shapes and under confusing backgrounds. Building extraction is still an open problem requiring more research efforts. In the future, we will plan to design and train specific network aiming at detecting closely located small buildings, large scale buildings, buildings with special shapes and under confusing backgrounds.

Conclusions
In this paper, we propose an automatic building extraction method based on improved Mask R-CNN framework, which detect the rotated bounding boxes of buildings and segment them from very complex backgrounds, simultaneously. The rotation anchor with inclined angle is used to regress the rotation bounding box of buildings in the RPN stage. Then, after rotation anticlockwise and ROI Align, feature maps are transferred to the multi-branch prediction network. In addition, RFB modules are inserted to the segmentation branch to handle multi-scale variability, and other branches output the classification scores and horizontal rectangle coordinate. Finally, the mask and rectangle bounding box are rotated clockwise by the inclined angle as the final instance segmentation result. Experiment results on a newly collected large Google Earth remote sensing dataset with diverse buildings under complex backgrounds show that our method can achieve promising results. The future work can be focused on samples annotation, improvement, and compression of network structure to promote the performance of our method.
Author Contributions: Q.W. and Q.L. designed the method and experiments, and wrote the paper; K.J. performed the experiments; W.W. and Q.G. analyzed the experiment results; L.L. and P.W. prepared for the collected google earth dataset.