During the last 10 years, many countries in the world have suffered from natural disasters, which are increasing in frequency and intensity and brought huge loss in exposure of persons and assets [1
]. Disaster loss assessment can provide technical support and decision-making basis for disaster relief and post-disaster reconstruction. Loss from building damage always accounts for a high percentage of all losses especially in typhoon, earthquake, flood, and geological disasters, so loss assessment of building damage is obviously an essential work of the whole loss assessment. Remote sensing images have played an important role in building damage assessment for their characteristics of wide coverage, high resolution and high time efficiency [2
]. Building footprint vector data can provide basic information of buildings, such as outline, area, shape. It can be used to assess the damage condition of each building and calculate the total number of damaged buildings in a disaster region [4
How to extract building footprint from pre-disaster high resolution remote sensing images is a key problem which has brought much attention from both academia and industry for decades. In consideration of feature extraction, the classical methods of building footprint extraction from remote sensing images can be summed up into two categories: bottom-up data-driven methods and top-down model-driven methods [5
]. The bottom-up data-driven methods mainly consider the low-level features such as lines, corners, regional textures, shadows, and height differences from remote sensing images, and assemble them under some rules to identity building targets. The top-down model-driven methods start from the semantic model and prior knowledge of the whole building targets, use some high level global features, and guide the optimization of element extraction, image segmentation, spatial relationship modeling, contour curve evolution to be close to the building targets. The accuracy and efficiency of these two methods are usually difficult to meet the practical application requirements of remote sensing building extraction.
Recent years, the deep learning methods, represented by Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), have been gradually dominating the fields of artificial intelligence [6
]. The progress in deep learning has greatly promoted the developments of image classification (AlexNet [7
], Visual Geometry Group Network (VGG) [8
], GoogleNet [9
], Residual Network (ResNet) [10
]), image semantic segmentation (Fully Convolutional Network (FCN) [11
], U-Net [12
]), object detection (Region Convolutional Neural Network (R-CNN) [13
], Fast R-CNN [14
], Faster R-CNN [15
], Region Fully Convolutional Network (R-FCN) [16
], You Only Look Once (YOLO) [17
], Single Shot Multibox Detector (SSD) [18
]), and some other classical computer vision problems. These great achievements have inspired researchers in remote sensing community to apply deep learning techniques to building extraction. A straightforward strategy is adapting these algorithms to building extraction task. For instance, Vakalopoulou et al. [19
] presented one of the first deep learning based building extraction methods, in which a small patch was fed into AlexNet and Support Vector Machine (SVM) assembling classifier to detect buildings, then the pixel-wise classification result was refined by Markov Random Filed (MRF). They achieved 90% average correctness in Quickbird, Worldview images. In [20
], a 3-layer hierarchically fused fully convolutional network (HF-FCN) was developed to test on Massachusetts Buildings Dataset [21
] and achieved 91% average recall rate, which was based on FCN, truncated the 6th, 7th fully connected layers and 5th pooling layer, fused multi-level convolution, upper sampling layers and formed a cascaded segment network. Zhang et al. [22
] developed a CNN based building detection method. They employed a multi-scale saliency map to locate building-up areas, and combined with sliding window to obtain the candidate patches classified by CNN. A promising result of 89% detection precision was achieved on Google Earth remote sensing images with a spatial resolution of 0.26 m. A network structure, which adopted the SegNet [23
] like encoder-decoder pair style and incorporated the up-sampling and densification operations into deconvolution layers, is proposed and achieved 95.62% building segmentation precision in ISPRS 2D Semantic Labelling Challenge [24
]. A patch-based CNN classification network is proposed and has an accuracy of building segmentation exceeding that of previous similar work in Massachusetts Buildings Dataset and Abu Dhabi Dataset, which uses a Global Average Pooling layer instead of the full connection layer and uses the super pixel division (SLIC) for post-processing [25
]. A multi-constraint fully convolutional networks (MC–FCNs) is proposed and achieves a 97% total accuracy of building segmentation in New Zealand aerial remote sensing images, which adopts the basic structure of a fully skip connected U-Net and adds a multi-layer constraint between each feature map layer and the corresponding multi-scale true value annotation data, increasing the expression ability of the middle layer [26
]. Based on the architecture of U-Net, the Res-U-Net segmentation network is established, which uses residual units instead of plain neural units as basic blocks, and the guided filter is then adopted as subsequent step to fine-tune the building extraction result. The segmentation accuracy of the building on the ISPRS 2D Semantic Labelling Challenge dataset is superior to 97% [27
]. In general, the existing work of deep learning method for building extraction from high-resolution remote sensing images is mainly based on semantic segmentation, and the work on target detection and image classification method is few. The main idea is to improve the context information by adding the multi-layer features to the FCN framework, and to improve the ability to adapt to the complex background of remote sensing images and the small building targets.
Semantic segmentation under complex geospatial backgrounds is likely to result in edge connection among closely adjacent buildings, which is unfavorable for subsequent edge extraction and outline fitting owing to edge confusion of buildings. Mask R-CNN [28
], a pioneer work in instance segmentation which is a task predicting bounding boxes and segmentation masks simultaneously, have achieved significant improvement. In this work, segmentation is carried out based on detection result, making it especially suitable to deal with outline extraction of densely distributed small buildings. A newly defined Rotation Bounding Box (RBB), which involves angle regression, is incorporated into Faster R-CNN framework, and this method forces the detection networks to learn the correct orientation angle of ship targets according to angle-related IoU and angle-related loss function [29
]. Meanwhile, a novel Receptive Field Block (RFB) module, which makes use of multi-branch pooling with varying kernels and atrous convolution layers to simulate RFs of different sizes in human visual system, is developed to strengthen the deep features learned from lightweight CNN detection models [30
]. In this paper, we incorporate the RBB and RFB into the RPN stage and segmentation branches of Mask R-CNN framework, respectively. This improvement can provide a denser bounding box, and furtherly promote the accuracy of mask prediction owing to better adaptation to multi-scale building targets.
Experiments based on a newly collected large building outline dataset show that our method, improved from Mask R-CNN framework, has a state-of-the-art performance in joint building detection and rooftop segmentation task.