Semantic Segmentation of Building Roof in Dense Urban Environment with Deep Convolutional Neural Network: A Case Study Using GF2 VHR Imagery in China

This paper presents a novel approach for semantic segmentation of building roofs in dense urban environments with a Deep Convolution Neural Network (DCNN) using Chinese Very High Resolution (VHR) satellite (i.e., GF2) imagery. To provide an operational end-to-end approach for accurately mapping build roofs with feature extraction and image segmentation, a fully convolutional DCNN with both convolutional and deconvolutional layers is designed to perform building roof segmentation. We selected typical cities with dense and diverse urban environments in different metropolitan regions of China as study areas, and sample images were collected over cities. High performance GPU-mounted workstations are employed to perform the model training and optimization. With the building roof samples collected over different cities, the predictive model with convolution layers is developed for building roof segmentation. The validation shows that the overall accuracy (OA) and the mean Intersection Over Union (mIOU) of DCNN-based semantic segmentation results are 94.67% and 0.85, respectively, and the CRF-refined segmentation results achieved OA of 94.69% and mIOU of 0.83. The results suggest that the proposed approach is a promising solution for building roof mapping with VHR images over large areas in dense urban environments with different building patterns. With the operational acquisition of GF2 VHR imagery, it is expected to develop an automated pipeline of operational built-up area monitoring, and the timely update of building roof map could be applied in urban management and assessment of human settlement-related sustainable development goals over large areas.


Introduction
Urbanization is a process whereby human beings significantly affect the natural environment of land surfaces. It not only causes changes in land cover and land use, but also has profound effects on the daily life of our society [1]. Buildings are the essential part of urban environment. They play a vital role in human life as a basic infrastructure of human settlement [2]. Accurate and timely maps of buildings are crucial for urban planning, environmental management and sustainable development studies, especially in areas undergoing rapid urbanization [3,4]. However, the conventional ways of building roof maps, e.g., field surveys or manual annotation of imagery, are time-consuming and labor-intensive to provide timely and accurate building information for developing countries, affordable or free VHR images like GF2 PMS imagery would advance the remote sensing technology to broad domains in SDGs assessment. Gaofen-2 (GF2) is one of the series satellite missions in China's High-resolution Earth Observation System (CHEOS) [30]. It was launched on August 19, 2014 with a panchromatic and multispectral sensor (PMS) which acquires panchromatic and multispectral images simultaneously. It is expected to provide a long-term VHR satellite imagery at an affordable cost with global coverage, especially of developing countries.
This study aims to examine the potential of the GF2 PMS for building roof mapping over large urban areas, especially in dense and diverse urban environments. It presents our efforts at building roof segmentation with DCNN over mega-cities in China, using a DCNN model. With the development of an automatic pipeline for operational building mapping with Chinese GF2 VHR imagery over large areas, regardless of the location of the areas and the acquisition data of the image it is expected to provide an alternative solution for supporting sustainable development assessment and urban planning with VHR imagery at low cost.

Data Sets and Study Areas
The GF2 PMS is capable of collecting VHR imagery with a Ground Sampling Distance (GSD) of of 0.8 m panchromatic and 3.2 m multispectral bands on a swath of 45 km. Table 1 shows the basic configurations of the sensor. To cover the diverse urban patterns and building styles in China, typical cities across the country were selected in different metropolitan regions as study areas, and training and test GF2 PMS images were collected with manual delineation. Table 2 lists the cities and acquisition date of the images. With the assumption that imagery acquired during the growing season would provide better visual effects in both the spectral and spatial domains, we only choose images acquired in the period between June and September in 2016 with cloud cover of less than 5%, and a total of seven images were selected in this study for further processing. Though the orientation of PMS images is well calibrated, high precision co-registration between the panchromatic and multispectral images acquired by PMS is performed to keep rigorous geometrical alignment. With the reference of corresponding panchromatic imagery, visual selection of ground control points is carried out for georectification of the multispectral images. The overall mean registration error for all images is less than 1.0 m. With the panchromatic and the corresponding rectified multispectral image, pan-sharping is conducted to obtain fused images with spatial resolution of 1.0 m using a Gram-Schmidt algorithm. Figure 1 illustrates the PMS panchromatic image, the true color composites of PMS multispectral image, and the true color composites of pan-sharped images. The images shows that the pan-sharping could enhance the characteristics of the building boundaries with much higher resolution. mean registration error for all images is less than 1.0 m. With the panchromatic and the corresponding rectified multispectral image, pan-sharping is conducted to obtain fused images with spatial resolution of 1.0 m using a Gram-Schmidt algorithm. Figure 1 illustrates the PMS panchromatic image, the true color composites of PMS multispectral image, and the true color composites of pan-sharped images. The images shows that the pan-sharping could enhance the characteristics of the building boundaries with much higher resolution.

Collection of Sample Images
With the pan-sharped high resolution imagery, sample data set was collected by manual delineation, though the basic unit of delineation is individual building, for those closely connected buildings, neighbour buildings are merged together since it is hard to separate individual buildings. And the manual delineated polygons are then converted to binary images as building roof mask, i.e., the binary image values of 1 and 0 means building roof, rest of land cover types respectively.

Methodology
In this paper, a DCNN model is developed to perform the feature extraction and per-pixel labeling for building roof upon the VHR satellite imagery. Generally, a DCNN model consists of

Collection of Sample Images
With the pan-sharped high resolution imagery, sample data set was collected by manual delineation, though the basic unit of delineation is individual building, for those closely connected buildings, neighbour buildings are merged together since it is hard to separate individual buildings. And the manual delineated polygons are then converted to binary images as building roof mask, i.e., the binary image values of 1 and 0 means building roof, rest of land cover types respectively.

Methodology
In this paper, a DCNN model is developed to perform the feature extraction and per-pixel labeling for building roof upon the VHR satellite imagery. Generally, a DCNN model consists of convolutional and pooling layers, where the convolutional layers perform feature extraction and the pooling layers summarize the features by aggregating neighbour pixels in imagery, while both convolutional and pooling layers reduce image size, the deconvolutional layer provide a way to upsample image to original resolution, the detailed description of the three type operations is well introduced in [31].

Design of DCNN
There are many well-known CNN models for vision recognition tasks, e.g., AlexNet, VGG, GoogLeNet, RestNet. To develop an accurate model with high computational efficiency, the VGG-16 is selected as the basic DCNN network for building roof segmentation with GF2 VHR satellite imagery, according to Occam's Razor. It provides promising results in different vision tasks with reasonable model size. The VGG-16 model is defined as a combination of several convolutional and pooling layers. However, the original VGG-16 model was designed to provide a summarized semantic information at the image level. Long et al., proposed a fully connected network (FCN) for dense prediction of images at pixel level, with the use of deconvolutional operation to upsample CNN layers [32], FCN provides an end-to-end framework for image segmentation at pixel level. Since the building roof mapping requires dense prediction, the deconvolutional layer is adopted to recover the image size for dense per-pixel prediction. Figure 2 illustrates the architecture of the DCNN model for the building roof segmentation with GF2 imagery, with both convolutional and deconvolutional layers. It is well addressed that the distribution drift of data, i.e., image Digital Number (DN) in this study, between layers in DCNN may significantly reduce the computational efficiency and segmentation accuracy. Batch normalization layers, as additional layers in fully convolutional DCNN, are therefore placed to perform feature normalization between layers. The fully convolutional network perform both feature extraction with spatial-spectral information of imagery and per-pixel class prediction. convolutional and pooling layers, where the convolutional layers perform feature extraction and the pooling layers summarize the features by aggregating neighbour pixels in imagery, while both convolutional and pooling layers reduce image size, the deconvolutional layer provide a way to upsample image to original resolution, the detailed description of the three type operations is well introduced in [31].

Design of DCNN
There are many well-known CNN models for vision recognition tasks, e.g., AlexNet, VGG, GoogLeNet, RestNet. To develop an accurate model with high computational efficiency, the VGG-16 is selected as the basic DCNN network for building roof segmentation with GF2 VHR satellite imagery, according to Occam's Razor. It provides promising results in different vision tasks with reasonable model size. The VGG-16 model is defined as a combination of several convolutional and pooling layers. However, the original VGG-16 model was designed to provide a summarized semantic information at the image level. Long et al.,proposed a fully connected network (FCN) for dense prediction of images at pixel level, with the use of deconvolutional operation to upsample CNN layers [32], FCN provides an end-to-end framework for image segmentation at pixel level. Since the building roof mapping requires dense prediction, the deconvolutional layer is adopted to recover the image size for dense per-pixel prediction. Figure 2 illustrates the architecture of the DCNN model for the building roof segmentation with GF2 imagery, with both convolutional and deconvolutional layers. It is well addressed that the distribution drift of data, i.e., image Digital Number (DN) in this study, between layers in DCNN may significantly reduce the computational efficiency and segmentation accuracy. Batch normalization layers, as additional layers in fully convolutional DCNN, are therefore placed to perform feature normalization between layers. The fully convolutional network perform both feature extraction with spatial-spectral information of imagery and per-pixel class prediction.

DCNN Model Training and Inference
To reduce the effect of varied illumination and atmosphere for the images acquired at different time over different areas, Equation (1) is employed to normalize the original image DN values at the scene level. This is expected to provide comparable training and test data sets. The normalized images are then clipped to sample images with a size of 512 × 512 pixels. In total 1460 sample images are selected, where 80% of the sample images are randomly selected for training, while the remaining 20% of the samples are selected for validation: where DN ori and DN T are the DN values of original and transformed images, respectively, and M,N are the image height and width. The DCNN model is implemented upon the open source deep learning package developed by Google, i.e., Tensorflow. High performance workstation with GPUs, i.e., NVIDIA TitanX, is employed to perform the DCNN model training and inference, and the batch size in the training is 8. The dropout rate of 0.5 is set at training stage to prevent the DCNN from over-fitting.

Post-Processing with Conditional Random Field (CRF)
It is well established that the convolutional and pooling operations in image segmentation can smooth the boundaries of objects [32]. To refine the segmentation results, CRF is employed to regularize the building segmentation. As a probabilistic graphical model, CRF has been widely applied in image analysis. It provides a solution for image analysis by connecting pixels with neighbors using a graph, the graph consists of nodes, which are single image pixels. The nodes, together with edges of the graph, are utilized to characterize spatial-spectral relationship. The posterior probability of the model can be characterized by the Gibbs distribution: where P(Y|X) is the normalized conditional probability of the event X and Y, which is characterized by a Gibbs distribution. P(Y,X) is the joint distribution probability, which is defined as an exponential function of weighted factors, and the factors are customized model for specific problems. Reference [31] proposed a fully connected CRF model for image segmentation upon pixel level, the joint distribution probability is parameterized with the summary of unary and pairwise potentials (Equation (3)): where E(x) is the Gibbs energy, ∑ i ψ u (x i ) is the unary term, which is computed independently for each pixel. ∑ i<j ψ p x i , x j denotes the pairwise potentials, which are designed for characterizing the relationship between pixels and their neighbors. Detailed definitions of the unary and pairwise potentials can be found in [33]. With the mean field approximation algorithm, the maximization of the posterior probability of Gibbs distribution is calculated for inference of the model. In this study, the fully connected CRF model was employed for building roof map segmentation.

Training of the DCNN Model
The batch size for the training is 8, with 3000 epochs for the optimization of the DCNN model. Figure 3 shows the change of loss in the training step, it suggest that the loss gradually decreased during the training stage. The moving average loss is calculated with moving window size of 100, while the maximum value of moving average error approximately is 1.2, the minimum value of moving average error is 0.05, and it does not have significant change in the end of training, that means the DCNN model converges to moving average error of 0.05.

Training of the DCNN Model
The batch size for the training is 8, with 3000 epochs for the optimization of the DCNN model. Figure 3 shows the change of loss in the training step, it suggest that the loss gradually decreased during the training stage. The moving average loss is calculated with moving window size of 100, while the maximum value of moving average error approximately is 1.2, the minimum value of moving average error is 0.05, and it does not have significant change in the end of training, that means the DCNN model converges to moving average error of 0.05.

Qualitative Assessment
Basically, the semantic segmentation of building roof is a binary classification, i.e., image pixels are labeled as either building roofs or non-building roofs. With the DCNN model trained by the interactive optimization, the segmentation of building roofs, i.e., the prediction of per-pixel label, is performed with the trained DCNN model. CRF is also employed to refine the segmentation results in this study. Figure 4 illustrates the building roof segmentation results with fused GF2 PMS imagery over the different study areas. Dense urban environments with business buildings, dense urban environments with apartments, low density environments with apartments, urban environments with single family houses and sparse urban environments, are selected for qualitative assessment. Colorized labels which indicate the comparison between prediction and ground truth are applied for visualizing the segmentation results in details. Visual inspection upon the segmentation results suggests that the DCNN could generate a promising building roof map over

Qualitative Assessment
Basically, the semantic segmentation of building roof is a binary classification, i.e., image pixels are labeled as either building roofs or non-building roofs. With the DCNN model trained by the interactive optimization, the segmentation of building roofs, i.e., the prediction of per-pixel label, is performed with the trained DCNN model. CRF is also employed to refine the segmentation results in this study. Figure 4 illustrates the building roof segmentation results with fused GF2 PMS imagery over the different study areas. Dense urban environments with business buildings, dense urban environments with apartments, low density environments with apartments, urban environments with single family houses and sparse urban environments, are selected for qualitative assessment. Colorized labels which indicate the comparison between prediction and ground truth are applied for visualizing the segmentation results in details. Visual inspection upon the segmentation results suggests that the DCNN could generate a promising building roof map over different urban environments. From the building roof map with colorized accuracy indicators, we can conclude that many individual buildings are segmented as connected image objects. While CRF is introduced as an efficient solution for optimizing the segmentation results, it doesn't significantly change the building roof map. It is also observed that the segmented building roofs have smooth boundaries, even after the regularization operations with CRF.

Quantitative Assessment
It is well known that quantitative assessment metrics are of paramount importance for land surface mapping. In the remote sensing community, the metrics were designed from the perspective of accuracy assessment of thematic maps at pixel level [34,35]. The DCNN model predicts building roof by labeling binary patches, unlike the pixel-wise classification, the accuracy of the semantic segmentation need to be assessed at both pixel level and patch level. Firstly, the per-pixel level accuracy is characterized by overall accuracy (OA), it is defined by Equation (4): where N TP , N FN are the numbers of pixels which are correctly labeled, while N TN , N FP are the numbers of pixels which are incorrectly labeled. The mean overall accuracy is averaged with OA values of all test images. The mean intersection-over-union (mIOU) is applied to characterize the accuracy at segment level, and mIOU is calculated using Equation (5): where N GT is the total number of pixels of the ground truth, i.e., manually delineated building roofs. N DR is the total number of pixels of the corresponding building roofs detected by the DCNN model, and k is the total number of segmentation patches. The quantitative assessment of segmentation results are given in Tables 3 and 4. The OA values of two results, i.e., DCNN segmented and CRF-refined building roofs are 94.67% and 94.69%, respectively. This suggests that the DCNN model worked very well for building roof segmentation using the Chinese GF2 imagery, and the CRF could improve the segmentation results. It is also suggests that the CRF could refine the results. However, the CRF-based refinement didn't produce any significant changes to the original segmentation.   for each row, images from left to right are true color composite GF2 PMS imagery (512 Pixel × 512 Pixel), manual delineation of building roof, DCNN segmentation results and CRF-refined building roof segmentation, respectively; for the images with segmentation results, green mask is True Positive (TP, building roof pixel was correctly classified), blue mask is False Negative (FN, building roof pixel was classified as non-building roof) and red mask is False Positive (FP, non-building roof pixel was classified as building roof).

Conclusions
This paper presents an approach for semantic segmentation of building roofs in dense and diverse urban environment with Chinese GF2 PMS images. A fully connected DCNN is introduced to carry out the feature extraction as well as pixel labeling. Experiments are conducted upon VHR images acquired by the PMS on board the Chinese GF2 satellite. The DCNN could extract features in both the spectral and spatial domains, and the features are fused together for dense prediction at a pixel level. The quantitative assessment of the semantic segmentation results suggest that the fully connected DCNN achieved high accuracy building roof maps. However, the convolutional operation performs feature extraction with moving window filters, and the pooling operation aggregates neighboring pixels with summation, so both operations would smooth images and eventually change the boundaries of buildings. Future work on the algorithm improvement would focus on the fusion of low level geometrical features, e.g., multi-scale feature fusion with dilated convolutional filter, image transformation for building edge enhancement.
It is demonstrated that the generation of DCNN models relies on a large volume of positive and negative samples, and the collection of massive numbers of building roof samples is not only costly, but it is also difficult to cover all scenarios in the practice of inference, especially in dense and diverse urban environments spread over large areas. Currently, there are some efforts being conducting to collect building roofs globally over different areas, e.g., SpaceNet, hosted by DigitalGlobe on Amazon Web Service (AWS), and the Volunteering Geographical Inventory (VGI) based approach is also being investigated for sample collection, which would be an alternative for large scale sample collection in future work.
In summary, the DCNN model performs well on semantic segmentation of building roofs in dense urban environments using GF2 VHR images. With the advances in high performance computers and the low cost GF2 imagery, it could be a promising method for operational building roof mapping over large areas with different building types and distribution patterns, and it offers an affordable data sources for both urban planning and sustainable development assessment.