Building Extraction in Very High Resolution Remote Sensing Imagery Using Deep Learning and Guided Filters

: Very high resolution (VHR) remote sensing imagery has been used for land cover classiﬁcation, and it tends to a transition from land-use classiﬁcation to pixel-level semantic segmentation. Inspired by the recent success of deep learning and the ﬁlter method in computer vision, this work provides a segmentation model, which designs an image segmentation neural network based on the deep residual networks and uses a guided ﬁlter to extract buildings in remote sensing imagery. Our method includes the following steps: ﬁrst, the VHR remote sensing imagery is preprocessed and some hand-crafted features are calculated. Second, a designed deep network architecture is trained with the urban district remote sensing image to extract buildings at the pixel level. Third, a guided ﬁlter is employed to optimize the classiﬁcation map produced by deep learning; at the same time, some salt-and-pepper noise is removed. Experimental results based on the Vaihingen and Potsdam datasets demonstrate that our method, which beneﬁts from neural networks and guided ﬁltering, achieves a higher overall accuracy when compared with other machine learning and deep learning methods. The method proposed shows outstanding performance in terms of the building extraction from diversiﬁed objects in the urban district.


Introduction
Remote sensing images with very high resolution (VHR) are widely used in many applications including land cover mapping and monitoring [1], multi-angle urban classification analysis [2], automatic road detection [3], as well as the identification of tree species in forest management [4]. Several of the practical applications are based on VHR remote sensing imagery classification at the pixel level [5][6][7][8], also defined as semantic segmentation. Semantic segmentation of remote sensing imagery aims to classify every pixel into a given category, and it is an important task for understanding and inferring objects [9,10] and the relationships between spatial objects in a scene [11].
Automatic semantic annotation of urban areas plays an important role in many photogrammetry and remote sensing applications, such as building and updating a geographical database, land cover change, and extracting thematic information. In recent years, the development of computing hardware and sensor technologies has made high resolution sampling available with a ground sampling distance (GSD) of 5-30 cm [12] so that objects such as roof tiles, cars, buildings, and individual branches of trees, are distinguishable, which has increased the interest to perform semantic segmentation in urban areas.
In the past several years, spatial and spectral features have been used to improve the performance of VHR semantic segmentation based on pixel-wise analysis. Spatial contextual information like the grey level co-occurrence matrix (GLCM) has been employed to obtain a more accurate classification LiDAR systems) [12], and the digital surface model (DSM) is available, which can be regarded as an additional depth channel.
Previous researchers have provided useful insights into the various methods that can be used in pixel labelling. However, these methods cannot clearly detect the boundary of the objects, and lack the ability to remove the salt-and-pepper class noise; some pixels with similar spectral values are usually misclassify. To resolve these problems, this work attempts to take semantic labelling methods from computer vision and apply them to building extraction from VHR remote sensing imageries.
In this paper, we try to improve the classification accuracy by a new model based on deep residual networks (ResNet) [42]. At the same time, we introduce an object-oriented guided filter to improve the performance of classification. This method, on paper, involves three steps. First, imagery pre-processing is needed to prepare the dataset for deep learning. Second, a deep network is trained to segment VHR remote sensing imagery into two classes: buildings and clutter/unknown. Third, a guided filter is employed to optimize the extraction buildings and an ultimate spectral-spatial classification map of the urban district is achieved by fusing the object-oriented optimized results. All the challenges have resulted in improving the classification accuracy of complex urban area remote sensing imagery. The major contribution of this work is proposing a new model based on ResNet that we defined as Res-U-Net, and exploring a novel framework to perform classification of VHR remote sensing imagery. The experimental results show that the novel framework is more effective at extracting buildings.
The remainder of this paper is organized as follows: Section 2 presents the building extraction using VHR imagery in urban areas based on deep learning and guided filters; Section 3 describes the experimental results and how to set the parameters; Section 4 is a discussion of our method and Section 5 presents our concluding remarks.

Methods for Classification in Very High Resolution Remote Sensing Imagery
In this work a pixel classification method to extract buildings from urban districts within VHR remote sensing imageries based on deep learning and guided filters is proposed. First, the imageries are pre-processed and edge enhancing is used to emphasize the pixels which exist at the edges of the buildings. Some hand-crafted features including the normalized differential vegetation index (NDVI), the normalized digital surface model (NDSM), and the first component of the principal component analysis (PCA1) are extracted based on the color infrared (CIR) imagery, red green blue (RGB) satellite imagery as well as the corresponding digital surface model (DSM). Then, the proposed deep neural network Res-U-Net is introduced for pixel classification, where the hand-crafted features, the original bands, and the ground truth (labeled artificially) are treated as inputs to train the network. The output of the deep neural network is the segmentation map that represents the pixel labeling results. Finally, we briefly introduce the concept of a guided filter to fine-tune the pixel labeling results because the convolutional network tends to blur object boundaries and visually degrade the result when it is applied to remote sensing data [12]. An overview of the proposed pixel classification framework is illustrated in Figure 1. imageries usually have the corresponding overlapping image (or combined camera + LiDAR systems) [12], and the digital surface model (DSM) is available, which can be regarded as an additional depth channel. Previous researchers have provided useful insights into the various methods that can be used in pixel labelling. However, these methods cannot clearly detect the boundary of the objects, and lack the ability to remove the salt-and-pepper class noise; some pixels with similar spectral values are usually misclassify. To resolve these problems, this work attempts to take semantic labelling methods from computer vision and apply them to building extraction from VHR remote sensing imageries.
In this paper, we try to improve the classification accuracy by a new model based on deep residual networks (ResNet) [42]. At the same time, we introduce an object-oriented guided filter to improve the performance of classification. This method, on paper, involves three steps. First, imagery pre-processing is needed to prepare the dataset for deep learning. Second, a deep network is trained to segment VHR remote sensing imagery into two classes: buildings and clutter/unknown. Third, a guided filter is employed to optimize the extraction buildings and an ultimate spectral-spatial classification map of the urban district is achieved by fusing the object-oriented optimized results. All the challenges have resulted in improving the classification accuracy of complex urban area remote sensing imagery. The major contribution of this work is proposing a new model based on ResNet that we defined as Res-U-Net, and exploring a novel framework to perform classification of VHR remote sensing imagery. The experimental results show that the novel framework is more effective at extracting buildings.
The remainder of this paper is organized as follows: section two presents the building extraction using VHR imagery in urban areas based on deep learning and guided filters; section three describes the experimental results and how to set the parameters; section four is a discussion of our method and section five presents our concluding remarks.

Methods for Classification in Very High Resolution Remote Sensing Imagery
In this work a pixel classification method to extract buildings from urban districts within VHR remote sensing imageries based on deep learning and guided filters is proposed. First, the imageries are pre-processed and edge enhancing is used to emphasize the pixels which exist at the edges of the buildings. Some hand-crafted features including the normalized differential vegetation index (NDVI), the normalized digital surface model (NDSM), and the first component of the principal component analysis (PCA1) are extracted based on the color infrared (CIR) imagery, red green blue (RGB) satellite imagery as well as the corresponding digital surface model (DSM). Then, the proposed deep neural network Res-U-Net is introduced for pixel classification, where the hand-crafted features, the original bands, and the ground truth (labeled artificially) are treated as inputs to train the network. The output of the deep neural network is the segmentation map that represents the pixel labeling results. Finally, we briefly introduce the concept of a guided filter to fine-tune the pixel labeling results because the convolutional network tends to blur object boundaries and visually degrade the result when it is applied to remote sensing data [12]. An overview of the proposed pixel classification framework is illustrated in Figure 1.

Deep Learning for Remote Sensing Imagery Classification
Convolutional networks have been widely utilized in applications ranging from whole-image classification [43][44][45] to pixel classification as semantic segmentation in computer vision. Pixel classification includes automatically building maps of geo-localized semantic classes (for example: buildings, impervious surfer, vegetation, and so on) from the earth-observation data [46]. In recent years, deep learning has become a state-of-the-art tool for pixel classification in remote sensing, as well as other fields. Fully convolutional networks are adapted as effective tools for the semantic labelling of high-resolution remote sensing data. This paper uses the modified and extended architecture ResNet, named Res-U-Net ( Figure 2). Remote Sens. 2018, 10, 144 4 of 18

Deep Learning for Remote Sensing Imagery Classification
Convolutional networks have been widely utilized in applications ranging from whole-image classification [43][44][45] to pixel classification as semantic segmentation in computer vision. Pixel classification includes automatically building maps of geo-localized semantic classes (for example: buildings, impervious surfer, vegetation, and so on) from the earth-observation data [46]. In recent years, deep learning has become a state-of-the-art tool for pixel classification in remote sensing, as well as other fields. Fully convolutional networks are adapted as effective tools for the semantic labelling of high-resolution remote sensing data. This paper uses the modified and extended architecture ResNet, named Res-U-Net ( Figure 2). In this paper, we trained the Res-U-Net by adopting the approach of reference [47], which is famous for having the ability to work with very little training data but still obtain precise segmentation. The Res-U-Net network consists of two paths: contracting (left) and expansive (right). The left part is the ResNet, which is used to extract the features of input data, and we modified the input layer to adapt the seven elements of the input data. The input layer is followed by a normalization layer and a max pooling layer. The activation layer in the network contains a rectified linear unit (ReLU) and a 2 × 2 max-pooling operation for the subsampling, both of them improve the robustness of the network against distortions and small translations [44]. During the features extraction, there are four stages and every stage includes several residual blocks. The feature maps in the same block have the same size, and the feature maps in the following blocks are half that of the previous ones. The feature maps in different blocks have different scale features. The expansive part aims to extract the buildings using the feature maps. The number of stages in contracting and expansive is the same. Inspired by the feature pyramid networks [48], to obtain the features in multiple scales, a concatenation with the corresponding stage from the contracting part is designed in the deep neural network. Every stage in the expansive part includes the upsampling of the feature In this paper, we trained the Res-U-Net by adopting the approach of reference [47], which is famous for having the ability to work with very little training data but still obtain precise segmentation. The Res-U-Net network consists of two paths: contracting (left) and expansive (right). The left part is the ResNet, which is used to extract the features of input data, and we modified the input layer to adapt the seven elements of the input data. The input layer is followed by a normalization layer and a max pooling layer. The activation layer in the network contains a rectified linear unit (ReLU) and a 2 × 2 max-pooling operation for the subsampling, both of them improve the robustness of the network against distortions and small translations [44]. During the features extraction, there are four stages and every stage includes several residual blocks. The feature maps in the same block have the same size, and the feature maps in the following blocks are half that of the previous ones. The feature maps in different blocks have different scale features. The expansive part aims to extract the buildings using the feature maps. The number of stages in contracting and expansive is the same.
Inspired by the feature pyramid networks [48], to obtain the features in multiple scales, a concatenation with the corresponding stage from the contracting part is designed in the deep neural network. Every stage in the expansive part includes the upsampling of the feature map, a concatenation block and a convolution block, which consists of a 3 × 3 convolution layer, a normalization layer and a rectified linear unit. At the end of the network, a 1 × 1 convolutional layer is added to map the feature vectors to the two classes of buildings and clutter, the outputs of this layer indicate the class scores for the pixel. A softmax layer, used to calculate the classification results, is added at the end of the network. In this work, the deep convolutional network uses the ResNet as a feature extractor, which solves the degradation problem during the layer increases, and it is useful to extract the features in contracting. The concatenation in the expansive part is able to learn multiple scales and different level features, which increases the robustness of the network and improves the accuracy of the building extraction. The output of the softmax layer is a probability map with two channels. It presents the result of the classification between buildings and clutter in every pixel.
Within the remote sensing imagery and their corresponding normalized digital surface model, hand-crafted features such as NDVI, PCA1 as well as the classified segmentation maps are regarded as the inputs to train the network. The Res-U-Net builds higher level features by the grouping of mapping features of lower level features, and therefore, the results are located more accurately. It transmits the error from a high level to a low level and speeds up the training [47]. The size of the output of the network is the same as the input and it usesnd-to-end processing. At the beginning of the network, max-polling and convolution layers produce more abstract feature maps, which are beneficial for the up-convolution in order to calculate an accurate pixel classification result.
The building extraction problem can be regarded as a binary classification problem. During the training of the parameters, it can be solved by a logistic regression using the optimization of the energy function. As with other training methods [47], we train the network using the gradient descent to minimize the energy function. The energy function is calculated by the softmax as well as the cross entropy loss function. The softmax is used to calculate the probability map, defined as: where k ∈ {1, 2} which corresponds to the buildings and the clutter, and K represents the number of classes as two. p k x i is the probability that sample x i belongs to class k. The energy function is defined as follows: is assumed to be the training data, x i represents the vectored features, and y i is the labeled data, m represents the number of samples, and w is a weight map in the network to be optimized.

Guided Filtering
To fine-tune the buildings extracted by deep learning, the guided filter, which was firstly proposed by He [49], is introduced in this work. Like the bilateral filter, it is an edge-preserving smoothing technique. Thanks to the guiding of the input image (guidance image), the filtering result is more structured and less smoothed. The guided filter is better than the bilateral filter in terms of detail and it is more effective [49], which makes it widely applicable in computer vision and graphics [50]. The guided filter assumes that the local linear model exists between the guidance image and the filtering result, so that it will benefit to optimize object classification like buildings. The guided filter involves two input images including a guidance image I c and a filtering image I in . The filtering output O is assumed to be a linear transform of I c in a window w k : where a k and b k are the coefficients of the linear transform between the guidance image I c and the filtering image O within window w k (the size of window is w × w). They can be calculated as follows: where, u k and σ k are the mean and variance of the guidance image I c within the window w k , and p k is the mean of the filtering image I in within the window w k , and ε controls the blur degree of the guided filter. Because pixel i has a relationship with all the windows that cover it, the output of filtering O (i) is calculated as: where a i and b i are the mean of coefficients of all the windows that cover the pixel i. For simplicity, the equation can be rewritten as: The original imageries are treated as guiders to optimize the boundaries in order to remove the salt-and-pepper class noise. The result, directly fine-tuned by the guided filter, will result in the over-smoothness of the extracted buildings in the output. However, the building maps should be binary and the pixels in the boundaries change gradient in reality. Therefore, we set a threshold during filtering. If the value is larger than the threshold it will be set to 255, which represents buildings, otherwise, it is equal to 0, which represents the clutter.

Datasets
The ISPRS 2D semantic labelling VHR remote sensing imageries of urban districts are used in the experiments, including the Vaihingen (Germany) and Potsdam (Germany) datasets, as these are open asset datasets provided online. Both of them consist of the near infra-red, red, and green ortho-rectified imagery (or color infra-red, CIR). The corresponding digital surface models (DSMs) generated by dense image matching and ground truth labels are annotated manually. Additionally, the Potsdam dataset has a blue channel, containing 38 ortho-rectified aerial IRRGB images of ≈ 6000 × 6000 (in total, over 1,368,000,000 pixels) at 5 cm spatial resolution, where 24 tiles are labelled with pixel-level ground truth. The Vaihingen dataset comprised of 33 large image patches of ≈ 2500 × 2500, extracted from a larger orthophoto imagery captured over Vaihingen. Overall, there are about 168,287,871 pixels, and the imageries have a ground sample distance (GSD) of 9 cm, where 16 tiles are labelled with pixel-level ground truth. Each of the ground truth labels are made up of building and unknown (clutter). The DSM is a value array which has the same size as the input image and the labelled ground truth. At the same time, the normalized DSMs [51] are available for us, where the height is computed using the off-ground pixels. The imageries with ground truth are divided into two parts, where 80% are used to train the Res-U-Net and 20% are used to validate the trained model.

Preprocessing the Data for Deep Learning
Although the urban remote sensing imagery used in this work is in high resolution, some object edges are still fuzzy, which result in the object being unrecognizable from the background. Therefore, Remote Sens. 2018, 10, 144 7 of 18 this work introduces the edge enhancement effect to the remote sensing imagery processing. The edge enhancement is an image-filter that reduces the effect of noise. It can also decrease the complexity of the image computation. Edge enhancement is widely used in fields such as pattern recognition, image semantic segmentation, and so on. This work enhances the edge of the imageries using the python imaging library (PIL). It is a kind of convolutional filter, where a n × n matrix is defined to operate with the digital imagery. Every pixel of the edge enhancement result is a sum-weighted value of the convolution region. The size of the convolution kernel used in the experiment is 5 × 5.
The size of the total from dataset is approximately 6000 × 6000. If the whole dataset is used as an input for the deep network, millions of paragraphs must be learned, which would lead to a lack of memory. Therefore, we processed the imageries using a 256 × 256 sliding window with a stride of 64 px to produce the samples. Every eight samples were regarded as a batch to train the network.

Experimental Setup and Results
To improve the accuracy of the vegetation in this experiment, we computed the NDVI from the near-infrared and the red channels, and it was used as an indicator for the vegetation (NDVI = (NIR − R)/(NIR + R)). A PCA transformation was introduced to extract the first component comprising of brightness, which will be beneficial to classify some special building roofs. The bands of R, G, B (there is no blue band in the Vaihingen data), and CIR, as well as the hand-crafted features including NDVI, NDSM, the first component of PCA, and the corresponding ground truth (Figure 1) are used as inputs to train the Res-U-Net. The architecture, as well as the parameters used in this work, is shown in Figure 3.

Preprocessing the Data for Deep Learning
Although the urban remote sensing imagery used in this work is in high resolution, some object edges are still fuzzy, which result in the object being unrecognizable from the background. Therefore, this work introduces the edge enhancement effect to the remote sensing imagery processing. The edge enhancement is an image-filter that reduces the effect of noise. It can also decrease the complexity of the image computation. Edge enhancement is widely used in fields such as pattern recognition, image semantic segmentation, and so on. This work enhances the edge of the imageries using the python imaging library (PIL). It is a kind of convolutional filter, where a n n × matrix is defined to operate with the digital imagery. Every pixel of the edge enhancement result is a sumweighted value of the convolution region. The size of the convolution kernel used in the experiment is 5 The size of the total from dataset is approximately 6 0 0 0 6 0 0 0 × . If the whole dataset is used as an input for the deep network, millions of paragraphs must be learned, which would lead to a lack of memory. Therefore, we processed the imageries using a 2 5 6 2 5 6 × sliding window with a stride of 64 px to produce the samples. Every eight samples were regarded as a batch to train the network.

Experimental Setup and Results
To improve the accuracy of the vegetation in this experiment, we computed the NDVI from the near-infrared and the red channels, and it was used as an indicator for the vegetation (NDVI = (NIR − R)/(NIR + R)). A PCA transformation was introduced to extract the first component comprising of brightness, which will be beneficial to classify some special building roofs. The bands of R, G, B (there is no blue band in the Vaihingen data), and CIR, as well as the hand-crafted features including NDVI, NDSM, the first component of PCA, and the corresponding ground truth ( Figure 1) are used as inputs to train the Res-U-Net. The architecture, as well as the parameters used in this work, is shown in Figure 3. For an individual network, we trained the network with a learning rate of 0.001. To ensure an outstanding learning result, we divided the learning rate by ten every ten epochs. There are 100 epochs during the training and each epoch has 2048 samples. We use the Adam as the optimizer to optimize the network when adjusting parameters like weights, biases, and so on. In case most of the evaluation data have targets, we set the size of evaluation data as 2000 × 2000. For an individual network, we trained the network with a learning rate of 0.001. To ensure an outstanding learning result, we divided the learning rate by ten every ten epochs. There are 100 epochs during the training and each epoch has 2048 samples. We use the Adam as the optimizer to optimize the network when adjusting parameters like weights, biases, and so on. In case most of the evaluation data have targets, we set the size of evaluation data as 2000 × 2000. The provided metrics of F 1 score and the global pixel-wise accuracy of each class are used to assess the quantitative performance. F 1 score is a representation of the harmonic mean of precision and recall, and it can be calculated as follows: where Here, TP i is the number of true positives for class i, FP i and FN i represent false positive and false negative, respectively. These metrics are computed using the pixel-based confusion matrices per tile or by an accumulated confusion matrix. At the same time, the overall accuracy (OA) can be obtained by normalizing the trace from the confusion matrix [52].
The proposed deep learning of the Res-U-Net is implemented using Tensorflow and Keras in the Linux platform with a TITAN GPU (12 GB RAM). After 204,800 iterations, our best model achieves state-of-the-art results on the datasets ( Table 1). The changing accuracies and losses of the Potsdam and Vaihingen datasets with the increasing epochs are shown in Figure 4. The provided metrics of F1 score and the global pixel-wise accuracy of each class are used to assess the quantitative performance. F1 score is a representation of the harmonic mean of precision and recall, and it can be calculated as follows: Here, TPi is the number of true positives for class i, FPi and FNi represent false positive and false negative, respectively. These metrics are computed using the pixel-based confusion matrices per tile or by an accumulated confusion matrix. At the same time, the overall accuracy (OA) can be obtained by normalizing the trace from the confusion matrix [52].
The proposed deep learning of the Res-U-Net is implemented using Tensorflow and Keras in the Linux platform with a TITAN GPU (12 GB RAM). After 204,800 iterations, our best model achieves state-of-the-art results on the datasets ( Table 1). The changing accuracies and losses of the Potsdam and Vaihingen datasets with the increasing epochs are shown in Figure 4.  The architecture reaches 96.91% overall accuracy over the Potsdam and 97.71% overall accuracy over Vaihingen, respectively. The deep learning frame performs particularly well on impervious ground and the buildings ( Figure 5).
Remote Sens. 2018, 10, 144 9 of 18 The architecture reaches 96.91% overall accuracy over the Potsdam and 97.71% overall accuracy over Vaihingen, respectively. The deep learning frame performs particularly well on impervious ground and the buildings ( Figure 5). Although the accuracy of the pixel labelling improved by using edge enhancement and deep neural networks, the boundaries of the buildings were still blurry and some pixels belonging to the buildings were misclassified (Figure 6b,e). To improve the performance, a guided filter was introduced. During the optimization by the guided filter, we set values larger than the threshold (t = 90) to 255, which is mentioned in Section 2.2. Otherwise, the values are set to 0. The original imageries as well as the prediction results produced by deep learning are used as the input for the guided filter. From the results (Figure 6), it is clear that the performance in both of the datasets improved.  Although the accuracy of the pixel labelling improved by using edge enhancement and deep neural networks, the boundaries of the buildings were still blurry and some pixels belonging to the buildings were misclassified (Figure 6b,e). To improve the performance, a guided filter was introduced. During the optimization by the guided filter, we set values larger than the threshold (t = 90) to 255, which is mentioned in Section 2.2. Otherwise, the values are set to 0. The original imageries as well as the prediction results produced by deep learning are used as the input for the guided filter. From the results (Figure 6), it is clear that the performance in both of the datasets improved. The architecture reaches 96.91% overall accuracy over the Potsdam and 97.71% overall accuracy over Vaihingen, respectively. The deep learning frame performs particularly well on impervious ground and the buildings ( Figure 5). Although the accuracy of the pixel labelling improved by using edge enhancement and deep neural networks, the boundaries of the buildings were still blurry and some pixels belonging to the buildings were misclassified (Figure 6b,e). To improve the performance, a guided filter was introduced. During the optimization by the guided filter, we set values larger than the threshold (t = 90) to 255, which is mentioned in Section 2.2. Otherwise, the values are set to 0. The original imageries as well as the prediction results produced by deep learning are used as the input for the guided filter. From the results (Figure 6), it is clear that the performance in both of the datasets improved.

Some Effects to the Result of Deep Learning
Although VHR remote sensing imagery is easily applied to distinguish objects on the ground, some edges are not obvious between objects with similar spectral values, so it is difficult to classify the pixels, especially in the urban districts. This work introduces edge enhancing to increase the differences among objects which leads to better performance during classification. We compared the overall accuracy for buildings and clutter classification, as well as precision, recall and F 1 (mentioned above) by both using and not using the preprocessing (Figure 7), respectively.

Some Effects to the Result of Deep Learning
Although VHR remote sensing imagery is easily applied to distinguish objects on the ground, some edges are not obvious between objects with similar spectral values, so it is difficult to classify the pixels, especially in the urban districts. This work introduces edge enhancing to increase the differences among objects which leads to better performance during classification. We compared the overall accuracy for buildings and clutter classification, as well as precision, recall and F1 (mentioned above) by both using and not using the preprocessing (Figure 7), respectively.  As we can see, the overall accuracy of Potsdam has improved by 0.43% and the overall accuracy of Vaihingen has improved by 2.94%. At the same time, the precision and recall for buildings has improved compared to the results computed using the inputs without edge enhancing. Edge enhancement is able to emphasize the indistinct pixels at the edges of the buildings so that they can be classified more precisely, as shown in Figure 8. It can be easily observed that the performance is poor in some parts like A, B without the edge enhancing preprocessing.
Remote Sens. 2018, 10, 144 11 of 18 As we can see, the overall accuracy of Potsdam has improved by 0.43% and the overall accuracy of Vaihingen has improved by 2.94%. At the same time, the precision and recall for buildings has improved compared to the results computed using the inputs without edge enhancing. Edge enhancement is able to emphasize the indistinct pixels at the edges of the buildings so that they can be classified more precisely, as shown in Figure 8. It can be easily observed that the performance is poor in some parts like A, B without the edge enhancing preprocessing. Precision, recall and the F1 scores have significantly improved thanks to the discriminative power of the digital surface model (DSM) and NDVI. To illustrate some differences between the results achieved by the DSM and NDVI, the controlling variable method was adopted to analysis of the effects of the elements. We compared the performance of deep learning whilst exclude either the DSM or the NDVI and the performance of deep learning only treat the RBG images as input. Table 2 compares the results on the Vaihingen and Potsdam datasets. It can be clearly observed that the results support the idea that it is beneficial to use the DSM and the NDVI, and that they improve the overall accuracy by 1.64% and 0.39% for the Potsdam dataset and 1.45% and 2.19% for the Vaihingen dataset. They also improve the F1 by 3.92% and 0.83% for Potsdam and 2.42% and 3.89% for Vaihingen. Compared with the results, it is clear that the limitation of the input only with RBG images, the overall accuracy of deep learning decreased by 2.66% and 2.7%, respectively; and F1 for building decreased by 5.71% and 4.13%, respectively, whilst exclude both the DSM and the NDVI.  By analysis, it is clear that the performance using the DSM as a channel of input has improved when compared to the case without the DSM. The recall for buildings in the two datasets decreased by 6.96% and 2.05%, respectively. That is to say, the nature of some pixels that are buildings are Precision, recall and the F 1 scores have significantly improved thanks to the discriminative power of the digital surface model (DSM) and NDVI. To illustrate some differences between the results achieved by the DSM and NDVI, the controlling variable method was adopted to analysis of the effects of the elements. We compared the performance of deep learning whilst exclude either the DSM or the NDVI and the performance of deep learning only treat the RBG images as input. Table 2 compares the results on the Vaihingen and Potsdam datasets. It can be clearly observed that the results support the idea that it is beneficial to use the DSM and the NDVI, and that they improve the overall accuracy by 1.64% and 0.39% for the Potsdam dataset and 1.45% and 2.19% for the Vaihingen dataset. They also improve the F 1 by 3.92% and 0.83% for Potsdam and 2.42% and 3.89% for Vaihingen. Compared with the results, it is clear that the limitation of the input only with RBG images, the overall accuracy of deep learning decreased by 2.66% and 2.7%, respectively; and F 1 for building decreased by 5.71% and 4.13%, respectively, whilst exclude both the DSM and the NDVI. By analysis, it is clear that the performance using the DSM as a channel of input has improved when compared to the case without the DSM. The recall for buildings in the two datasets decreased by 6.96% and 2.05%, respectively. That is to say, the nature of some pixels that are buildings are misclassified as clutter. Although the pixels that belong to a roof exposed to the sun and a roof out of the sun are different, they have the same DSM value, so it will perform well when extracting all kinds of building roofs. Some road pixels are very similar to the roof of the building in terms of spectral characteristics, but they have a large difference in DSM. As a result, DSM improves the capability of the model to extract buildings and the classification precision of OA, buildings and clutter. The results can be observed in Figure 9.

OA Precision (B) F1 (B) Recall (B) Precision (C) F1 (C) Recall (C)
Remote Sens. 2018, 10, 144 12 of 18 misclassified as clutter. Although the pixels that belong to a roof exposed to the sun and a roof out of the sun are different, they have the same DSM value, so it will perform well when extracting all kinds of building roofs. Some road pixels are very similar to the roof of the building in terms of spectral characteristics, but they have a large difference in DSM. As a result, DSM improves the capability of the model to extract buildings and the classification precision of OA, buildings and clutter. The results can be observed in Figure 9. The NDVI can show the impact of the underlying background of buildings and the vegetation canopy structure to some degree. In urban areas, some low buildings are always covered with tress, which make it difficult to classify, like part A and B in Figure 10. When training the network without the NDVI, the overall accuracy and F1 for both buildings and clutter in both Potsdam and Vaihingen datasets decreased. The recall for buildings in the two datasets decreased by 1.65% and 3.86%, respectively. The results ( Figure 10) show that the NDVI as a channel of input to train the model is beneficial to solve the problem.
Compared with other methods using the same datasets (that is, the training and validation datasets), the results are reported in Table 3. The Res-U-Net proposed in this work shows improvements on building extraction in both datasets. The network extracts features using the ResNet, which works well in contracting, and it benefits a lot from solving the degradation problem during the increase of layers. The expansive concatenated with multiple scales in different blocks and is helpful in classifying the buildings of different sizes. The NDVI can show the impact of the underlying background of buildings and the vegetation canopy structure to some degree. In urban areas, some low buildings are always covered with tress, which make it difficult to classify, like part A and B in Figure 10. When training the network without the NDVI, the overall accuracy and F 1 for both buildings and clutter in both Potsdam and Vaihingen datasets decreased. The recall for buildings in the two datasets decreased by 1.65% and 3.86%, respectively. The results ( Figure 10) show that the NDVI as a channel of input to train the model is beneficial to solve the problem.
Compared with other methods using the same datasets (that is, the training and validation datasets), the results are reported in Table 3. The Res-U-Net proposed in this work shows improvements on building extraction in both datasets. The network extracts features using the ResNet, which works well in contracting, and it benefits a lot from solving the degradation problem during the increase of layers. The expansive concatenated with multiple scales in different blocks and is helpful in classifying the buildings of different sizes.

Influence of the Guided Filter
The threshold used in the optimization by the guided filter is important. Since some pixels near the building edge and the spectrum are similar to the buildings if the threshold is smaller, more pixels will be extracted as buildings and lead to the extracted building area being larger than the real building area. On the other hand, if the threshold is larger, some unclear edges will be excluded and the extracted building area will be smaller than the real area of the buildings ( Figure 11).

Influence of the Guided Filter
The threshold used in the optimization by the guided filter is important. Since some pixels near the building edge and the spectrum are similar to the buildings if the threshold is smaller, more pixels will be extracted as buildings and lead to the extracted building area being larger than the real building area. On the other hand, if the threshold is larger, some unclear edges will be excluded and the extracted building area will be smaller than the real area of the buildings ( Figure 11).

Influence of the Guided Filter
The threshold used in the optimization by the guided filter is important. Since some pixels near the building edge and the spectrum are similar to the buildings if the threshold is smaller, more pixels will be extracted as buildings and lead to the extracted building area being larger than the real building area. On the other hand, if the threshold is larger, some unclear edges will be excluded and the extracted building area will be smaller than the real area of the buildings ( Figure 11).  To get the optimal threshold, we compared the overall accuracy and F 1 of the results using different thresholds. The guided filter with different thresholds was then used by the same predicted results of the Res-U-Net. The thresholds range was between 40 and 175 and the threshold value increased by every five steps. From the result (Figure 12) we can see that the accuracy increases as the threshold grows until it reaches a threshold of t = 90. After that, the overall accuracy and F 1 decreases with the growing threshold. In this way, the threshold in this work was set to 90 while optimizing using the guided filter. To get the optimal threshold, we compared the overall accuracy and F1 of the results using different thresholds. The guided filter with different thresholds was then used by the same predicted results of the Res-U-Net. The thresholds range was between 40 and 175 and the threshold value increased by every five steps. From the result (Figure 12) we can see that the accuracy increases as the threshold grows until it reaches a threshold of t = 90. After that, the overall accuracy and F1 decreases with the growing threshold. In this way, the threshold in this work was set to 90 while optimizing using the guided filter. The size of the window in the guided filter also affects the accuracy during optimization. If the size of window is too small, there will be less information in view to be used to guide the optimization and the filtered result will not be able to obtain enough surrounding information during optimization. On the contrary, if the size of window is too large, the information in the window will be mixed, which will mislead the filter optimization. To get the optimal window size in the guided filter, we compared the overall accuracy and F1 of the results using different window sizes from two to 15. From the results (Figure 13) we can see that the overall accuracy and F1 increased as the window size increased until it reached size = 5. After that, the overall accuracy and F1 decreased with the growing window size. Therefore, the size of the window in the guided filter was set as five while optimizing the Res-U-Net results in the experiments. The size of the window in the guided filter also affects the accuracy during optimization. If the size of window is too small, there will be less information in view to be used to guide the optimization and the filtered result will not be able to obtain enough surrounding information during optimization. On the contrary, if the size of window is too large, the information in the window will be mixed, which will mislead the filter optimization. To get the optimal window size in the guided filter, we compared the overall accuracy and F 1 of the results using different window sizes from two to 15. From the results (Figure 13) we can see that the overall accuracy and F 1 increased as the window size increased until it reached size = 5. After that, the overall accuracy and F 1 decreased with the growing window size. Therefore, the size of the window in the guided filter was set as five while optimizing the Res-U-Net results in the experiments.

Conclusions
In this paper, a novel framework to perform building extraction in urban districts with very high resolution (VHR) remote sensing imagery is presented. The major contribution of this work is to explore an alternative technique for labeling objects in urban districts, which combined deep learning and guided filtering. This project aimed to design a network which improved the accuracy of building extraction and introduced a guided filter into the post-processing of the results. In our work, during the preprocessing of the date, we used edge enhancing and it is helpful in improving the performance of the segmentation process. As the deep neural network, Res-U-Net did well in labeling different scales buildings; guided filtering was introduced after the Res-U-Net neural network stage, which optimized the classification results and removed the salt-and-pepper class noise. At the same time, it preserved the boundaries of the objects within the imagery effectively. Experiments were carried out on two VHR remote sensing imagery datasets. Every desirable object was extracted successfully using the method mentioned in this work and the results showed the effectiveness and feasibility of the proposed framework in improving the performance of the urban district remote sensing imagery classification. The method was compared with some classical VHR remote sensing classification such as the fully convolutional network (FCN) as well as the method that combined the convolutional neural network (CNN) and random forest (RF). Experimental results demonstrated that our methods were better than the other methods. The proposed method in this work can obtain improvements in terms of overall accuracy, precision and F1 over the classical classification systems.
With the development of remote sensing technology, more and more VHR images can be accessed conveniently, and the classification of the urban district plays an important role in practical applications such as urban infrastructure, management, and so on. This work has provided an effective method to improve VHR image classification performance. However, the shape of some

Conclusions
In this paper, a novel framework to perform building extraction in urban districts with very high resolution (VHR) remote sensing imagery is presented. The major contribution of this work is to explore an alternative technique for labeling objects in urban districts, which combined deep learning and guided filtering. This project aimed to design a network which improved the accuracy of building extraction and introduced a guided filter into the post-processing of the results. In our work, during the preprocessing of the date, we used edge enhancing and it is helpful in improving the performance of the segmentation process. As the deep neural network, Res-U-Net did well in labeling different scales buildings; guided filtering was introduced after the Res-U-Net neural network stage, which optimized the classification results and removed the salt-and-pepper class noise. At the same time, it preserved the boundaries of the objects within the imagery effectively. Experiments were carried out on two VHR remote sensing imagery datasets. Every desirable object was extracted successfully using the method mentioned in this work and the results showed the effectiveness and feasibility of the proposed framework in improving the performance of the urban district remote sensing imagery classification. The method was compared with some classical VHR remote sensing classification such as the fully convolutional network (FCN) as well as the method that combined the convolutional neural network (CNN) and random forest (RF). Experimental results demonstrated that our methods were better than the other methods. The proposed method in this work can obtain improvements in terms of overall accuracy, precision and F 1 over the classical classification systems.
With the development of remote sensing technology, more and more VHR images can be accessed conveniently, and the classification of the urban district plays an important role in practical applications such as urban infrastructure, management, and so on. This work has provided an effective method to improve VHR image classification performance. However, the shape of some buildings that are covered by trees cannot be detected precisely, and some blurry and irregular boundaries are hardly classified. In the future, a more optimized deep neural network is required to improve efficiency and accuracy. At the same time, further improvement may be achieved by combining the deep neural network and the guided filter in an end-to-end model, which would combine the advantage of a guided filter that preserves boundaries and decreases the salt-and-pepper class noise whilst also being convenient to train like the FCN. Instead of treating non-building as a background class, we will take the scene semantic into account and extract the roads and trees as well as the cars and so on in future studies.