A Postprocessing Method Based on Regions and Boundaries Using Convolutional Neural Networks and a New Dataset for Building Extraction

Yang, Haiping; Xu, Meixia; Chen, Yuanyuan; Wu, Wei; Dong, Wen

doi:10.3390/rs14030647

Open AccessArticle

A Postprocessing Method Based on Regions and Boundaries Using Convolutional Neural Networks and a New Dataset for Building Extraction

¹

College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310024, China

²

State Key Laboratory of Remote Sensing Science, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(3), 647; https://doi.org/10.3390/rs14030647

Submission received: 1 January 2022 / Revised: 27 January 2022 / Accepted: 28 January 2022 / Published: 29 January 2022

(This article belongs to the Section Remote Sensing Image Processing)

Abstract

:

Deep convolutional neural network (DCNN)-based methods have shown great improvements in building extraction from high spatial resolution remote sensing images. In this paper, we propose a postprocessing method based on DCNNs for building extraction. Specifically, building regions and boundaries are learned simultaneously or separately by DCNNs. The predicted building regions and boundaries are then combined by the postprocessing method to produce the final building regions. In addition, we introduce a manually labeled dataset based on high spatial resolution images for building detection, the XIHU building dataset. This dataset is then used in the experiments to evaluate our methods. Besides the WHU building dataset, East Asia (WHUEA) is also included. Results demonstrate that our method that combines the results of DeepLab and BDCN shows the best performance on the XIHU building dataset, which achieves 0.78% and 23.30% F1 scores, and 1.13% and 28.45% intersection-over-union (IoU) improvements compared with DeepLab and BDCN, respectively. Additionally, our method that combines the results of Mask R-CNN and DexiNed performs best on the WHUEA dataset. Moreover, our methods outperform the state-of-the-art multitask learning network, PMNet, on both XIHU and WHUEA datasets, which indicates that the overall performance can be improved although building regions and boundaries are learned in the training stage.

Keywords:

building extraction; building boundaries; building masks; high spatial resolution images; building dataset

1. Introduction

Building extraction from high spatial resolution satellite imagery is fundamental to many applications, such as urban planning [1], building damage mapping [2], and residential environmental quality measurement [3]. There are two main types of building extraction methods, manual delineation and automatic building extraction. Manual building boundary delineation is commonly used in real applications, yet it can be hardly used to produce building maps for large areas because it is time-consuming and labor-intensive. To save time and labor, various kinds of automatic building extraction methods, such as attention-based, R-CNN based and edge-based models [4,5,6], have been developed over the last few decades. However, an automatic and accurate building extraction method remains a challenging task due to the diversity of building rooftops and the complexity of the background in the high spatial resolution satellite images.

Traditional building detection methods usually commence with a feature extraction step. The features, such as building shadows, shape, and color, are designed deliberately by the researchers [7,8,9,10]. Additionally, machine-learning algorithms, such as support vector machine and random forest, then follow to detect buildings [11]. These building extraction methods are effective to extract specific types of buildings in one region. Nevertheless, they barely show comparable performance in another district, because the features designed by human engineers can hardly generalize from one area to another.

Instead of manually designing features, DCNNs can automatically learn features from raw images. Since record-breaking results were achieved in the ImageNet contest [12], DCNNs have been applied to various computer vision tasks, such as semantic segmentation [13,14], instance segmentation [15], and object detection [16]. Soon, DCNN-based methods were introduced to the remote sensing fields, such as image classification [17] and building detection [18]. Additionally, they quickly became the dominant approaches in the building extraction task [19,20,21,22,23]. Since these DCNN-based methods are data-driven, researchers need to prepare a well-annotated building dataset to train DCNNs. However, dataset generation requires a lot of time and effort. Publicly available datasets can facilitate the comparison of different building detection methods. The pioneering work was the publishing of the Massachusetts Buildings Dataset [24], which includes 151 aerial images with 1500 × 1500 pixels. The labels were obtained from the building footprints of OpenStreetMap, thereby containing noise. Vaihingen and Potsdam datasets provided by the International Society for Photogrammetry and Remote Sensing (ISPRS) (https://www2.isprs.org/commissions/comm2/wg4/benchmark/semantic-labeling/, accessed on 31 December 2021) for the semantic labeling contest are also used in building detection [25,26]. These two datasets consist of very high-resolution aerial images and show different types of buildings. A common characteristic of the above datasets is that their training and testing sets come from the same geographical region. To evaluate the generalization capabilities, Maggiori et al. [27] created the Inria Aerial Image Labeling Dataset, in which the training and testing sets are from different cities. Afterward, Ji et al. [28] provided a large-scale building dataset, which contains an aerial image dataset and two satellite datasets, hence the public building datasets were extended from the aerial to satellite images. However, the satellite images covering the study area, Hangzhou, are not contained in these publicly available datasets.

Besides the generation of building datasets, the detection methods are of great importance. Generally, the building extraction task is considered as a classification problem or a boundary detection problem in DCNN-based methods. To solve a classification problem, patch-based models [22] and pixel-based models [5,29,30,31,32] are commonly used. The pioneering work in extracting buildings by deep neural networks adopted the patch-based framework, predicting the pixel in the patch center one label at a time [24]. Additionally, conditional random fields (CRFs) were introduced to smooth the outputs of the DCNNs. Following the work in [33], Saito et al. [34] proposed a three-channel DCNN-based method to simultaneously extract buildings, roads and background using part of the images from Massachusetts Buildings Dataset and Massachusetts Roads Dataset released in [24]. Additionally, they further proposed a model-averaging method to avoid the patchy outputs by the patch-based DCNNs. Alshehhi, Marpu, Woon and Mura [22] proposed a modified patch-based DCNN, in which fully connected layers are substituted with global average pooling. Additionally, the simple linear iterative clustering algorithm is used to refine the DCNN output, e.g., merging the misclassified building segments. Patch-based architecture is commonly used in the early DCNN-based building extraction methods. However, this type of method lacks efficiency in the training process and is not good at localizing building boundaries.

Different from the patch-based methods, pixel-based models assign each pixel in the image to one class at a time, which is called the end-to-end manner. Since the fully convolutional networks (FCNs) [35] were proposed, building extraction methods have been gradually dominated by the end-to-end networks. Early work of the dense classification of buildings was conducted by Maggiori et al. [36]. They devised a fully convolutional architecture to eliminate the discontinuities caused by the patch-borders. Afterward, Yuan [37] designed a DCNN-based network by integrating multistage features in an end-to-end manner. It is worth noting that in [37], the signed distance function was proposed to generate signed-distance labels, which can represent building boundaries, the interior of the rooftops and the exterior of the rooftops. In this way, the building extraction task was regarded as a multiclass classification problem rather than a binary classification one [38]. Besides the above FCN-based methods, encoder–decoder structures are commonly used to deal with the localization issue [13,39]. For example, Ji, Wei and Lu [28] introduced a Siamese network based on U-Net [39] for large buildings. Recently, the attention module has been integrated into the encoder–decoder networks to extract buildings [4,31]. To extract precise building boundaries, edges are considered in the network. Yang et al. [40] proposed an edge-aware network, in which the edge perception network help refine the building prediction. Zhu, Liang, Yan, Chen and Wang [25] proposed an edge-detail-network (E-D-Net) that simultaneously considers the building edges and regions. E-D-Net consists of two U-Net-based subnets. Additionally, a novel fusion strategy is used to combine the building edges and details. Although much progress of rooftop extraction has been achieved, the predicted building regions are likely incomplete by these end-to-end methods, especially those that are irregular or have a large internal color change.

Concerning the edge-based models, building extraction is formulated as an edge detection problem. The ground truth used in this kind of model is the building boundaries (Figure 1b). In the literature relating to building boundaries, methods generally fall into the following categories: active contour based [41], polygon based [42], and edge-detection based [43]. Active contour models, or snakes [44] were proposed to find salient image contours and track contours during motion. Zhang et al. [45] proposed deep structured active contours, in which the energy function is learned by DCNNs. Compared with the active contour-based method, polygon-based methods are relatively new. Li, Wegner and Lucchi [42] proposed a novel approach called PolyMapper, which was able to directly output the vector format of buildings, given the overhead images as input. Based on PolyMapper, Zhao et al. [46] proposed an end-to-end network, the main idea of which is predicting the key points of building boundaries and connecting them to form a closed polygon directly. Concerning the methods based on edge detection, DCNN-based methods from computer vision, such as holistically nested edge detection (HED) [47], richer convolutional features (RCF) [48], D-LinkNet [49], and dense extreme inception network for edge detection (DexiNed) [50] have been introduced to extract the building boundaries. Postprocessing is usually needed in the edge-detection-based approaches. For example, Lu, Ming, Lin, Hong, Bai and Fang [43] used the richer convolutional features network to extract building edges. Additionally, they refined the edge probability map with a geomorphological concept in the postprocessing step. Nevertheless, broken lines are likely to be shown in the output of deep edge-detection networks, which would result in the omission of rooftops.

Hence, this paper aims to improve the completeness of building regions by introducing building boundaries in the postprocessing step. By combining the boundaries and incomplete regions at the same location, closed areas can be generated, which is complementary to the incomplete building regions. Specifically, the building regions are learned using region labels (Figure 1c). Additionally, the building boundaries are trained with edge labels (Figure 1b). They can be trained either simultaneously or separately by DCNNs. Then, the predicted building regions and boundaries are combined by the postprocessing method to produce the final building regions.

The main contributions of this paper are as follows:

We develop and provide a building dataset, which contains 742 image tiles with 512 × 512 pixels in the training set and 197 image tiles with 512 × 512 pixels in the testing set from 0.5 m satellite images covering several sites in Hangzhou, Zhejiang Province, China.
We propose a new postprocessing method for building extraction based on DCNNs. This method takes advantage of both the predicted building regions and boundaries and shows improvements in extracting complete rooftops.

2. Materials and Methods

2.1. Datasets

A new satellite dataset, the XIHU building dataset, is introduced in this paper (Figure 2). This dataset covers an area of about 62 km² over the Xihu District, Hangzhou, Zhejiang Province, China. The publicly available building datasets cover areas such as San Francisco (United States), Vienna (Austria), Vaihingen (Germany), Christchurch (New Zealand), and Beijing (China) [6,27,28]. The XIHU building dataset provided in this paper is a supplement to the public datasets for building detection. To the best of our knowledge, it is the first public satellite building dataset covering regions in Hangzhou, China.

In the XIHU building dataset, most of the areas cover urban scenes, while a small amount covers rural ones. There are various types of buildings in the XIHU building dataset (Figure 3). From top to bottom, Figure 3a shows low-rise residential buildings, rural residential buildings, high-rise residential buildings and industrial buildings, respectively. These rooftops have diverse colors, sizes and shapes, hence the building detection task using the XIHU building dataset is challenging.

The satellite images were downloaded from Google Earth, following the terms of service and guidelines of Google Earth. All the images and labels in the XIHU building dataset are only for noncommercial or personal use. There are three visible bands (red, green, and blue) in the images, with 0.5 m spatial resolution. Additionally, all the satellite images were captured from October to December 2017. The XIHU building datasets contain 939 image tiles with dimensions of 512 × 512 pixels. The buildings in these images were manually delineated. Additionally, the building footprints were cross-checked to improve the labeling quality. As shown in Figure 3b,c, rooftop boundaries and regions are provided in this paper. Then, 742 image tiles of the XIHU building dataset are chosen as the training dataset, while the other 197 tiles are used for testing (Table 1). The selection of images for the training and testing dataset is not completely random, thus keeping the balance of building types in the training and testing dataset. The spatial distribution of the training and testing images is demonstrated in Figure 2.

Besides the XIHU building dataset, satellite datasets that cover East Asia in the WHU building dataset [28] are used in this paper. The dataset covers 860 km² in East Asia with a spatial resolution of 0.45 m. All images in the WHUEA dataset were cropped into the dimensions of 512 × 512 pixels. Additionally, 3135 images of the dataset were used for training while 903 images were used for testing (Table 1). Similar building styles, mostly rural residential buildings, are shown in the WHUEA dataset. However, the satellite images show a great difference in colors (Figure 4). This dataset was chosen to test the generalization ability of our method.

2.2. Methods

Although much progress has been achieved by DCNNs, incomplete building regions are likely shown due to the irregular shape or color variation. However, parts of the boundaries of these incomplete building regions may be detected by the deep edge-detection models. It is intuitive to combine the predicted building regions and boundaries to improve the incomplete building regions. The combination is completed by the postprocessing method in our building-detection framework (Figure 5). In detail, there are three major stages in our framework. In the first stage, building datasets, including original remote sensing images, their corresponding building regions and boundaries, are prepared for the next training stage. In the second stage, deep models are trained by the building regions or boundaries. According to the training strategies of the deep models (multitask or single-task learning), the building labels of regions and boundaries can be learned separately or simultaneously in the training stage. In the third stage, the candidates for the rooftop regions and boundaries can be obtained by the models trained in the second stage. Afterward, the predicted building regions and boundaries are combined by the postprocessing method to generate the final building footprints.

2.2.1. Network Training

Training with region labels

In the network training stage, U-Net [39], DeepLab [51], and Mask R-CNN [15] are chosen to train with building region labels.

U-Net is extended from the architecture of FCNs. It consists of two symmetric encoding and decoding parts. In the encoding part, U-Net follows the typical architecture of FCNs. In the decoding part, an expansive path is added to gradually recover the spatial features to achieve localization. Specifically, the feature maps in the encoding part are cropped to concatenate with the corresponding upsampling features in the decoding part, thus generating a u-shape network architecture. In this paper, the input of U-Net is building region labels.

Several versions of DeepLab have been developed since it was proposed in 2015 [52]. In the early versions of DeepLab, the very important characteristics include “atrous convolution” and atrous spatial pyramid pooling (ASPP) [51,52]. Atrous convolution can compute a larger context without increasing the number of network parameters, while ASPP can help to extract multiscale features. In the latest version of DeepLab, DeepLabv3+ [53], ASPP and encoder–decoder structure are combined to capture multiscale context information and sharp spatial features. In this paper, we choose the latest version, DeepLabv3+, in the training stage.

Mask R-CNN is based on Faster R-CNN [54]. In the first stage, Mask R-CNN uses the Region Proposal Network to obtain candidate object bounding boxes, which is identical to Faster R-CNN. In the second stage, besides the output of the class and bounding box in Faster R-CNN, Mask R-CNN adds a new branch to predict the object mask. A multitask loss is defined as follows [15]:

L = L_{c l s} + L_{b o x} + L_{m a s k}

(1)

where

L_{c l s}

is the classification loss,

L_{b o x}

is the bounding box loss, and

L_{m a s k}

the object mask loss. Therefore, it simultaneously outputs the object class, bounding box and pixel-wise mask. In this paper, Mask R-CNN is chosen to train with the building region labels.

Training with building boundaries

To extract building boundaries, we consider the building extraction task as an edge-detection problem. In this paper, three edge-detection models, HED [47], BDCN [55], and DexiNed [50], in computer vision were introduced to carry out the edge-detection task.

HED is an end-to-end deep neural network for edge detection, which is inspired by the idea of FCNs. The network architecture of HED follows that of the VGGNet [56] with some modifications. First, connections between the side output layer and the last convolutional layer are added in each stage of HED, thus obtaining multiscale side output features. Second, the last stage of VGGNet is removed to reduce the memory/time cost. An important characteristic of HED is that deep supervision [57] in each side output and weighted-fusion supervision are both used in the training process. In this way, multiscale edge predictions can be produced. In this paper, building boundaries are used as a supervision signal in the training process of HED.

BDCN adopts a cascade structure to learn multiscale features for edge detection. It is composed of five incremental detection blocks. Inspired by dilation convolutions [51], a scale enhancement module is inserted into each incremental detection block to extract multiscale features. Another contribution of BDCN is the training strategy of the deep network. Different from supervising prediction maps in different layers with one scale ground truth, each incremental detection block in BDCN is trained by two layer-specific scale ground truth maps. Additionally, the intermediate edge predictions are then fused to produce the final edge map. In this paper, BDCN is trained to predict building boundaries.

DexiNed, inspired by HED [47] and Xception [58], is a deep structured model for image edge detection. It consists of the dense extreme inception network and the upsampling blocks. Specifically, there are six main blocks in the encoder of DexiNed. In each block, the output feature is fed to the corresponding upsampling block, thus generating the edge feature map. At the end of the network, all the edge feature maps in the main blocks are combined to the final fuse edge result. Besides the main connection, edge connection is used inside the last four blocks to keep edge features in the deeper encoder blocks. A key feature of DexiNed is that it produces thin edges in the final prediction results. In this paper, DexiNed is used to train the building edge model and produce thin building edges.

Training with Building Regions and Boundaries

Building regions and boundaries can be simultaneously learned by multitask networks. In this paper, the state-of-the-art network that uses progressive and multitask learning, called PMNet [59], is selected in the training stage. PMNet adopts the encoder–decoder architecture. It is composed of one shared encoder and two task-specific decoders with identical architecture. Here, one decoder is used to predict building regions while the other is used to predict the building boundaries. The two branches of PMNet are both progressively learned in a coarse-to-fine way. The binary cross-entropy loss is used as the loss function in each path,

L (y, y^{'}) = - \frac{1}{M} \sum_{i = 1}^{M} (y_{i} \log (y_{i}^{'}) + (1 - y_{i}) \log (1 - y_{i}^{'}))

(2)

where

y

and

y^{'}

represent the ground truth and predicted value, respectively, and

M

is the pixel number of images. Considering the work done in [59] and the size of the building training datasets, we choose ResNet-50 as the encoder in PMNet in the following experiments.

2.2.2. Building Extraction

After the network training stage, the candidates for the building regions and boundaries can be predicted by the trained models. It is likely to see that part of the building regions are missing in the predicted results, while the edges of that building are shown in the boundary output (Figure 6). Therefore, in the building extraction stage (Figure 7), we propose a postprocessing method that combines the candidate building regions and boundaries to improve the completeness of building regions.

The steps of the postprocessing method are as follows (Figure 7):

Obtain the candidate building regions and boundaries of the test images as the input;
Reduce the candidate building boundaries to one-pixel-wide edge maps;
Combine the building regions with the one-pixel-wide edge maps;
Fill holes in the combination map of the building regions and boundaries;
Apply morphological operation to the hole filling results;
Remove building regions whose areas are smaller than the threshold;
Generate the building regions as the output.

Recent state-of-the-art edge detection methods based on DCNNs usually generate thick boundaries. As shown in Figure 8, more than one-pixel-wide representations are displayed in the edge maps predicted by HED (Figure 8b), BDCN (Figure 8c) and DexiNed (Figure 8d). Additionally, the closer to the building boundaries, the larger the pixel value. To obtain one-pixel-wide boundaries, there are two steps: thresholding the edge maps and thinning the boundaries. Suppose pixel value 255 represents the building class and the value 0 is the background. Thresholding the predicted edge maps follows the equation:

E = {\begin{matrix} 255, p (i) > α \\ 0, o t h e r w i s e \end{matrix}

(3)

where E is the thresholding result, p(i) is the ith pixel of the predicted edge maps and

α

is the threshold. For an 8-bit image output,

α

is between 0 and 255. Here, we consider pixels smaller than the threshold

α

as the background. In this way, the candidates for building boundaries are refined. Then, a fast thinning algorithm [60] is selected to generate one-pixel-wide building edge maps, in which building boundaries are not necessarily closed.

To further take the advantage of both the context and spatial information learned by DCNNs, the predicted building regions and the one-pixel-wide building boundaries are combined as follows:

C = {\begin{matrix} 0, E (i) = 0 A N D R (i) = 0 \\ 255, o t h e r w i s e \end{matrix}

(4)

where C is the combined result, E(i) is the ith pixel of the one-pixel-wide edge map, R(i) is the ith pixel of the corresponding building region map. The combination of the building regions and one-pixel-wide edges helps to generate the candidates of closed regions. An example is demonstrated inside the red box of Figure 7. Then, these newly generated closed regions or holes are filled with the building class to form the new combined building regions.

After that, morphological operation and small area removal are used to further refine the new combined building regions. The kernel used in the morphological operation is defined in Figure 9. We choose the erosion operation to remove the redundant one-pixel-wide building edges. An example of redundant edges is displayed inside the yellow boxes of Figure 7. In the small area removal step, regions that are smaller than the threshold

β

are removed, because we consider that the area of a building should reach a specified size.

2.3. Threshold Optimization

There are two thresholds in the postprocessing step. We only consider threshold

α

in the optimization process because threshold

β

is only related to the minimum size of rooftops in the study area. Threshold

α

, used in the thinning algorithm, is learned on the training sets. Given a training set that contains N input–output pairs,

((X_R1, X_B1), Y₁), ((X_R2, X_B2), Y₂), …, ((X_Ri, X_Bi), Y_i), … ((X_RN, X_BN), Y_N),

where X_Ri and X_Bi are the predicted building regions and boundaries, respectively, and Y_i represents the corresponding ground truth of building regions. Suppose X_Ri and X_Bi are the input of the postprocessing step (Figure 7), and

{\tilde{Y}}_{i}

is the corresponding output. The threshold

α

is optimized to maximize the F1 score (F1),

h^{*} = \underset{h \in H}{a r g m a x} F 1

(5)

F 1 = \frac{2 \times P \times R}{P + R},

(6)

P = \frac{T P}{T P + F P},

(7)

R = \frac{T P}{T P + F N},

(8)

where H is the space of possible hypotheses for threshold

α

, TP is the count of true predictions on the building samples, FP is the count of false predictions on the building samples, and FN is the count of the false predictions on the background samples. F1 varies between 0 (the worst value) and 1 (the best value). Additionally, the optimal threshold is obtained by a search through the space H.

2.4. Implementation Details

The deep neural networks trained in this paper use PyTorch. To make a fair comparison among different building extraction methods, we use the default data augmentation strategy from the publicly available codes. The hyperparameters and GPU used for each deep neural network are listed in Tables S1 and S2. Additionally, all the algorithms are programmed in Python 3.7, which can be downloaded from the links in Appendix A.

2.5. Accuracy Assessment

To evaluate the performance of the building extraction methods, four commonly used metrics, precision (P) (Equation (7)), recall (R) (Equation (8)), F1 (Equation (6)) and IoU (Equation (9)), are computed in this paper.

I o U = \frac{P \times R}{P + R - P \times R} .

(9)

These are all pixel-based metrics and they are computed from the confusion matrix. From Equations (6)–(8), we can see that P shows the proportion of the true building predictions to the total building predictions, R represents the proportion of the true building predictions to the total building test samples and F1 considers both P and R in the way of their harmonic mean. The last metric, IoU (Equation (9)), indicates the percentage of the intersection of the predicted and ground truth building regions over the union of the predicted and ground truth building regions.

3. Results

To evaluate the performance of our proposed methods, state-of-the-art networks were used to predict building regions and boundaries, respectively. In detail, U-Net [39], DeepLab [51], and Mask R-CNN [15] were chosen to predict building regions while HED [47], BDCN [55], and DexiNed [50] were selected to predict building boundaries. Additionally, we also picked PMNet [59], a network that used progressive and multitask learning strategies, to predict the building regions and boundaries simultaneously. Then, the predicted building regions and boundaries are combined by the postprocessing method.

3.1. XIHU Building Dataset

The threshold

α

was trained separately in different combinations of predicted building regions and boundaries of the XIHU training set (Figure 10). During the training of

α

,

β

was set to 0 to make a fair comparison. Additionally, the searching space of

α

ranged from 10 to 250 incrementing by 10, since the predicted images of building boundaries were 8-bit. In this way, the settings of

α

in XIHU testing set could be determined by the maximum F1 during the search process (Figure 10). In detail,

α

was set to 250 for all sorts of boundaries that combine regions predicted by U-Net (Figure 10a); the values of

α

were 160, 110 and 180 for boundaries by HED, BDCN and DexiNed, respectively, with regions by DeepLab (Figure 10b); the settings of

α

were 120, 20 and 10 to thin boundaries by HED, BDCN and DexiNed, respectively, with regions by Mask R-CNN (Figure 10c). Additionally,

β

was set to 16 by considering both the image spatial resolution and building sizes in the dataset.

From the results on XIHU datasets (Table 2, Table 3 and Table 4), we can see that our method that combines the results of DeepLab and BDCN (

α

= 110) achieves the highest F1 score (83.14%) and IoU (71.14%) (Table 3). Results show that in various combinations of building regions and boundaries, our method shows higher F1 and IoU than the corresponding base deep models, which indicates the effectiveness of our postprocessing method.

3.2. WHUEA Building Dataset

In the WHUEA dataset, threshold

α

was also learned individually in different combinations of building regions and boundaries (Figure 11). The settings of

β

and the searching space of threshold

α

were identical to that in the XIHU dataset. From the evolution of F1 during the training stage (Figure 11), the settings of

α

to generate one-pixel-wide building boundaries by HED, BDCN and DexiNed were 250, 250 and 240, respectively, for the combination with the building regions by U-Net, while the settings of

α

were 180, 130, 220 and 160, 120, 200 for that by DeepLab and Mask R-CNN, respectively. Additionally,

β

was set to 4 by considering both the image spatial resolution and building sizes in the study area.

From the results on WHUEA datasets (Table 5, Table 6 and Table 7), we can see that the highest F1 (81.90%) and IoU (69.34%) are achieved by the combination of Mask R-CNN and DexiNed (

α

= 200) (Table 7). From the results of different combinations, we can observe that our postprocessing methods show higher F1 and IoU than the deep models that predict the input building regions and boundaries, which implies that the proposed postprocessing method also works for the dataset containing images with a great difference in colors.

4. Discussion

4.1. Comparison with the Networks That Predict Building Regions

We compare the results of our methods with that of U-Net [39], DeepLab [51] and Mask R-CNN [15] both quantitatively and qualitatively.

4.1.1. Qualitative Analysis

Figure 12 illustrates examples of results of different methods on the XIHU dataset. Rows (a,b) of Figure 12 show the original satellite images and the corresponding labels. Additionally, the results of U-Net, DeepLab, Mask R-CNN and our results are displayed in Rows (c–f), respectively. Compared with the results of Mask R-CNN, sharper corners are shown in that of U-Net, DeepLab and our method. However, Mask R-CNN shows better ability in building instances detection (red boxes in Figure 12). In addition, Column 1 of Figure 12d,f shows that our methods improve the integrity of building rooftops with complex shapes, and Column 3 of Figure 12d,f illustrates that our methods are also able to detect the missed building by DeepLab. This is because the combinations of one-pixel-wide boundaries (red boxes in Figure 13a,c) and the corresponding building regions (red boxes in Columns 1 and 3 of Figure 12d) form closed areas. The results of our method are the same as that of DeepLab (red boxes in Columns 2 and 4 of Figure 12f) in regions where the combinations of the boundaries (red boxes in Figure 13b,d) and regions (red boxes in Columns 2 and 4 of Figure 12d) fail to form closed areas. However, false building prediction occurs (yellow box in Column 3 of Figure 12f) due to the closed areas by fake building boundaries (yellow box in Figure 13c).

A visual comparison of the results of U-Net, DeepLab, Mask R-CNN and our method on WHUEA dataset is shown Rows (c–f) of Figure 14. We can see that incomplete regions are shown in results predicted by U-Net. Additionally, small rooftops are missed in the predictions by U-Net, DeepLab and Mask R-CNN. Visually, our method provides a better performance on the completeness and regularity of rooftops (red boxes in Figure 14f) due to the formation of closed areas in the combination of building regions (red boxes in Figure 14e) and boundaries (red boxes in Figure 15). However, false prediction occurs (the yellow box in Column 4 of Figure 14f) because of closed areas formed by inaccurate edges (the yellow box in Figure 15d), which is similar to the results on XIHU dataset.

4.1.2. Quantitative Analysis

From the results of the XIHU dataset (Table 2, Table 3 and Table 4), we find that DeepLab shows higher F1 score (82.36%) and IoU (70.01%) compared with U-Net and Mask R-CNN, which indicates that DeepLab performs better in the XIHU dataset that contains various building types. This is probably due to the multiscale and sharp spatial feature learning capabilities of DeepLab. Moreover, we find that the results of our methods are quite relevant to that of building region detection models because our postprocessing method depends on the input regions. Table 3 shows the comparison of the results of DeepLab and our methods that combines the building regions predicted by DeepLab and building boundaries predicted by HED, BDCN and DexiNed. We can observe that our method, combining the results of DeepLab and BDCN (

α

= 110), performs best in F1 and IoU. In detail, our method achieves 2.42%, 0.78% and 1.13% improvements in R, F1 and IoU compared with the results of DeepLab. However, we can observe that the P of our method declines by 1.15% compared with that of DeepLab. This indicates that our method can effectively add the rooftops omitted by the building region networks, while false predictions are introduced in the final results by combining the building edges with regions (Figure 16), which is consistent with the observation in the visual comparison (Section 4.1.1).

From the results of the WHUEA dataset (Table 5, Table 6 and Table 7), we can see that the F1 score and IoU of Mask R-CNN are better than U-Net and DeepLab, which indicates that Mask R-CNN shows better performance in the dataset containing similar building types with a great difference in color. Similar to the results of the XIHU dataset, the performance of our methods on the WHUEA dataset is also closely connected with that of the corresponding region detection models. Compared with the results of Mask R-CNN (Table 7), 2.98%, 1.09% and 1.55% improvements in R, F1 and IoU are achieved by our method that combines the prediction of Mask R-CNN and DexiNed (

α

= 200), while the P of our method decreases by 1.04%. These results suggest that our method performs better despite the introduction of false prediction (Figure 17).

4.2. Comparison with Deep Edge-Detection Networks

We further compare the results of our methods with that of the deep edge-detection models, including HED [47], BDCN [55] and DexiNed [50]. Specifically, the one-pixel-wide building boundary maps used in our methods are further processed by the hole-filling algorithm. Additionally, the hole-filled results are then evaluated quantitatively. To make a fair comparison, the settings of α were the same as those in our methods.

From the results of the XIHU dataset (Table 2, Table 3 and Table 4), we can see that our methods show better performance than all the edge-detection models (Table 2, Table 3 and Table 4). Our method that combines DeepLab and BDCN (

α

= 110) achieves 23.30% F1 and 28.45% IoU improvements over BDCN (

α

= 110). Particularly, a sharp decline in R is found in the results of HED, BDCN and DexiNed, because broken lines are shown in the building boundaries (Figure 18), which leads to the missing detection of rooftops. Moreover, we can observe that the closed building regions are increasing as α becomes smaller (Figure 18), because the smaller the threshold α, the more pixels retained.

For the WHUEA dataset, our methods also show better performance in F1 and IoU (Table 5, Table 6 and Table 7). In detail, our method that combines Mask R-CNN and DexiNed (

α

= 200) achieves a 12.52% F1 score and 16.22% IoU improvements over DexiNed (

α

= 200). Similar to the results of the XIHU dataset, broken lines are displayed in all building boundary maps by HED, BDCN and DexiNed in the WHUEA dataset (Figure 19). This observation indicates that the combination of the building regions and boundaries of our method reduces the requirements for closed boundaries, which are of great importance in the building extraction process by the deep edge-detection networks. In addition, among the three edge-detection methods, DexiNed shows higher precision quantitatively and more regular rooftops than HED and BDCN visually, which indicates that DexiNed can capture the edges of similar building types better.

4.3. Comparison with PMNet

Recent studies have simultaneously trained building regions and boundaries by the multitask learning strategy, which uses both the building regions and boundaries during the training process. Different from the networks using a multitask learning strategy, our method combines the building regions and boundaries after the training stage. To compare the results of our method and that of the multitask learning network, we choose the state-of-the-art method that adopts a multitask learning scheme, PMNet, to predict the building regions and boundaries simultaneously. In detail, we use the PMNet mask with conditional random fields to generate the building regions. Threshold

α

was learned by the training set with the same settings of

β

and search spaces as that in Section 3.1. According to the maximum F1, the values of

α

were 180, 140, 200, 100 in the XIHU dataset (Figure 20) and 190, 210, 230, 120 in the WHUEA dataset (Figure 21). To make a fair comparison, the settings of β are identical to those of experiments in Section 3.1 (XIHU dataset) and Section 3.2 (WHUEA dataset). Additionally, the way of evaluating PMNet contour predictions is the same as that in Section 4.2.

From the results of the XIHU dataset (Table 8), we can observe that our method that combines PMNet mask and contour (α = 100) achieves improvements of 0.63% for F1 and 0.81% for IoU compared with the PMNet mask output. Additionally, 61.59% F1 and 52.24% IoU improvements are shown by comparing the results of our method and PMNet contour (α = 100). From the results of the WHUEA dataset (Table 9), we can find that the combination of PMNet mask and contour (α = 120) achieves 0.38% F1 and 0.49% IoU improvements over PMNet mask, and 65.64% F1 and 56.16% IoU improvements over PMNet contour (α = 120). These results indicate that the combination of building regions and boundaries by our postprocessing method can improve the overall performance, although the network simultaneously learns the building regions and boundaries in the training stage. In addition, we observe that the combinations of PMNet mask and HED, BDCN, DexiNed show higher F1 and IoU than the combination of PMNet mask and contour in both the results of XIHU and WHUEA datasets. The combination of PMNet mask and BDCN (α = 140) in the XIHU dataset and the combination of PMNet mask and DexiNed (α = 230) in the WHUEA dataset achieve the highest F1 and IoU, respectively. The results imply that the accuracy of the building boundaries affects the performance of our method, rather than the training strategy.

4.4. Perspectives

To improve the completeness of building regions, the proposed postprocessing method combines predicted regions and boundaries to generate new closed areas. For future work, regularization algorithms will be introduced to the postprocessing stage. Furthermore, our results found that the combination of DeepLab and BDCN performs best in scenes containing various types of buildings in the same city (Table 3), and the combination of Mask R-CNN and DexiNed is recommended for datasets containing similar building types while showing a great difference in color (Table 7). More experiments are required to evaluate the best combination for different datasets and locations.

The XIHU dataset introduced in this paper consists of 939 tiles of manually labeled 0.5 m satellite images. Additionally, there are various building types in the dataset, which is challenging and suitable for future comparison of building detection methods. Furthermore, this dataset can also be used to test the generalization ability of models together with the other publicly available datasets. We encourage researchers to use our dataset for further research.

5. Conclusions

In this paper, we proposed a brand new dataset based on high spatial resolution satellite images, the XIHU dataset, for building detection. In addition, we proposed a postprocessing method based on DCNNs for building extraction. Our method depends on the building regions and boundaries predicted by DCCNs. Then, these predictions are further postprocessed to improve the completeness of the building regions. The results of our experiments on XIHU and WHUEA datasets demonstrate that our methods that combine the building regions and boundaries show better performance than the networks that originally predict building regions or boundaries. Moreover, the comparison of the results of our method and PMNet indicates that the overall performance can be improved by our method, even though the building regions and boundaries are simultaneously learned in the model training process.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/rs14030647/s1, Table S1: Hyperparameters and GPUs used on the XIHU dataset; Table S2: Hyperparameters and GPUs used on the WHUEA dataset.

Author Contributions

Conceptualization, H.Y. and W.W.; methodology, H.Y.; software, H.Y., M.X. and Y.C.; validation, H.Y., W.W. and W.D.; formal analysis, H.Y.; investigation, H.Y., M.X. and Y.C.; resources, H.Y., W.W. and W.D.; data curation, H.Y., M.X. and Y.C.; writing—original draft preparation, H.Y.; writing—review and editing, H.Y.; visualization, H.Y. and M.X.; funding acquisition, H.Y. and W.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Key R&D Program of China, grant number 2018YFD1100301; the National Natural Science Foundation of China, grant number 42001276; Zhejiang Provincial Natural Science Foundation of China, grant number LQ19D010006; the Strategic Priority Research Program of Chinese Academy of Sciences, grant number XDA 20030302.

Data Availability Statement

XIHU building dataset can be found here: https://github.com/yanghplab/building-dataset (accessed on 31 December 2021). Additionally, we downloaded the WHUEA dataset from here: http://gpcv.whu.edu.cn/data/building_dataset.html (accessed on 31 December 2021).

Acknowledgments

The authors would thank Yuanyuan Yang, Haodong Li, Yubo Lin and Dingheng Fu from Zhejiang University of Technology, Hangzhou, China, helped with the dataset preparation.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

The codes used in this research can be found in the following links:

Our postprocessing method: https://github.com/yanghplab/building-dataset (accessed on 31 December 2021)
Unet: https://github.com/JavisPeng/u_net_liver (accessed on 31 December 2021)
MaskRCNN: https://github.com/jinfagang/FruitsNutsSeg (accessed on 31 December 2021)
PMNet: https://github.com/tiruss/PMNet (accessed on 31 December 2021)
Deeplabv3+: https://github.com/VainF/DeepLabV3Plus-Pytorch (accessed on 31 December 2021
HED: https://github.com/Duchen521/HED-document-detection (accessed on 31 December 2021)
BDCN: https://github.com/pkuCactus/BDCN (accessed on 31 December 2021)
DexiNed: https://github.com/xavysp/DexiNed (accessed on 31 December 2021)

References

Park, Y.; Guldmann, J.-M.; Liu, D. Impacts of tree and building shades on the urban heat island: Combining remote sensing, 3D digital city and spatial regression approaches. Comput. Environ. Urban Syst. 2021, 88, 101655. [Google Scholar] [CrossRef]
Adriano, B.; Yokoya, N.; Xia, J.; Miura, H.; Liu, W.; Matsuoka, M.; Koshimura, S. Learning from multimodal and multitemporal earth observation data for building damage mapping. ISPRS J. Photogramm. Remote Sens. 2021, 175, 132–143. [Google Scholar] [CrossRef]
Zhang, X.; Du, S.; Du, S.; Liu, B. How do land-use patterns influence residential environment quality? A multiscale geographic survey in Beijing. Remote Sens. Environ. 2020, 249, 112014. [Google Scholar] [CrossRef]
Guo, M.; Liu, H.; Xu, Y.; Huang, Y. Building Extraction Based on U-Net with an Attention Block and Multiple Losses. Remote Sens. 2020, 12, 1400. [Google Scholar] [CrossRef]
Liu, Y.; Chen, D.; Ma, A.; Zhong, Y.; Fang, F.; Xu, K. Multiscale U-Shaped CNN Building Instance Extraction Framework With Edge Constraint for High-Spatial-Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6106–6120. [Google Scholar] [CrossRef]
Xia, L.; Zhang, X.; Zhang, J.; Yang, H.; Chen, T. Building Extraction from Very-High-Resolution Remote Sensing Images Using Semi-Supervised Semantic Edge Detection. Remote Sens. 2021, 13, 2187. [Google Scholar] [CrossRef]
Liow, Y.-T.; Pavlidis, T. Use of shadows for extracting buildings in aerial images. Comput. Vis. Graph. Image Processing 1990, 49, 242–277. [Google Scholar] [CrossRef]
Liasis, G.; Stavrou, S. Building extraction in satellite images using active contours and colour features. Int. J. Remote Sens. 2016, 37, 1127–1153. [Google Scholar] [CrossRef]
Zhang, Q.; Huang, X.; Zhang, G. A Morphological Building Detection Framework for High-Resolution Optical Imagery over Urban Areas. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1388–1392. [Google Scholar] [CrossRef]
Ok, A.O. Automated detection of buildings from single VHR multispectral images using shadow information and graph cuts. ISPRS J. Photogramm. Remote Sens. 2013, 86, 21–40. [Google Scholar] [CrossRef]
Turker, M.; Koc-San, D. Building extraction from high-resolution optical spaceborne images using the integration of support vector machine (SVM) classification, Hough transformation and perceptual grouping. Int. J. Appl. Earth Obs. Geoinf. 2015, 34, 58–69. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef] [PubMed]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Lv, X.; Ming, D.; Chen, Y.; Wang, M. Very high resolution remote sensing image classification with SEEDS-CNN and scale effect analysis for superpixel CNN classification. Int. J. Remote Sens. 2019, 40, 506–531. [Google Scholar] [CrossRef]
Vakalopoulou, M.; Karantzalos, K.; Komodakis, N.; Paragios, N. Building detection in very high resolution multispectral data with deep learning features. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 1873–1876. [Google Scholar]
Volpi, M.; Tuia, D. Dense Semantic Labeling of Subdecimeter Resolution Images with Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 881–893. [Google Scholar] [CrossRef] [Green Version]
Yang, H.; Yu, B.; Luo, J.; Chen, F. Semantic segmentation of high spatial resolution images with deep neural networks. GIScience Remote Sens. 2019, 56, 749–768. [Google Scholar] [CrossRef]
Zhao, K.; Kang, J.; Jung, J.; Sohn, G. Building Extraction from Satellite Images Using Mask R-CNN with Building Boundary Regularization. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 242–2424. [Google Scholar]
Alshehhi, R.; Marpu, P.R.; Woon, W.L.; Mura, M.D. Simultaneous extraction of roads and buildings in remote sensing imagery with convolutional neural networks. ISPRS J. Photogramm. Remote Sens. 2017, 130, 139–149. [Google Scholar] [CrossRef]
Waldner, F.; Diakogiannis, F.I. Deep learning on edge: Extracting field boundaries from satellite images with a convolutional neural network. Remote Sens. Environ. 2020, 245, 111741. [Google Scholar] [CrossRef]
Mnih, V. Machine Learning for Aerial Image Labeling. Ph.D. Thesis, University of Toronto, Toronto, ON, Canada, 2013. [Google Scholar]
Zhu, Y.; Liang, Z.; Yan, J.; Chen, G.; Wang, X. E-D-Net: Automatic Building Extraction From High-Resolution Aerial Images With Boundary Information. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4595–4606. [Google Scholar] [CrossRef]
Li, Z.; Zhang, X.; Xiao, P.; Zheng, Z. On the Effectiveness of Weakly Supervised Semantic Segmentation for Building Extraction From High-Resolution Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 3266–3281. [Google Scholar] [CrossRef]
Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can semantic labeling methods generalize to any city? The inria aerial image labeling benchmark. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Honolulu, HI, USA, 23–28 July 2017; pp. 3226–3229. [Google Scholar]
Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction From an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
Chen, Z.; Li, D.; Fan, W.; Guan, H.; Wang, C.; Li, J. Self-Attention in Reconstruction Bias U-Net for Semantic Segmentation of Building Rooftops in Optical Remote Sensing Images. Remote Sens. 2021, 13, 2524. [Google Scholar] [CrossRef]
Cai, J.; Chen, Y. MHA-Net: Multipath Hybrid Attention Network for Building Footprint Extraction From High-Resolution Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 5807–5817. [Google Scholar] [CrossRef]
Zhu, Q.; Liao, C.; Hu, H.; Mei, X.; Li, H. MAP-Net: Multiple Attending Path Neural Network for Building Footprint Extraction From Remote Sensed Imagery. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6169–6181. [Google Scholar] [CrossRef]
He, N.; Fang, L.; Plaza, A. Hybrid first and second order attention Unet for building segmentation in remote sensing images. Sci. China Inf. Sci. 2020, 63, 140305. [Google Scholar] [CrossRef] [Green Version]
Mnih, V.; Hinton, G.E. Learning to Detect Roads in High-Resolution Aerial Images. In Proceedings of the Computer Vision—ECCV 2010, Heraklion, Greece, 5–11 September 2010; pp. 210–223. [Google Scholar]
Saito, S.; Yamashita, T.; Aoki, Y. Multiple Object Extraction from Aerial Imagery with Convolutional Neural Networks. J. Imaging Sci. Technol. 2016, 60, 0104021–0104029. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef]
Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Convolutional Neural Networks for Large-Scale Remote-Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 645–657. [Google Scholar] [CrossRef] [Green Version]
Yuan, J. Learning Building Extraction in Aerial Scenes with Convolutional Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2793–2798. [Google Scholar] [CrossRef]
Yang, H.L.; Yuan, J.; Lunga, D.; Laverdiere, M.; Rose, A.; Bhaduri, B. Building Extraction at Scale Using Convolutional Neural Network: Mapping of the United States. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 2600–2614. [Google Scholar] [CrossRef] [Green Version]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Cham, Switzerland, 5–9 October 2015; pp. 234–241. [Google Scholar]
Yang, G.; Zhang, Q.; Zhang, G. EANet: Edge-Aware Network for the Extraction of Buildings from Aerial Images. Remote Sens. 2020, 12, 2161. [Google Scholar] [CrossRef]
Cheng, D.; Liao, R.; Fidler, S.; Urtasun, R. DARNet: Deep Active Ray Network for Building Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7423–7431. [Google Scholar]
Li, Z.; Wegner, J.D.; Lucchi, A. Topological Map Extraction from Overhead Images. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 1715–1724. [Google Scholar]
Lu, T.; Ming, D.; Lin, X.; Hong, Z.; Bai, X.; Fang, J. Detecting Building Edges from High Spatial Resolution Remote Sensing Imagery Using Richer Convolution Features Network. Remote Sens. 2018, 10, 1496. [Google Scholar] [CrossRef] [Green Version]
Kass, M.; Witkin, A.; Terzopoulos, D. Snakes: Active contour models. Int. J. Comput. Vis. 1988, 1, 321–331. [Google Scholar] [CrossRef]
Zhang, L.; Bai, M.; Liao, R.; Urtasun, R.; Marcos, D.; Tuia, D.; Kellenberger, B. Learning Deep Structured Active Contours End-to-End. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8877–8885. [Google Scholar]
Zhao, W.; Persello, C.; Stein, A. Building outline delineation: From aerial images to polygons with an improved end-to-end learning framework. ISPRS J. Photogramm. Remote Sens. 2021, 175, 119–131. [Google Scholar] [CrossRef]
Xie, S.; Tu, Z. Holistically-Nested Edge Detection. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1395–1403. [Google Scholar]
Liu, Y.; Cheng, M.; Hu, X.; Bian, J.; Zhang, L.; Bai, X.; Tang, J. Richer Convolutional Features for Edge Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1939–1946. [Google Scholar] [CrossRef] [Green Version]
Zhou, L.; Zhang, C.; Wu, M. D-LinkNet: LinkNet with Pretrained Encoder and Dilated Convolution for High Resolution Satellite Imagery Road Extraction. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 192–1924. [Google Scholar]
Soria, X.; Riba, E.; Sappa, A. Dense Extreme Inception Network: Towards a Robust CNN Model for Edge Detection. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 1912–1921. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [Green Version]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. In Proceedings of the ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Cham, Switzerland, 8–14 September 2018; pp. 833–851. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems—Volume 1, Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
He, J.; Zhang, S.; Yang, M.; Shan, Y.; Huang, T. Bi-Directional Cascade Network for Perceptual Edge Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3823–3832. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Lee, C.-Y.; Xie, S.; Gallagher, P.; Zhang, Z.; Tu, Z. Deeply-Supervised Nets. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 9–12 May 2015; pp. 562–570. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar]
Kang, D.; Park, S.; Paik, J. Coarse to Fine: Progressive and Multi-Task Learning for Salient Object Detection. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 1491–1498. [Google Scholar]
Zhang, T.Y.; Suen, C.Y. A fast parallel algorithm for thinning digital patterns. Commun. ACM 1984, 27, 236–239. [Google Scholar] [CrossRef]

Figure 1. Illustration of satellite images and their ground truth. (a) Images from Satellite dataset II of the WHU Building Dataset [28]; (b) Ground truth boundaries; (c) Ground truth regions.

Figure 2. Geospatial distribution of XIHU building dataset.

Figure 3. Examples of annotated images from the training set of the XIHU building dataset. (a) From top to bottom, images show low-rise residential buildings, rural residential buildings, high-rise residential buildings, and industrial buildings, respectively; (b) Ground truth boundaries; (c) Ground truth regions.

Figure 4. Examples of annotated images from the training set of the WHU building dataset (East Asia). (a) Original images; (b) Ground truth regions.

Figure 5. Overview of the proposed building detection framework.

Figure 6. Examples of the predicted building regions and boundaries by the trained models in the network training stage. (a) An example test satellite image; (b) The predicted building regions by UNET; (c) The predicted building boundaries by BDCN.

Figure 7. Workflow of the building extraction stage. The postprocessing method is shown in the dashed box.

Figure 8. Examples of the outputs by deep edge-detection methods. (a) An example satellite image; (b) The results of HED; (c) The results of BDCN; (d) The results of DexiNed.

Figure 9. The kernel used for the morphological operation.

Figure 10. F1 versus threshold

α

using the XIHU training set. (a) The combinations of building regions by U-Net and building boundaries by HED, BDCN and DexiNed, respectively; (b) The combinations of building regions by DeepLab and building boundaries by HED, BDCN and DexiNed, respectively; (c) The combinations of building regions by Mask R-CNN and building boundaries by HED, BDCN and DexiNed, respectively.

Figure 10. F1 versus threshold

α

using the XIHU training set. (a) The combinations of building regions by U-Net and building boundaries by HED, BDCN and DexiNed, respectively; (b) The combinations of building regions by DeepLab and building boundaries by HED, BDCN and DexiNed, respectively; (c) The combinations of building regions by Mask R-CNN and building boundaries by HED, BDCN and DexiNed, respectively.

Figure 11. F1 versus threshold

α

using the WHUEA training set. (a) The combinations of building regions by U-Net and building boundaries by HED, BDCN and DexiNed, respectively; (b) The combinations of building regions by DeepLab and building boundaries by HED, BDCN and DexiNed, respectively; (c) The combinations of building regions by Mask R-CNN and building boundaries by HED, BDCN and DexiNed, respectively.

Figure 11. F1 versus threshold

α

using the WHUEA training set. (a) The combinations of building regions by U-Net and building boundaries by HED, BDCN and DexiNed, respectively; (b) The combinations of building regions by DeepLab and building boundaries by HED, BDCN and DexiNed, respectively; (c) The combinations of building regions by Mask R-CNN and building boundaries by HED, BDCN and DexiNed, respectively.

Figure 12. Examples of building extraction results on the XIHU building dataset. (a) Original satellite images; (b) Ground truth; (c) Results of U-Net; (d) Results of DeepLab; (e) Results of Mask R-CNN; (f) Our results using the combination of DeepLab and BDCN (

α

= 110).

Figure 12. Examples of building extraction results on the XIHU building dataset. (a) Original satellite images; (b) Ground truth; (c) Results of U-Net; (d) Results of DeepLab; (e) Results of Mask R-CNN; (f) Our results using the combination of DeepLab and BDCN (

α

= 110).

Figure 13. Examples of one-pixel-wide building boundaries by BDCN (

α

= 110). Results (a–d) are computed from images in Columns 1–4 of Figure 12a.

Figure 13. Examples of one-pixel-wide building boundaries by BDCN (

α

= 110). Results (a–d) are computed from images in Columns 1–4 of Figure 12a.

Figure 14. Examples of building extraction results on the WHUEA building dataset. (a) Original satellite images; (b) Ground truth; (c) Results of U-Net; (d) Results of DeepLab; (e) Results of Mask R-CNN; (f) Our results using the combination of Mask R-CNN and DexiNed (

α

= 200).

Figure 14. Examples of building extraction results on the WHUEA building dataset. (a) Original satellite images; (b) Ground truth; (c) Results of U-Net; (d) Results of DeepLab; (e) Results of Mask R-CNN; (f) Our results using the combination of Mask R-CNN and DexiNed (

α

= 200).

Figure 15. Examples of one-pixel-wide building boundaries by DexiNed (

α

= 200). Results (a–d) are computed from images in Columns 1–4 of Figure 14a.

Figure 15. Examples of one-pixel-wide building boundaries by DexiNed (

α

= 200). Results (a–d) are computed from images in Columns 1–4 of Figure 14a.

Figure 16. Confusion matrix for the XIHU dataset. (a) Confusion matrix for DeepLab; (b) Confusion matrix for our method that combines the results of DeepLab and BDCN (

α

= 110).

Figure 16. Confusion matrix for the XIHU dataset. (a) Confusion matrix for DeepLab; (b) Confusion matrix for our method that combines the results of DeepLab and BDCN (

α

= 110).

Figure 17. Confusion matrix for the WHUEA dataset. (a) Confusion matrix for Mask R-CNN; (b) Confusion matrix for our method that combines the results of Mask R-CNN and DexiNed (α = 200).

Figure 18. Examples of results of XIHU building dataset by edge-detection models. (a) Original satellite images; (b) Results of HED (

α

= 250); (c) Results of HED (

α

= 160); (d) Results of HED (

α

= 120); (e) Results of BDCN (

α

= 250); (f) Results of BDCN (

α

= 110); (g) Results of BDCN (

α

= 20); (h) Results of DexiNed (

α

= 250); (i) Results of DexiNed (

α

= 180); (j) Results of DexiNed (

α

= 10).

Figure 18. Examples of results of XIHU building dataset by edge-detection models. (a) Original satellite images; (b) Results of HED (

α

= 250); (c) Results of HED (

α

= 160); (d) Results of HED (

α

= 120); (e) Results of BDCN (

α

= 250); (f) Results of BDCN (

α

= 110); (g) Results of BDCN (

α

= 20); (h) Results of DexiNed (

α

= 250); (i) Results of DexiNed (

α

= 180); (j) Results of DexiNed (

α

= 10).

Figure 19. Examples of results of WHUEA building dataset by edge-detection models. (a) Original satellite images; (b) Results of HED (

α

= 250); (c) Results of HED (

α

= 180); (d) Results of HED (

α

= 160); (e) Results of BDCN (

α

= 250); (f) Results of BDCN (

α

= 130); (g) Results of BDCN (

α

= 120); (h) Results of DexiNed (

α

= 240); (i) Results of DexiNed (

α

= 220); (j) Results of DexiNed (

α

= 200).

Figure 19. Examples of results of WHUEA building dataset by edge-detection models. (a) Original satellite images; (b) Results of HED (

α

= 250); (c) Results of HED (

α

= 180); (d) Results of HED (

α

= 160); (e) Results of BDCN (

α

= 250); (f) Results of BDCN (

α

= 130); (g) Results of BDCN (

α

= 120); (h) Results of DexiNed (

α

= 240); (i) Results of DexiNed (

α

= 220); (j) Results of DexiNed (

α

= 200).

Figure 20. F1 versus threshold

α

using the XIHU training set for the combinations of building regions by PMNet mask and building boundaries by HED, BDCN, DexiNed and PMNet contour, respectively.

Figure 20. F1 versus threshold

α

using the XIHU training set for the combinations of building regions by PMNet mask and building boundaries by HED, BDCN, DexiNed and PMNet contour, respectively.

Figure 21. F1 versus threshold

α

using the WHUEA training set for the combinations of building regions by PMNet mask and building boundaries by HED, BDCN, DexiNed and PMNet contour, respectively.

Figure 21. F1 versus threshold

α

using the WHUEA training set for the combinations of building regions by PMNet mask and building boundaries by HED, BDCN, DexiNed and PMNet contour, respectively.

Table 1. An overview of the building datasets.

Dataset	Training	Testing
XIHU Building	742	197
WHU Building (East Asia)	3135	903

Table 2. Results of U-Net and its combinations with different predicted building boundaries on the XIHU dataset.

Methods	P (%)	R (%)	F1 (%)	IoU (%)
U-Net [39]	89.46	73.55	80.73	67.68
U-Net [39] + HED [47] (α = 250) + our method	89.36	73.69	80.77	67.74
HED [47] (α = 250)	58.71	1.45	2.83	1.43
U-Net [39] + BDCN [55] (α = 250) + our method	89.39	73.63	80.75	67.71
BDCN [55] (α = 250)	64.96	0.44	0.88	0.44
U-Net [39] + DexiNed [50] (α = 250) + our method	89.34	73.77	80.81	67.80
DexiNed [50] (α = 250)	57.85	2.44	4.69	2.40

Table 3. Results of DeepLab and its combinations with different predicted building boundaries on the XIHU dataset.

Methods	P (%)	R (%)	F1 (%)	IoU (%)
DeepLab [51]	86.76	78.39	82.36	70.01
DeepLab [51] + HED [47] (α = 160) + our method	84.38	80.88	82.59	70.35
HED [47] (α = 160)	80.52	42.47	55.61	38.51
DeepLab [51] + BDCN [55] (α = 110) + our method	85.61	80.81	83.14	71.14
BDCN [55] (α = 110)	84.80	46.23	59.84	42.69
DeepLab [51] + DexiNed [50] (α = 180) + our method	86.14	79.95	82.93	70.84
DexiNed [50] (α = 180)	85.47	27.51	41.62	26.28

Table 4. Results of Mask R-CNN and its combinations with different predicted building boundaries on the XIHU dataset.

Methods	P (%)	R (%)	F1 (%)	IoU (%)
Mask R-CNN [15]	89.82	70.38	78.92	65.18
Mask R-CNN [15] + HED [47] ( $α$ = 120) + our method	85.00	76.20	80.36	67.17
HED [47] ( $α$ = 120)	78.87	54.11	64.19	47.26
Mask R-CNN [15] + BDCN [55] ( $α$ = 20) + our method	85.01	77.66	81.17	68.31
BDCN [55] ( $α$ = 20)	79.47	68.74	73.72	58.38
Mask R-CNN [15] + DexiNed [50] ( $α$ = 10) + our method	87.76	75.46	81.15	68.27
DexiNed [50] ( $α$ = 10)	85.06	52.00	64.55	47.65

Table 5. Results of U-Net and its combinations with different predicted building boundaries on the WHUEA dataset.

Methods	P (%)	R (%)	F1 (%)	IoU (%)
U-Net [39]	85.87	73.95	79.47	65.93
U-Net [39] + HED [47] ( $α$ = 250) + our method	85.74	74.22	79.56	66.06
HED [47] ( $α$ = 250)	59.48	2.95	5.62	2.89
U-Net [39] + BDCN [55] ( $α$ = 250) + our method	85.89	73.97	79.48	65.95
BDCN [55] ( $α$ = 250)	60.56	0.12	0.24	0.12
U-Net [39] + DexiNed [50] ( $α$ = 240) + our method	85.36	76.51	80.69	67.64
DexiNed [50] ( $α$ = 240)	86.38	38.52	53.28	36.32

Table 6. Results of DeepLab and its combinations with different predicted building boundaries on the WHUEA dataset.

Methods	P (%)	R (%)	F1 (%)	IoU (%)
DeepLab [51]	81.57	78.54	80.03	66.70
DeepLab [51] + HED [47] ( $α$ = 180) + our method	81.02	79.64	80.33	67.12
HED [47] ( $α$ = 180)	78.65	19.51	31.26	18.53
DeepLab [51] + BDCN [55] ( $α$ = 130) + our method	79.81	82.43	81.10	68.21
BDCN [55] ( $α$ = 130)	82.81	66.03	73.48	58.07
DeepLab [51] + DexiNed [50] ( $α$ = 220) + our method	80.86	81.53	81.19	68.34
DexiNed [50] ( $α$ = 220)	86.75	52.97	65.78	49.01

Table 7. Results of Mask R-CNN and its combinations with different predicted building boundaries on the WHUEA dataset.

Methods	P (%)	R (%)	F1 (%)	IoU (%)
Mask R-CNN [15]	84.43	77.48	80.81	67.79
Mask R-CNN [15] + HED [47] ( $α$ = 160) + our method	83.55	78.62	81.01	68.08
HED [47] ( $α$ = 160)	79.93	23.35	36.14	22.05
Mask R-CNN [15] + BDCN [55] ( $α$ = 120) + our method	82.12	80.87	81.49	68.76
BDCN [55] ( $α$ = 120)	82.09	69.13	75.06	60.07
Mask R-CNN [15] + DexiNed [50] ( $α$ = 200) + our method	83.39	80.46	81.90	69.34
DexiNed [50] ( $α$ = 200)	86.70	57.83	69.38	53.12

Table 8. Results of PMNet mask and its combinations with different predicted building boundaries on XIHU dataset.

Methods	P (%)	R (%)	F1 (%)	IoU (%)
PMNet mask [59]	88.85	62.71	73.53	58.13
PMNet mask [59] + HED [47] ( $α$ = 180) + our method	86.86	68.63	76.68	62.18
HED [47] ( $α$ = 180)	80.24	33.96	47.73	31.34
PMNet mask [59] + BDCN [55] ( $α$ = 140) + our method	87.97	68.78	77.20	62.87
BDCN [55] ( $α$ = 140)	85.25	37.41	52.00	35.14
PMNet mask [59] + DexiNed [50] ( $α$ = 200) + our method	88.42	66.39	75.84	61.08
DexiNed [50] ( $α$ = 200)	85.01	24.61	38.17	23.59
PMNet mask [59] + PMNet contour [59] ( $α$ = 100) + our method	88.40	63.87	74.16	58.94
PMNet contour [59] ( $α$ = 100)	67.58	6.93	12.57	6.70

Table 9. Results of PMNet mask and its combinations with different predicted building boundaries on WHUEA dataset.

Methods	P (%)	R (%)	F1 (%)	IoU (%)
PMNet mask [59]	89.09	66.37	76.07	61.38
PMNet mask [59] + HED [47] ( $α$ = 190) + our method	88.21	68.48	77.10	62.74
HED [47] ( $α$ = 190)	77.88	17.55	28.64	16.72
PMNet mask [59] + BDCN [55] ( $α$ = 210) + our method	88.43	68.82	77.40	63.13
BDCN [55] ( $α$ = 210)	83.00	23.55	36.69	22.46
PMNet mask [59] + DexiNed [50] ( $α$ = 230) + our method	87.68	72.06	79.11	65.44
DexiNed [50] ( $α$ = 230)	86.74	48.52	62.23	45.17
PMNet mask [59] + PMNet contour [59] ( $α$ = 120) + our method	88.88	67.06	76.45	61.87
PMNet contour [59] ( $α$ = 120)	70.88	5.85	10.81	5.71

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, H.; Xu, M.; Chen, Y.; Wu, W.; Dong, W. A Postprocessing Method Based on Regions and Boundaries Using Convolutional Neural Networks and a New Dataset for Building Extraction. Remote Sens. 2022, 14, 647. https://doi.org/10.3390/rs14030647

AMA Style

Yang H, Xu M, Chen Y, Wu W, Dong W. A Postprocessing Method Based on Regions and Boundaries Using Convolutional Neural Networks and a New Dataset for Building Extraction. Remote Sensing. 2022; 14(3):647. https://doi.org/10.3390/rs14030647

Chicago/Turabian Style

Yang, Haiping, Meixia Xu, Yuanyuan Chen, Wei Wu, and Wen Dong. 2022. "A Postprocessing Method Based on Regions and Boundaries Using Convolutional Neural Networks and a New Dataset for Building Extraction" Remote Sensing 14, no. 3: 647. https://doi.org/10.3390/rs14030647

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Postprocessing Method Based on Regions and Boundaries Using Convolutional Neural Networks and a New Dataset for Building Extraction

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.2. Methods

2.2.1. Network Training

2.2.2. Building Extraction

2.3. Threshold Optimization

2.4. Implementation Details

2.5. Accuracy Assessment

3. Results

3.1. XIHU Building Dataset

3.2. WHUEA Building Dataset

4. Discussion

4.1. Comparison with the Networks That Predict Building Regions

4.1.1. Qualitative Analysis

4.1.2. Quantitative Analysis

4.2. Comparison with Deep Edge-Detection Networks

4.3. Comparison with PMNet

4.4. Perspectives

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI