Fusion of Multiscale Convolutional Neural Networks for Building Extraction in Very High-Resolution Images

: Extracting buildings from very high resolution (VHR) images has attracted much attention but is still challenging due to their large varieties in appearance and scale. Convolutional neural networks (CNNs) have shown effective and superior performance in automatically learning high-level and discriminative features in extracting buildings. However, the ﬁxed receptive ﬁelds make conventional CNNs insufﬁcient to tolerate large scale changes. Multiscale CNN (MCNN) is a promising structure to meet this challenge. Unfortunately, the multiscale features extracted by MCNN are always stacked and fed into one classiﬁer, which make it difﬁcult to recognize objects with different scales. Besides, the repeated sub-sampling processes lead to a blurred boundary of the extracted features. In this study, we proposed a novel parallel support vector mechanism (SVM)-based fusion strategy to take full use of deep features at different scales as extracted by the MCNN structure. We ﬁrstly designed a MCNN structure with different sizes of input patches and kernels, to learn multiscale deep features. After that, features at different scales were individually fed into different support vector machine (SVM) classiﬁers to produce rule images for pre-classiﬁcation. A decision fusion strategy is then applied on the pre-classiﬁcation results based on another SVM classiﬁer. Finally, superpixels are applied to reﬁne the boundary of the fused results using region-based maximum voting. For performance evaluation, the well-known International Society for Photogrammetry and Remote Sensing (ISPRS) Potsdam dataset was used in comparison with several state-of-the-art algorithms. Experimental results have demonstrated the superior performance of the proposed methodology in extracting complex buildings in urban districts.


Introduction
With the acceleration of urbanization, building extraction becomes increasingly essential for urban planning, change monitoring, population estimation, and disaster assessment [1,2]. As remote sensed techniques improved, high resolution images even very high resolution (VHR) images provided by satellites, spaceborne, and airborne are more and more popular [3][4][5]. The availability of these images makes it possible to distinguish buildings from background objects [6]. However, completely extracting buildings from VHR images with high accuracy is still a challenge. For one thing, the shape mainly caused by the down-sampling pooling processes in CNN, which make CNN extract more abstract features but at the cost of reduced feature resolution.
To meet these challenges, in this paper, support vector machine (SVM)-based fusion strategy of the multiscale CNN features is proposed for building extraction in VHR images. The multiscale deep features are firstly produced by multiscale CNN models with inputs and kernels at three scales. These features are then separately fed into different SVMs to derive rule images at different scales. Rule images are referred to the primary images of SVM, which contain the distance of each pixel to the hyperplane of the binary classification problem [49]. After that, these rule images are fused with another SVM to derive a building classification result. Finally, a region-based max voting scheme is conducted using superpixels generated by a mean-shift (MS) algorithm. The main contributions of this study lie in the following two aspects: (1) Extended deep features at single scale to multiscale for extracting of buildings; (2) proposed a parallel SVM-based strategy to fuse multiscale CNN results at decision level. The experimental results conducted in the three study areas indicate that the proposed algorithm is outperformed to other popular algorithms.
The remainder of this paper is organized as follows. Section 2 presents the detailed structure of the proposed method for building extraction in VHR images. Section 3 describes the experimental results and the comparisons with other machine learning algorithms. Discussions and conclusions are given in Sections 4 and 5, respectively.

Methodology
In this study, an effective building extraction from VHR images framework is proposed, which combines the discriminative features of objects provided by MCNN and the decision fusion strategy based on SVMs. The overall workflow of the proposed method is illustrated in Figure 1. As we can see, there are three major steps with the proposed algorithm: (A) Multiscale deep features extraction: By learning multiscale features using MCNN models with different inputs and kernel sizes; (B) SVM-based fusion of MCNN: Generating rule images based on different SVM models and fused them using another SVM classifier at decision level; (C) boundary refinement: Using superpixels to provide building boundary by region-based max voting to produce the ultimate building maps.

Basic Theory of CNN
Compared with traditional building extraction algorithms, CNN is a more effective one since it can extract hierarchical representative features of buildings [25]. Traditionally, a classic CNN is a fed-back multilayer network which contains two sorts of typical layers, named convolutional layers and pooling layers [30]. The convolutional layers can generate various convoluted features by different filters, and the pooling layers are used to make the feature maps extracted by convolutional layers more abstract and robust via sub-sampling operation.
Generally, at l th convolutional layer, the feature maps of (l − 1) th layer are firstly convolved with learnable filters k, and then the output feature maps of l th will be produced through a nonlinear activation function g(·). The activation function g(·) here is commonly specified to be the sigmoid function, or the hyperbolic tangent function and rectified linear units [50]. Therefore, the l th convolutional layer C l can be summarized as where h l−1 refers to the hidden layer in which h 0 is the raw input. b l is the bias term of the l th layer feature map. When the convolutional layer works, each filter k will slide over the entire image and produces feature maps. One superiority of the convolutional layer in CNN is that it can learn and choose the best filter for the entire network [43].  The pooling layers are always followed by convolutional layers, which offers to generalize the features produced by convolutional layers more robust and further can reduce the computational complexity by using a sub-sampling operation. Pooling layers P l are defined as

Basic Theory of CNN
where down(·) represents a sub-sampling function. Typically, it will sum over each distinct n-by-n block in the input map thus that the output feature maps are n-times smaller than previous ones. Each output map is given its own additive bias parameter b l , which is similar to convolutional layers.

Multiscale Deep Features Extraction
To extract deep features at different scales to describe complex buildings, we constructed a multiscale CNN structure in this paper. We used image patches at three different sizes as inputs to feed into three corresponding CNN models with three different kernel sizes, respectively. Specifically, the small scale will contain the inner spatial information, and the medium scale will contain the edges and corners, while the large scale will contain the neighboring and context information, therefore we can get more complete features to extract buildings.
The architecture of the MCNN in this paper is illustrated in Figure 2. As we can see, in order to extract multiscale features of buildings, we used three input patches centered on one pixel at sizes of 14 × 14, 24 × 24 and 34 × 34, respectively. The corresponding CNN models are named CNN 14 , CNN 24 , and CNN 34 . Two convolution and sub-sampling layers are set in CNN models at each scale. Besides, to increase the performance of extracting multiscale features, the kernel sizes of different CNN models are also different. The specific parameters of CNN models at different scales are listed in Table 1. Using different CNN models, there are 192, 108, and 42 feature maps produced, respectively. To this end, the patch at small scale with small convolutional kernel size is focusing on the inner information of buildings, and the patch at medium scale with medium convolutional kernel size may contain the corners and edges information of buildings, while the patch at large scale with large convolutional kernel size will contain the neighboring objects and context information of buildings. Accordingly, this MCNN model can learn and extract multiscale spatial features of buildings. Compared with traditional building extraction algorithms, CNN is a more effective one since it can extract hierarchical representative features of buildings [25]. Traditionally, a classic CNN is a fedback multilayer network which contains two sorts of typical layers, named convolutional layers and pooling layers [30]. The convolutional layers can generate various convoluted features by different filters, and the pooling layers are used to make the feature maps extracted by convolutional layers more abstract and robust via sub-sampling operation.
Generally, at th l convolutional layer, the feature maps of ( 1) th l − layer are firstly convolved with learnable filters k , and then the output feature maps of th l will be produced through a nonlinear activation function ( ) g  . The activation function ( ) g  here is commonly specified to be the sigmoid function, or the hyperbolic tangent function and rectified linear units [50]. Therefore, the th l convolutional layer l C can be summarized as where 1 l h − refers to the hidden layer in which 0 h is the raw input. l b is the bias term of the th l layer feature map. When the convolutional layer works, each filter k will slide over the entire image and produces feature maps. One superiority of the convolutional layer in CNN is that it can learn and choose the best filter for the entire network [43].
The pooling layers are always followed by convolutional layers, which offers to generalize the features produced by convolutional layers more robust and further can reduce the computational complexity by using a sub-sampling operation. Pooling layers l P are defined as where ( ) down  represents a sub-sampling function. Typically, it will sum over each distinct n-by-n block in the input map thus that the output feature maps are n-times smaller than previous ones. Each output map is given its own additive bias parameter l b , which is similar to convolutional layers.

Multiscale Deep Features Extraction
To extract deep features at different scales to describe complex buildings, we constructed a multiscale CNN structure in this paper. We used image patches at three different sizes as inputs to feed into three corresponding CNN models with three different kernel sizes, respectively. Specifically, the small scale will contain the inner spatial information, and the medium scale will contain the edges and corners, while the large scale will contain the neighboring and context information, therefore we can get more complete features to extract buildings.

SVM-Based Fusion of MCNN
In some existing studies, deep features from different sources are always stacked together and fed into one classifier for further classification. However, in this way, features at each source are treated equally, and it is difficult for a single classifier to match different features together, which lead to the poor performance in recognizing complex objects. Therefore, in this paper, deep features at three scales were fed into three support vector machines (SVM) individually. SVM is a successfully introduced machine learning algorithm in the remote sensing context and reported effective in classification [51][52][53][54]. Considering the land covers in experimental datasets, we set four land-cover classes: Buildings, road, vegetation, and shadow. Five hundred samples of each class were selected randomly, and again using a random sampling strategy, 800 hundred samples of each class were generated as an independent validation set.
Corresponding to input features at three different scales, we used three SVM models. The SVM was trained individually at each scale to estimate the kernel parameter γ and the regularization parameter C. In order to solve the multiclass problems, two main strategies have been proposed to extend original SVM, which is developed as binary classifiers. One is a one-against-one (OAO) strategy and the other is a one-against-all (OAA) strategy [55]. The rule images derived from the OAO strategy has been demonstrated better suited in a multiple classifier system than those from the OAA strategy [49]. Therefore, in this paper, we used the OAO strategy to produce rule images.
In the configuration of SVM, due to the superiority in handling complex nonlinear class distributions and comparatively simple computational complexity, a Gaussian kernel was selected [56]. The training of SVM with the Gaussian kernel and the generation of the rule images were performed using image SVM [57], which is freely available in Enmap-Box and using the LIBSVM approach for training. The best combinations for kernel parameters of γ and C are determined by a grid search using a tenfold cross validation. As shown in Figure 1, after the first three SVM classifiers, there are 18 rule images produced (six rule images of each individual SVM), given the four land-cover classes.
The rule images were then used for the decision fusion to decide the final label of each pixel. In traditional SVM classifications, the decision fusion is conducted using a simple majority voting based on these rule images. In this paper, we used a second SVM for the decision fusion process to take full use of feature information. Specifically, all the rule images derived from the first SVM is firstly combined into one data set. An additional SVM is then applied to these data set consisting of rule images focused on different scales to determine the building classification result, which is demonstrated to outperform the simple majority voting strategy in studies [49,58]. It is noted that the training and validation samples of both the first SVM and the second SVM are sharing.

Boundary Refinement
By using the SVM-based fusion strategy, the proposed algorithm can predict the position of buildings with different scales. However, due to that the classification and decision fusion processes are based on deep features extracted by CNN, the repeated sub-sampling operations in CNN will Remote Sens. 2019, 11, 227 7 of 16 make the boundary of buildings blurred, which tends to amplify the building mapping uncertainty. Therefore, further refinement of the extracted results is needed. Combining superpixels is considered as a promising process to address this issue. Superpixels are defined as patches of pixels in which the texture, color, brightness, etc. are similar [59]. The boundary offered by superpixels are clear and the pixels in superpixels are homogeneous, which can be utilized to optimize the MCNN classification maps by a simple voting algorithm.
Over the past years, over 30 sorts of superpixel were developed to the public [60]. These algorithms can be generally divided into two categories: One is based on gradient ascent and the other one is based on graph theory [61]. Gradient ascent methods mainly cover the mean-shift (MS) algorithm [62], the simple linear iterative clustering (SLIC) algorithm [63] and the watershed transform algorithm [64]; while graph theory based methods mainly cover efficient graph-based image segmentation (EGB) [65] and the normalized cuts algorithm [66]. Among these algorithms, the MS algorithm has a good performance in segmenting VHR images with the advantage of a simple parameter and no need for prior knowledge. In addition, the MS algorithm is able to maintain the saliency and edge information, which contributed its wide applications in complex images [62,67,68]. Therefore, the MS algorithm is applied in this paper to generate superpixels.
To integrate the SVM-based classification maps with superpixels into final buildings maps, a simple max voting scheme is employed in this paper. The max voting scheme contains three steps. Firstly, the MCNN classification result is mapped to each superpixel to assign classification labels for all pixels. Then, the classification unit is defined as a superpixel instead of individual pixels in the post-processing. Finally, in the voting process, the mostly frequently appeared label in a superpixel is considered as the final label of this superpixel. For a superpixel r, the label SP r is defined as where (i,j) is the coordinate of the pixel r(i,j), and f r(i,j) is the label of the pixel r(i,j) in superpixel r from the initial MCNN classification result. N is the total number of the expected classes. For all superpixels, the same voting scheme is applied and the ultimate result is obtained.

Introduction of Datasets
The VHR images used in our experiments consists of three orthophotos from a well-known dataset, named the ISPRS Potsdam 2D semantic VHR remote sensing (Germany) datasets, which are open datasets provided online at http://www2.isprs.org/commissions/comm3/wg4/2d-sem-labelpotsdam.html. They contain two sorts of optical images, including near-inferred, red, green bands (NIR-RG) and red, green, blue bands (RGB), respectively. Besides, the Potsdam dataset also contains a digital surface models (DSM) map and a manually annotated ground truth image. In our experiment, we just used the Potsdam NIR-RG image. In order to test the effectiveness of the proposed algorithm under different building environments, we used three images from Potsdam, which contain dense and complex buildings, and the original images and corresponding reference images are illustrated in Figure 3. Experimental images in three study areas are named as Image 1, Image 2, and Image 3, respectively. Additionally, the spatial resolution of images from Potsdam is approximately 0.09 m.
proposed algorithm under different building environments, we used three images from Potsdam, which contain dense and complex buildings, and the original images and corresponding reference images are illustrated in Figure 3. Experimental images in three study areas are named as Image 1, Image 2, and Image 3, respectively. Additionally, the spatial resolution of images from Potsdam is approximately 0.09 meter.

Experimental Setups
There are several parameters in the experiment that need to be set. Firstly, when using MCNN to extract multiscale features, the input patches centered on a pixel are set to 14 × 14, 24 × 24, 34 × 34, respectively. The kernel sizes of two convolutional layers of the corresponding CNN models are set to 3 × 3, 5 × 5, 7 × 7, respectively. Secondly, when using the MS algorithm to produce superpixels, each image needs to set three scale parameters, named the window widths of color, spatial domain, and the minimum area size. Focusing on different environments of images, the three parameters of image (a), (b), and (c) are all set to 30/12.5/150, and 30/12.5/150 pixels.
To verify the superiority of the proposed algorithm, three algorithms are adopted as the compared algorithms. The compared algorithms and the reasons to configure these compared algorithms are as follows. Firstly, in order to demonstrate the superiority of deep features, we used the original spectral bands instead of deep features as inputs, and the rest processes remains the same, hereafter named comparison 1 algorithm (C1). Secondly, in order to demonstrate the superiority of multiscale deep features, we used the single scale deep features instead of multiscale deep features, and the rest of the processes remain the same, hereafter named comparison 2 algorithm (C2). Thirdly, to verity the effectiveness of separately using the deep features at each scale, the deep features were stacked as one feature set and then fed into one SVM classifier, while the rest of the processes are the same, hereafter named comparison 3 algorithm (C3).

Precision Evaluation Criteria
In this paper, we used three popular criteria, named Recall, Precision, and F-measure to evaluate the performance of the proposed algorithm [38,69,70]. They are defined as follows.

Experimental Setups
There are several parameters in the experiment that need to be set. Firstly, when using MCNN to extract multiscale features, the input patches centered on a pixel are set to 14 × 14, 24 × 24, 34 × 34, respectively. The kernel sizes of two convolutional layers of the corresponding CNN models are set to 3 × 3, 5 × 5, 7 × 7, respectively. Secondly, when using the MS algorithm to produce superpixels, each image needs to set three scale parameters, named the window widths of color, spatial domain, and the minimum area size. Focusing on different environments of images, the three parameters of image (a), (b), and (c) are all set to 30/12.5/150, and 30/12.5/150 pixels.
To verify the superiority of the proposed algorithm, three algorithms are adopted as the compared algorithms. The compared algorithms and the reasons to configure these compared algorithms are as follows. Firstly, in order to demonstrate the superiority of deep features, we used the original spectral bands instead of deep features as inputs, and the rest processes remains the same, hereafter named comparison 1 algorithm (C1). Secondly, in order to demonstrate the superiority of multiscale deep features, we used the single scale deep features instead of multiscale deep features, and the rest of the processes remain the same, hereafter named comparison 2 algorithm (C2). Thirdly, to verity the effectiveness of separately using the deep features at each scale, the deep features were stacked as one feature set and then fed into one SVM classifier, while the rest of the processes are the same, hereafter named comparison 3 algorithm (C3).

Precision Evaluation Criteria
In this paper, we used three popular criteria, named Recall, Precision, and F-measure to evaluate the performance of the proposed algorithm [38,69,70]. They are defined as follows. where TP (true positive) represents the total number of building pixels correctly classified in the reference maps; FP (false positive) represents the number of background pixels misclassified as buildings; FN (false negative) represents the number of true building pixels misclassified as background pixels. Overall, Recall and Precision can draw the building extraction accuracy, and F-measure is a synthetic measurement of Recall and Precision. Figures 4-6 show the building extraction results by the proposed algorithm and the three compared algorithms. It is noted that the final sub-image of each figure is the ultimate building result of our proposed algorithm which is refined by superpixels. Overall, the better performance of our proposed algorithm is clearly perceivable in all three images. The proposed algorithm can extract buildings with various scales and the extracted buildings are complete and continuous under complex building environments (Figures 4-6). Especially, as illustrated in the red rectangles corresponding to Figure 4a, the proposed algorithm can detect buildings completely, while the others performed poorly in detecting buildings on the dark side, which indicates the superiority of our proposed algorithm in extracting buildings with different appearances. Besides, thanks to the multiscale deep features, the proposed algorithm can recognize buildings at both small and large scales, as shown in the labeled yellow ellipses in Figure 5. Finally, as shown in Figures 4, 5 and 6f, the use of superpixels makes the ultimate building maps with few speckles and noises. However, the buildings covered by shadows are difficult to be extracted using the proposed algorithm. For example, as shown in the blue rectangles in Figure 5, there are part of buildings that are covered in shadows, and all the algorithms cannot effectively detect it. Shadows lead to the distortion of information, which makes the recognition of objects a challenge.  (6) where TP (true positive) represents the total number of building pixels correctly classified in the reference maps; FP (false positive) represents the number of background pixels misclassified as buildings; FN (false negative) represents the number of true building pixels misclassified as background pixels. Overall, Recall and Precision can draw the building extraction accuracy, and Fmeasure is a synthetic measurement of Recall and Precision. Figures 4-6 show the building extraction results by the proposed algorithm and the three compared algorithms. It is noted that the final sub-image of each figure is the ultimate building result of our proposed algorithm which is refined by superpixels. Overall, the better performance of our proposed algorithm is clearly perceivable in all three images. The proposed algorithm can extract buildings with various scales and the extracted buildings are complete and continuous under complex building environments (Figures 4-6). Especially, as illustrated in the red rectangles corresponding to Figure 4a, the proposed algorithm can detect buildings completely, while the others performed poorly in detecting buildings on the dark side, which indicates the superiority of our proposed algorithm in extracting buildings with different appearances. Besides, thanks to the multiscale deep features, the proposed algorithm can recognize buildings at both small and large scales, as shown in the labeled yellow ellipses in Figure 5. Finally, as shown in Figures 4-6f, the use of superpixels makes the ultimate building maps with few speckles and noises. However, the buildings covered by shadows are difficult to be extracted using the proposed algorithm. For example, as shown in the blue rectangles in Figure 5, there are part of buildings that are covered in shadows, and all the algorithms cannot effectively detect it. Shadows lead to the distortion of information, which makes the recognition of objects a challenge.  Compared with our proposed algorithm, the other three compared algorithms have unsatisfied performance at different objects. Firstly, as shown in Figures 4-6c, the buildings extracted from C1 Compared with our proposed algorithm, the other three compared algorithms have unsatisfied performance at different objects. Firstly, as shown in Figures 4-6c, the buildings extracted from C1 Compared with our proposed algorithm, the other three compared algorithms have unsatisfied performance at different objects. Firstly, as shown in Figures 4, 5 and 6c, the buildings extracted from C1 are not complete. For example, the extracted building labeled in the green rectangle in Figure 5c has some small holes, while in Figure 5b,d,f, this building is comparatively more complete. Besides, the overall building map extracted by C1 are with much noises, such as the labeled objects in green rectangles corresponding to Figure 6. This is because that C1 is based only on spectral features, which indicates the superiority of CNN model in extracting both spectral and spatial features. Besides, comparing the proposed algorithm with C2, we can find that there are some misclassifications of the latter algorithm, such as shown in the red squares in Figure 5d. This is mainly because the features at a single scale cannot provide sufficient information for accurate recognition under a very complex urban environment. Finally, by the comparison between the proposed algorithm and C3 algorithm, we can easily find that the feature stacking strategy generated poorer performance than separate classification of different features. As illustrated in the Figures 4 and 5e, there are some misclassifications, which mainly indicates that treating features at different scales has some limitations. Table 2 shows the quantitative evaluation results of buildings extracted by the proposed algorithm and three other algorithms, respectively. By the comparisons of the F-measure values, we can draw a conclusion that the proposed algorithm has the best performance in extracting complex buildings. Overall, the F-measure is comparatively high for the proposed algorithm in three study areas. Compared with the proposed algorithm, the F-measure of the other three compared algorithms are much lower, which demonstrate the effectiveness of deep features, multiscale CNN strategy, and separate fusion of features strategies adopted in our proposed algorithm. By analyzing the criteria of recall, we can find that the Recall values of C2 and C3 are lower than the Precision values in Image 1, which is mainly because there were more buildings that were omitted, as shown in Figure 4d,e. This can also demonstrate that there are some limitations of compared algorithms in extracting buildings in some degree. Moreover, the Recall of C1 is generally lower in all the three images, mainly due to some of the buildings extracted by C1 not being complete, and with fine structures. Compared with them, the proposed algorithm has a good performance in both Recall and Precision, which means that it can better extract buildings in VHR images.

Discussions
This paper proposed a novel algorithm to extract buildings in VHR images by fusion of multiscale deep features at decision level. This algorithm can extract buildings with different spatial scales and with fine structures and few noises by using three strategies, including a multiscale CNN structure to extract multiscale features, an SVM-based fusion strategy, and superpixels, respectively. The contributions of these three strategies are discussed as follows.

The Effectiveness of Deep Learning Strategy
To distinguish buildings from the background, a popular deep learning algorithm CNN was applied to explore the features. Besides, considering that buildings in VHR images are large-variety scales, a multiscale strategy was further utilized to improve the traditional CNN architecture. Specifically, we used image patches at three different sizes to feed into three corresponding CNN models with three different kernel sizes. In this way, we can get more complete features for extracted buildings. In order to verify the superiority of multiscale deep features to original spectral bands, we set C1 as the comparison algorithm, and the results are shown and compared in Figures 4, 5 and  6c. According to the comparison analysis, we can easily find that buildings extracted by MCNN are more complete, while the roofs extracted by SVM often contain some holes. This is mainly because that CNN algorithm can automatically extract high-level, abstract, as well as spatial-related features from the original data directly. However, the classification of SVM is based on spectral characteristics, which means it cannot detect some inhomogeneous pixels on buildings roofs. Accordingly, it is effective to use deep learning algorithms to extract buildings especially under complex urban environments.

The Effectiveness of Seperately Using Deep Features at Each Scale
In most existing studies, features from multisource are always stacked together and then fed into one classifier to produce the final classification maps. However, in this way, features at each source are treated equally, and it is difficult for a single classifier to match different features together, which lead to the poor performance in recognizing objects. Therefore, in this paper, we designed a parallel SVM-based fusion strategy to separately use deep features at different scales. In order to verify the effectiveness of this strategy, we set the C3 algorithms of each images. As we can see from

The Effectiveness of Superpixels
Due to the repeated pooling operations in the CNN algorithm, the deep features extracted from the CNN are always with blurred boundary. Therefore, we finally improved the classification results by using the fine boundary information. Specifically, we used a simple region-based max voting for classification based on superpixels instead of individual pixels. In order to verify the effectiveness of combining superpixels, the classification results and the refinement building maps are both illustrated in Figures 4-6. As illustrated in Figures 4-6, the buildings refined by superpixels contain better structures and fewer speckles and noises. Therefore, it is effective to improve the blurred boundary of the CNN results by the use of superpixels.

A Word on Data Quality
In our experiments, we repeatedly noticed the inaccurate ground truth maps in the Potsdam dataset, as shown in Figure 7. As shown in the blue rectangles in Figure 7, the ground truth image obviously missed a small building, while in Figure 7c, the proposed algorithm can extract this small building accurately. On the other hand, we also found that there were some building boundaries also missed in the provided ground truth data, which was also an indicator to make the quantitative evaluations of the proposed algorithm lower. In future work, we will use more accurate datasets for effective assessment.

Conclusions
Buildings extraction from VHR images has been a popular topic in the last two decades. However, the large varieties in scales and appearances of buildings make the task very challenging, especially from VHR images. In this paper, we proposed a novel SVM-based fusion algorithm based on multiscale deep features to extract buildings in VHR images. The experimental results have validated the effectiveness of the proposed algorithm. Thanks to the multiscale deep features, SVMbased fusion strategy and the superpixels refinement, the proposed approach has achieved: (1) Accurately buildings extraction with different scales, and (2) the completeness and well-structured extraction of buildings with fewer speckles and noise. Specifically, the deep features extracted by multiscale CNN instead of traditional single-scale CNN contributed to the satisfied performance in recognizing different spatial scales of buildings. Besides, instead of stacking features into one classifier, the proposed parallel SVM-based fusion strategy takes deep features at each scale together. Meanwhile, the superpixels also helped to improve the MCNN results, where region-based max voting had refined the boundary and reduced the noise. However, extracting buildings covered by other objects such as umbrellas and trees is still challenging. This is mainly due to the fact that the spectral characteristics of covered buildings and uncovered buildings are totally different, and spectral bands of optical VHR images cannot penetrate these coverings. To meet this challenge, in the future, we will consider fusing different datasets such as SAR or design more effective classifiers by incorporating context and shape features of buildings.

Conclusions
Buildings extraction from VHR images has been a popular topic in the last two decades. However, the large varieties in scales and appearances of buildings make the task very challenging, especially from VHR images. In this paper, we proposed a novel SVM-based fusion algorithm based on multiscale deep features to extract buildings in VHR images. The experimental results have validated the effectiveness of the proposed algorithm. Thanks to the multiscale deep features, SVM-based fusion strategy and the superpixels refinement, the proposed approach has achieved: (1) Accurately buildings extraction with different scales, and (2) the completeness and well-structured extraction of buildings with fewer speckles and noise. Specifically, the deep features extracted by multiscale CNN instead of traditional single-scale CNN contributed to the satisfied performance in recognizing different spatial scales of buildings. Besides, instead of stacking features into one classifier, the proposed parallel SVM-based fusion strategy takes deep features at each scale together. Meanwhile, the superpixels also helped to improve the MCNN results, where region-based max voting had refined the boundary and reduced the noise. However, extracting buildings covered by other objects such as umbrellas and trees is still challenging. This is mainly due to the fact that the spectral characteristics of covered buildings and uncovered buildings are totally different, and spectral bands of optical VHR images cannot penetrate these coverings. To meet this challenge, in the future, we will consider fusing different datasets such as SAR or design more effective classifiers by incorporating context and shape features of buildings.