A Multi-Scale Filtering Building Index for Building Extraction in Very High-Resolution Satellite Imagery

Building extraction plays a significant role in many high-resolution remote sensing image applications. Many current building extraction methods need training samples while it is common knowledge that different samples often lead to different generalization ability. Morphological building index (MBI), representing morphological features of building regions in an index form, can effectively extract building regions especially in Chinese urban regions without any training samples and has drawn much attention. However, some problems like the heavy computation cost of multi-scale and multi-direction morphological operations still exist. In this paper, a multi-scale filtering building index (MFBI) is proposed in the hope of overcoming these drawbacks and dealing with the increasing noise in very high-resolution remote sensing image. The profile of multi-scale average filtering is averaged and normalized to generate this index. Moreover, to fully utilize the relatively little spectral information in very high-resolution remote sensing image, two scenarios to generate the multi-channel multi-scale filtering index (MMFBI) are proposed. While no high-resolution remote sensing image building extraction dataset is open to the public now and the current very high-resolution remote sensing image building extraction datasets usually contain samples from the Northern American or European regions, we offer a very high-resolution remote sensing image building extraction datasets in which the samples contain multiple building styles from multiple Chinese regions. The proposed MFBI and MMFBI outperform MBI and the currently used object based segmentation method on the dataset, with a high recall and F-score. Meanwhile, the computation time of MFBI and MBI is compared on three large-scale very high-resolution satellite image and the sensitivity analysis demonstrates the robustness of the proposed method.


Introduction
Building extraction plays a significant role in a series of high-resolution remote sensing applications (e.g., urban extension monitoring, urban mapping and planning, spatial analysis) [1][2][3][4].Especially in China, for the last thirty to forty years, rapid urbanization has been witnessed, resulting in the eager need of these remote sensing applications [5][6][7][8].
In the 20th century, with only middle and coarse spatial resolution remotely sensed imagery, built-up area extraction is usually a secondary product of land use and land cover classification [9][10][11].Subject to the spatial resolution at that time, some applications mentioned above like illegal building detection and geo-database updating were unable to implement.It has been reported that only with a spatial resolution of less than five meters, can the single building be clearly represented in imagery [12,13].
For the last twenty to thirty years, the spatial resolution has been improved significantly.For example, the well-known Quick Bird image has a spatial resolution of 0.6 m, and the newly launched WorldView-3 image has a spatial resolution of 0.3 m.The largely increased spatial resolution makes some applications like object detection, geo-database updating and illegal building detection become possible [14][15][16][17][18][19][20][21].Usually, remotely sensed imagery with a spatial resolution of about 4 to 1 m is called high-resolution remotely sensed imagery (HRRSI) and remotely sensed imagery with a spatial resolution of less than 1 meter is called very high-resolution remotely sensed imagery (VHRRSI) [22][23][24][25].
At the beginning of the 21st century, many studies focused on building extraction with HRRSI.Mathematical morphology, a theory that has been widely used in remote sensing image processing, provides a theoretical fundament for many methods like morphological profiles (MP) [26], differential morphological profiles (DMP) [27], extended morphological profiles (EMP) [28] and attribute profiles (AP) [29].The basic idea of these methods is to extract features via morphological operations and then feed these features into a classifier like SVM to extract building regions or realize land cover classification.Object-based methods tend to segment an image using spectral, texture and contextual information, and then use rule sets to extract building regions directly or implement supervised classification with the features from segmented objects [30,31].A series of built-up area indexes such as texture-derived built-up presence index (PanTex) [32], multi-scale urban complexity index (MUCI) [33] and morphological building index (MBI) [34], have been reported to achieve good performance on building extraction tasks in HRRSI.The basic idea of these methods is to present features which can discriminate building from other objects in an index form and then extract building regions with a threshold segmentation rather than supervised classification.Meanwhile, some building extraction methods based on active contour [35,36] and graph cut [37,38] have also been reported.In short, many of these aforementioned building extraction methods belong to supervised learning thus need training samples.The number of training samples, the time cost and the generalization performance are all critical factors for the performance of these methods when being put into applications.Since 2012, deep learning has outperformed almost all traditional methods in many visual tasks.In the field of building extraction, deep learning based methods usually need a dataset to train the model first and then predict labels on each image pixel.But the question mainly lies in the lack of training samples for HRRSI or VHRRSI and the generalization ability for a given method [39].As it will be discussed in Section 2, currently available building extraction datasets for VHRRSI and HRRSI have some limitations and might be inappropriate for the building extraction tasks in Chinese regions due to the different building styles between China and Western countries.
It should be noted that in spite of the fact that some other sensor data like LiDAR can also perform well on building extraction [40,41], their applications are still subject to the access to these data.Since this paper pays more attention to optical imagery and the convenience of a method for building extraction, methods for building extraction with these sensors are beyond the scope of this paper.
When the spatial resolution reaches less than 1 m, several challenges make the built-up area extraction task more difficult (demonstrated in Figure 1).The first is that in VHRRSI, building roofs can be represented in detail while variant spectral values from the same roof make it difficult to label all these building pixels as building area.The second is that road areas, once difficult to be distinguished from building areas in HRRSI, are much wider in VHRRSI and are unable to be regarded as line structures anymore, leading to the increasing difficulty to distinguish them from building areas.The third is that the influence of noise from the sensors in VHRRSI is more apparent than that in HRRSI.Hence, nowadays, many building extraction tasks are implemented via object-oriented methods and deep learning based methods [42][43][44].However, as is mentioned above, deep learning based methods need a large number of training samples and the generalization ability of the trained model might be poor in many cases.Meanwhile, the performance of object-oriented methods depends largely on the results of segmentation, which also challenge its reliability and generalization ability.Meanwhile, automatic building extraction methods have also got much attention for its avoidance of supervised learning and the possibility to bypass such problems mentioned above.For example, graph theories [45,46], fully connected conditional random field [47] and multi-scale texture features [48] have been used for building extraction in VHRRSI.
building extraction methods such as DMP [26], EMP [28] and PanTex [32], is a novel building extraction method in HRRSI (with a spatial resolution of 2 to 4 m in reported experiments).The basic idea of MBI is to first extract the spectral information of building areas with each pixel's maximal gray value among all spectral bands and then extract the spatial information via the differential profiles of multi-scale and multi-direction linear morphological operations.The success of MBI lies in the automatic building extraction without supervised learning and the avoidance of high dimension features.However, some drawbacks still remain.Firstly, multi-scale and multi-direction morphological operations cause heavy computation cost especially in VHRRSI and might perform worse due to the three challenges mentioned above.Secondly, the strategy of selecting a maximal gray value from every spectral band ignores some spectral information that also contributes to building regions, while it is widely acknowledged that the performance of HRRSI interpretation tasks depends largely on the joint use of spatial and spectral information [51].Inspired by MBI and the fact that basic filters can suppress noise, in Reference [Error!Reference source not found.52],we find that multi-scale filters can extract building features in VHRRSI.In this Morphological building index (MBI) [33,34,49,50], outperforming some of state-of-the-art building extraction methods such as DMP [26], EMP [28] and PanTex [32], is a novel building extraction method in HRRSI (with a spatial resolution of 2 to 4 m in reported experiments).The basic idea of MBI is to first extract the spectral information of building areas with each pixel's maximal gray value among all spectral bands and then extract the spatial information via the differential profiles of multi-scale and multi-direction linear morphological operations.The success of MBI lies in the automatic building extraction without supervised learning and the avoidance of high dimension features.However, some drawbacks still remain.Firstly, multi-scale and multi-direction morphological operations cause heavy computation cost especially in VHRRSI and might perform worse due to the three challenges mentioned above.Secondly, the strategy of selecting a maximal gray value from every spectral band ignores some spectral information that also contributes to building regions, while it is widely acknowledged that the performance of HRRSI interpretation tasks depends largely on the joint use of spatial and spectral information [51].
Inspired by MBI and the fact that basic filters can suppress noise, in Reference [52], we find that multi-scale filters can extract building features in VHRRSI.In this paper, a novel multi-scale building index (MFBI) is studied to automatically extract the feature map of building areas in VHRRSI.To fully utilize the spectral information, two scenarios to extend MFBI to multiple channels (multi-channel MFBI, MMFBI) are proposed.Exhaustive experiments on our newly published satellite image dataset for building extraction have demonstrated the effectiveness of MFBI and MMFBI.
The main contribution of this paper is summarized as follows The remainder of this paper is organized as follows.In Section 2, we briefly summarize some related work such as MP, MBI and the current building extraction datasets.In Section 3, we introduce the proposed MFBI and Multi-channel MFBI (MMFBI).In Section 4, we first introduce our dataset for building extraction in VHRRSI and present detail information of our experiments.Finally, in Section 5, a brief conclusion is drawn.

Morphological Profile
Morphological profile (MP) was first transferred into high spectral and high-resolution remotely sensed imagery in References [26][27][28][29].The basic idea of the morphological profile is to extract a series of feature images by using a certain shape morphological operator (e.g., rectangular, circular) of different structural element sizes.The series of difference images from any two images next to each other in MP is called differential morphological profile (DMP).It has been acknowledged that DMP can utilize more spatial information in HRRSI than MP.Later, by utilizing features from the difference of one image to all other images in the profile, generalized differential morphological profile (GDMP) [53] is studied and a better classification performance has been reported on several standard datasets compared with DMP.The relationship of MP, DMP and GDMP is demonstrated in Figure 2. Some other improvement on MP and DMP such as AP, MPs are also reported.These morphological features are usually fed into a classifier like SVM to implement land use classification or extract built-up areas.However, the generalization ability and the possible applications of all these methods are still limited by the chosen training samples.
building extraction in VHRRSI and present detail information of our experiments.Finally, in Section 5, a brief conclusion is drawn.Morphological profile (MP) was first transferred into high spectral and high-resolution remotely sensed imagery in References [26][27][28][29].The basic idea of the morphological profile is to extract a series of feature images by using a certain shape morphological operator (e.g., rectangular, circular) of different structural element sizes.The series of difference images from any two images next to each other in MP is called differential morphological profile (DMP).It has been acknowledged that DMP can utilize more spatial information in HRRSI than MP.Later, by utilizing features from the

Morphological Building Index
Morphological building index is calculated in the following steps [34].
Step 1. Brightness image is generated from each pixel's maximal gray value among all spectral bands.This is because that the maximums of multispectral bands correspond to high reflectance, while in an aerial image such reflectance usually indicates candidate buildings [34,50].
Step 2. Opening by reconstruction operation is implemented on the brightness image to further enhance the signal of building areas.Note that here the assumption is that the built-up area tends to be brighter in imagery.
Step 3. A linear (line-shaped) morphological operator with a certain size is served as a structural element and is operated on the aforementioned reconstructed imagery in multiple directions to generate the feature image.It has been reported that four directions (i.e., 0 • , 45 • , 90 • and 135 • ) are enough to extract building features [34,50].
Step 4. Given different structural element sizes (with the parameter setting of step size, minimal window size and maximum window size) for the operator in step 3, a series of feature images can be generated.Then, the differential profile of these feature images is obtained.
Step 5.All difference images in the differential profile are averaged and normalized to (0, 1) to generate the morphological building index.
Step 6. Post-processing framework.After these five steps, a threshold is set to segment building areas and a series of post-processing operations such as the removal of elongated areas and the removal of false alarms caused by vegetation and water.
However, drawbacks like heavy computation cost caused by a series of morphological operations still remain.

Building Extraction Datasets
For the last ten years, some datasets have been served as benchmark for the task of built-up area extraction in HRRSI.Several typical datasets among them are summarized in Table 1.These datasets are designed for the experiment of model-driven methods.Each of these datasets is usually an image of small size and is not available to the public.Until now, two well-known datasets have been published for the task of building extraction in VHRRSI, that is, the Massachusetts dataset and the Inria dataset [56,57].These two datasets were firstly for the validation and comparison of data-driven methods.The first and the second row in Table 2 summarizes the basic information of these two datasets.However, the current datasets in HRRSI and VHRRSI still have some gap to satisfy building extraction tasks, mainly because of the following reasons: No dataset for HRRSI is open to the public till now.This situation makes it hard to validate and compare traditional model-driven methods.

1.
Every dataset for HRRSI usually consists of a few small-size pieces of images and is often incapable to represent the performance of a proposed method in different situations.

2.
Till now almost all datasets for VHRRSI are from aerial imagery which covers some regions in the US or Europe, with a good imaging condition.No VHRRSI dataset designed for Chinese region is available now, while urban and suburban landscapes in China and Western countries are quite different (examples are demonstrated in Figure 3).Note that it is acknowledged that different training samples usually lead to quite different performance for data-driven methods and that in many cases the imaging condition of satellite image is different from aerial images.

3.
No open VHRRSI building extraction dataset from satellite imagery is available now, let alone the requirement to fit into both model-driven and data-driven methods since for model-driven methods, the near infrared channel is quite important for their implementation and performance.
Until now, two well-known datasets have been published for the task of building extraction in VHRRSI, that is, the Massachusetts dataset and the Inria dataset [56,57].These two datasets were firstly for the validation and comparison of data-driven methods.The first and the second row in Table 2 summarizes the basic information of these two datasets.
However, the current datasets in HRRSI and VHRRSI still have some gap to satisfy building extraction tasks, mainly because of the following reasons:

Samples in China
Samples in Europe [57] Samples in the U.S. [56] Figure 3.Some examples of Chinese and Western Countries' landscape.
No dataset for HRRSI is open to the public till now.This situation makes it hard to validate and compare traditional model-driven methods.1.Every dataset for HRRSI usually consists of a few small-size pieces of images and is often incapable to represent the performance of a proposed method in different situations.

Multi-Scale Building Index
Image filtering was first studied to remove noise and was then widely used to extract features in a series of visual tasks.Filtering can be divided into two categories, that is, linear filtering and nonlinear filtering.Average filtering is typical linear filtering while morphological operations belong to nonlinear filtering.
Inspired by the fact that average filtering is effective to remove noise and is of less computation cost, in this work, average filters are tested to generate Multi-scale building index (MFBI) for the extraction of building areas.Compared with MBI, multi-scale and multi-direction linear morphological filtering is replaced by multi-scale filtering, and the top-hat transformation in MBI is abandoned.In other words, with similar parameter settings of window size, all morphological operations are abandoned to alleviate computation cost.Instead, the average filter is implemented and it can overcome noise in VHRRSI.
As Figure 4 has demonstrated, the proposed MFBI has the following steps.
Image filtering was first studied to remove noise and was then widely used to extract features in a series of visual tasks.Filtering can be divided into two categories, that is, linear filtering and nonlinear filtering.Average filtering is typical linear filtering while morphological operations belong to nonlinear filtering.
Inspired by the fact that average filtering is effective to remove noise and is of less computation cost, in this work, average filters are tested to generate Multi-scale building index (MFBI) for the extraction of building areas.Compared with MBI, multi-scale and multi-direction linear morphological filtering is replaced by multi-scale filtering, and the top-hat transformation in MBI is abandoned.In other words, with similar parameter settings of window size, all morphological operations are abandoned to alleviate computation cost.Instead, the average filter is implemented and it can overcome noise in VHRRSI.
As Figure 4 has demonstrated, the proposed MFBI has the following steps.Step 1.The generation of brightness image I(x).In MFBI, the brightness image is generated from each pixel's maximal spectral value among three optical bands, as (1) expresses.Here, red, green and blue denote the red, green and blue band of an image respectively.The reason why we choose only optical bands is that recently it is reported that visual bands contribute significantly to the spectral property of building areas [55].
Step 2. The generation of filtering profiles.A series of filters with window sizes of an equal difference (parameters include initial window size S min , final window size S max and step size ∆s) on brightness image is applied.It should be noted that these parameter settings are similar to MBI in VHRRSI [50].Here, FPavr and s denote filtering profiles of average filters and window size respectively.Let (x, y) be a pixel of brightness image I, and i, j belong to an integer, we have: Step 3. The generation of differential filtering profiles.After step 2, we can get k − 1 corresponding differential images.Here, k is calculated via k = (S max − S min )/∆s + 1.Let DFPavr denote the differential filtering profile of average filters, and it can be expressed in Formula (4).
Step 4. The generation of MFBI.k − 1 corresponding differential images in step 3 are averaged and normalized into [0, 1] to generate MFBI.
Step 5. Extraction of building areas.Similar to the extraction framework of MBI, after the generation of MFBI, the extraction of building areas are implemented according to a series of rule sets.Since the size of the original image and MFBI feature image is the same, let (x, y) denote a pixel of the MFBI feature image.Then, it is segmented by the rule set defined in (6).Here, T denotes threshold value for MFBI.MFBI(x, y) > T Step 6. Post processing framework.The image is composed of a series of segmented regions that could belong to building regions.Let NDVI, and T NDVI denotes threshold value for NDVI of an image, and the NDVI segmentation value respectively.Meanwhile, Let l, Ratio, R 1 , Area and A 1 denote such a region, the length-width ratio of such a region, the corresponding threshold of length-width ratio, the area and the corresponding threshold value of the area, the post-processing framework is composed of a series of operations denoted in the rule set (7).Note that the length-width ratio of each object is calculated via oriented bounding boxes so that objects at different orientations can be described more accurately.
A threshold value T is set to segment pixels that possibly belong to building areas.Due to the fact that the framework and operations of MFBI are similar to that of MBI, the threshold value of MFBI is similar to the threshold value of MBI, which has been carefully studied in References [34,51].Pixels belonging to building areas usually have an MFBI between 0.4 and 0.6.The three operations in rule set (7) are strategies to remove the false alarms (e.g., removal of vegetation, elimination of elongated roads), similar to the implementation in References [34,51,58].It should be noted that after the NDVI threshold, we first fill holes in the binary image and then we implement the second and the third operation in (7).Compared with the former works [34,50], filling holes before region selection can alleviate the problem that some parts of a building roof are excluded by calculating NDVI when these parts of a roof are covered by vegetation.

Joint Use of MFBI and Spectral Information
To fully and jointly utilize spectral and spatial information, two scenarios to extend MFBI to multiple channels (Multi-channel MFBI, MMFBI) are proposed in this paper, as is demonstrated in Figure 5.As is pointed in Section 3.1, related work has pointed out that visual bands contribute significantly to the spectral property of building areas [55].

Joint use of MFBI and spectral information
To fully and jointly utilize spectral and spatial information, two scenarios to extend MFBI to multiple channels (Multi-channel MFBI, MMFBI) are proposed in this paper, as is demonstrated in Figure 5.As is pointed in Section 3.1, related work has pointed out that visual bands contribute significantly to the spectral property of building areas [55].To further discriminate the spectral information of building areas from others, principal component analysis (PCA) [59,60], one of the most commonly used methods to improve the feature separability, is implemented in these two scenarios.
Let z and x denote a lower dimension and higher dimension matrix respectively, PCA tends to find a mapping w which can present the relation between z and x.
Figure 5. Two proposed scenarios to fully utilize spectral information and MFBI.
To further discriminate the spectral information of building areas from others, principal component analysis (PCA) [59,60], one of the most commonly used methods to improve the feature separability, is implemented in these two scenarios.
Let z and x denote a lower dimension and higher dimension matrix respectively, PCA tends to find a mapping w which can present the relation between z and x.
The most paramount component w 1 satisfies the condition that after projected to w 1 , samples become the most distinctive.Hence, we have: The objective is to find and maximize w 1 .It can be regarded as a Langulan problem, in the below form as the formula ( 9) is expressed: With the utilization of PCA, two scenarios are described in detail as follows.Scenario 1. Principal component analysis (PCA) is implemented on three visual bands (Red, Green and Blue) of VHRRSI.Then, information of the first component PC1(x) is regarded as the brightness image to generate MFBI feature image, since much information of the building areas has been transformed into PC1(x) after PCA.
Scenario 2. For each channel in the three channel RGB image, MFBI feature image is generated and a three channel MFBI image is obtained.Then, a principal component analysis is implemented on this three-channel MFBI feature image.We continue to step 3 and 4 on the first component PC1(x) of this feature image.Similarly, it is reported in Reference [48] that after texture-derived feature extraction, the first component is selected since it contains much signal of building regions.In our experiments, the first component also contains much signal from building areas.

Dataset
MBI was proposed to extract single building in imagery in HRRSI, especially effective in Chinese urban regions.As an improvement of MBI, MFBI is also capable of building extraction.
However, as is mentioned in Section 2.3, no open dataset for building extraction task in HRRSI is available now to compare these model-driven methods, while the aforementioned VHRRSI datasets sampled from Western countries are more appropriate for data-driven methods since they do not contain the near infrared channel which is of importance for many model-driven methods.
Hence, to fairly compare these model-driven methods like MBI and MFBI on VHRRSI, and to offer a benchmark for these algorithms' performance on Chinese regions, an open dataset named Wuhan University Building Extraction Dataset (WHUDBE) is introduced in this paper (download link: https://drive.google.com/open?id=1TfyNPSRSs8jMtbeSiP90SbGLW7fhjj6z).
The consideration of selecting samples for the dataset mainly include the following aspects: 1.
The inter-class similarity and intra-class dissimilarity in VHRRSI.It is widely acknowledged that with the increase of spatial resolution, both the similarity between different types of land cover and the dissimilarity of the same type of land cover have largely increased, resulting in a series of problems for the automatic interpretation of VHRRSI.Hence, to test the performance of a specific algorithm, the variety of building shapes, building sizes and building roofs must be considered when selecting samples for our datasets.

2.
Land covers hard to be distinguished in the building extraction task.In Reference [61], Mohsen concludes that one of the major challenges in building extraction tasks in VHRRSI is the existence of shadow, vegetation, water regions, and man-made non-building features.These types of land cover should also appear in the samples of our datasets to test the performance of an algorithm.

3.
The covering of typical Chinese landscape in different regions as many as possible.It is known that different regions in China usually have different building structures due to a series of factors such as the influence of economy, climate, population and so on.Meanwhile, the urban, suburban and rural areas should also be covered.
31 pan-sharpened VHRRSI from 7 provinces in China are the data source of our dataset.These seven provinces come from the Eastern, North-western, Southern, and the middle Chinese regions respectively.Sensors include Quick Bird, Gaofen-2, WorldView2, with a spatial resolution of 0.6, 0.8 and 0.5 m respectively.Based on the principles mentioned above, we carefully choose 57 pieces of image patches with a row and column of 512 pixels and 512 pixels respectively to validate the performance of the proposed method.Figure 6 illustrates all the samples from our dataset.When choosing samples, effort has been made to present the complexity of building area landscape and to include those challenge elements mentioned in Reference [61] as much as possible.After that, ground truth is labeled by two experts who are not involved in our study.When compared with the other two aforementioned VHRRSI building extraction datasets with our newly published dataset, several aspects are listed as follows.In terms of the data source, our dataset, all from VHRRS satellite imagery, serves as a good complementary of the other two aerial image datasets.In terms of the study region, our dataset can well represent the reality and complexity of the building areas among China, and can also be regarded as a good complementary for these two datasets covering America and Europe.More importantly, our dataset offers the near infrared band from the satellite sensors and thus can be utilized to validate both the model-driven and data-driven building extraction algorithms.Note that in Section 2.3, we have mentioned that the near infrared information is important for many model-driven building extraction methods.

Parameter Settings
In all experiments mentioned below, parameter settings for MBI and MFBI are listed in Table 3.The most significant parameter for both MBI and MFBI is the segmentation threshold, which will be discussed later.The window size of profiles also has a strong influence on the effectiveness of extracted feature maps.For MBI, these window sizes are all set the same as Reference [50], while for MFBI a smaller maximum window size and a larger threshold value is needed.For the object-oriented-based method, we use eCognition to extract building regions.Allowing for the multiple scales of building areas, a 2-scale segmentation stratagem, with the scale parameters of 120 and 60 respectively, is implemented to segment image and the rule sets are the same as Huang did in Reference [34].For the framework of post-processing, since this paper mainly pays attention to the development of MFBI and MMFBI, we do not fine-tune those parameters on our datasets.Instead, in all test images, NDVI to remove false alarms caused by vegetation, area threshold to remove small objects, and length-width ratio threshold to remove elongated roads, are set to be 0.1, 30, and 5.6, respectively, the same as Huang did in Reference [50].It should be noted that since we do not fine-tune these parameters on our own dataset, there is much possibility that after the fine-tuning of parameters in the post-processing framework, MFBI could achieve a better accuracy on our dataset.
All of our experiments are implemented on a personal computer with CPU i5-7500 and RAM 8GB.All codes are programmed in Visual Studio 2015, with API from OpenCV3.0.

Experiment on Computation Time
Three large-scale VHRRSIs are chosen to test the computation time of MBI and MFBI.The basic information of these images (i.e., image size, sensor type, spatial resolution) and computation time are listed in Table 4.In Figure 7, these three images, the corresponding MBI and MFBI feature map are demonstrated.Note that some false alarms caused by vegetation have been removed via calculating NDVI in these feature maps.are demonstrated.Note that some false alarms caused by vegetation have been removed via calculating NDVI in these feature maps.From Table 4, we can observe that the proposed MFBI outperforms MBI with much less computation time on large scale satellite images.From Figure 7, in terms of visual effect, in many regions of these feature maps, MFBI are more capable of preserving features of building areas than MBI with less noise, while MBI could cause cracks on building roofs.It can be explained by the fact that average filtering is more capable of generating homogeneous regions with similar gray values while multi-scale and multi-direction morphological operations could lead to the exclusion of a few pixels in the building roof due to their different gray value.

Experiments on WHUBED
The developed MFBI are compared with MBI and the widely used object-oriented segmentation method on our newly published WHUBED.In this section, we will compare them from both the visual effect and quantitative analysis.
In Figure 8, the extraction results and the corresponding ground truth maps of four samples from different landscapes are demonstrated for the comparison of visual effect.The first row is the original images, the second row is the corresponding ground truth maps, and from the third to the seventh row, the extraction results of the object based segmentation method, MBI, MFBI, the first scenario of MMFBI and the second scenario of MMFBI are listed respectively.Although these four samples look simple at a first glance, they are challenging if a relatively high accuracy can be obtained mainly because of the following characteristics.
Sample 1: On the upper area, several informal settlements are located and on the right of the lower area, several buildings with a dark roof are located.Note that for MBI, two weaknesses lie in the incapable of extracting informal settlements and the dark built-up regions [34].Meanwhile, on the left of the lower region, the wide road and the boat on the river is also easy to introduce false alarms.
Sample 2: The difficulty lies in the imaging condition and the land covers on the river bank.Under this unsatisfactory imaging condition, ground objects on the image are a little bit obscure.The bright and wide roads and other man-made objects are easy to cause false alarms.
Sample 3: The difficulty lies in the low intensity of the image and the dark building roofs of the informal settlements.As is mentioned in Sample 1, these two problems are challenging for MBI.Since the elongated road can be easily eliminated by the morphological operations, it should not be considered as difficult as some former researchers do.
Sample 4: Building areas are relatively small in size and are irregular in geometry, tending to cause omission errors.Meanwhile, the texture from the farmland and the road makes it easier to introduce false alarms.
In terms of the visual effect, those informal settlements and building in small sizes are more effectively extracted in these samples with our proposed MFBI and the false alarms introduced by road or other small man-made objects are relatively less, when compared to other methods.These better-performed regions are marked by red bounding boxes in Figure 8.
For accuracy assessment, we choose the commonly used recall, precision and F1-score for the assessment of building extraction tasks as our measurement to evaluate the performance of building extraction results [45,62].Usually, recall reflects an algorithm's ability to find true positives, while precision reflects an algorithm's cost to find true positives.In addition, F1-score measures the ability of both precision and recall.
where tp, fp, tn, and fn denote true positive, false positive, true negative and false negative, respectively.
where tp, fp, tn, and fn denote true positive, false positive, true negative and false negative, respectively.The accuracy of three building extraction methods on every sample in our dataset is listed in Table 5.In addition, the mean and standard deviation of these three indices of all 57 samples in WHUBED is listed in Table 6.From these results, some important observation can be found: 1.
For most of the samples in our dataset, the proposed MFBI achieves the highest F1-score and recall, while MBI tends to have higher precision (see Table 5).From Table 6, both MBI and the proposed MFBI can outperform the widely used object-oriented method.

2.
From the perspective of performance on different types of samples, all of the three methods, namely, objected based segmentation, MBI and MFBI, tend to perform better on urban or suburban areas.It can be explained that in these regions, many buildings are in regular shape while roads, one of the main false alarms, can be easily removed by post-processing when calculating shape index and length-width ratio.However, although MFBI performs better than the other two methods on rural areas, neither of these three methods are robust enough on rural areas, where the farmland with a regular geometric shape and too much bright bare soil can cause severe false alarms.

3.
The proposed MFBI tends to have relatively high recall and the precision value is lower than the recall value.The high recall and relatively high false alarm of MFBI might be explained by the utilization of rectangular filtering windows.Rectangular filters tend to stress the influence of bright pixels belonging to building areas but pixels around the building areas could also have a high gray value after calculating MFBI.In other words, the rectangular filtering window might not fully utilize the abundant spatial information of building areas in VHRRSI.

Sensitivity Analysis
As is mentioned in Section 3.1, sharing a similar framework and operation, the threshold value of MFBI and MBI tends to be similar.While the influence of MBI's segmentation threshold value T has been carefully studied in References [34,50], in this section, we illustrate the influence of MFBI's segmentation threshold value T on the results of building extraction.
Different light conditions and landscapes are taken into account for the illustration.Four samples, from different lighting conditions (i.e., bright, moderate and dark) while including different landscapes (i.e., urban regions, suburban regions and rural regions) are demonstrated in Figure 9.In the first and the second column of Figure 9, we demonstrate the samples and the corresponding MFBI feature map.In the third column of Figure 9, we offer the relation between different MFBI threshold and the corresponding recall, precision and F-score.From these figures, the observation is in accordance with the conclusion in References [34,50].

1.
With the increase of MFBI, the F-score tends to increase first and then decrease.While the recall tends to decrease, the precision tends to increase.This trend fits the general regulation of the recall and precision curves offered in References [34,50].A small threshold usually leads to the selection of a relatively large amount of samples.Although we will get a high recall from these samples, a large number of false positives are among these samples, leading to the relatively low precision.On the contrary, when the threshold is set high, the algorithm will select a relatively small number of samples with relatively high precision, while some true positives are missed.

2.
When the threshold is set from 0.4 to 0.6, the proposed method can usually achieve the best performance with a high recall value and a modest precision value, no matter in urban, suburban or rural regions.It can be explained that after the feature extraction of MFBI, pixels belonging to building areas often have an MFBI value at about 0.4 to 0.6.Such regulation has also been reported in Reference [34].Different light conditions and landscapes are taken into account for the illustration.Four samples, from different lighting conditions (i.e., bright, moderate and dark) while including different landscapes (i.e., urban regions, suburban regions and rural regions) are demonstrated in Figure 9.In the first and the second column of Figure 9, we demonstrate the samples and the corresponding MFBI feature map.In the third column of Figure 9, we offer the relation between different MFBI threshold and the corresponding recall, precision and F-score.From these figures, the observation is in accordance with the conclusion in References [34,50]. 1.With the increase of MFBI, the F-score tends to increase first and then decrease.While the recall tends to decrease, the precision tends to increase.This trend fits the general regulation of the recall and precision curves offered in References [34,50].A small threshold usually leads to the selection of a relatively large amount of samples.Although we will get a high recall from these samples, a large number of false positives are among these samples, leading to the relatively low

Experiments on Two Proposed Scenarios
Before discussing the extraction results of MMFBI, we first demonstrate and discuss the feature map of these two scenarios to generate MMFBI.In Figure 10, the results after PCA in the first and the second scenario to generate MMFBI are demonstrated.As is mentioned in Section 3.2, we choose the first component of these results (these results are shown in the second and third component of Figure 10) to get the MMFBI feature maps for building extraction.From these results we can observe that: 1.
The implementation of PCA can help extract building features.In the first scenario, the PCA is implemented on the original image from our datasets.Much of the information from building pixels can be enhanced (see from the second column of Figure 10) and these homogenous regions are salient in the first component.

2.
For the second scenario to generate MMFBI, after the calculation of MFBI in each channel and the PCA transformation, the MMFBI feature map is more capable to enhance building areas than the MMFBI feature map in our first scenario, which simply implements PCA on optical images.It can be explained that the calculation of MFBI on each channel selects pixels that could belong to building areas and later PCA refines these selected pixels.For example, some pixels belonging to vegetation are selected in the red channel but are not selected in the green and blue channel, and the PCA implementation can exclude these pixels from the feature map.
The proposed two scenarios in Section 3.2 are tested on our dataset and their performance on each sample is listed in Table 5.And the mean and standard deviation of these three indices of all 57 samples in WHUBED is listed in Table 6.From these results, we can observe that: 1.
The feature extraction ability of the two proposed scenarios is better than the basic MFBI, especially when we take account of the precision and the F-score.This result is reasonable since information that contributes greatly to building areas and other man-made objects in the red, green and blue channel are all taken into account and the signal of some false alarms from one single channel can be suppressed.This phenomenon is clearly demonstrated in Figures 8  and 11.From the sixth and seventh row of Figure 8 and the first column of Figure 11, much noise mainly from wide roads can be observed in the MFBI feature map, while in the second and the third column of Figure 11, noise is much less in the MMFBI feature maps.

2.
The first scenario can make the feature map more compactness since the first component of the three channel optical image contains more information on the building structures while suppresses much information from other types of land cover.This phenomenon is clearly demonstrated in the second row of Figure 11.

3.
The second scenario can improve the accuracy mainly because of the calculation of MFBI on three channels separately and PCA transformation after that.Calculating MFBI on each channel makes full use of information that can present the signal of building areas and the PCA transformation on this image can refine the result by eliminating some pixels belonging to other land ground types such as road or bare soil.The situation that some roads mixed with building areas in the feature map of MFBI and the first scenario of MMFBI can be removed in the feature map of the second scenario MMFBI is obvious in the image of Figure 11c,f,i.Meanwhile, as is mentioned in Section 3.2, after PCA, while the first component contains much signal from building areas, the second component contains much information from other land covers such as roads and bare soil.However, simply using the second component also takes the risk of excluding some building areas whose material is similar to roads.It should be emphasized that in pixel-level building detection, one of the major differences between HRRSI and VHRRSI is that roads are much wider in VHRRSI and are more difficult to be eliminated.detection, one of the major differences between HRRSI and VHRRSI is that roads are much wider in VHRRSI and are more difficult to be eliminated.

Conclusion
In this paper, a multi-scale filtering building index (MFBI) is proposed with the objective to avoid complex morphological operations and use basic average filters instead.After a detailed study of current datasets for building extraction, in the hope of offering a VHRRSI dataset for model-driven based methods, we introduce our newly published dataset WHUBED and use it as a benchmark to compare our proposed method with the widely used object-oriented method and MBI.Experiments demonstrate that the proposed MFBI can generate building feature maps much faster than MBI, and can outperform the other two methods in terms of accuracy.To fully utilize spectral information that

Conclusions
In this paper, a multi-scale filtering building index (MFBI) is proposed with the objective to avoid complex morphological operations and use basic average filters instead.After a detailed study of current datasets for building extraction, in the hope of offering a VHRRSI dataset for model-driven based methods, we introduce our newly published dataset WHUBED and use it as a benchmark to compare our proposed method with the widely used object-oriented method and MBI.Experiments demonstrate that the proposed MFBI can generate building feature maps much faster than MBI, and can outperform the other two methods in terms of accuracy.To fully utilize spectral information that contributes to urban regions in VHRRSI, two scenarios to extend MFBI into multiple channels (MMFBI) are studied.Related experiments demonstrate that these two scenarios can reduce false alarms in MFBI and therefore can achieve higher accuracy.
However, some weaknesses for the proposed MFBI include: (1) It does not fully utilize spatial information especially multi-direction structural information, and can introduce artefacts.(2) The brightness image might not truly present building features when a sensor is too sensitive at many pixels in a particular channel.
Feature work includes the implementation of MFBI's post-processing framework systematically and the utilization of more directional and structural information in MFBI.

Figure 1 .
Figure 1.Examples of more noise and wider road in very high resolution remote sensing image (VHRRSI) than in high resolution remote sensing imagery (HRRSI).(a) and (c): The same region in VHRRSI and HRRSI respectively.Some salient noise is marked by yellow bounding boxes; (b) and (d): The same region in VHRRSI and HRRSI respectively.Some Road is marked by red bounding boxes.

Figure 1 .
Figure 1.Examples of more noise and wider road in very high resolution remote sensing image (VHRRSI) than in high resolution remote sensing imagery (HRRSI).(a,c): The same region in VHRRSI and HRRSI respectively.Some salient noise is marked by yellow bounding boxes; (b,d): The same region in VHRRSI and HRRSI respectively.Some Road is marked by red bounding boxes.

Figure 3 .
Figure 3.Some examples of Chinese and Western Countries' landscape.

Figure 4 .
Figure 4. Flowchart of the proposed multi-scale filtering building index (MFBI).Figure 4. Flowchart of the proposed multi-scale filtering building index (MFBI).

Figure 4 .
Figure 4. Flowchart of the proposed multi-scale filtering building index (MFBI).Figure 4. Flowchart of the proposed multi-scale filtering building index (MFBI).

Figure 5 .
Figure 5. Two proposed scenarios to fully utilize spectral information and MFBI.

Figure 7 .
Figure 7. (a-c), the original image of Image 1, Image 2, and Image 3, respectively; (d-f), the MBI feature map of Image 1, Image 2, and Image 3, respectively; (g-i), the MFBI feature map of Image 1, Image 2, and Image 3, respectively; (j-l), examples of the extracted building feature shape of MBI (left) and MFBI (right) in Image 1, Image 2 and Image 3, respectively.

Figure 7 .
Figure 7. (a-c), the original image of Image 1, Image 2, and Image 3, respectively; (d-f), the MBI feature map of Image 1, Image 2, and Image 3, respectively; (g-i), the MFBI feature map of Image 1, Image 2, and Image 3, respectively; (j-l), examples of the extracted building feature shape of MBI (left) and MFBI (right) in Image 1, Image 2 and Image 3, respectively.

Figure 8 .
Figure 8. Performance on four samples in WHUBED.The first row and the second row are original images and the corresponding ground truth maps.The third, the fourth, the fifth, the sixth and the seventh row are the extraction results of objected based segmentation method, MBI, MFBI, the first scenario of MMFBI and the second scenario of MMFBI.

Figure 8 .
Figure 8. Performance on four samples in WHUBED.The first row and the second row are original images and the corresponding ground truth maps.The third, the the fifth, the sixth and the seventh row are the extraction results of objected based segmentation method, MBI, MFBI, the first scenario of MMFBI and the second scenario of MMFBI.

Figure 9 .
Figure 9. Performance on different samples on WHUBED.The first column: Five samples in WHUBED; the second column: Corresponding MFBI feature maps.Note that some false alarms caused by vegetation have been removed by normalized differential vegetation index (NDVI).The third column: the relationship between the threshold value of MFBI and precision, recall and F1-score.

Figure 9 .
Figure 9. Performance on different samples on WHUBED.The first column: Five samples in WHUBED; the second column: Corresponding MFBI feature maps.Note that some false alarms caused by vegetation have been removed by normalized differential vegetation index (NDVI).The third column: the relationship between the threshold value of MFBI and precision, recall and F1-score.

SamplesFigure 10 .
Figure 10.Samples and the corresponding results after principal component analysis (PCA) in scenario1 and scenario2.The first column: Five samples in the WHUBED; The second and the third column: Corresponding results from the first and the second scenario after the step of PCA when generating MMFBI.

Figure 10 .
Figure 10.Samples and the corresponding results after principal component analysis (PCA) in scenario1 and scenario2.The first column: Five samples in the WHUBED; The second and the third column: Corresponding results from the first and the second scenario after the step of PCA when generating MMFBI.

Figure 11 .
Figure 11.Feature map comparison of MBI, MMBI scenario 1 and scenario 2. (a-c): A feature map example of MBI, MMBI Scenario 1 and Scenario 2 in Image 1. (d-f): A feature map example of MBI, MMBI Scenario 1 and Scenario 2 in Image 2. (g-i): A feature map example of MBI, MMBI Scenario 1 and Scenario 2 in Image 3.
* Three other bands in WorldView2 imagery are abandoned.

Table 2 .
Two available data-driven building extraction datasets for VHRRSI.

Table 3 .
Parameter settings of MBI and MFBI.

Table 4 .
Computation time of MBI and MFBI on three large-scale images.

Table 4 .
Computation time of MBI and MFBI on three large-scale images.

Table 5 .
Accuracy of the proposed MFBI and two scenarios of Multi-channel MFBI (MMFBI) compared with MBI and eCognition on WHUBED (in percentage).Note that: R, P and F denote recall, precision and F1-score respectively.eC, S1 and S2 denote the multi-scale segmentation based method operated on eCognition, Scenario 1 and Scenario 2 of MMFBI respectively.The best performance of the recall, precision and F1-score on each sample is marked bold.

Table 6 .
Mean and standard deviation of several methods compared on WHUBED (in percentage).

Table 6 .
Mean and standard deviation of several methods compared on WHUBED (in percentage).