Buildings are the places where human beings live, work, and recreate [1
]. The distribution of buildings is useful in many applications such as disaster assessment, urban planning, and environmental monitoring [2
], and the precise location of buildings can also help municipalities in their efforts to better assist and protect their citizens [4
]. Therefore, it is very important to accurately detect buildings. With the development of sensor technology, High spatial Resolution/Very High spatial Resolution (VHR) remote sensing images with multispectral channels can be acquired. In the context of this paper, images with a spatial resolution lower than one meter in the panchromatic channel are referred to as VHR imagery, and images with a spatial resolution greater than one meter and lower than ten meters in the panchromatic channel are referred to as High Resolution imagery [5
]. Since these High Resolution/VHR remote sensing images contain a large amount of spectral, structure, and texture information, they provide more potential for accurate building detection. However, manual processing of these images to extract buildings requires continuous hard work and attention from humans, and it is impractical when applied to regional or global scales. Therefore, it is necessary to develop methods that can automatically or semi-automatically detect buildings from High Resolution/VHR remote sensing images. In the past decades, a large number of studies in this area have been conducted. Depending on whether or not the auxiliary information is used, we can divide the methods developed into two categories. The first category uses monocular remote sensing images to detect buildings, and the second category combines remote sensing images with auxiliary data such as height information to detect buildings. Several review articles can be found in [6
]. Among them, Unsalan and Boyer [7
] extended the work in [6
] by comparing and analyzing the performance of different methods proposed until late 2003. Baltsavias [8
] provided a review of different knowledge-based object extraction methods. Haala and Kada [9
] discussed previous works on building reconstruction from a method and data perspective. More recently, Cheng and Han [10
] systematically analyzed the existing methods devoted to object detection from optical remote sensing images. Since this study is dedicated to detecting buildings from a single VHR remote sensing imagery, our discussion of previous studies will focus on this area.
The development of low-orbit earth imaging technology has made available VHR remote sensing images with multispectral bands. In order to make full use of this spectral and spatial information, a large number of studies have used classification methods to detect buildings. For example, Lee et al. [11
] combined supervised classification, iterative self-organizing data analysis technique algorithm (ISODATA) and Hough transformation to automatically detect buildings from IKONOS images. In their study, the classification process was designed to obtain the approximation locations and shapes of candidate building objects, and ISODATA segmentation followed by Hough transformation were performed to accurately extract building boundaries. Later, Inglada [12
] used a large number of geometric features to characterize the man-made objects in high resolution remote sensing images and then combined them with support vector machine classification to extract buildings. In a different study, Senaras et al. [13
] proposed a decision fusion method based on a two-layer hierarchical ensemble learning architecture to detect buildings. This method first extracted fundamental features such as color, texture, and shape features from the input image to train individual base-layer classifiers, and then fused the outputs of multiple base-layer classifiers by a meta-layer classifier to detect buildings. More recently, a new method based on a modified patch-based Convolutional Neural Network (CNN) architecture has been proposed for automatic building extraction [14
]. This method did not require any pre-processing operations and it replaced the fully connected layers of the CNN model with the global average pooling. In summary, although these classification methods are effective for building extraction, it should be noted that these methods require a large volume of training samples, which is quite laborious and time-consuming.
Graph theory, as an important branch of mathematics, has also been used for building detection. For example, Unsalan and Boyer [7
] developed a system to extract buildings and streets from satellite images using graph theory. In their work, four linear structuring elements were used to construct the binary balloons and then these balloons were represented in a graph framework to detect buildings and streets. However, due to the assumptions involved in the detection process, this method is only applicable to the type of buildings in North America. Later, Sirmacek and Unsalan [15
] combined scale invariant feature transform (SIFT) with graph theory to extract buildings, where the vertices of the graph were represented by the SIFT key points. They validated this method on 28 IKONOS images and obtained promising results with a building detection accuracy of 88.4%. However, it should be noted that this method can only detect buildings that correspond to preset templates and are spatially isolated. In a different work, Ok et al. [16
] developed a novel approach for automatic building detection based on fuzzy logic and the GrabCut algorithm. In their work, the directional spatial relationship between buildings and their shadows was first modeled to generate fuzzy landscapes, and then the buildings were detected based on the fuzzy landscapes and shadow evidence using the GrabCut partitioning algorithm. Nevertheless, the performance of this method is limited by the accuracy of shadow extraction. Later, Ok [17
] extended their previous work by introducing a new shadow detection method and a two-level graph partitioning framework to detect buildings more accurately. However, buildings whose shadows are not visible cannot be detected by this method.
On the other hand, some studies have also used active contour models to detect buildings. For example, Peng and Liu [18
] proposed a new building detection method using a modified snake model combined with radiometric features and contextual information. Nevertheless, this method cannot effectively extract buildings in complex image scenes. In a different work, Ahmadi et al. [19
] proposed a new active contour model based on level set formulation to extract building boundaries. An experiment conducted in an aerial image showed that this model can achieve a completeness ratio of 80%. However, it should be noted that this model fails to extract buildings with similar radiometric values to the background. More recently, Liasis and Stavrou [20
] used the HSV color components of the input image to modify the traditional active contour segmentation model to detect buildings. However, some non-building objects such as roads and bridges are also incorrectly labeled as buildings by this method when applied to high-density urban environments.
In recent years, a number of feature indices that can predict the presence of buildings have also been proposed. For example, Pesaresi et al. [21
] developed a novel texture-derived built-up presence index (PanTex) for automatic building detection based on fuzzy composition of anisotropic textural co-occurrence measures. The construction of the PanTex was based on the fact that there was a high local contrast between the buildings and their surrounding shadows. Therefore, they used the contrast textural measures derived from the gray-level co-occurrence matrix to calculate the PanTex. Later, Lhomme et al. [22
] proposed a semi-automatic building detection method using a new feature index called “Discrimination by Ratio of Variance” (DRV). The DRV was defined based on the gray-level variations of the building’s body and its periphery. More recently, Huang and Zhang [23
] proposed the morphological building index (MBI) to automatically detect buildings from GeoEye-1 images. The fundamental principle of the MBI was to represent the intrinsic spectral-structural properties of buildings (e.g., brightness, contrast, and size) using a set of morphological operations (e.g., top-hat by reconstruction, directionality, and granulometry). Furthermore, some improved methods for the original MBI, aiming at reducing the commission and omission errors in urban areas, have also been proposed [24
]. The original MBI and its improved methods are effective for the detection of buildings in urban areas, but they fail to detect buildings in non-urban areas (e.g., mountainous, agricultural, and rural areas) where many irrelevant objects such as farmland, bright barren land, and impervious roads will cause large numbers of interferences to the detection of buildings. To solve this problem, a postprocessing framework for the MBI algorithm was proposed in [26
] to extend the detection of buildings to non-urban areas by additionally considering the geometrical, spectral, and contextual information of the input image. However, it should be noted that this method is limited by the performance of these additional information extractions.
In this study, a new building detection method based on the MBI algorithm is proposed to detect buildings from VHR remote sensing images captured in complex environments. The proposed method can effectively solve the problem that many irrelevant objects with similar spectral characteristics to buildings will cause large numbers of interferences to the detection of buildings. Specifically, the proposed method first extracts built-up areas from the VHR remote sensing imagery, and then detects buildings from the extracted built-up areas. For the extraction of built-up areas (first step), the spatial voting method [27
] based on the local feature points is used in this study. The term “local feature point” is defined as a small point of interest that is distinct from the background [28
]. Among the literature, various local feature point detectors have been used to extract built-up areas, such as the Gabor-based detector [27
], the SIFT-based detector [15
], the Harris-based detectors [29
], and the FAST-based detector [31
]. However, it should be mentioned that these local feature point detectors have a common problem when used for built-up areas extraction. Since they are mainly designed to detect local feature points over areas with complex textures or salient edges, they not only detect local feature points in built-up areas, but also detect local feature points in non-built-up areas. However, these local feature points in non-built-up areas (referred to as false local feature points in this study) will weaken the extraction accuracy of built-up areas, so it is necessary to design a method that can effectively eliminate these false local feature points. To this end, a saliency index is proposed in this study, which is constructed based on the density and the distribution evenness of the local feature points in a local circle window. In addition, we adopt the idea of voting based on superpixels in [32
] to improve the original spatial voting method [27
]. Through these processes, we can extract the built-up areas more accurately. On the other hand, for the detection of buildings (second step), since the original MBI algorithm is susceptible to large numbers of interferences from irrelevant objects (e.g., bright barren land, farmland, and impervious roads) in non-built-up areas, it has poor performance when detecting buildings in non-urban areas, such as mountainous, agricultural, and rural areas. To solve this problem, we propose applying the MBI algorithm in the extracted built-up areas (first step) to detect buildings, which can directly eliminate large numbers of interferences caused by irrelevant objects in non-built-up areas. In addition, to further eliminate some errors in built-up areas, we also build a rule based on the shadow, spectral, and geometric information for the postprocessing of the initial building detection results. Through these processes, our proposed method can effectively detect buildings in images captured in complex environments.
The remainder of this paper is arranged as follows: Section 2
provides a detailed description of the proposed method; Section 3
analyzes and compares the experimental results; Section 4
presents the discussion; and Section 5
provides the conclusion.