An Improved Boosting Learning Saliency Method for Built-Up Areas Extraction in Sentinel-2 Images

Built-up areas extraction from satellite images is an important aspect of urban planning and land use; however, this remains a challenging task when using optical satellite images. Existing methods may be limited because of the complex background. In this paper, an improved boosting learning saliency method for built-up area extraction from Sentinel-2 images is proposed. First, the optimal band combination for extracting such areas from Sentinel-2 data is determined; then, a coarse saliency map is generated, based on multiple cues and the geodesic weighted Bayesian (GWB) model, that provides training samples for a strong model; a refined saliency map is subsequently obtained using the strong model. Furthermore, cuboid cellular automata (CCA) is used to integrate multiscale saliency maps for improving the refined saliency map. Then, coarse and refined saliency maps are synthesized to create a final saliency map. Finally, the fractional-order Darwinian particle swarm optimization algorithm (FODPSO) is employed to extract the built-up areas from the final saliency result. Cities in five different types of ecosystems in China (desert, coastal, riverside, valley, and plain) are used to evaluate the proposed method. Analyses of results and comparative analyses with other methods suggest that the proposed method is robust, with good accuracy.


Introduction
Population density and resource utilization intensity tend to be very high in built-up areas.Rapid urbanization has resulted in several problems, including the urban heat island effect, air pollution, and unreasonable land use.Therefore, extracting built-up areas is a major topic of interest across numerous fields, including sustainability, remote sensing, and the social sciences.To efficiently distribute information regarding built-up areas to various research disciplines, remote sensing technology is widely used to extract and monitor these areas.The term "built-up areas" is widely used in the literature, and refers to the spatial extent of urbanized areas on a regional scale, but this is a nebulous and inconsistent definition [1].For the purpose of this study, built-up areas are defined as areas dominated by buildings, streets, and impervious surfaces; golf courses, green urban parks, sparse buildings in suburbs, and rural settlements are not included within this working definition.
Over the last few decades, many methods for extracting built-up areas have been proposed [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15].These methods can be broadly categorized into four groups: classification-based, index-based, texturebased, and multisensor-based methods.Classification-based methods [1][2][3] primarily consider how suitable classifiers may be used to extract built-up areas.However, these methods are fraught with challenges and limitations when applied at regional and global scales; for example, scene-to-scene data analyses are subjective, while the overall process is time-consuming, and entails complicated computing [4].Index-based methods [4][5][6][7][8][9] are designed based on spectral bands within which built-up areas exhibit their highest and lowest reflectance values amidst a multispectral dataset.Usually, these methods are unsuccessful in addressing the difficulties in distinguishing between built-up areas and other land cover types.Texture-based methods [10,11] can extract built-up areas based on high texture granularity.However, such methods may fail when ground objects with similar texture features to the built-up areas are encountered.Multisensor-based methods [1,[12][13][14] combine the various characteristics of multiple sensors to extract built-up areas.However, owing to difficulties associated with synthesizing different data types, such methods have not been widely used [15].
Saliency detection provides a unique perspective for ground object extraction, because it selects only interesting information related to the current behavior or task to be processed, ignoring irrelevant information [16].Saliency detection methods have been gradually introduced into the field of remote sensing, in recent years, to detect specific objects [17][18][19] in very high spatial resolution images, or high spatial resolution images, and they have proven to be effective.The Sentinel-2 satellites have a multispectral instrument with 13 spectral bands, and its richer spectrum ensures superior detection and extraction of built-up areas.Due to the unique spectral characteristics of the built-up areas, they can be highlighted in some band combinations, so the built-up areas can easily be identified by saliency detection methods.The saliency detection methods can be mainly categorized as bottom-up and top-down methods.Numerous bottom-up saliency detection methods have been proposed, and these can be broadly divided into four groups: those based on contrast, graph theory, and information theory, prior knowledge, and low-rank matrix recovery theory.
Contrast-based methods consist of local contrast and global contrast methods.The former is used to investigate the rarity of image regions with respect to nearby neighborhoods; the classic saliency method proposed by Itti et al. [20] is a typical local contrast method.Since its proposal, several approaches have also adopted the center-surround contrast strategy to calculate saliency, including the graph-based visual saliency detection method [21], the fuzzy growing method [22], and the discriminant center-surround hypothesis model [23,24], and so on.The global contrast methods calculate saliency using the contrast of a pixel or region with respect to the entire image.Cheng et al. [25] proposed a regional contrast-based saliency extraction algorithm that simultaneously evaluates global contrast differences and spatial coherence.Perazzi et al. [26] developed a contrast-based filtering method, and Shi et al. [27] proposed a generic and fast computational framework called pixelwise image saliency aggregating (PISA), that uses prior spatial information to weight color contrast, and direction contrast to generate a final saliency map.
Graph theory-based methods firstly use a graph model to represent an image, and then apply an established undirected or directed graph to predict the saliency value of each region.Gopalakrishnan et al. [28] performed random walks on graphs to detect salient objects.Wei et al. [29] proposed a model based on the geodesic method.Jiang et al. [30] proposed a saliency detection method using absorbing Markov chains.Yan et al. [30,31] applied a hierarchical model to analyze saliency cues from multiple levels of structure that were then integrated to infer a final saliency map.Qin et al. [32] introduced cellular automata to intuitively detect salient objects based on a dynamic evolution model.Based on information theory-based model, Bruce and Tsotsos [33] used Shannon's self-information to measure saliency, and proposed a visual attention model based on information maximization.Zhang et al. [34] used the self-information of local features in an image to represent object rarity, and then measured the saliency value.
Prior knowledge-based methods improve the accuracy of saliency detection methods via the general image properties that researchers have summarized through observation or experiment.Common prior knowledge types include center, background, objectness, semantic, color, spatial distribution, and sparse priors, with the center and background priors particularly widely used.
Several studies [26,35,36] employed the center prior, and demonstrated that it can enhance model performance.The background prior has also been used in several studies [37,38].
Matrix recovery-based methods, aiming at decomposing a matrix into a low-rank matrix and a sparse one, have shown the potential to address the problem of saliency detection, where the decomposed low-rank matrix naturally corresponds to the background, and the sparse one captures salient objects.Peng et al. [39] introduced sparse structures into the rotated principal component analysis (RPCA) model, and proposed a saliency detection method based on low-rank representation and structural sparse matrix decomposition.Lang et al. [40] proposed a multitask sparse pursuit algorithm based on low-rank representation.
Compared to bottom-up methods, little research has hitherto been conducted regarding the top-down saliency model.Jiang et al. [41] proposed a learning-based method by regarding saliency detection as a regression problem, where the saliency detection model was constructed based on the integration of numerous descriptors extracted from training samples with ground truth labels.Zhang et al. [34] integrated the top-down and bottom-up information to construct a Bayesian-based top-down model, where saliency is computed locally.Yang et al. [42] proposed a method combining conditional random field and sparse coding theory.Cholakkal et al. [43] regarded top-down saliency detection as an image-classification problem, and proposed a saliency detection method based on an image-classification framework.
As each group has different advantages, Tong et al. [44] proposed a bootstrap learning (BL) method to enhance performance; it exploits the strength of both bottom-up contrast-based saliency methods and top-down learning methods.However, more research is needed if the BL method is used to extract the built-up area.First, BL introduces a dark channel prior into the coarse saliency detection model to generate a coarse saliency map, but this prior is not suitable for all images.In images of darker backgrounds or brighter foregrounds, it may produce the opposite effect.Although the authors used adaptive weights to attenuate the inverse effects of the dark channel prior, remote sensing images are highly complex, especially where water bodies occur as dark background; in such cases, the BL algorithm may fail.Second, BL does not take into account the spatial information of ground objects, which may result in the detection of large amounts of background information.In addition, it simply superimposes multiscale saliency maps without fully integrating the information they provide.
In this study, an improved boosting learning saliency method for extracting built-up areas from remote sensing images is proposed.First, we determine the optimal band combination for extracting built-up areas.To overcome the shortcomings associated with a dark channel prior, we introduce a multi-cue fusion and water removal strategy into the coarse saliency model, to improve the accuracy of the coarse saliency map.Then, the GWB model [45] is used to consider spatial information, thereby eliminating the impact of land cover surrounding built-up areas, to further improve accuracy and provide reliable training samples for the strong saliency model.After that, the CCA [46] is employed to effectively integrate multiscale saliency maps to optimize the accuracy of the final saliency map.Finally, FODPSO algorithm [47] is used to segment the final saliency map to accurately capture information on built-up areas.
The contribution of this paper is threefold: (1) We improve the BL saliency method based on the characteristics of the remote sensing image for extracting the built-up areas.(2) GWB and CCA are introduced in the proposed method to suppress background regions and attach more importance to regions which are more likely to be parts of built-up areas.(3) We determine the optimal band combination of Sentinel-2 for built-up areas detection.
The rest of this paper is organized as follows: the proposed method is illustrated in Section 2, Section 3 focuses on the experimental results, and Sections 4 and 5 provide the discussion and conclusions, respectively.

Proposed Method
A flowchart of the proposed method is shown in Figure 1.The method consists of four stages.First, the image is sharpened to 10 m, and the optimal band combination for the built-up area extraction is determined.Then, the false color image generated by optimal band combination is segmented into a group of segmented objects.Subsequently, a coarse saliency map is constructed based on multiple clues fusion, and GWB, to generate training samples for a strong model.Based on the representation of three features, a strong classifier is trained to measure saliency.Next, the coarse and refined saliency maps are weighted, in combination, to generate the final saliency map.Finally, the built-up areas are extracted using the FODPSO method.
Remote Sens. 2018, 10, x FOR PEER REVIEW 4 of 24 based on multiple clues fusion, and GWB, to generate training samples for a strong model.Based on the representation of three features, a strong classifier is trained to measure saliency.Next, the coarse and refined saliency maps are weighted, in combination, to generate the final saliency map.Finally, the built-up areas are extracted using the FODPSO method.

Sentinel-2 Constellation
The Sentinel-2 constellation consists of two polar-orbiting satellites (Sentinel-2A and Sentinel-2B) placed in the same orbit.Sentinel-2A and Sentinel-2B equip with multispectral instruments capable of acquiring 13 bands information at different spatial resolutions (10,20, and 60 m).Sentinel-2 provides more details in the near-infrared (NIR) band range and short wavelength infrared (SWIR) band range, which is helpful for land cover, land monitoring, and emergency response [48].A high revisit time (10 days at the equator with one satellite, and 5 days with 2 satellites under cloud-free conditions, which results in 2-3 days at mid-latitudes) provides more cloudless images, and is a good support for the built-up areas extraction.

Atmospheric Correction and Image Sharpening
The bottom-of-atmosphere (surface) reflectance is a basic input to many earth observation applications, ranging from land surface phenology to land cover classification and change detection [49].To process top-of-atmosphere Level-1C data into atmospherically corrected bottom-ofatmosphere data, the Sen2Cor processor (version 2.4), developed by ESA to perform atmospheric correction, was employed [50].
In higher spatial resolution images, built-up areas tend to be more easily detected because higher spatial resolution images can clearly define the boundaries of the built-up areas, uniformly highlight built-up areas, and eliminate redundant backgrounds in the extracted built-up areas [19].To sharpen the bands of a Sentinel-2A image with spatial resolutions of 20 m and 60 m to a spatial resolution of 10 m, the modified selected and synthesized band scheme [51] was employed.

Optimal Band Selection
Built-up areas yield a higher reflectance response in the SWIR than in other bands [52], and it may help in alleviating the problem of confusion between built-up areas and other types of land cover, such as artificial open spaces, river gravel, and sand dunes [53].As Sentinel-2 has two SWIR bands, it is, therefore, inherently advantageous when applied to built-up areas extraction.In this study, both SWIR bands were selected to form the optimal band combination for built-up areas

Sentinel-2 Constellation
The Sentinel-2 constellation consists of two polar-orbiting satellites (Sentinel-2A and Sentinel-2B) placed in the same orbit.Sentinel-2A and Sentinel-2B equip with multispectral instruments capable of acquiring 13 bands information at different spatial resolutions (10,20, and 60 m).Sentinel-2 provides more details in the near-infrared (NIR) band range and short wavelength infrared (SWIR) band range, which is helpful for land cover, land monitoring, and emergency response [48].A high revisit time (10 days at the equator with one satellite, and 5 days with 2 satellites under cloud-free conditions, which results in 2-3 days at mid-latitudes) provides more cloudless images, and is a good support for the built-up areas extraction.

Atmospheric Correction and Image Sharpening
The bottom-of-atmosphere (surface) reflectance is a basic input to many earth observation applications, ranging from land surface phenology to land cover classification and change detection [49].To process top-of-atmosphere Level-1C data into atmospherically corrected bottom-ofatmosphere data, the Sen2Cor processor (version 2.4), developed by ESA to perform atmospheric correction, was employed [50].
In higher spatial resolution images, built-up areas tend to be more easily detected because higher spatial resolution images can clearly define the boundaries of the built-up areas, uniformly highlight built-up areas, and eliminate redundant backgrounds in the extracted built-up areas [19].To sharpen the bands of a Sentinel-2A image with spatial resolutions of 20 m and 60 m to a spatial resolution of 10 m, the modified selected and synthesized band scheme [51] was employed.

Optimal Band Selection
Built-up areas yield a higher reflectance response in the SWIR than in other bands [52], and it may help in alleviating the problem of confusion between built-up areas and other types of land cover, such as artificial open spaces, river gravel, and sand dunes [53].As Sentinel-2 has two SWIR bands, it is, therefore, inherently advantageous when applied to built-up areas extraction.In this study, both SWIR bands were selected to form the optimal band combination for built-up areas extraction.To select the third band of the optimal band combination, the optimum index factor (OIF) [54] was employed.OIF is a statistic value that can be used to select the optimum combination of three bands in a satellite image with which you want to create a color composite.The optimal band combination of bands, out of all possible 3-band combinations, is the one with the highest amount of "information" (highest sum of standard deviations), with the least amount of duplication (lowest correlation among band pairs).Band 9 and Band 10 are the water vapor band and the cirrus band.Band 8 and band 8A have very high correlations, and the information overlap is large.The standard deviation of Band 1, Band 2, and Band 3 is relatively low and contains less information.Therefore, these bands are not considered.Five candidate band combinations are Bands 12,11,8;Bands 12,11,7;Bands 12,11,6;Bands 12,11,5; and Bands 12,11,4.The OIF values of candidate band combinations were calculated based on the TIFF images exported by SNAP in ENVI software.The OIF method can reflect the amount of information in the band combination, but it still has some limitations.We needed to further analyze the separability of candidate band combinations for built-up areas and non-built-up areas.Here, the Jeffries-Matusita (J-M) distance [55,56] was used as a separability criterion for optimal band combination selection, whereby the J-M value ranges from 0 to 2. First, we used ENVI to select samples of built-up areas and non-built-up areas from the TIFF images exported by SNAP.Then, we calculated the J-M value and determined the band combination with the largest J-M value as the optimal band combination.The optimal results are shown in Section 3.

Multiscale Segmentation
The saliency map accuracy level is sensitive to the segmentation scale, so a multiscale strategy was employed.The false color image generated by optimal band combination was first segmented into homogenous and compact regions using a simple linear iterative clustering (SLIC) superpixel segmentation method [57].In SLIC, the number of superpixels, N', can affect the effect of segmentation.If N' is too small, it is often impossible to accurately separate the built-up areas from the background; if N' is too large, it needs much more computing time [58].As can be seen from Figure 2a,b, in some superpixels, both the built-up area pixels and the non-built-up area pixels are included, and the contour of the built-up areas cannot be accurately captured.In Figure 2c, the contour of the built-up areas can be accurately captured.Due to the scenes of remote sensing images being highly complex, determining the optimal N' for each image is very time-consuming.To simplify the problem, we given a large N' to avoid under segmentation.In this study, we empirically set N' to 20,000.Then, Hu's method [59] was adopted to merge similar superpixels into a set of objects O i , i = 1, ..., N, where N is the number of segmented objects.Figure 2d,e show the results of merging 20,000 superpixels into 2000 and 4000 objects, respectively.It not only reduces the number of superpixels, but also ensures that the outline of the built-up areas is well captured.
Remote Sens. 2018, 10, x FOR PEER REVIEW 5 of 24 extraction.To select the third band of the optimal band combination, the optimum index factor (OIF) [54] was employed.OIF is a statistic value that can be used to select the optimum combination of three bands in a satellite image with which you want to create a color composite.The optimal band combination of bands, out of all possible 3-band combinations, is the one with the highest amount of "information" (highest sum of standard deviations), with the least amount of duplication (lowest correlation among band pairs).Band 9 and Band 10 are the water vapor band and the cirrus band.Band 8 and band 8A have very high correlations, and the information overlap is large.The standard deviation of Band Here, the Jeffries-Matusita (J-M) distance [55,56] was used as a separability criterion for optimal band combination selection, whereby the J-M value ranges from 0 to 2. First, we used ENVI to select samples of built-up areas and non-built-up areas from the TIFF images exported by SNAP.Then, we calculated the J-M value and determined the band combination with the largest J-M value as the optimal band combination.The optimal results are shown in Section 3.

Multiscale Segmentation
The saliency map accuracy level is sensitive to the segmentation scale, so a multiscale strategy was employed.The false color image generated by optimal band combination was first segmented into homogenous and compact regions using a simple linear iterative clustering (SLIC) superpixel segmentation method [57].In SLIC, the number of superpixels, N', can affect the effect of segmentation.If N' is too small, it is often impossible to accurately separate the built-up areas from the background; if N' is too large, it needs much more computing time [58].As can be seen from Figure 2a,b, in some superpixels, both the built-up area pixels and the non-built-up area pixels are included, and the contour of the built-up areas cannot be accurately captured.In Figure 2c, the contour of the built-up areas can be accurately captured.Due to the scenes of remote sensing images being highly complex, determining the optimal N' for each image is very time-consuming.To simplify the problem, we given a large N' to avoid under segmentation.In this study, we empirically set N' to 20,000.Then, Hu's method [59] was adopted to merge similar superpixels into a set of objects Oi, i = 1, ..., N, where N is the number of segmented objects.Figure 2d,e show the results of merging 20,000 superpixels into 2000 and 4000 objects, respectively.It not only reduces the number of superpixels, but also ensures that the outline of the built-up areas is well captured.

Feature Selection
In this paper, three descriptors, including the color, texture, and spatial features, are used to describe each segmented object.Color feature is an important feature of the saliency detection methods; almost all saliency methods utilize the color feature.In particular, CIELab [60] aspires to

Feature Selection
In this paper, three descriptors, including the color, texture, and spatial features, are used to describe each segmented object.Color feature is an important feature of the saliency detection methods; almost all saliency methods utilize the color feature.In particular, CIELab [60] aspires to perceptual uniformity, and its L component closely matches human perception of lightness, while the a and b channels approximate the human chromatic opponent system, and RGB is often the default choice for scene representation and storage [36].Both of them are complementary and widely used for saliency detection [41,44,61].Hence, we calculated the average pixel value of each segmented object O i in RGB space and CIELab space, and the color feature of the segmented object O i can be described as where c r,g,b and c L,a,b represent the average value of each color channel of the pixels in the segmented object O i in the RGB and CIELab color spaces.Built-up areas usually have unique texture features.Local binary patterns (LBPs) [62] was utilized to calculate the texture feature of segmented objects.First, the LBP encoding for each pixel in the image was calculated using a 3 × 3 window and, in the uniform pattern [62], each pixel was assigned a value between 0 and 56.It is worth pointing out that although larger window size (such as 5 × 5 or 7 × 7) can utilize more information in the neighborhood, the noise corruption from the pixel away from the center can be more severe, which inevitably deteriorates the discriminative ability of the LBPs' feature [63].Then, an LBP histogram for each segmented object O i was constructed, and the texture feature of the segmented object O i can be described as where h i is the value of the i-th bin in an LBP histogram.
For spatial features, the eccentricity and area properties were used to eliminate segmented objects with a large eccentricity and a large area; these segmented objects are often strips of bare rock and river banks (such as yellow river bank).This can be described as where O Area is the area and O Ecce is the eccentricity of segmented object O i , and O Ecce is between 0 and 1.To avoid erroneously eliminating the road inside the built-up areas, we only consider the long strip-shaped segmented objects with a large area.We experimentally set the th 1 to 500 pixels, and th 2 was set to 0.95.The feature vector of segmented object O i can be obtained by

Coarse Saliency Map
In this part, we mainly explain how to obtain coarse saliency maps and training samples for a strong model.In Section 2.4.1, we mainly used the clues, such as the color and texture, to obtain the initial saliency map of the built-up areas.However, the map often has some background and water information.In Section 2.4.2, we considered the spatial information of ground objects, and introduced the GWB model to eliminate the background information similar to the built-up areas, and obtained the coarse saliency map.In Section 2.4.3, we eliminated the water information from the coarse saliency map.In Section 2.4.4,we selected the training samples from the coarse saliency map for the strong model.

Multiple Cues Fusion Compactness Saliency Using Color Cues
Following the image segmentation, a graph G = (V, E) with N nodes {v 1 , v 2 , • • •, v N } was constructed, and edges E weighted by an affinity matrix W = [w ij ] N × N .Node v i corresponds to the ith segmented object, and edge e ij link nodes v i and v j to each other, and the CIELab color space distance l ij between nodes v i and v j is defined as where c i and c j are the mean of segmented objects corresponding to nodes v i and v j in the CIELab color space.Note that the distance matrix The affinity matrix w ij is defined as where σ is a constant, and Ω i denotes the set of neighbors of node v i .If v i and v j are adjacent, v j is treated as a neighbor of v i , and the set of neighbors is equal to the number of v j .Salient objects typically have compact spatial distributions, whereas background regions are widely distributed across the entire image.Therefore, compactness may be determined by calculating the spatial variances of the segmented objects to calculate the compactness saliency map [64].First, the similarity a ij between a pair of segmented objects, v i and v j , is defined as The similarity based on the manifold ranking through the constructed graph is as follows: where A = [a ij ] N × N , D = diag {d 11 , d 22 , . . ., d NN }, d ii is the degree of nodes v i , and H = [h ij ] N × N is the similarity matrix after the diffusion process, α balances the smooth and fitting constraints of the manifold ranking algorithm and, empirically, α was set to 0.99 as in [65].The spatial variance of segmented objects can be calculated as where n j represents the number of pixels that belong to segmented object v j , b j = [b x j , b y j ] represents the centroid of the segmented object v j , and the µ i = [µ x i , µ y i ] represents the spatial mean.Considering that segmented objects at the center of an image are more noticeable, the spatial distances between segmented objects and the image's center can be calculated as follows where p = [p x , p y ] is the spatial coordinate of the image center.
The saliency map based on compactness is defined as where Norm(x) is a function that normalizes x to [0, 1].

Foreground Saliency Using Multiple Cues Contrast
Although the compactness saliency method tends to perform well, as attested to by a previous study [64], it only used the spatial variance of the color in the image space.As it primarily depends on color information, when the foreground and background objects are similar in color, the saliency deteriorates.To address this limitation, further aspects, such as texture and position, should be incorporated to refine the results.
First, the foreground seed set was determined by segmenting the compactness saliency map.Then, the contrast of each segmented object with the seeds was calculated using multiple cues, including information on texture and position.The foreground saliency is computed as follows: where Ω s is the foreground seed set, D t is the texture similarity between segmented objects based on LBP, and ||b i − b j || is the Euclidean distance between position of segmented objects.
Next, the S FG map was propagated using manifold ranking and, then, the propagated map was normalized to [0, 1], and denoted S fore (i).The S com (i) and S fore (i) maps are complementary to one another, and both saliency maps were integrated to define the initial saliency map, where η balances the compactness saliency map S com (i) and foreground saliency map S fore (i).In the optimal band combination, the built-up areas can be better identified using color features, while the built-up areas are also sensitive to texture features.Both of them have important contributions, so η was set to 0.5.

Geodesic Weighted Bayesian
Spatial information is a key aspect of geographic information system (GIS) and remote sensing fields, and while spatial relationships have increasingly been incorporated into satellite image processes, less attention has been given to the use of higher-level spatial relationships [66].To rectify this, a GWB model [45] was introduced to optimize the initial saliency map.The Bayesian inference for estimating the saliency map [67] is calculated as where p(sal) is the prior probability of being salient at pixel v, p(bk) is the prior probability of a pixel belonging to the background, p(v|sal) and p(v|bk) are the likelihood of observations, v is the feature vector of a given pixel.When the spatial relationships were considered, p(v|sal) and p(v|bk) can be rewritten as where s i is segmented object, sal is initial set of salient regions, and bk is the initial set of background regions.p geo (s i ) denotes the probability of s i , namely, the weight of segmented object s i .
Given pixel x, the feature vector was represented by its CIELab color and LBP texture features, and the observation likelihood of the given pixel x in segmented object O i can be calculated as where n j denotes the number of pixels within segmented object O i , n j (f(x)) denotes the number of f (x) values contained in segmented object O i , and f ∈{L, a, b, LBP} denotes the component of feature vector v, substituting observation likelihood ( 16) and ( 17) into (14), and utilizing the initial saliency map as a prior distribution to generate a more precise saliency map.Then, the initial saliency map was further refined to obtain the coarse saliency map S course based on graph cut method [68].

Removing the Water Bodies
Water bodies in remote sensing images usually belong to dark targets, and are more easily identified as salient objects, which can render several saliency detection methods unsuccessful.To avoid the interference caused by water bodies, they must be removed from the coarse saliency map.In [69], Xu noticed that water bodies have a stronger absorbability, and the built-up class has greater radiation in the SWIR band.Based on this characteristic, we set the segmented objects whose average pixel values are smaller than the given threshold, T w , to 0, thereby achieving the purpose of removing water bodies.To determine T w , the histogram of SWIR band was first generated.For cities with more water bodies, water bodies occupy a larger area, so there is a peak on the left side of the histogram.The gray values of other ground objects are usually greater than water bodies, and their peaks on the histogram are to the right of the water peak.We determined the value corresponding to the first trough to the right of the water peak as T w .Based on statistical results of multiple images, we determined that the threshold T w is 0.15.The cities with less water are almost unaffected by water bodies, and T w was set to 0.01.The gray value of the building shadow is also low, but its area is small.To avoid removing building shadow, we only removed segmented objects with a large area and the gray value less than T w .Considering that buildings in some areas are dense, the area of the shadow is relatively large.We empirically set the area threshold for removing water to 100 pixels.

Training Sample Selection
To select accurate training samples from the coarse saliency map, a set of selection rules was established: first, the average saliency value of each segmented object was computed, and two thresholds, T h and T l (T h is greater than T l ), were set to generate initial built-up area and non-built-up area training samples.Both thresholds can be adaptively determined by the mean value of the coarse saliency map, T h was set to ϑ times , and T l was set to , where ϑ is a parameter and ϑ was set to 1.8; more discussions about the values of ϑ can be found in Section 3.1.3.The segmented objects with saliency values above the T h were selected as initial built-up area samples, while those with saliency values below the T l were selected as initial non-built-up area samples.Next, we constrained the initial training sample set using the spatial feature F spatial to obtain the training samples {s i , l i } P i=1 , where s i is the i-th training sample from the coarse saliency map S course , l i is the binary label of the training sample, P is the number of the samples, built-up areas samples are labeled +1, and non-built-up areas samples are labeled −1.

Refined Saliency Map
One of the main difficulties using a support vector machine (SVM) is to determine the appropriate kernel for the given image.To select the appropriate kernel function for any input image, a multiple kernel boosting method [70] was employed.In this method, SVMs with different kernels are selected as weak classifiers, then, a strong classifier is learned, based on the boosting method.In this paper, we used N f (N feature by N kernel ) different standard SVM classifiers, where N feature is the number of features and N kernel is the number of kernel functions.The four different kernel functions are linear, polynomial, radial basis function, and sigmoid.For different feature sets, the decision function can be defined as where β n is the kernel weight, w i is the Lagrange multiplier, and b is the bias in the standard SVM algorithm.Equation ( 19) is a conventional function for the multiple kernel learning method; when the boosting algorithm was used to replace the simple combination of single-kernel SVMs in the multiple kernel learning, Equation ( 19) can be rewritten as where k n (s) = [k n (s,s 1 ),k n (s,s 2 ), . . .,k n (s,s P )] T , w = [w 1 l 1 ,w 2 l 2 , . . .,w P l P ] T , and b = b.By setting the decision function as Z n (S) = w T k n (S) + b n , the AdaBoost method may be employed to train a strong classifier, and Formula (20) can be rewritten as The AdaBoost method was used to calculate β j , and J represents the number of iterations of the boosting process.The process is as follows: Step 1: Begin with uniform weights, ω 1 (i) = 1/P, i = 1, 2, . . ., P, and assign a set of decision functions {Z n (S), n = 1, 2, . . ., N f } to each weak classifier.
Step 2: Compute the classification error {ε n } for each of the weak classifiers, and ascertain the decision function z j (s) with the minimum error ε j ; then, the combination coefficient β j is computed by where sgn(x) is the sign function, which equals 1 when x > 0, and is −1 otherwise, and β j must exceed 0.
Step 3: Update the weight according to Equation ( 23), and repeat step 2 for the next iteration until J iterations are completed.
Following J iterations, all of the β j and z j (s) can be obtained, and then the strong classifier was learned.Subsequently, a pixel-wise saliency map was generated using the strong classifier.Finally, the refined saliency map S refined was improved based on the graph cut method [68] and guided filter [71].

Multiscale Saliency
Since the size of the ground objects in the image is different, the saliency objects can appear on a variety of scales.In other words, the accuracy of the saliency map is sensitive to the number of segmented objects [70], a multiscale strategy was employed.In this study, seven layers (M is set to 7) of segmented objects with different granularities were generated, where N was set to 1000, 1500, 2000, 2500, 3000, 3500, and 4000, in each of the layers.More discussions about the values of M can be found in Section 3.1.3.To effectively integrate the results of the multiple scales M, the CCA method [46] was employed, whereby each cell corresponds to a pixel, and the saliency values of all pixels constitute the set of cells' states.For any cell in a saliency map, there should be 5M − 1 neighbors, including pixels with the same coordinates from different saliency maps, in addition to their 4 connected pixels [46].The saliency value of pixel i in the m-th saliency map at time t stands for its probability to be the foreground F, represented as S (t) m,i , while its possibility to be the background B is denoted as 1 − S (t) m,i .Otsu's method was used to binarize each map using an adaptive threshold.The threshold did not change, and was only related to the initial image.The threshold of the m-th saliency map is represented as γ m .Following segmentation, a pixel i can be classified as foreground or background.If a pixel, i, is foreground, the probability that one of its neighboring pixels, j, is measured as foreground is λ, while µ is the probability that j is measured as background when i belongs to the background.We assumed that λ is equal to µ if it was considered equally probable that the pixel belongs to the foreground or to the background.The posterior probability can be denoted as S (t) m,i •λ, which represents the probability of pixel i belonging to the foreground F, on the condition that its neighboring pixel j in the m-th saliency map was binarized as foreground at time t, and posterior probability S (t+1) m,i can also be used to represent the probability of pixel i belonging to the foreground F at time t + 1.Based on the prior ratio in [46], we have To deal with the logarithm of Equation ( 24), we have l(s where l(s ), and Λ is ln(λ/1 − λ).Assuming that each neighbor's contribution was conditionally independent, the synchronous updating rule can be defined as l(s where s m is the m-th saliency map at time t, M is the number of multiscale saliency maps, s j,k is the vector containing the saliency values of the j-th neighbor for all pixels in the m-th saliency map at time t, and 1 = [1, 1, . . ., 1].After T C iterations, the integrated saliency map s (TC) can be integrated as

Integration
Coarse saliency maps boast several advantages for detecting details and capturing local structural information, while refined saliency maps are more adept at describing global shapes.To maximize the complementarity of both salient maps, we integrated them using a weighted combination where κ is a balance factor for the combination.In the extraction of built-up areas, greater attention was paid to the outer contours of a city, and the fine saliency map is more applicable, so κ was set as 0.2, κ is between 0 and 1.

Bulit-Up Area Extraction
In the final saliency map, S final , built-up areas usually have the highest values, ground objects similar to built-up areas have the next highest values, and other ground objects have very low values.As such, the final saliency map can also be broadly segmented into three parts based on the gray value.To extract accurate built-up areas, an appropriate threshold segmentation image needs to be set.In [72], the genetic algorithm was used to determine the optimal segmentation threshold and achieved good results.In our paper, a multi-threshold segmentation algorithm, FODPSO [47], was employed.Following segmentation, the highest value part is the binary map of the built-up areas.The pseudo-code for FODPSO is presented in Table 1.

Experimental Results
To date, there has been little investigation into saliency detection in remote sensing images; thus, there are no classic testing datasets with existing ground truth (GT) that may be consulted when categories of Sentinel-2 images are introduced to evaluate the effectiveness and novelty of the proposed method.The GT map is obtained by manual segmentation based on the definition of the built-up areas in the first section.Since the built-up areas extraction is greatly affected by the surrounding land cover, based on the ecosystems, the experimental cities are divided into five types: desert, coastal, riverside, valley, and plain cities (Figure 3).Desert cities are distributed in the northwest of China, surrounded by desert and loess, with little vegetation.These conditions have a significant impact on the built-up area extraction.Coastal cities are located in the eastern coastal areas of China, and are the most economically active areas, requiring more timely monitoring of built-up areas.The riverside cities are distributed along the banks of the Yangtze and Yellow Rivers.As dark ground object, water bodies are likely to be detected as saliency targets by the saliency model, which may affect the built-up area extraction by the saliency method.Valley cities are located in western China, and are often affected by bare rock, rendering built-up areas extraction extremely challenging.Plain cities are distributed in eastern China, where the landform of the built-up areas is plain, and surrounded by significant amounts of farmland (including bare land), which may have an impact on the extraction of built-up areas.Details of the study areas are presented in Table 2.  Based on the OIF and J-M methods, the selection of the optimal band combination for different images is shown in Table 3. From Table 3, the optimal band combination for most cities consists of  To evaluate the performance of the proposed method with respect to saliency detection, we compared it to the eight most recent saliency detection methods: dense and sparse reconstruction (DSR) [73], discriminative regional feature integration (DRFI) [41], regional principal color (RPC) [74], diffusion-based compactness and local contrast (DCLC) [64], inner and inter label propagation (LPS) [75], bootstrap learning (BL) [44], diffusion process on a two-layer sparse graph (DPTLSG) [76], and reversion correction and regularized random walks ranking (RCRR) [77].To ensure fairness, all methods used the optimal band combination as the original image for saliency detection.To further evaluate the extraction accuracy of the proposed method, the saliency map was segmented to obtain the binary map of the built-up areas based on FODPSO; the results were compared to some built-up area extraction methods.Due to index-based methods being sensitive to the built-up areas [72], they are widely used in built-up areas extraction.In this study, two index-based methods, NDBI [7] and NBI [5], were selected.PanTex [11] is a method for extracting built-up areas based on texture feature, and has been evaluated by many experiments [78], so the method was also selected.

Qualitative Experiment
The results of saliency maps generated by nine different methods are presented for qualitative comparison in Figure 4.It is clear that our method efficiently detects built-up areas and identifies the contours of built-up areas most accurately, while the results of the other eight methods are inferior.For riverside cities, several of the methods, including DSR, DRFI, RPC, BL, and RCRR, identify water bodies as salient objects rather than built-up areas.Of these methods, DCLC and DPTLSG have superior performance, but still fall short in terms of accuracy.The BL method also produces satisfactory saliency maps, but its performance is generally poor in coastal and riverside cities. DSR, DRFI, and RPC fail to highlight the built-up areas in their entireties.LPS focuses too much on an image's central information, overlooking the information on built-up areas throughout the image.The RCRR method detects and highlights unnecessary and irrelevant background information.
bands 12, 11, and 7, while the optimal band combination for most valley cities is composed of bands 12, 11, and 5. To evaluate the performance of the proposed method with respect to saliency detection, we compared it to the eight most recent saliency detection methods: dense and sparse reconstruction (DSR) [73], discriminative regional feature integration (DRFI) [41], regional principal color (RPC) [74], diffusion-based compactness and local contrast (DCLC) [64], inner and inter label propagation (LPS) [75], bootstrap learning (BL) [44], diffusion process on a two-layer sparse graph (DPTLSG) [76], and reversion correction and regularized random walks ranking (RCRR) [77].To ensure fairness, all methods used the optimal band combination as the original image for saliency detection.To further evaluate the extraction accuracy of the proposed method, the saliency map was segmented to obtain the binary map of the built-up areas based on FODPSO; the results were compared to some built-up area extraction methods.Due to index-based methods being sensitive to the built-up areas [72], they are widely used in built-up areas extraction.In this study, two index-based methods, NDBI [7] and NBI [5], were selected.PanTex [11] is a method for extracting built-up areas based on texture feature, and has been evaluated by many experiments [78], so the method was also selected.

Qualitative Experiment
The results of saliency maps generated by nine different methods are presented for qualitative comparison in Figure 4.It is clear that our method efficiently detects built-up areas and identifies the contours of built-up areas most accurately, while the results of the other eight methods are inferior.For riverside cities, several of the methods, including DSR, DRFI, RPC, BL, and RCRR, identify water bodies as salient objects rather than built-up areas.Of these methods, DCLC and DPTLSG have superior performance, but still fall short in terms of accuracy.The BL method also produces satisfactory saliency maps, but its performance is generally poor in coastal and riverside cities. DSR, DRFI, and RPC fail to highlight the built-up areas in their entireties.LPS focuses too much on an image's central information, overlooking the information on built-up areas throughout the image.The RCRR method detects and highlights unnecessary and irrelevant background information.

Quantitative Experiment
To quantitatively evaluate the performance of each saliency method, the receiver operating characteristic (ROC)-area under the curve (AUC) metric, precision, recall, and F-measure, and time comparison were used.

ROC-AUC Metric
The receiver operating characteristic (ROC) curve is derived by thresholding a saliency map at the threshold within the range [0, 255], and further classifying the saliency map into the saliency objects and the background [79].The ROC graph is generated by plotting the true positive rate (on the y-axis) against the false positive rate (on the x-axis).The true positive and false positive rates are expressed as where TPR is the true positive rate, FPR is the false positive rate, TP is the true positives and it is the number of correctly identified built-up areas, FN is the false negatives and it is the number of incorrectly rejected, FP is false positives and it is the number of the incorrectly identified, and TN is the true negatives and it is the number of the correctly rejected.When the FPR value is the same, the higher the TPR, the better a method's performance; the larger the area under the curve (AUC), the better the performance.The AUC values for the different methods are presented in Table 4, from which it may be seen that the proposed method has the highest AUC value.The ROC curves for the different methods are shown in Figure 5a.We can conclude that the ROC curve generated by our model demonstrates superior performance.To further evaluate the quality of the saliency maps, precision, recall, and F-measure were employed.They can be computed by where the term (x, y) denotes the coordinates of the images, t(x, y) is the ground truth, and s(x, y) is the binary image after the thresholding saliency maps.The threshold was set to twice the average gray value; the segmented objects whose average gray value are greater than the threshold was designated foreground, with all others designated background.High recall means that a model returned most of the built-up areas, whereas high precision means that a model returned substantially more built-up areas than background regions.The F-measure is the harmonic mean of precision and recall, and β 2 was set as 1 to balance the importance of precision and recall.
Figure 5b shows the precision, recall, and F-measure values of the evaluated methods.According to Figure 5b, our method achieves the highest precision, which means that the redundant background in the residential areas acquired by our method is the smallest among the nine methods.The recall of our method is the highest among the nine methods, which means that our methods can complete built-up areas.The built-up areas extracted by the RPC and BL methods are greatly affected by water bodies and have a low recall and precision.LPS focuses too much on an image's central information, so it also has a low recall and precision.DCLC, DPTLSG have good accuracy and recall rate, and can also identify the built-up areas well.Although RCRR has a good recall rate, it extracts more background information, so the accuracy is lower.Overall, the proposed method performs well against the state-of-the-art methods.
Remote Sens. 2018, 10, x FOR PEER REVIEW 17 of 24 Figure 5b shows the precision, recall, and F-measure values of the evaluated methods.According to Figure 5b, our method achieves the highest precision, which means that the redundant background in the residential areas acquired by our method is the smallest among the nine methods.The recall of our method is the highest among the nine methods, which means that our methods can complete built-up areas.The built-up areas extracted by the RPC and BL methods are greatly affected by water bodies and have a low recall and precision.LPS focuses too much on an image's central information, so it also has a low recall and precision.DCLC, DPTLSG have good accuracy and recall rate, and can also identify the built-up areas well.Although RCRR has a good recall rate, it extracts more background information, so the accuracy is lower.Overall, the proposed method performs well against the state-of-the-art methods.

Time Comparison
We compared the computational time for each method using MATLAB on a PC with 8 G RAM, Intel Core i5-4590 CPU @ 3.30 GHz.The average time comparison of our method and the other competing methods is given in Table 5.As can be seen from Table 5, our method is very timeconsuming, mainly due to multiscale segmentation and an ensemble learning strategy.

Important Parameter Settings
There are two important parameters in the proposed method: M, layers of segmented objects; the parameter ϑ used to calculate the thresholds Th and Tl.To determine the optimal parameter M, we compared the accuracy and time of the different M. The results are shown in Table 6.From Table 6, as M increases, the accuracy increases, but the calculation time also rises sharply.If M continues to increase, the computational time cannot be accepted.To determine the optimal parameter ϑ, we compared the accuracy of different ϑ.The results are shown in Table 7. From Table 7, we can see that when ϑ is 1.8, the accuracy is the highest.Although

Time Comparison
We compared the computational time for each method using MATLAB on a PC with 8 G RAM, Intel Core i5-4590 CPU @ 3.30 GHz.The average time comparison of our method and the other competing methods is given in Table 5.As can be seen from Table 5, our method is very time-consuming, mainly due to multiscale segmentation and an ensemble learning strategy.

Important Parameter Settings
There are two important parameters in the proposed method: M, layers of segmented objects; the parameter ϑ used to calculate the thresholds T h and T l .To determine the optimal parameter M, we compared the accuracy and time of the different M. The results are shown in Table 6.From Table 6, as M increases, the accuracy increases, but the calculation time also rises sharply.If M continues to increase, the computational time cannot be accepted.To determine the optimal parameter ϑ, we compared the accuracy of different ϑ.The results are shown in Table 7. From Table 7, we can see that when ϑ is 1.8, the accuracy is the highest.Although ϑ = 1.8 was determined to be the optimal ϑ, it was not suitable for images of Xining and Yulin.For the images of Xining and Yulin, ϑ was set to 2.8.To evaluate the overall performance of the proposed method in extracting built-up areas, we further compared our method to some advanced built-up areas extraction methods, including two index-based methods [5,7] and one texture-based method [11].They were calculated according to Equations ( 35)- (37).To make these equations available, we used ESA's Sen2Cor atmospheric correction module to process Sentinel-2 Level-1C images into Level-2A bottom of atmosphere (BOA) reflectance images [50].(i − j) 2 •P ij , with N g is the number of gray levels present in the image, and P i,j is the (i,j)th entry of the co-occurrence matrix.Figure 6 presents the results achieved by the different methods.The binary maps shown in Figure 6f,g,h were obtained by automatically determining the segmentation threshold using Otsu's algorithm to segment maps in Figure 6b,c,d.The binary maps shown in Figure 6j were obtained by segmenting the maps in Figure 6f.based on the optimal thresholds determined by FODPSO algorithm.Since the built-up areas are usually large, we keep the area larger than the T area and remove the area smaller than the T area , where the T area is empirically set to 3000 pixels.From Figure 6, it is clear that the two index-based methods perform poorly on the images of the desert cities and the valley cities, they are almost impossible to use to identify and extract the built-up areas, while the desert and bare rock are clearly extracted.However, they perform very well on the images of coastal cities, riverside cities, and plain cities, because of the high vegetation coverage or large water area in these cities.The PanTex method performs better than both index-based methods in detecting built-up areas, with the locations clearly identifiable.However, the PanTex method only utilizes texture features; some areas with texture features similar to built-up areas may also be extracted.For example, in desert cities, they are surrounded by a large number of loess and desert areas, which have similar texture features to the built-up areas and are, therefore, incorrectly extracted.Although the land cover around cities varies, our proposed method can still efficiently identify the locations and boundaries of built-up areas, and can accurately extract them.To quantitatively evaluate the various methods, three statistical measures were used: overall accuracy, commission error, and omission error.The commission error represents the percentage of pixels that belong to non-built-up areas but have been classified as built-up areas.Omission error represents the percentage of pixels that belong to built-up areas, but have been classified as non-built-up areas.Table 8 shows the average statistical measurement results for five different types of cities.The overall accuracies of our proposed method in five different types of cities are higher than the other three methods, and the commission and omission errors of the proposed method are the lowest among the four methods.NDBI and NBI have low overall accuracy and high commission errors and omission error on images in desert and valley cities.This suggests that these two index-based methods are not suitable for extracting built-up areas surrounded by bare rock and desert.However, they perform well on images in the other three types of cities and achieve high overall accuracy.PanTex performs very well, second only to the proposed method.Its omission error is also relatively low, while the commission error is high in desert cities.This indicates that when PanTex extracted the built-up areas of the desert city, a large amount of non-built-up areas are incorrectly extracted.In summary, the proposed method takes into account the different features of the built-up areas, based on visual salience, and can achieve good results in different types of cities. proposed method in five different types of cities are higher than the other three methods, and the commission and omission errors of the proposed method are the lowest among the four methods.NDBI and NBI have low overall accuracy and high commission errors and omission error on images in desert and valley cities.This suggests that these two index-based methods are not suitable for extracting built-up areas surrounded by bare rock and desert.However, they perform well on images in the other three types of cities and achieve high overall accuracy.PanTex performs very well, second only to the proposed method.Its omission error is also relatively low, while the commission error is high in desert cities.This indicates that when PanTex extracted the built-up areas of the desert city, a large amount of non-built-up areas are incorrectly extracted.In summary, the proposed method takes into account the different features of the built-up areas, based on visual salience, and can achieve good results in different types of cities.

Discussion
In this paper, a new method for extracting built-up areas from images based on the principles of salient object detection is proposed.Compared to other saliency detection methods, the unique band information of remote sensing is fully exploited.For example, the optimal band combination composed of unique bands of satellite images can highlight the built-up areas very effectively.Water bodies in the SWIR band can be removed to prevent them from being detected as salient objects.Compared to existing built-up area extraction methods using optical images, our method devotes greater attention to the most conspicuous built-up areas in the image, and not only the spectral and textural information of the built-up areas.Therefore, it is more robust and not easily distorted by surrounding ground objects, such as bare rock or desert.Compared to the BL saliency method, we consider a greater number of cues and introduce the GWB model to improve the detection accuracy of the coarse saliency map, thus providing a more reliable training sample for training strong models.CCA is employed to integrate multiscale detection results, as distinct from the simple superposition effect using the BL saliency method, and it can improve the ultimate detection accuracy and reduce the background information.In addition, shape information of ground objects is also utilized.For the multiscale segmentation strategy, we do not set different SLIC parameters for segmenting the image multiple times but, rather, adopt a fixed segmentation parameter, then merge the superpixels based on the various merge parameters to obtain multiscale segmentation images, improving segmentation and algorithm efficiency.

Discussion
In this paper, a new method for extracting built-up areas from images based on the principles of salient object detection is proposed.Compared to other saliency detection methods, the unique band information of remote sensing is fully exploited.For example, the optimal band combination composed of unique bands of satellite images can highlight the built-up areas very effectively.Water bodies in the SWIR band can be removed to prevent them from being detected as salient objects.Compared to existing built-up area extraction methods using optical images, our method devotes greater attention to the most conspicuous built-up areas in the image, and not only the spectral and textural information of the built-up areas.Therefore, it is more robust and not easily distorted by surrounding ground objects, such as bare rock or desert.Compared to the BL saliency method, we consider a greater number of cues and introduce the GWB model to improve the detection accuracy of the coarse saliency map, thus providing a more reliable training sample for training strong models.CCA is employed to integrate multiscale detection results, as distinct from the simple superposition effect using the BL saliency method, and it can improve the ultimate detection accuracy and reduce the background information.In addition, shape information of ground objects is also utilized.For the multiscale segmentation strategy, we do not set different SLIC parameters for segmenting the image multiple times but, rather, adopt a fixed segmentation parameter, then merge the superpixels based on the various merge parameters to obtain multiscale segmentation images, improving segmentation and algorithm efficiency.
Although our method is precise and robust, it does have some shortcomings and limitations that cannot be ignored.First, it employs multiscale segmentation and an ensemble learning strategy that affects its processing efficiency, resulting in a computational time that is several times as long as some other methods.However, we can select fewer scales if the requirements can be met and the calculation time is shortened.Second, our method also incorrectly detects some non-built-up areas.For example, in the Lhasa image, the river bank is detected.The third limitation concerns the detection accuracy of the coarse saliency map, which affects the reliability of the training sample and the final result.The detection accuracy of the coarse saliency map depends on the detection method and the input image.Overall, the proposed method has the potential to extract built-up areas in different types of cities with adequate accuracy.

Conclusions
This paper proposes a new built-up area extraction method based on an improved BL saliency model.First, the band combinations that highlight the built-up areas are explored.Then, we produce coarse saliency maps based on multiple cues, and the GWB model to generate training samples for a strong classification model that is subsequently used to produce a refined saliency map.To further improve detection performance, multiscale saliency maps are integrated by CCA.The final saliency result combines the coarse and refined saliency maps.Finally, the information pertaining to built-up areas is extracted using the FODPSO algorithm.Comparative experimentation using other advanced saliency detection methods indicates that our method outperforms the other eight models in extracting built-up areas from various complex background environments.Comparative analyses with three advanced built-up area extraction methods confirm the superior performance of our proposed method.Therefore, the proposed mothed not only has good precision and robustness, but also has practical value in the extraction of built-up areas.
Future research will focus primarily on three aspects: First, we intend to optimize our method further, shortening the computational duration of the process.Second, we will take more features into consideration to avoid the extraction of other ground objects.Third, we will determine the optimal band combinations of other satellites for extracting built-up areas, ultimately extending our method for use with more satellites.

Figure 5 .
Figure 5. Quantitative evaluation results of different methods: (a) ROC curves of different methods on Sentinel-2 images (b) Precision, recall, and F-measure of different methods on Sentinel-2 images.

Figure 5 .
Figure 5. Quantitative evaluation results of different methods: (a) ROC curves of different methods on Sentinel-2 images (b) Precision, recall, and F-measure of different methods on Sentinel-2 images.
where b swir is the reflectance of the SWIR band, b nir is the reflectance of the NIR band, b red is the reflectance of the red band, tx i = f (w = 9, v i , m = CON), i∈[α 1 , d 1 ; a 2 , d 2 ; . . .; a n , d n ]. w = window size, α and d are the distance and angle defining the displacement vector v required to select the pairs producing the co-occurrence matrix, and m is the textural measure applied to the given co-occurrence matrix distribution.CON =
1, Band 2, and Band 3 is relatively low and contains less information.Therefore, these bands are not considered.Five candidate band combinations are Bands 12, 11, 8; Bands 12, 11, 7; Bands 12, 11, 6; Bands 12, 11, 5; and Bands 12, 11, 4. The OIF values of candidate band combinations were calculated based on the TIFF images exported by SNAP in ENVI software.The OIF method can reflect the amount of information in the band combination, but it still has some limitations.We needed to further analyze the separability of candidate band combinations for built-up areas and non-built-up areas.
position parameter, x n is velocity parameter, χ 1n is local best, χ 2n is global best for i = 1:1:Max.Number of the iteration

Table 4 .
The area under the curve (AUC) for different methods.

Table 5 .
Running time comparisons for nine methods.

Table 6 .
Comparison results of different M.

Table 5 .
Running time comparisons for nine methods.

Table 6 .
Comparison results of different M.

Table 7 .
Comparison results of different ϑ.

Table 8
Accuracy assessment of the resultant images.