A Comprehensive Evaluation of Approaches for Built-Up Area Extraction from Landsat OLI Images Using Massive Samples

Detailed information about built-up areas is valuable for mapping complex urban environments. Although a large number of classification algorithms for such areas have been developed, they are rarely tested from the perspective of feature engineering and feature learning. Therefore, we launched a unique investigation to provide a full test of the Operational Land Imager (OLI) imagery for 15-m resolution built-up area classification in 2015, in Beijing, China. Training a classifier requires many sample points, and we proposed a method based on the European Space Agency’s (ESA) 38-m global built-up area data of 2014, OpenStreetMap, and MOD13Q1-NDVI to achieve the rapid and automatic generation of a large number of sample points. Our aim was to examine the influence of a single pixel and image patch under traditional feature engineering and modern feature learning strategies. In feature engineering, we consider spectra, shape, and texture as the input features, and support vector machine (SVM), random forest (RF), and AdaBoost as the classification algorithms. In feature learning, the convolutional neural network (CNN) is used as the classification algorithm. In total, 26 built-up land cover maps were produced. The experimental results show the following: (1) The approaches based on feature learning are generally better than those based on feature engineering in terms of classification accuracy, and the performance of ensemble classifiers (e.g., RF) are comparable to that of CNN. Two-dimensional CNN and the 7-neighborhood RF have the highest classification accuracies at nearly 91%; (2) Overall, the classification effect and accuracy based on image patches are better than those based on single pixels. The features that can highlight the information of the target category (e.g., PanTex (texture-derived built-up presence index) and enhanced morphological building index (EMBI)) can help improve classification accuracy. The code and experimental results are available at https://github.com/zhangtao151820/CompareMethod.


Introduction
"Built-up area" refers to land used for urban and rural residential purposes and public facilities.Built-up areas are one of the most important elements of land use and play an extremely important role in urban development planning [1].Extracting built-up areas is crucial for mapping and the management of complex urban environments across local and regional scales [2][3][4][5][6].Landsat images, a good source of data for generating information about built-up areas over large areas, are frequently used for this purpose [7,8].However, the mapping of built-up land poses a significant features [6,23,24], and three-dimensional modeling [25] have increasingly been applied to building extraction.Pattern classification based on feature engineering has a good advantage in extracting certain ground objects (vegetation and water).However, for the recognition and classification of built-up areas, because of the uneven distribution and fragmentation of buildings, large surface spectral heterogeneity, and morphological characteristics without a fixed pattern, it is difficult to find a suitable feature in a wide range of urban and suburban areas.
In recent years, with the rise of artificial intelligence and large data, pattern classification based on feature learning has become a popular research topic, especially in in-depth learning [9][10][11]26,27] and reinforcement learning [28][29][30][31][32][33].Feature learning automatically learns and utilizes features from raw data.Deep learning (DL) can automatically extract hierarchical data features by unsupervised or semi-supervised feature-learning algorithms.In contrast, traditional machine learning methods require manual design features.DL is a representation learning algorithm based on large-scale data in machine learning.Modern DL methods have often been applied successfully in the field of feature learning, such as self-encoder [34], restricted Boltzmann machine [35], and generative adversarial networks [36].These implement automatic learning abstract feature representation in an unsupervised or semi-supervised manner, and their results support advanced achievements in areas such as speech recognition, image classification [37], and object recognition [38].With the rapid development of convolutional neural networks (CNNs), especially the excellent performance of deep CNNs on the ImageNet contest [39][40][41][42], CNNs have shown great advantages in image pattern recognition, scene classification [43], object detection, and other issues.An increasing number of researchers have applied CNN to remote sensing image classification.CNNs of different structures were used for building extraction in [44][45][46].Yang et al. [47] showed that the combination of a subset of spectral bands can promote the classification accuracy of CNNs.

Pixel-Based versus Patch-Based Classification
With the improvement of image spatial resolution, the basic unit of remote sensing land cover mapping has undergone a transformation from image pixels to image objects (segments and patches).The goal of remote sensing land cover mapping is usually to obtain the semantic category of each pixel.Traditionally, built-up extraction has been conducted using pixel-based approaches, where land cover classes are allocated to each individual pixel.In a feature space, a classifier (e.g., SVM or k-nearest neighbors (KNN)) is used to separate the feature space into several regions.In the transformation of the feature space and the image plane space, there is the problem that the same object with different spectra and different objects can have the same spectrum, because the spatial relationship cannot be considered.Therefore, the classification results of the image plane will exhibit salt-and-pepper noise and fragmentation.Momeni et al. [13] compared the influence of spatial resolution, spectral band set, and the classification approach for mapping detailed urban land cover based on WorldView-2 images.Their results demonstrated that spatial resolution is clearly the most influential factor when mapping complex urban environments.Wang et al. [48] identified and inspected an urban built-up area boundary based on the temperature retrieval method and used qualitative and quantitative analysis methods to analyze the spatio-temporal characteristics of the expansion of the Jingzhou urban built-up area from 1990 to 2014.
With the improvement of image spatial resolution, and especially with the launch of the SPOT, QuickBird, and Worldview satellites, a large number of high-resolution images are now publicly available.High-resolution images, hyperspectral images, and radar data are widely used to extract built-up areas.Relying on single-pixel spectral information cannot adequately describe and reflect the feature information of ground objects.Instead of the pixels' features, one might use image patches as the features of geo-objects.In one image patch, the spatial relations and semantic links between pixels are considered, regarding a patch as one or more target objects, such as scene recognition, semantic segmentation, and object detection.Zhong and Wang [7] presented a multiple conditional random fields (CRFs) ensemble model to incorporate multiple features and learn their contextual information; their experiments on a wide range of images showed that their ensemble model produces higher extraction accuracy for built-up areas than a single CRF.Ning and Lin [49] presented a method for extracting built-up areas from very high-resolution (VHR) remote sensing imagery using the feature-level-based fusion of right-angle corners, right-angle sides, and road marks.On average, the completeness and the quality of their proposed method are 17.94% and 13.33% better than those of the PanTex method, respectively.

Built-Up Area Extraction from Medium-Resolution Images
Although an increasing number of high-resolution images are available, medium-resolution images are still the most widely used for a wide range of ground object extractions, due to limited computer performance and considerable data mining technology.Worldwide, the spatial resolution of built-up area data ranges from low to high, with resolutions of 500, 250, 38, and 30 m.The International Geosphere Biosphere Programme (IGBP) scheme was classified using the C4.5 decision tree algorithm that ingested a full year of 8-day moderate-resolution imaging spectroradiometer (MODIS) Nadir Bidirectional Reflectance Distribution Function (BRDF)-adjusted reflectance data [50].Schaaf et al. [8] utilized a random forest (RF) classification algorithm to map global land cover in 2001 and 2010 with spatial-temporal consistency based on MODIS data and Landsat images.Gong [51] produced the first 30-m resolution global land cover maps based on four classifiers (ML, J4.8 decision tree, RF, and SVM) using Landsat Thematic Mapper (TM) and Enhanced Thematic Mapper Plus (ETM+) data.Based on TM and ETM+ images, Chen et al. [52] applied the pixel-object hierarchical classification method to extract the global man-made surface, and the user accuracy reached 80%.Using the symbolic classification algorithm [53], the ESA processed massive Landsat images and high-resolution images to extract 38-m resolution global residential areas in 1975, 1990, 2000, and 2014, with an overall accuracy of more than 85%.Based on the GEE platform [54], Liu et al. proposed the urban comprehensive land use index [55] and found the appropriate threshold in the global sub-climate areas and extracted multi-temporal urban built-up areas [56].
In smaller regions, research on built-up area extraction methods has become a popular topic.For example, Zhang et al. [57] proposed an empirical normalized difference of a seasonal brightness temperature index (NDSTI) for enhancing a built-up area based on the contrast heat emission seasonal response of the area to solar radiation and adopted a decision tree classification method for the rapid and accurate extraction of the area.Goldblatt et al. [5] presented an efficient and low-cost machine-learning approach for the pixel-based image classification of built-up areas at a large geographic scale using Landsat data.Their methodology combines nighttime-light data and Landsat 8 images and overcomes the lack of extensive ground reference data.Ma et al. [58] presented a sample-optimized approach for classifying urban area data in several cities of western China using a combination of the Defense Meteorological Satellite Program (DMSP)-Operational Linescan System (OLS) for nighttime-light data, Landsat images, and GlobeLand30.Goldblatt et al. [59] applied a classification and regression tree, SVM, and RFs to extract urban areas in India based on a single pixel using the GEE platform.
In this paper, we compare the accuracy and efficiency of the approaches for built-up area extraction from Landsat 8-OLI images based on single pixels or image patches in two perspectives of feature engineering and feature learning.We systematically and comprehensively compare the impact of features and classifiers on built-up area extraction results using 15-m resolution OLI-images.Moreover, given the influential role that the classification approach plays on output accuracy and how this is linked intrinsically with image specifications, all the image datasets are classified using parametric and non-parametric pixel-based and patch-based classifiers.This enables a fuller and more robust assessment of the Landsat 8 data and transmits helpful and practical information for urban planners and other user communities on the level of thematic detail that can be achieved when mapping complex built-up areas.Finally, an analysis was conducted using a relatively large image covering approximately 32,400 km 2 of the city of Beijing, China, and its surroundings.This means that built-up area extraction is generated at a scale of practical value and relevance (the whole city-scale), unlike the earlier experiments of Li et al. [12] and Momeni et al. [13], which were limited to very small local areas.

Study Area and OLI Image
The study area is the city of Beijing, the capital of China, which is located at 36 • N latitude, 107 • E longitude.Beijing has a population of slightly more than 21 million.The climate is a typical northern, temperate, semi-humid, continental monsoon climate, with a hot and rainy summer, a cold and dry winter, and a short spring and autumn.The landscape consists of 62% mountains and 38% plains.The topography of Beijing is high in the northwest and low in the southeast, with an average elevation of approximately 43.5 m.Beijing is a typical international metropolis with prosperous business circles and developed transportation systems.The objects on the ground surface are complex and heterogeneous.Within the Fifth Ring, the buildings are densely distributed, while the buildings are sparse in the suburbs outside the Fifth Ring.Therefore, we determined that choosing Beijing as an experimental area is typical, scientific, and reasonable.
The Landsat 8-OLI land imager has nine bands, and the imaging width is 185 × 185 km.The resolution of Bands 1-7 is 30 m, and Band 8 is a panchromatic band with a resolution of 15 m.Compared with the Enhanced Thematic Mapper Plus (ETM+) sensor on Landsat-7 (Table 1), the OLI terrestrial imager has made the following adjustments: (1) The wavelength of Band 5 is adjusted to 0.845-0.885µm, thus eliminating the influence of water vapor absorption at 0.825 µm; (2) the Band 8 panchromatic wave band is narrow, so that vegetation and non-vegetation areas can be distinguished better; and (3) the newly added blue band of Band 1 (0.433-0.453 µm) is mainly used for the observation of coastal zones.We selected OLI images on GEE; due to the large amount of cloud cover in spring and autumn, the images are mainly from summer.To ensure data quality, we utilized the minimum cloud cover synthesis algorithm provided by GEE to preprocess and generate the required images [54].To facilitate the built-up area extraction at 15-m resolution, the first seven bands (Bands 1-7) of the Landsat 8 OLI images were up-sampled to 15 m using the nearest neighborhood sampling.We then clipped the image with a size of 12,000 × 12,000 pixels.As shown in Figure 1, the false color (combination of Bands 4, 6, and 7) showed that the quality of the data was good and met the requirements.To facilitate the mapping display and to more clearly express the details, we chose two representative regions, A and B, with sizes of 1000 × 1000 pixels.Region B is an urban central region with a dense distribution of buildings, while region A is a suburban region with a sparse distribution of buildings.However, we still considered the whole research area as the analysis object when conducting the experiment.while subregion A is a suburban region with sparse buildings).

Massive Samples Automatically Selected from Built-Up Production
The training and testing samples were automatically selected from the 38-m global built-up production of the ESA in 2014 [53].A large number of sample points were automatically generated, filtered, and corrected.As shown in Figure 2, the detailed process included three steps: (1) Randomly selecting 20,000 sample points in each experimental area; (2) using the buildings and water datasets of OpenStreetMap (OSM) in China and the MOD13Q1-NDVI data to filter and correct the selected sample points; and (3) combining with ArcGIS Online image for manual correction.The aim was to modify the built-up sample points in the vegetation area and the water body into non-built-up sample points and to modify the non-built-up sample points in the built-up area into the built-up sample points.Finally, sample points of the built-up area and non-built-up area were obtained.The sample points after filtration and correction were hierarchically divided into training samples and test samples at a proportion of 6:4.Finally, a total of 11,499 training samples and 7667 test samples were obtained.In Figure 1, the yellow points represent built-up samples and the blue points represent non-built-up samples.

Massive Samples Automatically Selected from Built-Up Production
The training and testing samples were automatically selected from the 38-m global built-up production of the ESA in 2014 [53].A large number of sample points were automatically generated, filtered, and corrected.As shown in Figure 2, the detailed process included three steps: (1) Randomly selecting 20,000 sample points in each experimental area; (2) using the buildings and water datasets of OpenStreetMap (OSM) in China and the MOD13Q1-NDVI data to filter and correct the selected sample points; and (3) combining with ArcGIS Online image for manual correction.The aim was to modify the built-up sample points in the vegetation area and the water body into non-built-up sample points and to modify the non-built-up sample points in the built-up area into the built-up sample points.Finally, sample points of the built-up area and non-built-up area were obtained.The sample points after filtration and correction were hierarchically divided into training samples and test samples at a proportion of 6:4.Finally, a total of 11,499 training samples and 7667 test samples were obtained.In Figure 1, the yellow points represent built-up samples and the blue points represent non-built-up samples.

Research Methods
In this paper, the accuracy and efficiency of extracting a 15-m resolution built-up area based on a single pixel and image patch were compared and analyzed comprehensively from the two perspectives of feature engineering and feature learning.As shown in Table 2, we proceeded based upon four classifications: (1) Single-pixel classification under feature engineering, that is, pre-set features using the original eight-band spectrum, building remote sensing indices (NDBI and IBI), enhanced morphological building index (EMBI), building area presence indices (PanTex), and the texture feature mean of these five features (the classifiers were SVM, RF, and AdaBoost); (2) Image patch classification under feature engineering, the original eight-band features, considering the single pixel within the neighborhood of 3 × 3, 5 × 5, 7 × 7 pixel patches, that is, generating a new feature vector and then classifying (the classifiers were still SVM, RF, and AdaBoost); (3) Single-pixel classification under feature learning (For eigenvectors of eight bands on a single pixel, one-dimensional CNN was used to learn the spectral features, and then, classification was realized); and (4) Image patch classification under feature learning, the original eight bands, considering the 5 × 5 neighborhood pixel block, using two-dimensional CNN, while learning the spectral and plane spatial location relationship features.Thus, the built-up area was distinguished.

Research Methods
In this paper, the accuracy and efficiency of extracting a 15-m resolution built-up area based on a single pixel and image patch were compared and analyzed comprehensively from the two perspectives of feature engineering and feature learning.As shown in Table 2, we proceeded based upon four classifications: (1) Single-pixel classification under feature engineering, that is, pre-set features using the original eight-band spectrum, building remote sensing indices (NDBI and IBI), enhanced morphological building index (EMBI), building area presence indices (PanTex), and the texture feature mean of these five features (the classifiers were SVM, RF, and AdaBoost); (2) Image patch classification under feature engineering, the original eight-band features, considering the single pixel within the neighborhood of 3 × 3, 5 × 5, 7 × 7 pixel patches, that is, generating a new feature vector and then classifying (the classifiers were still SVM, RF, and AdaBoost); (3) Single-pixel classification under feature learning (For eigenvectors of eight bands on a single pixel, one-dimensional CNN was used to learn the spectral features, and then, classification was realized); and (4) Image patch classification under feature learning, the original eight bands, considering the 5 × 5 neighborhood pixel block, using two-dimensional CNN, while learning the spectral and plane spatial location relationship features.Thus, the built-up area was distinguished.Based on single pixel classification, spectral (original eight bands (OS8) and the built-up remote sensing index including IBI and NDBI (RSBI)), morphological (EMBI), and textural features (PanTex and texture from the gray-level co-occurrence matrix named GLCM (TEGL)) were considered in this paper.The EMBI and textural features were computed mathematically from the panchromatic band, while NDBI and IBI in RSBI were calculated from multi-spectral bands.The panchromatic band is very important for the extraction of the 15-m resolution built-up area, and to make the feature dimension greater than 1, RSBI, EMBI, and PanTex included the panchromatic band.As shown in Figure 3, the range of values of each feature was different, in order to train better the classifiers, so all the features were normalized to the range of 0-1 and then linearly stretched to 0-255 with the type of UINT8.We applied SVM, RF, and AdaBoost to realize the classification for a single pixel.There were 15 classification results based on three kinds of classifiers and five kinds of features.

Pixel-Based Classification
Based on single pixel classification, spectral (original eight bands (OS8) and the built-up remote sensing index including IBI and NDBI (RSBI)), morphological (EMBI), and textural features (PanTex and texture from the gray-level co-occurrence matrix named GLCM (TEGL)) were considered in this paper.The EMBI and textural features were computed mathematically from the panchromatic band, while NDBI and IBI in RSBI were calculated from multi-spectral bands.The panchromatic band is very important for the extraction of the 15-m resolution built-up area, and to make the feature dimension greater than 1, RSBI, EMBI, and PanTex included the panchromatic band.As shown in Figure 3, the range of values of each feature was different, in order to train better the classifiers, so all the features were normalized to the range of 0-1 and then linearly stretched to 0-255 with the type of UINT8.We applied SVM, RF, and AdaBoost to realize the classification for a single pixel.There were 15 classification results based on three kinds of classifiers and five kinds of features.RSBI: Buildings have unique spectral characteristics.Through the combination and operation of different bands, a remote sensing index that can characterize the building information was obtained.In [18], Zha et al. proposed a method based on the normalized difference building index (NDBI) to automate the process of mapping built-up areas.Built-up areas were effectively mapped through the arithmetic manipulation of NDBI (see Equation ( 1)) derived from the near-infrared (NIR) and short-wave infrared bands (SWIR1).(1) Spectrum OS8: The first seven bands, which were sampled up to 15 m, were stacked with the panchromatic band to form eight-band data as the original spectral feature.
RSBI: Buildings have unique spectral characteristics.Through the combination and operation of different bands, a remote sensing index that can characterize the building information was obtained.In [18], Zha et al. proposed a method based on the normalized difference building index (NDBI) to automate the process of mapping built-up areas.Built-up areas were effectively mapped through the arithmetic manipulation of NDBI (see Equation ( 1)) derived from the near-infrared (NIR) and short-wave infrared bands (SWIR1).
where SWIR1 is Band 6 of Landsat 8 and NIR is Band 5.
In [19], a new index derived from existing indices, IBI, was proposed for the rapid extraction of built-up land features in satellite imagery.The IBI is distinguished from conventional indices by its first-time use of thematic index-derived bands including RED, GREEN, NIR, and SWIR1 to construct an index rather than using the original image bands.Built-up areas are effectively extracted by setting the appropriate threshold for IBI.The IBI is calculated using Equation ( 2): where SWIR1 is Band 6 of Landsat 8, NIR is Band 5, RED is Band 4, and GREEN is Band 3.
(2) Morphology: Pan + EMBI Referring to the study of Huang and Zhang [60], EMBI (see Equation ( 3)), which is regarded as a characteristic feature of a building object, is the mean value of the multi-directional and multi-scale differential morphological sequence: where, DMP W−TH OFC (d i , s j ) denotes the different morphological characteristics of the size and direction of structural elements and D N and S N denote the number of directions and dimensions of structural elements, respectively.Considering the building size on a 15-m resolution panchromatic image, we set the size of the linear structure element from 1 pixel to 6 pixels, and the direction from 10 • to 180 • , so D N = 18 and S N = 6, and there are 108 linear structure elements.As shown in Figure 3, EMBI is calculated based on these linear structure elements.

Built-Up Presence Index: Pan + PanTex
Based on the high local contrast of buildings, a texture calculation method of the building area existence index (PanTex) was proposed by Pesaresi et al. [20].Their method was slightly adjusted in this paper.For the panchromatic image, the grayscale co-occurrence matrix (GLCM) [61] of the 12 displacement vectors shown in Figure 4 was calculated in a sliding window of 5 × 5.Then, based on each GLCM, the contrast texture statistical features were computed.Finally, 12 contrast features of all the displacement vectors were maximized as the pixel values of the center pixel in the sliding window.The PanTex was calculated using Equation (4): where w = 5 indicates the size of the analysis window as 5 × 5, V i represents the displacement vector used to calculate the GLCM, the maximum value of i is 12, and CON represents the contrast characteristics calculated based on the GLCM.The CON is calculated using Equation (5): where i and j are the discrete values of the row and column directions in the GLCM and P(i, j) are the corresponding values of i and j in the GLCM.
Built-Up Presence Index: Pan + PanTex Based on the high local contrast of buildings, a texture calculation method of the building area existence index (PanTex) was proposed by Pesaresi et al. [20].Their method was slightly adjusted in this paper.For the panchromatic image, the grayscale co-occurrence matrix (GLCM) [61] of the 12 displacement vectors shown in Figure 4 was calculated in a sliding window of 5 × 5.Then, based on each GLCM, the contrast texture statistical features were computed.Finally, 12 contrast features of all the displacement vectors were maximized as the pixel values of the center pixel in the sliding window.The PanTex was calculated using Equation (4):  where w = 5 indicates the size of the analysis window as 5 × 5, Vi represents the displacement vector used to calculate the GLCM, the maximum value of i is 12, and CON represents the contrast characteristics calculated based on the GLCM.The CON is calculated using Equation ( 5): where i and j are the discrete values of the row and column directions in the GLCM and P(i, j) are the corresponding values of i and j in the GLCM.
Five Mean Texture Features: Texture from GLCM ASM: The angular second moment (ASM) reflects the distribution of the grayscale and the size of the texture [62].The value of the ASM depends on the distribution of the elements in the GLCM.If the values of all the elements tend to be the same, the ASM values are smaller; however, if the values of the elements are more different and distributed more centrally, the ASM values are larger.Additionally, the large ASM value means that the distribution of the texture patterns is more uniform and regular.
CON: The contrast (CON) can show the clarity of the image and the depth of the texture [64].It reflects how the values of the GLCM elements are distributed and the local variation information of the image, that is, the moment of inertia near the main diagonal of the GLCM.The greater the value of the element from the diagonal in the GLCM, the greater the contrast.
CON: The contrast (CON) can show the clarity of the image and the depth of the texture [63].It reflects how the values of the GLCM elements are distributed and the local variation information of the image, that is, the moment of inertia near the main diagonal of the GLCM.The greater the value of the element from the diagonal in the GLCM, the greater the contrast.
COR: The correlation (COR) measures the similarity of the spatial gray level co-occurrence matrix elements in row or column directions; thus, the magnitude of the correlation value reflects the local gray level correlation in the image [62][63][64].When the values of the matrix elements are uniform and equal, the value of the correlation is large; on the contrary, if the pixel values of the matrix differ greatly, the value of the correlation is small.If there are horizontal directional textures in the image, the COR of the horizontal direction matrix is larger than the COR value of the other matrix.
where µ 1 , µ 2 , σ 1 , and σ 2 are respectively given by: ENT: Entropy (ENT) is the measure of the amount of information in an image and represents the degree of non-uniformity or complexity of the texture in an image [62][63][64].Texture information is also a random measure of the image information.Entropy is larger when all the elements in the GLCM have the largest randomness, all values of the GLCM are almost equal, and the elements of the GLCM are dispersed.
HOM: The homogeneity (HOM) reflects the homogeneity of the image texture and measures the local variation of the image texture [63][64][65].A large value indicates that there is a lack of variation among different regions of the image texture, and the local distribution is very uniform.
For each pixel of a 15-m resolution panchromatic gray image, all the GLCMs of all the displacement vectors were calculated by considering the neighborhood window of 5 × 5.Then, the ASM, CON, COR, ENT, and HOM corresponding to each GLCM were calculated.In the end, the average value was determined.Five texture features based on GLCM in the 5 × 5 neighborhood were finally obtained: mean-ASM, mean-CON, mean-COR, mean-ENT, and mean-HOM.

Patch-Based Classification
For each pixel, we considered its neighborhood windows of 3 × 3, 5 × 5, and 7 × 7, and input the pixel patch into the classifier, which is equivalent to the increase of the feature dimension.In the original eight-band images, the feature dimensions of patches in 3, 5, and 7 neighborhoods are 72, 200, and 392, respectively.We also applied SVM, RF, and AdaBoost to realize the classification for the image patches.There were nine classification results based on three types of classifiers and three types of image patch size.

Classification Algorithm
The main idea of SVM [23,58] is to establish an optimal decision hyperplane to maximize the distance between the nearest two classes of samples on both sides of the plane, in order to provide good generalization ability for classification problems.RF is a parallel ensemble classification algorithm, and AdaBoost is a serial classifier.The essence of RF is an improvement to the decision tree algorithm, which merges multiple decision trees, and the establishment of each tree depends on the samples extracted independently [59,65].The core idea of AdaBoost is to train different weak classifiers using the same training set and then assemble these weak classifiers to form a stronger final classifier.In this paper, we used the three classifiers (svm.SVC, ensemble.RandomForestClassifier, and ensemble.AdaBoostClassifier) provided by the Python sklearn module.Referring to [12][13][14]51] and through theoretical analysis and experiment, the parameters of the three classifiers are shown in Table 3.

Feature Learning
The original image was input into the CNN model in the form of a three-dimensional pixel patch or a single-pixel sequence.For the input layer, the up-sampled seven bands were stacked with the panchromatic band.After the convolution layer and pooling layer of the CNN, the multi-level features of the buildings and non-buildings could be automatically learned.For single-pixel classification, a one-dimensional convolution (CNN_1D) was utilized to learn the spectral features.For image patch classification, the two-dimensional convolution (CNN_2D) was applied to learn the spectral and spatial relations simultaneously.The loss function was cross-entropy, and the categories were determined by the softmax layer.We compared the accuracy and efficiency of the one-dimensional temporal convolution on a single pixel with that of the two-dimensional spatial convolution in the neighborhood.We applied Python-Keras module to build the CNN and combined the sklearn module to realize the classification and accuracy evaluation.

CNN_1D Classification
For each pixel, only the spectral information was taken into account without considering the spatial relationship between pixels.Within the convolutional layer, as shown in Figure 5, there were two one-dimensional convolution (Conv1D) layers realizing one-dimensional band directional convolution, which was equivalent to complex band operation.We then used three fully connected layers.To prevent overfitting, one batch-normalization layer and one dropout layer were added.The output layer consisted of a softmax operator, which output two categories.In the whole network, we used the popular function called rectified linear unit (ReLU) to solve the vanishing gradient problem for the training epochs in the network.

Feature Learning
The original image was input into the CNN model in the form of a three-dimensional pixel patch or a single-pixel sequence.For the input layer, the up-sampled seven bands were stacked with the panchromatic band.After the convolution layer and pooling layer of the CNN, the multi-level features of the buildings and non-buildings could be automatically learned.For single-pixel classification, a one-dimensional convolution (CNN_1D) was utilized to learn the spectral features.For image patch classification, the two-dimensional convolution (CNN_2D) was applied to learn the spectral and spatial relations simultaneously.The loss function was cross-entropy, and the categories were determined by the softmax layer.We compared the accuracy and efficiency of the one-dimensional temporal convolution on a single pixel with that of the two-dimensional spatial convolution in the neighborhood.We applied Python-Keras module to build the CNN and combined the sklearn module to realize the classification and accuracy evaluation.

CNN_1D Classification
For each pixel, only the spectral information was taken into account without considering the spatial relationship between pixels.Within the convolutional layer, as shown in Figure 5, there were two one-dimensional convolution (Conv1D) layers realizing one-dimensional band directional convolution, which was equivalent to complex band operation.We then used three fully connected layers.To prevent overfitting, one batch-normalization layer and one dropout layer were added.The output layer consisted of a softmax operator, which output two categories.In the whole network, we used the popular function called rectified linear unit (ReLU) to solve the vanishing gradient problem for the training epochs in the network.

CNN_2D Classification
In a 15-m resolution image, the size of a building is generally less than five pixels.For each pixel, the 5-neighborhood was considered, which means that the size of the image patch was 5 × 5 × 8. Therefore, as shown in Figure 6, an image patch with eight bands and a 5 × 5 neighborhood centered on each sample was input into the neural network.Within the convolutional layer, there were two two-dimensional convolution (Conv2D) and two max-pooling layers, which aimed to extract spectral and spatial features, and more high-grade features.In the fully connected layer, we used three fully connected layers.Meanwhile, one batch-normalization layer and one dropout layer were added to prevent overfitting.The output layer consisted of a softmax operator, which outputs two categories.In the whole network, we also used ReLU as the activation function.

CNN_2D Classification
In a 15-m resolution image, the size of a building is generally less than five pixels.For each pixel, the 5-neighborhood was considered, which means that the size of the image patch was 5 × 5 × 8. Therefore, as shown in Figure 6, an image patch with eight bands and a 5 × 5 neighborhood centered on each sample was input into the neural network.Within the convolutional layer, there were two two-dimensional convolution (Conv2D) and two max-pooling layers, which aimed to extract spectral and spatial features, and more high-grade features.In the fully connected layer, we used three fully connected layers.Meanwhile, one batch-normalization layer and one dropout layer were added to prevent overfitting.The output layer consisted of a softmax operator, which outputs two categories.In the whole network, we also used ReLU as the activation function.

Experimental Results and Evaluation
In total, 26 built-up land cover results were produced.However, considering the large range of research, the large amount of map data, and the reduction of the mapping resolution, we cut out the results of two small areas (A and B) for map display.As shown in Figure 7, for feature engineering based on a single pixel, 30 maps were produced using a combination of five features (OS8, RSBI, EMBI, PanTex, and TEGL) and three classifiers (SVM, RF, and AdaBoost).For feature engineering based on image patches, 18 maps were produced using a combination of three kinds of neighborhoods (P3, P5, and P7) and three classifiers (SVM, RF, and AdaBoost).For feature learning, four maps were used, including the result of one-dimensional convolution on a single pixel and the result of two-dimensional convolution on an image patch.The main aim of this paper was to compare the accuracy and efficiency of extracting a 15-m resolution built-up area based on a single pixel and image patch in two cases of feature engineering and feature learning.For completeness, all 52 classified maps for sub-regions A and B were extracted and are provided in Figures 7 and 8.For the sake of qualitative comparison, we compared the results of all the conditions with those of Global-Urban-2015 [56] and GlobalLand30 [52].
A total of 7667 test samples was used for the accuracy evaluation.The test samples were classified to obtain the predictive label of each sample, and then, the confusion matrix shown in Table 4 was obtained according to the real label and the predictive label.The overall accuracy (OA), recall, and precision were calculated based on the confusion matrix: OA represents the correctly predicted sample size for all the samples, recall indicates the size of the predicted built-up sample in all the true built-up samples, and precision indicates the size of the true built-up sample in all the predicted built-up samples.These three precision indices were used to comprehensively evaluate the accuracy of the built-up area extraction.

Experimental Results and Evaluation
In total, 26 built-up land cover results were produced.However, considering the large range of research, the large amount of map data, and the reduction of the mapping resolution, we cut out the results of two small areas (A and B) for map display.As shown in Figure 7, for feature engineering based on a single pixel, 30 maps were produced using a combination of five features (OS8, RSBI, EMBI, PanTex, and TEGL) and three classifiers (SVM, RF, and AdaBoost).For feature engineering based on image patches, 18 maps were produced using a combination of three kinds of neighborhoods (P3, P5, and P7) and three classifiers (SVM, RF, and AdaBoost).For feature learning, four maps were used, including the result of one-dimensional convolution on a single pixel and the result of two-dimensional convolution on an image patch.The main aim of this paper was to compare the accuracy and efficiency of extracting a 15-m resolution built-up area based on a single pixel and image patch in two cases of feature engineering and feature learning.For completeness, all 52 classified maps for sub-regions A and B were extracted and are provided in Figures 7 and 8.For the sake of qualitative comparison, we compared the results of all the conditions with those of Global-Urban-2015 [56] and GlobalLand30 [52].
A total of 7667 test samples was used for the accuracy evaluation.The test samples were classified to obtain the predictive label of each sample, and then, the confusion matrix shown in Table 4 was obtained according to the real label and the predictive label.The overall accuracy (OA), recall, and precision were calculated based on the confusion matrix: OA represents the correctly predicted sample size for all the samples, recall indicates the size of the predicted built-up sample in all the true built-up samples, and precision indicates the size of the true built-up sample in all the predicted built-up samples.These three precision indices were used to comprehensively evaluate the accuracy of the built-up area extraction.The OA, Recall, and Precision were calculated using Equations ( 11)-( 13), respectively: where the meanings of TP, TN, FN, and FP are shown in Table 4.

Feature Engineering and Feature Learning
From the perspective of feature engineering and feature learning, based on feature engineering classification, the classification results were highly correlated with the setting of features, and the appropriate features were conducive to improving the classification accuracy.However, feature learning does not need to consider manual feature setting.CNN can automatically learn multi-level features from the original image and then achieve classification by black box operation.As shown in Figures 7 and 8 and Tables 5 and 6, the classification accuracy based on feature learning is generally better than that based on feature engineering.However, in feature engineering, when the original eight bands considered the neighborhood and the classifier was RF, the overall accuracy reached 90%, which is comparable to the results of CNN_2D.

Single Pixel and Image Patch
Considering a single pixel and image patch, the classification based on a single pixel only considers the feature vectors of the pixel, ignoring the spatial relationship between pixels in the image spatial plane.As shown in Figure 7 and Table 5, overall, the classification effect and accuracy based on the image patch were better than those based on a single pixel; however, the feature dimension of the image patch was large, and there may have been feature redundancy.When training samples are large, a more complex classification model is needed.As shown in Figure 9 and Table 6, CNN based on the image patch had a significant advantage over CNN based on a single pixel.

Single Pixel and Image Patch
Considering a single pixel and image patch, the classification based on a single pixel only considers the feature vectors of the pixel, ignoring the spatial relationship between pixels in the image spatial plane.As shown in Figure 7 and Table 5, overall, the classification effect and accuracy based on the image patch were better than those based on a single pixel; however, the feature dimension of the image patch was large, and there may have been feature redundancy.When training samples are large, a more complex classification model is needed.As shown in Figure 9 and Table 6, CNN based on the image patch had a significant advantage over CNN based on a single pixel.The first column is the online images of ERSI.The second and third columns represent the classification results of CNN_1D and CNN_2D, respectively.The fourth and fifth columns represent the results of GlobalLand30 and UrbanLand, respectively.)

Single Pixel and Image Patch
Considering a single pixel and image patch, the classification based on a single pixel only considers the feature vectors of the pixel, ignoring the spatial relationship between pixels in the image spatial plane.As shown in Figure 7 and Table 5, overall, the classification effect and accuracy based on the image patch were better than those based on a single pixel; however, the feature dimension of the image patch was large, and there may have been feature redundancy.When training samples are large, a more complex classification model is needed.As shown in Figure 9 and Table 6, CNN based on the image patch had a significant advantage over CNN based on a single pixel.Under feature engineering, the accuracy of classifications based on a single pixel was significantly lower than that based on an image patch.Comparing OS8, RSBI, EMBI, PanTex, and TEGL, the order of OA and Recall from high to low was as follows: OS8, PanTex, EMBI, RSBI, and TEGL.The original spectrum (OS8) had the best effect, and the OA of OS8 and PanTex was higher than 82%.The analysis shows that the feature dimension is not necessarily related to the improvement of classification accuracy.Highlighting the characteristics of the target category information can help improve classification accuracy.Combining Figures 4 and 7, PanTex and EMBI can effectively distinguish built-up areas, while RSBI and TEGL cannot reflect buildings well.In particular, the five texture features under TEGL have redundancy and conflict.In the five-dimensional feature space, it is difficult to learn the appropriate classification boundary, which leads to a poor classification effect.
As shown in Figures 8 and 10, the built-up area extraction effect based on CNN_2D performed the best.We found many details in the results of CNN_2D that were missing in the other two productions (GlobalLand30 and Global-Urban-2015).One of the reasons for this is that the result of CNN_2D was produced from satellite images with higher spatial resolutions.Consequently, within the urban area, non-built-up areas (e.g., water and vegetation in dense buildings) could be discriminated from built-up areas within the Landsat 8 images.Meanwhile, in the suburbs, small built-up areas and narrow roads could become distinguishable from the background.Another reason is the higher classification accuracy of the CNN.
Under feature engineering, the accuracy of classifications based on a single pixel was significantly lower than that based on an image patch.Comparing OS8, RSBI, EMBI, PanTex, and TEGL, the order of OA and Recall from high to low was as follows: OS8, PanTex, EMBI, RSBI, and TEGL.The original spectrum (OS8) had the best effect, and the OA of OS8 and PanTex was higher than 82%.The analysis shows that the feature dimension is not necessarily related to the improvement of classification accuracy.Highlighting the characteristics of the target category information can help improve classification accuracy.Combining Figure 4 and Figure 7, PanTex and EMBI can effectively distinguish built-up areas, while RSBI and TEGL cannot reflect buildings well.In particular, the five texture features under TEGL have redundancy and conflict.In the five-dimensional feature space, it is difficult to learn the appropriate classification boundary, which leads to a poor classification effect.
As shown in Figures 8 and 10, the built-up area extraction effect based on CNN_2D performed the best.We found many details in the results of CNN_2D that were missing in the other two productions (GlobalLand30 and Global-Urban-2015).One of the reasons for this is that the result of CNN_2D was produced from satellite images with higher spatial resolutions.Consequently, within the urban area, non-built-up areas (e.g., water and vegetation in dense buildings) could be discriminated from built-up areas within the Landsat 8 images.Meanwhile, in the suburbs, small built-up areas and narrow roads could become distinguishable from the background.Another reason is the higher classification accuracy of the CNN.

Classification Strategy
From the perspective of classification strategy, compared with traditional machine learning algorithms such as SVM, RF, and AdaBoost, CNN has the advantages of autonomous learning, stability, and robustness.Additionally, CNN can learn the dual characteristics of the spectrum and spatial structure in a black box.Users can migrate and use the trained network structure and only need to focus on input and output.As shown in Figures 7 and 8, the classification accuracy of P5-RF and P7-RF differed little from that of CNN_2D and was far superior to the other classification results.CNN has the structure of batch normalization and dropout, which can prevent overfitting.With the increase of the convolutional layer and pooling layer, the network becomes increasingly complex and has stronger fitting and predicting ability.Integrated classifiers (such as RF and AdaBoost), which synthesize the prediction results of all the base classifiers and determine the final category by a voting method, can effectively prevent overfitting and still have higher classification accuracy when the feature dimension is high.However, SVM is more suitable for small sample learning.When the number of samples is too large and the feature dimension is high, most of the training samples are regarded as support vectors, resulting in overfitting, and the final classification accuracy is very low, even worse than a random guess.The OA of P3-SVM, P5-SVM, and P7-SVM

Classification Strategy
From the perspective of classification strategy, compared with traditional machine learning algorithms such as SVM, RF, and AdaBoost, CNN has the advantages of autonomous learning, stability, and robustness.Additionally, CNN can learn the dual characteristics of the spectrum and spatial structure in a black box.Users can migrate and use the trained network structure and only need to focus on input and output.As shown in Figures 7 and 8, the classification accuracy of P5-RF and P7-RF differed little from that of CNN_2D and was far superior to the other classification results.CNN has the structure of batch normalization and dropout, which can prevent overfitting.With the increase of the convolutional layer and pooling layer, the network becomes increasingly complex and has stronger fitting and predicting ability.Integrated classifiers (such as RF and AdaBoost), which synthesize the prediction results of all the base classifiers and determine the final category by a voting method, can effectively prevent overfitting and still have higher classification accuracy when the feature dimension is high.However, SVM is more suitable for small sample learning.When the number of samples is too large and the feature dimension is high, most of the training samples are regarded as support vectors, resulting in overfitting, and the final classification accuracy is very low, even worse than a random guess.The OA of P3-SVM, P5-SVM, and P7-SVM were 0.730, 0.525, and 0.525, respectively, and the training accuracy was 1 for all the measures.Overfitting clearly occurred and led to the classification failure.

Time Complexity
According to the training time of the model, the time of the classifiers based on feature engineering was significantly less than that of the models based on feature learning.As shown in Table 7 and Figure 11, the training time of the classifiers based on feature engineering was less than 300 s, while that of the models based on feature learning was more than 400 s.For feature engineering, the training time of the classifiers based on image patch (P3, P5, and P7) was significantly higher than that of the classifiers based on single pixel.The lower the dimension of the feature, the shorter the training time.Compared with the three classifiers, SVM, RF, and AdaBoost, the training time of SVM was significantly longer than that of RF and AdaBoost.When the feature dimension was very low (below three), the training time of RF and AdaBoost was extremely short.For feature learning, when the model was stable, the training time of CNN_1D and CNN_2D models was 400 s and 1000 s, respectively, so the time complexity of CNN_2D was higher than that of CNN_1D.

Time Complexity
According to the training time of the model, the time of the classifiers based on feature engineering was significantly less than that of the models based on feature learning.As shown in Table 7 and Figure 11, the training time of the classifiers based on feature engineering was less than 300 s, while that of the models based on feature learning was more than 400 s.For feature engineering, the training time of the classifiers based on image patch (P3, P5, and P7) was significantly higher than that of the classifiers based on single pixel.The lower the dimension of the feature, the shorter the training time.Compared with the three classifiers, SVM, RF, and AdaBoost, the training time of SVM was significantly longer than that of RF and AdaBoost.When the feature dimension was very low (below three), the training time of RF and AdaBoost was extremely short.For feature learning, when the model was stable, the training time of CNN_1D and CNN_2D models was 400 s and 1000 s, respectively, so the time complexity of CNN_2D was higher than that of CNN_1D.

Support Vectors of SVM
As shown in Figures 7 and 9 and Tables 5 and 8, P3-SVM, P5-SVM, and P7-SVM were overfitted, and the number of support vectors in the 11,499 training samples was 11,431, 11,485, and 11,493, respectively.Therefore, all the training samples were regarded as support vectors, so the model training was too complex with poor generalization ability and was unable to accurately predict the unclassified data.In the above experimental analysis, the penalty coefficient (C) and Gamma, which are the key parameters, were set to 10 and 100, respectively.The setting of these two parameters was reasonable and scientifically based on prior knowledge and experimental attempts.

Support Vectors of SVM
As shown in Figures 7 and 9 and Tables 5 and 8, P3-SVM, P5-SVM, and P7-SVM were overfitted, and the number of support vectors in the 11,499 training samples was 11,431, 11,485, and 11,493, respectively.Therefore, all the training samples were regarded as support vectors, so the model training was too complex with poor generalization ability and was unable to accurately predict the unclassified data.In the above experimental analysis, the penalty coefficient (C) and Gamma, which are the key parameters, were set to 10 and 100, respectively.The setting of these two parameters was reasonable and scientifically based on prior knowledge and experimental attempts.When the features were OS8, RSBI, EMBI, PanTex, and TEGL, the classification results met expectations, and no overfitting was observed.When considering 3, 5, and 7 neighborhoods, the number of features was 72, 200, and 392, respectively, the feature dimension increased significantly, and these high-dimensional features had greater correlation and redundancy, resulting in overfitting in SVM.We analyzed the principle of the SVM classification algorithm.The overfitting was mainly due to the irregular distribution and clustering of the training data in the feature space, resulting in a large number of samples as support vectors, and the classification boundary was very complex.Therefore, we utilized the L2 regularization method provided by Python sklearn to process the original eight-band data and eliminate the noise and scattering of the data and then used SVM to classify.The results show that the overfitting was effectively suppressed.As shown in Tables 8 and 9, after L2 regularization, the number of support vectors corresponding to OS8, P3, P5, and P7 was significantly reduced, and the OAs were 0.800, 0.832, 0.858, and 0.874, respectively.The classification effect significantly improved, which agreed with logic and expectations.

CNN_2D versus VGG16
We compared the results of CNN_2D with VGG16 [40].As shown in Figure 12, we reserved the weight of the convolution layers and the pooling layers of VGG16 and reset the top layers, including the fully-connected layers, the BatchNormalization layer and the softmax layer.Because the original image has 8 bands and cannot be directly input to VGG16, we fused panchromatic and multi-spectral bands by Gram-Schmidt pan sharpening to obtain the fusion image with 15-m resolution and seven bands.Then, we considered three bands in two ways: (1) The first three principal components were taken after principal component analysis; and (2) the 432 bands representing red, green and blue (RGB) were taken directly.For VGG16, there had to be three channels of input data, and the size had to be greater than 48 × 48, which were determined by the parameters of the VVG16 model in python-keras, so the neighborhood of 5 × 5 was up-sampled to 50 × 50 by nearest neighbor sampling.
When the features were OS8, RSBI, EMBI, PanTex, and TEGL, the classification results met expectations, and no overfitting was observed.When considering 3, 5, and 7 neighborhoods, the number of features was 72, 200, and 392, respectively, the feature dimension increased significantly, and these high-dimensional features had greater correlation and redundancy, resulting in overfitting in SVM.We analyzed the principle of the SVM classification algorithm.The overfitting was mainly due to the irregular distribution and clustering of the training data in the feature space, resulting in a large number of samples as support vectors, and the classification boundary was very complex.Therefore, we utilized the L2 regularization method provided by Python sklearn to process the original eight-band data and eliminate the noise and scattering of the data and then used SVM to classify.The results show that the overfitting was effectively suppressed.As shown in Tables 8 and  9, after L2 regularization, the number of support vectors corresponding to OS8, P3, P5, and P7 was significantly reduced, and the OAs were 0.800, 0.832, 0.858, and 0.874, respectively.The classification effect significantly improved, which agreed with logic and expectations.

CNN_2D versus VGG16
We compared the results of CNN_2D with VGG16 [40].As shown in Figure 12, we reserved the weight of the convolution layers and the pooling layers of VGG16 and reset the top layers, including the fully-connected layers, the BatchNormalization layer and the softmax layer.Because the original image has 8 bands and cannot be directly input to VGG16, we fused panchromatic and multi-spectral bands by Gram-Schmidt pan sharpening to obtain the fusion image with 15-m resolution and seven bands.Then, we considered three bands in two ways: (1) The first three principal components were taken after principal component analysis; and (2) the 432 bands representing red, green and blue (RGB) were taken directly.For VGG16, there had to be three channels of input data, and the size had to be greater than 48 × 48, which were determined by the parameters of the VVG16 model in python-keras, so the neighborhood of 5 × 5 was up-sampled to 50 × 50 by nearest neighbor sampling.10 shows that the accuracy of the CNN_2D was significantly better than that of VGG16, and the training time was greatly shortened.In Figure 14, the classification effect of CNN_2D was obviously greater than that of VGG16, and the extraction of built-up areas was more detailed and accurate.10 shows that the accuracy of the CNN_2D was significantly better than that of VGG16, and the training time was greatly shortened.In Figure 14, the classification effect of CNN_2D was obviously greater than that of VGG16, and the extraction of built-up areas was more detailed and accurate.

Conclusions
This paper presented a unique investigation to provide a full evaluation of OLI imagery for 15-m resolution built-up area classification from two viewpoints.First, traditional feature engineering and modern feature learning strategies were compared.Next, the influence of a single pixel and image patch was examined.In contrast, previous studies have generally tended to conduct     10 shows that the accuracy of the CNN_2D was significantly better than that of VGG16, and the training time was greatly shortened.In Figure 14, the classification effect of CNN_2D was obviously greater than that of VGG16, and the extraction of built-up areas was more detailed and accurate.

Conclusions
This paper presented a unique investigation to provide a full evaluation of OLI imagery for 15-m resolution built-up area classification from two viewpoints.First, traditional feature engineering and modern feature learning strategies were compared.Next, the influence of a single pixel and image patch was examined.In contrast, previous studies have generally tended to conduct

Conclusions
This paper presented a unique investigation to provide a full evaluation of OLI imagery for 15-m resolution built-up area classification from two viewpoints.First, traditional feature engineering and modern feature learning strategies were compared.Next, the influence of a single pixel and image patch was examined.In contrast, previous studies have generally tended to conduct limited comparisons between, for instance, coarse and fine resolution or pixel-and object-based classification.First, our training samples were automatically selected and filtered based on the existing product data.Then, we made a multi-level and all-around comparison from two different perspectives: (1) single pixel and image patch; and (2) feature engineering and feature learning.In feature engineering, we took into account the spectrum, morphology, texture, and other characteristics.In previous studies, there has been no such detailed and comprehensive consideration.Finally, our work was conducted on a relatively large image area-the city of Beijing, China, and its surroundings-ensuring that urban land cover information is generated at a scale of practical value.In contrast, earlier experiments have often been limited to very small, local areas.All the tests were evaluated by the same set of test samples with overall accuracy and a Kappa coefficient.The results can be summarized as follows: (1) The classification accuracy based on feature learning is generally better than that based on feature engineering.However in feature engineering, when the original eight bands consider the neighborhood and the classifier is RF, the overall accuracy reaches 90%, which is comparable to the results of CNN_2D.(2) The classification effect and accuracy based on the image patch are better than those based on a single pixel; however, the feature dimension of the image patch is large, and there may be feature redundancy.When training samples are large, a more complex classification model is needed.CNNs based on image patches have a significant advantage over CNNs based on single pixels.The results of CNN_2D, water, and vegetation in dense buildings can be discriminated from built-up areas within the Landsat 8 images.Meanwhile, in the suburbs, small built-up areas and narrow roads become distinguishable from the background.(3) Compared with traditional machine learning algorithms, such as SVM, RF, and AdaBoost, CNN has the advantages of autonomous learning, stability, and robustness.The classification accuracy of P5-RF and P7-RF differs little from that of CNN_2D and is far superior to the other classification results.The accuracy of CNN_2D is significantly better than that of VGG16.L2 regularization can eliminate the noise and scattering of the original eight-band data, effectively suppress SVM overfitting, and significantly reduce the number of support vectors.
The results of this paper can be used as a reference for the extraction and mapping of large 15-m resolution building areas.The comprehensive comparison of classification algorithms can help researchers in remote sensing image pattern recognition to understand the principle and applicability of the algorithm and better carry out scientific research.In this paper, a large number of samples were selected automatically based on existing data products, which is of great significance for improving the efficiency and effectiveness of supervised classification and can save considerable manpower and time.At the same time, there are some shortcomings to this research, such as not using multi-scale remote sensing data (low-, medium-, and high-resolution) for the comparative analysis of built-up area extraction, the fact that the spatial relationship of the pixels in an image patch was not analyzed in depth, and the fact that the hidden layer of CNN was not displayed and analyzed in detail.We will study these problems in follow-up work and hope that more scholars will develop an interest and become involved.

Figure 1 .
Figure 1.Research area and samples (Subregion B is an urban central region with dense buildings, while subregion A is a suburban region with sparse buildings).

Figure 1 .
Figure 1.Research area and samples (Subregion B is an urban central region with dense buildings, while subregion A is a suburban region with sparse buildings).

Figure 3 .
Figure 3. Feature maps of different feature descriptors.ASM: angular second moment; CON: contrast; COR: correlation; ENT: entropy; and HOM: homogeneity.(A and B represent the two subregions.)

Figure 3 .
Figure 3. Feature maps of different feature descriptors.ASM: angular second moment; CON: contrast; COR: correlation; ENT: entropy; and HOM: homogeneity (A and B represent the two subregions).

Figure 4 .
Figure 4.The 12 displacement vectors in the 5 × 5 window.Five Mean Texture Features: Texture from GLCM ASM: The angular second moment (ASM) reflects the distribution of the grayscale and the size of the texture[62].The value of the ASM depends on the distribution of the elements in the GLCM.If the values of all the elements tend to be the same, the ASM values are smaller; however, if the values of the elements are more different and distributed more centrally, the ASM values are larger.Additionally, the large ASM value means that the distribution of the texture patterns is more uniform and regular.

Figure 7 .Figure 7 .Figure 8 .
Figure 7. Classification results based on feature engineering.(The first five rows are the classification results of three classifiers within five set features.The last three rows are the classification results of three classifiers within three kinds of image patch.) Figure 7. Classification results based on feature engineering.(The first five rows are the classification results of three classifiers within five set features.The last three rows are the classification results of three classifiers within three kinds of image patch.)

Figure 9 .
Figure 9. Classification accuracies of the built-up area based on feature engineering.

Figure 8 .
Figure 8. Classification results based on feature learning (A and B represent the two subregions.The first column is the online images of ERSI.The second and third columns represent the classification results of CNN_1D and CNN_2D, respectively.The fourth and fifth columns represent the results of GlobalLand30 and UrbanLand, respectively).

Figure 8 .
Figure 8. Classification results based on feature learning.(A and B represent the two subregions.The first column is the online images of ERSI.The second and third columns represent the classification results of CNN_1D and CNN_2D, respectively.The fourth and fifth columns represent the results of GlobalLand30 and UrbanLand, respectively.)

Figure 9 .
Figure 9. Classification accuracies of the built-up area based on feature engineering.

Figure 9 .
Figure 9. Classification accuracies of the built-up area based on feature engineering.

Figure 10 .
Figure 10.Classification accuracies of the built-up area based on feature learning.

Figure 10 .
Figure 10.Classification accuracies of the built-up area based on feature learning.

Figure 11 .
Figure 11.The training process of the CNN_1D and CNN_2D (training time: s).

Figure 11 .
Figure 11.The training process of the CNN_1D and CNN_2D (training time: s).

Figure 12 .
Figure 12.Transfer learning and fine-tuning of VGG16.Figure 12. Transfer learning and fine-tuning of VGG16.

Figure 12 .
Figure 12.Transfer learning and fine-tuning of VGG16.Figure 12. Transfer learning and fine-tuning of VGG16.We set the ratio of the training samples and validation samples to 6:4 for training the proposed CNN and VGG16.The accuracy and loss of the training process are shown in Figure 13.We recorded the training accuracy, training loss, test accuracy, test loss, and training time of 200 epochs.Table10shows that the accuracy of the CNN_2D was significantly better than that of VGG16, and the training time was greatly shortened.In Figure14, the classification effect of CNN_2D Remote Sens. 2018, 10, x FOR PEER REVIEW 22 of 27 We set the ratio of the training samples and validation samples to 6:4 for training the proposed CNN and VGG16.The accuracy and loss of the training process are shown in Figure 13.

Figure 13 .
Figure 13.Accuracy and loss in the training process of the proposed CNN_2D, VGG16-PCA, and VGG16-RGB.

Figure 14 .
Figure 14.Results of built-up area by CNN_2D, VGG16-PCA, and VGG16-RGB.(A and B represent the two subregions.The first column is the online images of ERSI.The second column represents the classification results of CNN_2D.The third and fourth columns represent the classification results of VGG16-PCA and VGG16-RGB, respectively.)

Figure 13 .
Figure 13.Accuracy and loss in the training process of the proposed CNN_2D, VGG16-PCA, and VGG16-RGB.
CNN and VGG16.The accuracy and loss of the training process are shown in Figure13.

Figure 13 .
Figure 13.Accuracy and loss in the training process of the proposed CNN_2D, VGG16-PCA, and VGG16-RGB.

Figure 14 .
Figure 14.Results of built-up area by CNN_2D, VGG16-PCA, and VGG16-RGB.(A and B represent the two subregions.The first column is the online images of ERSI.The second column represents the classification results of CNN_2D.The third and fourth columns represent the classification results of VGG16-PCA and VGG16-RGB, respectively.)

Figure 14 .
Figure 14.Results of built-up area by CNN_2D, VGG16-PCA, and VGG16-RGB (A and B represent the two subregions.The first column is the online images of ERSI.The second column represents the classification results of CNN_2D.The third and fourth columns represent the classification results of VGG16-PCA and VGG16-RGB, respectively).

Table 1 .
Comparison of Landsat 7 and Landsat 8 satellite bands.

Table 2 .
Overall framework of method and technology.

Table 2 .
Overall framework of method and technology.

Table 3 .
Parameters of the classifiers used in our experiments (Python sklearn).

Table 3 .
Parameters of the classifiers used in our experiments (Python sklearn).

Table 4 .
The representation of the confusion matrix for the test samples.

Table 5 .
Accuracy evaluation based on feature engineering.

Table 6 .
Accuracy evaluation based on feature learning.

Table 6 .
Accuracy evaluation based on feature learning.

Table 6 .
Accuracy evaluation based on feature learning.

Table 7 .
The training time of the different classifier based on feature learning (unit: s).

Table 7 .
The training time of the different classifier based on feature learning (unit: s).

Table 8 .
Number of support vectors of SVM.

Table 8 .
Number of support vectors of SVM.

Table 10 .
The accuracy and loss of CNN_2D, VGG16-PCA, and VGG16-RGB.CNN-Strategy Training Accuracy OA Recall Precision Training Time (s)