Mapping Impervious Surfaces in Town – Rural Transition Belts Using China ’ s GF-2 Imagery and Object-Based Deep CNNs

Impervious surfaces play an important role in urban planning and sustainable environmental management. High-spatial-resolution (HSR) images containing pure pixels have significant potential for the detailed delineation of land surfaces. However, due to high intraclass variability and low interclass distance, the mapping and monitoring of impervious surfaces in complex town–rural areas using HSR images remains a challenge. The fully convolutional network (FCN) model, a variant of convolution neural networks (CNNs), recently achieved state-of-the-art performance in HSR image classification applications. However, due to the inherent nature of FCN processing, it is challenging for an FCN to precisely capture the detailed information of classification targets. To solve this problem, we propose an object-based deep CNN framework that integrates object-based image analysis (OBIA) with deep CNNs to accurately extract and estimate impervious surfaces. Specifically, we also adopted two widely used transfer learning technologies to expedite the training of deep CNNs. Finally, we compare our approach with conventional OBIA classification and state-of-the-art FCN-based methods, such as FCN-8s and the U-Net methods. Both of these FCN-based methods are well designed for pixel-wise classification applications and have achieved great success. Our results show that the proposed approach effectively identified impervious surfaces, with 93.9% overall accuracy. Compared with the existing methods, i.e., OBIA, FCN-8s and U-Net methods, it shows that our method achieves obviously improvement in accuracy. Our findings also suggest that the classification performance of our proposed method is related to training strategy, indicating that significantly higher accuracy can be achieved through transfer learning by fine-tuning rather than feature extraction. Our approach for the automatic extraction and mapping of impervious surfaces also lays a solid foundation for intelligent monitoring and the management of land use and land cover.


Introduction
Urban development, which has significantly changed land use and land cover (LULC) patterns over the past 30 years, typically involves the removal of natural surface cover and an increase in impervious surfaces [1].Impervious surfaces mainly include artificial structures that eliminate water infiltration and soil moisture evaporation; these surfaces include rooftops, roads covered with asphalt and concrete, and parking lots.In recent years, impervious surfaces have been seen as an important indicator of urbanization and play an important role in natural environment assessment [2][3][4].A high impervious surface ratio can cause heavy flooding and "urban heat island" effects, and may also adversely affect ecological environments.Thus, the accurate monitoring and estimation of impervious surfaces is critical for urban planning and sustainable environmental management.
With the increase in the availability of high-spatial-resolution (HSR) imagery, mapping impervious surfaces from HSR images has attracted increasing attention [5][6][7][8][9][10].To reduce the high intraclass variability and low interclass distance in HSR imagery, object-based image analysis (OBIA) is a new and evolving paradigm [11] that has achieved significantly high accuracy on information extraction from HSR images [12][13][14].Object-based image classification approaches include two main steps: first, an image is divided into homogeneous and continues segments; second, classification is performed based on the attributes of the segments.However, since nearly all the features used are based on the statistical features of pixels or segments, which may exclude the intrinsic qualities of the land cover type from HSR pixels, it is impossible for these features to allow for high discrimination while maintaining robustness [15].
Over the past few years, deep convolutional neural networks (CNNs), which attempt to learn high-level feature representations in a hierarchical manner, have achieved state-of-the-art performance in computer vision, significantly outperforming other methods [16][17][18].Thus, CNNs have been applied to RS classification applications [15,[19][20][21].As deep CNNs require multidimensional inputs, a very simple method has been developed to predict each pixel in an image based on overlapping patches using a sliding-window search method [21][22][23].However, this patch-wise procedure has a limited receptive field of a predefined size, and so objects that are obviously larger or smaller than the fixed size may be easily fragmented or misclassified as background.Although some studies try to improve performance using patches centered on the superpixel segmentation as input [24,25], this is not a fundamental solution because of the existence of the inherent tradeoff between the fixed size of the receptive field and the varying size of meaningful sematic image objects in HSR imagery.A new trend in recent research is to employ fully convolutional network (FCN)-based approaches [26], which replace fully connected layers in standard CNNs with convolutional layers for dense class map generation [27][28][29].FCN consists of an encoder structure and a decoder structure.The image is first converted to a low-resolution feature representation by using the encoder structure and is then converted to pixel-wise predictions using the decoder structure.However, it is still challenging to restore the identical detailed resolution of the input image during upsampling via learning, which may result in the loss of the detailed information available in HSR imagery [30].This situation can be worse in complex rural environments because the impervious surface area usually covers a smaller area and is much more sparsely distributed than pervious surfaces.In particular, the capture of valuable edge information remains a challenge.
To solve this problem, it is necessary to combine the edge information provided by image segmentation with the feature learning capability of deep CNNs for information extraction from HSR images.Inspired by this idea, we have developed a framework and applied it to impervious surface extraction.In addition, we investigate two commonly used transfer learning techniques, which are used to expedite the training of deep CNNs.Finally, we compare our approach with the conventional OBIA classification, and FCN-based approaches, such as FCN-8s approach proposed in [26],and the U-Net approach proposed in [31], which are commonly used approaches and achieved success in image classification for remote sensing or nature images.

Study Area
We selected the Chongfu subdistrict, which is located in Tongxiang County, the northeastern part of Zhejiang Province, China (120 • 26 11 E, 30 • 32 48 N, see Figure 1), as our study area.The Chongfu subdistrict has an area of 110.44 km 2 and is representative of a typical town-rural pattern on the Hang-Jia-Hu plain.It is located in the subtropical monsoon climate zone and has abundant precipitation, distinct seasons, and an annual average temperature of 16.5 degrees centigrade.Due to its favorable climate and physiognomy, the study area provides an ideal environment for agricultural development.In addition, due to its premium geographical location, essentially the geographical center of the Shanghai-Hangzhou-Suzhou (Huhangsu) triangle within the Huhangsu one-hour economic circle, and its comprehensive transportation network, the area has experienced a rapid development period, during which various impervious surfaces were established due to construction.Thus, its complex town-rural environment renders Tongxiang a suitable area for developing a robust method using HSR images to monitor impervious surfaces.center of the Shanghai-Hangzhou-Suzhou (Huhangsu) triangle within the Huhangsu one-hour economic circle, and its comprehensive transportation network, the area has experienced a rapid development period, during which various impervious surfaces were established due to construction.Thus, its complex town-rural environment renders Tongxiang a suitable area for developing a robust method using HSR images to monitor impervious surfaces.As shown in Figure 2, visual inspection combining images of the entire area reveals that the impervious surfaces mainly comprise three categories: roads (including asphalt and concrete roads), rooftops (town buildings, industrial warehouses, and rural settlements), and other exposed impervious surfaces (squares, parking lots, and grain-basking fields).The other types of land cover in the study area represent pervious surfaces, such as water bodies (including rivers and ponds), vegetation (crops, shrubs, and trees), and bare land.To effectively extract impervious surfaces, we regard impervious and pervious surfaces as individual classes rather than detailed categories.

Materials and Methods.
Our overall framework is shown in Figure 3.Following preprocessing and pan sharpening, we firstly acquire the semantically meaningful image objects by applying a segmentation algorithm on the imagery.Next, to follow the data format requirement in transfer learning and avoid abnormal gradients, standardization and normalization are conducted on every single image object (a set of pixels).Subsequently, the image objects are randomly separated into three individual datasets that are used for training, validation, and testing.Pre-trained inception-resnet v2 is employed for transfer As shown in Figure 2, visual inspection combining images of the entire area reveals that the impervious surfaces mainly comprise three categories: roads (including asphalt and concrete roads), rooftops (town buildings, industrial warehouses, and rural settlements), and other exposed impervious surfaces (squares, parking lots, and grain-basking fields).The other types of land cover in the study area represent pervious surfaces, such as water bodies (including rivers and ponds), vegetation (crops, shrubs, and trees), and bare land.To effectively extract impervious surfaces, we regard impervious and pervious surfaces as individual classes rather than detailed categories.center of the Shanghai-Hangzhou-Suzhou (Huhangsu) triangle within the Huhangsu one-hour economic circle, and its comprehensive transportation network, the area has experienced a rapid development period, during which various impervious surfaces were established due to construction.Thus, its complex town-rural environment renders Tongxiang a suitable area for developing a robust method using HSR images to monitor impervious surfaces.As shown in Figure 2, visual inspection combining images of the entire area reveals that the impervious surfaces mainly comprise three categories: roads (including asphalt and concrete roads), rooftops (town buildings, industrial warehouses, and rural settlements), and other exposed impervious surfaces (squares, parking lots, and grain-basking fields).The other types of land cover in the study area represent pervious surfaces, such as water bodies (including rivers and ponds), vegetation (crops, shrubs, and trees), and bare land.To effectively extract impervious surfaces, we regard impervious and pervious surfaces as individual classes rather than detailed categories.

Materials and Methods.
Our overall framework is shown in Figure 3.Following preprocessing and pan sharpening, we firstly acquire the semantically meaningful image objects by applying a segmentation algorithm on the imagery.Next, to follow the data format requirement in transfer learning and avoid abnormal gradients, standardization and normalization are conducted on every single image object (a set of pixels).Subsequently, the image objects are randomly separated into three individual datasets that are used for training, validation, and testing.Pre-trained inception-resnet v2 is employed for transfer

Materials and Methods.
Our overall framework is shown in Figure 3.Following preprocessing and pan sharpening, we firstly acquire the semantically meaningful image objects by applying a segmentation algorithm on the imagery.Next, to follow the data format requirement in transfer learning and avoid abnormal gradients, standardization and normalization are conducted on every single image object (a set of pixels).Subsequently, the image objects are randomly separated into three individual datasets that are used for training, validation, and testing.Pre-trained inception-resnet v2 is employed for transfer learning with the training set for 30 epochs.We saved and estimated the model after each epoch and selected the best model based on validation set performance.The model was then used to produce the final map of impervious surfaces.To verify whether our method effectively discriminates impervious surfaces, we compared our method with conventional object-based nearest neighbor classification (NNC), FCN-8s, and U-Net.

Datasets and Preprocessing
We acquired imagery from the PMS sensor of Gaofen 2 (GF-2), comprising four multispectral bands (MSS) with a spatial resolution of 3.2 m and a panchromatic band (PAN) with a resolution of 0.8 m.The entire study area imagery was acquired on July 22, 2016, under cloud-free atmospheric conditions, thus, atmospheric correction was unnecessary for our LULC classification purposes during preprocessing [32,33].The acquired MSS image and PAN image were orthorectified into the universal transverse Mercator (UTM) projection system and fused using the Gram-Schmidt (GS) pansharpening method in ENVI (version 5.1, Exelis Visual Information Solutions, Boulder, CO, USA, 2014).We then calculated the normalized difference vegetation index (NDVI) and incorporated it into the fused images by using the layer stacking tool in ENVI.The NDVI values are calculated based on the surface reflectance, which used the Fast Line-of-sight Atmospheric Analysis of Spectral Hypercubes (FLAASH) atmospheric model implemented in ENVI for atmospheric correction.
We also collected land-use change surveying maps from 2015 (provided by the Bureau of Land and Resources, Tongxiang) covering the entire study area as ancillary data for labeling the image for

Datasets and Preprocessing
We acquired imagery from the PMS sensor of Gaofen 2 (GF-2), comprising four multispectral bands (MSS) with a spatial resolution of 3.2 m and a panchromatic band (PAN) with a resolution of 0.8 m.The entire study area imagery was acquired on July 22, 2016, under cloud-free atmospheric conditions, thus, atmospheric correction was unnecessary for our LULC classification purposes during preprocessing [32,33].The acquired MSS image and PAN image were orthorectified into the universal transverse Mercator (UTM) projection system and fused using the Gram-Schmidt (GS) pan-sharpening method in ENVI (version 5.1, Exelis Visual Information Solutions, Boulder, CO, USA, 2014).We then calculated the normalized difference vegetation index (NDVI) and incorporated it into the fused images by using the layer stacking tool in ENVI.The NDVI values are calculated based on the surface reflectance, which used the Fast Line-of-sight Atmospheric Analysis of Spectral Hypercubes (FLAASH) atmospheric model implemented in ENVI for atmospheric correction.
We also collected land-use change surveying maps from 2015 (provided by the Bureau of Land and Resources, Tongxiang) covering the entire study area as ancillary data for labeling the image for training and validation.Specifically, we carefully revised the map attributes by visual interpretation and ground surveying.

Multi-Resolution Segmentation
Following preprocessing and pan-sharpening, image segmentation was performed to produce semantically meaningful image objects.We adopted multiresolution segmentation (MRS), which is a widely used algorithm integrated into eCognition (version 9.0, Trimble Germany GmbH, Munich, Germany, 2014).MRS is a bottom-up region-merging method that iteratively merges small image objects into a larger image object until a heterogeneity threshold is reached [34].MRS is controlled by three key criteria: scale parameter (SP), shape, and compactness.Choosing an optimal segmentation parameter is typically a subjective trial-and-error process; however, we adopted an objective method based on local variance (LV) to select the optimal SP based on an estimate provided by the Estimation of Scale Parameter 2 (ESP 2) tool [35].The ESP 2 tool was used to iteratively perform segmentation in fixed step sizes and calculate the LV for each SP.Finally, we plotted the rate of change of the LV (ROC-LV) against the corresponding SP, with the peaks in the ROC-LV curve indicating the optimal SP [36,37].Because impervious surfaces tend to have a regular shape and compact features, the weight of the shape and compactness criteria was set to 0.8.Meanwhile, we used four bands of GF-2 imagery and NDVI as input raster layers for the MRS algorithm, and assigned them the same weight of 1.

Standardization and Normalization
CNNs require a multidimensional array as input rather than a single feature vector.Therefore, we extracted multidimensional patches as samples as opposed to single pixels.To accomplish this goal, the image objects were first extracted as patches.These were padded with zero values and labeled with the attribute of the majority overlapping region with the reference data.These patches are original and semantically meaningful image pixels from the HSR imagery; that is, they contain the most useful and essential information for each semantic category.The optimal patch size is expected to cover major semantically meaningful objects; thus, it can vary in accordance with one's interests and the resolution of the images.In this study, we adopted a size of 299 pixels (due to the limitations of transfer learning) and resized the image using a bilinear method.In addition, we eliminated objects with areas of less than 40 m 2 to avoid the influence of defective pixels in ArcGIS (version 10.2, ESRI Inc., Redlands, CA, USA).
In general, the digital values in RS imagery are integers that can have a dynamic range of greater than 8 bits.Thus, during the training phase of CNNs, these values are typically transformed into an approximately normal distribution to avoid abnormal gradients.However, as the transfer learning strategy was adopted in our study, to reduce loss of information and shorten the distance between RS data and the original dataset, we used only three visible bands (the red, green, and blue bands) of the GF-2 imagery in our proposed method, and used a linear transformation method for each band: where DN norm is the normalized value and DN orig and DN max are the original and maximum pixel values in the image, respectively.Following normalization, we randomly divided all the image objects into three independent subsets, i.e., the training, validation, and testing sets, which account for 70%, 10%, and 20% of the whole dataset, respectively.

Transfer Learning Based on a Pre-Trained Inception-Resnet V2 Model
Rather than training a complex deep CNN from scratch, we adopted a transfer learning strategy to reduce the time and amount of labeled data required for training.Transfer learning assumes that knowledge learned from one task can be helpful in improving performance when applied in another task or domain [38].This strategy has been successfully applied in many image classification applications by employing a CNN from a set of pre-trained weights [39][40][41].Typically, pre-trained CNNs include both the model structure and weights, which are fully trained with sufficiently labeled images collected from other applications.
Generally, there are two strategies for employing pre-trained CNNs.As shown in Figure 4, the first one, "feature extraction," regards pre-trained CNNs as feature extractors, as it only reconstructs and fine-tunes the final logits (classifier) layer, while all the rest remain frozen.In our approach, we adopt multinomial logistic regression (also known as softmax regression) as our classifier because it is efficient and straightforward [42].The second strategy for employing pre-trained CNNs, "fine-tuning," continues training all the layers to keep the output as close to the new task as possible [43].Both of these commonly used strategies are compared in our study.
Remote Sens. 2019, 11, 280 6 of 18 task or domain [38].This strategy has been successfully applied in many image classification applications by employing a CNN from a set of pre-trained weights [39][40][41].Typically, pre-trained CNNs include both the model structure and weights, which are fully trained with sufficiently labeled images collected from other applications.Generally, there are two strategies for employing pre-trained CNNs.As shown in Figure 4, the first one, "feature extraction," regards pre-trained CNNs as feature extractors, as it only reconstructs and fine-tunes the final logits (classifier) layer, while all the rest remain frozen.In our approach, we adopt multinomial logistic regression (also known as softmax regression) as our classifier because it is efficient and straightforward [42].The second strategy for employing pre-trained CNNs, "finetuning," continues training all the layers to keep the output as close to the new task as possible [43].Both of these commonly used strategies are compared in our study.Considering the performance and accessibility of the deep CNN architecture and our limited amount of labeled RS data, we decided to transfer the pre-trained inception-resnet V2 model [44], which has been fully trained on the ILSVRC-2012-CLS image classification dataset [39].The inception-resnet V2 model combines the inherent computational efficiency of inception architectures with the accelerative training benefits conferred by residual connections.The inception structure was introduced as a fundamental part in GoogLeNet [40] and has been optimized and refined over a series of iterations [41,44,45].It was expected that most of the structure used in this model has been optimized with features suitable for image classification purposes.
To implement the pre-trained inception-resnet V2 model, we used the TensorFlow-Slim (TF-Slim, version 1.4) image classification model library, which contains a high-level application programming interface (API) for defining, training, and evaluating complex models.All of the programming code is based on python 3.5.2.Four Tesla P40 graphics processing units (GPUs) on the deep learning service (DLS) of Meituan Open Services (MOS) were used for acceleration.We trained all the models using a batch size of 64, a learning rate of 0.01 decayed exponentially by 0.94 every 2 epochs, and RMSProp optimization with a momentum of 0.9 and decay of 0.9.CNN weights were recorded after every epoch; after all the epochs were completed, we selected the model with the highest accuracy on our validation set as our optimal model.

Object-Based Nearest Neighbor Classification
To provide a comparison with our proposed method, the object-based NNC method was applied to distinguish between impervious and pervious surfaces.NNC is straightforward to implement and does not require hyperparameter definitions.Moreover, due to its inherent mechanism, NNC can also achieve satisfactory results without detailed classification categories.Thus, NNC is an appropriate method for comparison with our proposed method.Following an identical selection of Considering the performance and accessibility of the deep CNN architecture and our limited amount of labeled RS data, we decided to transfer the pre-trained inception-resnet V2 model [44], which has been fully trained on the ILSVRC-2012-CLS image classification dataset [39].The inception-resnet V2 model combines the inherent computational efficiency of inception architectures with the accelerative training benefits conferred by residual connections.The inception structure was introduced as a fundamental part in GoogLeNet [40] and has been optimized and refined over a series of iterations [41,44,45].It was expected that most of the structure used in this model has been optimized with features suitable for image classification purposes.
To implement the pre-trained inception-resnet V2 model, we used the TensorFlow-Slim (TF-Slim, version 1.4) image classification model library, which contains a high-level application programming interface (API) for defining, training, and evaluating complex models.All of the programming code is based on python 3.5.2.Four Tesla P40 graphics processing units (GPUs) on the deep learning service (DLS) of Meituan Open Services (MOS) were used for acceleration.We trained all the models using a batch size of 64, a learning rate of 0.01 decayed exponentially by 0.94 every 2 epochs, and RMSProp optimization with a momentum of 0.9 and decay of 0.9.CNN weights were recorded after every epoch; after all the epochs were completed, we selected the model with the highest accuracy on our validation set as our optimal model.

Object-Based Nearest Neighbor Classification
To provide a comparison with our proposed method, the object-based NNC method was applied to distinguish between impervious and pervious surfaces.NNC is straightforward to implement and does not require hyperparameter definitions.Moreover, due to its inherent mechanism, NNC can also achieve satisfactory results without detailed classification categories.Thus, NNC is an appropriate method for comparison with our proposed method.Following an identical selection of samples, we used feature space optimization (FSO), a tool available in eCognition, to select the optimal feature combination.Based on selected samples, FSO calculates the Euclidean distance in feature space between classes and chooses the best combination of features, resulting in the largest minimum distances between the least separable classes [46].

Fully Convolutional Neural Networks
FCN-based approaches were also selected for comparison with our proposed method due to their high performance in recent RS applications.We opted to use the widely used FCN-8s and U-Net FCN models, because of their successful performance in image classification for remote sensing or nature images.Both FCN-8s and U-Net are expected to produce accurate and detailed segmentations because they combine semantic information from deep, coarser layers and surface information from shallow, finer layers.FCN-8s, a VGG-16 network with a skip-layer structure, combines its final prediction layer with lower layers (the pool3 pool4 layers).U-Net has a u-shaped architecture consisting of a contracting path (left) and an expensive path (right).Every step in the expensive path combines information in a corresponding lower layer.Detailed information on these model architectures can be found in [26] and [31].
We trained FCN-8s and U-Net on the GF-2 true color images for impervious surface extraction.During training, we modified the number of outputs to be 2, fine-tuned FCN-8s based on an ImageNet pre-trained model, and trained the U-Net from scratch.Additionally, we set the learning rate to 0.0001 to avoid abnormal gradients and batch size to 10 due to the limits of the GPU's memory.The other training parameters for both FCN models are identical to those used in our proposed method.The percent of three independent subsets, i.e., the training, validation and testing sets, are the same with our proposed methods.

Accuracy Assessment and Comparison
In this paper, we compared the proposed object-based deep CNNs approach with the object-based NNC, FCN-8s, and U-Net methods.We conducted accuracy assessment on the final classification maps, with a total of 44970 randomly selected segments to construct the error matrix.We confirmed whether the segments were correctly identified by visual interpretation.During the visual interpretation, a 2015 land-use change surveying map was used as ancillary data.This map includes accurate spatial distribution information for the detailed categories of impervious and pervious surfaces, thus offering detailed locations of the impervious and pervious surfaces.Finally, the accurate spatial distribution map for these two different types of surfaces was obtained, by visual interpretation, for conducting accuracy assessment.Based on the error matrix, we calculated several commonly used accuracy statistics, including user accuracy (UA), producer accuracy (PA), and overall accuracy (OA).
To compare the accuracies of the classification results between different methods, we employed three commonly used evaluation metrics for impervious surfaces: precision, recall, and F-measure [47,48], which are calculated as follows: where TP, TN, FP, and FN refer to true positives, true negatives, false positives, and false negatives, respectively.
Another method widely used in the accuracy assessment is the Kappa statistical analysis, which is a discrete multivariate technique to statistically analyze the difference between a classified map and reference map [49].In this study, the Kappa statistic of every error matrix was calculated.Furthermore, Kappa statistics of two different error matrices were compared in Z-test to measure whether there was significant difference between the two classification methods [49].The Z-statistic is calculated as follows: where K 1 and K 2 are the two Kappa statistics, Var(K 1 ) and Var(K 2 ) are their estimated variances.
The hypothesis that two Kappa statistics are equal is rejected if |Z| is greater than a certain amount (1.96 for a 95% confidence level test).
In this study, we regard the discrimination of impervious and pervious surfaces as binary classification.Therefore, TP and TN are defined as the area of correctly labeled impervious and pervious surfaces.Precision and recall values of different methods were calculated at the pixel level based on the test dataset.Eventually, we calculated five accuracy values for each accuracy statistic using Equations ( 2), (3), and (4), and a total of 10 Z-statistics between different methods using Equation (5).

Segmentation Results with Optimal Scale Parameter
ROC-LV and LV values are plotted against corresponding SPs in Figure 5. Based on this diagram, the peaks of the ROC-LV curve indicate optimal SPs for segmentation.The graph shows that the scale of 41 represents the first break after a continuous and abrupt decay.As a result, we set 41 as the optimal SP, generating 224,889 segments.As shown in Figure 6, at the scale of 41, rooftops, water bodies, and grain-basking fields can be identified.The segmentation results have a border consistent with the major target in our study and are fully consistent with the expected results in the overall framework shown in Figure 3.
where K1 and K2 are the two Kappa statistics, Var(K1) and Var(K2) are their estimated variances.The hypothesis that two Kappa statistics are equal is rejected if |Z| is greater than a certain amount (1.96 for a 95% confidence level test).
In this study, we regard the discrimination of impervious and pervious surfaces as binary classification.Therefore, TP and TN are defined as the area of correctly labeled impervious and pervious surfaces.Precision and recall values of different methods were calculated at the pixel level based on the test dataset.Eventually, we calculated five accuracy values for each accuracy statistic using Equations ( 2), (3), and (4), and a total of 10 Z-statistics between different methods using Equation (5).

Segmentation Results with Optimal Scale Parameter
ROC-LV and LV values are plotted against corresponding SPs in Figure 5. Based on this diagram, the peaks of the ROC-LV curve indicate optimal SPs for segmentation.The graph shows that the scale of 41 represents the first break after a continuous and abrupt decay.As a result, we set 41 as the optimal SP, generating 224,889 segments.As shown in Figure 6, at the scale of 41, rooftops, water bodies, and grain-basking fields can be identified.The segmentation results have a border consistent with the major target in our study and are fully consistent with the expected results in the overall framework shown in Figure 3.

Final Map and Accuracy Assessment
Some detailed subset examples from the classification results using our proposed method are shown in Figure 8.After visual inspection of the final maps, semantically meaningful objects are

Final Map and Accuracy Assessment
Some detailed subset examples from the classification results using our proposed method are shown in Figure 8.After visual inspection of the final maps, semantically meaningful objects are

Final Map and Accuracy Assessment
Some detailed subset examples from the classification results using our proposed method are shown in Figure 8.After visual inspection of the final maps, semantically meaningful objects are accurately delineated, and most of the impervious and pervious surfaces can be classified successfully.
accurately delineated, and most of the impervious and pervious surfaces can be classified successfully.To further quantitatively assess the classification results, we randomly selected over 44,970 segments for the accuracy assessment, which account for approximately 20% of whole image objects.Tables 1 and 2 show the confusion matrices of the classification results.We find that the pervious surfaces produced by the fine-tuning method achieves the highest UA value of 95.85%, indicating that over 95% of the identified pervious surface in the classification results are truly pervious surfaces.The impervious and pervious surfaces have relatively high PA values of 92.13% and 94.82%, respectively, which are produced by the fine-tuning method.To further quantitatively assess the classification results, we randomly selected over 44,970 segments for the accuracy assessment, which account for approximately 20% of whole image objects.Tables 1 and 2 show the confusion matrices of the classification results.We find that the pervious surfaces produced by the fine-tuning method achieves the highest UA value of 95.85%, indicating that over 95% of the identified pervious surface in the classification results are truly pervious surfaces.The impervious and pervious surfaces have relatively high PA values of 92.13% and 94.82%, respectively, which are produced by the fine-tuning method.We can also find that the fine-tuning method performs better than the feature extraction method with higher accuracy values, which is consistent with findings in the previous optimal model selection phase.Both the impervious and pervious surfaces classified by feature extraction method have UAs greater than 80%.And all the UAs and PAs of the impervious and pervious surfaces produced by fine-tuning method are greater than 90%.Thus, impervious and pervious surfaces are classified successfully, with all OAs greater than 85%.

Accuracy Comparison
In this study, we used the object-based NNC and the state-of-the-art FCN-based methods for comparison.Table 3 shows the experiment setup and computational complexity of different classification schemes in this study.In terms of computational complexity, objected-based NNC method is more time intensive relative to the others.With the acceleration of GPU, our proposed methods and the FCN-based methods take less time.Furthermore, our proposed feature extraction method requires the least time.This is because in the other methods that perform training with the fine-tuning strategy, the networks continue training from pre-trained models, where all the weights and biases in the models need to be updated, and more time and resources are required.Table 3. Experiments setup and computational complexity between classification schemes.OB-NNC: object-based nearest neighbor classification (NNC) method.Ours-FE: our feature extraction method.Ours-FT: our fine-tuning method.GPU: Tesla P40 graphics processing units.CPU: inter i7 7700k with 16 Gb memory.To quantitatively assess the performance of our proposed method and other methods, precision, recall, and F-measure were calculated at the pixel level (Table 4), and our fine-tuned object-based deep CNN is found to obtain the best performance.The approaches using FCN-8 and U-Net achieve similar accuracy levels and achieve a F-measure value of approximately 80.0%, and both of them have lower precision and recall values than the fine-tuned object-based deep CNN.We find that the fine-tuned object-based deep CNN achieves the highest F-measure value for impervious surfaces at 88.9%, compared with 80.9% for object-based NNC method, and the highest precision value for impervious surfaces at 89.7%, which is 7.5% higher than that of object-based NNC method.Table 5 summarizes the Kappa analysis results between the five classification methods used in this study.According to the Z-test of the Kappa values, our proposed fine-tuning method was significantly different from all the other methods.Furthermore, since the Kappa value of the fine-tuning method results is greater than all the other methods (Table 4), and Z ≥ 1.96, the classification results of our proposed fine-tuning method are statistically significantly better than the other five classification results at 5% significance level.However, there was no significant difference among the other four methods.It is evident from the evaluation criteria that (1) the fine-tuned object-based deep CNNs achieve the highest precision, recall, and accuracy values, (2) the fine-tuned object-based deep CNNs outperform the feature extraction method, and (3) the fine-tuned object-based deep CNNs produced significantly higher accuracies than all the other methods.

Object-Based NNC vs. Our Approach
Our proposed method was first compared with the widely used object-based NNC approach.The major difference between the two methods involves the feature design procedure.First, deep CNNs can automatically learn features, which are present as the weights of each layer [50].In contrast, the feature space of the object-based NNC method is based on manually designed features.Second, deep CNNs attempt to learn high-level feature representations in a hierarchical manner [51], whereas manually designed feature based methods mainly cover spectrum, shape, and texture features.The majority of these hand-crafted features are based on statistical results, resulting in a lack of generalizability [15].Thus, the use of deep CNNs improved our classification performance relative to the object-based NNC approach, with an 8% improvement in terms of F-measure at the pixel level.

Pixel-Wise FCN-Based Methods vs. Our Approach
Compared with state-of-the-art pixel-wise FCN-based methods, our approach takes segments as its basic classification units and extracts accurate semantic information from deep CNNs.Unlike the standard CNN model, FCN is a pixel-wise classification method composed of downsampling and upsampling processes.The downsampling path is usually a combination of convolutional and max-pooling layers, which are commonly used in the CNNs for extract and interpret the context in image classification tasks.The upsampling path is generally composed of deconvolutional layers, which are used to upsample the feature maps and output the final dense classification results.However, during the downsampling process, FCN's pooling operations can lose a great deal of detailed information.Although the FCN models can be enhanced by combining the feature map of a previous low-level pooling layer, as with FCN-8s and U-Net, it is still challenging to restore highly nonlinear object boundaries during upsampling by learning.Thus, such methods tend to contain salt and pepper noises [52][53][54] and the detailed boundaries of an object are often lost or smoothed [30].
In our proposed method, we avoid the loss of detailed information by using image objects as basic classification units.First, we employed the MRS algorithm to overcome local spectral variance and provide precise boundary information.With the segments as basic processing units, the detailed and complete image information of land cover in the real word can then be observed by deep CNNs.Additionally, the fixed-size receptive field in FCN-based models becomes a semantically meaningful receptive field before classification; thus, the shape information is also considered in our approach.Besides, compared with the downsampling path in FCN-based models, the pre-trained inception-resnet V2 model with deeper and well-designed structure can provide more accurate semantic information.Therefore, our method can achieve high accuracy with accurate boundary information and avoid the salt and pepper noises.As shown in Figure 9, our approach outperforms FCN-8s and U-Net in terms of detailed information.pooling layers, which are commonly used in the CNNs for extract and interpret the context in image classification tasks.The upsampling path is generally composed of deconvolutional layers, which are used to upsample the feature maps and output the final dense classification results.However, during the downsampling process, FCN's pooling operations can lose a great deal of detailed information.
Although the FCN models can be enhanced by combining the feature map of a previous low-level pooling layer, as with FCN-8s and U-Net, it is still challenging to restore highly nonlinear object boundaries during upsampling by learning.Thus, such methods tend to contain salt and pepper noises [52][53][54] and the detailed boundaries of an object are often lost or smoothed [30].
In our proposed method, we avoid the loss of detailed information by using image objects as basic classification units.First, we employed the MRS algorithm to overcome local spectral variance and provide precise boundary information.With the segments as basic processing units, the detailed and complete image information of land cover in the real word can then be observed by deep CNNs.Additionally, the fixed-size receptive field in FCN-based models becomes a semantically meaningful receptive field before classification; thus, the shape information is also considered in our approach.Besides, compared with the downsampling path in FCN-based models, the pre-trained inceptionresnet V2 model with deeper and well-designed structure can provide more accurate semantic information.Therefore, our method can achieve high accuracy with accurate boundary information and avoid the salt and pepper noises.As shown in Figure 9, our approach outperforms FCN-8s and U-Net in terms of detailed information.

Training Strategies and Scale Effects
Two types of widely used strategies of transfer learning were adopted in this study: feature extraction and fine-tuning.The difference between these options highlights the general applicability of early pre-trained layers.The first one utilizes the early pre-trained layers to produce features that would subsequently be used to extract impervious surfaces, which can achieve satisfactory overall accuracy (lager than 85%), since these layers can extract the low-level and more generic features and can be utilized for other image classification tasks [55].However, as fine-tuning continues training from a pre-trained model, that is, as it adjusts the weights in each layer to acquire output as close to the new labels as possible, it shows improvements over feature extraction (8.84% in overall accuracy).However, as only the pre-trained inception-resnet V2 model and the multinomial logistic regression as classifier were employed in this study, we emphasize that numerous classifiers and optimization methods should be subjected to further investigation.
In our approach, it is crucial to generate a set of semantically meaningful segments because they are regarded as the basic units for classification.These segments preserve the detailed and complete image information for landcover and provided additional geometry information for the deep CNNs.We chose the MRS algorithm for segmentation since it follows the region-merging principle and can generate satisfactory segmentation results with our imagery.One of the most important and sensitive parameters for the MRS algorithm is SP, which is defined as the maximum threshold for the heterogeneity in an image object.By adjusting SP, image objects at specific-level scales can be generated.Rather than using trial-and-error, we employed an objective method, the ESP 2 tool, to identify optimal SP.However, segmentation errors, as so-called under-and over-segmentation, still persist and have not been completely solved.Therefore, segmentation methods that can directly and accurately delineate the boundaries of different land cover classes are still required.

Conclusion
This paper presents an object-based deep CNN framework for impervious surface extraction from VHSR imagery.Compared with the conventional OBIA method and other two commonly used FCN-based methods, the classification accuracy has been obviously improved.
Our proposed method, which is based on a combination of the MRS algorithm and deep CNNs, can effectively map impervious surfaces while retaining detailed information in HSR images.The MRS algorithm, with optimal SP selected by ESP 2 tool, can provide semantically meaningful image objects for our study, and the pre-trained deep CNNs allow us to effectively extract and interpret the context.Besides, our performance comparison using different training strategies indicates that significantly higher accuracy can be achieved through transfer learning by fine-tuning rather than feature extraction.

Training Strategies and Scale Effects
Two types of widely used strategies of transfer learning were adopted in this study: feature extraction and fine-tuning.The difference between these options highlights the general applicability of early pre-trained layers.The first one utilizes the early pre-trained layers to produce features that would subsequently be used to extract impervious surfaces, which can achieve satisfactory overall accuracy (lager than 85%), since these layers can extract the low-level and more generic features and can be utilized for other image classification tasks [55].However, as fine-tuning continues training from a pre-trained model, that is, as it adjusts the weights in each layer to acquire output as close to the new labels as possible, it shows improvements over feature extraction (8.84% in overall accuracy).However, as only the pre-trained inception-resnet V2 model and the multinomial logistic regression as classifier were employed in this study, we emphasize that numerous classifiers and optimization methods should be subjected to further investigation.
In our approach, it is crucial to generate a set of semantically meaningful segments because they are regarded as the basic units for classification.These segments preserve the detailed and complete image information for landcover and provided additional geometry information for the deep CNNs.We chose the MRS algorithm for segmentation since it follows the region-merging principle and can generate satisfactory segmentation results with our imagery.One of the most important and sensitive parameters for the MRS algorithm is SP, which is defined as the maximum threshold for the heterogeneity in an image object.By adjusting SP, image objects at specific-level scales can be generated.Rather than using trial-and-error, we employed an objective method, the ESP 2 tool, to identify optimal SP.However, segmentation errors, as so-called under-and over-segmentation, still persist and have not been completely solved.Therefore, segmentation methods that can directly and accurately delineate the boundaries of different land cover classes are still required.

Conclusions
This paper presents an object-based deep CNN framework for impervious surface extraction from VHSR imagery.Compared with the conventional OBIA method and other two commonly used FCN-based methods, the classification accuracy has been obviously improved.
Our proposed method, which is based on a combination of the MRS algorithm and deep CNNs, can effectively map impervious surfaces while retaining detailed information in HSR images.The MRS algorithm, with optimal SP selected by ESP 2 tool, can provide semantically meaningful image objects for our study, and the pre-trained deep CNNs allow us to effectively extract and interpret the context.Besides, our performance comparison using different training strategies indicates that significantly higher accuracy can be achieved through transfer learning by fine-tuning rather than feature extraction.
As future research, we might focus on testing our method for the mapping of other land covers and on images with higher spatial resolution.Additionally, it is necessary to improve and investigate

Figure 1 .
Figure 1.The study area is a typical town-rural region in Zhejiang province.The Gaofen 2 (GF-2) image used in this study is presented in true color.

Figure 2 .
Figure 2. Image examples on the ground for different types of the impervious and previous surfaces in our study area, including typical examples in rural (a), and town areas (b).

Figure 1 .
Figure 1.The study area is a typical town-rural region in Zhejiang province.The Gaofen 2 (GF-2) image used in this study is presented in true color.

Figure 1 .
Figure 1.The study area is a typical town-rural region in Zhejiang province.The Gaofen 2 (GF-2) image used in this study is presented in true color.

Figure 2 .
Figure 2. Image examples on the ground for different types of the impervious and previous surfaces in our study area, including typical examples in rural (a), and town areas (b).

Figure 2 .
Figure 2. Image examples on the ground for different types of the impervious and previous surfaces in our study area, including typical examples in rural (a), and town areas (b).
Remote Sens. 2019, 11, 280 4 of 18 learning with the training set for 30 epochs.We saved and estimated the model after each epoch and selected the best model based on validation set performance.The model was then used to produce the final map of impervious surfaces.To verify whether our method effectively discriminates impervious surfaces, we compared our method with conventional object-based nearest neighbor classification (NNC), FCN-8s, and U-Net.

Figure 3 .
Figure 3. Outline of the overall framework presented in this paper, including image preprocessing, classification, and comparison methods.

Figure 3 .
Figure 3. Outline of the overall framework presented in this paper, including image preprocessing, classification, and comparison methods.

Figure 4 .
Figure 4.The transfer learning strategies of this study: (a) feature extraction and (b) fine-tuning.The strategies differ in the trainable part of the pre-training, which is shown in the dotted box.

Figure 4 .
Figure 4.The transfer learning strategies of this study: (a) feature extraction and (b) fine-tuning.The strategies differ in the trainable part of the pre-training, which is shown in the dotted box.

Figure 5 .
Figure 5. Local variance (LV) and rates of change of LV (ROC-LV) values against corresponding scale parameters (SPs) produced by the Estimation of Scale Parameter 2 (ESP 2) tool.The gray vertical dotted line indicates the optimal SP.

Figure 5 .
Figure 5. Local variance (LV) and rates of change of LV (ROC-LV) values against corresponding scale parameters (SPs) produced by the Estimation of Scale Parameter 2 (ESP 2) tool.The gray vertical dotted line indicates the optimal SP.

Figure 6 .
Figure 6.Subset examples of the segmentation results in rural (a), and town areas (b).The yellow lines represent the segmentation results at the scale of 41.Images are presented in true color.

Figure 7
Figure 7 shows training and validation accuracy values with respect to the number of epochs completed.Clearly, it can be found that the fine-tuning method achieves higher accuracy values than the feature extraction method during the training phase.With an increase in epochs, both of these transfer learning techniques show a more stable trend in accuracy values.The highest validation values achieved by fine-tuning and feature extraction methods are 93.90% and 85.05%, respectively.Based on these highest validation accuracy values, we selected the best model for fine-tuning after training for 17 epochs, and the best model for feature extraction after training for 30 epochs.Vertical dotted lines indicate the optimal models of fine-tuning and feature extraction in Figure 7.

Figure 7 .
Figure 7.The training and validation accuracy values with respect to the number of epochs.Red and gray vertical dotted lines highlight the highest overall accuracy on validation set using fine-tuning and feature extraction methods, respectively.Validation-FT: The validation accuracy values using our proposed fine-tuning method.Validation-FE: The validation accuracy values using our proposed feature extraction method.Train-FT: The training accuracy values using our proposed fine-tuning method; Train-FE: The training accuracy values using our proposed feature extraction method.

Figure 6 .
Figure 6.Subset examples of the segmentation results in rural (a), and town areas (b).The yellow lines represent the segmentation results at the scale of 41.Images are presented in true color.

Figure 7 Figure 6 .
Figure 7 shows training and validation accuracy values with respect to the number of epochs completed.Clearly, it can be found that the fine-tuning method achieves higher accuracy values than the feature extraction method during the training phase.With an increase in epochs, both of these transfer learning techniques show a more stable trend in accuracy values.The highest validation values achieved by fine-tuning and feature extraction methods are 93.90% and 85.05%, respectively.Based on these highest validation accuracy values, we selected the best model for fine-tuning after training for 17 epochs, and the best model for feature extraction after training for 30 epochs.Vertical dotted lines indicate the optimal models of fine-tuning and feature extraction in Figure 7.

Figure 7
Figure 7 shows training and validation accuracy values with respect to the number of epochs completed.Clearly, it can be found that the fine-tuning method achieves higher accuracy values than the feature extraction method during the training phase.With an increase in epochs, both of these transfer learning techniques show a more stable trend in accuracy values.The highest validation values achieved by fine-tuning and feature extraction methods are 93.90% and 85.05%, respectively.Based on these highest validation accuracy values, we selected the best model for fine-tuning after training for 17 epochs, and the best model for feature extraction after training for 30 epochs.Vertical dotted lines indicate the optimal models of fine-tuning and feature extraction in Figure 7.

Figure 7 .
Figure 7.The training and validation accuracy values with respect to the number of epochs.Red and gray vertical dotted lines highlight the highest overall accuracy on validation set using fine-tuning and feature extraction methods, respectively.Validation-FT: The validation accuracy values using our proposed fine-tuning method.Validation-FE: The validation accuracy values using our proposed feature extraction method.Train-FT: The training accuracy values using our proposed fine-tuning method; Train-FE: The training accuracy values using our proposed feature extraction method.

Figure 7 .
Figure 7.The training and validation accuracy values with respect to the number of epochs.Red and gray vertical dotted lines highlight the highest overall accuracy on validation set using fine-tuning and feature extraction methods, respectively.Validation-FT: The validation accuracy values using our proposed fine-tuning method.Validation-FE: The validation accuracy values using our proposed feature extraction method.Train-FT: The training accuracy values using our proposed fine-tuning method; Train-FE: The training accuracy values using our proposed feature extraction method.

Figure 8 .
Figure 8. Subset examples from classification results using our proposed object-based deep convolution neural networks (CNNs).

Figure 8 .
Figure 8. Subset examples from classification results using our proposed object-based deep convolution neural networks (CNNs).

Figure 9 .
Figure 9. Detailed comparison between our proposed method and FCN based models at subset areas (a) and (b).The arrows indicate that some impervious surfaces can be extracted accurately by utilizing object-based deep CNNs, while by using the FCN-based models, the salt and pepper noises still persist and detailed boundaries are often smoothed.

Table 1 .
Error matrix for the final map using our proposed feature extraction method.PA: producer accuracy, UA: user accuracy, IS: impervious surface, PS: pervious surface.

Table 2 .
Error matrix for the final map using our proposed fine-tuning method.

Table 4 .
Quantitative comparison between methods using conventional object-based image analysis (OBIA), FCN-8s, U-Net, and our methods at the pixel level.

Table 5 .
Z-test for a 95% confidence level between methods using conventional OBIA, FCN-8s, U-Net, and our methods.