Improving Urban Land Cover/use Mapping by Integrating A Hybrid Convolutional Neural Network and An Automatic Training Sample Expanding Strategy

: Moderate spatial resolution (MSR) satellite images, which hold a trade-off among radiometric, spectral, spatial and temporal characteristics, are extremely popular data for acquiring land cover information. However, the low accuracy of existing classiﬁcation methods for MSR images is still a fundamental issue restricting their capability in urban land cover mapping. In this study, we proposed a hybrid convolutional neural network (H-ConvNet) for improving urban land cover mapping with MSR Sentinel-2 images. The H-ConvNet was structured with two streams: one lightweight 1D ConvNet for deep spectral feature extraction and one lightweight 2D ConvNet for deep context feature extraction. To obtain a well-trained 2D ConvNet, a training sample expansion strategy was introduced to assist context feature learning. The H-ConvNet was tested in six highly heterogeneous urban regions around the world, and it was compared with support vector machine (SVM), object-based image analysis (OBIA), Markov random ﬁeld model (MRF) and a newly proposed patch-based ConvNet system. The results showed that the H-ConvNet performed best. We hope that the proposed H-ConvNet would beneﬁt for the land cover mapping with MSR images in highly heterogeneous urban regions.


Introduction
Accurate and timely urban land cover mapping plays a major role in many urban applications, such as ecosystem monitoring, land management, planning and landscape analysis [1][2][3]. Remotely sensed satellite images can quickly capture land cover changes at a large-scale and have been widely used for land cover mapping for decades [4,5]. Compared with rural regions mainly covering natural land surfaces (e.g., grass and water), urban regions have undergone a nearly complete reconstruction, with a highly heterogeneous land surface. There is generally a confusion among land cover categories in mapping urban land cover with remote sensing techniques, and thus mapping efficiency and accuracy are hampered.
With decades of optical imaging technique development, we can easily access vast amounts of high-quality moderate spatial resolution (MSR) satellite images (e.g., Landsat-8 and Sentinel-2 images), which allows the broad applications of land cover mapping over large-scale areas. Many algorithms have been successfully widely used in remotely sensed image classification for different application an automatic training sample expansion strategy by incorporating spectral feature-based classification result is proposed in this study.

Study Area
This study was focused on highly developed urban regions around the world. The Shanghai City in China was selected first to test out proposed method (Figure 1), and it covers approximately 1000 km 2 from city Centre to rural fringe. The land use intensity in this region diminishes from urban to suburban region, and the artificial lands are covered by a wide range of construction materials, including asphalt, concrete, shingles and tiles in residential, industrial, civic (such as hospital institutes) and commercial land areas. Five highly heterogeneous urban regions, London, New York, Paris, Seoul and Tokyo, were selected for further validating the accuracy and efficiency of the proposed method.

Data Source and Preprocessing
The Sentinel-2A MSI image (Orbit number 46, tile number T51RUQ on 20 July 2016 was applied for urban land cover mapping in this study. The Level-1C geometrically corrected top-of-atmosphere reflectance product of selected image was downloaded from the Sentinels Scientific Data Hub (https://scihub.copernicus.eu/) (Figure 1), and atmospheric correction was implemented to generate bottom-of-atmosphere (BOA)-corrected reflectance product with the Sen2cor module (version 2.2.1) within the Sentinel-2 Toolbox (S2TBX) in Sentinel Application Platform (SNAP, version 4.0.2). Since three Sentinel-2A MSI image bands with 60m spatial resolution are designed mainly for atmospheric correction and cloud screening (band 1 for aerosol retrieval, band 9 for water vapor correction and band 10 for cirrus detection) [34], they were removed and the remaining four 10 m spatial resolution bands and six 20 m spatial resolution bands were used for the urban land cover mapping in this study. The spectral bands with 20 m spatial resolution were upsampled to 10m spatial resolution using a nearest neighbor interpolation algorithm, and accordingly all 10m bands were stacked for the subsequent classification in our study.
To collect the ground truth for training and validating the proposed H-ConvNets, the Google Earth image with fine spatial resolution (acquired on 21 July 2016, which is well matched to the acquisition time of the Sentinel-2 image used in this study) was acquired. To minimize the labor in supervised image classification process and to provide rigorous validation in method evaluation, a limited amount of training samples and an extensive amount of validation samples were collected using a stratified random sampling strategy. Specifically, an unsupervised k-means algorithm was applied to divide the image into 10 classes in implementing stratified sampling, thus an even number of validation samples were collected from each class of the image. Finally, totally 962 and 229,247 pixels are respectively obtained for the H-ConvNet training and validation corresponding to four-class urban surface mapping, and 1190 and 156,682 pixels are respectively obtained for the H-ConvNet training and validation corresponding to six-class urban surface mapping.

Land Cover Categories
Four land cover types, including impervious surface, soil, vegetation and water, were firstly classified in this study. In addition, considering the spatial resolution of selected Sentinel-2 image, the characteristics of urban landscape and the application value in practice, a refined land use/cover system with six classes was employed for methods testing in this study, and it contains water surface, green space, bare land, urban living area, urban industrial area and urban road.

Methods
A lightweight H-ConvNet consisting of one 1D ConvNet and one 2D ConvNet was proposed to perform a pixelwise classification in this study ( Figure 2). To fully understand the proposed method, the detailed descriptions of 1D ConvNet, 2D ConvNet, training sample expansion strategy and spectral-context feature joint urban land cover mapping are given as follows.

Architecture of H-ConvNet
The architecture of a typical convolutional neural network (ConvNet) is structured with a series of stages, each of which contains a cascade of layers, such as a convolutional layer, a batch normalization layer, a nonlinearity layer and a scale pooling layer, and with end-to-end feature learning, the data features at multiple levels can be obtained for data categorization [33]. Unlike the commonly very deep design of ConvNet in computer vision, a lightweight 1D ConvNet and 2D ConvNet for the joint exploitation of spectral and context features from MSR image were structured in this study (Figure 3). Each pixel in one remote sensing image holds the spectral feature of a specific land cover, and the distinctive feature in each pixel is commonly used for land cover classification. To take full advantage of the spectral information in a remote sensing image, a 1D ConvNet was proposed for deep spectral feature extraction in this study. Unlike the commonly used spectral feature extractors for specific land cover category recognition by predefining the spectral features [35,36], such as NDVI and NDWI, this 1D ConvNet can be fed with raw pixels and automatically exports many distinctive spectral features corresponding to each land cover category, and thus alleviates the need for hand-engineered feature extractors. The most straightforward way to improve the performance of ConvNet is to increase the depth of model size. However, driven by increasing depth, the notorious problem of vanishing/exploding gradients inevitably arises [37,38], which makes model training more difficult. In addition to increasing network depth, increasing model width also contributes to network performance improvement. In this study, considering the relatively limited amount information in a single pixel of multispectral image and inspired by the Inception module of GoogLeNet, a simple 1D ConvNet with a balanced model width and depth was structured for spectral feature learning ( Figure 3).

2d Convnet for Context Feature Learning
To capture the context features of each pixel in a remote sensing image, a 2D ConvNet was designed under the inspiration of dense connectivity, i.e., all layers directly connected to each layer in a feed-forward way [39]; that is, the information moves in only one forward direction, from the input nodes, through the hidden nodes (if any) and to the output nodes. This dense connectivity design has several merits, including alleviating vanishing-gradient problem, strengthening feature propagation and encouraging feature reuse. This 2D ConvNet was designed for a pixelwise land cover classification, and the context feature of per-pixel is characterized by its surrounding patch-based ground area. In particular, context feature-based classification which was performed on the image patches with 5 × 5, 15 × 15 and 25 × 25 pixels was tested, and the optimal patch size of 15 × 15 pixels was finally selected for context feature learning in our study. We structured the 2D ConvNet with three 3 × 3 convolutional layers, and the lth layer receives the feature maps of all preceding layers as input: where [x 0 , x 1 ...x l−1 ] refers to the concatenation of feature maps produced in layers 0, 1,. . . and l − 1.
If each function H l produces k feature maps, then the lth layer will contain k 0 + k × (l − 1) input feature maps, in which k 0 is the number of image channels in input layer and k is the growth rate of the network. To improve computational efficiency, we introduced a 1 × 1 convolution to reduce the number of input feature maps ahead of each 3 × 3 convolution in network [26]. The consecutive operations of batch normalization (BN) and rectified linear unit (ReLU) were structured to follow all the convolutional layers in our network. Briefly, we can refer to our 2D ConvNet with such a composite layer, i.e., to the Pooling-Conv(1 × 1)-BN-ReLU-Conv(3 × 3)-BN-Relu-Pooling version of H l . In our experiment, we set k = 8 and let each 1 × 1 convolution produce k feature maps. Consequently, the number of input feature maps of (4,12,20,28) can be obtained corresponding to each layer of structured 2D ConvNet.

Automatic Training Sample Expansion
When applying a machine learning algorithm to remote sensing image classification, some human labeled samples are required for model training. However, the training of a deep learning model commonly requires a large amount of human-labeled samples, and sample labeling is a very laborious process. In this study, to minimize human labor in sample labeling, instead of using the original learning sample for 2D ConvNet training, we proposed an automatic training sample expansion method. The expanded training samples produced by our method were used for full context feature learning. The overview concept and details of training sample expansion regarding the expanded training sample generation are illustrated in Figures 4 and 5, respectively. The expanded training samples were generated mainly by two steps: (1) the original learning sample-trained 1D ConvNet was used to estimate the category probability of each pixel, and (2) an initial probability threshold was used to determine confidence classification result. To avoid excessive data, which make model training a time-consuming task, the adaptive probability growth and maximum sample size threshold were set to constrain the size of confidence classification result for each category. Specifically, if the resulting confidence classification result exceeds the given maximum value, then the initial probability threshold will be automatically incremented with a fixed probability growth. In this study, we set the parameters of initial probability threshold, adaptive probability growth and maximum sample size with 0.90, 0.005 and 30,000, respectively, and accordingly, the final confidence classification result can be obtained in a fully automatic way. Specifically, the parameters determination details is given in the following Section 5.1.
The confidence classification result obtained using initial probability threshold suffers a sample imbalance problem, which leads to imbalanced feature learning during 2D ConvNet training. We defined R as the sample size ratio between a pair of categories, if the maximum ratio max α,b∈categories of the category pairs is larger than a predefined maximum ratio r, which is set to 3 in our study, the sample size is regarded as unbalanced, and then further sample equalization is. Furthermore, we introduced an exponential transformation method for sample equalization. Specifically, with the given maximum ratio r, the exponential transformation parameters can be calculated according to max α,b∈categories R a,b s = r; and then, a relatively reasonable proportion R can be obtained through exponential transformation R eq = R s . The redundant pixels of confidence classification result were removed by a random spatial approach, and the remaining classified pixels were regarded as final expanded training samples. Generally, sample equalization is crucial for subsequent 2D ConvNet-based classification; however, if the maximum ratio parameter falls into a reasonable scope of 1-4, then the maximum ratio parameter only slightly affects classification accuracy.

Model Training
The power of ConvNet depends on the weights involved in convolutional filters used to produce the distinctive feature maps through feed-forward propagation. Therefore, it is very important to obtain optimal convolutional filters with respect to each layer of network. The gradient back propagation algorithm, a core fundamental component of neural networks, can be applied for parameters updating by repeatedly propagating gradients through all modules, starting from the output at top (where the network produces its prediction) to bottom (where the original input is fed). Before the parameters were updated along with proposed H-ConvNet training, the model parameters were initialized using a normal distribution function with a certain mean of 0 and a standard deviation of 0.01 in this study.
To obtain gradient, an objective function should be predefined for network, and we used cross entropy loss as objective function in this study. Additionally, considering that overfitting weakens generalization ability, we introduced L2 regularization to mitigate the overfitting problem of proposed model. Thus, the objective function containing a cross entropy loss, which measures the difference between two probability distributions [40], and a regularization term is given as follow: where J (y, p, w) is objective function; y, p and w denote target class probability, predicted class probability and network weights, respectively; k is the number of categories in classification; and λ is L2 regularization parameter, which is used to prevent model overfitting by controlling L2 regularization intensity. To obtain an optimal regularization parameter, a simple grid search was performed using a holdout of 20% of training data for validation. As we employed a mini-batch update strategy, the cost was computed based on a mini batch of inputs. n denotes the size of mini batch.
The proposed H-ConvNet consists of one 1D ConvNet and one 2D ConvNet, and the training process of our H-ConvNet contains three steps: (1) the 1D ConvNet is trained with a limited number of training samples; (2) a training sample expansion is performed using pretrained 1D ConvNet, and (3) the expanded training samples are used for 2D ConvNet training. For optimization, Adam (Kingma and Ba, 2014) with momentum parameters β1 = 0.5, β2 = 0.999 is applied, the learning rate and mini batch size are set to 0.002 and 8 for 1D ConvNet, and 0.001 and 8 for 2D ConvNet trainings, and the training procedures stop after no improvement in the cost function is observed after 100 epochs and 500 epochs during 1D ConvNet and 2D ConvNet trainings, respectively.

Land Cover Category Determination
We applied the original and expanded training samples for the training of 1D ConvNets and 2D ConvNet, respectively. Thus, for any input pixel location of image, both spectral features and context features can be obtained through the implementation of well-trained ConvNets. The land cover category of each image pixel was determined according to the more prominent features that is quantified by feature-based classification probability (Equation (3)): where i is input image pixel, c is corresponding classification result, and p k (i |spectral ) and p k (i |scene ) represent spectral feature-based and context feature-based classification probabilities, respectively. With the collaboration of obtained spectral and context features, a pixelwise urban land cover mapping was performed on remote sensing image.

Methods Comparison
The performance of H-ConvNets was quantitatively evaluated by comparing with existing classification methods, including support vector machine (SVM), object-based image analysis (OBIA), Markov random field model (MRF) and newly proposed patch-based CNN system [17]. The patch-based CNNs was implemented strictly accoeding to Sharma et al. [17], and the detailed implementations of other three methods are given as follows: SVM: The radial basis function (RBF) kernel was adopted. As the two parameters of penalty value C and kernel width σ should be configured for SVM implementation, we searched the optimal parameters within the parameter space of (8, 16, 32, 64, 128, 256, 512, 1024) and (0.005, 0.01, 0.05, 0.1, 0.5, 1) corresponding to C and σ parameters, respectively, through a grid-search method with 5-fold cross validation.
OBIA: A multiresolution segmentation algorithm was initially implemented to segment objects throughout image. Three key parameters, namely, scale, color and compactness, were required in image segmentation, and for each of the parameter values, the scale parameter varied the most, as it was the most influential parameter [41]. In the comparison experiments, the parameters of color and compactness were constantly set to 0.9 and 0.5, respectively, and the optimal scale parameter was determined within the parameter space of (50, 60, 70, 80, 90, 100) by using a trial-and-error method. The trial-and-error method is laborious and thus time consuming; however, for a convincing comparative test, we expect high accuracy rather than higher efficiency in OBIA implementation. Therefore, although some semiautomatic and automatic methods have been proposed for scale parameter determination [42][43][44][45], the trial-and-error method was still adopted in our study. On the basis of generated objects, a range of features, including mean value, standard deviation and gray-level cooccurrence matrix, were fed into a parameterized SVM for object-based image classification.
MRF: A Markov random fields (MRF), which is used to model contextual correlation among image pixels in terms of the conditional prior probabilities of individual pixels given by their neighboring pixels, and thus the spatial dependencies within a pixel neighborhood can be used to support the classification of central pixel [46,47]. For the implementation of MRF method, the conditional prior probability was calculated using a probabilistic SVM method [48], and the spatial context was incorporated by a predefined four-neighborhood system according to SVM classification result. The optimal smoothness parameter β was determined within the parameter space of (0.5, 1,2,4,8,16,32), and the α-expansion graph-cut-based algorithm was employed to optimize the MRF energy function with iterations.

H-ConvNet, 1D ConvNet and 2D ConvNet
With the implementation of H-ConvNet, an urban land cover map, which consists of four biophysical land cover categories of water, vegetation, soil and impervious surfaces, was produced ( Figure 6). To illustrate the efficiency of feature cooperation in mapping urban land cover, we conducted a comparative analysis among the urban land cover maps derived by 1D ConvNet, 2D ConvNet and H-ConvNet. The comparative analysis among the urban land cover maps derived by 1D ConvNet, 2D ConvNet and H-ConvNet are shown in Figure 6. The spectral feature-based 1D ConvNet could capture more detailed geospatial variations in urban region, while it inevitably suffered pixel-level misclassification, particularly for the distinction of impervious and soil surfaces (Figure 6c,g,k). Meanwhile, the 1D ConvNet model demonstrated a serious spatial discontinuity in urban landscape mapping. The context feature-based 2D ConvNet does the exact opposite by assigning a land cover category of a given context to central pixel. The 2D ConvNet produced a space-continuous landscape map while suffering from misclassifications when encountering a heterogeneous urban environment (Figure 6d,h,l). The urban maps of Figure 6b,f,j show that the H-ConvNet overcomes the shortcomings of respective spectral feature-based 1D ConvNet and context feature-based 2D ConvNet, and thus achieve the best performance in urban land cover mapping.

Classification in Terms of the Biophysical Composition and Land Use
The proposed H-ConvNet model is evaluated by mapping the urban land cover in terms of biophysical compositions and land use. By comparing the corrected pixels of each classification map to the reference map, the proposed H-ConvNet achieved the highest overall accuracy of 87.15% in the 4-class urban mapping (Figure 7), followed by the Patch-based CNNs (82.87%), MRF (81.36%), SVM (81.10%) and OBIA (81.92%). This algorithm achieved the highest overall accuracy of 81.04% in the 6-class urban mapping, followed by the Patch-based CNNs (79.85%), MRF (77.62%), OBIA (77.61%) and SVM (76.94%) (Figure 7). The producer's accuracy [49], which is calculated through a division operation of the number of accurately classified reference samples and the total number of reference samples for that class, is used for a further accuracy analysis by the form of confusion matrices. As illustrated in Figures 8 and 9, the four-class land cover mapping experiences serious confusion between the soil and impervious surface, while the six-class land cover mapping suffers observable confusion among the urban living area, urban industrial area and bare land. In general, through a comparison of the methods, the H-ConvNet achieved the best performance in land cover mapping with relatively less confusion in both the four-class and six-class land cover classifications.

Urban Land Cover Mapping with Additional Test Regions
To further validate the performance and superiority of the proposed H-ConvNet, in our study, we selected five additional test regions corresponding to the cities of London, New York, Paris, Seoul and Tokyo ( Figure 10). These five urban regions are all with high surface heterogeneity and with different urban landscapes, which is suit for the testing of the proposed classification method. With the acquired Sentinel-2 images, a series of preprocessing methods, such as atmospheric correction, and cloud mask, are carried out first, and then, the pixel-based training and validation samples are collected using a stratified random sampling strategy which is mentioned in Section 2.2. Specifically, for the five study areas of London, New York, Paris, Seoul and Tokyo, the numbers of training pixels are 2780, 1221, 829, 657 and 1151, and the numbers of testing pixels are 17,411, 27,470, 17,816, 12,848 and 17,141, respectively. The proposed H-ConvNet and comparison methods are applied to six-class urban land cover mapping in all the test regions ( Figure 11). According to the quantitative accuracy assessment (Figure 12), the proposed H-ConvNet performs the best in all test regions of London, New York, Paris, Seoul and Tokyo that is, the proposed H-ConvNet achieved the highest overall accuracies of 77.21%, 90.14%, 87.16%, 89.05% and 87.65%, respectively ( Figure 12). The accuracies derived by the comparison methods vary in urban landscapes; for example, the OBIA achieved the second highest overall accuracy in the London and New York test regions, while it achieves the second lowest overall accuracy in the Seoul test region. The MRF model achieved the second highest overall accuracy in the Paris and Seoul test regions, while it achieved the lowest accuracy in the London test region. The SVM performs better in the Seoul and Tokyo test regions, while it loses much accuracy in the New York and Paris test regions. Generally, the methods commonly have their specific applicability and perform well when the algorithms fit the test urban landscape. Our proposed method achieved the highest accuracy in all the selected urban regions, further demonstrating the superiority of this method in terms of both high accuracy and robustness compared to the existing classification methods.

the Improvement in the Convnet with the Expanded Sample Training
To fully learn the context features of land cover categories, we exploit a training sample expanding strategy that aims to acquire enough training samples for structured 2D ConvNet training. Through an example expansion process, the quantity of the training samples are expanded to nearly 100-fold for the study region and provide more context information for 2D context feature learning ( Figure  13). In this section, we accessed the effect of an expanded training sample on urban land cover mapping in practice. As Figure 14 shows, when compared to the original training sample-trained 2D ConvNet, the expanded learning sample-trained model achieves an improved urban mapping with a higher 7.81% of the average overall accuracy. As the result shows, a large number of training samples benefit the context features learning by using the 2D ConvNet, and the proposed training sample expanding strategy is demonstrated to be effective. Moreover, when the new H-ConvNet is structured by the expanded sample-trained 2D ConvNet, the urban land cover mapping enhanced to a certain in comparison with the original sample-trained H-ConvNet ( Figure 14).  The automatic sample expanding method requires four predefined parameters, and therefore, we can further evaluate the sensitivity of the parameters in this section. We carried out experiments by setting the parameters to different values. As shown in the experimental results in Tables 1 and 2, when the parameters fall into reasonable scopes, these parameters slightly affect the final classification result. Specifically, we think the initial probability threshold is larger than 0.85, the adaptive probability growth lies within the approximate range 0.001-0.005, the maximum sample lies within the approximate range 20,000-40,000, and the maximum ratio lies within the approximate range 1-4. Accordingly, in our study, we set the four parameters to 0.90, 0.005, 30000, and 3.

Applicability of Spectral/Context Feature-Based Urban Mapping
The proposed H-ConvNet improves the urban land cover mapping through an effective collaboration of the spectral feature-oriented 1D ConvNet and context feature-oriented 2D ConvNet. Specifically, the 2D ConvNet is trained using an expanded training sample, which is obtained through a combination with the spectral feature-based classification result. Generally, three solutions can be obtained for land cover mapping with different application requirements in our study: (1) spectral feature-based 1D ConvNet, (2) context feature-based 2D ConvNet, and (3) spectral and context feature-based H-ConvNet. We test the performance of the three solutions by considering the mapping accuracy, model stability, operation efficiency and landscape mapping quality by using the indictors of the respective average overall accuracy, accuracy standard deviation, time consuming and patch density. As Table 3 illustrates, the structured ConvNets achieved the highest overall accuracy with a decreasing order of H-ConvNet, 2D ConvNet and 1D ConvNet. However, as demonstrated by the standard deviation of the accuracies, the 1D ConvNet model is the most stable model, followed by the H-ConvNet and 2D ConvNet. Additionally, we can see the significant difference of the mapping quality derived from different mapping solutions ( Figure 6); that is, the 2D ConvNet-based urban land cover mapping model achieves a low patch density and thus shows a high landscape continuity of the obtained land cover map. However the spectral-based 1D ConvNet experiences much salt-pepper noise during image classification, and thus, a high landscape fragmentation of the derived land cover map occurs. By taking both the mapping accuracy and visualization mapping quality into account, the H-ConvNet performs the best due to an effective collaboration of spectral and context features.  Nevertheless, although the proposed H-ConvNet performs the best in terms of the classification accuracy, the H-ConvNet consists of one 1D ConvNet and one 2D ConvNet, and thus the implementation of H-ConvNet is relatively time-consuming using a computer with a CPU configuration. As Table 3 illustrates, the H-ConvNet costs ten times longer than the 1D ConvNet in land cover/land use classification. Taken together, the structured ConvNets have their respective applicability and can be used to for different applications. The H-ConvNet, which has advantages with high accuracy and high model stability, is the most preferred solution in urban land cover mapping. Moreover, the structured 1D ConvNet is also a good option when time efficiency is required, and the 2D ConvNet is applicable when the visualization mapping quality is regarded as the most important factor.

Comparison to Semantic Segmentation Models and Further Research
Semantic segmentation involves classifying each pixel in an image into a class in the field of computer vision [50]. This method has also shown great potential in pixelwise satellite images classification. Generally, common semantic segmentation models are designed with a 2-D encoder-decoder structure, and state-of-the-art models are commonly structured with very deep layers to guarantee accuracy. Compared to semantic segmentation (Table 4), the newly proposed H-ConvNet is designed with a hybrid 3-D structure that corresponds to 3-D information analysis for multispectral satellite image. H-ConvNet is also designed as a lightweight network that provides rapid model implementation under limited computing resources condition (e.g., CPU configuration only). In addition, the training conditions in terms of sample quantity and sample design for semantic segmentation models and H-ConvNet are also different. Unlike in the semantic segmentation models, which require large numbers of human-annotated training samples for feature learning, thereby limiting the intelligent design of these methods, H-ConvNet can perform well when only a limited number of human-annotated training samples is provided. From the perspective of sample design, H-ConvNet requires pixel-based sample inputs, and semantic segmentation models require patch-based sample inputs. Current studies have explored many state-of-the-art semantic segmentation models for the image classification in the field of computer vision [50][51][52]; however, there are still many issues that prevent the directly transfer use of the existing models from computer vision to remote sensing. First of all, the high-quality datasets for feature learning with respect to different remote sensing applications are lacking. Second, common deep learning methods require large amounts of training data and high performance computing resource, thereby limits the widespread application of the deep learning method. Therefore, in the further study, more high-quality datasets are needed to be developed for the state-of-the-art deep learning model training. On the other hand, on the premise of this study, the semi-supervised or weakly supervised deep learning model with a lightweight network structure is also required to be further explored.

Conclusions
It is difficult to capture the distinctive features of land cover categories in heterogeneous urban environments, which makes it challenging to map the urban land cover with high precision.
The advanced deep ConvNets have brought breakthroughs in the field of computer vision [33]; however, due to the image data gap between the domains of remote sensing and computer vision, it is still challenging to directly and efficiently apply state-of-the-art ConvNets for remote sensing applications. When structuring a ConvNet for remote sensing image classification, the difference in the image modality between remote sensing and computer vision should be considered. Specifically, (1) the remote sensing images are acquired from a satellite view and with significantly changeable spatial resolutions (e.g., approximately 1-1000 m ), whereas the images in computer vision are always acquired from a human view and with high spatial resolution. (2) Rather than a color image, which is commonly composed of three-color channels, the remote sensing image contains several (e.g., 4-10) or even hundreds of color channels corresponding to the multispectral and hyperspectral image, respectively.
(3) In computer vision, massive data sets corresponding to specific applications have been built for ConvNet training; however, very few training data sets are built for feature learning from remote sensing image. Therefore, the model training strategy that is based on a small amount of training samples is still adopted for ConvNet training and thus results in the low efficiency and universality of the trained ConvNets in remote sensing image classification. Notably, the semantic segmentation method in computer vision can also be used for per-pixel remote sensing image classification; however, semantic segmentation requires patch-based labeled data for model training, which makes the implementation of image classification different from our proposed method. Therefore, the semantic segmentation method is not compared in this study.
Many researchers have explored the application potential of ConvNet in high-resolution remotely sensed image processing; however, few ConvNet application studies haves been tailored for moderate-resolution remote sensing image. In this study, we proposed a new lightweight H-ConvNet for the enhancement of urban land cover mapping using Sentinel-2 moderate-resolution image. Through experimental testing, the H-ConvNet achieved a significant improvement compared with the existing classification methods, including SVM, OBIA, MRF model and patch-based CNNs. In general, the main contributions of this paper can be summarized as follows: (1) with the inspiration of the advanced deep learning technique, we designed a new H-ConvNet that is particularly applicable for heterogeneous urban land cover mapping based on moderate-resolution image. Due to the superiority of the proposed H-ConvNet and the large amount of moderate-resolution data reserves, this H-ConvNet demonstrates great potential with respect to large-scale land investigations. (2) The structured spectral-based 1D ConvNet and context-based 2D ConvNet can also be used for land cover mapping alone in some cases. To fully learn the context features via a structured 2D ConvNet, a training sample expanding strategy is proposed and is shown to be effective in our study.
Author Contributions: Conceptualization, X.L. and X.T.; methodology, X.L.; validation, X.L. and Z.H.; formal analysis, X.L. and G.W.; writing-original draft preparation, X.L.; writing-review and editing, Z.H. and G.W.; supervision, X.T. and G.W. All authors have read and agreed to the published version of the manuscript.