Intelligent Mapping of Urban Forests from High-Resolution Remotely Sensed Imagery Using Object-Based U-Net-DenseNet-Coupled Network

: The application of deep learning techniques, especially deep convolutional neural networks (DCNNs), in the intelligent mapping of very high spatial resolution (VHSR) remote sensing images has drawn much attention in the remote sensing community. However, the fragmented distribution of urban land use types and the complex structure of urban forests bring about a variety of challenges for urban land use mapping and the extraction of urban forests. Based on the DCNN algorithm, this study proposes a novel object-based U-net-DenseNet-coupled network (OUDN) method to realize urban land use mapping and the accurate extraction of urban forests. The proposed OUDN has three parts: the ﬁrst part involves the coupling of the improved U-net and DenseNet architectures; then, the network is trained according to the labeled data sets, and the land use information in the study area is classiﬁed; the ﬁnal part fuses the object boundary information obtained by object-based multiresolution segmentation into the classiﬁcation layer, and a voting method is applied to optimize the classiﬁcation results. The results show that (1) the classiﬁcation results of the OUDN algorithm are better than those of U-net and DenseNet, and the average classiﬁcation accuracy is 92.9%, an increase in approximately 3%; (2) for the U-net-DenseNet-coupled network (UDN) and OUDN, the urban forest extraction accuracies are higher than those of U-net and DenseNet, and the OUDN e ﬀ ectively alleviates the classiﬁcation error caused by the fragmentation of urban distribution by combining object-based multiresolution segmentation features, making the overall accuracy (OA) of urban land use classiﬁcation and the extraction accuracy of urban forests superior to those of the UDN algorithm; (3) based on the Spe-Texture (the spectral features combined with the texture features), the OA of the OUDN in the extraction of urban land use categories can reach 93.8%, thereby the algorithm achieved the accurate discrimination of di ﬀ erent land use types, especially urban forests (99.7%). Therefore, this study provides a reference for feature setting for the mapping of urban land use information from VHSR imagery.


Introduction
Urban land use mapping and the information extraction of urban forest resources are significant, yet challenging tasks in the field of remote sensing and have great value for urban environment monitoring, planning, and designing [1][2][3]. In addition, smart cities are now an irreversible trend in urban development in the world, and urban forests constitute "vital," "green," and indispensable infrastructure in cities. Therefore, the intelligent mapping of urban forest resources from remote sensing data is an essential component of smart city construction.
Over the past few decades, multispectral (such as the Thematic Mapper (TM)) [4][5][6][7], hyperspectral, and LiDAR [8][9][10] techniques have played important roles in the monitoring of urban forest resources. Currently, with the rapid development of modern remote sensing technologies, a very large amount of VHSR remotely sensed imagery (such as WorldView-3) is commercially available, creating new opportunities for the accurate extraction of urban forests at a very detailed level [11][12][13]. The application of VHSR images in urban forest resource monitoring has attracted increasing attention because of the rich and fine properties in these images. However, the ground objects of VHSR images are highly complex and confusing. For one thing, numerous land use types (such as Agricultural Land and grassland) have the same spectrum and texture characteristics [14], resulting in strong homogeneity in different categories [15], that is, the phenomenon of "same spectrum with different objects." For another, rich detailed information gives similar objects (such as building composed of different construction materials) strong heterogeneity in the spectral and structural properties [16], resulting in the phenomenon of "same object with different spectra". In addition, traditional statistical classification methods encounter these problems in the extraction of urban forests from VHSR remote sensing images. Additionally, urban forests with fragmented distributions are composed of scattered trees, street trees, and urban park forest vegetation. This also creates very large challenges for urban land use classification and the accurate mapping of urban forests [17].
Object-based classification first aggregates adjacent pixels with similar spectral and texture properties into complementary and overlapping objects through the image segmentation method to achieve image classification, and the processing units are converted from conventional pixels to image objects [18]. This classification method is based on homogeneous objects. In addition to applying the spectral information of images, this method fully exploits spatial features such as geometric shapes and texture details. The essence of object-based classification is to break through the limitations of traditional pixel-based classification and reduce the phenomena of the "same object with different spectra" and the "salt-and-pepper" phenomenon caused by distribution fragmentation. Therefore, object-based classification methods often yield better results than traditional pixel-based classification methods [19]. Recently, the combination of object-based and machine learning (ML) is widely used to detect features in the forest such as damage detection, landslide detection, and insect-infested forests [20][21][22][23]. In terms of ML, deep learning (DL) uses a large amount of data to train the model and can simulate and learn high-level features [24], making deep learning a new popular topic in the current research on the intelligent extraction of VHSR remote sensing information [15,25,26].
For DL, DCNNs and semantic segmentation algorithms are widely used in the classification of VHSR images, providing algorithmic support for accurate classification and facilitating great progress [27][28][29][30][31][32][33][34][35][36][37][38]. Among them, DCNNs are the core algorithms for the development of deep learning [39]. These networks learn abstract features through multiple layers of convolutions, conduct network training and learning, and finally, classify and predict images. DenseNet is a classic convolutional neural network framework [40]. This network can extract abstract features while combining the information features of all previous layers, so it has been widely applied in the classification of remote sensing images [41][42][43][44]. However, this network has some problems such as the limited extraction of abstract features. Semantic segmentation places higher requirements on the architectural design of convolutional networks, classifying each pixel in the image into a corresponding category, that is, achieving pixel-level classification. A typical representation of semantic segmentation is U-net [45], which combines upsampling with downsampling. U-net can not only extract deeper features but also achieve accurate classification [46,47]. Therefore, U-net and DenseNet can be integrated to address the problem of the limited extraction of abstract features in DenseNet, and this combination may facilitate more accurate extraction from VHSR images.
In summary, object-based multiresolution segmentation offers obvious advantages in dealing with the problems of "same object with different spectra" and the "salt-and-pepper" phenomenon caused by distribution fragmentation [48][49][50][51][52][53], and deep learning is an important method for the intelligent mapping of VHSR remote sensing images. Consequently, this research proposes the novel classification method of the object-based U-net-DenseNet-coupled network (OUDN) to realize the intelligent and accurate extraction of urban land use and urban forest resources. This study takes subregion of the Yuhang District of Hangzhou City as the study area, with WorldView-3 images as the data source. First, the DenseNet and U-net network architectures are integrated; then, the network is trained according to the labeled data sets, and land use classification results are obtained based on the trained model. Finally, the object boundaries derived by object-based multiresolution segmentation are combined with the classification results of deep learning to optimize the classification results with the majority voting method.

Study Area
In this research, a subregion of the Yuhang District (YH) of Hangzhou, in Zhejiang Province in Southeast China, was chosen as the study area ( Figure 1). WorldView-3 images of the study area were captured on 28 October 2018. The images contain four multispectral bands (red, green, blue, and near infrared (NIR)) with a spatial resolution of 2 m and a panchromatic band with a spatial resolution of 0.5 m. According to the USGS land cover classification system [54] and the FROM-GLC10 [55,56], the land use categories were divided into six classes, including Forest, Built-up, Agricultural Land, Grassland, Barren Land, and Water. As shown in Figure 1b, due to shadows in VHSR image, this study added a class of Others, including Shadow of trees and buildings. The detailed descriptions of each land use class and its corresponding subclasses are listed in Table 1.

Data Processing
The image preprocessing including radiation correction and atmosphere correction was first performed using ENVI 5.3. Then, this label maps of the actual land use categories were made by eCognition software based on the results of the field survey combined with the method of visual interpretation. Due to the limitations on the size of the processed images from the GPU as well as to obtain more training images and to better extract image features, this study used the overlapping cropping method ( Figure 2) to segment the images in the sample set into 4761 subimage blocks using 128 × 128 pixel windows for the minibatch training of the DL algorithms.

Feature Setting
In this study, the classification features are divided into three groups: (1) the original R, G, B, and NIR bands, namely, the spectral features (Spe), the spectral features combined with the vegetation index features (Spe-Index), and the spectral features combined with the texture features (Spe-Texture). Based on these three groups of features, the performance of the OUDN algorithm in the mapping of urban land use and urban forest information is evaluated. Descriptions of the spectral features, vegetation index, and texture are given in Table 2. The texture features based on the gray-level co-occurrence matrix (GLCM) [57] include mean, variance, entropy, angular second moment, homogeneity, contrast, dissimilarity, and correlation [58][59][60] with different calculation windows (3 × 3, 5 × 5, 7 × 7, 9 × 9, 11 × 11 and 13 × 13) [61]. Table 2. All the involved features are listed in detail in this paper, including original bands of WorldView-3 data, vegetation indices, and texture features based on the gray-level co-occurrence matrix (GLCM). Texture features based on the gray-level co-occurrence matrix (GLCM) is the ith row of the jth column in the Nth moving window

Methodology
The DenseNet architecture takes the output of all the previous layers as input and combines the previous information features to extract abstract features that are fairly limited. U-net performs deep feature extraction on the basis of the previous layer. Therefore, this study first improves the U-net and DenseNet networks and deeply couples them into the U-net-DenseNet-coupled network (UDN). Then, this network is combined with object-based multiresolution segmentation methods to construct the OUDN algorithm for intelligent and accurate extraction of urban land use and urban forest resources from VHSR images. The following introduces the DL algorithms in detail based on a brief introduction of DCNNs.

Brief Introduction of CNNs
Convolutional neural networks (CNNs) are the core algorithms of DL in the field of computer vision (CV) applications (such as image recognition) because of their ability to obtain hierarchically abstract representations with local operations [63]. This network structure was first inspired by biological vision mechanisms. There are four key ideas behind CNNs that take advantage of the properties of natural signals: local connections, shared weights, pooling, and the use of many layers [24], which are fully utilized.
As shown in Figure 3, the CNN structure consists of four basic processing layers: the convolution layer (Conv), nonlinear activation layer (such as ReLU), normalization layer (such as batch normalization (BN)), and pooling layer (Pooling) [63,64]. The first few layers are composed of two types of layers: convolutional layers and pooling layers. The units in a convolutional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank, and all units in a feature map share the same filter bank. Different feature maps in every layer use different filter banks, so different features can be learned. The result of this local weighted sum is then passed through a nonlinear activation function such as a ReLU, and the output results are pooled and nonlinearly processed through normalization (such as BN). In addition, nonlinear activation and nonlinear normalization are nonlinear blocks of processing that leads to a bigger boost in model training, so they play a significant role in CNN architecture. After multiple convolutions (combining a convolutional layer and a pooling layer is called a convolution), the results are flattened as the input of the fully connected layer, namely, the artificial neural network (ANN). Thus, the prediction result is finally obtained. Specifically, the major operations performed in the CNNs can be summarized by Equations (1)- (5): (1) where S [l] indicates the feature map at the lth layer [25], S [l−1] denotes the input feature map to the lth layer, and W [l] and b [l] represent the weights and biases of the layer, respectively, that convolve the input feature map through linear convolution * . These steps are often followed by a max-pooling operation with p × p window size (pool p ) to aggregate the statistics of the features within specific regions, which forms the output feature map S [l] . The ϕ(Z), R, indicates the nonlinearity function outside the convolution layer and corrects the convolution result of each layer, Z denotes the result of the convolution operation by calculating , m represents the batch size (the number of samples required for a single training iteration), µ represents the mean, σ 2 represents the variance, ε is a constant set to keep the value stable to prevent √ σ 2 + ε from being 0, and R (i) norm is the normalized value.

Improved DenseNet (D):
DenseNet is based on ResNet [65], and its most important characteristic is that the feature maps of all previous networks are used as input for each layer of the network. Additionally, the feature maps are used as input by the following network layer, so the problem of gradient disappearance can be alleviated and the number of parameters can be reduced. The improved DenseNet network structure in this study is shown in Figure 4. Figure 4a is the complete structure, which adopts 3 Dense Blocks and 2 Translation layers. Before the first Dense Block, two convolutions are used. In this study, the bottle layer (1 × 1 convolution) in the Translation layer is converted to a 3 × 3 convolution operation, followed by an upsampling layer and finally the prediction result. The specific Dense Block structure is shown in Figure 4b and summarized by Equation (6): where [X 0 , X 1 , . . . , X −1 denotes the feature maps with layers of X 0 , X 1 . . . , X −1 and H [X 0 , X 1 , . . . , X −1 indicates that the layer takes all feature maps of the previous layers (X 0 , X 1 . . . , X −1 ) as input. In this study, all the convolution operations in the Dense Block use 3 × 3 convolution kernels, and the number of output feature maps (K) in each layer is set to 32. Improved U-net (U): U-net is an improved fully convolutional network (FCN) [66]. This network has attracted extensive attention because of its clear structure and excellent performance on small data sets. U-net is divided into a contracting path (to effectively capture contextual information) and an expansive path (to achieve a more precise position for the pixel boundary). Considering the characteristics of urban land use categories and the rich details of WorldView-3 images, the improved structure in this study mainly increases the number of network layers to 11 layers, and each layer increases the convolution operations, thereby obtaining increasingly abstract features. The network is constructed around convolution filters to obtain images with different resolutions, so the structural features of the image can be detected on different scales. More importantly, BN is performed before the convolutional layer and pooling layer, and the details are shown in Figure 5.
(1) The left half of the bottom layer is the contracting path. With the input of a 128 × 128 image, each layer uses three 3 × 3 convolution operations. After each convolution, followed by the ReLU activation function, max-pooling with a step of 2 is applied for downsampling. In each downsampling stage, the number of feature channels is doubled. Five downsamplings are applied, followed by two 3 × 3 convolutions in the bottom layer of the network architecture.
The size of the feature maps is eventually reduced to 4 × 4 pixels, and the number of feature map channels is 1024. (2) The right half of the network, that is, the expansive path, mainly restores the feature information of the original image. First, a deconvolution kernel with a size of 2 × 2 is used to perform upsampling. In this process, the number of the feature map channels is halved, while the feature maps of the symmetrical position generated by the downsampling and the upsampling are merged; then, three 3 × 3 convolution operations are performed on the merged features, and the above operations are repeated until the image is restored to the size of input image; ultimately, four 3 × 3 and one 1 × 1 convolution operations and a Softmax activation function are used to complete the category prediction of each pixel in the image. The Softmax activation function is defined as Equations (7): where a k (X) represents the activation value of the kth channel at the position of pixel X. K indicates the number of categories, and p k (X) denotes the function with the approximate maximum probability. If a k (X) is the largest activation value in the kth channel, p k (X) is approximately equal to 1; in contrast, p k (X) is approximately equal to zero for other k values.

UDN:
The detailed coupling process of the improved U-net and DenseNet is shown in Figure 6. (a) The first two layers use the same convolutional layer and pooling layer to obtain abstract feature maps; (b) then, the feature maps obtained by the above operations are input into the Combining Block structure to realize the coupling of the convolution results from the two structures. After two convolution operations are performed on the coupling result, max-pooling is used to perform downsampling, followed by two Combining Block operations; (c) after the downsampling, two convolutions are performed on the coupling result to obtain 1024 feature maps of 4 × 4; (d) the smallest feature maps (4 × 4 × 1024) are restored to the size of the original image after 5 upsamplings; (e) finally, the classification result is output based on the front feature maps through the 1 × 1 convolution operations and the Softmax function.

OUDN:
The boundary information of the categories is the basis of the accurate classification of VHSR images. In this study, the OUDN algorithm combines the category objects obtained by object-based multiresolution segmentation [18] with the classification results of the UDN algorithm to constrain and optimize the classification results. Four multispectral bands (red, green, blue, and near infrared) together with vegetation indices and texture features, useful for differentiating urban land use objects with complex information, are incorporated as multiple input data sources for the image segmentation using eCognition software. Then, all the image objects are transformed into GIS vector polygons with distinctive geometric shapes, which are combined with the classification results of the UDN algorithm. Based on the Spatial Analysis Tools of ArcGIS, the category with the largest statistics is taken as the category of the object by counting the number of pixels in each object and using the majority voting method. Thereby, final classification results of the OUDN algorithm are obtained. The segmentation scale directly affects the boundary accuracy of the categories. Therefore, according to the selection method of the optimal segmentation scale [67], this study gains the segmentation results by setting different segmentation scales, and determines the final segmentation scale 50. Finally, the template for training the minibatch neural network based on the above algorithms in this research is shown in Algorithm 1 [68]. The network uses the loss function of categorical cross entropy and the adaptive optimization algorithm of Adam. Additionally, the number of iterations is set to 50, and the learning rate (lr) is set to 0.0001. In each iteration, b images are sampled to compute the gradients, and then the network parameters are updated. The training of the network stops after K passes through the data set.

Experiment Design
The flowchart of steps is shown in Figure 7. The WorldView-3 image with 15.872 × 15.872 pixels is first preprocessed by image fusion, radiometric calibration, and atmospheric correction (Figure 7a). According to this preprocessed image, a 3968 × 3968 pixel subimage with various categories is cropped for model prediction, and other representative subimages are cropped as the sample set including training set and validation set for model training; then, labeled maps are made based on the sample set, followed by image cropping (Figure 7b); the cropped original images and the corresponding labeled maps are used to train the DL models ( Figure 7c); the image with 3968 × 3968 pixels is classified by the trained model ( Figure 7d); finally, objects of multiresolution segmentation are applied to optimize the classification results of the UDN algorithm to obtain the classification results of the OUDN, followed by detailed comparisons of the results from all algorithms including U, D, UDN, and OUDN (Figure 7e).

Results and Analysis
The tests of the proposed OUDN algorithm were presented in this section, and the classification results are compared with those of UDN, improved U-net (U), and improved DenseNet (D). To evaluate the proposed algorithm, the classification results in this study were assessed with the overall accuracy (OA), kappa coefficient (Kappa), producer accuracy (PA), and user accuracy (UA) [69]. The detailed results and analysis of the model training and classification results are clarified as follows.

Training Results of U, D and UDN Algorithms
There were a total of 4761 image blocks with 128 × 128 pixels in the sample set. Additionally, 3984 of these blocks were selected for the training, and the remaining blocks were used for the validation. Then, the cropped original image blocks and the corresponding labeled maps were used to train the minibatch network model according to the template of Algorithm 1. Based on the three feature groups of Spe, Spe-Index, and Spe-Texture, the overall model accuracies including training accuracy (TA) and validation accuracy (VA) of the U, D, and UDN algorithms were demonstrated in Table 3. In all feature combinations, the UDN algorithm obtained the highest training accuracies (98.1%, 98%, and 98.4%). However, for the U and U algorithms, the training accuracies of the Spe-Texture were the lowest (96.3% and 96%) compared with those of the Spe and Spe-Index. The UDN algorithm achieved the highest model accuracies (TA of 98.4% and VA of 93.8%, respectively) based on Spe-Texture.

Classification Results Based on Four Algorithms
The classification accuracies of the U, D, UDN, and OUDN algorithms on the three feature groups of Spe, Spe-Index, and Spe-Texture are demonstrated in Tables 4-6, respectively. In general, among the three feature combinations, the U and D algorithms yielded the lowest OA and Kappa, followed by UDN; in contrast, the OUDN algorithm achieved the highest OA (92.3%, 92.6%, and 93.8%) and Kappa (0.910, 0.914, and 0.928). The average accuracy of the OUDN algorithm was much higher (approximately 3%) than those of the U and D algorithms. As shown in Table 4, the UDN algorithm obtained better accuracies for Agricultural Land and Grassland than the U and D algorithms. For example, the PA values of Agricultural Land were 89%, 88%, and 90.3% for the U, D, and UDN algorithms, respectively, and the PA values of Grassland were 64.3%, 73%, and 74%, respectively. Compared with those of the UDN algorithm, the OUDN algorithm obtained better PA values for Agricultural Land, Grassland, Barren Land, and Water. Table 5 shows that the PA of Agricultural Land of the UDN algorithm was 5% and 3.3% higher than those of the U and D algorithms, respectively. In addition, the OUDN algorithm mainly yielded improvements in the PA values of Forest, Built-up, Agricultural Land, Grassland, and Barren Land. As shown in Table 6, the UDN algorithm yielded higher accuracies for Forest, Built-up, Grassland, Barren Land, and Water than the U and D algorithms, and in particular, the PA value of Grassland was significantly higher, by 15.6% and 17.3%, respectively. Meanwhile, the OUDN algorithm yielded the accuracies superior to those of the UDN algorithm in some categories. In summary, the OUDN algorithm obtained high extraction accuracies for urban land use types, and coupling object-based segmentation effectively addressed the fragmentation problem of classification with high-resolution images, thereby improving the image classification accuracy. Therefore, the OUDN algorithm offered great advantages for urban land-cover classification. Table 4. The classification accuracies of the U, D, UDN, and OUDN algorithms based on the Spe, including the accuracies (user accuracy (UA) and producer accuracy (PA)) of every class, overall accuracy (OA) and kappa coefficient (Kappa).   The classification maps of the four algorithms based on the Spe, Spe-Index, and Spe-Texture are presented in Figures 8-10, respectively, with the correct or incorrect classification results marked in black or red circles, respectively. In general, the classification results of the UDN and OUDN algorithms were better than those of the other methods, and there was no obvious "salt-and-pepper" effect in the classification results of the four algorithms. However, due to the splicing in the U, D, and UDN algorithms, the ground object boundary exhibited discontinuities, whereas the proposed OUDN algorithm addressed this problem to a certain extent.
Classification maps of different algorithms based on Spe: Based on the Spe, the proposed method in this paper better identified the ground classes that are difficult to distinguish, including Built-up, Barren Land, Agricultural Land, and Grassland. However, the recognition effect of the U and D algorithms was undesirable. As shown in Figure 8, the U and D algorithms confused Built-up and Barren Land (red circle (1)), while the UDN and OUDN algorithms correctly distinguished them (black circle (1)); for the U algorithm, Built-up was misclassified as Barren Land (red circle (2)), while the other algorithms accurately identified these classes (black circle (2)); the D algorithm did not identify Barren Land (red circle (3)), in contrast, the recognition effect of the other methods was favorable (black circle (3)); for the U and D algorithms, Grassland was misclassified as Agricultural Land (red circle (4)), while other algorithms precisely distinguished them (black circle (4)); the four algorithms mistakenly classified some Agricultural Land as Grassland and confused them (red circle (5)).
Classification maps of different algorithms based on Spe-Index: Based on the Spe-Index, the proposed method in this paper better recognized Built-up, Barren Land, Agricultural Land, and Grassland. However, the recognition effect of the U and D algorithms was poor. As demonstrated by Figure 9, U and D algorithms confused Built-up and Barren Land (red circle (1)), whereas the UDN and OUDN algorithms correctly distinguished them (black circle (1)); the U algorithm incorrectly identified Barren Land (red circle (2)), while the classification results of other algorithms were superior (black circle (2)); the U and D algorithms mistakenly classified Barren Land as Agricultural Land (red circle (3)), in contrast, the UDN and OUDN better identified them (black circle (3)); for all four algorithms, some Agricultural Land was misclassified as Grassland (red circle (4)).   Classification maps of the different algorithms based on Spe-Texture: Based on the Spe-Texture, the proposed method in this paper better identified each category, especially Grassland, yielding the best recognition result; nevertheless, the recognition effect of the U and D algorithms was worse. As shown in Figure 10, the U and D algorithms incorrectly classified much Barren Land as Agricultural Land (red circle (1)), whereas the UDN and OUDN algorithms identified these types better (black circle (1)); the D algorithm confused Built-up and Barren Land (red circle (2)), while the other algorithms better distinguished them (black circle (2)); the extraction effects for Grassland of the UND and OUDN algorithms (black circle (3)) were better than those of the U and D algorithms (red circle (3)); all the algorithms mistakenly classified some Agricultural Land as Grassland (red circle (4)).

Extraction Results of Urban Forests
This section focuses on the analysis of urban forest extraction based on the Spe, Spe-Index, and Spe-Texture with the four algorithms. As shown in Tables 4-6, the PA values of the urban forest information extraction for all algorithms were above 98%, which indicated that the DL algorithms used in this study offered obvious advantages in the extraction of urban forests. Additionally, for the OUDN algorithm, the average PA (99.1%) and UA (89.3%) of urban forest extraction were better than those of the other algorithms based on the three groups of features. This demonstrated that the OUDN algorithm exhibited fewer errors from urban forest leakage and misclassification errors between urban forests and other land use types.
The classification results for urban forests, including scattered trees and street trees, of the different algorithms based on the Spe-Texture are presented in Figure 11. In this study, two representative subregions (subset (1) and subset (2)) were selected for the analysis of the results of the different algorithms, with the correct or incorrect classification results marked in black or blue circles, respectively. In general, the urban forest extraction effect of the OUDN algorithm was the best. According to the classification results of subset (1), the U and D algorithms mistakenly identified some street trees (blue circles), while UDN and OUDN better extracted these trees (black circles). As shown in the results of subset (2), the extraction results for some scattered trees of the U and D algorithms were not acceptable (blue circles); nevertheless, UDN and OUDN accurately distinguished them (black circles). Additionally, the U and D algorithms misclassified some Forest as Grassland and Built-up (blue circles), whereas UDN and OUDN correctly identified the urban forests (black circles).

Result Analysis
According to the classification results of the four algorithms on the Spe, Spe-Index, and Spe-Texture, a confusion matrix is constructed, which is shown in Figure 12. In general, regardless of the feature combinations, the classification accuracies of each algorithm for Forest, Built-up, Water, and Others were relatively high, and the recognition accuracy was above 95%. In particular, the classification accuracy of the Forest was above 98%, whereas the classification accuracies of the other categories varied greatly. As demonstrated by Figure 12, (1) based on the Spe, the extraction accuracies of the OUDN algorithm for Agricultural Land and Grassland were significantly superior to those of the U and D algorithms, whereas the U and D algorithms misclassified Agricultural Land and Grassland as Forest at a higher rate. Compared with that of the U algorithm, the OUDN algorithm yielded better Grassland classification accuracy (75%, an increase in 11%) while optimizing the extraction accuracy of UDN (74%). The D algorithm misclassified 15% of the Barren Land as Built-up, whereas only 8% was incorrectly predicted by the UDN and OUDN algorithms. Therefore, the OUDN algorithm offered obvious advantages in urban land-cover classification. (2) For the Spe-Index, compared with those of the U and D algorithms (87% and 89%, respectively), the OUDN algorithm yielded higher extraction accuracies of Agricultural Land (94%) and optimized the classification accuracy of UDN (92%). The U and D algorithms misclassified 12% of the Barren Land as Built-up, whereas only 11% and 10% were incorrectly predicted by the UDN and OUDN algorithms, so the OUDN algorithm captured the best classification effect. (3) For the Spe-Texture, the extraction accuracies of the UDN and OUDN algorithms for Grassland were very high (83% and 85%, respectively), and the accuracies were the highest among all the Grassland classification results. Compared with the classification accuracies of the U and D algorithms (68% and 66%), the accuracies of UDN and OUDN were 15-19% higher. Figure 12 showed that the U and D algorithms misclassified 21% and 25% of the Grassland as Agricultural Land, respectively, whereas the misclassification rates of UDN and OUDN were fairly low (7% and 6%, respectively). For urban forests, as demonstrated by Figure 12, (1) based on the Spe, the extraction accuracy of urban forests was 99% for each algorithm, however, these algorithms misclassified Agricultural Land and Grassland as Forest generally. Compared with those of the U and D algorithms, OUDN's rate of misclassification of Agricultural Land and Grassland as Forest was the lowest (5% and 4%); (2) based on the Spe-Index, the OUDN algorithm obtained the highest urban forest extraction accuracy (99%) and the lowest rate of Agricultural Land and Grassland misclassified as Forest (4% and 7%); (3) based on the Spe-Texture, the urban forest extraction accuracy of the OUDN algorithm was the highest (approximately 100%).
Through the above analysis, it was concluded that (1) the classification results of the OUDN algorithm were significantly better than those of the other algorithms for confusing ground categories (such as Agricultural Land, Grassland, and Barren Land); (2) the accuracy of the UDN algorithm was improved through object constraints; (3) especially for Spe-Texture, the OUDN algorithm achieved the highest OA (93.8%), which was 4% and 4.1% higher than those of the U and D algorithms, respectively; (4) the UDN and OUDN algorithms had obvious advantages regarding the accurate extraction of urban forests, and they not only accurately extracted the street trees but also identified the scattered trees ignored by the U and D algorithms.

Discussion
The UDN and OUDN algorithms constructed by this study achieved higher accuracies in the extraction of urban land use information from VHSR imagery than the U and D algorithms. The UDN algorithm applied the coupling of the improved 11-layer U-net network and the improved DenseNet to train the network and realize prediction using the learned deep level features. With the advantages of both networks, the accurate extraction of urban land use and urban forests was ensured. Meanwhile, the UDN algorithm addressed the problems of common misclassifications and omissions in the classification process (Tables 4-6) and dealt with the confusion of Agricultural Land, Grassland, Barren Land and Built-up (Figure 12), thereby improving the classification accuracies of urban land use and urban forest. In all feature combinations, especially for the Spe-Texture, the classification accuracies of the UDN algorithm were 3.4% and 3.5% higher than those of the U and D algorithms, respectively. This study chose 50 as the optimal segmentation scale, and the phenomenon of misclassification with UDN was corrected by the constraints of the segmentation objects (Tables 4-6). The OUDN algorithm not only alleviated the distribution fragmentation of ground objects and the common "salt-and-pepper" phenomenon in the classification process but also dealt with the problem of discontinuous boundaries during the splicing process of the classification results of segmented image blocks (Figures 8-10). Compared with previous studies about classifications using U-net and DenseNet [41,46], this study fully combined the advantages of U-net and DenseNet network and achieved more high classification accuracies. Compared with previous object-based DL classification methods [49], in this study, object-based multiresolution segmentations were used to constrain and optimize the UND classification results and not used to participate in the UND classification. It is necessary to further study in this respect.
The overall classification accuracies (OA) of different features based on different algorithms are shown in Figure 13. (1) In terms of the UDN and OUDN algorithms, accuracies of the Spe-Texture were the highest (93.2% and 93.8%), followed by those of the Spe-Index (92.3% and 92.6%). As demonstrated by Figure 12, for Grassland, Built-up and Water, the classification accuracies of the Spe-Texture were significantly higher than those of the Spe-Index. For example, the Grassland accuracies of the Spe-Texture were 8% and 10% higher than those of the Spe-Index, and the Built-up accuracies of the Spe-Texture were 2% and 2% higher than those of the Spe-Index. It can be concluded from Table 3 that the TA and VA of Spe-Texture are higher. Thus, the classification results of the Spe-Texture were better than those of the Spe-Index and Spe. (2) In terms of the U and D algorithms, classification accuracies of the Spe-Texture were the lowest, 21%, 25% of the Grassland, and 13% and 17% of the Barren Land were misclassified as Forest and Built-up, respectively. In contrast, the accuracies of Spe-Index were the highest. For urban forests, after texture was added to the Spe, that is, based on the Spe-Texture, the UDN and OUDN algorithms achieved the highest classification accuracies (approximately 100%) for extracting the information of urban forests from VHSR imagery. Similarly, the U and D algorithms also offered relatively obvious advantages for extracting urban forest information based on the feature. As shown in Figure 12, (1) for the U algorithm, the Spe-Index yielded the lowest urban forest extraction accuracy. However, based on Spe-Texture, the accuracy is the highest (99%), and the ratio of Grassland misclassified as Forest was lower (10%) than that of Spe (12%), so the Spe-Texture offered advantages in the extraction of urban forests. (2) For the D algorithm, the urban forest extraction accuracy with the Spe-Texture, compared with those with the Spe and Spe-Index features, was the highest; meanwhile, the ratios of Agricultural Land and Grassland misclassified as Forest were the lowest (2% and 8%).
(3) For the UDN and OUDN algorithms, although the urban forest extraction accuracy based on the Spe-Texture was the highest, the ratios of Agricultural Land and Grassland that were misclassified as Forest, compared with those of the other features, were the highest (5% and 8%), thereby resulting in confusion between urban forests and other land use categories.

Conclusions
Urban land use classification using VHSR remotely sensed imagery remains a challenging task due to the extreme difficulties in differentiating complex and confusing land use categories. This paper proposed a novel OUDN algorithm for the mapping of urban land use information from VHSR imagery, and the information of urban land use and urban forest resources was extracted accurately. The results showed that the OA of the UDN algorithm for urban land use classification was substantially higher than those of the U and D algorithms in terms of Spe, Spe-Index, and Spe-Texture. Object-based image analysis (OBIA) can address the problem of the "salt-and-pepper" effect encountered in VHSR image classification to a certain extent. Therefore, the OA of urban land use classification and the urban forest extraction accuracy were improved significantly based on the UDN algorithm combined with object-based multiresolution segmentation constraints, which indicated that the OUDN algorithm offered dramatic advantages in the extraction of urban land use information from VHSR imagery. The OA of spectral features combined with texture features (Spe-Texture) in the extraction of urban land use information was as high as 93.8% with the OUDN algorithm, and different land use classes were identified accurately. Especially for urban forests, the OUDN algorithm achieved the highest classification accuracy of 99.7%. Thus, this study provided a reference for the feature setting of urban forest information extraction from VHSR imagery. However, for the OUDN algorithms, the ratios of Agricultural Land and Grassland misclassified as Forest were higher based on Spe-Texture, which led to confusion between urban forests and other categories. This issue will be further studied in future research.