An Improved Res-UNet Model for Tree Species Classiﬁcation Using Airborne High-Resolution Images

: Tree species classiﬁcation is important for the management and sustainable development of forest resources. Traditional object-oriented tree species classiﬁcation methods, such as support vector machines, require manual feature selection and generally low accuracy, whereas deep learning technology can automatically extract image features to achieve end-to-end classiﬁcation. Therefore, a tree classiﬁcation method based on deep learning is proposed in this study. This method combines the semantic segmentation network U-Net and the feature extraction network ResNet into an improved Res-UNet network, where the convolutional layer of the U-Net network is represented by the residual unit of ResNet, and linear interpolation is used instead of deconvolution in each upsampling layer. At the output of the network, conditional random ﬁelds are used for post-processing. This network model is used to perform classiﬁcation experiments on airborne orthophotos of Nanning Gaofeng Forest Farm in Guangxi, China. The results are then compared with those of U-Net and ResNet networks. The proposed method exhibits higher classiﬁcation accuracy with an overall classiﬁcation accuracy of 87%. Thus, the proposed model can e ﬀ ectively implement forest tree species classiﬁcation and provide new opportunities for tree species classiﬁcation in southern China. classiﬁcation accuracy is lower than the results obtained using hyperspectral imagery, it shows higher classiﬁcation accuracy compared to studies using three-band high-resolution image classiﬁcation.


Introduction
Tree species classification is highly significant for sustainable forest management and ecological environmental protection [1]. High-spatial-resolution remote sensing images are preferred for detailed tree classification because of their better spatial characteristics.
In recent years, significant advances have been made in high-scoring image classification methods, which are typically characterized into pixel-based classification [2][3][4] or object-oriented classification [5][6][7][8]. Pixel-based classification methods use pixels as the unit of classification; they mainly consider the band spectral intensity information of the pixel and ignore the spatial structure relationship and contextual semantic information [9]. For high-resolution remote sensing images with fewer bands, pixel-based methods will lead to substantial redundancy in the spatial data, resulting in "salt and pepper" effects. Many scholars combined manual feature extraction with traditional object-oriented methods for tree species classification. Immitzer et al. [10] performed a Random Forest classification (object-based and pixel-based) using spectra of manually delineated sunlit regions of tree crowns and the overall accuracy for classifying 10 tree species was around 82%. Li et al. [11] explored the potential of bitemporal WorldView-2 and WorldView-3 images for identifying five dominant urban classification task is unsatisfactory for complex feature information. The U-Net network can combine the underlying spatial feature obtained by downsampling with the input of upsampling through skip connections to improve its ability to obtain tree edge information. However, gradient degradation commonly occurs during the process of network deepening. The ResNet network has a unique residual unit, which can avoid gradient degradation in the process of network deepening [39]. Introducing it into U-Net network has become a current research hotspot. Some scholars have carried out related research in the fields of single target extraction and urban land classification. Chu et al. [40] proposed a method based on U-Net that used ResNet replaced contraction part for sea-land segmentation. Xu et al. [41] designed an image segmentation neural network based on deep residual networks and used a guided filter to more effectively extract buildings in remote sensing imagery. Zhang et al. [42] proposed novel multiscale deep learning models, namely ASPP-UNet and ResASPP-UNet for urban land cover classification based on very high-resolution satellite imagery and ResASPP-UNet produced the highest classification accuracy.
However, previous studies mainly performed simple binary classification by combining U-Net and ResNet, and the network structure was relatively simple. Other studies mainly addressed urban land use classification problems and therefore the ability to classify tree species in complex forest type is not clear. The problem of small differences in spectral characteristics between tree species brought challenges to tree species classification. Therefore, the main objectives of this study include the following: to combine U-Net and ResNet and propose a Res-UNet network suitable for tree species classification. The convolutional layer of U-Net is replaced with the basic unit of ResNet, which is used to extract multiscale spatial features and simultaneously solve the gradient degradation problem of deep networks for an increasing number of network layers. At the output of the network, post-processing with the conditional random fields (CRF) is proposed to optimize the tree species segmentation graph; to evaluate the ability of airborne CCD (charge coupled devices) images to identify complex forest tree species in the south using the Res-UNet network; and to analyze the parameters that affect the classification ability of the model.

Study Area
The study area is located in the Jiepai Forest Farm of the Guangxi Gaofeng State Owned Forest Farm in Nanning, Guangxi Province, southern China. As shown in Figure 1, it is located at 108 • 31 east longitude and 22 • 58 north latitude. The average annual temperature is approximately 21°C, the average annual rainfall is 1304.2 mm, and the red soil layer is deep, which is suitable for the growth of tropical and subtropical tree species [43]. The forest cover in the study area is dominated by artificial forests, predominantly eucalyptus (Eucalyptus robusta Smith), Illicium verum (Illicium verum Hook.f.), wetland pine (Pinus elliottii Engelm.), Masson pine (Pinus massoniana Lamb.), Chinese fir (Cunninghamia lanceolata (Lamb.) Hook.), and other broad-leaved tree types. Among them, eucalyptus (Eucalyptus robusta Smith) and Chinese fir (Cunninghamia lanceolata (Lamb.) Hook.) are planted over large areas, which has certain advantages for classification. Some broad-leaved tree species have a small planting area so are classified as other broad-leaved trees. Some roads also exist in the study area. The classification system is shown in Table 1.

Acquisition and Preprocessing of Remote Sensing Image Data
The aerial flights took place on January 13, 2018 and January 30, 2018. The aerial photography area was 108°7′ to 108°38′ east longitude, 22°49′ to 23°5 ′ north latitude, measuring approximately 125 km 2 . The specific area is shown in Figure 1. The actual flight altitude was approximately 1000 m, and the weather on the day of data acquisition was clear and cloudless. The onboard LiCHy (LiDAR, CCD, and Hyperspectral) system of the Chinese Academy of Forestry is equipped with an aerial digital camera to acquire CCD images [44]. It is also equipped with a LiDAR scanner and a  The specific area is shown in Figure 1. The actual flight altitude was approximately 1000 m, and the weather on the day of data acquisition was clear and cloudless. The onboard LiCHy (LiDAR, CCD, and Hyperspectral) system of the Chinese Academy of Forestry is equipped with an aerial digital camera to acquire CCD images [44]. It is also equipped with a LiDAR scanner and a hyperspectral sensor for LiDAR Data, hyperspectral data, inertial measurement unit (IMU), and GPS data. The aviation digital camera has 60 million pixels, a lens focal length of 50 mm, and an image spatial resolution of 0.2 m, including three bands of red, green, and blue.

Ground Survey Data and Other Auxiliary Data
The ground data survey was conducted at Gaofeng Forest Farm from January 16, 2018 to February 5, 2018. First, the GF-2 data were visually interpreted to determine the location of the classification area. Then field survey was conducted in the classification area to understand the distribution and characteristics of tree species. In addition, a vector map of the entire forest farm provided by the Guangxi Academy of Forest Sciences was used to assist in making labels for training samples.

Datasets Production
The datasets used in this study were cropped from the entire image of entire aerial area (as shown in Figure 1 (top right)). The training data comprised 1000 images with a pixel size of 1024 × 1024 including all categories in the classification system. The test data size was 5334 × 4951 pixel images and training data and test data are independent of each other. Based on forest farm vector data, visual interpretation, and a field survey, the tree species categories were marked as labels. In order to meet the required number of samples during the training process, data enhancement operations such as translation and rotation were performed on the training data to form a total of 2000 images that were sent to the neural network as a training set. To enhance the robustness of the network, the training sets were divided into training data (80%) and validation data (20%) using the stratified sampling method. The number of training samples and validation samples in each category is shown in Table 2. In addition, this study used 40%, 60%, 80%, and 100% of the training sets for training in order to explore the most suitable number of training samples.

Workflow Description
In this study, an improved U-Net network was used to classify high-resolution images of tree species. The convolutional layer of the network was represented by the residual unit of the ResNet network. The classification process was shown in Figure 2: 1024 × 1024 image blocks were cut from the entire image and the real feature categories were labeled as training samples. The training samples were used as the training set after image enhancement. The selected test sample size was 5334 × 4951, which contained nine feature types. The same method was used to label the true feature types. The image block instead of the pixel unit was sent to the network for training, and the model loss was obtained after training. The model parameters were updated by gradient back propagation until the optimal parameters were obtained. In the classification stage, the test set was sent to the trained network for prediction, and the prediction result was subjected to CRF post-processing to obtain the final classification map.
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 18 trained network for prediction, and the prediction result was subjected to CRF post-processing to obtain the final classification map.

ResNet Network
Kaiming He et al. [39] proposed the ResNet network in 2015, which won first place in the ImageNet competition classification task. ResNet was proposed to solve the problem of deep gradient degradation. Thus, many subsequent methods have been based on either ResNet50 or ResNet101. ResNet refers to the VGG19 network on which it is based; it replaces the fully connected layer with a global average pool and uses a connection method called "shortcut connection" (see Figure 3). The feature map is composed of a residual map and an identity map and the output is y = F (x) + x. Residual learning is easier than original feature learning. When the network has reached the optimum, it continues to deepen and the residual approaches zero. At this time, the network only performs identity mapping, and its performance does not decrease with increasing depth, which avoids the degradation problem caused by network deepening. In this study, two residual units were designed for different model requirements. As shown in Figure 4, when the number of input channels and output channels was equal, the residual unit shown in Figure 4a was used to perform three 3 × 3 convolution operations on the input and output together with the original input, using a stride of one. Conversely, when the number of input channels and output channels was different, the residual unit of Figure 4b was used with a stride customized, and 3 × 3 convolution was performed on the input and output with the results after three convolution operations. The ResNet network in this study was composed of these two types of residual units. In order to achieve the tree species classification task, the residual unit 4b was used at the output end of the network instead of the fully connected layer. A two-dimensional feature map was output, and softmax was used for pixel-bypixel class prediction.

ResNet Network
Kaiming He et al. [39] proposed the ResNet network in 2015, which won first place in the ImageNet competition classification task. ResNet was proposed to solve the problem of deep gradient degradation. Thus, many subsequent methods have been based on either ResNet50 or ResNet101. ResNet refers to the VGG19 network on which it is based; it replaces the fully connected layer with a global average pool and uses a connection method called "shortcut connection" (see Figure 3). The feature map is composed of a residual map and an identity map and the output is y = F (x) + x. Residual learning is easier than original feature learning. When the network has reached the optimum, it continues to deepen and the residual approaches zero. At this time, the network only performs identity mapping, and its performance does not decrease with increasing depth, which avoids the degradation problem caused by network deepening. In this study, two residual units were designed for different model requirements. As shown in Figure 4, when the number of input channels and output channels was equal, the residual unit shown in Figure 4a was used to perform three 3 × 3 convolution operations on the input and output together with the original input, using a stride of one. Conversely, when the number of input channels and output channels was different, the residual unit of Figure 4b was used with a stride customized, and 3 × 3 convolution was performed on the input and output with the results after three convolution operations. The ResNet network in this study was composed of these two types of residual units. In order to achieve the tree species classification task, the residual unit 4b was used at the output end of the network instead of the fully connected layer. A two-dimensional feature map was output, and softmax was used for pixel-by-pixel class prediction.

ResNet-Unet Network
Previously, when CNN was used for classification tasks, the input could only take the form of images and the output was the corresponding labels; however, many users wish to obtain the classification results for each pixel in visual tasks. Ronneberge et al. [35] proposed the U-Net network in 2015, whose network structure is shown in Figure 5. In the structure, "3 × 3 conv, n" represents the Remote Sens. 2020, 12, 1128 7 of 17 convolution layer with a convolution kernel of 3 × 3 and number of input channels is n, "max_pool_2 × 2" represents the maximum pooling layer with a step size of two, "3 × 3 deconv" represents the convolution kernel with a 3 × 3 transposed convolution layer, "concat" refers to splicing two tensors, and "m × m" such as "256 × 256" means m-m size of feature map. It was mainly used for medical image analysis, before gradually being used in image classification tasks. U-Net is also a variant of the CNN that has been improved using FCN. U-Net is composed of two main parts: the contraction path and the expansion path. The contraction path is used to capture the semantic information of the image, whereas the symmetrical expansion path is used to accurately locate the semantic information. The fully connected layer is not used in the network structure. It reduces the number of parameters that need to be trained, enabling the network to perform end-to-end output more efficiently.

ResNet-Unet Network
Previously, when CNN was used for classification tasks, the input could only take the form of images and the output was the corresponding labels; however, many users wish to obtain the classification results for each pixel in visual tasks. Ronneberge et al. [35] proposed the U-Net network in 2015, whose network structure is shown in Figure 5. In the structure, "3 × 3 conv, n" represents the convolution layer with a convolution kernel of 3 × 3 and number of input channels is n, "max_pool_2 × 2" represents the maximum pooling layer with a step size of two, "3 × 3 deconv" represents the convolution kernel with a 3 × 3 transposed convolution layer, "concat" refers to splicing two tensors, and "m × m" such as "256 × 256" means m-m size of feature map. It was mainly used for medical image analysis, before gradually being used in image classification tasks. U-Net is also a variant of the CNN that has been improved using FCN. U-Net is composed of two main parts: the contraction path and the expansion path. The contraction path is used to capture the semantic information of the image, whereas the symmetrical expansion path is used to accurately locate the semantic information. The fully connected layer is not used in the network structure. It reduces the number of parameters that need to be trained, enabling the network to perform end-to-end output more efficiently.

ResNet-Unet Network
Previously, when CNN was used for classification tasks, the input could only take the form of images and the output was the corresponding labels; however, many users wish to obtain the classification results for each pixel in visual tasks. Ronneberge et al. [35] proposed the U-Net network in 2015, whose network structure is shown in Figure 5. In the structure, "3 × 3 conv, n" represents the convolution layer with a convolution kernel of 3 × 3 and number of input channels is n, "max_pool_2 × 2" represents the maximum pooling layer with a step size of two, "3 × 3 deconv" represents the convolution kernel with a 3 × 3 transposed convolution layer, "concat" refers to splicing two tensors, and "m × m" such as "256 × 256" means m-m size of feature map. It was mainly used for medical image analysis, before gradually being used in image classification tasks. U-Net is also a variant of the CNN that has been improved using FCN. U-Net is composed of two main parts: the contraction path and the expansion path. The contraction path is used to capture the semantic information of the image, whereas the symmetrical expansion path is used to accurately locate the semantic information. The fully connected layer is not used in the network structure. It reduces the number of parameters that need to be trained, enabling the network to perform end-to-end output more efficiently. Our tree classification strategy used the idea of semantic segmentation. Based on the advantages of the U-Net network, this study proposed a Res-UNet network by combining U-Net and ResNet and the following improvements were made for the classification of tree species: (1) The convolutional layer, pooling layer, and residual unit were modified. (2) A residual unit was inserted to extract the image space features before fusing the feature maps of the downsampling layer and the upsampling layer, so as to adapt to the classification of complex tree species. (3) Linear interpolation was used instead of deconvolution to reduce the model complexity to a certain extent. (4) The final output level was modified to nine to distinguish the nine tree species. (5) At the output of the network, post-processing with the CRF is proposed to optimize the tree species segmentation graph. The network structure was shown in Figure 6. It includes downsampling and upsampling. In the structure, "3 × 3 conv, n" and Remote Sens. 2020, 12, 1128 8 of 17 "m × m" such as "256 × 256" have the same meaning as U-Net, "resize_bilinear" represents bilinear interpolation, and "add" refers to connecting two matrices. Our tree classification strategy used the idea of semantic segmentation. Based on the advantages of the U-Net network, this study proposed a Res-UNet network by combining U-Net and ResNet and the following improvements were made for the classification of tree species: (1) The convolutional layer, pooling layer, and residual unit were modified. (2) A residual unit was inserted to extract the image space features before fusing the feature maps of the downsampling layer and the upsampling layer, so as to adapt to the classification of complex tree species. (3) Linear interpolation was used instead of deconvolution to reduce the model complexity to a certain extent. (4) The final output level was modified to nine to distinguish the nine tree species. (5) At the output of the network, postprocessing with the CRF is proposed to optimize the tree species segmentation graph. The network structure was shown in Figure 6. It includes downsampling and upsampling. In the structure, "3 × 3 conv, n" and "m × m" such as "256 × 256" have the same meaning as U-Net, "resize_bilinear" represents bilinear interpolation, and "add" refers to connecting two matrices. In the downsampling network structure, four residual units with a step size of two are used for feature extraction. Every time the feature map passes through a residual unit, its size is doubled and the number of convolution filters is doubled. In each residual unit, the data is normalized in batches to ensure that each forward propagation is output on the same distribution as the maximum. In this way, the distribution of the data samples referenced in the backward calculation will be the same as that in the forward calculation, ensuring a uniform distribution, leading to more meaningful adjustment of the weights and avoiding the problem of gradient explosion during network training. The activation function is rectified linear unit (relu), which enables the sparse model to better mine relevant features and fit the training data to accelerate network convergence.
When using a full CNN for high-scoring image classification, in order to achieve end-to-end classification, deconvolution is often used for upsampling operations to upsample the feature map to the size of the input image. However, deconvolution needs to learn a large number of parameters and is computationally intensive. The bilinear interpolation algorithm does not require learning parameters, reducing the amount of calculation [45]. Therefore, this study used bilinear interpolation instead of deconvolution and analyzed its impact on classification performance. So, in the upsampling network, a linear interpolation operation is used instead of deconvolution. Every time the linear interpolation is performed, the feature map is doubled until it increases to the size of the input feature map, so that the entire network can achieve end-to-end input. In the linear interpolation process, as the number of convolutions increases, the extracted features are more effective; however, the loss of feature map spatial information can easily occur. Therefore, feature maps with the same size in the upsampling layer and downsampling layer are combined to obtain a feature map with higher spatial resolution.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 18 Figure 6. The network structure of Res-UNet.
In the downsampling network structure, four residual units with a step size of two are used for feature extraction. Every time the feature map passes through a residual unit, its size is doubled and the number of convolution filters is doubled. In each residual unit, the data is normalized in batches to ensure that each forward propagation is output on the same distribution as the maximum. In this way, the distribution of the data samples referenced in the backward calculation will be the same as that in the forward calculation, ensuring a uniform distribution, leading to more meaningful adjustment of the weights and avoiding the problem of gradient explosion during network training. The activation function is rectified linear unit (relu), which enables the sparse model to better mine relevant features and fit the training data to accelerate network convergence.
When using a full CNN for high-scoring image classification, in order to achieve end-to-end classification, deconvolution is often used for upsampling operations to upsample the feature map to the size of the input image. However, deconvolution needs to learn a large number of parameters and is computationally intensive. The bilinear interpolation algorithm does not require learning parameters, reducing the amount of calculation [45]. Therefore, this study used bilinear interpolation instead of deconvolution and analyzed its impact on classification performance. So, in the upsampling network, a linear interpolation operation is used instead of deconvolution. Every time the linear interpolation is performed, the feature map is doubled until it increases to the size of the input feature map, so that the entire network can achieve end-to-end input. In the linear interpolation process, as the number of convolutions increases, the extracted features are more effective; however, the loss of feature map spatial information can easily occur. Therefore, feature maps with the same size in the upsampling layer and downsampling layer are combined to obtain a feature map with higher spatial resolution.
In this study, the downsampled feature map was first sent to a residual unit with a step size of 1 then upsampled. When the upsampling features were fused, the output of each layer of the upsampling was first subjected to a residual operation with a step size of one to ensure that it has the In this study, the downsampled feature map was first sent to a residual unit with a step size of 1 then upsampled. When the upsampling features were fused, the output of each layer of the upsampling was first subjected to a residual operation with a step size of one to ensure that it has the same size and number of channels as the corresponding upsampling layer. At the output of the network, a 1 × 1 convolution layer was used to obtain a feature map with the same number of output channels as categories. The proposed Res-UNet network enables the feature map to be restored to the input size by extracting the deep features of the image to achieve end-to-end classification.

Conditional Random Field (CRF)
The CRF is a discriminant probability model, which is an improvement on the Hidden Marko Model (HMM) and Maximum Entropy Markov Model (MEMM). CRF overcomes the limitation of HMM whereby it can only define specific types of feature functions. Instead, the CRF can define a larger number of feature functions, and the feature functions can use arbitrary weights. MEMM is only normalized locally; thus, it is easy to fall into local optimization. In the CRF model, the global probability is counted. When normalizing, the global distribution of the data is considered, the problem of label offset of the MEMM is solved, and the global optimum can be obtained.
In image segmentation, CRF treats each labeled pixel as a random variable in a Markov random field, and the entire image is a global observation. Then, the energy function labeled x can be expressed as: The first item is a data item, which is the segmentation result of CNN, and it represents the probability that the xi-th pixel belongs to each category. The second term is a post-processing smoothing term, which represents the difference in gray value and spatial distance between the two pixels xi and xj. At this time, the most likely label combination can be obtained by minimizing the energy function E (x). Then, the optimal segmentation result can be obtained. Post-processing is critical to the classification results. In order to verify the impact of the classification results using CRFs for post-classification processing, a CRF operation was added to the network output.

Network Training and Prediction
During network training, the model parameters were initialized randomly and the training set was input into the model for training. The average cross-entropy loss was used to calculate the loss of the model, where the loss function is expressed as follows: Here, m represents the size of the mini-batch, and x_i and z_i represent the predicted and true values of the ith sample in each batch, respectively. The loss was forwarded and the network parameters were optimized using the Adam optimizer [46]. The calculation formula of the Adam optimizer is where θ is the weight, α is the learning rate, t is the number of training iterations, m is the momentum vector, s is the squared cumulative vector of the gradient, and is an infinitely small number.
Finally, under the optimal model, the learning rate was set to 1e-5, the batch size was 1, and 60,000 rounds were trained until the accuracy ceases to improve. The model weights were guaranteed. During prediction, due to computer memory limitations, the model predicts the 256 × 256 area of the test image each time and uses CRF for post-processing until it traverses the entire image to obtain the classification result map. This study used Python based on the TensorFlow deep learning framework. The hardware configuration of the operating platform included Intel®Xeon (R) CPU E5-2620 v4@2.10GHZ and two nvidia GeForce GTX 1080Ti GPUs. Table 3 shows the tree species classification accuracy in the Res-UNet (Linear interpolation + CRF) network using 40%, 60%, 80%, and 100% of the training sets. When the training sample is 40% of the training set, it shows very poor classification results, and the Kappa coefficient is only 0.683. In addition, with the increase of the training set, the classification accuracy shows an upward trend, but the increased amplitude gradually decreases. Therefore, this study used 100% of the training sets to conduct experiments with different methods.  Figure 7 shows the tree species classification results for various classification methods. According to the comparison and analysis of the classification results, Res-UNet has a better ability to distinguish each tree species. Eucalyptus and Illicium verum can be better classified, but the small area of Mytilaria laosensis is seriously misaligned. After post-processing with the CRF, the mixed phenomenon of Chinese fir and other broad-leaved improved.  The tree species classification results of various methods are shown in Table 4. The classification accuracy of the Illicium verum is high in various networks, indicating that various networks can effectively extract the characteristics of the Illicium verum, and the classification results are relatively stable. Except for other broad-leaved, Res-UNet improves the classification accuracy of tree species from that of ResNet and U-Net. The classification accuracy of each tree species has been improved to a different level after CRF post-processing was added; the overall classification accuracy increases by 2.7%. The classification accuracy of tree species is also improved by using bilinear interpolation instead of deconvolution, and the overall classification accuracy is improved by 5.8%. Figure 7f shows the results of post-processing and upsampling using linear interpolation, which again indicates that the proposed model achieves the best classification effect. Although the classification accuracy is lower than the results obtained using hyperspectral imagery, it shows higher classification accuracy compared to studies using three-band high-resolution image classification. As shown in Figure 8, the ResNet, U-Net, and Res-UNet networks use linear interpolation instead of upsampling and CRF post-processing training accuracy and cross-entropy loss curves, where the x-axis represents the number of training iterations. After 80,000 iterations of training, the accuracy and loss of U-Net and Res-UNet tend to stabilize. Among them, the accuracy of Res-UNet is slightly higher than that of U-Net, and its loss decreases most rapidly to zero. Conversely, the U-Net loss drops to 0.3 and remains stable, whereas ResNet exhibits the lowest accuracy and loss convergence; thus, ResNet is the least desirable model. As shown in Figure 8, the ResNet, U-Net, and Res-UNet networks use linear interpolation instead of upsampling and CRF post-processing training accuracy and cross-entropy loss curves, where the x-axis represents the number of training iterations. After 80,000 iterations of training, the accuracy and loss of U-Net and Res-UNet tend to stabilize. Among them, the accuracy of Res-UNet is slightly higher than that of U-Net, and its loss decreases most rapidly to zero. Conversely, the U-Net loss drops to 0.3 and remains stable, whereas ResNet exhibits the lowest accuracy and loss convergence; thus, ResNet is the least desirable model.  Table 5 shows the number of parameters that need to be trained during different model training, as well as the time required for model training and prediction. When linear interpolation is used instead of the deconvolution operation in the upsampling process, the training times are approximately equal. However, when using linear interpolation training, a small number of parameters need to be trained, which reduces the complexity of the operation. Table 5. Parameters, training, and prediction time of different classification methods.

Method
ResNet Unet  Table 5 shows the number of parameters that need to be trained during different model training, as well as the time required for model training and prediction. When linear interpolation is used instead of the deconvolution operation in the upsampling process, the training times are approximately equal. However, when using linear interpolation training, a small number of parameters need to be trained, which reduces the complexity of the operation.

Impact of CRFs on Classification Results
When using a deep neural network for image classification, the downsampling operation during encoding will lose the image information, resulting in poor image contour restoration during decoding. In addition, the convolution operation is locally connected so can only extract information from a rectangular area around a pixel. Although repeated convolution operations can gradually increase the rectangular area, it cannot be extracted even at the last convolution layer. The CRF model is based on a probability map model, which calculates the similarity between any two pixels to determine whether they belong to the same class and uses the global information of the observation field to avoid errors caused by inappropriate modeling and compensate for the boundary smoothing problem caused by deep neural networks. Based on the pixel probability calculated by the deep neural network, the prior information of the local structure of the image is fused through CRF, which can effectively improve the classification accuracy. In this study, the CRF post-processing operation reduced mixing between other broad-leaved and Chinese fir species, especially for the other broad-leaved trees with a sparse distribution in the lower right corner of the study area. The resulting boundaries were clearer and smoother, and the classification accuracy was significantly improved. Figure 9 compares the classification effect of the mixed tree species in the red box in Figure 7f after CRF post-processing.
Remote Sens. 2020, 12, x FOR PEER REVIEW 14 of 18 decoding. In addition, the convolution operation is locally connected so can only extract information from a rectangular area around a pixel. Although repeated convolution operations can gradually increase the rectangular area, it cannot be extracted even at the last convolution layer. The CRF model is based on a probability map model, which calculates the similarity between any two pixels to determine whether they belong to the same class and uses the global information of the observation field to avoid errors caused by inappropriate modeling and compensate for the boundary smoothing problem caused by deep neural networks. Based on the pixel probability calculated by the deep neural network, the prior information of the local structure of the image is fused through CRF, which can effectively improve the classification accuracy. In this study, the CRF post-processing operation reduced mixing between other broad-leaved and Chinese fir species, especially for the other broadleaved trees with a sparse distribution in the lower right corner of the study area. The resulting boundaries were clearer and smoother, and the classification accuracy was significantly improved. Figure 9 compares the classification effect of the mixed tree species in the red box in Figure 7f after CRF post-processing.

Effect of Bilinear Interpolation Instead of Deconvolution
Bilinear interpolation differs from ordinary linear interpolation methods; it calculates the value of a point by finding the four pixel points closest to the corresponding coordinate, which can effectively reduce the error. Assuming the source image size is m × n and the target image is a × b, then the side-to-side ratios of the two images are: m/a and n/b. Typically, this ratio is not an integer. The floating point is used during programming and storing. The (i, j) -th pixel point (i-row, j-column) of the target image can correspond to the source image by the side length ratio, and its corresponding coordinates are (i × m / a, j × n / b). Obviously, this corresponding coordinate is not typically an integer. The calculation principle of bilinear interpolation can obtain the calculation result of the integer to avoid the occurrence of errors. Moreover, bilinear interpolation does not require learning parameters, which reduces the complexity of the model. In this study, after using bilinear interpolation instead of deconvolution, the number of parameters that the model required for training was reduced. The classification accuracy of the other broad-leaved, Pinus elliottii, and Chinese fir categories increased by 19%, 6.8%, and 6.8% respectively. The classification accuracy of other broad-leaved leaves exhibited the greatest improvement (19%). Furthermore, the overall accuracy and Kappa coefficients improved by an average of 5.8% and 3.8%.

Comparison of Improved Res-UNet with U-Net and ResNet Networks
The network operation results reveal that Res-UNet obtained the best classification results; i.e., the highest classification accuracy and Kappa coefficient for various tree species, followed by U-Net, with ResNet exhibiting the worst effect. When the ResNet network was used alone, the classification results were fragmented, the edges were rough, the accuracy was low, and severe mixing occurred

Effect of Bilinear Interpolation Instead of Deconvolution
Bilinear interpolation differs from ordinary linear interpolation methods; it calculates the value of a point by finding the four pixel points closest to the corresponding coordinate, which can effectively reduce the error. Assuming the source image size is m × n and the target image is a × b, then the side-to-side ratios of the two images are: m/a and n/b. Typically, this ratio is not an integer. The floating point is used during programming and storing. The (i, j) -th pixel point (i-row, j-column) of the target image can correspond to the source image by the side length ratio, and its corresponding coordinates are (i × m / a, j × n / b). Obviously, this corresponding coordinate is not typically an integer. The calculation principle of bilinear interpolation can obtain the calculation result of the integer to avoid the occurrence of errors. Moreover, bilinear interpolation does not require learning parameters, which reduces the complexity of the model. In this study, after using bilinear interpolation instead of deconvolution, the number of parameters that the model required for training was reduced. The classification accuracy of the other broad-leaved, Pinus elliottii, and Chinese fir categories increased by 19%, 6.8%, and 6.8% respectively. The classification accuracy of other broad-leaved leaves exhibited the greatest improvement (19%). Furthermore, the overall accuracy and Kappa coefficients improved by an average of 5.8% and 3.8%.

Comparison of Improved Res-UNet with U-Net and ResNet Networks
The network operation results reveal that Res-UNet obtained the best classification results; i.e., the highest classification accuracy and Kappa coefficient for various tree species, followed by U-Net, with ResNet exhibiting the worst effect. When the ResNet network was used alone, the classification results were fragmented, the edges were rough, the accuracy was low, and severe mixing occurred between tree species. The improved Res-UNet network uses the ResNet residual unit instead of the U-Net network convolution layer, which can extract information at different scales of the image and identify tree species in smaller areas. At the same time, it avoids the gradient degradation problem caused by the deepening of the network layer to obtain the best classification effect. Thus, the proposed Res-UNet can be an effective method for the classification of complex tree species in southern China.

Comparison of Classification Accuracy for Different Categories
Because various broad-leaved tree species exhibited a sparse distribution, they were classified into other broad-leaved categories. However, due to differences in the characteristics of different broad-leaved tree species, the classification effect was not ideal, even though the accuracy was greatly improved by improving the network. Notably, the planting area of eucalyptus was large and the sample size was sufficient; it exhibited the highest classification accuracy of all tree species. The classification accuracy of Illicium verum is second only to eucalyptus. Its clustered leaves are easily distinguishable from other tree species. Therefore, assuming a sufficient sample size, the improved Res-UNet network can be employed with high-spatial-resolution images to achieve higher tree species classification accuracy.

Impact of Label Samples on Classification
When using CNNs to classify tree species in remote sensing images, the sample is very important; however, labeling is difficult [47]. For the classification of broad-leaved tree species, the proposed method exhibited relatively low accuracy due to the small sample size. Therefore, for tree species with insufficient sample sizes, the classification accuracy is affected. The issue of sample making is gaining increasing attention from scholars [48]. Some researchers have proposed a method of combining unsupervised learning and semisupervised learning to make samples of each tree species using sparse autoencoders and deep belief networks when testing organic carbon content [49]. It simplifies the production of samples. In future research, we will try to further optimize the network structure to address the small sample problem.

Conclusions
In this article, we proposed an improved Res-UNet network for tree classification using high-scoring remote sensing images. This novel method uses the residual unit of ResNet instead of the convolutional layer of the U-Net network; therefore, it can achieve multiscale feature extraction of an image, allowing information to spread from shallow to deep layers while avoiding degradation of network performance. Conditional random fields are used at the output of the network for postclassification processing, which results in smoother tree species boundaries. By using bilinear interpolation instead of deconvolution, the network performance is significantly improved. Experimental results show that, compared with U-Net and ResNet, the improved Res-UNet method can effectively extract the spatial and spectral characteristics of an image. For southern Chinese tree species with small differences in their spectral characteristics, the overall accuracy, average accuracy, and Kappa coefficients were 87.51%, 85.43%, and 84.21%, respectively. The proposed network provides new opportunities for the tree species classification of high-spatial-resolution images.
Funding: This research was funded by the National Key R&D Program of China project "Research of Key Technologies for Monitoring Forest Plantation Resources" (2017YFD0600900).