A New Individual Tree Species Classification Method Based on the ResU-Net Model

: Individual tree species (ITS) classification is one of the key issues in forest resource management. Compared with traditional classification methods, deep learning networks may yield ITS classification results with higher accuracy. In this research, the U-Net and ResNet networks were combined to form a Res-UNet network by changing the structure of the convolutional layer to the residual structure in ResNet based on the framework of the U-Net model. In addition, a second Res-UNet network named Res-UNet2 was further constructed to explore the effect of the stacking of residual structures on network performance. The Res-UNet2 model structure is similar to that of the Res-UNet model, but the convolutional layer in the U-Net model is created with a double-layer residual structure. The two networks proposed in this work were used to classify ITSs in WorldView-3 images of the Huangshan Mountains, Anhui Province, China, acquired in March 2019. The resulting ITS map was compared with the classification results obtained with U-Net and ResNet. The total classification accuracy of the ResU-Net network reached 94.29% and was higher than that generated by the U-Net and ResNet models, verifying that the ResU-Net model can more accurately classify ITSs. The Res-UNet2 model performed poorly compared to Res-UNet, indicating that stacking the residual modules in ResNet does not achieve an accuracy improvement.


Introduction
Forest resources are among the most important natural resources for humankind, and increasing attention is being paid to the management of forest resources [1][2][3]. Determining how to classify individual trees, the smallest component of a forest, is of great significance and research value for forest resource management [4,5]. However, the traditional individual tree species (ITS) measurement method is time consuming and laborious, and it is difficult to widely use in rugged areas [6]. With the development of remote sensing technology, high-spatial-resolution remote sensing imagery can be used to determine the positions of individual trees and delineate individual tree crowns, thus providing the possibility for large-scale ITS classification [7][8][9][10].
At present, ITS classification technology is mainly based on high-spatial-resolution airborne multispectral or hyperspectral data, high-point-density LiDAR point cloud data, or their combination [11][12][13]. With these data, a variety of ITSs can be identified, and high classification accuracies can be obtained [14][15][16]. For example, Guan et al. [17] used deep Boltzmann machines (DBMs) and LiDAR data to obtain the high-level features of individual trees; then, they used a support vector machine (SVM) to classify ten tree species, with an overall accuracy of 86.1%. Zou et al. [18] employed a deep belief network (DBN) and 3D point clouds to classify five tree species, with an overall accuracy of 93.1%. Wang et al. Combining ResNet with a U-Net network to build a new network structure (ResU-Net network) can solve the performance degradation issue of U-Net under extreme depth conditions [53]. Additionally, this approach enables the U-Net network to contain deeper layers and consider training parameters at the same depth scale. To a certain extent, the insufficient depth problem for U-Net related to setting a fixed depth size can be avoided, and classification performance can be improved [54].
In this study, high-spatial-resolution satellite imagery and an improved U-Net model were used to identify ITSs. An ITS sample set from remote sensing images of the study area was established and enhanced, and U-Net, ResNet, and ResU-Net were used to classify ITSs in the study area. In addition, the effect of ResU-Net models composed of different combinations of ResNet and U-Net to potentially improve ITS classification accuracy is discussed.
The majority of this paper is divided into six parts: Section II gives the details of the proposed methods, Section III introduces the experiments, Section IV presents and discusses the experimental results, Section V provides the discussion, and the conclusions are presented in Section VI.

Study Area
The study area is located in the scenic region of the Huangshan Mountains, Anhui Province, as shown in Figure 1, with 56% forest coverage in the scenic area. The gray area in Figure 1 (a) denotes the province where the study area is located, and the red rectangle in Figure 1 (b) represents the study area within the province. The geographical location of the study area is from 118°9′16″ to 118°11′24″ east longitude and 30°7′8″ to 30°10′37″ north latitude, with a total area of 21.6 km 2 . The study area is located in a subtropical monsoon climate zone and on the northern edge of the central subtropics. This area is also located in an evergreen broad-leaved forest with red soil and yellow soil zones. As the topography is characterized by raised peaks and long ravines, the climate varies vertically, and the vertical zoning of plants is obvious in the study area. There are more natural forests than planted forests and more mixed forests than pure forests. The pure forests of Pinus taiwanensis (Pinus taiwanensis Hayata) are mainly distributed on the top of Huangshan Mountain. Mixed forests of Pinus taiwanensis (Pinus taiwanensis Hayata) (or Pinus massoniana (Pinus massoniana Lamb)) and broad-leaved trees are distributed near the peak and on the banks of Huangshan Mountain. Pure broadleaf forests are distributed at the bottom of the mountain. Artificial arbor forests are mainly distributed around the scenic area. The tree species in the scenic area mainly include Cunninghamia lanceolata (Cunninghamia lanceolata (Lamb.) Hook), pine (Pinus), Phyllostachys pubescens (Phyllostachys heterocycla (Carr.) Mitford cv. Pubescens), and arbors (macrophanerophytes).

Experimental Data
The WorldView-3 imagery of the study area was acquired on 10 March 2019. The imagery has one panchromatic band with a spatial resolution of 0.3 m and eight multispectral bands with a spatial resolution of 1.2 m. The wavelength ranges of the eight multispectral bands are shown in Table 1. Due to the large extent and complex topography of the Huangshan scenic area, the field sampling area was selected on both sides of the viewing route in the field investigation of the research group, and GPS devices were used to collect sample points in the sampling area.

Experimental Process
The research process is divided into four main parts: data preprocessing, sample dataset construction, network training and classification, and classification evaluation. This specific process is shown in Figure 3.

Data Preprocessing
First, the WorldView-3 imagery was processed based on radiometric calibration, orthorectification, and image clipping. The imagery adopted apparent reflection without atmospheric correction. Then, to fully use the spectral and texture information in the remote sensing imagery, the fusion method based on haze and ratio (HR) information proposed by Jing et al. [55] was used to fuse panchromatic imagery and multispectral imagery and obtain 8-band imagery with a spatial resolution of 0.3 m for subsequent experiments. The HR fusion method introduces the haze in the spectral band. The fused image retains the texture information of panchromatic band and greatly reduces the spectral distortion.

Building the Sample Set
To construct the remote sensing imagery sample set of ITSs, it is necessary to first remove the parts of images containing only an individual crown from whole remote sensing images and then determine and label the category of the individual tree crown.
Due to the wide coverage of remote sensing imagery and the complexity of ground objects, it is necessary to use certain methods to extract and describe the tree crowns extracted from remote sensing imagery. In this study, the crown slice from imagery (CSI) algorithm proposed by Jing in 2014 [56] was used for the multiscale segmentation of remote sensing imagery and automatic crown delineation. Due to the small difference in texture information between different tree species in remote sensing images, the CSI algorithm uses the information of crown texture and spectral brightness to delineate the crown, and the crown delineation result is more accurate.
The steps in constructing the remote sensing imagery sample set of ITSs are shown in Figure 4. The main steps are as follows: (1) field survey the study area, and collect samples of tree species categories; (2) use the CSI algorithm to automatically circle individual tree crowns in the remote sensing images of the study area; (3) combine the collected sample points, crown circling results, and remote sensing imagery interpretation results to label individual tree crown species categories; (4) output the labeled tree crowns based on the smallest outer rectangle to obtain remote sensing image maps for individual tree species; and (5) categorize the remote sensing image maps for individual species according to the species categories to build a sample set containing the main tree species in the study area. The tree species categories were combined according to the field sampling results, and the individual tree crown images were classified after category labeling. The ITS imagery sample set was divided into five categories, as shown in Table 2

. Data Augmentation
A deep learning network often requires a large number of samples for training to achieve good accuracy. However, the number of samples labeled in the above process may not be sufficient. Therefore, after each class in the sample set was divided into training sample set, verification sample set, and test sample set in the ratio of 3:1:1, it is necessary to enhance the sample set. In this study, the original ITS samples were rotated by 90 degrees, 180 degrees, and 270 degrees; horizontally flipped; and turned upside down (as shown in Figure 5). The number of sample sets was expanded to six times the original number, as shown in Table 3.  The U-Net structure is mainly divided into a downsampling stage and an upsampling stage, with only convolutional and pooling layers in the network and no fully connected layers. In the network, a shallow high-resolution layer is used to solve the location problem, and a deeper layer is used to solve the pixel classification problem to promote the subsequent semantic level segmentation and classification of the imagery. In the upsampling stage and downsampling stage of U-Net, convolution operations are performed at the same level. The skip connection structure is used to connect the downsampling layer with the upsampling layer so that the features extracted from the downsampling layer can be directly transferred to the upsampling layer, thereby increasing the accuracy of pixel locations in U-Net and the accuracy of segmentation and classification. Generally, the input imagery of U-Net includes individual-channel grayscale images used in image segmentation operations. In this study, by adding reshape, flattening and softmax operations, the improved U-Net network can complete the classification of 8-channel image data.
The structure of the improved U-Net network is shown in Figure 6. The short red arrows represent the convolution and activation operation. The long red arrows represent the copy and crop operation. The short downward yellow arrows represent the maximum pooling operation. The upward yellow arrows represent the upper convolution operation, and the green arrows represent the convolution operation with a 1 × 1 convolution kernel.

ResU-Net Model
The core module of ResNet involves the application of a residual structure and includes the elementwise addition of block inputs and outputs through shortcuts [52]. This simple addition does not add parameters or computations to the network model but can greatly improve the training speed and effectiveness of the model. In addition, the residual structure can solve the degradation problem for deep model layers. At present, there are two main types of residual structures (as shown in Figure 7), and since different numbers of filters are used in this study when combining ResNet and U-Net networks, the latter residual structure is most suitable in the context of maximizing the accuracy of the results. The second residual structure is combined with U-Net to construct a new ResU-Net network structure, which is used for multichannel imagery classification. The ResU-Net network mainly adopts the U-Net framework with the same upsampling phase and downsampling phase, but unlike that in U-Net, the specific module in the network is implemented as the residual module in ResNet. The specific network architecture is shown in Figure 8, where Figure 8a shows the overall framework of the ResU-Net model and Figure 8b illustrates the specific implementation of each block. To study whether the stacking of residual modules is helpful to further improve the accuracy of network classification, a second ResU-Net structure (ResU-Net2 network) was designed. Similar to ResU-Net, ResU-Net2 also adopts the U-Net network framework with two stages of upsampling and downsampling. However, two residual modules are stacked in each implementation module to explore whether reasonable stacking of residual modules can make the classification effect more prominent. The specific network architecture is shown in Figure 9, where Figure 9a is the overall framework of the ResU-Net2 model, and Figure 9b is the specific implementation of each block in the overall framework.

Experimental Environment
The deep learning environment in this study was built with a Keras 2.2.4 front end and a TensorFlow 1.14.0 back end. The programming language was Python 3.6.8. The operating system was Windows 10, and the graphics card was an NVIDIA GTX1060. Parallel computing was achieved through CUDA10.0.
Based on the software, hardware, and dataset, four combinations of models and data were compared: the improved U-Net model and test dataset, the ResNet model and test dataset, the ResU-Net model and test dataset, and the ResU-Net2 model and test dataset.

Training and Prediction
In this study, the ReLU function was chosen as the neuron activation function for the four network models; the loss function was categorical_crossentropy, and the iterative operator was the Adam optimizer. The batch number in each training cycle was 60, with a total of 500 iterations. In the iterative process, the network was allowed to complete five iterations without improvement (if an accuracy increase was greater than 0.001, it was regarded as an improvement.) If more than 5 iterations passed without an accuracy improvement, the learning rate was reduced by 0.005. The lower limit of the learning rate was 0.5 × 10 −6 . After each epoch, the test results were obtained for the verification set. With increasing epochs, if the test error increased or the accuracy improved by less than 0.001 for more than 10 iterations, network training ended. The model with the highest classification accuracy for the validation set was saved and used to produce predictions based on the test set.

Classification Accuracy
The U-Net, ResNet, ResU-Net, and ResU-Net2 models were trained by using the divided training sample sets. During the training process, the accuracy of the trained model was verified by using the validation sample set. After iterative training, the convergence period, training accuracy, and verification accuracy of U-Net, ResNet, ResU-Net, and ResU-Net2 models were found and are shown in Table 4. The training results of the four models in Table 4 indicate that the three models (U-Net, ResU-Net, and ResU-Net2) that use the U-Net model framework yield the highest overall training and validation accuracies, with values above 93%, verifying that the U-Net model framework can accurately complete imagery classification tasks. Compared with U-Net, the ResU-Net model formed by combining U-Net and ResNet has a reduced convergence period, faster training speed, and significantly higher training and validation accuracy. Compared with the ResU-Net model, the ResU-Net2 model also has reduced training and validation accuracy, although the convergence period is reduced.
From the training and verification accuracies of the four models, the combination of the ResNet module and U-Net model framework can effectively increase the training depth and improve the training and verification accuracies compared to those of traditional methods, but the excessive stacking of residual modules will decrease the classification accuracy.
To fully illustrate the applicability of the model in the classification and prediction of different tree species, the confusion matrix for the four models was calculated by using the test sample set. The producer's accuracy, user's accuracy, overall accuracy, and kappa coefficient were obtained according to the confusion matrix, as shown in Table 5. Producer's accuracy, user's accuracy, overall accuracy and kappa coefficient are compared for each of the four models, and the highest model metrics will be bolded.
According to Table 5, (1) 6) The ResU-Net model exhibits the highest overall accuracy and kappa coefficient, and the overall accuracy and kappa coefficient of the U-Net and ResU-Net2 models are the same. (7) In general, compared with the other three models, the ResU-Net model provides better overall performance and higher stability, and it is more suitable for the classification of ITSs in remote sensing imagery. Although the performance of the U-Net, ResNet, and ResU-Net2 models may be outstanding in terms of the producer's accuracy or user's accuracy for a certain tree species, the overall classification ability and classification stability are inferior to those of the ResU-Net model.

Classification Map
The comparison in Section 3.1 shows that the ResU-Net model provides higher classification accuracy and stronger applicability than the other models. Therefore, the ResU-Net model was used to complete the classification of ITSs in WorldView-3 imagery in the study area. The following classification result plots were obtained, where the "others" category represent other nontree types.
According to Figure 10, the pure Phyllostachys pubescens forests are mainly distributed in the northern part of the research area. From north to south, mixed forests of Phyllostachys pubescens and evergreen arbors, evergreen arbor forests, mixed forests of evergreen arbor and Cunninghamia lanceolata, pure Cunninghamia lanceolata forests, mixed forests of Cunninghamia lanceolata and Pinus taiwanensis, and mixed forests of Pinus taiwanensis and deciduous arbor can be observed. The mixed forests of Cunninghamia lanceolata and Pinus taiwanensis and some pure forests of Cunninghamia lanceolata are distributed in the southeastern part of the study area. These species display an obvious vertical distribution with elevation, which is consistent with the vertical change in climate in the region. The distribution of the same species in the study area is concentrated. Additionally, the distribution of different species displays transitional trends that are closely related to the climate change characteristics and the preferences of different species in the study area.

The Reliability of the U-Net Model Framework
The experimental results indicate that the three models that use the U-Net framework achieve good accuracy in ITS classification. This result verifies the classification reliability of the U-Net model framework, which improves the accuracy of classification through connecting shallow and deep features. Many scholars have also explored this area, such as Pan et al. [57], who segmented individual buildings using the Worldview satellite image with eight pan-sharpened bands at a 0.5 m spatial resolution and U-Net with segmentation accuracy over 86%. Wang et al. [58] integrated the spatial pyramid pooling module into the U-Net structure to classify 11 feature types with an overall accuracy of more than 86%.
In future research, further utilization of the advantages of the U-Net model framework should be explored, shallow and deep features should be further connected to improve the modeling accuracy, and the use of other models with this framework should be investigated to improve the model.

The Residual Structure and U-Net Model Framework
The experimental results show that the ResU-Net model can complete the high-precision classification of ITSs and reduce the classification error. Compared with the U-Net model, the ResU-Net model with a residual structure can reduce the degradation issue in the deep learning network, provide sufficient training samples, and improve the classification accuracy for a constant network depth.
The classification performance of traditional residual networks improves as the number of network layers is increased. However, in the framework of the U-Net model, the accuracy of the ResU-Net2 model does not improve with increasing residual structure.
Compared with the ResU-Net model, the ResU-Net2 model displayed decreases in both classification accuracy and performance. The results suggest that it is necessary to explore the appropriate combination mode when the two network models are combined. An unsuitable combination mode will lead to performance degradation and network structure redundancy.
At present, some scholars have also explored different structures of ResU-Net models and achieved good results [59,60]. How to find a more suitable ResU-Net model among many combination methods is a problem to be explored in the future.

The Separability of Rare Tree Species Samples
Due to the difficulty of collecting ITS samples in the field and the effects of various factors on the actual collection process, samples of a certain species are often too sparse to fully learn the corresponding classification characteristics and classify individuals. To solve such problems, samples of similar classes are often combined to obtain a large class, expand the sample size, and enhance classification. However, this method can also lead to the loss of ability to recognize rare trees, and only trees in the considered categories can be identified.
In addition, the degree of difficulty in classifying different individual tree species varies from remote sensing images. Individual trees in coniferous forests tend to be independent of each other, making it easy to circle and classify individual tree species. In broadleaf and high-density mixed forests, the canopies cover a wide area, and there are overlapping and cross-linking between canopies, which leads to the individual significance of individual canopies not being obvious, and it is not easy to identify and classify individual tree species [61].
Future research needs to continue to explore the use of small sample sets in network training to accurately discriminate among different tree classes and improve the accuracy of classification results.

The Selection of Data Sources
In this experiment, WorldView-3 data with a spatial resolution of 0.3 m were used as the main research dataset. The WorldView-3 data include eight multispectral bands with abundant spectral and texture information. However, there are still some defects in these data, such as a spatial resolution that is not high enough to support the accurate extraction of individual tree crowns and locations of individual trees. Determining how to improve the positioning accuracy of individual tree crowns is an urgent problem that must be solved.
In addition, with the development of remote sensing technology and the continuous enrichment in remote sensing information sources, increasing amounts of information can be used for ITS classification. High-resolution remote sensing imagery contains a large number of canopy texture features, which can clearly reflect the fine details of the forest, which lays the foundation for high-precision extraction and classification of individual canopies. Hyperspectral images are rich in spectral information and hyperspectral images have potential in distinguishing tree species of different ages [62,63]. LiDAR data, on the other hand, can effectively reconstruct the structure of individual wood types, thus contributing to the classification of ITS [64].
For the many available types of classification information, determining how to effectively combine multiple pieces of information and improve the accuracy of ITS classification are important tasks for the future.

Experimental Errors
Although experimental errors were avoided as much as possible during the experimental process, some errors were unavoidable due to objective constraints. For instance, during the sampling process, there was a high possibility of misidentifying tree species due to the similarities among trees and the inadequacy of the knowledge of sampling personnel. In addition, the influence of random noise in the process of remote sensing imagery acquisition can lead to experimental error in tree species classification. The robustness of the classification network must be improved to reduce the influence of the experimental error. Moreover, the abilities of the fully trained classification network to recognize and eliminate mislabeled samples and perform satisfactory training must be researched in the future.
In this area of research, the application of generative adversarial networks (GANs) can be considered. There are two modules in the GAN framework: a generative model and discriminative model. The discriminative module can effectively identify the wrong samples, while the generative module can effectively solve the problem of lack of samples, thus helping to improve the classification ability for ITS [65][66][67].

Conclusions
In this study, remote sensing imagery and field sampling data in the study area were used to construct and enhance a remote sensing imagery ITS sample set, and a ResU-Net model was proposed for ITS classification. The following conclusions were obtained.
First, the introduction of the U-Net model framework improved the classification accuracy of ITSs. By comparing the classification results of the improved U-Net model with those of the two ResU-Net models, we found that the training and verification accuracies of the three models were all above 93%, and the overall accuracies were all greater than 92%. These results verify that the U-Net model framework is reliable for ITS classification.
Second, the combination of ResNet and U-Net can improve the accuracy of classification models. The combination of the residual module in ResNet and the U-Net model framework can effectively increase the model training depth and improve the training and verification accuracies. The results indicate that the training and verification accuracies of the ResU-Net model are 95.77% and 95.67%, respectively, with an overall accuracy of 94.29%. Compared with those of the U-Net model, the training and verification accuracies of the ResU-Net model are improved by 1.49% and 1.51%, respectively, and the overall accuracy is improved by 2.11%.
Finally, the excessive stacking of residual modules leads to a decline in the network classification ability. When there are too many residual modules stacked in the U-Net model framework, the classification accuracy decreases. By comparing the classification accuracies of the two ResU-Net models, we find that the training, validation, and overall accuracies of the ResU-Net2 model are 1.91%, 2.29%, and 2.11% lower than those for the ResU-Net model, respectively.
Through this study, a new network structure with high accuracy and suitability for ITS classification is constructed, and the influence of stacking residual modules on network performance is further considered to aid in future experiments.