Deep Learning Approaches for the Mapping of Tree Species Diversity in a Tropical Wetland Using Airborne LiDAR and High-Spatial-Resolution Remote Sensing Images

: The monitoring of tree species diversity is important for forest or wetland ecosystem service maintenance or resource management. Remote sensing is an e ﬃ cient alternative to traditional ﬁeld work to map tree species diversity over large areas. Previous studies have used light detection and ranging (LiDAR) and imaging spectroscopy (hyperspectral or multispectral remote sensing) for species richness prediction. The recent development of very high spatial resolution (VHR) RGB images has enabled detailed characterization of canopies and forest structures. In this study, we developed a three-step workﬂow for mapping tree species diversity, the aim of which was to increase knowledge of tree species diversity assessment using deep learning in a tropical wetland (Haizhu Wetland) in South China based on VHR-RGB images and LiDAR points. Firstly, individual trees were detected based on a canopy height model (CHM, derived from LiDAR points) by the local-maxima-based method in the FUSION software (Version 3.70, Seattle, USA). Then, tree species at the individual tree level were identiﬁed via a patch-based image input method, which cropped the RGB images into small patches (the individually detected trees) based on the tree apexes detected. Three di ﬀ erent deep learning methods (i.e., AlexNet, VGG16, and ResNet50) were modiﬁed to classify the tree species, as they can make good use of the spatial context information. Finally, four diversity indices, namely, the Margalef richness index, the Shannon–Wiener diversity index, the Simpson diversity index, and the Pielou evenness index, were calculated from the ﬁxed subset with a size of 30 × 30 m for assessment. In the classiﬁcation phase, VGG16 had the best performance, with an overall accuracy of 73.25% for 18 tree species. Based on the classiﬁcation results, mapping of tree species diversity showed reasonable agreement with ﬁeld survey data ( R 2Margalef = 0.4562, root-mean-square error RMSE Margalef = 0.5629; R 2Shannon–Wiener = 0.7948, RMSE Shannon–Wiener = 0.7202; R 2Simpson = 0.7907, RMSE Simpson = 0.1038; and R 2Pielou = 0.5875, RMSE Pielou = 0.3053). While challenges remain for individual tree detection and species classiﬁcation, the deep-learning-based solution shows potential for mapping tree species diversity.


Introduction
There is much evidence to support the importance of tree species diversity for maintaining wetland ecosystems, according to Schäfer et al. [1]. With the ongoing loss of biodiversity, mapping tree species diversity can provide valuable insights for ecologists [2] and is also essential from the perspective of environmental monitoring and conservation management [3]. Traditional biodiversity measurement is often conducted by field work or monitoring systems [4]. However, these means cannot provide spatially distributed and updated information [5]. Remote sensing techniques can solve this problem and have been used for biodiversity monitoring, as they can cover large areas of multiple spatial scales [6][7][8][9].
Hyperspectral remote sensing imaging, or imaging spectroscopy, has been widely utilized in measuring biodiversity distribution due to its significant capacity for spectral measurement in identifying species [10,11]. Zhao et al. [12] used a species-driven leaf optical trait method called "spectranomics" for forest species diversity mapping. They identified interspecies variations in terms of biochemical and structural properties from airborne hyperspectral images. In their approach, a maximum number of 13 species could be identified. They also reported that their methods would be limited in other areas, as the algorithm would reach saturation when the species richness is high. Previous studies also explored supervised classification methods for tree species discrimination. Yang et al. [13] conducted mangrove species mapping based on AISA+ hyperspectral imagery. A minimum noise fraction transformation was employed for spectral dimensionality reduction. Classification approaches such as maximum likelihood (ML) and spectral angle mapper (SAM) were used for tree species distinguishing, and ML showed good capabilities in their study. Nevalainen et al. [14] employed k-nearest neighbors (k-NN), naive Bayes, decision trees C 4.5, multilayer perceptron (MLP), and random forest (RF) for five tree species classification in southern Finland, and they investigated different feature compositions for the input. Dalponte et al. [15] analyzed multisensor images for tree species identification, and different data setups were fed to supervised classifiers, such as support vector machine (SVM) and RF classifiers. They reported that the multispectral data performed worse compared with the hyperspectral images and high-density light detection and ranging (LiDAR) data, and a similar conclusion was drawn by Goodenough et al. [16].
Isolating individual trees from remote sensing images has been found to be beneficial for tree species diversity mapping [1,17], as the within-class spectral variation can be reduced over the individual tree crowns. However, plant-level mapping is constrained when an individual tree is smaller than the spatial resolution of remote sensing images [18]. Recently, very high spatial resolution (VHR) images with low spectral resolution have found to be able to provide a detailed spatial distribution of tree species types, which enables individual tree detection. Getzin et al. [19] proposed a multiregression approach using the canopy gaps derived from VHR unmanned aerial vehicle (UAV) images for biodiversity assessments in forests and demonstrated the potential of cost-effective VHR images in biodiversity mapping. Despite that studies have exploited VHR data for forest-nonforest classification or coniferous-broadleaf classification [20][21][22], it remains a challenge to use VHR images for tree species identification [15]. A number of studies have utilized pixel-wise classification based on the spectra of leaves but have largely ignored the texture information of tree canopies (a visual feature that contains information on the structure arrangement of the trees in an image and their relationship with neighbors). To allow for possible identification of tree species using VHR images, there is a need to exploit detailed spatial information, such as shape, texture, and context information. To overcome the abovementioned difficulties, many studies developed object-based methods that segment homogeneous and adjacent pixels into objects [23,24]. Object-based classification can make better use of features, such as shape or texture, than the pixel-wise classification, but the determination of the object size is difficult across complex scenes.
Deep learning approaches have become powerful tools for feature extraction and image processing in computer vision [25] and even in remote sensing [26,27]. Deep learning approaches have proved to be superior to traditional machine learning methods in a number of remote sensing applications.
In their review, Ma et al. [26] showed that nearly 200 publications using deep convolutional neural networks (CNNs) have been published in the field of remote sensing by early 2019 of which most focused on land use land cover (LULC) classification [28], urban feature extraction [29][30][31], and crop detection [32,33]. Deep learning approaches often require a large amount of training data, and there are benchmark datasets publicly available for training and testing of deep learning approaches in the abovementioned remote sensing fields. Compared with the studies mentioned above, very few studies using deep learning have focused on trees or forest classification [34]. Flood et al. [35] used a U-net convolutional neural network to extract woody vegetation extent from high-resolution three-band Earth-I imagery. In their research, a selection of 1 km 2 was manually labeled for training. The final results were pixel-wise and only two types (trees and large shrubs) were mapped. If there are diverse tree species, pixel-wise data labeling will be difficult and costly, especially in forested areas. Li et al. [36] proposed a deep-learning-based approach for detecting and counting oil palm trees. In their research, the simple deep learning model of LeNet was used as the classifier and the model input was determined by a sliding window size of 17 × 17 pixels. The sliding step was found to have large impacts on the detection results, the palm trees could be repeatedly detected or missed if the sliding window size was too small or large, respectively. Compared with pixel-based tree species classification, patch-based tree samples can better extract useful spatial context features for classification. In this study, we aimed to evaluate the potential of deep learning to classify tree species at the individual tree level by using airborne LiDAR and high-spatial-resolution aerial images for the purpose of diversity mapping. To achieve this goal, (1) individual trees were first identified using LiDAR data, and the tree apexes were distinguished. The tree patches were then cropped based on the detected tree apexes and three-band VHR images. (2) The training and test samples of 17 tree species and one class named "Others" were surveyed in the field work. The samples were labeled at the individual tree level, and three deep CNNs were modified for tree species classification. (3) Tree species diversity was mapped based on the individual tree species. A tropical wetland named Haizhu Wetland in South China was selected as the study area.

Study Area
The Haizhu National Wetland Park (centered at 23 • 04 7.06"N, 113 • 20 2.30"E) is located in the city of Guangzhou in Guangdong Province in South China. This wetland covers approximately 869 ha (377 ha are water area), with an elevation range of −1 to 9 m above sea level. It is a composite wetland ecosystem of the Pearl River Delta, inland lake wetland, and orchard, but it also contains land cover types such as roads, water, and buildings. It is also an important ecological barrier in South Guangzhou. The Haizhu National Wetland Park consists of three parts (marked A-C in Figure 1). Area A is also referred to as Haizhu Lake Park, and a forest (dominant by broadleaved trees) grows along the lake. Area B is a semiconstructed wetland mixed with various broadleaved trees, including flowering and fruit trees. Area C is mainly covered by neatly arranged fruit trees. On the basis of the original orchard, tree species enrichment and habitat restoration have been conducted for the Haizhu National Wetland in the past few years, making it a good ecological environment for birds. The forest across Haizhu National Wetland Park is now heterogeneous with approximately 20 dominant tree species, including evergreen broadleaved species (such as Ficus microcarpa Linn. f., Delonix regia, and Chorisia speciosa A.St.-Hil) and fruit trees (such as longan and litchi). Here, we selected all of Area A and parts of Areas B and C as the study area ( Figure 1).

Figure 1.
The location of Haizhu Wetland and three separate parts with true-color composites in the Haizhu Wetland were selected as the study area. Area A is also referred to as Haizhu Lake Park, and a forest (dominant by broadleaved trees) grows along the lake. Area B is a semiconstructed wetland mixed with various broadleaved trees, including flowering and fruit trees. Area C is mainly covered by neatly arranged fruit trees. The red squares in Areas A and B denote the locations of the fieldsurveyed plots.

Field Survey
The field work was conducted in June 2018 and March and May 2019. Twelve square plots with the size of 30 × 30 m were randomly surveyed across the Haizhu National Wetland, covering the dominant tree species. We also collected single-tree samples for the dominant tree species, including the broadleaved tree species (banyan tree (F. microcarpa Linn. f.), flame tree (D. regia), silk floss tree (C. speciosa A.St.-Hil.), Bauhinia (Bauhinia purpurea Linn.), eucalyptus trees (Eucalyptus robusta Smith), sakura tree (Cerasus sp.), pond cypress (Taxodium ascendens Brongn), Alstonia scholaris, Bischofia javanica Bl., Hibiscus tiliaceus Linn., and camphor tree (Cinnamomum camphora (L.) Presl.)) and the fruit tree species (litchi (Litchi chinensis Sonn.), longan (Dimocarpus longan Lour.), banana (Musa nana Lour.), papaya (Carica papaya), carambola (Averrhoa carambola L.), and mango tree (Mangifera indica L.)). The plots as well as the single-tree samples were positioned by the Guangzhou Continuous Operating Reference System (GZCORS) and Electronic Total Station. The trees with a diameter at breast height (DBH) larger than 5 cm were measured, and the corresponding tree species were recorded. Finally, a total of 2439 scattered trees and 304 trees in 12 plots (the red squares in Figure 1) were surveyed for 17 dominant species and one class of nondominant tree species.

Remotely Sensed Data
The remotely sensed data we used were high-resolution RGB images and LiDAR point clouds, which were acquired in September 2017 over the research site. A Trimble Harrier 68i laser scanner and a frame amplitude aero digital camera were mounted on a Yun5 airplane for data collection, and the flight height was about 1000 m. The RGB images in Tiff format were orthorectified by ground control points and mosaicked, while the LiDAR point clouds were stored in LAS format. These aerial data were collected with good and similar weather conditions as well as atmosphere transparency, which can be considered in the approximate imaging conditions. The ground sampling distance of RGB images was 0.1 m, and there were five to eight LiDAR points per square meter. The point clouds were firstly preprocessed to remove the noisy values. Digital surface models (DSMs) and digital elevation models (DEMs) were derived using the raster conversion and filtering in LAS tools with a Figure 1. The location of Haizhu Wetland and three separate parts with true-color composites in the Haizhu Wetland were selected as the study area. Area A is also referred to as Haizhu Lake Park, and a forest (dominant by broadleaved trees) grows along the lake. Area B is a semiconstructed wetland mixed with various broadleaved trees, including flowering and fruit trees. Area C is mainly covered by neatly arranged fruit trees. The red squares in Areas A and B denote the locations of the field-surveyed plots.

Field Survey
The field work was conducted in June 2018 and March and May 2019. Twelve square plots with the size of 30 × 30 m were randomly surveyed across the Haizhu National Wetland, covering the dominant tree species. We also collected single-tree samples for the dominant tree species, including the broadleaved tree species (banyan tree (F. microcarpa Linn. f.), flame tree (D. regia), silk floss tree (C. speciosa A.St.-Hil.), Bauhinia (Bauhinia purpurea Linn.), eucalyptus trees (Eucalyptus robusta Smith), sakura tree (Cerasus sp.), pond cypress (Taxodium ascendens Brongn), Alstonia scholaris, Bischofia javanica Bl., Hibiscus tiliaceus Linn., and camphor tree (Cinnamomum camphora (L.) Presl.)) and the fruit tree species (litchi (Litchi chinensis Sonn.), longan (Dimocarpus longan Lour.), banana (Musa nana Lour.), papaya (Carica papaya), carambola (Averrhoa carambola L.), and mango tree (Mangifera indica L.)). The plots as well as the single-tree samples were positioned by the Guangzhou Continuous Operating Reference System (GZCORS) and Electronic Total Station. The trees with a diameter at breast height (DBH) larger than 5 cm were measured, and the corresponding tree species were recorded. Finally, a total of 2439 scattered trees and 304 trees in 12 plots (the red squares in Figure 1) were surveyed for 17 dominant species and one class of nondominant tree species.

Remotely Sensed Data
The remotely sensed data we used were high-resolution RGB images and LiDAR point clouds, which were acquired in September 2017 over the research site. A Trimble Harrier 68i laser scanner and a frame amplitude aero digital camera were mounted on a Yun5 airplane for data collection, and the flight height was about 1000 m. The RGB images in Tiff format were orthorectified by ground control points and mosaicked, while the LiDAR point clouds were stored in LAS format. These aerial data were collected with good and similar weather conditions as well as atmosphere transparency, which can be considered in the approximate imaging conditions. The ground sampling distance of RGB images was 0.1 m, and there were five to eight LiDAR points per square meter. The point clouds were firstly preprocessed to remove the noisy values. Digital surface models (DSMs) and digital elevation models (DEMs) were derived using the raster conversion and filtering in LAS tools with a spatial resolution of 0.1 m. A canopy height model (CHM) was created to obtain the height of trees by subtracting DEM from DSM for further individual tree detection.

Overview
Unlike traditional tree species classification methods that predict the tree species class label per pixel or per segmented object, our method was designed to infer patch-level prediction. As shown in Figure 2, the training and test samples were in the form of image patches rather than pixels or objects, and finally, each image path was assigned a label of tree species. Figure 2 presents the flowchart of our method with three major steps. Firstly, the individual trees (the information of tree location with x and y coordinates) were isolated based on the CHM via a local maxima algorithm [37]. The individual tree patches were then cropped according to the tree apexes and RGB images. Secondly, three deep learning methods (i.e., AlexNet, VGG16, and ResNet50) were modified for tree species classification. The cropped tree image patches were fed into the CNNs, and finally, 17 dominant tree species and a class of "Others" were identified. Lastly, we performed tree species diversity mapping based on the results of individual tree detection and tree species classification, and the diversity indices, namely, the Margalef richness index, the Simpson diversity index, the Shannon-Wiener diversity index, and the Pielou evenness index 1, and species richness were calculated for the three parts of the Haizhu Wetland. The diversity mapping results were assessed based on the 12 field-surveyed plots.

Overview
Unlike traditional tree species classification methods that predict the tree species class label per pixel or per segmented object, our method was designed to infer patch-level prediction. As shown in Figure 2, the training and test samples were in the form of image patches rather than pixels or objects, and finally, each image path was assigned a label of tree species. Figure 2 presents the flowchart of our method with three major steps. Firstly, the individual trees (the information of tree location with x and y coordinates) were isolated based on the CHM via a local maxima algorithm [37]. The individual tree patches were then cropped according to the tree apexes and RGB images. Secondly, three deep learning methods (i.e., AlexNet, VGG16, and ResNet50) were modified for tree species classification. The cropped tree image patches were fed into the CNNs, and finally, 17 dominant tree species and a class of "Others" were identified. Lastly, we performed tree species diversity mapping based on the results of individual tree detection and tree species classification, and the diversity indices, namely, the Margalef richness index, the Simpson diversity index, the Shannon-Wiener diversity index, and the Pielou evenness index 1, and species richness were calculated for the three parts of the Haizhu Wetland. The diversity mapping results were assessed based on the 12 fieldsurveyed plots.

Individual Tree Detection
Individual tree detection is an important procedure for further species diversity mapping. In this study, individual trees were detected based on the CHM (in the format of dtm derived from LiDAR point clouds) by the local-maxima-based method (CanpyMaxima, Popescu et al. [37]) in the FUSION software (Version 3.70, Seattle, USA). The height thresholds were set as 1.8-3 m, which varied from different stands. The individual trees were detected and output in the format of a CSV file in which the information of tree location (x and y coordinates), tree height, crown width, and height to crown base were recorded. As there were artificial structures in the research site, points on the buildings or other structures were removed based on the nonvegetation mask. The 12 plots surveyed in the field work were used for individual tree detection assessment.
The individual tree image patches were obtained for further tree species classification based on the detected tree locations (the x and y coordinates of the tree apexes) and the RGB images over the research site. The treetops, which were in the format of points, could be overlaid on the RGB image. Centered on the treetops, we cropped the RGB image into image patches (individual trees), each of 64 × 64 pixels ( Figure 3).

Individual Tree Detection
Individual tree detection is an important procedure for further species diversity mapping. In this study, individual trees were detected based on the CHM (in the format of dtm derived from LiDAR point clouds) by the local-maxima-based method (CanpyMaxima, Popescu et al. [37]) in the FUSION software (Version 3.70, Seattle, USA). The height thresholds were set as 1.8-3 m, which varied from different stands. The individual trees were detected and output in the format of a CSV file in which the information of tree location (x and y coordinates), tree height, crown width, and height to crown base were recorded. As there were artificial structures in the research site, points on the buildings or other structures were removed based on the nonvegetation mask. The 12 plots surveyed in the field work were used for individual tree detection assessment.
The individual tree image patches were obtained for further tree species classification based on the detected tree locations (the x and y coordinates of the tree apexes) and the RGB images over the research site. The treetops, which were in the format of points, could be overlaid on the RGB image. Centered on the treetops, we cropped the RGB image into image patches (individual trees), each of 64 × 64 pixels ( Figure 3).

Deep Learning Methods for Tree Species Classification
In the past few years, CNNs have been a hot topic in the field of image classification. Since the publication of AlexNet [38], a number of classical CNN architectures have been proposed, including VGG [39], GoogLeNet [40], and ResNet [41]. VGG can be considered as a deepened version of AlexNet, which employed small convolutional kernels. GoogLeNet adopted the Inception module, which is easy to use for network modification. It also removed the fully connected layers to reduce the number of parameters. Moreover, it used two auxiliary classifiers to accelerate network convergence. As a consequence of the auxiliary classifiers, GoogLeNet is not as scalable as VGG. On the other hand, the depth of networks is a crucial factor that influences CNN performance [39]. Richer features of different levels can be extracted from deep CNN layers, whereas deep models are not easy to optimize. In many studies, batch normalization (BN) is employed to hamper vanishing/exploding gradients in deep CNNs. However, the accuracy often becomes saturated and then degrades (degradation problem) in the training phase, even though BN layers are used. ResNet [41] addressed the degradation problem by using shallow layers and identity mapping for network construction. Two shortcuts (i.e., identity and projection shortcuts) have been introduced for residual learning. Recently, these networks have been introduced into the field of remote sensing.
Our tree species classification strategy takes advantage of recent CNNs for patch-wise classification. We formulated the tree species classification as a supervised image classification problem to identify 18 tree species (17 dominant trees species and a class of "Others"). For this purpose, we adopted AlexNet, VGG16, and ResNet50 implemented in Caffe for individual tree

Deep Learning Methods for Tree Species Classification
In the past few years, CNNs have been a hot topic in the field of image classification. Since the publication of AlexNet [38], a number of classical CNN architectures have been proposed, including VGG [39], GoogLeNet [40], and ResNet [41]. VGG can be considered as a deepened version of AlexNet, which employed small convolutional kernels. GoogLeNet adopted the Inception module, which is easy to use for network modification. It also removed the fully connected layers to reduce the number of parameters. Moreover, it used two auxiliary classifiers to accelerate network convergence. As a consequence of the auxiliary classifiers, GoogLeNet is not as scalable as VGG. On the other hand, the depth of networks is a crucial factor that influences CNN performance [39]. Richer features of different levels can be extracted from deep CNN layers, whereas deep models are not easy to optimize. In many studies, batch normalization (BN) is employed to hamper vanishing/exploding gradients in deep CNNs. However, the accuracy often becomes saturated and then degrades (degradation problem) in the training phase, even though BN layers are used. ResNet [41] addressed the degradation problem by using shallow layers and identity mapping for network construction. Two shortcuts (i.e., identity and projection shortcuts) have been introduced for residual learning. Recently, these networks have been introduced into the field of remote sensing.
Our tree species classification strategy takes advantage of recent CNNs for patch-wise classification. We formulated the tree species classification as a supervised image classification problem to identify 18 tree species (17 dominant trees species and a class of "Others"). For this purpose, we adopted AlexNet, VGG16, and ResNet50 implemented in Caffe for individual tree classification. Some adaptive modifications were made for our tree classification problem: (1) The input image size was modified and set as 64 × 64 pixels instead of the original size of each convolutional neural network. (2) The corresponding convolutional and pooling layers were adjusted for feature extraction accordingly.
(3) The final output layers were also modified to 18 classes, so as to distinguish 17 dominant tree species and the class of "Others". The detailed architectures of the three CNNs are shown in Figure 4. In the procedure of our deep-learning-based tree species classification, both training and test data were image patches of individual tress.  To avoid the distinction of identical textures only differing from each other by orientation changes and to increase the amount of tree species samples for training of the deep learning network, we performed data augmentation on the tree samples. The tree samples in the form of patches were rotated, mirrored, and flipped randomly. Finally, a total of 5664 tree samples were used for CNN training. Scattered samples (627) and tree samples (304) in 12 plots surveyed in the field measurements were used for test and tree species classification accuracy assessment (931 test tree samples in total). The 12 plots were also used for diversity mapping assessment.

Forest Species Diversity Mapping
Based on the detected individual trees and the classified tree species, the diversity of three parts of the Haizhu Wetland could be mapped. In this paper, the study area was divided into grids with a spatial resolution of 30 × 30 m, and the grids without trees were dismissed. Richness and evenness are the two components of alpha diversity. Richness is defined as the total number of species in a particular quadrat size, while evenness accounts for relative species abundance [1]. A single diversity measure is not necessarily appropriate for characterizing diversity [42]. In this paper, the Margalef richness index [43], the Simpson diversity index [44], the Shannon-Wiener diversity index [45], the To avoid the distinction of identical textures only differing from each other by orientation changes and to increase the amount of tree species samples for training of the deep learning network, we performed data augmentation on the tree samples. The tree samples in the form of patches were rotated, mirrored, and flipped randomly. Finally, a total of 5664 tree samples were used for CNN training. Scattered samples (627) and tree samples (304) in 12 plots surveyed in the field measurements were used for test and tree species classification accuracy assessment (931 test tree samples in total). The 12 plots were also used for diversity mapping assessment.

Forest Species Diversity Mapping
Based on the detected individual trees and the classified tree species, the diversity of three parts of the Haizhu Wetland could be mapped. In this paper, the study area was divided into grids with a spatial resolution of 30 × 30 m, and the grids without trees were dismissed. Richness and evenness are the two components of alpha diversity. Richness is defined as the total number of species in a particular quadrat size, while evenness accounts for relative species abundance [1]. A single diversity measure is not necessarily appropriate for characterizing diversity [42]. In this paper, the Margalef richness index [43], the Simpson diversity index [44], the Shannon-Wiener diversity index [45], the Pielou evenness index 1 [46], and the tree species richness were calculated ( Table 1). The diversity indices considered both the richness and evenness of the tree species in the study area. The ground truth of the diversity indices in the 12 field-surveyed plots was calculated based on the equations in Table 1, and they were then compared with the prediction values of the corresponding grids. Table 1. The diversity indices used in this study.

Diversity Indices and Description
Definition Remarks

Margalef Richness Index:
An index to measure the number of species in a certain region.

Simpson Diversity Index:
An index that takes into account the number of species, as well as the relative abundance of each species.
N i : the total number of species i; N : the total number of all individuals; S : the number of species.

Shannon-Wiener Diversity Index:
An index that indicates the relationship between species and community complexity.

Experimental Setup
For deep-learning-based tree species classification, all the CNNs in this study were implemented in Caffe [47] on a NVIDIA GTX Titan X GPU. The initial learning rate was determined by trial and error from the two values 0.00001 and 0.0001, and finally 0.00001 was adopted for all the three CNNs. The Adam optimizer [48] was used to optimize the learning rate. The max iteration was set as 200,000 for all the three networks in the training phase, and the training models were saved every 10,000 iterations to identify the best models for testing. The test tree samples were predicted by each saved training model of the three CNNs. The best performance of VGG 16, ResNet50, and AlexNet were at 140,000, 110,000, and 100,000 iterations, respectively.

Assessment
The steps in our method were assessed in different manners. The commonly used metrics root-mean-square error (RMSE) and coefficient of determination (R 2 ) were used to assess the performance of forest species diversity mapping. The reference species diversity was calculated based on the 12 plots in the field work.
In terms of the individual tree classification, the confusion matrix generally used in computer vision was calculated. The producer's (P) and user's (U) accuracies, F1-score, and overall accuracy (OA) were used for assessment. The P and U accuracies are also referred to as recall and precision [49], respectively. The producer's accuracy is defined as the ratio of the correctly detected trees to all the positive tree samples in ground truth, while the user's accuracy is defined as the ratio of the correctly detected trees to all the positive tree samples that model predicted.
where TP means the positive samples predicted as true, FP means the positive samples predicted as false, FN means the negative samples predicted as false, and N is total number of the test image patches. In this step, the ground truth samples were the 931 test tree samples as mentioned above.

Individual Tree Species Classification
Results: AlexNet, VGG16, and ResNet50 Table 2 reports the accuracies obtained from the testing samples with the abovementioned best training models. Taking advantage of patch-based training samples, CNNs are able to learn discriminative texture features to identify tree species. Among the three networks, VGG16 achieved the highest precision with an overall accuracy of 73.25%, ResNet50 achieved a slightly lower accuracy with an overall accuracy of 72.93%, while AlexNet performed the worst with an overall accuracy of 68.53%. In the results of VGG16 and ResNet50, most of the trees could be classified well, and trees such as banana, papaya, sakura tree, and Hibiscus tiliaceus had both higher user's and producer's accuracies. Although hyperspectral chemical information was not included, the spatial or textural features extracted by the CNN could identify these species well, especially the trees with special leaf shapes. However, there were also some trees that performed poorly in classification, such as the silk floss tree and the camphor tree; their user's and producer's accuracies were less than 60.00%. The reason is probably because the number of training samples of silk floss tree and camphor tree was small, and the CNN could not learn the features of the two classes well, as a training model is often profitable for species with a large amount of samples. Although the classification accuracies were not as good as the results obtained from the hyperspectral images [15], the high-resolution RGB images showed relatively good results compared with the work using multispectral images [2]. In the research of Dalponte et al. [15], the average classification accuracy was about 80.4% when using hyperspectral images and LiDAR, while in the research of Ferreira et al. [2], the average classification accuracy was about 70% when they used visible/near-infrared bands. As VGG16 performed the best among the three deep learning algorithms, we employed the VGG16 network at 140,000 iterations as the classification model for tree species prediction accordingly. All the clipped image patches of the individual trees in the three parts were fed into the trained VGG16 network at 140,000 iterations for classification, the results were obtained, and the tree species distribution across the three parts is shown in Table 3. This is in line with reality: Area A has a large proportion of banyan and silk floss trees, Area B has a large amount of Ceiba speciose, and Area C has a large amount of longan. Overall, the proportion of all kinds of tree species in Areas A and B is relatively balanced, while there is an obvious dominant tree species in Area C.  Figure 5 shows the tree species biodiversity mapping with a spatial resolution of 30 m at the three parts in the Haizhu Wetland. A grid with a size of 30 × 30 m was generated by using the four boundaries of each part, and only the biodiversity of the plots with trees was calculated and visualized. No matter what (richness or evenness) the indices considered, the four subfigures showed consistent results regarding tree species diversity. Generally, the northern part has higher biodiversity than the southern part in Area A. Most areas in Area B have high values except the northern part. The spatial distribution of diversity in Area C is relatively discrete, and there is a part in central Area C with low values. This is related to the fact that there are abundant trees species in Area A. In terms of Area B, the northern part is mainly composed of other land covers, such as broad sidewalks and squares, leading to a low level of diversity. A large amount of fruit trees were planted artificially in central Area C, which also caused low diversity values. Overall, different diversity indices showed similar spatial distributions at the same place, indicating the reliability of our solution.

Accuracy Assessment
The diversity indices at the 12 field-surveyed plots were calculated according to the detected tree numbers and predicted tree species in each plot. The ground truth species richness and diversity in the very plot were both calculated with the actual number of tree species. Table 4 shows the predicted species richness compared to the ground truth. The predicted species richness was sometimes higher than the field-surveyed values (RMSE = 1.91). Figure 6 shows the validation of the four species indices calculated by VGG16 prediction. The results of the Simpson index (R 2 = 0.7907, RMSE = 0.1038) and the Shannon-Wiener index (R 2 = 0.7948, RMSE = 0.7202) were much better than those of Margalef (R 2 = 0.4562, RMSE = 0.5629) and Pielou (R 2 = 0.5875, RMSE = 0.3089). We also mapped the diversity at two other scales (i.e., 10 and 20 m) (Figure 7), which provided different patterns compared with the 30 m scale. The diversity was reduced in the two smaller scales, and the accuracy decreased at smaller spatial scales.

Accuracy Assessment
The diversity indices at the 12 field-surveyed plots were calculated according to the detected tree numbers and predicted tree species in each plot. The ground truth species richness and diversity in the very plot were both calculated with the actual number of tree species. Table 4 shows the predicted species richness compared to the ground truth. The predicted species richness was sometimes higher than the field-surveyed values (RMSE = 1.91). Figure 6 shows the validation of the four species indices calculated by VGG16 prediction. The results of the Simpson index (R 2 = 0.7907, RMSE = 0.1038) and the Shannon-Wiener index (R 2 = 0.7948, RMSE = 0.7202) were much better than those of Margalef (R 2 = 0.4562, RMSE = 0.5629) and Pielou (R 2 = 0.5875, RMSE = 0.3089). We also mapped the diversity at two other scales (i.e., 10 and 20 m) (Figure 7), which provided different patterns compared with the 30 m scale. The diversity was reduced in the two smaller scales, and the accuracy decreased at smaller spatial scales. Note: Ground truth means the actual number of tree species in the field survey, and prediction denotes the predicted number of tree species by our solution.

Discussion
Compared with pixel-based classification methods for tree species, our study at the individual tree level made better use of spatial context information. The convolutional layers involve the neighbors of a pixel, which can provide the texture information and spatial relationships of ground objects, while only the spectral features are employed in the pixel-based method. Although only RGB images were used for classification, the deep CNNs could generalize well to samples over different image perspectives or light conditions (Figure 8), and good accuracies were obtained. Among the three deep learning methods, networks with a deeper architecture (VGG16 and ResNet50) achieved better accuracies than AlexNet, indicating that richer features can be extracted by different levels of CNN layers. In the literature, ResNet50 has been shown to perform well [41], as its residual learning is based on the identity and projection shortcuts. However, VGG16 performed slightly better than it in this study, which might have been due to our specific individual tree datasets. Each tree species was similar in terms of the RGB images, so the deeper network of ResNet50 might have been overfitted in the training phase, which led to slightly poorer results than those from studies in other applications [50]. Both VGG16 and ResNet50 are well-suited for patch-based tree species classification, and there were only small performance differences. Moreover, the class merging strategy we used also influenced the model performance. The class of "Others" could contain a number of different tree species with various colors or textures. Although VGG16 could identify different tree species according to the high-level features extracted, the softmax classifier could not assign an appropriate label for the class of "Others". Finally, it could assign a label from 0 to 17 (the dominant tree species which had similar features to the predicted one), leading to overestimation.

Discussion
Compared with pixel-based classification methods for tree species, our study at the individual tree level made better use of spatial context information. The convolutional layers involve the neighbors of a pixel, which can provide the texture information and spatial relationships of ground objects, while only the spectral features are employed in the pixel-based method. Although only RGB images were used for classification, the deep CNNs could generalize well to samples over different image perspectives or light conditions (Figure 8), and good accuracies were obtained. Among the three deep learning methods, networks with a deeper architecture (VGG16 and ResNet50) achieved better accuracies than AlexNet, indicating that richer features can be extracted by different levels of CNN layers. In the literature, ResNet50 has been shown to perform well [41], as its residual learning is based on the identity and projection shortcuts. However, VGG16 performed slightly better than it in this study, which might have been due to our specific individual tree datasets. Each tree species was similar in terms of the RGB images, so the deeper network of ResNet50 might have been overfitted in the training phase, which led to slightly poorer results than those from studies in other applications [50]. Both VGG16 and ResNet50 are well-suited for patch-based tree species classification, and there were only small performance differences. Moreover, the class merging strategy we used also influenced the model performance. The class of "Others" could contain a number of different tree species with various colors or textures. Although VGG16 could identify different tree species according to the high-level features extracted, the softmax classifier could not assign an appropriate label for the class of "Others". Finally, it could assign a label from 0 to 17 (the dominant tree species which had similar features to the predicted one), leading to overestimation. The method that we employed to isolate individual trees may have influenced the results of the biodiversity mapping. The algorithm was designed for mixed pines and deciduous trees [37], and the accuracy in this study was about 84.20% in terms of correlation coefficient R. It might not have worked well in some forest types in our study area. The individual tree detection method could be further refined by using species-specific models. Moreover, the algorithm is based on canopy height and, thus, only the trees in the upper canopy can be identified. It is meaningful to derive the whole structure feature of a tree, such as appending the stem information, from the LiDAR point clouds. Although challenges exist in individual tree isolation and tree species classification, the results of biodiversity mapping were satisfactory in a way. The Simpson index and the Shannon-Wiener index achieved high accuracy, with R 2 = 0.79.
The success of transferring our solution to other regions is promising. First, the development of UAV technology makes forest image data acquirement easier. UAV LiDAR and optical sensors can obtain higher-spatial-resolution LiDAR point clouds and images, and even higher-spectral-resolution images. Second, the performance of deep-learning-based approaches depends on the training data, that is, the tree species training sample and its quality. As long as the tree sample can be measured well in field work, deep learning networks will work well. Further, it is beneficial to collect tree The method that we employed to isolate individual trees may have influenced the results of the biodiversity mapping. The algorithm was designed for mixed pines and deciduous trees [37], and the accuracy in this study was about 84.20% in terms of correlation coefficient R. It might not have worked well in some forest types in our study area. The individual tree detection method could be further refined by using species-specific models. Moreover, the algorithm is based on canopy height and, thus, only the trees in the upper canopy can be identified. It is meaningful to derive the whole structure feature of a tree, such as appending the stem information, from the LiDAR point clouds. Although challenges exist in individual tree isolation and tree species classification, the results of biodiversity mapping were satisfactory in a way. The Simpson index and the Shannon-Wiener index achieved high accuracy, with R 2 = 0.79.
The success of transferring our solution to other regions is promising. First, the development of UAV technology makes forest image data acquirement easier. UAV LiDAR and optical sensors can obtain higher-spatial-resolution LiDAR point clouds and images, and even higher-spectral-resolution images. Second, the performance of deep-learning-based approaches depends on the training data, that is, the tree species training sample and its quality. As long as the tree sample can be measured well in field work, deep learning networks will work well. Further, it is beneficial to collect tree samples with the aid of crowd sourcing. In terms of the sampled area in species diversity, our method is also applicable for other scales (50 × 50 m, 90 × 90 m, etc.), depending on the quadrat size in the field work or the requirements of species diversity estimation.

Conclusions
The results of this study indicate the potential of deep learning methods for applications in tree species diversity mapping with high-resolution RGB images and LiDAR data. Our proposed three-step workflow achieved accuracies of R 2 Shannon-Wiener = 0.7948, RMSE Shannon-Wiener = 0.7202; R 2 Simpson = 0.7907, RMSE Simpson = 0.1038; R 2 Margalef = 0.4562, RMSE Margalef = 0.5629; and R 2 Pielou = 0.5875, RMSE Pielou = 0.3053. The method design as well as the deep learning technology also allow the processing of large datasets and have the potential for transfer to other forest regions due to on-the-fly data acquisition and the processing capability.
A comparison of three deep learning algorithms showed that deep CNN architectures can perform well in tree species diversity mapping. VGG16 achieved a slightly better performance than ResNet50 due to the characteristics of the tree samples. The results of only using RGB images in our study show the potential of tree species diversity mapping. It can be expected to be able to predict diversity more accurately with the improved estimation of individual tree isolation and the addition of other bands (such as near-infrared and red-edge bands) in tree species classification. We conclude that our proposed solution using deep learning approaches is suited to map tree species diversity in the Haizhu Wetland.