Tree Species Classiﬁcation Using Ground-Based LiDAR Data by Various Point Cloud Deep Learning Methods

: Tree species information is an important factor in forest resource surveys, and light detection and ranging (LiDAR), as a new technical tool for forest resource surveys, can quickly obtain the 3D structural information of trees. In particular, the rapid and accurate classiﬁcation and identiﬁcation of tree species information from individual tree point clouds using deep learning methods is a new development direction for LiDAR technology in forest applications. In this study, mobile laser scanning (MLS) data collected in the ﬁeld are ﬁrst pre-processed to extract individual tree point clouds. Two downsampling methods, non-uniform grid and farthest point sampling, are combined to process the point cloud data, and the obtained sample data are more conducive to the deep learning model for extracting classiﬁcation features. Finally, four different types of point cloud deep learning models, including pointwise multi-layer perceptron (MLP) (PointNet, PointNet++, PointMLP), convolution-based (PointConv), graph-based (DGCNN), and attention-based (PCT) models, are used to classify and identify the individual tree point clouds of eight tree species. The results show that the classiﬁcation accuracy of all models (except for PointNet) exceeded 0.90, where the PointConv model achieved the highest classiﬁcation accuracy for tree species classiﬁcation. The streamlined PointMLP model can still achieve high classiﬁcation accuracy, while the PCT model did not achieve good accuracy in the tree species classiﬁcation experiment, likely due to the small sample size. We compare the training process and ﬁnal classiﬁcation accuracy of the different types of point cloud deep learning models in tree species classiﬁcation experiments, further demonstrating the advantages of deep learning techniques in tree species recognition and providing experimental reference for related research and technological development.


Introduction
Forest ecosystems cover approximately one-third of the Earth's land surface and are an important type of global land cover [1]. Forests have an irreplaceable role in maintaining the stability of ecosystems and play a vital role in the survival and development of human civilization [2,3]. Tree species information is essential in forest resource surveys, and the classification of tree species is one of the main tasks of forest science [4]. Timely and accurate information on tree species is essential for developing strategies for the sustainable management and conservation of planted and natural forests [5]. Over the past four decades, the development of remote sensing technology has made large-scale forest inventories possible. Light detection and ranging (LiDAR), as an active remote sensing technology, can obtain 3D point clouds of scanned forest scenes and has become the main technical tool for tree species identification and accurate extraction of forest parameters.
LiDAR data from different platforms have different roles in the study of forest parameter extraction and tree species classification. Most studies have used airborne laser scanning (ALS) data, from which scholars extract taxonomic features for region-wide mapping of tree species distribution [6][7][8][9]; however, there are relatively few studies using terrestrial laser scanning (TLS) data for individual tree species classification due to its high acquisition cost and difficulty in data processing [10,11]. Mobile laser scanning (MLS) based on simultaneous localization and mapping (SLAM) technology can easily and quickly collect point cloud data in a study area [12]. Research on the identification and classification of individual tree point cloud species based on ground-based LiDAR data extraction has been carried out by various scholars [13][14][15].
Based on traditional machine learning algorithms, LiDAR-derived feature metrics help to improve the accuracy of tree species identification [16]. Random forests (RF) [17][18][19][20] and support vector machines (SVM) [21][22][23] have been used in many studies. However, machine learning-based classification methods require the prior extraction of a large number of classification features, and the accuracy of these algorithms depends on a variety of feature metrics. Deep learning approaches, as end-to-end classifiers, can automatically extract classification features through the use of deep neural networks. Existing research has demonstrated that high classification accuracy can be achieved through the use of deep learning for tree species classification [8,9,14].
Over the past decade, deep learning has made rapid progress in the field of computer vision. Point clouds are disordered and do not conform to the regular lattice grid in 2D images, making it difficult to apply traditional convolutional neural networks (CNNs) to such unordered inputs. Therefore, scholars have developed a series of deep learning models based on the characteristics of 3D point cloud data, in addition to a series of deep learning models applicable to point clouds. The development of point cloud analysis has been closely related to the development of image processing networks [24]. PointNet [25] and PointNet++ [26], as seminal studies in point-based deep learning, have driven the development of deep learning modeling techniques that learn categorical feature information directly from the original point cloud. In current research, point cloud deep learning methods can be classified into pointwise multi-layer perceptron (MLP), convolution-based, graph-based, and attention-based methods, depending on the point-based feature-learning network architecture [27,28]. Most existing deep learning methods used in studies for individual tree point cloud tree species classification are based on pointwise MLP. Other types of point cloud deep learning methods have rarely been used in tree species classification studies. In response to the limitations of point cloud deep learning models regarding the number of points in typical samples, the non-uniform grid-based downsampling algorithm, used in the studies [14,29], has shown potential for accurately extracting the 3D morphology of single plants. Zhou et al. [30] proposed CNN-specific [31] class activation maps (CAMs) for the visualization of class features, which represent a weighted linear sum of each class of visual patterns present at different spatial locations in the image [30]. In this line, Huang et al. [32] have attempted to interpret the PointNet model through the use of CAMs, and Xi et al. [11] have visualized the feature points of PointNet++ classification results using CAMs.
To fully explore the potential of point cloud deep learning methods with regard to the classification of individual tree species, we conducted the following experiments. A point cloud data downsampling strategy (NGFPS) that incorporates non-uniform grid and farthest point sampling methods was proposed. Four different types of point cloud deep learning approaches, including pointwise MLP, convolution-based, graph-based, and attention-based models, were used for tree species classification and recognition of individual tree point clouds. A total of six point cloud deep learning models-including PointNet [25], PointNet++ [26], PointMLP [24], PointConv [33], DGCNN [34], and point cloud transformer (PCT) [35]-were included in our experiments. We analyzed and evaluated the classification results of all models. The CAM method was used to visualize the set of critical points at which the classification features of a sample converge. Some new results regarding tree species classification using point cloud deep learning methods were presented, thus providing a reference for future tree species classification and the further development of deep learning models.

Study Area and Data Collection
Experimental data were collected from three study areas in China: The Greater Khingan Station (GKS), Huailai Remote Sensing Comprehensive Experimental Station (HL), and Gaofeng forest farm (GF). Point cloud data from forest plots in the study areas were collected in September, July, and April of 2021, using a LiBackpack DGC50 backpack laser scanning (BLS) system from Beijing GreenValley Technology Co., Ltd. (Beijing, China). The LiBackpack DGC50 system has an absolute accuracy of ±5 cm, a relative accuracy of 3 cm, and is equipped with two VLP 16 laser sensors that can reach a scanning range of 100 m, with vertical and horizontal scanning angles of −90 • to 90 • and 360 • , respectively. Figure 1 shows the distribution locations of the three study areas and the LiDAR data of some sample sites. LiDAR data were collected from 17, 41, and 16 ground sample plots in the three experimental areas of GKS, HL, and GF, respectively. A total of eight types of tree species were included in the three experimental areas. Table 1 provides a detailed representation of tree species and forest plot information. results regarding tree species classification using point cloud deep learning methods were presented, thus providing a reference for future tree species classification and the further development of deep learning models.

Study Area and Data Collection
Experimental data were collected from three study areas in China: The Greater Khingan Station (GKS), Huailai Remote Sensing Comprehensive Experimental Station (HL), and Gaofeng forest farm (GF). Point cloud data from forest plots in the study areas were collected in September, July, and April of 2021, using a LiBackpack DGC50 backpack laser scanning (BLS) system from Beijing GreenValley Technology Co., Ltd. (Beijing, China). The LiBackpack DGC50 system has an absolute accuracy of ±5 cm, a relative accuracy of 3 cm, and is equipped with two VLP 16 laser sensors that can reach a scanning range of 100 m, with vertical and horizontal scanning angles of −90° to 90° and 360°, respectively. Figure 1 shows the distribution locations of the three study areas and the LiDAR data of some sample sites. LiDAR data were collected from 17, 41, and 16 ground sample plots in the three experimental areas of GKS, HL, and GF, respectively. A total of eight types of tree species were included in the three experimental areas. Table 1 provides a detailed representation of tree species and forest plot information.

Data Pre-Processing
The data collected in the field were SLAM solved using the LiFuser-BP software, in order to obtain the raw point cloud data in standard format. To obtain individual tree point cloud data that fit the input requirements of the deep learning model, the LiDAR data of the plots collected in the field were subjected to a series of pre-processing steps.
First, the noise in the point cloud data was removed. A height threshold method was used to remove noise points that were significantly above and below ground level, and a local distribution-based algorithm was used to remove some stray isolated points. Then, we used the cloth simulation filtering (CSF) algorithm [36] to classify the raw point cloud data into two categories: ground points and vegetation points. To segment the point cloud of each tree from the point cloud of the whole plot, we normalized the point cloud of the plot based on the elevation information of the ground points, which were processed to be parallel to the horizontal plane. Subsequently, we eliminated the ground points and segmented the point cloud of each individual tree from the sample point cloud using the comparative shortest path (CSP) algorithm [37], which uses a bottom-up approach to detect each tree in the region. Immediately afterwards, we processed the point cloud data of each single tree manually and eliminated some points that were misclassified, such as ground and weeds. Finally, we obtained clean point cloud data for each tree.
An individual tree point cloud data set for tree species classification, called TS8, was successfully built following the file organization of the ModelNet40 [38] dataset, which has been commonly used in point cloud deep learning research. As we obtained an unbalanced number of trees of each species, we used a stratified random sampling method within species, randomly selecting 80% of each tree species as the training data set and the remaining 20% as the test set.

Research Workflow
After creation of the TS8 (Table 1) data set, we downsampled the individual tree point clouds using the proposed NGFPS method. The data set was then trained for classification using a variety of deep learning models. Finally, high-precision tree species classification results and optimal model hyperparameters were obtained. Figure 2 depicts the technical route and logical structure of this study in detail.

Methods Combined with Non-Uniform Grid and Farthest Point Sampling
The difference between the non-uniform grid sampling (NGS) method and traditional grid-/voxel-based methods is that it can select representative points from a grid having different sizes. The NGS method calculates the normal vector of each point before sampling the point cloud, which can better preserve the details of 3D objects. The NGS method was first used to align multiple sets of point clouds representing the same object but with different coordinate systems [39], as well as for 3D point cloud surface reconstruction [40][41][42]. This method can effectively filter out the key points describing the shape of the object from a dense point cloud and is more conducive to the mutual matching of multiple groups of point clouds, as it can well-preserve the details of the 3D object's surface. The dominant downsampling method used with current deep learning models is the farthest point sampling (FPS) method, which allows for rapid sampling to obtain a point cloud of objects containing a specified number of points (N = 1024, 2048, …). One of the main reasons for using the FPS method is that the ModelNet40 data set consists of point clouds uniformly sampled from the surface of regular computer-aided design (CAD) models; however, in practice, the point cloud density of collected LiDAR data is not uniformly distributed, and the points that can represent the details of the object need to be fully retained when identifying and classifying tree species.

Methods Combined with Non-Uniform Grid and Farthest Point Sampling
The difference between the non-uniform grid sampling (NGS) method and traditional grid-/voxel-based methods is that it can select representative points from a grid having different sizes. The NGS method calculates the normal vector of each point before sampling the point cloud, which can better preserve the details of 3D objects. The NGS method was first used to align multiple sets of point clouds representing the same object but with different coordinate systems [39], as well as for 3D point cloud surface reconstruction [40][41][42]. This method can effectively filter out the key points describing the shape of the object from a dense point cloud and is more conducive to the mutual matching of multiple groups of point clouds, as it can well-preserve the details of the 3D object's surface. The dominant downsampling method used with current deep learning models is the farthest point sampling (FPS) method, which allows for rapid sampling to obtain a point cloud of objects containing a specified number of points (N = 1024, 2048, . . . ). One of the main reasons for using the FPS method is that the ModelNet40 data set consists of point clouds uniformly sampled from the surface of regular computer-aided design (CAD) models; however, in practice, the point cloud density of collected LiDAR data is not uniformly distributed, and the points that can represent the details of the object need to be fully retained when identifying and classifying tree species. FPS is a downsampling method for uniform density point clouds, which cannot successfully retain the more detailed features of tree point clouds. In the study of tree species classification, more detailed information regarding 3D objects needs to be retained using NGS methods. However, the NGS method cannot accurately obtain the number of points (N) contained in the sample required by the point cloud deep learning model. Therefore, in this experiment, we first used the NGS method to obtain the number of points in the individual tree point cloud that is closest to N. After obtaining samples with sufficient detail retention, the number of points in the individual tree point cloud can be unified using the FPS method.
The NGS algorithm can be implemented quickly in MATLAB, but requires inputting the smallest number of points (k) within each grid point as the initial parameter, and the final sampling results in the representation of k points combined into 1 point. To obtain a fixed number of downsampling points N per object, the value of k tends to be inconsistent during data processing. In order to obtain individual tree point cloud data that are more conducive to model learning, we combine two methods, non-uniform grid and farthest point sampling, and the combined method is called NGFPS in this paper. The details of the combination of the two methods are as follows.

1.
The objects are downsampled using the NGS algorithm, and k is iterate as an input parameter. The minimum value of k is set to 6, and the value of k is increased by 1 at each iteration; 2.
When the number of points satisfies N (k) < N after downsampling the object, the iteration is stopped, and the experimental results of N (k−1) are retained; 3.
We use the FPS algorithm to downsample the N (k−1) points to the specified number of points N.
Using the NGFPS method not only preserves the details of the 3D objects better, but also allows us to obtain a data set that satisfies the number of points N as the input to the deep learning model.

Point Cloud Deep Learning Methods
We conducted tree species classification research using four types of point cloud deep learning approaches: pointwise MLP, convolution-based, graph-based, and attention-based models. The second module in Figure 2 shows the categories of models used, along with the names of the specific models. To facilitate comparison and analysis of the models, the structures of all models are summarized and drawn in a single image ( Figure 3). Additionally, the hyperparameters used in the training process of all the deep learning models are summarized in Table 2. We maintained the parameters used by the original authors of the models as much as possible. Remote Sens. 2022, 14, x FOR PEER REVIEW 7 of 22

Pointwise MLP Methods
Multi-layer perceptrons (MLPs), developed from perceptrons, are the simplest kind of deep network. An MLP is a non-linear model obtained by adding hidden layers and activation functions, as well as changing the classification function. Commonly used activation functions include sigmoid, tanh, ReLU, and so on. The softmax function is used to address multi-classification problems. The number of hidden layers and the size of each layer are its main hyperparameters.
Due to the irregularity of point cloud data, traditional 2D deep learning methods cannot be directly used with 3D data. PointNet is a pioneering work that deals directly with disordered point sets using multiple shared MLPs, which achieves permutation invariance through symmetric functions. PointNet uses multiple MLP layers to learn point-wise features independently, as well as a maximum pooling layer to extract global features (see Figure 3a). This model has three main modules: A symmetric function aggregating information from all points, a combination of local and global information, and a joint alignment network. The global features extracted by the model are classified using a multi-layer perceptron classifier. The PointNet model can be expressed by Equation (1): where {x 1 , x 2 , . . . , x n } is the set of points of the model input, γ and h are multi-layer perceptron networks, and the function f is invariant to the arrangement of the points of the model input.
As the features of each point in PointNet are learned independently, the local structure generated by the metric space points cannot be captured, which limits its ability to recognize fine-grained patterns and generalize to complex scenes. PointNet++ introduces a hierarchical feature learning paradigm that captures the fine geometric structure from the neighborhood of each point. Due to its ability to obtain local feature information at different scales of 3D objects, PointNet++ has become the basis for the development and exploitation of many other models [34,43,44]. As the core of the PointNet++ hierarchy, its set abstraction layer (SA) consists of three layers: A sampling layer, a grouping layer, and a PointNet-based learning layer. The SA in Figure 3b represents the structure of the model. The input of each SA module can be represented by a matrix of size N × (d + C), where N denotes the number of points of the 3D object, d denotes the coordinate dimension, and C denotes the feature dimension. The structure after SA processing is N' × (d + C').
(1) The points in each layer of the input are first downsampled using the FPS method. Compared with random sampling, FPS provides better coverage of the entire point set for the same number of points [26].
(2) The sampled data are subsequently grouped using a ball query method, which allows for better generalization of local area features in space. The output is N' where K denotes the number of points contained in each query ball.
(3) Finally, the PointNet layer is used to learn the features of the grouped data. PointNet++ learns features from local geometric structures by stacking multiple levels of ensemble abstraction and abstracting local features in a layer-wise manner. The core structure of the model can be expressed as Equation (2): where A(·) denotes the maximum pooling operation, Φ(·) denotes the MLP local feature extraction function, and f i,j denotes the feature of the jth neighboring point of the ith sampling point. In this experiment, the distribution of the point cloud density of our collected LiDAR data is heterogeneous, so a multi-scale grouping (MSG) approach was used. A simple but effective way of capturing multi-scale features is to use MSG. Features of different scales are connected in series to form multi-scale features.
PointMLP is a deep residual MLP network that follows the design philosophy of PointNet and PointNet++, using a simpler but deeper network architecture explored by Ma et al. [24]. PointMLP uses pre-feedback residual MLP network to hierarchically aggregate the local features extracted by MLP without any fine-grained local feature extractor. Thus, this method can avoid the large computational effort and continuous memory access caused by complex local feature extraction. A lightweight geometric affine module is also introduced, in order to adaptively convert local points to a normal distribution, further improving the performance and generalization capability of the model. The structure of the model is shown in Figure 3c, and the core algorithm of this network can be expressed as Equation (3): where Φ pre (·) and Φ pos (·) are residual MLP modules, the shared Φ pre (·) is used to learn shared weights from local regions, and Φ pos (·) is used to extract deep aggregated features. The aggregation function A(·) is the maximum pooling function. The model uses the k-nearest neighbor algorithm (k-NN) to select proximity points with a value of k = 24. Equation (3) describes a phase of PointMLP. We use a four-layer deep network, so it is necessary to repeat this computational procedure four times. Ma et al. [24] have also proposed a lightweight version, PointMLP-elite, which reduces the number of channels in the PointMLP intermediate FC layer by a factor of 4, slightly adjusts the network architecture, and reduces the number of MLP blocks and embedding dimensions.

Convolution-Based Method
PointConv [33] is one of the best convolution-based methods. PointConv is an extension of the Monte Carlo approximation of the 3D continuous convolution operator. By using the continuous weights and density functions in the MLP approximation convolution filter, PointConv is able to extend the dynamic filter to new convolution operations. The point cloud is represented as a set of 3D points {p i | i = 1, . . . , n}, where each point is a vector containing location (x, y, z) features. PointConv is a substitution-invariant point cloud operation that can make point clouds compatible with convolution. PointConv can be expressed as follows: where S δ x , δ y , δ z represents the inverse density of δ x , δ y , δ z , and F x + δ x , y + δ y , z + δ z represents the characteristics of the points centered at point (x, y, z) in local region G. For each local region, δ x , δ y , δ z can be any position within the region.
PointConv approximates the weight function W δ x , δ y , δ z from the 3D coordinates δ x , δ y , δ z by a multi-layer perceptron and approximates the inverse density S δ x , δ y , δ z through kernelized density estimation and a non-linear transformation implemented with an MLP. The weights of the MLP are shared over all points, in order to maintain permutation invariance. The density of each point in the point cloud is estimated using kernel density estimation, following which the density is input to the MLP for one-dimensional non-linear transformation, resulting in the inverse density scale S δ x , δ y , δ z .
Considering the high memory consumption and low efficiency of the PointConv implementation, Wu et al. [33] have simplified the model to two standard operations: matrix multiplication and 2D convolution.

Graph-Based Method
Point clouds have difficulties, in terms of processing directly using convolution, due to their disorderly and irregular characteristics. Graph-based approaches use graphs to study the relationships between points. Wang et al. [34] have designed the DGCNN model and proposed a new neural network module, called EdgeConv, to generate edge features describing the relationship between points and their neighbors. It constructs a local graph, which preserves the relationships between points. EdgeConv dynamically constructs a graph structure at each layer of the network, using each point as a centroid to characterize the edge feature with each neighboring point, and then aggregates these features to obtain a new representation of that point. EdgeConv is one of the main components of the DGCNN structure. The set of points is defined as X = (x 1 , x 2 , . . . , x n ), where each point is a vector containing the features at a location x i = (x i , y i , z i ). The directed graph of the local point cloud structure can be represented as: where V = {1, 2, · · · , n} denotes the vertices and E = V × V denotes the edges. In the simplest case, G is defined as the k-nearest neighbors (k-NN) graph of the point set X. The edge features are defined as: where h Θ is a set of non-linear functions with learnable parameters Θ. The edge features are defined using the channel symmetric aggregation function associated with all the edges from each vertex. The output of EdgeConv at the ith vertex can be expressed as: where Θ = (θ 1 , . . . , θ M , φ 1 , . . . , φ M ). Equation (8) can preserve the global features in the neighborhood, as well as the information of the local neighborhood. The edge features are obtained by adding the perceptron in Equation (9), while the aggregation operation is implemented using Equation (10). Another major part of the DGCNN structure is the dynamic update graph. The graph is recomputed using the nearest neighbors in the feature space generated at each layer.
Therefore, at each layer l, there is a different graph G (l) = V (l) , E (l) . The DGCNN Remote Sens. 2022, 14, 5733 11 of 21 learns how to construct the graph G used in each layer, which is its largest difference from the GCN.

Attention-Based Method
Attention-based methods have also shown excellent capabilities for relationship exploration, such as PCT [35] and Point Transformer [45,46]. The attention module is the core component, which generates refined attention features for input features, based on the global context. Attention allows the model to adapt to diverse data by calculating dynamic weights. The self-attention (SA) module is the core component, generating refined attention features as its input features based on the global context. Self-attention is a mechanism that calculates the semantic affinities between different items within a sequence of data. Self-attention updates the features at each location by computing a weighted sum of features using pairs of affinities at all locations to capture the long-range dependencies in a single sample.
The transformer does not care about the order of the input data. For point cloud data, the transformer itself is permutation-invariant, and feature learning is performed through an attention mechanism. As such, they are considered very suitable for implementing point cloud deep learning models. The overall structure and details of the PCT model are shown in Figure 3f [35] proposed the offset-attention method, which replaces the attentional features in SA with the offset between the input and the attentional features of the attentional mechanism module. The offset attention layer obtains a better model performance by calculating the offset between the SA features and the input features via element-by-element subtraction.
The PCT model is designed with a local neighborhood aggregation strategy to enhance local feature extraction. Two sampling and grouping (SG) layers gradually expand the sensory domain during feature aggregation. The SG layers use Euclidean distance during point cloud sampling and perform feature aggregation of local neighbors for each point grouped by k-NN.

Critical Points Visualization
Most deep learning models, after extracting features from samples, use aggregation functions to bring together important features to form global features, then use softmax functions to complete the multi-classification task. Zhou et al. [30] have proposed the class activation map (CAM) through the use of global average pooling (GAP) in the network. CAM is used to indicate which part of the image is responsible for the classification results in different networks. The CAM replaces the last fully connected layer with the GAP, which displays its decision as a "salient graph". This improved structure allows us to efficiently locate important regions in the image for semantic prediction. CAM is a deep learning interpretation method, in which training network information, such as gradients, are passed backward to obtain results reflecting the basis of model decisions, such as a heatmap of each sample's contribution, which can then be used to interpret the deep learning model [47].
Among the six deep learning methods used in this study, PointNet, PointNet++, PointMLP, and PCT all use the max pooling function, while the DGCNN model uses both max and average pooling convergence functions. The global features are processed using the softmax function, in order to achieve the final classification. All deep learning models extract features through their own unique model structure, which is a pre-requisite for effective classification. The more accurate and comprehensive the features aggregated by the aggregation function, the more accurate the classification result of the model and the higher the classification accuracy. The max pooling function extracts and retains the points that contain the largest eigenvalues. In the study of [25], these points are called critical points. These collections of critical points that contribute to the maximum set of features summarize the skeleton of the 3D point cloud object shape. We extracted and visualized critical points in the process of model training. Based on this, some explanatory notes on the deep learning model can be given.

Model Accuracy Evaluation Metrics
After obtaining the tree classification results for each model, we evaluated the models using the following metrics: balanced accuracy (BAcc), precision (Pr), recall (Re), F-score (F), and kappa coefficient (kappa). Equations (11)- (14) give the formulae for Pr, Re, F, and kappa for each category. In the final representation of the model metrics, we calculate the weighted average of these metrics using the number of samples in each category as the weights. Additionally, the weighted average of the model Re represents the overall accuracy (Acc) of the model. BAcc is the arithmetic mean of Re for each category.
where TP indicates the outcomes where the model correctly predicts the positive class, FP indicates the outcomes where the model incorrectly predicts the positive class, FN indicates the outcomes where the model incorrectly predicts the negative class, and p e is the sum of the product of the actual sample size and the predicted sample size divided by the square of the total number of samples. The F-score (F) can be interpreted as a weighted average of precision and recall, which reaches its best value at 1 and its worst value at 0. The kappa coefficient is a metric used for consistency testing, which can also be used to measure the effectiveness of classification. For classification problems, consistency is the agreement between the model prediction and the actual classification result. The kappa coefficient is calculated based on the confusion matrix, taking values between −1 and 1, and is usually greater than 0. Finally, we plotted the confusion matrix for each model's classification results.

Analysis of the Effect of NGFPS Downsampling Method
We conducted deep learning training and testing experiments using the PointNet++ model on sample data from both the FPS and NGFPS methods. We aggregated the results of the optimal model obtained from the training and analyzed them. We recorded the model evaluation metrics for both the FPS and NGFPS approaches on the test data set. All of these results are presented in Figure 4. As can be seen from the figure, all model evaluation metrics were optimal in the results using the NGFPS method; notably, the balanced accuracy (BAcc) of the NGFPS experimental results was significantly higher than that of the FPS (by 0.02). The results of the model classification were close to perfect and almost identical, as can be seen from the kappa coefficient values. of the optimal model obtained from the training and analyzed them. We recorded the model evaluation metrics for both the FPS and NGFPS approaches on the test data set. All of these results are presented in Figure 4. As can be seen from the figure, all model evaluation metrics were optimal in the results using the NGFPS method; notably, the balanced accuracy (BAcc) of the NGFPS experimental results was significantly higher than that of the FPS (by 0.02). The results of the model classification were close to perfect and almost identical, as can be seen from the kappa coefficient values.

Training Process of Deep Learning Models
We recorded the change in training accuracy ( Figure 5) and the change in loss ( Figure  6) during the training of each deep learning model. Figure 5 shows that the improvement in training and testing accuracy for the PointNet model with increasing epochs was not very large. The training and testing accuracies of all other models eventually reached values greater than 0.90. Although the training accuracy of the PCT model reached a stable interval early, its test accuracy increased slowly after reaching 0.80. The training accuracy of the two PointNet++ models was always greater than the testing accuracy during the training process. The training accuracy of the MSG models was consistently higher than that of the SSG models. During the training of the DGCNN model, the training accuracy and the testing accuracy were basically the same when the Epoch was in the range of 150, following which the testing accuracy was relatively lower than the training accuracy as the training continued. From the perspective of model training accuracy, DGCNN performed relatively poorly. In the later stage of model training, the training accuracy kept improving, but the testing accuracy did not show good performance. For all other models, the test accuracy was always greater than the training accuracy during the deep learning training process. The advantages of the lightweight PointMLP-elite model can be seen. During model training, the PointMLP-elite model achieved higher training accuracy faster than PointMLP, and the final training accuracy remained consistent. The testing accuracy of the PointMLP-elite model was also higher than that of the PointMLP model. PointConv displayed a unique advantage in test accuracy, reaching over 0.95 at the early stage of training (Epoch = 100). There was also some slower improvement in the accuracy of PointConv as the training continued. The PCT model obtained a relatively high testing accuracy at the beginning of training (Epoch = 50); however, after epoch 75, the test accuracy did not grow significantly, although the training accuracy increased.

Training Process of Deep Learning Models
We recorded the change in training accuracy ( Figure 5) and the change in loss ( Figure 6) during the training of each deep learning model. Figure 5 shows that the improvement in training and testing accuracy for the PointNet model with increasing epochs was not very large. The training and testing accuracies of all other models eventually reached values greater than 0.90. Although the training accuracy of the PCT model reached a stable interval early, its test accuracy increased slowly after reaching 0.80. The training accuracy of the two PointNet++ models was always greater than the testing accuracy during the training process. The training accuracy of the MSG models was consistently higher than that of the SSG models. During the training of the DGCNN model, the training accuracy and the testing accuracy were basically the same when the Epoch was in the range of 150, following which the testing accuracy was relatively lower than the training accuracy as the training continued. From the perspective of model training accuracy, DGCNN performed relatively poorly. In the later stage of model training, the training accuracy kept improving, but the testing accuracy did not show good performance. For all other models, the test accuracy was always greater than the training accuracy during the deep learning training process. The advantages of the lightweight PointMLP-elite model can be seen. During model training, the PointMLP-elite model achieved higher training accuracy faster than PointMLP, and the final training accuracy remained consistent. The testing accuracy of the PointMLP-elite model was also higher than that of the PointMLP model. PointConv displayed a unique advantage in test accuracy, reaching over 0.95 at the early stage of training (Epoch = 100). There was also some slower improvement in the accuracy of PointConv as the training continued. The PCT model obtained a relatively high testing accuracy at the beginning of training (Epoch = 50); however, after epoch 75, the test accuracy did not grow significantly, although the training accuracy increased.
We could not evaluate different models exclusively with respect to the magnitude of the value of the loss, as different models use different processes for the calculation of the loss during the training of the model. However, we could compare models of the same type or models with the same loss processing method. Figure 6 shows that the loss of all the models decreased and eventually plateaued. This indicated that the training of all models was stable. PointNet and PointNet++ used the same model training strategy. The loss value of PointNet++ was eventually stable and close to 0, but the loss of the PointNet model was higher, with value stabilizing at approximately 1. This indicates that the classification performance of the PointNet model was worse than that of PointNet++. The loss values of the DGCNN and PointMLP models also finally stabilized at approximately 1. We could not evaluate different models exclusively with respect to the magnitude of the value of the loss, as different models use different processes for the calculation of the loss during the training of the model. However, we could compare models of the same type or models with the same loss processing method. Figure 6 shows that the loss of all the models decreased and eventually plateaued. This indicated that the training of all models was stable. PointNet and PointNet++ used the same model training strategy. The loss value of PointNet++ was eventually stable and close to 0, but the loss of the PointNet model was higher, with value stabilizing at approximately 1. This indicates that the classification performance of the PointNet model was worse than that of PointNet++. The loss values of the DGCNN and PointMLP models also finally stabilized at approximately 1.

Accuracy of Tree Species Classification
We summarize the final classification results of all models on the training and test sets and provide all model evaluation metrics in Table 3. As shown in the table, the PointNet model had the lowest evaluation metrics for all results. In the evaluation of the results on the test set, the PointConv model had the highest values in all evaluation metrics. As a recently proposed state-of-the-art (SOTA) pointwise MLP method, PointMLP ranked second in all classification evaluation metrics on the test set. The confusion matrices of the classification results for all models on the test data set are presented in Figure 7, which corroborate the results detailed in Table 3. The overall performance indicated that three models-PointConv, PointMLP, and DGCNN-classified all samples nearly completely correctly.

Accuracy of Tree Species Classification
We summarize the final classification results of all models on the training and test sets and provide all model evaluation metrics in Table 3. As shown in the table, the Point-Net model had the lowest evaluation metrics for all results. In the evaluation of the results on the test set, the PointConv model had the highest values in all evaluation metrics. As a recently proposed state-of-the-art (SOTA) pointwise MLP method, PointMLP ranked second in all classification evaluation metrics on the test set. The confusion matrices of the classification results for all models on the test data set are presented in Figure 7, which corroborate the results detailed in Table 3. The overall performance indicated that three models-PointConv, PointMLP, and DGCNN-classified all samples nearly completely correctly.

Model Comparison and Analysis
We summarize the values of some evaluation metrics, determined through a literature review, of the point cloud deep learning models used in this study in Table 4. The mean accuracy (mAcc) and overall accuracy (OA) in Table 4    The structure and performance of point cloud deep learning models have been continuously improved and optimized by researchers, and newly proposed models have recently achieved higher classification accuracy. Among the experiments performed to classify the ModelNet40 data set, Table 4 shows that the PointMLP method had the highest classification accuracy, and three other methods-PointConv, DGCNN, and PCT-had similar classification accuracy. Meanwhile, the classification accuracies of PointNet and PointNet++ were lower. Furthermore, the PointMLP model had the largest number of parameters, while the elite version had the smallest number of parameters, and the difference in classification accuracy between the two models was small. The PointNet model had the second-highest number of parameters, but also had the worst classification accuracy. PointNet++ (MSG) had higher FLOPs and, thus, a higher computational burden. DGCNN and PCT had high classification accuracy but require a certain amount of memory and processor load on the computer.

Model Comparison and Analysis
We counted the time spent for each model to train in the tree species classification experiment, which are plotted in Figure 8a, while Figure 8b displays the GPU memory occupancy of each model during training. As shown in Figure 8, the time required for training PointNet++ and PointMLP to obtain the optimal model parameters was longer. PointMLP-elite is a streamlined and lightweight version of PointMLP, with a model runtime reduction of 70 min, approximately 30% less than that of PointMLP. Using the same training parameters, the GPU memory of the PointMLP-elite model was reduced by 24%. Three models-PointConv, DGCNN, and PCT-took less time for training, but DGCNN had a higher GPU memory allocation.

Visualization of Critical Points
We extracted the critical points used for classification after training seven models, including PointNet, PointNet++ (MSG), PointNet++ (SSG), PointMLP, PointMLP-elite, DGCNN, and PCT, and selected samples of some tree species as cases for demonstration ( Figure S1). Figure S1 shows that all critical points summarized the structure of the 3D objects well. Some tree species with similar shapes or structures will lead to misclassification, but this is rare.

Discussion
In this study, we demonstrated that four different point cloud deep learning model techniques, including pointwise MLP, convolution-based, graph-based, and attentionbased models, perform well when categorizing tree species from individual tree LiDAR point clouds. We suggested a new approach for downsampling point clouds that combines the non-uniform grid and farthest point sampling methods. For training and testing experiments involving tree species categorization, we explored eight point cloud deep learning models: PointNet, PointNet++ (MSG), PointNet++ (SSG), PointMLP, PointMLPelite, PointConv, DGCNN, and PCT. With the exception of PointNet, the accuracy of tree

Visualization of Critical Points
We extracted the critical points used for classification after training seven models, including PointNet, PointNet++ (MSG), PointNet++ (SSG), PointMLP, PointMLP-elite, DGCNN, and PCT, and selected samples of some tree species as cases for demonstration ( Figure S1). Figure S1 shows that all critical points summarized the structure of the 3D objects well. Some tree species with similar shapes or structures will lead to misclassification, but this is rare.

Discussion
In this study, we demonstrated that four different point cloud deep learning model techniques, including pointwise MLP, convolution-based, graph-based, and attention-based models, perform well when categorizing tree species from individual tree LiDAR point clouds. We suggested a new approach for downsampling point clouds that combines the non-uniform grid and farthest point sampling methods. For training and testing experiments involving tree species categorization, we explored eight point cloud deep learning models: PointNet, PointNet++ (MSG), PointNet++ (SSG), PointMLP, PointMLPelite, PointConv, DGCNN, and PCT. With the exception of PointNet, the accuracy of tree species classification for all models was greater than 90%. We extracted and visualized the set of critical points in feature convergence for each model when classifying each sample, which were found to comprehensively summarize the skeletons of the 3D point cloud object shapes.
Individual tree species point cloud data processed with the NGFPS method can be utilized to gain higher accuracy in tree species classification as it combines the non-uniform grid sampling and farthest point sampling algorithms. Using non-uniform grid sampling can preserve the details of 3D objects well, while FPS is a distance-based fast sampling method that guarantees uniform sampling, which has been widely used. The NGFPS combines the advantages of both of these downsampling methods and can represent the structure of 3D objects comprehensively and accurately.
The classification accuracy of the PointNet model was poor. This was because PointNet only considers the global features of the samples and ignores local features. Trees are usually grouped into one category in 3D object classification studies because their shape characteristics are very similar. Local detailed features are the key information necessary for classification when distinguishing different species of trees. The PointNet model cannot capture the local structure generated by the metric space points, thus limiting its ability to identify fine-grained patterns. All of the other five methods could extract the local feature information of 3D objects and achieved a higher classification accuracy.
The PointConv model achieved the highest classification accuracy and also required less training time. In 2D images, convolutional neural networks have fundamentally changed the landscape of computer vision by dramatically improving the results of almost all vision tasks [33]. Our experiments demonstrated the scalability of convolution-based point cloud deep learning models for the task of classifying 3D objects. The 3D convolutional deep learning method with local information extraction capability can successfully capture the structural features of trees for classification and recognition. Convolution plays a unique role in the study of deep learning.
The PointMLP model achieved a classification accuracy second only to that of the PointConv model, in terms of tree species classification. This confirmed that the existing feature extractors proposed by Ma et al. [24] can already describe the local geometric features of 3D objects well. More complex designs no longer need to be designed to further improve performance. This illustrates that we need to re-think and re-design algorithms for local feature extraction, in order to propose simple model structures for point cloud data analysis. The experimental results of the PointMLP-elite model illustrated that a simple deep learning model structure can still achieve good classification accuracy.
The PCT model did not achieve optimal classification accuracy in the tree species classification experiments. The transformer does not use recurrent neural layers, but attention mechanisms, which have a high degree of parallelism. However, attention mechanisms make fewer assumptions about the whole model, meaning that the transformer model has less adjustable parameters, which leads to a need for more data and a larger model to train in order to achieve the same effect as a CNN [48]. In this experiment, the final classification accuracy is relatively low because the samples used for deep learning model training are relatively small. This is why transformer-based models are considered to be larger and more expensive. The transformer is a feature extraction method based on an attention mechanism, which was originally proposed for natural language translation. However, scholars have found that the model can be applied in any deep learning domain. Transformers can fuse various data, such as text, images, speech, video, and so on, and extract features within one and the same framework as a means to train larger and better models. This is a hotspot and focus of deep learning research at present and is expected to remain so for some time.
The ranking of OA on the ModelNet40 dataset differs from the results of the tree species classification experiments for different models. The results in Tables 3 and 4 show that, although the recently developed models (PointMLP, PCT) achieved superior classification accuracy in experiments on 3D object classification, the convolution-based method (PointConv) and the MLP-based method (PointNet++ (MSG)) had higher classification accuracy when performing tree species identification on individual tree point clouds. This indicates that more models and methods need to be considered for use in different classification tasks of 3D objects. The evaluation of the number of parameters and performance of different models need to be reanalyzed for different experimental tasks.
Our experiments still had several flaws that will need to be addressed in the future. Due to the small number of training examples, transformer, a promising deep learning framework at the moment, did not demonstrate its advantages in this study. In future related works, we hope to broaden the application of transformer models in point cloud deep learning by extending the amount of data to as many samples as possible. Similar to the authenticity test of remote sensing products [49][50][51], we must reduce the error tolerance of training samples while improving the accuracy of the model by drawing on the authenticity test. It has been proposed that very high-accuracy models can be trained using MLP or simpler architectures; however, we found that the training time of pointwise MLP-based methods was higher than that of other types of methods. The scalability of the deep learning models used at this stage must be further demonstrated through various experiments, as the methods currently designed and developed by scholars have been proposed based on several data sets common to the field of computer vision. In this context, more architectures are sure to emerge and expand into more research areas in the future.
Overall, our experiment was very successful. The NGFPS method can aid in deep learning model training, obtaining higher accuracies, and different types of point cloud deep learning methods were shown to achieve good results in the study of tree species classification. The proposal of new point cloud deep learning methods, such as PointMLP, provides us with more options in the study of tree species classification.

Conclusions
Deep learning approaches were demonstrated to be capable of identifying individual tree point cloud species very accurately. The NGFPS method, which combines the nonuniform grid and farthest point sampling methods, can better preserve the details and structural information of 3D objects, providing accurate data input for deep learning models to accurately identify the features of different tree species. All deep learning models generalized the classification objects in terms of both local and global features. In particular, the convolution-based PointConv model exemplified its superior performance for 3D object classification. Furthermore, deep learning algorithms need not be more complex than they currently are, as the lightweight PointMLP-elite model presented unique advantages over PointMLP. The PointNet++ model, a foundational model for point-based deep learning methods, still exhibited high classification accuracy in tree species classification studies. Point cloud deep learning models are constantly developing, and the potentials of deep learning techniques will be continuously explored. The successful practice of using deep learning models for individual tree species classification provides a new solution, allowing for more efficient forest resource surveys.