Hierarchical Instance Recognition of Individual Roadside Trees in Environmentally Complex Urban Areas from UAV Laser Scanning Point Clouds

: Individual tree segmentation is essential for many applications in city management and urban ecology. Light Detection and Ranging (LiDAR) system acquires accurate point clouds in a fast and environmentally-friendly manner, which enables single tree detection. However, the large number of object categories and occlusion from nearby objects in complex environment pose great challenges in urban tree inventory, resulting in omission or commission errors. Therefore, this paper addresses these challenges and increases the accuracy of individual tree segmentation by proposing an automated method for instance recognition urban roadside trees. The proposed algorithm was implemented of unmanned aerial vehicles laser scanning (UAV-LS) data. First, an improved ﬁltering algorithm was developed to identify ground and non-ground points. Second, we extracted tree-like objects via labeling on non-ground points using a deep learning model with a few smaller modiﬁcations. Unlike only concentrating on the global features in previous method, the proposed method revises a pointwise semantic learning network to capture both the global and local information at multiple scales, signiﬁcantly avoiding the information loss in local neighborhoods and reducing useless convolutional computations. Afterwards, the semantic representation is fed into a graph-structured optimization model, which obtains globally optimal classiﬁcation results by constructing a weighted indirect graph and solving the optimization problem with graph-cuts. The segmented tree points were extracted and consolidated through a series of operations, and they were ﬁnally recognized by combining graph embedding learning with a structure-aware loss function and a supervoxel-based normalized cut segmentation method. Experimental results on two public datasets demonstrated that our framework achieved better performance in terms of classiﬁcation accuracy and recognition ratio of tree. ability to recognize the object of interest in complex scenes. The proposed architecture is similar to PointNet with a few small modiﬁcations, i.e., features are computed from local regions instead of the entire point cloud, which makes the estimation of local information more accurate. We propose an improved 3D semantic segmentation network based on pointwise KNN ( k -th nearest neighbor) search method, which has unique advantage that extract multi-level features. The proposed network takes local features of a query point from local region composed of neighbors set as its new features and establishes connections between multiple layers by adding skip connections to strengthen the learning ability of local features.

ISPRS Int. J. Geo-Inf. 2020, 9, 595 3 of 26 Urban roadside trees extraction and recognition is a long-standing, yet still active topic. The research on roadside trees from point cloud in urban areas is relatively less than that of forestry resource inventory. In addition, the LiDAR data collected in the forest area mainly consist of tree, while the urban scene contains various artifacts, which increases the complexity of tree segmentation task. As regard to trees extraction, existing work can be further divided into two types: geometric rule-based and semantic labeling-based methods.
Geometric rule-based methods generally detect and extract trees from environmentally complex scenes using geometric attributes [19] such as tree shape, cylindrical trunk shape, points distribution, etc. They can be divided into three classes: the clustering [20], region growing [21], and density threshold-based method [22]. All of the above methods performed well in their own research areas. However, they are basically difficult to extract trees from large-scale scenes of dense mixed objects in urban areas, because the heavy computing costs required for local geometric features. Specifically, most rule-based works are traditional and not competitive in speed and accuracy.
In recent years, a series of deep learning methods are tried to apply to 3D semantic segmentation [23]. Note that deep learning based semantic segmentation of point clouds in outdoor scenes often includes category of trees, this work is beneficial to complete the task of tree segmentation from laser scanning data. The most direct way to exploit deep learning on the LiDAR point cloud classification is to first convert the 3D points to regular collections of 2D images in various ways. Semantic segmentation is then performed through a mature deep neural network (DNN) in the image field, and finally transfer the image segmentation result back to the point cloud [24][25][26]. However, these models cause spatial information loss and induce quantization errors in the conversion process [27]. Another feasible method is to extend 2D convolution to 3D convolutional neural network (3D CNN). Most researchers develop the professional model to transform point clouds to 3D voxels, followed by a 3D CNN that predicts the semantic label based on the occupancy grid [28]. However, 3D volumetric grids entail substantial memory consumption and computational cost.
To overcome the shortcomings of the above two methods, some work directly employs deep learning models on raw point clouds. However, CNNs can handle only regular input format. Direct application of standard CNNs to unordered and unstructured point clouds is infeasible. In terms of this, the pioneering work PointNet [29] propose to learn per-point features using shared multi-layer perceptron (MLP) and global features using symmetrical pooling functions. Based on PointNet, a series of point-based networks have been proposed recently [23]. Overall, these methods can be roughly divided into pointwise MLP methods [30][31][32][33], point convolution methods [34][35][36][37], RNN (Recurrent Neural Network)-based methods [38][39][40][41], and graph-based methods [42][43][44][45]. To summarize, most of the existing deep learning-related methods that only focus on either global or local statistical information, result in a performance reduction in representativeness and descriptiveness. In addition, some of above models employ standard convolution kernels with regular receptive fields, which neglect the structural connections between points and fail to account for varying point density.
Our method follows an idea similar to the point convolution methods, however, instead of only focusing on the single features and ignore other relationships between individual points in PointNet-derived deep learning models, we propose a point-wise semantic learning network to acquire both global and local information of each point, avoiding the information loss in local neighborhoods and reducing useless convolutional computations.
To improve the segmentation accuracy and efficiency simultaneously, we propose a deep learning approach designed for recognizing individual trees from UAV-LS data. The non-ground points are identified and acquired; subsequently, a revised point-wise convolution algorithm is proposed to extract trees from non-ground points firstly; a graph-structured optimization algorithm is then performed to obtain optimal classification results; finally, a novel approach called 3D graph embedding learning with a structure-aware loss function is introduced for individual roadside trees segmentation. To the best of our knowledge, this paper is the earlier application of deep learning based instance segmentation method [46] to segment individual trees from ALS point clouds. We aim to obtain satisfactory segmentation results by utilizing deep learning while substantially reducing the impact of scene complexity.
The main contributions of our approach include the following three aspects: (i) Presentation of the urban scene understanding and the approach of roadside tree recognition from the input of the UAV-LS data to the tree segmentation.
(ii) Improved classification of large-scale 3D unordered points without voxelizing by using pointwise semantic learning network. Most of the parameters in the network are learned, and thus the intensive parameter tuning cost is significantly reduced.
(iii) Effective aggregation of multi-level information (geometric information, discriminative embedding information, and information from neighbors) and elimination of quantization errors caused by regular voxelization by using graph convolutional neural network combining structure-aware loss function with the attention-based k-nearest neighbor.

Materials and Methods
As shown in Figure 1, our method consists of four main steps: (1) Ground removal (2.1). Raw data is divided into ground and non-ground points by an improved filtering algorithm. (2) Roadside trees objects detection (2.2). Roadside trees objects are obtained through deep learning approaches based on the revised PointNet network on non-ground points. (3) Labeling refinement (2.3). A graph-structured optimization algorithm is performed to achieve spatial smoothing of the initial semantic labeling results. (4) Recognition of individual roadside trees (2.4). Roadside trees objects are segmented by a deep learning based method from point cloud of the given categories. The details of each step of our method are as follows.
ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 4 of 26 (i) Presentation of the urban scene understanding and the approach of roadside tree recognition from the input of the UAV-LS data to the tree segmentation.
(ii) Improved classification of large-scale 3D unordered points without voxelizing by using pointwise semantic learning network. Most of the parameters in the network are learned, and thus the intensive parameter tuning cost is significantly reduced.
(iii) Effective aggregation of multi-level information (geometric information, discriminative embedding information, and information from neighbors) and elimination of quantization errors caused by regular voxelization by using graph convolutional neural network combining structureaware loss function with the attention-based k-nearest neighbor.

Materials and Methods
As shown in Figure 1, our method consists of four main steps: (1) Ground removal (2.1). Raw data is divided into ground and non-ground points by an improved filtering algorithm. (2) Roadside trees objects detection (2.2). Roadside trees objects are obtained through deep learning approaches based on the revised PointNet network on non-ground points. (3) Labeling refinement (2.3). A graphstructured optimization algorithm is performed to achieve spatial smoothing of the initial semantic labeling results. (4) Recognition of individual roadside trees (2.4). Roadside trees objects are segmented by a deep learning based method from point cloud of the given categories. The details of each step of our method are as follows.

Ground Removal by Improved Progressive TIN Densification Filtering
Because of the scan mode of Aerial LiDAR systems, ground points take up a great portion of the entire scene. Such a great deal of ground points not only enlarge the searching regions for extracting non-ground objects, but also increase the spatial complexities and slow the processing speed. Therefore, removing ground points from a variety of scenes is a preliminary but crucial step. To reduce the quantity of the data to be handled and consider terrain fluctuations in a large scene, we develop an improved progressive triangulated irregular network (TIN) densification (IPTD) filtering algorithm, which can rapidly and effectively distinguish ground points from non-ground points, particularly structurally complex regions.
In existing filtering algorithms, the morphological filtering algorithm [47] obtains non-ground points while retaining the terrain details; the progressive densification method [48] has gained

Ground Removal by Improved Progressive TIN Densification Filtering
Because of the scan mode of Aerial LiDAR systems, ground points take up a great portion of the entire scene. Such a great deal of ground points not only enlarge the searching regions for extracting non-ground objects, but also increase the spatial complexities and slow the processing speed. Therefore, removing ground points from a variety of scenes is a preliminary but crucial step. To reduce the quantity of the data to be handled and consider terrain fluctuations in a large scene, we develop an improved progressive triangulated irregular network (TIN) densification (IPTD) filtering algorithm, which can rapidly and effectively distinguish ground points from non-ground points, particularly structurally complex regions.
In existing filtering algorithms, the morphological filtering algorithm [47] obtains non-ground points while retaining the terrain details; the progressive densification method [48] has gained popularity owing to its robustness and effectiveness in segmenting ground points, however, it has inadequacies when dealing with topographically and environmentally complex regions and tends to remove ground points on steep regions and flatten the terrain. To obtain better filtering performance for complex urban areas, we investigate the feasibility of improving progressive TIN densification filtering of point clouds. Compared with previous work [49,50], the enhancements of the proposed IPTD filtering algorithm encompasses three aspects. (1) Potential ground seed points are obtained through extended local minimum for the grids containing points and the nearest neighbor was adopted to interpolate the elevations of grids without points instead of using the lowest points in user-defined grids.
(2) Accurate ground seed points are acquired by judging the elevation difference in the neighborhood of the local thin plate spline interpolation by the given threshold. This operation provides more ground seed points that are generally evenly distributed. (3) Before upward densification, downward densification is performed to promote the ability of the proposed algorithm in coping with slope variations. Lastly, the ground points are extracted by iteratively densifying ground seed points, the remaining points are taken as non-ground points. Figure 2 shows an example of the point cloud before and after ground removal, where the colors represent elevation variations.
ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 5 of 26 popularity owing to its robustness and effectiveness in segmenting ground points, however, it has inadequacies when dealing with topographically and environmentally complex regions and tends to remove ground points on steep regions and flatten the terrain. To obtain better filtering performance for complex urban areas, we investigate the feasibility of improving progressive TIN densification filtering of point clouds. Compared with previous work [49,50], the enhancements of the proposed IPTD filtering algorithm encompasses three aspects. (1) Potential ground seed points are obtained through extended local minimum for the grids containing points and the nearest neighbor was adopted to interpolate the elevations of grids without points instead of using the lowest points in user-defined grids.
(2) Accurate ground seed points are acquired by judging the elevation difference in the neighborhood of the local thin plate spline interpolation by the given threshold. This operation provides more ground seed points that are generally evenly distributed. (3) Before upward densification, downward densification is performed to promote the ability of the proposed algorithm in coping with slope variations. Lastly, the ground points are extracted by iteratively densifying ground seed points, the remaining points are taken as non-ground points. Figure 2 shows an example of the point cloud before and after ground removal, where the colors represent elevation variations.

Detection of Tree-Like Structures via Pointwise Classification on Non-ground Points
As a significant requirement of urban tree inventory, labeling input data with the category of tree is fundamental in exploiting the informative values for the instance segmentation of trees. The problem has been extensively researched. Some unified approaches combine handcrafted features in a heuristic manner but fails to capture high-level semantic structures. The segmented points generated by deep-learning methods are often corrupted with noise and outliers due to the unorganized distribution and uneven point density, which are inevitable in complex urban environments. It is a challenge to effectively and automatically recognize individual trees in environmentally complex urban regions from such data. To solve this problem, we revise an existing deep-learning architecture to directly process unstructured point clouds and implement a point-wise semantic learning network.
PointNet has revolutionized how we think about processing point clouds as it offers a learnable structured representation for 3D semantic segmentation tasks. However, the local features of point clouds cannot be effectively robustly learned, limiting its ability to recognize the object of interest in complex scenes. The proposed architecture is similar to PointNet with a few small modifications, i.e. features are computed from local regions instead of the entire point cloud, which makes the estimation of local information more accurate. We propose an improved 3D semantic segmentation network based on pointwise KNN (k-th nearest neighbor) search method, which has unique advantage that extract multi-level features. The proposed network takes local features of a query point from local region composed of neighbors set as its new features and establishes connections between multiple layers by adding skip connections to strengthen the learning ability of local features.
In this section, we further present the theoretical and logical principles of the proposed network for urban object segmentation from non-ground points of ALS data in details. Considering that the descriptiveness of differences between points from low-dimensional initial features is far from

Detection of Tree-Like Structures via Pointwise Classification on Non-Ground Points
As a significant requirement of urban tree inventory, labeling input data with the category of tree is fundamental in exploiting the informative values for the instance segmentation of trees. The problem has been extensively researched. Some unified approaches combine handcrafted features in a heuristic manner but fails to capture high-level semantic structures. The segmented points generated by deep-learning methods are often corrupted with noise and outliers due to the unorganized distribution and uneven point density, which are inevitable in complex urban environments. It is a challenge to effectively and automatically recognize individual trees in environmentally complex urban regions from such data. To solve this problem, we revise an existing deep-learning architecture to directly process unstructured point clouds and implement a point-wise semantic learning network.
PointNet has revolutionized how we think about processing point clouds as it offers a learnable structured representation for 3D semantic segmentation tasks. However, the local features of point clouds cannot be effectively robustly learned, limiting its ability to recognize the object of interest in complex scenes. The proposed architecture is similar to PointNet with a few small modifications, i.e., features are computed from local regions instead of the entire point cloud, which makes the estimation of local information more accurate. We propose an improved 3D semantic segmentation network based on pointwise KNN (k-th nearest neighbor) search method, which has unique advantage that extract multi-level features. The proposed network takes local features of a query point from local region composed of neighbors set as its new features and establishes connections between multiple layers by adding skip connections to strengthen the learning ability of local features.
In this section, we further present the theoretical and logical principles of the proposed network for urban object segmentation from non-ground points of ALS data in details. Considering that the descriptiveness of differences between points from low-dimensional initial features is far from satisfactory and the richness of point features is good for local feature extraction, we first enhance the discriminability of point features and then obtain local information in the new feature space. The proposed network mainly contains two modules: (1) implementing optimized PointNet by KNN, which is used to extract more powerful local information in the high-dimensional feature space; (2) semantically segmenting large-scale point cloud using the global and local features. The architecture of the proposed method is summarized in Figure 3. proposed network mainly contains two modules: (1) implementing optimized PointNet by KNN, which is used to extract more powerful local information in the high-dimensional feature space; (2) semantically segmenting large-scale point cloud using the global and local features. The architecture of the proposed method is summarized in Figure 3. Our detailed and complete feature learning network is illustrated in Figure 3, the green part, which consists of two key modules: sub-module of point feature extraction based on PointNet and sub-module of local feature extraction in the high-dimensional feature space.

Sub-Module of Point Feature Extraction
The module is designed to obtain richer point cloud high-dimensional features by using feature extraction operations many times. For the whole pipeline, we directly use simplified PointNet as our backbone network. Our network takes a point cloud of size × as input, then encodes it into a × 128 shaped matrix using the shared Multi-Layer Perceptron (MLP) [29]. After max pooling, the dimension of the global feature of point cloud is 128. Finally, the global features are duplicated N times and concatenates point feature into a feature vector. This vector is then input into the MLP to obtain a novel × 128 shaped feature vector, which is fed into the following module to acquire the local attributes. In general, the feature learning network mainly uses three sub-modules of point feature extraction with the same principle. In other words, we perform three repeated feature extractions. We pass and fuse the features of different layers through the connection, for extracting and fusing richer high-level features. The feature learning network illustrated in Figure 4 is composes with two components including a simplified PointNet and connection channels.  Our detailed and complete feature learning network is illustrated in Figure 3, the green part, which consists of two key modules: sub-module of point feature extraction based on PointNet and sub-module of local feature extraction in the high-dimensional feature space.

Sub-Module of Point Feature Extraction
The module is designed to obtain richer point cloud high-dimensional features by using feature extraction operations many times. For the whole pipeline, we directly use simplified PointNet as our backbone network. Our network takes a point cloud of size N × D as input, then encodes it into a N × 128 shaped matrix using the shared Multi-Layer Perceptron (MLP) [29]. After max pooling, the dimension of the global feature of point cloud is 128. Finally, the global features are duplicated N times and concatenates point feature into a feature vector. This vector is then input into the MLP to obtain a novel N × 128 shaped feature vector, which is fed into the following module to acquire the local attributes.
In general, the feature learning network mainly uses three sub-modules of point feature extraction with the same principle. In other words, we perform three repeated feature extractions. We pass and fuse the features of different layers through the connection, for extracting and fusing richer high-level features. The feature learning network illustrated in Figure 4 is composes with two components including a simplified PointNet and connection channels. local attributes.
In general, the feature learning network mainly uses three sub-modules of point feature extraction with the same principle. In other words, we perform three repeated feature extractions. We pass and fuse the features of different layers through the connection, for extracting and fusing richer high-level features. The feature learning network illustrated in Figure 4 is composes with two components including a simplified PointNet and connection channels.

Sub-Module of Local Feature Extraction
The features of points belonging to the same semantic category are similar, this module aims to reduce the feature distance between similar points and increase the discriminability of different point clouds. The output from the above module forms a N × 128 tensor with low-level information. To get sufficient expressive power to transform each point feature into a higher-dimensional feature, a fully connected layer is added to the KNN module. We first transform the original points P to a canonical pose by a STN (spatial transformer network) [51], and then search for the K nearest neighbors on the spatially invariant pointsP for each query pointp n . The point-to-point set KNN search is defined as follows:p wherep n, k represents the k-th nearest neighbor of the query pointp n .
In the feature space of the input data, we search for K neighboring points with the smallest feature distance from the query point. The pseudocode of the KNN module, which shows the details of the feature space search algorithm based on KNN, is presented in Appendix A.
The feature learning network contains three local feature extraction sub-modules with identical structures. At a certain level, each module is a process of point cloud features learning within the local neighborhood. Therefore, this paper completes the task of learning within the multi-levels local neighborhoods by repeatedly applying sub-module of local feature extraction three times. The detailed processing of this module is shown in Figure 5. The local feature extraction module based on feature space extracts local features at one level. By applying this module multiple times, the receptive field of operation can be expanded, it is equivalent to extracting multiple local features at different levels, which is beneficial to extract more and richer point cloud features. The features of points belonging to the same semantic category are similar, this module aims to reduce the feature distance between similar points and increase the discriminability of different point clouds. The output from the above module forms a × 128 tensor with low-level information. To get sufficient expressive power to transform each point feature into a higher-dimensional feature, a fully connected layer is added to the KNN module. We first transform the original points to a canonical pose by a STN (spatial transformer network) [51], and then search for the K nearest neighbors on the spatially invariant points for each query point . The point-to-point set KNN search is defined as follows: where , represents the k-th nearest neighbor of the query point .
In the feature space of the input data, we search for K neighboring points with the smallest feature distance from the query point. The pseudocode of the KNN module, which shows the details of the feature space search algorithm based on KNN, is presented in Appendix A.
The feature learning network contains three local feature extraction sub-modules with identical structures. At a certain level, each module is a process of point cloud features learning within the local neighborhood. Therefore, this paper completes the task of learning within the multi-levels local neighborhoods by repeatedly applying sub-module of local feature extraction three times. The detailed processing of this module is shown in Figure 5. The local feature extraction module based on feature space extracts local features at one level. By applying this module multiple times, the receptive field of operation can be expanded, it is equivalent to extracting multiple local features at different levels, which is beneficial to extract more and richer point cloud features.

Semantic Segmentation Module and Loss Function
The semantic segmentation component concatenates the three output vectors (low-level point feature vector, high-level local feature vector, and global feature vector) into a 1280-dimensional feature vector. This vector is gradually reduced in dimensionality by directly using MLP and then fed into the Softmax layers to acquire the final segmentation result, which consists of N × M scores for each of the N points and each of the M semantic categories.
To minimize model errors during training, the loss function ℒ of our network consists of semantic segmentation loss ℒ and contrastive loss ℒ : where ℒ is defined with the classical and off-the-shelf softmax cross entropy loss function, which

Semantic Segmentation Module and Loss Function
The semantic segmentation component concatenates the three output vectors (low-level point feature vector, high-level local feature vector, and global feature vector) into a 1280-dimensional feature vector. This vector is gradually reduced in dimensionality by directly using MLP and then fed into the Softmax layers to acquire the final segmentation result, which consists of N × M scores for each of the N points and each of the M semantic categories. To minimize model errors during training, the loss function L of our network consists of semantic segmentation loss L sem and contrastive loss L pair : where L sem is defined with the classical and off-the-shelf softmax cross entropy loss function, which is formulated as follows: where g i denotes the one-hot label of i-th training sample, N denotes the batch size, and e i / j e j is the softmax prediction score vector. As for the contrastive loss, it is expressed by a discriminative function based on the assumption that the features of points belonging to the same semantic category are similar. Feature distance and point labels are metrics to measure the dissimilarity between two points. Therefore, in the training process, the proposed network minimizes the feature similarity difference between the points of the same label, and expands the feature difference between two points belonging to different object labels in the feature space. Specifically, the contrastive loss function is defined as follows: where d represents the Euclidean distance of two points feature, N is the number of points, and margin is a preset threshold, which is a metric to measure the discrimination between two points. y is a binary function that indicates whether two points belong to the same category. y equals 1 if two points belong to the same object: In this case, if the feature distance between the two points is small, the model is more suitable and the loss L pair is smaller. If two points do not belong to the same category, then y = 0: If the feature distance between two points is greater than the margin, it means that the two points do not affect each other, and the loss L pair is 0. If the feature distance of the two points is less than the margin, it means that the loss L pair increases with the decrease of the feature distance, then the current model is not suitable and needs to be retrained. The point-wise semantic label is determined based on the prediction score vector after minimizing the loss function. Finally, the semantic segmentation maps the initial point cloud features to new high-level feature spaces. Namely, points of the same semantic category are clustered together, while the different classes are separated in the semantic feature space.

Graph-Structured Optimization for Classification Refinement
The proposed method exists a few local errors in the result of point-wise semantic learning network. For instance, a tree without canopy is misclassified as others or a pole mixed with tree is misclassified as a tree. Considering the object tree in the classification result as input in last step, instance segmentation of trees, these inaccurate labels may have undesirable consequences. Therefore, we can optimize the initial results and further obtain locally smoothed results with a probabilistic model. It is obvious that the wrong labels can be revised by their local context, as advocated in [52]. We model spatial correlations by applying a graph structure to ensure the consistency of point-wise label prediction, in other words, the labeling refinement can be converted into a graph-structured problem.
The first step is construction of the graph to structure the objective functional, the graphical model is constructed by the undirected adjacency graph G = {V, E}, where V = {v i } represents the nodes and a set of edges E = {e ij } encode spatial relationship of adjacent points with weights w. In graph G, for a central vertex v, we define its ten KNN neighbor according to the minimum number of edges from the other vertex v i to v. With respect to the edge weights w ∈ [0, 1], the spatial distance, difference of normal vector angles and similarity are adopted for estimating weights. Furthermore, let p = p 1 , . . . , p N denote a set of the points, let C = {c 1 , . . . , c m } be a set of labels (in this paper, m is determined by the number of labels in the dataset), let Ψ = {Ψ|i = 1, . . . , N} represent a set of feature variables of points, and let L = l = (l 1 , . . . , l N ) l i ∈ C, i = 1, . . . , N denote all possible label configurations [53]. To achieve global optimization using the constructed graph, we then formalize the optimal label configuration as an energy function minimization problem, and it is defined by following equation: where unary potential term E data (L) quantitatively measures the disagreement between the possible label configuration L and the observed data, while smooth potential term E smooth keeps smoothness and consistency between predictions, and λ is a weight coefficient used to balance the influence between the unary potential and the local smoothness.
With this configuration, the regularization process of the initial semantic segmentation will be performed to ensure labels locally continuous and globally optimal. In the proposed framework, the two terms have different definitions in the abovementioned energy functions. Consequently, the form of unary potential E data (L) is typically: where φ i (l i ) measures how well label l i fits the feature variables Ψ i of given observed data and enforces the influence of the labels. As for the urban region, the same objects have similar features, while different objects have distinct features. The term φ i (l i ) = − log P(l i ) is derived from the predicted probability P(l i ) output from the point-wise classification. The higher the category posterior probability, the smaller the unary potential. The second term in Equation (7) suppresses the "salt and pepper" effect and demonstrates the penalty to assign labels to a couple of 3D points, and is thus judged by [54]. More specifically, µ l i , l j = 1 if l i l j or 0 otherwise, k represents the Gaussian kernel relying on extracted features f which is determined by the XYZ and intensity values of the points i and j, and ω p indicates constant coefficients. Two Gaussian kernels [55,56] are chosen as follows: where, ω b is the weight of the bilateral kernel, ω s is the weight of the spatial kernel, θ α , θ β and θ γ are three predefined hyperparameters.
While the regularization coefficient λ is estimated as follows: where d ij is the distance between points i and j, and δ is the expectation of all neighboring distances.
Accordingly, the optimal label prediction L * is the solution of minimization energy function with the following structure: Although accurate minimization is intractable, the minimization problem is easily and appropriately solved by a graph-cut algorithm using the α-expansion [57,58]. With a few graph-cut iterations, we can effectively and quickly find an approximate solution for optimizing multi-label energies. The labeling cost is not considered since graph-based optimization can effectively leverage the prediction and confidence, as well as semantic label assignment between two similar points in each region. The optimized results could be automatically adaptive to the underlying urban scenes without the predefined features for some uncertain objects.

Segmentation of Individual Roadside Trees with Deep Metric Learning
After the class tree is determined, the label-based segmentation method is used to extract the tree points. As tree crowns are often clumped and connected together, it is vital to determine the points of individual trees by correctly isolating tree crown points of each trunk. Instance recognition of trees in environmentally complex urban areas is a challenge due to the poor-quality data. To overcome this problem, the quality of the segmented tree data is first enhanced through recovering the missing regions and removing the noise, outliers. We extend the previously proposed method in [59,60], and seek self-similar points to denoise them simultaneously using graph Laplacian regularization. Inspired by the algorithm of [61], we exploit an edge preserving smoothing algorithm using local neighborhood information to recover the regions with missing data. Delineation of individual trees from points after quality correction is then performed.
To automatically extract each tree crown points, a novel end-to-end architecture is applied to individual tree segmentation, which combines structure-aware loss function and attention-based k nearest neighbor (KNN). The proposed framework is summarized in Figure 6, we elaborate on the three main components of our proposed network, including the submanifold convolutional network, the structure-aware loss function and the graph convolutional network (GCN), respectively. We firstly generate initial embeddings for each point by the submanifold sparse convolutional network. Inspired by the work of [62], we obtain discriminative embeddings for each tree from the LiDAR points based on the structure-aware loss function, which considers both the geometric and the embedding information. In order to achieve refined embeddings, we develop an attention-based graph convolutional neural network that aims to automatically choose and aggregate information from neighbors. Finally, to get segmentation of individual roadside trees, we employ a simple improved normalized cut segmentation algorithm to cluster refined embeddings.

Segmentation of Individual Roadside Trees with Deep Metric Learning
After the class tree is determined, the label-based segmentation method is used to extract the tree points. As tree crowns are often clumped and connected together, it is vital to determine the points of individual trees by correctly isolating tree crown points of each trunk. Instance recognition of trees in environmentally complex urban areas is a challenge due to the poor-quality data. To overcome this problem, the quality of the segmented tree data is first enhanced through recovering the missing regions and removing the noise, outliers. We extend the previously proposed method in [59] and [60], and seek self-similar points to denoise them simultaneously using graph Laplacian regularization. Inspired by the algorithm of [61], we exploit an edge preserving smoothing algorithm using local neighborhood information to recover the regions with missing data. Delineation of individual trees from points after quality correction is then performed.
To automatically extract each tree crown points, a novel end-to-end architecture is applied to individual tree segmentation, which combines structure-aware loss function and attention-based k nearest neighbor (KNN). The proposed framework is summarized in Figure 6, we elaborate on the three main components of our proposed network, including the submanifold convolutional network, the structure-aware loss function and the graph convolutional network (GCN), respectively. We firstly generate initial embeddings for each point by the submanifold sparse convolutional network. Inspired by the work of [62], we obtain discriminative embeddings for each tree from the LiDAR points based on the structure-aware loss function, which considers both the geometric and the embedding information. In order to achieve refined embeddings, we develop an attention-based graph convolutional neural network that aims to automatically choose and aggregate information from neighbors. Finally, to get segmentation of individual roadside trees, we employ a simple improved normalized cut segmentation algorithm to cluster refined embeddings. Specifically, we directly use the architecture of the submanifold convolutional network (SCN) as our first component by borrowing from [63]. In our experiment, we use two backbone networks, including an UNet-like architecture (with smaller capacity and faster speed) and a ResNet-like architecture (with larger capacity and slower speed). In this section, we mainly describe the last two components of our proposed method for instance segmentation of trees. In the metric learning, the points within the same tree have similar embeddings while points from different trees are apart in Figure 6. Illustration of the whole network architecture for individual tree segmentation. N is the number of points. F is the dimension of the backbone (submanifold convolutional network) output. E is the dimension of the instance embedding. The over-segmentation algorithm is used to cluster the instance embeddings during the inference. Specifically, we directly use the architecture of the submanifold convolutional network (SCN) as our first component by borrowing from [63]. In our experiment, we use two backbone networks, including an UNet-like architecture (with smaller capacity and faster speed) and a ResNet-like architecture (with larger capacity and slower speed). In this section, we mainly describe the last two components of our proposed method for instance segmentation of trees. In the metric learning, the points within the same tree have similar embeddings while points from different trees are apart in the embedding space. Considering points within each tree do not only have embedding features but also have geometric relations, we hope that the final results will more discriminative by combining structure information with embedding features. Some commonly used metrics (e.g., cosine distance) for measuring the similarity between embeddings may cause the learning process and the post-process more difficult as kinds of reason. To make embedding discriminative enough, the Euclidean distance chosen to measure the similarity between embeddings after many test experiments. After measuring the similarity between embeddings, we obtain discriminative embeddings for each tree by a structure-aware loss function. Our loss function is consisted of the following two items: (12) where N is the total number of the tree in the entire scene. The first item L intra i aims to minimize the distance between embeddings within the same tree. As shown in Equation (13), the overall feature of a tree can be described by a mean embedding.
where α denotes a threshold for penalizing large embedding distance, n i is the point number of the ith tree. sd i,j is the coordinate of the jth point within the ith tree, which measures the spatial distance between the jth point and the geometric center µ sd,i of the ith tree; ed i,j is the embedding of the jth point within the ith tree, which measures the embedding distance between the jth point and the mean embedding µ ed,i . Further explanation, sd i,j and ed i,j are then represented as Equations (14) and (15), respectively.
On the other hand, the second item L inter ij is commonly used to make points from different trees discriminative. Specifically, where β denotes a threshold for the distance between mean embeddings. After repeated experiments, α and β is set 0.7 and 1.5 respectively. To achieve the goal that is to generate similar embeddings within the same tree and discriminative embeddings between different trees, KNN algorithm is applied to improve the local consistency of embeddings and aggregate information from surrounding points for a certain point. However, it is unfortunate that some wrong information brought by KNN will be harmful for embeddings. It is more obvious that a point near the edge of a certain tree may aggregates information from another trees. Instead of the standard KNN aggregation (x The transform process can be formalized as follows: where the input embeddings of point clouds are denoted by X = {x 1 , . . . , x n } ⊆ R F , x j i 1 , . . . , x j i k are the k nearest neighbors of x i according to their spatial positions, and α m is the attention weight for each neighbor and the normalization of the softmax function. Figure 7 is the illustration of attention-based KNN which is the aggregator of our proposed graph convolutional neural network containing two steps. In step 1, for each input point, k nearest neighbors are searched according to the spatial coordinate. Different weights are assigned to different neighbors in step 2. The output of the aggregator is the weighted average of the embeddings of k neighbors. In general, aggregator in the form of attention-based KNN is a natural and meaningful operation for 3D points and allows the network to learn different importance for different neighbors. In the previous research, the GCN is normally composed of two parts: the aggregator and the updator (illustrated in Figure 8). As explained above, the aggregator is to gather information from neighbors using the proposed attention-based KNN. To update the aggregated information by mapping embeddings into a new feature space, a simple fully connected layer without bias is used as the updator. The operation is formalized as follows: where W ⊆ R 2F×F is a trainable parameter of the updator. A portion of GCNs describe the relation using the laplacian matrix and the eigen-decomposition, which require huge computing cost (complexity O(n 2 )). In contrast to previous GCNs, the main spotlight of our spatial GCN is that the attention-based KNN is used as the aggregator. In other word, the KNN (complexity O(n × k)) is used to describe the relation. It is quite vital for GCN to be applied to the original data. Last but not least, the proposed network will be easily trained end-to-end and uses the ADAM optimizer with constant learning rate 0.001. During implementation, we firstly pretrain the backbone network to obtain a pretrained segmentation model, then train the whole tree segmentation network based on the pretrained model, which can save time when conducting multiple experiments.
where W ⊆ × is a trainable parameter of the updator.
A portion of GCNs describe the relation using the laplacian matrix and the eigen-decomposition, which require huge computing cost (complexity ( )). In contrast to previous GCNs, the main spotlight of our spatial GCN is that the attention-based KNN is used as the aggregator. In other word, the KNN (complexity (n×k)) is used to describe the relation. It is quite vital for GCN to be applied to the original data. Last but not least, the proposed network will be easily trained end-to-end and uses the ADAM optimizer with constant learning rate 0.001. During implementation, we firstly pretrain the backbone network to obtain a pretrained segmentation model, then train the whole tree segmentation network based on the pretrained model, which can save time when conducting multiple experiments.    The spatially independent trees are quickly and effectively separated, but the overlapping or adjacent trees are difficult to isolate. Several previous researches reduced omission errors by a certain graph-cut algorithm, which enhanced accuracy but increased computational complexity. To further segment those objects containing more than one object, a supervoxel-based normalized cut segmentation method is developed. The mixed objects are firstly partitioned into homogeneous supervoxels with approximately equal resolution using the existing over-segmentation algorithm proposed in [64], which can preserve the boundary of object much better than others. Then, consider a complete weighted graph G(V, E) constructed from the given supervoxels according to their spatial neighbors, where the vertices V are represented by the center of supervoxels, and edges E are connected between each pair of adjacent supervoxels. The meaningful weight assigned to the edge is The spatially independent trees are quickly and effectively separated, but the overlapping or adjacent trees are difficult to isolate. Several previous researches reduced omission errors by a certain graph-cut algorithm, which enhanced accuracy but increased computational complexity. To further segment those objects containing more than one object, a supervoxel-based normalized cut segmentation method is developed. The mixed objects are firstly partitioned into homogeneous supervoxels with approximately equal resolution using the existing over-segmentation algorithm proposed in [64], which can preserve the boundary of object much better than others. Then, consider a complete weighted graph G(V, E) constructed from the given supervoxels according to their spatial neighbors, where the vertices V are represented by the center of supervoxels, and edges E are connected between each pair of adjacent supervoxels. The meaningful weight assigned to the edge is adopted for measuring the similarity between a pair of supervoxels connected by the edge and is calculated by the geometric information associated with the supervoxels as follows: where D XY ij and D Z ij are the horizontal and vertical distance between supervoxels i and j, respectively. σ XY , σ Z and σ G indicate the standard deviations of D XY ij , D Z ij and G max ij , respectively. r XY is a distance threshold for determining the maximal valid distance between two supervoxels in the horizontal plane. G max ij is expressed as where D XY (i, treeTop) and D XY ( j, treeTop) denote the horizontal distance between supervoxels i, j and the top of nearest tree. Specifically, the similarity between two supervoxels is measured by considering their distance in the horizontal plane and their relative horizontal and vertical distributions. By such a definition, we partition the complete weighted graph G into two disjoint groups A and B by normalized cut segmentation method, which maximizes the similarity within each group (A ∩ B = ∅) and the dissimilarity between two groups (A ∪ B = V). According to [65], the corresponding cost function is defined as where cut(A, B) = i∈A,i∈B ω ij denotes the sum of the weights on the edges connecting groups A and B; assoc(A, V) = i∈A,i∈V ω ij and assoc(B, V) = i∈B,i∈V ω ij denote the sum of the weights on the edges falling in group A and B, respectively.
The process of dividing the weighted graph G into two separate groups A and B is regarded as the minimization of Ncut (A, B). Since minimization problem is NP-hard, the proposed method relies on the approximation strategy, which achieves fairly good results in terms of solution quality and speed. Then the minimization of Ncut(A, B) is obtained by solving the corresponding generalized eigenvalue problem (D − W)y = λDy (22) where W(i, j) = ω ij , and D is a diagonal matrix, whose ith row records the sum of the weights on the edges associated with supervoxel i We introduce a parameter z denoted as D − 1 2 y, thus, the Equation (22) is represented as From the basic principle of the Rayleigh quotient, it can be known that solving the minimum value problem of Ncut(A, B) is converted into solving the second minimum eigenvector of the feature system, and the best division result of normalized segmentation is obtained.
where λ = 0, z 0 = D − 1 2 I is the small solution of the abovementioned feature system, y 0 = I is the smallest feature vector.
Based on the normalized cut principle, the overlapping objects are divided into two segments by employing a threshold to the eigenvector associated with the second smallest eigenvalue. Since elements of the second smallest eigenvector generally appear as continuous real values, a separation point needs to be introduced to bisect, usually 0 or the median of the eigenvector elements is used as the separation point. In order to minimize Ncut(A, B), heuristic method [66] is used to find the optimal separation point. The elevation information plays a significant role in aerial LiDAR data processing, we finally adopt an elevation-attention module [67] which directly applies the per-point elevation information to improve segmentation results.

Results
A brief description about the experimental data is first given in this section. Then, we qualitatively and quantitatively analyze the performance of the results derived from the proposed method, respectively.

Data Description
We assess the performance of our approach using the following two public datasets: 2019 IEEE Geoscience and Remote Sensing Society (GRSS) Data Fusion Contest 3D point cloud classification challenge (DFC 3D) [68] and Dayton Annotated LiDAR Earth Scan (DALES) [69] dataset. The DFC 3D is an aerial LiDAR dataset, which is collected by the IEEE GRSS and covers approximately 100 km 2 over parts of Southern United States, provided in ASCII text files. Considering the difference in category definition, we investigate six predefined semantic classes, namely, ground, tree, building, water, elevated road/bridge and unlabeled points. The XYZ and intensity are used as the inputs in our experiments. Scenes from three different types areas are provided on the IEEE GRSS 3D labeling website: two scenes with 10 files are employed as the training set and the other one scene with 6 files are used as the test set.
The DALES dataset is also composed of ALS data acquired with a Riegl LiDAR system flown with a mean flying height of about 1000 m above some Canadian cities. As a new large-scale ALS dataset, it spans 10 km 2 of area and eight object categories with over 0.5 billion labeled points. The dataset is randomly split into two regions, namely training set and testing set with roughly a 70/30 percentage split, respectively.

Classification Performances
We quantitatively assess the classification performance of the proposed approach in terms of the following evaluation metrics [70]: precision, recall, F1 score, overall accuracy (OA), and average F1 score (AvgF1). Among them, the first three metrics are specifically applied to assess the performance on each single class, whereas overall accuracy and average F1 score are used to evaluate the performance on the whole test set. To test the performance of the improved network based on pointwise KNN search, the comparisons between the results with the original PointNet is conducted. Furthermore, to validate the feasibility of the graph-structured optimization designed for labeling refinement, the comparisons between the results with and without smoothing is conducted. Here, Tables 1 and 2 list the final classification results on the DFC 3D dataset and the DALES dataset, respectively. More specifically, we finally achieved the precision of 87.0%, recall of 91.2%, and F1 score of 89.1% for labeling the tree class on the DFC 3D dataset, and an IoU of 94.1% for tree category on the DALES dataset. Table 1. Performance of classification results of our model with optimization using DFC 3D dataset. We report the precision, recall, and F1 score for each category in the first 3 columns as well as the OA and Avg F1 in the last two columns (All values are in %).

Metrics
Ground

Comparison between PointNet and Our Method
For the point-wise semantic segmentation, the feature learning of ALS data is a fundamental problem that directly affects the results of urban scene understanding. To validate the classification performance of the proposed network, we compared the results obtained by extracting more powerful local information using an optimized PointNet by KNN to those by extracting global features directly from the original PointNet. The classification results of original and optimized PointNet on the DFC 3D dataset are presented as a comparison in Table 3. The proposed network can largely improve the performance of classification with an increase of 10.7% in OA and a 19.9% increment in AvgF1. Additionally, the precision, recall, and F1 of tree show remarkable increase by 0.7%, 11.6%, 6.1%, respectively, indicating that the proposed strategy can produce higher accuracy, especially for the tree object extraction. The possible explanation is that the induction of multi-scale local information can provide better representation, especially in improving object integrity. Figure 9 presents the visualization of the classification results on DFC 3D dataset, it can be seen that the use of optimized PointNet procedure provides good initial result, which shows the efficacy of the proposed network in providing informative features.
largely improve the performance of classification with an increase of 10.7% in OA and a 19.9% increment in AvgF1. Additionally, the precision, recall, and F1 of tree show remarkable increase by 0.7%, 11.6%, 6.1%, respectively, indicating that the proposed strategy can produce higher accuracy, especially for the tree object extraction. The possible explanation is that the induction of multi-scale local information can provide better representation, especially in improving object integrity. Figure  9 presents the visualization of the classification results on DFC 3D dataset, it can be seen that the use of optimized PointNet procedure provides good initial result, which shows the efficacy of the proposed network in providing informative features.  Table 3. Performance of our model and PointNet on the DFC 3D dataset. We report the precision (p), recall (r), and F1 score for tree category in the first 3 columns as well as the overall accuracy (OA) and average F1 score (Avg F1) in the last two columns (All values are in %).

Effectiveness of Labeling Smoothing using Graph-structured Optimization
There exists a small number of wrongly labeled points in the outputs, which can be corrected in the labeling smoothing by contextual information. In order to evaluate the usefulness of graph based-  Table 3. Performance of our model and PointNet on the DFC 3D dataset. We report the precision (p), recall (r), and F1 score for tree category in the first 3 columns as well as the overall accuracy (OA) and average F1 score (Avg F1) in the last two columns (All values are in %).

Method
Tree (

Effectiveness of Labeling Smoothing Using Graph-Structured Optimization
There exists a small number of wrongly labeled points in the outputs, which can be corrected in the labeling smoothing by contextual information. In order to evaluate the usefulness of graph based-optimization, a regularization framework was tested for obtaining spatially smooth semantic labelling of UAV-LS data from a pointwise classification. Here, Table 4 lists the initial and optimized results of semantic segmentation, it can be seen that the overall performance does not show considerable improvements, the implementation of label smoothing improves the OA by 0.2%, but the influence of classification optimization is clear in some specific classes such as trees. Figure 10 shows the detailed visual illustration, which indicates an improvement in both smoothness and classification performance. As aforementioned, although wrongly classified points of most urban scenes are correctly labeled through a powerful deep learning method, the change in the statistic of the classification accuracies not seems apparent. This can be attributed to the reason that our strategy has already provided global and local properties with high quality, especially the implementation of the pointwise KNN search strategy. optimization, a regularization framework was tested for obtaining spatially smooth semantic labelling of UAV-LS data from a pointwise classification. Here, Table 4 lists the initial and optimized results of semantic segmentation, it can be seen that the overall performance does not show considerable improvements, the implementation of label smoothing improves the OA by 0.2%, but the influence of classification optimization is clear in some specific classes such as trees. Figure 10 shows the detailed visual illustration, which indicates an improvement in both smoothness and classification performance. As aforementioned, although wrongly classified points of most urban scenes are correctly labeled through a powerful deep learning method, the change in the statistic of the classification accuracies not seems apparent. This can be attributed to the reason that our strategy has already provided global and local properties with high quality, especially the implementation of the pointwise KNN search strategy.  Table 4. Comparison of the initial and smoothed classification results using DFC 3D dataset. We report the precision (p), recall (r), and F1 score for tree category in the first 3 columns as well as the OA and Avg F1 in the last two columns (All values are in %). Additionally, we implemented several 3D semantic segmentation approaches to make a fair comparison using the above-mentioned two public dataset; such a comparison can reveal that the proposed method can outperform other classic methods.

Method
After illustrating the effectiveness, we compare the proposed method with other published highaccuracy methods that have available codes using the DFC 3D dataset, including DANCE-Net [71] and GA-Conv [67]. The DANCE-Net method [71] classified the ALS data by introducing a densityaware convolution module which uses the point-wise density to reweight the learnable weights of  Table 4. Comparison of the initial and smoothed classification results using DFC 3D dataset. We report the precision (p), recall (r), and F1 score for tree category in the first 3 columns as well as the OA and Avg F1 in the last two columns (All values are in %).

Method
Tree (

Comparison with Other Published Methods
Additionally, we implemented several 3D semantic segmentation approaches to make a fair comparison using the above-mentioned two public dataset; such a comparison can reveal that the proposed method can outperform other classic methods.
After illustrating the effectiveness, we compare the proposed method with other published high-accuracy methods that have available codes using the DFC 3D dataset, including DANCE-Net [71] and GA-Conv [67]. The DANCE-Net method [71] classified the ALS data by introducing a density-aware convolution module which uses the point-wise density to reweight the learnable weights of convolution kernels and further developing a multi-scale CNN model to perform per-point semantic labeling. In GA-Conv method [67], whose strategy is similar to that of the DANCE-Net [71], approximating convolution on unevenly distributed 3D point sets with a geometry-attentional network consisting of geometry-aware convolution, dense hierarchical architecture and elevation-attention module to embed the three characteristics effectively, which can be trained in an end-to-end manner. Furthermore, compared to the aforementioned methods, the proposed method that also operates directly on point clouds, ranks second among three strategies, with OA value that are 5.1% higher than those of GA-Conv [67]. Table 5 displays a comparison of the classification accuracy with the aforementioned evaluation metrics among three different methods. Compared to DANCE-Net [71], we achieve a lower OA; however, we provide improved performance in classifying tree that produce a higher accuracy. Table 5. Quantitative comparison between our model and other methods on the DFC 3D dataset. We report the precision (p), recall (r), and F1 score for tree category in the first 3 columns as well as the overall accuracy (OA) and average F1 score (Avg F1) in the last two columns. The boldface text indicates the best performance (All values are in %).

Method
Tree ( To further investigate the versatility of the proposed method on large-scale ALS datasets, we also obtained the pointwise classification results on the DALES dataset. We follow the evaluation metrics of similar large-scale LiDAR point cloud benchmarks and use the mean IoU and the OA as our main evaluation metrics. The per class IoU is first defined as Equation (25), the mean IoU is simply the mean across all eight categories, excluding the unknown category, of the form as Equation (26), and the OA can be calculated as Equation (27). For further evaluation, the proposed method was compared to previous published methods (we selected only algorithms that have published results and available codes, including PointNet++ [30], ShellNet [72], and Superpoint Graphs [42]). Quantitative comparison results on the DALES dataset are listed in Table 6, showing that the proposed network achieves better classification performance in terms of OA and mean IoU score than the other models. Specifically, the proposed model obtains state-of-the-art extraction performance for the trees. The notably strong performance of our architecture on trees with an IoU of 94.1%, over 2% higher than other networks is likely due to the difference between the proposed architecture and other methods, is that we did not rely on the selection of a fixed number of points within a search radius. This method of batch selection makes it possible to select a wide enough neighborhood to adequately get information while also having enough points to identify small objects.
N k=1 c jk (27) Figures 11a and 12a indicate the two selected scenes, colored by the elevation of each point. After ground points removed by the IPTD filtering algorithm, the road facilities were semantically recognized from the non-ground points. Figures 11b and 12b show the object recognition results dotted in different colors, where gray, red, orange, green and blue points represent points from ground, buildings, water, trees and others respectively. Figures 11c and 12c show roadside trees extraction outcomes, where tree is drawn in green and ground is dotted in gray. Figures 11d and 12d show roadside trees instance segmentation outcomes, dotted in different color. Just like Figures 11 and 12, Figures 13a-d and 14a-d show the visualized results of each step of our method on the DALES dataset. More details of the individual roadside tree segmentation result are shown in Figure 15, Figure 15a is the segmentation outcomes for small trees and Figure 15b is segmentation outcomes for incomplete trees. It can be seen that the trees are well detected if they are not seriously overlaid, implying that the proposed method provides good performance in instance recognizing roadside trees, even small and incomplete objects.  Figures 11a and 12a indicate the two selected scenes, colored by the elevation of each point. After ground points removed by the IPTD filtering algorithm, the road facilities were semantically recognized from the non-ground points. Figures 11b and 12b show the object recognition results dotted in different colors, where gray, red, orange, green and blue points represent points from ground, buildings, water, trees and others respectively. Figures 11c and 12c show roadside trees extraction outcomes, where tree is drawn in green and ground is dotted in gray. Figures 11d and 12d show roadside trees instance segmentation outcomes, dotted in different color. Just like Figures 11 and 12, Figures 13a-d and 14a-d show the visualized results of each step of our method on the DALES dataset. More details of the individual roadside tree segmentation result are shown in Figure 15, Figure 15a is the segmentation outcomes for small trees and Figure 15b is segmentation outcomes for incomplete trees. It can be seen that the trees are well detected if they are not seriously overlaid, implying that the proposed method provides good performance in instance recognizing roadside trees, even small and incomplete objects.

Evaluation of the Proposed Method
We ran the tree segmentation algorithm in Python and compared the result with the reference trees. In this study, the performance of the proposed individual tree segmentation method on these two ALS datasets is evaluated by the following metrics [6]: segmentation accuracy (AC), omission error (OM) and commission error (COM). AC is the rate of trees correctly detected; OM is the rate of undetected trees, and COM is the rate of falsely detected trees.
COM = fse/ref (30) where de is the number of trees correctly segmented, ude is the number of unsegmented trees, fde is the number of trees falsely segmented and ref is the number of reference trees.

Evaluation of the Proposed Method
We ran the tree segmentation algorithm in Python and compared the result with the reference trees. In this study, the performance of the proposed individual tree segmentation method on these two ALS datasets is evaluated by the following metrics [6]: segmentation accuracy (AC), omission error (OM) and commission error (COM). AC is the rate of trees correctly detected; OM is the rate of undetected trees, and COM is the rate of falsely detected trees.
COM = fse/ref (30) where de is the number of trees correctly segmented, ude is the number of unsegmented trees, fde is the number of trees falsely segmented and ref is the number of reference trees. Table 7 shows the segmentation accuracy, omission error and commission error of individual tree segmentation from these two datasets. Our method achieves good results in segmenting roadside trees with an average AC, OM and COM of (86.8%, 13.2%, 9.5%) for the two datasets. Three metrics decline slightly with the sharp increasing scene complexity.

Evaluation of the Proposed Method
We ran the tree segmentation algorithm in Python and compared the result with the reference trees. In this study, the performance of the proposed individual tree segmentation method on these two ALS datasets is evaluated by the following metrics [6]: segmentation accuracy (AC), omission error (OM) and commission error (COM). AC is the rate of trees correctly detected; OM is the rate of undetected trees, and COM is the rate of falsely detected trees.
where de is the number of trees correctly segmented, ude is the number of unsegmented trees, fde is the number of trees falsely segmented and ref is the number of reference trees. Table 7 shows the segmentation accuracy, omission error and commission error of individual tree segmentation from these two datasets. Our method achieves good results in segmenting roadside trees with an average AC, OM and COM of (86.8%, 13.2%, 9.5%) for the two datasets. Three metrics decline slightly with the sharp increasing scene complexity.

Comparative Studies
To evaluate the effectiveness of the instance segmentation of tree, we designed a group of experiments and compared it with other three methods, including Li's method [73], ForestMetrics [74], and treeseg [75] in terms of segmentation accuracy, omission error, and commission error for recognizing roadside trees, as listed in Table 8. We apply the same data to evaluate the proposed method and other methods in this paper. Li et al. [73] adopted a top-to-bottom region-growing method for tree segmentation in coniferous forests. However, the performance of the algorithm is not ideal when applied to urban roadside trees. ForestMetrics [74] mainly detects trunks and delineates individual trees from ALS to be well suited for trees with crowns of structurally complex shapes by a new bottom-up algorithm. Although ForestMetrics achieved a good tree segmentation performance with the AC, OM, and COM values of 85.9%, 14.1%, and 11.8%, respectively. Unfortunately, it fails to deal with extended and irregular tree shapes, especially if the canopy border is not correctly depicted. The data-driven approach, treeseg [75], utilizes generic point cloud processing techniques including Euclidean clustering, principal component analysis, region-based segmentation, shape fitting, and connectivity testing. The open-source approach to automate tree segmentation task, which uses few a priori assumptions of tree architecture, achieved worse segmentation accuracies with the AC, OM, and COM values of 80.9%, 19.1%, and 13.2%, respectively.
The proposed method employs an attention-based GCN which can automatically choose and aggregate information from neighbors, and a structure-aware loss function for tree segmentation to improve the geometric and the embedding information distinctiveness for individual tree. The proposed method also develops a novel and effective supervoxel-based normalized cut segmentation method, improving segmentation performance for incomplete and small trees. Thus, we have better accuracy of tree segmentation than those of Li's method [73], ForestMetrics [74], and treeseg [75].

Conclusions
To address the complicated problem of classifying large-scale scenes and segmenting individual roadside trees objects, we proposed a complete workflow from airborne LiDAR data in environmentally complex urban areas, including (1) an improving progressive TIN densification filtering algorithm is applied to remove ground points, (2) a deep learning framework that integrates a point feature learning network, and a local feature learning network for the efficient semantic parsing of large-scale UAV-LS data, (3) a graph-structured optimization model to ensure the consistency of point-wise label prediction, (4) a simple yet novel method employing graph embedding learning with a structure-aware loss function and supervoxel-based normalized cut segmentation for isolating individual roadside trees. Our approach was evaluated by estimating accuracy on two publicly accessible ALS datasets, resulting in a satisfactory detection and segmentation of tree points from connected and clumped objects.
The experimental results demonstrate that our method provides a powerful solution to segment individual trees from urban UAV-LS data in terms of accuracy and correctness. It performed better than several classic 3D semantic segmentation methods and individual tree segmentation methods in terms of detection and segmentation accuracy. The proposed approach only utilizes 3D coordinates and intensity of point clouds and does not require any supplementary information, and is also robust to detect and segment roadside tree objects in overlapped region. The instance segmentation of tree from UAV-LS data also lay a good foundation for accurately calculating the structure metrics of trees and classification of species of urban trees, providing a good database for environmental impact assessment, biomass estimation, and tree risks management. In the future, we will test the proposed method on more large-scale road environment to build a complete roadside tree objects database for planning and management of urban forests.