Road-Side Individual Tree Segmentation from Urban MLS Point Clouds Using Metric Learning

: As one of the most important components of urban space, an outdated inventory of road-side trees may misguide managers in the assessment and upgrade of urban environments, potentially affecting urban road quality. Therefore, automatic and accurate instance segmentation of road-side trees from urban point clouds is an important task in urban ecology research. However, previous works show under-or over-segmentation effects for road-side trees due to overlapping, irregular shapes and incompleteness. In this paper, a deep learning framework that combines semantic and instance segmentation is proposed to extract single road-side trees from vehicle-mounted mobile laser scanning (MLS) point clouds. In the semantic segmentation stage, the ground points are ﬁltered to reduce the processing time. Subsequently, a graph-based semantic segmentation network is developed to segment road-side tree points from the raw MLS point clouds. For the individual tree segmentation stage, a novel joint instance and semantic segmentation network is adopted to detect instance-level roadside trees. Two complex Chinese urban point cloud scenes are used to evaluate the individual urban tree segmentation performance of the proposed method. The proposed method accurately extract approximately 90% of the road-side trees and achieve better segmentation results than existing published methods in both two urban MLS point clouds. Living Vegetation Volume (LVV) calculation can beneﬁt from individual tree segmentation. The proposed method provides a promising solution for ecological construction based on the LVV calculation of urban roads.


Introduction
Statistics show that more than half of the world's population lives in cities. By the middle of the 21st century, this proportion is expected to rise to 70% [1]. The more serious truth is that cities that account for less than 3% of the earth's surface consume more than 75% of natural resources. Vegetation has the function of eliminating harmful pollutants, reducing noise, regulating temperature, protecting water sources, and providing various renewable energy sources [2][3][4]. If vegetation solutions can be reasonably incorporated into urban, methods similar to urban tree inventory are bound to overcome a series of existing challenges. To obtain comprehensive and accurate urban tree information, various emerging technologies have gradually replaced traditional manual measurement methods, such as photogrammetry and remote sensing [5][6][7]. In particular, the rapidly developing LiDAR Remote Sens. 2023, 15,1992 2 of 26 technology offers a promising method for capturing urban point clouds, demonstrating its brilliance in large-scale mapping scenes [8].
LiDAR is different from images limited by low resolution, weather sensitivity, and poor penetration, and its advantages include high precision, high resolution, and flexibility [9][10][11]. Most importantly, they can reflect the detailed three-dimensional spatial distribution of trees at the individual level, which provides a new perspective for tree inventory. As we all know, urban tree inventory requires not only accurate spatial information but also individual tree parameters [12]. For management purposes, the timely update of spatial information such as the distribution of trees and the location of an individual tree helps maintain reliable monitoring of urban trees. The separation of woody parts and leaves provides a basis for the calculation of individual tree parameters such as species classification, leaf area index (LAI) estimation, crown volume estimation, and diameter at breast height (DBH) estimation [13][14][15][16]. Therefore, instance segmentation of urban trees and the separation of wood and leaf points for individual trees are indispensable and important components [17]. However, the acquired point clouds are all unorganized and huge, which makes these two tasks technically challenging.
The above-mentioned research and results inspired us to use a deep learning network to extract a single tree from the point cloud and further distinguish the wood-leaf points. Therefore, in this article, we propose a new type of network applied to complex cities to give full play to the potential and advantages of the combination of semantic segmentation and instance segmentation [18] and contribute at least the following double aspects: • A novel individual tree segmentation framework that combines semantic and instance segmentation network is designed to separate instance-level road-side trees from point clouds.

•
Extensive experiments on two mobile laser scanning (MLS) and one airborne laser scanning (ALS) point clouds have been carried out to demonstrate the effectiveness and generalization of the proposed tree segmentation method for urban scenes.

Related Work
First, we briefly survey point cloud semantic and instance segmentation, which inspires our study. Next, we present a review of recent progress regarding individual tree segmentation from point clouds.

Point Cloud Semantic Segmentation
Point cloud semantic segmentation is a practical solution to interpret information of the 3D scene from point clouds, which aims to annotate each point in a given point cloud with a label of semantic meaning [19]. Previous works solve the problem of point cloud semantic segmentation by applying supervised classification models in accordance with handcrafted features [20][21][22]. The performances of these methods usually depend on two very important factors: distinctive hand-crafted features (i.e., geometry features, waveformbased features, topology features, and contextual features from spatial distributions) and discriminative classifiers (i.e., support vector machines, random forest, Hough forest, and Markov random fields) [23][24][25][26]. However, the calculation of effective handcrafted features requires a large number of prior knowledge, which has limited ability to learn good features of the scanned objects [27].
To mitigate burdens in feature design, deep learning for point cloud semantic segmentation has drawn increasingly considerable attention because it provides an end-to-end solution [28][29][30]. Therefore, deep-learning-based methods have become the dominant technologies in the point cloud semantic segmentation task. As discussed in [31], there are four main paradigms of neural networks for point clouds semantic segmentation, such as projection-based methods [32][33][34] that usually project a 3D point cloud into 2D images, discretization-based methods [35][36][37] that usually transform a point cloud into a discrete representation, point-based networks [38][39][40] that directly work on the irregular point cloud, and graph-based methods [41][42][43] that construct an adjacency graph to model the

Individual Tree Segmentation
In the past decade, various clustering methods to varying degrees have been applied to obtain better segmentation results. Chen et al. [56] compared four different single tree extraction methods: Euclidean distance classification, region growing, normalized cutting, and supervoxel segmentation, and found that the application of the N-cut method after Euclidean distance clustering can obtain better segmentation results. Furthermore, various automated methods for extracting individual trees from point clouds have been proposed, which can be roughly divided into geometry-based unsupervised methods [57][58][59] and a supervision method based on semantic annotation [60][61][62]. These methods can process some trees with simple structures through tedious and labor-intensive parameter tuning, but they still lack generalization ability on the instance-level separation of trees with different shapes and canopy structures.
Benefiting from the advances in neural network architectures and their great potential in improving the generality and accuracy of point cloud segmentation, there are several works that successfully achieved the individual tree segmentation from the point cloud using deep learning methods [63]. Following an idea similar to the instance segmentation, Wang et al. [64] propose a point-wise semantic learning network to acquire both local and global information. By avoiding the information loss and reducing useless convolutional computations, it is an effective approach for individual tree segmentation from ALS point clouds. To automatically extract urban trees from large-scale point clouds, PointNLM [65] incorporates the supervoxel-based and point-wise methods for capturing the long-range relationship. Simultaneously, a fusion layer of neighborhood max-pooling method is developed to concatenate the multi-level features for separating the road-side trees. For tree detection, Luo et al. [66] design a top-down slice module, which can effectively mine the vertical structure information in a top-down way. To detect trees more accurately, Luo et al. [66] also add a multi-branch network for providing discriminative information by  [67] develop a novel top-down framework, called DAE_Net, based on semantic and instance segmentation network. After that, the boundaries of instance-level trees are enhanced by predicting the direction vector for isolated tree clusters.

Methodology
As shown in Figure 1, the proposed framework consists of two stages: road-side tree extraction by semantic segmentation and individual tree separation by instance segmentation.
method is developed to concatenate the multi-level features for separating the road-side trees. For tree detection, Luo et al. [66] design a top-down slice module, which can effectively mine the vertical structure information in a top-down way. To detect trees more accurately, Luo et al. [66] also add a multi-branch network for providing discriminative information by fusing multichannel features. To extract individual urban trees from MLS point clouds, Luo et al. [67] develop a novel top-down framework, called DAE_Net, based on semantic and instance segmentation network. After that, the boundaries of instancelevel trees are enhanced by predicting the direction vector for isolated tree clusters.

Methodology
As shown in Figure 1, the proposed framework consists of two stages: road-side tree extraction by semantic segmentation and individual tree separation by instance segmentation. Figure 1. Pipeline of the proposed framework. The tree extraction stage divides the input point cloud into tree and non-tree points. The individual tree separation module takes the tree points as input data and obtains the individual road-side trees.

Tree Point Extraction
Generally, an MLS system has a relatively direct scanning angle of view of the ground. Therefore, the collected 3D points contain massive ground points, which undoubtedly increase algorithm complexity. A fast and effective preprocessing method is adopted to separate ground points from original point clouds to reduce the data search range of point cloud processing. In addition, the ground points are projected and resampled to obtain the digital elevation model of corresponding region, which is of great significance for the subsequent calculation of tree height features.
There are often inevitable ups and downs on the ground in an entire urban scene [68]. The filtering effect of ground points is not ideal considering the influence of the accuracy of the initial triangulated irregular network (TIN). An improved progressive TIN densification filtering method is introduced to remove ground points. The undetermined seed points were first selected by the extended local minimum for grids containing points. Then, the elevations of the grids without points are interpolated using the nearest-neighbor method. The final seed points are determined according to the judgment of the elevation difference (ED) in the local neighborhood of thin-plate spline interpolation with the threshold. Finally, the initial TIN is constructed and iteratively encrypted to extract the ground points [69].
The complexity of an urban scene is the main obstacle to semantically extracting tree points, which usually contain many categories of objects as well as overlapping or closely

Tree Point Extraction
Generally, an MLS system has a relatively direct scanning angle of view of the ground. Therefore, the collected 3D points contain massive ground points, which undoubtedly increase algorithm complexity. A fast and effective preprocessing method is adopted to separate ground points from original point clouds to reduce the data search range of point cloud processing. In addition, the ground points are projected and resampled to obtain the digital elevation model of corresponding region, which is of great significance for the subsequent calculation of tree height features.
There are often inevitable ups and downs on the ground in an entire urban scene [68]. The filtering effect of ground points is not ideal considering the influence of the accuracy of the initial triangulated irregular network (TIN). An improved progressive TIN densification filtering method is introduced to remove ground points. The undetermined seed points were first selected by the extended local minimum for grids containing points. Then, the elevations of the grids without points are interpolated using the nearest-neighbor method. The final seed points are determined according to the judgment of the elevation difference (ED) in the local neighborhood of thin-plate spline interpolation with the threshold. Finally, the initial TIN is constructed and iteratively encrypted to extract the ground points [69].
The complexity of an urban scene is the main obstacle to semantically extracting tree points, which usually contain many categories of objects as well as overlapping or closely positioned objects [20]. This study proposes a graph convolution network (GCN) that integrates a lightweight representation learning module and a deep context-aware sequential module with embedded residual learning to classify urban scenes into tree and non-tree point clouds.
Unstructured point clouds are first divided into geometrically homogeneous parts to ameliorate non-uniform distribution and reduce computational complexity. The geometric grouping algorithm is adopted from [41] which directly consumes original LiDAR point clouds generating clusters with approximately equal resolution. A sparse auto-encoder is employed to compress and encode high-dimensional information as the embedding to represent the geometric attributes of every patch. Moreover, the spatial position of geometric clusters is considered to be concatenated into the final descriptor to increase spatial relationships. To promote the formation of associated areas from geometric patches generated from the non-ground points, an adjacency graph G = {V, E} is constructed to model neighboring relationships among the patches. The center of precomputed patches acts as the nodes V = {v i } in the G, and edges E = e ij are established between each pair of adjacent patches to allow the network to be relatively robust in handling varying point densities. Specifically, variable adjacency is adopted instead of a fixed-size neighborhood.
There is no spatial transformer network (STN) chosen to match the groups because the geometric groups are computed based on normalization. Specific feature extraction for geometric groupings is performed as follows. First, node feature F n is obtained using a multilayer perceptron (MLP). Then, k neighboring points of the node is found using the knearest neighbor (KNN) algorithm, and the neighborhood coordinates X k c of each node are obtained. The spatial position information of the neighboring point feature set F k c obtained through the neighborhood point subscript attributes is further encoded. This structure encodes the 3D geometric information of nodes (the coordinates) and the connection with corresponding neighbors (the Euclidean distance X c − X k c ). An MLP with a 3-layer fully connected structure adjusted the weights of the four spatial position information and extracted geometric features F G . The convolution operation on the node feature and the neighboring point feature obtained the semantic feature F S so that the extraction of the local and context information is more detailed. Finally, the geometric features encoded by the geometric coordinate, node association information, and neighboring point features are weighted and summed to form the neighborhood feature set F CSG . Edge features are determined by the filter-generating network that dynamically produces weights for filtering edge-specific information through element-wise multiplication [70].
The last step of semantic segmentation involves group-wise labeling by employing a GCN to classify the Voronoi adjacency graph. A residual net architecture is used for semantic segmentation to accelerate convergence and prediction [71]. The input is a graph with a varying number of edges and nodes, with some regular neural networks unable to cope with such structures. Therefore, long short-term memory [72] is chosen with an input gate while incorporating the residual connections. This technique can handle graphs with varying sizes while avoiding the vanishing gradient problem.
Only the tree points are used for further individual urban tree segmentation from the various classes of obtained point clouds.

Individual Tree Segmentation
After the object tree is extracted, we further carry out the individual tree segmentation task. Traditionally, there are two commonly used individual tree segmentation methods: the CHM-based segmentation methods [6] and the cluster-based graph cut methods [73]. CHM-based segmentation method can quickly segment tree point clouds, but the CHM transformation can result in the loss of most crucial geometric and spatial context attributes. By contrast, the cluster-based graph cut method can preserve 3D spatial context information. However, too many parameters will result in high computational costs. In addition, the regular clustering strategies are completely insensitive to the boundary, which makes fine segmentation in the complex tree scene very difficult. Recently, point cloud processing has achieved significant progress with the development of deep learning techniques [64][65][66][67], which makes it possible to extract individual trees from point clouds. To effectively extract individual trees from urban MLS point clouds, we propose a novel segmentation network that combines the semantic information (spatial context information) of the tree category and the instance information of each tree. In this section, we elaborate on the proposed individual tree segmentation method in three parts, namely density-based point convolution, associatively segmenting instances and semantics in tree point clouds, and loss function based on metric learning. The overview of the proposed individual tree segmentation method is shown in Figure 2.
which makes fine segmentation in the complex tree scene very difficult. Recently, point cloud processing has achieved significant progress with the development of deep learning techniques [64][65][66][67], which makes it possible to extract individual trees from point clouds. To effectively extract individual trees from urban MLS point clouds, we propose a novel segmentation network that combines the semantic information (spatial context information) of the tree category and the instance information of each tree. In this section, we elaborate on the proposed individual tree segmentation method in three parts, namely density-based point convolution, associatively segmenting instances and semantics in tree point clouds, and loss function based on metric learning. The overview of the proposed individual tree segmentation method is shown in Figure 2. In general, 2D convolution kernel cannot be applied to scattered and disordered point clouds. PointNet-series networks are the earliest architectures that extract features directly from points. PointNet uses the Multi-layer Perceptron (MLP) with shared weight to process the point cloud by weighted summation and solves the disorder of the point cloud through the maximum pooling operation. However, maximum pool (MP) operation is easy to cause local information loss of point cloud. To improve the ability of point information extraction, the weight of the convolution operator is treated as a continuous function composed of local context information relative to the reference point. For the functions f(x) and g(x) of d-dimensional vector x, the definition formula of convolution is shown as follows: It can be interpreted as a 2D discrete function in the image, which is usually represented by a grid matrix. In the convolution neural network, the convolution kernel acts on a fixed size local area for weighted sum operation. The relative position between pixels in the image is fixed; therefore, the filter can be discretized into weighted summation with corresponding positions in each local region. Unlike images, point cloud data is scattered and disordered. Each point in the point cloud is an arbitrary continuous value, rather than distributed on a fixed grid. Traditional convolutional filters used on images cannot be directly utilized on point clouds. To make full use of convolutional operations on point clouds, a permutation-invariant convolutional filter, called PointCONV [74], is used to define the 3D convolutions for continuous functions by where and are two functions, ( + , + , + ) indicates the contextual feature of a point (i = 1, 2, …, n) in the local neighborhood E, where (x, y, z) is the center position of this local region. There is a difference in the density between the canopy points and the trunk points in the tree point cloud. Therefore, density information is extracted to construct density-based point convolution, as follows:

Density-Based Point Convolution (DPC)
In general, 2D convolution kernel cannot be applied to scattered and disordered point clouds. PointNet-series networks are the earliest architectures that extract features directly from points. PointNet uses the Multi-layer Perceptron (MLP) with shared weight to process the point cloud by weighted summation and solves the disorder of the point cloud through the maximum pooling operation. However, maximum pool (MP) operation is easy to cause local information loss of point cloud. To improve the ability of point information extraction, the weight of the convolution operator is treated as a continuous function composed of local context information relative to the reference point. For the functions f (x) and g(x) of d-dimensional vector x, the definition formula of convolution is shown as follows: It can be interpreted as a 2D discrete function in the image, which is usually represented by a grid matrix. In the convolution neural network, the convolution kernel acts on a fixed size local area for weighted sum operation. The relative position between pixels in the image is fixed; therefore, the filter can be discretized into weighted summation with corresponding positions in each local region.
Unlike images, point cloud data is scattered and disordered. Each point in the point cloud is an arbitrary continuous value, rather than distributed on a fixed grid. Traditional convolutional filters used on images cannot be directly utilized on point clouds. To make full use of convolutional operations on point clouds, a permutation-invariant convolutional filter, called PointCONV [74], is used to define the 3D convolutions for continuous functions by where W and F are two functions, F x + δ x , y + δ y , z + δ z indicates the contextual feature of a point p i (i = 1, 2, . . . , n) in the local neighborhood E, where (x, y, z) is the center position of this local region. There is a difference in the density between the canopy points and the trunk points in the tree point cloud. Therefore, density information is extracted to construct density-based point convolution, as follows: where S δ x , δ y , δ z represents the inverse density given the local neighborhood point δ x , δ y , δ z . Because the down-sampled point clouds are non-uniformly distributed, densitybased weighting is very important. The weight function W δ x , δ y , δ z is constructed through MLP. The inverse density function S δ x , δ y , δ z is constructed by kernel density estimation, and then nonlinear transformation is realized by using MLP. The density point convolution aiming at the arrangement invariance is constructed through the MLP with shared weight. The density parameters of each point in the fixed neighborhood are calculated based on the kernel density estimation function, and the density parameters are transformed nonlinearly through MLP. The appropriate density function is learned adaptively, and the final inverse density scale is calculated. Figure 3 shows the operation of density-based point convolution in local regions. C in and C out are the number of channels of input feature and output feature, and C kin and C kout are the number of channels of input feature and output feature corresponding to local neighborhood. The input is the local feature, calculated by the spatial context information fusion block, which also includes point coordinate information and other feature information (color, intensity, etc.). MLP is implemented by a 1 × 1 convolution. After the convolution, the extracted neighborhood features F in are encoded into the output features F out ∈ R C out , as follow where S ∈ R K represents the density scale and W ∈ R K×C in ×C out is the output weight function.
DensityConv( , , ) = ∑ ( , , ) ( , , ) ( + , + , + ) where ( , , ) represents the inverse density given the local neighborhood point ( , , ). Because the down-sampled point clouds are non-uniformly distributed, density-based weighting is very important. The weight function ( , , ) is constructed through MLP. The inverse density function ( , , ) is constructed by kernel density estimation, and then nonlinear transformation is realized by using MLP. The density point convolution aiming at the arrangement invariance is constructed through the MLP with shared weight. The density parameters of each point in the fixed neighborhood are calculated based on the kernel density estimation function, and the density parameters are transformed nonlinearly through MLP. The appropriate density function is learned adaptively, and the final inverse density scale is calculated. Figure 3 shows the operation of density-based point convolution in local regions. C and C are the number of channels of input feature and output feature, and C and C are the number of channels of input feature and output feature corresponding to local neighborhood. The input is the local feature, = ⊕ ⨁( − )⨁‖ − ‖ ∈ ℝ × , calculated by the spatial context information fusion block, which also includes point coordinate information and other feature information (color, intensity, etc.). MLP is implemented by a 1 × 1 convolution. After the convolution, the extracted neighborhood features are encoded into the output features ∈ ℝ , as follow where ∈ ℝ represents the density scale and ∈ ℝ × × is the output weight function.

Associatively Segmenting Instances and Semantics in Tree Point Clouds
To avoid a large number of parameter adjustment processes in traditional algorithms, the semantic information and specific instance information of individual trees are learned adaptively to obtain the optimal parameters, which makes it possible to segment tree point clouds that are spatially overlapping with varying shapes and incompleteness. In this section, we map the point clouds into the high-dimensional feature space and learn the distribution characteristics of the high-dimensional feature space based on the metric learning method.
As illustrated in Figure 2, our segmentation network is composed by three parts: an initial feature extraction block, two parallel decoders, and a feature fusion block. More specifically, the initial feature extraction block is designed to construct a shared encoder by combining DPC and PointNet++ [39]. In other words, we construct our backbone network by directly duplicating an abstraction module of PointNet++. However, the Point-Net++ may lose detailed information due to the MP operation and has expensive GPU

Associatively Segmenting Instances and Semantics in Tree Point Clouds
To avoid a large number of parameter adjustment processes in traditional algorithms, the semantic information and specific instance information of individual trees are learned adaptively to obtain the optimal parameters, which makes it possible to segment tree point clouds that are spatially overlapping with varying shapes and incompleteness. In this section, we map the point clouds into the high-dimensional feature space and learn the distribution characteristics of the high-dimensional feature space based on the metric learning method.
As illustrated in Figure 2, our segmentation network is composed by three parts: an initial feature extraction block, two parallel decoders, and a feature fusion block. More specifically, the initial feature extraction block is designed to construct a shared encoder by combining DPC and PointNet++ [39]. In other words, we construct our backbone network by directly duplicating an abstraction module of PointNet++. However, the PointNet++ may lose detailed information due to the MP operation and has expensive GPU memory consumption during training process. Therefore, we follow JSPNet [55] to combine the set abstraction module of PointNet++ and three feature encoding layers of our DPC sequentially to construct the shared encoder. Similarly, two decoders share the same structure that is built by concatenating three depth-wise feature decoding layers of DPC and a feature propagation layer of PointNet++. These two decoders are developed for extracting point-wise semantic features and instance embedding, respectively. Finally, Remote Sens. 2023, 15, 1992 8 of 26 in the feature fusion block, different layer features are fused because the high-level layer has richer semantic information while the low-level has much more detailed information, which is beneficial for better segmentation.
The input of the network is the point cloud feature matrix of N a × 9. We encode the point cloud feature as N e × 512 by means of weight sharing. Next, the high-dimensional feature matrix is input into the parallel decoder. In the semantic feature decoding branch, we fuse the features of different levels form the N a × 128 of a high-dimensional semantic feature matrix F SS through a jump connection. In the branch of case feature coding, we output the instance feature matrix F IS by jumping to connect the pre-enhanced and postenhanced features. Finally, we integrate semantic features and instance features through semantic and instance information fusion modules. As shown in Figure 2, the output of the final feature matrix F ISS is used to distinguish individual trees. The shape of F ISS is N a × K, where K is the dimension of the embedded vector. We predict the instance label of each tree. Based on the method of metric learning, we learn the distribution law of features in high-dimensional embedded space. We draw closer the features that belong to the same instance object and pull out the features of different instance objects.
As shown in Figure 4, K-nearest neighbor search is adopted to find a fixed number of adjacent points for each point in the high-dimensional instance embedding space. We use k nearest neighbor search to generate the index matrix of the shape N a × K. According to the generated index matrix, we use the context information fusion module to generate the local neighborhood feature matrix of the instance space. In the semantic space, the feature tensor with the shape of N a × K × N F is generated according to the index matrix, and each group corresponds to the local region near a centroid in the instance embedding space. Through Equation (5), we equalize the local examples and semantic features to each dimensional feature to enhance the semantics and examples of centroid refinement.
where {x i1 , . . . , x iK } represents the semantic and instance fusion features corresponding to K adjacent points centered on point i in the instance embedding space.
memory consumption during training process. Therefore, we follow JSPNet [55] to combine the set abstraction module of PointNet++ and three feature encoding layers of our DPC sequentially to construct the shared encoder. Similarly, two decoders share the same structure that is built by concatenating three depth-wise feature decoding layers of DPC and a feature propagation layer of PointNet++. These two decoders are developed for extracting point-wise semantic features and instance embedding, respectively. Finally, in the feature fusion block, different layer features are fused because the high-level layer has richer semantic information while the low-level has much more detailed information, which is beneficial for better segmentation. The input of the network is the point cloud feature matrix of × 9. We encode the point cloud feature as × 512 by means of weight sharing. Next, the high-dimensional feature matrix is input into the parallel decoder. In the semantic feature decoding branch, we fuse the features of different levels form the × 128 of a high-dimensional semantic feature matrix F SS through a jump connection. In the branch of case feature coding, we output the instance feature matrix F IS by jumping to connect the pre-enhanced and postenhanced features. Finally, we integrate semantic features and instance features through semantic and instance information fusion modules. As shown in Figure 2, the output of the final feature matrix F ISS is used to distinguish individual trees. The shape of F ISS is × , where is the dimension of the embedded vector. We predict the instance label of each tree. Based on the method of metric learning, we learn the distribution law of features in high-dimensional embedded space. We draw closer the features that belong to the same instance object and pull out the features of different instance objects.
As shown in Figure 4, K-nearest neighbor search is adopted to find a fixed number of adjacent points for each point in the high-dimensional instance embedding space. We use k nearest neighbor search to generate the index matrix of the shape × . According to the generated index matrix, we use the context information fusion module to generate the local neighborhood feature matrix of the instance space. In the semantic space, the feature tensor with the shape of × × is generated according to the index matrix, and each group corresponds to the local region near a centroid in the instance embedding space. Through Equation (5), we equalize the local examples and semantic features to each dimensional feature to enhance the semantics and examples of centroid refinement.
where { 1 , … , } represents the semantic and instance fusion features corresponding to K adjacent points centered on point i in the instance embedding space. In the enhanced high-dimensional semantic space, we construct a local neighborhood graph through K-nearest neighbors and use the graph attention mechanism to select more In the enhanced high-dimensional semantic space, we construct a local neighborhood graph through K-nearest neighbors and use the graph attention mechanism to select more representative semantic features to enrich case features. The F = { f 1 , f 2 , . . . , f k } and F ∈ R k × m are local neighborhood feature of each node is input into the graph attention module. m is the dimension of the feature, and k is the number of nodes. First of all, we encode the input context feature matrix through the shared weight matrix W ∈ R m × m. Then, we normalize the encoded features through the Softmax activation function to obtain the self-attention coefficient corresponding to the feature matrix of each node, F, as shown Remote Sens. 2023, 15, 1992 9 of 26 in Equation (7), e ij represents the influence of each neighborhood point feature on the node feature. Attention matrix is generated as a ij , and the activation function needs to be applied before obtaining the nodes in the next layer. The final screening of more representative semantic enhancement information F = f 1 , f 2 , . . . , f k , and F ∈ R k × m are shown in Equation (8).
We combine more representative semantic enhancement information with high-dimensional instance features to form the final high-dimensional instance feature discrimination matrix that combines semantic information and instance information to enhance each other.

Loss Function Based on Metric Learning
The loss function is the discriminant loss function used in metric learning, as shown below.
where L pull pulls embeddings close to the mean embedding of instance, while the L push makes the mean embedding of different instances separated from each other. L reg is the regularization item (Equation (10)), which makes the center of the instance close to the origin and keeps the gradient always active.
where µ i is the average embedding of tree instance. For individual tree segmentation, L pull makes points on the same tree in the highdimensional instance space close to its center, which is defined as follows: where δ v is the penalty margin for the center point of each instance. When the distance between the point on a single tree and its center point is less than δ v , no penalty will be imposed. In addition, [x] + = max(0, x); · 1 is the L 1 norm, M is the number of the roadside trees, N m refers to the number of points in instance i, E n represents the embedding of points in the tree instance. As shown in Equation (12), L push keeps the points of different trees away from each other. When the distance between the centers of two tree instances exceeds 2δ d , no penalty will be imposed, so that instances can be freely distributed in space.
During the testing, the final instance labels are obtained by using a simple mean-shift clustering [75] on the high-dimensional embedded feature space.

Estimation of Living Vegetation Volume
Living Vegetation Volume (LVV) calculation is an important task of urban ecology because it can objectively and accurately describe the urban greenery quality and provide a reliable data foundation for the quantitative study on the mechanism of urban greenery ecological functions. Benefited from our high-quality instance segmentation results of road-side trees, the convex hull method is adopted to calculate LVV of urban roads.
It is necessary to extract the canopy point cloud to calculate the LVV of road-side trees. According to the definition of the principal direction in differential geometry, the direction corresponding to the minimum curvature is adopted as the principal direction of the tree point cloud. The main directions of leaf point clouds are messy, while the main directions of point clouds in the branches are basically coincident. Therefore, the normal vectors of the object points and adjacent points are used to construct a dense tangent circle and further calculate the main direction of the tree point cloud. After that, the tree canopy is extracted according to the axial distribution density and axial similarity of the trunk. At given point, we judge the axial distribution density in the cylinder constructed by this point. The axis of the cylinder is divided into n segments. The inner point of the cylinder is projected onto the axis. The ratio of the segment occupied by the projection point to n is the axial distribution density. The optimal threshold is 0.8. The included angle of each point in the cylinder is then calculated, which refers to the included angle between the main direction of each point in the cylinder and the main direction of the center point. The best threshold is 20 • . Finally, the specific gravity of the effective point is calculated. Specifically, the effective point refers to the point that meets the conditions that the axial distribution density is greater than the density threshold and the included angle ratio is less than the included angle threshold. The ratio of these points to the number of points in the cylinder is needed. The optimal threshold is 0.8. After many tests, the height of the constructed cylinder is 10 times the average density of the point cloud, and the radius is 2 times the average density. When the axial distribution density is greater than 0.8, the included angle is less than 20 • , and the specific gravity of the effective point is greater than 0.8, the constructed cylinder is considered the best cylinder approximation of the trunk point cloud. Then, the trunk point cloud is regionally grown until all the point clouds are processed. Finally, the region is merged to identify the trunk point cloud and remove the trunk point cloud to complete the canopy extraction.
From the perspective of dendrometry, the traditional LVV calculation takes the crown width and crown height as parameters and treats the crown as regular geometry [76]. However, most of the crown shapes are variable and have no specific regular shape, resulting in large errors. We calculate the LVV by the convex hull method and compare it with the traditional method and the platform method.

Dataset Description
To check the performance of the proposed method, MLS point clouds from two different urban regions are used in the evaluation experiments. Dataset I was collected using a Riegl VMX-450 MLS system in the summer of 2020, covering approximately 6.0 km urban roads in Shanghai, China. Dataset II was collected using a Trimble MX2 MLS system in Nanjing, China, covering urban road length approximately 8.0 km. For training the neural network, 4.5 km of Dataset I and 6.0 km of Dataset II are manually labeled for quantitative evaluation. It is worth noting that the main characteristic of these twopoint cloud datasets includes many road-side trees and the distributions of road-side trees presenting different situations. Figure 5 shows an overview of two urban MLS point cloud scenes. Several road-side trees are quite sparse, while others overlap heavily.    To better present the urban MLS point cloud semantic segmentation performance the proposed method, we quantitatively assess the semantic segmentation results in ter of the two commonly used evaluation metrics [20,41]: overall accuracy (OA) and m intersection over union (mIoU). The numerical point cloud semantic segmentation resu for Dataset I and Dataset II are listed in Table 1. As can be perceived, the proposed meth achieves excellent performance in semantically segmenting MLS point clouds with an erage OA and mIoU of (89.1%, 63.8%) and (88.8%, 64.3%) for the two MLS point clo datasets, respectively. From the global perspective, the OAs of Dataset I and Datase exceed 88%, which demonstrates the effectiveness of our semantic segmentation mod Meanwhile, the OAs and mIoUs of Dataset I and Dataset II show no evident performa differences. Moreover, the IoU of road-side trees, the most important urban objects, s pass 87% in both Dataset I and Dataset II, achieving the ideal results for tree point extr tion. To better present the urban MLS point cloud semantic segmentation performance of the proposed method, we quantitatively assess the semantic segmentation results in terms of the two commonly used evaluation metrics [20,41]: overall accuracy (OA) and mean intersection over union (mIoU). The numerical point cloud semantic segmentation results for Dataset I and Dataset II are listed in Table 1. As can be perceived, the proposed method achieves excellent performance in semantically segmenting MLS point clouds with an average OA and mIoU of (89.1%, 63.8%) and (88.8%, 64.3%) for the two MLS point cloud datasets, respectively. From the global perspective, the OAs of Dataset I and Dataset II exceed 88%, which demonstrates the effectiveness of our semantic segmentation model. Meanwhile, the OAs and mIoUs of Dataset I and Dataset II show no evident performance differences. Moreover, the IoU of road-side trees, the most important urban objects, surpass 87% in both Dataset I and Dataset II, achieving the ideal results for tree point extraction.

Comparison with Other Published Methods
For further semantic segmentation performance evaluation, we compare our semantic segmentation model with existing published networks obtaining baseline results. These networks can be viewed as reference methods, including PointNet++ [39], TagentConv [77], MS3_DVS [78], SPGraph [41], KPConv [40], RandLA-Net [44], and MS-RRFSegNet [20]. Specifically, PointNet++ [39] is a follow-up work of PointNet [38], which is the pioneer work directly on irregular points. It grouped points hierarchically and progressively acquired both local and global features. TagentConv [77] is a representative model of projection-based methods for semantic segmentation of large scenes. It introduced a novel tangent convolution and operated directly on precomputed tangent planes. MS3_DVS [78] is a representative model of discretization-based methods, which proposed multi-scale voxel network architecture to classify 3D point clouds of large scene. SPGraph [41] is one of the first methods capable of directly processing large-scale point clouds based on an attributed directed graph, which consists of geometrically homogeneous partitioning, super-point embedding, and contextual segmentation. KPConv [40] is a flexible pointwise convolution operator for point cloud semantic segmentation, which proposed a kernel point fully convolutional network to achieve state-of-the-art performance on the existing benchmarks. RandLA-Net [44] is a lightweight yet efficient point cloud semantic segmentation network, which utilized random point sampling to achieve vastly high efficiency and captured geometric features by a local feature aggregation module. MS-RRFSegNet [20] is a multiscale regional relation feature segmentation network, which adopted a sparse auto-encoder for feature embedding representations of the homogeneous super-voxels that reorganized raw data, and semantically labeled super-voxels based on the regional relation feature reasoning module.
For fair comparison, we faithfully follow the experimental settings of each selected algorithm that has available code. In addition, the proposed model is also compared between the Dataset I and Dataset II. All experiments are performed on a computer equipped with two NVIDIA GEFORCE RTX 3080 GPUs. Based on the same configurations, the quantitative results on the Dataset I and Dataset II are also presented in Table 1. As can be perceived, TagentConv [77] has the worst performance since the orientation of the tangent plane may not be estimated well in urban road scenes with topographic relief variations. The mIoU scores of the proposed method is the highest at present and is followed by RandLA-Net [44] with a gap of approximately 2%, while the KPConv [40] is slightly inferior to RandLA-Net [44] by approximately 0.4%, and SPGraph [41] achieves the fourth highest performance.
It is worth noting that the abovementioned four methods are greatly superior to others in general. There are a number of common categories such as ground, vegetation, and building that are finely segmented due to the abundance of points of these categories in the dataset. In general, while the proposed method achieves satisfying semantic segmentation results and ranks highly, the overall segmentation performances of other state-of-the-art deep-learning methods are far from satisfactory. In particular, some key elements of road infrastructures have weak performances across all of the techniques.

Tree Segmentation Results
The individual tree segmentation performances of two urban MLS point clouds are estimated qualitatively. The visual samples in Figures 8 and 9 are selected with different spatial structures of complex urban environments to show the good segmentation ability of the proposed method. Figures 8a and 9a illustrate the road-side tree extraction results, where the extracted road-side tree points are overlaid on the original urban MLS point clouds. Figures 8b and 9b present individual tree segmentation outcomes, where every road-side tree is drawn in one color. Figures 8c and 9c show some zoom-in views of the individual tree segmentation results at some randomly selected regions. We can see that there exist some errors in the boundary regions of segmented instance-level road-side trees; nonetheless, it still has a high sensitivity to separate individual road-side trees. variations. The mIoU scores of the proposed method is the highest at present and is followed by RandLA-Net [44] with a gap of approximately 2%, while the KPConv [40] is slightly inferior to RandLA-Net [44] by approximately 0.4%, and SPGraph [41] achieves the fourth highest performance. It is worth noting that the abovementioned four methods are greatly superior to others in general. There are a number of common categories such as ground, vegetation, and building that are finely segmented due to the abundance of points of these categories in the dataset. In general, while the proposed method achieves satisfying semantic segmentation results and ranks highly, the overall segmentation performances of other state-of-the-art deep-learning methods are far from satisfactory. In particular, some key elements of road infrastructures have weak performances across all of the techniques.

Tree Segmentation Results
The individual tree segmentation performances of two urban MLS point clouds are estimated qualitatively. The visual samples in Figures 8 and 9 are selected with different spatial structures of complex urban environments to show the good segmentation ability of the proposed method. Figures 8a and 9a illustrate the road-side tree extraction results, where the extracted road-side tree points are overlaid on the original urban MLS point clouds. Figures 8b and 9b present individual tree segmentation outcomes, where every road-side tree is drawn in one color. Figures 8c and 9c show some zoom-in views of the individual tree segmentation results at some randomly selected regions. We can see that there exist some errors in the boundary regions of segmented instance-level road-side trees; nonetheless, it still has a high sensitivity to separate individual road-side trees. To better show the individual tree segmentation results of the proposed method, we quantitatively evaluate the individual tree segmentation performances in terms of the four commonly used instance segmentation evaluation metrics [79]: the precision (Prec), the recall (Rec), the mean coverage (mCov) and the mean weighted coverage (mWCov) (see Eqs. (13)(14)(15)(16)). The numerical instance segmentation results for Dataset I and Dataset II are presented in Table 2, respectively. We can see that the proposed method obtains good performance in individual tree segmentation from urban MLS point clouds with average mPrec, mRec, mCov and mWCov of (90.27%, 89.75%, 86.39%, 88.98%) and (90.86%, 89.27%, 87.20%, 88.56%) for the two urban MLS point cloud datasets, respectively. From the global perspective, the mPrec and mRec of Dataset I and Dataset II both exceed 89%, which demonstrates the effectiveness of the proposed individual tree segmentation network. Moreover, the mCov and mWCov of the instance segmentation of road-side trees surpass 86% and 88% in Dataset I and Dataset II, respectively, achieving the significant performances for individual tree segmentation from urban MLS point clouds. Meanwhile, these four instance-level metrics of Dataset I and Dataset II show no evident performance differences.
where | | represents the number of segmented road-side tree instances with an IoU with ground truth larger than 0.5; | | refers to the total number of predicted instances of the road-side trees, and | | indicates the number of the segmented road-side tree instances in the ground truth.
where | | represents the number of all road-side tree instances in the ground truth.
indicates the a-th road-side tree instance area in the ground truth road-side tree instance collection, refers to the b-th segmented road-side tree instance area, b refers to the number of trees in the point cloud to be processed. To better show the individual tree segmentation results of the proposed method, we quantitatively evaluate the individual tree segmentation performances in terms of the four commonly used instance segmentation evaluation metrics [79]: the precision (Prec), the recall (Rec), the mean coverage (mCov) and the mean weighted coverage (mWCov) (see Equations (13)-(16)). The numerical instance segmentation results for Dataset I and Dataset II are presented in Table 2, respectively. We can see that the proposed method obtains good performance in individual tree segmentation from urban MLS point clouds with average mPrec, mRec, mCov and mWCov of (90.27%, 89.75%, 86.39%, 88.98%) and (90.86%, 89.27%, 87.20%, 88.56%) for the two urban MLS point cloud datasets, respectively. From the global perspective, the mPrec and mRec of Dataset I and Dataset II both exceed 89%, which demonstrates the effectiveness of the proposed individual tree segmentation network. Moreover, the mCov and mWCov of the instance segmentation of road-side trees surpass 86% and 88% in Dataset I and Dataset II, respectively, achieving the significant performances for individual tree segmentation from urban MLS point clouds. Meanwhile, these four instance-level metrics of Dataset I and Dataset II show no evident performance differences.
where TP ins represents the number of segmented road-side tree instances with an IoU with ground truth larger than 0.5; P ins refers to the total number of predicted instances of the road-side trees, and G ins indicates the number of the segmented road-side tree instances in the ground truth.
where |I| represents the number of all road-side tree instances in the ground truth. I a indicates the a-th road-side tree instance area in the ground truth road-side tree instance collection, P b refers to the b-th segmented road-side tree instance area, b refers to the number of trees in the point cloud to be processed.

Comparative Studies
To further prove the superiority of our individual tree segmentation method, we designed a number of experiments and compared it with selected popular methods, including two traditional methods (watershed-based method [80] and mean shift-based method [81]) and two deep learning approaches (SGE_Net [64] and DAE_Net [67]). To qualitatively present the effectiveness of the proposed method for individual tree segmentation in complex urban MLS point cloud scenes, a selected examples of visual results are shown in Figure 10. Specifically, we can see that all methods obtain satisfactory results for road-trees with consistent tree shapes in simple situations. With regard to the complex position distribution, such as multiple trees distributed in a queue with serious spatial overlap, two traditional methods [80,81] are fast and efficient but easy to result in omission or commission errors. By contrast, two deep learning approaches [64,67] and our method can achieve better tree segmentation results. The main reason is that the traditional methods strongly depend on the boundary spatial features between adjacent road-side trees and the fixed shape assumption of road-side trees. For the deep learning methods based on the designed neural networks, their layer structure and parameters can implicitly express the spatial interactions between tree point clouds, facilitating the feature representations for instance segmentation. Although the deep learning methods introduce additional computational complexity, this degeneration should be tolerated when we conduct the individual tree segmentation in large-scale MLS point clouds.

Comparative Studies
To further prove the superiority of our individual tree segmentation method, we designed a number of experiments and compared it with selected popular methods, including two traditional methods (watershed-based method [80] and mean shift-based method [81]) and two deep learning approaches (SGE_Net [64] and DAE_Net [67]). To qualitatively present the effectiveness of the proposed method for individual tree segmentation in complex urban MLS point cloud scenes, a selected examples of visual results are shown in Figure 10. Specifically, we can see that all methods obtain satisfactory results for roadtrees with consistent tree shapes in simple situations. With regard to the complex position distribution, such as multiple trees distributed in a queue with serious spatial overlap, two traditional methods [80,81] are fast and efficient but easy to result in omission or commission errors. By contrast, two deep learning approaches [64,67] and our method can achieve better tree segmentation results. The main reason is that the traditional methods strongly depend on the boundary spatial features between adjacent road-side trees and the fixed shape assumption of road-side trees. For the deep learning methods based on the designed neural networks, their layer structure and parameters can implicitly express the spatial interactions between tree point clouds, facilitating the feature representations for instance segmentation. Although the deep learning methods introduce additional computational complexity, this degeneration should be tolerated when we conduct the individual tree segmentation in large-scale MLS point clouds. Furthermore, since it is relatively limited to show the advantages of the proposed method by visual presentation, we further quantitatively compare the individual tree segmentation performance of four selected baselines and the proposed method. The numerical comparisons are also provided in Table 2 for Dataset I and Dataset II. The quantitative comparison for individual tree segmentation among four baselines shows that the proposed method achieves the best segmentation results on all four evaluation indicators, mPrec, mRec, mCov, and mWCov, not only for Dataset I but also for the Dataset II. As can be perceived, the mPrec, mRec, mCov, and mWCov of [80] are the worst at present and are followed by [81] with gaps of approximately 1.67%, 1.15%, 2.64%, and 2.69%, while the SGE_Net [64] is superior to DAE_Net [67] by approximately 2.30%, 1.99%, 1.10%, and 1.34%. Because DAE_Net [67] propose to use the pointwise direction embedding to distinguish the fine boundaries of individual road-side trees, it has more obvious improvements, compared to [80,81], and SGE_Net [64], on both two urban MLS point cloud datasets, which reveals that SGE_Net [64] is good at the individual tree segmentation task of the complex urban scenes. In practical application, however, there are inevitably errors in the detected tree centers. Therefore, from the comparison results of Table 2, we can see that SGE_Net [64] is slightly inferior to ours in general. For example, the proposed method outperforms the SGE_Net [64] by average improvements of approximately 3.24% in mPrec, approximately 2.63% in mRec, approximately 1.59% mCov, and more than 1.96% in mWCov. To sum up, it is evident that the proposed method has obtained a prominent improvement compared with the selected four reference baselines.

LVV Calculatation Results
It can be seen from Table 3, the relative error ( 1 ) between the adopted scheme and the traditional method is 12.6~33.7%, and the average relative error is 16.5~19.9%. The trees in the real urban scene are complex and changeable, and even the same tree species Furthermore, since it is relatively limited to show the advantages of the proposed method by visual presentation, we further quantitatively compare the individual tree segmentation performance of four selected baselines and the proposed method. The numerical comparisons are also provided in Table 2 for Dataset I and Dataset II. The quantitative comparison for individual tree segmentation among four baselines shows that the proposed method achieves the best segmentation results on all four evaluation indicators, mPrec, mRec, mCov, and mWCov, not only for Dataset I but also for the Dataset II. As can be perceived, the mPrec, mRec, mCov, and mWCov of [80] are the worst at present and are followed by [81] with gaps of approximately 1.67%, 1.15%, 2.64%, and 2.69%, while the SGE_Net [64] is superior to DAE_Net [67] by approximately 2.30%, 1.99%, 1.10%, and 1.34%. Because DAE_Net [67] propose to use the pointwise direction embedding to distinguish the fine boundaries of individual road-side trees, it has more obvious improvements, compared to [80,81], and SGE_Net [64], on both two urban MLS point cloud datasets, which reveals that SGE_Net [64] is good at the individual tree segmentation task of the complex urban scenes. In practical application, however, there are inevitably errors in the detected tree centers. Therefore, from the comparison results of Table 2, we can see that SGE_Net [64] is slightly inferior to ours in general. For example, the proposed method outperforms the SGE_Net [64] by average improvements of approximately 3.24% in mPrec, approximately 2.63% in mRec, approximately 1.59% mCov, and more than 1.96% in mWCov. To sum up, it is evident that the proposed method has obtained a prominent improvement compared with the selected four reference baselines.

LVV Calculatation Results
It can be seen from Table 3, the relative error (δ 1 ) between the adopted scheme and the traditional method is 12.6~33.7%, and the average relative error is 16.5~19.9%. The trees in the real urban scene are complex and changeable, and even the same tree species have different crown shapes. This situation leads to the fact that the true canopy profile of trees cannot be effectively expressed, so it is difficult to find the most suitable crown shape. In addition, due to the influence of human factors in the visual process, the error is large. The relative error (δ 2 ) between the adopted scheme and the platform method is 2.7~14.6%, and the average relative error is 7.8~8.9%. The platform method does not need to consider the tree shape to calculate the LVV, which reduces the influence of human factors and improves the calculation efficiency. However, there is a large gap between the bottom layer and the bottom layer of the actual tree crown in the calculation of the platform method, resulting in a large error at the bottom layer. The calculation of the adopted scheme is based on high-precision tree point clouds, and the used convex polyhedron approximates the original shape of the tree crown, which can better express the space volume occupied by the tree stem and leaf. Therefore, the obtained LVV is more accurate and does not need to consider the tree shape, realizing the automatic calculation of LVV. To better reflect the accuracy of LVV calculation, the correlation coefficient (R 2 ) is adopted to compare the results of manual measurements and that from LVV calculated in the proposed model. The definition of the evaluation indicator is as follows: where Q denotes the number of trees; m q is the value of the manual measured LVV; m q is the value of LVV determined from the segmented tree point clouds; and m q is the mean value of the manually measured LVV. To evaluate the accuracy of the calculated LVVs based on the segmentation results of the proposed method, the calculated results are compared with manual measured ground truths. The linear correlations between calculated values and the manual measurements are given in Figure 11. In the comparative results, R 2 of Dataset I and Dataset II are 0.9924 and 0.9873, respectively. The R 2 of two-point cloud datasets are close to 1, showing the correlation for tree-level LVV is high. Two fitted lines are close to y = x, showing high accuracies of our approach to extract instance-level road-side trees.

Generalization Capability
To further show the generalization ability of our approach, an additional experiment is carried out on an urban ALS point cloud dataset captured in Wuhan, China. This dataset is a highly dense ALS point cloud dataset with various types of urban objects, covering approximately 3.5 km 2 . The individual tree segmentation result is presented in Figure 12, proving that our method achieved good segmentation results on the ALS point clouds. Moreover, SGE_Net [64] and DAE_Net [67] are selected as comparison methods, and the corresponding individual tree segmentation results are provided in Table 4. The proposed method outperformed the above two deep learning methods, which obtained an average improvement of 5.56%, 3.58%, 4.78%, and 6.74% in terms of all mPrec, mRec, mCov, and mWCov scores, respectively.
To better reflect the accuracy of LVV calculation, the correlation coefficient ( adopted to compare the results of manual measurements and that from LVV calculat the proposed model. The definition of the evaluation indicator is as follows: , where Q denotes the number of trees; is the value of the manual measured LVV is the value of LVV determined from the segmented tree point clouds; and ̅ is the m value of the manually measured LVV. To evaluate the accuracy of the calculated based on the segmentation results of the proposed method, the calculated results are pared with manual measured ground truths. The linear correlations between calcul values and the manual measurements are given in Figure 11. (a) (b) Figure 11. The validation LVV calculation results for two-point cloud datasets. In the comparative results, 2 of Dataset I and Dataset II are 0.9924 and 0.9873, respectively. The 2 of two-point cloud datasets are close to 1, showing the correlation for tree-level LVV is high. Two fitted lines are close to y = x, showing high accuracies of our approach to extract instance-level road-side trees.

Generalization Capability
To further show the generalization ability of our approach, an additional experiment is carried out on an urban ALS point cloud dataset captured in Wuhan, China. This dataset is a highly dense ALS point cloud dataset with various types of urban objects, covering approximately 3.5 km 2 . The individual tree segmentation result is presented in Figure 12, proving that our method achieved good segmentation results on the ALS point clouds. Moreover, SGE_Net [64] and DAE_Net [67] are selected as comparison methods, and the corresponding individual tree segmentation results are provided in Table 4. The proposed method outperformed the above two deep learning methods, which obtained an average improvement of 5.56%, 3.58%, 4.78%, and 6.74% in terms of all mPrec, mRec, mCov, and mWCov scores, respectively.

Conclusions
The accurate individual tree segmentation is one of most important eco-urban construction tasks. In this study, a novel top-down framework is developed to extract individual tree from MLS point clouds by integrating semantic and instance segmentation. In various highway scenes, there are a large number of overlapping and irregular road-side trees in urban roads. The semantic segmentation network is first used to semantically segment tree points from raw point clouds. Next, an instance segmentation network is developed to isolate individual road-side trees. The instance segmentation network consists of a shared feature encoder, two parallel feature decoders, and a feature fusion module. To improve network accuracy and efficiency, the loss function based on metric learning is adopted for training. The Prec, Rec, mCov, and mWCov of (90.27%, 89.75%, 86.39%, and 88.98%, respectively) and (90.86%, 89.27%, 87.20%, and 88.56%, respectively) are obtained from two different MLS point cloud datasets. The achieved individual tree segmentation results are superior to that of other published methods. Individual tree segmentation results provide support for future eco-city analysis, such as calculating the LVV of urban roads. In conclusion, our work offers an effective solution to individual tree segmentation.

Conclusions
The accurate individual tree segmentation is one of most important eco-urban construction tasks. In this study, a novel top-down framework is developed to extract individual tree from MLS point clouds by integrating semantic and instance segmentation. In various highway scenes, there are a large number of overlapping and irregular road-side trees in urban roads. The semantic segmentation network is first used to semantically segment tree points from raw point clouds. Next, an instance segmentation network is developed to isolate individual road-side trees. The instance segmentation network consists of a shared feature encoder, two parallel feature decoders, and a feature fusion module. To improve network accuracy and efficiency, the loss function based on metric learning is adopted for training. The Prec, Rec, mCov, and mWCov of (90.27%, 89.75%, 86.39%, and 88.98%, respectively) and (90.86%, 89.27%, 87.20%, and 88.56%, respectively) are obtained from two different MLS point cloud datasets. The achieved individual tree segmentation results are superior to that of other published methods. Individual tree segmentation results provide support for future eco-city analysis, such as calculating the LVV of urban roads. In conclusion, our work offers an effective solution to individual tree segmentation.