DGPolarNet: Dynamic Graph Convolution Network for LiDAR Point Cloud Semantic Segmentation on Polar BEV

: Semantic segmentation in LiDAR point clouds has become an important research topic for autonomous driving systems. This paper proposes a dynamic graph convolution neural network for LiDAR point cloud semantic segmentation using a polar bird’s-eye view, referred to as DGPolarNet. LiDAR point clouds are converted to polar coordinates, which are rasterized into regular grids. The points mapped onto each grid distribute evenly to solve the problem of the sparse distribution and uneven density of LiDAR point clouds. In DGPolarNet, a dynamic feature extraction module is designed to generate edge features of perceptual points of interest sampled by the farthest point sampling and K-nearest neighbor methods. By embedding edge features with the original point cloud, local features are obtained and input into PointNet to quantize the points and predict semantic segmentation results. The system was tested on the Semantic KITTI dataset, and the segmentation accuracy reached 56.5%


Introduction
LiDAR sensors are essential devices for environmental perception tasks in smart vehicles, as they can scan millions of 3D points for each frame [1][2][3]. In recent years, LiDARbased semantic segmentation technology has achieved rapid development. However, LiDAR point clouds have the characteristics of irregular structure, uneven density, and sparse distribution, which are challenging problems for deep learning approaches.
Three-dimensional (3D) segmentation methods [4,5] based on machine learning, such as support vector machine (SVM), random forest, naïve Bayesian supervised learning, etc., usually utilize the geometrical or distribution features of point clouds to train models. The feature extraction [6,7] process of large-scale LiDAR point clouds is computationally intensive, which limits machine learning approaches for outdoor environment perception tasks. Meanwhile, because LiDAR point cloud density is high in the near field and loose in the far field, such methods have poor adaptability and expansion capabilities [8]. From the traditional machine learning method to various deep neural networks, the semantic segmentation approaches of projected views, points, voxels, and graphs have been widely researched. The multiview projection and voxel mapping methods lead to feature information loss. Meanwhile, due to the uneven point distribution density in far and near fields, the sampled unstable features cause low performance of the trained network model. PolarNet [9] has been proposed to rasterize the polar coordinates of LiDAR points into regular grids as input into a convolution neural network (CNN) model for semantic segmentation. Figure 1 compares the density distribution of a LiDAR point cloud frame in the Cartesian and polar BEV coordinate systems. The density distribution is more uniform under the polar BEV coordinate system. Although PolarNet solves the problem of the semantic segmentation. Figure 1 compares the density distribution of a LiDAR cloud frame in the Cartesian and polar BEV coordinate systems. The density distr is more uniform under the polar BEV coordinate system. Although PolarNet sol problem of the uneven density of LiDAR point clouds, the applied feature ext method by using max-pooling operations causes the information loss of detailed g rical features.
(a) (b) This paper proposes a dynamic graph convolution network for LiDAR poin semantic segmentation using a polar bird's-eye view, referred to as DGPolarNet, as in Figure 2. The proposed DGPolarNet input is the original point clouds, and the is the semantic segmentation results. Firstly, the LiDAR point clouds are converted lar coordinates, which are registered into regular grids to balance the input DGPolarNet. Then, a dynamic feature extraction module generates edge features on perceptual points of interest sampled by the farthest point sampling (FPS) and K est neighbor (KNN) methods. Finally, the extracted edge features are combined w This paper proposes a dynamic graph convolution network for LiDAR point cloud semantic segmentation using a polar bird's-eye view, referred to as DGPolarNet, as shown in Figure 2. The proposed DGPolarNet input is the original point clouds, and the output is the semantic segmentation results. Firstly, the LiDAR point clouds are converted to polar coordinates, which are registered into regular grids to balance the input data to DGPolarNet. Then, a dynamic feature extraction module generates edge features based on perceptual Remote Sens. 2022, 14, 3825 3 of 18 points of interest sampled by the farthest point sampling (FPS) and K-nearest neighbor (KNN) methods. Finally, the extracted edge features are combined with the original point cloud through skip connections to recover spatial information lost and enhance local features with describable semantic information. The extracted dynamical edge features are input into a convolutional neural network to provide discriminant features for the semantic segmentation network.
Remote Sens. 2022, 14, x FOR PEER REVIEW 3 of 1 original point cloud through skip connections to recover spatial information lost and en hance local features with describable semantic information. The extracted dynamical edge features are input into a convolutional neural network to provide discriminant feature for the semantic segmentation network. The main contributions of this paper are as follows: (1) The semantic segmentation based on a polar bird's-eye view (BEV) solves the problems of the sparse distribution and uneven density of LiDAR point clouds. (2) The edge features generated from the points o interest sampled by FPS and KNN are more discriminants than the local features com puted by KNN.

Related Works
In this section, we briefly survey the related works of 3D semantic segmentation, in cluding neural network models to process converted multiviews, voxels, points, and graphs.

Multiview Projection Methods
To reuse deep neural network models for semantic segmentation in 2D images, 3D point clouds are projected onto 2D images or spherical images for pixel-wise semanti labeling. Lawin et al. [10], Boulch et al. [11], and Tatarchenko et al. [12] mapped 3D poin clouds onto multiple 2D images from different viewports, which were input into deep learning networks, such as convolutional neural network (CNN), full convolutional net work (FCN), and fully convolutional U-shaped network with skip connections, for pixel wise semantic labeling. Using an inverse mapping procedure, semantic segmentation wa implemented in 3D point clouds. Su et al. [13] proposed a sparse lattice network (SPLAT Net) with a bilateral convolution layer. Original point clouds were mapped to a sparse lattice, in which valid grids were input into bilateral convolution layers for hierarchica and spatial features extraction. SPLATNet was a joint 2D-3D network with a series of 1 × 1 CONV layers for the fusion of the extracted features of 3D point clouds and multiview images. Wu et al. [14] proposed a semantic segmentation framework, SqueezeSeg. It uti lized spherical projection to transform sparse and irregular 3D point clouds into dense 2D grids, which were input into a convolutional network, SqueezeNet, for feature extraction The segmentation results were further optimized based on conditional random fields Similarly, Milioto et al. [15] proposed RangeNet++ operating on the range views of spher ical projection from LiDAR point clouds. To improve computation speed performance they utilized a GPU parallel programming technique to speed up the postprocessing o KNN operations. However, such 3D semantic segmentation methods based on 2D projec tion were sensitive to viewport selection and had low accuracy performance when pro cessing point clouds of sparse distribution.
Su et al. [16] proposed a multiview convolutional neural network (MVCNN) tha captured 2D images from different viewports and aggregated them into 3D shape de scriptors by convolutional and pooling operations. These descriptive features were fed into a semantic segmentation network. However, this method only allowed selecting fixed The main contributions of this paper are as follows: (1) The semantic segmentation based on a polar bird's-eye view (BEV) solves the problems of the sparse distribution and uneven density of LiDAR point clouds. (2) The edge features generated from the points of interest sampled by FPS and KNN are more discriminants than the local features computed by KNN.

Related Works
In this section, we briefly survey the related works of 3D semantic segmentation, including neural network models to process converted multiviews, voxels, points, and graphs.

Multiview Projection Methods
To reuse deep neural network models for semantic segmentation in 2D images, 3D point clouds are projected onto 2D images or spherical images for pixel-wise semantic labeling. Lawin et al. [10], Boulch et al. [11], and Tatarchenko et al. [12] mapped 3D point clouds onto multiple 2D images from different viewports, which were input into deep learning networks, such as convolutional neural network (CNN), full convolutional network (FCN), and fully convolutional U-shaped network with skip connections, for pixelwise semantic labeling. Using an inverse mapping procedure, semantic segmentation was implemented in 3D point clouds. Su et al. [13] proposed a sparse lattice network (SPLATNet) with a bilateral convolution layer. Original point clouds were mapped to a sparse lattice, in which valid grids were input into bilateral convolution layers for hierarchical and spatial features extraction. SPLATNet was a joint 2D-3D network with a series of 1 × 1 CONV layers for the fusion of the extracted features of 3D point clouds and multiview images. Wu et al. [14] proposed a semantic segmentation framework, SqueezeSeg. It utilized spherical projection to transform sparse and irregular 3D point clouds into dense 2D grids, which were input into a convolutional network, SqueezeNet, for feature extraction. The segmentation results were further optimized based on conditional random fields. Similarly, Milioto et al. [15] proposed RangeNet++ operating on the range views of spherical projection from LiDAR point clouds. To improve computation speed performance, they utilized a GPU parallel programming technique to speed up the postprocessing of KNN operations. However, such 3D semantic segmentation methods based on 2D projection were sensitive to viewport selection and had low accuracy performance when processing point clouds of sparse distribution.
Su et al. [16] proposed a multiview convolutional neural network (MVCNN) that captured 2D images from different viewports and aggregated them into 3D shape descriptors by convolutional and pooling operations. These descriptive features were fed into a semantic segmentation network. However, this method only allowed selecting fixed viewports and ignored a large amount of geometric space information, which was not suitable for large-scale complex scenes. To generate discriminative information, Feng et al. [17] proposed a group-view convolutional neural network (GVCNN) based on MVCNN. The GVCNN used a fully convolutional network to extract view-level descriptors. In addition, grouping modules were used to learn association information and distinguish information between different viewports. Based on content-based discrimination, group-level descriptors were generated by dividing multiviews into several groups. The network applied a fully connected layer to complete the identification task. Classification performance was significantly improved by emphasizing discriminative information in the 3D shape descriptor, which was discovered at the group level. However, the GVCNN used maximum pooling for multiview clustering operations, which lost some useful information. Wang et al. [18] proposed a recurrent clustering and pooling convolutional neural network (RCPCNN). The RCP module was designed to generate dominant sets based on the similarity of multiple views. The RCPCNN grouped clusters of the dominant set, which were used for network updating by the pooling method. To improve the adaptability of the semantic segmentation network, the RCP module used the cyclic clustering and pooling module to generate feature vectors automatically. To fully exploit the correlation between multiple views, Ma et al. [19] combined the long short-term memory (LSTM) network with a CNN for 3D shape recognition. The 2D CNN network was used to extract the low-level features for each image of the view subsequences. The extracted features were input into the LSTM network as a time series. Through a sequence voting layer, the extracted features were aggregated into shape descriptors. This model fully utilized the advantages of CNN and LSTM, which effectively improved the discriminative ability of multiview descriptors.

Voxelization Methods
Maturana et al. [20] proposed VoxNet to convert unstructured point cloud data into grid data as the input of a 3D CNN. Point clouds were mapped to multiple 3D grids using an occupancy grid algorithm. The value of each grid cell was normalized and fed into the convolutional layers of the network to generate feature maps. A max-pooling method was performed on nonoverlapping blocks of voxels. Rethage et al. [21] proposed a fully convolutional point network (FCPN) for semantic voxel labeling. In low-level feature abstraction layers, the FCPN employed PointNet with a uniform sampling strategy to ensure the permutation invariance of local geometric feature extraction. The high-dimension feature extraction network formed an octree-like nonoverlapping feature space to reduce memory costs. To enhance the performance of convolutional networks on sparse 3D voxelized data, Graham et al. [22] designed submanifold sparse convolutional networks (SSCNs), which fixed the active sites of sparse convolutions to keep sparsity stable over multiple layers. To solve the submanifold dilation problem, the SSCNs only process active voxels. Moreover, strided operations incorporating pooling or strided convolution were introduced for data fusion in the hidden layers between disconnected components. Wu et al. [23] proposed 3DShapeNet, which represented 3D geometry features as a probability distribution of binary variables on 3D voxels. Binary tensors were fed into three layers of convolutional filters to extract features. Although these methods solved the problem of processing unstructured point clouds, the memory usage may increase cubically with the increase in resolution.
Riegler et al. [24] proposed OctNet with adaptive spatial partitioning capability that replaced fixed-resolution voxels with a flexible octree structure, effectively solving the problem of the high memory consumption of voxels. It divided the 3D space hierarchically into a set of unbalanced octree structures. In the octree structures, each leaf node stored a pooled feature representation. The 3D space was partitioned based on data density. Computational storage resources were dynamically pooled based on the input 3D structure, which greatly reduced memory consumption and improved computational speed. Wang et al. [25] proposed an O-CNN that implemented convolution operations only on sparse octants occupied by 3D surface boundaries. The O-CNN used the sparsity of the octree representation and the local orientation of the shape to reasonably allocate memory and improve computational efficiency. Xu et al. [26] proposed a learning-free 3D point cloud segmentation strategy based on the octree structure. The graph structure was gener-ated based on local contextual information for point cloud voxelization and clustering of voxels. Perceptual grouping was utilized to segment 3D point clouds in a purely geometric way. This method had better classification results for complex scenes and objects with nonplanar surfaces. However, the traversing process of the octree structure caused high time complexity for generating a high-resolution octree.
To solve the problems of the high memory consumption and long training time of voxelized networks, Li et al. [27] proposed a field-probing neural network for 3D data (FPNN). The 3D space was represented by a 3D vector field as the input to the network. Instead of a convolutional layer in the CNN, a field-probing filter was used to extract features from the 3D vector field efficiently. This way, the computational complexity only depended on the number of field-probing filters and sampling points, and it was not affected by the resolution of the voxel map. Le et al. [28] proposed a point-grid hybrid model, PointGrid, which sampled a certain number of points in each grid cell with a simple point quantization strategy. From these sampled points, local ensemble features were extracted. Tchapmi et al. [29] proposed SEGCloud, which voxelized 3D point clouds and generated coarse downsampled voxel labels by using 3D-FCNN. The voxel labels were interpolated back to the 3D points by trilinear interpolation layers. The original 3D point features were combined with the resulting class labels after interpolation. By using fully connected conditional random fields, the final class labels were inferred to obtain a finegrained semantic segmentation level. In SEGCloud, the shared computation using fully convolutional networks reduced a certain amount of computation consumption. Although the voxel-based semantic segmentation methods utilized lossy compression on sparse point clouds as preprocessing to reduce computational memory consumption, loss of geometric and topological information of point clouds was caused.

Point Methods
To avoid loss of original information by projection or voxelization, Qi et al. [30] proposed a point-wise semantic segmentation network, PointNet, that directly operated on unordered point clouds without sampling preprocessing. PointNet learned geometric features through shared multilayer perceptrons (MLPs), which were input into symmetrical max-pooling functions for global features extraction. However, PointNet based on pointwise MLP focused on the perception of global geometric information and ignored the topological structure of local features of points. To extend the PointNet applicability to complex scenes, Qi et al. [31] proposed a deep hierarchical feature learning network, PointNet++. PointNet++ exploited an individual PointNet to generate local features in multiscale set abstractions and utilized hierarchical concepts to extract global features.
Considering both orientation awareness and scale awareness, Jiang [32] proposed a PointSIFT module that supported various PointNet-based architectures. The module used orientation-encoding convolution to generate multiscale representation from eight orientations. A set abstraction module in the downsampling stage and a feature propagation module in the upsampling stage were combined to obtain the features. Li et al. [33] proposed SO-Net, which constructed a self-organizing map (SOM) to model the spatial distribution of point clouds. The SOM node features were aggregated into a global feature vector using average pooling. A parallel branching network with fully connected branches and convolutional branches was used to recover a single feature vector from the global features to represent the input point cloud.
Because LiDAR point clouds have an imbalanced distribution, several invalid grids have been registered in far fields for traditional grid rasterization methods [34,35]. BEV maps represent point cloud data from a top-down perspective without losing any scale and range information [36,37]. By projecting raw point clouds into a fixed-size polar BEV map, Zhang et al. [9] proposed a PolarNet that extracted the local features in polar grids and integrated them into a 2D CNN for semantic segmentation.

Graph Methods
Currently, the graph neural network (GNN) research proposed by Scarselli et al. [38] has been widely studied for 2D and 3D semantic analysis tasks. Bruna et al. [39] applied convolutional neural networks in non-Euclidean domains modeled with graphs to perform local filtering in both the spatial and spectral domains. Simonovsky et al. [40] proposed an edge conditional convolution network on a local graph neighborhood. By integrating edge labels and an asymmetric edge function, the relationship between local points was established. Landrieu et al. [41] introduced an attribute-directed graph hyper point graph with edge features. Gated graph neural networks and edge conditional convolution were utilized to obtain contextual information for large-scale point cloud segmentation. Landrieu et al. [42] proposed a 3D point cloud oversegmentation strategy to improve segmentation accuracy. Jiang et al. [43] aggregated point features into edge branches in a hierarchical manner to enhance the description of local features. Kipf et al. [44] proposed a semi-supervised classification network, the graph convolutional neural networks (GCN). To extend the applicability of the GCN, Te et al. [45] proposed a regularized graph convolutional neural network (RGCNN), which mainly consisted of graph construction, graph convolution, and feature filtering layers. In each convolution layer, the graph Laplacian matrix was updated to improve the flexibility to adapt to dynamic graphs. The Chebyshev polynomial was also used to reduce computational complexity. Experiments indicated that the RGCNN had good performances in both point cloud classification and segmentation tasks of different densities. Wang et al. [46] developed a dynamic graph convolutional neural network (DGCNN). To obtain the local features of point clouds, the DGCNN used the EdgeConv operation to extract the features of centroids in a local neighborhood map. EdgeConv only considered the point coordinates and the distances of the neighboring points. Because the vector directions between neighboring points were ignored, some local geometric information was eventually lost in this method.
To enhance local feature representation, this paper proposes DGPolarNet, an inheritance derived from DGCNN, PolarNet, and PointNet++. The local discriminate edge features generated from the points of interest sampled by FPS and KNN discriminate more than the local features computed by KNN. The extracted local features and the original point cloud data are aggregated for semantic segmentation to prevent feature loss during the convolution and pooling operations.

DGPolarNet for LiDAR Point Cloud Semantic Segmentation
Through a comprehensive analysis of global and local features, this paper proposes DGPolarNet, a dynamic graph convolution network with FPS and KNN for LiDAR point cloud semantic segmentation based on a polar BEV, as shown in

BEV Polar Converter
To solve the invalid grid waste problem of processing the uneven distribution of Li-DAR scanning, the polar BEV coordinate system is utilized to register 3D point clouds into regular grids. Using Equation (1), the 3D point (x, y, z, t) is converted into the polar coordinate (r, θ, t), where (r, θ) is the polar coordinate defined by Equations (1) and (2), and t is the intensity value of the laser reflection.
= arcsin y + + The points in the polar BEV grid defined as ( , , ) ∈ are rasterized into a 3D array ( ) of size ( × × ), which is then input into the FPS-KNN dynamic network as the first layer of the backbone network. For the first layer ( ) , is the batch size, and is the number of points in each batch. Each point has three attributes { , , }; thus, = 3.

FPS-KNN Dynamic Network
The FK-EdgeConv (FPS-KNN EdgeConv) method is developed by integrating FPS and KNN algorithms to extract the comprehensive edge features of the nearest and farthest vertices, as shown in Figure 4.

BEV Polar Converter
To solve the invalid grid waste problem of processing the uneven distribution of LiDAR scanning, the polar BEV coordinate system is utilized to register 3D point clouds into regular grids. Using Equation (1), the 3D point (x, y, z, t) is converted into the polar coordinate (r, θ, t), where (r, θ) is the polar coordinate defined by Equations (1) and (2), and t is the intensity value of the laser reflection.
The points in the polar BEV grid defined as p i (r i , θ i , t i ) ∈ P are rasterized into a 3D array V (1) of size n 1 1 × n 1 2 × n 1 3 , which is then input into the FPS-KNN dynamic network as the first layer of the backbone network. For the first layer V (1) , n 1 1 is the batch size, and n 1 3 is the number of points in each batch. Each point has three attributes {r, θ, t}; thus, n 1 2 = 3.

FPS-KNN Dynamic Network
The FK-EdgeConv (FPS-KNN EdgeConv) method is developed by integrating FPS and KNN algorithms to extract the comprehensive edge features of the nearest and farthest vertices, as shown in Figure 4.  The FPS-KNN dynamic network constructs a directed graph for each layer using the FK-EdgeConv method. For the l-th network layer, the dynamic graph ( ) is defined as Equation (3), where the datasets ( ) and ( ) with a dimension of ( ( ) × ( ) × ( ) ) represent the set of vertices and edges, respectively.
To obtain more effective semantic features, FPS and KNN are utilized to generate directed graphs from the LiDAR point clouds instead of the fully connected edges, which suffer from high memory consumption. The FPS and KNN operations sample the ( ) farthest and ( ) nearest neighbors, respectively, from the vertices set ( ) . Thus, the edge set ( ) has 2 × ( ) directed edge elements, which are calculated based on the target point ( ) and the neighbor points using Equation (6).
Then, the vertices in ( ) and the edges in ( ) are input into the Conv2D and pooling operations to generate the output dataset ( 1) with a dimension of ( 1 The edge feature computation as the Conv2D is defined as ℎ , . We utilize max-pooling and min-pooling operations to extract local features from the sampled farthest and nearest vertices, respectively. Accordingly, the output ( ) ∈ ( ) of the FK-EdgeConv operation is denoted as Equation (7). The dataset ( ) is also the input of the (l+1)-th layer processed by the following FK-EdgeConv operation. In particular, only the directed graph of the first layer of the FPS-KNN dynamic network is built based on the points in the polar BEV coordinate system. The following layers are constructed using the features extracted from the previous layer.  The FPS-KNN dynamic network constructs a directed graph for each layer using the FK-EdgeConv method. For the l-th network layer, the dynamic graph G (l) is defined as Equation (3), where the datasets V (l) and E (l) with a dimension of (n (l) 3 ) represent the set of vertices and edges, respectively.
To obtain more effective semantic features, FPS and KNN are utilized to generate directed graphs from the LiDAR point clouds instead of the fully connected edges, which suffer from high memory consumption. The FPS and KNN operations sample the k (l) farthest and k (l) nearest neighbors, respectively, from the vertices set V (l) . Thus, the edge set E (l) has 2 × k (l) directed edge elements, which are calculated based on the target point p (l) i and the neighbor points using Equation (6).
Then, the vertices in V (l) and the edges in E (l) are input into the Conv2D and pooling operations to generate the output dataset V (l+1) with a dimension of (n 3 . The edge feature computation as the Conv2D is defined as h p i , ε i j . We utilize max-pooling and min-pooling operations to extract local features from the sampled farthest and nearest vertices, respectively. Accordingly, the output p (l+1) i ∈ V (l+1) of the FK-EdgeConv operation is denoted as Equation (7). The dataset V (l+1) is also the input of the (l+1)-th layer processed by the following FK-EdgeConv operation. In particular, only the directed graph of the first layer of the FPS-KNN dynamic network is built based on the points in the polar BEV coordinate system. The following layers are constructed using the features extracted from the previous layer.
The FK-EdgeConv unit, as shown in Figure 3b, is developed to compute local features of the array V (l) = {v 3 ) for the l-th layer. The FPS and KNN algorithms are implemented on V (l) and output the feature array V (l) = {v is generated as the input of the following layer. In each layer, FK-EdgeConv is utilized to calculate the dynamic feature graph model as local semantic features, which are further aggregated for semantic feature enhancement. In our practice, we implement five FK-EdgeConv operations and four down operations accordingly.
After extracting the graph features of multiple layers, the down unit, as described in Figure 3c, is implemented to fuse the extract features of the layers l and l'. The inputs of the down unit consist of two feature sets of different sizes. The feature arrays V (l) and V (l ) are reshaped and concatenated into the aggregated feature array 3 ). Using the Conv+Relu operation on each batch of V (l ) , the higher-dimensional feature array V (l +1) is generated as the input of the following layer.
By using a skip architecture, the local features of FK-EdgeConv operations and down processes are joined as aggregated features, which are input into the postprocessing for global semantic segmentation. Similar to the skip architecture, the features generated by the FPS-KNN EdgeConv and down processes are reshaped by cropping and scaling operations to merge into the aggregated features in the concatenating procedure.

Postprocessing
The aggregation features generated by the FPS-KNN dynamic network are mapped back to their corresponding polar BEV grid as the input of postprocessing. The shared MLP is utilized to convolute the semantic segmentation prediction. In the l-th layer, the feature set v l is computed via Equation (8), where w lm pqt is the learnable parameter for the element (p, q, t) in layer m of the MLP. v l

Experiments and Analysis
The proposed model was tested on the SemanticKITTI [47] datasets. Compared with other typical semantic segmentation networks, the accuracy performances of the DGPolar-Net model with several critical parameters are analyzed and discussed in this section.

Datasets
SemanticKITTI is a dataset of LiDAR point clouds collected by Velodyne Lidar HDL-64E and annotated with point-level semantic labels. It consists of a total of 43,551 frames from 22 sequences collected from inner-city traffic, including 23,201 for training and the rest for testing. Each frame has around 104,452 points on average. There are a total number of 19 typical types of objects, including ground-related, structures, vehicles, nature, human, object, and outlier classes. The dataset is unbalanced in point counts for different objects. For example, there is a small number of motorcyclist objects in most scenes, and only a few points are labeled for the motorcyclist class. In our experiment, we used one sequence for validation and nine sequences for training.

Semantic Segmentation Performance
Experiments in this section were conducted using an Intel(R) Xeon(R) Silver 4110 CPU @ 2.10 GHz 2.10 GHz dual processor, an NVIDIA Quadro RTX 6000 graphics card, and 64 GB RAM. In our experiments, the points in the range of −50 m < x < 50 m, −50 m < y < 50 m, and −3 m < z < 1.5 m were mapped to the polar BEV coordinate system, which were then rasterized into the polar BEV grids with a resolution of (480 × 360 × 32). After analyzing the point distribution of the SemanticKITTI dataset, we specified the k value of each layer as equal to 20. In our experiment, we specified v 1 1 = 1, v 1 2 = 3, and v 1 3 = 1,843,200. Our DGPolarNet model had 14 layers, among which the 1st layer was the converted polar BEV grid data, the 2nd to 11th layers were FPS-KNN dynamic networks, and the 12th to 14th layers were for postprocessing. V (11) represented the aggregated features of the 11th layer, and V (14) represented the final semantic score of the 14th layer. The softmax function was utilized as the loss function. Table 1 illustrates the data dimension for each layer. In the 14th layer, there were 19 segmentation scores for 19 classes in SemanticKITTI accordingly.   Figure 5 shows some samples of the semantic segmentation results using the proposed DGPolarNet method. To evaluate the semantic segmentation performance, the mean intersection-over-union (mIoU) is applied (Equation (9)), where the variables TP c , FP c , and FN c are the number of true-positive, false-positive, and false-negative predictions for class c, respectively, and C is the number of classes. Table 2 shows the segmentation mIoU performances on all the object classes of the SemanticKITTI compared with the state-of-the-art methods. Our mIoU achieved 56.5% on average. Using the proposed DGPolarNet, the average segmentation IoUs of ground-related regions, buildings, vehicles, nature regions, humans, and other objects were 86.4%, 90.1%, 45.20%, 81.35%, 43.83%, and 52.40%, respectively. Our method had good performance for motorcycles, trucks, bicyclists, roads, sidewalks, buildings, vegetation, and pole objects. However, its IoU was low for other-ground, motorcyclist, and bicycle objects, because the extracted features of these objects were not discriminative. tem, which were then rasterized into the polar BEV grids with a resolution of (480 × 360 × 32). After analyzing the point distribution of the SemanticKITTI dataset, we specified the k value of each layer as equal to 20. In our experiment, we specified = 1, = 3, and = 1,843,200. Our DGPolarNet model had 14 layers, among which the 1st layer was the converted polar BEV grid data, the 2nd to 11th layers were FPS-KNN dynamic networks, and the 12th to 14th layers were for postprocessing. ( ) represented the aggregated features of the 11th layer, and ( ) represented the final semantic score of the 14th layer. The softmax function was utilized as the loss function. Table 1 illustrates the data dimension for each layer. In the 14th layer, there were 19 segmentation scores for 19 classes in SemanticKITTI accordingly.  3  3  3  3  3  3  3  3  3  3  3  18 Table 2 shows the segmentation mIoU performances on all the object classes of the SemanticKITTI compared with the state-of-the-art methods. Our mIoU achieved 56.5% on average. Using the proposed DGPolarNet, the average segmentation IoUs of ground-related regions, buildings, vehicles, nature regions, humans, and other objects were 86.4%, 90.1%, 45.20%, 81.35%, 43.83%, and 52.40%, respectively. Our method had good performance for motorcycles, trucks, bicyclists, roads, sidewalks, buildings, vegetation, and pole objects. However, its IoU was low for other-ground, motorcyclist, and bicycle objects, because the extracted features of these objects were not discriminative.  PointNet [30] extracted the global features from all the points directly and lacked the correlation among the local features. Meanwhile, PointNet implemented the semantic segmentation only using the 3D point coordinates without the intensity information. The la-  PointNet [30] extracted the global features from all the points directly and lacked the correlation among the local features. Meanwhile, PointNet implemented the semantic segmentation only using the 3D point coordinates without the intensity information. The laser reflection intensities of different materials were distinguished from each other. Without the intensity information, the connected objects of different types were easily detected as one object. Thus, we introduced the intensity data as one input of the DGPolarNet to enhance the discriminative local features of the dynamic graph. Compared with PointNet, the mIoU performance of our model was improved by 41.9%. For the ground-related, road, sidewalk, parking, and other-ground regions, our method increased by 31.8%, 43.7%, 42.6%, and 18.6%, respectively. RangeNet++ [15] projected the original point clouds onto the 2D range view, which caused spatial structure information loss and rasterization errors. In particular. when processing vehicle and human objects, the mIoU was only 27.3% and 14.2%, respectively. We used the polar BEV converter to solve the problem of uneven distribution of point clouds and applied both FPS and KNN to preserve the local geometrical features. Thus, the mIoU of our model was improved by around 4.3% compared with RangeNet++ and much improved for the vehicle and human objects.
Although PolarNet [9] utilized the polar BEV system to balance the input distribution, the extracted feature for each grid cell was insufficient by a learnable simplified PointNet with a max-pooling operation. To retain geometrical features, we constructed the dynamic graph for each BEV grid. Meanwhile, the extracted high-level semantic features were enhanced by the skip architecture of all the intermediate layers. Compared with the PolarNet network, the mIoU of our model is improved by 2.2%. For objects of complex shapes, such as vehicles and humans, our method improved by 14.5% on average.
Instead of only using KNN, we sampled both farthest and nearest neighbors by integrating FPS and KNN algorithms, which reduced the discriminative feature loss in the local feature encoding process. We conducted the comparison experiments using the KNN method and FPS-KNN method under different k values in dynamic graph construction, as shown in Table 3. When the k value was specified as 20, the models achieved the best performance. When it increased higher, the performance of the two models degraded on the contrary. Because the FPS-KNN method constructed feature maps obtaining both the internal geometrical structures and external contours of objects, the encoded features of the dynamic graphs were more describable than only using KNN.  Table 4 analyzes the DGPolarNet performances through true-positive (TP), falsenegative (FN), and true-negative (TN) samples of the semantic segmentation results. Accordingly, the precision (P), recall (R), and F1 scores were calculated by using Equation (10) to evaluate the semantic segmentation performances. The values of P, R, and F1 values in Table 4 indicated that the proposed DGPolarNet has lower segmentation accuracies for other-ground, bicycle, and motorcyclist objects than the other objects. Because the point distribution of other-ground was similar to that of road and sidewalk objects, and motorcyclist objects were also similar to person objects, the segmentation for such objects did not perform well. For bicycle objects, there were a small number of points scanned on the sample surface, which caused insufficient training of the network model.

Conclusions
This paper proposes DGPolarNet, an efficient approach for semantic segmentation in LiDAR point clouds. The polar BEV converter is utilized to rasterize the LiDAR points into regular polar grids of even point distribution. An FPS-KNN dynamic network is developed to construct dynamic directed graphs and extract the local features of each BEV grid. Employing skip connection, the graph features of each layer are aggregated into high-dimensional features. All the aggregated features of each BEV grid are then integrated into a shared MLP for semantic segmentation. We validate the proposed DGPolarNet on the SemanticKITTI dataset, which is more efficient than previous methods.