Fast Context Awareness Encoder for LiDAR Point Semantic Segmentation

LiDAR sensor is a valuable tool for environmental perception as it can generate 3D point cloud data with re�ectivity and position information by re�ecting laser beams. However, it cannot provide the meaning of each point cloud cluster, which has led to many studies focusing on identifying semantic information about point clouds. This paper explores point cloud segmentation and presents a network that encodes point cloud data at different levels to obtain semantic information about the point cloud cluster. The local context awareness network uses the points and their surrounding points to contribute local features, which are then combined with global features to obtain a better understanding of the position, density, and other information of the point cloud. The feature extraction network provides highly abstracted information, allowing for more accurate semantic segmentation of the discrete points in space. The proposed algorithm is compared and veri�ed against other Semantic KITTI data algorithms, and has achieved state-of-the-art performance. Due to its ability to note ne-grained features on the z-axis in space, the algorithm shows higher prediction accuracy for certain types of objects. Moreover, the training and validation time is short, and the algorithm can meet high real-time requirements for 3D perception tasks.


. Introduction
Environmental perception is the most challenging task for an autonomous driving system, to analyze real-time data and provide a reliable driving recommendation.With the advent of the Internet of Things (IoT), several automated driving technologies utilize the LiDAR sensor or depth cameras to acquire rich environmental information and get the 3D data of the surrounding objects.
Typically, the approach for perceiving point cloud data from LiDAR involves object detection and point cloud segmentation.While 3D target detection provides an approximate estimation of the point cloud, the 3D point cloud semantic segmentation function (Lawin, et al., 2017) classi es each point of the scene point cloud individually and allows for ne differentiation of point clouds.This paper focuses on exploring point cloud segmentation methods to obtain more accurate semantic information.
Most of the algorithms used for 3D object detection and segmentation are based either on voxel or point features.Point-level feature algorithms like Pointnet (Qi, et al., 2017) do not give much attention to the surrounding point clouds, leading to insu cient accuracy.Pointnet++ (Qi, et al., 2017) improves the features of surrounding point clouds using multi-scale grouping, but this approach requires large and intensive computations.Voxel-based algorithms like Voxelnet (Zhou and Tuzel, 2018) directly voxelate the point cloud, missing out on the ne-grained features of points.Point Pillars (Lang et al., 2019) and other algorithms transform 3D data into a pseudo-image by projecting the point cloud onto a bird's eye view, which signi cantly improves the processing speed, but the abstract precision is impacted by the compression of height information.The trade-off between detection speed and accuracy remains challenging for 3D object detection and segmentation.
To address the forementioned problems, the paper proposes a feature extraction network based on context-aware multi-level features of both points and voxels.This network incorporates a voxel-based feature extraction method to extract local features from each voxel of the point cloud, ensuring both e cient and accurate feature extraction.This method also enhances the context feature of the point and its surrounding points.Additionally, point features are extracted and fused, resulting in features that exhibit robust performance at the point level.As per the o cial data provided on the Semantic KITTI segmentation website, the highest accuracy achieved through the LiDAR point cloud is slightly over 70%.In this paper, we propose a lightweight network that aggregates point features, local context features, and global context features to improve the precision of point cloud segmentation while reducing training and inference time.Our method achieves state-of-the-art results on the Semantic KITTI dataset and enables quick detection of instant objects.The proposed method exhibits clear advantages in identifying objects that are stacked vertically, such as Motorcyclist and Motorcycle.

. Related Work
This section provides a brief overview of some point cloud segmentation algorithms.Many studies have focused on segmenting input point clouds directly using arti cial features.These segmentation methods often involve extracting point cloud features using clustering algorithms, such as K-means (Huang, et al., 2019), K-Medoids (Ma, et al., 2017), c-means (Yang, et al., 2020), KD trees (Zhou, et al., 2008), Octogram trees (Woo, et al., 2002), spectral clustering ( Arias-Castro, et al., 2011), Mean-shift (Hu, et al., 2017), and DBSCAN (Density-Based Spatial Clustering of Applications with Noise) (Wang, et al., 2019).These algorithms can increase the constraints of the normal vector propagation direction on the segmentation (Zheng, 2015).However, the segmentation of arti cial features typically results in a rough division of surfaces or blocks based on shape, position, density, etc. Semantic information cannot be given for each part, and these segmentation methods do not have any learning ability.
With the advent of deep learning in the eld of image analysis, many researchers have proposed methods for point cloud data using deep learning (Li, et al., 2022) (Li, et al., 2021).Since object detection and segmentation are often studied together in deep learning networks, this paper will also reference some wellrecognized object detection algorithms.Point cloud target detection and segmentation algorithms mainly include: point-based, voxel-based, and graph neural network-based, depending on the organization form of the point cloud data.

Point-based method
Many studies have used 3D projection to obtain 2D images from different views and then applied 2D CNNs, such as MVCNN (Su, et al., 2015), for object recognition.However, this approach results in a signi cant loss of geometric information.PointNet (Garcia-Garcia, et al., 2016), on the other hand, directly learns point-by-point features from point clouds and has a signi cant impact on 3D object recognition and 3D semantic segmentation.However, PointNet neglects the spatial neighboring relations between points.Upgraded versions of pointNet, such as PointNet + + and 3D Point Capsule Networks (Zhao, et al., 2019), have addressed this issue by learning and extracting representative local features through hierarchies in the encoding stage.PointRCNN (Shi, et al., 2019) is a two-stage method that considers both geometric information and local point feature information and uses a bin-based loss function to train the 3D bounding box proposal.YOLO 3D (Ali, et al., 2018) is an end-to-end deep neural network based on the Darknet architecture, where the center coordinates of the bounding box and 3D object bounding box are directly used as regression problems.This method rst maps the point cloud data to create a bird's eye view grid image.

Voxel-based
In order to organize the unordered point cloud data, researchers have proposed voxelization techniques, such as VoxelNet.The method voxelizes the original point cloud data, which reduces the number of convolutional calculations required and improves execution speed and learning e ciency.VoxelNet also introduces sparse convolution and voxel feature coding (VFE) layers.Building upon VoxelNet, Voxel-FPN (Wang, et al., 2019) stacks voxels features of different heights at the same position and then uses a pyramid structure to extract multi-scale Voxels.SECOND (Yan, et al., 2018) proposes an e cient spconv net to improve VoxelNet and adds data augmentation techniques.PointPillars take this a step further by transforming voxels into long pillars to improve calculation e ciency further.

Voxel-Point-based method
Fast Point RCNN (Chen, et al., 2019) approach adopts voxels and 3D point clouds as input for the VoxelRPN sub-network, which outputs preliminary object predictions and addresses the issue of local information loss.Meanwhile, the PV-RCNN (Shi, et al., 2020) method combines voxel CNN and point networks to extract and learn more distinct features from point clouds.

Graph neural network-based method
The method of point cloud segmentation using graph attention convolution is proposed by Graph Attention Convolution (Wang, et al., 2019).Point-GNN (Shi, W and Rajkumar, R., 2020) uses a graph neural network to represent a disordered 3D point and predicts the category and shape of objects through box merging scoring operations of multiple vertices.Point RGCN (Zarzar, et al., 2019) uses RGCN to provide feature and context aggregation of each proposal.However, this approach is time-consuming to build graphs and perform inference iteratively, making it longer to train models.

Others
Point cloud segmentation methods include (AF)2-S3Net (Cheng, et al., 2021) that uses an attention fusion feature and several other CNN-based methods that improve the e ciency of segmentation (Xie, et al., 2021).Additionally, some semantic segmentation methods (Zhou, et al., 2021) demonstrate promising results in point cloud segmentation under occlusion, using adaptive augmentation and adversarial pruning methods.
According to above research, the point-based algorithm emphasizes details and preserves the original geometric information of the point clouds, but it requires high computation.Voxel-based algorithms retain point and local features well but may suffer from a signi cant loss of ne-grained accuracy.Graph networkbased algorithms capture the contextual relationship between points and better perceive 3D geometric information, but they have long feed-forward times and longer training and inference times.Therefore, we propose a method which can use Voxel-based feature to make the irregular distribution point cloud more balanced and easier to calculate, and pay attention to keeping and calculating the ne-grained features of point itself.Therefore, we propose this algorithm that combines features from different levels of context awareness to identify unordered point sets, which yields impressive results.

A. Architecture
The aim of this study is to improve the accuracy of semantic information for point cloud data by minimizing the occurrence of leakage and error points.A segmentation algorithm that meets high real-time requirements was investigated to achieve this.A lightweight convolutional network called Fast Context-Awareness Encoder (FCAE) was developed, which combines both local and global context features of the point cloud.The contributions of this paper can be summarized as follows: 1) This paper proposes a novel deep learning-based encoder network.The combination of semantic features involves three aspects: (i) the invariant feature of the point; (ii) the local structural features can be obtained after extracting the position relationship and point characteristics between the scattered points through the context network; (iii) from above local features, the network abstracts the initial global feature of these points.By fusing the features of the point cloud at the point-, voxel-, and global level, the network can obtain more accurate feature encoding of the point cloud data while preserving the ne-grained features of the points as the context and position features.Furthermore, the network can process discrete point features with high variability in point cloud density faster without additional pre-processing.
2) During each feature extraction stage, particularly in the voxel and global feature extraction stages, the network architecture is inspired by feature networks that possess strong expressiveness.The hierarchical design is modi ed and optimized by simplifying certain network layers in various areas to minimize and balance the computational load.
3) The feature extraction network generates local features by considering the context awareness features of points and their neighbors, while the initial global features are obtained by abstracting the voxel feature and context features at higher levels.By leveraging the dual context relationship of points, the deep neural network can extract and abstract the position relationship and point features between scattered points through both local and global context awareness.In this paper, the algorithm refers to the code design of PyTorch in PointNet's 3×3 T-net transformation for feature extraction, which is used to preserve some of the original point cloud features.The input data matrix dimension is expanded and passed through three Conv1d convolutions.However, in the parameters of the convolution, the output channel of the last layer is set to 256, which is not as large as the gap with the channels set in PointNet.The design of the network layer is shown in Fig. 2.
Following the convolutions, the data goes through a max-pooling layer, two linear full connections, and a linear layer, reducing the dimension to 9. A bias term is added to obtain the T-net transform matrix, which is multiplied by the original input data to produce the output matrix.
Through the structural transformation of this part, the coordinate transformation of the input point feature is performed, and the transformation matrix training is su cient, allowing for better preservation of point features.

C. Voxel feature network
This part is focused on the voxelization of point cloud data and extraction of the voxel features in the network.Figure 3 shows the architecture of the network. (

1) Point cloud voxelization
The point cloud sample data has four dimensions, denoted as P∈RNx4, where N is the total number of points in the current scene.In order to extract local context features in the feature extraction network, the point cloud is voxelated rst.The purpose involves dividing the space into voxels to reduce the imbalance of points between them.By doing so, calculation power can be saved, and features can be extracted more quickly.Given the sparse points in the point cloud and uneven distribution in space, this article adopts VoxelNet's method to transform points with uneven density distribution into regularly distributed voxel blocks stored in memory.The rst step in this process is to set the voxel grid size to be divisible by the point cloud range size, resulting in an integer voxel grid number K. To prevent over-calculation in the voxel grid with varying numbers of points, a proper maximum point number M is set as the sampling threshold.In the next step, random sampling is applied, and point clouds below this threshold are lled with 0. This ensures that the network reduces the imbalance between different voxels.
(2) Voxel feature value and relative coordinates After voxelating the original point clouds, the dense voxels represented by V∈RKxMx4are obtained, and the corresponding voxel coordinates represented by C∈RKx3 are calculated to indicate the voxel indices on the x, y, and z axe in the voxel grid.In the next step of the voxel feature decoration procedure, additional relative coordinates are also calculated.These include the distance (xc, yc, zc) between each point and the arithmetic mean (centroid) of the points within the voxel, as well as the offsets (xp, yp, zp) of each point to its own voxel center (physical center).The computational process for obtaining these coordinates is shown in Fig. 4.
(3) Voxel feature decoration The voxel feature decoration method described in the paper is an extension of the PointPillar algorithm, with a focus on enhancing the feature extraction along the z-axis.In the PointPillar approach, the offset of point within a pillar is calculated relative to the arithmetic center and the pillar grid center (x, y center coordinates), with no segment for the z-axis.The pillar itself is used as a feature unit, with the position feature of the point treated as a whole.To reduce the computational complexity of using Conv3d, a pseudo-image is adopted in the subsequent feature map.
Our method uses Con3d for convolution calculation to extract features of voxels and the complete voxel structure, rather than Con2d on the pseudo-image.This allows the offset on the z-axis from the center coordinates to be calculated, resulting in a voxel-based feature representation that expands the feature dimension of the enhanced data and captures more spatial feature information.The point cloud data is augmented to a 10 -dimensional feature matrix (x,y,z,r, xc, yc, zc, xp, yp, zp).
Due to the presence of xc, yc, zc xp, yp, zp values, no distinction is made between valid points in voxels and lled zeros.Therefore, this paper retains the method of using masks in VoxelNet to reset these lled feature points to 0. Unlike the VoxelNet network, where the mask operation is executed once in each layer of stacked VFE modules in the later stage, in this work, it is directly executed after the generation of enhanced voxels data to ensure that the input data is purer.This results in the creation of enhanced dense decorated voxels represented by U∈R KxMx10 , which serve as the input to Simpli ed Voxel Feature Network (SVFE). (

4) SVFE
The point cloud features can be learned effectively by extending the data dimension to 10 dimensions.As shown in Fig. 3, the voxel feature extraction (VFE) in our paper is similar to VFE in VoxelNet, but with one key difference: it only uses one layer of VFE.This is because, in the subsequent steps, the point feature and the voxel feature are concatenated to obtain the local aggregation features of each voxel, eliminating the need for multiple VFE layers.This can signi cantly reduce the computation of this step.The SVFE proposed in our paper is inspired by the PFN layer used in the PointPillars network.The SVFE contains only a fully connected layer and a max operation, with the fully connected layer using a (10, C) linear layer.The output dimension is a matrix of (K, M, C), where C is the output feature dimension of the voxel.The second dimension is then max-pooled to obtain an output matrix of (K, 1, C).This allows for the feature information of the voxelated point cloud to be described as interactive information that is directly extracted to the local aggregation point.
(5) Generate 5D spatial feature blob Non-empty voxels are processed in the SVFE, and these voxels correspond to only a smaller part of the original space.Here, it is necessary to remap the obtained non-empty voxel features back into the original 3D space, i.e., the dense voxel feature represented by B∈R KxC .This involves using a 3D scatter to combine the spatial information of the voxel inner points with the dense voxel feature, resulting in a 5D feature matrix π∈R nx×ny×nz×N×C .The indices of the voxel grids in the x, y, and z-axis are calculated using their Cartesian topological structure, X * self._ny* self._nz+ Y * self._nz+ Z, where X, Y, and Z are their values in three dimensions, and nx, ny, and nz represent the number of grids along the x, y, and z-axis, respectively.The feature data of the voxel points are then put into the voxel grid indices to create the output 5D feature blob, where N is the number of the points in the data queue, and C is the dimension of voxel feature extraction.Due to the sparsity of the point cloud data, about 90% of the voxels are empty, which can greatly reduce the memory and computational consumption in backpropagation.

D. Scene context feature extraction network (SCFEN)
The SCFEN network proposed in the paper aims to extract scene context features from the local contextual features of the point cloud.By combining the previously extracted local point space features, the network can produce an initial global feature that captures the awareness of the scene context in the point cloud.This process aggregates context information of voxel local features to abstract higher-level point cloud information.Figure 5 shows the architecture of the network.
The 5D local feature blobs are fed to the SCFEN network as input to obtain the global context features.The process is similar to the convolutional middle layers (CML) in VoxelNet, which uses three Conv3d layers for feature extraction.The difference is it considers max-pooling after each Con3d layer to retain more signi cant local features for each voxel feature.Compared with VoxelNet, the initial global features extracted by this method can highlight the negrained features of the point cloud to a greater extent.
The procedure for network implementation is described here.The 5D spatial feature matrix is passed through three sets of Convolutional maximum pooling units (including the Conv3d layer, Relu activation layer, and max-pooling layer).The channels are halved per unit.The resulting features are then fed into a fully connected layer to obtain the initial global feature vector (represented by G) with a dimension of 1024.This paper also conducts experiments to compare the performance of the global feature with a dimension of 512.In this step, the output (N, G) matrix is used as the scene context feature vector of the global feature.After extracting the local features and sparse tensors of each voxel in the VFN network, SCFEN extends the receptive eld to obtain rich shape information.

E. Point-Voxel-Global feature fused network
This paper presents point cloud features from three levels to fully integrate the semantic features of the context.Following the extraction of context features at all levels, this paper cascades the features from all levels together to generate new abstract features.These features combine the invariant point features, the local features of voxelated points and their surrounding points, and the abstracted initial global features, each occupying different proportions in terms of their degree of importance.The speci c proportion is determined by the number of channels output of each feature network.
The context features from the three levels (fp, , and fg represent the point-, local-, and global features, respectively) are fused to generate an output context feature F∈RNx(4 + C + G).The fused features are then extracted using a multi-layer perception (MLP) network.This network uses multiple fully connected layers with the sliding method to achieve higher feature expression ability than a convolutional neural network while reducing computational requirements and avoiding over tting that may occur when using multiple convolution kernels.
The MLP consists of three sets of fully connected layers (FC1, FC2, FC3), and the number of output channels D corresponds to the number of nal classi cations.After obtaining the features, the output is represented as O∈RNxD, and then passed through the log_softmax activation function, which normalizes the exponential function.This maps the output of multiple neurons to a range of (0,1), with the normalized sum equal to 1. Finally, the entire point cloud segmentation network calculates the probability distribution for each classi cation to obtain the semantic information of each output point. .

Training details
PyTorch was used to implement 3D point cloud semantic segmentation network, which was trained on NVIDIA Tesla T4 GPU.
In this experiment, the batch size to 4 during network training, and the Adam gradient optimizer was used to optimize the learning rate, which was adjusted according to the number of epochs trained.The initial learning rate was 0.001 and was reduced to 0.1 times every ten epochs.The learning rate decreased gradually during the experiment to avoid a large oscillation range that impedes quick convergence.
As shown in Fig. 7 of the training, the accuracy value quickly increased to approximately 0.9, while the training loss decreased rapidly.After completing model training, the prediction result was obtained from the validation set.This study conducted the experiment using voxel size (0.64, 0.64, 0.64) and a training time of 76434 seconds.

Dataset
Semantic-KITTI (Behley, et al., 2019) is one of the large-scale datasets for 3D LiDAR point-cloud segmentation, including semantic and panoptic segmentation.
The dataset is about 80G and divided into 22 sequence subdirectories.It is further divided into the training and validation sets based on a 6:1 ratio.The dataset contains 32 classes, out of which seven have either moved or non-moving properties.To simplify the classi cation process, classes with similar properties were combined, and classes with low scores were ignored.This resulted in the retention of 19 classes used for training and evaluation.

Evaluation
1. Qualitative evaluation metric In Fig. 8, several semantic segmentation results generated by the 3D point cloud segmentation network on the Semantic KITTI test set are presented for qualitative evaluation.Each color represents a different semantic class.The segmented graph shows that the ground points are divided into roads and sidewalks.Also, it accurately distinguished objects, such as the edges of road sidewalks from roads.
The visible segmentation results closely matched the ground truth in the Semantic KITTI dataset.
In the comparison experiment presented in the paper, Fig. 9 shows the segmentation effect when the voxel size is (0.32, 0.32, 0.32).While the segmentation effects of some places may be better than those with voxel size (0.64, 0.64, 0.64), there are also some objects where the segmentation is worse.The difference in qualitative evaluation is not signi cant.For example, the side of the road in the gure with the voxel size (0.64, 0.64, 0.64) is unclear, but the effect of the road extending forward in the gure with the voxel size (0.32, 0.32, 0.32) is not good as the original.
Figure 10 illustrates how two sets of point clouds are segmented by removing or adjusting corresponding components in the ablation experiment.The gure depicts that removing each of the three components from the network leads to a varying degree of accuracy loss, with the local feature module having the greatest impact and the point feature module having the lowest.Since the global module plays a crucial role in the output feature, the paper only modi es the output dimension value of the module, setting it to 512.Through comparison of experimental effects, it is observed that this modi cation also results in a signi cant loss of segmentation effect.

Quantitative evaluation metric
A. Accuracy 3 the experiment, the predicted truth value consists of true-positive (tp) and false-positive (fp) values.The accuracy of the algorithm in the prediction process is determined by the ratio of the predicted truth value to the sample dataset.Additionally, the true value in the negative sample dataset is considered, and the true-negative (tn) value is added to calculate the accuracy, which serves as an evaluation metric for the model.In the experiments conducted in this paper, the accuracy was found to be 0.7998.However, to avoid the in uence of the fp value in the prediction truth value, this work uses the MIoU value, which is commonly used in point cloud segmentation, to calculate the prediction accuracy of the model.

B. MIoU
In order to evaluate the proposed method, we followed the o cial guidance and used the MIoU as the evaluation metric.It can be formulated as follows: 4 where TPc, FPc, and FNc correspond to the sample quantity of true-positive, false-positive, and false-negative predictions for class c.
The accuracy of our proposed model was evaluated using MIoU in 19 categories, and the quantitative results are presented in Table 1, along with a comparison to other state-of-the-art methods for point cloud segmentation.While the average IoU score of our model (68.9%) is slightly lower than that of the model with the highest score(the Point-Voxel-KD algorithm reached this value 71.2 after the data set was enhanced by applying ne-tuning, ipping and rotation tests, but in the same case without enhancing the data, the IOU value is 68.9% which is same as the result of our method), it exhibits the best performance in 8 out of the 19 categories.Our model performs better, particularly for small targets like people and objects that are easily confused in the zaxis direction, such as motorcycles, motorcyclists, and bicycles.Additionally, the average detection time of our model is 90 ms, which is relatively short compared to the detection time of the state-of-the-art algorithms.

Comparative and ablation experiments
In this paper, a comparative experiment was conducted using an initial voxel size value of (0.64, 0.64, 0.64), and the experiment was repeated with voxel size set to (0.32, 0.32, 0.32).To avoid memory over ow, the batch was set to 1, and the corresponding training time was extended by almost four times.The experimental results showed that the MIoU increased by 0.6 percentage points to 69.5, and the accuracy rate increased by 1.84 percentage points to 0.8182.It was observed that the accuracy and MIoU of the model improved with a reduction in voxel size.However, the training time increased signi cantly, the gradient descent was slower, and training was less likely to converge.Additionally, for some classi cations, the segmentation effect was worse than that of the original model.Consequently, the experiment determined that (0.64,0.64,0.64)was the optimal voxel size.
The article includes ablation experiments that involve modifying the values of each component, the result is shown in Table 2.As the global feature module is the basis of feature extraction and its features cannot be eliminated, the module reduces the dimension occupied by its features, thereby reducing the proportion of the fused features.The experiments also involve eliminating the point feature module and the voxel feature module in the perception network, and the results of these ablation experiments are as follows.
This paper presents the ablation experiments on the three fusion modules.The experiments revealed that removing the point feature module led to a reduction in MIoU by 4.2%.Similarly, when the voxel feature module did not participate in the fusion feature, the loss in MIoU was signi cant, with a reduction of 16.6%.Furthermore, reducing the dimension of the global feature output to 512 also resulted in a loss of MIoU, with a reduction of 13.3%.These results suggest that each module of the proposed feature network plays a relatively important role in improving the accuracy of point cloud segmentation.

Discussion
The experimental results showed that our network performed well in segmenting point clouds in the Semantic KITTI dataset.However, the effectiveness of the segmentation was not consistent across all point cloud classi cations.To improve our point cloud segmentation network, the following design methods can be considered for future work.To optimize point cloud segmentation, a technique involves dividing the point cloud data into sector data blocks based on its 360-degree distribution around the origin.The sector data blocks are densest near the origin and larger further away from the center radius, where the point cloud is sparser.This blocking method results in a more balanced distribution of data within the blocks, which can improve segmentation accuracy for details and learning outcomes while avoiding the omission of details.Within the segmented sector data block, the data of 360 degrees can be further divided into several equal parts in different directions, enhancing the segmentation process.In addition to the fan-shaped block, the point cloud voxels can be designed vertically by dividing the data into different size voxels on the z-axis.However, in actual point cloud data, most data above a certain height may be sparse or blank, resulting in wasted data blocks for learning.To optimize this, the data above a certain height can be identi ed as sparse data and treated separately, while the data below this height can be normally divided into equal parts.The upper part of the data in this area is longer than the other parts but relatively sparse, so balancing the distribution of the point cloud divided on the Z-axis can avoid invalid division based on height.These are some data block division methods that can be used based on the existing data distribution.Research can be conducted on the data layer of a learning network, where point data, local features, or overall features are extracted.This is followed by cascade feature learning on both the feature values and the original data values.To address the issue of characteristic data loss due to the pooling layer on the original layer data, pyramid zooming and cascading operations can be performed on the network.These operations can improve the learning effect and the accuracy of point cloud segmentation.

Conclusions
This study examines the technique of point cloud segmentation and proposes the Fast Context-Awareness network, which demonstrates superior performance compared to other state-of-the-art methods on the Semantic KITTI dataset.Although our algorithm is lower than one with highest score, but in the same case without enhancing the data, the IOU value same as the result of our method, beside it, our exhibits the best performance in some categories.The network leverages multi-level features such as point features, voxel features, and global features to impart semantic information to points in the point cloud quickly.The logical structure of our method is clear and allows for easy parameter adjustments based on the point cloud's characteristics.For instance, by changing the nal feature dimension of the point feature module, the proportion of point features can be increased or reduced, while the proportion of local voxel features and global features can be decreased or increased.This feature network facilitates the extraction of detailed information from point clouds and yields more accurate point cloud segmentation results on datasets like Semantic KITTI.The paper also enhances the segmentation effect of feature extraction on the Z axis, making some objects with more prominent features on the Z axis, such as motorcyclists and motorcycles, better segmented than other point cloud segmentation networks.The results suggest that the proposed network may be more effective for driving datasets with large road slopes, such as coal mine alleys, where the slope may be more than ten degrees.The network can better segment stacked point clouds belonging to the same object category at different heights on the z-axis, potentially leading to improved point cloud segmentation performance.

Declarations Figures
Page 13/       The segmentation effect comparison of our algorithm and ground truth on the Semantic KITTI test set Figure 9 The segmentation effect of our algorithm with different voxel sizes on the Semantic KITTI test set

Step 5 : 7 :
Local feature extraction for batch_idx in range(batch_size): batch_mask = = batch_idx local_feats <-canvas_3d[batch_idx, :, x_coords, y_coords, z_coords]; local_feats[batch_mask, :] <-local_feats ▷ the local feature (N, C) as the input of Step7.end for Step 6: Scene global feature extraction ▷ canvas_3d(batch_size, C, nx, ny, nz) max_pool3d(relu(Conv3d(C, C/2))); ▷ For three times Linear((nx/8) * (ny/8) * (nz/8) * (C/ 8), G) ▷ Global feature (N, G).G is the set global feature dimension, which is 1024 in the paper.The matrix is as the input of Step 7. Step Semantic segmentation head fuse_feats <-concat([points, local_feats, global_feats], dim=-1) ▷ Combine that point features, the local feature and the global features, fuse_feats (N, 4 + C + G); MLP perception: MLP(F) = log_softmax(Relu(FC3(Relu(FC2(Relu(FC1)))))) ▷ shape = (N, D), D is the number of semantic segmentation classes.The architecture of our point cloud segmentation network is as Fig.1.Firstly, we refer to T-Net in Pointnet for taking the input transformation when extracting point invariance features; At the same time, voxel the point cloud to obtain the local features of the point cloud.It can store the point cloud data in the tensor data in the form of hash table with reference to VFE (Voxel Feature Encoding) method adopted in Voxelnet.In next step, the voxel features output the semantic features of the surrounding context of each point, and a 5D spatial feature matrix is generated.Then the global context features are abstracted from the local contextual feature around these points; In the critical step, the contextual fusion features of the point cloud are constructed by cascading three different levels of feature.Finally, the multi-layer perceptron is applied to classify the point cloud.B.Points' invariant feature networkTo capture local and global features across different abstraction levels, this paper proposes a method combining point cloud features from three different levels to fully integrate the semantic features of the context.These levels include (i) the invariant feature of the point, (ii) the context semantic information around each point, and (iii) the high-level abstract feature of the global scene context.The initial point features extraction is performed to preserve the rotation invariance, translation invariance, and scale invariance of the point cloud.In this paper, the network employs the T-net module from PointNet, a transformation function based on the data itself.The generated point cloud rotation matrix is multiplied by the input point cloud data to perform the a ne transformation.The purpose of this module is to fully retain the point features of the original point cloud by training the transformation matrix.
Figure6shows the architecture of the network.In the rst part, T-Net transformation extracts invariant point features, resulting in the output I∈RNx4.The second part uses voxel coordinates to sample surrounding features, generating a 5D spatial feature matrix π∈R nx×ny×nz×N×C .This, along with corresponding position information, forms the local context awareness information L∈RNxC.The third part presents a high-level abstract global feature, represented by S∈RNxG, which combines point position features and voxel context features from the 5D abstract voxel feature matrix.Since the global feature extraction process integrates the local context features from the previous section, and the local features abstract not only the position information of the point itself but also that of others, the abstract ability of this feature is stronger.
+ F P c + F N c

Figure 3 Voxel
Figure 3 Voxel feature network and local contextual feature extraction

Figure 4 Voxel
Figure 4 Voxel feature decoration procedure

Figure 5 Scene
Figure 5 Scene context feature extraction network

Figure 7 The
Figure 7 The graph of accuracy and loss with training batch index