LiDAR Point Clouds Semantic Segmentation in Autonomous Driving Based on Asymmetrical Convolution

: LiDAR has become a vital sensor for autonomous driving scene understanding. To meet the accuracy and speed of LiDAR point clouds semantic segmentation, an efficient model ACPNet is proposed in this paper. In the feature extraction stage, the backbone is constructed with asymmetric convolutions, so the skeleton of the square convolution kernel is enhanced, which leads to greater robustness to target rotation. Moreover, a contextual feature enhancement module is designed to extract richer contextual features. During training, global scaling and global translation are performed to enrich the diversity of datasets. Compared with the baseline network PolarNet, the mIoU of ACPNet on the SemanticKITTI, SemanticPOSS and nuScenes datasets are improved by 5.1%, 1.6% and 2.9%, respectively. Meanwhile, the speed of ACPNet is 14 FPS, which basically meets the real-time requirements in autonomous driving scenarios. The experimental results show that ACPNet significantly improves the performance of LiDAR point cloud semantic segmentation.


Introduction
Scene understanding is one of the most critical tasks in autonomous driving.With the challenges introduced by recent technologies such as autonomous driving, a detailed and accurate understanding of the road scene has become a main part of any outdoor autonomous robotic system in recent years.Although semantic segmentation of 2D images is crucial to attaining scene understanding, there are still some limitations to visual sensors, such as the inefficiency of acquiring information under insufficient light, lack of depth information and limited field of view.In contrast, LiDAR can obtain accurate depth information with higher density and wider viewing field regardless of lighting conditions, which makes it a more reliable source of information for environmental perception.Therefore, the scene understanding of LiDAR point clouds with semantic segmentation has become a focal point in autonomous driving.
According to point clouds' encoding methods, the current LiDAR point clouds semantic segmentation methods can be divided into three categories: point-based methods, voxel-based methods, and projection-based methods.In terms of speed, there is a lot of computation and memory consumption in point-based and voxel-based methods, which makes it difficult to achieve real-time effects with the on-board computing platform.A higher priority should be placed on real-time performance when it comes to autonomous driving than segmentation accuracy.In contrast, projection-based methods are lightweight and fast, so real-time effects can be achieved during deployment.In terms of segmentation accuracy, the projection-based method has shown some success.However, since the point cloud information is not fully utilized during feature extraction, there is still room for improving segmentation accuracy.
When achieving real-time effects, it is of great relevance to improve the segmentation accuracy in autonomous driving scenarios.To meet segmentation accuracy and speed, an efficient real-time network ACPNet (Asymmetric Convolution based on PolarNet) is proposed in this paper.PolarNet [1] is the baseline network of ACPNet, which encodes point clouds through polar bird's-eye-view (BEV) representation.BEV is the abbreviation for Bird's Eye View, which is a perspective that views an object or scene from above, just like a bird looking down at the ground in the air.Also known as God's perspective, which is a perspective or coordinate system used to describe the perception of the world.The using of polar BEV has some advantages: First, in terms of point allocation within grid cells, the polar BEV method will assign point clouds to their respective grid cells more evenly.Second, since the partitioning method brings about a more balanced distribution of points, the theoretical upper limit of prediction accuracy for the semantic classification of point clouds will be increased, thereby improving the performance of downstream semantic segmentation models [1].In ACPNet, the encoded point cloud features are fed into an Asymmetric Convolution Backbone Network (ACBN) for feature extraction.Then, the features extracted by the backbone are input to the Contextual Feature Enhancement Module (CFEM) for further mining of contextual features.Moreover, global scaling and global translation are used as Enhanced Data Augmentation (EDA) while ACPNet is being trained.Experiments are conducted on the SemanticKITTI [2], SemanticPOSS [3] and nuScenes [4] dataset to verify the validity and generalization of our method.The main contributions of this paper can be summarized as follows:

•
An Asymmetric Convolution Backbone Network is proposed.Asymmetric convolutions are used in the backbone to enhance the skeleton of square convolution kernels and reduce interference caused by target rotations.

•
A Contextual Feature Enhancement Module is proposed, which can fully extract the contextual feature by decomposing and aggregating the features.

•
Enhanced Data Augmentation methods of global scaling and global translation are used to enrich the diversity of the dataset samples.Thus, the generalization capability of the model is further improved without increasing the computational cost.

Related Works
Due to the sparsity and disorderliness of point clouds, encoding the input point cloud is a crucial issue when using convolutional neural networks for semantic segmentation of 3D point clouds.According to the encoding methods for point clouds, existing point cloud encoding methods can be divided into three categories: Point-based Methods, Voxel-based Methods, and Projection-based Methods.

Point-Based Methods
PointNet [5] is a point-wise learning method for point cloud features, and max pooling is used to integrate global features.PointNet++ [6] is an extension to PointNet, and the ability to extract local information of different scales is strengthened.A spatially continuous convolution is proposed in PointConv [7], which reduces the memory consumption of the algorithm effectively.For semantic segmentation in large-scale point clouds scenarios, the point clouds are represented as interconnected superpoint graphs in SPG [8], and then PointNet was used to learn the features of the superpoint graph.An attentionbased module was designed in RandLA-Net [9] to integrate local features, achieving efficient segmentation in large-scale point clouds.Segmentation performance was further improved in KPConv [10] with a novel spatial kernel-based point convolution.Lu et al. [11] suggested the use of distinct aggregation strategies for both within-category and betweencategory data.Employing aggregation or enhancement techniques on local features [12] can effectively enhance the perception of intricate details.Furthermore, to effectively learn features from extensive point clouds encompassing diverse target types, Fan et al. [13] introduced the SCF-Net.This network incorporates a dual-distance attention mechanism and global contextual features to enhance semantic segmentation performance.
Point-based methods directly work on the raw point clouds without excessive initialization transformation steps.However, when handling expansive point cloud scenes, the local nearest neighbor search is inevitably involved, which is computationally inefficient.Thus, there is still clearly room for improvement in point-based methods.

Voxel-Based Methods
Point clouds are regularly divided into 3D cubic voxels, and Voxel-based methods employ 3D convolution for the extraction of features.SEGCloud [14] is one of the earlier methods for semantic segmentation based on voxel representation.In order to utilize 3D convolution efficiently and expand the receptive field, 3D sparse convolution [15] is used in Minkowski CNN [16], which reduces the computational complexity of convolution.In pursuit of higher segmentation accuracy, a neural architecture search (NAS) based model SPVNAS [17] is proposed, which trades high computational cost for accuracy.In order to fit the spatial distribution of the LiDAR point clouds, a cylinder voxel division method is proposed in Cylinder3D [18], which makes it obtain high accuracy.In order to streamline computations and enhance the intricacies of smaller instances, an attention-focused feature fusion module and an adaptive feature selection module are proposed by Cheng et al. [19].To improve the speed of voxel-based networks, a method of knowledge distillation from point to voxel is proposed in PVKD [20] to achieve model compression.
High segmentation accuracy is typically achieved in voxel-based methods.However, 3D convolution is inevitably used, resulting in significant memory occupation and high computational consumption.

Projection-Based Methods
The basic concept behind projection-based methods is to transform point clouds into images that can undergo 2D convolution operations.The SqueezeSeg [21][22][23] series of algorithms based on SqueezeNet [24] perform semantic segmentation after projecting point clouds.RangeNet++ [25] implements semantic segmentation based on the backbone network of DarkNet53 [26], and a K-Nearest Neighbor (KNN) algorithm is proposed to improve segmentation accuracy.3D-MiniNet [27] is based on a lightweight backbone to build the network, achieving a faster speed.A polar BEV representation method is proposed in PolarNet [1], which uses a simplified version of PointNet to encode the point clouds of each polar coordinate grid to obtain a pseudo image, and KNN post-processing operation is no longer needed.Peng et al. [28] introduced a multi-attention mechanism to enhance the understanding of driving scenes, specifically focused on dense top-view semantic segmentation using sparse LiDAR data.SalsaNext [29] introduced a new context module, which replaces the ResNet encoder blocks with a residual convolution stack that has increasing receptive fields.Additionally, it incorporated a pixel-shuffle layer into the decoder.MINet [30] employed multiple paths with varying scales to effectively distribute computational resources across different scales.FIDNet-Point [31] designed a fully interpolation decoding module that directly upsamples the multi-resolution feature maps using bilinear interpolation.CENet+KNN [32] incorporated convolutional layers with larger kernel sizes, replacing MLP, and integrated multiple auxiliary segmentation heads into its architecture.
There are obvious advantages in computational complexity and speed in projectionbased methods.Therefore, it is significant to improve the segmentation accuracy of projection-based methods for practical application in autonomous driving.

Methodology
The overall framework of ACPNet is shown in Figure 1.First, the raw point clouds are encoded using a polar BEV encoder.Then, the encoded point cloud features are input into the ACBN constructed with asymmetric convolutions for feature extraction.Next, the features extracted by the backbone are inputted into the CFEM for further mining contextual features.Finally, the output features are processed by the semantic head to acquire the semantic segmentation results.

Asymmetric Convolution Backbone Network (ACBN)
Objects such as vehicles, riders, and pedestrians are the main detection targets in autonomous driving scenarios.These objects will be presented in a small rectangular area after the BEV projection.Furthermore, it is common for objects to rotate in the horizontal direction.Horizontal rotation refers to the rotation angle of objects on the road compared to the front of the LiDAR sensor.When an object is not directly in front of the LiDAR, horizontal rotation occurs.Recent studies [33] also indicate that the central crisscross weights play a more significant role in the square convolution kernel.
Asymmetric convolution is a type of convolutional operation used in convolutional neural networks.Unlike square convolutions that use local convolution blocks with equal length and width, asymmetric convolutions use rectangular blocks with unequal length and width.These convolutions are characterized by their ability to extract different global features depending on the orientation of the rectangular block.When the length is greater than the width, the convolution can extract more global features in the vertical direction, resulting in a larger receptive field or attention range.Conversely, when the length is less than the width, the convolution can extract more global features in the horizontal direction.By combining asymmetric convolutions with different orientations, the weights in the horizontal and vertical directions can be overlaid to enhance the weights at the center cross position of the square convolutional kernel.
As shown in Figure 2, when the feature map is flipped left-right or up-down, the information extracted by the original square kernel will change.But at the same time, if there are horizontal kernels or vertical kernels in the convolution combination, some of the kernels will still get the same output as the original feature map in the axially symmetric locations.From this, it can be seen that asymmetric convolution can still extract correct features when dealing with rotational distortions, thus it will enable the model to generalize better on the unseen rotated samples and show robustness.
To enhance the horizontal and vertical responses, we introduce the Asymmetric Convolutions Block as a means to achieve this objective and it can improve the robustness of the model for certain transformations, such as target rollover and rotation in BEV.Inspired by the observation and subsequent conclusion in [33], asymmetric convolutions of 1 × 3 and 3 × 1 are used to build the asymmetric convolutions, which strengthen the skeleton of the square convolution kernel while weakening the corner.Moreover, the receptive field of the combination composed of 1 × 3 and 3 × 1 asymmetric convolutions are the same as 3 × 3 square convolutions.As shown in Figure 3, the ACBN consists of four downsampling asymmetric convolution blocks and four upsampling asymmetric convolution blocks.In addition, three skip connections (the dotted line in Figure 3) are also employed to merge the low-level and high-level features within the network, thereby enhancing the capability of network for detailed learning.The Downsample Asymmetric Convolution Block is shown in Figure 4, in which a square convolution with the stride of 2 is operated on the features.After that, two asymmetric convolution combinations are operated separately, and the summed results are output.In these asymmetric convolution combinations, the kernels are 3 × 1, 1 × 3, and 1 × 3, 3 × 1, respectively.
where F in and F out represent input features and output features, respectively, C 3×3 , C 1×3 and C 3×1 represent 3 × 3, 1 × 3 and 3 × 1 convolution, respectively.Figure 5 illustrates the Upsample Asymmetric Convolution Block which makes use of bilinear interpolation, and then low-level features are concatenated from skip connections.Lastly, an asymmetric convolution combination consisting of 1 × 3 and 3 × 1 kernels is performed.The calculation of the upsample asymmetric convolution block as shown in Equation ( 2): where F low represents low-level features, ∆ represents feature concatenation, and B represents the bilinear interpolation operation.

Contextual Feature Enhancement Module (CFEM)
One of the primary challenges of semantic segmentation is the lack of contextual features in the whole network, so exploring the global contextual features of different scales is crucial in learning the complex correlations among classes.Recently, Studies regarding the semantic segmentation of 3D point clouds also pay attention to the extraction of global contextual features [12,34] and achieved good results.Constructing high-rank global context features directly is challenging due to the need for sufficient capacity to capture extensive contextual variations [35].To simplify high-rank feature extraction, the Contextual Feature Enhancement Module is proposed.We utilize the tensor decomposition theory [36] to construct the high-rank contextual feature by combining low-rank tensors.This involves using two rank-1 kernels to generate the low-rank features, which are then aggregated to produce the ultimate global context.
As shown in Figure 6, rank-1 kernels are first used to decompose high-rank contextual features based on dimension, which generate low-rank encodings.Next, the active values of the Sigmoid function are added as output.Finally, the current features are multiplied with the input features to acquire the enhanced high-rank contextual features.The decomposition and aggregation strategy is used here to avoid the difficulties of direct high-rank feature extraction.The calculation of the CFEM as shown in Equation (3): where F in and F out represent the input feature and output feature, respectively.Sig represents the logistic Sigmoid function, while C 3×1 and C 1×3 represent the 3 × 1 and 1 × 3 convolution, respectively.

Enhanced Data Augmentation (EDA)
Inspired by [37], the global scaling and global translation are employed in training to provide more sample information and improve the model's generalization ability.For global scaling, this method increases the diversity of sample scales in the training data by randomly magnifying and shrinking the global original point cloud information and label information, thereby adding different scale information to the dataset.Moreover, global translation enriches the dataset samples by randomly translating all points in each frame of the point cloud, from the perspective of transforming the distance between the targets and the sensor.Implementation details are shown in Figure 7.As shown in Figure 7b, the global scaling is implemented by extracting the scalar s to scale the point p(x, y, z) ∈ P in each direction from a uniform distribution U(1 − t, 1 + t) with t ∈ {0.05, 0.1, 0.25}, so the randomly scaled point p * can be represented as p * (s • x, s • y, s • z).Also, each label a is scaled so that a(x c , y c , z c , w, l, h, θ) ∈ A can be represented as a * (s As shown in Figure 7c, the global translation is implemented by translating each point p(x, y, z) ∈ P, so each translated point p * can be represented as p * (x + ∆x, y + ∆y, z + ∆z).Also, each label a(x c , y c , z c , w, l, h, θ) ∈ A is converted to the form a * (x c + ∆x, y c + ∆y, z c + ∆z, w, l, h, θ) ∈ A * , where ∆x, ∆y and ∆z are sampled independently from the normal distribution N 0, σ 2 and σ takes values in the range σ 2 ∈ {0.1, 0.2, 0.4}.
Apart from the methods discussed above, Random Flip and Random Rotation in the baseline model are still preserved in the training of ACPNet.

Loss Function
The loss in ACPNet follows the existing models [19,29], the weighted cross-entropy loss and the Lovász-Softmax loss [38] are used to improve the accuracy of segmentation and the value of Intersection-over-Union (IoU), i.e., the Jaccard index.
The formula of weighted cross-entropy loss is shown in Equation ( 4): where v i is the frequency of each class, P(y i ) and P( ŷi ) correspond to the ground truth probability and prediction probability of the model, respectively.
The formula of Lovász-Softmax loss as shown in Equation ( 5): where J is the Lovász extension of the Jaccard index, C is the class number, e(c) is the vector of errors for class c, e(c) ∈ [0, 1] p , and p is the number of pixels considered.Therefore, the total loss of ACPNet is given by Equation ( 6):

Experiments
In order to evaluate the performance of ACPNet, experiments are conducted in this part.During training, the Adam optimizer is used to fit the parameters with a learning rate of 0.001 and a batch size of 2, and the maximum number of training epochs is 30.Moreover, the training process is conducted on a server with Intel Xeon Gold 5118 @ 2.30 GHz CPU and NVIDIA RTX 3090 GPU.

Dataset and Metric
SemanticKITTI [2] is a LiDAR point clouds segmentation dataset for large-scale outdoor scenes, which is made based on the KITTI Vision Odometry Benchmark [39].Se-manticKITTI provides 22 sequences of dense point-level annotations, and 19 main classes are used for evaluation.Among all 22 sequences, sequences 00 to 10 are used as the training set (of which sequence 08 is the validation set), and sequences 11 to 21 are used as the test set.
SemanticPOSS [3] is a challenging benchmark created by Peking University, comprising 2988 intricate LiDAR scenes with a large number of sparse dynamic instances, such as people and riders.It is smaller and sparser compared to other benchmarks, making it more challenging.The dataset is divided into six sequences, with sequence 2 designated as the test set and the remaining sequences used for training.
nuScenes [4] is a large-scale autonomous driving dataset created by Motional.It consists of 1000 scenes, each 20 s in duration and captured using a 32-beam LiDAR sensor.In total, the dataset comprises 40,000 frames.They also formally divided the data into training and validation sets.Following the consolidation of similar classes and removal of infrequent ones, a total of 16 classes remain for the purpose of LiDAR semantic segmentation.
The mean Intersection-over-Union (mIoU) [40] over all classes is used as the primary evaluation metric.The formula of mIoU is as shown in Equation ( 7): where TP c , FP c and FN c represent the predictions of True Positive, False Positive and False Negative of each class c, respectively, and n is the number of classes.FPS is also used as an evaluation metric, and the FPS is measured on a single NVIDIA RTX 2080Ti GPU.Note that the speed of the LiDAR semantic segmentation model is considered real-time when it reaches 10 FPS on NVIDIA RTX 2080Ti.That is because the computing power of this GPU is comparable to that of current mainstream on-board computing platforms, and the acquisition frequency of the Velodyne-HDLE64 LiDAR used in the SemanticKITTI dataset is 10 Hz.

Results on SemanticKITTI, SemanticPOSS and nuScenes
In order to verify the effectiveness of ACPNet, the performance is evaluated on SemanticKITTI, SemanticPOSS and nuScenes datasets.In this section, ACPNet is compared with several other current mainstream methods.
As shown in Table 1, these are the results of the SemanticKITTI test set, compared to the baseline model PolarNet, there is a significant performance improvement in ACPNet.In particular, the IoU is improved in 17 out of 19 classes, with improving by more than 5% in traffic participants such as trucks, buses, motorcycles, motorcyclists and bicyclists.Furthermore, the mIoU over all classes is increased by 5.1%, reaching 59.4%.Besides, the comparison of ACPNet and other methods is presented in Table 1, there are advantages in seven classes, and the IoU of the car is outstanding.Regarding speed, the running speed of ACPNet exceeds 14 FPS, meeting the demand for real-time autonomous driving.As shown in Table 2, these are the results of the SemanticPOSS test set, ACPNet outperforms the compared methods significantly in terms of mIoU.Additionally, ACPNet has achieved the highest results in seven classes, including rider, plants, traffic sign, etc.
As shown in Table 3, these are the results on the NuScenes validation set, ACPNet has achieved a mIoU metric of 72.8%, which is 2.9% higher than the baseline model PolarNet.Besides, the IoU of ACPNet is improved in 15 out of 16 classes, and improvements are obtained in traffic participants classes such as car, bus, bicycle and motorcycle.The experimental results show that our method effectively performs semantic segmentation in LiDAR point clouds and outperforms other methods.Part of the visualizations of prediction results on the SemanticKITTI dataset are shown in Figure 8.

Ablation Studies
To investigate the individual contribution of each module over the baseline model PolarNet [1], ablation studies are conducted on the validation set within the SemanticKITTI dataset (seq 08).The studied modules include the Contextual Feature Enhancement Module (CFEM), the Asymmetric Convolution Backbone Network (ACBN), and the Enhanced Data Augmentation (EDA).GS and GT, respectively, stand for global scaling and global translation.The results of the ablation experiments are presented in Table 4.By adding the CFEM, the mIoU of the model is improved by 1.6%.This result points out that the module is able to mine and extract contextual features, avoiding the difficulty of directly extracting high-ranking features.
By adding the ACBN, the mIoU of the model is improved by 1.1%.The skeleton of the square convolution kernel is strengthened due to the asymmetric convolutions.
By adding the EDA, the richness of training samples is increased by global scaling and global translation.The mIoU is improved by another 0.5% on the SemanticKITTI validation set, reaching 59.6%.The refined ablation experiment results show that the effect of global scaling is basically consistent with that of global translation, but the effect of global scaling is slightly stronger than that of global translation.
From the results of ablation experiments, it can be concluded that the methods proposed in this paper all lead to gains in performance.

Influence of Grid Density
In this section, the influence of grid density on the model is analyzed.When partitioning the original point clouds, the segmentation accuracy and speed are affected by grid density.To verify whether higher speed can be achieved by sacrificing some accuracy, ACPNet-mini is designed by varying the grid density.The grid sizes of ACPNet and ACPNet-mini are 480 × 360 × 32 and 320 × 240 × 32, respectively, where the three dimensions represent radius, angle and height.
According to Table 5, ACPNet-mini sacrifices 1.9% of the mIoU by reducing the computation, resulting in a 33.3% improvement in running speed.Besides, it can be found that ACPNet achieves a real-time effect without introducing additional computation while having a large improvement in mIoU compared to the baseline model.

Conclusions
An efficient real-time LiDAR point clouds semantic segmentation model ACPNet is proposed in this paper.Asymmetric Convolution Backbone Network and Contextual Feature Enhancement Module are proposed to improve the feature extraction ability of the model, and Enhanced Data Augmentation methods are used to enrich the diversity of training samples.Compared with the baseline network PolarNet, the mIoU of ACPNet on the SemanticKITTI, SemanticPOSS and nuScenes datasets are improved by 5.1%, 1.6% and 2.9%, respectively.Meanwhile, the speed of ACPNet is 14 FPS, which basically meets the real-time requirements in autonomous driving scenarios.Besides, ACPNet-mini is designed by reducing the grid density in the point clouds encoding stage, significantly increasing the speed at the expense of smaller segmentation accuracy.In summary, ACPNet essentially satisfies the demands of real-time semantic segmentation of LiDAR point clouds for autonomous driving.

Discussion
In the future, we will continue to investigate more general and effective methods to enhance performance.Additionally, we plan to expand our approach to achieve end-to-end 3D panoptic segmentation on LiDAR point clouds for autonomous driving.

Funding:
The generous support of the Discipline Construction of Computer Science and Technology of Shanghai Polytechnic University under Grant B60KY150002-02 are gratefully acknowledged.

Figure 2 .
Figure 2. In contrast to square kernels, horizontal and vertical kernels demonstrate greater resilience against flipping.

Figure 7 .
Figure 7. Visualization of the original scene and enhanced data augmentation methods (shown in BEV).(a) Original scene, (b) Scene augmented by global scaling, (c) Scene augmented by global translation.

Figure 8 .
Figure 8. Visualization on SemanticKITTI validation set.Where (a,b) are LiDAR raw data and ground truth of semantic segmentation, (c,d) are predictions of this frame for PolarNet and our method.The areas circled by the red circles represent the different properties of the segmentation results.

Table 1 .
Evaluation Results of ACPNet and existing methods on the SemanticKITTI Test Set.

Table 2 .
Evaluation Results of ACPNet and existing methods on the SemanticPOSS test set.

Table 3 .
Evaluation Results of ACPNet and existing methods on the nuScenes validation set.

Table 4 .
Ablation studies for network components on SemanticKITTI Validation Set (seq 08).

Table 5 .
Experiments with different grid sizes on SemanticKITTI Validation Set (seq 08).