Real-Time Semantic Segmentation of 3D Point Cloud for Autonomous Driving

: Autonomous vehicles perceive objects through various sensors. Cameras, radar, and LiDAR are generally used as vehicle sensors, each of which has its own characteristics. As examples, cameras are used for a high-level understanding of a scene, radar is applied to weather-resistant distance perception, and LiDAR is used for accurate distance recognition. The ability of a camera to understand a scene has overwhelmingly increased with the recent development of deep learning. In addition, technologies that emulate other sensors using a single sensor are being developed. Therefore, in this study, a LiDAR data-based scene understanding method was developed through deep learning. The approaches to accessing LiDAR data through deep learning are mainly divided into point, projection, and voxel methods. The purpose of this study is to apply a projection method to secure a real-time performance. The convolutional neural network method used by a conventional camera can be easily applied to the projection method. In addition, an adaptive break point detector method used for conventional 2D LiDAR information is utilized to solve the misclassiﬁcation caused by the conversion from 2D into 3D. The results of this study are evaluated through a comparison with other technologies.


Introduction
Scene understanding through semantic segmentation is one of the components of the perception system used in autonomous vehicles. Autonomous vehicles understand the overall situation by applying multiple attached sensors. Typically, information is acquired through radar, cameras, and LiDAR. Each sensor has its own advantages and disadvantages. The perception system configures each sensor to have a complementary relationship used to solve the shortcomings of the sensors. In addition, more accurate results are sought through redundant information of each sensor. Greater stability can be secured when there is more redundant information with high reliability. To acquire redundant information, technologies that emulate the functions of different sensors with a single sensor are being developed [1][2][3].
Semantic segmentation is a field of computer vision. Scene understanding can be divided mainly into classification, detection, and segmentation. Classification is a method of predicting a label for an image. Detection is a method of predicting the position of an image while predicting the label. Segmentation is a task dividing objects of an image into meaningful units or a method of predicting a label for every pixel.
The use of semantic information is increasing in areas such as localization, object detection, and tracking, which are the roles of LiDAR in autonomous vehicles. It is used to increase the effect of such algorithms as loop closure [1] or to increase the performance of object tracking in simultaneous localization and mapping. Methods using deep learning have been proposed for semantic segmentation of 3D LiDAR data. Such deep learning methods can be mainly divided into three types, point methods that use original raw data and do not apply preprocessing; voxel grid methods that standardize and reduce the number of data; and 2D projection methods that use a 2D projection, similar to an image [4]. Although a point method is robust against data distortion by using the original raw data, it has difficulty guaranteeing a real-time performance. For a voxel grid method, the distortion rate and computation speed vary depending on the size of the grid. Finally, in a 2D projection, data are simplified by converting the coordinate system from 3D into 2D data.
A real-time LiDAR semantic segmentation method was used in this study. Moreover, a method of projecting LiDAR data from a 3D coordinate system into a 2D image coordinate system was used. The segmentation of each pixel of a 2D-projected LiDAR image was inferred using a convolutional neural network. The LiDAR data of the 3D coordinate system were segmented by applying the results inferred from the 2D image to the 3D coordinate system. In this paper, a modified version of an existing image semantic segmentation network is proposed by considering the characteristics of the point cloud. A filter using an adaptive break point detector (ABD) was used to reduce the misclassification when applying data inferred from a 2D coordinate system to a 3D coordinate system data. Based on the above description, it operates faster than the measurement speed of the LiDAR sensor (approximately 10 Hz) and performs Semantic Segmentation of LiDAR data through reliable level of inference.

Related Work
Scene perception in autonomous vehicles has made rapid progress with the advent of deep learning. In particular, techniques such as semantic segmentation have been developed. However, semantic segmentation requires a large amount of computing power. This problem has been significantly resolved through parallel processing using a graphics processing unit. In addition, research on the weight pruning of deep neural networks (DNNs), such as MobileNets V2, has been conducted [5].
Studies on semantic segmentation are also being conducted for LiDAR data, including the semantic segmentation of images. An indirect method for obtaining semantic segmentation information of an image through calibration was developed. A method for directly applying LiDAR to a DNN and achieve the semantic segmentation of LiDAR data has also been applied [1]. In addition, methods for directly applying LiDAR data to DNNs are being studied, of which there are three main types: a method for applying a 3D convolution by splitting a 3D space into voxels of a given size to apply a point cloud to a DNN, a method for applying a 2D convolution by using a multi-view image as an input, and a method for directly applying a point cloud to the network [6].
PointNet, proposed by Qi et al., uses a transform network, which is an end-to-end DNN that learns the characteristics directly from a point cloud [6]. It was applied to 3D object perception and 3D semantic segmentation. Subsequently, an improved PointNet++ was proposed to learn the local characteristics [7]. VoxelNet, proposed by Zhou et al., was the first to employ an end-to-end DNN in the 3D domain. The data were simplified and standardized for application to a DNN by splitting the space into voxels and expressing only certain points as voxels [8]. SqueezeSeg, proposed by Wu et al., projects the point cloud onto the image coordinate system for use in a 2D convolution [9]. In addition, a conditional random field, used in image semantic segmentation, was applied. The results showed a faster performance than the measurement speed of the sensor by projecting onto the image coordinate system. An improved DNN was proposed through V2 and V3 for SqueezeSeg [2]. As with SqueezeSeg, a method for projecting a point cloud on the image coordinate system was used.
A DNN requires a large number of data to extract the features. The Cityscape and Mapillary datasets are mainly used in semantic segmentation of a video. In this study, the SemanticKITTI dataset was used to obtain 3D semantic segmentation training data [3,10].

Method
The purpose of this study was to conduct a semantic segmentation of LiDAR data that can be used in perception systems of autonomous vehicles. The method for projecting LiDAR data into a 2D image coordinate system, and the configuration and characteristics of a DNN for a semantic segmentation of the projected image, are described in this section. Postprocessing using ABD was used to limit the misclassification during the process of recovering from the inferred image using LiDAR data. After projecting from (A) the input to the 2D image coordinate system, the semantic segmented LiDAR data were output by passing through (B) the proposed DNN and (C) the ABD filter, as shown in Figure 1. A DNN requires a large number of data to extract the features. The Cityscape and Mapillary datasets are mainly used in semantic segmentation of a video. In this study, the SemanticKITTI dataset was used to obtain 3D semantic segmentation training data [3,10].

Method
The purpose of this study was to conduct a semantic segmentation of LiDAR data that can be used in perception systems of autonomous vehicles. The method for projecting LiDAR data into a 2D image coordinate system, and the configuration and characteristics of a DNN for a semantic segmentation of the projected image, are described in this section. Postprocessing using ABD was used to limit the misclassification during the process of recovering from the inferred image using LiDAR data. After projecting from (A) the input to the 2D image coordinate system, the semantic segmented LiDAR data were output by passing through (B) the proposed DNN and (C) the ABD filter, as shown in Figure 1.

LiDAR Point Cloud Representation
The method for projecting 3D LiDAR data x, y, z, i onto the image coordinate system u, v for a 2D convolution application is described in this section. The LiDAR coordinate system can be projected onto the image coordinate system using a spherical coordinate system. The projection according to the sensor measurement method generates data with a large amount of noise, such as the overlapping of objects, when LiDAR data are projected onto the image coordinate system because 3D data are projected onto the 2D coordinate system. The data with the shortest path from the sensor are represented in the image to prevent noise. The 360° data acquired from the sensor are projected [11].
1 arctan , where W, H denote the width and height of the image, and x, y, z denote LiDAR data.
Here, is the field of view of the sensor, and r [12]. The x, y, z, r, i image is generated by projecting the LiDAR data and r onto the converted coordinate system using Equation (1). The created image is input into the network in the form of 5 . A spherical projection is shown in Figure 2.

LiDAR Point Cloud Representation
The method for projecting 3D LiDAR data (x, y, z, i) onto the image coordinate system (u, v) for a 2D convolution application is described in this section. The LiDAR coordinate system can be projected onto the image coordinate system using a spherical coordinate system. The projection according to the sensor measurement method generates data with a large amount of noise, such as the overlapping of objects, when LiDAR data are projected onto the image coordinate system because 3D data are projected onto the 2D coordinate system. The data with the shortest path from the sensor are represented in the image to prevent noise. The 360 • data acquired from the sensor are projected [11].
where (W, H) denote the width and height of the image, and x, y, z denote LiDAR data.
Here, f = f up + f down is the field of view of the sensor, and r = x 2 + y 2 + z 2 [12]. The [x, y, z, r, i] image is generated by projecting the LiDAR data and r onto the converted coordinate system using Equation (1). The created image is input into the network in the form of [W × H × 5]. A spherical projection is shown in Figure 2.
A DNN requires a large number of data to extract the features. The Cityscape and Mapillary datasets are mainly used in semantic segmentation of a video. In this study, the SemanticKITTI dataset was used to obtain 3D semantic segmentation training data [3,10].

Method
The purpose of this study was to conduct a semantic segmentation of LiDAR data that can be used in perception systems of autonomous vehicles. The method for projecting LiDAR data into a 2D image coordinate system, and the configuration and characteristics of a DNN for a semantic segmentation of the projected image, are described in this section. Postprocessing using ABD was used to limit the misclassification during the process of recovering from the inferred image using LiDAR data. After projecting from (A) the input to the 2D image coordinate system, the semantic segmented LiDAR data were output by passing through (B) the proposed DNN and (C) the ABD filter, as shown in Figure 1.

LiDAR Point Cloud Representation
The method for projecting 3D LiDAR data x, y, z, i onto the image coordinate system u, v for a 2D convolution application is described in this section. The LiDAR coordinate system can be projected onto the image coordinate system using a spherical coordinate system. The projection according to the sensor measurement method generates data with a large amount of noise, such as the overlapping of objects, when LiDAR data are projected onto the image coordinate system because 3D data are projected onto the 2D coordinate system. The data with the shortest path from the sensor are represented in the image to prevent noise. The 360° data acquired from the sensor are projected [11].
1 arctan , where W, H denote the width and height of the image, and x, y, z denote LiDAR data.
Here, is the field of view of the sensor, and r [12]. The x, y, z, r, i image is generated by projecting the LiDAR data and r onto the converted coordinate system using Equation (1). The created image is input into the network in the form of 5 . A spherical projection is shown in Figure 2.

Network Structure
The proposed network configuration uses DeepLabV3 as the main network, and partial convolution and semilocal convolution are the main convolution layer.

Partial Convolution
A partial convolution is a padding method proposed by Nvidia. The size of the input data decreases as the convolution and pooling proceed. The data may be excessively reduced, and information may be lost as the depth of the network increases. Padding is applied to extend the surroundings of the input data by filling them with a specific value to prevent data loss. Generally, zero padding is used. However, data including errors are obtained at the border of the image if it is filled with a specific value or zero. Partial convolution is a method of conditioning the output for input data by defining 0 as a hole and 1 as a non-hole by adding a binary mask. A partial convolution is helpful for data loss through the abovementioned method and has been applied to holes generated when the LiDAR data and the error at the border are projected onto a 2D coordinate system [13]. The partial convolution is shown in Figure 3.
The proposed network configuration uses DeepLabV3 as the main network, and partial convolution and semilocal convolution are the main convolution layer.

Partial Convolution
A partial convolution is a padding method proposed by Nvidia. The size of the input data decreases as the convolution and pooling proceed. The data may be excessively reduced, and information may be lost as the depth of the network increases. Padding is applied to extend the surroundings of the input data by filling them with a specific value to prevent data loss. Generally, zero padding is used. However, data including errors are obtained at the border of the image if it is filled with a specific value or zero. Partial convolution is a method of conditioning the output for input data by defining 0 as a hole and 1 as a non-hole by adding a binary mask. A partial convolution is helpful for data loss through the abovementioned method and has been applied to holes generated when the LiDAR data and the error at the border are projected onto a 2D coordinate system [13]. The partial convolution is shown in Figure 3.

Semilocal Convolution
A semilocal convolution uses the fact that data in a fixed space are measured when LiDAR data are projected into a 2D coordinate system, unlike the image data of a camera. This convolution is applied by dividing the input data by α. Different kernels can be applied to the segmented convolution, and the convolution weight divided by the region is shared. Moreover, it can be learned by using the characteristics of LiDAR data because it is learned by dividing the input data by region and applying weights [14]. The semilocal convolution is shown in Figure 4.

Semilocal Convolution
A semilocal convolution uses the fact that data in a fixed space are measured when LiDAR data are projected into a 2D coordinate system, unlike the image data of a camera. This convolution is applied by dividing the input data by α. Different kernels can be applied to the segmented convolution, and the convolution weight divided by the region is shared. Moreover, it can be learned by using the characteristics of LiDAR data because it is learned by dividing the input data by region and applying weights [14]. The semilocal convolution is shown in Figure 4.

Atrous Convolution
Atrous convolution creates and uses an empty space inside the kernel, unlike a conventional convolution. For segmentation, it is better for a DNN when a pixel has a wider field of view. A conventional method constructs a deeper DNN for the pixel to have a wider field of view. However, more original information is lost when the DNN is deeper. An atrous convolution expands the field of view of a pixel by creating an empty space.

Atrous Convolution
Atrous convolution creates and uses an empty space inside the kernel, unlike a conventional convolution. For segmentation, it is better for a DNN when a pixel has a wider field of view. A conventional method constructs a deeper DNN for the pixel to have a wider field of view. However, more original information is lost when the DNN is deeper. An atrous convolution expands the field of view of a pixel by creating an empty space. This is advantageous for segmentation, and a light DNN can be configured because the field of view of a pixel is expanded. Here, r represents the size of the empty space. Different ratios of r can be used to obtain multiscale features simultaneously [15]. An atrous convolution is shown in Figure 5. This convolution is used in a network, as shown in Figure 6.

Atrous Convolution
Atrous convolution creates and uses an empty space inside the kernel, unlike a conventional convolution. For segmentation, it is better for a DNN when a pixel has a wider field of view. A conventional method constructs a deeper DNN for the pixel to have a wider field of view. However, more original information is lost when the DNN is deeper. An atrous convolution expands the field of view of a pixel by creating an empty space. This is advantageous for segmentation, and a light DNN can be configured because the field of view of a pixel is expanded. Here, r represents the size of the empty space. Different ratios of r can be used to obtain multiscale features simultaneously [15]. An atrous convolution is shown in Figure 5. This convolution is used in a network, as shown in Figure 6.

DeepLabV3+
DeepLabV3+, used in image semantic segmentation, was applied as the backbone. DeepLabV3+ is designed for image data semantic segmentation and in encoder-decoder structures. Four methods were proposed for DeepLabV3+ from version 1 to 3+. An atrous convolution was proposed in V1, atrous spatial pyramid pooling was proposed in V2, and the ResNet structure was proposed by applying an atrous convolution in V3. The V3+ used in this study uses an atrous separable convolution [15].

Network Details
Data are received in the form of 64 1024 5 H W C as input. An encoderdecoder structure is used with DeepLabV3+ as the backbone. Xception-41 is used as the backbone network, and the entry part is replaced with a partial convolution and semilocal convolution considering that the inputs are LiDAR data. Cross entropy is used as the loss function [14].

CE ,
This loss function is applied most frequently. Here, denotes the ground-truth for

DeepLabV3+
DeepLabV3+, used in image semantic segmentation, was applied as the backbone. DeepLabV3+ is designed for image data semantic segmentation and in encoder-decoder structures. Four methods were proposed for DeepLabV3+ from version 1 to 3+. An atrous convolution was proposed in V1, atrous spatial pyramid pooling was proposed in V2, and the ResNet structure was proposed by applying an atrous convolution in V3. The V3+ used in this study uses an atrous separable convolution [15].

Network Details
Data are received in the form of 64 × 1024 × 5 (H × W × C) as input. An encoderdecoder structure is used with DeepLabV3+ as the backbone. Xception-41 is used as the backbone network, and the entry part is replaced with a partial convolution and semilocal Electronics 2021, 10, 1960 6 of 10 convolution considering that the inputs are LiDAR data. Cross entropy is used as the loss function [14].
This loss function is applied most frequently. Here,ŷ i c denotes the ground-truth for class c at the one-hot encoded pixel position i, and y i c denotes the predicted data of softmax. Generally, semantic segmentation is evaluated using the mean intersection over union (mIoU). The purpose is to minimize the cross entropy in the learning process to reach a high mIoU. The modified 3D model of Xception is shown in Figure 7. The network structure is shown in Figure 8.

Postprocessing
Conversion from a 3D coordinate system into a 2D coordinate system causes errors because only data with shortest path from the sensor are shown as a representative point in the 2D image coordinate system. A misclassification occurs when the data that were classified in a 2D image coordinate system are applied to 3D data. In this study, a filter using an ABD was applied to reduce the misclassification. An ABD was used for the clustering method of 2D LiDAR data.
When the distance between ‖ ‖ is greater than the threshold circle ( ), ABD is a method that designates this as a break point [16]. If the threshold of the circle is small, the prediction will not be reached, and if it is large, an overflow will occur.
The pseudocode of the ABD, as shown in Table 1, adaptively depends on ∆θ and r, as shown in Figure 9. Here, ∆θ , λ denotes a user-definable constant, and is the sensor noise associated with r, in which the range of influence of the circle is wider when λ is small or is large. The distance r and height h of data projected on a pixel [u, v] were substituted into the filter by using the ABD as a filter in order of distance, and

Postprocessing
Conversion from a 3D coordinate system into a 2D coordinate system causes errors because only data with shortest path from the sensor are shown as a representative point in the 2D image coordinate system. A misclassification occurs when the data that were classified in a 2D image coordinate system are applied to 3D data. In this study, a filter using an ABD was applied to reduce the misclassification. An ABD was used for the clustering method of 2D LiDAR data.
When the distance between ‖ ‖ is greater than the threshold circle ( ), ABD is a method that designates this as a break point [16]. If the threshold of the circle is small, the prediction will not be reached, and if it is large, an overflow will occur.
The pseudocode of the ABD, as shown in Table 1, adaptively depends on ∆θ and r, as shown in Figure 9. Here, ∆θ , λ denotes a user-definable constant, and is the sensor noise associated with r, in which the range of influence of the circle is wider when λ is small or is large. The distance r and height h of data projected on a pixel

Postprocessing
Conversion from a 3D coordinate system into a 2D coordinate system causes errors because only data with shortest path from the sensor are shown as a representative point in the 2D image coordinate system. A misclassification occurs when the data that were classified in a 2D image coordinate system are applied to 3D data. In this study, a filter using an ABD was applied to reduce the misclassification. An ABD was used for the clustering method of 2D LiDAR data.
When the distance between ||p n − p n−1 || is greater than the threshold circle (D max ), ABD is a method that designates this as a break point [16]. If the threshold of the circle is small, the prediction will not be reached, and if it is large, an overflow will occur. The pseudocode of the ABD, as shown in Table 1, adaptively depends on ∆θ and r, as shown in Figure 9. Here, ∆θ = θ n − θ n−1 , λ denotes a user-definable constant, and σ r is the sensor noise associated with r, in which the range of influence of the circle is wider when λ is small or σ r is large. The distance r and height h of data projected on a pixel [u, v] were substituted into the filter by using the ABD as a filter in order of distance, and the classification result was applied by designating it as one object up to the break point. The parameter values were calculated empirically. In the system, λ is 10, and σ r is 2. The ABD is shown in Figure 9. The ABD postprocessing is shown in Figure 10. Table 1. Adaptive break point detector.

Experiments
The network was trained and evaluated using SemanticKitti data. SemanticKitti provides LiDAR data by labeling them from Kitti data. The dataset consists of more than

Experiments
The network was trained and evaluated using SemanticKitti data. SemanticKitti provides LiDAR data by labeling them from Kitti data. The dataset consists of more than 43,000 scans. The data are organized in sequences within the range of 00 to 21, and 21,000

Experiments
The network was trained and evaluated using SemanticKitti data. SemanticKitti provides LiDAR data by labeling them from Kitti data. The dataset consists of more than 43,000 scans. The data are organized in sequences within the range of 00 to 21, and 21,000 scans from 00 to 10 can be used for training because they provide the ground-truth. Sequences 11 to 21 are used as test data. The dataset provides 28 classes including moving objects, which were used in the experiment by being merged into 19 classes [10].
For the scratch training of the network, the base learning rate is 0.03, the weight decay is 0.000015, and the batch size is 38. The hardware and software configurations are presented in Tables 2 and 3, respectively. An mIoU evaluation, which is mainly used in a semantic segmentation evaluation, was employed to evaluate the inference results [12]. The evaluation is shown in Table 4.
Here, TP c , FP c , and FN c are the true positive, false positive, and false negative predictions of class c, respectively, and C is the number of classes. For the evaluation network, the network was improved to apply DeepLab V3+, which is a 2D semantic segmentation network, to 3D LiDAR data as a backbone. An evaluation of the network inference results is shown in Table 4. It contains the results according to the image size and the size of the decoder stride of DeepLabV3+. The network results are shown in Figure 11.
Electronics 2021, 10, x FOR PEER REVIEW 9 of 11 scans from 00 to 10 can be used for training because they provide the ground-truth. Sequences 11 to 21 are used as test data. The dataset provides 28 classes including moving objects, which were used in the experiment by being merged into 19 classes [10].
For the scratch training of the network, the base learning rate is 0.03, the weight decay is 0.000015, and the batch size is 38. The hardware and software configurations are presented in Tables 2 and 3, respectively. An mIoU evaluation, which is mainly used in a semantic segmentation evaluation, was employed to evaluate the inference results [12]. The evaluation is shown in Table 4.
Here, , , and are the true positive, false positive, and false negative predictions of class c, respectively, and C is the number of classes. For the evaluation network, the network was improved to apply DeepLab V3+, which is a 2D semantic segmentation network, to 3D LiDAR data as a backbone. An evaluation of the network inference results is shown in Table 4. It contains the results according to the image size and the size of the decoder stride of DeepLabV3+. The network results are shown in Figure 11.
It was found that only the data projected on the 2D image were classified through the proposed filter, as shown in Figure 12. The white data are unclassified data. Figure 11. Network result. Figure 11. Network result.
It was found that only the data projected on the 2D image were classified through the proposed filter, as shown in Figure 12. The white data are unclassified data.  An mIoU evaluation, which is mainly used in a semantic segmentation evaluation, was employed to evaluate the inference results [12]. The evaluation is shown in Table 4.
Here, , , and are the true positive, false positive, and false negative predictions of class c, respectively, and C is the number of classes. For the evaluation network, the network was improved to apply DeepLab V3+, which is a 2D semantic segmentation network, to 3D LiDAR data as a backbone. An evaluation of the network inference results is shown in Table 4. It contains the results according to the image size and the size of the decoder stride of DeepLabV3+. The network results are shown in Figure 11.
It was found that only the data projected on the 2D image were classified through the proposed filter, as shown in Figure 12. The white data are unclassified data.

Conclusions
A 2D network was designed for a semantic segmentation of 3D LiDAR, and a semantic segmentation algorithm using the network was proposed. The error propagation, which is a disadvantage of 2D classification, was reduced by using a postprocessed ABD filter to reduce the classification error of a 2D network. Finally, semantic segmentation was conducted for 3D LiDAR data. The practicality of this was demonstrated through its considerably faster speed than the sensor measurement speed of 10 Hz with a computing speed of 13 Hz.
A further study on the development of a network to complement classes with low classification results and for weight pruning is planned.