Real-Time LiDAR Point Cloud Semantic Segmentation for Autonomous Driving

: LiDAR has been widely used in autonomous driving systems to provide high-precision 3D geometric information about the vehicle’s surroundings for perception, localization, and path planning. LiDAR-based point cloud semantic segmentation is an important task with a critical real-time requirement. However, most of the existing convolutional neural network (CNN) models for 3D point cloud semantic segmentation are very complex and can hardly be processed at real-time on an embedded platform. In this study, a lightweight CNN structure was proposed for projection-based LiDAR point cloud semantic segmentation with only 1.9 M parameters that gave an 87% reduction comparing to the state-of-the-art networks. When evaluated on a GPU, the processing time was 38.5 ms per frame, and it achieved a 47.9% mIoU score on Semantic-KITTI dataset. In addition, the proposed CNN is targeted on an FPGA using an NVDLA architecture, which results in a 2.74x speedup over the GPU implementation with a 46 times improvement in terms of power efﬁciency.


Introduction
Nowadays, LiDAR sensors have become indispensable for the emerging autonomous driving vehicles. LiDAR sensors are usually installed on autonomous cars for perception [1], mapping [2], and positioning [3]. Compared with a traditional camera sensor, LiDAR can capture very precise distance measurements from the surrounding environment. One of the main tasks for LiDAR-based processing on Semantic-KITTI challenges [4] is the real-time point cloud semantic segmentation.
In general, LiDAR point cloud segmentation networks can be divided into two major subcategories: the projection-based method and the point-wise method. The projectionbased method is to project the 3D point cloud into a 2D spherical projection [5] or a bird's eye view (BEV) [6]. Subsequently, deep neural networks for 2D images can be directly employed. As described in [7], the point-wise method directly operates on the original 3D points. It divides the 3D space into voxel grids and utilizes neural networks to extract features from the grids. Although the number of network parameters of a point-wise method is slightly lower, the computational load is much higher owing to operations of all fully connected layers. The projection-based method can achieve comparable, state-ofthe-art accuracy while running significantly faster. In this study, we followed this method based on spherical projection.
Most of the existing point cloud segmentation networks are very complex in structure and have a large number of parameters. Even when running on the latest GPUs, they can rarely match the real-time rate of a LiDAR sensor at 10 Hz. This prevents them from being applied directly to an autonomous driving system. Therefore, a lightweight CNN at a comparable accuracy is highly desirable for those time-critical applications.
In this article, we propose a real-time LiDAR point cloud semantic segmentation network, which can run in real time on the GPU. In addition, we also propose several optimization techniques aimed at converting the ordinary CNN structure into a hardwarefriendly structure. As a result, the proposed CNN network is successfully targeted on an FPGA with a processing time of only 17.7 ms. That is much faster than the 10 Hz of the LiDAR sensor data rate, leaving ample time for other perception and planning tasks for the vehicle controller. The contributions of this article are summarized as follows: (1) To our knowledge, this is one of the first end-to-end FPGA implementations for realtime LiDAR point cloud semantic segmentation. A LiDAR sensor is directly connected to the FPGA via an Ethernet port. After pre-processing by the LiDAR driver on the embedded processor inside the FPGA, the point cloud is stored in the DDR memory that can be accessed by the on-chip CNN hardware accelerator.
(2) A real-time and lightweight CNN-named network is proposed, whose mIoU score was 47.9%on the Semantic-KITTI dataset. The network extracts features from two branches: one shallow branch for spatial information and one deep branch for context information. Its inference time on an NVIDIA RTX 3090Ti was about 38.5 ms.
(3) By balancing the on-chip memory and multipliers, our proposed network achieved a real-time processing speed of 56 frames per second (fps) when targeted on the ZCU104 MPSoC FPGA platform.
The rest of the article is organized as follows: Section 2 summarizes the existing research on LiDAR point cloud semantic segmentation and the FPGA implementations of segmentation networks. In Section 3, the proposed segmentation network model is described in detail with evaluation performance on the GPU. The FPGA system architecture and its implementation results are discussed in Sections 4 and 5, respectively. Finally, Section 6 concludes the entire article.

Semantic Segmentation of Lidar Point Clouds
In recent years, the use of deep neural networks for semantic segmentation of 3D LiDAR point clouds has made great progress [7][8][9][10][11][12].These mainstream network structures for semantic segmentation basically use full convolutional networks [13], multi-branch models [14], and encoder-decoder structures [2]. However, these network structures are difficult to achieve the balance of processing speed and segmentation accuracy. Based on the above reasons, we propose a multi-branch network model to extract spatial features and context features separately, and then these features are fused together to generate the semantic information, thus achieving both a real-time processing speed and a high precision of segmentation.
The core difference between these advanced methods is not only in network design but also in the representation of point-cloud data. Point-wise methods directly process the original irregular 3D points without any pre-processing, such as the mainstream method, namely, PointNet [15][16][17][18] and its subsequent version PointNet++ [16]. Although such methods are powerful on small point clouds, the computation load on larger point-cloud data sets becomes very heavy and requires a much longer processing time.
The projection-based method is to project the 3D point cloud into a 2D bird's eye view (BEV) [6] or spherical projection [5]. PolarNet [19] used the BEV projection, which projects the point cloud data into the BEV representation in polar coordinates. Squeeze-Seg [8], SqueezeSegV2 [10], SqueezeSegV3 [20], and RangeNet++ [9] utilized the spherical projection mechanism to convert a 3D point cloud into a frontal-view image or a sphericalprojection image and adopted standard 2D convolutional networks in image space for semantic segmentation. SalsaNext [7] made a series of improvements to the backbone network in SalsaNet [21], adding a new global context block and an improved encoderdecoder [22] to achieve state-of-the-art results in 3D LiDAR semantic segmentation using a spherical-projection image as input.

Fpga Implementations of Segmentation Networks
In autonomous driving or advanced driver-assistance systems (ADAS), the LiDAR point-cloud-segmentation algorithm must fulfill a real-time requirement, so it is often implemented on embedded platforms such as ASIC, FPGA, or mobile CPU/GPU processors [23,24]. Some of the previous works [25,26] mainly used a one-dimensional shrinking array to accelerate matrix multiplication on FPGA, which achieved an efficient resource utilization and a low bandwidth. The newly proposed neural network architecture [27] used the single-instruction multiple-data (SIMD) structure for matrix multiplication.Continental AG released the assisted autonomous driving control unit (ADCU) based on the Zynq UltraScale+ MPSoC chip. The control unit supported LiDAR processing but did not reveal technical details. In [28,29], the LiDAR sensor was connected to the PC via Ethernet. After pre-processing the LiDAR point cloud on the PC, feature maps were fed to the neuralnetwork accelerator in the FPGA. NVIDIA proposed the DRIVE AGX autonomous driving computer platform based on the Xavier SoC chip, which can process the point cloud data received from the LiDAR sensor.

Proposed Network
In this section, we describe our method in detail from the point-cloud representation, the CNN architecture, and the network training. Finally, the performance evaluation of the network is given.

Spherical Projection of LiDAR Point Cloud
As in [9], we projected a sparse 3D LiDAR point cloud onto a spherical surface, and the generated range view (RV) image of the LiDAR allows standard 2D convolution to be processed.
In the RV image, each raw LiDAR where (u, v) are the image coordinates, (h, w) are the height and width of the desired range image representation, r represents the range of each point as r = x 2 + y 2 + z 2 , and f defines the sensor vertical field of view as f = | f dowm | + f up . For each point projected to (u, v), we used its 3D point coordinates (x, y, z), distance index r, and intensity value i as the characteristics and stacked them along the channel dimension. In this way, we can represent a 3D LiDAR point cloud as an RV image with the shape [w × h × 5] and feed it to the network, and then the point-cloud segmentation can be transformed to image segmentation.

Network Architecture
Our proposed network follows a multi-branch model design, extracting the spatial and contextual features separately and then fusing these features to recover the network information. The network structure is shown in Figure 1. The network is inspired by ContextNet [30], BiSeNet [31], and SalsaNext [7]. The RV image projection of the point cloud is used as the input of the network, as described in Section 3.1. Context path: Semantic segmentation involves pixel-level classification and spatial location information of pixel categories. Therefore, effective context information and original image spatial detail information are important to semantic-segmentation tasks. In order to capture the contextual information of different global regions, we placed an input convolutional layer and two residual modules from ResNet18 [32] at the front end of the network for eight times fast down-sampling. We added global average pooling (GAP) at the end of ResNet18, thus providing a larger receptive field. Next, a global context module (GCM) was introduced to refine the context information, and we used it to guide the feature learning of the current path. In our previous work [23], GCM was modified from the attention refinement module as in [31]. As shown in Figure 2a, a GCM consists of a globalaverage-pooling layer and a convolutional layer that extracts global context features. These improved global features are applied to contextual feature fusion by multiplication. The sigmoid layer determines whether to apply global features or not. It integrates the global context information easily without any up-sampling operation. Therefore, the computation cost is negligible.
Spatial path: The spatial path mainly captures the spatial details of the input image and contains only four convolutional layers. The first three convolutional layers each contain a stride 2 convolution, followed by batch normalization [33] and ReLU [34]. Therefore, the output feature map extracted by this path was 1/8 of the original image. Due to the large spatial size of the feature map, it encodes rich spatial information. Figure 1 shows the details of the structure.
The feature-fusion module (FFM) [31] is to fuse the output features from the context path and the spatial path, as shown in Figure 2b. Since the features of the two paths are not the same, the two parts of the features cannot be simply weighted, but they are superimposed through concatenation. The FFM module includes the global-averagepooling layer, 11 convolutional layers, an ReLU activation layer, and a sigmoid layer. At the end of the network, the FFM output is up-sampled eight times using the bi-linear resize algorithm to produce the output in the same size of the input.
The number of channels is chosen to be a factor of 32. This is based on the number of parallelism that the hardware accelerator can support in order to maximize its efficiency.

Training Details
This 3D point-cloud semantic-segmentation network was implemented using PyTorch, and it was trained on a single NVIDIA RTX 3090Ti GPU. Following previous work [9,10], we performed spherical projection processing on all points following Equation (1). We projected all points in a scan into a 64 × 2048 image. If multiple points are projected to the same pixel of a 2D image, the point with the largest distance is retained. Then, we used 2D convolution to process the range-view image to obtain a 2D predicted label map, and then we restored it to 3D space.
During training, the network was trained with an initial learning rate of 0.01, and the batch size was set to 24. In the inference process, the original point cloud via spherical projection was fed into the network, and the 2D prediction result was obtained. Then, we used the restoration operation to obtain the 3D prediction, as in the previous work [9,10].

Dataset and Evaluation
Semantic-KITTI [4] is a large-scale dataset for 3D LiDAR point-cloud segmentation, including semantic segmentation and panoptic segmentation. The dataset contains 22 sequences of point-cloud data. We follow the same protocol in [9], where the sequences between 00-10 are the training data, and the sequence 08 is used for validation. The remaining sequences between 11-21 are testing data.
Evaluation Metric: In order to evaluate the proposed method, we followed the official guidance and used the mean intersection-over-union (mIoU) as the evaluation metric defined in [4,35], which can be formulated as: where TP c , FP c , and FN c correspond to the number of true-positive, false-positive, and false-negative predictions for class c where c is the class number.
Quantitative Results:Following previous work [9], we used mIoU over 19 categories to evaluate the accuracy. Table 1 shows the quantitative results obtained compared with other state-of-the-art point-wise and projection-based methods. The mean IoU score (47.9%) of our proposed model was slightly lower than that of the state-of-the-art models due to the reduced network complexity. In contrast to the baseline BiseNet, we obtained an improvement greater than 6.5% in terms of the accuracy, with the best performance in 18 of the 19 categories. When it comes to the performance of each individual category, the proposed network has comparable performance similar to other approaches.
For qualitative evaluation, Figures 3 and 4 show some semantic-segmentation results generated by the 3D point-cloud segmentation network on the Semantic-KITTI test set. In order to compare the results more easily, Figure 3 shows the results using spherical projection, with each color representing a different semantic class. We can see that ground points are divided into a road and a sidewalk. In particular, the road needs a significant amount of contextual information and information from neighboring points, since a small curb usually distinguishes the sidewalk from the road. The network clearly can distinguish the objects such as cars on the road.

Run-Time Evaluation on Gpu
In autonomous driving, the processing time of the network must meet real-time requirements. In order to obtain fair statistics, all measurements were performed on the same single NVIDIA RTX 3090Ti-24GB card using the entire Semantic-KITTI dataset. The total run-time performance of our proposed network compared with other networks is shown in Table 2. As depicted in Table 2, compared with BiseNet [31], our method clearly shows better performance, while the parameters were reduced by about seven times. When the uncertainty calculation is excluded and a fair comparison is made with the deterministic model, our model can run at 26 Hz. Note that the high speed we achieved is significantly faster than the frame rate of mainstream LiDAR sensors that is are at 10 Hz or 20 Hz [39].

FPGA-Based System Architecture
The FPGA-based system architecture for the proposed LiDAR point-cloud semanticsegmentation network is shown in Figure 1, which is partitioned into the software part on the ARM processors and the hardware part on the programmable logic (PL). The image adjustment of the input and output of the neural network is mainly processed using the OpenCV library functions [40] running on the ARM processor. In this study, we implemented the NVIDIA Deep Learning Accelerator (NVDLA) framework on the PL side that is interconnected with the dual-core ARM processor through the AXI4 bus. The CNN inference executes on the NVDLA architecture as an FPGA-based hardware accelerator.
The NVDLA hardware-accelerator architecture consists of the following four components: (1) calculation of the convolution engine unit; (2) the activation engine; (3) the pooling engine; (4) the feature-map (FM) buffer and the weight buffer. The 2D convolution and adder tree were embedded in the convolution engine module. The above parts were configured based on the on-chip resources available on the target FPGA platform, and the block diagram of the hardware architecture is illustrated in Figure 5.

Software Tasks on Arm Processor
As shown in Figure 6, in order to fully use the computing resources available on the FPGA, our point-cloud processing algorithm was partitioned to the ARM processor and programmable logic. The neural network input pre-processing and output-image resize were processed on the ARM processor as software, while the entire CNN inference was implemented using NVDLA on the programmable logic. All three tasks were scheduled as a pipeline, thus increasing system throughput.

Convolution Engine
As shown in Figure 7, the NVDLA standard convolution engine was configured with a convolution buffer. When the convolution operation is enabled, the addresses of FM and weights in the main memory are configured in the registers, and then data are pre-fetched to the FM buffer for operations, reducing the overhead of memory access. The core of the convolution engine is the multiply-accumulate (MAC) array. The array size is configurable by optimizing it for parallel operations. In Figure 7, taking a 3 × 3 kernel as an example, it consists of nine multipliers and an adder tree. In this work, the size of the MAC array was 32 × 32.

GCM Module and FFM Module
Both GCM and FFM modules need to operate using completely different computing modes. The pooling engine of NVDLA was adopted to realize the global average pool, which was used to calculate the average value of the whole channel. The following 1 × 1 convolution was mathematically equivalent to a vector-matrix multiplication and can be routed to the convolution-engine module. The NVDLA activation engine module supports a wide variety of linear and non-linear operations, provides native support for linear functions, and uses look-up tables (LUTs) to implement non-linear functions. Therefore, the calculation of the ReLU function and the sigmoid function can be realized.

Memory Mapping
The on-chip memory separately stores weight-data, feature-mapping, and globalpooling results. In order to increase the processing speed and to reduce the data-transfer rate between FPGA and DDR memory, NVDLA uses a ping-pong buffer mechanism to improve system efficiency. The use of two register banks can reduce the reprogramming delay. In this method, the second group of buffers were programmed at the same time when the first group of buffers is processing the convolution calculation.

Quantization
To maximize the computation capability of NVDLA, fixed-point operations are preferred. In this study, we choose the eight-bit integer quantization for NVDLA hardware implementation. Firstly, from a storage perspective, the memory space for eight-bit weights is only half of that for 16-bit quantization. Secondly, from a hardware-resources perspective, each DSP slice inside the FPGA can perform two eight-bit multiplications concurrently but can only perform one 16-bit multiplication. This results in faster computations.

Results and Discussion
The target hardware platform was a Zynq UltraScale+ MPSoC ZCU104 development board. The PetaLinux operating system was running on the ARM core. The test setup is shown in Figure 8, where the ZCU104 board and LiDAR were connected through the Ethernet port via a UDP protocol. After being processed by the Velodyne LiDAR driver, users can send Python commands to the ARM processor within the FPGA chip (PS side), which receives point-cloud data from LiDAR and stores them in the DDR memory after pre-processing. During execution, the CNN accelerator reads point-cloud data from the DDR or stores the intermediate data to DDR using DMA via a high-performance (HP) port according to the on-chip AXI bus protocol. The hardware resource consumption is summarized in Table 3. The bottleneck of this design is DSP resources, since 95.78% of those are already in use. Increasing the parallelism would require more DSP slices, and a larger FPGA would be necessary. Furthermore, due to the large feature map size, 100% on-chip memory, including both BRAM and URAM, were utilized to buffer as many feature maps or parameters as possible. The FPGA performance on the Semantic-KITTI dataset is shown in Table 4. After replacing all the large kernel convolutions with uniform kernel size and quantization using INT8 fixed-point weights, the mIoU of network on FPGA was 46.4%, which is slightly less than the floating-point computations on the GPU. The estimated power consumption of the FPGA implementation is shown in Table 5. For comparison purposes, the GPU power was estimated at 115 Watts. Considering the FPGA processing time was 2.17 times faster than the GPU, our FPGA implementation was 46 times better than the GPU in terms of power efficiency.

Conclusions
In this study, we proposed a lightweight CNN for the task of LiDAR point-cloud semantic segmentation. For a feature map size of 64 × 2048, this network achieved a 47.9% mIoU score on the Semantic-KITTI dataset with a processing time of 38.5 ms on the RTX 3090Ti GPU. When comparing to state-of-the-art networks, the network achieved similar error performance but only used 13% of the parameters. Furthermore, the proposed network was successfully targeted on an MPSoC FPGA platform using an NVDLA hardware architecture, which speeds up the processing time by a factor of 2.17 compared to its GPU implementation.