An Efficient LiDAR Point Cloud Map Coding Scheme Based on Segmentation and Frame-Inserting Network

In this article, we present an efficient coding scheme for LiDAR point cloud maps. As a point cloud map consists of numerous single scans spliced together, by recording the time stamp and quaternion matrix of each scan during map building, we cast the point cloud map compression into the point cloud sequence compression problem. The coding architecture includes two techniques: intra-coding and inter-coding. For intra-frames, a segmentation-based intra-prediction technique is developed. For inter-frames, an interpolation-based inter-frame coding network is explored to remove temporal redundancy by generating virtual point clouds based on the decoded frames. We only need to code the difference between the original LiDAR data and the intra/inter-predicted point cloud data. The point cloud map can be reconstructed according to the decoded point cloud sequence and quaternion matrices. Experiments on the KITTI dataset show that the proposed coding scheme can largely eliminate the temporal and spatial redundancies. The point cloud map can be encoded to 1/24 of its original size with 2 mm-level precision. Our algorithm also obtains better coding performance compared with the octree and Google Draco algorithms.


Introduction
LiDAR point clouds have been widely used in many emerging applications [1], such as the preservation of historical relics, mobile robots, and remote sensing [2][3][4][5][6][7][8][9]. LiDAR sensors are essential for autonomous vehicles, and dense LiDAR point cloud maps play an indispensable role in unmanned driving, such as obstacle detection [10], localization [11], and navigation [12]. In wide geographic areas, a LiDAR point cloud map consists of a vast number of points and requires a large bandwidth and storage space to transmit and store [4,[13][14][15]. Therefore, developing compression algorithms for the dense LiDAR point cloud maps is an urgent task.
With the characteristics of covering a large area, unstructured organization, and a huge volume [16], it is difficult to remove redundancy without distortion when encoding the LiDAR point cloud map. Octree, as a data structure, has been widely used to encode point clouds. Each internal node has exactly eight children in an octree. The octree-based point cloud compression method is to divide a 3D point cloud by recursively subdividing it into eight octants. As octree-based compression methods are lossy, they are suboptimal for the autonomous vehicles that have strict requirements for compression accuracy in the task of path planning or obstacle detection, etc.
In this research, we focus on compressing the large-scale dense LiDAR point cloud maps. The well-known low-drift and real-time LiDAR odometry (LOAM) algorithm are used to build the 3D map [17]. The proposed LiDAR point cloud map coding algorithm can be used in mobile robots or surveying and mapping fields. The major contributions are as follows.
• By recording the time stamp and quaternion matrix of each scan during mapping, the large-scale point cloud map compression can be formulated as a point cloud sequence compression problem; • For intra-coding, we develop an intra-prediction method based on segmentation and plane fitting, which can exploit and remove the spatial redundancy by utilizing the spatial structure characteristics of the point cloud.

Point Cloud Coding: A Brief Review
The compression of the 3D point cloud data is a hot topic and preliminary investigations have been recently made.

Volumetric/Tree-Based Point Clouds Coding
For unstructured point clouds, Octree representation is commonly utilized as a pointcloud geometry compression method. To improve the immersive visual experience, De Oliveira Renteet et al. [18] proposed an efficient geometric coding scheme for point clouds, in which an octree-based compression algorithm compression was utilized as a basic layer and used the graph transformation technique as an enhancement layer to encode the residual data. Their reported evaluation results show that the method produced significant improvement, especially at low and medium bit rates. Traditional point-cloud compression algorithms are limited to encoding the position and attribute of discrete point clouds. In [19], Krivokua et al. introduced an alternative technique based on volume function. Like regression analysis, the volume function is a continuous function that can interpolate the values on a finite set of points into a linear combination of continuous basis functions. B-spline wavelet basis is utilized for encoding the volume function, which represents the geometry and attributes of point clouds. Compared to the latest MPEG point-cloud coding standard [20], their algorithm achieves better coding performance in geometry and attributes. To alleviate the computational complexity of the 3D point-cloud model in registration, data abstraction, and visualization, Elseberg et al. [21] proposed an effective point-cloud storage and compression scheme based on the Octree. The coding scheme can be used in file format conversion and 3D model registration.

Image/Video-Based Point Clouds Coding
For structured point clouds, some studies have focused on employing image/video codecs to compress point cloud data by mapping it into 2D data. Similar to this approach, Tu et al. [22] transformed LiDAR point clouds into a range image sequence and used a simultaneous localization and mapping (SLAM) algorithm to perform inter-prediction. The intra-frame and inter-prediction residual data are encoded by the MPEG-like compression method. Unlike the aforementioned methods, Wang et al. [23] proposed a method to compress RGB-D point clouds. The properties between RGB-D point clouds and LiDAR point clouds are similar, except for the measurement range (i.e., Lidar is around 100 m, while RGB-D camera is around a few meters). They developed a warping-based depth data coding method, in which a point-cloud registration algorithm was utilized to remove redundancy. Experimental results showed that their algorithm achieved a higher compress ratio with less distortion compared to recent methods. Tu et al. [24] used the conventional image and video-based schemes to compress the 2D arrays by converting the LiDAR data to a range image. Feng et al. [25] proposed a real-time spatio-temporal LiDAR point clouds compression scheme. In this scheme, key frames are identified and encoded by interative plane fitting, and then the temporal streams are encoded by referencing the spatially encoded data. In [26], Tu et al. firstly chose frames as keyframes (I-frame) and obtained the optical flow between the two nearest keyframes. Then, according to the two keyframes and the optical flow, a U-net network was utilized to generate the remaining LiDAR frames (P-frames) between the two keyframes. They removed the temporal redundancy by interpolating point cloud data between two non-adjacent frames. In [27], Tu et al. proposed an RNN-based network to encode LiDAR point clouds. They used a recurrent neural network, and only the input of the first layer was the point cloud data, while the inputs of other layers were the residual data. Their method was to remove the spatial redundancy of a frame of point clouds, not the temporal redundancy. Coding each frame is independent and does not depend on other frames.

Summary
Currently, there is still no efficient coding solution for the LiDAR point cloud map. Though volumetric/tree-based schemes, such as Google Draco [28] and MPEG TMC 13 [20], can be used to encode the LiDAR point cloud map, their coding performance is far from satisfactory as these methods fail to utilize the spatial structure of point clouds. Converting LiDAR data into a 2D matrix in itself is an efficient method for reducing spatial redundancy. However, most of the existing methods [22,24,29] directly use the image/video-based method to encode LiDAR range image without further exploiting the temporal and spatial redundancies.

Overall Codec Architecture
In the paper, the problem of efficient coding dense LiDAR point cloud map is addressed, which can be used for mobile robots or surveying and mapping fields. The low-drift and real-time LOAM algorithm is utilized to construct a LiDAR point cloud map [17]. During the mapping process, the time stamp of the LiDAR scans and quaternion matrices is recorded relative to the global coordinate origin. As the LiDAR point cloud map is constructed by these frames, the dense point cloud map compression can be specified into a point cloud sequence coding task.
Generally, the point cloud sequence compression needs to exploit both temporal and spatial redundancies. The system architecture of our LiDAR point cloud map coding algorithm is illustrated in Figure 1. We divide the frames in the point cloud sequence into intra-frames (I-frames) and inter-frames (B-frames, bi-prediction). An I-frame is compressed independently by removing the spatial redundancies, while a B-frame is compressed by referring to encoded I-frames or B-frames to remove the temporal redundancies. Two I-frames are encoded firstly, followed by B-frames in the middle of two I-frames. B-frames cannot be encoded independently and rely on two encoded I-frames [30]. The intra-frames are coded by a segmentation-based prediction technique, while for inter-frames, we develop an interpolation-based coding network to remove the temporal redundancy.
Decoding is the inverse process of encoding. The decoded residual data is added to the prediction data to recover the point clouds. According to the decoded point clouds, quaternion matrices, and translation matrices, the point cloud map can be reconstructed: where x I , y I , z I is x, y, z of coordinate system of an intra frame, and x B , y B , z B are corresponding x, y, z, when using the predicted B-frame as the coordinate origin. C x , C y , and C z denotes the translation matrix, and R yaw , R pitch , R roll represents the rotation matrix of yaw, pitch, and roll angle.

Overview of Intra-Coding Network
The pipeline of the proposed segmentation-based intra-frame point cloud coding method is illustrated in Figure 2. Firstly, the point cloud conversion from 3D to 2D is performed to obtain the 2D Matrix of point clouds. Then, the RangeNet++ [31] network is utilized to realize point cloud segmentation. The residual data, contour map, and the quadric surface parameters are encoded with lossless coding schemes and packaged as the the intra-bitstream.

LiDAR Point Cloud Segmentation
The LiDAR data from the KITTI dataset is utilized to verify our method, which uses 64 channels, covering 26.9 • vertical field of view and a 360 • horizontal field of view [32]. Considering that LiDAR sensors acquire data in an orderly way, the point clouds can be converted from R 3 to R 2 . Let (I i = (x i , y i , z i )) i=1...N be the coordinates of a frame of point cloud captured by Velodyne sensors. Considering that the point clouds is ordered, it can be transformed into a 2D matrix X(u, v) u=1...M,v=1...N .
where (x, y, z) represent the coordinates of point P, (u, v) are 2D matrix coordinates, (h, w) are the height and width of the desired 2D matrix representation, r = x 2 + y 2 + z 2 represents the distance of point P = (x, y, z) from the origin, and f = f up + f down denotes the vertical field of view of the LiDAR sensor [33].
The RangeNet++ takes the fused 2D information as input and outputs the segmentation results. Three 2D convolutional blocks are adopted as a 2D feature extractor. The output of the final layer is the point cloud segmentation results. The segmentation results will pave the way for the subsequent surface fitting-based intra-prediction technique.

Segmentation-Based Intra-Prediction Technique
Nearly one-third of the point cloud data are ground points. After segmenting the point cloud, we can segment the ground and other objects. Thus, the quadric surface fitting-based technique is utilized to fit the point clouds. Considering the complexity and compression efficiency, we fit the segmented region with the plane for the ground points and sphere for other object.
A plane is defined by J = (n, d), where n is the normal vector, with ||n|| = 1, and d is the vertical distance from the origin to the plane. The distance from a point p i to the plane is defined as: Then we can construct the equation ε plane as a function of the distance of the minimum value of the sum.
A sphere is represented by J = (c, r), where r ∈ R denotes the radius, and c ∈ R 3 represents the center. The distance from point p i to the sphere is defined as: Then we can construct the fitting equation ε sphere as a function of the distance of the minimum value of the sum.

Residual Data Coding
According to the parameters of the LiDAR sensor, we use the fitting plane to calculate the coordinates of the virtual points. The residual data R intra (x, y) is the difference between the original range image X intra (x, y) and the predicted range image P intra (x, y). As the pixel values of the residual data are nearly zero, the entropy of the R intra (x, y) is smaller copared with X intra (x, y). Thus, the residual data R intra (x, y) can be encoded with fewer bits.

Overall Inter-Prediction Network
To remove temporal redundancy in the LiDAR point clouds [34], an inter-frame point cloud inserting network is designed, as illustrated in Figure 3. The interpolation module utilizes the encoded points clouds X = {X t0 , X t0+2k }, and generates the internal point cloud frame P t0+k . We calculate the difference between the predicted result P t0+k and the real point cloud X t0+k as residual data R inter (x, y), which will be encoded as the inter-bitstream.  Figure 3 illustrates the diagram of the LiDAR point cloud interpolation module. The encoder and decoder parts predict the 3D voxel stream, which will be used to generate the required intermediate frames. The network generate the predicted frame P t=t 0 +k according to the input point cloudsX t=t 0 andX t=t 0 +2k , whereX t=t 0 ,X t=t 0 +2k are already encoded frames. The predicted frame can be the middle frame by interpolating or the next frame by extrapolating the input point clouds. We focus on interpolating the intermediate frame according to two decoded frames. The network is represented by H(X rec , Θ), where the output F is the 3D voxel flow of the input X rec , and Θ is the network parameters.

Point Cloud Interpolation Module
where F is the optical flow of two adjacent frames. The opposite direction of the optical flow is used to identify the corresponding position in the previous frame. The coordinates of the corresponding positions in the preceding and following frames is defined as L f ormer = (x − ∆x, y − ∆y) and L later = (x + ∆x, y + ∆y). We use tri-linear interpolation from the eight corner points of the voxel to calculate the output value P(x, y) by four points:X 00 (x ceil ,y ceil ), X 01 (x ceil ,y f loor ),X 10 (x f loor ,y ceil ), andX 11 (x f loor ,y f loor ), from the former frame, and the other four points:X 00 (x ceil ,ŷ ceil ),X 01 (x ceil ,ŷ f loor ),X 10 (x f loor ,ŷ ceil ) andX 11 (x f loor ,ŷ f loor ), from the later frame. The time component of the voxel stream F can be considered as a linear blending weight between two adjacent frames. We employ this voxel stream to sample the input two frames and use the volume sampling function T x,y,ω to generate the final predicted frame P.
where P f ormer (x, y) and P later (x, y) are computed by The interpolation network adopts a fully convolutional structure, with four convolutional layers and four deconvolutional layers. To better maintain the spatial features in the low dimension layer, some jump connections between the corresponding convolutional layer and deconvolutional layer are added.

Inter Loss Function Design
The prediction module is represented by H(X (t0,t0+2k) , Θ), where the output P t0+k is the predicted point cloud at t = T + 1 and Θ is the network parameters.
The predited point cloud data are converted to range image, and the difference between the original range image X t0+k (x, y) and the predicted range image P t0+k (x, y) is calculated as the residual data R inter (x, y), which will be encoded losslessly. Explicity, the training loss is defined as follows: L loss = ||R inter || 2 (13)

Visualization Results
For the inter-coding network, if the point-cloud interpolation module synthesizes more accurate point clouds, smaller residual data and higher compression performance can be achieved. To deeply evaluate, four scenes point cloud interpolation results are obtained, as illustrated in Figure 4. We can see that the predicted point cloud and the original point cloud are almost the same. The effectiveness of the inter insertion module is confirmed.

Experimental Results
The proposed point cloud compression scheme is implemented in python using point cloud libraries (PCL) [35] on a PC with a TITAN RTX GPU. The KITTI dataset [36], including city, residential, campus, and road scenes, is used to evaluate our algorithm.

Evaluation Metric
To evaluate the overall performance, the calculation of the compression ratio (CR) and relative distance (D d ) is considered. The CR is obtained by calculating the ratio between the point cloud size after compression and the original size.
D d represents the distance between the ground truth LiDAR data P input and the reconstructed one P decode . D d is defined as follows: The measure is sensitive to false positives (rebuilding points in unoccupied areas) and false negatives (excluding occupied areas).

Coding Performance for a Single Frame
To find the most efficient coding method for the residual data, several lossless coding schemes are used to encode the residual data, including Zstandard, LZ5, Deflate, Lizard, LZ4, and PPMd. To verify the compression efficiency in removing the time and space redundancy, the 2D matrices are also directly encoded with the lossless coding algorithms without any preprocessing. Figure 5 shows the experimental results. For intra-coding, the smallest CR value is 4.64%, achieved by using the PPMd scheme for the city scene due to its simple structure. The point cloud of the residential scene, however, is complex and the CR is much higher. The smaller the CR, the better the coding performance.
For inter-coding, it can be observed that using the PPMd scheme achieves the best compression performance for the campus scene, with a CR of 3.09%. The worst CR is 6.87% for the residential scene using the LZ4 scheme. However, the CR of the proposed inter-coding network is still smaller than that of directly coding point cloud data with lossless coding schemes.

Comparsion with Octree and Draco
According to the time stamp and quaternion matrix of each scan, the scans are merged into a LiDAR panoramic map. Taking this point cloud map as a whole, Table 1 describes the CR results of the proposed coding method compared to Octree [35] and Google Draco [28]. The quantization accuracy (QA) of our method is set to 2 mm, 5 mm, and 1 cm, while the distance resolution (DR) of Octree is set to 1 mm 3 , 5 mm 3 , and 1 cm 3 . The quantization bits (QB) of Draco are set to 17, 15, and 14, which correspond to 1 mm accuracy, 5 mm accuracy, and 1 cm accuracy, respectively. Besides that, the compression level (CL) = 10 is set to achieve the highest compression rate. In our experiments, PPMd is selected to encode residual data. From Table 1, it can be observed that compared with Octree, the proposed algorithm achieves a smaller CR value.

Rate-Distortion Curves
Four state-of-the-art baselines are chosen for comparison, including Google Draco [28], MPEG TMC13 [20], and Tu's method [26]. The results are evaluated in terms of the relationship between the Distance and the bit per point (bpp) for point cloud data for four scenes, as illustrated in Figure 6. The proposed LiDAR point clouds coding method has shown outstanding advantages in terms of bbp and D d among these methods [37][38][39]. Google Draco and MPEG TMC13 are not designed specifically for multi-line LiDAR data [40,41]. Compared to these methods, our method can generate the point cloud more accurately, which contributes to largely removing the redundancy and obtaining a better D d -bpp result.

Computational Complexity
The proposed intra-coding consists of three steps, namely segmentation, intra-prediction, and residual data coding, while the inter-coding method consists of inserting frame and residual data coding. By calculating the average coding time, 100 frames are selected. Experimental results show that the total intra-coding time with the lossless method (PPMd) is 0.21 s and the total inter-coding time is 0.15 s.

Discussion
With a 5-15 HZ user-selectable frame rate, the LiDAR sensor of HDL-64E S2 captures over 1.3 million points per second. The drawback of the proposed LiDAR point cloud map coding scheme is that the intra-coding and inter-coding can not meet the real-time requirement [42][43][44]. However, it can be used for off-line LiDAR point cloud map coding to reduce its storage space and transmission bandwidth, which can be used in mobile robots or surveying and mapping fields. The follow-up research focuses on implementing the coding scheme to the FPGA platform to accelerate the algorithm and achieve real-time performance.

Conclusions
Ranging sensors, such as LiDAR, are considered to be very robust under all light conditions or foggy weather, which have been widely used in the field of autonomous driving tasks, for instance, navigation, obstacle avoidance, target tracking, and recognition, etc. However, the enormous volume of LiDAR point clouds brings great challenges to data storage and transmission. To address this issue, this paper focuses on LiDAR point cloud map coding. Firstly, an intra-coding technique is designed based on point cloud segmentation and geometric reconstruction, which can effectively remove the spatial redundancy of the LiDAR point cloud. Secondly, we designed a point cloud insertion network to remove the time redundancy of point clouds by inserting frames into the intermediate moment according to the encoded point clouds. Experiments demonstrate that the proposed method obtains a higher comparison performance compared with several representative point cloud methods.