Instance Segmentation of Sparse Point Clouds with Spatio-Temporal Coding for Autonomous Robot

: In the study of Simultaneous Localization and Mapping (SLAM), the existence of dynamic obstacles will have a great impact on it, and when there are many dynamic obstacles, it will lead to great challenges in mapping. Therefore, segmenting dynamic objects in the environment is particularly important. The common data format in the field of autonomous robots is point clouds. How to use point clouds to segment dynamic objects is the focus of this study. The existing point clouds instance segmentation methods are mostly based on dense point clouds. In our application scenario, we use 16-line LiDAR (sparse point clouds) and propose a sparse point clouds instance segmentation method based on spatio-temporal encoding and decoding for autonomous robots in dynamic environments. Compared with other point clouds instance segmentation methods, the proposed algorithm has significantly improved average percision and average recall on instance segmentation of our point clouds dataset. In addition, the annotation of point clouds is time-consuming and laborious, and the existing dataset for point clouds instance segmentation is also very limited. Thus, we propose an autonomous point clouds annotation algorithm that integrates object tracking, segmentation, and point clouds to 2D mapping methods, the resulting data can then be used for training robust model.


Introduction
SLAM [1] is an important module of autonomous robots [2].The tasks of robots include mapping, localization, and path planning.Building an environment map is the foundation of robot tasks, and the map can be used for subsequent tasks.The construction of map often relies on point clouds, and the dynamic targets in point clouds pose challenges to map construction [3].A point clouds map containing dynamic objects is shown in the following Figure 1.The current solution to address challenges is to start with point clouds.The processing of point clouds includes semantic segmentation and instance segmentation [4].Instance segmentation not only needs to distinguish which class each point belongs to, but also needs to distinguish different individuals in the same class [5].
There are several ways to deal with object extraction in different stages.At the registration stage [6,7], for objects with rapid changes in motion state, traditional or neural networks can usually be used to filter them out.At the stage of mapping [3,[8][9][10], highdynamic objects are filtered synchronously during the SLAM process, in order to use the information of all frames.Post-processing is performed on the map after the SLAM process is completed to filter out objects with slow changes in motion status.This method is more effective for temporarily stationary objects.In addition, the construction level includes lifelong processing [11,12] for dynamic object filtering and semi static object updates.The post-processing method can combine more information to more accurately filter out the target object, which is a better way.The final foothold of all methods is in the processing of point clouds.The environmental perception of autonomous driving also needs to process the data from LiDAR, perceive the specific target in the point clouds scanned by the LiDAR, and provide corresponding strategies.Directly processing the dynamic objects in the point clouds can avoid the influence of the dynamic environment on the construction of the map.Currently, for the acquisition of data to study specific targets in autonomous robots, most applications use 64-line [13] or above LiDAR.According to the number of lines, LiDAR can be divided into single-line, 4-line, 16-line, 64-line, 128-line, etc.As the number of lines increases, the number of points in the point clouds obtained by the LiDAR continues to increase.The difficulty in the instance segmentation of the point clouds decreases, but the cost also increases.The price of 64-line LiDAR is about three times than that of 16-line LiDAR.Autonomous robots typically use 16-line LiDAR for development and research.The number of points obtained by 16-line LiDAR is one fourth of 64-line LiDAR.A comparison between sparse point clouds and dense point clouds is shown in Figure 2.For data with sparse point clouds, there are also fewer features of the target in Figure 2. Currently, there is relatively little research on instance segmentation for sparse point clouds.In reality, the use of 16-line LiDAR is much more common than that of 64 and 128-line LiDAR.
Thus, we propose a solution for instance segmentation of sparse point clouds.In general, sparser point clouds have fewer features and it is difficult to recognize target objects with the naked eye, making manual annotation more difficult.To address this issue, we propose a scheme for the instance segmentation and annotation of sparse point clouds using integrated spatio-temporal information.Overall, the contributions of this paper are as follows.First, a new point clouds annotation method is proposed to provide a large amount of data for point clouds instance segmentation model training.Secondly, we propose a novel spatio-temporal encoding and decoding, and incorporate spatio-temporal semantic loss into the instance segmentation model.The segmentation results have significantly improved compared to when they were not introduced.Finally, we propose spatio-temporal information splitting to generate instance segmentation results for sparse point clouds.
The remainder of this paper is organized as follows.After presenting the related work, we present methods for the creation of datasets and object-specific instance segmentation in the point clouds of dynamic environments, followed by the experimental description, discussion and conclusions in this section.

Point Clouds Instance Segmentation
The top-down proposal based approach is done based on zones, and then the objects are segmented within each zone.Because the point clouds has the characteristic of data irregularity, Yi et al. [14] propose a top-down method, the resulting proposals are highly characteristic, and the overall network is based on PointNet.In addition to this, some studies also consider RGB information.Hou et al. [15] present 3D semantic instance segmentation network (3DSIS).The network fuses the RGB information of the view, and works with the geometry information to predict the bounding box and predict the instance.Yang et al. [16] propose a new network which breaks away from the traditional anchor points, the method does not use non-maximum suppression, and a classifier to classify each point is selected to achieve object segmentation.Liu et al. [17] present a network which can extract the approximate instance center of each object, and then sample the results to get the desired instance.Proposal based methods process each target proposal independently without interference from other instances.However, proposal based methods struggle to generate high-quality proposals, because the acquired point exists on the surface of the object.

Autonomous Robot Dynamic Environment Target Filtering
The existence of dynamic problems has attracted widespread attention and there are many studies either from camera data [18,19] or from laser scans [20][21][22].To extract dynamic objects from camera data, Chabot et al. [19] as well as Reddy et al. [23] use neural networks to process images as input, while outputting classification and motion status.Similarly, Vertens et al. [24] propose fuse detection of vehicle status, the neural network takes into account the camera's image flow and optical flow information as inputs to the network.Chen et al. [25] select image information from different 3D views to predict bounding boxes of different categories.The task of processing and detecting objects by Xu et al. [26] combines the information of images and 3D scans, and assigns 3D scans to each detection.Li et al. [22] use neural networks to detect objects, prior to which distance images were obtained through 3D scanning.Engelcke et al. [21] achieved object detection in 3D point clouds by utilizing feature centered voting schemes.Wang et al. [27] were able to directly detect target objects in 3D scanning.They select a fast network based on sliding windows for directly detecting objects in 3D scanning.Dewan et al. [20] detect moving points in 3D scanning by calculating the motion information between two frames of scanning.Hahnel et al. [28] propose a probability based method that can estimate the beam reflected by moving objects throughout 3D scanning, and establish a mapping of stationary objects.Meyer-Delius et al. [29] propose a grid occupying method using a hidden Markov model, which can detect potential changes in each element.
Most of the current methods are based on traditional machine learning methods and tracking of specific objects in point clouds of dynamic environments.In order to perform environment perception and dynamic target confirmation more accurately, we use the instance segmentation of the dynamic environment point clouds to complete.Sparse point cloudssets are few and difficult to label, while deep learning methods for 2D images are relatively mature.We combine the tracking and segmentation methods of 2D images to complete the labeling of sparse point clouds.

Automatic Data Annotation
Most methods for point clouds annotation are manual annotation, mainly by using existing annotation software to manually identify point clouds instances and perform annotation.Therefore, manual annotation is time-consuming and laborious, and due to the sparse point clouds obtained by 16-line LiDAR used in our study, the target objects in the point clouds are not obvious, making manual annotation more difficult.Therefore, we also studied an annotation scheme for sparse point clouds.While using the LiDAR of autonomous robots to collect point clouds, the Intel RealSense Depth Camera D435 (Intel-D435 camera) is also used to collect image data, Intel-D435 camera is produced by Intel corporation in the United States.Intel is headquartered in the United States, specifically in Santa Clara, California.We know that based on external parameters of LiDAR and camera, we can project point clouds to 2D mapping.Compared with instance segmentation of point clouds, there are more studies on instance segmentation of 2D images, and the segmentation model is also relatively mature.And in order to preserve the spatio-temporal information of adjacent frames.We integrate target tracking of images with instance segmentation methods to autonomously annotate point clouds.In this study, we independently annotated the person and car in the point clouds.
The process of autonomous annotation of point clouds is shown in the Figure 3.The leftmost column in the Figure 3 is the original data obtained by the LiDAR and camera.Firstly, the image data obtains the mask of person and car through the instance segmentation network and target tracking process of the 2D image.Secondly, based on the coordinate system relationship between the camera and the LiDAR, the point clouds is mapped to 2D, and the annotation results of the corresponding point clouds instances are obtained based on the results of the mask, namely the rightmost columns Figure 3.We choose yolov5 as the model for image instance segmentation, and Fastmot as the target tracking model.The details of these two models will no longer be described.The following is a detailed description of how to use image segmentation results to generate point clouds annotations.
The key to this process lies in the coordinate transformation from the LiDAR to the camera.The 3D coordinates in space are (X w , Y w , Z w ) T , The homogeneous coordinates is expressed as (X w , Y w , Z w , 1) T .The coordinates of the projection point are (u c , v c ) T .The homogeneous coordinates is expressed as (u c , v c , 1) T .The internal parameter matrix of the camera is K.The perspective projection model of R and t is specifically described as Equations ( 1) and (2) : where write Equation (1) in the form of a system of equations and eliminate z c to obtain Equation (3): Each set of 3D-2D matching points corresponds to two equations, with a total of 12 unknowns, requiring at least 6 sets of matching points.The above Equation ( 3) is written in matrix form, and the values of f 11 -f 34 of system of linear equations are solved.Therefore, the rotation matrix and translation matrix can be obtained as Equations ( 4) and ( 5) : After obtaining the external parameter matrix of the camera and LiDAR, we map the point clouds onto a 2D mapping.Based on the results obtained from image segmentation, we extract the corresponding 3D point clouds and save it as a point clouds instance.Complete the autonomous annotation of the point clouds.

Proposed Instance Segmentation
The overall architecture of the proposed model is depicted in Figure 4, the overall model consists of two stages.The first stage is the semantic segmentation, the input point clouds generates point-level semantic labels and offset vectors, and the second stage generates instance proposals for these output groupings.Using proposal method, utilize a backbone network to extract features from the data that can be used for classification, generation of instance masks, and scoring of generated masks.During the movement of the autonomous robot, the point clouds information obtained by the LiDAR has continuity in time and space.Therefore, we add spatio-temporal coding to make full use of spatiotemporal information.Using a point by point prediction method, the input of the prediction network is a set of N points that each point contains coordinate and color information, and then the point clouds is voxelated into an ordered voxel grid.These voxel grids are used as inputs for U-Net-style backbone [30] to obtain features.The backbone of U-Net-style is shown in the Figure 5.The term 'cat' in the network refers to the connection of feature vectors and the term 'identity' in the network refers to the feature.where the structures of 'conv' and 'deconv' are shown in the following Figure 6.The Spconv (Spatially Sparse Convolution) in the figure is a spatially sparse convolutional library used in this study to replace conventional convolutions.The conv and deconv operations are represented by Equations ( 6) and ( 7), where µ is the mean of x, σ is the variance of x, ϵ is a very small positive number (used for numerical stability), γ and β are learnable scaling factor and offset parameters.The ReLU function turns each negative value in the input vector to zero, mathematically represented as max of 0 and f.Our 3D point clouds feature extraction is achieved using Submanifold Sparse Convolution [31], and the model outputs features through two branches to obtain pointwise semantic scores and offset vectors.
Cross-entropy loss (CE) is used in the semantic training branch, and l 1 regression loss is used in the offset branch.The semantic loss and offset loss are as follow Equations ( 8) and ( 9): where the semantic score of the output is represented by s, the output offset vectors is o, s * is the semantic label, o * is offset label representing the vector from a point to the geometric center of the instance that the point belongs to (analogous to [32][33][34]), N is the number of points, and I {p i } is the indicator function indicating whether the point p i belongs to any instance.In addition, in order to preserve the spatio-temporal information, we add spatio-temporal encoding and decoding in the training process, the loss between point clouds in the loss function, and extract the results of N − 1 frames that are exactly the same from two adjacent point clouds, and solve the cross entropy loss function.The semantic loss is shown in Equation (10).
wherein, if the frame of two overlapping point clouds is i-j, then the sl1 is the semantic score of frame i-j of the first point clouds, sl2 is the semantic score of frame i-j of the second point clouds.For the generated instances, it is recommended to refine them from top to bottom, obtain classification and refinement results, extract features from each proposal through a feature extractor, and then input the features into a U-Net network with fewer layers.The tiny U-Net network is shown in Figure 9.The structural details in the network, such as 'conv','deconv','blocks'..., are consistent with the previous ones.The training loss [35,36] of these branches is the combination of cross-entropy, binary cross-entropy (BCE), and l 2 regression losses.The losses of class, mask, and mask score are Equation (11), Equation ( 12), and Equation (13), respectively.
L mask score = 1 where c * , m * , r * are the classification, segmentation, and mask scoring targets, respectively.K is the total number of proposals and I {.} indicates whether the proposal is a positive sample.Similarly, in the classification stage, we also add the idea of spatio-temporal encoding and decoding to solve the loss function for the overlapping parts of the two point clouds.The classification loss is represented by Equation ( 14).The cl1 is the semantic score of frame i-j of the first point clouds, cl2 is the semantic score of frame i-j of the second point clouds.

Spatio-Temporal Encoding and Decoding
Due to the continuity of point clouds obtained by LiDAR in space and time, in order to preserve spatio-temporal information, we overlay adjacent frame point clouds, starting from the first frame, and overlay adjacent point clouds as network inputs.The first point clouds is the superposition of frames 1 to N, the second point clouds is the superposition of frames 2 to N+1, and so on.N-1 frame point clouds are the same between adjacent point clouds.Therefore, during model training, the segmentation results of N-1 frames between adjacent point clouds should be the same.After obtaining the segmentation results, there are still N-1 frames with similar results before the point clouds segmentation results.We perform intersection processing on the results of the same frames to obtain more accurate point clouds segmentation results.

Autonomous Robot Hardware Settings
We build a hardware platform for point clouds and image data collection.The hardware platform includes two-wheel differential chassis, a LiDAR for point clouds collection, four Intel-D435 cameras for image data collection,Jetson AGX Xavier (AGX) for computing.AGX is manufactured by NVIDIA company which is located in Santa Clara, CA, USA.The hardware platform is shown in Figure 10.There is a 16-line LiDAR on the top of the car, and four Intel-D435 cameras are distributed below the LiDAR (front, rear, left and right).

Dataset
We started the 16-line LiDAR and four Intel-D435 cameras to collect the data required for the experiment in nine scenes, with a total of 8321 point clouds and 33284 images.In addition to the two specific targets of person and car required for our experimental scenes, there are also trees and buildings on both sides of the road.In these point clouds, the point clouds of the target object is very sparse, and thus it is a huge challenge for labeling and instance segmentation.

Experiments and Discussions
The point clouds obtained by the 16-line LiDAR is sparse, and the frequency of obtaining point clouds is relatively high, with 15 frames of point clouds obtained in one second.The difference between the point clouds of adjacent frames is relatively small, and the problem of sparse point clouds can be solved by overlaying the point clouds of adjacent frames, while also obtaining more features.We will set the experimental settings to 5, 9, 13, and 15 frames to stack the point clouds to find the optimal number of stacked frames, respectively.The evaluation indicators are standard average precision (AP) and average recall rate (AR).Here, AP 50, AP_25, RC_50, and RC_25 represent scores with IoU (Intersection over Union) thresholds of 0.5 and 0.25, respectively.Similarly, AP and AR represent an average score with an IoU threshold of 0.5 to 0.95, with a step size of 0.05.
We trained and tested these five sets of data separately.The model was implemented using the PyTorch v1.11 (https://pytorch.org/get-started/previous-versions/)deep learning framework and trained using the Adam optimizer.This batch size is set to 2. The learning rate is initialized to 0.001 and scheduled through cosine annealing.The voxel size grouping bandwidth is set to 0.02m and 0.04m, respectively.The score threshold for soft grouping is set to 0.2.
The results of four sets of data training and testing are shown in Tables 1-3.From these tables, as the number of adjacent frames increases, the point clouds gradually becomes dense, and the trained model can gradually segment specific targets in the point clouds.It can be seen that when the number of point clouds reaches 15, the AP of the specific target segmentation result is three times higher than when the number of point clouds is 5.We divided the AP and AR of person and car in the five sets of data instances and drew a Figure 11.According to the Tables 1-3, and Figure 11, we can observe that the idea of overlaying point clouds of adjacent frames to obtain dense information has a significant impact on the instance segmentation of person.When the point clouds frames are 5, the model can already segment the person in the point clouds.However, due to the fact that the characteristics of the car are not obvious, it is only when the point clouds frames are stacked to 15 that the model achieves good results in car segmentation.Therefore, we adopted the point clouds overlay of 15 adjacent frames as the input of the model.Moreover, the frequency of the 16-line LiDAR we used is exactly 15, and the data we used is the point clouds information obtained by the LiDAR within one second.
According to the control experiment, we choose to stack adjacent 15 frames as the final parameter of the experiment.Then, experiments are set up to verify the effectiveness of the spatio-temporal coding.For the same point clouds, we set up a control experiment, a set of original point clouds instance segmentation models, and a second set of spatio-temporal coding for training.We will quantitatively compare the segmentation performance of our model and softgroup, and compare the segmentation results of car and person in the point clouds.The comparison results are shown in the Tables 4-6.The table compares the results of the point clouds instance segmentation of softgroup and our method, which is a model based on the softgroup network structure with spatio-temporal encoding and decoding.Both models are based on the Pytorch framework, and the learning rate, threshold, and other parameter configurations and the training point clouds were the same.From Table 4, it can be seen that after adding the spatio-temporal encoding and decoding part, the model achieved the better segmentation of car, with varying degrees of improvement in the AP and AR.Due to the fact that the car does not have many feature points compared to person, the addition of some new information is helpful for instance segmentation.From Table 5, correspondingly, for the person in the point clouds, the segmentation performance of the model is already relatively good without the addition of spatio-temporal encoding and decoding.However, when spatio-temporal encoding and decoding are added, the segmentation performance on person is slightly improved.Overall, adding spatio-temporal information has a certain promoting effect on the instance segmentation of the model.The test results of the model include the point clouds results of fifteen adjacent frames.We extract the point clouds segmentation results of a single frame and perform postprocessing.The same point clouds frames are intersected to obtain more accurate segmentation results.The schematic diagram of the point clouds from a single frame, overlaying adjacent fifteen frames, and the model outputting the results of adjacent fifteen frames, as well as the segmentation results of the final split single frame, is shown in the following Figure 12.The different colors in the figure represent different instances, and the background is displayed in black.We compared the three images (c), (d), and (e), the specific comparison of segmentation results is shown in the Figure 13.It can be seen from the figure that our method can segment all instances as much as possible, and the segmentation results without incorporating spatio-temporal encoding and decoding not only have unrecognized car instances, but also have cases of misidentification.Compared with visual and quantitative results, the proposed method for sparse point clouds instance segmentation is feasible.

Conclusions
This study mainly focuses on instance segmentation of sparse point clouds.Firstly, in practical applications, most of them are sparse point clouds, but datasets related to sparse point clouds are relatively rare.We built a hardware platform, selected different scenarios, and collected sparse point clouds.
Secondly, due to the sparsity of point clouds, the characteristics of specific targets in point clouds are not obvious, making annotation of sparse point clouds relatively difficult.Therefore, we propose an autonomous annotation scheme for sparse point clouds, utilizing target tracking and segmentation methods of 2D images combined with the relationship between 3D point clouds and 2D mappings.Moreover, we perform autonomous annotation on point clouds.Then, because in practical applications, the point clouds collected by LiDAR has continuity in both space and time, we incorporate spatio-temporal encoding and decoding into the model for point clouds instance segmentation.In order to solve the problem of sparse point clouds, we also overlay adjacent frame point clouds to generate training data and propose a point clouds instance segmentation model that integrates spatio-temporal information.
Finally, we extract the segmentation results of a single frame instance from the model output and process them to obtain the segmentation results of a single frame point clouds.
The entire process we propose can be applied to the segmentation, extraction, and filtering of specific targets in dynamic environments, which will help autonomous robots construct map in dynamic environments and avoid the impact of dynamic targets on the construction of map.Because this study introduces spatio-temporal encoding and decoding, it is more effective for segmenting point clouds instances with temporal information, but there is not much improvement in segmenting point clouds without spatio-temporal information.In the future, we will strive to use the point clouds instance segmentation model to perceive specific objects in the environment for autonomous driving, which will help generate strategies during the autonomous driving process.

Figure 1 .
Figure 1.Point cloud map containing dynamic targets.The red rectangle in the map shows a moving person.Due to this person's movement, the point cloud in the red rectangle looks like a "ghost shadow".

Figure 2 .
Figure 2. Sparse (top) and dense (bottom) point clouds.The top is sparse point clouds and the bottom is dense point clouds.

Figure 3 .
Figure 3.The flowchart of the proposed annotation for point clouds.

Figure 4 .
Figure 4.The framework of the proposed instance segmentation for point clouds.

Figure 5 .
Figure 5.The U-Net-style backbone and the detailed structure descriptions.The structures of the blocks and blocks_tail are shown in the following Figures 7 and 8.

Figure 7 .
Figure 7.The structure of the blocks.

Figure 10 .
Figure 10.The hardware platform used in this study.

Figure 11 .
Figure 11.Comparisons of person (top) and cars (bottom) for instance segmentation of point clouds with different framerate.

( a )
Single point clouds (b) Fifteen frames of point clouds (c) Label of point clouds (d) Segmentation of softgroup

Figure 12 .
Figure 12.Point cloud processing and comparison.(a) Original single frame of point clouds.(b) Point clouds obtained by overlaying fifteen adjacent frames.(c) The annotation of the point clouds.(d) The instance segmentation of point clouds obtained by the softgroup model.(e) The instance segmentation of point clouds obtained by the proposed model.(f) Extract a single frame point clouds from (e).

Figure 13 .
Figure 13.Comparisons of instance segmentation between softgroup and the proposed method on point clouds.

Table 1 .
Comparisons of segmentation results for cars with different framerate.

Table 2 .
Comparisons of segmentation results for person with different framerate.

Table 3 .
Comparisons of segmentation results for cars and person with different framerate.

Table 4 .
Comparisons of instance segmentation for cars.

Table 5 .
Comparisons of instance segmentation for person.

Table 6 .
Comparisons of instance segmentation for cars and person.