3.1. Automatic Data Annotation
Most methods for point clouds annotation are manual annotation, mainly by using existing annotation software to manually identify point clouds instances and perform annotation. Therefore, manual annotation is time-consuming and laborious, and due to the sparse point clouds obtained by 16-line LiDAR used in our study, the target objects in the point clouds are not obvious, making manual annotation more difficult. Therefore, we also studied an annotation scheme for sparse point clouds. While using the LiDAR of autonomous robots to collect point clouds, the Intel RealSense Depth Camera D435 (Intel-D435 camera) is also used to collect image data, Intel-D435 camera is produced by Intel corporation in the United States. Intel is headquartered in the United States, specifically in Santa Clara, California. We know that based on external parameters of LiDAR and camera, we can project point clouds to 2D mapping. Compared with instance segmentation of point clouds, there are more studies on instance segmentation of 2D images, and the segmentation model is also relatively mature. And in order to preserve the spatio-temporal information of adjacent frames. We integrate target tracking of images with instance segmentation methods to autonomously annotate point clouds. In this study, we independently annotated the person and car in the point clouds.
The process of autonomous annotation of point clouds is shown in the
Figure 3. The leftmost column in the
Figure 3 is the original data obtained by the LiDAR and camera. Firstly, the image data obtains the mask of person and car through the instance segmentation network and target tracking process of the 2D image. Secondly, based on the coordinate system relationship between the camera and the LiDAR, the point clouds is mapped to 2D, and the annotation results of the corresponding point clouds instances are obtained based on the results of the mask, namely the rightmost columns
Figure 3. We choose yolov5 as the model for image instance segmentation, and Fastmot as the target tracking model. The details of these two models will no longer be described. The following is a detailed description of how to use image segmentation results to generate point clouds annotations.
The key to this process lies in the coordinate transformation from the LiDAR to the camera. The 3D coordinates in space are
, The homogeneous coordinates is expressed as
. The coordinates of the projection point are
. The homogeneous coordinates is expressed as
. The internal parameter matrix of the camera is
K. The perspective projection model of
R and
t is specifically described as Equations (
1) and (
2):
where
write Equation (
1) in the form of a system of equations and eliminate
to obtain Equation (
3):
Each set of 3D-2D matching points corresponds to two equations, with a total of 12 unknowns, requiring at least 6 sets of matching points. The above Equation (
3) is written in matrix form, and the values of
-
of system of linear equations are solved. Therefore, the rotation matrix and translation matrix can be obtained as Equations (
4) and (
5):
After obtaining the external parameter matrix of the camera and LiDAR, we map the point clouds onto a 2D mapping. Based on the results obtained from image segmentation, we extract the corresponding 3D point clouds and save it as a point clouds instance. Complete the autonomous annotation of the point clouds.
3.2. Proposed Instance Segmentation
The overall architecture of the proposed model is depicted in
Figure 4, the overall model consists of two stages. The first stage is the semantic segmentation, the input point clouds generates point-level semantic labels and offset vectors, and the second stage generates instance proposals for these output groupings. Using proposal method, utilize a backbone network to extract features from the data that can be used for classification, generation of instance masks, and scoring of generated masks. During the movement of the autonomous robot, the point clouds information obtained by the LiDAR has continuity in time and space. Therefore, we add spatio-temporal coding to make full use of spatio-temporal information.
Using a point by point prediction method, the input of the prediction network is a set of
N points that each point contains coordinate and color information, and then the point clouds is voxelated into an ordered voxel grid. These voxel grids are used as inputs for U-Net-style backbone [
30] to obtain features. The backbone of U-Net-style is shown in the
Figure 5. The term ’cat’ in the network refers to the connection of feature vectors and the term ’identity’ in the network refers to the feature. where the structures of ’conv’ and ’deconv’ are shown in the following
Figure 6. The Spconv (Spatially Sparse Convolution) in the figure is a spatially sparse convolutional library used in this study to replace conventional convolutions. The conv and deconv operations are represented by Equations (
6) and (
7), where
is the mean of x,
is the variance of x,
is a very small positive number (used for numerical stability),
and
are learnable scaling factor and offset parameters. The ReLU function turns each negative value in the input vector to zero, mathematically represented as max of 0 and
f. Our 3D point clouds feature extraction is achieved using Submanifold Sparse Convolution [
31], and the model outputs features through two branches to obtain pointwise semantic scores and offset vectors.
Cross-entropy loss (
) is used in the semantic training branch, and
regression loss is used in the offset branch. The semantic loss and offset loss are as follow Equations (
8) and (
9):
where the semantic score of the output is represented by
s, the output offset vectors is
o,
is the semantic label,
is offset label representing the vector from a point to the geometric center of the instance that the point belongs to (analogous to [
32,
33,
34]),
N is the number of points, and
is the indicator function indicating whether the point
belongs to any instance. In addition, in order to preserve the spatio-temporal information, we add spatio-temporal encoding and decoding in the training process, the loss between point clouds in the loss function, and extract the results of
N-1 frames that are exactly the same from two adjacent point clouds, and solve the cross entropy loss function. The semantic loss is shown in Equation (
10).
wherein, if the frame of two overlapping point clouds is
i-
j, then the
is the semantic score of frame
i-
j of the first point clouds,
is the semantic score of frame
i-
j of the second point clouds.
For the generated instances, it is recommended to refine them from top to bottom, obtain classification and refinement results, extract features from each proposal through a feature extractor, and then input the features into a U-Net network with fewer layers. The tiny U-Net network is shown in
Figure 9. The structural details in the network, such as ‘conv’, ‘deconv’, ‘blocks’…, are consistent with the previous ones. The training loss [
35,
36] of these branches is the combination of cross-entropy, binary cross-entropy (
), and
regression losses. The losses of class, mask, and mask score are Equation (
11), Equation (
12), and Equation (
13), respectively.
where
,
,
are the classification, segmentation, and mask scoring targets, respectively.
K is the total number of proposals and
indicates whether the proposal is a positive sample.
Similarly, in the classification stage, we also add the idea of spatio-temporal encoding and decoding to solve the loss function for the overlapping parts of the two point clouds. The classification loss is represented by Equation (
14). The
is the semantic score of frame
i-
j of the first point clouds,
is the semantic score of frame
i-
j of the second point clouds.