The framework of our method is shown in
Figure 1. It comprises three parts, which are shown in different colors. Lines in yellow show the generation of hybrid point clouds, which includes the simulation of the LiDAR sensor and the combination of hybrid point clouds. They are illustrated in
Section 3.1 and
Section 3.2, respectively. Lines in red show the computational experiments on 3D models such as 3D object detection. Lines in blue illustrate our human-in-the-loop optimization of the data generation and the training process, which are discussed in
Section 3.3.
3.1. Point-Based LiDAR Simulation
In this part, we discuss how to simulate the effects of LiDAR scanning for real or virtual objects with our virtual LiDAR. A point-based method was adopted to avoid the tedious computation of the collision between LiDAR beams and meshes of objects. The simulation of the LiDAR sensor is illustrated in
Figure 2. By imitating the mechanism of real LiDAR, a virtual LiDAR was built that had similar horizon and vertical resolutions. For objects to be scanned, the original mesh models were first transformed into point clouds. Finally, equivalent scanning grids were proposed to simulate the scanning of the LiDAR sensor and generate the point clouds.
The beams of LiDAR can be encoded by a two-dimensional index
, where
i is the id of the beam in the vertical dimension and
j is the id of the beam in the horizontal dimension. The coordination system of virtual LiDAR is illustrated in the left of
Figure 2. By only considering the range of scanning within the Field of View (FoV) of the camera, we denote the minimum angle in the vertical and horizontal directions as
and
, respectively. The resolution of virtual LiDAR is denoted as
and
in the vertical and horizontal directions. Then, we have:
where
and
are the angle of the LiDAR beam in the vertical and horizontal directions. To calculate the projection of the beam with the index of
onto the image plane, we assumed the distance of a point on the beam to the LiDAR sensor is
. Then, the coordinates of the point
are:
Assuming the LiDAR sensor has no relative rotation with respect to the camera, the rotation matrix can be ignored in the calculation of the extrinsic parameters of the camera. Let us denote the translation matrix as
, which is defined as:
where
is the translation between the LiDAR coordinates and the camera coordinates. The intrinsic matrix
K can be denoted as:
where
are the focus-related parameters,
is the offset, and
s is the shear parameter. Then, we can project the 3D point
onto the image plane. The projected point
can be calculated as:
Then, for each beam, their projected position on the image plane is calculated to generate the equivalent scanning grids. For each point on the grids, the mean distance
to all its neighbors can be calculated. A receptive range
is assigned for point
as follows:
where
is used to adjust the receptive range, and it is within the range
. The points from the virtual object will be kept if they are within the range of
from
. If there is more than one point assigned to the same grid, we randomly keep one of them. With such a design, we can project the point clouds onto the image plane and compare them with the equivalent grids to see if they can be scanned by the virtual LiDAR. To simulate the reflectance of the LiDAR scan, we used the distance of the point and a random function as the inputs. Then, the reflectance
r is:
where
is used to adjust the reflectance strength,
l is the distance of the point to the LiDAR sensor, and
is the random function used to generate a random value within [0,0.3]. The
and
function are used to restrict the value of the reflectance within
.
3.2. Hybrid Point Cloud Integration
To improve the realness of synthetic point clouds and avoid the heavy data annotation work, we separated the background point clouds and foreground point clouds. Real environmental point clouds were used as the background to reduce the domain difference. Virtual point clouds or scanned point clouds of objects of interest were used as the foreground to generate annotated data automatically. The public point cloud dataset KITTI [
25] was used as the source of real point clouds. The integration of the hybrid point cloud pipeline is illustrated in
Figure 3. To eliminate the influence of real points from foreground objects, all the points within the 3D bounding box were removed according to the labels of KITTI. Point clouds of foreground objects can either be generated from CAD models or scanned objects. For CAD models, we obtained their surface points with our surface point sampling method. To scan the objects in the real world, an iPad pro device with a LiDAR sensor was used. It can conveniently generate the point clouds of the objects of interest. We also proposed an automatic placing algorithm to choose proper places for foreground virtual point clouds. Although, in our scenario, foreground point clouds can be placed in the positions where real foreground objects are removed, such a placing mechanism cannot be generalized to background scenes with no real objects. Then, background and foreground points are combined, and virtual points are further processed by our virtual LiDAR to keep the visible points.
Existing CAD models can be used as the sources to generate the point clouds of the objects of interest. Considering that only outside points of an object can be captured in real scenes, we proposed a surface point sampling method to produce and upsample the points lying on the surface of the CAD models. As shown in
Figure 4, firstly, the complete point clouds are generated from the CAD model with uniform sampling. Complete point clouds include both the outside and inside points of the CAD model. Therefore, it is necessary to remove the points lying inside the object. Making use of current 3D graphic software such as Blender, surface points can be extracted easily. However, in this paper, we tried to avoid using external software or applications to develop an online processing mechanism. We used the multiview projection to decide which points were on the surface of the object. Five views (front view, back view, left view, right view, and bird’s eye view) were used to create effects according to real situations. Finally, the points from different views were combined, and duplicate points were eliminated to generate the surface point clouds. The surface point clouds were further upsampled to guarantee that every beam of the virtual LiDAR could generate the point when it collided with the target.
Foreground point cloud generation: Two approaches were adopted to generate the foreground point clouds, i.e., generating from CAD models and generating from scanned real objects.
CAD models can provide a flexible imitation of common objects, but are not realistic enough. Existing real objects that we just removed only contain partial information (visible parts) and thus cannot be used for flexible data generation. Therefore, we proposed to use mobile LiDAR to reconstruct the objects in the real world. An iPad Pro (2020) was used to scan the real objects. We used the app called 3D Scanner APP [
37] to scan objects and transform them into point clouds. The 3D models and corresponding point clouds are shown in
Figure 5.
Group-based placing algorithm: To make the integration of virtual and real point clouds reasonable, we proposed a group-based placing algorithm to find the proper areas to place the virtual points into the real background points. Firstly, 2D keypoints on the BEV plane are uniformly defined with a resolution of
. Then,
k neighbors for each keypoint are found according to the
x and
y coordinates using the constrained K-Nearest Neighbor (KNN) algorithm with a range of
r. Finally, we computed the maximum and minimum height for the points within each group. If the height difference
is below the threshold
, the corresponding keypoint is selected as a candidate position. For the object that is only provided with the normalized coordinates, we randomly generated the size
to obtain its absolute coordinates using the constrained normal distribution, which is defined as:
where
and
are the minimum and maximum values of the size, respectively.
is a random variable with a normal distribution. The mean value and the variance of the normal distribution are
and
, respectively. To find a suitable orientation for the foreground object, we proposed a “random try and region-based validation” method.
We evenly generated
anchor angles. For each anchor angle, a random shift, which obeys uniform distribution, is added. The
t-th value of the angle is:
Then, we tried each random angle and cropped points in the 3D region formed by the candidate position, a random size, and a random angle with 3D Region of Interest (RoI) pooling. If the height difference of the points in the 3D region is lower than , the current box is labeled as a valid one and the corresponding position, size, and angle information is recorded for placing the foreground object. To avoid the collision between multiple foreground objects, we iteratively removed the keypoints that lied in the current valid box. The group-based placing algorithm is illustrated in Algorithm 1.
3.3. Human-in-the-Loop Optimization
The parallel point cloud framework makes it possible for consistent optimization of both the point cloud generation and model parameters’ update. In this part, we mainly discuss the adjustment of the point cloud generation with the human in the loop.
There are many factors to be considered in the virtual point cloud generation such as the types of virtual objects, the density of points of the object, the size of the object, and so on. Although machine intelligence such as auto-augmentation [
38] may have the potential to solve these problems, it demands massive computational capacity and needs carefully designed learning networks. In our situation, taking advantage of the human experience and the flexibility of virtual data was the most direct and effective way. After the training process, several randomly sampled data were used to evaluate the performance of the current model. Then, we visualized the outputs and asked three volunteers to comment on the detection results. Based on their opinions, the parameters in the generation of point clouds, such as the mean number of foreground objects in each frame, the mean size of foreground objects, the receptive range of the virtual LiDAR, the reflectance rate of virtual points, and so on, were optimized accordingly. After several iterations, the quality of the point clouds and the performance of the detectors were improved.
Algorithm 1: Group-based placing algorithm. |
|