6.3.1. Paris-Carla-3D
We start by validating the choice of parameters used for each semantic group. To do so, we generate the basic splats model on PC3D-Paris using different
-nn, with
= 10, 40, and 120, and compare the results to our AdaSplats method (see
Figure 7). For a fair comparison with Basic Splats on the rendering and LiDAR simulation sides, we found that the best trade-off between geometric accuracy and a hole-free approximation is to use
= 40. This results in holes on the surface, especially the ground. A larger
would result in very large splats, and a smaller
would result in many holes, which would not express well the local geometry.
Figure 8 shows renderings of the different surface representation methods on PC3D-Paris dataset, where the last image (bottom right) shows the original point cloud. In these qualitative results, we show the modeling capabilities of the different surface representations. We can see that we obtain the best results with AdaSplats, especially on fine structures, shown in colored squares, thanks to the adaptiveness of our method.
Figure 9 illustrates qualitative results comparing the accumulated point clouds from the LiDAR simulation with the different surface representations. The last image is the original point cloud used to model the environment. The other images are an accumulation of simulated scans (we show one simulated LiDAR scan in blue). We can see that AdaSplats results in higher-quality LiDAR data when compared to Basic Splats and other meshing techniques.
The simulation in meshed or basic splat environments does not perform well on thin objects containing few points, such as fences, poles, and traffic signs. Basic splatting techniques are not able to adapt to the local sparsity without semantic information. Screened Poisson [
32] and IMLS [
8] suffer from a performance drop on outdoor noisy LiDAR data, especially on thin objects. These surface reconstruction methods result in artifacts on open shapes; borders are dilated because these functions attempt to close the surface, as they are performing inside/outside classification. To limit this effect, we truncate the IMLS function at three voxels and perform surface trimming with Poisson. However, we can still see artifacts in
Figure 8 and
Figure 9 (e.g., red, orange, and green areas).
Our method is also verified quantitatively on PC3D-Paris. We report the generation time of the different surface representations, rendering, and LiDAR simulation details of PC3D-Paris in
Table 1. AdaSplats-KPConv includes KPConv for automatic SS (trained on the training set of PC3D dataset). On Paris-CARLA-3D, KPconv has an average mIoU of 52% over all classes and 68% IoU for the class “vehicles” (computed on test set Soufflot-0 and Souffot-3). AdaSplats-GT uses the ground truth semantic (manual annotation), and AdaSplats-Descr computes the local descriptors to arrange points into the three semantic groups. We obtain the lowest number of primitives with our AdaSplats method. We observe that Basic Splats has the lowest generation time, and this is because there is no resampling, which results in generating the splatted environment only once. Moreover, the generation time of AdaSplats-KPConv includes the inference time of KPConv, which is 600 seconds for 10 million points. Adasplats-Descr achieves similar C2C distance to Basic Splats; however, it performs better on thin structures, as we see in the comparisons below. Moreover, it results in a lower number of primitives, which consequently accelerates the simulation and rendering time. The generation time of IMLS is very high, and we attribute this to our implementation, which could be improved. However, it would still result in a generation time higher than Screened Poisson.
LiDAR sensors such as Velodyne HDL32 or HDL64 in default mode acquire one scan (full 360
azimuth turn) in around 100 ms (being at 10 Hz). The LiDAR simulation frequency (LiDAR Freq in
Table 1) is the simulation time of one LiDAR scan. It includes generating the LiDAR rays at a given position, the host (CPU)-to-device (GPU) communication, ray-casting, primitives intersection, and reporting back the buffer containing points of intersection (device-to-host). Our ray–splat intersection is very fast, and we are able to simulate one scan in around 5 ms (203 Hz LiDAR Frequency for AdaSplats-KPconv in
Table 1), being 20 times faster than real time. This is interesting for doing massive simulation. The rendering and LiDAR simulation frequency of meshed environments is higher, and this is expected because rendering pipelines are optimized to accelerate the ray–primitive intersection on polygonal meshes. Moreover, the ray–primitive intersection is hard-coded on GPU. However, we obtain a higher quality surface representation (
Figure 8 and
Figure 9) and still achieve a rendering frequency that is faster than real time with our AdaSplats method. Furthermore, we obtain a lower LiDAR simulation frequency with respect to rendering frequency because it includes host–device (CPU-GPU) and device–host (GPU-CPU) communications for each scan, as we explain above, while the new frame position and rays generation for rendering is done on the device side.
We also report in
Table 1 the C2C distance between the simulated and original point cloud (see Equation (
15)). AdaSplats-Descr achieves similar overall accuracy, compared with Basic Splats. AdaSplats-KPConv improves the accuracy over Basic Splats, while AdaSplats-GT shows that, with improved semantics, the simulated data can be the closest to the original.
To see the effect of resampling on the final simulation, we remove the resampling step from the AdaSplats generation and report the results in
Table 2. We observe that, without resampling, we obtain a higher number of geometric primitives, which affects the rendering frequency and results in a higher C2C distance. This demonstrates that, with our resampling technique, we are able to increase the accuracy of splats generation and lower the number of generated primitives thanks to the re-distribution of points.
Our resampling method does not result in a higher number of points; rather, it re-distributes the points and removes the excess in the form of noise and outliers. Moreover, it reduces the density throughout the whole point cloud. This re-distribution results in a better surface representation, which consequently reduces the number of overlapping splats in a given spherical neighborhood. Our resampling method preserves the hole-free approximation of the surface and sharp features thanks to the checks performed during the generation step.
We notice that the point clouds contain a huge amount of points on the ground; this is the easiest class to model and has a higher effect on the computed distance. However, thin structures contain fewer points and are important for AV simulation. To measure the modeling of thin structures, we pick three classes from PC3D-Paris, compute the C2C distance on these classes, and report the results in
Table 3.
We observe that AdaSplats (all variants) obtains much better results than IMLS, Poisson, or Basic Splats. AdaSplats-KPConv is able to achieve a C2C distance very close to the model constructed with ground truth semantic information. We achieve a lower C2C distance on poles and traffic signs with KPConv due to misclassifications, leading to the generation of smaller splats.
We make an important observation, which is that we always obtain better quantitative and qualitative results independent from the source of the semantic classes. This proves that our method achieves better scene modeling, especially on fine structures, even if the semantic information is not perfect (see
Table 3 and
Table 4).
We measure the contribution of resampling on thin structures we perform once more the semantic C2C distance on the AdaSplats model without resampling and report the results in
Table 4. We observe a drop in performance, which can be seen from the higher distance we obtain between the simulated and the original point clouds on thin structures.
In
Table 2 and
Table 4, we compare our AdaSplats method with and without resampling. By comparison, we obtain less primitives (1.72M) with resampling than without (2.84M). We also have a better repartition of splats on thin objects with resampling (2.2 cm) than without (2.3 cm).
AdaSplats is a method that uses, but does not require, perfect semantics, as can be seen from the simulation results inside the scene modeled using AdaSplats-Descr and AdaSplats-KPConv, which have errors. On the contrary, modeling methods that have specific models for semantic objects (e.g., a specific model for traffic lights) are highly dependent on the quality of the semantics and no longer work with the slightest error. Compared to mesh-based models using surface reconstruction, splats are independent surface elements whose parameters can be easily changed according to semantics, unlike methods based on SDFs, such as IMLS, or on an indicator function, like Poisson.
6.3.3. M-City
Figure 12 and
Figure 13 show renderings and the simulated point clouds, respectively, using the different surface representation methods on M-City. As a reminder, for M-City, we did not perform Poisson and IMLS surface reconstruction, since we do not have the position of the scanners to orient the normals. However, we use the manually reconstructed mesh to compare a manual reconstruction of the scene to Basic Splats and our AdaSplats method. Moreover, we cannot train KPConv on this small dataset; therefore, we do not include AdaSplats-KPConv in the comparisons. However, AdaSplats-Descr can be computed, and we can see that it is able to achieve lower C2C distance and higher simulation frequency compared to Basic Splats. Looking at the results, we can observe that our method (AdaSplats-GT) obtains a better surface representation, which can be clearly seen on the grass and vegetation that are hard to manually reconstruct due to the complexity of the geometry. When manually reconstructing complex geometry, 3D artists need to simplify the local geometry.
We report the generation time of the different surface representations, rendering, and LiDAR simulation details of M-City in
Table 6. The same as PC3D-Paris and SemanticKITTI, we are able to obtain the lowest number of primitives with our AdaSplats method. Moreover, our automatic pipeline drastically reduces the generation time (more than 1 month for manual reconstruction against 8.5 min for AdaSplats), while obtaining a higher rendering quality. AdaSplats still provides accurate surface modeling capabilities, even without a correct normals orientation.
With M-City, we demonstrate that our pipeline can also be used on point clouds collected using a TLS, achieving LiDAR simulation results that are closer to reality than a manually reconstructed model. This is due to the modification of the local geometry done by 3D artists to simplify the reconstruction task (e.g., on the vegetation or some traffic signs).
6.3.4. SimKITTI32
In the context of AVs, 3D SS methods [
12,
13,
14,
15] provide important information about the surroundings of the vehicles, increasing the level of scene understanding. Many manually labeled datasets are available [
18,
67]; however, not all datasets were acquired using the same sensor model or configuration. When SS networks are trained on a given dataset, they perform poorly when tested on datasets acquired using different LiDAR sensors, such as training on data acquired using a Velodyne HDL-64 and testing on datasets acquired using a Velodyne HDL-32. This is mainly due to the domain gap arising from the different sensor model, which affects the points density and scan pattern.
We introduce SimKITTI32, which is an automatically annotated dataset simulated using a Velodyne HDL-32 LiDAR sensor model in SemanticKITTI [
18] sequence 08 (used in the validation procedure of 3D SS methods) that was acquired originally using a Velodyne HDL-64. SimKITTI32 is created with the aim to test the ability of SS methods to generalize to different sensor models. We use our AdaSplats method to model the full sequence 08 using the point-wise semantic labels provided with the original dataset. For the simulation of the Velodyne HDL-32 sensor, we use a slightly different LiDAR placement. More specifically, we use the original trajectory of the LiDAR sensor and offset its position by −0.5 m on the
z-axis. This offset provides more scan lines on high elevations when simulating the Velodyne HDL-32, since it has a larger vertical field of view with respect to the Velodyne HDL-64.
First, we obtain the static scene by removing the dynamic objects from the dataset. Here, the dynamic objects refer to points belonging to the semantic classes of moving objects only while static objects, such as parked vehicles, are considered part of the static background. We extract frame-wise dynamic objects points and generate the splats on each frame separately. Due to the sparsity of points obtained from the moving objects, we generate one splat per point, with a fixed radius of 14 cm, which is equal to their average point-to-point distance.
We concatenate the splatted frames of the dynamic objects with the splatted static scene (using AdaSplats-GT) and simulate the Velodyne HDL-32 LiDAR. We make three improvements on the previous LiDAR simulation method.
First, we simulate the HDL-32 distance error computation with an additive white Gaussian noise with zero mean and = 0.005.
Second, instead of returning the first ray–splat intersection, we accumulate several intersections between the ray and the overlapping splats using a recursive call of the ray tracing function. More specifically, we define the depth (
) of intersection (
= 5 in our experiments), which defines how many overlapping splats we want to intersect along the same ray. The higher the depth, the more computations are required, which ultimately affects the simulation frequency, so we are left with a trade-off between precision and time complexity. Having defined the depth, we cast the rays from the sensor origin, traverse the BVH and return the data from the intersected splat, such as the center and the semantic class (the semantic class is saved with each splat during generation time). We offset the intersection point by an
(we use
in our implementation) to prevent self intersections, and cast a new ray; we then save the intersection data. We repeat these steps until the maximum depth is reached, or all splats of the same semantic class are intersected. Moreover, we put a threshold on the distance between two consecutive intersections (10 cm in our experiment), to prevent the accumulation of intersections belonging to different objects. If the number of overlapping splats is less than
, they do not belong to the same semantic class, or the distance threshold is exceeded, we reset
to the maximum number of intersections. In a final step, we compute a weighted average of the intersection point, taking into consideration the depth of the intersection:
where
is the final intersection point,
is the intersection point at depth i, and
is a Gaussian kernel used to weight the contribution of each intersection to the final intersection point along the ray direction using the depth information:
where
is the current ray–splat intersection depth and (
) the maximum depth of intersection.
Finally, when we concatenate all of the separate dynamic objects frames with the static background, we obtain trails of splats representing the displaced dynamic objects through time. If we simulate the LiDAR sensor directly on the concatenated splatted environment, we also obtain a trail of points. We make use of the semantic labels that we associate with the splat during generation time and check if the intersected splat belongs to a moving object. In that case, we invoke the any-hit program of OptiX to accumulate all the intersections along the ray. At generation time, we also assign the splats that belong to a moving object the corresponding frame number as an attribute, which we use at intersection time inside OptiX. If the number of the simulated frames matches the frame number of the splat intersected inside any-hit, we report only this intersection back. This ensures that we only intersect the splats belonging to the frame currently being simulated.
Figure 14 shows a frame from the SemanticKITTI dataset sequence 08, the same frame simulated with an HDL-64 LiDAR model at the same position without any offset, and the simulated frame using an HDL-32 LiDAR model shifted by −0.5 m on the
z-axis. We can see that, with our implementation, we are able to accurately simulate the data with a different sensor model.