4.1. Experimental Setup
To experimentally evaluate the performance of LP-DSG, a Livox MID-360 (Livox, Shenzhen, China) LiDAR sensor is used as the low-density LiDAR, and a low-power NVIDIA Jetson AGX Orin (NVIDIA, Santa Clara, CA, USA) is employed to run the algorithm. The Livox MID-360 sensor can detect objects with 80% reflectivity at a maximum range of 70 m and objects with a reflectivity of 10% at a maximum range of 40 m. It also has a horizontal FOV of and a vertical FOV that ranges from to . Sensor data is collected at 10 Hz intervals, with approximately 20,000 points per scan frame. The NVIDIA Jetson AGX Orin features an 8-core Cortex-A78AE CPU and an Ampere architecture-based GPU with 2048 CUDA cores, delivering up to 275 TOPS.
NDT is used as a scan-to-map matching technique. NDT partitions the point cloud into voxel spaces and models them as normal distributions, producing an optimal transformation matrix that minimizes the distributional differences between two LiDAR scans or between a LiDAR scan and a map. This process is used to generate the map. In LP-DSG, NDT is employed to match the LiDAR scans with the map to ensure stable map generation.
For 3D object detection, we used voxel-based PointPillars, PillarNet, SECOND and Voxel R-CNN models. Among these, PointPillars was selected as the experimental model for evaluating the entire LP-DSG system, including LOP and DSG algorithms, due to its computational efficiency suitable for real-time operation on embedded platforms while maintaining sufficient detection performance. For PointPillars, the voxel size
is set to (0.16, 0.16, 6) m, and each voxel is configured to hold a maximum of 16 points for computational efficiency. The point cloud queue size
in Equation (
6) is set to 5, determined through systematic experiments using this PointPillars configuration.
Figure 8 visualizes the effect of different queue size
on voxel occupancy as a heatmap, while
Table 1 provides a detailed quantitative analysis. The analysis region contains 1662 voxels from the same ground truth bounding box.
As shown in
Table 1, increasing
from 3 to 5 adds 90 non-empty voxels while introducing only 15 over-capacity voxels, achieving a 6 to 1 ratio. However, further increasing from 5 to 7 adds just 63 non-empty voxels but introduces 41 over-capacity voxels, resulting in a 1.5 to 1 ratio. Here, non-empty voxels represent features that capture object geometry, while over-capacity voxels require random sampling that can disrupt learning. In PointPillars, when a voxel exceeds the 16-point limit, random sampling may produce point distributions that do not represent actual object surfaces, creating unrealistic geometric patterns that hinder model training. Additionally, larger queue sizes increase computational overhead, which can compromise real-time performance on embedded platforms. Therefore,
of 5 was selected as the optimal balance that maximizes geometric feature coverage while minimizing both learning interference from random sampling artifacts and computational burden for real-time operation.
Furthermore, the ROI for USV
is set to [−69.12, 69.12] m, [−69.12, 69.12] m, and [−2, 4] m, respectively, to enable detection in all directions. Voxel size
is determined with each model’s downsampling ratio in mind, as shown in
Table 2. Because 3D backbone models preserve detailed information along the Z-axis, reducing XY resolution incurs minimal performance loss. In contrast, 2D backbone models lose some Z-axis information when projecting to BEV and therefore require higher XY resolution to maintain a fine-grained representation. Accordingly, PillarNet uses a voxel size with half the XY dimensions of the other models to ensure sufficient performance despite its 0.125 downsampling ratio. The anchor sizes for the Boat and Front are set to (2.0, 3.8, 3.6) m and (3.6, 0.8, 1.5) m, respectively, reflecting actual object dimensions. All training-related settings, including network architecture, optimizer, scheduler, learning rate, and loss function, are implemented in OpenPCDet [
44] using the default hyperparameter values specified in the original publications [
35,
37,
40,
41].
LOP employs DBSCAN as the clustering method. The DBSCAN parameters in Algorithm 2 are set considering the size of the USV, with the cluster distance threshold
set to 1.0 m and the minimum cluster points
set to 3. Because LP-DSG operates across a relatively small scale of approximately 50 m, the sizes of the bounding box queue
and long-term object queue
are both set to 100, allowing up to 100 objects to be stored on the map. Considering the size of the USV and registration errors, matching distance threshold
between the cluster buffer and long-term object queue is set to 2.0 m. Long-term object queue update weight
is set to 0.9, meaning that 90% of the information from the cluster buffer and 10% from the long-term object queue is incorporated to maintain the most up-to-date information during updates. Given the actual berthing environment for DSG, dock length
is set to 32.0 m, dock width
to 5.6 m, and dock distance
to 4.0 m. The specific dimensions for these parameters are derived from the physical constraints of the test site and the USV, as referenced in
Figure 9a.
4.2. Dataset Construction
As shown in
Figure 9b, the experiments are conducted on a large lake using a USV, from which the necessary data for training are collected. To ensure diversity in the point cloud data collected from LiDAR, three datasets were constructed based on four scenarios (
Figure 10). The first dataset consists of raw point cloud data collected directly from LiDAR (
Figure 11a), while the second dataset was generated by applying frame stacking (FS) to create high-quality point cloud data. FS accumulates point clouds from previous frames in the current frame to produce a merged point cloud, without a registration process. The third dataset was constructed by registering and merging five consecutive frames of raw point cloud data, producing high-quality point cloud data using RFS (
Figure 11b).
All three datasets share the same label information, which was generated using the high-quality point clouds from RFS. This labeling approach is justified by the fact that all three datasets exist in the same sensor coordinate system and represent the same physical scenes captured at the same time. In LiDAR-based 3D object detection, bounding box annotations represent the physical dimensions and locations of objects based on surface points, independent of point density.
This follows the standard practice in major 3D detection benchmarks [
17,
42,
45], where objects are labeled based on their physical extent regardless of point sparsity in any particular frame. Using consistent labels across all datasets ensures fair comparison by eliminating annotation variance as a confounding factor, allowing us to isolate and evaluate the effect of point cloud quality on detection performance. Each of the three datasets consists of the same 1722 scenes, and the labeled dataset is divided into 1378 and 344 scenes for training and validation, respectively.
We employ four data augmentation techniques to enhance training data diversity and reduce observation angle bias. First, we perform ground-truth sampling [
37], which extracts points from labeled bounding boxes and inserts them into other scenes. We set the minimum points to sample to 5 for both Boat and Front classes, with up to 20 Boat objects and 5 Front objects inserted per scene. Second, we randomly flip each scene along the X and Y axes. Third, we apply random rotation around the Z axis within
. Finally, we apply random scaling with factors between 0.95 and 1.05. These augmentations ensure that each training iteration encounters a different scene variation, allowing the model to learn from diverse inputs.
4.3. Experimental Results
Experiments are conducted to evaluate the improvement in object detection reliability and accuracy, efficiency in the long-term perception and management of objects, and the stability of DSG in real-world scenarios. In this experiment, PointPillars was selected as the experimental model for performing subsequent tasks after comparing object detection performance.
To evaluate the improvement in object detection reliability and accuracy due to the use of high-quality training data, a comparative analysis is conducted using 344 validation scenes, as shown in
Table 3. For each model, the evaluation compares its baseline and extended versions incorporating FS and RFS.
Table 3 presents a quantitative evaluation of detection accuracy by comparing class-wise AP and mAP for both 3D bounding boxes and BEV. For all classes, the IoU threshold is set to 0.7, and 40 recall positions are used for a more detailed performance analysis [
46,
47].
As a result, compared to the single-frame baseline models, applying RFS consistently improves performance across all four model architectures. Quantitatively, RFS achieves an average 3D mAP improvement of 23.38 percentage points across all models. Specifically, PointPillars improves by 22.45 percentage points, PillarNet by 23.26 percentage points, SECOND by 26.44 percentage points, and Voxel R-CNN by 21.37 percentage points. Notably, the Front class shows particularly substantial improvements across all architectures. PointPillars achieves a 35.87 percentage point improvement, PillarNet 39.19 percentage points, SECOND 41.73 percentage points, and Voxel R-CNN 35.68 percentage points. These results indicate that high-quality point clouds generated by RFS are especially effective for detecting small objects with sparse point clouds. This consistent improvement pattern across diverse model architectures, ranging from lightweight PointPillars to complex Voxel R-CNN, validates the robustness and generalizability of our approach.
RFS produces a higher object detection performance than FS, another multi-frame model. This improvement is attributed to the multi-frame alignment process in RFS, which refines the point cloud, making it more suitable for object detection.
Figure 12 visualizes the point cloud datasets generated by applying FS and RFS to the experimental model and shows the class-wise precision-recall (PR) curve results for each model. Notably, in the Front object region, indicated by the red ellipse, the FS data sample produces an unstable object shape. In contrast, with RFS, the object shape is more stable due to precise frame alignment. This is a critical factor in detecting relatively small objects such as the Front.
Figure 13 presents the inference results for the experimental model. Compared to the ground truth, the baseline and baseline with FS models (
Figure 13a and
Figure 13b, respectively) fail to detect the Front, while the baseline with RFS model reliably does so (
Figure 13c).
To evaluate the effectiveness of long-term perception and object management, we compare the detection recall (DR) with and without LOP.
Figure 14 presents the DR over time, representing the detected objects as a percentage of the total number of actual objects in Scenario 3 of
Figure 10. When LOP is not employed, the DR gradually increases over time but starts to decline after reaching 60% (
Figure 14a). This decline can be attributed to objects moving out of the LiDAR FOV. In addition, due to noise in the real-world environment, the DR fluctuates during the detection process. In contrast, when LOP is used, the DR steadily increases over time (
Figure 14b). This is because LOP retains object information even when they move out of the LiDAR FOV. As a result, a DR of approximately 90% is achieved in Scenario 3, contributing to the stability of DSG.
The stability of the DSG algorithm employed in Scenario 1 of
Figure 10 is also tested.
Table 4 presents the DSG results for the baseline and RFS of the experimental model. Nearest-neighbor matching is performed in the XY plane between the bounding boxes of the Boat stored in LOP and the candidate docking spots generated by the DSG algorithm. The mean, standard deviation, and 95% confidence interval of the matching distances between paired points are calculated. As a result, the baseline model exhibits a mean matching distance of 0.654 m, whereas the model with RFS achieves an improved mean matching distance of 0.579 m. Additionally, when comparing the confidence intervals of both models, the baseline model has a range of [0.614, 0.693] m, while the model with RFS has a range of [0.531, 0.627] m. Because the confidence interval widths are relatively small compared to the size of the USV, which has a length of 3.8 m and a width of 1.8 m, it can be confirmed that the DSG algorithm is stable in both cases.
Figure 15 visualizes the DSG results for the model with RFS, confirming that the system can reliably generate autonomous docking spots in real-world environments. A detailed visualization of the complete LP-DSG system workflow is provided in
Video S1.