5.3. Comparison of SLAM Robustness in Dynamic Environments
To evaluate our algorithm, we conducted comparative experiments on both public datasets and our own dataset, comparing VINS-mono, dynamic-VINS, and our algorithm. Among them, VINS-mono serves as the basic framework for the development of the algorithm in this paper, yet it lacks the capacity to handle dynamic environments. Dynamic-VINS is a relatively advanced open-source algorithm at present, but it only eliminates dynamic features. By comparing with VINS-mono, we can verify the ability of the algorithm proposed in this paper to deal with dynamic environments. Meanwhile, through comparison with the dynamic-VINS algorithm, we can verify the improvement in accuracy brought by the rigid-point-set modeling in our algorithm. The algorithm in this paper conducts experimental comparisons of three algorithms using three datasets. During the algorithm operation and evaluation process, everything is kept consistent except for the algorithms being verified.
The public dataset we selected is the market dataset in OpenLORIS-Scene. It is a dataset recorded in a supermarket, containing a large number of pedestrians and high dynamics. For indoor scenes, we use the Qualisys 3D Motion Capture Systems, which contains nine Miqus M3 arranged in rooms, as shown in
Figure 12 to obtain the robot’s ground truth motion, which can provide sub-millimeter accuracy for the target trajectory. The Motion Capture Systems can support a maximum resolution of 4MP. In the full field of view high speed mode, it can support a frame rate of up to 650 frames per second, and the maximum capture distance reaches 18 m. The experimental site is an indoor room with a size of 5 m by 6 m. The acquisition device is a D455 equipped with target points recognizable by the motion capture system. The pose relationship between the target points and the camera is determined by the mechanical installation position. The D455 itself is equipped with an IMU component, and the pose between the two is calibrated using the kalibr method. Some of the scenes where data was collected are shown in
Figure 13.
For outdoor scenes, we use differential GNSS (model M66H2H-Lite GNSS receiver, Johannes Kepler Luojia Technology Co., Ltd. in Wuhan, China) positioning devices to obtain the robot’s ground truth motion. This technology uses a fixed base station to collect satellite data. The base station sends the collected information to mobile devices via a 4G network. The mobile devices collect satellite data while receiving information from the base station and perform carrier-phase differential calculations to obtain high-precision positioning results. The positioning accuracy of the GNSS device in this paper can reach sub-centimeter level, and it can output 20H positioning information. The data collection platform is a mobile robot, as shown in
Figure 7. The devices related to GNSS are shown in
Figure 7b. The total length of the trajectory is 315 m. Some schematic diagrams of the scenes in the trajectory and route are shown in
Figure 14.
The evaluation metrics we used are mainly Absolute Pose Error (APE) and Relative Pose Error (RPE), using EVO analysis tools [
23]. Among them, APE is the error between the spatial position at each moment and the spatial position of the corresponding ground-truth trajectory at that moment, reflecting the overall degree of deviation. RPE is the error between the relative transformation of adjacent poses and that of the ground-truth trajectory, reflecting the local estimation accuracy. Based on the sufficiency of data, in the experiments of this paper, the evaluation metrics include the Root Mean Square Error (RMSE), Mean, Median, and Standard Deviation (STD) of ATE and RTE, respectively.
where
represents the degree of precision improvement,
represents the trajectory precision of the algorithm in this paper,
represents the trajectory precision of other algorithms.
Firstly, we conducted a comparative experiment in OpenLORIS-Scene, and the trajectories are shown in
Figure 15a. The trajectory error results of the three algorithms shown in
Figure 15b. The results of the statistics are presented in
Table 3 and
Table 4. As shown in
Figure 15a, the algorithm proposed in this paper is overall closest to the true trajectory, followed by dynamic-VINS, and VINS-mono performs the worst. Analyzing
Figure 15b, in the x- and y-axes, the algorithm in this paper does not have a significant advantage over dynamic-VINS, but both are significantly better than VINS-mono. In the
z-axis direction, the algorithm in this paper does not have a significant advantage over VINS-mono, yet both are significantly better than dynamic-VINS. This is because VINS-mono does not have the ability to handle dynamic targets. Moving people and objects on the plane will seriously interfere with its horizontal positioning estimation. While dynamic-VINS uses a direct filtering method, which leads to a reduction in the features on people and objects used to constrain the
z-axis, thus reducing the
z-axis accuracy.
By analyzing
Figure 16,
Table 3 and
Table 4, we can see that: in terms of both APE and RPE indicators, the algorithm in this paper outperforms dynamic-VINS and is far superior to VINS-mono. Compared with dynamic-VINS, the algorithm in this paper has improvements of 13.6%, 12.3%, and 17.0% in the RMSE, MEAN, and STD indicators respectively. Compared with VINS-mono, the algorithm in this paper has improvements of 74.1%, 75.7%, and 66.6% in the RMSE, MEAN, and STD indicators, respectively. In summary, in the market scenario with high-dynamic targets, the algorithm in this paper is superior to the other two algorithms.
Next are the comparative experiments on our own indoor dataset. The trajectories are shown in
Figure 17a. The results of the trajectory error of the three algorithms are shown in
Figure 17b. The results of the statistics are shown in
Table 5 and
Table 6. In this part of the experiment, the camera repeatedly performed complex circular motions in a small scene. Relative to the camera, there were multiple instances where moving people in the scene occupied a large area. This experiment tests the algorithm’s ability to cope with interference caused by dynamic targets when the camera is in continuous rotational motion. Analyzing
Figure 17, most of the trajectories of dynamic-VINS and the algorithm in this paper are close to the true trajectory, far superior to VINS-mono. Meanwhile, most of the trajectories of the algorithm in this paper show better results compared to dynamic-VINS.
By analyzing
Figure 18,
Table 5 and
Table 6, we can see that: in terms of both APE and RPE indicators, the algorithm in this paper outperforms dynamic-VINS and is far superior to VINS-mono. Compared with dynamic-VINS, the algorithm in this paper has improvements of 17.3%, 15.3%, and 28.7% in the RMSE, MEAN, and STD indicators, respectively. Compared with VINS-mono, the algorithm in this paper has improvements of 35.0%, 32.7%, and 47.1% in the RMSE, MEAN, and STD indicators, respectively. In summary, in the market scenario with high-dynamic objects, the algorithm in this paper is superior to the other two algorithms.
Next are the comparative experiments on our own outdoor dataset. The trajectories are shown in
Figure 19a. The results of the trajectory error of the three algorithms is shown in
Figure 19b. The results of the statistics are shown in
Table 7. In this part of the experiment, the camera has a wide field of view, which means the proportion of dynamic feature points is appropriate in most trajectories. This experiment tests the anti-interference ability of the algorithm proposed in this paper when the proportion of dynamics is low. By analyzing
Figure 19, we can see that in several trajectory segments, such as the segments from 350 to 400 on the
x-axis, from 350 to 400 and from 450 to 520 on the
y-axis, and from 300 to 420 on the
z-axis, the algorithm in this paper and dynamic-VINS are closer to the true trajectory and perform better than VINS-mono.
By analyzing
Figure 20,
Table 7, we can see that: in terms of both APE indicators, the algorithm in this paper outperforms dynamic-VINS and is far superior to VINS-mono. Compared with dynamic-VINS, the algorithm in this paper has improvements of 12.7%, 12.2%, and 16.7% in the RMSE, MEAN, and STD indicators, respectively. Compared with VINS-mono, the algorithm in this paper has improvements of 75.8%, 73.2%, and 84.8% in the RMSE, MEAN, and STD indicators, respectively. Since the ground-truth trajectory converted from differential GPS data does not contain attitude information, we do not compare relative error values.
In addition, as depicted in
Figure 13 and
Figure 14 of this paper, there are multi-target interactions within the self-collected datasets. In an indoor space, 2–3 dynamically moving individuals engage in multiple instances of cross-walking, creating complex scenarios where the device and people experience close-range occlusion. Similarly, in the outdoor datasets, there are numerous cases of people occluding one another, along with scenarios where vehicles pass by and obscure pedestrians. Thus, this algorithm is capable of addressing the issue of dynamic occlusion.
In summary, in a dynamic environment, the algorithm proposed in this paper improves the overall positioning accuracy by separating dynamic feature points and performing rigid-body modeling.