The Method of Static Semantic Map Construction Based on Instance Segmentation and Dynamic Point Elimination

: Semantic information usually contains a description of the environment content, which enables mobile robot to understand the environment and improves its ability to interact with the environment. In high-level human–computer interaction application, the Simultaneous Localization and Mapping (SLAM) system not only needs higher accuracy and robustness, but also has the ability to construct a static semantic map of the environment. However, traditional visual SLAM lacks semantic information. Furthermore, in an actual scene, dynamic objects will reduce the system performance and also generate redundancy when constructing map. These all directly affect the robot’s ability to perceive and understand the surrounding environment. Based on ORB-SLAM3, this article proposes a new algorithm that uses semantic information and the global dense optical ﬂow as constraints to generate dynamic-static mask and eliminate dynamic objects. Then, to further construct a static 3D semantic map under indoor dynamic environments, a fusion of 2D semantic information and 3D point cloud is carried out. The experimental results on different types of dataset sequences show that, compared with original ORB-SLAM3, both Absolute Pose Error (APE) and Relative Pose Error (RPE) have been ameliorated to varying degrees, especially on freiburg3-walking-xyz, the APE reduced by 97.78% from the original average value of 0.523, and RPE reduced by 52.33% from the original average value of 0.0193. Compared with DS-SLAM and DynaSLAM, our system improves real-time performance while ensuring accuracy and robustness. Meanwhile, the expected map with environmental semantic information is built, and the map redundancy caused by dynamic objects is successfully reduced. The test results in real scenes further demonstrate the effect of constructing static semantic maps and prove the effectiveness of our algorithm.


Introduction
SLAM (Simultaneous Localization and Mapping) is a method for intelligent mobile devices to locate the pose and build map of the surrounding environment in unknown scenes. It is widely used in many fields, such as unmanned driving, robot, and AR (Augmented Reality). A typical SLAM framework is mainly composed of front-end odometry, back-end pose optimization, and loop detection. According to the different types of sensors used by SLAM system to obtain environmental data, it can be roughly divided into: Laser SLAM using lidar, whose front end is laser odometry and Visual SLAM using camera, and its front end is visual odometry [1]. The sensor used by Visual SLAM is not only low in price, but also more intuitive to obtain environmental content. In recent years, vision-based SLAM solutions have been fully developed, such as MonoSLAM [2], ORB-SLAM2 [3], and ORB-SLAM3 [4].
However, these schemes assume that the unknown environment is static, while real scenes often contain both dynamic objects and static objects. The visual odometry is a functional module that processes sensor data, performs feature selection, extraction, and matching, and obtains pose optimization and partial mapping results in a short time [5]. Unluckily, the movement of dynamic objects in the environment will directly affect the results of visual odometry feature selection, extraction, matching, and data association [6], ultimately affect the performance of the entire system. At the same time, for the needs of human-computer interaction, the problem of Visual SLAM is not only about pose positioning and construction of environmental consistency maps. Many practical applications of Visual SLAM such as robotic home care often require higher system accuracy, robustness, and a semantic map with perceptual information that can provide the robot with more higher-level environmental information and help complete complex interactive tasks [7].
With the continuous development of deep learning, some classic deep learning networks have been proposed, such as CNN [8], R-CNN [9], SegNet [10], Faster R-CNN [11], Mask R-CNN [12], and so on. Combining deep learning network with vSLAM can help robots perceive scenes from both geometric and semantic levels, abstractly understand and cognize the environmental content, and obtain high-level perception of the environment. Compared with the simple target detection and recognition networks commonly used, such as YOLOv3 [13] and SSD [14], the semantic segmentation network has great advantages in two aspects when worked with Visual SLAM. On the one hand, semantic segmentation can obtain a more accurate outline of the target rather than the rectangular frame where the target location is. It means that, in the process of Visual SLAM dynamic point elimination, a relatively accurate prior dynamic object range can be provided to avoid the loss of tracking accuracy or tracking failure caused by excessive feature points being eliminated. On the other hand, semantic segmentation can be used to directly obtain the 2D semantic information in the scene, which is convenient for the construction of an environment map integrated with semantic information. Some related solutions are based on ORB-SLAM [15] or ORB-SLAM2, due to the novel ORB-SLAM3 algorithm being formally reported this year. ORB-SLAM3 [4] mainly proposes a new visual inertia navigation and multi-map fusion algorithm, as well as pinhole and fisheye camera models, and it proposed a maximal probability map based on a close combination of features. Therefore, its performance has been greatly improved compared to the previous version. In general, the ORB-SLAM3 algorithm is more mature than the previous version, which will promote the engineering landing development of Visual SLAM to a certain extent.
The present work of this article mainly revolves around ORB-SLAM3. In order to improve the performance of ORB-SLAM3 in a dynamic scene, we propose a new method to improve the accuracy and robustness of visual odometry in a dynamic scene, and, for the goal of providing environmental semantic information, we further construct an indoor static semantic map. The main contributions are as described in the following three points:

1.
On the basis of ORB-SLAM3, we use multiple concurrency technology to add an instance segment thread. This thread uses FPN(Feature Pyramid Network) [16]+ Mask R-CNN network and is written in C++ language to extract the semantic information of image frames. Since the main language style of ORB-SLAM3 is C++, this makes the modules of the system become orderly and harmonious.

2.
We propose a new method of combining with a deep learning FPN+Mask R-CNN network with global dense optical flow to obtain semantic information and eliminate the dynamic points in objects under the dynamic scene, which solves the redundant tracking problem of visual odometry and improves the accuracy and robustness of ORB-SLAM3 in dynamic scene effectively. 3.
Our system integrates 2D semantic information and 3D point cloud to construct a semantic map with perceptual information, further improving the robot's ability to perceive and understand the surrounding environment.
In the rest of this article, the structure is as follows: Section 2 provides some related work in improving the performance of visual odometry, reducing the impact of dynamic objects, and constructing perceptible semantic map. Section 3 describes the design and implementation of our SLAM system. Section 4 provides the performance of our system in a dataset and real scene to illustrate the effectiveness of our system. Finally, the work of this article is summarized and discussed in Section 5.

Related Work
In practical applications, the accuracy and robustness of the visual odometry is very important to the Visual SLAM system, and the construction of a perceptible environment map is also an indispensable condition for high-level interaction.

Improvement of Visual Odometry Performance
In order to improve the performance of the visual odometry, some algorithms are proposed. For example, Cui [17] used image intensity for data association in common frames and use photometric calibration for accuracy and robustness. Zhang [18] used a method that matched lines and computed a collinear relationship of points to assist bundle adjustment, and then modify perspective-n-point to improve the tracking accuracy under a poorly textured situation. Konstantinos-Nektarios [19] proposed a novel visual semantic odometry framework to enable medium-term continuous tracking of points using semantics. Zhu [20] fused a purely event-based tracking algorithm with an inertial measurement unit, to provide accurate metric tracking of a camera's full 6 DOF pose. As for ORB-SLAM3 [4], it proposes a feature-based tightly coupled visual inertial navigation system, which completely relies on the maximum posterior estimation, so that the system can run robustly in real time indoors or outdoors, and is more accurate than ORB-SLAM2 by two to five times.

Visual SLAM in a Dynamic Scene
The emergence of a dynamic object will not only affect the accuracy of pose tracking and increase the extra computational burden, but also bring inconvenience to advanced applications such as robot navigation and interaction in practical applications. Therefore, it is necessary to eliminate the influence brought by dynamic factors. For example, Chao [21] used the semantic segmentation network SegNet and LK sparse optical flow combined with motion consistency to eliminate dynamic objects. Berta [22] used instance segment network Mask R-CNN and Multi-view geometry to eliminate dynamic objects. However, the speed of the multi-view geometry and the accuracy of the sparse optical flow algorithm are often unsatisfactory. Deyvid [23] used scene flow to propagate dynamic objects within the map. Palazzolo [24] used the residuals obtained after an initial registration, together with the explicit modeling of free space in the model. Rünz [25] used a multiple model fitting approach where each object can move independently from the background and still be effectively tracked. Their methods have certain requirements for computing power and hardware, and our ideal compromise is to use the global dense optical flow combined with semantic information.

Semantic Information of Maps
Perceivable semantic maps are essential for robots to complete interactive behaviors, and there are different ways to give semantic information to maps. For example, Weinmann [26] use the methods of neighborhood selection, feature extraction, feature selection, and classification to directly segment the point cloud to obtain semantics. Qi [27] used YOLOv3 to obtain object types and contour to construct environmental label maps, while Guan [28] used semantic information to process point clouds and objects to construct real-time semantic maps. Yue [29] used multi-robot collaboration, and the local semantic maps are shared among robots for global semantic map fusion. Qin [30] used robust semantic features, inertial measurement unit, and wheel encoders to generate a global visual semantic map. Wei [31] used instance networks and built instance-oriented 3D semantic maps directly from images acquired by the RGB-D camera. As previously men-tioned, semantic segmentation has great advantages when obtaining semantic information compared to other ways. In this work, we will fuse 2D semantic information obtained from Mask R-CNN instance segment networks into 3D point clouds to endow the map semantic information.

System Description
In this section, our Visual SLAM algorithm will be introduced in detail. Section 2 includes four aspects: first is the main framework of the system; second is the instance segmentation network; third is the dynamic point eliminate algorithm; and last is the method of constructing static semantic map.

System Components
As mentioned in Section 2, the performance of ORB-SLAM3 algorithm is better than ORB-SLAM2. Therefore, our algorithm solution chooses ORB-SLAM3 as the basic framework. However, ORB-SLAM3 does not have good robustness in dynamic scenes. In order to reduce the impact of dynamic object on the accuracy and robustness, we designed the system to eliminate dynamic objects firstly, and then, we further built static semantic maps in indoor dynamic scenes. The main framework of system is shown in Figure 1. The purple part is our improvement point, the yellow part is our work on the choice and deployment of the instance segmentation network, the green part is the final output of our system, and, due to ORB-SLAM3 building a sparse point cloud map, the orange part is the added dense point cloud construction algorithm [32].  As shown in Figure 1, there are mainly five threads, namely Tracking thread, Loop Mapping thread, Loop Closing thread, Full Bundle Adjustment thread, and Semantic Segmentation thread. When the image frame of the current scene (include RGB image and Depth image) is obtained, the RGB image is simultaneously passed to the Tracking thread and Semantic Segmentation thread to extract feature points and semantic information. Then, the global dense optical flow mask of the RGB image is calculated to obtain the actual dynamic object information and then further combined the dynamic information with the prior semantic results obtained from semantic segmentation; a mutually constrained dynamic-static mask is then formed. Subsequently, it follows the process of using the mask to eliminate dynamic objects among feature points to obtain keyframes and calculate dense point cloud maps. In addition, the final work is fusing the semantic information and the dense point cloud to generate the static semantic map.

Semantic Segmentation
In order to obtain the semantic information in the scene, we use the instance segmentation network Mask R-CNN to segment the semantic information in the Semantic Segmentation thread, and rewrite it into C + + style when deploying it to our system. The network structure is shown in Figure 2. Deep features of images usually have rich semantic information. To obtain accurate target recognition results in instance segmentation, in the first stage of Mask R-CNN network, Resnet101 [33] is selected as the feature extraction layer to extract the basic features of images. In addition, considering different scales of large, medium, and small targets that may appear in the actual environment when constructing semantic map, after feature extraction, the FPN network was further selected to perform jump-connection fusion between the bottom layer and the top layer of extracted features, and the basic structure of the network was adjusted according to the actual demand, so that the Mask R-CNN has better semantic segmentation accuracy and can recognize up to 80 categories on the COCO dataset [34].

Dynamic Points Elimination
In the dynamic points' elimination process, we use the method of optical flow estimation to detect dynamic objects in the scene to obtain actual dynamic information. The optical flow method is divided into sparse optical flow method and dense optical flow method. It is a two-dimensional pixel detection processing method, which uses the changes of pixels in the time domain combined with the correlation between neighboring frames to calculate the corresponding relationship between the previous frame and the current frame, so as to obtain the motion information of the object. Based on the general principle of optical flow method, as shown in Figure 3, the dense optical flow algorithm proposed by Gunner Farneback used the pixel points in the two image frames before and after to perform motion estimation, and its effect is better than that of the sparse optical flow algorithm [35].

Observation Point
Optical-Flow-Mask While eliminating dynamic information, most useful static information should be retained to ensure the accuracy and reliability of the tracking process. In the present work, in order to obtain a higher detection accuracy, we use the global dense optical flow method to detect dynamic objects in the image frame, and set a small threshold to detect small scale moving targets. At the same time, the subsampling operation is used to improve the detection speed. Then, considering the problem of motion noise, this article takes the optical flow constraint as a soft threshold condition, which will be combined with the prior semantic information of the semantic thread to obtain further constraints. The specific algorithm flow is as follows: (1) Carry out semantic segmentation to obtain the priori Semantic-Mask of dynamic objects, and calculate the dense optical flow to obtain the Optical-Flow-Mask generated by the actual movement of the object.
(2) Traverse the pixels of Semantic-Mask and determine whether each point has dynamic information in the corresponding 3 × 3 area in the Optical-Flow-Mask.
(2.1) If Semantic-Mask (i,j) is a prior dynamic point, and the optical flow information appears in the corresponding region, then the pixel of this point belongs to the dynamic object region; (2.2) If Semantic-Mask (i,j) is a prior dynamic point, and no optical flow information appears in its corresponding region, then the pixel of this point belongs to the prior dynamic object region; (2.3) If Semantic-Mask (i,j) is a prior static point, and the optical flow information appears in the corresponding region, then the pixel of this point belongs to the dynamic object region; (2.4) If Semantic-Mask (i,j) is a prior static point, and no optical flow information appears in the corresponding region, then the pixel of this point belongs to the static target region; (3) Fuse pixels in all dynamic areas to generate the final dynamic-static mask; (4) Combine the dynamic-static information of the mask to judge the previously extracted feature points, if the feature point belongs to the dynamic area in the mask, the feature point will be eliminated.

Static Semantic Map Construction
The original ORB-SLAM3 generates sparse point clouds. In our system, after the dynamic points is eliminated, the keyframes is obtained, and then the dense point cloud through the keyframe is further calculated. In the end, the obtained 2D semantic information is fused with a 3D dense point cloud to build a static 3D semantic dense point cloud map. The algorithm flow is shown in Figure 4.  In the fusion of 2D semantic information and 3D point cloud, the points of 3D point cloud with inappropriate distance will be eliminated according to the depth information at firstly. Then, the semantic colors of different objects from the acquired 2D semantic information are extracted, and the semantic color information according to the coordinate index corresponding to the depth value is obtained. Finally, the semantic color information will be added to 3D space points of corresponding depth values, and gives the point cloud with semantic information. The entire semantic map construction process is carried out on the basis of dynamic object elimination. In this way, the redundancy of map information brought by the dynamic object participating in the mapping is avoided, and the static map in the actual scene is restored to a certain extent. That is, through the fusion processing of the 3D point cloud, the point cloud is endowed with semantic information, and a perceptible static semantic map of the indoor dynamic environment is generated.

Experimental Results
In order to test the actual effect of our algorithm, we used two types of scenes in the TUM dataset [36] (high dynamic scenes and low dynamic scenes).
Generally, when using the TUM dataset, APE (Absolute Pose Error) and RPE (Relative Pose Error) are used to evaluate the robustness and accuracy of visual odometry. APE represents the global trajectory consistency, the smaller the value, the higher the consistency, and the better the robustness of the system. RPE is used to measure the drift degree of rotation and transformation process, and, the smaller the drift is, the more accurate the system is [36]. Our experiments also use APE and RPE to analyze and compare the estimated trajectory and the real trajectory. In addition, then, we calculate APE and RPE to get a result including RMSE (Root Mean Square Error), Median Error, Mean Error, and S.D. (Standard Deviation).
In Section 4.1, we firstly verify and analyze the effectiveness of the dynamic point eliminate algorithm; In Section 4.2, we compare and analyze the performance of our SLAM system to construct a static semantic map on the dataset. Finally, the actual mapping effect is shown through the real scene experiment in Section 4.3.  In this scene, there are two people walking around the desk. The left image in Figure 5 is the raw image of a certain frame in the scene, and the right image is the result of semantic segmentation. The left image in Figure 6 shows the feature point distribution in a certain frame of tracking. It can be seen that there are a large number of feature points on the moving person. The right image in Figure 6 is the feature point distribution after applying the algorithm in this article to eliminate the selected dynamic points.

Dynamic Object Eliminating Experiment
For further analysis and comparison of the above experimental results, we calculated APE and RPE between the estimated trajectory and the real trajectory. The experimental results are shown in Tables 1-3, where fr3 represents that the dataset sequence it belongs to is freiburg3; sitting and walking represent two different character states, sitting is low dynamic and walking is high dynamic; xyz, rpy, static, and half halfsphere stand for four types of camera ego-motions [36]. For example, sit means that the person is sitting, and xyz means the camera moves along the x-y-z-axis.       Table 3 further illustrates the experimental results, in which the improvements represent that the obtained error after the algorithm processing in this article reduces the percentage of the original error. In addition, the percentage of average APE and RPE reduction are shown in Tables 4 and 5. In Tables 4 and 5, compared with the original ORB-SLAM3 on different types of dataset sequences of fr3-walking, after processing the dynamic point elimination algorithm, the APE is greatly reduced, and the RPE also has a more obvious reduction. Especially on fr3-walking-xyz, APE decreases by 97.78% on average, and RPE decreases by 52.33% on average. It is noted that, since only part of the human body is moving in the low dynamic dataset, when the visual odometry is tracking, the static part of the human body still provides pose estimation information. While the ultimate goal of this article is to further construct a static semantic map by eliminating the influence of dynamic objects, our focus is on the effect of static semantic mapping at the end. Thus, we only use the low-dynamic dataset fr3-sitting-static to illustrate its effect here. Although the APE has only a small decrease and RPE has not changed much, the precision of the static semantic map in low-dynamic scene is ensured. From the point of view of the dynamic point elimination effect, our algorithm not only improves the pose accuracy, but also improves the robustness of the system to the dynamic environment, and is conducive to construct the static map consistent with the environment of the dynamic scene. In order to more intuitively show the effectiveness of the dynamic point eliminating algorithm in this article and the improvement of ORB-SLAM3's pose accuracy and system robustness in a dynamic environment, we take the freiburg3-walking-xyz dataset as an example. Under the same experimental conditions, we compared the real trajectory of the dataset, the estimated trajectories of the original ORB-SLAM3, DS-SLAM, and DynaSLAM with the estimated trajectory of our algorithm. As shown in Figures 7 and 8, the real trajectory of the original dataset is groundtruth, Our represents the result of our algorithm, and the comparison among these trajectories are drawn, respectively.
In Figure 7, the estimated trajectory of the original ORB-SLAM3 deviates the most from the real trajectory and is most affected by dynamic objects; DS-SLAM improves the influence of dynamic objects to a certain extent; but the deviation between DynaSLAM, our algorithm, and the real trajectory is the smallest, effectively reducing the impact of dynamic objects. As a result, after using the algorithm proposed in this article, the estimated trajectory and the real trajectory can be well fitted, which improves the trajectory accuracy and robustness of the ORB-SLAM3, and therefore enhances the global consistency of the mapping.
In Figure 8, from the comparative analysis of the three different directions of these trajectories, we can intuitively see the deviation in different pose directions at different moments: the estimated trajectory of ORB-SLAM3 obviously has large deviations in the three directions of x, y, and z to the real trajectory; the degree of deviation of the estimated trajectory of DS-SLAM in the x, y, and z has been improved. Similarly, DynaSLAM and our algorithm fit the real trajectory well in all directions. Then, we further evaluate the performance of DS-SLAM, DynaSLAM, and our algorithm on APE and RPE through EVO [37] (Evaluation of Odometry). The results are shown in Tables 6 and 7.    In addition, Table 8 shows the mean tracking time of DS-SLAM, DynaSLAM, and our SLAM system. Although the real-time performance of DS-SLAM is good, it is not as good as DynaSLAM and our SLAM system in improving the impact of dynamic objects. DynaSLAM and our system are comparable in reducing the impact of dynamic objects, and the gap between them is small in order of magnitude. However, our real-time performance is better than DynaSLAM. These results also show that our algorithm improves the robustness of ORB-SLAM3 system in dynamic scenes.  Table 9 shows the running time consumption table of the main algorithm modules of our SLAM system. These results are obtained by averaging the time of the algorithm running 10 times. The first column is the processing time of semantic segmentation; the second column is the time required to calculate the global dense optical flow and dynamicstatic mask; the third column is the time required to dynamic points elimination. According to all of the analyses of the experimental results above, it is not difficult to find that our algorithm effectively improves the performance degradation of ORB-SLAM3 in the tracking process caused by the movement of dynamic objects. However, for the consideration of lightweight platform application, the real-time performance of the algorithm on our platform needs to be further improved.

Dataset Experiment
This section conducts experiments in a low-dynamic dataset scene (there are two people sitting on a chair in the scene, and the body is moving locally) and a high-dynamic dataset scene (there are two people walking around the desk in the scene). In order to fully compare the effect of the algorithm in the work when constructing a static semantic map, we firstly use ORB-SLAM3 to construct an original sparse point cloud map. The results are shown in Figure 9.
In Figure 9, (a) is a frame image in the low dynamic dataset, and (b) is its corresponding sparse point cloud map. (c) is a frame in the high-dynamic dataset, and (d) is its corresponding sparse point cloud map.
After the dense mapping, the mapping results of the low-dynamic dataset and the high-dynamic dataset are further compared without using the algorithm and using the algorithm of our work, as shown in Figure 10. The results without the proposed algorithm are in the left column, and their dense point cloud maps do not have semantic information. The right column corresponds to the semantic maps when using our algorithm in the article. Specifically, in Figure 10, (a) is the map of dense point cloud in a low dynamic dataset not using our algorithm, in which it can be seen that the hands and heads of characters appear obviously redundant; (b) is the corresponding dynamic object eliminating and semantic mapping effect when implementing the proposed algorithm in this article. Clearly, the redundant information is reduced in the figure, and the semantic information from instance segmentation network is given to the objects in the environment. Among them, orange represents the computer screen, green represents the keyboard, yellow represents the bottle, light purple represents the book, dark purple represents the mouse, and dark blue represents the chair; (c) is the result for high dynamic scenes' dense point cloud mapping, and, due to the large movement of people in the scene, a large amount of redundancy can easily be found in the point cloud map; While compared with (c), (d) is the mapping effect after using our algorithm to eliminate the dynamic semantic characters, which effectively reduces the map redundancy, while having the semantic information of different colors. From the comparison in Figure 10, it can be concluded that the entire environment map with the elimination of dynamic object not only reduces the map redundancy information brought by dynamic object, but also has semantic information, and it can provide semantic support for high-level tasks.

Real Scene Test
Finally, we test our system in the actual laboratory. The hardware platform is as mentioned above. The depth camera is Astra-Pro, and the experimental results are shown in Figure 11.
In Figure 11, (a) shows the low dynamic scene in the laboratory (part of the figure's body moves), and (b) gives the static semantic map generated by the algorithm in this article of (a). In the figure, dark blue represents chair, orange represents the display screen, dark purple represents books, light purple represents the mouse, yellow represents the bottle, and green represents the keyboard. Similarly, (c) is a highly dynamic scene in the laboratory (the figure is walking back and forth in front of the desk), and (d) is a static semantic map constructed by the algorithm in this article about (c), whose semantic meaning is consistent with that of (b). From the analysis of the results with the dynamic point elimination algorithm running in the actual scene, it can be seen that the algorithm proposed in this article has effectively eliminated the dynamic object and completed the construction of the static semantic map.

Conclusions and Future Work
This article explores the solution to improve robustness and accuracy of ORB-SLAM3 in dynamic scenes. Through combining semantic information and global dense optical flow to eliminate dynamic points, it reduces the influence of the visual odometry on the pose estimation caused by the dynamic object. Compared with the original ORB-SLAM3, on different types of dataset sequences, both APE and RPE have been ameliorated to varying degrees, especially on fr3-walking-xyz, the APE decreased by 97.78% from the original average value of 0.523, and the RPE decreased by 52.33% from the original average value of 0.0193. In addition, compared with DS-SLAM and DynaSLAM, it makes a trade-off between speed and performance. At the same time, the fusion method of 2D semantic information and 3D pointcloud gives the map semantic information and reduces map redundancy successfully. As a feasible way of vSLAM to perceive the surrounding environment in higher level applications, this research provides a perceivable indoor environmental map with semantics for robots to understand surroundings. In the next work, we will put more emphasis on promoting the development of vSLAM to engineering, modifying the instance segmentation network based on wavelet transform, and investigating a variety of motion detection strategies to form strong constraints on the TensorRT platform to improve the precision of dynamic eliminating and the speed of the system.  Data Availability Statement: "COCO dataset" at https://cocodataset.org (accessed on 5 August 2021). "TUM dataset" at https://vision.in.tum.de/data/datasets/rgbd-dataset/download (accessed on 5 August 2021).

Conflicts of Interest:
The authors declare no conflict of interest.