DM-SLAM: Monocular SLAM in Dynamic Environments

: Many classic visual monocular SLAM (simultaneous localization and mapping) systems have been developed over the past decades, yet most of them fail when dynamic scenarios dominate. DM-SLAM is proposed for handling dynamic objects in environments based on ORB-SLAM2. This article mainly concentrates on two aspects. Firstly, we proposed a distribution and local-based RANSAC (Random Sample Consensus) algorithm (DLRSAC) to extract static features from the dynamic scene based on awareness of the nature di ﬀ erence between motion and static, which is integrated into initialization of DM-SLAM. Secondly, we designed a candidate map points selection mechanism based on neighborhood mutual exclusion to balance the accuracy of tracking camera pose and system robustness in motion scenes. Finally, we conducted experiments in the public dataset and compared DM-SLAM with ORB-SLAM2. The experiments corroborated the superiority of the DM-SLAM.


Introduction
The main task of visual SLAM (simultaneous localization and mapping) is to estimate the pose of the camera and reconstruct the three-dimensional structure of the environments in the field of view of continuous frames. Visual-based autonomous robots are able to acquire camera poses and environmental information to perform complex tasks such as robot navigation, human-robot interaction, and path planning.
Monocular, binocular, and RGB-D cameras can be used to implement visual SLAM. Due to their acquisition of depth of environments directly, binocular-based and RGB-D-based SLAM are more robust than monocular SLAM. Although monocular cameras have their own inherent limitations (such as inability to observe scale and state initialization), with respect to size, power, and cost [1], monocular cameras have broad application scenarios.
However, despite some remarkable results such as ORB-SLAM2 [2], LSD [3], DSO [4] in visual SLAM, most approaches, whether feature-based, direct or semi-dense [5,6], assume that static targets or backgrounds occupy the majority in the scene [1]. For feature-based SLAM, this assumption can be interpreted as that the feature points of the static scene occupy the majority in all feature points in the image. These systems generally adopt RANSAC [7] and robust kernel algorithms that perform well when the dynamic feature points are relatively few, but as the proportion of static feature points gradually decreases, the computation required for obtaining camera poses with higher confidence increases dramatically. Such assumptions greatly limit possible applications of visual SLAM in actual scenarios.
The existence of dynamic objects not only affects the tracking accuracy if the matches belonged to dynamic objects are used in tracking algorithms, but also causes the system to establish an imprecise map including dynamic objects and reduces the accuracy of loop closure detection, which is used to reduce cumulative error. For many long-term applications such as virtual reality, augmented reality or medical imaging, retaining precise and stable maps from previous runs which can be reused is significant, especially if dynamic objects can be detected and processed appropriately [8].
The premise of solving the camera pose is to get the static feature set, and the camera pose is required to filter the static features from the image features of the noise features, mismatches and dynamic features. When the scene does not satisfy the main assumption of static scene as assumed by typical SLAM system, how to get an accurate camera pose seems to be a problem with "chicken-and-egg" characteristics [8]. There are currently three main ideas for solving this chicken-egg problem. We will review these ideas and introduce the main methods in detail in subsequent context.
The first method is reconstruction before motion segmentation. Tan et al. [9] propose a prior-based adaptive RANSAC algorithm called PARSAC to establish a camera motion model, which will estimate the probability that current model is a camera motion model by the degree of distribution of points within the model for each iteration. Sun et al. [10] warp the last frame with the perspective matrix and subtract the current frame from the warped frame to distinguish dynamic areas. The perspective matrix is constructed by computing the optimal homography with the RANSAC algorithm, wherein camera ego-motions are represented as 2-D perspective matrices. The above methods carry out reconstruction before motion segmentation, but their results are greatly affected by the initial reconstruction result.
There also exist some methods of reconstruction after motion segmentation. Kitt et al. [11] used a classifier trained in advance to classify the feature points as dynamic or static, but it cannot be used to explore an unknown environment because it is limited by training data. Berta [1] et al. removed features on the movable objects in prior-dynamic environments by CNN (Convolutional Neural Network), then use multi-view geometry to eliminate moving objects that are not set to movable. Chao Yu [12] added semantic segmentation threads and dense semantic map creation threads based on ORB-SLAM2 [2]. SegNet is used to segment the image semantically and the result is given to tracking thread to eliminate the features on movable objects. After solving the pose of camera with the remaining features, the system determines whether the movable objects are actually moving by checking whether the features on the movable object satisfy the solved camera model. In addition to the above possible solutions, factorization has an elegant mathematical formulation and can solve the problems of segmentation and reconstruction simultaneously. Schindler, K [13] and Vidal, R [14] separated moving objects while objects should be all rigid and enough frames have been considered, they usually required that the same feature set is available throughout the certain chosen frames. Yan, J [15] utilized two properties of trajectory data, geometric constraint and locality, to segment a wide range of motions including independent, articulated, rigid, non-rigid, degenerate, non-degenerate, or any combination.
Presently, we propose a distribution and local-based RANSAC algorithm (DLRSAC) to address this chicken-egg problem currently. DLRSAC clusters feature points by roughly dividing the image into grids, then reconstructing the motion model in the local grid, and finally competing the local motion model for the entire image to create the camera motion model. DLRSAC transforms SLAM problems in dynamic environments into numerical optimization problems. DLRSAC will be introduced in detail in Section 2. We integrate DLRSAC into the ORB-SLAM monocular system, where DLRSAC is used to initialize the map by extracting static features. During the tracking process, we designed a map updating mechanism to deal with scenarios including either non-rigid dynamic objects or objects whose movement is difficult to identify between frames, which will decay subsequent map expansion and camera pose calculation.

1.
The problem has chicken-and-egg characteristics, and the natural difference between motion and static is the key to solving the problem.

2.
Due to the presence of inherent sensor noise, motion of different moving objects and camera motion are distinguished only when the motion exceeds noise interference.
The second issue can be resolved simply by checking more frames, or in two frames with larger time intervals. In the DS-SLAM system to be introduced later, we designed a targeted mechanism to avoid noise interference in order to construct a map with high confidence and more accurate camera pose.
We believe that understanding the nature of static and motion is the key to solving the problem inherent in the chicken-and-egg situation. In the relative motion of the camera with the environment, all motion models solved in the scene can be characterized as Equation (1).
represents the synthesis of two motions; T represents the motion of the object (including rotation and translation), T c represents the motion of the camera, and T s represents the motion of the static object, which is the only constant in the equation (rotation as identity matrix and translation into three-dimensional 0 vector), T n represents the motion of the nth motion model. Due to the coupling with T c , the value of T c T s is constantly changing. T c is indistinguishable when there is no prior information of T c or T n .
However, considering first the issue from agents in the environment, we find a way to define static. For the exploring agent, it is necessary to provide a reference coordinate system with a wide distribution and constant motion state so that the agent can locate itself and conveniently reconstruct the environment. This is the difference in nature between static and motion in our living environment. Therefore, we designed the DLRSAC based on three basic observations as follow: • The definition of static relies on the distribution of inliers within the model, not the numerical score of the motion model and the number of feature points included. • Models between different moving objects share a few inliers.
Taking into account the above observations, our algorithm aims to create a motion model that is solved by a set of feature points geometrically as close as possible, with the final inliers within the model distributed as evenly as possible. Equation (2) below defines our objective function: In Equation (2), (x s1 ,x s2 ••• x s8 )refers to the 8 points sampled from current frame, F is the fundamental matrix solved through eight-points algorithm by(x s1 ,x s2 ••• x s8 ). (x t1 ,x t2 ••• x tn )are all inliers of F. D(·) is the function of calculating the distribution of input variables. The specific form of D(·) is shown in Equation (3).
In (3), X is the features set, which is also the input variables of D(•). p i is the feature belonging to X. p is the centroid of X and n is the number of features in X. To reduce the complexity of solving the problem, we designed the DRLRSAC algorithm to extract static features.
We use the 8-point method when solving the fundamental matrix, so the number of molecular variables is 8. The smaller the molecule, the closer the 8 feature points used in the solution model, and the higher the confidence that they also belong to our motion model. The denominator is the distribution of the interior points belonging to the model. The larger the distribution, the more feasible the model should be adopted as the reference system. In other words, the probability that we think it is camera motion model increases with the distribution.
Before we describe the algorithm flow, some definitions need to be clarified in advance.
Grid: We will divide the image into l * m grids of the same size. Grid model: Fundamental matrix represents the current grid solved by RANSAC [8] with the matched feature in current grid.
Couple: Refers to the test of how many inliers in tested grid satisfy the Grid model. Coupled Grid set: Consists of the grids whose Couple Degree with Coupling Grid is greater than a certain threshold and Coupling Grid itself.
Grid model distribution: Refers to the geometric distribution of Coupled Grid set of one Grid model. Algorithm flow is described as below: 1.
Extract features in the frames used for initialization and match them.

2.
Divide the image into l ×m grids, assign features into grids according to whether features belong. For each Grid, if the number of features doesn't satisfy the solution condition, it is discarded. Then, solve the centroid with all the feature points within these grids which pass above check.

3.
For each grid, normalize all the features, then use RANSAC to solve the Grid model with the normalized features.

4.
Constructing Couple Matrix by calculating Coupled Degree of each grid with other grids.

5.
Get Coupled Grid set of each Grid model by setting certain threshold, calculate distribution of the set using the centroids of the grids in the set.

6.
Select the Grid model with the largest Grid model distribution as the camera pose, the features in grids contained in the Coupled Grid set of selected grid are considered to be candidate static feature points. 7.
Triangulate static feature points in step 6, initialize map with successful triangulated features. 8.
Use bundle adjustment to adjust 3D map points and camera pose.

Neighborhood Mutual Exclusion for Selecitng Candidate Map Points.
The newly added map points are more likely to be used for subsequent system tracking in SLAM, which will affect subsequent estimation of the camera pose and map expansion. Therefore, it is important to ensure that the newly selected map points belong to a stable map, or else the system will quickly fail. Conformity to geometric constraints in more key frames must be verified, but this will slow down map expansion and reduce robustness of the system, especially in low-texture area.
The matches introduced by the new key frame are the main source for generating new map points. When key frame K i inserts, matches are searched for between features in K i and other unmatched features in connected key frames K c . If a match passes the following test, it will be taken as a candidate map point.

1.
The new match satisfies the fundamental matrix.

2.
Triangular ORB pairs after checking parallax, projection error, uniform scale, and forward depth.

3.
There are no alternate map points in the neighborhood of current tested feature match.
Details can be seen Algorithm 1.

Algorithm 1. Candidate map points selection mechanism
For new match (x-x') between optimal co-view key frames with new key frame; The distribution and number of added map points are constrained by the neighborhood. The advantage of this is that the map points on the objects that become moving can be removed, and the method of neighborhood suppression is used in the process of selecting the map points. There are two benefits to this:

•
Due to the concentrated distribution of moving objects, the number and probability of features sampled onto moving objects are reduced by the neighborhood suppression selection • Limit the number of selected objects that become dynamic. Even if features on moving objects are selected, the impact on reconstruction is weakened

System Review
ORB-SLAM2 integrates modules such as map initialization, inter-frame estimation, local map management, and closed-loop detection, which provides good experimental comparison objects for subsequent research. So, we adapted ORB-SLAM2 to provide basic framework for creating a static map and robustly estimating camera pose. Among them, the closed loop process in the system is identical to ORB-SLAM2. Figure 1 shows the framework of DM-SLAM. DLRSAC is integrated into initialization in the tracking thread to construct an original map. Also, the expansion rules of the map are designed to weaken the effects of dynamic objects in the mapping thread. Below, we describe the measures that have been taken to improve the robustness of system motion when compared to ORB-SLAM2. There are three process including tracking, local mapping and loop closing. Tracking is mainly responsible for extracting the initial pose of the camera utilizing the input images, where we integrate distribution and local-based RANSAC algorithm (DLRSAC) into the initialization in tracking process. Local mapping is responsible for saving, filtering and optimizing key frames and map points, where we integrate algorithms 1 into candidate point decision. The closed-loop detection mainly detects whether there is a closed-loop, and if so, global optimization is performed.

DLRSAC Implementation in DM-SLAM
DLRSAC performs static semantic extraction to create a static initial map with respect to ORB-SLAM2. This algorithm doesn't require any prior information, yet it introduces distribution degree evaluation to break the problem of "chicken-egg" circulation in dynamic scenes. The camera pose estimation can be completed more confidentially and more accurately even if static feature points in the scene don't take precedence.
Static initial maps are constantly used in camera tracking, which affect subsequent map expansion. Therefore, the accuracy of DLRSAC is much more important than the recall rate. The accuracy of the algorithm is improved by analyzing the reason why the algorithm fails, introducing detection methods and resetting the system when failure occurs. Analysis of the reasons for failure is reviewed below: 1. The parallax between the two frames is so small that solving F matrix at this time is pathological problem. 2. The motion in the scene is too complicated to affect the model selection in each grid, so the motion model is finally selected. 3. The scene appears motion blur (low-textured area), leading to few matches and few sufficient Grid models.
Detection methods are reviewed below: 1. Verify the parallax of map points. If the mean value is too small, the algorithm is considered to fail. In order to call a large parallax, we use two frames separated by several frames instead of two consecutive frames at the time of initialization. An additional benefit of doing this is that it makes the motion and noise separable. 2. Verify that the distribution of selected Grid model. If the value is too small, the algorithm is considered to fail. 3. Check the two frames used for initialization, reset system if there are too few features matching.
We delay the initialization until this method passes the failure test. Figure 2 shows the static feature extraction effect of the algorithm in different motion scenarios in TUM fr3_walking_rpy. There are three process including tracking, local mapping and loop closing. Tracking is mainly responsible for extracting the initial pose of the camera utilizing the input images, where we integrate distribution and local-based RANSAC algorithm (DLRSAC) into the initialization in tracking process. Local mapping is responsible for saving, filtering and optimizing key frames and map points, where we integrate algorithms 1 into candidate point decision. The closed-loop detection mainly detects whether there is a closed-loop, and if so, global optimization is performed.

DLRSAC Implementation in DM-SLAM
DLRSAC performs static semantic extraction to create a static initial map with respect to ORB-SLAM2. This algorithm doesn't require any prior information, yet it introduces distribution degree evaluation to break the problem of "chicken-egg" circulation in dynamic scenes. The camera pose estimation can be completed more confidentially and more accurately even if static feature points in the scene don't take precedence.
Static initial maps are constantly used in camera tracking, which affect subsequent map expansion. Therefore, the accuracy of DLRSAC is much more important than the recall rate. The accuracy of the algorithm is improved by analyzing the reason why the algorithm fails, introducing detection methods and resetting the system when failure occurs. Analysis of the reasons for failure is reviewed below: 1.
The parallax between the two frames is so small that solving F matrix at this time is pathological problem.

2.
The motion in the scene is too complicated to affect the model selection in each grid, so the motion model is finally selected.

3.
The scene appears motion blur (low-textured area), leading to few matches and few sufficient Grid models.
Detection methods are reviewed below: 1.
Verify the parallax of map points. If the mean value is too small, the algorithm is considered to fail. In order to call a large parallax, we use two frames separated by several frames instead of two consecutive frames at the time of initialization. An additional benefit of doing this is that it makes the motion and noise separable.

2.
Verify that the distribution of selected Grid model. If the value is too small, the algorithm is considered to fail.

3.
Check the two frames used for initialization, reset system if there are too few features matching.
We delay the initialization until this method passes the failure test. Figure 2 shows the static feature extraction effect of the algorithm in different motion scenarios in TUM fr3_walking_rpy.

Tracking Thread
In tracking thread of other visual SLAM systems, the motion model [16] is usually used to initialize camera pose, and then outliers are eliminated in subsequent optimizations. First we track the matching features between current frame and current reference key frame, estimate the camera's initial pose, then track the local map to get more feature matches. Finally, camera pose optimization is performed. This approach essentially is a method of reconstruction after segmentation. Dynamic objects are able to be managed because the definition of the static map has been implemented in the initialization. See ORB-SLAM2 [16] for details on local map.

Tracking Thread
In tracking thread of other visual SLAM systems, the motion model [16] is usually used to initialize camera pose, and then outliers are eliminated in subsequent optimizations. First we track the matching features between current frame and current reference key frame, estimate the camera's initial pose, then track the local map to get more feature matches. Finally, camera pose optimization is performed. This approach essentially is a method of reconstruction after segmentation. Dynamic objects are able to be managed because the definition of the static map has been implemented in the initialization. See ORB-SLAM2 [16] for details on local map.

Results
In order to show the effectiveness of the DLRSAC algorithm in more detail, we not only run DM-SLAM on a public data set, but also perform DLRSAC experiments. Experiments are tested in dynamic environments within the public TUM RGB-D dataset [17] while depth images are ignored. High dynamic fr3_walking sequences are typically very challenging because people walking in the scene increases the difficulty of monocular SLAM. First, we show the static features extraction results of initialization using DLRSAC in different scenarios of fr3_walking_rpy sequence after tacking fails, as well as intermediate results in the tracking. Then, we test the time required by the DLRSAC algorithm and compared it with RANSAC and ARSAC. Finally, we evaluate the performance of DM-SLAM in some sequences of TUM-RGB-D dataset. All experiments are performed on a computer with Intel i7 CPU, and 8GB RAM.

Evaluations of DLRSAC
Unlike the data used in comparison of algorithms [9] like RANSAC, LoSAC [18], Multi-GS [19] and ARSAC [9], where the dynamic objects change significantly in geometry, the data used in the operation of this system is quite different. In our experiments, we used successive frames in video sequences, including our own data and TUM sequence fr3_walking_rpy. Figure 3 displays results tested in our data, where the middle book in the source images is moving. Our data is used to compare algorithms where there are rigid-body motions in the scene. Figure 4 shows the results tested in TUM fr3_walking_rpy, which verify the performance of algorithms when there are non-rigid objects in the scene. We compared our algorithm with others, including standard RANSAC [8] and ARSAC [9]. We choose ARSAC instead of PARSAC because obtaining prior information of the inliers ratio in the images is not practical. As can be seen in Figure 4e-f, most static features can be extracted with DLRSAC, with the exception of static features in the conditions below: 1.
The feature is so independent from other features that the number of feature points in the local grid does not satisfy the solution condition.

2.
Some static features belong to the grid in which moving features dominate, meaning that the grid is not classified into the Coupled Grid set (so all features in the grid are considered as moving features).
Appl. Sci. 2019, 9, x FOR PEER REVIEW 8 of 16 In order to show the effectiveness of the DLRSAC algorithm in more detail, we not only run DM-SLAM on a public data set, but also perform DLRSAC experiments. Experiments are tested in dynamic environments within the public TUM RGB-D dataset [17] while depth images are ignored. High dynamic fr3_walking sequences are typically very challenging because people walking in the scene increases the difficulty of monocular SLAM. First, we show the static features extraction results of initialization using DLRSAC in different scenarios of fr3_walking_rpy sequence after tacking fails, as well as intermediate results in the tracking. Then, we test the time required by the DLRSAC algorithm and compared it with RANSAC and ARSAC. Finally, we evaluate the performance of DM-SLAM in some sequences of TUM-RGB-D dataset. All experiments are performed on a computer with Intel i7 CPU, and 8GB RAM.

Evaluations of DLRSAC
Unlike the data used in comparison of algorithms [9] like RANSAC, LoSAC [18], Multi-GS [19] and ARSAC [9], where the dynamic objects change significantly in geometry, the data used in the operation of this system is quite different. In our experiments, we used successive frames in video sequences, including our own data and TUM sequence fr3_walking_rpy. Figure 3 displays results tested in our data, where the middle book in the source images is moving. Our data is used to compare algorithms where there are rigid-body motions in the scene. Figure 4 shows the results tested in TUM fr3_walking_rpy, which verify the performance of algorithms when there are non-rigid objects in the scene. We compared our algorithm with others, including standard RANSAC [8] and ARSAC [9]. We choose ARSAC instead of PARSAC because obtaining prior information of the inliers ratio in the images is not practical. As can be seen in Figure  4e  We can see from Figures 3c and 4a that RANSAC has almost no distinguishing ability for moving objects in consecutive frames. And, although ARSAC can extract a part of static features, like Figures 3d and 4b, the extracted features are much lower in number than those of DLRSAC. We hypothesize that this is because ARSAC only takes the degree of distribution as the objective function optimization variable.
In addition, the computational overhead of DLRSAC is slightly longer than RANSAC and ARSAC while within acceptable circumstances, as listed in Table 1 (the iteration number is set to 2000). The purpose of dividing an image into grids is to reduce the complexity of motion within the grid so that the points that are sampled to solve the Grid model are more likely to belong to the same motion model. As we know, under RANSAC taken to solve Grid model, the number N of iterations required to guarantee a correct solution is: where s is the number of data points, ε is the percentage of outliers, and p is the probability of success (confidence level) [20]. In fact, RANSAC tends to choose models that are solved by features from multiple source models, as shown in Figure 3. These models have stronger Coupling ability in local grid but less in global image, which helps DLRSAC filter out real static features. Aside from this, we find that the selected model has stabilized after several hundred iterations due to this nature. The grid selected in Figure 5    We can see from Figures 3c and 4a that RANSAC has almost no distinguishing ability for moving objects in consecutive frames. And, although ARSAC can extract a part of static features, like Figures  3d and 4b, the extracted features are much lower in number than those of DLRSAC. We hypothesize that this is because ARSAC only takes the degree of distribution as the objective function optimization variable.
In addition, the computational overhead of DLRSAC is slightly longer than RANSAC and ARSAC while within acceptable circumstances, as listed in Table 1 (the iteration number is set to 2000).

Initialization and Tracking Performance
In sequence fr3_walking, there are two men walking around the desk. The men are the only movable objects, which causes a highly dynamic scene. Sequence fr3_walking includes rpy, xyz, and half sphere, which stands for three types of camera ego-motions. Sequence fr3_walking_rpy indicates that rotation is dominant in camera movement, sequence fr3_walking_xyz indicates that translation is dominant in camera movement, and sequence fr3_walking_half_sphere indicates that both camera motion and rotation are obvious. In order to verify the performance of DLRSAC in different motion scenarios, we randomly tested our algorithm in different frames in the fr3_walking sequence, and then selected several typical motion scenes to show performance of static features extraction of our algorithm. The results are shown in Figure 6. In Figure 6 e-f, although there is certain motion blur in the environment, the algorithm still extracts most of the static features.
It can be seen that DM-SLAM removes most motion features during the tracking process. In the map expansion, although motion features are added to the map points occasionally, the limitation on the distribution and number of newly added map points weakened the influence of new added points on the camera pose calculation. In turn, the correct camera pose assures the geometric constraint relationship, which allows the system to eliminate motion features based on the projection error of the newly added map points in subsequent frames.

Overall Evaluation Using TUM RGB-D Dataset
It is worth noting that the determination of static objects in the DM-SLAM system stems from the solution of the model in the grid. Therefore, DM-SLAM is not suitable for the situation where massive moving objects exist in the scene, resulting in the static background blocked in various parts of images. The most scenes in TUM RGB-D dataset are indoor and the motion situation is not too complicated. At the same time, sequence fr3_walking includes many different types of camera motions, such as rpy, xyz, and half_sphere, where we can observe the performance of the SLAM system more comprehensively. Therefore, we chose TUM sequence fr3_walking to compare ORB-SLAM2 with DM-SLAM. Tables 2-4 show quantitative comparison results of ORB-SLAM2 and DM-SLAM in sequences fr3_walking and fr1_xyz. As Berta, B and Facil, J proposed in [1], we run each sequence ten times and show RMSE (RootMeanSquareError), mean, media, S.D. results, to account for non-deterministic nature of the system. In order to compare the basic performance of DRLRSAC, we have added alignment with sequence fr1_xyz, which is dominated by static scenes. RMSE, Mean Error, Median Error, and Standard Deviation (S.D.) indicates SLAM errors from different aspects, but RMSE and S.D. are more relevant because they can better indicate the robustness and stability of the system. The calculation method of improvements refers to literature [17].

Initialization and Tracking Performance
In sequence fr3_walking, there are two men walking around the desk. The men are the only movable objects, which causes a highly dynamic scene. Sequence fr3_walking includes rpy, xyz, and half sphere, which stands for three types of camera ego-motions. Sequence fr3_walking_rpy indicates that rotation is dominant in camera movement, sequence fr3_walking_xyz indicates that translation is dominant in camera movement, and sequence fr3_walking_half_sphere indicates that both camera motion and rotation are obvious. In order to verify the performance of DLRSAC in different motion scenarios, we randomly tested our algorithm in different frames in the fr3_walking sequence, and then selected several typical motion scenes to show performance of static features extraction of our algorithm. The results are shown in Figure 6. In Figure 6e-f, although there is certain motion blur in the environment, the algorithm still extracts most of the static features.
It can be seen that DM-SLAM removes most motion features during the tracking process. In the map expansion, although motion features are added to the map points occasionally, the limitation on the distribution and number of newly added map points weakened the influence of new added points on the camera pose calculation. In turn, the correct camera pose assures the geometric constraint relationship, which allows the system to eliminate motion features based on the projection error of the newly added map points in subsequent frames.

Overall Evaluation Using TUM RGB-D Dataset
It is worth noting that the determination of static objects in the DM-SLAM system stems from the solution of the model in the grid. Therefore, DM-SLAM is not suitable for the situation where massive moving objects exist in the scene, resulting in the static background blocked in various parts of images. The most scenes in TUM RGB-D dataset are indoor and the motion situation is not too complicated. At the same time, sequence fr3_walking includes many different types of camera motions, such as rpy, xyz, and half_sphere, where we can observe the performance of the SLAM system more comprehensively. Therefore, we chose TUM sequence fr3_walking to compare ORB-SLAM2 with DM-SLAM. Tables 2-4 show quantitative comparison results of ORB-SLAM2 and DM-SLAM in sequences fr3_walking and fr1_xyz. As Berta, B and Facil, J proposed in [1], we run each sequence ten times and show RMSE (RootMeanSquareError), mean, media, S.D. results, to account for non-deterministic nature of the system. In order to compare the basic performance of DRLRSAC, we have added alignment with sequence fr1_xyz, which is dominated by static scenes. RMSE, Mean Error, Median Error, and Standard Deviation (S.D.) indicates SLAM errors from different aspects, but RMSE and S.D. are more relevant because they can better indicate the robustness and stability of the system. The calculation method of improvements refers to literature [17]. As we can see from Tables 2-4 there is obvious improvement when compared DM-SLAM with ORB-SLAM2. In sequence fr3_walking, the RMSE and S.D. improvement values of ATE can reach up to 96.42% and 98.45%. DM-SLAM errors are mainly caused by outliers, mismatches, and numerical calculations. We think that improvement depends on the ratio between the errors caused by the moving object and the above errors. In fr1_xyz, the improvement in performance is not obvious and we hypothesized that when the scene is mainly static, the distinction of DM-SLAM for dynamic features is not necessary for ORB-SLAM2 system. Figure 7 shows the ATE and results of metric rotational drift (RPE) from ORB-SLAM2 and DM-SLAM respectively. It can be seen that the overall error of DM-SLAM is much smaller than ORB-SLAM2. In experiments, when the number of feature points extracted per frame (default) is 1000, ORB-SLAM2 cannot work (while DM-SLAM works). For comparison, we set the number of feature points extracted per frame 3000, which improves the data association between frames and robustness of the system. As we can see from Tables 2-4 there is obvious improvement when compared DM-SLAM with ORB-SLAM2. In sequence fr3_walking, the RMSE and S.D. improvement values of ATE can reach up to 96.42% and 98.45%. DM-SLAM errors are mainly caused by outliers, mismatches, and numerical calculations. We think that improvement depends on the ratio between the errors caused by the moving object and the above errors. In fr1_xyz, the improvement in performance is not obvious and we hypothesized that when the scene is mainly static, the distinction of DM-SLAM for dynamic features is not necessary for ORB-SLAM2 system. Figure 7 shows the ATE and results of metric rotational drift (RPE) from ORB-SLAM2 and DM-SLAM respectively. It can be seen that the overall error of DM-SLAM is much smaller than ORB-SLAM2. In experiments, when the number of feature points extracted per frame (default) is 1000, ORB-SLAM2 cannot work (while DM-SLAM works). For comparison, we set the number of feature points extracted per frame 3000, which improves the data association between frames and robustness of the system.

Conclusions
In this paper, we propose a DM-SLAM system based on ORB-SLAM2 to adapt to dynamic environments. DLRSAC is proposed to extract the most static matches for initialization even when dynamic objects take dominate whether dynamic objects are non-rigid or rigid. In regard to tracking thread, we designed a neighborhood mutual exclusion algorithm to limit the number of candidate maps. This algorithm reduces the interference of dynamic features on camera tracking and ensures rapid expansion of the map. Due to the above techniques, DM-SLAM system can successfully operate in a dynamic environment even when moving objects dominate the scene.
The limitation of this method is that when a large number of stationary objects appear in the scene and then move in the tracking process, the algorithm will fail. Because in this case the static objects essentially represent the reference coordinate system, the motion of the reference coordinate system cannot be detected in this algorithm.
In our observation, static means the most suitable reference coordinate system in the whole world. In under-textured areas, non-feature pixels can be used to find the global optimal reference coordinate system, which is beyond feature-based methods. In the future, we are considering introducing a direct method to extend the ability of DM-SLAM to adapt to richer dynamic scenarios.

Conclusions
In this paper, we propose a DM-SLAM system based on ORB-SLAM2 to adapt to dynamic environments. DLRSAC is proposed to extract the most static matches for initialization even when dynamic objects take dominate whether dynamic objects are non-rigid or rigid. In regard to tracking thread, we designed a neighborhood mutual exclusion algorithm to limit the number of candidate maps. This algorithm reduces the interference of dynamic features on camera tracking and ensures rapid expansion of the map. Due to the above techniques, DM-SLAM system can successfully operate in a dynamic environment even when moving objects dominate the scene.
The limitation of this method is that when a large number of stationary objects appear in the scene and then move in the tracking process, the algorithm will fail. Because in this case the static objects essentially represent the reference coordinate system, the motion of the reference coordinate system cannot be detected in this algorithm.
In our observation, static means the most suitable reference coordinate system in the whole world. In under-textured areas, non-feature pixels can be used to find the global optimal reference coordinate system, which is beyond feature-based methods. In the future, we are considering introducing a direct method to extend the ability of DM-SLAM to adapt to richer dynamic scenarios.