VINS-dimc: A Visual-Inertial Navigation System for Dynamic Environment Integrating Multiple Constraints

: Most visual–inertial navigation systems (VINSs) suffer from moving objects and achieve poor positioning accuracy in dynamic environments. Therefore, to improve the positioning accuracy of VINS in dynamic environments, a monocular visual–inertial navigation system, VINS-dimc, is proposed. This system integrates various constraints on the elimination of dynamic feature points, which helps to improve the positioning accuracy of VINSs in dynamic environments. First, the motion model, computed from the inertial measurement unit (IMU) data, is subjected to epipolar constraint and flow vector bound (FVB) constraint to eliminate feature matching that deviates significantly from the motion model. This algorithm then combines multiple feature point matching constraints that avoid the lack of single constraints and make the system more robust and universal. Finally, VINS-dimc was proposed, which can adapt to a dynamic environment. Experiments show that the proposed algorithm could accurately eliminate the dynamic feature points on moving objects while preserving the static feature points. It is a great help for the positioning accuracy and robustness of VINSs, whether they are from self-collected data or public datasets.


Introduction
Visual simultaneous localization and mapping (SLAM), which uses image data from cameras, is one of the most important topics in computer vision [1,2]. This technology has important applications in robotics [3,4], autopiloting [5,6], and augmented reality [7,8].
When SLAM technology integrates an inertial measurement unit (IMU) and cameras, it is called a visual-inertial navigation system (VINS). The best known VINS is VINSmono [9], which was proposed by Qin et al. This system achieves accurate positioning of a device by observing visual feature points and pre-integrated IMU measurements. It can also compute and calibrate extrinsic and temporal offsets between the camera and IMU online. OKVIS [10] uses the concept of "keyframes" that partially marginalize old states to keep computational costs low, and ensure real-time operation. ROVIO [11] directly uses the pixel intensity errors of the images, and it can achieve accurate tracking performance with great robustness. A novel VINS based on an extended Kalman filter was proposed, named MSCKF [12]. The measurement model can express the geometric constraints, which is useful for system localization. An open platform named OpenVINS was proposed by Geneva et al. [13]. It uses some technologies, such as a sliding window Kalman filter, consistent First-Estimates Jacobian treatments and SLAM landmarks.
However, most VINSs are focused static environments [14]. In a front-end visual odometer module, the system extracts the feature points on visual images. Then, the system matches the feature points of two adjacent frames. Then, the position and attitude of the camera and the position of the feature points in the real world are determined by a local bundle adjustment of the visual and IMU data in the sliding window. The residual of the bundle adjustment optimization which should be minimized can be formulated as follows [15]: where R is the sum of the residuals, m is the number of images, and n is the number of feature points. If the 3D key point can be observed in the image k, the value of , is set to 1; otherwise, it is set to 0. Here, , represents the two-dimensional (2D) coordinates in image k of 3D point . Moreover, π represents the projection of a 3D key point onto an image based on the position, pose, and intrinsic parameters of the cameras. Moreover, is the position and pose of the device corresponding to the kth image frame, is the 3D coordinates of the ith feature point, and is the residual of the IMU data in this short time period.
This scheme assumes that VINS is in an environment where all objects are static. However, in a real environment, there are often many dynamic objects, so this assumption is often untenable. Figure 1a shows a situation where the feature point is stationary. At this point, the motion of the camera is determined by solving Equation (1). Figure  1b shows a situation where the feature point is stationary, whereas the feature point is moving in a dynamic environment. As a result of the movement of the point , its coordinates on the image also change from , to , . Let vector , be the displacement of the dynamic feature point , on a 2D image k+1. The dynamic information in the environment causes the corresponding pixels to shift, so that the value of , is not zero, which affects the robustness and accuracy of VINS. where the feature point is stationary, whereas the feature point is moving. and represent the center of the camera at two points in time. , is the motion of the camera during the period from time point k to time point k + 1. The rectangles represent the imaging plane of the camera. , represents the imaging of the i-th feature point at time point k.
In summary, to improve the robustness and accuracy of VINSs in dynamic environments, it is necessary to filter feature point matching. It should eliminate the feature points in motion, and use static feature points to compute the position and attitude of the device. To solve the dynamic problem in VINSs, researchers have proposed various methods to improve the system. Some conventional methods are described below. Shimamura et al. [16] proposed a vSLAM method that includes an initialization method, camera poses, and an outlier rejection method for moving objects. In addition, they constructed an angular histogram based on the outlier flows, approximated the obtained angular histogram, and through estimation of the parameters for Gaussian mixtures. To adaptively model dynamic environments, Tan et al. [17] proposed an update method and an online keyframe representation. They projected the feature points from the keyframe images to the current frame image. Then, they could detect the dynamic features by comparing the appearance and structure. Moreover, an adaptive random sample consensus (RANSAC) algorithm has been proposed to efficiently remove the mismatched feature points. Rünz et al. [18] used a multiple model-fitting approach, where each dynamic object can move independently, and the object could still be tracked effectively. They enabled a robot to maintain 3D models of every objects and improve them over time through fusion. In [19], Sun et al. proposed a novel moving features removal approach using an RGBD camera. They integrated it into the SLAM system. The moving features removal approach acts as a preprocessing stage to filter out the feature points that correspond to dynamic objects. Moreover, Alcantarilla et al. [20] introduced the concept of dense scene flow for SLAM, so that moving objects can be detected. Lee et al. [21] proposed a solution to the dynamic problem by using a pose graph. According to the grouping rules, the nodes of the graph are grouped based on the noise covariance, and the constraints are truncated. Li et al. [22] proposed a sparse visual SLAM based on a motion probability propagation model for dynamic keypoint elimination. Their approach combines geometric information and semantic segmentation information to track keypoints. In [23], Nam et al. propose a robust tightly-coupled VINS that uses the multi-stage outlier removal method. It uses the multi-stage outlier removal method to deal with the influence of moving objects based on the feedback information of the estimated states. However, most of these methods have assumed that there is more static information than dynamic information in a scene, so the model generated by static feature matching can be used to eliminate dynamic feature point matching. However, in practice, there is a significant amount of dynamic information, so this assumption does not hold. Furthermore, most methods for eliminating feature point matching on moving objects use only one constraint for moving objects. A single constraint often fails, making it difficult to correctly handle all types of feature point matching.
Researchers have proposed numerous solutions based on deep learning methods. A method for moving object dense segmentation under dynamic scenarios was proposed by Wang et al. [24]. They combined dense dynamic object segmentation with dense visual SLAM and proposed effective measures. In this way, SLAM can estimate the attitudes of cameras. A novel method for detecting objects and multi-view objects SLAM was presented by Yang et al. [25]. Their method is effective in both dynamic and static environments. They also generated high-quality cuboid proposals. The method used multi-view bundle adjustment, so that the SLAM system can jointly optimize the attitudes of cameras, objects, and feature points. Bescos et al. [26] integrated the moving object detection method and background inpainting capabilities into SLAM. The SLAM system can operate in multiple modes in dynamic scenarios. The system can also detect moving objects by geometric constraint on multiple views, deep learning, or both. Yu et al. [27] proposed a visual SLAM based on semantic segmentation information, suitable for dynamic environments. The method combines semantic segmentation information and uses a motion consistency-checking method to reduce the influence of moving objects. This improves the accuracy of SLAM in a dynamic environment. A semantic SLAM framework was proposed by Brasch et al. [28]. It combines feature-based and direct approaches to improve the robustness and accuracy of SLAM under many challenging environments, such as a dynamic environment. The proposed method uses the semantic information extracted from images. In addition, Jiao et al. [29] proposed a SLAM framework that employs a deep learning method for object detection, and closely links the target recognition results with the geometric information of the feature points in the visual SLAM system. Therefore, the extracted visual feature points are associated with dynamic probability. A novel dynamic SLAM, combined with a deep learning method to perform semantic segmentation, was proposed by Zhang et al. [30]. Two consistency-checking strategies were used to filter out the feature points located on moving objects. Thus, point and line features were used together to calculate the pose of cameras. However, the generalization of deep learning is not good, especially for the SLAM problem. Although it performs well with the training dataset, it does not perform well with the test dataset, resulting in low robustness of the system. Moreover, a deep learning method requires extremely high accuracy of the training dataset, especially for object detection and semantic segmentation; however, it is quite difficult to accurately determine the motion state of the object based on the object category.
To improve the visual-inertial navigation system, this study improved on our previous work [31] by using a variety of constraint combinations to improve the accuracy of feature matching. In our previous study, we determined the validity of IMU data and used epipolar geometric constraints to eliminate abnormal feature point matching that deviated from the epipolar line. Based on this, the fundamental matrix, calculated from the IMU data and flow vector bound (FVB) constraints, were used to remove the dynamic feature point offset along the epipolar direction. The sliding window model was used to filter out feature point matches between the current frame image and the image before several frames, to reduce the interference from dynamic feature points. Furthermore, a grid-based motion statistics (GMS) constraint was used, and spatial consistency between feature points was used to refine feature point matching. Finally, the algorithm was integrated with VINS-mono and VINS-dimc to achieve better robustness and accuracy in dynamic environments. In summary, the contributions of this study can be described as follows.
1. FVB constraints were combined with IMU data. The motion model and epipolar were calculated using IMU data, and the dynamic feature point offset along the epipolar was eliminated using the FVB constraints. 2. This method combined multiple constraints and used epipolar, FVB, GMS, and sliding window constraints, to compensate for the shortcomings of a single constraint and help VINS achieve more accurate feature matching. 3. The proposed algorithm was integrated with VINS-mono, and VINS-dimc is proposed. We have conducted experiments with VINS-dimc.
The rest of the paper is organized as follows. First, in Section 2, the basic principle of VINS and the dynamic information feature point elimination algorithm based on multiple constraints are presented. Then, Section 3 presents the experimental scheme used in this study, the experimental procedure, and the results. Finally, Section 4 provides some discussion and concluding remarks.

Overview
To improve the positioning accuracy of VINS in a dynamic environment, we propose a dynamic feature point elimination algorithm. It is integrated into VINS-mono, and a VINS-dimc with excellent performance in a dynamic environment is proposed. The flowchart of VINS-dimc is shown in Figure 2. During the initialization stage, the system creates a local map through the structure from the movement. The initialization parameters of the system can be determined by aligning the IMU data with the visual data. The graph optimization method, based on bundle adjustment, is an important state estimation algorithm for the tight fusion of cam-era IMU data. The optimal solution is obtained by globally optimizing all the measurements [32,33]. The improvement in this study was in the feature point matching stage. Combining with the IMU data, we used a multi constraint dynamic feature point elimination algorithm to refine the feature point matching. The positioning accuracy of the system was improved by improving the accuracy of feature point matching.

IMU Data Validity Discrimination and Epipolar Constraint
The content of this part is similar to that of our previous study, so we will only briefly introduce it here. For further details, please refer to our earlier work.
Based on the position and attitude changes measured by IMU, the fundamental matrix of the camera motion can be obtained using the following formulas: where E is the essential matrix, ∧ is the antisymmetric matrix of translation t, F is the fundamental matrix, and K is the intrinsic parameter of the camera. The feature point on the previous frame image determines the epipolar line: = . The distance between feature point matching and the fundamental matrix is calculated [34]. where If the distance is less than threshold a, the feature matching will be consistent with the fundamental matrix; otherwise, it will be inconsistent. The value of , is set to 0, when the ith feature point of the image frame is consistent; otherwise, it is set to 1, that is, If the effective feature matches is not less than threshold b, the IMU data will be considered to be valid; otherwise, it is invalid. The value of is set to 0 when the the IMU data is invalid; otherwise, it is set to 1, that is, If = 1, the fundamental matrix is accurate. The validity of feature point matching is determined by the distance between the matching and the fundamental matrix. If it is larger than the threshold c, it means that the feature matching differs from the motion model computed using the IMU data. This feature point matching is probably wrong or is on a dynamic object. Therefore, it is removed from the feature point matching set. and . In this case, the distance between the points and the epipolar line does not change. Therefore, the epipolar constraints are not valid. To compensate for this situation, we introduced the FVB constraint. Let us assume that the translation of the camera is . Let and be the corresponding pixels of the point and in the front and rear images, respectively. is the depth of the feature point in the scene. Then, the 3D coordinate of point is , and its coordinate in the image in the second image can be obtained using the following formulas:

Multi Constraint Fusion Strategy
Since point P is moving, therefore, When the camera moves, the corresponding pixels of the three-dimensional points in the image move along the line determined by the point and = , and the amplitude of the motion depends on the translation amount and the depth [35].
We can specify a possible depth interval for 3D points and then determine the maximum and minimum displacements of the corresponding points on the epipolar line. If the displacement amplitude of a point is not between the minimum and maximum, it is probably a dynamic feature point.
If the IMU data is invalid (that is, = 0), we considered the accuracy of the translation calculated from the IMU data to be poor. Therefore, the FVB constraint was not used in such a case.

GMS Constraint
IMU datasets are not always accurate, and if the data has large deviation, the epipolar and FVB constraints are invalid. GMS constraints are required to prevent feature matching on a moving object.
We considered using spatial consistency between feature point matching to constrain the feature points. The constraint uses a grid-based motion statistics method, namely, GMS [36].
In this method, the statistical probability of some matches in a region is considered as the motion smoothness. Thus, all feature point matches are checked by the model to eliminate feature points on moving objects and to obtain a correct feature match. When feature points are matched, the GMS extracts high-quality feature matches to eliminate low-quality matches. This is an extremely robust matching method. Using video verification, even in a weak texture environment containing blurred images and wide baseline data, GMS has been found to consistently outperform other feature-matching algorithms that can be run in real time. GMS can achieve the same accuracy as more complex and slower algorithms. Since GMS has high accuracy and high execution speed, it is more suitable for use in VINSs.
We incorporated GMS constraints into a VINS system and used spatial consistency to eliminate feature point matching on dynamic objects.

Sliding Window Constraint
In the VINS, which is based on the feature point method, two adjacent images are usually used for feature point matching to calculate the relative motion of the camera during this process. However, in a dynamic environment, we need to filter out dynamic feature points and maintain stationary matching, which is helpful for VINSs. However, due to the high frequency of the camera, which is usually more than 10 Hz, the pixels corresponding to the moving object do not have significant motion in the two adjacent images. Therefore, the VINS cannot reduce the influence of the moving object.
To solve this problem, the sliding window constraint was proposed to achieve feature matching between the current frame image and the image several frames before. The schematic representation of the sliding window constraint is as figure 4. When compared to the adjacent image frames, the time difference increases significantly. Therefore, the displacement of the pixels corresponding to the moving object in the two adjacent image frames becomes more obvious. However, the static feature points do not change. Thus, the influence of the dynamic object is reduced, and the features are matched to the static object.

Multi-Constraint Fusion Algorithm
Epipolar geometric constraints apply only to moving feature points that deviate from the epipolar line, whereas FVB constraints apply only to feature points that move along the epipolar line. These two constraints only matter when the accuracy of the IMU data is high. The GMS constraint restricts feature point matching by spatial consistency, and the sliding window constraint improves feature-matching accuracy by the temporal and spatial relationships of the visual data. These two constraints are independent of IMU data. Thus, the four feature-matching constraint algorithms can compensate each other. Therefore, the proposed algorithm integrates the above four constraints and avoids the failure of a single constraint.

Experiment
We integrated the multi constraint fusion algorithm proposed in this paper into VINS-mono, and proposed the VINS-dimc. An Intel(R) Core(TM) i7-7700hq CPU @ 2.80 GHz was used as the experimental platform. We performed the experiments on an Ubuntu 16.04 LTS system.
First, we verified the proposed algorithm by testing whether it could accurately eliminate dynamic feature matching by running the feature point matching experiment. Then, we ran the self-collected data into the VINSs to check whether it is helpful for positioning accuracy by checking the closed-loop error. Finally, we ran the public dataset in VINSdimc. The positioning results were compared with the ground-truth to calculate the absolute positioning error. Then, the root mean square error (RMSE) of the error could be be obtained. We chose VINS-mono, OKVIS-mono, and ROVIO as references, which are among the most representative visual-inertial navigation systems. We also used our previous work as a reference to prove that this improvement is effective.

Materials and Experimental Setup
The equipment we used for data acquisition was the Intel RealSense d435i camera [37]. It is a monocular camera with IMU and depth camera. Only RGB images and IMU data were used in this experiment. We set the resolution of the image data to 640 × 480, and the frequency to 15 Hz. The frequency of the accelerometer was set to 60 Hz, and the frequency of the gyroscope to 200 Hz.
During data acquisition, the experimenter shook the file bag in front of the lens. In the scene, the file bag is a moving object while the other objects are motionless. We used the traditional method and the proposed algorithm to perform the experiments. The collected image example is shown in Figure 5. Figure 5a is the image of the previous frame and Figure 5b is the image of the current frame.

Results
Since only the file bag is moving in the scene, the ideal result is that there are no moving points on the file bag. The results of feature point matching are shown in Figure  6. Figure 6a,b shows the results of feature matching using the traditional method, and Figure 6c,d shows the results of the proposed algorithm. In Figure 6a,b, there are many feature points on the moving portfolio. However, in the proposed method, there are no feature points on the moving file bag. The feature points on the static object are the same as those in the traditional method. Therefore, the proposed algorithm can effectively eliminate dynamic feature points.

Materials and Experimental Setup
We performed an experimental verification using real scenes. The experimental scenario is shown in Figure 7. During the experiment, a person walked around the scene. Therefore, there is a certain amount of dynamic information in the scene, which is a challenge for the VINS system. Since there is no exact ground-truth, we set the start and end of the experiment to the same point, so that we could evaluate the positioning accuracy by comparing the deviation between the start and end. During the experiment, the experimenters held the camera to collect data along the closed-loop path in the scene. In this experiment, the same camera was used to collect data as in the previous experiment. The data acquisition parameters were also the same as in the previous experiment. An example of the visual data for the two adjacent frames we captured is shown in Figure 8.

Results
We used the proposed algorithm to eliminate feature points on moving objects, and the feature-matching results are shown in Figure 9. Figure 9a,b show the distribution of the feature points. The experiment shows that the proposed algorithm is still effective in the VINSs. The algorithm was able to eliminate the feature points that are on the pedestrians and retain the feature points of the static scene. We set the start and end points of the positioning experiment to the same position, and the difference between the positioning result at the last moment and the initial value can be considered as an index of accuracy. Figure 10 shows the results of the experiment. Figure 10a shows the track of the system in 3D space. Figure 10b shows the changes of XYZ three-axis coordinates during the experiment. The positioning results of the three systems differed only slightly in the XY direction; however, the loop error of VINS-dimc in the Z-direction was much smaller than that of OKVIS-mono and VINS-mono. We calculated the difference in positioning results between the end point and the start point. The difference in OKVIS-mono was 0.516 m and the difference in VINS-mono was 0.168 m, whereas the difference in VINS-dimc was 0.153 m.
Based on the proposed algorithm, the positioning accuracy of VINS-dimc improved significantly when compared to VINS-mono. So, the proposed algorithm is also effective for the self-collected data. The algorithm obviously contributes to positioning accuracy.

Materials and Experimental Setup
To scientifically test the positioning accuracy of VINS-dimc, we performed experiments on the ADVIO public dataset [38] with ground-truth. This is a public dataset that uses handheld devices for visual-inertial odometry, such as the sensors in a smartphone. The dataset contains 23 sequences that include recordings from both outdoor and indoor settings. The ground-truth is obtained by combining a recent pure inertial navigation system [39].
To verify the algorithm in different scenarios, this study selected the six most representative of the 23 sequences. The selected sequences were 1, 2, 6, 11, 16, and 21, which contain various experimental scenes, including a shopping mall, a subway station, an office, and an outdoor area. They also contain all kinds of dynamic objects, such as pedestrians, elevators, and moving cars.
Two adjacent raw images from sequence 1 of the public dataset were selected as shown in Figure 11. Figure 11a shows the previous image, and Figure 11b shows the current image. In these two image frames, the other objects are motionless except for the moving elevator. Therefore, only the moving elevator affects the feature matching. Therefore, the ideal result is that all feature points on the elevator are eliminated and the other feature points are retained.
(a) (b) Figure 11. Two adjacent image in the ADVIO dataset: (a) the previous image, (b) the current image.

Experimental Results
During the experiment, VINS extracts feature points and matches, and eliminates the abnormal features of the two images. We use the proposed feature point elimination algorithm in VINS. Figure 12 shows the feature points matching results. Figure 12a,b shows the distribution of feature points in the image. The proposed algorithm is also suitable for public data sets. It can eliminate the feature points on the moving elevator while retaining other feature points, and there is no obvious error in feature matching.
We used the Evo library [40] to calculate the RMSE of the absolute positioning error (APE) of the positioning result. APE is the direct difference between the estimated pose and the ground-truth, which can directly reflect the accuracy of the algorithm and the global consistency of the trajectory. It should be noted that the estimated pose and ground truth are usually not in the same coordinate system, so we need to align them first. We need to calculate a transformation matrix S ∈ (3) from the estimated pose to the ground-truth using the least square method. Therefore, the APE of frame i is defined as follows: where, represents the ground-truth and represents the calculation result of the algorithm.
Then, use root mean squared error (RMSE) to count the APE: where ∆ represents the interval time and m represents the number of samples taken.
In the six sequences of the ADVIO public dataset, the error comparison data between the proposed method and the original method are shown in Table 1. Since the dataset contains many dynamic objects that require the system to be highly robust, both OKVIS-mono and ROVIO are unable to run some of the datasets. As shown in Table 1, VINS-dimc achieved excellent performance when compared to the other three state-of-the-art VINSs. When compared to our previous work, VINS-dimc has also made significant progress. Not only are there scenes with considerable dynamic information, but also scenes where almost all objects are stationary, such as in an office. The system not only runs successfully for all datasets, but also improves the positioning accuracy. VINSdimc achieves better robustness and higher positioning accuracy than conventional VINSs in both dynamic and static environments. Figure 13 shows the error data of the results for different sequences in the ADVIO public dataset. The figures in the left column show the absolute error over time. The abscissa represents the experimental time; the unit is seconds, and the ordinate is the absolute position error in meters. The figures in the right column show the statistics for the absolute position error. The statistics in the figures from top to bottom are the maximum, minimum, standard deviation, median, mean, and RMSE. The horizontal axis represents the size in meters.

Discussion and Conclusions
In this study, we present an algorithm for feature-matching elimination based on multiple constraints. This algorithm has two main advantages. First, the information from IMU is combined with epipolar and FVB constraints to eliminate abnormal feature points using geometric relationships. Second, the GMS and sliding window constraints are introduced, which are combined with the epipolar and FVB constraints to eliminate feature point matching for dynamic objects. The Epipolar constraint and FVB constraint combine IMU data and use geometric information to eliminate dynamic feature points. The GMS constraint considers the spatial consistency of the feature point matching. Sliding window constraint uses time correlation to constrain feature points. Thus, the proposed algorithm integrates four constraints to avoid the failure of a single constraint.
We integrated the algorithm into VINS-mono and proposed VINS-dimc. Through feature point matching experiments, we proved that the proposed algorithm is helpful in removing dynamic feature points. In the experiment with self-collected data, the closedloop error of VINS-dimc is much lower than that of conventional methods. In the experiment with the ADVIO public dataset, the positioning error of VINS-dimc is the smallest. Therefore, the proposed method can improve the matching accuracy of feature points and the positioning accuracy of VINSs.
In future work, we will consider another difficulty of VINS: illumination changes. In practical applications, the change of illumination has a great impact on the stability of feature points. We intend to propose an image enhancement method and integrate it with VINS to improve the accuracy and robustness of the system.