Robust Stereo Visual Inertial Navigation System Based on Multi-Stage Outlier Removal in Dynamic Environments

Robotic mapping and odometry are the primary competencies of a navigation system for an autonomous mobile robot. However, the state estimation of the robot typically mixes with a drift over time, and its accuracy is degraded critically when using only proprioceptive sensors in indoor environments. Besides, the accuracy of an ego-motion estimated state is severely diminished in dynamic environments because of the influences of both the dynamic objects and light reflection. To this end, the multi-sensor fusion technique is employed to bound the navigation error by adopting the complementary nature of the Inertial Measurement Unit (IMU) and the bearing information of the camera. In this paper, we propose a robust tightly-coupled Visual-Inertial Navigation System (VINS) based on multi-stage outlier removal using the Multi-State Constraint Kalman Filter (MSCKF) framework. First, an efficient and lightweight VINS algorithm is developed for the robust state estimation of a mobile robot by practicing a stereo camera and an IMU towards dynamic indoor environments. Furthermore, we propose strategies to deal with the impacts of dynamic objects by using multi-stage outlier removal based on the feedback information of estimated states. The proposed VINS is implemented and validated through public datasets. In addition, we develop a sensor system and evaluate the VINS algorithm in the dynamic indoor environment with different scenarios. The experimental results show better performance in terms of robustness and accuracy with low computation complexity as compared to state-of-the-art approaches.


Introduction
In recent years, the field of robotics has witnessed remarkable advances in both academia and the industry with the assistance of Artificial Intelligence (AI). The evolution of technology has awakened various real-time applications of the mobile robot, such as search and rescue missions in disasters, ship and deliver small packages, and self-driving vehicles [1,2]. The localization system of an autonomous mobile robot is a crucial competence. However, the system typically involves an accumulation error over time, particularly in long-term navigation. Although Real-Time Kinematic (RTK)-GPS can provide outstanding precision for outdoor localization, it is not suitable for indoor or GPS-denied outdoor environments such as under bridges or high buildings. To this end, the Multi-Sensor Fusion Technique (MSFT) is developed to bound the navigation error by joining the advantages of proprioceptive and exteroceptive sensors. Simultaneous Localization and Mapping (SLAM) has employed MSFT to estimate the state, reconstruct, and understand the surroundings over the last 20 years [3]. It also enables loop closure detection to enhance not only the accuracy but also the image to find a similar direction, then dismiss outliers by using the most frequent angle vector. However, this study uses a constant threshold to separate inliers and outliers, which is not a robust approach. In [30], Santana Pedro et al. proposed a solution to reduce the influence of rotational movement by separating the obtained image into a half left and a half right, and then to reject outliers efficiently. M. Buczko and V. Willert [31] proposed a study based on the comparison of the optical flow and the camera movement. The error is then applied as a standard to treat rotation errors in the camera motion. Researchers in [32] applied optimization strategies to efficiently classify outliers in the VO problem by using only a monocular camera without the depth information. Nevertheless, the mentioned techniques do not use the information such as the velocity, rotation angle in the sliding window to update the threshold. This work proposes a robust multi-stage outlier removal strategy that leverages all estimated states in the sliding window of VIO to eliminate the visual uncertainty features in the stochastic environment. Figure 1 shows that our proposed method can extract more robust features than a state-of-the-art VINS-fusion [13] influenced by illumination and a dynamic object.  In this paper, a robust stereo visual-inertial navigation system for a mobile robot is proposed in dynamic indoor environments. A system architecture is briefly illustrated as in Figure 2, where the inputs are images from the left and right camera and the inertial data from an IMU. Firstly, the feature extraction is utilized to obtain the Features from Accelerated Segment Test (FAST) features [33] from the images.Then, the Kanade-Lucas-Tomasi (KLT)-feature tracker with the RANdom SAmple Consensus (RANSAC) [34] is utilized to match the features. A robust outlier removal method is developed by employing the forward-feedback-up-down tracking, then rejoins the state feedback information of the navigation system to remove the outliers. Once we have robust visual features, then one of Bayes filtering [6] is applied to correct the robot state obtained from the propagation phase. The experimental results with public datasets show that the proposed methodology achieves more reliable performance in both computational complexity and accuracy than state-of-the-art approaches. Besides, the tests are conducted in various scenarios under dynamic environments with an indigenous built sensor setup.  Figure 2. The proposed navigation system architecture is based on a robust visual-inertial fusion with a multi-stage outlier removal. Herein, the features are classified into three classes which are represented by blue, yellow, and red.
In summary, the key contributions of this work can be summarized as follows: • We provide the design of a robust stereo vision aided-inertial navigation system with high accuracy and low computational cost based on multi-stage outlier removal towards resource-constrained devices for stochastic environments; • Multi-stage outlier removal strategies are introduced, in particular an approach with a multi-stage that removes outlier features caused by the influences of the dynamic objects based on the state feedback information in the sliding window is proposed; • The experimental evaluation of the proposed VINS algorithm is carried out on both public and our datasets for the indoor environment. We demonstrate on relevant datasets that the proposed solution can robustly reject outliers and improve the system's accuracy both for the static and dynamic environment.
The remainder of this paper is structured as follows, Section 2 introduces the proposed system. In Section 3, we present the strategies to reject outliers. Experimental results are illustrated in Section 4, followed by a conclusion in Section 5.

Robust Stereo Vision-Aided Inertial Navigation System Estimator Description
In this section, we provide the design of a robust stereo vision-aided inertial navigation system. In order to design the system, we first set up a sensor coordinate configuration that consists of the left camera {C L }, right camera {C R }, and IMU {I} frame as shown in Figure 3. The robot frame is selectively aligned with the inertial frame {I}, determined as a reference frame for the navigation system. The global frame {G} is aligned with the world-centric and is considered as an initial pose of the robot during the operation.

The Definition of System State on Manifolds
Following the MSCKF approach [7,27], we define a state vector of the VINS system at each camera clock instance, which consists of the IMU state concerning the global frame {G} and a series of K camera poses clones in a sliding window. Besides, a group of ρ robust visual landmarks, a set of stereo camera's extrinsic and intrinsic calibration parameters, and a time offset are also employed. All mentioned information are combined in a unique state vector x represented as: where x I as the IMU states can be expressed as: The Jet Propulsion Laboratory (JPL) unit quaternion Iq G in Equation (2) describes a rotation from {G} frame to {I} frame, and G p I and G v I are the transition and velocity from the IMU {I} frame to global {G} frame at time step , respectively. b ω and b a are the gyroscope and acceleration biases, sequentially. The inertial vector x I has a total of 15 Degrees of Freedom (DOF) and represented on manifold M with a vector space R 12 product with unit quaternion space U (M = U × R 12 ). The camera clones collected at different time step within the sliding window are given by: where is the position of a camera pose clone with respect to the global frame {G} at time instance − i. The vector o S represents a set of robust visual features G p S ρ and is given by: The calibration vector o Z including extrinsic parameters with rotation Iq C z and transition C z p I from the IMU frame to the left camera frame, and intrinsic parameters ϑ z of the stereo camera is expressed by: where Assume that the left and right camera clock are synchronized and have a time offset C t I denoted by δt to the IMU clock as shown in Figure 4.

IMU clock
Camera left clock Herein, a unit quaternionq can be updated by multiplying an estimated quaternionq with an error quaternion δq as yieldedq = δq ⊗q. A small angle can be appropriated as δq 1 2 δθ T 1 T and R(δq) = I 3×3 − [δθ × ] , where ⊗ presents the quaternion multiplication, R(.) denotes a 3 × 3 rotation matrix on the special orthogonal group SO(3) [6,35]. On differentiable manifolds, the boxplus and boxminus operation are used to represent the updating of a matrix on Lie group by addition or subtraction with a vector on its tangent space [6]. Simplification, or is applied to update each element in the state vector of Equation (1) with its Lie algebra [6], yielding as x = x δx.

Propagation Model
In order to apply standard the Extended Kalman Filter (EKF) propagation [6,36], the inertial data should be converted to a control input in a dynamic system as given: Note that, the input of Equation (6) is the IMU state vector of Equation (2), which only needs to process during the propagation. The focus is about how to propagate the IMU data between two adjacent camera clocks [t , t +1 ] as shown in Figure 4. Without losing generality, at a time step t between two IMU clocks as shown t ∈ t γ , t γ+1 , the first element in the IMU state vector in Equation (2) is computed as given [35]: where Φ (t, t γ ) is a transition matrix that can be solved by using Linear Time-Varying (LVI) [36]. Then, the propagation term of the rotation from time t γ to t γ+1 is evaluated [7,36] as follows: whereω is assumed constantlyω(t) =ω or linearlyω(t) =ω 0 + κt throughout the propagation.
To implement the algorithm in term of recursive form, the function regarding each variable in the IMU state vector of Equation (2) needs to transform into discrete-time formula according to the assumptions: - The angular velocities between two IMU clock t ∈ t γ , t γ+1 are constant ω m (t) = ω m,γ .; - The accelerations over two IMU clock t ∈ t γ , t γ+1 are constant a m (t) = a m,γ .
According to the state vector in Equation (2), the error-state vector is established as: wherex is an error state between true state x and estimated statex, as x =x +x. The error-state vector can be written as the input of the dynamic Equation (6) as follows: where Φ t γ+1 , t γ is a system matrix, G γ is a noise matrix, and n = [n ω n a n bω n ba ] T is an inertial noise input vector. The covariance matrix [7,36] can be computed as: Rearrange the error states in term of dynamic Equation (13) with the input state vector of Equation (2). Finally, the covariance matrix of Equation (13) propagate from t γ to t γ+1 computed as [35,36]: with the state transition matrix Φ t γ+1 , t γ (A1) and the discrete time noise covariance matrix G γ (A2) as given in Appendix A. The rest of the state vector of Equation (1) does not relate to time, so their Jacobian matrices are the identity matrix. That means the state covariance matrix forms a sparse matrix, thus allowing for the exploitation of the computational saving cost. In summary, the propagation process from t γ to t γ+1 is illustrated in Algorithm 1.

Algorithm 1: Inertial Propagation Process Algorithm
Input: angular velocity ω m and acceleration a m measurements Output: the estimated state and its covariance matrix while running do if inertial measurement is obtained then Propagate rotation Equation (9), position and velocity Equation (10), and bias Equation (11); Compute the state transition matrix Φ (A1); Calculate the discrete time noise covariance matrix G (A2); Evaluate the state covariance matrix P according to Equation (15); end end

Visual Measurement Update Model
To use the bearing information from the camera, we first extract FAST features from the obtained images and classify them into three groups [7,27]. If a feature survives in the whole sliding window it is categorized as robust feature and if its identity is found in the state vector of Equation (4), it is called a SLAM feature. Otherwise, if the above robust feature is searched in the state vector of Equation (4) database and not detected, then it is marked by a Delay feature. Other remaining features are MSCKF features, which are generally lost track in the current camera image. Later, the delay features are used to initialize into the state vector of Equation (4) and the system covariance matrix. Finally, the SLAM and MSCKF feature are leveraged to update the state with a standard EKF process [6,36].

On-Manifold Correction Phase
Whenever images are acquired, the visual feature information is processed to correct the state after the inertial propagation. In order to apply standard EKF [6], all optical elements should be written in terms of the non-linear function as: where r m, +1 is the measurement vector, x +1 is the state vector represented on manifold M, and n m, +1 ∼ N (0, R m, +1 ) is the white measurement noise. We linearize Equation (16) around the current estimated statex +1| , therefore, the current zero-mean error state is given as: where H +1 the Jacobian matrix of the error state. From Equation (17), the standard EKF is used to update the state on manifold as follows: The observed visual information is represented on the distorted image plane, therefore the measurement needs to transform 3D feature representation to 2D data in the image plane as shown in Figure 5. For a particular variable, we need to transform and calculate Jacobian based on the obtained information of the visual feature.
(1) Figure 5. A demonstration of the transform between a 3D and 2D visual feature representation. The dark blue arrows indicate how to convert 3D feature representation to 2D data on the distorted image plane, and the green arrows show the processes to compute the Jacobian concerning a variable by using the chain rule of differentiation. Inside the brown circle presents the Anchored Inverse Depth model [7] of a visual feature.
For example, Jacobian of a variable in state vector Equation (1) can be determined by leveraging the chain rule of differentiation, which is explained as in Figure 5, as shown: Then, we compute the residuals of the feature by employing the acquired measurements in the sliding window (for j = − m + 1, . . . , + 1): where Gp f i andx are the error position of 3D feature and the current error state vector evaluated at the j-th measurement as shown in the vector of Equation (1), respectively. Jacobian matrices H x ij and H f ij are computed at x ij andp f i by leveraging Equation (21). We then compact all residuals from the total observed measurements of the feature, as computed: [27] are block matrices formed by the elements in Equation (22). However, a severe problem in Equations (22) and (23) is that the primitive covariance matrices P x f , P f f , P n f related to the feature are unknown. Therefore, the linearized measurement in Equation (23) need to eliminate the error feature Gp f i . To this end, we apply QR decomposition [38] by performing the null-space, then the result of Equation (23) is yielded as [7,27]: Multiplying both side of Equation (24) by Q T 1 , noting that Q 1 and Q 2 are orthogonal, we have: In addition, we can also apply the QR decomposition by using Givens rotations [38] to reduce more computational complexity, given as: The final step employs the EKF update process with Equations (25) and (26) introduced as in Equations (19) and (20), we have: A large number of features detected and processed at the same time instance is the most common scenario. In order to save more computational cost, we can stack residuals of all features in Equation (26) into a block, to yield: Similarly, we employ the thin QR factorization [39] of H o = QH for residualz o , given as: Therefore, implementing the standard EKF update [6] for Equation (29) to get the final estimation result.
The last step is state augmentation of Equation (1), which occurs when obtaining a new image as in Equation (3) or inserting a new SLAM feature in Equation (4). In the first scenario, when receiving a new image, we append the latest camera pose to the state vector of camera clones in Equation (3) with a copy of the indicated pose and its covariance matrix, associated with the IMU state of Equation (2). The similar work is applied when removing a camera clone out of the sliding window. The following situation arrives when initializing a new SLAM feature f into the state vector Equation (4), we do the same process as presented above. Once having the covariance matrix P xx and cross-correlation P x f matrix [7,27], the augmented covariance matrix will be inserted into the feature covariance matrix, and we have:

Multi-Stage Outlier Removal in the Stochastic Environments
The accuracy of the state estimation system by using bearing information of the camera is typically influenced by dynamic objects and illumination. This section presents a multi-stage outlier removal method to overcome the problem. The generic system is presented as in Figure 2, and the overall processes steps are illustrated as:

1.
Stage 1: The FAST features is extracted, then the KLT Feature Tracker algorithm is used to find the corresponding with RANSAC. We perform forward-backward-up-down matching to correct the tracking; 2.
Stage 2: 3D feature triangulation with optimization solution is utilized to calculate the 3D position and the outlier rejection is also performed in this phase; 3.
Stage 3: A robust outlier rejection scheme is proposed in which it considers the estimated state motion with the feedback information of the robot poses and velocity;

4.
Stage 4: During the EKF update as shown in Equation (27), statistical analysis is used with the chi-squared test to carry out outliers from inliers.
Then stage 1 is used to find robust feature correspondence. An image first is divided into blocks and FAST features are extracted with multi-threading process for each image block to increase the computational speed [8,33]. Next, the optical flow with the KLT tracking algorithm with RANSAC [33,34] is employed to determine the feature corresponding in the boundary of it.
However, the mobile robot regularly operates with sharp turn rotations, which effect the losing of optical tracking. To overcome this difficulty, we propose a robust feature matching based on the forward-backward-up-down optical flow tracking. For the current images of the left camera I L +1 and the right camera I R +1 , we find the forward corresponding from image I L +1 to image I R +1 , then use the backward optical flow tracking from image I R +1 to I L +1 . Later, to avoid the illumination effect, we apply the up-down corresponding of the previous image I L to the current image I L +1 of the left camera, and the previous image I R to the current image I R +1 of the right camera. Stage 1 is exhibited in Figure 6. In the next stage, when we have robust feature f , the triangulation is utilized to find the 3D position of the feature concerning the anchor pose. The linear triangulation [7,8,27] is applied to find the initial position of the feature. In this method, we use m obtained measurements of the feature and stack them into a block matrix with the form of a linear equation as follows: where K 2m×3 is a 2m × 3 matrix related to the bearing information with respect to the anchor position, and H 2m×1 is a 2m × 1 matrix corresponded both to the bearing information and the transition of the feature to the anchor camera position [27]. By using the Householder rank-revealing QR decomposition [27,38] to efficiently solve with a condition number [38] of linear Equation (31), we then apply the condition number to reject outliers which sensitives to the error. After initialization, the more accurate feature position can be obtained by utilizing non-linear optimization, given as: where A u * f , A v * f , A ρ * f are optimal values of the 3D feature position with respect to the anchor pose, z i f is the bearing measurement obtained in the image i, and h i f (.) is a function concerning to the 3D feature position. The least-squares problem of Equation (32) is solved by the Levenberg-Marquart (LM) algorithm [8,27]. After that, the feature is filtered out if its depth is out of range or the baseline ration [8,37] is higher than a threshold.
Stage 3 is applied before utilizing the EKF update process. We focus on how to remove outliers by adopting the relation between the optical flow and the robot motion.
The ideal solution is to use the 3D feature position of each feature, then project and establish a vector that is reflected by the camera motion in the sliding window called the visual vector. Another vector is acquired from the captured bearing information of the camera named the actual vector. Once the vectors are established, they are used to compute a relation error [32], then the error is compared with a threshold to detect the outlier. The whole process is presented in Figure 7. Assume that feature f is needed to be considered with the estimated 3D position A p f . The feature position is firstly converted to the {G} coordinate. Then, we transfer it to each camera coordinate {C i } in the sliding window and project it into the normalization image plane of the camera C i as shown in Figures 5 and 7c. We then have a set Ω 1 of the actual vector v N−i Figure 7a. We use only the first and the last vector in set Ω 1 and Ω 2 , then the actual vector v 1 and the visual vector v 2 are established as shown in Figure 7b. The vectors are calculated as follows: With the assumption that the distances ξ N−k between the initial points of both vectors are small, therefore the error of endpoint is approximated as ∆ξ ξ N . To avoid the sensitivity of visual vector v 2 , we employ the normalized re-projection error ε [32], as yielding: where ε C L and ε C R are the normalized re-projection error of left and right camera, respectively. We finally compare the error with a threshold and remove the feature if it is higher than the limit. Nevertheless, in practice, the camera performance depends on its linear velocity and rotation. Thus, we propose a threshold which is adapted as a function of the linear and angular velocity of the robot, as yielding: More simply, the update threshold is a linear combination of modular of the linear and angular velocity, as shown: where the parameters K ω and K v are selected depending on the camera. Finally, stage 4 is studied when the computed results are not close enough to the expected solution, which can influence the outcome of the estimation algorithm. To this end, the Mahalanobis gating test [8,27] is applied. After leveraging the null-space projection in Equation (25), the linearization results give usz o, +1 and H o, +1 . In this case, we need to find the difference between the observation and estimation values. The residual covariance is given as H 0, +1 P +1, H T 0, +1 , then the condition value is computed as follows: Once the value ξ i is computed, it is used to compare with a threshold given by the Chi-squared distribution χ 2 of the 95-th percentile cumulative density function. Here the degrees of freedom of the χ 2 is the number of rows in the residual vectorz o, +1 . If the value ξ i is larger than the threshold, then remove the feature from the inlier. The Chi-squared distribution χ 2 can be calculated at this stage. However, to save the computational cost, we apply an efficient look-up table to search the value.

Experimental Results
In this section, the details of experimental results are presented and compared with the state-of-the-art methods on public datasets. Moreover, the proposed algorithm is evaluated within a sensor platform in indoor stochastic environments. First, we develop the proposed methodology, as shown in Algorithm 2. Herein, the number of the sliding window is fixed to (K = 5). The maximum number of extracted FAST features is set to 400, and the maximum value of the SLAM feature in the state vector is limited to 50 (ρ = 50). The proposed algorithm runs in real-time based on the laptop computer Legion Y520 with device specification as an Intel Core i7-7700HQ four-core 2.8 GHz processors, GTX 1060 GPU, and 16 GB RAM. However, this algorithm only utilizes CPU to evaluate results. We develop and perform the algorithm on top of Ubuntu version 16.04.5 LTS with the Robot Operating System (ROS) version-Kinetic [40] written in C/C++.

Algorithm 2: A Robust Stereo VINS Algorithm Based on Multi-Stage Outlier Removal
Input: angular velocity ω m , acceleration a m measurement, and stero images. Output: state estimation and its covariance. while running do Propagation: collect IMU measurements, then propagate as in Algorithm 1; if stereo images are obtained then Add the latest camera pose to the state vector ; Extract FAST feature, then matching with outlier rejection stage 1 as in Section 3; Features pass stage 1, classifier the feature into three categories; for All MSCKF features do 3D feature triangulation, then reject outlier as stage 2 in Section 3; Perform outlier removal at stage 3 in Section 3; Compute the residual and its Jacobian for each feature using Equation (25); Perform the Chi-square test as in stage 4 explained in Section 3; All surviving features, which are combined with residual vector and the Jacobian matrix (26); Update the state vector and covariance matrix as in Equation (27); end for All SLAM features do Do the same process as MSCKF feature without 3D triangulation; Update the state vector and covariance matrix as in Equation (27); end for All Delay features do Do the same process as MSCKF feature without update the state vector but the covariance matrix; Initialize the features in the state vector; end Process State Management

•
Remove SLAM features that are not survival in the newest frame, then perform the changing of anchor pose for SLAM features to the new anchor position. • Rejects the oldest pose in the sliding window from the state vector and also updates its covariance. Delete used features for the MSCKF update at the last timestep to avoid the use of coinciding information for the next update.

Experimental Results on the Public Datasets
In this experiment, the visual-inertial EuRoC dataset [41] with 11 sequences is utilized. The datasets were collected by on-board a Micro Aerial Vehicle (MAV) in three categories of static indoor environments. The available data consists of 20 Hz stereo images and 200 Hz of MEMS IMU ADIS16448. The camera intrinsic, camera-IMU extrinsic, and spatio-temporal calibration parameters are given. These datasets also support 200 Hz ground-truth, which is evaluated by using the 3D and 6D motion capture systems consist of the Vicon, Leica MS50 laser tracker, and Leica MS50. We evaluate the proposed algorithm without using SLAM landmarks supported by the EuRoC dataset. Our algorithm can operate successfully in real-time for all sequences of the datasets. We also conduct the comparison with VIO algorithms as follows: 1.
(Stereo) S-MSCKF VIO [8]: A state-of-the-art filtering-based method with open-sourced, which implemented from initial version MSCKF [7], uses stereo images for high accuracy navigation;

2.
VINS-Fusion VIO [13]: A modern optimization-based solution is an extended version of VINS-mono [14]. This method utilizes the IMU pre-integration with a factor-graph optimization. The loop closure is disabled for this test; 3.
Basalt VIO [15]: The most recent state-of-the-art open-sourced VIO, which employs the fix-lag smoother based-optimization methodology. This method uses stereo key-frame to extract the relevant information for VIO based on the nonlinear factor recovery.
To perform the comparison, we use the SLAM VIO trajectory evaluation toolbox [42] to compute the Root Mean Square Errors (RMSE) and Absolute Trajectory Error (ATE) for each sequence from EuRoC datasets, the experimental outcomes are shown in Table 1. Our proposed results outperform S-MSCKF and VINS-Fusion in the most sequences of the EuRoC dataset, and slightly less than Basalt overall. Table 1 indicates that S-MSCKF filtering-based VIO had the worst result, followed by the VINS-Fusion based-optimization method. Although the Basalt solution was better than ours in the case of V1_01, MH_04, and MH_05, the proposed approach outperformed in sequences V1_03 and V2_02. Other routes had similar results. The best cases of our proposed method were deployed in the V1_02, V1_03, V2_02, and V2_03 sequence, which are illustrated in Figure 8. Figure 9 provides an example of a comparison of the relative transition error between our method and the Basalt V2_02 sequence. Furthermore, the three-sigma bounds of the state estimation errors were considered [27], as shown in Figure 10. The boundaries with ±3σ reported that more than 99% error fell in the confidence regions. This bounded uncertainty outcomes indicated that the proposed system was a well-functioning and robust filter. Table 1. Root Mean Square Errors (RMSE) of absolute trajectory error for S-MSCKF, VINS-Fusion, and Basalt compared with our algorithm on the EuRoC dataset. The bold text results are shown to be the best and X is fail to test.   We examined the processing time of each task provided in Table 2. Our algorithm was implemented and operated in parallel on CPU resources. Table 2 shown that the most consumption time was the tracking process with feature extraction and feature matching. In contrast, the shortest processing period was propagation, which was less than a half millisecond. The time consumed for the feature update and initialization increased linearly on the number of features to be updated [7]. In summary, the total processing time was about 26 ms, approximately equal compared with 25 ms of S-MSCKF. The optimization-based VIO methods consumed near double-time compared with our process. In detail, the VINS-Fusion and Basalt use factor-graph optimization spent 60 ms and 54 ms, respectively, compared with our proposed method which only took 26 ms. The quantitative results show that the proposed method had a similar accuracy but outperformed the state-of-the-art VIO techniques with only a half computational complexity.

Experimental Results in an Indoor Dynamic Environments
In order to evaluate the experimental results in the indoor dynamic environments, we built a sensor system comprised of a Zed stereo camera and an IMU 3DM-GX3-25, which are rigidly attached on a convenient light-weight platform as shown in Figure 11. During an evaluation of the experiment, the stereo camera's frequency was set at 30 Hz with an image resolution that captured 640 × 480 while the IMU signal was sampled at a rate of 200 Hz.
Our results could detect and remove outlier efficiently, as shown in Figure 1 and 12. The use of multi-stage outlier removal could efficiently discard the dynamic features, indicated by the blue circles in Figure 12b. We first evaluated the proposed algorithm inside a high-dynamic room where the people are continuously moving in the scope of a static camera. The RMSE of the proposed method was only 0.02 compared with the drift of VINS-Fusion [13] which was more than half a meter, as shown in Figure 13. In the second situation, the sensor system was moved around a small lab room of around 6 × 6 m 2 while a pedestrian with a pattern walked inside the scope of the camera. We evaluated the position precision by applying KITTI standard [43], which is calculated by the ratio of the drift when returning to the initial pose and the total length of the trajectory. The rotation accuracy was also the relative of error bearing to the entire path. The result of the traveling path is illustrated in Figure 14, note that the sensor system turned around many times in this case. After getting the last pose, we computed the position precision ∆p = 0.2563 45.52 = 0.56 (cm/m), and the bearing accuracy was ∆ϕ = 2.4443 45.52 = 0.0541 (deg/m). Although the system was employed to do several sharp turn rotations, the result still achieved a high precision. VINS-Fusion [13] was also used to perform this experiment however, the accuracy was about 4 (cm/m), which is approximately 10 times less than our algorithm as shown in Figure 14. For the last scenario, we conducted the sensor configuration moving along a hallway of around 7 m × 25 m and returned to the initial position. The result is presented in Figure 15. Herein, Line 1 (case 1) shows the estimated path, which is measured by moving the sensor system along the hallway without a dynamic object but was effected by the illumination of electric lights. In case 2, a person moves along with a pattern and appears sparse in the camera's scope. In this scenario, the dot green line (Line 2) presents the estimated trajectory inside this environment under dynamic and illumination effects. Line 3 (case 3), as shown in Figure 15, demonstrates the navigation path in a high dynamic and illumination environment. In this case, a person with a pattern moving with high frequency inside the scope of both cameras. Figure 16 demonstrates the comparisons between our method, VINS-Fusion [13], and Zed SLAM [44] supported by the producer. Figure 16a depicts our proposed method's results and the Zed SLAM algorithms, while VINS-Fusion gained a poor accuracy with an approximate 1 m drift. In particular, Zed SLAM could detect a loop closure at the endpoint, then adjusted the end pose to coincide with the initial position. Nevertheless, if we do not consider the loop closure detection, the proposed algorithm gets a better result. As observed from Figure 16b, although adding loop closure technical, our method with an error of 0.78 outperformed the Zed SLAM's result of 1.975. VINS-Fusion lost tracking in area L 1 , so it gained a severe outcome of around 47 (cm/m). The worst of Zed SLAM is shown in Figure 16c, which had a wrong direction when returning to the initial pose while our solution had a stable result with only an error of 1.4%. VINS-Fusion also failed when coming back to the original position, but its accuracy was approximately twice better than Zed SLAM. The position and bearing accuracy of our method and the comparison are presented in Table 3. E3 E2 E1 S1, S2, S3 Figure 15. The estimated trajectories in a stochastic environment along the hallway outside our lab around 7 m × 25 m with various scenarios are illustrated. Line 1 is a path without the dynamic object; Line 2 is with the moving person but sparse, Line 3 is with a very dense active person; all cases are under the influence of illumination. Point S1, S2, and S3 are start poses of all paths while E1, E2, and E3 are end positions of scenario 1, 2, and 3, respectively. (a) Figure 16. Cont. (c) Figure 16. The comparison of estimated trajectories between our proposed method with VINS-Fusion and Zed SLAM in various situations, as described in Figure 15.

Conclusions
The problem of vision-aided inertial navigation has achieved significant progress over a decade. Nevertheless, stochastic environments typically consist of dynamic objects and illumination invariance. Thus the precision and robustness of the estimation system is adversely affected. In this paper, we studied a robust visual-inertial navigation system based on the tightly-coupled methodology within the MSCKF framework using a stereo camera and an IMU for the stochastic surroundings. The system was integrated with the multi-stage outlier removal algorithm to limit the influences of dynamic environments. Herein, visual data was applied to extract FAST features, then leveraged for the removal algorithm to reject uncertainty features. Lastly, the surviving features were employed in the correction phase of EKF. We evaluated the algorithm in the EuRoC dataset and got noticeable results compared with the state-of-the-art VINS. In addition, a sensor system consisting of a Zed camera and an inertial measurement unit was installed to examine the performance of the proposed method in the dynamic surroundings. The outcomes demonstrated that our solution outperformed the Zed SLAM and VINS-Fusion with a precision of only 0.4% to 1% depending on each scenario. The proposed estimator operated in real-time, with only 26 ms of time consumption without a GPU. Finally, this solution could enable us to obtain high-precision pose estimation in real-time under resourced-constraint devices.