1. Introduction
In the research field of multi-sensor fusion simultaneous localization and mapping (SLAM), the fusion of vision and inertia has always been a hot research topic in recent years [
1,
2,
3,
4,
5,
6,
7,
8]. The rich environmental information captured by image can be used for building maps and estimating the up-to-scale trajectory of the camera, which is attached to the mobile platform. The measurements of gyroscope and accelerometer contained in the inertial measurement unit (IMU) sensor can be integrated to estimate accurate short-term rigid body motion. In addition, both the camera and the IMU have matured miniaturized products. These properties make the combination of the monocular vision camera and the IMU economical, effective, and low energy for localization and mapping in small mobile robot or other platforms with limited load.
The state-of-the-art visual-inertial SLAM methods [
1,
2,
3,
4,
5,
6,
7,
8] presented in the literature widely use the IMU preintegration [
9] and the keyframe-based nonlinear optimization methods [
6]. However, the performance of these techniques heavily depends on prior precise extrinsic calibration of the six-degree-of-freedom (6-DOF) transformation between the camera and the IMU. As the coordinate conversion parameters between the camera and the IMU, the extrinsic parameters play important roles in accurate data fusion between the camera and the IMU. Incorrect extrinsic calibration leads to system deviation and degrades the overall navigation performance. For the acquisition of extrinsic parameters, most of the visual-inertial SLAM methods [
1,
2,
3,
4,
5,
6,
7] use Kalibr to calibrate the extrinsic parameters, which a state-of-the-art marker-based off-line calibration method [
10]. Thereafter, the calibrated extrinsic parameters by Kalibr are considered accurate and constant during the running of the whole program. However, this process is complex and time consuming. Additionally, it is usually required to repeat this process when the sensors are repositioned. Therefore, in order to improve the portability and operability of the program, an automatic extrinsic parameter estimation is necessary. At present, the automatic extrinsic parameter calibration [
11] has been implemented in a monocular visual-inertial system constructed by one camera and a low-cost IMU (VINS-Mono) [
8]. Weibo Huang et al. [
12] implemented self-calibration of the camera and the IMU extrinsic parameters in the monocular ORB feature-based SLAM system (ORB-SLAM) framework [
1,
2]; ORB are binary features invariant to rotation and scale (in a certain range), resulting in a very fast recognizer with good invariance to viewpoint, although this code is not open source.
For visual measurement processing, it can be divided into the optical flow methods [
13] and the feature-based methods [
14,
15,
16] according to the matching strategy.
The feature-based methods represented by ORB-SLAM series [
1,
2,
3,
4] extract the ORB feature points [
14] from the image, then calculate the descriptor of the corresponding feature points, and finally obtain the feature point pairs in two frames through the descriptor comparison. This kind of method features high positioning accuracy, but it is time-consuming and easy to lose tracking in a weak texture environment or due to blurred image in fast motion [
17].
The same as feature-based methods, the optical flow methods represented by VINS-Mono also extract feature points from the image. However, the latter ones use image pixel values to match feature points, and eliminate the descriptor calculation of feature points, which is the most time-consuming process in feature-based methods. This characteristic makes the optical flow methods suitable for real-time and resource-constrained occasions, while the accuracy is also guaranteed to a certain extent. Moreover, the robustness to pixel noise of the optical flow methods is better than feature-based methods.
It must be noted that, due to the assumption of photometric invariance [
14], the optical flow methods are easily affected by external illumination, which means the feature-based methods feature a better robustness to external illumination than optical flow methods. Hence, a mixed method can be built based on the optical flow and feature-based methods, in order to achieve their robustness complementarity. The optical flow methods complement the feature point method when the image descriptor cannot be accurately calculated due to motion blur. The feature point methods provide better inter-frame pose results to suppress the motion drift caused by the wrong feature matching of the optical flow method.
Although the robustness complementarity has been achieved in the mixed method, the risk of tracking loss in weak illumination and low texture environment cannot be eliminated. Thus, the participation of IMU is very necessary.
The involvement of IMU data needs to go through an initialization process [
18,
19], which mainly completes the work of finding the gravity direction and correcting the bias of gyroscopes and accelerometers [
20]. The initialization of IMU data has been implemented by many algorithms [
1,
2,
3,
4,
5,
6,
7,
8,
9]. Nowadays, most of the state-of-the-art methods use the visual initialization information as a reference to correct the gyroscope and accelerometer bias of IMU [
4,
5,
6,
7,
8] at the start. Taking VINS-Mono as an example [
19], the initialization of the IMU is realized by aligning the IMU preintegration results with the vision-only structure from motion (SfM) results, which can obtain a rough estimation of IMU state vector. Then, the related parameters of the IMU are continuously optimized in the subsequent tracking process. This leads to the fact that the accuracy of VINS-Mono may not be ideal when the map is just initialized. For the monocular mode of ORB-SLAM2 [
21], only two keyframes with sufficient matching feature points are needed for initialization, and the combination with IMU is not considered, so the initialization process is faster than that of VINS-Mono, but the map is constructed up-to-scale. Although the open-source code of ORB-SLAM3 [
22] adds the mode of monocular camera with IMU, it also includes multi-map management and feature-based relocation and loop closing. The implementation of these functions relies on the continuous increase of feature saving and management with the expansion of the map range, which consumes a large amount of computer memory, making ORB-SLAM3 unable to run in real time on resource-constrained platforms. In addition, the self-calibration of extrinsic parameters is not considered in ORB-SLAM3.
Based on the above analysis, this paper proposes a robust parallel initialization method for monocular visual inertial SLAM, which integrates the characteristics of VINS-Mono and monocular mode of ORB-SLAM2 in the initialization phase. Based on the ORB-SLAM2 initialization algorithm framework and the VINS-Mono initialization algorithm framework, the research work of this paper first executes the monocular initialization thread in ORB-SLAM2 and the monocular inertial initialization process in VINS-Mono in parallel, and then uses the Umeyama algorithm [
23] and bundle adjustment (BA) [
24] optimization to integrate their initialization data. The result of this method is more robust, and with higher accuracy than the original VINS-Mono initialization. Compared with the original monocular initialization trajectory of ORB-SLAM2, the trajectory obtained by the initialization in this paper is in metric unit instead of up-to-scale, which better describes the real external environment. It also retains the online extrinsic parameter calibration function in VINS-Mono initialization, which improves the portability of the initialization algorithm. In the subsequent experimental comparison with the state-of-the-art ORB-SLAM3 initialization results, it can be found that the accuracy of the map obtained by the two initializations is comparable. The contributions can be summarized as follows:
The design of two-thread parallel initialization makes SLAM system have better portability and initialization robustness.
The designed initialization mode improves the initialization accuracy of the original VINS-Mono, which can provide a more accurate initial value of state estimation to the visual inertial SLAM system.
The initialization state estimated by the proposed method makes the trajectory accurate like the one estimated by ORB-SLAM3 in monocular-inertial mode.
The remainder of this article is arranged as follows: In
Section 2, the measurement preprocessing of the IMU and the camera information are described. In
Section 3, the detail of the initialization process in the proposed method is introduced.
Section 4 presents the experimental results of the improved initialization of monocular visual-inertial SLAM. Finally, the conclusions of this study are presented in
Section 5.
3. The Initialization Method Integrates Feature-Based Pose Information, Optical Flow Pose Information and Inertial Information
If we process the results of the optical flow method, the feature-based method, and IMU preintegration at the same time, the amount of data to be optimized is considerable, which requires high computer performance; therefore, we use parallel threads to initialize and a two-step optimization. In order to make full use of the existing open-source programs and reduce the amount of calculation of the programs at the same time, we first initialize the optical flow data and feature-based data separately, and then fuse the two initialization results to complete the combination of visual information as well as the inertial information.
The specific implementation process is shown in
Figure 1. Firstly, the initialization threads based on feature points and optical flow method are carried out in parallel. After the completion of the initialization of the two threads, the coarse alignment of the initial map is realized by using the Umeyama algorithm. Then, the local BA fusion of visual items is performed. Finally, the visual and inertial items in the two threads are fused to obtain the optimized initialization results.
3.1. Initialization Framework for Monocular VIO in the Optical Flow Method
The optical flow monocular inertial initialization method adopted by VINS-Mono is advanced and widely used. It also includes the extrinsic parameters estimation [
11] between the camera and the IMU, and the alignment of the world coordinate with the gravity direction. The procedure of initialization framework for VINS is shown in
Figure 2.
Visual tracking based on optical flow method requires continuous frames and small movement distance. However, in order to obtain the initial map points, the matching feature points need to be triangulated, and the premise of triangulation is that there is a large enough displacement between the two image frames where the feature points are located. Therefore, the initialization based on the optical flow method tends to establish a sliding window of size 10 as a data buffer to save the image frame, and the saving unit is one image frame. The initialized operation will find two frames with appropriate disparity in the sliding window for triangulation and propagate the triangulation results to other frames in the sliding window based on the results of optical flow tracking. Note that only frames that are successfully tracked and have a certain difference from the previous frame will be added to the sliding window. In VINS-Mono, section VI, Part D, has specific operations on marginalization of state items in the sliding window.
3.1.1. Vision-Only SfM in Sliding Window
As can be seen in
Figure 2, the initialization method based on the optical flow method starts on the premise that the sliding window, which serves as the image frame data buffer, has been filled with the image frames output by the camera. When the sliding window is full of frames, we check feature correspondences between the latest frame and all previous frames. If we can find stable feature tracking (more than 30 tracked features) and sufficient parallax (more than 20 pixels) between the latest frame and any other frames in the sliding window, we recover the relative rotation and up-to-scale translation between these two frames using the five-point algorithm [
28]. Then, the feature points observed in the two frames are triangulated. Based on these triangulated features, a PnP method [
29] is performed to estimate poses of all other frames in the window. Finally, a global full BA is carried on to minimize the total reprojection error of all feature observations. In this case, the first image frame in the sliding window is the reference coordinate system.
3.1.2. Visual-Inertial Alignment
In VINS-Mono, to scale the visual map to metric scale, the IMU preintegration is aligned with the SfM pose estimation results from the previous step, as show in
Figure 3 [8]. The basic idea is to match the up-to-scale visual structure with IMU preintegration.
According to the spatial transformation relationship, we can easily know the coordinate transformation relationship between the camera frame and IMU frame as follows:
where
are matrix representation for left and right quaternion multiplication,
is the skew-symmetric matrix from the first three elements
of a quaternion,
is the real part of the quaternion. Moreover,
denotes the rotation quaternion from the
y coordinate system to the
x coordinate system, where
and
at positions of
x and
y represent the camera coordinate system and inertial coordinate system corresponding to the
i-th frame image, respectively.
The relation of Equation (5) holds between all adjacent frames, and we can construct the overdetermined equation according to the equality relationship between multiple frames as:
where
N is the index of the latest frame which keeps growing until the end of rotation calibration and
.is weight derived from the robust norm for a better outlier processing. Then, use singular value decomposition (SVD) to obtain the rotation relationship between IMU and camera,
. The extrinsic parameter results obtained by this method are not completely accurate because IMU error is not considered, and the value needs to be optimized by back-end optimization in the future. If
is known exactly, skip this step. See Formula (4)–(9) in [
11] for the detailed calculation process.
Combining the results of SfM,
, and IMU preintegration in Equations (2) and (4), we can construct an equation between two consecutive frames,
and
:
The cost equation is constructed accordingly, while the linear equation for IMU preintegration with respect to the gyro deviation is as follows to minimize the cost function:
where
indexes all frames in the window. In such a way, we get an initial calibration of the gyroscope bias
. Then, we repropagate all IMU preintegration terms
,
and
using the new gyroscope bias.
We set the first camera frame
as the reference frame for SfM. All frame poses (
,
) and feature positions are represented with respect to
. Moreover, (
,
) are extrinsic parameters between the camera and the IMU. After the gyroscope bias is initialized, the gravity vector and metric scale in the world coordinate system can be initialized by establishing the relation equation between the IMU preintegration of two consecutive frames and the camera pose information obtained by SfM as:
among
According to the above equation, and the velocity
, gravity vector
and scale
are taken as the state quantities to be estimated as:
Combine Equations (10) and (11) into the following linear measurement model:
where
,
,
and
are obtained from the up-to-scale monocular visual SfM, and
is the time interval between two consecutive frames. By solving this linear least-square problem:
the body frame velocities for every frame in the window, the gravity vector in the visual reference frame
, as well as the scale parameter can be obtained.
Then, the gravity is perturbed with two variables on its tangent space, which preserves 2-DOF as in Equation (16). The aim is to optimize the direction of the gravity vector obtained in the initialization of the previous step:
where
is the known magnitude of the gravity, and
is the unit vector representing the direction of gravity. Moreover,
and
are two orthogonal bases spanning the tangent plane, as shown in
Figure 4 [
8], and
and
are 2-D perturbation toward
and
, respectively. Lastly,
and
can be set arbitrarily as long as they are not parallel.
Then, the in Equation (10) is replaced by , and solve for 2-D together with other state variables. This process iterates several times until converges.
The rotation between the world frame and the camera frame can be obtained by rotating the refined gravity vector to the z-axis. Then, all variables from the reference frame are rotated to the world frame . The body frame velocities will also be rotated to the world frame. Translational components from the visual SfM will be scaled to metric units. At this point, the initialization procedure is finished.
3.2. Initialization Framework for Monocular VO in ORB Feature-Based Method
After the detection and description of feature points in an image frame is finished, feature matching is carried out.
In this paper, the ORB initialization framework only carries out the pure vision initialization process as in ORB-SLAM2.
The monocular initialization process of ORB-SLAM2 can be divided into the following three stages.
3.2.1. Select two Frames as Initial Reference Frames
The first reference frame selected during initialization requires a sufficient number of feature points detected (>100); there is also a requirement on the number of feature point detection for the following image frames. If it does not meet the requirement on number of feature points in the two consecutive frames, the first reference frame should be selected again, and the above process should be repeated.
3.2.2. Calculation of Relative Pose Based on the Matching Features and Triangulation
When the number of feature points in two consecutive frames is sufficient, DBOW2 bag-of-word is used to accelerate the matching process of feature points. After word bag matching, multi-thread parallel processing is used to calculate the relative pose using basic matrix model and homography matrix model, respectively, and the pose transformation results of the two frames are obtained. The reprojection error of features is used to score the two pose results. The pose results obtained by the model with better score are selected to complete the triangulation of matching points. Then, the PnP algorithm is used to calculate the relative poses of the new image.
3.2.3. Global BA
Global BA is performed on the previously recovered feature points and poses to optimize the initialized map.
So far, we can obtain an initial up-to-scale map based on the ORB feature-based method that saves the relative poses and 3D coordinates of the corresponding feature points between image frames.
From the above description, we can see that the initialization process of the visual map based on ORB feature points is simple and fast, but the established map is not in metric units and does not align with the inertial coordinate system. On the premise that the two initialization processes are carried out synchronously, the map initialization based on ORB feature points should be completed first.
3.3. The Initialization Framework That Fuses the Initial Results of the Optical Flow Method and Feature Point Methods
The data fusion operation proposed in this paper will be performed after the initialization of both the optical flow method and the feature-based method. Since the initialization of the feature-based method only depends on the image of the monocular camera, the visual map established by feature-based method needs to be multiplied by a fixed scale to be in metric units. This map is referred to as the feature-based map in the following. So, the map established by optical flow method is referred to as the optical flow map.
The fusion process can be divided into three steps, as shown in
Figure 5. First, the similarity matrix transformation from feature-based map is obtained by the Umeyama algorithm. The Umeyama algorithm is designed to compute the positional relationship between two sets of data. When aligning two trajectories
and
, the goal is to compute a set of s,
and
such that the objective function (17) is optimal:
where
is the scaling factor,
is the rotation matrix, and
represents the translation.
Then, the transfer matrix between two co-view frames in the feature-based map is substituted into the corresponding relationship in optical flow method for optimization. Finally, the pose and rotation information of the state vector optimized in the previous step is fixed, and the remaining states in the state vector are further optimized.
3.3.1. Rough Map Alignment Using Umeyama Algorithm
In order to better perform the data fusion operation between the two maps, we use the Umeyama algorithm to solve the similarity transformation matrix, which is solved by the inputs of position information from the common key frame in the two maps. The basic idea of this step is to restore the feature-based map to metric unit and align with the world coordinate system in optical flow map through scaling, rotation, and translation operations.
Through the Umeyama algorithm, we can get the similarity transformation matrix
and make the following equation between the two maps approximately true:
The feature-based map can be approximate aligned with the optical flow map after the similarity transformation above, and the alignment process is shown in
Figure 6.
3.3.2. Local BA in Visual States
The feature-based map mainly saves the relative rotation and translation information between frames. Due to the higher accuracy of the feature matching algorithm used in the initialization of feature-based method, the optimization of the states in feature-based map is ignored. The relevant information provided by the feature-based map is added to the optimization as measurement.
Thus, the state vector in the local BA is defined as:
where
is the position state of IMU states at the time the
kth image is captured. It contains position and orientation of the IMU in the world frame. Moreover,
is scale value in the similarity matrix which calculated by Umeyama algorithm and needs to be optimized in BA,
n is the total number of keyframes, and
m is the total number of features tracked by optical flow method in the sliding window. Lastly,
is the inverse distance of the
lth features from its first observation.
The visual BA formulation can be optimized by minimizing the sum of the Mahalanobis norm of all measurement residuals to obtain a maximum posteriori estimation as:
where the Huber norm is defined as:
where
,
and
are residuals for scale estimation in feature-based map, rotation estimation between two common-view frames in feature-based map, and visual measurements in optical flow map, respectively. The detailed definition of residual term will be presented next.
The position parameters of the image frames in the optical flow map are mainly optimized based on the relative rotation between frames solved by the feature-based map. According to the inter-frame conversion matrix of the feature-based map, combined with Equation (18), the constrained equation holds as follow:
where
,
and
are the quaternion form of
,
and
, respectively.
is calculated in Equation (18). The
ith image is co-viewed with the
jth image. Lastly,
is the set of keyframe pairs in the feature-based map with co-view feature points.
Since the accuracy of the feature-based map on inter-frame rotation matrix is considerable, the translation requires a scale factor to restore it.
According to the feature-based map and the similarity matrix provided by Umeyama algorithm in the first step, the translation between two co-view frames and scale factor s are added to the residual formulation. Combined with Equation (18), the residuals corresponding to s are defined as:
where
is the transformation matrix from the inertial coordinate system of the
jth image frame to the world coordinate system of the optical flow map. The reason for considering the inertial coordinate system is to further optimize the extrinsic parameters between the camera and the IMU. Lastly,
denotes the norm of the vector.
Consider the
lth feature that is first observed in the
ith image, the residual for the feature observation in the
jth image is defined as:
where
is the first observation of the
lth feature that happens in the
ith image,
is the observation of the same feature in the
jth image,
is the back projection function, which turns a pixel location into a unit vector using camera intrinsic parameters,
is the unit vector for the observation of the
lth feature in the
jth frame, and
is predicted feature measurement on the unit sphere by transforming its first observation in the
ith frame to the
jth frame.
3.3.3. Full BA with Some Optimized States Fixed
In the previous step, the inter-frame constraint in the feature-based map is introduced to optimize the state of image frame and extrinsic parameters in optical flow map. The optimized states will be fixed, and then the Full BA operation of visual states in optical flow map and inertial states is performed:
where
is the IMU state at the time that the
kth image is captured. It contains position, velocity, and orientation of the IMU in the world frame, and the acceleration bias and gyroscope bias in the IMU body frame.
The visual measurement model here is same as in Local BA, as Equation (24). The only difference is that the position information () of the frame that exits in both the optical flow map and the feature-based map is treated as a constant. To this end, the nonlinear cost function here is formulated as:
where
and
are residuals for IMU and visual measurements, respectively. Moreover,
is defined in (24),
is defined as:
where
extracts the vector part of a quaternion
for the error-state representation,
is the 3-D error-state representation of quaternion, and
are pre-integrated IMU measurement terms between two consecutive image frames. Accelerometer and gyroscope biases are also included in the residual terms for the online correction.