Visual-Inertial Odometry of Smartphone under Manhattan World

: Based on the hypothesis of the Manhattan world, we propose a tightly-coupled monocular visual-inertial odometry (VIO) system that combines structural features with point features and can run on a mobile phone in real-time. The back-end optimization is based on the sliding window method to improve computing efﬁciency. As the Manhattan world is abundant in the man-made environment, this regular world can use structural features to encode the orthogonality and parallelism concealed in the building to eliminate the accumulated rotation error. We deﬁne a structural feature as an orthogonal basis composed of three orthogonal vanishing points in the Manhattan world. Meanwhile, to extract structural features in real-time on the mobile phone, we propose a fast structural feature extraction method based on the known vertical dominant direction. Our experiments on the public datasets and self-collected dataset show that our system is superior to most existing open-source systems, especially in the situations where the images are texture-less, dark, and blurry.


Introduction
Positioning and navigation have attracted much attention in recent years, and many achievements have been made in the fields of robotics, micro aircraft, and autonomous driving. These devices can achieve very high accuracy by fusing global navigation satellite systems (GNSS), wheel-encoders, cameras, lasers, inertial measurement units (IMU), and other sensors [1][2][3]. Particularly, the fusion scheme of IMU and camera has been widely used because of its low cost and lightweight computation [4,5]. In general, the location can be obtained through GNSS outdoors, but in indoor environments, such as shopping malls and airports where GNSS is easily blocked [6,7], it becomes more difficult for people to obtain high-precision positioning services. As the smartphone is an indispensable portable device in modern life, it is particularly important to explore the potential of its positioning abilities [8]. At the same time, with the rapid development of augmented reality (AR), mobile phones have also received widespread attention as a platform for the interaction of AR technology. AR requires the mobile phone to estimate the 6 degrees of freedom (DoF) pose rather than just a 2 DoF or three DoF position [9]. The smartphone is usually equipped with at least an IMU and a consumer-level camera, which meets the minimum requirement of visual-inertial fusion.
Due to the inherent shortcomings of the temporal offsets between mobile phone sensors (camera and IMU), the motion blur generated by rolling shutter cameras [10], and the frequency reduction mechanism after overheating, etc., it is difficult to achieve a real-time estimation of the pose on mobile phones. Simultaneously, the behaviors of people when holding mobile phones are unpredictable, as they may rotate quickly for seeing the surrounding environment clearly, which may lead to blurred images. Several studies [11][12][13] that can estimate pose in real-time on mobile phones have been proposed. However, they can only eliminate the accumulated drift through loop closure, which may not exist in the paths of users. Furthermore, these methods still cannot solve the effect of image blur on rotation drift during fast rotation.
At present, the mainstreams of vision-based methods are still using point features [14][15][16], texture-less areas such as indoor corridors remain a challenge to point features-based methods. When there are few point features, the performance and accuracy of the system will be greatly degraded. Existing works use non-structural line features or structural features in the scene to deal with such difficulty. The monocular version of PL-SLAM [17] based on ORB-SLAM [15], stereo PL-SLAM [18] and PL-VIO [19] that incorporates IMU are all systems that use non-structural line features to improve the system accuracy. They match and triangulate line features between successive frames, and add line feature constraints to the residual functions by different parameterization methods.
Non-structural line features can help state estimation in the texture-less area. However, line features are unstable and easily occluded. Meanwhile, the same as point features, line features can only generate local constraints on the co-visibility graph, not a global one. By contrast, structural features can provide global observations for constraints, especially in indoor environments where most structures are man-made. The structure of this environment is usually regular and consists of three mutually orthogonal dominant directions, which is called the Manhattan world (MW) hypothesis [20]. Straight lines corresponding to each dominant direction in the space are no longer parallel in the image after projective transformation, but intersect at the vanishing point (VP) [21]. Some previous works use the structural regularity of the MW on monocular [22][23][24][25], stereo [26] and RGB-D cameras [27,28], respectively, essentially using the orthogonality of vanishing points to calculate accurate rotation or constrain the relative rotation between frames. From these works, it can be seen that the structural feature can eliminate the accumulative rotation drift of the system. Moreover, Zhou et al. [29] show that the rotation error is the main reason for long-term drift.
In our proposed method, we add the structural feature constraint to the tightly-coupled optimization-based visual-inertial odometry (VIO). We use the reprojection constraint of the point features and the global observation constraint of the structural features for state estimation. Furthermore, orthogonality between structural features is taken into account. We also optimize the process of the structural feature extraction in order to reduce the computing power of the mobile phone. As for precision, our method outperforms most of the state-of-the-art methods in the test on the public datasets. In the indoor field experiments with a mobile phone, the results show that our method achieves the best performance in terms of accuracy. The main contributions of this paper can be listed as follows: • A fast structural feature extraction method that can run in real-time on the mobile phone is proposed. We adopt the method of exhausting VP hypotheses to obtain the optimal global solution. • We propose to directly parameterize the three VPs into an orthogonal basis and define the orthogonal basis as a structural feature. In mathematics, we use Hamilton quaternion to represent this orthogonal basis to avoid singularity. At the same time, we use the tangent space of the rotating manifold to update the structural feature. The orthogonality of the structural feature is considered in this updating method. • We propose a tightly-coupled, optimization-based monocular visual-inertial odometry where IMU measurements, point features, and structural features are used as observation information. As far as we know, this is the first to add structural regularity constraint to VIO in the form of an orthogonal basis. Moreover, it can run in real-time on an Android phone with Kirin 990 5G processor at an average processing speed of 28.1 ms for a single frame.

VI-SLAM and VIO
IMU provides additional observation constraints, effectively improving the robustness of the monocular vision task with motion blur and occlusion problems. Both visual-inertial simultaneous localization and mapping (VI-SLAM) and VIO can be generally divided into filtering-based methods [4,14,30] and optimization-based methods [3,16,[31][32][33]. The filtering-based methods only retain the current camera state and the landmarks that may be observed in the future. Comparatively, optimization-based methods usually retain the states of multiple historical cameras and associated landmarks. The most representative filter-based methods are MSCKF [14] and ROVIO [30]. MSCKF [14] improves the algorithm efficiency by marginalizing landmarks from the state vector and reduces computation cost caused by the increase of landmarks. ROVIO [30] proposes an EKF framework that minimizes the photometric error of the patches around the point feature instead of minimizing the reprojection error.
OKVIS [32] is a classical optimization-based VIO system. It is the first to combine the optimization-based tightly-coupled method with the sliding window, and the constraints of the old states to the states in the sliding window are preserved by marginalization. The sliding window mechanism can help to maintain a constant computational cost. Similar to OKVIS [32], VINS-Mono [16] also uses the sliding window method as its back-end method. However, it proposes a fast and robust visual-inertial initialization method that estimates multiple states in the system. A 4 DoF pose graph optimization method is also adopted to eliminate the accumulated drift on x, y, z, and yaw angles. ORB-SLAM3 [33] recently released their work into the community, on the basis of the original [31]. Its VI-SLAM estimates multiple states of the system through Maximum-a-Posteriori (MAP) during visual-inertial initialization.
Several studies deploy SLAM systems on mobile phones for performing real-time estimation of the 6 DoF pose. RKSLAM [11] shows the AR effect of small range movement in the paper and proposes the use of a homography matrix to calculate inter-frame rotation for dealing with fast-rotating scenes. When the image is very blurry, and the point features cannot be matched, the homography matrix cannot be calculated. Piao et al. [12] replace the front end of ORB-SLAM with IMU pre-integration and KLT sparse optical flow tracking, making the average single-frame processing speed reach 23.2 ms on Android devices with a Qualcomm Snapdragon 805 processor. However, the performance is poor under fast rotation because the back end of ORB-SLAM is optimized based on the reprojection error under the local map, which will be interrupted during rapid rotation and fails to update the landmarks. VINS-Mobile [13] proposes a fast and robust initialization method to register the pose into the inertial frame and only optimizes the sliding window of the last few frames during the back-end optimization. However, maintaining a loop closure thread is a large overhead for a mobile phone and may occupy the computing resources of back-end optimization, causing frequency reduction after much heating. This paper disables the loop closure thread and allocates the computing resources to the back-end optimization.

Vanishing Point Extraction
The conventional process of VP extraction is first to extract the line segments in the image, and then cluster the extracted lines. Existing real-time VP extraction methods are mainly divided into two categories: One is the exhaustive method improved by Lu et al. [34]. They exhaustively list VPs hypotheses based on the orthogonality of VPs and construct a polar grid to speed up the VPs hypothesis query, finally improving the real-time performance. The other kind is based on RANSAC [22,[35][36][37]. The initial vanishing point is usually estimated by defining the minimum solution set, namely two line segments. A random sampling algorithm iteratively generates the vanishing point hypothesis, and the optimal vanishing point is selected as the final solution. Tardif et al. [36] use the J-linkage algorithm to extract VPs without any prior information. Bazin et al. [37] propose a method to solve the VPs using 1-line RANSAC when the normal vector of the horizontal plane is known. Camposeco et al. [22] then accelerate the RANSAC line clustering process by getting the direction of gravity from the accelerometer in their VIO system. Though the RANSAC method is speedy, there is a problem that the optimal global solution cannot be obtained.

Structural Regularity
A line of works uses structural regularity in MW to improve the performance of pose estimation. Camposeco et al. [22] combine the VPs with VIO in the framework of EKF and parametrize the VPs into a tangent space on the unit circle for representation. However, they do not consider that the VPs should maintain orthogonality after parameterization. StructSLAM [23] does not directly use the VPs as observation constraints but takes the structural lines corresponding to VPs as the observation. The structural line is parameterized by the parametric plane in which it is located and the intersection point across the parametric plane. The work most similar to ours is the system proposed by Li et al. [24]. They add the VP constraints based on PL-SLAM [17] and use the VPs to correct the pose from the aspects of absolute rotation and relative rotation. The absolute rotation residual is the residual between the currently extracted MW axes and the global MW axes. Also, the idea of average rotation in Global SFM [38] is adopted to further optimize the absolute rotation. However, they think that VP measurements are noiseless, so they do not continue to adjust the extracted VPs. Some studies use the density distribution of surface normal vectors to improve the accuracy of rotation estimation. LPVO [27] improves a mean-shift algorithm based on the normal vectors density distribution of the surface proposed by MVO [29]. LPVO uses the RGB image to extract line segments to generate the vanishing direction hypotheses. The accuracy of rotation motion estimation is improved using the density distribution of direction vectors and surface normal vectors. Guo et al. [28] use the cost function composed of point, line, and plane features to estimate the rotation during tracing. The keyframe rotation is refined by aligning the currently extracted MW axes with the global MW axes, while Li et al. [25] conduct the surface normal prediction of the RGB image by the convolutional neural network (CNN) to replace the role of the depth camera. Such studies rely on the surface normal and are difficult to be deployed on the mobile phone, whether using an RGB-D camera or CNN. Liu et al. [26] first obtain the rotation estimation after aligning the current MW axes with the global MW axes. They then separately estimate the translation, transforming the nonlinear optimization problem into a linear optimization problem. However, they ignore the error in the initialization of the global MW axes, which would always affect the rotation estimation of the system.

Preliminaries
We first define the notations and coordinates used throughout the paper. We employ (·) W to denote the Earth's inertial frame, (·) B k and (·) C k to denote the inertial frame and camera frame for the kth image. The body frame is defined to be same as the inertial frame. We use Hamilton quaternion A q B and the corresponding rotation matrix A R B ∈ SO(3) to represent the rotation from frame {B} to frame {A}, and R(q) and q(R) to represent the conversion between the quaternion and rotation matrix. A p B represents the position of frame {B} in frame {A}. We also denote the homogeneous transformation matrix A T B ∈ SE(3): B q C and B p C represent the extrinsic parameters between the inertial frame and the camera frame.
Due to the temporal offset between the two sensors (IMU and camera), it is not easy to calibrate extrinsic parameters between IMU and camera in the mobile phone through open-source calibration tools such as Kalibr [39]. So we manually measure the position of the sensor mount on the printed circuit board (PCB). We employ( ·) as a noisy measurement or an estimate of the state variable. We denote that [q] L and [q] R are, respectively, the left and right quaternion-product matrices, moreover, ⊗ represents the multiplication operation between two quaternions [40]. Since quaternion is four-dimensional, we use the perturbation of the tangent space of the rotation manifold δθ in order to prevent over-parameterization, and likewise for the perturbation representation of the rotation matrix. As is illustrated in (2), where [·] × represents the skew symmetric matrix corresponding to a vector: (2) Figure 1 shows an overview of the proposed visual-inertial odometry system. The system takes the image and IMU data flow from the mobile phone as input and outputs the estimated pose of 6 DoF. The system makes use of the rotation residuals of the structural feature on-manifold, the reprojection residuals of point feature, and the IMU pre-integration residuals for the Maximum-a-Posteriori (MAP). These observations are used as local observation constraints, while the structural features can also be used as global constraints to improve the accuracy of pose estimation. The system is mainly divided into two modules: front end and back end.

Front End
Initialized?

States Feedback
Sliding Window Optimizer Y Figure 1. Pipeline of the proposed visual-inertial odometry system. Front End. The front end receives the measurements of the image and IMU from the mobile phone, and at the same time, the old optimized states are received to match the structural features. Each time a new image comes, we update the states using IMU in the time interval of the current frame and the previous frame as the initial value for the next back-end optimization. Simultaneously, the pre-integration is used as a measurement between the continuous states during the back-end optimization, which will be described in Section 4.2.1.
The processing of acquiring point features and structural features is divided into two threads. After IMU pre-integration, we will track the old point features through the KLT sparse optical flow algorithm [41] and maintain a minimum number of point features. When the number of point features is insufficient, new point features will be added. In consideration of the efficiency, a FAST detector [42] is used to extract the corner features; furthermore, meshing and non-maximal suppression are used to control the uniform distribution of the extracted point features. We then use RANSAC with an essential model test to remove outliers. When the inlier rate is too low (lower than 0.7), we use the homograph model to perform further RANSAC verification to prevent degradation of the essential matrix due to pure rotation. While processing point features, we extract and match the structural feature of the keyframe as described in Section 4.2.2. The keyframe is decided based on the following three criteria:

1.
There are many newly observed landmarks.

3.
The average parallax between point features matched between successive frames is large.
Back End. The back end is mainly based on the sliding window tightly-coupled method for state estimation. First of all, point management is carried out for the tracking results of the front-end point features. We represent landmark by the back-projection of the first observed point feature and its corresponding inverse depth. Then, according to the prior residual, point feature residuals, structural feature residuals, and IMU pre-integration residuals, a Maximum-a-Posteriori (MAP) estimation based on sliding window is carried out as described in detail in Section 4.3. Finally, landmarks and structural feature measurements with a large error need to be culled because of the change of the optimized states. If the inverse depth of the landmarks is negative, we need to cull the observations of the landmark and all corresponding point features. If the angle error between the corresponding structural feature observation of a frame in the sliding window and the maintained structural feature state is greater than the threshold (6 degrees), the corresponding structural feature observation of that frame would be culled.

IMU Pre-Integration
The raw measurements of the gyroscope and accelerometer can be obtained from IMU, which represents the measurements of the body frame. The gyroscope measurementω B and accelerometer measurementâ B are affected by gyroscope bias b B ω , accelerometer bias b B a , gyroscope noise n B ω , and accelerometer noise n B a . The measurements of time step t can be expressed as: where R( B t q W ) rotates the acceleration of the body in the inertial frame to the representation in the body frame, and g W is the gravity vector in the inertial frame. As in [4,16], we assume that the noises of the gyroscope and accelerometer are zero-mean Gaussian white noises, The noises of gyroscope bias and accelerometer bias are modeled as random walk noises; therefore, its derivative are Gaussian noises, The nominal-state kinematics in continuous time is shown in (4) : where ∆t is the time interval between i and j. From (4), we can see that the propagation of states is based on the position, velocity, and rotation at the time i. When the starting states change, subsequent states need to be re-propagated. Using an optimization-based approach, we need to re-propagate the IMU measurements during each iteration that will cause a change in the starting states, which is very time consuming. To avoid repeated re-propagation, we rewrite (4) as (5) and (6), and decompose the starting states from the integral, and make the integral being carried out in the local frame: As can be seen from (6), the IMU integration measurements in the time interval of [i, j] take B i as a reference frame, which is only related to bias and is called IMU pre-integration measurements for restricting the states of consecutive keyframes. When the bias error is slight in optimization, it will not be re-propagated, and IMU pre-integration measurement error is considered to be caused by a tiny perturbation of bias, so we use a first-order approximation to update it: b ω are the Jacobian matrices of IMU pre-integrated measurements with respect to bias, such as , which can be obtained through the error state transfer matrix.
In practice, IMU measurements need to be pre-integrated in discrete time, and sensor noises need to be considered. We propagate states at time l by mid-point integration of IMU measurements at the discrete time of l, l+1. In (8), δt l is the time interval between l and l+1: Finally, as manifested in (10), we can derive the error state kinematics in discrete time based on the mid-point integration method. We represent quaternion as minimum parameterization q ≈q ⊗ 1 We summarize (10) and use the simplified linear model to propagate the covariance according to the forward propagation method of covariance [21]. Moreover, we can also iteratively calculate the jacobian matrix of the pre-integration measurements for the error states and provide the matrix blocks to (7).
where Q l ∈ R 18×18 is the covariance matrix of the raw IMU noise and initial condition Σ i,i = 0 15×15 , J i,i = I 15×15 . Therefore, we can express the pre-integration noise of IMU in the time interval [i, j] of two consecutive keyframes as:

Structural Feature Detection and Matching
The essence of the structural feature extraction is to classify the line segments in the image and then calculate the three orthogonal dominant directions {d i | i=0,1,2 }. The orthogonal vanishing directions can be expressed as an orthogonal basis form, then we can obtain the representation of the structural feature in the camera frame C q VPs = q( d 0 d 1 d 2 3×3 ). We use the EDline algorithm [43] to extract line segments.
Before the initialization of the structural feature is finished, we extract VPs using a combined RANSAC and exhaustive method proposed by Lu et al. [34]. After the initialization is completed, the quick vanishing point extraction is carried out based on the known vertical direction. Suppose the angle errors between the extracted structural feature and the global state of the structural feature in each dominant direction are small (less than 6 degrees). In that case, it is considered to be a useful measurement of the structural feature.

Global state initialization for structural features
We maintain each structural feature measurement with a sliding window of the same size as the back-end sliding window. When the back-end initialization is complete, we begin the global state initialization of the structural features. Ideally, each structural feature measurement should be the same rotation W q VPs = W q B ⊗ B q C ⊗ C q VPs after being transformed to the world frame, but noise perturbation prevents them from overlapping. After the structural features in the sliding window are transformed to the world frame, spherical interpolation is carried out in turn to obtain WqVPs . If the angle error of each dominant direction between WqVPs and the measurement of structural features in the world frame is less than the angle threshold (less than 6 degrees), the measurement is considered as an inlier. If the inlier rate is greater than 0.8, it is considered a successful initialization. Otherwise, the structural feature measurement with the maximum error is discarded.

Structural feature extraction
After the global state initialization of structural features is completed, we propose a fast and straightforward method to extract structural feature based on the known vertical direction. The whole process is shown in Figure 2. First, as shown in Figure 2a, we project the dominant vertical direction into the image to obtain VP init and find the line segment set {l v } belonging to this dominant direction. We define the angle between the vanishing point and the line segment, as shown in Figure 3. When the angle is less than 6 degrees, we believe that the line segment belongs to this dominant direction. We can use the 3 × n matrix L formed by {l v } to construct the least square problem to get the first vanishing point VP 1 : L T VP 1 = 0 n×1 . It can be seen that the known vertical direction only provides us with an initial value so that we can distinguish the line segments. We consider a disturbance in W q C , which leads to a slight error in the vertical direction from the world frame to the camera frame. The orthogonal relation is satisfied between VP 1 and VP 2 . VP 2 must be on a circle formed by the intersection of the normal plane of VP 1 and the equivalent sphere, as shown in Figure 2b. Therefore, we take n degree as a step size and sample uniformly on the circle to generate the hypothesis of (360/n) VP 2 s. In our work, we set n as 0.5. Meanwhile, VP 3 is obtained by the cross product of VP 1 and VP 2 .
To quickly calculate the global optimal vanishing point hypotheses, we use the polar grid proposed in [34], which maps the intersection points obtained by all line segment pairs in the image to the polar grid, and calculates the corresponding weight in Figure 2c: The weight is defined as l , where the response is the proportion of the line segment in the image. This means that the vanishing point hypothesis generated by the more obvious line segment gets a higher weight. After constructing the polar grid, a gaussian smoothing filter is used to reduce the influence of measurement noise. Finally, we only need to map all VP 2 and VP 3 obtained from the vanishing point hypotheses into the polar grid, and select the vanishing point with the maximum corresponding weight in the polar grid as the final optimal vanishing point Figure 2d.

Structural feature matching
Our structural feature comprises three orthogonal vanishing points, so the matching problem of the structural feature is transformed into the matching problem of vanishing point direction. Since the vertical direction is known, we only have to match one dominant direction. First of all, for continuous keyframes I and J, we use W q C I and W q C J to convert their respective structural feature measurement to the world frame, respectively. When the included angle between dominant directions of I and J is less than 6 degrees, the structural features are considered to be well matched. In considering that if the pose drift is too large, the matching will fail. So when the above matching method fails, we use the relative rotation C I q C J of the IMU pre-integration between keyframes I and J to transform the structural feature measurement in I to J for rematching.

Tightly-Coupled Nonlinear Optimization
In this work, we first define the state variable X in the sliding window, which consists of body states, landmark inverse depths, and three orthogonal VPs in the Earth's inertial frame. In our work, we define three orthogonal VPs to make up a structural feature and explain in Section 4.2.2 how to initialize the global structural feature in the inertial frame. Since VPs are calculated based on the image line segments with noises that may reflect in the global structural feature, we optimize the global structure feature as a state W q VPs in the sliding window. For the consistency of our VIO system, we use quaternion to represent the orthogonal basis composed of three orthogonal VPs: where X is the state of the body frame corresponding to the keyframe in the sliding window, λ represents the inverse depth of each landmark, the inverse depth refers to the inverse depth corresponding to the back-projection of the starting point feature of the landmark point-track. We use F and M to denote keyframes and landmarks in the sliding window. In line with other quaternion state update strategies, we incrementally update the global structure feature state on the manifold. Then we denote the measurement Z as the observation on the state X . In (15), Z v l represents the point feature measurement of the mth landmark observed in the lth keyframe. If z v l,m is the starting point feature of the landmark point-track, we rewrite it as z v l s ,m . C l q VPs represents the orthogonal basis measurement of the VPs under the camera frame of the lth keyframe. B p,q represents the IMU pre-integration measurements between successive keyframes p and q: We apply Maximum-a-Posteriori (MAP) criterion to estimate the state of X with the measurement Z, i.e.,: With Bayes' rule, X |Z in (16) can be decomposed by the prior p(X ) and the likelihood p(Z |X ), i.e., Under the assumption of Gaussian distribution, we use the negative logarithm to represent the Maximum-a-Posteriori (MAP) problem, which transforms the problem into a minimum negative logarithm problem. In other words, we need to find the optimal estimation on the state X , which can minimize the Mahalanobis norm of all measurement residuals: where {r P , H P } represent the prior information obtained after the oldest keyframe is marginalized from the sliding window. r F is the point feature of the reprojection residual. r A r is the absolute rotation residual between the structural feature measurement of keyframe and the global structural feature state. At the same time, r R r is the relative rotation residual between the structural feature measurement of each keyframe. r I is the IMU pre-integration measurement residual between successive keyframes in the sliding window. ρ is the robust function used to suppress outliers. We use the factor graph in Figure 4 to illustrate this least square problem. In this work, we use the Levenberg-Marquardt (LM) algorithm to solve this nonlinear optimization problem in (18). The detailed residual terms and Jacobian matrixes are given in the following sections.  Figure 4. The factor graph represents our optimization problem. Circles represent the states of being optimized, and squares represent the factors as probability constraints between states.

Point Feature Measurement Factor
We define the measurement residual of the point feature as the reprojection error on the normalized image plane. The continuous observations of the same landmark form a point-track, and there is a reprojection error between the tracked point feature and the first observation of the landmark: We use the chain rule to derive the Jacobian matrix of the point feature factor, and the states to be optimized can be expressed as: X F = W p B ls , W q B ls , W p B l , W q B l , λ m . The Jacobian matrix is derived as follows: where

Structural Feature Measurement Factor
In order to make use of the orthogonality and parallelism of the structural feature, absolute rotation residual and relative rotation residual are defined, respectively, as the constraint factors of structure features. The absolute rotation residual is used to reduce accumulated rotation errors of our VIO system over a long time and refine the global VPs state. The relative rotation residuals are used to constrain the relative rotation between the keyframes where the structural feature can be observed.

Absolute Rotation Residual
where C l q VPs is the measurement of the structural feature corresponding to the lth keyframe in the camera frame. W q VPs is a global structural feature state. The structural feature observed in this keyframe can be represented in the world frame by the body rotation W q B l of this keyframe, forming an absolute rotation residual with W q VPs . The states to be optimized in the absolute rotation factor are X A r = [ W q B l , W q VPs ], and the corresponding Jacobian matrix is as follows: where (·) br represents the 3 × 3 submatrix block in the bottom right corner of matrix [·]. Quaternion is represented by three-dimensional error state δθ to prevent over-parameterization, when updating the quaternion state,q ⊗ 1 1 2 δθ is used for updating.

Relative Rotation Residual
where C l 1 q VPs , C l 2 q VPs are the measurements of the structural feature corresponding to l 1 th and l 2 th keyframes in the camera frame, respectively. The structural feature measurement C l 2 q VPs in the l 2 th keyframe can be represented in l 1 th keyframe by the relative rotation between keyframes. Similar to the absolute rotation residual, we define the rotation error between C l 2 q VPs after the rotation transformation and C l 1 q VPs as the relative rotation residual. For the relative rotation factor, the states associated with it to be optimized are X R r = [ W q B l 1 , W q B l 2 ], and the corresponding Jacobian matrix is:

IMU Measurement Factor
IMU pre-integration (Section 4.2.1) can be used to constrain the states between two consecutive keyframes in the sliding window. IMU measurement residual is defined as follows: where B p α B q , B p β B q and B p γ B q are the IMU pre-integration measurements, and ∆t is the time interval between the pth and qth consecutive keyframes. IMU measurements constrain all body states of two consecutive frames, so we need to optimize X I = [X p , X q ]. The Jacobian matrix corresponding to the IMU measurement residual can be obtained as:

Experimental Results
We evaluate the performance of the proposed Manhattan world based VIO system on the public benchmark datasets and also in the mobile phone based indoor field tests. The state-of-the-art optimization methods are compared in both tests. We analyze the computing complexity of the proposed method, the running time of each main module is compared on the mobile phone. All of our comparative experiments are carried out on a computer with an Intel Core i5-8250 CPU at 1.6GHz, and 16 GB RAM. The Android phone we use to record the running time is HUAWEI Mate30 equipped with a Kirin 990 5G processor and 8 GB memory.

Dataset Comparison
We evaluate our VIO system on two public datasets: EuRoC dataset [44] and TUM-VI dataset [45], and we compare the performance of our system with OKVIS [32], VINS-Mono [16], PL-VIO [19], and ORB-SLAM3 [33], respectively. These are all optimization-based systems, among which OKVIS, VINS-Mono, PL-VIO are sliding window optimization-based systems, and ORB-SLAM3 is a SLAM system based on local map tracking and map reuse. OKVIS is the first VIO system to combine a sliding window approach with a tightly-coupled optimization approach. Based on the sliding window optimization, VINS-Mono adds loop closure optimization of 4 DoF and robust initialization, which results in a complete VI-SLAM system. PL-VIO adds line features based on VINS-Mono, which improves the performance of the point feature-based approach in texture-less regions. ORB-SLAM3 is a complete SLAM system containing visual, visual-inertial, and multi-map SLAM; the use of a local map for optimization can help to achieve extremely high accuracy on public datasets within centimeters.

EuRoC Dataset
EuRoC dataset [44] is captured by a micro aerial vehicle (MAV) in two scenes. It contains 752 × 480 resolution stereo images from global shutter cameras at 20 fps and synchronized IMU measurements at 200 Hz. The ground truth of the entire trajectory is obtained by using the VICON motion capture system. In this work, we use IMU and images from the left camera as inputs for each system.
Unlike point features and line features, structural features encode global rotation information. Regardless of the running time of the system, we can always observe the same structural feature, which can effectively reduce the accumulated rotation error of the system, as well as decrease the translation error accordingly. As shown in Figure 5, Figure 5a is processed by our system in MH_03_medium sequence, while Figure 5b is processed in V1_02_medium sequence. Red, green, and blue lines represent the three orthogonal orientations in the scene, respectively. Meanwhile, × is used to indicate the position of the vanishing point corresponding to the dominant direction of red in the image. Machine Hall in the EuRoC dataset is a scene that fully conforms to the Manhattan world hypothesis. Simultaneously, the VICON Room has some interference in other directions besides the three orthogonal directions of the room. However, our initial structural feature method can help us filter out other directions and find the three VPs corresponding to the room structure. It can be seen that the structural features are global and the MAV can observe the structural features in different places so that the accumulated errors are eliminated.   Table 1, we give the root mean square error (RMSE) of translation and rotation between the estimated trajectories and ground truth, and corresponding line charts, as shown in Figure 6.  It can be seen from Table 1 and Figure 6 that ORB-SLAM3 achieves the best performance in the EuRoC dataset, which is superior in accuracy to other systems, except for V2_01_easy sequence. The reason is that it utilizes a local map for reprojection residual constraint, which can maximize the use of historical information without loop closure. This strategy generates high map consistency when tracking point features stably. However, in addition to point feature extraction, ORB-SLAM3 also needs to calculate point feature descriptors. The point feature matching in local map tracking is O(n 2 ), which presents a challenge to pose estimation on mobile phones in real-time.
Compared with the methods based on sliding window optimization such as OKVIS, VINS-Mono, and PL-VIO, we significantly improved the trajectory accuracy by utilizing the structural features. Except for MH_05_difficult, V1_02_medium, and V2_02_medium, the translation error and rotation error of other sequences are all lower than the existing similar systems. At the beginning of the V2_02_medium sequence, MAV is in a scene with a lot of outlier structure interferences, making it difficult to initialize structural features and form global constraints for an extended period. This is the main reason why the trajectory error of the proposed system in this sequence increases significantly.
We also compare the trajectory errors versus time of several sequences to show the advantages of our proposed system with structural features. Figure 7a-c represents the performance of our system in the three sequences of MH_02_easy, MH_04_difficult, and V2_01_easy, respectively. From top to bottom are the estimated trajectory, the rotation error versus time, and the translation error versus time, respectively. The rotation error of our proposed system is much smaller than that of other optimization-based methods throughout the trajectory, especially in the MH_02_easy sequence where the MAV moves stably, without too much fast motion and rotation.

TUM-VI Dataset
The TUM-VI dataset [45] is composed of 28 sequences collected in 6 different scenes. In order to verify the algorithmic performance of our proposed VIO system, we used the corridor scene, which satisfies our assumptions of the Manhattan world. The TUM-VI dataset provides ground truth from a motion capture system at the start and end of the sequences. We chose 512 × 512 resolution images and used the default parameters provided by the TUM-VI dataset to evaluate other open-sourced systems. Our parameters are consistent with those of VINS-Mono.
This scene has texture-less regions, illumination changes, and other factors that affect the precision of pose estimation. As shown in Figure 8, when the camera is in pure rotation motion Figure 8a   Consistent with the analysis method of the EuRoC dataset in Section 5.1.1, RMSE of translation and rotation after the estimated trajectory aligned with the ground truth is given, respectively, as shown in Table 2 and Figure 9. It is important to note that the aligned parts are only the start and end of the sequences. It is mentioned in [33] that pose RMSE calculated in this way is about half of the accumulated drift.  As shown in Table 2 and Figure 9, our proposed VIO system is significantly better than other open-source systems based on slide-window optimization. In addition to the corridor1 and corridor5 sequences, our translation performance is superior to ORB-SLAM3 without loop closure among the other sequences. In the corridor sequence, except for the three dominant directions of the corridor itself, there are few outlier line segments that belong to other directions in the image, which leaves us less noise of structural feature measurement. Some scenes with texture-less regions in the corridor pose a significant challenge to the point features-based system, but they have little impact on systems based on structural features. At the same time, it can be seen from the pose errors of PL-VIO that the addition of line feature residuals to the optimization items may bring adverse effects.
In Figure 10, we show the result of each system in the corridor1 sequence. We use our proposed system as a reference trajectory for alignment. As shown in Figure 10a, except for ORB-SLAM3 and our proposed system, the other methods have an apparent Z-axis drift. From Figure 10b, it can be seen that the Z-axis drift of PL-VIO and VINS-Mono significantly increase from 220s onwards. We check the motion state of the camera around this time and find the camera moves rapidly in a texture-less room. As shown in Figure 11, most point features are lost in VINS-Mono and our system due to darkness and image blur at this moment. However, our structural features are well extracted. Our system can still measure and correct rotation according to the structural features in this scene, thus reducing the Z-axis drift.
(a) The estimated trajectory in corridor1 sequence (b) Z-axis translation versus time in corridor1 sequence Figure 10. The performance of each system in the corridor1 sequence is shown in the figure. We give the estimated trajectory and side view of each system after alignment, and we give the curve of Z-axis translation versus time.

Field Test
We also carried out experiments on a mobile phone to compare the performance of various systems. We evaluated the performance of each system by walking back and forth along the same straight line in a 48-meter corridor scene. Because of the limitation of not having expensive motion capture devices, we strictly tried our best to restrict the position and height of the phone to make sure that the phone was moving along a straight line and at a steady height (along the line on the floor). For real-time performance, we used 480 × 480 resolution images to conduct experiments on the phone. Simultaneously, we collected 640 × 640 resolution images from the mobile phone to do offline experiments to compare the system performances. As shown in Figure 12a, our movement in the corridor is different from that in the TUM-VI dataset. We turn more quickly, which makes images very blurred. Figure 12b shows how our system works on a mobile phone in real-time. (a) Raw image (b) Proposed system in phone Figure 12. Subfigure (a) are images we collect in the corridor using an Android phone. It can be seen that texture-less area and the image blur appear in our experiment. Subfigure (b) shows the real-time operation of our system on the mobile phone, where the top of the image is the estimated trajectory. As can be seen, after we walk back and forth along the corridor three times, the trajectory is still in a straight line.
For a more intuitive comparison, we compare our trajectory with those estimated trajectories by other systems in Figure 13a. It can be seen in Figure 13b that after walking back and forth twice, the cumulated angle drift of our proposed system is significantly smaller than those of other methods. Moreover, the reduction of angle drift also makes the Z-axis drift of our system much smaller than those of other systems. In our corridor sequence, ORB-SLAM3 has the most significant drift. Because point features are wholly lost when we turn quickly, the continuity of the local map is broken. PL-VIO is second only to our proposed system, indicating that line features create useful constraints in our environment.  We also record the elapsed time of each major module while the phone is running, as shown in Figure  14. As can be seen from the cyan curve, even using EDline [43], the fastest line segment detector, it still takes 18 ms on average to extract on the mobile phone. At the same time, the red curve represents the elapsed time of structure feature extraction. It can be seen that when the global structural feature is not initialized, the method in [34] is used to extract the vanishing points in three directions, and the elapsed time is about 36 ms. After the successful initialization, the proposed structural feature extraction method can reach below 6 ms.

Conclusions and Future works
In this work, we propose a novel tightly-coupled monocular vision-inertial odometry, which is based on the sliding window optimization method. The property of the structural feature is incorporated into the residual term for tightly-coupled optimization. The method is effective in the Manhattan world where the structural feature as a global measurement can constrain the rotation error, and thus improve the translation accuracy of the system. When considering the noise in extracting the structural feature, we take the global structure feature as the state variable, and update it on the rotation manifold. The performance of our system has been tested on two benchmark datasets and the field tests. Both of the results show that the proposed system is superior to the listed mainstream open-source systems based on sliding window optimization. We achieve higher precision in some sequences than the ORB-SLAM3 without loop closure, especially in the situations when the images are texture-less, dark, and blurry. In order to achieve low computation complexity, we further propose to quickly extract structural features based on the known vertical dominant direction. With the improvements, the VIO system we proposed can run in real-time on a mobile phone and the whole extraction process can be completed within 6 ms on the tested mobile phone.
With respect to future work, it has been noticed that there are cases of the Atlanta world (AW) [46], which consists of multiple Manhattan worlds with the same vertical dominant direction. How to optimize rotation error with multiple structural features in the Atlanta world will be left to future research.