^{*}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

In this paper measurements from a monocular vision system are fused with inertial/magnetic measurements from an Inertial Measurement Unit (IMU) rigidly connected to the camera. Two Extended Kalman filters (EKFs) were developed to estimate the pose of the IMU/camera sensor moving relative to a rigid scene (ego-motion), based on a set of fiducials. The two filters were identical as for the state equation and the measurement equations of the inertial/magnetic sensors. The DLT-based EKF exploited visual estimates of the ego-motion using a variant of the Direct Linear Transformation (DLT) method; the error-driven EKF exploited pseudo-measurements based on the projection errors from measured two-dimensional point features to the corresponding three-dimensional fiducials. The two filters were off-line analyzed in different experimental conditions and compared to a purely IMU-based EKF used for estimating the orientation of the IMU/camera sensor. The DLT-based EKF was more accurate than the error-driven EKF, less robust against loss of visual features, and equivalent in terms of computational complexity. Orientation root mean square errors (RMSEs) of 1° (1.5°), and position RMSEs of 3.5 mm (10 mm) were achieved in our experiments by the DLT-based EKF (error-driven EKF); by contrast, orientation RMSEs of 1.6° were achieved by the purely IMU-based EKF.

Sensor fusion methods combine data from disparate sources of information in a way that should ideally give better performance than that achieved when each source of information is used alone. The design of systems based on sensor fusion methods requires the availability of complementary sensors in order that the disadvantages of each sensor are overcome by the advantages of the others. An interesting application niche for sensor fusion—the one dealt with in this paper—is motion tracking. None of the several existing sensor technologies, taken alone, can meet the desired performance specifications, especially when motion is to be tracked without restrictions in space and time [

Vision-based tracking systems can accurately track the relative motion between the camera and objects within its field of view (FOV) by measuring the frame-by-frame displacements of selected features, such as points or lines [

Inertial-based tracking systems integrate Inertial Measurement Units (IMUs) that incorporate accelerometers and gyroscopes for measuring translational accelerations and angular velocities of the objects they are affixed to with high sampling rates; this feature makes them ideally suited to capture fast motion dynamics. Being internally referenced and immune to shadowing and occlusions, inertial sensors can track body motion, in principle, without restrictions in space. Unfortunately, measurements of linear accelerations and angular velocities are affected by time-varying bias and wideband measurement noise of inertial sensors. Accurate estimates of body orientation in the three-dimensional (3D) space can be produced using quite complex filtering algorithms, sometimes with the addition of magnetic sensors that sense the Earth's magnetic field to help producing drift-free heading estimates [

Fusing visual and inertial/magnetic measurements can therefore yield, in principle, a tracking system for pose estimation in all six DOFs that retains, at the same time, the long-term stability and the accuracy of a vision-based tracking system with the short-term robustness and promptness of response typical of an INS [

In this paper the problem of estimating the ego-motion of a hand-held IMU-camera system is addressed. The presented development stems from our ongoing research on tracking position and orientation of human body segments for applications in telerehabilitation. While orientation tracking can be successfully performed using EKF-based sensor fusion methods based on inertial/magnetic measurements [

A tightly coupled approach was adopted to the design of a system in which pose estimates were derived from observations of fiducials. Two EKF-based sensor fusion methods were developed that built somewhat upon the approaches investigated in [

The main contributions of this paper are: (a) the comparative analysis and performance evaluation of the two different forms of visual aiding—the study was extended to the case when visual and inertial/magnetic measurements were used alone; (b) the investigation of the role played by magnetic sensors and related measurements of the Earth's magnetic field for heading stabilization, never attempted before in research on visuo-inertial integration (to the best of our knowledge). This paper is organized as follows: Section 2 reports the description of our experimental setup and a detailed mathematical analysis of the filtering methods. Main results achieved so far are presented in Section 3 and then discussed in Section 4. Finally, we offer concluding remarks and perspectives for our future work in Section 5.

We introduce the reference frames that are used in the experimental setup shown in

Navigation frame {

Body frame {

Camera frame {

Image frame {

The following notation is used to express the relation between two frames, for instance {^{cb}^{cb}^{c}

The IMU is an MTx orientation tracker (Xsens Technologies B.V., Enschede, The Netherlands) equipped with one tri-axial accelerometer, one tri-axial gyroscope and one tri-axial magnetic sensor, with mutually orthogonal sensitive axes; the raw sensory data are delivered to the host computer at 100 Hz via another USB port. Both the camera and the IMU are electrically synchronized to an optical motion analysis system Vicon 460 equipped with six infrared (IR) cameras running at 100 Hz. The 3D coordinates of eight IR-reflective markers are acquired. Four markers (diameter: 15 mm; are located at the corners of the plastic box housing the sensor unit assembly, and four markers of the same diameter are located on the chessboard plane, where they are used for capturing the 3D coordinates of four black-and-white extreme corners of the chessboard. Since the size of the chessboard printed on an A3 sheet of paper is known, the 3D coordinates resolved in {

The ancillary laboratory frame where the 3D positions of the markers are given is used to compute the transformation from {^{cb}^{cb}^{c}

The purely IMU-based method for determining the IMU orientation relative to {_{R}_{k}_{R}(_{k}^{nb}_{k}^{T}

The rotational state transition matrix _{R} _{k}_{–1} is related to _{k}_{k}_{k}_{–1} is the sampling interval and

The process noise vector _{R}_{k}_{–1} is related to the noise in the angular velocity measurements as follows:
^{g}_{k}_{–1} is the gyroscope measurement noise, which is assumed white Gaussian with zero mean and covariance matrix
_{n}

When tracked motions are relatively slow, as it is assumed in this paper, the sensed acceleration is simply taken as the projection of the gravity ^{n}

Since no heading information is available when the gravity vector is sensed, the measurement of the Earth's magnetic field ^{n}^{h}_{k}^{a}_{k}^{−1} denotes the quaternion inverse, and ^{n}^{n}^{n}^{n}_{V}

The EKF linearization requires the computation of the Jacobian matrices of the measurement

The operator _{1} _{2} _{3} _{4}]^{T}

The measurement noise covariance matrix is given by:
_{n}_{dip}, namely the angle that is formed between the sensed magnetic field and the sensed gravity acceleration, are compared to their nominal values using suitably chosen threshold values,_{dip} and _{h}_{norm} with the value of gravity (1^{2}) [_{norm} and _{g}

We assume that the visual features are the projections into the image plane of _{f}_{f}

The image point features are fed to a least-squares estimation algorithm to calculate the transformation from {

The EKF-based sensor fusion method of body pose estimation requires that the rotational state vector _{R} = ^{nb}

In our approach accelerometers are used for stabilizing the IMU-camera attitude with respect to gravity (roll and pitch angles), as prescribed by the measurement ^{a}_{3} · ^{a}σ^{2}, where the variance ^{a}σ^{2} is also called the strength of the driving noise [

The state transition matrix can be written as:
_{k}

The noise covariance matrix of the process noise _{T}_{k}_{–1} can be written as:

The covariance matrix of the process noise

Two different sensor fusion strategies are considered to account for how to add the visual measurements

A common element to both methods is the approach to visual feature tracking. While the purely vision-based method of pose estimation relies on the popular frame-to-frame KLT, visual feature tracking in either the DLT-based EKF or the error-driven EKF exploits the predicted _{x}_{y}_{c} and _{c} are the coordinates of the principal point (image centre) relative to the origin of the frame {

Features points

For either method, the overall linearized measurement model can be written in the following form:

The measurement noise covariance matrix is written as follows:

The size of the matrices _{k}_{k}

A multi-rate filtering strategy is needed to deal with the different sampling rates of IMU and camera measurements: the IMU measurement process runs at a rate of 100 Hz, while the camera measurement process is slower, running at a rate of approximately 30 fps (_{k}

The DLT method reviewed in Section 2.2 provides, for each incoming image frame, the estimate of the chessboard pose in {

We recall that ^{cb} and ^{c} are known from solving the IMU-camera relative pose calibration problem, as already described above.

The visual observation matrix can be simply written as:

The measurement noise covariance matrix is:

In principle, the covariance matrix cov(

The feature projection errors at time _{k}

The measurement equation can be written as:

Since the dependence of the measurements

The Jacobian matrix related to the translational part of the state vector can be written:

The visual measurement noise covariance matrix _{vis}^{vis}σ^{vis}σ^{vis}σ

Eight tracking experiments, each lasting 60 s, were conducted by moving the sensor unit assembly of ^{2}. An additional tracking experiment was performed by moving the sensor unit assembly 1° at a time.

The IMU sensors were calibrated using the in-field calibration techniques described in [

The following filtering methods were tested: the purely IMU-based method of orientation estimation (Section 2.1); the purely vision-based method of pose estimation (Section 2.2); and the two methods of sensor fusion named DLT-based EKF (Section 2.3.1) and error-driven EKF (Section 2.3.2). In all cases no gating technique was implemented in the EKFs to detect outliers due to mismatched features in consecutive image frames. The sensor data acquired during the tracking experiments were analyzed for the off-line validation study in five different filtering scenarios: (a) inertial/magnetic sensor measurements from the IMU were ignored by the filters; (b) inertial/magnetic sensor measurements from the IMU were assimilated in the filters; (c) the magnetic sensor measurements from the IMU were ignored by the filters; (d) gyro bias was not compensated by bias capture, in the situation when magnetic sensor measurements from the IMU were ignored by the filters; (e) a mechanism of intentional damage to the integrity of visual information was implemented and inertial/magnetic sensor measurements were assimilated by the filters. The rationale behind (c) was to stress the importance of magnetic sensor measurements for heading stabilization. The rationale behind (d) was to urge the capability of the proposed sensor fusion methods to accommodate slight imperfections that are typical of inertial sensors. Finally, the rationale behind (e) was to assess the tracking robustness of the sensor fusion methods against visual gaps. The mechanism for degrading the visual information made available to the DLT-based EKF and the error-driven EKF was implemented as follows: for each incoming image frame, a random sample of visual features with size randomly selected from 0 (^{vis}

The reference data were interpolated using cubic splines to the time instants when inertial/magnetic and visual measurements were made. Standard conversion formulae were then used to convert the reference and estimated quaternions in the corresponding Euler angles. The performance assessment was based on the root mean square errors (RMSEs) of the estimated roll, pitch and yaw angles. Moreover, the error quaternion
^{−1}(Δ_{4}) was used to compute the orientation RMSE. The RMSE of the estimated position was computed separately for each coordinate axis (_{X}_{Y}_{Z}

The filtering algorithms were implemented using Matlab; the experimental validation was carried out in off-line conditions. Since the 10-dimensional state vector was the same for either the DLT-based EKF or the error-driven EKF, the operations involved in the prediction stage were exactly the same, which took (approximately) 1 ms in the current implementation (standard laptop, 2.2 GHz clock frequency). Another common element was the vector matching process for the sensed acceleration and magnetic field vectors, which required 1 ms, while the computation of the inertial/magnetic Jacobian matrix took approximately 1 ms. The difference between the two EKFs was in the visual measurement equations: in the DLT-based EKF 10 measurement channels were deployed, in contrast with the 24 measurement channels needed by the error-driven EKF. The computation of the visual features required 14 ms in both filters, which included state propagation and prediction. In the DLT-based EKF, the DLT method was implemented at each iteration cycle, followed by the update of the time-varying measurement noise covariance matrix in

The RMSE values of the eight tracking experiments are summarized in mean value ± SD in

The representative plots in

The plot of

In this paper positioning is not attempted using inertial/magnetic sensors alone, as it is done, e.g., in pedestrian navigation systems, when the IMU is attached to the foot. The exploitation of biomechanical constraints that concern the dynamics of human walking allows indeed mitigating the error growth incurred in the double-time integration process of gravity-compensated acceleration components: for instance, the cubic-time growth of positioning errors can be broken down to a linear-time growth by implementing zero-velocity updates (ZUPT) at the times when the foot is detected steady during walking [

However, the informative contribution of the inertial/magnetic or just the inertial measurements to the DLT-based EKF is not relevant to boost the accuracy of pose estimation—for slow tracked motions, the DLT-based visual measurements are sufficient to obtain very accurate pose estimates—see

In contrast to the DLT-based EKF, the error-driven EKF benefits greatly from the integration of inertial/magnetic or from inertial measurements (to a lesser extent), without which it fails in the experimental trials of this paper. The error-driven EKF performs better, or even much better, than the purely IMU-based method in terms of attitude estimation accuracy, while yielding quite accurate estimates of position too. However, some problems of the error-driven EKF are raised, especially when magnetic measurements are not incorporated in the filtering process, which are not shown by the DLT-based EKF. Our explanation is that providing the sensor fusion method with direct measurements of the quaternion and translation vector of interest is much more informative than relying on visual projection errors as the error-driven EKF does.

The value of incorporating the magnetic sensor measurements in the sensor fusion process is assessed by analyzing the data reported in

The reason is that the error-driven EKF may suffer from gross mismatches between estimated and reference poses. In practice, wrong state vector estimates are produced, which do not preclude however the system from successfully tracking the image point features. This is a good instance of the problem of ambiguous/multiple solutions to the pose estimation problem. As discussed in [

The visual sabotage implemented in this paper is not as extreme as permanent losses of image point features would be, such as those occurring in case of occlusions, or when the ego-motion is so fast that part or all of the chessboard area escapes the camera FOV. We simply limit to randomly reduce number and location of coplanar feature points, sometimes even below the minimum number theoretically needed for pose estimation. The data reported in

The main problem experienced in regard of loss of vision is as follows: since it is only the vision that does positioning, the position estimates tend to diverge fast when the system is blind, visually speaking. While the orientation estimates are continuously and accurately provided by the inertial/magnetic sensors, it is this diverging trend that explains why projection errors may rapidly grow to an extent that makes impossible for the system to maintain the track on the chosen fiducial markers. To make matters worse, we have decided not to implement any mechanism for monitoring the filter divergence based on the number of visual features registered, or any re-initialization procedure in case of divergence [

In this paper two approaches to fuse visual and inertial/magnetic measurements have been considered and correspondingly two EKFs have been developed to track the ego-motion in all six DOFs. They were analyzed with the aim to elucidate how the visual and inertial/magnetic measurements cooperate together and to which extent they do for ego-motion estimation. The two filters perform differently in terms of accuracy and robustness: in the DLT-based EKF the visual measurements seem to have a higher informational content as compared to the inertial/magnetic measurements, and the overall system shows remarkably good accuracy in estimating all six DOFs; conversely, in the error-driven EKF the inertial/magnetic measurements are fundamental for the correct operation of the filter, and the overall system can thus gain in robustness against loss of visual information, at the expense of accuracy in estimating all six DOFs. Moreover, the strategy of sensor fusion is interesting in other respects: on the one hand, the DLT-based EKF takes advantage of the inertial/magnetic measurements since visual features can be tracked without using tools like the KLT, which are computationally time-consuming; on the other hand, the error-driven EKF does positioning only because of its capability of exploiting the projection errors of the image point features.

That magnetic sensor measurements can be helpful to stabilize heading is highlighted in our results, although this statement cannot be overemphasized given the difficulties of motion tracking in magnetically perturbed environments [

In conclusion, in this paper we proposed two different models of visual measurements to be used within Kalman-based filters that also incorporate inertial/magnetic measurements for estimating the ego-motion of a hand-held IMU/camera sensor unit. The two proposed EKFs were off-line analyzed in different experimental conditions: the DLT-based EKF was more accurate than the error-driven EKF, less robust against loss of visual features, and equivalent in terms of computational complexity. Orientation RMSEs of 1° (1.5°) and position RMSEs of 3.5 mm (10 mm) were achieved in our experiments by the DLT-based EKF (error-driven EKF). By contrast, the purely IMU-based EKF achieved orientation RMSEs of 1.6°.

This work has been supported in part by funds from the Italian Ministry of Education and Research.

The camera is rigidly attached to the same support where the IMU sensor case is also attached. The axes of the three frames {

The nine feature points constructed during the initialization stage are shown. Red squares: the four feature points manually selected by the user; green circles: the nine feature points constructed during the initialization stage.

Block diagrams of the two Kalman-filter-based methods of sensor fusion.

Timestamps of camera (blue) and IMU (red) samples. The number of IMU samples between successive image frames is slightly variable due to the irregular sampling rate of the camera.

For one of the eight tracking experiments, plots of reference position and Euler angles of the body pose, together with plots of the position and orientation RMSE. For the sake of visualization, the RMSE values are computed over moving average windows of duration 5 s (DLT-based EKF, in red; error-driven EKF, in black).

Reference and filtered position and Euler angles that are produced using the error-driven EKF (pure vision). Reference data in blue; filtered data in red.

Error plots of the position and orientation RMSEs

Parameter tuning.

_{g} |
0.40 |

^{α}σ^{2}] |
0.05 |

_{h} |
2 |

_{α} |
10 |

_{θ} |
0.05 |

_{b} |
1 |

^{vis}σ |
0.75 |

_{h} |
20 |

_{dip} [°] |
5 |

_{g} |
20 |

Summary statistics of the performance metrics in the scenario (a).

Yaw, ° | 0.45 ± 0.08 | 0.40 ± 0.04 | TF |

Pitch, ° | 0.64 ± 0.10 | 0.66 ± 0.13 | TF |

Roll, ° | 0.78 ± 0.14 | 0.73 ± 0.09 | TF |

Orientation, ° | 1.11 ± 0.16 | 1.09 ± 0.19 | TF |

1.55 ± 0.42 | 1.54 ± 0.31 | TF | |

2.67 ± 0.59 | 2.88 ± 0.43 | TF | |

4.45 ± 0.71 | 2.14 ± 0.32 | TF | |

Position, mm | 5.45 ± 0.76 | 3.95 ± 0.52 | TF |

Summary statistics of the performance metrics in the scenario (b).

Yaw, ° | 1.04 ± 0.27 | 0.41 ± 0.06 | 0.81 ± 0.16 |

Pitch, ° | 0.76 ± 0.19 | 0.63 ± 0.10 | 0.78 ± 0.31 |

Roll, ° | 0.96 ± 0.15 | 0.78 ± 0.10 | 0.92 ± 0.13 |

Orientation, ° | 1.61 ± 0.29 | 1.08 ± 0.13 | 1.46 ± 0.25 |

N/A | 1.37 ± 0.41 | 2.59 ± 0.56 | |

N/A | 2.82 ± 0.48 | 6.72 ± 1.20 | |

N/A | 1.96 ± 0.24 | 6.64 ± 2.41 | |

Position, mm | N/A | 3.40 ± 1.10 | 10.00 ± 1.75 |

Summary statistics of the performance metrics in the scenario (c).

Yaw, ° | 2.16 ± 2.03 | 0.43 ± 0.06 | 2.34 ± 1.81 |

Pitch, ° | 0.80 ± 0.16 | 0.65 ± 0.09 | 0.84 ± 0.20 |

Roll, ° | 1.29 ± 1.40 | 0.81 ± 0.11 | 0.81 ± 0.15 |

Orientation, ° | 3.00 ± 1.95 | 1.10 ± 0.15 | 2.72 ± 1.63 |

N/A | 1.48 ± 0.44 | 4.47 ± 2.01 | |

N/A | 3.03 ± 0.32 | 20.67 ± 13.4 | |

N/A | 2.01 ± 0.29 | 5.89 ± 2.13 | |

Position, mm | N/A | 3.90 ± 0.51 | 22.22 ± 13.0 |

Summary statistics of the performance metrics in the scenario (d).

Yaw, ° | 29.93 ± 0.90 | 0.41 ± 0.06 | 3.07 ± 0.62 |

Pitch, ° | 1.01 ± 0.27 | 0.69 ± 0.09 | 1.20 ± 0.51 |

Roll, ° | 1.16 ± 0.17 | 0.77 ± 0.10 | 1.00 ± 0.16 |

Orientation, ° | 29.99 ± 0.91 | 1.08 ± 0.13 | 3.41 ± 0.65 |

N/A | 1.36 ± 0.41 | 4.92 ± 1.58 | |

N/A | 2.82 ± 0.48 | 24.27 ± 6.33 | |

N/A | 1.97 ± 0.24 | 9.19 ± 4.80 | |

Position, mm | N/A | 3.71 ± 0.60 | 26.79 ± 6.27 |