MAV Localization in Large-Scale Environments: A Decoupled Optimization/Filtering Approach

Soliman, Abanob; Hadj-Abdelkader, Hicham; Bonardi, Fabien; Bouchafa, Samia; Sidibé, Désiré

doi:10.3390/s23010516

Open AccessArticle

MAV Localization in Large-Scale Environments: A Decoupled Optimization/Filtering Approach

by

Abanob Soliman

,

Hicham Hadj-Abdelkader

,

Fabien Bonardi

,

Samia Bouchafa

and

Désiré Sidibé

^*

IBISC Laboratory, Université d’Evry-Paris Saclay, 91020 Evry-Courcouronnes, France

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(1), 516; https://doi.org/10.3390/s23010516

Submission received: 31 October 2022 / Revised: 28 December 2022 / Accepted: 29 December 2022 / Published: 3 January 2023

(This article belongs to the Special Issue Multi-Sensor and Multi-Modal Place Localization)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Developing new sensor fusion algorithms has become indispensable to tackle the daunting problem of GPS-aided micro aerial vehicle (MAV) localization in large-scale landscapes. Sensor fusion should guarantee high-accuracy estimation with the least amount of system delay. Towards this goal, we propose a linear optimal state estimation approach for the MAV to avoid complicated and high-latency calculations and an immediate metric-scale recovery paradigm that uses low-rate noisy GPS measurements when available. Our proposed strategy shows how the vision sensor can quickly bootstrap a pose that has been arbitrarily scaled and recovered from various drifts that affect vision-based algorithms. We can consider the camera as a “black-box” pose estimator thanks to our proposed optimization/filtering-based methodology. This maintains the sensor fusion algorithm’s computational complexity and makes it suitable for MAV’s long-term operations in expansive areas. Due to the limited global tracking and localization data from the GPS sensors, our proposal on MAV’s localization solution considers the sensor measurement uncertainty constraints under such circumstances. Extensive quantitative and qualitative analyses utilizing real-world and large-scale MAV sequences demonstrate the higher performance of our technique in comparison to most recent state-of-the-art algorithms in terms of trajectory estimation accuracy and system latency.

Keywords:

MAV; multimodal sensing; localization; odometry; visual drifts; sensor fusion; Kalman filter; calibration; optimization

1. Introduction

Robust localization of micro aerial vehicles (MAVs) in uncharted large-scale areas can rely on complementary data gathered by many sensor modalities. The study of simultaneous localization and mapping (SLAM), primarily used for MAV navigation in expansive and dynamic settings, may be enriched and expanded by using multi-modal datasets [1]. These settings have certain traits, such as the dynamic range of the scene’s object intensities. For instance, mapping a small interior space with adequate illumination might be of more outstanding quality than mapping a rural area at night with heavy rain, wind, and fog (outdoors dynamic environment). The benefits of multimodal approaches become apparent when systems rely on sensors with a high dynamic range and strong sensing capabilities, such as event cameras, LiDARs, or radars, or typical inexpensive cameras fused with other sensor modalities such as the inertial measurement units (IMUs) and GPS sensors. These multimodal approaches can indeed fill some lack of data during scene mapping and MAV localization.

Towards this aim, we develop a trustworthy (quick and precise) localization solution that utilizes information from three sensor modalities: camera frame data, IMU measurements, and GPS readings. Nevertheless, the GPS sensor readings are consistently slower and noisier than those from the IMU or camera modules, and they frequently experience signal loss in GPS-restricted locations. Therefore, a localization system that depends on GPS data must perform effectively when GPS readings are lost.

Visual-inertial odometry (VIO) is one of the most mature and well-established approaches in the localization field [2,3,4]. Efficient visual odometry can be achieved using a high-quality perception of the surroundings. Sensors performing this perception task can differ in their nature of data collection. On the one hand, the most common visual odometry sensors are cameras such as RGB cameras [5], event cameras [6], and RGB-D cameras [7]. On the other hand, using the LiDAR sensor [8] can provide point clouds and a GPS sensor [9,10] can locate the MAV using satellite signals triangulation, as represented in Figure 1.

The accuracy of the state estimation process relies on an error-state extended Kalman filter (ES-EKF) and the bootstrapping quality of its states. A well-established IMU-based state estimator initialization technique was discussed in [5]. In this bootstrapping method, the global metric scale of the trajectory and the IMU-camera gravity alignment is optimized using a specific amount of IMU readings preintegration combined with an initial up-to-scale trajectory estimated using the camera only. This bootstrapping process is prone to failure due to insufficient IMU excitation, especially when the MAV navigates in a planar terrain.

The MAV should contain a localization system that continually calculates the pose with high accuracy and low latency during search and rescue missions, for instance. The MAV is equipped with restricted resources regarding the data processing unit and the limited power source capacity for long-term navigation operations in large-scale situations. In light of this, the state estimate approach should consistently have low computational complexity and resist sensor readings that deviate from the norm.

Our work’s main contribution to tackle the aforementioned challenges is three-fold:

-: In the case of state estimator initialization failure, we propose a unique instant bootstrapping technique based on continuous-time manifold optimization via pose graph optimization (PGO) and range factors, which depends on low-rate GPS signals.
-: A closed-form estimation method without nonlinear optimization during IMU/CAM fusion produces a reduced system latency with constant CPU computing complexity. The mathematical modeling of a linear ES-EKF with a precise and quick gyroscope integration strategy accounts for the simplicity of our proposed localization solution.
-: The EuRoC benchmark [12], for MAV localization assessment in indoor environments, and the Fast Flight dataset [11], for large-scale outdoor environments, are two real-world publicly available benchmarks on which our IMU/GPS-CAM fusion system has been thoroughly tested. With thorough ablation investigations into the role of each sensor modality in the overall accuracy of the state estimation process, the assessment is conducted using the most recent state-of-the-art visual-inertial odometry methodologies.

2. Related Work

2.1. Sensor Fusion

Figure 2 presents a global overview of the current state-of-the-art approaches for localization. The ability to continually estimate the robot’s ego-motion (position and orientation) over time is a significant difficulty in autonomous navigation, path planning, object tracking, and collision avoidance platforms [13]. The Global Positioning System (GPS) is a well-known localization method applied to several autonomous system domains. One kind of global navigation satellite system (GNSS) is GPS [10]. GPS is used as a self-localization source, such as for MAVs security applications, and gives any user with a GPS receiver positional information with meter-level precision. The satellite signal blockage, high noise levels, multipath effects, and other issues with GPS, on the other hand, make it a less trustworthy alternative sensor for self-localization modules. However, real-time kinematic (RTK) and precise point positioning (PPP) [9], two GPS technologies that are rapidly developing, can provide locations with decimeter- or centimeter-level precision.

The effectiveness of GPS satellite signals heavily depends on the surrounding environment; it works best in locations with clear skies and is ineffective for inside navigation since walls and other obstacles impede it [14]. This makes the GPS module an unsuitable primary sensor for reliable autonomous vehicle localization under adverse weather and environmental conditions. Hence, the fusion of GPS signals with other inertial and/or visual sensors is indispensable for a reliable localization solution, especially in such environments. The state-of-the-art sensor fusion systems are differentiated into two prominent families: loosely- [15], and tightly-coupled [16] fusion strategies. In loosely-coupled fusion, the camera frames for pose estimation are processed as a black-box. A filter or an optimization model is developed to fuse the arbitrary-scaled poses from the visual sensor with the noisy metric-scaled re-integrated IMU readings [17].

On the contrary, in the tightly coupled approach, scene information from the visual sensor is fused with the IMU measurements (linear accelerations and angular velocities) using a fusion filter or an optimization model that estimates the metric-scaled pose, visual odometry scale factor, IMU biases, and visual drift between the IMU-camera inertial frames. One of the prominent advantages of a tightly-coupled fusion scheme is that it can estimate accurate scene information to reconstruct a precise scene map, along with providing the SLAM system with high confidence in loop closure during re-localization situations.

2.2. Fusion Strategies

The two sensor fusion strategies (loosely and tightly coupled) have two main execution techniques: filter-based and optimization-based execution. Some filter-based state-of-the-art approaches are deterministic, such as MSCKF [18], S-MSCKF [11], S-UKF-LG/S-IEKF [19], and ROVIO [20]. At the same time, alternative strategies can be based on nondeterministic filters such as particle filters [21].

Optimization-based methods such as VINS-Mono [22], OKVIS [23], ORB-SLAM [24], and BASALT [25], can be deterministic or non-deterministic based on the optimization strategy and the convergence constraints. The estimation and robustness of visual localization frameworks have advanced significantly in recent decades, and this development may be furthered by tightly integrating visual and inertial data. Most methods integrate data utilizing optimization methods or filtering-based procedures.

Filtering approaches are ideally suited to real-time applications [26,27], which is the main emphasis of this study. In contrast, optimization-based methods are more precise but often have a more extensive processing complexity. The observability-constrained technique addresses the consistency issue, a shortcoming of traditional VIO filter-based algorithms [28]. The EKF/MSCKF and its cutting-edge variations are among the most widely used solutions because they effectively balance accuracy and computational complexity.

A recent study [29] shows that if the air mass’s random character is considered, the EKF system states of an MAV are observable. The drag and lift forces on the MAV will directly impact the projected pose and velocity due to the nature of air mass randomization. To make an online update for the uncertainties brought on by these random effects on the precise position of the sensors’ reference frames, we contribute with a visual drift augmentation technique to our EKF measurement model. The EKF’s ability to tolerate significant disturbances in the MAV’s velocity state variable and still converge to the undisturbed estimates is what we target.

2.3. Visual Odometry

The main objective of a visual odometry solution is to perform an accurate and precise localization of the robot (ground or aerial vehicle) to estimate its pose during the navigation task. Estimated poses can be on either discrete- or continuous-time manifolds. Cioffi et al. [30] studied the reliability of the estimated poses on both manifolds using IMU/Visual/GPS sensors. They came to an important conclusion: similar results are produced by the two representations when the camera and IMU are time-synchronized.

In [13], the sliding window pose-graph optimization of the most recent robot states uses global position data with poses predicted by a VIO method. Like [15], pose-graph optimization employs an independent VIO technique to generate pose estimations fused with GPS data. In contrast to [13], the pose-graph in [15] includes an extra node representing the local coordinate frame’s origin to confine the absolute orientation. However, these methods are loosely connected, meaning that a separate VIO algorithm generates the relative pose estimations. Inspired by [13,15], we present a loosely coupled strategy that considers the correlations between all measures by including them in a hybrid optimization and filtering problem.

It is demonstrated in [23] that considering all measurement correlations is essential for high-precision estimations in the visual-inertial situation. A tightly coupled sliding window optimization for visual and inertial data with loosely connected GPS refinement is presented in [14]. The GPS readings are given the same timestamp as the temporally nearest image to be included in the sliding window because it is believed that they would only be accessible at low rates. As opposed to [14], we efficiently compute the global positional factors by closely coupling the global position measurements using the Runge–Kutta 4th-order gyroscope preintegration scheme [31]. This enables the sliding window to incorporate numerous global parameters and each keyframe with barely any additional processing load.

2.4. Methodology Background

We highlight the methodology that inspires our study in blue-dashed rectangles in Figure 2. Where the loosely coupled fusion strategy [32] is adopted to keep constant computational complexity for real-time performance, along with adding a reset mode for the framework, as discussed in [33] as well as an online IMU-camera extrinsic calibration paradigm [4]. Integrating the IMU/GPS readings with the global shutter visual sensor monocular frames raises our localization solution’s accuracy level, leveraging the MAV’s inertial and global localization information.

Pushing the limits of the extended Kalman filter to raise the robustness of our localization solution towards a resilient system, we leverage the high accuracy of the optimization to initialize the filter pose states using a novel instant approach utilizing the low-rate noisy GPS readings when available. Sensor fusion on continuous-time (CT) manifolds, such as B-splines [34], suffers from a high execution complexity, especially with the time derivatives of high-order manifolds for integrating the IMU measurements in the estimation process. Hence, in our novel method, we avoid this dilemma with a simple spline-fitting approach for the GPS readings during the data pre-processing stage.

3. System Architecture

Our core sensor setup consists of an inertial navigation sensor (IMU), a global positioning sensor (GPS), and a monocular camera, as illustrated in Figure 3. The pipeline starts with the data acquisition and pre-processing for the initialization process, as discussed in Section 3.1. The initialization is an optimization-based phase (see Algorithm 1) with a considerably low complexity and processing time whose output is an instant metric-scaled pose estimated from the camera, GPS, and gyroscope readings. Then, an ES-EKF (see Algorithm 2) whose dynamic model is given in Section 3.2, is applied to estimate all the system states, including the MAV’s trajectory, velocity, and a scale factor to recover the initially estimated trajectory in the case of GPS readings loss. Finally, we present the measurement model in Section 3.3 with a novel false pose augmentation paradigm to ensure the observability of all the filter states, as analyzed in Appendix A.

Algorithm 1 Bootstrapping: Pose Graph Optimization and Range Factors

Input: RGB frames (c), camera matrix (

K_{c}

), GPS readings (DT-GPS), IMU readings (

I

)

Output: Metric-scaled trajectory (

T_{v c} [p_{v}^{c}, q_{v}^{c}] \in S E (3)

)

1:: $T_{v c}^{0} \Leftarrow KLT - VO (c, K_{c})$ ▹ Arbitrary-scaled pose
2:: $p (u) \Leftarrow spline_fit (DT - GPS)$ ▹ CT-GPS by Equation (4)
3:: $[ϕ, θ, ψ] \Leftarrow RK 4 (I_{g y r o} (ω))$ ▹ Initial orientations
4:: whilenotconvergeddo ▹ Initial trajectory optimization
5:: $T_{v c} \Leftarrow optimize (T_{v c}^{0}, p (u), [ϕ, θ, ψ])$ ▹ Equation (6)
6:: end while

The state representation is a 31-element state vector

X

:

X = {[{p_{w}^{i}}^{⊤} {v_{w}^{i}}^{⊤} {q_{w}^{i}}^{⊤} {b_{ω}}^{⊤} {b_{a}}^{⊤} λ {p_{i}^{c}}^{⊤} {q_{i}^{c}}^{⊤} {p_{v}^{w}}^{⊤} {q_{v}^{w}}^{⊤}]}^{⊤},

(1)

where

p_{w}^{i}

is the position of the IMU in the world frame (world frame is a gravity-aligned frame.)

(w)

, its velocity

v_{w}^{i}

, and its attitude rotation quaternion

q_{w}^{i}

describing a rotation from the IMU frame

(i)

into the world frame

(w)

.

b_{ω}

and

b_{a}

are the gyro and acceleration biases along with the visual odometry scale factor

λ

.

R_{(q)}

is the quaternion q rotational matrix, g is the gravity vector aligned with the world frame

(w)

, and

Ω (ω)

is the quaternion-multiplication matrix of

ω

.

The IMU/camera calibration states are the rotation from the camera frame into the IMU frame

q_{i}^{c}

, and the position of the camera center with regard to the IMU frame

p_{i}^{c}

.

Finally, the visual attitude drifts between the black-boxed visual frame (vision frame is the frame to which the camera pose is estimated in the black-box vision framework)

(v)

and the world inertial frame

(w)

are reflected in

q_{v}^{w}

and the translational ones in

p_{v}^{w}

. We assume that all the visual drifts are spatial without any temporal drifts, i.e., the IMU and the camera have synchronized timestamps.

Algorithm 2 End-to-End State Estimation Scheme

Input: IMU readings, initial optimized trajectory

T_{v c}

Output: FilterStates

X = \{λ, K_{i} [b_{a}, b_{ω}], T_{i c}, T_{w v}, T_{w i}, v_{w i}\}, \forall T [p, q] \in S E (3)

1:: $P, Q_{c}, R$ _initialization, FilterStates_initialization
2:: ErrorStates_initialization=0
3:: while imuRead do
4:: Read LastStep (k) P, FilterStates, ErrorStates
5:: Read LastStep (k) IMU (Accel, Gyro) values
6:: Read current (k+1) IMU (Accel, Gyro) values
7:: Step 1: Propagate IMU states ▹ Equation (12)
8:: Step 2: Calculate $F_{d}$ and $Q_{d}$ ▹ Equations (14) and (16)
9:: Step 3: Compute P state covariance matrix ▹ Equation (18)
10:: if camRead then
11:: Read current (k+1) CAM $T_{v c}$ values ▹ Metric-scaled pose
12:: Step 4: Estimate false pose ▹ Equation (19)
13:: Step 5: Calculate $\tilde{z}$ , H ▹ Equation (20)
14:: Step 6: Calculate S, K, ErrorStates $\hat{\tilde{x}}$ , P ▹ Equations (26) and (27)
15:: Step 7: Update: FilterStates += ErrorStates
16:: Step 8: RESET $\hat{\tilde{x}} = 0$ , P ▹ Equation (29)
17:: end if
18:: end while

The corresponding 28-elements error state vector is defined by:

\tilde{x} = {[{Δ p_{w}^{i}}^{⊤} {Δ v_{w}^{i}}^{⊤} {δ θ_{w}^{i}}^{⊤} {Δ b_{ω}}^{⊤} {Δ b_{a}}^{⊤} Δ λ {Δ p_{i}^{c}}^{⊤} {δ θ_{i}^{c}}^{⊤} {Δ p_{v}^{w}}^{⊤} {δ θ_{v}^{w}}^{⊤}]}^{⊤},

(2)

as the difference of an estimate

\hat{x}

to its quantity x, i.e.,

\tilde{x} = x - \hat{x}

. We apply this to all state variables except the error quaternions, which are defined by:

δ q_{y}^{x} = q_{y}^{x} \otimes {\hat{q}}_{y}^{x} \approx {[\begin{matrix} \frac{1}{2} δ θ_{y}^{x} & 1 \end{matrix}]}^{⊤} .

(3)

This error quaternion representation increases the numerical stability of the estimation process and handles the quaternion in its minimal representation [35].

3.1. State Estimator Initialization

An incremental structure from motion (SfM) algorithm [36] is applied to the acquired image frames, whose goal is to retrieve the camera poses and the 3D structure of the scene, based on the five-point algorithm proposed in [37]. ORB features are detected, and the highest quality points are tracked between 10 consecutive frames using the KLT method [38].

To solve the arbitrary-scale problem of the camera trajectory only, we apply an on-manifold cumulative B-spline (https://github.com/AbanobSoliman/B-splines (accessed on 1 October 2022)) interpolation [34] to synthesize a very smooth continuous-time (CT) trajectory in

R^{3}

from the low-rate noisy GPS readings.

The matrix form for the cumulative B-spline manifold of order

k = n + 1

, where n is the spline degree, is modeled at

t \in [t_{i}, t_{i + k - 1}]

as:

p (u) = p_{i} + \sum_{j = 1}^{k - 1} {\tilde{B}}_{j}^{(k)} . {\bar{u}}_{j}^{(k)} . d_{j}^{i},

(4)

where

p (u) \in R^{3}

is the continuous-time B-spline increment that interpolates k GPS measurements on the normalized unit of time

u (t) : = (t - t_{i}) / Δ t_{s} - P_{n}

with

1 / Δ t_{s}

denoting the spline generation frequency and

P_{n}

being the pose number that contributes to the current spline segment

P_{n} \in [0, \dots, k - 1]

.

p_{i}

is the initial discrete-time (DT) GPS location measurement at time

t_{i}

. The term

d_{j}^{i} = p_{i + j} - p_{i + j - 1}

is the difference vector between two consecutive DT-GPS readings. The matrix

{\tilde{B}}_{j}^{(k)}

is the cumulative basis blending and

{\bar{u}}_{j}^{(k)}

is the normalized time vector, both of which are defined as:

\begin{matrix} {\tilde{B}}_{j}^{(k)} = {\tilde{b}}_{j, n}^{(k)} = \sum_{s = j}^{k - 1} b_{s, n}^{(k)}, \\ b_{s, n}^{(k)} = \frac{C_{k - 1}^{n}}{(k - 1)!} \sum_{l = s}^{k - 1} {(- 1)}^{l - s} C_{k}^{l - s} {(k - 1 - l)}^{k - 1 - n}, \\ {\bar{u}}_{j}^{(k)} = {[u^{0}, \dots, u^{k - 1}, u^{k}]}^{⊤}, u \in [0, \dots, 1] . \end{matrix}

(5)

Our GPS-IMU aided initialization system comprises two optimization factors: the first is a pose graph optimization (PGO) factor

r^{p}

that optimizes the 6-DoF of every pose, whereas the second is a range factor

r^{s}

that constraints the translation limits between every two KLT-VO poses. Hence, the metric scale of the visual odometry pose is recovered using the gyroscope and GPS readings, leveraging the high accuracy of the optimization process. An illustrative scheme for the initialization process factor graph is shown in Figure 4.

Level 1’s objective function

L^{p, s}

is modeled as:

L^{p, s} = \underset{T_{w i}}{arg min} [\sum_{(i, j)}^{N} ({| | r^{p} (i, j) | |}_{Σ_{i, j}^{p}}^{2} + {| | r^{s} (i, j) | |}_{Σ_{i, j}^{s}}^{2})] .

(6)

Σ_{i, j}^{p}, Σ_{i, j}^{s}

are the information matrices associated with the GPS readings covariance, reflecting the PGO and Range factors noises on the global metric scale estimation process between two RGB-D aligned frames.

Pose Graph Optimization (PGO) factor. The PGO is a 6-DoF factor that controls the relative pose error between two consecutive edges

i, j

and is formulated as:

r^{p} = {||({\hat{T}}_{i}^{- 1} {\hat{T}}_{j}) ⊖ Δ T_{i j}^{ω, G P S}||}_{2}

(7)

where

{| | . | |}_{2}

is the L2-norm,

{\hat{T}}_{i, j} \in S E (3)

is the

T_{w i}^{0}

estimated from the front-end pipeline at frames

i, j

. The operator ⊖ is the

S E (3)

logarithmic map as defined in [39]. The error transformation

Δ T_{i j}^{ω, G P S} [δ R_{i j}^{ω}, δ p_{i j}^{G P S}] \in se (3)

, where

δ p_{i j}^{G P S} = p_{j} - p_{i}

is the CT-GPS measurement increment and

δ R_{i j}^{ω} = {[δ ϕ, δ θ, δ ψ]}^{⊤} \in so (3)

is the gyroscope integrated increment

δ R_{i j}^{ω} = \int_{k = i}^{j} (ω_{k}) . d k

using the Runge–Kutta 4th order (RK4) integration method [31] between the keyframes i and j.

Range factor. The range factor limits the front-end visual drift and keeps the global metric scale under control within a sensible range defined by the GPS signal and is formulated as:

r^{s} = {||{| | {\hat{t}}_{j} - {\hat{t}}_{i} | |}_{2} - {| | p_{j}^{G P S} - p_{i}^{G P S} | |}_{2}||}_{2}

(8)

where

{\hat{t}}_{i, j}, p_{i, j}^{G P S} \in R^{3}

are the translation vectors of two consecutive front-end (KLT-VO) poses and CT-GPS signals, respectively.

3.2. Dynamic Model

The core state estimation is performed by fusing the RGB camera frames and the IMU reading using an error states extended Kalman filter (ES-EKF). Figure 5 illustrates the inter-sensor extrinsic relation between the IMU/GPS sensors and a monocular camera.

To use the linear states estimator, we assume that the IMU measurements contain a particular bias

b_{a} \in N (0, σ_{b a})

,

b_{ω} \in N (0, σ_{b ω})

and a white Gaussian noise

n_{a} \in N (0, σ_{a})

,

n_{ω} \in N (0, σ_{ω})

.

Thus, the real angular velocities

ω

and accelerations a in the IMU body frame

(i)

can be written as:

\begin{matrix} ω = ω_{m} - b_{ω} - n_{ω} & a n d & a = a_{m} - b_{a} - n_{a}, \end{matrix}

(9)

where the subscript m denotes the measured value. The dynamics of the non-static biases are modeled as a random process:

\begin{matrix} \dot{b_{ω}} = n_{b_{ω}}, & \dot{b_{a}} = n_{b_{a}} . \end{matrix}

(10)

The standard deviation

σ_{b_{ω}}, σ_{b_{a}}, σ_{w}, σ_{a}

values are generally given by the IMU manufacturer’s data in Allan deviation plots. For discrete time steps, it will be applied in the filter. We need to convert these values according to their units:

\begin{matrix} d σ_{ω, a}^{2} = \frac{σ_{ω, a}^{2}}{\nabla t}, & d σ_{b_{ω, a}}^{2} = σ_{b_{ω, a}}^{2} * \nabla t . \end{matrix}

(11)

The following differential equations govern IMU state propagation:

\begin{matrix} \dot{p_{w}^{i}} = v_{w}^{i}, \\ \dot{v_{w}^{i}} = R_{(q_{w}^{i})}^{⊤} (a_{m} - b_{a} - n_{a}) - g, \\ \dot{q_{w}^{i}} = \frac{1}{2} Ω (ω_{m} - b_{ω} - n_{ω}) q_{w}^{i}, \\ \dot{b_{ω}} = n_{b_{ω}}, \dot{b_{a}} = n_{b_{a}}, \dot{λ} = 0, \\ \dot{p_{i}^{c}} = 0, \dot{q_{i}^{c}} = 0, \dot{p_{v}^{w}} = 0, \dot{q_{v}^{w}} = 0, \end{matrix}

(12)

For the quaternion integration inside the ES-EKF, we use the first-order integrator defined in [35] as:

\begin{matrix} \bar{w} = \frac{ω_{k + 1} + ω_{k}}{2}, κ = \frac{1}{2} . Ω (\bar{ω}) . Δ t, \\ {\hat{q_{w}^{i}}}_{k + 1} = [e^{κ} + \frac{Δ t^{2}}{48} (Ω (ω_{k + 1}) . Ω (ω_{k}) - Ω (ω_{k}) . Ω (ω_{k + 1}))] . {\hat{q_{w}^{i}}}_{k} . \end{matrix}

(13)

where the hat term

\hat{}

means the estimated value. The exponential term

e^{κ}

is expanded by the Maclaurin series.

The states transition matrix

F_{d}

is modeled as:

F_{d} = [\begin{matrix} I_{d_{3}} & Δ t & A & B & - R_{(\hat{q_{w}^{i}})}^{⊤} \frac{Δ t^{2}}{2} & 0_{3 \times 13} \\ 0_{3} & I_{d_{3}} & C & D & - R_{(\hat{q_{w}^{i}})}^{⊤} Δ t & 0_{3 \times 13} \\ 0_{3} & 0_{3} & E & F & 0_{3} & 0_{3 \times 13} \\ 0_{3} & 0_{3} & 0_{3} & I_{d_{3}} & 0_{3} & 0_{3 \times 13} \\ 0_{3} & 0_{3} & 0_{3} & 0_{3} & I_{d_{3}} & 0_{3 \times 13} \\ 0_{13 \times 3} & 0_{13 \times 3} & 0_{13 \times 3} & 0_{13 \times 3} & 0_{13 \times 3} & I_{d_{13}} \end{matrix}] .

(14)

Then, we apply the small-angle approximation for which

| ω | \to 0

apply the de l’Hopital rule and obtain a compact solution for the six matrix blocks

A, B, C, D, E, F

[35]:

\begin{matrix} A = - R_{(\hat{q_{w}^{i}})}^{⊤} {⌊\hat{a}⌋}_{\times} (\frac{Δ t^{2}}{2!} - \frac{Δ t^{3}}{3!} {⌊\hat{ω}⌋}_{\times} + \frac{Δ t^{4}}{4!} {⌊\hat{ω}⌋}_{\times}^{2}), \\ B = - R_{(\hat{q_{w}^{i}})}^{⊤} {⌊\hat{a}⌋}_{\times} (\frac{- Δ t^{3}}{3!} + \frac{Δ t^{4}}{4!} {⌊\hat{ω}⌋}_{\times} - \frac{Δ t^{5}}{5!} {⌊\hat{ω}⌋}_{\times}^{2}), \\ C = - R_{(\hat{q_{w}^{i}})}^{⊤} {⌊\hat{a}⌋}_{\times} (Δ t - \frac{Δ t^{2}}{2!} {⌊\hat{ω}⌋}_{\times} + \frac{Δ t^{3}}{3!} {⌊\hat{ω}⌋}_{\times}^{2}), \\ D = - A, \\ E = I_{d_{3}} - Δ t {⌊\hat{ω}⌋}_{\times} + \frac{Δ t^{2}}{2!} {⌊\hat{ω}⌋}_{\times}^{2}, \\ F = - Δ t + \frac{Δ t^{2}}{2!} {⌊\hat{ω}⌋}_{\times} - \frac{Δ t^{3}}{3!} {⌊\hat{ω}⌋}_{\times}^{2}, \end{matrix}

(15)

with

\hat{ω} = ω_{m} - \hat{b_{ω}}

,

\hat{a} = a_{m} - \hat{b_{a}}

and

{⌊\hat{ω}⌋}_{\times}

,

{⌊\hat{a}⌋}_{\times}

the skew-symmetric matrices for IMU readings.

We can now derive the discrete-time input noise covariance matrix

Q_{d}

as:

Q_{d} = \underset{Δ t}{ʃ} F_{d} (τ) G_{c} Q_{c} G_{c}^{⊤} F_{d} {(τ)}^{⊤} d τ,

(16)

where

Q_{c}

is the CT process noise covariance, and

G_{c}

is calculated in the form:

G_{c} = [\begin{matrix} 0_{3} & 0_{3} & 0_{3} & 0_{3} \\ - R_{(\hat{q_{w}^{i}})}^{⊤} & 0_{3} & 0_{3} & 0_{3} \\ 0_{3} & 0_{3} & I_{d_{3}} & 0_{3} \\ 0_{3} & 0_{3} & 0_{3} & I_{d_{3}} \\ 0_{3} & - I_{d_{3}} & 0_{3} & 0_{3} \\ 0_{13 \times 3} & 0_{13 \times 3} & 0_{13 \times 3} & 0_{13 \times 3} \end{matrix}] .

(17)

The closed-form solution of the complete derivation of the

Q_{d}

covariance matrix is given in detail in Appendix B.

Finally, the propagated state covariance matrix computation is defined as:

P_{k + 1 | k} = F_{d} P_{k | k} F_{d}^{⊤} + Q_{d} .

(18)

3.3. Measurement Model

The main contribution of our measurement model for an observable ES-EKF is the false relative pose augmentation methodology of the visual drift quaternion state at the previous time step (k) updated with the current camera measurement at a time (k+1) and modeled as:

q_{v}^{w} (k) = {\hat{q}}_{w}^{i} {(k)}^{- 1} \otimes {\hat{q}}_{i}^{c} {(k)}^{- 1} \otimes q_{v}^{c} (k + 1) .

(19)

The camera position measurement model yields the position of the camera with respect to the vision frame

p_{v}^{c}

. The error in measurement modeled as

{\tilde{z}}_{p}

and linearized as

{\tilde{z}}_{p L}

:

{\tilde{z}}_{p} = z_{p} - {\hat{z}}_{p} = p_{v}^{c} - R_{({\hat{q}}_{v}^{w})}^{⊤} ({\hat{p}}_{w}^{i} + R_{({\hat{q}}_{w}^{i})}^{⊤} {\hat{p}}_{i}^{c}) \hat{λ} \dot{=} {\tilde{z}}_{p L} = H_{p} \tilde{x},

(20)

with

H_{p}^{⊤} = [\begin{matrix} R_{({\hat{q}}_{v}^{w})}^{⊤} \hat{λ} \\ 0_{3 \times 3} \\ - R_{({\hat{q}}_{v}^{w})}^{⊤} R_{({\hat{q}}_{w}^{i})}^{⊤} {⌊{\hat{p}}_{i}^{c}⌋}_{\times} \hat{λ} \\ 0_{6 x 3} \\ R_{({\hat{q}}_{v}^{w})}^{⊤} R_{({\hat{q}}_{w}^{i})}^{⊤} {\hat{p}}_{i}^{c} + R_{({\hat{q}}_{v}^{w})}^{⊤} {\hat{p}}_{w}^{i} \\ R_{({\hat{q}}_{v}^{w})}^{⊤} R_{({\hat{q}}_{w}^{i})}^{⊤} \hat{λ} \\ 0_{6 x 3} \\ - R_{({\hat{q}}_{v}^{w})}^{⊤} {⌊({\hat{p}}_{w}^{i} + R_{({\hat{q}}_{w}^{i})}^{⊤} {\hat{p}}_{i}^{c}) \hat{λ}⌋}_{\times} \end{matrix}],

(21)

using the definition of the error-quaternion

\begin{matrix} q_{w}^{i} = δ q_{w}^{i} \otimes {\hat{q}}_{w}^{i}, \\ R_{(q_{w}^{i})} \approx (I_{d_{3}} - {⌊δ θ_{w}^{i}⌋}_{\times}) . R_{(q_{w}^{\hat{i}})} . \end{matrix}

(22)

The vision algorithm yields the rotation from the camera frame into the vision frame

q_{v}^{c}

. We can model the error measurement as

{\tilde{z}}_{q} = z_{q} - {\hat{z}}_{q} = q_{i}^{c} \otimes q_{w}^{i} \otimes q_{v}^{w} \otimes {(q_{i}^{c} \otimes {\hat{q}}_{w}^{i} \otimes {\hat{q}}_{v}^{w})}^{- 1} .

(23)

Finally, the measurements Jacobian H in

\tilde{z} = H . \tilde{x}

is calculated based on the method in [33] and can be stacked together in the form

[\begin{matrix} {\tilde{z}}_{p} \\ {\tilde{z}}_{q} \end{matrix}] = [\begin{matrix} H_{p} \\ \begin{matrix} 0_{3 x 6} & {\tilde{H}}_{q}^{w i} & 0_{3 x 10} & {\tilde{H}}_{q}^{i c} & 0_{3 \times 3} & {\tilde{H}}_{q}^{v w} \end{matrix} \end{matrix}] \tilde{x} .

(24)

with the Jacobian matrices

{\tilde{H}}_{q}^{x y}

, known as the right Jacobian of

S O (3)

, and are defined as:

\begin{matrix} {\tilde{H}}_{q}^{x y} = J_{r} (θ_{x}^{y}) = lim_{δ θ \to 0} \frac{L o g (E x p {(θ)}^{*} \otimes E x p (θ + δ θ))}{δ θ}, \\ J_{r} (θ_{x}^{y}) = I_{d_{3}} - (\frac{1 - c o s ∥δ θ∥}{{∥δ θ∥}^{2}}) . {⌊δ θ_{x}^{y}⌋}_{\times} + (\frac{∥δ θ∥ - s i n ∥δ θ∥}{{∥δ θ∥}^{3}}) . {⌊δ θ_{x}^{y}⌋}_{\times}^{2} . \end{matrix}

(25)

3.4. States Update

To update the framework for the current time step (k+1), we compute the innovation term S, Kalman gain K, and the states correction vector

\hat{\tilde{x}}

defined as:

\begin{matrix} S = H P H^{⊤} + R, & K = P H^{⊤} S^{- 1}, & \hat{\tilde{x}} = K \tilde{z} \end{matrix} .

(26)

The error state covariance is updated as follows:

P_{k + 1 | k + 1} = (I_{d_{28}} - K H) P_{k + 1 | k} {(I_{d_{28}} - K H)}^{⊤} + K R K^{⊤},

(27)

where

R_{[6 x 6]} = d i a g (R_{p o s i t i o n}, R_{o r i e n t a t i o n})

is the measurement noise covariance matrix.

The error quaternion is calculated by (3) to ensure its unit length, and then update the states vector:

X_{k + 1} = X_{k} + \hat{\tilde{x}}

.

For the quaternions state update:

{\hat{q}}_{k + 1} = \frac{[\begin{matrix} 1 & \frac{1}{2} δ θ_{k + 1}^{1} & \frac{1}{2} δ θ_{k + 1}^{2} & \frac{1}{2} δ θ_{k + 1}^{3} \end{matrix}] \otimes {\hat{q}}_{k}}{∥[\begin{matrix} 1 & \frac{1}{2} δ θ_{k + 1}^{1} & \frac{1}{2} δ θ_{k + 1}^{2} & \frac{1}{2} δ θ_{k + 1}^{3} \end{matrix}] \otimes {\hat{q}}_{k}∥},

(28)

where

δ θ_{k + 1}^{i}

is the ith error state of this quaternion.

3.5. Reset Mode

The ES-EKF reset mode is performed by setting

\hat{\tilde{x}} \leftarrow 0

and

P \leftarrow G . P . G^{⊤}

, where G is the Jacobian matrix defined by

\begin{matrix} G = d i a g (I_{d_{6}}, J_{r_{w i}}, I_{d_{10}}, J_{r_{i c}}, I_{d_{3}}, J_{r_{v w}}), \\ J_{r_{x y}} = \frac{\partial δ θ_{x}^{y^{+}}}{\partial δ θ_{x}^{y}} = I_{d_{3}} - \frac{1}{2} {⌊\hat{δ θ_{x}^{y}}⌋}_{_{\times}} . \end{matrix}

(29)

4. Experiments

4.1. Setup

An extensive quantitative and qualitative evaluation is carried out to validate all the state estimation process aspects. This thorough performance analysis is run on the EuRoC benchmark [12] for an indoor system global positioning evaluation in low-speed flights and on the Fast Flight dataset [11] for outdoor experimentation at relatively high-speed flights. For a fair comparison, all the pipeline processing stages in both Algorithms 1 and 2 are performed on a 16 GB RAM laptop computer running 64-bit Ubuntu 20.04.3 LTS with AMD(R) Ryzen 7 4800 h × 16 cores 2.9 GHz processor and a Radeon RTX NV166 Renoir graphics card. In Table 1, we represent the quantitative insights of our experiment settings regarding the benchmarks statistical data and the sensors parameters in-detail.

The front-end of the pipeline, including both the data acquisition and pre-processing steps, is developed as a Python API that sends the optimization variables to the factor graph implemented in C++ using the Ceres solver [40] to achieve the lowest possible system latency before the state estimation process. The Sparse Normal Cholesky linear solver by the Ceres solver is employed to solve the least-squares convex optimization problem formulated in Equation (6) along with the Levenberg–Marquardt trust region strategy with the automatic differentiation tool for Jacobian calculations. The sparse Schur linear method is applied to utilize the Schur complement for a more robust and fast optimization process. The pipeline’s back-end for the state estimation process is developed entirely in MATLAB (https://github.com/AbanobSoliman/VIO_RGB_IMU (accessed on 30 October 2022)) and all the initialization parameters are given explicitly in Table 2.

The performance analysis is performed using the two trajectory evaluation metrics: root mean square error (RMSE) for the Fast Flight dataset compared to the GPS trajectory

p_{g p s}

, and the RMS absolute trajectory error (ATE) for the EuRoC benchmark compared to the ground truth trajectory

T_{g t}

provided with Vicon room sequences. The positional RMSE metric for the Fast Flight sequences is chosen because the ground truth GPS trajectories exist with unknown ground truth orientations. However, for EuRoC sequences, we select the RMS ATE metric for two reasons: 1. the Vicon system provides ground truth poses (positions and orientations); and 2. to ensure a fair comparison with the latest state-of-the-art methods based on the same error metric. The two trajectory evaluation metrics are formulated as follows:

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {||\hat{p} (i) - p_{g p s} (i)||}^{2}}, ATE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {| | p (T_{g t}^{- 1} (i) . T_{r e l} . \hat{T} (i)) | |}^{2}} [m],

(30)

where

\hat{p}

is the estimated translation vector of the

\hat{T} \in S E (3)

trajectory;

p (.)

is the translation vector of the

T \in S E (3)

pose; and

T_{r e l}

is rigid-body transformation corresponding to the least-squares solution that maps the

\hat{T}

trajectory onto the

T_{g t}

trajectory calculated by optimization. We set it constant for all sequences that belong to the same benchmark.

4.2. The EuRoC MAV Benchmark

The two main characteristics of the EuRoC MAV sequences are the complex combined 6-DoF motions and the relatively low speeds compared to the Fast Flight sequences. These prominent characteristics allow an accurate evaluation of the ES-EKF marginally stable states, such as the velocity and the visual drift. In Table 3, we report the ATE values as an evaluation parameter for the trajectory estimation accuracy compared to the ground truth. Moreover, Table 3 shows an ablation study that investigates the contribution of the GPS sensor to the overall estimation accuracy, especially for the monocular vision-based optimization methods: ours (PGO) and the recent work of Cioffi et al. [30]. The selection of the six Vicon room sequences from the EuRoC benchmark is because a comparison with an alternative method such as [30] incorporating GPS signals simulated from the Vicon system readings better emphasizes the findings of this ablation study.

A prominent finding of this ablation study is that vision is the most significant type of sensor. In most sequences, the lowest ATE is obtained by fusing the camera trajectory from the vision KLT-based SfM algorithm to a gravity-aligned frame using the noisy simulated GPS data, and adding inertial measurements does not provide a measurable benefit in this case. However, adding the gyroscope measurements to the visual-GPS fusion has led to the least ATE achieved by our PGO model compared to all other discrete-time (DT) methods. Figure 6 and Figure 7 show our trajectory and velocity estimations after incorporating the accelerometer readings in the ES-EKF model, resulting in the lowest achievable errors that can compete with the continuous-time optimization model in [30].

4.3. The Fast Flight Dataset

The main observation, which is validated upon both the EuRoC and Fast Flight sequences (see Table 4 and Figure 8), is that for velocities less than 5 (m/s), the monocular loosely coupled ES-EKF can achieve considerably lower estimation errors concerning the other filter- or optimization-based methods. For velocities more than 5 (m/s), our proposed optimization-based initialization scores the lowest RMSE compared to all other methods in comparison in Table 4. On the contrary, the monocular ES-EKF scores the lowest RMSE, especially for velocities more than 10 (m/s), compared to the best-performing Kalman filter stereo model of the S-MSCKF.

Since the maximum achieved velocity of the EuRoC MAV is nearly 2.3 (m/s), the quantitative results in Table 3 further support this conclusion, where our ES-EKF scores the best performance compared to the other state-of-the-art methods. In-depth reasoning for this degraded performance at high speeds (more than 5 (m/s)) can be clarified based on the hardware characteristics of the MAV sensors’ properties, such as the data rate, latency, and noise effects at high speeds. Our optimization-based (PGO) initialization outperforms all other optimization- or filtering-based methods with high-rate visual-inertial sensors.

An insightful overview of the velocity profiles estimated by our ES-EKF is represented in Figure 9. The main conclusion is that the estimated velocity profile during the planar motion of the MAV in the X–Z plane optimally fits the upper and lower bounds of the top speed for each sequence. Towards an in-depth investigation to understand the high perturbations in the estimated velocity when approaching the maximum limit, we plot the velocity error states in the ES-EKF showing a high error at the instances when approaching top speeds due to the strong vibrations in the MAV structure affecting the IMU readings.

The high estimation accuracy of our ES-EKF model compared to GPS readings and the PGO optimization-based initialization process is further verified by the Y axis trajectory estimation in Figure 8. The maximum estimated altitude for all sequences by the ES-EKF is nearly 60 (m), whereas both the GPS readings and the initialization optimizer estimate a maximum altitude of nearly 100 (m). To physically validate which is a more accurate altitude estimation, we took snippets of the scene at a time instance in the exact halfway of all trajectories as shown in Figure 1. We can observe that the MAV is nearly on the same level as the roof of a commercial aircraft hangar, which is in the range of 30 (m) to 66 (m). This observation validates the high estimation accuracy of the altitude using our ES-EKF.

4.4. Real-Time Performance Analysis

The filter-based approaches are more advantageous for real-time onboard applications because they use the CPU more efficiently than the monocular and stereo optimization-based methods. Due to its computationally intensive front-end pipeline for both temporal and stereo matching, OKVIS uses more CPU than VINS-Mono. Additionally, OKVIS’s back-end operates at a speed that is much faster than the set 10 (Hz) rate of VINS-Mono. Approximately 90% of the work in our back-end, ES-EKF, is brought on by the front-end, which includes ORB feature detection, KLT-based tracking, and matching. At 200 (Hz), the filter uses approximately 10% of a core. Our suggested technique offers the maximum estimation frequency, which provides the optimal balance between the precision and computing cost.

Figure 10 contrasts how much CPU time various VIO solutions used on the EuRoC benchmark and the Fast Flight dataset. Since V2-03 has considerable scale drift with S-IEKF and S-UKF-LG techniques and hence has significantly worse accuracy when compared to other methods, the CPU consumption of V2-03 is excluded from the comparison. According to the testing, the ES-EKF achieves the lowest CPU consumption while retaining a similar level of accuracy in comparison with other methods. We notice that the proposed method puts more computing work into the image processing front-end than the tests using the EuRoC dataset. Higher imaging frequency and resolution are one explanation, while Fast Flight results in a shorter feature lifetime, necessitating frequent new feature identification, is another reason.

5. Conclusions

Our work aimed to provide an accurate and computationally inexpensive localization solution during MAVs’ long-term navigation in large-scale environments. We represented a loosely coupled IMU/GPS camera fusion framework with pose failure detection methodology toward this goal. Moreover, we proposed a novel decoupled optimization- and filtering-based sensor fusion technique that achieves a high estimation accuracy and minimum system complexity compared to the other methods in the literature. We used real-world indoor and outdoor settings for the MAV localization studies to validate and test the findings of our proposed method.

The vision-based black-box pose estimation accuracy is first examined in a controlled laboratory Vicon room of the EuRoC benchmark. The outcomes confirmed the system’s reliance on monocular vision. The experiments on EuRoC and Fast Flight sequences have shown remarkable accuracy in the trajectory estimation studies. We also evaluated the proposed scheme in terms of computational complexity, measured by CPU usage, where our monocular-vision optimization/filtering solution outperformed all the competing techniques.

This conclusion enforces our work’s contributions to a reliable (fast and accurate) sensor fusion solution for challenging and large-scale environments. From a future perspective, it will be necessary to comprise situations where GPS sensor constraints, such as the multipath effects on the optimizer. Finally, further generalizing the optimization problem will be necessary to extend the algorithm’s pose estimation capability to include multiple vision sensors (stereo RGB, for instance).

Author Contributions

Conceptualization, A.S.; methodology, A.S.; software, A.S.; validation, A.S., H.H.-A., F.B., D.S. and S.B.; formal analysis, A.S., H.H.-A. and F.B.; investigation, A.S.; resources, S.B.; data curation, A.S.; writing—original draft preparation, A.S.; writing—review and editing, S.B., D.S., H.H.-A., F.B. and A.S.; supervision, S.B., D.S., H.H.-A. and F.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Direction Générale de l’Armement (DGA) and the French National Research Agency as a part of the LOCA3D project in the framework of the MALIN challenge (https://challenge-malin.fr).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The EuRoC MAV benchmark is publicly available online at (https://projects.asl.ethz.ch/datasets/doku.php?id=kmavvisualinertialdatasets (accessed on 1 October 2022)). The Fast Flight dataset is publicly available online at (https://github.com/KumarRobotics/msckf_vio/wiki/Dataset (accessed on 3 October 2022)).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are ordered according to the mentioned precedence in this article:

MAV	Micro Aerial Vehicle
VO	Visual Odometry
VIO	Visual Inertial Odometry
SLAM	Simultaneous Localization Furthermore, Mapping
CT/DT	Continuous Time/Discrete Time
KLT	Kanade–Lucas Tracking
PGO	Pose Graph Optimization
ES-EKF	Error States Extended Kalman Filter
S-MSCKF	Stereo Multi-State Constraint Kalman Filter
S-UKF-LG	Stereo Unscented Kalman Filter on Lie Groups
S-IEKF	Stereo (Invariant-)Extended Kalman Filter
ATE	Absolute Trajectory Error
RMS	Root Mean Square
RMSE	Root Mean Square Error

Appendix A. Observability Analysis

The EKF-based VIO for 6-DOF motion estimate contains four unobservable states corresponding to the global position and rotation around the gravity axis, or yaw angle, as demonstrated in [44]. A simple EKF VIO implementation will gather false information about yaw. The different processes and measurements’ linearizing point causes this unobservability. To ensure that the uncertainty of the current camera states in the state vector is not impacted by the uncertainty of the current IMU state during the propagation step, in our implementation, the camera poses in the state vector can be represented with respect to its inertial frame

(v)

instead of the latest IMU frame. Besides the efficient gyroscope RK4 integration scheme during the initialization process, our ES-EKF implementation minimizes the effect of the unobservable modes of the basic EKF. Figure A1 shows the IMU intrinsics, IMU-CAM extrinsic parameters, and odometry scale ES-EKF states plotted for sample EuRoC and Fast Flight sequences.

The main observation from Figure A1 is that when the motion of the MAV is smooth with no abrupt rotations and translations, our optimization-based initialization estimates an optimal metric-scaled trajectory with

λ = 1

. Moreover, we also observe that when the IMU-camera setup is not accurately calibrated, the ES-EKF can optimally align the sensor setup in a robust online calibration process. Furthermore, the estimated IMU biases using our ES-EKF model are accurate and in a sensible range. One crucial observation is the estimated attitude visual drift of the visual sensor and the detection of consistent drift patterns based on the MAV speed (Fast Flight sequences) and abrupt motions (EuRoC sequences). These observations validate the contribution of the ES-EKF to the sustainability of the proposed method to achieve a resilient system that observes all the state vector parameters in addition to all the 6 DoF of the MAV trajectory. Finally, after the initial trajectory optimization, the filtering process is indispensable to estimate the false camera poses during long-term navigation caused by the visual attitude drifts.

Figure A1. Our ES−EKF estimated states. Columns from left to right: IMU (accelerometer/gyroscope) biases

b_{a}, b_{ω}

, odometry scale factor

λ

, visual drift orientations

q_{v}^{w}

, and IMU-camera translation online calibration

p_{i}^{c}

. Rows 1,2 for sample Fast Flight sequences (gps5, gps10) and rows 3, 4 for sample EuRoC sequences (V2-02, V2-03), respectively.

Figure A1. Our ES−EKF estimated states. Columns from left to right: IMU (accelerometer/gyroscope) biases

b_{a}, b_{ω}

, odometry scale factor

λ

, visual drift orientations

q_{v}^{w}

, and IMU-camera translation online calibration

p_{i}^{c}

. Rows 1,2 for sample Fast Flight sequences (gps5, gps10) and rows 3, 4 for sample EuRoC sequences (V2-02, V2-03), respectively.

Appendix B. Q_d Derivation Equations

The

Q_{d}

in Equation (16) can be obtained after the consecutive matrix multiplications are performed using the following formulas. For simplicity, let

t = Δ t

,

σ = d σ

,

β = - R_{(\hat{q_{w}^{i}})}

:

Q_{d_{11}} = \int_{Δ t} (σ_{n_{a}}^{2} . t^{2} . β . β^{⊤} + σ_{n_{ω}}^{2} . A . A^{⊤} + σ_{n_{b_{ω}}}^{2} . B . B^{⊤} + σ_{n_{b_{a}}}^{2} . \frac{t^{4}}{4} . β . β^{⊤}),

\begin{matrix} Q_{d_{11}} & = σ_{n_{a}}^{2} . \frac{t^{3}}{3} . β . β^{⊤} + σ_{n_{ω}}^{2} [(β {⌊\hat{a}⌋}_{\times}) (I_{d_{3}} . \frac{t^{5}}{20} + \frac{t^{7}}{504} {⌊\hat{ω}⌋}_{\times}^{2}) {(β {⌊\hat{a}⌋}_{\times})}^{⊤}] + \dots \\ + & σ_{n_{b_{ω}}}^{2} [(β {⌊\hat{a}⌋}_{\times}) (I_{d_{3}} . \frac{t^{7}}{252} + \frac{t^{9}}{8640} {⌊\hat{ω}⌋}_{\times}^{2}) {(β {⌊\hat{a}⌋}_{\times})}^{⊤}] + σ_{n_{b_{a}}}^{2} . \frac{t^{5}}{20} . β . β^{⊤}, \end{matrix}

Q_{d_{12}} = \int_{Δ t} (σ_{n_{a}}^{2} . t . β . β^{⊤} + σ_{n_{ω}}^{2} . A . C^{⊤} + σ_{n_{b_{ω}}}^{2} . B . D^{⊤} + σ_{n_{b_{a}}}^{2} . \frac{t^{3}}{2} . β . β^{⊤}),

\begin{matrix} Q_{d_{12}} & = σ_{n_{a}}^{2} . \frac{t^{2}}{2} . β . β^{⊤} + σ_{n_{ω}}^{2} [(β {⌊\hat{a}⌋}_{\times}) (I_{d_{3}} . \frac{t^{4}}{8} + \frac{t^{5}}{60} {⌊\hat{ω}⌋}_{\times} + \frac{t^{6}}{144} {⌊\hat{ω}⌋}_{\times}^{2}) {(β {⌊\hat{a}⌋}_{\times})}^{⊤}] + \dots \\ + & σ_{n_{b_{ω}}}^{2} [(β {⌊\hat{a}⌋}_{\times}) (I_{d_{3}} . \frac{t^{6}}{72} + \frac{t^{7}}{1008} {⌊\hat{ω}⌋}_{\times} + \frac{t^{8}}{1920} {⌊\hat{ω}⌋}_{\times}^{2}) {(β {⌊\hat{a}⌋}_{\times})}^{⊤}] + σ_{n_{b_{a}}}^{2} . \frac{t^{4}}{8} . β . β^{⊤}, \end{matrix}

Q_{d_{13}} = \int_{Δ t} (σ_{n_{ω}}^{2} . A . E^{⊤} + σ_{n_{b_{ω}}}^{2} . B . F^{⊤}),

\begin{matrix} Q_{d_{13}} = σ_{n_{ω}}^{2} [(β {⌊\hat{a}⌋}_{\times}) (I_{d_{3}} . \frac{t^{3}}{6} + \frac{t^{4}}{12} {⌊\hat{ω}⌋}_{\times} + \frac{t^{5}}{40} {⌊\hat{ω}⌋}_{\times}^{2})] + \dots \\ + σ_{n_{b_{ω}}}^{2} [(β {⌊\hat{a}⌋}_{\times}) (I_{d_{3}} . \frac{t^{5}}{30} + \frac{t^{6}}{144} {⌊\hat{ω}⌋}_{\times} + \frac{11 . t^{7}}{5040} {⌊\hat{ω}⌋}_{\times}^{2})], \end{matrix}

Q_{d_{14}} = \int_{Δ t} (σ_{n_{b_{ω}}}^{2} . B) = σ_{n_{b_{ω}}}^{2} [(β {⌊\hat{a}⌋}_{\times}) (- I_{d_{3}} . \frac{t^{4}}{24} + \frac{t^{5}}{120} {⌊\hat{ω}⌋}_{\times} - \frac{t^{6}}{720} {⌊\hat{ω}⌋}_{\times}^{2})],

Q_{d_{15}} = \int_{Δ t} (σ_{n_{b_{a}}}^{2} . \frac{t^{2}}{2} . β) = σ_{n_{b_{a}}}^{2} . \frac{t^{3}}{6} . β,

Q_{d_{16}} = \int_{Δ t} (0_{3 \times 13}) = 0_{3 \times 13},

Q_{d_{21}} = \int_{Δ t} (σ_{n_{a}}^{2} . t . β . β^{⊤} + σ_{n_{ω}}^{2} . C . A^{⊤} + σ_{n_{b_{ω}}}^{2} . D . B^{⊤} + σ_{n_{b_{a}}}^{2} . \frac{t^{3}}{2} . β . β^{⊤}),

\begin{matrix} Q_{d_{21}} & = σ_{n_{a}}^{2} . \frac{t^{2}}{2} . β . β^{⊤} + σ_{n_{ω}}^{2} [(β {⌊\hat{a}⌋}_{\times}) (I_{d_{3}} . \frac{t^{4}}{8} - \frac{t^{5}}{60} {⌊\hat{ω}⌋}_{\times} + \frac{t^{6}}{144} {⌊\hat{ω}⌋}_{\times}^{2}) {(β {⌊\hat{a}⌋}_{\times})}^{⊤}] + \dots \\ + & σ_{n_{b_{ω}}}^{2} [(β {⌊\hat{a}⌋}_{\times}) (I_{d_{3}} . \frac{t^{6}}{72} - \frac{t^{7}}{1008} {⌊\hat{ω}⌋}_{\times} + \frac{t^{8}}{1920} {⌊\hat{ω}⌋}_{\times}^{2}) {(β {⌊\hat{a}⌋}_{\times})}^{⊤}] + σ_{n_{b_{a}}}^{2} . \frac{t^{4}}{8} . β . β^{⊤}, \end{matrix}

Q_{d_{22}} = \int_{Δ t} (σ_{n_{a}}^{2} . β . β^{⊤} + σ_{n_{ω}}^{2} . C . C^{⊤} + σ_{n_{b_{ω}}}^{2} . D . D^{⊤} + σ_{n_{b_{a}}}^{2} . t^{2} . β . β^{⊤}),

\begin{matrix} Q_{d_{22}} & = σ_{n_{a}}^{2} . t . β . β^{⊤} + σ_{n_{ω}}^{2} [(β {⌊\hat{a}⌋}_{\times}) (I_{d_{3}} . \frac{t^{3}}{3} + \frac{t^{5}}{60} {⌊\hat{ω}⌋}_{\times}^{2}) {(β {⌊\hat{a}⌋}_{\times})}^{⊤}] + \dots \\ + & σ_{n_{b_{ω}}}^{2} [(β {⌊\hat{a}⌋}_{\times}) (I_{d_{3}} . \frac{t^{5}}{20} + \frac{t^{7}}{504} {⌊\hat{ω}⌋}_{\times}^{2}) {(β {⌊\hat{a}⌋}_{\times})}^{⊤}] + σ_{n_{b_{a}}}^{2} . \frac{t^{3}}{3} . β . β^{⊤}, \end{matrix}

Q_{d_{23}} = \int_{Δ t} (σ_{n_{ω}}^{2} . C . E^{⊤} + σ_{n_{b_{ω}}}^{2} . D . F^{⊤}),

\begin{matrix} Q_{d_{23}} & = σ_{n_{ω}}^{2} [(β {⌊\hat{a}⌋}_{\times}) (I_{d_{3}} . \frac{t^{2}}{2} + \frac{t^{3}}{6} {⌊\hat{ω}⌋}_{\times} + \frac{t^{4}}{24} {⌊\hat{ω}⌋}_{\times}^{2})] + \dots \\ + & σ_{n_{b_{ω}}}^{2} [(β {⌊\hat{a}⌋}_{\times}) (I_{d_{3}} . \frac{t^{4}}{8} + \frac{t^{5}}{60} {⌊\hat{ω}⌋}_{\times} + \frac{t^{6}}{144} {⌊\hat{ω}⌋}_{\times}^{2})], \end{matrix}

Q_{d_{24}} = \int_{Δ t} (σ_{n_{b_{ω}}}^{2} . D) = σ_{n_{b_{ω}}}^{2} [- (β {⌊\hat{a}⌋}_{\times}) (I_{d_{3}} . \frac{t^{3}}{6} - \frac{t^{4}}{24} {⌊\hat{ω}⌋}_{\times} + \frac{t^{5}}{120} {⌊\hat{ω}⌋}_{\times}^{2})],

Q_{d_{25}} = \int_{Δ t} (σ_{n_{b_{a}}}^{2} . t . β) = σ_{n_{b_{a}}}^{2} . \frac{t^{2}}{2} . β,

Q_{d_{26}} = \int_{Δ t} (0_{3 \times 13}) = 0_{3 \times 13},

Q_{d_{31}} = \int_{Δ t} (σ_{n_{ω}}^{2} . E . A^{⊤} + σ_{n_{b_{ω}}}^{2} . F . B^{⊤}),

\begin{matrix} Q_{d_{31}} & = σ_{n_{ω}}^{2} [(I_{d_{3}} . \frac{t^{3}}{6} - \frac{t^{4}}{12} {⌊\hat{ω}⌋}_{\times} + \frac{t^{5}}{40} {⌊\hat{ω}⌋}_{\times}^{2}) {(β {⌊\hat{a}⌋}_{\times})}^{⊤}] + \dots \\ + & σ_{n_{b_{ω}}}^{2} [(I_{d_{3}} . \frac{t^{5}}{30} - \frac{t^{6}}{144} {⌊\hat{ω}⌋}_{\times} + \frac{11 . t^{7}}{5040} {⌊\hat{ω}⌋}_{\times}^{2}) {(β {⌊\hat{a}⌋}_{\times})}^{⊤}], \end{matrix}

Q_{d_{32}} = \int_{Δ t} (σ_{n_{ω}}^{2} . E . C^{⊤} + σ_{n_{b_{ω}}}^{2} . F . D^{⊤}),

\begin{matrix} Q_{d_{32}} & = σ_{n_{ω}}^{2} [(I_{d_{3}} . \frac{t^{2}}{2} - \frac{t^{3}}{6} {⌊\hat{ω}⌋}_{\times} + \frac{t^{4}}{24} {⌊\hat{ω}⌋}_{\times}^{2}) {(β {⌊\hat{a}⌋}_{\times})}^{⊤}] + \dots \\ + & σ_{n_{b_{ω}}}^{2} [(I_{d_{3}} . \frac{t^{4}}{8} - \frac{t^{5}}{60} {⌊\hat{ω}⌋}_{\times} + \frac{t^{6}}{144} {⌊\hat{ω}⌋}_{\times}^{2}) {(β {⌊\hat{a}⌋}_{\times})}^{⊤}], \end{matrix}

Q_{d_{33}} = \int_{Δ t} (σ_{n_{ω}}^{2} . E . E^{⊤} + σ_{n_{b_{ω}}}^{2} . F . F^{⊤}),

Q_{d_{33}} = σ_{n_{ω}}^{2} [(I_{d_{3}} . t)] + σ_{n_{b_{ω}}}^{2} [(I_{d_{3}} . \frac{t^{3}}{3} + \frac{t^{5}}{60} {⌊\hat{ω}⌋}_{\times}^{2})],

Q_{d_{34}} = \int_{Δ t} (σ_{n_{b_{ω}}}^{2} . F) = σ_{n_{b_{ω}}}^{2} [(I_{d_{3}} . \frac{- t^{2}}{2} + \frac{t^{3}}{6} {⌊\hat{ω}⌋}_{\times} - \frac{t^{4}}{24} {⌊\hat{ω}⌋}_{\times}^{2})],

Q_{d_{35}} = \int_{Δ t} (0_{3 \times 3}) = 0_{3 \times 3},

Q_{d_{36}} = \int_{Δ t} (0_{3 \times 13}) = 0_{3 \times 13},

Q_{d_{41}} = \int_{Δ t} (σ_{n_{b_{ω}}}^{2} . B^{⊤}) = σ_{n_{b_{ω}}}^{2} {[(β {⌊\hat{a}⌋}_{\times}) (- I_{d_{3}} . \frac{t^{4}}{24} + \frac{t^{5}}{120} {⌊\hat{ω}⌋}_{\times} - \frac{t^{6}}{720} {⌊\hat{ω}⌋}_{\times}^{2})]}^{⊤},

Q_{d_{42}} = \int_{Δ t} (σ_{n_{b_{ω}}}^{2} . D^{⊤}) = σ_{n_{b_{ω}}}^{2} {[- (β {⌊\hat{a}⌋}_{\times}) (I_{d_{3}} . \frac{t^{3}}{6} - \frac{t^{4}}{24} {⌊\hat{ω}⌋}_{\times} + \frac{t^{5}}{120} {⌊\hat{ω}⌋}_{\times}^{2})]}^{⊤},

Q_{d_{43}} = \int_{Δ t} (σ_{n_{b_{ω}}}^{2} . F^{⊤}) = σ_{n_{b_{ω}}}^{2} {[(I_{d_{3}} . \frac{- t^{2}}{2} + \frac{t^{3}}{6} {⌊\hat{ω}⌋}_{\times} - \frac{t^{4}}{24} {⌊\hat{ω}⌋}_{\times}^{2})]}^{⊤},

Q_{d_{44}} = \int_{Δ t} (σ_{n_{b_{ω}}}^{2}) = σ_{n_{b_{ω}}}^{2} . t,

Q_{d_{45}} = \int_{Δ t} (0_{3 \times 3}) = 0_{3 \times 3},

Q_{d_{46}} = \int_{Δ t} (0_{3 \times 13}) = 0_{3 \times 13},

Q_{d_{51}} = \int_{Δ t} (σ_{n_{b_{a}}}^{2} . β^{⊤} . \frac{t^{2}}{2}) = σ_{n_{b_{ω}}}^{2} . β^{⊤} . \frac{t^{3}}{6},

Q_{d_{52}} = \int_{Δ t} (σ_{n_{b_{a}}}^{2} . β^{⊤} . t) = σ_{n_{b_{ω}}}^{2} . β^{⊤} . \frac{t^{2}}{2},

Q_{d_{53}} = \int_{Δ t} (0_{3 \times 3}) = 0_{3 \times 3},

Q_{d_{54}} = \int_{Δ t} (0_{3 \times 3}) = 0_{3 \times 3},

Q_{d_{55}} = \int_{Δ t} (σ_{n_{b_{a}}}^{2}) = σ_{n_{b_{a}}}^{2} . t,

Q_{d_{56}} = \int_{Δ t} (0_{3 \times 13}) = 0_{3 \times 13},

\begin{matrix} Q_{d_{61 \to 65}} = \int_{Δ t} (0_{13 \times 3}) = 0_{13 \times 3}, & Q_{d_{66}} = \int_{Δ t} (0_{13 \times 13}) = 0_{13 \times 13} . \end{matrix}

References

Soliman, A.; Bonardi, F.; Sidibé, D.; Bouchafa, S. IBISCape: A Simulated Benchmark for multi-modal SLAM Systems Evaluation in Large-scale Dynamic Environments. J. Intell. Robot. Syst. 2022, 106, 53. [Google Scholar] [CrossRef]
Dong, B.; Zhang, K. A Tightly Coupled Visual-Inertial GNSS State Estimator Based on Point-Line Feature. Sensors 2022, 22, 3391. [Google Scholar] [CrossRef] [PubMed]
Gu, N.; Xing, F.; You, Z. GNSS Spoofing Detection Based on Coupled Visual/Inertial/GNSS Navigation System. Sensors 2021, 21, 6769. [Google Scholar] [CrossRef] [PubMed]
Huang, W.; Wan, W.; Liu, H. Optimization-Based Online Initialization and Calibration of Monocular Visual-Inertial Odometry Considering Spatial-Temporal Constraints. Sensors 2021, 21, 2673. [Google Scholar] [CrossRef] [PubMed]
Ma, S.; Bai, X.; Wang, Y.; Fang, R. Robust Stereo Visual-Inertial Odometry Using Nonlinear Optimization. Sensors 2019, 19, 3747. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhang, S.; Wang, W.; Li, H.; Zhang, S. EVtracker: An Event-Driven Spatiotemporal Method for Dynamic Object Tracking. Sensors 2022, 22, 6090. [Google Scholar] [CrossRef]
Ren, G.; Yu, Y.; Liu, H.; Stathaki, T. Dynamic Knowledge Distillation with Noise Elimination for RGB-D Salient Object Detection. Sensors 2022, 22, 6188. [Google Scholar] [CrossRef]
Alliez, P.; Bonardi, F.; Bouchafa, S.; Didier, J.Y.; Hadj-Abdelkader, H.; Muñoz, F.I.I.; Kachurka, V.; Rault, B.; Robin, M.; Roussel, D. Real-Time Multi-SLAM System for Agent Localization and 3D Mapping in Dynamic Scenarios. In Proceedings of the International Confererence on Intelligent Robots and Systems (IROS 2020), Las Vegas, NV, USA, 25–29 October 2020. [Google Scholar]
Alonge, F.; Cusumano, P.; D’Ippolito, F.; Garraffa, G.; Livreri, P.; Sferlazza, A. Localization in Structured Environments with UWB Devices without Acceleration Measurements, and Velocity Estimation Using a Kalman-Bucy Filter. Sensors 2022, 22, 6308. [Google Scholar] [CrossRef]
Cao, S.; Gao, H.; You, J. In-Flight Alignment of Integrated SINS/GPS/Polarization/Geomagnetic Navigation System Based on Federal UKF. Sensors 2022, 22, 5985. [Google Scholar] [CrossRef]
Sun, K.; Mohta, K.; Pfrommer, B.; Watterson, M.; Liu, S.; Mulgaonkar, Y.; Taylor, C.J.; Kumar, V. Robust Stereo Visual Inertial Odometry for Fast Autonomous Flight. IEEE Robot. Autom. Lett. 2018, 3, 965–972. [Google Scholar] [CrossRef]
Burri, M.; Nikolic, J.; Gohl, P.; Schneider, T.; Rehder, J.; Omari, S.; Achtelik, M.W.; Siegwart, R. The EuRoC micro aerial vehicle datasets. Int. J. Robot. Res. 2016, 35, 1157–1163. [Google Scholar] [CrossRef]
Qin, T.; Pan, J.; Cao, S.; Shen, S. A general optimization-based framework for local odometry estimation with multiple sensors. arXiv 2019, arXiv:1901.03638. [Google Scholar]
Yu, Y.; Gao, W.; Liu, C.; Shen, S.; Liu, M. A GPS-aided Omnidirectional Visual-Inertial State Estimator in Ubiquitous Environments. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 7750–7755. [Google Scholar] [CrossRef]
Mascaro, R.; Teixeira, L.; Hinzmann, T.; Siegwart, R.; Chli, M. GOMSF: Graph-Optimization Based Multi-Sensor Fusion for robust UAV Pose estimation. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 1421–1428. [Google Scholar] [CrossRef] [Green Version]
Cioffi, G.; Scaramuzza, D. Tightly-coupled Fusion of Global Positional Measurements in Optimization-based Visual-Inertial Odometry. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 5089–5095. [Google Scholar] [CrossRef]
Dai, J.; Liu, S.; Hao, X.; Ren, Z.; Yang, X. UAV Localization Algorithm Based on Factor Graph Optimization in Complex Scenes. Sensors 2022, 22, 5862. [Google Scholar] [CrossRef]
Mourikis, A.I.; Roumeliotis, S.I. A Multi-State Constraint Kalman Filter for Vision-aided Inertial Navigation. In Proceedings of the 2007 IEEE International Conference on Robotics and Automation, Rome, Italy, 10–14 April 2007; pp. 3565–3572. [Google Scholar] [CrossRef]
Brossard, M.; Bonnabel, S.; Barrau, A. Unscented Kalman Filter on Lie Groups for Visual Inertial Odometry. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 649–655. [Google Scholar] [CrossRef] [Green Version]
Bloesch, M.; Burri, M.; Omari, S.; Hutter, M.; Siegwart, R. Iterated extended Kalman filter based visual-inertial odometry using direct photometric feedback. Int. J. Robot. Res. 2017, 36, 1053–1072. [Google Scholar] [CrossRef] [Green Version]
Brunello, A.; Urgolo, A.; Pittino, F.; Montvay, A.; Montanari, A. Virtual Sensing and Sensors Selection for Efficient Temperature Monitoring in Indoor Environments. Sensors 2021, 21, 2728. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. VINS-Mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Leutenegger, S.; Lynen, S.; Bosse, M.; Siegwart, R.; Furgale, P. Keyframe-based visual—Inertial odometry using nonlinear optimization. Int. J. Robot. Res. 2015, 34, 314–334. [Google Scholar] [CrossRef] [Green Version]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. OrbSLAM3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Usenko, V.; Demmel, N.; Schubert, D.; Stückler, J.; Cremers, D. Visual-inertial mapping with non-linear factor recovery. IEEE Robot. Autom. Lett. 2019, 5, 422–429. [Google Scholar] [CrossRef] [Green Version]
Schimmack, M.; Haus, B.; Mercorelli, P. An Extended Kalman Filter as an Observer in a Control Structure for Health Monitoring of a Metal–Polymer Hybrid Soft Actuator. IEEE/ASME Trans. Mechatron. 2018, 23, 1477–1487. [Google Scholar] [CrossRef]
Mercorelli, P. A switching Kalman Filter for sensorless control of a hybrid hydraulic piezo actuator using MPC for camless internal combustion engines. In Proceedings of the 2012 IEEE International Conference on Control Applications, Dubrovnik, Croatia, 3–5 October 2012; pp. 980–985. [Google Scholar] [CrossRef]
Huang, G.; Kaess, M.; Leonard, J.J. Towards consistent visual-inertial navigation. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 4926–4933. [Google Scholar] [CrossRef] [Green Version]
Huang, P.; Meyr, H.; Dörpinghaus, M.; Fettweis, G. Observability Analysis of Flight State Estimation for UAVs and Experimental Validation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 4659–4665. [Google Scholar] [CrossRef]
Cioffi, G.; Cieslewski, T.; Scaramuzza, D. Continuous-Time Vs. Discrete-Time Vision-Based SLAM: A Comparative Study. IEEE Robot. Autom. Lett. 2022, 7, 2399–2406. [Google Scholar] [CrossRef]
Nurhakim, A.; Ismail, N.; Saputra, H.M.; Uyun, S. Modified Fourth-Order Runge-Kutta Method Based on Trapezoid Approach. In Proceedings of the 2018 4th International Conference on Wireless and Telematics (ICWT), Nusa Dua, Bali, Indonesia, 12–13 July 2018; pp. 1–5. [Google Scholar] [CrossRef]
Lv, M.; Wei, H.; Fu, X.; Wang, W.; Zhou, D. A Loosely Coupled Extended Kalman Filter Algorithm for Agricultural Scene-Based Multi-Sensor Fusion. Front. Plant Sci. 2022, 13, 9260. [Google Scholar] [CrossRef] [PubMed]
Sola, J. Quaternion kinematics for the error-state Kalman filter. arXiv 2017, arXiv:1711.02508. [Google Scholar]
Sommer, C.; Usenko, V.; Schubert, D.; Demmel, N.; Cremers, D. Efficient Derivative Computation for Cumulative B-Splines on Lie Groups. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; pp. 11145–11153. [Google Scholar] [CrossRef]
Trawny, N.; Roumeliotis, S.I. Indirect Kalman filter for 3D attitude estimation. Eng. Tech. Rep. 2005, 2, 2005. [Google Scholar]
Moulon, P.; Monasse, P.; Marlet, R. Adaptive Structure from Motion with a Contrario Model Estimation. In Proceedings of the Computer Vision—ACCV 2012, Daejeon, Republic of Korea, 5–9 November 2012; Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 257–270. [Google Scholar]
Nister, D. An efficient solution to the five-point relative pose problem. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 756–770. [Google Scholar] [CrossRef]
Tomasi, C.; Kanade, T. Detection and tracking of point. Int. J. Comput. Vis. 1991, 9, 137–154. [Google Scholar] [CrossRef]
Wang, Y.; Chirikjian, G.S. Nonparametric second-order theory of error propagation on motion groups. Int. J. Robot. Res. 2008, 27, 1258–1273. [Google Scholar] [CrossRef] [Green Version]
Agarwal, S.; Mierle, K. Ceres Solver. Available online: https://github.com/ceres-solver/ceres-solver (accessed on 10 October 2022).
Geneva, P.; Eckenhoff, K.; Lee, W.; Yang, Y.; Huang, G. OpenVINS: A Research Platform for Visual-Inertial Estimation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 4666–4672. [Google Scholar] [CrossRef]
Zuo, X.; Merrill, N.; Li, W.; Liu, Y.; Pollefeys, M.; Huang, G.P. CodeVIO: Visual-Inertial Odometry with Learned Optimizable Dense Depth. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 14382–14388. [Google Scholar]
Rosinol, A.; Abate, M.; Chang, Y.; Carlone, L. Kimera: An open-source library for real-time metric-semantic localization and mapping. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 1689–1696. [Google Scholar]
Li, M.; Mourikis, A.I. High-precision, consistent EKF-based visual-inertial odometry. Int. J. Robot. Res. 2013, 32, 690–711. [Google Scholar] [CrossRef]

Figure 1. An example for the on-map GPS readings of the large-scale environment of the Fast Flight dataset [11] sequences: gps175, gps15, gps10, and gps5. The sequence number denotes the maximum flight velocity of each sequence: 17.5, 15, 10, and 5 (m/s), respectively. The color bar (bottom) denotes the map scale in (km) on the x axis and the altitude of each sequence in (m) on the y axis. In the blue dotted box: Comparing the maximum MAV’s altitude at instance before the descent stage to the height of an aircraft hangar. The estimated airport asset height is 54.72 (m), corresponding to the maximum MAV altitude. Images are courtesy of Google Earth.

Figure 2. Visual odometry is generally categorized together with self-contained and global localization methods.

Figure 3. Overview of our proposed entire system architecture.

Figure 4. Initialization factor graph.

p (u)

is the CT-GPS trajectory generated at high frequency. RK4 is the Runge–Kutta 4th order gyroscope integration scheme. Dotted lines denote the error term

({\hat{T}}_{i}^{- 1} {\hat{T}}_{j})

in Equation (7) between any two KLT-VO poses.

Figure 4. Initialization factor graph.

p (u)

is the CT-GPS trajectory generated at high frequency. RK4 is the Runge–Kutta 4th order gyroscope integration scheme. Dotted lines denote the error term

({\hat{T}}_{i}^{- 1} {\hat{T}}_{j})

in Equation (7) between any two KLT-VO poses.

Figure 5. The frames of reference annotations.

Figure 6. EuRoC 3D trajectory estimation compared to the ground truth.

Figure 7. Estimated velocity profile validation with the ground truth. Comparison of sample sequences from EuRoC benchmark.

Figure 8. Fast Flight (X (top)−Y (middle)−Z (bottom)) trajectory estimation compared to the GPS readings.

Figure 9. (Top): Fast Flight velocity profile validation with the top speed of each sequence. (Bottom): velocity error states in the ES−EKF.

Figure 10. CPU usage as a real-time performance analysis indicator.

Table 1. Insights into our experiments’ statistical information and sensor settings.

Parameter		EuRoC Benchmark [12]				Fast Flight Dataset [11]
Stats	Total processed sequences	6 (Vicon room)				4 (airport runway)
	Total sequences duration	11.6111 min				8.8867 min
	Total sequences length	411.5425 m				2539.0599 ¹ m
	Maximum speed	2.3 (m/s)				17.5 (m/s)
Camera	Total processed frames	13,736				21,312
	Frame resolution	752 × 480 pixels				960 × 800 pixels
	Intrinsics $(f_{x}, f_{y}, c_{x}, c_{y})$	458.65	457.30	367.22	248.38	606.58	606.73	474.93	402.28
	Distortion $(k_{1}, k_{2}, p_{1}, p_{2})$	−0.2834	0.0739	0.0001	0.000018	−0.0147	−0.0058	0.0072	−0.0046
	Camera-IMU $p_{i}^{c}$ (x,y,z,1) (m)	−0.0216	−0.0647	0.0098	1.0000	0.1058	−0.0177	−0.0089	1.0000
	Camera-IMU $q_{i}^{c}$ (x,y,z,w) [-]	−0.0077	0.0105	0.7018	0.7123	−1.0000	0.0042	−0.0039	0.0015
	Frame rate	20 (Hz)				40 (Hz)
IMU	Gyroscope noise density ( $σ_{n_{ω}}$ )	$1.6968 \times 10^{- 4}$ [rad/s/ $\sqrt{Hz}$ ]				$6.1087 \times 10^{- 5}$ [rad/s/ $\sqrt{Hz}$ ]
	Gyroscope random walk ( $σ_{n_{b_{ω}}}$ )	$1.9393 \times 10^{- 5}$ [rad/s $^{2}$ / $\sqrt{Hz}$ ]				$9.1548 \times 10^{- 5}$ [rad/s $^{2}$ / $\sqrt{Hz}$ ]
	Accelerometer noise density ( $σ_{n_{a}}$ )	$2.0000 \times 10^{- 3}$ [m/s $^{2}$ / $\sqrt{Hz}$ ]				$1.3734 \times 10^{- 3}$ [m/s $^{2}$ / $\sqrt{Hz}$ ]
	Accelerometer random walk ( $σ_{n_{b_{a}}}$ )	$3.0000 \times 10^{- 3}$ [m/s $^{3}$ / $\sqrt{Hz}$ ]				$2.7468 \times 10^{- 3}$ [m/s $^{3}$ / $\sqrt{Hz}$ ]
	Data rate ( $1 / Δ t$ )	200 (Hz)				200 (Hz)
GPS	Type/operation	Indoors/Vicon system				Outdoors/satellite Triangulation
	Readings	X (m), Y (m), Z (m)				Long. (deg), Lat. (deg), Alt. (m)
	Data rate	1 (Hz) (down-sampled)				5 (Hz)

¹ Denotes the exact value of the total trajectories lengths for all of the sequences of Fast Flight dataset shown on the x axis of Figure 1 (≈2.5 (km)).

Table 2. The ES-EKF initialization parameters for both the EuRoC and Fast Flight sequences.

Parameter Initialization	EuRoC Benchmark [12]	Fast Flight Dataset [11]
28-element error state vector ( $\hat{\tilde{x}}$ )	$0_{28 \times 1}$	$0_{28 \times 1}$
31-element state vector ¹ ( $X$ )	${(0_{3 \times 1} 0_{3 \times 1} {\bar{q}}^{⊤} 0_{3 \times 1} 0_{3 \times 1} 1 {p_{i}^{c}}^{⊤} {q_{i}^{c}}^{⊤} 0_{3 \times 1} {\bar{q}}^{⊤})}^{⊤}$
States propagation covariance (P)	$10^{- 7} \times I_{d_{28}}$	$10^{- 12} \times I_{d_{28}}$
CT process noise covariance ² ( $Q_{c}$ )	$d i a g (d σ_{n_{a}}^{2} . I_{d_{3}}, d σ_{n_{b_{a}}}^{2} . I_{d_{3}}, d σ_{n_{ω}}^{2} . I_{d_{3}}, d σ_{n_{b_{ω}}}^{2} . I_{d_{3}})$
Measurement noise covariance ( $R$ )	$d i a g (0.01, 0.01, 0.03, 10^{- 4}, 10^{- 4}, 10^{- 4})$

¹

\bar{q}

denotes the unity quaternion [0,0,0,1]. ² IMU noise density values for each dataset are from Table 1 and discretized using Equation (11).

Table 3. Ablation study on the contribution of the GPS sensor on the system accuracy. The latest state-of-the-art (monocular/stereo) VI-SLAM systems are compared to our proposed trajectory initialization (PGO factors) and ES-EKF state estimation methods. Bold denotes the most accurate.

Method		EuRoC Benchmark [12] (RMS ATE [m])						Avg.
Method		V1-01	V1-02	V1-03	V2-01	V2-02	V2-03	Avg.
Mono-VI	OKVIS [23]	0.090	0.200	0.240	0.130	0.160	0.290	0.185
	ROVIO [20]	0.100	0.100	0.140	0.120	0.140	0.140	0.123
	VINS-Mono [22]	0.047	0.066	0.180	0.056	0.090	0.244	0.114
	OpenVINS [41]	0.056	0.072	0.069	0.098	0.061	0.286	0.107
	CodeVIO ¹ [42]	0.054	0.071	0.068	0.097	0.061	0.275	0.104
	Cioffi et al. ² [16]	0.034	0.035	0.042	0.026	0.033	0.057	0.038
Stereo-VI	VINS-Fusion [13]	0.076	0.069	0.114	0.066	0.091	0.096	0.085
	BASALT [25]	0.040	0.020	0.030	0.030	0.020	0.050	0.032
	Kimera [43]	0.050	0.110	0.120	0.070	0.100	0.190	0.107
	ORB-SLAM3 [24]	0.038	0.014	0.024	0.032	0.014	0.024	0.024
Mono-(V/I/G) ³	CT (V+I+G) [30]	0.024	0.014	0.011	0.012	0.010	0.010	0.014
	CT (V+G) [30]	0.011	0.013	0.012	0.009	0.008	0.012	0.011
	CT (I+G) [30]	0.062	0.102	0.117	0.112	0.164	0.363	0.153
	DT (V+I+G) [30]	0.016	0.024	0.018	0.009	0.018	0.033	0.020
	DT (V+G) [30]	0.010	0.025	0.024	0.010	0.012	0.029	0.018
	DT (I+G) [30]	0.139	0.137	0.138	0.138	0.138	0.139	0.138
	Ours (PGO)	0.008	0.017 ⁴	0.023 ⁴	0.008	0.022	0.025 ⁴	0.017
	Ours (ES-EKF)	0.009	0.012	0.011	0.010	0.011	0.010	0.011

¹ Denotes the only learning-based baseline in the table and incorporates point clouds using LiDAR. ² Denotes values from the original work with four GPS readings connected to each optimization state. ³ V,I,G: Vision, IMU, and GPS (generated from the Vicon system readings). ⁴ Denotes KLT-VO tracks features in 5 consecutive frames instead of 10 due to the rapid movement of the MAV.

Table 4. Ablation study on the effect of the high MAV speed on the accuracy of the filtering approaches compared to optimization approaches. The first sub-section compares monocular (VINS-Mono and Ours) to stereo (OKVIS) optimization-based VI systems. The second sub-section compares stereo filtering-based approaches to our proposed method. Bold denotes the most accurate in each sub-section.

Method	Fast Flight [11] (RMSE (m))				Avg.
Method	gps5	gps10	gps15	gps175	Avg.
OKVIS [23]	3.224	4.987	3.985	4.535	4.183
VINS-Mono [22]	5.542	8.753	2.875	3.452	5.156
Ours (PGO)	0.417	0.759	0.180	0.927	0.571
S-MSCKF [11]	4.985	2.751	4.752	7.852	5.085
S-UKF-LG [19]	4.875	2.589	5.128	7.865	5.114
S-IEKF [19]	4.986	2.544	5.124	8.152	5.201
Ours (ES-EKF)	4.751	7.924	7.221	9.488	7.346

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Soliman, A.; Hadj-Abdelkader, H.; Bonardi, F.; Bouchafa, S.; Sidibé, D. MAV Localization in Large-Scale Environments: A Decoupled Optimization/Filtering Approach. Sensors 2023, 23, 516. https://doi.org/10.3390/s23010516

AMA Style

Soliman A, Hadj-Abdelkader H, Bonardi F, Bouchafa S, Sidibé D. MAV Localization in Large-Scale Environments: A Decoupled Optimization/Filtering Approach. Sensors. 2023; 23(1):516. https://doi.org/10.3390/s23010516

Chicago/Turabian Style

Soliman, Abanob, Hicham Hadj-Abdelkader, Fabien Bonardi, Samia Bouchafa, and Désiré Sidibé. 2023. "MAV Localization in Large-Scale Environments: A Decoupled Optimization/Filtering Approach" Sensors 23, no. 1: 516. https://doi.org/10.3390/s23010516

APA Style

Soliman, A., Hadj-Abdelkader, H., Bonardi, F., Bouchafa, S., & Sidibé, D. (2023). MAV Localization in Large-Scale Environments: A Decoupled Optimization/Filtering Approach. Sensors, 23(1), 516. https://doi.org/10.3390/s23010516

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MAV Localization in Large-Scale Environments: A Decoupled Optimization/Filtering Approach

Abstract

1. Introduction

2. Related Work

2.1. Sensor Fusion

2.2. Fusion Strategies

2.3. Visual Odometry

2.4. Methodology Background

3. System Architecture

3.1. State Estimator Initialization

3.2. Dynamic Model

3.3. Measurement Model

3.4. States Update

3.5. Reset Mode

4. Experiments

4.1. Setup

4.2. The EuRoC MAV Benchmark

4.3. The Fast Flight Dataset

4.4. Real-Time Performance Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Observability Analysis

Appendix B. Q_d Derivation Equations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

MAV Localization in Large-Scale Environments: A Decoupled Optimization/Filtering Approach

Abstract

1. Introduction

2. Related Work

2.1. Sensor Fusion

2.2. Fusion Strategies

2.3. Visual Odometry

2.4. Methodology Background

3. System Architecture

3.1. State Estimator Initialization

3.2. Dynamic Model

3.3. Measurement Model

3.4. States Update

3.5. Reset Mode

4. Experiments

4.1. Setup

4.2. The EuRoC MAV Benchmark

4.3. The Fast Flight Dataset

4.4. Real-Time Performance Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Observability Analysis

Appendix B. Qd Derivation Equations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Appendix B. Q_d Derivation Equations