MAV Localization in Large-Scale Environments: A Decoupled Optimization/Filtering Approach

Developing new sensor fusion algorithms has become indispensable to tackle the daunting problem of GPS-aided micro aerial vehicle (MAV) localization in large-scale landscapes. Sensor fusion should guarantee high-accuracy estimation with the least amount of system delay. Towards this goal, we propose a linear optimal state estimation approach for the MAV to avoid complicated and high-latency calculations and an immediate metric-scale recovery paradigm that uses low-rate noisy GPS measurements when available. Our proposed strategy shows how the vision sensor can quickly bootstrap a pose that has been arbitrarily scaled and recovered from various drifts that affect vision-based algorithms. We can consider the camera as a “black-box” pose estimator thanks to our proposed optimization/filtering-based methodology. This maintains the sensor fusion algorithm’s computational complexity and makes it suitable for MAV’s long-term operations in expansive areas. Due to the limited global tracking and localization data from the GPS sensors, our proposal on MAV’s localization solution considers the sensor measurement uncertainty constraints under such circumstances. Extensive quantitative and qualitative analyses utilizing real-world and large-scale MAV sequences demonstrate the higher performance of our technique in comparison to most recent state-of-the-art algorithms in terms of trajectory estimation accuracy and system latency.


Introduction
Robust localization of micro aerial vehicles (MAVs) in uncharted large-scale areas can rely on complementary data gathered by many sensor modalities. The study of simultaneous localization and mapping (SLAM), primarily used for MAV navigation in expansive and dynamic settings, may be enriched and expanded by using multi-modal datasets [1]. These settings have certain traits, such as the dynamic range of the scene's object intensities. For instance, mapping a small interior space with adequate illumination might be of more outstanding quality than mapping a rural area at night with heavy rain, wind, and fog (outdoors dynamic environment). The benefits of multimodal approaches become apparent when systems rely on sensors with a high dynamic range and strong sensing capabilities, such as event cameras, LiDARs, or radars, or typical inexpensive cameras fused with other sensor modalities such as the inertial measurement units (IMUs) and GPS sensors. These multimodal approaches can indeed fill some lack of data during scene mapping and MAV localization.
Towards this aim, we develop a trustworthy (quick and precise) localization solution that utilizes information from three sensor modalities: camera frame data, IMU measurements, and GPS readings. Nevertheless, the GPS sensor readings are consistently slower and noisier than those from the IMU or camera modules, and they frequently experience signal loss in GPS-restricted locations. Therefore, a localization system that depends on GPS data must perform effectively when GPS readings are lost.  [11] sequences: gps175, gps15, gps10, and gps5. The sequence number denotes the maximum flight velocity of each sequence: 17.5, 15, 10, and 5 (m/s), respectively. The color bar (bottom) denotes the map scale in (km) on the x axis and the altitude of each sequence in (m) on the y axis. In the blue dotted box: Comparing the maximum MAV's altitude at instance before the descent stage to the height of an aircraft hangar. The estimated airport asset height is 54.72 (m), corresponding to the maximum MAV altitude. Images are courtesy of Google Earth.
The accuracy of the state estimation process relies on an error-state extended Kalman filter (ES-EKF) and the bootstrapping quality of its states. A well-established IMU-based state estimator initialization technique was discussed in [5]. In this bootstrapping method, the global metric scale of the trajectory and the IMU-camera gravity alignment is optimized using a specific amount of IMU readings preintegration combined with an initial up-to-scale trajectory estimated using the camera only. This bootstrapping process is prone to failure due to insufficient IMU excitation, especially when the MAV navigates in a planar terrain.
The MAV should contain a localization system that continually calculates the pose with high accuracy and low latency during search and rescue missions, for instance. The MAV is equipped with restricted resources regarding the data processing unit and the limited power source capacity for long-term navigation operations in large-scale situations. In light of this, the state estimate approach should consistently have low computational complexity and resist sensor readings that deviate from the norm.
Our work's main contribution to tackle the aforementioned challenges is three-fold: -In the case of state estimator initialization failure, we propose a unique instant bootstrapping technique based on continuous-time manifold optimization via pose graph optimization (PGO) and range factors, which depends on low-rate GPS signals. -A closed-form estimation method without nonlinear optimization during IMU/CAM fusion produces a reduced system latency with constant CPU computing complexity. The mathematical modeling of a linear ES-EKF with a precise and quick gyroscope integration strategy accounts for the simplicity of our proposed localization solution. - The EuRoC benchmark [12], for MAV localization assessment in indoor environments, and the Fast Flight dataset [11], for large-scale outdoor environments, are two real-world publicly available benchmarks on which our IMU/GPS-CAM fusion system has been thoroughly tested. With thorough ablation investigations into the role of each sensor modality in the overall accuracy of the state estimation process, the assessment is conducted using the most recent state-of-the-art visual-inertial odometry methodologies. Figure 2 presents a global overview of the current state-of-the-art approaches for localization. The ability to continually estimate the robot's ego-motion (position and orientation) over time is a significant difficulty in autonomous navigation, path planning, object tracking, and collision avoidance platforms [13]. The Global Positioning System (GPS) is a well-known localization method applied to several autonomous system domains. One kind of global navigation satellite system (GNSS) is GPS [10]. GPS is used as a selflocalization source, such as for MAVs security applications, and gives any user with a GPS receiver positional information with meter-level precision. The satellite signal blockage, high noise levels, multipath effects, and other issues with GPS, on the other hand, make it a less trustworthy alternative sensor for self-localization modules. However, real-time kinematic (RTK) and precise point positioning (PPP) [9], two GPS technologies that are rapidly developing, can provide locations with decimeter-or centimeter-level precision.  The effectiveness of GPS satellite signals heavily depends on the surrounding environment; it works best in locations with clear skies and is ineffective for inside navigation since walls and other obstacles impede it [14]. This makes the GPS module an unsuitable primary sensor for reliable autonomous vehicle localization under adverse weather and environmental conditions. Hence, the fusion of GPS signals with other inertial and/or visual sensors is indispensable for a reliable localization solution, especially in such environments. The state-of-the-art sensor fusion systems are differentiated into two prominent families: loosely- [15], and tightly-coupled [16] fusion strategies. In loosely-coupled fusion, the camera frames for pose estimation are processed as a black-box. A filter or an optimization model is developed to fuse the arbitrary-scaled poses from the visual sensor with the noisy metric-scaled re-integrated IMU readings [17].

Localization
On the contrary, in the tightly coupled approach, scene information from the visual sensor is fused with the IMU measurements (linear accelerations and angular velocities) using a fusion filter or an optimization model that estimates the metric-scaled pose, visual odometry scale factor, IMU biases, and visual drift between the IMU-camera inertial frames. One of the prominent advantages of a tightly-coupled fusion scheme is that it can estimate accurate scene information to reconstruct a precise scene map, along with providing the SLAM system with high confidence in loop closure during re-localization situations.

Fusion Strategies
The two sensor fusion strategies (loosely and tightly coupled) have two main execution techniques: filter-based and optimization-based execution. Some filter-based state-ofthe-art approaches are deterministic, such as MSCKF [18], S-MSCKF [11], S-UKF-LG/S-IEKF [19], and ROVIO [20]. At the same time, alternative strategies can be based on nondeterministic filters such as particle filters [21].
Optimization-based methods such as VINS-Mono [22], OKVIS [23], ORB-SLAM [24], and BASALT [25], can be deterministic or non-deterministic based on the optimization strategy and the convergence constraints. The estimation and robustness of visual localization frameworks have advanced significantly in recent decades, and this development may be furthered by tightly integrating visual and inertial data. Most methods integrate data utilizing optimization methods or filtering-based procedures.
Filtering approaches are ideally suited to real-time applications [26,27], which is the main emphasis of this study. In contrast, optimization-based methods are more precise but often have a more extensive processing complexity. The observability-constrained technique addresses the consistency issue, a shortcoming of traditional VIO filter-based algorithms [28]. The EKF/MSCKF and its cutting-edge variations are among the most widely used solutions because they effectively balance accuracy and computational complexity.
A recent study [29] shows that if the air mass's random character is considered, the EKF system states of an MAV are observable. The drag and lift forces on the MAV will directly impact the projected pose and velocity due to the nature of air mass randomization. To make an online update for the uncertainties brought on by these random effects on the precise position of the sensors' reference frames, we contribute with a visual drift augmentation technique to our EKF measurement model. The EKF's ability to tolerate significant disturbances in the MAV's velocity state variable and still converge to the undisturbed estimates is what we target.

Visual Odometry
The main objective of a visual odometry solution is to perform an accurate and precise localization of the robot (ground or aerial vehicle) to estimate its pose during the navigation task. Estimated poses can be on either discrete-or continuous-time manifolds. Cioffi et al. [30] studied the reliability of the estimated poses on both manifolds using IMU/Visual/GPS sensors. They came to an important conclusion: similar results are produced by the two representations when the camera and IMU are time-synchronized.
In [13], the sliding window pose-graph optimization of the most recent robot states uses global position data with poses predicted by a VIO method. Like [15], pose-graph optimization employs an independent VIO technique to generate pose estimations fused with GPS data. In contrast to [13], the pose-graph in [15] includes an extra node representing the local coordinate frame's origin to confine the absolute orientation. However, these methods are loosely connected, meaning that a separate VIO algorithm generates the relative pose estimations. Inspired by [13,15], we present a loosely coupled strategy that considers the correlations between all measures by including them in a hybrid optimization and filtering problem.
It is demonstrated in [23] that considering all measurement correlations is essential for high-precision estimations in the visual-inertial situation. A tightly coupled sliding window optimization for visual and inertial data with loosely connected GPS refinement is presented in [14]. The GPS readings are given the same timestamp as the temporally nearest image to be included in the sliding window because it is believed that they would only be accessible at low rates. As opposed to [14], we efficiently compute the global positional factors by closely coupling the global position measurements using the Runge-Kutta 4th-order gyroscope preintegration scheme [31]. This enables the sliding window to incorporate numerous global parameters and each keyframe with barely any additional processing load.

Methodology Background
We highlight the methodology that inspires our study in blue-dashed rectangles in Figure 2. Where the loosely coupled fusion strategy [32] is adopted to keep constant computational complexity for real-time performance, along with adding a reset mode for the framework, as discussed in [33] as well as an online IMU-camera extrinsic calibration paradigm [4]. Integrating the IMU/GPS readings with the global shutter visual sensor monocular frames raises our localization solution's accuracy level, leveraging the MAV's inertial and global localization information.
Pushing the limits of the extended Kalman filter to raise the robustness of our localization solution towards a resilient system, we leverage the high accuracy of the optimization to initialize the filter pose states using a novel instant approach utilizing the low-rate noisy GPS readings when available. Sensor fusion on continuous-time (CT) manifolds, such as B-splines [34], suffers from a high execution complexity, especially with the time derivatives of high-order manifolds for integrating the IMU measurements in the estimation process. Hence, in our novel method, we avoid this dilemma with a simple spline-fitting approach for the GPS readings during the data pre-processing stage.

System Architecture
Our core sensor setup consists of an inertial navigation sensor (IMU), a global positioning sensor (GPS), and a monocular camera, as illustrated in Figure 3. The pipeline starts with the data acquisition and pre-processing for the initialization process, as discussed in Section 3.1. The initialization is an optimization-based phase (see Algorithm 1) with a considerably low complexity and processing time whose output is an instant metric-scaled pose estimated from the camera, GPS, and gyroscope readings. Then, an ES-EKF (see Algorithm 2) whose dynamic model is given in Section 3.2, is applied to estimate all the system states, including the MAV's trajectory, velocity, and a scale factor to recover the initially estimated trajectory in the case of GPS readings loss. Finally, we present the measurement model in Section 3.3 with a novel false pose augmentation paradigm to ensure the observability of all the filter states, as analyzed in Appendix A.

Algorithm 1 Bootstrapping: Pose Graph Optimization and Range Factors
Input: RGB frames (c), camera matrix (K c ), GPS readings (DT-GPS), IMU readings (I) Initial orientations 4: while not converged do Initial trajectory optimization 5: The state representation is a 31-element state vector X : where p i w is the position of the IMU in the world frame (world frame is a gravity-aligned frame.) (w), its velocity v i w , and its attitude rotation quaternion q i w describing a rotation from the IMU frame (i) into the world frame (w). b ω and b a are the gyro and acceleration biases along with the visual odometry scale factor λ. R (q) is the quaternion q rotational matrix, g is the gravity vector aligned with the world frame (w), and Ω(ω) is the quaternionmultiplication matrix of ω.
The IMU/camera calibration states are the rotation from the camera frame into the IMU frame q c i , and the position of the camera center with regard to the IMU frame p c i . Finally, the visual attitude drifts between the black-boxed visual frame (vision frame is the frame to which the camera pose is estimated in the black-box vision framework) (v) and the world inertial frame (w) are reflected in q w v and the translational ones in p w v . We assume that all the visual drifts are spatial without any temporal drifts, i.e., the IMU and the camera have synchronized timestamps.  Figure 3. Overview of our proposed entire system architecture.

end if 18: end while
The corresponding 28-elements error state vector is defined by: as the difference of an estimatex to its quantity x, i.e.,x = x −x. We apply this to all state variables except the error quaternions, which are defined by: This error quaternion representation increases the numerical stability of the estimation process and handles the quaternion in its minimal representation [35].

State Estimator Initialization
An incremental structure from motion (SfM) algorithm [36] is applied to the acquired image frames, whose goal is to retrieve the camera poses and the 3D structure of the scene, based on the five-point algorithm proposed in [37]. ORB features are detected, and the highest quality points are tracked between 10 consecutive frames using the KLT method [38].
To solve the arbitrary-scale problem of the camera trajectory only, we apply an onmanifold cumulative B-spline (https://github.com/AbanobSoliman/B-splines (accessed on 1 October 2022)) interpolation [34] to synthesize a very smooth continuous-time (CT) trajectory in R 3 from the low-rate noisy GPS readings.
The matrix form for the cumulative B-spline manifold of order k = n + 1, where n is the spline degree, is modeled at t ∈ [t i , t i+k−1 ] as: where p(u) ∈ R 3 is the continuous-time B-spline increment that interpolates k GPS measurements on the normalized unit of time u(t) := (t − t i )/∆t s − P n with 1/∆t s denoting the spline generation frequency and P n being the pose number that contributes to the current spline segment P n ∈ [0, · · · , k − 1]. p i is the initial discrete-time (DT) GPS location measurement at time t i . The term d i j = p i+j − p i+j−1 is the difference vector between two consecutive DT-GPS readings. The matrix B (k) j is the cumulative basis blending andū (k) j is the normalized time vector, both of which are defined as: Our GPS-IMU aided initialization system comprises two optimization factors: the first is a pose graph optimization (PGO) factor r p that optimizes the 6-DoF of every pose, whereas the second is a range factor r s that constraints the translation limits between every two KLT-VO poses. Hence, the metric scale of the visual odometry pose is recovered using the gyroscope and GPS readings, leveraging the high accuracy of the optimization process. An illustrative scheme for the initialization process factor graph is shown in Figure 4.
Level 1's objective function L p,s is modeled as: Σ p i,j , Σ s i,j are the information matrices associated with the GPS readings covariance, reflecting the PGO and Range factors noises on the global metric scale estimation process between two RGB-D aligned frames. iT j ) in Equation (7) between any two KLT-VO poses.
Pose Graph Optimization (PGO) factor. The PGO is a 6-DoF factor that controls the relative pose error between two consecutive edges i, j and is formulated as: where ||.|| 2 is the L2-norm,T i,j ∈ SE(3) is the T 0 wi estimated from the front-end pipeline at frames i, j. The operator is the SE(3) logarithmic map as defined in [39]. The er- (3) is the gyroscope integrated increment δR ω ij = j k=i (ω k ).dk using the Runge-Kutta 4th order (RK4) integration method [31] between the keyframes i and j.
Range factor. The range factor limits the front-end visual drift and keeps the global metric scale under control within a sensible range defined by the GPS signal and is formulated as: wheret i,j , p GPS i,j ∈ R 3 are the translation vectors of two consecutive front-end (KLT-VO) poses and CT-GPS signals, respectively.

Dynamic Model
The core state estimation is performed by fusing the RGB camera frames and the IMU reading using an error states extended Kalman filter (ES-EKF). Figure 5 illustrates the inter-sensor extrinsic relation between the IMU/GPS sensors and a monocular camera.
Thus, the real angular velocities ω and accelerations a in the IMU body frame (i) can be written as: where the subscript m denotes the measured value. The dynamics of the non-static biases are modeled as a random process:ḃ The standard deviation σ b ω , σ b a , σ w , σ a values are generally given by the IMU manufacturer's data in Allan deviation plots. For discrete time steps, it will be applied in the filter. We need to convert these values according to their units: The following differential equations govern IMU state propagation: For the quaternion integration inside the ES-EKF, we use the first-order integrator defined in [35] as:w = ω k+1 +ω k 2 , κ = 1 2 .Ω(ω).∆t, where the hat termˆmeans the estimated value. The exponential term e κ is expanded by the Maclaurin series. The states transition matrix F d is modeled as: Then, we apply the small-angle approximation for which |ω| → 0 apply the de l'Hopital rule and obtain a compact solution for the six matrix blocks A, B, C, D, E, F [35]: withω = ω m −b ω ,â = a m −b a and ω × , â × the skew-symmetric matrices for IMU readings.
We can now derive the discrete-time input noise covariance matrix Q d as: where Q c is the CT process noise covariance, and G c is calculated in the form: The closed-form solution of the complete derivation of the Q d covariance matrix is given in detail in Appendix B. Finally, the propagated state covariance matrix computation is defined as:

Measurement Model
The main contribution of our measurement model for an observable ES-EKF is the false relative pose augmentation methodology of the visual drift quaternion state at the previous time step (k) updated with the current camera measurement at a time (k+1) and modeled as: The camera position measurement model yields the position of the camera with respect to the vision frame p c v . The error in measurement modeled asz p and linearized asz pL : with using the definition of the error-quaternion The vision algorithm yields the rotation from the camera frame into the vision frame q c v . We can model the error measurement as Finally, the measurements Jacobian H inz = H.x is calculated based on the method in [33] and can be stacked together in the form with the Jacobian matricesH xy q , known as the right Jacobian of SO (3), and are defined as:

States Update
To update the framework for the current time step (k+1), we compute the innovation term S, Kalman gain K, and the states correction vectorx defined as: The error state covariance is updated as follows: where R [6x6] = diag(R position , R orientation ) is the measurement noise covariance matrix. The error quaternion is calculated by (3) to ensure its unit length, and then update the states vector: X k+1 = X k +x.
For the quaternions state update: where δθ i k+1 is the ith error state of this quaternion.

Reset Mode
The ES-EKF reset mode is performed by settingx ← 0 and P ← G.P.G , where G is the Jacobian matrix defined by G = diag(I d 6 , J r wi , I d 10 , J r ic , I d 3 , J r vw ),

Setup
An extensive quantitative and qualitative evaluation is carried out to validate all the state estimation process aspects. This thorough performance analysis is run on the EuRoC benchmark [12] for an indoor system global positioning evaluation in low-speed flights and on the Fast Flight dataset [11] for outdoor experimentation at relatively high-speed flights. For a fair comparison, all the pipeline processing stages in both Algorithms 1 and 2 are performed on a 16 GB RAM laptop computer running 64-bit Ubuntu 20.04.3 LTS with AMD(R) Ryzen 7 4800 h × 16 cores 2.9 GHz processor and a Radeon RTX NV166 Renoir graphics card. In Table 1, we represent the quantitative insights of our experiment settings regarding the benchmarks statistical data and the sensors parameters in-detail.  The front-end of the pipeline, including both the data acquisition and pre-processing steps, is developed as a Python API that sends the optimization variables to the factor graph implemented in C++ using the Ceres solver [40] to achieve the lowest possible system latency before the state estimation process. The Sparse Normal Cholesky linear solver by the Ceres solver is employed to solve the least-squares convex optimization problem formulated in Equation (6) along with the Levenberg-Marquardt trust region strategy with the automatic differentiation tool for Jacobian calculations. The sparse Schur linear method is applied to utilize the Schur complement for a more robust and fast optimization process. The pipeline's back-end for the state estimation process is developed entirely in MATLAB (https://github.com/AbanobSoliman/VIO_RGB_IMU (accessed on 30 October 2022)) and all the initialization parameters are given explicitly in Table 2. The performance analysis is performed using the two trajectory evaluation metrics: root mean square error (RMSE) for the Fast Flight dataset compared to the GPS trajectory p gps , and the RMS absolute trajectory error (ATE) for the EuRoC benchmark compared to the ground truth trajectory T gt provided with Vicon room sequences. The positional RMSE metric for the Fast Flight sequences is chosen because the ground truth GPS trajectories exist with unknown ground truth orientations. However, for EuRoC sequences, we select the RMS ATE metric for two reasons: 1. the Vicon system provides ground truth poses (positions and orientations); and 2. to ensure a fair comparison with the latest state-ofthe-art methods based on the same error metric. The two trajectory evaluation metrics are formulated as follows: [m], (30) wherep is the estimated translation vector of theT ∈ SE(3) trajectory; p(.) is the translation vector of the T ∈ SE(3) pose; and T rel is rigid-body transformation corresponding to the least-squares solution that maps theT trajectory onto the T gt trajectory calculated by optimization. We set it constant for all sequences that belong to the same benchmark.

The EuRoC MAV Benchmark
The two main characteristics of the EuRoC MAV sequences are the complex combined 6-DoF motions and the relatively low speeds compared to the Fast Flight sequences. These prominent characteristics allow an accurate evaluation of the ES-EKF marginally stable states, such as the velocity and the visual drift. In Table 3, we report the ATE values as an evaluation parameter for the trajectory estimation accuracy compared to the ground truth. Moreover, Table 3 shows an ablation study that investigates the contribution of the GPS sensor to the overall estimation accuracy, especially for the monocular vision-based optimization methods: ours (PGO) and the recent work of Cioffi et al. [30]. The selection of the six Vicon room sequences from the EuRoC benchmark is because a comparison with an alternative method such as [30] incorporating GPS signals simulated from the Vicon system readings better emphasizes the findings of this ablation study. Table 3. Ablation study on the contribution of the GPS sensor on the system accuracy. The latest state-of-the-art (monocular/stereo) VI-SLAM systems are compared to our proposed trajectory initialization (PGO factors) and ES-EKF state estimation methods. Bold denotes the most accurate. A prominent finding of this ablation study is that vision is the most significant type of sensor. In most sequences, the lowest ATE is obtained by fusing the camera trajectory from the vision KLT-based SfM algorithm to a gravity-aligned frame using the noisy simulated GPS data, and adding inertial measurements does not provide a measurable benefit in this case. However, adding the gyroscope measurements to the visual-GPS fusion has led to the least ATE achieved by our PGO model compared to all other discrete-time (DT) methods. Figures 6 and 7 show our trajectory and velocity estimations after incorporating the accelerometer readings in the ES-EKF model, resulting in the lowest achievable errors that can compete with the continuous-time optimization model in [30].

The Fast Flight Dataset
The main observation, which is validated upon both the EuRoC and Fast Flight sequences (see Table 4 and Figure 8), is that for velocities less than 5 (m/s), the monocular loosely coupled ES-EKF can achieve considerably lower estimation errors concerning the other filter-or optimization-based methods. For velocities more than 5 (m/s), our proposed optimization-based initialization scores the lowest RMSE compared to all other methods in comparison in Table 4. On the contrary, the monocular ES-EKF scores the lowest RMSE, especially for velocities more than 10 (m/s), compared to the best-performing Kalman filter stereo model of the S-MSCKF. Table 4. Ablation study on the effect of the high MAV speed on the accuracy of the filtering approaches compared to optimization approaches. The first sub-section compares monocular (VINS-Mono and Ours) to stereo (OKVIS) optimization-based VI systems. The second sub-section compares stereo filtering-based approaches to our proposed method. Bold denotes the most accurate in each sub-section.

Method
Fast  Since the maximum achieved velocity of the EuRoC MAV is nearly 2.3 (m/s), the quantitative results in Table 3 further support this conclusion, where our ES-EKF scores the best performance compared to the other state-of-the-art methods. In-depth reasoning for this degraded performance at high speeds (more than 5 (m/s)) can be clarified based on the hardware characteristics of the MAV sensors' properties, such as the data rate, latency, and noise effects at high speeds. Our optimization-based (PGO) initialization outperforms all other optimization-or filtering-based methods with high-rate visual-inertial sensors.
An insightful overview of the velocity profiles estimated by our ES-EKF is represented in Figure 9. The main conclusion is that the estimated velocity profile during the planar motion of the MAV in the X-Z plane optimally fits the upper and lower bounds of the top speed for each sequence. Towards an in-depth investigation to understand the high perturbations in the estimated velocity when approaching the maximum limit, we plot the velocity error states in the ES-EKF showing a high error at the instances when approaching top speeds due to the strong vibrations in the MAV structure affecting the IMU readings. The high estimation accuracy of our ES-EKF model compared to GPS readings and the PGO optimization-based initialization process is further verified by the Y axis trajectory estimation in Figure 8. The maximum estimated altitude for all sequences by the ES-EKF is nearly 60 (m), whereas both the GPS readings and the initialization optimizer estimate a maximum altitude of nearly 100 (m). To physically validate which is a more accurate altitude estimation, we took snippets of the scene at a time instance in the exact halfway of all trajectories as shown in Figure 1. We can observe that the MAV is nearly on the same level as the roof of a commercial aircraft hangar, which is in the range of 30 (m) to 66 (m). This observation validates the high estimation accuracy of the altitude using our ES-EKF.

Real-Time Performance Analysis
The filter-based approaches are more advantageous for real-time onboard applications because they use the CPU more efficiently than the monocular and stereo optimizationbased methods. Due to its computationally intensive front-end pipeline for both temporal and stereo matching, OKVIS uses more CPU than VINS-Mono. Additionally, OKVIS's back-end operates at a speed that is much faster than the set 10 (Hz) rate of VINS-Mono. Approximately 90% of the work in our back-end, ES-EKF, is brought on by the front-end, which includes ORB feature detection, KLT-based tracking, and matching. At 200 (Hz), the filter uses approximately 10% of a core. Our suggested technique offers the maximum estimation frequency, which provides the optimal balance between the precision and computing cost. Figure 10 contrasts how much CPU time various VIO solutions used on the EuRoC benchmark and the Fast Flight dataset. Since V2-03 has considerable scale drift with S-IEKF and S-UKF-LG techniques and hence has significantly worse accuracy when compared to other methods, the CPU consumption of V2-03 is excluded from the comparison. According to the testing, the ES-EKF achieves the lowest CPU consumption while retaining a similar level of accuracy in comparison with other methods. We notice that the proposed method puts more computing work into the image processing front-end than the tests using the EuRoC dataset. Higher imaging frequency and resolution are one explanation, while Fast Flight results in a shorter feature lifetime, necessitating frequent new feature identification, is another reason. Figure 10. CPU usage as a real-time performance analysis indicator.

Conclusions
Our work aimed to provide an accurate and computationally inexpensive localization solution during MAVs' long-term navigation in large-scale environments. We represented a loosely coupled IMU/GPS camera fusion framework with pose failure detection methodology toward this goal. Moreover, we proposed a novel decoupled optimizationand filtering-based sensor fusion technique that achieves a high estimation accuracy and minimum system complexity compared to the other methods in the literature. We used real-world indoor and outdoor settings for the MAV localization studies to validate and test the findings of our proposed method.
The vision-based black-box pose estimation accuracy is first examined in a controlled laboratory Vicon room of the EuRoC benchmark. The outcomes confirmed the system's reliance on monocular vision. The experiments on EuRoC and Fast Flight sequences have shown remarkable accuracy in the trajectory estimation studies. We also evaluated the proposed scheme in terms of computational complexity, measured by CPU usage, where our monocular-vision optimization/filtering solution outperformed all the competing techniques.
This conclusion enforces our work's contributions to a reliable (fast and accurate) sensor fusion solution for challenging and large-scale environments. From a future perspective, it will be necessary to comprise situations where GPS sensor constraints, such as the multipath effects on the optimizer. Finally, further generalizing the optimization problem will be necessary to extend the algorithm's pose estimation capability to include multiple vision sensors (stereo RGB, for instance). Data Availability Statement: The EuRoC MAV benchmark is publicly available online at (https: //projects.asl.ethz.ch/datasets/doku.php?id=kmavvisualinertialdatasets (accessed on 1 October 2022)). The Fast Flight dataset is publicly available online at (https://github.com/KumarRobotics/ msckf_vio/wiki/Dataset (accessed on 3 October 2022)).

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are ordered according to the mentioned precedence in this article:

Appendix A. Observability Analysis
The EKF-based VIO for 6-DOF motion estimate contains four unobservable states corresponding to the global position and rotation around the gravity axis, or yaw angle, as demonstrated in [44]. A simple EKF VIO implementation will gather false information about yaw. The different processes and measurements' linearizing point causes this unobservability. To ensure that the uncertainty of the current camera states in the state vector is not impacted by the uncertainty of the current IMU state during the propagation step, in our implementation, the camera poses in the state vector can be represented with respect to its inertial frame (v) instead of the latest IMU frame. Besides the efficient gyroscope RK4 integration scheme during the initialization process, our ES-EKF implementation minimizes the effect of the unobservable modes of the basic EKF. Figure A1 shows the IMU intrinsics, IMU-CAM extrinsic parameters, and odometry scale ES-EKF states plotted for sample EuRoC and Fast Flight sequences.
The main observation from Figure A1 is that when the motion of the MAV is smooth with no abrupt rotations and translations, our optimization-based initialization estimates an optimal metric-scaled trajectory with λ = 1. Moreover, we also observe that when the IMU-camera setup is not accurately calibrated, the ES-EKF can optimally align the sensor setup in a robust online calibration process. Furthermore, the estimated IMU biases using our ES-EKF model are accurate and in a sensible range. One crucial observation is the estimated attitude visual drift of the visual sensor and the detection of consistent drift patterns based on the MAV speed (Fast Flight sequences) and abrupt motions (EuRoC sequences). These observations validate the contribution of the ES-EKF to the sustainability of the proposed method to achieve a resilient system that observes all the state vector parameters in addition to all the 6 DoF of the MAV trajectory. Finally, after the initial trajectory optimization, the filtering process is indispensable to estimate the false camera poses during long-term navigation caused by the visual attitude drifts. Figure A1. Our ES−EKF estimated states. Columns from left to right: IMU (accelerometer/gyroscope) biases b a , b ω , odometry scale factor λ, visual drift orientations q w v , and IMU-camera translation online calibration p c i . Rows 1,2 for sample Fast Flight sequences (gps5, gps10) and rows 3, 4 for sample EuRoC sequences (V2-02, V2-03), respectively.