MarsBird-VII: An Autonomous Stereo–Inertial Navigation System with Real-Time Optimization for a Mars Rotorcraft Space Drone

Xiao, Ju; Qiu, Hanchen; Zhou, Yukun; Wang, Rui; Liu, Peng

doi:10.3390/drones10050346

Open AccessArticle

MarsBird-VII: An Autonomous Stereo–Inertial Navigation System with Real-Time Optimization for a Mars Rotorcraft Space Drone

by

Ju Xiao

,

Hanchen Qiu

,

Yukun Zhou

,

Rui Wang

and

Peng Liu

^*

Pattern Recognition and Intelligent System Research Center, Harbin Institute of Technology, No. 92, Xidazhi Street, Nangang District, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(5), 346; https://doi.org/10.3390/drones10050346

Submission received: 7 March 2026 / Revised: 26 April 2026 / Accepted: 1 May 2026 / Published: 4 May 2026

(This article belongs to the Special Issue Navigation, Control and Mission Planning Advances for Safe, Efficient and Autonomous Drones: 2nd Edition)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Mission-constrained stereo–inertial navigation and real-time optimization for a Mars rotorcraft (Tianwen-3 concept).
A Parity-Window sliding-window back-end achieves bounded per-update complexity and deterministic onboard execution under Tianwen-3-class compute constraints.

What are the implications of the main findings?

Enables reliable real-time navigation under tight the computing budgets of space avionics.
Flight experiments demonstrate accurate and robust state estimation and provide strong evidence for the feasibility of Tianwen-3-class rotorcraft navigation.

Abstract

Reliable autonomous navigation for Tianwen-3-class Mars rotorcraft must satisfy both sampling-level accuracy and hard real-time execution under severe onboard computational constraints. To address this challenge, we develop MarsBird-VII, a mission-constrained stereo visual–inertial navigation system that combines a computation-aware vision front-end with a Parity-Window sliding-window optimization back-end. The front-end decouples high-rate tracking from feature replenishment to bound perception latency, while the back-end alternates updates over interleaved state subsets and preserves full-window coupling through unified marginalization. Unlike simply reducing the sliding-window size, the proposed strategy reduces the per-update optimization cost without shrinking the geometric observation horizon, thereby improving the accuracy–runtime trade-off for embedded avionics. Earth-analog flight experiments demonstrate strong navigation performance under mission-relevant conditions. In full-sequence evaluation, the proposed system achieves an SE(3)-aligned translation APE of 0.31 m RMSE/0.47 m Max and further reaches 0.06 m RMSE/0.15 m Max on a nominal stable segment. Runtime profiling over 5000+ update cycles shows that the Parity-Window back-end keeps the maximum optimization latency below 58.32 ms, satisfying the 66.7 ms hard real-time deadline while maintaining accuracy close to full-window optimization. These results show that the proposed system provides a practical balance of accuracy, robustness, and deterministic real-time performance for Tianwen-3-class Mars rotorcraft navigation.

Keywords:

Mars sample return; rotorcraft; autonomous navigation; multi-sensor fusion; real-time optimization

1. Introduction

Mars remains a major focus of planetary exploration due to its unique geologic history, atmospheric conditions, and potential for past life. Among the two sampling strategies envisioned for the Tianwen-3—robotic-arm collection and rotorcraft-based sampling, the latter represents a more demanding step toward high-precision and robust autonomous operation. While orbiters and rovers have historically dominated Mars exploration, rotorcraft are now emerging as a powerful complement [1,2,3,4,5]. NASA’s Ingenuity Mars Helicopter has proven the feasibility of powered flight in the thin Martian atmosphere, completing 72 flights and accumulating about 17.0 km of total flight distance [6]. Compared with rovers limited by terrain hazards and low traversal speed, rotorcraft can rapidly access otherwise unreachable regions, provide high-resolution imaging at low altitudes, and support science operations through scouting, terrain assessment, or independent exploration. In the context of planetary exploration, such rotorcraft may be classified as space drones, i.e., unmanned aerial robotic platforms designed for autonomous operation in extraterrestrial environments. These capabilities indicate that rotorcraft can serve as an effective complement to rovers and orbiters in future planetary missions, including the possible exploration of other atmospheric bodies such as Titan and Venus.

However, under the complex environmental conditions of Mars, achieving reliable autonomous navigation for rotorcraft remains a major challenge, especially for sample–return missions like Tianwen-3, which impose far more stringent requirements for accuracy and robustness than those met by Ingenuity [7,8,9,10]. In addition to severe mass and computational constraints—limiting sensor quality and processing capability—the navigation system must cope with highly variable illumination, pervasive dust, and extreme thermal swings. Conventional terrestrial navigation systems are ill-suited to these conditions, as they typically rely on more capable sensors and processors. Although Ingenuity successfully demonstrated basic flight capability, its navigation framework was designed as a technology demonstrator and was not intended to satisfy the high-precision requirements of sample collection and return. These limitations motivate the development of a new navigation approach tailored to Tianwen-3-class mission requirements.

To bridge this gap, we develop a mission-constrained autonomous navigation system for Tianwen-3-class Mars rotorcraft that enables robust and deterministic real-time operation under limited onboard computational resources. The target platform is MarsBird-VII, a deployable Mars rotorcraft space drone concept for autonomous sampling operations, whose overall structure is shown in Figure 1 and detailed in [11]. In this paper, we focus on its onboard stereo–inertial navigation workflow with altimeter-aided visual depth initialization. The proposed design co-optimizes front-end perception latency and the back-end optimization workload so that reliable state estimation can be sustained within a strict per-cycle computing budget.

To reconcile the stringent mission requirements of Mars sampling with the severe computational constraints of onboard avionics, we propose a navigation system tailored for MarsBird-VII. The core contributions are as follows:

Mission-constrained navigation architecture. We establish an integrated stereo visual–inertial framework tailored to Mars sampling operations. The framework is designed to support decimeter-level accuracy while maintaining deterministic real-time updates on resource-constrained hardware.
Computation-aware vision front-end. We propose a conditionally decoupled stereo vision front-end that separates high-rate tracking from feature replenishment. Parallel feature detection over image regions, hysteresis-based mode switching, and deferred stereo matching are used to bound perception latency while maintaining sufficient visual constraints.
Parity-Window optimization back-end. We propose a Parity-Window sliding-window optimization strategy that alternately updates interleaved state subsets while retaining the original full-window horizon. Unlike full-window optimization, it reduces the per-cycle solve scope; unlike reduced-window or keyframe-skipping strategies, it preserves long-horizon visual connectivity and cross-state coupling through unified marginalization.

The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 presents MarsBird-VII’s vision front-end and optimization back-end algorithms, emphasizing their adaptability to Martian environments. Section 4 details the system-level design optimizations, namely, the vision-front-end re-architecture and the Parity-Window strategy, that are tailored to enable real-time execution under strict resource constraints. Section 5 presents experimental validation results and system-level benchmarks from real-world outdoor and ground experiments in mission-relevant scenarios.

2. Related Work

2.1. Operational Context: Mars Environment and Tianwen-3’s Constraints

Autonomous navigation for the Tianwen-3 rotorcraft must function under extreme environmental and mission-specific constraints. The Martian atmosphere, with a density of only about 1% of Earth’s, significantly alters aerodynamic characteristics and reduces damping effectiveness [12]. Highly variable illumination conditions caused by frequent dust storms and low solar elevation angles impair visual sensing [13]. In addition, large dust-covered plains and eroded terrains often result in visually ambiguous or low-texture landscapes, complicating feature-based navigation [14]. Communication delays of over 20 min between Earth and Mars preclude any ground-in-the-loop control, necessitating full onboard autonomy [15]. Extreme thermal variations, often below −80 °C, impact sensor calibration and operational reliability [16], while the low atmospheric density limits convective cooling, requiring dedicated thermal management designs.

The Tianwen-3 mission, aiming at Mars for a sample–return with a planned launch in 2028, imposes strict limitations on mass and computational resources. These constraints restrict the selection of sensors and the available computational capacity of the onboard platform. Therefore, the navigation system must not only operate robustly under challenging Martian environmental conditions but also remain compatible with strict onboard resource limits, requiring both environmental adaptability and computational efficiency.

These combined challenges significantly impair the performance of conventional visual–inertial odometry (VIO) and simultaneous localization and mapping (SLAM) systems designed for terrestrial applications. To fulfill Tianwen-3’s stringent requirements for accuracy, robustness, autonomy, and real-time operation under severe onboard resource constraints, an adaptive and computationally efficient navigation solution is essential.

2.2. Terrestrial and Martian Navigation Methods: Gaps and Limitations

Filter-based VIO approaches, such as the Multi-State Constraint Kalman Filter (MSCKF) and Robust Visual–Inertial Odometry (ROVIO) [17,18,19], are lightweight and suitable for real-time operation. However, as sequential estimators, they rely on local linearization at the current estimate, which can lead to accumulated linearization error and estimator inconsistency over long trajectories, especially when the system is repeatedly linearized around biased states. Such methods may struggle to simultaneously satisfy accuracy and consistency requirements under the specific computing and sensing constraints considered in this work. Learning-based VIO [20,21,22], although effective in Earth-like settings, lack proven generalization to Martian spectral and geometric conditions and are therefore difficult to trust for critical sampling operations.

In contrast to lightweight filter-based estimators, modern optimization-based frameworks such as VINS-Mono, its multi-sensor extension VINS-Fusion, and ORB-SLAM3 [23,24] achieve higher accuracy and robustness through tightly coupled nonlinear optimization over sliding windows. Their accuracy benefits largely come from iterative optimization over multi-frame constraints, but this also leads to substantial computational overhead due to dense marginalization and full-window solving. Under the resource constraints of the MarsBird-VII platform, such standard formulations are difficult to deploy directly while still guaranteeing deterministic real-time execution. This limitation motivates the development of a tailored optimization strategy that preserves high estimation accuracy while explicitly bounding computational complexity.

Navigation systems on Earth often incorporate supplementary sensors—such as barometers, light detection and ranging (LiDAR), the global navigation satellite system (GNSS), or magnetometers—to enhance reliability. These options are largely infeasible on Mars: barometric readings are unstable in the thin atmosphere, GNSS signals are unavailable, LiDAR systems typically exceed mass budgets, and the planet lacks a globally stable magnetic field for reliable magnetometer use [25,26,27]. Consequently, visual–inertial navigation has emerged as the primary solution, as exemplified by the performance of Ingenuity [7,9].

Overall, existing navigation approaches still leave a clear gap for MarsBird-VII-class rotorcraft. Filter-based VIO methods are computationally efficient but may suffer from long-term inconsistency, while full sliding-window optimization improves accuracy at the cost of a high computational load. Learning-based methods lack proven generalization to Martian visual conditions, and many terrestrial sensor-aided solutions relying on GNSSs, LiDAR, barometers, or magnetometers are difficult to deploy under Mars mission constraints. Therefore, a navigation framework is needed that can maintain visual–inertial accuracy and robustness while explicitly bounding onboard computational complexity.

2.3. Ingenuity: Processor, Sensors, and Navigation Framework

NASA’s Ingenuity Mars Helicopter successfully demonstrated autonomous, vision-based flight on Mars. Its onboard computation platform consisted of a Qualcomm Snapdragon 801, supporting the real-time processing of navigation data. The sensor suite included a downward-looking monocular camera, an inertial measurement unit (IMU), and a laser altimeter. For state estimation, Ingenuity employed the Minimal Augmented State Estimator for Vision-Based Navigation (MAVeN) algorithm, a cloned-state extended Kalman filter (EKF) that fused IMU propagation, feature-tracked velocity from the monocular camera, and altimeter measurements to estimate the helicopter’s state in real-time. MAVeN enabled over 70 successful flights, validating the feasibility of vision-based autonomous navigation in the thin Martian atmosphere [6,7,8,9].

Despite its effectiveness as a technology demonstrator, Ingenuity’s navigation system was designed around a locally planar ground assumption, which simplified the perception and state estimation but limited robustness over non-planar terrains and in perceptually degraded conditions. The Mars Science Helicopter (MSH) concept proposed a structure-invariant extension that fuses a 1D range measurement with monocular vision to remove the planarity constraint and mitigate scale drift [8]. While range-aided monocular designs are attractive from a sensing and computation standpoint, their scale correction can become sensitive to the quality and representativeness of the range measurement, particularly over rough topography or during rapid attitude changes. In practice, the effective constraint provided by a single-beam range sensor can be degraded by local geometry, footprint-scale irregularities, and attitude-dependent projection effects. This requires careful measurement validation and robust weighting to prevent occasional range outliers from adversely affecting the estimator.

Collectively, the MarsBird-VII mission requirements shift the emphasis from demonstrating VIO feasibility on Mars to delivering robust and deterministic real-time navigation under challenging visual conditions and tight onboard resource budgets. Under these coupled constraints, existing approaches may struggle to meet robustness, consistency, and timing requirements simultaneously. We therefore adopt a stereo visual–inertial sensing configuration. Unlike monocular approaches, where metric scale recovery depends on sufficient inertial excitation, a fixed stereo baseline provides instantaneous metric depth observability that is largely decoupled from IMU motion dynamics, helping to maintain scale consistency during low-excitation maneuvers and in the presence of inertial bias [28]. Building on this sensing choice, we co-design the processing workflow to jointly target robustness and timing determinism on embedded hardware. We employ a tightly coupled nonlinear optimization formulation to jointly incorporate multi-sensor constraints and improve resilience to outliers, while explicitly managing the computational cost to maintain a predictable runtime. Accordingly, we propose two complementary components. The first is a computation-aware vision front-end designed for Martian visual conditions, which maintains tracking robustness while bounding worst-case latency. The second is a Parity-Window nonlinear optimization back-end that alternates between two interleaved (odd/even) state subsets while maintaining a unified marginalization prior. Its role is not to shorten the sliding-window horizon but to reduce the per-update solve scope while preserving cross-state coupling, long-horizon consistency, and timing determinism under the real-time budget.

3. Navigation System: Architecture, Vision Front-End, and Optimization Back-End

An overview of the MarsBird-VII rotorcraft is provided in [11]. In this paper, we focus on its autonomous navigation system, which performs multi-sensor fusion by integrating an inertial measurement unit and stereo vision within a sliding-window nonlinear optimization framework. The proposed system is a tightly coupled stereo–inertial navigation framework for Mars rotorcraft. It is designed to provide robust and accurate state estimation under strict onboard real-time and computational constraints.

3.1. Navigation Architecture

The navigation architecture (Figure 2) is centered on the Navigation Board (NB), which executes high-level navigation functions, including perception, sensor fusion, and state estimation. The Flight Control Board (FCB) streams IMU and altimeter measurements to the NB. An onboard FPGA serves as a low-latency acquisition interface, capturing raw signals from the gyroscope (2000 Hz), accelerometer (1600 Hz), and laser altimeter (50 Hz). Two redundant MCUs perform preprocessing, filtering, and IMU time synchronization, and then return the synchronized measurements to the NB via the FPGA. This partitioning decouples sensing and preprocessing from navigation computation, ensuring time-aligned inputs to the NB and improving reliability under stringent onboard resource constraints.

The Navigation Board is built on the industrial-grade Rockchip RK3588S system-on-chip (SoC) (Rockchip Electronics Co., Ltd., Fuzhou, China), featuring four 2.25 GHz ARM Cortex-A76 high-performance cores and four 1.8 GHz ARM Cortex-A55 (Arm Ltd., Cambridge, UK) energy-efficient cores. It is equipped with 32 GB of Low-Power Double Data Rate 4X (LPDDR4X) random-access memory (RAM) and 64 GB of embedded MultiMediaCard (eMMC) storage to support the onboard navigation workload. This hardware configuration supports two main navigation workflows:

The vision front-end primarily uses all eight cores when executing intensive tasks such as feature detection and stereo matching. The eight-core allocation accelerates the processing of 15 Hz synchronized images from the two downward-facing monochrome navigation cameras (Navcam1 and Navcam2), triggered via the NB’s General-Purpose Input/Output (GPIO). Once each batch of vision front-end tasks is completed, the cores are released and made available to the optimization back-end module and other tasks, ensuring the efficient use of computing resources.

The optimization back-end runs on the four high-performance ARM Cortex-A76 cores. This module, based on nonlinear optimization, fuses visual observations with inertial measurements, while using laser altimeter readings as auxiliary geometric references for visual depth initialization and triangulation. The other four cores, ARM Cortex-A55, are assigned to communication, landing-site terrain analysis, and other auxiliary tasks.

The onboard computer must support concurrent real-time workloads (flight control, navigation estimation, sensor I/O, communication, and mission imaging) within a limited computational envelope. To balance the computational margin, mission imaging reliability and tracking stability, and the need for predictable latency in closed-loop control, we set the downward-looking stereo camera and navigation update rate to 15 Hz, which imposes a per-cycle deadline of

T_{b} = 1 / 15 \approx 66.7 m s

. Each cycle must complete both the vision front-end and back-end optimization within

T_{b}

; otherwise, latency accumulates across cycles and can degrade the closed-loop stability. Guided by this budget, Section 4 co-designs the real-time strategies (bounded sliding-window complexity, parallel execution of the front-end and back-end, and workload bounding via thresholds/feature capping), and Section 5.4 validates sustained 15 Hz operation using runtime statistics (e.g., average and tail latency) alongside accuracy metrics.

Notation Convention

Throughout the paper, a superscript denotes the reference frame, and a subscript denotes the body/camera or keyframe index. Rotations are represented by

R \in S O (3)

, positions and velocities by bold lowercase vectors, and attitudes in the optimization state by unit quaternions. For readability, the state of the IMU body at keyframe

k

is denoted by

x_{k}

, and the stereo cameras are indexed by

c \in {1, 2}

.

We perform state estimation in a local navigation frame

F_{w}

. The origin of

F_{w}

is fixed at the landing-site initialization; the vertical axis

z_{w}

is aligned with the local “up” direction (opposite gravity), and

x_{w}

is defined by projecting the vehicle’s initial heading onto the horizontal plane, with

{y_{w} = z}_{w} \times x_{w}

to complete a right-handed frame.

The body frame

F_{b}

is attached to the IMU and follows the forward–left–up (FLU) convention:

x_{b}

points forward,

y_{b}

points left, and

z_{b}

points upward.

For each downward-facing stereo camera, we define a camera frame

F_{c_{i}} (i \in {1, 2})

, which is rigidly attached to

F_{b}

. The camera–IMU extrinsic parameters, representing the transformation from the camera frame to the body frame, are denoted as (

p_{c_{i}}^{b}, q_{c_{i}}^{b}

) and are initialized by offline calibration and further refined online in the estimator. In the optimization back-end, these extrinsic parameters are equivalently represented as

T_{c_{i}}^{b} = (R_{c_{i}}^{b}, p_{c_{i}}^{b})

. In

F_{c_{i}}

, the optical axis

z_{c_{i}}

points nadir (downward),

x_{c_{i}}

points to the right side of the image plane, and

y_{c_{i}}

points to the bottom of the image plane, so that

x_{c_{i}} \times y_{c_{i}} = z_{c_{i}}

.

3.2. Vision Front-End for Martian Visual Conditions

3.2.1. Vision Strategy for Mars Flight Environments

To address the environmental challenges outlined in Section 2.1—including highly variable illumination, low-texture terrains, and limited onboard computing capacity—we design a vision front-end tailored to Martian conditions that balances robustness, efficiency, and accuracy. Conventional feature-based methods involve a trade-off between feature distinctiveness and computational cost. The Features from Accelerated Segment Test (FAST), which was used by NASA’s Ingenuity Mars Helicopter [7,9], enables rapid detection for real-time operation but can be unreliable in low-texture regions, potentially degrading navigation performance. More robust descriptor-based alternatives such as the Scale-Invariant Feature Transform (SIFT) and Speeded-Up Robust Features (SURF) handle sparse visual environments better but are too computationally expensive for resource-constrained onboard platforms. To balance these considerations, we adopt the Shi–Tomasi corner detector, which provides a favorable compromise between robustness in low-texture Martian landscapes and processing efficiency on MarsBird-VII [29,30,31,32].

For feature tracking, we employ a sparse pyramidal Lucas–Kanade (LK) optical flow algorithm. This approach enhances resilience to illumination variations, such as dust-induced lighting changes and surface reflectivity, while reducing computational demands compared to descriptor-based methods [33,34]. Because LK assumes small inter-frame motion, the imaging frequency must be sufficiently high to maintain small inter-frame displacements for reliable tracking. This combination provides a practical trade-off between tracking robustness and computational efficiency and serves as an effective front-end foundation for subsequent real-time optimization on low-computational onboard platforms.

3.2.2. Vision Front-End: Implementation and Workflow

Building on the strategy in Section 3.2.1, we implement a tightly coupled workflow to maintain robust stereo–temporal feature tracks under challenging flight conditions. For the first frame, Shi–Tomasi features are detected in the left image and propagated to the right image using bidirectional pyramidal LK optical flow to establish initial stereo correspondences. In subsequent frames, features are first temporally tracked across consecutive left images to maintain continuity, with new features added when the number of valid tracks falls below a predefined threshold. The resulting feature tracks are then propagated to the right image to preserve stereo consistency. All correspondences are subsequently validated using a random sample consensus (RANSAC)-based fundamental-matrix estimation, which rejects outliers and retains epipolar-consistent matches.

This left-first workflow provides a practical balance between robustness and computational efficiency: temporal tracking preserves feature continuity across frames, on-demand replenishment maintains sufficient coverage without unnecessary computation, and left-to-right stereo propagation preserves geometric consistency for depth estimation.

Each feature is represented as a ten-dimensional observation vector:

[t, f e a t u r e_{i d}, c a m_{i d}, x, y, 1, u, v, v_{x}, v_{y}]

(1)

which includes the time of the image frame

t

; a globally unique feature ID

f e a t u r e_{i d}

; the camera ID (0 for left, 1 for right)

c a m_{i d}

; the normalized homogeneous image coordinate of the feature

x, y, 1

; pixel coordinates

u, v

; and feature point velocity in the image plane

v_{x}, v_{y}

.

3.3. Robust Multi-Sensor Optimization Back-End for MarsBird-VII

3.3.1. State Estimation Strategy for Mars Flight Environments

Robust and accurate state estimation is critical for the autonomous navigation of MarsBird-VII in challenging Martian environments. High-frequency wind disturbances, unpredictable atmospheric conditions, and nonlinearities inherent to the navigation system—such as the camera motion model and feature point triangulation—introduce time-varying noise and pose strong challenges for state estimation. We formulate MarsBird-VII state estimation as a nonlinear sliding-window optimization problem over a finite set of recent keyframes. All measurements in the active window are jointly optimized, reducing sensitivity to noisy observations and transient dropouts. To bound computation, past states are marginalized and their information is retained as a prior on the remaining variables. This design provides stable pose and velocity estimates under high-dynamic motion and intermittent measurement degradation.

3.3.2. Optimization Back-End Workflow

Building on the estimation strategy described in Section 3.3.1, MarsBird-VII’s optimization back-end is formulated as a tightly coupled visual–inertial framework based on sliding-window nonlinear optimization. Laser altimeter measurements are not introduced as explicit residual factors in the back-end optimization. Instead, the nearest-in-time altimeter reading is used as an auxiliary geometric reference during visual depth initialization and feature triangulation, thereby indirectly regularizing the vertical geometry and reducing height-drift tendencies. This design is adopted because, during low-altitude flight over rough terrain, a single-beam altimeter may reflect highly local terrain fluctuations; using it as an explicit residual could therefore introduce terrain-dependent noise into the back-end optimization. The framework is illustrated in Figure 3. Visual feature factors provide geometric constraints, and IMU pre-integration supplies high-rate motion priors. After visual–inertial initialization, the estimator iteratively minimizes the joint cost over the active keyframes and applies marginalization to keep the problem size bounded while preserving prior information for subsequent updates.

The workflow consists of the following key stages:

IMU pre-integration: High-frequency inertial measurements are integrated between camera frames to provide relative motion priors. These priors enhance short-term pose prediction and continuity, particularly during rapid maneuvers or transient visual degradation.
Vision front-end measurements: Visual information is processed using a Perspective-n-Point (PnP) solver and feature triangulation. To improve robustness, visual measurements are temporally aligned with altimeter readings by using the nearest-in-time laser altitude measurement as an auxiliary geometric reference during feature triangulation. This strategy improves the reliability of 3D feature initialization, particularly in regions with sparse visual texture or partial occlusion.
Visual–inertial initialization: The IMU and camera observations are aligned into a consistent reference frame to establish coherent multi-sensor fusion before entering iterative optimization.
Sliding-window nonlinear optimization: Once initialized, the system maintains a fixed-size window of recent keyframes and jointly optimizes the following residuals:
(a)
Visual reprojection residuals, enforcing consistency between projected landmarks and observed features, including stereo residuals, temporal residuals across consecutive frames, and cross-camera temporal residuals;
(b)
IMU residuals, derived from pre-integrated measurements, which constrain the relative pose, velocity, and inertial bias drift;
(c)
Marginalization priors, which summarize the information from states removed from the active window, preserving their influence on current estimates while keeping the optimization problem size manageable.

The full optimization state includes the keyframe states of MarsBird-VII in the active sliding window, the inverse-depth parameters of the tracked landmarks, the camera–IMU extrinsic parameters, and the camera–IMU temporal offset. Specifically, the state of the

k

-th keyframe consists of the IMU/body position

p_{b_{k}}^{w}

, orientation quaternion

q_{b_{k}}^{w}

, velocity

v_{b_{k}}^{w}

, accelerometer bias

b_{a, k}

, and gyroscope bias

b_{g, k}

. Here, the superscript

w

denotes the world frame, and the subscript

b_{k}

denotes the IMU body frame at keyframe

k

. The inverse depth of landmark

l

is denoted by

λ_{l}

and is anchored to its host camera frame. The camera–IMU extrinsic parameters are denoted by

X_{c \to b} = {T_{c_{1}}^{b}, T_{c_{2}}^{b}}

, where

T_{c_{i}}^{b}

represents the rigid body transformation from camera frame

c_{i}

to the IMU body frame, consistent with the notation defined in Section 3.1. The variable

t_{d}

denotes the temporal offset between camera and IMU measurements.

The full optimization state vector

X

is expressed as:

{X = [x_{0}, x_{1}, \dots, x_{N}, λ_{1}, \dots, λ_{M}, X_{c \to b} {, t}_{d}]}^{T}

(2)

where

N + 1

is the number of keyframes in the active sliding window,

M

is the number of landmarks currently parameterized by the inverse depth, and

x_{k}

denotes the state of the

k

-th keyframe:

{x_{k} = [p_{b_{k}}^{w}, q_{b_{k}}^{w}, v_{b_{k}}^{w}, b_{a, k}, b_{g, k}]}^{T}

(3)

The optimization problem is formulated by minimizing the sum of squared residuals constructed from various sensor observations:

\min_{X} ‖ r_{p r i o r} (X) ‖^{2} + \sum_{(i, j) \in B} ‖ r_{i m u}^{i j} (X) ‖_{Σ_{i m u, i j}^{- 1}}^{2} + \sum_{(l, k, c) \in C} ρ (‖ r_{v i s}^{l, k, c} (X) ‖_{Σ_{v i s, l, k, c}^{- 1}}^{2})

(4)

where

r_{p r i o r} (X)

denotes the prior residual carried over from previous marginalization steps,

B

denotes the set of consecutive keyframe pairs connected by IMU pre-integration, and

C

denotes the set of valid visual observations in the active window. The term

r_{i m u}^{i j} (X)

is the standard bias-corrected IMU pre-integration residual between keyframes

i

and

j

, and

r_{v i s}^{l, k, c} (X)

is the reprojection residual of landmark

l

observed by camera

c

at frame

k

. The matrices

Σ_{i m u, i j}

and

Σ_{v i s, l, k, c}

are the corresponding covariance matrices of the IMU and visual measurements, respectively, and

ρ (\cdot)

denotes the Huber loss used to suppress the influence of visual outliers.

At each iteration, the nonlinear residuals in (4) are locally linearized at the current estimate, and the increment

δ X

is solved in the tangent space. The state is then updated using the manifold retraction

X \leftarrow X \oplus δ X

, where

\oplus

denotes the state composition operator on the underlying manifold.

In this work, the prior residual is maintained in square-root information form as

r_{p r i o r} (X) = J_{p} δ X - r_{p}

(5)

where

δ X

is the local perturbation at the current linearization point,

J_{p}

is the prior Jacobian square root, and

r_{p}

is the corresponding prior residual vector square root obtained from marginalization, which are used to construct:

H_{p} = {{J_{p}}^{T} J}_{p}, b_{p} = {{J_{p}}^{T} r}_{p}

(6)

where

H_{p}

and

b_{p}

represent the prior information matrix and vector, respectively.

The IMU residual in (4) follows the standard bias-corrected pre-integration model between consecutive keyframes. It consists of rotational, velocity, and positional consistency terms, together with accelerometer-bias and gyroscope-bias residual terms. Its formulation follows the standard on-manifold IMU pre-integration model in [35] and can be expressed as

r_{i m u}^{i j} = [\begin{matrix} r_{Δ R}^{i j} \\ r_{Δ v}^{i j} \\ r_{Δ p}^{i j} \\ r_{b_{a}}^{i j} \\ r_{b_{g}}^{i j} \end{matrix}]

(7)

The five components correspond to rotational, velocity, positional, accelerometer-bias, and gyroscope-bias residuals, respectively, where the last two terms capture the random-walk evolution of the IMU biases.

The visual residual is defined using the standard perspective reprojection model. For landmark

l

observed by camera

c

at frame

k

, the landmark position in the corresponding camera frame is first computed as

P_{l}^{c_{k}} = R_{w}^{c_{k}} (P_{l}^{w} - p_{c_{k}}^{w})

(8)

where

P_{l}^{w}

denotes the 3D position of landmark

l

in the world frame, and

p_{c_{k}}^{w}

and

R_{w}^{c_{k}}

denote the position and rotation of camera

c

at frame

k

, respectively. The predicted normalized image observation is then defined as

{\hat{z}}_{l}^{k, c} = π (P_{l}^{c_{k}}) = [\begin{matrix} X_{l}^{c_{k}} / Z_{l}^{c_{k}} \\ Y_{l}^{c_{k}} / Z_{l}^{c_{k}} \end{matrix}]

(9)

where

π (\cdot)

denotes the standard perspective projection onto the normalized image plane. Accordingly, the visual reprojection residual is defined as

r_{v i s}^{l, k, c} = {\hat{z}}_{l}^{k, c} - z_{l}^{k, c}

(10)

where

z_{l}^{k, c} \in R^{2}

is the measured undistorted normalized image coordinate of landmark

l

observed by camera

c

at frame

k

. In this work, landmarks are parameterized by inverse depth anchored at their host frames.

The iterative linearization and retraction process preserves the nonlinear nature of the residual models while enabling efficient sparse linear solving within the sliding window. By integrating measurements from multiple sensors within a sliding-window framework, this tightly coupled sliding-window formulation preserves temporal correlations, improves consistency across keyframes, and provides a robust estimation baseline under mission-relevant sensing constraints.

In summary, Section 3 establishes a high-performance baseline navigation framework for MarsBird-VII. The framework integrates high-resolution stereo vision and inertial measurements within a tightly coupled estimation procedure while using laser altimeter readings to assist visual depth initialization, providing a principled baseline for operation under Martian environmental conditions. This baseline, however, exposes a fundamental tension between estimation fidelity and onboard computational limits. At the front-end, high-resolution stereo processing is executed in a strictly serial procedure, which can induce pronounced per-frame latency, especially during feature replenishment. On the back-end, each update requires iterative linearization and the solution of a large sparse system [35,36,37]. As the window state dimension and the number of constraints increase, the resulting solve time grows accordingly. Together, these front-end and back-end costs can compromise sustained real-time operation on the onboard processor. Section 4 therefore presents a computation-aware redesign for real-time onboard execution. It introduces a conditionally decoupled front-end in which feature replenishment is kept off the critical tracking path. It also introduces a Parity-Window back-end with unified marginalization, which bounds the per-update solution complexity and improves timing determinism.

4. Real-Time Optimization of MarsBird-VII Navigation Workflow

4.1. Acceleration of Vision Front-End for Real-Time MarsBird-VII Navigation

Achieving real-time high-resolution stereo perception on the resource-constrained MarsBird-VII platform presents a fundamental trade-off between computational demand and available processing power. The baseline serial procedure suffers from a rigid dependency structure: the continuous feature tracking process is strictly coupled with and frequently blocked by the computationally expensive feature replenishment step, introducing unacceptable latency during rapid maneuvers. To overcome the real-time limitation imposed by the serial dependency between tracking and replenishment, we introduce a computation-aware vision front-end. The architecture is built on two complementary design elements: (i) a spatially partitioned detection scheme that bounds the replenishment cost under high-resolution stereo input and (ii) a conditional decoupling strategy that removes replenishment from the critical path of continuous tracking. Together, these elements reconfigure the execution flow to match the computation and latency constraints of the MarsBird-VII embedded processor while retaining the robustness of the baseline tracker.

4.1.1. Parallel Feature Detection with Region-Based Masking

Standard Shi–Tomasi detection requires a global scan over the full image, which becomes a major bottleneck under high-resolution inputs. We therefore adopt a spatially partitioned detection strategy with deterministic batch scheduling and dynamic region-based masking to exploit the available onboard parallel computing resources (Figure 4).

Grid-Based Decomposition and Batch Scheduling

Each image is partitioned into a 4 × 4 grid of 16 non-overlapping regions. Region-level detection is executed in two sequential batches: eight regions are processed in parallel using eight threads, followed by the remaining eight regions, with a fixed thread pool and a batch-level barrier synchronization to bound tail latency. To maintain spatial coverage without over-constraining texture-poor areas, we enforce a per-region feature cap (not a minimum): up to 10 features per region, i.e., ≤160 features per frame across all regions. The per-frame cap (≤160) was selected to provide sufficient spatial coverage for robust estimation while bounding worst-case detection time under high-resolution input. The value was determined empirically through onboard runtime profiling and repeated flight tests, balancing track density against the per-cycle budget

T_{b}

(defined in Section 3.1).

Consistent with the stereo-temporal procedure in Section 3.2.2, this parallelized detection is only applied to the left image, while right-image correspondences are obtained via the existing sparse optical flow propagation step, avoiding redundant binocular detection.

Dynamic Region-Based Masking

During replenishment, each region uses an occupancy mask to exclude areas already covered by active tracks, enforcing a minimum image–space separation and prioritizing long-lived features. The mask is updated every frame from surviving tracks and cleared for lost tracks, concentrating computation on feature-starved regions.

4.1.2. Conditional and Parallelized Feature Processing

The baseline front-end executes feature tracking and replenishment in a strictly serial order, which accumulates latency in feature-poor scenes. When replenishment is triggered, the additional feature detection and stereo matching must be completed before pose updates can proceed, lengthening the critical path and causing significant per-frame latency spikes. To remove this bottleneck, we redesign the front-end as a parallelized framework (Figure 5) that switches modes based on the number of valid tracked features

N

. Mode switching is governed by two hysteresis thresholds

T_{u}

and

T_{d}

(set to 100 and 60, respectively). The hysteresis thresholds were tuned to avoid frequent mode switching while maintaining the minimum track density required for stable optimization. Candidate (

T_{u}

,

T_{d}

) pairs were evaluated through onboard runtime profiling and flight tests, and the final setting was chosen to keep tail latency within the per-cycle budget

T_{b}

while preventing prolonged low-feature intervals. In particular,

T_{u} = 100

was selected to ensure that the back-end optimization retained a sufficient number of visual constraints before entering the replenishment regime, whereas

T_{d} = 60

provides a safety margin before the estimator becomes severely under-constrained.

Stable state (

N \geq T_{u}

): Tracking-only operation with dual-stream execution to minimize CPU load.

Critical state (

T_{d} \leq N < T_{u}

): Continued tracking with background replenishment to recover track density without breaking real-time responsiveness.

Emergency state (

N < T_{d}

): Intensified replenishment is activated. If recovery fails, a degradation flag is sent to the flight controller to request conservative maneuvering. During consecutive visual-track loss, the estimator falls back to IMU-only propagation, with rapidly increasing uncertainty. Prolonged absolute tracking loss remains outside the robust operating regime of the current system.

Dual-stream temporal parallelism: In the stable state, bidirectional pyramidal LK tracking for the left and right camera streams is executed concurrently as independent threads. This removes the left–right serial dependency on wall-clock time while preserving the geometric verification and outlier rejection described in Section 3.2.2.

Three-way concurrency with deferred stereo matching: When

N < T_{u}

, the system executes three tasks concurrently: (i) temporal LK tracking on the left stream, (ii) temporal LK tracking on the right stream, and (iii) parallelized feature detection on the left image (Section 4.1.1). To ensure pose update determinism, newly detected left-image features are not stereo-matched within the current frame. Instead, their right-image correspondences are established in the subsequent frame through the regular stereo propagation step. This deferred stereo matching decouples detection from immediate stereo matching, amortizes the replenishment cost across frames, and bounds the worst-case execution time (WCET) during replenishment bursts.

4.2. Parity-Window Alternating Optimization with Unified Marginalization

As outlined in Section 3.3, the optimization back-end is formulated as a tightly coupled sliding-window nonlinear optimization. On the MarsBird-VII avionics, repeatedly solving the full window at each update imposes a substantial computational burden, making it difficult to sustain the target update rate when a 10-frame window is maintained. To achieve bounded per-update complexity without sacrificing long-horizon consistency, we propose an alternating Parity-Window optimization strategy. The window is divided into two interleaved parity groups that are optimized alternately; the IMU and visual factors are reorganized accordingly at each update; and a unified marginalization strategy is applied to preserve cross-group constraints as the window slides. Compared with a conventional full sliding-window optimizer, which updates all states in the active window at every cycle, the proposed method only solves an interleaved subset of states at each update. Unlike a reduced-window strategy, however, it does not shorten the original window horizon. Instead, it keeps the full horizon for constraint retention and marginalization while only reducing the per-cycle solution scope.

4.2.1. Parity-Based Grouping and Selective Constraint Construction

Global Frame Indexing for Stable Grouping

To ensure consistent grouping under window sliding, each keyframe is assigned a unique, monotonically increasing global ID

{i d}_{g l o b a l}

. At optimization cycle

t

, the active set

X_{a}

is selected by parity:

X_{a} = {x_{k} \in X | {i d}_{g l o b a l} (k) m o d 2 = p}, p \in {0, 1}

(11)

where

p \in {0, 1}

alternates between cycles. Variables outside

X_{a}

are treated as fixed parameters during the current solution (i.e., their increments are set to zero), while they remain part of the window for marginalization (Section 4.2.2).

The proposed Parity-Window update can be interpreted as a block-coordinate Gauss–Newton step on the full sliding-window objective. At each update, only the increments of the active parity group are solved, while the remaining states are treated as anchors (zero increment) at the current linearization point. This strategy preserves the original objective and differs fundamentally from pruning-based approaches, which permanently remove states or measurements from the problem.

Importantly, parity scheduling alone does not guarantee a substantial runtime reduction in tightly coupled VIO, because factor evaluation and linearization may still involve the full window. We therefore redesign factor construction to match the reduced solution scope. The key difference is that runtime reduction is achieved not by shortening the horizon but by jointly redesigning the update scope, inertial factor construction, visual factor retention, and marginalization strategy.

Sparse IMU Factorization via Expanded Pre-Integration

In standard tightly coupled formulations, IMU factors connect every adjacent keyframe pair (

k \to k + 1

). In sliding-window optimization, these factors must be repeatedly evaluated and linearized (residuals and Jacobians) across Gauss–Newton/Levenberg–Marquardt (LM) iterations, which contributes to the non-negligible runtime when the window is long.

In the proposed Parity-Window strategy, we introduce a sparse IMU topology during the optimization step: instead of including all adjacent factors

∆_{k \to k + 1}

, we construct expanded IMU factors that directly bridge consecutive active frames. For an active keyframe k, the pre-integrated measurement over two intervals is composed to link the next active frame

k + 2

:

∆_{k \to k + 2} = ∆_{k \to k + 1} ⨁ ∆_{k + 1 \to k + 2}

(12)

⨁

denotes standard pre-integration composition, i.e., concatenating two consecutive pre-integrated measurements with bias correction.

This design reduces the number of IMU residuals that need to be evaluated and linearized per iteration by approximately half, while still providing a direct temporal constraint between the variables being updated in the current cycle. Importantly, the information from intermediate inertial measurements is not discarded; it is incorporated through the composed pre-integration. Therefore, compared with standard full-window factorization, the proposed IMU construction reduces the number of residuals that must be evaluated and linearized in each iteration, while preserving the inertial information across the same temporal span.

Inclusive Construction of Visual Factors

Unlike the IMU topology, visual constraints are crucial for limiting drift and maintaining geometric observability, especially when feature tracks span an irregular number of frames [38]. If only factors whose host and observation frames are both active were retained (e.g., Odd–Odd), the number of usable reprojection constraints would drop sharply, weakening geometric conditioning and degrading convergence.

We adopt an inclusive visual factor policy: a visual residual

r_{v i s}^{l, k, c}

is included if at least one endpoint is active:

r_{v i s}^{l, k, c} \Leftrightarrow (x_{h (l)} \in X_{a}) \lor (x_{k} \in X_{a})

(13)

where

h (l)

denotes the host frame of landmark

l

. This yields three cases: (i) Active–Active provides full constraints between updated variables; (ii) Active–Inactive uses the inactive state as a fixed anchor, supplying a prior-like geometric constraint that stabilizes the active update; and (iii) Inactive–Inactive factors are skipped only in the current reduced solution because their Jacobians, with respect to the active increments, are zero when both endpoints are held fixed. They are still retained in the factor graph and are re-linearized whenever one endpoint becomes active in the next update or when either endpoint is involved in the unified marginalization step. This is also a key distinction from reduced-window approximations: the proposed method reduces the update scope but does not weaken geometric observability by discarding long-horizon visual connectivity.

Unlike conventional keyframe-triggering methods, which risk IMU drift and linearization errors due to temporal gaps during aggressive maneuvers [39,40], our approach maintains continuous temporal constraints. Furthermore, simply reducing the sliding-window size (e.g., from 10 to 5 frames) decreases latency but compromises geometric robustness and noise rejection. By contrast, the Parity-Window employs an interleaved update strategy that achieves latency comparable to that of a small window while preserving long-horizon constraints through unified marginalization. Importantly, the proposed strategy should not be interpreted as a simple reduced-window approximation: it does not shorten the sliding-window horizon itself but instead reduces only the per-cycle solution scope while retaining cross-state coupling through the unified marginalization prior. This architecture improves timing determinism under resource-constrained onboard computation while retaining the estimation benefits of a longer geometric horizon.

4.2.2. Cross-Correlation-Preserving Unified Marginalization

A key risk of alternating (odd/even) optimization is system decoupling: if the two parity groups were marginalized independently, cross-correlations would be lost, potentially leading to inconsistent estimation [41]. We avoid this by decoupling the solve scope from the marginalization scope.

Asymmetric Design: Solution Reduced, Marginalization Full

Optimization step: Solve only for increments of the active variables

δ X_{a}

, while setting inactive increments to zero (

δ X_{i} = 0)

. This can be viewed as a block-coordinate Gauss–Newton step on the full problem.

Marginalization step: Performed on the joint state of the full window. When the oldest state

X_{m}

is removed, we linearize all factors connected to

X_{m}

regardless of parity.

Unified Marginalization Formulation

Partition the window state into

X_{m}

(to be marginalized),

X_{a}

, and

X_{i}

. The linearized normal equations at the marginalization point are [42]:

H_{f u l l} = [\begin{matrix} H_{m m} & H_{m a} & H_{m i} \\ H_{a m} & H_{a a} & H_{a i} \\ H_{i m} & H_{i a} & H_{i i} \end{matrix}], b_{f u l l} = [\begin{matrix} b_{m} \\ b_{a} \\ b_{i} \end{matrix}]

(14)

Here, the off-diagonal blocks

H_{m a}

and

H_{m i}

represent the couplings between the oldest state and the remaining active and inactive states in the window. To eliminate

X_{m}

, we apply the Schur complement:

H_{p} = [\begin{matrix} H_{a a} & H_{a i} \\ H_{i a} & H_{i i} \end{matrix}] - [\begin{matrix} H_{a m} \\ H_{i m} \end{matrix}] H_{m m}^{- 1} [\begin{matrix} H_{m a} & H_{m i} \end{matrix}], b_{p} = [\begin{matrix} b_{a} \\ b_{i} \end{matrix}] - [\begin{matrix} H_{a m} \\ H_{i m} \end{matrix}] H_{m m}^{- 1} b_{m}

(15)

The resulting prior information matrix

H_{p}

and information vector

b_{p}

are defined by the joint remaining state

{[X_{a}^{T}, X_{i}^{T}]}^{T}

rather than by the active subset alone. The corresponding prior matrix is generally dense because the Schur complement introduces couplings between all remaining state blocks. Expanding the prior matrix in block form gives

H_{p} = [\begin{matrix} H_{a a}^{*} & H_{a i}^{*} \\ H_{i a}^{*} & H_{i i}^{*} \end{matrix}]

(16)

Crucially, the off-diagonal blocks

H_{a i}^{*}

and

H_{i a}^{*}

encode cross-parity correlations, coupling odd and even states through a shared prior. Even when inactive variables are held fixed in the current cycle, this unified prior still links the active update to the inactive linearization point and prevents the estimator from splitting into two weakly connected sub-estimators. Since each state becomes active at least every other update, the solver remains warm-started and regularly re-linearized, reducing the worst-case solution time while preserving full-window information.

Runtime Reduction Rationale

Because the active state dimension is approximately halved in each cycle and the number of IMU residuals evaluated per iteration is also reduced accordingly (Section 4.2.1), the per-iteration cost of linearization and solving decreases significantly in practice. In Section 5.4.2, we further quantify the runtime reduction and accuracy impact through a comparative study of different windowing strategies (parity vs. 5/10 window).

4.2.3. Summary of the Parity-Window Back-End

This section introduces a Parity-Window back-end for real-time sliding-window optimization under embedded resource constraints. The key idea is to alternate updates over two interleaved parity groups so that only a subset of states is solved at each cycle, while estimation consistency is maintained through factor-graph restructuring and unified full-window marginalization. Specifically, we (i) sparsify inertial constraints by bridging consecutive active frames via composed pre-integration, (ii) retain all visual factor incidents to the active set (including Active–Inactive links) to preserve geometric observability, and (iii) apply joint marginalization over the full window to produce a single Schur-complement prior that preserves cross-parity correlations and prevents odd–even decoupling. Compared with conventional full-window optimization, the proposed method reduces per-cycle solution complexity; compared with reduced-window approaches, it preserves the original long-horizon constraint structure.

Update procedure: At each time step

t

, the Parity-Window back-end executes the following steps:

At time step $t$ , determine the active state set $X_{a}$ according to the parity rule in (11), and treat the remaining states as anchors (inactive).
Construct the reduced factor set by (i) composing IMU pre-integrations to form sparse IMU bridges between successive active states as in (12) and (ii) building the inclusive visual factor set that retains all visual constraint incidents to $X_{a}$ as in (13).
Initialize the optimization from the previous solution and solve the reduced problem by updating only the increments of active states ( $δ X_{a}$ ), while fixing anchor-state increments to zero ( $δ X_{i} = 0$ ).
Apply the obtained increments to update the full window state and slide the window forward by adding the newest state and its associated measurements.
Select the marginalization target(s) at the window tail and perform unified marginalization on the full joint state following (14)–(16) so that cross-parity correlations are preserved through the Schur complement.
Toggle the parity for the next step ( $t \to t + 1$ ) and repeat the above procedure.

Figure 6 provides an overview of the proposed back-end, covering the alternating update mechanism, factor-graph construction, computational impact, and the unified marginalization principle. Specifically, (a) illustrates the parity Gauss–Newton schedule; (b) contrasts the standard sliding-window factorization with the proposed sparse IMU bridging and visual anchoring strategy; (c) reports the normalized per-iteration complexity reduction in linearization, assembly, and solving; and (d) explains why marginalization must be joint to preserve the off-diagonal cross-parity terms and thus maintain odd–even coupling.

For clarity, the overall real-time update procedure of the proposed system, including the interaction between the parallelized vision front-end and the Parity-Window back-end, is summarized in Algorithm 1.

Algorithm 1. Real-time update procedure of the proposed navigation system.

Input: synchronized stereo images, IMU and altimeter measurements, current sliding-window state X, parity label p
Output: updated navigation state X and system status

Acquire synchronized stereo, IMU, and altimeter measurements.
Execute the vision front-end according to the current feature state:
(a)
stable state: dual-stream temporal tracking only;
(b)
critical state: temporal tracking and background feature replenishment in parallel;
(c)
emergency state: intensified replenishment and degradation-flag triggering if recovery fails.
Validate stereo–temporal correspondences and construct visual observation set $C$ with measurements $z_{l}^{k, c}$ .
Use the nearest-in-time altimeter reading as an auxiliary geometric reference for visual depth initialization and triangulation.
Select the active keyframe-state subset $X_{a}$ according to the parity rule in (11), and define the inactive keyframe-state subset as $X_{i} = X ∖ X_{a}$ .
Construct the reduced optimization problem:
(a)
build IMU residuals $r_{i m u}^{i j}$ between consecutive active states using pre-integration (12);
(b)
collect visual residuals $r_{v i s}^{l, k, c}$ incident to $X_{a}$ according to (13);
(c)
include the prior residual $r_{p r i o r}$ from marginalization.
Solve for the active-state increment:

$δ X_{a} = \arg \min_{δ X_{a}} (‖ r_{p r i o r} ‖^{2} + \sum_{(i, j) \in B_{a}} ‖ r_{i m u}^{i j} ‖_{Σ_{i m u, i j}^{- 1}}^{2} + \sum_{(l, k, c) \in C_{a}} ρ (‖ r_{v i s}^{l, k, c} ‖_{Σ_{v i s, l, k, c}^{- 1}}^{2}))$
while fixing inactive increments $δ X_{i} = 0$ . Form the full keyframe-state increment $δ X = {[{δ X_{a}}^{T}, 0^{T}]}^{T}$ and update the state using manifold retraction: $X \leftarrow X \oplus δ X$ .
Slide the window forward and update the state buffer.
Apply unified marginalization to the full window following (14)–(16), and update the prior $(H_{p}, b_{p})$ .
Toggle the parity label p for the next cycle and output the updated navigation state.

5. Experimental Validation of MarsBird-VII Navigation System

All experiments were conducted on the MarsBird-VII onboard navigation system described in Section 3.1. Flight testing was performed at the Songjiang Experimental Site of the Harbin Institute of Technology and nearby open areas. The platform carried a custom payload, including an engineering model of the Tianwen-3 avionics and the proposed onboard navigation system. Ground truth was provided by an onboard dual-antenna RTK-GPS system, offering centimeter-level positioning accuracy and approximately 0.2° heading accuracy. Temporal alignment between the RTK-GPS and the onboard sensor suite was achieved through a tightly coupled hardware–software synchronization mechanism based on RTK Pulse-Per-Second (PPS) triggering. Specifically, the PPS signal from the RTK module was used to align the software timestamps of the sensor acquisition chain to the RTK reference clock, thereby minimizing cross-sensor timing bias during trajectory evaluation.

During all test flights, onboard navigation was executed in real time. The experimental campaign evaluated the system in three aspects: mission functionality, environmental robustness, and real-time performance. Mission functionality was assessed through cruise and low-altitude sampling-related flight profiles representative of Tianwen-3-class rotorcraft operations. The 80 s cruise flight represented a nominal reconnaissance segment, whereas the 60 s low-altitude flight represented the core near-surface sampling phase requiring sustained hovering and local maneuvering under tight clearance constraints. These durations were selected to reflect representative operational segments rather than the upper limit of mission endurance. Long-horizon stability was further evaluated using separate 130 s high-dynamic sequences. Environmental robustness was assessed under diverse terrain conditions designed to emulate key Martian challenges, such as low-texture surfaces and undulating terrain. Real-time performance was quantified by comparing the system before and after the proposed optimizations.

5.1. Basic Flight Mission Evaluation of MarsBird-VII

5.1.1. Cruise Flight

The cruise flight task was designed to assess the system’s ability to support extended-area surveys, enabling MarsBird-VII to explore broader regions. In this experiment, the rotorcraft executed an 80 s back-and-forth trajectory (Figure 7). Even when the rotorcraft occasionally exceeded speeds of 2 m/s, which caused intermittent degradation in feature tracking, the sliding-window nonlinear optimization effectively mitigated transient frame-level degradation, ensuring robust state estimation and reliable navigation performance. As a result, the maximum position error remained below 0.25 m, velocity deviations did not exceed 0.2 m/s, attitude errors were limited to 0.1 rad, and angular velocity errors, after an initial transient of 0.06 rad/s, stabilized below 0.03 rad/s throughout the maneuver. These results confirm that the navigation system can maintain stable and accurate flight and indicate its suitability for extended-area survey missions.

5.1.2. Low-Altitude Sampling Flight

Low-altitude sampling is a key operational scenario in China’s Tianwen-3 Mars sample–return concept. In this phase, the rotorcraft is required to maintain a flight height of approximately 2 m above the terrain for precision detection. This altitude provides sufficient spatial resolution while mitigating dust disturbances that can degrade sensor performance, especially for vision-based navigation. However, operating this close to the surface also presents substantial navigation challenges. The limited clearance leaves little room for maneuvering and amplifies the effect of navigation errors, so even minor deviations in velocity or attitude can produce noticeable trajectory offsets and reduce flight stability.

As demonstrated in a 60 s low-altitude flight test (Figure 8), the proposed navigation system achieved high-precision performance under these challenging conditions. Position error was bounded within 0.2 m, and velocity deviations—after an initial transient of 0.5 m/s—converged and remained below 0.2 m/s. Attitude deviations did not exceed 0.03 rad, while angular rate errors stabilized below 0.04 rad/s following an initial offset of 0.025 rad/s. These results demonstrate that the system can support the navigation demands of a low-altitude near-surface sampling flight and indicate its potential suitability for future autonomous sampling missions on Mars.

Overall, the two experimental scenarios demonstrate that the proposed navigation system can sustain a stable flight performance for MarsBird-VII. It consistently achieves decimeter-level position accuracy, bounded velocity deviations, and stable attitude control across both the extended-area survey flight and low-altitude sampling flight. Together, these results indicate that the system is capable of supporting the navigation demands of a near-surface sampling flight and is a promising candidate for future autonomous sampling missions on Mars.

5.2. Comparative Evaluation Against Filter-Based Baselines

We evaluated the proposed VIO framework using a real-world flight sequence recorded by the MarsBird-VII avionics platform at a flight altitude of approximately 5 m. This altitude was chosen to mimic the operational height of the low-altitude sample-collection phase envisioned for the Tianwen-3 mission. The sequence is characterized by moderate-to-high dynamics, explicitly designed to stress test autonomous navigation. The flight profile includes phases of high-speed forward motion (up to 2 m/s) and aggressive in-place yaw rotations, which induce rapid feature displacement and intermittent motion blur. Ground truth is provided by a dual-antenna RTK-GPS system, offering high-precision position and heading references.

5.2.1. Baselines, Metrics, and Protocols

To benchmark the proposed optimization-based system, we compared it against two widely adopted filter-based estimators: OpenVINS, a state-of-the-art feature-based MSCKF estimator, and ROVIO, a direct photometric EKF estimator known for rapid initialization.

Metric: Trajectory accuracy is quantified using the translation absolute pose error (APE). To account for global reference-frame differences while preserving the metric scale, all estimated trajectories are aligned to the RTK-GPS ground truth using the Special Euclidean group (SE(3)) Umeyama alignment (rigid transformation without scale correction). We report both the root mean square error (RMSE) and the maximum error (Max).

Sensor parity and evaluation tracks: Our onboard platform provides synchronized stereo imagery, an IMU, and a laser altimeter. We conduct two complementary evaluation tracks to ensure (i) fair comparison with external baselines and (ii) a controlled assessment of visual modality. (i) Filter-based baseline comparison: For OpenVINS and ROVIO, all methods are evaluated under identical sensing conditions using stereo imagery and IMU only; the laser altimeter is excluded to avoid advantaging methods that do not natively fuse range measurements. (ii) Visual-modality comparison (ours only): We compare stereo visual–inertial (SVI) and monocular visual–inertial (MVI) configurations within our system on MarsBird-VII. Monocular visual–inertial uses the left camera only with the same IMU stream, while stereo visual–inertial uses the synchronized stereo pair with the IMU. All estimator settings (sliding-window size, initialization, marginalization, and parameters) are kept identical.

For OpenVINS and ROVIO, the reported APE statistics are mean values over runs that were completed normally and produced continuous trajectories without interruption. All baseline methods were configured using the authors’ recommended settings.

In this work, robustness refers to the ability of an estimator to maintain stable and reliable state estimation under high-dynamic flight conditions, especially during rapid motion, large yaw rotations, and transient visual degradation. In practice, robustness is reflected by whether the estimator can preserve the bounded performance without pronounced covariance/uncertainty growth, fast drift accumulation, or severe degradation in trajectory quality. Here, consistency is used in the practical long-horizon sense, namely, whether local estimation errors remain containable and recoverable over time rather than propagating into severe global trajectory corruption.

5.2.2. Evaluation Protocols

The aggressive maneuvers in this sequence pose significant challenges to conventional filter-based estimators. Across repeated trials, we observed the following practical limitations in the baselines:

OpenVINS instability: OpenVINS is highly sensitive to initialization in this sequence and frequently fails to initialize. Even when initialization succeeds, during segments with high-speed motion and large yaw rotations, the estimator may experience rapid covariance growth, leading to fast drift and, in some trials, divergence.

ROVIO degradation: ROVIO generally initializes more reliably, but its estimation uncertainty increases under rapid motion and degraded imagery. In addition, the computational load of direct photometric tracking can occasionally exceed the real-time processing budget on the embedded platform.

These effects are treated as robustness-related degradation behaviors in this study, because they indicate reduced estimator stability and increased sensitivity to high-dynamic motion and degraded measurements.

To provide a fair and comprehensive comparison, we report results under two complementary evaluation settings:

(Setting A) Full-sequence evaluation (robustness): We evaluate each method over the complete flight without reset or re-initialization. This setting reflects a realistic autonomous sortie, where estimator restart is undesirable for safety-critical operation. The position/velocity error time histories are shown in Figure 9a–c, while Figure 9d provides the corresponding reference trajectory and velocity profiles to indicate the high-dynamic motion segments. Initialization failures or divergence are reported as robustness limitations.

(Setting B) Stable-segment evaluation (precision): We additionally evaluate a nominal stable segment within the same flight to assess precision under comparatively stable conditions. The segment is selected using objective criteria: (i) RTK-GPS remains in fixed-solution status, and (ii) all compared methods output continuous trajectories without discontinuities. For this segment, we report the position/velocity error time histories in Figure 10a–c, together with the reference trajectory and velocity profiles in Figure 10d.

Since OpenVINS is sensitive to the start time, we repeat runs with

n = 40

starting offsets (step size Δt = 0.5 s); 29/40 runs produced continuous trajectories over the full flight under a no-restart/no-re-initialization protocol. We categorize failures as (i) initialization failure (no valid pose output within 5 s or explicit initialization failure reported by the estimator), (ii) trajectory interruption (missing pose outputs or process termination/crash), and (iii) divergence/numerical instability (rapidly growing residual/innovation statistics). Quantitative errors are reported over successful runs, whereas failed runs are summarized separately as robustness limitations.

5.2.3. Quantitative Results

Full-Sequence Results (Robustness Test)

Table 1 summarizes the full-sequence performance after SE(3) alignment. The proposed method remains stable over the complete aggressive flight and achieves the lowest error (RMSE 0.31 m, Max 0.47 m). In contrast, the filter-based baselines show substantial degradation over the full sequence: ROVIO increases to an RMSE of 2.93 m (Max 4.72 m), while OpenVINS accumulates large errors with an RMSE of 6.89 m and a maximum error of 14.75 m.

The corresponding time-varying position/velocity error traces are provided in Figure 9a–c. Figure 9d shows the reference trajectory and velocity profiles, which indicate the high-speed and sharp-turn segments where filter-based estimators are more prone to error accumulation. The baseline estimators exhibit a clear tendency of error accumulation over time (more pronounced in the second half of the flight), whereas the proposed method maintains bounded errors throughout the mission. Together with the full-sequence RMSE/Max statistics in Table 1, these results indicate improved robustness and long-horizon stability of the proposed approach on this aggressive flight sequence.

Stable-Segment Results (Precision Test)

We further evaluate a nominal stable segment to assess intrinsic precision under comparatively benign conditions, where all estimators maintain continuous tracking. As reported in Table 1 and visualized in Figure 10a–c, the proposed method still achieves the best accuracy (RMSE 0.06 m, Max 0.15 m), outperforming OpenVINS (RMSE 0.23 m, Max 0.52 m) and ROVIO (RMSE 0.81 m, Max 1.78 m). These results indicate that, beyond enhanced robustness over the full aggressive flight, the proposed sliding-window optimization also provides superior precision relative to sequential filtering under nominal conditions. Figure 10d further confirms that this segment corresponds to a comparatively stable motion profile, which explains why all methods can maintain continuous trajectories and why this setting is suitable for precision-oriented comparison.

5.2.4. Effect of Visual Modality: Stereo vs. Monocular Configuration

We isolate the effect of visual modality by comparing stereo visual–inertial and monocular visual–inertial configurations within our system on the same MarsBird-VII dataset. MVI uses the left camera only, together with the same IMU stream, whereas SVI uses the synchronized stereo pair and the IMU. All estimator settings (state definition, IMU pre-integration, sliding-window size, initialization, marginalization, and solver parameters) are kept identical; the only change is disabling the right-camera measurements for MVI (thus removing stereo constraints).

Table 2 summarizes the translation APE statistics (RMSE and Max), and Figure 11 visualizes the corresponding full-sequence position/velocity error time histories. Overall, SVI achieves a lower translation error and more stable long-horizon behavior than MVI, particularly during high-dynamic segments with rapid attitude changes and brief visual degradation, where MVI shows stronger error accumulation.

Mechanistically, the difference arises from how metric geometry is established. In the stereo configuration, the fixed binocular baseline provides direct depth constraints at each update, which improves landmark initialization and reduces dependence on motion-induced observability. In the monocular configuration, by contrast, the metric scale is recovered more indirectly through the coupling of motion excitation and inertial estimation, making it more vulnerable to low-excitation flight, inertial bias, and temporary image degradation. As a result, under rapid attitude changes or brief visual degradation, stereo is better able to preserve geometric constraint continuity and thus suppress rapid error accumulation.

5.2.5. Discussion

The experiments demonstrate that the proposed navigation framework provides a practical accuracy–robustness advantage for MarsBird-VII-relevant flight scenarios compared with representative filter-based baselines and with alternative visual–inertial sensing modalities.

Performance Gains via Iterative Re-Linearization

Compared with sequential filter-based estimators, the proposed sliding-window optimizer improves practical long-horizon consistency under aggressive maneuvers. Recent states and measurements are repeatedly re-linearized and jointly refined within the active window, making the estimator less sensitive to short-term visual degradation or local data-quality fluctuations. This repeated refinement helps contain local errors before they propagate into severe trajectory drift. In contrast, recursive filters are more vulnerable to error accumulation once affected by high-dynamic motion or degraded measurements, consistent with the bounded long-horizon errors shown in Figure 9.

Higher Precision for Sampling-Related Operations

The sensing-modality comparison indicates a clear advantage of stereo over monocular VIO. Under identical estimator settings, stereo reduces translation RMSE from 0.64 m to 0.31 m and the maximum error from 2.11 m to 0.47 m (Table 2; Figure 11). This improvement is consistent with stereo’s fixed baseline, which provides instantaneous metric depth and structurally decouples the metric observability from IMU excitation, thereby enhancing robustness to low-excitation maneuvers, inertial bias, and degraded imaging. These results support stereo visual–inertial sensing as a balanced design choice that jointly improves precision, robustness, and reliability for sampling-related operations.

Mission Relevance

These results indicate that the proposed navigation system is capable of supporting the operational demands of MarsBird-VII-relevant flight scenarios, enabling autonomous flight under degraded imaging and challenging flight dynamics. The system maintains a practical balance between precision and reliability during aggressive maneuvers and transient tracking degradation, supporting sampling-related flight profiles that require stable and consistent state estimation.

5.3. Environmental Robustness Experiments

To evaluate the robustness of the proposed visual front-end under diverse terrain conditions, we selected representative environments that emulate key Mars-related challenges (Figure 12). Specifically, snow was used to simulate low-texture surfaces, whereas sand mounds were used to represent undulating terrain with varying elevations.

Figure 12 shows the representative tracked features and histograms of long-term tracking counts. For visualization purposes, continuously tracked features are shown in red, newly extracted features are indicated in blue, while green arrows indicate the pixel displacements of tracked features between consecutive frames, i.e., the inter-frame motion of feature points. The parameters were configured such that 160 features were initially extracted in the first frame, with a replenishment threshold of

T_{u} = 100

. During flight, the number of tracked features gradually decreased but remained above the replenishment threshold for most of the sequence. When the count dropped below

T_{u}

, the system entered replenishment mode, where new features were extracted while stereo association was deferred to the next frame. The histogram confirms that, across both snow and sand environments, feature tracking was maintained in a stable manner, demonstrating the robustness of the visual front-end and its adaptability to low-texture and uneven-terrain conditions.

In addition to the terrain-oriented experiments, we further evaluated the vision front-end under representative Martian visual disturbance conditions, including illumination variation and dust-induced image blur. The navigation cameras use rapid autoexposure to maintain usable image brightness during illumination changes, thereby supporting continuous feature tracking. For dust-related disturbance, Figure 13 compares feature extraction before and after dust-induced blur. Although the blurred image shows reduced contrast and local loss of texture, the front-end still retains a sufficient number of usable feature points. This indicates that the proposed front-end can support feature extraction and tracking under illumination variation and light-to-moderate dust-induced blur. Under severe dust occlusion, visible-spectrum feature tracking may become unreliable. In such cases, the system-level degradation flag is triggered to request conservative flight control.

5.4. Experimental Evaluation: Real-Time Performance

To separately validate the effectiveness of the two proposed modules, Section 5.4.1 evaluates the front-end redesign in terms of runtime reduction and tracking stability, while Section 5.4.2 evaluates the Parity-Window back-end in terms of accuracy–efficiency trade-off and deterministic real-time compliance.

5.4.1. Vision Front-End Validation: Runtime Reduction and Tracking Stability

Evaluation of Parallel Feature Detection with Region-Based Masking

The feature detection module was evaluated before and after optimization (Figure 14 and Table 3). Originally, Shi–Tomasi corners were detected across the entire image, providing robust feature coverage but at a high computational cost, especially for high-resolution stereo frames. The optimized algorithm adopts a parallel, region-based strategy: each image is divided into 16 regions processed in parallel on eight ARM Cortex-A cores, with a dynamic, region-wise masking scheme to maintain spatially uniform and long-lived features. For visualization purposes, the detected feature points are highlighted in yellow. While the optimized method yields feature locations that differ somewhat from those of the original implementation, the detected points consistently satisfy the quality requirements for navigation, thereby improving real-time performance without compromising robustness.

Performance statistics were collected over 2270 frames. In the original implementation, the mean processing time was 81.85 ms, with a standard deviation of 7.37 ms (min/max 76.92/100.27 ms; median 81.85 ms). After optimization, the mean runtime decreased to 23.13 ms, with a standard deviation of 4.20 ms (min/max 19.46/51.97 ms; median 23.13 ms). Overall, the optimization reduced the mean runtime by 69.74%, while the reduced variance indicates an improved runtime stability.

Evaluation of Conditional and Parallelized Feature Processing

The original serial procedure incurred high latency, particularly when feature supplementation was needed. The optimized framework switches adaptively between tracking-only and feature-replenishment modes, executing bidirectional LK optical flow in parallel on both cameras and performing feature re-detection concurrently when necessary. Newly detected features are associated in the subsequent frame, reducing critical-path latency.

The time consumption of the feature tracking module before and after optimization is summarized in Table 4. In the original implementation, the average per-frame processing time was 71.37 ms, with a standard deviation of 18.10 ms, which often exceeded the hard real-time budget (Section 3.1). In contrast, the optimized procedure reduced the mean runtime to 22.93 ms, with a smaller standard deviation of 8.27 ms, ensuring that most frames were processed within the available interval. Re-detection was required in only 170 out of 2270 frames, where the average runtime was 42.43 ms, while frames without re-detection averaged 21.35 ms. Overall, the optimized procedure satisfies the real-time requirement not only in the mean runtime but also in upper-quartile latency, with a Q3 of 25.33 ms for the total optimized workflow.

The optimized vision front-end ensures stable feature tracking, as shown in Figure 15 and Figure 16. Figure 15 provides example images of tracked feature distributions before and after optimization. In the left image, red points denote continuously tracked features, while green arrows indicate the pixel displacements of these features between consecutive frames (i.e., the inter-frame motion of feature points). In the right image, green points represent stereo correspondences between the left and right cameras. This visualization confirms that the optimized method maintains stable and sufficient feature tracking. Figure 16 illustrates the distribution of feature counts during continuous tracking after optimization. Although the number of tracked features gradually decreases over time, it remains above the threshold of 100 required for reliable navigation. Only when the count falls below this threshold does the system supplement a limited number of features to maintain an adequate set of points.

5.4.2. Optimization Back-End Validation: Accuracy–Efficiency Trade-Off

To isolate the contribution of the proposed optimization back-end, we evaluate the Parity-Window strategy against two baselines: (i) a Full 10-Window optimizer (

W = 10

), representing a near upper bound of accuracy under stronger multi-frame constraints, and (ii) a Reduced 5-Window optimizer (

W = 5

), representing a conventional load-reduction approach.

Statistical Distribution of Optimization Latency

To verify real-time feasibility on the RK3588S, we analyze the distribution of back-end optimization latency over 5000+ update cycles. Following the hard real-time budget

T_{b}

(Section 3.1), we define real-time compliance using a conservative tail-latency criterion: the maximum back-end latency must remain below

T_{b}

over the evaluated cycles. Table 5 summarizes the resulting latency statistics.

Although the

W = 10

configuration achieves a mean latency of 57.54 ms, its third quartile and maximum latency (68.94 ms and 140.76 ms) exceed the 66.7 ms real-time budget

T_{b}

, indicating frequent deadline violations caused by tail-latency jitter. Such latency spikes may accumulate across update cycles and degrade closed-loop responsiveness. In contrast, Parity-Window substantially tightens the latency distribution: its maximum latency is 58.32 ms, safely below

T_{b}

, while maintaining accuracy close to the full-window optimizer.

Accuracy and Robustness Trade-Off

While the

W = 5

strategy offers the lowest latency, it incurs a significant loss in estimation fidelity and robustness. Table 6 reports full-trajectory accuracy using APE (RMSE and maximum error). Reducing the window to five frames increases the RMSE by 102.5% relative to

W = 10

, reflecting reduced long-horizon constraints and fewer opportunities for iterative re-linearization and refinement under strong nonlinearities. In the low-texture Mars-analog segment (Figure 17a), the reduced number of visual constraints in a five-frame horizon weakens the geometric observability, leading to step-like error growth when tracks become intermittent. In the aggressive turning segment with a high yaw rate (Figure 17b), the shorter horizon exacerbates sensitivity to linearization errors under strong nonlinearity, causing significantly larger drift than the Parity-Window and Full 10-Window configuration. The challenging segments shown in Figure 17 were selected based on objective flight-condition criteria, namely, sustained low tracked-feature density and an elevated yaw rate, respectively.

By contrast, the Parity-Window preserves a 10-frame geometric horizon through a unified marginalization prior, while alternating the active optimization set to reduce the per-update computation. As a result, it achieves an accuracy close to

W = 10

(only a 19.2% RMSE increase), while maintaining deterministic real-time behavior comparable to reduced-order configurations.

Conclusions of Comparative Study

Overall, the Parity-Window provides the most favorable accuracy–efficiency trade-off among the evaluated strategies: it avoids the large tail-latency violations of the full

W = 10

optimizer while preventing the accuracy degradation and brittleness observed with

W = 5

. This windowing strategy is therefore a key component of the proposed real-time navigation architecture, because it enables bounded optimization latency without sacrificing the long-horizon estimation benefits required for embedded Mars rotorcraft operation. From a system perspective, the reduced back-end computational load also improves onboard efficiency and thermal stability. The navigation stack was deployed on the RK3588S platform and operated continuously during flight tests and extended runtime evaluations without observable thermal throttling or performance degradation. These observations suggest that the reduced optimization workload is beneficial not only for real-time latency but also for sustained embedded operation under practical power and thermal constraints.

6. Conclusions

This study developed and validated an autonomous navigation system for the MarsBird-VII rotorcraft under Tianwen-3-class onboard constraints. Earth-analog flight experiments demonstrate a stable closed-loop navigation performance in both cruise and low-altitude sampling-related flight profiles. The system achieves decimeter-level positioning and stable attitude control while sustaining real-time onboard updates within the available computing budget. In full-sequence evaluation, the method attains an SE(3)-aligned translation APE of 0.31 m RMSE/0.47 m Max and further reaches 0.06 m RMSE/0.15 m Max on a nominal stable segment. Runtime profiling over 5000+ cycles shows that the Parity-Window back-end keeps the maximum optimization latency below 58.32 ms, meeting the hard real-time constraint while maintaining an accuracy close to full-window optimization. Overall, the results indicate that optimization-based navigation under strict computational constraints is a promising solution for reliable real-time state estimation in Tianwen-3-class rotorcraft navigation scenarios.

Despite these promising results, the current system still has several limitations. While the vision front-end incorporates an emergency degradation flag to request conservative maneuvering during feature depletion, prolonged absolute tracking loss remains a fundamental challenge. Under such conditions, extended reliance on IMU-only propagation inevitably leads to unbounded drift. Future work will therefore investigate perception-aware motion planning strategies that explicitly account for state uncertainty and environmental observability. Such strategies may help the rotorcraft avoid perceptually degraded regions and maintain more reliable state estimation during sampling-related operations, thereby improving overall mission robustness and safety.

Author Contributions

Conceptualization, P.L.; Methodology, J.X., H.Q., and P.L.; Software, J.X. and H.Q.; Validation, J.X., H.Q. and Y.Z.; Formal analysis, J.X. and H.Q.; Investigation, J.X., H.Q., Y.Z. and R.W.; Resources, P.L.; Data curation, J.X. and H.Q.; Writing—original draft preparation, J.X.; Writing—review and editing, H.Q., Y.Z., R.W. and P.L.; Visualization, J.X. and H.Q.; Supervision, P.L.; Project administration, P.L.; Funding acquisition, P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China (No. 2024YFC3015804) and the Tianwen-3 Key Technology Research Project.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to project restrictions. Source code and additional implementation details are not publicly available for the same reason.

Acknowledgments

The authors would like to thank the members of the Pattern Recognition and Intelligent System Research Center, Harbin Institute of Technology, for their support in system integration and flight experiments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhu, K.; Quan, Q.; Wang, K.; Tang, D.; Tang, B.; Dong, Y.; Wu, Q.; Deng, Z. Conceptual design and aerodynamic analysis of a Mars octocopter for sample collection. Acta Astronaut. 2023, 207, 10–23. [Google Scholar] [CrossRef]
Zhao, P.; Li, R.; Wu, P.; Liu, H.; Gao, X.; Deng, Z. Review of Key Technologies of Rotary-Wing Mars UAVs for Mars Exploration. Inventions 2023, 8, 151. [Google Scholar] [CrossRef]
Zhang, W.; Xu, B.; Zhang, H.; Xiang, C.; Fan, W.; Zhao, Z. Analysis of Aerodynamic Characteristics of Propeller Systems Based on Martian Atmospheric Environment. Drones 2023, 7, 397. [Google Scholar] [CrossRef]
Zurek, R.W.; Smrekar, S.E. An Overview of the Mars Reconnaissance Orbiter (MRO) science mission. J. Geophys. Res. Planets 2007, 112, E05S01. [Google Scholar] [CrossRef]
Vasavada, A.R. Mission Overview and Scientific Contributions from the Mars Science Laboratory Curiosity Rover After Eight Years of Surface Operations. Space Sci. Rev. 2022, 218, 20. [Google Scholar] [CrossRef]
National Aeronautics and Space Administration Jet Propulsion Laboratory (NASA JPL). After Three Years on Mars, NASA’s Ingenuity Helicopter Mission Ends; News Release; NASA JPL: Pasadena, CA, USA, 2024. Available online: https://www.jpl.nasa.gov/news/after-three-years-on-mars-nasas-ingenuity-helicopter-mission-ends/ (accessed on 8 February 2026).
Bayard, D.S.; Conway, D.T.; Brockers, R.; Delaune, J.H.; Matthies, L.H.; Grip, H.F.; Merewether, G.B.; Brown, T.L.; San Martin, A.M. Vision-Based Navigation for the NASA Mars Helicopter. In Proceedings of the AIAA SciTech 2019 Forum, San Diego, CA, USA, 7–11 January 2019. Paper AIAA 2019-1411. [Google Scholar] [CrossRef]
Alberico, I.; Delaune, J.; Cioffi, G.; Scaramuzza, D. Structure-Invariant Range-Visual-Inertial Odometry. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024), Abu Dhabi, United Arab Emirates, 14–18 October 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 10613–10620. [Google Scholar] [CrossRef]
Grip, H.F.; Lam, J.; Bayard, D.S.; Conway, D.T.; Singh, G.; Brockers, R.; Delaune, J.H.; Matthies, L.H.; Malpica, C.; Brown, T.L.; et al. Flight Control System for NASA’s Mars Helicopter. In Proceedings of the AIAA SciTech 2019 Forum, San Diego, CA, USA, 7–11 January 2019. Paper AIAA 2019-1289. [Google Scholar] [CrossRef]
Bayard, D.S. An Overview of the Mars Helicopter Vision-Based Navigation System; Presentation; Jet Propulsion Laboratory, National Aeronautics and Space Administration: Pasadena, CA, USA, 2021. Available online: https://ntrs.nasa.gov/citations/20230004754 (accessed on 30 April 2026).
Zhu, K.; Quan, Q.; Tang, D.; Dong, Y.; Wang, K.; Tang, B.; Wu, Q.; Deng, Z. A Mars quadcopter capable of autonomous flight and sample collection: Structure and avionics. Acta Astronaut. 2024, 214, 712–721. [Google Scholar] [CrossRef]
Dong, Z.; Li, C.; Lü, P. Atmosphere of Mars. In Martian Aeolian Geomorphology; Springer: Cham, Switzerland, 2025; pp. 149–196. [Google Scholar] [CrossRef]
Maimone, M.; Cheng, Y.; Matthies, L. Two years of Visual Odometry on the Mars Exploration Rovers. J. Field Robot. 2007, 24, 169–186. [Google Scholar] [CrossRef]
Matthies, L.; Maimone, M.; Johnson, A.; Cheng, Y.; Willson, R.; Villalpando, C.; Goldberg, S.; Huertas, A.; Stein, A.; Angelova, A. Computer Vision on Mars. Int. J. Comput. Vis. 2007, 75, 67–92. [Google Scholar] [CrossRef]
National Aeronautics and Space Administration (NASA). Mars Communications Disruption and Delay; 2023 Moon to Mars Architecture Concept Review White Paper; NASA: Washington, DC, USA, 2023. Available online: https://www.nasa.gov/wp-content/uploads/2024/01/mars-communications-disruption-and-delay.pdf (accessed on 8 February 2026).
Vasavada, A.R.; Chen, A.; Barnes, J.R.; Burkhart, P.D.; Cantor, B.A.; Dwyer-Cianciolo, A.M.; Fergason, R.L.; Hinson, D.P.; Justh, H.L.; Kass, D.M.; et al. Assessment of Environments for Mars Science Laboratory Entry, Descent, and Surface Operations. Space Sci. Rev. 2012, 170, 793–835. [Google Scholar] [CrossRef]
Mourikis, A.I.; Roumeliotis, S.I. A Multi-State Constraint Kalman Filter for Vision-aided Inertial Navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2007), Rome, Italy, 10–14 April 2007; pp. 3565–3572. [Google Scholar] [CrossRef]
Geneva, P.; Eckenhoff, K.; Lee, W.; Yang, Y.; Huang, G. OpenVINS: A Research Platform for Visual-Inertial Estimation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 4666–4672. [Google Scholar] [CrossRef]
Bloesch, M.; Burri, M.; Omari, S.; Hutter, M.; Siegwart, R. Iterated extended Kalman filter based visual-inertial odometry using direct photometric feedback. Int. J. Robot. Res. 2017, 36, 1053–1072. [Google Scholar] [CrossRef]
Wudenka, M.; Müller, M.G.; Demmel, N.; Wedler, A.; Triebel, R.; Cremers, D.; Stürzl, W. Towards Robust Monocular Visual Odometry for Flying Robots on Planetary Missions. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2021), Prague, Czech Republic, 27 September–1 October 2021; pp. 8737–8744. [Google Scholar] [CrossRef]
Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
Solodar, D.; Klein, I. VIO-DualProNet: Visual-Inertial Odometry with Learning-Based Process Noise Covariance. Eng. Appl. Artif. Intell. 2024, 133, 108466. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial, and Multi-Map SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Zurria, A.; Durante, D.; Iess, L. Preliminary Design and Performance Assessment of a Semi-Autonomous Global Navigation Satellite System on Mars. Acta Astronaut. 2025, 229, 232–245. [Google Scholar] [CrossRef]
Johnson, A.E.; Cheng, Y.; Trawny, N.; Montgomery, J.F.; Schroeder, S.; Chang, J.; Clouse, D.; Aaron, S.; Mohan, S. Implementation of a Map Relative Localization System for Planetary Landing. J. Guid. Control Dyn. 2023, 46, 618–637. [Google Scholar] [CrossRef]
Amzajerdian, F.; Hines, G.D.; Gragossian, A.; Barnes, B.W.; Dostart, N.A. Development of a Coherent Doppler Lidar for Precision Landing on Planetary Bodies. In Proceedings of the 21st Coherent Laser Radar Conference (CLRC-21), Big Sky, MT, USA, 26 June–1 July 2022; Available online: https://ntrs.nasa.gov/citations/20220008277 (accessed on 8 February 2026).
Mittelholz, A.; Johnson, C.L. The Martian Crustal Magnetic Field. Front. Astron. Space Sci. 2022, 9, 895362. [Google Scholar] [CrossRef]
Motohashi, M.; Kubota, T. Feature Point Selection Scheme of Stereo Visual Odometry for Planetary Exploration Rover. In Proceedings of the 2024 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), Boston, MA, USA, 15–19 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1219–1224. [Google Scholar]
Kadhima, H.A.; Araheemah, W.A. A Comparative Between Corner-Detectors (Harris, Shi-Tomasi & FAST) in Images Noisy Using Non-Local Means Filter. J. Al-Qadisiyah Comput. Sci. Math. 2019, 11, 86–93. [Google Scholar] [CrossRef]
Shi, J.; Tomasi, C. Good Features to Track. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 1994), Seattle, WA, USA, 21–23 June 1994; pp. 593–600. [Google Scholar] [CrossRef]
Zhang, Y.; Zhong, B.; Sun, X. A Benchmark for the Evaluation of Corner Detectors. Appl. Sci. 2022, 12, 11984. [Google Scholar] [CrossRef]
Zingg, S.; Scaramuzza, D.; Weiss, S.; Siegwart, R. MAV Navigation through Indoor Corridors Using Optical Flow. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2010), Anchorage, AK, USA, 3–7 May 2010; pp. 3361–3368. [Google Scholar] [CrossRef]
Alfarano, A.; Maiano, L.; Papa, L.; Amerini, I. Estimating Optical Flow: A Comprehensive Review of the State of the Art. Comput. Vis. Image Underst. 2024, 249, 104160. [Google Scholar] [CrossRef]
Forster, C.; Carlone, L.; Dellaert, F.; Scaramuzza, D. On-Manifold Preintegration for Real-Time Visual–Inertial Odometry. IEEE Trans. Robot. 2017, 33, 1–21. [Google Scholar] [CrossRef]
Leutenegger, S.; Lynen, S.; Bosse, M.; Siegwart, R.; Furgale, P. Keyframe-Based Visual–Inertial Odometry Using Nonlinear Optimization. Int. J. Robot. Res. 2015, 34, 314–334. [Google Scholar] [CrossRef]
Jiang, J.; Niu, X.; Guo, R.; Liu, J. A Hybrid Sliding Window Optimizer for Tightly-Coupled Vision-Aided Inertial Navigation System. Sensors 2019, 19, 3418. [Google Scholar] [CrossRef]
Lu, J.; Shen, Y.; Hang, Q.; Tang, Y. Research on Tightly Coupled Visual Inertial SLAM Based on Point and Line Features. Multimed. Syst. 2024, 30, 286. [Google Scholar] [CrossRef]
Dellaert, F.; Kaess, M. Factor Graphs for Robot Perception. Found. Trends Robot. 2017, 6, 1–139. [Google Scholar] [CrossRef]
Chen, H.; Wang, B.; Gu, D.; Ye, W. A Novel Adaptive Keyframe Selection Method with Multi-Source Joint Constraints for Visual SLAM. Intell. Serv. Robot. 2025, 18, 513–527. [Google Scholar] [CrossRef]
Chen, C.; Geneva, P.; Peng, Y.; Lee, W.; Huang, G. Optimization-Based VINS: Consistency, Marginalization, and FEJ. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2023), Detroit, MI, USA, 1–5 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1517–1524. [Google Scholar] [CrossRef]
Chen, C.; Peng, Y.; Huang, G. Fast and Consistent Covariance Recovery for Sliding-Window Optimization-Based VINS. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2024), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 13724–13731. [Google Scholar] [CrossRef]

Figure 1. Structural composition of the MarsBird-VII [11].

Figure 2. Navigation architecture of MarsBird-VII [11].

Figure 3. Robust multi-sensor optimization back-end framework for MarsBird-VII.

Figure 4. Parallel feature detection for MarsBird-VII images.

Figure 5. Parallelized feature processing for MarsBird-VII images.

Figure 6. Parity-Window back-end: Alternating update, factor sparsification, and unified marginalization.

Figure 7. Navigation accuracy during cruise flight.

Figure 8. Navigation accuracy during low-altitude sampling flight.

Figure 9. Full-sequence comparison under aggressive flight: position/velocity error time histories and reference motion profile.

Figure 10. Stable-segment comparison: position/velocity error time histories and reference motion profile.

Figure 11. Stereo vs. monocular configuration comparison: full-sequence position/velocity error time histories.

Figure 12. Feature tracking performance in representative environments. (In the visualization, red circles represent continuously tracked features, blue circles denote newly extracted features during replenishment, and green arrows indicate inter-frame pixel displacements).

Figure 13. Feature extraction under dust-induced image degradation. (Red circles indicate the extracted feature points).

Figure 14. Feature point detection before and after optimization. (Yellow dots indicate the detected feature points).

Figure 15. Feature tracking before and after optimization. (Red dots represent continuously tracked feature points between consecutive temporal frames, while green dots indicate stereo-matched feature points between the left and right cameras).

Figure 16. Tracked feature points after optimization.

Figure 17. APE comparison in challenging segments (low texture and sharp turn).

Table 1. SE(3)-aligned translation APE (RMSE/Max) for full-sequence and stable-segment evaluations.

Protocol	Method	APE RMSE (m)	APE Max (m)
A: Full sequence	OpenVINS [18]	6.89	14.75
	ROVIO [19]	2.93	4.72
	Parity-Window (Ours)	0.31	0.47
B: Stable segment	OpenVINS	0.23	0.52
	ROVIO	0.81	1.78
	Parity-Window (Ours)	0.06	0.15

Table 2. SE(3)-aligned translation APE (RMSE/Max) for MVI and SVI.

Method	APE RMSE (m)	APE Max (m)
Stereo visual–inertial (SVI)	0.31	0.47
Monocular visual–inertial (MVI)	0.64	2.11

Table 3. Processing time of the feature detection module before and after optimization.

Condition	Sample Count	Mean (ms)	Std. Dev. (ms)	Min (ms)	Max (ms)	Median (ms)	Performance Improvement
Original	2270	81.85	7.37	76.92	100.27	81.85	-
Optimized	2270	23.13	4.20	19.46	51.97	23.13	69.74%

Table 4. Processing time of the feature tracking module before and after optimization.

Condition	Sample Count	Mean (ms)	Std. Dev. (ms)	Min (ms)	Max (ms)	Median (ms)	Q3 (ms)
Original	2270	71.37	18.10	27.06	150.78	72.45	83.10
Optimized (Total)	2270	22.93	8.27	12.76	78.23	21.87	25.33
Optimized (Re-extracted)	170	42.43	14.24	23.72	68.23	39.66	45.75
Optimized (Non-re-extracted)	2100	21.35	4.93	12.76	50.36	21.33	24.49

Table 5. Back-end runtime comparison under different windowing strategies.

Strategy	Mean (ms)	Median (ms)	Q3 (ms)	Maximum (ms)	Std. Dev. (ms)	Real-Time Compliance (Max < 66.7)
Full 10-Window	57.54	54.77	68.94	140.76	20.16	Failed
Reduced 5-Window	22.45	21.05	24.12	31.05	2.15	Compliant
Parity-Window (Ours)	38.08	36.45	42.26	58.32	4.51	Compliant

Table 6. APE and challenging segment peak errors under different windowing strategies.

Strategy	APE RMSE (m)	APE Max (m)	Peak APE in Low Texture (m)	Peak APE in Sharp Turn (m)
Full 10-Window	0.26	0.43	0.40	0.23
Reduced 5-Window	0.48	0.92	0.88	0.83
Parity-Window (Ours)	0.31	0.47	0.43	0.27

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xiao, J.; Qiu, H.; Zhou, Y.; Wang, R.; Liu, P. MarsBird-VII: An Autonomous Stereo–Inertial Navigation System with Real-Time Optimization for a Mars Rotorcraft Space Drone. Drones 2026, 10, 346. https://doi.org/10.3390/drones10050346

AMA Style

Xiao J, Qiu H, Zhou Y, Wang R, Liu P. MarsBird-VII: An Autonomous Stereo–Inertial Navigation System with Real-Time Optimization for a Mars Rotorcraft Space Drone. Drones. 2026; 10(5):346. https://doi.org/10.3390/drones10050346

Chicago/Turabian Style

Xiao, Ju, Hanchen Qiu, Yukun Zhou, Rui Wang, and Peng Liu. 2026. "MarsBird-VII: An Autonomous Stereo–Inertial Navigation System with Real-Time Optimization for a Mars Rotorcraft Space Drone" Drones 10, no. 5: 346. https://doi.org/10.3390/drones10050346

APA Style

Xiao, J., Qiu, H., Zhou, Y., Wang, R., & Liu, P. (2026). MarsBird-VII: An Autonomous Stereo–Inertial Navigation System with Real-Time Optimization for a Mars Rotorcraft Space Drone. Drones, 10(5), 346. https://doi.org/10.3390/drones10050346

Article Menu

MarsBird-VII: An Autonomous Stereo–Inertial Navigation System with Real-Time Optimization for a Mars Rotorcraft Space Drone

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Operational Context: Mars Environment and Tianwen-3’s Constraints

2.2. Terrestrial and Martian Navigation Methods: Gaps and Limitations

2.3. Ingenuity: Processor, Sensors, and Navigation Framework

3. Navigation System: Architecture, Vision Front-End, and Optimization Back-End

3.1. Navigation Architecture

Notation Convention

3.2. Vision Front-End for Martian Visual Conditions

3.2.1. Vision Strategy for Mars Flight Environments

3.2.2. Vision Front-End: Implementation and Workflow

3.3. Robust Multi-Sensor Optimization Back-End for MarsBird-VII

3.3.1. State Estimation Strategy for Mars Flight Environments

3.3.2. Optimization Back-End Workflow

4. Real-Time Optimization of MarsBird-VII Navigation Workflow

4.1. Acceleration of Vision Front-End for Real-Time MarsBird-VII Navigation

4.1.1. Parallel Feature Detection with Region-Based Masking

Grid-Based Decomposition and Batch Scheduling

Dynamic Region-Based Masking

4.1.2. Conditional and Parallelized Feature Processing

4.2. Parity-Window Alternating Optimization with Unified Marginalization

4.2.1. Parity-Based Grouping and Selective Constraint Construction

Global Frame Indexing for Stable Grouping

Sparse IMU Factorization via Expanded Pre-Integration

Inclusive Construction of Visual Factors

4.2.2. Cross-Correlation-Preserving Unified Marginalization

Asymmetric Design: Solution Reduced, Marginalization Full

Unified Marginalization Formulation

Runtime Reduction Rationale

4.2.3. Summary of the Parity-Window Back-End

5. Experimental Validation of MarsBird-VII Navigation System

5.1. Basic Flight Mission Evaluation of MarsBird-VII

5.1.1. Cruise Flight

5.1.2. Low-Altitude Sampling Flight

5.2. Comparative Evaluation Against Filter-Based Baselines

5.2.1. Baselines, Metrics, and Protocols

5.2.2. Evaluation Protocols

5.2.3. Quantitative Results

Full-Sequence Results (Robustness Test)

Stable-Segment Results (Precision Test)

5.2.4. Effect of Visual Modality: Stereo vs. Monocular Configuration

5.2.5. Discussion

Performance Gains via Iterative Re-Linearization

Higher Precision for Sampling-Related Operations

Mission Relevance

5.3. Environmental Robustness Experiments

5.4. Experimental Evaluation: Real-Time Performance

5.4.1. Vision Front-End Validation: Runtime Reduction and Tracking Stability

Evaluation of Parallel Feature Detection with Region-Based Masking

Evaluation of Conditional and Parallelized Feature Processing

5.4.2. Optimization Back-End Validation: Accuracy–Efficiency Trade-Off

Statistical Distribution of Optimization Latency

Accuracy and Robustness Trade-Off

Conclusions of Comparative Study

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI