Robust Visual–Inertial Odometry via Multi-Scale Deep Feature Extraction and Flow-Consistency Filtering

Cho, Hae Min

doi:10.3390/app152010935

Open AccessArticle

Robust Visual–Inertial Odometry via Multi-Scale Deep Feature Extraction and Flow-Consistency Filtering

by

Hae Min Cho

Department of AI, Gachon University, Seongnam-si 1332, Gyeonggi-do, Republic of Korea

Appl. Sci. 2025, 15(20), 10935; https://doi.org/10.3390/app152010935

Submission received: 19 September 2025 / Revised: 9 October 2025 / Accepted: 10 October 2025 / Published: 11 October 2025

(This article belongs to the Special Issue Advances in Autonomous Driving: Detection and Tracking)

Download

Browse Figures

Versions Notes

Abstract

We present a visual–inertial odometry (VIO) system that integrates a deep feature extraction and filtering strategy with optical flow to improve tracking robustness. While many traditional VIO methods rely on hand-crafted features, they often struggle to remain robust under challenging visual conditions, such as low texture, motion blur, or lighting variation. These methods tend to exhibit large performance variance across different environments, primarily due to the limited repeatability and adaptability of hand-crafted keypoints. In contrast, learning-based features offer richer representations and can generalize across diverse domains thanks to data-driven training. However, they often suffer from uneven spatial distribution and temporal instability, which can degrade tracking performance. To address these issues, we propose a hybrid front-end that combines a lightweight deep feature extractor with an image pyramid and grid-based keypoint sampling to enhance spatial diversity. Additionally, a forward–backward optical-flow-consistency check is applied to filter unstable keypoints. The system improves feature tracking stability by enforcing spatial and temporal consistency while maintaining real-time efficiency. Finally, the effectiveness of the proposed VIO system is validated on the EuRoC MAV benchmark, showing a 19.35% reduction in trajectory RMSE and improved consistency across multiple sequences compared with previous methods.

Keywords:

visual–inertial odometry (VIO); deep feature; SLAM; learned feature front-end

1. Introduction

Visual–inertial odometry (VIO) has been widely adopted as a core component of autonomous navigation systems, enabling accurate estimation of ego-motion by combining visual and inertial measurements [1,2,3]. Among existing systems, methods based on tightly coupled optimization, such as VINS-Mono [4], have demonstrated strong performance and robustness in a variety of scenarios. However, most traditional pipelines are based on hand-crafted features [5,6], which often suffer from limited repeatability and sensitivity to appearance variations, making them vulnerable in visually complex or rapidly changing environments.

Recent advances in deep learning have included feature extraction methods that outperform traditional descriptors in terms of semantic richness and robustness to appearance changes [7,8,9]. Deep features offer higher-level representations that are more invariant to viewpoint and illumination changes. However, their direct application in SLAM (Simultaneous Localization and Mapping) or VIO systems introduces new challenges [10,11]. Deep features tend to cluster in semantically salient regions and lack temporal stability, which may lead to unstable tracking or degeneracy in optimization-based pipelines.

To address these limitations, a novel VIO system is proposed that leverages a lightweight deep feature extractor in combination with a multi-scale image pyramid and spatially balanced keypoint selection strategy. Additionally, an optical-flow-based filtering method is introduced, which applies forward–backward consistency checks to reject unstable keypoints. This hybrid front-end is designed to retain the descriptive power of deep features while improving spatial coverage and temporal stability for reliable tracking.

The system is implemented as an extension of VINS-Mono, where the original front-end is replaced by a deep learning-based feature extraction and filtering module. The main contributions of this work are summarized as follows:

A hybrid front-end for visual–inertial odometry is developed by integrating a lightweight deep feature extractor with an image pyramid and grid-based keypoint sampling strategy, ensuring spatial diversity of extracted features.
An optical-flow-based keypoint filtering method is introduced, which employs forward–backward consistency checks to improve temporal stability and reject unreliable features.
The proposed system is implemented to run in real time on standard GPU-equipped hardware and demonstrates competitive trajectory accuracy on the EuRoC MAV dataset under fair conditions.

The remainder of this paper is organized as follows. Section 2 reviews relevant studies, and Section 3 describes the overall architecture of the proposed system and presents the details of the feature extraction and filtering modules. The experimental results and comparisons are discussed in Section 4, and finally, Section 5 concludes the paper.

2. Related Works

2.1. Visual–Inertial Odometry (VIO)

VIO has been extensively studied as a robust solution for ego-motion estimation in a wide range of autonomous systems. Traditional frameworks, such as OKVIS [12] and VINS-Mono [4], adopt tightly coupled nonlinear optimization approaches that integrate inertial and visual measurements using hand-crafted features like FAST and ORB. Such systems have achieved high accuracy and real-time performance across diverse application domains. OKVIS2 [13] further improves upon the original OKVIS by offering enhanced scalability and computational efficiency, enabling real-time operation even in large-scale environments. VINS-Fusion [14,15] extends VINS-Mono to support multiple sensor types, making it suitable for complex, heterogeneous platforms. Despite their advantages, these traditional methods can struggle in environments where hand-crafted features are difficult to extract or cannot easily maintain reliable tracking performance. This limitation has led to increasing interest in learning-based feature representations, which can provide improved robustness and generalization under diverse conditions.

Beyond these representative systems, other VIO frameworks, such as ROVIO [16] and MSCKF [17], have also contributed important ideas, particularly in terms of computational efficiency and filter-based estimation. However, these methods still rely heavily on the stability of handcrafted visual features. As robotic applications continue to expand into challenging scenarios with textureless surfaces, repetitive patterns, or rapid motion, the limitations of traditional features have become increasingly evident. This trend has motivated researchers to study more adaptive front-end designs that can leverage modern machine learning techniques while preserving the well-established optimization back-ends of classical VIO systems.

2.2. Deep Feature-Based SLAM and VIO

With the emergence of deep learning, several studies have incorporated learned features into localization systems to enhance robustness. SuperPoint [7] and HFNet [18] provide visual representations with enhanced invariance to viewpoint or illumination changes. HFNet-SLAM [19] integrates HFNet into a visual SLAM pipeline, showing improved accuracy under challenging visual conditions. More recently, SuperVINS [11], built on VINS-Fusion, combined SuperPoint feature extraction with a LightGlue matching strategy, demonstrating enhanced robustness and real-time performance. Adaptive VIO [20] proposes a novel online continual learning framework, integrating deep networks for visual correspondence and IMU bias correction into a traditional optimization back-ends. This approach improves generalization and adaptability across environments while maintaining competitive accuracy compared to state-of-the-art methods. Despite these advances, learned features still tend to exhibit uneven spatial keypoint distributions and suffer from temporal inconsistency. These issues might degrade performance when used in optimization-based VIO pipelines. This motivated us to establish our hybrid keypoint strategy, which explicitly addresses these limitations.

In addition to these works, several deep feature extractors, such as R2D2 [21], DISK [22], and ALIKE [8], have also been proposed, emphasizing repeatability, end-to-end training, and lightweight deployment, respectively. These methods have been successfully integrated into visual SLAM pipelines, yielding improvements, particularly in texture-poor or dynamic environments. However, most existing studies primarily evaluate the representational power of the learned features themselves, often overlooking their temporal behavior when integrated into tightly coupled VIO frameworks. In practice, temporally unstable keypoints can propagate errors into the optimization back-end, leading to trajectory drift or degraded loop closure performance.

Traditional VIO systems often rely on geometric outlier rejection strategies, such as RANSAC-based pose estimation or residual thresholding, to remove mismatched correspondences. While effective for frame-to-frame verification, these techniques primarily assess spatial consistency within a single time step. In contrast, our method introduces a forward–backward optical-flow-consistency check that explicitly evaluates temporal stability across consecutive frames. This approach proactively filters unstable keypoints before optimization, complementing existing geometric outlier rejection mechanisms and enhancing feature reliability in learning-based VIO pipelines.

Recent studies have highlighted that the key challenge in feature-based systems is not only the quality of individual keypoints, but also their consistency over time and across sensor modalities. This perspective underscores the need for feature selection and filtering mechanisms that complement learned descriptors. Our work is motivated by this insight: instead of treating deep features as a drop-in replacements for handcrafted ones, we design a lightweight front-end that explicitly suppresses unstable keypoints, thereby improving robustness and stability in real-world VIO scenarios.

3. Proposed Method

3.1. Overview of the Modified VIO System

The proposed system builds upon the well-established VINS-Mono [4] framework, which tightly integrates visual and inertial measurements through a nonlinear optimization back-end. While the original system relies on handcrafted ORB features for point detection and tracking, our modification introduces a deep-learning-based feature extraction module to improve robustness and accuracy, particularly under challenging visual conditions.

Figure 1 illustrates the architecture of the modified system. The core pipeline of VINS-Mono, including IMU pre-integration and visual–inertial state estimation, is preserved in our system, but loop closure is omitted to focus solely on front-end improvements. This exclusion allows an isolated evaluation of the proposed keypoint extraction and filtering strategies without the influence of global optimization effects. Nevertheless, the proposed front-end is compatible with standard SLAM frameworks and can be integrated into existing systems without modification. The front-end is redesigned to integrate a lightweight deep feature extractor along with a pyramid-based sampling and optical-flow validation strategy for point selection.

Specifically, when the number of successfully tracked points falls below a predefined threshold, the system triggers a re-detection phase. At this point, an image pyramid is constructed, and a deep feature extractor is applied at each scale to generate candidate keypoints. The image is then divided into uniform grids, and keypoints are selected within each grid based on confidence scores to ensure spatial diversity. To further improve the stability of selected points, front–back optical-flow consistency is used to filter out unreliable candidates. The combination of deep feature extraction and optical-flow-consistency filtering provides a complementary balance between semantic robustness and temporal coherence. While previous hybrid front-ends [11,20] mainly focus on spatial matching or residual weighting without explicit temporal regularization, the proposed formulation introduces forward–backward flow consistency directly at the feature level. Deep features offer high-level invariance to viewpoint and illumination changes, whereas geometric flow consistency enforces stable correspondences over time by penalizing bidirectional motion discrepancies. This integration effectively regularizes feature trajectories, reducing outlier influence in the optimization back-end and improving the convergence of visual–inertial residual minimization. Consequently, the proposed hybrid front-end achieves greater temporal stability and optimization consistency than conventional appearance-only or purely geometric approaches.

This hybrid front-end leverages deep features while maintaining the spatial coverage and stability of traditional multi-scale detectors. The integration is designed to retain the real-time performance of the original system and requires only minimal additional computational overhead. This configuration isolates the proposed improvements. The two modules (pyramid-based sampling and forward and backward flow checks) are extractor-agnostic and can be reused with lightweight detectors or descriptors with only minor changes to the optimization back-end.

3.2. Deep Accelerated Feature Extraction

To replace the traditional handcrafted feature detector in VINS-Mono, we adopt a deep-learning-based module that generates keypoints and confidence maps from a single image. Since visual–inertial odometry systems require fast and consistent front-end processing to ensure high real-time performance, the feature extractor must be both lightweight and efficient. To meet these requirements, we use XFeat [9], a modular deep local feature framework designed for rapid keypoint and descriptor extraction with minimal computational overhead. Its streamlined architecture makes it well-suited for tightly coupled VIO pipelines, where delays in feature computation can directly degrade system responsiveness.

In our system, the model takes a grayscale image as input and produces a dense keypoint confidence map along with corresponding local descriptors in a single forward pass. The architecture consists of a shared encoder followed by two parallel heads for confidence and descriptor prediction. Unlike other deep detectors that incorporate scale-awareness through multi-scale convolutions or feature fusion, XFeat omits such mechanisms for efficiency. To compensate, we externally apply an image pyramid during inference to recover scale diversity in the extracted keypoints.

The adopted feature extractor achieves high speed and produces distinctive and robust keypoints, but presents two major limitations when deployed in VIO systems. First, due to the heatmap-like nature of the confidence output, keypoints tend to concentrate around localized high-response regions, resulting in poor coverage in homogeneous or distant areas. Second, the spatial distribution of keypoints is often uneven, which can negatively affect the accuracy and stability of pose estimation.

To overcome these limitations, we introduce a custom feature selection strategy, described in the Section 3.3. The key idea is to compensate for the lack of scale-awareness and ensure spatial diversity by constructing an external image pyramid and applying grid-based sampling across multiple scales.

3.3. Multi-Scale Feature Selection with Flow-Based Filtering

To address the limitations of uneven spatial distribution and lack of scale-awareness commonly found in learning-based feature detectors, we propose a multi-scale and flow-guided feature selection strategy. This process is triggered when the number of tracked points falls below a predefined threshold, indicating degraded tracking quality.

First, we construct an image pyramid from the current grayscale frame. At each pyramid level, a lightweight deep feature extractor is applied to generate dense confidence maps and descriptors. Since the adopted feature extractor omits internal multi-scale processing for efficiency, our use of an explicit image pyramid compensates for the lack of scale diversity in the keypoints.

To ensure spatial uniformity, each image at every scale level is divided into fixed-size grids. Within each grid cell, we select the top-N keypoints based on the confidence score provided by the network. This grid-based sampling helps to avoid the typical clustering of keypoints in high-texture regions, which can otherwise lead to unbalanced feature distributions. Figure 2 presents a visualization of the multi-scale feature extraction and selection pipeline. The left column shows the keypoints extracted at three different levels of the image pyramid, each capturing features at different spatial resolutions. It is evident that the spatial distribution and density of keypoints vary significantly across levels, reflecting their complementary nature. The rightmost image displays the final set of selected keypoints after grid-based aggregation. The result demonstrates improved spatial coverage while avoiding over-concentration of keypoints in textured regions. This spatial balancing plays a critical role in improving tracking robustness, especially under challenging conditions such as viewpoint changes or partial occlusions.

Following the multi-scale extraction and spatial sampling, we apply a forward–backward flow-consistency check to filter out temporally unstable keypoints. For each candidate

p_{cur}^{i}

in the current image

F_{cur}

, we compute the forward flow to the previous image

F_{prev}

to obtain the corresponding point

p_{back}^{i}

:

p_{back}^{i} = p_{cur}^{i} + flow (F_{cur}, F_{prev}, p_{cur}^{i})

(1)

It should be noted that the function flow(·) in Equation (1) refers to a sparse optical flow estimated for detected keypoints using the pyramidal Lucas–Kanade (KLT) algorithm [23]. Unlike dense flow estimation, which computes pixel-wise motion fields, this sparse formulation focuses on maintaining accurate trajectories for selected features that directly contribute to visual–inertial optimization. The approach provides sub-pixel accuracy for individual keypoint motion while ensuring computational efficiency for real-time processing. Consequently, the forward–backward consistency check is applied per keypoint track to validate temporal stability without computing dense flow across the entire image.

We then compute the backward flow from

p_{back}^{i}

in

F_{prev}

back to

F_{cur}

, and measure the discrepancy between the original point and the round-trip result:

e_{fb}^{i} = ∥p_{cur}^{i} - (p_{back}^{i} + flow (F_{prev}, F_{cur}, p_{back}^{i}))∥ .

(2)

Equivalently, the consistency error can be written as the sum of forward and backward flows:

e_{fb}^{i} = ∥flow (F_{cur}, F_{prev}, p_{cur}^{i}) + flow (F_{prev}, F_{cur}, p_{back}^{i})∥ .

(3)

Keypoints with

e_{fb}^{i}

exceeding a predefined threshold are rejected to eliminate temporally inconsistent or geometrically unreliable candidates. As shown in Figure 3, this filtering process enhances stability and leads to more reliable tracking by removing temporally inconsistent keypoints. While features are extracted from a multi-scale image pyramid, all optical-flow fields are upsampled to the highest resolution before computing the forward–backward consistency error. This approach ensures that the consistency evaluation benefits from the higher accuracy of full-resolution flow estimation, eliminating the need for level-wise scaling.

By combining pyramid-based deep feature extraction, grid-constrained sampling, and temporally consistent filtering, the proposed strategy yields robust, evenly distributed, and geometrically stable keypoints suitable for reliable VIO front-end operation.

3.4. Implementation Details

Our system builds on the open-source VINS-Mono framework, replacing its original feature module with the proposed deep feature extraction and selection pipeline. We adopt pre-trained weights from XFeat for efficient and stable inference. The deep feature extractor is integrated using TorchScript-based deployment, allowing efficient GPU inference without requiring a separate deep learning framework at runtime. The image pyramid is constructed on-the-fly using standard OpenCV functions, and feature extraction is performed independently on each scale level. The confidence threshold and the number of keypoints selected per grid cell are tunable parameters that can be adjusted based on the desired trade-off between performance and computational load.

For tracking, we retain the original optical-flow-based method from VINS-Mono, using a pyramidal Lucas–Kanade tracker. Flow-guided filtering is implemented by computing bidirectional flow between frames and comparing the displacement of forward–backward matches. Points with inconsistency beyond a threshold are removed before being passed to the front-end estimator.

The feature extraction and selection process is triggered only when the number of tracked points drops below a certain threshold. This strategy helps maintain high tracking quality while reducing unnecessary computation. All modules run in real time on a desktop platform with GPU acceleration, and further details on runtime performance and resource usage are provided in Section 4.

4. Experiment

We evaluate the proposed system on the EuRoC MAV dataset, a widely used benchmark for visual–inertial odometry that provides synchronized stereo images, IMU data, and accurate ground truth for six-degree-of-freedom pose estimation. We compare it against several representative visual–inertial odometry (VIO) systems. VINS-Mono [4] serves as a strong baseline for traditional tightly coupled monocular VIO based on hand-crafted features. In addition, we include OKVIS [12], another classical keyframe-based VIO system that relies on optical-flow tracking and stereo input. VINS-Fusion [14,15] extends VINS-Mono to support multiple sensor modalities and provides improved robustness in diverse conditions. Furthermore, we compare our method with two recent learning-based systems: SuperVINS [11], which integrates SuperPoint features with a LightGlue matcher to enhance correspondence robustness within an optimization-based pipeline, and VIO-DualProNet [24], which employs a dual-network fusion of visual and inertial data for end-to-end motion estimation. These recent approaches represent state-of-the-art learning-augmented VIO frameworks and serve as strong references for assessing both the accuracy and computational efficiency of the proposed method.

All experiments are conducted on a desktop equipped with an Intel Core i9-14900K CPU with 32 GB RAM and an NVIDIA RTX 5080 GPU. Although the evaluation is performed on a high-end desktop, the proposed system is designed with lightweight modules, including the XFeat-based feature extractor and an efficient filtering scheme. Thanks to this design, the method maintains real-time operation and can operate reliably even on lower-end computing platforms without significant degradation in accuracy. Each method is executed on the same hardware under identical conditions to ensure fairness. For methods without publicly available implementations, the reported results are adopted from their original papers to maintain consistency in the comparisons.

The trajectory accuracy is evaluated using the Root-Mean-Square Error (RMSE) of absolute poses, computed after aligning the estimated trajectory with the ground truth using a standard SE(3) alignment procedure [25]. The RMSE is defined as

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2}},

(4)

where

{\hat{y}}_{i}

denotes the estimated pose at frame i,

y_{i}

denotes the corresponding ground-truth pose, and N is the total number of evaluated frames.

To assess the effectiveness of our contributions, we also report the results from an internal ablation: a version of our system that uses deep features extracted from a pretrained network but omits the proposed flow-consistency filtering and spatial sampling steps (denoted as Ours (raw deep features)). This variant helps isolate the contribution of each component in the proposed front-end. In addition, an ablation study is conducted to analyze the impact of key parameters that significantly influence the performance of the proposed system, including the maximum number of tracked features, the grid size for deep feature extraction, the number of pyramid levels, and the flow-consistency threshold. For other parameters related to the camera configuration and sensor noise characteristics, the same values as those used in VINS-Mono are adopted to ensure fairness and consistency across methods. A detailed summary of the IMU noise parameters used in this study is provided in Appendix A (Table A1). All methods are evaluated without loop closure and use monocular input only, ensuring a fair comparison under consistent back-end optimization frameworks.

4.1. Quantitative Results

Table 1 presents the absolute trajectory RMSE for all evaluated methods on the EuRoC MAV dataset. The results reveal that the proposed method achieves robust and consistent tracking performance across diverse scenarios. To provide a clearer quantitative comparison, the table also reports the average percentage improvement of the proposed method over VINS-Mono, which is the primary baseline.

In sequences such as V102 and V201, which involve challenging visual conditions like dynamic illumination or aggressive motion, the proposed method outperforms all baselines. In V102, the RMSE is reduced, outperforming both VINS-Mono and VINS-Fusion. This improvement is attributed to the grid-based keypoint selection, which ensures that features are spatially well-distributed even in regions with low texture or uneven illumination. Similarly, in V201, which involves rapid camera motion and potential motion blur, our method outperforms traditional baselines and ablation variants. The forward–backward flow-consistency filtering is particularly effective here in eliminating unstable keypoints caused by blur or occlusion.

The benefit of the proposed front-end becomes especially clear when comparing it to the Ours (raw deep features) variant. In sequences like V203, where raw deep features alone lead to noticeable degradation, our full method reduces the error by suppressing temporally inconsistent keypoints. This shows the importance of the filtering mechanism, which prevents unstable keypoints from contributing to drift in dynamic scenes.

On the other hand, in more regular and static sequences such as MH03 and MH04, the performance gap between methods is narrower. These sequences offer abundant stable features, making handcrafted descriptors like ORB sufficient for accurate tracking. Nonetheless, our method maintains competitive accuracy without relying on stereo input or loop closure, demonstrating its general applicability.

It is also worth noting the comparison with VINS-Fusion, a widely adopted visual–inertial odometry framework that extends VINS-Mono by enhancing optimization stability and feature management. Although VINS-Fusion generally serves as a strong baseline in the community, our method shows competitive or even superior results on several sequences. In particular, on V201 and V202, the proposed pipeline outperforms VINS-Fusion, suggesting that the integration of learned features and temporal filtering enhances robustness and consistency even under challenging conditions. Overall, these findings confirm that the proposed approach achieves accuracy comparable to that of state-of-the-art VIO systems while maintaining a lightweight monocular configuration.

Although the proposed system achieves consistent improvements across most sequences, a few challenging cases, such as MH04 and V203, exhibit slightly increased drift. These sequences contain severe motion blur and abrupt illumination changes, which can distort optical-flow estimation and reduce temporal consistency across frames. Such effects may be mitigated by photometric normalization, contrast-invariant feature encoding, or adaptive flow-thresholding strategies, which will be considered in future extensions. In such conditions, a small number of temporally inconsistent keypoints may remain even after flow-consistency filtering, resulting in minor localization drift. The system may also experience reduced robustness in environments with repetitive textures or partial occlusion, where spatially balanced sampling cannot fully prevent feature ambiguity or mismatched correspondences. These observations highlight the inherent limitations of deep feature-based front-ends in visually degraded environments and emphasize the need for more adaptive perception strategies. Future research will focus on integrating semantic priors and blur- or occlusion-aware feature weighting mechanisms to improve resilience under such challenging conditions.

In contrast to recent deep learning-based approaches, the proposed system exhibits comparable or even superior performance on several sequences such as MH03, MH04, and V102. While fully learned pipelines like SuperVINS and VIO-DualProNet show strong results in visually challenging scenarios, they generally rely on heavier network architectures and greater computational cost. Our method, on the other hand, achieves competitive accuracy using a lightweight hybrid front-end that combines compact deep features with optical-flow-based filtering. This demonstrates that stable and spatially balanced feature selection can yield robustness on par with complex end-to-end frameworks while preserving real-time efficiency.

Overall, these results indicate that the proposed system not only improves performance in difficult scenarios where traditional features struggle, but also retains robustness and accuracy in favorable conditions. On average, the method achieves an approximately 21% lower trajectory error compared with VINS-Mono, confirming its effectiveness across both indoor and outdoor sequences. By integrating deep learning-based feature extraction with principled spatial and temporal filtering, the system achieves a well-balanced VIO pipeline suitable for diverse real-world environments.

Taken together, Table 1 and Figure 4 suggest that robustness gains stem from the synergy between spatial coverage (grid-constrained selection across pyramid scales) and temporal consistency (forward–backward flow filtering). The former prevents over-concentration in textured regions and secures motion observability across directions, while the latter removes transient, blur- or occlusion-induced instabilities before they propagate to the optimizer. Notably, the method remains competitive against stereo-enabled baselines while using monocular input only, which is attractive for platforms where additional sensors or calibration effort are impractical.

4.2. Impact of Feature Selection Strategy

The quality of visual–inertial odometry (VIO) strongly depends on the robustness, spatial distribution, and temporal consistency of selected keypoints. Learned features often exhibit high repeatability and semantic invariance, but they tend to concentrate in highly textured regions, leading to spatial imbalance and potential instability across frames. Our method explicitly addresses these limitations by combining three core strategies: (1) multi-scale deep feature extraction to capture keypoints across different resolutions, (2) grid-based spatial sampling to enforce even distribution, and (3) optical-flow-based consistency filtering to eliminate geometrically unstable or temporally inconsistent points.

Figure 5 provides a qualitative comparison on the MH05 sequence of the EuRoC dataset. The raw deep feature baseline shows a high density of keypoints clustered in certain regions, particularly around textured edges and corners, while large homogeneous regions remain underrepresented. In contrast, our method yields more uniform spatial coverage across the image due to grid-level selection across all scales. This balanced distribution ensures that motion in all directions is adequately observed, reducing the risk of estimation degradation due to local feature dropout. To further verify temporal stability, the keypoint survival rate was analyzed across consecutive frames. The proposed front-end maintained an average survival rate of 78.4%, compared to 62.1% for the baseline without flow-consistency filtering. This confirms that the filtering step effectively enhances feature longevity and tracking stability over time.

Furthermore, the incorporation of flow-consistency filtering significantly improves the temporal reliability of selected keypoints. By enforcing forward–backward consistency, the system effectively discards outliers caused by motion blur, occlusion, or viewpoint-induced descriptor drift. As a result, our method maintains a denser and more stable set of 3D landmarks over time, as visualized in the bottom row of Figure 5.

Together, these strategies demonstrate the importance of a principled front-end design. Instead of relying purely on raw deep features or handcrafted descriptors, our approach leverages the strengths of learned representations while addressing their limitations, resulting in a more robust and consistent VIO pipeline.

However, limitations remain. Flow consistency hinges on accurate correspondences; under severe blur, abrupt exposure changes, or large unmodeled parallax, its rejection power diminishes and some outliers persist. Highly repetitive textures can also stress the grid selection by trading off coverage and descriptor distinctiveness. These cases suggest a need to add complementary cues (e.g., semantics or depth) and scene-adaptive thresholds to further increase robustness.

4.3. Runtime Analysis

To assess the computational efficiency of the proposed front-end, we measured the runtime of its two main components: deep feature extraction (including multi-scale processing and grid-based keypoint selection) and optical-flow-based consistency filtering. On average, the feature extraction stage requires 33.73 ms per frame, covering the computation of multi-scale feature maps, keypoint extraction, and spatial sampling. The flow-consistency step adds 0.62 ms per frame for forward–backward flow and pixel-wise checks. End-to-end, the front-end takes

\approx 34.35

ms per frame (

\sim 29

FPS), operating within real-time constraints on the evaluated desktop GPU. In addition, a computational resource analysis was performed to evaluate adaptability to embedded platforms. The XFeat extractor requires less than 150 MB of GPU memory during inference and utilizes fewer than 20% of available CUDA cores on an RTX-level GPU, while the KLT-based optical-flow-consistency module runs entirely on the CPU at real-time rates. This configuration enables real-time operation even on embedded platforms with limited computational resources. Further speedups can be achieved through mixed-precision or pruning.

The breakdown shows that flow consistency contributes a negligible share of the total (

\sim 1.8 %

), indicating that latency is dominated by the extractor. In practice, further speedups are therefore most effectively sought on the extraction path (e.g., lighter backbones or fewer pyramid levels) while keeping the lightweight filtering intact. This aligns with the empirical trend that robustness gains come primarily from enforcing spatial coverage and temporal stability rather than from expensive post-processing. Even under reduced-frame-rate conditions, the proposed system maintains stable pose estimation. The flow-consistency filtering enforces temporal smoothness across consecutive frames, while the feature selection strategy prioritizes long-lived correspondences with consistent motion. As a result, the system remains accurate and reliable even when the image sampling frequency is lowered, demonstrating robustness for low-frame-rate cameras or resource-constrained hardware.

For deployment on embedded or low-power platforms, additional engineering may be required to keep the end-to-end latency within real-time bounds at lower budgets, such as model compression (quantization or pruning), kernel fusion, and pipeline scheduling. Because the proposed modules are decoupled from the back-end, such optimizations can proceed without modifying the estimator.

4.4. Ablation Study

4.4.1. Parameter Sensitivity

To validate the robustness of the proposed method and analyze the impact of key parameters, we conducted an ablation study by varying the maximum number of extracted features, the grid size, and the pyramid level while comparing the full pipeline against a variant without the flow-consistency filtering. As summarized in Table 2, the full pipeline consistently outperforms the variant without flow across all settings and sequences. When the maximum number of features is limited to 300, the accuracy slightly decreases compared to the default configuration, confirming that the proposed filtering benefits from a sufficient feature pool but remains stable even under reduced density. Larger grid sizes slightly degrade the performance due to lower spatial diversity, whereas reducing the pyramid levels increases the drift on long sequences such as MH05. Overall, the default configuration achieves the best trade-off between accuracy and computational efficiency, maintaining strong performance across all tested sequences. The default configuration provides balanced accuracy and efficiency across various settings, confirming stable performance without dataset-specific tuning. These results confirm that the observed gain is not merely due to higher feature density, but rather stems from the improved robustness of the flow-based filtering mechanism.

4.4.2. Effect of Flow-Consistency Filtering

To examine the impact of the proposed filtering module, an ablation experiment was performed on the EuRoC MH04 sequence. Figure 6 presents side-by-side trajectories estimated with filtering (left) and without filtering (right). Although both results follow a similar overall path, the filtered version shows smoother motion and less local drift, especially in curved regions where deviations are most evident. This visual comparison clearly demonstrates that enforcing bidirectional flow consistency enhances temporal stability in feature tracking. Table 3 further summarizes the quantitative results, showing that despite a slight reduction in the average number of tracked keypoints, the overall trajectory accuracy improves, confirming the benefit of the filtering step.

4.4.3. Flow-Consistency Threshold

To evaluate the influence of the flow-consistency threshold (

τ_{f b}

), an ablation study was conducted on the EuRoC MAV dataset using sequences MH03, MH05, and V202. Table 4 reports the trajectory RMSE under different threshold settings. Without the consistency check, all sequences show noticeably higher errors due to temporally unstable matches. When a strict threshold (

τ_{f b} = 0.3

) is applied, unstable points are successfully filtered out, yielding improved accuracy; however, over-filtering may remove useful keypoints, slightly reducing robustness in dynamic frames. The threshold of

τ_{f b} = 0.5

achieves the lowest average ATE across sequences, providing a good trade-off between rejecting inconsistent matches and preserving sufficient spatial coverage. In contrast, a relaxed threshold (

τ_{f b} = 1.0

) allows noisy flows to remain, increasing trajectory drift. Therefore,

τ_{f b} = 0.5

was adopted as the default setting for all experiments in this paper, balancing precision and stability.

5. Conclusions

We present a Visual Inertial Odometry system that augments a lightweight deep feature extractor with a multi-scale image pyramid, grid-based keypoint selection for spatial coverage, and forward and backward flow-consistency checks for temporal stability. The front-end directly targets two common weaknesses of learned features, namely spatial clustering and frame-to-frame instability, while preserving real-time operation and requiring only modest computational overhead.

Evaluations on the EuRoC MAV benchmark demonstrate that the proposed method achieves up to 19.35 % lower RMSE compared with existing feature-based and learning-based VIO systems, showing competitive and robust tracking across diverse scenarios. The method reduces drift in challenging sequences and remains strong even without loop closure or stereo input. This indicates that a simple and principled front-end can substantially stabilize optimization-based VIO. These results support the practicality of incorporating learned perception modules into classical geometric pipelines when paired with lightweight selection and filtering.

The proposed front-end shows practical potential for deployment in real-world robotic and mobile platforms. The architecture maintains real-time localization performance even on embedded systems with limited computational resources. These results indicate that the approach provides a solid foundation for reliable visual–inertial navigation in real-world environments. Future research will focus on extending the framework to multi-sensor configurations, including LiDAR, visual, and inertial inputs, and enhancing robustness in dynamic environments such as urban traffic scenes or indoor spaces with moving agents. To this end, semantic and depth cues will be integrated to improve environmental understanding and motion separation. Semantic information will guide keypoint selection toward stable, static regions, while depth priors will refine spatial distribution in texture-sparse or partially occluded areas. Quantitative evaluation will be conducted using metrics such as trajectory RMSE, dynamic-object recall, and scene-level consistency to assess the impact of these extensions. Additional lightweight branches can operate in parallel with the front-end to preserve real-time performance. Further directions include self-supervised and domain-adaptive training for temporal consistency, as well as hardware-aware optimization for efficient embedded deployment.

Funding

This work was supported by the Gachon University research fund of 2024 (GCU-202500670001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study were derived from publicly available benchmark datasets. Specifically, the EuRoC MAV dataset is accessible at [https://projects.asl.ethz.ch/datasets/doku.php?id=kmavvisualinertialdatasets] (accessed on 9 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. IMU Noise and Residual Weighting

The IMU noise parameters used in this work follow the standard configuration of VINS-Mono for the EuRoC dataset. These parameters are summarized in Table A1.

The residual weighting follows the same convention. Visual residuals assume a unit image noise (

σ_{i m g} = 1.0

pixel), consistent with the RANSAC threshold setting in the configuration file. IMU residuals are weighted by the covariance propagated from the pre-integration process using

(a c c_n, g y r_n, a c c_w, g y r_w)

, without any manual scaling.

Table A1. IMU noise parameters (VINS-Mono configuration).

Parameter	Value	Unit
Accelerometer noise density ( $a c c_n$ )	0.10	m/s²/ $\sqrt{Hz}$
Gyroscope noise density ( $g y r_n$ )	0.010	rad/s/ $\sqrt{Hz}$
Accelerometer bias random walk ( $a c c_w$ )	1.0 × 10⁻³	m/s³/ $\sqrt{Hz}$
Gyroscope bias random walk ( $g y r_w$ )	1.0 × 10⁻⁴	rad/s²/ $\sqrt{Hz}$
Gravity magnitude ( $g_n o r m$ )	9.81007	m/s²

References

Du, X.; Zhang, L.; Ji, C.; Luo, X.; Zhang, H.; Wang, M.; Wu, W.; Mao, J. SP-VIO: Robust and Efficient Filter-Based Visual Inertial Odometry with State Transformation Model and Pose-Only Visual Description. arXiv 2024, arXiv:2411.07551. [Google Scholar]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Von Stumberg, L.; Cremers, D. DM-VIO: Delayed marginalization visual-inertial odometry. IEEE Robot. Autom. Lett. 2022, 7, 1408–1415. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 IEEE International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Tareen, S.A.K.; Saleem, Z. A comparative analysis of sift, surf, kaze, akaze, orb, and brisk. In Proceedings of the 2018 IEEE International Conference on Computing, Mathematics and Engineering Technologies (iCoMET), Sukkur, Pakistan, 3–4 March 2018; pp. 1–10. [Google Scholar]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
Zhao, X.; Wu, X.; Miao, J.; Chen, W.; Chen, P.C.; Li, Z. Alike: Accurate and lightweight keypoint detection and descriptor extraction. IEEE Trans. Multimed. 2022, 25, 3101–3112. [Google Scholar] [CrossRef]
Potje, G.; Cadar, F.; Araujo, A.; Martins, R.; Nascimento, E.R. Xfeat: Accelerated features for lightweight image matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2682–2691. [Google Scholar]
Gao, M.; Geng, Z.; Pan, J.; Yan, Z.; Zhang, C.; Shi, G.; Fan, H.; Zhang, C. SuperPoint and SuperGlue-Based-VINS-Fusion Model. In Proceedings of the International Conference on Intelligent Computing, Tianjin, China, 5–8 August 2024; Springer: Berlin/Heidelberg, Germany; pp. 460–469. [Google Scholar]
Luo, H.; Guo, C.; Liu, Y.; Li, Z. SuperVINS: A visual-inertial SLAM framework integrated deep learning features. arXiv 2024, arXiv:2407.21348. [Google Scholar]
Leutenegger, S.; Lynen, S.; Bosse, M.; Siegwart, R.; Furgale, P. Keyframe-based visual—Inertial odometry using nonlinear optimization. Int. J. Robot. Res. 2015, 34, 314–334. [Google Scholar] [CrossRef]
Leutenegger, S. Okvis2: Realtime scalable visual-inertial slam with loop closure. arXiv 2022, arXiv:2202.09199. [Google Scholar] [CrossRef]
Qin, T.; Pan, J.; Cao, S.; Shen, S. A General Optimization-based Framework for Local Odometry Estimation with Multiple Sensors. arXiv 2019, arXiv:1901.03638. [Google Scholar] [CrossRef]
Qin, T.; Cao, S.; Pan, J.; Shen, S. A General Optimization-based Framework for Global Pose Estimation with Multiple Sensors. arXiv 2019, arXiv:1901.03642. [Google Scholar] [CrossRef]
Bloesch, M.; Omari, S.; Hutter, M.; Siegwart, R. Robust visual inertial odometry using a direct EKF-based approach. In Proceedings of the 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 298–304. [Google Scholar]
Mourikis, A.I.; Roumeliotis, S.I. A multi-state constraint Kalman filter for vision-aided inertial navigation. In Proceedings of the 2007 IEEE International Conference on Robotics and Automation, Roma, Italy, 10–14 April 2007; pp. 3565–3572. [Google Scholar]
Zhou, W.; Liu, C.; Lei, J.; Yu, L.; Luo, T. HFNet: Hierarchical feedback network with multilevel atrous spatial pyramid pooling for RGB-D saliency detection. Neurocomputing 2022, 490, 347–357. [Google Scholar] [CrossRef]
Liu, L.; Aitken, J.M. Hfnet-slam: An accurate and real-time monocular slam system with deep features. Sensors 2023, 23, 2113. [Google Scholar] [CrossRef] [PubMed]
Pan, Y.; Zhou, W.; Cao, Y.; Zha, H. Adaptive vio: Deep visual-inertial odometry with online continual learning. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 18019–18028. [Google Scholar]
Revaud, J.; De Souza, C.; Humenberger, M.; Weinzaepfel, P. R2d2: Reliable and repeatable detector and descriptor. Adv. Neural Inf. Process. Syst. 2019, 32. Available online: https://proceedings.neurips.cc/paper/2019/hash/3198dfd0aef271d22f7bcddd6f12f5cb-Abstract.html (accessed on 9 October 2025).
Tyszkiewicz, M.; Fua, P.; Trulls, E. Disk: Learning local features with policy gradient. Adv. Neural Inf. Process. Syst. 2020, 33, 14254–14265. [Google Scholar]
Lucas, B.D.; Kanade, T. An iterative image registration technique with an application to stereo vision. In Proceedings of the IJCAI’81: 7th International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–28 August 1981; Volume 2, pp. 674–679. [Google Scholar]
Solodar, D.; Klein, I. VIO-DualProNet: Visual-inertial odometry with learning based process noise covariance. Eng. Appl. Artif. Intell. 2024, 133, 108466. [Google Scholar] [CrossRef]
Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A benchmark for the evaluation of RGB-D SLAM systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura, Portugal, 7–12 October 2012; pp. 573–580. [Google Scholar]

Figure 1. Overview of the proposed method.

Figure 2. Visualization of the multi-scale feature selection process. The left column shows detected keypoints at each pyramid level, and the rightmost image displays the final set of selected keypoints after grid-based sampling and aggregation. This process ensures spatially uniform and scale-aware keypoint selection, which improves robustness in diverse scenes.

Figure 3. Illustration of the forward–backward optical-flow-consistency check. Keypoints with large discrepancies between the original position and the backward-reprojected location are rejected. Blue and red circles indicate the keypoints in the previous frame

F_{prev}

and the current frame

F_{cur}

, respectively. Blue and red crosses represent the optical-flow–projected positions of those keypoints. The black arrows show the forward and backward flow directions.

Figure 3. Illustration of the forward–backward optical-flow-consistency check. Keypoints with large discrepancies between the original position and the backward-reprojected location are rejected. Blue and red circles indicate the keypoints in the previous frame

F_{prev}

and the current frame

F_{cur}

, respectively. Blue and red crosses represent the optical-flow–projected positions of those keypoints. The black arrows show the forward and backward flow directions.

Figure 4. Trajectory comparison on EuRoC sequences between VINS-Mono (top row) and the proposed method (bottom row). Each plot shows the ground truth (black), estimated trajectory (blue), and error (red). The proposed method shows reduced drift and improved alignment across various sequences.

Figure 5. Comparison of keypoint distribution (top) and tracked 3D landmarks (bottom) on EuRoC MH05 sequence across different methods.

Figure 6. Trajectory comparison on EuRoC MH04. (Left): with flow-consistency filtering. (Right): without filtering. The region where the deviation is most pronounced is magnified, highlighting reduced local drift with the proposed filtering.

Table 1. RMSE of absolute trajectory error [m] on EuRoC MAV dataset. Best results per method are highlighted in bold. All methods were evaluated without loop closure.

Sequence	OKVIS	VINS-Mono	VINS-Fusion	Super VINS	VIO-DualProNet	Ours (All Features)	Ours (Full Pipeline)
MH01	0.3300	0.1556	0.1791	0.1267	0.22	0.2224	0.1645
MH02	0.3700	0.1780	0.1110	0.1170	0.19	0.1365	0.1478
MH03	0.2500	0.1950	0.1713	0.1806	0.11	0.1970	0.1631
MH04	0.2700	0.3470	–	0.2381	0.30	0.2112	0.2562
MH05	0.3900	0.3020	0.2762	0.2913	0.29	0.2845	0.2605
V101	0.0940	0.0889	0.0558	0.2010	0.05	0.0633	0.0588
V102	0.1400	0.1105	0.0893	0.1294	0.08	0.0890	0.0748
V103	0.2100	0.1875	0.1365	0.2312	0.11	0.1608	0.1617
V201	0.0900	0.0863	0.1276	0.0713	0.05	0.1341	0.0487
V202	0.1700	0.1580	0.1705	0.1301	0.07	0.1222	0.1199
V203	0.2300	0.2775	0.5551	0.1782	0.18	0.1798	0.2266
Average improvement over baseline (VINS-Mono)						+8.1%	+21.0%

Table 2. Ablation study on key parameters and the effect of flow-consistency filtering (RMSE [m]).

Parameter Setting	Variant	MH03	MH05	V202	Avg.
Default	full pipeline	0.1631	0.2605	0.1199	0.1812
Default	w/o flow consistency	0.1970	0.2845	0.1222	0.2012
Max feature = 300	full pipeline	0.1374	0.2595	0.1658	0.1876
Max feature = 300	w/o flow consistency	0.1301	0.2684	0.1695	0.1893
Grid size = 20	full pipeline	0.1748	0.2858	0.1380	0.1995
Grid size = 20	w/o flow consistency	0.2492	0.3121	0.1477	0.2363
Pyramid level = 4	full pipeline	0.1448	0.3000	0.1351	0.1933
Pyramid level = 4	w/o flow consistency	0.2165	0.2954	0.1545	0.2221

Default configuration: max feature = 150; grid size = 40; pyramid level = 3.

Table 3. Comparison of average keypoint count and trajectory accuracy on the EuRoC MH04 sequence.

Setting	Mean Keypoints	ATE RMSE [m]
Filtering ON	131.6	0.2562
Filtering OFF	142.4	0.2858

Table 4. ATE RMSE [m] under different flow-consistency thresholds on the EuRoC dataset. The results show that the threshold of 0.5 achieves the best overall balance between accuracy and robustness across various sequences.

Threshold ( $τ_{fb}$ )	MH03	MH05	V202
w/o flow consistency	0.1970	0.2845	0.1222
0.3 pixels	0.1712	0.2569	0.1213
0.5 pixels	0.1631	0.2605	0.1199
1.0 pixels	0.2075	0.2589	0.1376

Note: Bold values indicate the parameters yielding the best performance in each sequence.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cho, H.M. Robust Visual–Inertial Odometry via Multi-Scale Deep Feature Extraction and Flow-Consistency Filtering. Appl. Sci. 2025, 15, 10935. https://doi.org/10.3390/app152010935

AMA Style

Cho HM. Robust Visual–Inertial Odometry via Multi-Scale Deep Feature Extraction and Flow-Consistency Filtering. Applied Sciences. 2025; 15(20):10935. https://doi.org/10.3390/app152010935

Chicago/Turabian Style

Cho, Hae Min. 2025. "Robust Visual–Inertial Odometry via Multi-Scale Deep Feature Extraction and Flow-Consistency Filtering" Applied Sciences 15, no. 20: 10935. https://doi.org/10.3390/app152010935

APA Style

Cho, H. M. (2025). Robust Visual–Inertial Odometry via Multi-Scale Deep Feature Extraction and Flow-Consistency Filtering. Applied Sciences, 15(20), 10935. https://doi.org/10.3390/app152010935

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Visual–Inertial Odometry via Multi-Scale Deep Feature Extraction and Flow-Consistency Filtering

Abstract

1. Introduction

2. Related Works

2.1. Visual–Inertial Odometry (VIO)

2.2. Deep Feature-Based SLAM and VIO

3. Proposed Method

3.1. Overview of the Modified VIO System

3.2. Deep Accelerated Feature Extraction

3.3. Multi-Scale Feature Selection with Flow-Based Filtering

3.4. Implementation Details

4. Experiment

4.1. Quantitative Results

4.2. Impact of Feature Selection Strategy

4.3. Runtime Analysis

4.4. Ablation Study

4.4.1. Parameter Sensitivity

4.4.2. Effect of Flow-Consistency Filtering

4.4.3. Flow-Consistency Threshold

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. IMU Noise and Residual Weighting

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI