RWKV-VIO: An Efficient and Low-Drift Visual–Inertial Odometry Using an End-to-End Deep Network

Jiaxi Yang; Xiaoming Xu; Zeyuan Xu; Zhigang Wu; Weimeng Chu

doi:10.3390/s25185737

,

and

¹

School of Aeronautics and Astronautics, Sun Yat-sen University, Guangzhou 510275, China

²

Department of Electrical, Computer and Biomedical Engineering, University of Pavia, 27100 Pavia, Italy

³

Department of Mechanical Engineering, National University of Singapore, Singapore 119077, Singapore

^*

Author to whom correspondence should be addressed.

Sensors2025, 25(18), 5737;https://doi.org/10.3390/s25185737

This article belongs to the Section Sensors and Robotics

Version Notes

Order Reprints

Abstract

Visual–Inertial Odometry (VIO) is a foundational technology for autonomous navigation and robotics. However, existing deep learning-based methods face key challenges in temporal modeling and computational efficiency. Conventional approaches, such as Long Short-Term Memory (LSTM) networks and Transformers methods, often struggle to handle dependencies across different temporal scales while causing high computational costs. To address these issues, this work introduces Receptance Weighted Key Value (RWKV)-VIO, a novel framework based on the RWKV architecture. The proposed framework is designed with a lightweight structure and linear computational complexity, which effectively reduces the computational burden in temporal modeling. Furthermore, a newly developed Inertial Measurement Unit (IMU) encoder is included to improve the effectiveness of feature extraction using residual connections and channel alignment, allowing the efficient use of historical inertial data. A parallel encoding strategy uses two independently initialized encoders. Features are extracted from different dimensions by this strategy, strengthening the model’s ability to detect complex patterns. Experimental results for publicly shared datasets show that RWKV-VIO prioritizes computational efficiency and lightweight design. It significantly reduces model size and inference time compared to existing advanced methods while achieving top-ranked positioning accuracy among evaluated approaches.

Keywords:

visual-inertial odometry; data fusion; navigation; positioning; multi-sensor

1. Introduction

Accurate localization is essential for robot navigation, autonomous driving [], aircraft navigation, underwater positioning [], and augmented reality [,,]. Single sensors often fail to ensure accuracy and robustness in complex environments [,]. For example, visual sensors perform poorly in low-texture scenes or under changing lighting conditions. The Inertial Measurement Unit (IMU) offers high-frequency motion estimates but accumulates errors over time []. To solve these problems, multisensor fusion technology combines the strengths of different sensors to improve accuracy and reliability [,,,,].

Visual–Inertial Odometry (VIO) is a multisensor fusion technique that combines visual data and IMU measurements to deliver high-frequency pose estimation with enhanced robustness under challenging conditions. Compared to Visual Odometry (VO) based solely on vision [,,,], VIO demonstrates stronger robustness in environments with poor lighting or sparse visual features.

VIO methods can be broadly categorized into two types: geometry-based approaches and learning-based approaches. As shown in Figure 1a, geometry-based methods typically consist of visual correspondence, IMU integration, and state estimation. State estimation, in turn, can be divided into two subcategories: optimization-based and filtering-based methods. Geometry-based approaches [,,] generally achieve high accuracy under controlled or favorable conditions. However, these methods rely heavily on manually designed feature extraction and non-linear optimization techniques, making them highly sensitive to the quality of visual features and the accuracy of initialization. Furthermore, the reliance on manual feature extraction and nonlinear optimization leads to high computational complexity, limiting real-time applicability. In recent years, learning-based VIO models [,,] have shown robustness in complex scenarios by automatically extracting visual and inertial features end-to-end. The general architecture of these methods is shown in Figure 1b, mainly consisting of a visual–inertial encoder, a feature fusion network, and a pose regression network. However, these methods often rely on recurrent neural networks (RNNs), such as Long Short-Term Memory (LSTM) [] and Gated Recurrent Unit (GRU) [], for temporal modeling. These models suffer from gradient vanishing problems [], especially as the sequence length increases. They also struggle to capture long-term dependencies, which reduces accuracy. In addition, their sequential processing prevents parallel computation [], leading to slower training and inference and limiting scalability.

Figure 1. (a) VIO methods based on geometric principles. (b) VIO methods based on deep learning.

Deep learning-based VIO models also encounter challenges in extracting meaningful IMU features. Simpler architectures, such as Convolutional Neural Networks (CNNs), often fail to capture long-term dependencies and lose historical information over time. These limitations affect the accuracy of long-duration pose estimation and contribute to accumulated drift in predicted trajectories.

To address these challenges, RWKV-VIO is proposed as a framework based on the RWKV architecture []. RWKV is a lightweight temporal modeling method. It captures both long-term and short-term dependencies in multimodal data with linear computational complexity. This improves computational efficiency and enables real-time performance.

The contributions of the research are presented in summary as follows:

A temporal modeling method based on the RWKV architecture is proposed. It captures long-term and short-term dependencies in visual and inertial data efficiently. Its lightweight design reduces computational complexity.
A novel IMU encoder, Res-Encoder, integrates convolutional layers with residual connections to enhance the extraction of inertial features. It effectively preserves historical information, enhancing the depth and stability of feature representations.
A parallel IMU encoder structure is introduced to extract diverse features from inertial data, enhancing the expressiveness of inertial features and improving the fusion of multi-modal data.

Experimental results show that RWKV-VIO performs better than existing methods in terms of both localization accuracy and computational efficiency, fully demonstrating its robustness in complex environments and advantages in practical applications.

2. Related Work

In this section, we provide an overview of recent advancements in visual odometry (VO), visual–inertial odometry (VIO), and key supporting techniques for odometry, with a focus on deep learning-driven approaches. To systematically clarify the characteristics, technical paths, and limitations of representative methods in these areas, Table 1 first summarizes classic and influential works—including those discussed in subsequent VO and VIO sections—by category, sensor configuration, and core attributes. This structured comparison provides a foundational reference for analyzing gaps in existing research and highlighting the necessity of our proposed framework.

Table 1. Classification and comparison of representative VO/VIO methods.

2.1. Visual Odometry

Existing VO research primarily includes traditional feature-based and deep learning-based approaches. Early feature-based methods relied on manually designed descriptors: ORB [] was widely used for feature extraction and matching, enabling motion estimation via geometric triangulation, but failed in texture-less scenes and was sensitive to motion blur.

The shift to deep learning led to end-to-end VO models. DeepVO [] used deep recurrent convolutional neural networks (DRCNNs) to learn motion directly from image sequences, eliminating manual feature engineering. Monodepth2 [] introduced self-supervised learning for monocular depth estimation, supporting VO scale calibration. BeyondTracking [] added a memory selection mechanism to refine long-sequence poses, while GFS-VO [] used guided feature selection to enhance visual feature quality. However, these VO methods suffered from inherent limitations: monocular setups had scale ambiguity, leading to long-term drift, and pure visual data failed to compensate for feature loss in dynamic or low-light environments.

2.2. Visual-Inertial Odometry

To address VO’s gaps, VIO fuses visual data with IMU measurements, leveraging the camera’s spatial precision to suppress drift and the IMU’s high temporal resolution to fill visual feature gaps.

Early VIO methods were geometry-driven. ORB-SLAM3 [] integrated IMU data into nonlinear optimization, supporting multisensor modalities for high accuracy in controlled environments. VINS-Mono [] used a tightly coupled EKF to solve monocular scale ambiguity via IMU pre-integration. ROVIO [] avoided feature extraction, using dense image alignment to fuse IMU data with pixel intensity for low-texture efficiency. However, these methods relied on manual parameter tuning and had high computational complexity, limiting real-time performance on resource-constrained platforms.

Deep learning-based VIO enabled end-to-end spatiotemporal learning. VINet []—the first end-to-end deep VIO model—used DeepVO’s [] visual extractor and LSTM to process IMU data, projecting fused features into SE(3) space. Subsequent works optimized fusion: Chen et al. [] proposed soft/hard fusion for corrupted inputs; ATVIO [] used attention to balance visual–inertial streams and adaptive loss for pose regression. VIOLearner [] added unsupervised training for trajectory correction, while DeepVIO [] used 3D geometric constraints to suppress drift. Recent models like VIFT [] adopted Transformers for long-range dependencies, and Fusion [] refined selective fusion. Yet, limitations persisted: LSTM models had gradient vanishing in long sequences, Transformers had quadratic complexity, and many underutilized IMU data, leading to insufficient feature representation and dynamic environment inaccuracy.

2.3. Temporal Modeling

In addition to the advancements in VIO, significant progress has been made in temporal modeling. In recent years, innovative approaches to temporal modeling have significantly improved efficiency and performance through breakthroughs in architectural design. Transformer-based models [,,] have demonstrated exceptional performance in handling long-sequence tasks due to their self-attention mechanism and parallel computing capabilities. However, their quadratic complexity limits their application in resource-constrained scenarios. In contrast, CNNs [,] have shown strong modeling capabilities in long-term sequence prediction tasks by effectively capturing local and global patterns in time series. Additionally, explorations of Multilayer Perceptrons (MLPs) in temporal modeling have also achieved remarkable progress. For instance, models like DLinear [] achieve a performance comparable to more complex architectures with simple linear structures, prompting a reevaluation of the necessity of Transformer models. Recently, RNN-based architectures, such as RWKV [], combine the global modeling capabilities of Transformers with the recursive nature of RNNs, exhibiting greater efficiency and competitiveness.

Traditional LSTM-based methods in VIO are limited by their sequential nature, resulting in slow inference speeds, which hinder real-time performance. Inspired by the efficiency of RWKV-TS [], this work designs a lightweight temporal modeling framework that directly addresses these issues. By reducing the computational complexity of temporal dependency modeling, the proposed approach accelerates inference. It ensures a robust integration of IMU and visual data, making it suitable for dynamic environments and real-time applications.

3. Materials and Methods

We propose an in-depth description of the proposed deep learning-based VIO model. As illustrated in Figure 2a, the framework consists of a visual encoder, Res-Encoder, parallel encoder strategy, a positional encoding module, and a decoder built upon the RWKV network architecture. Each component is crafted to handle the specific challenges associated with visual–inertial fusion. This allows for effective feature extraction, reliable temporal modeling, and accurate pose estimation.

Figure 2. (a) RWKV-VIO Architecture. (b) The structure of the IMU encoder, which uses convolution and residual modules to extract temporal features from inertial data. (c) The RWKV module diagram with time mixing and channel mixing for modeling temporal and channel feature relationships.

3.1. End-to-End Deep Learning-Based Visual–Inertial Odometry

The end-to-end Visual-Inertial Odometry (VIO) algorithm takes as input two consecutive monocular image frames along with a set of Inertial Measurement Unit (IMU) measurements recorded during the time interval between them. More comprehensively, for a VIO system, the inputs consist of the monocular video frames

{V_{i}}_{i = 1}^{N}

, IMU measurements

{I_{i}}_{i = 1}^{N l}

(captured with a sampling frequency l times higher than the video frame rate), and the initial camera pose

P_{1}

. Here, N denotes the total number of video frames, with

i = 1

representing the first frame and

i = N

the last frame. The goal of VIO is to estimate the camera poses

{P_{i}}_{i = 2}^{N}

for the entire path, where

V_{i} \in R^{3 \times H \times W}

,

I_{i} \in R^{6}

, and

P_{i} \in SE (3)

. The IMU provides six-dimensional measurements, which include three-dimensional linear accelerations (along the x-, y-, and z-axes of the IMU coordinate system) and three-dimensional angular velocities (around the x-, y-, and z-axes of the IMU coordinate system). These measurements reflect the instantaneous motion state of the device. One typical way to perform VIO is to estimate the six-DoF relative pose

T_{t \to t + 1}

that satisfies

P_{t} T_{t \to t + 1} = P_{t + 1}

using two consecutive images

V_{t \to t + 1} = {V_{t}, V_{t + 1}}

and a set of IMU measurements

I_{t \to t + 1} = {I_{t l}, \dots, I_{(t + 1) l}}

for the time index

t = 1, 2, \dots, N - 1

. The relative pose

T_{t \to t + 1}

can be further decomposed into a rotational component (represented by a rotational vector

ϕ_{t} \in R^{3}

containing Euler angles, which describe the orientation change) and a translational component (a translational vector

v_{t} \in R^{3}

, which represents the displacement in three-dimensional space). Our method learns a selection strategy that opportunistically skips the visual information

V_{t \to t + 1}

to reduce the computational complexity while maintaining the relative pose estimation accuracy.

During training, our method employs a sliding window approach, where image frames of length

s e q_{-} l e n

and their corresponding IMU measurements are inputted to the model at each step. This is attributed to the parallel computation capability of our pose regression network, which differs from LSTM’s approach of predicting the next pose based on the current pose. Our model predicts

s e q_{-} l e n - 1

relative poses

\{P_{t \to t + s e q_{-} l e n - 1}\}

in each step and then predicts the next set of

s e q_{-} l e n - 1

relative poses in the following step, thus traversing the entire sequence.

3.2. Feature Encoder

3.2.1. Visual Encoder

In this study, the classic FlowNet architecture [] is adopted as the visual encoder to efficiently extract motion features, local textures, and robust global characteristics from image sequences. FlowNet employs a series of convolutional layers that progressively capture features at different levels of abstraction. Starting from basic elements such as edges and textures, the network advances to more complex motion patterns and spatial relationships. The architecture consists of multiple convolutional layers, each with increasing channel sizes, enabling the network to effectively model dynamic relationships within the scene. As shown in Figure 2a, by inputting consecutive image frames

V_{T}

and

V_{T + 1}

into the optical flow estimation network to obtain visual features (

F_{V}

) with a dimensionality of 512:

F_{V} = F l o w N e t ([V_{T}; V_{T + 1}])

(1)

3.2.2. IMU Encoder

The IMU encoder architecture, as depicted in Figure 2b, is based on a Res-Encoder (residual convolutional neural network) designed for effective inertial representation learning. The input to the encoder is a sequence

I \in R^{B \times S \times C \times L}

, where B denotes the batch size, S the temporal window length (e.g., number of video frames per sample),

C = 6

for the raw IMU channels (three-axis accelerometer and gyroscope), and L the number of IMU steps between two video frames.

The encoding process begins with an initial 1D convolution to extract low-level features from each IMU data channel:

X_{0} = ϕ_{init} (I) = Drop (σ (BN (Conv 1 d_{6 \to 64} (I)))),

(2)

where

Conv 1 d_{6 \to 64}

denotes a 1D convolution with 64 filters,

BN

is batch normalization,

σ

is the LeakyReLU activation, and

Drop

is dropout for regularization. Here,

X_{0}

is the output feature after initial 1D convolution of IMU data, and

ϕ_{init}

is the initial feature extraction function integrating convolution, batch normalization, activation, and dropout.

Subsequently, the encoder employs a stack of three residual convolutional blocks, each denoted as Bi-Conv, to progressively extract higher-level temporal and channel features while suppressing noise. The k-th Bi-Conv block consists of two convolutional layers with possible channel expansion, and incorporates a residual (shortcut) connection to promote feature reuse and gradient flow:

\begin{matrix} Z_{1}^{[k]} & = σ (BN (Conv 1 d_{C_{k} \to C_{k + 1}} (X_{k - 1}))), \end{matrix}

(3)

\begin{matrix} Z_{2}^{[k]} & = σ (BN (Conv 1 d_{C_{k + 1} \to C_{k + 1}} (Z_{1}^{[k]}))), \end{matrix}

(4)

\begin{matrix} Z_{2}^{[k]} & = Drop (Z_{2}^{[k]}) . \end{matrix}

(5)

where

Z_{1}^{[k]}

,

Z_{2}^{[k]}

are intermediate feature tensors in the k-th residual block,

σ

denotes an activation function (e.g., LeakyReLU),

BN

is batch normalization,

Conv 1 d_{C_{in} \to C_{out}}

represents a 1D convolution with

C_{in}

input channels and

C_{out}

output channels,

X_{k - 1}

is the input feature tensor to the k-th block, and

Drop

is dropout for regularization. When the input and output channels differ (

C_{k} \neq C_{k + 1}

), the residual path is projected via a

1 \times 1

convolution:

\begin{matrix} R^{[k]} & = \{\begin{matrix} X_{k - 1}, & if C_{k} = C_{k + 1}, \\ Conv 1 d_{C_{k} \to C_{k + 1}} (X_{k - 1}), & if C_{k} \neq C_{k + 1}, \end{matrix} \end{matrix}

(6)

\begin{matrix} X_{k} & = Z_{2}^{[k]} + R^{[k]} . \end{matrix}

(7)

The three Bi-Conv blocks in sequence operate at channel dimensions

C_{1} = 64,

C_{2} = 128, C_{3} = 256

, respectively. This hierarchical design enables the encoder to capture increasingly abstract inertial patterns across the temporal window and enhance robustness to sensor noise.

After the stacked residual blocks, the resulting feature tensor is flattened across channel and temporal dimensions, and then projected into a fixed-length latent vector to obtain the final IMU feature representation:

{\hat{F}}_{I} = ψ_{proj} (X_{3}) = FC (Flatten (X_{3})),

(8)

where

{\hat{F}}_{I}

is the final IMU feature representation,

ψ_{proj}

is the projection function,

X_{3}

is the feature tensor from stacked residual blocks, “Flatten” reshapes

X_{3}

into a vector, and

FC

projects it to dimension

d = 256

.

To further enhance representation diversity and robustness, inspired by ensemble learning [], a parallel architecture is employed. Specifically, two independently initialized Res-Encoders process the same IMU input in parallel:

F_{I} = [{\hat{F}}_{I}^{(1)}; {\hat{F}}_{I}^{(2)}] \in R^{B \times S \times 512},

(9)

where

[\cdot; \cdot]

denotes vector concatenation, and

{\hat{F}}_{I}^{(1)}

,

{\hat{F}}_{I}^{(2)}

are the outputs of the parallel encoders. Such integration yields a 512-dimensional IMU feature per sample and temporal window, and ensures that inertial features are well-aligned with the extracted visual features for subsequent cross-modal fusion.

In summary, the IMU encoder adopts a deeply residual 1D convolutional backbone augmented with dimension-matching projections, dropout, and a parallel ensemble strategy, to extract expressive and robust inertial representations for deep VIO.

3.3. Decoder

To address the challenge of capturing both short- and long-term dependencies in time series modeling while meeting real-time requirements, this study designs a decoder module based on a single-layer RWKV network, as shown in Figure 2c. By incorporating positional encoding, time mixing, and channel mixing mechanisms, efficient time series feature modeling and decoding were achieved.

3.3.1. Positional Encoding

To enable the model to perceive and leverage the temporal ordering of the multimodal sequence, we employ an explicit positional encoding mechanism that augments the fused feature representations with absolute position information. The inputs to the embedding module consist of visual features (

F_{V}

) and inertial features (

F_{I}

), concatenated along the feature dimension to form a unified multimodal representation:

X_{concat} = concat ([F_{V}, F_{I}])

(10)

Next, to extract meaningful local temporal dependencies, we apply a one-dimensional convolutional layer to

X_{concat}

, which is particularly effective for modeling short-range correlations and projecting the concatenated features into the embedding space:

X_{feature} = Conv1D (X_{concat})

(11)

Here,

Conv1D (\cdot)

denotes the 1D convolution operation with learnable kernels, which can capture dynamic transitions within short windows of the sequential data [], offering a rich representation for subsequent temporal modeling.

To encode absolute position information, we incorporate a fixed, non-learnable sinusoidal positional encoding scheme as originally proposed by Vaswani et al. []. For each time step

p o s

(corresponding to the synchronized sensor data at that moment), and each dimension i of the embedding space of dimensionality d, the positional encoding is defined as

\begin{matrix} P E_{(p o s, 2 i)} & = sin (\frac{p o s}{10000^{2 i / d}}) \end{matrix}

(12)

\begin{matrix} P E_{(p o s, 2 i + 1)} & = cos (\frac{p o s}{10000^{2 i / d}}) \end{matrix}

(13)

where

p o s

is the temporal index of the sequence, and

i \in {0, 1, \dots, d / 2 - 1}

. This alternating use of sine and cosine functions at different frequencies allows the model to encode temporal ordering in a manner that facilitates generalization to sequence lengths and positions not seen during training, promoting both local and global awareness of sequence structure.

The final input embedding is produced by performing an element-wise addition between the feature-encoded representation and the positional encoding:

X_{embed} = X_{feature} + P E

(14)

This operation ensures that each embedded feature vector not only contains multimodal semantic content but is also uniquely conditioned on its temporal position in the sequence. Such a design enables the network to utilize both local spatiotemporal features and explicit global order information, resulting in robust and contextually enriched descriptors for downstream tasks such as sequence modeling and temporal reasoning.

3.3.2. Time-Mixing Block

The time-mixing module is used to model temporal dependencies. In RWKV, Token Shift is introduced as a simple method for temporal feature mixing. For the current time step

x_{t}

and the previous time step

x_{t + 1}

, mixed features are generated through a linear combination, as given by the following equations:

\begin{matrix} g_{t} = W_{g} \cdot (μ_{g} ⊙ x_{t} + (1 - μ_{g}) ⊙ x_{t - 1}), \end{matrix}

(15)

\begin{matrix} r_{t} = W_{r} \cdot (μ_{r} ⊙ x_{t} + (1 - μ_{r}) ⊙ x_{t - 1}), \end{matrix}

(16)

\begin{matrix} k_{t} = W_{k} \cdot (μ_{k} ⊙ x_{t} + (1 - μ_{k}) ⊙ x_{t - 1}), \end{matrix}

(17)

\begin{matrix} v_{t} = W_{v} \cdot (μ_{v} ⊙ x_{t} + (1 - μ_{v}) ⊙ x_{t - 1}), \end{matrix}

(18)

where

u_{g}, u_{r}, u_{k}, u_{v}

are trainable weight parameters used to dynamically adjust the influence ratio of the current timestep and the historical timestep; ⊙ denotes the Hadamard product (performing element-wise multiplication on vectors or matrices of the same dimension); and · denotes matrix multiplication (multiplies the weight matrix by the feature vector) when paired with weight matrices like

W_{g}

.

The above operations are applied over the entire sequence of length T, producing four temporal feature sequences:

\begin{matrix} G & = {g_{t}}_{t = 1}^{T}, R = {r_{t}}_{t = 1}^{T}, \\ K & = {k_{t}}_{t = 1}^{T}, V = {v_{t}}_{t = 1}^{T} \end{matrix}

(19)

These sequences form what we denote as the GRKV block, which is visualized in Figure 2c. Each path (G, R, K, V) represents the full set of corresponding vectors across all time steps. Specifically, G is used for gating, while R, K, and V are fed into the WKV attention mechanism to capture temporal dependencies. Thus, GRKV refers not to a single timestep, but to the collection of all

(g_{t}, r_{t}, k_{t}, v_{t})

vectors over the sequence.

To further capture global temporal dependencies, the temporal mixing module employs a multi-head Weight-Key-Value (WKV) operation for global temporal modeling. The computation for a single-head WKV is defined as

w k v_{t} = d i a g (u) \cdot k_{t}^{⊤} \cdot v_{t} + \sum_{i = 1}^{t} d i a g {(w)}^{t - 1 - i} \cdot k_{t}^{⊤} \cdot v_{i}

(20)

where u represents the reward parameter for the current time step, w is the temporal decay vector (balancing the influence of the current time step and historical time steps), and

d i a g (\cdot)

denotes the operator that converts a 1D vector into a diagonal matrix (vector elements serve as main diagonal entries, with all other entries set to 0). The temporal decay weights are updated iteratively and constrained to

w \in (0, 1)

, ensuring that

d i a g (w)

is a contraction matrix:

w = e x p (- e x p (w))

(21)

To enhance the expressiveness of the model, RWKV introduces a multi-head

W K V

mechanism, computed as

Multiheadwk v_{t} = Concat (w k v_{t}^{1}, \dots, w k v_{t}^{h})

(22)

where h is the number of heads, and

C o n c a t (\cdot)

denotes the concatenation operator that merges multiple vectors along the feature dimension. In practical implementation, reshaping operations are used to optimize computational efficiency, completing the multi-head

W K V

calculation. The final output of the temporal mixing module is processed by the SiLU activation function and a normalization operation for each head, effectively performing GroupNorm [] for h groups. Note that · in

r_{t} \cdot w k v_{t}

denotes element-wise multiplication (multiplies corresponding elements of the feature vectors

r_{t}

and

w k v_{t}

). The output is given by

o_{t} = (S i L U (g_{t}) ⊙ L a y e r N o r m (r_{t} \cdot w k v_{t})) W_{o}

(23)

3.3.3. Channel-Mixing Block

The channel-mixing module is designed to enhance feature interaction across channels, further processing temporal features through weighted combinations and non-linear transformations. The operations are as follows:

\begin{matrix} {k_{t}}^{'} = {W_{g}}^{'} \cdot ({μ_{k}}^{'} ⊙ x_{t} + (1 - {μ_{k}}^{'}) ⊙ x_{t - 1}), \end{matrix}

(24)

\begin{matrix} {r_{t}}^{'} = {W_{r}}^{'} \cdot ({μ_{r}}^{'} ⊙ x_{t} + (1 - {μ_{r}}^{'}) ⊙ x_{t - 1}), \end{matrix}

(25)

\begin{matrix} {v_{t}}^{'} = R e L U^{2} ({k_{t}}^{'} \cdot {W_{v}}^{'}) \end{matrix}

(26)

\begin{matrix} {o_{t}}^{'} = S i g m o i d ({r_{t}}^{'}) ⊙ {v_{t}}^{'} \end{matrix}

(27)

where

R e L U^{2} (\cdot)

denotes the squared ReLU operator (it first applies ReLU activation:

R e L U (x) = max (0, x)

, then squares the activation result);

R e L U

is employed as a non-linear activation function to enhance the model’s expressive capability; and

S i g m o i d

is used to regulate the influence of temporal features on the final output.

The above operations are applied across all timesteps in the sequence, generating three feature sequences:

K^{'} = {k_{t}^{'}}_{t = 1}^{T}, R^{'} = {r_{t}^{'}}_{t = 1}^{T}, V^{'} = {v_{t}^{'}}_{t = 1}^{T}

(28)

These represent the channel-mixing counterparts to GRKV:

K^{'}

encodes transformed features,

R^{'}

acts as a gate, and

V^{'}

contains nonlinearly activated values. They are combined at each time step to form the final output

{o_{t}^{'}}_{t = 1}^{T}

. This process is performed in parallel across the entire sequence.

3.3.4. Pose Regression Module

After processing by the RWKV module, the learned temporal and contextual features are forwarded to a dedicated pose regression network. This regression module consists of a simple yet effective two-layer fully connected neural network, whose structure is illustrated as follows:

nn . Sequential (\underset{feature projection}{\underset{︸}{nn . Linear (d_{i n}, 128)}}, \underset{nonlinearity}{\underset{︸}{nn . LeakyReLU (0.1)}}, \underset{pose output}{\underset{︸}{nn . Linear (128, 6)}})

Given the high-dimensional embedded features

X_{e m b e d} \in R^{d_{i n}}

from the RWKV module, the first linear layer projects the input to a 128-dimensional latent space. This serves to condense and distill the most salient information necessary for regression, enabling more effective learning. A LeakyReLU activation function with a negative slope of

0.1

is then applied to introduce nonlinearity and mitigate issues such as the dying ReLU problem, thus increasing the model’s expressive capacity while allowing a small gradient when the unit is not active.

Following this, a final linear transformation maps the 128-dimensional latent representation to a six-dimensional output vector, which corresponds to the target pose parameters: translation

(x, y, z)

and rotation

(r_{x}, r_{y}, r_{z})

. These six values represent the location and orientation of the object at the predicted future time step.

The overall regression operation can be formally written as

P_{t \to t + s e q_{-} l e n - 1} = R (R W K V (X_{e m b e d}))

(29)

where

R (\cdot)

denotes the two-layer regression module described above, and

P_{t \to t + s e q_{-} l e n - 1}

encodes the predicted six-DoF pose (three translations and three rotations) from time step t to

t + s e q_{-} l e n - 1

, providing both the position and attitude information required for pose estimation tasks.

To summarize, although the regression network adopts a straightforward, fully connected design, its careful dimensionality reduction and the inclusion of nonlinear activation allow it to precisely transform the temporally enriched features from the RWKV module into accurate pose estimates.

3.4. Loss Function

During training, we use the Mean Squared Error (MSE) loss function, while also introducing a weight factor to balance the rotation and translation errors. The loss is defined as

L_{p o s e} = \frac{1}{T - 1} \sum_{t = 1}^{T - 1} ({∥{\hat{v}}_{t} - v_{t}∥}_{2}^{2} + α {∥{\hat{ϕ}}_{t} - ϕ_{t}∥}_{2}^{2})

(30)

where T is the sequence length,

{\hat{v}}_{t}

and

{\hat{ϕ}}_{t}

represent the predicted translation and rotation parameters, while

v_{t}

and

ϕ_{t}

denote the ground truth translation and rotation parameters.

α

is a weight factor that is used to balance the scale of translation and rotation errors.

4. Results

We executed extensive experiments to measure the performance and merits of the RWKV module. These evaluations include main results, ablation studies, and detailed analyses of computational efficiency and parameter count. We further analyzed the experimental findings to provide a deeper insight into the effectiveness of the proposed approach.

4.1. Experiment Setup

4.1.1. Dataset

This study uses the KITTI Odometry dataset [] as the experimental benchmark, which is a widely recognized evaluation standard in the visual/visual–inertial odometry (VO/VIO) domain. The dataset was carefully collected by a real vehicle. The vehicle was equipped with a set of advanced sensors, including cameras, a 3D lidar, a GPS navigation system, and an IMU sensor. As shown in Figure 3, the vehicle was driven through various real-world scenarios, such as densely populated urban areas, serene rural areas, high-speed highways, quiet residential neighborhoods, areas with different weather and traffic conditions, and campus environments. The dataset contains 22 sequences. Among them, sequences 00 to 10 supply ground-truth pose data for the training phase, whereas sequences 11 to 22 lack ground-truth pose data and are exclusively employed for evaluation. We selected sequences 00, 01, 02, 04, 06, 08, and 09 for training and sequences 05, 07, and 10 for testing. Since sequence 03 lacks IMU data, it was excluded from the experiments.

Figure 3. Examples of the KITTI Odometry dataset displaying scenes of residential areas, highways, hard traffic roads, and scenes with pedestrians.

The images and ground-truth poses in the dataset are sampled at a frequency of 10 Hz, while the IMU data are sampled at 100 Hz. To address temporal misalignment between the IMU data, images, and ground-truth poses, we interpolate the raw IMU data to achieve temporal synchronization with the images and ground-truth poses. For data alignment, we handle the frequency difference between visual and inertial sensors: given that images are captured at 10Hz and IMU data is sampled at 100 Hz, each image frame is mapped to 10 consecutive IMU measurements to synchronize these modalities. For an image sequence of length

s e q_{-} l e n

, the corresponding IMU data segment is extracted from the

i \times 10

-th to the

(i + 10) \times 10 + 1

-th data point(where i denotes the starting index of the sequence), ensuring equidistant temporal alignment across the sequence. This indexing strategy maintains consistent time-series correspondence between images, their associated poses, and IMU data, as commonly adopted in visual–inertial odometry research on the KITTI dataset.

In the experiments, only monocular images from the left camera of the KITTI dataset are utilized as input data. The

t_{r e l}

and

r_{r e l}

are the main metrics used in the experiments, representing the average translational error and average rotational error of all subsequences with lengths ranging from 100 to 800 m.

4.1.2. Implementation Details

During the training process, all images are resized to

512 \times 256

, and

s e q_l e n

is set to 11. The visual encoder utilizes the FlowNet-S network pre-trained on the FlyingChairs [] dataset (with the final layer removed) for optical flow feature extraction. The subsequent processing details and the inertial encoder were described earlier. Training employs a 60-epoch two-stage strategy with the Adam optimizer (weight decay

5 \times 10^{- 6}

, betas

(0.9, 0.999)

, eps

1 \times 10^{- 8}

): the first 20 epochs use a warm-up learning rate of

5 \times 10^{- 4}

to stabilize initial training, followed by 40 fine-tuning epochs with a reduced learning rate of

1 \times 10^{- 4}

for parameter refinement. A fixed training schedule is adopted without early stopping, with checkpoints saved every 10 epochs and the final model weights from the 60th epoch used for evaluation. All experiments are executed on an NVIDIA RTX 4090 GPU (24G) with a batch size of 16, balancing training efficiency and model convergence.

4.2. Main Results

Table 2 summarizes the pose estimation results of various methods on the KITTI dataset. The proposed RWKV-VIO model demonstrates excellent overall performance, achieving an average translational error of 2.29% and an average rotational error of 1.26° over trajectory length. Specifically, RWKV-VIO ranks second in

t_{r e l} (%)

for Sequence 05, and consistently remains among the top three in both Sequences 07 and 10 in terms of both translational and rotational errors. On average, RWKV-VIO attains the third-best results in both metrics, highlighting the model’s robustness and accuracy on different sequences.

Table 2. Comparison of VO/VIO methods categorized into geometry-based and learning-based approaches.

Compared to other methods, RWKV-VIO ranks third on average in terms of accuracy. Although the overall average translational error and rotational error of RWKV-VIO are slightly higher than those of DeepVIO [] and ORB-SLAM2 [], it outperforms both methods in their weaker areas. DeepVIO achieves a lower rotational error, but suffers from a higher translational error, while ORB-SLAM2 performs well in rotation, but has a significantly higher translational error. In contrast, RWKV-VIO balances these two metrics effectively, achieving a lower translational error than ORB-SLAM2 and a better rotational error than DeepVIO. This balanced performance makes RWKV-VIO stand out in tasks where both translational and rotational errors are critical. Moreover, RWKV-VIO has a faster inference speed than almost all other models, making it a more practical choice for real-world applications.

4.3. Ablation Study

IMU measurements are affected by various sources of error, such as bias drift, temperature sensitivity, random sensor noise, and random walk. These errors can be significantly amplified during the integration phase of conventional inertial navigation algorithms, leading to trajectory drift and the accumulation of positioning errors. This degradation in accuracy becomes more pronounced over time and distance. In environments with poor visual textures, weak visual features, or even complete absence of visual cues, visual-based localization methods are severely limited. Under these challenging conditions, the reliance on IMU data becomes increasingly important, and the accuracy of IMU measurements has a direct impact on the reliability of the entire localization system.

Ablation experiments were conducted to evaluate the effectiveness of the proposed inertial encoder in reducing the trajectory drift and cumulative errors. These experiments were performed in consistent settings, with modifications applied only to the inertial encoder. Specifically, three configurations were tested: a baseline encoder consisting of three convolutional layers, a single Res-Encoder, and the proposed parallel encoder strategy. The baseline encoder processes six-channel IMU data through three sequential 1D convolutional layers with channel dimensions increasing from 6 to 64, 128, and 256 respectively, each followed by BatchNorm, LeakyReLU (negative slope 0.1), and Dropout, before projecting features to the target dimension via a fully connected layer. As illustrated in Figure 4, the parallel encoder strategy significantly improved performance. The trajectory in Sequence 10 exhibited marked convergence, while the trajectory in Sequence 05, particularly near the end, aligned more closely with the ground truth (GT) compared to both the baseline encoder and the single Res-Encoder.

Figure 4. The results of the RWKV-VIO model’s positioning on KITTI Sequences 05 (left), 07 (middle), and 10 (right) with different encoder strategies.

To validate the drift suppression efficacy of the proposed strategy, we evaluated average translation and rotation errors across distances on KITTI Sequence 05. As shown in Figure 5, the baseline encoder lacks residual connections and loses critical historical IMU features, resulting in notable error accumulation over distance and abrupt initial translation errors. The single Res-Encoder preserves partial shallow features via residuals yet shows distinct initial translation fluctuations and cumulative error trends. In contrast, the parallel strategy uses dual independent Res-Encoders for complementary IMU feature extraction, effectively reducing error accumulation, enhancing curve stability, and suppressing initial translation abruptness, which demonstrates that residual connections and parallel structures collectively enable robust suppression of IMU error accumulation.

Figure 5. Average errors of RWKV-VIO on KITTI Sequence 05: Translation error ((left), %) and rotation error ((right), °/100 m) under different encoder strategies.

As summarized in Table 3, the parallel strategy reduced the translation error by 31.8 % and the rotation error by 20.2 %, demonstrating substantial accuracy improvements. The use of multiple independently initialized IMU encoders enables the model to capture complementary features, improving robustness and reducing the risk of overfitting, similarly to an ensemble learning approach. Additionally, the parallel architecture aligns on the IMU and visual features, enabling a more effective fusion of temporal and spatial information. This design significantly mitigates short- and long-term trajectory drift, resulting in improved overall pose estimation accuracy.

Table 3. Ablation study results on imu encoder.

4.4. Efficiency Analysis

This section begins by presenting a comparison of the time and space complexity of various models. As shown in Table 4, RWKV is a linear RNN model with time and space complexity of

O (L)

, where L is the sequence length. In contrast, for models based on LSTM, the model presented by Yang et al. [] can be considered a typical example. And the time complexity of LSTM is

O (L)

. However, its processing is sequential, with the output of one timestep relying on the previous one. Therefore, when inferring a sequence of length L, the number of inference steps for LSTM is L, as reflected in Table 4. For models based on the Transformer, VIFT is a representative example. Transformer models, due to their self-attention mechanism, have a time complexity of

O (L^{2})

, which escalates rapidly as the sequence length increases, resulting in high computational costs.

Table 4. Time complexity and space complexity comparison on LSTM, Transformer and RWKV.

As shown in Figure 6a, we compared the inference time and frame rates of various models. RWKV-VIO achieves the best performance, with an inference time of only 4.59 ms and a frame rate of 217.86 fps, significantly outperforming DeepVO (46.81 ms, 21.73 fps), VINet (16.04 ms, 62.34 fps), VIFT (22.80 ms, 43.86 fps) and Yang et al.’s method (11.39 ms, 87.80 fps). In Figure 6b, RWKV-VIO has a parameter size of only 37.97 M, much smaller than DeepVO (71.23 M), VIFT (45 M), VINet (47.97 M) and the Yang et al. [] method (85.65 M). This shows that RWKV-VIO excels both in inference efficiency and in terms of the lightweight design of the model.

Figure 6. (a) A comparison of inference time and frame rate among different models. (b) The frame rate and parameter quantity of these models.

The remarkable advantage of RWKV-VIO in inference time stems from the unique architecture of the RWKV algorithm. Traditional models like DeepVO and VINet rely on LSTM networks for temporal sequence processing. LSTM requires recursive computations, with each step depending on the previous state, leading to a serialized inference process. Transformer-based models such as VIFT have high accuracy. However, due to their self-attention mechanism that involves computing correlations between every position in the sequence and all others, the computational complexity soars as the sequence length increases. Consequently, their inference efficiency is affected. For example, VIFT’s inference speed is 22.8 ms, which is not particularly fast.

In contrast, RWKV avoids recursion by using a key-value mechanism to capture temporal features efficiently, enabling RWKV-VIO to achieve an inference speed of 4.59 ms, making it one of the fastest models currently available.

In terms of parameter size, RWKV-VIO also performs exceptionally well, with its parameter count being nearly half that of the Selective method. This is because RWKV reduces redundant parameters by optimizing its architecture, without relying on larger hidden dimensions or deeper networks as LSTM-based models do to compensate for their limitations in long-sequence dependency modeling. Additionally, RWKV aligns temporal and spatial features directly during fusion, reducing model complexity while maintaining strong feature representation capabilities.

5. Conclusions

In conclusion, the proposed RWKV-VIO framework introduces a novel approach to visual–inertial odometry, combining the strengths of the RWKV architecture with efficient temporal modeling. Using a lightweight structure with linear computational complexity, RWKV-VIO effectively captures both short- and long-term dependencies, overcoming the limitations of traditional recurrent models such as LSTMs and Transformers. The introduction of the parallel IMU encoder further enhances the feature extraction process, significantly improving performance in complex environments. Experimental findings demonstrate that RWKV-VIO performs better than existing methods when it comes to localization accuracy, computational efficiency, and real-time performance. The framework improves both translational and rotational accuracy in a balanced manner, making it a reliable solution for real-world applications in autonomous navigation and robotics.

Author Contributions

Conceptualization, J.Y.; methodology, J.Y.; software, J.Y.; validation, J.Y., X.X. and Z.X.; formal analysis, J.Y.; investigation, J.Y.; resources, Z.W.; data curation, J.Y.; writing—original draft preparation, J.Y.; writing—review and editing, J.Y., X.X. and Z.X.; visualization, J.Y.; supervision, X.X. and W.C.; project administration, J.Y., X.X. and W.C.; funding acquisition, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Basic Science Center for “Intelligent Multisource Autonomous Navigation” of the National Natural Science Foundation of China under Project 62388101, the European Union’s Horizon Europe Program for Marie Skłodowska-Curie Actions under Grant 101149415 (FEASIBLE), and the Key Laboratory Fund of National Defense Science and Technology [2025-WDZC-004-03].

Data Availability Statement

The data used in this study is derived from the KITTI dataset, which is publicly available at its official website: http://www.cvlibs.net/datasets/kitti/ (accessed on 9 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Ma, F.; Shi, J.; Wu, L.; Dai, K.; Zhong, S. Consistent Monocular Ackermann Visual–Inertial Odometry for Intelligent and Connected Vehicle Localization. Sensors 2020, 20, 5757. [Google Scholar] [CrossRef]
Cao, H.; Yang, T.; Yiu, K.-F.C. A Method Integrating the Matching Field Algorithm for the Three-Dimensional Positioning and Search of Underwater Wrecked Targets. Sensors 2025, 25, 4762. [Google Scholar] [CrossRef]
Wang, W.; Zhang, Q.; Hu, Y.; Gallay, M.; Zheng, W.; Guo, J. Recent Advances in SLAM for Degraded Environments A Review. IEEE Sens. J. 2025, 25, 27898–27921. [Google Scholar] [CrossRef]
Song, J.; Jo, H.; Jin, Y.; Lee, S.J. Uncertainty-Aware Depth Network for Visual Inertial Odometry of Mobile Robots. Sensors 2024, 24, 6665. [Google Scholar] [CrossRef]
Xu, Z.; Xi, B.; Yi, G.; Ahn, C.K. High-precision control scheme for hemispherical resonator gyroscopes with application to aerospace navigation systems. Aerosp. Sci. Technol. 2021, 119, 107168. [Google Scholar] [CrossRef]
Lee, J.-C.; Chen, C.-C.; Shen, C.-T.; Lai, Y.-C. Landmark-Based Scale Estimation and Correction of Visual Inertial Odometry for VTOL UAVs in a GPS-Denied Environment. Sensors 2022, 22, 9654. [Google Scholar] [CrossRef]
Nan, F.; Gao, Z.; Xu, R.; Wu, H.; Yu, Z.; Zhang, Y. Stiffness Tuning of Hemispherical Shell Resonator Based on Electrostatic Force Applied to Discrete Electrodes. IEEE Trans. Instrum. Meas. 2024, 73, 8505810. [Google Scholar] [CrossRef]
Cao, H.; Zhang, Y.; Han, Z.; Shao, X.; Gao, J.; Huang, K.; Shi, Y.; Tang, J.; Shen, C.; Liu, J. Pole-Zero Temperature Compensation Circuit Design and Experiment for Dual-Mass MEMS Gyroscope Bandwidth Expansion. IEEE/ASME Trans. Mechatronics 2019, 24, 677–688. [Google Scholar] [CrossRef]
Yang, C.; Cheng, Z.; Jia, X.; Zhang, L.; Li, L.; Zhao, D. A Novel Deep Learning Approach to 5G CSI/Geomagnetism/VIO Fused Indoor Localization. Sensors 2023, 23, 1311. [Google Scholar] [CrossRef]
Zhang, Y.; Chu, L.; Mao, Y.; Yu, X.; Wang, J.; Guo, C. A Vision/Inertial Navigation/Global Navigation Satellite Integrated System for Relative and Absolute Localization in Land Vehicles. Sensors 2024, 24, 3079. [Google Scholar] [CrossRef] [PubMed]
Chen, C.; Wang, B.; Lu, C.X.; Trigoni, N.; Markham, A. Deep Learning for Visual Localization and Mapping: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 17000–17020. [Google Scholar] [CrossRef] [PubMed]
Luo, R.C.; Yih, C.-C.; Su, K.L. Multisensor fusion and integration: Approaches, applications, and future research directions. IEEE Sens. J. 2002, 2, 107–119. [Google Scholar] [CrossRef]
Zhu, H.; Qiu, Y.; Li, Y.; Mihaylova, L.; Leung, H. An Adaptive Multisensor Fusion for Intelligent Vehicle Localization. IEEE Sens. J. 2024, 24, 8798–8806. [Google Scholar] [CrossRef]
Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
Engel, J.; Koltun, V.; Cremers, D. Direct Sparse Odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef]
Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast semi-direct monocular visual odometry. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 15–22. [Google Scholar] [CrossRef]
Sun, Q.; Tang, Y.; Zhang, C.; Zhao, C.; Qian, F.; Kurths, J. Unsupervised Estimation of Monocular Depth and VO in Dynamic Environments via Hybrid Masks. IEEE Trans. Neural Networks Learn. Syst. 2022, 33, 2023–2033. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Bloesch, M.; Omari, S.; Hutter, M.; Siegwart, R. Robust visual inertial odometry using a direct EKF-based approach. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 298–304. [Google Scholar] [CrossRef]
Chen, C.; Rosa, S.; Miao, Y.; Lu, C.X.; Wu, W.; Markham, A.; Trigoni, N. Selective Sensor Fusion for Neural Visual-Inertial Odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10534–10543. [Google Scholar] [CrossRef]
Yang, M.; Chen, Y.; Kim, H.S. Efficient deep visual and inertial odometry with adaptive visual modality selection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 233–250. [Google Scholar]
Tu, Z.; Chen, C.; Pan, X.; Liu, R.; Cui, J.; Mao, J. EMA-VIO: Deep Visual–Inertial Odometry With External Memory Attention. IEEE Sens. J. 2022, 22, 20877–20885. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Cho, K.; van Merrienboer, B.; Gülçehre, Ç.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the Conference Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014. Available online: https://api.semanticscholar.org/CorpusID:5590763 (accessed on 9 September 2025).
Antonello, R.; Vaidya, A.; Huth, A. Scaling laws for language encoding models in fMRI. Adv. Neural Inf. Process. Syst. 2023, 36, 21895–21907. [Google Scholar]
Zou, Y.; Ji, P.; Tran, Q.H.; Huang, J.B.; Chandraker, M. Learning monocular visual odometry via self-supervised long-term modeling. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 710–727. [Google Scholar]
Peng, B.; Alcaide, E.; Anthony, Q.; Anthony, Q.; Albalak, A.; Arcadinho, S.; Biderman, S. RWKV: Reinventing RNNs for the Transformer Era. arXiv 2023, arXiv:2305.13048. [Google Scholar] [CrossRef]
Clark, R.; Wang, S.; Wen, H.; Markham, A.; Trigoni, N. VINet: Visual-Inertial Odometry as a Sequence-to-Sequence Learning Problem. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar] [CrossRef]
Wang, S.; Clark, R.; Wen, H.; Trigoni, N. DeepVO: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 24 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2043–2050. [Google Scholar]
Shamwell, E.J.; Leung, S.; Nothwang, W.D. Vision-Aided Absolute Trajectory Estimation Using an Unsupervised Deep Network with Online Error Correction. In Proceedings of the 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 2524–2531. [Google Scholar]
Chen, C.; Rosa, S.; Lu, C.X.; Wang, B.; Trigoni, N.; Markham, A. Learning Selective Sensor Fusion for State Estimation. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 4103–4117. [Google Scholar] [CrossRef]
Han, L.; Lin, Y.; Du, G.; Lian, S. DeepVIO: Self-supervised Deep Learning of Monocular Visual Inertial Odometry using 3D Geometric Constraints. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 4–8 November 2019; pp. 6906–6913. [Google Scholar] [CrossRef]
Godard, C.; Aodha, O.M.; Firman, M.; Brostow, G. Digging Into Self-Supervised Monocular Depth Estimation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3827–3837. [Google Scholar] [CrossRef]
Xue, F.; Wang, X.; Li, S.; Wang, Q.; Wang, J.; Zha, H. Beyond Tracking: Selecting Memory and Refining Poses for Deep Visual Odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8567–8575. [Google Scholar]
Liu, L.; Li, G.; Li, T.H. ATVIO: Attention Guided Visual-Inertial Odometry. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 4125–4129. [Google Scholar] [CrossRef]
Xue, F.; Wang, Q.; Wang, X.; Dong, W.; Wang, J.; Zha, H. Guided feature selection for deep visual odometry. In Asian Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2018; pp. 293–308. [Google Scholar]
Kurt, Y.B.; Akman, A.O.; Alatan, A.A. Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry. arXiv 2024, arXiv:2409.08769. [Google Scholar] [CrossRef]
Li, S.; Jin, X.; Xuan, Y.; Zhou, X.; Chen, W.; Wang, Y.X.; Yan, X. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Liu, S.; Yu, H.; Liao, C.; Li, J.; Lin, W.; Liu, A.X.; Dustdar, S. Pyraformer: Low-Complexity Pyramidal Attention for Long-Range Time Series Modeling and Forecasting. In Proceedings of the Tenth International Conference on Learning Representations (ICLR 2022), Virtual Event, 25–29 April 2022. [Google Scholar]
Wang, H.; Peng, J.; Huang, F.; Wang, J.; Chen, J.; Xiao, Y. MICN: Multi-scale local and global context modeling for long-term series forecasting. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Liu, M.; Zeng, A.; Chen, M.; Xu, Z.; Lai, Q.; Ma, L.; Xu, Q. SciNet: Time series modeling and forecasting with sample convolution and interaction. Adv. Neural Inf. Process. Syst. 2022, 35, 5816–5828. [Google Scholar]
Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are transformers effective for time series forecasting? In Proceedings of the AAAI Conference on Artificial Intelligence 2023, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 11121–11128. [Google Scholar]
Hou, H.; Yu, F.R. RWKV-TS: Beyond Traditional Recurrent Neural Network for Time Series Tasks. arXiv 2024, arXiv:2401.09093. [Google Scholar] [CrossRef]
Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovitskiy, A.; Brox, T. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1647–1655. [Google Scholar] [CrossRef]
Yang, J.; Zeng, X.; Zhong, S.; Wu, S. Effective Neural Network Ensemble Approach for Improving Generalization Performance. IEEE Trans. Neural Netw. Learn. Syst. 2013, 24, 878–887. [Google Scholar] [CrossRef]
Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Wu, Y.; He, K. Group Normalization. arXiv 2018, arXiv:1803.08494. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Fischer, P.; Ilg, E.; Hausser, P.; Hazirbas, C.; Golkov, V.; Van Der Smagt, P.; Cremers, D.; Brox, T. FlowNet: Learning Optical Flow with Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015; pp. 2758–2766. [Google Scholar] [CrossRef]

Figure 1. (a) VIO methods based on geometric principles. (b) VIO methods based on deep learning.

Figure 2. (a) RWKV-VIO Architecture. (b) The structure of the IMU encoder, which uses convolution and residual modules to extract temporal features from inertial data. (c) The RWKV module diagram with time mixing and channel mixing for modeling temporal and channel feature relationships.

Figure 3. Examples of the KITTI Odometry dataset displaying scenes of residential areas, highways, hard traffic roads, and scenes with pedestrians.

Figure 4. The results of the RWKV-VIO model’s positioning on KITTI Sequences 05 (left), 07 (middle), and 10 (right) with different encoder strategies.

Figure 5. Average errors of RWKV-VIO on KITTI Sequence 05: Translation error ((left), %) and rotation error ((right), °/100 m) under different encoder strategies.

Figure 6. (a) A comparison of inference time and frame rate among different models. (b) The frame rate and parameter quantity of these models.

Table 1. Classification and comparison of representative VO/VIO methods.

Method Category	Model	Year	Sensor	Keywords
Geometry-based	ORB-SLAM2 []	2021	Mono/RGB-D/Stereo	ORB feature tracking; nonlinear optimization
	VINS-Mono []	2018	Mono + IMU	Tight mono-IMU coupling; EKF-based estimation
	ROVIO []	2015	Mono + IMU	Direct EKF; robust inertial fusion
Deep Learning-based	VINet []	2017	Mono + IMU	End-to-end seq2seq; LSTM fusion
	DeepVO []	2017	Mono	DRCNNs; end-to-end VO
	VIOLearner []	2018	Mono + IMU	Unsupervised learning; online trajectory correction
	Chen et al. []	2019	Mono + IMU	Selective sensor fusion; neural network
	DeepVIO []	2019	Mono + IMU	Self-supervised learning
	Monodepth2 []	2019	Mono	Self-supervised depth
	BeyondTracking []	2019	Mono	Memory selection; pose refinement
	Zou et al. []	2020	Mono	Self-supervised long-term; CNN-RNN hybrid
Deep Learning-based	ATVIO []	2021	Mono + IMU	Attention fusion; adaptive loss
	GFS-VO []	2018	Mono	Guided feature selection
	Tu et al. (EMA-VIO) []	2022	Mono + IMU	External memory attention
	Yang et al. []	2022	Mono + IMU	Adaptive visual modality; LSTM modeling
	VIFT []	2024	Mono + IMU	Causal Transformer
	Fusion []	2025	Mono + IMU	Selective fusion

Table 2. Comparison of VO/VIO methods categorized into geometry-based and learning-based approaches.

Method	Model	Type	Seq. 05		Seq. 07		Seq. 10		avg $t_{rel} (%)$	avg $r_{rel} (^{\circ})$
Method	Model	Type	$t_{rel} (%)$	$r_{rel} (^{\circ})$	$t_{rel} (%)$	$r_{rel} (^{\circ})$	$t_{rel} (%)$	$r_{rel} (^{\circ})$	avg $t_{rel} (%)$	avg $r_{rel} (^{\circ})$
Geometry	ORB-SLAM2 []	VO	9.12	0.2	10.34	0.3	4.04	0.3	7.8	0.27
	VINS-Mono []	VIO	11.6	1.26	10.0	1.72	16.5	2.34	12.7	1.77
	ROVIO []	VIO	3.21	1.22	2.97	1.38	3.20	1.33	3.13	1.31
Learning	Monodepth2 []	VO	4.66	1.7	4.58	2.6	7.73	3.4	5.65	2.56
	Zou et al. []	VO	2.63	0.5	6.43	2.1	5.81	1.8	4.95	1.46
	VIOLearner []	VIO	3.00	1.40	3.60	2.06	2.04	1.37	2.88	1.61
	DeepVIO []	VO	2.86	2.32	2.71	1.66	0.85	1.03	2.14	1.67
	GFS-VO []	VO	3.27	1.6	3.37	2.2	6.32	2.3	4.32	2.03
	DeepVO []	VO	2.62	3.61	3.91	4.60	8.11	8.83	4.88	5.58
	BeyondTracking []	VO	2.59	1.2	3.07	1.8	3.94	1.7	3.2	1.56
	ATVIO []	VIO	4.93	2.4	3.78	2.59	5.71	2.96	4.8	2.65
	Fusion []	VIO	4.44	1.69	2.95	1.32	3.41	1.41	3.02	1.42
	Yang et al. []	VIO	2.61	1.06	1.83	1.35	3.11	1.12	2.55	1.17
	VIFT []	VIO	2.02	0.53	1.75	0.47	2.57	0.54	2.01	0.71
	(Ours) RWKV-VIO	VIO	2.03	1.0	2.73	1.79	2.1	0.99	2.29	1.26

bold indicates the highest score, underlined indicates the second-highest score, and italicized indicates the third-highest score for each block.

Table 3. Ablation study results on imu encoder.

Method	Seq. 05		Seq. 07		Seq. 10		avg $t_{rel} (%)$	avg $r_{rel} (^{\circ})$
Method	$t_{rel} (%)$	$r_{rel} (^{\circ})$	$t_{rel} (%)$	$r_{rel} (^{\circ})$	$t_{rel} (%)$	$r_{rel} (^{\circ})$	avg $t_{rel} (%)$	avg $r_{rel} (^{\circ})$
base	3.38	1.25	2.65	1.70	4.04	1.85	3.36	1.58
+new imu encoder	2.76	1.13	2.99	1.81	3.13	1.56	2.96	1.50
+parallel imu encoder	2.03	1.00	2.73	1.79	2.13	0.99	2.29	1.26

bold indicates the highest score, underlined indicates the second-highest score.

Table 4. Time complexity and space complexity comparison on LSTM, Transformer and RWKV.

Method	Model	Time	Space	Inference Step
LSTM	Yang et al. []	$O (L)$	$O (L)$	L
Transformer	VIFT []	$O (L^{2})$	$O (L^{2})$	L
RWKV	RWKV-VIO	$O (L)$	$O (L)$	1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

RWKV-VIO: An Efficient and Low-Drift Visual–Inertial Odometry Using an End-to-End Deep Network

Abstract

1. Introduction

2. Related Work

2.1. Visual Odometry

2.2. Visual-Inertial Odometry

2.3. Temporal Modeling

3. Materials and Methods

3.1. End-to-End Deep Learning-Based Visual–Inertial Odometry

3.2. Feature Encoder

3.2.1. Visual Encoder

3.2.2. IMU Encoder

3.3. Decoder

3.3.1. Positional Encoding

3.3.2. Time-Mixing Block

3.3.3. Channel-Mixing Block

3.3.4. Pose Regression Module

3.4. Loss Function

4. Results

4.1. Experiment Setup

4.1.1. Dataset

4.1.2. Implementation Details

4.2. Main Results

4.3. Ablation Study

4.4. Efficiency Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics