RIOT: Recursive Inertial Odometry Transformer for Localisation from Low-Cost IMU Measurements

Inertial localisation is an important technique as it enables ego-motion estimation in conditions where external observers are unavailable. However, low-cost inertial sensors are inherently corrupted by bias and noise, which lead to unbound errors, making straight integration for position intractable. Traditional mathematical approaches are reliant on prior system knowledge, geometric theories and are constrained by predefined dynamics. Recent advances in deep learning, which benefit from ever-increasing volumes of data and computational power, allow for data-driven solutions that offer more comprehensive understanding. Existing deep inertial odometry solutions rely on estimating the latent states, such as velocity, or are dependent on fixed-sensor positions and periodic motion patterns. In this work, we propose taking the traditional state estimation recursive methodology and applying it in the deep learning domain. Our approach, which incorporates the true position priors in the training process, is trained on inertial measurements and ground truth displacement data, allowing recursion and learning both motion characteristics and systemic error bias and drift. We present two end-to-end frameworks for pose invariant deep inertial odometry that utilises self-attention to capture both spatial features and long-range dependencies in inertial data. We evaluate our approaches against a custom 2-layer Gated Recurrent Unit, trained in the same manner on the same data, and tested each approach on a number of different users, devices and activities. Each network had a sequence length weighted relative trajectory error mean ≤0.4594 m, highlighting the effectiveness of our learning process used in the development of the models.


Introduction
Inertial odometry is crucial in mobile agents as it facilitates ego-motion in many applications, such as autonomous driving, health/activity monitoring, indoor navigation, human-robot interaction and augmented/virtual reality. Inertial measurement units (IMUs) are low-power, offer high privacy, and are robust in various environments. As such, they offer a cheap and completely ego-centric means of localisation. IMUs typically consist of a 3D accelerometer, 3D gyroscope and 3D magnetometer. By accurately integrating data from IMUs and other sensors, it is possible to build a reliable system for estimating the motion and position of autonomous systems and pedestrian navigation. However, low-cost inertial sensors have high levels of noise and biases, causing unbound system error growth in long-term inertial navigation [1].
Neural networks have the ability to employ continuous activation functions that inherently understand time and are capable of modelling complex non-linear system behaviours, which are typically too complex for classical mathematical approaches [2]. Recurrent neural networks (RNNs) have long been the primary choice for sequence-tosequence modelling. Most existing deep inertial navigation solutions utilise RNNs, some supplementing with convolutional neural networks (CNNs) (see Section 2). However, these architectures have well-documented limitations, such as an inability to capture long-term dependencies and sequential computation that cannot be parallelised [3].
These deficiencies lead to the development of new architectures. The most notable being self-attention-based Transformer models, first proposed in [4], which, since inception, have become ubiquitous in natural language processing (NLP). The success seen in NLP has proliferated its use in a number of domains. Recently, we have seen Transformers employed in computer vision (CV) [5], NLP [6], time-series forecasting [7], image recognition/production [8], text summarisation [9], speech recognition [10] and music generation [11]. These implementations have displayed the network's ability to model long dependencies between input sequence elements and can be processed in parallel, contrasting RNNs. As noted in [12], these capabilities have the advantageous property of resolving the memory bottleneck commonly found in RNNs. Additionally, a comparison to the effectiveness of a long short-term memory (LSTM) network (a RNN variant) and selfattention-based Transformer showed significant performance gains in self-attention-based techniques on datasets with long-range dependencies [13].
In contrast to CNNs, Transformers do not necessitate design specifications and are proficient in handling set functions. Additionally, their uncomplicated architecture facilitates the processing of diverse modalities through the utilisation of homogeneous processing units, which have proven to exhibit remarkable scalability to both large networks and datasets. The incorporation of a self-attention mechanism in neural networks enables the inputs to engage with one other and to be evaluated according to their correlation with the final prediction. Despite extensive investigation into this formulation, limited research has been conducted utilising self-attention and unprocessed sequential readings obtained from low-cost, noisy inertial sensors in the inertial odometry domain. The substantial achievements achieved in other sequence-to-sequence learning problems indicate that the application of self-attention-based techniques could eliminate the need for accurate dynamic models and offer a more robust solution compared to previous RNN or CNN-based methodologies.

Related Work
Recent work has shown that the implementation of an accurate inertial odometry algorithm can serve as a foundation for a more robust and reliable ego-motion estimation system through the fusion of inertial sensors. The body of literature in this domain includes both end-to-end solutions and work applying machine learning (ML) directly to enhance the quality of IMU measurements and error models.
In [14], the authors propose a ML-based adaptive neuro-fuzzy inference system to compensate for the errors from a low-cost IMU. Similarly, the authors of [15] look at the effectiveness of different RNN architectures for IMU measurement noise reduction. In [16], the authors successfully utilised a CNN as an accelerometer error reduction method, and the authors of [17] applied a temporal convolutional network (TCN) to construct the gyroscope measurement model. The authors of [18] demonstrated the feasibility of using a smartphone's 6D inertial sensor (consisting of 3D accelerations and 3D angular velocities) and orientation estimation provided by its application programming interface (API) for pedestrian localisation. Termed RIDI, their approach leverages patterns in natural human movement to learn how to predict velocity and correct linear accelerations using linear least squares. Recent advances in deep learning (DL) have further accelerated data-driven-based inertial navigation. In [19], a deep learning approach called PDRNet was developed for pedestrian dead reckoning (PDR). PDRNet consists of a classification network for smartphone location recognition and a regression network for determining the change in distance and heading. These approaches rely heavily on a large number of carefully tuned parameters and users' walking habits, which leaves them fundamentally susceptible to rapid drift and a lack of generalisability. IONet [20] utilises a bidirectional long short-term memory (Bi-LSTM) and kinetic models to regress the magnitude velocity and the changing rate of direction. RONIN [21] takes ResNet [22], a LSTM built for CV, as a backbone to again regress a velocity vector. The authors of [23] apply preintegration and an LSTM as a solution to supplement the IMU motion model for deep inertial odometry. These unified deep neural networks provide more robust solutions in highly dynamic conditions. However, they are still reliant on direct integration. Additionally, these methods rely on IMU orientation, are all limited in dimensionality and are dependent on dynamical models that require prior knowledge of the dynamics of the system.
In [24], the authors present TLIO, again using ResNet, to regress 3D displacement estimates and the uncertainty, allowing them to tightly fuse the relative state measurement into a stochastic cloning extended Kalman filter (EKF) to solve for pose, velocity and sensor biases. Owing to its reliance on an EKF, it was shown to be susceptible to a system failure during highly dynamic or unusual motion, which is in line with previous work [25]. Similarly, in [26], the authors propose a hybridised approach using an LSTM and EKF in a modular design that consists of orientation and position subsystems, termed IDOL. Having a dedicated orientation module that included 3D magnetometer readings proved advantageous in contrast to previous approaches that are reliant on the system API. The authors of [27] present a novel loss formulation for smartphone-based deep inertial odometry. The authors use a ResNet-style neural network utilising two-second inertial signals from an IMU to estimate the average velocity and direction of movement. It is noted in this work that despite the obvious benefits of incorporating the magnetometer readings, the network would not converge.
The most recent approach is given in [28] and iterated in [29], where the authors propose attention-driven rotation-equivariance-supervised inertial odometry based on 6D IMU readings. They adopted ResNet to show that adding a self-supervised auxiliary task based on rotation-equivariance can improve the performance of the model when it is jointly trained and can be further improved with a Test-Time Training strategy. In their follow-up, the authors propose a hybrid neural network model for inertial odometry that combines a CNN block with attention mechanisms and a Bi-LSTM network. The CNN block is used to extract spatial features from normalised 6D IMU measurements, and the attention mechanisms, which include a spatial attention mechanism and a channel attention mechanism, are used to refine these features. The Bi-LSTM network is then used to capture temporal features.
The effectiveness of data-driven solutions in inertial odometry is well-documented, but these approaches share a common issue in network design. A well-designed network can improve performance in various applications [30], and IMU data, collected at high frequencies, can be challenging to process using traditional machine learning approaches, such as RNNs and CNNs. One drawback of using RNNs for IMU data processing is the issue of "washout", whereby the network's ability to remember past inputs diminishes over time [31], making it difficult to accurately process long sequences of data. On the other hand, CNNs require deep architectures to cover large enough receptive fields to effectively process IMU data, which can result in significant computational expenses during training and deployment [32]. Finally, it should be noted that any effective end-to-end solution for deep inertial odometry must contain the solutions proposed only for measurement error reduction.

Contributions
We investigate the efficacy of utilising self-attention for addressing the challenges in inertial navigation. IMU data from low-cost inertial sensors are inherently noisy, biased and incomplete. This can lead to inaccurate readings and make long-term tracking problematic. To address this, we incorporate a sliding window of input data and use prior network outputs as inputs to improve accuracy and robustness. By incorporating multiple readings over a short period of time, the network can average out noise and fill in gaps in the data. The window size is a hyperparameter that can be adjusted to balance the trade-off between incorporating enough information to improve accuracy while avoiding over-fitting to noise or short-term fluctuations. This formulation is designed to emulate the recursiveness leveraged in traditional mathematical approaches such as an EKF [33].
To mitigate the challenges associated with high-frequency time series data, we propose reliance entirely on the self-attention mechanism to compute representations of the inputs and outputs rather than using sequence-aligned RNNs or convolutions. As the self-attention mechanism will be the primary method for extracting information from the inputs and generating the outputs, there is the potential to be more efficient and flexible than using RNNs or CNNs, as self-attention mechanisms can capture long-range dependencies in the data and can be parallelised during the training process. Additionally, self-attention mechanisms can provide a degree of interpretability by allowing the model to identify the most important input features at each time step.
We term our approaches: Recursive Inertial Odometry Transformer (RIOT) and Attitude Recursive Inertial Odometry Transformer (ARIOT). To the best of our knowledge, our approaches are the only networks that leverage self-attention and all available IMU information (from a 3D accelerometer, 3D gyroscope and 3D magnetometer) to provide an end-to-end, 3D inertial odometry solution.
For the network design of RIOT, a number of modifications were made to the original Transformer proposed in [4]. The embedding layer is replaced by a generic linear layer to reduce the dimensionality of the input. Additionally, residual connections were added between the multi-head attention and the feed-forward layers to improve information flow through the network. Lastly, we forgo an activation function after the final linear layer to facilitate boundless position estimation.
The ARIOT model is a hierarchical Transformer; it differs from RIOT by the incorporation of an additional, internal attitude estimation network that regresses the orientation of the IMU from the sensor measurements. This subsystem benefits from the self-attentionbased framework design in [34]. However, we were able to formulate a new loss function, which, to the best of our knowledge, is absent in the literature. The angle between quaternions is a well-known quantity; however, when training a network to regress to unit quaternions, the inner product is frequently outside the range [−1, 1], resulting in numerical instability. We propose the use of Equation (18) to negate this instability. The output is used in the odometry network to further regress the accelerometer readings and prior position to give updated IMU-based localisation.
The effectiveness of RIOT and ARIOT is validated on unseen sequences in their entirety from different users, activities and smartphone IMU devices.

Sensor Models
First, we consider the problem of modelling measurements from a 9D IMU. It is implicit that these systems are characterised by high noise levels and time-varying additive biases. The available measurements from a typical IMU are from three-axis rate gyros, three-axis accelerometers and three-axis magnetometers. The reference frame of the IMU is termed the body frame (B), which is rotated with respect to some fixed inertial frame (I), e.g., the Earth-centered inertial (ECI) frame or the North-East-Down (NED) frame. However, for brevity, these reference frames are assumed and not incorporated into the notation.
The gyroscope measures the angular velocity of B relative to I, corrupted by a slowly varying bias and noise. Therefore, we can define the gyroscope measurements, I ω,t , as where ω t is the true angular velocity at each time instance t, δ ω,t denotes the time-varying bias and e ω,t is the noise, typically assumed to be Gaussian, e ω,t ∼ N (0, Σ ω ).
The accelerometer measures the linear acceleration of B relative to I. Again, with added noise and bias, the accelerometer measurements, I a,t , are given by where f t is the specific force at each time instance t, and δ a,t and e a,t denote the bias and noise, respectively, with e a,t ∼ N (0, Σ ω ).
Magnetometers provide information about the direction and intensity of the local magnetic field surrounding the sensor. The local magnetic field is composed of the Earth's magnetic field as well as any additional magnetic fields that arise due to the existence of magnetic materials. As magnetometer measurements are used primarily in attitude determination, we assume the magnitude of the local magnetic field vector, denoted by m l , is equal to 1-i.e., m l = 1. Assuming that the magnetometer only measures the local magnetic direction, its measurements, I m,t , can be modelled as where R b t denotes the rotation matrix from navigation to body frame and e m,t ∼ N (0, Σ m ) is the Gaussian noise.
By incorporating magnetometer measurements, we enable the system to determine its initial attitude. This is predicated on the principle that, given a set of two or more linearly independent vectors in two distinct reference frames, the rotation between said frames can be calculated. The underlying assumption here is that the accelerometer only measures the gravity vector, and the magnetometer only measures the local magnetic field. Hence we have four linearly independent vectors: measurements I a,t and I m,t , the local gravity vector g n , and the local magnetic vector, m l . Whilst this is seen as a major advantage, it does come with the drawback of requiring local magnetic field knowledge in order to transform Equation (3) into local coordinates.

Attitude and Position Estimation
Traditional attitude estimation approaches rely on gyro integration as the baseline for deriving the attitude. However, it is well documented that gyroscope measurements lack the information to give absolute attitude determination. Therefore, applying numerical gyro integration results in an accumulated error that grows boundlessly. As such, specific force measurements from an accelerometer are often used in tandem with magnetic field measurements from a magnetometer to correct the estimate, as they provide information on the absolute angular position. These methods typically involve complex mathematical models and computations and require prior state knowledge and specific sensor parameters [35].
Analogous to attitude estimation, traditional methods for position estimation are also susceptible to unbounded errors due to the lack of information for an absolute position change. To overcome this limitation, we propose using self-attention and raw data in gradient descent optimisation to analyse and retain information related to accelerometer error and bias over long sequences. The relevant features are extracted by the attention mechanism to learn the relationship between acceleration, attitude and position. In the case of ARIOT, this method also has the caveat of recognising and compensating for attitude estimation errors in the initial attitude estimation network.
We use the accelerometer and gyroscope measurements as inputs to the dynamics for the purpose of estimating the position. The state vector includes the position and a quaternion parametrisation of the attitude (detailed in Section 5.2). We use the inertial measurements along with prior positions to estimate the attitude and position.
The dynamics of the position for an interval of time ∆t are given by the equation where v t is the velocity, time is denoted as t and R n t is the rotation matrix from the body to the navigation frame. We switched the sign on the noise term for convenience. The dynamics of the attitude is then given by where f (·) is a function of the accelerometer and magnetometer measurements to calculate the correction term for the quaternion, and the notation denotes the quaternion multiplication given by A significant advantage of traditional state estimation algorithms over the neural networks is the retention of prior state estimates to update subsequent states. This allows for the algorithm to use past information to correct or refine the current estimate, which improves accuracy and reliability. In contrast, neural networks are typically trained to make predictions based on input data without explicit retention of past estimates.
We recognise that RNNs are somewhat the exception here as they can be designed to have internal state memory that can retain past state information; however, this does not actually give the network the desired recursive property. Instead, it acts more as a pseudo-recursion in which the hidden states store some information from previous steps and use it to influence future steps. This distinction is crucial in understanding the nature and limitations of RNNs in terms of recursive behaviour. Furthermore, networks that adopt this come with the aforementioned memory bottlenecks and vanishing gradient drawbacks.

Network Components
ARIOT and RIOT models are presented in Sections 5.2 and 5.3, respectively. Here we will introduce the common components found in both networks. The modular-specific adaptations will follow. In model design, we follow the original NLP Transformer proposed in [4], comprising encoder-decoder blocks and multi-head attention (MHA). An advantage of this intentionally straightforward system design is that it is efficient to implement and provides an out of the box solution. The input of the standard Transformer is a 1D sequence of token embeddings. To handle IMU data, the sequence embeddings are expanded to N-dimensions corresponding to feature inputs, each with a set of additional position embeddings, which represent the temporal information. The network produces a sequence of representations for each input time step, which is then used as feature vectors in downstream tasks.

Positional Encoding
In the Transformer model, as described in [4], the relative sequential position is not explicitly encoded. To incorporate relative sequential position information, we add sinusoidal position encoding functions over the inputs before the first layer. The values of the encoding are calculated using the trigonometric functions sin and cos. The argument of these functions is the product of the sequence position (pos) and a scaling factor 10000 2i/d model , where i is an index variable, 0 ≤ i ≤ d model −1 2 , used to calculate different dimensions of the positional encoding vector. For each value of i, two dimensions of the positional encoding vector are calculated, one using the sine function (PE (pos,2i) ) and one using the cosine function (PE (pos,2i+1) ). The idea behind using both the sine and cosine functions is to capture both the magnitude and phase of the sequential position information, i.e., relative order and distance between elements in the sequence. This, in turn, allows the model to attend different parts of the input sequence at different stages of processing. Additionally, this approach allows the model to generalise to different sequence lengths and attend to elements based on their relative positions rather than their absolute positions. This can be useful when the model needs to handle sequences of different lengths or in cases of mismatched sampling [36]. The positional encoding in both networks is defined by PE (pos,2i) = sin pos/10000 2i/d model PE (pos,2i+1) = cos pos/10000 2i/d model (7) where pos denotes the position, i the dimension and d model is the model dimensionality; in this work d model is 64 and 224 for the attitude and position networks (see Section 7), respectively. 5.1.2. Self-Attention Self-attention sublayers in these networks employ h = 2 heads. The self-attention mechanism works by first projecting the IMU measurements into a higher-dimensional space using a linear transformation parameterised by a set of weights W Q , W K , W V ∈ R d I ×d model . This projection is parameterised by a set of weights, which are learned during training. These parameter matrices are unique per layer and attention head. The transformed input data is then passed through a function (often called the "attention function"), which produces a set of attention weights for each input element, representing the importance of each input element in regressing to an attitude or position estimation.
Each attention head operates on an input sequence I = (I 1 , . . . , I n ) of n elements, where I i ∈ R d I . A new sequence of the same length in computed as z = (z 1 , . . . , z n ), where z i ∈ R d model , and each output element is computed as a weighted sum of a linearly transformed input per where each weight coefficient, α i,j , is computed through a softmax function, which normalises the compatibility scores for each element to produce a probability distribution over the input sequence where e ij is computed using a compatibility function that compares two input elements, Scaled dot product is used as the compatibility function to enable efficient computation. Linear transformations of the inputs add sufficient expressive power. The self-attention layer is implemented using MHA.
In the context of using IMU information to estimate an attitude quaternion or position, the self-attention mechanism is used to weigh the different sensor measurements differently, depending on how relevant they are to the position estimate. For example, the gyroscope measurements may be given a higher weight than the accelerometer measurements when estimating rotational motion, while the accelerometer measurements may be given a higher weight when estimating linear acceleration.

Encoder
The element-wise addition of the input vector and positional encoding vector is fed into two identical encoder layers. Each encoding layer is made up of two sub-layers: a MHA sub-layer and a fully connected feed-forward (FF) sub-layer. In the case of the ARIOT attitude module, we trialled a number of convolution layers to extract the spatial structure features of the data. However, the self-attention mechanism proved enough to capture the relevant information, and no benefit was seen.
Our encoder follows the Query-Key-Value model, proposed in [4], where the scaled dot-product attention used is given by where the input, I, is used to obtain the queries Q = I(k)W Q ∈ R N×D k , keys In this work, we employ h = 2 parallel attention layers or heads. For each, we use In addition to the attention sub-layers, each encoder/decoder layer consists of a fully connected FF network consisting of linear transformation and activation functions. We use a LeakyReLU [37] activation in the FF network as follows The point-wise FF network is a fully connected module where H is the output of the previous layer, W 1 where SelfAttn(·) denotes self-attention module and LayerNorm(·) the normal layer operation. The resultant vector is then fed into the decoder.

Decoder
The decoder is composed of two identical layers. The decoder contains the sub-layers found in the encoder, with the addition of a third sub-layer that performs MHA over the output vector from the encoder. The MHA mechanism allows the model to attend to multiple parts of the input sequence in parallel, allowing it to capture a more detailed and nuanced representation of the input. This is achieved by dividing the attention mechanism into multiple "heads". Each head performs attention with a different linear projection. Additionally, the self-attention mechanism in the decoder stacks prevents positions from influencing subsequent positions to ensure that predictions at k can depend only on the known outputs at or before k − 1. In our attitude network, the output maps the final layer into the estimated quaternion through a hyperbolic tangent. For the position, no outbound function is used past linearisation.

Attitude Recursive Inertial Odometry Transformer
The Attitude Recursive Inertial Odometry Transformer is a hierarchical framework composed of two self-attention-based encoder-decoder networks, depicted in Figures 1 and 2. The foundation for the initial network is based on previous work [34] and functions to regress attitude estimation from 9D inertial measurements (from Equations (1)-(3)). This allows for the componential estimation of both attitude and position estimation in a single framework, providing a robust solution for inertial odometry. The use of self-attention mechanisms within both modules allows for the modelling of long-term dependencies in the data, effectively handling the high dynamic motion present over long sequences.  We follow by parameterising the attitude in quaternions. Quaternions, which are a type of representation of attitudes in R 4 , have several advantages over representations in R 3 . They are free of discontinuities and singularities and are more computationally efficient and numerically stable. To be a valid representation of an attitude, a quaternion must be a unit quaternion. Unit quaternions have a one-to-one correspondence with rotation matrices, and they double cover the group SO(3), meaning that both q and −q represent the same attitude. However, by requiring that q 0 ≥ 0, we can ensure that there is a unique correspondence between quaternions and rotation matrices [38]. We propose to use the self-attention mechanism and raw 9D IMU data in gradient descent optimisation to analyse and retain information related to gyroscope error and bias over long sequences. This minimises the complexity by forgoing preintegration. Additionally, the solution is unconstrained by not forcing the network into predefined dynamic models. These features and the inclusion of magnetometer measurements also have the advantage of the network being an out-of-the-box solution where the local magnetic field is known.
We propose a new loss function for quaternions, which we call the Quaternion Loss. To define the Quaternion Loss, we first introduce some quaternion background and notation. A quaternion is a 4-tuple (x, y, z, w), where x, y, z, w are real numbers. Quaternions can be represented in the form q = w + xi + yj + zk (15) where i, j, k are the imaginary units, satisfying i 2 = j 2 = k 2 = ijk = −1.
Quaternions can be used to represent rotations in three-dimensional space by setting w to the cosine of the rotation angle and x, y, z to the sine of the rotation angle, multiplied by the rotation axis [39]. Given a pair of quaternions (q 1 , q 2 ), we can measure the similarity between them using the inner product as q 1 , q 2 = x 1 x 2 + y 1 y 2 + z 1 z 2 + w 1 w 2 (16) This product is related to the angle between the quaternions by the following, where θ is the angle between the quaternions and | · | denotes the L2 norm. We then define the Quaternion Loss function as L(q 1 , q 2 ) = cos −1 (clamp( q 1 , q 2 , −1 + , 1 − )) (18) where and is a small positive constant used to avoid numerical instability when the inner product is outside the range [−1, 1].
The mean angle across the batch is then returned as where N is the batch size.

Recursive Inertial Odometry Transformer
The Recursive Inertial Odometry Transformer is a self-attention-based encoder-decoder network. Forgoing the attitude module to directly apply self-attention to raw 9D IMU data (from Equations (1)-(3)) in gradient descent optimisation for 3D displacement regression, depicted in Figure 3. The input to the network is a concatenation of the inertial measurements and true position priors in the first cycle of training; then, true position priors are replaced by estimated position priors. The input is passed through an embedding layer to generate embedded representations. The encoder then applies self-attention to compute a weighted sum of the embedded representations for each time step, which is used to compute a context vector. The context vector is then passed through a decoder to estimate the 3D position at each time step. The equations for the input sequence, the embedding function and the self-attention mechanism are provided in Section 5.1. The model is then trained to minimise the Mean Square Error (MSE) loss function in Equation (20) using the ADAM optimisation algorithm [40].
where · 2 represents the squared Euclidean norm, N is the batch size, T the sequence length andp and p are the estimated and true positions, respectively.

Evaluation
Despite numerous proposed solutions in the literature attempting to solve inertial navigation, these approaches evaluate their algorithms using their datasets with various preprocessing and alignment techniques, such as the Umeyama algorithm [41]. Under these conditions, it is difficult to compare directly to these different algorithms. Additionally, to the best of our knowledge, no other approach leverages all available IMU information. However, the inclusion of magnetometer measurements comes with the drawback of our solutions being dependent on the local magnetic field, as the magnetometer readings are used to disambiguate the orientation of the IMU. This results in our network calibrations being regional-specific and not generalisable to other datasets without local magnetic field knowledge. To this end, we build on our own implementation of a RNN as a means of comparison.

Gated Recurrent Unit
Recent work on RNNs has shown that a Gated Recurrent Unit (GRU) surpasses the preferred LSTM in a number of scenarios [42][43][44]. Additionally, GRUs have fewer parameters, making it more computationally efficient, and has been shown to be more robust to noise and missing data [45].
We have added our own implementation of a two-layer GRU as a means of comparison. GRU has been shown to be effective in inertial attitude estimation [46]; however, to the best of our knowledge, GRUs are untested in the inertial odometry domain. The network is formulated with the hidden state h t at time step t as follows where x t is the input at time step t, W i * and b i * are the input-to-hidden weights and biases, W h * and b h * are the hidden-to-hidden weights and biases and σ is the sigmoid function. Equations (21) and (22) compute the reset gate, r t , and update gate, z t , respectively. These gates control the amount of information that is passed through to the next time step. Equation (23) computes the candidate hidden stateh t , and Equation (24) updates the hidden state h t by combining the previous hidden state h t−1 and the candidate hidden stateh t . We implement this network in the same manner as RIOT, depicted in Figure 4, where a stack of two GRU layers transforms the 9D IMU input at sampling instant t, concatenated with the 3D position vector at time t − 1, to an N n -dimensional feature vector h t , with N n = 200 being the number of neurons per layer.

Training and Dataset
This approach was trained and tested on publicly available smartphone data published by Chen et al. [47]. The dataset contains 158 sequences, totalling more than 42 km in total distance and incorporates a variety of attachments, activities and users to best reflect the broad use cases seen in real life. The data were captured via five different users and four different types of off-the-shelf consumer smartphones. The IMU data was collected and synchronised with a frequency of 100 Hz, which is generally accepted in various applications and research [48][49][50]. A high-precision optical motion capture system (Vicon) was used to capture full pose ground truth at 0.01 m location and 0.1 degree attitude accuracy [51]. The dataset was randomly divided into training, validation and test sets, following [52]. A single sequence was left out for each of the variables as a means of complete, unseen comparison with other techniques. To avoid overfitting and to improve the computing efficiency, we used a sliding window to capture 100 measurements every 50 to feed into the encoder and used random search to tune the hyperparameters. This gave us 63,614 training samples, 18,175 validation samples and 9089 test samples. The implementation of all adaptations was carried out with PyTorch. The attitude network converges after 300 epochs. The position network converges after 120 and 30 epochs, using true and recursive inputs, respectively. A learning rate of 0.001, an ADAM optimiser and a dropout of 0.2 were used across each implementation. The training was conducted in parallel on 4× Nvidia V100 GPUs.

Inference
The inference procedure for each model closely resembles the second training cycle, as illustrated in Figures 1, 3b and 4 for ARIOT, RIOT and GRU, respectively. The initial window of each sequence is pre-padded with zeroes, followed by a given initial position. The position window is then iteratively updated by processing the subsequent windows of data. This inference approach is designed to reflect both recurrent architectures and recursive mathematical models whilst leveraging the benefits of self-attention. This process is visualised in Figure 5.

Evaluation Metrics
In order to quantitatively assess the performance of each approach on each unobserved sequence of length K, the following three metrics were employed: • Absolute Trajectory Error (ATE) (m) The ATE is commonly used to assess the performance of a guidance or navigation system and represents the global accuracy of the estimated position.
• Relative Trajectory Error (RTE) (m) The RTE is a measure of the difference between the estimated and true position at a given time relative to the distance between the two positions. It is often used to quantify the location position consistency over a predefined duration ∆t; ∆t = 1 s in this work.
The CDF is the distribution function f (x), used to characterise the distribution of a variable. In this context, it is used to describe the probability that the error in the estimated position will be less than or equal to a certain value. f (x) is the probability density function of the localisation error e. ATE and RTE are used in deep inertial odometry papers [21,29], and CDF is a common metric in indoor localisation research [53].

Discussion
This work presents three approaches for evaluation of unseen sequences from different users, devices and activities. The ATE and RTE evaluation results are quantified in Table 1, with the best-performing approach for each sequence and metric highlighted in bold. In addition, a qualitative analysis was conducted on the model's output, which revealed a close correspondence between the predicted trajectories and the ground truth trajectories. This is depicted in Appendix A, which provides visualisations of the position estimates for each approach during the first and last minute of data. Table 1. Two-dimensional position error metric comparison. A complete sequence was left out of the training data for each variable in the dataset. This was performed as a means of unseen comparison over full sequences, allowing for different user, activity and device evaluations as well an overlook at the generalisability of each approach. Note that each network is capable of producing a 3D position estimate; however, as the data was largely taken on a level plane where the discrepancy in the z-axis is far smaller than the x-y plane, the addition of the vertical dimension would skew the error metrics.
The best-performing model over each sequence and for each metric has been made bold. RIOT performs best under most conditions. However, ARIOT tracks better during highly dynamic motion. A closer look at Figures A1 and A2 in Appendix A gives a clear visualisation of the advantages of the self-attention-based models in maintaining smooth, life-like path trajectories that almost mirror the true path. In contrast, the GRU path estimates are seemingly noisy, consistently creating a far greater total distance length than RIOT or ARIOT. Whilst all models are implemented recursively, the GRU's inability to attend to the entire sequence in updating the current position greatly affects the overall performance.
RIOT performed best overall with the lowest ATE and RTE values, with the exception of when the IMU was handheld. When the IMU is handheld and has implied consistent dynamic motion, the attitude estimation module in ARIOT is beneficial as it can help to disambiguate the accelerometer measurements that are affected by both linear acceleration and gravity. This led to a more accurate trajectory estimate. However, when the IMU is mostly stable or cyclic through the motion, the additional complexity of the attitude estimation module is redundant and actually hinders performance. We hypothesise that the model overly leans on attitude representation, which is only beneficial in highly dynamic scenarios.
When analysing the performance of our models, it is important to consider the characteristics of the data and the specific scenario. We theorise that the reason for the superior performance of RIOT is due to the simpler architecture, which is seemingly better suited for scenarios where the IMU is less dynamic. On the other hand, the additional complexity of ARIOT's attitude estimation module allows for improved handling of dynamic motion.
It is evident that the GRU performed considerably worse than both RIOT and ARIOT in all of the sequences. This is likely due to the fundamentally inferior design of the RNN model, leading to its inability to effectively process the complex motion present over long sequences. However, it is important to note that the GRU model still performed relatively well, which highlights the effectiveness of our learning process used in the development of the models.
This analysis is further evidenced in Figure 6, which depicts the mean CDF of the localisation error over the total set of test sequences. RIOT performs almost consistently, indicated by the steep gradient of the CDF in the lower error range, whereas for ARIOT and GRU, the errors are more spread out over a wider range of values. Our models utilise multi-headed self-attention, which is achieved through multiple parallel attention mechanisms. By allowing the model to attend to different parts of the input sequence dynamically, self-attention can capture complex relationships and dependencies. Each self-attention mechanism calculates an attention matrix A of size T × T, where T is the sequence length, by utilising the softmax operation, as described in Equation (11). The attention scores determine the influence of the input time features on the higher-level output time features.
The matrix visualisations in Figures 7a,c and 8a,c, provide a glimpse into how the model is weighing and combining multiple inputs to make a prediction. The values of the attention matrix depict two attention heads from the first self-attention layer from each encoder as an adjacency matrix between input nodes and output nodes. The matrix can also be represented as a bipartite graph, as shown in Figures 7b,d and 8b,d. The edge weights represent the strength of the attention, and the opacity of the edges indicates the magnitude. The input time series is shown above the attention graph as a reference, and the attention scores are depicted as vertical lines corresponding to the values in A.
From the visual representations of the attention matrices, we can directly observe the distinctions between the different self-attention heads and encoders. The heads in the first encoder appear to be highly concentrated on the latter part of the sequence, whereas the heads in the second encoder concentrate on the beginning but have greater overall attention. From Figures 7 and 8, we observe that the model considers both short and longer-term temporal dependencies in the data when making predictions rather than just focusing on the prior time step, as seen in traditional methodologies. This is largely the reason for the accurate and stable position estimates, especially in situations where the motion is complex or noisy. Figure 7. Visualisations of the self-attention scores from the first encoder in RIOT on an arbitrary sequence of input data. (a,b) (blue) depict the attention scores from the first head of the first encoder as a matrix and bipartite graph, respectively. (c,d) (red) depict the attention scores from the second head of the second encoder as a matrix and bipartite graph, respectively. (Left): The heat matrix displays the attention scores assigned to each input element in a sequence. The darker the colour, the higher the attention weight given to that element, indicating that it has a greater impact on the final output. (Right): The graph represents each input element as a node on one side of the graph, while the attention scores assigned by the model are represented as nodes on the other side. Edges connecting the nodes represent the attention weights or the degree to which the model is considering each input element. The thickness of the edges represents the magnitude of the attention weights, with thicker edges indicating higher attention scores. Figure 8. Visualisations of the self-attention scores from the second encoder in RIOT on an arbitrary sequence of input data. (a,b) (blue) depict the attention scores from the first head of the second encoder as a matrix and bipartite graph, respectively. (c,d) (red) depict the attention scores from the second head of the second encoder as a matrix and bipartite graph, respectively. (Left): The heat matrix displays the attention scores assigned to each input element in a sequence. The darker the colour, the higher the attention weight given to that element, indicating that it has a greater impact on the final output. (Right:) The graph represents each input element as a node on one side of the graph, while the attention scores assigned by the model are represented as nodes on the other side. Edges connecting the nodes represent the attention weights or the degree to which the model is considering each input element. The thickness of the edges represents the magnitude of the attention weights, with thicker edges indicating higher attention scores.
In summary, we evaluated the performance of three novel recursive deep inertial odometry frameworks. Our results show that self-attention-based networks have superior performance over a GRU-based RNN, with RIOT performing best overall, with a sequence length weighted mean ATE of 0.0865 m and RTE of 0.0091 m. The mean RTE and ATE of ARIOT and GRU were 0.1134 and 0.0095 m and 0.4594 and 0.0130 m, respectively. Our results also revealed that a simpler architecture could generally yield better results; however, having an attitude module dramatically improves performance in specific scenarios where the IMU experiences highly dynamic motion, highlighting the importance of evaluating solutions on diverse datasets.

RIOT Ablations
Model Dimensionality: We trained the model with differing dimensionality vector sizes from 56 to 896. Increasing the dimensionality of the model makes a small but measurable improvement up to 224. This finding aligns with the general principle in deep learning in which complexity reaches a point where passing it leads to overfitting, resulting in degraded performance on new data.
Encoder-Decoder Blocks: Increasing the number of encoder-decoder blocks did not result in a decrease in the model's perplexity. We trained three different models, with two, four and six blocks.
Attention Heads: We trained the model with two, three and four attention heads in each encoder-decoder block. There were small, almost immeasurable improvements in the networks' performance on the test set. However, when applied to the unseen sequences, the models with three or four heads performed considerably worse. We believe increasing the number of heads past two forces the network into overfitting.
Window Size: We trained the model with differing window sizes from 50 to 500 (0.5 to 5 s). As the window size incrementally increased over 100, we saw better test set results but worse results on the unseen sequences. Increasing the window size of the input data exponentially increases the model complexity, as RIOT has 12 input features. The added complexity forces the network into overfitting.

Conclusions
This work proposes novel self-attention-based recursive neural network models, RIOT and ARIOT, for pose invariant inertial odometry. The proposed approaches utilise a sliding window as a hyperparameter to mitigate noise spikes and missing measurements. True position priors are included in the training process in conjunction with raw inertial measurements and ground truth displacement data, allowing for recursion and the ability to learn both motion characteristics and systemic error bias and drift. The evaluation results demonstrate that RIOT outperforms ARIOT and a GRU in terms of position error metrics, with a sequence length weighted mean Absolute Trajectory Error (ATE) of 0.0865 m and sequence length weighted mean Relative Trajectory Error (RTE) of 0.0091 m. These results are significantly better than the existing deep-learning inertial odometry methods in the literature, highlighting the effectiveness of the proposed approaches and learning methodology. Future work will consider the scalability of these approaches and make them local magnetic field agnostic.  Data Availability Statement: The publicly available dataset analysed in this work can be found here: [47].

Acknowledgments:
The training was conducted in parallel on 4× Nvidia V100 GPUs, made possible with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Position Estimate Visulations from the First and Last Minute of Each Network
(e) (f) Figure A1. Position estimates from the first and last minute of each network as well as the true path for the first half of the unseen sequences, detailed in Section 6.5. The sequence run time is given in addition to each network's estimated total path length. The initial position is given to each network and is emphasised with individual markers. The final network estimate for each approach is also emphasised to easily visualise the positional drift over each sequence time period. (e) (f) Figure A2. Position estimates from the first and last minute of each network as well as the true path for the second half of the unseen sequences, detailed in Section 6.5. The sequence run time is given in addition to each network's estimated total path length. The initial position is given to each network and is emphasised with individual markers. The final network estimate for each approach is also emphasised to easily visualise the positional drift over each sequence time period.