1. Introduction
The study of smartphone-based navigation has emerged as a prominent research focus, driven by the escalating demand for navigation services and the pervasive adoption of smartphones. In outdoor environments, smartphones leverage signals from Global Navigation Satellite Systems (GNSSs) to enable precise navigation. However, in urban environments, obstructions often induce multipath interference, compromising the accuracy of GPS-based indoor navigation systems [
1]. To enhance navigation precision, integrating supplementary data streams such as WiFi, Bluetooth beacons, visual inputs, and inertial data has become imperative [
2]. Existing indoor navigation techniques, however, are typically constrained by environmental dependencies. For instance, WiFi and Bluetooth beacon-based methods are effective only in environments equipped with the requisite infrastructure [
3], while visual navigation systems necessitate adequate lighting conditions for clear image capture, limiting their operational applicability [
4]. These constraints pose substantial challenges to the broad deployment of such localization methodologies. In contrast, navigation systems utilizing inertial data circumvent the reliance on images or wireless signals, thereby overcoming limitations related to illumination and beacon infrastructure. Consequently, inertial sensor integration has emerged as a critical enabler for pedestrian navigation, bolstering localization capabilities irrespective of environmental conditions [
5].
Traditional inertial navigation approaches primarily rely on the Pedestrian Dead Reckoning (PDR) algorithm [
6], which utilizes data from an Inertial Measurement Unit (IMU) to estimate displacement and orientation. PDR algorithms typically employ double-integration and zero-velocity update (ZUPT) techniques [
7]. However, the calculation of velocity and position using this methodology results in cumulative navigation errors over time, necessitating high-precision sensors to mitigate inaccuracies [
8]. Moreover, ZUPT requires the IMU to be affixed to the user’s foot, rendering it impractical for smartphones equipped with an integrated IMU [
9].
In addition to PDR, traditional inertial navigation also includes Strapdown Inertial Navigation Systems (SINS) that rely on integration algorithms and filtering techniques (e.g., EKF) [
10] for orientation estimation. Derivative algorithms such as ZARU [
11] and MARU [
12,
13] have been developed to reduce error drift [
14,
15], but they all require strict sensor calibration and foot-mounted IMUs, leading to poor practicality [
16]. To address these limitations, early data-driven inertial navigation methods have been proposed, such as Hidden Markov Models (HMMs) [
17] and multi-layer perceptrons (MLPs) [
18] for motion parameter estimation, and RIDI [
19] that combines sensor fusion with data-driven strategies for smartphone-based navigation. However, these early data-driven methods either focus on single-joint movements with limited applicability or still suffer from cumulative errors due to reliance on integration algorithms.
Recent advancements in inertial navigation have increasingly incorporated neural networks, although the domain of neural network-based inertial navigation (NIL) remains relatively nascent [
20]. Notwithstanding, neural inertial navigation methodologies have demonstrated promising developments. For example, IONet [
21] employs a two-layer Convolutional Neural Network (CNN) in conjunction with a two-layer bidirectional Long Short-Term Memory (LSTM) [
22] unit to process raw data from a nine-axis IMU, enabling the estimation of displacement and direction. Building on this foundation, researchers at RoNIN [
23] have developed neural network architectures that integrate ResNet, LSTM, and Temporal Convolutional Network (TCN) frameworks. These models process normalized coordinates to generate velocity vector outputs, relying solely on accelerometer and gyroscope data while excluding inclination and azimuth angles. This exclusion effectively mitigates directional drift errors. Furthermore, these models obviate the need for initial coordinate system inputs during trajectory estimation. Despite their theoretical potential, these methods face limitations, including insufficient model testing and inaccuracies in displacement direction estimation, which collectively result in suboptimal navigation performance [
24,
25,
26,
27]. While neural network technologies have driven significant advancements in other scientific disciplines [
28], the field of inertial indoor navigation remains deficient in rigorous experimental validation [
29,
30]. Although multiple projects have conducted research on deep inertial navigation, there are still parts that have not yet been studied. Research on deep networks is mostly limited to testing the construction of models in order to pursue optimal model solutions. However, due to the variability of sensor noise, device placement, and changes in environmental conditions, inertial measurement has inaccuracies. Therefore, this research approach has limitations on the generalization of deep networks in different conditions and users. Therefore, research should be conducted on other aspects of deep networks, including innovative training methods, data augmentation strategies, and optimization techniques.
To address the aforementioned limitations, this study takes EDIN (Enhanced Deep Inertial Navigation) as the core research method and designs a dedicated inertial navigation system for smartphone-based indoor pedestrian localization, with all subsequent research and experiments centered on this framework.
In this study, we propose EDIN, a novel method designed to enhance the accuracy of indoor inertial navigation. The main contributions of this work can be summarized as follows:
A novel neural network architecture is proposed using ResNeXt [
31] and the Convolutional Block Attention Module (CBAM) to enhance feature representation, which improves the navigation accuracy.
An enhanced training pipeline for deep inertial navigation models is proposed through refined data augmentation and loss function, which helps us improve the model robustness.
Extensive evaluations are conducted on several publicly available inertial navigation datasets, demonstrating superior accuracy compared to existing methods.
The subsequent sections of this study are organized as follows:
Section 2 describes the materials and methods.
Section 3 describes the architecture of the EDIN system we have developed.
Section 4 analyzes navigation performance and presents corresponding results.
Section 5 provides a detailed discussion and the conclusions based on our findings.
2. Materials and Methods
In this section, we present our proposed indoor inertial navigation system, EDIN. Initially, we provide a succinct overview of the system architecture. Subsequently, we detail the core modules comprising the architecture, notably the neural network model and the model training method.
2.1. Inertial Navigation Problem
Inertial navigation involves estimating the user’s positional information by interpreting the data obtained from IMU. IMU data consist of a sequence of measurements from accelerometers, gyroscopes, and, optionally, magnetometers, with each sensor providing data along three degrees of freedom (DOF).
Define an IMU sequence as
, where
and
represent the local measurements of the accelerometer and gyroscope, respectively. Orientation information is typically required before using IMU data in a deep network to align it with a common reference system. Define an orientation sequence as
, where
represents the rotation vector in quaternion format. The inertial navigation problem can then be addressed by
represents the rotation matrix dependent on q. is determined by the navigation method.
Subsequently, the user’s location is estimated based on the corresponding location sequence
derived from
. Assuming the ground truth location of
is
, the primary metrics used to calculate the location estimation error are the absolute trajectory error (ATE) and the relative trajectory error (RTE), as given by
The ATE quantifies the spatial proximity between the trajectory coordinates generated by the navigation algorithm and the ground truth. Meanwhile, the RTE gauges the local spatial proximity between the trajectory derived from the localization algorithm and the ground truth within a predetermined time interval, commonly set at 1 min.
2.2. Model Network
2.2.1. Model Overview
The proposed EDIN model processes IMU data through a sophisticated deep learning architecture, which is shown in
Figure 1. Raw IMU time-series data first undergoes frame randomization and coordinate frame normalization in the input module. The core feature extraction leverages a ResNeXt backbone comprising four stages, each containing multiple Bottleneck Residual Blocks (BRBs) enhanced with CBAM [
32]. These CBAM-equipped BRBs sequentially apply channel and spatial attention to refine feature representations. The processed features are then passed through an output module consisting of fully connected (FC) layers with ReLU activations and dropout for regularization, ultimately predicting the location change. This design effectively combines the multi-branch efficiency of ResNeXt with the adaptive feature weighting of CBAM to improve inertial navigation accuracy.
The mathematical formulation of the EDIN network is as follows,
where
is the function defined by the neural network.
and
represent the raw values of gyroscope and accelerometer data, respectively, which are combined as the input to the neural network.
is the hidden state of the GRU unit at the most recent timestamp.
The EDIN model functions by processing continuous IMU data and calculating the corresponding coordinate changes within the geodetic coordinate system. By employing sliding windows, the model aggregates time-series IMU data, predominantly consisting of acceleration and gyroscope data. Each dataset encompasses 200 frames of IMU data. The navigation trajectory is derived through a recursive process, where coordinate values for each navigation point are sequentially retrieved from the initial coordinates.
2.2.2. Coordinate Systems for Neural Inertial Navigation
When deriving displacement distances from the network using IMU data, the initial velocity within the sensor coordinate system emerges as a potential variable derived from the original IMU data [
21]. Consequently, addressing this issue necessitates the inclusion of initial velocity as an input variable. Given the user’s coordinate transformation, which can be described as a polar vector denoting displacement and directional changes, the equation for calculating displacement from IMU data is articulated as follows:
represents the displacement distance.
represents the direction change.
v represents the velocity vector at the beginning of navigation.
If using the above output data directly in polar coordinates, the loss function used for training neural network models can be obtained by
where
is the estimated displacement distance,
is the ground truth displacement distance,
is the estimated heading angle change, and
is the ground truth heading angle change.
is a factor determining the weights of
and
.
However, a challenge encountered in heading regression tasks in practical applications arises from the inherent ambiguity when the user is stationary. Typically, the ground truth heading angle of the user poses is derived computationally by analyzing the collected trajectory ground truth data. Owing to the potential measurement errors inherent in the ground truth data of the trajectory, considerable inaccuracies may arise in the calculated heading angle derived, particularly when the user is stationary or moving at low speeds. On the other hand, determining the suitable value for the proportion in (
7) poses a considerable challenge. The absence of a universally applicable method for discerning this ratio necessitates an extensive array of experiments for validation, entailing a substantial investment of time and effort.
To mitigate this challenge, we use the Cartesian coordinate system scheme in results generated by neural networks. In fact, to obtain the localization trajectory, the coordinate equation can be obtained through Cartesian projection using the initial coordinate data by
represents the current location.
n represents the number of positioning times.
represents the initial coordinate.
represents the coordinate change, which is expressed as follows:
Due to the fact that directional information can be obtained from
Therefore, (
6) can be equivalent to
According to the above equation, theoretically, the two schemes can be equivalent to each other.
2.2.3. ResNeXt Module
The core feature extraction backbone of the proposed model is constructed upon the ResNeXt architecture, which represents a significant advancement over the conventional ResNet framework. The primary innovation of ResNeXt lies in the introduction of cardinality—the number of parallel transformation paths within a residual block—as an additional dimension of model design, complementing depth and width. In the proposed implementation, each ResNeXt block adopts a split–transform–merge strategy, implemented using a Bottleneck Residual Block (BRB). Specifically, the input feature map is divided into multiple low-dimensional embeddings across a predefined number of groups determined by the cardinality parameter. Each group is processed independently through a dedicated set of convolutional filters, typically implemented using grouped convolutions. The resulting feature maps from all groups are then aggregated through summation and subsequently fused with the shortcut connection to form the final residual output. This architecture enhances the network’s representational capacity and facilitates multi-branch feature learning while maintaining computational efficiency. Consequently, the ResNeXt-based backbone achieves a more favorable trade-off between accuracy and complexity compared to conventional approaches that solely increase network depth or width.
From the perspective of theoretical suitability for IMU time-series feature extraction, the grouped convolution and cardinality design in ResNeXt are highly consistent with the characteristics of 6-axis IMU signals. IMU data include 3-axis acceleration and 3-axis gyroscope, which are multi-channel time series with independent physical attributes but strong temporal correlation. Grouped convolution divides the feature extraction process into multiple parallel branches, which can implicitly decouple the learning of motion-related features and noise-related features, as well as distinguish the heterogeneous characteristics between acceleration and gyroscope. Compared to standard convolution in ResNet, cardinality as an independent design dimension avoids the cross-channel interference caused by single-branch convolution, and is more suitable for capturing multi-modal, low-dimensional, and high-sampling time-series features of IMU. Such a structure can improve feature diversity without significantly increasing computation, which is beneficial for long-sequence inertial trajectory estimation.
2.2.4. Attention Mechanism Module
CBAM [
32] is a simple and effective feedforward convolutional neural network attention module that infers attention maps through channel and spatial dimensions. The attention map is multiplied by the features to refine the adaptive features. This module can enhance practical features, reduce noise, and help the network learn the relationship between IMU data and speed more effectively.
In our work, we further enhance the vanilla ResNeXt block by integrating a CBAM with BRB. Specifically, we adapt one CBAM after each ResNeXt module to obtain the weighted features. The CBAM module sequentially infers attention maps along both the channel and spatial axes, allowing for the network to adaptively emphasize informative features and suppress less useful ones. This integration enables the model to not only capture a rich set of features through multi-branch processing, but also to intelligently refine these features by focusing on critical elements within the inertial data, leading to more robust and accurate navigation estimates.
For low-dimensional 6-axis IMU signals, the rationality of applying CBAM lies in its lightweight sequential attention structure, which avoids excessive freedom and insufficient constraints of attention maps. Unlike high-dimensional image data, IMU signals have limited channel dimension (only 6 channels), so channel attention can effectively learn the importance of acceleration and gyroscope channels without redundant parameters. The spatial attention of CBAM is naturally converted into temporal attention in IMU sequences, which focuses on key time steps such as motion switching, turning, and starting, rather than meaningless spatial regions. Since the attention weights are normalized and bounded, the attention distribution will not be too scattered or out of constraint. Therefore, CBAM is not only compatible with low-dimensional inertial signals, but also enhances the model’s ability to suppress sensor noise and focus on effective motion features.
2.3. Enhanced Model Training Method for Deep Indoor Inertial Navigation
2.3.1. Overview of Training Process
For the usual training processes of deep inertial navigation model, the first step is to collect inertial sensors based on smartphone motion and corresponding trajectory data, which are usually integrated into the internal sensors of the smartphone and other external devices. Then, the data need to be preprocessed, including coordinate system transformation, to convert the corresponding coordinate system from the device coordinate system to the navigation coordinate system, as well as data augmentation to expand the data. Finally, the model is trained using the training set and validation set composed of the above data. The model training will consist of multiple rounds, each of which will result in a temporary model and corresponding prediction results from the training set. At this point, it is necessary to calculate the difference between the predicted result and the actual result, which is called the loss value, and use it to adjust the model in the next round of training. The formula for calculating the loss value is called the loss function.
Considering that relevant research has shown that adopting new model training methods can effectively improve model accuracy, we have made improvements in the training process of EDIN, which is shown in
Figure 2. Specifically, in the preprocessing section, we added Gaussian-distributed white noise on top of existing data augmentation methods to further simulate real-world data. And we adopted a new Logcosh function for the loss function to further improve the performance of the model.
2.3.2. Data Augmentation
Due to the fact that indoor inertial navigation data still relies on manual collection, collecting extensive datasets in this field is often challenging. To address this issue, RIDI proposed the stabilized IMU coordinate system, which is obtained from the device coordinate system by aligning its Y-axis with the negative gravity direction. On the basis of RIDI, RoNIN proposed a heading-agnostic coordinate frame (HACF), which is any coordinate system where the Z-axis is aligned with gravity. In other words, we can choose any such coordinate system as long as we maintain consistency throughout the entire sequence. By using appropriate rotational representations (such as quaternions), transforming coordinates to HACF is not affected by singularities or discontinuities.
In addition to the aforementioned rotation processing, noise addition is also a commonly used data augmentation method. In fact, due to the significant noise accompanying the actual collection of IMU data, it is feasible to enhance the data by adding noise. So, we have added additional data noise based on the above scheme in EDIN.
2.3.3. Logcosh Loss Function
Mean Squared Error (MSE) is the most commonly employed loss function in the training of deep inertial navigation systems, as it computes the squared Euclidean distance between the predicted and ground-truth coordinate changes. However, in the presence of outliers with large deviations, MSE may yield excessively high loss values, potentially causing gradient explosion and degrading model robustness. Considering that outliers frequently occur in inertial data collected from low-cost sensors, it is preferable to address this issue by employing a more effective loss function. The Logcosh loss function provides a smoother alternative. For small prediction errors, it behaves similarly to MSE with quadratic growth, ensuring stable gradient updates, whereas for large errors, it approximates Mean Absolute Error (MAE) with linear growth, thereby reducing sensitivity to outliers. By integrating the advantages of both MSE and MAE, Log-Cosh achieves a balance between smooth optimization, convergence efficiency, and robustness, making it particularly suitable for regression tasks involving dynamically varying error ranges or scenarios requiring both precision and resilience.
3. Evaluations
3.1. Dataset
We conduct comprehensive evaluations on the task of position estimation using the RoNIN, OXIOD, and RIDI datasets for assessment. For each dataset, 80% is allocated for training, 10% for validation, and the remaining 10% for testing.
3.1.1. RoNIN
The RoNIN Dataset is a large-scale benchmark dataset for inertial navigation released by the Stanford University team in 2020. It contains 42.7 h of IMU-motion data collected from 100 human subjects with several Android devices, covering natural human movements such as walking, running, and stair-climbing. A 3D tracking phone (Asus Zenfone AR) was used to provide ground truth trajectories, with location drift less than 0.3 m and orientation drift less than 10 degrees. Based on whether the test sequences belong to the same group as the training set, the RoNIN dataset is divided into two subsets: RoNIN-seen and RoNIN-unseen.
3.1.2. OXIOD
The OXIOD Dataset was released by the University of Oxford team in 2018, focusing on smartphone-based IMU navigation in daily scenarios. It comprises 158 sequences with a total distance exceeding 42 km of IMU data under various motion modes such as walking, running, and hand-held movements.
3.1.3. RIDI
The RIDI Dataset contains multiple short-duration IMU sequences covering basic motion patterns such as walking and backward walking, with ground truth velocity and position provided by motion capture devices.
3.2. Model Training Details
During the training of models, we utilized PyTorch 2.10.0, torchvision 0.25.0, tensorboardX 2.6.4, and conducted our experiments using an NVIDIA GeForce RTX 4070 with 32GB GPU memory. We used a batch size of 128 and employed an ADAM optimizer with an initial learning rate of 0.0001, reducing the learning rate by a factor of 0.1 if the validation loss did not decrease over ten epochs. The network underwent training for 500 epochs, with each epoch involving a complete pass through all training data. For EDIN, we applied dropout with a keep probability of 0.5.
As described in RoNIN [
23], HACF was employed to circumvent singularity and discontinuity issues stemming from coordinate system transformations. This frame features a Z-axis perpendicular to gravity. During training, a random HACF was assigned at each step, defined by randomly rotating the ground truth trajectory within the horizontal plane. IMU data were then transformed to this HACF through device orientation and horizontal rotation. This approach effectively integrated sensor fusion into our data-driven system. During testing, we utilized the coordinate system defined by the orientation of the Android device, aligning its Z-axis with gravity.
Regarding the noise addition, we used Gaussian functions to generate white noise. Specifically, white noise with a variance of 0.1 was added to the acceleration, and white noise with a variance of 0.001 was added to the gyroscope. To further explore the impact of noise intensity on model performance, additional ablation experiments under different noise intensities were conducted. Specifically, five levels of acceleration noise and corresponding gyroscope noise were set, and the R-ResNeXt model was trained and tested under each intensity.
In the evaluations, we tested the MSE and Logcosh loss functions in the model training. We defined as the estimated displacement distance and as the ground truth displacement distance. n is the length of each set of data; for each set of 200 frames of input data, the value is n = 200. The calculation formulas for MSE and Logcosh are as follows:
3.3. Metrics
As shown in (
3) and (
4), we utilized ATE and RTE as evaluation metrics. In evaluations, an initial orientation was provided by a dataset of orientation data. To mitigate potential biases in trajectory direction and ensure an accurate comparison of model performance, we employed the Iterative Closest Point (ICP) method to align the initial 5 s of the estimated trajectory with the ground truth. Note that ATE and RTE were used as a metric in meters.
3.4. Competing Methods
We selected a series of methods for comparative analysis. In the non-neural network domain, the PDR approach was adopted, while the RIDI method was excluded, as RIDI classifies different smartphone placement states, making it unsuitable for our comparative evaluation. In the neural network domain, we selected RoNIN-ResNet (R-ResNet) from the RoNIN study and its modified version RoNIN-ResNeXt (R-ResNeXt). Other methods, such as IDOL, were excluded due to either negligible differences in network performance or the unavailability of publicly accessible source code, which prevents their inclusion in our comparative evaluation. The following section outlines the characteristics and attributes of these selected comparison methods.
3.4.1. PDR
An inertial navigation system that utilizes a step-counting method. We employ a technique from [
33] to detect the steps and determine the heading direction, with the step distance set at a predefined value.
3.4.2. IONet
A neural network-driven inertial navigation approach leveraging accelerometer and gyroscope data within a sliding window framework supported by the LSTM model.
3.4.3. R-ResNet
A robust neural inertial navigation network model utilizing a ResNet-based architecture. For this, only 50% data of RoNIN dataset was publicly available, so we retrained all R-ResNet models on this dataset.
3.4.4. R-ResNeXt
Similarly to the R-ResNet, the only difference was that the original ResNet module was replaced with a ResNeXt module of the same specification.
The exclusion of other neural network-based localization methods is due to minor model differences and undisclosed code. Except for the PDR method, all other models required training. To ensure a fair comparison, these models were trained under identical datasets and conditions to the EDIN model.
3.5. Results
3.5.1. Position Evaluations
As shown in
Table 1, the location evaluation results are partially visualized in
Figure 3, revealing different performance trends among evaluation methods. It can be seen that models based on ResNeXt perform better than models based on ResNet on the RoNIN, OXIOD, and RIDI datasets. The performance of the model trained by our proposed EDIN method further improved compared to the method of simply replacing it with ResNeXt. Specifically, EDIN reduces the ATE by 18.78% and RTE by 18.71% compared to R-ResNeXt on the RoNIN-seen dataset; on the more challenging RoNIN-unseen dataset, EDIN still maintains a 13.35% ATE reduction and 11.97% RTE reduction, reflecting its strong adaptability. On all three datasets, the proposed EDIN method achieves the lowest 95% percentile ATE and RTE among all comparison methods, which demonstrates its superior robustness to severe estimation errors. Consistent with the average localization results, the 95% percentile indicators also verify that EDIN can effectively suppress large tracking errors and improve the reliability of inertial navigation, even in rare and difficult scenarios.
In particular,
Figure 4 shows the box plots of the evaluation results from the tested deep learning methods on the RoNIN dataset. The proposed EDIN model yields results with a more compact distribution and reduced median and mean values, demonstrating improved stability and superior navigation performance. Nevertheless, the method produces a larger number of outliers with greater deviation, which can be attributed to its limited effectiveness in handling particularly challenging cases.
Figure 5 shows the CDF results on the RoNIN-seen and RoNIN-unseen test dataset. On the RoNIN-seen dataset, the performance difference between EDIN and other RoNIN-series models is relatively minor. However, on the RoNIN-unseen dataset, EDIN consistently outperforms the RoNIN-series models, indicating its superior robustness and adaptability to diverse environments. Nevertheless, as previously discussed, the figure also reveals that EDIN exhibits limited capability in handling particularly challenging cases. For instance, in the RoNIN-unseen dataset test, the maximum ATE error of EDIN is noticeably higher than that of the RoNIN-series models.
3.5.2. Ablation Study
The improvements of our proposed EDIN method compared to the R-ResNeXt model can be summarized in three aspects: adding noise in the data augmentation part, adding CBAM to the ResNeXt module, and replacing MSE with Logcosh loss function (see
Table 2). To verify the effectiveness of the above changes, we conducted additional ablation experiments. In this section, we train the R-ResNeXt model separately using one of the improved methods mentioned above. The test results of training the model using this method are shown in
Table 3. To further validate the superiority of the Logcosh loss function over other mainstream robust loss functions and provide empirical evidence for its gradient explosion mitigation ability, we conduct a dedicated comparative experiment on the R-ResNeXt baseline with Huber, Tukey, MAE and Logcosh loss functions (see
Table 2). The results show that the Logcosh loss achieves the optimal ATE and RTE on both RoNIN-seen and RoNIN-unseen datasets, with 4.32 m/2.93 m and 4.95 m/4.32 m respectively, which outperforms other loss functions significantly.
It can be observed that, in comparison with the baseline R-ResNeXt model, all three proposed improvement strategies generally contribute to performance enhancement. Specifically, the integration of the CBAM effectively strengthens feature representation by selectively emphasizing informative channels and spatial locations, which in turn leads to higher localization accuracy. The adoption of the log-cosh loss function provides a more suitable optimization criterion for navigation-related tasks, as it mitigates the influence of outliers while maintaining sensitivity to small errors. Additionally, the incorporation of noise-based data augmentation increases the diversity of the training set, and to further explore the impact of noise intensity on model performance, additional ablation experiments under different noise intensities were conducted, with the detailed results presented in
Table 4. Five levels of acceleration noise (0.05, 0.1, 0.2, 0.5, 1.0 m/s
2) and corresponding gyroscope noise (0.0005, 0.001, 0.002, 0.005, 0.01 rad/s) were set, and the R-ResNeXt model was trained and tested under each intensity. The results indicate that appropriate noise intensity is conducive to enhancing the model’s robustness against sensor noise. Thus, such a data augmentation strategy helps us to improve the model’s generalization capability across different datasets, with particularly noticeable improvements in reducing translational errors.
Despite these overall benefits, it is noteworthy that in certain scenarios, the use of these enhancements can result in decreased performance on specific evaluation metrics. Such degradation is likely attributable to variations in data characteristics among datasets, including differences in motion patterns, sensor noise profiles, and device placements, which may interact differently with the respective improvement strategies.
4. Discussion
In this study, we demonstrate that the proposed EDIN model achieves superior performance compared to other state-of-the-art approaches, particularly in terms of positioning accuracy and robustness. For most test sequences characterized by relatively small localization errors, EDIN is capable of producing trajectories that closely align with the ground truth, thereby reflecting its high precision in stable motion conditions. However, in a subset of anomalous sequences with substantially larger errors, the performance improvement becomes limited, and, in some cases, EDIN even underperforms relative to existing RoNIN-based methods. This degradation can primarily be attributed to the use of the log-cosh loss function during model training, which inherently suppresses the influence of outliers by assigning lower gradient weights to samples with large residuals. Consequently, the model tends to neglect anomalous sequences, leading to suboptimal generalization in these scenarios. Addressing this limitation—specifically, how to effectively identify and adaptively handle anomalous sequences to further enhance system robustness—constitutes a key direction for our future research efforts.
In addition to the issues observed during testing, future work will focus on further improving the training methodology for deep inertial navigation. Specifically, we aim to investigate the integration of existing deep learning model optimization techniques to identify and refine optimal model architectures. This includes the application of automated hyperparameter optimization, neural architecture search, and other algorithmic strategies that have demonstrated effectiveness in enhancing model performance. Furthermore, we plan to explore alternative data augmentation techniques and loss functions tailored to inertial navigation tasks, with the goal of increasing model generalization and robustness across diverse motion scenarios. The effectiveness of these approaches will be rigorously evaluated through systematic training and testing on relevant inertial navigation datasets, thereby providing a foundation for more accurate and reliable deep inertial navigation systems.