Deep Learning Methods for Speed Estimation of Bipedal Motion from Wearable IMU Sensors

The estimation of the speed of human motion from wearable IMU sensors is required in applications such as pedestrian dead reckoning. In this paper, we test deep learning methods for the prediction of the motion speed from raw readings of a low-cost IMU sensor. Each subject was observed using three sensors at the shoe, shin, and thigh. We show that existing general-purpose architectures outperform classical feature-based approaches and propose a novel architecture tailored for this task. The proposed architecture is based on a semi-supervised variational auto-encoder structure with innovated decoder in the form of a dense layer with a sinusoidal activation function. The proposed architecture achieved the lowest average error on the test data. Analysis of sensor placement reveals that the best location for the sensor is the shoe. Significant accuracy gain was observed when all three sensors were available. All data acquired in this experiment and the code of the estimation methods are available for download.


Introduction
We are concerned with the problem of estimating the locomotion speed of an individual for the purpose of inertial indoor navigation. Specifically, our research aims at investigating indoor navigation in special situations where the traditional means of location such as GPS or infrastructure-based (WiFi) signals are not available. A typical example is an emergency situation resolution, e.g., locating a firefighter in a hostile unknown environment. Due to unreliable infrastructure, it is impossible to use vision-aided inertial systems (VINS) and magnetic or thermal fingerprint systems; thus, the Pedestrian Dead Reckoning (PDR) is the most reliable source of information about the location of the tracked person.
We are concerned with analyzing data from the inertial motion unit (IMU) sensors due to their robustness in hostile environments. The IMUs are assumed to be attached to the human body by an elastic band to ensure close synchronization of their movement with the body. This is not ensured in approaches using IMU in mobile phones, where synchronization is weaker. Due to the chosen application area, we aim to estimate a wide range of speeds for bipedal motion, i.e., walking as well as running. It is a narrower space than the general gait speed estimation [1], but it is wider than walking speed estimation [2] since we also include running.
Many classical PDR methods use manually designed features [1][2][3], often without detailed specification, which prevents their replication. One of the most popular features is the zero-velocity update (ZUPT) [4], which has been extensively used [5][6][7]. Detection of the feature itself may be a complicated problem requiring adaptive approaches [8]. The application of the approach to indoor navigation often requires a combination of multiple methods [9].

1.
We measured IMU data on eight human subjects while walking and running together with a reference speed recorded by a monowheel and provided those data as publicly.
We also provide all codes to run the proposed methods to make replication of our results as convenient as possible.

2.
We test existing architectures of deep learning for the task of predicting the motion speed from the IMU data. We show the benefits of approaches based on auto-encoder topology. Moreover, we propose a novel decoder architecture that achieves the best results on our datasets. The architecture is motivated by the nature of the IMU signal.

3.
We provide sensitivity studies of the methods with respect to: (i) the subjects (via leave one out cross-validation), (ii) the number of IMU sensors on the body and their location, and (iii) availability of additional knowledge such as the length of the leg. We observed that these details are more important than the architecture of the neural network.
The work is structured as follows. A literature survey highlighting the most related approaches is provided in Section 2. Description of the data, problem formulation, and the main computational methods, including the proposed new architecture, are described in Section 3. All methods are evaluated on the recorded data in Section 4, where we show the superiority of the proposed architecture. The conclusion is presented in Section 6.

Related Work
We now briefly review the details of publications that we find the closest to our approach, as summarized in Table 1. From the range of classical feature-based methods, we select those related to neural networks.

Feature-Based Approaches
A large number of methods based on classical feature-based approaches are available; see summary in Table 1. Since the approach is sensitive to the tuning of the feature detection, many papers focus on improving the feature extraction, e.g., [7,23]. However, since our primary aim is to learn from the raw data, we will consider feature-based methods as a baseline for comparison. From a wide range of available methods, we analyze the performance of methods mentioned in survey [24], including extensive parameter tuning. Moreover, the default sensor placement in these approaches is almost always determined to be the foot attachment [4], with few exception such as [25]. We will study the sensitivity of this choice to estimation precision in the experimental section.

Neural Networks
Neural networks have been applied to the speed estimation problem in many forms, either in combination with ZUPT [16], or standalone using classical architectures such as the LSTM. However, most often the networks use shallow architectures that cannot take advantage of the deep architectures, i.e., the ability to learn the most suitable high-level features. The pioneering work using deep architectures in this domain is [13], where the data are recorded using a smartphone on a large corpus of data. Therefore, it is sufficient to train a single feed-forward architecture. We have not collected such a large corpus for data with the mounted IMU data; hence, we use methods that are based on semi-supervised architectures. We show experimentally that autoencoder-based architectures outperform the feed-forward networks.
We have tested standard convolutional-recurrent architectures used in autoencoders [20,21] and designed a novel decoder using weighted harmonic signal. Our approach can be also seen as a generalization of weighting used in feature-based approaches [26], where the weighting strategy is fixed. In our approach, the weights as well as the latent space (i.e., features) are learned from the data .

Available Datasets
Existing datasets of IMU-style recordings were often designed for particular applications, such as patient control or daily life monitoring. Thus, they differ in the measurement setup, see survey [27]. We can recognize two principal types of datasets for motion speed estimation: (i) shoe-mounted IMUs, e.g., [27][28][29][30], and (ii) smartphone based data with varying positions (backpack or pocket) [14,[31][32][33]. While the shoe mounted sensor is a traditional setup for feature-based methods, the deep-learning methods are predominantly trained on smartphone data in hand or pocket. Since the data do not contain information from both locations, it is impossible to select the most appropriate location. The dataset that we provide in this study is collecting data from three different locations of the sensors-shoe, shin, and thigh-to allow direct comparison of the sensor location for the same subjects under the same conditions, its relation to similar datasets is summarized in Table 2. An essential piece of information for speed estimation is the measurement conditions and the speed reference. While a properly calibrated treadmill could be an excellent method to get the speed reference for stable speed, it is unnatural during speed transitions. Measuring the stable speed in discrete steps imposes the risk of an incomplete dataset. This is not an issue for feature-based methods with few parameters but becomes problematic for the deep learning methods that struggle to learn the speed between the discretization steps. Alternatives to obtain the speed reference are (i) optical systems such as VICON [30,34] which provides only a limited space for performing experiments which complicates especially the running scenarios, and (ii) visual SLAM [14,33] which does not suffer from space limitations but its accuracy of the ground-truth is low. Therefore, we used a different approach with the monowheel to obtain speed reference.
We provide both the data and code of our methods to guarantee reproducibility, which is still not standard in the field. Many datasets used in studies are only private, many papers do not contain detailed method description (SVM methods often use more than hundred features, but paper describes only a few), authors do not provide source codes, ground-truth is often missing or contaminated with error.

Data Acquisition: Sensors on a Single Leg
Cheap MEMS sensors BMX055 were used to obtain the IMU data measurement. Three locations-thigh, shin, and foot-were chosen strategically to capture good information about bipedal movement on a single leg. This setup is a compromise between the number of sensors and information gain. It is based on the underlying assumption that the motion of both legs is, to some extend, symmetric (this assumption is expected to hold for healthy subjects, such as firefighters).
The use of less accurate sensors poses another estimation challenge, and our results may thus differ from studies that used highly accurate and thus expensive sensors. For each location, one BMX055 was used to capture 6DOF information of an orthogonal accelerometer and gyroscope. The sensor breakboard was attached by adhesive to a 3D printed platform with elastic bands to obtain tight body attachment for thigh and shin placement. The foot sensor platform was attached via a hook-and-loop fastener to the shoe, see Figure 1 right. The data from sensors were fetched to ESP32 via I2C in speed mode. The data were redirected wirelessly from ESP32 over WiFi to an android phone which served as the data storage during the measurement. After some optimization, the final data rate fluctuates around 400 Hz.
The data were collected in an open environment with the reference speed collected using a custom modification of a monowheel. Actual measurement with the monowheel was obtained for all subjects by the same researcher to avoid data corruption of the subject's natural movement; see Figure 1 left. Data for each subject are stored in a single array, with each column containing data recorded in one time instant: x t organized as follows.
Examples of the recorded data are provided in Figure 2 for the first subject, where two short periods of walking and running are selected for comparison. Note that while in the walking phase, it is possible to identify the heel strike moments by a naked eye, it is much more demanding in the running phase. Moreover, the limits of the sensor, which are ±2000 deg/s gyroscope and ±80 m/s 2 , are reached during the running phase. The saturation of the acceleration is critical for methods relying on its integration.

Problem Formulation
We are concerned with estimating an instantaneous speed of human bi-pedal motion using data from IMUs attached to the human body. The instantaneous speed at time index t is denoted y t , and the instantaneous data record x t . Due to insufficient information in a single observation, we predict the speed from a window Here, f θ () is a parametric predictor function, with parameters θ (aggregating e.g., parameters of all layers of the neural network). We aim to approach the problem by supervised learning, i.e., recording dataset on multiple subjects where T i is the length of the recorded time series. These data are used to train the predictor by minimizing a loss function between the predicted and the measured data.
The deep learning methods vary in the architecture of the neural network, yielding different types of parameters, and in the formulation of the loss function (3). These details will be elaborated in Section 3.3.
Note that various classical methods also fit into this formulation, however, the parametric space is much more restricted to contain e.g., tuning parameters of the data processing. Thus, we may consider the conventional feature-based methods as predictors with restricted degrees of freedom, while the deep-learning methods are free to learn the model structure from the data. The key challenge of the deep methods is to discover the right level of generalization, i.e., avoiding overfitting the training data with poor performance on the unseen (testing) data.

Deep Learning Methods
Even deep learning predictors are essentially just more complicated non-linear regression as defined in Section 3.1. Different model structures are represented by different architectures of the used network. The challenge is to find architecture that best corresponds with the nature of the studied problem. In this Section, we briefly review architectures that will be tested on the collected data. Specifically, we will compare two types of network architectures: (i) feed-forward architectures, and (ii) the semi-supervised variational autoencoder architecture. The former is a direct prediction method [10]. The latter is an extension of the feed-forward architecture by incorporating a generative model [35] as a tool for improving the generalization capabilities of the model.

Feed-Forward Networks
The feed-forward approach is a straightforward application of the neural network to the regression problem by defining a single network f θ (X t ) and a mean square error loss.
The loss function is a classical least-squares approach with well-known solution for the linear model. A vast amount of neural architecture has been proposed and tested. For example, simple classical LSTM architecture has been tested on the speed estimation problem in [16]. Recent progress in deep learning indicates that much better results can be obtained by convolutional architectures. We will test the two most recent methods of this category: InceptionTime is an architecture defined using convolutional layers with a bottleneck [18].
Perceiver is a complex architecture based on the attention mechanism [19].
These architectures have a number of hyper-parameters, mostly sizes, and dimensions of the latent variables, a detailed discussion of which is beyond the scope of this paper, see original publications for details. We have used reference implementation of these methods provided by their authors. The tuning of their hyper-parameters was performed using a uniform random search with ranges given in Table 3. The ranges were adjusted to ensure that the best results were not obtained at the border of the range.

Semi-Supervised Variational Autoencoders
While the feed-forward approach is methodologically simple, it often suffers from overfitting, i.e., achieving a good fit on the training data but poor on the test. While various regularization techniques were proposed, they are often hard to tune for good performance. An interesting approach to this problem is based on the use of autoencoders for regularization. Informally, the autoencoder projects the input information on the latent code-often of a much smaller dimension than the input-from which it tries to reconstruct the input as closely as possible. Specifically, the auto-encoder architecture defines two networks: (i) encoder generating the latent code z t = g ψ (X t ), and (ii) decoder generating the input reconstructions of the input from the code X t = f θ (z t ). The training loss is the mean square error of the reconstruction: where the free variables are parameters of the neural networks ψ (encoder ) and θ (decoder). The autoencoder alone is too ambiguous to train and has to be coupled with a regularization. The most popular, and often the best performing in practice, is the Variational autoencoder [36]. The additional assumption is that the latent variable (code) is distributed as independent Gaussian p(z t ) = N (0, I). The encoder does not provide an estimate of a single z but its distribution q(z|x) which is prescribed to have Gaussian form with mean µ ψ (X t ) and standard deviation σ ψ (X t ), both functions being represented by neural networks. The loss of the autoencoder is computed as evidence lower bound, yielding the following formula.
The parameter β > 0 governs compromise between the first (reconstruction) term and second (regularization) term of (6), KL(p||q) denotes the Kullback-Leibler divergence between probability distributions p and q, which is available in closed form for Gaussian distributions see [36] for details.
The combination that we are using originates in semi-supervised learning that aims to combine the Variational Autoencoder with feed-forward predictors [35]. The goal is to use a simple weighted combination of the previous loss functions ((i) the prediction loss (4) and (ii) the VAE loss (6)) to yield the following : where α is the weighting factor of the prediction quality. Note that by various choices of the weighting factors α and β we can recover various architecture. For example: (i) for β → 0, α → 0, (7) approaches pure autoencoder, and (ii) for α β, α 1, it approaches the feed-forward predictor. Thus tuning optimal values of α and β allows obtaining the best of the combined approaches. Models of this type will be denoted as semi-supervised variational autoencoders (SVAE), and its architecture is illustrated in Figure 3. A wide range of such methods may be designed for various choices of the architecture of the involved neural networks.  . Schematic architecture of semi-supervised variational autoencoder. The path from the input X t to y t corresponds to the feed-forward path; the path from the input X t to prediction X t provides regularization.
Recent studies indicate that state-of-the-art performance on time series data is obtained by a combination of CNN layers (acting as adaptive filters) followed by LSTM layers capturing the dynamics of the process. While this architecture was not successful in the feedforward setting, it worked very well as an encoder in VAE as proposed in [22]. However, motivated by the nature of our signal (see Figure 2), we investigated the possibility that the signal can be reconstructed using periodic functions. We, therefore, test two versions of the SVAE, differing in the decoder:

Conventional Decoder: SVAE-LSTM-CNN
With decoder being the reverse of the encoder, i.e., the LSTM layer followed by the deconvolution (ConvTranspose). This architecture is an extension of the classical LSTM autoencoder [22].

Proposed Decoder: SVAE-Sine
Motivated by the periodic nature of the generated signals, Figure 2, we propose the decoder in the form of weighted sine-waves: . This decoder has only one layer, and it thus much simpler than the LSTM-CNN decoder. We conjecture that this is the reason why SVAE-Sine was experimentally found to be more reliable than that of the LSTM-CNN version.

Remark 1.
The architectures mentioned in this Section were selected from a larger pool of methods using preliminary studies. We have tested many simple architectures-such as plain LSTM autoencoers [16] or plain CNN networks-for a limited number of runs. However, the results of simple architecture were significantly worse than those of the above-described ones (e.g., it was tough to obtain a converging pure LSTM network). Since this agrees with previously reported experiments, e.g., [37,38], we removed these simple architectures from the study and performed the computationally expensive cross-validation study with Monte Carlo hyper-parameter search only on the four above-mentioned methods.
Hyperparameters of the considered SVAE architectures are again sampled from ranges summarized in Table 4.

Experimental Protocol
The evaluation procedure follows the leave-one-out cross-validation protocol [39]. Specifically, the methods were trained 8 times, with data collected on one subject being used as testing data, and the remaining data used for training and validation (85% for training, 15% validation). The validation data were used for monitoring convergence of the training procedure, which was stopped when the validation error did not improve for 20 consecutive steps. The training loss is the mean square error (4) or its augmented version (7).
If not stated otherwise, all reported accuracies are estimation errors of the testing subject, averaged over all 8 repetitions. The mean absolute error was chosen as the main evaluation metric: where f s () is a model trained on data from all subjects except the sth. The error will be provided in the unit of the speed, i.e., km/h.

Conventional Feature-Based Methods
We will compare all tested deep-learning methods with state-of-the-art approaches based on heuristic/feature-based approaches. We have used the recent survey [24] for selecting the candidates. The methods are based on the integration of the speed from the acceleration sensor, using different features, such as detected heel-strikes for each limb, to calibrate the integration. Ten different variants of the features and their data processing are compared in [24], called method 1 to method 10, with the increasing complexity of the processing. The best performing method for their data was method 10 that uses information from the wrist sensor. Since we do not have sensors on the wrist, we compared only those not using their signals, i.e., methods in Table 5. The parameters of all relevant methods of [24]-i.e., the complementary filter cut frequency, the gyroscope scale error compensation parameter, and the accelerometer scale error compensation parameter-were optimized using Matlab's fminsearch method for best accuracy (8) for each method independently. The parameters obtained by optimization are not very intuitive, for example, the negative scaling for method 5. However, the error of speed estimation for more usual parameters was higher than that for the optimized. We conjecture that this is due to the nature of the collected signals (containing e.g., saturations, Figure 2) from a cheap and noisy sensor.
The best performance that was obtained on our data is Method 4 of [24], which uses the mid-stance to mid-stance segmentation to obtain the heel strike and toe-off indices. The estimation of yaw angle was manually set to zero constant because only the forward speed is measured.

Deep Learning Methods
For training and testing of all deep learning models, the data were resampled to isochronal 512 Hz and split into 2 s time windows with the reference speed set to the middle element of the window.
Since the number of the training samples is still low, the training data set for neural network training were augmented [40]. Two techniques were used to create the augmented samples: (i) sensor rotation, i.e., the measurements from one sensor were rotated by a small constant angle generated from Gaussian distribution with the standard deviation of 2.5°, and (ii) and a Gaussian white noise addition, using relative noise with standard deviation being 1% of the observed value. The sensor rotation simulates a slightly different sensor placement to clothes or body parts. A small portion of white noise was used to force the models to learn long-term dependencies.

Method Comparison for a Single Foot Sensor
For all four tested deep architectures-InceptionTime, Perceiver, SVAE-LSTM-CNN, SVAE-Sine-we have sampled 15 random draws from the ranges provided in Tables 3 and 4, respectively. The leave-one-out cross-validation was performed for each hyper-parameter value, and the results were sorted according to their testing error (8). Tables with the three best hyperparameter values for the data using the single foot sensor are provided in the Appendix A. We note that the difference between the best hyperparameter values of individual methods is relatively small, indicating consistent performance with low sensitivity to individual hyper-parameter. Thus, we did not perform any averaging over the initial conditions of the network learning.
A summary of the speed estimation errors for all deep methods is displayed in Figure 4 in comparison with the best feature-based method. Note that all deep learning methods achieved significantly better results than the conventional feature-based methods. The difference between the approaches is even more striking when the measured and predicted speed on the tested subject (unseen in training) are displayed in the form of a scatterplot in Figure 5.   Table 5). Results are averaged over 8 subjects of the leave-one-out cross-validation. All methods were optimized for best performance. Note that the prediction of the deep method does not exhibit any significant distortions depending on the speed, which are visible in the results of the conventional method. Scatter plots for other subjects and also for other deep methods are similar to that for SVAE-Sine.
While the proposed SVAE-Sine model yields the lowest prediction error, we note that even feed-forward methods, such as the InceptionTime, provide comparable results. The limiting factor for improving the estimation is thus not the architecture of the network but more likely other issues such as (i) inter-subject variability, and (ii) informativeness of the sensor data.

Inter-Subject Variability
An illustration of the multi-subject variability is displayed in Figure 6 via a box plot of speed prediction errors for individual subjects in the cross-validation study, where the difference between subjects is clearly visible. The results of two methods are compared in this chart: SVAE-Sine as representative of the semi-supervised methods and InceptionTime as representative of the feed-forward networks. Both methods achieve consistent results with low error on some subjects, such as numbers 5 and 6. However, they differ on other subjects, such as numbers 3 and 7 where the winner is SVAE and InceptionTime, respectively. Differences between these subjects may be caused by insufficient training or other causes. We consider SVAE to provide better results due to a lower average error but also lower error on the majority of the subjects. Note, however, that even the largest error in the most outlying subject number 7 is still significantly lower than errors of the conventional method ( Figure 4).

Sensitivity to the Sensor Location
In this Section, we investigate how much accuracy can be gained by using additional sensors. We will not investigate all architectures but focus only on the SVAE-Sine approach. The performance of the method was evaluated on all individual sensors, all couples, and all three sensors, see Figure 7. Hyperparameters of the architecture were those of the best performing architecture for the Foot sensor (Appendix A). The foot seems to be the most valuable location for the sensor if it is used as a single sensor, as well as in tandem with a sensor in another location. The error of speed estimation monotonically decreases with the increasing number of sensors.

Additional Biometric Information
Biometric information such as the length of the leg or weight of the body may be relevant for the estimation of the motion speed from IMUs. The question for deep learning is whether the availability of such information improves prediction error. We have trained the SVAE-Sine model with two additional inputs: (i) the length of the leg, and (ii) the body mass of the subject. We found that in our study such an extension did not yield any significant improvement in the estimation error.

Discussion and Future Work
We have designed the dataset to measure the signal at three different positions that allows comparison of our results with both shoe-mounted and smartphone-based data.
Previous studies indicate that since the smartphone is not coupled to the body movement properly (e.g., held in one hand), the estimates have higher error [33]. The situation where the smartphone is in the pocket corresponds to its placement on the thigh in our study. We have shown that it is the least accurate position of the all possible locations tested in our study (Figure 7), even though our error at this position is much lower than that from the smartphone (0.4 km/h for our method, 1.08 km/h for the smartphone [15]). Moreover, we have also shown that the error of the speed estimation is significantly reduced with an increasing number of sensors (0.25 km/h for three sensors). Fusion of multiple sensors is very straightforward in deep-learning methods, contrary to feature-based approaches. We conjecture that multiple sensors will be necessary to address more complicated motions such as knee-walking or ladder climbing. Due to the lowering prices of the sensors, we foresee a great potential for future research in the investigation of multi-sensor data e.g., integrated into special-purpose suits [41].
One of the key observations of our study is the inter-subject sensitivity of the speed estimates. We have shown in the cross-validation study that differences between subjects are much larger than between various deep-learning architectures ( Figure 6). This highlights inherent difficulties in designing a universal speed estimation algorithm (which is the goal of all feature-based methods). On the other hand, it opens a way for personalization of the method. Benefit of algorithmic personalization of the estimation algorithms for each individual were already studied in [3], and deep learning is in a great position to address this need. We foresee a great potential for the deep learning technique of pre-training [42] which prepares universal representation on a larger data corpus at high computational cost and finalizes training for personalized models with a lower number of the data and lower computational cost.

Conclusions
Estimation of the speed of motion from IMU attached to a human body has been dominated by feature-based methods. We have collected a dataset of eight subjects with ground truth speed and provide them publicly as a benchmarking dataset. We have demonstrated that recently developed deep learning architectures are able to provide much closer estimates than conventional methods on this dataset. Moreover, we have proposed a modification of the existing semi-supervised variational auto-encoder using a decoder in the form of a dense layer with sinusoidal activation functions. The proposed deep architecture was tailored for this application and outperformed, on average, the generalpurpose state-of-the-art methods on this task. Estimation error was always evaluated on the tested dataset (unseen during the training). The estimation error monotonically decreases with an increasing number of sensors on the body. The primary source of variability of the error is the human subject. However, even the subject with the largest error of the deep-learning method has a lower error than that of the best tested conventional featurebased method. On the other hand, the various architectures of deep networks seem to yield comparable performance. Therefore, we recommend focusing on the data acquisition and subject variability rather than details of network architectures for future work. Institutional Review Board Statement: The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Institutional Ethics Committee of the University of West Bohemia, protocol ZCU 001806/2022 from 24 January 2022.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: All data collected during this study and all code to reproduce our experiments is available from: https://github.com/Josef4Sci/DeepGait/, accessed on 14 May 2022.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Details of Hyperparameter Selection
In this section, we provide a review of the top-performing hyperparameters selected for each of the tested deep architectures. The hyperparameters are sorted according to the mean testing error in decreasing order. Note that the architecture's performance is quite consistent, and differences between hyperparameters settings are often smaller than differences between the methods. The tables below show the top three errors with hyperparameter settings from 15 draws for each architecture.