Deep Learning in Gait Parameter Prediction for OA and TKA Patients Wearing IMU Sensors

Quantitative assessments of patient movement quality in osteoarthritis (OA), specifically spatiotemporal gait parameters (STGPs), can provide in-depth insight into gait patterns, activity types, and changes in mobility after total knee arthroplasty (TKA). A study was conducted to benchmark the ability of multiple deep neural network (DNN) architectures to predict 12 STGPs from inertial measurement unit (IMU) data and to identify an optimal sensor combination, which has yet to be studied for OA and TKA subjects. DNNs were trained using movement data from 29 subjects, walking at slow, normal, and fast paces and evaluated with cross-fold validation over the subjects. Optimal sensor locations were determined by comparing prediction accuracy with 15 IMU configurations (pelvis, thigh, shank, and feet). Percent error across the 12 STGPs ranged from 2.1% (stride time) to 73.7% (toe-out angle) and overall was more accurate in temporal parameters than spatial parameters. The most and least accurate sensor combinations were feet-thighs and singular pelvis, respectively. DNNs showed promising results in predicting STGPs for OA and TKA subjects based on signals from IMU sensors and overcomes the dependency on sensor locations that can hinder the design of patient monitoring systems for clinical application.


Introduction
Quantitative assessments of movement quality in osteoarthritic (OA) and joint reconstruction patients, specifically spatial-temporal gait parameters (STGPs), provide valuable insight into gait patterns, activity type [1], risk of falling, and disease progression [2,3]. This diagnostic information is used in a number of applications that include development of personalized treatment plans, optimized post-operative rehabilitation, monitoring changes in mobility of patients after surgery [4][5][6][7], advancement of promising new interventions, and reducing overall medical costs [2]. Conventional methods for measuring gait characteristics that include motion capture (MOCAP) systems and force plates require a laboratory environment and expensive, time-consuming, equipment [8]. On the contrary, wearable sensors, specifically inertial measurement units (IMUs), are lightweight, inexpensive, and mobile. IMU's measurement fidelity has improved significantly in recent years and have been used in various applications including 3D character animation, robotics, automotive vehicles, drones, and human motion measurement [9].
Processing streams of IMU data to extract clinically meaningful movement characteristics, such as activity classification, spatial-temporal parameters, gait pathology, and gait phase detection is challenging [10]. Several studies calculate spatial-temporal gait parameters by reconstruction of foot

Gait Measurements of Osteoarthritic and Total Knee-Replacement Subjects
Twenty-nine subjects, including 14 subjects with OA (Age = 67 ± 7, weight = 79 ± 12 kg, height = 168 ± 16 cm, 4 females and 10 males), 15 subjects with total knee arthroplasty (TKA) (Age = 68 ± 4, weight = 76 ± 14 kg, height = 164 ± 9 cm, 11 females and 4 males, 7 uni-lateral and 8 bi-lateral), participated in the study as part of a larger investigation. All participants signed a consent form prior to the experiment with IRB approval (# 1328728). Subjects were fitted with 71 reflective markers on anatomical landmarks and 17 IMUs on various limb segments and the trunk. For this study, only the 7 IMUs located on the feet, shanks, thighs [35,36], and pelvis [37] were used in the subsequent data analysis (Figure 1a,b). Subjects performed 15 trials of a 5-m walking task at three different speeds: self-selected, slow, and fast to cover the entire range of possible daily walking paces. During fast walking, subjects were instructed to walk at their maximum comfortable speed without running (brisk walking) typified by longer steps at a faster cadence. During slow walking, subjects were instructed to walk at their slowest speed, typified by shorter steps at a slower cadence. During the walking tests, synchronized data was collected from a 13 camera Vicon motion capture system (Centennial, CO), 4 Bertec force platforms (Columbus, OH), and IMUs (Xsens, Enschede, Netherlands) ( Figure 1a). The sampling frequency of force data, MOCAP, and IMUs (free acceleration and angular velocity) were 1000 Hz, 100 Hz, and 40 Hz, respectively.

Gait Data Processing
MOCAP data were segmented into a full stride for each leg based on two successive heel strikes identified using the heel markers' vertical position [38]. For each full stride, the heel strike and toe off times, spatial characteristics (step length, stride length, step width, and toe out angle), temporal characteristics (step time, stride time, stance time, swing time, single support time, and double support time), and general characteristics (cadence and speed) were calculated [39,40].
IMU data for each trial was up-sampled to 100 Hz and segmented into full strides for each leg based on the angular velocities of the feet sensors in the sagittal plane using the peak detection method (Figure 1c,d) [41,42]. The mean absolute error between heel strike and toe-off events identified using the IMU and MOCAP data were 0.02 ± 0.01 and 0.04 ± 0.01 s, respectively. Linear accelerations and angular velocities from the IMU-based coordinate systems for left legs were reflected about the medio-lateral axis to provide consistent anatomical directions for left and right limb segments. The IMUs' six channels of acceleration and angular velocity data were normalized using the maximum sensor acceleration and angular velocity range. A zero-padding technique was used to ensure the IMUs' data sets had a consistent length of 212 points prior to use in the deep-learning models [24]. The IMU data for each stride segment was labeled with the gait characteristics calculated using the MOCAP data for use in the subsequent supervised machine learning models. The gait data processing yielded 3778 segmented and labeled strides from the 29 subjects. A descriptive statistical analysis was conducted on measured spatial, temporal, and general gait parameters to characterize the dataset. This includes mean, standard deviation, coefficient of variation, and interquartile range for knee OA and TKA subject cohorts at three paces, slow, normal, and fast.

Preliminary Neural Network Architecture Benchmarking and Selection
Six contemporary multivariate time series neural network architectures were utilized to predict stride length from our subject cohort based solely on the feet IMU data (Table 1). Stride length and the feet IMUs were chosen to enable benchmarking prediction accuracy against published studies. For network training, 80% of strides from 26 of 29 subjects were randomly allocated to the training set and the remaining 20% of strides from the same subjects were allocated to a validation set. Strides from the final three subjects not included in the training set, one OA subject, one uni-lateral TKA subject, and one bilateral TKA subject, were allocated to a test set. Network prediction accuracy was assessed using 5-fold cross-validation, with training, validation, and test sets randomly reallocated for each fold of the cross-validation. Optimal architecture with the lowest errors for both validation and test sets was selected for conducting a design of experiment on prediction of STGPs with different sensor numbers and locations. Neural networks were trained using a backpropagation and stochastic gradient descent optimization approach to minimize the loss function, mean square error (MSE), between the model-predicted and labeled stride length, using the form: whereŷ i was the model predicted stride length, y i was the labeled stride length, and n was the total number of strides in the training set. An adaptive learning rate optimization with a learning rate, beta-1, and beta-2 of 0.001, 0.9, and 0.999, respectively were used for training all networks [43] with a total epoch of 300. Once each network was trained, the predictive accuracy was quantified by calculating the mean error (ME) and the mean absolute error (MAE) between the predicted and measured stride lengths for both the validation and test sets. ME was calculated to enable comparison with previously published studies, whereas the MAE provides a better metric for true prediction accuracy. To enable an equitable comparison of the prediction accuracy across various gait characteristics with different magnitudes and units, the absolute error was divided by the mean of the labeled test data resulting in the normalized absolute percent error (NAPE).

Assessing Optimal Sensor Combinations for Each Gait Characteristic
Based on the result of the preliminary neural network architecture selection, the 1D convolution neural network (CNN) architecture proposed by Zrenner et al. was chosen for a larger design-of-experiment study on sensor combinations [19]. This network consisted of two convolutional layers followed by two max pooling layers, a flattening layer, and two fully-connected layers.
Rectified linear unit (ReLu) activation functions were placed after each layer. Keras with a Tensorflow backend was used for training the architecture [44,45].
A full factorial design of experiments was implemented to analyze the prediction accuracy based on 15 unique combinations of the feet, pelvis, shank, and thigh sensors ( Table 2). Leveraging the ensemble approach proposed by Hannick et al., individual CNNs were trained using the segmented and labeled stride IMU data to predict each of the 12 spatial, temporal, and general gait parameters ( Figure 2) for each unique sensor combination [24]. The same training set definitions, 5-fold cross-validation, and training approaches were used as in the preliminary analysis. Likewise, the same MAE and NAPE error estimations were calculated for each gait parameter with each sensor combination.  The Friedman test, which is a non-parametric statistical test analog to a repeated measures analysis of variance (ANOVA), was used to detect statistically significant differences in prediction accuracy (NAPE) across sensor combinations. Stepwise Dunn's post hoc tests followed by Bonferroni correction due to multiple testing was performed to establish significant differences (new p-value: 0.05/105 = 0.000476). To determine an overall optimal sensor combination, sensor combinations were ranked based on Friedman ranking and averaged across all the gait parameters for each sensor combination [46].

Spatial-Temporal Gait Parameters (STGPs) Statistical Analysis
The OA group demonstrated larger step width (+2.8 cm) and toe out angle (+4.9 deg), as well as smaller step length (−1.9 cm), stride length (−3.9 cm), double support time (−0.1 s), and speed (−3.4 cm/s) on average for three different paces compared to the TKA group. In general, OA patients demonstrated greater variation (standard deviation (SD), coefficient of variation (CV), and range) in all but two of the STGPs measured compared to TKA patients. Increases in variability (SD) was also observed for step length, stride length, cadence, and speed for fast trials in both OA and TKA groups compared to normal and slow paces (Table 3).

Benchmarking Neural Network Architecture
MAE for stride length ranged from 2.9 ± 2.6 cm to 6.9 ± 3.2 cm for the validation set and 7.6 ± 6.1 cm to 11.9 ± 7.1 cm for the test set ( Table 4). The CNN architecture proposed by Zrenner et al. yielded the lowest MAE for both the validation and test data sets, and the lowest ME, NAPE, and ME standard deviation for the test set, indicating negligible bias and low variance in the stride length predictions. Additionally, this network architecture included only 148,529 parameters which was smaller than the other networks, reducing the computational cost of training the network and preventing overfitting.
The Friedman test indicated statistically significant differences (p = 0.001) between sensor combinations. Multiple pairwise comparisons based on Friedman ranking are displayed as homogenous subsets in Table 6. Similar to the mean NAPE ranking, feet-thigh and feet-shank sensor combinations ranked first and second in Friedman ranking. There was not a statistically significant difference between feet-thigh and feet-shank (adjusted p-value = 0.077) while there was a statistically significant difference between feet-thigh and the rest of the sensor combinations (adjusted p-value = 0.00). The pelvis sensor had the lowest accuracy with a significant difference compared to the other homogenous subsets of sensor combinations.

Discussion
The primary outcome of this study was the development of a robust deep-learning framework to predict diagnostic gait parameters for subjects with OA and TKA and investigate various sensor combinations on prediction accuracy. A simple ensemble deep neural network with two layers of 1D-CNNs demonstrated robust performance in predicting each STGP compared to more complex networks. A design of experiment conducted on 15 combinations of sensors and locations for different patient populations and gait paces revealed how the prediction accuracy of STGPs can change over different conditions and identification of an optimal sensor combination might be challenging. Overall, feet sensors combined with either shank or thigh sensors produced the highest accuracy for most STGPs and the isolated pelvis sensor showed the lowest accuracy.
The CNN architecture proposed by Zrenner et al. resulted in the lowest MAE and the lowest standard deviations for both the validation and test subject datasets with errors of 2.9 ± 2.6 cm and 7.6 ± 6.1 cm, respectively [19]. predictive error for stride length using unique datasets, enabling a direct comparison with our results [19,24]. Hannick et al. predicted stride length based on more than 1300 strides from 101 geriatric subjects, with a mean error of −0.15 ± 6.09 cm compared to our error of −2.2 ± 9.7 cm using the same network architecture. They used a larger number of subjects (n = 99) compared to our study (n = 29) which gives more unique data points for the network to train on. We induced additional variability in our dataset by asking subjects to walk at three different paces, all determined by the subjects. However, since this additional variability was mainly within-subject and may have a large amount of replication, it resulted in a slightly larger mean error compared to Hannink et al. In general, given large standard deviations in both studies, this difference was trivial. In this context, while our dataset had considerable variability, it likely had less variability than in the running dataset employed by Zrenner et al., which reported a mean predictive error in stride length of 2.5 ± 20.1 cm and a mean absolute error of 15.3 cm. The robustness of these CNN architectures for prediction of stride length point to the validity of using deep learning for this application, but also suggests that prediction accuracy is reduced when variability in the dataset is increased.
Direct comparisons of prediction accuracy between the current study and previous studies across all the STGPs are difficult due to differences in subject characteristics, dataset size, and experimental procedures. Comparable reported results from Hannick et al. for geriatric subjects, from Zrenner et al. for runners, and from Carcreff et al. for youths with cerebral palsy for sensor combinations and gait parameters are compiled in Table 7. Specifically, when comparing spatial parameter predictions using feet sensors, our results were within the range reported by previous studies. However, our results showed a larger mean error in prediction of stride length and step width compared to Hannink et al. that could be attributed to the larger number of subjects in Hannink et al. (n = 101) compared to our study. Diseases that induce pathologic movements, like OA, inherently increase the variability in gait parameters. The accuracy in prediction of the TKA group was higher than the OA group. The NAPE for OA was 19.0% and for TKA was 14.7%. When accounting for this limitation, our errors and standard deviations were comparable to previously reported results.
The neural networks trained on all sensor combinations predicted spatial, temporal, and general parameters with varying levels of accuracy. The NAPE averaged across all sensor combinations, for step length, stride length, step width, and toe-out angle were 8.6 ± 0.7, 7.8 ± 0.7, 38.5 ± 1.8, 77 ± 2%, respectively. The increased predictive error for step width and toe-out angle was likely associated with the smaller mean movements for those parameters, reducing the signal-to-noise ratio compared to the larger sagittal plane motions. For temporal parameters, the NAPE ranged from 2.3 ± 0.1% for stride time to 24.9 ± 1.5% for double support time. For the general parameters, the NAPE was 3.5 ± 0.2 and 7.5 ± 0.8% for cadence and speed, respectively. Descriptive statistical analysis on STGPs in our dataset revealed that neural network predictions were more accurate for the parameters with a lower coefficient of variation (CV). CV was defined as the ratio of the standard deviation to the mean which is an indicator of the dispersion of a probability distribution of data [48]. This was evident in the lower prediction accuracy observed for step width, toe-out angle, double support time, and speed with larger CVs compared to other parameters (Table 3). Table 7. Deep-learning accuracy comparison with previous studies for (a) spatial parameters, (b) general, and (c) temporal parameters.
(a) Spatial ME ± STD Step Length (cm)

Stride Length (cm)
Step Average of all sensors Our Results 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 The differences in predicted accuracy for OA versus TKA groups was multifactorial. First, there was more variability in the gait of OA subjects due to their pathology which makes it harder to predict certain STGPs. This higher variability for OA subjects was expressed by higher standard deviations and coefficients of variation for all gait parameters except toe-out angle, stride time, and cadence (Table 3). Second, the accuracy in prediction of STGPs was slightly higher at normal and slow walking compared to fast walking with mean NAPE of 15.8% for normal and slow, and 17.6% for fast walking. This aligns with findings by Zrenner et al. that indicated increasing speed could negatively impact predictive accuracy due to higher variability at fast walking (Table 3) [19]. Stressing the OA group with higher demand walking at a fast pace resulted in even greater variability and decreased predictive accuracy.
Perhaps most important, one of the randomly selected OA test subjects (subject S21, see Appendix A Figures A1 and A2), walked with the shortest step length, shortest stride length, largest step width, and slowest speed among all subjects in the study, making this subject an outlier. Since our sample size was small, the impact of a single outlier was amplified and negatively affected prediction results. Jensen-Shannon divergence, which measures the similarity between two probability distributions, showed a larger divergence for subject S21 compared to the other two subjects in the same fold (S19 and S27). The divergences of step length, stride length, and step width for subject S21 were 5.67, 7.01, and 5.96 while for subject S19 and S27 the divergences were 0.18, 0.17, 0.10 and 0.38, 0.59, 0.67, respectively (see Appendix A Table A1). The divergence of S21 from the distribution of subjects used to train the CNNs resulted in poor performance, driving up the reported error for the OA cohort. Removing subject S21 from the test set reduced the mean and median NAPE from 19.0% and 6.6% to 17.3% and 4.8% which is comparable to the TKA group. Subject S21 had severe knee OA of the right knee which caused pain during activities of daily living and manifested in a noticeable limp on the affected limb compared to the other subjects in the OA cohort. Investigation of NAPE from the validation set revealed almost equal performance on both knee groups with mean and median NAPE of 7.7% and 2.9% for OA and 7.4% and 2.8% for TKA. The impact of subject S21 in the test set is an example of how CNNs result in poor performance when faced with data that are outside the distribution of the training data, which is one of the main challenges in the use of machine-learning models for real world applications. Hence, gaining intuition on training data set completeness is important prior to interpreting prediction accuracy. Out-of-distribution detection [49,50] has recently been recognized as an important area for developing trustworthy machine learning [49,50] and will be continually addressed in this work as patient numbers increase.
Our statistical analysis indicated statistically significant differences in accuracy between various sensor combinations tested across all conditions. The F-T combination was the highest-ranked sensor combination based on a Friedman test, showing a significant improvement in accuracy with respect to every other sensor combination except the F-S. It should be noted that although statistically significant, differences between the most and least accurate sensor combinations were small. The best sensor combination based on mean NAPE was F-T-S (15.25%) while the worst sensor combination was F-P (16.65%). Similarly, the Friedman test indicated the sensor combinations of F-T, F-S, and F-P-T were the top three ranked sensor combinations TKA subjects while T, F-S, and F-T were the top three ranked sensor combinations for OA subjects (see Appendix A Table A2a,b). The F-T and F-S were the common sensor combinations suitable for both OA and TKA groups. In addition, the F-T combination was also among the top three for slow, normal, and fast walking paces (see Appendix A Table A2c,e). As noted earlier, while the F-T sensor combination proved to be statistically better than other combinations, a 2-5% improvement in overall STGP prediction accuracy may be impactful during certain clinical applications. For instance, given the small difference in stride length at the normal pace between OA and TKA groups (~3 cm) higher accuracy predictions may be necessary for diagnostic purposes. However, higher accuracy may not be important for parameters with large differences between patient groups. This is an advantage of data driven modeling compared to other algorithm-based techniques in the prediction of STGPs. If the accuracy of STGPs is not largely impacted by sensor combination, there is freedom to design patient monitoring systems for specific patient groups based on other factors, such as cost and patient compliance. Feet sensors were necessary for stride segmentation during gait which is an input for the trained models. Therefore, including feet sensors is imperative for using a data-driven approach. Testing these sensor combinations on more complex tasks such as climbing stairs, sit-to-stand, and evaluating other joint kinematic and kinetic parameters would be necessary to clarify the value of using certain sensor combinations.
There are limitations to this study that should be considered. This study focused on gait to demonstrate the ability to predict STGPs from IMU data. In the OA population, other activities of daily living that place a greater demand on the subject will likely provide additional clinical value. The methods demonstrated in this study can be extended to predict analogous spatial temporal parameters for activities that include stair ascent/descent, sit-to-stand, and other high-demand activities. This study was also limited in the number of subjects that were included. This study demonstrated acceptable accuracy with 3778 segmented and labeled strides from the 29 subjects. Increasing the number of subjects and labeled strides will improve the predictive accuracy. Like other data-driven approaches, the trained network described in this study are only suitable for the selected population. There are also practical limitations to deploying our algorithm to a large patient population outside of a laboratory environment, including variability in sensor placement, reduced signal quality from low-cost IMUs, soft-tissue artifacts for high body mass index patients, and identification of patients with gait parameters outside the training data set. In order to implement this workflow for other populations with movement impairments that would benefit from patient monitoring, such as patients with cerebral palsy or stroke, the algorithm would need to be re-trained with inclusion of data from these populations. However, with this initial model architecture defined and the trained, a transfer-learning approach could be used on other populations to drastically reduce training time and the need for high volumes of data.

Conclusions
This study demonstrated that a deep-learning, data-driven approach was able to predict spatial temporal gait characteristics of OA and TKA patients based on signals from IMU sensors. Using a comprehensive analysis of various sensor combinations and their sensitivity to STGPs, patient population, and walking pace, our results showed that deep learning can overcome the dependency on sensor location that hinders the design of patient monitoring systems and negatively impacts patient compliance. Additionally, we demonstrated the importance sufficient variability in training and test data as a critical factor in the performance of DL models, especially for clinically relevant data with small sample sizes. A system that is able to leverage data streams from wearable sensors to produce real-time monitoring of STGPs in OA and TKA patients has the ability to improve clinical care and patient quality of life.   Table A1. Jensen-Shannon (JS) divergence (relative entropy) between test subjects and training on fold with subjects S19, S21, and S27.