Is a Wearable Sensor-Based Characterisation of Gait Robust Enough to Overcome Differences Between Measurement Protocols? A Multi-Centric Pragmatic Study in Patients with Multiple Sclerosis

Inertial measurement units (IMUs) allow accurate quantification of gait impairment of people with multiple sclerosis (pwMS). Nonetheless, it is not clear how IMU-based metrics might be influenced by pragmatic aspects associated with clinical translation of this approach, such as data collection settings and gait protocols. In this study, we hypothesised that these aspects do not significantly alter those characteristics of gait that are more related to quality and energetic efficiency and are quantifiable via acceleration related metrics, such as intensity, smoothness, stability, symmetry, and regularity. To test this hypothesis, we compared 33 IMU-based metrics extracted from data, retrospectively collected by two independent centres on two matched cohorts of pwMS. As a worst-case scenario, a walking test was performed in the two centres at a different speed along corridors of different lengths, using different IMU systems, which were also positioned differently. The results showed that the majority of the temporal metrics (9 out of 12) exhibited significant between-centre differences. Conversely, the between-centre differences in the gait quality metrics were small and comparable to those associated with a test-retest analysis under equivalent conditions. Therefore, the gait quality metrics are promising candidates for reliable multi-centric studies aiming at assessing rehabilitation interventions within a routine clinical context.


Introduction
Multiple sclerosis (MS) is a chronic demyelinating disease of the central nervous system affecting 2.3 million people worldwide [1]. MS is the major non-traumatic cause of disability in young and middle-aged adults [2], with a significant negative impact on independence and social participation [3]. Walking impairment is one of the most common functional deficits due to MS, even in the early stages to those observable between two sessions performed by the same centre (between-day test-retest reliability).

Participants
Two research centres, one located in Italy (centre A) and one in the United Kingdom (centre B), provided retrospective IMU data collected while pwMS walked back and forth for 6 min along a hospital corridor. The patients' level of disability was assessed with the EDSS scale, scored by an experienced neurologist. Patients were excluded if not free from any orthopaedic and/or musculoskeletal and neurological disorders other than MS that may have affected their gait and balance. Since there were no restrictions for MS subtypes, both patients with relapsing remitting MS who were relapse-free for 30 days prior to assessment (centre A) and patients with secondary progressive MS (centre B) were included in the study. Thirteen pwMS were selected from each data set to form two cohorts, with individual patients matched if having the same age, gender, EDSS score, and type of assistive device ( Table 1). As a result of this matching, the sample size, percentage of females, EDSS score distribution, number of pwMS who required an assistive device, and type of assistive device used during the walking test were the same in the two centres. The average walking speed was calculated as the total distance walked during the test divided by the duration of the walking trial. Table 1. Clinical characteristics of people with multiple sclerosis for centre A and centre B. Abbreviations: expanded disability status scale (EDSS); people with multiple sclerosis (pwMS); Mann-Whitney U (MWU) statistic; p-value (p); chi-square (X 2 ). pwMS from centre B repeated the instrumented walking test on a second visit, which was held 7-14 days after the first test at the same time of the day. The testing procedures were also kept constant between the two sessions. These data were used to assess between-day test-retest reliability.
Institutional review boards or ethics committees at the institutions in each country approved the separate protocols (NRES Committee Yorkshire & The Humber-Bradford Leeds (reference 15/YH/0300) and Ethical Committee of Don Carlo Gnocchi Foundation, Milan, Italy, references 29-03-2017 and 13-02-2019). Written informed consent was provided by all subjects. Data were collected in accordance with the International Declaration of Helsinki.

Experimental Protocol
Acceleration and angular velocity data from three IMUs, located at the fifth lumbar vertebra and around the right and left ankles, were recorded in both centres while pwMS walked back and forth for 6 min along a straight corridor free of obstacles and other people. If needed, they could use an assistive device and take short resting breaks while standing. Each IMU was manually aligned along the anatomical antero-posterior (AP), medio-lateral (ML), and vertical (V) axes.
The differences between the experimental protocols followed by centre A and centre B were: (i) device manufacturers and sampling frequency used to record acceleration and angular velocity signals; (ii) ankle IMU position; (iii) length of the walkway; (iv) instructions given to participants ( Figure 1). Specifically, Xsens IMUs (unit weight 16 g, unit size 47 mm × 30 mm × 13 mm; MTw, Xsens, NL) with a sampling frequency of 75 Hz were used in centre A and OPAL IMUs (unit weight 22 g, unit size 48.5 mm × 36.5 mm × 13.5 mm; OPAL, APDM Inc., Portland, OR, USA) with a sampling frequency of 128 Hz were used in centre B. The IMUs around both ankles were placed laterally in centre A and frontally in centre B. PwMS were requested to walk at their maximum speed along a 30-meter straight corridor in centre A and at preferred comfortable speed along a 10-meter straight corridor in centre B. for 6 min along a straight corridor free of obstacles and other people. If needed, they could use an assistive device and take short resting breaks while standing. Each IMU was manually aligned along the anatomical antero-posterior (AP), medio-lateral (ML), and vertical (V) axes. The differences between the experimental protocols followed by centre A and centre B were: (i) device manufacturers and sampling frequency used to record acceleration and angular velocity signals; (ii) ankle IMU position; (iii) length of the walkway; (iv) instructions given to participants ( Figure 1). Specifically, Xsens IMUs (unit weight 16 g, unit size 47 mm × 30 mm × 13 mm; MTw, Xsens, NL) with a sampling frequency of 75 Hz were used in centre A and OPAL IMUs (unit weight 22 g, unit size 48.5 mm × 36.5 mm × 13.5 mm; OPAL, APDM Inc., Portland, OR, USA) with a sampling frequency of 128 Hz were used in centre B. The IMUs around both ankles were placed laterally in centre A and frontally in centre B. PwMS were requested to walk at their maximum speed along a 30-meter straight corridor in centre A and at preferred comfortable speed along a 10-meter straight corridor in centre B.

Data Processing
Data processing routines were developed in Matlab ® (MATLAB R2019b, MathWorks, Inc., Natick, MA, USA). A total of 33 IMU-based metrics were included in this analysis. IMU signals collected in centre B were down sampled from 128 Hz to 75 Hz to match data from centre A, and the influence of down sampling was investigated by comparing the outcome metrics from centre B as obtained before and after the down sampling. Data from the lumbar IMU were reoriented to a horizontal-vertical coordinate system [48] and filtered with a 10 Hz cut-off, zero phase, low-pass Butterworth filter.
The turning motion and resting breaks were detected and removed from IMU signals to isolate steady-state walking bouts, which were used to compute the metrics of interest. The approach proposed by Salarian, et al. [49] was adapted to determine 180° turns, which appear in the V component of the lumbar angular velocity, ωz(t), as peaks of a given duration. The turning onset and offset were identified from the trunk rotation angle around the V axis, θz(t), obtained after integrating the ωz(t) signal. The turning components were evidenced in θz(t) as steep positive or negative gradients, whereas walking components were evidenced as small oscillations round a flat line. Specifically, θz(t) was first smoothed using a weighted least-squares linear regression. Abrupt change points and their locations were then searched in θz(t) using a predefined Matlab ® function based on the minimisation of a linear computational cost function [50]. Resting breaks were automatically detected by checking in 2-s window increments if: (i) the norm of the lumbar IMU angular velocity was less than 0.5 rad/s; (ii) the norm of the lumbar IMU acceleration was within ±10% of 9.81 m/s²

Data Processing
Data processing routines were developed in Matlab ® (MATLAB R2019b, MathWorks, Inc., Natick, MA, USA). A total of 33 IMU-based metrics were included in this analysis. IMU signals collected in centre B were down sampled from 128 Hz to 75 Hz to match data from centre A, and the influence of down sampling was investigated by comparing the outcome metrics from centre B as obtained before and after the down sampling. Data from the lumbar IMU were reoriented to a horizontal-vertical coordinate system [48] and filtered with a 10 Hz cut-off, zero phase, low-pass Butterworth filter.
The turning motion and resting breaks were detected and removed from IMU signals to isolate steady-state walking bouts, which were used to compute the metrics of interest. The approach proposed by Salarian, et al. [49] was adapted to determine 180 • turns, which appear in the V component of the lumbar angular velocity, ω z (t), as peaks of a given duration. The turning onset and offset were identified from the trunk rotation angle around the V axis, θ z (t), obtained after integrating the ω z (t) signal. The turning components were evidenced in θ z (t) as steep positive or negative gradients, whereas walking components were evidenced as small oscillations round a flat line. Specifically, θ z (t) was first smoothed using a weighted least-squares linear regression. Abrupt change points and their locations were then searched in θ z (t) using a predefined Matlab ® function based on the minimisation of a linear computational cost function [50]. Resting breaks were automatically detected by checking in 2-s window increments if: (i) the norm of the lumbar IMU angular velocity was less than 0.5 rad/s; (ii) the norm of the lumbar IMU acceleration was within ±10% of 9.81 m/s 2 [51]. A 2-s window was considered motionless if more than 50% of its samples fulfilled both criteria mentioned above.
Twelve gait metrics were extracted from the angular velocities recorded from the ankle IMUs and 21 were extracted from the lumbar IMU accelerations. Following the suggestions of Lord, et al. [52] and Buckley, et al. [53], these metrics were organised in independent gait domains (e.g., rhythm, variability, asymmetry, intensity, stability, smoothness, symmetry, and regularity).
Initial and final foot contact instances, referred to as gait events (GE), were identified for each steady-state walking bout as local minimum values of the ML angular velocity recorded from ankle IMUs of both legs [54]. These minima occur just before and after the instant of maximum ML angular velocity. Once the GE were determined, stride, step, swing and stance durations (representing rhythm domain) were separately estimated for left and right sides. Variability (i.e., within-subject combined standard deviation of left and right; variability domain) and asymmetry (i.e., absolute difference between the mean of left and right time series; asymmetry domain) of these metrics were also computed, applying the established formula in Galna, et al. [55] and Godfrey, et al. [56].
From processing the filtered acceleration signals in time and frequency domain, 21 additional metrics, referred to as gait quality metrics [57], were separately extracted for each acceleration component (AP, ML, and V): (i) intensity as the root mean square (RMS) of each acceleration component around its mean value [44]; (ii) stability as the ratio of the RMS in a given direction to the RMS vector magnitude [58]; (iii) smoothness as the RMS of the jerk [59]; (iv) symmetry represented by the harmonic ratio (HR), defined as the ratio of the sum of the amplitudes of the in-phase harmonics to the sum of the amplitudes of the out-of-phase harmonics [60,61]; (v) regularity as the ensemble of the following three metrics obtained from the unbiased normalised autocorrelation [62]: Step Regularity index = Stride regularity − Step regularity mean(Stride regularity, Step regularity) All metrics were calculated for the part of signals corresponding to the middle eight steps of each pass along the corridor and then averaged over the whole trial. The choice of eight steps was due to the maximum number of steps which subjects in centre B could walk in completely straight condition. Since centre A adopted a three-times longer path, in order to process the same number of steps, only one walking bout in every three was included for centre B.

Statistical Analysis
Statistical analyses were performed in R version 3.4.3 [63]. Participant characteristics from centre A and centre B were compared using the independent Mann-Whitney U for age and EDSS scores and Pearson's chi-square for gender. Given the limited sample size and the non-normal distribution of most of the investigated metrics (as a result of the Shapiro-Wilk test), non-parametric tests were performed. The level of significance was taken at 5%. A Wilcoxon signed-rank test was performed to compare the centre B metrics obtained from IMU data sampled at 128 Hz and those down-sampled at 75 Hz.
Between-day test-retest reliability of the metrics was evaluated for centre B through the intra-class correlation coefficients (ICCs) with a 95% confidence interval (CI). ICCs were calculated using a two-way random-effect model and absolute agreement (ICC2,k) [64]. An ICC lower than 0.39 was classified as poor, an ICC between 0.40 and 0.59 as fair, an ICC between 0.60 and 0.74 as moderate, and an ICC greater than 0.75 as excellent [65]. The minimum detectable changes (MDCs), representing the smallest amount of change that can be considered above the bounds of the measurement error and/or within-subject variability, was also computed for each metric at the CI of 95%, according to Equation (4): where SEM is the standard error of the measurement and SD corresponds to the average of the standard deviations from test and re-test sessions [66]. A Wilcoxon signed-rank test was used to determine if there was a median difference in centre B metrics between the two sessions, whereas an independent Mann-Whitney U test was carried out to compare IMU-based metrics from centre A and centre B.
In all the above tests, if the p-value was lower than 0.05, the null hypothesis (e.g., the two population medians were identical) was rejected and the alternative hypothesis accepted. To avoid misinterpretation of the p-values and to account for a type II error, the effect size (r) for non-parametric tests was also calculated as follows: where z is the z-score and N is the size of the study (i.e., the number of total observations) on which z is based. Cohen [67] suggested thresholds of 0.1, 0.3, and 0.5 for small, medium, and large effect sizes, respectively. Median, inter-quartile range, minimum, and maximum values were finally calculated for IMU-based metrics from centre A and centre B (both sessions).

Effect of Sampling Frequency
The results of the comparison between the metrics calculated using the 128 Hz and 75 Hz sampling frequencies are reported in Table 2. The HR, representative of the symmetry domain, was the only metric that significantly differed between the two analyses. Table 2. Effect of down-sampling of the acceleration and angular velocity signals on the investigated gait metrics. Abbreviations: sampling frequency (F S ), z-score (z), p-value (p), and effect size (r).

Between-Day Test-Retest Reliability
ICC, SEM, and MDC values for between-day assessment are shown in Table 3 for each metric estimated for pwMS from centre B who completed two testing visits. Overall, 17 out of 33 metrics revealed excellent test-retest reliability (ICC: 0.93-0.98; 95% CI: 0.76-0.93), 11 metrics showed moderate test-retest reliability (ICC: 0.88-0.92; 95% CI: 0.62-0.74), and only 5 metrics exhibited poor to fair test-retest reliability with ICC values between 0.72 and 0.86 and 95% CI between 0.13 and 0.52. The Wilcoxon signed-rank test showed no significant differences in any of the metrics between the two sessions ( Figure 2 and Table 4).

Regularity [-]
Step regularity  Figure 2. Minimum, first quartile (q1), median, mean, third quartile (q3), and maximum values of each IMU-based metrics relative to centre A (red) and centre B for between-day test-retest assessment (blue empty boxplots and blue filled boxplots). Values larger than q1 + 1.5(q3 + q1) or smaller than q1 Figure 2. Minimum, first quartile (q1), median, mean, third quartile (q3), and maximum values of each IMU-based metrics relative to centre A (red) and centre B for between-day test-retest assessment (blue empty boxplots and blue filled boxplots). Values larger than q1 + 1.5(q3 + q1) or smaller than q1 − 1.5(q3 − q1) are considered outliers and are represented with crosses (+). * p < 0.05. Note that, for graphical convenience, the absolute values have been depicted for the step regularity and regularity index in the ML direction.

Between-Centre Differences
As expected, the comparison between centre A and centre B via the independent Mann-Whitney U test highlighted significant differences for all the temporal metrics ( Figure 2 and Table 5; rhythm domain), except for swing duration. Apart from asymmetry of step duration and asymmetry of swing duration, variability and asymmetry of the temporal metrics were significantly lower in centre A compared to centre B (Figure 2 and Table 5; variability and asymmetry domain). However, even though the difference in asymmetry of swing duration between the two centres was non-significant (U = 48.0; p = 0.06), a fairly moderate effect size was found for this specific metric (r = 0.37). Conversely, a consistency between the two centres was found for 18 out of 21 metrics extracted from acceleration signals ( Figure 2 and Table 5; intensity, stability, smoothness, symmetry, and regularity domains). Only the differences in the regularity index in the ML direction and in the HR in the AP and ML directions were proved statistically significant between centre A and centre B (Figure 2 and Table 5). Table 5. Descriptive statistics for the investigated gait metrics from centre A and centre B (session1), including the Mann-Whitney U (MWU) statistic, p-value (p), and effect size (r).

Domain
Centre A Centre B U p r

Discussion
This study aimed to identify comparable gait metrics as quantified from IMU data measured from two different hospital settings on two matched cohorts of pwMS (13 pwMS for each centre, Table 1), under the hypothesis that those metrics associated with the overall balance control and coordination of gait (i.e., gait quality metrics) would be robust, even when obtained from different experimental protocols. Reported results overall corroborated this assumption and showed that between-centre differences for most of these metrics were comparable to those obtained by the same centre in two different sessions.
The small sample size, resulting from the attempt of maximising the cohort match, is certainly a limitation of this study. It is worth noting, in fact, that while some of the investigated gait metrics in centre A (e.g., asymmetry of swing duration from asymmetry domain and regularity index from regularity domain) did not differ significantly from those in centre B, an observed medium effect size suggested the opposite might hold true (Table 5). This is indeed likely to be due to the small sample size and possibly due to the higher inter-subject variability observed in centre B.
Since MS is well known for heterogeneity of symptoms, high day-to-day fluctuations, and a large variability in its course [68], care must be taken before generalising our findings to all pwMS with different levels of gait impairment. Another limitation of this study might lie in the fact that patients recruited by the two centres differed in the subtypes of MS. Nonetheless, Dujmovic, et al. [69] showed that the altered gait pattern in pwMS did not depend on the disease phenotype. Additional studies are of course needed to further investigate this aspect.
The comparison between centre A and centre B implied down-sampling the data from the latter. As expected, this affected only the calculation of HR, which is the only metric based on frequency analysis. In particular, changing sampling frequency from 128 Hz to 75 Hz led to decreased values in the AP and V directions and increased values in the ML direction (Table 2). This is in line with what was previously reported by Riva, et al. [35].
Moderate to excellent between-day test-retest reliability was observed for 28 out of 33 IMU-based metrics with few exceptions, which exhibited poor to fair reliability (Table 3). Additionally, all the investigated metrics were not significantly different between the two sessions ( Figure 2 and Table 4), even if some of these results (swing duration in particular) should be interpreted with care, due to the medium effect size. These findings confirmed that sensor-based gait analysis is a reliable tool in pwMS, as also reported in previous test-retest studies on pwMS [34].
Walking speed clearly affected the gait outcomes. In particular, the gait metrics representative of rhythm, variability, and asymmetry domains were evidently lower in centre A compared to centre B ( Figure 2 and Table 5) due to different instructions given to the participants in terms of walking speed (i.e., walk at maximum speed versus walk at self-selected speed). This finding is in agreement with previous studies on pwMS [28] and on people with other neurological conditions, such as Parkinson's disease [70], which observed a reduction of the above metrics with increasing walking speed. The shorter length of the walkway used in centre B could also have contributed to these differences. In fact, Storm, et al. [22] demonstrated that rhythm and variability metrics decreased when walking longer distances (e.g., lower stride duration and lower variability of stride duration). However, the data available for our study did not allow us to separate walking speed and path effects, and further studies should hence be performed to this purpose.
Unlike the temporal metrics, the gait quality metrics appeared to be robust with respect to the notable differences in the experimental gait protocols adopted by the two centres. Among these metrics, in fact, only differences in the regularity index in the ML direction and the HR (representative of symmetry domain) in the AP and ML directions were found to be statistically significant between centre A and centre B (Figure 2 and Table 5). Again, this specific result could be explained both by the different walking speed and by the different lengths of the walkway in the two centres. Indeed, an association between walking speed and HR has been previously showed, both in healthy young [43,44] and older subjects [39]. These authors observed that the HR increased at the self-selected comfortable walking speed and decreased at slower and faster speeds. A similar trend emerged from our analysis, except for the HR in the ML direction, but this specific metric should be handled with care due to its observed low test-retest reliability ( Table 3). The low number of steps (i.e., eight steps) used for calculating the HR for each walking bout might also have contributed to reduce robustness and reliability of this metric [57,71]. However, this choice was imposed by the reduced length of the corridor in centre B. Testing the participants along a shorter path also implied a higher number of turns over the 6 min, resulting in a minor validity of the HR as showed in the research by Riva, et al. [35] and by Brach, et al. [40].
While further studies are of course needed to fully validate this hypothesis, our results suggest that, in agreement with what is already reported for other neurological diseases, such as Parkinson's disease [53], the gait quality metrics extracted from the upper body accelerations should not be considered as a simple reflection of gait spatio-temporal features and might bring complementary informative content in quantifying patients' gait ability. Additionally, these metrics have been recently shown to be sensitive to fatigue and pathology progression in pwMS [72] and, as such, they are promising candidates for quantification of disease progression and rehabilitation interventions in these patients.

Conclusions
In conclusion, this pragmatic study showed consistency in the gait metrics from two matched groups of pwMS, even when they were assessed in two different hospitals and under notably different gait testing conditions. The identification of such robust gait metrics opens the possibility of comparing retrospective data and paves the way for reliable multi-centre studies to be conducted in routine hospital settings rather than in specialised gait research laboratories. This is essential to allow an increase of sample size and statistical power of clinical trials in which rehabilitation interventions need to be quantitatively assessed.  The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR, the Department of Health and Social Care, the IMI, the European Union, the EFPIA, or any associated partners.