Performance of Deep Learning Models in Forecasting Gait Trajectories of Children with Neurological Disorders

Forecasted gait trajectories of children could be used as feedforward input to control lower limb robotic devices, such as exoskeletons and actuated orthotic devices (e.g., Powered Ankle Foot Orthosis—PAFO). Several studies have forecasted healthy gait trajectories, but, to the best of our knowledge, none have forecasted gait trajectories of children with pathological gait yet. These exhibit higher inter- and intra-subject variability compared to typically developing gait of healthy subjects. Pathological trajectories represent the typical gait patterns that rehabilitative exoskeletons and actuated orthoses would target. In this study, we implemented two deep learning models, a Long-Term Short Memory (LSTM) and a Convolutional Neural Network (CNN), to forecast hip, knee, and ankle trajectories in terms of corresponding Euler angles in the pitch, roll, and yaw form for children with neurological disorders, up to 200 ms in the future. The deep learning models implemented in our study are trained on data (available online) from children with neurological disorders collected by Gillette Children’s Speciality Healthcare over the years 1994–2017. The children’s ages range from 4 to 19 years old and the majority of them had cerebral palsy (73%), while the rest were a combination of neurological, developmental, orthopaedic, and genetic disorders (27%). Data were recorded with a motion capture system (VICON) with a sampling frequency of 120 Hz while walking for 15 m. We investigated a total of 35 combinations of input and output time-frames, with window sizes for input vectors ranging from 50–1000 ms, and output vectors from 8.33–200 ms. Results show that LSTMs outperform CNNs, and the gap in performance becomes greater the larger the input and output window sizes are. The maximum difference between the Mean Absolute Errors (MAEs) of the CNN and LSTM networks was 0.91 degrees. Results also show that the input size has no significant influence on mean prediction errors when the output window is 50 ms or smaller. For output window sizes greater than 50 ms, the larger the input window, the lower the error. Overall, we obtained MAEs ranging from 0.095–2.531 degrees for the LSTM network, and from 0.129–2.840 degrees for the CNN. This study establishes the feasibility of forecasting pathological gait trajectories of children which could be integrated with exoskeleton control systems and experimentally explores the characteristics of such intelligent systems under varying input and output window time-frames.


Introduction
Assistive devices such as wheelchairs and exoskeletons are designed to facilitate or enhance movement [1]. For people with impaired motor or nervous systems which affect their movement, these devices increase mobility and independence and allow for better interaction with the environment [1].
Exoskeletons are external load-bearing robotic suits made of sensors, actuators, and controllers to increase the strength and endurance of the user [2,3]. In the 1960s, after earlier conceptualisations, the first exoskeleton 'Hardiman' was built by General Electric [2].
Gait trajectory prediction can be integrated into the control strategy of exoskeletons [23]. Several researchers investigated the use of deep learning for gait trajectory prediction. Liu et al. [24] developed a deep spatio-temporal model that consists of Long Short-Term Memory (LSTM) units to predict two time-steps in the future, with predictions averaged to smooth fluctuations. Zaroug et al. [25] implemented an auto-encoder LSTM for predicting linear acceleration and angular velocity trajectories. They experimented with varying lengths of input time-steps, between five and 40 steps, to predict five or 10 steps in the future (equivalent to 30 ms or 60 ms). An LSTM with a weighted discount loss function was proposed by Su et al. [26] for the prediction of the angular velocities of thigh, shank, and foot segments. They used 10 or 30 time-steps as input to predict five or 10 steps in the future, corresponding to 100 ms and 200 ms, respectively. Hernandez et al. [27] used a hybrid Convolutional Neural Network (CNN) and LSTM neural network, DeepConvLSTM, to forecast kinematic trajectories with an average MAE of 3.6 • , while Jia et al. [28] implemented a deep neural network with LSTM units and a feature fusion layer that combines kinematic (i.e., joint angles) and physiological (i.e., EMG) data for trajectory prediction. Zarough et al. [29] also compared between vanilla, stacked, bidirectional, and autoencoder LSTMs while Zhu et al. [30] used attention-based CNN-LSTM, predicting trajectories 60 ms in the future.
Values of kinematic parameters within a gait cycle vary more significantly between different individuals (inter-subject) compared to the values of kinematic parameters of the same individual (intra-subject) [31]. Intra-subject trajectory prediction models (models tested on data from the same individuals used for training the models) have been shown to be more accurate in their predictions than inter-subject models (models tested on data from individuals who were not used for training the models) [26]. However, the variability of pathological gait compared to healthy gait is even greater. For example, children with spastic cerebral palsy were found to have higher within-day and between-day variability compared to healthy children, possibly attributed to the spasticity that limits the range of motion of their joints [32,33].
Given that existing studies have used models trained on healthy gait trajectories only, the ability of deep learning models to forecast pathological trajectories with greater heterogeneity and variability is yet to be evaluated. The main contribution of this study is to investigate, for the first time, the performance of deep learning networks, specifically the Long Short-Term Memory (LSTM) neural network and Convolutional Neural Network (CNN), in forecasting pathological gait trajectories of children with neurological disorders. We conduct a comparison between the two networks. Furthermore, we investigate the influence of the length of the input and output windows on prediction accuracy and provide technical recommendations.

Data
The deep learning models implemented in our study are trained on data from children with neurological disorders. The data used are available online and were collected by Gillette Children's Speciality Healthcare over the years 1994-2017 [34]. They have been previously used for the development of a deep learning model to automatically detect gait events, specifically foot contact and foot off events [35]. The children of the dataset ranged from 4 to 19 years of age. The majority of them had cerebral palsy (73%), while the rest had a combination of neurological, developmental, orthopaedic, and genetic disorders (27%). Children were recorded with a motion capture system (VICON) with a sampling frequency of 120 Hz while walking for 15 m. The data consist of a 99-dimensional vector with kinematics and marker positions. The data were divided into a training and testing set. The statistical distribution of the features of the children in the training set is as follows: age (11.4 ± 6.2 years), weight (35.7 ± 17.7 kg), height (135.7 ± 21.6 cm), leg length (70.3 ± 14.0 cm), and walking speed (0.84 ± 0.28 m/s). The statistical distribution of the children in the testing set is as follows: age (11.0 ± 4.5 years), weight (35.9 ± 16.7 kg), height (135.6 ± 21.4 cm), leg length (70.6 ± 12.8 cm), and walking speed (0.85 ± 0.29 m/s) [35]. In our study, only kinematics were used for trajectory forecasting; these include angles of the hip, knee, and ankle in the yaw, pitch, and roll dimensions. The kinematics represent Euler angles, calculated using the plug-in-gait mechanical model [35].

Data Processing
Euler angles of the hip, knee, and ankle in yaw, pitch, and roll dimensions were available for both legs, however, only one leg was used for training the models. The preprocessing of data involved trimming the leading and the trailing zeros, and the removal of trails with spurious data (which were assumed to be Euler angles greater or less than a 90 • cut-off).
Since deep learning models are required to be trained with fixed-length sequences, fixed-length inputs and their corresponding target trajectories were generated using the sliding window method, illustrated in Figure 1. The sliding window method involves generating a shorter sequence x in with k time-steps, where k is the size of the input, i.e., the number of time-steps utilised by the model to make predictions, x in = {x 1 , x 2 , . . . , x k }. Another shorter sequence y out with z time-steps is created, were z is the size of the output, i.e., the number of time-steps that will be predicted by the model, y out = {x k+1 , x k+2 , . . . , x k+z }. Each input window has a corresponding output window (target label to train the models), and the two windows represent one training sample. The stride, which is the distance between the beginning of one sample and the beginning of the next sample, is set to 5 time-steps. Ten samples are generated from each trial. Figure 1. Illustration of the sliding window method. Continuous gait trajectories are used to generate input and output windows for training the model using the sliding window method. Each pair of input and output windows forms one sample. Several samples can be generated from one continuous gait trajectory by sliding the window for a specified distance, also known as stride length.
The models were trained using several sizes of input and output windows. The input window sizes for the LSTM were 50, 100, 200, 400, 600, 800, and 1000 ms. Given that the data were captured at a sampling frequency of 120Hz, those durations correspond to 6, 12, 24, 48, 72, 96, and 120 time-steps. Output window sizes for the LSTM were 8.33, 25, 50, 100, and 200 ms which correspond to 1, 3, 6, 12, and 24 time-steps. Every combination of input and output window sizes was used, yielding a total of 35 combinations. As for the CNN, we only focused on using 6 and 120 input time-steps (the smallest and largest input sizes we used with the LSTM) to predict 1, 3, 6, 12, and 24 output time-steps. We used these time ranges following values proposed in the literature by researchers that forecasted healthy gait. Zaroug et al. experimented with a wide range of input windows from 5-40 steps to forecast 5 or 10 steps in the future, which correspond to 30 and 60 ms respectively [25]. Output windows ranging from 50-60 ms were common too [25,28,30]. A few researchers have also predicted output windows of 100 ms or larger [26,29].
For n training samples, f features (hip, knee, and ankle angles in the yaw, pitch, and roll directions), l in input window size and l out output window size, an input matrix X and target output matrix Y are created, where X ∈ R n×l in × f and Y ∈ R n×l out × f . The matrices are then normalised such that X ∈ [0, 1] and Y ∈ [0, 1]. The aim is to build a model g() that maps X toŶ, i.e.,Ŷ = g(X), whereŶ is a close approximation to true value Y.

Long Short-Term Memory (LSTM) Architecture
The Long Short-Term Memory (LSTM) neural network is commonly used for timeseries applications including forecasting future trajectories. It is a type of gated Recurrent Neural Network (RNN), which solved the issue of vanishing and exploding gradients during training and is capable of learning long-term dependencies [36]. Each LSTM cell contains three gates: input, output, and forget gate. The equations of these gates, as explained in Goodfellow et al. [36], are reported below.
The forget gate unit f (1) The internal state of an LSTM cell, s i , is updated depending on the value of the forget gate unit f (t) i , and its equation for biases b, input weights U, and recurrent weights W is: ( The external input gate unit, g (t) i , is calculated as: The output of the LSTM cell h (t) i is calculated as: For biases b o , input weights U o , and recurrent weights W o , the value of the output gate unit is: The LSTM network implemented in this study contained 4 layers of LSTM units, with 128 units per layer. The hidden state of the final layer is used as input to a fully connected layer which is then reshaped to obtain the desired output. The overall architecture is illustrated in Figure 2.

Convolutional Neural Network (CNN) Architecture
In this study, a Convolutional Neural Network (CNN) is used to map input trajectories X to forecasted predictionsŶ. The convolution operation is formed between a sequence and a kernel, whose weights are tuned during the learning process. Equation (6) is used to calculate the output of the convolution operation S, adapted from the format included in Goodfellow et al. [36] to work with a 1D time-series I and kernel K.
The CNN architecture consists of several 1D convolutions and pooling layers, followed by a dense fully connected layer. The number of kernels in each convolution layer, as well as the size of the pooling layers, is illustrated in Figure 3. Dilated convolutions were tried, but they did not improve performance, therefore they were not included.

Baseline Methods
The performance of the deep learning models was benchmarked against a simpler machine learning model, a Fully Connected Network (FCN), and two non-intelligent models, which we refer to as Naïve Method 1 and Naïve Method 2. The FCN contains three hidden layers and 200 nodes per layer. As for the non-intelligent models, the first naïve method uses the final time-step in the input window as the predicted value for all output time-steps. The second naïve method uses the mean value of the input as the predicted value for all output time-steps.

Details of Network Implementation
The objective of the deep learning models is to find a mapping between the input trajectories X and forecasted trajectoriesŶ, such that the error between predicted trajectorieŝ Y and true trajectories Y is minimised. The loss function that has been used to optimise the deep learning models is the Mean Squared Error (MSE).
The number of trials used from the dataset was 16,782. Each trial was used to extract 10 samples using the sliding window method, described in Section 2.2, resulting in a total of 167,820 samples. The samples were split into training (70%), validation (20%), and testing (10%) sets. Both the CNN and LSTM are trained in mini-batches, where the size of each batch is 32 samples, with the Adam optimiser used for learning.
To select the hyper-parameters and architecture for the CNN, LSTM, and FCN models (described in Sections 2.3-2.5), a hyper-parameter search was performed. Hyper-parameter optimisation involved defining a search space, where windows of values of some parameters of the model were chosen (e.g., learning rate, number of layers of LSTM, number of hidden units, and size of kernels). The objective is to choose the value of parameters that optimises the performance of the networks. The tree-structured Parzen estimator algorithm, which is a type of Bayesian hyper-parameter sampler [37], was used to select the optimal parameters (from a defined search space) that minimise the validation loss. The search space for the hyper-parameters and the corresponding selected values are included in Table 1. The parameters were optimised for predictions with an input window size of 72 time-steps, and an output window size of 12 time-steps. The number of epochs was also a parameter that was included in the hyper-parameter optimisation, but the values were fine-tuned manually afterwards. The numbers of epochs for training the LSTM, CNN, and FCN were 60, 150, and 180, respectively. Once the optimal architecture and parameters of the networks were selected, the training and validation sets were combined to train these networks. The total numbers of trainable parameters of the optimised LSTM, CNN, and FCN are 495,320, 4,666,888, and 380,216, respectively. The final performance of the networks was evaluated on the testing set. The framework described has been implemented using Python, with the following libraries: Pytorch, Numpy, Matplotlib, SciPy, and Scikit-learn. Optuna was used for the hyper-parameter search [37]. An Nvidia Geforce RTX 2070 GPU was also used for computation.

Evaluation Metrics
Several metrics have been used to evaluate the performance of the models. To measure how close the predicted trajectoriesŶ are to the observed trajectories Y we have calculated the mean square error MSE, mean absolute error MAE, and Pearson correlation coefficient P. These were calculated after the de-normalisation (i.e., re-scaling to the original ranges) of the predicted trajectories. Given that n is the number of testing samples, f is the number of features, and l out is the output window size, the equations are as follows: Mean squared error: Mean squared error standard deviation: Mean absolute error: Mean absolute error standard deviation: Pearson correlation coefficient: These metrics are used to evaluate and compare the performance of the networks we implemented, with results presented in Section 3.

LSTM Network Performance for Varying Input and Output Window Sizes
The LSTM model has been trained using 35 combinations of input and output window sizes. The input window sizes were 6, 12, 24, 48, 72, 96, and 120 time-steps, which correspond to 50, 100, 200, 400, 600, 800, and 1000 ms, respectively, while the output window sizes were 1, 3, 6, 12, and 24 time-steps, and correspond to 8.33, 25, 50, 100, and 200 ms. The performance of the models (MAE, MSE, and Pearson correlation coefficient) are reported in Table 2.  Table 2 show that the smaller the size of the output window, the lower the error of predictions. LSTMs predicting one output time-step had the lowest mean errors, while LSTMs predicting 24 time-steps had the highest mean errors, as expected. On the other hand, the size of the input window did not significantly influence the mean losses when predicting short output windows, specifically six output time-steps or below (see Figure 4a-c). The size of the input window had an influence when predicting larger output windows, i.e., 12 and 24 time-steps, with larger input sizes resulting in lower mean errors (see Figure 4d,e). For 12 and 24 time-step output windows, using 120 time-steps as input window size, which is the largest input size we tried, led to the lowest mean absolute errors (see Figure 4d,e).

Performance of the CNN and Comparisons with LSTM Network
We trained the CNN and compared its performance with the LSTM network. We only focused on using six and 120 input time-steps (the smallest and largest input sizes we used with the LSTM) to predict 1, 3, 6, 12, and 24 output time-steps. The performances of both models (MAE, MSE, and Pearson correlation coefficient) are reported in Table 3. As shown in Table 3, the CNN errors increase when increasing the number of predicted time-steps, as observed with LSTM. In Figure 5, we compare the performance of the CNN and LSTM and it is noticeable that their mean errors are very similar when predicting small output windows, such as one and three time-steps. However, the difference in the MAE for CNN and LSTM widens for larger output windows including 6, 12, and 24 future timesteps, with LSTM outperforming CNN. Interestingly, the size of this difference depends on the input window size: when six time-steps are used as input, the CNN MAE is larger than the LSTM one; when 120 time-steps are used as input, the difference in their MAEs is even larger.

Benchmarking Performance of Deep Learning Models
We benchmark the performance of our two deep learning models to a simpler machine learning architecture, the Fully Connected Network (FCN), and two naïve/non-intelligent methods. The first naïve method uses the final time-step in the input window as the predicted value for all output time-steps. The second naïve method uses the mean value of the input as the predicted value for all output time-steps. Table 4 shows the MAE for the deep learning and benchmark models. Our deep learning models outperformed all naïve methods and the majority of the FCN predictions. In two cases, the FCN obtained better results: the first was when the input and output window sizes were six and 24 time-steps, respectively, where the FCN had lower MAE compared to the LSTM and CNN; the other case was when the input and output sizes were 120 and 24 time-steps, respectively, with FCN performing better than CNN, but not better than LSTM. a Naïve Method 1: all output time-steps are predicted to be the value of the last input time-step. b Naïve Method 2: all output time-steps are predicted to be the mean value of the input time-steps. * and † represent statistical significance compared to Naïve Method 1 and LSTM, respectively. Significance is based on pairwise t-tests (p < 0.05). Bold entries denote the lowest MAE value for a given input and output window size.
Naïve Method 1 resulted in lower MAEs compared to Naïve Method 2, therefore, pairwise t-tests were conducted between the Naïve Method 1 and all other intelligent methods (LSTM, CNN, and FCN). This was done to determine whether the mean absolute errors were significantly different (with p < 0.05); the results in Table 4 confirm that intelligent models perform better than non-intelligent models. Furthermore, pairwise t-tests were conducted between the LSTM and all the other models; the differences in the MAEs were found to be statistically significant, as shown in Table 4.

Accuracy of the Models across the Different Time-Steps
In the previous sections, we were reporting the mean error across all features and time-steps. Here, we calculate the MAE, for each time-step for a given output window, separately (see Figure 6); the MAE is calculated using an adapted form of Equation (9) (see Section 2.7), where the summation over k = 1:l out is not performed. As expected, results show that predictions further in the future deviate more from the actual values, and this deviation is more pronounced after around the 3rd time-step, as shown in Figure 7. Figure 7 also shows that the LSTM MAE increases with the increase in the size of the output window.

Performance of the Models for Each Joint
We investigated whether errors for a particular joint were higher than others. The MAE results for each of the hip, knee, and ankle joints are presented in Figure 8. The MAE for each joint represents the combined errors for the angles predicted in the pitch, roll, and yaw dimensions. They were calculated using an adapted form of Equation (9) (see Section 2.7), where the numbers of features, f, in the summation over j = 1:f are reduced to the pitch, yaw, and roll angles for a single joint, rather than for all joints. The results do not show any particular trend.

Discussion
In this paper, we implement deep learning models, specifically LSTM and CNN, to forecast trajectories of children with pathological gait. To the best of our knowledge, this is the first time trajectories of pathological gait of children, which exhibit larger inter-and intra-subject variability compared to the trajectories of typically developing children, are predicted using deep learning models.
The advantage of deep learning models is that they make predictions based on current input data, but also utilise knowledge of learned representations of gait trajectories acquired during a prior learning stage from numerous gait sequences. We used LSTM and CNN to forecast hip, knee, and ankle trajectories based on varying input and output window sizes. Input window sizes for the LSTM were 50, 100, 200, 400, 600, 800, and 1000 ms (for data captured at a sampling frequency of 120Hz, these durations correspond to 6,12,24,48,72,96, and 120 time-steps). Input window sizes for the CNN were 50 and 1000 ms (corresponding to six and 120 time-steps). The reason for using input window sizes up to 1000 ms is that the average length of a gait cycle for a typically developing school-aged child is 980-990 ms [38]. This means we trained deep learning models to make predictions based on data from approximately one full gait cycle, or lower. Output window sizes for the LSTM and CNN were 8.33, 25, 50, 100, and 200 ms (corresponding to 1, 3, 6, 12, and 24 time-steps); this means that we have tried to forecast up to 20% of the cycle. We used these time ranges following values proposed in the literature by researchers that forecasted healthy gait (refer to Section 2.2 for details).
The LSTM (a type of gated recurrent network) was selected for forecasting gait trajectories as it has been commonly and successfully used with sequential data [36]. The LSTM has the advantage of taking into account the order of values in an input sequence. It has the ability to learn long-term dependencies [36]. The LSTM network was compared to a CNN, which is mostly used for computer vision problems with 2D grid-like topology inputs using 2D convolutions, but it is increasingly used with time-series sequences using 1D convolutions [39]. Therefore, we implemented a CNN to evaluate if it shows promising performance in the task of forecasting trajectories for children with neurological disorders.
Our results show that the LSTM's performance is better than the CNN. However, the difference between the MAE of the two networks was larger with larger input and output window sizes. The performance gap, measured in MAE, was highest with 120 time-steps as input, and 24 time-steps as output, and was 0.91 degrees. There was one case where the MAE for the CNN was higher than the LSTM, which used the smallest combination of input and output windows (six and one time-steps, respectively). For this case, the difference in MAE between the CNN and LSTM was small, 0.014 degrees, and the CNN had a higher standard deviation. Our results are different from those of Moreira et al. [40], who found that the CNN was more robust for ankle joint torque estimation based on kinematics, speed, and anthropometry. Our results are also different from those of Molinaro et al. [41], in which a temporal convolution network (with dilated convolution layers) outperformed an LSTM implementation. It must be stressed that Molinaro et al. considered joint moments rather than joint angles.
We also compared the influence of the size of the input and output windows on predictions. The size of the input window did not have a significant influence on the accuracy of the LSTM network when predicting small output window sizes (including one, three, and six time-steps). However, for predicting longer output window sizes (including 12 and 24 time-steps), larger input windows resulted in lower errors. For both 12 and 24 time-steps, the lowest error was achieved using 120 input time-steps. This is different from what was reported by [25], who found that after increasing input size beyond 30 timesteps, the mean errors for predicting five time-steps in the future increased (10 time-steps correspond to 60 ms in that paper).
There were cases in our results in which the difference between the actual and predicted trajectories was large compared to the mean absolute error. We could not tell whether this was due to the model's lack of generalisability for certain types of pathological gait patterns, or due to an underlying issue with the data for those samples, such as sensor or marker errors. This is because the subjects were anonymised and the dataset used didn ot contain supplementary information/videos for each trial. This is one of the limitations to our study. Another limitation to this study, also caused by the anonymisation of the dataset, was the inability to test whether there is a significant difference in performance between individualised models (models that are subject-specific and need to be trained on data from the user of the exoskeleton) and generalised models (models that are subject-independent and make predictions without the need to be trained on data from a specific user). This is an important point to consider when designing exoskeleton control strategies.

Conclusions
To conclude, we have used two deep learning models, LSTM and CNN, to forecast the trajectories of children with neurological disorders. The results show that our deep learning models outperform the three baseline methods we implemented in our study (with the LSTM being the top performer), with only two exceptions in which the FCN was better (see Section 3.5 for details). We also experimented with varying input and output windows to quantify how the performance is affected by the amount of input data and the length of the future horizon. A potential application of our approach is the control of lower limb robotics, whereby forecasted trajectories of the models can be used as a proxy for the user's intentions. These intentions can be integrated into the control hierarchy of exoskeletons, specifically into the high-level control responsible for detecting the user's intention and passing it on to the mid and low levels to generate appropriate movement commands. For real-time systems, there is always a trade-off between performance and speed. Input windows therefore need to be large enough to achieve acceptable errors, but not too large to slow the system down. As future work, we should evaluate the performance of the models on data collected using wearable sensors (e.g., IMUs and foot pressure sensors) rather than motion capture systems. We should also evaluate the difference between the performance of individualised and generalised models. Furthermore, we believe that the forecasted trajectories need to be paired with a corrective algorithm, unique to every gait sequence. In this scenario, the user intention (coming from a trajectory forecasting model) is adjusted by a corrective algorithm that produces the 'desired trajectory' used by the midand low-level controllers of the exoskeleton.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: