Movement Analysis for Neurological and Musculoskeletal Disorders Using Graph Convolutional Neural Network

: Using optical motion capture and wearable sensors is a common way to analyze impaired movement in individuals with neurological and musculoskeletal disorders. However, using optical motion sensors and wearable sensors is expensive and often requires highly trained professionals to identify speciﬁc impairments. In this work, we proposed a graph convolutional neural network that mimics the intuition of physical therapists to identify patient-speciﬁc impairments based on video of a patient. In addition, two modeling approaches are compared: a graph convolutional network applied solely on skeleton input data and a graph convolutional network accompanied with a 1-dimensional convolutional neural network (1D-CNN). Experiments on the dataset showed that the proposed method not only improves the correlation of the predicted gait measure with the ground truth value (speed = 0.791, gait deviation index (GDI) = 0.792) but also enables faster training with fewer parameters. In conclusion, the proposed method shows that the possibility of using video-based data to treat neurological and musculoskeletal disorders with acceptable accuracy instead of depending on the expensive and labor-intensive optical motion capture systems.


Introduction
Over 40 million people in the United States are diagnosed with movement disorders, such as Parkinson's disease, stroke, dementia, cerebral palsy, osteoarthritis, multiple sclerosis, etc. Impaired movement often leads to a reduced ability to perform activities of daily living, decreased quality of life, and substantial societal costs (e.g., costs of health care, social care services, productivity loss, etc.) [1,2]. For example, in 2014 the Centers for Disease Control in the US estimated the lifetime costs of caring for an individual with cerebral palsy are approximately USD 1.3 billion [3]. Gait analysis is a popular method for diagnosing movement impairments in these populations, where it can be used to inform rehabilitative treatment and quantify the progress of improvements throughout the rehabilitative process.
The current gold standard for quantitative movement analysis is using optical motion capture systems [4,5], which require sensors to be placed on a subject at specific locations on the body. The 3D positions of each marker can then be triangulated by image sensors placed around the room. Although these systems can provide very accurate measurement of movement, there are several limitations preventing its wide use in clinical settings, such as costly equipment that is confined to laboratory settings, and the time-consuming accurate placement of sensors. Wearable sensors, such as inertial-measurement-units, have gained some traction for clinic-based quantitative movement analysis [6,7] because they overcome some of these limitations; in particular, the main benefit is that these wearable systems can be taken out of the lab to measure more naturalistic movements. However, wearable systems are still expensive and require time consuming and accurate sensor placement.
Recent advances in computer vision-based tracking of videos show promise for gait analysis occurring in natural environments without requiring expensive equipment or placement of sensors. Such video data are easier to capture from patients and inexpensive to process. The potential clinical utility of video-based pose-tracking has recently been demonstrated in studies on infants [8], healthy adults [9], older adults at risk of falls [10], and children with cerebral palsy [11]. With regards to gait analysis, these studies have attempted to measure either joint kinematics [11] or spatiotemporal parameters, such as step length and cadence [9], from a single video camera and have achieved moderate accuracy. This moderate accuracy can be attributed in part to the machine learning techniques that have been employed. The purpose of our study was to use more advanced deep learning techniques to achieve better accuracy of video-based gait analysis compared to the gold-standard optical motion capture.
In our perspective, two barriers have prevented wider use of video based recognition for human gait analysis. First, interest in video based recognition that uses pose estimation for keypoints extraction is relatively new and has been primarily confined to researchers in computer science and related fields. Second, there is expectation for a validation of the gait parameters as calculated by video based recognition approaches against gold standard method, i.e., optical motion capture systems.
The goals of this study were two-fold: (1) extract the keypoints from a video using OpenPose [12], then predict quantitative gait metrics commonly used in clinical gait analysis; and (2) propose a novel method based on graph convolutional neural network used for the prediction of quantitative gait metrics and also can be applied for classification tasks. The proposed method were trained on 1792 videos of 1026 unique patients with cerebral palsy disorder [13]. The clinical gait analysis method was used on the optical motion capture data to calculate the gait parameter values and used as a ground truth for our proposed network. The gait parameters that were used as metrics in this study were walking speed, cadence, gait deviation index, and knee flexion angle at maximum extension.

Related Work
With current technology, it is possible to capture skeleton data in real-time using depth sensors and pose estimation algorithms [12] with less resource and computational demand. For the realization of motion dynamics, skeleton data are the best choice since they are robust to illumination change, complex background, and scene variation. Conventional methods like hand-crafted approaches and deep-learning approaches are common ways to extract skeleton data from images or videos. The hand-crafted approach focus on capturing the dynamic motion of the joints, such as the relative position of the joints [14], the covariance matrix of the joint trajectories [15], and also the design of several view-invariant features. Among these features design, group sparsity-based class-specific dictionary coding [16], rotation and translations of body parts [15,17], and canonical view of transformed features [18] are the common ones. Other traditional methods [19][20][21] combine the information from different modalities, for instance, from depth information and skeleton data to further enhance the performance. However, these methods do not capture the features needed to predict the gait parameters like deep learning methods. Recently, a deep learning approach amassed a lot of success in many fields. The approach preferred to traditional machine learning methods because it outperforms with less complicated models and without requiring extensive feature engineering. Among the deep learning methods, recurrent neural network (RNN)-based methods [13,[22][23][24] that are known for the application of temporal dynamic behavior, and CNN-based methods are used in cases of parameter sharing and sparse connectivity to reduce the number of parameters that need to be learned [25]. Additionally, the CNN model mapped anatomical key points to an outcome metric (e.g., cadence) [11]. To improve performance, a two-stream-based model [26] integrates CNN and RNN that operate on RGB images and coordinate vectors of skeletons ordered in temporal form, respectively. However, both the single network and the combined two-stream model come short in understanding human movement since the spatial dependence between the correlated joints could not be captured by the methods.
The main idea of a graph neural network representing the complex relationship and inter-dependencies between the data in non-Euclidean domains that is classified under recurrent graph neural network (RecGNNs) was conceived in [27] and further explained in [28,29]. These RecGNNs methods use neighbor information iteratively to learn a target node. The training continues until a stable fixed point is attained. In general, the methods focus on learning node representations with recurrent neural architectures. Such methods demand higher computation power. As a consequence RecGNNs methods [30,31] are proposed to address such problem. Analogous to the convolutional neural network in images and videos, a graph convolutional network is proposed to generalize the operation of convolution from grid data to graph data. Here, in graph convolutional network (ConvGNN) the node representation is obtained by aggregating its neighbor features and its own features. Unlike RecGNNs methods, ConvGNN stack multiple graph convolutional layers and also pooling layers to extract high-level node and graph representation. ConvGNNs fall into two main streams, spectral-based approaches [32][33][34][35] and spatialbased approaches [36][37][38]. Although spectral-based approaches focus on removing noises from graph signals, spatial methods define graph convolution by information propagation, an idea inherited from RecGNNs. Recently, graph convolutional neural networks (GCNNs) [34] have been proposed to break the gap between spectral and spatial-based approaches. In addition to the flexibility, and simplicity, spatial-based methods are more efficient for graph-related data. Thus, we chose the approach in this paper. Initially, a graph convolutional neural network is proposed for spatial dependency. For applications like traffic forecasting, action recognition, CNN or RNN is used for temporal dependency alongside graph convolutional neural networks. Spatial-temporal graph neural network [39] address the time series prediction problem in traffic domain by applying graph convolutional neural network to both spatial and temporal dependency.
However, the method uses adjacency matrix only for spatial dimension. In the method, convectional convolutional neural network is used across temporal dimension that do not capture the features of skeleton dataset very well. Our method defines a single adjacency matrix that considers both temporal and spatial dimension. Moreover, the method gives a much faster training speed with fewer parameters.
Based on the landmark obtained by OpenPose [4], deep learning methods show promising results in gait analysis. Although OpenPose demonstrate higher location error of between 20 and 40 mm for laboratory based gait analysis comparing to the marker-based motion capture system, the method estimate a good landmark that can be used for further analysis [40]. CNN based method [11] predicts gait metrics that approach the theoretical limits for accuracy imposed by natural variability within the dataset prepared in Gillete Children Speciality Healthcare. However, still there is a room for improvement, where our method, i.e., graph convolutional neural network (GCNN) that fit the nature of the data improves the gait metrics.
The aim of this paper is producing gait metrics of video based system that mimics the metrics calculated from marker-based optical motion capture system. Beside using minimum squared error (MSE) during training, the application of information theory [41][42][43][44][45] namely correlation is used to show how the predicted value from our method is related to the ground truth value. For our dataset it turned out that the correlation values approach the maximum limit imposed by natural variability in the gait metrics.

Problem Formulation
Let X = q 1 , . . . , q T denote a temporal sequence where each frame q t ∈ R n j ×2 represents a human body pose at time t, with n j number of joints in the skeleton, and each joint with 2 dimensions ((x, y) position on Cartesian plane). The OpenPose [12] method extracts the joints from each video segment and we used the joints as input X ∈ X for our proposed method after flattening it. Let Y ∈ Y be the corresponding gait parameters of the input X.
Let F be a non-linear function that employs the mapping from X ⊆ R T×n j ×2 to Y ⊆ R 1×1 , i.e., F : X → Y. We parameterize the non-linear function F by a deep neural network with parameters θ F . In general, given a dataset with N training samples, i.e., , we will learn the network F (; θ F ) so that can predict the gait parameter. In particular, the learning objective can be defined as follows: where L is the loss function between the predict gait parameter F (X; θ F ) and the corresponding ground truth Y.

Data Preprocessing
To analyze the gait parameter of patients, we extract skeleton anatomical data from the Gillete Children Speciality Healthcare dataset using OpenPose [5]. The extracted data have 25 key points for each frame. Here, each keypoint are taken as nodes and the x, y location of each point is represented as a feature of the node. To suit our model, we flattened the keypoints inputs for a segment of a video, X. Despite Openpose [5] shows an efficient pose estimation method, some significant data are missing during extraction. Such missing data might contribute to the inaccurate prediction of gait metrics. We address the problem by replacing each missing joints with the mean of the feature's the probably extracted joints.

Graph Convolutional Neural Network
There are two approaches to define convolutional filters in a convolutional neural network: spectral-based graph convolution and spatial-based graph convolution. The spectral-based graph convolution has a solid mathematical foundation and it produced a good result in some applications. However, the method is not flexible to apply to many structures. The filter defined is domain-dependent. Additionally, the eigen decomposition in spectral-domain costs higher computational complexity. Like the typical convolutional method applied on videos, images, and sounds, a spatial convolutional method is implemented based on spatial relationships of the entity-nodes.
The skeleton of the body is represented as an undirected graph G = {V, E} on a skeleton sequence with N joints and T frames featuring both intra-body and inter-frame connection. In this graph, the node set V = {v i |i = 0, . . . , N − 1} includes all the joints in a skeleton sequence. Instead of taking the spatial and temporal features as a separate entity, we joined them into a single dimension. Therefore, the set of joints V consists of all joints from intra-body and inter-frame connection. Thus, the total number of nodes, N = n j × T, where n j is number of joints per frame and T is number of frames. Figure 1 shows the connectivity of joints in a frame and the connection of the same joint in consecutive frames.
In spatial and temporal graph convolution where the edge is defined on both spatial and temporal dimensions the edge, set E is composed of two subsets, the first subset depicts the intra-skeleton connection at each frame, and the second subset contains the inter-frame edges, which connect the same joints in consecutive frames. The intra-skeleton denoted as E S = {v ti , v tj |(i, j) ∈ n j } and inter-frame edges denoted as E T = {v ti , v (t+1)i }, where n j is the set of joints. For two consecutive joints in skeleton body, if the joints are connected with a single bone, we set the value of the edge between them 1. For intrabody connectivity, we set the edge between the same joint in consecutive frames to 1. We set a small value δ = 0.01 for joints that are not connected in intra-body or inter-frame connectivity. The connectivity of all joints in a given video is represented in adjacency matrix, A as it is shown in Equation (2).
where d i,j is the edge between two joints. As it is stated earlier, if there is no edge between two joints then d i,j set to δ = 0.01. In Equation (2), the row of the adjacency matrix shows the connection of joint 0 with all joints in intra-skeleton and inter-frame. Let the number of frames is M and each frame has 4 joints. The total number of joints would be, N = MX4. Thus, in Equation (2), d i,j means the connection between the first joints in the first frame and the second joint which is located in the same frame. If there is an edge between the joints the value of d i,j set to 1, otherwise it is set to 0. The toy example in Figure 2 describe for node, N = 16. Each rows shows how node i is related to all of the nodes in the input. Since we organized the spatial, temporal data and the adjacency matrix as a 2D dimension, we can directly use the graph convolutional definition from [34] into our problem.
where, I is the identity matrix representing self connections and D is the diagonal node degree matrix used for normalization given as the summation of adjacency matrix across the column, i.e., D = ∑ j A i,j × f in is the feature with R N×d dimension and before applying the first graph convolution f in considered as the spatial location of each node with R N×2 dimension.  For readers' understanding, Figure 2 depicts a toy example of the proposed graph convolutional method. As it is explained earlier, among 25 keypoints, 8 keypoints which are directly related to gait measurements are selected. The selected key points are divided into left and right key points to process further. The (x, y) location of left key points (left-hip (LHIP)), left-knee (LKNE), left-ankle (LANK), and left-big-toe (LBTO) for t frames used as input feature for the first graph convolutional layer. In our model, we have used two graph convolutional layer; the first layer takes the location of each keypoints in a video segment as input and produce 8 features. The output of the first layer fed into second layer and produces features with 16 channel which is fed in to conventional convolutional layer for down sampling and finally the output would be predicted. Figure 3 further illustrates the pipeline of our method from input to output. This method can be applied to the classification problem. The only modification expected would be adding softmax after the fully connected layer. It is worth noting that, although we achieve a good result, using only key points extracted using OpenPose, we attained a better result when we comprise a hand-engineered time series data crafted from the relationship of the joints. In addition to skeleton body joints, using derived time series data as input improves the overall performance of the network model. The first derived time series is computed as the angle between the vector from the knee to the hip and the vector from the knee to the ankle. The second time series was the difference between the x-coordinates of the left and right ankles. Our second model takes both types of input, i.e., skeleton joints and derived time series data, and predicts the gait parameters. As it is shown in Figure 4 a 1D convolutional neural network is applied on the derived time series data, and the output from this layer concatenated with the output from graph convolutional neural network. Then, the combined features passed through another 1D CNN and average pooling to learn more features and attain translation invariance. Specifically, average pooling computes the average of features of the skeleton joint for the segment of the video. Since our proposed method is categorized under regression problem, we have used the most common loss function mean squared error (MSE). In Equation (4), Y i is the ground truth for gait metrics evaluated from optical motion capture sensors. Based on reflective markers placed on patients the high-frequency cameras and motion capture software tracked the 3D positions and the 3D joint kinematics computed using the inverse kinematics [11]. Then, the time-series data of 3D-joint kinematics is analyzed and the gait metrics that are used as ground truth, Y i is computed. F (X; θ F ) is the predicted gait parameters by our model from video inputs. Finally, the loss function is formulated as follows:

Dataset
We have used a dataset from Gillette Children's Speciality Healthcare collected from 1994 to 2015 [5]. The dataset has 1792 videos of 1026 unique patients with the cerebral palsy condition. In the dataset, the patient's average age was 11 years with the standard deviation (SD) of 5.9. Average height and mass were 133 cm (SD = 22) and 34 kg (SD = 17), respectively. The ground truth metrics were computed from optical motion capture data. As it is described in [5], for each patient, reflective markers are placed on anatomical landmarks. Then, the optical motion capture system incorporated with tracking software captured the positions of the markers as the patients moved in controlled space.
At last, engineers post-processed these data and computed gait metrics that were used as a ground-truth. The video-based data used in our model for training were collected with exact setting but at a different time to the ground truth. The skeleton data are extracted from video-based data using the publicly available OpenPose [12] toolbox. The toolbox gave 2D coordinates (x, y) with a confidence score of C for 25 joints.

Performance Metrics
The measurement gait metrics used in these methods are common in many neurological and musculoskeletal disorders. These gait metrics that were used in our model were walking speed, cadence, knee flexion angle at maximum extension, and GDI. For performance criteria, we have used the correlation coefficients to compare the ground truth values that are prepared from optical motion sensor values. During training, MSE loss function with adaptive moment estimation (Adam) was used as optimization. After choosing the best parameter for each metric on validation test we took 300 samples from test data to examine how the predicted value from our model could be related to the label using correlation coefficients.

Experiment Settings
In this work, training and testing were implemented on a machine with Linux cluster CPU: Intel(R) Core i7-8700K CPU @ 3.7 GHz × 12, and GPU: NVIDIA GeForce GTX 1080. The network was trained in a fully-supervised way with L2 loss function and using adaptive moment estimation (Adam) as the optimization method. We trained both models for a maximum of 100 epochs with a learning rate of 0.015 and early stopping with a window size of 10, i.e., we stopped training if the validation loss could not decrease for 10 consecutive epochs. To avoid early stoppage of training, we decrease the learning by a factor of 10 every 20,000 iterations. We have allocated 60% of the data for training, 20% for validation, and the rest are allocated for testing.

Data Normalization
From each frame in a video, 2D image-plane coordinates of 25 keypoints with the confidence of individual keypoints were extracted by OpenPose toolkit. From each detected person in a frame the given points were the x, y coordinates, in pixels, of the centers of the torso, nose, and pelvis, and centers of the left and right shoulders, elbows, hands, hips, knees, ankles, heels, first and fifth toes, ears, and eyes [12]. The toolkit missed detecting few people from the frame. We removed 1443 such cases from using for training. For some cases only few of the skeleton joints are missed. For such cases, we used linear interpolation to fill the missed points.
Some of the input data were noisy, so they might not give an expected result. For example, the x-coordinate of the left ear and few other time-series data were noisy and contributed undesired results. To mitigate the effects of these noisy data, we normalized the image-plane coordinates of knees, ankles, hips, big toes, projected angles of the ankle and knee flexion, the distance between the first toe and ankle, and the distance between the left ankle and right ankle [13]. In addition, using window slicing more time-series data were generated, and using an augmented dataset enabled avoiding variation in each starting frame. Figure 5 depicts the comparison of graph convolutional neural network (GCNN), CNN, ridge regression (RR), and random forest (RF). In the proposed method, we have used mean squared error (MSE) to train our model. MSE gives us the measure of how far the predictions were from actual output but we do not know whether we are under predicting the data or over predicting the data. Therefore, we have used additional metrics, also known as correlation, to evaluate the outcome. Here, the term correlation was used as how the predicted gait parameters from video inputs related to the gait parameters calculated from laboratory-based motion capture sensors. We have taken 300 gait metrics samples that were predicted by our model based on the test data and we compared these samples with the ground truth and examined how they related with each other. Thus, the correlation between the gait metrics speed from our model and the ground truth was 0.791 (0.742-0.853). For GDI the correlation was 0.792 (0.710-0.822). For cadence, knee maximum flexion our proposed method predicted a similar results with CNN [11]. During inference, our system takes 0.2 s to predict the gait parameter of the patient.  Table 1 shows the detailed architecture of the proposed model ( Figure 4). The 2D joints extracted from a video using OpenPose [12] were fed into the first part of the model. In addition to the keypoints extracted using OpenPose, we comprised a hand-engineered time serious derived from to get an optimal result. Concatenating the time-series data (computed from the relation of the skeleton joints) that is passed through 1D-CNN with GCNN increases the prediction accuracy. Table 2 shows the comparison of these two approaches in terms of correlation coefficient with the ground truth. We have the first two graph convolutional layers in part i. The graph convolutional layers used to compute 8 and 16 dimension feature maps, respectively. Part ii of the proposed method has two convolutional layers with 7 kernel size each. In part iii, i.e., after the concatenation of the joint features from graph convolutional layer and the time series features from the convolutional layer, we have used a 1D convolutional layer, followed by maxpooling and another 1D convolutional layer with size 8 kernels . All the layers are followed by batch norm and Relu activation. Table 1. Layer descriptions of the proposed method: the first part of the graph convolutional network is composed of two graph convolutional layers, the second part is 1D convolutional for hand engineered input data and the third part of the layers comprise 1D convolutional, maxpooling, and fully connected layer for the concatenated input from part i and ii.

Part
Layer Type Layer Number of Unit Kernel Size Dropout

Conclusions
In this paper, we proposed a graph convolutional neural network that can capture the characteristics and relationships of skeleton input data in the spatial and temporal dimensions. Our method takes advantage of the structural information possessed by the input skeleton human pose which is extracted using OpenPose. In the method, spatial and temporal features of the input are represented in a single adjacency matrix that helps to apply graph convolution in both dimensions. As a result, the method captures the main features of the motion of the patient that contribute to predicting the gait parameter. Our approach has experimented on the cerebral palsy disorder and it outperformed the state-of-the-art method on the dataset that was processed by Gillette Speciality Healthcare. Our method predicted clinically apparent motion metrics from an ordinary video of patients with the cerebral palsy disorder. The method helps the clinicians to address the symptoms of neurological and musculoskeletal disorders without placing reflective markers on patients' anatomical landmarks which is very expensive and takes a lot of effort and time to diagnosis the patients. In addition, the method achieved the result with fewer parameters, faster training, and earlier convergence. In future work, we will apply the proposed method for other types of musculoskeletal and neurological disorders. Funding: This research received no external funding. Data Availability Statement: As described in this paper, the data prepared by Gillette Speciality [5] Healthcare is processed by using the OpenPose algorithm [12]. Due to privacy of the patients, the video data are not publicly available. However, the preprocessed data are publicly available at [11].

Dr. Han-Seok Seo is an Associate Professor and Director of the University of Arkansas Sensory
Science Center in the Department of Food Science at the University of Arkansas, Fayetteville. Dr. Seo is a creative sensory scientist who combines multidisciplinary backgrounds and skills in order to contribute to improved quality of life through healthy and happy eating behavior. His research interests include identifying mechanisms of multisensory interaction and integration with a focus on chemosensory cues, developing methods to improve eating quality, creating novel methodology of sensory evaluation, and investigating impacts of sensory disorders on eating quality. He holds two doctoral degrees, a Ph.D. in Food and Nutrition and a Doctor of Medical Science in Otorhinolaryngology from Seoul National University (Seoul, Korea) and the Technical University of Dresden (Dresden, Germany), respectively. Dr. Seo has published more than 120 articles in peer-reviewed journals, and he serves as an editorial board member of multiple journals including the Journal of Sensory Studies, Food Quality and Preference, Foods, Journal of Culinary Science and Technology, Journal of Food Science, and Korean Journal of Food and Cookery Science. He also serves as an Associate Editor and a Section Editor of the Food Research International and Current Opinion in Food Science, respectively.