Home-Based Measurements of Dystonia in Cerebral Palsy Using Smartphone-Coupled Inertial Sensor Technology and Machine Learning: A Proof-of-Concept Study

Accurate and reliable measurement of the severity of dystonia is essential for the indication, evaluation, monitoring and fine-tuning of treatments. Assessment of dystonia in children and adolescents with dyskinetic cerebral palsy (CP) is now commonly performed by visual evaluation either directly in the doctor’s office or from video recordings using standardized scales. Both methods lack objectivity and require much time and effort of clinical experts. Only a snapshot of the severity of dyskinetic movements (i.e., choreoathetosis and dystonia) is captured, and they are known to fluctuate over time and can increase with fatigue, pain, stress or emotions, which likely happens in a clinical environment. The goal of this study was to investigate whether it is feasible to use home-based measurements to assess and evaluate the severity of dystonia using smartphone-coupled inertial sensors and machine learning. Video and sensor data during both active and rest situations from 12 patients were collected outside a clinical setting. Three clinicians analyzed the videos and clinically scored the dystonia of the extremities on a 0–4 scale, following the definition of amplitude of the Dyskinesia Impairment Scale. The clinical scores and the sensor data were coupled to train different machine learning models using cross-validation. The average F1 scores (0.67 ± 0.19 for lower extremities and 0.68 ± 0.14 for upper extremities) in independent test datasets indicate that it is possible to detected dystonia automatically using individually trained models. The predictions could complement standard dyskinetic CP measures by providing frequent, objective, real-world assessments that could enhance clinical care. A generalized model, trained with data from other subjects, shows lower F1 scores (0.45 for lower extremities and 0.34 for upper extremities), likely due to a lack of training data and dissimilarities between subjects. However, the generalized model is reasonably able to distinguish between high and lower scores. Future research should focus on gathering more high-quality data and study how the models perform over the whole day.

in dyskinetic CP between as well as within subjects concerning involved body parts, and dependent on environmental factors and the activity performed [20][21][22], the automatic evaluation is a challenging machine learning task.
Monitoring of movement disorders of children and young adults with dyskinetic CP for a longer period of time within a well-known environment would provide a realistic and reliable evaluation of dystonia and choreoathetosis and can serve treatment decision and monitoring for this complex group. Within this proof-of-concept study, we used four IMUs coupled to a smartphone, allowing the collection of IMUs data and time-synchronized video recordings at home. We aim (1) to show the feasibility of data collection in a natural environment in children and young adults with dyskinetic CP and (2) to train a machine learning model that can detect and score dystonia using IMU data.

Materials and Methods
The flowchart in Figure 1 summarizes the dataflow from home measurements (IMUs and videos) towards the final evaluation of the picked classification models. Below, a detailed description of the methods is provided. significantly variable in dyskinetic CP between as well as within subjects concerning involved body parts, and dependent on environmental factors and the activity performed [20][21][22], the automatic evaluation is a challenging machine learning task.
Monitoring of movement disorders of children and young adults with dyskinetic CP for a longer period of time within a well-known environment would provide a realistic and reliable evaluation of dystonia and choreoathetosis and can serve treatment decision and monitoring for this complex group. Within this proof-of-concept study, we used four IMUs coupled to a smartphone, allowing the collection of IMUs data and time-synchronized video recordings at home. We aim (1) to show the feasibility of data collection in a natural environment in children and young adults with dyskinetic CP and (2) to train a machine learning model that can detect and score dystonia using IMU data.

Materials and Methods
The flowchart in Figure 1 summarizes the dataflow from home measurements (IMUs and videos) towards the final evaluation of the picked classification models. Below, a detailed description of the methods is provided.

Participants
Participants were recruited from the pediatric outpatient rehabilitation department during regular appointments from 1 March till 31 October 2021. Patients were included if they had: (1) a clinical diagnosis of dyskinetic CP [23,24], (2) were 4-24 years old, and (3) if parents/caregivers were able to follow the instructions for the home-based measurements.
In total, 12 participants were included; Participants had following characteristics (mean ± Standard deviation (range)): The study was approved by the Medical Ethics Committee of the VU University Medical Center Amsterdam (The Netherlands). Written informed consent was obtained from participants and, if applicable, their parents for participation in this study.

Participants
Participants were recruited from the pediatric outpatient rehabilitation department during regular appointments from 1 March till 31 October 2021. Patients were included if they had: (1) a clinical diagnosis of dyskinetic CP [23,24], (2) were 4-24 years old, and (3) if parents/caregivers were able to follow the instructions for the home-based measurements.
In total, 12 participants were included; Participants had following characteristics (mean ± Standard deviation (range)): DOT is a wearable sensor incorporating 3D accelerometers, gyroscopes and magnetometers to provide acceleration, angular velocity, and the Earth's magnetic field. Combined with Xsens, sensor fusion algorithms, 3D orientation and free acceleration are provided [10]. Inertial and orientation data outputs of the Xsens DOT sensor are presented in Table 1. The Xsens DOT sensors were set to measure with a sampling frequency of 60 Hz with an accelerometer range of ±16 g and a gyroscope range of ±2000 dps;

Procedure
For the measurements within this proof-of-concept study, participants could choose between measurements at home or in the hospital. For home measurements, participants received a measurement set containing a mobile phone with the MODYS@home app installed, four IMUs, chargers for the phone and sensors, fixation material and a manual. The four Xsens DOT sensors were attached towards the forearm (palmar on the forearm, proximal of processus styloideus ulnae) and lower leg (proximal of the lateral malleolus) ( Figure 2). The method of fixation on the attachment site was individually determined. Participants and parents/caregivers were instructed on how to place the IMUs on the participant and how to use the MODYS@home app to record videos and collect sensor data. They were asked to record 10 videos of about 1 minute each day, for both active and resting situations for 7 days within a period of 2 weeks. After the period of 2 weeks, the measurements set was picked up by the researcher and data was transferred by USB-connection for further analysis. For the individuals measured in the hospital, activities and rest data were collected, mimicking a home-based environment. Examples of activities performed at home as well as in the hospital are wheelchair driving, walking, stair climbing, cycling, eating/drinking, sport activities, gaming, computer use, playing music, playing a board game, reading, watching a video/television, using a communication device and resting in a chair or lying down. Activities were chosen by parents/caregivers and participants dependent on the functional level of the individual. Videos during passive movements, e.g., caregiving, transfers were excluded in the current analysis.

Software
Clinical scoring was done using an open-source tool for video annotation, ELAN version 6.2, Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands, sourced from: https://archive.mpi.nl/tla/elan/download (accessed on 3 May 2022); MATLAB (Mathworks Inc., Natick, MA, USA) release R2018b was used for processing the data and developing the machine learning models. The code used in the current study is made available (Supplementary S1).

Software
Clinical scoring was done using an open-source tool for video annotation, ELAN version 6.2, Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands, sourced from: https://archive.mpi.nl/tla/elan/download (accessed on 3 May 2022).; MATLAB (Mathworks Inc., Natick, MA, USA) release R2018b was used for processing the data and developing the machine learning models. The code used in the current study is made available (Supplementary S1).

Clinical Scoring
Three clinicians assessed the videos. For each time window of 5 seconds, a score between 0-4 was assigned for dystonia for the left and right arm and the left and right leg, separately, following the definition of amplitude of the Dyskinesia Impairment Scale (DIS) [25] for scoring dystonia. Within Parkinson's disease, a 5 s time windows was found to be optimal to achieve the minimum estimation error when estimating the severity of tremor, bradykinesia and dyskinesia using accelerometers and machine learning [26]. The DIS distinguishes between proximal and distal segments of the extremities when scoring amplitude. This score was summarized within the current scoring. Thus, each clinician provided four scores for dystonia for each time window of each video. The median of the three scores was calculated as the final score for the machine learning.

Data Pre-Processing
Data from the IMUs required pre-processing to serve as input for machine learning . As some time stamps were missing with different sensors, the sensor data from the four sensors was synchronized using linear interpolation with the values from adjacent timestamps.
For each sensor, the resultant free acceleration ( ) and resultant angular velocity ( ) at each time stamp was calculated using Equations (1) and (2) respectively: Each sensor therefore provided 11 signals: 4 accelerations, 4 angular velocities and 3 Euler angles. A single timestamp containing data from all four sensors consisted of 4 × 11 = 44 signals. Each 5 s time window contained 300 timestamps.
In MATLAB, the videos were automatically linked to the sensor data, cutting out parts of the sensor data where a video was recorded. These cut-out parts of sensor data were segmented into time windows of 5 s, equal to the clinical windows. Finally, the clinical scores were automatically linked to the corresponding time windows. Figure 3 shows an example of the sensor signals togethers with the clinical scoring.

Clinical Scoring
Three clinicians assessed the videos. For each time window of 5 seconds, a score between 0-4 was assigned for dystonia for the left and right arm and the left and right leg, separately, following the definition of amplitude of the Dyskinesia Impairment Scale (DIS) [25] for scoring dystonia. Within Parkinson's disease, a 5 s time windows was found to be optimal to achieve the minimum estimation error when estimating the severity of tremor, bradykinesia and dyskinesia using accelerometers and machine learning [26]. The DIS distinguishes between proximal and distal segments of the extremities when scoring amplitude. This score was summarized within the current scoring. Thus, each clinician provided four scores for dystonia for each time window of each video. The median of the three scores was calculated as the final score for the machine learning.

Data Pre-Processing
Data from the IMUs required pre-processing to serve as input for machine learning. As some time stamps were missing with different sensors, the sensor data from the four sensors was synchronized using linear interpolation with the values from adjacent timestamps.
For each sensor, the resultant free acceleration (a) and resultant angular velocity (ω) at each time stamp was calculated using Equations (1) and (2) respectively: Each sensor therefore provided 11 signals: 4 accelerations, 4 angular velocities and 3 Euler angles. A single timestamp containing data from all four sensors consisted of 4 × 11 = 44 signals. Each 5 s time window contained 300 timestamps.
In MATLAB, the videos were automatically linked to the sensor data, cutting out parts of the sensor data where a video was recorded. These cut-out parts of sensor data were segmented into time windows of 5 s, equal to the clinical windows. Finally, the clinical scores were automatically linked to the corresponding time windows. Figure 3 shows an example of the sensor signals togethers with the clinical scoring.
Per subject, two tables containing input data and output were created for machine learning. Tables were created for both upper and lower extremities, by adding the data from the left and right extremities.

Feature Selection and Extraction
Research has shown that feature selection is an effective way to improve the learning process and recognition accuracy, and decreases the complexity and computational cost [27]. We used a method recently described by Den Hartog et al. [28]. In brief, time domain and frequency domain features were tested on the data from all subjects. A Fast Fourier Transform was used to extract frequency-domain features. Initially, 32 different feature classes were tested for usability. For each time window, a single feature class was extracted per IMU signal, creating 11-dimensional feature vectors (1 feature class × 11 signals). These feature vectors were then fed to six different machine learning algorithms (Decision Tree, Discriminant Analysis, Naïve Bayes, Support Vector Machine, k-nearest neighbors, and Ensemble Learning), to test the feature classes' predictive power. Feature classes were only selected if they were capable of achieving an F1 score of at least 0.7 with a machine learning algorithm, indicating a strong correlation with the output. A total of 10 feature classes passed the selection round (Table 2). Per subject, two tables containing input data and output were created for machine learning. Tables were created for both upper and lower extremities, by adding the data from the left and right extremities.

Feature Selection and Extraction
Research has shown that feature selection is an effective way to improve the learning process and recognition accuracy, and decreases the complexity and computational cost [27]. We used a method recently described by Den Hartog et al. [28]. In brief, time domain and frequency domain features were tested on the data from all subjects. A Fast Fourier Transform was used to extract frequency-domain features. Initially, 32 different feature classes were tested for usability. For each time window, a single feature class was extracted per IMU signal, creating 11-dimensional feature vectors (1 feature class × 11 signals). These feature vectors were then fed to six different machine learning algorithms (Decision Tree, Discriminant Analysis, Naïve Bayes, Support Vector Machine, k-nearest neighbors, and Ensemble Learning), to test the feature classes' predictive power. Feature classes were only selected if they were capable of achieving an F1 score of at least 0.7 with a machine learning algorithm, indicating a strong correlation with the output. A total of 10 feature classes passed the selection round (Table 2). Root-mean-square 9 Root-sum-of-squares 10 Shannon entropy  Root-mean-square 9 Root-sum-of-squares 10 Shannon entropy Next, for each time window, all 10 feature classes were extracted for each of the 11 IMU signals, creating 10-dimensional feature vectors (10 classes × 11 signals). This means that for each time window there are 110 features that could describe the characteristics of that window.
Next, sequential feature selection (SFS) as described by MATLAB (Sequential Feature Selection-MATLAB & Simulink-MathWorks Benelux) was used, as this is an effective way to identify redundant and irrelevant features. Sequential feature selection is a wrappertype feature selection algorithm that starts training using a subset of features and then adds or removes a feature using a selection criterion. The selection criterion directly measures the change in model performance that results from adding or removing a feature. The algorithm repeats training and improving a model until its stopping criteria are satisfied.
In this study, sequential feature selection (SFS), with a maximum number of objective evaluation of 20, was used. SFS sequentially adds features to an empty candidate set until the addition of further features does not decrease the objective function. In this study, misclassification rate was set as the objective function.
Finally, the extracted features were normalized to rescale the data to a common scale. Supervised machine learning algorithms learn the relationship between input and output and the unit, scale, and distribution of the input data may vary from feature to feature. This will impact the classification accuracy of the models. In this work, the data was normalized by scaling each input variable to a range of 0 to 1.

Machine Learning and Algorithms
After processing the data and extracting features, the next step is to feed the feature vectors to machine learning algorithms. In this study, six types of supervised machine learning algorithms were tested: Decision Tree, Discriminant Analysis, Naïve Bayes, Support Vector Machine, k-nearest neighbors, and Ensemble Learning.

Training, Validating and Testing
For an objective evaluation of the machine learning algorithms, the datasets were divided into a training dataset, validation dataset and testing dataset.
Since the datasets were small, a 5-fold cross-validation was used to evaluate the performance of the models. For each iteration, 80% of the data was used for training and validation, and 20% was used for testing. For training the machine learning models, another 5-fold cross-validation was also used within the training and validation data.
The validation dataset provides an evaluation of a model fit on the training dataset while tuning the model's hyperparameters [29]. After training and validating, the trained models were evaluated with the testing data containing 20% of the data. The testing dataset was used to provide an unbiased evaluation of a final model fit on the training dataset [29]. This testing dataset was not used for training. Since a 5-fold cross-validation was used, all samples were tested in the testing dataset. The models' predicted clinical scores of the testing data were compared with the true clinical scores, to calculate the precision, recall and F1 score of the model when used on unseen data [29].
Most datasets contain a severe skew in the class distribution, which could lead to the machine learning algorithms performing poorly on the minority classes. To address this problem, the training data was oversampled to equalize the number of samples per score.
Different models were trained, validated, and tested using four different settings for each type of the six machine learning algorithms. This was done for both the upper extremities dataset and the lower extremities dataset. Models were trained (1) using all features (ALL), (2) using all features and hyperparameter tuning to find the optimal set of hyperparameters (ALL + HYP), (3) with selected features (SFS) and (4) using selected features and hyperparameter tuning (SFS + HYP) ( Table 3). For the ALL + HYP and SFS + HYP, the hyperparameters were determined using a Bayesian optimization algorithm with 15 iterations during the first fold ( Table 3). The found hyperparameters were then used during the remaining folds to test for the model's precision, recall and F1 score. Individual models (i.e., using the data of one participant only) as well as generalized models (i.e., using all data) were trained. The performance for each model was calculated. The trained individual models were tested on holdout testing data using 5-fold cross-validation. Generalized models were evaluated using leave-two-subjects-out crossvalidation (6-fold). For each of the 6 folds, data from 10 subjects was used for training and validating (5-fold cross-validation), and tested on the data from the two left-out subjects.
As main performance metric the F1 score was computed, which used the precision and recall (Equations (3)-(5)), calculated from 'True positive' (TP), 'False positive' (FP), and 'False negative' (FN) scores. F1 scores were calculated after training and validating, and after testing the models on the holdout test data. Per patient, the models with the highest F1 scores were selected as the final models for that patient. In addition, for the generalized models the root mean square errors (RMSE) was calculated and confusion matrix plotted for better interpretation of the model performance.

Datasets
Two patients were measured within the movement laboratory mimicking a home environment and activities, the other ten patients were measured at home by parents/caregivers. Even though parents/caregivers were instructed to record 10 one-minute videos each day, there were large differences in the number of samples (5-s time windows) in the final datasets for each subject. Not all parents/caregivers filmed as many videos as they were instructed. One participant stopped after one measurement due to uncomfortableness while attaching and wearing the sensors. The data of this subject were excluded for the individual trained models. Furthermore, errors in the sensors occurred for some measurements, resulting in loss of data. The most common errors were failure of one or more sensors and an error in the synchronization between the sensors. Moreover, not all windows could be scored because certain body parts were not visible on the videos. These factors led to different sizes of datasets for each subject. Table 4 lists the number of samples in each dataset of each subject. See Supplementary S2 for an overview of the distribution of the scores for each patient. The full dataset is available (Supplementary S3). n/a n/a n/a n/a n/a n/a dys upper 66 KNN ALL + HYP 0.96 0.60 0.65 0.73  Table 4 gives an overview of the best models (algorithm and model type) of each patient, together with the corresponding F1 scores, precision and recall. k-nearest neighbors algorithms led to the highest F1 validation score in most datasets and were therefore most often chosen as final model. Table 5 gives and overview of the mean F1 scores, precision and recall of all best models combined. High F1 scores (0.97 ± 0.03 for lower extremity dystonia and 0.93 ± 0.06 for upper extremity dystonia) were observed during validation of the individual models. In the independent test datasets, the F1 scores (0.67 ± 0.19 for lower extremity dystonia and 0.68 ± 0.14 for upper extremity dystonia) were lower (Table 5).

Generalized Clinical Scores Classification
See Table 6 for an overview of the best models per dataset. Figures 4 and 5 show the confusion matrices of the datasets. The generalized model showed lower F1 scores (0.45 for the lower extremities and 0.34 for the upper extremities) in the test datasets than the individual models. F1 scores were high in the validation data sets, but significantly lower in the test data sets, indicating the model does not work equally as well on unseen data. The majority of misclassifications occurred in neighbouring clinical scores, since they present similar behaviours. The RMSE were 1.07 for dystonia lower extremties and 0.98 for dystonia upper extremites, respectively. A clinical score of 4 in the dystonia upper extremities data set was never correcly classified, likely due to a lack of training samples during training of the models. dystonia upper extremites, respectively. A clinical score of 4 in the dystonia upper extremities data set was never correcly classified, likely due to a lack of training samples during training of the models.    dystonia upper extremites, respectively. A clinical score of 4 in the dystonia upper extremities data set was never correcly classified, likely due to a lack of training samples during training of the models.

Discussion
Within this study, the feasibility was assessed to train machine learning models with a sufficient performance within dyskinetic CP by using home-based measured IMU and video data, collected by parents/caregivers.
In summary, most of the parents/caregivers were able to collect enough data to clinically score the videos and use IMUs data for feature calculation. For 1 patient out of 12, discomfort due to the fixation of sensors was reported. We consider the performance (i.e., F1 score) of the individual trained model as moderate and the overall performance of the generalized models as low. However, when looking at the confusion matrices, the misclassifications were most often observed in neighboring classes, indicating that these models are reasonably able to correlate between the severity of the disorder and the clinical score. This observation is confirmed by the RMSEs of about 1 on a 4-point scale.
The current results are in line with previous studies using wearable IMUs or accelerometers within other patient populations (e.g., Parkinson's disease [30][31][32] and Huntington's disease [19], showing that it is feasible to automatically predict the severity of movement disorders such as tremor, bradykinesia and dyskinesia. Most studies using wearables to monitor movement disorders have been performed within Parkinson's disease including steps towards clinical implementation (i.e., assessment of measurement properties of methods). However, widespread clinical use is still lacking [16,18]. When relating the current results to studies in Parkinson's disease, reported performance are comparable: e.g., Tsipouras et al. [31] used IMUs to automatically classify lepodova-induced dyskinesia within standardized tasks on a 0-4 scale, using machine learning algorithms and multiple combinations of sensors and features. A generalized model within this study achieved an average accuracy (79% ± 11%) [31]. However, the results need to be interpreted with care as no independent test set was used and no F1 scores were computed. Another study used sensors placed on the upper and lower extremities. A high correlation between the estimated dyskinesia severity scores was found between the model prediction and the expert-rated scores on (r = 0.77 (p < 0.001)) [33]. Although the population of Parkinson's disease and dyskinetic CP are not directly comparable, it indicates the potential of the proposed methodology for individuals with dyskinetic CP. A current study suggested that IMUs can be used as a mobile alternative for marker-based motion capture (omitting the need for an advanced movement laboratory) within upper extremity movements analysis of standardize movements in dyskinetic CP [14]. The proposed methodology goes one step further by using home-based collected IMUs data within unstandardized situations. This methodology is especially interesting for individuals who cannot perform standardized movements (such as gait and reaching/grasping), where instrumented methods are lacking [34]. In addition, the methodology gives the opportunity to capture the variability within dystonia for a longer period of time. The results show that within an individual, dystonia is 'consistent' enough to be detected within unseen data. However, this is not true for all individuals (i.e., subject 11 and subject 12 showed lower F1 scores within the test set). The same applies to the generalized models. A possible inconsistency within data of individuals as well as between individuals could be explained by, on the one hand, the challenge for clinicians to score home-based videos consistently and, on the other hand, the variation of dystonia concerning velocity and position that can occur within and between individuals [20]. As the performance of machine learning models greatly depends on the amount, the coverage and quality of the data, the performance of individual model would most likely increase with the collection of more data from each individual, as well as measures from more patients.
A limitation of this study is the low number of subjects included, which limits the amount of data used to train and test the generalized models. Another limitation of this study is that data was collected only at certain fixed moments, which were mainly standing, sitting, and lying down. The developed models are therefore not properly trained with data from other everyday activities. This is likely to lead to inaccuracies in the predictions if the models are used to predict data over an entire day. Future research should focus on gathering more types of movements and activities, to train even more accurate predictive models. The model might also improve by adding IMUs data from children and young adults without a movement disorder, especially as it has been hypothesized that overflow movements seen in dystonia may contain a small repertoire of involuntary movements within a more variable repertoire of intended voluntary movements [35]. As collection of more and variable data might be difficult to perform on large scale, possible data augmentation techniques for time series should be considered in future studies [36]. In addition, it could be an option to perform a 'calibration measurement' for each individual before using sensors in a home environment [16], add some extra clinical scores on time windows and use transfer learning (i.e., adding the individual scored data to the pretrained generalized model) to improve the performance of the generalized models for each patient individually.
Since the results of this study demonstrated the feasibility of monitoring dystonia at home, it would be interesting to study the use of the models for treatment assessment (e.g., how the clinical scores vary before and after intrathecal baclofen treatment), with the hypothesis that the clinical scores will decline after the treatment. Moreover, the methods described in this paper could also be used to classify choreoathetosis, which also occurs in dyskinetic CP. However, there was too little variation in the scores in the current data to train models to classify choreoathetosis.

Conclusions
The results of this study indicate that it is feasible to assess dystonia in dyskinetic CP outside a clinical setting, using home measurements and individually trained machine learning models and thereby provide clinical useful information about the progression of dystonia during a longer period of time. The findings are in line with previous research on automatic assessment of dyskinesia in Parkinson's disease. To enhance clinical care, future studies should evaluate how standard dyskinetic CP measures can be complemented by providing frequent, objective, real-world assessments. Even though the generalized models achieved low F1 scores, they are reasonably able to link high clinical scores to high severity of the disorder and vice versa, even though they were trained with a limited amount of data. Future research should focus on gathering more high-quality data and study how the models perform over longer periods of time.