Using Wearable Sensors and Machine Learning to Automatically Detect Freezing of Gait during a FOG-Provoking Test

Freezing of gait (FOG) is a debilitating motor phenomenon that is common among individuals with advanced Parkinson’s disease. Objective and sensitive measures are needed to better quantify FOG. The present work addresses this need by leveraging wearable devices and machine-learning methods to develop and evaluate automated detection of FOG and quantification of its severity. Seventy-one subjects with FOG completed a FOG-provoking test while wearing three wearable sensors (lower back and each ankle). Subjects were videotaped before (OFF state) and after (ON state) they took their antiparkinsonian medications. Annotations of the videos provided the “ground-truth” for FOG detection. A leave-one-patient-out validation process with a training set of 57 subjects resulted in 84.1% sensitivity, 83.4% specificity, and 85.0% accuracy for FOG detection. Similar results were seen in an independent test set (data from 14 other subjects). Two derived outcomes, percent time frozen and number of FOG episodes, were associated with self-report of FOG. Bother derived-metrics were higher in the OFF state than in the ON state and in the most challenging level of the FOG-provoking test, compared to the least challenging level. These results suggest that this automated machine-learning approach can objectively assess FOG and that its outcomes are responsive to therapeutic interventions.


Introduction
Freezing of gait (FOG) is one of the most disabling and enigmatic phenomena that impacts people with Parkinson's disease (PD) [1,2]. PD subjects with FOG suffer from restrictions in the movement of their feet, especially in stressful situations. FOG increases the risk of falling and impairs functional independence [3]. It can lead to the total arrest of the movement of the feet (i.e., akinesia); however, at other times, there is a kinetic manifestation in which the legs shake at a relatively high frequency, other available methods for assessing FOG (e.g., the New FOG Questionnaire (NFOGQ) and Movement Disorder Society -Unified Parkinson's Disease Rating Scale (MDS-UPDRS) part III).

Participants
The current work is based on a dataset that was originally collected as part of another study. That study aimed to examine the impact of transcranial direct current brain stimulation (tDCS) on FOG in individuals with PD (manuscript in preparation). Seventy-one PD subjects with FOG were selected from the dataset. They were recruited from the Center for the Study of Movement, Cognition, and Mobility (CMCM) at the Neurological Institute in Tel Aviv Sourasky Medical Center, Israel, and the Hinda and Arthur Marcus Institute for Aging Research, Hebrew SeniorLife, Harvard Medical School, Boston, MA. All study participants met the UK Brain Bank criteria for idiopathic PD. Additional inclusion criteria included the following: (1) a minimum of 21 points on the Mini Mental State Examination [30], (2) a minimum score of 6 on the NFOGQ [31], (3) age 40-95 years old, and (4) a stable medication regimen, i.e., no change in medications for the month prior to study participation. Exclusion criteria were as follows: (1) inability to walk independently, (2) self-report of neurological or psychiatric disorder other than PD, (3) history of seizures or head trauma, and (4) implanted metals in the head area. All subjects provided informed written consent, as approved by local human studies committees.

FOG-Provoking Test and Wearable Sensors
The study participants performed a FOG-provoking test in the lab (see Figure 1 for an overview of the test) [10]. Each patient performed the test at three levels of difficulty: (1) as a single task, (2) as a dual-motor task, i.e., performing the trial while carrying a tray with a bottle on it, and (3) as a motor-cognitive task. In this most-challenging level, the participants were asked to perform the trial as described in the second level of difficulty, while also performing serial seven subtractions. The subjects were assessed while wearing Opal inertial sensors (APDM, Inc., Portland, OR, USA) on the lower back and at the side of each leg, above the ankle. Each sensor included a tri-axial accelerometer, gyroscope, and magnetometer recording at 128 Hz. The signals from the lower-back sensor provide insight into the left and right feet, as well as the trunk and upper body, and the signals from the ankle sensors provide more detailed information regarding movement of the legs. The subjects were tested before the administration of tDCS or sham, and again in three different visits after the administration of tDCS or sham (for more study details see ClinicalTrials.gov protocol NCT02656316). In each visit, the test was performed in the OFF state (at least 12 h with anti-parkinsonian medication withdrawal) and/or in the ON state and included the three levels mentioned above.
Sensors 2020, 20, x FOR PEER REVIEW 3 of 17 other available methods for assessing FOG (e.g., the New FOG Questionnaire (NFOGQ) and Movement Disorder Society -Unified Parkinson's Disease Rating Scale (MDS-UPDRS) part III).

Participants
The current work is based on a dataset that was originally collected as part of another study. That study aimed to examine the impact of transcranial direct current brain stimulation (tDCS) on FOG in individuals with PD (manuscript in preparation). Seventy-one PD subjects with FOG were selected from the dataset. They were recruited from the Center for the Study of Movement, Cognition, and Mobility (CMCM) at the Neurological Institute in Tel Aviv Sourasky Medical Center, Israel, and the Hinda and Arthur Marcus Institute for Aging Research, Hebrew SeniorLife, Harvard Medical School, Boston, MA. All study participants met the UK Brain Bank criteria for idiopathic PD. Additional inclusion criteria included the following: (1) a minimum of 21 points on the Mini Mental State Examination [30], (2) a minimum score of 6 on the NFOGQ [31], (3) age 40-95 years old, and (4) a stable medication regimen, i.e., no change in medications for the month prior to study participation. Exclusion criteria were as follows: (1) inability to walk independently, (2) self-report of neurological or psychiatric disorder other than PD, (3) history of seizures or head trauma, and (4) implanted metals in the head area. All subjects provided informed written consent, as approved by local human studies committees.

FOG-Provoking Test and Wearable Sensors
The study participants performed a FOG-provoking test in the lab (see Figure 1 for an overview of the test) [10]. Each patient performed the test at three levels of difficulty: (1) as a single task, (2) as a dual-motor task, i.e., performing the trial while carrying a tray with a bottle on it, and (3) as a motor-cognitive task. In this most-challenging level, the participants were asked to perform the trial as described in the second level of difficulty, while also performing serial seven subtractions. The subjects were assessed while wearing Opal inertial sensors (APDM, Inc., Portland, OR, USA) on the lower back and at the side of each leg, above the ankle. Each sensor included a tri-axial accelerometer, gyroscope, and magnetometer recording at 128 Hz. The signals from the lower-back sensor provide insight into the left and right feet, as well as the trunk and upper body, and the signals from the ankle sensors provide more detailed information regarding movement of the legs. The subjects were tested before the administration of tDCS or sham, and again in three different visits after the administration of tDCS or sham (for more study details see ClinicalTrials.gov protocol NCT02656316). In each visit, the test was performed in the OFF state (at least 12 h with anti-parkinsonian medication withdrawal) and/or in the ON state and included the three levels mentioned above.  [10]. The test was repeated three times, with three different levels of difficulty. From a seated position, the subject walks, turns in a circle clockwise and counter clockwise, as indicated, enters a doorway, turns, and then returns to the seated position. CW, clockwise; CCW, counterclockwise.
As a "gold standard" reference in order to be able to validate the output of the ML algorithm, each test was videotaped and annotated offline for any FOG episodes that occurred. An in-house  [10]. The test was repeated three times, with three different levels of difficulty. From a seated position, the subject walks, turns in a circle clockwise and counter clockwise, as indicated, enters a doorway, turns, and then returns to the seated position. CW, clockwise; CCW, counterclockwise.
As a "gold standard" reference in order to be able to validate the output of the ML algorithm, each test was videotaped and annotated offline for any FOG episodes that occurred. An in-house FOG tagging (annotation) program was developed for this purpose as a MATLAB GUI; it enabled synchronization between the videos and the signals and facilitated the annotation process. Predefined guidelines were used for manual detection of FOG [5,20,32]. Briefly, FOG was defined as an intention to walk without movement of the feet, or as heel lifting while toes stay on the ground, or an irregular turning rhythm while the pivot foot stays on the ground. To reduce inter-rater variability, all videos of the same individual were annotated by the same assessor. Videos that were noted with no FOG episodes or that were difficult to annotate due to very short episodes or akinesia, for example, were reviewed by two additional assessors. Preprocessing of the labeled dataset included merging separately identified FOG episodes that occurred <1 second apart, as well as assigning identified FOG episodes lasting shorter than one second in duration to a non-FOG segment.

Machine-Learning Algorithms
An automatic FOG-detection model was created by using ML methods. A support vector machine (SVM) with radial basis function (RBF) kernel was chosen as the classification method. This classifier has shown good results among previous publications of FOG detection [21,22,24,25,33]. Fourteen subjects (20%) were selected randomly as the test set, and the other 57 subjects (80%) were selected for the training set. All the date was first filtered with a second-order Butterworth low-pass filter with a cutoff frequency of 15 Hz, to reduce high frequency noise. The training set was divided into three-second windows with 50% overlap for non-FOG windows and 80% overlap for FOG windows, thus creating more FOG examples and improving the model's performance. For classification, only a window that contained at least 1.5 s of FOG episodes was considered as a FOG window; otherwise, it was considered to be a non-FOG window. For testing the model, the test set was partitioned into three-second windows with 50% overlap. The windows were overlapped to achieve maximum information and fine-tuning of the temporal resolution.
A set of 86 features was computed for each window and from several sensors (see Table A1 in the Appendix A). The feature set was based on previous publications [21][22][23] and was adapted and modified to the current dataset (e.g., the axes and frequency bands that were used). Two features that showed the relationship between the movements of the back to that of the legs were added. Figure 2 demonstrates one of these features. Feature selection was performed to identify the most discriminative features for distinguishing FOG windows from non-FOG windows. This was done through a leave-one-patient-out method in which each of the iterations was repeated by keeping apart the data from a single patient. This way the model was tested with data that were not included in its training and a model with a final set of features could be obtained. FOG tagging (annotation) program was developed for this purpose as a MATLAB GUI; it enabled synchronization between the videos and the signals and facilitated the annotation process. Predefined guidelines were used for manual detection of FOG [5,20,32]. Briefly, FOG was defined as an intention to walk without movement of the feet, or as heel lifting while toes stay on the ground, or an irregular turning rhythm while the pivot foot stays on the ground. To reduce inter-rater variability, all videos of the same individual were annotated by the same assessor. Videos that were noted with no FOG episodes or that were difficult to annotate due to very short episodes or akinesia, for example, were reviewed by two additional assessors. Preprocessing of the labeled dataset included merging separately identified FOG episodes that occurred <1 second apart, as well as assigning identified FOG episodes lasting shorter than one second in duration to a non-FOG segment.

Machine-Learning Algorithms
An automatic FOG-detection model was created by using ML methods. A support vector machine (SVM) with radial basis function (RBF) kernel was chosen as the classification method. This classifier has shown good results among previous publications of FOG detection [21,22,24,25,33]. Fourteen subjects (20%) were selected randomly as the test set, and the other 57 subjects (80%) were selected for the training set. All the date was first filtered with a second-order Butterworth low-pass filter with a cutoff frequency of 15 Hz, to reduce high frequency noise. The training set was divided into three-second windows with 50% overlap for non-FOG windows and 80% overlap for FOG windows, thus creating more FOG examples and improving the model's performance. For classification, only a window that contained at least 1.5 s of FOG episodes was considered as a FOG window; otherwise, it was considered to be a non-FOG window. For testing the model, the test set was partitioned into three-second windows with 50% overlap. The windows were overlapped to achieve maximum information and fine-tuning of the temporal resolution A set of 86 features was computed for each window and from several sensors (see Table A1 in the Appendix). The feature set was based on previous publications [21][22][23] and was adapted and modified to the current dataset (e.g., the axes and frequency bands that were used). Two features that showed the relationship between the movements of the back to that of the legs were added. Figure 2 demonstrates one of these features. Feature selection was performed to identify the most discriminative features for distinguishing FOG windows from non-FOG windows. This was done through a leave-one-patient-out method in which each of the iterations was repeated by keeping apart the data from a single patient. This way the model was tested with data that were not included in its training and a model with a final set of features could be obtained.  Feature selection in each leave-one-patient-out iteration was done as follows; t-tests were used within each feature, and only the features that significantly separated between windows with and without FOG with corrections to multiple comparisons (Bonferroni) were retained. Next, the selected features were ranked by their relevance, using the minimum redundancy, maximum relevance (MRMR) algorithm [34]. Finally, an iterative feature-selection process was performed in which the misclassification error as a function of the number of the ranked features was analyzed. The feature set that gained the minimum misclassification error was chosen.
Furthermore, the classification validation was performed with the leave-one-patient-out method. The data from a single patient that were kept out of the training batch were tested, and each window was compared to the reference annotations. The performance of each of the iterations in the detection of FOG was measured by sensitivity, specificity, and accuracy in a window level. When there were no FOG windows in the signal, sensitivity was considered to be 100%. A ROC (receiver operating characteristic) curve was calculated for each iteration. Afterward, the means across all the iterations of sensitivity, specificity, and accuracy, as well as the mean ROC curve, were calculated.
A final model was generated with the training set of 57 subjects and tested on the dataset of 14 other subjects who were not included in the training set. The previously described iterative feature-selection process was used in the final model. However, first, the features were ranked according to their commonness in the previously described leave-one-patient-out process. Then, the final set of features was selected according to the minimum classification error that was achieved among the training group and was used as an input to the final classifier. Automatic FOG detection was made by using the final model, and accuracies were achieved. Finally, each patient's dataset was evaluated for the number of FOG episodes and the total duration of FOG. An episode's duration was quantified by its ratio to the test's duration and referred to as "percent time frozen" [12]. See Figure 3 for a flowchart of the classification process.
Sensors 2020, 20, x FOR PEER REVIEW 5 of 17 Feature selection in each leave-one-patient-out iteration was done as follows; t-tests were used within each feature, and only the features that significantly separated between windows with and without FOG with corrections to multiple comparisons (Bonferroni) were retained. Next, the selected features were ranked by their relevance, using the minimum redundancy, maximum relevance (MRMR) algorithm [34]. Finally, an iterative feature-selection process was performed in which the misclassification error as a function of the number of the ranked features was analyzed. The feature set that gained the minimum misclassification error was chosen.
Furthermore, the classification validation was performed with the leave-one-patient-out method. The data from a single patient that were kept out of the training batch were tested, and each window was compared to the reference annotations. The performance of each of the iterations in the detection of FOG was measured by sensitivity, specificity, and accuracy in a window level. When there were no FOG windows in the signal, sensitivity was considered to be 100%. A ROC (receiver operating characteristic) curve was calculated for each iteration. Afterward, the means across all the iterations of sensitivity, specificity, and accuracy, as well as the mean ROC curve, were calculated.
A final model was generated with the training set of 57 subjects and tested on the dataset of 14 other subjects who were not included in the training set. The previously described iterative featureselection process was used in the final model. However, first, the features were ranked according to their commonness in the previously described leave-one-patient-out process. Then, the final set of features was selected according to the minimum classification error that was achieved among the training group and was used as an input to the final classifier. Automatic FOG detection was made by using the final model, and accuracies were achieved. Finally, each patient's dataset was evaluated for the number of FOG episodes and the total duration of FOG. An episode's duration was quantified by its ratio to the test's duration and referred to as "percent time frozen" [12]. See Figure 3 for a flowchart of the classification process. Two other algorithms were additionally evaluated for comparison: the freezing index [16], as a single measure, and the feature set suggested by Samà et al. [23]. The windowing process for both was the same as for the new suggested algorithm. The freezing index, originally developed by Moore et al. [16], is a threshold-based method to identify FOG. It is based on the idea that the power spectrum of FOG differs from that of normal gait [35]. The freezing index was calculated for each window as the ratio between the integrated power spectral density in 3-8 Hz (freeze band) to that in 0.5-3 Hz (walking band). The anterior-posterior acceleration axis was used for this purpose [17], and Two other algorithms were additionally evaluated for comparison: the freezing index [16], as a single measure, and the feature set suggested by Samà et al. [23]. The windowing process for both was the same as for the new suggested algorithm. The freezing index, originally developed by Moore et al. [16], is a threshold-based method to identify FOG. It is based on the idea that the power spectrum of FOG differs from that of normal gait [35]. The freezing index was calculated for each window as the ratio between the integrated power spectral density in 3-8 Hz (freeze band) to that in 0.5-3 Hz (walking band). The anterior-posterior acceleration axis was used for this purpose [17], and a threshold of 2.5 (AU) was determined for the detection of FOG. The freezing index was calculated for each leg separately, and the maximum value between the legs was chosen. The results were examined as described earlier with sensitivity, specificity, and accuracy per window. The second feature set that was implemented (Samà et al. [23]) was used as an input for the SVM classifier (see Table 1 for details). This feature set was chosen as a starting point because it showed promising results in detection with ML methods. The feature set included cross-axis information as correlations and differences in the mean value of the axes.

Statistical Analysis
Statistical analyses were performed by using SPSS statistical package version 25 (SPSS Inc., Chicago, IL), and MATLAB, 2019 version 9.7.0 (R2019B) (Natick, MA: The MathWorks, Inc.). A 2-tailed p-value ≤ 0.05 was considered statistically significant. Sensitivity was calculated as the sum of the true positive (FOG) windows, as detected by the algorithm divided by the total amount of FOG windows, according to video annotations. Specificity was calculated as the sum of the true negative (non-FOG) windows, as detected by the algorithm divided by the total amount of non-FOG windows, according to video annotations. Accuracy was the ratio between the sum of the true positive detections and true negatives detections to the entire population. For the train-test process of the classifier, the entire dataset was used. To study the responsiveness (e.g., change in the outcome measures in response to a challenging condition) and associations of the output of the algorithm, we focused on the baseline data (before the administration of tDCS) in order not to be influenced by the tDCS treatment on the existence of FOG. A square-root transformation was used to achieve normal distribution for the average of percent time frozen, the number of FOG episodes, and the clinical test score across the three levels of the FOG-provoking test. Paired t-tests were used to investigate differences between the OFF-and ON-medication states. Friedman's tests were used to investigate differences between the test levels within medication state condition, and Wilcoxon signed-rank tests were used to investigate the post hoc analysis. Wilcoxon singed-rank tests were also used to investigate differences between the performance of the new algorithm and two other methods. The performance of the three methods was measured in terms of sensitivity, specificity, and accuracy. Effect sizes for paired-sample t-tests (d) were calculated by dividing the mean difference by the standard deviation of the difference: d = 0.2 small effect, d = 0.5 medium effect, and d = 0.8 strong effect [36]. Effect sizes for Wilcoxon signed-rank tests (r) were calculated by dividing the test statistic by the square root of the number of observations: r = 0.1 small effect, r = 0.3 medium effect, and r = 0.5 and higher strong effect [36]. The associations between the algorithm-based measures and annotations-based measures, as well as other tests that reflect FOG severity, were calculated with Spearman's correlations. Table 2 summarizes the characteristics of the study participants. In general, the study participants had relatively advanced disease and relatively high scores on the NFOGQ, consistent with the inclusion criteria.

Detection Performance
For the classification process, 1041 recorded signals with a total duration of 17.6 h were recorded. Based on the video annotations, 1754 FOG episodes were identified before merging episodes that were close to each other (<1 s), and 1487 were identified after merging. The total FOG duration across all records was 6.75 h, with 0 to 11 episodes per test. The single-episode duration ranged from 0.18 s before eliminating short episodes (<1 s) to a maximum of 11.46 min. The training set (57 subjects) consisted of 12,629 windows of FOG (average of 550 windows per subject) and 20,172 non-FOG windows (average of 340 windows per subject) based on the manual video annotations. Each signal could have both FOG and non-FOG windows.
Following the feature-selection process, 18 features were used per training iteration on average. The average results were 84.1 ± 22.3% sensitivity, 83.4 ± 12.2% specificity, and 85.0 ± 10.0% accuracy. The average ROC curve across all the iterations had an AUC = 0.93 for the training set (see Figure 4a).
For the final classification model in which the dataset of 57 individuals was trained by the SVM classifier, 14 important features were chosen (Table 3) out of a list of 86 features (Table A1 in the Appendix A). The test set of 14 subjects showed 80.0 ± 19.2% sensitivity, 82.5 ± 11.2% specificity, and 86.6 ± 7.8% accuracy. The AUC of the ROC curve of the test set was 0.94 (see Figure 4a). Figure 4b illustrates ROC curves of the test set separately for the OFF-and ON-medication states. In addition to that, algorithm-derived percent time frozen during the test set was significantly correlated with the annotations-derived percent time frozen (r = 0.897 p < 0.01).      Figure 5 presents the sensitivity, specificity, and accuracy of the freezing index as a single feature and of an SVM with the feature set from Samà et al. [23], as well as the results of the proposed algorithm.
As shown in the figure, the proposed algorithm had better accuracy, sensitivity, and specificity Figure 5 presents the sensitivity, specificity, and accuracy of the freezing index as a single feature and of an SVM with the feature set from Samà et al. [23], as well as the results of the proposed algorithm. As shown in the figure, the proposed algorithm had better accuracy, sensitivity, and specificity compared to the two previously used algorithms.

Responsiveness of the Assessment of FOG Based on the Outcomes of the Algorithm
The dataset (71 subjects) was evaluated by the proposed algorithm for the quantity and extent of FOG episodes. As expected, paired t-tests showed that percent time frozen and the number of episodes averaged across the three levels of the test were higher in the OFF state than in the ON state (n = 43 subjects, %FOG: p = 0.004, effect size = 0.5; number of episodes: p < 0.001, effect size = 0.6) ( Figure 6). Similarly, the clinical score on the FOG-provoking test increased (p = 0.002) from 4.0 ± 2.2 in the ON state to 5.1 ± 2.3 in the OFF state, with an effect size of 0.5.

Responsiveness of the Assessment of FOG Based on the Outcomes of the Algorithm
The dataset (71 subjects) was evaluated by the proposed algorithm for the quantity and extent of FOG episodes. As expected, paired t-tests showed that percent time frozen and the number of episodes averaged across the three levels of the test were higher in the OFF state than in the ON state (n = 43 subjects, %FOG: p = 0.004, effect size = 0.5; number of episodes: p < 0.001, effect size = 0.6) ( Figure 6). Similarly, the clinical score on the FOG-provoking test increased (p = 0.002) from 4.0 ± 2.2 in the ON state to 5.1 ± 2.3 in the OFF state, with an effect size of 0.5. Furthermore, Friedman's tests revealed significant differences in the FOG measures between the levels of difficulty both in OFF and ON (OFF; n = 41 subjects, %FOG: p = 0.011, number of episodes: p < 0.001. ON; n = 62 subjects, %FOG, and number of episodes: p < 0.001). Post hoc analysis of the test levels in OFF-ON states showed that the highest percent time frozen was in the most challenging test level (Table 4).  Furthermore, Friedman's tests revealed significant differences in the FOG measures between the levels of difficulty both in OFF and ON (OFF; n = 41 subjects, %FOG: p = 0.011, number of episodes: p < 0.001. ON; n = 62 subjects, %FOG, and number of episodes: p < 0.001). Post hoc analysis of the test levels in OFF-ON states showed that the highest percent time frozen was in the most challenging test level (Table 4).  Values are presented as median (inter-quartile range). Since some of the subjects were not able to perform the testing in the OFF state, the number of subjects was higher in the ON state.

Associations of Algorithm Measures of FOG with Related Measures of FOG
Correlations between the algorithm's estimate of FOG and self-report of NFOGQ score, disease duration, TUG time, and MDS-UPDRS part III in OFF and ON are presented in Table 5. Mild-to-moderate correlations were observed, with some of the associations being dependent on the medication state.

Discussion and Conclusions
Here we describe a machine-learning method for the automatic detection of FOG episodes during a FOG-provoking test [10] using data from three wearable sensors. The FOG-provoking test included conditions that aimed to trigger FOG presentation, and thus it involves both walking and turning, as well as cognitive, motor, and emotional challenges. As was recently described in Reference [9], objective ways of evaluating FOG are needed. To the best of our knowledge, this is the first study that aimed to automatically detect FOG episodes during a previously validated FOG-provoking test. Several findings support the utility and validity of the present approach: (1) the relatively high sensitivity and specificity results of the training and, maybe, more importantly, the test set; (2) correlations between the ML-based outcomes and the manual video review of FOG; (3) the responsiveness of derived measurements from the automated detection to anti-parkinsonian medications and the more challenging FOG-provoking condition; and (4) the correlations between algorithm-derived measures of FOG and the NFOGQ.
Recent studies underline the importance of using relatively large datasets for studying FOG detection and specifically for ML approaches [6,9,13]. For example, a 2019 review screened 68 papers that aimed to detect FOG [13], and the largest studied population was only 32 FOG-PD subjects. In the present study, records from 71 subjects with a total of 6.75 hours of FOG episodes were included.
This reflects a relatively large number of annotated FOG episodes with prolonged total duration, consistent with the recent recommendations.
As suggested before [21][22][23], we found that patient-independent SVM classifiers can be used to detect FOG. The statistical measures of the performance of the classifier and the high correlations between the algorithm-derived results and the actual percent time frozen (r = 0.897, p < 0.01) indicates the good performance of the algorithm. The AUC reinforces this conclusion, specifically the similarity between the ROC curves of the OFF-ON medication states (Figure 4b). The minor differences between the OFF-ON curves likely reflect false negatives, due to a decrease in the occurrence of FOG in the ON state. Furthermore, there was some variation in the detection accuracy among the tested subjects, as can be seen by the standard deviations of the algorithm performance matrices. This might be due to the variations in the manifestations of FOG during the test across individuals. FOG was previously described as a mysterious symptom that is challenging to observe, which is manifested differently among individuals with PD, and that even changes in different situations for the patient him/herself [2]. Episode duration ranged between 1.0 to 687.6 s, and very short episodes were challenging for the clinicians to annotate and for automatic detection. Indeed, the present analyses did not consider FOG that was less than one-and-a-half-seconds long (due to a window length of 3 s)-a limitation. Akinesia, impaired gait, akinetic episodes, and deceleration (preceding opening the door, for example) likely also challenged the automated detection. Promisingly, the sensitivity and specificity of the detection of FOG episodes among the test set were very close to the performance of the training process. This suggests that the classifier was not over-trained or tuned to the specifics of the study participants and that it is generalizable.
The feature-selection process reduced the feature count from the original 86 to 14 features in total (Table 3), without compromising detection performance. This can make the classification process more accessible for further use. It is also important to note that we introduced new features: the ratio between the RMS from the leg sensors to the RMS from the back sensor (angular velocity from legs to acceleration from the back and angular velocity of both) (recall Figure 2). These features were based on the idea that, during kinetic freezing, the movement of the body forward decreases while the legs are still moving. This notion was supported by the high rank that these features received by the MRMR algorithm as part of the feature-selection process. It can be seen that most of the features that were included in the final model input were calculated from the leg sensors (Table 3). This finding may be because the manifestation of FOG decreases as the sensor moves away from the ground.
FOG duration and episode counts may have an added value in the evaluation of the FOG-provoking test that is usually scored only by the presence of episodes in predefined tasks. As expected, since FOG occurs more frequently in the OFF state and in more challenging tasks [2], the percent time frozen and the number of episodes were higher in the OFF than in ON state ( Figure 6), and in the most challenging level of the test than the easiest level (Table 4). These findings support the validity of these derived outcome measures. Interestingly, the effect sizes of the clinical score and FOG measures were similar (recall Table 5). We can infer from this that the automated assessment performs as well as the clinical rating. Of course, evaluating percent time frozen and the number of episodes in a FOG-provoking test can be accomplished by manual review of video-recorded tests. However, that rating process is quite time-consuming, requires trained personnel, and is subject to bias. The present findings indicate that the automated process based on wearable sensors and machine learning can be used instead of the manual review of videos.
To directly compare the results of the detection performance with the literature, we tested two additional methods of FOG detection. Figure 5 shows that the results based on those methods were lower than those of the new algorithm. Although the freezing index was one of the main input features in our classifier, as a single feature, its performance was insufficient. Samà et al. [23] presented results that are higher than those achieved by their same features with our dataset. That might be a consequence of the location of the sensors and specific characteristics of the dataset. The FOG-provoking test was made of short-distance trials with turns and few walking bouts, compared to the datasets from both studies that contained longer recordings of free walking. Difficulties in reproducing the results of previously published methods of automatic FOG detection were already described [23,26,37]. More specifically, the importance of validation of ML methods on different cohorts was recently highlighted [6]. Many factors may influence the ability to generalize one method to different datasets. These include the characteristics of the participants, the placement of the sensors, the preprocessing of the data, and the evaluation methods of the results. Caution should be used when trying to compare and utilize the automatic FOG detection to different cohorts with different conditions. Nonetheless, our use of a training set and a completely different test set should mitigate this issue.
Accuracy, sensitivity, and specificity above 80% in this context can be considered good. Still, for some applications, it might be helpful to increase detection performance. This might be achieved by adding additional sensor types (e.g., EMG [38] and ECG [39]) to help differentiate some of the more challenging conditions (e.g., akinetic FOG vs. standing still), by focusing on the subtypes of episodes (e.g., during a turn and start hesitation) or by applying techniques like deep learning [37]. The present work can also be extended to develop an automated and instrumented scoring. A total score might be based on the percent time frozen and the number of episodes, and perhaps sub-scores might relate to the different FOG-provoking test parts, e.g., gait initiation, turns, and straight-line walking. This might enhance monitoring and provide directions for therapy. Furthermore, it might be interesting to combine the objective score with other sensor-derived measures (e.g., duration of subtasks), self-report (e.g., NFOGQ score), and other FOG characteristics that are already known [40]. Reducing the number of sensors to achieve simplicity, especially for future use for home and daily living detection, should also be considered. Trade-offs between simplicity and ease-of-use, on the one hand, and complexity and granularity of the tool, on the other hand, will need to be evaluated.
The present results suggest that the automated detection of FOG within a laboratory-based FOG-provoking test can be achieved by using wearables and a machine-learning algorithm. This tool promises to augment the assessment of this debilitating phenomenon and, hopefully, aid in the development of improved treatments by affording objective evaluation of their impact on FOG. Data Deposition Information: The data will be made available from the authors upon reasonable request. 1 Freezing index (x); While turning: Gyroscope, x ∈ V Otherwise: Acceleration, x ∈ AP .