1. Introduction
Parkinson’s disease, a neurological disorder stemming from nerve cell degeneration in the brain’s substantia nigra, currently lacks a cure [
1]. Diagnosis of Parkinson’s relies not on X-rays or blood tests but on observable symptoms affecting daily activities [
2], such as freeze of gait (freezing while walking or performing tasks), tremors, dyskinesia (uncontrolled movements), and bradykinesia (slowed movements). These symptoms pose significant risks, such as bradykinesia while crossing streets or dyskinesia while handling sharp objects like knives, potentially leading to harm [
3]. Timely detection of these symptoms can greatly mitigate such risks, allowing for prompt alerts and the implementation of strategies to reduce their impact [
4].
Devices such as smartwatches, smart fitness bands, and smartphones are equipped with built-in inertial sensors, including accelerometers and gyroscopes. These sensors can be non-invasively attached to the human body, enabling real-time data analytics and allowing the devices to provide immediate alerts and cues. These devices utilize inertial data, making them versatile in various applications, including terrain classification [
5], fall detection [
6], and human activity recognition [
7]. Previous studies and the widespread availability of inertial measurement units (IMUs) show promise in leveraging these data for Parkinson’s disease (PD) diagnosis and monitoring [
8,
9,
10,
11,
12]. IMUs typically include accelerometers, gyroscopes, and sometimes magnetometers, all commonly found in most of the modern digital gadgets.
Driven by the high performance achieved through the utilization of inertial data in various applications, this study demonstrates the critical role of wearable technology such as smartwatches and fitness bands in anomalous event detection and classification for the PD patients. These devices enable continuous monitoring and analysis of movement patterns by capturing real time inertial data, thus facilitating timely alerts and medical interventions. Our paper introduces several novel contributions to wearable technology for PD management:
An end-to-end pipeline for real-time movement disorder detection, achieving an inference time of 80 ms, and multi-label classification (inference time of 165 ms) of tremors, dyskinesia, and bradykinesia.
The proposed model is very lightweight (around 9 MBs of size), which is optimal for deployment on edge devices such as smartwatches and fitness bands.
This study proposes two auto-labeling techniques (early and late labeling) to identify optimal pre-processing configurations and evaluates the proposed model using three different window sizes (50, 150, 250) for optimal signal segmentation.
Evaluation of two modeling techniques—manual feature crafting for a machine learning pipeline and processing raw inertial data for a deep learning pipeline—to find the optimal approach. A detailed comparison between the two modeling techniques can be found in the Results Section. For movement disorder detection, the deep learning approach achieves a recall of 93%. Similarly, for movement disorder classification, the deep learning approach achieves a recall score of 91.54%.
2. Literature Review
Many of the recent studies employ inertial data for detecting and characterizing PD and its symptoms. For instance, one study focused on freeze of gait (FOG), a common PD symptom affecting gait and increasing fall risk, achieving a 91.5% F1-score using time–frequency features and convolutional neural networks [
13]. Dvorani et al. [
14] targeted bradykinesia by analyzing upper limb inertial signals with a multi-layer perceptron and achieved an accuracy of 85%. Similarly, FOG is addressed in another study, which utilized inertial sensors and a novel evaluation metric called GaitScore [
15]. They achieved 97% sensitivity and 87% specificity.
Wearable inertial sensors offer promise for monitoring axial impairments in PD. Integration with machine learning techniques yielded low root mean square error (RMSE) and posture instability/gait difficulty scores (PIGD scores) [
16], and validation studies confirmed their reliability in capturing gait parameters [
17]. Investigations into hand tremors and levodopa effects on gait parameters achieved high accuracy using inertial sensors and advanced machine learning methods [
18,
19]. Studies on stride segmentation using hidden Markov models achieved a 92.1% F1-score on real-time gait data [
19], confirming the reliability of wearable sensors for gait analysis under supervised and unsupervised settings [
20]. Peres et al. [
21] employed neural networks on inertial data in early PD detection, demonstrating the effectiveness of inertial data and machine learning in PD diagnosis and management. Su et al. [
22] presented an interpretable CNN-based architecture and used spatio-temporal features for Parkinson’s disease detection, achieving an accuracy of 98%. Uchitomi et al. [
23] utilized a deep learning model on a time-series gait dataset acquired rom an IMU mounted on a subject’s leg to differentiate between healthy persons and persons with mild Parkinson’s disease. Dimoudis et al. [
24] proposed two deep learning architectures (LN-inception and InSEption) to predict FOG episodes, achieving an F1-score of 97%.
Inertial data require pre-processing before being used for machine learning or fed directly into deep learning models, as shown in many studies. Pre-processing steps involve noise removal using techniques like simple moving average [
25,
26]. This cleaned signal is then decomposed into smaller segments using techniques like peak/valley detection [
25], or by dividing signal into windows of fixed length and stride [
27]. Labeling segmented data follows various approaches based on the use case; straightforward cases apply the same label to all segments, e.g., gender identification [
25], while cases where the label for the whole signal is not same, e.g., occurrence of tremor episodes within the gait of a patient suffering from PD, require more sophisticated segment-labeling methods. With non-homogeneous segments, methods using the mode as the overall segment label are commonly used, as in [
26], or treating the segment as an anomalous one even if a smaller subset of it is anomalous [
27]. Features from time, frequency, or wavelet domains are computed for each segment for machine learning models, improving model accuracy but increasing the processing time. Deep learning models offer improved computation time and accuracy by directly using segmented and labeled raw data for training and inference [
28].
Apart from using inertial data, diverse data sources have been used to detect and quantify PD, such as electroencephalography (EEG), electromyography (EMG), speech, and vision. Commonly employed deep learning models include CNNs and RNNs, achieving high accuracies in various tasks. For instance, a combination of CNN+RNN classifies PD patients based on EEG with 99.2% accuracy [
29], while EMG readings detect motor fluctuations with 99% accuracy [
30]. Speech- and voice-based models also show promising results; however, real-world noise affects reliability [
31,
32,
33,
34,
35,
36]. Vision-based approaches reveal PD-related impairments, though occlusion and illumination changes still remain challenging [
37,
38]. Some other vision-based models analyze subjects’ handwriting as presented in [
39]. Alazeb et al. [
40] employed a combination of RGB, inertial, and depth sensor data, and used machine learning to offer therapeutic advise to patients. For vision-based models, overcoming data availability, equipment accessibility, and environmental challenges is crucial for real-world deployment.
While existing studies use wearable inertial sensors for PD detection, they often focus on isolated symptoms, rely on computationally expensive models, or lack standardized data segmentation. Real-time, multi-label classification of movement disorders remains under-explored. Moreover, models in the existing literature are not optimized for edge deployment. Our work addresses these gaps by proposing a lightweight model capable of real-time movement disorder detection and classification with an inference time of 80–165 ms. Additionally, we introduce two auto-labeling techniques to improve data segmentation and compare feature-engineered and deep learning approaches for optimal performance. We believe that these contributions advance real-time multi-label classification of movement disorders while ensuring computational efficiency for deployment on smartwatches and fitness bands.
3. Methodology
This section presents the proposed methodology, beginning with a brief overview of the publicly available dataset used in this study for training, validation, and testing. Subsequently, the proposed pipeline is discussed, which encompasses pre-processing and signal decomposition, feature extraction for classical machine learning algorithms, and the application of deep learning techniques.
3.1. Dataset
This study utilizes a publicly available dataset derived from the levodopa response study [
4]. This dataset is specifically tailored to explore anomalies related to motor fluctuations observed in individuals afflicted with PD. The dataset encompasses three distinct anomalies: tremors, dyskinesia, and bradykinesia. Notably, the raw data are readily accessible through an open data repository. Data collection procedures involved the participation of 28 subjects diagnosed with PD, with Hoehn and Yahr scores ranging from II to IV, over a span of 4 days. The Hoehn and Yahr scale was employed to gauge the progression of Parkinson’s symptoms and the degree of disability. It included subjects aged from 30 to 80 years and undergoing L-Dopa treatment, exhibiting at least mild dyskinesia and motor fluctuations. Subjects with significant neurological disorders such as epilepsy, brain tumors, or hydrocephalus were excluded from this study. Data acquisition was conducted using three distinct sensors, Samsung S2 smart phone, GeneActiv IMU, and a Pebble smartwatch, positioned at the waist, most affected upper limb, and least affected upper limb, respectively. Three-dimensional acceleration data, capturing movement along the x, y, and z axes, were collected at a frequency of 50 Hz. Additionally, the dataset includes information on vector magnitude. The dataset also encompasses labeled data, providing details on symptom severity (for tremors) and presence (for dyskinesia and bradykinesia) for each limb and motor task, as annotated by a clinician. Data collection spanned a 4-day period. On the first day, participants performed various motor tasks in an on-medication state within a laboratory setting. These tasks included a range of activities, such as walking, finger movements, and fine motor tasks, with repetitions occurring every 30 min over a 3- to 4-h period. Subsequently, participants were equipped with sensors to record their daily activities over the second and third days, yielding two days of unlabeled data. On the fourth day, participants returned to the laboratory in a non-medicated state to repeat the motor tasks performed on the first day. Following this initial testing phase, participants ingested their scheduled medication dose and performed an additional set of motor tasks. Annotation of all the data was conducted by a trained clinician.
3.2. Pipeline Overview
This section presents a novel two-stage methodology designed for detecting and categorizing anomalous events found in inertial data acquired from wearable sensors wore by patients diagnosed with PD, as shown in
Figure 1.
The proposed pipeline processes the captured inertial data on a segment-by-segment basis. First, the pipeline utilizes binary classification to discriminate between normal and anomalous signal segments. To accomplish this, the three anomalies (tremor, dyskinesia, and bradykinesia) from the dataset are consolidated into a unified binary label. Here, the signal segments are treated anomalously upon the detection of any anomaly, and they are classified as normal, when devoid of any anomalies. Following this, if a segment is identified as anomalous, it proceeds to the second stage of multi-label classification to categorize the specific types of anomalies present. In contrast to past studies, which involved dealing with only a single anomaly, like FOG or dyskinesia, our approach addresses multiple mutually inclusive classes. This implies that a single segment may be classified as having both tremor and dyskinesia simultaneously. Such an approach proves valuable for quantifying the severity of symptoms; for instance, an individual exhibiting both tremor and dyskinesia would receive a higher severity score compared to someone experiencing only dyskinesia. Conversely, if a signal segment is categorized as normal during the initial stage, the subsequent stage is bypassed, as no anomalies are detected. This study focuses on the role of data collected from the IMU attached to the most affected limb; thus, all experiments and results are derived from the GeneActiv IMU.
3.3. Signal Pre-Processing
The raw input signal is first decomposed into smaller segments. To optimize the window length, this study explored three different window sizes, based on the 50 Hz sampling rate used for data collection: (1) a 1 s window (50 data points), (2) a 3 s window (150 data points), and (3) a 5 s window (250 data points). An overlap of 80% among consecutive segments by using a stride of 20% is shown in
Figure 2.
With respect to the desired overlap (O) and selected window size (WS), the stride (S) can be derived from the following equation:
whereas the exact window indices (Start, End) for
segment (starting from 0) can be determined by the following equation:
The decomposed signal segments are then labeled. For this purpose, this study has explored two different labeling approaches. The first one (S1 methodology—early labeling) treats the segment as an anomalous one if it contains any anomalous data points (in the event of a single anomalous data point). Otherwise, it is treated as a normal segment (zero anomalous data points). In the second labeling approach (S2 methodology—late labeling), the segment label is as per majority voting. If most of the data points are anomalous, the label will be anomalous. On the other hand, if most of the data points are normal, the label will be normal. In case of a tie, the label will be anomalous. Pseudo-code for signal pre-processing can be found in Algorithm 1. Overall, this study has explored three different window sizes, each with two different labeling approaches. Their performance-wise comparative analysis is outlined in the Results Section.
Algorithm 1 Signal pre-processing. |
- 1:
Input: Raw signal data, Sampling rate , Window size , Overlap O - 2:
Output: Segmented windows with indices and labels - 3:
▹ Calculate stride based on overlap - 4:
▹ Initialize segment counter - 5:
while not end of signal do - 6:
- 7:
- 8:
Extract segment - 9:
- 10:
end while - 11:
for each segment do - 12:
▹ S1 - Early labeling - 13:
if any anomalous data point in then - 14:
Label as anomalous - 15:
else - 16:
Label as normal - 17:
end if - 18:
▹ S2 - Late labeling - 19:
Number of anomalous points in - 20:
Number of normal points in - 21:
if then - 22:
Label as anomalous - 23:
else if then - 24:
Label as normal - 25:
else - 26:
Label as anomalous (tie case) - 27:
end if - 28:
end for
|
3.4. Feature Crafting and Machine Learning
The machine learning pipeline involves computing features from input segments and using them to train a random forest model. This study proposes a set of 150 features from temporal, frequency, and wavelet domains [
25,
26]. There are 60 features computed from the time domain, 60 features computed from the frequency domain, and 30 features computed from the wavelet domain covering statistical, amplitude-related, and spectral measures across all three axes of inertial data, i.e., x, y and z axis, as shown in
Table 1. Since this study explores 3 different window sizes and 2 different labeling approaches, this will give us 6 (
) different datasets of signal segments. Hence, features are computed for each of these 6 datasets. This study used MATLAB R2024a for both signal segmentation and feature computation.
Manually crafted features are used for model training and evaluation, which is completed using Anaconda’s Python V3.10 distribution, on Jupyter Notebook, with Pycaret (used for model training and testing) and Imblearn (used for implementing random under sampling). Segment-wise features are then labeled for the first stage of the pipeline (movement disorder detection) by consolidating the three anomalies into a single unified anomaly binary label. Afterwards, the data are normalized using the z-score normalization technique, which obtains a mean of all values of 0 with a standard deviation of 1, thus resulting in more accurate and faster convergence. From data exploration, it was found that a class imbalance exists for both stages. For anomalous event detection, this study solved this problem using signal segmentation and segment-level labels, as shown at the top of
Figure 3. A similar pattern was observed for movement disorder classification in the raw data points, as shown at the bottom of
Figure 3. This occurred because the anomalous segment used for classification did not undergo any labeling techniques.
The prepared data are then used to train a random forest classifier (number of trees = 100) for binary classification of the input signal segment as an anomalous one or a normal one (stage 1). This study uses a 10-fold cross validation technique for model training and evaluation. The evaluation parameters include accuracy, recall, precision, F1-score, and confusion matrix.
The models underwent initial training with the full feature set, followed by training exclusively on the highest-performing features, aimed at reducing execution time. The top-performing feature for movement disorder detection (AD) and movement disorder classification (AC), using data from the most affected upper limb (GeneActiv Smartwatch), with a window size of 250 and S1 labeling, can be seen in
Figure 4. If Stage 1 classifies the input signal segment as an anomalous one, then it passes through the second stage, which predicts the type of anomalies (tremor, dyskinesia, bradykinesia) present in the signal. This is a multi-label classification problem and is handled using the label power set approach, where the problem is transformed to a multi-class classification problem by assigning a unique class label to each of the possible combination (based on existence) of the three anomalies, thus transforming the output array of
(tremor, dyskinesia, bradykinesia) into a
array.
3.5. Deep Learning
Similar to the above pipeline, a two-stage approach is followed here and includes a binary classifier (movement disorder detection) and a multi-label classifier (movement disorder classification). This study computes the magnitude of the raw 3D accelerations (see Equation (
3)) to convert the 3D signal into a 1D signal, which is then passed to the deep learning model for training and inference. Using the 1D magnitude computed from 3D input significantly improves the computational efficiency, making the solution more suitable for real-time applications and helping to reduce the impact of sensor orientation [
41]. This deep learning-based pipeline is also trained and evaluated using six different data configurations, which are made using three different window sizes for segmentation, and two different segment labeling approaches.
Overview of the Algorithm
This study employs our previously developed deep learning algorithm, HARDenseRNN [
41], as our deep learning model for training and validation. The HARDenseRNN is a combination of two multi-kernel convolutional neural network (CNN) modules followed by a recurrent neural network (RNN) module (see
Figure 5). This model first passes
as input through the CNN-based network for spatial feature extraction. The spatial features are then concatenated with the
and passed to the RNN-inspired network (bi-directional GRU) for capturing temporal features. The concatenation of the raw input signal, with the feature map generated by the CNN, ensures that the RNN can take advantage of both the raw signal and the CNN features for temporal feature extraction. These features are then passed through batch normalization, and are then flattened. Afterwards, the flattened output is passed through a dropout layer and then finally through a series of dense (fully connected) layers for making predictions. For the first stage, which involves binary classification, a single output node is used, with a sigmoid activation function and binary cross-entropy loss function.
The second stage, involving multi-label classification, uses three output nodes, with a sigmoid activation function and binary cross-entropy loss function. For the first stage of binary classification, the model had 753,035 trainable parameters and 896 non-trainable parameters. For the second stage of multi-label classification, the model had 754,061 trainable parameters and 896 non-trainable parameters. Deep learning model training was carried out on Google Colab Pro with a GPU (Tesla P100), with the help of TensorFlow 2.4 and Keras 2.4.3 for model construction and training. The model was trained for 100 epochs using a batch size of 128. The number of epochs were selected empirically by monitoring the plot of training and validation loss and accuracy values, as shown in
Figure 6. The pipeline also uses early stopping with the ‘max’ option for validation accuracy with a patience of 10 epochs. This study has also used scikit-learn for splitting data into training and testing sets, and it used seaborn for visualizations.
4. Results and Discussion
This section presents the results of the experiment, involving the evaluation of three segment sizes (250, 150, 50), two labeling approaches (S1 labeling and S2 labeling), two stages (movement disorder detection, movement disorder classification), and two models per stage (random forest, HARDenseRNN), encompassing a total of 24 models (3 × 2 × 2 × 2). It is important to note that this study uses recall as the primary metric for model evaluation to ensure that our models accurately identify as many anomalous segments as possible, thus ensuring accurate classification of critically anomalous segments.
4.1. Movement Disorder Detection
The initial phase of our proposed pipeline emphasizes the application of a binary classifier for categorizing input signal segments as normal or anomalous. The evaluation reveals that HARDenseRNN outperforms random forest in detecting anomalous events in patients with Parkinson’s disease when focusing on recall. The optimal configuration for HARDenseRNN is a segment size of 250 and S1 labeling, achieving the highest recall of 93.03%, along with an accuracy of 88.58% and precision of 86.13%. For S2 labeling, HARDenseRNN’s best recall is 87.92% with a segment size of 150. However, HARDenseRNN shows greater variability in performance depending on the segment size and labeling approach. In contrast, random forest consistently performs well across various configurations, with the highest recall of 89.10% at a segment size of 250 and S2 labeling, followed by 88.65% with the same segment size and S1 labeling. While random forest offers more consistent performance across various pre-processing configurations, the superior recall achieved by HARDenseRNN with larger segment sizes and S1 labeling makes it the most effective approach for this task.
Confusion matrices for movement disorder detection with HARDenseRNN and optimal pre-processing configurations (segment size 250 and S1 labeling) are shown in
Figure 7. In addition to achieving a higher recall, HARDenseRNN exhibits faster execution times and a smaller model size (approximately 9 MB only), even when compared with a random forest model trained only on the top 10 features, as shown in
Figure 8, making it optimal for real-time applications and deployment on low-end devices. Detailed results of 12 models evaluated for movement disorder detection can be observed in
Figure 9.
4.2. Movement Disorder Classification
In the multi-stage movement disorder identification process, where the initial stage identifies a signal as anomalous, the signal segment proceeds to the second stage for multi-label classification, which determines the type or types of anomalies present in the input signal segment. Here, too, HARDenseRNN demonstrates superior recall, particularly with a segment size of 250 and S1 labeling, achieving the highest recall of 91.54%, along with an accuracy of 79.71% and precision of 91.95%. For S2 labeling, HARDenseRNN also performs well, with a recall of 90.54% at a segment size of 250 and 89.82% at a segment size of 150. However, HARDenseRNN exhibits greater variability in its performance, with recall dropping to 79.13% and 80.33% at smaller segment sizes of 50 for S2 and S1 labeling, respectively. Random forest shows consistent (across various pre-processing configurations) but lower recall compared to HARDenseRNN. The highest recall for random forest is 71.99% at a segment size of 150 with S2 labeling, followed closely by 71.23% at a segment size of 250 with S2 labeling. Recall values are generally lower with S1 labeling, with the best performance being 67.72% at a segment size of 250. Despite this, random forest maintains relatively high precision across all configurations, indicating its reliability in classification.
Overall, HARDenseRNN with a segment size of 250 and S1 labeling emerges as the best-performing model and pre-processing configuration, achieving the highest recall of 91.54%. Although random forest is more stable across different configurations, HARDenseRNN’s higher recall makes it the more effective method for movement disorder classification in this context. Detailed results of the 12 models evaluated for movement disorder classification can be seen in
Figure 10. For random forest movement disorder classification, this study used a power set approach to perform multi-label classification. On the other hand, HARDenseRNN is able to perform multi-label classification, without the need of the power set approach, by using three output nodes, one for each anomaly. The confusion matrix for HARDenseRNN (across each node) using a window size of 250 and S1 labeling is shown in
Figure 7.
4.3. Inference Time Evaluation
The HARDenseRNN deep learning model outperforms the random forest-based pipeline in terms of inference time, as seen in
Figure 8. Although random forest takes 441 ms to complete a pipeline with all 150 features, HARDenseRNN takes only 165 ms. Even after excluding feature computation time, HARDenseRNN’s inference time remains significantly lower. For movement disorder detection, our model takes 45% less time compared to random forest, completing the task in 83 ms. In movement disorder classification, the time gap is even more substantial, with random forest taking 62% and 44% more time for the full feature set and top 10 features, respectively, compared to the HARDenseRNN deep learning model.
5. Conclusions
In Parkinson’s disease detection research, prior studies focused on a single specific anomaly or early detection using various modalities. Our work stands out by addressing multiple anomalies across body parts, including tremors, dyskinesia and bradykinesia, achieving 93.03% recall for movement disorder detection (binary) and 91.54% recall for movement disorder classification (multi-label). Utilizing data captured with the on-board IMUs of smartwatches attached on the most affected upper limb, our approach introduces a novel two-stage pipeline for movement disorder detection and classification (severity quantification). Unlike methods relying on vision or speech, our model provides a comprehensive solution. This study presents optimal pre-processing configurations, highlights key features for machine learning, and presents a deep neural network-based pipeline that is suitable for real-world deployment, even on low-end devices, with real time performance. This marks a significant advancement in Parkinson’s disease research, offering a holistic approach to movement disorder detection and severity quantification. Our current research emphasizes movement disorder detection and classification, excluding the prediction of severity for individual anomalies due to limitations in the dataset.
Future endeavors could involve addressing the skewed distribution by creating a more balanced dataset and incorporating individual anomaly severity quantification into our pipeline. Our future work aims to go beyond detection and quantification by incorporating anomaly forecasting, making our approach more proactive. Moreover, the dataset used in this study has 28 subjects, which may impact the generalization. This constraint should be considered when interpreting the results. To improve generalization, we plan to work with larger and more diverse datasets. Additionally, we aim to explore lower-limb and waist data to better capture full-body movement, including gait. These expansions will enhance Parkinson’s disease monitoring by providing a more comprehensive understanding and early anticipation of movement anomalies.