Next Article in Journal
Development of the Biofidelic Instrumented Neck Surrogate (BINS) with Tunable Stiffness and Embedded Kinematic Sensors for Application in Static Tests and Low-Energy Impacts
Previous Article in Journal
Decomposing and Modeling Acoustic Signals to Identify Machinery Defects in Industrial Soundscapes
Previous Article in Special Issue
A Superpixel-Based Algorithm for Detecting Optical Density Changes in Choroidal Optical Coherence Tomography Images of Diabetic Patients
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Machine Learning-Based Assessment of Parkinson’s Disease Symptoms Using Wearable and Smartphone Sensors

1
Faculty of Cybernetics, Military University of Technology, gen. Sylwestra Kaliskiego 2, 00-908 Warsaw, Poland
2
Department of Neurology, Faculty of Health Sciences, Medical University of Warsaw, Żwirki i Wigury 61, 02-091 Warsaw, Poland
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(16), 4924; https://doi.org/10.3390/s25164924
Submission received: 4 July 2025 / Revised: 29 July 2025 / Accepted: 7 August 2025 / Published: 9 August 2025

Abstract

This study explores the use of machine learning models to assess the severity of Parkinson’s disease symptoms based on data from wearable and smartphone sensors. It presents models to predict the severities of individual symptoms—tremor, bradykinesia, stiffness, and dyskinesia—as well as the overall state of patients, using both clinician and patient self-assessments as labels. The dataset, although limited and imbalanced, enabled the identification of key trends. The best performance was achieved when combining data from both the MYO armband and smartphone, and when using patient self-assessments as targets. Tremor was the most predictable symptom, while others proved more challenging—especially at higher severity levels, which were poorly represented in the dataset. These results highlight the value of multimodal data and the importance of patient input in symptom monitoring. However, they also point to the need for more balanced and extensive datasets to improve prediction accuracy across all severity levels and symptoms.

1. Introduction

Parkinson’s disease (PD) is one of the most common neurological disorders [1], significantly impacting patients’ quality of life and making daily tasks increasingly difficult. The disease manifests through both motor and non-motor symptoms, each of which requires different approaches for treatment [2,3].
Motor symptoms are the most widely recognized features of PD. The characteristic resting tremor, typically observed when muscles are at rest, is a defining symptom that presents as repetitive limb trembling, with a frequency between 4 and 6 Hz [2]. Bradykinesia, another primary motor symptom, represents the slowness of voluntary movement [2]. It is marked by longer action times and requires patients to exert greater effort and focus to perform movements. Additionally, muscle rigidity or stiffness is common, further complicating movements, causing pain, and reducing the range of motion [2]. These are the three core motor symptoms—tremor, bradykinesia, and rigidity—experienced by the patient due to the presence and advancement of the disease. More complex motor issues, such as freezing of gait, balance instability leading to falls, handwriting difficulties, and voice disorders, may manifest in advanced stages or in a smaller subset of patients [3].
Apart from motor symptoms, patients also suffer from a diverse range of non-motor symptoms encompassing physiological and psychological issues [2]. Examples include depression, lack of emotional involvement, sleep problems, and constipation. With the advancement of the disease, these symptoms might become more troublesome than the motor symptoms [2].
The cause of the disease is the progressive degeneration of dopaminergic neurons responsible for producing dopamine, which is a neurotransmitter related to regulating movement and various neurocognitive functions [4]. Dopamine deficiency is the main cause of PD symptoms, and the treatment strategies focus on restoring dopamine levels or enhancing the brain’s sensitivity to this neurotransmitter. Levodopa, a dopamine precursor capable of crossing the blood–brain barrier, remains the most effective medication for managing PD symptoms [5]. Usually, it is administered orally using pills, though alternative delivery methods such as duodenal levodopa infusion (Duodopa) are available for advanced cases [6].
However, many long-term levodopa users can face additional complications, such as dyskinesias—rapid, involuntary movements caused by fluctuations in dopamine levels [5]. As the disease progresses, the therapeutic window narrows, complicating medication management and necessitating highly personalized treatment regimens [3,5].
The complexity of PD management requires individual, patient-specific plans aimed at symptom mitigation to improve quality of life [7]. In order to find the optimal medicine doses, the clinicians assign an initial medication schedule that is then adjusted based on the patient’s individual reports [5,7]. However, these adjustments are heavily dependent on subjective patient reports, which are often imprecise and subject to variability. This introduces significant challenges in optimizing treatment, underscoring the urgent need for objective and reliable methods to monitor and quantify PD symptoms. Such a comprehensive assessment should encompass tremor, bradykinesia, rigidity, dyskinesias, and an overall evaluation of the patient’s condition.
Over the years, various approaches have been developed to objectively assess PD symptoms [8,9]. These methods often employ algorithms to detect and quantify specific symptoms or leverage machine learning models to evaluate the patient’s clinical state [10,11,12,13,14,15,16]. Traditional algorithmic approaches have typically focused on identifying the presence of symptoms or, in more advanced cases, classifying their severity into predefined categories. However, these systems have several limitations: they often rely on data from a single sensor or a specific exercise, making it difficult to generalize findings across different sensor types, tasks, or symptom manifestations. An example of such an approach was proposed by Griffiths et al. [14]. They created an algorithm which, based on continuous accelerometer readings, provides a score for both bradykinesia and dyskinesia severity throughout the day. Other studies focused on using machine learning models for predicting symptom severity. During the DREAM Challenge [12], and in a study by Gutowski [17], both shallow machine learning and advanced deep learning models were developed to predict the severity and presence of three individual symptoms, tremor, bradykinesia, and dyskinesia, also based on accelerometer readings from different limbs. A study published by Thomas et al. [13] focused on predicting a universal value—a treatment–response index, which was designed to represent the symptoms that could be captured using accelerometers and gyroscopes and reflect the response to treatment.
The presented solutions using machine learning were only able to detect the presence of the symptom or, in more sophisticated cases, classify its intensity into one of the predefined severity classes. Furthermore, these approaches usually focused on using data from a single sensor or a single exercise, making it difficult to compare results across different sensors, tasks, and symptoms. The study presented in this paper focuses on experiments aimed at building prediction models for evaluating the severity of four main symptoms associated with PD, tremor, bradykinesia, muscle stiffness, and dyskinesia, as well as the overall patient state. In contrast to previous work, it aims to build models capable of predicting a real-valued score on a 0–4 scale representing symptom severity. This enables fine-grained prediction for all considered symptoms and the overall condition based on multimodal sensor data collected during diverse motor tasks. Unlike prior studies focusing on binary classification or severity categories, this work provides continuous severity estimates and evaluates both clinician- and patient-based labels, facilitating future research into subjective versus objective symptom assessment. This approach offers more nuanced feedback for monitoring disease progression and treatment efficacy and can support personalized adjustments to treatment regimens.

2. Materials and Methods

2.1. Dataset

The dataset used in this research was created as a result of cooperation between two research facilities, the Military University of Technology (gen. Sylwestra Kaliskiego 2, 00-908 Warsaw, Poland) and the Medical University of Warsaw (Żwirki i Wigury 61, 02-091 Warsaw, Poland). The dataset consists primarily of recordings from patients with PD. It is the outcome of a study on the use of a mobile application in the differential diagnosis and treatment of tremor in patients with essential tremor, PD, and atypical parkinsonism. This study was approved by the Bioethics Committee of the Medical University of Warsaw. During this study, data collection was initially supervised by clinicians. The process was supported and organized by an information system, which consists of a mobile application and a web portal. The data was mostly collected using the mobile application.
The mobile application is designed to collect patient demographic and clinical data upon registration of the patient. However, its main goal is to allow evaluation of the patient’s state and track state changes, medicine schedules, and intakes. It allows for the conducting of examinations, during which the patient engaged in a series of exercises aimed at capturing different symptoms of the disease. These were performed using the mobile device and wearable sensors, including the Myo armband [18]. The application supports the collection of data during four types of exercises: sensor, reaction, handwriting, and speech exercises. However, this paper focuses solely on the severity assessment of individual symptoms based on the signals recorded during sensor exercises.
The main goal of sensor examinations is to detect motor symptoms of PD, particularly tremor, bradykinesia, and dyskinesia. To do this, the application enabled data collection with built-in sensors in the mobile device—with the accelerometer and gyroscope at a frequency of 50 Hz, and sensors from the wearable device—an accelerometer, gyroscope, and EMG data from the MYO armband. Before starting the examination, the patients were asked to put on the wearable sensors (if applicable) on one or two arms and hold the mobile phone in the examined hand.
The first sensor task was focused on detecting rest tremor. The patient was asked to keep their hands on their knees or a vertical platform for 30 s while the sensor data was collected. The second task was focused on detecting postural tremor—for 30 s, the patient extended their arms in front of them to record the data with sensors. Next, the patient performed a 30 s pronation–supination task—primarily for detecting bradykinesia. The sensor data were collected for another 30 s while the patient performed further tasks. This task aimed to assess the kinetic tremor experienced by the patient.
After each examination, performed under clinical supervision, the clinician evaluated the four main symptoms associated with PD on a scale from 0 to 4: bradykinesia, tremor, dyskinesia, and muscle stiffness. Additionally, both the clinician and the patient provided separate evaluations of the overall patient state (representing the response to medication), using a scale from −4 (severe symptoms) to 4 (severe dyskinesia), with 0 representing the optimal state. This evaluation allows for the capturing of the overall response to medication and can be later used to evaluate and adjust treatment.
A neurologist, Stanisław Szlufik, from the Mazovian Bródno Hospital (Ludwika Kondratowicza 8, 03-242 Warsaw, Poland), along with his team, was responsible for providing state evaluations, which were treated as the ground truth for experiments described in the paper. At the time of data analysis, this dataset contained accounts of 241 patients with PD, resulting in 739 examinations. However, not all the examinations included all exercises for two main reasons: first, the scope of the scales and assessments evolved over time; and second, to better capture the scope and magnitude of Parkinson’s disease symptoms, clinicians were allowed to restrict the set of exercises for each examination. The characteristics of the dataset are presented in Table 1.

2.2. Data Preparation

The dataset contains a limited number of samples; therefore, deep learning methods—despite their popularity in modern research—did not yield significant results. Previous work by Gutowski [17,19] involved the development of deep learning models using a significantly larger dataset, which resulted in strong performance. However, when these models were applied to the current, much smaller dataset—including attempts with transfer learning—the results were significantly worse than those obtained using shallow models. This motivated the decision to exclude deep learning approaches from this manuscript and focus solely on conventional, shallow ML models, which require additional preprocessing and the extraction of relevant features from raw signal data. The process of preparing the raw sensor signal for ML training and prediction is presented in Figure 1.
The raw signal was processed through actions such as filtering to remove unwanted components from the raw signal, calculating the magnitude of the signal, and decomposing it into multiple signals. This was then followed by the feature extraction step. Based on the signal type and the purpose of the model (what variable is predicted), a set of features was selected. These features were calculated based on the signal and should represent the signal well, given the model’s prediction task. If the number of created features is high, appropriate methods are often employed to reduce the number of variables, leaving only those most relevant. The reduced number of features can then be delivered as the inputs to the ML model.
The signals from inertial sensors needed some preparation for the ML models. They were first filtered using a high-pass filter in order to remove the gravitational acceleration component, with a cut-off frequency of 0.1 Hz. All of the signals were also filtered with a low-pass filter to remove all noise. Since the sampling frequency was 50 Hz, a low-pass filter with a cut-off frequency of 20 Hz was selected. PD-related tremors and other motor features do not typically exceed this frequency. For filtering the signals, the Butterworth filter [20] was used, which is often utilized for its flat frequency response in the passband, ensuring minimal signal distortion before the cutoff frequency. This approach balances the removal of unwanted components and retaining crucial movement data, facilitating accurate feature extraction and analysis.
After filtering, the signals were used to calculate the magnitude signal (1), representing movement captured in all directions. This derived signal allows for a more aggregated analysis, disregarding the direction of movement, which might be important, especially in cases where the sensors are not always worn in the same orientation.
M = X 2 + Y 2 + Z 2

2.3. Feature Extraction

Based on the description of the main PD symptoms—dyskinesia, bradykinesia, and tremor—and consultations with neurologists, the signal was decomposed into three frequency bands: 0–3, 3–9, and 9–14 Hz. The features were then calculated for each of these frequency bands and the whole signal, for each axis, and the magnitude signal. This allowed for the capturing of different aspects of the signal, providing an accurate and precise representation. The features selected to be calculated were chosen based on a literature review [11,13,21,22,23,24,25,26] regarding the analysis of inertial signals for detecting activities, diagnosing PD, and quantifying PD symptoms.
These features were divided into time domain features, calculated directly from the signal, and frequency domain features, which provided insights into the signal’s frequency content. To extract frequency domain features, the signal underwent a Fourier Transform [27], a process that decomposes the signal into its constituent frequencies, revealing the spectrum of frequencies present and their relative intensities. Part of this analysis involved computing the Power Spectral Density (PSD), which quantifies the power present within each frequency component of the signal. The PSD is crucial for understanding the energy distribution across various frequencies, enabling the identification of dominant frequency bands that may indicate the presence of tremors or other PD-related motor symptoms. This helped highlight the specific frequencies contributing to the signal and aided in detecting patterns or abnormalities in the frequency domain, offering a better representation of how PD affects motor functions. Table 2 contains a list of features that are calculated based on the signal for each axis and the magnitude.
The features were calculated using functions from the NumPy (v1.24.4), SciPy (v1.10.1), PyWavelets (v1.4.1), and EntropyHub (v2.0) Python (v3.8.10) libraries. Additional custom features were individually implemented in Python. Time domain features were calculated for the entire signal, across all three axes, and the signal magnitude, resulting in 44 features. Frequency domain features were computed for 3 previously described frequency bands and the original signal across all axes (X, Y, Z, and magnitude), yielding 128 features. To ensure that the signal’s characteristics were captured as accurately as possible, additional features were added.
A short-time Fourier transform (STFT) was performed with a window size of 4 s and a 2 s overlap. For each window, the mean PSD was calculated, and the following statistics were computed for the vector: the mean, standard deviation, skewness, min, and max. This resulted in 5 additional features for each axis and each frequency band, totaling 80 features. Similarly, the raw signal was segmented into windows of this size. For each window, the value range and the entropy were calculated, as described by E. Sejdić et al. [22]. Based on these values, the previously described statistics were calculated, adding 40 more features.
Following the methodology described by Thomas et al. [13], a three-level Discrete Wavelet Transform was applied using a Daubechies wavelet of order 10. The means and the standard deviations were calculated for first-level high-frequencies, second-level high-frequencies, and third-level high-frequencies. These calculations resulted in an additional 24 features.
To capture the correlations between different axes, Pearson correlation coefficients were calculated for each axis pair (X and Y, X and Z, and Y and Z), resulting in three features. In total, 275 features were extracted from a single accelerometer signal.

2.4. Examination Metadata

To build appropriate models for predicting the patient state, alongside the features extracted from the collected sensor data, additional features were added to improve the quality of the model and its prediction precision. These features were patient characteristics equal among all examinations of that patient, as well as characteristics of specific examinations. The full list is showcased in Table 3.
For categorical features, such as the affected side, handedness, and groups, one-hot encoding was applied to ensure correct interpretation of the values by ML models. The remaining features were normalized along with sensor-derived features by subtracting the mean and dividing it by the standard deviation.

2.5. Feature Selection

A single sensor examination performed by a patient can consist of 3 exercises. When the examination is performed for both hands, the number can increase to 6. Considering the number of sensors and extracted features, one examination can provide thousands of features, a number that can be easily higher than the number of patients and even the number of total examinations performed. As stated by Guyon and Elisseeff [29], in these situations, it is important to consider feature selection methods. These can reduce the number of dimensions and therefore make it easier for the ML model to learn the dependencies in data, as well as perform the training process faster. Recent advances in feature selection, such as the multi-objective binary grey wolf optimization with guided mutation [30], demonstrate how intelligent optimization techniques can further enhance feature selection efficiency and model performance.
The process to restrict the number of features was performed in two steps. The first step focused on removing the variables that are highly correlated with each other. Having duplicate features does not improve the performance of ML models but only slows down the process. Therefore, the Pearson’s correlation coefficient was calculated for every pair of features in question. Whenever there was a correlation value above 0.97 between two features, these were excluded from further analysis.
The second step in the reduction in feature dimensions was applied during the training process. First, all features were assigned importance scores using the Random Forest [31] ML model. They were then sorted in descending order, and only the top 60% were further considered. Feature selection was then performed through the following process: Beginning with the most important feature, additional features were added one by one. At each step, the model’s cross-validated performance was evaluated, and a feature was retained only if it led to an improvement.
This method was selected because, even after the initial step, some features were highly correlated, which can negatively impact methods that evaluate features individually, such as permutation importance [32], since permuting one feature does not fully remove shared information with correlated features. Moreover, methods such as Principal Component Analysis (PCA) [33] and scikit-learn’s SelectFromModel [34] did not improve model performance in the experiments. The iterative, performance-driven feature selection, combined with initial removal of highly correlated features, proved to be more effective, enabling the construction of quality models that provided accurate predictions based on as few as 30 features.

2.6. ML Model Training

Three different machine learning models from the scikit-learn (v1.2.2) and XGBoost (v2.0.3) Python libraries were selected for regression: Random Forest [31], Extreme Gradient Boosting (XGBoost) [31], and Support Vector Machine (SVM) [35]. These models were initialized with default parameters.
Random forest (RF) [31] is an ensemble method that bases its decisions on multiple individual ML models, such as decision trees. Each tree is trained on a random subset of the data and features, and the final prediction is made by combining the predictions of all trees. This approach reduces overfitting and improves generalization compared to a single decision tree.
XGBoost (XG) [31] is also an ensemble model that uses a collection of decision trees. However, it builds these trees sequentially, where each tree aims to correct the errors of the previous ones. This method, known as boosting, enhances the model’s accuracy and robustness by focusing on the mistakes made by prior models, thereby improving predictive performance.
SVM [35] can be used both for classification and regression tasks. For classification, it tries to find the optimal hyperplane that best separates members of different classes in the feature space. SVM supports the use of kernel functions, which can transform the data into a higher-dimensional space, making the separation process easier and enabling the handling of non-linear boundaries. The default kernel in scikit-learn is the Radial Basis Function [35].
To evaluate the performance of regression models, defined metrics are calculated on the prediction results of the test set. Commonly used metrics in evaluation are the mean squared error (MSE) (13), the coefficient of determination (R2) (14), the mean absolute error (MAE) (15), and Pearson’s correlation coefficient (r) (16) between the true values and predicted outcomes [36]. These give a good overview of the overall performance of the machine learning models. However, in cases of highly imbalanced datasets, such as this one, these metrics might not give enough information for model evaluation. Therefore, two additional metrics were constructed. Since the original labels are discrete values, which were considered classes previously, it was possible to calculate class-specific metrics. The MAE was selected to be calculated for every class separately (17). This was used to create a derived metric, bMAE (18), which represents the mean absolute error across different classes; similarly, bMSE (19) was defined using the mean squared error calculated for every class.
M S E = 1 n i = 1 n y true , i y pred , i 2
R 2 = 1 i = 1 n y true , i y pred , i 2 i = 1 n y true , i y true ¯ 2
M A E = 1 n i = 1 n y true , i y pred , i
r = i = 1 n y true , i y true ¯ y pred , i y pred ¯ i = 1 n y true , i y true ¯ 2 i = 1 n y pred , i y pred ¯ 2
MAE k = 1 n k i class   k y true , i y pred , i
b M A E = 1 C k = 1 C MAE k
b M S E = 1 C k = 1 C   1 n k i class   k ( y true , i y pred , i ) 2

2.7. Individual Symptom Evaluation

The features derived from signals collected during patient exercises represent the patient’s condition during the examination. The features provide different aspects of the examination performance and might be important in identifying specific symptoms of PD. At the end of examinations conducted in the presence of the clinician, a state assessment screen is displayed, where the overall state evaluation is provided along with individual symptoms, including tremor, bradykinesia, muscle stiffness, and dyskinesia. The clinician is asked to evaluate their severity on a scale of 0 (not present) to 4 (very severe). While this evaluation was not provided in all examinations for PD patients, 356 of the patient examinations contain these evaluations. This section focuses on building ML models capable of predicting individual symptom severities (as evaluated by clinicians) based on exercise-derived features.
In this dataset, the problem of imbalance is significant. The total number of samples is low, and higher symptom severities are poorly represented. For example, there is only one sample for dyskinesia severity of 4, making it impossible to train and evaluate the model for this severity. Other symptoms have better representation, with the most balanced dataset being for tremor prediction—10 samples for a severity of 4. The class distributions for all symptoms (tremor, bradykinesia, muscle stiffness, and dyskinesia) are shown in Figure 2.
Due to the small number of examinations in the dataset, to validate the ML models, cross-validation [37] was employed. It is a technique where the data is randomly split into k disjoint sets. The training process is then performed and evaluated k times, with k − 1 subsets used for the training and the remaining subset used for evaluation. This process is repeated k times, ensuring that every subset is treated as the test set exactly once.
In the simplest version of cross-validation, splitting into subsets is performed randomly. However, there are more advanced versions that can be used for specific scenarios. For example, stratified k-fold cross-validation is often used for classification problems. In this method, the partitioning is performed so that the distribution of class samples in different subsets is similar.
Additionally, Leave-One-Out (LOO) cross-validation can be used. It can be performed either on individual samples or on groups. When performed on samples, each subset contains only one sample. When performed on groups, the number of subsets is equal to the number of groups, with each model evaluated on one group while being trained on the remaining groups.
These splits are included in the scikit-learn Python library in the form of the following classes: KFold, StratifiedKFold, LeaveOneOut, and LeaveOneGroupOut. These are used to perform the training process in this part of the study.
During the training process, numerous training processes are executed; they can be grouped into three groups based on the expected goal of the training:
  • Single exercise, single sensor from one device—finding which sensor and device combination best captures specific symptoms during different exercises,
  • Single exercise—finding which exercise is best at capturing each of the symptoms,
  • All exercises and devices—finding out how the models perform at capturing symptoms when all of the data can be used.
Each experiment was performed using all of the previously defined models. Each experiment was validated using cross-validation with two different splits—10-fold split (10F) and leave one patient out (LOO)—to see how models perform in these different situations.

2.8. Overall State Evaluation

The comprehensive assessment of a patient’s overall state plays an important role in understanding the nature of PD. While detailed evaluations of specific symptoms offer valuable insights into the disease’s characteristics, severity, and symptom manifestations, they may not fully encompass the impact on a patient’s quality of life and daily functioning. To address this, the MDS-UPDRS [38] provides a foundational framework for a more inclusive evaluation. In an effort to simplify the case and represent the therapeutic effect of medication, Westin et al. [39] proposed the TRS scale, optimizing it to capture the spectrum of patient experiences from severe symptoms to severe dyskinesia, with 0 being the optimal state. The TRS scale used in this study ranges from −4 to +4, as presented in Figure 3. It was adjusted to allow clinicians to gauge the overall state of PD patients more effectively. Such a comprehensive assessment is crucial for monitoring disease progression and customizing treatment plans to align with the dynamic needs of each patient, thereby enhancing therapeutic outcomes and patient well-being.
In this section, the focus is on the development of machine learning models capable of predicting the adjusted TRS scale values. The predictions are based on a set of data collected during patient evaluations. These include sensor exercises, screen interactions, handwriting, and vocal exercises. By analyzing a diverse collection of examination data, the models aim to achieve a more accurate and personalized understanding of patient conditions. This approach is designed to enhance the precision of treatment plans, tailoring interventions to meet the unique needs of individuals with PD.
The goal of training ML models is to evaluate the patient’s state during examinations. The ground truth values for this were provided both by the patient—their subjective opinion—and by their clinician, which is hopefully, more objective.
While the models were trained separately on clinician- and patient-reported labels; no direct comparison between these two groups was performed in this study. However, the predictions based on these two sources showed noticeable differences, which is likely caused by the nature of the input. Clinician assessments tend to reflect standardized diagnostic criteria and are influenced by medical training and experience, whereas patient self-reports are subjective and may be shaped by individual perceptions or mood. Previous research [40,41] has also highlighted discrepancies between patient- and clinician-reported outcomes, especially in conditions involving fluctuating or non-visible symptoms of the disease.
As for the individual symptom severities, this dataset was also affected by an imbalance in the label values. Furthermore, the range of values is more than twice as big, and the precision is higher, which is shown in Figure 4 along with the number of examinations that had each of the labels assigned. This makes it more difficult to prepare a model that performs well in the range of values.
The process to build ML models for predicting the state of the patient is similar to the prediction of specific symptom severities. Regression models were built using sensor signals registered from different exercises; similarly, the 10-fold and leave-one-patient-out cross-validation was performed, and the previously described metrics (R2, r, MAE, bMAE, bMSE) were used to evaluate the models. The scope of experiments is the following:
  • Single device—finding which device is more useful in capturing the scope of the disease,
  • All collected data—building a complete and optimal model for predicting a patient’s state.
Due to the larger number of possible values than for the symptom severities, the results are presented in the form of a scatterplot instead of a violin plot. It is used to present machine learning regression results by plotting the true values on the x-axis and the predicted values on the y-axis, allowing for the assessment of the model’s performance.

3. Results

3.1. Individual Symptom Evaluation

The primary goal of the initial training process was to evaluate symptom severities based on data from individual exercises and sensor signals. For each training instance, all previously discussed machine learning models were applied and configured to address the regression task of estimating the severity of tremor, bradykinesia, muscle stiffness, and dyskinesia.
Model performance was assessed using standard evaluation metrics, and the top-performing models (those achieving the highest R2 scores) are summarized in Table 4, with complete results provided in Appendix A. The table lists two models per symptom: one trained with 10-fold cross-validation (10F) and the other using leave-one-patient-out cross-validation (LOO). Comparing these models helped assess how individual patient characteristics influence model performance.
Table 5 and Table 6 extend this analysis by evaluating symptom severity predictions using all sensor signals collected during a single exercise (Table 5) and the full dataset comprising all sensor data recorded during the examination (Table 6).
The results clearly show that tremor severity can be predicted with the highest accuracy, as confirmed by the Wilcoxon signed-rank test [42] (e.g., received a p-value of 1.1 × 10−6 when compared with bradykinesia). This is expected, as tremor is one of the most prominent and easily observable symptoms of PD. It is then followed by bradykinesia, likely due to its characteristic slowness of movement being relatively easy to detect through time-series sensor data. In contrast, dyskinesia prediction performed the worst, which is consistent with its limited representation in the dataset—particularly at higher severity levels, as shown in Figure 2.
The differences between 10F and LOO cross-validation results were generally small. To assess statistical significance, the Wilcoxon signed-rank test was again applied. With a test statistic of 1420.5 and a p-value of 0.436, no statistically significant difference was found at the 0.05 significance level. This suggests consistent model performance across different split approaches and indicates low susceptibility to data leakage—likely due to the diverse patient pool and the limited number of repeated examinations per patient.
From a task-specific perspective, exercises influenced performance differently depending on the symptom. Tremor and dyskinesia, which are observable through involuntary movement, were best captured during the first exercise, where the patient remained at rest. Conversely, bradykinesia and stiffness—requiring active motion—were best evaluated during the third exercise, which involved the pronation–supination task. Overall, accelerometer signals provided more informative features for severity prediction across all symptoms.
To visualize the prediction accuracy across severity levels, violin plots were created for models using all sensor data. These violin plots provide the distribution of predictions and class-specific bMAE values (Figure 5 and Figure 6).
The violin plots highlight several limitations of the models. First, they consistently struggle to predict higher severity levels: predictions rarely exceed a value of 3 when the ground truth is 4. Second, particularly for stiffness, a severity level of 0 is rarely predicted. This is a notable limitation, as correctly identifying the absence of symptoms is critical for clinical validity. One possible explanation is the imbalance in class distribution: for both bradykinesia and stiffness, level 1 was the class with the most samples. Furthermore, severity level 0 in these two symptoms often corresponds to extended periods of immobility, making it harder for the models to distinguish from low but non-zero symptom levels. Lastly, some models occasionally predict negative values, which do not occur in the original dataset. To solve this problem, one of the following approaches could be selected:
  • Using post-processing to clip predicted values to the valid range [0, 4],
  • Reformulating the problem as a classification task (ordinal classification). However, this would lead to a loss of prediction precision.
  • If neural networks were explored, applying a bounded activation function scaled to the target range in the final layer of the model.
Considering the advantages and disadvantages of these methods, clipping the values to the valid range is the best solution for this problem.
Since the clinical deployment depends on the transparency of the models, further experiments were performed to assess the importance of specific features in the prediction of severity for each symptom. To calculate these importances, the permutation feature importance method [32] was used with R2 as the scoring metric. This method works by randomly shuffling the values of each feature and measuring the resulting decrease in the model’s performance. This enables identification of the features the model relies on most for accurate predictions—the greater the drop in performance, the more important the feature is considered. The top five most relevant features and their corresponding importance scores are presented in Table 7.
Based on the results presented in the table, distinct sets of features emerged as most relevant for detecting and estimating the severity of specific Parkinson’s disease symptoms. For tremor, the highest-ranked features were frequency-based parameters, particularly derived from gyroscope data on the Z-axis, such as the weighted mean power and spectral centroid within the 3–9 Hz and 0–25 Hz bands, respectively. These frequency bands align with the known physiological range of tremor in Parkinson’s disease. Notably, both MYO and smartphone sensors contributed top-ranking features, indicating that multiple modalities are effective for tremor characterization.
In the case of bradykinesia, the most important features were predominantly time-domain statistics, such as the skewness, mean, and median, computed from accelerometer signals. These features reflect the irregular and reduced amplitude of movement typically associated with bradykinesia. Interestingly, features from both hands and all three axes contributed, suggesting that bradykinesia manifests in a more globally distributed motor pattern.
Dyskinesia and stiffness also showed distinct profiles. For dyskinesia, which involves involuntary, excessive movements, frequency-domain features again dominated, especially spectral power and related descriptors like the interquartile range and frequency of maximum power. These features capture the erratic, high-amplitude fluctuations characteristic of dyskinesia. In contrast, stiffness was best detected using both time- and frequency-domain features, including the maximum range and absolute mean differences, reflecting limited movement variability. The distribution of informative features reflects the physiological nature of each symptom and highlights the value of combining multiple sensors and feature types.

3.2. Overall State Evaluation

To predict the patient’s overall state, two experiments were conducted. The first one involved using data from a single sensor device—either the MYO armband or a smartphone—while the second experiment used data from both devices simultaneously. The metric values obtained through cross-validation are presented in Table 8. It showcases results received for predicting both the state according to the clinician and according to the patient.
In the case of patient state prediction, similar to symptom severity prediction, no significant differences were observed between the 10-fold (10F) and Leave-One-Out (LOO) validation splits. This suggests that the models are not prone to overfitting or data leakage, further confirming the robustness of the methodology. Interestingly, across all the prediction tasks, the best-performing models were consistently those based on SVMs, demonstrating their high effectiveness in handling this type of biomedical data. This was confirmed using the Wilcoxon signed-rank test: when comparing SVM results (R2) with those of RF and XG, p-values of 0.035 and 0.00024 were obtained, respectively (both lower than the significance level of 0.05).
As expected, models trained on combined data from both sensor devices outperformed those trained on data from a single device. This reinforces the necessity of using both sensors to collect comprehensive data. The improvement can be attributed to the fact that the sensors are positioned on different parts of the body during examination—the phone is held in the hand, while the MYO armband is worn on the forearm, thereby capturing a wider range of motion and providing complementary information.
The best models achieved strong predictive power, with correlations between true labels and predictions reaching values as high as 0.8. Notably, the model predicting the patient’s self-assessed state performed slightly better than the one based on the clinician’s evaluation. This was unexpected and may suggest that the patients’ own perceptions of their condition—when paired with sensor data—can be modeled more accurately.
To further analyze model performance, scatter plots are shown in Figure 7, illustrating the correlation between true values and predictions for both the patient’s and clinician’s assessments.
Both models demonstrate good predictive performance when estimating symptom severity (negative values), reflecting the motor deficits associated with PD. However, their ability to predict dyskinesias (positive values) is limited. Neither model was able to predict values higher than approximately 0.5, falling short of the upper range of possible scores (up to 4). This limitation is likely due to the small number of samples exhibiting pronounced dyskinesias in the dataset. This is consistent with the poor performance observed in the dyskinesia-specific model described earlier.
Overall, the results emphasize the importance of multimodal data collection and the strengths of SVMs in modeling patient states, while also highlighting the challenges of accurately capturing rarer symptoms such as dyskinesias.
To gain additional insights into the modeling process, feature importance was analyzed for the best overall state prediction models. This helped identify which features contributed most to the final predictions (Table 9).
The analysis revealed that both clinician- and patient-based models rely on sensor features reflecting movement variability and distribution. For clinician assessments, top features include absolute mean difference and interquartile range from accelerometer and gyroscope signals, along with the time since diagnosis. Patient models highlight similar features such as skewness, axis correlations, and the time since diagnosis. While slight differences appear in the specific metrics emphasized, these top features largely reflect aspects of motor function relevant to PD severity. This suggests that both perspectives capture overlapping information from sensor data.

4. Discussion and Conclusions

The experiments in this paper focused on building machine learning models to predict both the individual severities of symptoms and the overall state of patients related to PD, as assessed by clinicians and patients. Due to the limited and imbalanced dataset, shallow machine learning models were used.
The results reveal both the promise and the limitations of such models in clinical applications. The best performance was observed when data from all available exercises were combined, suggesting that aggregating diverse movement patterns improves model robustness. Tremor emerged as the most predictable symptom, likely due to its more visible and measurable nature in sensor data. In contrast, symptoms such as bradykinesia, stiffness, and dyskinesia were more difficult to assess. This may be attributed not only to their subtler manifestations but also to the significant class imbalance and the underrepresentation of higher severity levels in the dataset. These factors highlight a central challenge in PD symptom modeling: while wearable sensors can provide rich input, the clinical variability and skewed distribution of symptom severities can severely limit model generalizability and predictive power.
One notable finding was that models trained on patient self-assessments performed slightly better than those using clinician ratings. While this might initially seem surprising, it may reflect the fact that patients experience and recognize their symptoms throughout the day, while clinical evaluations are limited to short check-ups. This highlights the potential value of incorporating patient-reported data into monitoring systems, especially for symptoms that fluctuate. Another observation was that positive state values—indicating the presence and severity of dyskinesias on the TRS scale —were predicted less accurately, likely due to their underrepresentation in the dataset. Lastly, the best results were achieved when data from both sensors—the MYO armband and the smartphone—were used together, suggesting that combining multiple sources of information gives a fuller picture of symptom expression.
This study’s main limitations stem from the small dataset size and the imbalance in symptom severities, which likely hindered model performance for rarer or less obvious symptoms. Addressing this will require both broader data collection and possibly augmentation techniques to simulate underrepresented cases. Furthermore, although shallow models like SVMs and RF models were appropriate given the dataset, more complex models—such as deep neural networks or complex ensemble approaches—may uncover richer patterns if more data becomes available.
In conclusion, this work supports the feasibility of using wearable sensor data and machine learning to monitor PD symptoms, especially when models are tuned to specific symptoms and incorporate multimodal inputs. Future research should focus on expanding datasets, improving class balance, and exploring hybrid modeling approaches that blend patient and clinical insights. With these advancements, such models could become valuable tools for real-time, individualized disease monitoring and management in Parkinson’s disease.

Author Contributions

Conceptualization, T.G. and S.S.; methodology, T.G.; software, T.G.; validation, T.G., O.S., A.Ć., K.G., K.K., M.B., R.A., D.K. and S.S.; resources, T.G., O.S., A.Ć., K.G., K.K., M.B., R.A., D.K. and S.S.; writing—original draft preparation, T.G.; writing—review and editing, T.G., O.S., A.Ć., K.G., K.K., M.B., R.A., D.K. and S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was co-financed by Military University of Technology under research project UGB 531-000023-W500-22.

Institutional Review Board Statement

This study was carried out in accordance with the recommendations of the Bioethics Committee of Warsaw Medical University, with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Bioethics Committee of Warsaw Medical University, approval code KB/285/2023 and date 11 December 2023.

Informed Consent Statement

Informed consent was obtained from all subjects involved in this study.

Data Availability Statement

The original contributions presented in the study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. All training results for models predicting symptom severities based on single sensor signals for single exercises.
Table A1. All training results for models predicting symptom severities based on single sensor signals for single exercises.
SymptomSplitDatasetModelR2rMAEbMAEbMSE
bradykinesia10FMYO-ACC-#1XG0.1570.4170.6661.1942.215
bradykinesiaLOOMYO-ACC-#1XG0.1460.3990.6731.2192.245
bradykinesia10FMYO-ACC-#2SVM0.2220.4770.6301.1702.075
bradykinesiaLOOMYO-ACC-#2SVM0.2040.4580.6371.2132.268
bradykinesia10FMYO-ACC-#3SVM0.3440.5940.5721.1141.989
bradykinesiaLOOMYO-ACC-#3SVM0.3170.5650.5931.0031.461
bradykinesia10FMYO-GYRO-#1RF0.2000.4500.6481.2092.240
bradykinesiaLOOMYO-GYRO-#1SVM0.2330.4840.6301.1802.165
bradykinesia10FMYO-GYRO-#2SVM0.1970.4470.6481.1792.072
bradykinesiaLOOMYO-GYRO-#2SVM0.1880.4370.6571.1671.968
bradykinesia10FMYO-GYRO-#3RF0.3570.6000.5960.9921.436
bradykinesiaLOOMYO-GYRO-#3XG0.4080.6390.5700.8961.187
bradykinesia10FPhone-ACC-#1SVM0.1670.4140.6441.1952.149
bradykinesiaLOOPhone-ACC-#1SVM0.2000.4540.6501.1681.998
bradykinesia10FPhone-ACC-#2RF0.2200.4750.6511.1612.004
bradykinesiaLOOPhone-ACC-#2SVM0.1610.4180.6411.2292.407
bradykinesia10FPhone-ACC-#3SVM0.4000.6380.5470.9561.402
bradykinesiaLOOPhone-ACC-#3SVM0.3600.6030.5850.9781.402
bradykinesia10FPhone-GYRO-#1SVM0.1300.3680.6681.3022.598
bradykinesiaLOOPhone-GYRO-#1SVM0.0940.3300.6741.3412.761
bradykinesia10FPhone-GYRO-#2RF0.2150.4780.6451.1962.140
bradykinesiaLOOPhone-GYRO-#2SVM0.2040.4570.6191.1902.172
bradykinesia10FPhone-GYRO-#3XG0.3420.5900.5780.9701.531
bradykinesiaLOOPhone-GYRO-#3RF0.2900.5390.6151.0591.708
dyskinesia10FMYO-ACC-#1XG0.2480.5160.2751.3993.219
dyskinesiaLOOMYO-ACC-#1XG0.3020.5570.2661.3713.046
dyskinesia10FMYO-ACC-#2SVM0.2630.5990.2691.5393.493
dyskinesiaLOOMYO-ACC-#2XG0.3350.5800.2811.1091.810
dyskinesia10FMYO-ACC-#3SVM0.2590.5420.2601.6154.305
dyskinesiaLOOMYO-ACC-#3XG0.3260.5720.2651.2662.344
dyskinesia10FMYO-GYRO-#1XG0.3590.6000.2531.2092.179
dyskinesiaLOOMYO-GYRO-#1XG0.3790.6160.2591.1602.009
dyskinesia10FMYO-GYRO-#2XG0.3020.5610.2831.1591.962
dyskinesiaLOOMYO-GYRO-#2RF0.2980.5570.2701.3792.862
dyskinesia10FMYO-GYRO-#3SVM0.1450.4320.2941.7244.493
dyskinesiaLOOMYO-GYRO-#3SVM0.1660.4570.3201.7094.421
dyskinesia10FPhone-ACC-#1RF0.3430.5860.2351.2832.487
dyskinesiaLOOPhone-ACC-#1SVM0.3550.6410.2401.3792.837
dyskinesia10FPhone-ACC-#2RF0.2400.4900.2691.3932.912
dyskinesiaLOOPhone-ACC-#2XG0.2770.5300.2541.3542.894
dyskinesia10FPhone-ACC-#3RF0.1250.3850.3151.3552.636
dyskinesiaLOOPhone-ACC-#3XG0.0840.3670.3171.4393.091
dyskinesia10FPhone-GYRO-#1RF0.3850.6220.2281.2382.353
dyskinesiaLOOPhone-GYRO-#1SVM0.3480.6250.2371.2912.475
dyskinesia10FPhone-GYRO-#2RF0.3430.5890.2401.2082.194
dyskinesiaLOOPhone-GYRO-#2RF0.3070.5570.2541.2592.405
dyskinesia10FPhone-GYRO-#3RF0.2360.4850.2741.3992.903
dyskinesiaLOOPhone-GYRO-#3SVM0.1910.4710.2841.5853.655
stiffness10FMYO-ACC-#1XG0.1410.4120.6331.1371.958
stiffnessLOOMYO-ACC-#1XG0.1270.3830.6331.1562.015
stiffness10FMYO-ACC-#2RF0.1650.4080.6021.2042.224
stiffnessLOOMYO-ACC-#2RF0.1520.3990.6141.2062.113
stiffness10FMYO-ACC-#3RF0.3090.5620.5680.9921.402
stiffnessLOOMYO-ACC-#3XG0.3600.6000.5430.8691.073
stiffness10FMYO-GYRO-#1XG0.1320.4000.6211.1682.051
stiffnessLOOMYO-GYRO-#1SVM0.1600.4000.5891.2152.280
stiffness10FMYO-GYRO-#2XG0.1400.4100.6391.0831.698
stiffnessLOOMYO-GYRO-#2RF0.1780.4250.6111.1752.009
stiffness10FMYO-GYRO-#3SVM0.2610.5150.5651.0461.623
stiffnessLOOMYO-GYRO-#3XG0.3030.5570.5580.8621.075
stiffness10FPhone-ACC-#1SVM0.1800.4240.5921.0961.712
stiffnessLOOPhone-ACC-#1SVM0.1430.3810.6001.1671.969
stiffness10FPhone-ACC-#2RF0.1720.4150.5981.1972.193
stiffnessLOOPhone-ACC-#2RF0.1800.4240.5931.1792.120
stiffness10FPhone-ACC-#3SVM0.2720.5290.5691.0461.591
stiffnessLOOPhone-ACC-#3SVM0.2500.5120.5721.0491.574
stiffness10FPhone-GYRO-#1XG0.1720.4390.6041.0341.575
stiffnessLOOPhone-GYRO-#1SVM0.2080.4570.5841.1221.838
stiffness10FPhone-GYRO-#2SVM0.2320.4860.5931.1231.832
stiffnessLOOPhone-GYRO-#2SVM0.2420.4940.5791.1311.962
stiffness10FPhone-GYRO-#3SVM0.2910.5410.5621.0361.612
stiffnessLOOPhone-GYRO-#3RF0.2850.5360.5750.9801.356
tremor10FMYO-ACC-#1RF0.5230.7230.5700.7700.875
tremorLOOMYO-ACC-#1SVM0.5570.7500.5410.7770.926
tremor10FMYO-ACC-#2SVM0.5350.7320.5440.8001.017
tremorLOOMYO-ACC-#2SVM0.5310.7300.5610.7990.951
tremor10FMYO-ACC-#3RF0.2930.5480.6831.0541.634
tremorLOOMYO-ACC-#3SVM0.3530.6060.6470.9861.421
tremor10FMYO-GYRO-#1RF0.5900.7690.5370.6900.685
tremorLOOMYO-GYRO-#1RF0.5960.7730.5230.6980.720
tremor10FMYO-GYRO-#2SVM0.5240.7260.5650.8210.996
tremorLOOMYO-GYRO-#2SVM0.5440.7390.5520.7850.912
tremor10FMYO-GYRO-#3RF0.3400.5910.6760.9651.305
tremorLOOMYO-GYRO-#3RF0.3110.5650.6901.0261.484
tremor10FPhone-ACC-#1RF0.5950.7720.5370.6750.652
tremorLOOPhone-ACC-#1XG0.6160.7860.5140.6520.642
tremor10FPhone-ACC-#2SVM0.5350.7330.5620.7800.922
tremorLOOPhone-ACC-#2SVM0.5280.7280.5620.8050.983
tremor10FPhone-ACC-#3SVM0.3230.5730.6790.9851.396
tremorLOOPhone-ACC-#3SVM0.3590.6070.6590.9451.314
tremor10FPhone-GYRO-#1RF0.5900.7680.5460.6620.614
tremorLOOPhone-GYRO-#1RF0.5660.7520.5530.6960.673
tremor10FPhone-GYRO-#2RF0.5360.7340.5810.7620.827
tremorLOOPhone-GYRO-#2SVM0.5160.7180.5680.7760.894
tremor10FPhone-GYRO-#3SVM0.3450.5900.6520.9671.420
Table A2. All training results for models predicting symptom severities for single exercises.
Table A2. All training results for models predicting symptom severities for single exercises.
SymptomSplitDatasetModelR2rMAEbMAEbMSE
bradykinesia10F#1RF0.2340.4950.6381.1632.024
bradykinesiaLOO#1SVM0.2830.5410.6161.0831.713
bradykinesia10F#2SVM0.3130.5620.5861.1011.886
bradykinesiaLOO#2SVM0.3720.6230.5651.0761.840
bradykinesia10F#3RF0.4220.6540.5560.9561.412
bradykinesiaLOO#3SVM0.3920.6310.5620.9771.437
dyskinesia10F#1SVM0.4330.6900.2511.1752.024
dyskinesiaLOO#1SVM0.4770.7220.2451.1822.178
dyskinesia10F#2RF0.3860.6240.2451.1812.015
dyskinesiaLOO#2SVM0.3380.6410.2711.4403.098
dyskinesia10F#3SVM0.1970.4880.2791.6684.251
dyskinesiaLOO#3SVM0.2960.6030.2891.5573.771
state(doctor)10F#1SVM0.3700.6100.8231.4633.489
state(doctor)LOO#1SVM0.3840.6270.8191.4593.383
state(doctor)10F#2RF0.3330.5800.8201.5463.808
state(doctor)LOO#2SVM0.3290.5800.8251.5623.822
state(doctor)10F#3SVM0.3430.5920.7401.5863.932
state(doctor)LOO#3SVM0.3610.6150.7271.6003.990
state(patient)10F#1SVM0.3640.6040.8581.4023.327
state(patient)LOO#1SVM0.3830.6230.8591.3853.241
state(patient)10F#2RF0.3130.5660.9031.5103.524
state(patient)LOO#2SVM0.3280.5820.8681.5143.911
state(patient)10F#3RF0.3170.5670.8121.4583.478
state(patient)LOO#3SVM0.3610.6070.7711.4983.646
stiffness10F#1RF0.2440.5110.5661.1081.784
stiffnessLOO#1RF0.2120.4690.5751.1251.831
stiffness10F#2RF0.2280.4830.5901.1181.816
stiffnessLOO#2SVM0.2770.5330.5551.0931.858
stiffness10F#3SVM0.4020.6400.5110.8821.152
stiffnessLOO#3SVM0.4390.6720.5030.8601.085
tremor10F#1RF0.6300.7950.4960.6570.637
tremorLOO#1SVM0.6740.8220.4640.6460.645
tremor10F#2SVM0.6070.7800.5110.7400.839
tremorLOO#2SVM0.6650.8180.4680.6830.730
tremor10F#3SVM0.4240.6640.6270.9061.183
tremorLOO#3SVM0.4670.7000.5950.8851.168

References

  1. Kirmani, B.F.; Shapiro, L.A.; Shetty, A.K. Neurological and Neurodegenerative Disorders: Novel Concepts and Treatment. Aging Dis. 2021, 12, 950. [Google Scholar] [CrossRef]
  2. Sveinbjornsdottir, S. The Clinical Symptoms of Parkinson’s Disease. J. Neurochem. 2016, 139, 318–324. [Google Scholar] [CrossRef]
  3. Bloem, B.R.; Okun, M.S.; Klein, C. Parkinson’s Disease. Lancet 2021, 397, 2284–2303. [Google Scholar] [CrossRef]
  4. de Lau, L.M.; Breteler, M.M. Epidemiology of Parkinson’s Disease. Lancet Neurol. 2006, 5, 525–535. [Google Scholar] [CrossRef]
  5. Connolly, B.S.; Lang, A.E. Pharmacological Treatment of Parkinson Disease: A Review. JAMA 2014, 311, 1670–1683. [Google Scholar] [CrossRef]
  6. Giugni, J.C.; Okun, M.S. Treatment of Advanced Parkinson’s Disease. Curr. Opin. Neurol. 2014, 27, 450. [Google Scholar] [CrossRef]
  7. Lee, T.K.; Yankee, E.L. A Review on Parkinson’s Disease Treatment. Neurosciences 2021, 8, 222–244. [Google Scholar] [CrossRef]
  8. Yanase, J.; Triantaphyllou, E. A Systematic Survey of Computer-Aided Diagnosis in Medicine: Past and Present Developments. Expert. Syst. Appl. 2019, 138, 112821. [Google Scholar] [CrossRef]
  9. Shehab, M.; Abualigah, L.; Shambour, Q.; Abu-Hashem, M.A.; Shambour, M.K.Y.; Alsalibi, A.I.; Gandomi, A.H. Machine Learning in Medical Applications: A Review of State-of-the-Art Methods. Comput. Biol. Med. 2022, 145, 105458. [Google Scholar] [CrossRef] [PubMed]
  10. Oung, Q.W.; Hariharan, M.; Lee, H.L.; Basah, S.N.; Sarillee, M.; Lee, C.H. Wearable Multimodal Sensors for Evaluation of Patients with Parkinson Disease. In Proceedings of the 5th IEEE International Conference on Control System, Computing and Engineering, ICCSCE 2015, Penang, Malaysia, 27–29 November 2015; pp. 269–274. [Google Scholar] [CrossRef]
  11. Patel, S.; Lorincz, K.; Hughes, R.; Huggins, N.; Growdon, J.; Standaert, D.; Akay, M.; Dy, J.; Welsh, M.; Bonato, P. Monitoring Motor Fluctuations in Patients with Parkinsons Disease Using Wearable Sensors. IEEE Trans. Inf. Technol. Biomed. 2009, 13, 864–873. [Google Scholar] [CrossRef]
  12. Sieberts, S.K.; Schaff, J.; Duda, M.; Pataki, B.Á.; Sun, M.; Snyder, P.; Daneault, J.F.; Parisi, F.; Costante, G.; Rubin, U.; et al. Crowdsourcing Digital Health Measures to Predict Parkinson’s Disease Severity: The Parkinson’s Disease Digital Biomarker DREAM Challenge. npj Digit. Med. 2021, 4, 53. [Google Scholar] [CrossRef]
  13. Thomas, I.; Westin, J.; Alam, M.; Bergquist, F.; Nyholm, D.; Senek, M.; Memedi, M. A Treatment-Response Index from Wearable Sensors for Quantifying Parkinson’s Disease Motor States. IEEE J. Biomed. Health Inform. 2018, 22, 1341–1349. [Google Scholar] [CrossRef]
  14. Griffiths, R.I.; Kotschet, K.; Arfon, S.; Xu, Z.M.; Johnson, W.; Drago, J.; Evans, A.; Kempster, P.; Raghav, S.; Horne, M.K. Automated Assessment of Bradykinesia and Dyskinesia in Parkinson’s Disease. J. Parkinsons Dis. 2012, 2, 47–55. [Google Scholar] [CrossRef]
  15. Lin, F.; Wang, Z.; Li, Z.; Zhao, H.; Shi, X.; Liu, R.; Li, J.; Peng, D.; Ru, B. Fine-Grained Assessment of Upper-Limb Bradykinesia Through Multimodal Feature Enhancement and Deep Learning. IEEE Trans. Hum. Mach. Syst. 2025, 55, 508–518. [Google Scholar] [CrossRef]
  16. Rodriguez, F.; Krauss, P.; Kluckert, J.; Ryser, F.; Stieglitz, L.; Baumann, C.; Gassert, R.; Imbach, L.; Bichsel, O. Continuous and Unconstrained Tremor Monitoring in Parkinson’s Disease Using Supervised Machine Learning and Wearable Sensors. Parkinsons Dis. 2024, 2024, 5787563. [Google Scholar] [CrossRef]
  17. Gutowski, T. Deep Learning for Parkinson’s Disease Symptom Detection and Severity Evaluation Using Accelerometer Signal. In Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2022), Bruges, Belgium, Online Event, 5–7 October 2022; pp. 271–276. [Google Scholar] [CrossRef]
  18. Visconti, P.; Gaetani, F.; Zappatore, G.A.; Primiceri, P. Technical Features and Functionalities of Myo Armband: An Overview on Related Literature and Advanced Applications of Myoelectric Armbands Mainly Focused on Arm Prostheses. Int. J. Smart Sens. Intell. Syst. 2018, 11, 1–25. [Google Scholar] [CrossRef]
  19. Gutowski, T. Optimization of Medicine Dosing in Parkinson’s Disease, Based on Signals from Sensor Measurements; Military University of Technology: Warsaw, Poland, 2024. [Google Scholar]
  20. Erer, K.S. Adaptive Usage of the Butterworth Digital Filter. J. Biomech. 2007, 40, 2934–2943. [Google Scholar] [CrossRef]
  21. Bazgir, O.; Frounchi, J.; Habibi, S.A.H.; Palma, L.; Pierleoni, P. A Neural Network System for Diagnosis and Assessment of Tremor in Parkinson Disease Patients. In Proceedings of the 2015 22nd Iranian Conference on Biomedical Engineering, ICBME 2015, Tehran, Iran, 25–27 November 2015; pp. 1–5. [Google Scholar]
  22. Sejdic, E.; Lowry, K.A.; Bellanca, J.; Redfern, M.S.; Brach, J.S. A Comprehensive Assessment of Gait Accelerometry Signals in Time, Frequency and Time-Frequency Domains. IEEE Trans. Neural Syst. Rehabil. Eng. 2014, 22, 603–612. [Google Scholar] [CrossRef]
  23. Tsipouras, M.G.; Tzallas, A.T.; Rigas, G.; Tsouli, S.; Fotiadis, D.I.; Konitsiotis, S. An Automated Methodology for Levodopa-Induced Dyskinesia: Assessment Based on Gyroscope and Accelerometer Signals. Artif. Intell. Med. 2012, 55, 127–135. [Google Scholar] [CrossRef]
  24. Alam, M.N.; Johnson, B.; Gendreau, J.; Tavakolian, K.; Combs, C.; Fazel-Rezai, R. Tremor Quantification of Parkinson’s Disease –A Pilot Study. In Proceedings of the IEEE International Conference on Electro Information Technology, IEEE Computer Society, Grand Forks, ND, USA, 19–21 August 2016; pp. 755–759. [Google Scholar]
  25. Eskofier, B.M.; Lee, S.I.; Daneault, J.F.; Golabchi, F.N.; Ferreira-Carvalho, G.; Vergara-Diaz, G.; Sapienza, S.; Costante, G.; Klucken, J.; Kautz, T.; et al. Recent Machine Learning Advancements in Sensor-Based Mobility Analysis: Deep Learning for Parkinson’s Disease Assessment. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS 2016, Orlando, FL, USA, 16–20 August 2016; pp. 655–658. [Google Scholar] [CrossRef]
  26. San-Segundo, R.; Zhang, A.; Cebulla, A.; Panev, S.; Tabor, G.; Stebbins, K.; Massa, R.E.; Whitford, A.; de la Torre, F.; Hodgins, J. Parkinson’s Disease Tremor Detection in the Wild Using Wearable Accelerometers. Sensors 2020, 20, 5817. [Google Scholar] [CrossRef]
  27. Duhamel, P.; Vetterli, M. Fast Fourier Transforms: A Tutorial Review and a State of the Art. Signal Process 1990, 19, 259–299. [Google Scholar] [CrossRef]
  28. Delgado-Bonal, A.; Marshak, A. Approximate Entropy and Sample Entropy: A Comprehensive Tutorial. Entropy 2019, 21, 541. [Google Scholar] [CrossRef]
  29. Guyon, I.; Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar] [CrossRef][Green Version]
  30. Li, X.; Fu, Q.; Li, Q.; Ding, W.; Lin, F.; Zheng, Z. Multi-Objective Binary Grey Wolf Optimization for Feature Selection Based on Guided Mutation Strategy. Appl. Soft Comput. 2023, 145, 110558. [Google Scholar] [CrossRef]
  31. Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A Comparative Analysis of Gradient Boosting Algorithms. Artif. Intell. Rev. 2021, 54, 1937–1967. [Google Scholar] [CrossRef]
  32. Thakur, D.; Biswas, S. Permutation Importance Based Modified Guided Regularized Random Forest in Human Activity Recognition with Smartphone. Eng. Appl. Artif. Intell. 2024, 129, 107681. [Google Scholar] [CrossRef]
  33. Archana, T.; Sachin, D. Dimensionality Reduction and Classification through PCA and LDA. Int. J. Comput. Appl. 2015, 122, 4–8. [Google Scholar] [CrossRef]
  34. 1.13. Feature Selection—Scikit-Learn 1.5.0 Documentation. Available online: https://scikit-learn.org/stable/modules/feature_selection.html#selectfrommodel (accessed on 14 June 2024).
  35. Bishop, C.M. Pattern Recognition and Machine Learning (Information Science and Statistics); Springer New York, Inc.: Secaucus, NJ, USA, 2006; ISBN 0387310738. [Google Scholar]
  36. Naser, M.Z.; Alavi, A.H. Error Metrics and Performance Fitness Indicators for Artificial Intelligence and Machine Learning in Engineering and Sciences. Archit. Struct. Constr. 2021, 3, 499–517. [Google Scholar] [CrossRef]
  37. Wong, T.T. Performance Evaluation of Classification Algorithms by K-Fold and Leave-One-out Cross Validation. Pattern Recognit. 2015, 48, 2839–2846. [Google Scholar] [CrossRef]
  38. Goetz, C.G.; Tilley, B.C.; Shaftman, S.R.; Stebbins, G.T.; Fahn, S.; Martinez-Martin, P.; Poewe, W.; Sampaio, C.; Stern, M.B.; Dodel, R.; et al. Movement Disorder Society-Sponsored Revision of the Unified Parkinson’s Disease Rating Scale (MDS-UPDRS): Scale Presentation and Clinimetric Testing Results. Mov. Disord. 2008, 23, 2129–2170. [Google Scholar] [CrossRef]
  39. Westin, J.; Nyholm, D.; Pålhagen, S.; Willows, T.; Groth, T.; Dougherty, M.; Karlsson, M.O. A Pharmacokinetic-Pharmacodynamic Model for Duodenal Levodopa Infusion. Clin. Neuropharmacol. 2011, 34, 61–65. [Google Scholar] [CrossRef] [PubMed]
  40. Stacy, M.; Bowron, A.; Guttman, M.; Hauser, R.; Hughes, K.; Larsen, J.P.; Le Witt, P.; Oertel, W.; Quinn, N.; Sethi, K.; et al. Identification of Motor and Nonmotor Wearing-off in Parkinson’s Disease: Comparison of a Patient Questionnaire versus a Clinician Assessment. Mov. Disord. 2005, 20, 726–733. [Google Scholar] [CrossRef] [PubMed]
  41. Kikuya, A.; Tsukita, K.; Sawamura, M.; Yoshimura, K.; Takahashi, R. Distinct Clinical Implications of Patient- Versus Clinician-Rated Motor Symptoms in Parkinson’s Disease. Mov. Disord. 2024, 39, 1799–1808. [Google Scholar] [CrossRef]
  42. Pratt, J.W. Remarks on Zeros and Ties in the Wilcoxon Signed Rank Procedures. J. Am. Stat. Assoc. 1959, 54, 655–667. [Google Scholar] [CrossRef]
Figure 1. A chart presenting the preparation of the raw signal for conventional machine learning models.
Figure 1. A chart presenting the preparation of the raw signal for conventional machine learning models.
Sensors 25 04924 g001
Figure 2. A histogram presenting the distribution of symptom severities for the dataset.
Figure 2. A histogram presenting the distribution of symptom severities for the dataset.
Sensors 25 04924 g002
Figure 3. Value range of the adjusted TRS scale.
Figure 3. Value range of the adjusted TRS scale.
Sensors 25 04924 g003
Figure 4. The distribution of label values representing the patient state evaluated by the clinician (top) and by the patient (bottom).
Figure 4. The distribution of label values representing the patient state evaluated by the clinician (top) and by the patient (bottom).
Sensors 25 04924 g004
Figure 5. Violin plots presenting regression results with class-specific MAE values for tremor (left) and bradykinesia (right) using best-performing models evaluating based on a single exercise sensor signal.
Figure 5. Violin plots presenting regression results with class-specific MAE values for tremor (left) and bradykinesia (right) using best-performing models evaluating based on a single exercise sensor signal.
Sensors 25 04924 g005
Figure 6. Violin plots presenting regression results with class-specific MAE values for muscle stiffness (left) and dyskinesia (right) using best-performing models evaluated based on a single exercise sensor signal.
Figure 6. Violin plots presenting regression results with class-specific MAE values for muscle stiffness (left) and dyskinesia (right) using best-performing models evaluated based on a single exercise sensor signal.
Sensors 25 04924 g006
Figure 7. Scatter plots presenting regression results for predicting the patient’s overall state regarding PD using best-performing models trained on the patient’s self-evaluation (left) and clinician’s evaluations (right).
Figure 7. Scatter plots presenting regression results for predicting the patient’s overall state regarding PD using best-performing models trained on the patient’s self-evaluation (left) and clinician’s evaluations (right).
Sensors 25 04924 g007
Table 1. Characteristics of the dataset.
Table 1. Characteristics of the dataset.
CharacteristicValue
Total number of patients241
Age (years) *62.0 (11.1)
Years since diagnosis *10.5 (6.10)
Patient sex98 female, 143 male
Examination count739
Examinations per patient *3.07 (2.77)
States according to clinician *−1.64 (1.38)
States according to patient *−1.66 (1.42)
Examinations with state assessment according to doctor700
Examinations with symptom assessment356
*—represented by mean and standard deviation.
Table 2. List of features extracted from inertial sensor signals.
Table 2. List of features extracted from inertial sensor signals.
FeatureEquation/Explanation
Time domain
Mean x ¯ = 1 n i = 1 n x i (2)
Standard deviation s = 1 n 1 i = 1 n x i x ¯ 2 (3)
MedianThe middle value of the sorted signal samples.
Skewness S = n n 1 n 2 i = 1 n x i x ¯ s 3 (4)
Kurtosis K = n i = 1 n x i x ¯ 4 i = 1 n x i x ¯ 2 2 3 (5)
MaxThe maximum value in the signal.
MinThe minimum value in the signal.
Interquartile rangeThe difference between the 75th and 25th percentiles of the signal.
Approximate entropyA measure of the regularity and unpredictability of fluctuations in a time series [28].
Sample entropyA measure of the likelihood that similar sequences in time-series data remain similar over time [28].
Power P = 1 n i = 1 n x i 2 (6)
Absolute mean difference = 2 n i = 1 n / 2 x i 2 n i = n / 2 + 1 n x i (7)
Frequency domain
Max powerMaximum power found in the PSD.
Max power frequencyThe frequency at which the maximum power occurs.
Spectral power P = 1 N i = 0 N 1 X f i 2 (8)
Weighted mean power W M P = i = 0 N 1 X f i 2 f i i = 0 N 1 f i (9)
Kurtosis K = N i = 1 N X f i X f ¯ 4 i = 1 N X f i X f ¯ 2 2 3 (10)
Skewness S = N N 1 N 2 i = 1 N X f i X f ¯ s 3 (11)
Interquartile rangeInterquartile Range of the PSD values.
Spectral centroid C = i = 0 N 1 f i X f i i = 0 N 1 X f i (12)
n—number of samples, xii-th sample, N—number of frequency bins, fi—frequency of the i-th bin, X(fi)—magnitude of the Fourier Transform at the i-th bin.
Table 3. Features created from patient and examination metadata.
Table 3. Features created from patient and examination metadata.
NameDescriptionSource
affected sideThe side of the body more affected by the diseasepatient
handednessThe dominant hand of the patientpatient
groupsBelonging to groups (disease, treatment method)patient
diagnosisTime since diagnosis to execution of examinationpatient + exam
ageAge during examinationpatient + exam
Table 4. Training results for models predicting symptom severities based on single sensor signals for single exercises.
Table 4. Training results for models predicting symptom severities based on single sensor signals for single exercises.
SymptomSplitDatasetModelR2rMAEbMAEbMSE
bradykinesia10FPhone-ACC-#3SVM0.4000.6380.5470.9561.402
LOOMYO-GYRO-#3XG0.4080.6390.5700.8961.187
dyskinesia10FPhone-GYRO-#1RF0.3850.6220.2281.2382.353
LOOPhone-ACC-#1SVM0.3550.6410.2401.3792.837
stiffness10FMYO-ACC-#3RF0.3090.5620.5680.9921.402
LOOMYO-ACC-#3XG0.3600.6000.5430.8691.073
tremor10FPhone-ACC-#1RF0.5950.7720.5370.6750.652
LOOPhone-ACC-#1XG0.6160.7860.5140.6520.642
GYRO—gyroscope, ACC—accelerometer.
Table 5. Training results for models predicting symptom severities for single exercises.
Table 5. Training results for models predicting symptom severities for single exercises.
SymptomSplitDatasetModelR2rMAEbMAEbMSE
bradykinesia10F#3RF0.4220.6540.5560.9561.412
LOO#3SVM0.3920.6310.5620.9771.437
dyskinesia10F#1SVM0.4330.690.2511.1752.024
LOO#1SVM0.4770.7220.2451.1822.178
stiffness10F#3SVM0.4020.640.5110.8821.152
LOO#3SVM0.4390.6720.5030.8601.085
tremor10F#1RF0.6300.7950.4960.6570.637
LOO#1SVM0.6740.8220.4640.6460.645
Table 6. Training results for models predicting symptom severities based on all data.
Table 6. Training results for models predicting symptom severities based on all data.
SymptomSplitModelR2rMAEbMAEbMSE
bradykinesia10FSVM0.6320.8260.4380.7220.762
LOOSVM0.6290.8270.4350.7210.765
dyskinesia10FSVM0.5850.8020.2380.9791.442
LOOSVM0.5670.7900.2450.9831.432
stiffness10FSVM0.6040.8170.4200.7500.842
LOOSVM0.6170.8220.4100.7380.834
tremor10FSVM0.7770.8880.3820.5760.526
LOOSVM0.7800.8870.3780.5610.498
Table 7. Most relevant sensor-based features for symptom severity prediction.
Table 7. Most relevant sensor-based features for symptom severity prediction.
SymptomDevice, Sensor, ExerciseHandAxisParameterScore
TremorPhone-ACC-#1LeftZSpectral centroid (0–25 Hz)0.0269
MYO-GYRO-#1RightZWeighted mean power (3–9 Hz)0.0261
MYO-GYRO-#1LeftZMin of entropy (4 s window)0.0242
MYO-GYRO-#3LeftXSkewness of entropy (4 s window)0.0238
Phone-ACC-#1RightMKurtosis (3–9 Hz)0.0230
BradykinesiaPhone-ACC-#3RightMSkewness of value range (4 s window)0.0406
Phone-ACC-#3LeftXMean of entropy (4 s window)0.0371
MYO-ACC-#1RightXMedian0.0342
MYO-ACC-#3LeftYMax power (0–25 Hz)0.0335
Phone-ACC-#3LeftZAbsolute mean difference0.0329
DyskinesiaPhone-GYRO-#1LeftZSpectral power0.0694
MYO-GYRO-#3RightZInterquartile range0.0615
MYO-ACC-#1LeftYFrequency of max power0.0458
Phone-GYRO-#1LeftZD10.0434
Phone-GYRO-#3RightYMean PSD0.0410
StiffnessPhone-GYRO-#3RightZMax of value range (4 s window)0.0685
Phone-ACC-#1LeftYAbsolute mean difference0.0504
Phone-ACC-#3LeftYMean PSD (9–14 Hz)0.0489
MYO-ACC-#3LeftZSkewness0.0482
Phone-GYRO-#1LeftYMin of entropy (4 s window)0.0447
Table 8. Training results for models predicting patient’s state.
Table 8. Training results for models predicting patient’s state.
State According toSplitDatasetModelR2rMAEbMAEbMSE
clinician10FAllSVM0.5430.7670.6171.3092.786
LOOSVM0.5340.7580.6291.3092.765
10FMYOSVM0.4680.7030.671.4133.209
LOOSVM0.4710.7040.6751.3973.122
10FPhoneSVM0.4060.6520.6951.5143.716
LOOSVM0.4090.6530.6971.5053.693
patient10FAllSVM0.6100.8160.6031.1442.232
LOOSVM0.6080.8120.6101.1312.155
10FMYOSVM0.4540.6890.7231.3112.907
LOOSVM0.4520.6870.7241.3062.878
10FPhoneSVM0.4360.6740.7291.3783.166
LOOSVM0.3960.6390.7601.4333.384
Table 9. Most relevant features for overall state prediction.
Table 9. Most relevant features for overall state prediction.
According ToDevice, Sensor,
Exercise
HandAxisParameterScore
clinicianPhone-ACC-#1LeftZAbsolute mean difference0.0365
Phone-GYRO-#1LeftXInterquartile range (0–3 Hz)0.0313
MYO-GYRO-#3LeftZMaximum0.0307
---Time since diagnosis0.0285
Phone-GYRO-#1RightXWeighted mean power (0–3 Hz)0.0231
patientMYO-ACC-#1LeftYSkewness (0–25 Hz)0.0329
MYO-ACC-#3Left-Correlation (Y and Z)0.0262
---Time since diagnosis0.0231
MYO-GYRO-#1Right-Correlation (X and Y)0.0222
Phone-ACC-#1LeftZSpectral centroid (0–25 Hz)0.0217
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gutowski, T.; Stodulska, O.; Ćwiklińska, A.; Gutowska, K.; Kopeć, K.; Betka, M.; Antkiewicz, R.; Koziorowski, D.; Szlufik, S. Machine Learning-Based Assessment of Parkinson’s Disease Symptoms Using Wearable and Smartphone Sensors. Sensors 2025, 25, 4924. https://doi.org/10.3390/s25164924

AMA Style

Gutowski T, Stodulska O, Ćwiklińska A, Gutowska K, Kopeć K, Betka M, Antkiewicz R, Koziorowski D, Szlufik S. Machine Learning-Based Assessment of Parkinson’s Disease Symptoms Using Wearable and Smartphone Sensors. Sensors. 2025; 25(16):4924. https://doi.org/10.3390/s25164924

Chicago/Turabian Style

Gutowski, Tomasz, Olga Stodulska, Aleksandra Ćwiklińska, Katarzyna Gutowska, Kamila Kopeć, Marta Betka, Ryszard Antkiewicz, Dariusz Koziorowski, and Stanisław Szlufik. 2025. "Machine Learning-Based Assessment of Parkinson’s Disease Symptoms Using Wearable and Smartphone Sensors" Sensors 25, no. 16: 4924. https://doi.org/10.3390/s25164924

APA Style

Gutowski, T., Stodulska, O., Ćwiklińska, A., Gutowska, K., Kopeć, K., Betka, M., Antkiewicz, R., Koziorowski, D., & Szlufik, S. (2025). Machine Learning-Based Assessment of Parkinson’s Disease Symptoms Using Wearable and Smartphone Sensors. Sensors, 25(16), 4924. https://doi.org/10.3390/s25164924

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop