Predicting Wearing-Off of Parkinson’s Disease Patients Using a Wrist-Worn Fitness Tracker and a Smartphone: A Case Study

: Parkinson’s disease (PD) patients experience varying symptoms related to their illness. Therefore, each patient needs a tailored treatment program from their doctors. One approach is the use of anti-PD medicines. However, a “wearing-off” phenomenon occurs when these medicines lose their effect. As a result, patients start to experience the symptoms again until their next medicine intake. In the long term, the duration of “wearing-off” begins to shorten. Thus, patients and doctors have to work together to manage PD symptoms effectively. This study aims to develop a prediction model that can determine the “wearing-off” of anti-PD medicine. We used ﬁtness tracker data and self-reported symptoms from a smartphone application in a real-world environment. Two participants wore the ﬁtness tracker for a month while reporting any symptoms using the Wearing-Off Questionnaire (WoQ-9) on a smartphone application. Then, we processed and combined the datasets for each participant’s models. Our analysis produced prediction models for each participant. The average balanced accuracy with the best hyperparameters was at 70.0–71.7% for participant 1 and 76.1–76.9% for participant 2, suggesting that our approach would be helpful to manage the “wearing-off” of anti-PD medicine, motor ﬂuctuations of PD patients, and customized treatment for PD patients.


Introduction
Parkinson's disease (PD) patients experience difficulties in managing the symptoms of their illness. Each PD patient experiences symptoms differently and with varying severity. Doctors and medical practitioners customize the treatment of PD for every patient depending on their experienced symptoms. In order to do that, patients have to monitor their symptoms actively and report to their doctor. Then, doctors provide different doses of the anti-PD drug depending on the reported symptoms. The "wearing-off" phenomenon [1][2][3][4] occurs when the anti-PD medicine loses its effect. Patients start to experience the symptoms again until their next intake. In the long term, the duration of wearing-off begins to shorten. Thus, patients and doctors have to work together to manage PD symptoms effectively. Patients have to track their symptoms and the wearing-off period. Meanwhile, doctors need to have this information to give a customized approach to the patients' treatment.
Previous works have used different wearable data to detect PD symptoms. They mainly worked with accelerometer, electromyography (EMG), and gyroscope data to detect tremors, freezing of gait (FoG), and "wearing-off" periods [5][6][7]. Some of these previous works collected data in a closed and controlled environment. In contrast, other studies showed the feasibility of medium to large-scale data collection using different wearable sensors [8,9]. The Parkinson's KinetiGraph (PKG) has been used in some of these studies [9,10]. Statistical analysis [11] and machine learning-based analysis have been conducted on gait features; in one of these studies, machine learning algorithms were applied to wearable accelerometer-based sensor data to detect "on" and "off" states for PD patients [12]. In another study, machine learning classifiers were applied to PKG data to estimate the levodopa response from the Unified Parkinson's Disease Rating Scale Part III (UPDRS III) [10]. Smartwatches have also been employed to monitor motor fluctuations based on accelerometer data in a real-world environment [8,13]. Studies have usually employed either PKG or other medical-grade wearable sensors. The medium to large-scale data collection approach has yet to apply automated analysis using the machine learning approach to analyze and predict the wearing-off phenomenon. Data collection in a real-life environment has shown promise for further analysis and thus could help PD patients.
There is still untapped potential in the use of commercially available fitness trackers and smartwatches for PD management and for predicting the wearing-off phenomenon. Fitness trackers and smartwatches can now report a user's heart rate, sleep quality, stress, and even blood pressure, among many other data types. Although the reliability of fitness trackers and smartwatches is contestable [14], these data can be used to verify earlier studies. For example, the blood pressure and heart rate of PD patients have been investigated for the wearing-off phenomenon, and vlood pressure change was statistically significant among PD patients experiencing wearing-off [15].
PD patients have also reported effects on their sleep patterns. The associated risk factors of rapid eye movement (REM) sleep behavior disorder (RBD) in PD patients were examined [16,17]. Aside from RBD as a form of sleep disturbance, light, fragmented sleep due to increased muscle activity, disruption of biological rhythms during sleep, breathing difficulties, insomnia, and excessive daytime sleepiness were sleep disturbances that manifested in PD patients [18,19]. A study with 3075 PD patients showed an increase of PD symptoms during sleep disturbances for 32% of the patients, while depressive moods were found for 20% of the patients. This study reported that each PD patient had varying degrees of symptoms and a differing psychological stress structure [20].
Moreover, in a recent study, PD patients were shown with clinical evidence to be highly sensitive to the effects of stress. The prevalence of stress-related symptoms in PD patients was 30% to 40% for depression and 25% to 30% for anxiety. Furthermore, stress worsened tremors, FoG, and dyskinesia. Thus, the authors suggested further investigation using wearable sensors [21].
As more people are adopting wearable technologies, this study explores the use of other wearable datasets to predict the "wearing-off" of PD patients. This research aims to develop a prediction model to determine the "wearing-off" of anti-PD drugs from fitness tracker data in a real-world environment. This study seeks to answer the following research questions: 1.
How do we collect and combine a fitness tracker dataset and wearing-off dataset? 2.
Can we develop a prediction model to determine the wearing-off of PD patients?
This paper is organized as follows: Section 2 describes the participants, the data collection tools, and the collected datasets. It also presents the data collection, data processing, and model development. Then, Section 3 presents the summary of collected and combined datasets. The performance and the best configuration of the prediction models are also reported. Next, Section 4 elaborates the results of our analysis and prediction models. Finally, Section 5 concludes this paper.

Materials and Methods
This section describes the data collection, data processing, and model development, as summarized in Figure 1.  Figure 1. An overview of the study, describing the process from data collection to model development.

Data and Data Collection Tools
This study used two data collection tools that were distributed to the participants. First, a fitness tracker collected the participants' heart rate, step count, stress score, and sleep data. Second, a smartphone application enabled the participants to record wearing-off periods, drug intake time, and other one-time basic information such as age and gender.

Garmin Vivosmart4 Fitness Tracker
The Garmin vivosmart4 fitness tracker was chosen for this study. The participants preferred the vivosmart4 over smartwatches because it was sleek, lightweight, and waterproof. It weighs 16.5 g up to 17.1 g, with dimensions of 15 × 10.5 × 197 mm (https://buy.garmin.com/en-US/US/p/605739, accessed on 1 May 2021 ). The Garmin vivosmart4 contains an optical photoplethysmography (PPG) sensor and accelerometer sensor. Using light emitted into the skin, PPG estimates heart rate by monitoring the changes in the intensity of the reflected light due to the contraction and swelling of the arteries and arterioles caused by pulsating blood pressure [22,23]. Then, the Garmin vivosmart4 uses heart rate variability to estimate stress levels [24,25]. The combination of the heart rate, heart rate variability, and accelerometer reading is used to estimate sleep stages [23,26]. Finally, the Garmin vivosmart4 uses the accelerometer sensor to assess the number of steps taken [27].
End-users can monitor their data using the Garmin Connect smartphone application. On the other hand, developers can access heart rate, sleep, steps, and stress level data using the Garmin Health Application Programming Interface (Garmin Health API) (https://developer.garmin.com/gc-developer-program/health-api/, accessed on 1 May 2021). Upon gaining access, the Garmin Health API expected a web application server to receive the data via HTTP POST. We deployed the web application server with Amazon Web Services (AWS), and the acquired data were stored on that server. The data flow from the fitness tracker to data storage is described in Figure 1, while the data available via the Garmin Health API are summarized in Table 1. A smartphone application, called FonLog, was used as a data collection tool for human activity recognition in nursing services [30]. In this study, FonLog was customized specifically to collect wearing-off periods and drug intake time. To collect the data, we adapted the Wearing-Off Questionnaire (WoQ-9) using the Japanese translation [1,31]. The participants answered the remaining components of FonLog one time; these included age, gender, the Hoehn and Yahr Scale, and the Parkinson's Disease Questionnaire (PDQ-8). The Hoehn and Yahr scale was used to determine the patient's PD stage [32,33]. Then, the PDQ-8 identified the self-reported quality of life (QoL) among the PD patients [34]. Table 2 summarized all the data recorded using FonLog. Two patients participated in this initial study with their informed consent. Both of our participants were female and were in their late 30s to early 40s. According to their Hoehn and Yahr Scale (H&Y) scores and the Japan Ministry of Health, Labor, and Welfare's classification of living dysfunction (JCLD), both participants were in a similar PD stage. However, their views on their QoL were different from each other. Table 3 summarized the demographics of the two participants. The participants were subjected to the levodopa therapy, taking different medicines such as levodopa and carbidopa, dopamine agonist for D 2 receptors, and a selective monoamine oxidase type B (MAO-B) inhibitor drug. Participant 1 was taking 400 mg levodopa and carbidopa hydrate per day while participant 2 was taking 600 mg to 800 mg per day. Both participants were taking 1 tablet of 8 mg dopamine agonist medicine to stimulate D 2 receptors. Finally, participant 1 was taking 1 tablet of 2.5 mg MAO-B inhibitor while participant 2 was taking 0.5 mg per day.

Data Collection Method
The two PD patients received the data collection tools needed for our experiment. The data collection lasted for 30 days, with an additional one-day setup for the PD patients.
Each participant created their own Garmin Connect account to start the Garmin vivos-mart4 fitness tracker's data collection process. Next, the participants signed into our web application server using their Garmin Connect account. At this point, they had control over what information they would share with our web application server while reading the privacy policy. After all the steps, the Garmin Health API automatically transmitted available data to our AWS Web Application Server.
Throughout the 30-day data collection period, the participants were encouraged to always wear the Garmin vivosmart4, even during sleep. However, there were still cases when they forgot to wear it, such as after taking a shower or when doing household chores. In the worst case, they could not wear it due to some PD symptoms. These conditions were accepted to capture data that were as close as possible to the real-world scenario.
Using FonLog, the participants answered questions regarding (1) basic information, (2) the Hoehn and Yahr Scale, and (3) the PDQ-8 once. Then, they reported any of the following nine symptoms: tremors, slowing down of movement, change in mood or depression, rigidity of muscles, sharp pain or prolonged dull pain, impairment of complex movements of the hand and fingers, difficulty integrating thoughts or slowing down of thought, anxiety or panic attacks, and muscle spasms [1]. However, this study simplified the wearing-off label to either wearing-off (1) or not (0). Any recorded symptoms were considered to represent a wearing-off label.
The participants recorded the onset of any symptoms as accurately as possible. From the home screen presented in Figure 2a, the participants could click on the activity on the left side; i.e., activity enclosed in the red box. The home screen showed every recording on the right side. However, if a recording was difficult for the participants due to their symptoms, they could retroactively record data with their best estimation of the time. They could correct and review the onset of the symptoms, as shown in the top part of Figure 2b. Each participant created their own Garmin Connect account to start the Garmin vivos-mart4 fitness tracker's data collection process. Next, the participants signed into our web application server using their Garmin Connect account. At this point, they had control over what information they would share with our web application server while reading the privacy policy. After all the steps, the Garmin Health API automatically transmitted available data to our AWS Web Application Server.
Throughout the 30-day data collection period, the participants were encouraged to always wear the Garmin vivosmart4, even during sleep. However, there were still cases when they forgot to wear it, such as after taking a shower or when doing household chores. In the worst case, they could not wear it due to some PD symptoms. These conditions were accepted to capture data that were as close as possible to the real-world scenario.
Using FonLog, the participants answered questions regarding (1) basic information, (2) the Hoehn and Yahr Scale, and (3) the PDQ-8 once. Then, they reported any of the following nine symptoms: tremors, slowing down of movement, change in mood or depression, rigidity of muscles, sharp pain or prolonged dull pain, impairment of complex movements of the hand and fingers, difficulty integrating thoughts or slowing down of thought, anxiety or panic attacks, and muscle spasms [1]. However, this study simplified the wearing-off label to either wearing-off (1) or not (0). Any recorded symptoms were considered to represent a wearing-off label.
The participants recorded the onset of any symptoms as accurately as possible. From the home screen presented in Figure 2a, the participants could click on the activity on the left side; i.e., activity enclosed in the red box. The home screen showed every recording on the right side. However, if a recording was difficult for the participants due to their symptoms, they could retroactively record data with their best estimation of the time. They could correct and review the onset of the symptoms, as shown in the top part of Figure 2b. Figure 2. The customized FonLog smartphone application. (a) The home screen shows each questionnaire: WoQ-9 step 1 for symptoms onset, WoQ-9 step 2 for drug intake time, basic information, H&Y, and PDQ-8, as discussed in Section 2.1.3. (b) A sample FonLog form presents the WoQ-9 step 1, asking if each symptom described in 2.1.5 has been experienced.

Data Processing
This subsection describes how we processed and combined the fitness tracker dataset and the wearing-off dataset.
First, the sleep dataset was aggregated according to each calendar date and the sleep classification, as listed in Table 1. The original sleep data (Table 4) were transformed into aggregated sleep data (Table 5). Furthermore, additional sleep features were calculated. a) The home screen shows each questionnaire: WoQ-9 step 1 for symptoms onset, WoQ-9 step 2 for drug intake time, basic information, H&Y, and PDQ-8, as discussed in Section 2.1.3. (b) A sample FonLog form presents the WoQ-9 step 1, asking if each symptom described in Section 2.1.5 has been experienced.

Data Processing
This subsection describes how we processed and combined the fitness tracker dataset and the wearing-off dataset.
First, the sleep dataset was aggregated according to each calendar date and the sleep classification, as listed in Table 1. The original sleep data (Table 4) were transformed into aggregated sleep data (Table 5). Furthermore, additional sleep features were calculated.  The total non-REM sleep duration was the sum of deep and light sleep duration (Equation (1)). Next, the total sleep duration was the sum of total non-REM sleep duration and REM sleep duration (Equation (2)). Then, the percentage of non-REM sleep was the ratio of the non-REM sleep duration to the total sleep duration (Equation (3)) [35]. Finally, we estimated the sleep efficiency as the ratio of the total sleep duration to the sum of the total sleep and total waking duration (Equation (3)) [36]. Sleep efficiency = Total sleep duration Total sleep duration + Total awake duration (4) Second, the missing values were filled with "−1" before re-sampling. This was in accordance with how Garmin Health API reported stress scores when they could not estimate the stress level [28]. Replacing missing values with "−1" would also indicate that the participant was not wearing the fitness tracker.
Third, the fitness tracker datasets were re-sampled due to their varying granularity. We chose 15 s and 15 min intervals to represent the minimum and maximum granularity in the fitness tracker datasets, respectively. The resulting missing values due to re-sampling were filled by copying the last available data. This was the chosen fill method to replicate the streaming behavior of the fitness tracker dataset, as the future values would not be available.
Finally, the wearing-off dataset was cleaned by removing overlapping wearing-off data. Then, the wearing-off and the drug intake datasets were combined with the fitness tracker dataset. Each record was checked if it fell within the timestamp. If so, a value of "1" was assigned for the "y" or wearing-off variable; otherwise, a value of "0" was set. Similarly, if each drug intake record fell within the timestamp, the time that elapsed from the start of each drug intake record was computed. The time that elapsed for the succeeding drug intake record was set back to "0".
The raw available datasets from Garmin and the FonLog smartphone application and the combined dataset are provided in the Supplementary Materials.

Model Development
The prediction models for wearing-off were developed using the 15 s and 15 min datasets for each participant. For this initial case study, we built individual-level models because each participant experienced PD differently. Thus, we wanted to discover and optimize a prediction model for each individual to understand their PD situation fully.
In this subsection, the features and the different prediction models are discussed in detail (1) to understand how each model estimates the wearing-off phenomenon, and (2) to interpret how they use different features in the prediction. We also explain how we trained, evaluated, and optimized the models using the nested cross-validation approach. We used the Python programming language with PhotonAI [37] and Scikit-Learn [38] as the main libraries.
The following features extracted from the processed dataset were used to develop the prediction model: • x 1 : Heart rate (HR); • x 2 : Step count (Steps); • x 3 : Stress score (Steps); • x 4 : Awake duration during the estimated sleep period (Awake); • In this study, five machine learning algorithms were applied to develop the prediction models. These algorithms were Logistic Regression (LR), Linear Support Vector Machine (L. SVM), Decision Tree (DT), Random Forest (RF), and Gradient Boosting (GB) classifiers. We used these machine learning models to interpret the importance of each feature to the wearing-off prediction.

Logistic Regression (LR)
In this study, the Logistic Regression model p(X) was used to approximate the probability of wearing-off using the different data from the fitness tracker and drug intake data. During the training of a Logistic Regression model, it estimates the weights for each feature in relation to wearing-off. A larger weight for a feature shows that it contributes more to the probability of wearing-off than smaller weights. Mathematically, the Logistic Regression model is a sigmoid function of f (X) such that where x 1 . . . x 1 2 are the defined features, b refers to a bias, and w 1 . . . w 1 2 are the weights for each feature.

Linear Support Vector Machine (L. SVM)
The Linear Support Vector Machine's key task is to find a hyperplane w ∈ R p which has the largest distance (by min w,b 1 2 w T w ) to the nearest training data x i ∈ R p . It is defined as min where y ∈ {1, −1} n is the target, φ(x i ) is an identity function, b refers to a bias, and C refers to the strength of the penalty [38][39][40].

Decision Tree (DT)
The Decision Tree model finds the best split of data multiple times based on thresholds for each feature. Each subset of the data is assigned to a node of the tree with specific feature thresholds. Equation (7) encapsulates the Decision Tree model [41].
where R m is the subset of the data and I{x ∈ R m } is an identity function which returns "1" if x is in subset R m -otherwise, the identity function returns "0"-and y is the predicted value, which is equal to the average of all training instances c m .

Random Forest (RF)
Built using Decision Tress, the Random Forest model is an ensemble of smaller Decision Trees that produces a predictive model [38]. A Random Forest classification model follows these steps [42]: The tree is grown to the largest extent; 4.
Steps 1 to 3 are repeated to build bootstrapped Decision The different Decision Trees are combined using majority votingf = mode(f 1 . . .f B ).

Gradient Boosting (GB)
Unlike the Random Forest model, in which a majority vote occurs among Decision Trees, the Gradient Boosting model is an additive model that proceeds in a forward stagewise fashion. The goal is to improve a weak classifier that starts from initialization. Then, the negative gradient is calculated and the error is minimized. The next step size is determined until the final model is achieved [38,40,43,44].

Analysis Pipeline
This section describes the nested cross-fold (Nested CV) approach for model selection, hyperparameter optimization, and performance evaluation using PhotonAI. PhotonAI is a high-level machine learning library for designing and optimizing machine learning models using pipelines [37]. PhotonAI wraps around other machine learning libraries such as Scikit-Learn for general machine learning tools [38], Scikit-Optimize for hyperparameter optimization [45], and Imbalanced-Learn for handling class imbalance [46].

Nested CV Approach
Furthermore, PhotonAI allows the implementation of the nested CV approach, which was used for small to medium datasets. The dataset was divided into training, validation, and test splits in this approach, depending on the outer and the inner fold data split technique. The outer fold handled the model evaluation while the inner fold chose the best model with the optimized hyperparameters. Over-fitting would be avoided because the hyperparameter optimization was only exposed to the subset of each outer fold. The expected output of a nested CV approach was a set of optimized best models [37,42,[47][48][49]. The model development in Figure 1 showed the nested CV approach.

Stratified Validation Technique
A non-shuffled stratified validation technique was used to divide the dataset to preserve the time-based dependency and the wearing-off distribution percentage of the dataset.
Similar to k-fold cross-validation, the dataset was divided by k folds while maintaining the distribution of each class. The non-shuffled stratified validation split was used in this study due to the imbalanced nature of wearing-off reports. The dataset was divided into 5 outer folds, where each fold had approximately 24 days of training data and 6 days of test data. Then, each outer fold's training data were divided into 3 inner folds, with approximately 16 days of training data and 8 days of validation data. The depiction of the outer fold and inner fold of the model development in Figure 1 illustrates the stratified validation technique.
Hyperparameter Optimization Using Scikit-Optimize The hyperparameter optimization was conducted using the Scikit-optimize Python library within the PhotonAI library. Given a hyperparameter search space, the balanced accuracy metric in our study was maximized. Then, the next hyperparameter was suggested until 30 configurations were generated and tested. This optimization search occurred in the nested CV's inner fold.

Pipelines for Model Selection and Hyperparameter Optimization
With PhotonAI, we compared different learning algorithms while each learning algorithm was optimized (initial pipeline). Next, the recursive feature elimination (RFE) algorithm was added for the feature selection pipeline (FS). The goal of RFE was to choose the features by repeatedly removing the least important feature for each iteration until the desired number of features was reached [38,41]. Then, a class imbalance pipeline (CI) handled the few wearing-off labels in our dataset by adding ImbalanceDataTrans f ormer. This transformer was part of the Imbalanced-Learn [46] library in PhotonAI. The CI pipeline chose the best over and under-sampling technique that improved the model's performance. Finally, we combined FS and CI elements into one pipeline (CI + FS) for comparison with the previous pipelines. Table 6 summarizes the hyperparameter search space used in these pipelines. Table 6. Hyperparameter search space for each learning algorithm and other pipeline elements specific to each pipeline. In this study, the balanced accuracy was the chosen primary metric to handle the prediction model's bias towards the more frequent class in our dataset ("on" or "0"). The balanced accuracy was also chosen so that both a high true positive rate (or sensitivity, Sn) and a high true negative rate (or specificity, Sp) were achieved. The balanced accuracy was defined as the arithmetic mean of the sensitivity and specificity, as defined in Equations (8) and (9) [38,50].

Learning Algorithms and Other Pipeline Elements
Bal. Acc.
where TP is the true positive value, FN is the false negative value, FP is the false positive value, and TN is the true negative value, according to the confusion matrix in Table 7.  In this study, the precision and f1-score were also calculated, as shown in Equations (10) and (11). The precision and F1-score were considered because we wanted to avoid an "on" state being predicted as wearing-off. Using the balanced accuracy and F1-score gave further importance to false negative and false positive errors in our prediction model.

Wearing-Off Data Collection
Fitness tracker data and wearing-off data were collected for 30 days from 23 February until 24 March 2021. The participants recorded their wearing-off periods using a customized FonLog application. Participant 1 collected a total of 227 wearing-off periods, while participant 2 reported 44 wearing-off periods. The average duration of wearing-off for participant 1 was 87.432 min (σ = 77.943 min); on the other hand, participant 2 had an average duration of wearing-off of 25.295 min (σ = 34.512 min). Over the 30-day collection period, participant 1 reported at least one wearing-off every day. Meanwhile, participant 2's reporting was more sparse than participant 1; there was even no report on the third week from participant 2. Figures 3 and 4 present the distribution of wearing-off data for both participants.

Fitness Tracker Data Collection
Each participant was given a Garmin vivosmart4 fitness tracker. We asked the participants to wear the fitness tracker during the 30-day collection period. The collected data are summarized in Table 8. Both participants missed at least one day of wearing the fitness tracker over 30 days. Still, participant 1 wore the fitness tracker for longer than participant 2, as shown by the lower number of n records in Table 8. Moreover, participant 1 had a higher average heart rate than participant 2 over 30 days, as shown in Figure 5. Figure 6 shows the distribution of 15 min records for the number of steps. Both participants' distributions of step records presented a positively skewed distribution, where most records were under 100 steps. Specifically, for every 15 min record, the collected number of steps was mainly under 100 steps.
In terms of the reported stress score from the Garmin Health API, participant 1 had an average stress score (n = 14,400 of 3 min records) of 29.67 ± 32.51 over the 30-day collection period, or a low-stress classification based on Garmin's stress level classification [28]. On the other hand, participant 2 had an average stress score (n = 13,135 of 3 min records) of 11.23 ± 20.75 over the 30-day collection period, or a resting state classification. Figure 7 shows the stress scores for each participant.

Difference between the Participants' PD Experience
We also investigated our initial hypothesis that each patient would experience PD differently using a two-sample t-test with unequal variances, and effect sizes were reported using Cohen's d. There were significant differences between the participants' heart rates (t(300,765) = 1.96, p < 0.001, d = 0.995), stress scores (t(17,043) = 1.96, p < 0.001, d = 0.688), and numbers of steps (t(1606) = 1.96, p < 0.001, d = 0.486), where all three p values were below 0.001. However, there was no significant difference between the participants' daily amount of sleep, t(52) = 2.01, p = 0.23, d = 0.328.

Combined Datasets
The fitness tracker datasets, the wearing-off dataset, and the drug intake dataset were combined according to the method explained in Section 2.2. To recap, we prepared 15 s and 15 min datasets based on the 15 s heart rate interval and 15 min step interval, respectively. Then, these 15 s and 15 min datasets were used for each participant in our prediction model.

Wearing-Off Prediction Model
A series of pipelines for developing the wearing-off prediction model was applied to each participant's 15-s and 15-min datasets. We observed a class imbalance for the wearing-off y variable of participant 2 in both time frames. On the other hand, participant 1's dataset did not have a substantial class imbalance. Figure 9 presents the disparity in the classes for each participant.

Participant 1 Prediction Model
For participant 1, we analyzed the performance of each pipeline on the validation set. The CI pipeline produced the best balanced accuracy of 74.7% for the 15 min time frame and 73.5% for the 15 s time frame, as shown in Table 9. Table 9. Best hyperparameter configuration performance on the validation set for participant 1. CI pipeline resulted as the best pipeline for both time frames, as shown in bold. Next, we used the CI pipeline to compare the performance of each learning algorithms using the validation set. In the 15 min time frame, the GB algorithm produced the best balanced accuracy of 74.3% and and an f1-score of 74.0% over other algorithms. The next best learning algorithm was DT, with a 73.9% balanced accuracy and a 74.4% f1-score. Similarly, the GB algorithm had the best balanced accuracy of 72.7% and an f1-score of 69.7% on the 15 s time frame. However, the L. SVM had the second best performance, with 71.4% balanced accuracy and an f1-score of 63.3%. Table 10 shows the performance of the other learning algorithms.  Thus, we evaluated the GB, DT, and L. SVM algorithms using the CI pipeline for model evaluation. In the end, the CI pipeline with the GB learning algorithm had the best balanced accuracy and f1-score in both time frames. The prediction model had an average balanced accuracy of 71.7% ± 0.035 for the 15 min time frame and 70.0% ± 0.032 for the 15 s time frame, as shown in Table 11. In addition, the permutation feature importance for the prediction models were computed to identify which feature contributed to the prediction model. The permutation feature importance measured the prediction model error after a feature was permuted [41]. DrugElapsed, Stress, and HR were the top features in both time frames for participant 1 (Table 12). Table 11. Average performance of best models on the test set using the CI pipeline for participant 1. GB learning algorithm produced the best results in both intervals, as shown in bold.

Participant 2 Prediction Model
The same process was conducted for participant 2's prediction model. First, we compared the performance of each pipeline on the validation set in both time frames. The best balanced accuracy for both time frame was the initial pipeline. However, upon further testing with the pipelines, and due to the lack of wearing-off reports from participant 2, we opted to use the CI pipeline to handle the class imbalance. The 15 min time frame had a balanced accuracy of 75.6%, while the 15 s time frame had a balanced accuracy of 74.8%. The other pipelines' performances on the validation set are presented in Table 13. Second, the learning algorithms were compared using the CI pipeline for both time frames. For the 15 min time frame, the LR learning algorithm had the best balanced accuracy of 73.4% on the validation set. On the other hand, GB had the best balanced accuracy of 70.7% for the 15 s time frame. In both time frames, the second best algorithm was L. SVM. The results of the other learning algorithms, as well as other metrics, are shown in Table 14.  Third, the LR, GB, and L. SVM algorithms were evaluated on the test sets using the CI pipeline. In the 15 min time frame, LR had an average balanced accuracy of 76.9% ± 0.176. Meanwhile, GB had an average balanced accuracy of 76.1% ± 0.120 in the 15 s time frame (Table 15). The evaluated prediction models for participant 2 only showed DrugElapsed as the common top feature, while other features' importances varied, as shown in Table 16.

Discussion
This study aimed to develop an individual-level prediction model for the wearing-off of anti-PD medicine using a readily available fitness tracker in a real-world environment. Our first goal was to collect and combine the related datasets needed for this study. Second, prediction models were built to predict the wearing-off period. Lastly, we analyzed the produced results to assess our primary goal.
In response to our first research question, there were two main challenges in data collection and processing. First, various datasets from different tools were collected in different time intervals. There were some significant gaps without available data. Before resampling, the datasets were filled with values of "−1", based on Garmin's data handling protocol for stress scores. After re-sampling the datasets, the last known values were copied to missing values. It would be convenient to have a uniform interval rather than to re-sample the data for future studies. The use of smartwatches can also be beneficial to read and record raw sensor data, without sacrificing the comfort of the participants. Second, there were few wearing-off reports for participant 2. The sparsity and the low number of reports affected the performance of the prediction models. Participant 1 had sufficient reports, while participant 2 did not have very many reports.
In this study, we have shown the feasibility of using a fitness tracker to predict the wearing-off phenomenon. Based on the feature importance from the models, the elapsed time since the last drug intake affected both participants' models regardless of the sampling interval. For participant 1, the stress score and heart rate were indicators of the wearingoff phenomenon.
For both participants, the 15 min time frame produced better results than the 15 s time frame. The difference in the time frame affected the prediction model due to the presence of large amounts of missing data. In the future, the sampling interval can be optimized since the differences between the two time frames were substantial.
In response to the second research question, our case study showed the possibility of utilizing the fitness tracker dataset to estimate wearing-off. The individual-level prediction models were built to personalize the estimation based on the participants' experiences of PD. It was highlighted in a previous study that each PD patient had varying degrees of symptoms [20]. In this study's analysis, the participants' heart rate, number of steps, and stress score were also shown have significant differences. Thus, this study was able to produce a prediction model with an accuracy of 70.0-71.7% for participant 1 and 76.1-76.9% for participant 2 with the use of a non-motor dataset; i.e., in contrast with the accelerometer and gyroscope datasets, which were based on motor functions (Table 17). The prediction models could also be improved by reducing false positive and false negative predictions, as shown in Tables 18 and 19. It was possible that when the prediction model learned and discovered the wearing-off event, the participant has not reported the wearing-off. We learned this possibility from our short interview with the participants. Participant 2 often experienced mild to severe symptoms which resulted in fewer selfreports. Moreover, the participants forgot to replace their fitness tracker after doing household chores. Then, their symptoms began to manifest again. These events were backed up by the known information about the participants, such as fitness tracker usage (Table 8), PDQ-8, and H&Y profile (Table 3). In future work, these information will be helpful before data collection and would improve the study overall.

Conclusions
This study showed the development of prediction models to predict wearing-off for PD patients. A commercially available fitness tracker (Garmin vivosmart4) was used to collect data in a real-world environment. Moreover, the participants recorded their wearingoff periods and drug intake with a customized smartphone application (FonLog). Then, the datasets were combined to build our prediction models. Predictive models were built for each participant as there were significant differences between the participants' data, except for one feature. Developing predictive models for each participant in a challenging data collection environment resulted in a balanced accuracy of 70.0-71.7% for participant 1 and 76.1-76.9% for participant 2 on the test set. Both models utilized a class imbalanced transformer in their model development pipeline. Finally, only the time that elapsed after taking anti-PD medicine was a common predictor between the participants. Heart rates and stress scores contributed to the prediction model of participant 1, who presented more reports of wearing-off over the 30 days.
In the future, more participants will be part of the study to assess the differences among demographic groups. Moreover, these baseline models will be deployed to an early warning system for the participants. The reported feedback will be used to improve the prediction models and develop a symptom-based prediction model. We envision the utilization of these baseline models while collecting more data using a smartwatch.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/app11167354/s1: "data/garmin" contains the available raw Garmin dataset from Garmin Health API, "data/fonlog/records.xlsx" contains the raw FonLog dataset, "data/combined_data/ combined_ data_participant*.xlsx" contains data for the processed and combined dataset for each participant in different re-sampled time interval, "Data Processing.ipynb" contains the data processing source code, "Model Development" contains the source code for model pipelines, and the "About Supplementary Files" contains the guide to the Supplementary Materials.

Acknowledgments:
The authors would like to thank the Japanese Ministry of Education, Culture, Sports, Science, and Technology (MEXT) and the participants of our study.

Conflicts of Interest:
The authors declare no conflict of interest.
Sample Availability: Samples of the data and the analysis can be obtained with the corresponding author.

Abbreviations
The following abbreviations are used in this manuscript: Wearing-off Questionnaire-9