Prediction of the Levodopa Challenge Test in Parkinson’s Disease Using Data from a Wrist-Worn Sensor

The response to levodopa (LR) is important for managing Parkinson’s Disease and is measured with clinical scales prior to (OFF) and after (ON) levodopa. The aim of this study was to ascertain whether an ambulatory wearable device could predict the LR from the response to the first morning dose. The ON and OFF scores were sorted into six categories of severity so that separating Parkinson’s Kinetigraph (PKG) features corresponding to the ON and OFF scores became a multi-class classification problem according to whether they fell below or above the threshold for each class. Candidate features were extracted from the PKG data and matched to the class labels. Several linear and non-linear candidate statistical models were examined and compared to classify the six categories of severity. The resulting model predicted a clinically significant LR with an area under the receiver operator curve of 0.92. This study shows that ambulatory data could be used to identify a clinically significant response to levodopa. This study has also identified practical steps that would enhance the reliability of this test in future studies.


Introduction
Parkinson's Disease (PD) is a progressive neurodegenerative disorder that affects the frontal lobe, the brainstem and the autonomic nervous system. Impaired dopamine transmission is a defining feature  (49) 162 ± 56 (49) φ Unified Parkinson's Disease Rating Scale Part III. § expressed as a percentage of total the total levodopa equivalent dose (LED) in mg is calculated according to Tomlinson et.al. [12]. χ absolute difference between "OFF" to "ON" UPDRS III. * difference between "OFF" to "ON" UPDRS III expressed as percent of "OFF" UPDRS III.
Controls were 191 subjects aged 60 years and over, without a known neurological disorder and who had worn a PKG logger. Subjects whose bradykinesia scores (as measured by the PKG) were >90th percentile for controls were removed. Thus, there were 174 control subjects. The reason control subjects were included was to provide a large cohort of people who would not be bradykinetic in the early morning and whose bradykinesia score would not change after a nominal "dose time" (DT). DT was 07:00 for building the model: this was chosen to ensure that levels of sleep and inactivity would be similar to the first dose in the morning of PwP. For estimating the delta of the LDCT, DT was 10:00 and chosen explicitly to avoid sleep and inactivity. The UPDRS III was deemed to be "0" at DT.

The PKG System
The PKG system consists of a wrist-worn data logger (the PKG logger), a series of algorithms that produce data points every two minutes and a series of graphs and scores that synthesize this data into a clinically useful format known as the PKG. The PKG system [10] was the first system to receive clearance from the United States Food and Drug Administration to measure bradykinesia, and also has indications for measuring dyskinesia, tremor, and sleep.
The logger is a smart watch that is worn on the most affected wrist (see http://www. globalkineticscorporation.com.au/ for details). It weighs 35 g and contains a rechargeable battery and a 3-axis accelerometer set to record 11-bit digital measurement of acceleration with a range of ± 4 g and a sampling rate of 50 samples per second using a digital micro-controller and data storage on flash memory. The device is water resistant. The logger has been designed and approved for easy cleaning and reuse and after 6 days of recording, data are uploaded to the cloud for application of the algorithms. The algorithms were built using an expert system approach to model neurologists' recognition of bradykinesia and dyskinesia on accelerometry data. Inputs to the expert system included Mean Spectral Power (MSP) within bands of acceleration between 0.2 and 4 Hz, peak acceleration, and the amount of time within these epochs that there was no movement. These inputs were weighted to model neurologists' rating of bradykinesia and dyskinesia and to produce a bradykinesia score (BKS) and dyskinesia score (DKS) every 2 min.
The PKG is the graphical representation of the output of the algorithms from data collected every 2 min over an extended period, typically 6 days. The PKG plotted the BKS and DKS against the time of day. The time that medications were due and consumed were also shown, making it possible to assess whether there were dose related variations in BKS or DKS and how the median value at any time of day compared with a normal subject. Other scores are also provided, including compliance with the reminders, tremor scores [13], and times when the logger was not worn. The system also provides scores of day sleep and inactivity [14]. The numerical output relevant to this study is summarized further in the Glossary that follows. The reader is referred to the company's web site for further details (http://www.globalkineticscorporation.com.au/) and to other publications [10,11,[13][14][15][16][17][18][19][20]. Specific scores used in this paper are described in further details below.

Glossary of PKG Terms
• Bradykinesia Score (BKS). The score for bradykinesia produced by the PKG algorithm for a 2 min epoch. These inputs were weighted to model neurologists' rating of bradykinesia. BKS are produced every 2 min while the logger is being worn. • Epoch. In this paper, "epoch" will refer to a 2-min period in which PKG scores are calculated.

•
Tremor Amplitude (PA): This is the 75th percentile of the magnitude of the tremulous peak identified from the spectrogram from each epoch. Epochs without tremor are scored zero. • OFF Wrist: The logger has a capacitive sensor to detect whether it was being worn. An epoch marked as "OFF wrist" was unavailable for analysis. • Dose reminder: The logger can be programmed to vibrate for 10 s at the time when levodopa should be taken as a reminder to the wearer to take medications.
Dose acknowledgement. Following a reminder, the smart screen on the logger can be "swiped" by moving the finger slowly across the screen to indicate when the medication was consumed.
There are some terms that specifically apply to the first dose of the day ( Figure 1) and relate to the presumption in this study, that the level of motor function at the time of this first dose would be similar to the UPDRSIII OFF and the response to this first dose would be similar to the UPDRSIII ON .

•
Dose Time (DT). This is the 5 epochs (10 min), centered on the acknowledgement to the first reminder of the day and shown as a red diamond in Figure 1. It may be a few minutes after the reminder was delivered (as in Figure 1).

•
Effect Time (ET): This is the time when levodopa had its peak effect. It is calculated for the PKG as the peak of Smoothed Weekly BKS Time series from 46 min to 90 min after the acknowledgement of the dose reminder. • Active epochs: Epochs whose linearly weighted moving median of BKS ≤ 0. • Inactive epochs: Epochs whose linearly weighted moving median of BKS > 40 were attributed to inactivity or sleep and excluded [14].
• Reminder Time: The PKG logger is programmed to vibrate at specific times to remind subjects to take their medications. Subjects can acknowledge when they actually consume the medications by "swiping" the smart screen on the watch. The first acknowledgement of the reminder is shown as a red diamond in Figure 1.
The 5 epochs (10 min) centered on DT and ET were used to calculate further features of motor function from the PKG (see Section 2.7. Feature Engineering and Selection for more detailed discussion). These features are: Thirty-minute window weighted moving percentiles of BKS. Similarly, a 30-min weighted moving percentile of the BKS were also calculated. The weighting was linear, being maximal at the center of the window and declining linearly and symmetrically on each side to zero and provides some indication of the influence of shorter window sizes. They will be referred to as, for instance, BKS_WM75P for the 30 min window weighted moving 75th percentile of BKS. • Thirty-minute window moving 50th percentile of Tremor Amplitude. This is calculated in a similar manner to the 30-min moving percentiles of BKS and will be referred to as TA_M50P.

•
Thirty-minute window weighted moving 50th percentile of Tremor Amplitude. This was calculated in a similar manner to the 30-min window weighted moving percentiles of BKS and will be referred to as TA_WM55P.

•
Logarithm of 1 + 30 min window weighted moving 50th percentile of Tremor Amplitude: While the BKS is linearly correlated with UPDRS III [10] and scales logarithmically with respect to the acceleration of movement, the Tremor Score requires translating into the same scaling by defining Log (1 + Tremor Score). The "1" is added to obtain a finite non-negative feature. For example, the Log10(1 + 30 min window weighted moving 50th percentile of Tremor Amplitude) will be referred to as TA_WM50P_Log.

The Unified Parkinson's Disease Rating Scale Part III
The UPDRS Part III is a formalized version of the clinical examination with scores (usually 0-4) for various severities of motor function. The score of each of the items is summed to give a total score. There are two versions of the UPDRS III: the older is known simply as the UPDRS III and the newer version, which is owned by The Movement Disorder Society is termed the MDS-UPDRS III. Six items were added to the scale in the MDS-UPDRS III, so that the maximum possible total score increased from 108 to 132 points. Unfortunately, both versions are frequently used in practice. Formulae for allowing the older and newer total scores to be used have been proposed [21] but we have used a method that requires the addition of 7 points to the older scale to produce a robust correction [22].
In this study, no subjects had missing values in the UPDRS III. Two clinics used the older UPDRS III scale and one used the MDS-UPDRS III scale. As only the total scores were used, the older scales were adjusted by the addition of 7 points and UDPRS III is used from here to refer to the adjusted score. It is important to note that in the UPDRS III, scores for tremor and bradykinesia contribute~30% and~40% (respectively) whereas the BKS of the PKG only relates to bradykinesia with a separate score for tremor.

The LDCT
As described earlier, the LDCT is performed by administering a dose of levodopa in the "OFF" state which is then compared with the "ON" state using the UPDRS III. As already noted, there is considerable variation in the performance of this test, which was also apparent in the procedures used by the three clinics participating in this study.


OFF state assessment. All clinics assessed the "OFF" state at least 12 h after the PwP's last PD medication. No clinic ceased D2 agonist for more than 12 h.  Loading dose. One clinic gave a supramaximal dose of 50% more than their usual first dose of levodopa. One clinic gave 20% more than the usual morning dose in dispersible form. The third reviewed the "ON" state on their usual dose  ON state assessment. Two clinics assessed the "ON" state following the levodopa dose: one from 45 min on and the other at about 50-60 min. The third clinic made the assessment on another day to when the "OFF" state was assessed.

The PKG Representation of the LDCT
The PKG records the first dose on six consecutive mornings and the presumption in this study was that the level of motor function at the time of this first dose would be similar to the UPDRSIIIOFF. Similarly, the response to this first dose should be similar to the UPDRSIIION. As "ON" and "OFF" have specific clinical meanings, in this study the terminology of motor function at DT (or ET) is used to signify motor function measured by the PKG (see Figure 1). In modelling the LDCT using the PKG in this way we recognize the following issues:


The first levodopa dose of the morning may be substantially less than the dose of levodopa used in the LDCT. In fact, the average levodopa in the first dose was 164 mg (±79 SD) and only 25% of PwP took 200 mg or more at the first morning dose ( Figure 2).  D2 agonists were taken by 68% of PwP, in which case it constituted a median of 15% (8%-30%

The LDCT
As described earlier, the LDCT is performed by administering a dose of levodopa in the "OFF" state which is then compared with the "ON" state using the UPDRS III. As already noted, there is considerable variation in the performance of this test, which was also apparent in the procedures used by the three clinics participating in this study.
• OFF state assessment. All clinics assessed the "OFF" state at least 12 h after the PwP's last PD medication. No clinic ceased D2 agonist for more than 12 h.

•
Loading dose. One clinic gave a supramaximal dose of 50% more than their usual first dose of levodopa. One clinic gave 20% more than the usual morning dose in dispersible form. The third reviewed the "ON" state on their usual dose • ON state assessment. Two clinics assessed the "ON" state following the levodopa dose: one from 45 min on and the other at about 50-60 min. The third clinic made the assessment on another day to when the "OFF" state was assessed.

The PKG Representation of the LDCT
The PKG records the first dose on six consecutive mornings and the presumption in this study was that the level of motor function at the time of this first dose would be similar to the UPDRSIII OFF . Similarly, the response to this first dose should be similar to the UPDRSIII ON . As "ON" and "OFF" have specific clinical meanings, in this study the terminology of motor function at DT (or ET) is used to signify motor function measured by the PKG (see Figure 1). In modelling the LDCT using the PKG in this way we recognize the following issues:

•
The first levodopa dose of the morning may be substantially less than the dose of levodopa used in the LDCT. In fact, the average levodopa in the first dose was 164 mg (±79 SD) and only 25% of PwP took 200 mg or more at the first morning dose ( Figure 2). • D2 agonists were taken by 68% of PwP, in which case it constituted a median of 15% (8%-30% Inter-quartile range) of the total levodopa equivalent dose. The size of the levodopa response measured using the UPDRS III, was not smaller in those on D2 agonists (even in those where it constituted more than 25% of the levodopa equivalent dose ( Figure 2 and Table 1). • Some subjects may have remained in bed or resting during the dose time, so that all epochs that were inactive epochs (see definition above) or OFF wrist were excluded, and Table 1 shows the number of cases whose dose time was available for modeling UPDRSIII OFF . • Similar exclusions occurred on examination of the ET, and Table 2 shows the number of cases whose dose time was available for modeling UPDRSIII ON .

•
In some subjects, especially those with impaired gastric emptying, there may be variability in the time to peak effect and in the amplitude of the peak effect. Thus, a single day captured by the LDCT may not represent the summary of the 6 days from the PKG.
Sensors 2019, 19, x FOR PEER REVIEW 7 of 21 measured using the UPDRS III, was not smaller in those on D2 agonists (even in those where it constituted more than 25% of the levodopa equivalent dose ( Figure 2 and Table 1).  Some subjects may have remained in bed or resting during the dose time, so that all epochs that were inactive epochs (see definition above) or OFF wrist were excluded, and Table 1 shows the number of cases whose dose time was available for modeling UPDRSIIIOFF.  Similar exclusions occurred on examination of the ET, and Table 2 shows the number of cases whose dose time was available for modeling UPDRSIIION.  In some subjects, especially those with impaired gastric emptying, there may be variability in the time to peak effect and in the amplitude of the peak effect. Thus, a single day captured by the LDCT may not represent the summary of the 6 days from the PKG.  46 § "Plus" indicates that this line includes subjects who met basic criteria PLUS the added criteria in the column. DT=Dose Time which is the 5 epochs (10 minutes), centered on the acknowledgement to the first reminder of the day and shown as a red diamond in Figure 1. It may be a few minutes after the reminder was delivered (as in Figure 1). ET=Effect Time, which is the time when levodopa had its peak effect. It is calculated for the PKG as the peak of Smoothed Weekly BKS Time series from 46 minutes to 90 min after the acknowledgement of the dose reminder.

Model Design for Response Estimation
The first step in designing the computational model that estimates the response to a dose of  46 § "Plus" indicates that this line includes subjects who met basic criteria PLUS the added criteria in the column. DT=Dose Time which is the 5 epochs (10 min), centered on the acknowledgement to the first reminder of the day and shown as a red diamond in Figure 1. It may be a few minutes after the reminder was delivered (as in Figure 1). ET=Effect Time, which is the time when levodopa had its peak effect. It is calculated for the PKG as the peak of Smoothed Weekly BKS Time series from 46 min to 90 min after the acknowledgement of the dose reminder.

Model Design for Response Estimation
The first step in designing the computational model that estimates the response to a dose of levodopa was to divide the UPDRS III into a range of motor function severity levels (MFSL UPDRS ). This was to allow severity of motor function to be categorically classified instead of requiring regressions. This allows a more robust employment of the dataset with the non-identically distributed noise and variations in UPDRS III scoring, which introduces complexity for regression models. MFSL UPDRS "0" was set as a UPDRS III score <10 and levels above that were separated by increments of 12.5 UPDRS III units, with all scores ≥60 in level 5 (See Table 3). Unfortunately, the literature is relatively silent on the UPDRS III of normal subjects, and while a score of UPDRS III < 10 for MFSL UPDRS "0" is somewhat arbitrary, it is designed to be conservatively low, based on the range of scores in newly diagnosed subjects [23] and on the clinical experience of the authors. The size of the increments in UPDRS III between higher MFSL UPDR relate to the upper limits of inter/intra rater variability [21,24] and clinical important differences [25], although we accept that there is little literature to guide this choice. Table 3 shows the number of UPDRS OFF and UPDRS ON (or support sets) in each MFSL UPDRS . The various PKG features (Table 4 and glossary (Section 2.3) used to measure motor function were assigned to one of the 6 MFSL class labels (Table 3) using the corresponding UPDRS III score.
We decompose this multi-class classification problem into 5 binary classification problems, where samples are separated according to whether they fell below or above the threshold for each class. For instance, Classifier 3 will be a binary classifier separating UPDRS III < 35 and UPDRS III ≥ 35. The next section explains the design of these classifiers.

Feature Engineering and Selection
A set of candidate features (enumerated in Table 4 and in the glossary, Section 2.3) for performing the classification task were extracted from the PKG data and statistically matched to the class labels in Table 3. When dealing with a relatively small dataset, having many structurally dependent features can degrade the performance of the statistical model and risks overfitting due to noise induced variance. What follows is a reduction in the size of feature space by elimination and selection to enhance the model generalizability and interpretability in a supervised manner. The first phase of feature selection was to identify features of same nature and select the most relevant. For example, the two features BKS_M50P and BKS_WM50P in Table 3 are of the same nature in that they are both moving medians of BK Score and would thus contain similar information.
The mutual information (MI) test was performed to assess the relevance of these features to the target classes. This approach is neutral with respect to models and identifies any statistical relationship, linear or nonlinear, between the features and the class labels [26]. Table 4 shows the MI of each feature with the 6 target classes, estimated using 6 nearest neighbors for these continuously valued features, with the 6 class labels.
Based on the above discussion, the feature set was refined to BKS_M10P, BKS_M25P, BKS_M50P, BKS_WM75P, BKS_M90P and TA_WM50P_Log. Although these scores each have considerable relevance, some carry redundant information with respect to one or a combination of the others. Joint Mutual Information Maximization (JMIM) [27] was used to maximize relevancy while minimizing redundancy. This method first picks the feature with maximal MI with target classes and adds it to the set of selected features. It then iteratively adds the feature for which the minimum joint mutual information together with any of the already selected features is maximum among other candidates for selection. This heuristically ensures that a newly selected feature has greater relevancy and less redundancy. Table 5 shows the ranking of iterative selection of the refined feature set applying JMIM. Also shown are the mutual information of Weekly Aggregate Mean of each feature, i.e., the average of values of the features over a 10 min interval of entire week, with target classes. This shows the relevance of a feature when aggregated at DT and ET. Evidently, BKS_M90P and BKS_WM75P were relatively less relevant. Furthermore, BKS_M50P was relatively redundant and was also less relevant after weekly aggregation. Although there was the high level of collinearity expected between BKS_M10P BKS_M25P, both features were deployed as they represent different nonparametric measures of the temporal distribution of BKS. In summary, we selected the set of features namely BKS_M10P, BKS_M25P and TA_WM50P.  Rows in italics were features selected following assessment of relevancy and redundancy.

Model Selection, Training and Validation
Several discriminative statistical models were examined for the purpose of designing the five binary classifications algorithms. Although the selected features are monotonically and linearly correlated with the UPDRS III, linear and nonlinear models were considered: namely Logistic Regression [28], Support Vector Classifier [29,30] with Radial Basis Function (RBF) kernel and Gradient Boosting Decision Trees [31]. These discriminative models represent categories of linear, kernel nonlinear and ensemble of arbitrarily nonlinear approaches which have proved to be effective in a variety of problem types. The three features (BKS_M10P, BKS_M25P and TA_WM50P_Log) used are defined above.
When comparing the performance of the learning models described above, it was noted that Logistic Regression assumes that the observations in the dataset are independent and that multicollinearity is not present. We previously noted the structural collinearity between BKS_M10P and BKS_M25P. To investigate the effect of the violation of this assumption on Logistic Regression, we introduce a fourth model, where unsupervised reduction of the dimension of these two features into one is achieved using the first component of Principal Component Analyses (PCA) of BKS_M10P and BKS_M25P together with TA_WM50P_Log, followed by a Logistic Regression classifier. Furthermore, as the employed features are moving statistics, neighboring epochs (observations of the same subject) are not independent when 5 epochs in a row are sampled. Two strategies were used to address assumptions regarding independence of observations: (i) using other candidate models whose requirements around this assumption are relaxed (SVM and Gradient Boosting Trees), (ii) assessing the performance of the models both in terms of model selection and out of sample predictions on Weekly Aggregate predictions, where two observations from one subject (weekly aggregate of dose time and effect time) are spaced more than 30 min apart and are thus reasonably independent. Therefore, the use of Logistic Regression as one of the candidate classifiers is possible and the performance comparisons would be meaningful.
A train set and a test set were defined for each of the classification problems (i.e., into one of the levels defined in Table 3). For each binary classification problem, we allocate 30% samples from each class around Dose and Effect Time of the entire set to test set and use the remaining in train set. As binary classes for each classifier are different, train and test set will randomly vary across them. We ensure that if samples from a subject are used in test set, samples from that subject are not used in train set. These imply that we retain the original distribution balance in test set and ensure the complete separation of train and test sets for each classification problem.
We report two performance measures: (i) area under curve of Receiver Operating Characteristics (ROC AUC) which provides an average Sensitivity at different Specificities; (ii) area under curve of Precision (ratio of true positives to all positive declarations including false positives) vs. Recall (=Sensitivity) (PR AUC) which provides an average Precision at different Recalls averaged over the two classes. The performance metric (ii) is reported to account for the class imbalance that varies over the five Classifiers.
To tune the critical hyperparameters of each model, and assess their generalisability, a 10-fold hold-out cross validation (CV) on Train set was performed (Table 6). First it is evident that reducing the dimensionality of the structurally collinear features with PCA did not improve the performance of the Logistic Regression. This will support the employment of BKS_M10P and BKS_M25P together as separate features in Logistic Regression model that allows to capture the distribution of BK score reflected by these two features. Second, the RBF kernel SVC and Gradient Boosting Decision Trees model do not perform as well as Logistic Regression on Classifiers 4 and 5 while not offering any advantage at Classifiers in lower levels. Table 6 shows the performance of these models of various complexities on train set. Although Gradient Boosting Decision Trees model offers a lower bias than Logistic Regression for all classifiers, the cross-validation performance shows that the variance was not convincingly resolvable through parameter tuning.
The Logistic Regression model performs as well, if not better than the other models and is simpler in that it enhances the likelihood of generalizability and interpretability of the algorithm. Thus, the Logistic Regression model was further analyzed for validation and bias-variance assessment using the five classifier models and Logistic Regression was used in the following sections of this study.
The generalizability of the classifier models was examined by applying them to an unseen test set. Train and test sets for each classifier model were selected according to the description in the previous section. We decomposed this multi-class classification into 5 binary classification problems, where samples were separated according to whether they fell below or above the threshold for each class. Table 7 shows the performance of each classifier model on train and test sets.

Results
The aim of this study was to use the model described above to compare the response to levodopa (LR) measured during the LDCT with the change in motor function following the first morning dose measured by the PKG over 6 days. To remind the reader, the LR in the LDCT (LR UPDRS ) is calculated as the change in the UPDRS III scores (abs∆ UPDRS ) from the OFF state to the ON state, which was 22 (±11 SD) UPDRS III units for the whole cohort (see Table 1 for each clinic). The LR UPDRS is also commonly expressed clinically as %∆ UPDRS (abs∆ UPDRS /UPDRSIII OFF × 100), or the percent improvement, which was 47 (±18 SD) for the whole cohort (see Table 1 for each clinic). Figure 3 shows the relationship between abs∆ UPDRS and %∆ UPDRS . We compare the model with the abs∆ UPDRS first and then with %∆ UPDRS and the justification for considering the abs∆ is fully reviewed in the Discussion. as the change in the UPDRS III scores (abs∆UPDRS) from the OFF state to the ON state, which was 22 (±11 SD) UPDRS III units for the whole cohort (see Table 1 for each clinic). The LRUPDRS is also commonly expressed clinically as %∆UPDRS (abs∆UPDRS/UPDRSIIIOFF × 100), or the percent improvement, which was 47 (±18 SD) for the whole cohort (see Table 1 for each clinic). Figure 3 shows the relationship between abs∆UPDRS and %∆UPDRS. We compare the model with the abs∆UPDRS first and then with %∆UPDRS and the justification for considering the abs∆ is fully reviewed in the Discussion.  For the PKG, the Logistic Regression model designed in Section 2.9 was used to obtain a binary prediction for each epoch in the DT and ET as to whether it was below or above each MFSL At DT, the estimated motor function score for all available epochs were averaged to produce the PKG estimate of the motor function score at DT (MFSL DT ), and similarly at ET(MFSL ET ). These were used to produce the model's estimate of the LR: abs∆ PKG (MFSL DT -MFSL ET ) and %∆ PKG (abs∆ PKG /MFSL DT × 100).
Controls were used to augment the number of cases whose LR could be considered insignificant (Support Class 0 for ROC) and ensure class balance. For controls, DT was set at 10:00 a.m. and ET was at the peak of BKS smoothed weekly summary from 45 min to 90 min. UPDRS III was set at "0". The PwP and Controls included in the assessment of the response to the first morning dose of levodopa in the PKG as a model of the LDCT are shown in Table 2.

Performance of the PKG Predictions of LR Compared to the UPDRS III Measurement of LR
The performance of the abs∆ PKG as a prediction of abs∆ UPDRS was estimated in terms of the two metrics of Receiver Operator Characteristic and Precision-Recall Curve (See Section 2.8 for definition) ( Table 8). When all subjects (column headed "All" in Table 8 and Figure 4) in each class were included, the ROC AUC was 0.8 and PR AUC was 0.79. The performance of the model was potentially degraded by three possibilities: • That an uncertain region lay between abs∆ UPDRS that was a clinically meaningful (Support Class 1) and one which was not clinically meaningful (Support Class 0).

•
That the MFSL DT did not reflect the subjects' worse motor function (i.e., they were already "ON" in the sense that their usual level of motor function was not being seen at this time possibly because they had already responded to medications); • Day to day variability in the amplitude MFSL DT and MFSL ET , and in latency to peak response, especially from clinical variation in enteric delivery of the drug to absorption sites.
Means for identifying each of these possibilities were developed and then excluded from the comparison to assess their effect on the ROC statistics (Table 8 and Figure 4). In the followings, we elaborate on these exclusion mechanisms.   Table  8 (A) and 10 (B) plotted according the 4 corresponding column groups in that table: "U" indicates "uncertain", "O" indicates "Already ON" and "V" indicates "Variable". The small, pink shaded "box and whiskers" plot, between Case 0 and Case 1 in (A) in the Exclude U group shows the distribution of abs∆PKG of the uncertain cases. The boxes are the median and quartiles with the "whiskers" showing the 90th and 10th percentile.
3.1.1. Subjects in whom there was Uncertainty as to whether abs∆UPDRS was Meaningful.
The performance of the model will be enhanced if only subjects whose abs∆UPDRS are clearly positive or clearly negative are included, and cases whose labelling is ambiguous or uncertain are removed. It was noted that (i) the boundary between Level 0 and Level 1 was 10 UPDRS III points (Table 3); (ii) MFSL were separated by 12.5 UPDRS III points; (iii) inspection of Figure 3 suggests that a clinical "uncertain range" also exists (see further discussion in Section 4.2). Thus, a region from 11-14 and centered on 12.5 would remove most "uncertain" cases, and uncertain abs∆UPDRS were defined as being 11 ≤ abs∆UPDRS ≤ 14 [25]. That is, for the purpose of an LDCT, a meaningful improvement was defined as abs∆UPDRS > 14 (Class 1) and an insignificant improvement as abs∆UPDRS ≤ 10 (Class 0). Figure  4A shows that these "uncertain" cases did indeed have an abs∆PKG that was intermediate between cases in Class 0 and 1. When uncertain abs∆UPDRS were removed, the ROC AUC and PR AUC improved respectively, to 0.83 and was 0.81.

Subjects whose Motor Function at dose Time did not Represent their Worst Motor Function Level
If MFSLDT was not the most severe level of motor function present in a subjects PKG, it suggests that they may be "already ON": that is, they may have already received treatment. This is relevant because the aim clinically, when performing an LDCT, is for motor function at the time of the dose during the LDCT to represent the PwP's worst motor function (highest score) and to this end, anti-PD medications are not taken after the prior evening. Motor function is usually assessed several hours after awakening so any "so-called" sleep benefit has dispersed [32,33].
When performing an LDCT, the "OFF" state is considered to represent the subject's worst motor function. Similarly, the motor function at DT during a PKG is assumed as being the time of maximum  Table 8 (A) and 10 (B) plotted according the 4 corresponding column groups in that table: "U" indicates "uncertain", "O" indicates "Already ON" and "V" indicates "Variable". The small, pink shaded "box and whiskers" plot, between Case 0 and Case 1 in (A) in the Exclude U group shows the distribution of abs∆PKG of the uncertain cases. The boxes are the median and quartiles with the "whiskers" showing the 90th and 10th percentile.

Subjects in whom there was Uncertainty as to whether abs∆ UPDRS was Meaningful
The performance of the model will be enhanced if only subjects whose abs∆ UPDRS are clearly positive or clearly negative are included, and cases whose labelling is ambiguous or uncertain are removed. It was noted that (i) the boundary between Level 0 and Level 1 was 10 UPDRS III points (Table 3); (ii) MFSL were separated by 12.5 UPDRS III points; (iii) inspection of Figure 3 suggests that a clinical "uncertain range" also exists (see further discussion in Section 4.2). Thus, a region from 11-14 and centered on 12.5 would remove most "uncertain" cases, and uncertain abs∆ UPDRS were defined as being 11 ≤ abs∆ UPDRS ≤ 14 [25]. That is, for the purpose of an LDCT, a meaningful improvement was defined as abs∆ UPDRS > 14 (Class 1) and an insignificant improvement as abs∆ UPDRS ≤ 10 (Class 0). Figure 4A shows that these "uncertain" cases did indeed have an abs∆ PKG that was intermediate between cases in Class 0 and 1. When uncertain abs∆ UPDRS were removed, the ROC AUC and PR AUC improved respectively, to 0.83 and was 0.81.

Subjects whose Motor Function at dose Time did not Represent their Worst Motor Function Level
If MFSL DT was not the most severe level of motor function present in a subjects PKG, it suggests that they may be "already ON": that is, they may have already received treatment. This is relevant because the aim clinically, when performing an LDCT, is for motor function at the time of the dose during the LDCT to represent the PwP's worst motor function (highest score) and to this end, anti-PD medications are not taken after the prior evening. Motor function is usually assessed several hours after awakening so any "so-called" sleep benefit has dispersed [32,33].
When performing an LDCT, the "OFF" state is considered to represent the subject's worst motor function. Similarly, the motor function at DT during a PKG is assumed as being the time of maximum motor dysfunction. Subjects were excluded if they were known to routinely take medications prior to the first reminder, however on inspection of the PKG of some cases, it was apparent that higher levels of bradykinesia occurred later in the day and so it was possible that medications were consumed prior to the first reminder without this being recorded. This possibility was examined further. First subjects cases whose MFSL DT was in the treated range were found: That is MFSL DT was less than MFSL 3 ( Table 3, and see reference [15] for rationale for choosing this level). The next step was to establish whether motor function deteriorated significantly later in the day by estimating the subjects MFSL (as per the model: average plus one standard deviation) from 46 min after the first dose until 18:00 h. If this was more than 1 MFSL greater than MFSL DT , they were flagged as "already ON": 19% of PwP met this criterion and its effect was analyzed using the ROC statistic ( Table 8). Exclusion of these "already ON" subjects improved the ROC AUC to 0.87 and PR AUC was 0.85.) When cases that were "already ON" and those that were in the uncertain abs∆ UPDRS were both excluded, the ROC AUC and PR AUC further improved to 0.89 and 0.87 respectively.
Sixty-eight percent of all PwP in this study were on D2 agonists and the percent of PwP taking D2 agonists who were flagged as "already ON" was also 32%. Furthermore, the average LED was 177 in those who already ON, compared to 170 in the rest of the cohort. D2 agonist contributed less than 20% of the total levodopa equivalent dose in most PwP (Figure 2, median 15% (8-30% IQR)). Neither abs∆ PKG and abs∆ UPDRS were smaller when the size or the relative contribution of the D2 agonist dose was large. This suggested that D2 agonists did not contribute to whether a LR could be assessed using the PKG.

Subjects with Excess Variability in the Amplitude and Latency to Peak Response
In terms of applying the model, estimates of MFSL DT or MFSL ET could be affected in some subjects by epochs excluded because of inactivity or exercise. This affected the sample size and increased the variability in estimates of motor function at DT and ET. Also, the size and time to peak response to a dose of levodopa (such as in an LDCT) will be affected by variability in gut motility and gastric emptying causing erratic delivery of levodopa to transport sites in the gut, which would result in day to day variability even if measured by UPDRS III. There is also variation in the UPDRS III itself with variation in assessment by an individual performer and between performers. Excess variability in amplitude of response was assessed by measuring the standard deviation in the MFSL DT and MFSL EF of all epochs in DT and ET of all available days. If both were greater than 1 MFSL (Table 3) then it was deemed that there is significant variability in the motor function at both times: 23% of PwP had excess variability in amplitude. Variability in latency from dose to peak was also assessed: cases with excess variability were cases where the standard deviation of the estimation of MFSL was greater than 1 for both DT and ET. Thirty five percent of PwP were flagged as having increased variability in latency to peak. Exclusion of cases with variability measured in this way, in addition to cases that were "already ON" and were in the uncertain abs∆ UPDRS resulted in ROC AUC of 0.92 and PR AUC of 0.87 (Table 8).

Classification Performance of the %∆ PKG in Predicting the %∆ UPDRS in the LDCT
As discussed above, it is common practice clinically to use the %∆ UPDRS as an outcome of the LDCT. While the performance of the model in predicting the percent improvement in motor function was not as effective as predicting the change in motor function (Table 9 and Figure 4), it nevertheless resulted in ROC AUC and PR AUC of 0.82 and 0.73 (respectively) when cases that were "already ON", in the uncertain zone and with excess variability were excluded.

Clinical Relevance of the Measuring LR with the PKG
The duration of disease was plotted against the LR measured by abs∆ PKG and abs∆ UPDRS ( Figure 5). In this study the abs∆ PKG increase with disease duration, which is similar to reports of others using abs∆ UPDRS [4,34], whereas the abs∆ UPDRS showed only a weak trend to increase with disease and possibly the variation between different clinical reporters of the UPDRS III may have obscured this trend.
The average time to peak response was 45 min (±19 SD) and this compares to 51 min (±25 SD) reported elsewhere [35]. This also provides support that the PKG is indeed measuring an LR similar to that measured by the LDCT.
The size of the first dose used on the PKG ranged widely with a median of 150 mg (interquartile range ±50 mg). The minimum was 50 mg and the 10th percentile was 100 mg and 90th 250 mg. These doses are less than those used by most clinics including the three that provided participants for this study. However, there was a non-significant trend for the median abs∆ PKG to be larger when the 1st dose LED was 150 mg or less compared >200 mg. There was a very similar pattern to the median of the abs∆ UPDRS and the size of the 1st dose of the morning.
The abs∆ UPDRS decreases following DBS due to improved motor function in the "off" state [36]. Thus, it would be expected that abs∆ PKG will reduce following DBS. PKGs performed before and 6 months after DBS were available for estimation of their abs∆ PKG ( Figure 5C,D). As expected, there was marked reduction in abs∆ PKG and the size of the reduction was predicted by the pre-surgery LR as measured by the abs∆ PKG ( Figure 5D).
The data regarding the size of the LR in relationship to age and the response to DBS are presented here to show that the findings of the instrumented assessment of the LR reproduce the findings found by UPDRS III. months after DBS were available for estimation of their abs∆PKG ( Figure 5C,D). As expected, there was marked reduction in abs∆PKG and the size of the reduction was predicted by the pre-surgery LR as measured by the abs∆PKG ( Figure 5D).
The data regarding the size of the LR in relationship to age and the response to DBS are presented here to show that the findings of the instrumented assessment of the LR reproduce the findings found by UPDRS III. shows the abs∆PKG and 5B shows the abs∆UPDRS. (note that 1 unit on the Y axis of (A) approximates 12 UPDRS III units shown on the Y axis of (B)). (C) shows abs∆PKG before and after deep brain stimulation (DBS). (D) shows the same data, with the difference in abs∆PKG before and after DBS (X axis) plotted against the abs∆PKG before DBS. The grey region marked NSD (no significant response) in (D) indicates the absolute delta before DBS that is not significant. In (A-C) the boxes are the median and quartiles with the "whiskers" showing the 90th and 10th percentile. ). (C) shows abs∆ PKG before and after deep brain stimulation (DBS). (D) shows the same data, with the difference in abs∆ PKG before and after DBS (X axis) plotted against the abs∆ PKG before DBS. The grey region marked NSD (no significant response) in (D) indicates the absolute delta before DBS that is not significant. In (A-C) the boxes are the median and quartiles with the "whiskers" showing the 90th and 10th percentile.

Discussion
The aim of this study was to establish whether data from the PKG could be used to assess LR with a similar classification performance to the LDCT. We designed a Machine Learning model to define 6 levels of motor function severity using UPDRS III scores from before and after a dose of levodopa. The models are designed and validated against unseen data to ensure their generalizability. Employing these models, we could use the PKG data to predict the abs∆ UPDRS with ROC AUC of 0.92 that suggests PKG can be used to accurately replicate LDCT in an ambulatory fashion. One of the factors that contributing the difficulty in modelling the LDCT is the wide variety of clinical practice, both in the literature and in the clinics participating in this study. There are differences as to whether the LR is expressed as an absolute difference in the scores (abs∆) [34,37] or as a percentage of the "OFF" score (%∆) [2,38,39]. There is also variation in what constitutes a significant response: a %∆ of 30% is widely accepted [40,41] although changes including 20% [42], 25% [43], 33% [2], >40% [44] and 25-50% [45] are cited. Some centers measure the "ON" state at a specific time [2,38,39], typically 45 min [46,47], whereas others establish a peak score. There is also no uniformity in the size of the dose: most use an absolute dose [40,41,48] ranging from 150-400 mg of levodopa but others use some multiple of the usual morning dose [49][50][51].

Steps That Improve the Classification Performance of the abs∆PKG PKG
While the ROC AUC of all subjects was acceptable (0.80), substantial improvement (ROC AUC = 0.92) could be obtained by excluding those in the uncertain abs∆ UPDRS range, those who were "already ON" with deterioration observed later in the day and those with excess variability in response. The issue of an "uncertain zone" in the classification of the abs∆ UPDRS is addressed in Section 4.2.

Addressing the "Already ON" Issue
When an LDCT is performed, there is an expectation that "OFF" motor function represents the PwP's worst motor function. To achieve this, anti-PD medications are not taken after the prior evening and the test is usually performed several hours after awakening so any "so-called" sleep benefit [32,33] has dispersed. Similarly, for the PKG, the intention was that the level of motor function at the DT would represent the worst motor function experienced over the course of the day. In many cases, medications were taken immediately on awakening so, if "sleep benefit" were a real entity, it would interfere with the model's estimations. We are aware of the discourse relating to this entity [32,33] and merely note here that it is possible that it could have prejudiced the performance of the model. However, prior consumption of medications is a more tangible problem. Subjects were excluded from this study if the clinical notes indicated routine consumption of medications (including apomorphine) prior to the first reminder because this would improve the motor function at DT. However, it is possible that some subjects consumed medications prior to their first reminder without this being recorded.
Whatever the mechanism, subjects whose bradykinesia has been partially alleviated at the time of the first PKG reminder could be revealed as having adequately treated bradykinesia at the time of the first dose but having worse levels of motor function later in the day. The statistical method used in this study identified subjects whose level of motor function was already in the treated range and also experienced higher levels of bradykinesia (as identified by the PKG) later in the day. These subjects were flagged as "already ON" at DT: 19% of PwP met this criterion and excluding these PwP improved the ROC AUC and PR AUC (to 0.87 and 0.85 respectively).
The main purpose of introducing this flag was to indicate that early morning alleviation of bradykinesia through whatever mechanism (e.g., medication or sleep benefit) reduced the performance of detecting a levodopa response with the PKG. In applying the PKG as a test of levodopa responsiveness in the future, this would be addressed by specifically requesting PwP to refrain from medications and to be "up and active" for some time prior to the first reminder. The latter point would not only address the potential contribution of sleep benefit but would also reduce the number of epochs removed because of inactivity or sleep (see Section 3.1.1.). This proposition should be addressed in a future study. However, the flag should remain as a marker to question the compliance of individuals in relation to medications consumption in the morning.

Addressing the Issue of Increased Variability
While the effect of variability in gut motility on response to levodopa is well recognized [52], it is not commented on as a source of error in interpretation of the LDCT. Performing the study first thing in the morning is thought to reduce variability in gastric emptying. Arguably however, the average time and amplitude of response from 6 days may be a truer representation than a single sample as in the LDCT. Another important factor is that many PwP, are often still in bed or at least inactive at the time of the first dose. As a result, many epochs are excluded because inactivity indicates that they cannot be used to assess bradykinesia. The smaller sample size leads to increased variability. In the future this could be simply addressed by requesting that PwP are awake and active for half an hour prior to the first dose, which is much less intrusive than asking them to attend hospital in the untreated state for an LDCT. Removal of cases with excess variability had the least impact on improving the classification performance.

Reporting Absolute Size or Percentage Change of LR
The justification for including an "uncertain zone" was both empirical in that it improved the ROC statistics, but also because there is clinical uncertainty about what constitutes a positive response to levodopa. Clinical convention usually recognizes a significant LR as %∆ UPDRS of 30% [40,41] but range from 20-50% have been reported [2,[42][43][44][45]. The literature is not forthcoming as to why a percentage improvement is preferred. In clinical practice, LDCT is most commonly performed in subjects with significant levels of bradykinesia in the untreated state because the questions for performing the LDCT relate to suitability for advanced therapy or the diagnosis of responsive parkinsonism. For example, in this study the mean UPDRS OFF was 48 (±13 SD) and less than 10% had UPDRS OFF < 30. The abs∆ UPDRS increases with duration of disease, most likely due to increasing severity of UPDRS OFF [34,37]. Pieterman et al. [4] recently reported that abs∆ UPDRS correlated better with the UPDRS OFF ( Figure 5) and with a range of motor and non-motor scores than did the more commonly used %∆ UPDRS . While both %∆ UPDRS and abs∆ UPDRS were compared with the LR measured by the UPDRS III, we also found that the abs∆ UPDRS from the PKG provided the best predictor. While PwP in the early stages of disease do have a measurable LR, the abs∆ UPDRS is less than 11 UPDRS III points (~7) and the %∆ UPDRS cutoff for clinically meaningful results is around 33% [41]. Figure 3 indicates that PwP whose disease duration was less than 5 years were more likely (Fishers, p = 0.015) to have scores that were not clinically significant in absolute terms and this would argue that a different model is required for early disease.
The justification for the size of the zone (4 UPDRS III units) and its location on the UPDRS III scale (11)(12)(13)(14) was empirical, drawing on the observations described in Section 3.1.1 and led to an "uncertain zone" of 11 ≤ abs∆ UPDR ≤ 14 being used. The model would be strengthened if a future study included a greater number of cases with parkinsonism of whom clinicians are confident that the LR is not clinical meaningful: these cases might include multi-systems atrophy, progressive supranuclear palsy or vascular parkinsonism.
The current model suits the clinical question as to whether the LR is suitable for deploying advanced therapies such as deep brain stimulation. In these subjects, fluctuations are well developed and LR is usually large. However, evidence of LR is sometimes sought in early PD for both clinical and research reasons and would argue for a separate model to deal with small amplitude changes in early disease.

Pharmacological Agents Prescribed while Measuring the LR with the PKG
The median size of the first dose used on the PKG was less than those used by most clinics including the 3 that provided participants for this study. However as noted above, there appears to be little uniformity in the size of the dose: The use of the smaller dose may not affect the interpretation of a LDCT [53] and does not appear to have adversely affected the modelling. In further studies it might be advisable to use the PKG to ensure that the first morning dose is successful in alleviating bradykinesia. This would ensure that whatever the size of the 1st morning dose, it is large enough to produce a clinically meaningful improvement.
There is also no clarity around handling of D2 agonists for a LDCT [40,48,49], let alone the PKG's version of the LDCT. Many D2 agonists have a half-life that is long enough to have lingering effects in the morning if taken the night before. The CAPSIT protocol [2] and others [50] recommend ceasing for 24 h agonists before the LDCT but few clinics adhere to this. They did not appear to affect the outcome when using the PKG data to model the LDCT.

Routine Reporting of Levodopa Responsive with the PKG
While this study was intended to test the feasibility of measuring LR using the PKG system, it is relevant to speculate on how it may be incorporated into routine clinical care. If future studies confirm that the prediction of an individual PwP's LR was accurate providing the PwP is awake and active from 30 or more minutes prior to consuming the first dose since retiring the previous evening, until the dose is effective (~2 h later), then the LR could be considered as part of routine monitoring. Under these circumstances, it might be possible to report the abs∆ PKG and "OFF" and "ON" PKG levels. An indicator of the reliability of the measure based on the presence of excessive inactivity and sleep or the probability that earlier dose has been taken despite advice to the contrary might also be possible. Such a reporting mechanism is subject to further.

Conclusions
A Machine Learning model using features from the PKG can be used to predict the LR obtained from a LDCT. The performance of the test for individual PwP is likely to be very high if two practical steps were taken:

•
PwP are awake and active for 30 or more minutes prior to consuming the first dose.

•
No PD medications are taken between retiring and the first dose.
These suggestions should be confirmed in future studies which also include a greater proportion of people with parkinsonian features that do not respond to levodopa. It does not appear that that a first dose of under 200 mg levodopa substantially alters the modelling, but future studies may be needed to establish a more certain method for ensuring the first dose is adequate for eliciting an LR. This model was built with the assumption that a clinically useful LR was greater than 14 UPDRS III points. The LR in early PD and some other conditions is smaller and a suitable model could be built in the future to address this need.
In summary, used appropriately this approach seems likely to be effective in establishing levodopa responsiveness in most cases providing the PKG is performed appropriately.