Is the Random Forest Algorithm Suitable for Predicting Parkinson's Disease with Mild Cognitive Impairment out of Parkinson's Disease with Normal Cognition?

Because it is possible to delay the progression of dementia if it is detected and treated in an early stage, identifying mild cognitive impairment (MCI) is an important primary goal of dementia treatment. The objectives of this study were to develop a random forest-based Parkinson's disease with mild cognitive impairment (PD-MCI) prediction model considering health behaviors, environmental factors, medical history, physical functions, depression, and cognitive functions using the Parkinson's Dementia Clinical Epidemiology Data (a national survey conducted by the Korea Centers for Disease Control and Prevention) and to compare the prediction accuracy of our model with those of decision tree and multiple logistic regression models. We analyzed 96 subjects (PD-MCI = 45; Parkinson's disease with normal cognition (PD-NC) = 51 subjects). The prediction accuracy of the model was calculated using the overall accuracy, sensitivity, and specificity. Based on the random forest analysis, the major risk factors of PD-MCI were, in descending order of magnitude, Clinical Dementia Rating (CDR) sum of boxes, Untitled Parkinson's Disease Rating (UPDRS) motor score, the Korean Mini Mental State Examination (K-MMSE) total score, and the K- Korean Montreal Cognitive Assessment (K-MoCA) total score. The random forest method achieved a higher sensitivity than the decision tree model. Thus, it is advisable to develop a protocol to easily identify early stage PDD based on the PD-MCI prediction model developed in this study, in order to establish individualized monitoring to track high-risk groups.


Introduction
Over the past decade, the field of geriatrics has experienced emerging interest in Parkinson's disease with mild cognitive impairment (PD-MCI) [1][2][3][4]. The Sydney cohort study [5], the most highly representative epidemiology study on the subject, examined 136 patients diagnosed with Parkinson's disease (PD) over 20 years. The study reported that 84% of PD patients had cognitive impairment, and 50% of them progressed to PD dementia (PDD). Likewise, PD is often accompanied by cognitive dysfunction in addition to dyskinesia [2].
The mild cognitive impairment (MCI) stage is the earliest at which we can detect dementia [6]. Because it is possible to delay the progression of dementia when it is detected and treated in an early stage, identifying MCI is an important primary goal of dementia treatment [6]. PD-MCI is frequently found in patients with PD [7,8]. However, the sociodemographic and neuropsychological characteristics of PD-MCI are less well-known than those of MCI and vascular mild cognitive impairment (vascular-MCI) [7,8]. The distinctive neuropsychological characteristics found in early stage

Data Source
This study was conducted using the Parkinson's Dementia Clinical Epidemiology Data obtained from the National Biobank of Korea, the Center for Disease Control and Prevention, the Republic of Korea (no. KBN-2019-005). We obtained the approval of the Research Ethics Review Board, the National Biobank of Korea (no. KBN-2019-005), and the data use approval of the Korea Centers for Disease Control and Prevention (no. KBN-2019KBN- -1327. The National Biobank of Korea was established in 2008 with the approval of the Ministry of Health and Welfare and is managed by the Korea Centers for Disease Control and Prevention for the emerging necessity of managing bio-data systematically at a national level. The ultimate goal of the National Biobank of Korea is to promote biomedical research and public health. Please refer to Lee et al. [26] for the specific activities of the National Biobank of Korea, including its quality control programs. The Parkinson's Dementia Clinical Epidemiology Data used in this study were collected under the supervision of the Korea Centers for Disease Control and Prevention at 14 tertiary care organizations (university hospitals) from January to December 2015. Health surveys, including health behavior questions, were conducted using computer-assisted personal interviews. The data are composed of sociodemographic factors (e.g., gender), environmental factors (e.g., exposure to pesticides), health behaviors (e.g., smoking), disease history (e.g., hypertension), exercise characteristics related to PD (e.g., tremor), sleep behavior disorders (e.g., rapid eye movement (REM)), and neuropsychological characteristics (e.g., cognitive function). PD-MCI was diagnosed by neuropsychologists according to the criteria of the International Working Group on MCI [27].

Subjects
Observational studies frequently utilize secondary data and these studies are more likely to experience data imbalance while comparing patients and healthy subjects [28]. Propensity score matching (PSM) was used to minimize selection bias and resolve the imbalance of case-control [29]. This study found an imbalance between PD-NC and PD-MCI. In order to solve this issue, this study used PSM, balancing between populations using the nearest neighbor matching by controlling the age of the case-control group [30]. Moreover, this study excluded individuals (subjects) that did not match in both groups in common to ensure good data balance. Before matching, there were 274 subjects (PD-MCI = 223; PD-NC = 51), and, after conducting PSM, it was matched to 96 subjects (PD-MCI = 45, PD-NC = 51; Figure 1). This study finally analyzed 96 subjects.
used PSM, balancing between populations using the nearest neighbor matching by controlling the age of the case-control group [30]. Moreover, this study excluded individuals (subjects) that did not match in both groups in common to ensure good data balance. Before matching, there were 274 subjects (PD-MCI = 223; PD-NC = 51), and, after conducting PSM, it was matched to 96 subjects (PD-MCI = 45, PD-NC = 51; Figure 1). This study finally analyzed 96 subjects.

Development and Evaluation of Prediction Models
The prediction model was developed using a random forest algorithm, and the results of the developed prediction model were compared with those of a decision tree based on multiple logistic regression and a classification and regression tree. The prediction accuracy of the model was calculated using the recognition rate.
Random forests are ensemble classifiers that randomly learn multiple decision trees. The random forest method consists of a training step that constructs several decision trees, and a test step that classifies or predicts an outcome variable based on an input vector. The ensemble form of random forest training data can be expressed as Forest F = {f1, ..., fn} ( Figure 2). The distributions obtained from the decision trees of each forest were first averaged by T (the number of the decision trees) and then classification was conducted. The predictors of each sample were combined by using the mean for continuous target variables and the majority vote for categorical target variables.  Random forest is similar to the bagging technique, because both approaches combine decision trees generated from multiple bootstrap samples using the majority vote principle in order to increase stability. However, they are different, because the former uses a few explanatory variables that were randomly selected from each bootstrap sample.
This study presented a partial dependence plot and variable importance to show the prediction power of the main explanatory variables. The variable importance indicates the effect of an explanatory variable on the accuracy of a model. Therefore, when an explanatory variable improves the performance of a model, the importance of the variable increases. A partial dependence plot shows the changes in response variables according to the continuous change of each explanatory variable. The contribution of a dependent variable to an independent variable is expressed as a function of a variable. The function of partial dependence is presented in Equation (2).
RF can be free from overfitting theoretically, and is not affected by noise or outliers much [20]. Moreover, it can generate high accuracy results by reducing generalization errors [20]. However, RF is more likely to have an elbow point, which means a steep drop in slope with more trees. Moreover, there is a higher probability that each tree will be more complex when an unimportant explanatory variable is selected. Therefore, this study improved the accuracy of the model by considering the number of mtry, the number of candidate explanatory variables, in advance.
The prediction performance of a model was validated while considering the overall accuracy, sensitivity, and specificity together. Sensitivity means the prediction accuracy of PD-MCI, while specificity indicates that of PD-NC. As the objective of this study was to develop a model that can predict PD-MCI, this study considered overall prediction accuracy and sensitivity as the most important factors for evaluating prediction performance. When the overall prediction accuracies and sensitivities of the two models were identical, their specificities were compared. This study first established a random forest model and then compared the results and the accuracies of models obtained from multiple logistic regression and CART. In this case, forward selection based on standard likelihood ratio tests was used to select variables in the multiple logistic regression analysis. All of the statistical analyses were conducted using the "RandomForest" package of R-version-3.6.1 (Foundation for Statistical Computing, Vienna, Austria). Random forest is similar to the bagging technique, because both approaches combine decision trees generated from multiple bootstrap samples using the majority vote principle in order to increase stability. However, they are different, because the former uses a few explanatory variables that were randomly selected from each bootstrap sample.
This study presented a partial dependence plot and variable importance to show the prediction power of the main explanatory variables. The variable importance indicates the effect of an explanatory variable on the accuracy of a model. Therefore, when an explanatory variable improves the performance of a model, the importance of the variable increases. A partial dependence plot shows the changes in response variables according to the continuous change of each explanatory variable. The contribution of a dependent variable to an independent variable is expressed as a function of a variable. The function of partial dependence is presented in Equation (2).
RF can be free from overfitting theoretically, and is not affected by noise or outliers much [20]. Moreover, it can generate high accuracy results by reducing generalization errors [20]. However, RF is more likely to have an elbow point, which means a steep drop in slope with more trees. Moreover, there is a higher probability that each tree will be more complex when an unimportant explanatory variable is selected. Therefore, this study improved the accuracy of the model by considering the number of mtry, the number of candidate explanatory variables, in advance.
The prediction performance of a model was validated while considering the overall accuracy, sensitivity, and specificity together. Sensitivity means the prediction accuracy of PD-MCI, while specificity indicates that of PD-NC. As the objective of this study was to develop a model that can predict PD-MCI, this study considered overall prediction accuracy and sensitivity as the most important factors for evaluating prediction performance. When the overall prediction accuracies and sensitivities of the two models were identical, their specificities were compared. This study first established a random forest model and then compared the results and the accuracies of models obtained from multiple logistic regression and CART. In this case, forward selection based on standard likelihood ratio tests was used to select variables in the multiple logistic regression analysis. All of the statistical analyses were conducted using the "RandomForest" package of R-version-3.6.1 (Foundation for Statistical Computing, Vienna, Austria).

General Characteristics of the Subjects
The General characteristics of the subjects are presented in Table 2. Of the 96 subjects (after match), 47.9% were male, 52.1% were female, 38.5% had a high school or above level of education, 8.0% had a family history of PD, and 6.8% had a family history of Alzheimer's dementia. Additionally, 5.7%, 2.3%, 23.2%, and 40.0% of the subjects had a history of head injury (e.g., traumatic brain injury), stroke, diabetes, and hypertension, respectively.

Major Risk Factors of Random Forest-Based PD-MCI Prediction Model
A PD-MCI prediction model was established using random forests, and the results are presented in Figure 3. Some of the random forest models estimated major risk factors using decreased in the GINI coefficient. The major risk factors of PD-MCI were, in descending order of magnitude, CDR sum of boxes, UPDRS motor score, the K-MMSE total score, and the K-MoCA total score. Among these factors, the UPDRS motor score was the most important predictor of PD-MCI. In contrast, the importance of atrial fibrillation and stroke was zero.

Comparison of the Accuracy of the Developed Prediction Models
This study changed the mtry values (numbers), presenting the number of explanatory variables to be used in the decision tree constituting RF, from 5 to 15, and selected the value with the smallest error of Out-Of-Bag. The changes in the error of Out-Of-Bag are presented in Table 3. The optimal mtry to be applied in this study was 5, showing the lowest error rate (34.4%).
When ntree, the number of tree generation, and mtry were set as 500 and 5, respectively, the final RF model of this study had an overall accuracy of 65.6%, a sensitivity of 70.6%, and a specificity of 60.0% (Table 4). On the other hand, the overall accuracy of CART was calculated as 67.7%, higher than that of RF, but the sensitivity of it was the lowest (51.1%). In Figure 4, the black line indicates the changes in each error rate against 500 bootstrap samples. Figure 5 shows that the changes in error rate become relatively stable after the number of bootstrap samples exceeded 150. The partial dependence plot regarding the CDR sum of boxes, the most important variable in the predictive model, is presented in Figure 4. The results showed that, when other factors were constant, the risk of PD-MCI increased with a higher CDR sum of boxes (Figure 4).

Comparison of the Accuracy of the Developed Prediction Models
This study changed the mtry values (numbers), presenting the number of explanatory variables to be used in the decision tree constituting RF, from 5 to 15, and selected the value with the smallest error of Out-Of-Bag. The changes in the error of Out-Of-Bag are presented in Table 3. The optimal mtry to be applied in this study was 5, showing the lowest error rate (34.4%).
When ntree, the number of tree generation, and mtry were set as 500 and 5, respectively, the final RF model of this study had an overall accuracy of 65.6%, a sensitivity of 70.6%, and a specificity of 60.0% (Table 4). On the other hand, the overall accuracy of CART was calculated as 67.7%, higher than that of RF, but the sensitivity of it was the lowest (51.1%). In Figure 4, the black line indicates the changes in each error rate against 500 bootstrap samples. Figure 5 shows that the changes in error rate become relatively stable after the number of bootstrap samples exceeded 150.

Comparison of the Accuracy of the Developed Prediction Models
This study changed the mtry values (numbers), presenting the number of explanatory variables to be used in the decision tree constituting RF, from 5 to 15, and selected the value with the smallest error of Out-Of-Bag. The changes in the error of Out-Of-Bag are presented in Table 3. The optimal mtry to be applied in this study was 5, showing the lowest error rate (34.4%). When ntree, the number of tree generation, and mtry were set as 500 and 5, respectively, the final RF model of this study had an overall accuracy of 65.6%, a sensitivity of 70.6%, and a specificity of 60.0% (Table 4). On the other hand, the overall accuracy of CART was calculated as 67.7%, higher than that of RF, but the sensitivity of it was the lowest (51.1%). In Figure 4, the black line indicates the changes in each error rate against 500 bootstrap samples. Figure 5 shows that the changes in error rate become relatively stable after the number of bootstrap samples exceeded 150.

Discussion
Diagnosing early stage PD-MCI is important in the health sciences, because it can delay the cognitive decline associated with PDD. Previous studies [22,40] have reported that the impairment of the executive function is a major cognitive feature of PDD. However, it is challenging to distinguish PD-MCI from PD-NC solely based on executive function. Therefore, we explored the major differential indicators of PD-MCI, taking into account sociodemographic variables, health habits, PD related motor and non-motor symptoms, cognitive tests, and neuropsychological tests. We developed

Discussion
Diagnosing early stage PD-MCI is important in the health sciences, because it can delay the cognitive decline associated with PDD. Previous studies [22,40] have reported that the impairment of the executive function is a major cognitive feature of PDD. However, it is challenging to distinguish PD-MCI from PD-NC solely based on executive function. Therefore, we explored the major differential indicators of PD-MCI, taking into account sociodemographic variables, health habits, PD related motor and non-motor symptoms, cognitive tests, and neuropsychological tests. We developed a PD-MIC prediction model based on random forests, and confirmed that the CDR sum of boxes, UPDRS motor score, K-MMSE total score, and the K-MoCA total score were major predictors of PD-MCI. Among all of the neuropsychological screening tests, the CDR sum of boxes was the most important predictor for distinguishing PD-MCI from PD-NC. Therefore, when a neuropsychological test is performed to diagnose PD-MCI in patients with PD, the CDR (sum of boxes) scoring should be conducted first over other cognitive-language screening tests so as to achieve higher sensitivity.
Previous studies [41,42] examining the sociodemographic and emotional characteristics of PDD reported that depression is the main characteristic of PDD. For example, Aarsland et al. (2007) [41] evaluated 537 patients with PDD and observed that 58% of the patients had depression. However, in the present study, depression was not an important indicator for predicting PD-MCI. This might differ from previous studies [41,42], because previous studies compared healthy elderly individuals versus those with PD-MCI, while the present study only examined people with PD. In other words, depression is potentially not a major differential indicator in this study, because both PD and PD-MCI have high depression rates (31.3%). As only a few studies have tried to distinguish PD-MCI from PD-NC considering neuropsychological characteristics, health habits, and depression, more observation studies on PD-MCI are needed in order to verify the major predictors of PD-MCI.
Another meaningful finding of this study is that the sensitivity of random forests is higher than that of the decision tree model. These results agree with the results of previous studies predicting MCI [6] or cardiovascular disease in the elderly using random forests [43]. The prediction accuracy of random forests is higher than that of regression models or decision trees, because random forests are based on the bagging algorithm, which generates diverse decision trees using 500 bootstrap samples. As outliers can form decision tree nodes, the effects of the parameters that determine nodes are substantial, and, consequently, carry a risk of overfitting [44]. In contrast, random forests based on the bagging algorithm can prevent overfitting, because they reduce variance while maintaining tree bias. Moreover, random forests achieve a higher prediction accuracy than decision trees [45]. In addition, one advantage of random forests is their reduction of variance compared with the bagging model, which is achieved by decreasing the correlation between trees [43]. Random forests show a particularly better prediction accuracy than bagging models when there are many input variables [43]. Therefore, when selecting the key independent variables from a dataset containing many independent variables, such as the disease data used in this study, or developing prediction models on big data, random forests provide a higher accuracy than decision tree or multiple logistic regression models.
The merit of this study was the development of an MCI prediction model using examination data from a national survey. The limitations of this study are the following: (1) The number of study subjects was small. (2) The obsessive-compulsive symptoms commonly observed in patients with PD were not examined. (3) The prediction model did not include a biomarker, such as CFS. (4) This study adjusted the balance of the number of subjects between the groups by using age-matched PSM to solve the problem of unbalanced data. However, as a result of the PSM, a number of samples were excluded from the analysis, and the same size decreased. As a result, the overall accuracy, sensitivity, and specificity of the multiple logistic regression analysis were not calculated. Moreover, the age used for matching could not be used as an explanatory variable in the predictive model. Future studies will require more advanced techniques that can reduce the probability of overfitting to minimize imbalance, in addition to PSM. (5) Subjects taking PD medications (e.g., dopaminergics) were not evaluated. As PD medication particularly affects the expression of cognitive and behavioral symptoms, future studies should consider whether or not a subject takes medication.

Conclusions
It is necessary to develop a protocol that can easily identify early stage PDD in order to establish individualized monitoring for tracking high-risk groups based on the PD-MCI prediction model developed in this study. Moreover, to further increase the prediction accuracy of the present method, a random forest model using weighted voting is warranted. In addition, the development of multi-modal data-based machine learning models that include biomarkers and brain imaging test indicators, as well as sociodemographic factors, health habits, and neuropsychiatric indicators, is needed.