Cluster Analysis of Categorical Variables of Parkinson’s Disease Patients

Parkinson’s disease (PD) is a chronic disease. No treatment stops its progression, and it presents symptoms in multiple areas. One way to understand the PD population is to investigate the clustering of patients by demographic and clinical similarities. Previous PD cluster studies included scores from clinical surveys, which provide a numerical but ordinal, non-linear value. In addition, these studies did not include categorical variables, as the clustering method utilized was not applicable to categorical variables. It was discovered that the numerical values of patient age and disease duration were similar among past cluster results, pointing to the need to exclude these values. This paper proposes a novel and automatic discovery method to cluster PD patients by incorporating categorical variables. No estimate of the number of clusters is required as input, whereas the previous cluster methods require a guess from the end user in order for the method to be initiated. Using a patient dataset from the Parkinson’s Progression Markers Initiative (PPMI) website to demonstrate the new clustering technique, our results showed that this method provided an accurate separation of the patients. In addition, this method provides an explainable process and an easy way to interpret clusters and describe patient subtypes.


Introduction
Parkinson's disease (PD) is a disabling and progressive disease, and it is prevalent in the ageing population [1]. James Parkinson first described PD in 1817, in a publication titled, "An Essay on the Shaking Palsy", distinguishing between tremors at rest and during motion [1]. The main pathological feature of PD is the degeneration of neuromelanincontaining neurons and the presence of or increase in Lewy bodies (LBs), densely packed proteins, in the remaining neurons and other areas of the central nervous system [1]. Furthermore, the cause of the degeneration is unknown but it results in a loss of dopamine. Men may be 1.5 times more likely to be diagnosed with PD, with other possible risk factors including a family history of Parkinson's disease, environmental factors, and even personality traits. People with a family history of the disease may have twice the risk [1]. PD diagnosis is based on clinical examination to determine whether any of the four motor symptoms are present: tremors at rest, rigidity, bradykinesia, and postural instability, with bradykinesia the most disabling feature that affects everything from fastening buttons to handwriting to the stopping of one or both arms swinging while walking, whereas tremors are involuntary movements caused by muscle contractions, which are the presenting feature in most cases [1].
Standardized rating scales attempt to quantify disease progression and severity, but these are based on interpretation, are subjective, and are based on an ordinal value. The surveys contain successive categories to choose from, but successive categories do not represent equal differences of a measured attribute; hence, the resulting data are ordinal and categorical [2]. Because these scales are ordinal in type, their resulting scores are nonlinear values, not providing a quantifiable progression or severity level, even though past cluster

Materials and Methods
Given the limitations of existing clustering algorithms, a universal clustering discovery tree is proposed. The patent is pending on this method. It is universal because it can handle a multitude of data types including categorical (numerical or text format), numerical, and mixed datasets. Discovery refers to this method's way of automatically discovering clusters without any input or guess by a user as to how many clusters may exist, in order for the clustering to occur. In this tree-line approach, each variable is separated by its attributes, one at a time. The process of this approach can be viewed in Figure 1. Starting with the first variable, there are two attributes that are separated out. Then, the next variable attributes are separated out, again showing a two-attribute variable, but any number of attributes can be evaluated with this method. The discovery tree continues until the last variable's attributes are separated out.
The discovery tree method was developed so that variable order will not affect the final cluster results and provide the same result every time; this is a major limitation of existing clustering algorithms, as discussed earlier. A larger variable subset, such as a larger subset of males compared to females in the dataset, will not negatively affect the cluster result; another limitation of existing methods. The method automatically discovers the total number of clusters, eliminating the need for end-user input, making this a true, unsupervised method. This method can cluster all types of variables, including categorical, discrete, text, and mixed datasets. For continuous, numerical variables, the proposed method provides two conversion methods. The end user can convert the continuous variables to discrete numbers or categorical sets, prior to clustering. To highlight how this proposed method works, its application to categorical variables of a Parkinson's disease patient dataset is explored in the results section. Descriptive, statistical analysis of the numerical variables will be applied to the largest patient clusters.
The results of this clustering method can be evaluated with existing, clustering metrics. One of these metrics is the silhouette coefficient, which is an unsupervised measure that incorporates both cohesion and separation [3]. It is also referred to as silhouette score or index and it is calculated as follows [3]: For the ith object, calculate its average distance to all other objects in its cluster; this is a i . For the ith object and any cluster not containing the object, calculate the object's average distance to all the objects in the given cluster. Then, find the minimum value with respect to all clusters; this is b i . For the ith object, the silhouette coefficient is The silhouette score can lead to values within the (−1, 0, 1) range, where a value close to −1 means the objects are poorly clustered, a value close to 1 means the objects are tightly clustered, and a value equal to zero indicates an indifferent case [11]. An overall measure of the goodness of a clustering method can be obtained by computing the average silhouette coefficient of all points [3]. The silhouette score can lead to values within the (−1, 0, 1) range, where a value close to −1 means the objects are poorly clustered, a value close to 1 means the objects are tightly clustered, and a value equal to zero indicates an indifferent case [11]. An overall measure of the goodness of a clustering method can be obtained by computing the average silhouette coefficient of all points [3].
The Parkinson's Progression Markers Initiative (PPMI) was initiated to identify biomarkers of Parkinson's disease progression through biologic sampling and clinical and behavioral assessments. PPMI is taking place at clinical sites in the United States, Europe, Israel, and Australia. Data and samples from study participants will enable the development of a comprehensive Parkinson's database and biorepository available to the scientific community [12]. The PPMI dataset utilized for this study contained 423 patients, of whom 277 (65.5%) were male and 146 (34.5%) were female. In total, 19 of the 423 patients (4.5%) were under the age of 40 (25-39 years of age). The median disease duration was 4.3 months, with a range of weeks to 35.8 months (less than 2 years). In this dataset, 75% of patients did not have a family history of PD. In addition, 234 patients (55.3%) had motor symptoms that affected the right side of their body whereas 179 (42.3%) had symptoms that affected the left side of their body, with 10 (2.4%) with symptoms that affected both sides of their body. In terms of motor symptoms present at diagnosis, 78.3% had tremor symptoms, 75.6% had rigidity symptoms, and 82.3% had bradykinesia symptoms. The patient number variable was retained as this was a metadata point that referred back to The Parkinson's Progression Markers Initiative (PPMI) was initiated to identify biomarkers of Parkinson's disease progression through biologic sampling and clinical and behavioral assessments. PPMI is taking place at clinical sites in the United States, Europe, Israel, and Australia. Data and samples from study participants will enable the development of a comprehensive Parkinson's database and biorepository available to the scientific community [12]. The PPMI dataset utilized for this study contained 423 patients, of whom 277 (65.5%) were male and 146 (34.5%) were female. In total, 19 of the 423 patients (4.5%) were under the age of 40 (25-39 years of age). The median disease duration was 4.3 months, with a range of weeks to 35.8 months (less than 2 years). In this dataset, 75% of patients did not have a family history of PD. In addition, 234 patients (55.3%) had motor symptoms that affected the right side of their body whereas 179 (42.3%) had symptoms that affected the left side of their body, with 10 (2.4%) with symptoms that affected both sides of their body. In terms of motor symptoms present at diagnosis, 78.3% had tremor symptoms, 75.6% had rigidity symptoms, and 82.3% had bradykinesia symptoms. The patient number variable was retained as this was a metadata point that referred back to the patient in the spreadsheet. The variables retained for analysis are displayed in Table 1. PD patients with symmetrical symptoms were removed from the study because of the small percentage (2.4%) of the total. One patient was removed because there was a missing data point for family history, retaining a total number of 412 patients for cluster analysis. The information on this dataset, after pre-processing, is in Table 2. Median disease duration remained the same as the original dataset at 4.3 months, with the same range. Of the 412 patients, 76% (312) did not have a family history of the disease, whereas 24% (100) had a family history of the disease. In addition, 233 patients (55.3%) had motor symptoms that affected the right side of their body whereas 179 (42.3%) had an effect on the left side of their body. In terms of motor symptoms present at diagnosis, the percentages remained the same.

Results
This section provides the results of the universal clustering discovery tree applied to the six categorical variables of the PPMI dataset. The two numerical variables of age of onset and disease duration were analyzed during the post-analysis of the cluster results. Starting with the male (0) and female variable attributes (1), these were divided into two distinct groups. The next variable, family history, was also split into two groups because of the two attributed values (with a family history of PD and no family history of PD), under each of the male and female divisions. This separation continued with the remaining categorical variables' attributes: dominant (symptom) side, symptom 1 (resting tremor), symptom 2 (rigidity), and symptom 3 (bradykinesia), present at disease diagnosis.
Utilizing Tableau's plotting functions to display the result, each variable was selected and placed into the plotting space, starting with the biological sex variable. From the 412 PD patients, 47 clusters were automatically discovered, ranging from 1 patient up to 66 patients. The results are in Figure 2. The division of variable attributes can be viewed. The patient clusters, which are listed per row, can be described by reciting each attribute value. For example, cluster one (row one) contains male patients (1) with a family history of PD (1), affected on the left side (1), and with symptom 3 (bradykinesia, 1) present at diagnosis. This cluster also contained three patients as per the orange color coding.
There was a total of 15 clusters containing 51 patients (12% of the patients) who had 1 motor symptom and 23 clusters with 154 (37%) patients who had 2 motor symptoms. The clusters with patients with 1 motor symptom included 9 patients with bradykinesia, 2 patients with rigidity, and 40 patients with tremors, present at diagnosis. The clusters with patients with 2 motor symptoms included 76 patients with rigidity and bradykinesia, 48 patients with tremors and bradykinesia, and 30 patients with tremors and rigidity. As the number of symptoms increased, the number of patients increased in this dataset, as viewed in Figure 3.
In addition, the clusters with the largest number of patients, the five bottom rows, contained patients who had all three motor symptoms. Further reviewing the clusters where patients had all three motor symptoms present at diagnosis, eight clusters, outlined in orange, were discovered. These eight clusters contain exactly 50% of the entire dataset, with 206 of the 412 patients. The four largest clusters of males with all three motor symptoms equaled 142 out of 268, i.e., greater than half of the male subset at 53%. The four largest female clusters with all three symptoms contained 64 patients of 144, i.e., less than half of the female subset at 44%.
There was a total of 15 clusters containing 51 patients (12% of the patients) who had 1 motor symptom and 23 clusters with 154 (37%) patients who had 2 motor symptoms. The clusters with patients with 1 motor symptom included 9 patients with bradykinesia, 2 patients with rigidity, and 40 patients with tremors, present at diagnosis. The clusters with patients with 2 motor symptoms included 76 patients with rigidity and bradykinesia, 48 patients with tremors and bradykinesia, and 30 patients with tremors and rigidity. As the number of symptoms increased, the number of patients increased in this dataset, as viewed in Figure 3. Descriptive statistical measures were calculated for the five largest categorical cluster results from the universal discovery tree method. The largest male cluster contained 66 patients and the largest female cluster contained 33 patients, as shown in Table 4. The age of onset and disease durations were similar among the clusters, with age of onset median ranging from 59.5 to 61.4 years and disease duration median ranging from 3.0 to 4.3 months. This aligns with previous cluster analysis results and is expected as the age of onset for PD tends to be a limited age range, from 60 years of age and older, with a limited number of younger patients. This set of patients also had limited disease durations of less than 3 years, hence the similarities among durations in the five largest clusters. This supports the need to include categorical variables for PD patients and exclude variables that are similar among the population or original dataset. Descriptive statistical measures were calculated for the five largest categorical cluster results from the universal discovery tree method. The largest male cluster contained 66 patients and the largest female cluster contained 33 patients, as shown in Table 4. The age of onset and disease durations were similar among the clusters, with age of onset median ranging from 59.5 to 61.4 years and disease duration median ranging from 3.0 to 4.3 months. This aligns with previous cluster analysis results and is expected as the age of onset for PD tends to be a limited age range, from 60 years of age and older, with a limited number of younger patients. This set of patients also had limited disease durations of less than 3 years, hence the similarities among durations in the five largest clusters. This supports the need to include categorical variables for PD patients and exclude variables that are similar among the population or original dataset. Previous PD patient cluster studies did not utilize nor report silhouette scores of the cluster results. As mentioned earlier, an average silhouette score is commonly utilized for evaluating cluster results. Cluster validity measures tend to define cohesion, separation, or a combination of these, and can be applied to overall cluster results and individual clusters. The silhouette score incorporates both cohesion and separation [3]. A high silhouette score points to similarities among the data points within the clusters.
Utilizing the software program, IBM ® SPSS Statistics, for silhouette score calculation, a high score of 1.0 was computed for the 47 discovered clusters, as seen in Figure 4. This demonstrates the high accuracy of this method when applied to a categorical dataset. This result is expected because the variable attributes are creating the splits, meaning no attribute will be placed in an incorrect cluster, and the patients in each cluster have identical attributes. Previous PD patient cluster studies did not utilize nor report silhouette scores of the cluster results. As mentioned earlier, an average silhouette score is commonly utilized for evaluating cluster results. Cluster validity measures tend to define cohesion, separation, or a combination of these, and can be applied to overall cluster results and individual clusters. The silhouette score incorporates both cohesion and separation [3]. A high silhouette score points to similarities among the data points within the clusters.
Utilizing the software program, IBM ® SPSS Statistics, for silhouette score calculation, a high score of 1.0 was computed for the 47 discovered clusters, as seen in Figure 4. This demonstrates the high accuracy of this method when applied to a categorical dataset. This result is expected because the variable attributes are creating the splits, meaning no attribute will be placed in an incorrect cluster, and the patients in each cluster have identical attributes.

Discussion
Past PD cluster analysis studies utilized clinical scales to define motor and non-motor symptoms' presence and severity; however, these scores do not define disease progression nor severity in a quantifiable way but provide an ordinal, classification result. In addition, a set of these studies defined and calculated disease progression by dividing the scale scores by the time since diagnosis, which is incorrect and does not convert the scale

Discussion
Past PD cluster analysis studies utilized clinical scales to define motor and non-motor symptoms' presence and severity; however, these scores do not define disease progression nor severity in a quantifiable way but provide an ordinal, classification result. In addition, a set of these studies defined and calculated disease progression by dividing the scale scores by the time since diagnosis, which is incorrect and does not convert the scale score(s) to a quantifiable value. Cluster analysis needs to be conducted with accurate patient demographics, disease symptoms, and treatment results. In addition, past studies did not utilize the categorical variables of biological sex, family history, or dominant body side affected by the disease in the cluster analysis, providing a new way to describe patient subtypes.
The majority of past studies utilized K-means clustering, a common method where a predefined number of clusters is inputted prior to analysis. How the number of clusters was chosen was not provided in some of the studies. This selection is important, as an inaccurately chosen value will provide inaccurate patient assignments. In some studies, K values were based on a previous study and not on the dataset under review; this is incorrect. In addition, the K-means method, which was commonly applied in past cluster studies, includes a series of limitations, including not yielding the same result with initial, random assignments [4], not taking into consideration the data distribution [8], and being unable to handle data outliers [3].
In past PD cluster studies, silhouette scores were not reported. An average silhouette score is commonly utilized for evaluating cluster results. A high silhouette score points to similarities among the data points within the clusters. Future studies with a rigorous design, standardized with respect to the included variables, data processing, and clustering analysis technique, may advance the knowledge of PD subtypes [7].

Conclusions
This paper outlined a universal clustering discovery tree applied to a categorical dataset. This study utilized a PD patient dataset from the Parkinson's Progression Markers Initiative (PPMI) website to demonstrate the proposed clustering technique and to test its accuracy on the categorical variables. A total of 47 clusters were discovered with the 6 categorical variables. These results provided a perfect, average silhouette score of 1.0. As with any method, end users have to balance accuracy with results management; for end users who are looking for the most accurate method, which may provide a large number of resulting clusters, this will be the method of choice. A large number of clusters does not point to a weakness in this method but instead to the diversity that may occur among patients and disease symptoms. In addition, this clustering method is simple to use for medical practitioners and researchers, and its results are interpretable and explainable. The next steps are to automate this new clustering technique for widespread ease of use. The expectation of further use of this method is that it will provide a direction for treating Parkinson's disease patients, with a focus on personalized medicine, treatment in clusters, or a mixture of both applications.

Patent
The universal, clustering discovery tree proposed is patent pending.

Conflicts of Interest:
The authors declare no conflict of interest.