Improving the Accuracy of Diagnosis for Multiple-System Atrophy Using Deep Learning-Based Method

Simple Summary Diagnosis of neurodegenerative diseases requires examination of a variety of characteristics. A definitive diagnosis is obtained using a comprehensive evaluation of family history, neurological findings, brain imaging, genetic testing, and other medical information. Multiple-system atrophy (MSA) is a neurodegenerative disease associated with autonomic dysfunction, parkinsonism, and cerebellar ataxia, and early diagnosis is difficult because the disease changes over time. The aims of this study were to examine whether machine learning can improve diagnostic accuracy using MSA case data from a national survey, and to identify the features that are important for differentiation among MSA subtypes using machine learning. Abstract Multiple-system atrophy (MSA) is primarily an autonomic disorder with parkinsonism or cerebellar ataxia. Clinical diagnosis of MSA at an early stage is challenging because the symptoms change over the course of the disease. Recently, various artificial intelligence-based programs have been developed to improve the diagnostic accuracy of neurodegenerative diseases, but most are limited to the evaluation of diagnostic imaging. In this study, we examined the validity of diagnosis of MSA using a pointwise linear model (deep learning-based method). The goal of the study was to identify features associated with disease differentiation that were found to be important in deep learning. A total of 3377 registered MSA cases from FY2004 to FY2008 were used to train the model. The diagnostic probabilities of SND (striatonigral degeneration), SDS (Shy-Drager syndrome), and OPCA (olivopontocerebellar atrophy) were estimated to be 0.852 ± 0.107, 0.650 ± 0.235, and 0.858 ± 0.270, respectively. In the pointwise linear model used to identify and visualize features involved in individual subtypes, autonomic dysfunction was found to be a more prominent component of SDS compared to SND and OPCA. Similarly, respiratory failure was identified as a characteristic of SDS, dysphagia was identified as a characteristic of SND, and brain-stem atrophy was identified as a characteristic of OPCA.


Introduction
Multiple-system atrophy (MSA) is a neurodegenerative disorder characterized by progressive autonomic dysfunction, parkinsonism, and cerebellar and pyramidal features that occur in various combinations [1]. MSA used to be classified into olivopontocerebellar atrophy (OPCA), striatonigral degeneration (SND), and Shy-Drager syndrome (SDS); however, in the first diagnostic consensus on MSA, cases were classified as MSA-P for those with parkinsonism and MSA-C for those with cerebellar ataxia [2]. The term SDS, which had been used to describe MSA cases with prominent autonomic dysfunction, was formally taken out of use. In the second diagnostic consensus, MSA cases were categorized as definite, probable, and possible [3]. Since diagnosis of definite MSA requires pathological autopsy, early confirmation of a probable or possible case is clinically significant for patient management and disease-modifying therapy. In the second consensus, probable MSA was defined as a sporadic, progressive, and adult-onset (age ≥ 30 years) case with autonomic dysfunction and poorly L-DOPA-responsive parkinsonism (bradykinesia with rigidity, tremor, or postural instability) or cerebellar syndrome (gait ataxia with cerebellar dysarthria, limb ataxia, or cerebellar oculomotor dysfunction), while possible MSA was defined as a sporadic, progressive adult-onset case including parkinsonism or cerebellar ataxia and at least one feature suggesting autonomic dysfunction plus one other feature that may be a clinical or neuroimaging abnormality.
The accuracy of MSA diagnosis using the second consensus is 71% for probable MSA and 60% for possible MSA [4]. Therefore, an improvement in diagnostic accuracy is needed. Recently, artificial intelligence has been used to improve the diagnostic accuracy of neurodegenerative diseases, including MSA [5]. For example, differentiation of MSA from Parkinson's disease was examined using brain imaging with computed tomography (CT) and magnetic resonance imaging (MRI) [6]. However, neurological findings and other medical information are important in MSA diagnosis [7], but few studies have examined the early diagnosis of neurodegenerative diseases by machine learning using datasets obtained in clinical practice as training data [8]. The issues associated with machine learning include standardization of the dataset [9] and evaluation of the obtained diagnostic probability, i.e., how to define the thresholds of certainty [10].
The second consensus diagnostic criteria for MSA do not cover all early-stage MSA cases [11]. SDS is not included as a subtype in this consensus but is known to be a clinical form of MSA with autonomic dysfunction as the primary symptom. The usefulness of the SDS concept in planning therapeutic trials is being evaluated, as early onset of autonomic dysfunction is a poor prognostic factor [12]. Therefore, we decided to use machine learning and conventional statistical methods to determine the features that influence diagnosis of the MSA subtypes in the earlier classification. The aim of the study was to evaluate the diagnostic accuracy of machine learning for the OPCA, SND, and SDS subtypes based on diagnoses by neurologists and to identify the important variables in classifying these subtypes.
Conventional statistical methods are mainly models based on linear or logistic regression between a small number of variables and outcomes. Machine learning can derive a broader range of standard variables using a neural network [13], but conventional machine learning is limited in visualization of this process [14]. Therefore, in this study, we used a deep learning-based method based on a pointwise linear model [15,16], which allows for correlations of each explanatory variable to express the target variable. We show that machine learning can increase the diagnostic accuracy of MSA to >80%, and we identify the important features for diagnosis of MSA and their relationship with each MSA subtype using the pointwise linear model. This method improves the diagnostic accuracy for early MSA and demonstrates the effectiveness of use of machine learning in validating diagnostic criteria.

Ethics
This study was performed under the ethical guidelines for medical and biological research involving human subjects issued by the Ministry of Education, Culture, Sports, Science, and Technology (MEXT), the Ministry of Health, Labor, and Welfare (MHLW), and the Ministry of Economy, Trade, and Industry (METI) in Japan. The ethics committee of the National Center of Neurology and Psychiatry approved the study (A2019-056; 14 January 2021). All patients gave written informed consent for registration in the Specified Disease Treatment Research Program. After submission of informed consent forms and approval of a review committee, including neurologists in their respective prefectural governments, personal information was anonymized, and cases were registered in the MHLW database [17]. The anonymized data were provided to us for analysis by the MHLW (9 March 2021).

Data and Diagnosis
Data were obtained from forms submitted to and digitized by the MHLW between FY2004 to FY2008. We first excluded duplicate cases and those without essential demographic data, such as onset age. Cases included in data analysis fulfilled the diagnostic criteria for MSA established by the MHLW Research Committee on MSA.

Items
The following demographic and clinical features were obtained from forms submitted as cases with MSA: sex, age, symptoms at onset, mode of onset, progression, neurological findings, autonomic findings, other neurological findings, brain images on CT and MRI, activities of daily living (ADL), and medication (Table 1), walking capacity, standing capacity/eyes open, finger-to-nose test, and knee-tibia test on the International Cooperative Ataxia Rating Scale (ICARS) [18], gait abnormalities due to parkinsonism on the Unified Parkinson Disease Rating Scale (UPDRS) [19], and bent posture, posture stability, tremor at rest, rigidity, finger taps, and rising from a chair on the Unified Multiple System Atrophy Rating Scale (UMSARS) [20]. Except for walking, ADLs were classified on a three-point scale (without assistance, with assistance, and unable). Walking was classified on a fourpoint scale (without assistance, with assistance, with a wheelchair, and unable) [21].

Classification of MSA Subtypes
A pointwise linear model, the deep learning-based method, was used for classification of MSA subtypes [22]. To evaluate over-learning in discriminant boundary generation, the training data were divided into a training and test dataset. Futhermore, to optimize the hyperparameters and evaluate prediction performance of the model, we conducted 10-fold double cross-validation using the training dataset, and the model with the best prediction accuracy (highest AUC) was adopted. The best hyperparameters are listed in Table 2. In this study, 60 items were used as explanatory valuables, and diagnosis was used as the objective variable (Table 1). Of the 3377 cases registered from FY2004 to FY2008, 3220 (851 SND, 359 SDS, and 2010 OPCA) were included, after exclusion of 157 cases with missing data. Ten cases of each MSA subtype were used for validation, and the remaining 3190 cases were used as the training dataset.

Extraction of Important Features Involved in Classification of MSA Subtypes
Diagnoses by neurologists were classified into three subtypes: SND, SDS, and OPCA. A pointwise linear model (implemented in Pytorch 1.5.1, Python 3.7.4) was used to identify important features in these diagnoses. Two datasets were used: dataset (a), which included the rank order of items; dataset (b), which did not include the rank order of items. Both datasets excluded medication information. The feature variables of the two datasets were classified into binary variables (B), categorical variables (C), ordinal variables (O), and quantitative variables (Q). All features included in datasets (a) and (b) are listed in Table 3. Binary variables were encoded as 1 or 0. One-hot encoding was used for categorical variables. Quantitative variables were normalized (mean = 0, standard deviation = 1), and ordinal variables were expressed on a scale of 1, 2, 3, etc., corresponding to the order of ranks. The final numbers of feature variables in datasets (a) and (b) were 58 and 126, respectively. The predictive performance of the pointwise linear model was calculated using the area under the curve (AUC) evaluated by 10-fold double cross-validation (DCV). The one-vs.-rest strategy was used, in which a multiclass classification is split into one binary classification problem per class. Ultimately, the mean of the three different AUCs was considered to be the predictive performance.
To evaluate important features in each diagnosis, the importance score was defined using the weight vector. First, we calculated the sample-wise importance score s k is the weight tailored for the k-th feature x (n) k of sample (n) by the pointwise linear model. Next, for each subtype (e.g., SDS), the top 10% of features with sample-wise importance scores that were the largest in the classification model for each subtype were determined for each patient. Finally, the importance score was defined for each feature as the rate of samples whose top 10% features contained the feature.

Statistics
Descriptive statistics reported as counts (percentage) were used to describe the characteristics of the patients included in the analysis. A Kruskal-Wallis one-way analysisof-variance-by-ranks test was performed to compare variables between MSA subtypes. All p-values are reported to three decimal places, with those less than 0.001 reported as p < 0.001. χ 2 tests were used to compare categorical variables. Residual analysis was performed to determine which cell numbers in the cross-table represented sources of bias (p < 0.05) when significant bias was observed in a χ 2 test (p < 0.05). All analyses were performed using STATA ver. 17.0 (Stata Corporation LLC, College Station, TX, USA).

Patient Characteristics
A progressive course characterized the three MSA subtypes, and all had onset in the late 60s. Among early symptoms, ataxia (90.6%) was significantly more common in OPCA, parkinsonism (87.6%) was significantly more common in SND, and autonomic dysfunction (72.9%) was significantly more common in SDS. SND showed significantly more severe neurological findings than the other subtypes, and OPCA had a trend toward poorer finger-to-nose and knee-tibia tests and a higher frequency of dysarthria (79.3%). SDS showed significantly more autonomic dysfunction, especially respiratory failure (43.5%), than the other MSA subtypes. Brain CT and MRI revealed cerebellar atrophy common to all subtypes, with significant incidences of striatal atrophy/signal abnormality (58.7%) in SND and of brain-stem atrophy (79.3%) and a hot-cross-bun sign (47.9%) in OPCA. Regarding medication, dopamine receptor stimulants (40.5%) and amantadine hydrochloride (24.8%) were commonly used in SND, taltirelin hydrate (35.7%) was commonly used in OPCA, and droxidopa (37.7%) was commonly used in SDS (Table 4).

Diagnostic Probability Using the Point-Wise Linear Model
Every 10 cases were randomly selected from SND (851 cases), SDS (359 cases), and OPCA (2010 cases) diagnosed by neurologists in the deep learning-based method as the test dataset. The remaining cases were used as the training dataset. The AUC in the 10-fold DCV was 0.958 ± 0.001 in the training set and 0.959 ± 0.012 in the test dataset. The deep learning-based method resulted in high accuracies with a diagnostic probability of 0.852 ± 0.107 for SND and 0.858 ± 0.270 for OPCA. In contrast, SDS showed a significantly lower diagnostic probability of 0.650 ± 0.235 compared to SND and OPCA (p < 0.05). The diagnostic probability of SND, SDS, and OPCA for each case is shown in Table 5. Of the SDS cases classified as other subtypes, two were assigned to SND (cases 1 and 8) and one was assigned to OPCA (case 5); among OPCA cases, one was categorized as SND (case 2). On the other hand, in SND, all cases were classified into SND. The deep learning-based method was used to analyze SND, SDS, and OPCA cases diagnosed by neurologists to determine the diagnostic probability by each subtype. Columns with the highest diagnostic probability were colored.

Identifying Important Features Using the Pointwise Linear Model
The pointwise linear model was used to extract important features that were closely associated with the diagnosis for each of the three MSA subtypes.

Verification of the Prediction Performance for the Pointwise Linear Model
To investigate whether a specific rank in ordinal variables contributed to the classification of each diagnosis, we generated models with and without consideration of the rank order of items. Important features (score ≥ 0.3) were initially extracted from dataset (a) (number of feature variables: 58) or (b) (number of feature variables: 126) ( Table 3). The AUCs for the models were calculated using 10-fold DCV. For each fold, a model was determined using the training set, and then the trained model was evaluated using the test set. The prediction performance for each learning model was evaluated as the mean AUC over the 10 folds. The AUCs for the training and test sets were 0.954 ± 0.001 and 0.956 ± 0.010, respectively, in the classification model including the order of neurological findings, and 0.962 ± 0.001 and 0.960 ± 0.010, respectively, in the model that did not include this order.

Extraction of Important Features Closely Associated with the Diagnosis for the MSA Subtypes
Features with importance scores ≥ 0.30 were extracted from the pointwise linear model, i.e., features found to be important in ≥30% of patients. In the model including the order of severity, the finger-to-nose test ranked highest for all types, and ataxia onset was common to all types. Findings specific to each type included striatal atrophy/signal abnormality (score = 0.798) for SND, brain-stem atrophy (score = 0.347) for OPCA, and respiratory failure (score = 0.796) for SDS (Table 6a). If the order of severity was not taken into account (Table 6b), features common to all subtypes included autonomic dysfunction onset, parkinsonism onset, syncope, and striatal atrophy/signal abnormality. Common features were respiratory failure and head-up tilt test (positive) for SDS and OPCA, ataxia onset for SND and OPCA, and erectile dysfunction for SND and SDS. Dysphagia (score = 0.343) and walking capacity (normal) (score = 0.342) were specific to SND, severe constipation (score = 0.414) was specific to OPCA, and toileting (without assistance) (score = 0.433), urinary disturbance (score = 0.372), and urinary incontinence (score = 0.341) were specific to SDS. To understand whether the occurrence of an event tended to raise the probability in an MSA subtype, the median regression coefficients (weights) tailored to each case are listed in Table 6, in addition to the score. Note that the presence/absence registration in the registry form is reversed from the usual relationship (presence = 1, absence = 0) because presence = 1, and absence = 2. Therefore, a negative value indicates a stronger correlation, while a positive value indicates a weaker correlation. On the other hand, for items classified as "O" in Table 3 (neurological findings and ADL), positive values indicate a positive correlation (severe disability). Therefore, finger-to-nose test, rigidity, and walking capacity are treated as "O" in Table 6a and "C" in Table 6b. In Table 6a, the finger-to-nose test showed positive values for all subtypes and negative values for ataxia onset. In Table 6b, autonomic dysfunction = −0.254 in SDS and ataxia onset = 0.366 in OPCA. For brain images from CT/MRI, the striatal atrophy/signal abnormality values were positive (0.111 and 0.130) in SDS and OPCA and negative (−0.161) in SND.

Discussion
MSA is a complex disease caused by a combination of parkinsonism, cerebellar ataxia, and autonomic dysfunction, and clinical presentation varies from onset of MSA [1]. The pathogenesis of MSA is thought to be accumulation of insoluble α-synuclein in neurons and oligodendroglia, which leads to progressive neurodegeneration [23]. In the latest international diagnostic consensus, diagnosis of MSA is classified into three stages: definite, probable, and possible. Since definite MSA requires an autopsy, a probable or possible clinical diagnosis is significant for disease management and selection of disease-modifying treatment [24]. However, in the current consensus, the probability of diagnosis is <70% [4]. A clinical trial targeting α-synuclein as a disease-modifying treatment for MSA is being planned, but since MSA is suspected only when clinical findings become apparent, early diagnosis is needed for development of treatment strategies [24].
Recent studies have examined utilization of artificial intelligence to improve the diagnostic accuracy of neurodegenerative diseases. However, these studies have mainly been limited to differential diagnosis using images and have not utilized medical information such as family history and neurological findings [6]. Moreover, it is challenging to make an early diagnosis by brain imaging alone, and a comprehensive evaluation of all medical information is necessary to improve diagnostic accuracy [25]. MSA is also a rare disease, and a nationwide collection of MSA cases requires the same diagnostic criteria, survey questions, and case validation framework to ensure data uniformity. For this reason, we used case information registered in a uniform nationwide survey. In addition, since newly registered cases were used, the data were likely to be from a relatively early point after MSA was suspected, as shown by the characteristics of the cases in Table 4.
Conventional statistical methods can reveal differences among subtypes, but machine learning has the advantage of linking all explanatory and objective variables for each case [26]. Thus, the diagnostic probability can be indicated by cases according to the MSA disease type with which it is strongly associated. The MSA cases used in this study were not classified as MSA-C or MSA-P, but as the subtypes of SND, SDS, and OPCA, and we machine-trained the data in these three subtypes to verify the diagnostic validity. Machine learning enables the diagnostic probability of each subtype to be shown for each case, which permits determination of this probability for each subtype. As shown in Table 5, some cases that were registered as SDS were classified as SND or OPCA. However, about 70% of the cases remained as SDS, which indicates that this subtype has features that differ from those of OPCA and SND. The diagnostic accuracies for SND and OPCA, which are strongly characterized by parkinsonism and cerebellar ataxia, respectively, exceeded 90%, suggesting that each of these populations has a homogeneous component.
One limitation of a machine learning is that the processing method is hidden, which limits the understanding of the relationship between objective and explanatory variables [14]. Therefore, we used a pointwise linear model, which can display the relationship between explanatory and objective variables as coefficients, to identify features involved in the diagnosis of each MSA subtype. This model was applied to the relationship between clinical information and diagnostic results for all items shown in Table 1. We note that the cases were registered at an early stage when MSA was suspected; hence, only the details of the medication could be ascertained, and results for drug responsiveness are still needed. We then tried to extract the important features that influence the classification of each subtype, excluding medication information. The accuracy of the machine learning model for MSA obtained using the pointwise linear model was not significantly affected by the amount of information used or the information structure, as the AUCs all showed high accuracy of ≥0.95, regardless of whether the severity of each item was ordered or not. Since the inclusion of a large amount of information can result in over-learning, which may lead to a decrease in model accuracy [27], we attempted to verify whether accuracy could be improved by narrowing down the items, but no significant differences were found. Therefore, the items ranked high for each condition in Table 6 were identified as important features for differentiation among the MSA subtypes.
Results from the corresponding component analysis using the pointwise linear model showing the impact of autonomic dysfunction (A), parkinsonism (P), and cerebellar ataxia (C) components for each subtype are shown in Figure 1. When the order of severity was considered (Figure 1a), the A component was more important in SDS than in SND and OPCA, the P component (rigidity, parkinsonism onset, and striatal atrophy/signal abnormality) was more important in SND, and the C component (finger-to-nose test, brainstem atrophy, and ataxia onset) was more important in OPCA. The finger-to-nose test and ataxia onset (C component) were commonly ranked high in all subtypes, reflecting the significantly greater impairment and more frequent occurrence of OPCA compared to the other two subtypes (Table 4).   Table 6a,b were divided into three components that were strongly associated with autonomic dysfunction (A), parkinsonism (P), and cerebellar ataxia (C). Analyses were performed with (a) and without (b) considering the order of severity of items for neurological findings and ADL. Based on the weight value, those with a strong relationship with the subtype were designated as P (positive) and those with a weak relationship with the subtype were designated as N (negative).

Conclusions
In this study, we examined the feasibility of machine learning for differential diagnosis of MSA, which is characterized by a complex interplay of autonomic dysfunction, parkinsonism, and cerebellar ataxia that changes over time, and we identified important  Table 6a,b were divided into three components that were strongly associated with autonomic dysfunction (A), parkinsonism (P), and cerebellar ataxia (C). Analyses were performed with (a) and without (b) considering the order of severity of items for neurological findings and ADL. Based on the weight value, those with a strong relationship with the subtype were designated as P (positive) and those with a weak relationship with the subtype were designated as N (negative).
The head-up tilt test (positive) and urinary disturbance high (A component) ranked high for SND and OPCA, but not for SDS, for which respiratory failure, syncope, and urinary incontinence ranked higher. These respective items tended to be significantly higher in SDS than in SND and OPCA, at 76.9%, 43.5%, and 57.8%, respectively (Table 4). In particular, respiratory failure, which is a poor prognostic factor in MSA, ranked high for SDS and may be a unique feature determining prognosis for this subtype [28]. For neurological findings and ADL, each item had more subitems according to the degree of disability. If order was not considered, the relationships among the 84 subitems included in these items and other items could be evaluated (Figure 1b). In contrast to Figure 1a, the P component commonly ranked high in all subtypes. In particular, parkinsonism onset and striatal atrophy/signal abnormality were common in all subtypes, while dysphagia was specific for SND and apraxia was specific for SDS. The higher frequency of cases assigned to this subtype compared to the other subtypes may be due to these specific findings. It is of note that dysphagia was identified as an important feature for SND because this condition is an additional feature of possible MSA-P and a poor prognostic factor for MSA-P [29].
Since the score indicates the importance of each item in the diagnosis, it is necessary to clarify whether this contributes positively or negatively to the diagnosis. Therefore, we examined the involvement of each item in the diagnosis in terms of positive and negative correlations using the median regression coefficient (weight) obtained in the pointwise linear model. With consideration of the rank order of the items, the finger-to-nose test tended to be positive, with a positive weight for all subtypes. In addition, ataxia onset was negative for all subtypes because presence/absence was reversed in the dataset structure (presence = 1, absence = 0) from the general dataset structure (presence = 1, absence = 2), indicating a tendency for all subtypes to have ataxia onset in common (Table 6a). Similarly, without considering the rank order of items, SDS tended to be associated with autonomic dysfunction and was less likely to be associated with parkinsonism (Table 6b). On the other hand, SDS showed a weight of −0.030 for apraxia, which indicates that apraxia is a significant differential feature in SDS. SND tended to be accompanied by erectile dysfunction as autonomic dysfunction and walking capacity normal, with a weight of −0.103. Given that the severity of walking capacity was included in the critical feature of the model with considering the order of severity of items, it suggests that "inability to walk normally" is more important than the severity of walking capacity in the diagnosis of SND. In OPCA, autonomic dysfunction and parkinsonism were less likely to be present.
This study shows that use of a machine learning can improve the diagnostic accuracy for MSA. However, important items in differential diagnosis need to be identified using a pointwise linear model, so that future studies of complex conditions such as MSA can be conducted. This will be a key aspect of development of diagnostic criteria for MSA.

Conclusions
In this study, we examined the feasibility of machine learning for differential diagnosis of MSA, which is characterized by a complex interplay of autonomic dysfunction, parkinsonism, and cerebellar ataxia that changes over time, and we identified important features in the diagnosis. Unlike conventional statistical methods that capture the characteristics of MSA subtypes, we were able to determine the influence of features that may have been overlooked in diagnosis by considering relationships among all the variables. Although poorly L-DOPA-responsive parkinsonism is a diagnostic criterion for MSA, treatment response could not be included in the machine learning because of the lack of data on response assessment after L-DOPA medication in this study. In this regard, it is not easy to objectively assess L-DOPA responsiveness at early diagnosis. On the other hand, it is possible to predict the prognosis from the information at the time of initial diagnosis by machine learning the long-term course of medical treatment.