Neurological diseases strain health systems and pose a considerable ongoing burden on healthcare resources. Parkinson’s Disease (PD) has been reported as one of the fastest-growing neurological disorders in terms of prevalence and deaths [1
]. A large, global burden of disease study identified PD as one of the top 5 leading causes of death from neurological disorders in the US [2
]. It is estimated that there were approximately 6.1 million people with PD (PwP) globally in 2016, indicating a sharp upward trend compared to 2.5 million PwP in 1990 [1
Diagnosis of PD requires subjective assessment in-clinic, which incurs logistical costs. Crucially, consultant neurologists might misdiagnose PD in up to around 20% of the total cases, while the symptom monitoring accuracy is inherently limited due to the intra- and inter-rater variations in the standard clinical scales used to assess PD symptoms’ severity [3
]. Given the current objective constraints and limitations with subjective assessments, there is an urgent and unmet need for developing diagnostic support tools for the objective detection and monitoring of PD.
Parkinson’s disease is a neurodegenerative disease that is characterized by four cardinal signs: tremor, bradykinesia, rigor, and postural instability [5
]. Most PwP also experience some form of speech performance degradation as a result of PD [6
]. It is due to this reason, that the potential of capitalizing on acoustic analysis of speech signals to develop PD decision support tools has been pursued vigorously with considerable success over the last 10–15 years. Encouragingly, using voice, studies have proposed technologies based on acoustic analyses to: (1) differentiate PwP from controls [7
], (2) monitor the symptom severity of PD [11
], (3) assess voice rehabilitation in PD [15
], (4) identify at-risk participants (i.e., those with isolated Rapid Eye Movement (REM) sleep behavior disorder as confirmed by a polysomnography test) [16
], (4) identify participants with a higher genetic predisposition for developing PD (i.e., those with a mutation in the Leucine-Rich Repeat Kinase 2 (LRRK2) gene) [17
], and (5) predict a range of clinical scores that quantify participants’ motor symptoms, cognition, daytime sleepiness, depression, and overall state of health [18
]. A limitation of these studies was, however, that they typically rely on using high-quality voice recordings for the analyses which are collected under carefully acoustically controlled conditions with high-end specialized equipment.
Recently, to assess the scalability of voice as a population screening tool for PD, we undertook the largest PD characterization study employing telephone-quality voice [19
], which we refer to as the Parkinson’s Voice Initiative (PVI) study. PVI is the first of its kind large-scale study collecting speech data from PwP and control participants under free-living acoustic conditions. Using sustained vowel phonations (International Phonetic Alphabet /a:/) collected from participants in 7 countries, Arora et al. [19
] sought to discriminate PD participants from controls using phonations collected under non-acoustically controlled conditions
The use of sustained phonations for quantifying vocal impairment is well established [20
]. However, our understanding of variations in dysphonia measures/sustained phonations from participants with different linguistic backgrounds is still rather limited. Historically, the use of sustained vowels has been motivated by the fact that they can be considered generic (certain vowels such as /a/ are met across different languages) and hence the processing of sustained vowel phonations overcomes linguistic differences [20
]. In their analyses, Arora et al. (2019) [19
] relied on the underlying assumption that sustained vowel phonations are considered generalizable across people from different linguistic backgrounds, pooling together all the data from PVI. Tsanas and Arora (2021) [22
] investigated the differences in dysphonia measures between UK- and US-English speaking PwP, and reported that although there is an excellent agreement between classical acoustic measures (such as jitter and shimmer), there are pronounced differences in some of the more advanced acoustic measures between the two cohorts. Given that phonations may be language-dependent, this prompts the further question of whether acoustic analyses should be performed separately for participants from different linguistic backgrounds, along with undertaking cross-cohort comparisons. Therefore, this study is a natural extension of the work undertaken by Arora et al. (2019) [19
], whereby we focus on the stratified analysis of the sustained phonation by using voice recordings from participants from one linguistic background, specifically, the US-English cohort.
The paper is organized as follows. Section 2
presents the data, followed by the methodology used for acoustic analysis comprising data pre-processing, feature extraction, feature selection, classification, and evaluation strategy. Section 3
presents the results, focusing on describing the most salient dysphonia measures that differentiate PwP from controls, along with the out-of-sample classification results. Discussions and directions for future research are provided in Section 4
. Conclusions are provided in Section 5
We investigated the potential of differentiating between PwP and controls using telephone-recorded speech collected under acoustically non-controlled conditions utilizing different statistical machine learning techniques and strategies. This study is part of our wider goal to explore whether we can develop a PD screening tool that is readily accessible, accurate, and ideally free-of-charge, and is the underlying reason we set up the PVI study from which the data for this study were drawn. We demonstrated 67.34% balanced accuracy using 27 acoustic features presented into an SVM with a standard 10-fold CV approach. This finding was further verified on an additional out-of-sample unbalanced dataset where we found a balanced accuracy of 66.3% (sensitivity: 65.09%, specificity: 67.49%). Overall, this is very similar performance to what we had previously reported in Arora et al. (2019) (66.4% balanced accuracy); however, this has now been achieved using 27 acoustic features compared to the 100 features that we had reported in the afore-mentioned study, and so is a more parsimonious result.
Unlike our previous exploration of the ability of the PVI dataset to differentiate PwP from controls, here we used only the US cohort. This was motivated by findings in some of our earlier investigations that some of the feature distributions are different across the PVI cohorts [22
], which suggests that we should carefully consider stratifying the PVI data and investigating cohorts independently. We aim to explore transfer learning approaches [40
] to account for covariate shifting between the different datasets in the PVI study (given data has been collected across 7 countries and participants between countries may come from different linguistic backgrounds e.g., English or Spanish).
Placing the results in the wider context in the research literature, this study’s findings are very modest given we had previously reported more than 98% binary differentiation between PwP and controls using a similar protocol to collect sustained vowel /a/ phonations [10
]. Similarly, other research groups had indicatively reported accuracies around and over 90% in this binary differentiation application [8
]. However, we stress that previous work had focused on collecting data under carefully controlled acoustic conditions (e.g., sound-treated booths, using high-quality standardized microphones [10
]), whereas in the PVI participants self-enrolled using their own devices, which have different specifications in terms of microphone quality and frequency attenuation characteristics, and in their own environments, which typically had some background noise, whilst using different telephone networks. Moreover, unlike most research studies, participants in the PVI were not screened or clinically assessed for study enrollment, and thus we cannot rule out the presence of clinical-pathologic differences in voice within this cohort. Collectively, all these ‘degrees of freedom’ lead to lower quality data and therefore it is expected that there will be considerable performance degradation. For example, some of the most successful nonlinear dysphonia measures in this application rely on the use of high frequencies (2.5–10 KHz) to compute the ‘noise’ component in the recorded signal (see [10
] for details). Given that the sampling rate in PVI is 8 kHz (and therefore the useful recorded information is up to 4 kHz according to the Nyquist sampling theorem), this constrains the extraction of clinically informative features.
Speech impairment is commonly associated with Parkinson’s [40
] and is characterized by pitch monotonicity, variable rate, imprecise consonants, and breathiness and harshness. As opposed to other types of speech signals that are often used in clinical assessments, such as running speech and reading aloud a linguistically rich pre-specified text e.g., the Grandfather Passage [20
], the use of sustained phonations helps circumvent challenges associated with different accents and linguistic confounds [20
]. For example, our previous work has shown that sustained phonations can provide high accuracy in differentiating PwP from controls [10
], along with other interesting insights in the speech-PD literature, including replicating PD symptom severity and assisting PD rehabilitation [10
]. We emphasize also that the methodology adopted in this study for processing sustained vowels had previously also been generalized to analyze different types of speech, e.g., voice fillers [42
], and to provide useful insights more widely in different biomedical speech signal processing applications [43
]. Therefore, the use of sustained vowels is strongly motivated and has been practically vindicated. A further practical consideration is that this study draws data from PVI, where data were collected across 7 countries with participants coming from different linguistic backgrounds [19
]. One of the aims of PVI was to provide cross-linguistic comparisons for the assessment of PD within a short time span of speech samples from a large, self-selected population group. Therefore, for practical reasons and to minimize participant burden, we had decided in PVI to collect exclusively sustained vowels. It is due to these reasons that the focus of this study was on analyzing sustained phonations. Nevertheless, we remark that the use of alternative speech types, e.g., running speech, might be accommodating additional acoustic information which is not captured in sustained vowels (although we stress that the argument goes both ways, the use of sustained vowels may capture information not accounted for in running speech). An interesting line of future work would be to evaluate the efficacy of telephone-quality sustained phonations in conjunction with running speech to develop screening tools for PD.
The participants in this study were entirely self-selected, where they were prompted to answer the question—‘Do you have Parkinson’s disease?’ and their response was treated as the gold standard (or label) for statistical mapping. In the absence of detailed clinical assessments, we cannot rule out clinical-pathologic differences in voice within this cohort, which could be one of the factors contributing to the relatively low discrimination accuracy reported in this study. It is worth noting that diagnosis/monitoring of PD requires in-person subjective assessment, typically by a trained neurologist, which can incur substantial logistical costs in resource-constrained and remote settings. Thus, we deemed it necessary to include only self-reported symptoms. Specifically, the data collection protocol of PVI was designed with the objective to develop a population-based screening (and not monitoring) tool for PD, which would have the potential to transform current practices by reducing logistical costs associated with in-person clinical assessments, while exploring alternate routes to recruiting participants for clinical trials.
This study builds on our previous work on PVI [19
] and acoustic analysis [10
] to almost completely automate the data processing pipeline. In principle, it may be useful to apply auditory-perceptual analysis relying on human expertise to analyze the data and potentially identify problems, e.g., highly aperiodic/too noisy signals, and also to perceptually characterize the signals (producing additional features). This is indeed often done in studies with a low number of speech samples with speech signals of different nature (e.g., running speech, counting days, reading pre-specified linguistically rich text etc.). Auditory-perceptual analysis is not commonly used when processing sustained vowels, at least in the biomedical speech signal processing literature. Moreover, auditory-perceptual analysis would be very challenging practically and costly for the size of the available data in PVI. Instead, developing automated pattern recognition tools combined with statistical machine learning offers a replicable, objective, automated, and directly scalable approach. This has enabled us to automatically determine, for example, highly aperiodic and noisy signals which were discarded from further analysis (for details on the algorithm see our previous work [19
We explored three different feature selection methods and standard feature transformation using PCA to reduce the dimensionality of the dataset. The transformed features using PCA led to consistently worse results and hence these results are not presented in the paper due to space constraints. The three feature selection algorithms led to quite different feature subsets (results not shown), and SIMBA along with SVM provided a somewhat better overall performance in the balanced dataset where we applied the standard CV approach. Therefore, we reported in Figure 2
the performance of classifiers as a function of the number of features progressively selected by SIMBA.
SVMs and RF worked considerably better than Adaboost in this application (see Figure 2
). In our experience on this and related PD problems using classification tools, we have observed that generally bagging approaches tend to outperform boosting approaches, although we do not have a theoretical justification for this finding. SVMs led to the best overall result, which is broadly in agreement with our empirical observation in related studies on Parkinson’s applications; we have previously reported SVMs slightly outperform RF in binary classification problems, whereas RF generally leads to better outcomes in multiclass classification problems [21
]. Again, this should be cautiously considered on the basis of our experience in related applications, and we make no further claims on generalizability of this finding. We remark that the choice of the three classifiers used here is indicative of some commonly used methods, there are many alternative classifiers that could be explored. For example, an interesting line of further research work would be to provide a comparison of different classification methods, including deep learning. Moreover, it would be worth exploring different classifiers in further detail in conjunction with different class balancing schemes and model validation strategies.
There are different model validation strategies that could be explored and here it is particularly important because of the highly unbalanced nature of the dataset. In principle, when using a single dataset it is useful to perform CV (e.g., 5-fold or 10-fold CV, along with additional iterations for statistical confidence) rather than leaving a single portion of the data out for testing (‘the testing dataset’). This is because often we want to assess the model’s robustness with perturbed training/test data, while also assessing variability in performance across folds (and iterations) to provide an estimate of the generalization performance including a confidence interval. However, the highly unbalanced nature of the problem given the available dataset in this study poses considerable challenges when using a standard CV approach. Therefore, we decided on a strategy where we used both model validation approaches, retaining a completely separate subset of the data for testing at the very end and using a balanced subset with 3000 randomly selected samples (which overcomes problems with highly unbalanced data) for a standard training/testing scenario using 10-fold CV. This enables us to both assess the model’s performance in a ‘classifier-friendly’ binary classification setting with a balanced dataset where we can also provide a confidence interval on the estimates (see Figure 2
) and also to test the model’s performance on an additional unbalanced subset (see Figure 3
We remark that the developed SVM model was further validated on an unbalanced ‘held-out’ dataset (see Figure 3
), where we observe that most PwP were correctly detected. The false positives rate is still fairly high and there is ample space for improving these results further before they can be meaningfully used as an accurate clinical decision support tool. Nonetheless, the findings in Figure 3
highlight that this freely accessible tool for screening for PD might be a useful direction and could be complemented with additional modalities (e.g., smell [44
] and smartphone-based tests [11
]) to form a more accurate and practical tool that people could periodically use for mobile check-up and potentially facilitate referrals for specialized physical neurological assessment.
This study has some key limitations primarily regarding the quality of the speech dataset. The standard recommendation of the speech community is that speech signals should be sampled with at least 20 KHz sampling frequency for clinical applications because there is useful information in the higher frequencies of the spectrum [20
]. Also, the data in PVI was collected under acoustically non-controlled conditions, which has a clear degradation effect on the data quality of the recorded speech signals. Nevertheless, some recent exploratory work has demonstrated that sustained vowel /a/ transmitted over the simulated standard telephone network (following the typical digital communications process with down-sampling to 8 KHz, encoding, transmitting through a noisy channel and decoding) demonstrated that the reduction in voice quality was not prohibitive for replicating the standard PD symptom severity metric [14
]. Therefore, there is some justification that the reduced sampling rate used in PVI (8 KHz) would still be useful information to be extracted from the sub-optimally recorded data. In principle, a study could be designed these days where people could collect speech samples recorded on a high-end smartphone (which uses a high-quality microphone) and captured using a dedicated smartphone app at the recommended sample rate. However, that would require people to have access to high-end expensive equipment, and thus such a solution would not be widely available. Instead, PVI was conceptualized as an approach to democratize access to a potentially useful PD screening tool that could be accessible to all at practically no cost. We maintain that if we want to scale up work and deliver responsible, innovative solutions to make a meaningful differences in practice with a largely accessible tool, there are some compromises we will likely need to make when collecting data in a practical setting so that it would be as accessible as possible by those who would like to use it.