1. Introduction
Parkinson’s disease (PD) is a progressive neurodegenerative disorder that affects millions of people worldwide [
1]. It primarily results from the gradual loss of dopaminergic neurons in the substantia nigra, a brain region essential for motor control. This degeneration leads to motor symptoms such as bradykinesia (slowness of movement), muscle rigidity, and resting tremors [
2]. A significant challenge to early detection is that these symptoms often manifest only after 50–60% of the dopaminergic neurons have degenerated [
3] and 60–80% of their striatal projections have degenerated [
4].
PD mainly affects older adults over the age of 50, with approximately 10% of cases classified as early-onset, with symptoms beginning before age 40 [
5]. Several occupational exposures typical of maritime environments are recognized or suspected risk factors for PD, warranting attention within occupational health frameworks. The disease often starts with mild symptoms, such as tremors or stiffness in one limb, and gradually progresses to include speech impairment, cognitive decline, and behavioral changes. In advanced stages, dementia may also occur [
6]. These clinical features often overlap with normal aging, which complicates early and accurate diagnosis [
5].
The global burden of PD has significantly increased in recent decades, with cases rising from 2.5 million in 1996 to 8.5 million in 2019 [
7]. Factors contributing to this rise include aging populations, improved diagnostic capabilities, and environmental influences [
2]. Conventional diagnostic procedures frequently rely on invasive and empirically based clinical evaluations [
8]. These approaches are typically expensive, technically demanding, and lack the sensitivity required for early-stage detection. Therefore, there is a growing need for alternative diagnostic methods that are accurate, cost-effective, non-invasive, and scalable.
Speech analysis has emerged as a promising non-invasive technique for detecting PD [
9]. Voice changes, such as hypokinetic dysarthria, a motor speech disorder marked by reduced pitch variation, phonatory instability, and articulation deficits, are among the early signs of the disease [
2]. These changes can be measured through various acoustic features, including jitter, shimmer, formant frequencies, voice onset time, and vowel space area [
10,
11].
Mel-Frequency Cepstral Coefficients (MFCCs) are widely used in speech processing to capture the spectral characteristics of the vocal tract. They effectively model the timbre and articulation patterns that are often altered in PD speech. Gammatone Frequency Cepstral Coefficients (GTCCs), which simulate the cochlear filtering properties of human hearing, are also useful for identifying speech impairments in PD due to their sensitivity to vocal instability and reduced intensity.
Machine learning (ML) and deep learning (DL) approaches have shown strong performance in analyzing speech signals for PD detection. These methods can learn complex patterns in vocal features and distinguish between healthy individuals and those with PD. This study investigates the use of ML and DL models trained on MFCCs, GTCCs, and acoustic features from both reading tasks and spontaneous dialog to develop accurate, scalable tools for automatic early PD detection and monitoring.
4. Discussion
This study investigated the performance of various machine learning classifiers in distinguishing individuals with PD from healthy controls using both reading task recordings and spontaneous dialog, based on diverse feature sets, acoustic features, MFCCs, GTCCs, and their combination. The results demonstrate several key trends, highlighting the importance of feature representation, the type of speech, and the choice of classifier.
4.1. Effect of Feature Sets
Across both tasks among the single feature representations, the GTCC feature set model achieved superior performance compared to MFCC and acoustic features, particularly in terms of F1-score for PD subjects and ROC-AUC. The MFCC-based model also demonstrated strong performance, while acoustic features alone yielded comparatively lower results. While they capture prosodic and phonation-related cues (e.g., jitter, shimmer, and intensity), they lack the spectral richness of cepstral features, which may limit their discriminative power in isolation. MFCCs and GTCCs, being cepstral features, are effective in modeling the spectral characteristics of speech. GTCCs, which are more sensitive to glottal and pitch variations, appear particularly well-suited for PD detection when combined with MFCCs or acoustic features.
Feature fusion significantly improved classification performance. In particular, the combination of GTCC and acoustic features resulted in the highest overall performance, indicating that these features capture complementary information relevant to Parkinson’s disease detection. However, combining all feature sets did not consistently yield further improvements, suggesting potential redundancy and increased model complexity.
4.2. Task-Specific Observations
A critical observation is that classification performance was consistently higher for the reading task compared to spontaneous dialog. This can be attributed to several factors.
Controlled speech content in the reading task reduces lexical and prosodic variability, enabling more consistent acoustic pattern extraction across samples.
Spontaneous dialog introduces variability in speaking style, turn-taking, and emotion, making the detection of subtle PD-related speech impairments more challenging.
Additionally, speaker diarization errors and overlapping speech in spontaneous dialog may have degraded feature quality.
Nevertheless, despite the increased complexity of spontaneous dialog, classifiers trained on cepstral features and their combinations still achieved strong performance (up to 83.7% accuracy and 0.88 AUC with XGBoost). This confirms that robust PD speech markers can be extracted even from naturalistic conversations.
4.3. Statistical Significance Analysis
The performance of machine learning models was evaluated using different feature representations, including acoustic features, MFCCs, GTCCs, and their combinations. Across both the reading task and spontaneous dialog, hybrid feature sets generally achieved higher classification accuracy compared to individual feature groups. However, since several improvements were marginal, a paired
t-test (α = 0.05) was conducted to evaluate whether performance differences between feature sets are statistically significant for both the reading task and spontaneous dialog, as shown in the
Supplementary Materials (see Tables S9 and S16).
Reading Task: Acoustic vs. MFCCs/GTCCs: This was highly significant (p ≈ 0.0000) for most models, showing that acoustic features provide complementary information. MFCCs vs. GTCCs: Not significant (p > 0.05), indicating redundancy. MFCCs vs. Hybrid Features: Mostly not significant, with few exceptions (e.g., SVM, RF, DT, and MLP). Hybrid vs. All Features: Significant improvements mainly for RF, KNN, and SVM, suggesting ensemble models benefit most from feature fusion.
Spontaneous Dialog: Acoustic vs. MFCCs/GTCCs: Still significant across most models, confirming robustness of acoustic features. MFCCs vs. GTCCs: Mostly not significant, except KNN (p = 0.0014). MFCCs vs. Hybrid Features: Largely not significant, with minor improvements (MLP). Hybrid vs. All Features: Fewer significant results; several p-values = 1.0000 indicate negligible differences.
Acoustic features are consistently significant across both tasks, highlighting their importance in PD detection. MFCCs and GTCCs are redundant, contributing similar spectral information. Feature fusion improves accuracy, but gains are not always statistically significant. Reading speech shows stronger significance than spontaneous speech, due to lower variability. Model dependency exists: Random Forest benefits most from feature fusion, while XGBoost and MLP show minimal sensitivity.
Overall, combining MFCCs, GTCCs, and acoustic features yields the best performance, but statistical tests show that improvements are not always significant, especially in spontaneous speech. A balanced approach between feature complexity and model selection is therefore recommended.
4.4. Bias Analysis
Per-class evaluation revealed that some models exhibited lower recall for PD subjects compared to healthy controls, indicating potential bias toward the majority class. Notably, GTCC-based and fused feature models demonstrated improved sensitivity for PD detection, which is critical for clinical applications where minimizing false negatives is essential.
4.5. Performance Comparison with Previously Published Studies
In this subsection, we compare the performance of our study with previous research using the MDVR-KCL dataset. Through a comprehensive literature review, we identified six studies that have utilized this dataset. However, two of these studies by Rohit et al. [
17] and Huang et al. [
18] are not open access, and therefore, we could not access the methodological details or reported results. Consequently, our comparative analysis is limited to the remaining four studies, for which sufficient information is publicly available.
Table 1 presents a summary of the methods, features, and best-reported performances of these studies.
Di Cesare et al. [
15] utilized both reading text and conversational speech tasks in their study. They applied speaker diarization techniques to remove overlaps between the interviewer and participants in the conversational speech recordings. Following this, they implemented a sampling strategy on both types of audio signals, although the sampling methodology was not clearly described. Their feature set included MFCCs and GTCCs. The best performance was achieved using a K-Nearest Neighbors (KNN) classifier on the read-text task, yielding an accuracy of 92.3%, sensitivity of 93%, specificity of 91%, and an F1-score of 92. For the conversational speech, the highest performance was obtained using a Support Vector Machine (SVM), with similar metrics except for an F1-score of 90. Acoustic features were not included in their analysis, and they did not report AUC or MCC values.
Klempíř et al. [
16] focused solely on read-text speech signals. They used the wav2vec 1.0 method and extracted only MFCC features. Their study reported an AUC of 0.72 using a Random Forest classifier and 0.78 with XGBoost. Other performance metrics were not disclosed in their publication.
Yousif et al. [
13] also analyzed read text (spontaneous speech) recordings. Their approach involved both numerical and graphical processing of voice signals. They analyzed spectrograms, Mel-spectrograms, STFT, and MFCC features (using both Slaney and HTK toolkits). Supervised machine learning models applied to these features achieved near-perfect results using KNN, with reported accuracy, sensitivity, specificity, F1-score, and AUC all approaching 100%. Graphical features were evaluated using pre-trained convolutional neural networks (CNNs), where the best performance was obtained using the VGG16 model, with an accuracy, sensitivity, specificity, and F1-score of 96.93%, and an AUC of 99.55%.
Reddy and Akku [
15] analyzed both spontaneous speech and dialog using a sparse learning approach. They did not specify any sampling technique or strategy for managing overlapping speakers. Their analysis was limited to MFCC features. Their proposed model, NSRC, demonstrated strong results on the read-text task, achieving an accuracy of 82.46%, sensitivity of 89.24%, precision of 82.73%, F1-score of 86.14, and MCC of 0.50. For the spontaneous dialog task, the model slightly outperformed traditional methods with an accuracy of 83.08%, sensitivity of 82.46%, precision of 79.66%, F1-score of 81.03, and MCC of 0.57.
Figure 10 shows a bar chart comparison of each study’s results. The subplot on the left shows metric values, and the subplot on the right shows the AUC and MCC values of each study.
In comparison, our study is distinct in both methodological design and feature selection. We analyzed all available recordings from both read text and spontaneous dialog tasks, incorporating acoustic features alongside MFCCs and GTCCs for machine learning model development. We trained six different ML models. Our best performance on the read-text task was achieved using an SVM with an accuracy of 93.46%, sensitivity of 90.90%, precision of 95.20%, F1-score of 91.84, AUC of 0.97, and MCC of 0.86. For the spontaneous dialog task, the best result was obtained with the XGBoost model, achieving an accuracy of 83.70%, sensitivity of 76.32%, precision of 88.89%, F1-score of 79.45, AUC of 0.88, and MCC of 0.66.
In medical diagnosis tasks, especially those involving imbalanced datasets such as PD detection, precision and sensitivity are critical for evaluating true positive predictions. Relying solely on accuracy can be misleading. The F1-score is particularly informative in such contexts, as it balances precision and recall. AUC is a valuable metric for comparing models on the same dataset, while MCC provides an overall measure of model performance that considers all confusion matrix categories. Our results indicate a balanced and effective approach for both speech tasks using a comprehensive feature set and robust model evaluation.
4.6. Study Limitations and Future Work
The primary limitation of this study is the relatively small sample size. A larger cohort of participants during data acquisition could have significantly influenced methodological choices and led to more generalizable and robust results. The slightly lower performance observed in the spontaneous conversation task can be attributed to the limited number of samples; it is likely that increasing the number of subjects would improve model performance.
However, it is important to acknowledge that applying the same methodological framework to a larger and more heterogeneous dataset, without adequate adjustments and a rigorous training phase, may result in diminished performance. This underscores the need for carefully tailored training strategies when scaling up, and it raises important questions about the nature and complexity of the training required to maintain high model performance across broader populations.
Although this study combines multiple feature types, the overall feature dimensionality remains relatively low. Specifically, we utilized mean values of the first 13 MFCCs, 13 GTCCs, and 11 acoustic features, resulting in a total of 37 features per sample. Given this manageable feature size, explicit feature selection or dimensionality reduction techniques were not applied. However, combining heterogeneous feature sets may still introduce a risk of overfitting, particularly with limited data. To avoid this, we employed cross-validation (LOSOCV) and regularized ML models to ensure robust generalization. We also acknowledge that external validation is essential for assessing generalizability. Due to the limited availability of compatible datasets, independent cohort validation was not performed in this study. Future work will focus on evaluating the proposed framework on larger and multi-center datasets and will explore feature selection and dimensionality reduction methods, such as Recursive Feature Elimination (RFE) and Principal Component Analysis (PCA), to further improve model performance and interpretability.
Another limitation of this study is the lack of standardization in the selection of speech passages. Variability in linguistic complexity, phonetic richness, emotional tone, passage length, and familiarity can significantly affect speech production and acoustic features. All factors that influence speech production should be systematically controlled in future experimental designs. Future work will address this by incorporating controlled and standardized speech material, along with phonetic-level analysis, such as vowel space area and consonant articulation patterns, to better account for variability across participants.
These challenges reflect real-world conditions, where background noise, overlapping speech, and variable conversational dynamics are common. Such complexities may affect feature stability and model performance. Although MFCC and GTCC features are widely recognized for their robustness in speech analysis, variability in recording conditions, like device differences and environmental noise, can still influence their reliability, particularly in uncontrolled settings. Addressing these challenges will require more advanced speaker diarization techniques and the extraction of richer feature sets, including extended cepstral, prosodic, voice quality, and articulatory features. These enhancements may improve model robustness and accuracy in complex acoustic environments and support more reliable real-world deployment.
Deep learning approaches, such as CNNs, have shown strong performance in speech-based disease detection; they typically require large-scale datasets to achieve reliable generalization. In this study, the relatively limited dataset size motivated the use of classical machine learning models, which are better suited for small, structured feature sets and offer greater robustness and interpretability under such conditions. Nevertheless, deep learning remains a promising direction for future research. With the availability of larger and more diverse datasets, future work will explore lightweight architectures, such as 1D CNNs applied to MFCC or GTCC features, as well as hybrid approaches combining deep and handcrafted features to further enhance performance.
While the proposed models achieved strong classification performance, clinical interpretability remains limited, as the current analysis focuses primarily on predictive accuracy rather than the contribution of individual features. Features related to voice stability (e.g., jitter and shimmer) and cepstral representations (MFCCs and GTCCs) are likely to play an important role in classification, given their relevance to phonatory and articulatory impairments in PD. Future work will incorporate feature importance analyses, such as SHAP or model-based importance measures, to improve clinical insight.
This study is limited to binary classification (PD vs. HC), which restricts its applicability for monitoring disease progression. Future research will extend this framework to predict disease severity (e.g., MDS-UPDRS scores) and model longitudinal progression, subject to the availability of longitudinal clinical data. The current work relies solely on speech-based biomarkers. Integrating speech features with complementary modalities, such as neuroimaging biomarkers, may further improve diagnostic performance and pathophysiological interpretability. However, access to such medical imaging data is often limited due to privacy concerns and the lack of publicly available datasets. Future work will explore multimodal frameworks incorporating imaging and clinical data, contingent upon the availability of appropriate datasets or the ability to collect such data in compliance with ethical and regulatory requirements.
It should also be emphasized that this study provides only a preliminary, voice-based assessment of Parkinson’s disease and is not intended to substitute for clinical evaluations conducted by healthcare professionals, such as those based on the MDS-UPDRS. Nonetheless, this research represents a promising step toward automated, real-time health monitoring through voice analysis. A future implementation, potentially via a mobile application, could support both in-home and clinical use cases. However, such applications must rigorously address issues of data privacy, informed consent, and regulatory compliance in accordance with frameworks such as the GDPR and HIPAA.
In summary, this study highlights the potential of cepstral features and machine learning models in the detection of PD and lays important groundwork for future applications in the diagnosis and monitoring of neurodegenerative conditions through speech-based analysis.
5. Conclusions
This study explored the application of machine learning techniques to the automatic detection of PD using voice recordings from spontaneous speech and conversation. By extracting and combining multiple feature types, including acoustic parameters, Mel-Frequency Cepstral Coefficients (MFCCs), and Gammatone Cepstral Coefficients (GTCCs), we developed and evaluated classifiers capable of distinguishing PD patients from healthy controls with strong performance across multiple metrics.
Our results show that the combination of cepstral and acoustic features consistently improves classification accuracy, sensitivity, and robustness, compared to using any single feature type. The best performances were achieved using an SVM model on the reading task (accuracy: 95.45%, AUC: 0.98, and MCC: 0.90) and an XGBoost model on spontaneous dialog (accuracy: 83.70%, AUC: 0.88, and MCC: 0.66). These findings indicate that relevant vocal biomarkers of PD are present even in unstructured, conversational speech, though performance is somewhat reduced due to its inherent variability.
Nonetheless, the study is constrained by certain limitations, including a limited sample size, a lack of standardized speech content, and difficulties in managing overlapping speech during dialog. Addressing these challenges in future research through larger and more diverse datasets, advanced diarization techniques, and deep learning models capable of capturing temporal and contextual dependencies will be crucial for improving system performance and real-world applicability.
Future work will explore cross-linguistic generalization by evaluating the proposed models on speech data from native Italian speakers. Specifically, one dataset will be used for training and another for testing in an inter-database evaluation setup. This approach will help assess the models’ generalizability across different languages and recording conditions. However, the limited size of available datasets may still pose challenges for training high-capacity models, making transfer learning or data augmentation valuable strategies to explore.
Overall, this work demonstrates the feasibility of using voice as a non-invasive, accessible biomarker for PD detection and monitoring. It contributes to the growing body of evidence supporting speech-based digital biomarkers and lays the groundwork for the development of real-time, mobile, and telehealth-compatible screening tools for neurodegenerative diseases.