Machine Learning Approaches to Early Detection of Parkinson’s Disease Using Speech Analysis Technique
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors
Major comments:
- Clarify the validation strategy mismatch.
The abstract states 10-fold cross‑validation and LOSOCV were used, but Section 2.5 and the results only mention LOSOCV. Specify which validation strategy corresponds to which reported results, or unify the description.
- Report class‑wise performance metrics for imbalanced data.
The dataset is imbalanced (16 PD vs. 21 HC). Provide precision, recall, and F1‑score per class (PD and HC) for the best models, not just macro averages, to better assess bias.
- Explain how the “816 segmented audio files” from reading task were handled in subject‑wise splitting.
- Add statistical significance testing for performance differences.
For comparisons (e.g., MFCC vs. GTCC vs. combined features), include statistical tests (e.g., paired t‑test or McNemar’s test) to show whether observed differences are significant, especially when accuracy differences are small (e.g., 91.4% vs. 90.2%).
- Discuss potential overfitting from feature engineering.
Combining acoustic + MFCC + GTCC yields many features relative to sample size (e.g., reading task training: 566 samples). Mention whether dimensionality reduction or feature selection was considered, and report feature counts.
- Address the lack of standardization in speech passages as a limitation more concretely.
The limitation is noted but not quantified. Suggest analyzing phonetic content (e.g., vowel space, consonant clusters) across participants to control for passage variability in future work.
- The authors should enhance their manuscript by adding the following studies to previous research:
- Automatic and Early Detection of Parkinson’s Disease by Analyzing Acoustic Signals Using Classification Algorithms Based on Recursive Feature Elimination Method.
- Hybrid Techniques of Multi-CNN and Ensemble Learning to Analyze Handwritten Spiral and Wave Drawing for Diagnosing Parkinson's Disease.
And cite the following studies to improve methodology:
- Analysis of dermoscopy images of multi-class for early detection of skin lesions by hybrid systems based on integrating features of CNN models
- Predicting of diabetic retinopathy development stages of fundus images using deep learning based on combined features
- Analyzing histological images using hybrid techniques for early detection of multi-class breast cancer based on fusion features of cnn and handcrafted
- Reconcile the abstract’s highest performance with the main text.
Abstract reports XGBoost with 91.6% accuracy (10‑fold CV), but reading task best is MLP with 93.46% (subject‑wise split). Clarify which result is the main finding and ensure consistency between abstract and Sections 3.1–3.2.
- Improve reproducibility by providing code or pseudo‑code for segmentation and diarization.
The methods describe using PyDub for silence detection and a generic “speaker diarization techniques” without naming the library (e.g., pyannote.audio). Specify the exact tools and parameters.
- Discuss why no deep learning (e.g., CNNs, RNNs) was attempted despite mentioning DL in the introduction.
The introduction highlights deep learning, but experiments use only classical ML. Explain this gap or add a brief pilot comparison with a simple deep model (e.g., 1D‑CNN on MFCCs) to support the chosen approach.
Author Response
To
The Editor,
Journal of Neurology International
Dear Editor,
We are writing to you regarding our paper "Machine Learning Approaches to Early Detection of Parkinson’s Disease Using Speech Analysis Technique", which has been submitted to the Journal of Neurology Internationalfor publication.
We would like to thank you for your helpful review comments that have aided us in improving our manuscript. Our point-to-point reply to the reviewers’ comments and suggestions is below.
We thank Reviewer for the constructive feedback and thoughtful suggestions. Below we address each comment in detail and indicate where changes have been made in the manuscript.
Reviewer# 1
Major comments:
- Clarify the validation strategy mismatch.
The abstract states 10-fold cross‑validation and LOSOCV were used, but Section 2.5 and the results only mention LOSOCV. Specify which validation strategy corresponds to which reported results, or unify the description.
Response:
Thank you for highlighting this important point. We clarify that 10-fold cross-validation was used exclusively within the GridSearchCV procedure for hyperparameter tuning, while Leave-One-Subject-Out Cross-Validation (LOSOCV) was employed as the final evaluation strategy for reporting model performance.
To avoid confusion, the manuscript has been revised accordingly. The abstract now clearly states that hyperparameter tuning was performed using GridSearchCV with 10-fold cross-validation, while final model evaluation was conducted using LOSOCV to ensure subject-independent performance assessment (Page 1, Lines 27–29).
In addition, Section 2.5 has been updated to explicitly distinguish between the internal validation used during hyperparameter optimization and the external LOSOCV procedure used for final performance evaluation (Page 7, Lines 280–301).
- Report class‑wise performance metrics for imbalanced data.
The dataset is imbalanced (16 PD vs. 21 HC). Provide precision, recall, and F1‑score per class (PD and HC) for the best models, not just macro averages, to better assess bias.
Response:
We add individuals PD and HC class Precision, recall and F1- Score of each model in supplementary Materials, please see Supplementary S 4, S 6, S 8 for reading task and for Spontaneous dialog S11, S13 and S15. Also, we update results section with for best model performance class-based performance. Please see page 8 lines: 334-350, page 9 lines 375-397, page 10-11 lines 426-441. And section 3.2 Spontaneously Conversational Speech page 12-13 lines 476-499, page14 lines 532-555, and page 15 lines 562-572.
- Explain how the “816 segmented audio files” from reading task were handled in subject‑wise splitting.
Response:
We thank the reviewer for this clarification request. Although 816 audio segments were generated from the reading task, subject-wise splitting was strictly enforced to ensure that all segments belonging to a given subject were included exclusively in either the training or testing set during LOSOCV. This approach prevents data leakage and preserves subject independence. We note that conventional segment-level splitting may inadvertently distribute samples from the same subject across both training and testing sets, leading to overfitting and inflated performance. To avoid this, we adopted a subject-level partitioning strategy, where unique participant identifiers were used to group all corresponding segments prior to splitting.
For experiments a 70:30 split at the subject level was applied, where 70% of participants’ segments were assigned to the training set and 30% to the testing set. Post-split verification confirmed that no subject contributed data to both sets, thereby preserving strict independence between training and evaluation data. Please see subsection 2.4.1 data splitting in page 6 and line 255-269
- Add statistical significance testing for performance differences.
For comparisons (e.g., MFCC vs. GTCC vs. combined features), include statistical tests (e.g., paired t‑test or McNemar’s test) to show whether observed differences are significant, especially when accuracy differences are small (e.g., 91.4% vs. 90.2%).
Response:
To facilitate performance comparison between different models and feature sets, statistical significance was evaluated using paired t-tests across the LOSOCV folds. This methodology was selected to account for fold-wise variability, ensuring a robust assessment of model efficacy. The analysis indicates whether the observed disparities between feature sets reached statistical significance based on calculated p - values. Detailed results of the paired t-test analyses are available in Supplementary materials S9 and S16. We also add sub section 4.3 “Statistical Significance Analysis” and 4.4 “Bias Analysis” under discussion where we include statistical tests paired t‑test and bias risk. Please see page 17-18 lines 639-673.
- Discuss potential overfitting from feature engineering.
Combining acoustic + MFCC + GTCC yields many features relative to sample size (e.g., reading task training: 566 samples). Mention whether dimensionality reduction or feature selection was considered, and report feature counts.
Response:
Although this study combines multiple feature types, the overall feature dimensionality remains relatively low. Specifically, we use mean values of the first 13 MFCCs, 13 GTCCs, and 11 acoustic features, resulting in a total of 37 features per sample. Given this manageable feature size, explicit feature selection or dimensionality reduction techniques were not applied. Clarified feature types and total feature count in Section 2.3 “features extraction “please see pages 5–6, and lines between 215-245.
However, combining heterogeneous feature sets may still introduce a risk of overfitting, particularly with limited data. To avoid this, we employed LOSOCV and regularized ML models to ensure robust generalization. We also updated section 4.4 study limitations, please see pages 20-21 and lines 758-770, where we discuss potential overfitting.
- Address the lack of standardization in speech passages as a limitation more concretely.
The limitation is noted but not quantified. Suggest analyzing phonetic content (e.g., vowel space, consonant clusters) across participants to control for passage variability in future work.
Response:
We have expanded the limitations section to more clearly address the impact of variability in speech passages. Also, we added future directions, including the use of standardized speech material and phonetic-level analysis like vowel space and consonant articulation patterns. Please see lines 771-778 on page 21.
- The authors should enhance their manuscript by adding the following studies to previous research:
- Automatic and Early Detection of Parkinson’s Disease by Analyzing Acoustic Signals Using Classification Algorithms Based on Recursive Feature Elimination Method.
- Hybrid Techniques of Multi-CNN and Ensemble Learning to Analyze Handwritten Spiral and Wave Drawing for Diagnosing Parkinson's Disease.
And cite the following studies to improve methodology:
- Analysis of dermoscopy images of multi-class for early detection of skin lesions by hybrid systems based on integrating features of CNN models
- Predicting of diabetic retinopathy development stages of fundus images using deep learning based on combined features
- Analyzing histological images using hybrid techniques for early detection of multi-class breast cancer based on fusion features of cnn and handcrafted
We thank reviewer for the suggestion. We cite relevant works into our study.
- Reconcile the abstract’s highest performance with the main text.
Abstract reports XGBoost with 91.6% accuracy (10‑fold CV), but reading task best is MLP with 93.46% (subject‑wise split). Clarify which result is the main finding and ensure consistency between abstract and Sections 3.1–3.2.
Response:
The abstract has been revised to reflect the primary evaluation setting (LOSOCV). The previously reported 10-fold CV result has been removed to avoid confusion. Please see Page 1, Lines 27–35 and Results sections and subsection 3.1 -3.2 pages 7-15. Also please see page 19 and lines between 723-731 of our best results.
- Improve reproducibility by providing code or pseudo‑code for segmentation and diarization.
The methods describe using PyDub for silence detection and a generic “speaker diarization techniques” without naming the library (e.g., pyannote.audio). Specify the exact tools and parameters.
Response:
We have improved reproducibility by specifying the tools and parameters used in this work.
Silence detection: PyDub (minimum silence = 500 ms, threshold = −16 dBFS) and
Speaker diarization: pyannote.audio library. Please see pages 4 and 5, lines 170-202. We also plan to share our code in GitHub repository under open license (After our project timeline)
- Discuss why no deep learning (e.g., CNNs, RNNs) was attempted despite mentioning DL in the introduction.
The introduction highlights deep learning, but experiments use only classical ML. Explain this gap or add a brief pilot comparison with a simple deep model (e.g., 1D‑CNN on MFCCs) to support the chosen approach.
Response:
We acknowledge that deep learning methods were discussed in the introduction. They were not implemented in this study due to the limited dataset size, which increases the risk of overfitting and reduces the generalization capability of deep models. In contrast, classical machine learning methods are better suited for small datasets and provide more stable and interpretable results under such conditions.
We acknowledge the importance of deep learning methods and will explore their application in future work using larger datasets. A preliminary comparison with lightweight deep models (e.g., 1D CNNs on MFCC features) is also planned to further validate the proposed approach. Please see section 4.6 study limitations and future work page 21, lines 789-797.
We hope that these revisions and clarifications address all concerns and enhance the quality, transparency, and impact of our article and also hope that the revised manuscript is now accepted for publication.
I am looking forward to hearing from you, and I would like to thank you in advance for your attention and your time.
Sincerely,
Dr. Mohammad Amran Hossain
On behalf of all authors
Reviewer 2 Report
Comments and Suggestions for Authors
This study investigates machine learning approaches for early detection of Parkinson’s disease (PD) using speech analysis. The authors utilize the MDVR-KCL dataset and extract acoustic, MFCC, and GTCC features from both reading and spontaneous speech tasks. Multiple classifiers, including SVM, Random Forest, MLP, and XGBoost, are evaluated with robust validation strategies such as LOSOCV. The results demonstrate strong classification performance, with XGBoost achieving the highest accuracy (~91.6%). The study highlights the potential of speech-based digital biomarkers as a non-invasive and scalable tool for early PD detection.
Major comments:
The study is based on a relatively small dataset (n=37 participants).
- How do the authors ensure that the model is not overfitting despite LOSOCV?
- Can the authors comment on external validation using independent cohorts?
MFCC and GTCC features show strong performance, but speech signals can be highly variable.
- How robust are these features to recording conditions (e.g., device differences, background noise)?
- Were any normalization or domain adaptation strategies considered?
While the model achieves high accuracy, clinical interpretability remains limited.
- Which specific features (e.g., jitter, shimmer, specific MFCC components) contributed most to classification?
- Can the authors provide feature importance or SHAP analysis to improve clinical insight?
The study focuses on classification (PD vs HC), but PD is a progressive disease.
- Can this framework be extended to predict disease severity (e.g., UPDRS scores) or longitudinal progression?
- This would significantly enhance clinical applicability.
The current study focuses solely on speech-based biomarkers. However, multimodal approaches may further improve diagnostic performance.
- Combining speech features with neuroimaging biomarkers, particularly quantitative susceptibility mapping (QSM) or other MRI-based measures of nigrostriatal degeneration, could provide complementary information.
- Such integration may enhance both diagnostic accuracy and pathophysiological interpretability, especially in early-stage PD.
- The authors are encouraged to discuss the potential of multimodal frameworks integrating speech and imaging data.
Please consider including the following papers in the manuscript:
10.1002/mds.27717
Author Response
To
The Editor,
Journal of Neurology International
Dear Editor,
We are writing to you regarding our paper "Machine Learning Approaches to Early Detection of Parkinson’s Disease Using Speech Analysis Technique", which has been submitted to the Journal of Neurology International for publication.
We would like to thank you for your helpful review comments that have aided us in improving our manuscript. Our point-to-point reply to the reviewers’ comments and suggestions is below.
Reviewer# 2
We thank Reviewer 2 for a careful evaluation and critical feedback. We have worked to address all concerns.
Comments and Suggestions for Authors
This study investigates machine learning approaches for early detection of Parkinson’s disease (PD) using speech analysis. The authors utilize the MDVR-KCL dataset and extract acoustic, MFCC, and GTCC features from both reading and spontaneous speech tasks. Multiple classifiers, including SVM, Random Forest, MLP, and XGBoost, are evaluated with robust validation strategies such as LOSOCV. The results demonstrate strong classification performance, with XGBoost achieving the highest accuracy (~91.6%). The study highlights the potential of speech-based digital biomarkers as a non-invasive and scalable tool for early PD detection.
Major comments:
- The study is based on a relatively small dataset (n=37 participants).
- How do the authors ensure that the model is not overfitting despite LOSOCV?
- Can the authors comment on external validation using independent cohorts?
Response:
To mitigate overfitting, we employed Leave-One-Subject-Out Cross-Validation (LOSOCV), which ensures strict subject independence and maximizes training data usage while preventing data leakage. Also, we used low-dimensional feature sets with a total of 37 features and regularized machine learning models to further reduce overfitting risk. We acknowledge that external validation is essential for assessing generalizability. However, due to the limited availability of compatible datasets, independent cohort validation was not performed in this study. This will be a key focus of future work, where the proposed framework will be evaluated on larger and multi-center datasets. Please see study limitations section page 21 lines 758-770.
- MFCC and GTCC features show strong performance, but speech signals can be highly variable.
- How robust are these features to recording conditions (e.g., device differences, background noise)?
- Were any normalization or domain adaptation strategies considered?
Response:
MFCC and GTCC features are widely used due to their robustness in capturing spectral and perceptual characteristics of speech. In this study, basic preprocessing and feature averaging were applied to reduce variability across recordings. Please see pages 4-5, lines 170-183.
We acknowledge that recording conditions such as device differences and background noise may still affect stability. Advanced normalization and domain adaptation techniques were not explicitly applied in this work. These approaches, including noise-robust feature extraction and domain adaptation methods, will be explored in future studies to enhance model robustness. Please see page 21 lines 778-787.
- While the model achieves high accuracy, clinical interpretability remains limited.
- Which specific features (e.g., jitter, shimmer, specific MFCC components) contributed most to classification?
- Can the authors provide feature importance or SHAP analysis to improve clinical insight?
Response:
Across both tasks among the single feature representations, the GTCC features model achieved superior performance compared to MFCC and acoustic features, particu-larly in terms of F1-score for PD subjects and ROC-AUC. The MFCC based model also demonstrated strong performance, while acoustic features alone yielded comparatively lower results. While they capture prosodic and phonation-related cues (e.g., jitter, shimmer, intensity), they lack the spectral richness of cepstral features, which may limit their discriminative power in isolation. Please see page 16-17 and lines 610-625.
We thank the reviewer for this valuable suggestion. While this study focused on evaluating classification performance, we acknowledge the importance of interpretability for clinical applications. At present, explicit feature importance analysis SHAP values are not included. We have highlighted this as a limitation and an important direction for future work. Feature importance analysis will be incorporated to better understand the contribution of specific features such as jitter, shimmer, and cepstral coefficients. Please see page 21 lines 798-804.
- The study focuses on classification (PD vs HC), but PD is a progressive disease.
- Can this framework be extended to predict disease severity (e.g., UPDRS scores) or longitudinal progression?
- This would significantly enhance clinical applicability.
Response:
We thank the reviewer for this insightful comment. We agree that extending the framework beyond binary classification to predict disease severity and progression would significantly enhance clinical applicability.
While this study focuses on PD vs HC classification, the proposed feature extraction and modeling framework can be extended to regression or multi-class tasks, such as predicting clinical scores (e.g., UPDRS) or tracking longitudinal changes. This will be explored in future work using datasets with annotated severity and follow-up recordings. Please see page 21 lines 805-808.
- The current study focuses solely on speech-based biomarkers. However, multimodal approaches may further improve diagnostic performance.
- Combining speech features with neuroimaging biomarkers, particularly quantitative susceptibility mapping (QSM) or other MRI-based measures of nigrostriatal degeneration, could provide complementary information.
- Such integration may enhance both diagnostic accuracy and pathophysiological interpretability, especially in early-stage PD.
- The authors are encouraged to discuss the potential of multimodal frameworks integrating speech and imaging data.
Please consider including the following papers in the manuscript: 10.1002/mds.27717
Response:
We agree that multimodal approaches integrating speech features with neuroimaging biomarkers, such as quantitative susceptibility mapping (QSM) or other MRI-based measures, could provide complementary diagnostic information. However, access to such medical imaging data is often limited due to privacy concerns and the lack of publicly available datasets.
While this study focuses on speech-based analysis, future work will explore multimodal frameworks incorporating imaging and clinical data, contingent upon the availability of appropriate datasets or the ability to collect such data in compliance with ethical and regulatory requirements. Please see page 21 lines 808-814.
We hope that these revisions and clarifications address all concerns and enhance the quality, transparency, and impact of our article and also hope that the revised manuscript is now accepted for publication.
I am looking forward to hearing from you, and I would like to thank you in advance for your attention and your time.
Sincerely,
Dr. Mohammad Amran Hossain
On behalf of all authors
Round 2
Reviewer 2 Report
Comments and Suggestions for Authors
The authors addressed concerns from reviewers properly.

