Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessReview

Peer-Review Record

Machine Learning Applications for Diagnosing Parkinson’s Disease via Speech, Language, and Voice Changes: A Systematic Review

Inventions 2025, 10(4), 48; https://doi.org/10.3390/inventions10040048

by Mohammad Amran Hossain^*

, Enea Traini

and Francesco Amenta^*

Reviewer 1:

Yan Fu

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Inventions 2025, 10(4), 48; https://doi.org/10.3390/inventions10040048

Submission received: 29 April 2025 / Revised: 20 June 2025 / Accepted: 24 June 2025 / Published: 27 June 2025

(This article belongs to the Section Inventions and Innovation in Design, Modeling and Computing Methods)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This systematic review addresses a timely and clinically relevant topic with methodological rigor. Its focus on speech/voice biomarkers for PD fills a gap in existing literature. However, improvements in bias assessment, data presentation, and clinical contextualization are needed to maximize impact.

The review adheres to PRISMA guidelines, employs a rigorous search strategy across major databases, and includes 34 studies with diverse datasets (e.g., PD-GITA, mPower). The inclusion of multilingual datasets (e.g., Italian, Turkish) adds breadth. However, the exclusion of non-English studies is not sufficiently justified and may introduce language bias. Acknowledging this limitation explicitly would strengthen transparency. Supplementary tables (e.g., SPICMO framework, dataset details) are referenced but not included in the main text, reducing accessibility. Key supplementary data (e.g., Table 3: ML models) should be summarized in the main manuscript for clarity.
This is the first review focusing exclusively on ML applications for PD diagnosis via speech/voice, differentiating it from prior reviews on general neurological disorders or cognitive impairment. The emphasis on real-world data collection (e.g., smartphone-based mPower dataset) highlights emerging trends. The discussion of "conversational dialogue data" as an under-explored resource is novel but underdeveloped. Expanding this section with specific examples or proposed methodologies would enhance originality.
The review underscores the clinical potential of non-invasive, cost-effective ML tools for early PD detection, aligning with global healthcare priorities. The analysis of cross-cultural datasets (e.g., Korean, Spanish) emphasizes scalability. However, the clinical applicability of findings is discussed superficially. A dedicated subsection on "Translational Challenges" (e.g., integration into clinical workflows, regulatory hurdles) would better contextualize the significance.
The risk of bias assessment lacks depth. A standardized tool (e.g., QUADAS-2 for diagnostic studies) should be applied and results summarized in the main text. Data synthesis is narrative; a meta-analysis (even limited) of accuracy/sensitivity metrics across studies would strengthen evidence synthesis.
The introduction is overly lengthy; condensing background on PD pathophysiology would allow more space for critical analysis of ML advancements. Performance metrics in Table 3 are inconsistently reported (e.g., missing AUC/F1-scores for some studies). Standardizing columns (e.g., "N/A" for unreported metrics) would improve readability.

Comments on the Quality of English Language

The language is overall well-written, while the followings can be improved:

Grammar and Syntax:
- Missing Articles:
  - Example: "Key manifestations of PD include bradykinesia (slowness of movement)..." → "Key manifestations of PD include the slowness of movement..."
- Verb Tense Consistency:
  - Example: "utilizing databases such as PubMed..." → "utilized databases such as PubMed..."
- Subject-Verb Agreement:
  - Example: "A total of 34 research articles were included... with an in-depth analysis..." → Correct as written, but ensure consistency in plural/singular forms elsewhere.
Sentence Complexity:
- Overly Long Sentences:
  - Example: "The progressive depletion of dopamine leads to... deterioration of other vital functions."
Repetition and Redundancy:
- Phrases like "promising results" recur frequently.
Passive Voice Overuse:
- Example: "Several studies have reported high diagnostic accuracy."
Inconsistent Abbreviations:
- Some abbreviations (e.g., SVM, KNN) are defined only in tables.
Formatting and Presentation:
- Table 3 has inconsistent entries (e.g., missing AUC/F1-scores).
Jargon Accessibility:
- Terms like "Mel-frequency cepstral coefficients (MFCC)" may confuse non-specialists.

Author Response

We thank Reviewer 1 for the constructive feedback and thoughtful suggestions. Below we address each comment in detail and indicate where changes have been made in the manuscript.

Comments and Suggestions for Authors

The review adheres to PRISMA guidelines, employs a rigorous search strategy across major databases, and includes 34 studies with diverse datasets (e.g., PD-GITA, mPower). The inclusion of multilingual datasets (e.g., Italian, Turkish) adds breadth. However, the exclusion of non-English studies is not sufficiently justified and may introduce language bias. Acknowledging this limitation explicitly would strengthen transparency. Supplementary tables (e.g., SPICMO framework, dataset details) are referenced but not included in the main text, reducing accessibility. Key supplementary data (e.g., Table 3: ML models) should be summarized in the main manuscript for clarity.

Response:
We appreciate the reviewer’s concern. In the study selection phase, we excluded non-English studies due to our limited language proficiency, which hindered our ability to interpret methodologies, findings, and author commentary. We now explicitly acknowledge this limitation in the manuscript (Page 16, Lines 652–660).

Regarding supplementary tables, we have ensured they are appropriately referenced and discussed within the main text. For example, please refer to Page 4, Lines 165–174, and within the Results and Discussion sections where relevant content from Supplementary Table 3 and others has been integrated or summarized.

This is the first review focusing exclusively on ML applications for PD diagnosis via speech/voice, differentiating it from prior reviews on general neurological disorders or cognitive impairment. The emphasis on real-world data collection (e.g., smartphone-based mPower dataset) highlights emerging trends. The discussion of "conversational dialogue data" as an under-explored resource is novel but underdeveloped. Expanding this section with specific examples or proposed methodologies would enhance originality.

Response:
Thank you for this suggestion. We have expanded this section under Subsection 4.1.1 Voice and Speech Tasks(Page 13-14, Lines 512–537) to provide specific examples and methodological insights regarding conversational dialogue as an emerging but underexplored data modality.

The review underscores the clinical potential of non-invasive, cost-effective ML tools for early PD detection, aligning with global healthcare priorities. The analysis of cross-cultural datasets (e.g., Korean, Spanish) emphasizes scalability. However, the clinical applicability of findings is discussed superficially. A dedicated subsection on "Translational Challenges" (e.g., integration into clinical workflows, regulatory hurdles) would better contextualize the significance.

Response:
We have added a new subsection titled "Translational Challenges" under the Discussion section (Page 16, Lines 637–650) to contextualize key issues such as clinical integration, regulatory barriers, and workflow alignment.

The risk of bias assessment lacks depth. A standardized tool (e.g., QUADAS-2 for diagnostic studies) should be applied and results summarized in the main text. Data synthesis is narrative; a meta-analysis (even limited) of accuracy/sensitivity metrics across studies would strengthen evidence synthesis.

Response:
All full-text articles underwent a quality assessment based on the Kitchenham and Charters framework. We used 19 assessment questions (Supplementary Table 3) and excluded studies that answered fewer than 12 questions. Details are provided in Section 2.4 (Page 4-5, Lines 139–155), and findings are summarized in Supplementary Table 4.

Furthermore, we now include a forest plot summarizing performance metrics across studies in the Results section (Pages 11–13, Lines 428–4838), addressing both this and another reviewer’s recommendation.

The introduction is overly lengthy; condensing background on PD pathophysiology would allow more space for critical analysis of ML advancements. Performance metrics in Table 3 are inconsistently reported (e.g., missing AUC/F1-scores for some studies). Standardizing columns (e.g., "N/A" for unreported metrics) would improve readability.

Response:
We have condensed the Introduction to focus more on ML-specific developments and reduced background content on PD pathophysiology. Additionally, Table 3 has been replaced with a forest plot summarizing model performance, ensuring consistency and improved clarity (Pages 11–13, Lines 428–483).

We hope that these revisions and clarifications address all concerns and enhance the quality, transparency, and impact of our systematic review and also hope that the revised manuscript is now accepted for publication.

I am looking forward to hearing from you and I would like to thank you in advance for your attention and your time.

Sincerely,

Mohammad Amran Hossain

On behalf of all authors

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

After carefully reviewing this work, some comments are given below:

The manuscript does not offer substantial novelty over existing systematic reviews in the field of ML for Parkinson’s Disease diagnosis using speech/voice data. For example, related recent systematic reviews (Hecker et al., 2022; Idrisoglu et al., 2023; Altham et al., 2024) already comprehensively cover ML/AI applications in voice analysis for PD and other neurodegenerative diseases. The present review simply aggregates similar studies without offering new insights, meta-analysis, or actionable frameworks.
There is vague or insufficient description of the quality assessment framework; for example, it is unclear how many studies failed which criteria, or what the “19 questions” entailed.
There is no attempt to quantitatively compare model performance or conduct a meta-analysis (even though the field could support this).
The review simply lists datasets and algorithms, reporting the “best performance” metric per paper, without discussing differences in validation design, data leakage threats, or external validity.
The manuscript frequently makes unsupported or overly optimistic claims about the transformative ability of ML for PD diagnosis, without qualifying the actual evidence strength.
The manuscript contains numerous grammatical, typographical, and stylistic errors, undermining professionalism and readability. For example, the authors frequently awkward or incorrect phrasing (e.g., “Telephony Center” instead of “Telepharmacy Center” in Acknowledgments); missing or inconsistent tense, singular/plural mismatches, and formatting errors (misplaced mathematical notations, table captions, etc.); references are inconsistently formatted.
The authors’ discussion of study limitations is highly generic and does not provide actionable recommendations or acknowledge the field’s major methodological flaws.

Author Response

Reviewer #2:

We thank Reviewer 2 for a careful evaluation and critical feedback. We have worked to address all major concerns.

After carefully reviewing this work, some comments are given below:

The manuscript does not offer substantial novelty over existing systematic reviews in the field of ML for Parkinson’s Disease diagnosis using speech/voice data. For example, related recent systematic reviews (Hecker et al., 2022; Idrisoglu et al., 2023; Altham et al., 2024) already comprehensively cover ML/AI applications in voice analysis for PD and other neurodegenerative diseases. The present review simply aggregates similar studies without offering new insights, meta-analysis, or actionable frameworks.

Response:
We differentiate our work by analyzing real-world datasets and introducing the concept of conversational dialogue as an underutilized modality. We also provide an updated synthesis of recent studies (up to 2024), which were not covered in prior reviews, and offer an actionable framework on translational challenges. Regarding the meta-analysis, even though it was not within our study scope, we tried to address the reviewer’s comments by conducting the meta-analysis. However, the heterogeneity of the studies that we included are very high (more than 90%), and then we gave up reporting the meta-analysis result in our study. We appreciate the reviewer for the understanding.

There is vague or insufficient description of the quality assessment framework; for example, it is unclear how many studies failed which criteria, or what the “19 questions” entailed

Response:
We clarified the quality assessment process in Section 2.4 (Pages 3–4, Lines 138–155). All criteria and excluded studies are listed in Supplementary Tables 3 and 4. Studies that did not answer minimum 12 questions out of 19 were excluded after group discussion and consensus.

There is no attempt to quantitatively compare model performance or conduct a meta-analysis (even though the field could support this).

Response:
We now include some forest plot in the Results section summarizing model performance (Pages 11–13, Lines 427–482).

The review simply lists datasets and algorithms, reporting the “best performance” metric per paper, without discussing differences in validation design, data leakage threats, or external validity.

Response:
We now discuss model validation designs in Section 3.4.1 (Page 10, Lines 384–409), with additional discussion of data leakage and external validity risks in Section 4.4 & 4.5 (Page 15-16, Lines 589 –660).

The manuscript frequently makes unsupported or overly optimistic claims about the transformative ability of ML for PD diagnosis, without qualifying the actual evidence strength.

Response:
We have revised the manuscript to avoid overstated claims and instead emphasize the current limitations, need for external validation, and translational barriers. These are covered under the Research Challenges and Recommendations and Translational Challenges subsections. Please see pages 15-16, lines 621-665

The manuscript contains numerous grammatical, typographical, and stylistic errors, undermining professionalism and readability. For example, the authors frequently awkward or incorrect phrasing (e.g., “Telephony Center” instead of “Telepharmacy Center” in Acknowledgments); missing or inconsistent tense, singular/plural mismatches, and formatting errors (misplaced mathematical notations, table captions, etc.); references are inconsistently formatted.

Response:
We have thoroughly proofread and revised the manuscript for grammar, clarity, and academic tone. Specific corrections include terminology (e.g., replacing “Telephony Center” with “Telepharmacy Center”), tense consistency, singular/plural agreement, and formatting (e.g., reference styling, table alignment, and figure captions).

The authors’ discussion of study limitations is highly generic and does not provide actionable recommendations or acknowledge the field’s major methodological flaws.

Response:
We have revised the Study Limitations section (Pages 16, Lines 652–660) to include more specific limitations such as language bias, dataset imbalance, lack of external validation, and reproducibility challenges. Actionable recommendations are now included under Research Challenges and Recommendations.

I am looking forward to hearing from you and I would like to thank you in advance for your attention and your time.

Sincerely,

Mohammad Amran Hossain

On behalf of all authors

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This manuscript presents a systematic review of recent studies that apply machine learning (ML) techniques to diagnose Parkinson’s disease (PD) through voice, speech, and language data. The topic is highly relevant and timely, and the authors have made a commendable effort to identify and synthesize a wide range of literature. However, several methodological and interpretive issues should be addressed to improve the clarity, reproducibility, and utility of the review. Below are my major comments:

1. While the review addresses three main research questions, the synthesis of findings remains largely descriptive. The application of frameworks such as SPICMO or PRISMA is not clearly reflected in the organization of the results. A more structured thematic synthesis (e.g., by feature type, ML model family, or evaluation metric) would improve clarity and analytical depth.

2. The inclusion criteria are only briefly described. The specific search strings, database-specific filters, and application of Boolean logic are not sufficiently reported. Moreover, the boundary between “speech,” “voice,” and “language” data is conceptually unclear, which could compromise the consistency of study selection.

3. The authors mention that a quality appraisal was conducted based on the Kitchenham and Charters checklist, but the review does not present individual quality scores or a summary of the appraisal results. This omission weakens the credibility of the selection process and the relative weight of findings. Including a summary table or matrix of quality indicators is strongly recommended.

4. Although the authors state that meta-analysis was beyond the scope of the current field, many included studies report standard metrics (e.g., accuracy, precision, F1-score). A basic forest plot or aggregated comparison across model types and datasets would add quantitative insight. The reasons for omitting such analysis should be more rigorously justified.

5. The discussion section lacks a deeper exploration of why ML models, despite their high accuracy in research contexts, have not yet been widely adopted in clinical practice. Potential obstacles—such as dataset generalizability, linguistic variability, and lack of external validation—should be discussed more thoroughly to enhance the translational value of the review.

6. It is not always clear whether the reported “best performances” (e.g., highest accuracy or F1-score) were achieved using individual ML/DL models or through hybrid/ensemble techniques. This ambiguity makes it difficult for readers to identify which specific techniques consistently yielded superior results. I recommend the authors revise the performance summary tables to clearly indicate the model combinations (e.g., SVM alone vs. CNN+LSTM ensembles), and to provide a synthesized conclusion on which model types or architectures are most promising across studies.

Author Response

Reviewer 3.

We thank Reviewer 3 for their thorough review and insightful recommendations, which have helped improve the quality and clarity of the manuscript.

Comments and Suggestions for Authors

While the review addresses three main research questions, the synthesis of findings remains largely descriptive. The application of frameworks such as SPICMO or PRISMA is not clearly reflected in the organization of the results. A more structured thematic synthesis (e.g., by feature type, ML model family, or evaluation metric) would improve clarity and analytical depth.

Response:
We have reorganized the Results and Discussion sections thematically grouping content by feature type, ML model family, and evaluation metric to improve analytical depth and coherence.

The inclusion criteria are only briefly described. The specific search strings, database-specific filters, and application of Boolean logic are not sufficiently reported. Moreover, the boundary between “speech,” “voice,” and “language” data is conceptually unclear, which could compromise the consistency of study selection.

Response:
We now clarify the inclusion criteria in Section 2.2 (Page 3, Lines 110–119). We used Boolean operators such as “voice OR speech OR language” to identify studies analyzing audio signals for PD diagnosis. While our scope was focused on signal-level analyses, we did not exclude studies based on the spoken language or speech task type.

The authors mention that a quality appraisal was conducted based on the Kitchenham and Charters checklist, but the review does not present individual quality scores or a summary of the appraisal results. This omission weakens the credibility of the selection process and the relative weight of findings. Including a summary table or matrix of quality indicators is strongly recommended.

Response:
We used the Kitchenham and Charters checklist (19 questions) for quality assessment, excluding studies scoring below 12. This process is detailed in Section 2.4 (Page 3-4, Lines 139–155) and Supplementary Tables 3–4. We have now added a summary table of quality indicators to enhance transparency.

Although the authors state that meta-analysis was beyond the scope of the current field, many included studies report standard metrics (e.g., accuracy, precision, F1-score). A basic forest plot or aggregated comparison across model types and datasets would add quantitative insight. The reasons for omitting such analysis should be more rigorously justified.

Response:
We agree and now include a meta-analysis and forest plot in the Results section summarizing model performance across studies (Pages 11–13, Lines 427–482).

The discussion section lacks a deeper exploration of why ML models, despite their high accuracy in research contexts, have not yet been widely adopted in clinical practice. Potential obstacles—such as dataset generalizability, linguistic variability, and lack of external validation—should be discussed more thoroughly to enhance the translational value of the review.

Response:
We discuss clinical adoption barriers, including dataset generalizability and regulatory constraints, in a newly added subsection titled "Translational Challenges" (Pages 16, Lines 638–650), as well as under Research Challenges and Recommendations.

It is not always clear whether the reported “best performances” (e.g., highest accuracy or F1-score) were achieved using individual ML/DL models or through hybrid/ensemble techniques. This ambiguity makes it difficult for readers to identify which specific techniques consistently yielded superior results. I recommend the authors revise the performance summary tables to clearly indicate the model combinations (e.g., SVM alone vs. CNN+LSTM ensembles), and to provide a synthesized conclusion on which model types or architectures are most promising across studies.

Response:
We revised the Model Performance section (Pages 11–13, Lines 427–480) and added specific model types to the forest plot. We also clarify model architectures (e.g., SVM, CNN-LSTM) in Supplementary Table 7 and discuss trends in performance in the Discussion section (Page 14,15, Lines 565–587).

I am looking forward to hearing from you and I would like to thank you in advance for your attention and your time.

Sincerely,

Mohammad Amran Hossain

On behalf of all authors

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

In this revision, most of my previous concerns have been addressed. However, the following concerns should be addressed as well:

Although the review is qualitative, a basic meta-analysis or vote-counting approach (e.g., proportion of studies above certain accuracy thresholds by model type) would have added analytical depth.
More critical analysis is needed regarding studies with flawed validation (e.g., no test set, poor metric choices). These should be flagged more explicitly.
Although eligibility criteria are provided, justification for the inclusion of studies with very small sample sizes or limited demographic data should be better clarified.
Figures 5 and 6 are valuable but need more detailed captions and explanations to be understandable as standalone visuals.

Author Response

We thank Reviewer for the constructive feedback and thoughtful suggestions. Below we address each comment in detail and indicate where changes have been made in the manuscript.

In this revision, most of my previous concerns have been addressed. However, the following concerns should be addressed as well:

Although the review is qualitative, a basic meta-analysis or vote-counting approach (e.g., proportion of studies above certain accuracy thresholds by model type) would have added analytical depth.

Response:
Thank you for this valuable suggestion. We have now integrated a vote-counting analysis to enhance the analytical depth of our review. This includes categorizing studies based on model families and the proportion achieving ≥90%, 80–89%, and <80% accuracy. This analysis is presented in Figure 7 and discussed in Section 4.3 pages 15–16, lines 592–610.

More critical analysis is needed regarding studies with flawed validation (e.g., no test set, poor metric choices). These should be flagged more explicitly.

Response:

We agree and have revised the Risk of Bias section (please see pages 16–17, lines 639–651) to more explicitly highlight methodological concerns, including the absence of independent test sets, use of training data for evaluation, and inappropriate metrics. These issues are now clearly flagged in Supplementary Table 7, under the column "Methodological Flaws."

Although eligibility criteria are provided, justification for the inclusion of studies with very small sample sizes or limited demographic data should be better clarified.

Response:
We appreciate this observation. A clarification has been added to Section 2.2 Eligibility Criteria, please see page 4, lines 156–161, to explain that such studies were retained due to their methodological novelty, unique datasets, and the limited availability of large open datasets in this emerging research area.

Figures 5 and 6 are valuable but need more detailed captions and explanations to be understandable as standalone visuals.

Response:

We have revised the captions for Figures 5 and 6 to ensure they are fully self-contained and informative for standalone interpretation. Additionally, explanatory sentences have been added to the main text to guide the reader’s understanding of the figures in context please see pages 11–12, lines 435–501.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

It seems that the current form of this manuscript is now ready for publicaiton.

Author Response

It seems that the current form of this manuscript is now ready for publication.

We are grateful to Reviewer for the positive assessment, noting that the manuscript is now suitable for publication in its current form.

Author Response File: Author Response.pdf

Round 3

Reviewer 2 Report

Comments and Suggestions for Authors

In this revision, all my previous concerns have been addressed. No further comments on this manuscript.

Article Menu

Machine Learning Applications for Diagnosing Parkinson’s Disease via Speech, Language, and Voice Changes: A Systematic Review

Further Information

Guidelines

MDPI Initiatives

Follow MDPI