1. Introduction
An array of vocal cord disorders afflicts the general population around the world, impacting individual health and lifestyle dynamics. Thus, early and systematic detection significantly contributes to potential effective treatment in vocal cord disorders [
1]. Such detection methods, including automated systems, are key in diagnosing, treating, and tracking voice disorders, aiding physicians to treat patients in a timely manner while also identifying those who need hospitalization.
Acoustic reading analysis and in-depth research on automated systems have made significant growth possible in automated detection for vocal disorders. As communication is essential in determining a person’s physical and mental status, recognizing their voice pathology is crucial to healthcare applications specific to the patient [
2].
Modern techniques, including indirect laryngoscopy, video laryngoscopy, stroboscope light, and the professional’s ear, can help identify problems with the larynx and vocal folds on an individual level, thus facilitating a qualitative evaluation of these structures. Moreover, medical professionals utilize the Grade, Roughness, Breathiness, Asthenia, Strain (GRBAS) and the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) scales to evaluate speech [
3].
The most perceived in the category of voice disorders are structural disorders. Physical abnormalities or strenuous damage to the vocal folds and other laryngeal structures result in such structural voice disorders [
4]. These disorders can significantly impact voice production limiting the ability to speak. However, with technological advancements, a multitude of tools are being developed to discern a normal voice from a pathological one using acoustical analysis [
5].
Subjective auditory-perceptual evaluations have moved to instrumental and AI-based methods to form a more established voice pathology assessment. Use of instrumental techniques like video stroboscopy for vocal fold movement visualization, aerodynamic evaluations for measuring subglottic pressure and airflow, and acoustic analysis for assessing jitter, shimmer, and HNR has significantly improved diagnostic accuracy [
6,
7]. Combining machine learning and artificial intelligence facilitates predictive analytics and automated classification of vocal and speech disorders, thus enhancing treatment and identification of these disorders with an increased evidence-based approach, resulting in more reliable clinical outcomes.
Logistic regression is a widely used statistical technique for binary classification [
8,
9,
10], making it an appropriate choice for predicting the presence of voice pathology. Pertaining to its simplicity, interpretability, and ability to provide insights into predictor significance, logistic regression is a preferred initial approach for voice pathology diagnosis. This study applied logistic regression to evaluate the relationship between various predictors, including age, severity, loudness, gender, and voice quality features, with the probability of diagnosis employing the PVQD dataset.
Logistic regression (LR) is utilized as the primary classification model for predicting voice pathology using perceptual voice quality ratings from the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) scale. CAPE-V provides quantitative assessments of vocal qualities such as roughness, breathiness, strain, pitch, and loudness, which serve as predictor variables in the logistic regression model. The dependent variable is binary, which checks the presence or absence of voice pathology, making LR an appropriate statistical tool for evaluating the relationship between voice quality measures and pathology diagnosis.
The findings from the following study indicated that certain features significantly predicted voice pathology, while others did not meaningfully contribute to the model. Despite its usefulness, logistic regression has inherent limitations, including its assumption of linearity and challenges in capturing complex relationships between variables. The model’s sensitivity to outliers and the potential for overfitting or underfitting further impacts its predictive reliability. These constraints highlight the necessity for broader validation and the exploration of more advanced machine learning techniques to enhance diagnostic accuracy and generalizability.
2. Literature Review
Annually, around 8% to 10% of individuals are affected by voice pathologies, which is directly proportional to the amount of strain they exert on the voice. As a result, professionals using voice continuously at work, such as teachers, are affected more. Such incidence exceeds 28% [
11,
12,
13] and reaches up to 58% [
14]. Even so, the number of people being treated for voice disorders remains low.
To improve the treatment of these pathologies, several Automatic Voice Disorder Detection (AVDD) systems have been proposed. They aim to detect these issues and facilitate their treatment. In recent years, the design of AVDD systems, such as the acoustic features, has been studied. Acoustic features have been looked at to represent the audio, with some of them being Mel-Frequency Cepstral Coefficients (MFCCs) [
15,
16,
17], Linear Predictive Cepstral Coefficients (LPCCs) [
18], and glottal source features.
The range of statistical classifiers used includes logistic regression for binary classification tasks, Gaussian Mixture Models (GMMs), and Support Vector Machines (SVMs) [
6]. Classifiers based on Deep Neural Network (DNN) architectures, such as Convolutional Neural Networks (CNNs) [
8] and Long Short-Term Memory (LSTM) networks [
7], have also successfully established their efficiency in voice-related studies. To extract features from audio information, self-supervised (SS) learning representations have been employed recently [
9]. The effectiveness of these representations lies in the ability of SS representation learning to expose underlying relationships in data. As classifiers, they use architectures such as Multi-Layer Perceptrons (MLPs) and Class Token Transformer (CT Transformer) designs based on a Multihead Self-Attention (MSA) [
10]. A drawback is that these representations require large amounts of audio data for training. The availability and variability of pathological voice data are limited, and constraints have been imposed on the progress of these systems, with the result being that several current studies focus on generating new synthetic data.
The contribution made by machine learning algorithms to the field of voice pathology has helped achieve commendable progress in the classification and detection of voice disorders Emerging powerful tools for medical diagnostics—deep learning and machine learning—could enable an individual to identify pathological conditions non-invasively and accurately from the analyses of human voice. Notable works that use deep learning models on mobile healthcare frameworks have been conducted to identify voice pathology. An example would be the system developed by Alhussein et al. to detect voice pathology using CNN architectures like VGG-16 and CaffeNet using transfer learning techniques. The system they developed was tested on the Saarbrucken voice disorder database, producing an overall classification accuracy of 97.5%. Its exceptional performance clearly demonstrated the potential deep learning has in real-time, mobile-oriented healthcare applications [
19].
Another instance is where Mittal et al. came up with a deep learning-based architecture that features decision-level fusion with non-parametric learners for classifying voice pathologies. The model performed considerably better than traditional machine learning approaches, detecting voice pathologies with a classification accuracy of 99.14%. This has several important implications for deep learning with the aim of enhancing the robustness and accuracy of systems designed for voice pathology detection [
20].
Ghoniem et al. proposed a hybrid diagnostic system where deep CNNs are integrated with GAs to optimize the classification process. This not only increased the classification performance by 5.4% over standard CNNs but yielded a total accuracy of 99.37%, thus proving the potential of hybrid AI models in voice pathology diagnostics [
21].
Another important contribution is made by Miliaresi et al. They propose a deep learning architecture module in voice signal and medical metadata processing to classify voice disorders. Their module, which reached a 64.4% accuracy classification, provided evidence to the potential of combining several data modalities to improve diagnostic performance. This approach is critical in addressing the challenges posed by limited training data, a common issue in medical applications [
22].
The electroglottographic signals have also been integrated into the deep learning models of voice pathology classification. The multimodal system developed by Miliaresi et al. utilizes such signals besides conventional audio descriptors and medical data. The incorporation greatly improved this system’s classification ability, reaching the state-of-the-art accuracy of 89.3%, with the reason being that such a type of multimodal approach is very important for the reflection of the highly complex nature of voice disorders and in improving the accuracy of diagnosis [
23].
Similar is the research conducted by Wang et al., which demonstrated the potential of DNNs regarding the classification of voice signals of patients with glottic neoplasms versus those with benign voice disorders. This research, having a testing accuracy of 86.11%, reflects the appropriateness of DNNs within medical diagnosis, especially in the early stage detection of certain voice pathologies [
24].
Ksibi et al. developed a two-level classifier model that categorizes first the gender of the voice and then decides whether it is healthy or pathological. This approach focuses on the vowel/a/taken from the Saarbruecken Voice Database and obtained 88.84% accuracy; thus, this forms a promising direction toward a more accurate early diagnosis of voice pathologies [
25]. Chen and Chen developed an automatic diagnosis method using a DNN for detecting pathological voices. In this, the DNN model outperformed Support Vector Machines and random forests to achieve an accuracy of 98.6%, thus showing the effectiveness of deep learning in this application [
26].
Oropharyngeal dysphagia (OD) impacts health and quality of life to a great extent, with aspiration being a key concern, typically assessed using gold-standard tools like FEES and VFSS. A study by Heijnen et al. employed penalized logistic regression to predict aspiration in OD patients using self-evaluation questionnaires (e.g., SWAL-QOL, DHI, EAT-10) and oral intake status, achieving excellent model performance with an AUC of 0.92. These findings suggest that non-invasive assessments combined with logistic regression can provide an accurate alternative for predicting aspiration in situations where instrumental evaluations are unavailable [
27].
A retrospective study by Huwyler et al. evaluated compliance with speech therapy for voice disorders among English- and non-English-speaking patients using logistic regression models. Results indicated that non-English-speaking patients had significantly lower compliance (63% vs. 74%,
p = 0.0011), with interpreter needs, younger age, and lower income being associated with higher noncompliance (aOR 1.56, 95% CI 1.11–2.18). The study highlights that interventions like in-person interpreters or multilingual speech therapists may mitigate barriers to therapy compliance for non-English-speaking populations [
28].
Parkinson’s disease (PD) is often associated with hypokinetic dysarthria (HD), and a study by D. Kovac et al. aimed to identify acoustic features for language-independent diagnosis. Using logistic regression, the study achieved 72–73% classification accuracy in single-language models, which dropped to 67% in multilingual models, highlighting the challenge of cross-language generalization. Prosodic features, interword pause frequency in particular, were identified as significant predictors, suggesting their potential for automatic and language-independent PD diagnosis [
29].
Wahengbam et al. proposed a deep learning architecture for the multiclass classification of voice pathologies by utilizing a “group decision analogy” technique. Their model performed a maximum overall classification accuracy of 97.7%, while some pathologies like hypokinetic dysphonia and reflux laryngitis gave 100% classification [
30].
According to Seok et al., a review of the current trends in AI-based research on voice pathology explicated the progress made both in binary and multi-classification tasks. They discussed some challenges and limitations of applying AI in clinical settings, including the need for large datasets and embedding continuous speech analysis, capable of enhancing diagnostic accuracy further [
31]. Applications of ML have shown promising success in the domain of voice pathology detection [
32].
This study proposes to explore the insights and challenges in the use of logistic regression in voice pathology detection.
The objectives of this study are as follows:
- 1.
Evaluate the effectiveness of logistic regression in voice pathology detection by analyzing its predictive performance on the Perceptual Voice Qualities Database (PVQD) using CAPE-V scale ratings.
- 2.
Identify significant predictors of voice pathology and assess their statistical influence through odds ratios and p-values in the logistic regression model.
- 3.
Analyze the limitations and challenges of logistic regression in handling non-linear relationships, feature interactions, and dataset variability and propose potential improvements through advanced machine learning techniques.
The enhancement in accuracy and robustness revealed by these reviews point to revolutionizing voice pathology diagnostics using these technologies. Although it remains that the challenges with data limitations, model interpretability, and clinical implementation still persist, future research on how these challenges can be met to actually realize the full potential of AI in this domain might suggest a total solution.
5. Results and Discussion
The logistic regression analysis of the dataset provided insights into the influence of age, gender, loudness, and severity on the probability of voice pathology diagnosis.
Figure 2 shows the results that indicate that age exhibited a positive linear relationship with the likelihood of diagnosis, with older individuals being at higher risk (
Figure 2, top-left). This aligns with the general understanding that aging is a risk factor for many voice disorders due to degenerative changes in the vocal folds. Nevertheless, this pattern may not generalize to datasets focusing on pediatric populations or those with specific non-age-related pathologies.
Gender appeared to have a minimal impact on diagnosis probabilities, as shown by the nearly identical values for men and women (
Figure 2, top-right). This suggests that the dataset used in this study lacks significant gender bias in the prevalence of voice disorders. However, other datasets may reveal variations due to known differences in the anatomy and physiology of male and female vocal folds, as well as the prevalence of gender-specific disorders. Also, the relationship between loudness and diagnosis (
Figure 2, bottom-left) might vary depending on the specific disorder studied, as certain conditions, such as vocal fatigue, are closely associated with changes in loudness levels.
It was interesting to note that the results showed a decreasing trend in diagnosis probability with increasing severity scores (
Figure 2, bottom-right), counterintuitive to clinical expectations. This trend may reflect a limitation in the dataset or the logistic regression model’s assumptions, as severity typically correlates positively with the likelihood of pathology.
Figure 3 and
Figure 4 give the ROC curve and confusion matrix. This observation emphasizes the importance of evaluating model assumptions and feature relationships carefully when applying logistic regression. These findings also suggest that while logistic regression provides valuable insights, its linear assumption may not capture complex, non-linear relationships in some datasets.