Next Article in Journal
A Lightweight, End-to-End Encrypted Data Pipeline for IIoT: An AES-GCM Implementation for ESP32, MQTT, and Raspberry Pi
Previous Article in Journal
Cultural Tourism Marketing Model Based on Multivariate Analysis in Geographic Information System: A Systematic Review of the Literature
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Automated Severity and Breathiness Assessment of Disordered Speech Using a Speech Foundation Model

by
Vahid Ashkanichenarlogh
1,2,*,
Arman Hassanpour
1,3 and
Vijay Parsa
1,2,3
1
National Centre for Audiology, Western University, London, ON N6A 3K7, Canada
2
Department of Electrical and Computer Engineering, Western University, London, ON N6A 3K7, Canada
3
School of Communication Sciences and Disorders, Faculty of Health Sciences, Western University, London, ON N6A 3K7, Canada
*
Author to whom correspondence should be addressed.
Information 2026, 17(1), 32; https://doi.org/10.3390/info17010032
Submission received: 3 October 2025 / Revised: 26 December 2025 / Accepted: 29 December 2025 / Published: 3 January 2026

Abstract

In this study, we propose a novel automated model for speech quality estimation that objectively evaluates perceptual dysphonia severity and breathiness in audio samples, demonstrating strong correlation with expert ratings. The proposed model integrates Whisper encoder embeddings with Mel spectrograms augmented by second-order delta features combined with a sequential-attention fusion network feature mapping path. This hybrid approach enhances the model’s sensitivity to phonetic, high-level feature representation, and spectral variations, enabling more accurate predictions of perceptual speech quality. A sequential-attention fusion network feature mapping module captures long-range dependencies through the multi-head attention network, while LSTM layers refine the learned representations by modeling temporal dynamics. Comparative analysis against state-of-the-art methods for dysphonia assessment demonstrates our model’s better correlation with clinician’s judgments across test samples. Our findings underscore the effectiveness of ASR-derived embeddings alongside the deep feature mapping structure in disordered speech quality assessment, offering a promising pathway for advancing automated evaluation systems.

Graphical Abstract

1. Introduction

In clinical practice, evaluating the vocal system (mostly for voice disorders and for accurately assessing dysphonia) primarily relies on auditory-perceptual judgment, a widely used subjective assessment method [1,2]. Auditory-perceptual evaluation plays a vital role in identifying vocal pathologies and monitoring speech disorders, particularly after invasive treatments such as subthalamic nucleus deep brain stimulation for Parkinson’s disease (PD) [3,4]. This method enables clinicians to assess voice quality and track changes over time, offering valuable insights into treatment efficacy. Clinicians assess voice characteristics based on their auditory perception, typically employing standardized rating scales. Examples of standardized scales include the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) [5], which allows clinicians to rate the voice/speech attributes such as breathiness, roughness, strain, pitch, and overall severity, and the GRBAS scale [6] that facilitates judgments of grade, roughness, breathiness, asthenia, and strain. Unlike GRBAS, which uses an ordinal scale, CAPE-V employs a visual analog scale, allowing for more precise detection of subtle voice quality differences. It also ensures higher intra and inter-rater reliability, making it more consistent across different clinicians and over time. CAPE-V follows a structured and standardized protocol, reducing variability in assessments and improving comparability across studies. Moreover, it provides a more comprehensive evaluation of voice quality by incorporating a broader range of perceptual parameters beyond the Grade score in GRBAS. These advantages make CAPE-V based evaluated model particularly useful for clinical assessments and for validating disordered speech processing models in real world applications.
Despite its significance, auditory-perceptual judgment presents several limitations. First, it requires raters with substantial clinical expertise to ensure accurate assessments. Second, achieving reliability often demands evaluations from multiple experts, adding to the complexity of the process. Third, the procedure is time-intensive, leading to delays in obtaining results, which can hinder timely clinical decision-making. These challenges underscore the need for more objective with standardized metrics, and efficient evaluation methods [7,8]. Furthermore, a recent study [9] investigated the consistency with which experienced voice clinicians applied the CAPE-V protocol for evaluating voice quality. In this study, twenty clinicians assessed audio recordings from twelve individuals with diverse vocal characteristics, using the CAPE-V scales under conditions that reflected typical clinical practice. The results revealed notable variability in clinicians’ ratings, particularly across the dimensions of breathiness, roughness, and strain. This inconsistency highlights a critical challenge in clinical voice assessment-namely, the lack of standardization in applying CAPE-V which may compromise the accuracy and reliability of dysphonia severity evaluations. As such, this approach is inherently susceptible to inter-rater variability and subjective bias, highlighting the need for more standardized and objective evaluation techniques of the acoustic voice sample [1,2].
In clinical voice assessments, objective evaluation methods have traditionally emphasized the analysis of sustained vowel phonations. This preference stems from the fact that sustained vowels offer a stable and consistent vocal sample, reducing the impact of rapid articulatory movements and prosodic variations typically found in continuous speech. As a result, they provide a controlled environment for measuring key acoustic parameters related to voice quality [10,11]. The meta-analysis by Maryn et al. [10] has identified autocorrelation peak value, spectral flatness of linear prediction residual, and smoothed Cepstral Peak Prominence (CPP) as prominent acoustic correlates of sustained vowel voice quality. However, relying exclusively on sustained vowels presents certain limitations. These phonations may not fully reflect the dynamic characteristics of natural speech, potentially missing critical features of voice disorders that become evident during connected speech. To overcome this, recent advancements have incorporated analyses of continuous speech. For example, the Acoustic Voice Quality Index (AVQI) integrates measurements from both types of speech samples, offering a more comprehensive assessment of dysphonia severity [12]. In particular, the AVQI incorporates measurements of smoothed CPP, amplitude perturbation measures (“shimmer”), and the profile of the long-term average speech spectrum (LTAS) [13].
In recent years, Deep Neural Networks (DNNs) have revolutionized speech quality (SQ) and speech intelligibility (SI) assessment in other fields such as telecommunications and assistive hearing devices, offering non-intrusive, end-to-end evaluations that eliminate the need for reference signals [13,14,15,16]. For example, in Quality-Net [17], a DNN-based model designed for nonintrusive speech quality assessment. The model is trained using a Recurrent Neural Network (RNN) with bidirectional long short-term memory (BLSTM) layers to predict perceptual speech quality scores, such as the Perceptual Evaluation of Speech Quality (PESQ), directly from degraded speech. By capturing temporal dependencies in speech signals, Quality-Net refines its predictions by minimizing the MSE between its predicted and actual quality scores, enhancing its assessment accuracy. Similarly, several advanced approaches have emerged, further improving non-intrusive speech quality assessment. These include STOI-Net [16], which focuses on SI prediction, MOSA-Net [14], a model designed for MOS prediction, CCATMos [18], which integrates contextual and temporal modeling, and TorchAudio-Squim [19], an open-source framework offering pretrained models for speech quality evaluation.
In some research work, self-supervised learning (SSL) has emerged as an approach to overcoming data scarcity in speech processing tasks. By enabling models to learn robust representations from unlabeled data, SSL has been instrumental in developing non-intrusive speech assessment systems [20,21,22,23,24]. Building on this foundation, researchers have explored the potential of Speech Foundation Models (SFMs) for SI and SQ prediction [25]. For instance, Cuervo and Marxer [25] conducted a systematic evaluation of ten SFMs in the context of the Clarity Prediction Challenge 2 (CPC2). Their study revealed that certain SFMs could effectively predict the percentage of correctly perceived words by hearing-impaired listeners from speech-in-noise recordings, achieving state-of-the-art performance with minimal adaptation. Similarly, Mogridge et al. [26] investigated noise-robust SFMs for non-intrusive SI prediction. They proposed extracting temporal-hierarchical features from SFMs trained on large and diverse datasets, such as Whisper, to enhance SI model prediction accuracy. Their results demonstrated that leveraging these rich representations significantly improved non-intrusive SI prediction performance.
The advances in DNN-based have impacted the disordered voice quality research as well. However, a substantial majority of research studies focused on automatic disorder detection and classification. In a recent scoping review, Liu et al. [27] revealed that 88% of published research between 2000–2023 aimed at detecting the presence of voice pathology. For example, in [28], researchers proposed a deep learning framework for the automatic detection of dysphonia based on acoustic features derived from sustained vowel phonations. The study utilized recordings from 238 dysphonic and 223 healthy Mandarin speakers, from which mel-spectrograms were extracted from 1.5 s audio segments. These features were then used to train Convolutional Neural Networks (CNNs) for binary classification of dysphonic versus healthy voices.
Liu et al.’s scoping review also showed that only 5% of the studies investigated the assessment of voice quality attributes, and all these studies focused on the GRBAS scale. For example, in [6], the authors developed a DNN model to predict the overall severity of dysphonia using the GRBAS scale-encompassing Grade, Roughness, Breathiness, Asthenia, and Strain. Their model achieved performance that was comparable to, and in some cases exceeded that of expert human raters [29]. Furthermore, Dang et al. [30] recently introduced a deep learning-based approach for voice quality assessment, incorporating ASR and SSL representations trained on largescale normal speech datasets. Their work highlights the growing impact of SSL and SFMs in advancing objective and non-intrusive speech assessment methodologies.
Very few studies have applied DNN-based speech quality prediction models for estimating the CAPE-V ratings associated with a speech sample. As an example, Benjamin et al. [31] have developed a machine-learning model using support vector machines and acoustic-prosodic features extracted via OpenSMILE to predict CAPE-V overall severity from voice recordings. The model achieved a high correlation (r = 0.847) with expert ratings, with improved performance when combining features from vowels, sentences, and whole audio samples. Lin et al. [32] assimilated the jitter, shimmer, HNR, zero crossings, along with age and sex variables using a random forest (RF) machine learning model to predict CAPE-V ratings, but for sustained vowels only. Lin et al. [32] also compared the performance of their Machine Learning (ML) model with SSL and SFM models such as Whisper and WavLM and reported that these models were equivalent. The performance of these models in predicting CAPE-V ratings for continuous speech sentences is unknown.
In summary, very few machine-learning models have been developed for the comprehensive assessment of dysphonia severity and breathiness (for speech quality measurements, technical feasibility studies are often conducted in the context of clinical disordered voice assessments.) that integrate the sentence samples while being accurate to expert-rated samples [31]. In this paper, we have proposed a model using Whisper encoder transformer encoder, San Francisco, CA, USA, Mel spectrogram with second order Deltas and Sequential-Attention Fusion Network (SAFN) feature mapping path, details of which are given in the following section.

2. Proposed Model

The proposed model, illustrated in Figure 1, is designed to estimate speech quality by utilizing features extracted from an ASR model along with Mel-spectrogram representations. The model takes two primary inputs: ASR-derived embeddings, obtained from the encoder hidden states of a pretrained ASR model, and Mel-spectrogram features with second order deltas. The ASR representations are derived from the Whisper model (whisper-small) [9], which was pretrained to capture rich linguistic information. Specifically, we used the publicly available checkpoint “openai/whisper-small” from the Hugging Face Transformers library (version ≥ 4.30). Whisper embeddings were extracted by passing each waveform (16 kHz) through the Whisper encoder using its default feature extractor, which computes 80-channel log-Mel spectrograms from 25 ms windows with 10 ms hop size. During training, we did not modify the Whisper architecture or fine-tune its parameters; instead, the encoder’s hidden states from all six layers were linearly projected via six adapter layers (GELU + LayerNorm + Dropout), combined using a learnable softmax-weighted fusion. To enhance feature diversity and improve robustness across various auditory conditions, we integrated Whisper-derived embeddings with delta-enhanced Mel-spectrograms where concatenated with handcrafted 120-dimensional acoustic features derived from Mel-spectrograms and their first and second-order deltas computed using torchaudio.transforms.MelSpectrogram.
This combination ensures that both linguistic and spectral features are effectively captured. The extracted features are then processed by downstream modules, which map them to the output labels. In our implementation, the Whisper padding operation was removed before utilizing the ASR features to ensure consistent feature alignment. The ASR model consists of 12 transformer encoder layers (denoted as W) that extract hierarchical speech representations.
To incorporate representations from multiple depths, an adapter architecture is employed to process the outputs from the deeper six layers of the Whisper model. This design enables the model to capture and utilize hierarchical features learned at varying levels of abstraction, enhancing its ability to extract meaningful information across different layers of the neural network. The adapter architecture consists of a fully connected (FC) layer, a GELU activation function [33], a normalization layer, and a dropout layer with 10% probability as shown in Figure 2. Residual connections in transformer encoders are known for preserving and propagating features across layers, but adapter networks offer the advantage of fine-tuning and reweighting multi-depth representations, making them valuable for tasks like quality prediction. Adapters, which are lightweight task-specific layers, adapt pre-trained features for downstream tasks, effectively integrating hierarchical information that residual connections alone may not fully capture. The outputs from the six adapters are assigned learnable weights that sum to 1, and these outputs are then combined using a weighted summation. To ensure the sum of the weights equals 1, the outputs from the adapters are processed through a softmax layer. Additionally, the feature set includes the Mel-spectrogram, along with its first and second-order delta coefficients, which are crucial for capturing frequency-level characteristics necessary for auditory perceptual assessment. Whisper operates on 40-dimensional Mel-spectrograms rather than time-domain data. For our downstream module, we concatenate all features from Whisper, and Mel-spectrogram (with its deltas) along the feature dimension to create a comprehensive input representation. In our proposed structure, our designed SAFN model which predicted in Figure 3, has constructed from three-layer unidirectional LSTM layers for sequence processing with 360 units input size and 128 hidden sizes, followed by a dropout layer with the probability of 30%. The LSTM output is further refined using a FC (128 → 128 dimensions), followed by GELU activation and dropout (30% probability) for additional feature transformation.
To enhance contextual modeling and capture long-range dependencies in speech, the model employs two stacked multi-head attention (MHA) layers, each with 128-dimensional embeddings and 16 attention heads. The first MHA layer processes the LSTM-enhanced features by computing self-attention over the input sequences, producing an output, which is then added to the original input through a residual connection and normalized via normalization layer. The second MHA layer follows the same residual formulation, refining the representation further through another normalization operation. These residual connections facilitate stable gradient flow and improve network convergence by preserving essential information from earlier layers. After attention processing, the features are passed through a two-layer feedforward network, where the first layer expands the dimensionality from 128 to 256, followed by a GELU activation, and then projected back to 128 dimensions. Another residual connection ensures that the feedforward transformation is integrated smoothly into the existing feature space. For speech quality estimation, the model employs a linear projection layer (128 → 1 dimension) to predict frame-level quality scores, which are then passed through a sigmoid activation function to normalize outputs. A global average pooling (GAP) layer is applied to aggregate frame-level scores into a single utterance-level quality prediction. The use of residual connections, layer normalization, and dropout mechanisms ensures robust feature learning and prevents overfitting, making the model effective in capturing both short-term and long-term dependencies in speech. By leveraging ASR embeddings and spectral representations, the proposed structure efficiently models the underlying quality characteristics of speech and provides a reliable framework for objective speech assessment tasks. Although the proposed framework builds upon established architectures such as Whisper-based encoders and Mel-spectrogram feature extractors, its novelty lies in the structured fusion and adaptive mapping strategy realized through the proposed SAFN block (Figure 3). The SAFN block further enhances temporal context modeling via stacked LSTM + GELU feedforward layers, coupled with residual normalization pathways that promote stable gradient flow and selective feature retention. This design enables the network to jointly capture low-level spectral cues and high-level linguistic embeddings, improving sensitivity to perceptual voice distortions such as breathiness and hoarseness. The proposed Whisper + Mel + SAFN architecture was specifically optimized and evaluated for predicting CAPE-V severity and breathiness scores, two perceptually critical dimensions in dysphonia assessment. Importantly, to examine robustness and clinical transferability, we validated the model’s generalization performance on a custom-collected dataset rated by human listeners following the CAPE-V protocol.

3. Methodology

3.1. Datasets

We conducted our primary experiments on the Perceptual Voice Qualities Database (PVQD) dataset [34]. PVQD comprises speech samples collected from 296 different talkers, with speech samples consisting of sustained vowels /a/, /i/, and running speech in English (a small number of samples do not contain the sustained vowel /a/ or /i/). Audio signals were captured at a sampling rate of 44.1 kHz, and rated by 19 experienced voice clinicians using both the CAPE-V (100-point visual analog scale) and GRBAS scales. Experienced raters evaluated the voice quality of the recorded samples using the CAPE-V and GRBAS scales. Each rater had extensive clinical experience (mean = 13.6 years) and completed ratings via a randomized online protocol to ensure independent and balanced assessment. Additionally, demographic information, including the speaker’s age and gender, was provided to support the analysis. The dataset demonstrated good-to-excellent inter-rater reliability (ICC = 0.860 for CAPE-V; 0.859 for GRBAS) and high intra-rater reliability (ICC > 0.89). The present study used the CAPE-V connected-speech ratings, which represent global perceptual judgments that integrate phonatory, articulatory, and resonance cues. Accordingly, the model learns to predict overall perceptual dysphonia (including severity and breathiness) rather than isolated glottal-source features. In our experiments, running speech are extracted into speech segments that last 2 to 5 s. To ensure subject-level independence and prevent any potential speaker leakage across data partitions, we carefully constructed our training, validation, and test sets based on speaker identity rather than individual segments. Specifically, the PVQD dataset was segmented into 1659 clips derived from distinct speakers, and these were divided into 1005 samples for training, 343 for testing, and 311 for validation. It is worthy to note that no segments from the same subject were shared across splits, each speaker’s recordings were entirely contained within a single subset. This speaker-exclusive partitioning ensures that the model’s performance reflects genuine generalization to unseen talkers rather than memorization of speaker-specific acoustic characteristics. Consequently, the reported results represent true speaker-independent performance, providing a reliable assessment of the model’s robustness.
In order to test the generalization of the trained models, we utilized a small secondary dataset [35] that contained continuous speech samples from 24 disordered and 6 normal talkers. These speech samples were rated by thirty naïve listeners in four sessions following the CAPE-V protocol. Results showed that the overall severity had high reliability, both across listeners and sessions. As such, the models trained on predicting the CAPE-V severity scores for the PVQD were applied to these secondary dataset samples to assess potential generalization.

3.2. Input Data

In our experiments, all audio recordings were resampled to a consistent sample rate of 16,000 Hz to standardize the input data. For the proposed non-intrusive metric, we extracted Mel-spectrogram features from the recordings, utilizing 400 FFT filters for the Short-Time Fourier Transform (STFT). The hop length between consecutive STFT windows was set to 320 samples. To generate the spectrogram, we used a Hann window with a length of 400 samples. Additionally, 40 Mel filter-banks were applied to capture the frequency characteristics of the signal. Figure 4 illustrates the Mel-spectrograms of two representative test samples from the dataset, highlighting the frequency content and temporal dynamics of the signals. This approach allows for an efficient and detailed representation of the audio that is critical for auditory perceptual assessments. It is important to note here that, no external pretraining, data augmentation, or artificial noise injection was applied, and the model was trained purely on the available CAPE-V-annotated PVQD dataset to preserve the natural voice quality characteristics. The dataset was pre-segmented into train and validation subsets using a subject-independent, balanced split strategy: the distribution of CAPE-V ratings was kept approximately uniform across both partitions to avoid bias and reflect inter-speaker variability. This ensured that the validation set served as an unbiased performance estimate rather than a random subset of the same speaker pool.

3.3. Training Setup

It is worth noting that the dataset was divided into three parts: training (1005 samples), validation (311 samples), and testing (343 samples). Our experiments were conducted using two distinct target labels including CAPE-V breathiness and CAPE-V severity for evaluating the model’s performance within the distinct experiments. Although severity and breathiness are primarily phonatory attributes, the CAPE-V scores used in this study represent global perceptual judgments that integrate phonatory, articulatory, and prosodic cues present in connected speech. Consequently, the proposed model estimates perceived dysphonia severity and breathiness at the perceptual level rather than purely physiological phonation quality. In our experiments, we utilize the mean absolute error (MAE) loss for regression tasks. To ensure interpretability and consistency, we reported the MAE as the primary loss metric during training and validation. MAE provides a direct measure of the average absolute deviation between predicted and perceptual GRBAS ratings, making it particularly suitable for ordinal, bounded clinical scales where large outliers should not disproportionately influence optimization, as would occur with MSE.
Our model was trained using the AdamW optimizer, which is a variant of the Adam optimizer with decoupled weight decay regularization. AdamW is particularly well-suited for mitigating the issue of weight decay being coupled with the adaptive learning rate updates. In our training process, the optimizer was configured with a learning rate of 5 × 10−6, which is relatively small to ensure stable and gradual convergence, preventing sudden fluctuations in the loss function and helping the model generalize well to unseen data. Additionally, a weight decay of 1 × 10−4 with a ReduceLROnPlateau scheduler (factor = 0.5, patience = 4) was applied to encourage L1 regularization, preventing the model from overfitting by discouraging excessively large weight values. If there was no improvement on the validation set for four consecutive epochs, the learning rate was reduced by half. The batch size was set to 1, and during fine-tuning, the weights of the pre-trained modules remained adjustable and were not frozen. We trained the model through 200 epochs, and the experiments were performed on a system with 32 GB-RAM, a GPU-based graphic card with 10,240 CUDA cores (GeForce RTX 3080TI-A12G) and we used PyTorch ≥ 2.0 (2.5.1+cu121). Table 1 shows the summary of the proposed model and training configuration.
In our paper, the Dang et al. model is adopted from [33]. Their proposed architecture employs a multi-stream design in which ASR embeddings, SSL representations, and second-order delta Mel-spectrogram features are processed through three parallel streams and subsequently fused at the feature representation stage. To implement this model, we used the official GitHub https://github.com/ code released by the authors and trained it using the same dataset as our proposed model to ensure a fair comparison. For the ablation version considered in this paper, we simplify the original Dang et al. architecture by retaining only the ASR and second-order delta Mel-spectrogram streams. This modification allows us to isolate and examine the contribution of these components to voice quality assessment performance, while making the ablation model more comparable to our proposed approach.

3.4. Statistical Analyses

Comprehensive statistical analyses were performed to evaluate the relationship between the average clinician ratings and the corresponding model-predicted ratings. First, because the data were non-Gaussian, non-parametric tests were applied. The Friedman test was used to compare ratings across models, followed by pairwise post hoc comparisons using the Wilcoxon signed-rank test with Bonferroni correction when the Friedman test indicated significant differences. Second, linear regression analyses were conducted on scatter plots of clinicians versus predicted ratings to assess the strength and direction of associations. Third, both Pearson and Spearman correlation coefficients were computed to quantify linear and monotonic relationships, respectively, with 95% confidence intervals estimated via bootstrap resampling approach (with 10,000 iterations). Differences in correlation coefficients were tested for statistical significance using Fisher’s z-transformation. Finally, Bland–Altman plots were generated to visualize the level of agreement and potential systematic bias between clinicians and predicted ratings.

4. Results

4.1. Training and Validation Results

Although no explicit data augmentation was applied in this study, several complementary regularization strategies were incorporated to mitigate overfitting. Specifically, the model employed L1 regularization together with the AdamW optimizer, which introduces decoupled weight decay for stable and controlled parameter updates. Early stopping was applied after 10 epochs to prevent overfitting. Additionally, dropout layers were strategically integrated across the LSTM, fully connected, and adapter modules within the proposed architecture to reduce co-adaptation of neurons and enhance generalization. The training and validation loss trends (Figure 5) further confirm that no overfitting behavior was observed, as both curves converge smoothly with consistent reductions in loss across epochs. These results collectively indicate that the proposed regularization scheme is effective in maintaining generalization performance even under limited data conditions.
Figure 5. Training and validation loss trend using breathiness data.
Figure 5. Training and validation loss trend using breathiness data.
Information 17 00032 g005

4.2. Descriptive Data and Analyses

First, we visualized the distribution of subjective clinician ratings and the predicted objective ratings, emphasizing the differences in model predictions by comparing their mean and median values across the entire test dataset. Figure 6 displays the box-whisker plot of the CAPE-V severity and CAPE-V breathiness data, with the y-axis representing the normalized scores (0 to 1).
Across the test data, clinician ratings had a mean of 0.326 (SD: 0.278) and a median of 0.202 [IQR: 0.096–0.504] for the severity attribute, indicating a broad spread. The proposed model showed a mean 0.301 (SD: 0.234) and median [IQR] 0.209 [IQR: 0.117–0.431], suggesting comparable central tendency to clinician rating with somewhat narrower dispersion. The Dang et al. [33] model exhibited a mean of 0.276 (SD: 0.243) and median 0.178 [IQR: 0.083–0.381], reflecting slightly lower central tendency relative to the clinician and proposed model ratings. Finally, the Dang et al. [33] ablated model had a mean of 0.307 (0.227) and median of 0.225 [IQR: 0.127–0.428], with dispersion comparable to the proposed model.
For the breathiness data, clinician ratings exhibited a mean of 0.243 (SD: 0.255) and a median of 0.130 [IQR: 0.051–0.339], indicating substantial. The proposed model produced ratings with a mean of 0.190 (SD: 0.221) and a median of 0.091 [IQR: 0.049–0.181], suggesting generally lower predicted breathiness scores compared to the clinician ratings. The Dang et al. [33] model showed a mean of 0.197 (SD: 0.204) and a median of 0.123 [IQR: 0.043–0.231], while the Dang et al. [33] ablated model yielded a mean of 0.193 (0.197) and a median of 0.108 [IQR: 0.046–0.226]. Overall, all models demonstrated narrower interquartile ranges than the clinician ratings, reflecting reduced dispersion in predicted breathiness scores.
The significance of the differences between the mean clinician ratings and the model-predicted ratings was assessed next. Shapiro–Wilk tests and visual inspection of the Q-Q plot indicated non-normality. The non-parametric Friedman test was therefore employed and resulted in a statistically significant result for both the severity and breathiness data (χ2(3) = 9.814, p = 0.0202 and χ2(3) = 9.814, p = 0.0202, respectively). Post hoc Wilcoxon signed-rank tests with Bonferroni correction for the severity data revealed that only the full Dang et al. [33] model predicted ratings were significantly different from the clinician ratings. In contrast, predicted breathiness ratings from all models were significantly different from the clinician ratings. Thus, all models significantly underestimated the breathiness voice quality attribute across the samples in the test data.

4.3. Linear Regression and Correlational Analyses

Scatter plots between the subjective and objective data are shown in Figure 7 for further assessment of their relationship. In Figure 7, the y-axis represents the objective score averaged across all sentences produced by a talker in the test dataset, and the x-axis represents the averaged CAPE-V expert rating for that talker. Furthermore, the linear regression fits to the scattered data are indicated as dashed lines in all sub-panels of Figure 7.
Scatter plots in Figure 7a,b reveal that the breathiness data exhibits greater skewness in its distribution, with greater concentration of data points at the lower end of the breathiness scale. In general, strong linear relationships were observed between the perceptual and predicted ratings for both the CAPE-V attributes. For example, the variances explained by the linear regression fits to the CAPE-V severity scatter data (Figure 7a) were 82.6%, 81.6%, and 81.0%, for the proposed, Dang et al. [33], and Dang et al. [33] ablated versions, respectively. Similarly, the corresponding variances for the CAPE-V breathiness scatter data (Figure 7b) were 88.0%, 87.3%, and 87.1%respectively. The findings from Figure 7 confirm the efficacy of objective measures based on SFM models in predicting breathiness and overall severity as perceived in subjective assessments by expert clinicians.
For reference, scatter plots representing the CPP data are presented in Figure 7c,d. As noted in the Introduction section, CPP is a widely used metric in acoustic analyses of voice. Linear regression fits to the CPP scatter plots explained 71.8% for the CAPE-V severity data and 73.3% of the CAPE-V breathiness data. It is important to emphasize, however, that these CPP data are included only for contextual comparison; a direct comparison between the CPP results and model predictions would be inappropriate, as the CPP measure involves no training, whereas the SFM-based models were explicitly trained to predict perceptual ratings.Similar patterns can be observed for the CAPE-V severity data.
Figure 7. Scatter plots displaying subjective CAPE-V severity and breathiness scores against the predicted ratings from different models (Our proposed model, Dang et al. [30] model, and Dang et al. [30] ablated model). (a) CAPE-V severity scatter data for the three models along with their linear regression fits, (b) CAPE-V breathiness scatter data for the three models along with their linear regression fits, (c) CAPE-V severity scatter data for the CPP metric along with the linear regression fit, (d) CAPE-V breathiness scatter data for the CPP metric along with the linear regression fit.
Figure 7. Scatter plots displaying subjective CAPE-V severity and breathiness scores against the predicted ratings from different models (Our proposed model, Dang et al. [30] model, and Dang et al. [30] ablated model). (a) CAPE-V severity scatter data for the three models along with their linear regression fits, (b) CAPE-V breathiness scatter data for the three models along with their linear regression fits, (c) CAPE-V severity scatter data for the CPP metric along with the linear regression fit, (d) CAPE-V breathiness scatter data for the CPP metric along with the linear regression fit.
Information 17 00032 g007
Table 2 displays the Pearson correlation coefficient and RMSE results between the predicted and actual CAPE-V breathiness/severity values across different models. These values are reported at the individual sentence level, as well as the talker level—where the sentence level scores are averaged for each talker. It is evident from this table, that the proposed model achieves the highest correlation coefficient for the CAPE-V severity and breathiness attributes at both the sentence and talker levels while maintaining the lowest RMSE values, indicating strong predictive performance with minimal error. The correlation coefficients and the RMSE values associated with the full and ablated Dang et al. [30] models are close behind. It is further evident that the ML models perform significantly better than the low-level traditional acoustic parameters such as the CPP and HNR. For comparative purposes, Table 1 also lists the correlation coefficient and RMSE values reported by Benjamin et al. [31] for the CAPE-V severity prediction. It must be noted here that Benjamin et al. [31] reported the predictive metrics for the entire database of 295, rather than the test data subset used for evaluating the rest of the models. Once again, CPP and HNR values are included for contextual reference. CPP and HNR quantify specific glottal-source properties, whereas CAPE-V ratings reflect global perceptual impressions shaped by multiple speech dimensions. Therefore, lower CPP/HNR correlations do not suggest inferiority but rather a different focus. The proposed model, trained on perceptual scores, captures broader acoustic cues that align more closely with listener judgments, offering enhanced perceptual correspondence rather than deeper glottal-source modeling.
The statistical significance of the differences in correlation coefficients associated with the proposed model, and the full and ablated versions of Dang et al. [30] models was assessed using the z test statistic. At the sentence level, there was no statistically significant difference between the proposed model and the Dang et al. [30] full model for the overall severity attribute (z = 0.382, p = 0.351) and breathiness. The comparsion between the proposed model and the ablated Dang et al. model did result in a significant difference (z = 2.56, p < 0.01). For the breathiness attribute at the sentence level, the proposed model resulted in significantly better correlations than both versions of Dang et al. model (z = 1.67, p = 0.047 and z = 2.395, p < 0.01, respectively). When collapsed across the sentences, i.e., at the talker level, the correlation coefficients between the subjective scores and the predictions from the proposed and Dang et al. models were statistically similar. Due to the smaller size of the test dataset when collapsed across sentences (n = 59), future research with data from a larger cohort of talkers is needed to further assess the performance differences between the competitive SFM models.

4.4. Bland–Altman Analyses

To further gain insights into the relationship between the subjective ratings and their objective predictions, Bland–Altman analyses was conducted. In particular, the differences between the averaged clinician rating and the predicted scores were plotted against their mean values for both the proposed model and the Dang et al. [33] model (shown in Figure 8). The mean bias associated with the proposed model statistically insignificant, while the mean bias associated with Dang et al. [33] model was statistically significant. The linear regression fits resulted in positive slopes for both models, indicating a growing discrepancy between predicted and clinician ratings at higher CAPE-V severity values. Similar results were observed for the CAPE-V breathiness parameter.

4.5. Ablational Analyses

The ablation analysis highlights the specific contribution of the proposed SAFN beyond standard mapping approaches such as LSTM-based architectures. As shown in Table 3, incorporating the SAFN block led to a notable improvement in correlation with perceptual breathiness ratings on the validation set (0.9244 vs. 0.9031), demonstrating approximately a 2% absolute gain under identical training conditions. This enhancement can be attributed to SAFN’s ability to dynamically recalibrate multi-level Whisper embeddings and Mel-spectral features through multi-head attention and residual normalization, effectively capturing both temporal dependencies and perceptual subtleties that conventional LSTM fusion schemes may overlook. Furthermore, removing the Mel-spectral + delta stream caused a measurable drop in correlation (0.8993), underscoring the complementary contribution of low-level acoustic cues to perceptual assessment. Similarly, when the adapter layers were restricted to the last three ASR encoder blocks (0.9026), performance declined relative to the full SAFN configuration, confirming the importance of distributing adaptive attention across deeper feature hierarchies. Overall, the ablation results confirm that the proposed structure introduces distinct representational advantages and better optimized, leading to more robust and perceptually aligned voice quality prediction.

4.6. Generalization to an Unseen Dataset

Given the lack of any other publicly available CAPE-V databases, we have tested our model on an unseen, small, and private dataset that we collected by Ensar et al. [35]. In this database, the CAPE-V ratings were collected for sentence recordings from 30 talkers (24 disordered and 6 normal talkers). Unlike the PDVQ dataset, which contained expert CAPE-V evaluations, these samples were rated by thirty inexperienced listeners. Ensar et al. [35] reported high inter- and intra-rater reliabilities for the CAPE-V overall severity scores, and hence their average scores for the 30 speech samples were used to compare against the corresponding model predictions. Results showed 0.9338 correlation between mean subjective CAPE-V severity scores and predicted severity scores using the proposed model. The generalization performance of the proposed model on an unseen dataset is illustrated in Figure 9. The scatter plot shows the relationship between the subjective severity scores, obtained from human listeners, and the predicted severity scores generated by the model. A strong positive linear association is observed, with data points closely distributed around the fitted regression line. The regression equation y = 0.7131x + 0.2146 and the coefficient of determination demonstrate that the model accounts for approximately 87% of the variance in subjective ratings, indicating a high degree of predictive accuracy. As a comparison, the Dang et al. [30] full model resulted in a linear regression fit that explained 85% of the variance. This result highlights the robustness and reliability of the proposed approach in capturing perceptual severity patterns beyond the training data, reinforcing its potential for real-world applicability in objective disordered speech quality assessment.

5. Discussion

The results of our proposed model demonstrate the efficacy of integrating Whisper encoder embeddings with Mel spectrograms augmented by second-order delta features along with the proposed SAFN for speech quality estimation within the disordered speech signals. This approach leverages the robustness of Whisper’s deep-learned speech representations, which encapsulate phonetic and linguistic information, enhancing the model’s ability to discern subtle quality variations in speech signals. Additionally, the inclusion of second-order delta features captures dynamic spectral changes, improving the sensitivity of the system to transient distortions and artifacts.
Our SAFN feature mapping module effectively learns long-range dependencies, preserving global and long dependencies of the speech structure while reducing redundant information. The subsequent LSTM layers further refine the learned representations by modeling temporal dependencies, ensuring a more accurate prediction of speech quality. This hybrid architecture allows our model to balance the strengths of self-attention mechanisms with the sequential modeling capabilities of recurrent networks.
Comparative analysis with existing state-of-the-art methods highlighted the advantages of our approach. As demonstrated in Figure 5 and Figure 6, the proposed model consistently outperforms baseline methods, achieving the highest Pearson correlation coefficients and lowest RMSE values for CAPE-V breathiness and severity estimation. Notably, our model achieved Pearson correlation of 0.9382 for breathiness and 0.9090 for severity, slightly outperforming the Dang model and the Ablation study. The lower number of trainable parameters (242.4 M vs. 336.1 M in the Dang model) highlights the computational efficiency of our approach without compromising accuracy. Furthermore, the model’s agreement with perceptual ratings, as evidenced by its alignment with subjective assessments, underscores its clinical relevance.
Our findings also highlight the limitations of traditional acoustic measures such as CPP, which, despite showing a strong correlation with perceptual ratings, exhibited higher variability and lower predictive performance compared to our proposed model. As illustrated in Table 1, the CPP-based approach yielded a correlation of −85.59% for breathiness and −84.72% for severity, with relatively higher RMSE values, indicating that the model achieves better alignment with perceptual dysphonia ratings compared to traditional acoustic measures.
Beyond its predictive accuracy, the proposed model demonstrates giid generalization on unseen samples, maintaining robustness against both stationary and non-stationary distortions. The ability of Whisper encoder embeddings to retain meaningful phonetic structures contributes to improved correlation with perceptual quality scores, particularly in cases where traditional spectral-based methods struggle. These findings highlight the potential of ASR-derived embeddings along with the second order of Mel spectrograms and SAFN in speech quality estimation, paving the way for more intelligent and adaptable automated assessment systems.
The proposed model offers promising potential for clinical translation, particularly within telemedicine and remote voice assessment contexts. Since the framework operates on short audio segments (2–5 s) and requires minimal preprocessing, it can be adapted for near real-time inference on local or cloud-based systems. This makes it suitable for remote monitoring, teleconsultations, and self-assessment applications where patients can record short speech samples using smartphones or web-based platforms. In a clinical setting, the model could be integrated as a decision-support tool within existing ENT and speech-language pathology workflows, providing clinicians with an objective, continuous estimate of perceptual voice quality indices (like CAPE-V Grade, Breathiness, and Severity). Such automated scores could complement expert ratings, assist in longitudinal tracking of therapy progress, and help standardize subjective evaluations across raters and clinics. Future developments will focus on optimizing inference latency, deploying lightweight versions for point-of-care use, and establishing clinician-in-the-loop interfaces to ensure seamless integration with electronic medical record systems and existing diagnostic pipelines.
While the proposed model demonstrates strong correlations with CAPE-V ratings and generalizes well to unseen data, several limitations should be acknowledged. First, the lack of publicly available CAPE-V rated datasets, particularly those including connected or running speech, restricts the diversity and scale of training data and limits cross-study benchmarking. The PVQD data used in this study remains one of the few open sources, but its sample size and coverage may not fully capture the variability of disordered voices across pathologies and recording conditions. Second, in the unseen test dataset (Ensar et al. [35]), the CAPE-V ratings were provided by relatively inexperienced raters, which may introduce additional variability or bias in perceptual ground-truth scores, even though averaging across thirty listeners mitigated random error. Finally, although our architecture is computationally lighter than Dang et al. [30] (≈242 million trainable parameters vs. ≈336 million), the model’s size remains substantial for deployment on resource-constrained or embedded platforms such as mobile devices or hearing aids. Future work will explore model compression, pruning, and knowledge-distillation strategies to improve efficiency without degrading perceptual accuracy.
Finally, while the proposed model achieves high correlations with subjective ratings, its interpretability, particularly in relation to the underlying acoustic features associated with disordered speech quality, is a limiting factor. Interpreting the behavior of the proposed model (and indeed Dang et al. [30] model) is challenging, particularly because it relies on high dimensional embeddings derived from large speech foundation models. Although these embeddings are effective for predicting overall severity and breathiness, their internal structure does not map transparently onto well-defined phonatory or prosodic constructs. As a result, it is difficult to determine the extent to which the model is capturing glottal features—such as vocal fold excitation patterns—or suprasegmental attributes—such as intonation, rhythm, or stress. The distributed and nonlinear nature of the learned representations makes it hard to attribute prediction outcomes to specific acoustic mechanisms, highlighting the broader challenge of interpretability in deep learning models that operate on representations learned from massive ASR- or SSL-based speech encoders.

6. Conclusions and Future Work

This paper presents an effective disordered speech quality estimation framework that leverages Whisper encoder embeddings and Mel spectrograms with second-order delta features in addition with a deep sequential-attention fusion network architecture. By incorporating a sequential-attention fusion network architecture feature mapping module and LSTM layers, our model efficiently captures both global and temporal dependencies in speech signals. The proposed approach slightly outperformed the existing methods, particularly in handling diverse conditions and retaining phonetic structures relevant to perceptual quality. In the case of computation complexity, the proposed structure is significantly lower than the state-of-the-art structure while outperforming that method in both CAPE-V breathiness and severity on the same unseen data. Overall, our results showed 92.43% Pearson correlation using CAPE-V breathiness and 88.09% correlation using the CAPE-V severity on test set. As for future research, we should explore strategies such as domain adaptation, alternative embedding techniques, and model compression methods to enhance deploy ability. Additionally, validating the system on real-world noisy datasets, including conversational and multi-speaker environments, would further solidify its practical applicability.

Author Contributions

Conceptualization: V.A. and V.P.; investigation: V.A.; methodology: V.A.; project administration: V.A.; software: V.A.; validation: V.A. and V.P.; visualization: V.A. and A.H.; formal analysis: V.A., A.H. and V.P.; supervision: V.P.; writing—original draft: V.A. and V.P.; writing—review & editing: V.A. and V.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Discovery Grant from Natural Sciences and Engineering Research Council (NSERC), Canada, to Vijay Parsa.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code with the segmented running speech based dataset and some example predictions are available at GitHub through the provided link (accessed on 2 October 2025): https://github.com/vahidashkani/vahidashkani-Impaired-Patients-Voice-Quality-Assessment-Model.

Acknowledgments

We gratefully thank Philip Doyle for sharing the additional CAPE-V database that was used in assessing the generalization performance of ML models.

Conflicts of Interest

The authors confirm that there are no conflicts of interest related to the research, writing, authorship and/or publication of this article.

Abbreviations

The following abbreviations are used in this manuscript:
ASRAutomatic Speech Recognition
DNNDeep Neural Network
LSTMLong Short-Term Memory
SQSpeech Quality
SISpeech Intelligibility
SFMsSpeech Foundation Models
CPPCepstral Peak Prominence
AVQIAcoustic Voice Quality Index
CAPE-VConsensus Auditory-Perceptual Evaluation of Voice
GRBASGrade, Roughness, Breathiness, Asthenia, and Strain
PDParkinson’s Disease
CPC2Clarity Prediction Challenge 2
SSLSelf-Supervised Learning
SAFNSequential-Attention Fusion Network
FCFully Connected
GAPGlobal Average Pooling
PVQDPerceptual Voice Qualities Database
RMSERoot Mean Square Error
BLSTMbidirectional long short-term memory
RNNRecurrent Neural Network
PESQPerceptual Evaluation of Speech Quality
MOSMean Opinion Score
CNNsConvolutional Neural Networks
RFRandom Forest
MLMachine Learning
GELUGaussian Error Linear Unit
MHAMulti-Head Attention
MAEMean Absolute Error

References

  1. Barsties, B.; De Bodt, M. Assessment of voice quality: Current state-of-the-art. Auris Nasus Larynx 2015, 42, 183–188. [Google Scholar] [CrossRef]
  2. Kreiman, J.; Gerratt, B.R. Perceptual Assessment of Voice Quality: Past, Present, and Future. Perspect. Voice Voice Disord. 2010, 20, 62–67. [Google Scholar] [CrossRef]
  3. Tsuboi, T.; Watanabe, H.; Tanaka, Y.; Ohdake, R.; Yoneyama, N.; Hara, K.; Nakamura, R.; Watanabe, H.; Senda, J.; Atsuta, N.; et al. Distinct phenotypes of speech and voice disorders in Parkinson’s disease after subthalamic nucleus deep brain stimulation. J. Neurol. Neurosurg. Psychiatry 2015, 86, 856–864. [Google Scholar] [CrossRef] [PubMed]
  4. Tsuboi, T.; Watanabe, H.; Tanaka, Y.; Ohdake, R.; Hattori, M.; Kawabata, K.; Hara, K.; Ito, M.; Fujimoto, Y.; Nakatsubo, D.; et al. Early detection of speech and voice disorders in Parkinson’s disease patients treated with subthalamic nucleus deep brain stimulation: A 1-year follow-up study. J. Neural Transm. 2017, 124, 1547–1556. [Google Scholar] [CrossRef] [PubMed]
  5. Kim, S.; Le, D.; Zheng, W.; Singh, T.; Arora, A.; Zhai, X.; Fuegen, C.; Kalinli, O.; Seltzer, M.L. Evaluating User Perception of Speech Recognition System Quality with Semantic Distance Metric. arXiv 2022, arXiv:2110.05376. [Google Scholar] [CrossRef]
  6. Hidaka, S.; Lee, Y.; Nakanishi, M.; Wakamiya, K.; Nakagawa, T.; Kaburagi, T. Automatic GRBAS Scoring of Pathological Voices using Deep Learning and a Small Set of Labeled Voice Data. J. Voice 2025, 39, 846.e1–846.e23. [Google Scholar] [CrossRef]
  7. Kent, R.D. Hearing and Believing. Am. J. Speech-Lang. Pathol. 1996, 5, 7–23. [Google Scholar] [CrossRef]
  8. Mehta, D.D.; Hillman, R.E. Voice assessment: Updates on perceptual, acoustic, aerodynamic, and endoscopic imaging methods. Curr. Opin. Otolaryngol. Head Neck Surg. 2008, 16, 211. [Google Scholar] [CrossRef]
  9. Nagle, K.F. Clinical Use of the CAPE-V Scales: Agreement, Reliability and Notes on Voice Quality. J. Voice 2025, 39, 685–698. [Google Scholar] [CrossRef]
  10. Maryn, Y.; Roy, N.; De Bodt, M.; Van Cauwenberge, P.; Corthals, P. Acoustic measurement of overall voice quality: A meta-analysisa. J. Acoust. Soc. Am. 2009, 126, 2619–2634. [Google Scholar] [CrossRef]
  11. Gómez-García, J.A.; Moro-Velázquez, L.; Mendes-Laureano, J.; Castellanos-Dominguez, G.; Godino-Llorente, J.I. Emulating the perceptual capabilities of a human evaluator to map the GRB scale for the assessment of voice disorders. Eng. Appl. Artif. Intell. 2019, 82, 236–251. [Google Scholar] [CrossRef]
  12. Maryn, Y.; Weenink, D. Objective Dysphonia Measures in the Program Praat: Smoothed Cepstral Peak Prominence and Acoustic Voice Quality Index. J. Voice 2015, 29, 35–43. [Google Scholar] [CrossRef]
  13. Leng, Y.; Tan, X.; Zhao, S.; Soong, F.; Li, X.-Y.; Qin, T. MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 391–395. [Google Scholar]
  14. Zezario, R.E.; Fu, S.-W.; Chen, F.; Fuh, C.-S.; Wang, H.-M.; Tsao, Y. Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 54–70. [Google Scholar] [CrossRef]
  15. Dong, X.; Williamson, D.S. An Attention Enhanced Multi-Task Model for Objective Speech Assessment in Real-World Environments. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 911–915. [Google Scholar]
  16. Zezario, R.E.; Fu, S.-W.; Fuh, C.-S.; Tsao, Y.; Wang, H.-M. STOI-Net: A Deep Learning based Non-Intrusive Speech Intelligibility Assessment Model. In Proceedings of the 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, 7–10 December 2020; pp. 482–486. Available online: https://ieeexplore.ieee.org/abstract/document/9306495 (accessed on 18 May 2025).
  17. Fu, S.-W.; Tsao, Y.; Hwang, H.-T.; Wang, H.-M. Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM. arXiv 2018, arXiv:1808.05344. [Google Scholar] [CrossRef]
  18. Liu, Y.; Yang, L.-C.; Pawlicki, A.; Stamenovic, M. CCATMos: Convolutional Context-aware Transformer Network for Non-intrusive Speech Quality Assessment. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 3318–3322. [Google Scholar]
  19. Kumar, A.; Tan, K.; Ni, Z.; Manocha, P.; Zhang, X.; Henderson, E.; Xu, B. Torchaudio-Squim: Reference-Less Speech Quality and Intelligibility Measures in Torchaudio. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
  20. Gao, Y.; Shi, H.; Chu, C.; Kawahara, T. Enhancing Two-Stage Finetuning for Speech Emotion Recognition Using Adapters. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 11316–11320. [Google Scholar]
  21. Gao, Y.; Chu, C.; Kawahara, T. Two-stage Finetuning of Wav2vec 2.0 for Speech Emotion Recognition with ASR and Gender Pretraining. In Proceedings of the INTERSPEECH 2023, ISCA, Dublin, Ireland, 20–24 August 2023; pp. 3637–3641. [Google Scholar]
  22. Tian, J.; Hu, D.; Shi, X.; He, J.; Li, X.; Gao, Y.; Toda, T.; Xu, X.; Hu, X. Semi-supervised Multimodal Emotion Recognition with Consensus Decision-making and Label Correction. In Proceedings of the 1st International Workshop on Multimodal and Responsible Affective Computing, Ottawa, ON, Canada, 29 October 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 67–73. [Google Scholar]
  23. Dang, S.; Matsumoto, T.; Takeuchi, Y.; Kudo, H. Using Semi-supervised Learning for Monaural Time-domain Speech Separation with a Self-supervised Learning-based SI-SNR Estimator. In Proceedings of the INTERSPEECH 2023, ISCA, Dublin, Ireland, 20–24 August 2023; pp. 3759–3763. [Google Scholar]
  24. Sun, H.; Zhao, S.; Wang, X.; Zeng, W.; Chen, Y.; Qin, Y. Fine-Grained Disentangled Representation Learning For Multimodal Emotion Recognition. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 11051–11055. [Google Scholar]
  25. Cuervo, S.; Marxer, R. Speech Foundation Models on Intelligibility Prediction for Hearing-Impaired Listeners. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 1421–1425. [Google Scholar]
  26. Mogridge, R.; Close, G.; Sutherland, R.; Hain, T.; Barker, J.; Goetze, S.; Ragni, A. Non-Intrusive Speech Intelligibility Prediction for Hearing-Impaired Users Using Intermediate ASR Features and Human Memory Models. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 306–310. [Google Scholar]
  27. Liu, G.S.; Jovanovic, N.; Sung, C.K.; Doyle, P.C. A Scoping Review of Artificial Intelligence Detection of Voice Pathology: Challenges and Opportunities. Otolaryngol.–Head Neck Surg. 2024, 171, 658–666. [Google Scholar] [CrossRef]
  28. Chen, Z.; Zhu, P.; Qiu, W.; Guo, J.; Li, Y. Deep learning in automatic detection of dysphonia: Comparing acoustic features and developing a generalizable framework. Int. J. Lang. Commun. Disord. 2023, 58, 279–294. [Google Scholar] [CrossRef]
  29. García, M.A.; Rosset, A.L. Deep Neural Network for Automatic Assessment of Dysphonia. arXiv 2022, arXiv:2202.12957. [Google Scholar] [CrossRef]
  30. Dang, S.; Matsumoto, T.; Takeuchi, Y.; Tsuboi, T.; Tanaka, Y.; Nakatsubo, D.; Maesawa, S.; Saito, R.; Katsuno, M.; Kudo, H. Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features. arXiv 2024, arXiv:2408.12279. [Google Scholar] [CrossRef]
  31. van der Woerd, B.; Chen, Z.; Flemotomos, N.; Oljaca, M.; Sund, L.T.; Narayanan, S.; Johns, M.M. A Machine-Learning Algorithm for the Automated Perceptual Evaluation of Dysphonia Severity. J. Voice 2023, 39, 1440–1445. [Google Scholar] [CrossRef]
  32. Lin, Y.-H.; Tseng, W.-H.; Chen, L.-C.; Tan, C.-T.; Tsao, Y. Lightly Weighted Automatic Audio Parameter Extraction for the Quality Assessment of Consensus Auditory-Perceptual Evaluation of Voice. In Proceedings of the 2024 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 6–8 January 2024; pp. 1–6. [Google Scholar]
  33. Lee, M. Mathematical Analysis and Performance Evaluation of the GELU Activation Function in Deep Learning. J. Math. 2023, 2023, 4229924. [Google Scholar] [CrossRef]
  34. Walden, P.R. Perceptual Voice Qualities Database (PVQD): Database Characteristics. J. Voice 2022, 36, 875.e15–875.e23. [Google Scholar] [CrossRef]
  35. Ensar, B.; Searl, J.; Doyle, P. Stability of Auditory-Perceptual Judgments of Vocal Quality by Inexperienced Listeners. In Proceedings of the American Speech and Hearing Convention, Seattle, WA, USA, 5–7 December 2024. [Google Scholar]
Figure 1. Overview of the proposed structure for quality measurement.
Figure 1. Overview of the proposed structure for quality measurement.
Information 17 00032 g001
Figure 2. Overview of the proposed adapter structure.
Figure 2. Overview of the proposed adapter structure.
Information 17 00032 g002
Figure 3. Proposed feature mapping block.
Figure 3. Proposed feature mapping block.
Information 17 00032 g003
Figure 4. (Left): Waveforms of two test samples. (Right): Corresponding Mel-spectrograms.
Figure 4. (Left): Waveforms of two test samples. (Right): Corresponding Mel-spectrograms.
Information 17 00032 g004
Figure 6. Box-whisker plots comparing the average clinician ratings of overall severity and breathiness with the corresponding model predicted ratings (Our proposed model, Dang et al. [30] model, and Dang et al. [30] ablated model). Here, “X”s indicate the mean values, while the solid line within each box represents the median. Lines with “*”s indicate statistically significant differences.
Figure 6. Box-whisker plots comparing the average clinician ratings of overall severity and breathiness with the corresponding model predicted ratings (Our proposed model, Dang et al. [30] model, and Dang et al. [30] ablated model). Here, “X”s indicate the mean values, while the solid line within each box represents the median. Lines with “*”s indicate statistically significant differences.
Information 17 00032 g006
Figure 8. Bland–Altman plots between the averaged clinician overall severity ratings and their predicted scores. (a) proposed model; and (b) Dang et al. [33] model. Plots show the mean bias, the linear regression fit to the scatter data, and the upper and lower limits of agreement along with their 95% confidence intervals.
Figure 8. Bland–Altman plots between the averaged clinician overall severity ratings and their predicted scores. (a) proposed model; and (b) Dang et al. [33] model. Plots show the mean bias, the linear regression fit to the scatter data, and the upper and lower limits of agreement along with their 95% confidence intervals.
Information 17 00032 g008
Figure 9. Scatter plot displaying the mean subjective CAPE-V severity scores and predicted severity scores using the proposed model on the unseen dataset.
Figure 9. Scatter plot displaying the mean subjective CAPE-V severity scores and predicted severity scores using the proposed model on the unseen dataset.
Information 17 00032 g009
Table 1. Summary of the Proposed Whisper + Mel + SAFN Model Architecture and Training Configuration.
Table 1. Summary of the Proposed Whisper + Mel + SAFN Model Architecture and Training Configuration.
ComponentLayer/BlockKey ParametersOutput DimensionNotes
InputAudio waveform16 kHzResampled to 16 kHz
Mel-SpectrogramSTFT (Hann)FFT = 400, hop = 320, 40 Mel filters40 × TUsed for deltas
Delta Features1st & 2nd order120 × TConcatenated to Mel
ASR EncoderWhisper-small12 Transformer layers, hidden 384384 × TPre-trained, not frozen
Adapters (×6)FC (384 → 128) → GELU → LayerNorm → Dropout (0.1)6 learnable weights (softmax normalized)128 × TFuse multi-depth features
Fusion BlockConcatenationWhisper + Mel + Deltas(128 + 120) × TCombined representation
SAFN (Feature Mapping Block)3 × Uni-LSTM (360 → 128, dropout 0.3)+2 MHA (128 dim, 16 heads), FFN (128 → 256 → 128)128 × TTemporal context modeling
Output HeadFC (128 → 1) → Sigmoid → Global Avg Pooling1Utterance-level quality score
Loss FunctionMAERegression loss
OptimizerAdamWlr = 5 × 10−6, weight decay = 1 × 10−4ReduceLROnPlateau (f = 0.5, p = 4)
Training200 epochsbatch = 1 Pretrained modules unfrozen
HardwareRTX 3080 Ti (10,240 CUDA cores), 32 GB RAMPyTorch ≥ 2.0 (2.5.1+cu121)
Table 2. Pearson r, Spearman ρ, and RMSE values from subjective and objective rating comparisons. The 95% confidence intervals for the two correlation coefficients are also shown. Note: * Benjamin et al. [31] results are from the entire dataset of 295 talkers, whereas the remaining model scores were calculated for the test dataset of 59 talkers.
Table 2. Pearson r, Spearman ρ, and RMSE values from subjective and objective rating comparisons. The 95% confidence intervals for the two correlation coefficients are also shown. Note: * Benjamin et al. [31] results are from the entire dataset of 295 talkers, whereas the remaining model scores were calculated for the test dataset of 59 talkers.
MethodCAPE-V SeverityCAPE-V BreathinessTrainable Parameters
Pearson r
(95% CI)
Spearman ρ
(95% CI)
RMSEPearson r
(95% CI)
Spearman ρ
(95% CI)
RMSE
Sentence Level:        
Proposed0.8810 (0.8504, 0.9042)0.7652 (0.7024, 0.8161)0.13350.9244 (0.9040, 0.9401)0.8095 (0.7580, 0.8515)0.1118242,423,655
Dang et al. [30]0.8784 (0.8446, 0.9017)0.7648 (0.7017, 0.8166)0.14230.9155 (0.8924, 0.9328)0.8217 (0.7760, 0.8587)0.1159336,119,692
Dang et al. [30], Ablated0.8685 (0.8351, 0.8946)0.7560 (0.6942, 0.8073)0.13860.9104 (0.8856, 0.9290)0.8014 (0.7515, 0.8421)0.1216241,177,500
CPP (dB)−0.7468 (−0.6935, −0.7890)−0.6554 (−0.5816, −0.7217)0.1835−0.7577 (−0.7050, −0.7978)−0.6223 (−0.5381, −0.6940)0.1665 
HNR (dB)−0.4916 (−0.3936, −0.5793)−0.3649 (−0.2635, −0.4591)0.2402−0.4898 (−0.3787, −0.5873)−0.2367 (−0.1196, −0.3454)0.2225 
Talker Level:       
Proposed0.9092 (0.8463, 0.9437)0.8062 (0.6416, 0.8953)0.11890.9394 (0.8880, 0.9654)0.8621 (0.7402, 0.9315)0.1029 
Dang et al. [30]0.9034 (0.8327, 0.9413)0.8042 (0.6484, 0.8943)0.12860.9352 (0.8828, 0.9623)0.8645 (0.7587, 0.9281)0.1060 
Dang et al. [30], Ablated0.9000 (0.8257, 0.9403)0.8110 (0.6640, 0.8988)0.12370.9342 (0.8823, 0.9623)0.8438 (0.7242, 0.9172)0.1111 
Benjamin et al. [31] *0.8460-0.1423--- 
CPP (dB)−0.8489 (−0.7487, −0.9048)−0.7512 (−0.5896, −0.8530)0.1458−0.8576 (−0.7662, −0.9135)−0.6933 (−0.4865, −0.9135)0.1307 
HNR (dB)−0.5330 (−0.2758, −0.7142)−0.3785 (−0.1199, −0.5952)0.2333−0.5296 (−0.2232, −0.7275)−0.2161 (−0.0792, −0.4881)0.2156 
Table 3. Ablation study showing the effect of the proposed SAFN beyond LSTM fusion model, represented the correlation with breathiness scores on validation set.
Table 3. Ablation study showing the effect of the proposed SAFN beyond LSTM fusion model, represented the correlation with breathiness scores on validation set.
StructurePearson r (Breathiness)
With proposed structure0.9244
The proposed structure with LSTM fusion model0.9031
The proposed structure without Mel-spectral + deltas stream0.8993
The proposed model with three adapters in the last three ASR encoder block0.9026
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ashkanichenarlogh, V.; Hassanpour, A.; Parsa, V. Automated Severity and Breathiness Assessment of Disordered Speech Using a Speech Foundation Model. Information 2026, 17, 32. https://doi.org/10.3390/info17010032

AMA Style

Ashkanichenarlogh V, Hassanpour A, Parsa V. Automated Severity and Breathiness Assessment of Disordered Speech Using a Speech Foundation Model. Information. 2026; 17(1):32. https://doi.org/10.3390/info17010032

Chicago/Turabian Style

Ashkanichenarlogh, Vahid, Arman Hassanpour, and Vijay Parsa. 2026. "Automated Severity and Breathiness Assessment of Disordered Speech Using a Speech Foundation Model" Information 17, no. 1: 32. https://doi.org/10.3390/info17010032

APA Style

Ashkanichenarlogh, V., Hassanpour, A., & Parsa, V. (2026). Automated Severity and Breathiness Assessment of Disordered Speech Using a Speech Foundation Model. Information, 17(1), 32. https://doi.org/10.3390/info17010032

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop