Machine Learning Assessment of Spasmodic Dysphonia Based on Acoustical and Perceptual Parameters

Adductor spasmodic dysphonia is a type of adult-onset focal dystonia characterized by involuntary spasms of laryngeal muscles. This paper applied machine learning techniques for the severity assessment of spasmodic dysphonia. To this aim, 7 perceptual indices and 48 acoustical parameters were estimated from the Italian word /a’jwɔle/ emitted by 28 female patients, manually segmented from a standardized sentence and used as features in two classification experiments. Subjects were divided into three severity classes (mild, moderate, severe) on the basis of the G (grade) score of the GRB scale. The first aim was that of finding relationships between perceptual and objective measures with the Local Interpretable Model-Agnostic Explanations method. Then, the development of a diagnostic tool for adductor spasmodic dysphonia severity assessment was investigated. Reliable relationships between G; R (Roughness); B (Breathiness); Spasmodicity; and the acoustical parameters: voiced percentage, F2 median, and F1 median were found. After data scaling, Bayesian hyperparameter optimization, and leave-one-out cross-validation, a k-nearest neighbors model provided 89% accuracy in distinguishing patients among the three severity classes. The proposed methods highlighted the best acoustical parameters that could be used jointly with GRB indices to support the perceptual evaluation of spasmodic dysphonia and provide a tool to help severity assessment of spasmodic dysphonia.


Introduction
Dystonia refers to a set of movement disorders typically characterized by involuntary sustained or intermittent muscle contractions that may affect anatomical areas, such as neck muscles (cervical dystonia); the orbicularis oculi muscles (blepharospasm); or laryngeal muscles, causing laryngeal dystonia [1]. This latter condition is characterized by a relatively low prevalence, 1 ÷ 100,000 [2]. Two main different types of laryngeal dystonia have been described: adductor spasmodic dysphonia (AdSD) and abductor spasmodic dysphonia (AbSD), which rarely occur together [1,3]. AdSD represents an impairing disease that affects adductor muscles of the vocal folds during phonation, causing irregular voice breaks and a typically strained, strangled voice; it is the most common type of spasmodic dysphonia (SD), affecting 90% of cases [2], with a female predominance and an average age of onset at 45 years [2]. Spasms tend to occur during vowel emissions, mainly during the glottal stop between them [1]. AbSD concerns vocal fold abductor muscles, causing glottis opening during voice production and consequent breathy vocal breaks, especially when a voiceless pathologists and otolaryngologists in grading AdSD severity and in monitoring its evolution over time and treatment [15].

Material and Methods
Voice samples were collected from 28 female patients (mean = 64.6 years, std = 11.6 years) diagnosed with AdSD and recruited at Ospedale Maggiore Policlinico Milano, Milano, Italy. Four male patients were recorded as well, but they were excluded from further analyses to avoid unbalanced data issues.
Each patient underwent a complete phoniatric investigation including videolaryngostroboscopy. The diagnosis was unanimously made by a team consisting of a highly experienced laryngologist and two voice therapists, both with a long period of experience with SD patients. Several patients were known to suffer from SD for years and had already been successfully treated with botulinum toxin.
All subjects uttered only once, at comfortable pitch and loudness, an Italian standardized sentence rich in vocalic sounds and constantly voiced: "il bambino ama le aiuole della mamma" ("the child loves mother's flowerbeds"). Recordings were collected anonymously with smartphone-integrated microphones in a controlled (quiet) room. This study was approved by the Institutional Review Board of the Fondazione IRCCS Ca' Granda Ospedale Maggiore Policlinico, Milano, Italy, and informed consent was obtained from each participant.

Perceptual Parameters
Perceptual evaluation of the whole sentence was performed blindly by three independent judges with experience in voice disorders (1 otolaryngologist and 2 voice therapists), and scores were averaged. The scores ranged from 0 to 10, where 0 is the rating for the worst voice condition and 10 for the best one. Seven perceptual parameters were considered. Three of them were taken from the GRB scale proposed by Hirano [16] for voice assessment: The GIRBAS scale, derived from [16], represents the most used protocol for voice assessment and is currently suggested for clinical research by the European Research Group [17] and the Società Italiana di Fonologia e Laringologia [18]. Indices I, A, and S were not included in this work as they are less reliable in clinical settings [19].
Other four perceptual parameters were taken from the IINFVo scale that was defined for substitution voicing assessment by Moerman et al. [20] and later validated by Siemons-Luhring [21] for SD assessment as well: • I (Intelligibility): the impression of the possibility of the patient to be understood by the listener. • F (Fluency): intended as smoothness of speech production. • Vo (Voicing): the capability to correctly utter voiced or unvoiced speech, that is, the speech is voiced or unvoiced when it actually needs to be voiced or unvoiced [20]. • S (Spasmodicity): this parameter is related to the perception of voice breaks, tremor, and strain.

Acoustical Analysis
The Italian word /a'jwOe/ was manually segmented from the standardized sentence and analyzed with the BioVoice open-source software tool [22,23], already successfully applied to the acoustical analysis of voices from newborns to children and adults, such as shape classification of newborn cry melody [24], dysprosody detection in Parkinson's disease [25], and genetic syndrome speech phenotype characterization [26]. The word /a'jwOe/ is widely used as it is mainly composed of vowels and is also suited for studying articulation capabilities. BioVoice automatically resamples audio files at 44.1 kHz, saves Bioengineering 2023, 10, 426 4 of 14 them in .wav format, and sets proper frequency ranges for male/female/child/newborn voices to obtain reliable acoustical parameters, both in the time and in the frequency domain. Specifically: frame duration, number, and percentage of voiced/unvoiced parts, F0, formants F1-F3, jitter, and noise (Normalized Noise Energy, NNE), along with their standard deviation, are estimated. BioVoice also computes the Power Spectral Density (PSD), normalized to its maximum value. In this work, the PSD frequency spectrum in the adult's range up to 5500 Hz was divided into 500 Hz wide slides where the average power was calculated [27]. A total of 48 acoustical parameters were estimated, summarized in Table 1, along with a short description. Other parameters are specific for newborn/infant cry and for the singing voice. As they were not used in the present work, they are not listed here. Mean PSD in the frequency range 5000-5500 Hz * stands for 1, 2, and 3. These parameters were computed for the first, second, and third formant (respectively, F1, F2, and F3).

Machine Learning and Statistical Analysis
Automatic assessment of voice pathologies based on artificial intelligence techniques represents an established approach to analyze high-dimensional data. Most of the used models include k-nearest neighbors (KNN), support vector machine (SVM), and random forest (RF), and they were implemented in this work. Some advantages of machine learning (ML) methods are that very few assumptions are required about the data-generating systems, which are instead relevant to obtain reliable statistical results but are not often available in clinical practice. Moreover, they are capable of generalizing underlying data patterns for prediction with completely new observations [30][31][32]. Furthermore, many studies [11,33] have highlighted that even when data do not show statistically significant differences, ML can still effectively identify parameters to differentiate observations. As the present study is focused on prediction rather than inferencing, statistical analysis was implemented only to support ML results. The items listed in Table 1 were used as features for a set of multiclass and multivariate classification experiments. Features were z-score normalized to allow comparisons between variables with different scales and units of measurement. They made up the whole training set. Hyperparameter tuning was performed with Bayesian optimization using MATLAB ® 2020b [34]. The function bayesopt.m allows for the selection of several properties, such as the number of iterations for the evaluation of the objective function (set at 60 iterations, according to [35]); the optimization metric, which corresponds to the global accuracy; and model hyperparameters that minimize the overall misclassification error. In our study,

•
For random forest, the fitcensemble.m function was used, and the aggregation method was set as "Bag". The minimum number of leaves was selected between 2 and 27. The maximum number of splits was between 2 and 27. The split criteria were between "deviance", "gdi", and "twoing" (according to [37]). The number of variables to sample was between 1 and 55.
To reduce overfitting and obtain reliable results, the trained models were crossvalidated with the leave-one-subject-out (LOSO) method. Perceptual indices were used to separate AdSD severity into three classes: severe, moderate, and mild, as shown in Table 2. In the first artificial intelligence (AI) experiment, each rating from the GRB and the IFVoS scales was iteratively implemented as a response variable to identify relationships between objective and perceptual variables and find the most relevant features that define the voice characteristics of AdSD. At each iteration, only one perceptual parameter was considered, and the other ones were ignored.
In the second AI experiment, index G was used as response variable, while all other parameters were used to evaluate the classifiers capability in AdSD severity assessment. In Table 3, the number of patients for each class is shown. Categorical values were not z-score normalized. A MATLAB ® 2020b [34] code was developed to automatically save validation accuracies, precision, sensitivity, specificity, F-score (i.e., the harmonic mean between precision and sensitivity), area-under-the-curve values (AUCs), and model hyperparameters. An important issue concerning machine learning techniques, especially when they are used as diagnostic tools for decision making, is the trust of individual prediction: although convincing, it can be difficult to directly relate prediction results to existing physiological knowledge [32]. Evaluation metrics such as accuracy may not be indicative; therefore, Ribeiro et al. [38] proposed textual and visual artefacts that provide a qualitative explanation of the relationships between single observation (or instance) features and model predictions. Thus, Local Interpretable Model-Agnostic Explanations (LIME) were used to identify which parameters contributed or disagreed with the prediction of an individual observation made by the trained models. The LIME method simplifies the relationship between data labels and independent variables and is based on the assumption that every model behaves like simple linear models at the local scale, i.e., at single-row-level data. Once a particular instance is selected, or "explained", LIME operates in three steps: (1) It samples new data from the instance and calculates distances between sampled data and the original observation. (2) It uses the complex model that needs to be explained to make predictions on synthesized data and then it trains the simple model. (3) The simple model is weighted using the same distance metric of step 1 and therefore it identifies the features that contributed in order to obtain a specific prediction.
In MATLAB, the function lime.m requires the user to specify the hyperparameters of the trained complex model, the observation that needs to be explained (named "query point"), and the number of predictors to train the simple model, and it returns a bar graph showing the relevance of each predictor and predictor weight. A code was developed to iteratively compute LIME for all observation for each trained classifier, setting the number of features to 10. As an example, Figure 1 depicts the outcome of LIME prediction for a row of the dataset; hence, the query point or observation of the first AI experiment, considering G as response variable. Blackbox model refers to the trained complex classifier, Simple Model to the classifier trained by LIME, and 1 to the severity class introduced in Table 2. On the vertical axis, features that contributed to the correct classification are shown (in this case: % voiced, F1 mean, duration mean, and F0 mean), while in the horizontal axis, their individual weights are shown.
With reference to Figure 1, in order to find relevant features, other models and corresponding plots were obtained for each observation, and predictor weights were averaged. Then, the total weight was computed, and the ratio between individual predictor weight and total weight was considered in order to identify relevant parameters: an arbitrary threshold was set at 0.1.
As for statistical analysis, a Shapiro-Wilk test proved that data were normally distributed in both experiments. Therefore, in order to trace AI experiment settings, a first ANOVA test was performed, taking perceptual indices as response variables and acoustical parameters as dependent variables.
A second ANOVA test was carried out, considering G as a response variable and all other parameters as dependent variables. Significance level was set at 0.05 for both tests. With reference to Figure 1, in order to find relevant features, other models and corresponding plots were obtained for each observation, and predictor weights were averaged. Then, the total weight was computed, and the ratio between individual predictor weight and total weight was considered in order to identify relevant parameters: an arbitrary threshold was set at 0.1.
As for statistical analysis, a Shapiro-Wilk test proved that data were normally distributed in both experiments. Therefore, in order to trace AI experiment settings, a first ANOVA test was performed, taking perceptual indices as response variables and acoustical parameters as dependent variables.
A second ANOVA test was carried out, considering G as a response variable and all other parameters as dependent variables. Significance level was set at 0.05 for both tests. Table 4 summarizes the best validation accuracies of the first experiment, in which perceptual indices were iteratively used as response variables to highlight a possible relationship between them and objective parameters. Table 4 displays models' hyperparameters as well: d stands for distance metric, w for distance weight, c for split criterion, v for number of predictors to select at random for each split, l for number of leaves, s for the maximum number of splits, and N for the number of learning cycles.  Table 5, and computed which features contributed to the outcome; therefore, it was possible to find out relationships between response variables and acoustical correlates, as Table 5 shows.  Table 4 summarizes the best validation accuracies of the first experiment, in which perceptual indices were iteratively used as response variables to highlight a possible relationship between them and objective parameters. Table 4 displays models' hyperparameters as well: d stands for distance metric, w for distance weight, c for split criterion, v for number of predictors to select at random for each split, l for number of leaves, s for the maximum number of splits, and N for the number of learning cycles.  Table 5, and computed which features contributed to the outcome; therefore, it was possible to find out relationships between response variables and acoustical correlates, as Table 5 shows.        In the second experiment, G was used as response variable to assess AdSD severity with both perceptual and objective parameters. The best model was a KNN with k = 2 and Euclidean metric distance, which reached 89% LOSO validation accuracy. LIME results are summarized in Table 6.   In the second experiment, G was used as response variable to assess AdSD severity with both perceptual and objective parameters. The best model was a KNN with k = 2 and Euclidean metric distance, which reached 89% LOSO validation accuracy. LIME results are summarized in Table 6. Table 6. Most relevant features obtained by averaging and normalizing LIME predictors' weights for the G index in the second AI experiment.

Response Variable Features
G Spasmodicity, R, Voicing In Table 7, the performance of the KNN model is presented. In brackets, the 95% confidence bounds for each metric are reported.   In the second experiment, G was used as response variable to assess AdSD severity with both perceptual and objective parameters. The best model was a KNN with k = 2 and Euclidean metric distance, which reached 89% LOSO validation accuracy. LIME results are summarized in Table 6. Table 6. Most relevant features obtained by averaging and normalizing LIME predictors' weights for the G index in the second AI experiment.

Response Variable Features
G Spasmodicity, R, Voicing In Table 7, the performance of the KNN model is presented. In brackets, the 95% confidence bounds for each metric are reported. In Table 7, the performance of the KNN model is presented. In brackets, the 95% confidence bounds for each metric are reported. In Table 8, results from the first ANOVA test are shown with p-values in brackets. No significant difference between classes for the B index was observed. For the second statistical test, perceptual indices showed significant differences between severity classes for R (p < 0.001), B (p = 0.029), Intelligibility (p < 0.001), Fluency (p < 0.001), Voicing (p < 0.001), and Spasmodicity (p < 0.001). Furthermore, G showed significant differences between classes for F0 mean (p = 0.012), F0 median (p = 0.043), and F0 max (p = 0.003).

Discussion
The identification of relationships between perceptual and objective measures of voice still represents a topic under discussion. In Dejonckere et al. [39], correlations were found between G, shimmer, and HNR; R and jitter; and B and shimmer. In another paper, the same authors showed that G was also highly correlated (0.71) with the dominant cepstral peak [40]. Bhuta et al. [17] found relationships between the voice turbulence index (VTI) and G, NHR with G and R, and soft phonation index (SPI) with G and B. These results were confirmed in more recent papers of Park et al. [41] and Narisimhan et al. [42]. Focusing on SD, Dejonckere et al. [43] found relationships only between B and devoicing measures extracted with AMPEX: PVF (i.e., the proportion of voiced frames), PVS (i.e., the proportion of speech frames), and VL90 (which represents the 90th percentile of the voicing length distribution). However, this study had different aims from the present one, as it concerned the comparison between pre-and post-treatment voice quality.
In the present study, the LIME method was implemented to investigate relationships between perceptual indices and objective parameters for AdSD, which represent an unexplored task. Specifically,

•
The percentage of voiced parts in the whole audio signal (% voiced) was the parameter most strongly related to G value. This may be linked to repeated interruptions that significantly lower voice quality. This result is supported also by the two parameters: duration mean and duration max, which highlight the longer time required to emit the word /a'jwOle/ due to alterations in vocal fold mobility. F0 mean and F1 mean contributed as well. The relevance of F1 mean could be associated with pharynx constriction degree [44]. • R assessment is linked to F2 median and % voiced. These results suggest that roughness is related to tongue movements [44]. Jitter was found to be relevant in 8 out 28 predictions with LIME, but its total contribution weight was equal to 3.1%. Interestingly, jitter represented a "confounding" parameter causing moderate cases of AdSD to be classified as severe. In 25% of observations, NNE, a measure of noise alternative to HNR, was considered relevant, but its total contribution weight was equal to 2.4% only. However, it is important to specify that these discrepancies with literature results [39,41,42] are probably caused by different tasks: while jitter and NNE typically show strong correlations with R when healthy subjects are compared with patients diagnosed with AdSD, probably jitter and NNE do not represent relevant parameters to assess and distinguish among different AdSD severity classes. • B ratings and F1 median values showed the strongest relationship. As shown in Figure 4, even if PSD III (range [1-1.5] kHz) is considered a relevant parameter, it caused two out of four misclassifications from moderate to mild, and therefore it should be discarded. • Spasmodicity is associated with F1 median as well as the medium-high frequency region of the spectrum, described by PSD VIII (range of 3.5-4 kHz).
For Intelligibility, Fluency, and Voicing, the low validation accuracies of trained classifiers made the relevant parameters identified by LIME unreliable. However, statistical analysis highlighted significant differences between classes for Intelligibility in Pause duration min (0.049), which can be related to high correlation between Intelligibility and PVF in [21], and F0 mean (0.041). Fluency presented significant differences in NNE (p = 0.037) and PSD I (p = 0.01), with this last parameter confirmed by LIME as well, and Voicing with F2 min (p = 0.03) and F0 mean (p = 0.005), in agreement with Table 5.
It is interesting to notice that the LIME outcome is supported by statistical analysis, especially for R: indeed, F2 median, % voiced, and F1 mean show significant differences between classes, in particular between severe and moderate class with p = 0.005, p = 0.040, and p = 0.040, respectively. This is partly true also for G and Spasmodicity, where F0 mean and F1 median each showed significant differences. For Spasmodicity, the ANOVA highlighted that, besides F1-and F2-related measures, % voiced, Number of Units, and Number of pauses provided significant differences between severity classes: since spasmodicity directly depends on voice breakings, such results reflect strong connection with objective correlates.
Both acoustical and perceptual parameters were included as features in a multiclass classification experiment that aims at developing a model capable of supporting otolaryngologists in AdSD assessment. To the authors' knowledge, this represents the first attempt to assess the severity of such pathology with machine learning techniques. Indeed, the literature mainly concerns general voice disorder, or just SD assessment, or only the discrimination between SD and MDT. With a KNN model, a high accuracy of 89% was obtained: this is a promising result that highlights how MLVA could be successfully applied as a support for ENT specialists in AdSD assessment. Table 7 shows high precision, F-score, and AUC values, especially for the severity class 1, which means that, with this classifier, acoustical parameters and perceptual indices are able to distinguish the most severe cases of AdSD. Our model also shows strong recognition capabilities for moderate cases of AdSD, while mild severity cases are characterized by lower performance. In fact, our sample comprised only two mild cases. Hence, mild cases of SD were underrepresented in our study, and further analyses will be required in the future; this is possibly a consequence of our severe inclusion criterion, i.e., only cases with compelling clinical diagnosis of SD. However, precision and specificity are equal to 100%, underlying that no severe or moderate cases were misclassified as mild cases: this is an important result, useful to avoid underestimation. LIME was also used to identify relevant features in prediction: Table 6 shows that they were all perceptual parameters, particularly Spasmodicity, R, and Voicing. Moreover, ANOVA detected significant differences between all perceptual parameters (especially R, Voicing, and Spasmodicity, with p-values < 0.001) and F0-related measures, which can be associated with PSD I. These results confirm the validity of the IINFVo scale for SD assessment [21].
For each perceptual index, the proposed method was capable of detecting the best corresponding acoustical feature: % voiced for G, F2 median for R, and F1 median for B and Spasmodicity. These parameters could be associated with voice breakings caused by spasms, tongue movement limitations, and pharynx constriction, respectively, which may also depend on anomalous laryngeal muscle contraction of AdSD. Due to the low validation accuracy of the trained models, possible relationships between IINFVo and acoustical parameters were not considered, and further analysis is required to improve classification performances. However, the combination of machine learning techniques, which can highlight underlying patterns in data, and LIME, allowed for the identification of useful relationships between perceptual and acoustical parameters in SD assessment, a task where statistical analysis methods showed poor performances. Finally, a KNN model provided 89% overall accuracy in dividing patients affected by AdSD into three severity classes-this promising result could help ENT specialists to recognize subjects that would benefit from botulinum toxin A treatment and reduce pathology gravity underestimation.

Conclusions
AdSD is characterized by strained voice, pitch breaks, and intermittent breathiness that are usually assessed with perceptual ratings. Acoustical analysis and machine learning techniques can support the clinicians' diagnosis providing reliable relationships between objective and perceptual indices. In this paper, an innovative approach is proposed to address this problem on the basis of the LIME method, which highlighted the most relevant objective parameters that should be taken into account to support and validate G, R, B, and Spasmodicity assessment. This could be helpful to quantify possible improvements or worsening of voice quality after specific treatments. Results were supported by statistical analysis as well. Moreover, a novel machine learning experiment was carried out to develop an automatic tool that could help otolaryngologists in AdSD severity assessment and also to monitor SD voice quality over time after botulinum toxin injection, as an aid to check whether and to what extent it was successful and if and when it may be appropriate to repeat the treatment. However, adjustments of the method and a larger dataset, including male patients as well, for validation are required to reduce the number of severe cases classified as moderate.