Machine Learning Approach to Dysphonia Detection

: This paper addresses the processing of speech data and their utilization in a decision support system. The main aim of this work is to utilize machine learning methods to recognize pathological speech, particularly dysphonia. We extracted 1560 speech features and used these to train the classiﬁcation model. As classiﬁers, three state-of-the-art methods were used: K-nearest neighbors, random forests, and support vector machine. We analyzed the performance of classiﬁers with and without gender taken into account. The experimental results showed that it is possible to recognize pathological speech with as high as a 91.3% classiﬁcation accuracy.


Introduction
Neurological control through muscle and sensing is part of almost every human activity.It is so natural that we do not even realize it.This is also the case for speech production.Even though the process itself is complex, its functioning is taken for granted.Unfortunately, neurological diseases, infections, tissue changes, and injuries can negatively affect speech production.Impaired speech production is frequently represented by dysphonia or dysphonic voice.Dysphonia is a perceptual quality of the voice that indicates that some negative changes have occurred in the phonation organs [1].The relationship between voice pathology and acoustic voice features has been clinically established and confirmed both quantitatively and subjectively by speech experts [2][3][4].
As indicated above, pathological voice is an indication of health-related problems.However, recognizing dysphonic voice at an early stage is not an easy task.Usually, a trained expert is required, and a series of speech exercises need to be completed.Automatized speech evaluation can allow for time-and cost-effective speech analysis that, in turn, enables the screening of a wider population.Since some diseases, such as Parkinson's disease, manifest themselves early in speech disruption, early discovery through screening can lead to earlier treatment and to improved treatment results.The main motivation for the realization of this work is the utilization of artificial intelligence for the diagnosis of different diseases.This can lead to significant improvements in diagnosis and healthcare and, in turn, to the improvement of human life [5,6].Several diseases include a speech disorder as an early symptom.This is usually due to damage of the neurological system or caused directly by damage to some part of the vocal tract, such as the vocal cords [7].Frequently, a speech disorder leads to a secondary symptom, and early detection may reveal many high-risk illnesses [1,8].
The ultimate goal of this research is to develop a decision support system that provides an accurate, objective, and time-efficient diagnosis and helps medical personnel provide the correct diagnostic decision and treatment.In this work, we focused on the detection of pathological speech based on features obtained from voice samples from multiple subjects.It is a noninvasive way of examining a voice and only digital voice recordings are needed.Voice data were obtained from the publicly available Saarbrucken Voice Database (SVD).We exported 194 samples from SVD, of which 94 samples originated from patients with dysphonia and 100 samples came from healthy ones.In order to detect pathological speech, it is necessary to build and train a classification model.For this purpose, three state-of-the-art classification algorithms were utilized: support vector machine (SVM), random forest classifier (RFC), and K-nearest neighbors (KNN).
The paper is organized as follows.In the next section, we provide a brief overview of similar works in the area of pathological speech processing.Then, we describe the preprocessing of the dataset and provide a brief overview of classification algorithms.Later, we propose a decision support model and present the results of numerical experiments.Conclusions are drawn in the last section.

Related Work
Speech processing is a very active area of research.There are many contributions that focus on different aspects of speech processing, from feature extraction to decision support systems based on speech analysis.We a provide a brief overview of some recent findings related to our research topic.
There are several existing solutions in the field of pathological speech detection [9,10].For example, Al-Nasheri et al. in their work [11] concentrated on developing feature extraction for the detection and classification of voice pathologies by investigating different frequency bands using autocorrelation and entropy.The voice impairment cases studied were caused by vocal cysts, vocal polyps, and vocal paralysis.They found that the most contributive frequency bands in both detection and classification were between 1000 and 8000 Hz.Each voice sample consisted of the sustained vowel /a/, and a support vector machine was used as a classifier.The highest obtained accuracies in the case of detection were 99.69%, 92.79%, and 99.79% for Massachusetts Eye and Ear Infirmary (MEEI), the Saarbrucken Voice Database (SVD), and the Arabic Voice Pathology Database (AVPD), respectively.
Martinez et al. [12] in their work presented a set of experiments on pathological voice detection with the SVD by using the MultiFocal toolkit for discriminative calibration and fusion.Results were compared with the MEEI database.Since they used the data from the SVD dataset, sustained vowel recordings of /a/, /i/, and /u/ were analyzed.The samples were not differentiated according to the diagnosis, but they used all samples in SVD and distinguished only between healthy and pathological ones.Samples of 650 subjects were healthy and 1320 samples were pathological.Extracted features included mel-cepstral coefficients (MFCCs), harmonics-to-noise ratio (HNR), normalized noise energy (NNE), and glottal-to-noise excitation ratio (GNE), and they mostly measured the quality of the voice.A Gaussian mixture model was used as a classifier.In the case of the vowel /a/, they reached an accuracy of 80.4%; with the vowel /i/, it was 78.3%; and with the vowel /u/, it was 79.9%.For all vowel fusions, they reached an accuracy of 87.9%.Using the MEEI dataset, they achieved an accuracy of 94.3%, which is 6.6% more than with the SVD dataset.
Little et al. [13,14] focused on discriminating healthy subjects from subjects with Parkinson's disease by detecting dysphonia.They introduced a new measure of dysphonia: pitch period entropy (PPE).The utilized data consisted of 195 sustained vowels from 31 subjects, of which 23 were diagnosed with Parkinson's disease.The extracted features included pitch period entropy, shimmer, jitter, fundamental frequency, pitch marks, HNR, etc.They found that the combination of the features HNR, PPE, detrended fluctuation analysis, and recurrence period density entropy led to quite an accurate classification of the subjects with Parkinson's disease from healthy subjects.A classification performance of 91.4%, using a kernel support vector machine, was achieved.
Other authors have also investigated the effect of Parkinson's disease on speech [15] or various aspects of speech deterioration caused by Alzheimer's disease [16], dysphagia [8], or impaired speech [17].
Speech signal processing and machine learning are being increasingly explored in the developmental disorder domain, where the methods range from supervised classification to knowledge-based data mining of highly subjective constructs of psychological states [18,19].

Dataset
We used the publicly available Saarbrucken Voice Database [20].It is a collection of voice recordings where one subject's samples consist of recordings of the vowels /a/, /i/, and /u/ produced at a normal, high, low, and low-high-low pitch.The length of the recordings with sustained vowels is from 1-2 s.We exported 194 samples from this database, of which 94 samples belong to patients with dysphonia (41 men, 53 women) and 100 samples belong to healthy ones.The age of all subjects is over 18 years.

Speech Feature Extraction
To obtain some representative characteristics of speech, it was necessary to implement feature extraction from individual samples and to build a feature matrix.For each intonation of each vowel, 130 features were extracted.This means that 520 features were extracted for all intonations of a particular vowel.For one subject and all corresponding vowel recordings of /a/, /i/, and /u/, the number of features is 1560.The features consist of the following specific types of parameters: energy, low-short time energy ratio, zero crossing rate, Teager-Kaiser energy operator, entropy of energy, Hurst's coefficient, fundamental frequency, mel-cepstral coefficients, formants, jitter, shimmer, spectral centroid, spectral roll-off, spectral flux, spectral flatness, spectral entropy, spectral spread, linear prediction coefficients, harmonics-to-noise ratio, power spectral density, and phonatory frequency range.Some of the features contain multiple subtypes (e.g., shimmer:local, shimmer:apq3, shimmer:apq5, etc.) and some of the features were extracted from smaller time frames of the recording.In this case, several statistical functionals (median, average, minimum, maximum, and standard deviation) were determined.

Feature Selection
It was previously shown that feature selection (FS) can improve prediction performance in some areas [21].We applied feature selection to find the optimal subset of features for better classification between healthy and pathological samples.We selected simple filter FS to get k best features.Filter FS is a computationally effective approach that provides results competitive with more complex methods.Mutual information for discrete target variables was used to estimate the score of features.This function is based on entropy estimation from k-nearest neighbor distances.Mutual information between two random variables is a non-negative value, which measures the dependency between variables.If the variable is independent, it is zero, and the higher the value, the greater the dependence.

Principal Component Analysis
The purpose of conducting principal component analysis is similar to the purpose of feature selection, i.e., to find a smaller subset of features to improve prediction performance.Unlike the FS method, dimensionality reduction does not preserve the original features but instead transform features from high-dimensional space to new, lower-dimensional space.We implemented dimensionality reduction using the principal component analysis (PCA) method.PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.The new feature set is a linear combination of these principal components and is used to build a new feature matrix.

Classification Models
Currently, there are many classification algorithms, and there are none that would outperform the others in every scenario [22].To improve the robustness of our results, we selected three state-of-the-art classifiers: support vector machine (SVM) with nonlinear kernel, K-nearest neighbors (KNN), and random forests classifier (RFC).All three classifiers are based on different underlying principles and have shown very promising results in many areas.

Support Vector Machine
SVM uses a hyperplane to classify data samples into two categories.The hyperplane is built in such a way that it allocates the majority of points of the same category on the same side of the hyperplane while trying to maximize the distance of data samples from both categories to this hyperplane.The subset of data samples closest to the separating hyperplane is denoted as the support vectors [23].
Assume that data samples in the training set x i , where x i ∈ R n , with class labels y i , where y ∈ {1, −1} for i = 1, . . ., N. The optimal hyperplane is defined as [24] wx where w represents a weight vector and b represents bias.
The goal is to maximize the margin, which can be achieved through a constrained optimization problem By introducing slack parameters, the objective function is updated to: The introduction of kernelization into the SVM allows it to solve a difficult nonlinear task in data mining.For the SVM classifier, we searched through the following parameters:

K-Nearest Neighbors
Akbulut et al. [25] stated that the KNN method is considered to be one of the oldest and the simplest types of nonparametric classifier.A nonparametric model for KNN means that the classification of the test data does not use a function that has been set up in advance based on the learning process on the training dataset.Instead, it uses the memory of this training set and measures the similarity of the new test sample to the original sample based on the distance.The goal is to determine the true class of an undefined test pattern by finding the nearest neighbors within a hypersphere of a predefined radius.The disadvantage is that when choosing a low k value, the separating boundary is highly adapted to the training data, and over-training occurs.At larger values of k, the boundary tends to be smoother and achieve better prediction results for new samples.The optimal value of k needs to be determined experimentally.For the KNN method, we grid-searched through the following parameters: RFC is considered a complex classifier, as it comprises a set of decision trees.Each tree is an independent classifier that consists of decision nodes.Each node evaluates a certain condition using the test data from the set {x 1 , x 2 , . . ., x n }.Based on the result, the branch goes onto the next node, up to the tree sheet that holds the classification information [26,27].
As stated in [28], the decision tree process is appropriate for selecting attributes from the next test set that has the greatest information gain to divide the data into two of the most diverse strings.This process is repeated at each node until the tree sheet is obtained.However, this can lead to a large depth of the tree and increase the risk of over-training.For this reason, the maximum depth of the tree is limited in practice.In order to increase classification accuracy, these trees, with weaker prediction capabilities, are grouped to create a more robust and accurate model-RFC.Another drawback associated with decision tree classifiers is their high variance.The random forests should achieve better success with a higher number of decision trees-k [26].The value of k needs to be found where accuracy is stabilized and will not increase further.The set of parameters for RFC in our experiments was as follows: The number of features to consider when looking for a split: sqrt(number of features), log2(number of features)

Pathological Speech Detection
The main aim of this work is to design a machine learning model that is able to discriminate pathological speech.As was indicated above, from the machine learning point of view, this is a binary classification task.The following list describes the sequence of steps for creating the system for pathological speech detection: 1. Export of data: Recordings of the vowels /a/, /i/, and /u/ were obtained from the freely available Saarbrucken Voice Database.2. Feature extraction: From the exported samples, it was necessary to extract speech features that express voice quality and potential pathological disorders present in the voice.Whole feature extraction was performed in Python.The types of features are described in Section 3. 3. Dimensionality reduction: In order to improve the accuracy of the classification, the selection of the features and principal component analysis were performed.Both methods are described in more detail in Section 3. The most relevant features are depicted in Table 1. 4. Visualization of features: In order to get a better view of the data structure, we visualized features using the PCA method.These visualizations are shown in Figures 1-3.For this visualization, we used the first three principal components from the PCA output.5. Model training: Each machine learning model used in this work (SVM, KNN, RFC) which was designed for the classification of samples must be trained prior to classification.A combination of different features and genders was tested.After this step, we should have classifiers with their optimal parameters, because tuning the hyperparameters of the model is also included in this step.The detailed description of the training and cross-validation procedure is provided in the Experimental Results section.6. Model evaluation: The created model was tested on new test data, which means that these data were not part of the training process and the classifier did not come into contact with them.The graphical design of the sequence of steps is shown in Figure 4.

Experimental Results
Two types of result comparisons were made for all three classifiers: SVM, KNN, and RFC.In the first case, we compared the influence of dimensionality reduction on classification performance.In the second case, we compared the results by considering the gender and the results yielded by processing different vowels.
Prediction performance is measured by accuracy, defined as where tp represents true positive and tn is true negative.Then, fp denotes a false positive sample and fn a false negative sample.

Influence of Feature Selection on Prediction Performance
First, we aimed to analyze how the feature selection affects the prediction performance of classifiers.We evaluated the classification accuracy using only selected features; a new subset of k features was obtained in a loop, where k = {50, 1560} and, in each iteration, k was incremented by 50.Then, parameter tuning for the classifier model (SVM, KNN, RFC) was done.The classification performance of the classifiers for each iteration is shown in Figure 5.
For SVM and RFC, the accuracy increases steadily up to 400 features.After this point, the accuracy decreases and rises repeatedly, and the values usually do not exceed the limit of accuracy that was reached with 400 features.On the other hand, the accuracy of the KNN classifier shows a decreasing tendency with an increasing number of features.This is probably due to the higher dimensionality of the data space since the KNN classifier is known to suffer from higher dimensions.Even though, in this case, the accuracy of KNN dropped in higher dimensional cases, this is not always the case, as can be seen, for example, in [29].

Influence of PCA on Prediction Performance
The goal of this section is to compare the accuracy of the classifiers with data of reduced dimensions.The first step was to set the number of principal components that make up a new set of training and test data.As with the previous selection method, the hyperparameters of the classification model were tuned with these new data in the training process.An overview of the accuracy of the classification results using the PCA method for different numbers of principal components is shown in Figure 6.
Using PCA, the classification results did not improve as we expected, so we did not use this method anymore.After this finding, we used only filter feature selection.

Results by Gender and Classifiers
The previous experiments indicated that the feature selection positively influenced the prediction performance, so we utilized it in further experiments.The number of all features for one subject was 1560, and only 300 features were selected in feature selection.
For each classifier, we applied two types of cross-validation.In the first case, 75% of the data were used for training the model.Out of these training data, 25% were utilized as validation data, employed for hyperparameter tuning.The classifier model with the best parameters found in the previous step was applied to the test data that were not part of the training dataset.The whole process

Figure 1 .
Figure 1.Visualization of mixed gender samples, with 1560 features for each sample.

Figure 2 .
Figure 2. Visualization of female samples, with 1560 features for each sample.

Figure 3 .
Figure 3. Visualization of male samples, with 1560 features for each sample.

Figure 4 .
Figure 4. System design for pathological speech detection.

Figure 5 .
Figure 5. Prediction accuracy as a function of the number of features.

Figure 6 .
Figure 6.Prediction accuracy as a function of the number of principal component analysis (PCA) components.

Table 1 .
/hlThe most important features, as selected by feature selection, including the vowel being pronounced and intonation (L-low, N-neutral, H-high, LHL-changing low-high-low).