Experimental Evaluation of Deep Learning Methods for an Intelligent Pathological Voice Detection System Using the Saarbruecken Voice Database

: This work is focused on deep learning methods, such as feedforward neural network (FNN) and convolutional neural network (CNN), for pathological voice detection using mel-frequency cepstral coefﬁcients (MFCCs), linear prediction cepstrum coefﬁcients (LPCCs), and higher-order statistics (HOSs) parameters. In total, 518 voice data samples were obtained from the publicly available Saarbruecken voice database (SVD), comprising recordings of 259 healthy and 259 pathological women and men, respectively, and using /a/, /i/, and /u/ vowels at normal pitch. Signiﬁcant differences were observed between the normal and the pathological voice signals for normalized skewness ( p = 0.000) and kurtosis ( p = 0.000), except for normalized kurtosis ( p = 0.051) that was estimated in the /u/ samples in women. These parameters are useful and meaningful for classifying pathological voice signals. The highest accuracy, 82.69%, was achieved by the CNN classiﬁer with the LPCCs parameter in the /u/ vowel in men. The second-best performance, 80.77%, was obtained with a combination of the FNN classiﬁer, MFCCs, and HOSs for the /i/ vowel samples in women. There was merit in combining the acoustic measures with HOS parameters for better characterization in terms of accuracy. The combination of various parameters and deep learning methods was also useful for distinguishing normal from pathological voices.


Introduction
The automatic detection of speech disabilities has attracted significant clinical and academic attention, with the hope of accurately diagnosing speech impairments before they are identified by well-trained experts and expensive equipment. Although many researchers focus on acoustic analysis, parametric and nonparametric feature extraction, and the automatic detection of speech pathology using pattern recognition algorithms and statistical methods [1][2][3][4], pathological voice detection studies using deep learning techniques have been actively published recently. Especially, artificial machine learning techniques are used to do an excellent job of classification in various areas. Recently, many machine learning algorithms, such as random forest (RF), gradient boosting, support vector machines (SVM), k-nearest neighbors (kNN), and artificial neural networks (ANNs), have been adopted to identify various signals [5][6][7].
Pathological voices represent health-related problems. Some diseases include speech impairment as an early symptom. This is usually caused by damage to the nervous system or damage to parts of the vocal tract, such as the vocal cords [8]. Speech disorders often lead to secondary symptoms, and many high-risk diseases can be found in their early stages through pathological voice analysis [9,10]. In particular, speech disorders are sometimes indicative of early-stage Parkinson's disease; thus, early detection through screening can lead to early treatment and can improve treatment results. However, it is not easy for speech experts to analyze and evaluate speech, even at the early stage of a voice disorder.
Usually, a trained professional is needed, and this individual must undergo training in voice evaluation to accurately evaluate speech. Therefore, automatic pathological voice detection enables efficient speech evaluation in terms of time and cost, resulting in more speech impairment screening. The main motivation for realizing this work is the use of artificial intelligence to diagnose various diseases. This can lead to significant improvements in diagnosis and healthcare, as well as further improvements in human life [11,12].
The ultimate goal of this research is to develop an intelligent pathological voice detection system that supports an accurate and objective diagnosis. This work is focused on deep learning methods, such as feedforward neural network (FNN) and convolutional neural network (CNN), for the detection of pathological speech using mel-frequency cepstral coefficients (MFCCs) and linear prediction cepstrum coefficients (LPCCs), as well as higher-order statistics (HOSs) parameters. Voice data were obtained from the publicly available Saarbrucken voice database (SVD). The author exported 518 samples from the SVD, comprising 259 healthy and 259 pathological female recordings of the /a/, /i/, and /u/ vowels at normal pitch. Additionally, 259 healthy and 259 pathological male recordings of the same vowels at normal pitch are used in this work. Normalized kurtosis and skewness are shown in the form of box plots to provide better visualizations of the normal and pathological voice signals for men and women in each /a/, /i/, and /u/ vowel. Finally, this work investigates voice performance using deep learning methods with various combinations of parameters. Therefore, the originality of this work can be found in its proposal of a new parameter and a novel deep learning method that combines HOSs, MFCCs, and LPCCs in the /a/, /i/, and /u/ voice signals of healthy and pathological individuals. The contribution of this paper can be summarized in the following points: • This paper intruduces an intelligent pathological voice detection system that supports an accurate and objective diagnosis based on deep learning and the parameters introduced.

•
The suggested combinations of various parameters and deep learning methods can effectively distinguish normal from pathological voices. • A lot of experimental tests are performed to confirm the effectiveness of the pathological voice detection system using the Saarbruecken voice database.

•
The experimental results emphasize the superiority of the proposed pathological voice detection system integrating machine learning methods and various parameters to monitor and diagnose a pathological voice for an effective and reliable system.

Related Work
As a field of study, pathological voice signal processing has always aimed to create objective and accurate classifications of voice disorders. Additionally, there have been many contributions that focus on various aspects of speech processing from feature extractions to decision support systems based on deep learning methods [13][14][15][16][17][18][19]. This section provides a brief overview of several recent findings related to the research topic of this paper.
Dankovicová et al. focused on feature selection (FS) and machine learning methods, such as K-nearest neighbors (KNN), random forests (RF), and support vector machines (SVM). The sustained vowels /a/, /i/, and /u/ generated by normal, high, low, and lowhigh-low were used. These vowels were selected from the Saarbrucken voice database, and 94 pathological subjects and 100 healthy subjects were chosen. The SVM classifier achieved the highest accuracy by reducing the feature set to 300 using the filter FS method in the original 1560 feature. The overall classification performance based on feature selection was the highest, with 80.3% for mixed samples, 80.6% for female samples, and 86.2% for male samples [13].
A study by Mohammed et al. focused on transfer learning strategies, such as an effective pre-trained ResNet34 model for CNN training. Due to the unequal distribution of samples, this model adjusted the weights of the samples used for the minority groups during training as a means of compensation. A three-part weight product is the weight of the final sample. A class weight (α), a gender weight (β), as well as a gender-age weight (γ), each led to a final sample weight (ω) that was calculated as ω = α·β·γ. The 300 training samples extracted in the SVD were divided equally into 150 healthy and 150 pathological classes to ensure a balanced training process. An additional 1074 tested samples, divided into 200 healthy and 874 pathological classes, were included in the study. The system achieved a high prediction accuracy result of up to 94.54% accuracy on the training data and 95.41% accuracy on the testing data [17].
Hedge et al. presented surveys of research works conducted on the automatic detection of voice disorders and explored ways to identify different types of voice disorders. They also analyzed different databases, feature extraction techniques, and machine learning approaches used in various studies. The voices were generally categorized as normal and pathological in most of the papers; however, some studies included Alzheimer's disease and Parkinson's disease (PD). Finally, this paper reviewed the performance of some of the significant research work conducted in this area [18].
A study by Hemmerling et al. sought to evaluate the usefulness of various speech signal analysis methods in the detection of voice pathologies. First, the initial vector consisted of 28 parameters extracted from the sustained vowels /a/, /i/, and /u/ at high, low, and normal pitch in time, frequency, and cepstral domains. Subsequently, linear feature extraction techniques (principal component analysis) were used to reduce the number of parameters and select the most effective acoustic features describing speech signals. They also performed nonlinear data transformations that were calculated using kernel principal components. The initial and extracted feature vectors were classified using k-means clustering and random forest classifiers. Using random forest classification for female and male recordings, they obtained accuracies of up to 100% for the classification of healthy versus pathological voices [14].
Other authors have also investigated the theoretical aspects of voice disorders, feature extraction techniques, and machine learning (ML) techniques, and they have also reviewed the performance of some of the significant research performed in the field of pathological voice signal processing [20][21][22][23].

Database
Voice samples of sustained vowels /a/, /i/, and /u/ were digitally recorded and published online in the SVD, created by the Institute of Phonetics of the University of Saarland [24]. The SVD consists of voices recorded by more than 2000 people. The patient's voices were recorded for vowels /a/, /i/, and /u/ at high, low, low-high, and normal pitches, respectively. The pitches are described and can be allocated into four categories. The length of the recordings ranges from 1 to 4 s. The audio format is a waveform (16bit sample), sampled at 50 kHz. The entire database contains recordings of 71 different and well-defined voice pathologies and healthy patients. This paper used 259 healthy and 259 pathological female recordings of the /a/, /i/, and /u/vowels at normal pitch (97 suffered from hyperfunctional dysphonia, 51 had functional dysphonia, 30 suffered from laryngitis, and 81 suffered from other pathologies listed in the database [24]) and 259 healthy and 259 pathological male recordings of the same vowels at normal pitch (62 men suffered from laryngitis, 44 had hyperfunctional dysphonia, 25 suffered from a vocal fold polyp, and 128 suffered from other pathologies listed in the database [24]). The details are described in Table 1. Due to the essential differences in voice behavior between men and women, the parameters were statistically analyzed for men and women separately.

Feature Extraction
Classical parameters, such as MFCCs and LPCCs, are used in this study. MFCCs are standard methods for feature extraction that exploit the knowledge of human hearing systems [1]. The first step of linear predictive (LP) analysis is to estimate a source signal using inverse filtering. After that, the spectrum is computed using the source signal. The computed spectrum is used to study the energy distribution in both normal and pathological voices. The number of LP coefficients is one of the key elements of LP analysis to determine the formant peaks. This is because removing the effect of the formants from the speech signal can provide an accurate estimation of the source signal [25]. In this work, 20-dimensional MFCCs and LPCCs were extracted from a 40-ms window signal using a 20-ms frameshift.
To identify speech impairments, this work used a novel set of HOS parameters extracted from the time domain. A primary advantage of this method is that periodic or quasi-periodic signals are not required for reliable analysis [1- 3,26]. HOS parameters obtained from time domains provide promising results in this field [1][2][3]. Inspired by the above three studies, this work aims to extract HOS parameters from the time domain to detect and classify speech impairments. Among the various HOSs, the 3rd-and 4th-order cumulants are used as characteristic parameters in this study. These parameters are called normalized skewness, γ 3 , and normalized kurtosis, γ 4 , and are defined as shown in (1).
where x n is the n-th sample value, N is the number of samples, and µ and σ represents the mean and the standard deviations, respectively.

Deep Learning Methods
As shown in Figure 1a,b, information only moves in one direction in FNN. This requires moving forward from the input node through the hidden node to the output node. There are no cycles or loops in the network. Figure 1a shows an example of a fully connected feedforward neural network with two hidden layers. "Fully connected" means that each node is connected to all nodes in the next hierarchy. This work addresses the binary classification problem of normal and pathological speech classification using FNN. In machine learning, classification is a supervised learning method that divides data samples into predefined groups using decision-making features [27]. This study uses two feed-forward layers. After the first layer, there is rectified linear unit (ReLU) activation and, after the last layer, softmax activation occurs, as shown in Figure 1b [28]. The parameter values used are shown in Table 2.  CNN is similar to a typical neural network, as shown in Figure 2. It consists of neurons with learnable weights and biases. Each neuron receives some inputs, performs a dot product, and optionally follows it with a non-linearity. The entire network represents one differentiable score function, i.e., from the raw image pixels at one end to the class scores at the other end. Finally, there are activation functions, such as softmax, in the fully connected layer [28,29]. This work used two convolutional layers and two feed-forward layers. Dropout with a probability of 0.05 and batch normalization was applied to all layers except the feed-forward layers. Max-pooling and average-pooling operations were performed between the convolutional layers to downsample the intermediate representations over time, and to add some time invariance into the process. The details are provided in Table 3.   Figure 3 shows normalized kurtosis in the form of box plots to better visualize the normal and pathological voice signals for men and women at each /a/, /i/, and /u/ vowel. Overall, normalized kurtosis estimated in pathological signals tends to be larger and more widely distributed than that estimated from normal speech signals. As shown in Figure 3a,b, the normalized kurtosis extracted from the women's /a/ vowel tended to be distributed below zero, while the one from the men's /a/ vowel tended to be close to zero. In the /i/ samples shown in Figure 3c,d, both the women and men's /i/ samples tended to have values less than zero, but the men's /i/ samples tended to have slightly smaller negative values. For both women and men's /u/ samples, shown in Figure 3e,f, the distributions of the four plots are less than zero, and they are almost identical in shape. Figure 4 shows the distributions of normalized skewness extracted from normal and pathological voice signals for men and women in each /a/, /i/, and /u/ vowel. The normalized skewness extracted from a pathological voice has a smaller average value than that extracted from a normal voice, as well as a wider range. The normalized skewness estimated from the women's /a/ samples in Figure 4a,b has a positive mean for both pathological and normal voices, while that estimated from the men's /a/ samples has a negative mean value distribution for both. For Figure 4c, the normalized skewness analyzed in both normal and pathological /i/ samples tended to have a positive mean on both sides. However, it can be seen that the average of the normalized skewness estimated in the normal voice has a positive value, while that of the pathological voice has a negative average value, as shown in Figure 4d. In both Figure 4e,f, the normalized skewness of the normal /u/ voices tends to have a positive mean, while that of pathological /u/ voices tends to have a negative mean.   Table 4 shows that the statistical analyses between normal and pathological voice signals for women and men were performed using a Mann-Whitney's U test for independent samples. Information on the means, minimums, maximums, percentiles, and p values are presented. The significance level was set at a priority of p < 0.05. In Table 4, an asterisk (*) indicates that the p value is less than 0.05. This means that some parameters are statistically different between normal and pathological voice signals. The Mann-Whitney U test showed a significant difference between normal and pathological voice signals for normalized skewness (p = 0.000) and kurtosis (p = 0.000) for both women and men, except for women's /u/ samples in the case of the normalized kurtosis (p = 0.051). It is clear that these methods are useful and meaningful for classifying pathological voice signals. The input of the FNN and CNN consists of various parameters, such as MFCCs, MFCCs + HOSs, LPCCs, LPCCs + HOSs, and the output of the FNN and CNN is coded to be zero if it is found to be a pathological voice, and if it is found to be a normal voice, it should be marked as 1. In Table 5, the first and second parameters consisting of 20-dimensional MFCCs and LPCCs were extracted from the window signals of 40 milliseconds using a frameshift of 20 milliseconds. The same settings have been applied in many previous studies [30]. The following parameter, MFCCs + HOSs, was created by adding two parameters to the 20-dimensional MFCCs; thus, it had 22 dimensions. The LPCCs + HOSs parameter also has the same dimensions as MFCCs + HOSs. Finally, the MFCCs + MFCC deltas + HOSs parameters consist of 20-dimensional MFCCs, 20-dimensional MFCC deltas, and skewness and kurtosis with 42 dimensions. Therefore, the total dimensions range from 20 to 42. All voice data were grouped into sets of training (70% of the data) and testing (30%) to implement all methods. As shown in Table 1, in the mixed data of men and women, 363 were used for the training datasets and 155 were used for the testing datasets. In addition, in each data sample of the men or women, 181 were used for the training dataset and 78 were used for the test dataset. Each set was randomly selected from the subset for a fivefold cross-validation scheme [1-4]. The whole process was repeated 10 times, and the results were averaged. The results are shown as the means and standard deviations. In Table 5, each column shows the highest values (thick and blue font) obtained within each type of testing parameter, classifier, and sex. Many thick and red values in the rows show the highest results according to various parameters among all women and men. The results of the individual classifiers were also compared by evaluating those of each vowel separately. The best performance results among the vowels are highlighted in thickness in each column and row in Table 5. The accuracy of a model is usually determined after the model parameters are learned and fixed and the learning is not performed. In this paper, the test samples were fed to the model and the number of mistakes that the model made were recorded, after comparison to the actual targets. Then, the percentage of misclassification was calculated. For example, if the number of test samples was 155 and the model classified 130 of those correctly, then the model's accuracy was 83.9%. In this paper, the accuracy will be shown through a confusion matrix.

Experimental Results and Discussion
For the FNN classifier in Table 5, a combination of MFCCs and MFCC deltas showed the best performance, 80.13%, for classifying pathological and normal voices in the men's vowel /a/. The average accuracy, 76.92%, was also used with the MFCCs + HOSs parameter in mixed data samples of women and men. In addition, the average performance was 76.28% for the women's data samples regarding the MFCCs + MFCC deltas parameter. The vowel /i/ achieved the greatest accuracy, 80.77%, in the classification between pathological and normal voices using a combination of MFCCs and HOSs for women's data samples. Additionally, in mixed and men data samples, accuracies of 75.64% and 75.00% were obtained from the MFCCs + MFCC deltas + HOSs and MFCCs + HOSs parameters, respectively. In the vowel /u/, the highest accuracy, 80.77%, was achieved by utilizing the LPCCs + HOSs parameter in the case of men's data samples. For the CNN classifiers, utilization of the LPCCs parameter was 82.69%, showing the best performance in classifying pathological and normal voices when using men's /u/ vowel. The vowel /a/ achieved good accuracy, 76.60%, regarding the differentiation between pathological and normal voices, with a combination of MFCCs and HOSs for mixed data samples. Moreover, 76.28% and 75.64% were also obtained from the MFCCs + MFCC deltas + HOSs and MFCCs parameters in both women's and men's data samples, respectively. When using the vowel /i/, similar results, 76.92%, were found for women and men with MFCCs and LPCCs + HOSs, respectively. In mixed data samples of women and men, an average accuracy, 75.00%, was also obtained from the MFCCs + MFCC deltas + HOSs parameters. In the vowel /u/, the highest accuracy, 82.69%, was achieved by the utilization of the LPCCs parameter for data samples from men. In addition, in mixed and women data samples, accuracies of 70.83% and 76.92% were achieved by the LPCCs + HOSs and MFCCs + MFCC deltas + HOSs parameters, respectively.
For voiced /u/ vowels in men, the utilization of the CNN classifiers and LPCC parameters obtained the highest accuracy, 82.69%, compared to the other vowels, classifiers, and parameters. For women, the accuracy of each classifier is higher in vowel /i/ than in other vowels. Another finding is that the fusion of the HOS parameters in a particular vowel has a positive effect on the overall accuracy of the classification. The best results for mixed samples were achieved by the CNN classifier, although most of the results were very similar to those of the FNN classifier.
To further analyze the behavior of the model, the author investigated the confusion matrix and relationship between the loss and learning rates. Figure 5 shows the relationship between the loss and learning rates when the highest performance was 82.69%. In this experiment, the epoch was 100 and the learning rate was 0.001. As the epoch runs from 1 to 100, the learning rate tends to decrease from 0.001 to 0.0081, and the loss value tends to decrease from 0.66 to 0.58. The classification results represented by the confusion matrix are described in Figure 6. The confusion matrix of the testing set shows that an excellent classification accuracy, 82.69%, can be achieved using the proposed CNN and LPCC combinations.

Conclusions
This work proposed and implemented a system for pathological voice detection using deep learning methods. The training and testing data used were recordings of the sustained vowels /a/, /i/, and /u/. Pathological records were from 259 female and 259 male subjects, and the control group was formed by samples from 259 female and 259 male subjects who were healthy and recorded at normal pitch. In order to obtain voice information from these recordings, various methods for extracting speech features were implemented, including MFCCs, LPCCs, and HOSs. In order to design the most optimal classification model, two types of classifiers based on the following deep learning methods were studied: FNN and CNN.
The distributions of normalized skewness and kurtosis extracted from normal and pathological voice signals in men and women are described in each /a/, /i/, and /u/ vowel. The Mann-Whitney U test showed a significant difference between the normal and pathological voice signals for normalized skewness (p = 0.000) and kurtosis (p = 0.000), excluding the normalized kurtosis (p = 0.051) extracted from the female /u/ sample.
The highest accuracy, 82.69%, was achieved by the CNN classifier with the LPCCs parameter for the /u/ vowel in men. Several other combinations can be used for classification. The second-best performance, 80.77%, was obtained in the differentiation between pathological and normal voices with a combination of the FNN classifier, MFCCs, and HOSs; this was attained for the /i/ vowel samples in women. The combination of the FNN classifier, MFCCs, and MFCC deltas showed the best performance by a third candidate, 80.13%, in the /a/ vowel in men. The fourth best accuracy, 79.42%, also used the FNN and MFCC parameters in the /i/ vowel samples in women. In addition, the fifth-best performance was 77.56% for the men's /u/ vowel samples using the CNN classifier and the MFCCs + MFCC deltas parameter. Experimenting with female or male samples with single data was more effective than experimenting with a mixture of these samples. There was also merit in combining the acoustic measures with the HOS parameters for better characterization, as both are useful when it comes to imparting important voice information. As the most important discovery of this study, the combination of various parameters and deep learning methods was useful for distinguishing normal from pathological voices.
In future work, pathological voice detection systems could be developed to classify the stage of a specific disease and voice quality, and a monitoring function for voice disorder could also be added. Future research will need to recognize the differences between two or more diseases and voice quality. Additionally, the author will constantly study the parameters that reflect important information about pathological voice signals to realize high classification performance using various deep learning methods and artificial intelligence (AI) techniques developed in various areas [31][32][33][34][35][36][37]. Finally, gender analysis in the field of pathological voice signal processing needs to become more widespread. The sponsor had no involvement in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the article for publication.

Conflicts of Interest:
The author declares no conflict of interest.