Employing Energy and Statistical Features for Automatic Diagnosis of Voice Disorders

The presence of laryngeal disease affects vocal fold(s) dynamics and thus causes changes in pitch, loudness, and other characteristics of the human voice. Many frameworks based on the acoustic analysis of speech signals have been created in recent years; however, they are evaluated on just one or two corpora and are not independent to voice illnesses and human bias. In this article, a unified wavelet-based paradigm for evaluating voice diseases is presented. This approach is independent of voice diseases, human bias, or dialect. The vocal folds’ dynamics are impacted by the voice disorder, and this further modifies the sound source. Therefore, inverse filtering is used to capture the modified voice source. Furthermore, the fundamental frequency independent statistical and energy metrics are derived from each spectral sub-band to characterize the retrieved voice source. Speech recordings of the sustained vowel /a/ were collected from four different datasets in German, Spanish, English, and Arabic to run the several intra and inter-dataset experiments. The classifiers’ achieved performance indicators show that energy and statistical features uncover vital information on a variety of clinical voices, and therefore the suggested approach can be used as a complementary means for the automatic medical assessment of voice diseases.


Introduction
A key component of human conversation is speech signals. However, the multiple laryngeal illnesses harm the parts of the voice box that are responsible for speech, creating an abnormal voice that is unfit for everyday uses [1]. As a result, it is essential to identify the source of a voice anomaly as soon as possible. Traditional methods for examining voice dysfunction in medical practice are subjective and invasive. The grade, roughness, breathiness, asthenia, and strain (GRBAS) and the Consensus Auditory Perceptual Evaluation-Voice (CAPE-V) scales are used in the traditional perceptual analysis [2]. Such perceptual examination is subjective, imprecise, and necessitates expert professionals.
(AUC = 0.74). Studies suggest that perturbation and complexity metrics may categorize speech signals owing to their complementarity. A further investigation, including D2, jitter, shimmer, and second order entropies, is reported by Zhang and Jiang [25]. For the study of continuous speech production, the vowel /a/ and the vowels derived from the reading passage of the MEEI corpus are used. The findings revealed the ability of complexity attributes to distinguish among disordered and normal speech cases. On the other hand, perturbation metrics worked well with sustained vowels, but should be applied with precaution in vowels derived from continuous speech. Little et al. [26] presented a couple of new complexity attributes for the diagnosis of voice impairments: recurrence period density entropy (RPDE) and detrended fluctuations analysis (DFA). The recognition rate of 92% was attained with the proposed new attributes by employing a quadratic discriminant analysis for classification and bootstrap resampling for validation in the MEEI dataset. This recognition rate is greater than that attained by integrating other conventional parameters that do not surpass 81%. Two new entropy attributes based on the Markov models introduced in [27] are: Renyi HMM entropy and Shannon HMM entropy. These new attributes have been considered to characterize the MEEI dataset, together with nine other complexity attributes, three perturbation parameters, and MFCC. With all combined parameters, an accuracy of 98% is attained using a GMM.
Despite the fact that spectral and cepstral attributes have really been extensively employed in the assessment of voice impairment, it is important to note that they are quite likely to portray the vocal tract system than variations in the vibratory pattern of the vocal folds directly caused by voice illnesses. As a result, glottal source attributes derived from the glottal wave have a high research value. Forero et al. [28] captured the vocal tract excitation by glottal inverse filtering (GIF). Sixteen time instant-based, quotient-based, and frequency domain glottal metrics are extracted from the glottal wave and 12 MFCC extracted from speech signal to carry out the voice assessment. Three classes of subjects, namely, vocal fold paralysis, nodule, and healthy, are considered for the experiments. The classification accuracy attained with glottal parameters is 95.80%, 82.00%, and 96.20%; however, with the MFCC parameters, these values are 75.20%, 87.00%, and 80.00% using ANN, HMM, and SVM respectively. With the fusion of glottal and MFCC parameters, the achieved classification rate is 96.60%, 92.00%, and 97.20% using ANN, HMM, and SVM, respectively. Similarly, in [29], the authors proposed a voice disorder identification and classification system using an interlaced derivative pattern (IDP). The authors compared the IDP based system with MDVP and MFCC parameters. The suggested scheme attains average accuracies of 99.20%, 93.20%, and 91.50% in disorder identification tests using the SVM classifier with recordings from MEEI, SVD, and AVPD, respectively. In [30], different time-based, frequency-based, and Liljencrants-Fant (LF) model-based attributes were mined from the glottal pulse-form applying the well-known Aalto Aparat voice inverse filtering and parameterization tool. The normal pitch utterance of the sustained vowel /a/ mined from German, English, Arabic, and Spanish voice records is used. The highest rates of voice disorder identification achieved are 99.80%, 99.70%, 99.80%, and 99.80% respectively, over SVD, PdA, AVPD, and MEEI. The best voice illness categorization rates attained are 90.80%, 99.30%, 99.80%, and 90.10% for SVD, PdA, AVPD, and MEEI, respectively. The LF model-based attributes have demonstrated good discrimination power compared to other attributes for the assessment of voice disfunction. This paper proposes an acoustical assessment-based technique for assessing voice illnesses induced by a malfunctioning vocal fold(s). A multi-resolution transform, such as a stationary wavelet transform (SWT), can detect the voice produced due to disorder since it has a significant amplitude fluctuation of extremely low scale. Sub-band distribution can identify unusual spectrum distribution, speech non-periodicity, and unexpected energy variations in the speech signal [31]. This is how we came up with the idea for using SWT in our research.

Voice Disorder Databases
This article employs four distinct language voice disease records. Table 1 shows the number of voice signals in each database, and Table 2 shows the database characteristics. The speech records with sustained vowel/a/are mined from these databases. Since the sampling frequency of each database varies, all speech records were resampled to 22.05 KHz to create a consistent (unique) sampling frequency across all datasets. In addition to the speech recording indicated in rows 1 to 4, the following pathological PdA [27] database speech samples were used for experiments in Section 3.5: Acquired iatrogenic trauma on the vocal cords = 2, Upper motor neuron injury = 14, Extrapyramidal alterations = 1, and Lack of closure = 6.

Proposed Speech Pathology Assessment Technique
This paper details a SWT-based framework for assessing voice illnesses. The block diagram in Figure 1 depicts the distinct steps employed in the voice assessment system. The methodology comprises two main phases: (a) learning and (b) evaluation. During the learning stage, a 10-fold cross-validated classification model has been developed using healthful and pathological speech training recordings. In the evaluation phase, unseen speech records are tested on a cross-validated trained model. The normalization of the speech signal amplitude, framing, and windowing are the three main steps in pre-processing. The true voice source is then acquired from the pre-processed voice signal using the GIF method, which isolates the glottal pulse from the spoken signal [35]. In this study, an iterative adaptive inverse filtering (IAIF) technique is applied to obtain the glottal flow and is endorsed due to its robustness in noisy surroundings [36]. In IAIF, the vocal tract response is characterized by the discrete all-pole (DAP) model as it performs better compared to linear predictive coding (LPC) [37]. In an IAIF algorithm, the model of the acoustic tube is estimated in two phases and, hence, is named iterative inverse filtering. In the first step of IAIF, the input speech signal is pre-emphasized. The final refined glottal source is obtained by canceling the influence of lip radiation from the previous stage [38]. Figure 2 gives the illustration of the speech signal and voice source signal acquired using the IAIF. SWT is used to decompose the extracted glottal flow. To represent the voice source in a numerical form, energy and statistical metrics are obtained from each sub-band's coefficients.

9
Voice illness types Functional and organic Organic and traumatic etiologies

Proposed Speech Pathology Assessment Technique
This paper details a SWT-based framework for assessing voice illnesses. The block diagram in Figure 1 depicts the distinct steps employed in the voice assessment system. The methodology comprises two main phases: (a) learning and (b) evaluation. During the learning stage, a 10-fold cross-validated classification model has been developed using healthful and pathological speech training recordings. In the evaluation phase, unseen speech records are tested on a cross-validated trained model. The normalization of the speech signal amplitude, framing, and windowing are the three main steps in pre-processing. The true voice source is then acquired from the pre-processed voice signal using the GIF method, which isolates the glottal pulse from the spoken signal [35]. In this study, an iterative adaptive inverse filtering (IAIF) technique is applied to obtain the glottal flow and is endorsed due to its robustness in noisy surroundings [36]. In IAIF, the vocal tract response is characterized by the discrete all-pole (DAP) model as it performs better compared to linear predictive coding (LPC) [37]. In an IAIF algorithm, the model of the acoustic tube is estimated in two phases and, hence, is named iterative inverse filtering. In the first step of IAIF, the input speech signal is pre-emphasized. The final refined glottal source is obtained by canceling the influence of lip radiation from the previous stage [38]. Figure 2 gives the illustration of the speech signal and voice source signal acquired using the IAIF. SWT is used to decompose the extracted glottal flow. To represent the voice source in a numerical form, energy and statistical metrics are obtained from each subband's coefficients.  The information gain-centered feature selection method is used to pick the op subset of features and omit redundant and irrelevant features. The following se briefly describes the various statistical attributes and SWT. Before feeding the metric the classifier, min/max normalization is used to scale all descriptors. This procedure l  The information gain-centered feature selection method is used to pick the optimal subset of features and omit redundant and irrelevant features. The following section briefly describes the various statistical attributes and SWT. Before feeding the metrics into the classifier, min/max normalization is used to scale all descriptors. This procedure limits the dataset to the range [0, 1]. After normalizing the training data, the min and max values for each attribute of the training data are applied to standardize the test data.
For voice evaluation, two supervised classification algorithms are used: support vector machine (SVM) and stochastic gradient descent (SGD). The typical classifier quality measurement metrics, such as the area under the curve (AUC), classification accuracy (CA), precision (PPV), recall, and F1 score, are calculated. The suggested approach has been validated against four different databases. Support vector machine (SVM) An SVM is a linear two-class classifier, and its non-probabilistic facet is a key feature. An SVM partition sample along a decision boundary (plane) is obtained by just a small data subset. The subset of data that supports the decision boundary is adequately termed the support vector. The leftover training samples have no impact on the location of the plane in the feature space. However, the classification model is built in probabilistic classifiers by considering all of the data and thus requiring more computational resources [39].
Furthermore, the binary and linear facets are two constraints of the SVM. The latest developments using "Kernel Trick" handled the linearity constraint of the decision boundary. Additionally, the lack of ability to categorize the data in more than two categories seems to be a topic of current research. The techniques to date support the formation of various SVMs that compare input data to one another in a variety of means, such as one-to-all or all-to-all (the latter is also termed one-to-one) [39,40].
We are provided with training data of n samples in the form of: ( , where t i is the class label and is either 1 or −1. The label specifies the class of each sample → s i to which it belongs and every → s i is a real vector of the x-dimension. While learning, the SVM algorithm determines a decision border inside the input space that splits all data points in the two different classes. In this algorithm, the optimization task is the determination of the linear hyperplane that will have the maximal margin among the two categories. The SVM now employs this hyperplane to evaluate the category of new unseen test samples [40]. The hyperplanes are shown in Figure 3. Thus, the hyperplane splits the set of points → s i for which label t i = 1 and t i = −1. Any hyperplane may be given as a set of data points → s i , satisfying the equation: where → v is the vector perpendicular (normal vector) to the hyperplane and need not be a unit vector. The factor k → v is used to calculate the offset of a hyperplane from the origin along the normal vector → v . If the input data used to train the SVM are linearly distinguishable, we can select two parallel hyperplanes that split two data classes. Moreover, one has to ensure that the separation of them is as high as possible. The area constrained by two hyperplanes such as these is termed the "margin", and the max-margin hyperplane is the hyperplane that exists midway between them. These hyperplanes can be characterized by equations with standardized input data [39]. For  s − k = −1, any data points above this plane are of one kind of class with the label −1. The spacing between these two planes is 2 → v and to maximize the margin, we need to minimize the → v . We also need to avoid data samples' falling inside the margin, and this is achieved by adding a constraint: for every i, either unit vector. The factor ‖⃗⃗‖ is used to calculate the offset of a hyperplane from the origin along the normal vector ⃗. If the input data used to train the SVM are linearly distinguishable, we can select two parallel hyperplanes that split two data classes. Moreover, one has to ensure that the separation of them is as high as possible. The area constrained by two hyperplanes such as these is termed the "margin", and the max-margin hyperplane is the hyperplane that exists midway between them. These hyperplanes can be characterized by equations with standardized input data [39]. For ⃗. ⃗ − = 1, any data points above this plane are of one kind of class with the label 1, and for ⃗. ⃗ − = −1, any data points above this plane are of one kind of class with the label −1. The spacing between these two planes is 2 ‖⃗⃗‖ and to maximize the margin, we need to minimize the ‖ ⃗‖. We also need to avoid data samples' falling inside the margin, and this is achieved by adding a constraint: for every i, either These restrictions assert that every data value will be fall on the right side of the margin. One can rewrite this as: ( ⃗. ⃗ − ) ≥ 1, for all 1 ≤ ≤ .
To have the optimization problem, it can be put together as: "Minimize ‖ ⃗‖ subject to ( ⃗. ⃗ − ) ≥ 1, for = 1, … ". The ⃗ and k, which solve this optimization problem and will determine classifier, ⃗ ↦ ( ⃗. ⃗ − ). An important consequence of this geometric description is that the max-margin hyperplane is completely determined by those ⃗ that lie nearest to it. These ⃗ are called support vectors [40].

Stochastic gradient descent (SGD)
The SGD is simply an optimization method, but it does not necessarily apply to a particular family of ML models. It is just a means to train a classifier model and is employed in ML for training a wide range of models. However, in this work, it is employed to train the linear SVM, where it plays the role of optimizer [41].
The term "batch" refers to the number of samples employed to compute the gradient in each iteration in gradient descents (GD) from a set of data. In classic GD optimization, These restrictions assert that every data value will be fall on the right side of the margin. One can rewrite this as: To have the optimization problem, it can be put together as: "Minimize v and k, which solve this optimization problem and will determine classifier, . An important consequence of this geometric description is that the max-margin hyperplane is completely determined by those → s that lie nearest to it. These → s i are called support vectors [40]. Stochastic gradient descent (SGD) The SGD is simply an optimization method, but it does not necessarily apply to a particular family of ML models. It is just a means to train a classifier model and is employed in ML for training a wide range of models. However, in this work, it is employed to train the linear SVM, where it plays the role of optimizer [41].
The term "batch" refers to the number of samples employed to compute the gradient in each iteration in gradient descents (GD) from a set of data. In classic GD optimization, such as Batch GD, the batch is drawn to be the entire dataset. Although it is very useful to use the entire dataset to obtain the minima in a more noise-free and far less random way, the trouble starts when our dataset is very big. If one is performing a classification task on two million data samples, then all two million samples must be used for one iteration when performing a GD optimization technique, and this will be done for each iteration till the minima are attained. Therefore, this optimization technique is computationally expensive [41]. This issue is resolved by SGD. In SGD, only one sample, i.e., one batch size, is used to execute each iteration. The training sample is randomly mixed and chosen for the iteration. The path followed by this algorithm is generally noisy when reaching the minima compared to classical GD, since in each iteration, a single sample is drawn randomly. However, this noisy path does not affect the performance as long as the algorithm achieves the minima with a considerably smaller training time. The paths followed in GD and SGD are shown in Figure 4. Further, it normally takes relatively high iterations to achieve the minima, due to the random descent. Although it takes a higher number of iterations to achieve the minima than traditional GD, it is still less of a computational burden. Therefore, in most cases, SGD is preferred over traditional GD to optimize the ML algorithm [41]. Let us look at the problem of objective function minimization in the form: where v is the parameter that minimizes P(v). Every summand function P m corresponds to the mth observation of the set of data considered for the training of the model. If the normal GD is employed to minimize the abovementioned function, then it will execute iterations as [42]: where the parameter η is the learning rate.
In SGD, the real gradient of P(v) is approximated by a gradient at the individual observation: iterations to achieve the minima than traditional GD, it is still less of a computational bur-den. Therefore, in most cases, SGD is preferred over traditional GD to optimize the ML algorithm [41]. Let us look at the problem of objective function minimization in the form: where v is the parameter that minimizes ( ). Every summand function corresponds to the mth observation of the set of data considered for the training of the model. If the normal GD is employed to minimize the abovementioned function, then it will execute iterations as [42]: where the parameter is the learning rate.
In SGD, the real gradient of ( ) is approximated by a gradient at the individual observation: . Path followed in GD and SGD to attain minima [43].
When the algorithm runs on through the training data, it conducts the update above for every observation of training. You can make multiple runs over all the training dataset before the algorithm converges. When that is done, the training data (samples) can be shuffled to avoid cycles for each run. Classic frameworks can utilize an adaptive learning rate to converge the algorithm. The SGD may be presented in pseudo code as follows [43]: When the algorithm runs on through the training data, it conducts the update above for every observation of training. You can make multiple runs over all the training dataset before the algorithm converges. When that is done, the training data (samples) can be shuffled to avoid cycles for each run. Classic frameworks can utilize an adaptive learning rate to converge the algorithm. The SGD may be presented in pseudo code as follows [43]: • Select an initial vector of parameters v and step size (learning rate) η • Repeat it until an approximate minimum is obtained: Shuffle the observations in the training dataset randomly. For m = 1, 2, 3, . . . , n, do: v := v − η∇P m (v).

Stationary Wavelet Transform
In the study of transient phenomena, wavelet transform is widely used to obtain temporal and frequency information from the signal [44,45]. A four-level decomposition of the signal employing the SWT is shown in Figure 5. SWT generates two sets of coefficients for a signal s[x] of length L in the first stage: approximation coefficients A1 and detail coefficients D1. Convoluting s with the low-pass filter (LF_D) for the approximation component and the high-pass filter (HF_D) for the detail component yields these coefficients. Using a similar strategy, the approximation coefficients A1 are separated into two pieces in the next level. However, in this stage, up-sampling is performed, and s[x] is substituted by A1, resulting in A2 and D2. In this research, the glottal wave is decomposed up to level four. The Haar wavelet is used, as the energy and statistical metrics mined from the sub-bands have better discrimination ability compared to higher-order Daubechies wavelets. coefficients D1. Convoluting s with the low-pass filter (LF_D) for the approximation component and the high-pass filter (HF_D) for the detail component yields these coefficients. Using a similar strategy, the approximation coefficients A1 are separated into two pieces in the next level. However, in this stage, up-sampling is performed, and s[x] is substituted by A1, resulting in A2 and D2. In this research, the glottal wave is decomposed up to level four. The Haar wavelet is used, as the energy and statistical metrics mined from the subbands have better discrimination ability compared to higher-order Daubechies wavelets.

SWT Sub-Band Features
As stated earlier, the presented scheme employs wavelet-based energy and statistical attributes extraction. Following SWT decomposition, every sub-band yields a total of seven features. The following is a concise description of the different attributes employed: 1. Energy: the energy of each is calculated using: The sub-band is denoted by WSBi, and the number of coefficients is denoted by L.
2. Spectral entropy: the sub-band entropy is calculated as follows: 3. Mean (μ): the mean of both coefficients of the sub-band is calculated as: 4. Variance: the variance of both coefficients of the sub-band is calculated as:

SWT Sub-Band Features
As stated earlier, the presented scheme employs wavelet-based energy and statistical attributes extraction. Following SWT decomposition, every sub-band yields a total of seven features. The following is a concise description of the different attributes employed:

1.
Energy: the energy of each is calculated using: The sub-band is denoted by W SBi , and the number of coefficients is denoted by L.

2.
Spectral entropy: the sub-band entropy is calculated as follows: 3. Mean (µ): the mean of both coefficients of the sub-band is calculated as: 4.
Variance: the variance of both coefficients of the sub-band is calculated as:

5.
Standard deviation: this is calculated as: where µ denotes the mean of the sub-band.

6.
Kurtosis: this is evaluated as: where E(i) denotes the expected value.

7.
Skewness: skewness computes the asymmetry of the data in relation to the sample mean and is estimated as:

Evaluation of Discernment Potential of Attributes
It is extremely important to test the discernment ability of the descriptors before applying to a classifier for a decision to be made. It aids in selecting the best features and removing those that are unwarranted. Subjective and objective methods are used in this study. The probability density plots (PDF) were used to subjectively test the discernment ability of all of the descriptors. The spacing between the peaks in the PDF plots were visually inspected for all four datasets. The greater the distance between the peaks, the greater the discrimination power of the descriptor. For the SVD dataset, it is observed that the spacing between the peaks of the PDF plots of healthy and unhealthy classes are the maximum for the Mean-D3 descriptor, followed by Mean-D4 and Mean-D1. For the PdA dataset, it is observed that the spacing between the peaks of the PDF plots of healthy and unhealthy classes are the maximum for the Mean-D2 descriptor, followed by Mean-D3 and Mean-D1. For the AVPD dataset, it is observed that the spacing between the peaks of the PDF plots of healthy and unhealthy classes are the maximum for the Mean-D3 descriptor, followed by Mean-D4 and Mean-D2. For the MEEI dataset, it is observed that the spacing between the peaks of the PDF plots of healthy and unhealthy classes are the maximum for the Pentropy-A4 descriptor, followed by Mean-D4 and Mean-D2. Figure 6 shows the density plot for these top three descriptors. The density plots of the Mean-D3 feature for SVD dataset is shown in Figure 6A, where Mean-D3 signifies the mean of the detail coefficients of the sub-band after the third decomposition level. One more example, as the PDF plots of the Mean-D2 feature for the PdA dataset, is shown in Figure 6D, where Mean-D2 signifies the mean of the detail coefficients of the sub-band after the second decomposition level. The Mean-D1 and Mean-D4 descriptors represent the mean of the detail coefficients of the sub-band after the first and fourth decomposition levels, respectively. A similar procedure was used to assess the discernment ability of the descriptors for the classification of speech disorders. Figure 7 shows the density plots for the top three features in each dataset for the classification of speech disorders. The density plots for the four classes are plotted for the PdA and AVPD datasets, three of which are pathologic and the fourth is healthful. However, the graphs for three groups are shown in the SVD and MEEI databases, two of which are diseased and the third one is healthful. The quantity of cyst recordings in the SVD and MEEI databases is minimal and of short duration, so only three categories were investigated. Each kind of disorder yields distinct hills with far-flung peaks, implying high discrimination capacity.
The Information Gain (IG) feature-scoring method is used to objectively assess all 56 attributes' discriminating abilities. Later, the attributes are placed in descending order based on their discernment power. Table 3a,b shows the top 10 attributes with their IG values across each dataset for voice dysfunction identification and classification, respectively. The size of the feature vector depends on the decomposition level. The dimension of the feature vector grows in proportion to the decomposition level. Hence, to reduce the dimension of the feature vector, the feature-selection method is harnessed. It aids in the selection of the best features while eliminating the ones that are redundant. By deleting unnecessary characteristics, feature selection approaches pick a small group of characteristics that boosts the classification rate.  The Information Gain (IG) feature-scoring method is used to objectively assess all 56 attributes' discriminating abilities. Later, the attributes are placed in descending order based on their discernment power. Table 3a,b shows the top 10 attributes with their IG values across each dataset for voice dysfunction identification and classification, respectively. The size of the feature vector depends on the decomposition level. The dimension of the feature vector grows in proportion to the decomposition level. Hence, to reduce the dimension of the feature vector, the feature-selection method is harnessed. It aids in the selection of the best features while eliminating the ones that are redundant. By deleting unnecessary characteristics, feature selection approaches pick a small group of characteristics that boosts the classification rate.

Results
Two supervised classifiers, SVM and SGD, are used to evaluate vocal abnormalities. The radial basis function (RBF) kernel is utilized in the SVM classifier, and parameter C is set at 1. The hinge as a classification loss function, squared loss as a regression loss function, and ridge (L2) as a regularization approach were chosen as the various parameters for the SGD, also known as linear SVM. A stratiform k-fold cross-validation resampling procedure is more preferable than general cross-validation in terms of bias and variance, so it is used to develop machine learning models for the assessment of voice abnormalities. On four separate datasets, many inter-and intra (cross)-dataset tests were undertaken to determine the utility of the extracted descriptors for voice disease identification and categorization. The acquired glottal wave is dissected up to level 4, and seven descriptors are mined from each sub-band's approximation as well as detailed coefficients ((7_approximation coefficients + 7_detailed coefficients) × 4 = 56).

Intra-Dataset Voice Pathology Detection Experiments
Two experiments are carried out to validate the efficacy of extracted wavelet-based features for voice disfunction recognition, one for each of the top three ranked descriptors and one for all 56 descriptors. Table 4 displays the calculated performance-assessing indicators of the classifiers. The observation that the detection rates for the top three and all 56 features are nearly equal leads to the conclusion that only the top three descriptors are vital to detect voice pathology. The suggested system's computational complexity decreases because of the fewer descriptors and smaller feature vectors. In addition, these results reveal that the features are not noisy, as increasing the number of descriptors has no negative influence on the detection accuracy. The values of classifier performance metrics look identical in many tables because of the very small number of false positives (FP) and false negatives (FN). The scatter plot for these top three descriptors is shown in Figure 8 and clearly shows the power of discrimination as the clusters of healthy and pathological data points (samples) do not overlap. The outliers are not removed; hence, some data points become mixed into other class data points.

Voice Disorder Categorization
Using the information gain (IG) feature ranking method, 12 experiments with d sets of selected descriptors are conducted in each database for multiclass classifi These experiments are being carried out to see whether the derived sub-band desc can distinguish between healthy people and those with a clinical condition of a cord(s) cyst, paralysis, or polyp. The outcomes of these tests are presented in T which unmistakably demonstrates that the categorization accuracy (CA) rises as th ber of descriptors increases. It assures that the features are not noisy and do not the categorization accuracy negatively.

Voice Disorder Categorization
Using the information gain (IG) feature ranking method, 12 experiments with distinct sets of selected descriptors are conducted in each database for multiclass classification. These experiments are being carried out to see whether the derived sub-band descriptors can distinguish between healthy people and those with a clinical condition of a vocal cord(s) cyst, paralysis, or polyp. The outcomes of these tests are presented in Table 5, which unmistakably demonstrates that the categorization accuracy (CA) rises as the number of descriptors increases. It assures that the features are not noisy and do not impact the categorization accuracy negatively.
The classification rate with all 56 features and the top five features is nearly identical across all datasets. As a result, it is shown that less than 9% of the entire number of descriptors is sufficient to achieve an almost equal recognition rate. The suggested system's computational complexity decreases because of the fewer descriptors and smaller feature vectors. The highest accuracy obtained with the AVPD database using the top five descriptors was 99.87%. In the case of the SVD, AVPD, and MEEI databases, the top three descriptors performed extraordinarily. The classification accuracy attained with these top three features is nearly identical to that achieved only with the top five and all 56 features. Figure 3 shows the PDF plots of the top three descriptors, which are not overlapping and endorse the high discrimination capacity.

Dataset-Independent Voice Illness Assessment System
A key objective of this study is to propose a voice pathology identification and classifying framework that is not affected by language, accent, age group, ethnic background, sex, or other factors. There is no single (particular) feature that can provide a high identification and classifying rate across all datasets. Therefore, three common features from the top five features are tactfully picked from all datasets. The chosen three descriptors are Mean-D2, Mean-D3, and Mean-D4. This makes the proposed dataset independent, and as a result, the method is not affected by factors such as sex, age, culture, ethnicity, or language.
To ascertain the discrimination power of these three selected common descriptors, scatter plots and PDF plots are plotted. Figure 9 shows the scatter plots of these descriptors for the SVD, PdA, AVPD, and MEEI datasets for the pathology identification process. In the scatter plot, hardly any of the samples are mixed, indicating that the three descriptors chosen have excellent discrimination ability to separate pathological samples. The quality metrics of the classifiers for speech condition identification with the three most common features are presented in Table 6, and the maximum recognition rate attained is 99.97% for the Arabic dataset. Figure 10 shows the density graphs for the three most common descriptors for speech condition classification. As there are few cyst pathology samples in the SVD and MEEI datasets, three groups-healthy, paralyzed, and polyp-are analyzed. The AVPD and PdA datasets are classified into four categories: normal, cystic, paralyzed, and polyp. The classifiers' key parameters for categorization are presented in Table 7, and the AVPD dataset attained the best categorization accuracy of 99.87%. features are presented in Table 6, and the maximum recognition rate attained is 99 for the Arabic dataset. Figure 10 shows the density graphs for the three most com descriptors for speech condition classification. As there are few cyst pathology samp the SVD and MEEI datasets, three groups-healthy, paralyzed, and polyp-are anal The AVPD and PdA datasets are classified into four categories: normal, cystic, paral and polyp. The classifiers' key parameters for categorization are presented in Table 7 the AVPD dataset attained the best categorization accuracy of 99.87%.  Table 6. SVM and SGD classifier metrics for three most common descriptors.

Voice Disorder Detection (Inter-Database)
Inter-database tests are carried out as an extra task to examine the discriminative power of features for voice pathology identification and to ascertain the proposed voice evaluation system's independence from language, accent, age, social background, sex, and so on. This set of experiments aided in determining how well a voice pathology identification system trained by one dataset would discriminate speech recordings from another dataset. In the inter (cross)-dataset type, 14 experiments are performed using same three most common features, namely, Mean-D2, Mean-D3, and Mean-D4, considered in Section 3.3. The classifier was trained on one of the datasets for the first four tests, and the learned model was tested on the other three datasets individually. For experiments five through ten, all conceivable pairings of two datasets were employed for learning, and the learned model was tested using the leftover two datasets individually. In the last three experiments, every possible combination of the three datasets was used to train the classifier; however, the leftover dataset was used to assess the educated model. The detection rate of these 14 experiments is shown in Table 8 and reveals that the proposed approach is unbiased in terms of spoken language, age, social background, accent, sex, etc.

Voice Pathology Independent System
One more goal of this study is to propose an independent voice pathology system. To ascertain this, extra tests are carried out using variety of pathologies. Up to Section 3.4, all of the experiments are performed using healthy, cyst, paralysis, and polyp speech samples. In the SVD, PdA, and MEEI databases, laryngitis and Reinke's edema are the most common pathologies, along with paralysis, polyp, and cyst. So, these five pathologies were considered with the healthy speech samples to perform voice identification experiments. The descriptor rankings obtained with these five pathologies are given in Table 9. The descriptor rankings for voice disorder identification shown in Tables 3 and 9 do not change much. In addition, the descriptors Mean-D1, Mean-D2, Mean-D3, and Mean-D4 have retained their position among the top eight descriptors. The classifier performance measures of this experiment are shown in Table 10. To confirm that the system is independent of voice pathologies, the same three most common descriptors Mean-D2, Mean-D3, and Mean-D4 mentioned in Section 3.3 are used. The highest detection rate obtained is more than 99% in all databases by using only the three most common descriptors. The scatter plots are shown in Figure 11 to justify the high detection rate achieved. It is clear from Figure 11 that these three common descriptors are highly discriminative. pendent of voice pathologies, the same three most common descriptors Mean-D2, M D3, and Mean-D4 mentioned in Section 3.3 are used. The highest detection rate obt is more than 99% in all databases by using only the three most common descriptor scatter plots are shown in Figure 11 to justify the high detection rate achieved. It is from Figure 11 that these three common descriptors are highly discriminative.  The PdA databank covers 15 organic and traumatic etiologies and the AVPD database consists of five organic speech pathology samples. All speech samples available in the PdA and AVPD databases are also considered in the next voice disorder identification experiment. This experiment consists of wide variety of voice pathology samples. The top 10 ranked descriptors are shown in Table 11, and it is found that these are almost the same as those listed in Tables 3 and 9. In addition, the Mean-D1, Mean-D2, Mean-D3, and Mean-D4 descriptors are at rank 1, rank 2, rank 3, and rank 4, respectively. Thus, although the voice pathologies increased compared to earlier experiments, the most discriminatory descriptors remain the same. The classifier performance measures of this experiment are shown in Table 12. Here, two separate experiments are performed with the three most common (Mean-D2, Mean-D3, and Mean-D4) descriptors and all 56 descriptors. Again, the three most common descriptors perform well, and the highest detection rate obtained is 99.99%. The scatter plots for these three common descriptors are shown in Figure 12 and show that the healthy and unhealthy classes' data points are non-overlapping. The outcome of this experiment ensures that the proposed system is independent of voice pathologies. shown in Table 12. Here, two separate experiments are performed with the three common (Mean-D2, Mean-D3, and Mean-D4) descriptors and all 56 descriptors. A the three most common descriptors perform well, and the highest detection rate obt is 99.99%. The scatter plots for these three common descriptors are shown in Figu and show that the healthy and unhealthy classes' data points are non-overlapping outcome of this experiment ensures that the proposed system is independent of voi thologies.  Further, two multiclass experiments are performed with PdA and AVPD dataset speech samples. In the PdA dataset multiclass experiment, healthy and seven pathological classes-cyst, paralysis, polyp, laryngitis, Reinke's edema, nodules, and sulcus-are used. However, in the AVPD dataset multiclass experiment, healthy and five organic pathological classes-cyst, paralysis, polyp, nodules, and sulcus-are used. The top 10 descriptor rankings obtained are shown in Table 13 and are almost identical to the descriptor rankings shown in Table 3. Thus, though the pathologies increased, the ranking of the descriptors does not change much, and the Mean-D1, Mean-D2, Mean-D3, and Mean-D4 descriptors have retained their position in the top 4. The results of these experiments are shown in Table 14 and it is observed that the accuracy obtained with the top 10 and all 56 descriptors is nearly the same as using the SGD classifier. The performance of the system using the three most common descriptors is not as good for the PdA database, as the number of classes in the multiclass experiments is higher compared to the experiments performed in Section 3.2. However, in the AVPD database, the classification rate achieved with the three common (Mean-D2, Mean-D3, and Mean-D4) features is almost identical. Thus, the proposed framework can be used for the assessment of almost any voice pathology that affects the vibratory pattern of vocal folds. Moreover, to perform voice disorder identification or classification, the three most common (Mean-D2, Mean-D3, and Mean-D4) features are adequate to attain sufficient accuracy.

Impact of Wavelet Decomposition Level on the Accuracy of Detecting and Categorization
The feature vector's dimension is determined by the decomposition level. As the decomposition level grows, so does the dimensionality of the feature vector, increasing the system's computational complexity and response time. As a result, determining the optimal level of decomposition is required to obtain satisfactory detection and categorization accuracy. Table 15 depicts the impact of the decomposition level on the detection rate and reveals that the detection rate is nearly same even when the decomposition level is increased. Table 16 portrays the impact of the decomposition level on the classifying accuracy, revealing that level 4 achieves the highest classification accuracy. The glottal waveform is decomposed to level 4 in this study because the suggested system achieves both the desired identification and classification accuracy. Table 17 compares the accuracies achieved in this study to those stated in the existing state-of-the-art literature.

Discussion
This study uses the well-known IAIF method to acquire the glottal wave, which is then dissected using SWT up to level 4. The obtained glottal wave is then numerically quantified using all of the attributes that were mined from each sub-band. The usefulness of extracted features for the identification and categorization of vocal disorders is evaluated using a variety of intra-and inter-database tests. The statistical features have demonstrated outstanding performance for assessing voice diseases. The findings reveal that the descriptors behaved differently for each database, resulting in variations in detection and classification accuracy. The following factors could account for the minor variance in accuracy and other metrics.

1.
The voice-acquiring equipment and acoustic settings in each corpus were different.

2.
The sampled frequency varies by dataset. This has an impact on the accuracy and reliability of a voice analysis. 3.
In the MEEI registry, healthy people were not clinically evaluated. As a result, there is no way of knowing whether these people were truly healthy.

4.
Although the recordings are altered to obtain just the stable section of the phonation, the research has revealed that the start and end of the vocalization have more acoustic clues than the steady section. 5.
In the German and English datasets, many audios are labeled with multiple diseases. 6.
For certain disorders, there are only a few recordings available. 7.
The severity of pathology is different in each dataset 8.
The distribution of the subject's age and sex varies by dataset. 9.
The SVD dataset does not include relevant information on paralysis, whether unilateral or bilateral. 10. The PdA dataset includes audio files with strong background noise or barely discernible vocalizations.
The wavelet transform has the power to extract spatial-frequency information from a non-stationary signal, and since disordered speech is transient in nature, we used the orthogonal Haar wavelet for multilevel decomposition. We examined Daubechies wavelets over four datasets to validate the use of the Haar wavelet. Tables 18 and 19 show the calculated detection and classification rates, and Haar outperforms Daubechies wavelets in terms of the classification rate, justifying our choice of Haar wavelets for decomposition. When it comes to detecting voice disorders, both Haar and Daubechies have been proven effective. To confirm that the classifier model developed is independent of the database, it is tested with three more databases other than the database by which it was trained. This set of experiments helped to investigate how efficiently a voice pathology detection system that is educated by one database can distinguish samples from the other database. However, the results were not encouraging. To bring uniformity to the application of the system to various databases, various common descriptors of voice pathologies are identified. The proposed system for identifying and classifying voice disorders is therefore evaluated based on three common features collected from all databases. In terms of the identification and classification accuracy, 99.9% has been attained so far.

Conclusions
The variation in speech quality due to voice pathology is directly linked to the real biological source of voiced excitations emerging from the glottis. This source of voiced excitations is obtained by IAIF, which gives better insight into the cause of changes in speech quality. For the modeling of the vocal tract, a discrete all-pole model was used in inverse filtering as it performs better compared to LPC in a high-pitched voice. This study proposes a novel voice dysfunction diagnostic and categorization system based on SWT. Multilevel SWT decomposition has been used to detect variation in speech quality, and energy as well as statistical parameters were derived from each sub-band to qualify the glottal wave. The derived parameters are independent of the fundamental frequency and were found to be very successful in distinguishing healthful and disordered speech. The Mean-D1, Mean-D2, Mean-D3, and Mean-D4 descriptors extracted from the detailed coefficients of decomposition level 1, level 2, level 3, and level 4 have very good discrimination power compared to other descriptors for speech dysfunction assessment. The proposed system achieved an average recognition rate of 99.99% and 99.60% for detection and classifying, respectively. The SGD classification model outperformed the SVM classifier, signifying that the extracted features are linearly separable.
The purpose of this study is to create a voice condition assessment system that is independent of voice pathologies as well as the language, accent, age, cultural background, and gender of the speaker by utilizing three common descriptors from the top ten descriptors.
Additionally, this article presents an automated voice disease diagnostic algorithm that is recommended for a wide variety of voice diseases. Our proposed system is among a few speech impairment diagnostic systems that have identified a broad range of laryngeal pathologies. This framework can be introduced as an android app that can be used to check the voice quality in daily life at any instant. The conceived laryngeal pathology evaluation framework can be used to diagnose voice concerns objectively and without invasion. It can be used as a supplemental aid to track the degree of recovery both during and after the cycle of voice therapy.