3.1. Experimental Setup
The 20-dimensional MFCC and LPC parameters are obtained from each speech file of the SVD. The modules used to extract the MFCC and LPC are set to a frame size with 40 ms and a frame overlapping rate of 50%. The SMOTE, ADASYN, and Borderline-SMOTE algorithms with the four nearest neighbors are used (k = 4), with a total of 42 random states. We experimented SMOTE, ADASYN, and Borderline-SMOTE algorithms with k = 3, 4, and 5. When k was equal to 4, it was confirmed that the best performance was shown. This study focuses on the effectiveness of using the proposed method to generate minority class samples rather than controlling the hyperparameters of the deep learning classifier to achieve optimal classification performance.
Table 2 lists the main parameters of each model. First, we study the classification between normal and pathological voices using an FNN with two hidden layers. A rectified linear unit (ReLU) is activated after the first layer, and the activation of softmax function occurs after the last layer [
38]. The main parameter values are shown in
Table 2. This study also uses a CNN with four consecutively convolutional layers, in which the convolutional mask has a kernel of size 3 × 3 and ReLU activation functions with 64, 64, 32, and 32 convolutional masks for each layer. The CNN also has four max pooling layers with sizes of 2 × 2, one dense layer with 512 nodes where each node has an ReLU activation function, and one softmax output layer with four neurons. The details are presented in
Table 2. Experimental results are obtained through 10-fold cross-validation to ensure that each fold of the training (70%) and testing (30%) data contains at least one sample from the minority class. The models are implemented using Python 3.7 with the scikit-learn, imbalanced-learn, and PyTorch libraries.
3.3. Oversampling Method Comparison
Figure 5 shows the fifth and fifteenth MFCC distributions of the normal voice samples obtained before and after using various oversampling methods, including the ADASYN, SMOTE, and Borderline-SMOTE on the SVD database.
Figure 5a,b show the fifth and fifteenth MFCCs of the original samples, respectively. Overall, when comparing the oversampled waveform with the original waveform, the amplitude is slightly higher, and the samples are slightly tighter in the oversampled MFCCs. The different components are partially marked with circles. In
Figure 5c, the samples oversampled by the SMOTE are observed in the second and third circles. This waveform is more densely oversampled than the Borderline-SMOTE waveform of the original sample, as shown in
Figure 5a. In
Figure 5e, it is possible to observe the aspects oversampled by ADASYN in the last circle. When oversampled with Borderline-SMOTE, the dense waveforms can be observed in the first circle of
Figure 5g. When oversampling the fifth MFCCs extracted from the normal voices using the three methods discussed, the SMOTE best interpolates the samples while maintaining the characteristics of the original MFCC of the fifth waveform, and the fifth MFCC oversampled by ADASYN appears to be the most similar to the original MFCC of the fifth waveform. In the case of the fifteenth MFCCs, in
Figure 5d, the samples oversampled by the SMOTE are clearly observed in the first and second circles. When oversampling the fifteenth MFCCs extracted from normal voices using the three methods mentioned, the SMOTE best interpolates the samples, allowing for tight waveforms. The fifteenth MFCC oversampled by Borderline-SMOTE appears to be most similar to the original MFCC of the fifteenth waveform.
Figure 6 shows the fifth and fifteenth LPC distributions obtained for the normal voice samples before and after using various oversampling methods, including the ADASYN, SMOTE, and Borderline-SMOTE, on the SVD database.
Figure 6a,b show the original LPCs of the fifth and fifteenth samples, respectively. As is true of the aspects observed in the MFCCs, when comparing the oversampled waveform with the original waveform, the amplitude is slightly higher, and the samples are slightly tighter in the oversampled LPCs. The different components are partially marked with circles. In the LPCs oversampled by Borderline-SMOTE in
Figure 6g,h, significant differences are observed in all circles.
Similarly, based on the results of other oversampling methods, such as ADASYN and the SMOTE, prominent differences are observed in the various circles shown in
Figure 6c–f. Therefore, among the three methods, Borderline-SMOTE best interpolates the samples while maintaining the characteristics of the original LPC waveform.
The fifth and fifteenth percentiles of the MFCCs and LPCs are randomly selected to show the oversampling phenomenon visually. Overall, as shown in
Figure 5 and
Figure 6, the samples oversampled for the LPC tend to be better interpolated than those of the MFCC. In both cases, the samples oversampled using the ADASYN method tend to be similar to the original waveform. Consequently, an equal number of normal and pathological voice samples are produced. Although the data lengths in (c) to (h) appear to match those in (a) and (b) in
Figure 5 and
Figure 6, this is because the oversampled samples overlap at the same time and are plotted together. In reality, the normal voice data samples are as numerous as the pathological voice data samples.
3.4. Experimental Results and Analysis
The model evaluation measures are obtained through 10-fold cross-validation with the specificity, recall, G, and F1 value.
Table 4 presents the results measured by the FNN and CNN models before using oversampling methods. The performances of all models are poor. In SVD, most model evaluation metrics are lower than 0.6 because the number of class distribution in the sample is unbalanced.
Figure 7 shows the confusion matrix produced by each deep learning model using feature parameters to classify pathological and normal voices in the initial imbalanced class dataset. As shown in
Table 4 and
Figure 7, the recognition rate of classifiers is biased toward samples of the majority class composed of pathological voices, whereas the accuracy for samples of the minority class composed of normal voice is insufficient. Therefore, the model misleads the overall accuracy results. From
Figure 7,
Figure 8,
Figure 9 and
Figure 10, 72.96% accuracy is shown for the combination of MFCCs and the FNN. Additionally, accuracies of 72.31% and 71.34% are obtained with a combination of LPCs and MFCCs using the FNN and CNN, respectively. The lowest performance of 66.94% is obtained using the LPCs and CNN, as shown in
Figure 10a.
Table 5 shows the model evaluation matrices obtained from each classifier after performing imbalanced handling using various oversampling algorithms. The results in terms of binary class confusion matrices are shown in
Figure 8 and
Figure 9. Model evaluations of FNN and CNN classifiers described in
Table 5 show that VPD systems with SMOTE have better specificity, recall, G value, and F1 value. Compared to various deep learning models and the feature parameters, the combination of the CNN and LPCs oversampled by the SMOTE has the highest accuracy (98.89%), and the best overall performance is achieved when evaluating the model metrics with the implementation of the SMOTE algorithm. The optimal classifier, such as the CNN with LPCs oversampled by the SMOTE, increases the recall, the specificity, the G value, and the F1 value to 1.0, 0.97, 0.98, and 0.99, respectively. These studies are conducted on an imbalanced dataset, with significant deviations ranging from 0.20 to 0.73 between the recall and specificity. Our proposed method, which combines the CNN and LPCs oversampled by the SMOTE, improves the recall and specificity from 0.01 to 0.73 and 0.00 to 0.28, respectively, compared to the performance of other conventional methods.
Figure 8 and
Figure 9 show the confusion matrices of each deep learning model using feature parameters to classify normal and pathological voices in the balanced dataset. Comparing the distributions of the confusion matrices in
Figure 7,
Figure 8 and
Figure 9, we find that the classification performance for majority class samples composed of pathological voices is lower for the VPD system using the FC-SMOTE algorithm, but it signi
ficantly improves its ability to classify minority class samples composed of normal voices.
Figure 10b demonstrates that utilizing the CNN classifier with the LPCs oversampled by the SMOTE yields the highest accuracy at 98.89% compared to combinations of the other deep learning classifiers and feature parameters. Additionally, the next-highest performances (98.28% and 93.61%) are obtained via the LPCs oversampled by Borderline-SMOTE and ADASYN, respectively, with the CNN classifier. In the FNN classifier, the best accuracy (91.52%) is achieved with the LPCs oversampled by the SMOTE. The combination of the MFCCs oversampled by ADASYN and the FNN classifier yields the lowest performance of 69.53%.
Overall, as a deep learning model, the CNN outperforms the single deep learning classifier for disordered voice detection, which has also been featured in the most recent published paper on this topic [
6]. Additionally, the proposed method, such as the combination of the LPCs oversampled by the SMOTE and the CNN in the binary class confusion matrix of two models, is generally superior in terms of the exact prediction for each normal and pathological voice. The experimental results indicate that the SMOTE is a useful approach for building a binary classification model between pathological and normal voices. It also confirms that our suggested VPD algorithm can train minority classes better and can achieve an improved binary classification performance.
Because the MFCC is widely used in speech signal processing, the VPD system can also achieve good classification and detection performances on imbalanced datasets, as demonstrated in a recent study [
39,
40]. Therefore, the performances obtained from the MFCCs are better than those of the LPCs when using the two deep learning models in the class-imbalanced binary classification, as shown in
Figure 10a. The performance of the FNN is approximately 6% better than that of the CNN. However, regarding the accuracies obtained on the balanced class dataset, all the models represent good predictive ability in the results of binary classification for normal and pathological voices, with LPCs oversampled by various methods. The deep learning model using the CNN achieves the best performance overall. Experimentally, the LPC appears to be more sensitive to oversampling methods than the MFCC. In conclusion, the CNN (as a deep learning model) and LPCs oversampled by the SMOTE as feature parameters obtain the highest performance in classification between pathological and normal voice using the SVD. Notably, the classification rate of the VPD system configured using the SMOTE method demonstrates considerable improvement compared to the results of the non-sampling. Therefore, we conclude that the proposed method, such as the combination of the CNN and the LPCs oversampled by the SMOTE, can efficiently increase the performance for classification between the pathological and normal voices.
In summary, our VPD system uses the SMOTE, ADASYN, and Boderline-SMOTE algorithms to generate the binary imbalanced class data in the SVD, and the accuracy of the algorithm is verified using a set of deep learning classifiers such as FNN and CNN. To classify normal and pathological voices, when compared with a VPD system without various oversampling algorithms, all performances of our VPD system with the SMOTE method with the recall, specificity, G value, and F1 value are higher than those of the former. These results confirm that our proposed method is a good strategy for achieving successful classification between pathological and normal voices. It also justifies that our methods for solving class imbalances in a limited pathological speech database can be applied to pathological speech detection in the field of biomedical engineering.
3.5. Comparison with Existing Techniques
In the previous subsections, the performance of our method is compared with that of deep learning methods using vocal tract-based cepstral features and their combinations as references. The results in
Figure 10b and
Table 6 show that the LPC combination based on the SMOTE + CNN yields the best overall detection accuracy. In this subsection, this optimal combination is compared with existing methodologies and deep learning techniques. Many studies have developed different VPD techniques over the past few decades; in our work, we select four studies with databases and deep learning methods similar to those used in our research, as shown in
Table 6.
Table 6 presents the databases, methodologies, features, deep learning methods, and performances of competing approaches under the binary detection model scenario for voice disorder detection. In [
22], the CGAN-IFCM algorithm achieves an accuracy of 95.15%, with 869 normal and 1356 pathological voice data in the SVD. Although our study also uses all the SVD data, we are not sure why the number of points differs from the total amount of data in this study. The authors of [
24] prove that the proposed FC-SMOTE method outperforms the other oversampling methods by 100% and 90% in terms of the accuracy of the CNN model with the MEEI and SVD, respectively. Then, for the SVD data, 687 normal and 194 pathological voice data are used, and pathological voice data are augmented to 687 via oversampling. Although the approach in [
26] demonstrated good performance, that study used the SPDD. In a comparison with [
2,
22,
24] using the SVD, the proposed combination containing LPCs based on the SMOTE and the CNN improved the resulting accuracy by 3.74–11.29%. Compared to [
26], using the SPDD, our proposed method increased the performance by 2.26%. All existing works [
2,
22,
24,
26] solved imbalanced class issues with various features, databases, and methodologies. In conclusion, our method (LPCs based on the SMOTE and the CNN) has improved accuracy by up to 11.29% compared to papers using the same parameters (MFCC) and database (SVD) as our experiments.