Class-Imbalanced Voice Pathology Detection and Classiﬁcation Using Fuzzy Cluster Oversampling Method

: The Massachusetts Eye and Ear Inﬁrmary (MEEI) database is an international-standard training database for voice pathology detection (VPD) systems. However, there is a class-imbalanced distribution in normal and pathological voice samples and different types of pathological voice samples in the MEEI database. This study aimed to develop a VPD system that uses the fuzzy clustering synthetic minority oversampling technique algorithm (FC-SMOTE) to automatically detect and classify four types of pathological voices in a multi-class imbalanced database. The proposed FC-SMOTE algorithm processes the initial class-imbalanced dataset. A set of machine learning models was evaluated and validated using the resulting class-balanced dataset as an input. The effectiveness of the VPD system with FC-SMOTE was further veriﬁed by an external validation set and another pathological voice database (Saarbruecken Voice Database (SVD)). The experimental results show that, in the multi-classiﬁcation of pathological voice for the class-imbalanced dataset, the method we propose can signiﬁcantly improve the diagnostic accuracy. Meanwhile, FC-SMOTE outperforms the traditional imbalanced data oversampling algorithms, and it is preferred for imbalanced voice diagnosis in practical applications.


Introduction
Traditional pathological voice detection mainly depends on experienced clinicians or laryngoscopes to observe the vocal cord structure [1], which is subjective and invasive. In recent years, we have witnessed the success of artificial intelligence in medical applications [2][3][4]. A voice pathology detection (VPD) system based on machine learning algorithms and well-established features has become a hot topic in research. In biomedical engineering, different features are extracted from signals to build VPD systems that automatically detect pathological voices. Most of these studies have experimented with the Massachusetts Eye and Ear Infirmary (MEEI) database [5], which has become one of the standard databases for VPD systems [6]. Nevertheless, in the past studies on voice pathology detection, many researchers ignored the class-imbalanced distribution of voice samples in the MEEI database. The MEEI database contains 657 pathological voice samples but only 53 normal voice samples from healthy speakers. Thus, taking the Acc as the evaluation result of the classifier makes the pathological voice detection model's performance better than it actually is. This is because learning with imbalanced datasets usually results in a biased classifier that obtains a higher detection accuracy in majority classes and a lower one in minority classes [7]. For example, given a dataset with 95% of the samples labelled as positive, if all negative samples are misclassified and positive samples are all classified correctly, we can obtain an Acc = 95%. However, the accuracy rate for negative samples is 0. Obviously, this is too optimistic regarding the performance of the classifier when the data are imbalanced [8,9]. In practical applications, the class-imbalanced data result from the insufficient number of samples in the pathological voice database, which also makes it difficult for the traditional VPD system to classify multiple pathological types. Given its importance, pathological voice diagnoses with imbalanced data have attracted the interest of researchers [10,11].
Due to the high class-imbalanced ratio in the MEEI database, is not suitable for researchers to use Acc as the primary model evaluation measure. Thus, it is necessary to consider imbalanced learning to model the VPD system based on the class-imbalanced pathological voice database.

Imbalanced Learning
The class-imbalanced problem is found in many real-world applications, including fraud detection [12], medical diagnosis [13,14], bioinformatics [15], and so forth. Obtaining class-balanced datasets is very difficult in medical diagnosis because it requires the expensive endoscopy (e.g., laryngoscopy) of patients and labelling with manual supervision. Classifiers trained on imbalanced datasets will reduce their robustness and generalization performance, which is also one of the main challenges of machine learning.
Existing solutions to solve class-imbalanced problems can be divided into internal methods and external methods. The internal method [16][17][18] uses a variety of cost adjustment techniques in the learning algorithm and makes full use of the initial class-imbalanced dataset during the training process. The external method [19][20][21] refers to rebalancing the external training data while keeping the learning algorithm unchanged. Current research [22] shows that external methods are more popular than other methods and are universally applied in various fields. The external method includes oversampling; undersampling; and hybrid undersampling, which combines oversampling and undersampling. The majority of classes are usually undersampled to reduce the number of samples to balance each class's sample distribution. Minority classes are usually oversampled to generate new samples by duplicating or synthesizing the samples to decrease the classes' imbalance ratio. The synthetic minority oversampling technique (SMOTE) [23] is one of the main oversampling methods used to handle imbalanced data. SMOTE is a method to improve Random OverSampling (ROS), reducing the risk of overfitting, but it may cause overgeneralization, is susceptible to generating noise in samples, and causes increased overlapping between different classes [24]. Due to the generation of wrong minority samples, this problem generalizes the minority class region to the majority class region. Therefore, in the current research on imbalanced learning algorithms, SMOTE has multiple variations intended to solve the original algorithm's weaknesses. Many researchers have focused on improving the SMOTE algorithm. The variants of SMOTE that have been proposed are Borderline-SMOTE [25], adaptive synthetic sampling (ADASYN) [26], Random-SMOTE [27], among others.
The oversampling algorithm based on clustering works better for imbalanced learning with a large number of unknown samples. On the one hand, clustering algorithms are based on unsupervised learning and are vital elements of machine learning in general. On the other hand, as a soft clustering algorithm, the usefulness of the fuzzy clustering algorithm has been confirmed in existing work [28]. However, in the recent research and application of fuzzy clustering [29,30], the algorithm has mainly been used as a classification technique. Especially for the pathological voice dataset, which has fuzzy boundaries and overlaps between clusters, fuzzy clustering algorithms can be useful [31]. In imbalanced learning, the indistinct boundaries and overlaps between the clusters affect the sampling technique's performance, resulting in the generation of noisy samples with no information value. Thus, we used this particularity of fuzzy clustering for imbalanced learning as a consolidation technique for oversampling operations. Some imbalanced learning algorithms based on clustering have been proposed in [32,33]. They can omit the steps of labelling all samples in a set and are more suitable for the imbalanced learning of unknown samples in medical diagnosis. Nonetheless, clustering-based SMOTE algorithms ignore the imbalanced distribution within the sample class [34], which may lead to the generation of non-differentiated minority samples underfitting the classifier. Furthermore, the related research [35] is mainly based on the k-means clustering algorithm, but it is limited to the k-means hard clustering principle, which is easily affected by abnormal points and is not suitable for too-discrete classification. The fuzzy clustering synthetic minority oversampling technique algorithm (FC-SMOTE) used in this paper overcomes the influence of noise points and abnormal points and the imbalanced distribution between and within classes at the same time. Finally, it is necessary to generate as many differentiated and representative minority samples as possible to improve the pathological voice classification model's performance.

Objective
Researchers usually use the MEEI database as the standard training database for VPD systems in the research on pathological voice detection and intelligent diagnosis, but ignore the imbalance between normal and pathological classes of this database. As a model evaluation measure, Acc is not suitable for class-imbalanced data, leading to inaccurate model results. The current classification studies of pathological voices only focus on binary classification between normal and pathological voices or binary classification between one pathological voice type and other pathological voices; there are few studies on the multi-classification of pathological voices.
To tackle these issues, we propose a VPD system that combines the data oversamplingbased fuzzy c-means clustering SMOTE method and machine learning models to resolve the multi-classification of pathological voices with imbalanced datasets. Mel-Frequency Cepstral Coefficients (MFCCs) are some of the most effective feature parameters in speech recognition [36][37][38] and were proposed based on the human auditory perception structure [39]. Recent studies have shown that the feature extraction of MFCCs has been widely used in VPD systems. Thus, this paper selects MFCCs as the input parameters of the model. Considering the issue of class-imbalance in the MEEI database, we conjunct the c-means clustering algorithm with SMOTE and propose a VPD system in the class-imbalanced pathological voice database called FC-SMOTE. FC-SMOTE finds safe areas where minority class samples are located by the c-means clustering algorithm and oversamples minority class samples in safe areas according to the density of the minority sample distribution. Ultimately, more samples will be generated in sparse minority areas to solve the imbalance within the class. Moreover, FC-SMOTE can detect safe oversampled areas without considering all sample class labels. In the modeling of the VPD system, a set of machine learning models are evaluated and validated using the resulting class-balanced dataset as an input. Some reasonable metrics [40] include Recall, Specificity, G value, F1 value, and AUC/PRC as evaluation metrics for the classifier model. The FC-SMOTE algorithm is used to select minority and majority classes, and a class-balanced training dataset is constructed to train machine learning models. The results show that the VPD system with FC-SMOTE has shown good results in the detection and multi-classification of pathological voices, which helps in the pathological diagnosis of voices in practical clinical use. For the selected model evaluation measures, FC-SMOTE outperforms other traditional oversampling algorithms, which include SMOTE, Borderline-SMOTE, and ADASYN.

Overview of the Framework
In this paper, we develop a VPD multiple classification system for class-imbalanced data in the pathological voice database and provide a framework to solve the problem, which is shown in Figure 1, including three steps-feature extraction, FC-SMOTE algorithm handing, and model validation and evaluation. The theoretical background of the MFCC feature parameters is briefly described. We use FC-SMOTE to generate new minority samples to obtain a balanced dataset from the initial dataset. Finally, we evaluate and validate a set of machine learning models using the resulting class-balanced dataset as an input.  In the feature extraction step, we can compute a set of MFCCs for each speech frame of one-second audio to obtain approximately 64 MFCC samples (window size = 32 ms, frame overlapping rate = 50%). Finally, we use their multi-frame averaging as the MFCC vector for a voice. A total of thirty-six MFCC parameters includes MFCCs' original feature parameters and their first-order and second-order derivatives.

FC-SMOTE Algorithm Handing
As one of the standard oversampling methods used to deal with class-imbalanced datasets, SMOTE artificially generates new minority class samples. It combines the generated samples with the initial training set to balance the training dataset. SMOTE generates synthetic samples in three steps. Firstly, it selects a random minority sample a for observation. Then, it selects the sample b among its k nearest minority sample neighbours. Finally, it creates a new sample x by randomly interpolating two samples. The x function is shown in Formula (1).
where w is the random weight in [0, 1]. However, SMOTE randomly selects a minority sample to oversample, which may generate minority class samples in majority regions in the presence of noise, as shown in Figure 2a.
The cluster-based SMOTE can avoid the generation of noise by oversampling only in safe areas, as shown in Figure 2b. Besides, when considering the clinically imbalanced dataset dealing with the pathological voice, FC-SMOTE focuses on within-class imbalance by expanding the sparse minority areas to generate more differentiated minority samples to avoid underfitting the classifier. Thus, it can generate as many differentiated and representative minority samples as possible to improve the VPD system performance.
FC-SMOTE consists of three steps-clustering, filtering, and oversampling. In the clustering step, after setting the number of clusters to be divided according to the number of categories, the fuzzy c-means (FCM) clustering algorithm is used to cluster the input samples to specify the number of clusters. The FCM algorithm is one of the most widely used fuzzy clustering algorithms and a popular iterative method for finding groups. It first selects the number of clusters. Then, it assigns coefficients between [0, 1] randomly to initialize the membership matrix. The centroid of the cluster is calculated as the average of all points and weighted according to the degree of the cluster to which they belong. The position of the centroids for each cluster and coefficient in the clusters for each data point is updated. Repeat until the algorithm has converged. In the filtering step, the safe area for oversampling was selected based on the fuzzy set's spatial distribution. In the safe area, the Euclidean matrix of the membership degree of the minority samples and the centroid of a cluster is calculated to obtain the density distribution matrix to perform the next step of oversampling. Therefore, the filtering step allocates more generated samples to the sparse minority samples space in the safe area rather than in the dense minority sample space in order to generate as many minority samples with significant differences as possible to overcome the underfitting problem of the classifier caused by oversampling. Finally, in the oversampling step SMOTE is used in each selected cluster based on sampling weight to balance the number of samples in the minority and majority classes. The safe areas used for oversampling are selected based on a fuzzy set. A fuzzy set is a set that has degrees of membership between 0 and 1. In the following formula, the fuzzy set is represented with a tilde character(∼). We define the safe area C (k_sa f e) as shown in Formula (2).
In Formula (2), j is the class number of minority samples and U is the membership matrix universal set. In the safe area, the Euclidean matrix of the membership degree of the minority samples and the membership degree of the cluster center is calculated according to Formulas (3)- (5).
In Formula (3), n is the number of minority samples. The distance matrix D uv means that each row of U is subtracted by each row of V, and its Euclidean distance is calculated. The Euclidean distance matrix of each selected cluster is calculated based on the above formulas, and the majority of samples are ignored. The mean minority distance within each safe area is obtained by dividing all non-diagonal elements of the distance matrix by the number of non-diagonal elements. The density of each safe area is defined as the distance matrix of the safe area divided by the mean minority distance raised to the membership matrix's dimension root. Sampling weight is defined as the inversion of the density. The sampling weight(SW) can be calculated according to Formulas (6) and (7).
For the initial class-imbalanced dataset, c is the membership matrix dimension and S maj and S min are used to represent the sample number of the majority class and the minority class, respectively, and their difference equals the total number of new minority samples that need to be synthesized. Therefore, the sample number of the synthetic minority classes around each minority sample can be calculated using Formula (8).
In Formula (8), the function round(·) represents rounding a number down or up according to the decimals. Thus, N u min represents the number of new minority samples that need to be generated by SMOTE around the uth minority sample. The distribution of samples generated in oversampling depends on the sampling weight-that is, high and low sampling weights correspond to minority samples of low density and high density, respectively. In Formula (10), P w ∈ [20%, 18%, 16%, 14%, 12%, 8%, 6%, 4%, 2%](w = 1, 2, ..., 9). When (S maj − S min ) > ∑ S min u=1 N u min , each sample of the first nine initial minority samples with a higher SW corresponds to the weight in P w in turn, and increases P w W 1 synthetic samples around it. When (S maj − S min ) < ∑ S min u=1 N u min , each sample of the last nine initial minority samples with a lower SW correspond to the weight in P u in turn and decrease P w W 2 synthetic samples around them. If P w W 2 is bigger than the number of synthetic samples around the minority sample, we skip this minority sample without deleting the generated sample and assign the weight to the next minority sample. The steps of FC-SMOTE are shown in Algorithm 1, and the flowchart is as follows in Figure 3.

Model Validation and Evaluation
After processing the imbalanced algorithm FC-SMOTE, four types of pathological voice class-balanced datasets and normal and pathological voice samples are obtained from the initial class-imbalanced dataset. In this process, in order to ensure that our results are not biased towards specific machine learning models and reduce the risk of overfitting, we validated and evaluated a set of standard representative machine learning classifiers (logistic regression(LR), naive Bayes (NB), decision tree (DT), support vector machine (SVM), K nearest neighbours (KNN), random forest (RF), XGBoost, and gradient tree boosting (GBDT)), as well as a deep-learning-based classification model (multi-layer perceptron (MLP) and convolutional neural network (CNN)).
Calcula te the distance matrix according to Formula (3)- (5) Class-imbalanced dataset Select the safe areas based on fuzzy set according to Formula (2) For each minority class sample in the safe area.
Obtain the m ean dista nce of minority sam ples

Algorithm 1 The algorithm of the FC-SMOTE
Input: X = {x 1 , x 2 , ..., x n }, j(a specify a number of clusters), k(number of nearest neighbors to be found by SMOTE), c(membership matrix dimensions). Output: a class-balanced dataset.
Step 1: The input elements X is clustered into j clusters by Fuzzy C-means.
Step 2: Select the safe areas used for oversampling based on fuzzy set, according to Formula (2).
Step 3: for safe areas j do (a): Calculate the distance matrix D j between the membership value of each minority sample point S min and the cluster center point, according to Formulas  (9) and (10). end for

Database
The database we used was developed by the Massachusetts Eye and Ear Infirmary (MEEI) Voice and Speech Lab. The speakers in the MEEI database samples are native English speakers. The MEEI database is the most widely used international commercial database in voice pathology detection, but it has the defect of a class-imbalanced distribution of samples. The MEEI database contains more than 1400 voiced samples of the sustained vowel /a/, but only 53 normal voice samples. In this study, all available 53 normal samples and four pathology types-vocal cord nodules, vocal cord polyps, vocal cord edema, and vocal cord paralysis-were chosen because these four are common in clinical diagnosis. Table 1 summarizes the primary information of the MEEI database samples used. It can be seen that in the MEEI database, there is a class-imbalanced distribution between normal and pathological voice samples, as well as each type of pathological voice sample. The class-imbalanced distribution of samples will lead to the poor performance of a classifier, which many researchers have ignored. To better illustrate this problem of imbalanced characteristics and the difficulties of example distribution in the attribute space, we visualized them in a two-dimensional space using Principal Component Analysis (PCA). Figure 4a,b show the data distribution of the normal and pathological samples before and after using FC-SMOTE in the MEEI database.
In this work, we first focus on modeling VPD systems in the class-imbalanced MEEI database, which can mislead the diagnosis results. Therefore, an improved SMOTE technique based on the fuzzy c-means clustering algorithm is proposed for balancing the dataset. In order to broaden the scope of the problem and maximize the applicability of the proposed method, we selected vowel /a/ samples with vocal cord nodules, vocal cord polyps, vocal cord edema, and vocal cord paralysis from another voice pathology database (Saarbruecken Voice Database (SVD) [41]) and conducted the same experiment using the proposed method. The SVD database is German, and is a collection of voice recordings and EGG signals from more than 2000 persons. It contains recordings of 687 healthy persons and 1356 patients with one or more of 71 different pathologies. Table 2 summarizes the primary information of the SVD database samples used. Figure 4c,d shows the sample point distribution before and after using FC-SMOTE in the SVD database.

Parameter Settings
All 50 kHz files were downsampled to 25 kHz to ensure that all voice files had the same sampling frequency. All 3-second files were edited to include only the first second of phonation. The 36-dimensional MFCC parameters and their first and second derivatives were extracted from each file in the database. The parameters for extracting the MFCC features were set as follows: frame size = 32 ms, frame overlapping rate = 50%, Mel filter order = 24. For FC-SMOTE, the 3 nearest neighbours SMOTE algorithm was used (k = 3), four clusters needed to be found by c-means for an imbalanced class of four pathology types (j = 4), and two clusters needed to be found by c-means for an imbalanced class of normal and pathological voices (j = 2). This work focused on the robustness of the method to generate minority class samples rather than adjusting the classifier hyperparameters to obtain the classifier's optimal performance. Therefore, Bayesian Optimization [42] was used to find the relatively optimal hyperparameters so that the machine learning classifier's performance was in a good performance range. Table 3 lists the main hyperparameter search space of each model. For the experimental verification of the deep learning model CNN, we refer to the method proposed in [2] to construct the network. Experiment results were obtained by 10-fold cross-validation to sure that each fold of the training or testing contained at least one sample of the minority class. To illustrate the generated data's reliability, we randomly isolated 10% of the dataset not used for oversampling from the initial dataset as external validation. Table 4 presents the training and test data distribution information of the experimental database. In the experiment based on the MEEI database, we set up an external dataset to prove the effectiveness of the FC-SMOTE algorithm in generating samples. Therefore, in the SVD database experiment no external dataset was set to repeat this verification experiment. The calculations and models were implemented by Python 3.7 with the Imbalanced-learn, Scikit-learn, and TensorFlow 2.0 libraries.

Model Evaluation Measure
Traditionally, the metric most commonly used to measure classifier performance has been Acc. However, in imbalanced datasets the common measure of classifier performance is inappropriate because its results can be misleading. If the wrong metric is chosen to evaluate models, it is likely to choose a flawed model, or, in the worst case, be misled about the model's expected performance. Thus, this study adopts some reasonable evaluation measures, including recall, specificity, G value, F1 value, and AUC/PRC, to measure the model's performance in a class-imbalanced dataset. Recall can represent the minority class accuracy rate. Specificity can represent the majority class accuracy rate. G value and F1 value also consider the classification performance for the minority class. When dealing with imbalanced datasets, precision-recall curves (PRC) give a more informative picture of an algorithm's performance. In short, these evaluation measures will not be affected by the number of samples in different classes, so they are usually considered "unbiased" and can be used in scenarios where classes are imbalanced. Relevant model evaluation measures are calculated according to Formula (9) through to Formula (14). In the formulae, the meanings of TP, TN, FP, and FN are shown in Table 5. TP is the number of accurate positive samples (actual minority and predicted as a minority), FN is the number of false-negative samples (actual minority, but predicted as a majority), FP is the number of false-positive samples (actual majority, but predicted as a minority), and TN is the number of accurate positive samples (actual majority, and predicted as a majority).
where n is the number of samples. For multi-class models, the macro average rule is used to calculate the above model evaluation measures.

Positive Class Negative Class
Positive class TP FN Negative class FP TN

Experimental Results and Analysis
In the construction part of the multi-classification system, the classification problem is extended to four exact types of pathological voice: vocal cord nodules, vocal cord polyps, vocal cord edema, and vocal cord paralysis. The model evaluation measure is obtained by recall, specificity, G value, and F1 value through 10-fold cross-validation. Meanwhile, each classifier training time is also presented to evaluate the data complexity. Table 6 shows the results of ten machine learning models before using FC-SMOTE to balance the dataset. The predictive ability of all models is inferior. Due to the classimbalanced distribution of the sample in the MEEI database, most of the model evaluation measures are lower than 0.5, which may also be due to the current lack of work on the multiclassification of pathological voices. The model evaluation measures of each classifier after imbalanced handing using the FC-SMOTE algorithm are shown in Table 7. Additionally, the results of external validation are shown by a multi-class confusion matrix in Figure 5.  It can be seen from the comprehensive model evaluation measures of the ten classifiers in Table 7 that the VPD system with FC-SMOTE performs better in terms of the recall, specificity, G value, and F1 value. Compared with the single model and the neural network model, the three ensemble models (RF, XGboost, GBDT) have the most significant improvement and have the best overall performance in model evaluation measures after the FC-SMOTE algorithm handing. The optimal classifier RF increases the recall to 0.83, the specificity to 0.95, the G value to 0.88, and the F1 value to 0.86. As an ensemble learning model, RF performs better than single classifiers in pathological voice classification, which is also reflected in the latest review paper [6]. Meanwhile, the same effect is shown in two other typical ensemble learning models (GBDT, XGBoost). The external validation set samples are regarded as independent data and do not participate in the imbalanced algorithm handing. From the multi-class confusion matrix of the ten models, it can be seen that the method we proposed has an excellent overall effect on the number of correct predictions for each class. Such results indicate that the FC-SMOTE processing is an effective method to build a multi-classification model of pathological voices, which means that the VPD system can learn minority classes better to improve performance in multi-classification. Since MFCC is a well-established feature in the field of speech recognition, the VPD system can also show promising results in binary classification detection in imbalanced datasets, as confirmed in a recent research work [43,44]. Therefore, in the class-imbalanced binary classification, we focus on the correct recognition of minority samples by models in binary classification. In addition, to respond to the experimental results after using the FC-SMOTE algorithm, an external validation set composed of the same samples was used to generate a confusion matrix. The confusion matrix gives the class-imbalanced binary classification results of external validation in Figure 6.  Figure 6 shows the confusion matrix of each machine learning model for detecting normal and pathological voices in the initial class-imbalanced dataset. As can be observed, the classifier's recognition ability is obviously biased towards the majority class samples (pathological type), while the recognition ability of the minority class samples (normal type) is lacking. Therefore, the model has misleading results for the overall accuracy (Acc = TP + TN TP + FN + FP + TN ). Meanwhile, the binary classification results of the classbalanced dataset are shown in Table 8. Similarly, in the binary classification our experiment focuses on the confusion matrix of the external dataset, as shown in Figure 7.  All the models exhibit good predictive capabilities for the binary classification results of normal and pathological voices with FC-SMOTE. Similarly, the ensemble learning model showed the best performance overall. RF performs better than the other models in all model evaluation measures for recall (1.00), specificity (1.00), G value (0.99), and F1 value (1.00). When comparing the distribution of the confusion matrix in the scenarios of Figures 6 and 7, it can be seen although the VPD system with the FC-SMOTE algorithm sacrifices some degree of recognition capability for majority class samples (pathological voice), it significantly improves it for the minority class samples (normal voice). For the time consumption of the machine learning model, the training time of a single classifier is the shortest, and the training time of the deep learning model is the longest. In contrast, the ensemble learning model reflects the best performance in the pathological voice detection and classification experiments of the MEEI database.
Furthermore, the area under the curve (AUC) of pathological voice multi-classification and binary classification in the external validation dataset corresponding to each model is shown by the corresponding PRC curve diagram. Figure 8a,b show the PRC curves of each model without and with the FC-SMOTE algorithm, respectively, in the multi-classification. Figure 8c,d show the binary classification PRC curves of the class-imbalanced and classbalanced datasets, respectively, which are graphed as curves. The commonly used model evaluation measure AUC/ROC is a biased metric. Although it is commonly used to summarize the ROC curve and is widely used in medical applications, it is too optimistic regarding the performance of the classifier when the data are imbalanced [8]. Both precision and recall are helpful in cases where there is a classimbalance in the observations. The precision and recall are focused on the positive class (the minority class) and are unconcerned with the true negatives (majority class). Therefore, the PRC was not impacted by the addition of majority classes. Compared with the AUC value of multi-classification in the initial class-imbalanced dataset, the VPD system with The perfect test will have a PRC that passes through the upper right corner (corresponding to 100% precision and 100% recall). Generally, the closer a PRC is to the upper right corner, the better the test is. Figure 8 reports the comparison results of the AUC, showing that the use of the FC-SMOTE algorithm can greatly improve the machine learning model's ability to detect the minority class voice samples. Meanwhile, the PRCs demonstrate that the VPD system with FC-SMOTE has a good multi-class classification ability for pathological voices and illustrates that FC-SMOTE has a high ability to improve the multi-classification efficiency. Furthermore, the PRC curve in Figure 8d shows the AUC for the binary classification with FC-SMOTE. It can be seen that the AUC can reach 1.00 in most models. Additionally, this indicates that the VPD system with the FC-SMOTE algorithm used in this paper can achieve excellent results in the diagnosis of pathological voice.
Subsequently, to widen the problem scope and verify the applicability of the proposed method, we experimented with the same scenario in the SVD database. For the multiclassification system, Tables 9 and 10 present the model evaluation measures of each classifier without and with the C-SMOTE algorithm by 10-fold cross-validation in the SVD database. In the experiment based on the MEEI database, we set up an external dataset to prove the effectiveness of the FC-SMOTE algorithm in generating samples. Therefore, in the SVD database experiment no external dataset was set to repeat this verification experiment. The results in Table 9 show that pathological voice classification with a class-imbalanced dataset cannot obtain good results in these typical machine learning models. This problem also exists in the SVD database. As can be seen from Tables 9 and 10 regarding the time consumption of the machine learning model of the SVD database, although the performance of the deep model obtained the best results, the training time was also significantly extended compared to that of the other two types of models. Therefore, in our opinion, the ensemble learning model with a sub-optimal performance is still a preferred choice for pathological voice detection and classification. Besides this, the overall performance of the machine learning model in the SVD database is lower than that in the MEEI database, which may be due to the higher complexity of the data in the SVD database. This is because there are multiple types of speech disorders in the same voice sample in the SVD database, resulting in classifiers that do not classify well. The data complexity of the two databases can be shown indirectly through the sample point distribution in Figure 4. Thus, in the result of the SVD database, the model based on deep learning is better than the other learning models, and MLP yields good results for recall (0.75), specificity (0.94), G value (0.84), and F1 value (0.76), which are better than those of the other models. However, it is worth noting that compared with the result of non-sampling, the performance of the VPD system constructed by the FC-SMOTE algorithm was significantly improved. Similarly, the binary classification results of the 10-fold cross-validation before and after the FC-SMOTE treatment are shown in Tables 10 and 11, respectively. Tables 11 and 12 show that the ten machine learning models with the FC-SMOTE algorithm in the SVD database perform better than the non-sampling models. The model evaluation measures produced by CNN are all 0.90, which is a little higher than that of other models. Thus, it can be concluded that the proposed method can effectively improve the performance of the pathological voice diagnosis model. To further demonstrate the effectiveness of FC-SMOTE in solving the problem of a class-imbalanced database, a comparison is made with SMOTE and two typical SMOTEbased extension approaches: Borderline-SMOTE and ADASYN. Table 13 shows the performance of SMOTE, Borderline-SMOTE, ADASYN, and the proposed FC-SMOTE with the optimal classifier in binary and multi-class pathological voice classification in the MEEI and SVD databases. The results show that our proposed FC-SMOTE method outperforms the other three methods by 0.02-0.17 and 0.03-0.14 in terms of the model evaluation measure in the MEEI and SVD databases, respectively. Thus, it can be concluded that the proposed method outperforms SMOTE, Borderline-SMOTE, and ADASYN in class-imbalanced pathological voice detection and classification.

Discussion
We considered the serious class-imbalance in the MEEI database, an international standardized database for pathological voice detection research. Researchers have ignored this problem in the past. Therefore, using Acc as the main model evaluation measure in most research on modeling VPD systems is misleading. This paper proposed a fuzzy clustering algorithm-based SMOTE oversampling technology to balance classes for a pathological voice database with a class-imbalance. The FC-SMOTE algorithm improves the SMOTE technique, which avoids unnecessary noise and classifier underfitting and overcomes the imbalanced data distribution between and within classes. FC-SMOTE has shown a good performance in processing class-imbalanced datasets with a high overlap density, such as in pathological voice samples.
In the research on pathological voice detection, most current works have regarded pathological voice detection as a binary classification whose results are only output as normal and pathological voice, and there is less research on multi-classification between different pathologies. Table 14 presents some of the performances in existing works on pathological voice detection based on cross-validation in the MEEI database. In Table 14, recall and specificity indicate the recognition ability of the classifier for the minority class and the majority class, respectively. These works were conducted in an imbalanced dataset, with notable deviations between recall and specificity ranging from 0.20 to 0.64. Our proposed method improves the recall and specificity by 0.04-0.65 and 0.00-0.07, respectively, compared to the performance in existing works.
In summary, our VPD system used the FC-SMOTE algorithm to handle the multiclass imbalanced data in the MEEI database and verified the generalization ability of the algorithm through a set of standard machine learning classifiers. In the multi-classification of vocal cord nodules, vocal cord polyps, vocal cord edema, and vocal cord paralysis, compared with a VPD system without FC-SMOTE algorithm, all the performance measures of our VPD system with the FC-SMOTE algorithm in terms of recall, specificity, G value, F1 value, and AUC/PRC are higher than the former. This justifies our claim that the proposed method is a promising strategy for voice pathological multi-class classification. Additionally, biomedical engineering is applied to pathological voice detection to solve class-imbalances in limited pathological voice databases.

Conclusions
This paper proposes a VPD system combined with an FC-SMOTE imbalanced learning algorithm. The FC-SMOTE combines the fuzzy c-means clustering algorithm with SMOTE. By analyzing the distribution between and within classes of the samples, different oversampling treatments are applied to minority class samples. The empirical experiment is firstly carried out in the MEEI database. A set of machine learning models are evaluated and validated using the resulting class-balanced dataset as an input, and a set of reasonable metrics is selected for model evaluation measures. Meanwhile, the effectiveness of FC-SMOTE is tested on an external validation set of the MEEI database. The experiment results are evaluated by the model evaluation measures recall, specificity, G value, F1 value, and AUC/PRC, which are reasonable metrics for class-imbalanced learning. The experimental results show that the VPD system without the FC-SMOTE algorithm has a very poor performance in recognizing minority classes and can hardly multi-classify pathological voices types. The VPD system with FC-SMOTE is an effective method to build a multi-classification model of pathological voice, which means that the VPD model can learn minority classes better to improve the VPD system's performance. Subsequently, to broaden the scope of the problem and maximize the applicability of the proposed method, the same experiment using the proposed method was conducted in the SVD database. The results again verify the effectiveness of our proposed method. Finally, the performance of FC-SMOTE and traditional oversampling algorithms are discussed. In pathological voice detection and classification, FC-SMOTE shows a better performance than traditional oversampling methods.
In future research, our method can be applied to assist clinicians in diagnosing pathological voices, solving the problem of insufficient medical resources and shortening the diagnosis time for patients. Besides this, FC-SMOTE can further learn more pathological voice types to propose an effective VPD system for more types of class-imbalanced pathological voice multi-classification problems.