Hierarchical Boosting Dual-Stage Feature Reduction Ensemble Model for Parkinson’s Disease Speech Data

As a neurodegenerative disease, Parkinson’s disease (PD) is hard to identify at the early stage, while using speech data to build a machine learning diagnosis model has proved effective in its early diagnosis. However, speech data show high degrees of redundancy, repetition, and unnecessary noise, which influence the accuracy of diagnosis results. Although feature reduction (FR) could alleviate this issue, the traditional FR is one-sided (traditional feature extraction could construct high-quality features without feature preference, while traditional feature selection could achieve feature preference but could not construct high-quality features). To address this issue, the Hierarchical Boosting Dual-Stage Feature Reduction Ensemble Model (HBD-SFREM) is proposed in this paper. The major contributions of HBD-SFREM are as follows: (1) The instance space of the deep hierarchy is built by an iterative deep extraction mechanism. (2) The manifold features extraction method embeds the nearest neighbor feature preference method to form the dual-stage feature reduction pair. (3) The dual-stage feature reduction pair is iteratively performed by the AdaBoost mechanism to obtain instances features with higher quality, thus achieving a substantial improvement in model recognition accuracy. (4) The deep hierarchy instance space is integrated into the original instance space to improve the generalization of the algorithm. Three PD speech datasets and a self-collected dataset are used to test HBD-SFREM in this paper. Compared with other FR algorithms and deep learning algorithms, the accuracy of HBD-SFREM in PD speech recognition is improved significantly and would not be affected by a small sample dataset. Thus, HBD-SFREM could give a reference for other related studies.


Introduction
Parkinson's disease (PD) is a neurodegenerative disease with the characteristics of motor stiffness, movement retardation, tremor, and some non-motor symptoms (NMS, like bass disorder, sleep disorder, depression, constipation, pain, and dysarthria). Numerous studies have shown that PD patients will have NMS as the disease develops that seriously affects the quality of life [1].
NMS can be detected at the early stage of the disease, which allows a sound treatment plan to be designed. Dysarthria is the primary NMS and plays a guiding role in the study of PD pathogenesis. In addition, the advantages of speech data collection have made speech analysis gradually become the main analysis method for PD recognition as well as a key research area for early PD recognition [2].
However, speech data exhibit a high rate of redundancy and repetition and contain much unnecessary noise. Feature reduction (FR) could help alleviate this issue. Currently, this topic has attracted extensive attention from researchers and has great research significance [3]. Early FR research on PD speech recognition primarily focused on feature selection, which could be simply considered as the way of selecting the optimal feature subset from the original feature space. Some feature selection algorithms include Relief [4][5][6], (mRMR) [3], SBS [7], PSO [8], SFS [9], LASSO [2,4], Pvalue [10]. Erika R et al. selected the optimal subset of features from the original features, then using the Pvalue algorithm [10]. Sakar and Kursun [11] proposed a new feature selection algorithm based on mutual information, and the model is trained using support vector machines (SVM), achieving an accuracy of 92.75%. Musa Peker [12] used mRMR to identify valid features and then submitted the obtained features into a complex-valued artificial neural network. Benba et al. [13] selected features based on pathology thresholds through a multi-dimensional voice detection procedure (MDPV) and then submitted the obtained features to K-nearest neighbors (KNN) and SVM, achieving an accuracy of 95%. Shirvan RA et al. [14] used genetic algorithms and KNN to determine the optimal features that affected the result of recognition.
Feature extraction is another type of FR algorithm, the idea is to map the highdimensional features to the low-dimensional space and keep all the information of the original instance as much as possible [15]. Linear approaches were primarily used before, in which PCA [16,17], and LDA [18][19][20][21] were representative methods. Chen et al. [21] developed a PD detection system that used PCA to extract features and trained the model with a fuzzy KNN classifier, which achieved an accuracy of 96.7%. Hariharan M et al. extracted features of PD using PCA and LDA and obtained a high accuracy rate [10]. Linear feature extraction methods generally assume the data in a high-dimensional linear space, which is the opposite of the non-linear characteristics of PD speech datasets in the real world [22][23][24]. Thus, linear feature extraction could not be applied well to non-linear data spaces because it limits the accuracy of PD recognition [25]. Currently, non-linear feature extraction has been developed and applied to PD recognition [19,26,27]. Kernel mapping and deep neural network mapping are two representative types of non-linear feature extraction methods. Yang achieved good results by feature extraction of PD speech data through SFS and PCA with kernel [19]. Derya A proposed the Genetic Algorithm-Wavelet Kernel-Extreme Learning Machine (GA-WK-ELM), and the wavelet kernels were used to map non-linear features from PD speech data [25]. Grover used deep neural networks to process Parkinson's disease speech data features and predict the severity of PD [26]. Camilo considered multimodal information, including not only speech data of PD patients, but also writing, handwriting data and gait, and posture data and trained the model for recognition according to deep learning methods [27].
Manifold learning is another type of feature extraction method that could be applied to small sample datasets. Locally preserved projection (LPP) is a representative algorithm for manifold learning, which preserves the structure of the nearest neighbor between data samples after feature extraction, while minimizing the dimensions of the features [28]. However, since LPP is the nearest neighbor retention algorithm, most of the improved algorithms based on LPP only focus on the differences between classes and do not consider the large differences within classes [29][30][31]. Liu et al. considered both interclass data aliasing and intraclass data aliasing, which effectively solve these problems [16].
In recent studies, some scholars have attempted to integrate the advantages of feature selection and feature extraction to create hybrid feature processing methods. M. Hariharan et al. [9] proposed a hybrid system using SFS and PCA to process the data feature characteristics and feed the processed bibliography into a least square support vector machine classifier to learn the prediction model. H. Almayyan et al. [32] proposed a hybrid recognition system that uses PCA and Relief for feature processing and SVM combined with recursive feature elimination (SVMRFE) as a classifier to train the model. In addition, the study still used the SMOTE technique in order to equalize and diversify the dataset.
Based on the above analysis, we know that the FR method can solve the problems of high redundancy, high repetition, and noise of speech data. However, traditional feature extraction could construct high-quality features but could not achieve feature preference, while traditional feature selection could achieve feature preference but could not construct high-quality features. The two types of FR methods are different in principle but can be complementary to each other. Thus, it is necessary to propose a feature reduction method that could simultaneously achieve feature preference and high-quality feature construction. Although some related studies have made some progress in this field [21,32], critical problems also remain to be solved: (1) the integration of feature extraction and feature selection always occurs once, then the absence of multiple iterations to find the optimal fusion make it impossible to obtain higher quality merged features; (2) existing methods only consider information on the characteristics of the sample in the original space, and ignoring structural information on the characteristics of the deeper instance. In order to address these issues, the Hierarchical Boosting Dual-Stage Feature Reduction Ensemble Model for Parkinson's disease speech data (HBD-SFREM) is proposed in this study. The major contributions and innovations of this model are listed below.

1.
The instance space of the hierarchy is built by an iterative deep extraction mechanism. 2.
The manifold feature extraction method embeds the nearest neighbor feature preference method to form a dual-stage feature reduction pair module.

3.
The dual-stage feature reduction pair (D-Spair) module is iteratively performed by the AdaBoost mechanism to obtain higher quality features, thus achieving a substantial improvement in model diagnosis accuracy. 4.
The deep hierarchy instance space is integrated into the original instance space to enhance the generalization ability of the model.
The writing structure of this paper is given here. Section 2 introduces the principles related to the proposed model; Section 3 describes the experiments designed in this paper as well as the presentation and analysis of the results; Section 4 analyzes the limitations and contributions of this study.

Symbol Description
In order to facilitate the presentation of the HBD-SFREM, some symbols need to be defined first. The datasets used in this study are numerical matrices and described as T ∈ R N×D , where N = N 1 + N 2 + · · · + N C . By default, each row represents an instance, N indicates the number of instances in X. D denotes the dimension of X. C is the category of datasets, the label of instances is expressed as y = [y 1 , y 2 , · · · , y N ] T ∈ R N . The number of instances in each hierarchy is determined by the number of instances in the upper hierarchy and P, where P is the proportion of instances retained when IDEM is performed. The mapping matrix generated by the D-Spair maps R D to R d , where R D represents the high-dimensional dataset, and R d represents the low-dimensional dataset, (d < D).

Construction of the Different Hierarchy Instance Space
In this part, the layers H of hierarchical instance spaces and the numbers of independent instance subspace n are used. One of the primary innovations in this paper is that deep hierarchy instance space is constructed based on IDEM. The relationship between the different hierarchies of instance spaces is analyzed by learning instances of different hierarchy spaces, and the generalization ability of the final model will also be improved.
In the IDEM mechanism, π j is used to define the clusters and the clustered partition of data points is denoted by π j k j=1 , while the radial basis function φ is used to map the data to high-dimension space, thus the objective function is defined as: where m j is the center of each cluster, where x i is instances of X train . Assume that each cluster has the same weight, the Euclidean distance of each sample φ(s) to the cluster center m j is denoted as: Figure 1 describes the detailed process of the IDEM. The IDEM is based on the means clustering method with radial basis kernel [33][34][35]. The original dataset is defined as the first hierarchy instance, and the IDEM mechanism is used to cluster this hierarchy instance to generate the second hierarchy instance. Then, the second hierarchy instance is clustered to generate the third hierarchy instance, until the H − th hierarchy instances are generated, where H ∈ n + (n + represents the set of positive integers). The number of newly generated instances is P% from the upper hierarchy instances.
In the IDEM mechanism, j  is used to define the clusters and the clustered partition of data points is denoted by   j 1 k j   , while the radial basis function  is used to map the data to high-dimension space, thus the objective function is defined as: where j m is the center of each cluster, where i x is instances of train X .
Assume that each cluster has the same weight, the Euclidean distance of each sample ( ) s  to the cluster center j m is denoted as: Figure 1 describes the detailed process of the IDEM. The IDEM is based on the means clustering method with radial basis kernel [33][34][35]. The original dataset is defined as the first hierarchy instance, and the IDEM mechanism is used to cluster this hierarchy instance to generate the second hierarchy instance. Then, the second hierarchy instance is clustered to generate the third hierarchy instance, until the -H th hierarchy instances are generated, where H n   ( n  represents the set of positive integers). The number of newly generated instances is % P from the upper hierarchy instances.

Boosting Dual-Stage Feature Reduction Pair Ensemble Module
The typical characteristics of PD speech datasets are a small sample, having high repetition, high redundancy, and a certain amount of noise. According to the characteristics above, the boosting dual-stage feature reduction pair ensemble module (BD-SFREM) is designed to address this issue, which includes the dual-stage feature reduction pair (D-Spair) module and boosting ensemble module.

D-Spair module;
Suppose the number of instances of c th is c N , then the total number of the instance In the first step, D-Spair makes instances belonging to the same category closer together after mapping, that is, the within-class variance matrix of similar samples is reduced, the specific mathematical formula is expressed as follows:

Boosting Dual-Stage Feature Reduction Pair Ensemble Module
The typical characteristics of PD speech datasets are a small sample, having high repetition, high redundancy, and a certain amount of noise. According to the characteristics above, the boosting dual-stage feature reduction pair ensemble module (BD-SFREM) is designed to address this issue, which includes the dual-stage feature reduction pair (D-Spair) module and boosting ensemble module.

D-Spair module;
Suppose the number of instances of c th is N c , then the total number of the instance In the first step, D-Spair makes instances belonging to the same category closer together after mapping, that is, the within-class variance matrix of similar samples is reduced, the specific mathematical formula is expressed as follows: where T stands for the variance matrix of the intraclass.
i denotes the center of c − th class, and x wc the samples belonging to the same class.
Similarly, instances with different class labels are mapped as far apart as possible, that is the variance matrix between different classes should be increased as much as possible, and the specific mathematical formula is expressed as follows: where T represents the scatter matrix between different x i stands for the center of the local part, and x the number of the c − th class in the local part.
In addition, the nearest neighbor structure between samples is preserved during the mapping process (i.e., locality preservation), the specific mathematical formula could be described as follows: where Thus, the objective function of the feature extraction part of the D-Spair is designed to minimize the local variance matrix within the same category and maximize the variance matrix between different categories, while preserving the nearest neighbor structure of each instance. Based on the description of Equations (3)-(5), the mathematical expression of the feature extraction part is expressed as follows.
Equation (6) could be transformed by the Lagrange multiplier method into Equation (7) L Take the derivative of M to obtain the optimal solution.
where λ and γ is the penalty factor. Equation (8) could be solved and the projection matrix M is obtained. The vector M ∈ R D×d is the generalized eigenvector of (µS DC − γX train AX T train ) −1 S SC and λ is the first d largest eigenvalues. The vector M k = (m 1 , m 2 , · · · , m k ) is composed of the first k eigenvectors of M. Next, the vector M k is used to map X train , resulting in high-quality feature extraction, the mapped data are named X train . Define the sample set X train as S, divide X train into S + and S − according to the class label of instances. An instance X i is randomly selected from S without putting back (X i ∈ S). According to the nearest neighbor criterion, an instance is also selected from S + and S − respectively, which are noted as nearst i+ , nearst i− . Assume that X i has p features, i.e., each X i consists of p-dimensional vectors ( Similarly, W i denotes the feature weight of X i , which also consists of p-dimensional Same as X i , nearst i+ and nearst i− are also composed of p-dimensional vectors. Firstly, initialize the weights Then, these optimal features are used to train the classifier.

Boosting ensemble module;
In the boosting ensemble module, the AdaBoost mechanism is used to combine various D-Spair, thereby constructing the boosting ensemble module. Finally, the pseudocode of BD-SFREM is shown as follows.

BD-SFREM
Input: X train : training dataset (an N × D matrix) and the corresponding labels Y train (an N × 1 matrix) X valid : valid dataset (an N × D matrix) and the corresponding labels Y Valid (an N × 1 matrix) X test : test data (an N × D matrix) and the corresponding labels Y test (an N × 1 matrix) T: boosting module usage times Threshold: flag of boosting module end Output: Final Prediction P f inal_i K of the independent instance space Use the D-Spair module to obtain dual-stage features.

Hierarchical Space Instance Learning Mechanism
The implementation of the hierarchical space instance learning mechanism is based on the construction of the different hierarchy spaces and BD-SFREM. First, the IDEM mechanism is used to construct the deep hierarchy space. Then, the BD-SFREM is applied to different hierarchy spaces to perform the hierarchy space instance learning mechanism, and the results of the deep hierarchy spaces are integrated with the results of the original hierarchy spaces in order to improve the generalization ability of the model.
The pseudocode of the hierarchical space instance learning mechanism is shown as follows: Hierarchical space instance learning mechanism where A is the labels of the output samples after IDEM, and a j represents the labels that belong to the same category.

Overall Description of the Proposed Model
The overall description of the proposed model (HBD-SFREM) is described in this part. First, the different hierarchy space is constructed by IDEM. Second, a method of boosting dual-stage feature reduction process (boosting dual-stage feature reduction pair ensemble module) is established based on the proposed objective function. Finally, the above methods are applied to different hierarchy spaces to perform hierarchy space instance learning, then the results of the deep hierarchy spaces are integrated with the results of the original hierarchy instance spaces in order to improve the generalization ability of the algorithm. Figure 2 depicts the algorithm of this paper.

Datasets
Three representative PD speech datasets and a self-collected PD speech dataset were utilized to validate the innovation of the HBD-SFREM.
LSVT: The LSVT dataset was founded by Professor Athanasios Tsanas of the University of Oxford (tsanasthanasis@gmail.com). The role of this dataset was to assess effectiveness after rehabilitation treatment. In total, 14 subjects with PD (eight of them were male

Datasets
Three representative PD speech datasets and a self-collected PD speech dataset were utilized to validate the innovation of the HBD-SFREM.
LSVT: The LSVT dataset was founded by Professor Athanasios Tsanas of the University of Oxford (tsanasthanasis@gmail.com). The role of this dataset was to assess effectiveness after rehabilitation treatment. In total, 14 subjects with PD (eight of them were male and six were female) participated in the entire data collection process. For more details, see [36].
PSDMTSR: The dataset consisted of a total sample of 40 subjects, in which 20 samples were from people with PD and 20 samples were from healthy people. For more details, see [37].
Parkinson: A total of 31 subjects' speech data were collected in this dataset, 23 of whom were people with PD and eight of whom were healthy. For more details, see [38].
SelfData: The dataset was collected from a total of 31 subjects, 10 of whom suffered from PD and 21 of whom were healthy. Specifically, five of the 10 with PD were male and five were female; 12 of the 21 healthy subjects were male and nine were female. Thirteen voice segments (samples) were collected for each subject, and each voice segment consisted of 26 features. The SONY ICD-SX2000 recording tool was used for voice acquisition, and the recording tool was kept at a distance of 15 cm from the subject's lips during the acquisition. Each subject was asked to read a specific piece of pronunciation material and the pronunciation made by each subject was recorded. The sampling was set to a frequency of 44.1 kHz and the resolution was set to 16 bits.
Three of the four datasets (LSVT, PSDMTSR, and Parkinson) are available to the public and can be downloaded from the UCI dataset repository created by the University of California, Irvine (www.archive.ics.uci.edu/ml/index.php (accessed on 24 November 2021)). The Chinese Army Medical University provided the SelfData dataset. Brief information about the datasets is shown in Table 1. For the LSVT dataset, 'healthy people' means the number of patients whose clinicians allowed ongoing rehabilitation, and 'patients' mean the number of patients whose clinicians did not allow rehabilitation. For the SelfData dataset, the 'healthy people' denote the number of patients treated with the relevant medication and the 'patients' mean the number of patients treated with the relevant medication before.

Experimental Environment
All experiments were conducted in MATLAB version 2017b, running on a PC with Windows 10, 64-bit and the CPU was intel(R) Core i5-2300 (2.80 GHz) as well as 8 GB of RAM. Praat is a computer speech processing software, which is used to analyze the speech features and extract speech features in this paper. The basic classifiers used in this study was the SVM. For optimal performance of the D-Spair, the affinity matrix Z was constructed using adjustable regularization coefficients λ and γ as well as adjustable kernel parameters t and adjusted from the given set 10 −4 , 10 −3 , 10 −2 , · · · , 10 2 , 10 3 , 10 4 . The dimension d of the subspace stack network was adjusted from the following set {5, 10, 15, · · ·}.The local ratios r b and r w were empirically chosen as 0.9 for this study. The parameter description and setting of the HBD-SFREM are shown in Table 2. In this study, all experiments were repeated ten times and the statistical results are reported.

Evaluation Criteria
The proposal of a new algorithm needs to be evaluated using a series of criteria. This study selected five model evaluation metrics to comprehensively evaluate the HBD-SFREM. They are: model prediction accuracy rate (Acc), model prediction correct rate (Pre), model recall rate (Rec), and comprehensive evaluation metrics F-score and G-mean. All the above evaluation metrics were constructed by a confusion matrix. The confusion matrix is a table that visualizes the model predictions [39]. The PD speech diagnosis studied in this paper is a binary classification problem, thus the confusion matrix was constructed as shown in Table 3.

Results and Analysis
In this part, the ablation method was used to verify the major innovation parts of the HBD-SFREM and then the representative feature extraction and feature selection algorithms were selected for comparison. Furthermore, existing feature reduction algorithms for PD speech recognition and two deep learning methods were also introduced in comparing with the proposed model. In the experiments, the hold-out method was used to divide the PD speech dataset: the dataset was randomly partitioned into three disjoint sets, including the training, validation, and test sets. As multiple speech segments (instances) were collected for each PD subject in the used dataset, instances from the same subject should be divided into the same set, to avoid the crossover of instances from the same subjects which could effectively respond to the authenticity of the results.

Verification of the Effectiveness of HBD-SFREM
This section introduces the verification results of the innovation of HBD-SFREM, including the results of the BD-SFREM and those of the hierarchical space instance learning mechanism. It is worth noting that since the construction of the different hierarchy space is the basis for its learning mechanism, the validity of the hierarchical space instance learning mechanism could further prove the effectiveness of the construction of the different hierarchy instance space.

1.
Verification of the BD-SFREM; This part gives the results of both D-Spair and BD-SFREM. Two of the feature processing methods were chosen for constructing the D-Spair, and these are local discriminant preservation projection (LDPP) and Relief. To give a much clearer presentation of the results, some symbols should be defined below. Only-FE represents the mere usage of LDPP to process the features and Only-FS the Relief. D-Spair stands for the results of D-Spair module, while BD-SFREM represents the result of boosting dual-stage feature reduction pair ensemble module. (B) represents the affinity matrix of the binary mode in the feature extraction and (H) the heat kernel mode. The experiments constructed in this section were performed in the original instance space.
As shown in Table 4, for LSVT, Parkinson, and PSDMTSR, the BD-SFREM had the best results in Acc, Pre, Rec, G-mean, and F-score regardless of diverse classifier, while for SelfData, the BD-SFREM had the best results in Acc and Pre. In addition, the results of D-Spair and BD-SFREM were much more accurate than those of the Only-FS and Only-FE. Thus, the D-Spair module and BD-SFREM are effective. Three of the four datasets used in this paper are unbalanced datasets. From the experiment results in the above table, the BD-SFREM module is helpful in handling imbalanced instance datasets, especially for the LSVT, PSDMTSR, and Parkinson datasets, and the advantages of the BD-SFREM are more obvious. Since the quality of the self-collected dataset was lower than that of the public dataset, its model effectiveness was accordingly reduced. However, it can be improved by the IDEM mechanism, which is illustrated in next section. Table 4. Results of the validation of the algorithm using the ablation method (%). This section compares the results of the deep hierarchy instance space with those of the original instance space, and illustrates the effectiveness of the hierarchical space instance learning mechanism. (O) represents the results in the original instance space and (H) the results in the deep hierarchy instance space. Specifically speaking, Only-FS (O) stands for the results of the original instance space, and Only-FS (H) the results of the deep hierarchy instance space.

LSVT
As shown in Table 5, the results of the deep hierarchy space instance (H) were improved for all PD speech datasets in diverse methods compared with the results of the original instance space (O). For LSVT, PSDMTSR, and SelfData, the results of (H) were obviously better than those of (O). For Parkinson, the results of (H) were also improved, though insignificantly. The last two columns of the table are the results of BD-SFREM, from which the results of (H) were obviously better than those of (O) in all datasets, with a maximum improvement rate of 9.53% on the LSVT dataset. Therefore, the hierarchical space instance learning mechanism in this paper is effective. Table 5. Verification of hierarchy space instance learning mechanism (%).  Table 6 shows the results of HBD-SFREM in different spaces (in which SVM (RFE) classifier is used). From the results in Table 6, we can see that the integrated output is always optimal, which further improves the generalization performance of the whole model. In this section, some representative feature processing methods, like mRMR, Pvalue, SVMRFE, PCA, and LDA, were selected to compare with the proposed model (HBD-SFREM. Because deep learning also acts as major feature processing methods, its two representative methods, namely deep belief network (DBN) and stacked encoder (SE), were compared with HBD-SFREM in this paper. To facilitate the results presentation, some symbols should be defined in the first place. HBD-SFREM (B) stands for the results in mode B, and HBD-SFREM (H) the results of mode H.

LSVT
As shown in Table 7, the results of HBD-SFREM outperformed the algorithm reference groups on ACC and Pre, regardless of diverse datasets and classifiers. For the LSVT dataset, HBD-SFREM outperformed those reference groups on Rec, G-mean, and F-score. For the PSDMTSR and Parkinson datasets, the results of HBD-SFREM in G-mean and F-score were more accurate than those of reference groups. For SelfData, the results of the HBD-SFREM on Acc and Pre were better than its reference groups. To demonstrate the advantages of HBD-SFREM more clearly, the results of using SVM (RBF) classifier on different datasets are given in Figure 3, where the HBD-SFREM has achieved the best accuracy. In summary, HBD-SFREM outperformed the reference groups in most cases, which further verifies the effectiveness of HBD-SFREM.
In addition, the ROC curves of all models on different datasets are shown in Figure 4. From Figure 4, we can see the area under curves (AUC) of HBD-SFREM is higher than the comparison models. It is worth noting that since SelfData is designed to simulate the real diagnosis environment of doctors, it is weaker in quality than the other three public datasets, but even under such conditions, the experimental result (AUC) shown in Figure 4 still proves that the HBD-SFREM is better than the comparative methods.  In addition, the ROC curves of all models on different datasets are shown in Figure  4. From Figure 4, we can see the area under curves (AUC) of HBD-SFREM is higher than the comparison models. It is worth noting that since SelfData is designed to simulate the real diagnosis environment of doctors, it is weaker in quality than the other three public datasets, but even under such conditions, the experimental result (AUC) shown in Figure   Figure 3. Comparison Results Using Different Datasets.
In addition, the ROC curves of all models on different datasets are shown in Figure  4. From Figure 4, we can see the area under curves (AUC) of HBD-SFREM is higher than the comparison models. It is worth noting that since SelfData is designed to simulate the real diagnosis environment of doctors, it is weaker in quality than the other three public datasets, but even under such conditions, the experimental result (AUC) shown in Figure  4 still proves that the HBD-SFREM is better than the comparative methods.  In [20].
The dataset is partitioned into a training set and a test set using the leave-one-out method (LOSO). Since each subject in the dataset contains multiple samples, the leave-one-out method here actually leaves all samples from one subject. Then, the feature dimension of the dataset is reduced using the LDA dimension reduction algorithm, and the BP neural network with genetic algorithm optimization is used to train the optimal prediction model (LDA-NN-GA). (4) FC-SVM [6]: This algorithm was proposed by Cigdem O in 2018. In [6], the Fisher criterion (FC)-based feature selection method is used to rank feature weights, finally, the first K useful features are selected based on a threshold to input the classifier (SVM with RBF) for training to obtain the model. (5) SFFS-RF [40]: This algorithm was proposed by Galaz Z in 2016. In this study, the sequential floating feature selection algorithm (SFFS) is adopted to process the data features, followed by inputting the processed results into the RF classifier to learn the prediction model. Table 8 shows that HBD-SFREM always performed better than the other algorithms.

Comparison with Relevant PD Speech Recognition Methods
HBD-SFREM primarily improves the accuracy of PD speech recognition. This section aims to show the effectiveness of the HBD-SFREM by comparing it with other PD speech FR algorithms. The algorithm reference groups are as follows: (1) Relief-SVM [4]: Little used method in 2012, it involves first selecting four feature processing methods to process the features of the dataset, and then using Relief and SVM classifier with linear kernel function model (Relief-SVM) to learn to obtain a model. (2) mRMR classifier [3]: This method was used by Sakar in 2018. In [3], feature selection is first performed using mRMR and then the prediction results voting or stacking strategies of seven classifiers are integrated. (3) LDA-NN-GA [20]: This algorithm was proposed by L Ali and C Zhu in 2019. In [20].
The dataset is partitioned into a training set and a test set using the leave-one-out method (LOSO). Since each subject in the dataset contains multiple samples, the leaveone-out method here actually leaves all samples from one subject. Then, the feature dimension of the dataset is reduced using the LDA dimension reduction algorithm, and the BP neural network with genetic algorithm optimization is used to train the optimal prediction model (LDA-NN-GA). (4) FC-SVM [6]: This algorithm was proposed by Cigdem O in 2018. In [6], the Fisher criterion (FC)-based feature selection method is used to rank feature weights, finally, the first K useful features are selected based on a threshold to input the classifier (SVM with RBF) for training to obtain the model. (5) SFFS-RF [40]: This algorithm was proposed by Galaz Z in 2016. In this study, the sequential floating feature selection algorithm (SFFS) is adopted to process the data features, followed by inputting the processed results into the RF classifier to learn the prediction model. Table 8 shows that HBD-SFREM always performed better than the other algorithms. For LSVT and Parkinson, the results were higher than those of the other algorithms, and the largest improvement rates in accuracy were 16.67% and 38.71%, respectively, demonstrating the advantages of HBD-SFREM. For SelfData and PSDMTSR, the results of HBD-SFREM were higher than the other algorithms in most cases, and the biggest improvement rates in accuracy were 22.37% and 22.27%, respectively. In addition, the experimental results of the comparison algorithms selected in this section were not as excellent as described in relevant studies, and the reason for this phenomenon is probably because the experimental conditions in this study were slightly different from those used in the reference group. For instance, the data diversity method differed from the method used by the authors in [20]. Additionally, the number of training data used in this study were less than that of [20]. In general, the larger the number of training data instances, the higher the prediction accuracy produced by the training model.

Discussion and Conclusions
HBD-SFREM has introduced an excellent dual-stage feature processing method that integrates the advantages of traditional feature extraction and feature selection algorithms. HBD-SFREM could generate high-quality features that are most useful to model learning, and thus achieve an early and accurate diagnosis of PD. These benefits can improve the identification accuracy as well as its stability. In addition, HBD-SFREM could be applied to small sample datasets of PD speech, including some unbalanced speech datasets. Experimental results demonstrate that the HBD-SFREM outperforms other existing algorithms of PD speech diagnosis.
Currently, publicly available PD speech datasets are relatively few. Three public PD speech datasets from UCI are introduced to validate the effectiveness as well as the innovativeness of the HBD-SFREM. In addition, this article also introduces the Chinese PD speech dataset collected by the authors. The experimental results indicate that HBD-SFREM achieves significantly better performance with the datasets studied. For all datasets, HBD-SFREM largely improves the diagnosis accuracy, especially on the Parkinson dataset. The degree of accuracy is enhanced by at least 19.36% compared to the other representative feature processing algorithms. At present, there are still relatively few fusion methods to study the selection and extraction of features for PD speech recognition, so this paper lays a good foundation for future research.
For future study, many more types of feature extraction and selection methods should be introduced into this research to develop and evaluate further effective algorithms. Besides, the improvement of the hierarchical space instance learning mechanism should be verified. As a framework algorithm, HBD-SFREM is different from other extraction and feature selection algorithms. Therefore, HBD-SFREM is rather valuable for reference and study in this field.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Written informed consent has been obtained from the patient(s) to publish this paper. Data Availability Statement: Three publicly available datasets are used in this paper and they can be downloaded from: www.archive.ics.uci.edu/ml/index.php (accessed on 24 November 2021). The code used in this study can be found at: https://github.com/YangMingYaoo/Deep-Embeddedhybrid-FR-algorithm-about-parkinsons-disease.git, (accessed on 24 November 2021).