Speaker Recognition Based on Fusion of a Deep and Shallow Recombination Gaussian Supervector

: Extracting speaker’s personalized feature parameters is vital for speaker recognition. Only one kind of feature cannot fully reﬂect the speaker’s personality information. In order to represent the speaker’s identity more comprehensively and improve speaker recognition rate, we propose a speaker recognition method based on the fusion feature of a deep and shallow recombination Gaussian supervector. In this method, the deep bottleneck features are ﬁrst extracted by Deep Neural Network (DNN), which are used for the input of the Gaussian Mixture Model (GMM) to obtain the deep Gaussian supervector. On the other hand, we input the Mel-Frequency Cepstral Coefﬁcient (MFCC) to GMM directly to extract the traditional Gaussian supervector. Finally, the two categories of features are combined in the form of horizontal dimension augmentation. In addition, when the number of speakers to be recognized increases, in order to prevent the system recognition rate from falling sharply, we introduce the optimization algorithm to ﬁnd the optimal weight before the feature fusion. The experiment results indicate that the speaker recognition rate based on the feature which is fused directly can reach 98.75%, which is 5% and 0.62% higher than the traditional feature and deep bottleneck feature, respectively. When the number of speakers increases, the fusion feature based on optimized weight coefﬁcients can improve the recognition rate by 0.81%. It is validated that our proposed fusion method can effectively consider the complementarity of the different types of features and improve the speaker recognition rate.


Introduction
Over the last two decades, with the rapid development of artificial intelligence, voiceprint, iris, fingerprint, face and other biometrics have been of wide concern [1][2][3]. Speech is the most common way to communicate and convey information in people's daily life. A person's vocal tract structure determines that person's unique vocal characteristics. This makes speaker recognition possible. Speaker recognition technology is a kind of biometrics technology, which automatically distinguishes the speaker's identity information through the unique features contained in the voice. Generally speaking, the speaker recognition mainly comprises two important branches: speaker identification and speaker verification [4]. The former is to select the speaker with the highest similarity by comparing the speech of the speaker to be recognized with the trained models. It is a multi-classification problem. However, the latter is to determine whether the input speech belongs to the specific trained speaker model. It is a binary judgment problem. The technologies of speaker identification have been widely discussed in recent years. The speaker identification system mainly consists of three parts: speech signal preprocessing, feature parameters extraction, and classification [5]. Since human beings are influenced by their own physical conditions and external environment in the course of communication, the extraction of the distinguishable information from complex speech is a challenging task. Mel-Frequency Cepstral Coefficient (MFCC) [6][7][8][9][10], Linear Prediction Cepstrum Coefficient (LPCC) [11,12], Perceptual Linear Predictive (PLP) [13]) and Linear Predictive Coding (LPC) [14] are the most frequently used traditional features in speaker recognition. Wu and Cao [6] replaced the logarithmic transformation in the standard MFCC analysis with a combined function to improve the noisy sensitivity. The experiments showed that MFCC-based feature reduced the error rate significantly under the noisy environment. Sahidullah and Saha [9] proposed a novel windowing technique to compute MFCC. The method was based on the fundamental property of Discrete Time Fourier Transform (DTFT) related to differentiation in the frequency domain and it achieved good substance and consistency. In Reference [10], a novel algorithm of extracting MFCC for speech recognition was proposed. They modified the filter bank and added the filter bank to generate the power coefficient. It could effectively reduce the consumption of computer hardware. There were also some classical models such as Gaussian Mixture Model (GMM)-Support Vector Machine (GMM-SVM) [15], GMM-Universal Background Model (GMM-UBM) [16] and Probabilistic Linear Discriminant Analysis/i-vector (PLDA/ivector) [17] applied to speaker recognition. In recent years, more and more researchers have used deep networks to complete speaker recognition. Shahin et al. [18] proposed a new classifier called cascaded Gaussian mixture model-deep neural network. They used the GMM to generate the emotional tags of each speaker under each emotional speaking condition, and then the new vector of features was used as the input of the Deep Neural Network (DNN) classifier. Lastly, the output of the DNN was the final classification results. The performance of the system was tested on the Emirati speech database and "speech under simulated and actual stress" English dataset. Work in [19] used the neural network for classification and wavelet transform to extract feature parameters. The result demonstrated that the performance was better than the Multi-Layer Perceptron (MLP)-based classification in the aspect of recognition accuracy, average precision, average recall and root mean square error. Matejka et al. [20] investigated combining the deep bottleneck features with the traditional MFCC features to complete the speaker identification. In Reference [21], they utilized the DNN to extract the deep feature for automatic speaker recognition and language recognition. The results showed that a 55% reduction in equal error rate for the 2013 Domain Adaptation Challenge out-of-domain condition and a 48% reduction on the NIST 2011 language recognition evaluation 30 s test condition.
The several different speaker identification methods described above have been widely accepted and applied for their respective special advantages and good recognition performance, but there are still some shortcomings. The traditional features only reflect the speaker's physical information and represent shallow characteristics of the speech. They cannot fully exploit the deep structural information of speech signals [22]. The deep neural network can extract deep features of speech segments by simulating the structure of the human brain, but it ignores the most basic physical layer characteristics. Therefore, in order to fully express the features of speech signals and take advantage of each model, some studies have proposed different fusion strategies to complete speaker recognition in recent years. Omar et al. [23] proposed an MLP network based on feature fusion to train the recognition system. The LPC and MFCC were fused and then input into the MLP, which was used as a classifier for speaker identification system. In the paper [24], they took full account of the complementarity between different levels of speech signals and proposed the fusion method of deep and shallow features for the speaker verification system. Compared with the baseline system, the EER was reduced by 54.8%. When the training speech is short utterances, the GMM is failing to achieve a good performance. For the sake of find a solution to the problem, work in [25] utilized the Convolutional Neural Network (CNN) to process the spectrogram of speech signal and combined it with the GMM for scoring fusion. In order to improve the robustness of the speaker verification under noisy environments, Asbai and Amrouche [26] proposed a new method of weighted score fusion. These studies fully prove that the fusion method can effectively improve the performance of the speaker recognition system.
Since the deep and shallow features reflect the speaker's information from different aspects, the speaker's characteristics can be more comprehensively represented by effective fusion. Therefore, we propose a new speaker recognition method based on the fusion of deep and shallow Gaussian supervector. In this method, the MFCC is firstly obtained from the input speech signal, and then DNN is used to obtain the bottleneck features, which are used to acquire the deep Gaussian supervector. On the other hand, we input the MFCC into the GMM directly to obtain the traditional Gaussian supervector. Lastly, we fuse the two kinds of features to form a new vector to train the SVM and complete the speaker classification.
The main contributions made in this paper can be summarized here: (1) We design a DNN network to extract the deep bottleneck features that contain more discriminative information from different speakers. (2) In order to take into account the complementarity between different hierarchical features, we propose a novel fusion model to form a new Gaussian supervector for speaker recognition. (3) We propose a speaker recognition system based on optimization weight coefficient, which improves the robustness of the system. (4) We explore the factors that affect the performance of the system recognition and utilize the Fisher criterion to filter redundant information.
The remainder of the paper is organized as follows: Section 2 describes the proposed speaker identification system based on fusion features, MFCC, recombined Gaussian supervector and feature selection strategy, respectively. Section 3 mainly elaborates on the speaker identification system based on optimized weight coefficients. The experimental results and analysis are presented in the Section 4. Finally, the conclusion of this work is given in Section 5.

Proposed Speaker Recognition System
In a speaker identification system, it is vital to extract some features that can indicate speaker identity information, and then these features are used to train the classification model. Finally, the model is used for identification. Therefore, the performance of the speaker's recognition system is directly impacted by the quality of features. A single feature often cannot fully reflect the speaker's personality information, resulting in a low recognition rate. The traditional acoustic characteristics mostly consider the information of the physical layer of speech signal, and more reflect the shallow features of the human auditory perception and vocal tract. Therefore, it is difficult to represent the high-level information of speech segments. In recent years, the DNN have adopted a multi-layer network structure to simulate human brains, which can fully deeper identity information related to the speaker. However, it does not involve the most intuitive acoustic features of the physical layer, which may also lead to poor system performance. Thus, in order to further improve the performance of the speaker identification system, we propose a novel recognition method that fuses the depth features and traditional acoustic features to accurately recognize the speaker's identity. The system block diagram of the proposed model is shown in Figure 1.
Electronics 2021, 9, x FOR PEER REVIEW 3 of GMM for scoring fusion. In order to improve the robustness of the speaker verificati under noisy environments, Asbai and Amrouche [26] proposed a new method weighted score fusion. These studies fully prove that the fusion method can effective improve the performance of the speaker recognition system. Since the deep and shallow features reflect the speaker's information from differe aspects, the speaker's characteristics can be more comprehensively represented by effe tive fusion. Therefore, we propose a new speaker recognition method based on the fusi of deep and shallow Gaussian supervector. In this method, the MFCC is firstly obtain from the input speech signal, and then DNN is used to obtain the bottleneck featur which are used to acquire the deep Gaussian supervector. On the other hand, we inp the MFCC into the GMM directly to obtain the traditional Gaussian supervector. Last we fuse the two kinds of features to form a new vector to train the SVM and complete t speaker classification.
The main contributions made in this paper can be summarized here: (1) We design DNN network to extract the deep bottleneck features that contain more discriminati information from different speakers. (2) In order to take into account the complementari between different hierarchical features, we propose a novel fusion model to form a ne Gaussian supervector for speaker recognition. (3) We propose a speaker recognition sy tem based on optimization weight coefficient, which improves the robustness of the sy tem. (4) We explore the factors that affect the performance of the system recognition an utilize the Fisher criterion to filter redundant information.
The remainder of the paper is organized as follows: Section 2 describes the propos speaker identification system based on fusion features, MFCC, recombined Gaussian s pervector and feature selection strategy, respectively. Section 3 mainly elaborates on t speaker identification system based on optimized weight coefficients. The experimen results and analysis are presented in the Section 4. Finally, the conclusion of this work given in Section 5.

Proposed Speaker Recognition System
In a speaker identification system, it is vital to extract some features that can indica speaker identity information, and then these features are used to train the classificati model. Finally, the model is used for identification. Therefore, the performance of t speaker's recognition system is directly impacted by the quality of features. A single fe ture often cannot fully reflect the speaker's personality information, resulting in a lo recognition rate. The traditional acoustic characteristics mostly consider the informati of the physical layer of speech signal, and more reflect the shallow features of the hum auditory perception and vocal tract. Therefore, it is difficult to represent the high-lev information of speech segments. In recent years, the DNN have adopted a multi-lay network structure to simulate human brains, which can fully deeper identity informati related to the speaker. However, it does not involve the most intuitive acoustic features the physical layer, which may also lead to poor system performance. Thus, in order further improve the performance of the speaker identification system, we propose a nov recognition method that fuses the depth features and traditional acoustic features to a curately recognize the speaker's identity. The system block diagram of the propos model is shown in Figure 1.   In the training stage, the input speech signal is preprocessed by endpoint detection, pre-emphasis, framing, and windowing. Then, the MFCC is obtained from the processed signals to train the DNN. After the training process is completed, we extract the deep bottleneck features. Since the GMM achieves excellent performance in the field of speaker recognition, we use GMM to further get the deep Gaussian supervector of speech. On the other hand, in order to obtain the traditional acoustic characteristics, we input the MFCC to the GMM directly to obtain the traditional Gaussian supervector. The Gaussian supervector reflects the mean statistical characteristics of speech signals separately, but they ignore the relevance of different frames. Therefore, we recombine the extracted traditional and deep Gauss supervectors. Finally, the obtained deep and traditional recombined supervectors are fused in the form of the augmented vector dimension. That is, the traditional Gaussian mean supervector is horizontally spliced in the depth supervector to form a new vector with higher dimension and more personalized information. The new fused features are used to train the classifier SVM. In the test stage, we also get the fusion supervector of the test speech data according to the processing method of the training phase, and then input them into the trained SVM to obtain the classification results.

Recombined Gaussian Supervector
In the previous work, the traditional features such as MFCC [27], Gaussian statistical characteristics [15] were often applied to speaker identification, which had good performance. In this paper, we extract 48-dimension MFCC from the input speech and calculate the Gaussian statistics as the input feature to train the SVM.

MFCC
If the speech lasts no more than 30 ms and the frame shift is 10 ms, this voice is considered to short-term stable. Therefore, before extracting the MFCC, we often need to preprocess it first. The preprocessing mainly includes endpoint detection, pre-emphasis, framing, and windowing.
The pre-emphasis part can be realized by a high-pass filter, which is equivalent to where α is the pre-emphasis coefficient (usually in the interval [0.9, 1]). In the process of experiment, we adopt the Hamming window to smooth edge of framed signals and make it periodic. The window function is defined as follows, MFCC represents the transient power range of human speech [15]. Mel frequency reflects the conversion relationship between actual frequency and perceptual frequency. It can be obtained by using the formula [28] where f is the actual frequency and its unit is Hz. The specific steps and flow chart of the extraction process are shown in Figure 2. Firstly, the continuous speech signal in time domain is transformed into discrete digital signal by sampling, framing and windowing, and then FFT or DFT transformation is applied to each frame to obtain the corresponding linear spectrum. Secondly, the actual frequency is converted into Mel frequency scale, and the linear spectrum is input into the Mel filter bank for filtering to obtain Mel spectrum. Next, logarithmic power spectrum is obtained by logarithmic operation. Finally, the correlation between the components is eliminated by DCT transformation, and the MFCC parameters are obtained. In addition, the first-order

Extraction of Recombined Gaussian Supervector
GMM has been widely utilized in speaker recognition. In this paper, we mainly a the GMM to extract the Gaussian mean supervector. The parameters are estimated input data using the Expectation Maximum (EM) algorithm [29]. The M-order G Gaussian probability formula is given by where X is a D-dimensional random vector, and ( ) In this paper, we mainly use the mean vector i μ of each Gaussian componen order to make the input vector contain more personality information, we connect the form the mean supervector. It can be represented as follows: is the mean vector of i-th component. If we consider mean vector separately, the correlation between them will be ignored. If the Gaussian relation number is too large, the performance of the system will be reduced due t decrease of the feature correlation between multiple frames. Therefore, it is importa select an appropriate Gaussian correlation number J to recombine the feature ve The first new mean vector obtained is . According to this rule recombined supervector is obtained by traversing the entire supervector in turn. Fin we will get K reconstructed supervectors. The relationship between J and K fies the following equation

Extraction of Recombined Gaussian Supervector
GMM has been widely utilized in speaker recognition. In this paper, we mainly adopt the GMM to extract the Gaussian mean supervector. The parameters are estimated from input data using the Expectation Maximum (EM) algorithm [29]. The M-order GMM Gaussian probability formula is given by where X is a D-dimensional random vector, and f i (X|µ, ∑) is the density function represented in the vector space R d , w i is the mixed weight and satisfies ∑ M i=1 w i = 1. The density is given by where µ i , ∑ i refer to the mean vector and covariance matrix, respectively. Therefore, we usually use the model λ k (w k , µ k , ∑ k ) to represent the kth mixture component. Given a feature vectors set X={x 1 , x 2 , x 3 , . . . . . . , x T }, the aim of applying the GMM is to compute the necessary statistics. We first set the Gaussian component number and initial value, and then use the EM algorithm to estimate a new parameter ∧ λ. The new model parameters are input to the next training until the model converges. The parameters are shown as follows In this paper, we mainly use the mean vector µ i of each Gaussian component. In order to make the input vector contain more personality information, we connect them to form the mean supervector. It can be represented as follows: m = {m 1 , m 2 , m 3 , . . . . . . , m M }, where the m i (i = 1, 2, . . . . . . , M) is the mean vector of i-th component. If we consider each mean vector separately, the correlation between them will be ignored. If the Gaussian correlation number is too large, the performance of the system will be reduced due to the decrease of the feature correlation between multiple frames. Therefore, it is important to select an appropriate Gaussian correlation number J to recombine the feature vector. The first new mean vector obtained is m 1 = m 1 , m 2 , . . . . . . , m J . According to this rule, the recombined supervector is obtained by traversing the entire supervector in turn. Finally, we will get K reconstructed supervectors. The relationship between J and K satisfies the following equation where M is the number of original Gaussian supervectors. The new traditional recombined vector can be expressed by m = m 1 , m 2 , . . . . . . , m K , where m p (p = 1, 2, . . . . . . , K) represents the each recombined supervector.

Deep Recombined Gaussian Supervector
Since the traditional features such as MFCC, LPCC and Gaussian supervector simply represent the shallow physical information of the speaker's voice, they cannot extract the features on a deeper level. Therefore, it is necessary to get the feature vector, which can remove redundant information and reflect the speaker's identity information more deeply. DNN has achieved an overall success in automatic speaker recognition [30,31]. There are two major applications of DNN: one is used as a classifier and the other is to extract speech features frame by frame. In our work, we design a DNN to obtain deep bottleneck features.

Deep Neural Network Model
The DNN is an MLP with multiple hidden layers, each of which is implemented by Restricted Boltzmann Machine (RBM) [32]. The value of the input and hidden units is generally binary which obey the Bernoulli distribution. The energy function is defined by where v, h represents the state of the visible and hidden layer, respectively. The parameter θ = {w, b, c} denotes the connection weights between visible and hidden unit as w ij and the biases of the visible and hidden layers. The training of DNN can be divided into two stages: pre-training and fine-tuning. In the pre-training stage, we adopt an unsupervised method to finish the initialization of DNN. Contrastive Divergence (CD) [33] is used to estimate the parameters of RBM. In the fine-tuning stage, we adopt the Back Propagation (BP) algorithm to finely adjust the network parameters. In this process, the parameter of the network is adjusted supervised. Thus, it is necessary to align each frame of training data with corresponding speaker labels.

Extraction of Deep Recombined Gaussian Supervector
In general, the DNN is composed of the input layer, the output layer and the hidden layer. We mainly extract deep bottleneck features from raw speech. The Figure 3 shows the DNN structure used in this paper. We design five layers' network to train the speech signal, including the input layer, three hidden layers and the output layer. The structure of the network is 200-200-48-200-10. The number of neuron nodes in the output layer is the total number of speakers to be identified. Firstly, MFCC features of the training speech signals after preprocessing are extracted as the input of DNN. After the pre-training and fine-tuning stage are completed, the bottleneck layer is regarded as the new output. Thus, the traditional characteristic parameters are converted into the deep bottleneck features.  After the deep bottleneck features are obtained, we use them as the input of GMM to get the Gaussian mean vector. Since there are some correlations between the Gaussian mean vectors of different frames, we further recombine these vectors according to the rules mentioned in Section 2.1.2. The new deep recombined Gaussian supervector is ex- represents each recombined component.

Classification Based on Fusion Features
In order to consider the complementarity between deep and shallow recombination supervector, we splice the traditional recombined supervector horizontally behind the deep recombined Gaussian supervector. Through , we can get the new fusion features as follows, after the fusion vectors are obtained, we input them into the SVM so as to achieve a judgement.

Support Vector Machine Classifier
The target of SVM training is to find the maximum margin hyperplane for learning samples that can distinguish different speaker identity information. When the input sample data has linear separability, the learning of SVM can be achieved by solving the following optimization problems where w represent the weight vector. If the data set is nonlinear, we need to introduce the kernel function which can map the original data to a new feature space. The dimension of the new feature space is higher than that of the former one. In addition, since the Radial Basis Function (RBF) has shown its unique advantages in pattern recognition, the RBF is used in the proposed model. The formula of RBF is as follows After the deep bottleneck features are obtained, we use them as the input of GMM to get the Gaussian mean vector. Since there are some correlations between the Gaussian mean vectors of different frames, we further recombine these vectors according to the rules mentioned in Section 2.1.2. The new deep recombined Gaussian supervector is ex- represents each recombined component.

Classification Based on Fusion Features
In order to consider the complementarity between deep and shallow recombination supervector, we splice the traditional recombined supervector horizontally behind the deep recombined Gaussian supervector. Through m = m 1 , m 2 , . . . . . . , m K and v = v 1 , v ' 2 , . . . . . . v K , we can get the new fusion features as follows, after the fusion vectors are obtained, we input them into the SVM so as to achieve a judgement.

Support Vector Machine Classifier
The target of SVM training is to find the maximum margin hyperplane for learning samples that can distinguish different speaker identity information. When the input sample data has linear separability, the learning of SVM can be achieved by solving the following optimization problems where w represent the weight vector. If the data set is nonlinear, we need to introduce the kernel function which can map the original data to a new feature space. The dimension of the new feature space is higher than that of the former one. In addition, since the Radial Basis Function (RBF) has shown its unique advantages in pattern recognition, the RBF is used in the proposed model. The formula of RBF is as follows which is a radial symmetric scalar function. It is usually defined as a monotone function of Euclidian distance between any point x i and a certain center x j in a space, which can be denoted as k x i − x j . In our work, speaker recognition is a multi-classification task. One-to-one and oneto-many are the two main methods to realize the multiple classification problem in SVM. Since the speed of the former is much faster, we adopt the One-to-one to finish speaker recognition.

Fisher Criterion Selection
In the fusion model, the dimension of feature fusion input into SVM classifier may be very large, which will increase the modeling time. Therefore, an effective dimension reduction strategy should be taken to remove the useless speaker identity information and reduce the computational complexity of the model. We choose the Fisher criterion [34] to complete the feature selection of the deep and shallow recombination Gaussian supervector. The main idea of Fisher criterion selection is that the Euclidean distance between the same category is smaller, while the distance between different features is larger. We define the q-dim feature of the i-th emotion as T iq = X i 1q , X i 2q , X i 3q , . . . . . . , X i Mq and the Fisher criterion discriminant coefficient can be calculated by where s represents the total number of speakers, µ iq and σ 2 iq are the mean and variance of the vector T iq . In our proposed method, after the deep and shallow recombined supervector are fused, we calculate the Fisher coefficients between fusion features of the different identified speakers. Next, we sort them in ascending order and remove the corresponding features with smaller coefficients. Finally, the reserved features form a new vector and some irrelevant information can be eliminated.

Optimization of Feature Weight Coefficient
In the above speaker recognition system, when the two types of feature are fused, they are directly spliced in the horizontal direction. That is to say, we default that the two parameters have the same contribution to the system, and there is no processing in the subsequent steps. It is found that when the number of speakers increases, the difficulty of recognition will increase. Therefore, the recognition accuracy of the system will decrease. For each speaker, the contribution of different parameters to the final recognition result is different, so the importance of each parameter will be involved. When several types of features are fused, the weight coefficients between them will be considered to further improve the recognition rate and reduce the probability of misjudgment. Therefore, in order to measure the weight of each feature more accurately, we use two common optimization algorithms-Genetic Algorithm (GA) [35] and Simulated Annealing (SA) [36] algorithm to find the most appropriate weights. The system block diagram is shown in Figure 4. When the number of speakers increases, using only two features is not enough to describe the speakers comprehensively. In order to more fully describe the identity information, in the system shown in Figure 4, three different characteristics are used to obtain the fusion features. In the above system, we use three features: deep recombined Gaussian supervector, traditional recombined Gaussian supervector and MFCC. Assuming that the i-th feature is expressed by i x , the corresponding weight coefficients is i w . The fused features of optimized weight coefficients can be expressed as follows When the number of speakers increases, using only two features is not enough to describe the speakers comprehensively. In order to more fully describe the identity information, in the system shown in Figure 4, three different characteristics are used to obtain the fusion features. In the above system, we use three features: deep recombined Gaussian supervector, traditional recombined Gaussian supervector and MFCC. Assuming that the i-th feature is expressed by x i , the corresponding weight coefficients is w i . The fused features of optimized weight coefficients can be expressed as follows where the X f represents the fused feature, and N is the number of features. In this paper, the value of N is 3.
In the training stage, the GA or SA algorithms can be used to find the optimal weights of three types of features, and then they are multiplied by their respective coefficients. Lastly, they are spliced horizontally to form a new feature. In the test stage, we also use the method of training phase to obtain the fusion characteristics of the test set. When the training feature and test feature are obtained, and then input them into the classifier SVM to obtain the identity information of the speakers.

Database Description and Experiment Setup
In order to verify the effectiveness of the proposed method, we choose to conduct experiments on the database that is recorded in Chinese. The database, which is widely used in speaker recognition, derives from the national 863 key projects (2006AA010102) and it contains 210 speakers totally. To make our research more representative, we select 10 speakers randomly for the following experiment. There are five male and five female participants. Each speaker reads 180 utterances that last about five seconds. We randomly choose 80 utterances, of which 60 files are used as the training samples and the rest samples are used as the test samples. Since the SVM classifier has good performance in the field of speaker recognition, we utilize the LIBSVM toolbox (https://www.csie.ntu.edu.tw/~cjlin/ libsvm/) by Professor Lin Zhiren of Taiwan University to realize the classification.
We do the endpoint detection in advance of the extraction of speech characteristic parameters. In general, speech signals are considered to be invariant in a short time. Therefore, the input samples need to be framed into small segments. We divide them in the form of 256 points length and 128 points shift. The features used in this work is the MFCC, which include 24 order MFCC and dynamic characteristics 24-∆MFCC. The DNN structure and training parameters are obtained by experiment. Firstly, the parameters of the input layer are the original acoustic characteristic MFCC, which contains 48 dimensions, and then there are three hidden layers after the input layer. Lastly, the output layer is used for classification, and the number of neurons is the same as the number of people to be identified.

The Impact of Deep Network Parameters on the System
For the pattern classification problem, the feature parameters are the important factor that determines the system performance. Therefore, in order to get the best performance of the fusion model, we perform the experiment by changing the network structure and training parameters to find the optimal deep network.
(1) In the process of deep network training, the batch size is a key factor affecting the extraction of depth features. With setting the batch values to 5, 10, 15, 20 and 25, six groups of experiments are designed to search the best value of batch size. The network structure is 48-200-48-200-10. The recognition rate is shown in Figure 5. As can be seen from the Figure 5, when the batch size is greater than 10, the recognition rate will decrease with the increase of batch value. It can be explained by the fact that the oversize batch value may lead to the overfitting of network training. In other words, the network has poor generalization performance. On the other hand, the batch size is too small to speed up the convergence, which will greatly increase the training time. Therefore, an appropriate batch value is important for network training. When the value is set to 10, the recognition rate of the system reaches the peak. Thus, in the subsequent experiment, we set the batch size to 10. (2) Previous studies have found that the learning rate of DNN also has a significant impact on system performance. In our work, a group of comparative experiments with different learning rates are conducted. From the result shown in Figure 6, it is indicated that the performance of speaker recognition is improved with the increase of learning rate as the learning rate is not higher than 0.002. When the learning rate is smaller, the speed of convergence will be slower, which will cause large time consumption. On the contrary, if we set the learning rate too big, the optimal value may be missed in the iteration process and the extracted features will be undesirable. It is not difficult to find that the most suitable learning rate is 0.002 from the experiment results. Thus, in this paper, the learning rate is set to 0.002. (2) Previous studies have found that the learning rate of DNN also has a significant impact on system performance. In our work, a group of comparative experiments with different learning rates are conducted. From the result shown in Figure 6, it is indicated that the performance of speaker recognition is improved with the increase of learning rate as the learning rate is not higher than 0.002. When the learning rate is smaller, the speed of convergence will be slower, which will cause large time consumption. On the contrary, if we set the learning rate too big, the optimal value may be missed in the iteration process and the extracted features will be undesirable. It is not difficult to find that the most suitable learning rate is 0.002 from the experiment results. Thus, in this paper, the learning rate is set to 0.002.  Table 1 shows the system recognition rate with the bottleneck layer at different locations. It is not difficult to find that the system performance achieves the best when the second hidden layer is set as the bottleneck layer. It can be explained that some important features may be neglected if the extracted features are located farther ahead. Conversely, if the position is later, the features may be redundant, which will also affect the performance of the system. Moreover, by observing the data of each row in the table, for different network structures, when the number of correlations is 64 and the parameters of the second hidden layer are used as bottleneck layer features, the system has the highest recognition rate. These also illustrate the importance of the correlation between features. Therefore, we will adopt the structure 48-200-48-200-10 in the following experiment.  Table 1 shows the system recognition rate with the bottleneck layer at different locations. It is not difficult to find that the system performance achieves the best when the second hidden layer is set as the bottleneck layer. It can be explained that some important features may be neglected if the extracted features are located farther ahead. Conversely, if the position is later, the features may be redundant, which will also affect the performance of the system. Moreover, by observing the data of each row in the table, for different network structures, when the number of correlations is 64 and the parameters of the second hidden layer are used as bottleneck layer features, the system has the highest recognition rate. These also illustrate the importance of the correlation between features. Therefore, we will adopt the structure 48-200-48-200-10 in the following experiment.

The Superiority of Bottleneck Features
As an important modern tool for feature extraction, the deep neural network has shown its unique advantages. There are more than two layers in the deep network. The output of each layer can be used as a type of feature, which represents the information of input signal at different levels. Therefore, it is an important issue to choose which layer of output has the depth feature. In this paper, there are five layers in the deep network. The third layer is set as the bottleneck layer and the number of Gaussian components is 1024. In addition, we extract the characteristics from the first, second and third hidden layers respectively for experiments.
According to the Figure 7, when the Gaussian correlation number is lower than 64 and the bottleneck features are adopted as the training feature, the system recognition rate is similar to that of the system with the feature from the first or third layer. However, when the correlation number is higher than 64, the performance of the bottleneck feature is significantly better than that of other layers. Evidently, it turns out that the bottleneck feature is superior to the other depth features. It can be explained that the deep bottleneck feature removes the redundant part of input features and has a more concise and abstract representation. As a further step to compare the performance between the traditional features and deep bottleneck features, we conduct a series of comparative experiments using the MFCC and bottleneck features. In the MFCC-based system, we use MFCC as the input of GMM directly, and then recombine the mean supervector to finish judgment by SVM. In the system based on the bottleneck feature, we first obtain the deep bottleneck feature by using MFCC as the input of DNN, and then the subsequent process is the same as the traditional recombination supervector. Finally, we still use SVM for decision classification. The results of the two kinds of features are displayed in Table 2.    In the speaker recognition system, although the traditional acoustic features and depth features show their own respective advantages, few researchers have considered the complementarity between them. As a further method toward improving the performance of speaker recognition, we have proposed to fuse the shallow and deep features displayed in Section 2. According to the research findings in the Sections 4.2.1 and 4.2.2, we have obtained the optimal depth characteristics and network parameters. Therefore, we will set the network at 200-200-48-200-10, the learning rate is 0.002, and the batch value is 10. On one side, we used the bottleneck features to train the GMM. On the other side, we use the MFCC to obtain traditional Gaussian supervector. Lastly, we combine them to train the classifier. In order to verify the validity of the fusion model, we compare the fusion model with MFCC, deep recombined supervector and i-vector [37]. In our work, after the deep bottleneck features are obtained, we input them to the GMM-UBM to extract the i-vector. We set the dimension of MFCC as 24. Combining its first-order difference, the total dimension of acoustic characteristics is 48. The component of UBM is 1024. The dimension of i-vector is 400. After the i-vector is extracted, it is input into SVM for classification to complete speaker identification. A more detailed introduction can be It can be seen from Table 2 that the DNN-based system is far better than the traditional features in most cases. It can be proved that the feature extracted from the DNN can excavate the deeper identity information and contain more distinguishable characteristic. Moreover, the dimension of bottleneck layer is far less than the other layers. This also fully displays that it can effectively compress identity-related information in the bottleneck layer. Therefore, in this paper, we make full use of its advantages and select the bottleneck feature as the deep feature.

Performance of Speaker Recognition Using the Proposed Fusion Model
In the speaker recognition system, although the traditional acoustic features and depth features show their own respective advantages, few researchers have considered the complementarity between them. As a further method toward improving the performance of speaker recognition, we have proposed to fuse the shallow and deep features displayed in Section 2. According to the research findings in the Sections 4.2.1 and 4.2.2, we have obtained the optimal depth characteristics and network parameters. Therefore, we will set the network at 200-200-48-200-10, the learning rate is 0.002, and the batch value is 10. On one side, we used the bottleneck features to train the GMM. On the other side, we use the MFCC to obtain traditional Gaussian supervector. Lastly, we combine them to train the classifier. In order to verify the validity of the fusion model, we compare the fusion model with MFCC, deep recombined supervector and i-vector [37]. In our work, after the deep bottleneck features are obtained, we input them to the GMM-UBM to extract the i-vector. We set the dimension of MFCC as 24. Combining its first-order difference, the total dimension of acoustic characteristics is 48. The component of UBM is 1024. The dimension of i-vector is 400. After the i-vector is extracted, it is input into SVM for classification to complete speaker identification. A more detailed introduction can be found in [38,39]. The results of the comparative experiments are displayed in Figures 8 and 9. found in [38,39]. The results of the comparative experiments are displayed in Figures 8  and 9.  From the above experimental results, it is easy to find that the recognition of the proposed fusion feature is significantly better than the other three features in most cases. In particular, when the Gaussian correlation number is set to 64, the performance of the proposed fusion model reaches the peak with a 98.75% recognition rate. Compared to the method based on the traditional feature and deep feature, the proposed fusion method outperforms them by 5% and 0.62%, respectively. This concludes that the performance of speaker recognition system is better in terms of accuracy rate when the traditional feature and deep feature are combined together. It also proves that the proposed fusion method takes into account the complementarity between different categories of features.
In general, the number of speakers may affect the system performance. In the above experiment, we select only 10 speakers to conduct the comparative experiment. Therefore, in order to improve the system robustness, we enlarge the number of speakers to 20 and   From the above experimental results, it is easy to find that the recognition of the proposed fusion feature is significantly better than the other three features in most cases. In particular, when the Gaussian correlation number is set to 64, the performance of the proposed fusion model reaches the peak with a 98.75% recognition rate. Compared to the method based on the traditional feature and deep feature, the proposed fusion method outperforms them by 5% and 0.62%, respectively. This concludes that the performance of speaker recognition system is better in terms of accuracy rate when the traditional feature and deep feature are combined together. It also proves that the proposed fusion method takes into account the complementarity between different categories of features.
In general, the number of speakers may affect the system performance. In the above experiment, we select only 10 speakers to conduct the comparative experiment. Therefore, in order to improve the system robustness, we enlarge the number of speakers to 20 and From the above experimental results, it is easy to find that the recognition of the proposed fusion feature is significantly better than the other three features in most cases. In particular, when the Gaussian correlation number is set to 64, the performance of the proposed fusion model reaches the peak with a 98.75% recognition rate. Compared to the method based on the traditional feature and deep feature, the proposed fusion method outperforms them by 5% and 0.62%, respectively. This concludes that the performance of speaker recognition system is better in terms of accuracy rate when the traditional feature and deep feature are combined together. It also proves that the proposed fusion method takes into account the complementarity between different categories of features.
In general, the number of speakers may affect the system performance. In the above experiment, we select only 10 speakers to conduct the comparative experiment. Therefore, in order to improve the system robustness, we enlarge the number of speakers to 20 and they are selected randomly from the database. We extract the traditional Gaussian supervector and deep Gaussian supervector in the same way, and then fuse the two kinds of vectors as the new acoustic feature. The other parameters are the same as the 10 speaker's recognition system. We compare the performance of shallow feature, deep feature and the fused feature, respectively. The experimental results are shown in Figure 10.
Electronics 2021, 9, x FOR PEER REVIEW 15 of 21 they are selected randomly from the database. We extract the traditional Gaussian supervector and deep Gaussian supervector in the same way, and then fuse the two kinds of vectors as the new acoustic feature. The other parameters are the same as the 10 speaker's recognition system. We compare the performance of shallow feature, deep feature and the fused feature, respectively. The experimental results are shown in Figure 10. From the above result, when the Gaussian component number is 1024 and the correlation number is 64, the proposed system can reach the highest recognition rate of 95.94%. Compared with the other two methods, the newly proposed method is improved by 4.71% and 1.06%, respectively. As the number of people increases, the highest recognition rate of the system will decrease to some extent. However, the performance of our proposed system is better them even if the number of people increases. At the same time, it can be seen that the new method can work well when the training data sample is small. This also can prove that our proposed method has good generalization ability and the result is not a special case.
As a further step to verify the influence of Gaussian component number, we set it in the fusion model to 256, 512, 1024 and 2048 to carry out a set of experiments. The number of speakers is 10 and the parameters of the network are the same as the Section 4.2.3.
As we can see from Figure 11, when the Gaussian correlation number is greater than 32, the larger Gaussian component number is, the better of the performance will be. When the Gaussian component number is 256 and correlation numbers is less than 32, the system recognition rate is better, but it is still unsatisfactory. This also indicates that the performance of the system can be enhanced when the Gaussian component is large enough. However, the recognition rate cannot be improved indefinitely. The recognition rate can reach 98.75%, when the Gaussian component number is 1024 and the correlation numbers is 64. As the number of components increases, the computational complexity and the training time will also grow to a certain extent. Thus, choosing a suitable number of Gaussian components is also an important factor in the speaker recognition model. From the above result, when the Gaussian component number is 1024 and the correlation number is 64, the proposed system can reach the highest recognition rate of 95.94%. Compared with the other two methods, the newly proposed method is improved by 4.71% and 1.06%, respectively. As the number of people increases, the highest recognition rate of the system will decrease to some extent. However, the performance of our proposed system is better them even if the number of people increases. At the same time, it can be seen that the new method can work well when the training data sample is small. This also can prove that our proposed method has good generalization ability and the result is not a special case.
As a further step to verify the influence of Gaussian component number, we set it in the fusion model to 256, 512, 1024 and 2048 to carry out a set of experiments. The number of speakers is 10 and the parameters of the network are the same as the Section 4.2.3.
As we can see from Figure 11, when the Gaussian correlation number is greater than 32, the larger Gaussian component number is, the better of the performance will be. When the Gaussian component number is 256 and correlation numbers is less than 32, the system recognition rate is better, but it is still unsatisfactory. This also indicates that the performance of the system can be enhanced when the Gaussian component is large enough. However, the recognition rate cannot be improved indefinitely. The recognition rate can reach 98.75%, when the Gaussian component number is 1024 and the correlation numbers is 64. As the number of components increases, the computational complexity and the training time will also grow to a certain extent. Thus, choosing a suitable number of Gaussian components is also an important factor in the speaker recognition model. In the proposed model, there may be some redundancy in the fusion features. On the one hand, when the original Gaussian mean vectors are recombined, the dimension of the vector will be enlarged by splicing the mean vector. On the other hand, the process of fusing the deep and shallow recombination vectors also expands the dimension of the vector. It will increase the computational complexity and classification time. Therefore, in order to eliminate the influence of fusion feature redundancy, we adopt Fisher criterion to filter out some useless information. Through the above experiments, it can be found that the performance is best when the correlation number is 64. Thus, in the following experiments, we mainly focus on the case where the Gaussian mixture number is 1024 and the correlation number is 64. The experimental results are displayed in Figure 12. We can find that the performance of the system remains unchanged when the reduced dimension is less than 200. However, when the reduced dimension is higher than 200, the recognition rate will decrease gradually. It can be concluded that the Fisher criterion can only filter the redundant information of fusion features to a limited degree. There is plenty of room for performance improvement, so further search for better algorithms is needed to reduce the dimensions of features. In the proposed model, there may be some redundancy in the fusion features. On the one hand, when the original Gaussian mean vectors are recombined, the dimension of the vector will be enlarged by splicing the mean vector. On the other hand, the process of fusing the deep and shallow recombination vectors also expands the dimension of the vector. It will increase the computational complexity and classification time. Therefore, in order to eliminate the influence of fusion feature redundancy, we adopt Fisher criterion to filter out some useless information. Through the above experiments, it can be found that the performance is best when the correlation number is 64. Thus, in the following experiments, we mainly focus on the case where the Gaussian mixture number is 1024 and the correlation number is 64. The experimental results are displayed in Figure 12. In the proposed model, there may be some redundancy in the fusion features. On the one hand, when the original Gaussian mean vectors are recombined, the dimension of the vector will be enlarged by splicing the mean vector. On the other hand, the process of fusing the deep and shallow recombination vectors also expands the dimension of the vector. It will increase the computational complexity and classification time. Therefore, in order to eliminate the influence of fusion feature redundancy, we adopt Fisher criterion to filter out some useless information. Through the above experiments, it can be found that the performance is best when the correlation number is 64. Thus, in the following experiments, we mainly focus on the case where the Gaussian mixture number is 1024 and the correlation number is 64. The experimental results are displayed in Figure 12. We can find that the performance of the system remains unchanged when the reduced dimension is less than 200. However, when the reduced dimension is higher than 200, the recognition rate will decrease gradually. It can be concluded that the Fisher criterion can only filter the redundant information of fusion features to a limited degree. There is plenty of room for performance improvement, so further search for better algorithms is needed to reduce the dimensions of features. We can find that the performance of the system remains unchanged when the reduced dimension is less than 200. However, when the reduced dimension is higher than 200, the recognition rate will decrease gradually. It can be concluded that the Fisher criterion can only filter the redundant information of fusion features to a limited degree. There is plenty of room for performance improvement, so further search for better algorithms is needed to reduce the dimensions of features.

Performance of Speaker Recognition Based on the Optimized Weight Coefficients
In the Section 4.2.3, when we enlarge the number of speakers to 20, the highest recognition rate can reach 95.94%. Compared to the case of 10 speakers, the performance of the system decreases to some extent. Therefore, in order to prove the superiority of the fusion feature based on the optimized weight coefficients, this section conducts relevant comparative experiments on the system proposed in the Section 3. The total speakers are 20 and the three types of feature are deep recombination Gaussian supervector, traditional Gaussian supervector and MFCC. The setting of relevant parameters is the same as Section 4.2.3. The dimension of MFCC is 48 and the number of Gaussian components is 1024. When the GA algorithm is used for optimization, the maximum number of individuals in the population is 50, the number of iterations is 500, the crossover probability is set to 0.45, and the mutation probability is set to 0.02. In SA, the annealing rate is 0.95, the number of iterations is 500, and the step size of the metropolis is 0.02. We conduct experiments on traditional Gaussian supervectors, deep Gaussian supervectors, directly fused features, and fused features with weighted coefficients. Figure 13 shows the recognition rate of different types of features under different correlation numbers. In the Section 4.2.3, when we enlarge the number of speakers to 20, the highest recognition rate can reach 95.94%. Compared to the case of 10 speakers, the performance of the system decreases to some extent. Therefore, in order to prove the superiority of the fusion feature based on the optimized weight coefficients, this section conducts relevant comparative experiments on the system proposed in the Section 3. The total speakers are 20 and the three types of feature are deep recombination Gaussian supervector, traditional Gaussian supervector and MFCC. The setting of relevant parameters is the same as Section 4.2.3. The dimension of MFCC is 48 and the number of Gaussian components is 1024. When the GA algorithm is used for optimization, the maximum number of individuals in the population is 50, the number of iterations is 500, the crossover probability is set to 0.45, and the mutation probability is set to 0.02. In SA, the annealing rate is 0.95, the number of iterations is 500, and the step size of the metropolis is 0.02. We conduct experiments on traditional Gaussian supervectors, deep Gaussian supervectors, directly fused features, and fused features with weighted coefficients. Figure 13 shows the recognition rate of different types of features under different correlation numbers. The last two bars in the Figure 13 represent the performance of using GA and SA algorithms for weight optimization. As can be seen from the figure, the highest recognition rates corresponding to different features are 91.23%, 94.88%, 95.94%, 96.75%, 96.3%. The performance of the fused feature with weighted coefficients is better than the other three kinds of features. Compared to the performance of directly fused features, the corresponding recognition rate is increased by 0.81% and 0.36% when we use GA or SA, respectively. In addition, under different Gaussian correlation numbers, the performance is better than the system with a single feature. It also proves the effectiveness of the weighted fusion feature method using the optimization algorithm in Section 3 of the multi-user scenario. In the case of different correlation numbers, the speaker recognition rate obtained by using the GA is generally higher than the SA. Therefore, in the application scenario of speaker recognition in this paper, GA is more suitable for weight optimization than SA.
Based on the study of Section 4.2.3, it is found that there will be some redundancy in the fusion features. In the system proposed in Section 3, three kinds of features are fused. This is more likely to cause feature redundancy. Therefore, in order to reduce the adverse The last two bars in the Figure 13 represent the performance of using GA and SA algorithms for weight optimization. As can be seen from the figure, the highest recognition rates corresponding to different features are 91.23%, 94.88%, 95.94%, 96.75%, 96.3%. The performance of the fused feature with weighted coefficients is better than the other three kinds of features. Compared to the performance of directly fused features, the corresponding recognition rate is increased by 0.81% and 0.36% when we use GA or SA, respectively. In addition, under different Gaussian correlation numbers, the performance is better than the system with a single feature. It also proves the effectiveness of the weighted fusion feature method using the optimization algorithm in Section 3 of the multi-user scenario. In the case of different correlation numbers, the speaker recognition rate obtained by using the GA is generally higher than the SA. Therefore, in the application scenario of speaker recognition in this paper, GA is more suitable for weight optimization than SA.
Based on the study of Section 4.2.3, it is found that there will be some redundancy in the fusion features. In the system proposed in Section 3, three kinds of features are fused. This is more likely to cause feature redundancy. Therefore, in order to reduce the adverse effects of excessively high fusion feature dimensions, in the speaker recognition system based on GA, two dimensionality reduction strategies are used to screen the redundant features, and then input them into the classifier for training, so as to realize the speaker identification.
The experimental results in Figure 14 show that when the dimension of filtered feature is less than 200, the recognition rate of Fisher screening method does not decrease, and the change of the recognition rate is not obvious in the PCA-based system. When the subtracted dimension is greater than 200, the corresponding performance of two methods decline rapidly, and the change of PCA method's recognition rate is more obvious. The above results show that there is a certain amount of redundant information in the fusion features, which needs to be selected reasonably, so as to ensure that the performance will not decline while removing the redundant information. s 2021, 9, x FOR PEER REVIEW 18 of 21 effects of excessively high fusion feature dimensions, in the speaker recognition system based on GA, two dimensionality reduction strategies are used to screen the redundant features, and then input them into the classifier for training, so as to realize the speaker identification.
The experimental results in Figure 14 show that when the dimension of filtered feature is less than 200, the recognition rate of Fisher screening method does not decrease, and the change of the recognition rate is not obvious in the PCA-based system. When the subtracted dimension is greater than 200, the corresponding performance of two methods decline rapidly, and the change of PCA method's recognition rate is more obvious. The above results show that there is a certain amount of redundant information in the fusion features, which needs to be selected reasonably, so as to ensure that the performance will not decline while removing the redundant information.

Conclusions
In this paper, we present a novel speaker recognition model based on deep and shallow recombined Gaussian supervector, which can effectively improve the system's performance. In this proposed approach, we first extract MFCC from the original speech signal, and then input them into the DNN to extract the depth bottleneck feature to obtain the depth Gaussian supervector further. On the other hand, we directly use the MFCC to train the Gaussian mixture model to get the traditional Gaussian supervector. Finally, they are recombined and spliced horizontally to form a higher dimension fusion feature. New features are used to train SVM for final judgment. In order to obtain the best performance, we adjust the network parameters and select the optimum depth characteristics through experiments. To assess the new approach, we compare it with the system based on depth features or traditional features alone. The finding results show that the fusion method can enhance the system performance effectively. In addition, when the number of speakers to be recognized increases, in order to prevent the system recognition rate from falling sharply, we introduce the optimization algorithm to find the optimal weight before the feature fusion. The experimental results demonstrate that the fusion feature based on optimized weight coefficients can improve the recognition rate by 0.81%. Due to feature fusion, the vector dimensions input into SVM will be enlarged, resulting in higher system complexity and longer running time. The Fisher criterion can merely reduce a small part

Conclusions
In this paper, we present a novel speaker recognition model based on deep and shallow recombined Gaussian supervector, which can effectively improve the system's performance. In this proposed approach, we first extract MFCC from the original speech signal, and then input them into the DNN to extract the depth bottleneck feature to obtain the depth Gaussian supervector further. On the other hand, we directly use the MFCC to train the Gaussian mixture model to get the traditional Gaussian supervector. Finally, they are recombined and spliced horizontally to form a higher dimension fusion feature. New features are used to train SVM for final judgment. In order to obtain the best performance, we adjust the network parameters and select the optimum depth characteristics through experiments. To assess the new approach, we compare it with the system based on depth features or traditional features alone. The finding results show that the fusion method can enhance the system performance effectively. In addition, when the number of speakers to be recognized increases, in order to prevent the system recognition rate from falling sharply, we introduce the optimization algorithm to find the optimal weight before the feature fusion. The experimental results demonstrate that the fusion feature based on optimized weight coefficients can improve the recognition rate by 0.81%. Due to feature fusion, the vector dimensions input into SVM will be enlarged, resulting in higher system complexity and longer running time. The Fisher criterion can merely reduce a small part of redundant information. Therefore, the key focus of our next research direction is to find a superior algorithm to further optimize the time and computational complexity of the system.