Whispered Speech Conversion Based on the Inversion of Mel Frequency Cepstral Coefficient Features

: A conversion method based on the inversion of Mel frequency cepstral coefficient (MFCC) features was proposed to convert whispered speech into normal speech. First, the MFCC features of whispered speech and normal speech were extracted and a matching relation between the MFCC feature parameters of whispered speech and normal speech was developed through the Gaussian mixture model (GMM). Then, the MFCC feature parameters of normal speech corresponding to whispered speech were obtained based on the GMM and, finally, whispered speech was converted into normal speech through the inversion of MFCC features. The experimental results showed that the cepstral distortion (CD) of the normal speech converted by the proposed method was 21% less than that of the normal speech converted by the linear predictive coefficient (LPC) features, the mean opinion score (MOS) was 3.56, and a satisfactory outcome in both intelligibility and sound quality was achieved.


Introduction
Whispered speech is a method of articulation different from normal speech [1]; it is produced without vibration of the vocal cords at a low sound level, which causes the voiced sound of whispered speech to have no fundamental frequency and an energy 20 dB less than that of normal speech [2]. Because of these characteristics, whispered speech is widely used in places where loud noises are prohibited such as conference rooms, libraries, and concert halls. Because of its potential applications [3][4][5][6], in recent years, studies on whispered speech (e.g., whispered speech emotion recognition, whispered speech enhancement, whispered speech recognition, and whispered speech conversion) have gradually attracted researchers' attention. Among them, whispered speech conversion, a technique that converts whispered speech to normal speech, has been widely used in mobile communication, medical equipment, security monitoring, and crime identification [7][8][9]. For example, for a laryngeal cancer patient who has undergone a laryngectomy, the conversion of whispered speech to normal speech can greatly improve the patient's speech communication experience and efficiency [10].
At present, whisper speech conversion is mainly divided into two categories. One is rule-based whisper conversion, mainly using empirical observation or statistical modeling to generate transformation rules. The other one is machine learning based, which includes Gaussian mixture model and neural network method [11]. In terms of the conversion of whispered speech to normal speech, researchers used multiple excited linear prediction method to implement Chinese speech reconstruction [12,13], in which normal speech is reconstructed by adding fundamental frequencies to the speech energy based on a whispered speech formant. However, the formant distribution of whispered speech is not identical to that of normal speech, and its formant frequencies are shifted to high frequencies [14]; therefore, the sound quality of the reconstructed speech was poor. Perrotin et al. adopted the mixed excitation linear prediction (MELP) model [15] to reconstruct whispered speech, where the whispered speech is divided into five frequency bands. Of these five frequency bands, four low-frequency bands were treated as voiced excitation, and the high-frequency band was treated as unvoiced excitation, to which fundamental frequencies are added based on the energy of speech, modifying the characteristics of the formant spectrum. The advantage of this method is its easy application to communication systems; however, because the added fundamental frequency is more monotonous than normal speech, the reconstructed speech often sounds "metallic". Sharifzadeh et al. improved the code-excited linear prediction (CELP) method by adding a whispered speech preprocessing module, a fundamental frequency-generation module and a formant modification module to achieve the conversion of whispered speech to normal speech. Their model has very limited effectiveness on continuous speech and is prone to generate quantization errors when performing vector quantization on an excitation source [16]. Xu et al. proposed a whispered speech reconstruction system based on homomorphic signal processing and relative entropy segmentation [17,18], which can better eliminate the metallic sound caused by adding fundamental frequencies, regardless of rhyme.
The above methods are all rule-based whispered speech conversion techniques. In recent years, an increasing number of scholars have begun to use the statistical distribution characteristics of speech features to convert speeches and adopt probabilistic methods to achieve the prediction of target feature vectors from source feature vectors. Among them, the Gaussian mixture model (GMM) is the most widely used. Toda et al. applied the GMM-based speech conversion methods to whispered speech conversion for the first time [19][20][21][22][23], making it possible to convert whispered speech to normal speech. In addition, Chen et al. proposed to build a continuous probability model for the spectral envelope of whispered speech using a probability-weighted GMM [24] to obtain the mapping relations of channel parameters between the whispered speech and the corresponding normal speech. The probability-weighted GMM can theoretically obtain a reasonable mapping relation between the source and target feature vectors, but it is still difficult for GMM to build a model because of the many dimensions of the spectral envelope. Therefore, it is important to choose speech features when converting whispered speech using GMM. Compared with other feature parameters, the Mel frequency cepstral coefficient (MFCC) simulates the hearing characteristics of the human ear. Boucheron et al. used MFCC feature inversion to synthesize speech, and the perceptual evaluation of speech quality (PESQ) of the synthesized speech was over 3.5 [25]. To consider the sparseness of speech, in this study, we used the L1/2 algorithm to reconstruct speech based on the research by Boucheron et al., which improved the reconstruction effectiveness.
In this paper, we report a method for converting whispered speech to normal speech based on MFCC and GMM. In the model establishment stage, we extracted the MFCC parameters of each frame of whispered speech and reference normal speech from the parallel corpus; we then used the GMM to establish the joint probability distribution between the frame feature parameters. In the conversion stage, we first input the frame feature parameters of whispered speech into the model. After estimating the feature parameters of normal speech, we used the MFCC feature parameter inversion method to directly reconstruct normal speech. Compared with existing GMM-based whispered speech conversion methods, the GMM developed in this study took into account the correlation between adjacent frames of speech, and the proposed method does not require fundamental frequency estimation when reconstructing speech.

GMM-Based Whispered Speech Conversion Model
GMM-based whispered speech conversion reconstructs normal speech from whispered speech by estimating the acoustic parameters of normal speech. As shown in Figure 1, the method has two stages: the model establishment stage and the conversion stage. In the first stage, the MFCC parameters of each frame of normal speech and whispered speech were extracted, and then the reference normal speech and whispered speech were processed with dynamic time warping (DTW). The normal speech-whispered speech joint MFCC feature distribution model was then constructed using GMM. In the second stage, the MFCC features of whispered speech were converted to the MFCC features of normal speech using the established GMM from the first stage, and the normal speech was then synthesized using the MFCC features of normal speech. The probability density function of an M-order GMM was obtained by a weighted summing of M Gaussian probability density functions as follows: . Each subdistribution is a D-dimensional joint Gaussian probability distribution, which can be expressed as follows: where i μ is the mean vector, and i Σ is the covariance matrix.
Assuming that the source feature vector and the target feature vector conform to the joint Gaussian probability distribution, the GMM was used to model the mixed MFCC feature parameters as follows:  , , where m a is the probability weight of the mth Gaussian component, and t x is the source feature parameter vector of the tth frame.

Inversion of Speech Features
The MFCC analysis is based on the mechanism of human hearing characteristics, i.e., the analysis of the speech spectrum based on the results of human hearing experiments. For this process, it is a common practice to pass the speech signal through a series of filter banks, which are called Mel filter banks. When the speech signal passes through the filter banks, the output is the power spectrum at the Mel frequency. By performing a discrete cosine transform (DCT) on all the log operation of filter bank outputs, we obtain the MFCC as follows: where Z is the energy spectrum of the Mel filter as follows: z y = Φ The weight matrix of the Mel frequency scale is the tone frequency based on the perception by the human ear. The value of the Mel frequency scale exhibits a roughly logarithmic distribution corresponding to the actual frequency. The speech frequency can be divided into a series of triangular filter sequences, i.e., the Mel filter banks, in which K is the spectral line of the Mel filter banks, y is the power spectrum of speech sound x and X is the frequency domain data of speech by STFT, as follows: To obtain the reconstructed speech by the inverting MFCC, the process of obtaining MFCC features was inverted. Since the DCT, log, and squaring operations are all reversible, when Z and Φ are known, it is possible to more accurately estimate the energy spectrum y of the speech from the MFCC features and its phase spectrum [26,27] to obtain the inverted speech frame.

Estimation of energy spectrum Y
Because of the sparse characteristics of speech signals, the use of sparse characteristics in their decomposition can effectively avoid interference from some noises, which is of great significance for signal reconstruction. Xu et al. [28] conducted an in-depth study on the sparse decomposition of signals and demonstrated the advantages of the L1/2 norm in the sparseness of signal decomposition, which has achieved good results when applied to the reconstruction of sparse signals. In this study, we propose an L1/2 + 2 double sparse constraint-based signal reconstruction model and a solution for y with known Z and Φ. The objective function model is as follows: where 1 λ and 2 λ are the regularization parameters that control the sparseness and smoothness, respectively, of the coefficient vector, satisfying 1 λ > 0 and 2 λ > 0. From Equation (4), we obtain the updated formula for y as follows: According to the literature [29], we set the following:

Phase spectrum estimation
We used the least squares method to estimate the speech frame from the power spectrum, i.e., the power spectrum was square-rooted and then converted into an amplitude spectrum. An appreciable amount of phase information is lost when obtaining the power spectrum by feature extraction; thus, it is important to recover the phase spectrum. The least square method with the inverse short-time Fourier transform magnitude (LSE-ISTFTM) algorithm was used to modify the discrete Fourier transform (DFT) and inverse discrete Fourier transform (IDFT) to estimate the phase spectrum [30], which, in combination with the given amplitude spectrum, was then used to estimate the speech for each frame in the time domain. Finally, the speech was reconstructed by adding and overlapping the sequence of speech frames. The LSE-ISTFTM Algorithm 1 process is as follows:

Experimental Conditions
The experiment was performed on a PC with a Core i5 3.2 GHz processor and 4 G memory, using the MATLAB 2013a simulation software. To verify the effectiveness of the proposed algorithm, we used 100 sentences that had been recorded in a quiet environment. There were 10 speakers (five females and five males), each with 10 sentences, covering five subjects such as stock market, catering, tourism, sports and film. The length of each sentence was approximately 2 s, with a sampling rate of 8 kHz and a precision of 16 bit. The frame segmentation was performed on the signals, with a frame length of 512 sampling points and an interframe overlap of 50%, three frames of whispered speech and one frame of normal speech constituted a joint feature vector. We randomly selected 90% whispered speech sentences and the corresponding parallel corpus as training data, the rest (10%) was used as test data. The model does not consider speaker relevance. Because the speech information of the current frame of speech has a close relationship with the frame before and after the current frame, to achieve better conversion effectiveness, we not only used the current speech frame after DTW but also considered the frames before and after the speech frame. Also, under the condition that the pronunciation is standard Chinese, if the training set and test set are different subjects, even unseen speaker, it has little impact on the final result.
To verify the effectiveness of the proposed L1/2 algorithm, we compared it with the L2 algorithm; the results are shown in Figure 2, which the number of EM iterations is 300. The data in Figure 2 show that the effectiveness of the proposed algorithm on speech reconstruction has been greatly improved compared to that of the L2 algorithm. We generated the spectrograms of the two algorithms after the inversion to further validate the advantages of the proposed method, as shown in Figure 3. The data in Figure 3 show that compared to the L2 algorithm, the reconstructed speech using the proposed algorithm was clearer, with reduced interference from noise. The proposed method obtained the spectrograms similar to the reference normal speech without estimating the fundamental frequency. The harmonic component can be clearly seen from the spectrograms, indicating that the proposed method can obtain the spectrograms estimation of normal speech. We thus performed subsequent experiments using the proposed algorithm.
In the next experiment, we examined the performance of the proposed method both subjectively and objectively. The subjective test mainly used the auditory characteristics of the human ear to evaluate the converted speech from multiple aspects, such as propensity, intelligibility, and naturalness, using the mean opinion score (MOS) scale. The MOS scale categorizes speech quality into five levels, i.e., 1: Bad, unbearable annoying; 2: Poor, obviously distorted and annoying; 3: Fair, perceivably distorted and slightly annoying; 4: Good, slightly distorted; 5: Excellent, no perceivable distortion. The subjects were asked to score the converted speech based on the above scale to assess its quality. For the objective test, we used the cepstral distortion (CD), which is calculated by the following formula: where d C and d C ′ are the dth dimensional Mel frequency cepstra of the converted speech and the target speech, respectively, and d is the number of speech frames. The average value of the frames is used as the CD value of the speech segment. The number of Gaussian components of GMM (M) must be determined first, and it is difficult to deduce the optimal M theoretically; however, that number can be experimentally obtained based on different data sets. It is generally set to 4, 8, 16, etc. In this study, from the results in Figure 4, we found the optimal conditions are as follows: multiframe speech and M = 32. Thus, in subsequent experiments, these conditions were adopted.

Experimental Results
When using the MOS score for a subjective evaluation, the higher the MOS score, the higher the intelligibility and naturalness of the converted speech. In the MOS evaluation, we invited five testers to score 10 converted speeches from each conversion direc-tion. The results are shown Table 1 in which LPC [31] was used to extract partial correlation coefficients and build signal generation models for synthesis, the network configurations of DNN-based [11] were 60-120-60-120-60, MFCC represents the result from using only the current frame for feature conversion, and MFCCs represents the result from using both the current frame and the previous and subsequent frames (i.e., multiple frames) for feature conversion. The results in Table 1 show that using multiple frames is more effective than using a single frame. At the same time, the MOS scores of the converted whispered speech using the MFCC feature inversion are all greater than 3, indicating the conversion is very effective in recognizing whispered speech.
CD is an objective evaluation standard, especially for evaluating the performance of cepstral feature conversion. According to the formula, the smaller the distance (d), the closer the converted speech to normal speech. As shown in Figure 5, the speech reconstructed using the proposed method was more similar to normal speech, indicating that the feature inversion algorithm is a feasible method for whispered speech conversion. To more visually show the conversion effectiveness, a spectrogram case is shown in Figure 6, in which the speech sentence is "gu min sang shi xin xin" ("shareholders have lost confidence"). The data in Figure 6 show that the MFCC feature inversion-based method proposed in this study achieved a better conversion between formant and spectral envelope. The whispered speech is a voiced sound with an energy approximately 20 dB lower than that of normal speech; however, the sound energy is increased significantly in the converted speech. The speech converted using the proposed method also exhibits a clearer harmonic phenomenon than whispered speech, with significant horizontal bars of the harmonics and distinct alternation of voiced and unvoiced sounds, indicating that the spectrogram of the converted speech is more similar to that of normal speech.

Conclusions
To consider the sparseness of speech, we proposed to use the L1/2 algorithm to invert the MFCC features, which generates a good hearing effect. We used the GMM to jointly model the feature parameters of whispered speech and normal speech, we obtained a conversion model that converted whispered speech features to normal speech features, and lastly, we adopted feature inversion to reconstruct the converted speech. Experiments showed that the proposed method was very effective in converting whispered speech, with significant improvement in sound quality.