Whispered Speech Detection Using Glottal Flow-Based Features

: Recent studies have reported that the performance of Automatic Speech Recognition (ASR) technologies designed for normal speech notably deteriorates when it is evaluated by whispered speech. Therefore, the detection of whispered speech is useful in order to attenuate the mismatch between training and testing situations. This paper proposes two new Glottal Flow (GF)-based features, namely, GF-based Mel-Frequency Cepstral Coefﬁcient (GF-MFCC) as a magnitude-based feature and GF-based relative phase (GF-RP) as a phase-based feature for whispered speech detection. The main contribution of the proposed features is to extract magnitude and phase information obtained by the GF signal. In the GF-MFCC, Mel-frequency cepstral coefﬁcient (MFCC) feature extraction is modiﬁed using the estimated GF signal derived from the iterative adaptive inverse ﬁltering as the input to replace the raw speech signal. In a similar way, the GF-RP feature is the modiﬁcation of the relative phase (RP) feature extraction by using the GF signal instead of the raw speech signal. The whispered speech production provides lower amplitude from the glottal source than normal speech production, thus, the whispered speech via Discrete Fourier Transformation (DFT) provides the lower magnitude and phase information, which make it different from a normal speech. Therefore, it is hypothesized that two types of our proposed features are useful for whispered speech detection. In addition, using the individual GF-MFCC/GF-RP feature, the feature-level and score-level combination are also proposed to further improve the detection performance. The performance of the proposed features and combinations in this study is investigated using the CHAIN corpus. The proposed GF-MFCC outperforms MFCC, while GF-RP has a higher performance than the RP. Further improved results are obtained via the feature-level combination of MFCC and GF-MFCC (MFCC&GF-MFCC)/RP and GF-RP(RP&GF-RP) compared with using either one alone. In addition, the combined score of MFCC&GF-MFCC and RP&GF-RP gives the best frame-level accuracy of 95.01% and the utterance-level accuracy of 100%.


Introduction
Recently, Automatic Speaker Verification (ASV) and Automatic Speech Recognition (ASR) technologies have been applied in many modern speech applications [1][2][3]. However, the existing ASV or ASR systems designed for normal speech [4,5] notably deteriorate when they are evaluated using whispered speech, which is mainly characterized as unvoiced speech. Therefore, for improving the performance of ASV and ASR systems, the detection of whispered speech is useful for the feature transformation [6] or the model adaptation [7,8] in order to attenuate the mismatch between training and testing situations [9]. In addition, the detection of whispered speech can be applied for the investigation of human diseases such as laryngeal cancer [10], functional voice disorders [11], and functional aphonia [12]. It can also have an interdisciplinary impact in terms of ambient technologies for the future of smart homes and smart cities [13]. This study focuses on a whispered speech detection, which is a subject area of symmetry based on the pattern recognition task in the field of computer science.
A typical whispered speech detection system, in which a task that defines whether the given speech sample is normal or whispered usually comprises the front-end feature extraction and the back-end classifier. Most whispered speech detection systems are either focused on investigating the audio evidence for the front-end feature extraction [9,[14][15][16][17], or creating/exploiting the effective Deep Neural Network (DNN) models for the back-end classifier [18,19]. For the front-end feature extraction, the earlier studies have explored the use of Relative Spectral Transform-Perceptual Linear Prediction (RASTA-PLP) coefficients [14], Auditory-inspired Modulation Spectrum (AMS) features [14,15], Long-Term Logarithmic Energy Variation (LTLEV) features [16], Teager Energy Cepstral Coefficients (TECC) [9], Mel-Frequency Cepstral Coefficients (MFCC), Linear-Frequency Cepstral Coefficients (LFCC) [9], and spectral entropy [17]. Although these results have shown the importance of the aforementioned features to an extent, they are not computed using the modification/improvement of the raw speech signal, which may be the effective input of feature extraction. Therefore, the design of new feature extraction is an open research subject for developing the performance of the whispered speech detection.
The back-end classifier, DNN and Long Short Term Memory (LSTM) using the logfilterbank energies were proposed in [18] and they exhibited the useful results for the detection of whispered speech. Moreover, the authors of [19] introduced Convolutional Neural Network (CNN) and xception CNN with the joint magnitude spectrum and group delay spectrum for classifying whispered speech. The result have showed that CNN and xception CNN provided a promising accuracy at the utterance-level performance. Although the mentioned DNN-based classifiers provides encouraging classification, the detection performance strongly depends on a large amount of training data. Moreover, we also note that a conventional DNN-based classifier trained by the limited training data still requires the design of auxiliary features for augmenting the conventional input features to improve the model performance as seen in [19,20]. This paper aims to investigate the principal of whispered speech; therefore it focuses on devising relevant features rather than creating DNN models.
In this paper, we propose two types of new features for separating whispered speech from normal speech. The main contribution of the proposed features is to extract magnitude and phase information based on Glottal Flow (GF) signal [21]. In the first proposed feature, the MFCC feature extraction is modified using the predicted GF signal derived from Iterative Adaptive Inverse Filtering (IAIF) [21][22][23] as the input to replace the raw speech signal. This modified MFCC feature is referred to GF-based MFCC (GF-MFCC) and used as the new magnitude feature for this experiment. In a similar way, the RP feature extraction is modified using the GF signal instead of the raw speech signal for the second proposed feature. The modified RP is called GF-based RP, which is used for the new phase feature. As the normal speech production has a higher amplitude from the glottal source than the whispered speech production, thus it provides the low magnitude values via Discrete Fourier Transformation (DFT) making it different from normal speech. Therefore, it is hypothesized that two types of new features provide the efficient magnitude and phase information for the detection of whispered speech. In addition, using the individual GF-MFCC/GF-RP feature, feature-level combination is applied to exploit the complementary nature between raw speech and GF based feature to further improve the detection performance. The score-level combination is also implemented to improve the decision accuracy using the complementary nature between magnitude and phase information.
The remaining sections of this paper are organized as follows: Section 2 analyzes the effect of a GFE signal on the whisper speech detection and introduces the conventional and proposed feature sets including MFCC vs. GF-MFCC and RP vs. GF-RP extraction sets. The experimental setup is described in Section 3, which includes the details of the database, the feature extraction parameters, and the classifier. The results and discussions are presented in Section 4. Finally, Section 5 gives the conclusions.

GF-Based Feature Extraction
In this section, estimating the GF signal and analyzing its effect on the whispered speech detection are first described, and then the feature extraction based on MFCC vs. GF-MFCC and RP vs. GF-RP features are introduced.

Estimating the GF Signal and Analyzing Its Effect
Although various techniques [24][25][26] of estimating the GF signal have been proposed, this study applies the IAIF technique because of its computational efficiency and simplicity introduced in [21]. The IAIF is based on an iterative refinement of both glottal components and vocal tract transfer function. The GF signal is estimated using an inverse filtering to cancel the effects of the vocal tract and lip radiation. In practice, the IAIF method has two iterations. In the first one, the preliminary estimation for the glottal contribution is calculated using first order linear predictive coding (LPC) analysis. In the second one, the higher order LPC analysis is added to yield the higher accurate model for a glottal contribution.
The detailed framework of the IAIF technique is shown in Figure 1. Here, the blocks numbered from 1 to 6 are defined as the first iteration and the ones numbered from 7 to 11 are the second iteration. The estimation of the GF signal using the IAIF technique comprises the following brief stages.

•
Block no.1: the high-pass filter, which is a standard pre-processing in glottal inverse filtering, is implemented to filter the given speech sample s so as to delete the lower frequency ambient noises derived from the microphone. The further details of the IAIF technique can be seen in [22,23]. The magnitude spectrograms obtained from speech and GF signal are compared to analyze the effect on distinguishing between normal and whispered speech. Two utterances derived from one normal speech signal (frfo1_S03_solo.wav) (https://chains.ucd.ie/ftpaccess. php (accessed on 28 March 2020)) and one whispered speech signal (frfo1_S03_whsp.wav) (https://chains.ucd.ie/ftpaccess.php (accessed on 29 March 2020)), which were produced by the same speaker with the same text selected content. Figure 2 displays the visualizations of the speech signal, GF signal, the magnitude spectrogram of speech signal, and the magnitude spectrogram of the GF signal.
It can be observed from Figure 2e,f that the magnitude spectrogram derived from the normal speech signals preserves more information in the low and high frequency than the magnitude spectrogram derived from the whispered speech signals. The reason is owing to the lack of the periodic excitation in the vocal folds based on the production of whispered speech. In a similar way, based on GF signals, it can be seen from Figure 2g,h that the magnitude based on normal speech signals provide more information than the magnitude based on whispered speech signal. This is due to the lack of the glottal source information based on the production of whispered speech. By comparing the spectrograms among speech and GF signals, although the magnitude spectrograms using raw speech signal could give more information in the low and high frequency than using a GF signal, a lower information in the low and high frequency based on the GF signal may be efficiently captured as a new feature, which motivates the hypothesis of this study that the feature extraction using a GF signal is powerful for the whispered speech detection.

MFCC vs. GF-MFCC Extraction
The MFCC is a popular magnitude-based feature for a speech/speaker task because it can extract effective spectral characteristics at the high frequency range, which is also suitable for separating whispered speech from normal speech as summarized in [9]. In this paper, the MFCC is used as the baseline magnitude-based feature for the experiment. The process of MFCC feature extraction is shown in Figure 3a. It is briefly summarized as follows: the time domain speech signal, s(n), is first framed and windowed to obtain the pre-processed speech signal, x(n). After that the DFT is calculated for each frame at time t to obtain the spectrum, X(ω k , t), as follows: where ω k = 2π N k and N denotes the length of DFT. Next, the power of the magnitude spectrum information is computed and then is weighted by the l Mel scale filter frequency to get the energy coefficient, E(l, t), given by where L denotes the total number of filters. The lower and upper frequency are denoted as L 1 and U l , respectively.
At last, since the energy coefficients in adjacent bands tend to be correlated, the discrete cosine transform (DCT) was applied to the log function of E(l, t). Here, the result in the cepstral domain C, which is referred to MFCC, is computed as: where m is the number of cepstral coefficients and N c is the number of MFCC coefficients. The previous subsection revealed that the magnitude information derived from GF signals could efficiently provide the differences between whispered speech and normal speech due to different magnitude distribution characteristics. However, detecting whispered speech through the magnitude feature extraction capturing the estimated GF signal derived from IAIF has been less studied. This paper proposed GF-MFCC for the detection of whispered speech. MFCC was modified using the estimated GF estimation derived from IAIF as the input to replace the raw speech signal. Figure 3b shows the process of GF-MFCC extraction.
In order to observe the visual MFCC and GF-MFCC characteristics, Figure 4a-d shows visualization maps of MFCC and GF-MFCC features using normal and whispered speech. We can observe from Figure 4a,b that the MFCC using normal and whispered speech provided a good representation, but the magnitude information distribution (orange and yellow colors) spread in high and low frequencies which led to an ambiguous representation as seen in unvoiced segments from 0.5 s to 0.55 s. Subsequently, when the detection of whispered speech using GF-MFCC was considered based on Figure 4c,d, it could be seen that the GF-MFCC using normal and whispered speech gave distinct representation owing to the compact magnitude distribution which could be efficiently captured as an input feature for the detection of whispered speech.

RP vs. GF-RP Extraction
Recently, the RP feature, which is a phase-based feature, has been used for many speech tasks such as speech emotion recognition [27], speaker verification [28], speaker recognition [29], replay attack detection [30], and synthetic speech detection [20]. Although the RP feature can efficiently capture information of the provided speech by introducing the normalization process followed by the cosine and sine functions and provide the promising results, this feature has been less exploited to detect whispered speech. In this paper, we explore the significance of the RP feature to separate whispered speech from normal speech.
Based on the conventional short-time windowing, the changes in the original phase information strongly depend on the clipping position of the input speech. As summarized in [27], despite the same sentence, the phase representations between the adjacent windows using the original phase information were very different although they should be similar. To overcome this main obstacle based on the clipping position, the RP was introduced to obtain the smaller phase differences between the adjacent windows. To compute the relative phases of other frequencies, the phase was constantly kept at a certain base frequency ω base . We can obtain the spectrum as follows: Here, the difference in phase information between Equations (1) and (4) is (−θ(ω base , t)) when ω base = ω. For the other frequency where ω = 2π f , ω ω base (−θ(ω base , t)) is the difference in phase information between Equations (1) and (4).
Next, the phase information can be normalized as following: At last, the phase information is mapped into coordinates on a unit circle: RP → {cos(θ), sin(θ)} (6) Figure 3c shows the process of RP extraction. To understand further details of the RP feature extraction, readers are referred to [28].
Motivated by Section 2.1, there is the possibility that the phase information derived from the GF signal can provide distinct representation because the magnitude and phase information obtained from DFT have a strong relationship. In this paper, the GF-RP feature is proposed as the new phase-based feature for the detection of whispered speech. RP is modified using the GF signal instead of the raw speech signal. In practice, the IAIF block is added as a new process to augment the conventional RP feature extraction. The process of GF-RP feature extraction is shown in Figure 3d.
To visually see the RP and GF-RP characteristics, the visualization maps are shown in Figure 4e-h. The RP feature is compared with the MFCC and GF-MFCC features in Figure 4a,b. Although the discrimination power based on MFCC and GF-MFCC features provide more distinguishable representation than the phase-based discrimination, the RP and GF-RP are useful for separating whispered speech from normal speech. Figure 4e-h reveals that GF-RP provides clearer representation than RP because the phase information based on the raw speech has the ambiguous representations between whispered and normal speech. The reason is that the GF signal could be efficiently extracted as a phase based feature.

Used Database
In this paper, the publicly available CHAINS corpus [31] is used to investigate the proposed feature and combination. The main reason for using this database is that it is lightweight and can be used to conduct the feature experiments. The CHAINS corpus was produced by 36 speakers (16 females/20 males) and contains 1332 utterances in normal and whispered speech. The utterances are recorded with three different accents, including Ireland accent with 12 females and 16 males, and United States of America accent with 3 females and 2 males, the United Kingdom with 1 female and 2 males. All utterances are sampled at 44.1 kHz. The details of the CHAINS corpus are shown in Table 1.

Feature Extraction Parameters
Prior to the feature extraction process, all utterances are first downsampled from 44.1 kHz to 16 kHz as to save a computational cost. To directly compare the frame-level performance, the frame-blocking process of all features is computed with a 20 ms frame length and a 10 ms frameshift. For the MFCC and GF-MFCC features, 39-dimensional feature vectors based on 40 subband filters in a mel filterbank are considered as suggested by [9]. Here, 13 static coefficients are appended along with ∆ and ∆∆ to obtain 39-dimensional feature vectors for the MFCC and GF-MFCC. The parameters of the IAIF method are set as in [32] to estimate the GF signal. Next, 38-dimensional feature vectors are used for the RP and GF-RP features and the base frequency of both phase features is set to 2π × 1000 kHz as suggested in [28,30,33].

Classifier
For the current experiment focused on the limitation of training data sets, a deep learning-based classifier, which tends to provide inefficient performance with the light training datasets [34] was not considered for the experiment. In this paper, the use of Gaussian Mixture Model (GMM) is a very simple choice and can provide a good performance for the detection of whispered speech [9]. The Expectation Maximization (EM) algorithm for two-class GMM classifiers is implemented to search for the Maximum Likelihood Estimation (MLE) parameters for the given samples based on normal and whispered speech. The decision of defining whether the tested speech sample is normal or whispered speech is predicted using the following logarithmic likelihood ratio: where O is the new testing sample, and λ normal and λ whisper denote the GMMs for normal and whispered speech, respectively. Here, we implement the Vlfeat toolkit [35] to model the GMM with 512-mixture components as suggested in [9]. Because of the lightweight database, the 5-fold cross-validation is used to investigate all our experiments. Observed by the success of the score-level combination [28], more improved detection performance compared with a single classifier is obtained by combining the scores of two classifiers using the different features. In this paper, the linear combination introduced in [28] was to produce a new decision score as follows: where α is the weighting coefficient, Ls f irst and Ls second represent the likelihood scores of the GMMs obtained from the first and second selected features, respectively. After repeated experiments, the weights of combining two magnitude/phase features are 0.6 and for combining magnitude and phase features are 0.8. In addition, using a score-level combination, the feature-level combination is applied to investigate the complementary nature between conventional and GF-based features.

Results and Discussion
The effectiveness of our proposed features and combinations are first evaluated in terms of the frame-level performance. Two common evaluation criteria suggested in [9] are employed as follows: (1) Frame-level accuracy: the classification accuracy is the ratio of number of the correctly predicted segments to the total number of the testing segments, which contains the normal speech and whispered speech segments. (2) Equal Error Rate (EER): this error rate is the rate when the false alarm rate is equal to the miss probabilities based on a frame-level decision.
The results of our proposed features and combinations based on frame-level performance are presented in Table 2. The following conclusions can be drawn:

•
The baseline magnitude and phase features comparison, as seen in Table 2, reveal that the RP gives a poorer performance than MFCC. The reason is that the magnitudebased discrimination power provided more distinguishable characteristics than the phase-based discrimination.

•
As seen in the proposed features in Table 2, it could initially be seen that the GF-MFCC performs better than the MFCC while the GF-RP also is superior to RP in terms of accuracy and EER. The results indicate that the extraction of phase and magnitude information based on the GF signal, which gives compact scattering information, could reduce the ambiguous differences between normal and whispered speech. Secondly, it could be observed that the feature-level combination/augmentation of MFCC and GF-MFCC (MFCC&GF-MFCC) features could significantly improve the classification performance, compared to using a single MFCC/GF-MFCC feature. In a similar way, the improved results could be obtained using the augmentation of RP and GF-RP (RP&GF-RP) features. Based on only magnitude and phase information, these results confirm that the GF-based features are complementarities with the conventional features using raw speech signals. signal. However, the comparison of single magnitude and phase-based features reveal that the augmentation of magnitude and phase-based features which is not reported in Table 2, could give worse performance than using individual features. The reason is that the GMM-based classifier could not handle modeling the joint magnitude and phase-based features. Thirdly, the improved performance could also be obtained using the score combination These results indicate that the complementary nature between magnitude and phase-based features could be obtained using the score-level combination. A similar trend can be found in [28,33]. Finally, it is evident that the combined score of the augmented MFCC&GF-MFCC and the augmented RP&GF-RP gives the best performance compared to other used methods.

•
The results of the currently proposed features are compared with some known systems based on the CHAINS corpus. Here, LFCC and TECC were compared with the proposed feature. As seen in Table 2, the GF-MFCC performs better than LFCC because of the advantages of capturing the GF signal and using the Mel-filterbank. However, it could be observed from the results that the TECC perform better than all our proposed features, the score combination of MFCC&GF-MFCC and RP&GF-RP because this feature incorporates both amplitude and frequency information of the raw signal. In fact, the self-selection-training and testing datasets are used in experiments for the result of the TECC feature. However, the current results were evaluated using a five-fold cross-validation strategy. This ensures more reliable classification performance of the proposed methods.
To represent the miss and false alarm probabilities based on the conventional features, our proposed features, and the proposed feature/score-level combination, Figure 5 shows the DET curves of MFCC, GF-MFCC, MFCC&GF-MFCC, RP, GF-RP, RP&GF-RP and MFCC&GF-MFCC+RP&GF-RP, using the experimental results of the first fold obtained from our experiments. By comparing the DET curves with MFCC, GF-MFCC, MFCC&GF-MFCC, RP, GF-RP, and RP&GF-RP feature sets, it can be seen that the MFCC/RP provides higher miss and false alarm probabilities than the GF-MFCC/GF-MFCC. This indicates that our proposed features capturing the GF signal could reduce the ambiguous differences between normal and whispered speech. Moreover, the results show that the MFCC&GF-MFCC/RP&GF-RP significantly provides the higher miss and false alarm probabilities than individual features because the features based on raw speech and GF signals have strong complementarity based on the GMM-based classifier. Finally, it is observed that the score combination of MFCC&GF-MFCC and RP&GF-RP could reduce the miss and false alarm probabilities because the complementarity of magnitude and phase features is obtained using the score-level combination.
From Table 2, the MFCC&GF-MFCC+RP&GF-RP was the best result based on the frame-level performance. Therefore, it was also used to compare some known systems, based on the utterance-level performance. Here, the single frame-wise scores are averaged to summarize the utterance-level decision of the tested utterance. To investigate the utterance-level performance, two common evaluation criteria including accuracy and F1score are used. The utterance-level accuracy is similar to the frame-level accuracy but it uses the averaged frame-wise decision instead of the single frame decision. The F1 score is defined as the harmonic mean between recall (Re) and precision (Pr), which is expressed as:  The evaluation results of the proposed and existing methods are shown in Table 3. As seen in Table 3, the results show that MFCC&GF-MFCC+RP&GF-RP provides an accuracy of 100% and the F1-score of 100%. This indicates that the proposed method is very effective for the utterance-level performance. By comparing some known systems, it is observed that our proposed method outperforms all known systems. This is because the referred systems are based on the deep-learning classifiers, which might not perform well in the limited training datasets.

Conclusions and Future Work
In this paper, two GF-based features, namely, GF-MFCC and GF-RP features have been introduced to employ the GF signals for whispered speech detection. The MFCC and GF-MFCC/RP and GF-RP features have been augmented and termed as MFCC&GF-MFCC and RP&GF-RP, respectively, to combine the merits based on raw speech and GF-based features. The score combinations of MFCC&GF-MFCC and RP&GF-RP have been proposed to use the complementarity of the magnitude and phase information. The performances of the proposed features have been evaluated using the CHAINS corpus. The experimental results have revealed that the GF-MFCC performs better than MFCC. In a similar way, the GF-RP performs better than RP. Moreover, the MFCC&GF-MFCC/RP&GF-RP provided better performance than using either one alone. At last, when compared to MFCC&GF-MFCC/RP&GF-RP, the further improved performance could be obtained using the combined scores of MFCC&GF-MFCC and RP&GF-RP. The experimental results indicate the GF-MFCC and GF-RP features are powerful for the whispered speech detection.
Although the proposed systems have indicated a promising performance under the utterance-level condition for the whispered speech detection, there are still many challenges for frame-level performance. In future work, the effect of the proposed extraction method based on empirical mode decomposition [36] will be investigated. In addition, it is worth exploring the efficient whispered voice activity detection algorithms [37] to improve the frame-level classification accuracy.

Conflicts of Interest:
The authors declare no conflict of interest.