Whispered Speech Detection Using Glottal Flow-Based Features
Abstract
1. Introduction
2. GF-Based Feature Extraction
2.1. Estimating the GF Signal and Analyzing Its Effect
- Block no.1: the high-pass filter, which is a standard pre-processing in glottal inverse filtering, is implemented to filter the given speech sample s so as to delete the lower frequency ambient noises derived from the microphone.
- Block no.2: the first-order LPC analysis is calculated using through to estimate the contributions of the GF and the lip radiation.
- Block no.3: the estimated GF and lip radiation are eliminated from the filtered speech signal through inverse filtering.
- Block no.4: the output of block 4 is considered using a order LPC analysis, to get the first estimation of the vocal tract.
- Block no.5: the estimated vocal tract is eliminated from the filtered speech signal through inverse filtering.
- Block no.6: the first estimation for the glottal excitation is obtained by eliminating the lip radiation effect through integration.
- Block no.7: the second-order LPC analysis, is used to obtain the glottal contribution. For this purpose, because LPC analysis has higher order than the second block, it is possible to obtain a more accurate estimation.
- Block no.8: the estimated glottal contribution is eliminated again through inverse filtering.
- Block no.9: the final estimation of vocal tract is obtained using a p order LPC analysis, to the previous block.
- Block no.10: the effect of the vocal tract is canceled from the filtered speech using the inverse filtering.
- Block no.11: the glottal flow g is obtained by eliminating the lip radiation effect by integrating the output of block 10.
2.2. MFCC vs. GF-MFCC Extraction
2.3. RP vs. GF-RP Extraction
3. Experimental Setup
3.1. Used Database
3.2. Feature Extraction Parameters
3.3. Classifier
4. Results and Discussion
- (1)
- Frame-level accuracy: the classification accuracy is the ratio of number of the correctly predicted segments to the total number of the testing segments, which contains the normal speech and whispered speech segments.
- (2)
- Equal Error Rate (EER): this error rate is the rate when the false alarm rate is equal to the miss probabilities based on a frame-level decision.
- The baseline magnitude and phase features comparison, as seen in Table 2, reveal that the RP gives a poorer performance than MFCC. The reason is that the magnitude-based discrimination power provided more distinguishable characteristics than the phase-based discrimination.
- As seen in the proposed features in Table 2, it could initially be seen that the GF-MFCC performs better than the MFCC while the GF-RP also is superior to RP in terms of accuracy and EER. The results indicate that the extraction of phase and magnitude information based on the GF signal, which gives compact scattering information, could reduce the ambiguous differences between normal and whispered speech. Secondly, it could be observed that the feature-level combination/augmentation of MFCC and GF-MFCC (MFCC&GF-MFCC) features could significantly improve the classification performance, compared to using a single MFCC/GF-MFCC feature. In a similar way, the improved results could be obtained using the augmentation of RP and GF-RP (RP&GF-RP) features. Based on only magnitude and phase information, these results confirm that the GF-based features are complementarities with the conventional features using raw speech signals. signal. However, the comparison of single magnitude and phase-based features reveal that the augmentation of magnitude and phase-based features which is not reported in Table 2, could give worse performance than using individual features. The reason is that the GMM-based classifier could not handle modeling the joint magnitude and phase-based features. Thirdly, the improved performance could also be obtained using the score combination of MFCC and GF-MFCC (MFCC+GF-MFCC)/RP and GF-RP (RP+GF-RP) because the combined scores lead to the complementary nature between conventional and GF-based features. Moreover, unlike the augmentation of magnitude-based and phase-based features, an improved performance could be obtained using the combined scores of magnitude and phase-based features such as the score combination of MFCC and RP (MFCC+RP), of GF-MFCC and RP (GF-MFCC+RP), and of GF-MFCC and RP&GF-RP (GF-MFCC+ RP&GF-RP). These results indicate that the complementary nature between magnitude and phase-based features could be obtained using the score-level combination. A similar trend can be found in [28,33]. Finally, it is evident that the combined score of the augmented MFCC&GF-MFCC and the augmented RP&GF-RP gives the best performance compared to other used methods.
- The results of the currently proposed features are compared with some known systems based on the CHAINS corpus. Here, LFCC and TECC were compared with the proposed feature. As seen in Table 2, the GF-MFCC performs better than LFCC because of the advantages of capturing the GF signal and using the Mel-filterbank. However, it could be observed from the results that the TECC perform better than all our proposed features, the score combination of MFCC&GF-MFCC and RP&GF-RP because this feature incorporates both amplitude and frequency information of the raw signal. In fact, the self-selection-training and testing datasets are used in experiments for the result of the TECC feature. However, the current results were evaluated using a five-fold cross-validation strategy. This ensures more reliable classification performance of the proposed methods.
5. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Wang, D.; Wang, X.; Lv, S. An overview of end-to-end automatic speech recognition. Symmetry 2019, 11, 1018. [Google Scholar] [CrossRef]
- Memon, N. How biometric authentication poses new challenges to our security and privacy [in the spotlight]. IEEE Signal Process. Mag. 2017, 34, 196–194. [Google Scholar] [CrossRef]
- Heigold, G.; Moreno, I.; Bengio, S.; Shazeer, N. End-to-end text-dependent speaker verification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5115–5119. [Google Scholar]
- Grozdić, D.T.; Jovičić, S.T. Whispered speech recognition using deep denoising autoencoder and inverse filtering. IEEE/ACM Trans. Audio Speech Lang. 2017, 25, 2313–2322. [Google Scholar] [CrossRef]
- Jin, Q.; Jou, S.S.; Schultz, T. Whispering speaker identification. In Proceedings of the IEEE IEEE International Conference on Multimedia and Expo (ICME), Beijing, China, 2–5 July 2007; pp. 1027–1030. [Google Scholar]
- Yang, C.; Brown, G.; Lu, L.; Yamagishi, J.; King, S. Noise-robust whispered speech recognition using a non-audible-murmur microphone with VTS compensation. In Proceedings of the 8th International Symposium on Chinese Spoken Language Processing, Hong Kong, China, 5–8 December 2012; pp. 220–223. [Google Scholar]
- Ito, T.; Takeda, K.; Itakura, F. Analysis and recognition of whispered speech. Speech Commun. 2005, 45, 139–152. [Google Scholar] [CrossRef]
- Mathur, A.; Reddy, S.M.; Hegde, R.M. Significance of parametric spectral ratio methods in detection and recognition of whispered speech. EURASIP J. Adv. Signal Process. 2012, 2012, 1–20. [Google Scholar] [CrossRef][Green Version]
- Khoria, K.; Kamble, M.R.; Patil, H.A. Teager energy cepstral coefficients for classification of normal vs. whisper speech. In Proceedings of the 28th European Signal Processing Conference (EUSIPCO), Virtual, 18–22 January 2021; pp. 1–5. [Google Scholar]
- Gavidia-Ceballos, L.; Hansen, J.H. Direct speech feature estimation using an iterative EM algorithm for vocal fold pathology detection. IEEE Trans. Biomed. Eng. 1996, 43, 373–383. [Google Scholar] [CrossRef] [PubMed]
- Koufman, J.A.; Isaacson, G. The spectrum of vocal dysfunction. Otolaryngol. Clin. N. Am. 1991, 24, 985–988. [Google Scholar] [CrossRef]
- Hansen, J.H.; Gavidia-Ceballos, L.; Kaiser, J.F. A nonlinear operator-based speech feature analysis method with application to vocal fold pathology assessment. IEEE Trans. Biomed. Eng. 1998, 45, 300–313. [Google Scholar] [CrossRef]
- Thakur, N.; Han, C. An ambient intelligence-based human behavior monitoring framework for ubiquitous environments. Information 2021, 12, 81. [Google Scholar] [CrossRef]
- Sarria-Paja, M.; Falk, T.H. Whispered speech detection in noise using auditory-inspired modulation spectrum features. IEEE Signal Process. Lett. 2013, 20, 142–149. [Google Scholar] [CrossRef]
- Kinnunen, T.; Lee, K.A.; Li, H. Dimension reduction of the modulation spectrogram for speaker verification. In Proceedings of the Odyssey 2008: The Speaker and Language Recognition Workshop, Stellenbosch, South Africa, 21–24 July 2008. [Google Scholar]
- Meenakshi, G.N.; Ghosh, P.K. Robust whisper activity detection using long-term log energy variation of sub-band signal. IEEE Signal Process. Lett. 2015, 22, 1859–1863. [Google Scholar] [CrossRef]
- Zhang, C.; Hansen, J.H. Whisper-island detection based on unsupervised segmentation with entropy-based speech feature processing. IEEE Trans. Audio Speech Lang. 2010, 19, 883–894. [Google Scholar] [CrossRef]
- Raeesy, Z.; Gillespie, K.; Ma, C.; Drugman, T.; Gu, J.; Maas, R.; Rastrow, A.; Hoffmeister, B. LSTM-based whisper detection. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 139–144. [Google Scholar]
- Shah, N.J.; Shaik, M.A.B.; Periyasamy, P.; Patil, H.A.; Vij, V. Exploiting phase-based features for whisper vs. speech classification. In Proceedings of the 29th European Signal Processing Conference (EUSIPCO), Virtual, 23–27 August 2021; pp. 1–5. [Google Scholar]
- Wang, L.; Phapatanaburi, K.; Oo, Z.; Nakagawa, S.; Iwahashi, M.; Dang, J. Phase aware deep neural network for noise robust voice activity detection. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017; pp. 1087–1092. [Google Scholar]
- Raitio, T.; Suni, A.; Yamagishi, J.; Pulakka, H.; Nurminen, J.; Vainio, M.; Alku, P. HMM-based speech synthesis utilizing glottal inverse filtering. IEEE Trans. Audio Speech Lang. 2010, 19, 153–165. [Google Scholar] [CrossRef]
- Alku, P. Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Commun. 1992, 11, 109–118. [Google Scholar] [CrossRef]
- Alku, P.; Tiitinen, H.; Näätänen, R. A method for generating natural-sounding speech stimuli for cognitive brain research. Clin. Neurophysiol. 1999, 110, 1329–1333. [Google Scholar] [CrossRef]
- Wong, D.; Markel, J.; Gray, A. Least squares glottal inverse filtering from the acoustic speech waveform. IEEE Trans. Audio Speech Lang. 1979, 27, 350–355. [Google Scholar] [CrossRef]
- Akande, O.O.; Murphy, P.J. Estimation of the vocal tract transfer function with application to glottal wave analysis. Speech Commun. 2005, 46, 15–36. [Google Scholar] [CrossRef]
- Fu, Q.; Murphy, P. Robust glottal source estimation based on joint source-filter model optimization. IEEE Trans. Audio Speech Lang. 2006, 14, 492–501. [Google Scholar] [CrossRef]
- Guo, L.; Wang, L.; Dang, J.; Chng, E.S.; Nakagawa, S. Learning affective representations based on magnitude and dynamic relative phase information for speech emotion recognition. Speech Commun. 2022, 136, 118–127. [Google Scholar] [CrossRef]
- Nakagawa, S.; Wang, L.; Ohtsuka, S. Speaker identification and verification by combining MFCC and phase information. IEEE/ACM Trans. Audio Speech Lang. 2011, 20, 1085–1095. [Google Scholar] [CrossRef]
- Wang, L.; Minami, K.; Yamamoto, K.; Nakagawa, S. Speaker recognition by combining MFCC and phase information in noisy conditions. IEICE Trans. Inf. Syst. 2010, 93, 2397–2406. [Google Scholar] [CrossRef]
- Oo, Z.; Wang, L.; Phapatanaburi, K.; Liu, M.; Nakagawa, S.; Iwahashi, M.; Dang, J. Replay attack detection with auditory filter-based relative phase features. Eurasip J. Audio Speech Music Process. 2019, 2019, 1–11. [Google Scholar] [CrossRef]
- Cummins, F.; Grimaldi, M.; Leonard, T.; Simko, J. The chains corpus: Characterizing individual speakers. In Proceedings of the Sixteenth Annual Conference of the International Conference on Speech and Computer, Saint Petersburg, Russian, 25–29 June 2006; pp. 431–435. [Google Scholar]
- Degottex, G.; Kane, J.; Drugman, T.; Raitio, T.; Scherer, S. COVAREP—A collaborative voice analysis repository for speech technologies. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 960–964. [Google Scholar]
- Wang, L.; Yoshida, Y.; Kawakami, Y.; Nakagawa, S. Relative phase information for detecting human speech and spoofed speech. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015; pp. 2092–2096. [Google Scholar]
- Deng, L. Deep learning: From speech recognition to language and multimodal processing. APSIPA Trans. Signal Inf. Process. 2016, 2016, 5. [Google Scholar] [CrossRef]
- Vedaldi, A.; Fulkerson, B. VLFeat: An open and portable library of computer vision algorithms. In Proceedings of the 18th ACM International Conference on Multimedia, New York, NY, USA, 25–29 October 2010; pp. 1469–1472. [Google Scholar]
- Phapatanaburi, K.; Kokkhunthod, K.; Wang, L.; Jumphoo, T.; Uthansakul, M.; Boonmahitthisud, A.; Uthansakul, P. Brainwave classification for character-writing application using emd-based GMM and KELM approaches. CMC-Comput. Mater. Contin. 2021, 66, 3029–3044. [Google Scholar] [CrossRef]
- Naini, A.R.; Satyapriya, M.; Ghosh, P.K. Whisper activity detection using CNN-LSTM based attention pooling network trained for a speaker identification Task. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 14–18 September 2020; pp. 2922–2926. [Google Scholar]





| Gender | Whisper | Normal | |||
|---|---|---|---|---|---|
| F | M | Duration | No.Utt. | Duration | No.Utt. | 
| 16 | 20 | 2:28:28 | 1332 | 2:33:08 | 1332 | 
| Features | Accuracy (%) | EER (%) | |
|---|---|---|---|
| Conventional | MFCC (our implementation set as in [9]) | 91.20 | 8.89 | 
| RP | 70.16 | 30.73 | |
| Proposed | GF-MFCC | 93.15 | 6.78 | 
| GF-RP | 71.50 | 29.21 | |
| MFCC&GF-MFCC | 94.58 | 5.33 | |
| RP&GF-RP | 77.81 | 22.81 | |
| MFCC+GF-MFCC | 94.57 | 5.36 | |
| RP+GF-RP | 75.99 | 25.01 | |
| MFCC+RP | 91.38 | 8.74 | |
| MFCC+GF-RP | 91.83 | 8.25 | |
| MFCC+RP&GF-RP | 92.31 | 7.72 | |
| GF-MFCC+RP | 93.44 | 6.42 | |
| GF-MFCC+GF-RP | 93.17 | 6.71 | |
| GF-MFCC+RP&GF-RP | 93.76 | 6.03 | |
| MFCC&GF-MFCC+RP | 94.66 | 5.22 | |
| MFCC&GF-MFCC+GF-RP | 94.74 | 5.18 | |
| MFCC&GF-MFCC+RP&GF-RP | 95.01 | 4.85 | |
| Compared | LFCC (result in [9]) | 83.97 | 16.05 | 
| TECC (result in [9]) | 95.61 | 4.46 | 
| Features | Classifiers | Accuracy (%) | F1-Score | |
|---|---|---|---|---|
| Proposed | MFCC&GF-MFCC+RP&GF-RP | GMM | 100 | 100 | 
| Compared | SPEC (result in [19]) | DNN | 98.94 | 98.96 | 
| CPSPEC (result in [19]) | DNN | 99.42 | 99.42 | |
| GDSPEC (result in [19]) | DNN | 97.98 | 98.03 | |
| SPEC&CPSPEC (result in [19]) | DNN | 99.89 | 99.89 | |
| SPEC&GDSPEC (result in [19]) | DNN | 99.78 | 99.79 | |
| SPEC&GDSPEC (result in [19]) | CNN | 99.99 | 99.99 | |
| SPEC&GDSPEC (result in [19]) | Xception CNN | 99.98 | 99.98 | 
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. | 
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Phapatanaburi, K.; Pathonsuwan, W.; Wang, L.; Anchuen, P.; Jumphoo, T.; Buayai, P.; Uthansakul, M.; Uthansakul, P. Whispered Speech Detection Using Glottal Flow-Based Features. Symmetry 2022, 14, 777. https://doi.org/10.3390/sym14040777
Phapatanaburi K, Pathonsuwan W, Wang L, Anchuen P, Jumphoo T, Buayai P, Uthansakul M, Uthansakul P. Whispered Speech Detection Using Glottal Flow-Based Features. Symmetry. 2022; 14(4):777. https://doi.org/10.3390/sym14040777
Chicago/Turabian StylePhapatanaburi, Khomdet, Wongsathon Pathonsuwan, Longbiao Wang, Patikorn Anchuen, Talit Jumphoo, Prawit Buayai, Monthippa Uthansakul, and Peerapong Uthansakul. 2022. "Whispered Speech Detection Using Glottal Flow-Based Features" Symmetry 14, no. 4: 777. https://doi.org/10.3390/sym14040777
APA StylePhapatanaburi, K., Pathonsuwan, W., Wang, L., Anchuen, P., Jumphoo, T., Buayai, P., Uthansakul, M., & Uthansakul, P. (2022). Whispered Speech Detection Using Glottal Flow-Based Features. Symmetry, 14(4), 777. https://doi.org/10.3390/sym14040777
 
         
                                                
 
       