Estimation of the Underlying F0 Range of a Speaker from the Spectral Features of a Brief Speech Input
Abstract
:Featured Application
Abstract
1. Introduction
1.1. F0 Range and F0 Normalization
1.2. Traditional Methods of F0 Range Estimation
1.3. Pitch Range Perception from a Brief Speech Input
1.4. A Method for F0 Range Estimation from a Brief Speech Input
1.5. The Motivation of the Current Study
2. Experiment 1: The Refined Model
2.1. Corpora
2.2. Refined Model Setup
2.3. Results
3. Experiment 2: F0 Range Estimation from a Bilingual Parallel Corpus
3.1. Speech Data
3.2. Estimation from the L1 and L2 Speech by the Spectral-Baed Model and the Direct F0 Analysis
3.3. Results and Discussion
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Crystal, D. A Dictionary of Linguistics and Phonetics; John Wiley & Sons: Hoboken, NJ, USA, 2011; Volume 30. [Google Scholar]
- Trask, R. A Dictionary of Phonetics and Phonology; Routledge: London, UK, 1996. [Google Scholar]
- Honorof, D.N.; Whalen, D.H. Perception of pitch location within a speaker’s F0 range. J. Acoust. Soc. Am. 2005, 117, 2193–2200. [Google Scholar] [CrossRef]
- Bishop, J.; Keating, P. Perception of pitch location within a speaker’s range: Fundamental frequency, voice quality and speaker sex. J. Acoust. Soc. Am. 2012, 132, 1100–1112. [Google Scholar] [CrossRef] [Green Version]
- Mo, Y.; Cole, J.; Lee, E.-K. Naïve listeners’ prominence and boundary perception. In Proceedings of the Speech Prosody 2008, Campinas, Brazil, 6–9 May 2008; pp. 735–738. [Google Scholar]
- Lee, C.-Y. Identifying isolated, multispeaker Mandarin tones from brief acoustic input: A perceptual and acoustic study. J. Acoust. Soc. Am. 2009, 125, 1125–1137. [Google Scholar] [CrossRef]
- Kuang, J.; Liberman, M. Integrating Voice Quality Cues in the Pitch Perception of Speech and Non-speech Utterances. Front. Psychol. 2018, 9, 2147. [Google Scholar] [CrossRef]
- Lai, W.; Kuang, J. The effect of speaker gender on Cantonese tone perception. J. Acoust. Soc. Am. 2020, 147, 4119–4132. [Google Scholar] [CrossRef]
- Van Dommelen, W.A.; Moxness, B.H. Acoustic parameters in speaker height and weight identification: Sex-specific behaviour. Lang. Speech 1995, 38, 267–287. [Google Scholar] [CrossRef]
- Edlund, J.; Heldner, M. Underpinning/nailon/: Automatic Estimation of Pitch Range and Speaker Relative Pitch. In Speaker Classification II; Müller, C., Ed.; Springer: Berlin/Heidelberg, Germany, 2007; Volume 4441, pp. 229–242. ISBN 978-3-540-74121-3. [Google Scholar]
- Looze, C.D.; Hirst, D. Detecting changes in key and range for the automatic modelling and coding of intonation. In Proceedings of the Speech Prosody 2008, Campinas, Brazil, 6–9 May 2008; p. 4. [Google Scholar]
- Ambrazaitis, G. Revisiting intonational pitch accents in Swedish: Evidence from lexical accent neutralization. In Proceedings of the TIE4, the Fourth European Conference on Tone and Intonation, Stockholm, Sweden, 9–11 September 2010; pp. 69–70. [Google Scholar]
- Mahmoodzadeh, A.; Abutalebi, H.R.; Soltanian-Zadeh, H.; Sheikhzadeh, H. Determination of pitch range based on onset and offset analysis in modulation frequency domain. In Proceedings of the 2010 5th International Symposium on Telecommunications, Tehran, Iran, 4–6 December 2010; IEEE: Tehran, Iran, 2010; pp. 604–608. [Google Scholar]
- Zhang, W.; Zhang, Q.; Xie, Y.; Zhang, J. LSTM-Based Pitch Range Estimation from Spectral Information of Brief Speech Input. In Proceedings of the 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei City, Taiwan, 26–29 November 2018; IEEE: Taipei City, Taiwan, 2018; pp. 349–353. [Google Scholar]
- Baken, R.J.; Orlikoff, R.F. Clinical Measurement of Speech and Voice; Cengage Learning: Hong Kong, China, 2000. [Google Scholar]
- Laver, J.; John, L. Principles of Phonetics; Cambridge University Press: Cambridge, UK, 1994. [Google Scholar]
- Mennen, I.; Schaeffler, F.; Docherty, G. Cross-language differences in fundamental frequency range: A comparison of English and German. J. Acoust. Soc. Am. 2012, 131, 2249–2260. [Google Scholar] [CrossRef] [Green Version]
- Brockmann, M.; Storck, C.; Carding, P.N.; Drinnan, M.J. Voice loudness and gender effects on jitter and shimmer in healthy adults. J. Speech Lang. Hear. Res. 2008, 51, 1152–1160. [Google Scholar] [CrossRef]
- Deliyski, D. Effects of aging on selected acoustic voice parameters: Preliminary normative data and educational implications. Educ. Gerontol. 2001, 27, 159–168. [Google Scholar] [CrossRef]
- Stathopoulos, E.T.; Huber, J.E.; Sussman, J.E. Changes in Acoustic Characteristics of the Voice Across the Life Span: Measures From Individuals 4–93 Years of Age. J. Speech Lang. Hear. Res. 2011, 54, 1011–1021. [Google Scholar] [CrossRef]
- Awan, S.N. The aging female voice: Acoustic and respiratory data. Clin. Linguist. Phon. 2006, 20, 171–180. [Google Scholar] [CrossRef]
- Shipp, T.; Huntington, D.A. Some acoustic and perceptual factors in acute-laryngitic hoarseness. J. Speech Hear. Disord. 1965, 30, 350–359. [Google Scholar] [CrossRef]
- Hecker, M.H.; Kreul, E.J. Descriptions of the speech of patients with cancer of the vocal folds. Part I: Measures of fundamental frequency. J. Acoust. Soc. Am. 1971, 49, 1275–1282. [Google Scholar] [CrossRef]
- Cooper, M. Spectrographic analysis of fundamental frequency and hoarseness before and after vocal rehabilitation. J. Speech Hear. Disord. 1974, 39, 286–297. [Google Scholar] [CrossRef]
- Murry, T.; Doherty, E.T. Selected acoustic characteristics of pathologic and normal speakers. J. Speech Lang. Hear. Res. 1980, 23, 361–369. [Google Scholar] [CrossRef]
- Keating, P.; Kuo, G. Comparison of speaking fundamental frequency in English and Mandarin. J. Acoust. Soc. Am. 2012, 132, 1050–1060. [Google Scholar] [CrossRef] [Green Version]
- Patterson, D.; Ladd, D.R. Pitch range modelling: Linguistic dimensions of variation. In Proceedings of the ICPhS, San Francisco, CA, USA, 1–7 August 1999; pp. 1169–1172. [Google Scholar]
- Patterson, D.J. Linguistic Approach to Pitch Range Modelling. Ph.D. Thesis, University of Edinburgh, Edinburgh, UK, 2000. [Google Scholar]
- Mennen, I.; Schaeffler, F.; Docherty, G. A methodological study into the linguistic dimensions of pitch range differences between German and English. In Proceedings of the 4th Conference on Speech Prosody, Campinas, Brazil, 6–9 May 2008. [Google Scholar]
- Peterson, G.E.; Barney, H.L. Control methods used in a study of the vowels. J. Acoust. Soc. Am. 1952, 24, 175–184. [Google Scholar] [CrossRef]
- Ladefoged, P.; Broadbent, D.E. Information conveyed by vowels. J. Acoust. Soc. Am. 1957, 29, 98–104. [Google Scholar] [CrossRef]
- Leather, J. Speaker normalization in perception of lexical tone. J. Phon. 1983, 11, 373–382. [Google Scholar] [CrossRef]
- Moore, C.B.; Jongman, A. Speaker normalization in the perception of Mandarin Chinese tones. J. Acoust. Soc. Am. 1997, 102, 1864–1877. [Google Scholar] [CrossRef]
- Wong, P.C.M.; Diehl, R.L. Perceptual Normalization for Inter- and Intratalker Variation in Cantonese Level Tones. J. Speech Lang. Hear. Res. 2003, 46, 413–421. [Google Scholar] [CrossRef]
- Whalen, D.H.; Xu, Y. Information for Mandarin tones in the amplitude contour and in brief segments. Phonetica 1992, 49, 25–47. [Google Scholar] [CrossRef] [Green Version]
- Yang, S. A preliminary study on the perceptual center of tones in Standard Chinese. Acta Psychol. Sin. 1992, 3, 247–253. [Google Scholar]
- Gottfried, T.L.; Suiter, T.L. Effect of linguistic experience on the identification of Mandarin Chinese vowels and tones. J. Phon. 1997, 25, 207–231. [Google Scholar] [CrossRef]
- Fant, G. Acoustic Theory of Speech Production; Walter de Gruyter: Berlin, Germany, 1970. [Google Scholar]
- Warrier, C.M.; Zatorre, R.J. Influence of tonal context and timbral variation on perception of pitch. Percept. Psychophys. 2002, 64, 198–207. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Singh, P.G.; Hirsh, I.J. Influence of spectral locus and F0 changes on the pitch and timbre of complex tones. J. Acoust. Soc. Am. 1992, 92, 2650–2661. [Google Scholar] [CrossRef]
- Russo, F.A.; Thompson, W.F. An interval size illusion: The influence of timbre on the perceived size of melodic intervals. Percept. Psychophys. 2005, 67, 559–568. [Google Scholar] [CrossRef] [Green Version]
- Allen, E.J.; Oxenham, A.J. Symmetric interactions and interference between pitch and timbre. J. Acoust. Soc. Am. 2014, 135, 1371–1379. [Google Scholar] [CrossRef] [Green Version]
- Swerts, M.; Veldhuis, R. The effect of speech melody on voice quality. Speech Commun. 2001, 33, 297–303. [Google Scholar] [CrossRef]
- Carlson, R.; Elenius, K.; Swerts, M. Perceptual judgments of pitch range. In Proceedings of the Speech Prosody 2004, International Conference, Nara, Japan, 23–26 March 2004. [Google Scholar]
- Verstraete, J.; Forrez, G.; Mertens, P.; Debruyne, F. The Effect of Sustained Phonation at High and Low Pitch on Vocal Jitter and Shimmer. Folia Phoniatr. Logop. 1993, 45, 223–228. [Google Scholar] [CrossRef]
- Ullakonoja, R. Comparison of pitch range in Finnish (L1) and Russian (L2). In Proceedings of the ICPhS, Saarbrücken, Germany, 6–10 August 2007. [Google Scholar]
- Zimmerer, F.; Jügler, J.; Andreeva, B.; Möbius, B.; Trouvain, J. Too cautious to vary more? A comparison of pitch variation in native and non-native productions of French and German speakers. In Proceedings of the 7th Speech Prosody Conference, Dublin, Ireland, 20–23 May 2014; pp. 1037–1041. [Google Scholar]
- Gao, S.; Xu, B.; Zhang, H.; Zhao, B.; Li, C.; Huang, T. Update progress of Sinohear: Advanced Mandarin LVCSR system at NLPR. In Proceedings of the ICPhS, Beijing, China, 16–20 October 2000. [Google Scholar]
- Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In Proceedings of the 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Seoul, Korea, 1–3 November 2017; pp. 1–5. [Google Scholar]
- Wang, D.; Zhang, X. Thchs-30: A free chinese speech corpus. arXiv 2015, arXiv:1512.01882. [Google Scholar]
- Boersma, P. Praat: Doing Phonetics by Computer. 2006. Available online: http://www.praat.org/ (accessed on 24 June 2022).
- Caruana, R. Multitask learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
- Zhang, Q.; Cao, C.; Li, T.; Xie, Y.; Zhang, J. Pitch range estimation with multi features and MTL-DNN model. In Proceedings of the 2018 14th IEEE International Conference on Signal Processing (ICSP), Beijing, China, 12–16 August 2018; pp. 939–943. [Google Scholar]
- Lin, J.; Gao, Y.; Zhang, W.; Wei, L.; Xie, Y.; Zhang, J. Improving Pronunciation Erroneous Tendency Detection with Multi-Model Soft Targets. J. Signal Processing Syst. 2020, 92, 793–803. [Google Scholar] [CrossRef]
- Charles, P.W.D. Keras. GitHub Repository. 2013. Available online: https://keras.io/ (accessed on 24 June 2022).
- Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011. [Google Scholar]
F0 Range Parameter | MAPE (%) |
---|---|
Ceiling | 2.19 |
Mean | 2.15 |
Floor | 2.52 |
DFA-JC (Taken as the Actual Value) | SBM-J | SBM-C | DFA-J | DFA-C | |
---|---|---|---|---|---|
Ceiling | 2.410 (0.12) | 2.387 (0.09) | 2.380 (0.09) | 2.402 (0.12) | 2.374 (0.13) |
Mean | 2.243 (0.12) | 2.206 (0.08) | 2.199 (0.09) | 2.256 (0.12) | 2.229 (0.12) |
Floor | 2.084 (0.13) | 2.056 (0.08) | 2.048 (0.08) | 2.100 (0.13) | 2.088 (0.13) |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, W.; Xie, Y.; Lin, B.; Wang, L.; Zhang, J. Estimation of the Underlying F0 Range of a Speaker from the Spectral Features of a Brief Speech Input. Appl. Sci. 2022, 12, 6494. https://doi.org/10.3390/app12136494
Zhang W, Xie Y, Lin B, Wang L, Zhang J. Estimation of the Underlying F0 Range of a Speaker from the Spectral Features of a Brief Speech Input. Applied Sciences. 2022; 12(13):6494. https://doi.org/10.3390/app12136494
Chicago/Turabian StyleZhang, Wei, Yanlu Xie, Binghuai Lin, Liyuan Wang, and Jinsong Zhang. 2022. "Estimation of the Underlying F0 Range of a Speaker from the Spectral Features of a Brief Speech Input" Applied Sciences 12, no. 13: 6494. https://doi.org/10.3390/app12136494
APA StyleZhang, W., Xie, Y., Lin, B., Wang, L., & Zhang, J. (2022). Estimation of the Underlying F0 Range of a Speaker from the Spectral Features of a Brief Speech Input. Applied Sciences, 12(13), 6494. https://doi.org/10.3390/app12136494