Cochleogram-Based Speech Emotion Recognition with the Cascade of Asymmetric Resonators with Fast-Acting Compression Using Time-Distributed Convolutional Long Short-Term Memory and Support Vector Machines
Abstract
:1. Introduction
2. Related Studies
3. Proposed Approach
4. Experimental Design
4.1. Datasets
4.2. Deep Learning Architectures Used in the Experiments
5. Results
6. Limitations of CARFAC 24 System
- CARFAC 24 outperforms traditional methods.
- Time-Distributed Convolutional LSTM models achieve high accuracy.
- SVM-SMO is quite successful when used in tandem with RBF kernel.
7. Discussions
8. Conclusions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
ADAM | Adaptive Moment |
AGC | Automatic Gain Control |
ASED | Amharic Speech Emotion Dataset |
CARFAC | Cascade of Asymmetric Resonators with Fast-Acting Compression |
DCT | Discrete Cosine Transform |
FFT | Fast Fourier Transform |
FPGA | Field Programmable Gate Array |
HWR | Half Wave Rectifier |
IHC | Inner Hair Cell |
LSTM | Long Short Term Memory |
MFCC | Mel Frequency Cepstral Coefficients |
OHC | Outer Hair Cell |
PZFC | Pole-Zero Filter Cascade |
RBF | Radial Basis Functions |
ResNet | Residual Network |
SER | Speech Emotion Recognition |
SM | Scala Media |
ST | Scala Tympani |
SV | Scala Vestibuli |
SVM | Support Vector Machines |
SMO | Sequential Minimal Optimization |
TDConvLSTM | Time Distributed Convolutional LSTM |
Weka | Waikato Environment for Knowledge Analysis |
VGG | Visual Geometry Group |
References
- Lyon, R.F. Human and Machine Hearing: Extracting Meaning from Sound; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
- Stevens, S.S.; Volkmann, J.; Newman, E.B. A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 1937, 8, 185–190. [Google Scholar] [CrossRef]
- Lyon, R.F. A computational model of filtering, detection and compression in the cochlea. In Proceedings of the ICASSP’82. IEEE International Conference on Acoustics, Speech, and Signal Processing, Paris, France, 3–5 May 1982; pp. 1282–1285. [Google Scholar]
- Hermansky, H. Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 1990, 87, 1738–1752. [Google Scholar] [CrossRef] [PubMed]
- Johannesma, P.I.M. The pre-response stimulus ensemble of neurons in the cochlear nucleus. In Symposium on Hearing Theory; IPO: Eindhoven, The Netherlands, 1972; pp. 58–69. [Google Scholar]
- Glasberg, B.R.; Moore, B. Derivation of auditory filter shapes from notched-noise data. Hear. Res. 1990, 47, 103–138. [Google Scholar] [CrossRef]
- Seneff, S. A joint synchrony/mean-rate model of auditory speech processing. J. Phon. 1988, 16, 55–76. [Google Scholar] [CrossRef]
- Meddis, R. An evaluation of eight computer models of the mammalian Inner Hair Cell Function. J. Acoust. Soc. Am. 1988, 90, 904–917. [Google Scholar]
- Bruce, C.; Erfani, Y.; Zilany, M.S.R. A phenomenological model of the synapse between the inner hair cell and auditory nerve: Implications of limited neurotransmitter release sites. Hear. Res. 2018, 360, 40–54. [Google Scholar] [CrossRef] [PubMed]
- Hohmann, V. Frequency analysis and synthesis using a gammatone filterbank. Acta Acust. United Acoust. 2002, 88, 433–442. [Google Scholar]
- King, A.; Varnet, L.; Lorenzi, C. Accounting for masking of frequency modulation by amplitude modulation with the modulation filter-bank concept. J. Acoust. Soc. Am. 2019, 145, 2277–2293. [Google Scholar] [CrossRef]
- Relaño-Iborra, H.; Zaar, J.; Dau, T. A speech-based computational auditory signal processing and perception model. J. Acoust. Soc. Am. 2019, 146, 3306–3317. [Google Scholar] [CrossRef]
- Jepsen, M.; Ewert, S.; Dau, T. A computational model of human auditory signal processing and perception. J. Acoust. Soc. Am. 2008, 124, 422–438. [Google Scholar] [CrossRef]
- Verhulst, S.; Altoè, A.; Vasilkov, V. Functional modeling of the human auditory brainstem response to broadband stimulation. Hear. Res. 2018, 360, 55–75. [Google Scholar] [CrossRef]
- Zilany, M.S.A.; Bruce, I.C.; Carney, L.H. Updated parameters and expanded simulation options for a model of the auditory periphery. J. Acoust. Soc. Am. 2014, 135, 283–286. [Google Scholar] [CrossRef] [PubMed]
- Jackson, B.S.; Carney, L.H. The spontaneous rate histogram of the auditory nerve can be explained by only two or three spontaneous rates and long-range dependence. J. Assoc. Res. Otolaryngol. 2005, 6, 148–159. [Google Scholar] [CrossRef]
- Gutkin, A. Eidos: An open-source auditory periphery modeling toolkit and evaluation of cross-lingual phonemic contrasts. In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL); European Language Resources Association: Marseille, France, 2020; pp. 9–20. [Google Scholar]
- Masetto, S.; Spaiardi, P.; Johnson, S.J. Signal Transmission by Auditory and Vestibular Hair Cells. In Recent Advances in Audiological and Vestibular Research; IntechOpen: Rijeka, Croatia, 2022. [Google Scholar]
- Moser, T.; Karagulyan, N.; Neef, J.; Jaime Tobón, L.M. Diversity matters-extending sound intensity coding by inner hair cells via heterogeneous synapses. EMBO J. 2023, 42, e114587. [Google Scholar] [CrossRef] [PubMed]
- Ashmore, J. Cochlear outer hair cell motility. Physiol. Rev. 2008, 88, 173–210. [Google Scholar] [CrossRef] [PubMed]
- Feher, J.J. Quantitative Human Physiology: An introduction; Academic Press: Cambridge, MA, USA, 2017. [Google Scholar]
- Qing, Z.; Mao-Li, D. Anatomy and physiology of peripheral auditory system and common causes of hearing loss. J. Otol. 2009, 4, 7–14. [Google Scholar] [CrossRef]
- Lord, R.M.; Abel, E.W.; Wang, Z.; Mills, R.P. Effects of draining cochlear fluids on stapes displacement in human middle-ear models. J. Acoust. Soc. Am. 2001, 110, 3132–3139. [Google Scholar] [CrossRef]
- Parent, P.; Allen, J.B. Time-domain “wave” model of the human tympanic membrane. Hear. Res. 2010, 263, 152–167. [Google Scholar] [CrossRef]
- Parent, P.; Allen, J.B. Wave model of the cat tympanic membrane. J. Acoust. Soc. Am. 2007, 122, 918–931. [Google Scholar] [CrossRef]
- Naghibolhosseini, M.; Long, G.R. Fractional-order modelling and simulation of human ear. Int. J. Comput. Math. 2018, 95, 1257–1273. [Google Scholar] [CrossRef]
- Pastras, C.J.; Gholami, N.; Jennings, S.; Zhu, H.; Zhou, W.; Brown, D.J.; Curthoys, I.S.; Rabbitt, R.D. A mathematical model for mechanical activation and compound action potential generation by the utricle in response to sound and vibration. Front. Neurol. 2023, 14, 1109506. [Google Scholar] [CrossRef] [PubMed]
- Kuokkanen, P. Modelling the Activity of the Auditory Nerve After Hearing Loss. Master’s Thesis, Department of Physics, University of Jyväskylä, Jyväskylä, Finland, 2005. [Google Scholar]
- Weremczuk, A.; Rusinek, R. Dynamics of the middle ear with an implantable hearing device: An improved electromechanical model. Nonlinear Dyn. 2024, 112, 2219–2235. [Google Scholar] [CrossRef]
- De Paolis, A.; Bikson, M.; Nelson, J.T.; de Ru, J.A.; Packer, M.; Cardoso, L. Analytical and numerical modeling of the hearing system: Advances towards the assessment of hearing damage. Hear. Res. 2017, 349, 111–128. [Google Scholar] [CrossRef]
- Zablotni, R.; Tudruj, S.; Latalski, J.; Szymanski, M.; Kucharski, A.; Zając, G.; Rusinek, R. Sound-Induced Round Window Vibration—Experiment and Numerical Simulations of Energy Transfer Through the Cochlea of the Human Ear. Appl. Sci. 2025, 15, 301. [Google Scholar] [CrossRef]
- Chen, J.; Sprigg, J.; Castle, N.; Matson, C.; Hedjoudje, A.; Dai, C. A Virtual Inner Ear Model Selects Ramped Pulse Shapes for Vestibular Afferent Stimulation. Bioengineering 2023, 10, 1436. [Google Scholar] [CrossRef]
- Schurzig, D.; Fröhlich, M.; Raggl, S.; Scheper, V.; Lenarz, T.; Rau, T.S. Uncoiling the human cochlea-Physical scala tympani models to study pharmacokinetics inside the inner ear. Life 2021, 11, 373. [Google Scholar] [CrossRef]
- Xu, Y.; Afshar, S.; Wang, R.; Cohen, G.; Singh Thakur, C.; Hamilton, T.J.; van Schaik, A. A biologically inspired sound localisation system using a silicon cochlea pair. Appl. Sci. 2021, 11, 1519. [Google Scholar] [CrossRef]
- Xu, Y.; Thakur, C.S.; Singh, R.K.; Hamilton, T.J.; Wang, R.M.; Van Schaik, A. A FPGA implementation of the CAR-FAC cochlear model. Front. Neurosci. 2018, 12, 198. [Google Scholar] [CrossRef]
- Lyon, R.F. Cascades of two-pole–two-zero asymmetric resonators are good models of peripheral auditory function. J. Acoust. Soc. Am. 2011, 130, 3893–3904. [Google Scholar] [CrossRef]
- Scarpiniti, M.; Parisi, R.; Lee, Y.C. A Scalogram-based CNN approach for audio classification in construction sites. Appl. Sci. 2023, 14, 90. [Google Scholar] [CrossRef]
- Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2012; Volume 25. [Google Scholar]
- Schroeder, M.R. An integrable model for the basilar membrane. J. Acoust. Soc. Am. 1973, 53, 429–434. [Google Scholar] [CrossRef] [PubMed]
- Zweig, G.; Lipes, R.; Pierce, J.R. The cochlear compromise. J. Acoust. Soc. Am. 1976, 59, 975–982. [Google Scholar] [CrossRef] [PubMed]
- Slaney, M. Lyon’s Cochlear Model; Apple Computer, Advanced Technology Group: Cupertino, CA, USA, 1988; Volume 13. [Google Scholar]
- Duda, R.O.; Lyon, R.F.; Slaney, M. Correlograms and the separation of sounds. In Proceedings of the 1990 Conference Record Twenty-Fourth Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 5 October–7 November 1990; IEEE Computer Society: Washington, DC, USA, 1990; Volume 2, pp. 457–461. [Google Scholar]
- Lyon, R.F.; Schonberger, R.; Slaney, M.; Velimirović, M.; Yu, H. The CARFAC v2 Cochlear Model in Matlab, NumPy, and JAX. arXiv 2024, arXiv:2404.17490. [Google Scholar]
- Ahmed, N.; Natarajan, T.; Rao, K.R. Discrete cosine transform. IEEE Trans. Comput. 1974, 100, 90–93. [Google Scholar] [CrossRef]
- Christop, I. nEMO: Dataset of Emotional Speech in Polish. arXiv 2024, arXiv:2404.06292. [Google Scholar]
- Retta, E.A.; Sutcliffe, R.; Mahmood, J.; Berwo, M.A.; Almekhlafi, E.; Khan, S.A.; Chaudhry, S.A.; Mhamed, M.; Feng, J. Cross-Corpus Multilingual Speech Emotion Recognition: Amharic vs. Other Languages. Appl. Sci. 2023, 13, 12587. [Google Scholar] [CrossRef]
- Skowroński, K.; Gałuszka, A.; Probierz, E. Polish Speech and Text Emotion Recognition in a Multimodal Emotion Analysis System. Appl. Sci. 2024, 14, 10284. [Google Scholar] [CrossRef]
- Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of German emotional speech. In Proceedings of the Ninth European Conference on Speech Communication and Technology, Lisbon, Portugal, 4–8 September 2005; pp. 1517–1520. [Google Scholar]
- Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef]
- Latif, S.; Qayyum, A.; Usman, M.; Qadir, J. Cross lingual speech emotion recognition: Urdu vs. western languages. In Proceedings of the 2018 International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, 17–19 December 2018; pp. 88–93. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Schmidhuber, J.; Hochreiter, S. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar]
- Liu, W.; Wang, J.; Li, Z.; Lu, Q. A Hybrid Improved Dual-Channel and Dual-Attention Mechanism Model for Water Quality Prediction in Nearshore Aquaculture. Electronics 2025, 14, 331. [Google Scholar] [CrossRef]
- Vapnik, V.N. The Support Vector method. In Artificial Neural Networks—ICANN’97; Lecture Notes in Computer Science; Gerstner, W., Germond, A., Hasler, M., Nicoud, J.D., Eds.; Springer: Berlin/Heidelberg, Germany, 1997; Volume 1327. [Google Scholar] [CrossRef]
- Platt, J.C. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines; Microsoft Research Technical Report; Microsoft: Redmond, WA, USA, 1998. [Google Scholar]
- Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H. The WEKA data mining software: An update. ACM SIGKDD Explor. Newsl. 2009, 11, 10–18. [Google Scholar] [CrossRef]
- Majdak, P.; Hollomey, C.; Baumgartner, R. AMT 1.x: A toolbox for reproducible research in auditory modeling. Acta Acust. 2022, 6, 19. [Google Scholar] [CrossRef]
- Chollet, F.; Anonymous. Keras. GitHub. Available online: https://github.com/fchollet/keras (accessed on 1 January 2024).
- Kingma, P.D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
- Badshah, A.M.; Ahmad, J.; Rahim, N.; Baik, S.W. Speech emotion recognition from spectrograms with deep convolutional neural network. In Proceedings of the 2017 International Conference on Platform Technology and Service (PlatCon), Busan, Republic of Korea, 13–15 February 2017; pp. 1–5. [Google Scholar]
- Rudd, D.H.; Huo, H.; Xu, G. Leveraged Mel spectrograms using harmonic and percussive components in speech emotion recognition. In Advances in Knowledge Discovery and Data Mining, Pacific-Asia Conference on Knowledge Discovery and Data Mining, Chengdu, China, May 16–19, 2022; Springer International Publishing: Cham, Switzerland; pp. 392–404.
- Glackin, C.; Wall, J.; Chollet, G.; Dugan, N.; Cannings, N. TIMIT and NTIMIT phone recognition using convolutional neural networks. In Pattern Recognition Applications and Methods, 7th International Conference, ICPRAM 2018, Funchal, Madeira, Portugal, 16–18 January 2018; Revised Selected Papers 7; Springer International Publishing: Berlin/Heidelberg, Germany, 2018; pp. 89–100. [Google Scholar]
Layer | Output Shape | #Parameters |
---|---|---|
Input | (None, 1, 44, 66, 1) | 0 |
Time Distributed Conv2D (256, (3 × 3), “relu”) | (None, 1, 44, 66, 256) | 2560 |
Time Distributed Conv2D (256, (3 × 3), “relu”) | (None, 1, 44, 66, 256) | 590,080 |
Time Distributed MaxPool (2, 2) | (None, 1, 22, 33, 256) | 0 |
Time Distributed (Flatten) | (None, 1, 185,856) | 0 |
LSTM (512) | (None, 1, 512) | 381,683,712 |
LSTM (256) | (None, 1, 256) | 787,456 |
LSTM (128) | (None, 128) | 197,120 |
Drop Out (0.25) | (None, 128) | 0 |
Flatten | (None, 128) | 0 |
Dense (512) | (None, 512) | 66,048 |
Drop Out (0.5) | (None, 512) | 0 |
Dense (#classes) | (None, 6) | 3078 |
ACC | Precision | Recall | F1 | Kappa | |
---|---|---|---|---|---|
SVM-SMO | 84.51 | 84.60 | 84.51 | 84.51 | - |
TDConvLSTM | 84.64 | 84.59 | 84.64 | 84.56 | 80.74 |
VGG19 | 84.24 | 84.87 | 84.24 | 84.23 | 80.25 |
LSTM | 83.23 | 83.37 | 83.23 | 83.18 | 78.99 |
ACC | Precision | Recall | F1 | Kappa | |
---|---|---|---|---|---|
SVM-SMO | 86.72 | 86.90 | 86.72 | 86.80 | - |
TDConvLSTM | 88.51 | 88.83 | 88.51 | 88.57 | 86.19 |
VGG19 | 88.40 | 88.46 | 88.40 | 88.35 | 86.07 |
LSTM | 82.60 | 84.33 | 82.60 | 82.55 | 79.14 |
ACC | Precision | Recall | F1-Score | Kappa | |
---|---|---|---|---|---|
SVM-SMO | 85.97 | 86.70 | 86.00 | 85.90 | - |
TDConvLSTM | 82.93 | 83.77 | 82.93 | 82.77 | 78.66 |
VGG19 | 79.67 | 81.60 | 79.67 | 79.39 | 74.58 |
LSTM | 82.11 | 84.10 | 82.11 | 82.06 | 77.64 |
ACC | Precision | Recall | F1-Score | Kappa | |
---|---|---|---|---|---|
SVM-SMO | 64.07 | 67.40 | 64.07 | 63.60 | - |
TDConvLSTM | 65.70 | 70.95 | 65.70 | 64.41 | 58.98 |
VGG19 | 60.67 | 64.18 | 60.67 | 60.26 | 52.92 |
LSTM | 64.45 | 65.11 | 64.45 | 63.86 | 57.33 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Parlak, C. Cochleogram-Based Speech Emotion Recognition with the Cascade of Asymmetric Resonators with Fast-Acting Compression Using Time-Distributed Convolutional Long Short-Term Memory and Support Vector Machines. Biomimetics 2025, 10, 167. https://doi.org/10.3390/biomimetics10030167
Parlak C. Cochleogram-Based Speech Emotion Recognition with the Cascade of Asymmetric Resonators with Fast-Acting Compression Using Time-Distributed Convolutional Long Short-Term Memory and Support Vector Machines. Biomimetics. 2025; 10(3):167. https://doi.org/10.3390/biomimetics10030167
Chicago/Turabian StyleParlak, Cevahir. 2025. "Cochleogram-Based Speech Emotion Recognition with the Cascade of Asymmetric Resonators with Fast-Acting Compression Using Time-Distributed Convolutional Long Short-Term Memory and Support Vector Machines" Biomimetics 10, no. 3: 167. https://doi.org/10.3390/biomimetics10030167
APA StyleParlak, C. (2025). Cochleogram-Based Speech Emotion Recognition with the Cascade of Asymmetric Resonators with Fast-Acting Compression Using Time-Distributed Convolutional Long Short-Term Memory and Support Vector Machines. Biomimetics, 10(3), 167. https://doi.org/10.3390/biomimetics10030167