Next Article in Journal
Comparison of Dual Beam Dispersive and FTNIR Spectroscopy for Lactate Detection
Previous Article in Journal
AnkFall—Falls, Falling Risks and Daily-Life Activities Dataset with an Ankle-Placed Accelerometer and Training Using Recurrent Neural Networks
Article

On the Speech Properties and Feature Extraction Methods in Speech Emotion Recognition

1
Institute of Multimedia Information and Communication Technologies, Faculty of Electrical Engineering and Information Technology, Slovak University of Technology in Bratislava, 2412 Bratislava, Slovakia
2
Institute of Robotics and Cybernetics, Faculty of Electrical Engineering and Information Technology, Slovak University of Technology in Bratislava, 2412 Bratislava, Slovakia
3
Institute of Computer Science and Mathematics, Faculty of Electrical Engineering and Information Technology, Slovak University of Technology in Bratislava, 2412 Bratislava, Slovakia
*
Author to whom correspondence should be addressed.
Academic Editor: Chiman Kwan
Sensors 2021, 21(5), 1888; https://doi.org/10.3390/s21051888
Received: 20 January 2021 / Revised: 1 March 2021 / Accepted: 3 March 2021 / Published: 8 March 2021
(This article belongs to the Section Physical Sensors)
Many speech emotion recognition systems have been designed using different features and classification methods. Still, there is a lack of knowledge and reasoning regarding the underlying speech characteristics and processing, i.e., how basic characteristics, methods, and settings affect the accuracy, to what extent, etc. This study is to extend physical perspective on speech emotion recognition by analyzing basic speech characteristics and modeling methods, e.g., time characteristics (segmentation, window types, and classification regions—lengths and overlaps), frequency ranges, frequency scales, processing of whole speech (spectrograms), vocal tract (filter banks, linear prediction coefficient (LPC) modeling), and excitation (inverse LPC filtering) signals, magnitude and phase manipulations, cepstral features, etc. In the evaluation phase the state-of-the-art classification method and rigorous statistical tests were applied, namely N-fold cross validation, paired t-test, rank, and Pearson correlations. The results revealed several settings in a 75% accuracy range (seven emotions). The most successful methods were based on vocal tract features using psychoacoustic filter banks covering the 0–8 kHz frequency range. Well scoring are also spectrograms carrying vocal tract and excitation information. It was found that even basic processing like pre-emphasis, segmentation, magnitude modifications, etc., can dramatically affect the results. Most findings are robust by exhibiting strong correlations across tested databases. View Full-Text
Keywords: windows; frequency scales; spectrograms; psychoacoustic filter banks; LPC; cepstral features; phases; speech emotions; classification windows; frequency scales; spectrograms; psychoacoustic filter banks; LPC; cepstral features; phases; speech emotions; classification
Show Figures

Figure 1

MDPI and ACS Style

Kacur, J.; Puterka, B.; Pavlovicova, J.; Oravec, M. On the Speech Properties and Feature Extraction Methods in Speech Emotion Recognition. Sensors 2021, 21, 1888. https://doi.org/10.3390/s21051888

AMA Style

Kacur J, Puterka B, Pavlovicova J, Oravec M. On the Speech Properties and Feature Extraction Methods in Speech Emotion Recognition. Sensors. 2021; 21(5):1888. https://doi.org/10.3390/s21051888

Chicago/Turabian Style

Kacur, Juraj; Puterka, Boris; Pavlovicova, Jarmila; Oravec, Milos. 2021. "On the Speech Properties and Feature Extraction Methods in Speech Emotion Recognition" Sensors 21, no. 5: 1888. https://doi.org/10.3390/s21051888

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Search more from Scilit
 
Search
Back to TopTop