Next Article in Journal
Heat Transfer and Entropy Generation Abilities of MWCNTs/GNPs Hybrid Nanofluids in Microtubes
Next Article in Special Issue
Enhanced Approach Using Reduced SBTFD Features and Modified Individual Behavior Estimation for Crowd Condition Prediction
Previous Article in Journal
Distributed Hypothesis Testing with Privacy Constraints
Previous Article in Special Issue
Learning Using Concave and Convex Kernels: Applications in Predicting Quality of Sleep and Level of Fatigue in Fibromyalgia
Article Menu
Issue 5 (May) cover image

Export Article

Open AccessArticle

3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms

Department of Electrical and Electronics Engineering, Eastern Mediterranean University, 99628 Gazimagusa, North Cyprus, via Mersin 10, Turkey
Author to whom correspondence should be addressed.
Entropy 2019, 21(5), 479;
Received: 12 March 2019 / Revised: 3 May 2019 / Accepted: 4 May 2019 / Published: 8 May 2019
(This article belongs to the Special Issue Statistical Machine Learning for Human Behaviour Analysis)
PDF [1899 KB, uploaded 15 May 2019]


Detecting human intentions and emotions helps improve human–robot interactions. Emotion recognition has been a challenging research direction in the past decade. This paper proposes an emotion recognition system based on analysis of speech signals. Firstly, we split each speech signal into overlapping frames of the same length. Next, we extract an 88-dimensional vector of audio features including Mel Frequency Cepstral Coefficients (MFCC), pitch, and intensity for each of the respective frames. In parallel, the spectrogram of each frame is generated. In the final preprocessing step, by applying k-means clustering on the extracted features of all frames of each audio signal, we select k most discriminant frames, namely keyframes, to summarize the speech signal. Then, the sequence of the corresponding spectrograms of keyframes is encapsulated in a 3D tensor. These tensors are used to train and test a 3D Convolutional Neural network using a 10-fold cross-validation approach. The proposed 3D CNN has two convolutional layers and one fully connected layer. Experiments are conducted on the Surrey Audio-Visual Expressed Emotion (SAVEE), Ryerson Multimedia Laboratory (RML), and eNTERFACE’05 databases. The results are superior to the state-of-the-art methods reported in the literature. View Full-Text
Keywords: speech emotion recognition; 3D convolutional neural networks; deep learning; k-means clustering; spectrograms speech emotion recognition; 3D convolutional neural networks; deep learning; k-means clustering; spectrograms

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Share & Cite This Article

MDPI and ACS Style

Hajarolasvadi, N.; Demirel, H. 3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms. Entropy 2019, 21, 479.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics



[Return to top]
Entropy EISSN 1099-4300 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top