Next Article in Journal
Energetic Study of Clusters and Reaction Barrier Heights from Efficient Semilocal Density Functionals
Next Article in Special Issue
Artificial Immune Classifier Based on ELLipsoidal Regions (AICELL)
Previous Article in Journal
Numerical Simulation of the Laminar Forced Convective Heat Transfer between Two Concentric Cylinders
Previous Article in Special Issue
Detecting Perturbed Subpathways towards Mouse Lung Regeneration Following H1N1 Influenza Infection
Article Menu

Export Article

Open AccessArticle
Computation 2017, 5(2), 26;

Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition

Computer Science and Engineering Department, University of Texas at Arlington, Arlington, TX 76019, USA
Institute of Informatics and Telecommunications, National Center for Scientific Research—“Demokritos”, Athens 15310, Greece
Department of Informatics, Ionian University, Corfu 49100, Greece
Author to whom correspondence should be addressed.
Academic Editor: Demos T. Tsahalis
Received: 31 March 2017 / Revised: 25 May 2017 / Accepted: 27 May 2017 / Published: 1 June 2017
Full-Text   |   PDF [5164 KB, uploaded 5 June 2017]   |  


Emotion recognition from speech may play a crucial role in many applications related to human–computer interaction or understanding the affective state of users in certain tasks, where other modalities such as video or physiological parameters are unavailable. In general, a human’s emotions may be recognized using several modalities such as analyzing facial expressions, speech, physiological parameters (e.g., electroencephalograms, electrocardiograms) etc. However, measuring of these modalities may be difficult, obtrusive or require expensive hardware. In that context, speech may be the best alternative modality in many practical applications. In this work we present an approach that uses a Convolutional Neural Network (CNN) functioning as a visual feature extractor and trained using raw speech information. In contrast to traditional machine learning approaches, CNNs are responsible for identifying the important features of the input thus, making the need of hand-crafted feature engineering optional in many tasks. In this paper no extra features are required other than the spectrogram representations and hand-crafted features were only extracted for validation purposes of our method. Moreover, it does not require any linguistic model and is not specific to any particular language. We compare the proposed approach using cross-language datasets and demonstrate that it is able to provide superior results vs. traditional ones that use hand-crafted features. View Full-Text
Keywords: emotion recognition; convolutional neural networks; spectrograms emotion recognition; convolutional neural networks; spectrograms

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Share & Cite This Article

MDPI and ACS Style

Papakostas, M.; Spyrou, E.; Giannakopoulos, T.; Siantikos, G.; Sgouropoulos, D.; Mylonas, P.; Makedon, F. Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition. Computation 2017, 5, 26.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics



[Return to top]
Computation EISSN 2079-3197 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top