Next Article in Journal
A Verifiable Fully Homomorphic Encryption Scheme for Cloud Computing Security
Previous Article in Journal
On Distributed Denial of Service Current Defense Schemes
Previous Article in Special Issue
Sign Language Technologies and the Critical Role of SL Resources in View of Future Internet Accessibility Services
Open AccessArticle

Emotion Recognition from Speech Using the Bag-of-Visual Words on Audio Segment Spectrograms

Institute of Informatics and Telecommunications, National Centre for Scientific Research “Demokritos”, 15341 Athens, Greece
Department of Computer Science, University of Thessaly, 38221 Lamia, Greece
Department of Computer Engineering T.E., Technological Education Institute of Sterea Ellada, 34400 Lamia, Greece
Department of Informatics, Ionian University, 49132 Corfu, Greece
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in Proceedings of the 11th PErvasive Technologies Related to Assistive Environments Conference, Corfu, Greece, 26–26 June 2018; pp. 106–107.
Technologies 2019, 7(1), 20;
Received: 30 November 2018 / Revised: 21 January 2019 / Accepted: 30 January 2019 / Published: 4 February 2019
(This article belongs to the Special Issue The PErvasive Technologies Related to Assistive Environments (PETRA))
PDF [1204 KB, uploaded 29 March 2019]


It is noteworthy nowadays that monitoring and understanding a human’s emotional state plays a key role in the current and forthcoming computational technologies. On the other hand, this monitoring and analysis should be as unobtrusive as possible, since in our era the digital world has been smoothly adopted in everyday life activities. In this framework and within the domain of assessing humans’ affective state during their educational training, the most popular way to go is to use sensory equipment that would allow their observing without involving any kind of direct contact. Thus, in this work, we focus on human emotion recognition from audio stimuli (i.e., human speech) using a novel approach based on a computer vision inspired methodology, namely the bag-of-visual words method, applied on several audio segment spectrograms. The latter are considered to be the visual representation of the considered audio segment and may be analyzed by exploiting well-known traditional computer vision techniques, such as construction of a visual vocabulary, extraction of speeded-up robust features (SURF) features, quantization into a set of visual words, and image histogram construction. As a last step, support vector machines (SVM) classifiers are trained based on the aforementioned information. Finally, to further generalize the herein proposed approach, we utilize publicly available datasets from several human languages to perform cross-language experiments, both in terms of actor-created and real-life ones. View Full-Text
Keywords: emotion recognition; bag-of-visual words; spectrograms emotion recognition; bag-of-visual words; spectrograms

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Share & Cite This Article

MDPI and ACS Style

Spyrou, E.; Nikopoulou, R.; Vernikos, I.; Mylonas, P. Emotion Recognition from Speech Using the Bag-of-Visual Words on Audio Segment Spectrograms. Technologies 2019, 7, 20.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics



[Return to top]
Technologies EISSN 2227-7080 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top