sensors-logo

Journal Browser

Journal Browser

Special Issue "Sensor Based Multi-Modal Emotion Recognition"

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Intelligent Sensors".

Deadline for manuscript submissions: 30 November 2022 | Viewed by 10975
Please contact the Guest Editor or the Section Managing Editor at ([email protected]) for any queries.

Special Issue Editors

Prof. Soo-Hyung Kim
E-Mail Website
Guest Editor
Chonnam National University, Gwangju, South Korea
Interests: deep-learning-based emotion recognition; medical image analysis; pattern recognition
Special Issues, Collections and Topics in MDPI journals
Prof. Gueesang Lee
E-Mail Website
Guest Editor
Chonnam National University, Gwangju, South Korea
Interests: image processing; computer vision; medical imaging

Special Issue Information

Dear Colleagues,

Emotion recognition is one of the hot issues in AI research. This Special Issue is being assembled to share all kinds of in-depth research results related to emotion recognition, such as the classification of emotion category (anger, disgust, fear, happiness, sadness, surprise, neutral, etc.), arousal/valence estimation, diagnosis of mental health such as stress, pain, cognitive load, engagement, curiosity, humor, and so on. All of these problems deal with a stream of data not only from individual sensors such as RGB-D cameras, EEG/ECG/EMG sensors, wearable devices, or smart phones, but also from the fusion of various sensors.

Please join this Special Issue entitled “Sensor-Based Multi-Modal Emotion Recognition”, and contribute your valuable research progress. Thank you very much.

Prof. Soo-Hyung Kim
Prof. Gueesang Lee
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • multi-modal emotion recognition
  • audio-visual, EEG/ECG/EMG, wearable devices
  • emotion classification
  • arousal/valence estimation
  • stress, pain, cognitive load, engagement, curiosity, humor
  • related issues in emotion recognition or sentiment analysis

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Article
EEG Connectivity during Active Emotional Musical Performance
Sensors 2022, 22(11), 4064; https://doi.org/10.3390/s22114064 - 27 May 2022
Viewed by 276
Abstract
The neural correlates of intentional emotion transfer by the music performer are not well investigated as the present-day research mainly focuses on the assessment of emotions evoked by music. In this study, we aim to determine whether EEG connectivity patterns can reflect differences [...] Read more.
The neural correlates of intentional emotion transfer by the music performer are not well investigated as the present-day research mainly focuses on the assessment of emotions evoked by music. In this study, we aim to determine whether EEG connectivity patterns can reflect differences in information exchange during emotional playing. The EEG data were recorded while subjects were performing a simple piano score with contrasting emotional intentions and evaluated the subjectively experienced success of emotion transfer. The brain connectivity patterns were assessed from the EEG data using the Granger Causality approach. The effective connectivity was analyzed in different frequency bands—delta, theta, alpha, beta, and gamma. The features that (1) were able to discriminate between the neutral baseline and the emotional playing and (2) were shared across conditions, were used for further comparison. The low frequency bands—delta, theta, alpha—showed a limited number of connections (4 to 6) contributing to the discrimination between the emotional playing conditions. In contrast, a dense pattern of connections between regions that was able to discriminate between conditions (30 to 38) was observed in beta and gamma frequency ranges. The current study demonstrates that EEG-based connectivity in beta and gamma frequency ranges can effectively reflect the state of the networks involved in the emotional transfer through musical performance, whereas utility of the low frequency bands (delta, theta, alpha) remains questionable. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

Article
AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention
Sensors 2021, 21(24), 8356; https://doi.org/10.3390/s21248356 - 14 Dec 2021
Viewed by 1097
Abstract
In this paper, we tackle the problem of predicting the affective responses of movie viewers, based on the content of the movies. Current studies on this topic focus on video representation learning and fusion techniques to combine the extracted features for predicting affect. [...] Read more.
In this paper, we tackle the problem of predicting the affective responses of movie viewers, based on the content of the movies. Current studies on this topic focus on video representation learning and fusion techniques to combine the extracted features for predicting affect. Yet, these typically, while ignoring the correlation between multiple modality inputs, ignore the correlation between temporal inputs (i.e., sequential features). To explore these correlations, a neural network architecture—namely AttendAffectNet (AAN)—uses the self-attention mechanism for predicting the emotions of movie viewers from different input modalities. Particularly, visual, audio, and text features are considered for predicting emotions (and expressed in terms of valence and arousal). We analyze three variants of our proposed AAN: Feature AAN, Temporal AAN, and Mixed AAN. The Feature AAN applies the self-attention mechanism in an innovative way on the features extracted from the different modalities (including video, audio, and movie subtitles) of a whole movie to, thereby, capture the relationships between them. The Temporal AAN takes the time domain of the movies and the sequential dependency of affective responses into account. In the Temporal AAN, self-attention is applied on the concatenated (multimodal) feature vectors representing different subsequent movie segments. In the Mixed AAN, we combine the strong points of the Feature AAN and the Temporal AAN, by applying self-attention first on vectors of features obtained from different modalities in each movie segment and then on the feature representations of all subsequent (temporal) movie segments. We extensively trained and validated our proposed AAN on both the MediaEval 2016 dataset for the Emotional Impact of Movies Task and the extended COGNIMUSE dataset. Our experiments demonstrate that audio features play a more influential role than those extracted from video and movie subtitles when predicting the emotions of movie viewers on these datasets. The models that use all visual, audio, and text features simultaneously as their inputs performed better than those using features extracted from each modality separately. In addition, the Feature AAN outperformed other AAN variants on the above-mentioned datasets, highlighting the importance of taking different features as context to one another when fusing them. The Feature AAN also performed better than the baseline models when predicting the valence dimension. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

Article
EEG-Based Emotion Recognition by Convolutional Neural Network with Multi-Scale Kernels
Sensors 2021, 21(15), 5092; https://doi.org/10.3390/s21155092 - 27 Jul 2021
Viewed by 1009
Abstract
Besides facial or gesture-based emotion recognition, Electroencephalogram (EEG) data have been drawing attention thanks to their capability in countering the effect of deceptive external expressions of humans, like faces or speeches. Emotion recognition based on EEG signals heavily relies on the features and [...] Read more.
Besides facial or gesture-based emotion recognition, Electroencephalogram (EEG) data have been drawing attention thanks to their capability in countering the effect of deceptive external expressions of humans, like faces or speeches. Emotion recognition based on EEG signals heavily relies on the features and their delineation, which requires the selection of feature categories converted from the raw signals and types of expressions that could display the intrinsic properties of an individual signal or a group of them. Moreover, the correlation or interaction among channels and frequency bands also contain crucial information for emotional state prediction, and it is commonly disregarded in conventional approaches. Therefore, in our method, the correlation between 32 channels and frequency bands were put into use to enhance the emotion prediction performance. The extracted features chosen from the time domain were arranged into feature-homogeneous matrices, with their positions following the corresponding electrodes placed on the scalp. Based on this 3D representation of EEG signals, the model must have the ability to learn the local and global patterns that describe the short and long-range relations of EEG channels, along with the embedded features. To deal with this problem, we proposed the 2D CNN with different kernel-size of convolutional layers assembled into a convolution block, combining features that were distributed in small and large regions. Ten-fold cross validation was conducted on the DEAP dataset to prove the effectiveness of our approach. We achieved the average accuracies of 98.27% and 98.36% for arousal and valence binary classification, respectively. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

Article
Deep-Learning-Based Multimodal Emotion Classification for Music Videos
Sensors 2021, 21(14), 4927; https://doi.org/10.3390/s21144927 - 20 Jul 2021
Cited by 10 | Viewed by 1779
Abstract
Music videos contain a great deal of visual and acoustic information. Each information source within a music video influences the emotions conveyed through the audio and video, suggesting that only a multimodal approach is capable of achieving efficient affective computing. This paper presents [...] Read more.
Music videos contain a great deal of visual and acoustic information. Each information source within a music video influences the emotions conveyed through the audio and video, suggesting that only a multimodal approach is capable of achieving efficient affective computing. This paper presents an affective computing system that relies on music, video, and facial expression cues, making it useful for emotional analysis. We applied the audio–video information exchange and boosting methods to regularize the training process and reduced the computational costs by using a separable convolution strategy. In sum, our empirical findings are as follows: (1) Multimodal representations efficiently capture all acoustic and visual emotional clues included in each music video, (2) the computational cost of each neural network is significantly reduced by factorizing the standard 2D/3D convolution into separate channels and spatiotemporal interactions, and (3) information-sharing methods incorporated into multimodal representations are helpful in guiding individual information flow and boosting overall performance. We tested our findings across several unimodal and multimodal networks against various evaluation metrics and visual analyzers. Our best classifier attained 74% accuracy, an f1-score of 0.73, and an area under the curve score of 0.926. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

Article
Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion
Sensors 2021, 21(14), 4913; https://doi.org/10.3390/s21144913 - 19 Jul 2021
Cited by 6 | Viewed by 2006
Abstract
Decades of scientific research have been conducted on developing and evaluating methods for automated emotion recognition. With exponentially growing technology, there is a wide range of emerging applications that require emotional state recognition of the user. This paper investigates a robust approach for [...] Read more.
Decades of scientific research have been conducted on developing and evaluating methods for automated emotion recognition. With exponentially growing technology, there is a wide range of emerging applications that require emotional state recognition of the user. This paper investigates a robust approach for multimodal emotion recognition during a conversation. Three separate models for audio, video and text modalities are structured and fine-tuned on the MELD. In this paper, a transformer-based crossmodality fusion with the EmbraceNet architecture is employed to estimate the emotion. The proposed multimodal network architecture can achieve up to 65% accuracy, which significantly surpasses any of the unimodal models. We provide multiple evaluation techniques applied to our work to show that our model is robust and can even outperform the state-of-the-art models on the MELD. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

Article
Context-Aware Emotion Recognition in the Wild Using Spatio-Temporal and Temporal-Pyramid Models
Sensors 2021, 21(7), 2344; https://doi.org/10.3390/s21072344 - 27 Mar 2021
Cited by 2 | Viewed by 1209
Abstract
Emotion recognition plays an important role in human–computer interactions. Recent studies have focused on video emotion recognition in the wild and have run into difficulties related to occlusion, illumination, complex behavior over time, and auditory cues. State-of-the-art methods use multiple modalities, such as [...] Read more.
Emotion recognition plays an important role in human–computer interactions. Recent studies have focused on video emotion recognition in the wild and have run into difficulties related to occlusion, illumination, complex behavior over time, and auditory cues. State-of-the-art methods use multiple modalities, such as frame-level, spatiotemporal, and audio approaches. However, such methods have difficulties in exploiting long-term dependencies in temporal information, capturing contextual information, and integrating multi-modal information. In this paper, we introduce a multi-modal flexible system for video-based emotion recognition in the wild. Our system tracks and votes on significant faces corresponding to persons of interest in a video to classify seven basic emotions. The key contribution of this study is that it proposes the use of face feature extraction with context-aware and statistical information for emotion recognition. We also build two model architectures to effectively exploit long-term dependencies in temporal information with a temporal-pyramid model and a spatiotemporal model with “Conv2D+LSTM+3DCNN+Classify” architecture. Finally, we propose the best selection ensemble to improve the accuracy of multi-modal fusion. The best selection ensemble selects the best combination from spatiotemporal and temporal-pyramid models to achieve the best accuracy for classifying the seven basic emotions. In our experiment, we take benchmark measurement on the AFEW dataset with high accuracy. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

Article
CorrNet: Fine-Grained Emotion Recognition for Video Watching Using Wearable Physiological Sensors
Sensors 2021, 21(1), 52; https://doi.org/10.3390/s21010052 - 24 Dec 2020
Cited by 7 | Viewed by 2503
Abstract
Recognizing user emotions while they watch short-form videos anytime and anywhere is essential for facilitating video content customization and personalization. However, most works either classify a single emotion per video stimuli, or are restricted to static, desktop environments. To address this, we propose [...] Read more.
Recognizing user emotions while they watch short-form videos anytime and anywhere is essential for facilitating video content customization and personalization. However, most works either classify a single emotion per video stimuli, or are restricted to static, desktop environments. To address this, we propose a correlation-based emotion recognition algorithm (CorrNet) to recognize the valence and arousal (V-A) of each instance (fine-grained segment of signals) using only wearable, physiological signals (e.g., electrodermal activity, heart rate). CorrNet takes advantage of features both inside each instance (intra-modality features) and between different instances for the same video stimuli (correlation-based features). We first test our approach on an indoor-desktop affect dataset (CASE), and thereafter on an outdoor-mobile affect dataset (MERCA) which we collected using a smart wristband and wearable eyetracker. Results show that for subject-independent binary classification (high-low), CorrNet yields promising recognition accuracies: 76.37% and 74.03% for V-A on CASE, and 70.29% and 68.15% for V-A on MERCA. Our findings show: (1) instance segment lengths between 1–4 s result in highest recognition accuracies (2) accuracies between laboratory-grade and wearable sensors are comparable, even under low sampling rates (≤64 Hz) (3) large amounts of neutral V-A labels, an artifact of continuous affect annotation, result in varied recognition performance. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

Planned Papers

The below list represents only planned manuscripts. Some of these manuscripts have not been received by the Editorial Office yet. Papers submitted to MDPI journals are subject to peer-review.

Title: Subjective Evaluation of Basic Emotions from Audio-Visual Data
Authors: Sudarsana Reddy Kadiri; Paavo Alku
Affiliation: Department of Signal Processing and Acoustics, Aalto University, Finland
Abstract: Understanding of the perception of emotions or affective states in humans is important to develop emotion-aware systems that work in realistic scenarios. In this paper, the perception of emotions in naturalistic human interaction (audio-visual data) is studied using perceptual evaluation. For this purpose, a naturalistic audio-visual emotion database collected from TV broadcasts like soap-operas and movies, called the IIIT-H Audio-Visual Emotion (IIIT-H AVE) database, is used. The database consists of audio-alone, video-alone, and audio-visual data in English. Using data of all these three modes, perceptual tests are conducted for four basic emotions (angry, happy, neutral and sad) based on category labeling and for two dimensions namely arousal (active or passive) and valence (positive or negative) based on dimensional labeling. Interestingly, the general patterns in the perception of emotions were remarkably different for different emotions. This finding emphasizes the importance of emotion-specific features compared to commonly used features in the development of emotion-aware systems.

Back to TopTop