1. Introduction
Human senses are physiological responses and play a vital role in perception. Humans perceive their surrounding environment using multiple senses, i.e., vision, auditory, gustatory, olfactory, and tactile (touch). The sensing organs transmit information to the human brain, which helps in perceiving the surrounding environment. Traditional multimedia engages only two human senses, i.e., auditory and vision. Whereas the human experience of viewing multimedia content can be enhanced by engaging more than two human senses simultaneously. The multimedia content that can engage more than two human senses simultaneously is termed as multiple sensorial media (mulsemedia) [
1,
2]. There is a recent focus on mulsemedia with the aim towards providing an immersive real-world environment during multimedia interactions. Mulsemedia could provide a new dimension towards developing immersive systems in diverse fields such as education, medical, advertisement, and home entertainment. Furthermore, recent advancements in wearable sensing technologies have provided a broad spectrum to researchers for analyzing mulsemedia and its impact on human emotions and behavior. A detailed survey of the devices that engage haptic, olfactory, and gustatory senses in addition to vision and hearing for building a mulsemedia environment was presented in [
3]. Similarly, a framework was proposed for the delivery of multi-sensory effects to a heterogeneous system [
4].
The recognition and adaptation to the affective state of a user have increased the potential of affective computing. Affective state of an individual conveys the emotional intent and is considered as a primary mean of communication. In everyday life, emotions play an essential role in understanding human behavior and non-verbal communication. Emotions are physiological responses evoked in reaction to external stimuli and could be used for evaluating the type of stimulus. Affective computing has augmented the development of models and systems that can process human activity, and in turn simulate it with smart recognition and interpretation [
5]. Emotions have been characterized into six basic types including anger, surprise, fear, happiness, and sadness [
6] whereas Russell’s Circumplex model categorizes emotions in a two-dimensional space based on the valence and arousal scores [
7]. A person’s emotional state may change depending on their subjective experience [
8]. An emotional state can be evaluated by varying environmental conditions and this evaluation can benefit from self reports as well as the data collected by various sensing devices [
9,
10]. Integrating these (sources of information) can help us in better understanding an individual’s behavior or emotional state.
Whenever a person engages with certain emotional stimuli, their feelings are communicated through physiological cues like brain activity, heart rate, facial expressions, body gestures, or change in vocals. These cues are used in associating the emotional state of an individual with an external stimulus. Emotion recognition using speech [
11,
12,
13], facial expressions [
14,
15,
16] and their fusion [
17,
18] has been explored. These conventional methods for emotion recognition have limitations such as privacy and camera positioning [
19]. Emotion recognition from physiological cues like brain activity, skin conductance, and heart rate has shown promising results and is relatively new in this line of research. Human emotions are generated from the limbic system, which directs our attention and effects brain patterns [
20,
21]. Recently, the interest in brain activity evaluation using electroencephalography (EEG) has increased due to the availability of low-cost wearable headsets and their easy usage. Emotional markers are present in EEG signals, which cannot be easily deceived by a user’s voluntary actions [
22,
23]. Emotion recognition using EEG focuses on identifying the emotional state of the mind. The changes in skin conductance are also observed during differential emotional states [
24,
25]. A variation in heart rate has been reported as a discriminating cue for human emotion recognition [
26,
27].
The quality of experience (QoE) of mulsemedia content has been subjectively analyzed where different genders and age groups have shown a varying level of perception [
28]. Similarly, synchronization errors in audio-visual content and external devices have been analyzed and discussed to enhance the experience level of viewers [
29]. Mulsemedia has been explored in a 360-degree video environment, where a higher quality of perception and enjoyment was achieved [
30]. The QoE of mulsemedia has been objectively analyzed using heart rate and electrodermal activity (EDA) [
31]. A correlation was found between these objective metrics with the arousal and subjective ratings of QoE [
32]. Eye gaze data and heart rate have been analyzed for evaluating the enjoyment and perception of viewers while experiencing mulsemedia content [
33,
34]. Cross-modal correspondences were also identified when mapped with multi-sensory effects. Eye gaze and heart rate have a significant influence on QoE of viewers while experiencing cross-modal sensory effects. Human emotions were recognized in response to tactile enhanced multimedia (TEM) using brain signals [
35]. EEG data were acquired and four emotions (i.e., sad, relaxed, angry, and happy) were classified using time domain features. A significant change in human emotions was observed by engaging an additional tactile sense. An increase in emotion recognition accuracy was achieved by extracting frequency domain features [
36].
Hence, the human response to mulsemedia content can be evaluated using various physiological signals. While TEM clips (for stimuli) and EEG data were used for recognizing emotions [
35,
36], there is no multimodal physiological signals based emotion recognition framework that has used TEM as stimulus. Towards this, we generate four TEM clips and curate a multimodal dataset based on EEG, galvanic skin response (GSR), and photoplethysmography (PPG) signals in response to TEM clips. Emotion annotation is achieved using self-assessment manikin (SAM) questionnaire. Four human emotions (sad, relaxed, happy, and angry) are recognized using each modality (individually) and fusion of these modalities. Our results show that the fusion strategy achieves better performance for emotion recognition. Our major contributions in this work are two-fold i.e.,
We present a method, utilizing multi-modal physiological signals including EEG, GSR, and PPG (acquired using wearable sensors), for emotion recognition in response to TEM.
Our results show that utilizing a multimodal fusion strategy for emotion recognition in response to TEM outperforms using data individually from EEG, GSR, and PPG.
The rest of the paper is structured as follows.
Section 2 presents the review of latest emotion recognition methods using physiological signals.
Section 3 deals with the proposed methodology used for emotion recognition using physiological signals. Emotion recognition results using multiple modalities are presented in
Section 4, which is followed by conclusions in
Section 5.
2. Related Work
In literature, various stimuli have been used to evoke human emotions that engage either a single human sense [
37,
38,
39,
40,
41] or two human senses [
42,
43,
44,
45,
46,
47,
48,
49,
50]. These evoked emotions are then recognized using features extracted from data acquired using different physiological sensors. Audio music was used as stimuli that engaged a single (i.e., auditory) human sense [
37]. EEG based features were extracted to classify human emotions in response to music stimuli and the impact of different genres on different age groups was analyzed. Different nightscape images were used as stimuli, engaging the sense of vision [
38]. EEG signals were recorded to analyze brain patterns for evaluating the images in terms of fear. An asymmetry index method was introduced for EEG based emotion recognition in response to images [
51]. Different odors were used to recognize emotions using content that engaged the sense of olfaction [
39]. EEG signals were used to analyze different brain regions to discriminate pleasant and unpleasant odors. Brain signals were recorded, while engaging the sense of tactile by caressing of textile fabric on the forearm [
40]. EEG signals were then used to classify a pleasant and unpleasant state. A practical GSR and PPG based emotion recognition framework was proposed where Geneva affective picture database (GAPED) was used as stimulus [
41].
The use of physiological signals is found to be more effective for emotion recognition when compared with speech and gestures [
52]. Moreover, multimodal data analysis has a significant impact on emotion detection performance [
53,
54,
55,
56,
57]. Emotions were recognized by using music as a stimulus [
58]. Different physiological signals i.e., EMG, electrocardiogram (ECG), GSR, and respiration changes were acquired to classify different emotional states in the valence-arousal plane. Different time and frequency domain features were extracted and the effectiveness of features was proven by classification accuracies. A music recommendation system was designed by analyzing physiological signals i.e., GSR and PPG [
59]. Emotions were linked with the physiological responses in real-time to feed into the recommendation engine. Images were presented, engaging only one human sense, to evoke emotions [
60]. Facial expressions and different physiological signals such as GSR, ECG, and temperature data were acquired while presenting the stimulus. A fusion strategy was employed to improve the emotion recognition performance. Different images were presented to detect emotions using EEG and peripheral signals [
61]. Emotion detection performance was analyzed by using EEG and peripheral signals individually as well as together.
Different datasets are created for emotion detection using physiological signals in response to various types of stimuli. For instance, dataset for emotion analysis using physiological signals (DEAP) was created to recognize human emotions [
42]. Different video clips were displayed to subjects and EEG, GSR, electromyogram (EMG), electrooculogram (EoG), and blood volume pressure (BVP) data were recorded. It was shown that fusing multiple modalities significantly improves emotion recognition performance. Similarly, EEG, EMG, GSR, and temperature data were acquired by presenting video clips as a stimulus [
62]. Significant improvement in emotion recognition performance was reported by applying modality fusion strategies. A dataset comprising of EEG, ECG, EoG, and magnetoencephalogram (MEG) signals was created for emotion recognition [
63]. Emotions were elicited while presenting musical videos and brain signals were also acquired using MEG sensors and compared with EEG sensors. Another physiological dataset comprising of EEG, ECG, and GSR signals was created to study the effect of personality and emotions by presenting video clips as a stimulus [
64]. The relationship between emotions and personality was analyzed using the physiological cues. A physiological dataset was created to study the effect of mood and personality by presenting emotional videos [
65]. EEG, GSR, and ECG data were acquired to investigate affective levels using valence and arousal scores. A new multimodal physiological emotion database (MPED) was made public to recognize human emotions using physiological signals including EEG, GSR, respiration, and ECG [
66]. The emotions in MPED were categorized based on discrete emotion model. Emotions were recognized by extracting features from ECG and GSR signals [
67]. The dataset was acquired by exposing individuals to emotional videos. A pre-processing and feature extraction mechanism was proposed to improve emotion detection accuracy. Emotion detection was performed for ageing people by analyzing ECG, EMG, EDA, and skin temperature data [
68]. These physiological responses were analyzed to monitor and detect emotional states in elderly people. These datasets have been created to recognize emotions by analyzing the classifier performance using individual modality or fusion of multiple modalities. Moreover, these studies have presented a stimulus that engages either one human sense (audio music) or two human senses (videos).
The impact of different modalities, i.e., EEG, eye blink, and their fusion on emotion recognition was also investigated [
43]. Self-induced emotion patterns were investigated using EEG in response to video clips presented as stimulus [
69]. An ensemble classification approach was used to classify emotional states using ECG signals [
70]. Emotion monitoring was proposed for healthcare using a low cost wearable EEG headset [
71]. Moreover, effect of culture on emotion recognition was investigated using EEG signals by presenting video clips in two different languages [
72]. A feature extraction method was proposed to improve emotion recognition accuracy using EEG signals [
73]. A quadratic time-frequency feature extraction scheme was proposed to recognize emotions using EEG signals [
74]. Physiological signals (EEG and ECG) were used to investigate driver’s emotional states [
75]. Emotion recognition was analyzed in response to different movie clips using blood oxygen saturation, GSR, and heart rate variability to evaluate these clips in terms of prompted emotions [
44]. The EEG data from DEAP dataset were used and wavelet-based features were extracted from selected channels to recognize emotions [
45]. Different frequency bands of brain signals were analyzed to identify more sensitive brain lobes for the emotion recognition task [
46]. Physiological and inertial sensors were also used to recognize emotions in response to video clips [
47]. EDA, PPG, GSR, accelerometer, skin temperature, blood volume pulse, and heart rate data were collected to recognize different emotional states of an individual. Feature- and decision-level fusion was applied to facial and EEG-based features for multimodal video induced emotion recognition framework [
48].
The efficiency of GSR and PPG data from DEAP dataset was examined for emotion categorization. The fusion of GSR and PPG features was also studied to recognize emotions [
76]. A machine-learning framework for boredom classification using fusion of EEG and GSR data was proposed in response to videos [
77]. A correlation between EEG and GSR data and boredom state was also revealed. Negative emotions were classified using multimodal physiological signals (including ECG, skin temperature, and EDA) in response to videos [
78]. ReliefF-based channel selection method was applied to EEG data from DEAP dataset to classify four human emotions [
79]. The channel reduction technique was validated by comparing the accuracy and F-score of the system using support vector machine classifier. A commercially available wearable smart bracelet was used to acquire heart rate data while watching traditional video clips to recognize three emotions (neutral, happy, and sad) [
80]. Four human emotions, i.e., anger, sadness, joy, and pleasure in response to videos were recognized by extracting four types of features from ECG signals [
70]. Ensemble learning methods were employed to improve the classification accuracy of the system for real-world machine learning problems. EEG and GSR signals were also used to classify boredom states in response to video clips [
77]. A gradient boosting decision tree (GBDT) based classification scheme was proposed to improve emotion recognition accuracy using physiological signals (ECG, EMG, GSR, and PPG) in response to videos [
53]. Fusion of features from EEG and GSR data was used to improve emotion recognition accuracy in response to video clips [
81].
Most of the abovementioned emotion recognition methods extract time-, frequency-, and wavelet-domain features from physiological signals. There are some recent studies that have used deep learning techniques for emotion recognition [
81,
82,
83,
84,
85]. A convolutional neural network (CNN) model was employed to improve emotion recognition performance using physiological signals (including EDA, ECG, and skin temperature) while engaging individuals with video stimulus [
86]. A CNN-based model was proposed using DEAP dataset for detecting emotions in response to videos [
87]. A capsule network model was proposed using EEG data for emotion recognition [
88]. A CNN based model was also proposed to improve accuracy by recognizing emotions using heart rate variability and respiration changes [
85]. A deep belief network was proposed for EEG-based emotion recognition, which selected critical frequency bands and channels [
82]. Spatial temporal recurrent neural network was proposed for emotion recognition task and showed promising results on EEG and facial expression dataset [
83]. EEG and GSR data from DEAP were used to improve the emotion classification accuracy [
81]. Spectrogram calculated from EEG signals was given as input to the CNN to extract EEG features, which was then fused with GSR based features. Another CNN-based approach was proposed to recognize emotions in response to videos and results were tested in a subject-dependent and subject-independent manner [
84]. Six basic emotions were classified using various CNN models in response to videos as stimuli [
85]. Although, high classification accuracy was achieved for selective CNN models, training these deep CNN models remains a challenge. A comprehensive review of emotion recognition and sentiment analysis using multimodal data was presented in some recent works [
10,
22,
89].
Recently, emotion recognition techniques have been explored in response to content engaging three human senses (mulsemedia) [
35,
36,
90,
91]. Olfaction enhanced multimedia engaging sense of vision, olfaction, and auditory was generated [
90]. Brain activity was statistically analyzed and it was reported that by engaging olfactory sense with traditional multimedia significantly activates different brain regions. Features from these brain regions were utilized to recognize pleasantness states and it was identified that the olfaction enhanced content recognizes human emotions more accurately as compared to traditional multimedia. A vibro-tactile enhanced multimedia was used as stimulus that engaged the sense of vision, auditory, and tactile [
91]. Heart rate and eye-tracking data were used to analyze the effect of vibro-tactile enhanced multimedia on user’s perception. Two TEM clips were used as stimuli and EEG signals were used to recognize four human emotions [
35,
36]. A summary of recent works on emotion recognition using physiological signals is presented in
Table 1. It should be noted that these methods are delineated based on stimuli (including images and videos). While for videos, a significant emotion classification accuracy (>90%) has been reported in multiple instances, but for TEM the performance has been significantly lower. In this work, not only the number of TEM clips is increased but also multimodal strategy for emotion recognition in response to TEM is proposed.