A Deep-Learning Model for Subject-Independent Human Emotion Recognition Using Electrodermal Activity Sensors

One of the main objectives of Active and Assisted Living (AAL) environments is to ensure that elderly and/or disabled people perform/live well in their immediate environments; this can be monitored by among others the recognition of emotions based on non-highly intrusive sensors such as Electrodermal Activity (EDA) sensors. However, designing a learning system or building a machine-learning model to recognize human emotions while training the system on a specific group of persons and testing the system on a totally a new group of persons is still a serious challenge in the field, as it is possible that the second testing group of persons may have different emotion patterns. Accordingly, the purpose of this paper is to contribute to the field of human emotion recognition by proposing a Convolutional Neural Network (CNN) architecture which ensures promising robustness-related results for both subject-dependent and subject-independent human emotion recognition. The CNN model has been trained using a grid search technique which is a model hyperparameter optimization technique to fine-tune the parameters of the proposed CNN architecture. The overall concept’s performance is validated and stress-tested by using MAHNOB and DEAP datasets. The results demonstrate a promising robustness improvement regarding various evaluation metrics. We could increase the accuracy for subject-independent classification to 78% and 82% for MAHNOB and DEAP respectively and to 81% and 85% subject-dependent classification for MAHNOB and DEAP respectively (4 classes/labels). The work shows clearly that while using solely the non-intrusive EDA sensors a robust classification of human emotion is possible even without involving additional/other physiological signals.


Introduction
Emotion recognition plays an important role in various areas of life, especially in the field of Active and Assisted Living (AAL) [1] and Driver Assistance Systems (DAS) [2]. Recognizing emotions automatically is one of technical enablers of AAL, as it is considered to be a significant help for monitoring and observing the mental state of either old people or disabled persons. by the "only slightly intrusive" EDA-based sensors in this research field. The structure of the paper is as follows: Section 2 presents an overview of the state-of-the-art approaches. Section 3 introduces the datasets. Section 4 portrays the overall architecture of the proposed classification model. Sections 5 and 6 present the overall results and the related discussions respectively. The paper ends with a conclusion in Section 7.

Related Works
Regarding human emotion recognition based on EDA sensors which can be embedded in smart wearable devices, few works have been published so far. However, in [17], they proposed a system to recognize the driver's emotional state after transforming the EDA signals using a short-time Fourier transform. They considered three classes: neutral-stress, neutral-anger, and stress-anger. Furthermore, in [18], they applied a convex optimization-based electrodermal activity (cvxEDA) framework and clustering algorithms to automatically classify the arousal and valence levels induced by affective sound stimuli.
In the literature, it has been proven that the stimuli nature plays an important role to increase the EDA response which helps to make the emotion recognition process less complex [19]. Furthermore, other works showed promising results when EDA responses are modulated by musical emotional [20,21]. Consequently, this result encouraged researchers to work on classifying arousal and valence levels induced by auditory stimuli.
In [22], authors used the AVEC 2016 dataset [23,24], they proposed a deep-learning model that consists of a CNN followed by a recurrent neural network and then fully connected layers. They showed that an end-to-end deep-learning approach directly depending on raw signals can replace feature engineering for emotion recognition purposes.
Moreover, the use of different physiological signals has been previously involved [25,26]. However, mounting different types of sensors on the human body is not preferred and nor well-accepted. In [26], authors fused different types of sensors, ECG (Electrocardiogram), EDA and ST (Skin Temperature) through a hybrid neural model which combines cellular neural networks and echo state neural networks to recognize four classes of valence and arousal, mainly, high valence high arousal, high valence low arousal, low valence high arousal, and low valence, low arousal. In [25], authors combined facial electromyograms, electrocardiogram, respiration, and EDA dataset which were collected during racing conditions. The emotional classes identified are high stress, low stress, disappointment, and euphoria. Support vector machines (SVMs) and adaptive neuro-fuzzy inference system (ANFIS) have been used for the classification.
In [27], the researchers reported results using only EDA to recognize four different states, joy, anger, sadness, pleasure using 193 features and a music and based on genetic algorithm and the K-neighbor methods. Table 1 shows a summary of the state-of-the-art for human emotion recognition using physiological signals. More details regarding state-of-the-art experiments and obtained results can be found in Section 6.
The major limitations in the state-of-the-art can be summarized in three major points. First, the limitation regarding proposing generalized models to recognize human emotions based on EDA signals (i.e., published works do not comprehensively consider the lab-setting independence property of emotion classifiers for EDA signals). Second, the limitation concerning subject-independent human emotion recognition (i.e., published works do not comprehensively address the subject-independence property of emotion classifiers for EDA signals). Third, most published related works do focus mostly on classifying only 2 (active/passive) emotional states.
In this work, we focus on the second and the third limitation, due to the fact that classifying human emotion with respect to different lab settings is a research question which may need to adjust the raw data in a feature engineering level which is not the focus of this work where CNN does extract the desired features internally as it is a deep-learning model.

Datasets
This study uses public benchmark datasets (MAHNOB and DEAP) of physiological signals to test our proposal for a robust emotion recognition system. However, for both solely the EDA related data will be used in the experiments for this paper.

MAHNOB
The dataset used is called MAHNOB and was collected by Soleymani Mohammad et al. [31]. The data is related to different physiological signals.
The data was collected from 30 young healthy adults who participated in the study. 17 of the participants were female and 13 of them were males. Their age varied between 19 to 40. The participants were shown 20 emotional video clips which were evaluated in terms of both valence and arousal by using the Self-Assessment Manikins (SAM) questionnaire [32]. SAM is a prominent tool that visualizes the degree of valence and arousal by manikins. The participants distinguished a scale from 1 to 9, see Figure 1. In the experiments for MAHNOB, electroencephalogram (EEG), blood volume pressure (BVP), respiration pattern, skin temperature, electromyogram (EMG), electrooculogram (EOG), electrocardiogram (ECG), and EDA of 30 participants were collected.

DEAP
DEAP [33] is a multimodal dataset used to analyze human emotional states.
The stimuli used in the experiments were chosen in different steps. First, they selected 120 initial stimuli that were selected both semi-automatically and manually. Second, a one-minute highlight part was specified for each stimulus. Third, through a web-based subjective assessment experiment, 40 final stimuli were chosen.
During the physiological experiment, 32 participants evaluated 40 videos via a web interface used for subjective emotion assessment in terms of the levels of arousal, valence, like/dislike, dominance, and familiarity. The age of participants varied between 19 to 37. Concerning the classes/labels for DEAP, we considered the same classes as same as in Section 4.1.

Classification Using a Convolution Neural Network-CNN
In this section, we present, the labelling of EDA signals, the design details of the proposed CNN for emotion classification and then, the evaluation metrics and evaluation.

Preprocessing and Labelling
First, raw data of EDA were scaled such that the distribution is centered around 0, with a standard deviation of 1. Additionally, after data normalization, two states [34] valence and arousal are addressed for emotion classification. In this regard, the scales (1-9) were mapped into 2 levels for each valence and arousal state according to the SAM ratings.

Classifiers
To perform the emotions classification task, we propose a deep-learning approach. A CNN is a kind of feedforward network structure that consists of multiple layers of convolutional filters followed by subsampling filters and ends with a fully connected classification layer. The classical LeNet-5CNN first proposed by LeCun et al. in [35] is the basic model for various CNN applications for object detection, localization, and prediction.
First, the EDA signals are converted into matrices whereby the goal is to make the application of CNN model possible (see Section 5).
As illustrated in Figure 2, the proposed CNN architecture has three convolutional layers (C1, C2, and C3), three subsampling layers in between (i.e., P1, P2, and P3), and an output layer F. The convolutional layers generate feature maps using 72 (3 × 3) filters followed by a Scaled Exponential Linear Units (SELU) as an activation function, 196 (3 × 3) filters followed by a Rectified linear unit (ReLU) as an activation function and 392 (3 × 3) filters followed by a ReLU as an activation function.
Additionally, in the subsampling layers, the generated feature maps are spatially down-sampled. In our proposed model, the feature maps in layers C1, C2 and C3 are sub-sampled to a corresponding feature map of size 2 × 2, 3 × 3 and 3 × 3 in the subsequent layers P1, P2, and P3 respectively.
The output layer F is a fully connected neural model that performs the classification process, it consists of three layers. The first layer has 1176 nodes, each activated by a ReLU activation function. The second layer has 1024 nodes, each activated by a SELU activation function. The final layer is the SoftMax output layer C1.
The result of the mentioned layers is a 2D representation of extracted features from input feature map(s) based on the input EDA signals.
Since the dropout is a regularization technique to avoid over-fitting in neural networks based on preventing complex co-adaptations on training data [36], therefore, our dropout for each layer was 0.25 which is related to a fraction of the input units to drop. Table 2 shows parameters used for all the layers of the proposed CNN model. Table 2. Parameters used for all the layers of the proposed CNN model.

Layer Kernel, Units
Other Layers Parameters C is the convolution layer, P is the max-pooling layer and SELU is the Scaled Exponential Linear Unit activation function.
A grid search technique has been used to fine-tune the CNN model hyperparameters and to find out the optimal number of filters and layers needed to perform the emotion classification task. We have used the GridSearchCV class in Scikit-learn [37]. We have provided a dictionary of hyperparameters that should be checked during the performance evaluation. By default, the grid search uses one thread, but it can be configured to use all available cores to increase the calculation time. Then, the Scikit-learn class has been combined with Keras to find out what are the best hyperparameters values. Additionally, cross a validation is used to evaluate each individual model and the default of 10-fold cross-validation has been used.
All provided results have been obtained while using the following computer platform: Intel Corei7-7820HK processor Quad-Core 2.90 GHz, 16 GB DDR4 SDRAM, NVIDIA GeForce GTX 1080 with 8 GB dedicated storage.
Additionally, we examine several classifiers to compare the performance of the existing models with that of the here proposed one. In particular, Support Vector Machine (SVM) [38], K-Nearest Neighbor (KNN) [39], Naive Bayes [40] and Random Forest [41] are considered for benchmarking.
Based on Figures 3 and 4, selecting the previous classifiers has different advantages for comparison purposes. For example, the objective of random forests is that they consider a set of high-variance, low-bias decision trees and convert them into a model that has both low variance and low bias. On the other hand, KNNs is an algorithm which stores all the available cases and classifies new cases based on a similarity measure (e.g., distance functions). Therefore, KNN has been applied in statistical estimation and pattern recognition from the beginning of the 1970s on as a non-parametric technique [39]. Support Vector Machines are well-known in handling non-linearly separable data based on their non-linear kernel, e.g., the SVM with a polynomial kernel (SVM (poly)), and the SVM with a radial basis kernel (SVM (rbf)). Therefore, we classify the EDA data using three types of SVMs, namely the following ones: SVM (linear) (i.e., standard linear SVM), SVM (poly) and SVM (rbf). Finally, we used a simple probabilistic model which is the Naive Bayes. The purpose of using such a probabilistic model is to show how it behaves on EDA data. Table 3 shows the values of parameters of proposed CNN and other classifiers.

Evaluation Metrics and Validation Concept
To evaluate the overall performance of the classifiers, we consider several performance metrics. In particular, we use precision, recall, f-measure, and accuracy, as in [42].
Regarding the evaluation scenarios, we consider two cases. The subject-dependent and subject-independent cases. Subject-dependent means training and testing have been performed on the same subject. Subject-independent means the training has been performed on a group of subjects and testing has been performed on a totally new group of subjects.

Results
To have a deeper understanding of the performance of the proposed CNN model, MAHNOB, and DEAP datasets were used for testing the overall classification performance.
Moreover, the data distribution should be taken into consideration to choose a suitable classifier for comparison purposes. In this regard, a Fisher mapping [43] was used to define the three major scores in the samples that are investigated. Based on the output of Figures 3 and 4, it is concluded that the data is highly overlapped, and there is a kind of class imbalance problem.
In this assessment, 10 subjects were selected from the MAHNOB and DEAP datasets. Each dataset for each subject consists of four classes (see Sections 3.1 and 3.2). The average training time for each subject was approximately 21 min.
The length of the considered EDA signals is 2574 that are converted to matrices of size (39 × 66). All results are presented for ten-fold cross-validation.
Tables 4 and 5 present the average values for the precision, the recall, and the f-measure using DEAP and MAHNOB datasets respectively. The tables show the performance metrics values when training and testing are performed on the same subject. The tables show the average value of precision, recall, and f-measure with respect to each subject. The performance metrics values for each subject have been summed and divided by the total number of subjects. The major target of this experiment is to check out the overall performance for subject-dependent EDA-based emotion classification.
Tables 6 and 7 present the precision the recall, and the f-measure using DEAP and MAHNOB datasets respectively. The results are obtained when training and testing are performed on different subjects. The major target of this experiment is to check out the overall performance for subject-independent EDA-based emotion classification.
In all tables, the proposed CNN model shows the highest performance compared to K-NN and random forest which are hereby the best next two classifiers. When K-NN and random forest classifiers perform well, it indicates that the dataset is not easily separable, and the nonlinearity is high. This can be observed in Figure 4. Accordingly, the decision planes generated using other classifiers (see Tables 4-7) do not categorize some points in space to an inappropriate region as good as K-NN and random forest classifiers.
The performance metrics and the implementation are written in Python using Numpy (http://www.numpy.org/), Scikit-learn (https://scikit-learn.org/) and Keras (https://keras.io/). All performance metrics are calculated for each class and weighted taking the class imbalance into account. Accordingly, the evaluation metrics for each label have been calculated and their average has been weighted by the support measurement which is the number of true instances for each label. Tables 8 and 9 show the confusion matrix for both MAHNOB and DEAP (the average performance results for training and testing on same subjects) and the confusion matrix for both MAHNOB and DEAP (the average performance results for training and testing on different subjects), respectively.

Discussion
Aiming at highlighting the contribution of this work, other works should be considered and analyzed. However, it is not easy to make such a comparison due to the fact that (a) other works may combine other types of physiological signals and they do not use only EDA, and (b) the reaction and the response of EDA does highly depend on the stimuli type, which showed better results when the stimuli is an acoustic one [18].
To our knowledge, this study shows for the first time that developing a subject-independent human emotion recognition using only EDA signals with a promising recognition rate is possible. It is also worthwhile noting that we were able to, • increase the f-measure for subject-independent classification to 78% and 81% for MAHNOB and DEAP respectively (4 classes/labels).
• increase the f-measure for subject-dependent classification have been increased to 83% and 85% for MAHNOB and DEAP respectively (4 classes/labels).
In the state-of-the-art, researchers in [22] tested a deep-learning model which consists of RNN and CNN which showed a Concordance Correlation Coefficient (CCC) [44] of 0.10 on the arousal dimension and 0.33 on the valence dimension based on EDA only. They used AVEC 2016 dataset [23,24].
In addition, in [27], they reported an emotion recognition analysis using only the EDA signal for subject-dependent with an accuracy of 56.5% for the arousal dimension and 50.5% for the valence dimension based on four songs stimuli. In [18], authors suggested a system which can achieve a recognition accuracy of 77.33% on the arousal dimension, and 84% on the valence dimension based on three emotional states induced by affective sounds taken from IADS collection [45].
Furthermore, it should be mentioned that the binary classification (passive/active cases) of EDA signals showed high results as in [28] with an accuracy of 95% using SVM and an accuracy of 80% using CNNs in [29].
However, getting such a high performance for two classes is expected where other studies showed clearly that EDA signals for active and passive states form clear patterns compared to the 4 classes of arousal and valence for emotion recognition [46]. Table 10 shows a summary of the state-of-the-art for EDA-based emotion detection regarding, experiment, number of classes, used classifiers, and the reported accuracy. Additionally, analyzing the results of the state-of-art, clearly, feature engineering for subject-independent and subject-dependent human emotion detection based on EDA does not lead to high performance. In particular, when the number of classes is higher than two. This is because extracting the sympathetic response patterns which are part of each emotion is difficult. Furthermore, when trying to overcome this fact by analyzing more basic features such as level, response amplitude, rate, rise time, and recovery time, they discard flexible elicited behavior which might improve emotion recognition. Therefore, it has been proven in this work that DL can overcome this drawback quite well.
Regarding the point of testing the proposed model using different datasets from different labs, it is because human emotions do not form similar patterns. Consequently, the research community should develop generalized models to recognize human emotions, where subjects, elicitation materials, and physiological sensors brands are different from the ones involved in the initial training. Dealing with such research question has an important impact for human support in the frame of smart environments in different applications.
Concerning, human emotion recognition with respect to different lab-settings, in [30], authors showed that adjusting and manipulating the feature space to bring both datasets to a homogeneous feature space as a pre-processing step may increase the overall performance even when datasets come from different labs.
Moreover, in [47], they checked the ability of 504 school children aged between 8 and 11 years old to recognize the emotions of facial expressions based on pictures. The overall performance was approximately 86% to recognize anger, fear, sadness, happiness, disgust, and neutral facial expressions. It is impressive to see that the proposed automated EDA-based emotion recognition system is close to the performance of human capability to interpret the facial expressions.

Conclusions
This study can be considered to be a basic contribution in terms of overcoming the generalization problem for human emotion recognition. The aim was to show the feasibility and the possibility of building such generalized models for relevant application contexts. Furthermore, this study examined the less intrusive sensors based on statistical analyses in real-life datasets and reviewed various state-of-the-art approaches to human emotion recognition in smart home environments.
Additionally, emotion recognition is a cornerstone of advanced intelligent systems for monitoring a subject's comfort. Thus, information on a subject's emotion and stress level is a key component for the future of smart AAL environments.
In our future work, we will focus on human emotion recognition using EDA with respect to different lab-settings, which means, we will try to build a generalized approach which should be trained using lab-settings X and tested using lab-settings Y. Additionally, we plan to combine Stacked Sparse Auto Encoders with CNN. Moreover, CNN essentially learns local (spatial) features. On the other side, RNN does in essence rather learn temporal features. Consequently, combining both neural network concepts will result in a neuro-processor which can learn both contextual dependencies (i.e., spatial and temporal) from inputted local features. As a result, such a combination does potentially improve the overall performance.