A Systematic Exploration of Deep Neural Networks for EDA-Based Emotion Recognition

: Subject-independent emotion recognition based on physiological signals has become a research hotspot. Previous research has proved that electrodermal activity (EDA) signals are an effective data resource for emotion recognition. Beneﬁting from their great representation ability, an increasing number of deep neural networks have been applied for emotion recognition, and they can be classiﬁed as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), or a combination of these (CNN+RNN). However, there has been no systematic research on the predictive power and conﬁgurations of different deep neural networks in this task. In this work, we systematically explore the conﬁgurations and performances of three adapted deep neural networks: ResNet, LSTM, and hybrid ResNet-LSTM. Our experiments use the subject-independent method to evaluate the three-class classiﬁcation on the MAHNOB dataset. The results prove that the CNN model (ResNet) reaches a better accuracy and F1 score than the RNN model (LSTM) and the CNN+RNN model (hybrid ResNet-LSTM). Extensive comparisons also reveal that our three deep neural networks with EDA data outperform previous models with handcraft features on emotion recognition, which proves the great potential of the end-to-end DNN method.


Introduction
Robust information about the emotional state of a user is key to providing an empathetic experience during human-machine interaction (HCI) [1]. To make the interaction go well, it is important to ensure that the computers can understand the feelings of users through the interaction process [2]. In recent decades, emotion recognition has become a significant field in HCI, and it has been applied in a wide range of areas such as usability testing, development process improvement, enhanced website customization, and video games [3].
In the study of emotion recognition, different data resources have been applied, including facial and body expressions, eye gaze, audio signals, physiological signals (ECG, EEG, and EDA/GSR), respiration amplitude, and skin temperature [4]. Among these data resources, physiological signals have been paid more attention in studies recently, as they can reflect the emotional states objectively, while expressions and body motions can be influenced by subjective behavior and therefore misleading. In the research on emotion recognition based on physiological signals, studies have come up with a subject-dependent method [5]. In this field, subject-dependent means that the source of data comes from the same person. After decades of research, the subject-dependent method has achieved an accuracy of more than 90% [6]. However, the subject-dependent model is not satisfying on robustness and universality, as the performance shows instability when moving from experiment participants to the general population [7]. Hence, some researchers have started to focus on the subject-independent method, where the data is acquired from multiple persons. Compared to the subject-dependent method, the great generality determines that the subject-independent approaches perform better on for emotion classification. To fit with our task, the dimension of the convolutional operation in the original ResNet is changed from 2D to 1D, which can directly process the sequential EDA signal. We evaluate our models in the commonly used open-source dataset MAHNOB-HCI for three-class classification. The extensive ablation experiments are used to systematically compare the performance between the different structures of three types of deep neural networks, the results of which also demonstrate the superiority of the deep neural network by comparison with previous methods in the MAHNOB-HCI dataset.
The rest of this article is structured as follows: Section 2 reviews the related works, including the theories of emotion models and EDA analysis. Section 3 presents our proposed method. Section 4 discusses the results, and Section 5 presents the conclusions.

Theories of Emotion Models
Emotion is a complex concept and has many different descriptions when emphasizing different aspects such as feelings of arousal and/or hedonic value, appraisal and/or labeling processes, external emotion generating stimuli, and the relationship between emotion and motivation [32]. One definition [32] written by Theodore D. Kemper in 1978 is that "emotion is a relatively short-term evaluative response essentially positive or negative in nature involving distinct somatic (and often cognitive) components." For further study, researchers built emotion models to evaluate people's emotions in qualitative and quantitative ways. The main emotion models can be classified into discrete, dimensional, dynamical, and appraisal models. Discrete models consider emotions to be discrete and mixable. The color wheel model proposed by Plutchik [33] is a typical discrete model (see Figure 1). In this model, emotions are just like colors. There are eight basic emotions (joy, trust, fear, surprise, sadness, disgust, anger, and anticipation) in a ring order, which is analogous to the primary colors. Other emotions can be created by mixing two basic emotions. For example, the mixture of "joy" and "trust" creates a new emotion "love", just like the color orange can be acquired from mixing red and yellow. Dimensional models are very different from discrete models. In the theory of dimensional models, emotions such as joy, trust, and surprise all have strong intrinsic connections within each other, rather than being isolated, and have some common features that can be measured quantitatively. Based on this theory, dimensional models appear as coordinate systems with two dimensions or multiple dimensions; each axis (or dimension) represents a feature, and different emotions are at different positions of the coordinate system. Figure 2 shows the valence and arousal model proposed by Russel [34], which is the most representative dimensional model and is most used in the study of emotion recognition. Russel's model is quantized by the discrete values in the 2D coordinate of valence and arousal. In this model, the valence axis represents the level of positive or negative, where high valence means pleasant and low valence means unpleasant, and the arousal axis represents the intensity of the emotion, where high arousal means activation and low arousal means deactivation. For further explanation, emotions such as stressed and nervous show high valence and low arousal, while excited shows both high valence and high arousal; depressed shows both low valence and low arousal, while relaxed shows high valence and low arousal. The valence and arousal model established a quantifiable evaluation index for emotions, and our work is based on this model.

EDA Analysis
EDA is a general term for the autonomic changes in the electrical properties of the skin [35]. Electrodermal activities are formed and influenced by many complex factors. According to neurophysiology and psychophysiology research, EDA reflects the sweat gland activities on the skin surface, which is innervated by the autonomic nervous system (mainly the sudomotor nerves) [35]. The underlying reason for the connection between EDA and sweat gland activities is that sweat on the skin surface can change the electrical conductivity of the skin. Furthermore, the sweat gland activities is influenced by people's emotional states and changes. One simple example in our daily experience is that people sweat when they are scared and nervous. These findings prove that human emotional states can indeed be speculated from EDA signals, and the theoretical foundation for the study of emotion recognition can therefore be conducted based on EDA.
Generally researchers use skin conductance (SC), a representative manifestation of EDA, as the parameter measured in experiments. SC activity is composed of two parts: Tonic and phasic activity. The tonic signal is a slowly changing baseline that is caused by the drifting skin conductance level (SCL) and other unconscious activities. The phasic signal, also known as skin conductance response (SCR), is a quick response caused by external stimuli emotion (e.g. changes in emotional states). The EDA data that researchers collect from the experiment equipment are the origin SC signals, which need to be decomposed into tonic and phasic signals.
For the process of EDA, there are model-based methods, including discrete deconvolution analysis (DDA), continuous decomposition analysis (CDA), and convex optimization-based electrodermal activity (CvxEDA). DDA uses nonnegative deconvolution to decompose SC data into discrete compact responses [36], and CDA outperforms DDA by establishing a continuous measure that reflects the origin signal more closely [37]. CvxEDA is a more advanced method brought up by A. Greco et al. recently in 2016 [31]. As it was shown by Greco et al. that CvxEDA has a stronger correlation and discriminant ability than CDA [35], we choose cvxEDA to decompose the EDA signal in this work.

Materials and Methods
In this work, we attempt to use three typical deep neural networks: CNN (ResNet), RNN (LSTM), and CNN+RNN (ResNet-LSTM) to promote the recognition baseline on the MAHNOB-HCI dataset and systematically compare the performance between them. Before being fed into the DNN models, the origin signal was decomposed into tonic and phasic signals using the CvxEDA method. We then divided the signals equally into three networks, respectively, for training. The DNN models output the prediction probability of three classes with a nonlinear function method and choose the class with the maximum probability as the final result. Compared with the corresponding ground truth labels, we obtained the performance of models with evaluation metrics.

CvxEDA
The convex optimization-based electrodermal activity (CvxEDA) [31] is used to decompose the origin EDA signal into tonic and phasic signals. The model of CvxEDA can be written as where y represents the origin EDA signal, r represents the phasic signal, t represents the tonic signal, and represents the noise produced by measurement and modeling errors. The tonic signal t is composed of B-spline functions as well as the offset and linear trend term where B-spline functions make up the columns of the matrix B, l represents the spline coefficients,C represents a Nx2 matrix ( C i,1 = 1, C i,2 = i/N), and d is a vector that represents the offset and linear trend. The phasic signal can be modeled by the Bateman function where τ 0 and τ 1 represent the slow and fast time constants, respectively, and u(τ) represents the step function. We transform Equation (3) through the Laplace transform and replace s as where δ is the sampling interval. Following the ARMA model, we can finally obtain the phasic expression of discrete-time approximation as where ψ, θ and ξ are constants calculated from τ 0 , τ 1 and δ) represent the tridiagonal matrix, respectively, and p is the activity of the autonomic nervous system. According to Equations (2) and (5), Equation (1) can be written as Using the maximum a posteriori estimation, the parameters p, l, and d can be represented as According to Bayes' theorem, we have where p can be written as a Poisson distribution [38], l as a normal distribution, and P(p, l, d|y) as an error model, while P(d) is discarded from further computations. Therefore, we have the transformed equation as Finally, Equation 9 can be written into an optimization problem: The optimization problem of Equation (10) can be easily solved by many existing methods. More details can be seen in [31].

ResNet
The ResNet [28] was brought up by K He et al. in 2015. ResNet is an excellent deep convolutional neural network, broadly used in image recognition and image feature extraction. ResNet mainly solves the degradation problem in convolutional neural networks with deeper layers (such as VGG) by presenting a residual learning block. As shown in Figure 3, the residual learning block has an "identity shortcut connection" to skip some layers and feed the data straight into the next layer. The output of the block can be represented as where x is the input, W 1 and W 2 are the convolutional layers, and W 3 is the downsampling operation. Our model is adapted from the backbone of ResNet and we substitute a one-dimensional convolution for a two-dimensional convolution in ResNet in the area of emotion recognition based on EDA. ResNet and many other deep convolutional neural networks are brought up in the image domain with two-dimensional convolution operation for image feature extraction. If applying the original 2D ResNet, the 1D input signal should be rearranged to a 2D matrix, like in [11]. However, the 2D convolutional kernel will disturb the sequential distribution to extract features. As the physiological signals such as EDA and EEG are one-dimensional sequences, we replaced the two-dimensional convolution into a one-dimensional convolution to process input signal directly and conveniently. We use the one-dimensional ResNet to extract features and feed the feature vector into a regression layer to obtain the prediction of three classes. The structure of our ResNet is shown as Table 1. Table 1. The structure of our one-dimensional ResNet. Each convolution layer in this network is followed with a batch normalization operation.

LSTM Network
One of the special RNN structures is the LSTM. It has a more complex structure to solve the problem of gradient vanishing in a conditional RNN. It is capable for capturing information that has connections between long distances and is especially suitable for long-sequence feature extraction. Figure 4 shows the structure of LSTM. It has a repeating module for recurrence, as do all other RNNs, but is much more complex. Each module contains a forget gate, an input gate, an output gate, and a cell state (c t ). The cell state is the core of LSTM. It runs like a belt through the repeating modules: It receives messages from the previous cell state, adjust the messages, and then delivers them to the next cell state. The gates are used to control the delivery process and filter the messages in the network by using a sigmoid function. The forget gate decides how much information that the cell state receives from the previous one should be maintained. The maintained information can be written as where σ represents the sigmoid function, W i f and W h f represent the weight coefficients, b i f and b h f represent the biases, x t represents the input, and h t−1 is both the state of the hidden layer and the output of the previous module. The input gate screens the information from the input and the previous output, and generates new information together with a tanh function. This process can be described as where i t represents the input, g t represents the result of the tanh function, and W and b, respectively, represent the weight coefficients and biases. Information from the forget gate and the input gate constitute a new cell state: where c t represents the cell state. The output gate selects the information from the input and the previous output as well, and together with a tanh function decides the new output: where o t represents the output gate, and h t is the output and is sent to the next module as the hidden state. The structure of our LSTM model is shown in Figure 5. The input EDA signal is fed into each LSTM module point by point, each point containing three channels (phasic, tonic, and origin). Each LSTM module outputs an eigenvector (also called the state of the hidden layer mentioned above) to the next. At the end of the final module, there is a regression layer that contains the stacked linear functions, ReLU functions, and the SoftMax function to predict the classification.

ResNet-LSTM
The ResNet-LSTM network combines both the descriptive power of ResNet and the sequential features capturing the ability of LSTM. The main structure of ResNet and LSTM in this hybrid network are essentially the same as the single ResNet and LSTM mentioned above. As illustrated in Figure 6, in this ResNet-LSTM network, the EDA signal is fed into ResNet as a three-channel input (phasic, tonic, and origin) first. The ResNet structure outputs a one-dimensional eigenvector and feeds it into the LSTM structure. Similar to the single LSTM, there is a regression layer containing a linear function at the end of the final LSTM module for three-class classification.

Implementation Details
In this work, we use a public benchmark dataset, MAHNOB-HCI, of EDA signals. The emotional states of MAHNOB-HCI are labeled with nine emotion keywords in valence and arousal dimensions. Following the baseline of the MAHNOB-HCI dataset [4], we relabeled nine emotional states to three classes, conducted classification with our models, and evaluated the performance with an average accuracy and F1 score. The details are described as follows.

Dataset
The MAHNOB-HCI [4] is a popular emotion dataset for affective computing. The 32-channel EEG and multiple physiological signals including EDA, ECG, RSP, and TMP signals were recorded from 30 participants in response to external stimulus (video and imagery) [4]. MAHNOB-HCI includes two experiments: An emotion recognition (also called emotion elicitation) experiment and an implicit tagging experiment. The data we used in this work are from the emotion elicitation experiment, in which 27 participants were asked to watch 20 emotional videos. To be more specific, in each sample, a neutral clip was shown first to relax the participant's emotion, and the emotional video was then played, after which the participant filled in the form. EDA (GSR) and other physiological signals were recorded 30 s before and after the emotional video. After watching each video, participants conducted a self-assessment with nine emotion keywords.
Referring to the previous baseline in [9,10], we utilized the valid samples downloaded from the dataset server in the "Selection of Emotion Elicitation" item. We followed the annotation strategy of [4,9,10] to relabel the nine annotations for three-class classification as shown in Table 2.

Evaluation Metrics
Accuracy and F1 score are used as evaluation metrics in this work. In this subsection, we briefly introduce their definitions as follows.
Accuracy is calculated as where N correct is the number of samples classified correctly, and N total is the number of total samples. In this work, accuracies are all finally calculated in averages, which are subject to 10-fold cross validation. F1 score is based on precision and recall: where TP, FP, and FN represents true positive (predicted as positive and actually active), false positive (predicted as positive but actually negative), and false negative (predicted as negative but actually active). As this is a three-class problem. there are F1 scores for each class and an average F1 score for the overall classifier: and the overall average F1 score can be presented as where F1 1 , F1 2 , and F1 3 , respectively, represent the F1 score of each class.

Training Setup
As described in the previous sections, we built three models based on ResNet, LSTM, and hybrid ResNet-LSTM network, respectively. To figure out the best configuration of each model, three ablation experiments about the structure settings were conducted. After establishing the best architecture of each DNN model, we conducted an analysis comparing the three DNN models to discover which model performs best on EDA-based emotion recognition. Finally, we compared the results of the DNN models in this work with the existing studies based on the MAHNOB-HCI dataset.
In our experiments, emotions are classified based on the three-class strategy mentioned in Table 2, and results are evaluated based on the metrics in Section 4.2. It should be emphasized that accuracies shown in this work were all subject to 10-fold cross validation. For consistency of the comparison, each model was trained for 25 epochs with the Stochastic Gradient Descent (SGD) optimizer. The initial learning rate was set to 0.001, which was decreased by multiplying it by 0.1 at every five epochs. The detailed experimental results and analyses of our experiments will be described as follows.

Configuration of ResNet
For ResNet, the configuration is addressed on the number of stacked residual learning blocks (Res-blocks) in one convolutional group, seen in Table 1. More residual blocks means a deeper network. The depth of the DNN model is one of the key factors that influence the performance of the network. Previous research has shown that a deeper network generally has a better performance in image classification [28]. However, the relationship between the depth of ResNet and its performance has not been studied in the field of EDA-based emotion recognition in the MAHNOB-HCI dataset. Moreover, the deeper model is followed with a higher computational expense and a slower processing speed. We seek to balance recognition performance with computing expense. Therefore, we conducted an ablation experiment to consider three different values (1, 2, and 3) for the number of stacked Res-blocks and compared the original 2D ResNet with our adapted 1D ResNet. Table 3 shows the configuration sets of ResNet in this experiment, and Table 4 shows the experiment results. From the results, we can see that the adapted 1D ResNet is superior than the original 2D ResNet. Moreover, with the increase of the stacked Res-block number, the accuracies do not significantly improve. This reveals that the representation ability of the simplest ResNet is good enough to distinguish the EDA signals from different emotional states in the MAHNOB-HCI dataset. Considering that the deeper network will bring a greater cost of GPU memory and time, we choose the architecture with one Res-block in one convolutional group (Configuration ID 1) as the best architecture for ResNet. Table 3. ResNet configuration sets.

Configuration of LSTM
The hidden cell dimension and the layer dimension are two adjustable configurations of LSTM. For LSTM in physiological signal analysis [39], common settings of the hidden cell dimension are 128 and 256, and layer dimensions can be deepened from 1 to 2 or 3. To determine the best LSTM structure, we designed the ablation experiment from a combination of the numbers of hidden cell and layer dimensions. Table 5 shows the configuration sets of LSTM in this experiment, and Table 6 shows the experiment results.  Comparing the performances between the different configurations, we can see that LSTM with one layer dimension (Configuration ID 1 and 4) performs poorly, which reveals that one layer is too shallow for mining useful features from complex EDA signals; LSTM with three layer dimensions (Configuration ID 3 and 6) is not as good as LSTM with two layer dimensions (Configuration ID 2 and 5), which means that deeper LSTM leads to instability and increases training difficulty; LSTM with two layer dimensions (Configuration ID 2 and 5) balance representation ability and training difficulty to obtain the best performance. Configuration ID 5 achieves the highest accuracies in terms of both valence and arousal, but is very similar to Configuration ID 2. As mentioned in Section 5.2, we choose the simpler LSTM (Configuration ID 2) configured with 128 hidden cells and two layers as the best architecture for LSTM.

Configuration of Hybrid ResNet-LSTM
The previous two ablation experiments indicate that ResNet performs well with low stacked Res-block numbers, and LSTM shows better performances with 128 or 256 hidden cell dimensions and two layer dimensions. To avoid unnecessary repetition, we combined the better configurations of the single ResNet and LSTM to obtain four sets of configurations for the Hybrid ResNet-LSTM (see configuration sets in Table 7 and results in Table 8).  Corresponding to the results in Section 5.2 and Section 5.3, the testing accuracies and F1 scores of four configurations are nearly closed. Therefore, the simplest structure (Configuration ID 1) is chosen as the best architecture for hybrid ResNet-LSTM. The best architecture consists of the ResNet with one Res-block and the LSTM with two-layer and 128 hidden cells.

Comparison of ResNet, LSTM, and Hybrid ResNet-LSTM
After establishing the best architectures of the three deep neural networks, we can do some analysis on their performances. We directly utilize the respective experiment results of the three best architectures from Sections 5.2-5.4 to produce Table 9. Obviously, ResNet achieves the best performance among the three DNN models, which means that the CNN framework has a greater ability to mine dynamic and static features from decomposed EDA data compared to the RNN and hybrid CNN+RNN. Considering that our task is to predict a global emotional state with a sequential EDA input, the reason LSTM network achieves the poorest results is that it pays a great amount of attention to conducting sequential processing over time and ignores some global information. Moreover, as described in Section 5.3, the training difficulty of LSTM is also a strong factor. While the hybrid ResNet-LSTM combines the advantages of a CNN and RNN, the complicated hybrid architecture further promotes the training difficulty, which limits the model performance.

Comparison with Baselines in MAHNOB-HCI Dataset
We conduct some comparisons between the three DNN models in this work and the existing approaches for emotion recognition of three-class classification using physiological signals of the MAHNOB-HCI dataset to validate the effectiveness of the DNN-based methods. Considering that there are very few methods only using EDA signals in the MAHNOB-HCI database, we also involve approaches using other signals. Ferdinando et al. [9] utilize a KNN classifier with handcrafted features to complement baseline accuracies for the MAHNOB-HCI database. They further [10] promote emotion recognition using fused physiological features. Moreover, Liu et al. [40] apply the LSTM with the combination of EEG signals and external videos features. We compare the three DNN models (described in Section 5.5) with them, which can be seen in Table 10. For the three-class prediction task in MAHNOB-HCI, almost all of the three DNN models in this work outperform the existing methods in terms of the valence and arousal dimensions. Moreover, the ResNet model in this work improves the recognition accuracy significantly and performs much better than the previous methods, which validates the superiority of the deep convolutional neural network in this task. Table 10. Comparison with the state-of-the-art three-class classification tasks in the MAHNOB-HCI dataset. Accuracies (%) and F1 score (%) are subject to 10-fold cross validation. The best results are highlighted in bold.

Conclusions
In this work, we investigate three typical deep neural networks-ResNet, LSTM, and hybrid ResNet-LSTM-with respect to an EDA-based three-class emotion recognition task and systematically conduct ablation experiments to compare the different structures and their advantages. The comparison between the three DNN networks and the existing methods shows that the CNN-based ResNet in this work has the best performance: It has the best average accuracy (86.73%) and F1 score (85.71%) for valance, and the best average accuracy (86.92%) and F1 score (85.95%) for arousal in the MAHNOB-HCI dataset, which validates that the CNN framework is superior in EDA-based emotion recognition. The great performance of ResNet can make the following contributions: (1) The great representation ability of the end-to-end deep learning network can directly extract useful features from EDA signals; (2) the novel CvxEDA decomposition augments the one-channel EDA data to obtain phasic and tonic components, and the phasic and tonic signals can, respectively, reveal dynamic and static emotion changes. For further research, we will explore more novel CNN architectures to enhance the accuracy and generalization of models for EDA-based emotion recognition.