A Multi-Column CNN Model for Emotion Recognition from EEG Signals

We present a multi-column CNN-based model for emotion recognition from EEG signals. Recently, a deep neural network is widely employed for extracting features and recognizing emotions from various biosignals including EEG signals. A decision from a single CNN-based emotion recognizing module shows improved accuracy than the conventional handcrafted feature-based modules. To further improve the accuracy of the CNN-based modules, we devise a multi-column structured model, whose decision is produced by a weighted sum of the decisions from individual recognizing modules. We apply the model to EEG signals from DEAP dataset for comparison and demonstrate the improved accuracy of our model.


Introduction
Emotion is a kind of conscious or unconscious feeling for a phenomenon or a work. Emotion is expressed through various biological and physical reactions including voice, text, gestures, facial expressions, and biosignals. People cannot conceal their emotions even though they don't want to reveal them. A reaction to emotion is a very important key for a communication between people.
Emotional reaction plays a very important role to many human-computer interaction (HCI) applications. Recently, many venders present various systems that react to their users without any explicit requests or commands. The systems catch the implicit request from their users by recognizing their emotion. For example, a recently launched TV set automatically controls its contrast and brightness according to the content playing on the TV. When they play a foot game as their content, it increases contrast and brightness, since the user's emotion is in exciting mode.
To properly recognize emotion from users, video signal catching the facial expressions, audio signals catching the voice and various biosignals including electroencephalography (EEG), electrocardiogram (ECG), electromyogram (EMG), and photoplethysmogram (PPG), respiration pattern (RSP), and galvanic skin response (GSR) are employed. Among them, the biosignals gather high interest for emotion recognition, since they are spontaneous signals, which people cannot control these reactions at their own will. Some people who can keep calm their faces and voices at the change of their emotions cannot conceal their biosignals.
Emotion recognition using these physiological signals is a very interesting research field. Among them, EEG is very frequently-used signals for emotion recognition. Even though EEG presents a relatively precise measure and an easy interface, it suffers from the non-stationary property of the signal. Therefore, the extraction of temporal correlations of spontaneous EEG signals is a key issue for the emotion recognition from EEG signals.
Many researchers presented various machine learning schemes such as support vector machine (SVM) and k nearest neighborhood (kNN) for emotion recognition from biosignals. Recently, deep learning-based schemes replace the traditional machine learning schemes. CNN and RNN-based models are proposed for emotion recognition from EEG signals. However, they still have problems in resolving the non-stationary property of the EEG signals.
We present a multi-column CNN-based emotion recognition model from EEG signals (see Figure 1). Our model is composed of several recognition modules that decide the valence and arousal from input EEG signals. The final decision by our model is made by collecting the decisions of individual recognition modules through a weighted voting strategy. Since the individual decision is affected by the non-stationary property of EEG signals, our strategy that makes final decision through a weighted voting can resolve the non-stationary property in a great scale. The experiment on the DEAP dataset demonstrates the excellence of our model.

Related Work
We categorize the emotion recognition works according to the sources. Petrantonakis and Hadjileontiadis [2] employed EEG-based feature extraction scheme for emotion recognition. Their technique is designed using higher order crossings (HOC) analysis which is a robust feature extraction and classification method, which is tested with four classical methods including QDA, kNN, Mahalanobis distance and SVM. As a result, they show that the HOC scheme outperforms their competitors.
Korats et al. [3] conducted a similar experiment by comparing four major ICA algorithms including FastICA, AMICA, Extended Infomax and JADER. In this experiment, AMICA shows an impressive performance.
Duan et al. [4] presented an effective EEG-based emotion classifier based on a feature named differential entropy. They further proposed linear dynamical system (LDS) for feature smoothing method and minimal redundancy maximal relevance algorithm for feature selection algorithm to increase the accuracies and efficiencies of EEG-based emotion classifiers.
Jenke et al. [5] compared 33 studies that extract features for emotion recognition from EEG signals. Their result reveals that features selected by multivariate schemes slightly outperform those from univariate methods.
Zheng [6] proposed a multi channel EEG-based emotion recognition scheme using a novel group sparse canonical correlation analysis (GSCCA), which is a group sparse extension of the conventional CCG method. The GSCCA shows good performance in handing the group feature selection problem from raw EEG features. And, they prove that their GSCCA method outperform the state-of-the-art EEG-based emotion recognition methods.
Mert and Akan [7] employed empirical mode decomposition (EMD) and its multivariate extension (MEMD) for emotion recognition from EEG signals. To resolve the non-stationary behavior and the multichannel issue of EEG signals, they applied MEMD-based feature extraction method to process non-stationary multichannel EEG. They analyzed intrinsic mode functions extracted by MEMD using various time and frequency domain.

CNN-Based Works
Jirayucharoensak et al. [8] presented a deep learning approach for recognizing emotion from nonstationary EEG signals. They employed a stacked autoencoder (SAE) for feature learning. To resolve overfitting problem caused by the non-stationary EEG, they applied PCA to extract the most important component and covariate shift adaptation to minimize the nonstationary effect.
Khosrowabadi et al. [9] presented a six-layer biologically inspired feedforward neural network to discriminate human emotion from EEG signals. The network has a spectral filtering for the input layer, which is followed by a shift register memory. The emotion is discriminated from EEG signals based on valence and arousal levels.
Alhagry et al. [10] developed an LSTM RNN-based emotion recognition technique from EEG signals. The emotions they aim to recognize are in three axes: arousal, valence and liking. They demonstrated accuracy of greater than 85% for the three axes.
Tripathi et al. [11] presented a CNN-based emotion recognition method from EEG signals in the DEAP dataset. They explored two different neural models: a simple deep neural network and a convolutional neural network. The latter demonstrated an improvement of 4.96% than the state-of-the-art techniques.
Salama et al. [12] presented a 3D CNN approach for recognizing emotions from multichannel EEG signals. They developed a data augmentation phase to improve the performance of their 3D CNN model. They employed DEAP dataset to achieve 87.44% accuracy for valence and 88.49% for arousal.
Yang et al. [13] proposed a hybrid neural network combining CNN and RNN to classify human emotion by learning spatial-temporal representation of raw EEG signals. The CNN module mines the inter-channel correation among physically adjacent EEG signals and RNN module mines contextual information of the signals. They tested their model on DEAP dataset and demonstrated that their model achieves 90.80% for valence and 91.03% for arousal.
Moon et al. [14] applied CNN for EEG-based emotion recognition. They employed brain connectivity features, which has not been used in previous studies, to account for synchronous activation of different brain regions. Therefore, their method effectively captures asymmetric brain activity patterns, which plays an important role in emotion recognition.
Li et al. [15] proposed an emotion recognition approach that considers temporal, spatial and frequency characteristics of EEG signals. Their approach extracts RASM as the feature to describe the frequency-space domain characteristic of EEG signals and constructs an LSTM network to explore the temporal correlations of EEG signals. They applied DEAP dataset to their model and achieved 76.67% mean accuracy.
Xing et al. [16] presented a framework that consists of a linear EEG mixing model, which is designed using Stack AutoEncoder, and an emotion timing model, which employes LSTM RNN. Their framework decomposes EEG source signals from collected EEG signals and improves classification accuracy by using the context correlations of the EEG feature sequences. Their model achieved 81.10% accuracy for valence and 74.38% for arousal from DEAP dataset.

Facial Expression
Scovanner et al. [17] introduced 3D SIFT descriptor for video, which represents the 3D nature of video data for action recognition. Their scheme discovers relationships between spatio-temporal words in order to describe video data better.
Klaser et al. [18] presented a novel local descriptor based on histograms of oriented 3D spatiotemporal gradients to extract emotion from facial features embedded in video sequences. They applied their descriptor to various action datasets and showed to outperform the state-of-the-art methods.
Liu et al. [19] resolved two important issues including temporal alignment and semantics-aware dynamic representation using manifold modeling of video based on a novel mid-level representation, which is denoted as expressionlet. They modeled expression video clip as a spatio-temporal manifold (STM) and learned a universal manifold model (UMM) for all low-level features. Finally, the local modes on each STM is instantiated by fitting the UMM. Through this approach, the aligned expression video spatially and temporally. They tested their model on four public expression datasets including CK+, MMI, Oulu-CASIA, and AFEW to demonstate that their model outperforms the state-of-the-art methods.
Soleymani et al. [20] presented an emotion recognizing technique based on LSTM RNN and continuous CRF. They annotated the valence from the facial expressions of participants while watching video and measured EEG signals. Both facial expression and EEG are processed on LSTM RNN to extract emotion information.
Zhang et al. [21] combined both EEG and video face signals to recognize human emotions. They proposed a spatial-temporal recurrent neural network (STRNN) that unifies the learning of two different signal sources into a spatial-temporal dependency model.

Model Construction
We build a multi-column structured model for emotion recognition from EEG signals. Ciresan et al. presented a pioneering multi-column deep neural network model for image classification [22]. This model classifies simple image dataset such as MNIST through a series of identical modules and determines categories by averaging the decisions of the individual modules. Our model is composed of several individually processing recognizing modules. Each recognizing module is designed based on a convolutional neural network (CNN) structure, composed of convolution layers and pooling layers. The structure of a recognizing module is illustrated in Figure 2. The basic building block of the module is a conv3-16, which is composed of conv3 layer followed by bias-add layer. Each module utilizes data from temporal snapshots of DEAP data. The cost function is defined to reflect both class of the emotion and intensity of the emotion. Details of the cost function is : where sim cos is cosine similarity between projected arousal-valence and actual arousal-valence, both vectorized. The activation function is leacky ReLU. The input, which is reformulated in a rectangular shape, is processed by two conv3-16 layers, followed by a maxpooling layer. After the maxpooling layer, two conv3-32 layers are added. After these layers, four fully connected layers are placed. The final result of these layers are two-fold, each of which corresponds to valence and arousal. The structure of our multi-column model is illustrated in Figure 3. Our model is composed of k recognitizing modules illustrated in Figure 2. We vary k from 3 to 7 to verify the most effective value. The individual decisions (v i ) of the modules are merged to a final decision (v f inal ) via either voting or a weight sum strategy according to the following formula: where v f inal is the final decision of our model, v i is the decision from i-th module and w i is the predicted probability for v i . w i , the weight term for i-th decision, comes from the predicted probability of the module. v i is a binary value having either +1 or -1, whereas +1 for high emotion state and -1 for low emotion state. To extract v i , we quantize w i in (0.0 ∼ 1.0) into nine-point metric in (1 ∼ 9). Then, the resulting value following the metric is converted into +1 if it is equal or greater than 5, and -1 if it is less than 5. The voting strategy follows majority-win rule, which concludes the final decision according to the v i value of majority. Therefore, the final decision v f inal of the voting strategy is made by the following formula: In Section 4, we test the accuracies of both strategies and select a better one.   The number of parameters in our model is estimated in Table 1.

Implementation Detail
We have implemented our model in a personal computer with Pentium i7 CPU, 16 GB main memory, and nVidia GeForce TitanX GPU. The model is implemented using Python with Pytorch library.

Data Collection
We employ EEG signal data from DEAP dataset collecting different kinds of physiological signals for human affective state analysis [23]. DEAP dataset contains EEG signals from 32 subjects who watched 40 one-minute music videos. During watching the video, the signals from the subjects are recorded. After watching the video, the subjects rate their emotion in terms of valence, arousal, dominance and liking in nine point metric (1 ∼ 9). We used these ratings as the ground truth for the model. Because we classify emotions with a circumplex model composed with valence and arousal, we excludeed dominance and liking for the research.
The EEG signals from DEAP dataset are downsampled to 128 Hz and applied by a 4.0-4.5 Hz band-pass filter. Therefore, each trial has 128 × 60 samples for 40 channels, but we exclude 8 channels with supplementary video data due to normalization issue. 32 channels with EEG signal forms one row for our two-dimensional input. 32 consecutive samples form single input data for the model. Figure 4 shows construction of a single input data. The data from single trial is segmented into (3 + T*k) trial where k denotes the number of columns in our model. The first 3 trials are removed for stability and remaining T trials are for each column.

Model Training
We collect the training data from DEAP dataset in the following strategy. For 32 participants, we collect training dataset from EEG signals of 22 participants, validation dataset from 5 participants, and test dataset from 5 participants. Each participants participated into 40 experiments. We composite a training datum from a participant by sampling 32 consecutive values from equally spaced 7680/(32 * k) different positions of an EEG signal, where k is varying number of the columns consisting our model. Therefore, We collect 33∼80 training data from one EEG signal of a participant for a video, leading into 29,040∼70,400 training dataset for our model. Similarly, we collect each 6600∼16,000 validation dataset and test dataset.
The learning rate for our model is 0.0001, which is reduced by 10 times according to the decrease of the error on validation set. We assign 0.5 for weight decay and 100 for batch size. Our training takes approximately 1 h and 30 min.

Result
We compare the accuracies of our model by comparing the strategy for final decision: voting and weighted sum. We segment the high-state and low-state of valence and arousal by assigning a threshold of 0.5. A valence or arousal whose value is greater than 0.5 is regarded as a high valence or arousal and A valence or arousal less than 0.5 is regarded as a low valence or arousal. We apply a majority-win strategy for voting.
The accuracies for valence and arousal from our model are listed in Table 2, and precision, recall, and F1 score are listed in Table 3. We measure the values by varying the number of columns from 1 to 7.

Comparison
We compare our model with some existing models that recognize emotion from EEG signals. The configuration of our model is 5-column with weighted sum strategy.
We compare 11 studies that present emotion recognition techniques from EEG signals in Table 4. Since they measured their performance using DEAP dataset [23], the comparison of the performances is very meaningful. Among them, 7 studies employed handcrafted features and classical recognition framework such as Bayesian network, SVM and HMM. The recent 4 studies employ deep neural network architecture such as CNN and LSTM RNN for recognition framework. We illustrate the result in Figure 5.  The summary of the comparison is suggested in Table 5. In average, the handcrafted feature based works record 71.67% for valence and 69.37% for arousal, while the deep neural network-based works record 81.4% for valence and 80.50% for arousal. The deep neural network-based works record more than 10% higher accuracies. For arousal, the lowest accuracy of deep neural network-based approach, which is 75.12% is higher than the highest accuracy of handcrafted feature-approach, which is 73.06%.

Multi-Class Emotion Recognition
Since the result of our model is valence and arousal, the coordinate of valence and arousal in widely-used Russell's model corresponds to various emotions including excited, happy, pleased, peaceful, calm, gloomy, sad, fear and suspense. We illustrated this process in Figure 6. Russel's emotion model Figure 6. The process of multi-class emotion recognition. The result of our model, which is (valence, arousal), is mapped into the valence-arousal coordinate space of Russell's model.

Limitation
The most important limitation of our model is a heavy computational load that comes from our multi-column structure. The computational cost for our model increased by the number of columns. Since we employ 5-column models for comparison, the computation time becomes longer than the existing models.

Conclusions and Future Work
In this paper, we have presented a multi-column structured model for emotion recognition. Each component of our model is an emotion recognizing module based on CNN with various conv, pooling, and fully connected layers. As an input for our model, EEG signal data from DEAP dataset are employed. Our model records higher accuracy than the existing models.
We are going to extend our model to process various biosignals including PPG and GRP. Other features such as facial expressions may be considered. Another direction is to apply our model to compare the emotional responses to real images against those to artworks.