Human Emotion Recognition with Electroencephalographic Multidimensional Features by Hybrid Deep Neural Networks

.


Introduction
Emotion is an important symbol of human intelligence; as such, an important intelligence symbol of artificial intelligence is that the machine can understand human emotions.As early as the 1980s, Minsky, one of the founders of artificial intelligence, proposed that a machine without emotions is not intelligent.Recently, research on human emotion recognition has been applied in many fields such as entertainment [1], safe driving [2,3], health care [4], social security [5], etc. Picard et al. [6] believed that the emotional changes of the human were embodied in speech [7], facial expressions [8], body posture [9], the central nervous system, autonomic nerve physiological activities [10], etc.Thus, multi-scale entropy as the EEG emotional features [36], and their results demonstrated that the time-frequency combined feature obtained better results than the traditional single-domain features.
EEG signals are obtained by measuring the electrical voltage signals of the multiple electrodes affixed to different positions on the scalp.From the obtaining method of the EEG signal, we can see that the information is highly correlated with the spatial, time, and frequency dimensions.However, seldom have previous studies paid attention to the spatial domain.The spatial information studies were limited to the asymmetry between the electrode pairs.The methods mostly calculate the differences in the power bands of the corresponding electrodes pairs on the left/right hemisphere of the scalp [21,37].Recently, Bashivan et al. transformed EEG activities into a sequence of topology-preserving multi-spectral images to study human cognitive function [23], but few studies have analyzed human emotions with the spatial information of the EEG signals.
The method to integrate EEG multidimensional features is based on the spatial distribution of EEG electrodes (according to the 10-20 system [38]), and map the frequency domain characteristics to a two-dimensional image.With this method, we obtain a sequence of images from consecutive time windows from the EEG signal.The details of the construction method are presented in Section 3.

Emotion Classification Methods
In order to provide a comparison to our method, we chose studies that classified human emotions with scales of Valence and Arousal in Table 2.It also lists the classification accuracy and the number of subjects.As seen in Table 2, the most commonly used emotional classification methods include k-Nearest Neighbor (k-NN, used in [15,39]), Support Vector Machine (SVM, used in [14,[40][41][42]), Random Decision Forest (RDF), Bayes Neural Networks (used in [43]) and Neural Networks (used in [44,45]).These methods are all used as baseline methods for comparison with our method, with details given in Section 5.2.
Table 2. Survey of the studies on emotion classification methods with EEG signal 1 .

Author and Study Emotion Classification Basis Subjects Accuracy Classification Method
Horlings [14] Valence and Arousal (2 classes) 10 81% SVM Schaaff [41] Valence and Arousal (3 classes) 30 66.7% SVM Frantzidis [40] Valence and Arousal (2 classes, respectively) 28 81.3% SVM Murugappan [15] Valence ( It is noteworthy that most of the methods listed in Table 2 classify emotions statically, except for the method used in [44] where the LSTM RNN was adopted to learn from the EEG features incrementally and dynamically.Another point worth noting is that only CNN is suitable for automatically extracting features from the image out of these methods.These two points are the reason for selecting CNN and LSTM RNN as parts of our classification method.The second column of Table 2 shows the classification basis and the number of the difference classes in the previous studies.As we can see, previous studies have basically divided emotions into categories two to three.In this study, we divided the emotion state into four classes.All the studies in Table 2 classify the emotion by Valence and Arousal.The third column of Table 2 shows the number of subjects included in the evaluated dataset.

Materials and Methods
The data preparation phase mainly included two aspects: the construction of EEG MFI sequences and the building of the emotion classification labels.

The Construction of EEG MFI Sequences
The International 10-20 System is an internationally recognized method of describing and applying the location of scalp electrodes in the context of an EEG test.The system is based on the relationship between the location of an electrode and the underlying area of the cerebral cortex.The "10" and "20" refer to the fact that the actual distances between the adjacent electrodes are either 10% or 20% of the total front-back or right-left distance of the skull [46].
Figure 1 shows a plan view of the International 10-20 System and a generalized square matrix from it.We can see that the left of Figure 1  studies.As we can see, previous studies have basically divided emotions into categories two to three.In this study, we divided the emotion state into four classes.All the studies in Table 2 classify the emotion by Valence and Arousal.The third column of Table 2 shows the number of subjects included in the evaluated dataset.

Materials and Methods
The data preparation phase mainly included two aspects: the construction of EEG MFI sequences and the building of the emotion classification labels.

The Construction of EEG MFI Sequences
The International 10-20 System is an internationally recognized method of describing and applying the location of scalp electrodes in the context of an EEG test.The system is based on the relationship between the location of an electrode and the underlying area of the cerebral cortex.The "10" and "20" refer to the fact that the actual distances between the adjacent electrodes are either 10% or 20% of the total front-back or right-left distance of the skull [46].
Figure 1 shows a plan view of the International 10-20 System and a generalized square matrix from it.We can see that the left of Figure 1   Figure 1 presents the method of mapping the International 10-20 System to a generalized EEG feature matrix.With this method, a single frame EEG MFI can be built from the EEG signal within a time window.With the time window moving forward, an EEG MFI sequence is constructed from the EEG signal.The process is presented in Figure 2. The definition of the red points and gray points is as same as it is defined in Figure 1.The different colors in EEG images represent the value of the EEG Figure 1 presents the method of mapping the International 10-20 System to a generalized EEG feature matrix.With this method, a single frame EEG MFI can be built from the EEG signal within a time window.With the time window moving forward, an EEG MFI sequence is constructed from the EEG signal.The process is presented in Figure 2. The definition of the red points and gray points is as same as it is defined in Figure 1.The different colors in EEG images represent the value of the EEG feature.The higher the feature is, the closer it is to the dark red.The lower the feature is, the closer it is to the dark blue.And the range of the EEG feature value is from 0 to 1.
Appl.Sci.2017, 9, 1060 6 of 21 feature.The higher the feature is, the closer it is to the dark red.The lower the feature is, the closer it is to the dark blue.And the range of the EEG feature value is from 0 to 1.The EEG MFI sequence construction process consists of three steps.First, the raw EEG signals are extracted from DEAP, which included the multi-channel EEG signal of 32 subjects.Each subject has 40 trials where each trial includes the EEG signals of 32 channels, each signal lasting for 60 s.In the leftmost image of Figure 2, we schematically show the raw EEG signal of the first 10 channels.After that, the power spectrum density (PSD, [14,31,33,39,40]) is extracted as a EEG frequency domain feature from the raw signals.The PSD is estimated with Welch's method in MATLAB (R2016a) using a Hamming window and different time window sizes (1,2,3,4,5,6,8,10,12,15,20,30 and 60 s) with no overlap as parameters.A number of (32 channels × 60 s/Tl) features are obtained per trial, where Tl is the size of the time window.Using a one-second time window as an example, 1920 (32 × 60) features are obtained from a raw EEG signal.After that, the features of each subject are normalized to reduce inter-participant variability by scaling between 0 and 1, as is shown in Equation (1): where i F  is the normalized value of the feature; max i F , min i F are the maximum and minimum value of the internal subject features; and i F is the ith value in the feature sequence.The red points in the feature matrix are directly filled with the normalized feature values.The values of the gray points are calculated with the surrounding point values, and can be expressed as Equation ( 2): ,(0 , 8, , ) where V is the value of the gray point (corresponding to ( , )   m n P ); and V  is the value of the point surrounding ( , )   m n P .If the index of the surrounding point exceeds the range of 0 and 8, then the value is 0. K is the number of non-zero elements in the numerator, and the default value of K is In the leftmost image of Figure 2, we schematically show the raw EEG signal of the first 10 channels.After that, the power spectrum density (PSD, [14,31,33,39,40]) is extracted as a EEG frequency domain feature from the raw signals.The PSD is estimated with Welch's method in MATLAB (R2016a) using a Hamming window and different time window sizes (1, 2, 3, 4, 5, 6, 8, 10, 12, 15, 20, 30 and 60 s) with no overlap as parameters.A number of (32 channels × 60 s/Tl) features are obtained per trial, where Tl is the size of the time window.Using a one-second time window as an example, 1920 (32 × 60) features are obtained from a raw EEG signal.After that, the features of each subject are normalized to reduce inter-participant variability by scaling between 0 and 1, as is shown in Equation ( 1): where F i is the normalized value of the feature; F max i , F min i are the maximum and minimum value of the internal subject features; and F i is the ith value in the feature sequence.The red points in the feature matrix are directly filled with the normalized feature values.The values of the gray points are calculated with the surrounding point values, and can be expressed as Equation (2): where V is the value of the gray point (corresponding to P (m,n) ); and V is the value of the point surrounding P (m,n) .If the index of the surrounding point exceeds the range of 0 and 8, then the value where the size of the matrix is (32 × 60 × 40 × 32).With the different time windows, it is possible to produce a different number of EEG MFIs.Assuming the number of EEG MFIs is N and the length of the time window is t, N equals sequence/t.The pseudo-code of producing the specific feature matrix is expressed in Appendix B.
where the size of the matrix is (32 × 60 × 40 × 32).With the different time windows, it is possible to produce a different number of EEG MFIs.Assuming the number of EEG MFIs is N and the length of the time window is t, N equals sequence/t.The pseudo-code of producing the specific feature matrix is expressed in Appendix B.  Taking the first and the second row in Figure 4 as example, we can see that the first row represents the EEG variation over five seconds with five frames; however, the second row represents the same time variation with two frames.MFI (1,5) and MFI(2,3) are very similar.Accordingly, we can infer that the MFI sequence with a short time window provides more details about the variation than the MFI sequence with a long-time window.The meaning of the color in Figure 4 is the same as it in Figure 3.  Taking the first and the second row in Figure 4 as example, we can see that the first row represents the EEG variation over five seconds with five frames; however, the second row represents the same time variation with two frames.MFI (1,5) and MFI (2,3) are very similar.Accordingly, we can infer that the MFI sequence with a short time window provides more details about the variation than the MFI sequence with a long-time window.The meaning of the color in Figure 4 is the same as it in Figure 3.

The Construction of the Emotion Classification Labels
The classification method adopted in this paper is a supervisory machine learning method.Therefore, the corresponding classification labels of the EEG signal also need to be prepared in advance

The Construction of the Emotion Classification Labels
The classification method adopted in this paper is a supervisory machine learning method.Therefore, the corresponding classification labels of the EEG signal also need to be prepared in advance.

The Construction of the Emotion Classification Labels
The classification method adopted in this paper is a supervisory machine learning method.Therefore, the corresponding classification labels of the EEG signal also need to be prepared in advance

The Construction of the Hybrid Deep Neural Networks
We propose a hybrid deep learning model called Convolutional and LSTM Recurrent Neural Networks (CLRNN) to conduct emotion recognition tasks.This model is a composite of two kinds of deep learning structures: CNN and the LSTM recurrent neural network (LSTM RNN).The structure of the model is presented in Figure 6.The CNN is used to extract features from EEG MFI, and the LSTM RNN is used for modeling the context information of the long-term EEG MFI sequences.The features automatically extracted by the CNN reflect the spatial distribution of the EEG signals.In this work, two stacked convolutional layers are adopted as the basic structure of the CNN, which included two convolution layers, two max pooling layers, and a full connection layer.Given the dynamic nature of the EEG data, the LSTM RNN is a reasonable choice for modeling the emotion classification.Before connecting to the LSTM unit, a flattening operation is adopted to transform the final feature maps into a one-dimensional vector.

The Construction of the Hybrid Deep Neural Networks
We propose a hybrid deep learning model called Convolutional and LSTM Recurrent Neural Networks (CLRNN) to conduct emotion recognition tasks.This model is a composite of two kinds of deep learning structures: CNN and the LSTM recurrent neural network (LSTM RNN).The structure of the model is presented in Figure 6.The CNN is used to extract features from EEG MFI, and the LSTM RNN is used for modeling the context information of the long-term EEG MFI sequences.The features automatically extracted by the CNN reflect the spatial distribution of the EEG signals.In this work, two stacked convolutional layers are adopted as the basic structure of the CNN, which included two convolution layers, two max pooling layers, and a full connection layer.Given the dynamic nature of the EEG data, the LSTM RNN is a reasonable choice for modeling the emotion classification.Before connecting to the LSTM unit, a flattening operation is adopted to transform the final feature maps into a one-dimensional vector.

The Construction of Convolutional Neural Networks
The inputting MFI size of the networks is 200 × 200 pixels, and it contains three color channels.We set the number of convolutional filters as 30 in the first convolutional layer to extract 30 different kinds of correlation information, namely 30 different features.At the same time, to extract the multiple scale spatial characteristics of MFI, we use different size receptive fields in the first convolutional layer.The field sizes are 2 × 2 pixels, 5 × 5 pixels and 10 × 10 pixels, respectively.Corresponding to the different sizes of the field, the strides are 2, 5 and 10 pixels, respectively, without overlap between the strides.The activation function is ReLU.Following the first convolutional layer is a max pooling layer with pooling size of 2 × 2, and the strides are 2.The second convolutional layer is set as 10 different filters with a size of 2 × 2 without overlap between strides.This setting helps to further fuse the information of a specific scale range from the prior features.

The Construction of Convolutional Neural Networks
The inputting MFI size of the networks is 200 × 200 pixels, and it contains three color channels.We set the number of convolutional filters as 30 in the first convolutional layer to extract 30 different kinds of correlation information, namely 30 different features.At the same time, to extract the multiple scale spatial characteristics of MFI, we use different size receptive fields in the first convolutional layer.The field sizes are 2 × 2 pixels, 5 × 5 pixels and 10 × 10 pixels, respectively.Corresponding to the different sizes of the field, the strides are 2, 5 and 10 pixels, respectively, without overlap between the strides.The activation function is ReLU.Following the first convolutional layer is a max pooling layer with pooling size of 2 × 2, and the strides are 2.The second convolutional layer is set as 10 different filters with a size of 2 × 2 without overlap between strides.This setting helps to further fuse the information of a specific scale range from the prior features.Like the first convolutional layer, we add a max pooling stage after this convolutional layer for information aggregation.Before connecting to the LSTM unit, a flatten operation is adopted to transform the final features into a one-dimensional feature vector.The configuration of the CNN described above is presented in Table 4.The dense layer in Table 4 is the layer that transforms the final features into a one-dimensional feature vector.In this layer, we set the output at 1/10 of the input to further compress the features and simplify the network.The LSTM RNN layer achieves a full connection to the dense layer.Next, the RNN output layer took 'softmax' as its activation function, and the output size is set to 4, corresponding to the four types of emotion states.

The Construction of LSTM Recurrent Neural Networks
In the DEAP experiment, the stimulus intensity changes over 60 s.The emotion scores by the Subjects are often based on the most exciting part of the entire video.Therefore, we needed to model the context information for long-term sequences.As mentioned before, RNN is good at sequential modeling.However, a simple RNN must face the challenge of 'gradient vanish or explode' in back propagation when its dependencies are too long [47].LSTM units have been adopted to replace the simple units of a traditional RNN.LSTM units combine gate mechanisms in their structures so that the key features of the timing data are effectively maintained and transmitted during the long-period calculation.The gate is able to forget the used information and the self-loop structure allows the gradient to flow for long durations [48].
A typical structure of a LSTM unit is illustrated in Figure 7.For comparison, Figure 7 shows the structure of two neural network units.The upper left corner of the figure is a simple recurrent neural network unit, and the LSTM unit is below the graph.As seen in Figure 7, the simple RNN unit only contains the feedback from the output to the input.However, the LSTM unit contains three gate structures, i.e., input gate, forget gate, and output gate, which determine what information from the prior step should be forgotten and what information in the current time step should be added into the main data flow.f i , f o and f g are the activation function of the input data, output data, and gate, respectively.In this study, they are all sigmoid functions.Like the first convolutional layer, we add a max pooling stage after this convolutional layer for information aggregation.Before connecting to the LSTM unit, a flatten operation is adopted to transform the final features into a one-dimensional feature vector.The configuration of the CNN described above is presented in Table 4.The dense layer in Table 4 is the layer that transforms the final features into a one-dimensional feature vector.In this layer, we set the output at 1/10 of the input to further compress the features and simplify the network.The LSTM RNN layer achieves a full connection to the dense layer.Next, the RNN output layer took 'softmax' as its activation function, and the output size is set to 4, corresponding to the four types of emotion states.

The Construction of LSTM Recurrent Neural Networks
In the DEAP experiment, the stimulus intensity changes over 60 seconds.The emotion scores by the Subjects are often based on the most exciting part of the entire video.Therefore, we needed to model the context information for long-term sequences.As mentioned before, RNN is good at sequential modeling.However, a simple RNN must face the challenge of 'gradient vanish or explode' in back propagation when its dependencies are too long [47].LSTM units have been adopted to replace the simple units of a traditional RNN.LSTM units combine gate mechanisms in their structures so that the key features of the timing data are effectively maintained and transmitted during the long-period calculation.The gate is able to forget the used information and the self-loop structure allows the gradient to flow for long durations [48].
A typical structure of a LSTM unit is illustrated in Figure 7.For comparison, Figure 7 shows the structure of two neural network units.The upper left corner of the figure is a simple recurrent neural network unit, and the LSTM unit is below the graph.As seen in Figure 7, the simple RNN unit only contains the feedback from the output to the input.However, the LSTM unit contains three gate structures, i.   Different gates generate decision vectors to decide what candidate information will be selected.Using 'Input Gate' as an example, this generates vector i t with the hidden state h t−1 from the prior LSTM cell and the current step's input x t .The process of generating i t can be formalized as in Equation ( 4): where w i is the weighted matrix of the input function; and b i is the bias.The input candidate information C t is also generated with h t−1 and x t .C t can be formalized as Equation ( 5): The final updating information is the multiplication of the candidate information by the decision vector C t × i t .Another gate is the forget gate, which generates vector f t to determine if the prior unit's state C t−1 should be reserved by multiplication C t−1 × f t .The f t can be formalized as Equation ( 6): where f t is scaled between 0 and 1 with the sigmoidal operation.The '0' element causes the corresponding information in C t−1 to be wiped out, while the '1' means the corresponding information is allowed to pass.The current unit state C t is a combination of C t−1 and C t , and can be formalized as Equation ( 7): The output state of the LSTM unit is determined by the output gate.The output gate also generates a decision vector o t to decide the hidden state h t , and they can be formalized as Equations ( 8) and ( 9), as follows: In this study, the LSTM RNN is adopted to learn contextual information from the spatial features sequence extracted from the MFI.

The Construction of CLRNN with DL4J
DeepLearning4J is a java based toolkit for building, training and deploying Neural Networks [49].In this study, DL4J is adopted as the framework to construct the CLRNN.We present the network's configuration in Appendix C. The code in Appendix C is used to construct the network structure of the CLRNN.The size of the kernel in each layer is set according to the configuration given in Table 4.The setting of the learning rate for each layer changed in the tuning process of network training.

Results and Discussion
In this section, we present the process of the experiment and compare our method with the baselines to show the effectiveness of our methods.

Experiment Dataset and Settings
As mentioned earlier, we used the open dataset DEAP to verify the effectiveness of our method, which include EEG signals from 32 channels collected from 32 subjects.Each subject took 40 trials, and each trial lasted 60 s.The sampling frequency of the EEG signal was 512 Hz.With different time windows, we obtained EEG MFI sequences with a different number of EEG MFIs.For example, with a one-second time window, we obtained 2400 MFIs for one EEG MFI sequence.With a two-second time window, we obtained 1200 MFIs for one EEG MFI sequence.However, even with the shortest time a comparison of the classification accuracies between CLRNN and the baseline methods is presented in Figure 8.We present a boxplot of the mean emotion recognition accuracies with the different time windows for each subject in Figure 8.The comparison shows the effectiveness of our method.The average emotion classification accuracy of each subject with CLRNN is 75.21%,whereas the average accuracies of other classification methods are 69.58% with CNN + RNN, 67.45% with SVM, 45.47% with Random Decision Forest, and 62.84% with k-NN, respectively.The highest accuracy is obtained from Subject 4 with CLRNN, which is 90.54%.
unit, the other network structure is the same as the CLRNN.
For peer-reviewed studies, we chose the studies listed in Table 2 for purposes of comparison.

Results and Discussion
In this section, we present the results of our experiments.Due to a variation of the parameters in the classification methods, we only present the best results obtained by each method.First, a comparison of the classification accuracies between CLRNN and the baseline methods is presented in Figure 8.We present a boxplot of the mean emotion recognition accuracies with the different time windows for each subject in Figure 8.The comparison shows the effectiveness of our method.The average emotion classification accuracy of each subject with CLRNN is 75.21%,whereas the average accuracies of other classification methods are 69.58% with CNN + RNN, 67.45% with SVM, 45.47% with Random Decision Forest, and 62.84% with k-NN, respectively.The highest accuracy is obtained from Subject 4 with CLRNN, which is 90.54%.After a comparison with the baseline methods, we chose relevant studies listed in Table 2 to compare with our method.The selection of the previous studies is based on two aspects: (1) the emotion analysis is based on EEG signals; and (2) the emotion label is produced by the scales of Valence and Arousal.We found that most studies in Table 2 classified emotions into two classes: Pleasant/Unpleasant or Positive/Negative.Some studies [7,43,44] classified emotion into three categories: Pleasant, Neutral, and Unpleasant.In our study, we classify emotion into four types (HVHA, HVLA, LVLA, and LVHA).Two emotion classification problems are relatively simple, and the highest accuracy reached 82%.Multiple (more than two) emotion classification problems are complex, and the accuracy of our method reaches 75.21%, which is higher than the results presented in [43,44].The studies in [43,44] also employ DEAP as a dataset to recognize human emotions.This shows the effectiveness of our method.In addition, [41,43,44] and this study all employed DEAP as the dataset to undertake the emotion analysis.The performance of our method is better than the others.A similar research method is used in [41], which also built Neural Networks by CNNs and LSTM RNNs, with the difference being that the two-dimensional EEG feature images constructed in [41] ignored the spatial characteristics of EEG signals.In this paper, the spatial features of EEG signals are considered very important for emotion recognition.Through the experiments in this study, we proved the correctness of this point.
To further validate the effectiveness of our method, we investigated the effect of the time window size on the classification analysis.The MFIs with different time windows are trained and tested in CLRNN and CNN + RNN, respectively.For comparison, the features trained in the baseline After a comparison with the baseline methods, we chose relevant studies listed in Table 2 to compare with our method.The selection of the previous studies is based on two aspects: (1) the emotion analysis is based on EEG signals; and (2) the emotion label is produced by the scales of Valence and Arousal.We found that most studies in Table 2 classified emotions into two classes: Pleasant/Unpleasant or Positive/Negative.Some studies [7,43,44] classified emotion into three categories: Pleasant, Neutral, and Unpleasant.In our study, we classify emotion into four types (HVHA, HVLA, LVLA, and LVHA).Two emotion classification problems are relatively simple, and the highest accuracy reached 82%.Multiple (more than two) emotion classification problems are complex, and the accuracy of our method reaches 75.21%, which is higher than the results presented in [43,44].The studies in [43,44] also employ DEAP as a dataset to recognize human emotions.This shows the effectiveness of our method.In addition, [41,43,44] and this study all employed DEAP as the dataset to undertake the emotion analysis.The performance of our method is better than the others.A similar research method is used in [41], which also built Neural Networks by CNNs and LSTM RNNs, with the difference being that the two-dimensional EEG feature images constructed in [41] ignored the spatial characteristics of EEG signals.In this paper, the spatial features of EEG signals are considered very important for emotion recognition.Through the experiments in this study, we proved the correctness of this point.
To further validate the effectiveness of our method, we investigated the effect of the time window size on the classification analysis.The MFIs with different time windows are trained and tested in CLRNN and CNN + RNN, respectively.For comparison, the features trained in the baseline method are also extracted from the raw EEG signals with the same time window size and are presented in Table 5, which shows the average of the emotional classification accuracy obtained by 32 subjects under different time windows.As seen from the results in Table 5, CLRNN showed sensitivity to the time window size.With an increase in the time window, the classification accuracy showed a decreasing trend.The accuracies of the classification from other methods did not change significantly with the  With our method, empirical research is carried out with the DEAP dataset.We compare our results with those from the baseline methods and find that the emotion classification accuracy of our method reached 75.21%, which is higher than the accuracies from the baseline methods.In the baseline methods, we chose a 'CNN + RNN' Neural Network without LSTM unit to compare with our method.We find that the LSTM unit showed the time sensitivity.Furthermore, we reviewed the  With our method, empirical research is carried out with the DEAP dataset.We compare our results with those from the baseline methods and find that the emotion classification accuracy of our method reached 75.21%, which is higher than the accuracies from the baseline methods.In the baseline methods, we chose a 'CNN + RNN' Neural Network without LSTM unit to compare with our method.We find that the LSTM unit showed the time sensitivity.Furthermore, we reviewed the state of the art of human emotion recognition by EEG signals.Compared with similar studies, our study improves the classification accuracy.
Additionally, we analyzed the effects of different time windows on classification accuracy and found that time windows corresponding to two to three seconds achieved good classification accuracy, and the corresponding classification accuracy decreased from the time window division after four seconds.Given these results, we inferred that MFI sequences from a smaller time window represent more details of the variation of the EEG signal.We select Subject 4 to seek corresponding evidence in the MFI sequence and find that, with smaller size time windows, MFI reveals more details about activation in the FP 1 and FP 2 area.
is the International 10-20 System, where the EEG electrodes circled in red are the test points used in the DEAP dataset.In this study, we generalized the International 10-20 System with test electrodes used in the DEAP dataset to form a square matrix (N × N), where N is the maximum point number between the horizontal or vertical test points.With the DEAP dataset, N equals 9.The square matrix without filling the EEG frequency features is represented at the right of Figure 1.The gray triangle above the center of the square matrix represents the nasion, while the red points are the electrodes corresponding to the red circles in the International 10-20 System.The gray points are added to form a fully matrix.The value of the red point corresponded to the frequency feature (PSD) of the EEG electrode.The value of the gray point is the interpolation of the red points surrounding it.

Figure 1 .
Figure 1.The International 10-20 System and the corresponding square EEG feature matrix (9 × 9) with tested EEG electrodes (the red points are tested in the trial and the gray points are not tested.

Figure 1 .
Figure 1.The International 10-20 System and the corresponding square EEG feature matrix (9 × 9) with tested EEG electrodes (the red points are tested in the trial and the gray points are not tested.

Figure 2 .
Figure 2. The construction process diagram of the electroencephalographic (EEG) Multidimensional Feature Image (MFI) sequence.

Figure 2 .
Figure 2. The construction process diagram of the electroencephalographic (EEG) Multidimensional Feature Image (MFI) sequence.

Figure 3 .
Figure 3.An enlarged EEG MFI with the names of the electrodes and contour lines.

Figure 4
Figure 4 displays the first five MFIs of Subject 1 with different time windows.Each row represents the MFIs with the same time window, and each line represents the MFIs with the same sequence order.Taking the first and the second row in Figure4as example, we can see that the first row represents the EEG variation over five seconds with five frames; however, the second row represents the same time variation with two frames.MFI(1,5) and MFI(2,3) are very similar.Accordingly, we can infer that the MFI sequence with a short time window provides more details about the variation than the MFI sequence with a long-time window.The meaning of the color in Figure4is the same as it in Figure3.

Figure 3 .
Figure 3.An enlarged EEG MFI with the names of the electrodes and contour lines.

Figure 4
Figure 4 displays the first five MFIs of Subject 1 with different time windows.Each row represents the MFIs with the same time window, and each line represents the MFIs with the same sequence order.Taking the first and the second row in Figure4as example, we can see that the first row represents the EEG variation over five seconds with five frames; however, the second row represents the same time variation with two frames.MFI(1,5) and MFI(2,3) are very similar.Accordingly, we can infer that the MFI sequence with a short time window provides more details about the variation than the MFI sequence with a long-time window.The meaning of the color in Figure4is the same as it in Figure3.

Figure 4 .
Figure 4.The MFI sequence of Subject 1 with different time windows.
. The DEAP dataset contains the emotional evaluation values (including Valence, Arousal, Dominance, Like, and Familiarity) for the trials.In this paper, Valence and Arousal are extracted as emotional evaluation criteria to generate emotional labels.According to the different levels of Valence and Arousal, we divided the emotional two-dimensional plane into four quadrants.They are High Valence High Arousal (HVHA), High Valence Low Arousal (HVLA), Low Valence Low Arousal (LVLA), and Low Valence High Arousal (LVHA).Each quadrant corresponds to an emotion classification, as shown in Figure 5.According to the positive or negative deviation of the Valence and Arousal, we mapped each trial into the four quadrants to form an emotional classification label.

Figure 5 .
Figure 5.The Valence-Arousal dimension model of human emotion.

Figure 4 .
Figure 4.The MFI sequence of Subject 1 with different time windows.
The DEAP dataset contains the emotional evaluation values (including Valence, Arousal, Dominance, Like, and Familiarity) for the trials.In this paper, Valence and Arousal are extracted as emotional evaluation criteria to generate emotional labels.According to the different levels of Valence and Arousal, we divided the emotional two-dimensional plane into four quadrants.They are High Valence High Arousal (HVHA), High Valence Low Arousal (HVLA), Low Valence Low Arousal (LVLA), and Low Valence High Arousal (LVHA).Each quadrant corresponds to an emotion classification, as shown in Figure 5.According to the positive or negative deviation of the Valence and Arousal, we mapped each trial into the four quadrants to form an emotional classification label.

Figure 4 .
Figure 4.The MFI sequence of Subject 1 with different time windows.
. The DEAP dataset contains the emotional evaluation values (including Valence, Arousal, Dominance, Like, and Familiarity) for the trials.In this paper, Valence and Arousal are extracted as emotional evaluation criteria to generate emotional labels.According to the different levels of Valence and Arousal, we divided the emotional two-dimensional plane into four quadrants.They are High Valence High Arousal (HVHA), High Valence Low Arousal (HVLA), Low Valence Low Arousal (LVLA), and Low Valence High Arousal (LVHA).Each quadrant corresponds to an emotion classification, as shown in Figure 5.According to the positive or negative deviation of the Valence and Arousal, we mapped each trial into the four quadrants to form an emotional classification label.

Figure 5 .
Figure 5.The Valence-Arousal dimension model of human emotion.Figure 5.The Valence-Arousal dimension model of human emotion.

Figure 5 .
Figure 5.The Valence-Arousal dimension model of human emotion.Figure 5.The Valence-Arousal dimension model of human emotion.

Figure 6 .
Figure 6.The structure of the hybrid deep neural networks used for emotion classification.

Figure 6 .
Figure 6.The structure of the hybrid deep neural networks used for emotion classification.
e., input gate, forget gate, and output gate, which determine what information from the prior step should be forgotten and what information in the current time step should be added into the main data flow.i f , o f and g f are the activation function of the input data, output data, and gate, respectively.In this study, they are all sigmoid functions.

Figure 7 .
Figure 7.The Long Short-Term-Memory (LSTM) unit and simple recurrent network unit.Figure 7. The Long Short-Term-Memory (LSTM) unit and simple recurrent network unit.

Figure 7 .
Figure 7.The Long Short-Term-Memory (LSTM) unit and simple recurrent network unit.Figure 7. The Long Short-Term-Memory (LSTM) unit and simple recurrent network unit.

Figure 8 .
Figure 8. Emotion recognition accuracies with different classification methods.

Figure 8 .
Figure 8. Emotion recognition accuracies with different classification methods.

Figure 10 .
Figure 10.MFI sequences with different time windows corresponding to Subject 4 (corresponding to the emotion HVHA).
In this study, we try to improve the accuracy of classifying human emotion by EEG signals.The innovation of our methods involves two aspects.First, we propose a new method for the EEG feature extraction and representation.EEG frequency features (PSD) are extracted from different EEG channels and mapped to a two-dimensional plane to construct the EEG MFI.EEG MFI sequences are built from the raw EEG signal.The EEG MFI sequences fuse together the spatial, frequency domain, and time characteristics of the raw EEG signal.Another aspect is our proposal of a hybrid deep neural network that deals with the EEG MFI sequences and recognizes the emotions.The hybrid deep neural networks combined Convolution Neural Networks and Long-Short-Term-Memory Recurrent Neural Networks.In the hybrid structure, CNN is used to learn temporary image patterns from EEG MFI sequences, and LSTM RNN is used to classify human emotions.

Figure 10 .
Figure 10.MFI sequences with different time windows corresponding to Subject 4 (corresponding to the emotion HVHA).
In this study, we try to improve the accuracy of classifying human emotion by EEG signals.The innovation of our methods involves two aspects.First, we propose a new method for the EEG feature extraction and representation.EEG frequency features (PSD) are extracted from different EEG channels and mapped to a two-dimensional plane to construct the EEG MFI.EEG MFI sequences are built from the raw EEG signal.The EEG MFI sequences fuse together the spatial, frequency domain, and time characteristics of the raw EEG signal.Another aspect is our proposal of a hybrid deep neural network that deals with the EEG MFI sequences and recognizes the emotions.The hybrid deep neural networks combined Convolution Neural Networks and Long-Short-Term-Memory Recurrent Neural Networks.In the hybrid structure, CNN is used to learn temporary image patterns from EEG MFI sequences, and LSTM RNN is used to classify human emotions.

Table 3
shows the number of the different emotional samples mapped into the four quadrants.The number of samples contained in different emotional types is basically balanced, which ensured the balance of the neural network classification training.

Table 3 .
The number of samples in different emotion classifications1.

Table 3
shows the number of the different emotional samples mapped into the four quadrants.The number of samples contained in different emotional types is basically balanced, which ensured the balance of the neural network classification training.

Table 3 .
The number of samples in different emotion classifications1.

Table 4 .
The configurations of CNN.The parameters are denoted as <input size/receptive field size/pooling size> × <number of kernels/channels/out size>.

Table 4 .
The configurations of CNN.The parameters are denoted as <input size/receptive field size/pooling size> × <number of kernels/channels/out size>.