Emotion Recognition by Correlating Facial Expressions and EEG Analysis

: Emotion recognition is a fundamental task that any affective computing system must perform to adapt to the user’s current mood. The analysis of electroencephalography signals has gained notoriety in studying human emotions because of its non-invasive nature. This paper presents a two-stage deep learning model to recognize emotional states by correlating facial expressions and brain signals. Most of the works related to the analysis of emotional states are based on analyzing large segments of signals, generally as long as the evoked potential lasts, which could cause many other phenomena to be involved in the recognition process. Unlike with other phenomena, such as epilepsy, there is no clearly deﬁned marker of when an event begins or ends. The novelty of the proposed model resides in the use of facial expressions as markers to improve the recognition process. This work uses a facial emotion recognition technique (FER) to create identiﬁers each time an emotional response is detected and uses them to extract segments of electroencephalography (EEG) records that a priori will be considered relevant for the analysis. The proposed model was tested on the DEAP dataset.


Introduction
The study of emotions and how they interact with computer systems has advanced considerably due to the significant developments in machine learning and increased computing power, allowing the analysis of several physiological responses to extract emotional traits, such as the voice, facial expressions, body temperature, heart rate, body movements or brain signals [1,2].
Emotion recognition has become a topic of interest for researchers in different areas, especially for the computer science community seeking to establish emotional interactions between humans and computers. Emotions play a crucial role in several activities related to human intelligence, such as decision making, perception, human interaction and human cognition in general. The analysis of EEG signals has gained notoriety in the study of human emotions because of its non-invasive nature, as well as the developments in consumergrade EEG devices, hence providing affordable and simple solutions for applications that require emotion recognition [3]. Several efforts have been made to analyze the physiological responses of emotions; some consider direct reactions such as facial expressions, while others focus on the analysis of biological signals such as the heart rate or brain activity [4][5][6][7].
The analysis of biological processes tends to focus on the fact that it is more difficult for the research subject to hide or fake a response to a stimulus, and therefore the response will be more reliable. However, the human body is complex, and many physiological reactions can be considered noise when interacting, for example, eye movement can disturb the information collected in electroencephalography (EEG) records. In addition, when we analyze cognitive phenomena, such as emotions, there is no explicit reference of when an emotional stimulus begins or ends, and the full spectrum of the signal is analyzed, assuming that there must be a trait within a period that characterizes the emotion. However, this assumes that the stimulus produced is expected and that there are no different responses throughout the experimental process, which is not crystal clear.
We propose to use facial expressions as time markers that dictate which moments in brain signals' registration we should be focused on, and thus generate a more robust and accurate methodology. The structure of this proposal is shown in Figure 1. The first stage is to analyze the video records using a facial emotion recognition (FER) technique to obtain markers that will tell us which and when an emotion occurred. Later this information will be used to extract fragments of the signals from an EEG record (two seconds before the identifier appears) associated with these markers and create a collection of data that, a priori, can be categorized as linked to an emotional process. The information obtained will be subsequently processed using digital signal processing techniques and a recurrent neural network training phase. Finally, we use the structure presented in [8] to summarize the highlights of this article: • A methodology is proposed that allows the identification of the analysis space for detecting emotions through EEG analysis and thus increasing the certainty that a behavior linked to emotion is being recognized. • A boosting methodology is used that uses two neural networks that operate together to strengthen the result. • ESL optimized networks are used to reduce the effects of noise. • An electrode reduction technique based on Brodmann's regions is used. • The DEAP database is used as a reference point, and a recognition rate of 88.2% ± 0.23 is obtained for the FER and 89.6% ± 0.109 for the EEG analysis.

Materials and Methods
Affective computing emphasizes that emotions play an essential role in human behavior and decision-making processes; however, they were ignored in the design of most digital systems and this highlights the necessity to develop or optimize methodologies that bring us closer to achieving a more natural human-machine interaction.

Emotions
The definition of emotion used in this work is described by Scherer [9], who defines emotions as organicistic responses of the nervous system to an event or experience. We also rely on the theories of Ekman and Russell to label and delimit emotional stimuli [10,11]. Ekman establishes that there are basic emotions and that these are present in all humans regardless of their environment: happiness, sadness, disgust, fear, surprise, and anger. Russell defines the circumplex model of emotions that solves the problem of the level of activation of emotion by associating emotional states based on their levels of excitement and valence. The combination of them creates a four-class emotions model: happiness (high arousal-high valence (HA-HV)), anger (high arousal-low valence (HA-LV)), sadness (low arousal-low valence (LA-LV)), neutral (low arousal-high valence (LA-HV)). This model was implemented in our experimental process.

Facial Emotion Recognition (FER)
Facial expressions are more important than we usually think, since our brains have developed a remarkable ability to identify faces and expressions, either for social or survival purposes. Thus there are various studies focused on facial emotion recognition (FER), specialized in detecting movements in facial muscles [12][13][14]. Our work implements a machine learning technique implemented in [15] that proposed a convolutional neural network (CNN) optimized by extreme sparse learning (ESL) (as can be observed in Figure 2) algorithm, whose objective is to strengthen its implementation in real-life situations. Extreme learning machine (ELM) is a very competitive classification technique, especially for multi-class classification problems. It also requires minor optimization, which results in a simple implementation, fast learning, and better generalization performance [13]. The ESL and ELM were implemented by assuming that the underlying sparse representation of the natural signals or images is efficiently approximated by the linear combination of fewer dictionary elements. Furthermore, the dictionary is obtained by applying predefined transforms to the data or direct learning of training data, and since it usually leads to a satisfactory reconstruction, the function is defined by Equation (1): where Y is the input set N. The learning dictionary for the Y sparse representation can be satisfied by solving Equation (2): which is the learner over the complete dictionary in the form of X = [d 1 , d 2 , ..., d m ] ∈ R NxS . This represents the sparse matrix of the inputs, and N 0 is the sparsity constraint.
The objective function of ELM can be summarized as proposed by the authors of [13], as shown in Equation (3): where X denotes the set of training samples, H is the hidden layer output matrix (H(X) ∈ RL S , L is the number of nodes in the hidden layer, β is the output weight vector of length L, and Z is the vector of class labels of length S.

Signal and Behavior Analysis
Machine learning techniques allow a faster and more efficient way to process physiological or biological behaviors to find patterns in emotional stimuli records; such as the analysis of voice [16][17][18], heart rate [7], body temperature [19] and brain signals [20]. Each technique has certain advantages, for example, voice analysis is low-cost and non-invasive. However, it requires physiological responses, and much more sophisticated techniques, such as positron emission tomography (PET), that do not require that persons emit any physiological response are expensive and invasive. Each method can be implemented according to the experimental conditions. We implement an EEG analysis that, besides being considered noisy, is a low-cost, non-invasive technique that does not require significant technical capacities for its implementation.

EEG Analysis
Our brain interprets emotional stimuli through a series of organicistic responses from the central nervous system [9]. Several studies use ML to analyze the behavior of emotional situations by recognizing patterns in the signals collected from the brain cortex-either region-or by considering all available information [20,21]. However, one of the main challenges is that these patterns are sought in large time windows (the time the evoked potential lasts, which can range from a few seconds to minutes), which implies that many other events can affect the experimental process, such as eye movements, facial muscles or cognitive states unrelated to the experiment. The analysis through ML has been proven to diagnose medical and cognitive phenomena conditions successfully [5,[22][23][24][25][26][27].
Performing an adequate EEG signals analysis is one of the most critical stages for improving the recognition rate. The signal analysis stage is based on the considerations published in [21], which can be observed in Figure 3 and are described as follows: • Filtering. A bandpass filter was implemented, with a cutoff of 0.2 to 47 Hz, to exclude all those frequencies found outside of a brain rhythm, and the information obtained for each of the rhythm ranges was subdivided: delta (0.2 to 3.5 Hz), theta (3.5 to 7.5 Hz), alpha (7.5 to 13 Hz), beta (13 to 28 Hz) and gamma (>28 Hz).
• Electrode selection by Brodmann bounded regions. The electrode selection methodology proposed in [21] was implemented, where the Brodmann areas related to the areas of the cortex specialized in visual, auditory processing and the hypothalamus are linked. In this way, the analysis region is delimited only to those electrodes connected to the parietal, temporal, frontal and occipital lobes of the cortex since these could provide more information related to the emotional process. So we only consider the electrodes: F7, FC5, T7, CP5, P7, P3, O1, Pz, P4, P8, O2, CP6, T8, FC6 and F8. This process considerably reduces the amount of data to be processed, which helps to improve the processing time, specifically to 15 electrodes of a differentiated 10/20 scheme. • Blind source separation (BSS) is used to eliminate all that information from the electrodes that were outside our boundary region by Equations (4) and (5).
Equation (4) is the probability distribution set and Equation (5) is the sum of the overlapped signals. • Feature extraction. We use a wavelet Daubechies 4 transform to extract the translation and scale coefficients (Equation (6)), which will serve as features in the recognition process. Likewise, the variance between the scales is analyzed and added as an extra feature for the analysis.
where Ψ j,k (t) is the Mother wave for the four vanishing moments in Equation (7).
The Daubechies wavelets are orthogonal and biorthogonal but they are not symmetrical. The compact support that is the range over which they are non-zero is [0, 2N − 1] and these waveforms could be implemented in db2, 4, 8 and 16; however, there is not a rule that we can follow to select the vanishing level, and the four vanishing moments were made through experimentation. The literature reports that most of the experimental proposals are based on complex prepossessing techniques, such as wavelet or matching pursuits techniques, to process the signals [4,28], and some other proposals analyze the signals without a pre-processing stage by using exhaustive methods instead of traditional digital signal processing [29,30].

EEG Neural Network Architecture
A recurrent neural network structure with four hidden layers was used, with the architecture shown in Figure 4; this configuration was extracted from an experimental process that can be consulted in [21], which determines that this kind of low complex neural network shows an excellent performance without increasing the computational burden. The network uses the one-dimensional convolutional structure to perform a depthwise convolution that acts separately on the channels, followed by a pointwise convolution that mixes channels. The three emotional states define the three classes as in the emotions model. The experimental configuration for the analysis of the EEG signals in this work is: Each data vector is composed of 8064 kernel coefficients; a total of 270 stimuli were used, 90 for each emotional state.

Data Source
The DEAP database was used in this work [31]; this dataset has video recordings of the experimental process, the metadata is detailed, and their methodology is very descriptive. Thus, this database allows us to analyze the video images and detect facial expressions, which we can link with the recordings provided by the EEG signals.
The dataset contains EEG, EMG, EOG signals, galvanic skin response, temperature recording, and respiration rate records. A total of 32 participants were analyzed; the frontal video recordings of the faces of 22 of them are available. The video was recorded at 50 fps and using the h264 codec. It contains 32 .bdf files (BioSemi's data format generated by the Actiview recording software), each with 48 recorded channels at 512 Hz (32 EEG channels, 12 peripheral channels, three unused channels, and one status channel).

Results
This work presents an average recognition rate of 89.6% ± 0.109, which is superior to the 82.9% average performance calculated from 2016 to 2020 in [3]. Although some works present much higher recognition rates, this is due to the kind of analysis presented in them, that is, they present mono or bi-modal methods, while for multimodal works, such as the one shown here, results are obtained considerably more conservatively, as shown in Table 1.
The confusion matrix presented in Figure 5 shows that HALV (anger) is almost all recognized. Hence, the false-positive rate considerably affects their performance, as shown in Figure 6, which indicates that 10% is recognized as HALV in both cases; however, as can be seen in Figure 7, the results of the predictions are very stable, and overall recognition does not over-fit. Tables 1 and 2 show that the recognition rates obtained in our implementation are pretty competitive with the average obtained in recent years. However, it is essential to note that our work focuses on increasing confidence in the emotional analysis process. However, we could also increase the recognition rate in multi-class scenarios when isolating the signals. Table 1. State-of-the-art recognition rates obtained by analyzing electrocardiogram (ECG), SKT, Galvanic Skin Response (GSR), Heart Rate Variability (HRV), Respiration Rate Analysis (RR), Skin Temperature Measurements (SKT), Electromyogram (EMG), Electrooculography (EOG) presented in [32].

Measurament Methods
Average Accuracy -  Table 2. State-of-the-art performance comparisons between analysis techniques reported in [3]. The averages of each of the techniques as compiled are shown and it is calculated that the percentage of the total performance in the 2016-2020 period is 82.9%.

Discussion
The premise of this work is that most of the studies related to the analysis of emotions through EEG are being performed by analyzing the complete signals of the experimental process and that, despite the recognition rates, many other phenomena can be involved in the recognition task, for example, facial muscles or mental burdens. It must also be considered that brain activity is chaotic, and that during the experimental process there may be various cognitive functions that we cannot control, for example, if the person has an intrusive thought or is thinking about something else. Therefore, it is essential to add an extra level of verification, and thus we propose the delimitation methodology presented in this work.
Although many works are already implementing filters to eliminate some of the phenomena produced by the body's natural dynamics, it is not very easy to characterize every one of them since we would have to map each one of the functions that the brain performs. In addition, this would change from person to person. Although we could obviate this limitation, there is also another one, which is perhaps more challenging to control-the emotional and mental state of the person, which, as mentioned, we would have to know whether the person is thinking about something else while undergoing the experiment, which is entirely unworkable. So this proposal, although it barely scratches the surface of the problem, helps identify the exact moments when a physiological response related to an emotion occurred and analyze the moment before it.
This proposal arises after publishing various works focused on EEG analysis, where although high recognition rates were achieved in experimental processes, it was complicated to replicate in reality since, in addition to having situations such as neuroplasty, we also noticed that people were distracted, remembered, rambled and imagined situations during the experimental processes. This could generate conditions entirely distinct for our study, so although this is not a definitive proposal, we believe that it helps us get a little closer to a real life implementation.

Conclusions
This work presents a competitive recognition rate of an emotional state through EEG signals, despite not focusing on increasing the recognition rate but on proposing a methodology that reduces the effects produced by the non-related phenomena that may occur during an experimental process and creating a cardinal correlation between the recognition and the experimental procedure.
One aspect to highlight for this work is the need to expand the scope of this work to a broader population sector. However, a lack of databases, such as DEAP, is a critical issue to overcome for future work, motivating us to create an original database that helps to corroborate our research. Another important aspect of this work is that the recognition stage sped up since the NNs are trained with shorter fragments of the signal instead of large amounts of traits that may or may not be related to the physiological response to an emotional stimulus.
An aspect not fully addressed in the document is the two machine learning techniques used to generate a more robust output. However, the result obtained in this research demonstrates the advantages of using ML techniques to link two phenomena that are, a priori, seen as two different research areas. Both the analysis of facial expressions and the EEG analysis have their fields of study. However, when combined, they generate a more robust and reliable analysis methodology and open the possibility of using many other physiological responses of emotions to consolidate the results.
Finally, we understand that the process presented in this work depends on the recognition rate of the FER process and, despite the optimization techniques that have been implemented, these techniques should improve it to achieve more precise references of when emotions occur; however, this could be improved as the analysis and ML methodologies advance.  Data Availability Statement: The DEAP dataset can be download from: http://www.eecs.qmul.ac. uk/mmv/datasets/deap/.