Affective State Assistant for Helping Users with Cognition Disabilities Using Neural Networks

: Non-verbal communication is essential in the communication process. This means that its lack can cause misinterpretations of the message that the sender tries to transmit to the receiver. With the rise of video calls, it seems that this problem has been partially solved. However, people with cognitive disorders such as those with some kind of Autism Spectrum Disorder (ASD) are unable to interpret non-verbal communication neither live nor by video call. This work analyzes the relationship between some physiological measures (EEG, ECG, and GSR) and the affective state of the user. To do that, some public datasets are evaluated and used for a multiple Deep Learning (DL) system. Each physiological signal is pre-processed using a feature extraction process after a frequency study with the Discrete Wavelet Transform (DWT), and those coefﬁcients are used as inputs for a single DL classiﬁer focused on that signal. These multiple classiﬁers (one for each signal) are evaluated independently and their outputs are combined in order to optimize the results and obtain additional information about the most reliable signals for classifying the affective states into three levels: low, middle, and high. The full system is carefully detailed and tested, obtaining promising results (more than 95% accuracy) that demonstrate its viability.


Introduction
High-context communication relies heavily on sensitivity to non-verbal behaviors and environmental cues to decipher meaning, while low-context exchanges are more verbally explicit, with little reliance on tacit or nuanced [1].
In interpersonal interactions, both emotional and cognitive processes are included.As emotions and related phenomena such as desires, moods, and feelings can be revealed through nonverbal behavior, as well as spoken with words, nonverbal behavior has a significant role in those interactions [2].
Thus, nonverbal behavior includes a variety of communicative behaviors that have no linguistic content.These include (but are not limited to): facial expressiveness, smiles, eye contact, head nodding, hand gestures, postural positions (open or closed body posture and forward to backward body lean); paralinguistic speech characteristics such as speech speed, volume, pitch, pauses, and lack of fluency of speech; and dialogic behaviors such as interruptions.It is widely recognized that nonverbal behavior conveys affective and emotional information, although it also has other functions (such as regulating the turn of the conversation).As examples, a frown may convey disapproval or a smile may convey approval or agreement.A blank expression can also convey an emotional message to a listener, such as indifference, boredom, or rejection.Nonverbal behaviors often (but not always) accompany words and therefore give words meaning in context (for example, by amplifying or contradicting the verbal message).Thus, the interpretation of a verbal message of agreement (like "That's okay") can be interpreted differently depending on whether the statement is accompanied by a frown or a smile or a blank expression [3].
However, certain mental and physical disabilities can provoke that non-verbal communication cannot be captured or understood by the receiver.This is the case of people with Autism Spectrum Disorders (ASD).
Several investigations reaffirm the importance of emotional aspects in social integration and, consequently, in the quality of life in people with ASD [4].However, these people present a deficit in the process of understanding non-verbal language; and, as many of these expressions are related to the emotional state of the transmitter, consequently, they present a deficit when understanding emotions.Various research works like [5,6] proved these affirmations.
Thus, in order to help in these cognitive deficiencies, our objective is to design an affective states recognition aid system, but, before that, we must delve into how to distinguish these moods.
Human emotions represent physiological phenomena that include feelings, memories, evaluations, unconscious reactions, body gestures, vocalization, postural orientations, etc [7].Emotional theories have evolved throughout time, distinguishing between those that defend that emotions come from the perception of physiological states (firstly suggested by James and Lange in 1880 [8,9] and known as 'somatic theories'), and those that consider emotions as cognitive evaluations [10] that are known as 'cognitive theories'.All these theories relate emotions with particular locations of the brain: cognitive theories relate emotions to the neocortex and somatic theories to the limbic system.
Actually, a large amount of authors consider both theories valid, since the mind is not cognitive or emotional individually, but rather a combination of both.Both emotion and cognition are unconscious processes that are transformed into conscious experiences [11].Furthermore, emotions are related to unconscious reactions in various parts of the body (not only in brain activity); this implies that, if we were able to detect and discretize these unconscious reactions, we could be able to infer about the emotional state of the person.However, to get to that point, it is necessary to force certain emotions in the user in order to be able to detect the physiological variations.
There are different techniques to induce emotions either through images and sounds [12,13], expressive behaviors [14,15], social interactions [16,17] and music [18] among others [19].In these studies, when working with multiple users, it is important that the stimuli can be reproduced multiple times without change; this is why the use of expressive behaviors and social interactions are ruled out.The most efficient and widely used ways are image sampling and video playback.In addition, the use of videos implies the inclusion of auditory elements, which enhance the emotional state of the user.Thus, we are going to focus on video playback [20].
Related to emotions classification, two theories are distinguished: the 'categorical approach' and the 'dimensional approach'.The categorical approach is based on the amount and type of emotions such as fear, anger, sadness, and joy.However, the dimensional approach describes emotions as points within a multidimensional space, which can be the activation space and valence [12], and the activation space and approach-distance [21].
The 'activation' value represents the intensity of the emotion.The 'valence' determines whether the emotion is positive or negative.On the other hand, the approach/away determines whether the emotion makes the person move towards the cause of the emotion or away from said cause.The affective state's classification according to those metrics is deeply evaluated by Posner et al. [22] and is summarized in Figure 1.As can be observed, the emotional state depends on the Valence and Arousal values.
Focusing our attention on the dimensional approach, it is very important to remark that, although the emotions' study requires a multidimensional approach [23] such as questionnaires to determine the emotion perceived by the user, the physiological responses and the analysis of the user's behavior (analysis of physiological signals) are the best approximation to an emotional response.This is because the questionnaires have several limitations: the users could not be able to understand or express what they feel [16], or their answer may be influenced by questionnaires from previous experiences [24].Thus, the analysis of user behavior using questionnaires does not guarantee the same emotional response in different contexts or in different people, so it is difficult to obtain a behavior pattern for different emotions.Focusing on using tangible measures that allows us to obtain a more precise measurement with less variability, there are several physiological signals related to the unconscious reactions from the user's body that can be logged and studied: facial's muscles activity, heart activity, skin conductivity, brain activity, among others.These signals will be described deeply in the next section.
Moreover, extracting useful information from physiological signals can be a difficult task as many elements are involved.Some works extract useful features from the data [25], others use thresholds in decision trees [26,27] and others convert the signal's temporal representation to images and apply computer vision techniques [28,29].
On the other hand, the study of medical signals and images has experienced a great progress with the inclusion of Machine Learning (ML) systems capable of automatically identifying and extracting the relevant characteristics to make a correct diagnosis, obtaining better results than those obtained by classical diagnostic systems [30,31].
These techniques have been mainly applied to imaging systems using Convolutional Neural Networks (CNN), which typically use classical Artificial Neural Networks as a final step for providing one classification output for each sample input.However, when working with physiological signals, temporal dimension is a very relevant source of information and, for this reason, in many cases, it is necessary to analyze a time window to obtain a correct classification.
In order to analyze these time windows, two main approaches are commonly used in Deep Learning (DL): using Recurrent Neural Networks (RNN) with the original data obtained from the physiological sensors, or using feature extraction techniques (before training the neural network) to work only with the most useful information.
Previous works obtained very good results using RNNs with analog and time-dependent sensors (like accelerometers) [32,33], but working with physiological sensors like electrocardiogram (ECG) is a harder task since these signals have a lot of noise (in high and low frequencies), they are not periodic and, in some cases like galvanic skin response (GSR), an incremental peak does not lead to a subsequent decline.This is why it is common to extract information from the frequency components of the physiological signals before using a neural network system.
Therefore, after a detailed introduction, the main objective of this work consists in the design, implementation, and testing of a Deep Learning (DL) system for user's affective state classification, in order to help in the emotional learning process of users with cognitive disabilities.Each physiological signal will be pre-processed by extracting its frequency features, and the obtained coefficients will be used as inputs of a NN system.Each signal will be tested to classify the main affective states (Arousal and Valence) in an independent way.Results will be analyzed, and the signals with better results will be combined in order to obtain a global affective state classification system.
The main novelty of this work is this system's architecture itself, as it allows for evaluating each signal independently, allowing that each network model can be adapted to the particular characteristics of its signal.In this way, two great benefits are obtained: (1) we can observe how each signal affects affective state classification; and, thanks to these partial evaluations, (2) the overall system designed by combining the independent ones obtains better results than others trained by brute force (as they do not take into account the specific characteristics of each type of signal).
The rest of the paper is divided in the following way: first, in the 'Materials and Methods' section, the different physiological signals and datasets analyzed for this work are presented, as well as the system's architecture, including the different implemented subsystems.Next, the results obtained after the training process and the evaluation of each model are detailed and explained in the Results and Discussion section.Finally, conclusions are presented.

Materials and Methods
In this section, the classifier architecture used for the task previously explained is described deeply.In order to do that, the information provided as output of the model, necessary for the classification the different affective states, needs to be described and, moreover, the physiological signals that provide this data need to be presented too.
Thus, first, the physiological signals used for obtaining the input data are detailed.Following this, the datasets used in this work are presented and compared, focusing on the signals logged by each one.In addition, finally, the proposed classifier to identify affective states is described.

Physiological Signals
Speaking about collecting data from physiological signals implies focusing on those medical instruments designed to be able to appreciate those signals.Each instrument can only collect information from a specific type of physiological signal.However, some of these instruments can be placed in different positions on the user's body; thus, the information obtained can be interpreted in different ways (depending on where the instrument is placed).Next, the four most used physiological signals in emotional studies and their associated instruments are presented.

Brain's Electrical Activity
The use of brain's evoked potentials as a measure of the response to visual stimuli has been used in several works [34,35].Positive potential modulations are obtained when visualizing both pleasant and unpleasant stimuli.Moreover, a positive slow wave has also been detected that is maintained once the stimulus has ended [35].
The main problem about using these signals is its low amplitude as well as its surrounding noise, which makes it considerably difficult to obtain the emotional response through the detection of evoked potentials patterns.Furthermore, depending on the helmet used to capture these potentials, it can be a very invasive technique for the user.Despite this, the asymmetry of the Electroencephalograph (EEG) allows us to determine the valence against a stimulus or the approach-distance [36].
Brain's electrical activity is measured using an EEG, obtaining a representation of all the electrodes' activity over time (electroencephalography).As detailed before, this physiological signal can help to determine the emotional valence.

Heart's Electrical Activity
A wide variety of studies demonstrate that heart rate variations are related with emotions.Graham [37] found that viewing images for 6 s causes a pattern in heart rate variation.This pattern consisted of three phases: initial deceleration, followed by an acceleration component and finally by a last deceleration component [12].Some emotions such as fear, anger, or joy have been evaluated using cardiac activity [38]; for example, cardiac frequency accelerates in anger and fear [39].
However, it presents several problems related to the duration and repetition of the stimulus.On the one hand, short stimuli (around half a second) do not modify heart rate [40].On the other hand, the initial deceleration decreases when the same stimulus is displayed repeatedly during an experiment [41].
Heart's electrical activity is measured using an Electrocardiograph (ECG), obtaining a representation over time (electrocardiography) from which the heart rate can be obtained.As detailed before, this physiological signal has been demonstrated to be related with emotions, and it is important to evaluate time windows in order to detect frequency accelerations and decelerations.

Muscles' Electrical Activity
The muscles most directly related to the user's emotions are those in the face: the mood is expressed through grimaces or facial gestures caused by these muscles.Facial muscles' electrical activity is useful for performing studies of emotions in which emotional arousal is so low that it is insensitive to facial gestures [42].There are two muscles that can be used for emotional assessment: the major zygomatic, related to smiling, and the superciliary corrugator related to the frown gesture.
Schwartz [43] was the first researcher to link both muscles with emotions, realizing that unpleasant images produced greater activity of the superciliary corrugator, and those pleasant images caused greater activity in the major zygomatic, relating the activity of both muscles with the valence of emotions.
However, the relationship of both muscles to valence differs considerably: some researchers report greater activity in the superciliary corrugator than in the major zygomatic [44].Furthermore, the relationship with the valence of the electrical signal of these muscles appears to be linear in the case of the superciliary corrugator, but has a "J" shape in the case of the zygomatic [45].There is less consensus on whether this electrical activity is related to activation level.Some authors suggest that, at least in the case of the corrugator, this relationship exists, while others suggest the opposite [46].From the point of view of Cacioppo [47], by recording the facial muscles' electrical activity, both valence and activation can be obtained.Although corrugated and zygomatic electrical activities have been widely used, other muscles such as the orbicularis have also been used to measure valence-positive emotions [48].
Muscles' electrical activity is measured using an Electromyograph (EMG), obtaining a representation over time (electromyography) of the electrical activity of the muscles near the placed electrodes.Thus, in order to detect the activity of the facial muscles, those electrodes need to be placed near the corrugator and the zygomatic muscles.However, as detailed before, there are several discrepancies about the information provided by these muscles.

Dermal Electrical Activity
Changes in skin conductivity are strongly related to variations in the level of activation [49,50].An increase in the activation level causes an increase in the level of skin conductivity, generating a positive potential that begins about 400 ms after the stimulus [12] and lasts between 400 ms and 700 ms [35,45].
The skin conductivity is also known as 'galvanic skin response' (GSR) or 'electro-dermal activity' (EDA).This GSR signal has two components, a tonic component and a phasic component.On the one hand, the tonic component is a low frequency signal that is associated with the baseline (trend) of the signal and undergoes slight variations over time.On the other hand, the phasic component corresponds to rapid and punctual variations, and is directly associated with the response to a stimulus.
Thus, dermal electrical activity (GSR or EDA) is measured using typical electrodes located in places where the density of the sweat glands is higher (forehead, cheeks, palms, fingers, and the soles of the feet), obtaining a representation over time of the skin's conductivity (varying because of user's sweating).This measure, although simple, has been demonstrated to reflect the affective state of the user in a very reliably way.

Dataset
In this subsection, several datasets are presented and analyzed in order to find the most adequate set for training and testing our classifier.Several information is obtained from each one: number of people, sensors used (physiological signals recorded), logging frequency and emotional states labeled.As will be observed next, most datasets use similar sensors (EEG, ECG and GSR) and obtain similar emotional states.
For the datasets searching process, we took into account those who met the following criteria: freely accessible, output labeled with emotional states, videos used as stimuli for the test subjects, and information recorded from various sensors.According to these criteria, five datasets were found: AMIGOS, ASCERTAIN [51], DEAP, DREAMER [52], and HR-EEG4EMO.
The main advantage about working with public and labelled datasets in our case is that the video clips that serve as stimuli have been carefully selected by psychologists to provoke a specific affective state.Moreover, in order to check that the labels are correct, at the end of each data collection, a survey is given to the users so that they can detail their affective state at that moment and how they have felt while viewing each video.Thus, we can assume that those labels are correct and work with that.
First, a brief summary about the used datasets is presented; and, after that, a summarizing table with the most relevant information is shown.

ASCERTAIN
This is a multi-modal dataset for measuring the affect, personality and affect recognition using commercial neuro-physiological sensors.This dataset has the information obtained from 58 participant, 38 male and 20 female, with a mean age of 30 years.The data was collected using EEG, ECG, GSR and Facial Gestures while the participants were watching 36 short musical videos.Data stored were labeled using Arousal, Valence, Dominance, Liking, and Familiarity.The ECG was recollected at 256 Hz, the GSR at 128 Hz and the EEG only at 32 Hz.

DREAMER
This is a multi-modal database consisting of EEG and ECG signals recorded during affect elicitation by means of audio-visual stimuli.Signals from 23 participants watching 18 video-clips were recorded along with the participants' self-assessment of their affective state after each stimuli, in terms of valence, arousal, and dominance.All the signals were captured using portable, wearable, wireless, low-cost and off-the-shelf equipment at 256 Hz for the ECG and 128 Hz for EEG.

Summary
The most relevant information about the analyzed datasets is summarized in Table 1.As can be seen, all datasets have several similarities.On the one hand, they used a good amount of participants and several sensors (two of them are common to all: EEG and ECG, although GSR is present in almost all of them); so the sensors we will use are EEG, ECG, and GSR.On the other hand, regarding the emotional states labeled and the studies done in previous works and detailed in the introduction, the main affective states are "Arousal" and "Valence".Thus, according to this, HR-EEG4EMO cannot be used because it only labeled "Valence".Another problem is that DREAMER does not have information about GSR, but its information about EEG and ECG can be combined with the others.Thus, according to these filters, our database should be composed by the datasets AMIGOS, ASCERTAIN, DEAP, and DREAMER.However, after trying several times to contact the creators of the datasets AMIGOS and DEAP, and after waiting several months for an answer, we had to discard these datasets due to not having access to their data.Thus, our database is composed by the datasets ASCERTAIN [51] and DREAMER [52].However, four adaptations are needed so that the information given by both datasets are completely the same: • Sensors: only EEG, ECG, and GSR data are used.Facial EMG from ASCERTAIN is discarded, and GSR data are not provided by DREAMER (using only the GSR information from ASCERTAIN).• Data frequency: based on the different frequencies used, it will be 32 Hz for EEG, 128 Hz for GSR, and 256 Hz for ECG.These are the lowest frequencies from each sensor of both datasets.The others will be down-sampled in order to balance them.• Data amount: In order to perform a balanced training, which gives equal weight to the information from all the datasets, we must use a similar amount of information from each of the datasets.In this way, there is no type of bias when training.Thus, the number of samples used from each of the datasets will be restricted by the dataset that has the least number of samples.In the case of GSR, the full amount of data provided by ASCERTAIN will be used.• Labels: both datasets share the main two labels ("Valence" and "Arousal").Due to the studies done before, with the information of both affective states is enough to extract information of the emotional state of the user.• EEG: eight channels are used in order to design our own device in a near future with OpenBCI platform.These channels are (according to the 10-20 international placement system) Fp1, Fp2, C3, C4, T5, T6, O1, and O2.The information from other channels is discarded.
According to the number of classes used for training and testing each affective state, some previous considerations are needed to build the dataset used in our classifier:

•
Most previous works divide each affective state in two classes (this fact will be observed in the final comparison made in the "Results and Discussion" section).We think that this is not enough, as we need more classes in order to discretize correctly the different states.• Among the works that use three classes, the central zone ('neutral') is given greater weight than the extremes ('low' and 'high'); i.e., in one of the works a division of 25% (low), 50% (medium), and 25% (high) is used.We thought that this division was not realistic and, although these values depend on the subject, we made an equitable division between those classes: we use the lower 30% values for 'low', 40% for 'medium', and the upper 30% values for 'high'.

Affective State Classifier
The proposed system is based on several neural networks (one for each sensor) trained for each of the affective states (Arousal and Valence), obtaining six neural networks that, after evaluating, are combined in order to improve the classification results.A complete vision of the proposed system is shown in Figure 2.
For all the neural networks, we use a 500-epoch training, a learning rate of 0.001, an Adam optimizer and the sigmoid function as activation function for all layers except the output (where a softmax is used).The loss function used is the Mean Square Error.
In order to provide veracity to the results obtained, and due to their variability depending on the distribution between the training and testing subsets, each test has been performed 10 times.The information presented in the "Results and Discussion" section correspond to the mean of those ten repetitions.
In the following sections, this process will be specified step by step, detailing from the extraction of useful information from the sensors to the design of the experiments carried out.

Pre-Processing
First, the information from each dataset is filtered.Creators of the ASCERTAIN dataset specify in one external file the confidence of the information obtained from each video-clip of each user, providing information about the noise level and the problems detected.In order to avoid classifier malfunctioning, all the data streams with high noise levels or labeled with low confidence by the creators have been discarded.
After that, the information from both datasets is combined in an equitable proportion for each sensor, obtaining three separate datasets: one for EEG (with eight channels), one for ECG (with two channels), and one for GSR (one channel).For ECG and GSR, the amount of data are down-sampled to 64 Hz, using a mean window filter of size 4 and 2, respectively.The goal for this filter is to obtain a smoothed signal, reducing the noise of both data streams.However, it cannot be applied to EEG signal since it is sampled at 32 Hz in an ASCERTAIN dataset, and reducing the frequency of this signal can cause irreversible loss of information.Thus, after the pre-processing step, three datasets are obtained: EEG dataset (sampled at 32 Hz), ECG dataset, and GSR dataset (both sampled at 64 Hz).
For the labeled information about Arousal and Valence, each dataset discretizes its values into a rank between 0 and N-1 (using a N-level resolution that vary according to the dataset).In almost all the works about emotions classification using Arousal and Valence, only two levels are taken into account for each affective state (low/deactivation and high/activation) but, in this case, we discretize both affective states in three levels: LOW, MEDIUM and HIGH.In the "Results and Discussion" section, the results obtained in this work will be compared with other works, and those differences will be detailed.

Feature Extraction
In order to train the system, and according to the explanation done in the Introduction section, a feature extraction process will be applied to each dataset independently.This process will consist of representing the information of each signal in time windows (whose width will be studied for the implementation of the neural networks), in order to extract the frequency characteristics of these windows.With the information represented in the frequency space, the coefficients obtained can be selected to obtain the main features of each time window.These features will be used as inputs of the neural networks.
The classical way of extracting the frequency components of a digitized periodic analog signal is their decomposition using the Discrete Fourier Transform (DFT).However, with physiological signals, this type of transformation is not efficient, since the components of these signals vary their periodicity continuously.For these cases, the Discrete Wavelet Transform (WDT) is used.
After applying a third-order Daubechies wavelet (db3) to the time window, two coefficient sets are obtained: the approximation coefficients (high frequencies) and the detailed coefficients (low frequencies).As the only high frequencies in a physiological signal come from the device noise or the environment alternating current, these are removed for the final coefficients (so the approximation coefficients are not taken into account).The obtained set of coefficients (detailed coefficients of the first-level decomposition) will be denoted as "Detailed Coefficients of the Original Decomposition" or DCOD.Moreover, to reduce spurious noise from the signal, the lowest frequencies can be removed: to do this, the maximum useful level of decomposition for the given input data length is calculated in order to extract the detailed components of this level (which represents the lowest frequencies of the signal) and those coefficients (denoted as "Detailed Coefficients of the Maximum Decomposition" or DCMD) are removed from the DCOD.Thus, after this processing, the coefficients obtained are separated and filtered, erasing the highest and lowest frequencies from them; and the resulting set of coefficients will be denoted as "Detailed Coefficients of the Original Decomposition without Lowest Frequencies" or DCOD-LF.However, DCMD are stored too as some basic information (like signal offset) is given by them.
With both processed sets of coefficients (DCOD-LF and DCMD), the features used as input of the neural network can be extracted.We have followed the studies done by JeeEun Lee and Sun K. Yoo [53,54], where different features are analyzed to extract information from physiological signals that have been previously processed by DWT.Following those studies, in our work, we have used these features: zero-crossing of DCOD-LF coefficients (ZC DCOD-LF ), standard deviation of DCOD-LF coefficients (SD DCOD-LF ), zero-crossing of DCMD coefficients (ZC DCMD ), mean of DCMD coefficients (M DCMD ), standard deviation of DCMD coefficients (SD DCMD ) and amplitude of DCMD coefficients (A DCMD ).

Single Neural Networks
The neural network architectures used in this work are based on a classical Multilayer Perceptron (MLP) network, using an input layer (whose width depends on the number of features extracted from the signal), two hidden layers, and an output layer (with three neurons according of the different levels used for the affective state classification: low, medium and high).Softmax activation function is applied to the output layers so it converts the resulting vector to a vector of categorical probabilities.We choose the category with more probability as the output to obtain a unique classification.
Each network is trained for Valence and Arousal independently, using 90% data of each dataset for training and 10% data for testing purposes.Moreover, all these neural networks are trained and tested using different sizes of the data time window varying from 1 second to 10 s.The results are analyzed and the best time window is selected.

Full Classification System
The work done in this investigation is not limited to evaluate the classification confidence for each affective state using each physiological sensor independently.After the single evaluation detailed before, the two physiological signals with the best results are selected and their coefficients are combined in order to obtain a full classifier that mix the information from both of them.
Thus, a final MLP neural network (that represents the final classification system) is trained (90% of the data) and tested (10% of the data) using the coefficients from both physiological signals, and, finally, the results for Arousal and Valence are detailed deeply.The architecture of this neural network is similar to the ones detailed before, using one input layer, two hidden layers, and one output layer.The width of the input and the hidden layers depends on the number of coefficients (that depends on the signals selected)-so, in order not to anticipate the results of the first part of this work, they will be indicated in the "Results and Discussion" section.

Results and Discussion
First, the results obtained after training each single neural network are shown.As detailed before, the time window evaluated for each physiological signal and affective state vary between 1 s and 10 s in order to determine the best time window for the full classification system.
After presenting the single results, the two best single systems will be combined and tested, obtaining the final results of this work.In this case, the size of the time window will not be evaluated, since it will be chosen from the single neural networks evaluation (4 s).
Finally, the results obtained in the full classification system will be compared with the results obtained from similar works in the last few years.

Single Neural Networks Results
As detailed before, each neural network will be tested using time windows from 1 to 10 s width.The results of training GSR data with Arousal and Valence are detailed in Tables 2 and 3, respectively.According to GSR neural network, a ≈83% accuracy is obtained for Arousal with a loss less than 0.10, and a ≈76% accuracy is obtained for Valence with a loss less than 0.13.Thus, GSR obtains better results when classifying Arousal.Moreover, the best time windows for both classifications are 4 s and 5 s width (remarked in red).It is important to remark that accuracy and error difference between training and testing is almost nil, so the network would work right with new data.
The same process is done for ECG dataset, and the results after training that network are detailed in Tables 4 and 5.
According to ECG classification results, a ≈81% accuracy is obtained for Arousal with a loss around 0.10 and a ≈80% accuracy is obtained for Valence with a loss around 0.10 too.Thus, ECG obtains better results when classifying Arousal, but those results are a little bit worse than the ones obtained with GSR.However, the results obtained for Valence are significantly better than the ones obtained by GSR.In this case, the best time windows for both classifications are 3 s and 4 s width.
In addition, finally, the same process is repeated for EEG neural network and the results obtained are detailed in Tables 6 and 7.In the case of an EEG network, a ≈75% accuracy is obtained for Arousal with a loss around 0.12 and a ≈80% accuracy is obtained for Valence with a loss less than 0.10.The results obtained for Arousal are the worst of all physiological signals, so EEG is discarded for that.However, in the case of Valence classification, training results are similar that the ones obtained by ECG.However, there are two main problems about using EEG signal for Valence classification:

•
The time window needed to obtain such results is very high compared with the one used for GSR and ECG.Thus, if the full system uses EEG, the data rate of the system will be reduced significantly.

•
The difference between the accuracy of the training set and the accuracy of the testing set is too high (and the error too).This fact may mean the system is overtrained and it is not ready for new data, so testing the system with other users can cause bad results.
Thus, according to the results obtained for each neural network independently, for the case of classifying Arousal and Valence, GSR and ECG obtains better results than EEG.In fact, GSR obtains the best results for Arousal and ECG obtain the best results for Valence.About the optimum time window, if we overlapped the results from Tables 2-5, the best results are distributed between the 4 s and 5 s windows for GSR and between 3 s to 4 s windows for ECG; thus, the common time window that will be selected is 4 s.
Next, the full classification system is detailed and evaluated.

Full Classification System
According to the results obtained previously, features from GSR and ECG will be combined for the full classification system, obtaining a MLP neural network with these characteristics: input layer (6 features × 3 channels = 18 nodes), hidden layer 1 (36 nodes), hidden layer 2 (12 nodes), and output layer (three nodes).
Using a 4-s time window, the results obtained are detailed in Table 8.It is important to remark that, in order to obtain reliable results, samples from both sensors are not combined randomly: we use the time stamps of each video-clip that are stored with the sensors data; thus, in this way, we make sure that each GSR sample is combined with the ECG sample collected in parallel.Moreover, in order to verify that EEG is not suitable to this task, we have created a system composed by the features from the three sensors (EEG, ECG, and GSR).Results obtained after training and testing this system using a 4-s time window are detailed in Table 9.As can be observed in Table 9, and comparing the results with Table 8, it is demonstrated that, for the goal of this work, the combination ECG + GSR gives more information about the affective state of the user than the combination of the three sensors (EEG + ECG + GSR).
After the final training with ECG and GSR, results showed that the classification system obtained by the combination of GSR and ECG coefficients improves the results obtained individually by each sensor significantly.The trend of the training process is growing for Arousal and Valence, as can be observed in Figure 3.Moreover, the error obtained for both affective states is much lower than the ones obtained individually.It is important to remark that there is a small difference between train and test subsets (≈3-4%), but there is not a significant difference as the results obtained for the test subset are above 90-91%.In order to evaluate the errors obtained by the classifier with both affective states, confusion matrices for the train and test subsets are shown in Figure 4.As shown in Figure 4, most of the errors occur between the extreme values (low and high) with the medium value.There are only a few cases where a high value is classified as low or vice versa.
In order to evaluate our system deeply, other metrics are commonly taken into account in neural network systems: sensitivity, specificity, precision, and F1-score.Those metrics have been calculated with the final system composed by ECG+GSR and the results are shown in Tables 10-13.
As can be observed in previous tables, results obtained for the training set are, indeed, better than the ones obtained for the testing set.Even so, F1-score results obtain ≈95% for Valence and Arousal for the training set, and ≈90% for the testing set.Some important aspects that can be observed with those final metrics are detailed next:  The analysis carried out previously is very interesting, since it describes that, in most of the metrics, the MIDDLE class obtains worse results than the others.This circumstance can be due to the fact that this class is not located in any extreme of the scale, so there are some occurrences that are close to the LOW and HIGH classes, and those ones can be confused with them.However, in the case of classes located at the extremes of the scale (LOW and HIGH), they only have a border where its occurrences can be confused.Moreover, usually, the occurrences with high or low values of Valence and/or Arousal tend to be extreme cases with extreme values (practically at the limits of their scale); so this is why they are commonly classified better than the occurrences of the MIDDLE class.
In order to achieve our final goal, we need the best results we can obtain, and it is true that a 100% accuracy is not obtained, but it is very difficult to obtain a perfect result when working with analog signals (which have a lot of noise) from multiple users and with multiple samples taken from each of them.Moreover, working with affective states, the variability between users is huge.For example, for the same video clip, two users can transmit completely opposite affective states (since it is something completely subjective), but, in addition, the same affective state provided by the same user can fluctuate in intensity.This is why the results obtained can be considered enough for our final purpose.

Comparison
Finally, after presenting the results obtained with the full classification system, they will be compared with results obtained by similar works in the last years.This comparison is synthesized in Table 14.Results obtained for our system are detailed for training and testing subsets independently, although the other works do not distinguish between them.
One important improvement in the present work is the data resolution: since most works only use a two-level classification system for Arousal and Valence, we increase to three levels of classification: LOW, MEDIUM, and HIGH.The works that use a three-level resolution obtain an accuracy lower than our system (maximum values ≈88% accuracy for Valence and ≈91% accuracy for Arousal).On the other hand, the work with the best results is the one presented by JeeEun Lee and Sun K. Yoo in 2020 [54] with a ≈98% accuracy, but it presents some deficiencies compared with our system, and it is important to comment on them: • Data resolution: it uses only two levels for Arousal and Valence (negative and neutral).We use three (low, medium, and high).• Time window: it uses a 30-s time window, obtaining a maximum processing rate of 0.033 samples per second.We use 4 s for the time window, so our system has around a 7.5x data rate (0.25 samples per second).

•
Training epochs: the results are provided after a 5000-epoch training.We use only 500 epochs.• Architecture complexity: it uses four hidden layers and a Recurrent Neural Network (RNN).We use a classical MLP neural network with only four hidden layers, so it is easier to be implemented in embedded systems.
Because of all these differences, we can affirm that our system is an improvement over other works with similar purposes.It is important to comment that several works have been discarded for this comparison because of the following reasons: they do not use physiological signals, they do not classify Arousal and Valence, or they do not provide accuracy results.

Conclusions
After analyzing the information stored from the various physiological sensors that appear in the databases and after implementing and testing an independent classification system for each one, it is concluded that the GSR and ECG sensors provide more information about the two main affective states: Arousal and Valence.Therefore, in future works, the use of EEG can be discarded, which is an advantage because this sensor is the most complex to use, the one that requires the most preparation, the most expensive, the one with the greatest number of channels provided (and therefore the most computationally expensive), and the one with the greatest variability between users.
The accuracy results obtained for Arousal and Valence from the system composed of GSR and ECG show that the pre-processing step carried out has been correct for these types of problems, and that the study of the time window for data processing has been decisive to obtain positive results.
Comparing this work with other similar works over the last years, we can find notable improvements in all areas.The architecture used in our system has a low complexity compared to other architectures used in previous works, allowing our system to have a data-rate higher than all of them and to be able to be integrated with relative ease in an embedded system.The accuracy obtained by our system clearly surpasses most of the works developed previously except in the case of the work presented by Lee and Sun K. Yoo in 2020; however, this last work has a much more complex architecture and the results are discretized only in two levels, while our work performs a three-level classification: low, medium, and high.
These improvements not only allow for integrating our classifier into an embedded system, as mentioned above, but also for extending the classification of the affective state to high-level emotions like sadness, happiness, nervousness, etc. (according to the classification indicated in the Introduction section and shown in Figure 1) thanks to the improvement in the resolution of the classification.
As indicated at the beginning, our long-term objective is to develop a portable or wearable system for the detection of emotions that can be used as a non-verbal communication learning aid for users with cognitive disabilities.To achieve this goal, our next step is to reduce the device to a more manageable size to make it more comfortable when using it.Furthermore, as can be seen from the results, the EEG sensor is discarded because it obtains the worst classification results.Thus, when designing the device, it is not necessary to mount the EEG helmet and, therefore, the other sensors can be located in a more comfortable area, such as the arm or hand.
Thanks to the advances presented in this work, it is demonstrated that this objective is feasible.

Figure 1 .
Figure 1.Classification of the different affective states according to Arousal and Valence.Information obtained from Posner et al. [22].

Figure 2 .
Figure 2. Full system implemented in this work.

Figure 4 .
Figure 4. Confusion matrices: (top-left) train set with Arousal; (top-right) test set with Arousal; (bottom-left) train set with Valence; and (bottom-right) test set with Valence.

Table 1 .
Datasets summary (superscripts indicate logging frequency for the used datasets).

Table 2 .
Results obtained with GSR network classifying Arousal (best time windows are marked).

Table 3 .
Results obtained with GSR network classifying Valence (best time windows are

Table 4 .
Results obtained with ECG network classifying Arousal (best time windows are marked).

Table 5 .
Results obtained with ECG network classifying Valence (best time windows are marked).

Table 6 .
Results obtained with EEG network classifying Arousal (best time windows are marked).

Table 7 .
Results obtained with EEG network classifying Valence (best time windows are marked).

Table 8 .
Full classification system results: GSR and ECG combined.

Table 9 .
Classification system results with GSR, ECG, and EEG combined demonstrates that the inclusion of EEG reduces the accuracy.

Table 10 .
Additional metrics obtained for Arousal (Train) classification with GSR + ECG system.

Table 11 .
Additional metrics obtained for Arousal (Test) classification with the GSR + ECG system.

Table 12 .
Additional metrics obtained for Valence (Train) classification with the GSR + ECG system.

Table 13 .
Additional metrics obtained for Valence (Test) classification with the GSR + ECG system.

Table 14 .
Results comparison with previous works.