eSEE-d: Emotional State Estimation Based on Eye-Tracking Dataset

Affective state estimation is a research field that has gained increased attention from the research community in the last decade. Two of the main catalysts for this are the advancement in the data analysis using artificial intelligence and the availability of high-quality video. Unfortunately, benchmarks and public datasets are limited, thus making the development of new methodologies and the implementation of comparative studies essential. The current work presents the eSEE-d database, which is a resource to be used for emotional State Estimation based on Eye-tracking data. Eye movements of 48 participants were recorded as they watched 10 emotion-evoking videos, each of them followed by a neutral video. Participants rated four emotions (tenderness, anger, disgust, sadness) on a scale from 0 to 10, which was later translated in terms of emotional arousal and valence levels. Furthermore, each participant filled three self-assessment questionnaires. An extensive analysis of the participants’ answers to the questionnaires’ self-assessment scores as well as their ratings during the experiments is presented. Moreover, eye and gaze features were extracted from the low-level eye-recorded metrics, and their correlations with the participants’ ratings are investigated. Finally, we take on the challenge to classify arousal and valence levels based solely on eye and gaze features, leading to promising results. In particular, the Deep Multilayer Perceptron (DMLP) network we developed achieved an accuracy of 92% in distinguishing positive valence from non-positive and 81% in distinguishing low arousal from medium arousal. The dataset is made publicly available.


Introduction
Emotions are psychological, cognitive and behavioral states associated with feelings and thoughts. In the literature, we can find various types of models for the quantification of emotions from classification using the basic emotions such as [1] to coding models based on the facial movements such as the Facial Action Coding System (FACS) [2] and dimensional models. Among the various ways proposed to categorize emotions, researchers have mainly focused on dimensional scales of emotions [3]. Russell's Circumplex Model of Affect [4] is a two-dimensional space with emotional arousal (EA) and emotional valence (EV) being the two dimensions. EA describes how calming or exciting an emotion is, while EV is the level of pleasantness. The idea behind this two-dimensional space is that emotions for the assessment of affect and personality traits is outlined. In Section 3.7, the low-level eye metrics analysis and algorithmic process for the extraction of eye and gaze features is explained. Afterwards, in Section 4, statistical correlations between personality and eye features with affective responses are investigated in detail. Additionally, the strategy and structure of a machine learning approach for the identification of emotional states based on neural networks is described and the results from the methods developed for arousal and valence recognition are presented. Furthermore, in Section 5, we interpret our results regarding the statistical and machine learning analysis and perform a benchmarking upon our findings. Finally, the conclusions drawn from the study are recapitulated and discussed.

Eye-Tracking Databases for Emotion Recognition
Interest in eye movement research dates back to the nineteenth century. Eye-tracking research is applied in areas such as cognitive psychology, neuropsychology, usability testing, or marketing [21]. One of the primary goals of machine learning in affective computing is the ability to recognize users' emotions for the purpose of emotion engineering. Methods based on electroencephalography (EEG), face image processing, and voice analysis are among the most prevalent techniques for emotion identification. Even while eye-tracking is rapidly becoming one of the most widely used sensor modalities in affective computing, it is still a relatively new method for emotion detection, particularly when employed exclusively.
The establishment of emotional databases that can be linked to multiple modality signals, stimulus materials, and experimental paradigms is crucial to emotion recognition research. Multiple human senses can be stimulated to develop emotions through the use of audio-visual information employed in multisensory media studies. The examination of facial expressions or neuro-physiological signals has been the primary focus of databases for the research of affect recognition based on visual modalities [22][23][24][25][26]. Yet, despite the fact that eye movements have been shown to be valuable indicators of affective response [18], few researchers have concentrated on the creation of relevant databases. The Eye-Tracking Movie Database (ETMD) [27] is a video-oriented database comprised of 10 participants and annotated with continuous arousal and valence ratings. The twelve (12) movie clips about (3-3.5 min) used as stimuli were collected from the COGNIMUSE database [27] to elicit various levels of arousal and valence based on six basic emotions: namely, happiness, sadness, disgust, fear, surprise, and anger. In addition, the dataset is made available to the public and includes eye-tracking parameters pertaining to gaze and fixation positions as well as pupil size. Nevertheless, the aforementioned database lacks blink-related measurements. Moreover, the video clips comprise random dialogue scenes or sometimes a mix of different scenes from the movie, thus distracting the viewer and weakening their evoked emotions.
The EMOtional attention dataset (EMOd) [28] is a diversified collection of 1019 images that trigger a range of emotions with eye-tracking data taken from 16 people in an effort to investigate the relationship between image sentiment and human attention. In addition, the EMOd contains high-level perceptual qualities, such as elicited emotions, as well as intensive image context labels, including object shapes, object attitudes, and object semantic category. Regarding the eye and gaze tracking metrics, the dataset provides fixation sites, duration, and fixation maps, which are accessible to the public. However, no raw eye and gaze data are provided, thus introducing limitations regarding the management of data for different purposes.
In addition, the datasets 208 NUSEF [29] and CAT2000 [30] are somewhat useful. NUSEF is a collection of 751 emotive photographs, primarily depicting faces, nudity, and human movement. The CAT2000 training set contains 2000 images depicting various settings, including emotive imagery and cartoons. Yet, these two datasets lack emotional content and object labels. In addition, the eye-tracking data acquired from the participants in these two studies is not available to the general public.
Despite their substantial contribution to the scientific community, the databases described have certain limitations. The primary restriction is the limited number of par-ticipants and available eye and gaze metrics. This is especially crucial when examining relationships between eye movements and emotions, as some measures, such as blinks, are useful indications of emotional arousal [18]. A second limitation of the datasets mentioned is the fact that they are appropriate for studying the influence of emotional charge on the shifting of attention specifically to the visual stimuli based on fixation saliency maps, but they are not appropriate for studying the correlation of emotional states with gaze patterns and pupil characteristics. Finally, the datasets suffer from a lack of available eye and gaze metrics, which in turn restrains the variety of computational and algorithmic apporaches that can be performed.
The limitations of each of the aforementioned datasets demonstrate the need for the development of a new eye-tracking dataset intended for emotion recognition. To this goal, the eSEE-d database presented in this work is a one-of-a-kind resource that can support new eye-tracking analysis for emotion identification research. It is, to the best of our knowledge, the first publicly accessible eye-tracking based dataset that integrates eye and gaze movements signals with self-assessment of the users while viewing 10 emotion-eliciting video clips (duration: 1-2 min.), enabling the evaluation of the effect and relationship of eye movements with emotion and personality. eSEE-d is also the largest eye-tracking database in terms of the number of participants and the quantity of eye and gaze measurements, thus providing the opportunity for many different types of experimentation both in terms of data management and the development of a variety of machine and deep learning techniques.

Methods
In this section, we detail the methodology used to generate the dataset and the materials utilized in the research.

Participants
The experimental protocol (110/12-02-2021) was submitted and approved by the Ethical Committee of the Foundation for Research and Technology Hellas (FORTH).
There were a total of fifty-six (56) participants in the study. Seven were eliminated because they did not fulfill the inclusion criteria: six had CES-D scores above the threshold and one had binocular visual acuity worse than 0.10 logMAR. One participant was disqualified due to poor quality recordings. All analyses were conducted on the remaining forty-eight (48) subjects (27 female, 21 male). Their average binocular visual acuity at 80 cm was −0.11 ± 0.08 logMAR (range: 0.10 to (−0.29) logMAR), their mean age was 32 ± 8 years (range: 18-47 years), and their average education level was 17 ± 2 years (range: 12-21 years).
Exclusion criteria for all subjects included any known ocular disease, spectaclecorrected binocular visual acuity in 80 cm of less than 0.10 logMAR (0.8 decimal acuity equivalent), clinically significant aberrant phorias, any known cardiovascular disease, and a CES-D score of 19 or higher (see Section 3.4).

Video Set
In this work, we decided to use videos as emotion-evoking stimuli. Although picturebased elicitation approaches have certain advantages, such as being simple to acquire materials for and constructing the experimental paradigm, they are incapable of providing adequate continuous stimulation, resulting in suboptimal emotion induction [25]. In contrast, a video-based elicitation method can evoke emotion continuously due to prolonged stimulus. Hence, the video-based elicitation method can compensate for the loss of picture stimulation, which has gained considerable attention [31,32].
Ten (10) videos with sound were used in the study, which were obtained from the public database FilmStim [33] and modified to meet the needs of the study. They were cropped so that their duration was shorter than 2 minutes, and no important dialogues were included, since the participants were native Greek speakers and the videos were in English or French.
The dataset consisted of two videos for each of the four chosen emotions (anger, disgust, sadness and tenderness) and two more videos which served as emotionally neutral ones. Table 1 shows the videos used, their duration and their emotion annotation. According to Russell's arousal-valence space (Figure 1), each emotion corresponds to its level of arousal and valence. Anger and disgust are High Arousal-Negative Valence (HANV), sadness is Low Arousal-Negative Valence (LANV) and tenderness is Low Arousal-Positive Valence (LAPV). Neutral is Medium Arousal-Medium Valence (MAMV). The first three emotions were chosen because they are commonly considered to be amongst the basic emotions [34], they are widely studied in emotion research [35] and their essence is easy to understand [33]. Tenderness, on the other hand, is not considered to be one of the basic emotions, but it has widely been used during the last years in emotion research [36][37][38]. Tenderness and amusement were the only positive emotions in the FilmStim database instead of the more generic "happiness". Tenderness is also an attachment-related emotion, and thus, it belongs to another, underrepresented group of emotions. Finally, it is an emotion that can easily be evoked by films [33].  In addition, ten (10) neutral videos were used after each emotion-evoking video, with the objective of inducing a relaxing state prior to the next emotion-evoking video and in parallel to allow for the previously evoked emotion to fade away. The neutral videos had a duration of approximately 60 seconds, which enabled us to sample the induced emotion before a user's emotional response returns to zero or to a baseline level for the first time (see Section 1 for details).

Emotion Annotations
The videos used were already annotated regarding the emotion that they evoke by [33]. The emotions were corresponded to the emotional arousal and valence levels of the valence-arousal space by the study group. This annotation will be referred to as "Objective annotation" from now on.
On the other hand, the "Subjective annotation" was based on each participant's selfassessment after each video. The self-assessment consisted of a 4-word differential emotions scale (DES)-anger, disgust, sadness and tenderness-on an 11-point scale.
Similar to [39], only self-assessment scores equal to or higher than 4 were accepted as a significant indication of the presence of a specific emotion. A self-assessment score lower than 4 was treated as emotionally neutral. An emotion was selected as "Subjective annotation" if it received a higher rating (at least 1 point) than the other three emotions [37].

Demographics And Psychoemotional Scales
All participants were asked to fill the following set of questionnaires in this order:

1.
Clinical cardiovascular record: a questionnaire that consisted of two questions: whether the participant had a cardiovascular record and whether he or she is taking any related medication.

2.
CES-D scale: a short self-report scale designed to measure depressive symptomatology [40,41]. It consists of 20 items that request the participant rate how often over the past week they experienced symptoms associated with depression, such as restless sleep, poor appetite, and feeling lonely. A score greater than 19 was used as the cutoff score in order to identify individuals at risk for clinical depression.

4.
STAI-trait test: the State-Trait Anxiety Inventory (trait) is a commonly used measure of trait anxiety [42,43]. It consists of 20 items that are rated on a 4-point scale. It evaluates the anxiety of the participant during the last six months.

6.
STAI-state test: the State-Trait Anxiety Inventory (state) is a commonly used measure of state anxiety [42,43]. It consists of 20 items that are rated on a 4-point scale. It evaluates the anxiety that the participant feels at the moment of the assessment.

Materials and Setup
The videos were presented on a computer screen (DELL, 24 , 1280 × 720) at 80 cm distance from the participant as shown in Figure 2. Wireless headphones were used, and the sound volume was set. However, participants were asked whether they were comfortable with the sound volume, and it was adjusted if necessary.
Using the Pupil Labs "Pupil Core" eye-tracker, eye-tracking measures were captured [46]. The binocular recordings had a sample rate of 240 Hz, an accuracy of 0.60 deg., and a precision of 0.02 deg. To minimize head movements, all measures were taken with the subjects seated in a chair with their heads supported by a chin and head rest. Using European-wide standardized logMAR charts, standardized logMAR acuity was determined [47]. With the cover test, stereopsis was assessed.
With the room lights on, controlled photopic lighting conditions were created for recording purposes. The corneal illuminance was 400 lux when the screen was off and 450 lux when the screen was blank.

Experimental Procedure
All participants read and signed an Informative Consent Form before the trial. The participants were then escorted to the laboratory. Subsequently, tests of binocular visual acuity at 80 cm and stereopsis were administered.
Subsequently, participants were asked to complete the above-mentioned questionnaires on the computer screen.
In case there was a cardiovascular record or a CES-D score over 19 (risk for clinical depression), the procedure ended and the participant was excluded. Otherwise, the procedure continued.
Following the questionnaire part, the experimental part commenced. After the gazetracker calibration, the guidelines were presented and the participants were informed that they could stop the video if they did not feel comfortable, at any time they wanted, just by pressing the Space button. Next, the emotion-evoking videos were presented in a randomized order. After each emotion-evoking video, participants were presented with a neutral video of about 1 min so the evoked emotion faded away before the next emotionevoking video. After the neutral video, the questionnaire for the emotion self-assessment was presented to the participants. We purposely decided to introduce a one-minute period before affective annotation from individuals who rated their feelings in response to the previous emotion-evoking clip under a self-assessment protocol targeting to simulate real world scenarios. This time period enables us to sample the emotion before a user's emotional response returns to zero or to a baseline level for the first time. The design of the study is shown in Figure 3.
For the course of the video-viewing procedure, a member of the research team monitored the gaze-tracker's output on a second monitor in case of any anomalies in the recordings or the participant needed additional assistance. To safeguard the participants and the study team from the SARS-CoV-2 pandemic and to prevent the spread of the virus, every precaution was taken.

Data Analysis Methodology
In this subsection, the algorithm used to calculate and interpret eye and gaze-related features from the raw data captured by the gaze-tracking device is described.

Raw Eye-Tracking Data
The raw gaze points from the pupil core are processed and analyzed for each recording sequence to confirm that participants viewed the entire video scene each time and did not glance away from the screen or even closed their eyes for a period longer than the average blink time to avoid viewing. Specifically, we excluded every recording during which the participant closed his/her eyes for a duration longer than 10% of the total clip duration to avoid watching, as blinking causes around 5-10% data loss during a recording [48]. Additionally, we considered faulty recordings those in which participants' attention was decoupled from the screen stimuli (mind-wandering phenomenon) for a length greater than 20% of the overall duration of the corresponding video clip [49]. The 20% cutoff was selected based on the evidence provided by a range of studies indicating that mind wandering occurs at least 20% and up to 50% of the time, even during tasks that are not designed to induce it [50,51]. Additionally, if a participant abruptly paused the video clip, the relevant recording was deleted. As a result, the final dataset consisted of 476 recordings that include those obtained for each of the valid emotion-evoking videos watched by the 48 study participants.
The output of the gaze tracker used in this study includes the gaze positions (x,y coordinates), the blink timings (start and end times), and the pupil diameter in millimeters. These metrics comprise various sorts of noise originating from both the eye tracker and the participants, as it is widely known that when collecting gaze-related data, there is typically some noise owing to eye blinking and an inability to capture corneal reflections [52]. Thus, filtering and denoising must be applied to the eye movement data in order to eliminate this undesired variance.
The raw gaze coordinates, which are in the form of normalized pixels, are converted into degrees of visual angle, and the instantaneous sample-to-sample gaze movement between two consecutive gaze points is calculated, leading to the calculation of the angular velocity, given the sampling frequency F s . To reduce velocity noise, we utilized a 5-tap velocity filter provided by [53] after one large velocity peak value (i.e., during a saccadic movement).

Fixation And Saccade Detection
Fixation detection algorithms categorize gaze data according to dispersion, velocity, and acceleration (or combinations thereof) parameters; [53,54]). Fixations and saccades are recognized based on the Velocity-Threshold Identification (I-VT) algorithm proposed by [54] due to its superiority when sample-by-sample comparisons are taken into account [55]. Furthermore, to determine the length of the fixations, we added an extra minimum duration threshold. The steps of the algorithm are as follows: 1.
Compute point-to-point velocities for every protocol point.

2.
Mark every point below the velocity threshold as a fixation point and every other location as a saccade point.

3.
Collapse consecutive fixation points based on fixation time into fixation groups, omitting saccade points. 4.
Map each cluster of fixations to the fixation at the centroid of their points. 5.
In the I-VT algorithm, the velocity threshold for saccade detection was set to 45 deg./s, as in [55]. In addition, the minimum fixation duration threshold was determined at 55 ms [56].

Pupil and Blink Detection
The pupil diameter and blink timings determined by the eye tracker contribute to the extraction of additional information linked to the pupil and blink. In the infrared illuminated eye camera frame, the pupil recognition algorithm locates the dark pupil [46]. The algorithm is not affected by corneal reflection and may be used by those who use contact lenses or spectacles. Based on input from user-submitted eye camera footage, the pupil recognition algorithm is constantly being improved. Based on a confidence threshold corresponding to the effective detection of the pupil area, blinks start and end times are determined. In total, 28 eye and gaze features are extracted based on fixation, saccade, blink and pupil characteristics and are presented in Table 2

Pupil Estimation Implications
The effect of emotional arousal and valence on pupil size is complex due to the fact that pupil diameter and its variation are highly reliant on various factors, including lighting conditions [5,57], the luminance of the movie [14] and the adapting field size [58,59]. There have been attempts to remove the movie luminance effect on pupil diameter [60]. Recent research involves deriving the estimated pupil diameter from the measured pupil diameter using the V component of the HSV color space [14].
In the present work, the experimental setup was designed to minimize this effect as much as feasible. First, the room's lighting settings were adjusted to be photopic so that any brightness variations in the film would be insignificant. Second, for the same reason, the participant's distance from the screen was quite large (80 cm). Corneal illuminance during films was measured and varied between 390 and 411 lux, showing little variance among the films.
To determine if the influence of the film's luminance was sufficiently low, a linear regression analysis was performed between pupil diameter and the V component of the HSV color space for each video. Among the 48 participants, 24 showed very weak correlation (r < 0.20), 20 showed weak correlation (0.20 < r < 0.40) and 4 showed moderate correlation (0.41 < r < 0.47). The levels of correlation were set based on [61].
It was considered that the results of linear regression analysis were satisfying, since no strong correlation between pupil diameter and V component was found in any of the participants. Thus, all analysis on pupil diameter was carried out with that acquired from the gaze tracker pupil diameter.

Results
In this section, the study's findings are provided. First, a statistical study is presented, which is followed by a Machine Learning analysis to detect any correlation between ocular characteristics and arousal and valence levels.  Table 3. An independent sample t-test showed only an emotional empathy score was statistically significantly different between men and women (t(47) = 3.538, p = 0.001) with women showing a higher score by 0.58. No correlation was found between any of the scales and age or education level (p > 0.092).  Table 4 shows the Hit Rate and the Mean Rating for each target emotion. Hit Rate is the percentage of the videos for which the participants had indicated that they had felt the target emotion (objective annotation) at least one point more intensely than any of the other three untargeted emotions [39]. Mean Rating is the mean value of the scores for each target emotion. Its scale is from 0 to 10, as decided based on methodological and theoretical criteria [62]. The relation between objective and subjective arousal is presented in Figure 4, where the subjective annotations are displayed with reference to the objective ones. Each pie chart shows the distribution of the subjective annotations corresponding to the target emotion class of the valence-arousal space. For this analysis, the valence-arousal space is divided into four quadrants, and the emotions are placed in the space according to its arousal and valence levels. The quadrants are "High Arousal-Positive Valence (HAPV)", "High Arousal-Negative Valence (HANV)", "Low Arousal-Negative Valence (LANV)" and "Low Arousal-Positive Valence (LAPV)". The origin of the system is "Medium Arousal-Medium Valence". Based on [39], we evaluated the "Discreteness" of the videos, i.e., whether the rating of the emotion that was identified as "subjective annotation" was statistically significant greater than the ratings of the rest of emotions. We used a t-test to make a pairwise comparisons between the subjective annotation and each of the rest of the emotions. In all comparisons, the rating of the emotion that was identified as "subjective annotation" was statistically significant greater than the ratings of the rest of the emotions (p < 0.001). Table 5 shows the mean ratings of each emotion in each subjective annotation. As there are several samples of the same subject at the same level of arousal or valence, our dataset is comprised of dependent observations. Hence, an analysis based on the assumption of independent observations within each group or between the groups themselves, such as ANOVA, is rejected. In order to discover which ocular characteristics are affected by the arousal and valence levels, a Mixed Linear Model (MLM) evaluation was conducted on each feature individually. Mixed Linear Models are an extension of simple linear models that are used to evaluate both fixed and random effects. MLMs provide more accurate estimates of the effects, better statistical power and non-inflated Type I errors compared to traditional analyses [63]. In the MLM analysis, arousal or valence level was selected as the fixed component, while participant ID was selected as the random factor. In addition, a Bonferroni post hoc test was performed to compare class characteristics.
In Tables 6 and 7, only the features that were affected in a statistically significant manner by the arousal and valence levels are presented. The Bonferroni post hoc test showed that five features were statistically significantly different between low and medium arousal level, six features were different between low and high, and nine features were different between medium and high. As far as the valence level is concerned, seven features were statistically significantly different between negative and medium valence level, three features were different between negative and positive, and six were different between medium and positive.
In order to evaluate the simultaneous effect of the arousal and valence levels on each eye feature, the subjective emotion class of the valence-arousal space was selected as fixed factor in the Mixed Linear Model (MLM) analysis. Again, a Bonferroni post hoc test for evaluating features among the classes was performed. Table 8 shows the features that were affected in a statistically significant manner by the emotion class. One feature was statistically significantly different between LAPV and HANV, six features were statistically significantly different between MAMV and HANV and between MAMV and LAPV, five were statistically significantly different between LANV and HANV, two were statistically significantly different between LANV and MAMV, and four were statistically significantly different between LANV and LAPV. Estimated mean values of the Mixed Linear Model and standard errors of all eye features for three classes (low, medium, high) of emotional arousal and three classes of emotional valence (negative, medium, positive) are presented in Table 2.
When the self-assessment scores were added to the Mixed Linear Model, no statistically significant effect of them on any of the eye features was found.

Feature Selection Process
Even though deep learning-based models enable a feature extraction process, it might be a good idea to remove irrelevant features before training the model. This may reduce memory and time consumption since deep learning procedures usually require a large amount of data. Moreover, feature selection could enhance the ability of the model to learn the most and least significant features and thus exclude it from the future data collection, resulting in improved performance [64].
For the machine learning analysis presented in Section 4.3, we developed a model including specific eye-tracking features for the estimation of the levels of arousal and valence as presented in Section 4.1. To this goal, we created three different feature sets for our predictive models. Specifically, we produced a correlation matrix of the features that were statistically significantly different among the arousal and valence levels and the emotion classes (Tables 6 and 7), and from the pairs that were highly correlated (r > 0.3), we kept the features that better distinguish among the levels of arousal and valence. Thus, for the estimation of the levels of EA, the feature set includes the metrics of Fixation and Blink frequency, Pupil diameter, Fixation duration kurtosis, Saccade duration variation and PD variation. For the identification of EV levels, our feature set comprises of eight eye-tracking metrics: Fixation and Blink frequency, Saccade amplitude and duration, Pupil diameter, Fixation duration kurtosis, Saccade duration variation and Pupil diameter kurtosis. Finally and for the synchronous estimation of EA and EV levels, the selected feature set includes the metrics of Saccade and Blink frequency, Saccade amplitude, Pupil diameter, Fixation duration kurtosis, Saccade duration variation and Pupil diameter kurtosis.

Machine Learning Analysis
This section examines the relationships between fixation, saccade, blink, and pupil-related ocular characteristics with arousal and valence levels. Consequently, the EV outcome measure can take on three values: negative, neutral, or positive, while the EA response variables are set to high, medium, and low. Specifically, the divided EA instances low, medium and high are marked as class LA, MA and HA, respectively. In the same manner, negative, neutral and positive EV examples are denoted as classes NV, MV and PV, respectively.
In our work, we used a set of DMLP neural networks written in the Python 3.8 environment for our models and generated a partition with 80% of the samples for training the networks and the remaining samples for testing, according to the Pareto principle [65]. Figure 5 illustrates the basic structure of our neural networks. The first layer contains N {6, 7, 8, 28} neurons based on the size of the data vector per experiment followed by m hidden layers, where m Z ∩ [2,3]. Additionally, each hidden layer contains n neurons, where n Z ∩ [8,256]. All layers are followed by a dropout layer with a 0.25 rate and Rectified Linear Unit (ReLu) as their Activation Function due to its computational efficiency [66]. Moreover, for the last layer, we used the Sigmoid function for the binary classification procedure and the Softmax function for the multi-class problems [67]. The optimizer used to accelerate the learning process is "Adam" with binary cross-entropy as the loss function. Each network is characterized by a single output y.
In this study, several neural network topologies were compared. Several numbers of hidden neurons have been utilized for training and testing in order to discover the optimal network layout for each classification attempt. The network with the least number of hidden neurons and the smallest testing error has been determined to be the optimal one. The resulting neural network with reduced complexity has been trained and evaluated again. We next conducted a 10-fold cross-validation and evaluated the performance of the neural networks using the average f1-score, Area Under Curve (AUC), and accuracy rate of well-classified testing data. The achieved results for the superior neural network architectures are presented in Tables 9-11. Table 9 represents the results of identifying various levels of EA, Table 10 presents the results of the discrimination attempt between different EV levels, while Table 11 provide the synchronous identification attempt of arousal and valence levels. From Table 9, we observe that the selected neural network is capable of discriminating the presence of a high emotional arousal level to an accuracy of 74%. However, without the presence of the medium class, it is more challenging for the neural network to discern between high and low arousal. Such behavior is not observed when HA is separated from MA or when MA is separated from LA. Additionally, especially with reference to the multi-class categorization of the three classes HA, MA, and LA, the network's percentage of right predictions is maintained at 74% by adding a third hidden layer. Regarding the objective of identifying and categorizing the different levels of emotional valence, the PV class is sufficiently distinct from the MV and NV classes. Simultaneously, class PV is easily differentiated from NV and MV at a rate of 82% and 90%, respectively. In comparison, NV is deemed to be weakly discriminated from MV. Despite the addition of an extra hidden layer, it appears to affect the success rate of predictions when categorizing the three classes. An additional approach was examined in order to investigate the potential of creating a combined model to synchronously identify the levels of EA and EV. Based on the annotation given by the participants, we split the response variable into four classes depending on the combined arousal and valence levels. The respective classes, the structure of the network as well as the results are presented in Table 11. It is worth noting that a different method was explored in relation to the models constructed and discussed in the preceding paragraphs. This strategy entailed the inclusion of the scores from the five questionnaires in the input layer of the networks. We executed the neural networks by adding in the input layer of the network each of the five questionnaires' scores separately and in aggregate to determine their effect on the models' performance. The results indicated that adding the scores had no discernible effect on the model's efficiency, either increasing or decreasing.

Discussion
In this work, we presented an eye-tracking dataset for the assessment of emotional arousal and valence levels based simply on eye and gaze characteristics. The dataset consists of eye and gaze-recording signals from 48 participants who viewed 10 emotionally evocative videos. The participants scored each emotional clip according to four basic emotions, which were then translated into arousal and valence levels.
Our statistical research revealed that numerous characteristics can be used to distinguish between arousal and valence levels (Tables 6 and 7). Based on the current research, we anticipated that pupil diameter, blink frequency, and fixation length would be the characteristics that best predict arousal and valence levels. This hypothesis was validated, although other variables, such as fixation frequency and saccade amplitude, contributed considerably to our prediction model. As hypothesized, our attempt to discriminate emotional states revealed that High Arousal-Negative Valence (HANV) and Low Arousal-Positive Valence (LAPV) could be distinguished with excellent accuracy. In addition to this comparison, additional characteristics such as fixation frequency, fixation duration, fixation duration kurtosis, saccade frequency, saccade amplitude, saccade duration variation and skewness, blink frequency, and pupil diameter can be used to distinguish HANV from other emotional states. Low Arousal-Negative Valence (LANV) can be discriminated from other emotional states by saccade amplitude, saccade velocity skewness and kurtosis, saccade duration, and pupil diameter (Table 8). In general, it appears that pupil diameter, blink frequency, fixation frequency, and saccade amplitude were strongly influenced by arousal, valence, and emotional states. These results demonstrate the ability of the eye-tracking data collected during the designed experimental protocol to indicate the level of EA and EV.
Regarding our neural network analysis, the highest success rate was observed during the binary classification between not positive and positive emotional valence level, achieving 92% accuracy whereas PV is effectively distinguished among the rest of the two levels. However, integrating the "neutral" class proved challenging, resulting in a significant decrease in network performance, especially when differentiating NV and MV. By adding an additional hidden layer to the shallow neural network model, the neural network was still able to reliably predict the three levels of EV despite a decline in performance. In terms of forecasting emotional arousal levels, both binary and multi-class identification tests provided positive results, with up to 81% accurate predictions; nevertheless, the LA class had a significant impact on the networks' performance. Positive results demonstrated that a model is able to forecast synchronously the amounts of EA and EV with a 72% success rate.
Overall, we tested a wide variety of eye features related to fixations, saccades, blinks and pupil and investigated the potential of several neural network models on different classification scenarios. In agreement with prior research, our findings reveal the effect of emotional charge on eye movements and pupillary responses. Specifically and in line with the ideas of [11,12,16], high arousal and positive valence levels can be successfully identified based solely on eye-tracking features. Furthermore, when comparing our work to those of [13][14][15], our results demonstrate significant improvements in terms of accuracy which can be attributed in part to the significant larger size of the training database. In addition, we have verified that using neural networks for emotion level estimation produces encouraging results, superior to [5,17], thus indicating potential toward the development of an emotion identification system with high discretization ability.
Although the videos used in this study were obtained from a public database, they are extracted from well-known film productions. Hence, it could be argued that the fact some participants being familiar with a certain video could result in participants not being as emotionally charged as they would have been upon first viewing. One more potential limitation of our study is that the participants rated the videos based on the five eliciting emotions rather than rating directly the levels arousal and valence, which were set later in accordance with the emotion ratings. Moreover, the inclusion of tenderness as a target emotion at the expense of the basic emotion of happiness means that the set did not contain any high intensity positive emotions and thus may result in a relative imbalance. However, tenderness and amusement were the only positive valence emotions in the FilmStim database. Furthermore, it is important to take into consideration that evoking high intensity-positive valence emotions to the subjects within a lab setting, and solely based on movie clip content, is extremely difficult and would probably lead to eliciting neutral valence emotions labeled wrongfully as high valence ones (i.e., happiness) and thus diminishing the integrity of eSEE-d.
We used Hit Rate as a variable that indicates the relation between the objective and the subjective annotation, and we saw that only 53% of the videos that were objectively annotated as anger were also subjectively annotated as anger. Another 20% was annotated as sadness and another 12% was annotated as disgust. This result confirms a rather old finding that anger is difficult to discretely elicit with brief films because it tends to co-exist with other negative emotions [39]. This limitation does not directly affect our further results because all analysis was performed having the subjective annotation as a basis and not the objective one. This co-existence of emotions, however, may have an impact on the subjective annotation as well. The DES self-assessment questionnaire gives the participants the choice to rate more than one emotion, but we defined subjective annotation as the emotion that received a higher rating (at least 1 point) than the other three emotions. In order to evaluate our definition of subjective annotation, we performed pairwise comparisons between the subjective annotation and each of the rest of the emotions. The t-test shows that although there is co-existence of the negative emotions, the rating of the emotion that was set as subjective annotation was statistically significantly greater that the rest of the emotions and confirms that the subjective annotation reflects mainly the corresponding emotion.
The FilmStim database consists of both colored and black and white videos. For the present study, videos from both categories were selected. Pupil size and pupil responses are known to be greatly dependent on luminance contrast, and this is why the study setup was such that luminance effect would be minimized, and this is confirmed in Section 3.7.4. Although there are studies that have shown that pupil size depends on color, apart from luminance [68], it is accepted that pupillary responses are affected mainly by luminance contrast and not color contrast [69]. Thus, since it is shown that luminance effect was not significant in our study, color effect would be even lower. It would be interesting though to have an exact evaluation of this effect in our study.
A final issue that needs to be pointed out is the fact that a there was a 1-min time interval between the emotion-evoking video and the self-assessment questionnaire. This approach was selected to imitate as effectively as feasible a real-world setting in which a person would be unable to appraise their emotions immediately after their manifestation. Emotions are dynamic, complicated processes that develop over time. So, a complete theoretical knowledge of how they operate cannot be attained until their time-related characteristics are well understood [70]. A recent review article [71] presents research on the determinants of emotion duration, building on a previous preliminary review [72]. One central temporal characteristic of emotions is the duration of emotional experience, which has been defined as the amount of time that elapses between the beginning and end point of an emotional episode. In contrast to moods, emotions begin with the occurrence of an external or internal event [73], despite the fact that the beginning of the feeling does not always correspond with the commencement of the event. An emotional episode concludes when the intensity of the emotional response returns to zero or to a baseline level, either for the first time [74], for several consecutive times [75], or permanently [73].
Despite the fact that duration definitions and associated metrics vary across accessible studies, the first definition (i.e., initial return to zero or baseline) is most frequently employed. Regardless of the time conceptualization adopted, it has been discovered that duration is highly diverse, with emotions ranging from a few seconds to several hours or even longer. In this sense, not all feelings are created equal: some emotions have been discovered to last for a long time, while others tend to go away rapidly. However, according to modern neurology, the average duration of an emotion in the human brain is 90 s, and therefore, we assume that the 1-min time does not affect the self-assessment.

Conclusions
In this study, we presented a dataset consisting of eye movements recorded while each subject watched emotion-eliciting films and evaluated five emotions, which were then converted into emotional arousal and valence levels. Despite the significant contribution to advancing the relevant research to date of the available public databases, the eSEE-d dataset consists of a significantly larger number of participants, and it includes a plethora of eye and gaze metrics. Thus, it provides the necessary depth and breadth of information for both experimenting with novel machine learning methods in attempting to develop optimal models for predicting emotional state using eye-tracking features.
The results of this study indicate the potential of neural networks for differentiating between different emotional states and emphasize the urgent need for further research. In this approach, we are currently investigating new models as well as the application of advanced deep learning methods in order to develop a model for simultaneously assessing valence and arousal levels with high efficiency. In addition, we intend to experiment with alternate, more robust feature selection methods as opposed to the statistical methods provided in this paper. Deep feature extraction architectures will be utilized for this task, and a different data-handling approach will be tested in order to predict the broad variety of different emotional states. Lastly, we intend to compare our findings with those of other studies that apply multimodal techniques, i.e., employ additional biosignals, and study the necessity and potential of merging eye and gaze data with other biometrics for improved performance in terms of computing expense.
This dataset is made accessible to the public, and we strongly encourage other researchers and academics to test their methods and algorithmic approaches on this immensely challenging database. Funding: This project has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement No 826429 (Project: SeeFar). This paper reflects only the author's view, and the Commission is not responsible for any use that may be made of the information it contains.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The resulting eSEE-d is available to the academic community (DOI: 10.5281/zenodo.5775674).

Conflicts of Interest:
The authors declare no conflict of interest.