Emotion Identification in Movies through Facial Expression Recognition

Understanding how acting bridges the emotional bond between spectators and films is essential to depict how humans interact with this rapidly growing digital medium. In recent decades, the research community made promising progress in developing facial expression recognition (FER) methods. However, no emphasis has been put in cinematographic content, which is complex by nature due to the visual techniques used to convey the desired emotions. Our work represents a step towards emotion identification in cinema through facial expressions’ analysis. We presented a comprehensive overview of the most relevant datasets used for FER, highlighting problems caused by their heterogeneity and to the inexistence of a universal model of emotions. Built upon this understanding, we evaluated these datasets with a standard image classification models to analyze the feasibility of using facial expressions to determine the emotional charge of a film. To cope with the problem of lack of datasets for the scope under analysis, we demonstrated the feasibility of using a generic dataset for the training process and propose a new way to look at emotions by creating clusters of emotions based on the evidence obtained in the experiments.


Introduction
Films are rich means of communication produced for cultural and entertainment purposes. Audio, text, and image work together to tell a story, trying to transmit emotional experiences to the audience. The emotion dimension in movies is influenced by the filmmakers' decisions in film production, but it is especially through acting that emotions are directly transmitted to the viewer. Characters transmit their emotions through the actors' facial expressions, and the audience experiences an emotional response.
Understanding how this bond between represented emotion and perceived emotion is created can give us concrete information on human interaction with this rapidly growing digital medium. This can integrate large film streaming platforms and be used for information retrieval concerning viewer experience, quality review, and for the improvement of state-of-the-art recommendation systems. Additionally, this matter falls into the field of affective computing, which is an interdisciplinary field that studies and develops systems that can recognize, interpret, process and simulate human affection. Therefore, emotional film perception could also be a contributing factor for creating affective movie streaming platforms.
Specifically, the challenge lies in answering the following question: "What emotion does this particular content convey?" This is studied in detail in the subfield of emotion and sentiment analysis by analyzing different modalities of content. More specifically, textbased sentiment analysis has been the reference in this area, with the use of natural language processing (NLP) and text analysis techniques for the extraction of the sentiment that a text conveys. A common application of these techniques is social network text analysis and e-commerce online reviews analysis, due to the proven added value to companies and organizations. Advances in computer vision (CV) and machine learning (ML) have, however, shifted the focus of this field by starting to leverage visual and aural content instead of only considering unimodal text-based approaches. The advantage of analyzing the three media present in movies by only assessing text is the possibility of taking into account the character behavior context: it is possible to combine visual and sound cues to better identify the true affective state represented in a film.
When analyzing movies, other stylistic characteristics can be used to improve the accuracy of emotion recognition. For instance, it is common practice to use camera closeups to evoke intense emotions in the audience. Although the research community has made promising progress in developing facial expression recognition methods, the application of current approaches on the complex nature of a film, where there is a strong variation in lighting and pose, is a problem far from being solved.
This work aimed to investigate the applicability of current automatic emotion identification solutions in the movie domain. We intended to gather a solid understanding of how emotions are addressed in the social and human sciences and discuss how emotional theories are adapted by classification models with deep learning (DL) and machine learning (ML). Taking into account the relevant available datasets, we selected two datasets for our experiments: one containing both posed (i.e., in controlled environments) and spontaneous (i.e., unplanned settings) image web samples, and another that contains images sampled from movies (i.e., with posed and spontaneous expressions). We benchmarked existing CNN architectures with both datasets, initializing them with pre-trained weights from ImageNet. Due to the inclusion of images in uncontrolled environments, the obtained results fall below what would be expected for this task. Hence, we discuss the reliability of multi-class classification models, their limitations, and possible adjustments to achieve improved outcomes. Based on the findings obtained in other multi-media domains that also explore affective analysis, we propose to reduce the number of discrete emotions based on the observation that overlap between classes exists and that clusters can be identified.
What remains of this article is structured as follows: Section 2 defines the problem this work intended to tackle and presents the related work regarding it; Section 3 provides a synthesis of the conducted study, including a detailed definition of the evaluation and analysis methodology with a description of the methods and datasets used to address it; Section 4 depicts and discusses the obtained results; Section 5 concludes by pointing out future paths to be pursued for automatic emotion identification.

Emotion Description and Representation
In their exploratory work, Paul Ekman argued [1] that facial expressions are universal and provide sufficient information to predict emotions. His studies suggest that emotions evolved through natural selection into a limited and discrete set of basic emotions: anger, disgust, fear, happiness, sadness, and surprise. Each emotion is independent of the others in its behavioral, psychological and physiological manifestations, and each is born from the activation of unique areas in the central nervous system. The criterion used was the assumption that each primary emotion has a distinct facial expression that is recognized even between different cultures [2]. This proposal set the grounds to other studies that tried to expand the set of emotions to non-basic ones, such as fatigue, anxiety, satisfaction, confusion, or frustration (e.g., Ortony, Clore and Collins's Model of Emotion (OCC)) [3][4][5].
To bridge emotion theory and visual observations from facial expressions, Ekman also proposed a Facial Action Coding System (FACS) [6]. FACS is an anatomically based system used to describe all visually discernible movement of face muscles, from which it is possible to objectively measure the frequency and intensity of facial expressions using a scale based on Action Unit (AU), i.e., the smallest distinguishable unit of measurable facial movement, such as brow lowering, eyes blinking or jaw dropping. The system has a total of 46 action units, each with a five-point ordinal scale, used to measure the degree of contraction. FACS is strictly descriptive and does not include an emotion correspondence. Therefore, the same authors proposed an Emotional Facial Action Coding System (EMFACS) [7] based on the six-basic discrete emotion model, thus making a connection between emotions and facial expressions. Recent studies have proposed a new classification system based on simple and compound emotions [8].
Despite being the dominant theory in psychology and neuroscience research, recent studies have pointed out some limitations in the six-basic emotion's model. Certain facial expressions are associated with more than one emotion, which suggests that the initial proposed taxonomy is not adequate [9]. Other studies suggest that there is no correlation between the basic emotions and the automatic activation of facial muscles [10], while other claims suggest that this model is culture-specific and not universal [11]. These drawbacks caused the emergence of additional methods that intend to be more exhaustive and universally accepted regarding emotion classification.
Some studies have assessed people's difficulty in evaluating and describing their own emotions, which points out that emotions are not discrete and isolated entities, but rather ambiguous and overlapping experiences [12]. This line of thought reinforced a dimensional model of emotions, which describes them as a continuum of highly interrelated and often ambiguous states. The model that gathered the most consensus among researchersthe Circumplex Model of Emotion-argues that there are two fundamental dimensions: valence, which represents the hedonic aspect of emotion (that is, how pleasurable it is for the human being), and arousal, which represents an enthusiasm or tension dimension, (i.e., the energy level) [13]. Hence, each emotion is represented using coordinates in a multi-dimensional state.
Other approaches propose either bi-dimensional [14] or tri-dimensional (e.g., Pleasure-Arousal (PA) and Pleasure-Arousal-Dominance (PAD) [15]) models for representing emotions. The utility of a third dimension remains unclear, as several studies revealed that the valence and arousal axes are sufficient to model emotions, particularly when handling emotions induced by videos [16]. However, Fontaine, after proposing a model with four dimensions, concluded that the optimal number of dimensions depends on the specificity of the targeted application/study [17].
The advantages of a dimensional model compared with a discrete model are the accuracy in describing emotions, by not being limited to a closed set of classes, and a better description of emotion variations over time, since they are not realistically discrete, but rather continuous.
Motivated by the dispersion of classification methods across emotional datasets, some studies have investigated the potential mapping between discrete/categorical and dimensional theories. In 2011, a first linear mapping between PAD and OCC emotion models [18] was proposed [19]. Nevertheless, it was based on theoretical assumptions, instead of using evidence-based studies. In 2018, a new study elaborated a mapping between Ekman's six basic emotions and the PAD model [20] by cross-referencing information of lexicons (i.e., Affective Norms for English Words (ANEW) [21] and Synesketch [22] lexicons) annotated in both models. Furthermore, they also derived a PA mapping using Nencki Affective Word List (NAWL) [23,24].
Using these lexicons datasets (ANEW, NAWL), an exploratory data analysis indicated the apparent formation of emotion clusters in the PA model: emotions with negative connotation have high overlap, specially between anger and sadness, while neutral and happy form individual clusters in the high-valence and medium-arousal, and in the lowarousal and low-valence regions, respectively. A similar analysis, in the aural domain [25], concluded that similar cluster regions exist, particularly in the happiness emotion and the overlap of "negative" emotions.

Facial Expression Recognition
Facial expression recognition (FER) systems use biometric markers to detect emotion in human faces. Since 2013, international competitions (such as FER2013 [26] and EmotiW [27]) have changed the facial expressions recognition paradigm by providing a significant increase in training data. These competitions introduced more unconstrained datasets which led to the transition of the study from controlled environments in the laboratory to more unrestrained settings.

Datasets for FER
Datasets are the fundamental piece in any machine learning application and several relevant datasets have been made available and used in most of the FER experiments.

1.
Acted Facial Expressions In The Wild (AFEW) [28]: Dataset consists of 1809 video segments extracted from movies. It is labeled with Ekman's discrete emotional model plus a neutral emotion class. The labeling process uses a recommendation system to suggest video clips to a human labeler through their subtitles. The annotations contain the perceived emotions and information regarding the actors present in the clip, such as their name, head-pose and age.

2.
AFEW-VA [29]: AFEW-VA is an extension of the AFEW dataset, from which 600 videos were selected and annotated for every frame using the dimensional emotion model (valence and arousal) for every facial region, which is described using 68 facial landmarks.

3.
AffectNet [30]: AffectNet contains more than 1 million facial images collected from the web by making queries using 1250 keywords related to emotions in six different languages. The entire database was annotated in the dimensional model (valence and arousal), and half of the database was manually annotated in both the categorical (with the eight labels: neutral, happy, sad, surprise, fear, disgust, anger, contempt, none, uncertain and non-face) and dimensional models.

4.
Aff-Wild2 [31]: The extended Aff-Wild database contains 558 videos annotated in continuous emotions (dimensional model-valence and arousal), using different AUs, and a set of 18 discrete FER classes, which also contain the six basic emotions.

5.
AM-FED+ [32]: The Extended Dataset of Naturalistic and Spontaneous Facial Expressions Collected in Everyday Settings (AM-FED+) consists of 1044 facial videos recorded in real-world conditions. All the videos have automatically detected facial landmark locations for every frame and 545 of the videos were manually FACS coded. A self-report of "liking" and "familiarity" responses from the viewers is also provided. 6.
CK+ [33]: The Extended Cohn-Kanade (CK+) is the most widely adopted laboratorycontrolled dataset. The database is composed of 593 FACS coded videos, 327 of which are labeled with the six basic expression labels (anger, disgust, fear, happiness, sadness and surprise) and contempt. CK+ does not provide specific training, validation and test sets. 7.
EmotioNet [8]: The EmotioNet database includes 950,000 images collected from the Web, annotated with AU, AU intensity, basic and compound emotion category, and WordNet concept. The emotion category is a set of classes extended from the discrete emotion model. Emotion categories and AUs were annotated using the algorithm described in [8]. 8.
FER2013 [26]: FER2013 was introduced in the ICML 2013 Challenges in Representation Learning, and consists of 48 × 48 pixel grayscale images of faces. The images were collected using Google's image search Application Programming Interface (API), in which the facial region is centered, resized and cropped to roughly occupy the same amount of space in each image. The database is composed of 28,709 training, 3589 validation and 3589 test images with seven emotion labels: anger, disgust, fear, happiness, sadness, surprise and neutral. 9.
JAFFE [34]: The Japanese Female Facial Expression (JAFFE) is one of the first facial expression datasets. It contains seven facial expressions (i.e., the labels from the dis-crete emotion model and a neutral label). The database is composed of 253 grayscale images with a resolution of 256 × 256 px. 10. KDEF [35]: The Karolinska Directed Emotional Faces (KDEF) is a set of 4900 pictures annotated using a model with six facial expression classes (happy, angry, afraid, disgusted, sad, surprised and neutral). The set of pictures registers 70 subjects (35 men and 35 women), viewed from five different angles. 11. MMI [36,37]: MMI Facial Expression is a laboratory-controlled dataset and has over 2900 videos of 75 subjects. Each video was annotated for the presence of AUs and the six basic expressions plus neutral. It contains recordings of the full temporal pattern of a facial expression, from the neutral state to the peak expression, and back to neutral. 12. OULU-CASIA [38]: Contains 2880 videos categorized into six basic expressions: happiness, sadness, surprise, anger, fear, disgust. The videos were recorded in a laboratory environment, using two different cameras (near-infrared and visible light) under three different illumination conditions (normal, weak and dark conditions). The first eight frames of each video correspond to the neutral class, while the last frame contains the peak expression. 13. RAF-DB [39,40]: Real-world Affective Faces Database (RAF-DB) contains 29,672 facial images downloaded from the Internet. The dataset has a crowdsourcing-based annotation with the six basic emotions, a neutral label, and twelve compound emotions. For each image, facial landmarks, bounding box, race, age range and gender attributes are also available. 14. SFEW [41]: Static Facial Expressions in the Wild (SFEW) contains frames selected from AFEW. The dataset was labeled using the discrete emotion model plus the neutral class. It contains 958 training, 372 testing and 436 validation samples. The authors also made available a pre-processed version of the dataset with the faces aligned in the image. SFEW was built following a Strictly Person Independent (SPI) protocol, therefore the train and test datasets contain different subjects. Table 1 provides an overview of the FER databases. At the moment, to the best of our knowledge, AFEW [28] (and its extensions to SFEW [41] and AFEW-VA [29]) is the only facial expression dataset in the movie domain, which poses a considerable obstacle for data-based methods given its very limited size. An alternative would be joining datasets from other domains, but there is some evidence that increasing the size of databases in training resulted in small increases in cross-domain performance [42]. Additionally, there is a huge variability of annotations between datasets, which complicates generalization across domains.
FER datasets share several properties, namely the shooting environment and the elicitation method. The shooting environment is closely related to the data quality and thus to the performance of deep FER systems. Laboratory-controlled shooting environments provide high-quality image data where illumination, background and head poses are strictly imposed. However, building these datasets is a time-consuming process and consequently, they are limited in the number of samples. In-the-wild settings, on the other hand, are easier to collect but prove to be challenging when attempting to achieve high-performance deep learning models.
The elicitation method refers to the way that the person pictured in an image portrayed the supposed emotion. Posed expression datasets, in which facial behavior is deliberately performed, are often exaggerated, increasing the differences between classes and making the images easier to classify. Spontaneous expression datasets are collected under the guarantee of containing natural responses to emotion inductions, better reflecting a realword scenario. Datasets that were collected from the Web or movies normally include both posed and spontaneous facial behavior. Additionally, the discrete model of emotions predominates FER datasets.  Table 2 demonstrates State of the Art (SoA) approaches and results on the most widely evaluated categorical datasets. SoA approaches achieve over 90% of accuracy in CK+ [33] and JAFFE [34], which is justified since they are datasets with laboratorycontrolled ideal conditions. However, datasets with subjects who perform spontaneous expressions under "in-the-wild" scenario conditions, such as FER2013 [26] and SFEW [41], have less satisfactory results.   (6) As shown in Table 2, CNN-based approaches are the foundation of SoA results and can be applied to FER tasks to achieve consistent performances. These SoA methods/models are derived from traditional DL architectures, which use well-known backbones for feature extraction (e.g., VGG, ResNet).
Directly using these standard feature extractors and fine tuning the softmax layer can contribute to softening the FER's small dataset problem. However, it creates a bottleneck because it relies on a predefined feature space. This issue is commonly tackled by using multistage fine-tuning strategies based on different combinations of the training dataset to enhance their performance [54] or by using facial recognition feature extractors and regularizing them with facial expression information [55].
To increase the power of representations for FER, several works have proposed novel architectures for increasing the depth of multi-scale features [56] or increasing the level of supervision in embedded representations [57]. Additionally, common limitations associated with softmax are caused by inter-class similarity, and are tackled with novel loss functions that drive the extractor towards more separable representations [39,40,[58][59][60]. Effectively training these novel architectures fails with insufficient amount of data. As observed in the previous analysis, FER datasets have a reduced size. Therefore, these limitations opened a new research direction in the context of FER, which is based on network ensembles and on the fusion of different face related tasks (e.g., facial landmark location and face recognition) [61].
In conclusion, the current datasets in the movie domain are still not large enough to allow traditional feature extractors to obtain the desired results. Additionally, physiological variations (such as age, genre, cultural context or levels of expressiveness) and technical inconsistencies (such as people's pose or lighting) are other challenges currently being addressed [61].

Proposed Methodology
Based on the evidence discussed in Section 2.2, it becomes clear that there are no sufficient large-scale movie datasets with face-derived emotion annotations. As a direct consequence, there are not many studies that validate the use of FER deep learning models specifically for the movie domain. Therefore, the problem we investigate can be defined through the following research questions: Can current FER datasets and Deep Learning (DL) models for image classification lead to meaningful results? What are the main challenges and limitations of FER in the movie domain? How can current results on affective/emotional analysis with other media be translated to FER in the cinema domain? Are the current emotional models adequate to the cinema domain where expressions are more complex and rehearsed?
Based on these research questions, we defined the following steps as the experimental design: 1.
From the list of available datasets provided in Section 2.2.1, we analyzed and selected a dataset for training the DL models and evaluated them in the movie domain; 2.
We pre-processed the selected datasets through a facial detector to extract more refined (tightly cropped) facial regions; 3.
We tested and benchmarked CNN architectures using accuracy as a performance metric. Furthermore, this first evaluation will also tackle the unbalance of the training dataset; 4.
Following the findings reported in Section 2.1 to study an approach for dimensionality reduction which allows to compare our findings with other domains (e.g., audio and text). This final step is divided into two approaches: (a) Using only the top-N performing classes; (b) Clustering the classes using the emotion clusters found in other studies from the SoA.
Within the datasets introduced in Section 2.2, none fit perfectly the requirements since there is no large-scale FER database in the film domain. Thus, we propose using a cross-database scenario involving two in-the-wild settings that can unite the benefits of a large database with the benefits of a film-based database.
For that purpose, FER2013 [26] was selected based on its size and in the fact that it includes both posed and spontaneous samples. This dataset was created using the Google image search API with 184 different keywords related to emotions, collecting 1000 images for each search query. Images were then cropped in the face region and a face-alignment post-processing phase was conducted. Prior to the experiments, images were grouped by their corresponding emotions. Each image, represented in the 48x48 vector in pixels, is labeled with an encoded emotion.
The number of samples per class of the dataset is presented in Table 3. The imbalance of the dataset is fairly evident, especially between disgust (with only 547 samples) and happy (with 8989 samples) classes. This imbalance is justifiable as it is relatively easy to classify a smile as happiness, while perceiving anger, fear or sadness is a more complicated task for the annotator. SFEW [41] was also chosen for this analysis since the images were directly collected through film frames. Furthermore, the labels of SFEW are consistent with FER2013 dataset, making the aforementioned cross-database study possible. The original version of the dataset only contained movie stills, while the second version of the dataset comes with pre-processed and aligned faces, and with LPQ (Local Phase Quantization) and PHOG (Pyramid Histogram of Oriented Gradients) features descriptors used for image feature extraction). Table 4 presents the distribution of the images in the dataset. SFEW was built following a strictly person independent (SPI) protocol, meaning that the train and test datasets do not contain images of the same person.

Results
Following the experimental design referred in Section 3, to set a baseline for our work, we benchmarked several SoA CNN architectures that were initialized with pre-trained weights from ImageNet. The selected backbones were MobileNetV2, Xception, VGG16, VGG19, ResnetV2, InceptionV3 and DenseNet. These models were selected based on their solid performance in other image challenges, with the premise that they could also be applied to FER tasks.
FER2013 was separated into training and testing sets. The baseline models were optimized using cross-entropy and accuracy, for validation purposes, during 25 epochs with a mini batch size of 128. The initial learning rate was set to 0.1, being decreased by a factor of 10% if the validation accuracy did not improve for three epochs. Moreover, the dataset was also extended by applying data augmentation with a probability of 50% on every instance. The selected augmentation methods were horizontal flip and width/height shift (min 10%). Table 5 presents these results for each baseline architecture, while Figures 1-6 illustrate their corresponding confusion matrices.
From the results, it is clear that none of the vanilla models achieved SoA results. In contrast, Xception performed well in inference time, with the second fastest training time of our tests, and the best accuracy result. Taking into account this preliminary analysis, Xception was selected as the baseline model for the purpose of the study to be conducted next.
Since SFEW has few samples, FER2013 was used to train the selected model using a large in-the-wild database of facial expressions. The trained model was tested with SFEW since it contains faces of actors directly extracted from film frames. This enables understanding whether the developed model is robust enough to adapt to a new context. Results are shown in Table 6 and Figures 7 and 8. From the presented numbers, we can conclude that Xception was able to achieve an overall accuracy of 68% in FER2013, which is within state-of-the-art values. Additionally, since FER2013 is a dataset built in a lab environment with a controlled image capturing conditions, these experiments will allow us to analyze whether a network trained in these conditions will have the ability to generalize to the film domain, by testing it with SFEW.
Having achieved the first objective, the next step was to simulate a real testing scenario of the network by submitting it to images taken from films. From a pool of 891 images, results were not satisfactory, reaching an overall accuracy of only 38%. Given this result, the next step was to address an already identified problem: the imbalance of FER2013.        Angry  60  62  53  40  Disgust  56  55  29  10  Fear  58  45  36  31  Happy  87  87  63  82  Sad  60  54  6  1  Surprise  79  80  13  14  Neutral  56  71  22  49 Accuracy

38
A n g r y D i s g u s t F e a r H a p p y S a d S u r p r i s e N e u t r a l

FER2013 Dataset Balancing
To deal with the class imbalance issue, the model was retrained with different class weights which causes the model to "pay more attention" to the examples from an underrepresented class. The values used were anger (1.026); disgust (9.407); fear (1.001); happy (0.568); sad (0.849); surprise (1.293); neutral (0.826). Results are illustrated in Table 7. Despite the overfit reduction, this approach did not lead to better accuracy results. When tested with SFEW dataset, the obtained results were similar to those already reported.

Reducing Dimensionality
The gathered evidence in Section 2 and the confusion matrices from the baseline results indicate that there is an overlap of emotions in the affective space. Thus, we propose a reduction in the dimensionality of the problem by reducing the number of emotions to be considered in affective analyses. We demonstrated the effectiveness of this approach firstly by selecting the top-four performing emotions in the previous experiments, and secondly, by selecting the clusters of emotions more clearly demarcated in the studies previously addressed.

Selecting the Top-Four Performing Emotions
The emotions that stood out in the previous tests were happy, surprise, neutral and angry, achieving a accuracy score of 87%, 80%, 71% and 62%, respectively. When training the model solely with these emotions, it was able to achieve an accuracy of 83%, as shown in Table 8. The confusion matrix for this testing scenario is shown in Figure 9.
After analyzing each emotion, we can conclude that by decreasing the size of the problem, the network's performance was improved. When applied to SFEW (Table 8 and Figure 10), the model also demonstrated some improvements with the reduction in dimensionality, going from 38% to 47% accuracy.

Clustered Emotions
Based on the evidence collected in Section 2.1, there are three clearly demarcated emotional clusters: happy (hereafter titled positive), neutral and a third one composed of the angry, sad, fear and disgust (the emotions with a negative connotation-hereafter titled negative). Therefore, another test involving these three clusters was performed. By concentrating only on these three emotions, the network achieved an accuracy of 85%, as illustrated in Table 8. For this methodology, the confusion matrices for the training and testing sets were illustrated, respectively, in Figures 11 and 12.
Testing the "three emotional network" with the SFEW dataset, a score of 64% was achieved, as illustrated in Table 8. Unlike the validation set of FER2013, the emotion with the best performance in SFEW was negative, reaching an accuracy value of 90%.
The best results were obtained when the dimensional reduction took place, so this may be a suitable solution for emotional analysis systems at the cost of losing granularity within the emotions of negative connotation. These results also showed similar emotion clusters as the ones discussed in Section 2 for other domains, that can be depicted in the confusion matrices demonstrated along this section. In particular, they show intersections between the "negative" clusters/classes and the neutral class/cluster (Figures 11 and 12), and within the negative connotation classes (Figures 7 and 8).

Conclusions
The work described in this paper had as its main objective the definition of an approach for the automatic computation of video-induced emotions using actors' facial expressions. It discusses the main models and theories for representing emotions, discrete and dimensional, along with their respective advantages and limitations. Then, we proceed with the exploration of a theoretical modeling approach from facial expressions to emotions, and discussed a possible approximation between these two very distinctive theories. The contextualization from human and social sciences allowed to foresee that the lack of unanimity in the classification of emotions would naturally have repercussions both in the databases and in the classification models, one of the major bottlenecks of affective analysis.
A systematic validation and benchmark analysis was performed to SoA FER approaches applied to the movie domain. After initial benchmarks, we fine-tuned the chosen model with FER2013, evaluating it with the movie-related dataset, SFEW. During this phase, we noticed several flaws and limitations in these datasets, ranging from class imbalance to even some blank images that do not contain faces. Additionally, we studied, through dimensionality reduction, the hypothesis that clustering observations from the valence-arousal space in other domains are transferable to this approach.
The obtained results show that even if there are still many open challenges related, among others, to the lack of data in the film domain and to the subjectiveness of emotions, the proposed methodology is capable of achieving relevant accuracy standards.
From the work developed and described in this article, several conclusions can be drawn. Firstly, there is a lack of training data both in terms of quantity and quality: there is no publicly available dataset that is large enough for the current deep learning standards. Additionally, within the available databases, there are several inconsistencies in the annotation (using different models of emotion, or even within the same theory of emotion) and image collection processes (illumination variation, occlusions, head-pose variation) that hinder progress in the FER field. Furthermore, the notion of ground truth applied to this context needs to be taken with a grain of salt, since classifying emotion is intrinsically biased in terms of the degree to which it reflects the perception of the emotional experience that the annotator is experiencing. Paul Ekman's basic emotions model is commonly used in current facial expression classification systems, since it tackles the definition of universal emotions and is widely accepted in the social sciences community. This model was designed by empirical experiences within people from different geographical areas, aiming to understand whether the same facial expressions translate a single emotion, without cultural variations. Hence, Ekman designed seven basic emotions used nowadays in the technological fields, to identify emotion through facial expressions. Current solutions are now quite accurate in this task for a variety of applications, with recent commercial uses, namely in the social networks. However, specifically in the cinema field, analyzing emotions from characters with existing frameworks proved to be an unsatisfying approach. On the one hand, actors are entitled to rehearse a facial expression of a character in a certain context. In this field, emotional representation is acted, thus using Ekman's model might not be a valid solution for the analysis of cinematographic content. For example, by applying current FER approaches to a comedy movie, the results could be flawed because acted emotions in this context should not be translated literally to the exact emotion apparent in the facial expression. In this example, we could obtain a distribution of emotions mostly focused on sadness and surprise, although in the comedy context, the meaning of the character's facial expressions should not be literal: thus, could we consider other basic emotions, with a more complex system that can distinguish an ironic sadness from a real sadness emotion, from a Drama movie? This could be a line of work for future implementations. On the other hand, the images captured in movies are cinematographic, i.e., they are taken in uncontrolled settings, where the environment varies in color, lightness exposure and camera angles. This content variety can be a clear struggle in the classification task, and concretely in the cinema field, it could have a large impact on research results.
Apart from facial expression, there are other characteristics in films that can be used to estimate their emotional charge, as discussed in Section 2. Therefore, as future work, we expect to use facial landmarks to obtain facial masks and, alongside the original image, use them as input to the model. This information might be leveraged as embedded regularization to weight faces' information in the classification of the conveyed emotions of movies. Furthermore, temporal information regarding the evolution of visual features might also be worth exploring since they are commonly used to convey emotions in cinematographic pieces. Regarding the annotation subjectiveness, we also considered that designing intuitive user interfaces that enable the annotator to perceive the differences between discrete emotion classes is also a future path to enhance the annotation process and quality, and to reduce the amount of noise in the construction of new datasets for the field. Funding: This research was partially financed by the ERDF-European Regional Development Fund-through the Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement and through the Portuguese National Innovation Agency (ANI) as a part of project CHIC: NORTE-01-0247-FEDER-0224498; and by National Funds through the Portuguese funding agency, FCT-Fundação para a Ciência e a Tecnologia, within project UIDB/50014/2020. Data Availability Statement: Not applicable.