End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild

: As emotions play a central role in human communication, automatic emotion recognition has attracted increasing attention in the last two decades. While multimodal systems enjoy high performances on lab-controlled data, they are still far from providing ecological validity on non-lab-controlled, namely “in-the-wild” data. This work investigates audiovisual deep learning approaches to emotion recognition in in-the-wild problem. Inspired by the outstanding performance of end-to-end and transfer learning techniques, we explored the effectiveness of architectures in which a modality-speciﬁc Convolutional Neural Network (CNN) is followed by a Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) using the AffWild2 dataset under the Affective Behavior Analysis in-the-Wild (ABAW) challenge protocol. We deployed unimodal end-to-end and transfer learning approaches within a multimodal fusion system, which generated ﬁnal predictions using a weighted score fusion scheme. Exploiting the proposed deep-learning-based multimodal system, we reached a test set challenge performance measure of 48.1% on the ABAW 2020 Facial Expressions challenge, which advances the ﬁrst-runner-up performance.


Introduction
Emotions play a vital role in daily human-human interactions [1].Automated recognition of emotions from multimodal signals has attracted increasing attention in the last two decades with applications in domains ranging from intelligent call centers [2,3] to intelligent tutoring systems [4][5][6].Emotion recognition is studied in the broader affective computing field, where the research of natural emotions is the focal point.Research in this domain is shifting to "in-the-wild" conditions, namely away from lab-controlled studies.This is due to the availability of new and challenging datasets collected and introduced in competitions such as Affective Facial Expressions in-the-Wild (AFEW) [7,8] and Affective Behavior Analysis in-the-Wild (ABAW) [9][10][11][12][13].Considering the challenging nature of the data, e.g., background noise in audio, cluttered background, and pose variations in video, benefiting from multiple modalities including, but not limited to acoustics, vision (face and body pose), physiological signals, and linguistics is essential [14].
Maturing over the last decade, earlier approaches to audiovisual emotion recognition included hand-crafted acoustic and visual features, which are then fed to classifiers that can handle high-dimensional feature vectors such as Support Vector Machines (SVMs).In acoustic emotion recognition, extracting Low-Level Descriptors (LLD) and summarizing them over short (1-4 s) chunks of audio using statistical functionals have been proven to be successful in a range of paralinguistics tasks [15,16] and have been popularly used in the INTERSPEECH Computational Paralinguistics Challenge series since 2009 [17][18][19].This scheme was later followed by clustering-based LLD summarization approaches such as Bag-of-Audio-Words (BoAW) [20,21] and Fisher-Vectors (FVs) [22][23][24][25].The BoAW approach was inspired by its counterpart in the linguistics domain, where the set of all words in the training set after preprocessing (e.g., stemming) is first used to form a "bag" of Nwords, and then, each document is represented as an N-dimensional histogram of word occurrences.In audio and video BoW representation, however, the LLDs from the training set are first clustered using K-means or Gaussian Mixture Models (GMMs), and then, the LLDs are assigned to the nearest cluster for a fixed-length, suprasegmental representation.While the BoAW approach computes the zeroth-order statistics, the FV approach, which was originally introduced in the vision domain [26], calculates the change in the underlying model (usually the GMM) parameters with respect to new coming data, thus also including the first-and the second-order statistics.With the popularity of Deep Learning (DL) and transfer learning, state-of-the-art systems in speech emotion recognition and paralinguistics benefit from extracting features from pretrained models [27,28] and deploying end-to-end models [29][30][31][32].
Motivated by these developments in multimodal emotion recognition and the recent outstanding performance of deep learning in the audio [48,49] and video [50,51] domains, as well as the performance of deep transfer learning to alleviate data scarcity in the target problem [40,52,53], in this study, we employed both deep end-to-end learning and deep transfer learning for both audio and video modalities, fusing the scores of the uni-modal subsystems for multimodal affect recognition in out-of-lab conditions.We experimented with and used the official challenge protocol for the ABAW challenge, the Facial Expressions Sub-challenge (ABAW-FER Challenge), originally run for Face and Gesture 2020, but later extended until October 2020.This sub-challenge includes Ekman's six basic emotions plus neutral, thus featuring a seven-class classification task.
We conducted extensive experiments with our system (and its components) on the AffWild2 dataset, adhering to the challenge protocol [54].The contributions of this paper include: (1) a novel multimodal framework that leverages deep and transfer learning in the audio and video modalities; (2) analysis of the components of our proposed system; (3) comparative results on the official ABAW-FER Challenge.Here, we present, extend, and advance our contribution to the challenge, which has been ranked third in the competition and was published in arXiv [55].
The remainder of this article is organized as follows: We analyze the current state of the multimodal emotion recognition domain and one of the challenges devoted to it in Section 2. Section 3 presents a developed multimodal emotion recognition system and describes its parts.Next, in Section 4, we provide the setting used in this work, including the data utilized, the training hyperparameters, and the preprocessing features.Section 5 presents the results obtained with the proposed multimodal emotion recognition system and its unimodal parts.In Section 6, we discuss the features of the developed system and present interesting findings during the research.Lastly, Section 7 summarizes the performed work and considers the directions of future research in multimodal emotion recognition in-the-wild.

Related Work
Having exhibited outstanding performance in a range of audio and image recognition tasks, such as speech and object recognition, deep learning has recently become the most dominant approach also in affect recognition.Training deep models has become increasingly popular and relatively easier, as the hunger of the deep models for massive data is met by the production of more and larger datasets related to affect recognition tasks.Moreover, deep learning models are developing not only in "depth", but also in "breadth": models process different modalities (e.g., audio, visual) at lower layers with modality-specific filters and at different sampling frequencies, which are subsequently combined at a higher abstraction layer via different fusion techniques.
Initial research studies about the application of different machine learning techniques to the FER started to appear at the beginning of the second millennium [56,57].The second explosion of emotion research started with the investigation of deep learning, thanks to the availability of pretrained deep CNNs such as VGG16 [58] and ResNet50 [59].While new developments in the machine learning and deep learning fields have led to significant improvements, still many research problems remain open including, but not limited to the alignment of heterogeneous signals (audio, video), handling small and imbalanced datasets, ensuring the reliability of subjective annotations, and handling data recorded in naturalistic conditions [60,61].

Multimodal Emotion Recognition
Multimodal emotion recognition involves taking advantage of multiple modalities (audio, video, text, physiological, and others) that, through fusion techniques, provide a single, final prediction.The primary fusion strategies from previous studies for multimodal emotion recognition can be classified into feature-level (early) fusion, decision-level (late) fusion, and model-level fusion [62].In addition, there is a hybrid fusion that involves the combining of the feature-and decision-level fusions [60].In a recent review paper on multimodal affect recognition [60], the authors mentioned that the increase in the feature set may decrease the classification accuracy if the training set is not large enough, pointing to the curse of dimensionality problem.Another problem is that the features from different modalities are collected in varied time scales and hence need to be synchronized appropriately.
In the last decade, researchers have focused more on the deep neural network in the multimodal emotion recognition domain.In [63], the authors applied supervised and unsupervised techniques for feature selection, investigated early fusion, and experimented with Convolutional Deep Belief Networks (CDBNs).In [64], LSTM was used to capture the correlation between different modalities and within the modalities.Finally, the results of LSTM were used as the input of the classifier LIBSVM to make the final prediction.A similar approach based on two LSTMs instead of LIBSVM was proposed by Tzirakis P. et al. [65].Another approach is to replace the softmax layer of each unimodal classifier with a new layer that will combine the deep embeddings of all modalities [66].
There have also been new investigations in the direction of new fusion techniques for modality merging.In [67], the authors discussed the role of speaker-exclusive models, the importance of different modalities (audio, video, text), and the generalizability (between datasets).In terms of ternary modalities, Majumder et al. [68] presented a novel feature fusion strategy that proceeds in a hierarchical fashion, first fusing the modalities one by one and only then fusing all three modalities.In [69], novel attention-based methods with LSTMs were proposed to fuse the modality-specific features.
Multimodal emotion recognition is still a developing area, and the task itself is far from maturity.Although there have been insights that deep learning models will be efficient for this domain, they require more diversified data for training.Challenges are organized to advance the state-of-the-art in multimodal emotion recognition in real-world environments by providing novel data and a comparable protocol.The ABAW-FER Challenge [9][10][11][12][13] is one such competition, whose publicly available data and common evaluation protocol were used in this paper.

ABAW-FER Challenge
In the context of the ABAW-FER Challenge, Kuhnke et al. [70] proposed a multimodal system that exploits Mel-spectrogram, appearance-based visual and facial-landmarksbased features.The last two are combined and used by a pretrained 3-D CNN, while Mel-spectrograms are fed into a slightly modified ResNet-18 Deep Neural Network (DNN).Fusing the output from the models by an additional fully connected layer, the authors reached a 0.509 Challenge Performance Measure (CPM) on the test set, taking first place in the Facial Expressions Sub-challenge in the ABAW challenge [70].
Gera, Darshan, and S. Balasubramanian implemented an attention-based framework [71], which fuses local and global attentive features and complementary context information to obtain features robust to non-trivial face positions and occlusions.The fusion process was performed on both the feature and decision levels (the loss function was calculated based on predictions from different "branches" of the framework).Thus, the authors obtained a 0.441 CPM and took second place in the ABAW-FER Challenge.
Liu et al. [72] combined ResNet and a Bidirectional Long Short-Term Memory Network (BLSTM), achieving a CPM of 0.408 on the test set.In [73], the authors explored data balancing techniques and their applications to multitask emotion recognition and found that data balancing is beneficial for classification, but not necessarily for regression.The proposed approach yielded a 0.405 CPM.Nhu-Tai Do et al. [74] proposed a temporal and statistical module to exploit and then fine-tune the face feature extraction model, obtaining a CPM of 0.389.Sachihiro Youoku et al. [75] used multiple optimized time windows (short-, middle-, and long-term) for extracting features from video data.In addition, the authors applied data balancing and fused the single-task models to further improve the prediction accuracy (CPM = 0.369).

Materials and Methods
Since emotion recognition is a complex multimodal paralinguistics task, we used both audio and video modalities.The final fusion of all systems was implemented as modeland class-based weighted decision-level fusion, which weights the class probabilities from each model separately.The pipeline of the fusion system is presented in Figure 1.
We hypothesized that the multimodal system for in-the-wild emotion recognition can benefit from alternative feature representations and pipelines for visual processing.The constituents of the fusion system and their optimal hyperparameters may vary depending on the target dataset.The design settings of the respective components are elaborated in Sections 3 and 5.In the visual emotion recognition domain, one of the most informative body parts is the human face.Multimodal emotion recognition systems often contain a facial expression recognition model, which catches much information reflected via changes in mimics ofhuman faces.Moreover, while the audio modality can be absent (the user is silent), the face of the user is generally available for facial analysis.Therefore, the development of a robust and efficient facial expression recognition model is one of the important tasks in the construction of multimodal emotion recognition systems.
A typical choice for facial emotion recognition systems is a deep convolutional neural network, such as ResNet50.In the emotion recognition domain, it is common to use transfer learning, namely training the deep neural network on a relevant task/domain, where the annotated data are sufficiently diverse and rich.While the main idea is using the knowledge distilled in the pretrained deep network, a popular trend is to use model embeddings (deep features), which can be extracted from different layers of the deep neural network.Usually, embeddings from the last convolutional layer are exploited.
To take advantage of transfer learning, as a base model, we used the VGGFace2-model, which is the ResNet50 model pretrained on the VGGFace2 dataset [76].The VGGFace2 dataset is intended for training models to recognize identities by faces.It contains 8631 identities in the training set, each with 363 images on average.
Based on the VGGFace2-model, we trained three facial expression recognition models, which differed in terms of data on which they were fine-tuned.We should note that for the fine-tuning, we took the VGGFace2-model, removed the last layer that provides class probabilities over the identities, and stacked two dense layers with 1024 and 7 neurons above it accordingly (hereinafter, we call this model the VGGFace2-model).The last layer with the softmax activation function allows the model to predict the probability of every emotional category.Thus, we obtained the following models: • VGGFace2-EE.The VGGFace2-model fine-tuned on the AffWild2 dataset; • AffectNet-EE.The VGGFace2-model fine-tuned on the AffectNet [77] and FER2013 [78] datasets (static frames); • AffWild2-EE.AffectNet-EE, which was further fine-tuned on the AffWild2 dataset (dynamic frames).

Temporal Modeling
The expression of emotions is inherently dynamic.Some emotions cannot be successfully recognized without exploiting the dynamics [40].Therefore, current emotion recognition systems effectively exploit different techniques to take into consideration the temporal context.These techniques include calculating the statistics of frame-level features over a time window and employing algorithms specifically developed for working with time series, such as recurrent neural networks.We implemented both techniques to investigate their recognition performance for the facial expression recognition task: • EE + SVM system.First of all, embeddings for every frame were extracted by the Embedding Extractor (EE), which is one of the fine-tuned CNNs presented earlier.
Next, we grouped the embeddings according to the chosen window in a sequence and applied statistical functionals, namely the mean and Standard Deviation (STD), on each sequence.This process summarizes the LLD matrix having dimensions of N × M (where N denotes the number of frames in a sequence and M is the size of the embeddings vector obtained from the EE for one image frame) into a suprasegmental feature vector with dimensions of 1 × M * 2 (since this vector contains the M mean and M STD statistics).Finally, an SVM classifier was trained on the obtained vectors to predict one emotion category for the whole window (sequence).The target emotion to train the SVM model was calculated using the mode (i.e., voting) of the emotion annotations in the sequence.This approach works best when the window size is close to the average duration of emotions (2-4 s based on previous research); • LSTM-based systems.These systems are based on recurrent neural networks, namely on LSTM networks.As an EE subsystem, AffectNet-EE was used.For the LSTM network's training, we also grouped embeddings using a time window.
Here, we experimented with two alternative training schemes: -EE + LSTM system.The extraction of the embeddings from the EE subsystem was separated from the LSTM network.Thus, the EE subsystem did not take part in the fine-tuning (training) process, and we exploited it solely for the embedding extraction; -E2E system.The EE subsystem was combined with the LSTM network during the fine-tuning (training) process.Thus, the system was trained as a whole, making it an End-to-End (E2E) deep neural network system.Moreover, training the EE subsystem as a part of the E2E system allowed us to generalize the EE subsystem more since it had "seen" more data including the AffWild2 dataset.
The description of the datasets used for training, training the hyperparameters, and the preprocessing procedures are elaborated in Section 4.

Audio Separation
It is well known that extraneous sounds in audio recordings (e.g., background noises, music) may significantly decrease the effectiveness of the training process, which results in poorer emotion recognition performance [79].To alleviate this, we applied Blind Source Separation (BSS) using the open-source library Spleeter [80], which contains a set of pretrained DNN models.We used a model that allowed us to separate audio into vocals and accompaniment (all other sounds including music).Spleeter models were trained with data having 11 kHz and 16 kHz sampling rates.Therefore, we downsampled audio files to 16 kHz and applied BSS.The obtained audio (vocals) was used in all further experiments.

Synchronization of Labels
Usually, videos from the datasets are annotated per frame, while various videos can differ in terms of the frame rate.Thus, the annotations also had different sampling rates.To align them, we downsampled all labels to a sampling rate of 5 Hz.We think that this frequency is sufficient to detect changes in emotions, because the emotional category switches rarely.

1D CNN + LSTM-Based Deep Network
Since there is no publicly available pretrained 1D CNN on raw audio, we constructed our own.To grasp the temporal information from 1D CNN embeddings more efficiently, we stacked two LSTM layers on top of it.The final layer had seven softmax neurons to match the number of classes.To represent window temporal modeling, the sequence-to-one modeling scheme was implemented.It maps one portion of the input acoustic raw data (for example, 4 s of audio) into emotional category probabilities.The number of model parameters was around 4.5 M. The architecture of the developed sequence-to-one model is presented in Figure 2. We would like to note that such a model is also an E2E model since it directly maps the raw audio to one of the emotion categories (or class probabilities).

Fusion Techniques
We employed "Model-and Class-based Weighed Fusion" (MCWF), where we had a fusion matrix of L × K, where L and K denote the number of models and classes, respectively.That is, we had an importance weight for each class of each model, separately.The fusion weights can be optimized for any measure of interest from a pool of matrices that are generated randomly using a Dirichlet distribution for each class, such that the weights for each class over models sum up to unity.This approach has been successfully applied in former video-based affect recognition in-the-wild challenges [40].

Experimental Setting
In this section, we provide the experimental setting.This includes the data description, the features of dataset preprocessing, the set of hyperparameters used, and the way they were optimized during the training of the models.The code for reproducing the results is available at https://github.com/DresvyanskiyDenis/ABAW-SIU,accessed on 30 November 2021.

Datasets Used in the Study
AffectNet [77] is a large dataset of human emotional expressions, collected by web querying in six different languages and annotated in terms of seven emotional categories, namely Ekman's six basic emotions [81] plus neutral.Overall, the database contains around 450 K images of facial expressions, partially in an in-the-wild manner.The AffectNet class distribution is strongly shifted to the neutral and happiness classes (these two classes comprise 73.7% of all dataset instances), which introduces a class imbalance issue [82] that needs to be solved.
To relax the AffectNet class imbalance, we mixed it (excluding major neutral and happiness classes) with the FER2013 [78] dataset, increasing the examples of minority classes to around 15% on average.The FER2013 database contains approximately 34 K grayscale face images, annotated in terms of 7 emotional categories, and has a class imbalance problem as well.
The AffWild2 [11] dataset was used to evaluate the effectiveness of the proposed approaches.The database was collected from YouTube and consists of about three million annotated frames with different qualities.The annotation process for seven basic emotions was performed in a frame-by-frame manner, eliminating the widely known problem of evaluator lagging during time-continuous annotation.The class distribution of the AffWild2 dataset is presented in Table 1.Thus, it has an even higher class imbalance (the top two classes account for 79.7% of all instances and the top three for 90.6%) in comparison with AffectNet.Moreover, the challenging conditions of the data include different occlusions, pose variations, the absence of a person/face, and in some cases, the existence of multiple people in a frame, while each frame is annotated and should be predicted properly.The typical video sequence from the AffWild2 database is presented in Figure 3.For example, the first row of images represents the arising occlusions and unusual colors, while the second one the arising second human in the frame (in one frame, the face detection algorithm erroneously detected him as the main person) and a high amount of movements from the participant (wiggling, nodding, occluding the head with the arms).Moreover, the AffWild2 database contains several videos with two annotated persons occurring in different timings.Such difficulties make this database essentially in-the-wild and difficult to process, yet interesting and challenging to the research community.

Preprocessing of the AffWild2 Dataset
As a helpful supplement, the authors of the AffWild2 dataset provided the cropped faces for all frames in the video files.They localized them utilizing the HeadHunter detector [83].However, we found several problems while working with these localized faces: (1) the target face is confused with secondary faces (see Figures 3 and 4); (2) in case the face is covered by hands or other objects (obstacles), the face detector does not work properly, as exemplified in Figure 4.The noticed problems highly contributed to the effectiveness of the model training due to the loss of important information, consequently affecting the efficacy of the temporal aggregation and recognition.To eliminate the second problem, we exploited the RetinaFace detector [84], which works more accurately, including the cases when the face is covered by obstacles by more than 50% [85].We tackled the first problem as well by utilizing the pure VGGFace2-model: we extracted the facial embeddings from the preceding and the current frames and compared the embeddings, using the cosine similarity.In case the cosine similarity is more than 0.5, the current face is considered as a correctly detected area.We can observe the higher reliability and accuracy of the obtained face detection approach in Figure 5.In comparison with the HeadHunter face detector, no faces were missed, and all of them were identified correctly.

Models Setup 4.3.1. Audio Emotion Recognition Models
To obtain more data during the training, we set the shift of the window with a step equal to 2/5 the size of the window so that every audio chunk had an overlap of 3/5 with the respective former chunk.Thus, we approximately doubled the amount of training data.Since the audio after the noise cleaning (see Section 3.2.1)had a sampling rate of 16 kHz, the temporal context of 4 s in this case, for example, equals 48,000 samples in the waveform and 20 labels.As we applied the sequence-to-one modeling, we should have retained only one label per window.This was performed by selecting the mode of the whole label sequence, which is the most frequent emotion category.
The training process was conducted using the Adam optimizer with learning rate = 0.00025 and the categorical cross-entropy loss function.To regularize the model, a dropout with a rate of 0.3 after every convolutional layer was applied as well.

Visual Emotion Recognition Models
VGGFace2-EE.To construct an embedding extractor, we used the VGGFace2-model.To smooth the class imbalance, while training the VGGFace2-model on the AffWild2 database, we downsampled every class category in the training set: for the neutral class, we considered every tenth frame in every video, for the anger, disgust, fear, and surprise classes every second frame, and for the happiness and sadness classes every fifth frame.Subsequently, to make the model more robust to interference, we used data augmentation techniques such as rotation, horizontal flip, and brightness variation.Moreover, we applied logarithmic weighting to the loss function.According to the logarithmic weighting, the weight of class i is calculated as: where ln() denotes the natural logarithm, r is a regularization parameter, M is the number of samples in the training set, and n i is the number of samples in class i.We experimented with various values of the hyperparameter r in the [0, 1] range and optimized it as 0.47.

AffectNet-EE.
For the AffectNet-EE training, we removed the VGGFace2-model's last layer and added above it Gaussian noise and one fully connected layer with 512 neurons and l2-regularization.Lastly, a softmax layer with seven neurons was stacked on top of the obtained model.The main difference of the AffectNet-EE model is that it was trained on the AffectNet dataset (and not on AffWild2).Moreover, to increase the proportion of the minority classes, the samples from the FER2013 dataset were added to the overall training set.
In the training process, such augmentation techniques as rotation, shear, horizontal flip, shifting, and changing the image contrast were used.In addition, to "dilute" the majority classes, we used the mixup [86] approach, which mixes two images and their labels, applying the weights generated by the Beta-distribution.Moreover, inversely proportional class weighting was applied to the categorical loss function.
During training, we exploited the cosine annealing with cold restart [87] as follows: the initial learning rate was set to 0.0001, which steadily descended to a minimum learning rate of 0.00001 within 6 epochs (=1 annealing cycle) and then sharply recovered to the initial learning rate.This procedure was repeated 5 times, resulting in the model training within 30 epochs.
AffWild2-EE.Structurally, Affwild2-EE is the same as AffectNet-EE; however, it was further fine-tuned on the AffWild2 corpus.To train the embedding extractor, we downsampled AffWild2 as follows: in every video, for category neutral, we took every tenth frame, for categories anger, disgust, fear, and surprise every second frame, and for happiness and sadness every fifth frame.All the other parameters, as well as the augmentation techniques, were chosen the same as for the AffectNet-EE training.

Temporal Modeling Techniques
As mentioned earlier, to handle the varying Frames Per Second (FPS) over the videos and correctly the implement subsystems using LSTM, we aligned all videos to have 5 FPS.Then, using AffectNet-EE, CNN embeddings were extracted and formed in a sequence of different lengths according to the window length and fed to the LSTM input.
As we describe below, we applied two different techniques of modeling temporal dependencies: calculating statistics from features within one window or using specially designed methods such as RNNs.Both methods have their pros and cons, and therefore, we employed both in our fusion framework.In addition, we should note that in most cases (except the fully end-to-end approach described in Section 5.1.2),we prepared the features for temporal modeling in advance for computational efficiency.For every video frame, we extracted deep embeddings exploiting either the VGGFace2-EE, AffectNet-EE, or AffWild2-EE models.Thus, every frame was represented as 512 deep features obtained from the penultimate fully connected layer.EE + SVM system.For statistical functional-based summarization of the frame-level features, we used two approaches: • SVM.Means and STDs were calculated over each fixed-sized window, resulting in a vector with a length of 1024.Next, a meta-parameter search for the SVM using calculated suprasegmental features was carried out.We conducted extensive experiments with various kernels including linear, polynomial, RBF (γ optimized in [0.001, 0.1]), and regularization parameter C (in [1,25]).The best result was obtained with following settings: kernel -polynomial, γ = 0.1, C = 3; • L-SVM.We calculated the means, STDs, and leading coefficients for polynomials of the first and the second orders over each fixed-size window, resulting in a suprasegmental feature vector of dimensionality 2048.Next, we consistently applied cascaded normalization, in the form of z-, power-, and l2-normalization, respectively, and trained the Linear Support Vector Machine (L-SVM) on the obtained normalized vectors, as suggested in [88].

EE + LSTM system.
As the second approach, we used "raw" deep embeddings as the input in the constructed LSTM neural network.It consisted of two layers with 512 and 256 neurons with l2-regularization and dropout between them (with a dropout rate of 0.5).On top of them, the fully connected softmax layer with seven neurons was placed.
In addition, for the training of the LSTM systems, we decimated the FPS rate in all AffWild2 dataset videos to 5.This had to be done, since all videos had different FPS (the lowest FPS was 7.5; the maximum FPS was 30), while the similarity in the rate of temporal context is a crucial point for LSTM networks.Then, using AffectNet-EE or AffWild2-EE (depending on the model), CNN embeddings were extracted and formed in a sequence of different lengths according to the window length and fed to the LSTM input.
Thus, both temporal modeling approaches performed sequence-to-one modeling, mapping frame-level deep embeddings within a fixed window into seven class probabilities.However, originally, one window contained several emotional labels, the number of which depended on the video frame rate.Therefore, after prediction, we expanded the vector of predicted probabilities with size 1 × 7 to n × 7, where n is the original number of emotional labels in the particular fixed window.Since we had an intersection between windows, to obtain the final decision on each concrete video frame, the intersected class probabilities were averaged.

Performance Measure
During our research, we considered two measures for model evaluation: Unweighted Average Recall (UAR) and the official ABAW-FER CPM, which is defined as [54]: where F1 is the weighted average of the Recall and the precision (also known as the F-measure) and the Accuracy is the fraction of predictions that are correctly classified.We utilized UAR mostly during choosing the models (see the experiments described in Section 5.1) because of its ability to elicit and neglect overfitted models, while the CPM was used in order to be able to compare our models with other participants of the ABAW-FER Challenge.

Results
In this section, we describe the results obtained during the extensive experiments we conducted on the AffWild2 dataset.Moreover, the adoption of a variant of the proposed multimodal system within the ABAW-FER Challenge is presented.

EE + SVM Subsystem
As we noted before, different techniques for temporal aggregation may be used.In this work, we focused on such methods as SVM and LSTM applied to the CNN embeddings.However, the first question to address is: What is the optimal context length?To find this out, we conducted experiments, fixing all the parameters except the temporal window length.For simplicity, we fixed the following parameters: (1) the face detector: HeadHunter; (2) AffWild2 as a training dataset (that is, VGGFace2-EE); (3) the temporal aggregation technique: EE + SVM system; (4) the logarithmic class weighting technique.We varied the length of the window from 2-8 with exponential steps, with an additional probe of 3 s.The experimental results are presented in Table 2.We chose two performance measures for the models' evaluation: Unweighted Average Recall (UAR) was chosen since it is known to be an unbiased measure of the reliability of the model (how much attention the model pays to every class), while the CPM was also chosen for comparison in terms of the ABAW-FER Challenge.As we can see from Table 2, a temporal context of 4 s was the best option for functional-based temporal aggregation (EE + SVM system), and therefore, all the subsequent experiments with this technique were performed using a temporal window of 4 s.Fixing the length of the temporal context, we continued optimizing the model via experiments on the variation of the face detector, embeddings extractors, and class weighting schemes.The results of the experiments are presented in Table 3.Here, we can highlight the models with the SysID 1 and 2: they had the highest UAR and CPM over all the other models.We discuss the results of the ABAW-FER Challenge in Section 5.3; however, we should note that one of these two models (namely SysID 2 in Table 3) was submitted during the ABAW competition.

EE + LSTM Subsystem
Although the systems based on SVM with suprasegmental features reached a decent performance on the validation set, relatively low results were obtained on the test set.This may be caused by an insufficient ability to generalize data expressed via overfitting to the training set.Therefore, we decided to utilize a more sophisticated temporal aggregation method such as LSTM-NN, which was originally developed to process time series and has a memory mechanism to capture long temporal dependencies.Fixing the choice of the face detector and training data (pretraining the model on the AffectNet dataset, that is AffectNet-EE), we conducted an experiment, varying the weighting scheme and length of the temporal window, as was done in the previous subsection.We used only context window lengths of 2 s and 4 s due to the computational complexity faced during the investigation.The results of the experiments are presented in Table 4. First of all, we should note that we chose the best model according to the UAR score, since it reflects the model's generalization ability more adequately than the CPM (this was observed during overfitting to the development set in the previous subsection).Thus, the model with the balanced class weighting technique and a temporal window size of 2 s was chosen for the further experiments.
Apart from that, we decided to train the AffWild2-EE + LSTM model using the best parameters found during the experiments with the AffectNet-EE + LSTM model.Despite the relatively low performance (UAR = 43.85% and CPM = 52.84%)obtained on the AffWild2 validation data, we found that AffWild2-EE + LSTM achieved a Recall performance gain of 22.6% on the neutral class and 3.23% on the happiness class.This makes the AffWild2-EE + LSTM model a good contributor in the fusion systems (comprising several models); therefore, we used it in our further experiments.Additionally, we carried out the experiment with the number of frozen layers while training the model.To do so, we fixed the meta-parameters of training as in the previous experiment and tried to conduct experiments with different options: training the whole CNN + LSTM model (AffectNet-E2E) simultaneously and training only the LSTM part of the model, while the CNN part was frozen (AffectNet-EE + LSTM).The results are presented in Table 5.We can clearly see that E2E training demonstrated higher performance in terms of the CPM, while using the main convolutional layers as embedding extractors, the AffectNet-EE + LSTM approach gave a more balanced Recall over classes expressed via a higher UAR.This means that E2E models sacrifice the Recall of minority classes in favor of higher accuracy.In addition, we can observe that our previous findings (see Table 4) were reinforced with the E2E learning: the temporal window of 2 s turned out to be the best one in terms of both performance measures.Thus, since we believe that UAR is more informative and adequate in comparison with the CPM, we chose the AffectNet-EE + LSTM model with a window size of 2 s for future fusion experiments.

Audio Emotion Recognition Sub-System
For the audio model (1D CNN + LSTM), we also optimized the length of the temporal context (window size).We conducted experiments using the AffWild2 validation set with different lengths of windows (from 2-12 s with steps of 2 s).The best performance was observed with a temporal context of 4 s, which is partially in line with our findings from the video emotion recognition subsystem.
We also tried to use Pretrained Audio Neural Networks (PANNs) [89] that have demonstrated state-of-the-art performance in audio pattern recognition.These models extract features from raw waveforms, then process those data and return predictions in real time.In this study, we used the CNN-14 model, which consists of one layer for extracting features and six convolutional blocks, inspired by VGG-like CNNs [58].The CNN-14 model was pretrained on the AudioSet dataset [90].We fine-tuned the PANN model for the Expression Challenge.In this pipeline, we used 2D Mel-spectrograms as the features.However, our 1D CNN + LSTM model showed better results.

Multimodal Emotion Recognition System
As already mentioned, we took part in the ABAW competition.However, our results are not limited only by this challenge: we subsequently continued experimenting and improving our multimodal system after the challenge as well.Therefore, to be consistent and to have the possibility to compare our results with other works, we continued working on the AffWild2 dataset, using the same performance measure as before: the CPM.In this section, we present our results in chronological order to explain the research progress during and after the challenge.
During the ABAW-FER Challenge, first of all, we examined the efficacy of the unimodal subsystems individually.We chose as the simplest the following models: 1D CNN + LSTM, VGGFace2-EE, and VGGFace2-EE + SVM.Each of them represents one of the possible usages of a modality, namely audio, visual-static, and visual-temporal.Since multimodal systems usually perform better than unimodal models, we decided to submit a multimodal system as well.Exploiting the visual-temporal model turned out to be more accurate over using just the visual-static model (framewise prediction); therefore, we constructed the multimodal system from the 1D CNN + LSTM and VGGFace2-EE + SVM approaches.It should be noted that the fusion of these two unimodal systems was performed via a decision-level fusion.The best results obtained within the ABAW competition are presented in the first part of Table 6.Using the decision fusion of the audio-and video-based systems, we reached a CPM of 42.10% and ranked third place in the ABAW-FER Challenge (the results of the first five participants are shown in Table 7).
Upon the analysis of the state-of-the-art works in the multimodal emotion recognition domain, we extended our system with new subsystems that collectively can catch the emotional states more accurately.To adjust the VGGFace2-model more to the motion recognition task, we applied a multi-state fine-tuning of the VGGFace2-model, obtaining the AffectNet-EE model after the first stage and AffWild2-EE after the second stage, as described in Section 4.3.2.In addition, since the SVM hyperparameters need to be adjusted every time the training set is changed, we shifted towards an end-to-end approach such as combining the CNN and LSTM.The results of the proposed multimodal fusion system and its subsystems are presented in the second part of Table 6.We see that the fine-tuned AffWild2-EE + LSTM alone was able to outperform our best multimodal system in the challenge by 2.1% absolute.This partially confirms that LSTM can operate better with time series than just estimating statistics and making decisions based on them by SVM.Combining the two fine-tuned visual EE + LSTM subsystems (AffectNet-EE + LSTM and AffWild2-EE + LSTM) further improved the CPM by 2.1% absolute.The addition of the end-to-end audio model, namely the 1D CNN-LSTM model, to these two EE + LSTM visual subsystems in a weighted fusion framework further contributed, reaching a CPM of 47.6%.Finally, in the proposed system (System 8 in Table 6), we fused all our best performing subsystems and, therefore, further extended this fully endto-end multimodal system with the use of a Linear SVM (L-SVM) on the functional features obtained from the AffWild2-EE.This final one advances the first runner-up performance on this challenge corpus (see Table 7).

Analysis and Discussion
When analyzing the classwise F1-scores of our proposed multimodal system (see Figure 6), we observed that the visual modalities had both higher and more balanced F1-scores over classes compared to the audio model, which eventually contributed to the final fusion system (see Figure 7).From the weights figure, we observed that the most significant contribution for neutral, anger, and surprised classes was made by L-SVM, for fear and sadness by AffWild2-EE + LSTM, for happiness by AffectNet-EE + LSTM and L-SVM (almost equally).Interestingly, the highest contribution of the audio modality was observed on the disgust emotion, which weighed evenly with the contribution from AffectNet-EE + LSTM.Thanks to the MCWF, which combines the classwise strengths of each model, the fusion system's F1-scores were better than or on par with the best unimodal counterpart.Although the found optimal fusion weights and the corresponding F1-scores were highly correlated, the models with the highest F1-score on a class did not always obtain the highest weight.Alternatively, the genetic-algorithm-based informed search can be investigated to improve the MCWF.
)XVLRQ We further analyzed our best embedding extractor (CNN model), namely AffWild2-EE, using GradCAM [92] to check where it attended and why it failed.In Figure 8, we provide sample saliency maps overlayed on the images.For the images in the first row, the CNN attended to eye region and was observed not to be effected by the partial occlusion of the mouth area.The second raw embeddingshowed that the CNN performed well on pose and occlusion variations.The examples show that the subtlety of facial expressions is an important factor for misclassifications.We also investigated the performance of our system in the neighborhood of an emotion change point t with respect to the ground truth annotations.We provide a breakdown of the performance of the proposed system (Sys.8) in the temporal neighborhood of the emotion change, namely in (A) [t − 4, t − 2), (B) [t − 2, t), (C) [t, t + 2), and (D) [t + 2, t + 4) s.The analysis window step of 2 s was based on our former window size optimization (2-4 s with different models).As shown in Table 8, we observed a drop in recognition performance around the emotion change point; however, even after the drop, the performance was around a 50% CPM and thus was at an acceptable level.The performance outside the ±2 s relative to the emotion change points was almost the same as the overall development set CPM.We note that even though on average there were 11.6 emotion change points per video clip, considering the number of frames, this happened very rarely (the proportion of emotion change points to the total number of development set frames was 0.25%).Sample sequences of frames taken from the temporal neighborhood of emotional change points are illustrated in Figure 9. Here, in each row, the frame in the center corresponds to the reference emotional change point t.From left to right, the relative time points (in seconds) are t − 4, t − 2, t, t + 2, and t + 4. Despite the challenges such as facial occlusions and pose variations, we observed a decent performance of capturing the emotions before and after the expression change points.To sum up, our results are in line with the former works that reported higher unimodal predictive performance for and a contribution to the multimodal system by the visual modality over the acoustic modality [40,70,71,73,91].Unlike former studies that reported only the visual modality contributing to the recognition of disgust [40], in our study, we observed the contribution of the acoustic system to this class.This contribution may stem from extralinguistic vocalizations, which deserves future investigation.

Inference Time Analysis
Additionally, we would like to note that the final system (SysID 8) presented in Table 6 (with a test set CPM of 48.07%) can be considered as a real-time method.To demonstrate this, we calculated the time needed for each processing step, namely from data preprocessing to the prediction generation step.
In Table 9, we provide the inference time based on the consecutive execution of all operations.However, some of them could be parallelized.For example, all the feature extraction models can be run in parallel on different CPUs.The same can be applied for the prediction generation process.Considering parallelization, the overall inference time would be 66.76 s for this video clip (0.56 SFI).This evidence allows stating that our method is a real-time inference method.Table 9.The inference time for different sub-processes of the final fusion system.For the system's testing, the video file "134.avi" with a duration of 119 s was chosen.SFI means "Seconds For Inference", denoting the seconds needed to process one second of video.

Conclusions
This article investigated the efficacy of deep learning models in the in-the-wild audiovisual emotion recognition domain.We showed that the transfer learning performed via multi-stage fine-tuning of deep CNN model allows increasing the model performance significantly.We also observed that in both audio and video modalities, the deployed fully end-to-end 1D CNN + LSTM and 2D CNN-EE + LSTM neural network architectures can be successfully exploited for catching the spatio-temporal patterns for the audiovisual emotion recognition task.Moreover, we demonstrated that the MCWF fusion of different deep neural networks with correctly fitted weights is able to enhance the predictive performance of the fused system over the unimodal systems.
One of the interesting findings in this research is that, despite its low individual performance, the audio-based model significantly contributed to the multimodal fusion process, especially for the emotion disgust.Analyzing the training dataset, we can note that the low performance of the audio model can be partly attributed to the long silence durations and background music or noise accompanying the participants' speech.Although the cleaning of the audio helped to significantly separate the human voices from other sounds, the ambiguity reflected by several different voices in one audio file was not illuminated.Nevertheless, even such a "confusing" audio channel contributed to the overall system performance.This underlines once more the necessity of using the multimodal emotion recognition system due to the possibility of combining the various advantages of unimodal subsystems.
In our future work, we plan to try more flexible information fusion techniques such as cross-modal attention-based fusion.Such an approach has shown promising results in other domains [93][94][95] and can significantly increase the performance of the considered system.To enhance the multimodal fusion via additional information, which can be extracted from already existing information channels, we plan to exploit the linguistic modality, which has shown promising results as an additional modality for multimodal emotion recognition systems [25].

Figure 2 .
Figure 2. The architecture of the proposed 1D CNN + LSTM-based deep neural network.

Figure 3 .
Figure 3.The example frames of two videos in the AffWild2 dataset, which demonstrate such challenges as pose variations and various occlusions.

Figure 4 .
Figure 4.The examples of two video sequences, on which the HeadHunter face detector had many failures.

Figure 5 .
Figure 5.The face detection results on the same two video sequences presented in Figure 4 using RetinaFace.

Figure 9 .
Figure 9. Example frames in the temporal vicinity of emotion change points in video clips with the corresponding ground truth and predicted emotion classes.

Table 1 .
The class distribution of the AffWild2 dataset.

Table 2 .
The experimental results on the temporal window length selection on the AffWild2 validation set.VGGFace-EE was used as the embeddings extractor; SVM was used as a classifier on suprasegmental features.The best result within the column is highlighted in bold.

Table 3 .
The performance results (%) for different combinations of face detectors, embeddings extractors, and class weighting schemes.SVM was used as a classifier on the suprasegmental features.The best results within the column are highlighted in bold.

Table 4 .
The experimental results on a selection of temporal window length and class weighting scheme on the AffWild2 validation set.AffectNet-EE was used as an embeddings extractor; LSTM was used as a temporal aggregation technique.The best result within the column is highlighted in bold.

Table 5 .
The classification performance comparison of the fully end-to-end (AffectNet-E2E) and separate model training (AffectNet-EE + LSTM) approaches on the AffWild2 validation set.The best result within the system is highlighted in bold.

Table 6 .
The performances of the Audio (A) and Visual (V) systems on the ABAW-FER Challenge Validation (Val.) and the test sets.The first block (four lines) shows the results obtained in the scope of the competition, and the rest are the results of our extended systems obtained after the challenge.All results are in % CPM.Baselines: CPM-Val.= 36%, CPM-Test = 30%.The best result within the column is highlighted in bold.

Table 7 .
Top-5-performing systems at the ABAW-FER Challenge 2020 compared to the performance of our work.

Table 8 .
The statistics of emotion change points and CPM performances around the emotion change points.# means number of; ECP means Emotional Change Point.