Learning Better Representations for Audio-Visual Emotion Recognition with Common Information

: Audio-visual emotion recognition aims to distinguish human emotional states by integrating the audio and visual data acquired in the expression of emotions. It is crucial for facilitating the affect-related human-machine interaction system by enabling machines to intelligently respond to human emotions. One challenge of this problem is how to efﬁciently extract feature representations from audio and visual modalities. Although progresses have been made by previous works, most of them ignore common information between audio and visual data during the feature learning process, which may limit the performance since these two modalities are highly correlated in terms of their emotional information. To address this issue, we propose a deep learning approach in order to efﬁciently utilize common information for audio-visual emotion recognition by correlation analysis. Speciﬁcally, we design an audio network and a visual network to extract the feature representations from audio and visual data respectively, and then employ a fusion network to combine the extracted features for emotion prediction. These neural networks are trained by a joint loss, combining: (i) the correlation loss based on Hirschfeld-Gebelein-R´enyi (HGR) maximal correlation, which extracts common information between audio data, visual data, and the corresponding emotion labels, and (ii) the classiﬁcation loss, which extracts discriminative information from each modality for emotion prediction. We further generalize our architecture to the semi-supervised learning scenario. The experimental results on the eNTERFACE’05 dataset, BAUM-1s dataset, and RAVDESS dataset show that common information can signiﬁcantly enhance the stability of features learned from different modalities, and improve the emotion recognition performance.


Introduction
Emotion recognition is an important component in affect-related human-machine interaction systems [1,2], as emotion can provide implicit feedback about human experience and conditions that are not easily captured by the explicit input.Audio-visual emotion recognition is a common type of emotion recognition [3,4].The comprehensive overview can be found in the surveys [5][6][7][8][9][10].Recent works have successfully applied it for many areas, such as disease diagnosis [11,12], affective tutoring system [13,14], marketing [15,16], and entertainment [17,18].One challenge of audio-visual emotion recognition is how to extract feature representations with an acceptable size from audio and visual data that are effective for emotion recognition.A number of previous works [19][20][21][22][23][24][25][26][27] have been proposed to tackle this challenge.Although progresses have been made by previous works, they usually suffer from the following two limitations.
First and foremost, these conventional strategies usually cannot efficiently utilize common information between different modalities by correlation analysis.For example, in [19][20][21], the common information is captured by combining the features that are learned from each modality into a feature vector.This technique often fails to exploit the complex dependencies and interactions between different modalities.Motivated by this concern, approaches [22][23][24] that are based on canonical correlation analysis (CCA) and some variant methods, such as kernel probabilistic CCA [25], sparse local discriminative CCA [26], and low-rank representation [27], are proposed.Although these methods have made some performance improvements, they may suffer from numerical issues [28,29].This instability arises in that they need to use the inverse of the empirical covariance matrix, which easily become singular over some mini-batches.As a workaround, such methods often limit the feature dimensionality to be relatively small in order to ensure stability, which is undesirable for the complex audio-visual emotion recognition task.
Besides, some previous works use heuristic features [22][23][24][25]27] or the features learned from shallow neural network structures [19][20][21]26] for emotion recognition.For example, the widely used audio heuristic features include prosody features and voice quality features [9], while typical visual heuristic features include Gabor features and HOG-TOP features [30].Recently, convolutional neural networks (CNNs) have become popular for extracting audio and visual features for emotion recognition [19][20][21].However, due to the high-dimensional emotional data, the learned representations using such methods with shallow structures are not expressive enough to predict emotions, which may lead to limited performance.
To address the above two problems, we propose an deep learning framework for audio-visual emotion recognition by efficiently utilizing common information between audio data, visual data, and the corresponding emotion labels.Figure 1 presents the structure of our system, which satisfies: (i) the highly non-linear correlation of the feature representations among audio data, visual data, and the corresponding labels should be fully analyzed in order to capture the common information, and (ii) the learned audio and visual features should have enough expressiveness to classify the emotions.By considering these two goals together, we can learn the feature representations that are fully discriminative for the emotion recognition task.
Specifically, we design an audio network and a visual network to learn the feature representations from audio and visual data, respectively, and then adopt a fusion network to combine the extracted audio and visual features for emotion prediction.Our neural network is trained by a joint loss function which is the linear combination of correlation loss and classification loss.The correlation loss is used to extract common information between audio data, visual data, and the corresponding emotion labels.We adopt common information here to describe the effectiveness of combining emotional information from different modalities by correlation analysis.It is implemented by a deep learning version of Hirschfeld-Gebelein-Rényi (HGR) maximal correlation [31][32][33], a well-known measure of dependence, to learn the maximally correlated feature representations of different modalities.The classification loss is used in order to extract discriminative information from each modality for emotion prediction.Further, we generalize our framework to the semi-supervised learning scenario.We conduct experiments on three public audio-visual datasets: eNTERFACE'05 [34], BAUM-1s [35], and RAVDESS [36].The results demonstrate that, by capturing common information with HGR maximal correlation, our deep learning approach can significantly enhance the stability of features that are learned from different modalities and improve the emotion recognition performance.Audio network and visual network use ResNet-50 [37] as the backbone architectures.Fusion network has several fully connected layers.Different settings of fusion network are considered in Section 5.2.2.The correlation loss is used to extract common information between different modalities.Additionally, the classification loss is used to capture discriminative information from each modality for emotion prediction.During the training process, emotion labels are used twice, once to compute the classification loss, and the another as the third modality to compute the correlation loss with audio and visual modalities.In this way, label information can be fully used in order to improve the discrimination ability of the feature representations.In the testing process, audio and visual data are used to predict the corresponding emotion labels.
To summarize, our main contributions are as follows: • We design a deep learning framework to learn the discriminative feature representations from the audio and visual data for emotion recognition.

•
We propose a correlation loss that is based on HGR maximal correlation to capture common information between audio data, visual data, and the corresponding emotion labels.

•
We generalize our framework to the semi-supervised learning scenario with common information.

•
We conduct experiments on the eNTERFACE'05, BAUM-1s, and RAVDESS datasets to demonstrate the effectiveness of our system.
To the best of our knowledge, our method is the first work to utilize HGR maximal correlation in order to extract common information between audio data, visual data, and the corresponding emotion labels for audio-visual emotion recognition.The remainder of this paper is organized, as follows.In Section 2, we describe the related works.In Section 3, we explain HGR maximal correlation.In Section 4, we introduce our approach in detail.Subsequently, we perform extensive experiments on the eNTERFACE'05, BAUM-1s, and RAVDESS datasets in Section 5. Finally, we draw conclusions and point out some future works in Section 6.

Related Works
Audio-visual emotion recognition is related to multimodal learning and feature extraction with deep learning.In this section, we review these two works.

Multimodal Learning
In the real world, human emotion expression has the inherent multimodality characteristic [38,39].Multimodal learning [40,41], also referred to as multiview learning [42], is proposed to build models that can better process and relate information from multiple modalities in order to efficiently use the multimodal data similar to the emotional data [43].One important research topic is to extract the information between different modalities to achieve better predictions, which is consistent with the common information in our paper.
The straightforward way is to concatenate the feature representations that were learned from each modality into a feature vector.Although this strategy is widely used, it often ignores the complex dependencies and interactions between different features of each modality and they may result in unsatisfactory performance [44].Some CCA-based approaches are further proposed in order to capture the complex correlation between different modalities.It can extract the linearly maximally correlated feature mappings of two random variables.Kernel CCA [45,46] and deep CCA [47] are proposed in order to generalize CCA to non-linear setting using the kernel method and the deep neural network (DNN) respectively.Subsequently, Wang et al. [48] proposed the deep canonically correlated autoencoder (DCCAE) by combining the canonical correlation loss and the reconstruction loss of autoencoders.
HGR maximal correlation is another important measure of dependence [29].It can extract the maximally non-linear correlated features of different modalities, showing higher efficiency than CCA.Recently, ref [28,49] implemented the HGR maximal correlation with deep learning.HGR maximal correlation and the deep learning version have become widely used [29,[50][51][52].For example, ref [52] adopted HGR maximal correlation to extract the correlation between different modalities and uses the hand-crafted features as the input for audio-visual emotion recognition while [29] considered the HGR maximal correlation in autoencoder architectures for multimodal emotion recognition.
Although HGR maximal correlation is used in the DNN methods [29,52] to compute the common information between the input audio and visual data, the emotional label information is not considered in the correlation loss, so it cannot guarantee that the learned features have sufficient discrimination ability for the emotion recognition task.Besides, the stability of common information in the deep learning models is not investigated in these methods.Here, we will sufficiently study the effectiveness of common information for emotion recognition.

Feature Extraction with Deep Learning
One key challenge in audio-visual emotion recognition is feature extraction [4,53], which decides what types of features with acceptable sizes are learned for emotion recognition.Since the success of deep learning, DNN has been gradually used to extract features and achieves better performance than traditional heuristic methods [54,55].Here, we focus on DNN approaches for emotion recognition, which can be divided into two categories according to the type of input data: raw data-based and hand-crafted features-based.
For audio emotion recognition, some works use hand-crafted features as the input to DNN models.For example, Ma et al. [56] proposed a multi-task attention-based DNN model and feed the hand-crafted features into the model.Some works directly feed the raw data to the DNN models.For example, Chen et al. [20] designed a network with several convolutional layers to extract audio features.Tzirakis et al. [57] proposed a new CNN with Long Short-Term Memory (LSTM) for end-to-end audio emotion recognition.Fu et al. [26] used a sparse autoencoder in order to obtain the hidden features of audio data.Dai et al. [58] presented an approach to learn discriminative features from the audio data by integrating center loss in the deep learning model for audio emotion recognition.
For visual emotion recognition, most works directly send visual data to the DNN models.For example, Mollahosseini et al. [59] proposed a network that consisted of two convolutional layers, each of which was followed by max-pooling and four Inception layers.The Inception layers increased the depth and width of the network while keeping the computational budget constant.Jain et al. [60] proposed a hybrid convolution-recurrent neural network.It consists of convolution layers followed by a recurrent neural network, which can consider the temporal dependencies that exist in the facial images.Hickson et al. [61] presented an algorithm to automatically infer facial expressions by analyzing only a partially occluded face while the user is engaged in the virtual reality experience.Zhang et al. [4] employed a 3D CNN pre-trained on large-scale video classification tasks to capture the feature representations in visual data.
Besides, the representation of the input data also has a significant impact on the feature extraction process.For example, Li et al. [62] divided the visual emotion recognition methods into two categories according to the representation: static-based and dynamic-based.In static-based methods [59,63,64], two-dimensional (2D) networks are used to extract spatial information from the single facial expression image, whereas dynamic-based methods [65][66][67] use 3D networks to capture the spatial and temporal information from the facial expression sequence.Although 3D networks contain more information than 2D networks, it is difficult to choose the appropriate 3D networks that can effectively improve the emotion recognition performance.Besides, the large scale of 3D networks may lead to cumbersome training processes.Analogously, in terms of audio emotion recognition, some works [29,54] directly send the time-domain audio data into the DNN models, while other works [58,68,69] first convert the audio data into spectrum representations that are similar to the RGB images, and then feed the spectrums into deep learning models.The latter approach is considered to be more effective [69].
Recently, some popular deep learning models have been widely used, such as VGGNet [70], GoogLeNet [71], and ResNet [37].When compared with VGGNet and GoogLeNet, ResNet has some advantages.For example, ResNet is shown to have better performance than VGGNet and GoogLeNet using a residual learning framework to ease the training of networks that are substantially deeper than those used previously [37].Besides, ResNet has been successfully applied to audio emotion recognition [72][73][74][75] and visual emotion recognition [76][77][78][79][80].These factors inspire us to use ResNet as the backbone of our whole network for audio-visual emotion recognition.

Preliminary
HGR maximal correlation can be regarded as a generalization of Pearson's correlation [81].For joint distributed random variables X and Y with ranges X and Y, HGR maximal correlation with k features is defined, as shown in Equation (1): where the supremum is taken from all Borel measurable functions.Besides, 0 ≤ ρ (k) (X, Y) ≤ 1, and ρ (k) (X, Y) = 1 represents that X is independent of Y. HGR maximal correlation can help us to extract the non-linear feature representations, f and g.From the perspective of information theory, f learned from X has the maximum information towards some aspects of Y and vice versa, which can be used to extract the common information that is shared in X and Y [28,82].In [28,49], based on HGR maximal correlation, an objective function that can be directly used for deep learning is proposed, as shown in Equation ( 2 where cov(f(X)) and cov(g(Y)) represent the covariance matrix of f(X) and g(Y), respectively.tr(•) represents the matrix trace operator.It is shown in [28] that Equation (2) can implement HGR maximal correlation equivalently.Inspired by this, we can design the loss function of our system based on HGR maximal correlation in order to extract the common information between different modalities.HGR maximal correlation is appealing to multimodal learning.On the one hand, it can extract the maximally non-linear correlated features of different modalities, but the Pearson's correlation cannot.On the other hand, it has strong efficiency in deep learning frameworks.These factors inspire us to integrate HGR maximal correlation into the deep learning model for audio-visual emotion recognition.

Methodology
Our goal is to efficiently learn the feature representations from audio and visual data in order to improve the emotion recognition performance with common information.To achieve the goal, we propose the system, as shown in Figure 1.In the following, we first formalize the audio-visual emotion recognition problem and, then, we present how the whole model is trained.

Problem Formulation
Here, we present the specific definition of audio-visual emotion recognition.Suppose that, in the training stage, we are given the samples (x (i) , y (i) , , where x (i) , y (i) , and z (i) , respectively, represent the visual data, audio data, and the emotion category label, such as anger, happiness, and sadness.Their corresponding random variables are denoted as X, Y, and Z. Subsequently, we use and h T to represent the k-dimensional feature functions of X, Y, and Z, respectively.We capture the HGR maximal correlation between f, g, and h, as shown in Figure 1, in order to learn the audio and visual features with common information to predict the corresponding emotions.

Model Learning
To jointly extract emotional features from audio and visual data, we propose the full loss function of the whole network, which is a linear combination of classification loss and correlation loss, defined as Equation ( 3): The classification loss, L cl f , measures the classification performance.The correlation loss, L corr , measures the dependencies between audio data, visual data, and the corresponding emotion labels.The parameter α is the weight coefficient.By considering the correlation loss in the training process, our model can extract the non-linear correlated feature representations with common information in order to improve the performance of audio-visual emotion recognition.When designing the correlation loss, most of previous works [29,52] only compute the correlation between audio data and visual data, but ignore the effect of label information, which may lead to the learned features not directly related to emotion prediction.It is significant to incorporate emotional label information into the correlation loss in order to enhance the discrimination ability of the learned feature representations.Therefore, we introduce a new form of correlation loss based on HGR maximal correlation, L corr , as shown in Equation (4).
where the first, second, and third rows of L corr compute the HGR maximal correlation between f and g, f and h, g and h, respectively, which can be regarded as learning common information between audio data, visual data, and the corresponding emotion labels.By extracting the correlation among different modalities, L corr can ensure that our model has sufficient discrimination ability for emotion recognition.We can think the first row of L corr as a simple version and call it L corr_simple .It only considers the HGR maximal correlation between the audio feature f and visual feature g, as shown in Equation (5).
Following [29], we use cross-entropy loss as the classification loss for emotion classification, as shown in Equation ( 6): where where j = 1, • • • , |Z |, θ j represents the j-th term of the weights in the last layer of the fusion network, Φ represents the feature function of fully connected layers in the fusion network.Additionally, Φ T (f(X), g(Y)) represents the function, Φ T , of the concatenation of f(X) and g(Y).In Section 5.3.3,we will conduct experiments to consider different forms of Φ to test the stability of common information in our framework.We can further consider the semi-supervised learning scenario [83,84], where labeled data may be expensive or time-consuming to obtain.Semi-supervised learning can help us use labeled data and unlabeled data for better emotion recognition.Suppose that, in the training process, we have labeled data (X l , Y l , Z l ) and unlabeled data (X u , Y u ), where X l , Y l , and Z l , respectively, represent the labeled audio modality, labeled visual modality, and the corresponding emotional labels, and X u and Y u represent the unlabeled audio and visual modalities.We use Equation ( 4) in order to compute the correlation loss L corr with the labeled data and Equation ( 5) to compute the correlation loss L corr_simple with the unlabeled data.The sum of correlation losses on these two parts is defined as L corr_semi , as shown in Equation ( 8): When the training data are 100% labeled, L corr_semi will become L corr .Besides, we use the labeled data (X l , Y l , Z l ) to compute the classification loss L cl f .Subsequently, we combine L corr_semi and L cl f in the form of Equation (3) to jointly train our network for semi-supervised learning.In this way, our model can make full use of labeled data and unlabeled data for emotion recognition.

Experiments
In this section, we evaluate our system in the following aspects: (i) to show our approach achieves higher performance than the previous works, (ii) to show our approach can enhance the stability of features learned from audio and visual data for emotion recognition, and (iii) to show that our approach can be easily generalized to the semi-supervised learning scenario.

Datasets
We perform experiments on three audio-visual emotional datasets to evaluate the effectiveness of our approach, including eNTERFACE'05, BAUM-1s, and RAVDESS, because they are available to the research community and widely used in audio-visual emotion recognition.
The eNTERFACE'05 dataset [34] has 1287 English video samples from 42 subjects coming from 14 different nationalities.Each subject are first told to listen to six different situations, with each of them eliciting one of the following emotions: anger, disgust, fear, happiness, sadness, and surprise.They then react to each of the situations and two human experts judged whether the reaction expressed the emotion in an unambiguous way.Happiness has 213 samples, and each of the other five emotions has 216 samples.The frame rate is 25 frames per second.The audio sample rate is 48,000 Hz.In all the samples, the shortest duration is 1.12 s and the longest duration is 106.92 s. 95% samples have the duration more than 1.56 s.
The BAUM-1s dataset [35] is a spontaneous audio-visual Turkish database that contains expressions of affective as well as mental states.It has 1134 video samples from 31 subjects.The subjects are first shown a sequence of images and short video clips, which are not only meticulously fashioned, but also timed to evoke a set of emotions and mental states.Subsequently, they express their feelings and ideas about the images and video clips they have watched in their own words, without using predetermined scripts.The subjects are not guided in any way regarding how to perform the emotion.The database contains recordings reflecting the six basic emotions (anger, disgust, fear, happiness, sadness, and surprise) as well as boredom and contempt.The database also contains several mental states, namely unsure, thinking, concentrating, and bothered.Following [4], our work focuses on recognizing the six basic emotions, which have 521 video clips.To be specific, anger, disgust, fear, happiness, sadness, and surprise have 56, 80, 37, 173, 134, and 41 samples, respectively.The frame rate is 29.97 frames per second.The audio sample rate is 48,000 Hz.In all the samples, the shortest duration is 0.43 s and longest duration is 29.2 s. 95% samples have the duration more than 1.03 s.
The RAVDESS dataset [36] is a validated multimodal database of emotional speech and song.It is gender-balanced consisting of 24 professional actors (12 female, 12 male), vocalizing lexically-matched statements in a neutral North American accent.Emotional expressions are elicited while using techniques the actors are trained in, including method acting or emotional memory techniques.Here, we consider the speech video clips, which are recorded in audio-visual format with 1440 samples.It includes calm, happy, sad, angry, fearful, surprise, disgust, and neutral expressions.Each expression is produced at two levels of emotional intensity.Ratings are provided by 247 individuals who are characteristic of untrained research participants from North America.Analogously, we only consider recognizing the six basic emotions as the eNTERFACE'05 and BAUM-1s datasets, each of which has 192 samples.The frame rate is 29.97 frames per second.The audio sample rate is 48,000 Hz.In all samples, the shortest duration is 2.99 s and longest duration is 5.31 s. 95% samples have the duration more than 3.24 s.
In Figure 2, we show some cropped facial images in order to illustrate the visual information of eNTERFACE'05, BAUM-1s, and RAVDESS datasets.We can also further provide the audio information of each dataset.However, the spectrogram of each sample is visually indistinguishable.Here, we statistically average the spectrograms of different samples that belong to the same class, as shown in Figure 3.It can be found that both visual information and audio information can differentiate emotions on the eNTERFACE'05, BAUM-1s, and RAVDESS datasets.

Data Preprocessing
The audio and visual data need to be extracted from emotional video samples, which always vary in time duration.We consider splitting each emotional video sample into several segments with the same length and extract audio data and visual data from them.Some previous works [4,29] use a moving window with a fixed length to segment the video samples.Additionally, the moving step of the window should also be carefully determined.Because the number of segments for each video sample depends on the time length of the corresponding video sample and the time lengths for different video samples are very different, the distribution of segmented dataset may be different from that of the original video dataset.Because of this phenomenon, such an approach may affect the final performance of audio-visual emotion recognition.
Here, we propose to randomly obtain a segment from a given video sample with a window and repeat this operation 30 times.Therefore, we can obtain 30 segments belonging to the same video sample.The label of the given video sample is used as the labels for these 30 segments.This technique can effectively ensure the distribution of the segmented dataset is consistent with that of the original dataset, and it also has the data augmentation effect.The best window size for emotion recognition is still unclear and it is reported in [85] that a segment longer than 0.25 s includes sufficient emotional information.Inspired by this, we set the window length to 0.5 s.The experimental results show that this length is suitable for emotion recognition.
Each segment usually contains several consecutive frames, which express the same emotion in a very similar way.Additionally, in [86], it is said that at the start and end of the emotional videos, that the subject usually prepares to express the emotion from the neutral state or return to the neutral state after the emotion is shown.These factors motivate us to choose the central frame in each segment as the key frame and take all key frames as the visual data.This way can effectively make the visual data contain rich emotional information for emotion classification and avoid redundancy.Subsequently, we use the MTCNN (multi-task cascaded convolutional network), as proposed in [87], to detect the human face from visual data with the squared size 160 × 160 × 3.After the preprocessing, we feed the visual data into the visual network.
In addition, we extract the speech signals from all segments.Subsequently, we extract the log Mel-spectrogram from the speech signal for each segment as audio data.For each speech signal, we adopt 94 Mel-filter banks in order to obtain the log Mel-spectrogram with a 40 ms hanning window and a 10ms overlapping, resulting in the representation with size 94 × 94.Afterwards, we convert the representation into three channels by copying the original Mel-spectrogram to each channel.Finally, we send the audio data into the audio network.

Network Architecture
We use ResNet-50 [37] as the backbone architectures of the audio network and visual network.The fully connected layer before the softmax layer of ResNet-50 has 512 units, denoted as feature functions f and g in the visual network and audio network, respectively.Additionally, the weights of the two networks are initialized by copying the parameters of trained on the ImageNet dataset [88].
Inspired by [89], we first convert the emotion labels into one-hot form and then use a fully connected layer with 512 units as the label network to obtain the feature function h.We then concatenate f and g into the fusion network for emotion prediction.The fusion network has several fully connected layers.The last layer of the fusion network is the softmax layer.Each fully connected layer is followed by the ReLU function.These fully connected layers before the softmax layer in the fusion network correspond to the feature function Φ.Because common information in different layers of our deep learning model may have different performance, we consider different forms of the feature function, Φ.Specifically, we respectively make the fusion network has one, two, and four fully connected layers before the softmax layer, as shown in Figure 4. We will report the performance of different settings in Section 5.3.3.In this way, we can test the stability of common information in our deep learning model.
< l a t e x i t s h a 1 _ b a s e 6 4 = " A 9 i P 5 x B O e r X M r s q Q 1 + k y 1 C k a z j W / L e v g A D B K P T w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " A 9 i P 5 x B O e r X M r s q Q 1 + k y 1 C k a z j W / L e v g A D B K P T w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " A 9 i P 5 u 3 U j L q I H e x i n + Z 5 j B o u U E e D v P t 4 x B O e r X M r s q Q 1 + k y 1 C k a z j W / L e v g A D B K P T w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " A 9 i P 5

512
< l a t e x i t s h a 1 _ b a s e 6 4 = " o 1 q 1 z h

T T I e 4 h H P O H Z u r S E N b H u P l O t n N H s 4 9 u y H j 4
A O Z K P q w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " o 1 q 1 z h

T T I e 4 h H P O H Z u r S E N b H u P l O t n N H s 4 9 u y H j 4
A O Z K P q w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " o 1 q 1 z h < l a t e x i t s h a 1 _ b a s e 6 4 = " A 9 i P 5 x o H l R d p + q e H 1 Z q j h l 1 E T v Y x T 7 N 8 w g 1 n K K O B n n H e M Q T n q 0 z K 7 f u r P v P V K t g N N v 4 t q y H D 5 p S k 3 o = < / l a t e x i t >

512
< l a t e x i t s h a 1 _ b a s e 6 4 = " o 1 q 1 z h

512
< l a t e x i t s h a 1 _ b a s e 6 4 = " o 1 q 1 z h < l a t e x i t s h a 1 _ b a s e 6 4 = " A 9 x B O e r X M r s q Q 1 + k y 1 C k a z j W / L e v g A D B K P T w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " A 9 x B O e r X M r s q Q 1 + k y 1 C k a z j W / L e v g A D B K P T w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " A 9 x B O e r X M r s q Q 1 + k y 1 C k a z j W / L e v g A D B K P T w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " A 9  In [4,90], it is said that the pre-trained strategy can effectively enhance the expressiveness of the learned features.Accordingly, here, we firstly train audio and visual network separately with the cross-entropy loss function.In this way, the networks can contain sufficient discriminative information from each modality.Subsequently, we use their weights as the initial weights of our whole model for joint training.
After the emotion probabilities of each segment are predicted, the average results across all segments belonging to the same video samples are used in order to predict the video-level emotion labels.

Experimental Settings
In our experiments, 70% samples are used as training data, 15% samples are used as validation data, and the remaining 15% samples are used as test data.The segments that belong to the same original video sample are assigned together as the training data or validation data or test data.Each experiment is run five times and the average recognition accuracy (%) is reported as the final result.We train our model while using the Adam [91] optimizer with the learning rate of 0.001.The batch size is set to 90.The number of epochs is set to 100.Pytorch [92] is used to implement our proposed model.We run the experiments on a NVIDIA TITAN V GPU card.

Experimental Results
In this section, we first show the performance of unimodal emotion recognition, show the performance of audio-visual emotion recognition, give the stability analysis of common information, and finally describe how our approach can be used for semi-supervised learning.

Unimodal Performance
To show the effectiveness of the learned audio and visual feature representations, we report the performance of our method in audio emotion recognition and visual emotion recognition, respectively, as shown in Tables 1 and 2, respectively.In these two scenarios, we only use the corresponding audio network and visual network to extract the feature representations for emotion prediction.At the same time, we compare our work with previous works, which are all based on the original datasets.Some of them use hand-crafted features and some use DNN-based features.From Table 1, we can see that our learned audio features are more discriminative for emotion recognition than the hand-crafted features [93,94] on the RAVDESS dataset and the features learned from DNN models with shallow structure [20,29] on the eNTERFACE'05 dataset.These results show that our network can effectively learn the audio features for emotion recognition.It is also worth noting that audio emotion recognition performs better on the RAVDESS dataset and eNTERFACE'05 dataset than on the BAUM-1s dataset, which indicate that the first two datasets may contain more emotional information.From Table 2, we can also observe that our visual network performs better than the methods [20,95,96].On the eNTERFACE'05 dataset, it can be found that our method achieves higher emotion recognition accuracy than the method [95] with hand-crafted features as input and the method [20] with raw data as input.On the RAVDESS dataset, our learned visual features yield better performance than the method [96] while using shallow DNN models with raw data as input.These results also show the visual features learned from our network have more discriminative power than previous works.

Multimodal Performance
We report the performance of audio-visual emotion recognition on the eNTERFACE'05, BAUM-1s, and RAVDESS datasets in order to further demonstrate the effectiveness of our system.In the meanwhile, we make a comparison with previous works, which also conduct experiments on these original datasets.Table 3 summarizes the results.To implement our architecture, we set α equal to 1 and make the fusion network have two fully connected layers on the eNTERFACE'05 dataset, set α equal to 1 and make the fusion network have one fully connected layer on the BAUM-1s dataset, set α equal to 0.1 and make the fusion network have one fully connected layer on the RAVDESS dataset.From Table 3, we can find that our method is competitive with the compared works.To be specific, on the eNTERFACE'05 dataset, the performance of our method is at least 1% higher than that of the previous methods [20][21][22]27,29,95,97,98].On the BAUM-1s dataset, we improve the performance of the method [27] from 60.05% to 67.59%.On the RAVDESS dataset, our method performs much better than the methods [99,100] by more than 13%.It is worth noting that the method [22] combines the audio and visual features that are based on CCA and the method [27] utilizes a similar approach with low-rank representation.Our method performs better than these two methods, which show that the correlation loss we propose can learn the discriminative feature representations more effectively.Besides, most of the above methods are based on DNN models, while our method achieves better performance, which shows that the features learned from our deep learning model has more powerful expressiveness.To sum up, our deep learning method can efficiently extract the discriminative feature representations with common information in order to achieve the highest accuracies among all the compared methods.
Figure 5 shows the classification confusion matrices using our method on the eNTERFACE'05, BAUM-1s, and RAVDESS datasets.It can be found that, on the eNTERFACE'05 dataset, "fear" and "surprise" are harder to be recognized when compared to other emotions.This indicates audio-visual cues of these two emotions contain less emotional information.On the BAUM-1s dataset, "happiness" achieves the highest recognition accuracy among the six emotions.However, "anger" and "fear" have lower recognition accuracies.This may be due to the small number of samples of "anger" and "fear" on the BAUM-1s dataset, which results in our model being unable to fully learn the features of "anger" and "fear" for emotion classification.On the RAVDESS dataset, both "anger" and "happiness" can be identified with 100% accuracy.Similar to the eNTERFACE'05 dataset, "fear" and "surprise" are more difficult to be recognized.In addition to the classification confusion matrices, we show some misclassification examples on the eNTERFACE'05, BAUM-1s, and RAVDESS datasets, as shown in Figure 6.For example, on the eNTERFACE'05 dataset, "disgust" is misclassified as "anger", "fear" is misclassified as "sadness".On the BAUM-1s dataset, "anger" is misclassified as "disgust", "disgust" is misclassified as "sadness".On the RAVDESS dataset, "disgust" is misclassified as "fear" and "fear" is misclassified as "sadness".This indicates that similar emotions on these three datasets may be difficult to distinguish.For instance, the image on the left of the first row is marked with "disgust → anger", which indicates that the true label of the input data is "disgust", but our model predicts that its label is "anger".

Stability Analysis of Common Information
From Figures 1 and 4, we can see that our proposed correlation loss can make the input features of the fusion network, f and g, maximally correlated.The more fully connected layers in the fusion network, the farther common information is from the softmax output, which may lead to different emotion recognition performance.Besides, we know that α determines how the correlation loss works during the training process.Different α values will lead to different operating mechanisms of common information.Motivated by these two factors, we investigate the stability of common information in our deep learning model.
We do the following study with different settings on the eNTERFACE'05, BAUM-1s, and RAVDESS datasets, as shown in Tables 4-6.We, respectively, make the fusion network has one, two and four fully connected layers before the softmax layer, which is shown in the setting column.The baseline in the method column means that we only use classification loss for training.
We compare the performance of two forms of correlation loss, one is L corr , the other is L corr_simple , which only computes the HGR maximal correlation between the features f and g.We set α to 0.01, 0.1, 1, and 10, respectively, to implement audio-visual emotion recognition to find how the common information works.For each setting, we report the performance of L corr_simple method and L corr method, respectively.
We have the following observations on the three datasets: (1) the accuracy of baseline methods is higher than that of audio network and visual network, which indicates that baseline methods can combine the information of audio data and visual data to some extent to improve the emotion recognition performance.(2) When compared with the baseline method, L corr can significantly improve the emotion recognition performance for different weight coefficients and different fully connected layers in the fusion network.This shows that the correlation loss we proposed, L corr , is very stable in ing audio-visual emotion recognition.(3) It should be noted that the L corr_simple method can also improve emotion recognition performance, but it is weaker than L corr method.For example, on the eNTERFACE'05 dataset, when the fusion network has one fully connected layer, L corr_simple methods with the weight coefficient of 0.1 perform worse than the baseline method.On the RAVDESS dataset, when the fusion network has four fully connected layers, L corr_simple method with the weight coefficient of 0.01 also performs worse than the baseline method.(4) On the eNTERFACE'05 dataset, the highest accuracy of 88.66% is achieved when we use L corr and set α = 1 with two fully connected layers in the fusion network.On the BAUM-1s dataset, the highest accuracy 67.59% is achieved when we use L corr and set α = 1 with one fully connected layer in the fusion network.On the RAVDESS dataset, the highest accuracy 97.57% is achieved when we use L corr and set α as 0.1 with one fully connected layer in the fusion network.These show that our method can benefit from common information with appropriate network settings.(5) When we set the weight coefficient to 0.1 or 1 and make the fusion network has one or two fully connected layers, our model performs better on all three datasets than models with other settings.

Robustness Analysis on Missing Modality
In our analysis above, we assume that audio and visual data are available during the testing stage.However, when generalizing the trained models to the real-world, we may encounter modality missing scenarios, which requires a good fusion model should perform well, even if missing modality occurs during the testing process.Motivated by this, we conduct the following study on the eNTERFACE'05 dataset in order to verify the robustness of our model in the testing process.
In our architecture, we concatenate the feature f of visual data and the feature g of audio data into the fusion network for emotion classification.In order to represent the modality missing problem during testing stage, we set the feature of one modality to 0 in order to indicate that this modality is missing, and then concatenate it with the feature of another modality to predict emotion.It should be noted that features of another modality is obtained by using the model we have trained.We consider three scenarios: only audio data are missing, only visual data are missing, and audio and visual data are missing.In each scenario, we, respectively, set 20%, 50%, and 80% of test data as missing.For the third scenario, the audio and visual data are missing at half of the missing rate, respectively.For example, the missing rate of 20% means that audio and visual data are missing with 10%, respectively.We compare the performance of three methods, L corr , L corr_simple and baseline.They have the same network structure, with two fully connected layers in the fusion network.For the L corr and L corr_simple methods, we set α = 1.The baseline method means that we only use classification loss for training.For each setting, we report the emotion recognition accuracy, which is shown in Figure 7.
From Figure 7, we have the following observations: (1) as the missing rate increases, the emotion recognition accuracies of L corr , L corr_simple and baseline methods decrease.(2) In the three data missing scenarios, the downward tendency of L corr_simple method with the increase of missing rate is basically consistent with that of L corr method, but the performance of the L corr_simple method is always lower than that of the L corr method.(3) When compared with the L corr method, the performance of the baseline method decreases faster with the increase of missing rate.The gap between L corr method and baseline method will become more apparent as the missing rate increases, especially in the scenario where only audio data are missing.This shows that the L corr method is more robust to deal with missing modality problem during the testing stage.

Semi-Supervised Audio-Visual Emotion Recognition
Semi-supervised learning has been used in many tasks when labeled data are scarce or difficult to obtain.It assumes that, during the training process, we have some labeled data and some unlabeled data.By using these two parts of data, semi-supervised learning can help us to perform better classification than supervised learning using only labeled data.Here, we show that our method can be easily adapted to semi-supervised audio-visual emotion recognition task.We conduct experiments on the RAVDESS dataset.Specifically, we mask the labels of some training data to indicate that they are unlabeled data.For the labeled data, audio modality, visual modality, and the corresponding emotion labels are available.For the unlabeled data, only audio modality and visual modality are available.We, respectively, set 20%, 50%, 80%, and 100% of the training data as labeled data.For each semi-supervised scenario, we compare the performance of different methods.The average accuracy of each method is reported in Table 7.
In the method column of Table 7, audio means that only the audio modality in the labeled data is trained for audio emotion recognition.Visual means that only the visual modality in the labeled data is trained for visual emotion recognition.Baseline means that audio and visual modalities in the labeled data are trained by our whole network with only classification loss for audio-visual emotion recognition.L corr_simple means that whether labeled data or unlabeled data, we use the correlation loss L corr_simple to compute the HGR maximal correlation between audio and visual modalities.In addition, for labeled data, we use Equation ( 6) in order to compute the classification loss L cl f .L corr_semi means that we use Equation (8) to compute the correlation loss L corr_semi .Additionally, similar to L corr_simple method, the classification loss L cl f is also computed using labeled data.It is worth noting that, for the L corr_simple method and L corr_semi method, classification loss and correlation loss are combined in the form of Equation (3) with the weight coefficient α to train our network together.Here, we make the fusion network have two fully connected layers to implement the baseline, L corr_simple and L corr_semi methods for audio-visual emotion recognition.By comparing with the baseline method that only uses labeled data, we can find that our proposed method L corr_semi can effectively combine unlabeled and labeled data for audio-visual emotion recognition.From Table 7, we have the following observations: (1) the performance of baseline method is better than that of audio method and visual method, which shows the importance of combining information of different modalities to improve the emotion recognition accuracy.(2) L corr_semi method achieves the highest accuracy among all methods, and it shows that it can significantly improve the performance of audio-visual emotion recognition for different weight coefficients and different percentages of labels.
(3) L corr_simple method can also contribute to improving the performance of audio-visual emotion recognition, but it is weaker than L corr_semi method, especially when the labels are insufficient.To sum up, our method can effectively improve the performance of audio-visual emotion recognition, showing its potential for semi-supervised learning.

Conclusions
In this paper, we propose an efficient deep learning approach to exploit common information between audio data, visual data, and the corresponding emotion labels for emotion recognition on the eNTERFACE'05, BAUM-1s, and RAVDESS datasets.To be specific, we design an audio network and a visual network to learn the feature representations from audio data and visual data, respectively, and then use a fusion network to combine the audio and visual features for emotional recognition.The full loss function of our whole neural network is a linear combination of correlation loss and classification loss.The former is used to extract common information between audio data, visual data, and the corresponding emotion labels with HGR maximal correlation.The latter is used to extract discriminative information from different modalities.We further generalize our framework to the semi-supervised learning scenario.The experimental results demonstrate that by combining the common information with HGR maximal correlation, our deep learning approach can significantly enhance the stability of features that are learned from different modalities, and improve the emotion recognition performance.
In the future, we will investigate the performance of our method for emotion recognition in more datasets.It is worth noting that, in the real world, both audio data and visual data may be noisy.We will further consider audio-visual emotion recognition in the noisy environment.Besides, in addition to audio and visual data, physiological signals [101][102][103] and text data [104] are important modalities for characterizing human emotions.Therefore, we will consider combining the information of these modalities for multimodal emotion recognition.
t e x i t s h a 1 _ b a s e 6 4 = " Q F h / Z Q w a Y M t 9 C u 8 q h H T + 5 p X a g 6 I = " > A A A C z X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k U Q F y U R Q Z c F N + 6 s Y B / Y F p m k k z a Y F 5 O J U G r d + g N u 9 b f E P 9 C / 8 M 6 Y g l p E J y Q 5 c + 4 9 Z + b e 6 y S B n 0 r L e i 0 Y c / M L i 0 v F 5 d L K 6 t r 6 R n l z q 5 n G m X B 5 w 4 2 D W L Q d l v L A j 3 h D + j L g 7 U R w F j o B b z k 3 p y r e u u U i 9 e P o U o 4 S 3 g v Z I P I 9 3 2 W S q K t u y O T Q 8 c b e 5 L p c s a q W X u Y s s H N Q Q b 7 q c f k F X f Q R w 0 W G E B w R J O E A D C k 9 H d i w k B D X w 5 g 4 Q c j X c Y 4 J S q T N K I t T B i P 2 h r 4 D 2 n V y N q K 9 8 k y 1 2 q V T A n o F K U 3 s k S a m P E F Y n W b q e K a d F f u b 9 1 h 7 q r u N 6 O / k X i G x E k N i / 9 J N M / + r U 7 V I e D j R N f h U U 6 I Z V Z 2 b u 2 S 6 K + r m 5 p e q J D k k x C n c p 7 g g 7 G r l t M + m 1 q S 6 d tV b p u N v O l O x a u / m u R n e 1 S 1 p w P b P c c 6 C 5 m H V t q r 2 x V G l d p C P u o g d 7 G K f 5 n m M G s 5 Q R 4 O 8 I z z i C c / G u Z E Z d 8 b 9 Z 6 p R y D X b + L a M h w + W I 5 N z < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Q F h / Z Q w a Y M t 9 C u 8 q h H T + 5 p X a g 6 I = " > A A A C z X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k U Q F y U R Q Z c F N + 6 s Y B / Y F p m k k z a Y F 5 O J U Gr d + g N u 9 b f E P 9 C / 8 M 6 Y g l p E J y Q 5 c + 4 9 Z + b e 6 y S B n 0 r L e i 0 Y c / M L i 0 v F 5 d L K 6 t r 6 R n l z q 5 n G m X B 5 w 4 2 D W L Q d l v L A j 3 h D + j L g 7 U R w F j o B b z k 3 p y r e u u U i 9 e P o U o 4 S 3 g v Z I P I 9 3 2 W S q K t u y O T Q 8 c b e 5 L p c s a qW X u Y s s H N Q Q b 7 q c f k F X f Q R w 0 W G E B w R J O E A D C k 9 H d i w k B D X w 5 g 4 Q c j X c Y 4 J S q T N K I t T B i P 2 h r 4 D 2 n V y N q K 9 8 k y 1 2 q V T A n o F K U 3 s k S a m P E F Y n W b q e K a d F f u b 9 1 h 7 q r u N 6 O / k X i G x E k N i / 9 J N M / + r U 7 V I e D j R N f h U U 6 I Z V Z2 b u 2 S 6 K + r m 5 p e q J D k k x C n c p 7 g g 7 G r l t M + m 1 q S 6 d t V b p u N v O l O x a u / m u R n e 1 S 1 p w P b P c c 6 C 5 m H V t q r 2 x V G l d p C P u o g d 7 G K f 5 n m M G s 5 Q R 4 O 8 I z z i C c / G u Z E Z d 8 b 9 Z 6 p R y D X b + L a M h w + W I 5 N z < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = "Q F h / Z Q w a Y M t 9 C u 8 q h H T + 5 p X a g 6 I = " > A A A C z X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k U Q F y U R Q Z c F N + 6 s Y B / Y F p m k k z a Y F 5 O J U G r d + g N u 9 b f E P 9 C / 8 M 6 Y g l p E J y Q 5 c + 4 9 Z + b e 6 y S B n 0 r L e i 0 Y c / M L i 0 v F 5 d L K 6 t r 6 R n l z q 5 n G m X B 5 w 4 2 D W L Q d l v L A j 3 h D + j L g 7 U R w F j o B b z k 3 p y r e u u U i 9 e P o U o 4 S 3 g v Z I P I 9 3 2 W S q K t u y O T Q 8 c b e 5 L p c s a q W X u Y s s H N Q Q b 7 q c f k F X f Q R w 0 W G E B w R J O E A D C k 9 H d i w k B D X w 5 g 4 Q c j X c Y 4 J S q T N K I t T B i P 2 h r 4 D 2 n V y N q K 9 8 k y 1 2 q V T A n o F K U 3 s k S a m P E F Y n W b q e K a d F f u b 9 1 h 7 q r u N 6 O / k X i G x E k N i / 9 J N M / + r U 7 V I e D j R N f h U U 6 I Z V Z 2 b u 2 S 6 K + r m 5 p e q J D k k x C n c p 7 g g 7 G r l t M + m 1 q S 6 d t V b p u N v O l O x a u / m u R n e 1 S 1 p w P b P c c 6 C 5 m H V t q r 2 x V G l d p C P u o g d 7 G K f 5 n m M G s 5 Q R 4 O 8 I z z i C c / G u Z E Z d 8 b 9 Z 6 p R y D X b + L a M h w + W I 5 N z < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Q F h / Z Q w a Y M t 9 C u 8 q h H T + 5 p X a g 6 I = " > A A A C z X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k U Q F y U R Q Z c F N + 6 s Y B / Y F p m k k z a Y F 5 O J U Gr d + g N u 9 b f E P 9 C / 8 M 6 Y g l p E J y Q 5 c + 4 9 Z + b e 6 y S B n 0 r L e i 0 Y c / M L i 0 v F 5 d L K 6 t r 6 R n l z q 5 n G m X B 5 w 4 2 D W L Q d l v L A j 3 h D + j L g 7 U R w F j o B b z k 3 p y r e u u U i 9 e P o U o 4 S 3 g v Z I P I 9 3 2 W S q K t u y O T Q 8 c b e 5 L p c s a qW X u Y s s H N Q Q b 7 q c f k F X f Q R w 0 W G E B w R J O E A D C k 9 H d i w k B D X w 5 g 4 Q c j X c Y 4 J S q T N K I t T B i P 2 h r 4 D 2 n V y N q K 9 8 k y 1 2 q V T A n o F K U 3 s k S a m P E F Y n W b q e K a d F f u b 9 1 h 7 q r u N 6 O / k X i G x E k N i / 9 J N M / + r U 7 V I e D j R N f h U U 6 I Z V Z2 b u 2 S 6 K + r m 5 p e q J D k k x C n c p 7 g g 7 G r l t M + m 1 q S 6 d t V b p u N v O l O x a u / m u R n e 1 S 1 p w P b P c c 6 C 5 m H V t q r 2 x V G l d p C P u o g d 7 G K f 5 n m M G s 5 Q R 4 O 8 I z z i C c / G u Z E Z d 8 b 9 Z 6 p R y D X b + L a M h w + W I 5 N z < / l a t e x i t > g < l a t e x i t s h a 1 _ b a s e 6 4 = " V z k v Q 5 b y 1 E 0 v w G / 3 0 9 z a 1 S 2 V Y I o = " > A A A C z X i c j V H L S s N A F D 3 G V 6 2 v q k s 3 w S K I i 5 K I o M u C G 3 d W s A 9 s i 0 z S a R v M i 8 l E K L V u / Q G 3 + l v i H + h f e G e c g g 9 E J y Q 5 c + 4 9 Z + b e 6 6 V h k E n H e Z m x Z u f m F x Y L S 8 X l l d W 1 9 d L G Z i N L c u H z u p + E i W h 5 L O N h E P O 6 D G T I W 6 n g L P J C 3 v S u T 1 S 8 e c N F F i T x h R y l v B u x Q R z 0 A 5 9 J o i 4 7 E Z N D r z 8 e T K 5 K Z a f i 6 G X / B K 4 B Z Z h V S 0 r P 6 K C H B D 5 y R O C I I Q m H Y M j o a c O F g 5 S 4 L s b E C U K B j n N M U C R t T l m c M h i x 1 / Q d 0 K 5 t 2 J j 2 y j P T a p 9 O C e k V p L S x S 5 q E 8 g R h d Z q t 4 7 l 2 V u x v 3 m P t q e 4 2 o r 9 n v C J i J Y b E / q W b Z v 5 X p 2 q R 6 O N Y 1 x B Q T a l m V H W + c c l 1 V 9 T N 7 U 9 V S X J I i V O 4 R 3 F B 2 N f K a Z 9 t r c l 0 7 a q 3 T M d f d a Z i 1 d 4 3 u T n e 1 C 1 p w O 7 3 c f 4 E j Y O K 6 1 T c 8 8 N y d d + M u o B t 7 G C P 5 n m E K k 5 R Q 5 2 8 Y z z g E U / W m Z V b t 9 b d R 6 o 1 Y z R b + L K s + 3 e Y h J N 0 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " V z k v Q 5 b y 1 E 0 v w G / 3 0 9 z a 1 S 2 V Y I o = " > A A A C z X i c j V H L S s N A F D 3 G V 6 2 v q k s 3 w S K I i 5 K I o M u C G 3 d W s A 9 s i 0 z S a R v M i 8 l E K L V u / Q G 3 + l v i H + h f e G e c g g 9 E J y Q 5 c + 4 9 Z + b e 6 6 V h k E n H e Z m x Z u f m F x Y L S 8 X l l d W 1 9 d L G Z i N L c u H z u p + E i W h 5 L O N h E P O 6 D G T I W 6 n g L P J C 3 v S u T 1 S 8 e c N F F i T x h R y l v B u x Q R z 0 A 5 9 J o i 4 7 E Z N D r z 8 e T K 5 K Z a f i 6 G X / B K 4 B Z Z h V S 0 r P 6 K C H B D 5 y R O C I I Q m H Y M j o a c O F g 5 S 4 L s bE C U K B j n N M U C R t T l m c M h i x 1 / Q d 0 K 5 t 2 J j 2 y j P T a p 9 O C e k V p L S x S 5 q E 8 g R h d Z q t 4 7 l 2 V u x v 3 m P t q e 4 2 o r 9 n v C J i J Y b E / q W b Z v 5 X p 2 q R 6 O N Y 1 x B Q T a l m V H W + c c l 1 V 9 T N 7 U 9 V S X J I i V O 4 R 3 F B 2 N f K a Z 9 tr c l 0 7 a q 3 T M d f d a Z i 1 d 4 3 u T n e 1 C 1 p w O 7 3 c f 4 E j Y O K 6 1 T c 8 8 N y d d + M u o B t 7 G C P 5 n m E K k 5 R Q 5 2 8 Y z z g E U / W m Z V b t 9 b d R 6 o 1 Y z R b + L K s + 3 e Y h J N 0 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " V z k v Q 5 b y 1 E 0 v w G / 3 0 9 z a 1 S 2 V Y I o = " > A A A C z X i c j V H L S s N A F D 3 G V 6 2 v q k s 3 w S K I i 5 K I o M u C G 3 d W s A 9 s i 0 z S a R v M i 8 l E K L V u / Q G 3 + l v i H + h f e G e c g g 9 E J y Q 5 c + 4 9 Z + b e 6 6 V h k E n H e Z m x Z u f m F x Y L S 8 X l l d W 1 9 d L G Z i N L c u H z u p + E i W h 5 L O N h E P O 6 D G T I W 6 n g L P J C 3 v S u T 1 S 8 e c N F F i T x h R y l v B u x Q R z 0 A 5 9 J o i 4 7 E Z N D r z 8 e T K 5 K Z a f i 6 G X / B K 4 B Z Z h V S 0 r P 6 K C H B D 5 y R O C I I Q m H Y M j o a c O F g 5 S 4 L s b E C U K B j n N M U C R t T l m c M h i x 1 / Q d 0 K 5 t 2 J j 2 y j P T a p 9 O C e k V p L S x S 5 q E 8 g R h d Z q t 4 7 l 2 V u x v 3 m P t q e 4 2 o r 9 n v C J i J Y b E / q W b Z v 5 X p 2 q R 6 O N Y 1 x B Q T a l m V H W + c c l 1 V 9 T N 7 U 9 V S X J I i V O 4 R 3 F B 2 N f K a Z 9 t r c l 0 7 a q 3 T M d f d a Z i 1 d 4 3 u T n e 1 C 1 p w O 7 3 c f 4 E j Y O K 6 1 T c 8 8 N y d d + M u o B t 7 G C P 5 n m E K k 5 R Q 5 2 8 Y z z g E U / W m Z V b t 9 b d R 6 o 1 Y z R b + L K s + 3 e Y h J N 0 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " V z k v Q 5 b y 1 E 0 v w G / 3 0 9 z a 1 S 2 V Y I o = " > A A A C z X i c j V H L S s N A F D 3 G V 6 2 v q k s 3 w S K I i 5 K I o M u C G 3 d W s A 9 s i 0 z S a R v M i 8 l E K L V u / Q G 3 + l v i H + h f e G e c g g 9 E J y Q 5 c + 4 9 Z + b e 6 6 V h k E n H e Z m x Z u f m F x Y L S 8 X l l d W 1 9 d L G Z i N L c u H z u p + E i W h 5 L O N h E P O 6 D G T I W 6 n g L P J C 3 v S u T 1 S 8 e c N F F i T x h R y l v B u x Q R z 0 A 5 9 J o i 4 7 E Z N D r z 8 e T K 5 K Z a f i 6 G X / B K 4 B Z Z h V S 0 r P 6 K C H B D 5 y R O C I I Q m H Y M j o a c O F g 5 S 4 L s b E C U K B j n N M U C R t T l m c M h i x 1 / Q d 0 K 5 t 2 J j 2 y j P T a p 9 O C e k V p L S x S 5 q E 8 g R h d Z q t 4 7 l 2 V u x v 3 m P t q e 4 2 o r 9 n v C J i J Y b E / q W b Z v 5 X p 2 q R 6 O N Y 1 x B Q T a l m V H W + c c l 1 V 9 T N 7 U 9 V S X J I i V O 4 R 3 F B 2 N f K a Z 9 t r c l 0 7 a q 3 T M d f d a Z i 1 d 4 3 u T n e 1 C 1 p w O 7 3 c f 4 E j Y O K 6 1 T c 8 8 N y d d + M u o B t 7 G C P 5 n m E K k 5 R Q 5 2 8 Y z z g E U / W m Z V b t 9 b d R 6 o 1 Y z R b + L K s + 3 e Y h J N 0 < / l a t e x i t > h < l a t e x i t s h a 1 _ b a s e 6 4 = " j R z J S A O S m b b e C 8 3 K 1 d 0 s S U J w n r 0 2 w T z 4 y z Z n / z H h t P f b c R / b 3 c K y R W Y U D s X 7 p p 5 n 9 1 u h a F H o 5 M D T 7 V l B h G V 8 d z l 8 x 0 R d / c / l K V I o e E O I 1 v K C 4 J c 6 O c 9 t k 2 m t T U r n v L T P z N Z G p W 7 3 m e m + F d 3 5 I G 7 P 4 c 5 y x o 7 d d c p + a e H V T r l X z U R e y g g j 2 a 5 y H q O E E D T f K O 8 I g n P F u n V m b d W f e f q V Y h 1 2 z j 2 7 I e P g C Y f Z N t < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " j R z J S A O S m b b e C 8 3 K 1 d 0 s S U J w n r 0 2 w T z 4 y z Z n / z H h t P f b c R / b 3 c K y R W Y U D s X 7 p p 5 n 9 1 u h a F H o 5 M D T 7 V l B h G V 8 d z l 8 x 0 R d / c / l K V I o e E O I 1 v K C 4 J c 6 O c 9 t k 2 m t T U r n v L T P z N Z G p W 7 3 m e m + F d 3 5 I G 7 P 4 c 5 y x o 7 d d c p + a e H V T r l X z U R e y g g j 2 a 5 y H q O E E D T f K O 8 I g n P F u n V m b d W f e f q V Y h 1 2 z j 2 7 I e P g C Y f Z N t < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " j R z J S A O S m b b e C 8 3 K 1 d 0 s S U J w n r 0 2 w T z 4 y z Z n / z H h t P f b c R / b 3 c K y R W Y U D s X 7 p p 5 n 9 1 u h a F H o 5 M D T 7 V l B h G V 8 d z l 8 x 0 R d / c / l K V I o e E O I 1 v K C 4 J c 6 O c 9 t k 2 m t T U r n v L T P z N Z G p W 7 3 m e m + F d 3 5 I G 7 P 4 c 5 y x o 7 d d c p + a e H V T r l X z U R e y g g j 2 a 5 y H q O E E D T f K O 8 I g n P F u n V m b d W f e f q V Y h 1 2 z j 2 7 I e P g C Y f Z N t < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " j R z J S A O S m b b e C 8 3 K

Figure 1 .
Figure1.The structure of our proposed system for audio-visual emotion recognition.The full loss function of our framework is a linear combination of correlation loss and classification loss.Audio network and visual network use ResNet-50[37] as the backbone architectures.Fusion network has several fully connected layers.Different settings of fusion network are considered in Section 5.2.2.The correlation loss is used to extract common information between different modalities.Additionally, the classification loss is used to capture discriminative information from each modality for emotion prediction.During the training process, emotion labels are used twice, once to compute the classification loss, and the another as the third modality to compute the correlation loss with audio and visual modalities.In this way, label information can be fully used in order to improve the discrimination ability of the feature representations.In the testing process, audio and visual data are used to predict the corresponding emotion labels.
w h T 4 S M O S I w B F D E g 7 h I a P n E i 4 c p M T 1 M C F O E A p 0 n O M e J d L m l M U p w y N 2 R N 8 h 7 S 4 N G 9 N e e W Z a z e i U k F 5 B S h t 7 p E k o T x B W p 9 k 6 n m t n x f 7 m P d G e 6 m 5 j w h T 4 S M O S I w B F D E g 7 h I a P n E i 4 c p M T 1 M C F O E A p 0 n O M e J d L m l M U p w y N 2 R N 8 h 7 S 4 N G 9 N e e W Z a z e i U k F 5 B S h t 7 p E k o T x B W p 9 k 6 n m t n x f 7 m P d G e 6 m 5 j w h T 4 S M O S I w B F D E g 7 h I a P n E i 4 c p M T 1 M C F O E A p 0 n O M e J d L m l M U p w y N 2 R N 8 h 7 S 4 N G 9 N e e W Z a z e i U k F 5 B S h t 7 p E k o T x B W p 9 k 6 n m t n x f 7 m P d G e 6 m 5 j w h T 4 S M O S I w B F D E g 7 h I a P n E i 4 c p M T 1 M C F O E A p 0 n O M e J d L m l M U p w y N 2 R N 8 h 7 S 4 N G 9 N e e W Z a z e i U k F 5 B S h t 7 p E k o T x B W p 9 k 6 n m t n x f 7 m P d G e 6 m 5 j r 9 M / A P 9 C + + M U 1 C L 6 I Q k Z 8 6 9 5 8 z c e 4 M k F J l 0 n N e C t b S 8 s r p W X C 9 t b G 5 t 7 5 R 3 9 9 p Z n K e M e y w O 4 7 Q b r 9 M / A P 9 C + + M U 1 C L 6 I Q k Z 8 6 9 5 8 z c e 4 M k F J l 0 n N e C t b S 8 s r p W X C 9 t b G 5 t 7 5 R 3 9 9 p Z n K e M e y w O 4 7 Q b r 9 M / A P 9 C + + M U 1 C L 6 I Q k Z 8 6 9 5 8 z c e 4 M k F J l 0 n N e C t b S 8 s r p W X C 9 t b G 5 t 7 5 R 3 9 9 p Z n K e M e y w O 4 7 Q b r 9 M / A P 9 C + + M U 1 C L 6 I Q k Z 8 6 9 5 8 z c e 4 M k F J l 0 n N e C t b S 8 s r p W X C 9 t b G 5 t 7 5 R 3 9 9 p Z n K e M e y w O 4 7 Q b p T y 6 4 j 14 6 A X + E w S d X k V M T n w e u P + 5 K Z c c a q O X v Y s c A 2 o w K x 6 U n 7 B F b p I 4 C N H B I 4 Y k n A I h o y e D l w 4 S I m 7 x p g 4 Q S j Q c Y 4 J S q T N K Y t T B i N 2 S N 8 + 7 T q G j W m v P D O t 9 u m U k F 5 B S h t 7 p E k o T x B W p 9 k 6 n m t n x f 7 m P d a e 6 m 4 j + n v G K y J W Y k D s X 7 p p 5 n 9 1 q h a J H o 5 1 D Q H V l G p G Ve c b l 1 x 3 R d 3 c / l K V J I e U O I W 7 F B e E f a 2 c 9 t n W m k z X r n r L d P x N Z y p W 7 X 2 T m + N d 3 Z I G 7 P 4 c 5 y r e O Z d l b s b 9 4 z 7 a n u N q W / b 7z G x E o M i f 1 L N 8 / 8 r 0 7 V I t H H m a 4 h o J p i z a j q m H H J d F f U z e 0 v V U l y i I l T u E f x h D D T y n m f b a 1 J d e 2 q t 5 6 O v + l M x a o 9 M 7 k Z 3 t U t a c D u z 3 E u g m a 1 4 j o V 9 + q 4 V C u b U e d x g E O U a Z 6 n q O E C d T T I e 4 h H P O H Z u r S E N b H u P l O t nN H s 4 9 u y H j 4 A O Z K P q w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " o 1 q 1 r e O Z d l b s b 9 4 z 7 a n u N q W / b 7z G x E o M i f 1 L N 8 / 8 r 0 7 V I t H H m a 4 h o J p i z a j q m H H J d F f U z e 0 v V U l y i I l T u E f x h D D T y n m f b a 1 J d e 2 q t 5 6 O v + l M x a o 9 M 7 k Z 3 t U t a c D u z 3 E u g m a 1 4 j o V 9 + q 4 V C u b U e d x g E O U a Z 6 n q O E C d T T I e 4 h H P O H Z u r S E N b H u P l O t nN H s 4 9 u y H j 4 A O Z K P q w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " o 1 q 1 r e O Z d l b s b 9 4 z 7 a n u N q W / b 7z G x E o M i f 1 L N 8 / 8 r 0 7 V I t H H m a 4 h o J p i z a j q m H H J d F f U z e 0 v V U l y i I l T u E f x h D D T y n m f b a 1 J d e 2 q t 5 6 O v + l M x a o 9 M 7 k Z 3 t U t a c D u z 3 E u g m a 1 4 j o V 9 + q 4 V C u b U e d x g E O U a Z 6 n q O E C d T T I e 4 h H P O H Z u r S E N b H u P l O t nN H s 4 9 u y H j 4 A O Z K P q w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " o 1 q 1 r e O Z d l b s b 9 4 z 7 a n u N qW / b 7 z G x E o M i f 1 L N 8 / 8 r 0 7 V I t H H m a 4 h o J p i z a j q m H H J d F f U z e 0 v V U l y i I l T u E f x h D D T y n m f b a 1 J d e 2 q t 5 6 O v + l M x a o 9 M 7 k Z 3 t U t a c D u z 3 E u g m a 1 4 j o V 9 + q 4 V C u b U e d x g E O U a Z 6 n q O E C d T T I e 4 h H P O H Z u r S E N b H u P l O t n NH s 4 9 u y H j 4 A O Z K P q w = = < / l a t e x i t > r e O Z d l b s b 9 4 z 7 a n u N q W/ b 7 z G x E o M i f 1 L N 8 / 8 r 0 7 V I t H H m a 4 h o J p i z a j q m H H J d F f U z e 0 v V U l y i I l T u E f x h D D T y n m f b a 1 J d e 2 q t 5 6 O v + l M x a o 9 M 7 k Z 3 t U t a c D u z 3 E u g m a 1 4 j o V 9 + q 4 V C u b U e d x g E O U a Z 6 n q O E C d T T I e 4 h H P O H Z u r S E N b H u P l O t n NH s 4 9 u y H j 4 A O Z K P q w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " o 1 q 1 r e O Z d l b s b 9 4 z 7 a n u N q W/ b 7 z G x E o M i f 1 L N 8 / 8 r 0 7 V I t H H m a 4 h o J p i z a j q m H H J d F f U z e 0 v V U l y i I l T u E f x h D D T y n m f b a 1 J d e 2 q t 5 6 O v + l M x a o 9 M 7 k Z 3 t U t a c D u z 3 E u g m a 1 4 j o V 9 + q 4 V C u b U e d x g E O U a Z 6 n q O E C d T T I e 4 h H P O H Z u r S E N b H u P l O t n NH s 4 9 u y H j 4 A O Z K P q w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " o 1 q 1 r e O Z d l b s b 9 4 z 7 a n u N q W/ b 7 z G x E o M i f 1 L N 8 / 8 r 0 7 V I t H H m a 4 h o J p i z a j q m H H J d F f U z e 0 v V U l y i I l T u E f x h D D T y n m f b a 1 J d e 2 q t 5 6 O v + l M x a o 9 M 7 k Z 3 t U t a c D u z 3 E u g m a 1 4 j o V 9 + q 4 V C u b U e d x g E O U a Z 6 n q O E C d T T I e 4 h H P O H Z u r S E N b H u P l O t n NH s 4 9 u y H j 4 A O Z K P q w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " o 1 q 1 r e O Z d l b s b 9 4 z 7 a n u N q W / b 7z G x E o M i f 1 L N 8 / 8 r 0 7 V I t H H m a 4 h o J p i z a j q m H H J d F f U z e 0 v V U l y i I l T u E f x h D D T y n m f b a 1 J d e 2 q t 5 6 O v + l M x a o 9 M 7 k Z 3 t U t a c D u z 3 E u g m a 1 4 j o V 9 + q 4 V C u b U e d x g E O U a Z 6 n q O E C d T T I e 4 h H P O H Z u r S E N b H u P l O t n N H s 4 9 u y H j 4 A O Z K P q w = = < / l a t e x i t >1024< l a t e x i t s h a 1 _ b a s e 6 4 = " b u G I j 9 i / 5 2 w J k h r z o l 0 r 9 M / A P 9 C + + M U 1 C L 6 I Q k Z 8 6 9 5 8 z c e 4 M k F J l 0 n N e C t b S 8 s r p W X C 9 t b G 5 t 7 5 R 3 9 9 p Z n K e M e y w O 4 7 Q b+ B k P R c Q 9 K W T I u 0 n K / U k Q 8 k 4 w P l X x z i 1 P M x F H l 3 K a 8 P 7 E H 0 V i K J g v i f J c p 9 4 o X Z c r T s 3 R y 1 4 E r g E V m N W K y y + 4 w g A x G H J M w B F B E g 7 h I 6 O n B x c O E u L 6 m B G X E h I 6 z n G P E m l z y u K U 4R M 7 p u + I d j 3 D R r R X n p l W M z o l p D c l p Y 0 j 0 s S U l x J W p 9 k 6 n m t n x f 7 m P d O e 6 m 5 T U H F E 8 J M 6 2 c 9 9 n W m k z X r n r r 6 / i b z l S s 2 j O T m + N d 3 Z I G 7 P 4 c 5 y J o 1 2 u u U 3 M v G p V m 1 Y y 6 i A M c o k r z P E Y T Z 2 j B I 2 + B R z z h 2 T q 3 E u v O m n 6 m W g W j 2 c e 3 Z T 1 8 A M + x j + Q = < /l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " b u G I j 9 i / 5 2 w J k h r z o l 0 r 9 M / A P 9 C + + M U 1 C L 6 I Q k Z 8 6 9 5 8 z c e 4 M k F J l 0 n N e C t b S 8 s r p W X C 9 t b G 5 t 7 5 R 3 9 9 p Z n K e M e y w O 4 7 Q b+ B k P R c Q 9 K W T I u 0 n K / U k Q 8 k 4 w P l X x z i 1 P M x F H l 3 K a 8 P 7 E H 0 V i K J g v i f J c p 9 4 o X Z c r T s 3 R y 1 4 E r g E V m N W K y y + 4 w g A x G H J M w B F B E g 7 h I 6 O n B x c O E u L 6 m B G X E h I 6 z n G P E m l z y u K U 4R M 7 p u + I d j 3 D R r R X n p l W M z o l p D c l p Y 0 j 0 s S U l x J W p 9 k 6 n m t n x f 7 m P d O e 6 m 5 T U H F E 8 J M 6 2 c 9 9 n W m k z X r n r r 6 / i b z l S s 2 j O T m + N d 3 Z I G 7 P 4 c 5 y J o 1 2 u u U 3 M v G p V m 1 Y y 6 i A M c o k r z P E Y T Z 2 j B I 2 + B R z z h 2 T q 3 E u v O m n 6 m W g W j 2 c e 3 Z T 1 8 A M + x j + Q = < /l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " b u G I j 9 i / 5 2 w J k h r z o l 0 r 9 M / A P 9 C + + M U 1 C L 6 I Q k Z 8 6 9 5 8 z c e 4 M k F J l 0 n N e C t b S 8 s r p W X C 9 t b G 5 t 7 5 R 3 9 9 p Z n K e M e y w O 4 7 Q b+ B k P R c Q 9 K W T I u 0 n K / U k Q 8 k 4 w P l X x z i 1 P M x F H l 3 K a 8 P 7 E H 0 V i K J g v i f J c p 9 4 o X Z c r T s 3 R y 1 4 E r g E V m N W K y y + 4 w g A x G H J M w B F B E g 7 h I 6 O n B x c O E u L 6 m B G X E h I 6 z n G P E m l z y u K U 4R M 7 p u + I d j 3 D R r R X n p l W M z o l p D c l p Y 0 j 0 s S U l x J W p 9 k 6 n m t n x f 7 m P d O e 6 m 5 T U H F E 8 J M 6 2 c 9 9 n W m k z X r n r r 6 / i b z l S s 2 j O T m + N d 3 Z I G 7 P 4 c 5 y J o 1 2 u u U 3 M v G p V m 1 Y y 6 i A M c o k r z P E Y T Z 2 j B I 2 + B R z z h 2 T q 3 E u v O m n 6 m W g W j 2 c e 3 Z T 1 8 A M + x j + Q = < /l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " b u G I j 9 i / 5 2 w J k h r z o l 0 r 9 M / A P 9 C + + M U 1 C L 6 I Q k Z 8 6 9 5 8 z c e 4 M k F J l 0 n N e C t b S 8 s r p W X C 9 t b G 5 t 7 5 R 3 9 9 p Z n K e M e y w O 4 7 Q b+ B k P R c Q 9 K W T I u 0 n K / U k Q 8 k 4 w P l X x z i 1 P M x F H l 3 K a 8 P 7 E H 0 V i K J g v i f J c p 9 4 o X Z c r T s 3 R y 1 4 E r g E V m N W K y y + 4 w g A x G H J M w B F B E g 7 h I 6 O n B x c O E u L 6 m B G X E h I 6 z n G P E m l z y u K U 4R M 7 p u + I d j 3 D R r R X n p l W M z o l p D c l p Y 0 j 0 s S U l x J W p 9 k 6 n m t n x f 7 m P d O e 6 m 5 T g A D B K P T w = = < / l a t e x i t > (a) One layer (b) Two layers (c) Four layers

Figure 4 .
Figure 4. Three settings of fusion network: one layer, two layers, and four layers.For example, two layers indicate that fusion network has two fully connected layers, with 1024 and 128 ReLU units, respectively.The output of fully connected layers is fed to the softmax layer, which has six units, representing the number of emotions.

Figure 5 .
Figure 5.The confusion matrices of audio-visual emotion recognition using our method on the eNTERFACE'05, BAUM-1s, and RAVDESS datasets.

Figure 6 .
Figure 6.Some misclassification samples on the eNTERFACE'05, BAUM-1s, and RAVDESS datasets.For instance, the image on the left of the first row is marked with "disgust → anger", which indicates that the true label of the input data is "disgust", but our model predicts that its label is "anger".

Figure 7 .
Figure7.Three scenarios with missing modality during the testing process: only audio data are missing, only visual data are missing, and audio and visual data are missing.The performance of L corr method, L corr_simple method and baseline method are compared.The missing rate of 0% means that there are no missing data.

Table 1 .
The performance comparison of audio emotion recognition with previous works on the eNTERFACE'05, BAUM-1s, and RAVDESS datasets.

Table 2 .
The performance comparison of visual emotion recognition with previous works on the eNTERFACE'05, BAUM-1s, and RAVDESS datasets.

Table 3 .
The performance comparison of audio-visual emotion recognition with previous works on the eNTERFACE'05, BAUM-1s, and RAVDESS datasets.

Table 4 .
The performance of audio-visual emotion recognition with different settings on the eNTERFACE'05 dataset.

Table 5 .
The performance of audio-visual emotion recognition with different settings on the BAUM-1s dataset.

Table 6 .
The performance of audio-visual emotion recognition with different settings on the RAVDESS dataset.