# Multi-Modal Residual Perceptron Network for Audio–Video Emotion Recognition

^{*}

## Abstract

**:**

## 1. Introduction

#### 1.1. Emotion Recognition from Face Expression and Voice Timbre

#### 1.2. Multi-Modal Emotion Recognition

#### 1.3. Paper Contribution and Structure

- 1.
- Multi-modal framework: We propose a novel within-modality Residual Perceptrons (RP) for efficient gradient blending for the neural network optimization using multi-term loss function in MRPN. The sub-networks and target loss functions produce superior parameterized multi-modal features, preserving the original knowledge of uni-modalities, which impedes inter-modal learning. The within-modality RP components reduce the side effects brought from such multi-term loss functions. As the result, we got significantly better performance over direct strategies including late fusion and end-to-end without MRPN.
- 2.
- Time Augmentation of input frames: We demonstrate data augmentation in time involving randomly slicing over input frame sequences from both modalities improved the recognition performance to the state-of-art, even without MRPN. We show also the results that time augmentation does not solve the cases where uni-modal solutions are better than multi-modal solutions, yet solved by MRPN.

## 2. Related Work

#### 2.1. Superiority in Multi-Modal Approach

#### 2.2. Potential Failures in the Existing Solutions

## 3. Hypothesis

#### 3.1. Within-Modal Information Can Be Missing or Fuzzy

#### 3.2. End-to-End Modeling for Multi-Modal Data Can Be Distorted

#### 3.3. Late Fusion Modeling for Multi-Modal Data Can Be Insufficient

## 4. Proposed Methods

#### 4.1. Functional Description of Analyzed Networks

- 1.
- ${F}_{m}$: Feature extractor for input temporal sequence ${x}_{m}$ of modality m, e.g., ${F}_{v}$ for video frames ${x}_{v}$, ${F}_{a}$ for audio segments ${x}_{a}$.
- 2.
- ${A}_{m}$: Aggregation component SAC for temporal feature sequence leading to temporal feature vector ${f}_{m}$, e.g., ${A}_{v}$, ${A}_{a}$ for video and audio features, respectively.$${f}_{m}\doteq {A}_{m}\left({F}_{m}\left({x}_{m}\right)\right)\u27f6{f}_{v}\doteq {A}_{v}\left({F}_{v}\left({x}_{v}\right)\right),\phantom{\rule{4pt}{0ex}}{f}_{a}\doteq {A}_{a}\left({F}_{a}\left({x}_{a}\right)\right)$$
- 3.
- Standard computing units:
`DenseUnit`—affine (a.k.a. dense, full connection);`Dropout`—random elements dropping for model regularizing;`FeatureNorm`—normalization for learned data regularizing (batch norm is adopted in the current implementation); and`Concatenate`—joining feature maps;`ReLU, Sigmoid`—activation units. - 4.
`Scoring`—component mapping feature vectors to class scores vector, usually composing the following operations:$$\to DenseUnit\to ReLU\to FeatureNorm\to DenseUnit$$$$\begin{array}{c}m\in \{v,a\},\phantom{\rule{4pt}{0ex}}{\widehat{f}}_{m}\doteq FeatureNorm\left({f}_{m}\right)\u27f6{s}_{m}\doteq Scoring({\widehat{f}}_{m})\hfill \\ {g}_{va}\doteq FeatureNorm\left(Concatenate({g}_{v},{g}_{a})\right)\u27f6{s}_{va}\doteq Scoring\left({g}_{va}\right)\hfill \end{array}$$- 5.
`FusionComponent`—concatenates its inputs ${g}_{v},{g}_{a}$, then makes the statistical normalization, and finally produces the vector of class scores:$$\begin{array}{c}{s}_{va}\doteq FusionComponent({g}_{v},{g}_{a})\u27f6\hfill \\ {g}_{v},{g}_{a}\to Concatenate\to Scoring\to {s}_{va}\hfill \end{array}$$In our networks ${g}_{v},{g}_{a}$ are statistically normalized multi-modal features (${\widehat{f}}_{v},{\widehat{f}}_{a}$) or their residually updated form (${f}_{v}^{\prime}$, ${f}_{a}^{\prime}$)—cf. those symbols in Figure 6.- 6.
`SoftMax`—computing unit for normalization of class scores to class probabilities:$$\begin{array}{c}m\in \{v,a\}\u27f6{p}_{m}\doteq Softmax\left({s}_{m}\right)\hfill \\ {p}_{va}\doteq Softmax\left({s}_{va}\right)\hfill \end{array}$$- 7.
`CrossEntropy`—a divergence of probability distributions used as loss function. Let p is the target probability distribution. Then the following loss functions are defined:$$\begin{array}{c}m\in \{v,a\},\phantom{\rule{4pt}{0ex}}{p}_{m}\doteq Softmax\left({s}_{m}\right)\u27f6{\mathcal{L}}_{m}\doteq CrossEntropy(p,{p}_{m})\hfill \\ {p}_{va}\doteq Softmax\left({s}_{va}\right)\u27f6{\mathcal{L}}_{va}=CrossEntropy(p,{p}_{va})\hfill \\ \mathcal{L}\doteq {\mathcal{L}}_{v}+{\mathcal{L}}_{a}+{\mathcal{L}}_{va}\hfill \end{array}$$- 8.
`ResPerceptron`(Residual Perceptron)—component performing statistical normalization for the dense unit (perceptron) computing residuals for normalized data. In our solution it transforms a modal feature vector ${f}_{m}$ into ${f}_{m}^{\prime}$, as follows:$$\begin{array}{c}{\widehat{f}}_{m}\doteq FeatureNorm\left({f}_{m}\right)\u27f6{f}_{m}^{\prime}\doteq ResPerceptron({\widehat{f}}_{m})\u27f6\hfill \\ {f}_{m}^{\prime}\doteq {\widehat{f}}_{m}+FeatureNorm(Sigmoid(DenseUnit({\widehat{f}}_{m})))\hfill \end{array}$$

- 1.
- Network ${\mathcal{N}}_{0}({f}_{v},{f}_{a};p)$ with fusion component and loss function ${\mathcal{L}}_{va}$:$$\begin{array}{c}{\widehat{f}}_{v}\doteq FeatureNorm\left({f}_{v}\right),\phantom{\rule{4pt}{0ex}}{\widehat{f}}_{a}\doteq FeatureNorm\left({f}_{a}\right)\hfill \\ {s}_{va}\doteq FusionComponent({\widehat{f}}_{v},{\widehat{f}}_{a})\hfill \\ {p}_{va}\doteq SoftMax\left({s}_{va}\right)\u27f6{\mathcal{L}}_{va}\doteq CrossEntropy(p,{p}_{va})\hfill \end{array}$$
- 2.
- Network ${\mathcal{N}}_{1}({f}_{v},{f}_{a};p)$ with fusion component and fused loss function $\mathcal{L}\doteq {\mathcal{L}}_{v}+{\mathcal{L}}_{a}+{\mathcal{L}}_{va}$:$$\begin{array}{c}{\widehat{f}}_{v}\doteq FeatureNorm\left({f}_{v}\right),\phantom{\rule{4pt}{0ex}}{\widehat{f}}_{a}\doteq FeatureNorm\left({f}_{a}\right)\hfill \\ {s}_{v}\doteq DenseUnit({\widehat{f}}_{v}),\phantom{\rule{4pt}{0ex}}{s}_{a}\doteq DenseUnit({\widehat{f}}_{a}),\phantom{\rule{4pt}{0ex}}{s}_{va}\doteq FusionComponent({\widehat{f}}_{v},{\widehat{f}}_{a})\hfill \\ {p}_{v}\doteq SoftMax\left({s}_{v}\right)\u27f6{\mathcal{L}}_{v}\doteq CrossEntropy(p,{p}_{v})\hfill \\ {p}_{a}\doteq SoftMax\left({s}_{a}\right)\u27f6{\mathcal{L}}_{a}\doteq CrossEntropy(p,{p}_{a})\hfill \\ {p}_{va}\doteq SoftMax\left({s}_{va}\right)\u27f6{\mathcal{L}}_{va}\doteq CrossEntropy(p,{p}_{va})\hfill \end{array}$$
- 3.
- Network ${\mathcal{N}}_{2}({f}_{v},{f}_{a};p)$ with normalized residual perceptron, fusion component and fused loss function $\mathcal{L}\doteq {\mathcal{L}}_{v}+{\mathcal{L}}_{a}+{\mathcal{L}}_{va}$:$$\begin{array}{c}{\widehat{f}}_{v}\doteq FeatureNorm\left({f}_{v}\right),\phantom{\rule{4pt}{0ex}}{\widehat{f}}_{a}\doteq FeatureNorm\left({f}_{a}\right)\hfill \\ {f}_{v}^{\prime}\doteq ResPerceptron({\widehat{f}}_{v}),\phantom{\rule{4pt}{0ex}}{f}_{a}^{\prime}\doteq ResPerceptron({\widehat{f}}_{a})\hfill \\ {s}_{v}\doteq DenseUnit({\widehat{f}}_{v}),\phantom{\rule{4pt}{0ex}}{s}_{a}\doteq DenseUnit({\widehat{f}}_{a}),\phantom{\rule{4pt}{0ex}}{s}_{va}\doteq FusionComponent({f}_{v}^{\prime},{f}_{a}^{\prime})\hfill \\ {p}_{v}\doteq SoftMax\left({s}_{v}\right)\u27f6{\mathcal{L}}_{v}\doteq CrossEntropy(p,{p}_{v})\hfill \\ {p}_{a}\doteq SoftMax\left({s}_{a}\right)\u27f6{\mathcal{L}}_{a}\doteq CrossEntropy(p,{p}_{a})\hfill \\ {p}_{va}\doteq SoftMax\left({s}_{va}\right)\u27f6{\mathcal{L}}_{va}\doteq CrossEntropy(p,{p}_{va})\hfill \end{array}$$

- 1.
- All instances of FeatureNorm unit are implemented as batch normalization units.
- 2.
- In testing mode only the central branch of networks ${\mathcal{N}}_{1},{\mathcal{N}}_{2}$ are active while the side branches are inactive as they are used only to compute the extra terms of the extended loss function.
- 3.
- The above facts make network architectures ${\mathcal{N}}_{0},{\mathcal{N}}_{1}$ equivalent in the testing mode. However, the models trained for those architectures are not the same, as weights are optimized for different loss functions.
- 4.
- In the testing mode all Dropout units are not active, as well.
- 5.
- The architecture of FusionComponent is identical for all three networks. The difference between models of ${\mathcal{N}}_{0}$ and ${\mathcal{N}}_{1}$ networks follows from the different loss functions while the difference between models of ${\mathcal{N}}_{1}$ and ${\mathcal{N}}_{2}$ networks is implied by using ResPerceptron (RP) components in ${\mathcal{N}}_{2}$ network.
- 6.
- To control the range of affine combinations computed by Residual Perceptron (RP) component, we use Sigmoid activations instead of the ReLU activations exploited in other components. The experiments confirm the advantage of this design decision.
- 7.
- The Residual Perceptron (RP) was introduced in the network ${\mathcal{N}}_{2}$ to implement better parameterization of within-modal features before their fusion.

#### 4.2. MRPN Components’ Role in Multi-Term Optimization

- 1.
- As we discussed in the hypothesis section, the late fusion strategy has the advantage of preserving the best information in each uni-modality since the uni-modality extracts generalized deep features which suffer little from the outliers of their own modality, i.e., a small amount of wrongly labeled data in uni-modal solutions will not contribute to the generalized feature patterns; they are “filtered out” by the uni-modal neural network. Thus the additional term of loss functions implies the blended gradient in the shallow layers of each uni-modality, and helped for better parameterization of the features before fusion, preserving the knowledge as uni-modalities are trained respectively. The above facts make the end-to-end strategy suffer less inter-modal information as late fusion does.
- 2.
- However, the multi-term optimization can result in extracting inferior uni-modal features as the input to the fusion component. This problem was mentioned in the literature [23,24,25]. RP is introduced to make modified uni-modal features, instead of storing all knowledge for uni-modal and multi-modal purposes in one unit, causing the clash of loss converging from two directions, the uni-modal and multi-modal knowledge can be stored in the original uni-modal features and modified multi-modal features, creating a new path for the gradient flow. RP can preserve the best of the uni-modal solution while the modified features from the short-cut can still fulfill the purpose of integrating new multi-modal features.

#### 4.3. MRPN in General Multi-Modal Applications

#### 4.4. Pre-Processing

- 1.
- Spatial data augmentation for visual frames:The facial area for the visual input frames is cropped using a CNN solution from Dlib library [35]. Once the facial area is cropped, spatial video augmentation is applied during the training phase. The same random augmentation parameters are applied for all frames of a video source illustrated in Figure 8.
- 2.
- Time dependent data augmentation for visual frames:Obviously, expressions from the same category do not last the same duration. To make our system robust to the inconsistent duration of the emotion events, we perform data augmentation in time by randomly slicing the original frames as Figure 9 illustrates. Such operation should also avoid too few input frames missing information of the expression events. Thus the training segments are selected to have at least one-second duration unless the original duration of the file is less than that.
- 3.
- Spatial data augmentation for vocal frames: Raw audio inputs are resampled at 16 kHz and standardized by their mean and standard deviation without any denoising or cutting, to remove influence from the distance of the speaker to the microphone, or subjective base volume of the speaker. The standardized wave is then divided into one-second segments and converted to spectrograms by a Hann windowing function of size 512 and hop size of 64.The above facts specify the size of the spectrogram at 256 × 250, approaching the required input shape of Resnet-18—the CNN extractor used in our experiment. The chunk size using such inputs for Resnet-18 [36] is close to its desired performance utilizing the advantage in the middle deep features.
- 4.
- Time dependent augmentation for vocal frames: Similar to the time augmentation in visual inputs, raw audio inputs are also randomly sliced. The raw data is further over-sampled in both the training and testing mode by a hopping window. A window of 0.2 s, 1/5 duration of the input segments to the CNN extractor is specified. The oversampling further improved our results by the increasing number of deep features from the output deep feature sequences of CNN to the SAC. The mechanism grants the opportunity for the SAC to investigate more details of the temporal information in the deep feature vectors.

## 5. Computational Experiments and Their Discussion

#### 5.1. Datasets

- 1.
- RAVDESS dataset includes both speech and song files. For the speech recognition proposal, we only use the speech files from the dataset. It contains 2880 files, 24 actors (12 female, 12 male), state two lexically identical statements. Speech includes calm, happy, sad, angry, fearful, surprise, and disgusted expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression, in a total of 8 categories. It is the most recent video-audio emotional dataset with the highest video quality in this research area to our best knowledge.
- 2.
- Crema-d dataset consists of visual and vocal emotional speech files in a range of basic emotional states (happy, sad, anger, fear, disgust, and neutral). 7442 clips of 91 actors with diverse ethnic backgrounds were rated by multiple raters in three modalities: Audio, visual, and audio-visual.

#### 5.2. Model Organization and Computational Setup

#### 5.3. Data Augmentation Cannot Generalize Multi-Modal Feature Patterns

#### 5.4. Discussion on Inferior Multi-Modal Cases

#### 5.5. Improvement of MRPN

#### 5.6. Comparing Baseline with SOTA

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

AVER | Audio video emotion recognition |

CNN | Convolution Neural Network |

Crema-d [33] | Crowd-sourced Emotional multi-modal Actors Dataset |

DNN | Deep Neural Network |

HCI | Human–Computer Interaction |

FC | Fully connected layer |

IEMOCAP [30] | Interactive emotional dyadic motion capture dataset |

LSTM | Long Short Term Memory |

MRPN | multi-modal Residual Perceptron Network |

RAVDESS [34] | The Ryerson Audio–Visual Database of Emotional Speech and Song |

RP | Residual Perceptron |

SAC | Sequence Aggregation Component |

SOTA | State of the Art Solution |

STFT | Short-term Fourier transformation |

SVM | Support Vector Machine |

VIT [38] | Vision Transformer |

## References

- Belhumeur, P.N.; Hespanha, J.a.P.; Kriegman, D.J. Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Trans. Pattern Anal. Mach. Intell.
**1997**, 19, 711–720. [Google Scholar] [CrossRef] [Green Version] - Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version] - Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] [CrossRef] - Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv
**2017**, arXiv:1706.03762. [Google Scholar] - Neverova, N.; Wolf, C.; Taylor, G.W.; Nebout, F. ModDrop: Adaptive multi-modal gesture recognition. arXiv
**2015**, arXiv:1501.00102. [Google Scholar] [CrossRef] [Green Version] - Vielzeuf, V.; Pateux, S.; Jurie, F. Temporal Multimodal Fusion for Video Emotion Classification in the Wild. arXiv
**2017**, arXiv:1709.07200. [Google Scholar] - Beard, R.; Das, R.; Ng, R.W.M.; Gopalakrishnan, P.G.K.; Eerens, L.; Swietojanski, P.; Miksik, O. Multi-Modal Sequence Fusion via Recursive Attention for Emotion Recognition. In Proceedings of the 22nd Conference on Computational Natural Language Learning; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 251–259. [Google Scholar]
- Ghaleb, E.; Popa, M.; Asteriadis, S. Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition. In Proceedings of the 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), Cambridge, UK, 3–6 September 2019; pp. 552–558. [Google Scholar]
- Zadeh, A.; Liang, P.P.; Mazumder, N.; Poria, S.; Cambria, E.; Morency, L.P. Memory Fusion Network for Multi-view Sequential Learning. arXiv
**2018**, arXiv:1802.00927. [Google Scholar] - Mansouri-Benssassi, E.; Ye, J. Speech Emotion Recognition With Early Visual Cross-modal Enhancement Using Spiking Neural Networks. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
- Zhang, S.; Zhang, S.; Huang, T.; Gao, W.; Tian, Q. Learning Affective Features with a Hybrid Deep Model for Audio–Visual Emotion Recognition. IEEE Trans. Cir. Syst. Video Technol.
**2018**, 28, 3030–3043. [Google Scholar] [CrossRef] - Ristea, N.; Duţu, L.C.; Radoi, A. Emotion Recognition System from Speech and Visual Information based on Convolutional Neural Networks. In Proceedings of the 2019 International Conference on Speech Technology and Human–Computer Dialogue (SpeD), Timisoara, Romania, 10–12 October 2019; pp. 1–6. [Google Scholar]
- Tzinis, E.; Wisdom, S.; Remez, T.; Hershey, J.R. Improving On-Screen Sound Separation for Open Domain Videos with Audio–Visual Self-attention. arXiv
**2021**, arXiv:2106.09669. [Google Scholar] - Wu, Y.; Zhu, L.; Yan, Y.; Yang, Y. Dual Attention Matching for Audio–Visual Event Localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
- Ghaleb, E.; Popa, M.; Asteriadis, S. Metric Learning-Based Multimodal Audio–Visual Emotion Recognition. IEEE MultiMedia
**2020**, 27, 37–48. [Google Scholar] [CrossRef] [Green Version] - Noroozi, F.; Marjanovic, M.; Njegus, A.; Escalera, S.; Anbarjafari, G. Audio–Visual Emotion Recognition in Video Clips. IEEE Trans. Affect. Comput.
**2019**, 10, 60–75. [Google Scholar] [CrossRef] - Hossain, M.S.; Muhammad, G. Emotion recognition using deep learning approach from audio–visual emotional big data. Inf. Fusion
**2019**, 49, 69–78. [Google Scholar] [CrossRef] - Ma, F.; Zhang, W.; Li, Y.; Huang, S.L.; Zhang, L. Learning Better Representations for Audio–Visual Emotion Recognition with Common Information. Appl. Sci.
**2020**, 10, 7239. [Google Scholar] [CrossRef] - Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Gool, L.V. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. arXiv
**2016**, arXiv:1608.00859. [Google Scholar] - Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. arXiv
**2014**, arXiv:1406.2199. [Google Scholar] - Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. arXiv
**2018**, arXiv:1705.07750. [Google Scholar] - Wang, W.; Tran, D.; Feiszli, M. What Makes Training Multi-Modal Classification Networks Hard? In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–18 June 2020; pp. 12692–12702. [Google Scholar]
- Standley, T.; Zamir, A.R.; Chen, D.; Guibas, L.; Malik, J.; Savarese, S. Which Tasks Should Be Learned Together in Multi-task Learning? arXiv
**2020**, arXiv:1905.07553. [Google Scholar] - Caruana, R. Multitask Learning. Mach. Learn.
**1997**, 28, 41–75. [Google Scholar] [CrossRef] - Yu, T.; Kumar, S.; Gupta, A.; Levine, S.; Hausman, K.; Finn, C. Gradient Surgery for Multi-Task Learning. arXiv
**2020**, arXiv:2001.06782. [Google Scholar] - Goodfellow, I.J.; Erhan, D.; Carrier, P.L.; Courville, A.; Mirza, M.; Hamner, B.; Cukierski, W.; Tang, Y.; Thaler, D.; Lee, D.H.; et al. Challenges in representation learning: A report on three machine learning contests. Neural Netw.
**2015**, 64, 59–63. [Google Scholar] [CrossRef] [Green Version] - Wang, W.; Fu, Y.; Sun, Q.; Chen, T.; Cao, C.; Zheng, Z.; Xu, G.; Qiu, H.; Jiang, Y.G.; Xue, X. Learning to Augment Expressions for Few-shot Fine-grained Facial Expression Recognition. arXiv
**2020**, arXiv:2001.06144. [Google Scholar] - Ng, H.W.; Nguyen, V.D.; Vonikakis, V.; Winkler, S. Deep Learning for Emotion Recognition on Small Datasets Using Transfer Learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction; Association for Computing Machinery: New York, NY, USA, 2015; pp. 443–449. [Google Scholar]
- Dhall, A.; Kaur, A.; Goecke, R.; Gedeon, T. EmotiW 2018: Audio–Video, Student Engagement and Group-Level Affect Prediction. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, Boulder, CO, USA, 16–20 October 2018; pp. 653–656. [Google Scholar]
- Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower Provost, E.; Kim, S.; Chang, J.; Lee, S.; Narayanan, S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval.
**2008**, 42, 335–359. [Google Scholar] [CrossRef] - Mustaqeem; Kwon, S. A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition. Sensors
**2020**, 20, 183. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Latif, S.; Rana, R.; Qadir, J.; Epps, J. Variational Autoencoders for Learning Latent Representations of Speech Emotion. arXiv
**2017**, arXiv:1712.08708. [Google Scholar] - Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R. CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset. IEEE Trans. Affect. Comput.
**2014**, 5, 377–390. [Google Scholar] [CrossRef] [Green Version] - Livingstone, S.R.; Russo, F.A. The Ryerson Audio–Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE
**2018**, 13, e0196391. [Google Scholar] [CrossRef] [Green Version] - King, D.E. Dlib-Ml: A Machine Learning Toolkit. J. Mach. Learn. Res.
**2009**, 10, 1755–1758. [Google Scholar] - He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv
**2015**, arXiv:1512.03385. [Google Scholar] - Nagrani, A.; Chung, J.S.; Zisserman, A. VoxCeleb: A Large-Scale Speaker Identification Dataset. 2017. Available online: https://arxiv.org/pdf/1706.08612.pdf (accessed on 12 August 2021).
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv
**2020**, arXiv:2010.11929. [Google Scholar] - Mustaqeem; Sajjad, M.; Kwon, S. Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM. IEEE Access
**2020**, 8, 79861–79875. [Google Scholar] [CrossRef]

**Figure 1.**The proposed multi-modal emotion recognition system using Deep Neural Network (DNN) approach.

**Upper part:**Video frames and audio spectral segments get independent temporal embeddings to be fused by our multi-modal Residual Perceptron Network (MRPN).

**Lower part:**MRPN performs in each modality normalizations via the proposed residual perceptrons and then scores their concatenated outputs in the fusion component. The uni-modal prediction branches are only active in training mode.

**Figure 2.**Video frames of visual facial expressions selected from RAVDESS (Ryerson Audio–Visual Database of Emotional Speech and Song) dataset.

**Figure 3.**Mel spectrograms of vocal timbres selected from RAVDESS (Ryerson Audio–Visual Database of Emotional Speech and Song) dataset.

**Figure 4.**Visualization (t-SNE algorithm) of deep features clustering from two different setups where the train/validation sets are shuffled. The clustering with respect to emotion classes are listed. 0: Neutral; 1: Calm; 2: Happy; 3: Sad; 4: Angry; 5: Fearful; 6: Disgust; 7: Surprised.

**Top part**: Clustering results from one setup of uni-modalities and multi-modality. Left part: Only image modality. Middle part: Only audio modality. Right part: Multi-modality.

**Down part**: Clustering results from the other setup where train/validation sets are shuffled.

**Figure 5.**Distorted gradients backpropagation in some modality since the gradients from fused layer makes impact on gradients flow into neural weights of both modalities.

**Figure 6.**Evolution of network design for multi-modal fusion (presented for training mode). ${\mathcal{N}}_{0}$: Fusion component (FC) only. ${\mathcal{N}}_{1}$ ([22]): Beside FC, independent scoring of each modality is considered. ${\mathcal{N}}_{2}$: Extending ${\mathcal{N}}_{1}$ network by Residual Perceptrons (RP) in each modality branch.

**Figure 7.**Generalization of our MRPN fusion approach to many modalities. It could be used for either regression or classification applications.

**Figure 8.**Visual comparison of augmentation procedure for cropped video frames.

**Top part**: Original video frames.

**Middle part**: Applying random augmentation parameters—same for all frames.

**Bottom part**: Applying random augmentation parameters—different for each frame.

**Figure 9.**Examples of time dependent augmentation for visual frames.

**Top part**: Original frames.

**Middle part**: Sliced frames start at beginning of the original frames.

**Bottom part**: Sliced frames start at the middle of the original frames.

**Table 1.**Comparison of single modalities models with ${\mathcal{N}}_{0}$ model (RAVDESS cases): VM—Visual Modality only; AM—Audio Modality only; JM—Joint Modalities (${\mathcal{N}}_{0}$ model); T—having time augmentation by signal random slicing; NT—not having time augmentation.

RAVDESS | A1,2 | A3,4 | A5,6 | A7,8 | A9,10 | A11,12 |
---|---|---|---|---|---|---|

AM (NT) | 70.8% | 55.0% | 57.5% | 74.1% | 43.5% | 65.8% |

AM (T) | 71.6% | 77.5% | 71.6% | 90.0% | 55.8% | 69.1% |

VM (NT) | 82.5% | 70.0% | 66.7% | 74.1% | 80.3% | 63.3% |

VM (T) | 86.6% | 75.0% | 70.6% | 76.6% | 87.3% | 69.1% |

JM (NT) | 90.8% | 89.1% | 85.2% | 89.3% | 78.5% | 85.5% |

JM (T) | 97.5% | 90.3% | 87.5% | 97.5% | 86.5% | 87.5% |

RAVDESS | A13,14 | A15,16 | A17,18 | A19,20 | A21,22 | A23,24 |

AM (NT) | 59.8% | 57.5% | 51.6% | 55.5% | 55.8% | 63.3% |

AM (T) | 70.0% | 69.1% | 57.5% | 63.3% | 68.3% | 68.3% |

VM (NT) | 71.3% | 60.0% | 63.3% | 70.8% | 65.8% | 70.8% |

VM (T) | 73.3% | 65.0% | 64.1% | 78.3% | 66.6% | 74.1% |

JM (NT) | 77.5% | 75.5% | 76.3% | 85.2% | 82.8% | 80.0% |

JM (T) | 82.4% | 79.6% | 83.2% | 89.0% | 85.5% | 84.2% |

**Table 2.**Comparison for RAVDESS of MRPN approach (network ${\mathcal{N}}_{2}$) with late fusion strategy (${\mathcal{N}}_{0}$), end-to-end strategy (${\mathcal{N}}_{0}$), and advanced end-to-end fusion strategy (${\mathcal{N}}_{1}$).

RAVDESS | A1,2 | A3,4 | A5,6 | A7,8 | A9,10 | A11,12 |
---|---|---|---|---|---|---|

${\mathcal{N}}_{0}$ (late fusion) | 61.6% | 92.1% | 87.5% | 96.6% | 66.6% | 87.5% |

${\mathcal{N}}_{0}$ (end-to-end) | 97.5% | 90.3% | 87.5% | 97.5% | 86.5% | 87.5% |

${\mathcal{N}}_{1}$ (end-to-end) | 97.5% | 89.1% | 88.3% | 97.5% | 90.0% | 90.0% |

${\mathcal{N}}_{2}$ (end-to-end) | 97.5% | 92.1% | 90.8% | 97.5% | 91.4% | 90.0% |

RAVDESS | A13,14 | A15,16 | A17,18 | A19,20 | A21,22 | A23,24 |

${\mathcal{N}}_{0}$ (late fusion) | 80.8% | 85.0% | 81.6% | 87.5% | 86.6% | 65.8% |

${\mathcal{N}}_{0}$ (end-to-end) | 82.4% | 79.6% | 83.2% | 89.0% | 85.5% | 84.2% |

${\mathcal{N}}_{1}$ (end-to-end) | 77.5% | 89.1% | 86.6% | 92.5% | 89.1% | 90.6% |

${\mathcal{N}}_{2}$ (end-to-end) | 84.3% | 89.7% | 89.8% | 93.3% | 90.6% | 90.6% |

**Table 3.**Comparison for Crema-d of MRPN approach (network ${\mathcal{N}}_{2}$) with simple fusion strategy (${\mathcal{N}}_{0}$), and advanced fusion strategy (${\mathcal{N}}_{1}$).

Crema-d | S1 | S2 | S3 | S4 | S5 |
---|---|---|---|---|---|

${\mathcal{N}}_{0}$ (late fusion) | 76.5% | 79.9% | 76.6% | 62.3% | 78.2% |

${\mathcal{N}}_{0}$ (end-to-end) | 77.3% | 81.3% | 79.2% | 74.8% | 78.6% |

${\mathcal{N}}_{1}$ (end-to-end) | 72.6% | 82.3% | 77.3% | 74.8% | 74.2% |

${\mathcal{N}}_{2}$ (end-to-end) | 79.5% | 83.0% | 83.0% | 76.8% | 81.9% |

Crema-d | S6 | S7 | S8 | S9 | |

${\mathcal{N}}_{0}$ (late fusion) | 81.8% | 78.8% | 80.0% | 77.5% | |

${\mathcal{N}}_{0}$ (end-to-end) | 82.0% | 75.1% | 79.5% | 77.5% | |

${\mathcal{N}}_{1}$ (end-to-end) | 82.0% | 74.8% | 79.3% | 75.8% | |

${\mathcal{N}}_{2}$ (end-to-end) | 82.0% | 80.0% | 80.5% | 78.6% |

**Table 4.**Comparison of our fusion models with others recent solutions. Options used: IA—image augmentation; WO—without audio overlapping; VA—video frames augmentation; and AO—audio overlapping; X symbol—there is no report from authors for the given dataset.

Model (Our) | RAVDESS | Crema-d |
---|---|---|

${\mathcal{N}}_{0}$ (end-to-end), Resnet18 + LSTM, IA | 83.20% | 77.25% |

${\mathcal{N}}_{0}$ (end-to-end), Resnet18 + LSTM, VA + WO | 85.20% | 79.25% |

${\mathcal{N}}_{0}$ (late fusion), Resnet18 + LSTM, VA + AO | 81.6% | 76.84% |

${\mathcal{N}}_{0}$ (end-to-end), Resnet18 + LSTM, VA + AO | 87.55% | 81.30% |

${\mathcal{N}}_{1}$ (end-to-end), Resnet18 + LSTM, VA + AO [22] | 89.8% | 77.0% |

MRPN (end-to-end), Resnet18 + LSTM, VA + AO | 90.8% | 83.00% |

MRPN (end-to-end), Resnet18 + Transformer(avg), VA + AO | 91.4% | 83.15% |

(OpenFace/COVAREP features + LSTM) + Attention [7] | 58.33% | 65.00% |

Dual Attention + LSTM [8] | 67.7% | 74.00% |

Resnet101 + BiLSTM [39] | 77.02% | X |

custom CNN [12] | X | 69.42% |

Early Cross-modal + MFCC + MEL spectrogram [10] | 83.6% | X |

CNN + Fisher vector + Metric learning [15] | X | 66.5% |

custom CNN + Spectrogram [31] | 79.5% (Audio) | X |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Chang, X.; Skarbek, W.
Multi-Modal Residual Perceptron Network for Audio–Video Emotion Recognition. *Sensors* **2021**, *21*, 5452.
https://doi.org/10.3390/s21165452

**AMA Style**

Chang X, Skarbek W.
Multi-Modal Residual Perceptron Network for Audio–Video Emotion Recognition. *Sensors*. 2021; 21(16):5452.
https://doi.org/10.3390/s21165452

**Chicago/Turabian Style**

Chang, Xin, and Władysław Skarbek.
2021. "Multi-Modal Residual Perceptron Network for Audio–Video Emotion Recognition" *Sensors* 21, no. 16: 5452.
https://doi.org/10.3390/s21165452