Multimodal Emotion Recognition from Art Using Sequential Co-Attention
Abstract
:1. Introduction
- We proposed a co-attention-based multimodal emotion recognition approach that aims to use information from the painting, title, and emotion category channels via weighted fusion to achieve more robust and accurate recognition;
- An experiment was carried on the dataset collected and provided for emotion recognition, which is publicly available;
- The proposed approach result was compared with the latest state-of-the-art approaches and also with other baseline approaches based on deep learning methods.
2. Related Work
2.1. Unimodal Approaches
2.2. Multimodal Approaches
2.3. Emotion Recognition from Art
3. The Proposed Sequential Multimodal Fusion Model
3.1. Image Feature Representation
3.2. Text Feature Representation
3.3. Emotion Category Feature Representation
3.4. Co-Attention Layer
3.5. Weighted Modality Fusion
3.6. Classification Layer
4. Experiment and Results
4.1. Dataset
4.2. Training Details
4.3. Baselines
- Bi-LSTM (Text Only): Bi-LSTM is one of the most popular methods for addressing many text classification problems. It leverages a bidirectional LSTM network for learning text representations and then uses a classification layer to make a prediction.
- CNN (Image Only): CNN with six hidden layers was implemented. The first two convolutional layers contain 32 kernels of size and the second two convolutional layers have 64 kernels of size . The second and fourth convolutional layers are interleaved with max-pooling layers of dimension with a dropout of . Then, a fully connected layer with 256 neurons and a dropout of is followed.
- Multimodal approaches (text and image): two multimodal approaches, namely Resnet_GRU without attention and Resnet_GRU attention from the previous work [3], in the same task were also implemented.
4.4. Results and Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Mohammad, S.; Kiritchenko, S. WikiArt Emotions: An Annotated Dataset of Emotions Evoked by Art. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Tripathi, S.; Beigi, H.S.M. Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning. arXiv 2018, arXiv:1804.05788. [Google Scholar]
- Tashu, T.M.; Horváth, T. Attention-Based Multi-modal Emotion Recognition from Art. Pattern Recognition. In Proceedings of the ICPR International Workshops and Challenges, Virtual Event, 10–15 January 2021; Part III; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 604–612. [Google Scholar]
- Sreeshakthy, M.; Preethi, J. Classification of Human Emotion from Deap EEG Signal Using Hybrid Improved Neural Networks with Cuckoo Search. BRAIN Broad Res. Artif. Intell. Neurosci. 2016, 6, 60–73. [Google Scholar]
- Cowie, R.; Douglas-Cowie, E.; Tsapatsoulis, N.; Votsis, G.; Kollias, S.; Fellenz, W.; Taylor, J. Emotion recognition in human-computer interaction. IEEE Signal Process. Mag. 2001, 18, 32–80. [Google Scholar] [CrossRef]
- Clavel, C.; Vasilescu, I.; Devillers, L.; Richard, G.; Ehrette, T. Fear-type emotion recognition for future audio-based surveillance systems. Speech Commun. 2008, 50, 487–503. [Google Scholar] [CrossRef] [Green Version]
- Khalfallah, J.; Slama, J.B.H. Facial Expression Recognition for Intelligent Tutoring Systems in Remote Laboratories Platform. Procedia Comput. Sci. 2015, 73, 274–281. [Google Scholar] [CrossRef] [Green Version]
- Knapp, R.B.; Kim, J.; André, E. Physiological Signals and Their Use in Augmenting Emotion Recognition for Human–Machine Interaction. In Emotion-Oriented Systems: The Humaine Handbook; Cowie, R., Pelachaud, C., Petta, P., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 133–159. [Google Scholar] [CrossRef]
- Shenoy, A.; Sardana, A. Multilogue-Net: A Context-Aware RNN for Multi-modal Emotion Detection and Sentiment Analysis in Conversation. In Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML); Association for Computational Linguistics: Seattle, WA, USA, 2020; pp. 19–28. [Google Scholar] [CrossRef]
- Yoon, S.; Dey, S.; Lee, H.; Jung, K. Attentive Modality Hopping Mechanism for Speech Emotion Recognition. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3362–3366. [Google Scholar] [CrossRef] [Green Version]
- Liu, G.; Yan, Y.; Ricci, E.; Yang, Y.; Han, Y.; Winkler, S.; Sebe, N. Inferring Painting Style with Multi-Task Dictionary Learning; AAAI Press: Cambridge, MA, USA, 2015; pp. 2162–2168. [Google Scholar]
- Wang, Y.; Takatsuka, M. SOM based artistic styles visualization. In Proceedings of the 2013 IEEE International Conference on Multimedia and Expo (ICME), San Jose, CA, USA, 15–19 July 2013; pp. 1–6. [Google Scholar]
- Szegedy, C.; Wei, L.; Yangqing, J.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Sartori, A.; Culibrk, D.; Yan, Y.; Sebe, N. Who’s Afraid of Itten: Using the Art Theory of Color Combination to Analyze Emotions in Abstract Paintings (MM ’15); Association for Computing Machinery: New York, NY, USA, 2015; pp. 311–320. [Google Scholar] [CrossRef]
- Zhao, S.; Gao, Y.; Jiang, X.; Yao, H.; Chua, T.S.; Sun, X. Exploring Principles-of-Art Features For Image Emotion Recognition; Association for Computing Machinery: New York, NY, USA, 2014; pp. 47–56. [Google Scholar] [CrossRef]
- Yanulevskaya, V.; Van Gemert, J.C.; Roth, K.; Herbold, A.K.; Sebe, N.; Geusebroek, J.M. Emotional valence categorization using holistic image features. In Proceedings of the 2008 15th IEEE International Conference on Image Processing, San Diego, CA, USA, 12–15 October 2008; pp. 101–104. [Google Scholar]
- Scherer, K.; Johnstone, T.; Klasmeyer, G. Handbook of Affective Sciences-Vocal Expression of Emotion; Oxford University: Oxford, UK, 2003; pp. 433–456. [Google Scholar]
- Navarretta, C. Individuality in Communicative Bodily Behaviours; Springer: Berlin/Heidelberg, Germany, 2012; pp. 417–423. [Google Scholar] [CrossRef]
- Seyeditabari, A.; Tabari, N.; Gholizadeh, S.; Zadrozny, W. Emotion Detection in Text: Focusing on Latent Representation. arXiv 2019, arXiv:abs/1907.09369. [Google Scholar]
- Yeh, S.L.; Lin, Y.S.; Lee, C.C. An Interaction-aware Attention Network for Speech Emotion Recognition in Spoken Dialogs. In Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6685–6689. [Google Scholar] [CrossRef]
- Castellano, G.; Kessous, L.; Caridakis, G. Emotion Recognition through Multiple Modalities: Face, Body Gesture, Speech. In Affect and Emotion in Human-Computer Interaction: From Theory to Applications; Peter, C., Beale, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2008; pp. 92–103. [Google Scholar]
- Sikka, K.; Dykstra, K.; Sathyanarayana, S.; Littlewort, G.; Bartlett, M. Multiple Kernel Learning for Emotion Recognition in the Wild; Association for Computing Machinery: New York, NY, USA, 2013; pp. 517–524. [Google Scholar] [CrossRef]
- Kim, Y.; Lee, H.; Provost, E.M. Deep learning for robust feature generation in audiovisual emotion recognition. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 3687–3691. [Google Scholar]
- Majumder, N.; Hazarika, D.; Gelbukh, A.; Cambria, E.; Poria, S. Multimodal Sentiment Analysis Using Hierarchical fusion with context modeling. Knowl. Based Syst. 2018, 161, 124–133. [Google Scholar] [CrossRef] [Green Version]
- Ren, M.; Nie, W.; Liu, A.; Su, Y. Multi-modal Correlated Network for emotion recognition in speech. Vis. Inform. 2019, 3, 150–155. [Google Scholar] [CrossRef]
- Yoon, S.; Byun, S.; Dey, S.; Jung, K. Speech Emotion Recognition Using Multi-hop Attention Mechanism. In Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 2822–2826. [Google Scholar]
- Lian, Z.; Li, Y.; Tao, J.; Huang, J. Investigation of Multimodal Features, Classifiers and Fusion Methods for Emotion Recognition. arXiv 2018, arXiv:1809.06225. [Google Scholar]
- Pan, Z.; Luo, Z.; Yang, J.; Li, H. Multi-Modal Attention for Speech Emotion Recognition. 2020. Available online: http://xxx.lanl.gov/abs/2009.04107 (accessed on 16 August 2021).
- Siriwardhana, S.; Reis, A.; Weerasekera, R.; Nanayakkara, S. Jointly Fine-Tuning “BERT-like” Self Supervised Models to Improve Multimodal Speech Emotion Recognition. arXiv 2020, arXiv:2008.06682. [Google Scholar]
- Liu, G.; Tan, Z. Research on Multi-modal Music Emotion Classification Based on Audio and Lyirc. In Proceedings of the 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 12–14 June 2020; Volume 1, pp. 2331–2335. [Google Scholar] [CrossRef]
- Machajdik, J.; Hanbury, A. Affective Image Classification Using Features Inspired by Psychology and Art Theory; Association for Computing Machinery: New York, NY, USA, 2010; pp. 83–92. [Google Scholar] [CrossRef]
- Yanulevskaya, V.; Uijlings, J.; Bruni, E.; Sartori, A.; Zamboni, E.; Bacci, F.; Melcher, D.; Sebe, N. In the Eye of the Beholder: Employing Statistical Analysis and Eye Tracking for Analyzing Abstract Paintings; Association for Computing Machinery: New York, NY, USA, 2012; pp. 349–358. [Google Scholar] [CrossRef]
- Sartori, A.; Yan, Y.; Özbal, G.; Almila, A.; Salah, A.; Salah, A.A.; Sebe, N. Looking at Mondrian’s Victory Boogie-Woogie: What Do I Feel; AAAI Press: Cambridge, MA, USA, 2015; pp. 2503–2509. [Google Scholar]
- Cai, Y.; Cai, H.; Wan, X. Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model; Association for Computational Linguistics: Florence, Italy, 2019; pp. 2506–2515. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Wang, P.; Wu, Q.; Shen, C.; Van den Hengel, A. The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3909–3918. [Google Scholar]
- Lu, J.; Yang, J.; Batra, D.; Parikh, D. Hierarchical Question-Image Co-Attention for Visual Question Answering. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16); Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 289–297. [Google Scholar]
- Chung, J.; Gülçehre, Ç.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
- Tashu, T.M. Off-Topic Essay Detection Using C-BGRU Siamese. In Proceedings of the 2020 IEEE 14th International Conference on Semantic Computing (ICSC), San Diego, CA, USA, 3–5 February 2020; pp. 221–225. [Google Scholar] [CrossRef]
- Gu, Y.; Yang, K.; Fu, S.; Chen, S.; Li, X.; Marsic, I. Hybrid Attention based Multimodal Network for Spoken Language Classification. In Proceedings of the 27th International Conference on Computational Linguistics; Association for Computational Linguistics: Santa Fe, NM, USA, 2018; pp. 2379–2390. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Doha, Qatar, 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
Polarity | Emotion Category | Instances |
---|---|---|
Positive | gratitude, happiness, humility, love, optimism, trust | 2578 |
Negative | anger, arrogance, disgust, fear, pessimism, regret, sadness, shame | 838 |
Other or Mixed | agreeableness, anticipation, disagreeableness, surprise, shyness, neutral | 689 |
Hyper-Parameters | Values |
---|---|
ResNet FC size | 512 |
Batch size | 32 |
Number of BGRU hidden units | 128 |
Dropout rate for GRU | 0.4 |
Number of epochs | 40 |
Learning rate | 0.001 |
Word embedding dimensions | 100 |
Model | Channel | Accuracy | Loss |
---|---|---|---|
CNN | Image | 0.683 | 0.663 |
Bi-LSTM | Title | 0.658 | 0.810 |
FFNN | Category | 0.689 | 0.441 |
ResNet_GRU without attention | Paint, title | 0.713 | 0.710 |
ResNet_GRU with attention | Paint, title | 0.741 | 0.130 |
Our new model with concatenation | Paint, title and category | 0.724 | 0.684 |
Our new model | Paint, title and category | 0.773 | 0.143 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tashu, T.M.; Hajiyeva, S.; Horvath, T. Multimodal Emotion Recognition from Art Using Sequential Co-Attention. J. Imaging 2021, 7, 157. https://doi.org/10.3390/jimaging7080157
Tashu TM, Hajiyeva S, Horvath T. Multimodal Emotion Recognition from Art Using Sequential Co-Attention. Journal of Imaging. 2021; 7(8):157. https://doi.org/10.3390/jimaging7080157
Chicago/Turabian StyleTashu, Tsegaye Misikir, Sakina Hajiyeva, and Tomas Horvath. 2021. "Multimodal Emotion Recognition from Art Using Sequential Co-Attention" Journal of Imaging 7, no. 8: 157. https://doi.org/10.3390/jimaging7080157
APA StyleTashu, T. M., Hajiyeva, S., & Horvath, T. (2021). Multimodal Emotion Recognition from Art Using Sequential Co-Attention. Journal of Imaging, 7(8), 157. https://doi.org/10.3390/jimaging7080157