Exploring Emotional Stimuli Detection in Artworks: A Benchmark Dataset and Baselines Evaluation †
Abstract
:1. Introduction
2. Related Work
3. Emotional Stimuli Detection
4. APOLO Dataset Curation
4.1. Data Selection
4.2. Annotation Process
4.3. Quality Control
4.4. Evaluation Dataset Analysis
5. Baselines
5.1. Baselines with Reference
5.2. Baselines without Reference
5.2.1. Object Detection
5.2.2. CASNet and CASNet II
5.2.3. Weakly-Supervised Emotional Stimuli Detecter
6. Experiments
6.1. Quantitative Analysis
6.2. Qualitative Analysis
6.3. Emotion-Wise Analysis on Stimuli Detector
- Predictions focus on similar regions. Although the WESD’s predictions for each emotion are different, most of them focus on similar regions (e.g., the house and the pool in the first example and the people in the second example) in one artwork. The results could be reasonable as some regions in the artwork may play an essential role in evoking multiple emotions. We observe that such regions are also involved in the utterances.
- Awe and contentment tend to involve more regions. Compared with other emotions, awe and contentment usually involve more regions, such as the whole sky in the first example, and the building and tree in the second example. These results may be related to the factor that the emotions of awe and contentment are usually evoked by wider scenery in the artwork.
7. Emotional Stimuli and Deep Generative Models
8. Limitations and Ethical Concerns
9. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
APOLO | Artwork Provoked emOtion EvaLuatiOn |
Binary cross-entropy | |
IoU | Intersection over union |
Precision with IoU threshold | |
WESD | Weakly-supervised Emotional Stimuli Detection |
Nomenclature
a | Artwork |
Number of patches in the horizontal axes | |
Number of patches in the vertical axes | |
Training set | |
Validation set | |
Test set | |
e | Emotion |
Set of emotions | |
Emotional stimuli detector | |
Fully-connected layer for embedding the whole artwork | |
Fully-connected layer to predict emotional stimuli score | |
Binary emotion classifier for emotion e | |
i | Patch |
K | Number of patches in artwork a |
Map of region r | |
Emotional stimuli maps for utterance u | |
Emotional stimuli maps for emotion e | |
p | Point/location of segment s |
Points related to phrase w | |
r | Region of artwork a |
R | Regions in artwork a, a set of region r |
s | Segment of artwork a |
Segment of phrase w | |
Similarity score | |
Segment predicted by model | |
Threshold of IoU | |
u | Utterance |
Feature of patch i | |
w | Phrase |
Phrases in utterance u | |
Emotional stimuli score of patch i with emotion e | |
Predicted emotional stimuli score of patch i with emotion e |
References
- Wilber, M.J.; Fang, C.; Jin, H.; Hertzmann, A.; Collomosse, J.P.; Belongie, S.J. BAM! The Behance Artistic Media Dataset for Recognition Beyond Photography. In Proceedings of the ICCV, Venice, Italy, 22–29 October 2017; pp. 1211–1220. [Google Scholar]
- Bai, Z.; Nakashima, Y.; García, N. Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation. In Proceedings of the ICCV, Virtual Event, 11–17 October 2021; pp. 5402–5412. [Google Scholar]
- Crowley, E.J.; Zisserman, A. The State of the Art: Object Retrieval in Paintings using Discriminative Regions. In Proceedings of the BMVC, Nottingham, UK, 1–5 September 2014. [Google Scholar]
- Gonthier, N.; Gousseau, Y.; Ladjal, S.; Bonfait, O. Weakly supervised object detection in artworks. In Proceedings of the ECCV Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Mensink, T.; van Gemert, J.C. The Rijksmuseum Challenge: Museum-Centered Visual Recognition. In Proceedings of the ICMR, Dallas, TX, USA, 2 December 2014; p. 451. [Google Scholar]
- Strezoski, G.; Worring, M. OmniArt: A Large-scale Artistic Benchmark. ACM Trans. Multim. Comput. Commun. Appl. 2018, 14, 88:1–88:21. [Google Scholar] [CrossRef]
- Tonkes, V.; Sabatelli, M. How Well Do Vision Transformers (VTs) Transfer to the Non-natural Image Domain? An Empirical Study Involving Art Classification. In Proceedings of the ECCV Workshop, Tel Aviv, Israel, 23–27 October 2022; Volume 13801, pp. 234–250. [Google Scholar]
- Reshetnikov, A.; Marinescu, M.V.; López, J.M. DEArt: Dataset of European Art. In Proceedings of the ECCV Workshop, Tel Aviv, Israel, 23–27 October 2022; Volume 13801, pp. 218–233. [Google Scholar]
- Garcia, N.; Vogiatzis, G. How to Read Paintings: Semantic Art Understanding with Multi-modal Retrieval. In Proceedings of the ECCV Workshops, Munich, Germany, 8–14 September 2018; Volume 11130, pp. 676–691. [Google Scholar]
- Garcia, N.; Ye, C.; Liu, Z.; Hu, Q.; Otani, M.; Chu, C.; Nakashima, Y.; Mitamura, T. A Dataset and Baselines for Visual Question Answering on Art. In Proceedings of the ECCV Workshops, Glasgow, UK, 23–28 August 2020; Volume 12536, pp. 92–108. [Google Scholar]
- Achlioptas, P.; Ovsjanikov, M.; Haydarov, K.; Elhoseiny, M.; Guibas, L.J. ArtEmis: Affective Language for Visual Art. In Proceedings of the CVPR, Nashville, TN, USA, 20–25 June 2021; pp. 11564–11574. [Google Scholar]
- Mohamed, Y.; Khan, F.F.; Haydarov, K.; Elhoseiny, M. It is Okay to Not Be Okay: Overcoming Emotional Bias in Affective Image Captioning by Contrastive Data Collection. In Proceedings of the CVPR, Nashville, TN, USA, 20–25 June 2022; pp. 21231–21240. [Google Scholar]
- Silvia, P.J. Emotional Responses to Art: From Collation and Arousal to Cognition and Emotion. Rev. Gen. Psychol. 2005, 9, 342–357. [Google Scholar] [CrossRef]
- Cooper, J.M.; Silvia, P.J. Opposing Art: Rejection as an Action Tendency of Hostile Aesthetic Emotions. Empir. Stud. Arts 2009, 27, 109–126. [Google Scholar] [CrossRef]
- Xenakis, I.; Arnellos, A.; Darzentas, J. The functional role of emotions in aesthetic judgment. New Ideas Psychol. 2012, 30, 212–226. [Google Scholar] [CrossRef]
- Pelowski, M.; Akiba, F. A model of art perception, evaluation and emotion in transformative aesthetic experience. New Ideas Psychol. 2011, 29, 80–97. [Google Scholar] [CrossRef]
- Yang, J.; Li, J.; Wang, X.; Ding, Y.; Gao, X. Stimuli-Aware Visual Emotion Analysis. IEEE Trans. Image Process. 2021, 30, 7432–7445. [Google Scholar] [CrossRef] [PubMed]
- Yang, J.; Gao, X.; Li, L.; Wang, X.; Ding, J. SOLVER: Scene-Object Interrelated Visual Emotion Reasoning Network. IEEE Trans. Image Process. 2021, 30, 8686–8701. [Google Scholar] [CrossRef] [PubMed]
- Yang, J.; She, D.; Lai, Y.K.; Rosin, P.L.; Yang, M.H. Weakly Supervised Coupled Networks for Visual Sentiment Analysis. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7584–7592. [Google Scholar]
- Xu, L.; Wang, Z.; Wu, B.; Lui, S.S.Y. MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis. In Proceedings of the CVPR, Louisiana, LA, USA, 18–24 June 2022; pp. 9469–9478. [Google Scholar]
- Yang, J.; Huang, Q.; Ding, T.; Lischinski, D.; Cohen-Or, D.; Huang, H. EmoSet: A Large-scale Visual Emotion Dataset with Rich Attributes. In Proceedings of the ICCV, Paris, France, 2–6 October 2023; pp. 20326–20337. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the CVPR, Louisiana, LA, USA, 18–24 June 2022; pp. 10674–10685. [Google Scholar]
- Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
- Shen, X.; Efros, A.A.; Aubry, M. Discovering Visual Patterns in Art Collections with Spatially-Consistent Feature Learning. In Proceedings of the CVPR, Beach, CA, USA, 16–20 June 2019; pp. 9270–9279. [Google Scholar]
- Lin, H.; Jia, J.; Guo, Q.; Xue, Y.; Huang, J.; Cai, L.; Feng, L. Psychological stress detection from cross-media microblog data using Deep Sparse Neural Network. In Proceedings of the ICME, Chengdu, China, 14–18 July 2014; pp. 1–6. [Google Scholar]
- Wang, X.; Zhang, H.; Cao, L.; Feng, L. Leverage Social Media for Personalized Stress Detection. In Proceedings of the ACM MM, Seattle, WA, USA, 12–16 October 2020. [Google Scholar]
- Truong, Q.T.; Lauw, H.W. Visual Sentiment Analysis for Review Images with Item-Oriented and User-Oriented CNN. In Proceedings of the ACM MM, Mountain View, CA, USA, 23–27 October 2017. [Google Scholar]
- Truong, Q.T.; Lauw, H.W. VistaNet: Visual Aspect Attention Network for Multimodal Sentiment Analysis. Proc. Aaai Conf. Artif. Intell. 2019, 33, 305–312. [Google Scholar] [CrossRef]
- Zhao, S.; Yao, X.; Yang, J.; Jia, G.; Ding, G.; Chua, T.S.; Schuller, B.W.; Keutzer, K. Affective Image Content Analysis: Two Decades Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 6729–6751. [Google Scholar] [CrossRef] [PubMed]
- Peng, K.C.; Sadovnik, A.; Gallagher, A.C.; Chen, T. Where do emotions come from? Predicting the Emotion Stimuli Map. In Proceedings of the ICIP, Phoenix, AZ, USA, 25–28 September 2016; pp. 614–618. [Google Scholar]
- Fan, S.; Shen, Z.; Jiang, M.; Koenig, B.L.; Xu, J.; Kankanhalli, M.; Zhao, Q. Emotional Attention: A Study of Image Sentiment and Visual Attention. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7521–7531. [Google Scholar]
- Liu, G.; Yan, Y.; Ricci, E.; Yang, Y.; Han, Y.; Winkler, S.; Sebe, N. Inferring Painting Style with Multi-Task Dictionary Learning. In Proceedings of the IJCAI, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
- Ypsilantis, N.A.; Garcia, N.; Han, G.; Ibrahimi, S.; Van Noord, N.; Tolias, G. The met dataset: Instance-level recognition for artworks. In Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), San Diego, CA, USA, 8–14 December 2021. [Google Scholar]
- Ruta, D.; Gilbert, A.; Aggarwal, P.; Marri, N.; Kale, A.; Briggs, J.; Speed, C.; Jin, H.; Faieta, B.; Filipkowski, A.; et al. StyleBabel: Artistic Style Tagging and Captioning. In Proceedings of the ECCV, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Machajdik, J.; Hanbury, A. Affective image classification using features inspired by psychology and art theory. In Proceedings of the ACM MM, Firenze, Italy, 25–29 October 2010. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the ICML, Virtual, 18–24 July 2021. [Google Scholar]
- Kazemzadeh, S.; Ordonez, V.; Matten, M.A.; Berg, T.L. ReferItGame: Referring to Objects in Photographs of Natural Scenes. In Proceedings of the EMNLP, Doha, Qatar, 25–29 October 2014. [Google Scholar]
- Mao, J.; Huang, J.; Toshev, A.; Camburu, O.; Yuille, A.L.; Murphy, K.P. Generation and Comprehension of Unambiguous Object Descriptions. In Proceedings of the CVPR, Las Vegas, NV, USA, 27–30 June 2016; pp. 11–20. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Proceedings of the NeurIPS, Las Vegas, NV, USA, 27–30 June 2019. [Google Scholar]
- Lu, J.; Goswami, V.; Rohrbach, M.; Parikh, D.; Lee, S. 12-in-1: Multi-Task Vision and Language Representation Learning. In Proceedings of the CVPR, Seattle, WA, USA, 13–19 June 2020; pp. 10434–10443. [Google Scholar]
- Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the ACL, Melbourne, Australia, 15–20 July 2018. [Google Scholar]
- Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; Gao, J. VinVL: Revisiting Visual Representations in Vision-Language Models. In Proceedings of the CVPR, Nashville, TN, USA, 20–25 June 2021; pp. 5575–5584. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Fan, S.; Shen, Z.; Jiang, M.; Koenig, B.L.; Kankanhalli, M.S.; Zhao, Q. Emotional Attention: From Eye Tracking to Computational Modeling. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1682–1699. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.P.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the CVPR, Boston, MA, USA, 7–12 June 2015; pp. 770–778. [Google Scholar]
- Yang, Z.; Wang, J.; Tang, Y.; Chen, K.; Zhao, H.; Torr, P.H.S. LAVT: Language-Aware Vision Transformer for Referring Image Segmentation. In Proceedings of the CVPR, Louisiana, LA, USA, 18–24 June 2022; pp. 18134–18144. [Google Scholar]
- Xu, L.; Huang, M.H.; Shang, X.; Yuan, Z.; Sun, Y.; Liu, J. Meta Compositional Referring Expression Segmentation. In Proceedings of the CVPR, Tokyo, Japan, 15 June 2023; pp. 19478–19487. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the ICLR, Toulon, France, 24–26 April 2017. [Google Scholar]
- Wu, Y.; Nakashima, Y.; Garcia, N. Not Only Generative Art: Stable Diffusion for Content-Style Disentanglement in Art Analysis. In Proceedings of the ICMR, Thessaloniki, Greece, 12–15 June 2023; pp. 199–208. [Google Scholar]
- Brooks, T.; Holynski, A.; Efros, A.A. InstructPix2Pix: Learning to Follow Image Editing Instructions. In Proceedings of the CVPR, Vancouver, Canada, 18–22 June 2023; pp. 18392–18402. [Google Scholar]
- Tang, R.; Liu, L.; Pandey, A.; Jiang, Z.; Yang, G.; Kumar, K.; Stenetorp, P.; Lin, J.; Ture, F. What the DAAM: Interpreting Stable Diffusion Using Cross Attention. In Proceedings of the ACL, Toronto, ON, Canada, 9–14 July 2023; pp. 5644–5659. [Google Scholar]
Samples | Images | Source | Emotions | ME | |
---|---|---|---|---|---|
EmotionROI [30] | 1980 | 1980 | social | 6 | No |
EMOd [31] | 1019 | 1019 | social | No | |
APOLO | 6781 | 4178 | artwork | 8 | Yes |
Painting | Emotion | Map_ID |
---|---|---|
a.y.-jackson_indian-home-1927 | sadness | 000000 |
aaron-siskind_new-york-24-1988 | anger | 000001 |
abdullah-suriosubroto_bamboo-forest | contentment | 000002 |
abdullah-suriosubroto_mountain-view | excitement | 000003 |
abraham-manievich_moscow-iii | excitement | 000004 |
abraham-manievich_moscow-iii | sadness | 000005 |
… | … | … |
Task | Baseline | Region Proposal | Input Text | Multiple Map | Bounding Box | Segmentation | ||||
---|---|---|---|---|---|---|---|---|---|---|
Pr@25 | Pr@50 | Pr@25 | Pr@50 | |||||||
1 | w/o reference | Entire artwork | - | - | - | 82.37 | 63.81 | 68.61 | 37.70 | |
2 | FasterRCNN | FasterRCNN | - | - | 84.03 | 66.40 | 74.43 | 43.67 | ||
3 | VinVL | VinVL | - | - | 84.43 | 67.54 | 75.10 | 43.94 | ||
4 | CASNet | - | - | - | 84.84 | 66.02 | 76.40 | 44.15 | ||
5 | CASNet II | - | - | - | 84.84 | 63.59 | 76.24 | 40.18 | ||
6 | WESD | - | - | ✓ | 84.30 | 66.66 | 75.89 | 44.97 | ||
7 | w/ reference | VilBERT | FasterRCNN | emotion | ✓ | 82.17 | 63.08 | 72.14 | 39.64 | |
8 | 12-in-1 | FasterRCNN | emotion | ✓ | 72.51 | 50.71 | 63.90 | 31.87 | ||
9 | CLIP + VinVL | VinVL | emotion | ✓ | 81.97 | 63.00 | 71.29 | 40.05 | ||
10 | VilBERT | FasterRCNN | utterance | ✓ | 84.10 | 65.41 | 75.26 | 42.18 | ||
11 | 12-in-1 | FasterRCNN | utterance | ✓ | 80.52 | 59.16 | 72.52 | 37.99 | ||
12 | CLIP + VinVL | VinVL | utterance | ✓ | 83.31 | 64.99 | 72.58 | 40.08 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, T.; Garcia, N.; Li, L.; Nakashima, Y. Exploring Emotional Stimuli Detection in Artworks: A Benchmark Dataset and Baselines Evaluation. J. Imaging 2024, 10, 136. https://doi.org/10.3390/jimaging10060136
Chen T, Garcia N, Li L, Nakashima Y. Exploring Emotional Stimuli Detection in Artworks: A Benchmark Dataset and Baselines Evaluation. Journal of Imaging. 2024; 10(6):136. https://doi.org/10.3390/jimaging10060136
Chicago/Turabian StyleChen, Tianwei, Noa Garcia, Liangzhi Li, and Yuta Nakashima. 2024. "Exploring Emotional Stimuli Detection in Artworks: A Benchmark Dataset and Baselines Evaluation" Journal of Imaging 10, no. 6: 136. https://doi.org/10.3390/jimaging10060136
APA StyleChen, T., Garcia, N., Li, L., & Nakashima, Y. (2024). Exploring Emotional Stimuli Detection in Artworks: A Benchmark Dataset and Baselines Evaluation. Journal of Imaging, 10(6), 136. https://doi.org/10.3390/jimaging10060136