A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments
Abstract
:1. Introduction
- Modeling visual attention by integrating low-level and high-level visual cues, motion, and auditory information.
- Applying implicit memory principles to merge the generated maps and create the final audio-visual saliency map.
- Assessing the significance of each information source in the proposed model through an ablation study.
2. Literature Review
2.1. Visual Attention in Static Environments
2.2. Visual Attention in Dynamic Environments
2.3. Multimodal Visual Attention
- Performance in motion-heavy scenes: In highly dynamic environments, excessive background motion can interfere with sound localization, reducing accuracy.
- Lack of face saliency modeling as high-level visual features: Prior research [30] has shown that observers consistently focus on faces in images, even when instructed to focus on competing objects. This effect is particularly evident in scenes like conversations. The MMS model does not explicitly consider these features, leading to potential mispredictions in face-dominant scenes.
3. Materials and Methods
3.1. Detection of Spatial Saliency
3.2. Detection of Face Saliency
3.3. Detection of Temporal Saliency
3.4. Audio Saliency Detection
3.5. Creating the Final Audio-Visual Saliency Map
4. Results and Discussion
4.1. Comparison with Visual Attention Models
- Static saliency models: The image saliency models designed for static scenes are as follows: IT, a saliency model derived from the neural structure of the primate early visual system [13]; GBVS, graph-based visual saliency [14]; SR, Fourier transform-based spectral residual saliency mode [16]; SUN, saliency using natural statistics [17]; FES, Visual Saliency Detection With Free Energy Theory [36]; HFT, a saliency model based the hypercomplex Fourier transform [42]; BMS, a Boolean map-based saliency model [43]; Judd, a supervised learning model of saliency incorporating bottom-up saliency cues and top-down semantic-dependent cues [44]; and SMVJ, a saliency model utilizing low-level saliency integrated with face detection [45].
- Dynamic saliency models: Video saliency models developed for dynamic scenes include the following: RWRV, saliency detection via random walk with restart [46]; SER, a model for static and space–time visual saliency detection using self-resemblance [47]; ICL, a dynamic visual attention model predicated on feature rarity [48]; and PQFT, Spatio-temporal Saliency Detection Employing the Phase Spectrum of the Quaternion Fourier Transform [41].
- Multimodal Model: the multimodal baseline model introduced in [5], an audio-visual attention model designed for predicting eye fixations.
4.2. Ablation Studies
4.3. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Borji, A.; Itti, L. State-of-the-Art in Visual Attention Modeling. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 185–207. [Google Scholar] [CrossRef] [PubMed]
- Borji, A.; Ahmadabadi, M.N.; Araabi, B.N.; Hamidi, M. Online learning of task-driven object-based visual attention control. Image Vis. Comput. 2010, 28, 1130–1145. [Google Scholar] [CrossRef]
- Liu, Y.; Qiao, M.; Xu, M.; Li, B.; Hu, W.; Borji, A. Learning to predict salient faces: A novel visual-audio saliency model. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 413–429. [Google Scholar] [CrossRef]
- Borji, A.; Cheng, M.M.; Jiang, H.; Li, J. Salient Object Detection: A Benchmark. IEEE Trans. Image Process. 2015, 24, 5706–5722. [Google Scholar] [CrossRef]
- Min, X.; Zhai, G.; Zhou, J.; Zhang, X.P.; Yang, X.; Guan, X. A Multimodal Saliency Model for Videos with High Audio-Visual Correspondence. IEEE Trans. Image Process. 2020, 29, 3805–3819. [Google Scholar] [CrossRef]
- Tsiami, A.; Koutras, P.; Maragos, P. STAViS: Spatio-Temporal AudioVisual Saliency Network. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4765–4775. [Google Scholar] [CrossRef]
- Coutrot, A.; Guyader, N.; Ionescu, G.; Caplier, A. Influence of soundtrack on eye movements during video exploration. J. Eye Mov. Res. 2012, 5, 2. [Google Scholar] [CrossRef]
- Song, G.; Pellerin, D.; Granjon, L. Different types of sounds influence gaze differently in videos. J. Eye Mov. Res. 2013, 6, 1–13. [Google Scholar] [CrossRef]
- Min, X.; Zhai, G.; Gao, Z.; Hu, C.; Yang, X. Sound influences visual attention discriminately in videos. In Proceedings of the 2014 Sixth International Workshop on Quality of Multimedia Experience (QoMEX), Singapore, 18–20 September 2014; pp. 153–158. [Google Scholar] [CrossRef]
- Tavakoli, H.R.; Borji, A.; Rahtu, E.; Kannala, J. DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction. arXiv 2019, arXiv:1905.10693. [Google Scholar] [CrossRef]
- Treisman, A.M.; Gelade, G. A feature-integration theory of attention. Cogn. Psychol. 1980, 12, 97–136. [Google Scholar] [CrossRef]
- Koch, C.; Ullman, S. Chapter 1—Shifts in selective visual attention: Towards the underlying neural circuitry. In Matters of Intelligence: Conceptual Structures in Cognitive Neuroscience; Springer: Dordrecht, The Netherlands, 1987; pp. 115–141. [Google Scholar] [CrossRef]
- Itti, L.; Koch, C.; Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1254–1259. [Google Scholar] [CrossRef]
- Harel, J.; Koch, C.; Perona, P. Graph-Based Visual Saliency. In Proceedings of the 19th International Conference on Neural Information Processing Systems (NIPS’06), Cambridge, MA, USA, 4 December 2006; pp. 545–552. [Google Scholar]
- Riche, N.; Mancas, M. Bottom-Up Saliency Models for Still Images: A Practical Review. In From Human Attention to Computational Attention: A Multidisciplinary Approach; Springer: New York, NY, USA, 2016; pp. 141–175. [Google Scholar]
- Hou, X.; Zhang, L. Saliency Detection: A Spectral Residual Approach. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar] [CrossRef]
- Zhang, L.; Tong, M.H.; Marks, T.K.; Shan, H.; Cottrell, G.W. SUN: A Bayesian framework for saliency using natural statistics. J. Vis. 2008, 8, 32. [Google Scholar] [CrossRef]
- Farkish, A.; Bosaghzadeh, A.; Amiri, S.H.; Ebrahimpour, R. Evaluating the Effects of Educational Multimedia Design Principles on Cognitive Load Using EEG Signal Analysis. Educ. Inf. Technol. 2023, 28, 2827–2843. [Google Scholar] [CrossRef]
- Wang, Y.; Liu, Z.; Xia, Y.; Zhu, C.; Zhao, D. Spatiotemporal module for video saliency prediction based on self-attention. Image Vis. Comput. 2021, 112, 104216. [Google Scholar] [CrossRef]
- Fang, Y.; Wang, Z.; Lin, W.; Fang, Z. Video Saliency Incorporating Spatiotemporal Cues and Uncertainty Weighting. IEEE Trans. Image Process. 2014, 23, 3910–3921. [Google Scholar] [CrossRef] [PubMed]
- Wang, W.; Shen, J.; Shao, L. Consistent Video Saliency Using Local Gradient Flow Optimization and Global Refinement. IEEE Trans. Image Process. 2015, 24, 4185–4196. [Google Scholar] [CrossRef]
- Liu, Z.; Li, J.; Ye, L.; Sun, G.; Shen, L. Saliency Detection for Unconstrained Videos Using Superpixel-Level Graph and Spatiotemporal Propagation. IEEE Trans. Circuits Syst. Video Technol. 2017, 27, 2527–2542. [Google Scholar] [CrossRef]
- Rogalska, A.; Napieralski, P. The visual attention saliency map for movie retrospection. Open Phys. 2018, 16, 188–192. [Google Scholar] [CrossRef]
- Bosaghzadeh, A.; Shabani, M.; Ebrahimpour, R. A Computational-Cognitive model of Visual Attention in Dynamic Environments. J. Electr. Comput. Eng. Innov. 2021, 10, 163–174. [Google Scholar] [CrossRef]
- Wang, W.; Shen, J.; Porikli, F. Saliency-aware geodesic video object segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3395–3402. [Google Scholar] [CrossRef]
- Wang, W.; Shen, J.; Yang, R.; Porikli, F. Saliency-Aware Video Object Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 20–33. [Google Scholar] [CrossRef]
- Koutras, P.; Katsamanis, A.; Maragos, P. Predicting Eyes’ Fixations in Movie Videos: Visual Saliency Experiments on a New Eye-Tracking Database. In Engineering Psychology and Cognitive Ergonomics, Proceedings of the 11th International Conference, EPCE 2014, Held as Part of HCI International 2014, Heraklion, Crete, Greece, 22–27 June 2014; Harris, D., Ed.; Springer International Publishing: Cham, Switzerland, 2014; pp. 183–194. [Google Scholar]
- Wang, G.; Chen, C.; Fan, D.; Hao, A.; Qin, H. From Semantic Categories to Fixations: A Novel Weakly-supervised Visual-auditory Saliency Detection Approach. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 19–25 June 2021; pp. 15114–15123. [Google Scholar] [CrossRef]
- Chen, C.; Song, M.; Song, W.; Guo, L.; Jian, M. A Comprehensive Survey on Video Saliency Detection with Auditory Information: The Audio-visual Consistency Perceptual is the Key! IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 457–477. [Google Scholar] [CrossRef]
- Ashwini, P.; Ananya, K.; Nayaka, B.S. A Review on Different Feature Recognition Techniques for Speech Process in Automatic Speech Recognition. Int. J. Sci. Technol. Res. 2019, 8, 1953–1957. [Google Scholar]
- Kayser, C.; Petkov, C.I.; Lippert, M.; Logothetis, N.K. Mechanisms for Allocating Auditory Attention: An Auditory Saliency Map. Curr. Biol. 2005, 15, 1943–1947. [Google Scholar] [CrossRef] [PubMed]
- Izadinia, H.; Saleemi, I.; Shah, M. Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects. IEEE Trans. Multimed. 2013, 15, 378–390. [Google Scholar] [CrossRef]
- Hardoon, D.R.; Szedmak, S.; Shawe-Taylor, J. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 2004, 16, 2639–2664. [Google Scholar] [CrossRef] [PubMed]
- Zhai, G.; Wu, X.; Yang, X.; Lin, W.; Zhang, W. A Psychovisual Quality Metric in Free-Energy Principle. IEEE Trans. Image Process. 2012, 21, 41–52. [Google Scholar] [CrossRef] [PubMed]
- Friston, K. The free-energy principle: A unified brain theory? Nat. Reviews Neurosci. 2010, 11, 127–138. [Google Scholar] [CrossRef]
- Gu, K.; Zhai, G.; Lin, W.; Yang, X.; Zhang, W. Visual Saliency Detection With Free Energy Theory. IEEE Signal Process. Lett. 2015, 22, 1552–1555. [Google Scholar] [CrossRef]
- Coutrot, A.; Guyader, N. How saliency, faces, and sound influence gaze in dynamic social scenes. J. Vis. 2014, 14, 5. [Google Scholar] [CrossRef]
- Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
- Min, X.; Zhai, G.; Gu, K.; Yang, X. Fixation Prediction through Multimodal Analysis. ACM Trans. Multimed. Comput. Commun. Appl. 2016, 13, 1–23. [Google Scholar] [CrossRef]
- Fang, Y.; Lin, W.; Chen, Z.; Tsai, C.M.; Lin, C.W. A Video Saliency Detection Model in Compressed Domain. IEEE Trans. Circuits Syst. Video Technol. 2014, 24, 27–38. [Google Scholar] [CrossRef]
- Guo, C.; Ma, Q.; Zhang, L. Spatio-temporal Saliency detection using phase spectrum of quaternion fourier transform. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar] [CrossRef]
- Li, J.; Levine, M.D.; An, X.; Xu, X.; He, H. Visual Saliency Based on Scale-Space Analysis in the Frequency Domain. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 996–1010. [Google Scholar] [CrossRef] [PubMed]
- Zhang, J.; Sclaroff, S. Saliency Detection: A Boolean Map Approach. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2013. [Google Scholar] [CrossRef]
- Judd, T.; Ehinger, K.; Durand, F.; Torralba, A. Learning to predict where humans look. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 2106–2113. [Google Scholar] [CrossRef]
- Cerf, M.; Harel, J.; Einhaeuser, W.; Koch, C. Predicting human gaze using low-level saliency combined with face detection. In Advances in Neural Information Processing Systems; Platt, J., Koller, D., Singer, Y., Roweis, S., Eds.; Curran Associates, Inc.: New York, NY, USA, 2007; Volume 20. [Google Scholar]
- Kim, H.; Kim, Y.; Sim, J.Y.; Kim, C.S. Spatiotemporal Saliency Detection for Video Sequences Based on Random Walk With Restart. IEEE Trans. Image Process. 2015, 24, 2552–2564. [Google Scholar] [CrossRef] [PubMed]
- Seo, H.J.; Milanfar, P. Static and space-time visual saliency detection by self-resemblance. J. Vis. 2009, 9, 15. [Google Scholar] [CrossRef] [PubMed]
- Hou, X.; Zhang, L. Dynamic visual attention: Searching for coding length increments. In Advances in Neural Information Processing Systems; Koller, D., Schuurmans, D., Bengio, Y., Bottou, L., Eds.; Curran Associates, Inc.: New York, NY, USA, 2008; Volume 21. [Google Scholar]
- Riche, N.; Duvinage, M.; Mancas, M.; Gosselin, B.; Dutoit, T. Saliency and Human Fixations: State-of-the-Art and Study of Comparison Metrics. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2013. [Google Scholar] [CrossRef]
- Emami, M.; Hoberock, L.L. Selection of a best metric and evaluation of bottom-up visual saliency models. Image Vis. Comput. 2013, 31, 796–808. [Google Scholar] [CrossRef]
- Bylinskii, Z.; Judd, T.; Oliva, A.; Torralba, A.; Durand, F. What Do Different Evaluation Metrics Tell Us About Saliency Models? IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 740–757. [Google Scholar] [CrossRef]
Notation | Description |
---|---|
Spatial saliency map | |
Temporal saliency map | |
Audio saliency map | |
Face saliency map | |
Integration of spatial saliency map and face saliency map |
Methods | AUC-J ↑ | AUC-b ↑ | NSS ↑ | |
---|---|---|---|---|
Static Model | IT | 0.8211 | 0.8096 | 1.2642 |
GBVS | 0.8467 | 0.8350 | 1.4536 | |
SR | 0.7566 | 0.7408 | 1.0079 | |
SUN | 0.7000 | 0.6858 | 0.6953 | |
SMVJ | 0.8537 | 0.8436 | 1.5247 | |
Judd | 0.8465 | 0.8226 | 1.3513 | |
BMS | 0.8016 | 0.7797 | 1.2733 | |
HFT | 0.8259 | 0.7981 | 1.4043 | |
FES | 0.7695 | 0.7583 | 1.0026 | |
Dynamic Model | PGFT | 0.7768 | 0.7428 | 1.1160 |
ICL | 0.7509 | 0.6424 | 0.8451 | |
SeR | 0.7712 | 0.6312 | 0.9676 | |
RWRV | 0.7606 | 0.7426 | 0.9812 | |
Audio-Visual | MMS | 0.8654 | 0.8498 | 1.8969 |
Model | Our method | 0.8885 | 0.8722 | 2.3235 |
Evaluation Metrics | p-Value |
---|---|
AUC-J | 5.03 |
AUC-b | 9.79 |
NSS | 4.57 |
Variation | ||||
---|---|---|---|---|
No | Yes | Yes | Yes | |
Yes | No | Yes | Yes | |
Yes | Yes | No | Yes | |
Yes | Yes | Yes | No | |
Yes | Yes | Yes | Yes |
Variation | S | |||||
---|---|---|---|---|---|---|
0.84964 | 0.80634 | 1.44909 | 0.00011 | 0.00999 | 9.24674 | |
0.82911 | 0.85542 | 1.82284 | 0.00018 | 0.01262 | 9.04489 | |
0.86836 | 0.88170 | 2.26551 | 0.00019 | 0.01576 | 8.92385 | |
0.86184 | 0.87758 | 2.01621 | 0.00018 | 0.01399 | 8.96247 | |
0.87216 | 0.88846 | 2.32348 | 0.00021 | 0.01616 | 8.83972 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yazdani, H.; Bosaghzadeh, A.; Ebrahimpour, R.; Dornaika, F. A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments. Big Data Cogn. Comput. 2025, 9, 120. https://doi.org/10.3390/bdcc9050120
Yazdani H, Bosaghzadeh A, Ebrahimpour R, Dornaika F. A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments. Big Data and Cognitive Computing. 2025; 9(5):120. https://doi.org/10.3390/bdcc9050120
Chicago/Turabian StyleYazdani, Hamideh, Alireza Bosaghzadeh, Reza Ebrahimpour, and Fadi Dornaika. 2025. "A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments" Big Data and Cognitive Computing 9, no. 5: 120. https://doi.org/10.3390/bdcc9050120
APA StyleYazdani, H., Bosaghzadeh, A., Ebrahimpour, R., & Dornaika, F. (2025). A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments. Big Data and Cognitive Computing, 9(5), 120. https://doi.org/10.3390/bdcc9050120