From Cues to Engagement: A Comprehensive Survey and Holistic Architecture for Computer Vision-Based Audience Analysis in Live Events
Abstract
1. Introduction
2. Selective Review of Recent Advancements in AI-Driven Audience Engagement Monitoring
2.1. Review Methodology
- (i)
- Related to real-world engagement applications (not virtual/online applications);
- (ii)
- Published in a journal or a conference;
- (iii)
- Written in English;
- (iv)
- Accessible online;
- (v)
- Provide systematic information on computing methods;
- (vi)
- Published between 2023 and October 2025.
- (i)
- Crowd engagement detection;
- (ii)
- Automatic engagement detection AND (computer vision OR sound analysis) AND (non-intrusive OR contactless);
- (iii)
- Crowd engagement AND (real-world OR in-the-wild) AND (facial expression OR gaze tracking OR acoustic sensing);
- (iv)
- Audience attention monitoring AND (vision-based OR audio-based) NOT (wearable OR biosensor);
- (v)
- Engagement recognition in crowds AND (dataset OR benchmark).
2.2. Audience/Crowd Engagement Detection Systems
2.3. Crowd Engagement Datasets
2.4. Discussion
3. Audience Engagement Model’s Architectural Elements
3.1. Emotion
3.2. Attention
3.3. Body Language, Scene Dynamics, and Behaviours
3.4. Discussion and Audience Engagement Architecture Constructs
- (a)
- Attention Direction and Level. (a.1) Focus: Track the individual or crowd and determine whether their head orientation (gaze) is directed toward the event’s primary focus, object, or person. (a.2) Heatmaps (a.2.1): The aggregated attention over a period of time and (a.2.2) the cumulative number of persons in a specific region close/or related to the object/person that is being engaged.
- (b)
- Emotion Level and Sentiment State. (b.1) Emotion determines the degree and level of emotion by calculating arousal and valence, which is combined with the (b.2) Sentiment State, namely, negative, neutral, and positive.
- (c)
- Body Language. (c.1) Actions and gestures convey intent and emotion, bridging language barriers to foster genuine connection (ex., clapping hands). (c.2) Pose communicates openness, confidence, and cultural respect, directly influencing rapport and the perceived sincerity of engagement. Together, they are the non-verbal foundation of trust and collaborative spirit in any event.
- (d)
- Scene Dynamics. The computation includes the following: (d.1) Density: This represents the density of persons by section in the event. (d.2) Count: This tracks the number of persons assisting the event. (d.3) Motion patterns: Different motion patterns are tracked during the event, and the model also creates a heatmap with the motion patterns of groups and respective chronologies of how groups move inside the event.
- (e)
- Behaviours represent observable actions and patterns that indicate a person’s level of interest, attention, and interaction with a product, service, content, or another person. They can be divided into the following: (e.1) Commitment: The architecture monitors how many times a group or the crowd interacts with a product, service, or event. (e.2) Conversion: It monitors how many groups complete, or if the crowd completes, a pre-defined action/path. (e.3) Retention: It monitors how many groups return, or if the crowd returns, to a product, service, or event sector after their first visit. (e.4) Feedback: It monitors how many times a type of feedback was presented by a group or the crowd (e.g., waving or clapping hands). (e.5) Social: It monitors how many different groups interact with each other: (e.6) Odd Behaviours/Alerts: It can detect a subset or group of attendees whose behaviour deviates significantly from other groups or event sectors or indicates abnormal crowd activity.
3.5. Ethical and Data Protection Implications
4. Discussion
Future Research Directions
- Development of standardised benchmarks: The highest priority should be the creation and public release of comprehensive datasets filmed in real-world event settings, annotated with ground truth for different levels (and types) of engagement, valence, arousal, head poses, behavioural metrics (e.g., clapping, cheering, or leaving), etc. Group- and crowd-based datasets, as shown in Table 1, provide shared benchmarks that facilitate reproducibility and promote adoption within industry.
- Advanced occlusion handling and high-density analysis: Research must focus on novel Computer Vision techniques (e.g., 3D reconstruction, transformer-based models, neuromorphic vision as in Lorenzo et al. [46]) to make the architecture robust in highly occluded, dense crowd scenarios, which are common in large events. Techniques for dense, occluded crowds are directly relevant for large festivals or sporting events, where visibility is poor but safety monitoring is critical.
- Adaptive camera setups and smart placement: Work should explore adaptive camera configurations, such as PTZ systems or multi-camera arrays, that can dynamically adjust placement, angle, and zoom to improve coverage in complex venues. Combined with 3D cameras (e.g., stereo rigs, structured light, or LiDAR), these setups can provide distance-aware information, helping to disambiguate overlapping individuals and estimate the distance between the audience and the target (stage, speaker, or point of interest). This additional spatial context not only mitigates occlusion but also enables richer engagement metrics by linking gaze and body orientation to the audience’s physical relation to the event focus.
- Multimodal fusion architectures: Future work should move beyond simple feature concatenation to explore sophisticated late-fusion and attention-based fusion models that can dynamically weight the importance of each construct. In conferences, fusing gaze and applause detection could inform real-time adjustments to presentations; in concerts, body language may outweigh facial cues due to lighting or distance.
- Context-aware and personalised models: Future systems should adapt to the contexts of events and learn personalised baselines for audience segments to improve interpretation. A jazz concert audience expresses engagement differently than a tech keynote audience; tailoring models to these contexts supports both cultural event organisers and corporate planners.
- Ethical AI and privacy-preserving techniques: As these systems deploy, research must integrate privacy-by-design principles. This includes exploring federated learning, on-device processing, and techniques that use low-resolution or abstracted feature data to protect individual identities while still extracting useful crowd-level insights.
- Integration with subjective measures: To validate automated metrics, future work should develop methods to seamlessly integrate sparse subjective feedback with continuous AI-derived data, creating a hybrid validation model. Hybrid models could combine real-time analytics with feedback buttons at conferences or mobile surveys at festivals, enhancing actionable insights for organisers.
- Scaling to denser crowds: Future work should investigate benchmarking computational load across hardware tiers and exploring neuromorphic vision sensors and transformer-based backbones as promising directions for scaling to denser crowds.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| AU | Action Unit |
| AUC | Area Under the Curve |
| BEM | Binary Engagement Model |
| BP | Back Propagation |
| CI | Concentration Index |
| CNN | Convolutional Neural Network |
| DCNN | Deep Convolutional Neural Network |
| DNN | Deep Neural Network |
| EI | Engagement Index |
| FER | Facial Expression Recognition |
| GIS | Geographic Information System |
| GRU | Gated Recurrent Unit |
| HAD | Histogram of Angular Deviations |
| HPE | Head Pose Estimation |
| HRI | Human–Robot Interaction |
| IE | Instantaneous Engagement |
| MAE | Mean Absolute Error |
| PE | Period Engagement |
| PoI | Point of Interest |
| R&D | Research and Development |
| RMSE | Root Mean Square Error |
| VA | Valence–Arousal |
| VLM | Vision Language Models |
| YOLO | You Only Look Once |
References
- Sánchez, F.L.; Hupont, I.; Tabik, S.; Herrera, F. Revisiting crowd behaviour analysis through deep learning: Taxonomy, anomaly detection, crowd emotions, datasets, opportunities and prospects. Inf. Fusion 2020, 64, 318–335. [Google Scholar] [CrossRef] [PubMed]
- Varghese, E.B.; Thampi, S.M. Towards the cognitive and psychological perspectives of crowd behaviour: A vision-based analysis. Connect. Sci. 2021, 33, 380–405. [Google Scholar] [CrossRef]
- Lemos, M.; Cardoso, P.J.S.; Rodrigues, J.M.F. Microscopic Binary Engagement Model. In Proceedings of the Computational Science—ICCS 2025, Singapore, 7–9 July 2025; Springer Nature: Cham, Switzerland, 2025; pp. 119–134. [Google Scholar] [CrossRef]
- Booth, B.M.; Bosch, N.; DMello, S.K. Engagement Detection and Its Applications in Learning: A Tutorial and Selective Review. Proc. IEEE 2023, 111, 1398–1422. [Google Scholar] [CrossRef]
- Lasri, I.; Riadsolh, A.; Elbelkacemi, M. Facial emotion recognition of deaf and hard-of-hearing students for engagement detection using deep learning. Educ. Inf. Technol. 2023, 28, 4069–4092. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks For Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
- Xu, Q.; Wei, Y.; Gao, J.; Yao, H.; Liu, Q. ICAPD Framework and simAM-YOLOv8n for Student Cognitive Engagement Detection in Classroom. IEEE Access 2023, 11, 136063–136076. [Google Scholar] [CrossRef]
- Sumer, O.; Goldberg, P.; DMello, S.; Gerjets, P.; Trautwein, U.; Kasneci, E. Multimodal Engagement Analysis From Facial Videos in the Classroom. IEEE Trans. Affect. Comput. 2023, 14, 1012–1027. [Google Scholar] [CrossRef]
- Zhao, Z.; Li, Y.; Yang, J.; Ma, Y. A lightweight facial expression recognition model for automated engagement detection. Signal Image Video Process. 2024, 18, 3553–3563. [Google Scholar] [CrossRef]
- Vrochidis, A.; Dimitriou, N.; Krinidis, S.; Panagiotidis, S.; Parcharidis, S.; Tzovaras, D. A Deep Learning Framework for Monitoring Audience Engagement in Online Video Events. Int. J. Comput. Intell. Syst. 2024, 17, 124. [Google Scholar] [CrossRef]
- Ruiz, N.; Chong, E.; Rehg, J.M. Fine-grained head pose estimation without keypoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2074–2083. [Google Scholar] [CrossRef]
- Shao, Z.; Liu, Z.; Cai, J.; Ma, L. JAA-Net: Joint facial action unit detection and face alignment via adaptive attention. Int. J. Comput. Vis. 2021, 129, 321–340. [Google Scholar] [CrossRef]
- Deng, J.; Guo, J.; Zhou, Y.; Yu, J.; Kotsia, I.; Zafeiriou, S. Retinaface: Single-stage dense face localisation in the wild. arXiv 2019, arXiv:1905.00641. [Google Scholar] [CrossRef]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar] [CrossRef]
- habib Albohamood, A.; Shaker Alqattan, M.; Padua Vizcarra, C. Real-time Student Engagement Monitoring in Classroom Environments using Machine Learning and Computer Vision. In Proceedings of the 2025 4th International Conference on Computing and Information Technology (ICCIT), Tabuk, Saudi Arabia, 13–14 April 2025; pp. 420–424. [Google Scholar] [CrossRef]
- Qarbal, I.; Sael, N.; Ouahabi, S. Student Engagement Detection Based on Head Pose Estimation and Facial Expressions Using Transfer Learning. In Proceedings of the International Conference on Smart City Applications, Tangier, Morocco, 1–3 October 2024; Springer: Berlin/Heidelberg, Germany, 2025; pp. 246–255. [Google Scholar] [CrossRef]
- Teotia, J.; Zhang, X.; Mao, R.; Cambria, E. Evaluating Vision Language Models in Detecting Learning Engagement. In Proceedings of the 2024 IEEE International Conference on Data Mining Workshops (ICDMW), Abu Dhabi, United Arab Emirates, 9–12 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 496–502. [Google Scholar] [CrossRef]
- Sorrentino, A.; Fiorini, L.; Cavallo, F. From the definition to the automatic assessment of engagement in human–robot interaction: A systematic review. Int. J. Soc. Robot. 2024, 16, 1641–1663. [Google Scholar] [CrossRef]
- Ravandi, B.S.; Khan, I.; Markelius, A.; Bergström, M.; Gander, P.; Erzin, E.; Lowe, R. Exploring task and social engagement in companion social robots: A comparative analysis of feedback types. Adv. Robot. 2025, 39, 884–899. [Google Scholar] [CrossRef]
- Qarbal, I.; Sael, N.; Ouahabi, S. Students Engagement Detection Based on Computer Vision: A Systematic Literature Review. IEEE Access 2025, 13, 140519–140545. [Google Scholar] [CrossRef]
- Bei, Y.; Guo, S.; Gao, K.; Feng, Z. Behavior capture guided engagement recognition. Pattern Recognit. 2025, 164, 111534. [Google Scholar] [CrossRef]
- Wang, J.; Yuan, S.; Lu, T.; Zhao, H.; Zhao, Y. Video-based real-time monitoring of engagement in E-learning using MediaPipe through multi-feature analysis. Expert Syst. Appl. 2025, 288, 128239. [Google Scholar] [CrossRef]
- Lu, W.; Yang, Y.; Song, R.; Chen, Y.; Wang, T.; Bian, C. A Video Dataset for Classroom Group Engagement Recognition. Sci. Data 2025, 12, 644. [Google Scholar] [CrossRef]
- HAPPEI Dataset. Available online: https://users.cecs.anu.edu.au/~few_group/Group.htm (accessed on 28 August 2025).
- MED: Multimodal Event Dataset. Available online: https://github.com/hosseinm/med?tab=readme-ov-file (accessed on 28 August 2025).
- Vaz, P.J.; Rodrigues, J.M.F.; Cardoso, P.J.S. Affective Computing Databases: In-Depth Analysis of Systematic Reviews and Surveys. IEEE Trans. Affect. Comput. 2025, 16, 537–554. [Google Scholar] [CrossRef]
- UCSD Anomaly Detection Dataset. Available online: http://www.svcl.ucsd.edu/projects/anomaly/dataset.htm (accessed on 28 August 2025).
- CUHK Avenue Dataset. Available online: https://www.cse.cuhk.edu.hk/leojia/projects/detectabnormal/dataset.html (accessed on 28 August 2025).
- UMN Crowd Dataset. Available online: https://mha.cs.umn.edu/proj_events.shtml (accessed on 28 August 2025).
- ShanghaiTech Campus Dataset. Available online: https://svip-lab.github.io/dataset/campus_dataset.html (accessed on 28 August 2025).
- UT-Interaction Dataset. Available online: https://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html (accessed on 28 August 2025).
- UCF-Crime Dataset. Available online: https://www.crcv.ucf.edu/research/real-world-anomaly-detection-in-surveillance-videos/ (accessed on 28 August 2025).
- Gao, Y.; Liu, H.; Wu, P.; Wang, C. A new descriptor of gradients self-similarity for smile detection in unconstrained scenarios. Neurocomputing 2016, 174, 1077–1086. [Google Scholar] [CrossRef]
- Lingenfelter, B.; Davis, S.R.; Hand, E.M. A quantitative analysis of labeling issues in the celeba dataset. In Proceedings of the International Symposium on Visual Computing, San Diego, CA, USA, 3–5 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 129–141. [Google Scholar] [CrossRef]
- Hwooi, S.K.W.; Othmani, A.; Sabri, A.Q.M. Deep learning-based approach for continuous affect prediction from facial expression images in valence-arousal space. IEEE Access 2022, 10, 96053–96065. [Google Scholar] [CrossRef]
- Savchenko, A.V. Frame-level prediction of facial expressions, valence, arousal and action units for mobile devices. arXiv 2022, arXiv:2203.13436. [Google Scholar] [CrossRef]
- Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. Vggface2: A dataset for recognising faces across pose and age. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 67–74. [Google Scholar] [CrossRef]
- Nguyen, H.H.; Huynh, V.T.; Kim, S.H. An ensemble approach for facial expression analysis in video. arXiv 2022, arXiv:2203.12891. [Google Scholar] [CrossRef]
- Radosavovic, I.; Kosaraju, R.P.; Girshick, R.; He, K.; Dollár, P. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10428–10436. [Google Scholar] [CrossRef]
- Cho, K. On the properties of neural machine translation: Encoder-decoder approaches. arXiv 2014, arXiv:1409.1259. [Google Scholar] [CrossRef]
- Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar] [CrossRef]
- Gupta, S.; Kumar, P.; Tekchandani, R.K. Facial emotion recognition based real-time learner engagement detection system in online learning context using deep learning models. Multimed. Tools Appl. 2023, 82, 11365–11394. [Google Scholar] [CrossRef]
- Hai, L.; Guo, H. Face detection with improved face r-CNN training method. In Proceedings of the 3rd International Conference on Control and Computer Vision, Macau, China, 23–25 August 2020; pp. 22–25. [Google Scholar] [CrossRef]
- Bruin, J.; Stuldreher, I.V.; Perone, P.; Hogenelst, K.; Naber, M.; Kamphuis, W.; Brouwer, A.M. Detection of arousal and valence from facial expressions and physiological responses evoked by different types of stressors. Front. Neuroergonomics 2024, 5, 1338243. [Google Scholar] [CrossRef]
- Ding, C.; Peng, H. Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 2005, 3, 185–205. [Google Scholar] [CrossRef] [PubMed]
- Berlincioni, L.; Cultrera, L.; Becattini, F.; Bimbo, A.D. Neuromorphic valence and arousal estimation. J. Ambient Intell. Humaniz. Comput. 2024, 1–11. [Google Scholar] [CrossRef]
- Kossaifi, J.; Tzimiropoulos, G.; Todorovic, S.; Pantic, M. AFEW-VA database for valence and arousal estimation in-the-wild. Image Vis. Comput. 2017, 65, 23–36. [Google Scholar] [CrossRef]
- Hu, Y.; Liu, S.C.; Delbruck, T. v2e: From video frames to realistic DVS events. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1312–1321. [Google Scholar] [CrossRef]
- Manojkumar, K.; Helen, L.S. Monitoring the crowd emotion using valence and arousal of crowd based on prominent features of crowd. Signal Image Video Process. 2025, 19, 519. [Google Scholar] [CrossRef]
- Hempel, T.; Abdelrahman, A.A.; Al-Hamadi, A. 6d rotation representation for unconstrained head pose estimation. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 2496–2500. [Google Scholar] [CrossRef]
- Hempel, T.; Abdelrahman, A.A.; Al-Hamadi, A. Toward Robust and Unconstrained Full Range of Rotation Head Pose Estimation. IEEE Trans. Image Process. 2024, 33, 2377–2387. [Google Scholar] [CrossRef]
- Reich, A.; Wuensche, H.J. Monocular 3d multi-object tracking with an ekf approach for long-term stable tracks. In Proceedings of the 2021 IEEE 24th International Conference on Information Fusion (FUSION), Sun City, South Africa, 1–4 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–7. [Google Scholar] [CrossRef]
- Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking objects as points. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 474–490. [Google Scholar] [CrossRef]
- Hossain, M.R.; Rahman, M.M.; Karim, M.R.; Al Amin, M.J.; Bepery, C. Determination of 3D Coordinates of Objects from Image with Deep Learning Model. In Proceedings of the 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 26–29 January 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 25–30. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar] [CrossRef]
- Yang, L.; Kang, B.; Huang, Z.; Zhao, Z.; Xu, X.; Feng, J.; Zhao, H. Depth anything v2. Adv. Neural Inf. Process. Syst. 2024, 37, 21875–21911. [Google Scholar] [CrossRef]
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2024, arXiv:2304.07193v2. [Google Scholar] [CrossRef]
- Sundararaman, R.; De Almeida Braga, C.; Marchand, E.; Pettre, J. Tracking pedestrian heads in dense crowd. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3865–3875. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar] [CrossRef]
- Zhu, C.; Tao, R.; Luu, K.; Savvides, M. Seeing small faces from robust anchor’s perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5127–5136. [Google Scholar] [CrossRef]
- Tang, X.; Du, D.K.; He, Z.; Liu, J. Pyramidbox: A context-assisted single shot face detector. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 797–813. [Google Scholar] [CrossRef]
- Deb, T.; Rahmun, M.; Bijoy, S.A.; Raha, M.H.; Khan, M.A. UUCT-HyMP: Towards tracking dispersed crowd groups from UAVs. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–8. [Google Scholar] [CrossRef]
- Shah, S.; Dey, D.; Lovett, C.; Kapoor, A. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics: Results of the 11th International Conference; Springer: Berlin/Heidelberg, Germany, 2018; pp. 621–635. [Google Scholar] [CrossRef]
- Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6182–6191. [Google Scholar] [CrossRef]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar] [CrossRef]
- Yeh, K.H.; Hsu, I.C.; Chou, Y.Z.; Chen, G.Y.; Tsai, Y.S. An aerial crowd-flow analyzing system for drone under YOLOv5 and StrongSort. In Proceedings of the 2022 International Automatic Control Conference (CACS), Kaohsiung, Taiwan, 3–6 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Ingham, F.; Poznanski, J.; Fang, J.; Yu, L.; et al. ultralytics/yolov5: V3. 1-bug fixes and performance improvements. Zenodo 2020. [Google Scholar] [CrossRef]
- Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. Strongsort: Make deepsort great again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
- Zhou, K.; Yang, Y.; Cavallaro, A.; Xiang, T. Omni-scale feature learning for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republci of Korea, 27 October–2 November 2019; pp. 3702–3712. [Google Scholar] [CrossRef]
- Real-Time Multi-Camera Multi-Object Tracker Using YOLOv5 and StrongSORT with OSNet. Available online: https://github.com/mikel-brostrom/Yolov5_StrongSORT_OSNet (accessed on 28 August 2025).
- Li, H.; Liu, L.; Yang, K.; Liu, S.; Gao, J.; Zhao, B.; Zhang, R.; Hou, J. Video crowd localization with multifocus gaussian neighborhood attention and a large-scale benchmark. IEEE Trans. Image Process. 2022, 31, 6032–6047. [Google Scholar] [CrossRef]
- Ekanayake, E.; Lei, Y.; Li, C. Crowd density level estimation and anomaly detection using multicolumn multistage bilinear convolution attention network (MCMS-BCNN-Attention). Appl. Sci. 2022, 13, 248. [Google Scholar] [CrossRef]
- Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: New York, NY, USA, 2021; pp. 10096–10106. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26–30 June 2016; pp. 589–597. [Google Scholar] [CrossRef]
- Ferryman, J.; Shahrokni, A. Pets2009: Dataset and challenge. In Proceedings of the 2009 Twelfth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, Snowbird, UT, USA, 7–9 December 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1–6. [Google Scholar] [CrossRef]
- Mehran, R.; Oyama, A.; Shah, M. Abnormal crowd behavior detection using social force model. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 935–942. [Google Scholar] [CrossRef]
- Liu, Y.; Cao, G.; Ge, Z.; Hu, Y. Crowd counting method via a dynamic-refined density map network. Neurocomputing 2022, 497, 191–203. [Google Scholar] [CrossRef]
- Lin, H.; Ma, Z.; Ji, R.; Wang, Y.; Hong, X. Boosting crowd counting via multifaceted attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 19628–19637. [Google Scholar] [CrossRef]
- Wang, F.; Sang, J.; Wu, Z.; Liu, Q.; Sang, N. Hybrid attention network based on progressive embedding scale-context for crowd counting. Inf. Sci. 2022, 591, 306–318. [Google Scholar] [CrossRef]
- Qi, Z.; Zhou, M.; Zhu, G.; Xue, Y. Multiple pedestrian tracking in dense crowds combined with head tracking. Appl. Sci. 2022, 13, 440. [Google Scholar] [CrossRef]
- Pan, Z. Multi-Scale Occluded Pedestrian Detection Based on Deep Learning. In Proceedings of the 2023 International Conference on Evolutionary Algorithms and Soft Computing Techniques (EASCT), Bengaluru, India, 20–21 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar] [CrossRef]
- Pai, A.K.; Chandrahasan, P.; Raghavendra, U.; Karunakar, A.K. Motion pattern-based crowd scene classification using histogram of angular deviations of trajectories. Vis. Comput. 2023, 39, 557–567. [Google Scholar] [CrossRef]
- Zhou, B.; Tang, X.; Wang, X. Measuring crowd collectiveness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 3049–3056. [Google Scholar] [CrossRef]
- Zhang, X.; Sun, Y.; Li, Q.; Li, X.; Shi, X. Crowd density estimation and mapping method based on surveillance video and GIS. ISPRS Int. J. Geo-Inf. 2023, 12, 56. [Google Scholar] [CrossRef]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar] [CrossRef]
- Qiu, W.; Wen, G.; Liu, H. A back-propagation neural network model based on genetic algorithm for prediction of build-up rate in drilling process. Arab. J. Sci. Eng. 2022, 47, 11089–11099. [Google Scholar] [CrossRef]
- Zhang, Y.; Chen, H.; Lai, Z.; Zhang, Z.; Yuan, D. Handling heavy occlusion in dense crowd tracking by focusing on the heads. In Proceedings of the Australasian Joint Conference on Artificial Intelligence, Brisbane, Australia, 18 November–1 December 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 79–90. [Google Scholar] [CrossRef]
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
- Shao, S.; Zhao, Z.; Li, B.; Xiao, T.; Yu, G.; Zhang, X.; Sun, J. CrowdHuman: A Benchmark for Detecting Human in a Crowd. arXiv 2018, arXiv:1805.00123. [Google Scholar] [CrossRef]
- Dendorfer, P.; Rezatofighi, H.; Milan, A.; Shi, J.; Cremers, D.; Reid, I.; Roth, S.; Schindler, K.; Leal-Taixé, L. Mot20: A benchmark for multi object tracking in crowded scenes. arXiv 2020, arXiv:2003.09003. [Google Scholar] [CrossRef]
- Mei, L.; Yu, M.; Jia, L.; Fu, M. Crowd Density Estimation via Global Crowd Collectiveness Metric. Drones 2024, 8, 616. [Google Scholar] [CrossRef]
- Ranasinghe, Y.; Nair, N.G.; Bandara, W.G.C.; Patel, V.M. CrowdDiff: Multi-hypothesis crowd density estimation using diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 12809–12819. [Google Scholar] [CrossRef]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar] [CrossRef]
- Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: New York, NY, USA, 2021; pp. 8162–8171. [Google Scholar] [CrossRef]
- Akpulat, M.; Ekinci, M. Anomaly detection in crowd scenes via cross trajectories. Appl. Intell. 2025, 55, 525. [Google Scholar] [CrossRef]
- Badauraudine, M.F.M.; Noor, M.N.M.M.; Othman, M.S.; Nasir, H.B.M. Detection and Tracking of People in a Dense Crowd through Deep Learning Approach-A Systematic Literature Review. Inf. Res. Commun. 2025, 1, 65–73. [Google Scholar] [CrossRef]
- Martins, P.V.C.; Cardoso, P.J.; Rodrigues, J.M. Affective Computing Emotional Body Gesture Recognition: Evolution and the Cream of the Crop. IEEE Access 2025, 13, 192871–192890. [Google Scholar] [CrossRef]
- Rodrigues, J.M.F.; Cardoso, P.J.S.; Lemos, M.; Cherniavska, O.; Bica, P. Engagement Monitorization in Crowded Environments: A Conceptual Framework. In Proceedings of the 11th International Conference on Software Development and Technologies for Enhancing Accessibility and Fighting Info-Exclusion, New York, NY, USA, 13–15 November 2025; DSAI ’24. p. 815. [Google Scholar] [CrossRef]

| Dataset | Domain/Scenes | Size | Labels/Annotations | Evaluation Protocol and Metrics | Notes |
|---|---|---|---|---|---|
| UCSD Anomaly Detection [27] | Pedestrian walkway videos | 34 train/36 test | Frame-level anomaly labels; pixel-level masks (subset) | Frame-level ROC/AUC; pixel-level AUC; EER | Classic small-scale benchmark; anomalies include bikes, skaters, vehicles |
| CUHK Avenue [28] | Campus avenue videos | 16 train/21 test; 47 abnormal events | Frame labels + bounding boxes | Frame-level AUC; IoU-based localisation (VOC) | Provides rectangles for localisation tasks |
| UMN Unusual Crowd Activity [29] | Indoor/outdoor crowd panic scenario videos | 3 sequences (1453, 4144, 2144 frames) | Frame-level normal to abnormal segments | Frame-level ROC/AUC; per-scene AUC | Sudden panic run abnormality |
| ShanghaiTech Campus [30] | Campus surveillance videos | 270k training frames; 130 events | Pixel-level masks + frame labels | Frame-level AUC; pixel-level AUC | Large modern benchmark with diverse scenes |
| UT-Interaction [31] | Human–human interaction indoor/outdoor videos | 20 sequences (1 min each), 6 classes | Temporal intervals and bounding boxes | Classification accuracy with LOSO CV | Interaction recognition dataset |
| UCF-Crime [32] | Real-world surveillance videos | 1900 videos; 13 anomaly types | Video-level anomaly labels; some frame-level GT | Frame-level ROC/AUC; PR curves | Weakly labelled large-scale anomaly dataset |
| GENKI-4K [33] | Natural, unconstrained face images (4000 images) | Binary smile label (smiling and non-smiling) | 5-fold cross-validation | Smaller dataset, focused specifically on smile detection (ollected from real-world scenarios) | |
| CelebA [34] | Celebrity face images (web-sourced, 200k images of 10k identities) | 40 binary attributes (e.g., smiling, glasses, hair colour), identity labels, bounding boxes, 5 landmark locations; 162k train/20k val/20k test | Often used for attribute classification, face detection, landmark localisation | Large-scale, diverse, multiple tasks possible (attributes, recognition, landmark detection) |
| Study/Authors | Methods | Overview | Applied to |
|---|---|---|---|
| Booth et al. [4] | Defines engagement as a multicomponent construct involving affective, cognitive, and behavioural components | Tutorial for engagement detection and its applications in learning | Engagement detection |
| Lasri et al. [5] | Facial expression recognition (FER) using a fine-tuned VGG-16 [6] | Engagement index (CI) computation | Engagement detection |
| Xu et al. [7] | YOLOv8n with SimAM attention | Analyse and adjust instruction based on real-time engagement data | Engagement detection |
| Sumer et al. [8] | Head pose and FER (vision + audio) | Investigate student engagement using audiovisual recordings in real classroom settings | Engagement detection |
| Zhao et al. [9] | Lightweight CNN, residual blocks, and CBAM/CAM | Real-time lightweight FER model for classroom use | Engagement detection |
| Vrochidis et al. [10] | HopeNet [11] (pose), JAA-Net [12] (AU), RetinaFace [13], DenseNet [14] | Six-layer deep framework fusing audio and video for online audience engagement | Engagement detection |
| Albohamood et al. [15] | YOLO-based pose and visual cues | Classifies students into Engaged, Not Engaged, and Partially Engaged. | Engagement detection. |
| Qarbal et al. [16] | Head pose estimation and facial expression analysis | Two-level verification of head pose and facial expression to strengthen engagement classification | Engagement detection |
| Teotia et al. [17] | Vision Language Models (VLMs) | Detection of classroom-specific emotions like engagement and distraction | Classroom-specific emotion detection |
| Sorrentino et al. [18] | - | Systematic review of HRI engagement. | Systematic review |
| Ravandi et al. [19] | Deep learning, primarily CNNs | Deep learning, mainly CNNs, detects engagement via visual cues in HRI | Engagement detection. |
| Qarbal et al. [20] | - | Engagement detection based on Computer Vision in learning environment | Systematic review |
| Bei et al. [21] | Region-focused behaviour capture transformer | Transformer model uses regional body features and disengagement behaviours to recognise video engagement | Engagement detection |
| Wang et al. [22] | Development of lightweight models that extract 3D facial landmarks using the MediaPipe library | Low-cost learner engagement detection framework for E-learning, combining behavioural (head posture, gaze, blinks) and emotional (smile detection) cues | Engagement detection. |
| Stephen et al. [35] | Regression model prediction, CNN feature extraction, and image scaling | Transfers continuous valence–arousal dimensional space to discrete labels | Valence and arousal estimation |
| Andrey [36] | Facial expression recognition, action unit (AU) analysis, and valence–arousal prediction | Combines FER, AU, and VA for fine-grained facial emotion analysis | Emotion analysis. |
| Nguyen et al. [38] | RegNet [39], GRU [40], and transformers [41] | Multimodal temporal fusion for continuous behaviour analysis | Behaviour analysis |
| Gupta et al. [42] | Faster R-CNN [43] and CNN FER pipeline | Online FER-based engagement index (EI) for E-learning. | Engagement detection |
| Juliette et al. [44] | SMOTE, MRMR [45], and grid search over multiple classifiers | Valence/arousal from facial and physiological responses across tasks | Valence and arousal estimation |
| Lorenzo et al. [46] | Neuromorphic event-based VA estimation | Uses event cameras (micro-movements) for valence/arousal estimation | Valence and arousal estimation |
| Manojkumar and Helen [49] | Fuzzy inference on crowd features, YOLOv8, DarkNet | Fuzzy inference model analyses crowd video to detect collective emotions via valence and arousal | Valence and arousal estimation |
| Hempel et al. [50] | Continuous 6D rotation matrix representation | Head pose estimation from single images | Head pose estimation. |
| Hempel et al. [51] | Unconstrained HPE with geodesic loss (SO(3)) | Robust and unconstrained head pose estimation, addressing full-range rotation challenges | Head pose estimation |
| Reich [52] | Extended Kalman Filter | Recovers 3D trajectories from mono input for physically plausible tracking | 3D multi-object tracking |
| Hossain et al. [54] | YOLOv3 and triangulation | Stereo angles and trigonometry to estimate 3D object positions | 3D coordinate estimation |
| Yang et al. [56] | Depth Anything v2 (DINOv2-G [57] teacher; ViT S/B/L/G students) | Synthetic-to-real distillation for monocular depth | Monocular depth estimation |
| Sundararaman et al. [58] | Novel head detector named HeadHunter, which leverages ResNet-50 [59], Feature Pyramid Network [60,61], and context [62] | Head detection/tracking in dense crowds | Pedestrian tracking in dense crowds |
| Deb et al. [63] | HyMP: DiMP [65], GCNs [66], and bilinear pooling | Aerial crowd group tracking via hybrid motion pooling | Tracking dispersed human crowds |
| Yeh et al. [67] | YOLOv5 [68], StrongSort [69], and OSNet [70] | Aerial crowd flow: Trajectories, hotspots, flow maps from drones | Aerial crowd-flow analysis |
| Li et al. [72] | Gaussian Neighbourhood Attention and context cross-attention | Video crowd localisation via multi-focus attention (head centres) | Model spatial–temporal dependencies in surveillance footage |
| Ekanayake et al. [73] | DenseNet121 [14]/EfficientNetV2 [74], multi-column CNN [75], and multistage CNN | Classifies crowd density performs binary classification for anomaly detection | Crowd density estimation |
| Liu et al. [78] | Refine Net and Counting Net | Crowd counting approach which addresses inaccuracies in traditional density map generation | Crowd counting |
| Lin et al. [79] | VGG-19 backbone [6], transformer encoder, and LRA/LAR/IAL | Multifaceted attention for large scale variation counting | Crowd counting |
| Wang et al. [80] | VGG16-BN [6] and Hybrid Attention Module | Addresses both background noise suppression and scale variation adaptation simultaneously for crowd counting | Crowd counting |
| Qi et al. [81] | Head-first tracking and enhanced Faster R-CNN [43] body detector | Robust multi-object tracking under severe occlusion (head to body) | Multi-object tracking system |
| Pan et al. [82] | Multi-scale feature and occlusion detection module | Pedestrian detection under heavy occlusion via dynamic part modelling | Pedestrian detection |
| Pai et al. [83] | gKLT trajectories [84] and Histogram of Angular Deviations | Classifies crowd scenes by motion coherence | Classify crowded scenes |
| Zhang et al. [85] | DeepLabv3+ [86] CSSM, CNN denoising, and GA-optimised BP | Crowd density estimation and geographic mapping by integrating surveillance video with GIS technologies | Crowd density estimation |
| Zhang et al. [88] | Anchor-free joint head–body (YOLOX [89]) and Joint SimOTA | Joint head/body detection improves tracking under occlusion | Head–body detection |
| Mei et al. [92] | Global collectiveness via optical flow (illum.-invariant) | Crowd collectiveness metric across intra-/inter-crowd coherence | Measure crowd collectiveness |
| Ranasinghe et al. [93] | Diffusion model [94,95], narrow kernels, and SSIM-based fusion | Denoising diffusion for high-fidelity density maps | Crowd counting |
| Akpulat et al. [96] | Trajectories, finite-time braid entropy, and DNN classifier | Topological complexity (braid entropy) for local/global anomalies | Anomaly detection |
| Badauraudine et al. [97] | - | Systematic survey of dense-crowd detection/tracking methods | Systematic review |
| Martins et al. [98] | - | Emotional body gesture recognition | Systematic review |
| Binary Engagement Model [3] | YOLOv4 (heads), FaceNet, WHENet (pose), EVAm (VA) | Identity-aware binary engagement model (engaged/not, positive/negative); supports multi-cam, real-world use | Engagement detection |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Lemos, M.; Cardoso, P.J.S.; Rodrigues, J.M.F. From Cues to Engagement: A Comprehensive Survey and Holistic Architecture for Computer Vision-Based Audience Analysis in Live Events. Multimodal Technol. Interact. 2026, 10, 8. https://doi.org/10.3390/mti10010008
Lemos M, Cardoso PJS, Rodrigues JMF. From Cues to Engagement: A Comprehensive Survey and Holistic Architecture for Computer Vision-Based Audience Analysis in Live Events. Multimodal Technologies and Interaction. 2026; 10(1):8. https://doi.org/10.3390/mti10010008
Chicago/Turabian StyleLemos, Marco, Pedro J. S. Cardoso, and João M. F. Rodrigues. 2026. "From Cues to Engagement: A Comprehensive Survey and Holistic Architecture for Computer Vision-Based Audience Analysis in Live Events" Multimodal Technologies and Interaction 10, no. 1: 8. https://doi.org/10.3390/mti10010008
APA StyleLemos, M., Cardoso, P. J. S., & Rodrigues, J. M. F. (2026). From Cues to Engagement: A Comprehensive Survey and Holistic Architecture for Computer Vision-Based Audience Analysis in Live Events. Multimodal Technologies and Interaction, 10(1), 8. https://doi.org/10.3390/mti10010008

