Multimodal Deep Learning for Group Activity Recognition in Smart Office Environments
Abstract
:1. Introduction
2. Background
2.1. Activity Recognition in Smart Buildings
2.2. Deep Learning for Activity Recognition
2.3. Multimodal Learning
3. Deep Multimodal Architecture
3.1. Visual Network
3.2. Audio Network
3.3. Temporal Audio–Visual Fusion Model
4. Empirical Results
- Classification accuracy, which is the ratio of correct predictions to the total number of input samples:
- F1-score, which is a measure of both the precision of the classifier and its robustness, defined as the harmonic mean between precision and recall:
4.1. Dataset
4.2. Unimodal Results Analysis
4.2.1. Visual Modality
4.2.2. Deeper Networks
4.2.3. Transfer Learning
4.2.4. Audio Modality
4.3. Temporal Multimodal Results Analysis
4.4. Discussion
5. Conclusions and Future Work
Author Contributions
Funding
Conflicts of Interest
References
- Islam, S.M.R.; Kwak, D.; Kabir, M.H.; Hossain, M.; Kwak, K. The Internet of Things for Health Care: A Comprehensive Survey. IEEE Access 2015, 3, 678–708. [Google Scholar] [CrossRef]
- Chernbumroong, S.; Cang, S.; Atkins, A.; Yu, H. Elderly activities recognition and classification for applications in assisted living. Expert Syst. Appl. 2013, 40, 1662–1674. [Google Scholar] [CrossRef]
- Minoli, D.; Sohraby, K.; Occhiogrosso, B. IoT Considerations, Requirements, and Architectures for Smart Buildings—Energy Optimization and Next-Generation Building Management Systems. IEEE Internet Things J. 2017, 4, 269–283. [Google Scholar] [CrossRef]
- Lim, B.; Van Den Briel, M.; Thiébaux, S.; Backhaus, S.; Bent, R. HVAC-Aware Occupancy Scheduling. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, Austin, TX, USA, 25–30 January 2015; AAAI Press: Palo Alto, CA, USA, 2015; pp. 679–686. [Google Scholar]
- Carletta, J.; Ashby, S.; Bourban, S.; Flynn, M.; Guillemot, M.; Hain, T.; Kadlec, J.; Karaiskos, V.; Kraaij, W.; Kronenthal, M.; et al. The AMI Meeting Corpus: A Pre-announcement. In Machine Learning for Multimodal Interaction; Renals, S., Bengio, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 28–39. [Google Scholar]
- Truong, N.C.; Baarslag, T.; Ramchurn, G.; Tran-Thanh, L. Interactive scheduling of appliance usage in the home. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI-160 (15/07/16), New York, NY, USA, 9–11 July 2016. [Google Scholar]
- Yang, Y.; Hao, J.; Zheng, Y.; Yu, C. Large-Scale Home Energy Management Using Entropy-Based Collective Multiagent Deep Reinforcement Learning Framework. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, 10–16 August 2019; pp. 630–636. [Google Scholar]
- Ahmadi-Karvigh, S.; Ghahramani, A.; Becerik-Gerber, B.; Soibelman, L. Real-time activity recognition for energy efficiency in buildings. Appl. Energy 2018, 211, 146–160. [Google Scholar] [CrossRef]
- Ye, H.; Gu, T.; Zhu, X.; Xu, J.; Tao, X.; Lu, J.; Jin, N. FTrack: Infrastructure-free floor localization via mobile phone sensing. In Proceedings of the 2012 IEEE International Conference on Pervasive Computing and Communications, Lugano, Switzerland, 19–23 March 2012; pp. 2–10. [Google Scholar] [CrossRef] [Green Version]
- Sarker, K.; Masoud, M.; Belkasim, S.; Ji, S. Towards Robust Human Activity Recognition from RGB Video Stream with Limited Labeled Data. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018. [Google Scholar] [CrossRef] [Green Version]
- Haubrick, P.; Ye, J. Robust Audio Sensing with Multi-Sound Classification. In Proceedings of the 2019 IEEE International Conference on Pervasive Computing and Communications, Kyoto, Japan, 11–15 March 2019; pp. 1–7. [Google Scholar]
- Mihailescu, R.C.; Persson, J.; Davidsson, P.; Eklund, U. Towards Collaborative Sensing using Dynamic Intelligent Virtual Sensors. In Intelligent Distributed Computing; Badica, C., El Fallah Seghrouchni, A., Beynier, A., Camacho, D., Herpson, C., Hindriks, K., Novais, P., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2017; pp. 217–226. [Google Scholar]
- Wu, Z.; Jiang, Y.G.; Wang, X.; Ye, H.; Xue, X. Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification. In Proceedings of the 24th ACM International Conference on Multimedia, MM ’16, Amsterdam, The Netherlands, 15–19 October 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 791–800. [Google Scholar]
- Arabacı, M.A.; Özkan, F.; Surer, E.; Jančovič, P.; Temizel, A. Multi-modal egocentric activity recognition using multi-kernel learning. Multimed. Tools Appl. 2020. [Google Scholar] [CrossRef]
- Kazakos, E.; Nagrani, A.; Zisserman, A.; Damen, D. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019; pp. 5491–5500. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 20–36. [Google Scholar]
- Casserfelt, K.; Mihailescu, R. An investigation of transfer learning for deep architectures in group activity recognition. In Proceedings of the IEEE International Conference on Pervasive Computing and Communications Workshops, PerCom Workshops 2019, Kyoto, Japan, 11–15 March 2019; pp. 58–64. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
- Springenberg, J.; Dosovitskiy, A.; Brox, T.; Riedmiller, M. Striving for Simplicity: The All Convolutional Net. arXiv 2014, arXiv:1412.6806. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
- Larsson, G.; Maire, M.; Shakhnarovich, G. FractalNet: Ultra-Deep Neural Networks without Residuals. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
- Srivastava, R.K.; Greff, K.; Schmidhuber, J. Training Very Deep Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems—Volume 2; MIT Press: Cambridge, MA, USA, 2015; pp. 2377–2385. [Google Scholar]
- Sapru, A.; Valente, F. Automatic speaker role labeling in AMI meetings: Recognition of formal and social roles. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 5057–5060. [Google Scholar]
- Zhao, Z.; Pan, H.; Fan, C.; Liu, Y.; Li, L.; Yang, M.; Cai, D. Abstractive Meeting Summarization via Hierarchical Adaptive Segmental Network Learning. In Proceedings of the World Wide Web Conference, WWW ’19, San Francisco, CA USA, 13–17 May 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 3455–3461. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Lopes, J.; Singh, S. Audio and Video Feature Fusion for Activity Recognition in Unconstrained Videos. In Intelligent Data Engineering and Automated Learning—IDEAL 2006; Corchado, E., Yin, H., Botti, V., Fyfe, C., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 823–831. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
Visual Spatial Features | Audio Spatial Features | Audio Spatial Features | Spatio-Temporal Features | ||||
---|---|---|---|---|---|---|---|
Corner | 98% | VGG-16 | 50% | ResNet-50 | 78% | Audio features | 59% |
Overhead | 99% | Inception V3 | 66% | ResNet-101 | 69% | Visual features | 100% |
Combined | 99% | ResNet-50 | 78% | ResNet-152 | 66% | Fused features | 100% |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Florea, G.A.; Mihailescu, R.-C. Multimodal Deep Learning for Group Activity Recognition in Smart Office Environments. Future Internet 2020, 12, 133. https://doi.org/10.3390/fi12080133
Florea GA, Mihailescu R-C. Multimodal Deep Learning for Group Activity Recognition in Smart Office Environments. Future Internet. 2020; 12(8):133. https://doi.org/10.3390/fi12080133
Chicago/Turabian StyleFlorea, George Albert, and Radu-Casian Mihailescu. 2020. "Multimodal Deep Learning for Group Activity Recognition in Smart Office Environments" Future Internet 12, no. 8: 133. https://doi.org/10.3390/fi12080133
APA StyleFlorea, G. A., & Mihailescu, R. -C. (2020). Multimodal Deep Learning for Group Activity Recognition in Smart Office Environments. Future Internet, 12(8), 133. https://doi.org/10.3390/fi12080133