Intent-Bert and Universal Context Encoders: A Framework for Workload and Sensor Agnostic Human Intention Prediction
Abstract
:1. Introduction
- 1.
- A framework, Intent-BERT, using Universal Context Encoders that supports encoding a diverse range of data modalities and formats into a single latent space through word embedding similarity
- 2.
- The use of Intent-BERT to passively recover a description of the next activity a human performs in English from a single time point in a scene
- 3.
- Additionally, using Intent-BERT to predict the time until the current activity a human is performing will end and the time until the next activity begins
- 4.
- A demonstration and evaluation of this approach on the InHARD Dataset [14].
2. Background
2.1. Human Robot Collaboration
2.2. Human-Robot Communication
2.3. Multi-Modal Techniques in Robotics
2.4. Human Intention Prediction
3. Methodology
3.1. Intuitive Rationale
3.2. Proposed Framework
3.3. Losses
4. Results
4.1. Experimental Setup
4.1.1. Data Cleaning and Preparation
4.1.2. General Hyper-Parameters
4.1.3. UCE/Intent-BERT Hyper-Parameters
4.2. Next Task/Action Prediction
4.3. Activity Timing Prediction
4.4. Timing Evaluation
4.5. Comparison of UCEs Against GPTs
The user will give you data about a person’s body parts and positioning, this data is given from a 3d perspective where you receive each body parts position in a x, y, z, format and their rotations in an x, y, z format. You also get each body parts position in a 2d perspective from multiple camera angles in an x,y format and you get a confidence level. I then need you to tell the user what person observed is doing and predict what they are about to do. You are to match each action to a specific meta-action and action found in the file you were given. Take your time, there is absolutely no rush I want the answer to be correct not quick.
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Li, S.; Wang, R.; Zheng, P.; Wang, L. Towards Proactive Human-Robot Collaboration: A Foreseeable Cognitive Manufacturing Paradigm. J. Manuf. Syst. 2021, 60, 547–552. [Google Scholar] [CrossRef]
- Bonarini, A. Communication in human-robot interaction. Curr. Robot. Rep. 2020, 1, 279–285. [Google Scholar] [CrossRef] [PubMed]
- Matsas, E.; Vosniakos, G.C.; Batras, D. Prototyping proactive and adaptive techniques for human-robot collaboration in manufacturing using virtual reality. Robot. Comput.-Integr. Manuf. 2018, 50, 168–180. [Google Scholar] [CrossRef]
- Zhou, Z.; Li, R.; Xu, W.; Yao, B.; Ji, Z. Context-aware assistance guidance via augmented reality for industrial human-robot collaboration. In Proceedings of the 2022 IEEE 17th Conference on Industrial Electronics and Applications (ICIEA), Chengdu, China, 16–19 December 2022; pp. 1516–1521. [Google Scholar] [CrossRef]
- Heinzmann, J.; Zelinsky, A. Quantitative Safety Guarantees for Physical Human-Robot Interaction. Int. J. Robot. Res. 2003, 22, 479–504. [Google Scholar] [CrossRef]
- Strazdas, D.; Hintz, J.; Felßberg, A.M.; Al-Hamadi, A. Robots and Wizards: An Investigation Into Natural Human–Robot Interaction. IEEE Access 2020, 8, 207635–207642. [Google Scholar] [CrossRef]
- Bucker, A.; Figueredo, L.; Haddadin, S.; Kapoor, A.; Ma, S.; Vemprala, S.; Bonatti, R. LATTE: LAnguage Trajectory TransformEr. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 7287–7294. [Google Scholar] [CrossRef]
- Choi, S.H.; Park, K.B.; Roh, D.H.; Lee, J.Y.; Mohammed, M.; Ghasemi, Y.; Jeong, H. An integrated mixed reality system for safety-aware human-robot collaboration using deep learning and digital twin generation. Robot. Comput.-Integr. Manuf. 2022, 73, 102258. [Google Scholar] [CrossRef]
- Wang, L.; Gao, R.; Váncza, J.; Krüger, J.; Wang, X.; Makris, S.; Chryssolouris, G. Symbiotic human-robot collaborative assembly. CIRP Ann. 2019, 68, 701–726. [Google Scholar] [CrossRef]
- Liu, H.; Wang, L. Human motion prediction for human-robot collaboration. J. Manuf. Syst. 2017, 44, 287–294. [Google Scholar] [CrossRef]
- Orsag, L.; Stipancic, T.; Koren, L. Towards a Safe Human–Robot Collaboration Using Information on Human Worker Activity. Sensors 2023, 23, 1283. [Google Scholar] [CrossRef] [PubMed]
- Zheng, P.; Li, S.; Xia, L.; Wang, L.; Nassehi, A. A visual reasoning-based approach for mutual-cognitive human-robot collaboration. CIRP Ann. 2022, 71, 377–380. [Google Scholar] [CrossRef]
- Wu, J.; Li, X.; Xu, S.; Yuan, H.; Ding, H.; Yang, Y.; Li, X.; Zhang, J.; Tong, Y.; Jiang, X.; et al. Towards open vocabulary learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5092–5113. [Google Scholar] [CrossRef]
- Dallel, M.; Havard, V.; Baudry, D.; Savatier, X. InHARD—Industrial Human Action Recognition Dataset in the Context of Industrial Collaborative Robotics. In Proceedings of the 2020 IEEE International Conference on Human-Machine Systems (ICHMS), Rome, Italy, 7–9 September 2020; pp. 1–6. [Google Scholar]
- Sharkawy, A.N.; Koustoumpardis, P.N. Human–robot interaction: A review and analysis on variable admittance control, safety, and perspectives. Machines 2022, 10, 591. [Google Scholar] [CrossRef]
- Wang, P.; Liu, H.; Wang, L.; Gao, R.X. Deep learning-based human motion recognition for predictive context-aware human-robot collaboration. CIRP Ann. 2018, 67, 17–20. [Google Scholar] [CrossRef]
- Mendes, N.; Safeea, M.; Neto, P. Flexible programming and orchestration of collaborative robotic manufacturing systems. In Proceedings of the 2018 IEEE 16th International Conference on Industrial Informatics (INDIN), Porto, Portugal, 18–20 July 2018; pp. 913–918. [Google Scholar] [CrossRef]
- Park, K.B.; Choi, S.H.; Lee, J.Y.; Ghasemi, Y.; Mohammed, M.; Jeong, H. Hands-Free Human–Robot Interaction Using Multimodal Gestures and Deep Learning in Wearable Mixed Reality. IEEE Access 2021, 9, 55448–55464. [Google Scholar] [CrossRef]
- Liu, H.; Fang, T.; Zhou, T.; Wang, L. Towards Robust Human-Robot Collaborative Manufacturing: Multimodal Fusion. IEEE Access 2018, 6, 74762–74771. [Google Scholar] [CrossRef]
- Papanastasiou, S.; Kousi, N.; Karagiannis, P.; Gkournelos, C.; Papavasileiou, A.; Dimoulas, K.; Baris, K.; Koukas, S.; Michalos, G.; Makris, S. Towards seamless human robot collaboration: Integrating multimodal interaction. Int. J. Adv. Manuf. Technol. 2019, 105, 1–17. [Google Scholar] [CrossRef]
- Prakash, A.; Chitta, K.; Geiger, A. Multi-Modal Fusion Transformer for End-to-End Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7077–7087. [Google Scholar]
- Huang, Z.; Zeng, Z.; Liu, B.; Fu, D.; Fu, J. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv 2020, arXiv:2004.00849. [Google Scholar]
- Zhang, P.; Jin, P.; Du, G.; Liu, X. Ensuring safety in human-robot coexisting environment based on two-level protection. Ind. Robot 2016, 43, 264–273. [Google Scholar] [CrossRef]
- Unhelkar, V.V.; Lasota, P.A.; Tyroller, Q.; Buhai, R.D.; Marceau, L.; Deml, B.; Shah, J.A. Human-Aware Robotic Assistant for Collaborative Assembly: Integrating Human Motion Prediction With Planning in Time. IEEE Robot. Autom. Lett. 2018, 3, 2394–2401. [Google Scholar] [CrossRef]
- Ragaglia, M.; Zanchettin, A.M.; Rocco, P. Trajectory generation algorithm for safe human-robot collaboration based on multiple depth sensor measurements. Mechatronics 2018, 55, 267–281. [Google Scholar] [CrossRef]
- Zhang, Y.; Ding, K.; Hui, J.; Lv, J.; Zhou, X.; Zheng, P. Human-object integrated assembly intention recognition for context-aware human-robot collaborative assembly. Adv. Eng. Inform. 2022, 54, 101792. [Google Scholar] [CrossRef]
- Li, S.; Zheng, P.; Wang, Z.; Fan, J.; Wang, L. Dynamic Scene Graph for Mutual-Cognition Generation in Proactive Human-Robot Collaboration. Procedia Cirp 2022, 107, 943–948. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Kenton, J.D.M.W.C.; Toutanova, L.K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Watson, M.; Chollet, F.; Sreepathihalli, D.; Saadat, S.; Sampath, R.; Rasskin, G.; Zhu, S.; Singh, V.; Wood, L.; Tan, Z.; et al. KerasNLP. 2022. Available online: https://github.com/keras-team/keras-nlp (accessed on 23 October 2024).
- Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. Version 8.0.178. 2023. Available online: https://journals.ieeeauthorcenter.ieee.org/wp-content/uploads/sites/7/IEEE_Reference_Guide.pdf (accessed on 17 January 2025).
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Zhang, Z.; Sabuncu, M. Generalized cross entropy loss for training deep neural networks with noisy labels. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar] [CrossRef]
- McKenzie, J. Mean absolute percentage error and bias in economic forecasting. Econ. Lett. 2011, 113, 259–262. [Google Scholar] [CrossRef]
- Rahutomo, F.; Kitasuka, T.; Aritsugi, M. Semantic cosine similarity. In Proceedings of the 7th International Student Conference on Advanced Science and Technology ICAST. University of Seoul South Korea, Seoul, Republic of Korea, 29–30 October 2012; Volume 4, p. 1. [Google Scholar]
- Tsokos, C.P.; Welch, R. Bayes discrimination with mean square error loss. Pattern Recognit. 1978, 10, 113–123. [Google Scholar] [CrossRef]
- Li, B.; Han, L. Distance weighted cosine similarity measure for text classification. In Proceedings of the Intelligent Data Engineering and Automated Learning–IDEAL 2013: 14th International Conference, IDEAL 2013, Hefei, China, 20–23 October 2013; Proceedings 14. Springer: Berlin/Heidelberg, Germany, 2013; pp. 611–618. [Google Scholar]
- Zhang, Z. Improved adam optimizer for deep neural networks. In Proceedings of the 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), Banff, AB, Canada, 4–6 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–2. [Google Scholar]
- Chen, S.F.; Beeferman, D.; Rosenfeld, R. Evaluation Metrics for Language Models; Carnegie Mellon University: Pittsburg, PA, USA, 1998. [Google Scholar] [CrossRef]
- NVIDIA. TensorRT—BERT. 2019. Available online: https://developer.nvidia.com/blog/real-time-nlp-with-bert-using-tensorrt-updated/ (accessed on 5 January 2020).
- Sener, F.; Chatterjee, D.; Shelepov, D.; He, K.; Singhania, D.; Wang, R.; Yao, A. Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities. In Proceedings of the the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 21096–21106. [Google Scholar]
- Damen, D.; Doughty, H.; Farinella, G.M.; Furnari, A.; Ma, J.; Kazakos, E.; Moltisanti, D.; Munro, J.; Perrett, T.; Price, W.; et al. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100. Int. J. Comput. Vis. 2021, 130, 33–35. [Google Scholar] [CrossRef]
Area | Description |
---|---|
Human-Robot Collaboration | Humans and Robots working jointly (but potentially spatially or temporally separated) towards the same goals [1] |
Human Robot Interaction | Humans and Robots directly interacting to exchange information or complete a task |
Human-Robot Communication | Humans and Robots exchanging information |
Human Intention Prediction * | Predicting the next long term action to be done by a human. |
Activity | Perplexity | PTA | Top-3 PTA | Phrase Accuracy |
---|---|---|---|---|
Task | 3.3594 | 60.08% | 80.14% | 14.63% |
Action | 3.3593 | 60.08% | 95.78% | 41.69% |
Activity | LGL | MSE | CS | MAPE |
---|---|---|---|---|
Task | 6.924 | 0.6490 | 0.4377 | 927.71% |
Action | 4.867 | 0.2990 | 0.4728 | 598.1% |
Prediction Type | MSLE | MAE | MAPE |
---|---|---|---|
Time to finish task | 0.2765 | 1.263s | 432.6% |
Time to next task | 0.4094 | 2.186s | 702.3% |
Model | Task EmbedLoss | Task PTA | Task PA | Action EmbedLoss | Action PTA | Action PA |
---|---|---|---|---|---|---|
GPT | 7.56 | 12.1% | 0.19 % | 7.73 | 10.9% | 0.19% |
Intent-BERT | 3.47 | 60.1% | 14.63% | 2.78 | 60.1% | 41.69% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Panoff, M.; Acevedo, J.; Yu, H.; Forcha, P.; Wang, S.; Bobda, C. Intent-Bert and Universal Context Encoders: A Framework for Workload and Sensor Agnostic Human Intention Prediction. Technologies 2025, 13, 61. https://doi.org/10.3390/technologies13020061
Panoff M, Acevedo J, Yu H, Forcha P, Wang S, Bobda C. Intent-Bert and Universal Context Encoders: A Framework for Workload and Sensor Agnostic Human Intention Prediction. Technologies. 2025; 13(2):61. https://doi.org/10.3390/technologies13020061
Chicago/Turabian StylePanoff, Maximillian, Joshua Acevedo, Honggang Yu, Peter Forcha, Shuo Wang, and Christophe Bobda. 2025. "Intent-Bert and Universal Context Encoders: A Framework for Workload and Sensor Agnostic Human Intention Prediction" Technologies 13, no. 2: 61. https://doi.org/10.3390/technologies13020061
APA StylePanoff, M., Acevedo, J., Yu, H., Forcha, P., Wang, S., & Bobda, C. (2025). Intent-Bert and Universal Context Encoders: A Framework for Workload and Sensor Agnostic Human Intention Prediction. Technologies, 13(2), 61. https://doi.org/10.3390/technologies13020061