Dynamic Attention Analysis of Body Parts in Transformer-Based Human–Robot Imitation Learning with the Embodiment Gap
Abstract
1. Introduction
- 1.
- Movement-level imitation: This involves directly imitating movements, such as replicating the motions of hands or feet.
- 2.
- Result-level imitation: This focuses on replicating the outcomes of actions, such as moving an object or opening a door.
- 3.
- Intention-level imitation: This involves understanding and imitating the goal behind an action, for example, understanding the goal of cleaning to achieve the result of picking up trash and then using a broom to sweep.
2. Related Works
2.1. Learning from Demonstration Under Embodiment Differences
2.2. Transformer-Based Imitation Learning
2.3. Cross-Embodiment Imitation Learning
2.3.1. Inverse Reinforcement Learning and Goal Inference
2.3.2. Adversarial and Embedding-Based Methods
2.3.3. Transformers and Diffusion-Based Models for Generalization
2.3.4. Imitation from Observation (IfO)
3. Dynamic Attention Mechanisms in Transformer Models for Human–Robot Imitation
3.1. Posture Estimation
3.2. Estimate Embodiment Gap
| Algorithm 1 Estimate ,. |
|
3.3. Feature Extraction of Imitation Movement
| Algorithm 2 k-means++ method. |
|
3.4. Learning Imitation Policy and Estimation Action
3.5. Dimensionality Reduction
4. Experimental Results of Dynamic Attention Mechanism in Transformer Models
4.1. Experiments Setting
4.2. Imitation Learning Results and Analysis of Dynamic Attention to Body Parts
5. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- OECD. OECD Employment Outlook 2025: Can We Get Through the Demographic Crunch? OECD Publishing: Paris, France, 2025. [Google Scholar] [CrossRef]
- Argall, B.D.; Chernova, S.; Veloso, M.; Browning, B. A survey of robot learning from demonstration. Robot. Auton. Syst. 2009, 57, 469–483. [Google Scholar] [CrossRef]
- Calinon, S.; Billard, A. Incremental learning of gestures by imitation in a humanoid robot. In Proceedings of the 2007 2nd ACM/IEEE International Conference on Human-Robot Interaction (HRI), Washington, DC, USA, 9–11 March 2007; pp. 255–262. [Google Scholar]
- Tykal, M.; Montebelli, A.; Kyrki, V. Incrementally assisted kinesthetic teaching for programming by demonstration. In Proceedings of the 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Christchurch, New Zealand, 7–10 March 2016; pp. 205–212. [Google Scholar]
- Koenemann, J.; Burget, F.; Bennewitz, M. Real-time imitation of human wholebody motions by humanoids. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 2806–2812. [Google Scholar]
- Zhang, T.; McCarthy, Z.; Jow, O.; Lee, D.; Chen, X.; Goldberg, K.; Abbeel, P. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 5628–5635. [Google Scholar]
- Tanaka, M.; Sekiyama, K. Human-Robot Imitation Learning of Movement for Embodiment Gap. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Cybern, Oahu, HI, USA, 1–4 October 2023; pp. 1733–1738. [Google Scholar]
- Taylor, M.; Bashkirov, S.; Rico, J.F.; Toriyama, I.; Miyada, N.; Yanagisawa, H.; Ishizuka, K. Learning Bipedal Robot Locomotion from Human Movement. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 2797–2803. [Google Scholar]
- Franzmeyer, T.; Torr, P.; Henriques, J.F. Learn what matters: Cross-domain imitation learning with task-relevant embeddings. Adv. Neural Inf. Process. Syst. 2022, 35, 26283–26294. [Google Scholar]
- LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; Huang, F. A tutorial on energy-based learning. Predicting Structured Data; Bakir, G., Hofman, T., Scholkopf, B., Smola, A., Taskar, B., Eds.; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
- Florence, P.; Lynch, C.; Zeng, A.; Ramirez, O.A.; Wahid, A.; Downs, L.; Wong, A.; Lee, J.; Mordatch, I.; Tompson, J. Implicit Behavioral Cloning. In Proceedings of the 5th Conference on Robot Learning, Zurich, Switzerland, 29–31 October 2018; pp. 158–168. [Google Scholar]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y. OpenPose:Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
- Delmerico, J.; Poranne, R.; Bogo, F.; Oleynikova, H.; Vollenweider, E.; Coros, S.; Nieto, J.; Pollefeys, M. Spatial Computing and Intuitive Interaction: Bringing Mixed Reality and Robotics Together. IEEE Robot. Autom. Mag. 2022, 29, 45–57. [Google Scholar] [CrossRef]
- Seo, M.; Park, H.A.; Yuan, S.; Zhu, Y.; Sentis, L. LEGATO: Cross-Embodiment Imitation Using a Grasping Tool. IEEE Robot. Autom. Lett. 2025, 10, 2854–2861. [Google Scholar] [CrossRef]
- Chen, H.; Zhu, C.; Li, Y. Tool-as-interface: Learning robot policies from human tool usage through imitation learning. arXiv 2025, arXiv:2504.04612. [Google Scholar]
- Zhao, T.Z.; Kumar, V.; Levine, S.; Finn, C. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv 2023, arXiv:2304.13705. [Google Scholar]
- Shafiullah, N.M.M.; Cui, Z.J.; Altanzaya, A.A.; Pinto, L. Behavior Transformers: Cloning k modes with one stone. arXiv 2022, arXiv:2206.11251. [Google Scholar] [CrossRef]
- Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; Mordatch, I. Decision Transformer: Reinforcement Learning via Sequence Modeling. arXiv 2021, arXiv:2106.01345. [Google Scholar] [CrossRef]
- Chebotar, Y.; Vuong, Q.; Hausman, K.; Xia, F.; Lu, Y.; Irpan, A.; Kumar, A.; Yu, T.; Herzog, A.; Pertsch, K.; et al. Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions. arXiv 2023, arXiv:2309.10150. [Google Scholar] [CrossRef]
- Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Dabis, J.; Finn, C.; Gopalakrishnan, K.; Hausman, K.; Herzog, A.; Hsu, J.; et al. RT-1: Robotics Transformer for Real-World Control at Scale. arXiv 2022, arXiv:2212.06817. [Google Scholar]
- Zitkovich, B.; Yu, T.; Xu, S.; Xu, P.; Xiao, T.; Xia, F.; Wu, J.; Wohlhart, P.; Welker, S.; Wahid, A.; et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv 2023, arXiv:2307.15818. [Google Scholar]
- Bain, M.; Sommut, C. A Framework for Behavioral Cloning. Mach. Intell. 1999, 15, 103–129. [Google Scholar]
- De Giacomo, G.; Iocchi, L.; Favorito, M.; Patrizi, F. Restraining bolts for reinforcement learning agents. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13659–13662. [Google Scholar]
- Zakka, K.; Zeng, A.; Florence, P.; Tompson, J.; Bohg, J.; Dwibedi, D. Xirl: Cross embodiment inverse reinforcement learning. arXiv 2021, arXiv:2106.03911. [Google Scholar] [CrossRef]
- Lum, T.G.W.; Lee, O.Y.; Liu, C.K.; Bohg, J. Crossing the human-robot embodiment gap with sim-to-real rl using one human demonstration. arXiv 2025, arXiv:2504.12609. [Google Scholar]
- Liu, J.; Li, Z.; Yu, M.; Dong, Z.; Calinon, S.; Caldwell, D.; Chen, F. Human-Humanoid Robots Cross-Embodiment Behavior-Skill Transfer Using Decomposed Adversarial Learning from Demonstration. arXiv 2024, arXiv:2412.15166. [Google Scholar]
- Xu, M.; Xu, Z.; Chi, C.; Veloso, M.; Song, S. Xskill: Cross embodiment skill discovery. In Proceedings of the Conference on Robot Learning (CoRL), Atlanta, GA, USA, 6–9 November 2023; pp. 3536–3555. [Google Scholar]
- Niu, Y.; Zhang, Y.; Yu, M.; Lin, C.; Li, C.; Wang, Y.; Yang, Y.; Yu, W.; Zhang, T.; Li, Z.; et al. Human2LocoMan: Learning Versatile Quadrupedal Manipulation with Human Pretraining. arXiv 2025, arXiv:2506.16475. [Google Scholar]
- Dessalene, E.; Mantripragada, P.; Maynord, M.; Aloimonos, Y. EmbodiSwap for Zero-Shot Robot Imitation Learning. arXiv 2025, arXiv:2510.03706. [Google Scholar]
- Park, S.; Bharadhwaj, H.; Tulsiani, S. DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy. arXiv 2025, arXiv:2506.20668. [Google Scholar]
- Qin, Y.; Wu, Y.H.; Liu, S.; Jiang, H.; Yang, R.; Fu, Y.; Wang, X. DexMV: Imitation Learning for Dexterous Manipulation from Human Videos. In Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2022; pp. 570–587. [Google Scholar]
- Torabi, F.; Warnell, G.; Stone, P. Behavioral Cloning from Observation. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligencem, Stockholm, Sweden, 13–19 July 2018; pp. 4950–4957. [Google Scholar]
- Torabi, F.; Warnell, G.; Stone, P. Generative adversarial imitation from observation. arXiv 2018, arXiv:1807.06158. [Google Scholar]
- Torabi, F.; Warnell, G.; Stone, P. Imitation Learning from Video by Leveraging Proprioception. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, Chaina, 10–16 August 2019; pp. 3585–3591. [Google Scholar]
- Karnan, H.; Torabi, F.; Warnell, G.; Stone, P. Adversarial Imitation Learning from Video Using a State Observer. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 2452–2458. [Google Scholar]
- Pavse, B.S.; Torabi, F.; Hanna, J.; Warnell, G.; Stone, P. RIDM: Reinforced Inverse Dynamics Modeling for Learning from a Single Observed Demonstration. IEEE Robot. Autom. Lett. 2020, 5, 6262–6269. [Google Scholar] [CrossRef]
- Torabi, F.; Warnell, G.; Stone, P. DEALIO: Data-Efficient Adversarial Learning for Imitation from Observation. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 2391–2397. [Google Scholar]
- Yu, T.; Finn, C.; Xie, A.; Dasari, S.; Zhang, T.; Abbeel, P.; Levine, S. One-shot imitation from observing humans viadomain-adaptive meta-learning. In Proceedings of the Robotics: Science and Systems, Pittsburgh, PA, USA, 26–30 June 2018. [Google Scholar]
- Sermanet, P.; Lynch, C.; Hsu, J.; Levine, S. Time-Contrastive Networks: SelfSupervised Learning from Multi-view Observation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 486–487. [Google Scholar]
- Hudson, E.; Warnell, G.; Torabi, F.; Stone, P. Skeletal Feature Compensation for Imitation Learning with Embodiment Mismatch. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–17 May 2022; pp. 2482–2488. [Google Scholar]
- Hiruma, H.; Ito, H.; Mori, H.; Ogata, T. Deep Active Visual Attention for Real-Time Robot Motion Generation: Emergence of Tool-Body Assimilation and Adaptive Tool-Use. IEEE Robot. Autom. Lett. 2022, 7, 8550–8557. [Google Scholar] [CrossRef]
- Sugihara, T. Solvability-Unconcerned Inverse Kinematics by the Levenberg- Marquardt Method. IEEE Trans. Robot. 2011, 27, 984–991. [Google Scholar] [CrossRef]
- Kim, T.S.; Park, D.D.H.; Lee, Y.B.; Han, D.G.; Su, S.J.; Lee, Y.J.; Kim, P.C.W. A study on the measurement of wrist motion range using the iPhone 4 gyroscope application. Ann. Plast. Surg. 2014, 73, 215–218. [Google Scholar] [CrossRef]
- Moromizato, K.; Kimura, R.; Fukase, H.; Yamaguchi, K.; Ishida, H. Whole-body patterns of the range of joint motion in young adults: Masculine type and feminine type. J. Physiol. Anthropol. 2016, 35, 23. [Google Scholar] [CrossRef] [PubMed]
- Zwerus, E.L.; Willigenburg, N.W.; Scholtes, V.A.; Somford, M.P.; Eygendaal, D.; van den Bekerom, M.P.J. Normative values and affecting factors for the elbow range of motion. Shoulder Elb. 2019, 11, 215–224. [Google Scholar] [CrossRef] [PubMed]
- Macqueen, J. Some methods for classification and analysis of multivariate observations. In 5-th Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Berkeley, CA, USA, 1967; pp. 281–297. [Google Scholar]
- Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2007; pp. 1027–1035. [Google Scholar]
- Abdi, H.; Williams, L.J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
- Schölkopf, B.; Smola, A.; Müller, K.R. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Comput. 1998, 10, 1299–1319. [Google Scholar] [CrossRef]
- van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
- Busy, M.; Caniot, M. qiBullet, a Bullet-based simulator for the Pepper and NAO robots. arXiv 2019, arXiv:1909.00779. [Google Scholar]
- Sedgwick, P. Pearson’s correlation coefficient. BMJ 2012, 345, e4483. [Google Scholar] [CrossRef]
- Wissler, C. The Spearman correlation formula. Science 1905, 22, 309–311. [Google Scholar] [CrossRef] [PubMed]
- Abdi, H. The Kendall rank correlation coefficient. Encycl. Meas. Stat. 2007, 2, 508–510. [Google Scholar]
- Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef]
- Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; Routledge: New York, NY, USA, 1988. [Google Scholar]
















| Joint | Human [deg] | NAO [deg] |
|---|---|---|
| Shoulder Pitch | −157.0 to | −119.5 to |
| Shoulder Roll | −63.0 to | - |
| Shoulder Yaw | −134.0 to | −76.0 to |
| Elbow Yaw | to | to |
| Elbow Roll | −85.0 to | −119.5 to |
| Wrist Pitch | −88.0 to | - |
| Wrist Roll | - | −104.5 to |
| Wrist Yaw | −33.0 to | - |
| Movement Patterns | Joints | Spearman | Kendall | Pearson r | Mutual Information (bits) |
|---|---|---|---|---|---|
| Whole-arm body movement | Shoulder pitch | 0.5935 | 0.3074 | 0.6055 | 1.453 |
| Elbow roll | 0.8465 | 0.7085 | 0.8162 | 1.606 | |
| Elbow-weighted body movements | Shoulder pitch | 0.6769 | 0.5917 | 0.8231 | 1.351 |
| Elbow roll | 0.7320 | 0.6109 | 0.5867 | 1.772 | |
| Wrist-weighted body movements | Elbow roll | 0.8779 | 0.7027 | 0.9488 | 1.073 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tsunekawa, Y.; Sekiyama, K. Dynamic Attention Analysis of Body Parts in Transformer-Based Human–Robot Imitation Learning with the Embodiment Gap. Machines 2025, 13, 1133. https://doi.org/10.3390/machines13121133
Tsunekawa Y, Sekiyama K. Dynamic Attention Analysis of Body Parts in Transformer-Based Human–Robot Imitation Learning with the Embodiment Gap. Machines. 2025; 13(12):1133. https://doi.org/10.3390/machines13121133
Chicago/Turabian StyleTsunekawa, Yoshiki, and Kosuke Sekiyama. 2025. "Dynamic Attention Analysis of Body Parts in Transformer-Based Human–Robot Imitation Learning with the Embodiment Gap" Machines 13, no. 12: 1133. https://doi.org/10.3390/machines13121133
APA StyleTsunekawa, Y., & Sekiyama, K. (2025). Dynamic Attention Analysis of Body Parts in Transformer-Based Human–Robot Imitation Learning with the Embodiment Gap. Machines, 13(12), 1133. https://doi.org/10.3390/machines13121133

