Next Article in Journal
Recent Advances in Portable Dry Electrode EEG: Architecture and Applications in Brain-Computer Interfaces
Previous Article in Journal
Assessing Smooth Pursuit Eye Movements Using Eye-Tracking Technology in Patients with Schizophrenia Under Treatment: A Pilot Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding †

1
Department of Informatics, Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo 182-8585, Japan
2
TikTok Inc., 1199 Coleman Ave, San Jose, CA 95110, USA
*
Authors to whom correspondence should be addressed.
This article is a revised and expanded version of a paper entitled “Keypoint-Integrated Instruction-Following Data Generation for Enhanced Human Pose and Action Understanding in Multimodal Models”, which was presented at the International Conference on Advanced Concepts for Intelligent Vision Systems (ACIVS), Tokyo, Japan, 28–30 July 2025.
Sensors 2025, 25(16), 5213; https://doi.org/10.3390/s25165213
Submission received: 5 August 2025 / Revised: 18 August 2025 / Accepted: 19 August 2025 / Published: 21 August 2025
(This article belongs to the Section Intelligent Sensors)

Abstract

Current vision–language models (VLMs) are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision–language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes. Our approach constructs a dataset comprising 200,328 samples tailored to fine-tune models for human-centric tasks, focusing on three areas: conversation, detailed description, and complex reasoning. We establish an Extended Human Pose and Action Understanding Benchmark (E-HPAUB) to assess model performance on human pose and action understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate our resulting LLaVA-Pose model on the benchmark, achieving significant improvements. Experimental results show an overall improvement of 33.2% compared to the original LLaVA-1.5-7B model. These findings highlight the effectiveness of keypoint-integrated data in enhancing multimodal models for human-centric visual understanding.
Keywords: human pose and action understanding; keypoint-integrated data generation; instruction-following data; vision–language models; multimodal instruction tuning human pose and action understanding; keypoint-integrated data generation; instruction-following data; vision–language models; multimodal instruction tuning

Share and Cite

MDPI and ACS Style

Zhang, D.; Hussain, T.; An, W.; Shouno, H. LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding. Sensors 2025, 25, 5213. https://doi.org/10.3390/s25165213

AMA Style

Zhang D, Hussain T, An W, Shouno H. LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding. Sensors. 2025; 25(16):5213. https://doi.org/10.3390/s25165213

Chicago/Turabian Style

Zhang, Dewen, Tahir Hussain, Wangpeng An, and Hayaru Shouno. 2025. "LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding" Sensors 25, no. 16: 5213. https://doi.org/10.3390/s25165213

APA Style

Zhang, D., Hussain, T., An, W., & Shouno, H. (2025). LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding. Sensors, 25(16), 5213. https://doi.org/10.3390/s25165213

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop