Expert Comment Generation Considering Sports Skill Level Using a Large Multimodal Model with Video and Spatial-Temporal Motion Features
Abstract
:1. Introduction
- To summarize, the contributions of our method are described as follows:
- Utilization of skill level classification results as inputs to LMMOur method incorporates the skill level classification results from the STA-GCN into the LMM, ensuring that the generated expert comments are appropriately tailored to the learner’s skill level.
- Incorporation of motion features considering spatial-temporal information for LMMBy integrating motion features that account for spatial-temporal information as inputs to the LMM, we enhance the model’s ability to generate detailed and context-specific feedback based on the players’ movements.
2. Skill-Level-Aware Expert Comment Generation
Algorithm 1 Skill-Level-Aware Expert Comment Generation |
Require: Input video V, skeletal motion data S, pretrained STA-GCN, pretrained LMM Ensure: Generated expert comments C 1: Visual Feature Extraction: 2: Extract visual features 3: Tokenize visual features 4: Motion Feature Extraction: 5: Extract motion features 6: Classify skill level 7: Tokenize motion features 8: Tokenize skill level 9: Multimodal Integration: 10: Combine tokens 11: Expert Comment Generation: 12: Generate expert comments 13: return C |
2.1. Video Data Tokenization
2.2. Skill Level Classification and Motion Feature Extraction
2.3. Expert Comment Generation with LMM
- Tokenized Video Features: Encoded representations of the visual content, capturing relevant aspects of the performance.
- Motion Features: Detailed motion information extracted from the perception branch of the STA-GCN, highlighting fundamental movements and joints.
- Skill Levels: Outputs from the attention branch of the STA-GCN, providing insights into the learner’s skill level.
- Prompts: Predefined text inputs or instructions are designed to guide expert comments.
Please generate a commentary that improves the player’s play and technique based on the video, skill level, and motion features below. In parentheses (), specify the body part that needs improvement.
2.4. Training Procedure
3. Experimental Results
3.1. Experimental Settings
- BLEU-4 [31]Measures the n-gram overlap between the generated comments and the ground truth, focusing on 4-gram matches. BLEU-4 is defined as follows:
- METEOR [32]Considers the precision and recall of unigrams, incorporating synonym matches and stemming. The METEOR score is defined as follows:The weights for the harmonic mean are based on the conventional literature [32]. In addition, Penalty represents the fragmentation penalty, which penalizes disordered matches. It is computed as
- ROUGE-L [33]The longest common subsequence between the generated comments and the reference is evaluated, capturing the sentence level structure.The parameter controls the trade-off between precision and recall. is defined as the length of the Longest Common Subsequence (LCS) divided by the length of the generated sequence, whereas is the length of the LCS divided by the length of the ground truth. They are computed as
3.2. Quantitative Evaluation of Generated Expert Comments
3.3. Qualitatively Evaluation Results
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ashutosh, K.; Nagarajan, T.; Pavlakos, G.; Kitani, K.; Grauman, K. ExpertAF: Expert actionable feedback from video. arXiv 2024, arXiv:2408.00672. [Google Scholar]
- Losch, S.; Traut-Mattausch, E.; Mühlberger, M.D.; Jonas, E. Comparing the effectiveness of individual coaching, self-coaching, and group training: How leadership makes the difference. Front. Psychol. 2016, 7, 175595. [Google Scholar] [CrossRef] [PubMed]
- Hu, X.; Gan, Z.; Wang, J.; Yang, Z.; Liu, Z.; Lu, Y.; Wang, L. Scaling up vision-language pre-training for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 17980–17989. [Google Scholar]
- Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 121–137. [Google Scholar]
- Lin, B.; Zhu, B.; Ye, Y.; Ning, M.; Jin, P.; Yuan, L. Video-LLaVA: Learning united visual representation by alignment before projection. arXiv 2023, arXiv:2311.10122. [Google Scholar]
- Zhou, H.; Martín-Martín, R.; Kapadia, M.; Savarese, S.; Niebles, J.C. Procedure-aware pretraining for instructional video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–23 June 2023; pp. 10727–10738. [Google Scholar]
- Delmas, G.; Weinzaepfel, P.; Lucas, T.; Moreno-Noguer, F.; Rogez, G. Posescript: 3D human poses from natural language. In Computer Vision—ECCV 2022, Proceedings of the 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 346–362. [Google Scholar]
- Jiang, B.; Chen, X.; Liu, W.; Yu, J.; Yu, G.; Chen, T. MotionGPT: Human motion as a foreign language. Adv. Neural Inf. Process. Syst. 2023, 36, 20067–20079. [Google Scholar]
- Chen, L.-H.; Lu, S.; Zeng, A.; Zhang, H.; Wang, B.; Zhang, R.; Zhang, L. MotionLLM: Understanding human behaviors from human motions and videos. arXiv 2024, arXiv:2405.20340. [Google Scholar]
- Wu, Q.; Zhao, Y.; Wang, Y.; Tai, Y.-W.; Tang, C.-K. MotionLLM: Multimodal motion-language learning with large language models. arXiv 2024, arXiv:2405.17013. [Google Scholar]
- Bertasius, G.; Chan, A.; Shi, J. Egocentric basketball motion planning from a single first-person image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5889–5898. [Google Scholar]
- Yu, X.; Rao, Y.; Zhao, W.; Lu, J.; Zhou, J. Group-aware contrastive regression for action quality assessment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7919–7928. [Google Scholar]
- Tang, Y.; Ni, Z.; Zhou, J.; Zhang, D.; Lu, J.; Wu, Y.; Zhou, J. Uncertainty-aware score distribution learning for action quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9839–9848. [Google Scholar]
- Zhang, S.; Dai, W.; Wang, S.; Shen, X.; Lu, J.; Zhou, J.; Tang, Y. Logo: A long-form video dataset for group action quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–23 June 2023; pp. 2405–2414. [Google Scholar]
- Zhang, B.; Chen, J.; Xu, Y.; Zhang, H.; Yang, X.; Geng, X. Auto-encoding score distribution regression for action quality assessment. Neural Comput. Appl. 2024, 36, 929–942. [Google Scholar] [CrossRef]
- Li, Y.M.; Zeng, L.A.; Meng, J.K.; Zheng, W.S. Continual action assessment via task-consistent score-discriminative feature distribution modeling. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9112–9124. [Google Scholar] [CrossRef]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
- Seino, T.; Saito, N.; Ogawa, T.; Asamizu, S.; Haseyama, M. Expert–Novice level classification using graph convolutional network introducing confidence-aware node-level attention mechanism. Sensors 2024, 24, 3033. [Google Scholar] [CrossRef] [PubMed]
- Shiraki, K.; Hirakawa, T.; Yamashita, T.; Fujiyoshi, H. Spatial temporal attention graph convolutional networks with mechanics-stream for skeleton-based action recognition. In Computer Vision—ACCV 2020, Proceedings of the 15th Asian Conference on Computer Vision, Kyoto, Japan, 20–24 November 2020; Springer: Cham, Switzerland, 2021; pp. 240–257. [Google Scholar]
- Zhu, B.; Lin, B.; Ning, M.; Yan, Y.; Cui, J.; Wang, H.; Pang, Y.; Jiang, W.; Zhang, J.; Li, Z.; et al. LanguageBind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv 2023, arXiv:2310.01852. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32, pp. 3482–3489. [Google Scholar]
- Si, C.; Jing, Y.; Wang, W.; Wang, L.; Tan, T. Skeleton-based action recognition with spatial reasoning and temporal stack learning. In Computer Vision—ECCV 2018, Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 103–118. [Google Scholar]
- Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3595–3603. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7912–7921. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12026–12035. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
- Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Lin, M.; Chen, Q.; Yan, S. Network in network. arXiv 2013, arXiv:1312.4400. [Google Scholar]
- Grauman, K.; Westbury, A.; Torresani, L.; Kitani, K.; Malik, J.; Afouras, T.; Ashutosh, K.; Baiyya, V.; Bansal, S.; Boote, B.; et al. Ego-Exo4D: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–21 June 2024; pp. 19383–19400. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
- Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Skill Level | Motion | Video | |
---|---|---|---|
PM | ✓ | ✓ | ✓ |
CM1 | ✓ | ✓ | |
CM2 | ✓ | ✓ | |
CM3 | ✓ | ||
CM4 | ✓ | ✓ | |
CM5 | ✓ |
BLEU-4 | METEOR | ROUGE-L | |
---|---|---|---|
PM | 1.75 × 10−2 | 0.256 | 0.156 |
CM1 | |||
CM2 | |||
CM3 | |||
CM4 | |||
CM5 | |||
CM6 | |||
CM7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Seino, T.; Saito, N.; Ogawa, T.; Asamizu, S.; Haseyama, M. Expert Comment Generation Considering Sports Skill Level Using a Large Multimodal Model with Video and Spatial-Temporal Motion Features. Sensors 2025, 25, 447. https://doi.org/10.3390/s25020447
Seino T, Saito N, Ogawa T, Asamizu S, Haseyama M. Expert Comment Generation Considering Sports Skill Level Using a Large Multimodal Model with Video and Spatial-Temporal Motion Features. Sensors. 2025; 25(2):447. https://doi.org/10.3390/s25020447
Chicago/Turabian StyleSeino, Tatsuki, Naoki Saito, Takahiro Ogawa, Satoshi Asamizu, and Miki Haseyama. 2025. "Expert Comment Generation Considering Sports Skill Level Using a Large Multimodal Model with Video and Spatial-Temporal Motion Features" Sensors 25, no. 2: 447. https://doi.org/10.3390/s25020447
APA StyleSeino, T., Saito, N., Ogawa, T., Asamizu, S., & Haseyama, M. (2025). Expert Comment Generation Considering Sports Skill Level Using a Large Multimodal Model with Video and Spatial-Temporal Motion Features. Sensors, 25(2), 447. https://doi.org/10.3390/s25020447