MLA-LSTM: A Local and Global Location Attention LSTM Learning Model for Scoring Figure Skating
Abstract
:1. Introduction
- (1)
- Since the most suitable feature extraction method and sampling clip lengths for the video scoring task are not yet conclusive, in this paper, a systematical comparison between different feature extraction and different clip lengths for the performance of figure skating scoring has been conducted. It will provide an effective reference for the research of feature extraction methods of long videos in the future;
- (2)
- To accurately identify the frames, we proposed a novel position aggregation network MS-LAM to capture the position information of athletes, as much as possible without losing athlete clip-level features. The location information of figure skaters is calculated automatically, and it cannot be affected by the size of skaters in the background.
2. Related Work
2.1. Video Understanding
2.2. Feature Fusion
3. Approach
3.1. Video Features
3.2. Multi-Scale Location Attention Module (MS-LAM)
3.3. MLA-LSTM
4. Experiments
4.1. Settings and Evaluation
4.1.1. Datasets
4.1.2. Evaluation Metric
4.1.3. Experimental Settings
4.1.4. Competitors
- Input features: We use clip-level features as input. To verify the effectiveness of different feature extraction methods, we extract them as described in Section 3.1.
- We use max and average pooling for video features embedding representation and using linear kernel SVR to regress the prediction scores.
- We emulate the C3D-LSTM architecture used in Parmar et al. [11] to directly generate video descriptions, and we also use the bi-directional LSTM (Bi-LSTM) system to maintain fairness issues that may arise due to the long video sequences. In addition, the hidden layer size in LSTM/Bi-LSTM is set to 256/128, using the same regression as our model. In addition, the method is extended to I3D and R3D features.
4.2. Results
4.2.1. Results of Different Variants
MIT-Skate | Fis-V | ||||
---|---|---|---|---|---|
TES | PCS | Avg. | |||
Pose + DCT [10] | 0.350 ** | - | - | - | |
C3D + LSTM [11] | 0.530 ** | - | - | - | |
ConvISA [12] | 0.450 ** | - | - | - | |
MSE + Ranking Loss [3] | 0.575 ** | - | - | - | |
MS-LSTM [13] | 0.590 ** | 0.650 ** | 0.780 ** | 0.715 ** | |
ACTION-NET [4] | 0.615 ** | 0.580 * | 0.794 * | 0.697 | |
C3D | Max + SVR | 0.480 | 0.470 | 0.610 | 0.540 |
Avg + SVR | 0.420 | 0.400 | 0.590 | 0.495 | |
LSTM | 0.370 | 0.590 | 0.770 | 0.680 | |
Bi-LSTM | 0.580 | 0.560 | 0.730 | 0.645 | |
MLA-LSTM (ours) | 0.616 | 0.604 | 0.808 | 0.706 | |
I3D | Max + SVR | 0.442 | 0.547 | 0.695 | 0.621 |
Avg + SVR | 0.531 | 0.558 | 0.703 | 0.631 | |
LSTM | 0.472 | 0.520 | 0.742 | 0.631 | |
Bi-LSTM | 0.587 | 0.629 | 0.705 | 0.667 | |
MS-LSTM | 0.628 | 0.656 | 0.809 | 0.733 | |
MLA-LSTM (ours) | 0.684 | 0.673 | 0.857 | 0.765 | |
R3D | Max + SVR | 0.560 | 0.557 | 0.669 | 0.613 |
Avg + SVR | 0.526 | 0.482 | 0.606 | 0.544 | |
LSTM | 0.552 | 0.537 | 0.765 | 0.651 | |
Bi-LSTM | 0.593 | 0.554 | 0.745 | 0.650 | |
MS-LSTM | 0.621 | 0.648 | 0.808 | 0.728 | |
MLA-LSTM (ours) | 0.652 | 0.640 | 0.830 | 0.735 |
4.2.2. Results of the Spearman’s Rank Correlation
4.2.3. Results of the Mean Squared Error
4.2.4. Impact of Clip Length
4.2.5. Ablation Study on Multi-Scale Location Attention Strategy
4.2.6. Ablation Study on Attention Strategy
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Parmar, P.; Morris, B.T. What and how well you performed? A multitask learning approach to action quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 304–313. [Google Scholar]
- Tang, Y.; Ni, Z.; Zhou, J.; Zhang, D.; Lu, J.; Wu, Y.; Zhou, J. Uncertainty-aware score distribution learning for action quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9839–9848. [Google Scholar]
- Li, Y.; Chai, X.; Chen, X. End-to-end learning for action quality assessment. In Pacific Rim Conference on Multimedia; Springer: Cham, Switzerland, 2018; pp. 125–134. [Google Scholar]
- Zeng, L.A.; Hong, F.T.; Zheng, W.S.; Yu, Q.Z.; Zeng, W.; Wang, Y.W.; Lai, J.H. Hybrid dynamic-static context-aware attention network for action assessment in long videos. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12 October 2020; pp. 2526–2534. [Google Scholar]
- Chen, X.; Pang, A.; Yang, W.; Ma, Y.; Xu, L.; Yu, J. SportsCap: Monocular 3D human motion capture and fine-grained understanding in challenging sports videos. Int. J. Comput. Vis. 2021, 129, 2846–2864. [Google Scholar] [CrossRef]
- Zuo, K.; Su, X. Three-Dimensional Action Recognition for Basketball Teaching Coupled with Deep Neural Network. Electronics 2022, 11, 3797. [Google Scholar] [CrossRef]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Hara, K.; Kataoka, H.; Satoh, Y. Learning spatio-temporal features with 3d residual networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 3154–3160. [Google Scholar]
- Pirsiavash, H.; Vondrick, C.; Torralba, A. Assessing the quality of actions. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 556–571. [Google Scholar]
- Parmar, P.; Tran Morris, B. Learning to score olympic events. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 20–28. [Google Scholar]
- Le, Q.V.; Zou, W.Y.; Yeung, S.Y.; Ng, A.Y. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 3361–3368. [Google Scholar]
- Xu, C.; Fu, Y.; Zhang, B.; Chen, Z.; Jiang, Y.G.; Xue, X. Learning to score figure skating sport videos. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 4578–4590. [Google Scholar] [CrossRef] [Green Version]
- Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
- Wang, L.; Tong, Z.; Ji, B.; Wu, G. Tdn: Temporal difference networks for efficient action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1895–1904. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
- Roy, A.M.; Bhaduri, J.; Kumar, T.; Raj, K. WilDect-YOLO: An efficient and robust computer vision-based accurate object localization model for automated endangered wildlife detection. Ecol. Inform. 2022, 2022, 101919. [Google Scholar] [CrossRef]
- Sun, W.; Dai, L.; Zhang, X.; Chang, P.; He, X. RSOD: Real-time small object detection algorithm in UAV-based traffic monitoring. Appl. Intell. 2022, 52, 8448–8463. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Doughty, H.; Mayol-Cuevas, W.; Damen, D. The pros and cons: Rank-aware temporal attention for skill determination in long videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7862–7871. [Google Scholar]
- Nakano, T.; Sakata, A.; Kishimoto, A. Estimating blink probability for highlight detection in figure skating videos. arXiv 2020, arXiv:2007.01089. [Google Scholar]
- Lei, Q.; Zhang, H.; Du, J. Temporal attention learning for action quality assessment in sports video. Signal Image Video Process. 2021, 15, 1575–1583. [Google Scholar] [CrossRef]
- Xu, A.; Zeng, L.A.; Zheng, W.S. Likert Scoring With Grade Decoupling for Long-Term Action Assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 3232–3241. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 448–456. [Google Scholar]
- Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732. [Google Scholar]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
- Sahoo, J.P.; Prakash, A.J.; Pławiak, P.; Samantray, S. Real-Time Hand Gesture Recognition Using Fine-Tuned Convolutional Neural Network. Sensors 2022, 22, 706. [Google Scholar] [CrossRef] [PubMed]
- Shao, D.; Zhao, Y.; Dai, B.; Lin, D. Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2616–2625. [Google Scholar]
TES | PCS | ||
---|---|---|---|
MS-LSTM [13] | 19.91 | 8.35 | |
ACTION-NET [4] | 26.35 * | 7.62 * | |
C3D | Max + SVR | 27.42 | 13.98 |
Avg + SVR | 30.25 | 15.96 | |
LSTM | 22.96 | 8.70 | |
Bi-LSTM | 23.80 | 10.36 | |
MLA-LSTM | 24.09 | 9.17 | |
I3D | Max + SVR | 24.55 | 13.02 |
Avg + SVR | 25.46 | 12.78 | |
LSTM | 30.626 | 14.62 | |
Bi-LSTM | 24.99 | 13.54 | |
MS-LSTM | 23.87 | 10.70 | |
MLA-LSTM | 19.07 | 6.63 | |
R3D | Max + SVR | 28.06 | 14.25 |
Avg + SVR | 31.22 | 16.82 | |
LSTM | 30.01 | 12.18 | |
Bi-LSTM | 27.51 | 12.55 | |
MS-LSTM | 24.05 | 9.32 | |
MLA-LSTM | 21.46 | 8.62 |
Clip Length\Model | ACTION-NET * | MLA-LSTM |
---|---|---|
5 | 0.603 | 0.612 |
8 | 0.606 | 0.625 |
16 | 0.612 | 0.654 |
32 | 0.630 | 0.671 |
Model | Correlation | MSE | ||
---|---|---|---|---|
TES | PCS | TES | PCS | |
MS-LSTM | 0.626 | 0.809 | 23.87 | 10.70 |
MLA-LSTM | 0.673 | 0.857 | 22.07 | 6.63 |
LA-M-LSTM | 0.646 | 0.824 | 24.86 | 9.39 |
IAFF-MS-LSTM | 0.586 | 0.807 | 27.13 | 12.62 |
IAFF-M-LSTM | 0.601 | 0.803 | 26.35 | 12.10 |
Model | Correlation | MSE |
---|---|---|
MS-LSTM | 0.618 | 127.42 |
MLA-LSTM | 0.684 | 112.46 |
LA-M-LSTM | 0.625 | 124.84 |
IAFF-MS-LSTM | 0.660 | 118.66 |
IAFF-M-LSTM | 0.642 | 120.81 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Han, C.; Shen, F.; Chen, L.; Lian, X.; Gou, H.; Gao, H. MLA-LSTM: A Local and Global Location Attention LSTM Learning Model for Scoring Figure Skating. Systems 2023, 11, 21. https://doi.org/10.3390/systems11010021
Han C, Shen F, Chen L, Lian X, Gou H, Gao H. MLA-LSTM: A Local and Global Location Attention LSTM Learning Model for Scoring Figure Skating. Systems. 2023; 11(1):21. https://doi.org/10.3390/systems11010021
Chicago/Turabian StyleHan, Chaoyu, Fangyao Shen, Lina Chen, Xiaoyi Lian, Hongjie Gou, and Hong Gao. 2023. "MLA-LSTM: A Local and Global Location Attention LSTM Learning Model for Scoring Figure Skating" Systems 11, no. 1: 21. https://doi.org/10.3390/systems11010021