An Attention-Refined Light-Weight High-Resolution Network for Macaque Monkey Pose Estimation
Abstract
:1. Introduction
- A multi-branch parallel architecture is built to keep high-resolution representations throughout the whole process, which also fuses rich semantic information from branches with different resolutions.
- A new basic block of the backbone is designed, relying on the powerful structure of the transformer and the PSA that could obtain global attention with a few parameters.
- For striking a balance between the weight and the performance of our model, we propose an attention-refined module based on asymmetric convolutions and triplet attention with few parameters.
2. Related Work
2.1. Two-Stage Paradigm
2.2. Vision Transformer
2.3. Light-Weight Network
3. Methods
3.1. Unbiased Data Processing
3.2. High-Resolution Network
3.3. Polarized Self-Attention
Algorithm 1: Polarized self-attention. |
input: |
output: |
// The input X is divided into and |
1; |
2; |
3; |
4; |
5; |
6; |
7; |
8 |
3.4. Attention Refined Block
3.4.1. Asymmetric Convolution
3.4.2. Triplet Attention
Algorithm 2: Triplet unit. |
input: |
output: |
1; |
2; |
3; |
4; |
5; |
6 |
Algorithm 3: Triplet attention. |
input: |
output: |
1; |
2; |
3; |
4; |
5; |
6; |
4. Experiments
4.1. Dataset
4.2. Evaluation Metrics
4.3. Experimental Settings
5. Results and Discussion
5.1. Experimental Result and Disscussion
5.2. Ablation Experiment
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
UDP | unbiased data processing |
W-MSA | window-based multi-head self-attention |
PVT | pyramid vision transformer |
FFN-DW | feed-forward network with a depth-wise convolution |
PSA | polarized self-attention |
ARM | attention-refined module |
NLP | natural language processing |
HRFormer | high-resolution transformer |
HRNet | high-resolution network |
ViT | vision transformer |
OKS | object keypoint similarity |
KS | keypoint similatity |
References
- Bala, P.C.; Eisenreich, B.R.; Yoo, S.B.M.; Hayden, B.Y.; Park, H.S.; Zimmermann, J. Automated markerless pose estimation in freely moving macaques with OpenMonkeyStudio. Nat. Commun. 2020, 11, 4560. [Google Scholar] [CrossRef] [PubMed]
- Mathis, M.W.; Mathis, A. Deep learning tools for the measurement of animal behavior in neuroscience. Curr. Opin. Neurobiol. 2020, 60, 1–11. [Google Scholar] [CrossRef] [PubMed]
- Labuguen (P), R.; Gaurav, V.; Blanco, S.N.; Matsumoto, J.; Inoue, K.; Shibata, T. Monkey Features Location Identification Using Convolutional Neural Networks. bioRxiv 2018. [Google Scholar] [CrossRef]
- Bulat, A.; Tzimiropoulos, G. Human Pose Estimation via Convolutional Part Heatmap Regression. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Series Title: Lecture Notes in Computer Science. Springer International Publishing: Cham, Switzerland, 2016; Volume 9911, pp. 717–732. [Google Scholar] [CrossRef]
- Liu, Z.; Chen, H.; Feng, R.; Wu, S.; Ji, S.; Yang, B.; Wang, X. Deep Dual Consecutive Network for Human Pose Estimation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 525–534. [Google Scholar] [CrossRef]
- Zhang, F.; Zhu, X.; Ye, M. Fast Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 3512–3521. [Google Scholar] [CrossRef]
- Liu, X.; Yu, S.Y.; Flierman, N.A.; Loyola, S.; Kamermans, M.; Hoogland, T.M.; De Zeeuw, C.I. OptiFlex: Multi-Frame Animal Pose Estimation Combining Deep Learning With Optical Flow. Front. Cell. Neurosci. 2021, 15, 621252. [Google Scholar] [CrossRef] [PubMed]
- Labuguen, R.; Matsumoto, J.; Negrete, S.B.; Nishimaru, H.; Nishijo, H.; Takada, M.; Go, Y.; Inoue, K.I.; Shibata, T. MacaquePose: A Novel “In the Wild” Macaque Monkey Pose Dataset for Markerless Motion Capture. Front. Behav. Neurosci. 2021, 14, 581154. [Google Scholar] [CrossRef] [PubMed]
- Wenwen, Z.; Yang, X.; Rui, B.; Li, L. Animal Pose Estimation Algorithm Based on the Lightweight Stacked Hourglass Network. Preprints, 2022; in review. [Google Scholar] [CrossRef]
- Ngo, V.; Gorman, J.C.; Fuente, M.F.D.L.; Souto, A.; Schiel, N.; Miller, C.T. Active vision during prey capture in wild marmoset monkeys. Curr. Biol. 2022, 32, 1–6. [Google Scholar] [CrossRef]
- Labuguen, R.; Bardeloza, D.K.; Negrete, S.B.; Matsumoto, J.; Inoue, K.; Shibata, T. Primate Markerless Pose Estimation and Movement Analysis Using DeepLabCut. In Proceedings of the 2019 Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and 2019 3rd International Conference on Imaging, Vision & Pattern Recognition (icIVPR), Spokane, WA, USA, 30 May–2 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 297–300. [Google Scholar] [CrossRef]
- Blanco Negrete, S.; Labuguen, R.; Matsumoto, J.; Go, Y.; Inoue, K.I.; Shibata, T. Multiple Monkey Pose Estimation Using OpenPose. bioRxiv 2021. [Google Scholar] [CrossRef]
- Mao, W.; Ge, Y.; Shen, C.; Tian, Z.; Wang, X.; Wang, Z. TFPose: Direct Human Pose Estimation with Transformers. arXiv 2021, arXiv:2103.15320. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
- Wang, Y.; Li, M.; Cai, H.; Chen, W.M.; Han, S. Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation. arXiv 2022, arXiv:2205.01271. [Google Scholar]
- Yuan, Y.; Fu, R.; Huang, L.; Lin, W.; Zhang, C.; Chen, X.; Wang, J. HRFormer: High-Resolution Vision Transformer for Dense Predict. In Proceedings of the NeurIPS 2021, Virtual, 6–14 December 2021. [Google Scholar]
- Wang, X.; Tong, J.; Wang, R. Attention Refined Network for Human Pose Estimation. Neural Process. Lett. 2021, 53, 2853–2872. [Google Scholar] [CrossRef]
- Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. MetaFormer is Actually What You Need for Vision. arXiv 2021, arXiv:2111.11418. [Google Scholar]
- Liu, H.; Liu, F.; Fan, X.; Huang, D. Polarized Self-Attention: Towards High-quality Pixel-wise Regression. arXiv 2021, arXiv:2107.00782. [Google Scholar]
- Zhang, Y.; Wa, S.; Sun, P.; Wang, Y. Pear Defect Detection Method Based on ResNet and DCGAN. Information 2021, 12, 397. [Google Scholar] [CrossRef]
- Belagiannis, V.; Zisserman, A. Recurrent human pose estimation. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 30 May–3 June 2017; pp. 468–475. [Google Scholar]
- Kreiss, S.; Bertoni, L.; Alahi, A. PifPaf: Composite Fields for Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 11969–11978. [Google Scholar] [CrossRef]
- Intarat, K.; Rakwatin, P.; Panboonyuen, T. Enhanced Feature Pyramid Vision Transformer for Semantic Segmentation on Thailand Landsat-8 Corpus. Information 2022, 13, 259. [Google Scholar] [CrossRef]
- Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1314–1324. [Google Scholar] [CrossRef]
- Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Series Title: Lecture Notes in Computer Science. Springer International Publishing: Cham, Switzerland, 2018; Volume 11218, pp. 122–138. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
- Huang, J.; Zhu, Z.; Guo, F.; Huang, G. The Devil Is in the Details: Delving Into Unbiased Data Processing for Human Pose Estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 5699–5708. [Google Scholar] [CrossRef]
- Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.; Wang, C.; Cao, Y.; Liu, B.; Luo, Y.; Zhang, H. A-HRNet: Attention Based High Resolution Network for Human pose estimation. In Proceedings of the 2020 Second International Conference on Transdisciplinary AI (TransAI), Irvine, CA, USA, 21–23 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 75–79. [Google Scholar] [CrossRef]
- Xiao, B.; Wu, H.; Wei, Y. Simple Baselines for Human Pose Estimation and Tracking. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Series Title: Lecture Notes in Computer Science. Springer International Publishing: Cham, Switzerland, 2018; Volume 11210, pp. 472–487. [Google Scholar] [CrossRef]
- Ding, X.; Guo, Y.; Ding, G.; Han, J. ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1911–1920. [Google Scholar] [CrossRef]
- Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to Attend: Convolutional Triplet Attention Module. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3138–3147. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, L. Microsoft COCO: Common Objects in Context. In Proceedings of the ECCV European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
- Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 3686–3693. [Google Scholar] [CrossRef]
- Yang, S.; Quan, Z.; Nie, M.; Yang, W. TransPose: Keypoint Localization via Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 11782–11792. [Google Scholar] [CrossRef]
Res. | Stage1 | Stage2 | Stage3 | ARM |
---|---|---|---|---|
Keypoint | Definition | Keypoint | Definition |
---|---|---|---|
1 | Nose | 10 | Left wrist |
2 | Left eye | 11 | Right wrist |
3 | Right eye | 12 | Left hip |
4 | Left ear | 13 | Right hip |
5 | Right ear | 14 | Left knee |
6 | Left shoulder | 15 | Right knee |
7 | Right shoulder | 16 | Left ankle |
8 | Left elbow | 17 | Right ankle |
9 | Right elbow |
Parameters | Setting |
---|---|
Max epoch | 210 |
Base learning rate | |
Ware-up step | |
Optimizer | Adam |
Batch size | 64 |
Loss function | Joint MSE Loss |
Methods | Input Size | #param. | GFLOPs | |||||
---|---|---|---|---|---|---|---|---|
ShuffleNetV2 | 256 × 192 | 1.25M | 0.14 | 69.2 | 89.5 | 80.5 | 69.7 | 73.0 |
MobileNetV2 | 256 × 192 | 2.22M | 0.31 | 71.6 | 89.5 | 82.0 | 71.9 | 75.0 |
SimpleBaseline-18 | 256 × 192 | 11.17M | 1.78 | 72.9 | 89.6 | 82.9 | 73.4 | 76.3 |
PVT | 256 × 192 | 0.86M | 0.10 | 72.6 | 89.4 | 83.9 | 72.9 | 76.1 |
Lite-HRNet | 256 × 192 | 1.76M | 0.31 | 74.2 | 89.5 | 84.9 | 74.3 | 77.4 |
HRFormer-Tiny | 256 × 192 | 2.49M | 1.39 | 75.2 | 89.4 | 85.8 | 75.7 | 78.9 |
HR-MPE (ours) | 256 × 192 | 2.40M | 1.92 | 77.0 | 89.5 | 86.2 | 77.2 | 80.0 |
No. | Baseline | Attention Refined Block | PSA | UDP | #param. | GFLOPs | ||
---|---|---|---|---|---|---|---|---|
1 | √ | 2.197 | 1.803 | 74.6 | 78.1 | |||
2 | √ | √ | 2.223 | 1.884 | 75.1 | 78.5 | ||
3 | √ | √ | √ | 2.407 | 1.928 | 76.1 | 79.4 | |
4 | √ | √ | √ | √ | 2.407 | 1.928 | 77.0 | 80.0 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, S.; Fan, Q.; Liu, S.; Li, S.; Zhao, C. An Attention-Refined Light-Weight High-Resolution Network for Macaque Monkey Pose Estimation. Information 2022, 13, 356. https://doi.org/10.3390/info13080356
Liu S, Fan Q, Liu S, Li S, Zhao C. An Attention-Refined Light-Weight High-Resolution Network for Macaque Monkey Pose Estimation. Information. 2022; 13(8):356. https://doi.org/10.3390/info13080356
Chicago/Turabian StyleLiu, Sicong, Qingcheng Fan, Shanghao Liu, Shuqin Li, and Chunjiang Zhao. 2022. "An Attention-Refined Light-Weight High-Resolution Network for Macaque Monkey Pose Estimation" Information 13, no. 8: 356. https://doi.org/10.3390/info13080356
APA StyleLiu, S., Fan, Q., Liu, S., Li, S., & Zhao, C. (2022). An Attention-Refined Light-Weight High-Resolution Network for Macaque Monkey Pose Estimation. Information, 13(8), 356. https://doi.org/10.3390/info13080356