MPE-HRNetL: A Lightweight High-Resolution Network for Multispecies Animal Pose Estimation
Abstract
:1. Introduction
- We propose SPPF. Comparing with SPPF, SPPF has a stronger feature extraction capability and lower computational load.
- We propose MPM and DSCA. Based on MPM and DSCA, we design MFE block which possesses strong feature extraction capabilities.
- We design the FE stage and take it as the final stage of MPE-HRNet. This stage is introduced to emphasize the semantic feature in the output feature map of MPE-HRNet.
2. Related Works
2.1. Animal Pose Estimation
2.2. Attention Mechanism
2.3. Spatial Pyramid Pooling
3. Proposed Algorithm
3.1. Architecture of MPE-HRNet
3.2. SPPF+
3.3. Mixed Feature Extraction Block
3.3.1. Weight Generation Module
3.3.2. Dual Spatial and Channel Attention Module
3.3.3. Mixed Feature Extraction Block
3.4. Feature Enhancement Stage
4. Experiments
4.1. Experimental Settings
4.2. Ablation Experiments
4.2.1. Selection of SPPF+ and SPPF
4.2.2. Effectiveness Study of All the Improvements
4.3. Performance Comparison
4.3.1. Comparison on AP-10K Dataset
4.3.2. Comparison on Animal Pose Dataset
4.4. Limitation Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Harding, E.J.; Paul, E.S.; Mendl, M. Cognitive bias and affective state. Nature 2004, 427, 312. [Google Scholar] [CrossRef] [PubMed]
- Zuffi, S.; Kanazawa, A.; Berger-Wolf, T.; Black, M.J. Three-D Safari: Learning to Estimate Zebra Pose, Shape, and Texture from Images “In the Wild”. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5359–5368. [Google Scholar]
- Anderson, T.L.; Donath, M. Animal behavior as a paradigm for developing robot autonomy. Robot. Auton. Syst. 1990, 6, 145–168. [Google Scholar] [CrossRef]
- Jiang, L.; Lee, C.; Teotia, D.; Ostadabbas, S. Animal pose estimation: A closer look at the state-of-the-art, existing gaps and opportunities. Comput. Vis. Image Underst. 2022, 222, 103483. [Google Scholar] [CrossRef]
- Cheng, B.; Xiao, B.; Wang, J.; Shi, H.; Huang, T.S.; Zhang, L. Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5386–5395. [Google Scholar]
- Chao, W.; Duan, F.; Du, P.; Zhu, W.; Jia, T.; Li, D. DEKRV2: More accurate or fast than DEKR. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 1451–1455. [Google Scholar]
- Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 466–481. [Google Scholar]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5693–5703. [Google Scholar]
- Yu, C.; Xiao, B.; Gao, C.; Yuan, L.; Zhang, L.; Sang, N.; Wang, J. Lite-hrnet: A lightweight high-resolution network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 10440–10450. [Google Scholar]
- Li, C.; Lee, G.H. From synthetic to real: Unsupervised domain adaptation for animal pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 1482–1491. [Google Scholar]
- Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
- Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 483–499. [Google Scholar]
- Wang, Z.; Zhou, S.; Yin, P.; Xu, A.; Ye, J. GANPose: Pose estimation of grouped pigs using a generative adversarial network. Comput. Electron. Agric. 2023, 212, 108119. [Google Scholar] [CrossRef]
- Fan, Q.; Liu, S.; Li, S.; Zhao, C. Bottom-up cattle pose estimation via concise multi-branch network. Comput. Electron. Agric. 2023, 211, 107945. [Google Scholar] [CrossRef]
- He, R.; Wang, X.; Chen, H.; Liu, C. VHR-BirdPose: Vision Transformer-Based HRNet for Bird Pose Estimation with Attention Mechanism. Electronics 2023, 12, 3643. [Google Scholar] [CrossRef]
- Zhou, F.; Jiang, Z.; Liu, Z.; Chen, F.; Chen, L.; Tong, L.; Yang, Z.; Wang, H.; Fei, M.; Li, L.; et al. Structured context enhancement network for mouse pose estimation. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 2787–2801. [Google Scholar] [CrossRef]
- Graving, J.M.; Chae, D.; Naik, H.; Li, L.; Koger, B.; Costelloe, B.R.; Couzin, I.D. DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning. eLife 2019, 8, e47994. [Google Scholar] [CrossRef]
- Zhao, S.; Bai, Z.; Meng, L.; Han, G.; Duan, E. Pose Estimation and Behavior Classification of Jinling White Duck Based on Improved HRNet. Animals 2023, 13, 2878. [Google Scholar] [CrossRef]
- Gong, Z.; Zhang, Y.; Lu, D.; Wu, T. Vision-Based Quadruped Pose Estimation and Gait Parameter Extraction Method. Electronics 2022, 11, 3702. [Google Scholar] [CrossRef]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
- Maselyne, J.; Adriaens, I.; Huybrechts, T.; De Ketelaere, B.; Millet, S.; Vangeyte, J.; Van Nuffel, A.; Saeys, W. Measuring the drinking behaviour of individual pigs housed in group using radio frequency identification (RFID). Animal 2016, 10, 1557–1566. [Google Scholar] [CrossRef] [PubMed]
- Liu, S.; Fan, Q.; Liu, S.; Zhao, C. DepthFormer: A High-Resolution Depth-Wise Transformer for Animal Pose Estimation. Agriculture 2022, 12, 1280. [Google Scholar] [CrossRef]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Sun, K.; Zhao, Y.; Jiang, B.; Cheng, T.; Xiao, B.; Liu, D.; Mu, Y.; Wang, X.; Liu, W.; Wang, J. High-resolution representations for labeling pixels and regions. arXiv 2019, arXiv:1904.04514. [Google Scholar]
- Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. Vitpose: Simple vision transformer baselines for human pose estimation. Adv. Neural Inf. Process. Syst. 2022, 35, 38571–38584. [Google Scholar]
- Liao, J.; Xu, J.; Shen, Y.; Lin, S. THANet: Transferring Human Pose Estimation to Animal Pose Estimation. Electronics 2023, 12, 4210. [Google Scholar] [CrossRef]
- Hu, X.; Liu, C. Animal Pose Estimation Based on Contrastive Learning with Dynamic Conditional Prompts. Animals 2024, 14, 1712. [Google Scholar] [CrossRef] [PubMed]
- Zeng, X.; Zhang, J.; Zhu, Z.; Guo, D. MVCRNet: A Semi-Supervised Multi-View Framework for Robust Animal Pose Estimation with Minimal Labeled Data. 2024; preprint. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
- Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 783–792. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Jocher, G. YOLOv5 by Ultralytics. 2020. Available online: https://zenodo.org/records/7347926 (accessed on 11 April 2024).
- Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
- Yu, H.; Xu, Y.; Zhang, J.; Zhao, W.; Guan, Z.; Tao, D. Ap-10K: A benchmark for animal pose estimation in the wild. arXiv 2021, arXiv:2108.12617. [Google Scholar]
- Cao, J.; Tang, H.; Fang, H.S.; Shen, X.; Lu, C.; Tai, Y.W. Cross-Domain Adaptation for Animal Pose Estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Jiang, T.; Lu, P.; Zhang, L.; Ma, N.; Han, R.; Lyu, C.; Li, Y.; Chen, K. Rtmpose: Real-time multi-person pose estimation based on mmpose. arXiv 2023, arXiv:2303.07399. [Google Scholar]
- Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
- Li, Q.; Zhang, Z.; Xiao, F.; Zhang, F.; Bhanu, B. Dite-HRNet: Dynamic lightweight high-resolution network for human pose estimation. arXiv 2022, arXiv:2204.10762. [Google Scholar]
- Zhang, F.; Zhu, X.; Dai, H.; Ye, M.; Zhu, C. Distribution-aware coordinate representation for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7093–7102. [Google Scholar]
Stages | Output Size | Operations | Resolution Branch | Output_Channels | Repeat | Num_Modules |
---|---|---|---|---|---|---|
stem | 64 × 64 | conv2d | 2 × | 32 | 1 | 1 |
shuffle block | 4 × | 32 | 1 | 1 | ||
2 | 64 × 64 | SPPF | 4 × 8 × | 40, 80 | 1 | 1 |
MFE block | 4 × 8 × | 40, 80 | 2 | 2 | ||
fusion block | 4 × 8 × | 40, 80 | 1 | |||
3 | 64 × 64 | SPPF | 4 × 8 × 16 × | 40, 80, 160 | 1 | 1 |
MFE block | 4 × 8 × 16 × | 40, 80, 160 | 2 | 2 | ||
fusion block | 4 × 8 × 16 × | 40, 80, 160 | 1 | |||
4 | 64 × 64 | SPPF | 4 × 8 × 16 × 32 × | 40, 80, 160 | 1 | 1 |
MFE block | 4 × 8 × 16 × 32 × | 40, 80, 160 | 2 | 2 | ||
fusion block | 4 × 8 × 16 × 32 × | 40, 80, 160 | 1 | |||
FE | 64 × 64 | upsampling | 4 × | 80 | 1 | 1 |
shuffle block | 4 × | 80 | 1 | |||
DConv | 4 × | 40 | 1 | |||
ECA | 4 × | 40 | 1 |
Keypoint | Definition | Keypoint | Definition |
---|---|---|---|
0 | Left Eye | 9 | Right Elbow |
1 | Right Eye | 10 | Right Front Paw |
2 | Nose | 11 | Left Hip |
3 | Neck | 12 | Left Knee |
4 | Root of Tail | 13 | Left Back Paw |
5 | Left Shoulder | 14 | Right Hip |
6 | Left Elbow | 15 | Right Knee |
7 | Left Front Paw | 16 | Right Back Paw |
8 | Right Shoulder |
Models | AP | AR | Params (M) | FLOPs (G) |
---|---|---|---|---|
Lite-HRNet | 59.8 | 64.9 | 1.13 | 0.35 |
+SPPF | 61.2 | 65.9 | 1.60 | 0.63 |
+SPPF & SPPF | 61.3 | 66.0 | 1.58 | 0.63 |
+SPPF & SPPF | 62.0 | 66.3 | 1.56 | 0.62 |
+SPPF | 61.2 | 66.0 | 1.46 | 0.62 |
Models | MFE-Lall | MFE-L & MFE-H | FE Stage | SPPF+ & SPPF | AP | AR | Params (M) | FLOPs (G) |
---|---|---|---|---|---|---|---|---|
model0 | 59.8 | 64.9 | 1.13 | 0.35 | ||||
model1 | ✓ | 60.1 | 65.0 | 1.13 | 0.36 | |||
model2 | ✓ | 60.4 | 65.2 | 1.17 | 0.44 | |||
model3 | ✓ | ✓ | 61.0 | 65.6 | 1.25 | 0.58 | ||
model4 | ✓ | ✓ | ✓ | 62.0 | 66.3 | 1.56 | 0.62 |
Methods | AP | AP50 | AP75 | APM | APL | AR | Inference Time (ms) | Params (M) | FLOPs (G) |
---|---|---|---|---|---|---|---|---|---|
CSPNeXt-t | 56.7 | 87.1 | 58.2 | 39.2 | 57.0 | 61.5 | 9.74 | 6.02 | 1.91 |
CSPNeXt-s | 60.5 | 88.7 | 63.2 | 47.3 | 60.7 | 64.8 | 11.23 | 8.58 | 2.36 |
DARK | 60.0 | 89.1 | 63.2 | 46.2 | 60.3 | 64.7 | 27.27 | 1.13 | 0.35 |
Dite-HRNet | 59.7 | 89.1 | 61.9 | 48.0 | 60.0 | 64.8 | 33.61 | 1.19 | 0.33 |
Lite-HRNet | 59.8 | 88.9 | 63.2 | 51.5 | 60.1 | 64.9 | 27.39 | 1.13 | 0.35 |
ShuffleNetV1 | 53.5 | 84.9 | 54.0 | 40.9 | 53.8 | 58.9 | 13.61 | 6.94 | 1.80 |
ShuffleNetV2 | 53.2 | 83.9 | 53.0 | 36.3 | 53.6 | 58.7 | 12.10 | 7.55 | 1.83 |
MobileNetV2 | 55.6 | 86.8 | 57.5 | 41.7 | 55.9 | 61.2 | 12.27 | 9.57 | 2.12 |
MobileNetV3 | 56.5 | 87.3 | 58.9 | 48.8 | 56.8 | 61.3 | 13.93 | 5.24 | 1.73 |
MPE-HRNet | 62.0 | 90.1 | 65.6 | 53.9 | 62.3 | 66.3 | 29.01 | 1.56 | 0.62 |
Methods | AP | AP | AP | AP | AP | AR | Inference Time (ms) | Params (M) | FLOPs (G) |
---|---|---|---|---|---|---|---|---|---|
CSPNeXt-t | 55.4 | 86.2 | 58.4 | 46.0 | 55.9 | 60.9 | 10.29 | 6.02 | 1.91 |
CSPNeXt-s | 58.9 | 87.7 | 62.5 | 49.2 | 59.5 | 63.8 | 11.84 | 8.58 | 2.36 |
DARK | 58.6 | 87.2 | 61.0 | 52.1 | 59.1 | 64.0 | 30.02 | 1.13 | 0.35 |
Dite-HRNet | 58.6 | 87.2 | 62.1 | 51.3 | 59.1 | 64.2 | 36.17 | 1.19 | 0.33 |
Lite-HRNet | 58.7 | 87.4 | 61.9 | 48.3 | 59.4 | 64.0 | 28.14 | 1.13 | 0.35 |
ShuffleNetV1 | 52.1 | 83.6 | 52.5 | 42.0 | 52.7 | 57.9 | 13.24 | 6.94 | 1.80 |
ShuffleNetV2 | 51.6 | 83.4 | 51.0 | 45.0 | 52.2 | 57.6 | 12.21 | 7.55 | 1.83 |
MobileNetV2 | 55.1 | 85.8 | 56.1 | 41.6 | 55.7 | 60.7 | 12.17 | 9.57 | 2.12 |
MobileNetV3 | 55.1 | 86.1 | 57.3 | 47.6 | 55.6 | 60.5 | 13.56 | 5.24 | 1.73 |
MPE-HRNet | 60.0 | 88.7 | 62.9 | 51.4 | 60.5 | 65.0 | 28.03 | 1.56 | 0.62 |
Methods | AP | AP | AR | Params (M) | FLOPs (G) |
---|---|---|---|---|---|
ResNet-101 | 68.1 | 93.8 | 77.4 | 52.99 | 12.13 |
HRNet-W32 | 72.2 | 94.2 | 77.0 | 28.54 | 10.25 |
RFB-HRNet | 75.0 | 95.8 | 78.1 | 63.97 | 22.61 |
Ours | 62.0 | 90.1 | 66.3 | 1.56 | 0.62 |
Methods | AP | AP | AP | AP | AP | AR | Inference Time (ms) | Params (M) | FLOPs (G) |
---|---|---|---|---|---|---|---|---|---|
CSPNeXt-s | 61.1 | 90.3 | 68.4 | 61.2 | 61.3 | 65.9 | 8.67 | 8.58 | 2.36 |
DARK | 63.3 | 90.5 | 70.7 | 65.0 | 63.2 | 68.5 | 27.47 | 1.13 | 0.35 |
Dite-HRNet | 62.0 | 90.1 | 67.0 | 63.3 | 62.0 | 67.3 | 30.10 | 1.19 | 0.33 |
Lite-HRNet | 62.2 | 89.4 | 68.3 | 64.4 | 62.1 | 67.5 | 26.47 | 1.13 | 0.35 |
ShuffleNetV2 | 56.5 | 86.4 | 61.8 | 59.5 | 56.0 | 61.9 | 12.55 | 7.55 | 1.83 |
MobileNetV2 | 58.7 | 88.1 | 64.6 | 63.1 | 58.3 | 64.0 | 12.07 | 9.57 | 2.12 |
MPE-HRNet | 64.9 | 90.5 | 72.3 | 65.9 | 64.9 | 69.8 | 28.17 | 1.56 | 0.62 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shen, J.; Jiang, Y.; Luo, J.; Wang, W. MPE-HRNetL: A Lightweight High-Resolution Network for Multispecies Animal Pose Estimation. Sensors 2024, 24, 6882. https://doi.org/10.3390/s24216882
Shen J, Jiang Y, Luo J, Wang W. MPE-HRNetL: A Lightweight High-Resolution Network for Multispecies Animal Pose Estimation. Sensors. 2024; 24(21):6882. https://doi.org/10.3390/s24216882
Chicago/Turabian StyleShen, Jiquan, Yaning Jiang, Junwei Luo, and Wei Wang. 2024. "MPE-HRNetL: A Lightweight High-Resolution Network for Multispecies Animal Pose Estimation" Sensors 24, no. 21: 6882. https://doi.org/10.3390/s24216882
APA StyleShen, J., Jiang, Y., Luo, J., & Wang, W. (2024). MPE-HRNetL: A Lightweight High-Resolution Network for Multispecies Animal Pose Estimation. Sensors, 24(21), 6882. https://doi.org/10.3390/s24216882