# Lightweight Architecture for Real-Time Hand Pose Estimation with Deep Supervision

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

- We investigate the less-studied hourglass network efficiency problem. Different from the most previous researches, we also focus on the model’s inference cost while keeping the accuracy performance.
- We treat the hourglass network as a variant of U-Net and propose a new training strategy using deep supervision. It has a better generalization performance and allows us for finer-grained pruning.
- We implement the two-level pruning strategy in the public NYU and ICVL datasets and achieve satisfactory accuracy rates compared to the previous state-of-the-art approaches. We also extensively examine the redundancy of the hand pose estimation design and find the significant benefits brought from deep supervision. We apply our model using OpenVINO toolkit to accelerate the inference process on the CPUs and it runs faster in contrast to GPUs.

## 2. Related Work

#### 2.1. Network Pruning

#### 2.2. Hand Pose Estimation

#### 2.3. Hourglass and U-Net

#### 2.4. Deep Supervision

## 3. Method

#### 3.1. Compact Network Architecture

#### 3.2. Hourglass with Deep Supervision

#### 3.3. Inference Optimization with OpenVINO

#### 3.4. Implementation Details

## 4. Experiments

#### 4.1. Baseline

#### 4.1.1. What Dominates the Accuracy of the Stacked Hourglass?

#### 4.1.2. Impact of Deep Supervision

#### 4.2. Comparision of State-of-the-Art

#### 4.3. OpenVINO Optimization

#### 4.4. Exploration Studies

## 5. Conclusions and Discussion

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Tompson, J.; Stein, M.; Lecun, Y.; Perlin, K. Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Gr. (TOG)
**2014**, 33, 169. [Google Scholar] [CrossRef] - Oberweger, M.; Wohlhart, P.; Lepetitt, V. Hands deep in deep learning for hand pose estimation. 2015. Available online: https://arxiv.org/pdf/1502.06807 (accessed on 4 April 2019).
- Ye, Q.; Yuan, S.; Kim, T.K. Spatial attention deep net with partial PSO for hierarchical hybrid hand pose estimation. In Lecture Notes in Computer Science; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016. [Google Scholar]
- Ge, L.H.; Liang, H.; Yuan, J.; Thalmann, D. Robust 3D hand pose estimation in single depth images: From singleview cnn to multi-view CNNs. 2016. Available online: https://arxiv.org/pdf/1606.07253 (accessed on 4 April 2019).
- Wan, C.; Probst, T.; Van Gool, L.; Yao, A. Dense 3D Regression for Hand Pose. Available online: https://arxiv.org/pdf/1711.08996 (accessed on 4 April 2019).
- Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. ECCV. In Computer Vision.; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; Volume 9912, pp. 483–499. [Google Scholar]
- Matthieu, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; Bengio, Y.S. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. 2016. Available online: https://arxiv.org/pdf/1602.02830 (accessed on 4 April 2019).
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. 2015. Available online: https://arxiv.org/pdf/1503.02531?context=cs (accessed on 4 April 2019).
- Han, S.; Mao, H.Z.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. 2015. Available online: https://arxiv.org/pdf/1510.00149 (accessed on 4 April 2019).
- Castellano, G.; Fanelli, A.M.; Pelillo, M. An iterative pruning algorithm for feed forward neural networks. IEEE Trans. Neural Netw.
**1997**, 8, 519. [Google Scholar] [CrossRef] [PubMed] - Lécun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version] - Han, S.; Pool, J.; Tran, J.; Dally, W.J. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems. 2015. Available online: https://arxiv.org/pdf/1506.02626 (accessed on 4 April 2019).
- Tang, D.; Chang, H.J.; Tejani, A.; Kim, T.K. Latent regression forest: Structured estimation of 3D articulated hand posture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM
**2012**, 60, 84–90. [Google Scholar] [CrossRef] - Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia, 3–6 November 2015. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Anwar, S.; Hwang, K.; Sung, W.Y. Structured Pruning of Deep Convolutional Neural Networks. 2015. Available online: https://arxiv.org/pdf/1512.08571 (accessed on 4 April 2019).
- Polyak, A.; Wolf, L. Channel-Level Acceleration of Deep Face Representations. IEEE Access
**2015**, 3, 2163–2175. [Google Scholar] [CrossRef] - Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; Graf, H.P. Pruning Filters for Efficient ConvNets. 2016. Available online: https://arxiv.org/pdf/1608.08710 (accessed on 4 April 2019).
- Zhang, F.; Zhu, X.T.; Ye, M. Fast Human Pose Estimation. 2019. Available online: https://arxiv.org/pdf/1811.05419 (accessed on 4 April 2019).
- Oberweger, M.; Lepetit, V. Deepprior++: Improving fast and accurate 3D hand pose estimation. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Chen, X.; Wang, G.; Guo, H.; Zhang, C. Pose Guided Structured Region Ensemble Network for Cascaded Hand Pose Estimation. Available online: https://arxiv.org/pdf/1708.03416 (accessed on 4 April 2019).
- Guo, H.K.; Wang, G.J.; Chen, X.H.; Zhang, C.R.; Qiao, F.; Yang, H.Z. Region Ensemble Network: Improving Convolutional Network for Hand Pose Estimation. 2017. Available online: https://arxiv.org/pdf/1702.02447 (accessed on 4 April 2019).
- Olaf, R.; Fischer, P.; Brox, T. U-net: Convolutional Networks for Biomedical Image Segmentation. 2015. Available online: https://arxiv.org/pdf/1505.04597 (accessed on 4 April 2019).
- Huang, G.; Liu, Z.; Laurens, V.D.M.; Weinberger, K.Q. Densely Connected Convolutional Networks. 2016. Available online: https://arxiv.org/pdf/1608.06993 (accessed on 4 April 2019).
- Li, X.; Chen, H.; Qi, X.; Dou, Q.; Fu, C.W.; Heng, P.A. H-DenseUNet: Hybrid densely connected UNet for liver and liver tumor segmentation from CT volumes. IEEE Trans. Med. Imaging
**2018**, 37, 2663–2674. [Google Scholar] [CrossRef] [PubMed] - Drozdzal, M.; Vorontsov, E.; Chartrand, G.; Kadoury, S.; Pal, C. The importance of skip connections in biomedical image segmentation. In Deep Learning and Data Labeling for Medical Applications; Carneiro, G., Ed.; Springer: Cham, Switzerland, 2016. [Google Scholar]
- Zhou, Z.W.; Siddiquee, M.R.; Tajbakhsh, N.; Liang, J.M. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
- Wang, L.; Lee, C.Y.; Tu, Z.; Lazebnik, S. Training Deeper Convolutional Networks with Deep Supervision. 2015. Available online: https://arxiv.org/pdf/1505.02496 (accessed on 4 April 2019).
- Bengio, Y.; Lamblin, P.; Popovici, D.; Larochelle, H.; Montral, U.D.; Qubec, M. Greedy layer-wise training of deep networks. In Proceedings of the 19th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 4–7 December 2006; MIT Press: Cambridge, MA, USA, 2006; pp. 153–160. [Google Scholar]
- Lee, C.Y.; Xie, S.; Gallagher, P.; Zhang, Z.Y.; Tu, Z. Deeply-Supervised Nets. 2015. Available online: https://arxiv.org/pdf/1409.5185 (accessed on 4 April 2019).
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Carreira, J.; Agrawal, P.; Fragkiadaki, K.; Malik, J. Human pose estimation with iterative error feedback. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Wei, S.E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Chu, X.; Yang, W.; Ouyang, W.; Ma, C.; Yuille, A.L.; Wang, X. Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Yang, W.; Li, S.; Ouyang, W.; Li, H.; Wang, X. Learning feature pyramids for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]

1 | Demo video: https://youtu.be/yFd6WrxmK3s. |

**Figure 1.**Compact network architecture. (

**a**) Layer pruning: the second stack is pruned. Post-processing module includes coordinate reconstruction from three heat maps (yellow blocks). Blue and green blocks stand for a series of convolutional and pool operations which are omitted by limited space (refer to Reference [5] for details). (

**b**) Level pruning: varying modes (${\mathrm{L}}^{1}$~${\mathrm{L}}^{4}$) can be selected for different complexity. Supervision modules are added to every decoder (red lines).

**Figure 2.**Workflow of OpenVINO toolkit: the trained model firstly optimized by the model optimizer to IR and then executed by the inference engine.

**Figure 4.**Loss descent comparison of w/ and w/o deep supervision (NYU dataset [1]).

**Figure 5.**Comparison of the original network on NYU [1]. Percentage of frames in which all joints are below a threshold is plotted.

**Figure 6.**Qualitative results. Hand pose estimation results on NYU [1] dataset. (

**a**) Successful samples (top row) (

**b**) Failed samples (top row) and the corresponding ground-truth (bottom row).

**Figure 7.**Qualitative results. Hand pose estimation results on ICVL dataset [13]. (

**a**) Successful samples (top row) (

**b**) Failed samples (top row) and the corresponding ground-truth (bottom row).

**Figure 8.**Comparison with state-of-the-art on ICVL [13]. Percentage of frames in which all joints are below a threshold is plotted.

**Table 1.**Comparison with the original network on NYU [1].

Architecture | Average 3D Error | Prams | Inference Time (Per Frame) |
---|---|---|---|

Wan et al. [5] (dense 3D) | 10.2339 mm | 5.83M | 44 ms |

1-stack-${\mathrm{L}}_{4}$ w/DS | 10.9989 mm | 2.92M | 35 ms |

1-stack-${\mathrm{L}}_{4}$ w/DS | 10.6243 mm | 2.92M | 35 ms |

1-stack-${\mathrm{L}}_{3}$ w/DS | 11.9070 mm | 2.86M | 32 ms |

1-stack-${\mathrm{L}}_{2}$ w/DS | 15.5249 mm | 2.80M | 26 ms |

1-stack-${\mathrm{L}}_{1}$ w/DS | 43.2613 mm | 2.75M | 25 ms |

**DS**: deep supervision; ${L}_{i}$: i

^{th}level pruning;

**M**: million.

# Stack | # Feature Number | Mean Joint Error (from Scratch) | Mean Joint Error (Knowledge Distillation) |
---|---|---|---|

1 | 128 | 14.0880 mm | 16.5343 mm |

1 | 64 | 14.0503 mm | 16.7141 mm |

2 | 128 | 10.2339 mm | / |

Method | Pruned Network | Train from Scratch w/o DS | Retrain w/o DS | Retrain w/DS |
---|---|---|---|---|

Mean joint error | 10.999 mm | 12.204 mm | 10.562 mm | 10.624 mm |

**Table 4.**Comparison with the original network on ICVL [13].

Architecture | Average 3D Error | Inference Time |
---|---|---|

Wan et al. [1] (dense 3D) | 7.25 mm | 47 ms |

1-stack-${\mathrm{L}}_{4}$ w/o DS | 7.42 mm | 38 ms |

Framework | Device | Preprocessing | Regression Network | Mean-Shift Algorithm | Total |
---|---|---|---|---|---|

Tensorflow | GPU (TITAN X) | 1.5 ms | 3.6 ms | 17.6 ms | 22.7 ms (44FPS) |

CPU (i9) | 1.3 ms | 20.8 ms | 4.8 ms | 26.9 ms (37FPS) | |

OpenVINO | CPU (i9) | 1.4 ms | 11.4 ms | 5.0 ms | 17.8 ms (56FPS) |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Wu, Y.; Ruan, X.; Zhang, Y.; Zhou, H.; Du, S.; Wu, G.
Lightweight Architecture for Real-Time Hand Pose Estimation with Deep Supervision. *Symmetry* **2019**, *11*, 585.
https://doi.org/10.3390/sym11040585

**AMA Style**

Wu Y, Ruan X, Zhang Y, Zhou H, Du S, Wu G.
Lightweight Architecture for Real-Time Hand Pose Estimation with Deep Supervision. *Symmetry*. 2019; 11(4):585.
https://doi.org/10.3390/sym11040585

**Chicago/Turabian Style**

Wu, Yufei, Xiaofei Ruan, Yu Zhang, Huang Zhou, Shengyu Du, and Gang Wu.
2019. "Lightweight Architecture for Real-Time Hand Pose Estimation with Deep Supervision" *Symmetry* 11, no. 4: 585.
https://doi.org/10.3390/sym11040585