# Real-Time Energy Efficient Hand Pose Estimation: A Case Study

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

- A compressed, fixed-point version of the hand pose estimation CNN network that is 5.3× smaller as compared with the uncompressed, floating point CNN. This squeezed version of the GPU-based CNN adopts customized bitwidths for weights and activations, and requires the minimum amount of computation in the convolutional layer i.e., depthwise separable convolution.
- A hardware architecture that is 4.2× faster and 577.3× more energy efficient than the originally proposed CNN implementation of hand pose estimation algorithm on NVIDIA GeForce GTX1070. This architecture exploits the parallelization in FPGA in order to speedup the inference of hand joints coordinates based on Zynq UltraScale+ XCZU9EG MPSoC using High-Level Synthesis (HLS).

## 2. Related Work

**Hand Pose Estimation:**Our implementation of hand pose estimation CNN on FPGA is originally based on the work of Malik et al. [10] and inspired by it. In this work, Malik et al. have shown enhanced precision over the state-of-the-art using a unified dataset. In fact, Oberweger et al. [11] tried different CNN topologies for the 3D hand pose estimation, and the basic CNN in [10] is similar to one of the CNNs in [11], in which only NYU and ICVL datasets were used to train and test the CNN.

**Network Quantization:**Krishnamoorthi [12] discussed different quantization techniques i.e., quantization-aware training and fine tuning (post quantization). He showed that quantization-aware training can provide higher accuracy than post quantization training schemes. In [13], Hubara et al. introduced a training method for the quantized neural networks. Furthermore, they have illustrated different classification performance for different network topologies such as AlexNet and GoogleNet.

**CNN Implementation on FPGA:**Venieris et al. showed in their survey [14] that hardware architecture for CNN implementation on FPGA can be categorized into two main categories: streaming architectures and single computation engine. In streaming architecture [15,16,17,18,19,20,21,22], each layer is mapped to a hardware block and these hardware blocks are connected with each others via stream pipes. In fact, our architecture in this work is a simplified version of the FINN architecture [22].

## 3. Overview of The Proposed Approach

#### 3.1. CNN-Based Hand Pose Estimation Algorithm

#### 3.2. Design Process Overview

## 4. Software Design Process

#### 4.1. Full-Precision CNN Training and Testing

#### 4.1.1. Network Training

#### 4.1.2. Network Testing

#### 4.2. Quantization-Aware Training (QAT)

#### 4.3. Linear Quantization Function

#### Quantization-Aware Training Process

## 5. Hardware Design Process

#### 5.1. Hardware Streaming Architecture Design (HSA)

#### 5.1.1. Convolutional and Pooling Layer Architecture

#### 5.1.2. Zero Padding Layer Architecture

#### 5.1.3. Fully Connected Layer Architecture

#### 5.1.4. Inter-Layer Packing

#### 5.1.5. Activation Quantization

- Convolutional Layer 1 and Pooling Layer 1 data type.
- Convolutional Layer 2 and Pooling Layer 2 data type.
- Convolutional Layer 3 data type.
- Fully connected Layer 1 data type.
- Fully connected Layer 2 data type.
- Fully connected Layer 3 data type.

#### 5.2. System Integration (SI)

#### 5.2.1. Platform Selection

^{®}UltraScale+™ XCZU9EG-2FFVB1156E MPSoC (multiprocessor System-on-Chip) is the core of this general purpose platform. In this platform, the PS is a multiprocessor System-on-Chip consisting of an ARM

^{®}flagship Cortex

^{®}-A53 64-bit quad-core processor and a Cortex-R5 dual-core real-time processor along with high speed DDR4 SODIMM and component memory interfaces. The PL is basically the Field Programmable Gate Array (FPGA) which features 912 (Block RAM) BRAMs, 2520 DSP48, 274k (Look-Up Tables) LUTs and 548k (Flip-Flops) kFFs. Figure 7 shows an overview of the board and its components.

#### 5.2.2. Hardware System Integration and On-Chip Software

## 6. Results

#### 6.1. Software-Related Experimental Setup and Results

#### Training and Testing the Quantized CNN

#### 6.2. Hardware-Related Experimental Setup and Results

#### 6.2.1. Comparison with NVIDIA GeForce GTX 1070 GPU

#### 6.2.2. Comparison with Other Embedded Platforms

## 7. Discussion

#### 7.1. Dilation

#### 7.2. Skip and Merge Connections

## 8. Conclusions

## 9. Future Work

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Appendix A

#### Appendix A.1. Stride Algorithm

Algorithm A1: Stride-Aware Convolution and Max-Pooling |

#### Appendix A.2. Zero Padding Algorithm

Algorithm A2: Zero Padding |

#### Appendix A.3. Group Convolution

**Figure A1.**Group Convolution for 8 output feature maps. This figure shows 4 different examples to obtain 8 output feature maps. (

**a**) The 2D convolution for a single input feature map. In this case, the input consists of a single group and 8 kernels. Each kernel is convoluted with the input feature map to produce an output feature map. (

**b**) The case of 8 input feature maps and a single input group, where 64 kernels are split into 8 groups and each group is convoluted with the input. The summation is necessary to obtain the desired number of output feature maps. (

**c**) Depthwise convolution. The input is split into 8 groups and each output feature map is computed depending form a single input feature map. (

**d**) Another example of group convolution in which the input is split into 4 groups. In this case, a total number of 16 kernels are required and each input group is convoluted with a set of 4 kernels. We ignored the summation in (

**a**,

**c**) since the result of each convolution is a single output feature map.

## References

- Chen, Y.; Tu, Z.; Ge, L.; Zhang, D.; Chen, R.; Yuan, J. So-handnet: Self-organizing network for 3d hand pose estimation with semi-supervised learning. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6961–6970. [Google Scholar]
- Li, S.; Lee, D. Point-to-pose voting based hand pose estimation using residual permutation equivariant layer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 11927–11936. [Google Scholar]
- Zimmermann, C.; Brox, T. Learning to estimate 3d hand pose from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4903–4911. [Google Scholar]
- Wang, R.; Paris, S.; Popović, J. 6D hands: Markerless hand-tracking for computer aided design. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, Santa Barbara, CA, USA, 16–19 October 2011; pp. 549–558. [Google Scholar]
- Li, S.; Ma, X.; Liang, H.; Görner, M.; Ruppel, P.; Fang, B.; Sun, F.; Zhang, J. Vision-based teleoperation of shadow dexterous hand using end-to-end deep neural network. arXiv
**2018**, arXiv:1809.06268. [Google Scholar] - Isaacs, J.; Foo, S. Hand pose estimation for American sign language recognition. In Proceedings of the Thirty-Sixth Southeastern Symposium on System Theory, Atlanta, GA, USA, 14–16 March 2004; pp. 132–136. [Google Scholar]
- Malik, J.; Elhayek, A.; Ahmed, S.; Shafait, F.; Malik, M.; Stricker, D. 3DAirSig: A Framework for Enabling In-Air Signatures Using a Multi-Modal Depth Sensor. Sensors
**2018**, 18, 3872. [Google Scholar] [CrossRef] [PubMed][Green Version] - Yuan, S.; Ye, Q.; Stenger, B.; Jain, S.; Kim, T.K. Bighand2. 2m benchmark: Hand pose dataset and state of the art analysis. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2605–2613. [Google Scholar]
- Vansteenkiste, E. New FPGA Design Tools and Architectures. Ph.D. Thesis, Ghent University, Gent, Belgium, 2016. [Google Scholar]
- Malik, J.; Elhayek, A.; Stricker, D. Simultaneous Hand Pose and Skeleton Bone-Lengths Estimation from a Single Depth Image. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 557–565. [Google Scholar]
- Oberweger, M.; Wohlhart, P.; Lepetit, V. Hands deep in deep learning for hand pose estimation. arXiv
**2015**, arXiv:1502.06807. [Google Scholar] - Krishnamoorthi, R. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv
**2018**, arXiv:1806.08342. [Google Scholar] - Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res.
**2017**, 18, 6869–6898. [Google Scholar] - Venieris, S.I.; Kouris, A.; Bouganis, C.S. Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions. arXiv
**2018**, arXiv:1803.05900. [Google Scholar] [CrossRef][Green Version] - Liu, Z.; Dou, Y.; Jiang, J.; Xu, J. Automatic code generation of convolutional neural networks in FPGA implementation. In Proceedings of the 2016 International Conference on Field-Programmable Technology (FPT), Xi’an, China, 7–9 December 2016; pp. 61–68. [Google Scholar]
- Abdelouahab, K.; Pelcat, M.; Serot, J.; Bourrasset, C.; Berry, F. Tactics to directly map CNN graphs on embedded FPGAs. IEEE Embed. Syst. Lett.
**2017**, 9, 113–116. [Google Scholar] [CrossRef][Green Version] - Abdelouahab, K.; Bourrasset, C.; Pelcat, M.; Berry, F.; Quinton, J.C.; Serot, J. A Holistic Approach for Optimizing DSP Block Utilization of a CNN implementation on FPGA. In Proceedings of the 10th International Conference on Distributed Smart Camera, Paris, France, 12–15 September 2016; pp. 69–75. [Google Scholar]
- Wang, Y.; Xu, J.; Han, Y.; Li, H.; Li, X. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In Proceedings of the 53rd Annual Design Automation Conference, Austin, TX, USA, 5–9 June 2016; p. 110. [Google Scholar]
- Venieris, S.I.; Bouganis, C.S. Latency-driven design for FPGA-based convolutional neural networks. In Proceedings of the 2017 27th International Conference on Field Programmable Logic and Applications (FPL), Ghent, Belgium, 4–8 September 2017; pp. 1–8. [Google Scholar]
- Venieris, S.I.; Bouganis, C.S. fpgaConvNet: A toolflow for mapping diverse convolutional neural networks on embedded FPGAs. arXiv
**2017**, arXiv:1711.08740. [Google Scholar] - Venieris, S.I.; Bouganis, C.S. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In Proceedings of the 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Washington, DC, USA, 1–3 May 2016; pp. 40–47. [Google Scholar]
- Umuroglu, Y.; Fraser, N.J.; Gambardella, G.; Blott, M.; Leong, P.; Jahre, M.; Vissers, K. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 65–74. [Google Scholar]
- Mamalet, F.; Garcia, C. Simplifying convnets for fast learning. In International Conference on Artificial Neural Networks; Springer: Berlin, Germany, 2012; pp. 58–65. [Google Scholar]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in PyTorch. In Proceedings of the NIPS Autodiff Workshop, Long Beach, CA, USA, 9 December 2017. [Google Scholar]
- Zhou, X.; Wan, Q.; Zhang, W.; Xue, X.; Wei, Y. Model-based deep hand pose estimation. arXiv
**2016**, arXiv:1606.06854. [Google Scholar] - Tompson, J.; Stein, M.; Lecun, Y.; Perlin, K. Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Graph. (ToG)
**2014**, 33, 169. [Google Scholar] [CrossRef] - Miyashita, D.; Lee, E.H.; Murmann, B. Convolutional neural networks using logarithmic data representation. arXiv
**2016**, arXiv:1603.01025. [Google Scholar] - Matai, J.; Richmond, D.; Lee, D.; Kastner, R. Enabling FPGAs for the masses. arXiv
**2014**, arXiv:1408.5870. [Google Scholar] - Vallina, F.M. Implementing Memory Structures for Video Processing in the Vivado HLS Tool; XAPP793 (v1. 0), 20 September; Xilinx, Inc.: Santa Clara, CA, USA, 2012. [Google Scholar]
- Xilinx. Vivado Design Suite User Guide: High-Level Synthesis (UG902); Xilinx, Inc.: Santa Clara, CA, USA, 2019. [Google Scholar]
- Xilinx. ZCU102 Evaluation Board (UG1182); Xilinx, Inc.: Santa Clara, CA, USA, 2018. [Google Scholar]
- ONNX Runtime. Available online: https://github.com/microsoft/onnxruntime (accessed on 7 May 2020).
- Oberweger, M.; Lepetit, V. Deepprior++: Improving fast and accurate 3d hand pose estimation. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 585–594. [Google Scholar]
- Malik, J.; Elhayek, A.; Nunnari, F.; Varanasi, K.; Tamaddon, K.; Heloir, A.; Stricker, D. Deephps: End-to-end estimation of 3d hand pose and shape by learning from synthetic depth. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 110–119. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Chidananda, P.; Sinha, A.; Rao, A.; Lee, D.; Rabinovich, A. Efficient 2.5 D Hand Pose Estimation via Auxiliary Multi-Task Training for Embedded Devices. arXiv
**2019**, arXiv:1909.05897. [Google Scholar]

**Figure 1.**The architecture of Convolutional Neural Network (CNN)-based hand pose estimation algorithm. The CNN takes a 128 × 128 input preprocessed image. It consists of 3 convolution layers each followed by an ReLU activation and max pooling. Afterwards, there are 2 fully connected layers with ReLU activation. A third fully connected layer (joint regression layer) regresses 3D joint positions. Conv stands for a convolution layer. Each conv is followed by a ReLU activation. FC denotes a fully connected layer.

**Figure 2.**Design process overview; the first box illustrates the software-level design phase, while the other boxes illustrate the hardware related design phase. The first stage is quantization-aware training (QAT) in which we decrease the CNN memory demand as well as the computation time on the hardware. The second stage is hardware streaming architecture (HSA) where the underlying hardware structure is designed for the CNN. In system integration (SI) stage, the programmable logic PL and the processing system PS are brought together and the interface with the memory is configured through the hardware system integration sub-stage. Furthermore, the on-Chip Software is developed for preprocessing and interfacing with the peripherals.

**Figure 3.**Streaming Architecture. Each CNN layer is mapped into a hardware block, and the hardware blocks are connected to each others via stream channels. The bitwidth of each stream is shown on this figure.

**Figure 4.**Line buffer and window buffer architecture. This figure illustrates an example of a single line buffer and a single window buffer for a single input feature map. In this example, the 4th line of the input feature map is being streamed value by value to the line buffer. Specifically, the input value 31 is streamed from the input feature map to the appropriate location in the line buffer. The old values (23 and 15) in the line buffer are shifted up to the appropriate places in their original lines, while the old value 7 is no more needed. The window buffer copies the set of line buffer values that correspond to the current operation (convolution or pooling).

**Figure 5.**Zero padding layer streaming architecture. In this architecture, the input values are streamed in, and the padding decision maker controls the multiplexer based on the current row and column indices. If the padding should be performed, a zero value is streamed out and the input stream stalls. Otherwise, the input value is directly pushed as is to the output stream.

**Figure 6.**Fully connected layer streaming architecture. In this architecture, the input values are streamed in, and the corresponding weights are fetched from the local BRAM memories for the MAC operation that takes place in the multiplier and the adder. The result is accumulated with the help of the accumulator and the demultiplexer. Once the MAC operations are done, the demultiplexer pushes the result value to the output streaming buffer.

**Figure 7.**ZCU102 evaluation board [31].

**Figure 8.**Hardware system integration; AXI-Lite interface provides the interconnection between the PS and the PL. DMA module is integrated in the PL. This module is responsible for converting the memory mapped input to AXI stream CNN input, as well as converting the AXI stream CNN output to a memory mapped output.

**Figure 9.**Performance analysis for different batch sizes; FPGA 1 denotes the FPGA implementation performance for batch size 1. Similarly, GPU1, GPU128 and GPU256 denote the GPU implementation performance for batch sizes 1, 128 and 256, respectively. For comparison purposes, we have shown the factors by which the performance differs for different implementations (the numbers on the arrows).

**Figure 10.**Performance analysis for different embedded platforms; FPGA, JX GPU, JX CPU, RPI3B+ and RPi3B+ ONNX denote the performance of our Xilinx UltraScale+ FPGA, Jetson Xavier embedded GPU, Jetson Xavier embedded CPU, Raspberry Pi 3B+ simple CPU and Raspberry Pi 3B+ ONNX-Runtime-based CPU implementations, respectively. Similarly to Figure 9, the numbers on the arrows illustrate the factors by which the performance differs for different implementations.

**Figure 11.**Relative latency and energy consumption per image for different embedded platforms and batch sizes; FPGA, Xavier GPU, Xavier CPU, Rpi and Rpi ONNX stand for our Xilinx UltraScale+ FPGA, Jetson Xavier embedded GPU, Jetson Xavier embedded CPU, Raspberry Pi 3B+ simple CPU and Raspberry Pi 3B+ ONNX-Runtime-based CPU implementations, respectively. The number above each column represents the relative latency or energy consumption with respect to our FPGA implementation.

Hyper Parameters | Value |
---|---|

No. Kernels per Conv Layer | 8 |

No. Epochs | 500 |

Optimizer | SGD |

Loss Function Criterion | MSE |

Batch Size | 256 |

Learning Rate | 0.005 |

Momentum | 0.9 |

Layer | Total No. Parameters | Full-Precision Size | Quantized Size |
---|---|---|---|

Convolutional 1 | 208 | 6.5 Kb | 2.4 Kb |

Pooling 1 | - | - | - |

Convolutional 2 | 208 | 6.5 Kb | 2.4 Kb |

Pooling 2 | - | - | - |

Convolutional 3 | 80 | 2.5 Kb | 0.9 Kb |

Fully Connected 1 | 1,180,672 | 36 Mb | 6.8 Mb |

Fully Connected 2 | 1,049,600 | 32 Mb | 6.0 Mb |

Fully Connected 3 | 95,325 | 2.9 Mb | 559 Kb |

Total | 2,326,093 | 71 Mb | 13.3 Mb |

Platform | Batch Size | Run-Time [ms] | Energy [mJ] |
---|---|---|---|

FPGA (ours) | 1 | 1.669 | 0.6676 |

GPU | 1 | 7.01 | 385.41 |

32 | 0.53 | 30.55 | |

64 | 0.35 | 21.86 | |

128 | 0.30 | 20.20 | |

256 | 0.29 | 21.99 | |

512 | 0.30 | 22.10 | |

1024 | 0.31 | 21.80 | |

2048 | 0.30 | 35.80 | |

4096 | 0.30 | 35.86 | |

8192 | 0.30 | 40.49 |

Platform | Batch Size | Run-Time (ms) | Energy (mJ) |
---|---|---|---|

FPGA (ours) | 1 | 1.669 | 0.6676 |

NVIDIA Jetson Xavier (GPU) | 1 | 2.21 | 9.06 |

16 | 0.94 | 22.51 | |

32 | 0.84 | 20.05 | |

64 | 0.73 | 19.64 | |

128 | 0.70 | 18.94 | |

NVIDIA Jetson Xavier (CPU) | 1 | 9.54 | 64,87 |

16 | 26.84 | 315.36 | |

32 | 14.28 | 169.20 | |

64 | 7.51 | 88.27 | |

128 | 4.36 | 48.33 | |

RaspberryPi 3B+ (ONNX-Runtime) | 1 | 16.04 | 39.30 |

16 | 6.01 | 13.82 | |

32 | 5.66 | 13.03 | |

64 | 5.56 | 12.67 | |

128 | 5.53 | 12.84 | |

RaspberryPi 3B+ | 1 | 151.97 | 395.12 |

16 | 27.00 | 51.30 | |

32 | 19.95 | 45.89 | |

64 | 15.98 | 38.36 | |

128 | 15.45 | 37.07 |

Resources | Used | Available | Util |
---|---|---|---|

CLB LUTs | 21,432 | 274,080 | 8% |

CLB Flip Flop | 15,104 | 548,160 | 3% |

Block RAM Tile | 772.5 | 912 | 85% |

DSPs | 20 | 2520 | 1% |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Al Koutayni, M.R.; Rybalkin, V.; Malik, J.; Elhayek, A.; Weis, C.; Reis, G.; Wehn, N.; Stricker, D. Real-Time Energy Efficient Hand Pose Estimation: A Case Study. *Sensors* **2020**, *20*, 2828.
https://doi.org/10.3390/s20102828

**AMA Style**

Al Koutayni MR, Rybalkin V, Malik J, Elhayek A, Weis C, Reis G, Wehn N, Stricker D. Real-Time Energy Efficient Hand Pose Estimation: A Case Study. *Sensors*. 2020; 20(10):2828.
https://doi.org/10.3390/s20102828

**Chicago/Turabian Style**

Al Koutayni, Mhd Rashed, Vladimir Rybalkin, Jameel Malik, Ahmed Elhayek, Christian Weis, Gerd Reis, Norbert Wehn, and Didier Stricker. 2020. "Real-Time Energy Efficient Hand Pose Estimation: A Case Study" *Sensors* 20, no. 10: 2828.
https://doi.org/10.3390/s20102828