We performed rigorous evaluation of our method using qualitative and quantitative analysis on both the 3D hand shape and the 3D pose estimation tasks. We provide comparisons with the state-of-the-arts and self-comparisons on both synthetic and real world datasets.

#### 6.1. Datasets, Baselines and Evaluation Metrics

None of the existing real hand pose datasets provide ground truth hand shape information. Therefore, we qualitatively evaluated the recovered 3D real hand mesh using two datasets: NYU [

34] and BigHand2.2M [

46]. NYU provides a train set (

${\mathcal{T}}_{\mathcal{N}}$) and a test set, which contain 72,757 and 8252 RGBD images, respectively. The dataset covers a wide range of complex poses but, it is collected from only one subject. It contains 36 annotated joint positions, out of which a subset of 14 joints are used for public comparisons [

34]. BigHand2.2M is the largest real dataset, which provides 956 K training depth frames captured from 10 different subjects. The test set for the pose estimation task contains 296 K images. However, the annotations for the test set are not available. Hence, for completeness, we first selected 90% of 956 K (i.e., 860 K) as train set (

${\mathcal{T}}_{\mathcal{B}}$) and the remaining frames (i.e., 96 K) as test set. Joint annotations of BigHand2.2M dataset are shown in

Figure 3 (right). We manually calculated the hand palm center by taking the mean of the metacarpal joints and the wrist joint. On the other hand, SynHand5M [

21] is the largest synthetic hand pose dataset, which contains 5 million depth images with 21 3D joints (see

Figure 3, left) and 1193 3D hand mesh vertices as ground truth annotations. Its train set (

${\mathcal{T}}_{\mathcal{S}}$) and test set distributions are

$4.5$ M and 500 K, respectively.

To study the impacts of individual modules on the accuracy of 3D hand pose estimation task, we compared our

**Full** model, which is the complete pipeline (see

Figure 2), with three baselines.

**Baseline 1** directly regresses

$\mathcal{J}$ (using Module 1 without the bone-to-joint layer).

**Baseline 2** is comprised of complete Module 1 while

**Baseline 3** constitutes the first two modules of our pipeline (see

Section 3). We used four error metrics [

21] to evaluate the accuracy of the estimated pose and hand mesh: (i)

**3D $\mathcal{J}$ Err.**, is the mean 3D joint position error over all test frames; (ii)

**3D $\mathcal{B}$ Err.** is the average 3D bone location error; (iii)

**3D $\mathcal{V}$ Err.** gives the mean 3D vertex location error; and (iv) the percentage of success frames within thresholds. All error metrics are reported in mm.

#### 6.2. Evaluation of 3D Hand Shape Estimation

This subsection gives the experimental details on 3D hand mesh estimation task using SynHand5M [

21], NYU [

34] and BigHand2.2M [

46] datasets.

**Synthetic hand mesh recovery**: As SynHand5M [

21] is fully-labeled for pose and shape, we trained Baseline 3 and our Full model in a fully-supervised manner using the training strategy explained in

Section 5. Quantitative results are summarized in

Table 1. Our Baseline 3 (without using 2D depth image synthesizer) outperforms the state-of-the-art DeepHPS method [

21]. Our Full model further improves the accuracy of shape estimation over Baseline 3 by

$19.6$%.

Figure 6 shows the qualitative results on some challenging hand poses of SynHand5M dataset.

**Real hand mesh Recovery**: To effectively learn real hand shapes, Module 3 acts as an important source of weak-supervision in training. To recover the hand shapes of NYU dataset, we combined the train sets of SynHand5M and NYU datasets i.e.,

${\mathcal{T}}_{\mathcal{SN}}$ =

${\mathcal{T}}_{\mathcal{S}}$ +

${\mathcal{T}}_{\mathcal{N}}$, in one unified format and shuffled them. NYU contains a larger set of joint annotations (i.e., 36 joints) than SynHand5M, therefore we selected 16 closely matching joints that are common to both datasets [

21]. Our Full model was end-to-end trained on

${\mathcal{T}}_{\mathcal{SN}}$ with total loss of the network given by Equation (

9). The mesh loss of Module 2 was computed by implementing the indicator function (Equation (

8)). The qualitative results of hand pose and shape recovery on NYU test set are shown in

Figure 7. Our algorithm successfully reconstructs reasonable hand shapes of complex poses. Clearly, the quality of shape reconstruction depends on the accuracy of the estimated 3D pose. Examples of synthesized depth images from Module 3 are shown in the

Supplementary Materials. Similarly, we jointly trained real BigHand2.2M and synthetic SynHand5M datasets using a mixed train set, i.e.,

${\mathcal{T}}_{\mathcal{BS}}$ =

${\mathcal{T}}_{\mathcal{B}}$ +

${\mathcal{T}}_{\mathcal{S}}$. Both datasets have same annotations, as shown in

Figure 3. Qualitative results of BigHand2.2M shapes recovery are shown in

Figure 7 and demonstrate successful hand shapes reconstruction even in cases of missing depth information and high occlusions, such as egocentric viewpoint images. More qualitative results from the live stream of depth camera are presented in the

Supplementary Materials.

For more rigorous evaluation of our approach for real hand shape recovery, we built a new model, which is inspired by the recent work of Ge et al. [

24]. In this model, hand mesh is first estimated using the CNN of Module 1, which directly regresses mesh vertices

$\mathcal{V}$ from input depth image

${\mathcal{D}}_{\mathcal{I}}$, and then a 3D hand pose regressor estimates 3D pose

$\mathcal{J}$ from the reconstructed

$\mathcal{V}$. Finally, the depth image synthesizer synthesizes the depth image

${\mathcal{D}}_{\mathcal{R}}$ from

$\mathcal{J}$. For notation simplicity, we call this model as

**Model 1** and compared its performance with our Full model on NYU dataset (

Table 2 shows the pipelines using the notations).

Figure 8 shows the qualitative comparison on the sample test images of NYU. Hence, the direct hand shape regression using a single depth image is cumbersome, which may lead to highly inaccurate shape estimation. The pipeline of Model 1 is given in the

Supplementary Materials.

**Comparison with the state-of-the-art**: To qualitatively compare our recovered real hand shape with the state-of-the-art DeepHPS method [

21], we implemented this method and trained it on

${\mathcal{T}}_{\mathcal{BS}}$. The results on the sample test images of BigHand2.2M dataset are shown in

Figure 9. Artifacts are clearly visible using DeepHPS method due to fixed linear bases (see

Section 2) and difficulty in learning complex hand shape and scale parameters in the deep network. In our case, we learn shape from pose, which results in plausible hand shape recovery. We also observed the effect of our Module 3 in training and compared the results of real shape recovery using our Baseline 3. The last column in

Figure 9 shows the shape estimation results from Baseline 3, i.e., without using the depth synthesizer. The inaccurate mesh reconstruction with Baseline 3 proves that the addition of a weak-supervision from Module 3 is necessary to get reasonable real hand shape reconstruction.

**Discussion**: Notably, our algorithm learns to reconstruct hand shapes from real depth images by learning from synthetic depth. Therefore, the consistency in depth and joint annotations of real and synthetic images is important to recover the plausible real hand shape and pose. Thus, our approach is unlikely to produce correct and plausible hand shapes for older real hand pose datasets such ICVL [

50] and MSRA2015 [

51], which are not fully consistent in depth and joint annotations with synthetic SynHand5M [

21] dataset.

#### 6.3. Evaluation of 3D Hand Pose Estimation

This subsection provides quantitative and qualitative evaluations of our approach on the task of 3D hand pose estimation. We provide self-comparisons and comparisons to the state-of-the-art methods on NYU [

34] and SynHand5M [

21] datasets. For the sake of completion, we also provide 3D pose estimation results on BigHand2.2M [

46] dataset.

**SynHand5M synthetic dataset**: We trained our Baseline 3 and Full model on SynHand5M dataset. The quantitative results for joint positions and bone vectors estimations are provided in

Table 1. Our algorithm outperforms the state-of-the-art methods, which shows the effectiveness of our weak-supervised algorithm and its superior performance compared to the state-of-the-art LBS method [

21].

**BigHand2.2M real dataset**: We evaluated the accuracy of 3D pose estimation on our created test set from BigHand2.2M dataset [

46]. We trained our Full model on mixed train set

${\mathcal{T}}_{\mathcal{BS}}$. Qualitative results are shown in

Figure 7, which demonstrate successful 3D pose recovery of complex hand poses even in cases of missing depth and large occlusions. Quantitatively, the 3D joint error on our created test set (see

Section 6.1) comes out to be

$11.84$ mm.

**Self-comparisons**: To rigorously evaluate our algorithm, we performed self-comparisons of our baseline architectures and Full model on real NYU dataset. The networks were jointly trained with combined NYU, BigHand and synthetic SynHand5M datasets and optimized for the loss given by Equation (

9). We used the hand model of Zhou et al. [

10] for implementing the bone-to-joint layer. Baseline 1 is similar to the CNN architecture proposed in [

31], which we use to directly regress

$\mathcal{J}$.

Table 3 shows the joints estimation accuracy of Baseline 1. Baseline 2, which incorporates hand skeleton structure (see

Section 4.1), achieves a

$9.6$% increase in pose estimation accuracy. Since

${\mathcal{L}}_{\mathcal{B}}$ is included in Baseline 2, the 3D bone error is also reported in

Table 3. Baseline 3 includes hand mesh learning, which marginally improves the pose estimation accuracy by

$2.8$% and bones estimation accuracy by

$1.9$% over Baseline 2. Our Full model shows the best accuracy on joint positions and bone vectors estimations by including Module 3 in training.

Figure 10 (left and middle) illustrate quantitative results of the self-comparisons. The curves that cover the most area achieve the highest accuracy. Qualitative comparisons of Baseline 1, Baseline 2 and the Full model are shown in

Figure 11. Furthermore, we quantitatively evaluated Model 1 (see

Section 6.2), which shows lower accuracy of 3D pose estimation due to inaccurate hand mesh estimation. We compared its performance to our Full model (see

Table 2).

**Comparison with the state-of-the-arts**: We compared the 3D hand pose estimation accuracy of our Full model (WHSP-Net) with state-of-the-art approaches.

Figure 10 (right) and

Table 4 show the quantitative comparisons. Notably, discriminative methods such as V2V-PoseNet [

4] and FeatureMapping [

3] achieve better accuracy than our method, but they generalize poorly on unseen data [

6]. Moreover, V2V-PoseNet is not real-time because of the time consuming gray scale depth input to voxel conversion and the complex 3D-CNN architecture. Furthermore, our method is not discriminative, rather it respects the structure of hand skeleton as well as additionally produces full 3D hand mesh. Therefore, our approach lies in the category of methods that output more than joints. In addition to the 3D pose, DeepModel [

10] outputs joint angles; HandScales [

11] produces joint angles and bone-lengths; and DeepHPS [

21] generates joint angles, bone-lengths, complex shape parameters and full 3D hand shape. Our method outperforms these methods, as shown in

Table 4. Our method shows competitive performance to the state-of-the-art methods that do not explicitly consider the hand structure and produce only the 3D pose [

3,

4,

5]. Our algorithm is real-time, producing the 3D pose and shape in

$2.9$ ms per frame.