CFAM: Estimating 3D Hand Poses from a Single RGB Image with Attention

: Precise 3D hand pose estimation can be used to improve the performance of human–computer interaction (HCI). Speciﬁcally, computer-vision-based hand pose estimation can make this process more natural. Most traditional computer-vision-based hand pose estimation methods use depth images as the input, which requires complicated and expensive acquisition equipment. Estimation through a single RGB image is more convenient and less expensive. Previous methods based on RGB images utilize only 2D keypoint score maps to recover 3D hand poses but ignore the hand texture features and the underlying spatial information in the RGB image, which leads to a relatively low accuracy. To address this issue, we propose a channel fusion attention mechanism that combines 2D keypoint features and RGB image features at the channel level. In particular, the proposed method replans weights by using cascading RGB images and 2D keypoint features, which enables rational planning and the utilization of various features. Moreover, our method improves the fusion performance of di ﬀ erent types of feature maps. Multiple contrast experiments on public datasets demonstrate that the accuracy of our proposed method is comparable to the state-of-the-art accuracy. the rotation angle R of the standard coordinate system relative to the camera coordinate system in the


Introduction
Gesture estimation plays a significant role in computer science, and related tasks aim toward understanding human gestures through algorithms. It is robust to environment changes such as mutative shooting distance and glare light. Human-computer interaction (HCI) can be implemented wherever and whenever, has fewer constraints, and enables computers to efficiently and precisely understand user commands without any mechanical assistance. Gestures for HCI are quick, vivid, intuitive, flexible, and visual; they can enable soundless interactions and bridge the gap between the real world and virtual worlds.
To recognize gestures, 3D hand poses are required. Computer-vision-based hand pose estimation enables people to communicate with machines more naturally. With the development of computer vision, pose estimation no longer relies on traditional wearable devices in specific scenes but can be directly implemented based on image recognition. The research on pose estimation in computer vision includes three main categories: depth images, multivision RGB images, and single RGB images. Many studies have estimated hand poses through depth images [1][2][3][4][5][6][7][8][9] and achieved good results. However, depth images must be obtained using indoor depth cameras and are thus not as convenient as RGB images. Multivision has successfully achieved hand tracking and hand pose estimation through RGB images [10] but there are still some constraints on users due to the requirements of multivision.
The goal of hand pose estimation based on computer vision is to free users from the constraints of depth equipment and multivision images and to facilitate HCI through mobile phones and other devices. Therefore, it is of great significance to be able to estimate hand poses based on a single RGB image, which allows full 3D hand poses to be learned from a single RGB image and does not rely on any special equipment or environment.
At present, estimating 3D hand poses from 2D score maps is the most commonly used pose estimation method based on single RGB images. Although some methods for human body posture estimation can turn RGB images into 3D postures, they cannot be directly applied to hand pose estimation. Moreover, hands have a more serious self-shielding problem than other parts of the human body as the inside of each hand is asymmetrical while the human body is symmetrical. A recent hand pose estimation method based on a single RGB image first estimates the hand state and the rotation angle relative to the camera, and then calculates the 3D coordinates. Unfortunately, for this method, the estimation is based on only 2D keypoints, and the texture features of the RGB image are ignored. In this paper, a fusing channel attention method is introduced to combine a 2D pose and an RGB image to estimate a 3D pose, which effectively solves the problem of different types of input data. The input is an RGB image of a human hand. After applying the end-to-end neural network, we obtain a 3D array. The 3D array is the spatial location of the 21 keypoints of the hand in the input image, as shown in Figure 1.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 2 of 16 requirements of multivision. The goal of hand pose estimation based on computer vision is to free users from the constraints of depth equipment and multivision images and to facilitate HCI through mobile phones and other devices. Therefore, it is of great significance to be able to estimate hand poses based on a single RGB image, which allows full 3D hand poses to be learned from a single RGB image and does not rely on any special equipment or environment. At present, estimating 3D hand poses from 2D score maps is the most commonly used pose estimation method based on single RGB images. Although some methods for human body posture estimation can turn RGB images into 3D postures, they cannot be directly applied to hand pose estimation. Moreover, hands have a more serious self-shielding problem than other parts of the human body as the inside of each hand is asymmetrical while the human body is symmetrical. A recent hand pose estimation method based on a single RGB image first estimates the hand state and the rotation angle relative to the camera, and then calculates the 3D coordinates. Unfortunately, for this method, the estimation is based on only 2D keypoints, and the texture features of the RGB image are ignored. In this paper, a fusing channel attention method is introduced to combine a 2D pose and an RGB image to estimate a 3D pose, which effectively solves the problem of different types of input data. The input is an RGB image of a human hand. After applying the end-to-end neural network, we obtain a 3D array. The 3D array is the spatial location of the 21 keypoints of the hand in the input image, as shown in Figure 1.

RGB image
3D keypoint coordinate Figure 1. Our task. The 3D hand pose is estimated by an end-to-end convolutional neural network (CNN). The input is an RGB image, and the output is the 3D coordinate of each keypoint on the hand.

Related Work
A computer-vision-based 3D hand pose can be estimated from an RGB image or a depth image by using computer vision. We introduce estimation methods based on depth images and RGB images in this chapter.

3D Hand Pose Estimation Based on Depth Images
Traditional computer-based 3D hand pose estimation uses depth images. Depth images include depth information, which is helpful for obtaining the distances between keypoints.
Markus et al. [1] proposed a preliminary positioning and optimization method, HandDeep, based on a convolutional neural network. It can accurately locate a hand in a single depth image after training with multiple labeled depth images. In 2016, Ayan et al. [2] proposed an acceleration method using matrix completion, and it can be applied to large-scale, real-time hand pose estimation without relying on a GPU. In 2017, Overweger et al. [3] optimized several aspects of the network and the training process and improved the accuracy. The method included data expansion, dropout, the addition of residual modules, and the optimization of hand segmentation. In 2014, Tomsons et al. [4] proposed a hand pose recognition method that combines the generation method and the data-driven method. In 2016, Sun et al. [5] proposed an algorithm for matching depth images with models. By constructing a hand model, this method matched the keypoints from the palm to fingertip. Wan et al. [12] proposed a method for dense pixels that aggregated local estimates using nonparametric mean shift variables, explicitly forcing consistency between the estimated 3D joint coordinates and the 2D and 3D local estimations. This approach provided a better fusion between 2D detection and Figure 1. Our task. The 3D hand pose is estimated by an end-to-end convolutional neural network (CNN). The input is an RGB image, and the output is the 3D coordinate of each keypoint on the hand.

Related Work
A computer-vision-based 3D hand pose can be estimated from an RGB image or a depth image by using computer vision. We introduce estimation methods based on depth images and RGB images in this chapter.

3D Hand Pose Estimation Based on Depth Images
Traditional computer-based 3D hand pose estimation uses depth images. Depth images include depth information, which is helpful for obtaining the distances between keypoints.
Markus et al. [1] proposed a preliminary positioning and optimization method, HandDeep, based on a convolutional neural network. It can accurately locate a hand in a single depth image after training with multiple labeled depth images. In 2016, Ayan et al. [2] proposed an acceleration method using matrix completion, and it can be applied to large-scale, real-time hand pose estimation without relying on a GPU. In 2017, Overweger et al. [3] optimized several aspects of the network and the training process and improved the accuracy. The method included data expansion, dropout, the addition of residual modules, and the optimization of hand segmentation. In 2014, Tomsons et al. [4] proposed a hand pose recognition method that combines the generation method and the data-driven method. In 2016, Sun et al. [5] proposed an algorithm for matching depth images with models. By constructing a hand model, this method matched the keypoints from the palm to fingertip. Wan et al. [11] proposed a method for dense pixels that aggregated local estimates using nonparametric mean shift variables, explicitly forcing consistency between the estimated 3D joint coordinates and the 2D and 3D local estimations. This approach provided a better fusion between 2D detection Appl. Sci. 2020, 10, 618 3 of 16 and 3D regression than prior mechanisms and various baselines. In 2018, Aisha et al. [12] set the gesture segmentation under the first-person perspective and the presence of occlusion with the aid of a conditional random field (CRF). For the first time, a method performed hand segmentation and detection from a self-centered perspective and under occlusion, and the hand pose estimation accuracy was improved by improving the segmentation accuracy. However, this method still did not solve the problem of occluded objects or the similarity between background objects and hands in RGB images. Motivated by CycleGAN [13], Baek et al. [14] proposed a method for expanding datasets. This method can actively generate keypoint data by training datasets and restore depth images through a GAN (Generative Adversarial Networks) after CycleGAN training. To some extent, the lack of training data for partial perspectives was solved. The proposed solution was useful to some extent. However, a complicated cyclical relationship was used, which made the training process cumbersome and the network complicated. Wan et al. [15] matched depth images to bone images based on a hidden space transformation. Although the accuracy was mediocre, the method could achieve a speed of 90 frames per second (FPS) on a CPU, improving the efficiency of realizing image-based hand pose estimation. The method mapped paired depth and bone images to the same position in the hidden space and restored the original image from the hidden space via deconvolution. The depth-image-based pose estimation method has gradually matured, but the depth acquisition device, which is sensitive to illumination, jitter, and distance, imposes constraints on the user. In addition, it is expensive.

3D Hand Pose Estimation Based on RGB Images
Due to the lack of depth information, hand pose estimation based on RGB images, especially single RGB images, developed relatively late. The accuracy of the RGB-based method is not as good as that of the depth-based method. However, an RGB image is easier to obtain, and the equipment is cheaper. Thus, an increasing amount of research focusses on the RGB-based method.
Zhang [10] proposed estimating poses based on multivision and using binocular vision to restore the exact distance information and realized RGB-based hand pose estimation. However, this method still places a large number of constraints on users. In 2017, Zimmermann [16] realized 3D hand pose estimation through a single RGB image based on deep learning; the method used deep networks to learn reasonable prior information from the data to solve the fuzzy problem without relying on any special equipment. A feasible network framework for deriving 3D keypoints from 2D keypoints was generated. The method consisted of three deep networks: the first network performed hand segmentation to locate the hand in the image, the second network estimated the 2D keypoint score map from the output of the first network using convolutional pose machines (CPMs) [17], and the third network derived a 3D keypoint from the 2D keypoints. Furthermore, the method proposed a normalized coordinate system that regarded the hand position in the normalized coordinate system as a rotation in the camera coordinate system; the hand position was calculated in the normalized coordinate system, and the rotation angle was calculated by using the neural network to restore the position of the 3D keypoint. This method was the first to achieve 3D hand pose estimation with a single RGB image. Spur [18] used a variational encoder hand pose estimation method to project the image and keypoint information onto the hidden space and optimized the accuracy by minimizing the distance between the image and the information in the hidden space. Dibra [19] used weakly supervised learning to estimate hand poses. This method does not directly perform supervision training through the three-dimensional hand keypoints, but rather generates depth images of the estimated 3D hand pose through a GAN. Muller [20] restored occluded hand areas through a GAN, which solved the problem of hand area occlusion to a certain extent.
Of the computer-vision-based hand pose estimation methods, the depth-based method requires more expensive equipment, but the multivision-based method still places certain constraints on the user. Most methods based on a single RGB image use a 2D score map but ignore the information contained in the RGB image. Based on 3D hand estimation from a single RGB image, we propose a method that uses the attention mechanism to fuse the 2D score map and the RGB image channel features. In Section 3, we introduce the prior methods and our methods. In Section 4, we introduce our experimental dataset and compare it with the baseline in the dataset and with the state-of-the-art methods. In Section 5, we summarize our paper.

Method
As shown in Figure 2, the task is divided into three steps: first, the hand bounding box is cropped from the input image; second, 2D keypoints are calculated from the hand bounding box; third, the 3D hand pose is estimated from 2D keypoints and the hand bounding box. The main step is the third one. We introduce the first and second steps in Section 3.1, and we introduce 3D hand pose estimation from 2D keypoints in Section 3.2. For fusion, we used the RGB image and 2D keypoint method, as described in Section 3.3.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 16 our experimental dataset and compare it with the baseline in the dataset and with the state-of-the-art methods. In Section 5, we summarize our paper.

Method
As shown in Figure 2, the task is divided into three steps: first, the hand bounding box is cropped from the input image; second, 2D keypoints are calculated from the hand bounding box; third, the 3D hand pose is estimated from 2D keypoints and the hand bounding box. The main step is the third one. We introduce the first and second steps in Section 3.1, and we introduce 3D hand pose estimation from 2D keypoints in Section 3.2. For fusion, we used the RGB image and 2D keypoint method, as described in Section 3.3.

PosePrior
Cropped hand image 2D score map 3D coordinates Figure 2. The framework of our method. The process is divided into three steps. First, the RGB image (a) is brought into the the network as the input. The hand area (b) is cropped from the whole RGB image using HandSegNet [17]. Then, the 2D score map (c) is estimated according to the hand area. Finally, 3D coordinates (d) are estimated from the 2D score map and cropped hand area by PosePrior.

2D Keypoint Calculation
In this section, we introduce HandSegNet to obtain the bounding box of the hand from the input image and PoseNet [17] to get the 2D keypoints.
We use J to represent the different hand keypoints; = {1,21} since we found 21 useful keypoints on one hand. = { = ( , , ), ∈ [1,21]} represents the 3D position of each hand keypoint. The input ∈ ×ℎ×3 is the RGB image. The picture of the cropped hand mask is ∈ ℝ ×ℎ ×3 , which is smaller than the whole input I and includes only the hand mask. = ( , , ) represents the rotation angle of the camera coordinate system relative to the world coordinate system. We use 2D Gaussian keypoint score maps = ( , ), ∈ [1,21] to present the 2D keypoints, where each score map corresponds to one keypoint. ( , ) is the position of the keypoint where the Gaussian score map ( , ) is centered. It is beneficial for the network to learn possible positions of the keypoints during the training process.
2D keypoint calculation is key to 3D keypoint estimation. To calculate the 2D keypoints, HandSegNet is first used to estimate the region of the hands, namely, handmask , from the original image ∈ ℝ ×ℎ×3 .
The first neural network used in this method is HandSegNet, whose task is to crop the hand region from the image. Directly cropping the horizontal and vertical coordinates of the hand region to obtain a rectangular block is a regression problem. Neural networks are not as good at regression tasks as they are at classification tasks [2]. Thus, we first get the mask of the hand and regard our goal as a task for computing bool images. Next, we determine whether each pixel in the image belongs to the area of the hand. For each pixel, we calculate the probability ( ) that it belongs the hand mask. When the probability is greater than a threshold, the point is considered to belong to the hand mask. The framework of our method. The process is divided into three steps. First, the RGB image (a) is brought into the the network as the input. The hand area (b) is cropped from the whole RGB image using HandSegNet [16]. Then, the 2D score map (c) is estimated according to the hand area. Finally, 3D coordinates (d) are estimated from the 2D score map and cropped hand area by PosePrior.

2D Keypoint Calculation
In this section, we introduce HandSegNet to obtain the bounding box of the hand from the input image and PoseNet [16] to get the 2D keypoints.
We use J to represent the different hand keypoints; J = {1, 21} since we found 21 useful keypoints on one hand. W = w J = (x, y, z), J ∈ [1,21] represents the 3D position of each hand keypoint. The input I ∈ R w×h×3 is the RGB image. The picture of the cropped hand mask is I mask ∈ R w m ×h m ×3 , which is smaller than the whole input I and includes only the hand mask. R = R x , R y , R z represents the rotation angle of the camera coordinate system relative to the world coordinate system. We use 2D Gaussian keypoint score maps P = p J (u, v), J ∈ [1,21] to present the 2D keypoints, where each score map corresponds to one keypoint. (u, v) is the position of the keypoint where the Gaussian score map p J (u, v) is centered. It is beneficial for the network to learn possible positions of the keypoints during the training process.
2D keypoint calculation is key to 3D keypoint estimation. To calculate the 2D keypoints, HandSegNet is first used to estimate the region of the hands, namely, handmask I mask , from the original image I ∈ R w×h×3 .
The first neural network used in this method is HandSegNet, whose task is to crop the hand region from the image. Directly cropping the horizontal and vertical coordinates of the hand region to obtain a rectangular block is a regression problem. Neural networks are not as good at regression tasks as they are at classification tasks [2]. Thus, we first get the mask of the hand and regard our goal as a task for computing bool images. Next, we determine whether each pixel in the image belongs to the area of the hand. For each pixel, we calculate the probability (P i ) that it belongs the hand mask. When the probability is greater than a threshold, the point is considered to belong to the hand mask.
Then, the center of mass of the hand mask is calculated. The hand area is cropped around the center of mass.
Then, I mask is fed into PoseNet to estimate score maps of different 2D keypoints. The 2D keypoints are estimated from the cropped hand images. Traditional methods directly predict the x and y values of each keypoint, which ignores the connection between the fingers and is a task at which the neural network is not good. We do not just estimate the x and y values of the keypoints like traditional methods; instead, we obtain a score map of each keypoint, such as that shown in Figure 2c. The score map also represents the location of the keypoint, and it can be understood better by neural network.

3D Hand Pose Estimation
The 3D hand pose can be estimated by PosePrior by using 2D keypoints. After PoseNet, the score map is sent to the PosePrior network to estimate the 3D hand pose.
The PosePrior network proposes to train the network to estimate coordinates within a canonical frame rather than to directly estimate absolute 3D coordinates. Additionally, it estimates the transformation from the relative 3D coordinates to the canonical frame during parallel processing, which is a 3D rotation matrix called the viewpoint. Two similar streams are used to estimate viewpoint and canonical coordinates. In the end, two estimates are combined to estimate the 3D coordinates.
The 3D coordinates W are divided into a world coordinate system W world and a camera coordinate system W camera . The camera rotation angle is introduced as R = R x , R y , R z to convert the two coordinate systems: 3D hand pose estimation can be divided into two tasks. One is to estimate the hand pose in the camera coordinate W camera , and the other is to estimate the angle of view, i.e., the camera rotation angle R. Finally, the two results are fused to get the final coordinates W world . Thus, the 3D coordinate transformation network is divided into two subnetworks with the same architecture, as shown in Figure 3, and the task is divided into an above network and a below network. The above network estimates the coordinates of the 3D hand keypoint W camera in the standard coordinate system. Meanwhile, the rotation angle R of the standard coordinate system relative to the camera coordinate system is estimated in the below network.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 16 Then, the center of mass of the hand mask is calculated. The hand area is cropped around the center of mass. Then, is fed into PoseNet to estimate score maps of different 2D keypoints. The 2D keypoints are estimated from the cropped hand images. Traditional methods directly predict the x and y values of each keypoint, which ignores the connection between the fingers and is a task at which the neural network is not good. We do not just estimate the x and y values of the keypoints like traditional methods; instead, we obtain a score map of each keypoint, such as that shown in Figure 2c. The score map also represents the location of the keypoint, and it can be understood better by neural network.

3D Hand Pose Estimation
The 3D hand pose can be estimated by PosePrior by using 2D keypoints. After PoseNet, the score map is sent to the PosePrior network to estimate the 3D hand pose.
The PosePrior network proposes to train the network to estimate coordinates within a canonical frame rather than to directly estimate absolute 3D coordinates. Additionally, it estimates the transformation from the relative 3D coordinates to the canonical frame during parallel processing, which is a 3D rotation matrix called the viewpoint. Two similar streams are used to estimate viewpoint and canonical coordinates. In the end, two estimates are combined to estimate the 3D coordinates.
The 3D coordinates are divided into a world coordinate system and a camera coordinate system . The camera rotation angle is introduced as = ( , , ) to convert the two coordinate systems: (1) 3D hand pose estimation can be divided into two tasks. One is to estimate the hand pose in the camera coordinate , and the other is to estimate the angle of view, i.e., the camera rotation angle R. Finally, the two results are fused to get the final coordinates . Thus, the 3D coordinate transformation network is divided into two subnetworks with the same architecture, as shown in Figure 3, and the task is divided into an above network and a below network. The above network estimates the coordinates of the 3D hand keypoint in the standard coordinate system. Meanwhile, the rotation angle R of the standard coordinate system relative to the camera coordinate system is estimated in the below network.  Figure 3. The framework of our channel fusion attention mechanism (CFAM), where C represents six convolution operations, F and O represent data, and FC represents fully connected operations. The framework is divided into three parts: frontend, middle, and backend. Our proposed CFAM is highlighted in the figure. In the frontend, we use a cropped hand region to estimate the 2D score map. The features of the 2D score map and the cropped hand are extracted using a CNN. Then, in the middle, we concatenate the feature maps to obtain and process using a channel attention mechanism. Finally, in the backend, the fully connected layer is used to estimate the camera rotation angle and the 3D hand pose in the camera coordinate system, and the 3D hand pose in the world coordinate system is calculated. The framework is divided into three parts: frontend, middle, and backend. Our proposed CFAM is highlighted in the figure. In the frontend, we use a cropped hand region to estimate the 2D score map. The features of the 2D score map and the cropped hand are extracted using a CNN. Then, in the middle, we concatenate the feature maps to obtain F i and process F i using a channel attention mechanism. Finally, in the backend, the fully connected layer is used to estimate the camera rotation angle and the 3D hand pose in the camera coordinate system, and the 3D hand pose in the world coordinate system is calculated.
Although the above method performs well, it uses only 2D keypoints, and the RGB information is lost. The RGB image contains texture information and implicit spatial information, which are not included in 2D keypoints. The information in the RGB image is essential for ensuring the accuracy of 3D hand pose estimation. To overcome this drawback, we propose a CFAM (channel fusion attention mechanism; see Section 3.3 for details) which makes full use of the RGB image and takes the spatial information into consideration.

Channel Fusion Attention Mechanism (CFAM)
In this section, we introduce CFAM, which fuses the information contained in the RGB image with the score map. If we directly merge them, the RGB image influencing factor with less information will overstep the amount of information it contains. As shown in Figure 4b, while adding the RGB image can improve the result to a certain degree, CFAM obtains a better result.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 16 Although the above method performs well, it uses only 2D keypoints, and the RGB information is lost. The RGB image contains texture information and implicit spatial information, which are not included in 2D keypoints. The information in the RGB image is essential for ensuring the accuracy of 3D hand pose estimation. To overcome this drawback, we propose a CFAM (channel fusion attention mechanism; see Section 3.3 for details) which makes full use of the RGB image and takes the spatial information into consideration.

Channel Fusion Attention Mechanism (CFAM)
In this section, we introduce CFAM, which fuses the information contained in the RGB image with the score map. If we directly merge them, the RGB image influencing factor with less information will overstep the amount of information it contains. As shown in Figure 4b, while adding the RGB image can improve the result to a certain degree, CFAM obtains a better result.  Although the score map already contains the vital keypoint position information, the RGB image has information that is not included in the score map, such as the implicit spatial information and the local texture information. The texture features are represented by the gray distribution of the surrounding space and the pixel, and they have a rotation invariance and strong resistance to noise. Statistical calculations are required in regions that contain multiple pixels but are not pixel-based features. In pattern matching, this regional feature has greater advantages and can be matched due to local deviations. In addition, the local texture information is repeated to varying degrees, forming the global texture information. The texture feature in handmask reflects the nature of the global feature and describes the surface properties of the hand corresponding to the image.
We show the architecture of Figure 3 in Tables 1 and 2. Table 1 shows the structure of the frontend, while Table 2 shows the structures of the middle and backend architectures. Although the score map already contains the vital keypoint position information, the RGB image has information that is not included in the score map, such as the implicit spatial information and the local texture information. The texture features are represented by the gray distribution of the surrounding space and the pixel, and they have a rotation invariance and strong resistance to noise. Statistical calculations are required in regions that contain multiple pixels but are not pixel-based features. In pattern matching, this regional feature has greater advantages and can be matched due to local deviations. In addition, the local texture information is repeated to varying degrees, forming the global texture information. The texture feature in handmask reflects the nature of the global feature and describes the surface properties of the hand corresponding to the image.
We show the architecture of Figure 3 in Tables 1 and 2. Table 1 shows the structure of the frontend, while Table 2 shows the structures of the middle and backend architectures.
The supplementary information from the RGB image can provide strong guidance for restoring the 3D coordinates for use with the score map. However, if the importance of the RGB image is considered to be the same as that of a score map, then the guiding effect of the RGB image could become too powerful, ultimately affecting the accuracy of the model. To make full use of the RGB image, we introduce a channel attention mechanism to constrain the influence of each input on the final result. Table 1. The frontend architecture, i.e., networks C 11 , C 12 , C 21 , and C 22 , is shown below. The networks have the same structure but differ in terms of the number of input channels. The number of channels is 3 when the input is an RGB image, while it is 21 when input is a 2D score map. In addition, the networks have different weights. Conv: convolution, ReLU: rectified linear units. The framework of our proposed method is provided in Figure 3, where the CFAM consists of two components: the frontend and the middle.

The Frontend: A Fusion Model of the Handmask and the Score Maps
We first propose a fusion model of the handmask and the score maps to consider the implicit spatial information in the RGB image in our CFAM.
There are four parallel processing streams (called C ij , i, j = 1, 2 in Figure 3) in the frontend of the network with almost identical architectures, including six convolutions with rectified linear unit (ReLU) nonlinearities. However, their parameters are not shared. We set the handmask I mask as the input of the first two streams C 1 j and the score maps p J as inputs of the latter two streams C 2 j . After being fed into C, the output of the first stream C 11 and the third stream C 21 is concatenated to estimate the camera coordinate, while the second stream C 12 and the fourth stream C 22 are concatenated to estimate the camera rotation angle. The procedure is illustrated by the following formulas: where F ij is the output of the convolution process, * represents the operation on the feature maps that C ij acts on, and ⊕ is the concatenation of F 1 j and F 2j . To make full use of I mask , the implicit spatial information and the texture information are utilized in 3D hand pose estimation, which remedies the problem of insufficient context. More spatial and context information is obtained by the network.

The Middle: A Channel Attention Block on the Fused Mode
Before the two feature maps are further processed by fully connected layers, the attention mechanism is added. Attention mechanisms are widely used in various computer vision tasks, such as image classification, segmentation, and object detection. The benefits of such a mechanism have been shown for those tasks. Generally, an attention mechanism biases the allocation of available processing resources toward the most informative components of the input. Hu [21] proposed a squeeze-and-excitation (SE) block to enhance the representational power of basic modules throughout the network. Inspired by the attention model, we use the channel attention block in the latter part of the convolutional layers. The dimensions of features from C are 4 × 4 × 256. The feature maps F k are first passed through a squeeze operation. Global average pooling is used to aggregate the feature maps across a 4 × 4 spatial dimension to produce a channel descriptor. A statistic L k is generated by shrinking F k through a 4 × 4 spatial dimension, where the ith (i ∈ [1,256]) element of L k is calculated using: This descriptor embeds the global distribution of channelwise feature responses, enabling information from the global receptive field of the network to be leveraged by its lower layers. Then, an excitation operation acts on the descriptor. The operation is given by: where δ refers to the ReLU function, U 1 ∈ R 256 h ×256 and U 2 ∈ R 256× 256 h . To limit model complexity and aid generalization, we first feed the descriptor L k to a fully connected layer U 1 around the dimensionality reduction layer with a reduction ratio h, followed by a ReLU. Then, a fully connected layer U 2 is used to increase the dimensionality, followed by sigmoid activation. After the excitation operation, R k is obtained to describe the weight of each feature map from F k . Finally, the feature maps F k from C are reweighted using channelwise multiplication (represented by) between F k and R k to generate the output of the channel attention block O k . The activation is given by: The frontend and the middle constitute our CFAM.

The Backend: Calculate the World Coordinates of the Keypoints
Through the activation above, the network can recalibrate features and learn to use global information to selectively emphasize informative features and suppress less useful ones. The output feature maps from channel attention block O k are concatenated with the information to determine whether the hand is a left or right hand and is processed further using two fully connected layers. Then, the two parallel streams are fed directly into fully connected layers to estimate the camera coordinate W camera and the camera rotation angle R. Both estimations are combined with an estimation of the world coordinate W world . The final process is as follows, where FC k is the operation of the full connection: Appl. Sci. 2020, 10, 618 9 of 16

Experiments
We conducted experiments to verify the proposed model. Our method was implemented in TensorFlow [22]. All experiments were conducted on a Linux computer with one NVIDIA 1080Ti GPU (with 11 GB memory). The batch size is the number of samples selected for one training epoch, and we set our batch size to 8. We trained the model with the Adam optimizer until the loss did not decrease. The learning rates used were 1 × 10 −5 , 1 × 10 −6 , and 1 × 10 −7 . The learning rates changed after 30,000 and 60,000 steps. The improved score map detection model and the 3D hand pose estimation model were tested. We highlight the results of the best method in each experiment in bold. In the table, some wrist prediction errors were 0 because we kept only two decimal places and thus errors that were less than 0.01 are rounded to zero; such errors indicate wrist predictions that were accurate.

Dataset
Our proposed method is based on a single RGB image whose labels are required for supervision. Traditional depth-based hand pose estimation datasets, such as MSRA [5] and the NYU Hand Pose Dataset by Tompson et al. [4], are not suitable for our method. Thus, we chose two open datasets: a real-world dataset from the Stereo Hand Pose Tracking Benchmark (STB) [10] and a generated dataset called the Rendered Hand Pose (RHD) [16] dataset, both of which contain RGB images of human hands and the position coordinates of 3D keypoints. Twenty-one hand keypoints are included in both datasets, including the palm center (not the wrist or hand center) and four keypoints per finger. Each sample in both datasets includes an RGB image, a handmask image, the rotation of the camera, and the 2D and 3D coordinates of each keypoint.
The RHD dataset is a generated dataset that consists of 39 different actions performed by 20 different persons. The dataset contains 41,258 training samples and 2728 testing samples. An image pixel in the dataset is 320 × 320. The STB dataset is a real image dataset collected from different cameras. It contains 36,000 images and can be divided into six scenarios. Each scene contains two RGB images and a depth image of the same action in different positions. There are 30,000 training images and 6000 testing samples in the 640 × 480 dataset. The two datasets include images of the hands of different people.

Assessment Criteria
The error and the area under the curve (AUC) were used to evaluate the experimental results. The error of each keypoint was calculated as follows: where gt J is the coordinate of the ground truth for keypoint J and pre J is the estimated coordinate of keypoint J. The AUC curve was based on the percentage of correct keypoints (PCK): In addition to evaluating the average error and the average AUC, we used 21 keypoints to calculate the error and accuracy for each finger. For convenience, in the following subsections in this section, we use "wrist" to represent the palm and "thumb," "index," "middle," "ring," and "little" to represent each finger. We use "GT" to represent the ground truth in the following experiments.

2D Score Map Detection
3D hand pose estimation largely depends on 2D pose estimation, and it can be effectively improved by enhancing 2D score map estimation. Motived by Zimmermann et al. [16], who used a CPM to locate 2D keypoints, as Figure 4a and Table 3 show, we improved the CPM and enhanced the location accuracy. Table 4 shows the result for each finger. The original RGB images and the RGB images of the cropped hand region are provided in the dataset. We resized the original RGB images to 240 × 320 and the images of the segmented hand region to 256 × 256 during training. The channel attention mechanism was added to improve the score map estimation accuracy. For a channel attention block to obtain a better feature map, we added this block to the CPM. CPMAtt represents the method after adding a channel attention mechanism to the CPM. CPMAtt_gt and CPM_gt are the CPMs that were used on the ground truth hand cropped images. CPMAtt and CPM are the CPMs that were used on the original image, which needed to be cropped by HandSegNet. Table 3. The mean error and AUC of the 2D keypoints estimation results on the RHD dataset. By adding the channel attention mechanism, CPMAtt was superior to a convolutional pose machine (CPM) [17] in terms of the AUC and error. Even in the HandSegNet cropped picture, our experimental AUC was better than that of the CPM method on the GT-cropped picture. Regardless of whether the image being tested was segmented by the GT or HandSegNet, our model was an improvement. The AUC was increased by nearly 9 percentage points, and the mean error was reduced by nearly 3 pixels. The best results are highlighted in bold.

3D Hand Pose Estimation with CFAM
To better estimate the 3D hand poses based on the previous pose, we propose the CFAM module, which includes the attention mechanism and the fusion of RGB images and 2D heat map information. To show that every step of our network design is effective, we used three strategies for comparison: Strategy 1: adding the channel attention mechanism. Strategy 2: adding the fusion of RGB images and 2D score maps without adding the channel attention mechanism.
Strategy 3: adding the full CFAM. Table 5 and Figure 4b show the effect of our approach on the RHD dataset, while Table 6 shows the effect on each finger. The channel attention mechanism can have certain auxiliary effects on feature acquisition. Therefore, the result of strategy 1 was slightly better than that of Zimmermann's method, but the effect was not significantly improved, and the AUC was increased by approximately one percent. Strategy 2 added RGB image-assisted training, and the improvement was significant. Our CFAM (strategy 3) combined the features of RGB images and the 2D score map, reducing the error from strategy 2 by more than 1 mm and reducing the error from Zimmermann's framework by more than 4 mm. Table 5. 3D hand pose estimation on the RHD dataset from a GT 2D score map and a GT-cropped RGB image. The best results are highlighted in bold. Strategy 3 was better than strategy 2, and strategy 1 was better than Zimmermann's framework. The main reason for the improvement was the addition of channel attention, but the improvement of strategy 3 was greater than that of strategy 1, and the accuracy was also improved. It was more difficult to improve the accuracy when the accuracy was already high, indicating that the attention mechanism in our CFAM was effective, and it not only played a role in channel attention, but also blended the characteristics of the RGB images and 2D score maps; only then could the results be greatly improved. Among the different methods, strategy 3 (CFAM) maximized the AUC. Since the CFAM method had the highest accuracy, we tested it on the STB dataset. Table 7 and the left graph of Figure 5 show the results of our CFAM on the STB dataset from the GT 2D score maps, and our CFAM outperformed Zimmermann's in terms of both the error and AUC. Table 8 shows the result for each finger; for most hand keypoints, our CFAM outperformed Zimmermann's framework. using both datasets and the details for each finger. Our method obtained a better result on the GTcropped RGB image. We also tested it on RGB images without GT cropping. This kind of image was cropped by HandSegNet first, which may have caused a bigger error because of the error in the segmentation strategy. Under all conditions, our method attained better results; therefore, the method was robust and useful for solving this kind of task.

Estimating 3D Hand Poses from a Single RGB Image
When estimating the 3D hand pose based on the GT 2D score map, it was found that our method was superior to Zimmermann's framework. To prove that our method was feasible throughout the process, we estimated 3D keypoints from a single RGB image, and we verified the results via the original RGB image that needed to be cropped by HandSegNet. The GT-cropped RGB image is the ground truth cropped RGB image, and the RGB image is the image without cropping, which needs HandSegNet to crop the hand image. The method called "Ours" is the method that used CPMAtt to estimate the 2D keypoints and CFAM to estimate the 3D keypoints. Due to the lack of depth information, estimating a 3D hand pose from a single RGB image is challenging. The hand side information was used for the processing step. The picture will slip if the hand side changes.
The right graph of Figure 5 shows the result using the STB dataset from GT-cropped images, while the left graph of Figure 6 shows the result using the RHD dataset. Tables 9-12 show the results using both datasets and the details for each finger. Our method obtained a better result on the GT-cropped RGB image. We also tested it on RGB images without GT cropping. This kind of image was cropped by HandSegNet first, which may have caused a bigger error because of the error in the segmentation strategy. Under all conditions, our method attained better results; therefore, the method was robust and useful for solving this kind of task.  Although our method performed better on most fingers, in some experiments, Zimmermann's method obtained a better result for the thumb. Our method focuses more on the global optimum, while Zimmermann's method pays more attention to the accuracy of single fingers. In general, our method worked best, but Zimmermann's method performed better on the thumb.

Comparison with the State-of-the-Art Methods
To prove the superiority of our approach, we compared it with the state-of-the-art methods. Since many methods are performed on segmented hand images and most methods are based on the STB dataset, we also compared them on the segmented hand images of the STB dataset. Table 15 shows that, among all methods, our method achieved the best AUC value. Dibra's method performs weak supervised learning by reducing keypoints into a depth image and can learn some implicit depth information through weak supervised learning, but the learned depth information is still less useful than that implied in the original RGB image. Zimmermann, Panteleris, and Supr use only 2D information to restore the 3D positions and lose some information from the RGB image. Muller's method restores the occluded hand areas through a GAN, but due to the error of the picture restored by the GAN, the error is magnified during intermediate error transmission. Since we have used the proposed CFAM module to take the information in the 2D score map and RGB image of the hand into account, our method achieved the best result.

Conclusions
We proposed CFAM for estimating the 3D hand pose from a single RGB image. As far as we know, we are the first to use this attention mechanism as a block in the application of 3D hand pose estimation, and the accuracy was clearly an improvement over other commonly used methods. We reasonably used the missing information in the color image by combining a 2D score map and an RGB image. In addition, we used an attention mechanism as a weighting scheme to clarify the guiding effect of the two features of color images and the 2D joint points on the 3D joint point estimation. We validated our method on the RHD and STB datasets. Multiple contrast experiments on public datasets demonstrated that our proposed method could achieve state-of-the-art accuracy, Figure 6. AUC of the 3D hand pose estimation on the RHD dataset from the GT-cropped RGB images (left) and the RGB images (right). Images cropped using HandSegNet had a certain degree of error; we experimented with the images cropped by HandSegNet to prove that our CFAM was not sensitive to these errors. As the right graph of Figure 6 and Tables 13 and 14 show, our method was still better than Zimmerman's framework, which proved that our CFAM can be used in end-to-end 3D hand pose estimation, and our method was superior to the original method in all modules. Many hand pose estimation methods are based on segmented hand images, indicating that they are sensitive to errors in the segmentation process. Our method can more accurately estimate hand poses in the presence of segmentation errors than other methods, and it can be used in tracking and undivided hand images. Although our method performed better on most fingers, in some experiments, Zimmermann's method obtained a better result for the thumb. Our method focuses more on the global optimum, while Zimmermann's method pays more attention to the accuracy of single fingers. In general, our method worked best, but Zimmermann's method performed better on the thumb.

Comparison with the State-of-the-Art Methods
To prove the superiority of our approach, we compared it with the state-of-the-art methods. Since many methods are performed on segmented hand images and most methods are based on the STB dataset, we also compared them on the segmented hand images of the STB dataset. Table 15 shows that, among all methods, our method achieved the best AUC value. Dibra's method performs weak supervised learning by reducing keypoints into a depth image and can learn some implicit depth information through weak supervised learning, but the learned depth information is still less useful than that implied in the original RGB image. Zimmermann, Panteleris, and Supr use only 2D information to restore the 3D positions and lose some information from the RGB image. Muller's method restores the occluded hand areas through a GAN, but due to the error of the picture restored by the GAN, the error is magnified during intermediate error transmission. Since we have used the proposed CFAM module to take the information in the 2D score map and RGB image of the hand into account, our method achieved the best result.

Conclusions
We proposed CFAM for estimating the 3D hand pose from a single RGB image. As far as we know, we are the first to use this attention mechanism as a block in the application of 3D hand pose estimation, and the accuracy was clearly an improvement over other commonly used methods. We reasonably used the missing information in the color image by combining a 2D score map and an RGB image. In addition, we used an attention mechanism as a weighting scheme to clarify the guiding effect of the two features of color images and the 2D joint points on the 3D joint point estimation. We validated our method on the RHD and STB datasets. Multiple contrast experiments on public datasets demonstrated that our proposed method could achieve state-of-the-art accuracy, and an ablation experiment showed that an RGB image and a 2D score map could be combined to improve the result of the 3D hand estimation, which means that the information in the RGB image was also very important. In future research, we will improve the efficiency of the program and simplify the model. Our method can be used in virtual reality equipment to accurately locate joint points.

Conflicts of Interest:
The authors declare no conflicts of interest.