Self-Attention Network for Human Pose Estimation

: Estimating the positions of human joints from monocular single RGB images has been a challenging task in recent years. Despite great progress in human pose estimation with convolutional neural networks (CNNs), a central problem still exists: the relationships and constraints, such as symmetric relations of human structures, are not well exploited in previous CNN-based methods. Considering the effectiveness of combining local and nonlocal consistencies, we propose an end-to-end self-attention network (SAN) to alleviate this issue. In SANs, attention-driven and long-range dependency modeling are adopted between joints to compensate for local content and mine details from all feature locations. To enable an SAN for both 2D and 3D pose estimations, we also design a compatible, effective and general joint learning framework to mix up the usage of different dimension data. We evaluate the proposed network on challenging benchmark datasets. The experimental results show that our method has signiﬁcantly achieved competitive results on Human3.6M, MPII and COCO datasets.


Introduction
Human pose estimation from monocular single images to provide informative knowledge for numerous applications, including action/activity recognition [1][2][3], action detection [4], human tracking [5], video gaming, surveillance, etc., is a fundamental problem in computer vision. It is a challenging problem in the presence of self-occlusions and rare poses caused by complex independent joints and high degree-of-freedom limbs, foreground occlusions caused by complex environment, etc. [6]. However, with the development of Convolution Neural Networks (CNNs) [7], significant progress has been made in in recent years. The layers of Convolution neural networks generate heat maps to represent the maximum likelihood of joints. Then, they regress these heat maps to 2D or 3D key-point locations.
Despite its good performance, we find that the convolution method is more difficult when considering anatomical relations and constraints. For example, when estimating human poses, CNNs summarize human body shapes more by texture than by geometry, and fail to capture geometrical parts of human bodies such as joint-location limits (for example, elbows between hands and shoulders) and left-right symmetry. One possible explanation for this is that CNN-based approaches rely heavily on convolution operators to model joints across the whole body shape. However, convolution operators have limited receptive fields-the long-term distance information can only be received after passing through several convolution layers. This could prevent models from learning long-range dependencies for a reason: it is difficult to summarize long-term information with a small network. As a result, parameters are too sensitive to unseen features. This will drop Appl. Sci. 2021, 11, 1826 2 of 14 high-level semantic information, which can lead to optimization not being able to achieve a better performance. Although a larger kernel size and more convolution blocks will solve this problem, they also increase computation and time efficiency.
To deal with this problem simply, effectively and efficiently, we introduce a selfattention mechanism [8] into our model which we called the Self-Attention Network (SAN). It simulates a nonlocal relationship between feature maps and combines long-term distance information into original feature maps which can significantly increase performance and efficiency of model. It is integral so the training procedure is end-to-end. SANs produce attention masks to reweight original features. These masks make the model focus more on nonlocal information. It complements convolution operators and learns contextual features and multilevel semantic information across feature maps.
This approach is general and can be used for both 2D and 3D pose estimations indistinguishably. We proposed a joint learning framework to enable the mixed usage of 2D and 3D data. Rich annotated 2D data could complement small scale 3D data. This satisfies end-to-end training and improves the generalization of model.
The main contribution of this work is three-fold: (1) We propose a simple yet surprisingly effective self-attention approach (SAN) which exploits long-range dependency between feature maps. It can increase representation power and performance of convolution operators. (2) We design a joint learning framework to enable usage of mixed 2D and 3D data such that the model can output both 2D and 3D poses and enhance generalization. As a by-product, our approach generates high quality 3D poses for images in the wild. (3) In experiments, SAN advances competitive results on a 3D Human3.6M dataset [9] by a large margin and achieves 48.6 mm (Mean per joint position error (MPJPE)). On a 2D dataset, SAN achieves competitive results-91.7% (PCKh@0.5) on MPII [10] and 71.8 (AP) on COCO [11].

Related Work
Human pose estimation has been a widely discussed topic in the past. It is divided into 2D and 3D human pose estimations. In this section, we focus on recent learning-based methods that are most relevant to our work. We will also discuss related works on visual attention mechanisms for complementing convolution operators.
2D human pose estimation: In recent years, significant progress has been made in 2D pose estimation due to the development of deep learning and rich annotated datasets. The authors of [12] stacked bottom-up and top-down processing with intermediate supervision to improve the performance. These methods are used by many researches for 2D detection in 3D pose estimation tasks. The authors of [13] incorporated a stacked hourglass model with a multicontext attention mechanism to refine the prediction. The authors of [14] learned to focus on specific regions of different input features by combining a novel attention model. Different from these, our approach adopts a self-attention mechanism to increase the receptive field in an efficient way to learn more semantic information.
3D human pose estimation: There are two main ways to recover 3D skeleton information. The first one divides 3D pose estimation task into two su-tasks: 2D pose estimation and inference of a 3D pose from a 2D pose [15,16]. This method combines a 2D pose detector [12] and a depth regression step to estimate 3D poses. In this method, the 2D/3D poses are separated so as to generalize 3D poses in the wild images. The second one directly infers the 3D pose from RGB images [7,17,18]. In this way, the training procedure is end-to-end. We adopted this method for our approach. The authors of [17] proposed a volumetric representation for 3D poses and adopted a coarse-to-fine strategy to refine the prediction. The authors of [7] combined the benefits of regression and heat maps. We also adopted this method to make the training process differentiable and to reduce quantization error to improve network efficiency.
Mixed 2D/3D Data training: Although the end-to-end training process is concise, there is a disadvantage-the small scale of 3D in the wild annotated data limits the performance and accuracy of domain shifts. For this problem, the authors of [19] proposed a network architecture that comprises a hidden space to encode 2D/3D features. The authors of [20] proposed a weakly supervised approach to make use of large scale 2D data. The authors of [7] used soft argmax to regress 2D/3D poses directly from images. The authors of [21] proposed a method that combines 2D/3D data to compensate for the lack of 3D data. In our approach, we propose a joint learning method that separates x, y, z heat maps to mix 2D and 3D data.
Self-Attention Mechanism: When humans look at global images, they pay more attention to important areas and suppress other unnecessary information. Attention mechanisms simulate human vision and have achieved great success in computer visionfor example, scene segmentation, style transfer, image classification and action recognition. In particular, self-attention has been proposed [8] to calculate the response at a position in a sequence by attending to all positions within the same sequence. The authors of [22] proposed a cross-modal self-attention module that captures the information between linguistic and visual features. The authors of [23] assembled a self-attention mechanism into a style-agnostic framework to catch salient characteristics within images. We propose a network to extend self-attention mechanisms in human pose estimation task in feature maps to learn anatomical relationships and constraints for better recognition in nonlocal regions.

Model Architecture
In this section, we first describe the problem formally and give an overview of our approach. Then, we introduce the basic idea of our approach.

Overview
3D human pose estimation is a problem, where given a single RGB image or a series of RGB images I = {I 1 , I 2 , . . . , I i }, the human pose estimation process aims to localize 2D (or 3D) human body joints in Euclidean space, denoted as Y = {y 1 , y 2 , . . . , y k }, y k ∈ R (k is the number of key-points).
As mentioned before, there are some occlusions, including the occlusion of body by objects in space and self-shielding and ambiguities including appearance diversity and lighting environment in the input. During the pose estimation stage, these weak-points will severely limit prediction ability. To solve this problem in an effective way, we adopt a new architecture, as shown in Figure 1. This is an end-to-end framework including a backbone block, self-attention network and upsampling block. Chief among them is the self-attention network which picks up efficient features from input images and generates self-attention masks to reweight the original feature maps to learn the long-range dependency between global features. A self-attention layer can capture the relatedness between feature maps and simulate long-distance multilevel associations across joints. The design of self-attention networks will be explained in Section 3.3. Backbone blocks are mainly used to extract features from image batches and upsampling blocks are used to regress feature maps to higher resolutions to refine joint locations. Backbone and upsampling block designs will be explained in Section 3.2.
Driven by the problem of lacking a 3D annotated dataset, we adopt a joint learning framework which separates x, y, z location regression in the training process-explained in Section 3.4. This method enables using mixed 2D and 3D data. It also increases the module generalization to real-world scenarios and refines the performance. Driven by the problem of lacking a 3D annotated dataset, we adopt a joint learning framework which separates x, y, z location regression in the training process-explained in Section 3.4. This method enables using mixed 2D and 3D data. It also increases the module generalization to real-world scenarios and refines the performance.

Backbone and Upsampling Block Design
In the backbone block, we adopted ResNet [24] to extract features from input images. ResNet replaces the traditional convolution + pooling layer of the deep neural network that sweeps both horizontal and vertical directions across the image. It adds a skip connection to ensure that higher layers have perform well as lower layers. Our model preserves conv1, conv2_x, conv3_x, conv4_x and conv5_x and removes the Fully Connected (FC) layer in ResNet. Because we use Resnet-50 and ResNet-152 as our backbone, the kernel size and strides are different based on network depth. In the upsampling block, we implemented deconvolution layers to regress obtained feature maps to a higher resolution. This block will refine the joint locations.

Self-Attention Network Design
When observing batches of input images including humans, we find that the relationships and constraints between joints will produce more useful information. Many human pose estimation methods use convolution neural networks (CNNs), the performance of which is limited by valid receptive field such that they are only capable of adjacent content in feature maps and cannot process long-range relations and grasp high-level semantic information. To compensate for this drawback, we propose a nonlocal approach called a Self-Attention Network (SAN). SANs not only receive efficient features in a local region, but also perceive contextual information over a wide range. The details are shown in Figure 2. , which learns long-range dependency to compensate the lost features from original images and reweight obtained feature maps. With increasing δ, the model will depend more on nonlocal information than local content. (c) An upsampling block to regress the feature maps to higher resolutions to refine joint locations.

Backbone and Upsampling Block Design
In the backbone block, we adopted ResNet [24] to extract features from input images. ResNet replaces the traditional convolution + pooling layer of the deep neural network that sweeps both horizontal and vertical directions across the image. It adds a skip connection to ensure that higher layers have perform well as lower layers. Our model preserves conv1, conv2_x, conv3_x, conv4_x and conv5_x and removes the Fully Connected (FC) layer in ResNet. Because we use Resnet-50 and ResNet-152 as our backbone, the kernel size and strides are different based on network depth. In the upsampling block, we implemented deconvolution layers to regress obtained feature maps to a higher resolution. This block will refine the joint locations.

Self-Attention Network Design
When observing batches of input images including humans, we find that the relationships and constraints between joints will produce more useful information. Many human pose estimation methods use convolution neural networks (CNNs), the performance of which is limited by valid receptive field such that they are only capable of adjacent content in feature maps and cannot process long-range relations and grasp high-level semantic information. To compensate for this drawback, we propose a nonlocal approach called a Self-Attention Network (SAN). SANs not only receive efficient features in a local region, but also perceive contextual information over a wide range. The details are shown in Figure 2.
Feature maps from the previous hidden layer X ∈ R C×B×H×W (C is channel number, B is batch size, H × W is the pixel number) are first transformed to three feature spaces, where: t indicates the target feature maps index. All three space vectors come from the same input. W t q is a query space vector. W t k is a key space vector. W t q and W t k are used to calculate weights which represent the similarity features between feature maps. W t v is a value space vector and is an output from original feature maps. Reweighting the long-term information on W t v enables the network to capture joint relationships easily. β q is a weight matrix of the query space vector, which maps the input matrix of β q needs to be transposed for the following operations. β k is a weight matrix of the key space vector, which maps the input matrix β v is a weight matrix of the value space vector, which maps the input matrix of B × C × W × H dimensions to B × C × W × H dimensions. β q , β k and β v are all trainable weight matrixes that transform feature maps to corresponding vector spaces and were implemented as 1 × 1 convolutions in our experiment. Feature maps from the previous hidden layer Xϵℝ × × × ( is channel number, is batch size, × is the pixel number) are first transformed to three feature spaces, where: indicates the target feature maps index. All three space vectors come from the same input.
is a query space vector. is a key space vector. and are used to calculate weights which represent the similarity features between feature maps. is a value space vector and is an output from original feature maps. Reweighting the longterm information on enables the network to capture joint relationships easily. is a weight matrix of the query space vector, which maps the input matrix of × × × dimensions to × × × dimensions.
needs to be transposed for the following operations.
is a weight matrix of the key space vector, which maps the input matrix of is a weight matrix of the value space vector, which maps the input matrix of × × × dimensions to × × × dimensions. , and are all trainable weight matrixes that transform feature maps to corresponding vector spaces and were implemented as 1 × 1 convolutions in our experiment.
as a search vector of feature maps for one image xϵℝ × × × matches to the key vector of all feature maps in this batch to calculate the positional encoding result, which is used to represent the similarity and relevance of features in the image.

A = ⨀ (2)
A is the self-attention distribution and is one element of self-attention matrix A ∈ ℝ . and indicate the feature map index. A represents the degrees of influence of the feature map to feature map, whereby the model obtains any two elements dependencies of the global context.
is an element of a corresponding feature map, and the dimension of is the same as . In the early stage of training, the feature extraction module was not fully trained due to weight matrix and bias, and it picks up limited helpful features which will lead to a small number of A . So, we dropped the 1/ element mentioned in [8] to reverse more adjacent information. W t q as a search vector of feature maps for one image x ∈ R C×B×H×W matches to the key vector W t k of all feature maps in this batch to calculate the positional encoding result, which is used to represent the similarity and relevance of features in the image.
A t is the self-attention distribution and is one element of self-attention matrix A ∈ R C . m and k indicate the feature map index. A t represents the degrees of influence of the m feature map to k feature map, whereby the model obtains any two elements dependencies of the global context. P m is an element of a corresponding feature map, and the dimension of P m is the same as W m q . In the early stage of training, the feature extraction module was not fully trained due to weight matrix and bias, and it picks up limited helpful features which will lead to a small number of A t . So, we dropped the 1/ √ d k element mentioned in [8] to reverse more adjacent information.
Then, we utilzed the softmax function on the attention mask matrix A ∈ R C to acquire cross feature probability, which is the normalization of rows and the sum of each row after normalization is 1. The cross feature probability on the W t v space vector was reweighted to obtain W t . W t is a self-attention mask that captures the long-distance multilevel relationship, and considers the constraints and symmetry relationship between joints effectively.
We added these self-attention masks to original feature maps with a trainable variable δ where Furthermore, δ controls the ratio of the local and nonlocal features. For example, at the start of training, the network relies more on local information since it is easier. However, when time goes by, the network will assign more weight to long-term distance features to refine the prediction. Inspired by [24], this skip connection also receives more information and mitigates the problem of a vanishing gradient.
We almost added batch normalization and ReLU at every convolution layer to speed up the training process. We used mean average loss (L1 loss) as the criterion.

Joint Learning for 2D and 3D Data
Because Equations (1)-(3) (Section 3.3) are applicable for all x, y, z coordinates in the same way, the output dimension is either 2D or 3D. So, joint learning with mixed 2D and 3D data is straightforward: separating space part x, y from depth part z. 2D data are mainly used to supervise the space part and 3D data for the depth part.
For the acquired 3D heat maps, H k ∈ R W×H×D of k joints (x represents for width (W), y represents for height (H) and z represents for depth (D)). The space part H, W is always required for both 2D and 3D samples. The depth part D is only computed for 3D samples and set to 0 for 2D samples; no gradient is back-propagated from depth (D).
Taking width space x coordinate as an example, we first regressed the 3D heat map to a 1D vector: and then regressed this 1D vector into x joint location: following this step, J y k and J z k can be inferred. In this way, the locations of x, y, z are separated so we can output 2D and 3D pose estimation results systematically.

Training and Data Processing
We used ResNet-50 and ResNet-50 as the backbone network in our experiments. The model was pretrained on an ImageNet classification dataset. δ was initialized as 0. The upsampling block for the heat map is fully convolutional. It first used deconvolution layers (4 × 4 kernel, stride 2) to upsample the feature map to the required resolution (72 × 72 for ResNet-152 and 64 × 64 for ResNet-50). Then, a 1 × 1 convolution layer was used to produce kth heat maps. Two Tesla M40 GPUs and batch size of 32 were used. The whole training contained 200 epochs. The learning rate is 0.0001 and dropped twice at the 170th epoch and 190th epoch with a decay of 0.1. An Adam optimizer was used.
In data processing, the input image was normalized to 288 × 384. Data augmentation included random flip, rotation (±30 • ), scale (±30%) and translation (±2% of the image size) of the original image. The samples were randomly sampled and shuffled.

Dataset and Evaluation Metrics
MPII: The MPII dataset [10] is the standard benchmark for 2D human pose estimations. The images are collected from online videos covering a wide range of activities and annotated by humans for J = 16 2D joints. It contains 25,000 training images. The evaluation metric is Percentage of Correct Keypoints (PCK).
Human3.6M: The Human3.6M dataset [9] is a widely used dataset for 3D human pose estimations. This dataset contains 3.6 million RGB images captured by the MoCap System featuring 11 actors performing 15 daily activities, such as eating, sitting, walking and taking a photo, from 4 camera views. The evaluation metric is the mean per joint position error (MPJPE), in millimeters, between the ground truth and the prediction across all cameras and joints after aligning the depth of the root joints.
COCO: The COCO dataset [11] presents imagery data with various human poses, different body scales and occlusion patterns. The training, valid and test sets contain more than 200,000 images and 250,000 in the wild person instances labels. In total, 150,000 instances are publicly available for training and valid.

Experiments on 3D Pose of Human3.6M
Following the standard protocol in [25], there are two widely used evaluation protocols with different training and testing data: Protocol#1: Five subjects (S1, S5, S6, S7, S8) for training and two subjects (S9, S11) for testing. Mean per joint position error (MPJPE) is used for evaluation.

Ablation Study
An ablation study was conducted using the Human3.6M test set and Protocol#1 was adopted. The self-attention network, extra 2D data, network depth and computation complexity were considered as shown in Tables 1 and 2.  Self-Attention Network: Long-range dependency is important for articulated relations in human poses. Comparing to methods {b,c} and {f,g} in Table 1, performances increased by 4.3 mm MPJPE using ResNet-50 and 1 mm using ResNet-152 when considering SAN. SANs offer more long-term distance information. δ = 0.0 at the beginning and increases in training. This means that long-term distance information is more important in higher level decision making processes. Considering method {d,e} in Table 1, the larger image resolution in our approach caused MPJPE to decrease by 1.2 mm. Figure 3 shows the results of the joint relationship over human joints with the change of δ. δ = 0.0 means that the network has not introduced SAN, because the proximity information is easier to obtain at the beginning. Therefore, the number representing the feature gain between the joints is small. In comparison, the diagonal number is the largest, which means the current network is more dependent on neighboring features. With the increase in δ, the network introduced a more anatomical relationships and reweighted original feature maps. When δ = 0.8, the larger number of means joints has strong correlation with other closed and symmetrical joints. The closed joints also have larger similarities than remote joints, such as joint 0, which has larger constraints with joint 7 than joint 15. Information transmits joints by joints such that the model will perceive more useful features. By adding this module, long-term distance information will be transformed between joints and compensate local content. maps. When δ = 0.8, the larger number of means joints has strong correlation with other closed and symmetrical joints. The closed joints also have larger similarities than remote joints, such as joint 0, which has larger constraints with joint 7 than joint 15. Information transmits joints by joints such that the model will perceive more useful features. By adding this module, long-term distance information will be transformed between joints and compensate local content.

Protocol#1
Direct.   Table 1, training with both 2D and 3D data provides significant performance gain-MPJPE dropped 5.3 mm when adding MPII dataset and 4.8 mm when adopting COCO dataset. This verifies the effectiveness of joint learning framework in our training process. Network depth: From method {e,f} in Table 1, the performance is enhanced by a deeper ResNet network. Changing network depth, MPJPE can drop by 3.3 mm from ResNet-50 to ResNet-152.

Discuss
Computation complexity: Table 2

Quantitative Results
The evaluation results in Table 3 show that SAN achieved good results under all protocols. Note that many leading methods have complex frameworks or learning strategies. Some of methods aim at using the wild images [19,20,32] or exploiting temporal information [28,30,33]. These methods have different research targets. Therefore, we included some of them during evaluation for completeness. There are three main findings: (1) Introducing a self-attention mechanism is effective and the proposed SAN outperforms many different type of methods in terms of results, including the end-to-end method [7,17] and two-stage method [16,19]. (2) Joint learning frameworks of 2D and 3D data are helpful [16,20]. They increase the robustness of our model in in-the-wild images. (3) Our approach showed a competitive performance on average: 48.6 mm (MPJPE) and 40.6 mm (PA-MPJPE). We improved previous methods by a large margin for the action of phones, poses, etc. The results prove the effectiveness of our approach. Table 3. Comparison of mean per joint position error (mm) in Human3.6M between the estimated pose and the ground truth. Lower values are better, with the best in bold, and the second best underlined.

Protocol#1
Direct  Figure 4 shows the qualitative results of 3D human poses. Input images are from Human3.6M and MPII datasets. The evaluated results are accurate in both constraints and the in-the-wild environment, which shows the robustness and generalization of our model. Figure 5 shows the visualization results of failure cases. In Table 3, we can find that sitting down activity always has the worst results over other activities in many studies. The possible reason for this is that when images have serious self-occlusions, it will cause overlap between joints. The prediction accuracies of these types of activities can be improved by adding a mining difficult cases block.

Experiment on 2D Pose of MPII and COCO
A joint learning framework enabled our model to produce high quality 2D key-point results. We carried out experiments on MPII and COCO datasets to evaluate these results. Our results were first evaluated on MPII for a validation set of about 3000 which was separate from the training, and the evaluation metric was PCK at a normalized distance of 0.5 (PCKh@0.5). Then, our results were evaluated using COCO on test-dev, and the

Experiment on 2D Pose of MPII and COCO
A joint learning framework enabled our model to produce high quality 2D keyresults. We carried out experiments on MPII and COCO datasets to evaluate these re Our results were first evaluated on MPII for a validation set of about 3000 which separate from the training, and the evaluation metric was PCK at a normalized dis

Experiment on 2D Pose of MPII and COCO
A joint learning framework enabled our model to produce high quality 2D key-point results. We carried out experiments on MPII and COCO datasets to evaluate these results. Our results were first evaluated on MPII for a validation set of about 3000 which was separate from the training, and the evaluation metric was PCK at a normalized distance of 0.5 (PCKh@0.5). Then, our results were evaluated using COCO on test-dev, and the evaluation metrics were AP, AP 50 , AP 75 , AP M and AP L .

Quantitative Results
Tables 4 and 5 report the comparison results of MPII and COCO, respectively. Our model achieves 91.7% (PCKh@0.5) on the MPII dataset and 71.8 (AP) on the COCO dataset, and produces competitive results and significant improvement over others. Combining the results on 3D and 2D data, we can conclude that: (1) Joint learning framework is effective. It manages 2D and 3D data in a simple way for training. (2) 2D data also increase 3D performance for rich annotation and prompt networks to produce high-quality 3D data in the wild poses.

Qualitative Results
With the help of joint learning framework, our approach outputs both 2D and 3D pose from images in the wild at the same time. We visualized example 2D prediction results in Figure 6. We can see that our method is robust in extremely difficult cases. The proposed SAN presents a better performance and can be generalized to unlimited images. This also shows that the impact of rich 2D annotated data will increase 3D performance dramatically. Our approach is helpful.
With the help of joint learning framework, our approach outputs both 2D and 3D pose from images in the wild at the same time. We visualized example 2D prediction results in Figure 6. We can see that our method is robust in extremely difficult cases. The proposed SAN presents a better performance and can be generalized to unlimited images. This also shows that the impact of rich 2D annotated data will increase 3D performance dramatically. Our approach is helpful.

Conclusions
In this paper, we propose a simple yet surprisingly effective self-attention network (SAN) for human pose estimation. SANs can not only solve the drawbacks of convolution operators, which only perceive local information and enlarge receptive fields in a computationally inefficient way, but also combine long-range dependency and multilevel information into convolution operators to enhance representation power and performance. We also introduce a joint learning framework for 2D and 3D data in the training procedure. So, our network can output both 2D and 3D poses. Experimental results show that after bringing in the SAN, the performance will be significantly improved. Our complete pipeline achieves the competitive results on both Human3.6M 3D, MPII and COCO 2D benchmarks. As a by-product, our approach generates high quality 3D poses for images in the wild.
Author Contributions: H.X., project administration, conceptualization, writing-review and editing; T.Z., investigation, methodology, writing-original draft. All authors have read and agreed to the published version of the manuscript.

Conclusions
In this paper, we propose a simple yet surprisingly effective self-attention network (SAN) for human pose estimation. SANs can not only solve the drawbacks of convolution operators, which only perceive local information and enlarge receptive fields in a computationally inefficient way, but also combine long-range dependency and multilevel information into convolution operators to enhance representation power and performance. We also introduce a joint learning framework for 2D and 3D data in the training procedure. So, our network can output both 2D and 3D poses. Experimental results show that after bringing in the SAN, the performance will be significantly improved. Our complete pipeline achieves the competitive results on both Human3.6M 3D, MPII and COCO 2D benchmarks. As a by-product, our approach generates high quality 3D poses for images in the wild.
Author Contributions: H.X., project administration, conceptualization, writing-review and editing; T.Z., investigation, methodology, writing-original draft. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.