Pointless Pose: Part Afﬁnity Field-Based 3D Pose Estimation without Detecting Keypoints

: Human pose estimation ﬁnds its application in an extremely wide domain and is therefore never pointless. We propose in this paper a new approach that, unlike any prior one that we are aware of, bypasses the 2D keypoint detection step based on which the 3D pose is estimated, and is thus pointless . Our motivation is rather straightforward: 2D keypoint detection is vulnerable to occlusions and out-of-image absences, in which case the 2D errors propagate to 3D recovery and deteriorate the results. To this end, we resort to explicitly estimating the human body regions of interest (ROI) and their 3D orientations. Even if a portion of the human body, like the lower arm, is partially absent, the predicted orientation vector pointing from the upper arm will take advantage of the local image evidence and recover the 3D pose. This is achieved, speciﬁcally, by deforming a skeleton-shaped puppet template to ﬁt the estimated orientation vectors. Despite its simple nature, the proposed approach yields truly robust and state-of-the-art results on several benchmarks and in-the-wild data.


Introduction
Human pose estimation aims at recovering the coordinates of a human body captured from one or multiple images, and therefore plays a vital role in an exceptionally broad spectrum of applications. Thanks to the recent development of deep learning, 2D human pose estimation has witnessed unprecedented advances [1][2][3]. Despite the encouraging progress, estimating 3D poses from a single image, being an ill-posed problem by nature, remains a challenging task.
Traditional 3D pose estimation methods depend on first detecting 2D body keypoints from the input image, followed by mapping the 2D detections back to the 3D world. The advantage of building 3D pose estimation basing on 2D keypoint detection is that the latter is a mature technique with high generalization capacity, which means that if a 3D pose estimation method only requires 2D keypoint locations as input, it would automatically inherit the generalization capacity. However, depth information, which is crucial to 3D pose estimation, is completely lost in the 2D keypoint estimation process, making the subsequent 2D-to-3D regression ill-posed and thus ambiguous. Estimating the depth along with 2D keypoint can solve this problem in theory, but this is a task of almost the same level of difficulty with 3D pose estimation. Some methods [4][5][6][7][8][9][10] tried to estimates the relative depth simultaneously with 2D keypoints. However these methods require camera parameters in post-processing, which greatly limits the range of application. In addition, the 2D keypoints may be in most cases robustly detected, only if the keypoints are present in the image. Such prerequisite is unfortunately too strong for real-world application scenarios, where heavy occlusions and out-of-image absences of body joints frequently occur and thus collapse the 3D estimation results.
We propose in this paper an end-to-end pointless 3D pose estimation approach, bypassing the 2D keypoint-detection step to avoid those problems mentioned above. We  Our method bypasses 2D pose estimation process, by directly estimating from images 1D, 2D and 3D part affinity fields, which represent the regions, 2D and 3D orientations (encoded in color) of different body regions, respectively. This fully part affinity fields (PAFs)-based estimation proves to be robust to keypoint absence. In addition, the proposed new pipeline achieves outstanding generalization ability to in-the-wild images with a simple semi-supervised approach. Our method does not require any camera parameters, which means it can be easily applied to other testing images.
An overview of the proposed method is illustrated in Figure 2. Specifically, we use a fully convolutional neural network (FCNN) to simultaneously predict 1D, 2D and 3D PAFs, which represent the regions, 2D and 3D orientations (encoded in color) of different limbs, respectively. To improve the generalization ability of the network, we train the FCNN on the mixture of a 2D pose dataset MPII [11] and a 3D pose dataset Human3.6M [12] in a semi-supervised approach. Once the PAFs are predicted by the FCNN, we first refine the 3D PAFs to remove the noises in it. The estimated PAFs are then aggregated to form the 3D orientation vectors over the human body, which are further aligned with a skeleton-shaped puppet template to produce the final 3D pose estimation result. The puppet adopted here features limbs of fixed sizes and adjustable body joints and is, in this process, deformed in a way that exactly fits the estimated 3D orientation vectors. The reason we choose to freeze the limb lengths of the puppet lies in that, the absolute length estimation from a single color image is in many cases unreliable, given that persons of different heights could have identical 2D projections. It is worth noting that all the post-processing steps are differentiable, which makes end-to-end training possible.
Despite its very simple nature, the proposed approach yields truly encouraging performances on lab-monitored benchmarks as well as in-the-wild images where large portions of human bodies are absent in the scene. To gain deeper insight into the behavior of the approach, we also introduce a new orientation-based evaluation metric for 3D pose estimation, which explicitly accounts for the angle between the estimated and the ground truth 3D limb vectors. Under both the keypoint-and orientation-based metrics, the proposed approach accomplishes state-of-the-art 3D pose estimation results. In summary, our contribution is an end-to-end pointless approach towards the neverpointless 3D human pose estimation. Our method bypasses the error-prone 2D keypoint detection step by substituting it with an orientation-based estimation to recover the 3D orientations of a subject, and then deforms a fixed-size puppet template to fit the predicted directional vectors so as to produce final 3D pose estimation. Such orientation-based estimations allows us to in many cases remedy the partially absent body parts that occur frequently in real-world scenarios. Experiments on several benchmarks and in-the-wild data show that the proposed approach, despite simple, achieves state-of-the-art results in terms of both the conventional keypoint-based and the newly proposed orientation-based evaluation metrics.

Related Work
In this section, we briefly review here approaches related to ours. We categorize them into two overlapping groups-methods relying on 2D keypoint detection and those explicitly using part affinity fields, where all methods in the latter group, in fact, utilize 2D keypoint detections as well.

2D Keypoint Estimation Based Methods
There are two most widely used pipelines as shown in Figure 1a,b. Methods that follow pipeline (a) divide the 3D pose estimation task into two steps, 2D pose estimation and 3D pose inference. These methods comprise a 2D keypoint detector and a subsequent optimization [13][14][15] or regression [16][17][18][19][20][21][22][23][24][25][26] step to estimate 3D pose. Early efforts on 3D pose estimation used dictionary learning, with the assumption that a 3D pose can be represented by a linear combination of a set of base poses [13][14][15]. Recently, many researchers have begun to use neural networks for 3D pose regression. For instance, Moreno-Noguer [21] used a Convolutional Neural Network (CNN) to regress the 3D joints distance matrices instead of 3D poses. Sun et al. [19] proposed to regress the bones instead of joints by re-parameterizing the pose presentation. Lee et al. [24] proposed a long short-term memory (LSTM) architecture to reconstruct 3D depth from the centroid to edge joints through learning the joint inter-dependencies. Chen et al. [27] proposed improve 3D human pose estimation by synthesizing human images [28,29]. Hossain et al. [30] designed an LSTMbased sequence-to-sequence network to estimates a sequence of 3D poses from a sequence of 2D poses. Given fact that 2D-to-3D inference is an ill-posed problem, methods along this line are prone to ambiguities in the 2D-to-3D regression at the second stage of this pipeline, if no addition image evidences are utilized.
The major difference between pipeline (a) and (b) is that the latter learns addition image evidences, like depths on joints, to help the 3D inference. Pons-Moll et al. [31] proposed an extensive set of posebits representing the boolean geometric relationships between body parts, and designed an algorithm to select useful posebits for 3D pose inference. Nie et al. [22] used the 2D keypoints and the correpsonding local image patches to predict the depth of human joints. Zhou et al. [32] proposed to learn the 2D keypoint locations and the corresponding depth using a weakly supervised approach. Pavlakos et al. [33] predicted the depth of human joints using manually annotated ordinal depth supervision by a ranking loss. Wang et al. [34] defined the pose attributes as intermediate image cues to reduce the ambiguity in lifting 2D pose into 3D space. These methods requires that all the human keypoints are present in the image, which is unfortunately too strong for real-world application scenarios, where out-of-image absences of body joints frequently occur and thus collapse the 3D estimation results.

Part Affinity Fields Based Methods
The part affinity field is originally proposed by Cao et al. [3]. In their work, 2D PAFs were used to help linking the kepoints on a person in the multi-person 2D pose detection problem. After that, several early attempts have been made to use 3D PAFs for 3D pose estimation. Luo et al. [35] and Xiang et al. [36] followed Cao et al. 's idea to predict 2D keypoint heatmaps and 3D PAFs. In their method, the 3D orientations were extracted according to the predicted 2D keypoint locations. Unfortunately, this step is nondifferentiable making end-to-end training infeasible. In Liu et al. 's work [37], 3D PAFs is used as additional image evidence to improve the 2D-to-3D regression. All these methods actually still rely heavily on 2D keypoint detections. As a result, these methods actually are still fragile to keypoints absences.

Our Approach
Our pointless 3D pose estimation method substitutes the detection of the only intermittently visible 2D points, with the estimation of 1D PAFs, i.e., the regions of limbs. This replacement not only makes our method robust to the cases with visually absent keypoints, but also provides us a simple and differentiable way to extract 3D vectors from PAFs, making end-to-end training possible. Lastly, we introduce an auxiliary task, the 2D PAFs estimation, which enables us to train the network on 2D pose dataset for better generalization ability. Compared to prior PAF-based methods, our method is end-to-end, robust to partial absence of body parts from the image, and achieves excellent generalization capacity to in-the-wild images.

Method
Our method takes as input a 256 × 256 color image and predicts three groups of part affinity fields simultaneously. Then the predicted 3D PAFs, which could be noisy, are refined by the 1D and 2D PAFs through a parameter-free process. The 3D limb orientations are obtained by averaging and unitizing each PAF. Then we injected the 3D orientations into a puppet with fixed limb lengths to obtain the final 3D pose output. The first step, predicting the PAFs, is the only step with learnable parameters in our method. All the remaining steps are differentiable, enabling us to end-to-end train the whole model. In the following sections, we explain these steps in detail.

Consistent 1D/2D/3D PAFs Representations
Part affinity field is designed to represent vectorised information such as directional vectors, in a specific region of interest. In this paper, the 1D PAF is taken to be a binary map that indicates whether each pixel is in the region of interest or not. In human pose estimation, the exact human limb region is unavailable, so we use two adjacent keypoints and a fixed width d to define a rectangular region of a limb. When the Euclidean distance of two adjacent keypoint is smaller than 2d, we define the region as a circle centerd at the mid-point of the two keypoints with a radius of r = √ 2d, to avoid a too small region. Figure 3 shows two example ROIs. Here a limb is defined as the body part that connects two adjacent keypoints.   For 2D and 3D PAF, the region is defined the same as 1D PAF. The only difference is that in an N-D PAF, each pixel represents an N-D vector. In this paper, we use PAFs to represent 2D and 3D limb directions, so the vectors in PAFs are unitized.
In prior works that adopt PAFs [3,[35][36][37], a two-branch architecture is used to learn two inconsistent sub-tasks, PAFs and 2D keypoint heatmaps estimation. This inconsistency leads to a a non-differentiable post-processing, so that end-to-end training is infeasible. However, the end-to-end training in 3D pose estimation is especially important in orientation based methods [19], because we need a long term objective to achieve an overall better pose estimation, otherwise the errors in each part will accumulate and result in large error to the far end body joints.
In this paper, we propose a three-branch architecture to learn three different but consistent sub-tasks, 1D, 2D and 3D PAFs estimation, respectively. This design not only simplifies the post-processing, but also make it differentiable, so that our method is endto-end. In addition, in previous works, the networks have to learn two totally different representations, i.e., Gaussian kernel based keypoint heatmaps and limb region based PAFs. In our method, the three sub-tasks are more consistent than the previous keypoint based design, because they are all limb region based and orientation based. The consistency among the three branches makes it easier for the network to learn domain-independent features, which is crucial to improving the generalization capacity of the network. Figure 4 illustrates the core architecture of our FCNN. In stage T = t, the feature map F t is first fed into an hourglass [2] block, then it flows through three branches simultaneously, to predict the 1D/2D/3D PAFs . After that, all the PAFs are transformed into features maps with the same number of channels to F t by 1 × 1 convolutional layers. Then the feature maps, including F t , are added up to generate a new one F t+1 as the input for the next stage. At training phase, each stage produces a group of PAFs and a 3D pose. At inference phase, only the output of the last stage are used as the final predictions.
The 1D and 2D branches are fully-supervised, because both datasets provide 2D annotations, from which we can generate the ground truth to fully supervise the learning of the 1D and 2D PAFs. For the 3D branch, on average only half of the training examples, that is, those come from the 3D dataset, have 3D PAFs supervision, so it is semi-supervised. We use the average gradient of the 3D training examples in each mini-batch to approximate the gradient of the whole mini-batch. When there are no 3D examples in a mini-batch, we simply set the gradients of parameters in this branch to zeros so that the weights in it are not updated in this single backward. In other words, the 3D branch only sees training examples from Human3.6M. Surprisingly, the automatically learnt features F t , which is shared by the three branches, is domain-independent enough. As a result, despite the 3D branch gets supervisions obtained from a monitored indoor environment, the network generalizes pretty well to in-the-wild images.
It is worth noting that our method achieves a better generalization ability (See the results in Section 4.4), without applying any weakly supervised loss or GAN loss to the 2D training examples [32,33,38,39]. This greatly simplifies the training process and makes it easier to re-implement.

The Loss Functions
In this paper, the learning of 1D PAFs is taken to be a pixel-wise binary classification problem. The value at each pixel in the 1D PAF indicates the probability of this pixel lies inside the region of interest. In other words, the predicted 1D PAFs are also the limb region confidence maps, which can simplify the post-processing and make it differentiable. We use the Binary Cross Entropy (BCE) loss for 1D PAFs learning as follows: where p i n and q i n represents the predicted and ground truth probability at pixel n in training example i.
The 2D/3D PAF learning is taken to be a regression problem. As we mentioned above, the 2D/3D PAFs represent both limb regions and 2D/3D limb orientations. In orientation learning based 3D pose estimation, the major target is to learn the directions, rather than the regions, which means a PAF with correct orientation estimation but inaccurate region detection is absolutely acceptable. However, the Mean Squared Error (MSE) loss, used by previous PAF-related works, will give a large penalty to these acceptable prediction cases.
To this end, we propose a boundary-insensitive loss function for N-D PAF learning. The basic idea is that, we impose small weights to the pixels where the region prediction is incorrect, to reduce the impact of wrong region prediction. This is achieved by utilising the 1D PAF predictions, which represent region prediction confidence maps, to define a pixel-wise weight map as follows: where w 0 is the lower bound of weights, σ is the standard deviation of the Gaussian distribution. We set w 0 = 0.2 in our experiments. The function above can be taken as a soft exclusive nor (ex-NOR, or XNOR) function. That is, an output weight of 1 is obtained only if both of its inputs are at the same probability level, either high or low. Otherwise, if they are at different probability level, the weight approaches to the lower bound w 0 . The ground truth probability q i n is binarized so that the case that prediction is correct but weight is small can never happen. Then the proposed boundary-insensitive loss is an MSE loss masked with the weight we define above: where N = 2, 3, x i n and y i n represents the predicted and ground truth N-D directions at pixel n in training example i. The proposed loss function gives a small penalty to a pixel if its location prediction is wrong, no matter its direction prediction is correct or not, while it still gives a large penalty to a pixel if its location prediction is correct but the direction prediction is wrong. The lower bound w 0 can avoid the loss function L t N being trivially minimized by taking p = 1 − q. We want to make it clear that, the proposed boundaryinsensitive loss does not boost the 3D pose prediction accuracy, but it could suppress the oscillation in training and thus speed up the convergence.

Differentiable Post-Processing
In this section, we introduce how we extract the 3D directions from the noisy PAF predictions in a differentiable way. Our post-processing consists of two parameter-free steps, 3D PAF refinement and 3D orientation injection.

3D PAF Refinement
The 3D PAF refinement is further divided into two steps-denoising and 2D/3D PAF ensemble. As shown in Figure 2b, the predicted 1D and 2D PAFs are usually nice and clean, but the 3D PAFs can be very noisy, typically in pixels beyond the body region. This is what expected, since the 3D branch has never been trained with any in-the-wild images with 3D supervisions. We use the much cleaner 1D PAF predictions, to filter out the noise in the 3D PAFs by a masking operation. This is the one of our motivations to design the 1D PAF branch.
As we mentioned in Section 3.1, the directional vector represented by first two dimensions of the 3D PAFs are approximately parallel with that of the 2D PAFs, which can be used to further refine the 3D PAFs. Specifically, we resize each 2D limb directional vector so that its L2 norm equals that of the first two dimensions of the corresponding 3D PAFs. Then we replace the first two dimensions of the 3D directional vector with the average of the two 2D vectors. The rationale behind this is that we take the 2D PAFs and the first two dimensions of the 3D PAFs as predictions produced by two separate models. The final prediction is the ensemble of them, which often improves the prediction by voting.

3D Orientation Injection
As discussed in Section 1, our method uses a skeleton-shaped puppet for producing the 3D pose output. The process of combining the predicted 3D directional vectors and the puppet makes a real 3D pose. We term this process as 3D orientation injection.
The human model in this paper is an articulated object that consists of several limbs and joints. A limb is a segment of fixed length, and a joint is the end point of a limb. Limbs can rotate among a conjunct joint (See Figure 2d). In this way, the human skeleton forms a tree structure. Usually the root of the tree is taken to be the pelvis, and is fixed at the origin. Following Human3.6M [12], the human model consists of 17 joints with 16 limbs. A limb has 0 degree of freedom (DOF) as its length is fixed, and a joint has 2 DOFs except for the root. So that there are 32 DOFs in our skeleton-shaped human model. Unlike Zhou et al. [40] that use a CNN to predict the rotation angle of joints, we estimate the 3D orientation directly. From the view of limb orientations, each limb 3D direction vector has 2 DOFs, which sums up to 32 DOFs as well.
3D orientation injection is a step that combines the predicted 3D limb orientations with the human model to generate the final 3D pose estimation. This process works like twisting the limbs of the human model to fit the predicted 3D direction vectors. For an arbitrary child node k in the skeleton tree, its 3D location prediction Y k is determined by its parent node's 3D location prediction X k , the predicted orientation v k , and the limb length L k , in a recursive way as follows: where we have X 0 = 0. L k s are constant numbers, which are obtained by calculating the average of the limb lengths of subject S1, S6, S6, S7 and S8 in the training set of Human3.6M. This process stops when the locations of all the leaf nodes are determined. During 3D orientation injection, the orientation errors do not accumulate, because the 3D direction vectors remain unchanged in this process. This is shown in Equation (4), where the 3D direction vector v k is only scaled by a factor of L k and translated by X k , both of which do not alter the direction of the vector.

End-to-End training with 3D Pose Loss
In Section 3.2, we discussed that the 3D PAFs are learnt by minimizing the loss function in Equation (3). In this loss function, each 3D limb orientation is independently estimated. Although limb orientation errors do not accumulate, joint location errors still could propagate along the skeleton tree and possibly accumulate into large errors for joints at the leaf node. For example, a location shift in the left shoulder would lead to the same amount of location shift to both left elbow and left wrist.
To solve this problem, long-term objectives should be considered so that the 3D orientations are jointly optimized. In our method, since all the steps are designed to be differentiable, we can directly use the 3D pose loss as a long-term objective and train the model end-to-end. Here we use L1 loss for 3D pose: where Y i k andŶ i k represents the predicted and GT 3D locations for joint k in training example i.' In experiment, we find that end-to-end training can speed up the convergence, and improve the accuracy of estimation as well. In all, for a T stage model, the overall loss function is: where λ 1 , λ 2 and λ 3 control the relative importance of each objective. We set λ 1 = 0.1, λ 2 = 1 and λ 3 = 1 in our experiments.

Experiments
We provide here details on our experiments, including datasets and protocols used, training details, quantitative and qualitative results, robust analysis and ablation studies.

Datasets and Protocols
We evaluate our method on the following three popular human pose benchmarks. Human3.6M [12] is a large-scale indoor 3D human pose dataset that comprises 3.6 million images and the corresponding 2D pose and 3D pose annotations. It features 7 subjects performing 15 everyday activities. We follow the standard protocol on Human3.6M to use S1, S5, S6, S7 and S8 for training, and use S9 and S11 for evaluation. Following [4,13,32,33], we down-sampled the original videos from 50 fps to 10 fps to remove redundancy in both training and evaluation.We report qualitative results on this dataset in terms of three evaluation metrics, i.e., the mean per joint position error (MPJPE), MPJPE after Procrustes alignment with the ground truth (PA-MPJPE), and the mean per limb orientation error (MPLORE), a metric for evaluating the 3D orientation prediction error as follows: where x l and y l are the predicted and GT direction vector of a limb l, respectively. L is the number of limbs. MPII [11] is the most widely used benchmark for 2D human pose estimation. It contains 25K in-the-wild images with 2D annotations but no 3D ground truth. As a result, direct image-to-3D training is not a practical option with this dataset. We adopt this dataset for the learning of 1D and 2D PAFs, and also use it for the qualitative evaluation of our 3D pose estimation.
MPI-INF-3DHP [41] is a smaller 3D pose dataset constructed by the Mocap system with both constrained indoor scenes and complex outdoor scenes. We only use the test split of this dataset to evaluate the generalization capacity of our method quantitatively, as done in many prior works.

Implementation Details
The training of our network is handy and stable. We use a pre-trained Stacked Hourglass [2] model to initialize the common modules of our network and the stacked hourglass, including the first 7 × 7 convolutional layer, the following 3 residual blocks, and the hourglass sub-modules (see Figure 4). Then the network is trained for 40 epochs with RMSprop. The initial learning rate is 5 × 10 −4 and decayed by 0.25 at the epoch of 20 and 30, respectively. The training examples are randomly sampled from Human3.6M and MPII with equal probability. Augmentations of random scale (1 ± 0.25) and random color jitter (1 ± 0.2), random rotation (±30 • , p = 0.6) and random horizontal flipping (p = 0.5) are used for both datasets. For fair comparison, we do not use multiple crops or flipping test for possible better performance score. The whole training procedure takes about 20 h on a single Tesla V100 GPU. The inference speed is about 70fps with a batch size of 6 on the same architecture.

Quantitative Results on Human3.6M
We first evaluate our method on Human3.6M using the metric MPJPE and PA-MPJPE in order to compare our method with state-of-the-art methods. The results are shown in Table 1. Our method slightly outperforms the state-of-the-art methods in terms of the average MPJPE and PA-MPJPE over all the 15 activities, even though our method is not designed in a way that optimizes these metrics. Table 1. Detailed results on Human3.6M under the metrics of MPJPE and PA-MPJPE . Methods marked with * use ground truth camera parameters in post-processing. The results of all approaches are taken from the original papers, except for [5], which is taken from [42]. We also provide the results evaluated with ground truth limb lengths. Best results are marked in bold. Since objects with totally different sizes could project into similar 2D images, the absolute length estimation from a single color image is usually not reliable. To this end, we propose the mean per limb orientation error (MPLORE), defined in Equation (7), to evaluate the 3D pose estimation performance in the setting of limb lengths decoupled.

MPJPE
The MPLORE results are shown in Table 2. We compare our method with three state-of-the-art methods [18,32,34]. Our method achieves the best results on 14 of the 15 activities, and the best averaged result. The MPLORE results indicate that our method predicts much better 3D orientations of limbs. Table 2. Comparison with state of the arts on Human3.6M in terms of MPLORE (lower the better). The best score is marked in bold. We achieve the best results across all the activities except for Walking, which is only slightly worse than [34].

Quantitative Results on MPI-INF-3DHP
This dataset is collected in indoor and outdoor with a multi-camera marker-less MoCap system. Because of this, the ground truth 3D annotations have some noise. To quantitatively show the generalization capacity of our method, we evaluate the 3D extension Percentage of Correct Keypoints (3DPCK) and Area Under Curve (AUC) score on the MPI-INF-3DHP without training with this dataset, as done in many previous works. The results are shown in Table 3. Our method achieves the second best score in terms of 3DPCK, and the best score in terms of AUC, demonstrating its good generalization capacity to unseen testing images. Table 3. 3DPCK and AUC on the MPI-INF-3DHP dataset. Higher is better. The results for all approaches are taken from the original papers. • represents our method without the auxiliary 2D orientation task, and r30 and r90 represent using random rotation augmentation of 30 and 90 degrees in the training. No training data from this dataset have been used. Our method achieves the best score in terms of AUC, and second best score in terms of 3DPCK. Since our method learns the 3D limb orientations, of which the first two dimensions represent the 2D orientations, using large-angle random rotation augmentation on the image should help training a network with better generalization capacity. This is in fact validated by the experiment results in the last two columns in Table 3, in which the model trained with 90-degree random rotation augmentation has considerable improvement, compared to the one trained with 30-degree augmentation.

Qualitative Results on MPII
MPII is the most widely used 2D pose datasets which does not contain 3D annotations. In this section, we provide some qualitative results in Figure 5 on this dataset especially in some challenging scenes like images with missing body parts. In Figure 5, the images in the first two rows are truncated and the occluded in the last two rows. Our method can produce visually appealing results even in the presence of incomplete body parts, proving the robustness of the proposed method. These examples on MPII also demonstrate our method's generalization capacity on various in-the-wild images.

Image
3D PAFs Prediction Image 3D PAFs Prediction In Figure 6, we show four failure examples. In Figure 6a, the pose is rarely-seen as well as includes self-occlusions. In Figure 6b, the left lower leg is out-of-image and there is a bag next to it with the similar color, so the network takes the bag as the absent lower leg. In Figure 6c, the subject in it wears a black helmet that covers his/her face, in which case the network gets confused by the left and right side of the body. In Figure 6d, the subject is occluded by the another person, in which case the network takes the arm of the person in the front as the subject's in the back.

Robustness Analysis: A Case Study
In Figure 7, we show the qualitative and quantitative results of a testing image under 5 synthetic disturbances including edge-erasing, rectangle-erasing, circle-erasing, partialblurring and a composition of the above. To explain how the performance deteriorates, we also visualize the predicted 2D keypoint heatmaps and the 3D PAFs. Here only the four limbs are included for better visualization. In this case, our method is much less sensitive to these disturbances than those 2D keypoint detection based methods. The synthetic disturbances are generated at random. Our method achieves consistent better performance, which indicates that our method has the potential in improving the robustness of 3D human pose estimation.
The robustness of our method can be attributed to two aspects. First and foremost, the pointless method design enables us to predict the ROI and 3D orientation of a limb even when the limb is partially out-of-image or occluded. Second, the final 3D orientation of a limb is extracted by averaging all the predicted vectors in the ROI. The averaging operation in this step can be treated as an average filter, which suppresses noises and disturbances in the predicted vectors, making the prediction more stable.  Figure 7. A case study on images with various geometric occlusions. We compare the results with three state-of-the-art methods: (a) [18], (b) [32], and (c) [34]. Our method is robust under missing key points, rectangular and circle-occlusions, as well as partial-blurring.

Ablation Study
To analyze the effectiveness of different steps in our method, we conduct ablation study on Human3.6M in terms of MPJPE. The results are reported in Table 4. Baseline refers to the approach that uses the original 3D PAFs without refinement. Denoising refers to using 1D/2D PAFs to remove the noise in the predicted 3D PAFs. Flip refers to using horizontal flipping test. The performance gain by denoising might seem minor on Human3.6M. The reason is that the 3D branch is trained in a fully supervised manner on Human3.6M so that there is little noise in the predicted 3D PAFs on images from this dataset. However, when testing on in-the-wild images, there could be a lot noise in 3D PAF predictions (see Figure 2), making the denoising an indispensable step.

Conclusions
We propose in this paper a simple and effectual 3D human pose estimation method, termed pointless 3D pose estimation. Unlike prior methods that rely on 2D keypoint detection, which is prone to errors in the absence of body parts and joints, the proposed approach bypasses this stage and substitutes it with estimations that explicitly account for both the ROIs and 3D orientations. This allows us to robustly recover the poses, by taking advantage of the estimated 3D vector pointing from a neighboring body part, even when some 2D keypoints are out of scene or occluded. State-of-the-art results, in terms of both keypoint-based and angle-based evaluation metrics, have been achieved on standard benchmarks as well as in-the-wild data. Possible future work includes multi-person 3D pose estimation, person limb length estimation and so on.