Multi-View Pose Generator Based on Deep Learning for Monocular 3D Human Pose Estimation

: In this paper, we study the problem of monocular 3D human pose estimation based on deep learning. Due to single view limitations, the monocular human pose estimation cannot avoid the inherent occlusion problem. The common methods use the multi-view based 3D pose estimation method to solve this problem. However, single-view images cannot be used directly in multi-view methods, which greatly limits practical applications. To address the above-mentioned issues, we propose a novel end-to-end 3D pose estimation network for monocular 3D human pose estimation. First, we propose a multi-view pose generator to predict multi-view 2D poses from the 2D poses in a single view. Secondly, we propose a simple but effective data augmentation method for generating multi-view 2D pose annotations, on account of the existing datasets (e.g., Human3.6M, etc.) not containing a large number of 2D pose annotations in different views. Thirdly, we employ graph convolutional network to infer a 3D pose from multi-view 2D poses. From experiments conducted on public datasets, the results have veriﬁed the effectiveness of our method. Furthermore, the ablation studies show that our method improved the performance of existing 3D pose estimation networks.

In recent years, research that studied 3D pose estimation has mainly focused on three different directions, namely 2D-to-3D pose estimation [10,13], monocular image-based 3D pose estimation [8,10,14,15], and multi-view images based 3D pose estimation [16][17][18][19]. These methods were mainly evaluated on the Human3.6M dataset [20], which was collected in a highly constrained environment with limited subjects and background variations. The current methods still have problems such as insufficient fitting, self-occlusion, limited representation ability, and difficulty in training.
Multi-view 3D pose estimation methods have proven to be effective to improve 3D pose estimation [17,19,21]. The advantages of these methods are avoiding partial occlusion, have easier access to more available information, and better performance, compared to using a single image. However, these methods need multi-view datasets during training, but such datasets are more difficult to obtain.
Human pose-like graphic data are composed of joint points and skeletons. Zhao et al. [15] improved GCNs and proposed a novel graph neural network architecture for regression that takes full advantage of local and global relationships of nodes, called Semantic Graph Convolutional Networks (SemGCN). Ci et al. [22] overcame the limitation of GCNs representation power by introducing a Locally Connected Network (LCN). To sum up, GCNs have been demonstrated to be an effective approach with fewer parameters, higher precision, and easier training in the application of 3D pose estimation.
In this work, we propose a method that achieves multi-view 3D pose estimation on single-view data input. As shown in Figure 1, our framework includes two stages: (i) Multi-view pose generator (MvPG) and (ii) GCNs for multi-view 2D to 3D pose. Our experiments show that MvPG can significantly improve the overall effect of the 3D pose estimation model. In a word, our method is general, and effectively improves the effect of 3D pose estimation. Our contributions can be summarized as follows: • We introduce an end-to-end network to implement a multi-view 3D pose estimation framework with single-view 2D pose as input; • We establish a strong MvPG to predict the 2D poses of multiple views from the 2D poses in a single view; • We present a simple and effective multi-view 2D pose datasets generation method.

•
We propose a novel loss function for constraining both joint points and bone length.

Related Work
There are two distinct categories of human pose estimation: Single-view methods and multi-view methods. Due to our method containing both the above elements and GCNs, we briefly summarize the past approaches for single-view, multi-view, and GCNs. Most of these approaches train model from large-scale datasets Human3.6m [20] to regress 3D human joint transformations.

Multi-View 3D Pose Estimation
These methods usually consist of two steps: (1) Estimating the 2D poses in multi-view images and (2) recovering the 3D pose from multi-view poses. It is easy to envision that the increase in the number of views could solve the self-occlusion problem, which is inherent in pose estimation. But the lack of datasets is a major problem in multi-view methods. To alleviate this problem, the majority of conducted research have focused on using weakly or self-supervised training methods to harvest annotations from different perspectives [12,23], or fusing features to achieve better results with as few perspectives as possible [17,19,21,24,25], such as, fusing the Inertial Measurement Unit (IMU) data and vision data to achieve better results [21,24], using multi-camera setup as an additional training source and fusing it with 3D models generated by individual cameras [25], and cross-view fusion [17].
This paper proposes an effective and efficient approach, which directly uses 2D poses to predict 3D poses. Specifically, we design a module named MvPG that generates 2D poses with multi-view from a monocular image. Then, using the generated 2D poses to estimate the 3D poses. Our entire model can effectively avoid dependence on a multi-view dataset, while alleviating the self-occlusion problem.

Single-View 3D Pose Estimation
Inspired by Martinez, most of the current solutions for the monocular 3D pose estimation mainly focused on two-stage methods. They established a simple baseline for 2D-to-3D human pose estimation by using neural networks to learn effectively 2D-to-3D mapping. The inevitable depth ambiguity in 3D pose estimation from single view images limit the estimation accuracy. Extensive research have exploited extra information to constrain the training process [15,[26][27][28][29]. A more common piece of extra information is temporal information. For example, Yan et al. [26] introduced the Spatial-Temporal Graph Convolutional Networks (ST-GCN) to automatically learn both spatial and temporal patterns from data. Cheng et al. [29] exploit estimated 2D confidence heatmaps of keypoints and an optical-flow consistency constraint to filter out unreliable estimations of occluded keypoints. Lin et al. [28] utilize matrix factorization (such as singular value decomposition or discrete cosine transform) to process all input frames simultaneously to avoid sensitivity and drift issues. In addition, Sharma et al. [27] employ Deep Conditional Variational Autoencoder (CVAE) [30] to learn anatomical constraints and sample 3D pose candidates.
Additionally, we add extra information during estimation, that is, other views of the pose in a monocular image. Finally, we optimize this method computationally inexpensive but still be able to improve the performance. To the best of our knowledge, there is no previous work that generates multi-view 2D keypoints from a monocular image to estimate 3D pose.

GCNs for 3D Pose Estimation
GCNs generalize convolutions to graph-structured data and have great performance for irregular data structures. In recent years, a number of researchers have introduced the idea of GCNs to the study of action recognition [26,31] and 3D human pose estimation [15,[32][33][34]. Constructing the GCNs can learn both spatial and temporal features for action recognition, such as Spatial Temporal Graph Convolutional Networks (ST-GCN) [26] and Actional-Structural Graph Convolution Network (AS-GCN) [31]. They harness the locality of graph convolution together with temporal dynamics. For pose estimation, it has also made full use of spatio-temporal information in GCNs [32]. Above that, Liu et al. [34] encode the strength of the relationship among joints by graph attention block and Zhang et al. [33] invented a 4D association graph for real-time multi-person motion capture. In this paper, we use a SemGCN [15] as the 2D to 3D regression network. It has the advantage of capturing local and global semantic relations, and is able to easily expand small parameters.Therefore, it is very suitable for our proposed multi-view pose generator.

Framework
The framework is illustrated in Figure 1, the whole model is formulated as an end-to-end network, which consists of two modules: (1) MvPG and (2) 2D to 3D regression networks. The MvPG predicts 2D poses of multiple views from a single view. The 2D to 3D regression network predicts an accurate 3D pose from multi-view 2D poses. During the training process, we first pre-train the MvPG on the human pose dataset. Then, we train the entire network in an end-to-end manner. Finally, accurate 3D pose data can be obtained.

Multi-View Pose Generator
Previous research has shown that multi-view methods [17,21] can effectively improve the performance of 3D pose estimation, however, multi-view poses are not easily available in real scenes. Accordingly, we tried to obtain the multi-view 2D pose from a single view so that the multi-view method can be utilized.
2D human pose is defined as a skeleton with N = 16 joints that can fully describe various postures of the human body, parameterized by a 2N vector q 1 , . . . , q N (see Figure 2a). The i-th 2D joints denoted as q i = (x i , y i ). Inspired by [35], to predict the right view from the left view, we propose a MvPG that aims to predict multi-view 2D poses from a single-view 2D pose. As shown in Figure 2b, given a single-view 2D pose keypoints q i , the goal of MvPG is to learn a mapping function f : R 2N → R M×2N for predicting a set of multi-view 2d pose f (q i ) from q i , where M is the number of multi-views. Each group of networks in the multi-view pose generator learns the corresponding parameters to predict the multi-view 2D pose Figure 2c). In order to train M pose generators, we need 2D pose data of M views q 1 i , q 2 i , . . . , q M i to supervise, such as Figure 2d. These data can be obtained by using camera parameter projection on the 3D pose (Figure 2e). The model aims to learn a regression function F g which minimizes the error over f m (q i ) and q m i : where, M is the number of 2D poses generated by the network, f m (q i ) is the prediction of the m-th view, q m i is the 2D pose annotation of the m-th view.
In order to train our MvPG, the datasets provide us with a series of 3D pose data, which contain the 3D coordinates of the joint points, skeleton information, and the camera coordinates in space and other data. We use the 3D coordinates of the human body and the coordinates of the camera in space to generate a 2D pose corresponding to the camera view. We use the 3D pose coordinates and the corresponding 2D pose coordinates of each view to train MvPG based on Equation (1).
Before using the MvPG for a 3D pose estimation task, we need to pre-train it. Multi-view 2D pose annotations are annotation for body pose estimation, and are labels used for supervised learning. Our MvPG model is being trained on the Human3.6M dataset [20], while the dataset only provides limited camera angle parameters. Therefore we need to augment the training data.

2D Pose Data Augmentation
The 3D pose is defined as a skeleton with N = 16 joints and parameterized by a 3N vector P 1 , . . . , P N . All keypoints are denoted as P i = (x i , y i , z i ). Existing 3D datasets such as Human3.6M [20] provides 3D coordinates of human joints and camera parameters that generate 2D poses in four perspectives. However, the limited number of cameras cannot meet the data requirements for training MvPG. Therefore, we introduce a rotation operation [29] to generate multi-view 2D pose annotations.
The Figure 3 describes our rotation operation. First, we obtain the 3D pose of ground-truth from the dataset and extract coordinate parameters P i = (x i , y i , z i ) of the key point. Second, we fix the Y-axis parameter y i in the three-dimensional coordinate system and consider only the rotation operations of x i and z i . The x i and z i rotate in [−π, π] with a sampling step of 2π/M. The coordinates after rotation are (x m i , y m i ), which can be described as:

2D to 3D Pose Regression Network
The goal of our method is to estimate body joint locations in 3D space. Formally, given a series of 2D keypoints of the monocular view q i = (x i , y i ) and their corresponding 3D keypoints P i = (x i , y i , z i ). The 2D to 3D pose regression network F takes q i as input and predicts the corresponding coordinates P i = (x i ,ỹ i ,z i ) in 3D space. Our model can be described as a function F * : The model aims to learn a regression function F * which minimizes the error overP i = (x i ,ỹ i ,z i ) and P i = (x i , y i , z i ).

Network Design
Firstly, we use the method described in Section 3.1 to build a MvPG. As shown in the upper part of Figure 5, in order to generate 2D poses from M perspectives, we need to combine MvPG in a symmetrical manner. The generated multi-views are symmetrical to each other at intervals of π, thereby alleviating the occlusion problem and blurring problem of the front and back of the limbs in the single view. Then multi-view pose data are concatenated, which enables them to contain more hidden information than single-view data. Each pose is represented by a 16 × 2 matrix. We simply combine the M pose data into a 16 × 2M matrix, where every two columns represent a 2D pose in a single view angle and the row vectors represent the coordinates of each key point at 16 view angles. Then we take the 16 × 2M matrix as the input of the 2D to 3D network.
Secondly, we use SemGCN [15] as a 2D to 3D Pose Regression Network. In order to obtain more high-level features and better performance, we deepen the SemGCN [15] network. In our experiments, we double the depth of the original SemGCN to get better performance, as shown in the lower part of Figure 5.
Finally, the network ends with a 1024-way fully-connected layer. This step is added to alleviate redundancy and prevent the network from overfitting.
In previous studies [8,15,22], the models have a large difference in the Mean Per Joint Position Error (MPJPE) for different poses. This indicates instability in model training. To alleviate this problem, we used the Mish [36] activation function instead of ReLU [37], defined as: f (x) = x · tanh(σ(x)). Where, σ(x) = ln (1 + e x ) is the softplus activation function [38].

Loss Function
Most previous studies have used minimizing Mean Square Error (MSE), which has proved to be a simple and efficient method that performs well on this task. On the basis of MSE we add bone-length consistency loss to constrain bone length. After the MvPG obtain the multi-view 2D keypoints, we feed them into our improved GCNs, which outputs the estimated 3D joint coordinates for all keypoints. The 2D to 3D networks employs the MSE loss based on 3D joints expressed as: where P i is the ground-truth 3D joint,P i is the corresponding predicted 3D joint by our model, and B is the number of bones of one skeleton. The bone lengthb and b are calculated from the predicted 3D joint and ground-truth 3D joint, respectively. In this way, we construct an end-to-end deep neural network for posture estimation from 2D to 3D. We use Adam optimizer to pre-train MvPG based on Equation (1) with the augmented data. Afterward, we train the whole network on the dataset to achieve the best results.

Experiments
In this section, we first introduce the dataset Human3.6M [20] used to evaluate network performance and the evaluation protocol. Second, according to Section 3.3.1, we design the network and conduct ablation studies on the components in our method. Finally, we report the results of our evaluation of the public datasets and compare them with state-of-the-art methods.

Setting
Datasets: The Human3.6M [20] dataset is one of the largest and widely used datasets for 3D human pose estimation. This dataset provides 3.6 million 3D human pose images and poses labels. It contains various poses captured from four cameras such as discussion, eating, sitting, smoking, etc. The ground-truth 3D poses are captured by the Mocap system, while the 2D poses can be obtained by projection with the known intrinsic and extrinsic camera parameters.
Evaluation Protocols: We follow the standard protocol on Human3.6M to use the subjects 1, 5, 6, 7, 8 for training and the subjects 9 and 11 for evaluation. The evaluation metric is the Mean Per Joint Position Error (MPJPE) in millimeter between the ground-truth and the prediction across all cameras and joints after aligning the depth of the root joints. We refer to this as Protocol #1. According to Protocol #1, only the frontal view is considered for testing, i.e., testing is performed on every 5th frame of the sequences from the frontal camera (cam-3) from trial 1 of each activity with ground-truth cropping. The training data includes all actions and perspectives. This protocol is named Protocol #2.
Experimental Settings: The model is trained by using Pytorch. To benefit from the efficiency of the parallel computation of the tensors, all simulation studies are conducted with RTX2080S GPU on an Ubuntu OS. Furthermore, in order to verify the effectiveness and efficiency of our method, we designed two sets of experiments: (1) Effects of different scales on MvPG results and (2) the influence of MvPG on the performance of 3D pose estimation on different networks.
Implementation Details: We use the ground truth 2D and 3D joint locations provided in the dataset as input of the MvPG for pre-training, and use the method of Section 3.2 for data augmentation during the training process. Before training the entire network, we import the pre-training parameters into the MvPG part. In this stage, the loss function is defined by Equation (4). We train our model for 15 epochs using the Adam optimizer, set the learning rate of 0.008 with exponential decay, and set the mini-batches size to 256. During testing, it processes an epoch per 15 min using batch processing mode (256 samples per batch) on a single RTX 2080S GPU. It is worth noting that during the training process, initializing different random number seeds in the network will have different effects on the training results. After a lot of experiments, we finally trained the best parameters on our device.

Ablation Study
In this section, we designed two sets of experiments. Firstly, in Section 4.2.1, in order to verify the effect of the number of views generated by MvPG on the algorithm, we design MvPG to generate different numbers of views for ablation experiments. Secondly, we apply MvPG to different 2D to 3D networks to verify the commonality of our method. Figure 5 shows the network architecture of SemGCN [15] with MvPG, which is composed of two main modules: (1) A basic version that SemGCN [15] without MvPG and (2) the MvPG on different scales. To evaluate the efficacy of MvPG, we conduct an ablation study on Human3.6M [20] under Protocol #1. The Table 1 lists the average error of all joints. The notations are as follows:

Performance Analysis of the Number of Views Generated by MvPG
Basic version: Refers to the pose estimator without the MvPG. The mean error of our basic version model is 40.81 mm, which is very close to the 40.78 mm error reported on SemGCN [15] with non-local [39].
MvPG: Refers to the model with the MvPG. MvPG-4, MvPG-8, and MvPG-16 respectively represent the number of poses generated by the MvPG, and there are 4, 8, and 16 poses, respectively. We compare different scales for the MvPG, and the results are shown in Table 1. The first line shows the results of a basic version with only the SemGCN [15] and non-local [39]. Unsurprisingly, the performance without the MvPG modules is poor. The second line shows the results of integrating the MvPG-4 modules, and the third line shows the results of integrating the MvPG-8 modules. As the results show, the introduction of the MvPG-4 and MvPG-8 modules performance improved by 9.38% and 4.2%, respectively. When MvPG-16 was introduced, as shown in the last line, our model achieved an estimation mean error of 35.8. Based on these experimental results, we set multi-views generated by MvPG to 16 at last.
We analyze the results of this experiment as follows: (1) Although different module scales make different contributions to the mean errors, the final mean errors could be further improved by selecting the appropriate module scales. (2) In the 3D pose estimation task, the more multi-views were more important than the fewer views, which fully demonstrates that the MvPG-16 module effectively extracted the multi-views feature. This result confirms the effectiveness of MvPG-16. (3) While increasing the number of views, it is necessary to further increase the depth of the subsequent 2D to 3D network to match the MvPG and learn more features, see Table 1.

Impact of MvPG on 3D Pose Estimation Network
To analyze the impact of using different 2D to 3D networks on MvPG in the entire pose estimation task we use SemGCN [15] and FCN [10], respectively, as 2D to 3D networks and then conduct ablation analysis on Human3.6M [20] under Protocol #1. As shown in Table 2, after adding a MvPG, FCN [10] get a 5.67% improvement, and SemGCN [15] get a 5.22% improvement. This experiment shows that our MvPG is generally applicable to various 2D to 3D pose estimation networks. Table 2. After using MvPG, the performance of the previous methods improved. The top 2 ranked values are highlighted in bold and the first and second are shown in red and blue, respectively.

Comparison with the State of the Art
We performed quantitative comparisons on all state-of-the-art methods based on single-view 3D pose estimation. These models were trained and tested on ground truth 2D pose. The results are shown in Table 3. We found that our method, using only 2D joints as inputs SemGCN [15] with the non-local [39] layer as the 2D to 3D network, was able to match the state-of-the-art performance. In particular, we reviewed the previous method, for the action of Directions, Greeting, Posing, Waiting, Walking, Walking Dog, and Walking together. There was serious self-occlusion in these actions, and our MvPG could compensate for this problem by predicting the pose of the multi-view. For Protocol #1, our method (GT) obtained the state-of-the-art results with a 35.8 mm of error, which had 12% improvements compared to the SemGCN architecture [15]. Compared to the recent best result [22], our method still had a 1.3% improvement. Table 3. Quantitative evaluations on the Human3.6M [20] under Protocol#1. GT indicates that the network was trained and tested on ground truth 2D pose. Non-local indicates that on our 2D to 3D network architecture in SemGCN [15] with non-local [39]. The top 1 ranked values are highlighted in bold number.

Method
Dire Compared with the latest single-view models, our model combined the advantages of the multi-view model. The experiments showed that our model could effectively improve the accuracy of single-view 3D pose estimation. Additionally, our model could be directly used in real scenes because it only needed one view to achieve high-precision 3D pose estimation. It is clear that our approach also has certain drawbacks, as our approach raised network size resulting in longer network training time. In the next step, we will improve this problem. Figure 6 shows the visualization results of our approach and compares them with 3D ground-truth on Human3.6M. Using single-view 2D pose as input, our approach is able to generate multi-view 2D pose data and mine hidden occlusion information for reconstructing 3D pose. As we can see, our method could accurately estimate the 3D pose, which shows that MvPG could handle self-occlusion more effectively.

Conclusions
In this paper, we proposed a Multi-view Pose Generator (MvPG) for 3D pose estimation from a novel perspective. Our method was able to predict a set of symmetric multi-view poses using a single-view 2D pose, which is used in 2D to 3D regression networks to solve the problem of self-occlusion in pose estimation. Combined with the advanced SemGCN model, the performance of 3D human pose estimation is further improved. The results of training and testing with ground truth 2D poses as input show that our method improved by 1.3% compared with the state of the art. Compared with multi-view 3D pose estimation, our method still has deficiencies. Our method can be applied to a 3D human pose estimation task that provides the only single view. For example in surveillance video equipment, clinical research, interactive games, etc. In future work, we plan to study the use of more advanced network design multi-view pose generators to achieve higher performance with a smaller network scale.