Towards Single 2D Image-Level Self-Supervision for 3D Human Pose and Shape Estimation

: Three-dimensional human pose and shape estimation is an important problem in the computer vision community, with numerous applications such as augmented reality, virtual reality, human computer interaction, and so on. However, training accurate 3D human pose and shape estimators based on deep learning approaches requires a large number of images and corresponding 3D ground-truth pose pairs, which are costly to collect. To relieve this constraint, various types of weakly or self-supervised pose estimation approaches have been proposed. Nevertheless, these methods still involve supervision signals, which require effort to collect, such as unpaired large-scale 3D ground truth data, a small subset of 3D labeled data, video priors, and so on. Often, they require installing equipment such as a calibrated multi-camera system to acquire strong multi-view priors. In this paper, we propose a self-supervised learning framework for 3D human pose and shape estimation that does not require other forms of supervision signals while using only single 2D images. Our framework inputs single 2D images, estimates human 3D meshes in the intermediate layers, and is trained to solve four types of self-supervision tasks (i.e., three image manipulation tasks and one neural rendering task) whose ground-truths are all based on the single 2D images themselves. Through experiments, we demonstrate the effectiveness of our approach on 3D human pose benchmark datasets (i.e., Human3.6M, 3DPW, and LSP), where we present the new state-of-the-art among weakly/self-supervised methods.

One of the fundamental issues in constructing pose estimation frameworks is that they consume large numbers of 2D or 3D ground-truth poses (i.e., x, y, and z coordinate values of the 3D human bodies) for a given 2D RGB (i.e., red, green, blue) or 2.5D depth input image to secure good accuracy in the mesh estimation task. Many researchers have proposed million-scale data pairs to properly train such frameworks [14]. However, it is challenging to acquire large-scale datasets containing diverse variations and quality 3D annotations. Manually annotating such 3D coordinate values is non-trivial, and it takes a great deal of time and manual effort. One team attempted to relieve the issue by using synthetic datasets based on graphics engines [15,16]. However, the appearance of images obtained from the graphics engines showed an observable gap to the real samples [17], and it is well known that the models trained by pure synthetic datasets do not generalize well to the real testing datasets [18].
In this paper, we attempted to relieve the issue of insufficient data by proposing self-supervision losses to train the 3D human pose estimation framework without explicit 2D/3D skeletal ground-truths. Our self-supervision losses consist of four types of supervision whose ground-truths are defined based on single 2D images: jigsaw puzzling [19], rotation prediction [20], image in-painting [21], and image projection [22] tasks (as illustrated in Figure 1). In the jigsaw puzzling task, the image is divided into the D × D tiles and shuffled; then, the jigsaw puzzle solver is trained to estimate the order of the tiles. In the rotation prediction task, the image is rotated and the rotation predictor is trained to estimate the rotation degrees. In the image in-painting task, sub-patches of the image are erased and the image in-painting decoder is learned to generate the erased patches with the aid of the adversarial loss [23]. Lastly, in the image projection task, the neural 3D mesh renderer is used to differentially generate the 2D images by projecting the estimated 3D meshes. This task directly provides gradients to the estimated 3D meshes and updates the parameters of the 3D mesh estimator, while the former three tasks indirectly enrich the feature vector extracted from the 2D images via the feature extractor. Via the combination of the proposed four losses, the 3D mesh estimator produces more accurate 3D human meshes.
Via a series of experiments, we have observed that the proposed self-supervised losses with additional image databases are able to enrich the pre-trained 3D human pose estimator and achieve state-of-the-art accuracy among self/weakly/semi-supervised works. The major contributions of this paper are summarized as follows: • We construct a 3D human pose and shape estimation framework that could be trained by 2D single image-level self-supervision without the use of other forms of supervision signals, such as explicit 2D/3D skeletons, video-level, or multi-view priors; • We propose four types of self-supervised losses based on the 2D single images themselves and introduce a method to effectively train the entire networks. In particular, we investigate which are the most promising combinations of losses to effectively achieve the 2D single image-level self-supervision for 3D human mesh reconstruction; • The proposed method outperforms the competitive 3D human pose estimation algorithms, proving that leveraging single 2D images could be used for strong supervision to train networks for the 3D mesh reconstruction task.
We have made our code and data publicly available at https://github.com/JunukCha/ SSPSE accessed on October 13, 2021.

Figure 1.
Example image-level self-supervision: We applied three types of image manipulation and one image projection task to train our human mesh estimation network. Each row shows examples from the Human3.6M dataset, and each column corresponds to (a) input RGB images, (b) shuffled patches used for the jigsaw puzzling task, (c) rotated RGB images used for rotation prediction task, (d) output images from part inpainting task, and (e) neural rendered images from estimated 3D meshes and textures, respectively.

Related Works
In this section, we review the recent literature on 3D human pose estimation and 3D human pose and shape estimation works. Then, we analyze recent weak/semi-supervised approaches designed for 3D human pose and shape estimation that are closely related to ours.

Three-Dimensional Human Pose Estimation
Many recent studies have focused on estimating 2D [11] or 3D keypoint locations on the human body [12]. Normally, these keypoint locations include major body joints such as the wrists, ankles, elbows, neck, shoulders, and knees. The architectures of 2D pose detectors have been designed to map images into the 2D pose vector more accurately. Wei et al. [2] proposed a sequential architecture composed of convolutional neural networks (CNNs) with multiple sizes of receptive fields. Zhou et al. [3] proposed a method using both a 3D geometric prior and temporal smoothness prior to treat considerable uncertainties in 2D joint locations. Newell et al. [1] proposed stacked hourglass networks based on the successive architecture, which is composed of pooling and upsampling layers. As we are living in the 3D space, understanding poses in the 3D space is the natural extension to 2D pose estimation. Three-dimensional pose estimation aims to locate the key 3D joints of the human body from an image or a video, which are either in the single-view or multiview setting. Most approaches are based on the two-stage approaches that first predict 2D joint locations using 2D pose detectors [1][2][3]24] and then predict 3D joint locations from the 2D joints or features by performing regression [4,[25][26][27] or model fitting [3,[28][29][30][31][32]. Xu et al. [24] proposed a graph stacked hourglass network for 3D human pose estimation, which consists of four stacked hourglasses. Estimated 2D joints are fed into the network, and the 3D pose is predicted. Martinez et al. [4] proposed a simple framework composed of fully connected layers to lift a 2D pose to a 3D angular pose. Moreno et al. [25] proposed a framework composed of a 2D joint detector based on CNN and inferred 3D poses from the detected 2D joints. These approaches have made great progress in improving the performance of 3D human pose estimation.

Three-Dimensional Human Pose and Shape Estimation
Followed by progresses in the field of 3D human pose estimation, simultaneously estimating the human body pose and shape has become a recent trend [10,[33][34][35][36][37][38][39][40][41]. There are two main approaches for estimating 3D human poses and shapes (i.e., meshes): one is the optimizationbased methods [9,42,43], and the other is the regression-based methods [5][6][7][8]36,37]. SM-PLify [9] is one of the representative optimization-based methods. It first estimates the 2D skeletons from the images and fits the graphical 3D body model (i.e., SMPL [13]) to it using the optimization method. Kanazawa et al. [5] proposed the deep learning-based regression framework. This is an end-to-end framework for reconstructing a 3D human mesh from a single RGB image. Omran et al. [36] proposed neural body fitting. This framework generates 12 semantic human body parts with a semantic segmentation CNN. From these semantic human body parts, pose and shape parameters of SMPL are estimated and a 3D human mesh is reconstructed. Pavlakos et al. [37] proposed a framework that estimates heatmaps related to pose parameters of SMPL and silhouettes related to shape parameters of SMPL simultaneously. Using these heatmaps and silhouettes, a 3D human mesh is reconstructed. A method [44] has also been introduced that exploits the merits of both optimization-based and regression-based methods for 3D mesh estimation. The temporal dynamics of the human body and shape have also been incorporated in the framework of [6]. The framework used to overcome the adaptation of a pre-trained model of human mesh estimation to the out-of-domain field was proposed in [8]. In addition, non-parametric body mesh reconstruction methods have been proposed [38,45]. Tan et al. [38] proposed an encoder-decoder network: first, a decoder is trained to predict a body silhouette from the SMPL parameter input; second, an encoder is trained on a real image and corresponding silhouette pairs while the decoder is fixed. Lin et al. [45] proposed end-to-end human pose and mesh reconstruction with a transformer [46].

Weakly/Semi-Supervised Learning in 3D Human Mesh Estimation
For most human pose estimation methods [36,37,43,[47][48][49][50], supervised learning prevails; however, securing the 3D mesh ground truth is non-trivial. Weakly or semi-supervised methods are designed to solve the issue of the lack of quality annotation by using the available easier annotations. In the context of the 3D human mesh estimation task, semisupervised learning uses 3D skeletons that are coarser than 3D meshes, while weaklysupervised learning uses either 2D annotation [36,51,52] or pseudo-3D annotation [53][54][55][56]. In [36], 2D keypoint annotations are exploited to estimate SMPL body model parameters from CNNs to recover human 3D meshes following predicted body part segmentation masks. In [51], two anatomically inspired losses are proposed and used with a weakly supervised learning framework to jointly learn from large-scale in-the-wild 2D and indoor/synthetic 3D data. The authors of [52] suggest weakly-supervised solutions by adopting multi-view consistency loss in 2.5D pose representations leveraged by independent 2D annotated samples. In [53][54][55][56], multi-view geometry is used to resolve 3D depth ambiguity. We developed our original methods in the self-supervised setting only using the input 2D images; however, to compare with these weakly or semi-supervised approaches, we extended our model to be additionally learned by 2D or 3D skeletons if they are available. In this manner, we compared our method to the state-of-the-art self/weakly/semi-supervised approaches.

Method
Our self-supervised 3D human mesh estimation framework is detailed in this section and illustrated in Figure 2. In this work, we used the network architecture of [5,6] as our baseline 3D mesh reconstruction network, which uses the ResNet-50 [57] architecture as the feature extractor and predicts SMPL [13] parameters using it. Our aim is to increase the accuracy of the baseline 3D mesh reconstruction network using the single 2D image-level self-supervised losses and its training strategy. In the remainder of this section, we explain the process at a more detailed level. Prior to that, we begin the discussion by introducing our network architectures.  f Inp , f Rot , and f Tex ) and a network in a purple box (i.e., f Disc ) using five corresponding losses (i.e., L jigsaw , L inp , L rot , L RGB , and L adv , respectively). Two networks f NMR and f r-cnn with pink boxes were fixed after employing the implementation of [22] and initializing weights from [58], respectively. Finally, in the mesh training stage, we trained f Feat and f Mesh in blue boxes, which were used to infer the 3D human meshes from 2D images. The discriminator f Disc in the purple box was also further trained during the mesh training stage. Blue, black, and green arrows denote routes used for testing, training, and supervision signals, respectively. Optional loss L joints was not used for the self-supervised setting (i.e., Ours (self-sup)); it was used via 2D skeletons for the weakly-supervised setting (i.e., Ours (weakly-sup)) and was used via both 2D and 3D skeletons for the semi-supervised setting (i.e., Ours (semi-sup)).

Network Architectures
Our feature extractor f Feat and mesh estimator f Mesh were taken from the previous mesh reconstruction method [5], and they were responsible for extracting features and estimating the 3D human meshes. Three networks (jigsaw puzzle solver f Jigsaw , rotation predictor f Rot , part in-painter f Inp ) were additionally involved to develop the self-supervision loss solving three different image manipulation tasks, and two additional networks (texture network f Tex , neural mesh renderer f NMR ) were involved to develop the self-supervision loss solving the image projection task. Finally, the discriminator f Disc was involved to capture the distribution of real 2D images containing human images. More details on individual networks are described in the remainder of this subsection.
Feature extractor f Feat : X → F ⊂ R 2048×1 . Similar to recent human mesh recovery works [5,6] that estimate 3D human mesh model (i.e., SMPL [13]) parameters from the RGB images x ∈ X, we involved the ResNet-50 [57] architecture as our feature extractor. It generates 2048-dimensional feature vectors f ∈ F from the input image x, whose size is resized to a 224 × 224 × 3 dimensional array.
Mesh estimator f Mesh : F → M ⊂ R 6890×13,776 . After extracting the feature vector f ∈ F, we mapped it to the corresponding 85-dimensional SMPL [13] parameters h ∈ H ⊂ R 85×1 (we used 3, 10, and 72-dimensional vectors as camera, shape, and pose parameters, respectively, in the same manner as in [5]). These parameters were used to differentially generate the corresponding human body meshes m ∈ M with 6890 vertices and 13,776 faces using the SMPL layer (refer to [5,13] for more details).
Jigsaw puzzle solver f Jigsaw : The jigsaw puzzle solver was used to solve the first self-supervision task, called the jigsaw puzzling task. In this task, we divided the input 2D image x into D × D tiles and permuted them to generate a jigsaw puzzling image x jigsaw . Then, we input D 2 tiles combined with the original image into the feature extractor f Feat by resizing all images into 224 × 224 × 3 dimensional images. The resultant features from five images were concatenated, retaining the permuted orders, and mapped into the vector o ∈ O, whose entry denotes the probability for the permutation order of the tiles in the whole images. The class number C = (D × D)! is equivalent to the number of permutations for D × D tiles. We found that D = 2 best encodes the human bodies, balancing it from the background regions (see Figure 1); thus, C = (2 × 2)! = 24 was used throughout the experiment.

Rotation predictor f
The rotation predictor was used to solve the second self-supervision task, called the rotation prediction task. In this task, we rotated the input 2D image x with R fixed angles θ to generate a new image x rot . Humans are in the upright position in images with 0 degrees. We first applied f Feat on the input image x rot to obtain the feature vector f. Then, the rotation predictor f Rot was applied on f to predict the vector r ∈ R, which contained the probability for R possible angles. In this work, we set R = 4 by setting the rotation angles to 0, 90, 180, and 270 for simplicity; however, we were able to achieve a meaningful accuracy improvement using this simple setting.
Part in-painter f Inp : F → P ⊂ R P×P×3 . The part in-painter was used to solve the third selfsupervision task, called the part in-painting task. In this task, we erased a P × P × 3-sized patch p from the center of an input image x, resulting in x hole . Then, the part in-painter f Inp was responsible for predictingp resembling p. We set P = 64 considering the patch size compared to the size of the human bodies in images ( Figure 1d describes the example in-painted patches).
Neural mesh renderer f NMR : The neural mesh renderer [22] was employed to solve the last self-supervision task, called the image projection task. In this task, 3D meshes were projected to the 2D images by an operation similar to graphics rendering [22]. Either the RGB human images x ∈ X or a binary segmentation mask s ∈ S could be generated from the 3D meshes m. To render RGB images, an additional texture array t ∈ T was required that described RGB values for faces of the 3D meshes. To infer the texture array, we additionally involved the texture network f Tex , as explained below.
Texture network f Tex : F → T ⊂ R 13,776×3 . As there are 13, 776 faces in the SMPL [13] model, the dimension of the texture array t is 13, 776 × 3, as it represents RGB values corresponding to each mesh face. Furthermore, we first inferred the 13, 776 × 3 dimensional array t from f Tex and differentially reshaped this towards a 13, 776 × Z × Z × Z × 3 dimensional array. Z = 2 was used as the visual quality based on this minimum dimension, which was sufficient for our purpose. 1]. The discriminator was involved to capture the distribution of the real human images and give gradients to the networks to make the realistic images, in a similar manner to [23]. The discriminator f Disc played a crucial role, especially when training the part in-painter f Inp and texture network f Tex , as these tasks require training networks to generate the realistic patches or textured images. Tables 1-3 present the detailed structure of our network architectures. For f r-cnn , f NMR , and f Feat , we employed the implementation and weights from [5,22,58], respectively. For the mesh estimator f Mesh as explained in Table 4, we involved the SMPL layer of [5], which can output the SMPL human body meshes from its estimated pose (72), shape (10), and camera (3) parameters.

Training Method
Our aim during the training stage was to enrich both feature extractor f Feat and mesh estimator f Mesh , which were responsible for estimating 3D meshes. Our losses were proposed to improve the two networks (i.e. feature extractor f Feat and mesh estimator f Mesh ) based on the 2D single image-level self-supervised losses, which were designed to solve three image manipulation tasks (i.e., jigsaw puzzle, rotation prediction, and part in-painting) and one image projection task. In this subsection, we explain the entire training strategy and introduce the individual losses that we used.

Summary of the Entire Training Process.
The entire training process was divided into two stages: (1) pre-training stage and (2) mesh training stage.
At the (1) pre-training stage, five networks (i.e. jigsaw puzzle solver f Jigsaw , rotation predictor f Rot , part in-painter f Inp , discriminator f Disc , and texture network f Tex ) were pre-trained using the corresponding losses defined in Equations (2)-(6), respectively. We pre-trained these networks to constitute them for the self-supervision losses used in the subsequent mesh training stage.
Then, at the (2) mesh training stage, we trained the feature extractor f Feat and mesh estimator f Mesh , which were responsible for the 3D mesh estimation using the loss L defined as follows: where the image manipulation loss L IM and neural rendering loss L NR are defined in Equations (7) and (9), respectively. The actual improvement of the 3D mesh reconstruction was obtained during this stage. Furthermore, the discriminator f Disc was further trained by discriminating between real and generated images to provide richer supervision. The entire training process is summarized in the Algorithm 1.

Pre-Training Stage
The aim of this stage was to train five different networks ( f Jigsaw , f Rot , f Inp , f Disc , and f Tex ) that were used to constitute the losses in the subsequent "mesh training" stage. We trained the networks by running T 1 = 10 epochs of training with the Adam optimizer on the losses L jigsaw , L rot , L inp , L adv and L RGB (Equations (2)-(6)) with learning rates of 0.01, 5 × 10 −4 , 1 × 10 −4 , and 1 × 10 −4 , and 0.01, respectively.
Jigsaw puzzle loss L jigsaw . As in [19], the jigsaw puzzling task could be formulated as the standard classification task. We applied the cross-entropy loss L jigsaw for the jigsaw puzzle solver f Jigsaw to make its output close to the permutation order: where y c, jigsaw is the c-th dimensional element of the one-hot encoded vector from the permutation ground-truth, andô c is the c-th dimensional response of the f Jigsaw network's output.
Rotation prediction loss L rot . The rotation prediction could be also formulated as the standard classification task as in [20]. Images were rotated with four possible angles θ, 0, 90, 180, and 270, and then mapped to the classification label set {1, 2, 3, 4}. To achieve this, the cross-entropy loss L rot was applied on the output of the rotation predictor f Rot : where y c,rot is the c-th dimensional element of the one-hot encoded vector from the rotation ground-truth, andr c is the c-th dimensional response of the f Rot network's output. -Jigsaw puzzle solver outputô c ; -Rotation predictor outputr c ; -Part in-painter outputp; -Rendered RGB imagex render . Initialization: -Pre-train f Feat , f Mesh based on [5]; -Initialize f r-cnn from [58]; -Implement f NMR from [22]; For each data RGB image x in the mini-batch D n , outputô c ,r c ,p, and generatex render ; Calculate gradient ∇L (Equation (1)) with respect to (the weights of) f Feat , f Mesh and f Disc on D n , and update f Feat , f Mesh , and f Disc ; end end end Part in-painting loss L inp . The part in-painting task could be framed as the image generation task as in [21]. We erased a patch p from the center of the RGB images x to make an image with a hole x hole . The image x hole was inputted to the f Inp ( f Feat (x hole )), and the part in-painter f Inp was trained to reconstruct the outputp = f Inp ( f Feat (x hole )). We trained the network using both the mean square error (MSE) loss and the GAN-type adversarial loss by the discriminator f Disc to reconstruct a realistic patchp. The MSE loss was enforced to make the reconstructed patchp look similar to the original patch p, and GAN-type adversarial loss was enforced to make the reconstructed imagex inp that combined x hole andp look more realistic. f Inp was pre-trained using the following rules: wherex inp is the in-painted image that fills the hole in x hole with the estimatedp, and λ inp and λ disc are set at 0.999 and 0.001, respectively.
Adversarial loss L adv . The discriminator f Disc was trained to discriminate original images x from imagesx inp -that is, the combination of x hole andp-outputted from the f Inp using the LSGAN [23] objective, as follows: RGB rendering loss L RGB . When projecting 3D meshes into the RGB images, RGB values for each vertex of 3D mesh were required. The texture network f Tex was pre-trained to infer such RGB values using the loss, defined as follows: where x render andx render denote the pseudo ground-truth obtained from f r-cnn (x) x and rendered meshes f NMR ( f Mesh ( f Feat (x)), f Tex (x)), respectively. The operation denotes the pixel-wise multiplication and f r-cnn denotes the pre-trained Mask-RCNN [58] network. The rationale behind the use of the Mask-RCNN [58] for obtaining the pseudo groundtruth x render is that 3D mesh reconstruction is more challenging than the 2D segmentation mask prediction.

Mesh Training Stage
At this stage, we trained the feature extractor f Feat , mesh estimator f Mesh , and the discriminator f Disc with the aid of pre-trained networks obtained from the pre-training stage. We ran the training for T 2 = 10 epochs using the loss L (Equation (1)) with a learning rate of 5 × 10 −6 . The learning rate was decreased using the exponential learning rate decay, whose decay rate was set to 0.99 for each epoch. In the remainder of this subsection, we elaborate the sub-loss of the loss L: L IM and L NR .
Image manipulation loss L IM . This loss consisted of sub-losses reflecting three selfsupervision tasks (i.e., solving jigsaw puzzling, rotation prediction and part in-painting tasks). To enrich the feature extractor f Feat using jigsaw puzzling, rotation prediction, and part in-painting tasks, we used the loss L IM that combined losses defined in Equations (2)-(4). We used the same form of losses as Equations (2)-(4) while changing the target training network to f Feat : +L inp ( f Feat |x hole ). (7) Neural rendering loss L NR . This loss reflected the last self-supervision task (i.e., solving image projection task). Given a 2D image x, 3D human meshes are estimated by the sequential combination of f Mesh and f Feat : f Mesh ( f Feat (x)). Then, we can render estimated 3D meshes m to imagesx render and segmentation masksx mask using the neural mesh renderer f NMR . L RGB and L mask are responsible for making two rendered images (i.e.,x render andx mask ) close to their original inputs, respectively. Additional losses L disc and L real are defined to train the discriminator and obtain more realistic images. The combination of four sub-losses is called the neural rendering loss L NR : wherex render,bg is the image that combines the rendered imagex render and a random background image x bg . L RGB ( f Feat , f Mesh |x) is in the same form as L RGB ( f Tex |x) in Equation (6); however, the target training network is changed to f Feat and f Mesh . L mask ( f Feat , f Mesh |x), L real ( f Feat , f Mesh |x), and L disc ( f Disc |x,x render ,x inp ) are newly defined as follows: (11) where x mask is the binary mask estimated from the Mask-RCNN [58]. Using Equation (11), the discriminator f Disc is trained to discriminate the original images x as real while the rendered imagesx rendered and in-painted imagesx inp are determined as fakes.
Joint loss L joints . When 2D or 3D skeleton joints are available, this loss is used to close the keypoints regressed from 3D mesh vertices to their ground-truth locations (in the SMPL [13] model, the model that geometrically regresses 3D skeletons from mesh vertices is accompanied). This is an optional loss that is not used for the self-supervised setting (i.e., Ours (self-sup)); however, we used this loss for the weakly-supervised setting (i.e., ours (weakly-sup)); or for the semi-supervised setting (i.e., ours (semi-sup)), in which either of 2D or 3D skeletons are available, respectively.

Incorporation of different human images.
As we constituted our framework with a selfsupervised loss that did not require any 2D/3D skeleton annotations, we were able to train our network with any RGB human images outside of our training dataset. We additionally involved 2D images from the MPI-INF-3DHP [59], MPII [60], COCO [61], and LSP [62] datasets, which have been involved in the weakly/semi-supervised settings, plus our own collections of wild YouTube data. When detecting the humans in the images, we used the Mask-RCNN network [58] to inspect if there was a human bounding box or not. The results in Table 5 using "Full w/o Y" (row 4) and "Full" (row 5) datasets can be compared to see the performance improvement by the incorporation of human images, where "Full" and "Full w/o Y" denote the model trained by datasets with and without our own collections of wild YouTube images, respectively. We could observe that the accuracy was improved by the incorporation of human images. Table 5. Ablation study on Human3.6M to analyze the effectiveness of the objective functions (first 2 rows) and results of weakly-supervised and semi-supervised approaches (last 2 rows). We also conducted an ablation study depending on the different training dataset compositions: "Full" denotes the full training dataset that we described in Section 4.1. Rows 3-4 show the ablation results using different training datasets, where "H36M" and "Full w/o Y" denote the Human3.6M dataset and full datasets without the collection of wild YouTube data, respectively. We use protocol-2 for "H36M".

Experiments
In this section, we explain the datasets and evaluation method and then analyze the obtained results in a qualitative and quantitative manner. We also provide ablation results by varying the parameters of our framework.
The Human3.6M dataset is a widely used dataset with paired images and 2D/3D pose annotations. This is an in-studio dataset and consists of four multi-view images; however, we did not use multi-view images for training. Like [5], we used subjects S1, S5, S6, S7, S8 for training and S9 and S11 for testing. We computed the mean per joint position error (MPJPE) and Procrustes analysis-MPJPE (PA-MPJPE) on the Human3.6M test dataset to evaluate the 3D pose estimation results.
MPI-INF-3DHP has 2D images and corresponding 2D/3D skeleton annotation pairs. This is an in-studio dataset and consists of eight multi-view images; however, we did not use multi-view images for training. The MPII dataset and COCO dataset consist of indoor and outdoor images with only 2D annotations.
The LSP dataset is a wild athletic dataset that consists of wild images with 2D annotations. This dataset is used for evaluation, with their (part-)segmentation ground truths secured by [43]. The 3DPW dataset is an in-the-wild outdoor dataset and is used only for evaluation purposes. We further collected images from YouTube of jogging or dancing motions, with diverse backgrounds and viewpoints (the dataset is available in our Github repository).

Evaluation Method
We used two approaches to evaluate the estimated 3D poses: MPJPE and PA-MPJPE. MPJPE is the mean Euclidean distance between the ground truth and prediction for a joint. PA-MPJPE is Procrustes analysis-MPJPE, which aligns the estimated 3D poses to their ground-truths by a similarity transformation called Procrustes analysis before computing the MPJPE. We also used two approaches to evaluate the estimated 3D shapes: accuracy and F1 score for FG-BG and part-segmentation masks. We measured their accuracy and F1 score by a pixel-wise comparison between estimated masks and their ground truths.

Results
For the Human3.6M dataset, we compare our method to recent fully/semi/weakly-/self-supervised methods [5,36,37,43,[47][48][49][50]64] that output the 3D human poses by protocol-2 in Table 6. Our method outperformed previous methods in all three types of supervision settings (i.e., self/weakly/semi-supervised). Kundu et al. [64]'s method is the competitive self-supervised learning framework using the video priors; however, our method outperformed it even without any video priors. Li et al. [54]'s method is the only baseline that outperforms our method and uses the self-supervised setting. However, the work is not directly comparable to ours, as its setting is easier as it estimates only skeletal joints and using multi-view priors. We provide full 3D meshes inferring both pose and shape and use only 2D images as the supervision signals.
To show the generalization ability of our method, we evaluated it on 3DPW, the data of which were never seen in our training process. For the 3DPW dataset, we compared the accuracy of our method to those of recent state-of-the-art approaches [4,5,9,47,51,64] in Table 7. For the LSP dataset, as the dataset does not contain 3D pose annotations, we only measured the accuracy and F1 score of FG-BG segmentation and six-part segmentation, respectively, for comparison with other algorithms. We compared our method to state-ofthe-art optimized-based methods [9,37,65] and regression-based methods [5,49,50,64] in Table 8. Optimization-based methods tend to have better performance on segmentation tasks than regression-based methods; however, our method outperformed most of them even in the self-supervised setting (i.e., Ours (self-sup)); our approach was also recorded as the best when involving more supervisions (i.e., Ours (weakly-sup) and Ours (semi-sup)). Table 6. Evaluation with Human3.6M(Protocol-2). Methods in the first 10 rows use equivalent 2D and 3D pose supervision. Methods in rows 11-14 and rows 15-18 are weakly-supervised and selfsupervised approaches, respectively. * indicates methods that output only 3D joints, not 3D meshes.

Ablation Study
To analyze the effectiveness of our entire training objective (Equation (1)), we conducted ablation experiments on our losses, as presented in Table 5. In our initial experiment, we used the combination of three images' manipulation task losses (i.e., jigsaw puzzling, rotation prediction, and part in-painting) and one RGB rendering loss L RGB for the "mesh training stage". However, given only these four types of losses, we observed that 3D meshes were trained in the wrong way, inferring camera scale parameters to be very small, as this might be an easier way to reduce the RGB rendering loss L RGB . To relieve this, we proposed the addition of more constraints in our loss by the segmentation mask loss L mask , defined in Equation (10): we observed that the uses of loss L mask and the discriminator loss L adv are helpful for preventing this phenomenon. In Table 5, by comparing results from "Ours (self-sup)-L real -L mask ", "Ours (self-sup)-L real ", and "Ours (self-sup)", we could conclude that results improved as more supervision was used. The corresponding qualitative results are also shown in Figure 4. Furthermore, we experimented on different configurations of datasets for training in the same table (Table 5): Rows 3-4 of Table 5 show the accuracy using fewer datasets for training where "H36M" and "Full w/o Y" denote the Human3.6M dataset and full datasets without the collection of wild YouTube data, respectively. "Full" denotes the full training data that we described in Section 4.1, including the collection of wild YouTube data compared to "Full w/o Y". We can see that involving more image data results in improved accuracy.  To show the generalization ability of our method, we evaluated it on 3DPW, whose data 351 were never seen in our training process. For the 3DPW dataset, we compared the accuracy of our 352 method to those of recent state-of-the-arts [4,5,9,47,51,64] in Table 8. For the LSP dataset, as  We further analyzed the effectiveness of the individual image-level supervision (i.e., L jigsaw , L rot , and L inp ) in the Table 9. We conducted the experiment by applying one of our self-supervision losses for our self/weakly and semi-supervised baselines. For example, "Ours (self-sup)-L rot -L inp ", "Ours (self-sup)-L jigsaw -L inp ", and "Ours (self-sup)-L rot -L jigsaw " denote the network trained without a rotation prediction loss L rot and a part in-painting loss L inp , the network trained without a jigsaw puzzle loss L jigsaw and a part in-painting loss L inp , and the network trained without a rotation prediction loss L rot and a jigsaw puzzle loss L jigsaw for our self-supervision setting, respectively. From these experiments, we can see that (1) none of three losses clearly prevail over the other remaining, and (2) applying all the losses consistently outperforms the results when applying one of the losses. Thus, we need to combine three image-level self-supervisions (i.e., L jigsaw , L rot , and L inp ) together in our main experiments. Table 9. Ablation study on Human3.6M, 3DPW, and LSP to analyze the effectiveness of image manipulation task losses.

Conclusions
In this paper, we present a self-supervised learning framework for recovering human 3D meshes from a single RGB image. We proposed the use of the combination of three image manipulation task losses and one neural rendering loss to enrich the feature space and boost the mesh reconstruction accuracy. We also find that the combination of RGB rendering, mask rendering, and adversarial losses is essential to properly achieve image-level selfsupervision without using video or multi-view priors. Experiments were conducted on three popular benchmarks, and our algorithm achieved the best accuracy among competitive self/weakly/semi and fully-supervised algorithms. Via these results, we could conclude that single 2D image-level supervision could be a strong supervision method for training 3D human pose and shape estimation. While we showed promising results using the proposed 2D single image-level self-supervision losses, avoiding the effort of collecting other types of supervision signals, some small inconvenience remains in our framework in terms of manually setting hyper-parameters. Future work should explore the possibilities of developing a fully automatic pipeline that resolves this inconvenience by including schemes to select the optimal hyper-parameters for our losses (e.g., tile number in jigsaw puzzle loss, rotation angles for rotation prediction loss and etc.).