Structure from Articulated Motion: Accurate and Stable Monocular 3D Reconstruction without Training Data

Recovery of articulated 3D structure from 2D observations is a challenging computer vision problem with many applications. Current learning-based approaches achieve state-of-the-art accuracy on public benchmarks but are restricted to specific types of objects and motions covered by the training datasets. Model-based approaches do not rely on training data but show lower accuracy on these datasets. In this paper, we introduce a model-based method called Structure from Articulated Motion (SfAM), which can recover multiple object and motion types without training on extensive data collections. At the same time, it performs on par with learning-based state-of-the-art approaches on public benchmarks and outperforms previous non-rigid structure from motion (NRSfM) methods. SfAM is built upon a general-purpose NRSfM technique while integrating a soft spatio-temporal constraint on the bone lengths. We use alternating optimization strategy to recover optimal geometry (i.e., bone proportions) together with 3D joint positions by enforcing the bone lengths consistency over a series of frames. SfAM is highly robust to noisy 2D annotations, generalizes to arbitrary objects and does not rely on training data, which is shown in extensive experiments on public benchmarks and real video sequences. We believe that it brings a new perspective on the domain of monocular 3D recovery of articulated structures, including human motion capture.


Introduction
3D structure recovery of articulated objects (i.e., comprising multiple connected rigid parts) from a set of 2D point tracks through multiple monocular images is a challenging computer vision problem [1][2][3][4]. Articulated structure recovery is ill-posed due to missing information about the third dimension [5]. Its applications include gesture and activity recognition, character animation in movies and games, and motion analysis in sport and robotics.
Recently, multiple learning-based approaches that recover 3D structures from 2D landmarks have been introduced [6][7][8][9]. These methods show state-of-the-art accuracy across public benchmarks. However, they are restricted to a specific kind of structure (e.g., human skeleton) and require extensive datasets for training. Moreover, they often fail to recover poses that are different from the training examples (see Section 4.2.5). When a scene includes different types of articulated objects, different methods have to be applied to reconstruct the whole scene.
In this paper, we introduce a general approach for accurate recovery of 3D poses of any articulated structure from 2D observations that does not rely on training data (see Figure 1). We build upon the recent progress in non-rigid structure from motion (NRSfM), which is a general technique for non-rigid 3D reconstruction from 2D point tracks. However, when considering an articulated object as a general non-rigid one, reconstructions can evince significant variations in the distances between the connected joints (see Section 4.2.3). These distances have to remain nearly constant across all articulated poses. Our method relies on this assumption and imposes a spatio-temporal constraint on the bone lengths. We call our approach Structure from Articulated Motion (SfAM). We apply an articulated structure term as a soft constraint on top of the classic optimization problem of NRSfM [10]. This term enforces the bone lengths-though not known in advance-to remain constant across all frames. Our optimization strategy alternates between the classic NRSfM problem and our articulated structure term until they both converge. This allows for recovering the geometry together with the 3D joint positions and the method does not rely on known bone lengths. Starting from a rough initialization of the articulated structure (e.g., a human arm is longer than a leg), SfAM still converges to the correct structure proportions (see Section 4.2.3). Figure 2 illustrates the significant difference between results produced by a general-purpose NRSfM technique [11] and our SfAM. Side-by-side comparison of the non-rigid structure from motion (NRSfM) method [11] and our SfAM. Reconstruction results of [11] violate anthropometric properties of the human skeleton due to changing bone lengths from frame to frame.
To summarise, our contributions are: • A generic framework for articulated structure recovery which achieves state-of-the-art accuracy among not learning-based methods across public datasets. Moreover, it shows performance close to state-of-the-art learning-based methods but at the same time is not restricted to specific objects (see Section 4) and does not require training data. • SfAM recovers sequence-specific bone proportions together with 3D joints (see Section 3). Thus, it does need known bone lengths.

•
The articulated prior energy term makes our approach robust to noisy 2D observations (see Section 4.2.2) by imposing additional constraints on the 3D structure.
In this paper, we show that a not learning-based approach can perform on par with state-of-the-art learning-based methods and even outperform some of them in real-world scenes (see Section 4.2.5). We demonstrate the effectiveness of SfAM for the recovery of different articulated structures through extensive quantitative and qualitative evaluation on different datasets [12][13][14] and real-world scenes (see Section 4). To the best of our knowledge, our SfAM is the first NRSfM approach evaluated on such comprehensive datasets as Human 3.6m [12] and NYU hand pose [14]. As a side effect of our method, it can be used for precise articulated model estimation (generate personalized human skeleton rigs (see Section 4.2.3)). This contrasts a lot with most recent supervised learning approaches which require extensive labeled databases for training, and still, often fail when unfamiliar poses are observed (see Section 4.2.5). Moreover, minor changes in the inputs lead to significant variations in the poses, which makes the results of learning-based methods very difficult or impossible to reproduce.

Related Work
Rigid and Non-Rigid Structure from Motion. Factorization-based Structure from Motion (SfM) is a general technique for 3D structure recovery from 2D point tracks. An SfM problem is well-posed for rigid objects due to the rigidity constraint [15]. Early extensions of Tomasi and Kanade's method [15] for the non-rigid case rely on rank and orthonormality constraints [16,17]. Subsequent methods investigated shape basis priors [18], temporal smoothness priors [19], trajectory space constraints [20] as well as such fundamental questions as shape basis uniqueness [21,22]. More recent methods combine priors in the metric and trajectory spaces [23]. To improve the reconstruction of stronger nonlinear deformations, Zhu et al. [24] introduce unions of linear subspaces. Dai et al. [10] propose an NRSfM method with as few additional constraints as possible. Lately, the focus of NRSfM research is drawn to the problem of scalability [11,25], i.e., the consistent performance across different scenarios and linear computational complexity in the number of points. Our SfAM is a scalable approach which builds upon the work of Ansari et al. [11]. In contrast to [11], we recover articulated structures with higher accuracy. Articulated and Multibody Structure from Motion. Over the last few years, several SfM approaches for articulated motion recovery were proposed. Some of them relax the global rigidity constraint for multiple parts [26,27] so that each of the parts is constrained to be rigid. They can handle relatively simple articulated motions, as the segmentation and the structure composition are assumed to be unknown [26]. As a result, these methods are hardly applicable to such complicated scenarios as human and hand pose recovery. Tresadern and Reid [28], Yan and Pollefeys [29] and Palladini et al. [26] address the articulated case with two rigid body parts and detect a hinge joint. Later, an approach with spatial smoothness and segmentation dealing with an arbitrary number of rigid parts was proposed by Fayad et al. [30]. Park and Sheikh [31] reconstruct trajectories given parent trajectories and known bone length, known camera, and root motion for each frame. Their objective is highly nonlinear and requires good initialization of trajectory parameters. In contrast, our method recovers sequence-specific bone proportions and does not rely on given bone lengths. Next, Valmadre et al. [32] propose a dynamic-programming approach for the reconstruction of articulated 3D trees from input 2D joint positions operating in linear time. Multibody SfM methods reconstruct multiple independent rigid body transformations and non-rigid deformations in the same scene [27,33]. In contrast, our approach is more general as it imposes a soft constraint of articulated motion on top of classic NRSfM. Piecewise and Locally Rigid Structure from Motion. Piecewise rigid approaches interpret the structure as locally rigid in the spatial domain [34,35]. Several methods divide the structure into patches, each of which can deform non-rigidly [36,37]. High granularity level of operation allows these methods to reconstruct large deformations as opposed to methods relying on linear low-rank subspace models [36]. Rehan et al. [38] penalize deviations between the bone lengths from the average distances between the joints over the whole sequence. This form of constraint does not guarantee a realistic reconstruction though, as it struggles to compensate for inaccurate 2D estimations or 3D inaccuracies in short time intervals. Monocular 3D Human Body and Hand Pose Estimation. Bone length constraints are widely used in the single-view regression of 3D human poses. One of the early works in this domain operates on single uncalibrated images and imposes constraints on the relative bone lengths [39]. It is capable of reconstructing a human pose up to scale. Later, an enhancement for multiple frames with bone symmetry and rigidity constraints (joints representing the same bone move rigidly relative to each other) was introduced by Wei and Chai [40]. Akhter and Black [41] use a pose prior that captures pose-dependent joint angle limits. Ramakrishna et al. [1] use a sum of squared bone lengths term that can still lead to unrealistic poses. Wandt et al. [2] constrain the bone lengths to be invariant. Their trilinear factorization approach relies on pre-trained body poses serving as a shape prior and transcendental functions modeling periodic motion peculiar to the human gait. An adaptation of this approach to hand gestures would require the acquisition of a new shape prior. Wandt et al. [42] constrain the sum of squared bone lengths of the articulated structure to be invariant throughout image sequence. However, the length of each bone can still vary. One of the modern methods for human pose and appearance estimation is MonoPerfCap of Xu et al. [43]. It imposes implicit bone length constraints through a dense template tailored to a specific person and captured in an external acquisition process.
Recently, many learning-based approaches for human pose and hand pose estimation have been presented in the literature [9,[44][45][46][47][48][49][50][51]. In [7], weak supervision constrains the output of the network with fixed bone proportions taken from the training dataset. Sun et al. [52] exploit a joint connection structure and uses bones instead of joints for pose representation. Wandt and Rosenhahn [53] use kinematic chain representation and include bone length information to their loss function during training. In contrast to our SfAM, [53] is not as robust to noisy 2D input (see Section 4.2.2). All these methods are highly specialized and rely on extensive collections of training data. In contrast, our SfAM is a general approach that can cope with different articulated structures, with no need for labeled datasets. Figure 3 shows a high-level overview of our approach. Following factorization-based NRSfM [10], we first recover the camera pose using 2D landmarks (Section 3.2). For 3D structure recovery, we extend the target energy function of the classic NRSfM problem [10,11] by our articulated prior term (Section 3.3.1).

The Proposed SfAM Approach
We assume that sparse 2D correspondences are given. In Section 3.3.2, we show how our new energy is efficiently optimized alternating between fixed-point continuation algorithm [54] and Levenberg-Marquardt [55,56]. This leads to an accurate reconstruction of articulated motions of different structures. Figure 3. The pipeline of the proposed SfAM approach. Following factorization-based NRSfM, we first recover the camera pose using 2D position observations. Then, we recover 3D articulated structure by optimizing our new energy functional accounting for articulated priors.

Factorization Model
The input to SfAM is the measurement matrix W = [W 1 , W 2 , . . . , W T ] T ∈ R 2T×N with N 2D joints tracked over T frames. Every W t , t ∈ {1, . . . , T}, is registered to the centroid of the observed structure and the translation is resolved in advance. Most of the NRSfM methods assume orthographic projection, as the intrinsic camera model is usually not known. Even though some benchmarks (e.g., [12]) provide camera parameters, we develop a general approach for uncalibrated settings. Following standard SfM approaches, we assume that every 2D projection W t can be factorized into a camera pose-projection matrix R t ∈ R 2×3 and 3D structure S t ∈ R 3×N so that W t = R t S t . We assume that the articulated structure deforms under the low-rank shape model [11,16]. Thus, S = [S 1 , S 2 , . . . , S T ] T can be parametrized by the set of unknown basis shapes B ∈ R 3K×N of cardinality K and the coefficient matrix C ∈ R T×K : where R = bkdiag(R 1 , R 2 , . . . , R T ) is the joint camera pose-projection matrix, I 3 is a 3 × 3 identity matrix and ⊗ denotes Kronecker product.

Recovery of Camera Poses
Applying singular value decomposition to W, we obtain initial estimates of M and B from Equation (1) up to an invertible corrective transformation Q ∈ R 3K×3K : In the following, we are using the shortcuts M 2t−1:2t ∈ R 2×3K for every t-th pair of rows of M, Q k ∈ R 3K×3 for the k-th column triplet of Q, k ∈ {1, . . . , K}. Considering (1) and (2), for every t ∈ {1, . . . , T} and k ∈ {1, . . . , K}, we have: Using the orthonormality constraints R t R T t = I 2 and denoting F = QQ T , we obtain: Therefore, the following systems of equations can be written for every t and k: where vec(·) is vectorization operator permuting a m × n matrix to a mn column vector. Stacking all G t vertically, we obtain: where Finding an optimal F k can be performed by solving the optimization problem: Due to the rank-3 constraint on every F k , this problem is solved by the iterative shrinkage-thresholding (IST) method [57]. Once an optimal F is found, the corrective transformation Q is recovered by Cholesky decomposition. Using Q, R is recovered from Equations (1)-(4).

Articulated Structure Representation
Having found R, we recover S. Note that we optionally rely on an updated W after the smooth shape trajectory step which imposes additional constraints on point trajectories and reduces the overall number of unknowns; please refer to [11] for more details. We rearrange the shape matrix S to where (X tn , Y tn , Z tn ), n ∈ {1, . . . , N} is a 3D coordinate of each joint in S. S # can be represented as: where P x , P y , P z ∈ R T×3N are binary row selectors. We follow [10,11] and represent the optimal non-rigid structure by: min where Π = (I − 1 T 11 T ) (1 is a vector of ones) and ||.|| * denotes the nuclear norm. Note that rank(S # ) ≤ K, and the mean 3D component is removed from S # . As shown in Figure 2, non-rigid structures recovered by the optimization of (10) can have significant variations in bone lengths. This often leads to unrealistic poses and body proportions. Unlike general non-rigid structures, in articulated structures, individual rigid parts or bones have constant lengths throughout the whole sequence. Moreover, all the bones follow constant proportions. These constraints are called articulated priors. We incorporate the articulated priors into the objective function (10) in the form of the following energy term: where e tb (S) = (D t b − L b ) 2 is an energy term for bone b and frame t, L b is initial normalized bone length value of bone b. The normalization is done with respect to the sum of all initial bone lengths. D t b = ||X t a b − X t c b || 2 is Euclidian distance between joints X t a b and X t c b connected by bone b; B is the number of bones of the articulated structure. Vectors a = [X a 1 , X a 2 , . . . , X a B ] and c = [X c 1 , X c 2 , . . . , X c B ] define the parent and child joints of bones, respectively. Unlike some previous works [7,41,58,59], we do not require predefined bone lengths or proportions. SfAM recovers optimal articulated structure that minimizes the total energy: where β is a scalar weight. Implementation of articulated prior (11) as a soft constraint makes the overall method robust to incorrect initialization of bone lengths.

Energy Optimization
Since (12) contains a nonlinear term E BL (S), we introduce an auxiliary variable A and obtain the following optimization problem which is linear with respect to S: We rewrite (13) in the Lagrangian form: where ||.|| F denotes the Frobenius norm and µ is a parameter. We split (14) into two subproblems: and min We alternate between the subproblems (15) and (16) and iterate until convergence. A remains fixed in (15) and S remains fixed in (16). In every optimization step, the subproblem (15) updates the 3D structure so that it more accurately projects to the observed 2D landmarks. The subproblem (16) penalizes the difference in bone lengths among all frames while recovering the sequence-specific bone proportions. The bone lengths of the recovered optimal 3D structures are almost constant throughout the whole image sequence but different from the initial L b .
The subproblem (15) is linear and solved by the fixed-point continuation (FPC) method [54]. First, we obtain the gradient of 1 2 Next, FPC for min S L(S, µ) instantiates as: where S ν (·) is the matrix shrinkage operator [54] and τ > 0 is a free parameter. The second subproblem (16) is nonlinear and is optimized for each iteration (18) using Levenberg-Marquardt of ceres [60]. Let denote the r l , l ∈ {1, . . . , TN} residuals of 1 2 We aggregate all residuals e tb (A) from (11) (note that S in (11) Next, the objective function (16) can be compactly written in terms of A as: The target nonlinear energy optimization problem consists of finding an optimal parameter set A so that: We solve (21) iteratively. In every optimization step k, the objective is linearized in the vicinity of the current solution A k by the first-order Taylor expansion: with J(A) (BT+TN)×3TN being the Jacobian of F(A k ). For every iteration, the objective for ∆A reads: In ceres [60], the optimum is computed in the least-squares sense with the Levenberg-Marquardt method: where λ k > 0 is a parameter and I is an identity matrix. The algorithm is summarized in Algorithm 1.

Experiments and Results
We extensively evaluate our SfAM on several datasets including Human 3.6m [12], synthetic sequences of Akhter et al. [13] and NYU hand pose [14] dataset. Moreover, we demonstrate qualitative results on challenging community videos. In total, our SfAM is compared to over thirty state-of-the-art model-based and learning-based methods (see Tables 1 and 2). We also implement SMSR of Ansari et al. [11], which is the most related approach to our SfAM and evaluate it on [12,14] as well as community videos. Moreover, we extend SMSR [11] with the local rigidity constraint of Rehan et al. [38] and include it into our comparison.
In Section 4.2.2, we evaluate the robustness of our approach to inaccuracies in 2D landmarks. The proposed SfAM recovers correct articulated structures given highly inaccurate initial bone lengths in Section 4.2.3. Finally, in Section 4.2.5, we highlight the numerous cases when our method performs better than state-of-the-art learning-based approaches in real-world scenes.
In all experiments, we use a sliding time window of 200 frames. For sequences shorter than 200 frames, we run our method on the whole sequence at once. All experiments are performed on a system with 32 GB RAM and twelve-core Intel Xeon CPU running at 3.6 GHz. Our framework is implemented in C++. Average processing time for a single frame from the Human 3.6m dataset [12] with given 2D annotations amounts to 140 ms.

Evaluation Methodology
We follow the established evaluation methodology in the area of NRSfM and rigidly align our 3D reconstructions to the ground truth. We report the reconstruction error E 3D in mm between ground truth joint positions S t n and aligned 3D reconstructions G(S t n ): where n ∈ {1, . . . , N}, t ∈ {1, . . . , T}, T is the number of frames in the sequence and N is the number of joints of the articulated object. For some datasets, we report the normalized mean 3D error: where σ tx , σ ty and σ tz denote normalized variances of reconstructions G(S t n ) along the x, y, z-axes respectively.

Human 3.6m Dataset
Human 3.6m [12] is currently the largest dataset for monocular 3D human pose sensing. It is widely used for evaluation of learning-based human pose estimation methods. Table 1 gives an overview of the quantitative results on the Human 3.6m [12]. We highlight approaches that are trained on Human 3.6m [12] with "*". We follow three common evaluation protocols. In Protocol #1, we compare the methods on two subjects (S9 and S11). The original framerate 50 f ps is reduced to 10 f ps. The learning-based approaches marked with "*" use subjects S1, S5, S6, S7, S8 and all camera views for training. Testing is done for all cameras. For Protocol #2, only the frontal view ("camera3") is used for evaluation. For Protocol #3, evaluation is done on every 64 th frame of subject S11 for all cameras. The learning-based approaches marked with "*" use subjects S1, S5, S6, S7, S8 and S9 for training.
For all methods and under all evaluation protocols, we report the reconstruction error E 3D after the rigid alignment of the recovered structures with ground truth. In our method, the bone lengths are initialized with the average values for all the subjects from the dataset.
As we see from Table 1, we show competitive accuracy to best performing learning-based approaches that are trained on Human 3.6m [12]. In Section 4.2.5, we demonstrate that our approach works better in real-world scenes which are different from this dataset. Table 1. The reconstruction error E 3D of SfAM and previous methods on Human 3.6m dataset. "*" indicates learning-based methods which are trained on Human 3.6m [12]. We outperform all model-based approaches and reach very close to the tuned supervised learning techniques. In Figure 4, we visualize several reconstructions of highly challenging scenes by SMSR [11] and the proposed SfAM. See Figure A1 for additional visualizations.  [11] on Human 3.6m [12]. NRSfM considers humans as general non-rigid objects and changes bone lengths from frame to frame.

Robustness to Inaccurate 2D Point Tracks
We validate the robustness of our approach to inaccuracies in 2D landmarks on Human 3.6m [12]. We compare our SfAM to state-of-the-art learning-based methods [9,47,53] trained on ground truth 2D data. We add Gaussian noise with increasing values of the standard deviation to the 2D ground truth point tracks. The reconstruction error as the function of the standard deviation of the noise is plotted in Figure 5a. SfAM is more robust than the compared methods for moderate and high perturbations, and the error grows very slowly with the increasing noise level. In contrast to our SfAM, the errors of [9,47,53] grow very fast even with a low level of noise. Note that we evaluate our method on a higher level of noise than [9,47,53]. The average error of the currently best performing 2D detectors is between 10-15 pixels [79,80]. We see that, for 10-15 pixels, SfAM has comparable error to the most accurate learning-based approaches while not relying on training data and being generalizable for different object classes.

Robustness to Incorrectly Initialized Bone Lengths and Real Bone Length Recovery
We study the accuracy of SfAM in recovering articulated structures given incorrectly initialized bone proportions (normalized bone lengths) on the subject S11 from Human 3.6m [12]. Starting from the ground truth initialization of bone lengths (obtained from the dataset), we change every bone length by adding different amounts of Gaussian noise with increasing standard deviations in the range [0; 70] mm. This allows us to analyze the recovered bone lengths and the robustness of SfAM to noise in a controlled and well-defined setting. The results of the experiment are plotted in Figure 5b. If the structure is initialized with anthropometric priors from [81], the error increases by only 3%. Note that our error in bone length estimation is slightly affected by the increasing levels of noise.
It is equal to 54 mm with ground truth initialization and grows just to 66 mm with σ = 70 mm. Note that the anthropometric prior corresponds to σ ≈ 15 mm. Given incorrect initial bone lengths, SfAM recovers not only correct poses, but also accurate sequence-specific bone lengths. We calculate the average difference between ground truth bone lengths of subject S11 and the initial ones, provided to our method. We do the same for the recovered structures. The results are best viewed in Figure 5c. Thus, SfAM can be used for precise skeleton estimation.
We also calculate standard deviations of bone lengths of the reconstructed objects for SMSR [11] and SfAM. Figure 5d shows that the standard deviation of bone lengths is very high for SMSR [11], as it considers a human as a general non-rigid object and changes the bone lengths from frame to frame. SfAM reduces the average standard deviation by 514% leading to a more accurate pose reconstruction and structure recovery. In Figure 5d, "Upper Legs" and "Lower Legs" denote bones between the hip/knee and knee/ankle, respectively; "Upper Arms" and "Lower Arms" denote bones between shoulder/elbow and elbow/wrist, respectively.

Synthetic NRSfM Datasets
Synthetic sequences of Akhter et al. [13] are commonly used for the evaluation of sparse NRSfM. We compare our approach with previous SfM methods on challenging synthetic sequences with a large variety of human motions Drink, Pickup, Stretch, and Yoga [20]. Some pairs of joints remain locally rigid in these sequences. We activate the articulated constraint for those points and evaluate our method. Table 2 shows the results of SfAM and previous SfM methods.
The errors e 3D for other listed methods are taken from PPTA [78] and SMSR [11]. Only PPTA [78] outperforms SfAM on Drink, whereas CSF2 [23] achieves a comparable e 3D . SfAM achieves the most consistent performance among all compared algorithms.

Real-World Videos
Our algorithm is capable of recovering human motion from challenging real-world videos. We compare our results with the state-of-the-art learning-based approach of Martinez et al. [9] and one of the best performing general-purpose NRSfM methods SMSR [11]. Since ground truth 2D annotations are not available, we use OpenPose [82] for 2D human body landmark extraction. Bone lengths are initialized with the values from anthropometric data tables [81]. As Figure 6 shows, [9] fails to correctly recover poses that are different from the training dataset [12]. SMSR [11] produces unrealistic human body structures. In contrast to [9,11], our method successfully recovers 3D human poses in real-world scenes. Figure 6. Comparison of our SfAM, NRSfM [11], and the learning-based method of Martinez et al. [9] on challenging real-world videos.

Hand Pose Estimation
We also evaluate SfAM on the NYU hand pose dataset [14], which provides 2D and 3D ground truth annotations for 8252 different hand poses. The hand model consists of 30 bones. Hand pose recovery is a challenging problem due to occlusion and many degrees of freedom. We compare the performance of our approach with SMSR [11] and its modification with local rigidity constraint from Rehan et al. [38]. Quantitatively, SfAM achieves E 3D of 14.2 mm. In contrast, E 3D of SMSR [11] is 22.2 mm, and SMSR with articulated body constraints [38] shows E 3D of 19.4 mm. Hence, the inclusion of our articulated prior term to [11] achieves an error improvement of 56%. The qualitative results are shown in Figure 7. Similar to human bodies, SfAM achieves lower error due to keeping bone lengths constant between frames. When SMSR [11] fails to reconstruct the correct 3D pose, SfAM still outputs plausible results.

Conclusions
We present a new method for 3D articulated structure recovery from 2D landmarks. The proposed approach is general and not restricted to specific structures or motions. Integration of our soft articulated prior term into a general-purpose NRSfM approach and alternating optimization resulted in accurate and stable results.
In contrast to the vast majority of state-of-the-art approaches, SfAM does not require training data or known bone lengths. By ensuring consistency of bone lengths throughout the whole sequence, it optimizes sequence-specific bone proportions and recovers 3D structures. In extensive experiments, it proves its generalizability and shows accuracy close to state-of-the-art on public benchmarks. It also shows a remarkable improvement in accuracy compared to other model-based approaches. Moreover, our method outperforms learning-based approaches in complicated real-world videos. All in all, we show that high accuracy on benchmarks can be achieved without the need for training and parameter tuning for specific datasets.
In future work, we plan to apply SfAM to animal shape estimation and recovery of personalized human skeletons. We also believe it can boost the development of methods for human and hand pose estimation with semi-supervision.  [11].