Deep Human Pose Estimation: A Conceptual Review of Paradigms, Progress, and Frontiers
Abstract
1. Introduction
- Representation: Tracing the shift from direct coordinates to heatmaps, volumetric encodings, implicit functions, and probabilistic distributions.
- Architecture: Covering the architectural evolution from multi-stage CNNs to graph neural networks (GCNs), transformers, and linear-complexity SSMs.
- Ambiguity and Generalization: Examining strategies for addressing depth ambiguity, occlusion, and domain gaps through self-supervision, uncertainty modeling, and synthetic data.
- Contextual Extension: Encompassing temporal dynamics, multi-view geometry, and multi-sensor fusion (RGB+IMU, LiDAR, event cameras).
- Applications and Frontiers: Linking algorithms to downstream tasks (biomechanics, XR) and constraints such as efficiency, privacy, and fairness.
1.1. Survey Methodology
1.2. Relevance to Image and Video Processing
- Organization. The remainder of this paper is organized as follows: Section 2 positions our work relative to existing reviews, Section 3 details key datasets and metrics, and the subsequent Section 4, Section 5, Section 6, Section 7 and Section 8 develop our five-axis taxonomy, followed by a critical discussion in Section 9 and future directions in Section 10.
2. Related Work
2.1. Studies on 2D Human Pose Estimation
2.2. Studies on 3D HPE and Mesh Recovery
2.3. Systematic, Structural, and System-Level Studies
2.4. Specialized and Application-Oriented Studies
2.5. Scope and Contributions of This Review
- Unified Taxonomy: Most studies treat 2D, 3D, and multi-view pose as distinct problems. Instead, we propose a unified taxonomy based on representation, architecture, and context that applies consistently across all dimensions.
- Focus on Paradigm Shifts (2018–2025): Rather than enumerating methods, we trace the intellectual lineage from CNNs to transformers to SSMs and from regression to heatmaps to distributions to diffusion, explaining why these shifts occurred.
- Critical Analysis: We combine quantitative reporting with critical discussion of metric limitations, data saturation, and reproducibility.
- Frontier Directions: We synthesize emerging trends in foundation models, efficiency, and human-centered AI (privacy, fairness) in order to define the next steps for research.
- Conceptual evolution refers to whether a survey explicitly traces the intellectual lineage of paradigm shifts such as the progression from coordinate regression to heatmaps to probabilistic distributions, from CNNs to transformers to SSMs, from deterministic to diffusion-based estimation, or from in-the-wild to scene-aware reasoning, rather than merely cataloging different methods. This final column most clearly distinguishes the present review from prior work.
3. Datasets and Benchmarks: Engines of Innovation in Human Pose Estimation
3.1. Laboratory Motion Capture Datasets
3.1.1. Human3.6M (H3.6M)
3.1.2. SURREAL (Synthetic Humans for REAL Tasks)
3.2. In-the-Wild 2D Benchmarks
3.2.1. MPII Human Pose
3.2.2. COCO Keypoints (COCO)
3.3. In-the-Wild 3D and Bridging the Domain Gap
3.3.1. MPI-INF-3DHP
3.3.2. 3DPW (3D Poses in the Wild)
- Real-world capture. Indoor and outdoor scenes with everyday clothing, natural lighting, partial occlusions, and a freely moving camera reproduce many of the conditions under which models trained mainly on controlled laboratory data tend to degrade.
- Accurate 3D reference poses. By fusing video and IMU information, 3DPW provides accurate 3D reference poses in unconstrained environments. The original work validated the reconstruction method on TotalCapture and reported an accuracy of 26 mm, making the annotations sufficiently accurate for benchmarking in-the-wild 3D HPE.
- SMPL-aligned annotations. Ground truth is provided through SMPL body model parameters, allowing the same dataset to support evaluation of 3D skeletons using metrics such as MPJPE and PA-MPJPE as well as 3D body meshes using surface-based metrics such as PVE or MPVE.
- Temporal video supervision. Because annotations are temporally consistent across video frames, 3DPW is particularly useful for evaluating temporal models, including TCN-, transformer-, and SSM-based approaches, under realistic camera motion and appearance variation.
3.3.3. FreeMan
3.4. Holistic, Application-Oriented, and Scene-Aware Benchmarks
3.4.1. COCO-WholeBody
3.4.2. AthletePose3D
3.4.3. LDPose
3.4.4. SLOPER4D
3.4.5. MoviCam
3.5. Evaluation Metrics and Quantitative Protocols
3.5.1. 2D Pose Metrics: PCK, PCKh, and AP/OKS
3.5.2. 3D Pose Metrics: MPJPE and Variants
- PA-MPJPE (Procrustes-Aligned MPJPE) computes MPJPE after a rigid Procrustes alignment (rotation, translation, and uniform scaling) between predicted and ground-truth poses. It largely reflects errors in relative pose configuration rather than global position.
- N-MPJPE (Normalized MPJPE) applies scale alignment before it calculate the error making it possible to isolate depth scale errors while preserving overall orientation.
- PCK3D and AUC respectively measure the percentage of joints within a given 3D distance threshold and the area under the PCK3D curve as the threshold varies. These metrics are less sensitive to outliers than MPJPE, and are popular on benchmarks such as MPI-INF-3DHP.
3.5.3. How We Use These Metrics in This Survey
- PCKh@0.5 on MPII.
- AP according to OKS, with AP50/AP75 when these values are specified, on COCO, CrowdPose, and OCHuman.
- on COCO-WholeBody, with breakdowns by body/feet/face/hands when provided by the original paper.
- MPJPE under protocol 1 and PA-MPJPE under protocol 2 on Human3.6M.
- PCK3D and AUC on MPI-INF-3DHP.
- PA-MPJPE on 3DPW, optionally with MPJPE and MPVE/PVE when mesh reconstruction is specified.
- Dataset-specific global or object-centered variants (global MPJPE, O-MPJPE) on FreeMan, SLOPER4D, and MoviCam.
3.6. Datasets as Drivers of Paradigm Shifts
4. Pose Representation: From Coordinates to Distributions
4.1. The Direct-Regression Paradigm
4.2. The Heatmap Paradigm: Robustness Through Spatial Probability
- The Differentiability Problem and Integral Solutions. Because the standard argmax method for extracting coordinates is not differentiable, gradients cannot backpropagate from the final coordinates to the network. Sun et al. [41] suggested integral regression as a solution to this issue. By normalizing the heatmap with a softmax and computing the expectation of grid positions, the process becomes a completely differentiable weighted sum (soft-argmax). This method has been widely used for 2D pose [42,43] and has been expanded to 3D voxel representations [44] to differentiably estimate depth, making it fundamental in the field.
- Improving Heatmap Precision and Debiasing. Beyond differentiability, the quality of the target heatmap (the label) has become a major research topic. Standard approaches generate targets using a Gaussian kernel of fixed size. Luo et al. [45] argued that this method was not optimal, and proposed scale-adaptive heatmap regression (SAHR) to dynamically adjust kernel size according to the scale of the individual as well as weight-adaptive heatmap regression loss (WAHR) for hard joints. Jiang et al. [46] similarly proposed a heatmap refinement method that adjusts Gaussian coverage using geometric priors. Quantization bias is a more subtle but omnipresent problem identified by Huang et al. [47]. Heatmaps are generated and stored on a discrete grid with a resolution (typically ) that is much coarser than the original image. During target encoding, the continuous ground-truth joint coordinate is rounded to the nearest integer cell; during decoding, the predicted argmax (or soft-argmax) is then upscaled back to image space. Both steps introduce a systematic asymmetric error: rounding is biased toward the cell center rather than uniformly distributed, the upscaling factor compounds sub-pixel offsets, and standard data augmentation transforms (flip, rotation, scale) are computed in pixel space and then re-discretized, accumulating the bias at every epoch. Across a typical training pipeline this manifests as a persistent half pixel-to-pixel offset on the predicted heatmap peak, which corresponds to a non-negligible drop in OKS-based AP at high thresholds (AP75 and above), where the OKS tolerance is small enough for half a pixel to matter. Unbiased data processing (UDP) addresses the problem theoretically rather than empirically: it (i) treats coordinates as continuous quantities throughout the pipeline, (ii) re-derives the affine transformations used by augmentation in the continuous domain so that no intermediate rounding occurs, and (iii) replaces the standard Gaussian on an integer grid target with a continuous Gaussian, the center of which is the exact floating-point coordinate. As popularized by DARK (Zhang et al. [48]) UDP combined with a distribution-aware decoder and Taylor-expansion peak refinement yields a measurable AP gain (+2 to +3 points on COCO for comparable backbones) at zero additional inference cost. This clearly demonstrates that representation choices rather than architecture alone can drive substantial improvements in 2D pose estimation. Recent work continues to refine this representation. According to Gu et al. [49], heatmap confidence scores are frequently calibrated inadequately; as an alternative, they suggested Calibrated ConfidenceNet (CCNet). Liu et al. [50] integrated anatomical cues through anisotropic Gaussian coding by stretching the kernel along the direction of the bone. Finally, Purkrabek and Matas [51] tackled the problem of out-of-frame joints, explicitly handling occlusion and truncation by predicting both a calibrated probability map and a discrete probability of existence.
4.3. Encoding Structure for Multi-Person Grouping (Bottom-Up Methods)
- Part Affinity Fields. OpenPose by Cao et al. [29] made bottom-up pose estimation a mainstream paradigm by introducing part affinity fields (PAFs), two-dimensional vector fields that encode the location and orientation of limbs. Instead of grouping joints by spatial proximity alone, OpenPose assigns a connection score to each candidate limb and uses these scores in bipartite matching to assemble full skeletons. For a candidate start joint and candidate end joint , the connection confidence is computed by line-integrating the predicted PAF along the straight segment between them:
- Composite and Geometric Fields for Advanced Grouping. Later works produced more detailed groupings. Kreiss et al. [53] introduced composite fields (PifPaf), predicting a part intensity field for precise localization and a part association field for grouping. Papandreou et al. [31] proposed PersonLab, which predicts short-range offsets for refinement and mid-range offsets to traverse the kinematic graph.
- Alternative Grouping: Centers and Object-Centric Methods. Vector fields can be substituted with anatomical centers as anchors. After seeing that bottom-up models often fail on lower-body joints, Zhang et al. [56] proposed regressing offsets to two centers (upper/lower body). To deal with scale variations, Cheng et al. [57] employed dual centers (head and hip). Wang et al. [58] adopted a decentralized approach in which every joint predicts the relative position of all other joints in the instance. Li et al. [59] simplified association by predicting limb centers alongside joints.
4.4. Three-Dimensional Representation
- Multi-View and Volumetric Heatmaps. The 3D volumetric grid () is a logical extension of 2D heatmaps. To address high computational cost, Nibali et al. [64] proposed marginal heatmaps for predicting projections on the , , and planes. Choi et al. [65] targeted mobile applications with a discretized volume and soft-argmax. Sárándi et al. [66] identified the scale-dependency of voxel grids as a weakness and introduced MeTRAbs, in which heatmaps are defined in a metric 3D space around the person to enable direct metric scale regression.
- Ordinal and Ranking-Based Depth. Some techniques anticipate depth relations instead of regressing absolute depth. In the depth ranking framework introduced by Wang et al. [67], a first network predicts a pairwise ranking matrix, then a second network uses this constraint to regress 3D pose. Similarly, Pavlakos et al. [68] presented ordinal depth supervision, which offers flexible constraints for 3D geometry by classifying joint pairs as closer, further, or at the same depth.
- Directly Encoding Structure and Kinematics. The final approach abandons indirect spatial representations for kinematic properties. Kundu et al. [69] introduced an unsupervised method in which the encoder predicts a “kinematics” vector (limb orientation). A non-learnable forward kinematics layer is then used to generate the 3D pose. Chen et al. [70] separated the task into prediction of bone direction (local) and bone length (global/constant).
- Synthesis. Reading Section 4 as a whole reveals a single conceptual trajectory in which each successive representation moves further from a point estimate and closer to an explicit structured representation of spatial uncertainty. Coordinate regression returns one number per joint; heatmaps return a probability map; integral and distribution-aware heatmaps return a calibrated probability map with sub-pixel precision; volumetric heatmaps extend the map into depth; and ordinal/kinematic encodings replace raw coordinates with relational and structural quantities that respect anatomy. Two implications follow. First, the representation choice now matters as much as the backbone; for instance, UDP and DARK deliver gains comparable to swapping ResNet for HRNet at near zero extra inference cost. Second, the trajectory has not converged; each representation trades one form of inductive bias for another (standard heatmap decoding with argmax is non-differentiable, kinematic encodings lose flexibility), and the most recent probabilistic and diffusion-based representations of Section 6.1 can be read as the logical next step of replacing a single calibrated distribution with a full posterior over plausible poses.
5. Architectures for Spatial and Global Context
5.1. Mastering Spatial Context with Convolutional Networks
5.1.1. The Multi-Stage Refinement Paradigm
5.1.2. The Revolution in High Resolution
5.1.3. Hybrid CNNs and Specialized Modules
- Postprocessing and Refinement. Recognizing that even top-performing estimators produce systematic errors, another line of research focuses on post hoc refinement. Fieraru et al. [92] proposed an explicit refinement network trained on synthetic errors (e.g., swapping limbs), while Moon et al. [93] formalized this with PoseFix, a model-agnostic network that learns to correct the output of any estimator by training on a distribution of realistic pose distortions.
5.2. The Global Context Revolution: Graphs and Transformers
5.2.1. GCNs: The Skeleton as a Graph
- Strong inductive bias. The graph structure explicitly encodes anatomical constraints, such as which joints are directly connected by a bone; this biases the model to learn relationships that are physically valid, reducing the hypothesis space and improving sample efficiency compared to a fully-connected network that must learn these connections from scratch.
- Locality of information. A joint’s 3D location is most directly influenced by its immediate kinematic neighbors (e.g., the elbow constrains the wrist); graph convolutions operate naturally on this local neighborhood, unlike standard convolutions that use a fixed grid-based receptive field.
- Structure-aware reasoning. Graph networks can be designed to respect the hierarchical and directed nature of the skeleton (e.g., from the hip to the knee to the ankle). As shown by conditional directed graph convolutions, using directed edges allows the model to explicitly represent the flow of influence from parent to child joints, thereby mirroring real biomechanics.
- Flexibility. The graph representation is not limited to a fixed skeleton. It can, in principle, be adapted to different skeleton definitions (e.g., with more or fewer joints) or even to non-human articulated objects, making it a more general tool for structured prediction.
5.2.2. Takeover by Transformers
- Transformers for Heatmap Prediction.
- Direct Regression Revisited. The most significant contribution of transformers has been the revival of the direct regression paradigm. DETR-style set prediction and query-based decoding helped to reduce some of the alignment issues that limited earlier CNN-based regression models.
5.2.3. Context-Aware 2D-to-3D Lifting
- Why visual context helps. The depth ambiguity of monocular 3D lifting is fundamentally nonlocal: the absolute depth of a joint cannot be determined from the joint itself, only from its relation to other parts of the scene that have known geometric properties. CNNs, with their limited and slowly growing receptive fields, can only reason about such relations indirectly; depth has to be inferred from local cues (foreshortening, shading) and then propagated stage-by-stage through the network, with information loss occurring at each downsampling step. Self-attention removes this propagation bottleneck by allowing every token to attend to every other token in a single layer, which is exactly what is required to bring distant evidence to bear on a local depth decision.
- Heatmap Refinement (e.g., TransPose, HRFormer). A CNN backbone first extracts visual features; the transformer encoder then processes these features using self-attention to capture long-range spatial relationships across the entire feature map. The output is used to predict final heatmaps. Here, the transformer acts as a powerful global context aggregator that enhances the CNN’s local features.
- Direct Regression (e.g., TFPose, Poseur). The image is passed through a CNN backbone to produce a feature map. The transformer decoder uses a set of learnable keypoint queries to directly regress the (x,y) or (x,y,z) coordinates for each joint. Each query attends to the most relevant image features via cross-attention. This treats pose estimation as a set prediction problem (like DETR for objects), elegantly avoiding both heatmaps and their postprocessing requirements.
- Lifting from 2D to 3D (e.g., PoseFormer, MixSTE).: Here, the input is a sequence of 2D pose coordinates from a video. The transformer is used as a spatiotemporal model that first applies self-attention along the spatial dimension (joints within a frame), then along the temporal dimension (same joint across frames). This allows the model to learn complex joint correlations and motion dynamics directly from the 2D pose sequence without needing the original image. This is a purely sequence-to-sequence paradigm.
6. Ambiguity, Generalization and Occlusion
6.1. From Deterministic Prediction to Probabilistic Estimation
6.1.1. Ambiguity Modeling with Multiple Hypothesis Generation
6.1.2. The Diffusion Model Era
- Joint-wise reprojection error. For each joint, JPMA reprojects every candidate 3D hypothesis to the 2D image plane and selects the hypothesis for which the projection is closest to the observed 2D keypoint in Euclidean distance.
- Joint-level aggregation. Rather than choosing or averaging entire poses, JPMA assembles the final prediction by combining the best joint from each hypothesis, which allows different joints to come from different candidates.
- Use of 2D priors. The 2D keypoints act as geometric priors that guide hypothesis selection; this method does not introduce an additional heatmap-likelihood term or a learned anatomical plausibility head.
6.2. Bridging the Domain Gap “In the Wild”
6.2.1. Self-Supervised and Weakly Supervised Learning
6.2.2. Data-Centric Solutions: Augmentation and Adaptation
- Adversarial and Learned Augmentation. Traditional augmentation strategies such as random scaling, rotation, flipping, and occlusion improve robustness, but remain fixed and hand-crafted. Yang et al. [135] proposed an adversarial data augmentation framework for human pose estimation in which augmentation and pose network training are jointly optimized. Instead of relying on predefined transformations, this method introduces an augmentation network that learns to generate challenging transformations conditioned on the current state of the pose estimator.
- Generalization and Invariance. To improve dataset performance, Wang et al. [139] proposed a method to generate synthetic 3D pose labels for in-the-wild image. They used a stereo-inspired neural network to lift 2D joint detections to 3D, then applied a geometric refinement step, producing a large dataset (400,000 images) with pseudo-3D ground truth. Doersch and Zisserman [140] exploited optical flow and synthetic humans (SURREAL) for simulated-real transfer. Chai et al. [141] introduced PoseDA, separating global position alignment from local pose deformation. Focusing on invariant representations, Wang et al. [142] used auxiliary viewpoint prediction to reduce camera bias. Cai et al. [143] proposed PoseIRM, which applies invariant risk minimization across synthetically generated camera settings to learn pose estimation models with features that are invariant to camera parameters, thereby avoiding reliance on spurious correlations tied to specific views. Zeng et al. [144] proposed SRNet to deal with rare poses via split and recombination strategies. Finally, methods that utilize privileged information or auxiliary signals have shown improved performance: Wang et al. [145] developed TMT, which uses 3D joint velocities during training to enhance monocular 3D pose estimation; Lee et al. [146] employed multi-model guidance to provide richer supervisory signals; Taketsugu and Ukita [147] studied active learning to efficiently adapt models to new video sequences; and Hu et al. [148] utilized meta-optimization with self-supervised tasks to enable rapid adaptation. Together, these techniques demonstrate how training with richer or structured data can improve inference on conventional inputs.
6.3. Reasoning About Occlusion and Crowds
6.3.1. Robustness to Occlusion
6.3.2. Crowded Scenes
7. Contextual Extension: Time, Space, and Modality
7.1. Exploiting Temporal Dynamics in Video
7.1.1. Early Temporal Models: RNN and Convolutional Approaches
7.1.2. Spatiotemporal Transformers and the State-Space Era
- Space–Time Transformer. Before transformers were fully adopted, Lin and Lee [169] proposed factorizing the problem in trajectory space using the discrete cosine transform (DCT). However, the direct application of self-attention proved transformative. Zheng et al. [170] established a new baseline with PoseFormer, which treats video as a sequence of tokens (joint × frame). By sequentially applying spatial and temporal attention, it significantly outperforms TCN. The following architectures optimized computation to handle the quadratic complexity of attention across extended sequences. In order to shorten sequences, Li et al. [171] introduced Strided Transformer, which replaces the fully-connected feed-forward layers in transformers with strided temporal convolutions that downsample 2D pose sequences and efficiently aggregate temporal context. Tang et al. [172] proposed STCFormer, which uses a spatiotemporal criss-cross attention block. By decomposing attention into separate spatial (within-frame joint interactions) and temporal (across-frame joint trajectories) components, it can efficiently model spatiotemporal correlations for 3D pose estimation in videos. Zhang et al. [173] introduced MixSTE, which alternates spatial and temporal transformer blocks to separately encode inter-joint spatial correlations and joint-wise temporal motion. Hassanin et al. [174] proposed CrossFormer, which uses dedicated modules for inter-joint and inter-frame interactions, enabling richer spatiotemporal modeling of articulation dynamics across video frames.
- Structural Enrichment and Hybrids. Researchers soon found that pure transformers often ignored anatomical structure. This precedent was re-injected through hierarchical designs. Relationships at the local (joint), regional (limb), and global (body) scales have been explicitly modeled by Wei et al. [175] (PGFormer) and Qian et al. [176] (HSTFormer). Through directed graphs, Chen et al. [177] expanded this to high-level dependencies such as joint–hyperbone. Leap clustering was used on the skeleton graph by Zhai et al. [178]. Furthermore, multi-level structures have been suggested to improve temporal features: RTPCA-Transformer [179] uses a pyramidal compression and amplification structure, while DC-GCT [180] incorporates double-chain constraints to concurrently describe local and global dependencies.
- State Space Models (Mamba): An Emerging Alternative. For modeling lengthy sequences, state space models (SSMs) such as Mamba have lately surfaced as a linearly complex substitute for transformers. Feng et al. [189] proposed GLSMamba, using selective 6D scanning to decouple learning into a global spatiotemporal Mamba and a local refinement Mamba. Similarly, Lu et al. [190] presented SAMA, a structure-aware SSM that integrates skeleton topology via learnable adjacency matrices, demonstrating promising results for long-term temporal modeling.
7.1.3. Motion-Centric and Kinematic-Aware Models
7.2. Resolving Ambiguity with Multiple Views and Sensors
7.2.1. Multi-View Geometry: From Triangulation to End-to-End Fusion
- Learnable Triangulation and Fusion. Early deep learning methods treated fusion as postprocessing (algebraic triangulation). Iskakov et al. [128] proposed learnable triangulation, replacing algebraic operations with differentiable volumetric back-projection. Others merged features earlier in the pipeline: Qiu et al. [200] produced per-view 2D heatmaps and then fused these across views before performing 3D pose reconstruction, while Zhang et al. [201] (AdaFuse) used an adaptive weighting scheme based on epipolar geometry to manage occlusion. Xie et al. [202] used meta-learning to adapt fusion weights to new camera configurations. Transformers have also been adapted to this domain. Moliner et al. [203] injected epipolar constraints directly into the attention mechanism. Liao et al. [204] combined transformers with classical geometry modules to enable backpropagation of 3D errors to 2D detectors. In pursuit of efficiency, Remelli et al. [205] proposed a canonical fusion in an untangled camera space. Recently, Chharia et al. [206] proposed MV-SSM, a multi-view state space modeling framework that applies state space modeling for efficient and robust multi-view fusion, explicitly modeling the joint spatial arrangement across views to improve generalization across camera setups. Meanwhile, Luvizon et al. [207] investigated a consensus-based optimization strategy for multi-view pose estimation. By combining per-view 3D predictions (depth + 2D joints) and optimizing for a globally consistent 3D pose in camera coordinates, their method refines multi-view estimates without relying on explicit volumetric grids.
- Uncalibrated and Parameter-Free Approaches. The requirement for accurate calibration is a major bottleneck. To address it, research has turned to uncalibrated parameters. Davoodnia et al. [208] (UPose3D) and Jiang et al. [209] jointly estimated pose and camera parameters and refined them iteratively. Gordon et al. [210] (FLEX) and Xu and Kitani [211] proposed predicting view-invariant quantities (e.g., bone length) in order to reconstruct motion without extrinsic parameters. By processing an arbitrary number of uncalibrated views, Adaptive Multi-View Transformer [212] learns relative geometry through attention. Li et al. [213] expanded uncalibrated methods to depth cameras by guiding camera pose estimation with point clouds. Self-supervised techniques [214,215] train on unlabeled data by utilizing multi-view consistency.
- Tracking and stereo. Specialized solutions such as dual-Diffusion [216], which jointly denoises 2D keypoint uncertainty and 3D pose uncertainty under a binocular (two view) setup, focuses on stereoscopic configurations to improve the robustness of 3D human pose estimation from noisy 2D detections. For multi-view tracking, Reddy et al. [217] (TesseTrack) and Zhang et al. [218] (VoxelTrack) used 4D volumes to connect postures in space and time. For every subject, efficient recurrent models such as TEMPO [219] can preserve temporal hidden states.
7.2.2. Leveraging Depth and Alternative Sensors
- RGB-D Approaches. Depth sensors eliminate scale ambiguity, but introduce noise and blending problems. Zimmermann et al. [220] proposed projecting 2D heatmaps into a 3D volume fused with depth occupancy grids. Zhang et al. [221] transferred depth knowledge to RGB grids via cross-modality distillation. Other research has developed decoupled architectures such as PoP-Net [222] and Residual Pose [223] to enhance 3D predictions using depth maps. To get around hardware limitations, Szczuko [224] focused on very low-resolution depth sensors, building massive synthetic datasets to build efficient MobileNet backbones for degraded inputs.
- Event Cameras and Specialized Geometries. Fisheye cameras contain significant distortion despite their wide fields of view.
7.3. Biophysical Constraints: From Temporal Smoothness to Physical Plausibility
- 1.
- Statistical priors. This approach penalizes deviations in bone length, joint angle range, or per-frame velocity from the distribution observed in MoCap data. This covers most TCN- and transformer-based temporal models, which typically improve temporal smoothness and reduce implausible jitter but do not guarantee physical validity.
- 2.
- Kinematic constraints. Exact bone length and joint angle limits can be enforced through forward-kinematics layers, bone direction-plus-length decompositions [70], or inverse-kinematics postprocessing [230]. This rules out anatomically impossible configurations, but can still permit physically impossible motions such as floating above the ground plane, foot skating, or passing through scene geometry.
- 3.
- Dynamic constraints. This approach forces physical plausibility at the motion level by considering gravity, ground contact, friction, balance, and rigid-body dynamics. Instead of only asking whether each predicted skeleton is anatomically valid or temporally smooth, these methods ask whether the full motion sequence could be physically executed in the scene. In this setting, foot contact is not merely a visual cue but a physical constraint: when a foot is predicted to be in contact, its position should remain on the local support surface, should not penetrate the scene, and should not slide unrealistically. For non-flat terrain, this requires replacing the flat-ground assumption with scene geometry, for example by querying a height map that gives the support-surface height at each horizontal location. PhysDynPose [39] is a recent example of this direction. It combines a kinematic pose estimator with camera-motion estimation, then refines the resulting motion using a scene-aware physics optimizer with contact, friction-cone, no-sliding, and root-drift constraints.
8. Efficiency, Unification, and the Expanding Frontier
8.1. The Push for Efficiency and Real-World Deployment
8.1.1. Lightweight Structures
8.1.2. Distillation, Quantization, and NAS
8.1.3. Acceleration via Compressed Domain and Pruning
8.2. The Unification Era
8.2.1. Joint Learning with Related Tasks
8.2.2. Training with Diverse Datasets
8.2.3. Models of Foundations
8.3. The Human-Centric Frontier
8.3.1. Whole-Body and Fine-Grained Stance
8.3.2. Application-Driven Estimation
- Biomechanics and Sports. Sports analytics requires high precision in rapid motion. The sports data domain gap was highlighted by early studies [264]. Two examples of solutions are real-time filtering for simulators [265] and geometry-based association for multi-view tracking [266]. Jiang and Xia [267] (PCNet) addressed extremity localization in fast motion, whereas Baumgartner and Klatt [268] employed field registration for uncalibrated broadcast video. The publication of AthletePose3D [36] confirmed that fine-tuning on domain-specific data significantly reduces error. By connecting computer vision and biomechanics, Koleini et al. [230] (BioPose) combined mesh recovery with inverse kinematics in order to guarantee anatomical correctness.
- Inclusivity and Privacy. Ying et al. [37] introduced LDPose, a benchmark for individuals with limb deficits, and suggested metrics to manage a range of morphologies in order to avoid discrimination. For privacy, Huang et al. [269] developed recoverable anonymization framework for pose estimation. A privacy-enhancing module, pose estimator, and recovery module are jointly learned, enabling accurate pose estimation on anonymized images (with identity obscured) while still allowing for authorized recovery of the original images. Akada et al. [270] addressed the excessive self-occlusion in egocentric (VR) views by augmenting head-mounted displays with rear-facing cameras, and introduced a transformer-based multi-view fusion method that refines 2D joint heatmaps using both front and rear views (with heatmap uncertainty). This mitigates self-occlusion and improves 3D pose estimation for egocentric VR.
8.3.3. Combining Absolute Positioning with Physics
8.3.4. New Benchmarks and Mesh Recovery
9. Critical Analysis and Discussion
9.1. A Strategic Roadmap: Three Views on the Field
- Theoretical view. The field’s conceptual center has moved from point estimation to distribution estimation: from a single coordinate, to a calibrated heatmap, to a multi-hypothesis ensemble, to a full denoising posterior. Each step is a different answer to the same question of how a model should represent the spatial uncertainty inherent in projecting a 3D body onto a 2D image. The rise of diffusion and self-supervised models is best understood not as another architectural fashion but as the natural completion of this trajectory, that is, it reframes 3D pose estimation as an inverse problem with an explicit prior rather than as supervised regression.
- Practical view. Accuracy on Human3.6M and COCO is no longer the binding constraint for deployment; latency, energy, robustness to compression and motion blur, behavior under occlusion and crowding, and graceful degradation on non-standard bodies (children, elderly, individuals with limb differences) now dominate the gap between benchmark numbers and field-ready systems. Therefore, the efficiency literature reviewed in Section 8.1—lightweight backbones, distillation, quantization, token and frame pruning, compressed-domain inference—is not a side topic but the central concern for most real applications.
- Design view. Practitioners choosing an architecture today face a small number of recurring decisions: heatmap versus regression, CNN versus transformer versus SSM, single-view versus multi-view, deterministic versus generative, RGB-only versus multimodal. Table 18 distills the tradeoffs and the conditions under which each option is preferable. The same tradeoffs explain why XR, sports biomechanics, and robotics have converged on different architectural stacks despite drawing from the same algorithmic literature.
9.2. Representation and Architecture: Benefits and Unspoken Expenses
9.3. Efficiency and Real-World Deployability
9.4. Generalization, Robustness, and Failure Modes
9.5. Limitations of Current Evaluation Metrics
9.6. Reproducibility and the “Foundation” Era
9.7. Case Study: Real-Time Deployment in Sports Biomechanics
10. Future Research Directions
10.1. Scalable Sequence Models: Beyond Transformers
10.2. From Isolated Skeletons to Scene-Aware Humans
10.3. Responsible Deployment, Privacy, and Fairness
10.4. Foundations and Multimodal Models
10.5. Next-Generation Benchmarks
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| HPE | Human Pose Estimation |
| CNN | Convolutional Neural Network |
| SSM | State Space Model |
| GCN | Graph Convolutional Network |
| ViT | Vision Transformer |
| RNN | Recurrent Neural Network |
| LSTM | Long Short-Term Memory |
| TCN | Temporal Convolutional Network |
| DCT | Discrete Cosine Transform |
| MDN | Mixture Density Network |
| CVAE | Conditional Variational Autoencoder |
| DETR | Detection Transformer |
| RLE | Residual Log-likelihood Estimation |
| PAF | Part Affinity Field |
| OKS | Object Keypoint Similarity |
| AP | Average Precision |
| PCK | Percentage of Correct Keypoints |
| PCKh | Percentage of Correct Keypoints (head-normalized) |
| MPJPE | Mean Per-Joint Position Error |
| PA-MPJPE | Procrustes-Aligned Mean Per-Joint Position Error |
| N-MPJPE | Normalized Mean Per-Joint Position Error |
| AUC | Area Under the Curve |
| O-MPJPE | Object-centric Mean Per-Joint Position Error |
| MoCap | Motion Capture |
| XR | Extended Reality |
| AR | Augmented Reality |
| VR | Virtual Reality |
| IMU | Inertial Measurement Unit |
| LiDAR | Light Detection and Ranging |
| RGB | Red Green Blue |
| RGB-D | Red Green Blue—Depth |
| HMR | Human Mesh Recovery |
| NAS | Neural Architecture Search |
| KD | Knowledge Distillation |
| BNN | Binary Neural Network |
| FLOPs | Floating Point Operations |
| GAN | Generative Adversarial Network |
| LLM | Large Language Model |
| CLIP | Contrastive Language–Image Pretraining |
| VQ-VAE | Vector Quantized Variational Autoencoder |
| MLP | Multi-Layer Perceptron |
| IoU | Intersection over Union |
| OHKM | Online Hard Keypoint Mining |
| WASP | Waterfall Atrous Spatial Pooling |
| UDP | Unbiased Data Processing |
| DARK | Distribution-Aware Coordinate Representation |
| SAHR | Scale-Adaptive Heatmap Regression |
| WAHR | Weight-Adaptive Heatmap Regression Loss |
| CPN | Cascaded Pyramid Network |
| HRNet | High-Resolution Network |
| MSPN | Multi-Stage Pose Network |
| RSN | Residual Steps Network |
| LCN | Locally Connected Network |
| PifPaf | Part Intensity Field and Part Association Field |
| ORPM | Occlusion-Robust Pose Map |
| MeTRAbs | Metric-Scale Truncation-Robust Heatmaps |
| COCO | Common Objects in Context |
| MPII | Max Planck Institut Informatics |
| H3.6M | Human3.6M |
| 3DPW | 3D Poses in the Wild |
| SURREAL | Synthetic Humans for REAL tasks |
References
- Dang, Q.; Yin, J.; Wang, B.; Zheng, W. Deep learning based 2d human pose estimation: A survey. Tsinghua Sci. Technol. 2019, 24, 663–676. [Google Scholar] [CrossRef]
- Munea, T.L.; Jembre, Y.Z.; Weldegebriel, H.T.; Chen, L.; Huang, C.; Yang, C. The progress of human pose estimation: A survey and taxonomy of models applied in 2D human pose estimation. IEEE Access 2020, 8, 133330–133348. [Google Scholar] [CrossRef]
- Zhang, Z.; Shin, S.Y. Two-Dimensional Human Pose Estimation with Deep Learning: A Review. Appl. Sci. 2025, 15, 7344. [Google Scholar] [CrossRef]
- Ji, X.; Fang, Q.; Dong, J.; Shuai, Q.; Jiang, W.; Zhou, X. A survey on monocular 3D human pose estimation. Virtual Real. Intell. Hardw. 2020, 2, 471–500. [Google Scholar] [CrossRef]
- Chen, Y.; Tian, Y.; He, M. Monocular human pose estimation: A survey of deep learning-based methods. Comput. Vis. Image Underst. 2020, 192, 102897. [Google Scholar] [CrossRef]
- Liu, W.; Bao, Q.; Sun, Y.; Mei, T. Recent advances of monocular 2D and 3D human pose estimation: A deep learning perspective. ACM Comput. Surv. 2022, 55, 1–41. [Google Scholar] [CrossRef]
- Guo, Y.; Gao, T.; Dong, A.; Jiang, X.; Zhu, Z.; Wang, F. A Survey of the State of the Art in Monocular 3D Human Pose Estimation: Methods, Benchmarks, and Challenges. Sensors 2025, 25, 2409. [Google Scholar] [CrossRef] [PubMed]
- Udayan, D.J.; Jayakumar, T.V.; Raman, R.; Kim, H.S.; Nedungadi, P. Deep Learning in Monocular 3D Human Pose Estimation: Systematic Review of Contemporary Techniques and Applications. Multimed. Tools Appl. 2025, 84, 36985–37021. [Google Scholar] [CrossRef]
- Wang, J.; Tan, S.; Zhen, X.; Xu, S.; Zheng, F.; He, Z.; Shao, L. Deep 3D human pose estimation: A review. Comput. Vis. Image Underst. 2021, 210, 103225. [Google Scholar] [CrossRef]
- Neupane, R.B.; Li, K.; Boka, T.F. A survey on deep 3D human pose estimation. Artif. Intell. Rev. 2024, 58, 24. [Google Scholar] [CrossRef]
- Liu, Y.; Qiu, C.; Zhang, Z. Deep learning for 3D human pose estimation and mesh recovery: A survey. Neurocomputing 2024, 596, 128049. [Google Scholar] [CrossRef]
- Dubey, S.; Dixit, M. A comprehensive survey on human pose estimation approaches. Multimed. Syst. 2023, 29, 167–195. [Google Scholar] [CrossRef]
- Sun, R.; Lin, Z.; Leng, S.; Wang, A.; Zhao, L. An In-Depth Analysis of 2D and 3D Pose Estimation Techniques in Deep Learning: Methodologies and Advances. Electronics 2025, 14, 1307. [Google Scholar] [CrossRef]
- Lan, G.; Wu, Y.; Hu, F.; Hao, Q. Vision-Based Human Pose Estimation via Deep Learning: A Survey. IEEE Trans. Hum.-Mach. Syst. 2023, 53, 253–268. [Google Scholar] [CrossRef]
- Gao, Z.; Chen, J.; Liu, Y.; Jin, Y.; Tian, D. A systematic survey on human pose estimation: Upstream and downstream tasks, approaches, lightweight models, and prospects. Artif. Intell. Rev. 2025, 58, 68. [Google Scholar] [CrossRef]
- Salisu, S.; Danyaro, K.U.; Nasser, M.; Hayder, I.M.; Younis, H.A. Review of Models for Estimating 3D Human Pose Using Deep Learning. PeerJ Comput. Sci. 2025, 11, e2574. [Google Scholar] [CrossRef] [PubMed]
- Jayaswal, R.; Ansari, M.A.; Mewada, A.; Pareek, P.; Ahmad, S. An in-depth exploration of structural pose estimation strategies and datasets. Discov. Comput. 2025, 28, 222. [Google Scholar] [CrossRef]
- Hou, Y.; Li, J.; Liao, S.; Xue, N. Research Advanced in Human Pose Estimation based on Deep Learning. Highlights Sci. Eng. Technol. 2024, 119, 444–453. [Google Scholar] [CrossRef]
- Nogueira, A.F.R.; Oliveira, H.P.; Teixeira, L.F. Markerless multi-view 3D human pose estimation: A survey. Image Vis. Comput. 2025, 155, 105437. [Google Scholar] [CrossRef]
- Azam, M.M.; Desai, K. A Survey on 3D Egocentric Human Pose Estimation. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 16–22 June 2024; pp. 1643–1654. [Google Scholar] [CrossRef]
- Algabri, R.; Abdu, A.; Lee, S. Deep learning and machine learning techniques for head pose estimation: A survey. Artif. Intell. Rev. 2024, 57, 288. [Google Scholar] [CrossRef]
- Suo, X.; Tang, W.; Li, Z. Motion Capture Technology in Sports Scenarios: A Survey. Sensors 2024, 24, 2947. [Google Scholar] [CrossRef]
- Song, L.; Yu, G.; Yuan, J.; Liu, Z. Human pose estimation and its application to action recognition: A survey. J. Vis. Commun. Image Represent. 2021, 76, 103055. [Google Scholar] [CrossRef]
- Ben Gamra, M.; Akhloufi, M.A. A review of deep learning techniques for 2D and 3D human pose estimation. Image Vis. Comput. 2021, 114, 104282. [Google Scholar] [CrossRef]
- Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
- Varol, G.; Romero, J.; Martin, X.; Mahmood, N.; Black, M.J.; Laptev, I.; Schmid, C. Learning from Synthetic Humans. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3686–3693. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
- Newell, A.; Huang, Z.; Deng, J. Associative embedding: End-to-end learning for joint detection and grouping. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, 4–9 December 2017; pp. 2274–2284. [Google Scholar]
- Papandreou, G.; Zhu, T.; Chen, L.C.; Gidaris, S.; Tompson, J.; Murphy, K. PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 282–299. [Google Scholar]
- Mehta, D.; Rhodin, H.; Casas, D.; Fua, P.; Sotnychenko, O.; Xu, W.; Theobalt, C. Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision. In Proceedings of the 3D Vision (3DV), 2017 fifth International Conference IEEE, Qingdao, China, 10–12 October 2017. [Google Scholar] [CrossRef]
- Von Marcard, T.; Henschel, R.; Black, M.J.; Rosenhahn, B.; Pons-Moll, G. Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 601–617. [Google Scholar]
- Wang, J.; Yang, F.; Li, B.; Gou, W.; Yan, D.; Zeng, A.; Gao, Y.; Wang, J.; Jing, Y.; Zhang, R. Freeman: Towards benchmarking 3d human pose estimation under real-world conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 21978–21988. [Google Scholar]
- Jin, S.; Xu, L.; Xu, J.; Wang, C.; Liu, W.; Qian, C.; Ouyang, W.; Luo, P. Whole-Body Human Pose Estimation in the Wild. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IX; Springer: Cham, Switzerland, 2020; pp. 196–214. [Google Scholar] [CrossRef]
- Yeung, C.; Suzuki, T.; Tanaka, R.; Yin, Z.; Fujii, K. AthletePose3D: A benchmark dataset for 3D human pose estimation and kinematic validation in athletic movements. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 5945–5956. [Google Scholar]
- Ying, J.; Du, H.; Zhang, K.; Li, L.; Yu, X. LDPose: Towards Inclusive Human Pose Estimation for Limb-Deficient Individuals in the Wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–20 October 2025; pp. 9865–9875. [Google Scholar]
- Dai, Y.; Lin, Y.; Lin, X.; Wen, C.; Xu, L.; Yi, H.; Shen, S.; Ma, Y.; Wang, C. SLOPER4D: A Scene-Aware Dataset for Global 4D Human Pose Estimation in Urban Environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 682–692. [Google Scholar]
- Aytekin, A.I.; Li, C.; Luvizon, D.; Dabral, R.; Oswald, M.; Habermann, M.; Theobalt, C. Physics-based Human Pose Estimation from a Single Moving RGB Camera. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 15–18 October 2025; pp. 3891–3900. [Google Scholar]
- Toshev, A.; Szegedy, C. DeepPose: Human Pose Estimation via Deep Neural Networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [Google Scholar] [CrossRef]
- Sun, X.; Xiao, B.; Wei, F.; Liang, S.; Wei, Y. Integral Human Pose Regression. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 536–553. [Google Scholar]
- Li, S.; Ke, L.; Pratama, K.; Tai, Y.W.; Tang, C.K.; Cheng, K.T. Cascaded Deep Monocular 3D Human Pose Estimation With Evolutionary Training Data. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6172–6182. [Google Scholar] [CrossRef]
- Kim, Y.; Kim, D. A CNN-based 3D human pose estimation based on projection of depth and ridge data. Pattern Recognit. 2020, 106, 107462. [Google Scholar] [CrossRef]
- Zhou, K.; Han, X.; Jiang, N.; Jia, K.; Lu, J. HEMlets Pose: Learning Part-Centric Heatmap Triplets for Accurate 3D Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2344–2353. [Google Scholar] [CrossRef]
- Luo, Z.; Wang, Z.; Huang, Y.; Wang, L.; Tan, T.; Zhou, E. Rethinking the Heatmap Regression for Bottom-up Human Pose Estimation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13259–13268. [Google Scholar] [CrossRef]
- Jiang, L.; Liu, Z.; Li, K.; Wu, W. Boosting Human Pose Estimation via Heatmap Refinement. In Proceedings of the MultiMedia Modeling; Ide, I., Kompatsiaris, I., Xu, C., Yanai, K., Chu, W.T., Nitta, N., Riegler, M., Yamasaki, T., Eds.; Springer: Singapore, 2025; pp. 153–167. [Google Scholar]
- Huang, J.; Zhu, Z.; Guo, F.; Huang, G. The Devil Is in the Details: Delving Into Unbiased Data Processing for Human Pose Estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5699–5708. [Google Scholar] [CrossRef]
- Zhang, F.; Zhu, X.; Dai, H.; Ye, M.; Zhu, C. Distribution-Aware Coordinate Representation for Human Pose Estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7091–7100. [Google Scholar] [CrossRef]
- Gu, K.; Chen, R.; Yu, X.; Yao, A. On the Calibration of Human Pose Estimation. In Proceedings of the 41st International Conference on Machine Learning, PMLR, Vienna, Austria, 21–27 July 2024; Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F., Eds.; ACM: New York, NY, USA, 2024; Volume 235, pp. 16530–16547. [Google Scholar]
- Liu, H.; Liu, T.; Chen, Y.; Zhang, Z.; Li, Y.F. EHPE: Skeleton cues-based gaussian coordinate encoding for efficient human pose estimation. IEEE Trans. Multimed. 2022, 26, 8464–8475. [Google Scholar] [CrossRef]
- Purkrabek, M.; Matas, J. ProbPose: A Probabilistic Approach to 2D Human Pose Estimation. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 27124–27133. [Google Scholar]
- Osokin, D. Real-time 2D Multi-Person Pose Estimation on CPU: Lightweight. arXiv 2018, arXiv:1811.12004. [Google Scholar] [CrossRef]
- Kreiss, S.; Bertoni, L.; Alahi, A. PifPaf: Composite Fields for Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 11969–11978. [Google Scholar] [CrossRef]
- Benzine, A.; Luvison, B.; Pham, Q.C.; Achard, C. Deep, Robust and Single Shot 3D Multi-Person Human Pose Estimation from Monocular Images. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 584–588. [Google Scholar] [CrossRef]
- Zhen, J.; Fang, Q.; Sun, J.; Liu, W.; Jiang, W.; Bao, H.; Zhou, X. SMAP: Single-Shot Multi-person Absolute 3D Pose Estimation. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XV; Springer: Cham, Switzerland, 2020; pp. 550–566. [Google Scholar] [CrossRef]
- Zhang, Z.; Luo, Y.; Gou, J. Double anchor embedding for accurate multi-person 2D pose estimation. Image Vis. Comput. 2021, 111, 104198. [Google Scholar] [CrossRef]
- Cheng, Y.; Ai, Y.; Wang, B.; Wang, X.; Tan, R.T. Bottom-up 2D pose estimation via dual anatomical centers for small-scale persons. Pattern Recognit. 2023, 139, 109403. [Google Scholar] [CrossRef]
- Wang, T.; Jin, L.; Wang, Z.; Fan, X.; Cheng, Y.; Teng, Y.; Xing, J.; Zhao, J. DecenterNet: Bottom-up human pose estimation via decentralized pose representation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 1798–1808. [Google Scholar]
- Li, J.; Su, W.; Wang, Z. Simple Pose: Rethinking and Improving a Bottom-up Approach for Multi-Person Pose Estimation. In Proceedings of the The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 7–12 February 2020; AAAI Press: Menlo Park, CA, USA, 2020; pp. 11354–11361. [Google Scholar] [CrossRef]
- McNally, W.; Vats, K.; Wong, A.; McPhee, J. Rethinking Keypoint Representations: Modeling Keypoints and Poses as Objects for Multi-person Human Pose Estimation. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; pp. 37–54. [Google Scholar]
- Maji, D.; Nagori, S.; Mathew, M.; Poddar, D. Yolo-pose: Enhancing yolo for multi person pose estimation using object keypoint similarity loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 2637–2646. [Google Scholar]
- Zauss, D.; Kreiss, S.; Alahi, A. Keypoint Communities. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11057–11066. [Google Scholar]
- Qu, H.; Cai, Y.; Foo, L.G.; Kumar, A.; Liu, J. A Characteristic Function-Based Method for Bottom-Up Human Pose Estimation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13009–13018. [Google Scholar] [CrossRef]
- Nibali, A.; He, Z.; Morgan, S.; Prendergast, L. 3D Human Pose Estimation With 2D Marginal Heatmaps. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Los Alamitos, CA, USA, 7–11 January 2019; pp. 1477–1485. [Google Scholar] [CrossRef]
- Choi, S.; Choi, S.; Kim, C. MobileHumanPose: Toward real-time 3D human pose estimation in mobile devices. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 20–25 June 2021; pp. 2328–2338. [Google Scholar] [CrossRef]
- Sárándi, I.; Linder, T.; Arras, K.O.; Leibe, B. MeTRAbs: Metric-Scale Truncation-Robust Heatmaps for Absolute 3D Human Pose Estimation. IEEE Trans. Biom. Behav. Identity Sci. 2021, 3, 16–30. [Google Scholar] [CrossRef]
- Wang, M.; Chen, X.; Liu, W.; Qian, C.; Lin, L.; Ma, L. DRPose3D: Depth ranking in 3D human pose estimation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18; AAAI Press: Menlo Park, CA, USA, 2018; pp. 978–984. [Google Scholar]
- Pavlakos, G.; Zhou, X.; Daniilidis, K. Ordinal Depth Supervision for 3D Human Pose Estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7307–7316. [Google Scholar] [CrossRef]
- Kundu, J.; Seth, S.; M V, R.; Rakesh, M.; Babu, R.; Chakraborty, A. Kinematic-Structure-Preserved Representation for Unsupervised 3D Human Pose Estimation. Proc. AAAI Conf. Artif. Intell. 2020, 34, 11312–11319. [Google Scholar] [CrossRef]
- Chen, T.; Fang, C.; Shen, X.; Zhu, Y.; Chen, Z.; Luo, J. Anatomy-Aware 3D Human Pose Estimation With Bone-Based Pose Decomposition. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 198–209. [Google Scholar] [CrossRef]
- Geng, Z.; Wang, C.; Wei, Y.; Liu, Z.; Li, H.; Hu, H. Human Pose as Compositional Tokens. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 660–671. [Google Scholar] [CrossRef]
- Fang, H.S.; Xu, Y.; Wang, W.; Liu, X.; Zhu, S.C. Learning pose grammar to encode human body configuration for 3d pose estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Marin-Jimenez, M.J.; Romero-Ramirez, F.J.; Munoz-Salinas, R.; Medina-Carnicer, R. 3D human pose estimation from depth maps using a deep combination of poses. J. Vis. Commun. Image Represent. 2018, 55, 627–639. [Google Scholar] [CrossRef]
- Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded Pyramid Network for Multi-person Pose Estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7103–7112. [Google Scholar] [CrossRef]
- Li, W.; Wang, Z.; Yin, B.; Peng, Q.; Du, Y.; Xiao, T.; Yu, G.; Lu, H.; Wei, Y.; Sun, J. Rethinking on multi-stage networks for human pose estimation. arXiv 2019, arXiv:1901.00148. [Google Scholar] [CrossRef]
- Su, Z.; Ye, M.; Zhang, G.; Dai, L.; Sheng, J. Cascade feature aggregation for human pose estimation. arXiv 2019, arXiv:1902.07837. [Google Scholar] [CrossRef]
- Zhang, H.; Ouyang, H.; Liu, S.; Qi, X.; Shen, X.; Yang, R.; Jia, J. Human pose estimation with spatial contextual information. arXiv 2019, arXiv:1901.01760. [Google Scholar] [CrossRef]
- Xiao, B.; Wu, H.; Wei, Y. Simple Baselines for Human Pose Estimation and Tracking. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part VI, Berlin, Heidelberg, 2018; Springer: Berlin/Heidelberg, Germany; pp. 472–487. [CrossRef]
- Bulat, A.; Kossaifi, J.; Tzimiropoulos, G.; Pantic, M. Toward fast and accurate human pose estimation via soft-gated skip connections. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020; pp. 8–15. [Google Scholar] [CrossRef]
- Tang, Z.; Peng, X.; Geng, S.; Wu, L.; Zhang, S.; Metaxas, D. Quantized Densely Connected U-Nets for Efficient Landmark Localization. In Proceedings of the Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 348–364. [Google Scholar]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5686–5696. [Google Scholar] [CrossRef]
- Cheng, B.; Xiao, B.; Wang, J.; Shi, H.; Huang, T.S.; Zhang, L. HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5385–5394. [Google Scholar] [CrossRef]
- Artacho, B.; Savakis, A. UniPose: Unified Human Pose Estimation in Single Images and Videos. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7033–7042. [Google Scholar] [CrossRef]
- Artacho, B.; Savakis, A. OmniPose: A Multi-Scale Framework for Multi-Person Pose Estimation. arXiv 2021, arXiv:2103.10180. [Google Scholar] [CrossRef]
- Ke, L.; Chang, M.C.; Qi, H.; Lyu, S. Multi-Scale Structure-Aware Network for Human Pose Estimation. In Proceedings of the Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 731–746. [Google Scholar]
- Zhang, J.; Chen, Z.; Tao, D. Towards High Performance Human Keypoint Detection. Int. J. Comput. Vis. 2021, 129, 2639–2662. [Google Scholar] [CrossRef]
- Cai, Y.; Wang, Z.; Luo, Z.; Yin, B.; Du, A.; Wang, H.; Zhang, X.; Zhou, X.; Zhou, E.; Sun, J. Learning Delicate Local Representations for Multi-person Pose Estimation. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III; Springer: Berlin/Heidelberg, Germany; pp. 455–472. [CrossRef]
- Groos, D.; Ramampiaro, H.; Ihlen, E.A. EfficientPose: Scalable single-person pose estimation. Appl. Intell. 2021, 51, 2518–2533. [Google Scholar] [CrossRef]
- Papaioannidis, C.; Mademlis, I.; Pitas, I. Fast single-person 2D human pose estimation using multi-task Convolutional Neural Networks. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
- Khirodkar, R.; Chari, V.; Agrawal, A.; Tyagi, A. Multi-Instance Pose Networks: Rethinking Top-Down Pose Estimation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3102–3111. [Google Scholar] [CrossRef]
- Munea, T.L.; Yang, C.; Huang, C.; Elhassan, M.A.; Zhen, Q. SimpleCut: A simple and strong 2D model for multi-person pose estimation. Comput. Vis. Image Underst. 2022, 222, 103509. [Google Scholar] [CrossRef]
- Fieraru, M.; Khoreva, A.; Pishchulin, L.; Schiele, B. Learning to Refine Human Pose Estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–23 June 2018; pp. 318–31809. [Google Scholar] [CrossRef]
- Moon, G.; Chang, J.Y.; Lee, K.M. PoseFix: Model-Agnostic General Human Pose Refinement Network. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7765–7773. [Google Scholar] [CrossRef]
- Xu, T.; Takano, W. Graph Stacked Hourglass Networks for 3D Human Pose Estimation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 16100–16109. [Google Scholar] [CrossRef]
- Ci, H.; Wang, C.; Ma, X.; Wang, Y. Optimizing Network Structure for 3D Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2262–2271. [Google Scholar] [CrossRef]
- Hu, W.; Zhang, C.; Zhan, F.; Zhang, L.; Wong, T.T. Conditional Directed Graph Convolution for 3D Human Pose Estimation. In Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA, 20–24 October 2021; MM’21; Association for Computing Machinery: New York, NY, USA, 2021; pp. 602–611. [Google Scholar] [CrossRef]
- Azizi, N.; Possegger, H.; Rodolà, E.; Bischof, H. 3D Human Pose Estimation Using Möbius Graph Convolutional Networks. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part I; Springer: Berlin/Heidelberg, Germany, 2022; pp. 160–178. [Google Scholar] [CrossRef]
- Liu, J.; Rojas, J.; Li, Y.; Liang, Z.; Guan, Y.; Xi, N.; Zhu, H. A Graph Attention Spatio-temporal Convolutional Network for 3D Human Pose Estimation in Video. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 3374–3380. [Google Scholar] [CrossRef]
- Li, W.; Liu, M.; Liu, H.; Guo, T.; Wang, T.; Tang, H.; Sebe, N. GraphMLP: A graph MLP-like architecture for 3D human pose estimation. Pattern Recognit. 2025, 158, 110925. [Google Scholar] [CrossRef]
- Yang, S.; Quan, Z.; Nie, M.; Yang, W. TransPose: Keypoint Localization via Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 11782–11792. [Google Scholar] [CrossRef]
- Yuan, Y.; Fu, R.; Huang, L.; Lin, W.; Zhang, C.; Chen, X.; Wang, J. HRFormer: High-resolution transformer for dense prediction. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS’21, Red Hook, NY, USA, 6–14 December 2021. [Google Scholar]
- Liu, H.; Liu, F.; Fan, X.; Huang, D. Polarized self-attention: Towards high-quality pixel-wise mapping. Neurocomputing 2022, 506, 158–167. [Google Scholar] [CrossRef]
- Li, R.; Li, Q.; Yang, S.; Zeng, X.; Yan, A. An efficient and accurate 2D human pose estimation method using VTTransPose network. Sci. Rep. 2024, 14, 7608. [Google Scholar] [CrossRef]
- Zeng, W.; Jin, S.; Liu, W.; Qian, C.; Luo, P.; Ouyang, W.; Wang, X. Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11091–11101. [Google Scholar] [CrossRef]
- Mao, W.; Ge, Y.; Shen, C.; Tian, Z.; Wang, X.; Wang, Z. TFPose: Direct Human Pose Estimation with Transformers. arXiv 2021, arXiv:2103.15320. [Google Scholar] [CrossRef]
- Panteleris, P.; Argyros, A. PE-former: Pose Estimation Transformer. In Proceedings of the Pattern Recognition and Artificial Intelligence: Third International Conference, ICPRAI 2022, Paris, France, 1–3 June 2022; Proceedings, Part II; Springer: Berlin/Heidelberg, Germany, 2022; pp. 3–14. [Google Scholar] [CrossRef]
- Li, K.; Wang, S.; Zhang, X.; Xu, Y.; Xu, W.; Tu, Z. Pose Recognition With Cascade Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 1944–1953. [Google Scholar]
- Mao, W.; Ge, Y.; Shen, C.; Tian, Z.; Wang, X.; Wang, Z.; den Hengel, A.v. Poseur: Direct Human Pose Regression with Transformers. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part VI; Springer: Berlin/Heidelberg, Germany, 2022; pp. 72–88. [Google Scholar] [CrossRef]
- Tian, Z.; Chen, H.; Shen, C. DirectPose: Direct End-to-End Multi-Person Pose Estimation. arXiv 2019, arXiv:1911.07451. [Google Scholar] [CrossRef]
- Liu, H.; Chen, Q.; Tan, Z.; Liu, J.J.; Wang, J.; Su, X.; Li, X.; Yao, K.; Han, J.; Ding, E.; et al. Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14983–14992. [Google Scholar] [CrossRef]
- Geng, Z.; Sun, K.; Xiao, B.; Zhang, Z.; Wang, J. Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14671–14681. [Google Scholar] [CrossRef]
- Ma, X.; Su, J.; Wang, C.; Ci, H.; Wang, Y. Context Modeling in 3D Human Pose Estimation: A Unified Perspective. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6234–6243. [Google Scholar] [CrossRef]
- Zhao, Q.; Zheng, C.; Liu, M.; Chen, C. A single 2D pose with context is worth hundreds for 3D human pose estimation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS’23, Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Li, C.; Lee, G.H. Generating Multiple Hypotheses for 3D Human Pose Estimation With Mixture Density Network. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9879–9887. [Google Scholar] [CrossRef]
- Sharma, S.; Varigonda, P.T.; Bindal, P.; Sharma, A.; Jain, A. Monocular 3D Human Pose Estimation by Generation and Ordinal Ranking. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2325–2334. [Google Scholar] [CrossRef]
- Han, C.; Yu, X.; Gao, C.; Sang, N.; Yang, Y. Single image based 3D human pose estimation via uncertainty learning. Pattern Recognit. 2022, 132, 108934. [Google Scholar] [CrossRef]
- Rommel, C.; Letzelter, V.; Samet, N.; Marlet, R.; Cord, M.; Pérez, P.; Valle, E. ManiPose: Manifold-constrained multi-hypothesis 3D human pose estimation. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS’24, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
- Holmquist, K.; Wandt, B. DiffPose: Multi-hypothesis Human Pose Estimation using Diffusion Models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 15931–15941. [Google Scholar] [CrossRef]
- Gong, J.; Foo, L.G.; Fan, Z.; Ke, Q.; Rahmani, H.; Liu, J. DiffPose: Toward More Reliable 3D Pose Estimation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13041–13051. [Google Scholar] [CrossRef]
- Shan, W.; Liu, Z.; Zhang, X.; Wang, Z.; Han, K.; Wang, S.; Ma, S.; Gao, W. Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14715–14725. [Google Scholar] [CrossRef]
- Xu, J.; Guo, Y.; Peng, Y. FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 561–570. [Google Scholar] [CrossRef]
- Wang, W.; Xiao, J.; Wang, C.; Liu, W.; Wang, Z.; Chen, L. Di2Pose: Discrete diffusion model for occluded 3D human pose estimation. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS’24, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
- Feng, R.; Gao, Y.; Elden Tse, T.H.; Ma, X.; Chang, H.J. DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14815–14826. [Google Scholar] [CrossRef]
- Doering, A.; Chen, D.; Zhang, S.; Schiele, B.; Gall, J. PoseTrack21: A Dataset for Person Search, Multi-Object Tracking and Multi-Person Pose Tracking. In Proceedings of the CVPR, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Jiang, Z.; Zhou, Z.; Li, L.; Chai, W.; Yang, C.Y.; Hwang, J.N. Back to Optimization: Diffusion-based Zero-Shot 3D Human Pose Estimation. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 6130–6140. [Google Scholar] [CrossRef]
- Rhodin, H.; Meyer, F.; Spörri, J.; Müller, E.; Constantin, V.; Fua, P.; Katircioglu, I.; Salzmann, M. Learning Monocular 3D Human Pose Estimation from Multi-view Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8437–8446. [Google Scholar] [CrossRef]
- Rhodin, H.; Salzmann, M.; Fua, P. Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part X; Springer: Berlin/Heidelberg, Germany, 2018; pp. 765–782. [Google Scholar] [CrossRef]
- Iskakov, K.; Burkov, E.; Lempitsky, V.; Malkov, Y. Learnable triangulation of human pose. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7718–7727. [Google Scholar]
- Yang, C.Y.; Luo, J.; Xia, L.; Sun, Y.; Qiao, N.; Zhang, K.; Jiang, Z.; Hwang, J.N.; Kuo, C.H. CameraPose: Weakly-Supervised Monocular 3D Human Pose Estimation by Leveraging In-the-wild 2D Annotations. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2923–2932. [Google Scholar] [CrossRef]
- Yu, Z.; Wang, M.; Chen, Y.; Favaro, P.; Modolo, D. Denoising and Selecting Pseudo-Heatmaps for Semi-Supervised Human Pose Estimation. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 6268–6277. [Google Scholar] [CrossRef]
- Nakatsuka, T.; Yoshii, K.; Koyama, Y.; Fukayama, S.; Goto, M.; Morishima, S. MirrorNet: A Deep Reflective Approach to 2D Pose Estimation for Single-Person Images. J. Inf. Process. 2021, 29, 406–423. [Google Scholar] [CrossRef]
- Kundu, J.N.; Seth, S.; Jampani, V.; Rakesh, M.; Venkatesh Babu, R.; Chakraborty, A. Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image Synthesis. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6151–6161. [Google Scholar] [CrossRef]
- Sosa, J.; Hogg, D. Self-supervised 3D Human Pose Estimation from a Single Image. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 4788–4797. [Google Scholar] [CrossRef]
- Kundu, J.N.; Seth, S.; YM, P.; Jampani, V.; Chakraborty, A.; Babu, R.V. Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose Estimation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 20416–20427. [Google Scholar] [CrossRef]
- Yang, W.; Ouyang, W.; Wang, X.; Ren, J.; Li, H.; Wang, X. 3D Human Pose Estimation in the Wild by Adversarial Learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5255–5264. [Google Scholar] [CrossRef]
- Peng, X.; Tang, Z.; Yang, F.; Feris, R.S.; Metaxas, D. Jointly Optimize Data Augmentation and Network Training: Adversarial Data Augmentation in Human Pose Estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2226–2234. [Google Scholar] [CrossRef]
- Gong, K.; Zhang, J.; Feng, J. PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8571–8580. [Google Scholar] [CrossRef]
- Peng, Q.; Zheng, C.; Chen, C. A Dual-Augmentor Framework for Domain Generalization in 3D Human Pose Estimation. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 2240–2249. [Google Scholar] [CrossRef]
- Wang, L.; Chen, Y.; Guo, Z.; Qian, K.; Lin, M.; Li, H.; Ren, J.S. Generalizing Monocular 3D Human Pose Estimation in the Wild. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4024–4033. [Google Scholar] [CrossRef]
- Doersch, C.; Zisserman, A. Sim2real transfer learning for 3d human pose estimation: Motion to the rescue. In Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2019; Volume 32. [Google Scholar]
- Chai, W.; Jiang, Z.; Hwang, J.N.; Wang, G. Global Adaptation meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14609–14619. [Google Scholar] [CrossRef]
- Wang, Z.; Shin, D.; Fowlkes, C.C. Predicting Camera Viewpoint Improves Cross-Dataset Generalization for 3D Human Pose Estimation. In Proceedings of the Computer Vision—ECCV 2020 Workshops; Bartoli, A., Fusiello, A., Eds.; Springer: Cham, Switzerland, 2020; pp. 523–540. [Google Scholar]
- Cai, Y.; Zhang, W.; Wu, Y.; Jin, C. PoseIRM: Enhance 3D Human Pose Estimation on Unseen Camera Settings via Invariant Risk Minimization. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 2124–2133. [Google Scholar] [CrossRef]
- Zeng, A.; Sun, X.; Huang, F.; Liu, M.; Xu, Q.; Lin, S. SRNet: Improving Generalization in 3D Human Pose Estimation with a Split-and-Recombine Approach. In Proceedings of the Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 507–523. [Google Scholar]
- Wang, Y.; Wang, Z.; Li, M.; Yan, H. 3D Human Pose Estimation with Two-step Mixed-Training Strategy. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 3320–3329. [Google Scholar] [CrossRef]
- Lee, S.; Hwang, Y.; Lee, J.T. Learning 2D Human Poses for Better 3D Lifting via Multi-model 3D-Guidance. In Proceedings of the Computer Vision—ACCV 2024; Cho, M., Laptev, I., Tran, D., Yao, A., Zha, H., Eds.; Springer: Singapore, 2025; pp. 185–202. [Google Scholar]
- Taketsugu, H.; Ukita, N. Active Transfer Learning for Efficient Video-Specific Human Pose Estimation. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 1869–1879. [Google Scholar] [CrossRef]
- Hu, S.; Sun, H.; Li, B.; Wei, D.; Li, W.; Lu, J. Fast Adaptation for Human Pose Estimation via Meta-Optimization. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 1792–1801. [Google Scholar] [CrossRef]
- Vosoughi, S.; Amer, M.A. Deep 3D Human Pose Estimation Under Partial Body Presence. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 569–573. [Google Scholar] [CrossRef]
- Cheng, Y.; Yang, B.; Wang, B.; Wending, Y.; Tan, R. Occlusion-Aware Networks for 3D Human Pose Estimation in Video. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 723–732. [Google Scholar] [CrossRef]
- Cheng, Y.; Yang, B.; Wang, B.; Tan, R.T. 3D Human Pose Estimation Using Spatio-Temporal Networks with Explicit Occlusion Training. Proc. AAAI Conf. Artif. Intell. 2020, 34, 10631–10638. [Google Scholar] [CrossRef]
- Das, A.; Das, S.; Sistu, G.; Horgan, J.; Bhattacharya, U.; Jones, E.; Glavin, M.; Eising, C. Deep Multi-Task Networks For Occluded Pedestrian Pose Estimation. arXiv 2022, arXiv:2206.07510. [Google Scholar] [CrossRef]
- Hardy, P.; Kim, H. LInKs “Lifting Independent Keypoints”—Partial Pose Lifting for Occlusion Handling with Improved Accuracy in 2D-3D Human Pose Estimation. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 3414–3423. [Google Scholar] [CrossRef]
- Zheng, H.; Li, H.; Dai, W.; Zheng, Z.; Li, C.; Zou, J.; Xiong, H. HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 16807–16817. [Google Scholar] [CrossRef]
- Zhang, Y.; Ji, P.; Wang, A.; Mei, J.; Kortylewski, A.; Yuille, A. 3D-Aware Neural Body Fitting for Occlusion Robust 3D Human Pose Estimation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 9365–9376. [Google Scholar] [CrossRef]
- Sun, P.; Gu, K.; Wang, Y.; Yang, L.; Yao, A. Rethinking Visibility in Human Pose Estimation: Occluded Pose Reasoning via Transformers. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 5891–5900. [Google Scholar] [CrossRef]
- Ning, G.; Liu, P.; Fan, X.; Zhang, C. A Top-Down Approach to Articulated Human Pose Estimation and Tracking. In Proceedings of the Computer Vision—ECCV 2018 Workshops, Munich, Germany, 8–14 September 2018; Proceedings, Part II; Springer: Berlin/Heidelberg, Germany, 2018; pp. 227–234. [Google Scholar] [CrossRef]
- Zhou, M.; Stoffl, L.; Mathis, M.W.; Mathis, A. Rethinking pose estimation in crowds: Overcoming the detection information bottleneck and ambiguity. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14643–14653. [Google Scholar] [CrossRef]
- Cheng, Y.; Wang, B.; Yang, B.; Tan, R.T. Monocular 3D Multi-Person Pose Estimation by Integrating Top-Down and Bottom-Up Networks. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7645–7655. [Google Scholar] [CrossRef]
- Zhao, S.; Liu, K.; Huang, Y.; Bao, Q.; Zeng, D.; Liu, W. DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation. In Proceedings of the Artificial Intelligence: Second CAAI International Conference, CICAI 2022, Beijing, China, 27–28 August 2022; Revised Selected Papers, Part II; Springer: Berlin/Heidelberg, Germany, 2022; pp. 559–576. [Google Scholar] [CrossRef]
- Dabral, R.; Gundavarapu, N.B.; Mitra, R.; Sharma, A.; Ramakrishnan, G.; Jain, A. Multi-Person 3D Human Pose Estimation from Monocular Images. In Proceedings of the 2019 International Conference on 3D Vision (3DV), Quebec City, QC, Canada, 16–19 September 2019; pp. 405–414. [Google Scholar] [CrossRef]
- Ding, Y.; Deng, W.; Zheng, Y.; Liu, P.; Wang, M.; Cheng, X.; Bao, J.; Chen, D.; Zeng, M. I2R-Net: Intra- and Inter-Human Relation Network for Multi-Person Pose Estimation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22; Raedt, L.D., Ed.; International Joint Conferences on Artificial Intelligence Organization: Vienna, Austria, 2022; Volume 7, pp. 855–862, Main Track. [Google Scholar] [CrossRef]
- Qiu, Z.; Yang, Q.; Wang, J.; Fu, D. Dynamic Graph Reasoning for Multi-person 3D Pose Estimation. In Proceedings of the 30th ACM International Conference on Multimedia, New York, NY, USA, 10–14 October 2022; MM’22. pp. 3521–3529. [Google Scholar] [CrossRef]
- Luo, Y.; Ren, J.; Wang, Z.; Sun, W.; Pan, J.; Liu, J.; Pang, J.; Lin, L. LSTM Pose Machines. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5207–5215. [Google Scholar] [CrossRef]
- Liu, S.; Li, Y.; Hua, G. Human Pose Estimation in Video via Structured Space Learning and Halfway Temporal Evaluation. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 2029–2038. [Google Scholar] [CrossRef]
- Pavllo, D.; Feichtenhofer, C.; Grangier, D.; Auli, M. 3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7745–7754. [Google Scholar] [CrossRef]
- Li, Y.; Li, K.; Wang, X.; Xu, R.Y.D. Exploring temporal consistency for human pose estimation in videos. Pattern Recognit. 2020, 103, 107258. [Google Scholar] [CrossRef]
- Liu, R.; Shen, J.; Wang, H.; Chen, C.; Cheung, S.-c.; Asari, V. Attention Mechanism Exploits Temporal Contexts: Real-Time 3D Human Pose Reconstruction. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5063–5072. [Google Scholar] [CrossRef]
- Lin, J.; Lee, G.H. Trajectory Space Factorization for Deep Video-Based 3D Human Pose Estimation. In Proceedings of the BMVC, Cardiff, UK, 9–12 September 2019. [Google Scholar]
- Zheng, C.; Zhu, S.; Mendieta, M.; Yang, T.; Chen, C.; Ding, Z. 3D Human Pose Estimation with Spatial and Temporal Transformers. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
- Li, W.; Liu, H.; Ding, R.; Liu, M.; Wang, P.; Yang, W. Exploiting Temporal Contexts With Strided Transformer for 3D Human Pose Estimation. IEEE Trans. Multimed. 2023, 25, 1282–1293. [Google Scholar] [CrossRef]
- Tang, Z.; Qiu, Z.; Hao, Y.; Hong, R.; Yao, T. 3D Human Pose Estimation with Spatio-Temporal Criss-Cross Attention. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 4790–4799. [Google Scholar] [CrossRef]
- Zhang, J.; Tu, Z.; Yang, J.; Chen, Y.; Yuan, J. MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13222–13232. [Google Scholar] [CrossRef]
- Hassanin, M.; Khamiss, A.; Bennamoun, M.; Boussaid, F.; Radwan, I. CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation. arXiv 2022, arXiv:2203.13387. [Google Scholar] [CrossRef]
- Wei, M.; Xie, X.; Zhong, Y.; Shi, G. Learning Pyramid-Structured Long-Range Dependencies for 3D Human Pose Estimation. IEEE Trans. Multimed. 2025, 27, 4684–4697. [Google Scholar] [CrossRef]
- Qian, X.; Tang, Y.; Zhang, N.; Han, M.; Xiao, J.; Huang, M.C.; Lin, R.S. HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose Estimation. arXiv 2023, arXiv:2301.07322. [Google Scholar] [CrossRef]
- Chen, H.; He, J.Y.; Xiang, W.; Cheng, Z.Q.; Liu, W.; Liu, H.; Luo, B.; Geng, Y.; Xie, X. HDFormer: High-order Directed Transformer for 3D Human Pose Estimation. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23; Elkind, E., Ed.; ACM: New York, NY, USA, 2023; pp. 581–589, Main Track. [Google Scholar] [CrossRef]
- Zhai, K.; Nie, Q.; Ouyang, B.; Li, X.; Yang, S. HopFIR: Hop-wise GraphFormer with Intragroup Joint Refinement for 3D Human Pose Estimation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14939–14949. [Google Scholar] [CrossRef]
- Liu, H.; Cheng, Z.Q.; Xiang, W.; He, J.Y.; Luo, B.; Geng, Y.; Xie, X. Refined Temporal Pyramidal Compression-and-Amplification Transformer for 3D Human Pose Estimation. In Proceedings of the 2025 IEEE International Conference on Multimedia and Expo (ICME), Nantes, France, 30 June–4 July 2025; pp. 1–6. [Google Scholar] [CrossRef]
- Kang, H.; Wang, Y.; Liu, M.; Wu, D.; Liu, P.; Yang, W. Double-chain Constraints for 3D Human Pose Estimation in Images and Videos. arXiv 2023, arXiv:2308.05298. [Google Scholar] [CrossRef]
- Mehraban, S.; Adeli, V.; Taati, B. Motionagformer: Enhancing 3d human pose estimation with a transformer-gcnformer network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024; pp. 6920–6930. [Google Scholar]
- Yu, B.X.; Zhang, Z.; Liu, Y.; Zhong, S.H.; Liu, Y.; Chen, C.W. GLA-GCN: Global-local Adaptive Graph Convolutional Network for 3D Human Pose Estimation from Monocular Video. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 8784–8795. [Google Scholar] [CrossRef]
- Peng, J.; Zhou, Y.; Mok, P. Ktpformer: Kinematics and trajectory prior knowledge-enhanced transformer for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 1123–1132. [Google Scholar]
- Li, C.; Liu, S.; Yao, L.; Zou, S. Video-based body geometric aware network for 3D human pose estimation. Optoelectron. Lett. 2022, 18, 313–320. [Google Scholar] [CrossRef]
- Li, W.; Liu, H.; Tang, H.; Wang, P.; Van Gool, L. MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13137–13146. [Google Scholar] [CrossRef]
- Liu, J.; Liu, M.; Liu, H.; Li, W. TCPFormer: Learning temporal correlation with implicit pose proxy for 3D human pose estimation. Proc. AAAI Conf. Artif. Intell. 2025, 39, 5478–5486. [Google Scholar] [CrossRef]
- Lutz, S.; Blythman, R.; Ghosal, K.; Moynihan, M.; Simms, C.; Smolic, A. Jointformer: Single-frame lifting transformer with error prediction and refinement for 3d human pose estimation. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montréal, QC, Canada, 21–25 August 2022; pp. 1156–1163. [Google Scholar]
- Qiu, Z.; Yang, Q.; Wang, J.; Fu, D. IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation. In Proceedings of the 30th ACM International Conference on Multimedia, MM’22, New York, NY, USA, 10–14 October 2022; pp. 6174–6182. [Google Scholar] [CrossRef]
- Feng, R.; Chang, H.J.; Tse, T.H.E.; Kim, B.; Chang, Y.; Gao, Y. High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–20 October 2025; pp. 8929–8938. [Google Scholar]
- Lu, Y.; Wang, J.; Gao, J.; Gong, R.; Cai, C.; Yap, K.H. A Structure-aware and Motion-adaptive Framework for 3D Human Pose Estimation with Mamba. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–20 October 2025; pp. 7958–7968. [Google Scholar]
- Dabral, R.; Mundhada, A.; Kusupati, U.; Afaque, S.; Sharma, A.; Jain, A. Learning 3d human pose from structure and motion. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 668–683. [Google Scholar]
- Gupta, V. Back to the future: Joint aware temporal deep learning 3D human pose estimation. arXiv 2020, arXiv:2002.11251. [Google Scholar] [CrossRef]
- Wang, G.; Zeng, H.; Wang, Z.; Liu, Z.; Wang, H. Motion projection consistency-based 3-D human pose estimation with virtual bones from monocular videos. IEEE Trans. Cogn. Dev. Syst. 2022, 15, 784–793. [Google Scholar] [CrossRef]
- Wang, J.; Yan, S.; Xiong, Y.; Lin, D. Motion guided 3d pose estimation from videos. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 764–780. [Google Scholar]
- Xu, J.; Yu, Z.; Ni, B.; Yang, J.; Yang, X.; Zhang, W. Deep kinematics analysis for monocular 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 899–908. [Google Scholar]
- Jin, K.M.; Lim, B.S.; Lee, G.H.; Kang, T.K.; Lee, S.W. Kinematic-aware hierarchical attention network for human pose estimation in videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 5725–5734. [Google Scholar]
- Li, Z.; Xu, B.; Huang, H.; Lu, C.; Guo, Y. Deep two-stream video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 430–439. [Google Scholar]
- Jeong, D.C.; Liu, H.; Salazar, S.; Jiang, J.; Kitts, C.A. SoloPose: One-Shot Kinematic 3D Human Pose Estimation with Video Data Augmentation. arXiv 2023, arXiv:2312.10195. [Google Scholar]
- Zhang, J.; Wang, Y.; Zhou, Z.; Luan, T.; Wang, Z.; Qiao, Y. Learning dynamical human-joint affinity for 3d pose estimation in videos. IEEE Trans. Image Process. 2021, 30, 7914–7925. [Google Scholar] [CrossRef] [PubMed]
- Qiu, H.; Wang, C.; Wang, J.; Wang, N.; Zeng, W. Cross view fusion for 3d human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4342–4351. [Google Scholar]
- Zhang, Z.; Wang, C.; Qiu, W.; Qin, W.; Zeng, W. Adafuse: Adaptive multiview fusion for accurate human pose estimation in the wild. Int. J. Comput. Vis. 2021, 129, 703–718. [Google Scholar] [CrossRef]
- Xie, R.; Wang, C.; Wang, Y. Metafuse: A pre-trained fusion model for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13686–13695. [Google Scholar]
- Moliner, O.; Huang, S.; Åström, K. Geometry-biased transformer for robust multi-view 3d human pose reconstruction. In Proceedings of the 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), Istanbul, Turkey, 27–31 May 2024; pp. 1–8. [Google Scholar]
- Liao, Z.; Zhu, J.; Wang, C.; Hu, H.; Waslander, S.L. Multiple view geometry transformers for 3D human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 708–717. [Google Scholar]
- Remelli, E.; Han, S.; Honari, S.; Fua, P.; Wang, R. Lightweight multi-view 3D pose estimation through camera-disentangled representation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6040–6049. [Google Scholar]
- Chharia, A.; Gou, W.; Dong, H. MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 5–18 October 2025; pp. 11590–11599. [Google Scholar]
- Luvizon, D.C.; Picard, D.; Tabia, H. Consensus-based optimization for 3D human pose estimation in camera coordinates. Int. J. Comput. Vis. 2022, 130, 869–882. [Google Scholar] [CrossRef]
- Davoodnia, V.; Ghorbani, S.; Carbonneau, M.A.; Messier, A.; Etemad, A. Upose3d: Uncertainty-aware 3d human pose estimation with cross-view and temporal cues. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 19–38. [Google Scholar]
- Jiang, B.; Hu, L.; Xia, S. Probabilistic triangulation for uncalibrated multi-view 3D human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 14850–14860. [Google Scholar]
- Gordon, B.; Raab, S.; Azov, G.; Giryes, R.; Cohen-Or, D. FLEX: Extrinsic parameters-free multi-view 3D human motion reconstruction. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 176–196. [Google Scholar]
- Xu, Y.; Kitani, K. Multi-view multi-person 3d pose estimation with uncalibrated camera networks. In Proceedings of the BMVC, London, UK, 21–24 November 2022. [Google Scholar]
- Shuai, H.; Wu, L.; Liu, Q. Adaptive multi-view and temporal fusing transformer for 3d human pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4122–4135. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.J.; Xu, Y.; Khirodkar, R.; Park, J.; Kitani, K. Multi-person 3d pose estimation from multi-view uncalibrated depth cameras. arXiv 2024, arXiv:2401.15616. [Google Scholar]
- Chang, I.; Park, M.G.; Kim, J.; Yoon, J.H. Multi-view 3d human pose estimation with self-supervised learning. In Proceedings of the 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Korea, Republic of Korea, 20–23 April 2021; pp. 255–257. [Google Scholar]
- Rodriguez-Criado, D.; Bachiller-Burgos, P.; Vogiatzis, G.; Manso, L.J. Multi-person 3D pose estimation from unlabelled data. Mach. Vis. Appl. 2024, 35, 46. [Google Scholar] [CrossRef]
- Wan, X.; Chen, Z.; Duan, B.; Zhao, X. Dual-diffusion for binocular 3D human pose estimation. Adv. Neural Inf. Process. Syst. 2024, 37, 78079–78103. [Google Scholar]
- Reddy, N.D.; Guigues, L.; Pishchulin, L.; Eledath, J.; Narasimhan, S.G. Tessetrack: End-to-end learnable multi-person articulated 3d pose tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15190–15200. [Google Scholar]
- Zhang, Y.; Wang, C.; Wang, X.; Liu, W.; Zeng, W. Voxeltrack: Multi-person 3d human pose estimation and tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2613–2626. [Google Scholar] [CrossRef] [PubMed]
- Choudhury, R.; Kitani, K.M.; Jeni, L.A. Tempo: Efficient multi-view pose estimation, tracking, and forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 14750–14760. [Google Scholar]
- Zimmermann, C.; Welschehold, T.; Dornhege, C.; Burgard, W.; Brox, T. 3d human pose estimation in rgbd images for robotic task learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 1986–1992. [Google Scholar]
- Zhang, B.; Xiao, Y.; Xiong, F.; Wu, C.; Cao, Z.; Liu, P.; Zhou, J.T. 3D human pose estimation with cross-modality training and multi-scale local refinement. Appl. Soft Comput. 2022, 122, 108950. [Google Scholar] [CrossRef]
- Guo, Y.; Li, Z.; Li, Z.; Du, X.; Quan, S.; Xu, Y. PoP-Net: Pose Over Parts Network for Multi-Person 3D Pose Estimation From a Depth Image. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 1240–1249. [Google Scholar]
- Martínez-González, A.; Villamizar, M.; Canévet, O.; Odobez, J.M. Residual pose: A decoupled approach for depth-based 3D human pose estimation. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 10313–10318. [Google Scholar]
- Szczuko, P. Deep neural networks for human pose estimation from a very low resolution depth image. Multimed. Tools Appl. 2019, 78, 29357–29377. [Google Scholar] [CrossRef]
- Aso, K.; Hwang, D.H.; Koike, H. Portable 3D human pose estimation for human-human interaction using a chest-mounted fisheye camera. In Proceedings of the Augmented Humans International Conference 2021, Rovaniemi, Finland, 22–24 February 2021; pp. 116–120. [Google Scholar]
- Zhang, Y.; You, S.; Karaoglu, S.; Gevers, T. Multi-person 3D pose estimation from a single image captured by a fisheye camera. Comput. Vis. Image Underst. 2022, 222, 103505. [Google Scholar] [CrossRef]
- Goyal, G.; Di Pietro, F.; Carissimi, N.; Glover, A.; Bartolozzi, C. Moveenet: Online high-frequency human pose estimation with an event camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4024–4033. [Google Scholar]
- Lang, B.; Chuah, M.C. Event-Guided Video Transformer for End-to-End 3D Human Pose Estimation. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 5114–5124. [Google Scholar]
- Lang, B.; Chuah, M.C. Event-Guided Fusion-Mamba for Context-Aware 3D Human Pose Estimation. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 950–960. [Google Scholar]
- Koleini, F.; Saleem, M.U.; Wang, P.; Xue, H.; Helmy, A.; Fenwick, A. BioPose: Biomechanically-accurate 3D Pose Estimation from Monocular Videos. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 6330–6339. [Google Scholar]
- Yu, C.; Xiao, B.; Gao, C.; Yuan, L.; Zhang, L.; Sang, N.; Wang, J. Lite-hrnet: A lightweight high-resolution network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10440–10450. [Google Scholar]
- Han, J.; Wang, Y. Greit-HRNet: Grouped Lightweight High-Resolution Network for Human Pose Estimation. In Proceedings of the Asian Conference on Computer Vision, Hanoi, Vietnam, 8–12 December 2024; pp. 3771–3787. [Google Scholar]
- Li, Q.; Zhang, Z.; Xiao, F.; Zhang, F.; Bhanu, B. Dite-HRNet: Dynamic lightweight high-resolution network for human pose estimation. arXiv 2022, arXiv:2204.10762. [Google Scholar]
- Zhang, F.; Zhu, X.; Ye, M. Fast human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2019; pp. 3517–3526. [Google Scholar]
- Diaz-Arias, A.; Shin, D. ConvFormer: Parameter reduction in transformer models for 3D human pose estimation by leveraging dynamic multi-headed convolutional attention. Vis. Comput. 2024, 40, 2555–2569. [Google Scholar] [CrossRef]
- Sun, Y.; Dougherty, A.W.; Zhang, Z.; Choi, Y.K.; Wu, C. Mixsynthformer: A transformer encoder-like structure with mixed synthetic self-attention for efficient human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 14884–14893. [Google Scholar]
- Jiang, T.; Lu, P.; Zhang, L.; Ma, N.; Han, R.; Lyu, C.; Li, Y.; Chen, K. Rtmpose: Real-time multi-person pose estimation based on mmpose. arXiv 2023, arXiv:2303.07399. [Google Scholar]
- Zeng, A.; Ju, X.; Yang, L.; Gao, R.; Zhu, X.; Dai, B.; Xu, Q. Deciwatch: A simple baseline for 10× efficient 2d and 3d pose estimation. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 607–624. [Google Scholar]
- Xu, Y.; Zhao, L.; Gong, C.; Li, G.; Wang, D.; Wang, N. DynPose: Largely Improving the Efficiency of Human Pose Estimation by a Simple Dynamic Framework. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 1160–1169. [Google Scholar]
- Zhang, Y.; Wang, Y.; Camps, O.; Sznaier, M. Key frame proposal network for efficient pose estimation in videos. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 609–625. [Google Scholar]
- Hwang, D.H.; Kim, S.; Monet, N.; Koike, H.; Bae, S. Lightweight 3d human pose estimation network training using teacher-student learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 479–488. [Google Scholar]
- Bulat, A.; Tzimiropoulos, G.; Kossaifi, J.; Pantic, M. Improved training of binary networks for human pose estimation and image recognition. arXiv 2019, arXiv:1904.05868. [Google Scholar] [CrossRef]
- Xu, L.; Guan, Y.; Jin, S.; Liu, W.; Qian, C.; Luo, P.; Ouyang, W.; Wang, X. Vipnas: Efficient video pose estimation via neural architecture search. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 22–25 June 2021; pp. 16072–16081. [Google Scholar]
- Xu, L.; Jin, S.; Liu, W.; Qian, C.; Ouyang, W.; Luo, P.; Wang, X. Zoomnas: Searching for whole-body human pose estimation in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5296–5313. [Google Scholar] [CrossRef]
- Liu, H.; Liu, W.; Chi, Z.; Wang, Y.; Yu, Y.; Chen, J.; Tang, J. Fast human pose estimation in compressed videos. IEEE Trans. Multimed. 2022, 25, 1390–1400. [Google Scholar] [CrossRef]
- Ma, H.; Wang, Z.; Chen, Y.; Kong, D.; Chen, L.; Liu, X.; Yan, X.; Tang, H.; Xie, X. Ppt: Token-pruned pose transformer for monocular and multi-view human pose estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 424–442. [Google Scholar]
- Li, W.; Liu, M.; Liu, H.; Wang, P.; Cai, J.; Sebe, N. Hourglass tokenizer for efficient transformer-based 3D human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 604–613. [Google Scholar]
- Pham, H.H.; Salmane, H.; Khoudour, L.; Crouzil, A.; Velastin, S.A.; Zegers, P. A unified deep framework for joint 3d pose estimation and action recognition from a single rgb camera. Sensors 2020, 20, 1825. [Google Scholar] [CrossRef] [PubMed]
- Luvizon, D.C.; Picard, D.; Tabia, H. 2d/3d pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 28–23 June 2018; pp. 5137–5146. [Google Scholar]
- Luvizon, D.C.; Picard, D.; Tabia, H. Multi-task deep learning for real-time 3D human pose estimation and action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2752–2764. [Google Scholar] [CrossRef] [PubMed]
- Ahmad, N.; Khan, J.; Kim, J.Y.; Lee, Y. Joint human pose estimation and instance segmentation with PosePlusSeg. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 69–76. [Google Scholar]
- Sárándi, I.; Hermans, A.; Leibe, B. Learning 3d human pose estimation from dozens of datasets using a geometry-aware autoencoder to bridge between skeleton formats. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 2956–2966. [Google Scholar]
- Jeong, U.; Freer, J.; Baek, S.; Chang, H.J.; Kim, K.I. PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 12278–12288. [Google Scholar]
- Shan, W.; Liu, Z.; Zhang, X.; Wang, S.; Ma, S.; Gao, W. P-stmo: Pre-trained spatial temporal many-to-one model for 3d human pose estimation. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 461–478. [Google Scholar]
- Zhu, W.; Ma, X.; Liu, Z.; Liu, L.; Wu, W.; Wang, Y. Motionbert: A unified perspective on learning human motion representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 15085–15099. [Google Scholar]
- Wang, Y.; Wu, Y.; He, W.; Guo, X.; Zhu, F.; Bai, L.; Zhao, R.; Wu, J.; He, T.; Ouyang, W.; et al. Hulk: A universal knowledge translator for human-centric tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 7, 5672–5689. [Google Scholar] [CrossRef] [PubMed]
- Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. Vitpose++: Vision transformer for generic body pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 1212–1230. [Google Scholar] [CrossRef]
- Ci, Y.; Wang, Y.; Chen, M.; Tang, S.; Bai, L.; Zhu, F.; Zhao, R.; Yu, F.; Qi, D.; Ouyang, W. Unihcp: A unified model for human-centric perceptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17840–17852. [Google Scholar]
- Dabhi, M.; Jeni, L.A.; Lucey, S. 3d-lfm: Lifting foundation model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 10466–10475. [Google Scholar]
- Jiang, Z.; Chai, W.; Li, L.; Zhou, Z.; Yang, C.Y.; Hwang, J.N. Unihpe: Towards unified human pose estimation via contrastive learning. arXiv 2023, arXiv:2311.16477. [Google Scholar] [CrossRef]
- Fang, H.S.; Li, J.; Tang, H.; Xu, C.; Zhu, H.; Xiu, Y.; Li, Y.L.; Lu, C. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7157–7173. [Google Scholar] [CrossRef] [PubMed]
- Jiang, T.; Xie, X.; Li, Y. RTMW: Real-time multi-person 2D and 3D whole-body pose estimation. arXiv 2024, arXiv:2407.08634. [Google Scholar]
- Samet, N.; Akbas, E. HPRNet: Hierarchical point regression for whole-body human pose estimation. Image Vis. Comput. 2021, 115, 104285. [Google Scholar] [CrossRef]
- Rey, R. Monocular 3D Human Pose Estimation. Master’s Thesis, KTH, School of Electrical Engineering and Computer Science (EECS), Stockholm, Sweden, 2023. [Google Scholar]
- Giulietti, N.; Todesca, D.; Carnevale, M.; Giberti, H. A Real-Time Human Pose Measurement System for Human-In-The-Loop Dynamic Simulators. IEEE Access 2025, 13, 24954–24969. [Google Scholar] [CrossRef]
- Bridgeman, L.; Volino, M.; Guillemaut, J.Y.; Hilton, A. Multi-person 3d pose estimation and tracking in sports. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2019. [Google Scholar]
- Jiang, J.H.; Xia, N. PCNet: A human pose compensation network based on incremental learning for sports actions estimation. Complex Intell. Syst. 2025, 11, 17. [Google Scholar] [CrossRef]
- Baumgartner, T.; Klatt, S. Monocular 3d human pose estimation for sports broadcasts using partial sports field registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5109–5118. [Google Scholar]
- Huang, W.; Ni, Y.; Rezvani, A.; Jeong, S.; Chen, H.; Liu, Y.; Wen, F.; Imani, M. Recoverable anonymization for pose estimation: A privacy-enhancing approach. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 5239–5249. [Google Scholar]
- Akada, H.; Wang, J.; Golyanik, V.; Theobalt, C. Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation. In Proceedings of the International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–23 October 2025. [Google Scholar]
- Matsune, A.; Hu, S.; Li, G.; Wen, S.; Zhu, X.; Tan, Z. A geometry loss combination for 3d human pose estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–6 January 2024; pp. 3272–3281. [Google Scholar]
- Hsu, C.H.; Jang, J.S.R. Enhancing 3D Human Pose Estimation with Bone Length Adjustment. In Proceedings of the Asian Conference on Computer Vision, Hanoi, Vietnam, 8–12 December 2024; pp. 3723–3738. [Google Scholar]
- Joo, H.; Neverova, N.; Vedaldi, A. Exemplar fine-tuning for 3d human model fitting towards in-the-wild 3d human pose estimation. In Proceedings of the 2021 International Conference on 3D Vision (3DV), Virtual, 1–3 December 2021; pp. 42–52. [Google Scholar]
- Chang, J.Y.; Moon, G.; Lee, K.M. Poselifter: Absolute 3D human pose lifting network from a single noisy 2D human pose. arXiv 2019, arXiv:1910.12029. [Google Scholar]
- Kim, J.H.; Han, J.; Lee, S.W. PoseAnchor: Robust Root Position Estimation for 3D Human Pose Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–20 October 2025; pp. 7079–7088. [Google Scholar]
- Hao, X.; Li, H. PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–20 October 2025; pp. 8110–8119. [Google Scholar]
- Zhan, Y.; Li, F.; Weng, R.; Choi, W. Ray3d: Ray-based 3d human pose estimation for monocular absolute 3d localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 13116–13125. [Google Scholar]
- Wang, Z.; Chen, R.; Liu, M.; Dong, G.; Basu, A. SPGNet: Spatial projection guided 3D human pose estimation in low dimensional space. In Proceedings of the International Conference on Smart Multimedia; Springer: Berlin/Heidelberg, Germany, 2022; pp. 41–55. [Google Scholar]
- Lee, G.H.; Lee, S.W. Uncertainty-aware human mesh recovery from video by learning part-based 3d dynamics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12375–12384. [Google Scholar]
- Kan, Z.; Chen, S.; Zhang, C.; Tang, Y.; He, Z. Self-correctable and adaptable inference for generalizable human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5537–5546. [Google Scholar]







| Structural Pillar | Paradigm/Theme | Key Concepts & Intellectual Evolution |
|---|---|---|
| I. Representation (Section 4) | Coordinates to Distributions | Direct Regression → Heatmaps → Integral Regression → Debiasing (Dark/UDP) |
| Multi-Person Grouping | Part Affinity Fields (PAFs) → Geometric Fields → Associative Embeddings → Centers & Anchor Points | |
| The Third Dimension | Volumetric Heatmaps → Ordinal/Ranking Depth → Kinematic & Structural Encoding | |
| II. Architecture (Section 5) | Spatial Context (CNNs) | Stacked Hourglass → Multi-Stage Refinement → High-Resolution (HRNet) → Hybrid Designs |
| Global Context (Transformers) | CNN–Transformer Hybrids → Pure Transformers (ViT) → Pose-as-Sequence (Tokenization) | |
| Graph & Structure | Graph Convolutional Networks (GCNs) → Directed Graphs → Kinematic Topology Modeling | |
| III. Ambiguity & Generalization (Section 6) | Uncertainty Modeling | Deterministic Prediction → Multi-Hypothesis Generation → Probabilistic Distributions → Diffusion Models |
| Domain Gap (In-the-Wild) | Weak Supervision → Self-Supervised Learning → Adversarial Adaptation → Synthetic Data Generators | |
| Robustness | Occlusion Reasoning (Visibility Tokens) → Crowd Modeling (Relational Graphs) | |
| IV. Contextual Extension (Section 7) | Temporal Dynamics | RNNs/LSTMs → Temporal Convolutions (TCNs) → Spatiotemporal Transformers → State-Space Models (Mamba) |
| Multi-View Geometry | Algebraic Triangulation → Learnable Fusion → Epipolar Transformers → Uncalibrated/Parameter-Free | |
| Sensors & Modalities | RGB-D Fusion → Event Cameras (High Speed) → LiDAR/IMU Integration | |
| V. Efficiency & Frontier (Section 8) | Efficiency & Deployment | Lightweight Backbones → Knowledge Distillation → Quantization → Token Pruning |
| Unification | Multi-Task Learning → Unified Datasets → Foundation Models (Large-Scale Pre-training) | |
| Human-Centric Tasks | Whole-Body Estimation → Sports Biomechanics → Physics-Awareness → Privacy & Fairness |
| Survey | Year | Scope | Represent. | Archit. | Ambiguity | Context | Apps/Frontier | Conceptual Evol. |
|---|---|---|---|---|---|---|---|---|
| Dang et al. [1] | 2019 | 2D | P | P | – | – | – | – |
| Chen et al. [5] | 2020 | 3D mono. | P | P | P | P | – | – |
| Ji et al. [4] | 2020 | 3D mono. | P | P | P | – | – | – |
| Munea et al. [2] | 2020 | 2D | P | P | – | – | – | – |
| Ben Gamra & Akhloufi [24] | 2021 | 2D + 3D | ✓ | ✓ | P | P | – | P |
| Liu et al. [6] | 2021 | 2D + 3D | P | ✓ | P | P | – | P |
| Song et al. [23] | 2021 | Action recog. | P | P | – | P | P | – |
| Wang et al. [9] | 2021 | 3D | P | ✓ | P | ✓ | – | – |
| Dubey & Dixit [12] | 2023 | 2D + 3D | P | ✓ | P | P | – | – |
| Lan et al. [14] | 2023 | 2D + 3D | P | P | – | – | ✓ | – |
| Azam & Desai [20] | 2024 | Egocentric | P | ✓ | P | ✓ | P | – |
| Algabri et al. [21] | 2024 | Head pose | P | P | – | – | P | – |
| Hou et al. [18] | 2024 | Tutorial | P | P | – | – | – | – |
| Neupane et al. [10] | 2024 | 3D | ✓ | ✓ | ✓ | ✓ | P | P |
| Liu et al. [11] | 2024 | 3D + Mesh | P | ✓ | P | ✓ | P | P |
| Suo et al. [22] | 2024 | Sports MoCap | P | P | – | P | ✓ | – |
| Gao et al. [15] | 2025 | HPE+downstr. | P | ✓ | P | ✓ | P | P |
| Guo et al. [7] | 2025 | 3D mono. | ✓ | ✓ | ✓ | P | – | P |
| Jayaswal et al. [17] | 2025 | Structural | P | P | – | P | – | – |
| Nogueira et al. [19] | 2025 | Multi-view | P | ✓ | P | ✓ | – | – |
| Salisu et al. [16] | 2025 | 3D arch. | P | ✓ | P | – | – | – |
| Sun et al. [13] | 2025 | 2D + 3D | ✓ | ✓ | ✓ | P | P | P |
| Udayan et al. [8] | 2025 | 3D mono. | ✓ | ✓ | P | P | P | P |
| Zhang & Shin [3] | 2025 | 2D | ✓ | ✓ | P | – | P | P |
| This review | 2026 | 2D + 3D + 4D | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Dataset | Year | Challenge Introduced | Paradigm/Method Enabled | Main Related Sections |
|---|---|---|---|---|
| Human3.6M [25] | 2014 | Large-scale, accurate, lab-clean 3D ground truth | Supervised 2D-to-3D lifting; volumetric heatmaps; protocol-based MPJPE evaluation | Section 4.4 and Section 5 |
| MPII Human Pose [27] | 2014 | In-the-wild 2D variety with daily activities | Stacked hourglass and multi-stage refinement; PCKh evaluation | Section 5.1 |
| COCO Keypoints [28] | 2014 | Crowded scenes, occlusion, scale variation | Bottom-up grouping (PAFs, embeddings, centres); AP/OKS protocol | Section 4.3 and Section 5.1 |
| SURREAL [26] | 2017 | Sim-to-real for full-body shape and mesh | Synthetic pre-training; sim2real adaptation; mesh recovery at scale | Section 3 and Section 6.2 |
| MPI-INF-3DHP [32] | 2017 | Bridging lab and in-the-wild 3D | Studio + real mixing; PCK3D/AUC evaluation; weakly supervised 3D | Section 3 and Section 6.2 |
| 3DPW [33] | 2018 | Real-world 3D ground truth via IMU fusion | Domain adaptation; mesh recovery in the wild; PA-MPJPE as SOTA proxy | Section 3 and Section 6.2 |
| COCO-WholeBody [35] | 2020 | Whole-body keypoints (body+face+hands+feet) | Whole-body estimation; coarse-to-fine architectures; per-part APwb | Section 8.3 |
| SLOPER4D [38] | 2023 | Global 4D pose in scanned urban scenes (LiDAR+IMU) | Scene-aware HPE; global-coordinate evaluation; human-scene interaction | Section 7.2 and Section 8.3 |
| FreeMan [34] | 2024 | Uncalibrated multi-view at scale, smartphone capture | Uncalibrated/parameter-free multi-view; O-MPJPE; robustness to clutter | Section 7.2 |
| AthletePose3D [36] | 2025 | High-speed, non-periodic athletic motion | Sports-biomechanics fine-tuning; physics-aware post-processing | Section 8.3 and Section 9 |
| LDPose [37] | 2025 | Limb-difference inclusivity; variable skeletons | Skeleton-agnostic/variable-topology models; inclusive evaluation | Section 8.3 and Section 9 |
| MoviCam [39] | 2025 | 3D pose from a moving RGB camera; physics annotations | Physics-aware optimization; gravity/ground-contact constraints | Section 7.3 and Section 8.3 |
| Method (Concept) | Reference | Dataset | Metric | Score |
|---|---|---|---|---|
| Paradigm: Direct Regression ↑ (Section 4.1) | ||||
| DeepPose | Toshev & Szegedy [40] | LSP (test) | PCP (mean) | 61.0% |
| Paradigm: 2D Heatmaps & Distributions ↑ (Section 4.2) | ||||
| Integral Regression | Sun et al. [41] | COCO | AP (OKS) | 67.8 |
| Heatmap Refinement | Jiang et al. [46] | COCO | AP (OKS) | 70.3 |
| SWAHR | Luo et al. [45] | COCO | AP (OKS) | 71.6 |
| DARK | Zhang et al. [48] | COCO | AP (OKS) | 76.2 |
| UDP | Huang et al. [47] | COCO | AP (OKS) | 76.5 |
| Anisotropic Gaussian | Liu et al. [50] | COCO | AP (OKS) | 79.1 |
| ProbPose | Purkrabek et al. [51] | CropCOCO | mAP (OKS) | 81.7 |
| Paradigm: Application to 3D Coordinates ↓ | ||||
| Cascaded 3D Regression | Li et al. [42] | Human3.6M | MPJPE | 50.9 mm |
| HEMlets Pose (3D heatmaps) | Zhou et al. [44] | Human3.6M | MPJPE | 39.9 mm |
| Grouping Family | Method | Dataset | Metric | Score |
|---|---|---|---|---|
| Part A: 2D Grouping (Primarily COCO) ↑ | ||||
| Pioneering PAFs | OpenPose [29] | COCO | AP | 61.8 |
| Geometric Embeddings | PersonLab [31] | COCO | AP | 66.5 |
| Composite Fields | PifPaf [53] | COCO | AP | 66.7 |
| Poses as Objects | KAPAO [60] | COCO | AP | 70.3 |
| Dual Anat. Centers | Dual Centers [57] | COCO | AP | 71.0 |
| Decentralized Centers | DecenterNet [58] | COCO | AP | 71.2 |
| Anchor Centers | Double Anchor [56] | CrowdPose | AP | 66.9 (+1.5) |
| Part B: Extension to 3D Grouping ↓ | ||||
| 3D Maps (ORPM) | ORPM [54] | CMU Panoptic | MPJPE | 68.5 mm (−3.6) |
| Absolute 3D Maps | SMAP [55] | CMU Panoptic | MPJPE | 61.8 mm |
| Method (Concept) | Reference | Dataset | Metric ↓ | Score |
|---|---|---|---|---|
| Paradigm: Volumetric/Heatmaps | ||||
| MobileHumanPose | Choi et al. [65] | Human3.6M | MPJPE | 79.6 mm |
| Marginal Heatmaps | Nibali et al. [64] | Human3.6M | MPJPE | 55.4 mm |
| MeTRAbs (Metric Maps) | Sárándi et al. [66] | Human3.6M | MPJPE | 49.3 mm |
| Paradigm: Ordinal | ||||
| Ordinal Supervision | Pavlakos et al. [68] | Human3.6M | MPJPE | 56.2 mm |
| DRPose3D (Ranking) | Wang et al. [67] | Human3.6M | MPJPE | 42.9 mm |
| Paradigm: Kinematic/Structured | ||||
| Kinematic Preservation | Kundu et al. [69] | Human3.6M | MPJPE | 56.1 mm |
| Compositional Tokens | Geng et al. [71] | Human3.6M | MPJPE | 47.8 mm |
| Pose Grammar | Fang et al. [72] | Human3.6M | MPJPE | 45.7 mm |
| Bone Decomposition | Chen et al. [70] | Human3.6M | MPJPE | 35.0 mm |
| Method | Reference | Backbone | Dataset | Score |
|---|---|---|---|---|
| Primary Benchmark: COCO test-dev (Metric: AP ↑) | ||||
| HigherHRNet | Cheng et al. [82] | HigherHRNet-W48 | COCO | 70.5 |
| CPN | Chen et al. [74] | ResNet-Inception | COCO | 72.1 |
| Simple Baselines | Xiao et al. [78] | ResNet-152 | COCO | 73.7 |
| HRNet | Sun et al. [81] | HRNet-W48 | COCO | 75.5 |
| MIPNet | Khirodkar et al. [90] | - | COCO | 75.7 |
| MSPN | Li et al. [75] | 4x Res-50 | COCO | 76.1 |
| RSN | Cai et al. [87] | 4xRSN-50 | COCO | 78.6 |
| Primary Benchmark: MPII test (Metric: PCKh@0.5 ↑) | ||||
| DU-Net | Tang et al. [80] | 16x U-Nets | MPII | 91.2 |
| Spatial Context | Zhang et al. [77] | 8x Hourglass | MPII | 92.5 |
| CFA | Su et al. [76] | R-101 + 4xR-50 | MPII | 93.9 |
| Method | Reference | Dataset | Metric | Score ↓ |
|---|---|---|---|---|
| MöbiusGCN | Azizi et al. [97] | Human3.6M | MPJPE | 52.1 mm |
| Graph Stacked Hourglass | Xu & Takano [94] | Human3.6M | MPJPE | 51.9 mm |
| GraphMLP | Li et al. [99] | Human3.6M | MPJPE | 48.0 mm |
| Conditional Directed GCN | Hu et al. [96] | Human3.6M | MPJPE | 41.1 mm |
| Optimizing Network Structure | Ci et al. [95] | Human3.6M | MPJPE | 36.3 mm |
| Method | Reference | Dataset | Metric | Score ↑ |
|---|---|---|---|---|
| Heatmap-based transformers | ||||
| VTTransPose | Li et al. [103] | COCO | AP | 73.6 |
| TransPose | Yang et al. [100] | COCO | AP | 75.0 |
| HRFormer | Yuan et al. [101] | COCO | AP | 76.2 |
| Polarized Self-Attn. | Liu et al. [102] | COCO | AP | 79.4 |
| Regression-based transformers | ||||
| DirectPose | Tian et al. [109] | COCO | AP | 64.8 |
| Cascade Transformers | Li et al. [107] | COCO | AP | 72.1 |
| TFPose | Mao et al. [105] | COCO | AP | 72.2 |
| Group Pose | Liu et al. [110] | COCO | AP | 72.8 |
| Poseur | Mao et al. [108] | COCO | AP | 78.3 |
| PE-former | Panteleris & Argyros [106] | COCO (Val) | AP | 72.6 |
| Method | Reference | Dataset | MPJPE ↓ | PA-MPJPE ↓ |
|---|---|---|---|---|
| ContextPose | Ma et al. [112] | Human3.6M | 43.4 mm | 34.6 mm |
| Single 2D + Context | Zhao et al. [113] | Human3.6M | 39.8 mm | 32.7 mm |
| Method | Reference | Paradigm | MPJPE ↓ |
|---|---|---|---|
| Multi-Hypothesis and Uncertainty | |||
| Uncertainty Learning | Han et al. [116] | Aleatoric | 66.7 mm |
| CVAE + Ordinal Ranking | Sharma et al. [115] | CVAE | 58.0 mm |
| Mixture Density Network | Li & Lee [114] | MDN | 52.7 mm |
| ManiPose | Rommel et al. [117] | Manifold | 39.1 mm |
| Diffusion Models | |||
| ZeDO (Zero-shot) | Jiang et al. [125] | Optimization | 51.4 mm |
| Di2Pose (Occlusion) | Wang et al. [122] | Discrete Diff. | 49.2 mm |
| DiffPose | Holmquist & Wandt [118] | Diffusion | 43.3 mm |
| Hypothesis Aggregation | Shan et al. [120] | Diffusion | 39.5 mm |
| DiffPose | Gong et al. [119] | Diffusion | 36.9 mm |
| FinePOSE | Xu et al. [121] | Prompt-Diffusion | 31.9 mm |
| Method | Reference | Database | Metric | Score |
|---|---|---|---|---|
| Unsupervised (Geo-Aware) | Rhodin et al. [127] | Human3.6M | MPJPE | 131.7 mm |
| Weak Sup. (Multi-view) | Rhodin et al. [126] | Human3.6M | MPJPE | 66.8 mm |
| Adversarial Learning | Yang et al. [135] | Human3.6M | MPJPE | 58.6 mm |
| Generalizing (2D→3D) | Wang et al. [139] | Human3.6M | MPJPE | 37.6 mm |
| Weak Sup. (Multi-view) | Iskakov et al. [128] | Human3.6M | MPJPE | 20.8 mm |
| Method | Reference | Human3.6M (MPJPE ↓) | 3DPW (PA-MPJPE ↓) |
|---|---|---|---|
| 3D-Guidance (Multi-model) | Lee et al. [146] | 50.6 mm | - |
| CameraPose (Weak Sup.) | Yang et al. [129] | 38.87 mm | 63.26 mm |
| PoseAug (Augmentation) | Gong et al. [137] | 38.2 mm | 81.6 mm |
| PoseIRM (Invariant Learning) | Cai et al. [143] | 25.6 mm | - |
| Method (Concept) | Reference | Benchmark | Metric | Score |
|---|---|---|---|---|
| Part A: Occlusion Robustness ↓ | ||||
| Partial Body Regression | Vosoughi & Amer [149] | H3.6M (Truncated) | MPJPE | 177.8 (−154.6) |
| LInKs (Lift-then-Fill) | Hardy & Kim [153] | H3.6M (Occlusion) | N-MPJPE | 61.6 (−2.4) |
| HiPART (Auto-regressive) | Zheng et al. [154] | H3.6M-Occluded | MPJPE | 28.3 |
| Part B: Crowded Scenes | ||||
| 2D CrowdPose (Metric: AP ↑) | ||||
| DPIT (Hybrid Transformer) | Zhao et al. [160] | COCO (Test) | AP | 74.6 |
| I2R-Net (Relational) | Ding et al. [162] | CrowdPose | AP | 77.4 |
| BUCTD (Hybrid BU-TD) | Zhou et al. [158] | CrowdPose | AP | 78.5 |
| 3D MuPoTS-3D (Metric: 3DPCK ↑) | ||||
| Multi-Person 3D | Dabral et al. [161] | MuPoTS-3D | 3DPCK | 74.3 |
| GR-M3D (Dynamic Graph) | Qiu et al. [163] | MuPoTS-3D | 3DPCK | 84.6 |
| Hybrid Top-down/Bottom-up | Cheng et al. [159] | MuPoTS-3D | 3DPCK | 88.9 |
| Paradigm | Method | Reference | Dataset | Frames | MPJPE ↓ |
|---|---|---|---|---|---|
| Temporal Conv. | TCN (Baseline) | Pavllo et al. [166] | Human3.6M | 243 | 46.8 mm |
| Transformer | PoseFormer | Zheng et al. [170] | Human3.6M | 81 | 44.3 mm |
| Transformer | HDFormer | Chen et al. [177] | Human3.6M | 96 | 40.3 mm |
| Transformer | MixSTE | Zhang et al. [173] | Human3.6M | 243 | 39.8 mm |
| State-Space (SSM) | SAMA | Lu et al. [190] | Human3.6M | 351 | 36.5 mm |
| Method | Reference | Type | MPJPE ↓ |
|---|---|---|---|
| Calibrated Fusion | |||
| Cross View Fusion | Qiu et al. [200] | Learnable | 31.17 mm |
| Learnable Triangulation | Iskakov et al. [128] | Volumetric | 20.8 mm |
| AdaFuse | Zhang et al. [201] | Adaptive | 19.5 mm |
| Geometry-Biased Trans. | Moliner et al. [203] | Transformer | 14.2 mm |
| Uncalibrated/Parameter-Free | |||
| Auto-supervision | Chang et al. [214] | Self-Sup. | 76.96 mm |
| FLEX | Gordon et al. [210] | Invariant | 30.2 mm |
| Method (Concept) | Reference | Dataset | Metric | Score |
|---|---|---|---|---|
| Part A: Efficiency and Deployment | ||||
| DeciWatch (Sampling) | Zeng et al. [238] | Human3.6M | MPJPE ↓ | 52.8 mm |
| RTMPose (Real-time) | Jiang et al. [237] | COCO val | AP ↑ | 74.8 |
| DynPose (Dynamic) | Xu et al. [239] | COCO val | AP ↑ | 78.0 |
| Part B: Unification and Foundation Models | ||||
| MotionBERT (Pretrained) | Zhu et al. [255] | Human3.6M | MPJPE ↓ | 37.5 mm |
| UniHPE (Unified Modality) | Jiang et al. [260] | Human3.6M | MPJPE ↓ | 50.5 mm |
| UniHCP (Unified Task) | Ci et al. [258] | Human3.6M | MPJPE ↓ | 75.6 mm |
| ViTPose++ (Foundation) | Xu et al. [257] | COCO test-dev | AP ↑ | 81.1 |
| Part C: Human-Centric Frontier | ||||
| COCO-WholeBody | Jin et al. [35] | WholeBody | APwb ↑ | 54.1 |
| LDPose (Inclusivity) | Ying et al. [37] | LDPose | APLD ↑ | 78.4 |
| U-HMR (Mesh Recovery) | Lee & Lee [279] | 3DPW | MPJPE ↓ | 92.8 mm |
| Design Paradigm | When to Choose | Dominant Trade-Off | Reviewed in Section |
|---|---|---|---|
| Heatmap (CNN, e.g., HRNet) | Sub-pixel precision matters; compute not bottleneck; 2D single/multi-person; well-defined keypoints. | High accuracy, high memory/latency; resolution caps precision. | Section 4.2 and Section 5.1 |
| Direct regression (transformer/DETR) | Real-time multi-person; low-latency edge; tolerant to slightly lower precision. | Simpler pipeline, faster inference; harder optimization, less spatial calibration. | Section 4.1 and Section 5.2.2 |
| GCN/structured graph | 2D-to-3D lifting where anatomical prior strong; small skeletons; relational reasoning under occlusion. | Strong inductive bias on fixed topology; limited generalization to new skeletons. | Section 5.2.1 and Section 6.3.2 |
| Spatiotemporal transformer | Short to medium video windows (e.g., 81 frames); pose lifting from 2D sequences; global temporal attention. | Quadratic time/memory; struggles on very long windows. | Section 7.1.2 |
| State space model (Mamba/SSM) | Very long temporal windows (hundreds of frames); continuous tracking; multi-view fusion at scale. | Linear complexity, hardware-friendly; less mature; recent literature only. | Section 7.1.2 and Section 7.2 |
| Diffusion-based estimator | Calibrated uncertainty needed (robotics, retargeting); occlusion-heavy scenes; multi-hypothesis output. | Iterative inference slow; needs aggregation/selection step. | Section 6.1 |
| Event-camera/multi-modal fusion | High-speed, low-light, high-dynamic-range; sports; AR/VR. | Hardware availability; sparse/asynchronous data need specialized models. | Section 7.2 |
| Physics-/biomechanics-aware | Output drives a simulator, controller, or biomechanical analysis; global trajectories matter. | Optimization in loop slow; depends on accurate camera/scene geometry. | Section 7.3 and Section 8.3 |
| Foundation model + adapter | Generalist deployment (pose+mesh+action); limited labeled data; transfer to new domains. | Large model size; partially open weights; black-box failure modes. | Section 8.2 |
| Method | Paradigm | Benchmark & Score | Strength | Weakness |
|---|---|---|---|---|
| RTMPose [237] | 2D regression, real-time | COCO val: AP 74.8 @ 90+ FPS | Sub-pixel via SimCC; real-time on CPU/GPU. | Single-person top-down; no temporal modeling. |
| DynPose [239] | Dynamic routing | COCO val: AP 78.0 | Skips easy frames; large efficiency gain. | Routing overhead on short clips. |
| ProbPose [51] | 2D probabilistic heatmap | CropCOCO: mAP 81.7 | Calibrated confidence; handles truncation. | Heavier head; mostly single-person. |
| MotionAGFormer [181] | GCN + transformer, 3D lifting | H3.6M (P1): MPJPE 38.4 mm | Local + global modeling; strong on H3.6M. | Coordinate-only input; depth-ambiguity ceiling. |
| TCPFormer [186] | Temporal transformer + proxy | H3.6M (P1): MPJPE 37.9 mm | Implicit proxy compresses long sequences. | Quadratic attention still binds at very long horizons. |
| SAMA [190] | Structure-aware SSM, video | H3.6M (P1): MPJPE 36.5 mm | Linear-time; 351-frame windows; topology-aware. | Recent; limited cross-dataset evaluation. |
| FinePOSE [121] | Prompt-conditioned diffusion | H3.6M (P1): MPJPE 31.9 mm | Lowest reported MPJPE; CLIP-conditioned. | Multi-step sampling; expensive at inference. |
| ZeDO [125] | Zero-shot diffusion + opt. | H3.6M (P1): MPJPE 51.4 mm | No 3D training data; zero-shot generalization. | Per-instance optimization; not real-time. |
| Di2Pose [122] | Discrete diffusion, occlusion | H3.6M: MPJPE 49.2 mm | Robust under heavy occlusion (mask/replace). | Discrete tokens limit precision. |
| HiPART [154] | Hierarchical autoregressive | H3.6M-Occluded: MPJPE 28.3 mm | Excellent under truncation/occlusion. | Autoregressive decoding is slow. |
| GR-M3D [163] | Dynamic graph, 3D multi-person | MuPoTS-3D: 3DPCK 84.6 | Robust crowd reasoning; per-person graph. | Top-down detector dependency. |
| PhysDynPose [39] | Physics-aware optimization | MoviCam: best Global MPJPE 183.7 mm | Gravity + ground contact enforced. | Simulation in the loop; slow. |
| BioPose [230] | Mesh + inverse kinematics | 3DPW: competitive PA-MPJPE 39.5 mm | Biomechanically valid output. | Requires careful camera calibration. |
| ViTPose++ [257] | Foundation, ViT-based | COCO test-dev: AP 79.4 | Scales with data; transfers across tasks. | Very large model; expensive to fine-tune. |
| Hulk [256] | Multi-task foundation | Multi-dataset: SOTA on several tasks | Unified human-centric perception. | Partial open release; benchmark heterogeneity. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Diallo, K.B.; Akhloufi, M.A. Deep Human Pose Estimation: A Conceptual Review of Paradigms, Progress, and Frontiers. Computers 2026, 15, 366. https://doi.org/10.3390/computers15060366
Diallo KB, Akhloufi MA. Deep Human Pose Estimation: A Conceptual Review of Paradigms, Progress, and Frontiers. Computers. 2026; 15(6):366. https://doi.org/10.3390/computers15060366
Chicago/Turabian StyleDiallo, Kassim B., and Moulay A. Akhloufi. 2026. "Deep Human Pose Estimation: A Conceptual Review of Paradigms, Progress, and Frontiers" Computers 15, no. 6: 366. https://doi.org/10.3390/computers15060366
APA StyleDiallo, K. B., & Akhloufi, M. A. (2026). Deep Human Pose Estimation: A Conceptual Review of Paradigms, Progress, and Frontiers. Computers, 15(6), 366. https://doi.org/10.3390/computers15060366

