Deep Human Pose Estimation: A Conceptual Review of Paradigms, Progress, and Frontiers

Diallo, Kassim B.; Akhloufi, Moulay A.

doi:10.3390/computers15060366

Open AccessReview

Deep Human Pose Estimation: A Conceptual Review of Paradigms, Progress, and Frontiers

by

Kassim B. Diallo

and

Moulay A. Akhloufi

^*

Perception, Robotics and Intelligent Machines Research Group (PRIME), Department of Computer Science, Université de Moncton, Moncton, NB E1A 3E9, Canada

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(6), 366; https://doi.org/10.3390/computers15060366

Submission received: 6 April 2026 / Revised: 28 May 2026 / Accepted: 30 May 2026 / Published: 4 June 2026

(This article belongs to the Special Issue Multimodal Pattern Recognition of Social Signals in HCI (Third Edition))

Download

Browse Figures

Versions Notes

Abstract

The field of pose estimation is a major problem in computer vision, enabling the direct transformation of an input image into a hierarchical representation of the human skeleton for application in the fields of virtual/augmented reality and human–machine interaction tasks. Research in this field has exploded between 2018 and 2025, with traditional taxonomies such as 2D versus 3D or top-down versus bottom-up no longer sufficient to capture the essence of the evolution of ideas. To solve this problem, we propose a conceptual review in the field of pose estimation, focusing on the intellectual evolution of methods and architecture rather than the standard flat classifications of papers. We divide recent advances into five structural pillars: Representation, which traces the evolution from pixel coordinate regression to heatmaps and probabilistic representation; Architecture, which analyzes the transition from multi-stage CNNs to transformers and state space models (SSMs); Ambiguity and Generalization, which analyzes how self-supervised, uncertainty-aware, and diffusion models address 3D depth ambiguity, occlusion, and domain gaps by modeling multiple plausible poses and reducing dependence on fully supervised in-the-wild 3D labels; Context Extension, which covers temporal dynamics, multi-view fusion, and potential sensors; and Applications, which links algorithms to efficiency, privacy, and foundation models. By providing an in-depth detailing of these pillars, we provide a unified view of the evolution of research paradigms that define human pose estimation and enable the identification of future problems and solutions in pose estimation and human-centered tasks.

Keywords:

human pose estimation; deep learning; 3D pose estimation; transformers; diffusion models; state space models; foundation models; domain generalization

1. Introduction

Human pose estimation (HPE) in computer vision seeks to localize the configuration of body joints in images or video; however, this remains a challenging problem. Its significance extends far beyond algorithmic performance, as reliable pose estimation is essential for critical applications in sports biomechanics, clinical gait analysis, immersive extended reality (XR), human–robot collaboration, and large-scale video understanding. In these domains, pose serves as a compact and interpretable interface bridging raw pixels and semantic reasoning.

The period from 2018 to 2025 has experienced dramatic acceleration in HPE research. Deep variants of convolutional neural networks (CNNs) have matured into highly accurate 2D estimators, while volumetric and lifting strategies have rendered large-scale 3D estimation from monocular RGB feasible. More recently, the landscape has been reshaped by transformers, state space models (SSMs), and diffusion-based generative frameworks. Driven by large-scale benchmarks and expanding computational resources, reported state-of-the-art results on major HPE benchmarks have changed rapidly during this period.

In this rapidly evolving context, navigating the literature has become a significant challenge. Classical taxonomies such as 2D versus 3D, top-down versus bottom-up, and monocular versus multi-view remain useful but are no longer sufficient to capture the field’s conceptual progression. These categories describe what a method does, but fail to explain why specific paradigms have emerged and later been replaced. For instance, the transition from coordinate regression to heatmaps, and later to distribution-aware regression and diffusion, reflects a fundamental evolution in modeling spatial ambiguity and uncertainty.

This article offers a complementary perspective. Rather than providing a standard catalog of methods or category pairwise evolution, we propose a conceptual literature review tracing the intellectual genealogy of HPE from 2018 to 2025. We organize recent advances along five conceptual axes, as summarized in Table 1:

Representation: Tracing the shift from direct coordinates to heatmaps, volumetric encodings, implicit functions, and probabilistic distributions.
Architecture: Covering the architectural evolution from multi-stage CNNs to graph neural networks (GCNs), transformers, and linear-complexity SSMs.
Ambiguity and Generalization: Examining strategies for addressing depth ambiguity, occlusion, and domain gaps through self-supervision, uncertainty modeling, and synthetic data.
Contextual Extension: Encompassing temporal dynamics, multi-view geometry, and multi-sensor fusion (RGB+IMU, LiDAR, event cameras).
Applications and Frontiers: Linking algorithms to downstream tasks (biomechanics, XR) and constraints such as efficiency, privacy, and fairness.

Our objective is to provide a unified and pose-centric view of 2D, 3D, and emerging 4D/scene-aware tasks. We articulate major paradigm shifts, couple quantitative comparisons with critical analysis of metrics and reproducibility, and connect architectural trends to real-world deployment constraints.

1.1. Survey Methodology

Our review is based on deep learning models applied to human pose estimation, focusing on work published between 2018 and 2025. We prioritize peer-reviewed publications from major conferences and journals (CVPR, ICCV, ECCV, NeurIPS, SIGGRAPH, IEEE TPAMI) and influential publications (e.g., ARXIV) that have motivated subsequent research or have a good track record on popular datasets.

Our selection process is specific rather than exhaustive. We combine keyword searches and citation chains to identify representative methods that illustrate key conceptual changes. We favor work evaluated on reference databases, such as COCO, Human3.6M, MPI-INF-3DHP and 3DPW, to ensure a reliable and consistent comparison.

1.2. Relevance to Image and Video Processing

Although often discussed in isolation, HPE is deeply connected with broader image and video processing challenges, and advances in the domain are generally solutions that came from neighbor areas. Pose estimators must operate on visual streams subject to compression artifacts, motion blur, and sensor noise. Furthermore, real-world deployment imposes strict constraints on latency and throughput.

In this survey, we highlight these relationships. We discuss architectures explicitly designed for efficiency (e.g., quantization, lightweight backbones) and examine methods that improve robustness under degraded input conditions. By framing HPE as a component of end-to-end video processing pipelines, we aim to provide guidance for practitioners who must design systems that perform reliably outside controlled laboratory environments.

Organization. The remainder of this paper is organized as follows: Section 2 positions our work relative to existing reviews, Section 3 details key datasets and metrics, and the subsequent Section 4, Section 5, Section 6, Section 7 and Section 8 develop our five-axis taxonomy, followed by a critical discussion in Section 9 and future directions in Section 10.

2. Related Work

Since the emergence of deep learning, numerous studies have focused on human pose estimation. Here, we classify the relevant and recent reviews and explain how our conceptual approach differs from them.

2.1. Studies on 2D Human Pose Estimation

Early deep learning surveys on 2D pose estimation were provided by Dang et al. [1], who established the taxonomy of top-down versus bottom-up pipelines, and Munea et al. [2], who detailed learning strategies and evaluation protocols. More recently, Zhang and Shin [3] have updated this perspective with modern architectures and performance analyses. While comprehensive for 2D localization, these works do not address the transition to 3D or temporal modeling.

2.2. Studies on 3D HPE and Mesh Recovery

A second category targets the 3D domain. Early surveys by Ji et al. [4] and Chen et al. [5] focused on the ill-posed nature of monocular estimation, classifying methods by reconstruction strategy (direct regression vs. lifting). Liu et al. [6] adopted a “2D–3D” perspective, linking the two domains. A recent review by Guo et al. [7] provides comprehensive coverage of diffusion models and state-space models (SSMs) for 3D HPE, while Udayan et al. [8] mention generative paradigms as emerging trends.

Larger 3D studies such as those by Wang et al. [9] and Neupane et al. [10] extended the scope to multi-person and multi-view configurations. Liu et al. [11] focused on human mesh recovery (HMR), bridging keypoint estimation and surface modeling.

2.3. Systematic, Structural, and System-Level Studies

A third set of studies takes a global perspective. Dubey and Dixit [12] and Sun et al. [13] both analyzed 2D and 3D techniques jointly. Lan et al. [14] provided a comprehensive survey of vision-based HPE covering both 2D and 3D methods along with datasets, evaluation metrics, and a dedicated section on applications. Gao et al. [15] positioned HPE within the ecosystem of upstream (object detection, keypoint detection) and downstream (action recognition, gait recognition) tasks. Salisu et al. [16] reviewed specific 3D model architectures. On the structural side, Jayaswal et al. [17] proposed a taxonomy based on human body modeling types (kinematic, planar, volumetric), while Hou et al. [18] have offered a concise tutorial-style introduction.

2.4. Specialized and Application-Oriented Studies

Finally, some surveys have targeted specific modalities or applications. Nogueira et al. [19] focused exclusively on markerless multi-view 3D HPE. Azam and Desai [20] reviewed egocentric 3D pose estimation. Other specialized surveys have addressed head pose [21], sports motion capture [22], and action recognition [23].

2.5. Scope and Contributions of This Review

While existing surveys offer rich catalogs of methods, several important gaps remain:

Unified Taxonomy: Most studies treat 2D, 3D, and multi-view pose as distinct problems. Instead, we propose a unified taxonomy based on representation, architecture, and context that applies consistently across all dimensions.
Focus on Paradigm Shifts (2018–2025): Rather than enumerating methods, we trace the intellectual lineage from CNNs to transformers to SSMs and from regression to heatmaps to distributions to diffusion, explaining why these shifts occurred.
Critical Analysis: We combine quantitative reporting with critical discussion of metric limitations, data saturation, and reproducibility.
Frontier Directions: We synthesize emerging trends in foundation models, efficiency, and human-centered AI (privacy, fairness) in order to define the next steps for research.

To concretely illustrate the gap addressed by this review, Table 2 maps the principal recent surveys onto the five conceptual pillars of our taxonomy along with the emerging cross-cutting themes of human-centric AI (ethics, fairness, privacy) and foundation models. The pattern is clear: existing reviews are predominantly dimensional, focusing exclusively on 2D [1,2,3] or 3D [4,5,6,7,8,9,10,11] pose estimation. Others focus on specific configurations (multi-view [19]), egocentric vision [20], or application domains such as sports motion capture [22] and action recognition [23]. Still others are structural [17,18] or system-level [12,13,14,15,16]. However, to the best of our knowledge, few existing surveys trace the intellectual lineage of representations, architectures, ambiguity modeling, contextual extension, and human-centric frontiers as a single coupled narrative. This integrative perspective is timely, as recent HPE research is no longer driven only by keypoint accuracy but by uncertainty-aware estimation, state space architectures, foundation models, efficiency, privacy, fairness, and inclusive evaluation. Prior surveys often cover some of these elements, but usually in isolation rather than as mutually connected components of the field’s conceptual evolution.

Conceptual evolution refers to whether a survey explicitly traces the intellectual lineage of paradigm shifts such as the progression from coordinate regression to heatmaps to probabilistic distributions, from CNNs to transformers to SSMs, from deterministic to diffusion-based estimation, or from in-the-wild to scene-aware reasoning, rather than merely cataloging different methods. This final column most clearly distinguishes the present review from prior work.

3. Datasets and Benchmarks: Engines of Innovation in Human Pose Estimation

The progress made in human pose estimation is inseparable from the evolution of its references. Each new dataset has not only been used to measure performance but has actively defined new conceptual problems, taking the field from controlled laboratory environments to real-world complexity. In this section, we present a brief overview of the most influential datasets grouped according to their role in this progression.

3.1. Laboratory Motion Capture Datasets

Early HPE methods based on deep learning were largely dominated by controlled laboratory datasets, which established the modern task of supervised 3D “lifting” by providing accurate motion capture (MoCap) annotations.

3.1.1. Human3.6M (H3.6M)

Human3.6M [25] is the canonical reference for supervised 3D pose estimation. It consists of 3.6 million video frames captured by four calibrated cameras, representing 11 actors performing 15 scripted actions in a motion capture studio. Its cleanliness and precise 3D annotations make it ideal for developing and comparing 3D HPE methods, but also contribute to a pronounced gap between the domain and real images.

3.1.2. SURREAL (Synthetic Humans for REAL Tasks)

SURREAL [26] is a large-scale synthetic dataset generated by rendering parametric 3D human models animated using MoCap data onto 2D image backgrounds. It hsa played a key role in sim-to-real approaches and full-body shape learning, providing more than 6 million synthetic frames with pose, depth, segmentation, optical flow, and surface-normal annotations.

3.2. In-the-Wild 2D Benchmarks

The next wave of progress was stimulated by real 2D benchmarks, which forced models to become robust to occlusion, clutter, and diverse appearances.

3.2.1. MPII Human Pose

MPII [27] was one of the first large-scale 2D benchmarks for real poses. The images are extracted from YouTube videos and cover a wide variety of daily activities. The dataset contains around 25,000 images with over 40,000 people annotated (16 keypoints), and has long served as a reference for 2D HPE.

3.2.2. COCO Keypoints (COCO)

The COCO [28] keypoint benchmark is one of the most influential 2D datasets for human pose estimation. It contains complex everyday scenes and includes over 200,000 labeled images with approximately 250,000 person instances annotated with 17 keypoints. By exposing the limitations of treating keypoints as independent local detections, COCO’s crowded and cluttered scenes have helped to motivate major architectural innovations in multi-person pose estimation. In such scenes, visible joints must be assigned to the correct person despite occlusion, nearby people, and background clutter. This has encouraged methods to incorporate stronger context and grouping mechanisms, including part affinity fields for bodypart association, associative embeddings for joint detection and grouping, and bottom-up part-based models such as PersonLab [29,30,31].

3.3. In-the-Wild 3D and Bridging the Domain Gap

To measure the extent to which models trained on laboratory data can be generalized to the real world, a series of real-world or semi-controlled 3D benchmarks have emerged.

3.3.1. MPI-INF-3DHP

MPI-INF-3DHP [32] is a large-scale 3D benchmark that mixes studio recordings obtained through a markerless MoCap system with real scenes. It serves as a bridge between controlled laboratory environments and real-world images, with over 1.3 million frames and various viewpoints and activities.

3.3.2. 3DPW (3D Poses in the Wild)

3DPW [33] marks a decisive milestone in 3D evaluation because it provides accurate 3D human pose annotations for challenging real-world videos captured with a moving handheld camera. The dataset was introduced as an in-the-wild benchmark built by combining a single portable RGB camera with body-worn inertial measurement units (IMUs), followed by an optimization procedure that jointly estimates body pose, camera motion, and IMU-related drift. This process enables frame-level reconstruction of SMPL body parameters. 3DPW contains more than 51,000 frames across 60 video sequences captured in indoor and outdoor environments, including everyday activities, natural clothing, varying illumination, partial occlusions, and camera motion.

Taken together, four properties explain why 3DPW became highly influential for evaluating real-world monocular 3D pose and mesh recovery:

Real-world capture. Indoor and outdoor scenes with everyday clothing, natural lighting, partial occlusions, and a freely moving camera reproduce many of the conditions under which models trained mainly on controlled laboratory data tend to degrade.
Accurate 3D reference poses. By fusing video and IMU information, 3DPW provides accurate 3D reference poses in unconstrained environments. The original work validated the reconstruction method on TotalCapture and reported an accuracy of 26 mm, making the annotations sufficiently accurate for benchmarking in-the-wild 3D HPE.
SMPL-aligned annotations. Ground truth is provided through SMPL body model parameters, allowing the same dataset to support evaluation of 3D skeletons using metrics such as MPJPE and PA-MPJPE as well as 3D body meshes using surface-based metrics such as PVE or MPVE.
Temporal video supervision. Because annotations are temporally consistent across video frames, 3DPW is particularly useful for evaluating temporal models, including TCN-, transformer-, and SSM-based approaches, under realistic camera motion and appearance variation.

Compared with earlier benchmarks, 3DPW combines several properties that previously were rarely available together: in-the-wild capture, moving-camera video, IMU-assisted 3D reference poses, SMPL body parameters, and temporal sequence annotations. Human3.6M provides accurate laboratory MoCap but lacks comparable in-the-wild variability; MPI-INF-3DHP partly bridges controlled and real-world settings but is less focused on moving-camera outdoor capture; MuPoTS-3D is useful for multi-person outdoor evaluation but is smaller in scale; and synthetic datasets such as SURREAL remain affected by sim-to-real domain gaps. As a result, 3DPW has become one of the main benchmarks for assessing whether monocular 3D pose and mesh recovery methods generalize beyond controlled laboratory settings.

3.3.3. FreeMan

FreeMan [34] is a very large multi-view dataset designed explicitly to evaluate models in uncontrolled real-world conditions. Handheld smartphones are used as the capture devices. It contains around 11 million frames and introduces the O-MPJPE metric for object-centered 3D error, encouraging robust methods in face of camera motion, clutter, and calibration noise.

3.4. Holistic, Application-Oriented, and Scene-Aware Benchmarks

More recent datasets have shifted the focus from isolated skeletal localization to richer human-centric tasks, including whole-body posture, inclusivity, sports biomechanics, and context-sensitive 4D posture.

3.4.1. COCO-WholeBody

COCO-WholeBody [35] is an annotation extension for COCO that does not introduce any new images, instead enriching existing instances of people with 133 key points (body, feet, face and hands). It has stimulated research into whole-body posture estimation, in which the aim is to jointly locate accurate whole-body landmarks.

3.4.2. AthletePose3D

AthletePose3D [36] exposes the failure modes of standard models by evaluating them on high-speed, high-acceleration athletic movements that are underrepresented in conventional datasets. Thus, the dataset highlights limitations in generalization, 3D lifting accuracy, and kinematic estimation, especially for complex sports movements involving rapid velocity changes and demanding biomechanical coordination. The original paper reports that models trained only on existing datasets perform poorly on athletic motions, while fine-tuning on AthletePose3D reduces error by over 69%.

3.4.3. LDPose

LDPose [37] directly challenges the fixed-skeleton or complete-limb assumption that underlies many standard HPE pipelines. Conventional datasets and models generally assume that all canonical body keypoints exist; in contrast, LDPose includes individuals with limb deficiencies and introduces residual-limb endpoint annotations. The task requires models not only to localize visible keypoints but also to determine whether downstream joints are anatomically absent and should be marked as non-existent. This makes LDPose fundamentally different from ordinary occlusion handling, since the missing joints reflect real anatomical variation rather than temporary visual obstruction.

3.4.4. SLOPER4D

SLOPER4D [38] is a next-generation benchmark focused on scene-sensitive 4D pose. Using LiDAR and IMU, it captures the global trajectories of people moving in 3D scanned urban environments, enabling evaluation of human-scene interactions and global 3D coherence over time. It includes more than 100K LiDAR frames, 300K video frames, and 500K IMU-based motion frames together with reconstructed scene point clouds and aligned global human motion annotations.

3.4.5. MoviCam

MoviCam [39] is the first non-synthetic dataset targeting 3D pose estimation from a moving RGB camera. It provides camera trajectories, scene geometry, and actual human motion, making it particularly relevant for physics-sensitive models such as PhysDynPose. The dataset comprises around 22,000 images spread over seven sequences.

3.5. Evaluation Metrics and Quantitative Protocols

The datasets reviewed above are associated with a set of evaluation metrics that have become standard for human pose estimation. In this subsection we briefly recall the most commonly used measures for 2D and 3D HPE, then clarify how we use them in the quantitative tables presented throughout the survey.

3.5.1. 2D Pose Metrics: PCK, PCKh, and AP/OKS

Early deep learning-based 2D HPE works predominantly use the percentage of correct keypoints (PCK) measure. Given a set of ground-truth joint locations

{p_{k}}

and predictions

{{\hat{p}}_{k}}

, PCK counts a keypoint as correct if the normalized distance between prediction and ground truth is below a threshold

PCK (α) = \frac{1}{K} \sum_{k = 1}^{K} 1 (∥ {\hat{p}}_{k} - p_{k} ∥_{2} \leq α L),

(1)

where K is the number of annotated joints,

α

is a fixed tolerance (e.g., 0.2 or 0.5), and L is a normalization length (often the torso length or diagonal of the person bounding box). The MPII benchmark popularized PCKh, where L is the head segment length and results are typically reported as

PCKh @ 0.5

.

On COCO and related benchmarks, the dominant metric is average precision (AP), computed using object keypoint similarity (OKS). For a single person instance, OKS is defined as

OKS = \frac{\sum_{k} exp (- \frac{∥ {\hat{p}}_{k} - p_{k} ∥_{2}^{2}}{2 s^{2} κ_{k}^{2}}) 1 (v_{k} > 0)}{\sum_{k} 1 (v_{k} > 0)},

(2)

where s denotes the person scale, typically derived from the ground-truth person area,

κ_{k}

is a per-keypoint falloff constant, and

v_{k}

is a visibility flag. Intuitively, OKS acts as a keypoint analogue of intersection-over-union (IoU) for bounding boxes. In object detection, IoU measures the degree of spatial overlap between a predicted bounding box and a ground-truth bounding box, and detections are considered correct when this overlap exceeds a given threshold. For pose estimation, keypoints do not define an overlap area in the same way; instead, OKS measures the spatial agreement between a predicted pose and a ground-truth pose by comparing corresponding keypoint locations. Each annotated keypoint contributes a Gaussian-like similarity term based on its squared Euclidean localization error, so a perfectly aligned keypoint contributes a value of one, while increasingly mislocalized keypoints contribute smaller values. The normalization by the person scale s makes the score scale-aware, so the same pixel error is penalized more strongly for small persons than for large persons. The keypoint-specific constant

κ_{k}

further adjusts the tolerance for each joint, reflecting that some anatomical landmarks are harder to localize precisely than others. Thus, OKS summarizes pose similarity as an averaged and scale-normalized keypoint matching score in

[0, 1]

, which can be thresholded analogously to IoU when computing AP. COCO reports AP as the mean precision over a range of OKS thresholds (typically from 0.50 to 0.95 in steps of 0.05) along with AP at fixed thresholds (e.g., AP⁵⁰, AP⁷⁵).

For whole-body benchmarks such as COCO-WholeBody, OKS is extended to a larger set of keypoints (body, face, hands, feet) with different

κ_{k}

values per joint group. Results are often broken down by body part (e.g., AP^body, AP^hand, AP^face) in addition to a global score.

3.5.2. 3D Pose Metrics: MPJPE and Variants

In 3D HPE, the most widely used metric is the mean per-joint position error (MPJPE) in millimetres. Given N pose samples, each with K joints, MPJPE is defined as

MPJPE = \frac{1}{N K} \sum_{n = 1}^{N} \sum_{k = 1}^{K} {∥{\hat{P}}_{n, k} - P_{n, k}∥}_{2},

(3)

where

P_{n, k}, {\hat{P}}_{n, k} \in R^{3}

are the ground-truth and predicted 3D coordinates for joint k in sample n. On Human3.6M, MPJPE is usually computed after placing the root joint of the skeleton at the origin, and results are reported under standard protocols (e.g., Protocol #1 and Protocol #2 differ in train/test splits and preprocessing).

There are several common variants of MPJPE:

PA-MPJPE (Procrustes-Aligned MPJPE) computes MPJPE after a rigid Procrustes alignment (rotation, translation, and uniform scaling) between predicted and ground-truth poses. It largely reflects errors in relative pose configuration rather than global position.
N-MPJPE (Normalized MPJPE) applies scale alignment before it calculate the error making it possible to isolate depth scale errors while preserving overall orientation.
PCK3D and AUC respectively measure the percentage of joints within a given 3D distance threshold and the area under the PCK3D curve as the threshold varies. These metrics are less sensitive to outliers than MPJPE, and are popular on benchmarks such as MPI-INF-3DHP.

Recently introduced multi-view and scene-aware datasets (e.g., FreeMan, SLOPER4D, MoviCam) also consider global 3D consistency. They may report root-relative MPJPE alongside global MPJPE in world coordinates, or object-centric variants such as O-MPJPE that normalize by scene or object scale. These measures emphasize robustness to camera motion, calibration noise, and errors in global trajectory.

3.5.3. How We Use These Metrics in This Survey

In the quantitative tables that conclude each conceptual axis (Section 4, Section 5, Section 6, Section 7 and Section 8), we use the standard metric for the dataset rather than assigning a single overall score. Specifically:

PCKh@0.5 on MPII.
AP according to OKS, with AP₅₀/AP₇₅ when these values are specified, on COCO, CrowdPose, and OCHuman.
${AP}_{wb}$ on COCO-WholeBody, with breakdowns by body/feet/face/hands when provided by the original paper.
MPJPE under protocol 1 and PA-MPJPE under protocol 2 on Human3.6M.
PCK3D and AUC on MPI-INF-3DHP.
PA-MPJPE on 3DPW, optionally with MPJPE and MPVE/PVE when mesh reconstruction is specified.
Dataset-specific global or object-centered variants (global MPJPE, O-MPJPE) on FreeMan, SLOPER4D, and MoviCam.

In each table, methods are grouped according to the paradigm or sub-paradigm that they instantiate, and the values are drawn from the original publications or widely cited standardized re-evaluations. The protocol is specified in the table caption when it deviates from the dataset’s default.

The reader should view these tables as indications of trends within a single paradigm, for example the monotonic improvement in MPJPE as 2D–3D conversion evolves from pure coordinate sequences to visually conditioned transformers, rather than as a cross-dataset ranking. Since scores based on OKS, PCK, and MPJPE reflect fundamentally different error metrics (scale-normalized IoU versus pixel-normalized distance versus absolute millimeters in 3D), no single metric is meaningful across datasets, and we deliberately avoid such comparisons.

3.6. Datasets as Drivers of Paradigm Shifts

To make explicit the role of data as an innovation engine, Table 3 traces the principal paradigm or representational shift enabled by each major dataset reviewed in this section. The pattern is one of co-evolution: each new dataset introduced a challenge that the prevailing methods could not handle, then the community responded with a new representation, architecture, or training regime that is now standard. Human3.6M standardized supervised 2D-to-3D lifting and protocol-based MPJPE evaluation; MPII drove the early multi-stage hourglass architectures; COCO’s crowded scenes forced the adoption of bottom-up grouping methods (PAFs, associative embeddings) and the modern AP/OKS evaluation protocol; 3DPW exposed the laboratory-to-wild domain gap and motivated many of the self-supervised, weakly supervised, domain adaptation, and synthetic augmentation methods discussed in Section 6.2; FreeMan and SLOPER4D moved the goalposts from root-relative to global-coordinate accuracy and from skeleton-only to scene-aware 4D; AthletePose3D opened up sports-biomechanics-specific evaluation; and LDPose directly challenged the fixed-skeleton assumption that underlies many standard HPE pipelines. Thus, to read the dataset list in this order is in effect to read the conceptual history of the field.

A second complementary view of this evolution is temporal: the benchmarks listed above arrive in roughly three waves: lab-MoCap (2014–2017), in-the-wild 2D and 3D (2014–2020), and scene-aware/inclusive/application-specific (2023–2025), and each wave drives a corresponding shift in the dominant paradigm of the next two to three years (see Table 3).

4. Pose Representation: From Coordinates to Distributions

4.1. The Direct-Regression Paradigm

Coordinate regression is the straightforward foundation method that was used in the early days of deep learning pose estimation. This approach uses an input image to train a neural network to directly predict a vector of

(x, y)

coordinates for every body joint. Toshev and Szegedy [40] pioneered this approach with DeepPose, using a convolutional neural network (CNN) to regress joint positions. Their method uses a multi-stage refinement strategy in which subsequent stages of the network receive zoomed-in images centered on previous predictions in order to refine accuracy.

Although innovative, direct coordinate regression faces fundamental limitations that explain why the community moved away from it. CNNs are built around translation-equivariant convolutions along with a hierarchical increasingly coarse spatial pooling approach in which features are aligned to image locations, not to arbitrary scalar outputs. Asking such a network to compress a feature map into a single (x,y) value per joint via fully-connected layers discards exactly the spatial structure that the convolutional stack just built, and forces the global head to learn the highly nonlinear many-to-one mapping from pixels to coordinates from scratch.

Three problems appears. First, the loss landscape is sensitive; a small change in input, a slight occlusion of the wrist, a clothing change, or a different camera viewpoint can shift the regressed coordinate by tens of pixels, producing large L2 gradients that destabilize training. Second, the model cannot represent uncertainty; there is no way to say ‘the elbow is either here or there’ with a single scalar pair, so ambiguous configurations collapse onto a blurry average that is correct for no actual sample. Third, multimodality is impossible; when a joint is plausibly in two places (left/right confusion under occlusion, or the depth flip in monocular 3D), regression must pick one mode and be penalized for the other, which forces the network to interpolate between them in latent space. The next dominating paradigm, heatmaps (illustrated in Figure 1), was motivated by these very difficulties.

Heatmaps mitigate these issues by preserving spatial distributions; however, standard argmax decoding is non-differentiable and collapses the distribution to a single location. Therefore, the shift from regression to heatmaps was not a stylistic preference but a structural realignment of the output representation with the geometry of the model.

4.2. The Heatmap Paradigm: Robustness Through Spatial Probability

To get around the drawbacks of straight regression, the community agreed upon a probabilistic heatmaps representation. Instead of regressing a single coordinate, the network predicts a probability map of size

K \times H \times W

, where for each joint k, the map

H_{k} \in R^{H \times W}

represents the probability that the joint is at position

(x, y)

. This transforms regression into a dense prediction task, which aligns better with CNN architectures. However, this paradigm introduced a “gap” between training (heatmap prediction) and inference (coordinate extraction), leading to a series of innovations.

The Differentiability Problem and Integral Solutions. Because the standard argmax method for extracting coordinates is not differentiable, gradients cannot backpropagate from the final coordinates to the network. Sun et al. [41] suggested integral regression as a solution to this issue. By normalizing the heatmap with a softmax and computing the expectation of grid positions, the process becomes a completely differentiable weighted sum (soft-argmax). This method has been widely used for 2D pose [42,43] and has been expanded to 3D voxel representations [44] to differentiably estimate depth, making it fundamental in the field.
Improving Heatmap Precision and Debiasing. Beyond differentiability, the quality of the target heatmap (the label) has become a major research topic. Standard approaches generate targets using a Gaussian kernel of fixed size. Luo et al. [45] argued that this method was not optimal, and proposed scale-adaptive heatmap regression (SAHR) to dynamically adjust kernel size according to the scale of the individual as well as weight-adaptive heatmap regression loss (WAHR) for hard joints. Jiang et al. [46] similarly proposed a heatmap refinement method that adjusts Gaussian coverage using geometric priors. Quantization bias is a more subtle but omnipresent problem identified by Huang et al. [47]. Heatmaps are generated and stored on a discrete grid with a resolution (typically $H / 4 \times W / 4$ ) that is much coarser than the original image. During target encoding, the continuous ground-truth joint coordinate $(x^{*}, y^{*})$ is rounded to the nearest integer cell; during decoding, the predicted argmax (or soft-argmax) is then upscaled back to image space. Both steps introduce a systematic asymmetric error: rounding is biased toward the cell center rather than uniformly distributed, the upscaling factor compounds sub-pixel offsets, and standard data augmentation transforms (flip, rotation, scale) are computed in pixel space and then re-discretized, accumulating the bias at every epoch. Across a typical training pipeline this manifests as a persistent half pixel-to-pixel offset on the predicted heatmap peak, which corresponds to a non-negligible drop in OKS-based AP at high thresholds (AP₇₅ and above), where the OKS tolerance is small enough for half a pixel to matter. Unbiased data processing (UDP) addresses the problem theoretically rather than empirically: it (i) treats coordinates as continuous quantities throughout the pipeline, (ii) re-derives the affine transformations used by augmentation in the continuous domain so that no intermediate rounding occurs, and (iii) replaces the standard Gaussian on an integer grid target with a continuous Gaussian, the center of which is the exact floating-point coordinate. As popularized by DARK (Zhang et al. [48]) UDP combined with a distribution-aware decoder and Taylor-expansion peak refinement yields a measurable AP gain (+2 to +3 points on COCO for comparable backbones) at zero additional inference cost. This clearly demonstrates that representation choices rather than architecture alone can drive substantial improvements in 2D pose estimation. Recent work continues to refine this representation. According to Gu et al. [49], heatmap confidence scores are frequently calibrated inadequately; as an alternative, they suggested Calibrated ConfidenceNet (CCNet). Liu et al. [50] integrated anatomical cues through anisotropic Gaussian coding by stretching the kernel along the direction of the bone. Finally, Purkrabek and Matas [51] tackled the problem of out-of-frame joints, explicitly handling occlusion and truncation by predicting both a calibrated probability map and a discrete probability of existence.

Table 4 summarizes the evolution of coordinate representations from direct regression to probability distributions.

4.3. Encoding Structure for Multi-Person Grouping (Bottom-Up Methods)

While heatmaps address joint localization, multi-person estimation introduces a second problem: association. Bottom-up approaches first detect all candidate joints in an image, then group them into person instances. This subsection reviews representations designed specifically for this matching challenge.

Part Affinity Fields. OpenPose by Cao et al. [29] made bottom-up pose estimation a mainstream paradigm by introducing part affinity fields (PAFs), two-dimensional vector fields that encode the location and orientation of limbs. Instead of grouping joints by spatial proximity alone, OpenPose assigns a connection score to each candidate limb and uses these scores in bipartite matching to assemble full skeletons. For a candidate start joint $d_{j_{1}}$ and candidate end joint $d_{j_{2}}$ , the connection confidence is computed by line-integrating the predicted PAF $L_{c}$ along the straight segment between them:

E = \int_{0}^{1} L_{c} (p (u)) \cdot \frac{d_{j_{2}} - d_{j_{1}}}{∥ d_{j_{2}} - d_{j_{1}} ∥} d u, p (u) = (1 - u) d_{j_{1}} + u d_{j_{2}},

where

p (u)

parameterizes the segment between the two candidate joints. In practice, the integral is approximated by uniformly sampling K points along the segment, typically

K = 10

, then averaging the dot products between the sampled PAF vectors and the unit direction

(d_{j_{2}} - d_{j_{1}}) / ∥ d_{j_{2}} - d_{j_{1}} ∥

. Therefore, a high score requires the sampled locations to lie on a limb of type c and the predicted local orientations to agree with the proposed start-to-end direction.

This formulation gives the matching step a principled geometric meaning: it evaluates whether a continuous, correctly oriented limb connects two candidate joints rather than relying on Euclidean proximity alone. This makes PAFs robust to crowding, crossed limbs, and unusual poses where unrelated joints may appear close in the image. Osokin [52] later proposed a lightweight OpenPose variant that reduced computational cost by replacing the VGG backbone with MobileNet while using dilated convolutions to preserve the effective receptive field. The same line-integral principle also inspired the composite fields of PifPaf [53] and the kinematic-graph traversal of PersonLab [31], where short-range and mid-range offset fields support instance assembly along the predicted skeleton edges.

Composite and Geometric Fields for Advanced Grouping. Later works produced more detailed groupings. Kreiss et al. [53] introduced composite fields (PifPaf), predicting a part intensity field for precise localization and a part association field for grouping. Papandreou et al. [31] proposed PersonLab, which predicts short-range offsets for refinement and mid-range offsets to traverse the kinematic graph.

These concepts extended to 3D as well. The occlusion-robust pose map (ORPM) proposed by Benzine et al. [54] recovers occluded sections by duplicating joint positions in the map. According to Zhen et al. [55], 2.5D cues are required for 3D bottom-up grouping; their method predicts both affinity fields and relative depth in order to guide grouping during occlusions.

Alternative Grouping: Centers and Object-Centric Methods. Vector fields can be substituted with anatomical centers as anchors. After seeing that bottom-up models often fail on lower-body joints, Zhang et al. [56] proposed regressing offsets to two centers (upper/lower body). To deal with scale variations, Cheng et al. [57] employed dual centers (head and hip). Wang et al. [58] adopted a decentralized approach in which every joint predicts the relative position of all other joints in the instance. Li et al. [59] simplified association by predicting limb centers alongside joints.

Recently, methods have moved toward treating poses as objects. McNally et al. [60] proposed KAPAO, which simultaneously detects two classes of objects: joint objects, representing individual body joints, and pose objects, representing complete human poses. Maji et al. [61] introduced a differentiable OKS loss by directly regressing coordinates using a single-stage YOLO detector. Other supervision-related advances include weighting joints by graph centrality [62] and using characteristic functions instead of the L2 loss [63].

Table 5 summarizes bottom-up multi-person grouping methods, including part affinity fields, geometric embeddings, pose-as-object approaches, and 3D grouping extensions.

4.4. Three-Dimensional Representation

The transition to 3D requires representations that can handle depth ambiguity. Three primary approaches have been investigated by researchers: kinematic encoding, ordinal relations, and volumetric grids.

Multi-View and Volumetric Heatmaps. The 3D volumetric grid ( $D \times H \times W$ ) is a logical extension of 2D heatmaps. To address high computational cost, Nibali et al. [64] proposed marginal heatmaps for predicting projections on the $x y$ , $x z$ , and $y z$ planes. Choi et al. [65] targeted mobile applications with a discretized $64^{3}$ volume and soft-argmax. Sárándi et al. [66] identified the scale-dependency of voxel grids as a weakness and introduced MeTRAbs, in which heatmaps are defined in a metric 3D space around the person to enable direct metric scale regression.
Ordinal and Ranking-Based Depth. Some techniques anticipate depth relations instead of regressing absolute depth. In the depth ranking framework introduced by Wang et al. [67], a first network predicts a pairwise ranking matrix, then a second network uses this constraint to regress 3D pose. Similarly, Pavlakos et al. [68] presented ordinal depth supervision, which offers flexible constraints for 3D geometry by classifying joint pairs as closer, further, or at the same depth.
Directly Encoding Structure and Kinematics. The final approach abandons indirect spatial representations for kinematic properties. Kundu et al. [69] introduced an unsupervised method in which the encoder predicts a “kinematics” vector (limb orientation). A non-learnable forward kinematics layer is then used to generate the 3D pose. Chen et al. [70] separated the task into prediction of bone direction (local) and bone length (global/constant).

High-level structural representations have also emerged. Geng et al. [71] represented pose as a sequence of discrete tokens via a VQ-VAE, assembling them into anatomically coherent skeletons. Fang et al. [72] used a “pose grammar” with bi-directional RNNs to model kinematics and symmetry. Alternatively, Marín-Jiménez et al. [73] modeled 3D pose as a linear combination of learned prototypes.

Synthesis. Reading Section 4 as a whole reveals a single conceptual trajectory in which each successive representation moves further from a point estimate and closer to an explicit structured representation of spatial uncertainty. Coordinate regression returns one number per joint; heatmaps return a probability map; integral and distribution-aware heatmaps return a calibrated probability map with sub-pixel precision; volumetric heatmaps extend the map into depth; and ordinal/kinematic encodings replace raw coordinates with relational and structural quantities that respect anatomy. Two implications follow. First, the representation choice now matters as much as the backbone; for instance, UDP and DARK deliver gains comparable to swapping ResNet for HRNet at near zero extra inference cost. Second, the trajectory has not converged; each representation trades one form of inductive bias for another (standard heatmap decoding with argmax is non-differentiable, kinematic encodings lose flexibility), and the most recent probabilistic and diffusion-based representations of Section 6.1 can be read as the logical next step of replacing a single calibrated distribution with a full posterior over plausible poses.

Table 6 summarizes the main three-dimensional representation paradigms, including volumetric heatmaps, ordinal depth relations, and kinematic or structured encodings.

5. Architectures for Spatial and Global Context

5.1. Mastering Spatial Context with Convolutional Networks

5.1.1. The Multi-Stage Refinement Paradigm

Prior to the emergence of high-resolution architectures, the dominant strategy for heatmap prediction was progressive refinement, popularized by the stacked hourglass architecture (Figure 2). The central premise is that a single encoder–decoder pass is insufficient to capture both low-level local features (essential for precision) and high-level semantic features (essential for structural consistency). Consequently, networks are stacked such that the output of stage t serves as a contextual prior for stage

t + 1

.

Chen et al. [74] formalized this approach with the cascaded pyramid network (CPN). Their architecture employs a GlobalNet to localize distinct keypoints and a RefineNet to address “hard” keypoints (e.g., occluded joints) using an online hard keypoint mining (OHKM) loss. To improve information flow across stages, Li et al. [75] proposed the multi-stage pose network (MSPN), which incorporates cross-stage feature aggregation and a coarse-to-fine supervision scheme.

Cross-Stage Feature Aggregation. A fundamental weakness of deep multi-stage networks is the progressive loss of spatial detail during repeated down- and up-sampling cycles. This is addressed in MSPN by introducing a cross-stage feature aggregation (CSFA) strategy. For each spatial scale s, two separate feature flows are extracted from the immediately preceding stage: one from its down-sampling path

F_{down}^{s}

, and one from its up-sampling path

F_{up}^{s}

. Each flow is transformed by a lightweight

1 \times 1

convolution before being fused with the current stage’s own down-sampled features

F_{cur}^{s}

:

{\tilde{F}}_{cur}^{s} = F_{cur}^{s} + W_{1} (F_{down}^{s}) + W_{2} (F_{up}^{s}),

(4)

where

W_{1}

and

W_{2}

denote the two

1 \times 1

convolutional projections. This cross-stage skip connection can be interpreted as an extended residual design spanning stages, alleviating gradient vanishing while preserving multi-scale context across the full pipeline. Ablation results confirm a consistent gain of

+ 0.3

AP on COCO for MSPN and

+ 0.5

AP for the hourglass baseline, with the larger improvement for the latter reflecting its greater susceptibility to inter-stage information loss due to its design incorporating equal channel widths.

Coarse-to-Fine Supervision. Standard multi-stage supervision applies the same fixed-bandwidth Gaussian heatmap target at every stage, which is misaligned with the natural coarse-to-fine progression of predictions across stages. MSPN replaces this fixed scheme with a coarse-to-fine supervision (CTF) strategy: stage t is supervised with a Gaussian of standard deviation

σ_{t}

, where

σ_{1} > σ_{2} > \dots > σ_{T}

. Concretely, for a ground-truth joint location

p^{*} \in R^{2}

, the heatmap target at stage t is

H_{k}^{t} (p) = exp (- \frac{∥ p - p_{k}^{*} ∥_{2}^{2}}{2 σ_{t}^{2}}),

(5)

where k indexes the joint and

σ_{t}

decreases with the stage index t. Within each stage, multi-scale intermediate supervision is applied at four spatial resolutions, and online hard keypoint mining (OHKM) [74] is applied at the finest scale to focus training on ambiguous joints. This strategy produced the largest single gain in the MSPN ablation (

+ 0.9

AP, from

73.3

to

74.2

on COCO minival), confirming that the quality of the supervisory signal is at least as impactful as architectural connectivity. Taken together, CSFA and CTF are complementary: aggregation ensures richer features flow into each stage, while coarse-to-fine supervision ensures that each stage is trained with an appropriately calibrated target, pushing the multi-stage paradigm to 76.1 AP on COCO test-dev with a 4×Res-50 backbone.

Similarly, Su et al. [76] introduced cascade feature aggregation to explicitly fuse low-, mid-, and high-level features from previous stages. Zhang et al. [77] further enhanced contextual modeling by appending a pose graph neural network (PGNN) at the end of the cascade for structural refinement.

However, Xiao et al. [78] questioned the need for such intricate multi-stage designs. Their seminal work on “Simple Baselines” showed that complicated multi-stage networks may be outperformed by a deep ResNet backbone followed by straightforward deconvolution layers. This result implied that the refinement head’s complexity may not be as important as the backbone’s representational ability. However, studies on efficient refinement continued: Bulat et al. [79] presented soft-gated skip connections to adaptively filter information flow, while Tang et al. [80] proposed densely connected U-Nets (DU-Net) to maximize feature reuse.

5.1.2. The Revolution in High Resolution

The loss of spatial information during downsampling is a basic drawback of hourglass and pyramidal structures. Resolution recovery through upsampling is intrinsically lossy. To resolve this, Sun et al. [81] introduced the high-resolution network (HRNet).

Unlike serial encoder–decoders, HRNet maintains a high-resolution representation throughout the entire forward pass. The architecture connects high-to-low resolution branches in parallel and employs repeated multi-scale fusion units. This allows the high-resolution branch to receive global semantic context from lower-resolution branches without sacrificing spatial precision. HRNet became a standard backbone for 2D pose estimation. Cheng et al. [82] later extended this concept to bottom-up estimation by adding a deconvolution module to generate super-resolution heatmaps, which are crucial for spotting small-scale human occurrences.

5.1.3. Hybrid CNNs and Specialized Modules

Specialized modules that improve efficiency or receptive fields have attracted attention alongside with backbone development. Artacho and Savakis [83] created the waterfall atrous spatial pooling (WASP) module in UniPose, which uses cascaded dilated convolutions to efficiently capture multi-scale context. Later, they expanded this to OmniPose [84], combining HRNet and WASP for single-pass multi-person estimation. Ke et al. [85] created multi-scale structure-aware networks (MSS-Net) with losses penalizing incorrect pairwise joint relations. Zhang et al. [86] proposed a cascaded context mixer (CCM) integrating squeeze-and-excitation mechanisms.

Local precision was the subject of other works. Since local details are frequently overpowered by global features, Cai et al. [87] introduced the residual steps network (RSN) to create densely fused features at the same scale. Groos et al. [88] introduced EfficientPose based on the EfficientNet backbone to address efficiency. Papaioannidis et al. [89] explored multi-task learning by creating semantic body maps that guide the principal heatmap regressor using a GAN-like collateral module.

To enable single-instance prediction within packed detection boxes, Khirodkar et al. [90] proposed multi-instance pose networks (MIPNet), which modifies the backbone to accept an instance selector input. Munea et al. [91] demonstrated the continued viability of hybrid architectures by proposing SimpleCut, a more straightforward U-Net that predicts both keypoints and grouping maps.

Table 7 summarizes CNN-based architectures, including multi-stage, pyramidal, hybrid, and high-resolution designs, on standard 2D pose-estimation benchmarks.

Postprocessing and Refinement. Recognizing that even top-performing estimators produce systematic errors, another line of research focuses on post hoc refinement. Fieraru et al. [92] proposed an explicit refinement network trained on synthetic errors (e.g., swapping limbs), while Moon et al. [93] formalized this with PoseFix, a model-agnostic network that learns to correct the output of any estimator by training on a distribution of realistic pose distortions.

5.2. The Global Context Revolution: Graphs and Transformers

CNNs are excellent at extracting local features, but struggle to capture long-range dependencies such the relationship between a wrist and an ankle due to their limited receptive fields. To overcome this, the field has shifted its attention to global relational modeling architectures such as transformers and graph convolutional networks (GCNs).

5.2.1. GCNs: The Skeleton as a Graph

Models can explicitly take advantage of anatomical priors by treating the human skeleton as a graph in which the bones are edges and the joints are nodes. Raising the dimension from 2D to 3D is a great use for this approach. Graph representation offer several concrete advantages for 3D pose estimation:

Strong inductive bias. The graph structure explicitly encodes anatomical constraints, such as which joints are directly connected by a bone; this biases the model to learn relationships that are physically valid, reducing the hypothesis space and improving sample efficiency compared to a fully-connected network that must learn these connections from scratch.
Locality of information. A joint’s 3D location is most directly influenced by its immediate kinematic neighbors (e.g., the elbow constrains the wrist); graph convolutions operate naturally on this local neighborhood, unlike standard convolutions that use a fixed grid-based receptive field.
Structure-aware reasoning. Graph networks can be designed to respect the hierarchical and directed nature of the skeleton (e.g., from the hip to the knee to the ankle). As shown by conditional directed graph convolutions, using directed edges allows the model to explicitly represent the flow of influence from parent to child joints, thereby mirroring real biomechanics.
Flexibility. The graph representation is not limited to a fixed skeleton. It can, in principle, be adapted to different skeleton definitions (e.g., with more or fewer joints) or even to non-human articulated objects, making it a more general tool for structured prediction.

GCNs were integrated into CNN backbones in early attempts such as the graph stacked hourglass [94]; however, regular GCNs sometimes perform badly due to strict weight sharing. This problem was solved by Ci et al. [95] by using locally connected networks (LCNs) to learn different weights for different joints.

Further refinements focused on graph topology. To address the claim that undirected graphs are unable to represent the kinematic hierarchy of the body, Hu et al. [96] proposed conditional directed graph convolutions. To enhance 3D rotation encoding, Azizi et al. [97] presented geometric operators based on Möbius transformations to preserve angles and translations between joints. Liu et al. [98] integrated spatial GCNs with temporal convolutional networks (TCNs) to simulate spatiotemporal context. More recently, Li et al. [99] proposed GraphMLP, a simpler architecture that uses MLP-mixers instead of explicit graph convolutions to accomplish global modeling.

Table 8 summarizes GCN-based 2D-to-3D lifting methods and reports their MPJPE scores on Human3.6M under Protocol 1.

5.2.2. Takeover by Transformers

A paradigm change occurred with the release of Vision Transformer (ViT). The global receptive field that CNNs lack is provided by the self-attention mechanism (Figure 3), which naturally mimics interactions between any pair of tokens.

Transformers for Heatmap Prediction.

Initially, the introduction of transformers was used to improve CNNs. Yang et al. [100] proposed TransPose, which processes CNN features with a transformer encoder to capture global spatial relationships prior to heatmap prediction. Yuan et al. [101] took this one step further with HRFormer, which explicitly integrated self-attention into the high-resolution branches of HRNet. Specific methods were also developed to reduce the quadratic complexity of attention; for instance, polarized self-attention [102] divides attention into spatial and channel dimensions to reduce computational cost, VTTransPose [103] employs “twin attentions” to separate row/column computation, and TCFormer [104] clusters tokens to focus computation on informative regions.

Direct Regression Revisited. The most significant contribution of transformers has been the revival of the direct regression paradigm. DETR-style set prediction and query-based decoding helped to reduce some of the alignment issues that limited earlier CNN-based regression models.

Mao et al. [105] formulated pose estimation as a sequence prediction task in TFPose. Panteleris and Argyros [106] then proposed PE-Former, a pure transformer design that completely eliminates the CNN backbone. In a similar vein, Li et al. [107] introduced cascade transformers, which gradually improve keypoint locations over time.

To deal with regression uncertainty, methods such as Poseur [108] combined transformers with residual log-likelihood estimation (RLE). This pattern also extended to multi-person estimation: DirectPose [109] employs a keypoint alignment strategy to overcome misalignment between convolutional features (feature maps) for end-to-end regression, while Group Pose [110] employs efficient “group queries” with a group attributed to each human instance composed of score and keypoints. Even bottom-up approaches have adopted this reasoning, with Geng et al. [111] showing that complex heatmap grouping might be replaced by regressing offsets from body centers.

5.2.3. Context-Aware 2D-to-3D Lifting

A basic question brought up by the use of transformers for 2D-to-3D lifting is whether a series of 2D coordinates are adequate or if visual context remains necessary. Ma et al. [112] combined pictorial and graph-based structure models, using attention to enforce bone length limitations (ContextPose). Zhao et al. [113] demonstrated that a single 2D pose enhanced with local visual characteristics derived from the image backbone might outperform sequence-based algorithms that depend simply on coordinates, highlighting the significance of visual context in resolving depth ambiguity.

Why visual context helps. The depth ambiguity of monocular 3D lifting is fundamentally nonlocal: the absolute depth of a joint cannot be determined from the joint itself, only from its relation to other parts of the scene that have known geometric properties. CNNs, with their limited and slowly growing receptive fields, can only reason about such relations indirectly; depth has to be inferred from local cues (foreshortening, shading) and then propagated stage-by-stage through the network, with information loss occurring at each downsampling step. Self-attention removes this propagation bottleneck by allowing every token to attend to every other token in a single layer, which is exactly what is required to bring distant evidence to bear on a local depth decision.

The canonical example is foot–torso reasoning. If visual or scene-aware reasoning supports the hypothesis that the subject’s foot is in contact with the support surface, then the foot depth becomes strongly constrained, meaning that the 2D foot observation can be back-projected onto the estimated support surface. More precisely, the model does not infer foot–ground contact from the 2D foot coordinate alone; rather, contact is inferred from visual and temporal evidence around the projected foot, such as the appearance of the support surface, orientation of the sole, local occlusion boundaries between the shoe and floor, shadows or contact patches, and the near-zero velocity of the foot during stance. In a context-aware transformer, the foot token can attend to nearby scene tokens corresponding to the floor or other support surfaces, thereby learning an implicit contact variable.

Once such a contact hypothesis is available, it provides a geometric constraint on depth. Given camera intrinsics, the 2D foot location defines a camera ray. If the foot is assumed to be in contact with a known support surface such as a ground plane or scene height map, then the 3D foot position can be estimated by intersecting the ray with the surface. For a height map

h (x, z)

, this corresponds to finding the point on the viewing ray for which the vertical coordinate satisfies

y = h (x, z)

. Thus, the foot depth is not estimated from the joint coordinate itself but from the conjunction of image evidence, contact reasoning, and scene geometry. The depth of the torso is then constrained by the kinematic chain and bone length priors connecting the foot to the rest of the body. While a CNN must learn this chain implicitly through many stacked layers, a transformer can encode it through direct attention between body and scene tokens, for example between torso, foot, and floor-region tokens. The same mechanism explains why context-aware lifting [113] outperforms coordinate-only lifting: the 2D keypoint sequence alone contains no information about which joints touch the floor, lean against a wall, or are occluded by a held object, whereas the visual feature tokens can provide such evidence. In this way, attention turns 3D lifting from a sequence-to-sequence regression problem to a scene-conditioned reasoning one, which forms the conceptual bridge to the probabilistic and physics-aware methods covered in Section 6.1 and Section 8.3.

To close this section, we emphasize that transformers are not a monolithic solution; rather, they are a flexible architecture that has been adapted in three distinct ways:

Heatmap Refinement (e.g., TransPose, HRFormer). A CNN backbone first extracts visual features; the transformer encoder then processes these features using self-attention to capture long-range spatial relationships across the entire feature map. The output is used to predict final heatmaps. Here, the transformer acts as a powerful global context aggregator that enhances the CNN’s local features.
Direct Regression (e.g., TFPose, Poseur). The image is passed through a CNN backbone to produce a feature map. The transformer decoder uses a set of learnable keypoint queries to directly regress the (x,y) or (x,y,z) coordinates for each joint. Each query attends to the most relevant image features via cross-attention. This treats pose estimation as a set prediction problem (like DETR for objects), elegantly avoiding both heatmaps and their postprocessing requirements.
Lifting from 2D to 3D (e.g., PoseFormer, MixSTE).: Here, the input is a sequence of 2D pose coordinates from a video. The transformer is used as a spatiotemporal model that first applies self-attention along the spatial dimension (joints within a frame), then along the temporal dimension (same joint across frames). This allows the model to learn complex joint correlations and motion dynamics directly from the 2D pose sequence without needing the original image. This is a purely sequence-to-sequence paradigm.

Table 9 summarizes transformer-based pose-estimation architectures, distinguishing heatmap-based transformer models from regression-based transformer models. Table 10 summarizes monocular 2D-to-3D lifting strategies and highlights the impact of visual context on lifting accuracy.

6. Ambiguity, Generalization and Occlusion

6.1. From Deterministic Prediction to Probabilistic Estimation

The architectures presented in the previous sections generally operate in a deterministic way: an input produces an output. However, pose estimation, in particular 2D to 3D extraction, is inherently ambiguous (as shown in Figure 4); a single 2D pose may correspond to several valid 3D configurations. To remedy this, the field has evolved from the search for a single “correct” answer to modeling the space of plausible solutions.

6.1.1. Ambiguity Modeling with Multiple Hypothesis Generation

The initial strategy for dealing with ambiguity was to generate several distinct pose hypotheses (Figure 5). In order to sample various 3D hypotheses, Li and Lee [114] employed mixture density networks (MDNs) to forecast the parameters of a mixture distribution instead of precise coordinates. A more methodical approach is provided by generative models. Sharma et al. [115] proposed a conditional variational autoencoder (CVAE) that learns a latent space conditioned by the 2D pose. By sampling this space, various 3D poses can be generated. Crucially, they incorporated an “ordinal ranking” module to evaluate these hypotheses against depth constraints. Specifically, the ordinal ranking module evaluates each CVAE-generated 3D pose hypothesis by comparing the pairwise joint depth ordering for the hypothesis with ordinal relations predicted from the image and the estimated 2D pose. OrdinalNet predicts less-than, greater-than, and equal-depth ordinal maps for joint pairs, which are then converted into a discrete joint–ordinal relation matrix. For each generated 3D candidate, an ordinal matrix is computed from its joint depths. The OrdinalScore is the number of pairwise ordinal relations that match the predicted ordinal matrix. These scores are passed through a temperature-scaled softmax to weight the generated samples, with the final 3D pose computed as the weighted average of the candidates. The same paper also reports an OracleScore upper bound, where the closest generated sample to the ground-truth 3D pose is selected; however, this requires ground truth, and is not used as the normal inference procedure. Rather than generating complete hypotheses, Han et al. [116] focused on uncertainty at the joint level. Their model includes a

σ

variance parameter for each joint, which modulates the loss function to tolerate errors in ambiguous regions. Recently, Rommel et al. [117] proposed ManiPose, a multi-head architecture that avoids physically implausible solutions by generating many restricted hypotheses based on the manifold of legitimate human poses.

6.1.2. The Diffusion Model Era

Uncertainty modeling in HPE has been transformed by the emergence of diffusion models. Diffusion models (Figure 6) are very useful for multimodal data because they learn to iteratively denoise a distribution, in contrast to single-step generators.

Holmquist and Wandt [118] and Gong et al. [119] independently applied this paradigm to 3D lifting (DiffPose), conditioning the denoising process on 2D detections to generate various 3D candidates. To select a single prediction from a set of diffusion-generated 3D hypotheses, Shan et al. [120] proposed a joint-wise reprojection-based multi-hypothesis aggregation module (JPMA). The diffusion model generates N plausible 3D pose hypotheses

{{\hat{P}}_{1}, \dots, {\hat{P}}_{N}}

conditioned on the same input 2D keypoints; JPMA then aggregates them at the joint level rather than the pose level. For each joint, every hypothesis is reprojected into the image plane using the known or estimated camera parameters, and the hypothesis with the smallest reprojection error relative to the input 2D keypoint is selected. The final 3D pose is formed by combining the selected joints across hypotheses.

More explicitly, let

{\hat{p}}_{h, j} \in R^{3}

denote the 3D coordinate of joint j in hypothesis h, and let

π (\cdot)

be the camera reprojection function. For each joint j, JPMA computes the reprojection error

e_{h, j} = {∥π ({\hat{p}}_{h, j}) - x_{j}∥}_{2},

where

x_{j}

is the observed 2D location of joint j. The selected hypothesis index for that joint is then

h_{j}^{*} = arg min_{h \in {1, \dots, N}} e_{h, j},

and the final 3D coordinate for joint j is copied from the selected hypothesis:

{\tilde{p}}_{j} = {\hat{p}}_{h_{j}^{*}, j} .

Thus, the final pose is

\tilde{P} = {{\tilde{p}}_{1}, {\tilde{p}}_{2}, \dots, {\tilde{p}}_{J}},

where different joints may originate from different diffusion hypotheses.

For example, suppose that three hypotheses are generated for a pose with four joints: hip, knee, ankle, and wrist. If the smallest reprojection errors are obtained from hypothesis 1 for the hip, hypothesis 3 for the knee, hypothesis 2 for the ankle, and hypothesis 1 for the wrist, then JPMA forms the final pose as

\tilde{P} = {{\hat{p}}_{1, hip}, {\hat{p}}_{3, knee}, {\hat{p}}_{2, ankle}, {\hat{p}}_{1, wrist}} .

Therefore, JPMA does not average whole poses or choose one complete hypothesis; instead, for each joint, it independently assembles a new 3D pose by selecting the 3D joint candidate for which the reprojection is most consistent with the corresponding 2D detection.

Joint-wise reprojection error. For each joint, JPMA reprojects every candidate 3D hypothesis to the 2D image plane and selects the hypothesis for which the projection is closest to the observed 2D keypoint in Euclidean distance.
Joint-level aggregation. Rather than choosing or averaging entire poses, JPMA assembles the final prediction by combining the best joint from each hypothesis, which allows different joints to come from different candidates.
Use of 2D priors. The 2D keypoints act as geometric priors that guide hypothesis selection; this method does not introduce an additional heatmap-likelihood term or a learned anatomical plausibility head.

Compared with pose-level averaging, this joint-level selection better exploits the diversity of diffusion-generated hypotheses and avoids collapsing them into a single mean pose. Joint-level aggregation is further reported to outperform pose-level aggregation, while the reprojection-based JPMA is reported to be more effective than average- or MLP-based fusion at the pose level. Diffusion models also enable rich multimodal conditioning. Xu et al. [121] presented FinePOSE, which conditions the diffusion process to textual prompts via CLIP to describe action types or speeds. With regard to occlusion, Wang et al. [122] proposed Di²Pose, which uses discrete pose token quantization and a discrete diffusion process (with mask/replace strategies) to better reconstruct 3D poses under missing part conditions. Feng et al. [123] extended diffusion-based pose estimation from static images to videos by formulating multi-frame 2D pose estimation as a conditional generative process over keypoint heatmaps. The key idea is to leverage both spatial information (frame-level appearance) and temporal information (motion consistency across frames) through an explicit spatiotemporal representation.

Concretely, the method processes each frame using a Vision Transformer backbone and then aggregates features across a temporal window using a spatiotemporal representation learner (STRL) implemented as cascaded transformer layers operating along the time axis. The resulting spatiotemporal feature serves as the conditioning signal for the diffusion model. More specifically, DiffPose conditions the reverse denoising process through its Pose-Decoder, which can be written as

f_{θ} (x_{t}, F_{t}^{i}, t)

, where

x_{t}

is the noisy keypoint heatmap at diffusion step t and

F_{t}^{i}

is the spatiotemporal feature extracted from the input frame sequence. For a given temporal window,

F_{t}^{i}

is kept fixed across the reverse diffusion trajectory and is provided to the decoder at each denoising step. Inside the decoder, the timestep embedding first modulates the noisy heatmap to obtain a step-adaptive heatmap

{\bar{x}}_{t}

. The decoder then constructs multi-scale size-matched pairs between

{\bar{x}}_{t}

and

F_{t}^{i}

and uses the noisy heatmap to form attention-like joint fields that look up keypoint-relevant regions from the spatiotemporal feature. These retrieved local joint features are fused with the original global spatiotemporal context to guide the prediction of the denoised heatmap. In this way, the conditioning feature steers the diffusion process toward spatially plausible and temporally consistent keypoint configurations. The Pose-Decoder is equipped with a lookup-based multiscale feature interaction (LMSFI) module; as inputs, it takes the noisy keypoint heatmap, diffusion timestep, and spatiotemporal feature, which are used to predict the denoising direction.

The forward (noising) process follows the standard diffusion formulation in which Gaussian noise is progressively added to the ground-truth heatmaps over T steps. The reverse process iteratively refines the noisy heatmaps conditioned on the spatiotemporal features, yielding increasingly accurate keypoint predictions.

This formulation enables the model to exploit temporal context in order to improve robustness in challenging scenarios such as occlusion and motion blur while also benefiting from the iterative refinement and ensemble properties of diffusion models. Empirically, DiffPose achieves state-of-the-art performance on PoseTrack [124]. Lastly, Jiang et al. [125] introduced ZeDO, a zero-shot method which does not require 2D–3D or image–3D training data; instead, it starts from a 2D pose (keypoints) and camera parameters, then performs case-by-case optimization to find a 3D pose that both projects correctly and is plausible under a pretrained human pose diffusion prior.

Table 11 summarizes probabilistic and generative approaches, including multi-hypothesis uncertainty models and diffusion-based methods, on Human3.6M under Protocol 1.

6.2. Bridging the Domain Gap “In the Wild”

One of the persistent challenges of 3D HPE is the domain gap between controlled laboratory datasets (e.g., Human3.6M) and “in-the-wild” imagery (see Figure 4). Models trained on clean MoCap data often fail when exposed to different lighting, clothing, and viewpoints.

6.2.1. Self-Supervised and Weakly Supervised Learning

To attenuate the scarcity of 3D labels in nature, researchers are exploiting weaker signals such as multi-view coherence. The basic idea is that a predicted 3D pose should be geometrically consistent when projected onto different camera views. Rhodin et al. [126,127] and Iskakov et al. [128] were the first to do this, enhancing the consistency of reprojection between views without 3D reference labels. Recent methods exploit 2D data more aggressively. Yang et al. [129] proposed CameraPose, which jointly predicts 3D pose and camera parameters to enable supervision via 2D reprojection error on in-the-wild images. Yu et al. [130] used a semi-supervised strategy, expanding the learning set by creating “pseudo-heatmaps” from unlabeled data using a two-student architecture. Nakatsuka et al. [131] (MirrorNet) used analysis-by-synthesis techniques, in which the network reconstructs the input image by learning a latent pose representation. This reduces the need for labeled data and allows for semi- or self-supervised learning. MirrorNet encodes images into a latent 2D pose representation and reconstructs the input image, forcing the latent representation to be anatomically plausible. Kundu et al. [132] employed part-based latent factors and unique picture synthesis to learn 3D poses without paired 2D–3D annotations. In order to impose a constant rotation of the projected position as supervision, Sosa and Hogg [133] employed geometric alterations such as 3D rotation. Kundu et al. [134] used uncertainty awareness in a self-supervised domain adaptation setting. Their MRP-Net architecture jointly predicts pose and per-joint confidence; during adaptation to unlabeled real images, uncertainty is minimized on plausible human images and maximized on out-of-distribution data. This allows the method to filter out unreliable pseudolabels and improve robustness when training without target-domain annotations.

6.2.2. Data-Centric Solutions: Augmentation and Adaptation

An alternative to unsupervised learning is to enhance the training data.

Adversarial and Learned Augmentation. Traditional augmentation strategies such as random scaling, rotation, flipping, and occlusion improve robustness, but remain fixed and hand-crafted. Yang et al. [135] proposed an adversarial data augmentation framework for human pose estimation in which augmentation and pose network training are jointly optimized. Instead of relying on predefined transformations, this method introduces an augmentation network that learns to generate challenging transformations conditioned on the current state of the pose estimator.

The framework consists of two components: a pose network that predicts human joint locations, and an augmentation network that produces distributions over augmentation parameters such as scaling, rotation, and occlusion. The augmentation network samples transformations that increase the pose network’s training loss, thereby exposing its failure modes, while the pose network learns to minimize the loss under these increasingly difficult perturbations. In this sense, the method can be interpreted as a two-player optimization process in which the augmentation network seeks to maximize the pose-estimation loss while the pose network aims to minimize it.

Subsequent work extended this idea in several directions. Peng et al. [136] proposed adversarial data augmentation to synthesize occlusions. Gong et al. [137] introduced PoseAug, a differentiable framework in which an augmentor network learns to generate challenging training samples such as extreme rotations in order to stress-test the pose estimator. Peng et al. [138] later refined this strategy with a dual augmentor framework that separates style and content augmentation.

Generalization and Invariance. To improve dataset performance, Wang et al. [139] proposed a method to generate synthetic 3D pose labels for in-the-wild image. They used a stereo-inspired neural network to lift 2D joint detections to 3D, then applied a geometric refinement step, producing a large dataset (400,000 images) with pseudo-3D ground truth. Doersch and Zisserman [140] exploited optical flow and synthetic humans (SURREAL) for simulated-real transfer. Chai et al. [141] introduced PoseDA, separating global position alignment from local pose deformation. Focusing on invariant representations, Wang et al. [142] used auxiliary viewpoint prediction to reduce camera bias. Cai et al. [143] proposed PoseIRM, which applies invariant risk minimization across synthetically generated camera settings to learn pose estimation models with features that are invariant to camera parameters, thereby avoiding reliance on spurious correlations tied to specific views. Zeng et al. [144] proposed SRNet to deal with rare poses via split and recombination strategies. Finally, methods that utilize privileged information or auxiliary signals have shown improved performance: Wang et al. [145] developed TMT, which uses 3D joint velocities during training to enhance monocular 3D pose estimation; Lee et al. [146] employed multi-model guidance to provide richer supervisory signals; Taketsugu and Ukita [147] studied active learning to efficiently adapt models to new video sequences; and Hu et al. [148] utilized meta-optimization with self-supervised tasks to enable rapid adaptation. Together, these techniques demonstrate how training with richer or structured data can improve inference on conventional inputs.

Table 12 summarizes representative weakly supervised, unsupervised, and adversarial generalization methods evaluated on Human3.6M. Table 13 further compares recent generalization and adaptation methods on Human3.6M and 3DPW.

6.3. Reasoning About Occlusion and Crowds

6.3.1. Robustness to Occlusion

Occlusion remains a fundamental mode of failure. Early solutions such as Vosoughi and Amer [149] and Cheng et al. [150,151] focused on filtering unreliable joints or training on partial bodies. Das et al. [152] used a multi-task network for pedestrian analysis. Their framework jointly performs pedestrian detection, instance segmentation, and pose estimation. An instance-level domain adaptation strategy is used to handle occlusion and domain shift, yielding improved robustness and better pose estimates for occluded pedestrians in automotive scenes. Specific methods have been created in the context of 2D–3D lifting. Hardy and Kim [153] proposed LInKs (“lift then fill”), a two-stage 2D-to-3D lifting method that first lifts reliably detected 2D joints independently into 3D, then fills in occluded or missing parts directly in 3D space. This approach improves robustness under occlusion and reduces dependencies on full 2D skeleton detection. HiPART, introduced by Zheng et al. [154], employs a hierarchical autoregressive transformer to “densify” sparse 2D inputs prior to lifting. Zhang et al. [155] used an analysis-by-synthesis approach (3DNBF), adapting a neural body volume to image features in order to hallucinate occluded parts on the basis of anatomical plausibility. Transformers enabled more explicit reasoning; for instance, Sun et al. [156] introduced “visibility tokens” that allow the network to identify occluded joints and infer their position based on global context rather than local visual evidence.

6.3.2. Crowded Scenes

In cluttered scenes, top-down methods fail when detectors miss occluded individuals, while bottom-up methods struggle to group interlaced articulations. Ning et al. [157] integrated tracking to mitigate detection failures. Hybrid architectures such as BUCTD [158], Cheng et al. [159], and DPIT [160] fill this gap by using bottom-up proposals to condition top-down refinement.

To manage inter-person occlusion, models need to reason about interactions. Dabral et al. [161] adapted Faster R-CNN for 3D multi-person pose estimation. More recently, I²R-Net [162] and GR-M3D [163] address multi-person 3D pose estimation by introducing dynamic graph reasoning (DGR), in which the decoding graph is predicted from the input image rather than fixed in advance. This design explicitly targets occlusion and depth ambiguity by combining scale- and depth-aware refinement with multi-root pose decoding.

Concretely, a backbone network produces four maps: a heatmap encoding joint and body-center confidence, a scale map, a depth map, and a 3D offset map. These maps are refined by a scale- and depth-aware refinement (SDAR) module, which leverages scale and depth context to improve heatmap localization and offset accuracy. The refined heatmap and offset maps are then used to decode multiple root keypoints and dense decoding paths for each detected person.

Given these dense paths, DGR constructs a dynamic decoding graph by assigning a soft weight to each candidate path. Each weight depends on the confidence of the start point, the target point, and their bone-level consistency together with a dataset-level bone length prior. Unlike fixed star- or tree-structured decoders, this dynamic graph adapts to occlusion patterns and depth ambiguity at inference time. The final 3D pose is obtained by aggregating the weighted contributions from all candidate roots and decoding paths. Empirically, GR-M3D outperforms fixed-graph baselines such as star and tree decoding, and has achieved state-of-the-art results on Human3.6M, MuPoTS-3D, and CMU Panoptic.

Table 14 summarizes representative methods for robustness to occlusion and crowded scenes, covering both 2D crowd-pose estimation and multi-person 3D pose estimation.

7. Contextual Extension: Time, Space, and Modality

7.1. Exploiting Temporal Dynamics in Video

Thus far, our analysis has focused on static imagery; however, because individual images contain no motion cues, they are fundamentally ambiguous. Video sequences provide exponentially richer information in the form of temporal dynamics. Historical context and continuity of motion are essential for resolving depth ambiguity (e.g., in the transition from 2D to 3D), managing temporary occlusions, and ensuring anatomically plausible motion.

7.1.1. Early Temporal Models: RNN and Convolutional Approaches

Early attempts to model video sequences were based on sequential architectures. Recurrent neural networks (RNNs), in particular LSTMs, have been used to propagate hidden states between frames. Luo et al. [164] proposed the LSTM pose machine, adapting the concept of multi-step refinement by replacing spatial steps with temporal LSTM units. Liu et al. [165] used an approach based on structured space learning, with dual streams for spatial features and temporal information. However, RNNs suffer from limited parallelization. A more efficient alternative has emerged in the form of temporal convolution networks (TCNs), which use dilated 1D convolutions to obtain large temporal receptive fields. The seminal work of Pavllo et al. [166] (VideoPose3D) demonstrated that a simple TCN could outperform RNNs in 2D-to-3D elevation while also being significantly faster. This paradigm has also been improved by later work: temporal coherence exploration (TCE) blocks were introduced by Li et al. [167] to align adjacent pictures, while Liu et al. [168] used TCNs with attention mechanisms to evaluate the relevance of various images inside the time window.

7.1.2. Spatiotemporal Transformers and the State-Space Era

The development of self-attention provides a natural way to describe global dependencies spanning location (articulations) and time (images), while TCNs broaden the receptive temporal range.

Space–Time Transformer. Before transformers were fully adopted, Lin and Lee [169] proposed factorizing the problem in trajectory space using the discrete cosine transform (DCT). However, the direct application of self-attention proved transformative. Zheng et al. [170] established a new baseline with PoseFormer, which treats video as a sequence of tokens (joint × frame). By sequentially applying spatial and temporal attention, it significantly outperforms TCN. The following architectures optimized computation to handle the quadratic complexity of attention across extended sequences. In order to shorten sequences, Li et al. [171] introduced Strided Transformer, which replaces the fully-connected feed-forward layers in transformers with strided temporal convolutions that downsample 2D pose sequences and efficiently aggregate temporal context. Tang et al. [172] proposed STCFormer, which uses a spatiotemporal criss-cross attention block. By decomposing attention into separate spatial (within-frame joint interactions) and temporal (across-frame joint trajectories) components, it can efficiently model spatiotemporal correlations for 3D pose estimation in videos. Zhang et al. [173] introduced MixSTE, which alternates spatial and temporal transformer blocks to separately encode inter-joint spatial correlations and joint-wise temporal motion. Hassanin et al. [174] proposed CrossFormer, which uses dedicated modules for inter-joint and inter-frame interactions, enabling richer spatiotemporal modeling of articulation dynamics across video frames.
Structural Enrichment and Hybrids. Researchers soon found that pure transformers often ignored anatomical structure. This precedent was re-injected through hierarchical designs. Relationships at the local (joint), regional (limb), and global (body) scales have been explicitly modeled by Wei et al. [175] (PGFormer) and Qian et al. [176] (HSTFormer). Through directed graphs, Chen et al. [177] expanded this to high-level dependencies such as joint–hyperbone. Leap clustering was used on the skeleton graph by Zhai et al. [178]. Furthermore, multi-level structures have been suggested to improve temporal features: RTPCA-Transformer [179] uses a pyramidal compression and amplification structure, while DC-GCT [180] incorporates double-chain constraints to concurrently describe local and global dependencies.

Hybrid topologies that integrate DCGs and transformers have also grown in favor. Mehraban et al. [181] (MotionAGFormer) and Yu et al. [182] (GLA-GCN) merged local kinematic modeling via GCNs with global context via transformers. Others have integrated kinematic priors directly: Peng et al. [183] used trajectory-guided attention, while Li et al. [184] combined skeletal attention with MLP mixers. Innovations have also been extended to ambiguity management and efficiency: Li et al. [185] (MHFormer) adapted transformers to generate multiple pose hypotheses, Liu et al. [186] (TCPFormer) introduced implicit pose proxies to compress temporal correlations and improve with number of frames, Lutz et al. [187] proposed a self-correction mechanism in which the network predicts its own error, and Qiu et al. [188] (IVT) proposed an end-to-end video transformer that samples tokens directly from image features, bypassing the 2D detectors available on the market.

State Space Models (Mamba): An Emerging Alternative. For modeling lengthy sequences, state space models (SSMs) such as Mamba have lately surfaced as a linearly complex substitute for transformers. Feng et al. [189] proposed GLSMamba, using selective 6D scanning to decouple learning into a global spatiotemporal Mamba and a local refinement Mamba. Similarly, Lu et al. [190] presented SAMA, a structure-aware SSM that integrates skeleton topology via learnable adjacency matrices, demonstrating promising results for long-term temporal modeling.

Table 15 summarizes representative temporal 2D-to-3D lifting models, including temporal convolutional, transformer-based, and state-space approaches.

7.1.3. Motion-Centric and Kinematic-Aware Models

In parallel with architectural changes, a stream of research is focusing on the application of physical and kinematic plausibility. The first approaches integrated physics and kinematics as loss functions: Dabral et al. [191] penalized invalid joint angles, while Gupta [192] and Wang et al. [193] added constraints on the velocity, acceleration, and coherence of virtual bones. Later work integrated kinematics into the network design. Wang et al. [194] used pairwise motion coding as the main supervisory signal. Xu et al. [195] decomposed prediction into bone length and direction, using Bi-LSTMs to supplement unreliable trajectories. Jin et al. [196] introduced velocity and acceleration directly into the network. Li et al. [197] used a dual-flow architecture merging RGB and optical flow. Finally, Jeong et al. [198] (SoloPose) and Zhang et al. [199] have designed dynamic graph networks that adapt graph topology to specific motion patterns.

7.2. Resolving Ambiguity with Multiple Views and Sensors

While monocular estimation is inherently ambiguous, multi-view configurations and alternative sensors provide direct geometric cues to solve the depth problem.

7.2.1. Multi-View Geometry: From Triangulation to End-to-End Fusion

The use of multiple cameras is the key standard of metric accuracy. The challenge is to merge the information efficiently, whether the cameras are calibrated or not.

Learnable Triangulation and Fusion. Early deep learning methods treated fusion as postprocessing (algebraic triangulation). Iskakov et al. [128] proposed learnable triangulation, replacing algebraic operations with differentiable volumetric back-projection. Others merged features earlier in the pipeline: Qiu et al. [200] produced per-view 2D heatmaps and then fused these across views before performing 3D pose reconstruction, while Zhang et al. [201] (AdaFuse) used an adaptive weighting scheme based on epipolar geometry to manage occlusion. Xie et al. [202] used meta-learning to adapt fusion weights to new camera configurations. Transformers have also been adapted to this domain. Moliner et al. [203] injected epipolar constraints directly into the attention mechanism. Liao et al. [204] combined transformers with classical geometry modules to enable backpropagation of 3D errors to 2D detectors. In pursuit of efficiency, Remelli et al. [205] proposed a canonical fusion in an untangled camera space. Recently, Chharia et al. [206] proposed MV-SSM, a multi-view state space modeling framework that applies state space modeling for efficient and robust multi-view fusion, explicitly modeling the joint spatial arrangement across views to improve generalization across camera setups. Meanwhile, Luvizon et al. [207] investigated a consensus-based optimization strategy for multi-view pose estimation. By combining per-view 3D predictions (depth + 2D joints) and optimizing for a globally consistent 3D pose in camera coordinates, their method refines multi-view estimates without relying on explicit volumetric grids.
Uncalibrated and Parameter-Free Approaches. The requirement for accurate calibration is a major bottleneck. To address it, research has turned to uncalibrated parameters. Davoodnia et al. [208] (UPose3D) and Jiang et al. [209] jointly estimated pose and camera parameters and refined them iteratively. Gordon et al. [210] (FLEX) and Xu and Kitani [211] proposed predicting view-invariant quantities (e.g., bone length) in order to reconstruct motion without extrinsic parameters. By processing an arbitrary number of uncalibrated views, Adaptive Multi-View Transformer [212] learns relative geometry through attention. Li et al. [213] expanded uncalibrated methods to depth cameras by guiding camera pose estimation with point clouds. Self-supervised techniques [214,215] train on unlabeled data by utilizing multi-view consistency.
Tracking and stereo. Specialized solutions such as dual-Diffusion [216], which jointly denoises 2D keypoint uncertainty and 3D pose uncertainty under a binocular (two view) setup, focuses on stereoscopic configurations to improve the robustness of 3D human pose estimation from noisy 2D detections. For multi-view tracking, Reddy et al. [217] (TesseTrack) and Zhang et al. [218] (VoxelTrack) used 4D volumes to connect postures in space and time. For every subject, efficient recurrent models such as TEMPO [219] can preserve temporal hidden states.

7.2.2. Leveraging Depth and Alternative Sensors

Sensors offering direct depth or high temporal resolution fundamentally change the problem landscape.

RGB-D Approaches. Depth sensors eliminate scale ambiguity, but introduce noise and blending problems. Zimmermann et al. [220] proposed projecting 2D heatmaps into a 3D volume fused with depth occupancy grids. Zhang et al. [221] transferred depth knowledge to RGB grids via cross-modality distillation. Other research has developed decoupled architectures such as PoP-Net [222] and Residual Pose [223] to enhance 3D predictions using depth maps. To get around hardware limitations, Szczuko [224] focused on very low-resolution depth sensors, building massive synthetic datasets to build efficient MobileNet backbones for degraded inputs.
Event Cameras and Specialized Geometries. Fisheye cameras contain significant distortion despite their wide fields of view.

Aso et al. [225] and Zhang et al. [226] solved this problem by rectifying images or using dual-branch networks to separate root localization from relative pose estimation. Event-driven cameras, which capture changes in brightness asynchronously, excel in high-speed scenarios; Goyal et al. [227] (MoveEnet) trained event-based networks by distillation from standard RGB models, while Lang and Chuah [228,229] integrated events with RGB using cross-attention (EVT) and Mamba (CA-MambaPose) to improve 3D estimation under challenging lighting and motion conditions.

Table 16 summarizes multi-view geometry methods, including calibrated fusion and uncalibrated or parameter-free approaches, evaluated on Human3.6M.

7.3. Biophysical Constraints: From Temporal Smoothness to Physical Plausibility

The temporal models of Section 7.1 and the physics-aware methods of Section 8.3.3 are typically treated as separate strands, one concerned with smooth motion across frames and the other with gravity, contact, and anatomical consistency. However, they address the same underlying requirement, namely, that a predicted pose sequence should be biophysically plausible. Bone length consistency, in particular, is enforced as a soft loss in several temporal models [191,192,193] and as a hard kinematic constraint in physics-based methods [39,230], even though the two literatures rarely cite each other. Therefore, we read both lines of work as instances of a single biophysical constraint axis with three escalating levels of strictness.

1.: Statistical priors. This approach penalizes deviations in bone length, joint angle range, or per-frame velocity from the distribution observed in MoCap data. This covers most TCN- and transformer-based temporal models, which typically improve temporal smoothness and reduce implausible jitter but do not guarantee physical validity.
2.: Kinematic constraints. Exact bone length and joint angle limits can be enforced through forward-kinematics layers, bone direction-plus-length decompositions [70], or inverse-kinematics postprocessing [230]. This rules out anatomically impossible configurations, but can still permit physically impossible motions such as floating above the ground plane, foot skating, or passing through scene geometry.
3.: Dynamic constraints. This approach forces physical plausibility at the motion level by considering gravity, ground contact, friction, balance, and rigid-body dynamics. Instead of only asking whether each predicted skeleton is anatomically valid or temporally smooth, these methods ask whether the full motion sequence could be physically executed in the scene. In this setting, foot contact is not merely a visual cue but a physical constraint: when a foot is predicted to be in contact, its position should remain on the local support surface, should not penetrate the scene, and should not slide unrealistically. For non-flat terrain, this requires replacing the flat-ground assumption with scene geometry, for example by querying a height map $h (x, z)$ that gives the support-surface height at each horizontal location. PhysDynPose [39] is a recent example of this direction. It combines a kinematic pose estimator with camera-motion estimation, then refines the resulting motion using a scene-aware physics optimizer with contact, friction-cone, no-sliding, and root-drift constraints.

Level (3) provides the strongest form of constraint because it evaluates pose sequences as physical motions rather than as independent skeletons or merely smooth trajectories. A dynamically plausible sequence should preserve body structure, maintain realistic contact with the ground, obey gravity, and avoid artifacts such as floating, foot skating, or physically impossible accelerations. This does not replace image evidence or anatomical constraints; rather, it complements them by filtering out hypotheses that may have a low per-frame pose error while remaining physically implausible.

This reframing has practical consequences. Methods that operate at level (3) may reduce the need for manually designed smoothness losses, because temporal coherence can emerge from the physical constraints themselves. Conversely, methods that only enforce level (1) can improve visual smoothness, but by construction cannot eliminate impossible motion failures such as drifting trajectories, floating bodies, or sliding feet. This distinction is especially important on benchmarks with moving camera or scene geometry annotations such as MoviCam and SLOPER4D, where the per-frame MPJPE can remain low even when the global motion is physically unrealistic.

Therefore, we argue that future temporal models should be evaluated not only on per-frame MPJPE and temporal smoothness metrics but also on physically grounded measures such as ground contact accuracy, foot skate rate, centre-of-mass trajectory error, and bone length variance over time. Adopting such a unified evaluation would make the two literatures directly comparable, and in our view would accelerate the convergence of temporal and physics-aware HPE as a primary frontier.

8. Efficiency, Unification, and the Expanding Frontier

8.1. The Push for Efficiency and Real-World Deployment

A parallel research track tackles the computational demands of deployment, whereas earlier parts concentrated on precision and resilience. It is necessary to fundamentally redesign architectures in order to move from heavy research models to resource-constrained devices (mobile, robotics, AR).

8.1.1. Lightweight Structures

Redesigning backbones to maximize the tradeoff between accuracy and FLOPs is the first approach. Yu et al. [231] proposed Lite-HRNet, which adapts the high-resolution network paradigm for efficiency by replacing computationally expensive

1 \times 1

convolutions with lightweight conditional channel weighting units. Han and Wang [232] (Greit-HRNet) used grouped channel weighting to further stabilize this idea, and Li et al. [233] (Dite-HRNet) used dynamic convolutions to adaptively capture global context.

Other works have built efficient architectures from scratch. Zhang et al. [234] developed fast human pose estimation, which distills knowledge from heavy teachers to lightweight students. In the transformer domain, Diaz-Arias and Shin [235] introduced ConvFormer, which substitutes dynamic multi-headed convolutional attention for conventional self-attention in order to minimize the number of parameters. In a similar vein, Sun et al. [236] presented MixSynthFormer, which uses MLP based synthetic self-attention. It generates spatial and temporal attention matrices via lightweight linear layers to efficiently model inter-joint and inter-frame dependencies for human pose estimation.

Real-time constraints have also driven methodological shifts. Jiang et al. [237] achieved exceptional speed with RTMPose by adopting a classification-based regression paradigm (SimCC). For video, Zeng et al. [238] proposed DeciWatch, a “sample–denoise–recover” framework that runs heavy estimation on only

\sim 10 %

of frames and interpolates the rest. Xu et al. [239] (DynPose) and Zhang et al. [240] pushed this further with dynamic routing, using lightweight modules to identify difficult frames that require full computation.

8.1.2. Distillation, Quantization, and NAS

Beyond manual design, model compression is a key strategy. Knowledge distillation (KD) is a widely used approach; Hwang et al. [241] trained MoVNet (based on MobileNetV2) by mimicking both the output heatmaps and intermediate features of a larger teacher network. By employing KD to stabilize the training of networks with single-bit weights and activations, Bulat et al. [242] investigated extreme compression via binary neural networks (BNNs).

Neural architecture search (NAS) automates the design process. Xu et al. [243] applied NAS to video pose estimation (ViPNAS), simultaneously optimizing spatial and temporal modules. They later extended this to whole-body estimation with ZoomNAS [244], discovering a hierarchical structure that dynamically allocates capacity between the body, face, and hands.

8.1.3. Acceleration via Compressed Domain and Pruning

A third avenue exploits data redundancy. Liu et al. [245] proposed operating directly in the compressed video domain. By using motion vectors from the video stream to propagate poses, they were able to achieve

5 \times

acceleration.

Redundancy for transformers is found in tokens that are not relevant. Token-pruned pose transformer (PPT), which detects and eliminates background tokens before to the expensive self-attention layers, was introduced by Ma et al. [246]. Li et al. [247] then applied this to the temporal domain and employed cross-attention to rebuild the whole sequence, using the hourglass tokenizer (HoT) and a small number of informative frame tokens for processing.

8.2. The Unification Era

Moving away from task specialization, a significant trend aims to unify pose estimation with related tasks across heterogeneous datasets, and ultimately with foundation models.

8.2.1. Joint Learning with Related Tasks

The synergy between pose and high-level understanding is leveraged in multitask learning. Pham et al. [248] and Luvizon et al. [249,250] combined pose estimation with action recognition, showing that shared representations can enhance both tasks. PosePlusSeg, a bottom-up network that concurrently predicts keypoints and instance segmentation masks from a single backbone, was similarly suggested by Ahmad et al. [251].

8.2.2. Training with Diverse Datasets

One of the biggest engineering challenges is to unify diverse datasets containing distinct skeleton definitions. To overcome this challenge, Sarandi et al. [252] used a geometry-aware autoencoder to learn a common latent space that “translates” many skeleton forms into a single representation. By matching pixels to prototypes in a metric space, Jeong et al. [253] proposed PoseBH, which consists of nonparametric keypoint prototypes in a unified embedding space. This approach enables a single network to support multiple skeleton types (human, animal, whole-body, etc.). Predictions are aligned to these prototype embeddings via cross type self-supervision, allowing for multi-dataset training despite heterogeneous annotations.

8.2.3. Models of Foundations

Motivated by LLMs, research is currently shifting towards large-scale foundation models (see Figure 7). P-STMO [254] and MotionBERT [255] introduced pretraining challenges where the network must learn robust motion priors in order to recover masked joints or frames.

These initiatives are founded on Vision Transformers (ViT). Simple ViT encoders scale remarkably well with data, as shown by Xu et al. [257] (ViTPose++). To enable comprehensive human-centric perception, models such as Hulk [256] and UniHCP [258] are pretrained on large aggregations of datasets covering pose, detection, and mesh recovery. Dabhi et al. [259] presented 3D-LFM, a unified object-agnostic 2D-to-3D lifting model that uses a permutation-equivariant transformer and a Procrustean alignment scheme to infer 3D structure for diverse articulated objects (humans, animals, etc.) from 2D landmarks alone, without the need for object-specific templates. These modalities were themselves united by Jiang et al. [260] (UniHPE), who aligned RGB, 2D pose, and 3D pose in a common contrastive embedding space.

8.3. The Human-Centric Frontier

Current research is expanding the scope of HPE beyond the core skeleton to include whole-body understanding, specific applications, and physical grounding.

8.3.1. Whole-Body and Fine-Grained Stance

Extreme scale variation is introduced by capturing the hands, feet, and face (whole-body stance). The COCO-WholeBody benchmark and ZoomNet coarse-to-fine architecture were suggested by Jin et al. [35]. Fang et al. [261] (AlphaPose) integrated whole-body estimations into a real-time tracking pipeline using symmetric integral regression. To handle small parts such as fingers with the lowest amount of computational overhead, Jiang et al. [262] (RTMW) and Samet and Akbas [263] (HPRNet) used hierarchical regression techniques to achieve further optimization.

8.3.2. Application-Driven Estimation

Biomechanics and Sports. Sports analytics requires high precision in rapid motion. The sports data domain gap was highlighted by early studies [264]. Two examples of solutions are real-time filtering for simulators [265] and geometry-based association for multi-view tracking [266]. Jiang and Xia [267] (PCNet) addressed extremity localization in fast motion, whereas Baumgartner and Klatt [268] employed field registration for uncalibrated broadcast video. The publication of AthletePose3D [36] confirmed that fine-tuning on domain-specific data significantly reduces error. By connecting computer vision and biomechanics, Koleini et al. [230] (BioPose) combined mesh recovery with inverse kinematics in order to guarantee anatomical correctness.
Inclusivity and Privacy. Ying et al. [37] introduced LDPose, a benchmark for individuals with limb deficits, and suggested metrics to manage a range of morphologies in order to avoid discrimination. For privacy, Huang et al. [269] developed recoverable anonymization framework for pose estimation. A privacy-enhancing module, pose estimator, and recovery module are jointly learned, enabling accurate pose estimation on anonymized images (with identity obscured) while still allowing for authorized recovery of the original images. Akada et al. [270] addressed the excessive self-occlusion in egocentric (VR) views by augmenting head-mounted displays with rear-facing cameras, and introduced a transformer-based multi-view fusion method that refines 2D joint heatmaps using both front and rear views (with heatmap uncertainty). This mitigates self-occlusion and improves 3D pose estimation for egocentric VR.

8.3.3. Combining Absolute Positioning with Physics

Most 3D HPE methods estimate pose in a root-relative coordinate system, which is sufficient for evaluating body configuration but insufficient for scene-aware applications. Real-world deployment often requires recovering the person’s absolute position, scale, and depth with respect to the camera or surrounding scene. This has motivated methods that combine pose regression with camera geometry, reprojection consistency, root-depth estimation, and physical grounding.

Several works have introduced geometric or anatomical constraints as a way to improve physical plausibility. Matsune et al. [271] and Hsu and Jang [272] proposed losses that encourage valid orientation and bone length consistency. Other methods have combined regression with optimization. Joo et al. [273] proposed exemplar fine-tuning, in which a pretrained regression model is optimized at test time to minimize 2D reprojection error, thereby approximating a fitting process. Aytekin et al. [39] extended this direction with PhysDynPose, which incorporates gravity and ground contact constraints into the optimization loop.

Absolute positioning has also been addressed through explicit camera-aware formulations. Chang et al. [274] proposed PoseLifter, which lifts 2D joint detections to absolute 3D using a learned regression network. Kim et al. [275] introduced PoseAnchor, which refines predicted 3D poses through robust root-position estimation to recover absolute depth. Hao and Li [276] encoded camera intrinsic parameters as input maps, while Zhan et al. [277] proposed Ray3D, which replaces pixel coordinates with normalized 3D rays. Wang et al. [278] further enforced geometric consistency through a reprojection loop.

Together, these methods extend HPE from root-relative skeleton recovery towards globally grounded human motion estimation. Their connection to temporal and physics-aware plausibility constraints is discussed in Section 7.3.

8.3.4. New Benchmarks and Mesh Recovery

Progress is driven by datasets such as FreeMan [34], which benchmarks 3D pose under realistic conditions, and SLOPER4D [38], which captures global 4D motion in large urban scenes using LiDAR/IMU.

Finally, these concepts are applied to human mesh recovery (HMR) surface modeling. Lee and Lee [279] incorporated uncertainty to handle the ambiguity of raising 2D video to 3D meshes. In order to adapt the model to new domains during testing, Kan et al. [280] developed a self-correctable inference framework (SCAI) that reduces internal consistency errors.

Table 17 summarizes representative methods across efficiency, unification, foundation models, and human-centric frontier directions.

9. Critical Analysis and Discussion

The conceptual taxonomy used in this study enables us to take a comprehensive look at current advancements in human pose estimation. In this part, we highlight cross-cutting restrictions along our five axes (representation, architecture, ambiguity, contextual extension, and applications) and describe the key patterns.

9.1. A Strategic Roadmap: Three Views on the Field

Beyond its empirical findings, this review can be read as a strategic roadmap organized along three complementary views.

Theoretical view. The field’s conceptual center has moved from point estimation to distribution estimation: from a single coordinate, to a calibrated heatmap, to a multi-hypothesis ensemble, to a full denoising posterior. Each step is a different answer to the same question of how a model should represent the spatial uncertainty inherent in projecting a 3D body onto a 2D image. The rise of diffusion and self-supervised models is best understood not as another architectural fashion but as the natural completion of this trajectory, that is, it reframes 3D pose estimation as an inverse problem with an explicit prior rather than as supervised regression.
Practical view. Accuracy on Human3.6M and COCO is no longer the binding constraint for deployment; latency, energy, robustness to compression and motion blur, behavior under occlusion and crowding, and graceful degradation on non-standard bodies (children, elderly, individuals with limb differences) now dominate the gap between benchmark numbers and field-ready systems. Therefore, the efficiency literature reviewed in Section 8.1—lightweight backbones, distillation, quantization, token and frame pruning, compressed-domain inference—is not a side topic but the central concern for most real applications.
Design view. Practitioners choosing an architecture today face a small number of recurring decisions: heatmap versus regression, CNN versus transformer versus SSM, single-view versus multi-view, deterministic versus generative, RGB-only versus multimodal. Table 18 distills the tradeoffs and the conditions under which each option is preferable. The same tradeoffs explain why XR, sports biomechanics, and robotics have converged on different architectural stacks despite drawing from the same algorithmic literature.

The intended use of Table 18 is diagnostic rather than prescriptive: practitioners should identify the row for which the “when to choose” description matches their deployment context, then weigh the dominant tradeoff against their constraints. In most real systems, the final architecture is a stack drawn from multiple rows, for example a real-time regression-based 2D estimator feeding an SSM-based temporal lifter feeding a physics-aware refinement layer, and the rows of the table should be read as composable rather than mutually exclusive choices.

To complement this design guide, Table 19 summarizes representative recent methods across several paradigms. The purpose of this snapshot is not to rank all methods directly, since the reported scores use different datasets, metrics, and protocols, but instead to make visible the dominant accuracy–efficiency–generality tradeoffs.

Two patterns are visible in the snapshot. First, on canonical 3D benchmarks such as Human3.6M, several of the strongest recent results come from generative or distribution-aware methods, including diffusion-based and autoregressive approaches. This supports the theoretical view introduced above, namely, that modern HPE is increasingly moving beyond single-point prediction toward representations and models that capture uncertainty, ambiguity, and multiple plausible poses.

Second, the methods with the strongest deployment profiles, such as RTMPose, DynPose, and SAMA, occupy a different part of the design space; they prioritize real-time inference, adaptive computation, or long-window efficiency even when they do not always match the most expensive generative models in raw accuracy. This gap between accuracy-oriented and deployment-oriented methods motivates the efficiency and unification trends discussed in Section 8.2.

Together, Table 18 and Table 19 translate the survey’s conceptual taxonomy into a practical deployment guide. Table 18 summarizes when each design paradigm is appropriate, while Table 19 illustrates representative recent methods and their main tradeoffs. These tables are not intended to replace the original papers but to help practitioners narrow the design space before selecting methods for a specific application.

9.2. Representation and Architecture: Benefits and Unspoken Expenses

The success of contemporary HPE has been largely attributed to the transition from straight coordinate regression to heatmaps, volumetric encodings, and distribution-aware regression. While volumetric and kinematic representations enhance 3D accuracy and physical plausibility, heatmap-based formulations allow for robust optimization and efficient use of convolutional structures.

However, these richer representations come with non-trivial costs. High-resolution heatmaps and volumetric grids significantly increase memory requirements and computational load. Furthermore, optimization may be difficult, since kinematic or mesh-based models often require careful constraint and previous calibration.

The transition from multi-stage CNNs to graph neural networks, transformers, and state space models has also greatly improved HPE systems’ ability to capture long-range dependencies. However, this advancement is accompanied by rising training costs, hyperparameter sensitivity, and model complexity. In some regimes, we observe diminishing returns from larger backbones or more sophisticated attention mechanisms once data and label saturation are reached. This implies that algorithmic innovation in the future should be assessed not only for accuracy but also for efficiency, robustness, and interpretability.

9.3. Efficiency and Real-World Deployability

Our analysis of efficient designs reveals an increasing conflict between real-world limitations and benchmark performance. While many works report FLOPs or parameter counts, these proxies only partially reflect real-world efficiency. Critical metrics such as wall-clock latency, memory footprint, energy consumption, and resilience to input degradation (e.g., motion blur, compression artefacts) are seldom measured systematically.

Moreover, a significant fraction of the reported gains in “efficient” HPE arises from hardware-specific optimizations or proprietary deployment stacks, making it difficult to disentangle architectural contributions from engineering choices. For practical deployments, a more nuanced view of efficiency is needed. There is a demand for standardized evaluation protocols that measure accuracy under realistic throughput and energy budgets across diverse hardware platforms, from embedded devices to edge accelerators.

9.4. Generalization, Robustness, and Failure Modes

Domain adaptation techniques and cross-dataset assessments show that in-dataset accuracy only provides a partial picture of model quality. Despite the growth of datasets, the majority of standards continue to underrepresent non-standard body morphologies, children, the elderly, and people with impairments. As a result, preconceived notions about clothing, movement patterns, and body form are skewed. Extreme situations such as severe occlusion, truncation, and crowding continue to generate brittle behavior in state-of-the-art models, while cultural and professional diversity remains limited.

Ambiguity resolution remains a persistent challenge. While diffusion-based and multi-hypothesis models offer principled tools for representing multiple plausible poses, their success often depends on strong assumptions regarding camera calibration or scene structure. In unconstrained environments with unknown cameras, ambiguity resolution remains fragile. Furthermore, while temporal models can improve per-frame metrics via smoothing, their relevance for tracking and human–robot cooperation remains questionable, since few studies have fully evaluated long-term stability or error accumulation across extended sequences.

9.5. Limitations of Current Evaluation Metrics

Standard metrics have enabled systematic progress, but also exhibit significant blind spots. In 2D HPE, OKS depends on heuristic choices of keypoint variances, which implicitly encodes assumptions about the relative importance of different body parts. In 3D HPE, MPJPE treats all joints independently and is sensitive to outliers. Procrustes-aligned variants (PA-MPJPE), while useful for assessing structure, can mask errors in global position and orientation that are critical for downstream tasks such as robotics or AR/VR.

As benchmarks approach saturation, these limitations become increasingly problematic. One approach is to design composite measures that concurrently account for geometric precision, physical plausibility, resistance to perturbations, and impact on downstream activities.

9.6. Reproducibility and the “Foundation” Era

Finally, the broader ecology casts doubt on reproducibility. While many CNN and transformer-based methods release their code, full training pipelines often depend on proprietary datasets or complex distributed infrastructure. Emerging “foundation” models are sometimes available only as black-box APIs or partially documented checkpoints, complicating scientific comparison. Community-level initiatives towards open benchmarks with transparent training recipes, unambiguous licensing, and consistent compute resource reporting are necessary to address these problems.

9.7. Case Study: Real-Time Deployment in Sports Biomechanics

To make the tradeoffs discussed in this section concrete, we close with a short case study of one of the most demanding deployment scenarios for modern HPE: sports biomechanics in broadcast settings. Sports footage stresses every dimension of our taxonomy. Representations must be accurate enough that downstream biomechanical quantities such as joint angles and stride-related measures remain reliable, since small keypoint errors can propagate into larger analytical errors. Architectures must combine global context to track the full body in wide shots with fine local detail to localize extremities under fast motion. Ambiguity and occlusion are routine; limbs overlap during contact events, motion blur degrades visual evidence, and athletes frequently move partially out of frame. Moreover, real-time constraints demand that the entire pipeline runs at 30–60 FPS on edge or broadcast hardware, often with limited computational budgets.

The literature reviewed in this survey highlights how these challenges are addressed in practice. AthletePose3D [36] documents the failure modes of models trained on laboratory datasets when applied to high-speed athletic motion, showing that domain-specific fine-tuning substantially reduces error. This provides concrete evidence that the in-the-wild domain gap remains a dominant bottleneck in sports applications.

From a biomechanical perspective, BioPose [230] integrates mesh recovery with inverse kinematics to enforce physically and anatomically valid motion, including consistent bone lengths and joint limits over time. Extending this idea, PhysDynPose [39] incorporates explicit physical constraints such as gravity and ground contact within the optimization process, illustrating the emerging convergence between pose estimation and physics-based modeling.

Other works target failure modes specific to sports footage. PCNet [267] focuses on extremity localization under rapid motion, where standard detectors often degrade, while Baumgartner and Klatt [268] addressing uncalibrated broadcast video by jointly estimating camera geometry through field registration and lifting 3D pose provides a practical instance of uncalibrated reconstruction methods in a real deployment scenario.

Real-time efficiency has also been a focus. RTMPose [237] and DynPose [239] demonstrate that classification-based regression and dynamic per-frame routing can sustain over 60 FPS on commodity GPUs while remaining within 1–2 AP points of much heavier models. These efficient backbones are increasingly integrated into sports pipelines, as illustrated by Giulietti et al. [265], who reported a complete real-time human-in-the-loop system for dynamic simulators.

Taken together, these systems illustrate that no single component resolves the deployment challenge in isolation; instead, a practical sports biomechanics pipeline emerges as a layered system: a robust 2D pose backbone, a temporally stable 2D-to-3D lifting module, a biomechanics- or physics-aware refinement stage, and a domain adaptation step tailored to sport-specific data. Progress across these layers compounds, with improvements in one stage—for example, domain adaptation or physical consistency—directly enhancing the reliability of the full system.

More broadly, recent advances in large-scale foundation models (e.g., Hulk [256], ViTPose++ [257]) are likely to play an increasing role as backbones for such pipelines, with specialized modules handling the domain-specific challenges of high-speed motion, occlusion, and biomechanical validity.

10. Future Research Directions

Building on the aforementioned analysis, we identify a number of avenues that will influence human pose estimation research going forward.

10.1. Scalable Sequence Models: Beyond Transformers

Although transformers are widely used in spatiotemporal modeling, their scalability is constrained by their quadratic complexity. Recent state space models (SSMs) such as Mamba promise linear-time inference on long sequences while retaining expressive power. Applying these models to pose estimation, for instance for continuous 4D tracking or multi-camera fusion, is a promising avenue for improving scalability in video-based systems.

10.2. From Isolated Skeletons to Scene-Aware Humans

Most current methods treat pose as an isolated skeleton that is decoupled from the environment. Future work must focus on scene-aware and physics-aware HPE in which global trajectories, contact patterns, and interactions with objects are explicitly modeled. Scene-centric datasets combining 3D motion with detailed geometry (e.g., urban scans) will be crucial to bridge the gap between computer vision, graphics, and robotics.

10.3. Responsible Deployment, Privacy, and Fairness

Privacy and equity become major issues when HPE systems are used in delicate settings such as healthcare and surveillance. Techniques like on-device inference and privacy-preserving learning can reduce the vulnerability of raw visual data. Fairness-aware training and inclusive datasets are concurrently necessary to avoid a disproportionate drop in performance for disadvantaged populations.

10.4. Foundations and Multimodal Models

The development of large multimodal models creates new opportunities. Deeper comprehension of human behavior may be possible by combining position modeling with language, sound, and appearance. Strong priors that are adaptable to a variety of tasks with little labeled data may be produced using foundation models that regard pose as one of multiple outputs, for instance alongside segmentation or depth. However, realizing this vision will require careful study of the transferability and bias of such large-scale models.

10.5. Next-Generation Benchmarks

Progress will continue to be driven by data. Beyond scaling existing paradigms, there is a critical need for datasets that (i) capture diverse bodies and cultures, (ii) emphasize difficult conditions such as heavy occlusion and extreme motion, (iii) combine multiple modalities (RGB, depth, event, LiDAR), and (iv) explicitly link pose estimation to downstream tasks. Coupled with multi-faceted evaluation protocols, such datasets will ensure that benchmark improvements translate into meaningful real-world progress.

Author Contributions

Conceptualization, K.B.D. and M.A.A.; methodology, K.B.D. and M.A.A.; validation, K.B.D. and M.A.A.; formal analysis, K.B.D. and M.A.A.; investigation, K.B.D. and M.A.A.; writing—original draft preparation, K.B.D.; writing—review and editing, M.A.A.; supervision, M.A.A.; funding acquisition, M.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was enabled in part by support provided by the Natural Sciences and Engineering Research Council of Canada (NSERC), funding reference number RGPIN-2024-05287, and by the AI in Health Research Chair at the Université de Moncton.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

HPE	Human Pose Estimation
CNN	Convolutional Neural Network
SSM	State Space Model
GCN	Graph Convolutional Network
ViT	Vision Transformer
RNN	Recurrent Neural Network
LSTM	Long Short-Term Memory
TCN	Temporal Convolutional Network
DCT	Discrete Cosine Transform
MDN	Mixture Density Network
CVAE	Conditional Variational Autoencoder
DETR	Detection Transformer
RLE	Residual Log-likelihood Estimation
PAF	Part Affinity Field
OKS	Object Keypoint Similarity
AP	Average Precision
PCK	Percentage of Correct Keypoints
PCKh	Percentage of Correct Keypoints (head-normalized)
MPJPE	Mean Per-Joint Position Error
PA-MPJPE	Procrustes-Aligned Mean Per-Joint Position Error
N-MPJPE	Normalized Mean Per-Joint Position Error
AUC	Area Under the Curve
O-MPJPE	Object-centric Mean Per-Joint Position Error
MoCap	Motion Capture
XR	Extended Reality
AR	Augmented Reality
VR	Virtual Reality
IMU	Inertial Measurement Unit
LiDAR	Light Detection and Ranging
RGB	Red Green Blue
RGB-D	Red Green Blue—Depth
HMR	Human Mesh Recovery
NAS	Neural Architecture Search
KD	Knowledge Distillation
BNN	Binary Neural Network
FLOPs	Floating Point Operations
GAN	Generative Adversarial Network
LLM	Large Language Model
CLIP	Contrastive Language–Image Pretraining
VQ-VAE	Vector Quantized Variational Autoencoder
MLP	Multi-Layer Perceptron
IoU	Intersection over Union
OHKM	Online Hard Keypoint Mining
WASP	Waterfall Atrous Spatial Pooling
UDP	Unbiased Data Processing
DARK	Distribution-Aware Coordinate Representation
SAHR	Scale-Adaptive Heatmap Regression
WAHR	Weight-Adaptive Heatmap Regression Loss
CPN	Cascaded Pyramid Network
HRNet	High-Resolution Network
MSPN	Multi-Stage Pose Network
RSN	Residual Steps Network
LCN	Locally Connected Network
PifPaf	Part Intensity Field and Part Association Field
ORPM	Occlusion-Robust Pose Map
MeTRAbs	Metric-Scale Truncation-Robust Heatmaps
COCO	Common Objects in Context
MPII	Max Planck Institut Informatics
H3.6M	Human3.6M
3DPW	3D Poses in the Wild
SURREAL	Synthetic Humans for REAL tasks

References

Dang, Q.; Yin, J.; Wang, B.; Zheng, W. Deep learning based 2d human pose estimation: A survey. Tsinghua Sci. Technol. 2019, 24, 663–676. [Google Scholar] [CrossRef]
Munea, T.L.; Jembre, Y.Z.; Weldegebriel, H.T.; Chen, L.; Huang, C.; Yang, C. The progress of human pose estimation: A survey and taxonomy of models applied in 2D human pose estimation. IEEE Access 2020, 8, 133330–133348. [Google Scholar] [CrossRef]
Zhang, Z.; Shin, S.Y. Two-Dimensional Human Pose Estimation with Deep Learning: A Review. Appl. Sci. 2025, 15, 7344. [Google Scholar] [CrossRef]
Ji, X.; Fang, Q.; Dong, J.; Shuai, Q.; Jiang, W.; Zhou, X. A survey on monocular 3D human pose estimation. Virtual Real. Intell. Hardw. 2020, 2, 471–500. [Google Scholar] [CrossRef]
Chen, Y.; Tian, Y.; He, M. Monocular human pose estimation: A survey of deep learning-based methods. Comput. Vis. Image Underst. 2020, 192, 102897. [Google Scholar] [CrossRef]
Liu, W.; Bao, Q.; Sun, Y.; Mei, T. Recent advances of monocular 2D and 3D human pose estimation: A deep learning perspective. ACM Comput. Surv. 2022, 55, 1–41. [Google Scholar] [CrossRef]
Guo, Y.; Gao, T.; Dong, A.; Jiang, X.; Zhu, Z.; Wang, F. A Survey of the State of the Art in Monocular 3D Human Pose Estimation: Methods, Benchmarks, and Challenges. Sensors 2025, 25, 2409. [Google Scholar] [CrossRef] [PubMed]
Udayan, D.J.; Jayakumar, T.V.; Raman, R.; Kim, H.S.; Nedungadi, P. Deep Learning in Monocular 3D Human Pose Estimation: Systematic Review of Contemporary Techniques and Applications. Multimed. Tools Appl. 2025, 84, 36985–37021. [Google Scholar] [CrossRef]
Wang, J.; Tan, S.; Zhen, X.; Xu, S.; Zheng, F.; He, Z.; Shao, L. Deep 3D human pose estimation: A review. Comput. Vis. Image Underst. 2021, 210, 103225. [Google Scholar] [CrossRef]
Neupane, R.B.; Li, K.; Boka, T.F. A survey on deep 3D human pose estimation. Artif. Intell. Rev. 2024, 58, 24. [Google Scholar] [CrossRef]
Liu, Y.; Qiu, C.; Zhang, Z. Deep learning for 3D human pose estimation and mesh recovery: A survey. Neurocomputing 2024, 596, 128049. [Google Scholar] [CrossRef]
Dubey, S.; Dixit, M. A comprehensive survey on human pose estimation approaches. Multimed. Syst. 2023, 29, 167–195. [Google Scholar] [CrossRef]
Sun, R.; Lin, Z.; Leng, S.; Wang, A.; Zhao, L. An In-Depth Analysis of 2D and 3D Pose Estimation Techniques in Deep Learning: Methodologies and Advances. Electronics 2025, 14, 1307. [Google Scholar] [CrossRef]
Lan, G.; Wu, Y.; Hu, F.; Hao, Q. Vision-Based Human Pose Estimation via Deep Learning: A Survey. IEEE Trans. Hum.-Mach. Syst. 2023, 53, 253–268. [Google Scholar] [CrossRef]
Gao, Z.; Chen, J.; Liu, Y.; Jin, Y.; Tian, D. A systematic survey on human pose estimation: Upstream and downstream tasks, approaches, lightweight models, and prospects. Artif. Intell. Rev. 2025, 58, 68. [Google Scholar] [CrossRef]
Salisu, S.; Danyaro, K.U.; Nasser, M.; Hayder, I.M.; Younis, H.A. Review of Models for Estimating 3D Human Pose Using Deep Learning. PeerJ Comput. Sci. 2025, 11, e2574. [Google Scholar] [CrossRef] [PubMed]
Jayaswal, R.; Ansari, M.A.; Mewada, A.; Pareek, P.; Ahmad, S. An in-depth exploration of structural pose estimation strategies and datasets. Discov. Comput. 2025, 28, 222. [Google Scholar] [CrossRef]
Hou, Y.; Li, J.; Liao, S.; Xue, N. Research Advanced in Human Pose Estimation based on Deep Learning. Highlights Sci. Eng. Technol. 2024, 119, 444–453. [Google Scholar] [CrossRef]
Nogueira, A.F.R.; Oliveira, H.P.; Teixeira, L.F. Markerless multi-view 3D human pose estimation: A survey. Image Vis. Comput. 2025, 155, 105437. [Google Scholar] [CrossRef]
Azam, M.M.; Desai, K. A Survey on 3D Egocentric Human Pose Estimation. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 16–22 June 2024; pp. 1643–1654. [Google Scholar] [CrossRef]
Algabri, R.; Abdu, A.; Lee, S. Deep learning and machine learning techniques for head pose estimation: A survey. Artif. Intell. Rev. 2024, 57, 288. [Google Scholar] [CrossRef]
Suo, X.; Tang, W.; Li, Z. Motion Capture Technology in Sports Scenarios: A Survey. Sensors 2024, 24, 2947. [Google Scholar] [CrossRef]
Song, L.; Yu, G.; Yuan, J.; Liu, Z. Human pose estimation and its application to action recognition: A survey. J. Vis. Commun. Image Represent. 2021, 76, 103055. [Google Scholar] [CrossRef]
Ben Gamra, M.; Akhloufi, M.A. A review of deep learning techniques for 2D and 3D human pose estimation. Image Vis. Comput. 2021, 114, 104282. [Google Scholar] [CrossRef]
Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
Varol, G.; Romero, J.; Martin, X.; Mahmood, N.; Black, M.J.; Laptev, I.; Schmid, C. Learning from Synthetic Humans. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3686–3693. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
Newell, A.; Huang, Z.; Deng, J. Associative embedding: End-to-end learning for joint detection and grouping. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, 4–9 December 2017; pp. 2274–2284. [Google Scholar]
Papandreou, G.; Zhu, T.; Chen, L.C.; Gidaris, S.; Tompson, J.; Murphy, K. PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 282–299. [Google Scholar]
Mehta, D.; Rhodin, H.; Casas, D.; Fua, P.; Sotnychenko, O.; Xu, W.; Theobalt, C. Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision. In Proceedings of the 3D Vision (3DV), 2017 fifth International Conference IEEE, Qingdao, China, 10–12 October 2017. [Google Scholar] [CrossRef]
Von Marcard, T.; Henschel, R.; Black, M.J.; Rosenhahn, B.; Pons-Moll, G. Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 601–617. [Google Scholar]
Wang, J.; Yang, F.; Li, B.; Gou, W.; Yan, D.; Zeng, A.; Gao, Y.; Wang, J.; Jing, Y.; Zhang, R. Freeman: Towards benchmarking 3d human pose estimation under real-world conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 21978–21988. [Google Scholar]
Jin, S.; Xu, L.; Xu, J.; Wang, C.; Liu, W.; Qian, C.; Ouyang, W.; Luo, P. Whole-Body Human Pose Estimation in the Wild. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IX; Springer: Cham, Switzerland, 2020; pp. 196–214. [Google Scholar] [CrossRef]
Yeung, C.; Suzuki, T.; Tanaka, R.; Yin, Z.; Fujii, K. AthletePose3D: A benchmark dataset for 3D human pose estimation and kinematic validation in athletic movements. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 5945–5956. [Google Scholar]
Ying, J.; Du, H.; Zhang, K.; Li, L.; Yu, X. LDPose: Towards Inclusive Human Pose Estimation for Limb-Deficient Individuals in the Wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–20 October 2025; pp. 9865–9875. [Google Scholar]
Dai, Y.; Lin, Y.; Lin, X.; Wen, C.; Xu, L.; Yi, H.; Shen, S.; Ma, Y.; Wang, C. SLOPER4D: A Scene-Aware Dataset for Global 4D Human Pose Estimation in Urban Environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 682–692. [Google Scholar]
Aytekin, A.I.; Li, C.; Luvizon, D.; Dabral, R.; Oswald, M.; Habermann, M.; Theobalt, C. Physics-based Human Pose Estimation from a Single Moving RGB Camera. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 15–18 October 2025; pp. 3891–3900. [Google Scholar]
Toshev, A.; Szegedy, C. DeepPose: Human Pose Estimation via Deep Neural Networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [Google Scholar] [CrossRef]
Sun, X.; Xiao, B.; Wei, F.; Liang, S.; Wei, Y. Integral Human Pose Regression. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 536–553. [Google Scholar]
Li, S.; Ke, L.; Pratama, K.; Tai, Y.W.; Tang, C.K.; Cheng, K.T. Cascaded Deep Monocular 3D Human Pose Estimation With Evolutionary Training Data. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6172–6182. [Google Scholar] [CrossRef]
Kim, Y.; Kim, D. A CNN-based 3D human pose estimation based on projection of depth and ridge data. Pattern Recognit. 2020, 106, 107462. [Google Scholar] [CrossRef]
Zhou, K.; Han, X.; Jiang, N.; Jia, K.; Lu, J. HEMlets Pose: Learning Part-Centric Heatmap Triplets for Accurate 3D Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2344–2353. [Google Scholar] [CrossRef]
Luo, Z.; Wang, Z.; Huang, Y.; Wang, L.; Tan, T.; Zhou, E. Rethinking the Heatmap Regression for Bottom-up Human Pose Estimation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13259–13268. [Google Scholar] [CrossRef]
Jiang, L.; Liu, Z.; Li, K.; Wu, W. Boosting Human Pose Estimation via Heatmap Refinement. In Proceedings of the MultiMedia Modeling; Ide, I., Kompatsiaris, I., Xu, C., Yanai, K., Chu, W.T., Nitta, N., Riegler, M., Yamasaki, T., Eds.; Springer: Singapore, 2025; pp. 153–167. [Google Scholar]
Huang, J.; Zhu, Z.; Guo, F.; Huang, G. The Devil Is in the Details: Delving Into Unbiased Data Processing for Human Pose Estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5699–5708. [Google Scholar] [CrossRef]
Zhang, F.; Zhu, X.; Dai, H.; Ye, M.; Zhu, C. Distribution-Aware Coordinate Representation for Human Pose Estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7091–7100. [Google Scholar] [CrossRef]
Gu, K.; Chen, R.; Yu, X.; Yao, A. On the Calibration of Human Pose Estimation. In Proceedings of the 41st International Conference on Machine Learning, PMLR, Vienna, Austria, 21–27 July 2024; Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F., Eds.; ACM: New York, NY, USA, 2024; Volume 235, pp. 16530–16547. [Google Scholar]
Liu, H.; Liu, T.; Chen, Y.; Zhang, Z.; Li, Y.F. EHPE: Skeleton cues-based gaussian coordinate encoding for efficient human pose estimation. IEEE Trans. Multimed. 2022, 26, 8464–8475. [Google Scholar] [CrossRef]
Purkrabek, M.; Matas, J. ProbPose: A Probabilistic Approach to 2D Human Pose Estimation. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 27124–27133. [Google Scholar]
Osokin, D. Real-time 2D Multi-Person Pose Estimation on CPU: Lightweight. arXiv 2018, arXiv:1811.12004. [Google Scholar] [CrossRef]
Kreiss, S.; Bertoni, L.; Alahi, A. PifPaf: Composite Fields for Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 11969–11978. [Google Scholar] [CrossRef]
Benzine, A.; Luvison, B.; Pham, Q.C.; Achard, C. Deep, Robust and Single Shot 3D Multi-Person Human Pose Estimation from Monocular Images. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 584–588. [Google Scholar] [CrossRef]
Zhen, J.; Fang, Q.; Sun, J.; Liu, W.; Jiang, W.; Bao, H.; Zhou, X. SMAP: Single-Shot Multi-person Absolute 3D Pose Estimation. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XV; Springer: Cham, Switzerland, 2020; pp. 550–566. [Google Scholar] [CrossRef]
Zhang, Z.; Luo, Y.; Gou, J. Double anchor embedding for accurate multi-person 2D pose estimation. Image Vis. Comput. 2021, 111, 104198. [Google Scholar] [CrossRef]
Cheng, Y.; Ai, Y.; Wang, B.; Wang, X.; Tan, R.T. Bottom-up 2D pose estimation via dual anatomical centers for small-scale persons. Pattern Recognit. 2023, 139, 109403. [Google Scholar] [CrossRef]
Wang, T.; Jin, L.; Wang, Z.; Fan, X.; Cheng, Y.; Teng, Y.; Xing, J.; Zhao, J. DecenterNet: Bottom-up human pose estimation via decentralized pose representation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 1798–1808. [Google Scholar]
Li, J.; Su, W.; Wang, Z. Simple Pose: Rethinking and Improving a Bottom-up Approach for Multi-Person Pose Estimation. In Proceedings of the The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 7–12 February 2020; AAAI Press: Menlo Park, CA, USA, 2020; pp. 11354–11361. [Google Scholar] [CrossRef]
McNally, W.; Vats, K.; Wong, A.; McPhee, J. Rethinking Keypoint Representations: Modeling Keypoints and Poses as Objects for Multi-person Human Pose Estimation. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; pp. 37–54. [Google Scholar]
Maji, D.; Nagori, S.; Mathew, M.; Poddar, D. Yolo-pose: Enhancing yolo for multi person pose estimation using object keypoint similarity loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 2637–2646. [Google Scholar]
Zauss, D.; Kreiss, S.; Alahi, A. Keypoint Communities. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11057–11066. [Google Scholar]
Qu, H.; Cai, Y.; Foo, L.G.; Kumar, A.; Liu, J. A Characteristic Function-Based Method for Bottom-Up Human Pose Estimation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13009–13018. [Google Scholar] [CrossRef]
Nibali, A.; He, Z.; Morgan, S.; Prendergast, L. 3D Human Pose Estimation With 2D Marginal Heatmaps. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Los Alamitos, CA, USA, 7–11 January 2019; pp. 1477–1485. [Google Scholar] [CrossRef]
Choi, S.; Choi, S.; Kim, C. MobileHumanPose: Toward real-time 3D human pose estimation in mobile devices. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 20–25 June 2021; pp. 2328–2338. [Google Scholar] [CrossRef]
Sárándi, I.; Linder, T.; Arras, K.O.; Leibe, B. MeTRAbs: Metric-Scale Truncation-Robust Heatmaps for Absolute 3D Human Pose Estimation. IEEE Trans. Biom. Behav. Identity Sci. 2021, 3, 16–30. [Google Scholar] [CrossRef]
Wang, M.; Chen, X.; Liu, W.; Qian, C.; Lin, L.; Ma, L. DRPose3D: Depth ranking in 3D human pose estimation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18; AAAI Press: Menlo Park, CA, USA, 2018; pp. 978–984. [Google Scholar]
Pavlakos, G.; Zhou, X.; Daniilidis, K. Ordinal Depth Supervision for 3D Human Pose Estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7307–7316. [Google Scholar] [CrossRef]
Kundu, J.; Seth, S.; M V, R.; Rakesh, M.; Babu, R.; Chakraborty, A. Kinematic-Structure-Preserved Representation for Unsupervised 3D Human Pose Estimation. Proc. AAAI Conf. Artif. Intell. 2020, 34, 11312–11319. [Google Scholar] [CrossRef]
Chen, T.; Fang, C.; Shen, X.; Zhu, Y.; Chen, Z.; Luo, J. Anatomy-Aware 3D Human Pose Estimation With Bone-Based Pose Decomposition. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 198–209. [Google Scholar] [CrossRef]
Geng, Z.; Wang, C.; Wei, Y.; Liu, Z.; Li, H.; Hu, H. Human Pose as Compositional Tokens. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 660–671. [Google Scholar] [CrossRef]
Fang, H.S.; Xu, Y.; Wang, W.; Liu, X.; Zhu, S.C. Learning pose grammar to encode human body configuration for 3d pose estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Marin-Jimenez, M.J.; Romero-Ramirez, F.J.; Munoz-Salinas, R.; Medina-Carnicer, R. 3D human pose estimation from depth maps using a deep combination of poses. J. Vis. Commun. Image Represent. 2018, 55, 627–639. [Google Scholar] [CrossRef]
Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded Pyramid Network for Multi-person Pose Estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7103–7112. [Google Scholar] [CrossRef]
Li, W.; Wang, Z.; Yin, B.; Peng, Q.; Du, Y.; Xiao, T.; Yu, G.; Lu, H.; Wei, Y.; Sun, J. Rethinking on multi-stage networks for human pose estimation. arXiv 2019, arXiv:1901.00148. [Google Scholar] [CrossRef]
Su, Z.; Ye, M.; Zhang, G.; Dai, L.; Sheng, J. Cascade feature aggregation for human pose estimation. arXiv 2019, arXiv:1902.07837. [Google Scholar] [CrossRef]
Zhang, H.; Ouyang, H.; Liu, S.; Qi, X.; Shen, X.; Yang, R.; Jia, J. Human pose estimation with spatial contextual information. arXiv 2019, arXiv:1901.01760. [Google Scholar] [CrossRef]
Xiao, B.; Wu, H.; Wei, Y. Simple Baselines for Human Pose Estimation and Tracking. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part VI, Berlin, Heidelberg, 2018; Springer: Berlin/Heidelberg, Germany; pp. 472–487. [CrossRef]
Bulat, A.; Kossaifi, J.; Tzimiropoulos, G.; Pantic, M. Toward fast and accurate human pose estimation via soft-gated skip connections. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020; pp. 8–15. [Google Scholar] [CrossRef]
Tang, Z.; Peng, X.; Geng, S.; Wu, L.; Zhang, S.; Metaxas, D. Quantized Densely Connected U-Nets for Efficient Landmark Localization. In Proceedings of the Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 348–364. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5686–5696. [Google Scholar] [CrossRef]
Cheng, B.; Xiao, B.; Wang, J.; Shi, H.; Huang, T.S.; Zhang, L. HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5385–5394. [Google Scholar] [CrossRef]
Artacho, B.; Savakis, A. UniPose: Unified Human Pose Estimation in Single Images and Videos. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7033–7042. [Google Scholar] [CrossRef]
Artacho, B.; Savakis, A. OmniPose: A Multi-Scale Framework for Multi-Person Pose Estimation. arXiv 2021, arXiv:2103.10180. [Google Scholar] [CrossRef]
Ke, L.; Chang, M.C.; Qi, H.; Lyu, S. Multi-Scale Structure-Aware Network for Human Pose Estimation. In Proceedings of the Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 731–746. [Google Scholar]
Zhang, J.; Chen, Z.; Tao, D. Towards High Performance Human Keypoint Detection. Int. J. Comput. Vis. 2021, 129, 2639–2662. [Google Scholar] [CrossRef]
Cai, Y.; Wang, Z.; Luo, Z.; Yin, B.; Du, A.; Wang, H.; Zhang, X.; Zhou, X.; Zhou, E.; Sun, J. Learning Delicate Local Representations for Multi-person Pose Estimation. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III; Springer: Berlin/Heidelberg, Germany; pp. 455–472. [CrossRef]
Groos, D.; Ramampiaro, H.; Ihlen, E.A. EfficientPose: Scalable single-person pose estimation. Appl. Intell. 2021, 51, 2518–2533. [Google Scholar] [CrossRef]
Papaioannidis, C.; Mademlis, I.; Pitas, I. Fast single-person 2D human pose estimation using multi-task Convolutional Neural Networks. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Khirodkar, R.; Chari, V.; Agrawal, A.; Tyagi, A. Multi-Instance Pose Networks: Rethinking Top-Down Pose Estimation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3102–3111. [Google Scholar] [CrossRef]
Munea, T.L.; Yang, C.; Huang, C.; Elhassan, M.A.; Zhen, Q. SimpleCut: A simple and strong 2D model for multi-person pose estimation. Comput. Vis. Image Underst. 2022, 222, 103509. [Google Scholar] [CrossRef]
Fieraru, M.; Khoreva, A.; Pishchulin, L.; Schiele, B. Learning to Refine Human Pose Estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–23 June 2018; pp. 318–31809. [Google Scholar] [CrossRef]
Moon, G.; Chang, J.Y.; Lee, K.M. PoseFix: Model-Agnostic General Human Pose Refinement Network. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7765–7773. [Google Scholar] [CrossRef]
Xu, T.; Takano, W. Graph Stacked Hourglass Networks for 3D Human Pose Estimation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 16100–16109. [Google Scholar] [CrossRef]
Ci, H.; Wang, C.; Ma, X.; Wang, Y. Optimizing Network Structure for 3D Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2262–2271. [Google Scholar] [CrossRef]
Hu, W.; Zhang, C.; Zhan, F.; Zhang, L.; Wong, T.T. Conditional Directed Graph Convolution for 3D Human Pose Estimation. In Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA, 20–24 October 2021; MM’21; Association for Computing Machinery: New York, NY, USA, 2021; pp. 602–611. [Google Scholar] [CrossRef]
Azizi, N.; Possegger, H.; Rodolà, E.; Bischof, H. 3D Human Pose Estimation Using Möbius Graph Convolutional Networks. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part I; Springer: Berlin/Heidelberg, Germany, 2022; pp. 160–178. [Google Scholar] [CrossRef]
Liu, J.; Rojas, J.; Li, Y.; Liang, Z.; Guan, Y.; Xi, N.; Zhu, H. A Graph Attention Spatio-temporal Convolutional Network for 3D Human Pose Estimation in Video. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 3374–3380. [Google Scholar] [CrossRef]
Li, W.; Liu, M.; Liu, H.; Guo, T.; Wang, T.; Tang, H.; Sebe, N. GraphMLP: A graph MLP-like architecture for 3D human pose estimation. Pattern Recognit. 2025, 158, 110925. [Google Scholar] [CrossRef]
Yang, S.; Quan, Z.; Nie, M.; Yang, W. TransPose: Keypoint Localization via Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 11782–11792. [Google Scholar] [CrossRef]
Yuan, Y.; Fu, R.; Huang, L.; Lin, W.; Zhang, C.; Chen, X.; Wang, J. HRFormer: High-resolution transformer for dense prediction. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS’21, Red Hook, NY, USA, 6–14 December 2021. [Google Scholar]
Liu, H.; Liu, F.; Fan, X.; Huang, D. Polarized self-attention: Towards high-quality pixel-wise mapping. Neurocomputing 2022, 506, 158–167. [Google Scholar] [CrossRef]
Li, R.; Li, Q.; Yang, S.; Zeng, X.; Yan, A. An efficient and accurate 2D human pose estimation method using VTTransPose network. Sci. Rep. 2024, 14, 7608. [Google Scholar] [CrossRef]
Zeng, W.; Jin, S.; Liu, W.; Qian, C.; Luo, P.; Ouyang, W.; Wang, X. Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11091–11101. [Google Scholar] [CrossRef]
Mao, W.; Ge, Y.; Shen, C.; Tian, Z.; Wang, X.; Wang, Z. TFPose: Direct Human Pose Estimation with Transformers. arXiv 2021, arXiv:2103.15320. [Google Scholar] [CrossRef]
Panteleris, P.; Argyros, A. PE-former: Pose Estimation Transformer. In Proceedings of the Pattern Recognition and Artificial Intelligence: Third International Conference, ICPRAI 2022, Paris, France, 1–3 June 2022; Proceedings, Part II; Springer: Berlin/Heidelberg, Germany, 2022; pp. 3–14. [Google Scholar] [CrossRef]
Li, K.; Wang, S.; Zhang, X.; Xu, Y.; Xu, W.; Tu, Z. Pose Recognition With Cascade Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 1944–1953. [Google Scholar]
Mao, W.; Ge, Y.; Shen, C.; Tian, Z.; Wang, X.; Wang, Z.; den Hengel, A.v. Poseur: Direct Human Pose Regression with Transformers. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part VI; Springer: Berlin/Heidelberg, Germany, 2022; pp. 72–88. [Google Scholar] [CrossRef]
Tian, Z.; Chen, H.; Shen, C. DirectPose: Direct End-to-End Multi-Person Pose Estimation. arXiv 2019, arXiv:1911.07451. [Google Scholar] [CrossRef]
Liu, H.; Chen, Q.; Tan, Z.; Liu, J.J.; Wang, J.; Su, X.; Li, X.; Yao, K.; Han, J.; Ding, E.; et al. Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14983–14992. [Google Scholar] [CrossRef]
Geng, Z.; Sun, K.; Xiao, B.; Zhang, Z.; Wang, J. Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14671–14681. [Google Scholar] [CrossRef]
Ma, X.; Su, J.; Wang, C.; Ci, H.; Wang, Y. Context Modeling in 3D Human Pose Estimation: A Unified Perspective. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6234–6243. [Google Scholar] [CrossRef]
Zhao, Q.; Zheng, C.; Liu, M.; Chen, C. A single 2D pose with context is worth hundreds for 3D human pose estimation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS’23, Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Li, C.; Lee, G.H. Generating Multiple Hypotheses for 3D Human Pose Estimation With Mixture Density Network. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9879–9887. [Google Scholar] [CrossRef]
Sharma, S.; Varigonda, P.T.; Bindal, P.; Sharma, A.; Jain, A. Monocular 3D Human Pose Estimation by Generation and Ordinal Ranking. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2325–2334. [Google Scholar] [CrossRef]
Han, C.; Yu, X.; Gao, C.; Sang, N.; Yang, Y. Single image based 3D human pose estimation via uncertainty learning. Pattern Recognit. 2022, 132, 108934. [Google Scholar] [CrossRef]
Rommel, C.; Letzelter, V.; Samet, N.; Marlet, R.; Cord, M.; Pérez, P.; Valle, E. ManiPose: Manifold-constrained multi-hypothesis 3D human pose estimation. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS’24, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Holmquist, K.; Wandt, B. DiffPose: Multi-hypothesis Human Pose Estimation using Diffusion Models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 15931–15941. [Google Scholar] [CrossRef]
Gong, J.; Foo, L.G.; Fan, Z.; Ke, Q.; Rahmani, H.; Liu, J. DiffPose: Toward More Reliable 3D Pose Estimation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13041–13051. [Google Scholar] [CrossRef]
Shan, W.; Liu, Z.; Zhang, X.; Wang, Z.; Han, K.; Wang, S.; Ma, S.; Gao, W. Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14715–14725. [Google Scholar] [CrossRef]
Xu, J.; Guo, Y.; Peng, Y. FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 561–570. [Google Scholar] [CrossRef]
Wang, W.; Xiao, J.; Wang, C.; Liu, W.; Wang, Z.; Chen, L. Di2Pose: Discrete diffusion model for occluded 3D human pose estimation. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS’24, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Feng, R.; Gao, Y.; Elden Tse, T.H.; Ma, X.; Chang, H.J. DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14815–14826. [Google Scholar] [CrossRef]
Doering, A.; Chen, D.; Zhang, S.; Schiele, B.; Gall, J. PoseTrack21: A Dataset for Person Search, Multi-Object Tracking and Multi-Person Pose Tracking. In Proceedings of the CVPR, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Jiang, Z.; Zhou, Z.; Li, L.; Chai, W.; Yang, C.Y.; Hwang, J.N. Back to Optimization: Diffusion-based Zero-Shot 3D Human Pose Estimation. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 6130–6140. [Google Scholar] [CrossRef]
Rhodin, H.; Meyer, F.; Spörri, J.; Müller, E.; Constantin, V.; Fua, P.; Katircioglu, I.; Salzmann, M. Learning Monocular 3D Human Pose Estimation from Multi-view Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8437–8446. [Google Scholar] [CrossRef]
Rhodin, H.; Salzmann, M.; Fua, P. Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part X; Springer: Berlin/Heidelberg, Germany, 2018; pp. 765–782. [Google Scholar] [CrossRef]
Iskakov, K.; Burkov, E.; Lempitsky, V.; Malkov, Y. Learnable triangulation of human pose. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7718–7727. [Google Scholar]
Yang, C.Y.; Luo, J.; Xia, L.; Sun, Y.; Qiao, N.; Zhang, K.; Jiang, Z.; Hwang, J.N.; Kuo, C.H. CameraPose: Weakly-Supervised Monocular 3D Human Pose Estimation by Leveraging In-the-wild 2D Annotations. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2923–2932. [Google Scholar] [CrossRef]
Yu, Z.; Wang, M.; Chen, Y.; Favaro, P.; Modolo, D. Denoising and Selecting Pseudo-Heatmaps for Semi-Supervised Human Pose Estimation. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 6268–6277. [Google Scholar] [CrossRef]
Nakatsuka, T.; Yoshii, K.; Koyama, Y.; Fukayama, S.; Goto, M.; Morishima, S. MirrorNet: A Deep Reflective Approach to 2D Pose Estimation for Single-Person Images. J. Inf. Process. 2021, 29, 406–423. [Google Scholar] [CrossRef]
Kundu, J.N.; Seth, S.; Jampani, V.; Rakesh, M.; Venkatesh Babu, R.; Chakraborty, A. Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image Synthesis. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6151–6161. [Google Scholar] [CrossRef]
Sosa, J.; Hogg, D. Self-supervised 3D Human Pose Estimation from a Single Image. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 4788–4797. [Google Scholar] [CrossRef]
Kundu, J.N.; Seth, S.; YM, P.; Jampani, V.; Chakraborty, A.; Babu, R.V. Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose Estimation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 20416–20427. [Google Scholar] [CrossRef]
Yang, W.; Ouyang, W.; Wang, X.; Ren, J.; Li, H.; Wang, X. 3D Human Pose Estimation in the Wild by Adversarial Learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5255–5264. [Google Scholar] [CrossRef]
Peng, X.; Tang, Z.; Yang, F.; Feris, R.S.; Metaxas, D. Jointly Optimize Data Augmentation and Network Training: Adversarial Data Augmentation in Human Pose Estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2226–2234. [Google Scholar] [CrossRef]
Gong, K.; Zhang, J.; Feng, J. PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8571–8580. [Google Scholar] [CrossRef]
Peng, Q.; Zheng, C.; Chen, C. A Dual-Augmentor Framework for Domain Generalization in 3D Human Pose Estimation. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 2240–2249. [Google Scholar] [CrossRef]
Wang, L.; Chen, Y.; Guo, Z.; Qian, K.; Lin, M.; Li, H.; Ren, J.S. Generalizing Monocular 3D Human Pose Estimation in the Wild. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4024–4033. [Google Scholar] [CrossRef]
Doersch, C.; Zisserman, A. Sim2real transfer learning for 3d human pose estimation: Motion to the rescue. In Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2019; Volume 32. [Google Scholar]
Chai, W.; Jiang, Z.; Hwang, J.N.; Wang, G. Global Adaptation meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14609–14619. [Google Scholar] [CrossRef]
Wang, Z.; Shin, D.; Fowlkes, C.C. Predicting Camera Viewpoint Improves Cross-Dataset Generalization for 3D Human Pose Estimation. In Proceedings of the Computer Vision—ECCV 2020 Workshops; Bartoli, A., Fusiello, A., Eds.; Springer: Cham, Switzerland, 2020; pp. 523–540. [Google Scholar]
Cai, Y.; Zhang, W.; Wu, Y.; Jin, C. PoseIRM: Enhance 3D Human Pose Estimation on Unseen Camera Settings via Invariant Risk Minimization. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 2124–2133. [Google Scholar] [CrossRef]
Zeng, A.; Sun, X.; Huang, F.; Liu, M.; Xu, Q.; Lin, S. SRNet: Improving Generalization in 3D Human Pose Estimation with a Split-and-Recombine Approach. In Proceedings of the Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 507–523. [Google Scholar]
Wang, Y.; Wang, Z.; Li, M.; Yan, H. 3D Human Pose Estimation with Two-step Mixed-Training Strategy. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 3320–3329. [Google Scholar] [CrossRef]
Lee, S.; Hwang, Y.; Lee, J.T. Learning 2D Human Poses for Better 3D Lifting via Multi-model 3D-Guidance. In Proceedings of the Computer Vision—ACCV 2024; Cho, M., Laptev, I., Tran, D., Yao, A., Zha, H., Eds.; Springer: Singapore, 2025; pp. 185–202. [Google Scholar]
Taketsugu, H.; Ukita, N. Active Transfer Learning for Efficient Video-Specific Human Pose Estimation. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 1869–1879. [Google Scholar] [CrossRef]
Hu, S.; Sun, H.; Li, B.; Wei, D.; Li, W.; Lu, J. Fast Adaptation for Human Pose Estimation via Meta-Optimization. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 1792–1801. [Google Scholar] [CrossRef]
Vosoughi, S.; Amer, M.A. Deep 3D Human Pose Estimation Under Partial Body Presence. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 569–573. [Google Scholar] [CrossRef]
Cheng, Y.; Yang, B.; Wang, B.; Wending, Y.; Tan, R. Occlusion-Aware Networks for 3D Human Pose Estimation in Video. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 723–732. [Google Scholar] [CrossRef]
Cheng, Y.; Yang, B.; Wang, B.; Tan, R.T. 3D Human Pose Estimation Using Spatio-Temporal Networks with Explicit Occlusion Training. Proc. AAAI Conf. Artif. Intell. 2020, 34, 10631–10638. [Google Scholar] [CrossRef]
Das, A.; Das, S.; Sistu, G.; Horgan, J.; Bhattacharya, U.; Jones, E.; Glavin, M.; Eising, C. Deep Multi-Task Networks For Occluded Pedestrian Pose Estimation. arXiv 2022, arXiv:2206.07510. [Google Scholar] [CrossRef]
Hardy, P.; Kim, H. LInKs “Lifting Independent Keypoints”—Partial Pose Lifting for Occlusion Handling with Improved Accuracy in 2D-3D Human Pose Estimation. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 3414–3423. [Google Scholar] [CrossRef]
Zheng, H.; Li, H.; Dai, W.; Zheng, Z.; Li, C.; Zou, J.; Xiong, H. HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 16807–16817. [Google Scholar] [CrossRef]
Zhang, Y.; Ji, P.; Wang, A.; Mei, J.; Kortylewski, A.; Yuille, A. 3D-Aware Neural Body Fitting for Occlusion Robust 3D Human Pose Estimation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 9365–9376. [Google Scholar] [CrossRef]
Sun, P.; Gu, K.; Wang, Y.; Yang, L.; Yao, A. Rethinking Visibility in Human Pose Estimation: Occluded Pose Reasoning via Transformers. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 5891–5900. [Google Scholar] [CrossRef]
Ning, G.; Liu, P.; Fan, X.; Zhang, C. A Top-Down Approach to Articulated Human Pose Estimation and Tracking. In Proceedings of the Computer Vision—ECCV 2018 Workshops, Munich, Germany, 8–14 September 2018; Proceedings, Part II; Springer: Berlin/Heidelberg, Germany, 2018; pp. 227–234. [Google Scholar] [CrossRef]
Zhou, M.; Stoffl, L.; Mathis, M.W.; Mathis, A. Rethinking pose estimation in crowds: Overcoming the detection information bottleneck and ambiguity. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14643–14653. [Google Scholar] [CrossRef]
Cheng, Y.; Wang, B.; Yang, B.; Tan, R.T. Monocular 3D Multi-Person Pose Estimation by Integrating Top-Down and Bottom-Up Networks. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7645–7655. [Google Scholar] [CrossRef]
Zhao, S.; Liu, K.; Huang, Y.; Bao, Q.; Zeng, D.; Liu, W. DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation. In Proceedings of the Artificial Intelligence: Second CAAI International Conference, CICAI 2022, Beijing, China, 27–28 August 2022; Revised Selected Papers, Part II; Springer: Berlin/Heidelberg, Germany, 2022; pp. 559–576. [Google Scholar] [CrossRef]
Dabral, R.; Gundavarapu, N.B.; Mitra, R.; Sharma, A.; Ramakrishnan, G.; Jain, A. Multi-Person 3D Human Pose Estimation from Monocular Images. In Proceedings of the 2019 International Conference on 3D Vision (3DV), Quebec City, QC, Canada, 16–19 September 2019; pp. 405–414. [Google Scholar] [CrossRef]
Ding, Y.; Deng, W.; Zheng, Y.; Liu, P.; Wang, M.; Cheng, X.; Bao, J.; Chen, D.; Zeng, M. I²R-Net: Intra- and Inter-Human Relation Network for Multi-Person Pose Estimation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22; Raedt, L.D., Ed.; International Joint Conferences on Artificial Intelligence Organization: Vienna, Austria, 2022; Volume 7, pp. 855–862, Main Track. [Google Scholar] [CrossRef]
Qiu, Z.; Yang, Q.; Wang, J.; Fu, D. Dynamic Graph Reasoning for Multi-person 3D Pose Estimation. In Proceedings of the 30th ACM International Conference on Multimedia, New York, NY, USA, 10–14 October 2022; MM’22. pp. 3521–3529. [Google Scholar] [CrossRef]
Luo, Y.; Ren, J.; Wang, Z.; Sun, W.; Pan, J.; Liu, J.; Pang, J.; Lin, L. LSTM Pose Machines. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5207–5215. [Google Scholar] [CrossRef]
Liu, S.; Li, Y.; Hua, G. Human Pose Estimation in Video via Structured Space Learning and Halfway Temporal Evaluation. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 2029–2038. [Google Scholar] [CrossRef]
Pavllo, D.; Feichtenhofer, C.; Grangier, D.; Auli, M. 3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7745–7754. [Google Scholar] [CrossRef]
Li, Y.; Li, K.; Wang, X.; Xu, R.Y.D. Exploring temporal consistency for human pose estimation in videos. Pattern Recognit. 2020, 103, 107258. [Google Scholar] [CrossRef]
Liu, R.; Shen, J.; Wang, H.; Chen, C.; Cheung, S.-c.; Asari, V. Attention Mechanism Exploits Temporal Contexts: Real-Time 3D Human Pose Reconstruction. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5063–5072. [Google Scholar] [CrossRef]
Lin, J.; Lee, G.H. Trajectory Space Factorization for Deep Video-Based 3D Human Pose Estimation. In Proceedings of the BMVC, Cardiff, UK, 9–12 September 2019. [Google Scholar]
Zheng, C.; Zhu, S.; Mendieta, M.; Yang, T.; Chen, C.; Ding, Z. 3D Human Pose Estimation with Spatial and Temporal Transformers. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
Li, W.; Liu, H.; Ding, R.; Liu, M.; Wang, P.; Yang, W. Exploiting Temporal Contexts With Strided Transformer for 3D Human Pose Estimation. IEEE Trans. Multimed. 2023, 25, 1282–1293. [Google Scholar] [CrossRef]
Tang, Z.; Qiu, Z.; Hao, Y.; Hong, R.; Yao, T. 3D Human Pose Estimation with Spatio-Temporal Criss-Cross Attention. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 4790–4799. [Google Scholar] [CrossRef]
Zhang, J.; Tu, Z.; Yang, J.; Chen, Y.; Yuan, J. MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13222–13232. [Google Scholar] [CrossRef]
Hassanin, M.; Khamiss, A.; Bennamoun, M.; Boussaid, F.; Radwan, I. CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation. arXiv 2022, arXiv:2203.13387. [Google Scholar] [CrossRef]
Wei, M.; Xie, X.; Zhong, Y.; Shi, G. Learning Pyramid-Structured Long-Range Dependencies for 3D Human Pose Estimation. IEEE Trans. Multimed. 2025, 27, 4684–4697. [Google Scholar] [CrossRef]
Qian, X.; Tang, Y.; Zhang, N.; Han, M.; Xiao, J.; Huang, M.C.; Lin, R.S. HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose Estimation. arXiv 2023, arXiv:2301.07322. [Google Scholar] [CrossRef]
Chen, H.; He, J.Y.; Xiang, W.; Cheng, Z.Q.; Liu, W.; Liu, H.; Luo, B.; Geng, Y.; Xie, X. HDFormer: High-order Directed Transformer for 3D Human Pose Estimation. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23; Elkind, E., Ed.; ACM: New York, NY, USA, 2023; pp. 581–589, Main Track. [Google Scholar] [CrossRef]
Zhai, K.; Nie, Q.; Ouyang, B.; Li, X.; Yang, S. HopFIR: Hop-wise GraphFormer with Intragroup Joint Refinement for 3D Human Pose Estimation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14939–14949. [Google Scholar] [CrossRef]
Liu, H.; Cheng, Z.Q.; Xiang, W.; He, J.Y.; Luo, B.; Geng, Y.; Xie, X. Refined Temporal Pyramidal Compression-and-Amplification Transformer for 3D Human Pose Estimation. In Proceedings of the 2025 IEEE International Conference on Multimedia and Expo (ICME), Nantes, France, 30 June–4 July 2025; pp. 1–6. [Google Scholar] [CrossRef]
Kang, H.; Wang, Y.; Liu, M.; Wu, D.; Liu, P.; Yang, W. Double-chain Constraints for 3D Human Pose Estimation in Images and Videos. arXiv 2023, arXiv:2308.05298. [Google Scholar] [CrossRef]
Mehraban, S.; Adeli, V.; Taati, B. Motionagformer: Enhancing 3d human pose estimation with a transformer-gcnformer network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024; pp. 6920–6930. [Google Scholar]
Yu, B.X.; Zhang, Z.; Liu, Y.; Zhong, S.H.; Liu, Y.; Chen, C.W. GLA-GCN: Global-local Adaptive Graph Convolutional Network for 3D Human Pose Estimation from Monocular Video. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 8784–8795. [Google Scholar] [CrossRef]
Peng, J.; Zhou, Y.; Mok, P. Ktpformer: Kinematics and trajectory prior knowledge-enhanced transformer for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 1123–1132. [Google Scholar]
Li, C.; Liu, S.; Yao, L.; Zou, S. Video-based body geometric aware network for 3D human pose estimation. Optoelectron. Lett. 2022, 18, 313–320. [Google Scholar] [CrossRef]
Li, W.; Liu, H.; Tang, H.; Wang, P.; Van Gool, L. MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13137–13146. [Google Scholar] [CrossRef]
Liu, J.; Liu, M.; Liu, H.; Li, W. TCPFormer: Learning temporal correlation with implicit pose proxy for 3D human pose estimation. Proc. AAAI Conf. Artif. Intell. 2025, 39, 5478–5486. [Google Scholar] [CrossRef]
Lutz, S.; Blythman, R.; Ghosal, K.; Moynihan, M.; Simms, C.; Smolic, A. Jointformer: Single-frame lifting transformer with error prediction and refinement for 3d human pose estimation. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montréal, QC, Canada, 21–25 August 2022; pp. 1156–1163. [Google Scholar]
Qiu, Z.; Yang, Q.; Wang, J.; Fu, D. IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation. In Proceedings of the 30th ACM International Conference on Multimedia, MM’22, New York, NY, USA, 10–14 October 2022; pp. 6174–6182. [Google Scholar] [CrossRef]
Feng, R.; Chang, H.J.; Tse, T.H.E.; Kim, B.; Chang, Y.; Gao, Y. High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–20 October 2025; pp. 8929–8938. [Google Scholar]
Lu, Y.; Wang, J.; Gao, J.; Gong, R.; Cai, C.; Yap, K.H. A Structure-aware and Motion-adaptive Framework for 3D Human Pose Estimation with Mamba. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–20 October 2025; pp. 7958–7968. [Google Scholar]
Dabral, R.; Mundhada, A.; Kusupati, U.; Afaque, S.; Sharma, A.; Jain, A. Learning 3d human pose from structure and motion. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 668–683. [Google Scholar]
Gupta, V. Back to the future: Joint aware temporal deep learning 3D human pose estimation. arXiv 2020, arXiv:2002.11251. [Google Scholar] [CrossRef]
Wang, G.; Zeng, H.; Wang, Z.; Liu, Z.; Wang, H. Motion projection consistency-based 3-D human pose estimation with virtual bones from monocular videos. IEEE Trans. Cogn. Dev. Syst. 2022, 15, 784–793. [Google Scholar] [CrossRef]
Wang, J.; Yan, S.; Xiong, Y.; Lin, D. Motion guided 3d pose estimation from videos. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 764–780. [Google Scholar]
Xu, J.; Yu, Z.; Ni, B.; Yang, J.; Yang, X.; Zhang, W. Deep kinematics analysis for monocular 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 899–908. [Google Scholar]
Jin, K.M.; Lim, B.S.; Lee, G.H.; Kang, T.K.; Lee, S.W. Kinematic-aware hierarchical attention network for human pose estimation in videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 5725–5734. [Google Scholar]
Li, Z.; Xu, B.; Huang, H.; Lu, C.; Guo, Y. Deep two-stream video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 430–439. [Google Scholar]
Jeong, D.C.; Liu, H.; Salazar, S.; Jiang, J.; Kitts, C.A. SoloPose: One-Shot Kinematic 3D Human Pose Estimation with Video Data Augmentation. arXiv 2023, arXiv:2312.10195. [Google Scholar]
Zhang, J.; Wang, Y.; Zhou, Z.; Luan, T.; Wang, Z.; Qiao, Y. Learning dynamical human-joint affinity for 3d pose estimation in videos. IEEE Trans. Image Process. 2021, 30, 7914–7925. [Google Scholar] [CrossRef] [PubMed]
Qiu, H.; Wang, C.; Wang, J.; Wang, N.; Zeng, W. Cross view fusion for 3d human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4342–4351. [Google Scholar]
Zhang, Z.; Wang, C.; Qiu, W.; Qin, W.; Zeng, W. Adafuse: Adaptive multiview fusion for accurate human pose estimation in the wild. Int. J. Comput. Vis. 2021, 129, 703–718. [Google Scholar] [CrossRef]
Xie, R.; Wang, C.; Wang, Y. Metafuse: A pre-trained fusion model for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13686–13695. [Google Scholar]
Moliner, O.; Huang, S.; Åström, K. Geometry-biased transformer for robust multi-view 3d human pose reconstruction. In Proceedings of the 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), Istanbul, Turkey, 27–31 May 2024; pp. 1–8. [Google Scholar]
Liao, Z.; Zhu, J.; Wang, C.; Hu, H.; Waslander, S.L. Multiple view geometry transformers for 3D human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 708–717. [Google Scholar]
Remelli, E.; Han, S.; Honari, S.; Fua, P.; Wang, R. Lightweight multi-view 3D pose estimation through camera-disentangled representation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6040–6049. [Google Scholar]
Chharia, A.; Gou, W.; Dong, H. MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 5–18 October 2025; pp. 11590–11599. [Google Scholar]
Luvizon, D.C.; Picard, D.; Tabia, H. Consensus-based optimization for 3D human pose estimation in camera coordinates. Int. J. Comput. Vis. 2022, 130, 869–882. [Google Scholar] [CrossRef]
Davoodnia, V.; Ghorbani, S.; Carbonneau, M.A.; Messier, A.; Etemad, A. Upose3d: Uncertainty-aware 3d human pose estimation with cross-view and temporal cues. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 19–38. [Google Scholar]
Jiang, B.; Hu, L.; Xia, S. Probabilistic triangulation for uncalibrated multi-view 3D human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 14850–14860. [Google Scholar]
Gordon, B.; Raab, S.; Azov, G.; Giryes, R.; Cohen-Or, D. FLEX: Extrinsic parameters-free multi-view 3D human motion reconstruction. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 176–196. [Google Scholar]
Xu, Y.; Kitani, K. Multi-view multi-person 3d pose estimation with uncalibrated camera networks. In Proceedings of the BMVC, London, UK, 21–24 November 2022. [Google Scholar]
Shuai, H.; Wu, L.; Liu, Q. Adaptive multi-view and temporal fusing transformer for 3d human pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4122–4135. [Google Scholar] [CrossRef] [PubMed]
Li, Y.J.; Xu, Y.; Khirodkar, R.; Park, J.; Kitani, K. Multi-person 3d pose estimation from multi-view uncalibrated depth cameras. arXiv 2024, arXiv:2401.15616. [Google Scholar]
Chang, I.; Park, M.G.; Kim, J.; Yoon, J.H. Multi-view 3d human pose estimation with self-supervised learning. In Proceedings of the 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Korea, Republic of Korea, 20–23 April 2021; pp. 255–257. [Google Scholar]
Rodriguez-Criado, D.; Bachiller-Burgos, P.; Vogiatzis, G.; Manso, L.J. Multi-person 3D pose estimation from unlabelled data. Mach. Vis. Appl. 2024, 35, 46. [Google Scholar] [CrossRef]
Wan, X.; Chen, Z.; Duan, B.; Zhao, X. Dual-diffusion for binocular 3D human pose estimation. Adv. Neural Inf. Process. Syst. 2024, 37, 78079–78103. [Google Scholar]
Reddy, N.D.; Guigues, L.; Pishchulin, L.; Eledath, J.; Narasimhan, S.G. Tessetrack: End-to-end learnable multi-person articulated 3d pose tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15190–15200. [Google Scholar]
Zhang, Y.; Wang, C.; Wang, X.; Liu, W.; Zeng, W. Voxeltrack: Multi-person 3d human pose estimation and tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2613–2626. [Google Scholar] [CrossRef] [PubMed]
Choudhury, R.; Kitani, K.M.; Jeni, L.A. Tempo: Efficient multi-view pose estimation, tracking, and forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 14750–14760. [Google Scholar]
Zimmermann, C.; Welschehold, T.; Dornhege, C.; Burgard, W.; Brox, T. 3d human pose estimation in rgbd images for robotic task learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 1986–1992. [Google Scholar]
Zhang, B.; Xiao, Y.; Xiong, F.; Wu, C.; Cao, Z.; Liu, P.; Zhou, J.T. 3D human pose estimation with cross-modality training and multi-scale local refinement. Appl. Soft Comput. 2022, 122, 108950. [Google Scholar] [CrossRef]
Guo, Y.; Li, Z.; Li, Z.; Du, X.; Quan, S.; Xu, Y. PoP-Net: Pose Over Parts Network for Multi-Person 3D Pose Estimation From a Depth Image. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 1240–1249. [Google Scholar]
Martínez-González, A.; Villamizar, M.; Canévet, O.; Odobez, J.M. Residual pose: A decoupled approach for depth-based 3D human pose estimation. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 10313–10318. [Google Scholar]
Szczuko, P. Deep neural networks for human pose estimation from a very low resolution depth image. Multimed. Tools Appl. 2019, 78, 29357–29377. [Google Scholar] [CrossRef]
Aso, K.; Hwang, D.H.; Koike, H. Portable 3D human pose estimation for human-human interaction using a chest-mounted fisheye camera. In Proceedings of the Augmented Humans International Conference 2021, Rovaniemi, Finland, 22–24 February 2021; pp. 116–120. [Google Scholar]
Zhang, Y.; You, S.; Karaoglu, S.; Gevers, T. Multi-person 3D pose estimation from a single image captured by a fisheye camera. Comput. Vis. Image Underst. 2022, 222, 103505. [Google Scholar] [CrossRef]
Goyal, G.; Di Pietro, F.; Carissimi, N.; Glover, A.; Bartolozzi, C. Moveenet: Online high-frequency human pose estimation with an event camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4024–4033. [Google Scholar]
Lang, B.; Chuah, M.C. Event-Guided Video Transformer for End-to-End 3D Human Pose Estimation. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 5114–5124. [Google Scholar]
Lang, B.; Chuah, M.C. Event-Guided Fusion-Mamba for Context-Aware 3D Human Pose Estimation. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 950–960. [Google Scholar]
Koleini, F.; Saleem, M.U.; Wang, P.; Xue, H.; Helmy, A.; Fenwick, A. BioPose: Biomechanically-accurate 3D Pose Estimation from Monocular Videos. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 6330–6339. [Google Scholar]
Yu, C.; Xiao, B.; Gao, C.; Yuan, L.; Zhang, L.; Sang, N.; Wang, J. Lite-hrnet: A lightweight high-resolution network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10440–10450. [Google Scholar]
Han, J.; Wang, Y. Greit-HRNet: Grouped Lightweight High-Resolution Network for Human Pose Estimation. In Proceedings of the Asian Conference on Computer Vision, Hanoi, Vietnam, 8–12 December 2024; pp. 3771–3787. [Google Scholar]
Li, Q.; Zhang, Z.; Xiao, F.; Zhang, F.; Bhanu, B. Dite-HRNet: Dynamic lightweight high-resolution network for human pose estimation. arXiv 2022, arXiv:2204.10762. [Google Scholar]
Zhang, F.; Zhu, X.; Ye, M. Fast human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2019; pp. 3517–3526. [Google Scholar]
Diaz-Arias, A.; Shin, D. ConvFormer: Parameter reduction in transformer models for 3D human pose estimation by leveraging dynamic multi-headed convolutional attention. Vis. Comput. 2024, 40, 2555–2569. [Google Scholar] [CrossRef]
Sun, Y.; Dougherty, A.W.; Zhang, Z.; Choi, Y.K.; Wu, C. Mixsynthformer: A transformer encoder-like structure with mixed synthetic self-attention for efficient human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 14884–14893. [Google Scholar]
Jiang, T.; Lu, P.; Zhang, L.; Ma, N.; Han, R.; Lyu, C.; Li, Y.; Chen, K. Rtmpose: Real-time multi-person pose estimation based on mmpose. arXiv 2023, arXiv:2303.07399. [Google Scholar]
Zeng, A.; Ju, X.; Yang, L.; Gao, R.; Zhu, X.; Dai, B.; Xu, Q. Deciwatch: A simple baseline for 10× efficient 2d and 3d pose estimation. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 607–624. [Google Scholar]
Xu, Y.; Zhao, L.; Gong, C.; Li, G.; Wang, D.; Wang, N. DynPose: Largely Improving the Efficiency of Human Pose Estimation by a Simple Dynamic Framework. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 1160–1169. [Google Scholar]
Zhang, Y.; Wang, Y.; Camps, O.; Sznaier, M. Key frame proposal network for efficient pose estimation in videos. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 609–625. [Google Scholar]
Hwang, D.H.; Kim, S.; Monet, N.; Koike, H.; Bae, S. Lightweight 3d human pose estimation network training using teacher-student learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 479–488. [Google Scholar]
Bulat, A.; Tzimiropoulos, G.; Kossaifi, J.; Pantic, M. Improved training of binary networks for human pose estimation and image recognition. arXiv 2019, arXiv:1904.05868. [Google Scholar] [CrossRef]
Xu, L.; Guan, Y.; Jin, S.; Liu, W.; Qian, C.; Luo, P.; Ouyang, W.; Wang, X. Vipnas: Efficient video pose estimation via neural architecture search. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 22–25 June 2021; pp. 16072–16081. [Google Scholar]
Xu, L.; Jin, S.; Liu, W.; Qian, C.; Ouyang, W.; Luo, P.; Wang, X. Zoomnas: Searching for whole-body human pose estimation in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5296–5313. [Google Scholar] [CrossRef]
Liu, H.; Liu, W.; Chi, Z.; Wang, Y.; Yu, Y.; Chen, J.; Tang, J. Fast human pose estimation in compressed videos. IEEE Trans. Multimed. 2022, 25, 1390–1400. [Google Scholar] [CrossRef]
Ma, H.; Wang, Z.; Chen, Y.; Kong, D.; Chen, L.; Liu, X.; Yan, X.; Tang, H.; Xie, X. Ppt: Token-pruned pose transformer for monocular and multi-view human pose estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 424–442. [Google Scholar]
Li, W.; Liu, M.; Liu, H.; Wang, P.; Cai, J.; Sebe, N. Hourglass tokenizer for efficient transformer-based 3D human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 604–613. [Google Scholar]
Pham, H.H.; Salmane, H.; Khoudour, L.; Crouzil, A.; Velastin, S.A.; Zegers, P. A unified deep framework for joint 3d pose estimation and action recognition from a single rgb camera. Sensors 2020, 20, 1825. [Google Scholar] [CrossRef] [PubMed]
Luvizon, D.C.; Picard, D.; Tabia, H. 2d/3d pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 28–23 June 2018; pp. 5137–5146. [Google Scholar]
Luvizon, D.C.; Picard, D.; Tabia, H. Multi-task deep learning for real-time 3D human pose estimation and action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2752–2764. [Google Scholar] [CrossRef] [PubMed]
Ahmad, N.; Khan, J.; Kim, J.Y.; Lee, Y. Joint human pose estimation and instance segmentation with PosePlusSeg. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 69–76. [Google Scholar]
Sárándi, I.; Hermans, A.; Leibe, B. Learning 3d human pose estimation from dozens of datasets using a geometry-aware autoencoder to bridge between skeleton formats. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 2956–2966. [Google Scholar]
Jeong, U.; Freer, J.; Baek, S.; Chang, H.J.; Kim, K.I. PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 12278–12288. [Google Scholar]
Shan, W.; Liu, Z.; Zhang, X.; Wang, S.; Ma, S.; Gao, W. P-stmo: Pre-trained spatial temporal many-to-one model for 3d human pose estimation. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 461–478. [Google Scholar]
Zhu, W.; Ma, X.; Liu, Z.; Liu, L.; Wu, W.; Wang, Y. Motionbert: A unified perspective on learning human motion representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 15085–15099. [Google Scholar]
Wang, Y.; Wu, Y.; He, W.; Guo, X.; Zhu, F.; Bai, L.; Zhao, R.; Wu, J.; He, T.; Ouyang, W.; et al. Hulk: A universal knowledge translator for human-centric tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 7, 5672–5689. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. Vitpose++: Vision transformer for generic body pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 1212–1230. [Google Scholar] [CrossRef]
Ci, Y.; Wang, Y.; Chen, M.; Tang, S.; Bai, L.; Zhu, F.; Zhao, R.; Yu, F.; Qi, D.; Ouyang, W. Unihcp: A unified model for human-centric perceptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17840–17852. [Google Scholar]
Dabhi, M.; Jeni, L.A.; Lucey, S. 3d-lfm: Lifting foundation model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 10466–10475. [Google Scholar]
Jiang, Z.; Chai, W.; Li, L.; Zhou, Z.; Yang, C.Y.; Hwang, J.N. Unihpe: Towards unified human pose estimation via contrastive learning. arXiv 2023, arXiv:2311.16477. [Google Scholar] [CrossRef]
Fang, H.S.; Li, J.; Tang, H.; Xu, C.; Zhu, H.; Xiu, Y.; Li, Y.L.; Lu, C. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7157–7173. [Google Scholar] [CrossRef] [PubMed]
Jiang, T.; Xie, X.; Li, Y. RTMW: Real-time multi-person 2D and 3D whole-body pose estimation. arXiv 2024, arXiv:2407.08634. [Google Scholar]
Samet, N.; Akbas, E. HPRNet: Hierarchical point regression for whole-body human pose estimation. Image Vis. Comput. 2021, 115, 104285. [Google Scholar] [CrossRef]
Rey, R. Monocular 3D Human Pose Estimation. Master’s Thesis, KTH, School of Electrical Engineering and Computer Science (EECS), Stockholm, Sweden, 2023. [Google Scholar]
Giulietti, N.; Todesca, D.; Carnevale, M.; Giberti, H. A Real-Time Human Pose Measurement System for Human-In-The-Loop Dynamic Simulators. IEEE Access 2025, 13, 24954–24969. [Google Scholar] [CrossRef]
Bridgeman, L.; Volino, M.; Guillemaut, J.Y.; Hilton, A. Multi-person 3d pose estimation and tracking in sports. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2019. [Google Scholar]
Jiang, J.H.; Xia, N. PCNet: A human pose compensation network based on incremental learning for sports actions estimation. Complex Intell. Syst. 2025, 11, 17. [Google Scholar] [CrossRef]
Baumgartner, T.; Klatt, S. Monocular 3d human pose estimation for sports broadcasts using partial sports field registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5109–5118. [Google Scholar]
Huang, W.; Ni, Y.; Rezvani, A.; Jeong, S.; Chen, H.; Liu, Y.; Wen, F.; Imani, M. Recoverable anonymization for pose estimation: A privacy-enhancing approach. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 5239–5249. [Google Scholar]
Akada, H.; Wang, J.; Golyanik, V.; Theobalt, C. Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation. In Proceedings of the International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–23 October 2025. [Google Scholar]
Matsune, A.; Hu, S.; Li, G.; Wen, S.; Zhu, X.; Tan, Z. A geometry loss combination for 3d human pose estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–6 January 2024; pp. 3272–3281. [Google Scholar]
Hsu, C.H.; Jang, J.S.R. Enhancing 3D Human Pose Estimation with Bone Length Adjustment. In Proceedings of the Asian Conference on Computer Vision, Hanoi, Vietnam, 8–12 December 2024; pp. 3723–3738. [Google Scholar]
Joo, H.; Neverova, N.; Vedaldi, A. Exemplar fine-tuning for 3d human model fitting towards in-the-wild 3d human pose estimation. In Proceedings of the 2021 International Conference on 3D Vision (3DV), Virtual, 1–3 December 2021; pp. 42–52. [Google Scholar]
Chang, J.Y.; Moon, G.; Lee, K.M. Poselifter: Absolute 3D human pose lifting network from a single noisy 2D human pose. arXiv 2019, arXiv:1910.12029. [Google Scholar]
Kim, J.H.; Han, J.; Lee, S.W. PoseAnchor: Robust Root Position Estimation for 3D Human Pose Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–20 October 2025; pp. 7079–7088. [Google Scholar]
Hao, X.; Li, H. PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–20 October 2025; pp. 8110–8119. [Google Scholar]
Zhan, Y.; Li, F.; Weng, R.; Choi, W. Ray3d: Ray-based 3d human pose estimation for monocular absolute 3d localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 13116–13125. [Google Scholar]
Wang, Z.; Chen, R.; Liu, M.; Dong, G.; Basu, A. SPGNet: Spatial projection guided 3D human pose estimation in low dimensional space. In Proceedings of the International Conference on Smart Multimedia; Springer: Berlin/Heidelberg, Germany, 2022; pp. 41–55. [Google Scholar]
Lee, G.H.; Lee, S.W. Uncertainty-aware human mesh recovery from video by learning part-based 3d dynamics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12375–12384. [Google Scholar]
Kan, Z.; Chen, S.; Zhang, C.; Tang, Y.; He, Z. Self-correctable and adaptable inference for generalizable human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5537–5546. [Google Scholar]

Figure 1. The evolution of output representations: (Left) coordinate regression directly maps pixels to (x,y) values; (Middle) 2D heatmaps represent joints as Gaussian peaks, preserving spatial context; (Right) 3D volumetric heatmaps extend this distribution concept into depth in order to handle spatial uncertainty in 3D space.

Figure 2. Receptive field in hourglass architectures. By repeatedly downsampling and upsampling features with skip connections, the network captures global context to resolve local ambiguities, allowing the receptive field to cover the entire body structure.

Figure 3. Overview of a transformer-based approach for pose estimation. By dividing the image into patches and using self-attention mechanisms, transformers can inherently model long-range dependencies and global context across the entire image simultaneously, resulting in superior performance compared to traditional CNN limited receptive fields. This figure illustrate the distinct use of transformers to solve pose estimation tasks.

Figure 4. Major challenges in 3D pose estimation. The top row illustrates the domain gap: (Left) a controlled laboratory environment with clean background and (Middle) a complex “in-the-wild” scenario from COCO dataset. The middle and right illustrate ambiguity and occlusion. (Middle) The red estimated skeletons show plausible 2D poses that are inaccurate in 3D due to self-occlusion or inherent depth ambiguity, with multiple 3D configurations projecting to the same 2D image. (Right) The ambiguity for human viewers (red by the author and cyan by the annotator).

Figure 5. Visualizing uncertainty through multi-hypothesis generation. (Left) A ground truth pose (red) is surrounded by multiple predicted hypotheses (blue), representing the distribution of possible poses. (Right) Two selected high-confidence hypotheses illustrating depth ambiguity, where the model is uncertain about the exact 3D depth of the legs despite similar 2D projections.

Figure 6. Diffusion models for pose estimation. Starting from random noise, the model iteratively denoises the latent representation conditioned on image features to resolve ambiguity and reconstruct a plausible 3D pose. This approach effectively handles occlusions, ambiguity, and complex articulations.

Figure 7. The paradigm of foundation models in human-centric tasks. Contrasting with traditional task-specific training, a large-scale pretrained foundation model learns general human representations that can be adapted or specialized for various downstream tasks such as 3D pose estimation, mesh recovery, and action recognition. This figure represents the Hulk [256] architecture.

Table 1. Conceptual taxonomy of the review. This table outlines the intellectual evolution of the field, organizing the survey into five structural pillars (Column 1) and their corresponding paradigm shifts (Columns 2 & 3).

Structural Pillar	Paradigm/Theme	Key Concepts & Intellectual Evolution
I. Representation (Section 4)	Coordinates to Distributions	Direct Regression → Heatmaps → Integral Regression → Debiasing (Dark/UDP)
	Multi-Person Grouping	Part Affinity Fields (PAFs) → Geometric Fields → Associative Embeddings → Centers & Anchor Points
	The Third Dimension	Volumetric Heatmaps → Ordinal/Ranking Depth → Kinematic & Structural Encoding
II. Architecture (Section 5)	Spatial Context (CNNs)	Stacked Hourglass → Multi-Stage Refinement → High-Resolution (HRNet) → Hybrid Designs
	Global Context (Transformers)	CNN–Transformer Hybrids → Pure Transformers (ViT) → Pose-as-Sequence (Tokenization)
	Graph & Structure	Graph Convolutional Networks (GCNs) → Directed Graphs → Kinematic Topology Modeling
III. Ambiguity & Generalization (Section 6)	Uncertainty Modeling	Deterministic Prediction → Multi-Hypothesis Generation → Probabilistic Distributions → Diffusion Models
	Domain Gap (In-the-Wild)	Weak Supervision → Self-Supervised Learning → Adversarial Adaptation → Synthetic Data Generators
	Robustness	Occlusion Reasoning (Visibility Tokens) → Crowd Modeling (Relational Graphs)
IV. Contextual Extension (Section 7)	Temporal Dynamics	RNNs/LSTMs → Temporal Convolutions (TCNs) → Spatiotemporal Transformers → State-Space Models (Mamba)
	Multi-View Geometry	Algebraic Triangulation → Learnable Fusion → Epipolar Transformers → Uncalibrated/Parameter-Free
	Sensors & Modalities	RGB-D Fusion → Event Cameras (High Speed) → LiDAR/IMU Integration
V. Efficiency & Frontier (Section 8)	Efficiency & Deployment	Lightweight Backbones → Knowledge Distillation → Quantization → Token Pruning
	Unification	Multi-Task Learning → Unified Datasets → Foundation Models (Large-Scale Pre-training)
	Human-Centric Tasks	Whole-Body Estimation → Sports Biomechanics → Physics-Awareness → Privacy & Fairness

Table 2. Coverage of recent HPE surveys across the conceptual pillars of this review. ✓ = full coverage; P = partial; – = none. Rows are sorted by year of the most recent survey version.

Survey	Year	Scope	Represent.	Archit.	Ambiguity	Context	Apps/Frontier	Conceptual Evol.
Dang et al. [1]	2019	2D	P	P	–	–	–	–
Chen et al. [5]	2020	3D mono.	P	P	P	P	–	–
Ji et al. [4]	2020	3D mono.	P	P	P	–	–	–
Munea et al. [2]	2020	2D	P	P	–	–	–	–
Ben Gamra & Akhloufi [24]	2021	2D + 3D	✓	✓	P	P	–	P
Liu et al. [6]	2021	2D + 3D	P	✓	P	P	–	P
Song et al. [23]	2021	Action recog.	P	P	–	P	P	–
Wang et al. [9]	2021	3D	P	✓	P	✓	–	–
Dubey & Dixit [12]	2023	2D + 3D	P	✓	P	P	–	–
Lan et al. [14]	2023	2D + 3D	P	P	–	–	✓	–
Azam & Desai [20]	2024	Egocentric	P	✓	P	✓	P	–
Algabri et al. [21]	2024	Head pose	P	P	–	–	P	–
Hou et al. [18]	2024	Tutorial	P	P	–	–	–	–
Neupane et al. [10]	2024	3D	✓	✓	✓	✓	P	P
Liu et al. [11]	2024	3D + Mesh	P	✓	P	✓	P	P
Suo et al. [22]	2024	Sports MoCap	P	P	–	P	✓	–
Gao et al. [15]	2025	HPE+downstr.	P	✓	P	✓	P	P
Guo et al. [7]	2025	3D mono.	✓	✓	✓	P	–	P
Jayaswal et al. [17]	2025	Structural	P	P	–	P	–	–
Nogueira et al. [19]	2025	Multi-view	P	✓	P	✓	–	–
Salisu et al. [16]	2025	3D arch.	P	✓	P	–	–	–
Sun et al. [13]	2025	2D + 3D	✓	✓	✓	P	P	P
Udayan et al. [8]	2025	3D mono.	✓	✓	P	P	P	P
Zhang & Shin [3]	2025	2D	✓	✓	P	–	P	P
This review	2026	2D + 3D + 4D	✓	✓	✓	✓	✓	✓

Table 3. Datasets as drivers of paradigm shifts. Each row pairs a benchmark with the conceptual challenge it introduced and the paradigm or method family it enabled in the subsequent literature.

Dataset	Year	Challenge Introduced	Paradigm/Method Enabled	Main Related Sections
Human3.6M [25]	2014	Large-scale, accurate, lab-clean 3D ground truth	Supervised 2D-to-3D lifting; volumetric heatmaps; protocol-based MPJPE evaluation	Section 4.4 and Section 5
MPII Human Pose [27]	2014	In-the-wild 2D variety with daily activities	Stacked hourglass and multi-stage refinement; PCKh evaluation	Section 5.1
COCO Keypoints [28]	2014	Crowded scenes, occlusion, scale variation	Bottom-up grouping (PAFs, embeddings, centres); AP/OKS protocol	Section 4.3 and Section 5.1
SURREAL [26]	2017	Sim-to-real for full-body shape and mesh	Synthetic pre-training; sim2real adaptation; mesh recovery at scale	Section 3 and Section 6.2
MPI-INF-3DHP [32]	2017	Bridging lab and in-the-wild 3D	Studio + real mixing; PCK3D/AUC evaluation; weakly supervised 3D	Section 3 and Section 6.2
3DPW [33]	2018	Real-world 3D ground truth via IMU fusion	Domain adaptation; mesh recovery in the wild; PA-MPJPE as SOTA proxy	Section 3 and Section 6.2
COCO-WholeBody [35]	2020	Whole-body keypoints (body+face+hands+feet)	Whole-body estimation; coarse-to-fine architectures; per-part AP_wb	Section 8.3
SLOPER4D [38]	2023	Global 4D pose in scanned urban scenes (LiDAR+IMU)	Scene-aware HPE; global-coordinate evaluation; human-scene interaction	Section 7.2 and Section 8.3
FreeMan [34]	2024	Uncalibrated multi-view at scale, smartphone capture	Uncalibrated/parameter-free multi-view; O-MPJPE; robustness to clutter	Section 7.2
AthletePose3D [36]	2025	High-speed, non-periodic athletic motion	Sports-biomechanics fine-tuning; physics-aware post-processing	Section 8.3 and Section 9
LDPose [37]	2025	Limb-difference inclusivity; variable skeletons	Skeleton-agnostic/variable-topology models; inclusive evaluation	Section 8.3 and Section 9
MoviCam [39]	2025	3D pose from a moving RGB camera; physics annotations	Physics-aware optimization; gravity/ground-contact constraints	Section 7.3 and Section 8.3

Table 4. Coordinate representations: from direct regression to probability distributions (Section 4.1 and Section 4.2).

Method (Concept)	Reference	Dataset	Metric	Score
Paradigm: Direct Regression ↑ (Section 4.1)
DeepPose	Toshev & Szegedy [40]	LSP (test)	PCP (mean)	61.0%
Paradigm: 2D Heatmaps & Distributions ↑ (Section 4.2)
Integral Regression	Sun et al. [41]	COCO	AP (OKS)	67.8
Heatmap Refinement	Jiang et al. [46]	COCO	AP (OKS)	70.3
SWAHR	Luo et al. [45]	COCO	AP (OKS)	71.6
DARK	Zhang et al. [48]	COCO	AP (OKS)	76.2
UDP	Huang et al. [47]	COCO	AP (OKS)	76.5
Anisotropic Gaussian	Liu et al. [50]	COCO	AP (OKS)	79.1
ProbPose	Purkrabek et al. [51]	CropCOCO	mAP (OKS)	81.7
Paradigm: Application to 3D Coordinates ↓
Cascaded 3D Regression	Li et al. [42]	Human3.6M	MPJPE	50.9 mm
HEMlets Pose (3D heatmaps)	Zhou et al. [44]	Human3.6M	MPJPE	39.9 mm

Table 5. Bottom-up multi-person grouping (Section 4.3).

Grouping Family	Method	Dataset	Metric	Score
Part A: 2D Grouping (Primarily COCO) ↑
Pioneering PAFs	OpenPose [29]	COCO	AP	61.8
Geometric Embeddings	PersonLab [31]	COCO	AP	66.5
Composite Fields	PifPaf [53]	COCO	AP	66.7
Poses as Objects	KAPAO [60]	COCO	AP	70.3
Dual Anat. Centers	Dual Centers [57]	COCO	AP	71.0
Decentralized Centers	DecenterNet [58]	COCO	AP	71.2
Anchor Centers	Double Anchor [56]	CrowdPose	AP	66.9 (+1.5)
Part B: Extension to 3D Grouping ↓
3D Maps (ORPM)	ORPM [54]	CMU Panoptic	MPJPE	68.5 mm (−3.6)
Absolute 3D Maps	SMAP [55]	CMU Panoptic	MPJPE	61.8 mm

Table 6. Representations of the third dimension (Section 4.4) grouped by paradigm (volumetric, ordinal, kinematic).

Method (Concept)	Reference	Dataset	Metric ↓	Score
Paradigm: Volumetric/Heatmaps
MobileHumanPose	Choi et al. [65]	Human3.6M	MPJPE	79.6 mm
Marginal Heatmaps	Nibali et al. [64]	Human3.6M	MPJPE	55.4 mm
MeTRAbs (Metric Maps)	Sárándi et al. [66]	Human3.6M	MPJPE	49.3 mm
Paradigm: Ordinal
Ordinal Supervision	Pavlakos et al. [68]	Human3.6M	MPJPE	56.2 mm
DRPose3D (Ranking)	Wang et al. [67]	Human3.6M	MPJPE	42.9 mm
Paradigm: Kinematic/Structured
Kinematic Preservation	Kundu et al. [69]	Human3.6M	MPJPE	56.1 mm
Compositional Tokens	Geng et al. [71]	Human3.6M	MPJPE	47.8 mm
Pose Grammar	Fang et al. [72]	Human3.6M	MPJPE	45.7 mm
Bone Decomposition	Chen et al. [70]	Human3.6M	MPJPE	35.0 mm

Table 7. CNN architectures: multi-stage, pyramidal, and high-resolution (Section 5.1). Comparison on standard 2D benchmarks.

Method	Reference	Backbone	Dataset	Score
Primary Benchmark: COCO test-dev (Metric: AP ↑)
HigherHRNet	Cheng et al. [82]	HigherHRNet-W48	COCO	70.5
CPN	Chen et al. [74]	ResNet-Inception	COCO	72.1
Simple Baselines	Xiao et al. [78]	ResNet-152	COCO	73.7
HRNet	Sun et al. [81]	HRNet-W48	COCO	75.5
MIPNet	Khirodkar et al. [90]	-	COCO	75.7
MSPN	Li et al. [75]	4x Res-50	COCO	76.1
RSN	Cai et al. [87]	4xRSN-50	COCO	78.6
Primary Benchmark: MPII test (Metric: PCKh@0.5 ↑)
DU-Net	Tang et al. [80]	16x U-Nets	MPII	91.2
Spatial Context	Zhang et al. [77]	8x Hourglass	MPII	92.5
CFA	Su et al. [76]	R-101 + 4xR-50	MPII	93.9

Table 8. GCN -based 2D-to-3D lifting (Section 5.2.1). Metric: MPJPE (mm) on Human3.6M (Protocol 1).

Method	Reference	Dataset	Metric	Score ↓
MöbiusGCN	Azizi et al. [97]	Human3.6M	MPJPE	52.1 mm
Graph Stacked Hourglass	Xu & Takano [94]	Human3.6M	MPJPE	51.9 mm
GraphMLP	Li et al. [99]	Human3.6M	MPJPE	48.0 mm
Conditional Directed GCN	Hu et al. [96]	Human3.6M	MPJPE	41.1 mm
Optimizing Network Structure	Ci et al. [95]	Human3.6M	MPJPE	36.3 mm

Table 9. Transformer architectures (Section 5.2.2). Comparison of heatmap-based hybrids vs. direct regression models on the COCO test set.

Method	Reference	Dataset	Metric	Score ↑
Heatmap-based transformers
VTTransPose	Li et al. [103]	COCO	AP	73.6
TransPose	Yang et al. [100]	COCO	AP	75.0
HRFormer	Yuan et al. [101]	COCO	AP	76.2
Polarized Self-Attn.	Liu et al. [102]	COCO	AP	79.4
Regression-based transformers
DirectPose	Tian et al. [109]	COCO	AP	64.8
Cascade Transformers	Li et al. [107]	COCO	AP	72.1
TFPose	Mao et al. [105]	COCO	AP	72.2
Group Pose	Liu et al. [110]	COCO	AP	72.8
Poseur	Mao et al. [108]	COCO	AP	78.3
PE-former	Panteleris & Argyros [106]	COCO (Val)	AP	72.6

Table 10. Monocular 2D-to-3D lifting strategies (Section 5.2.3). Impact of visual context on lifting accuracy (Human3.6M, Protocol 1 and Protocol 2).

Method	Reference	Dataset	MPJPE ↓	PA-MPJPE ↓
ContextPose	Ma et al. [112]	Human3.6M	43.4 mm	34.6 mm
Single 2D + Context	Zhao et al. [113]	Human3.6M	39.8 mm	32.7 mm

Table 11. Probabilistic and generative approaches (Section 6.1). Comparison of multi-hypothesis and diffusion-based methods on Human3.6M (Protocol 1).

Method	Reference	Paradigm	MPJPE ↓
Multi-Hypothesis and Uncertainty
Uncertainty Learning	Han et al. [116]	Aleatoric	66.7 mm
CVAE + Ordinal Ranking	Sharma et al. [115]	CVAE	58.0 mm
Mixture Density Network	Li & Lee [114]	MDN	52.7 mm
ManiPose	Rommel et al. [117]	Manifold	39.1 mm
Diffusion Models
ZeDO (Zero-shot)	Jiang et al. [125]	Optimization	51.4 mm
Di²Pose (Occlusion)	Wang et al. [122]	Discrete Diff.	49.2 mm
DiffPose	Holmquist & Wandt [118]	Diffusion	43.3 mm
Hypothesis Aggregation	Shan et al. [120]	Diffusion	39.5 mm
DiffPose	Gong et al. [119]	Diffusion	36.9 mm
FinePOSE	Xu et al. [121]	Prompt-Diffusion	31.9 mm

Table 12. Generalization methods on Human3.6M (Section 6.2).

Method	Reference	Database	Metric	Score
Unsupervised (Geo-Aware)	Rhodin et al. [127]	Human3.6M	MPJPE	131.7 mm
Weak Sup. (Multi-view)	Rhodin et al. [126]	Human3.6M	MPJPE	66.8 mm
Adversarial Learning	Yang et al. [135]	Human3.6M	MPJPE	58.6 mm
Generalizing (2D→3D)	Wang et al. [139]	Human3.6M	MPJPE	37.6 mm
Weak Sup. (Multi-view)	Iskakov et al. [128]	Human3.6M	MPJPE	20.8 mm

Table 13. Generalization methods on Human3.6M and 3DPW (Section 6.2). Metrics: MPJPE and PA-MPJPE.

Method	Reference	Human3.6M (MPJPE ↓)	3DPW (PA-MPJPE ↓)
3D-Guidance (Multi-model)	Lee et al. [146]	50.6 mm	-
CameraPose (Weak Sup.)	Yang et al. [129]	38.87 mm	63.26 mm
PoseAug (Augmentation)	Gong et al. [137]	38.2 mm	81.6 mm
PoseIRM (Invariant Learning)	Cai et al. [143]	25.6 mm	-

Table 14. Robustness to occlusion (Section 6.3.1) and crowds (Section 6.3.2).

Method (Concept)	Reference	Benchmark	Metric	Score
Part A: Occlusion Robustness ↓
Partial Body Regression	Vosoughi & Amer [149]	H3.6M (Truncated)	MPJPE	177.8 (−154.6)
LInKs (Lift-then-Fill)	Hardy & Kim [153]	H3.6M (Occlusion)	N-MPJPE	61.6 (−2.4)
HiPART (Auto-regressive)	Zheng et al. [154]	H3.6M-Occluded	MPJPE	28.3
Part B: Crowded Scenes
2D CrowdPose (Metric: AP ↑)
DPIT (Hybrid Transformer)	Zhao et al. [160]	COCO (Test)	AP	74.6
I²R-Net (Relational)	Ding et al. [162]	CrowdPose	AP	77.4
BUCTD (Hybrid BU-TD)	Zhou et al. [158]	CrowdPose	AP	78.5
3D MuPoTS-3D (Metric: 3DPCK ↑)
Multi-Person 3D	Dabral et al. [161]	MuPoTS-3D	3DPCK	74.3
GR-M3D (Dynamic Graph)	Qiu et al. [163]	MuPoTS-3D	3DPCK	84.6
Hybrid Top-down/Bottom-up	Cheng et al. [159]	MuPoTS-3D	3DPCK	88.9

Table 15. Temporal 2D-to-3D lifting models (Section 7.1.1 and Section 7.1.2). Metric: MPJPE (mm) on Human3.6M (Protocol 1).

Paradigm	Method	Reference	Dataset	Frames	MPJPE ↓
Temporal Conv.	TCN (Baseline)	Pavllo et al. [166]	Human3.6M	243	46.8 mm
Transformer	PoseFormer	Zheng et al. [170]	Human3.6M	81	44.3 mm
Transformer	HDFormer	Chen et al. [177]	Human3.6M	96	40.3 mm
Transformer	MixSTE	Zhang et al. [173]	Human3.6M	243	39.8 mm
State-Space (SSM)	SAMA	Lu et al. [190]	Human3.6M	351	36.5 mm

Table 16. Multi-viewgeometry (Section 7.2). Metric: MPJPE (mm) on Human3.6M (Protocol 1).

Method	Reference	Type	MPJPE ↓
Calibrated Fusion
Cross View Fusion	Qiu et al. [200]	Learnable	31.17 mm
Learnable Triangulation	Iskakov et al. [128]	Volumetric	20.8 mm
AdaFuse	Zhang et al. [201]	Adaptive	19.5 mm
Geometry-Biased Trans.	Moliner et al. [203]	Transformer	14.2 mm
Uncalibrated/Parameter-Free
Auto-supervision	Chang et al. [214]	Self-Sup.	76.96 mm
FLEX	Gordon et al. [210]	Invariant	30.2 mm

Table 17. Efficiency (Section 8.1), unification (Section 8.2), and frontier (Section 8.3). Summary of key methods illustrating the tradeoffs between speed, generality, and task specificity.

Method (Concept)	Reference	Dataset	Metric	Score
Part A: Efficiency and Deployment
DeciWatch (Sampling)	Zeng et al. [238]	Human3.6M	MPJPE ↓	52.8 mm
RTMPose (Real-time)	Jiang et al. [237]	COCO val	AP ↑	74.8
DynPose (Dynamic)	Xu et al. [239]	COCO val	AP ↑	78.0
Part B: Unification and Foundation Models
MotionBERT (Pretrained)	Zhu et al. [255]	Human3.6M	MPJPE ↓	37.5 mm
UniHPE (Unified Modality)	Jiang et al. [260]	Human3.6M	MPJPE ↓	50.5 mm
UniHCP (Unified Task)	Ci et al. [258]	Human3.6M	MPJPE ↓	75.6 mm
ViTPose++ (Foundation)	Xu et al. [257]	COCO test-dev	AP ↑	81.1
Part C: Human-Centric Frontier
COCO-WholeBody	Jin et al. [35]	WholeBody	AP_wb ↑	54.1
LDPose (Inclusivity)	Ying et al. [37]	LDPose	AP_LD ↑	78.4
U-HMR (Mesh Recovery)	Lee & Lee [279]	3DPW	MPJPE ↓	92.8 mm

Table 18. Design decision guide for selecting an HPE architecture. Each row pairs an architectural choice with the conditions under which it is preferable and the dominant tradeoff that the practitioner accepts.

Design Paradigm	When to Choose	Dominant Trade-Off	Reviewed in Section
Heatmap (CNN, e.g., HRNet)	Sub-pixel precision matters; compute not bottleneck; 2D single/multi-person; well-defined keypoints.	High accuracy, high memory/latency; resolution caps precision.	Section 4.2 and Section 5.1
Direct regression (transformer/DETR)	Real-time multi-person; low-latency edge; tolerant to slightly lower precision.	Simpler pipeline, faster inference; harder optimization, less spatial calibration.	Section 4.1 and Section 5.2.2
GCN/structured graph	2D-to-3D lifting where anatomical prior strong; small skeletons; relational reasoning under occlusion.	Strong inductive bias on fixed topology; limited generalization to new skeletons.	Section 5.2.1 and Section 6.3.2
Spatiotemporal transformer	Short to medium video windows (e.g., 81 frames); pose lifting from 2D sequences; global temporal attention.	Quadratic time/memory; struggles on very long windows.	Section 7.1.2
State space model (Mamba/SSM)	Very long temporal windows (hundreds of frames); continuous tracking; multi-view fusion at scale.	Linear complexity, hardware-friendly; less mature; recent literature only.	Section 7.1.2 and Section 7.2
Diffusion-based estimator	Calibrated uncertainty needed (robotics, retargeting); occlusion-heavy scenes; multi-hypothesis output.	Iterative inference slow; needs aggregation/selection step.	Section 6.1
Event-camera/multi-modal fusion	High-speed, low-light, high-dynamic-range; sports; AR/VR.	Hardware availability; sparse/asynchronous data need specialized models.	Section 7.2
Physics-/biomechanics-aware	Output drives a simulator, controller, or biomechanical analysis; global trajectories matter.	Optimization in loop slow; depends on accurate camera/scene geometry.	Section 7.3 and Section 8.3
Foundation model + adapter	Generalist deployment (pose+mesh+action); limited labeled data; transfer to new domains.	Large model size; partially open weights; black-box failure modes.	Section 8.2

Table 19. Cross-paradigm SOTA snapshot of representative methods (2024–2025).

Method	Paradigm	Benchmark & Score	Strength	Weakness
RTMPose [237]	2D regression, real-time	COCO val: AP 74.8 @ 90+ FPS	Sub-pixel via SimCC; real-time on CPU/GPU.	Single-person top-down; no temporal modeling.
DynPose [239]	Dynamic routing	COCO val: AP 78.0	Skips easy frames; large efficiency gain.	Routing overhead on short clips.
ProbPose [51]	2D probabilistic heatmap	CropCOCO: mAP 81.7	Calibrated confidence; handles truncation.	Heavier head; mostly single-person.
MotionAGFormer [181]	GCN + transformer, 3D lifting	H3.6M (P1): MPJPE 38.4 mm	Local + global modeling; strong on H3.6M.	Coordinate-only input; depth-ambiguity ceiling.
TCPFormer [186]	Temporal transformer + proxy	H3.6M (P1): MPJPE 37.9 mm	Implicit proxy compresses long sequences.	Quadratic attention still binds at very long horizons.
SAMA [190]	Structure-aware SSM, video	H3.6M (P1): MPJPE 36.5 mm	Linear-time; 351-frame windows; topology-aware.	Recent; limited cross-dataset evaluation.
FinePOSE [121]	Prompt-conditioned diffusion	H3.6M (P1): MPJPE 31.9 mm	Lowest reported MPJPE; CLIP-conditioned.	Multi-step sampling; expensive at inference.
ZeDO [125]	Zero-shot diffusion + opt.	H3.6M (P1): MPJPE 51.4 mm	No 3D training data; zero-shot generalization.	Per-instance optimization; not real-time.
Di²Pose [122]	Discrete diffusion, occlusion	H3.6M: MPJPE 49.2 mm	Robust under heavy occlusion (mask/replace).	Discrete tokens limit precision.
HiPART [154]	Hierarchical autoregressive	H3.6M-Occluded: MPJPE 28.3 mm	Excellent under truncation/occlusion.	Autoregressive decoding is slow.
GR-M3D [163]	Dynamic graph, 3D multi-person	MuPoTS-3D: 3DPCK 84.6	Robust crowd reasoning; per-person graph.	Top-down detector dependency.
PhysDynPose [39]	Physics-aware optimization	MoviCam: best Global MPJPE 183.7 mm	Gravity + ground contact enforced.	Simulation in the loop; slow.
BioPose [230]	Mesh + inverse kinematics	3DPW: competitive PA-MPJPE 39.5 mm	Biomechanically valid output.	Requires careful camera calibration.
ViTPose++ [257]	Foundation, ViT-based	COCO test-dev: AP 79.4	Scales with data; transfers across tasks.	Very large model; expensive to fine-tune.
Hulk [256]	Multi-task foundation	Multi-dataset: SOTA on several tasks	Unified human-centric perception.	Partial open release; benchmark heterogeneity.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Diallo, K.B.; Akhloufi, M.A. Deep Human Pose Estimation: A Conceptual Review of Paradigms, Progress, and Frontiers. Computers 2026, 15, 366. https://doi.org/10.3390/computers15060366

AMA Style

Diallo KB, Akhloufi MA. Deep Human Pose Estimation: A Conceptual Review of Paradigms, Progress, and Frontiers. Computers. 2026; 15(6):366. https://doi.org/10.3390/computers15060366

Chicago/Turabian Style

Diallo, Kassim B., and Moulay A. Akhloufi. 2026. "Deep Human Pose Estimation: A Conceptual Review of Paradigms, Progress, and Frontiers" Computers 15, no. 6: 366. https://doi.org/10.3390/computers15060366

APA Style

Diallo, K. B., & Akhloufi, M. A. (2026). Deep Human Pose Estimation: A Conceptual Review of Paradigms, Progress, and Frontiers. Computers, 15(6), 366. https://doi.org/10.3390/computers15060366

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Human Pose Estimation: A Conceptual Review of Paradigms, Progress, and Frontiers

Abstract

1. Introduction

1.1. Survey Methodology

1.2. Relevance to Image and Video Processing

2. Related Work

2.1. Studies on 2D Human Pose Estimation

2.2. Studies on 3D HPE and Mesh Recovery

2.3. Systematic, Structural, and System-Level Studies

2.4. Specialized and Application-Oriented Studies

2.5. Scope and Contributions of This Review

3. Datasets and Benchmarks: Engines of Innovation in Human Pose Estimation

3.1. Laboratory Motion Capture Datasets

3.1.1. Human3.6M (H3.6M)

3.1.2. SURREAL (Synthetic Humans for REAL Tasks)

3.2. In-the-Wild 2D Benchmarks

3.2.1. MPII Human Pose

3.2.2. COCO Keypoints (COCO)

3.3. In-the-Wild 3D and Bridging the Domain Gap

3.3.1. MPI-INF-3DHP

3.3.2. 3DPW (3D Poses in the Wild)

3.3.3. FreeMan

3.4. Holistic, Application-Oriented, and Scene-Aware Benchmarks

3.4.1. COCO-WholeBody

3.4.2. AthletePose3D

3.4.3. LDPose

3.4.4. SLOPER4D

3.4.5. MoviCam

3.5. Evaluation Metrics and Quantitative Protocols

3.5.1. 2D Pose Metrics: PCK, PCKh, and AP/OKS

3.5.2. 3D Pose Metrics: MPJPE and Variants

3.5.3. How We Use These Metrics in This Survey

3.6. Datasets as Drivers of Paradigm Shifts

4. Pose Representation: From Coordinates to Distributions

4.1. The Direct-Regression Paradigm

4.2. The Heatmap Paradigm: Robustness Through Spatial Probability

4.3. Encoding Structure for Multi-Person Grouping (Bottom-Up Methods)

4.4. Three-Dimensional Representation

5. Architectures for Spatial and Global Context

5.1. Mastering Spatial Context with Convolutional Networks

5.1.1. The Multi-Stage Refinement Paradigm

5.1.2. The Revolution in High Resolution

5.1.3. Hybrid CNNs and Specialized Modules

5.2. The Global Context Revolution: Graphs and Transformers

5.2.1. GCNs: The Skeleton as a Graph

5.2.2. Takeover by Transformers

5.2.3. Context-Aware 2D-to-3D Lifting

6. Ambiguity, Generalization and Occlusion

6.1. From Deterministic Prediction to Probabilistic Estimation

6.1.1. Ambiguity Modeling with Multiple Hypothesis Generation

6.1.2. The Diffusion Model Era

6.2. Bridging the Domain Gap “In the Wild”

6.2.1. Self-Supervised and Weakly Supervised Learning

6.2.2. Data-Centric Solutions: Augmentation and Adaptation

6.3. Reasoning About Occlusion and Crowds

6.3.1. Robustness to Occlusion

6.3.2. Crowded Scenes

7. Contextual Extension: Time, Space, and Modality

7.1. Exploiting Temporal Dynamics in Video

7.1.1. Early Temporal Models: RNN and Convolutional Approaches

7.1.2. Spatiotemporal Transformers and the State-Space Era

7.1.3. Motion-Centric and Kinematic-Aware Models

7.2. Resolving Ambiguity with Multiple Views and Sensors

7.2.1. Multi-View Geometry: From Triangulation to End-to-End Fusion

7.2.2. Leveraging Depth and Alternative Sensors

7.3. Biophysical Constraints: From Temporal Smoothness to Physical Plausibility

8. Efficiency, Unification, and the Expanding Frontier

8.1. The Push for Efficiency and Real-World Deployment

8.1.1. Lightweight Structures

8.1.2. Distillation, Quantization, and NAS

8.1.3. Acceleration via Compressed Domain and Pruning

8.2. The Unification Era

8.2.1. Joint Learning with Related Tasks

8.2.2. Training with Diverse Datasets

8.2.3. Models of Foundations

8.3. The Human-Centric Frontier

8.3.1. Whole-Body and Fine-Grained Stance

8.3.2. Application-Driven Estimation

8.3.3. Combining Absolute Positioning with Physics