Next Article in Journal
Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review
Next Article in Special Issue
A Framework for Safe Mobile Manipulation in Human-Centered Applications
Previous Article in Journal
Proof of Concept of an Occupational Machine for Biomechanical Load Reduction: Interpreting the User’s Intent
Previous Article in Special Issue
A Collaborative Robotic System for Autonomous Object Handling with Natural User Interaction
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

STAG-Net: A Lightweight Spatial–Temporal Attention GCN for Real-Time 6D Human Pose Estimation in Human–Robot Collaboration Scenarios

1
Interfaculty Initiative in Information Studies, Graduate School of Interdisciplinary Information Studies, The University of Tokyo, Tokyo 113-0033, Japan
2
Graduate School of Engineering, The University of Tokyo, Tokyo 113-0033, Japan
3
Institute of Industrial Science, The University of Tokyo, Tokyo 153-8505, Japan
4
Graduate School of Interdisciplinary Information Studies, The University of Tokyo, Tokyo 113-0033, Japan
*
Author to whom correspondence should be addressed.
Robotics 2026, 15(3), 54; https://doi.org/10.3390/robotics15030054
Submission received: 28 January 2026 / Revised: 25 February 2026 / Accepted: 2 March 2026 / Published: 4 March 2026
(This article belongs to the Special Issue Human–Robot Collaboration in Industry 5.0)

Abstract

Most existing research in human pose estimation focuses on predicting joint positions, paying limited attention to recovering the full 6D human pose, which comprises both 3D joint positions and bone orientations. Position-only methods treat joints as independent points, often resulting in structurally implausible poses and increased sensitivity to depth ambiguities—cases where poses share nearly identical joint positions but differ significantly in limb orientations. Incorporating bone orientation information helps enforce geometric consistency, yielding more anatomically plausible skeletal structures. Additionally, many state-of-the-art methods rely on large, computationally expensive models, which limit their applicability in real-time scenarios, such as human–robot collaboration. In this work, we propose STAG-Net, a novel 2D-to-6D lifting network that integrates Graph Convolutional Networks (GCNs), attention mechanisms, and Temporal Convolutional Networks (TCNs). By simultaneously learning joint positions and bone orientations, STAG-Net promotes geometrically consistent skeletal structures while remaining lightweight and computationally efficient. On the Human3.6M benchmark, STAG-Net achieves an MPJPE of 41.8 mm using 243 input frames. In addition, we introduce a lightweight single-frame variant, STG-Net, which achieves 50.8 mm MPJPE while operating in real time at 60 FPS using a single RGB camera. Extensive experiments on multiple large-scale datasets demonstrate the effectiveness and efficiency of the proposed approach.

1. Introduction

Three-dimensional human pose estimation (3D HPE) is a crucial task that explores the spatial relationship between human joints. Recently, thanks to the rapid development of deep learning, 3D human pose estimation has significantly improved in terms of accuracy and performance.
Existing state-of-the-art 3D human pose estimation methods primarily focus on predicting only joint positions [1,2,3,4], often overlooking the missing information of the human body–joint orientation. For certain scenarios, only obtaining joint positions is not sufficient for some applications, such as human–robot interaction and motion analysis. Additionally, most of the research often suffers from high computational costs, making it unsuitable for real-time applications. A model that is lightweight, computationally efficient, and highly accurate is essential.
To improve the accuracy of the 3D human pose prediction, skeleton-based 3D pose estimation models [5,6,7,8,9,10] leverage structural information, making them highly effective for addressing 3D human pose tasks with significantly reduced computational costs. Current deep learning-based methods for 3D human pose estimation can be broadly categorized into two types: one-stage methods and 2D-to-3D lifting methods. One-stage methods, such as Image-to-3D keypoints [11,12,13], directly predict the 3D coordinates to reconstruct the human pose from images in an end-to-end manner. In contrast, 2D-to-3D lifting methods [1,2,14] first localize the 2D keypoints on the image, and then lift 3D poses from the estimated 2D joint positions. A notable limitation of many existing approaches is their focus on either spatial or temporal information in isolation, neglecting the critical interplay and complementarity between spatial and temporal dimensions. Temporal-based methods [1,14,15] prove invaluable in addressing issues like depth ambiguity, which is a challenge for the 2D-to-3D pose lifting method.
While many studies address 3D human pose estimation, work on complete 6D human pose estimation(6D HPE) is still limited. Mesh recovery methods that leverages models such as SMPL (Skinned Multi-Person Linear) [16,17] predict human body models parametrically, it can obtain complete information of the human body, including the joint position and orientation. However, mesh recovery methods typically require high computational costs, making them less suitable for real-time applications. Among existing approaches, Fisch et al. [18] introduced a method that employs virtual markers to provide sufficient cues for accurately inferring joint rotations. However, their approach operates on a single frame, does not exploit temporal information that is crucial for video-based analysis, and relies on post-processing to recover joint orientations. Banik et al. [19] leverage both node and edge convolutions to incorporate joint and bone features for full human pose estimation, including joint positions and bone orientations. However, their method does not model temporal dependencies and achieves limited improvement on single-frame inputs despite its large model size.
Motivated by prior works that leverage spatial and temporal information for 3D pose estimation [3,20,21], as well as methods that explicitly incorporate rotation modeling [19,22], in this paper, we propose a unified framework, Spatio-Temporal Attention Graph Network (STAG-Net), a novel, efficient network for 2D-to-6D pose lifting that jointly integrates spatial and temporal information to estimate both joint positions and orientations. The orientation of a parent bone affects not only the position of its corresponding joint but also the orientations of its child bones; therefore, explicitly modeling and predicting bone orientations leads to more accurate joint position estimation. Furthermore, since this design principle remains effective even when using a single frame as input, we introduce a lightweight variant, STG-Net, specifically designed to achieve high accuracy while maintaining real-time performance. STG-Net processes single RGB images and enables efficient 6D human pose estimation in real-world scenarios. Our model demonstrates performance comparable to state-of-the-art methods on the Human3.6M dataset across both positional and orientational metrics, under both single-frame and multi-frame input settings. In summary, our contributions are:
(1) We propose a novel Node–Edge Graph Attention and TCN hybrid framework that simultaneously predicts 3D joint positions and 3D orientations from 2D pose inputs.
(2) We introduce a lightweight and effective Skip-TCN architecture for human pose 2D-to-3D lifting, specifically designed to efficiently handle multi-frame inputs.
(3) We report both position and orientation evaluation on multiple benchmark datasets.

2. Related Work

TCN-based Methods. Recent developments in one-stage method like [11,12,13] this type of method does not rely on 2D HPE and obtains 3D keypoint coordinates directly from image regression. On the other hand, inspired by [2], numerous advanced 2D-to-3D lifting models have been developed to achieve better performance. 2D-to-3D lifting can be broadly classified into three main categories: TCN-based, GCN-based, and Transformer-based architectures. Among TCN-based methods, Pavllo et al. [1] made significant contributions by introducing a dilated temporal convolutional network designed to capture long-term information. Liu et al. [14] use an attention mechanism combined with multi-scale dilated convolutions to capture long-range dependencies across frames, enhancing accuracy.
GCN-based Methods. One of the key innovations in Graph Neural Networks (GNN) is the vanilla GCN introduced by Kipf et al. [5], which consists of a simple graph convolution operation that performs transformation and aggregation of graph-structured data. For further developments of GCN, the ST-GCN, as proposed in [21], effectively learns both spatial and temporal patterns for action recognition. In recent years, GCN-based approaches have led to significant advancements in 3D HPE. Xu et al. [23] proposed an architecture that consists of repeated encoder–decoders to process human skeletal representations across three different scales. For the 2D-to-3D lifting models, the 3D pose is highly dependent on the accuracy of the 2D pose. Yu et al. [6] proposed a Global–local Adaptive GCN to reconstruct the global representation of an intermediate 3D human pose sequence from its corresponding 2D sequence, which significantly enhanced the result by incorporating ground truth data as input.
Transformer-based Methods. Recently, Transformer-based methods that primarily rely on attention mechanisms have significantly advanced 3D human pose estimation, achieving remarkable accuracy that in many cases surpasses GCN-based approaches. Li et al. [3] introduced the Strided Transformer, designed to transform a long sequence of 2D joint positions into a single 3D pose. The primary challenge in 2D-to-3D pose lifting lies in depth ambiguity and self-occlusion. To address this, Li et al. [4] proposed the MHFormer, which generates multiple plausible 3D pose hypotheses. Islam et al. [24] propose a Multi-hop Graph Transformer Network that combines multi-head self-attention with multi-hop GCNs to capture spatio-temporal dependencies and long-range interactions. Aouaidjia et al. [25] introduce a Graph Order Attention module to adaptively weight joint orders and a Body-Aware Temporal Transformer to model both global body dynamics and local inter-joint dependencies.

3. Methodology

This section presents the overall structure of STAG-Net, a novel and lightweight 2D-to-6D lifting network designed to predict both 3D joint positions and joint orientations in real-time. The network integrates the outputs of an attention-enhanced GCN and a TCN to improve the accuracy of 3D joint position estimation, effectively leveraging the inherent geometric relationships between joint positions and joint orientations.
STAG-Net is a two-stage method that requires 2D human poses as input. The overall architecture of STAG-Net is illustrated in Figure 1. It comprises two branches: the GCN Branch and the TCN Branch. The input to STAG-Net includes the 2D keypoint coordinates extracted from a video and the 2D bone angles, represented as rotation matrices derived from these coordinates. Note that only the GCN Branch uses the bone angle inputs. The output of the network consists of the 3D positions of each joint and the 6D rotation representations of each bone. By jointly learning joint positions and bone orientations, the network is encouraged to produce geometrically consistent skeletal structures rather than independent joint predictions.

3.1. Graph Configuration

In GCN-based 3D HPE, the human skeleton is usually represented as an undirected graph, G = V , E , where the nodes V denote the set of N body joints, V i = V , and the edges E denote the connections between the nodes, v i , v j E . Figure 2. illustrates the complete graph structure, including nodes and edges, with their definitions based on the Human3.6M datasets format. The connections between the nodes are represented by an adjacency matrix, A N o d e 0 , 1 N × N where 0 indicates no connection, 1 indicates the presence of a connection between nodes. Similarly, the connections between the edges are also represented by an adjacency matrix, A E d g e 0 , 1 E × E , where 0 indicates no connection and 1 indicates the presence of a connection between edges. The STAG-Net network fuses both attention-enhanced GCN and TCN branches. As shown in Figure 1, both branches take the frame number, denoted as F, as input. For real-time applications, a single frame is typically used, whereas the network also supports multi-frame inputs, which are commonly adopted for video-based processing.

3.2. Propagation Rules

The GCN branch of STAG-Net incorporates weight modulation as introduced by [9]. Inspired by the propagation rules introduced in [19], which leverage edge-aware-node and node-aware-edge convolutional layers, we designed two specialized layers: the Modulated-Node layer and the Modulated-Edge layer. These layers enhance the Modulated GCN by enabling the co-embedding of nodes and edges. In the context of a hierarchically structured human body, both joint positions and orientations play a crucial role, as the position and orientation of a parent joint directly influence those of its child joints. The Modulated-Node layer transforms and aggregates the input node and edge features following the equation below:
h i = σ j N ˜ i m j W n · a n h j n + 1 a n h j n _ a g g r e a ˜ i j n
The superscripts n and e indicate the node and edge layers respectively. W n donates the node weight matrix, while a n is a learnable parameter that adaptive balances the contributions of nodes and edges. h j e _ a g g r e represents the aggregated edge feature, which is transformed and integrated into the node feature using the current edge feature and a transformation matrix T R N n × N e , following the method proposed in [26], T n e = 1 if node n is connected to edge e, the equation can be written as:
h j e _ a g g r e = j N i T i j k ε j a ˜ i j e h k e
where h k e donates the current edge feature matrix, and a ˜ i j e represents the ( i , j ) -th entry of edge affinity matrix. Similar to Equation (1), the Modulated-Edge layer transforms and aggregates the input edge and node features according to the equation below:
h i = σ j N ˜ i m j W e · a e h j e + 1 a e h j n _ a g g r e a ˜ i j e
where h j n _ a g g r e represents the aggregated node feature using the transpose matrix T T R N e × N n that can be written as:
h j n _ a g g r e = j N i T i j T k ε j a ˜ i j n h k n

3.3. 6D Rotation Representation

For each bone, the network predicts a continuous 6D rotation representation, consisting of two 3D vectors. These vectors are converted into a valid rotation matrix via Gram–Schmidt orthogonalization, ensuring a proper right-handed rotation. We construct a target rotation matrix from ground-truth 3D keypoints. The bone vector connecting parent and child joints defines the primary z-axis. A stable reference axis is selected to compute the remaining x- and y-axes via cross products, forming a right-handed orthonormal frame. This matrix serves as the target for supervising the predicted 6D rotation representation, ensuring that each predicted bone aligns with the anatomically correct orientation. Following the approach proposed by Zhou et al. [27], a 6D rotation representation is derived by omitting the last column r 3 of the orthogonal rotation matrix R = r 1 r 2 r 3 . Therefore, the 6D rotation representation can be obtained by R 6 D = r 1 r 2 . This 6D representation is chosen for joint rotations due to its robustness and stable performance during training.

3.4. Skip-TCN

The TCN branch of STAG-Net processes a sequence of 2D poses using temporal convolutions. To capture long-term dependencies efficiently, we utilize dilated convolutions [28]. Figure 3 illustrates the overall structure of a TCN Unit, which begins with a dilated convolution layer, followed by WeightNorm, PReLU, and Dropout. This sequence is repeated once to form a residual connection.
The TCN structure proposed by [1] focuses on predicting 3D joint positions from video sequences. However, its network remains too large to achieve real-time performance exceeding 25 FPS. To address this, we introduce a new TCN architecture module called Skip-TCN, inspired by [29]. Their findings indicate that skip-connection-based designs achieve approximately 1.4× faster inference than ResNet architectures while maintaining comparable performance. Skip-TCN employs selective short-range and long-range concatenative skip connections. Figure 4 illustrates the complete structure of the Skip-TCN module. Each block of the module consists of 6 TCN-U. Each TCN-U is defined by specific parameters: D represents the dilate parameter, L denotes the number of hidden layers, k indicates the number of intermediate features. In Skip-Block1, the input of TCN-U-6 is formed by concatenating the outputs of TCN-U-1, TCN-U-3, and TCN-U-5. Skip-Block2 uses the output of Skip-Block1 as its input. For the residual connection in each block, we use a consistent convolutional block composed of a 1 × 1 convolution, batch normalization (BN), Efficient Channel Attention (ECA), and PReLU activation. The Skip-TCN has a model size of just 0.41M, making it an extremely lightweight model.
The temporal receptive field of Skip-TCN is determined by the dilation parameter D and the depth of the TCN layers. For different input frames, we adjust the dilation configuration so that the effective receptive field fully covers the entire temporal window. This ensures that all frames within the input sequence contribute to the final prediction while maintaining computational efficiency.

3.5. Architecture

The STAG-Net architecture consists of two parallel branches: an attention-enhanced GCN branch and a TCN branch for modeling temporal dependencies in human pose sequences. The GCN branch begins with linear embedding of node and edge features, followed by GELU activation, and is then stacked with three residual blocks composed of Modulated Node–Edge–Attention (M-NEA) layers. Each residual block contains two M-NEA layers, as illustrated in Figure 1. An M-NEA layer comprises a Modulated Node–Edge (M-NE) module and an attention module. The M-NE module further consists of a Modulated Node layer and a Modulated Edge layer, each followed by batch normalization and a ReLU activation. Within each M-NEA layer, the M-NE module and the attention module iteratively refine their feature representations, and their outputs are fused via residual addition to form the final feature representation. This design enables effective modeling of the interactions between joint positions and joint orientations. We adopt the Spatio-Temporal Criss-cross (STC) Attention mechanism proposed by [30] as the attention module in our network. Unlike standard self-attention that computes global pairwise interactions, STC attention decomposes attention into spatial and temporal criss-cross patterns. Specifically, for each joint feature, spatial attention captures dependencies across all joints within the same frame, while temporal attention models correlations of the same joint across different frames. In addition, a residual connection links the input and output of each block, facilitating the modeling of long-term dependencies. The attention-enhanced GCN branch produces two outputs: joint positions and joint orientations.
The TCN Branch starts with a convolutional block consisting of a 3 × 3 convolution, batch normalization (BN), Efficient Channel Attention (ECA), and PReLU activation. This is followed by two Skip-TCN blocks to learn the temporal information between the human poses. Another convolutional block, with the same structure as the first, is then applied. Finally, the 3D position features from the two branches are concatenated into a unified representation, which is then fed into a fully connected (FC) layer to regress the final 3D joint positions.
To evaluate the accuracy of the predicted joint positions, we adopt the Mean Per Joint Position Error (MPJPE), as defined in Equation (5). MPJPE measures the mean Euclidean distance between the predicted joint coordinates and the corresponding ground-truth positions. To further enhance robustness during training, we extend the standard MPJPE formulation by introducing a weighted combination of L2 and L1 distance terms, as defined in Equation (6). The L2 component promotes smooth optimization and stable convergence, while the L1 component improves robustness to outliers by reducing the influence of large deviations.
L MPJPE ( J p r , J g t ) = 1 N i = 1 N J i p r J i g t 2
L M P J P E N o r m J p r , J g t = 1 W n o r m · 1 N i = 1 N J i p r J i g t 2 + W n o r m · 1 N i = 1 N J i p r J i g t
where J p r and J g t are the predicted and ground truth 3D joint positions, respectively. And ( 1 W n o r m ) is a weight factor applied to the MPJPE loss. If W n o r m is small, then ( 1 W n o r m ) will be close to 1, meaning the MPJPE will have more influence on the overall loss. For joint orientations, we utilize the Identity Deviation (IDev) loss [31]. This loss measures the distance between the product of the predicted rotation matrix and the transposed ground truth rotation matrix, and the Identity matrix. To compute this, the predicted 6D rotation representation is converted into a 3 × 3 rotation matrix through orthogonalization and normalization. Ideally, for two identical rotation matrices, this distance should be zero. The IDev loss is formally defined as:
L I D e v R p r , R g t = 1 N i = 1 N I R i , g t T R i , p r 2
where R p r and R g t are the predicted and ground truth rotation matrices. If the two rotations are identical, the relative rotation equals the identity matrix. Therefore, the Frobenius norm measures the deviation from perfect alignment. The total loss is defined as:
L t o t a l = W p · L M P J P E + W r · L I D e v
where W p and W r are the weighting factors that balance the contributions of the joint position loss and the joint orientation loss in the overall loss function. These factors are dynamically computed during each iteration based on the current values of L M P J P E and L I D e v . Specifically, the weights are calculated as the inverse of the corresponding loss values and then normalized.
W p = 1 L ( M P J P E ) + ϵ 1 L ( M P J P E ) + ϵ + 1 L ( I D e v ) + ϵ , W r = 1 L ( I D e v ) + ϵ 1 L ( M P J P E ) + ϵ + 1 L ( I D e v ) + ϵ
where ϵ is a small constant for numerical stability. This normalization ensures that W p + W r = 1 . This adaptive inverse-loss weighting mechanism assigns relatively larger weights to smaller loss terms and reduces the influence of larger ones, thereby preventing one objective from dominating the optimization process.
While joint positions alone provide weak structural constraints, predicting bone rotations enforces kinematic consistency between connected joints. The 6D rotation prediction, converted to a rotation matrix and supervised via IDev loss, ensures that each bone aligns with its anatomically correct orientation. This geometric constraint regularizes the 3D joint predictions, reducing implausible configurations such as flipped limbs or unnatural twists, and improves overall pose accuracy.

4. Experimental Results

4.1. Implementation Details

We primarily utilize three open-source datasets to evaluate the performance of our STAG-Net network:
Human3.6M (H36M) [32] contains 3.6 million images of 7 subjects, performed by 11 professional actors, with a total of 15 training actions such as walking, sitting, etc. Following the standard evaluation protocol, we use subjects S1, S5, S6, S7, and S8 for training, and S9 and S11 for testing.
MPI-INF-3DHP (3DHP) [33] consists of both constrained indoor and complex outdoor scenes. It records 8 actors performing 8 activities from 14 camera views. We use subjects S1–S8 for training and TS1–TS6 for testing.
HumanEva-I (HE) [34] contains 7 calibrated video sequences that are synchronized with 3D body poses obtained from a motion capture system. The database contains 4 subjects performing 6 common actions. We follow the official dataset split, using Train/S1–S3 for training and Validate/S1–S3 for evaluation.
Our evaluation protocols include two main metrics: the MPJPE and P-MPJPE, also known as Protocol #1 and Protocol#2, respectively. For assessing joint orientation prediction, we use an angular metric, specifically the Mean Per-Joint Angular Error (MPJAE) [35]. The definitions of MPJAE is as follows:
M P J A E = 1 N i = 1 N θ s e p R i , g t , R i , p r
The goal is to compute the average geodesic distance θ s e p between the predicted and ground truth joint orientations. For all three metrics—MPJPE, P-MPJPE, and MPJAE—a lower value indicates better performance.
We implemented our proposed method with Pytorch [36], a single NVIDIA RTX 4080 GPU was used for training and testing. We chose different frame sequence lengths when conducting our experiments, i.e., F = 1 , F = 27 , F = 81 , F = 243 . Following the latest research [19], we normalize the 2D input and use camera coordinates for the 3D output. For 2D pose detection, we use CPN [37] as the detection method. In terms of training settings, our model is trained on the H36M datasets [32] using the Adam optimizer for 20 epochs. The learning rate is set to 0.001 and decays by a factor of 0.9 every 5 epochs. We train two variants of the proposed network with different input channel sizes: model-S (128) and model-L (192). All models are optimized using the MPJPE and IDev losses.
We adopt different strategies for single-frame and multi-frame models to accommodate different application scenarios. For the single-frame setting, we employ only the basic M-NE layer (STG-Net), while for the multi-frame model, we use the full M-NEA layer (STAG-Net). This design choice is motivated by the fact that attention mechanisms introduce substantial computational overhead, which is less suitable for single-frame and real-time applications.

4.2. Comparison of Joint Position Estimation

In this section, we report on the position evaluation of our models on multiple datasets.
Human3.6M.Table 1 and Table 2 present a comparison of our STG-Net model with state-of-the-art (SOTA) single-frame 2D-to-3D lifting methods on the Human3.6M dataset under Protocol #1 and Protocol #2, respectively. For a fair comparison, all the baseline results are obtained using single-frame input. Our model achieves a mean MPJPE of 50.8 mm, outperforming the most recent study [19] by 0.9 mm, securing the second-best score overall. Cai et al. [20] achieved slightly better results, but their evaluation relies on a post-refined model, whereas our model was not subjected to this refinement process.
Table 3 and Table 4 compare our STAG-Net with state-of-the-art 243-frame 2D-to-3D pose lifting methods on the Human3.6M dataset under Protocol #1 and Protocol #2, respectively. Our model-S achieves a mean MPJPE of 43.3 mm, while model-L further improves the performance to 41.7 mm. Notably, model-L outperforms the most recent method, MSTFormer [38] and Perspose [39], by 1.3 mm, achieving the best overall result. While on Protocol #2, [39] achieves the best overall result.
Table 1. MPJPE (mm) of 2D-to-3D lifting methods on Human3.6M under Protocol #1 (P1) using single-frame input. The best and second-best results are highlighted in bold and underlined, respectively. We use § to highlight methods that use refinement module [20].
Table 1. MPJPE (mm) of 2D-to-3D lifting methods on Human3.6M under Protocol #1 (P1) using single-frame input. The best and second-best results are highlighted in bold and underlined, respectively. We use § to highlight methods that use refinement module [20].
MethodDir.Disc.Eat.GreetPhonePhotoPosePurch.SitSitD.SmokeWaitWalkDWalkWalkTAvg.
Martinez et al. [2] (2017)51.856.258.159.069.578.455.258.174.094.662.359.165.165.152.462.9
Fang et al. [40] 2018)50.154.357.057.166.673.353.455.772.888.660.357.762.747.550.660.4
Pavlakos et al. [41] (2018)48.554.454.452.059.465.349.952.965.871.156.652.960.944.747.856.2
Lee et al. [42] (2018)40.249.247.852.650.175.050.243.055.873.954.155.658.243.343.352.8
Zhao et al. [10] (2019)57.360.751.460.561.149.947.368.186.255.067.861.042.160.645.357.6
Ci et al. [43] (2019)46.852.344.750.452.968.949.646.460.278.951.250.054.840.443.352.7
Pavllo et al. [1] 2019)47.150.649.051.853.661.449.447.459.367.452.449.555.339.542.751.8
Cai et al. [20] (2019) §46.548.847.650.952.961.348.345.859.264.451.248.453.539.241.250.6
Xu et al. [23] (2021)45.249.947.550.954.966.148.546.359.771.551.448.653.939.944.151.9
Zhao et al. [44] (2022)45.250.848.050.054.965.048.247.160.270.051.648.754.139.743.151.8
Soubarna et al. [19] (2024)48.950.146.750.454.663.048.847.964.168.650.548.753.939.342.251.7
STG-Net (Ours, F = 1)44.650.948.049.352.061.248.346.558.869.851.447.652.938.641.650.8
Table 2. Reconstruction error after rigid alignment on Human3.6M under Protocol #2 (P2) using single-frame input. The best and second-best results are highlighted in bold and underlined, respectively.
Table 2. Reconstruction error after rigid alignment on Human3.6M under Protocol #2 (P2) using single-frame input. The best and second-best results are highlighted in bold and underlined, respectively.
MethodDir.Disc.Eat.GreetPhonePhotoPosePurchSitSitD.SmokeWaitWalkDWalkWalkT.Avg.
Martinez et al. [2] (2017)39.543.246.447.051.056.041.440.656.569.449.245.049.538.043.147.7
Fang et al. [40] (2018)38.241.743.744.948.555.340.238.254.564.447.244.347.336.741.745.7
Pavlakos et al. [41] (2018)34.739.841.838.642.547.538.036.650.756.842.639.643.932.136.541.8
Lee et al. [42] (2018)34.935.243.242.646.255.037.638.850.967.348.935.231.050.734.643.4
Pavllo et al. [1] (2019)36.038.738.041.740.145.937.135.446.853.441.436.943.130.334.840.0
Cai et al. [20] (2019)36.838.738.241.740.746.837.935.647.651.741.336.842.731.034.740.2
Liu et al. [8] (2020)35.940.038.041.542.551.437.836.048.656.641.838.342.731.736.241.2
Zou et al. [9] (2021)35.738.636.340.539.244.537.035.446.451.240.535.641.730.733.939.1
STG-Net (Ours, F = 1)35.539.638.240.740.346.236.235.748.454.942.036.142.731.134.440.1
Table 3. MPJPE (mm) of 2D-to-3D lifting methods on Human3.6M under Protocol #1 (P1) using 243-frame input. The best and second-best results are highlighted in bold and underlined, respectively.
Table 3. MPJPE (mm) of 2D-to-3D lifting methods on Human3.6M under Protocol #1 (P1) using 243-frame input. The best and second-best results are highlighted in bold and underlined, respectively.
MethodDir.Disc.Eat.GreetPhonePhotoPosePurchSitSitD.SmokeWaitWalkDWalkWalkT.Avg.
Pavllo et al. [1] (2019)45.246.743.345.648.155.144.644.357.365.847.144.049.032.833.946.8
Liu et al. [14] (2020)41.844.841.144.947.454.143.442.256.263.645.343.545.331.332.245.1
Zeng et al. [45] (2020)46.647.143.941.645.849.646.540.053.461.146.142.643.131.532.644.8
Chen et al. [46] (2021)41.443.540.142.946.651.941.742.353.960.245.441.746.031.532.744.1
Li et al. [3] (2022)40.343.340.242.345.652.341.840.555.960.644.243.044.230.030.243.7
Yu et al. [6] (2023)41.344.340.841.845.954.142.141.557.862.945.042.845.929.429.944.4
Zhao et al. [47] (2023)45.2
Islam et al. [24] (2024)38.743.942.343.844.848.142.441.252.663.843.542.744.734.134.544.1
Song et al. [22] (2024)41.143.340.441.344.953.241.741.154.965.243.541.342.729.129.243.5
Lin et al. [38] (2025)39.042.540.741.145.951.341.140.554.159.944.141.343.429.529.843.0
Hao et al. [39] (2025)43.0
STAG-Net (Ours, F = 243, S)40.644.539.441.045.650.942.643.254.359.944.640.843.129.230.443.3
STAG-Net (Ours, F = 243, L)37.843.538.941.042.452.641.338.851.657.242.040.242.427.029.841.8
Table 4. Reconstruction error after rigid alignment on Human3.6M under Protocol #2 (P2) using 243-frame input. The best and second-best results are highlighted in bold and underlined, respectively.
Table 4. Reconstruction error after rigid alignment on Human3.6M under Protocol #2 (P2) using 243-frame input. The best and second-best results are highlighted in bold and underlined, respectively.
MethodDir.Disc.Eat.GreetPhonePhotoPosePurchSitSitD.SmokeWaitWalkDWalkWalkT.Avg.
Pavllo et al. [1] (2019)34.136.134.437.236.442.234.433.645.052.537.433.837.825.627.336.5
Liu et al. [14] (2020)32.335.233.335.835.941.533.232.744.650.937.032.437.025.227.235.6
Zeng et al. [45] (2020)34.832.128.530.731.436.935.630.538.940.532.531.029.922.524.532.0
Chen et al. [46] (2021)32.635.132.835.436.340.432.432.342.749.036.832.436.024.926.535.0
Li et al. [3] (2022)32.735.532.535.435.941.633.031.945.150.136.333.535.123.925.035.2
Yu et al. [6] (2023)32.435.332.634.235.042.132.131.945.549.536.132.435.623.524.734.8
Zhao et al. [47] (2023)35.6
Islam et al. [24] (2024)33.036.134.137.436.240.433.632.444.154.436.534.536.226.427.436.2
Song et al. [22] (2024)31.134.932.433.736.342.831.631.244.748.636.932.435.424.124.434.7
Lin et al. [38] (2025)31.234.433.033.935.439.432.331.743.348.135.932.834.624.024.534.3
Hao et al. [39] (2025)28.3
STAG-Net (Ours, F = 243, S)31.335.131.032.734.639.132.132.442.747.735.331.033.422.523.733.6
STAG-Net (Ours, F = 243, L)30.233.230.932.432.140.330.429.341.345.233.930.233.321.624.532.6
MPI-INF-3DHP. We evaluate our model on the official 3DHP test set and report the standard metrics, including 3DPCK and AUC. Here, AUC denotes the area under the PCK curve computed over a predefined range of 3D distance thresholds, providing a threshold-independent measure of pose estimation accuracy, as shown in Table 5. We report the mean performance along with confidence intervals computed via bootstrap on the 3DHP test set. Using single-frame input, our model achieves a 3DPCK score of 91.0 ± 0.2 and an AUC of 54.1 ± 0.3 . When extended to multi-frame input, our model further improves performance: with 81-frame input, it attains a 3DPCK score of 97.8 ± 0.1 and an AUC of 70.1 ± 0.1 , demonstrating the effectiveness of temporal information for 3D pose estimation. Notably, the multi-frame setting also yields smaller confidence intervals compared to the single-frame variant, indicating improved stability due to the incorporation of temporal information.
HumanEva-I. We train and evaluate our model on the HumanEva-I dataset for all subjects (S1, S2, S3) and 2 actions (Walk, Jog). We evaluate our model under protocol #2. As shown in Table 6, our model with an input length of F=3 achieves the best performance, obtaining a mean P-MPJPE of 8.5 mm. Moreover, our approach consistently outperforms recent methods [25] that rely on limited temporal information across all evaluated subsets.
Table 5. 3DPCK and AUC on the 3DHP test set. The best and second-best scores are highlighted in bold and underlined, respectively.
Table 5. 3DPCK and AUC on the 3DHP test set. The best and second-best scores are highlighted in bold and underlined, respectively.
Method3DPCKAUC
Luo et al. [48] (2018)81.845.2
Wandt et al. [49] (2019)82.558.5
Sárándi et al. [50] (2021)90.656.2
Zheng et al. [51] (2021)88.656.4
Gong et al. [52] (2022)89.153.1
Oreshkin et al. [53] (2023)88.648.9
Shetty et al. [54] (2023)91.852.3
Qian et al. [55] (2023)97.371.5
Hao et al. [39] (2025)94.055.2
STG-Net (Ours, F = 1) 91.0 ± 0.2 54.1 ± 0.3
STAG-Net (Ours, F = 27) 97.4 ± 0.1 ̲ 67.5 ± 0.2
STAG-Net (Ours, F = 81) 97 . 8 ± 0 . 1 70.1 ± 0.1 ̲
Table 6. Results on the HumanEva-I dataset under Protocol #2 using 2D ground truth (GT) joints as input. The best and second-best results are highlighted in bold and underlined, respectively. † indicates using temporal information.
Table 6. Results on the HumanEva-I dataset under Protocol #2 using 2D ground truth (GT) joints as input. The best and second-best results are highlighted in bold and underlined, respectively. † indicates using temporal information.
MethodWalkJogAvg.
S1S2S3S1S2S3
Martinez et al. [2] (2017)19.717.446.826.918.218.624.6
Lee et al. [42] (2018)18.619.930.525.716.817.721.5
Pavllo et al. [1] (2019)13.910.246.620.913.113.819.8
Zhang et al. [56] (2021)13.79.547.121.012.613.419.5
Li et al. [3] (2022) †9.77.615.812.39.411.211.0
Aouaidjia et al. [25] (2025) †8.76.517.913.57.88.510.4
STAG-Net (Ours, F = 3) †8.56.210.210.07.88.48.5

4.3. Qualitative Results

Figure 5 visualizes the results of our model on the Human3.6M dataset using 243-frame input sequences across 15 action categories. We evaluate the model using CPN-detected 2D keypoints as well as ground-truth 2D keypoints as inputs. The results demonstrate that our model consistently produces accurate 6D pose predictions for a wide range of human actions, with only minor discrepancies observed between the predicted poses and the ground truth.

4.4. Comparison of Joint Orientation Estimation

Few studies focus on predicting joint orientation; among the non-mesh-based methods, OKPS [18] is relevant. However, a direct comparison with our method is not feasible, as they report orientation errors only for a subset of joints. Thus, we compare our joint orientation prediction performance with the method in [19]. As shown in Table 7, our model performs better in position prediction and orientation prediction.

4.5. Model Size and Computational Complexity

This section compares the performance, model size, and computational complexity of the proposed STG-Net and STAG-Net with state-of-the-art methods under single-frame and 243-frame settings.
In the single-frame setting, STG-Net stands out for its extremely lightweight design and low computational cost while maintaining competitive accuracy in joint position prediction. Figure 6a illustrates the trade-off between model size and performance on the Human3.6M dataset, where our method achieves the lowest values in both dimensions. As reported in Table 8, the single-frame STG-Net contains only 2.44 M parameters and requires merely 0.34 M FLOPs, demonstrating its suitability for real-time and resource-constrained applications.
For the 243-frame setting, Figure 6b compares model size and performance on the Human3.6M dataset. The STAG-Net-S variant achieves the smallest model size while still delivering competitive performance, whereas STAG-Net-L outperforms the recent MSTFormer [38] while maintaining a smaller model footprint. As shown in Table 8, the computational cost increases significantly with longer input sequences. Specifically, STAG-Net-S has 3.06 M parameters and requires 6.93 G FLOPs, while STAG-Net-L contains 6.26 M parameters with 13.88 G FLOPs. This substantial increase compared to the single-frame STG-Net is primarily due to the attention-enhanced GCN modules, which introduce considerable computational overhead. Among transformer-based methods, MotionAGFormer [58] and MotionBERT [59] achieve lower MPJPE; however, these methods rely on significantly larger models and incur more than ten times the computational cost of our approach. In contrast, our method offers a favorable balance between accuracy, model size, and efficiency, particularly for long-sequence inputs.
Table 8. Comparison with the state of the art in terms of model size and computational complexity.
Table 8. Comparison with the state of the art in terms of model size and computational complexity.
MethodParametersFLOPsMPJPEFrames
M. Rayat et al. [57] (2018)16.96 M33.88 M58.31
Pavllo et al. [1] (2019)51.81
Soubarna et al. [19] 2024)4.6 M51.71
STG-Net (Ours)2.44 M0.34 M50.81
STAG-Net (Ours)5.51 M55.66 M51.11
Pavllo et al. [1] (2019)16.95 M33.87 M46.8243
LI et al. [3] (2022)4.23 M1.37 G44.0243
Yu et al. [6] (2023)1.3 M1.5 G44.4243
Zhu et al. [59] (2023)42.5 M174.7 G39.2243
Tang et al. [30] (2023)4.75 M19.56 G41.0243
Mehraban et al. [58] (2024)19.0 M78.3 G38.4243
Lin et al. [38] (2015)9.3 M1.29 G43.0243
STAG-Net (Ours, model-S)3.06 M6.93 G43.3243
STAG-Net (Ours, model-L)6.26 M13.88 G41.8243

4.6. Ablation Study

The ablation studies were performed from four perspectives: (1) analyzing the contributions of the GCN, Attention, and TCN branches; (2) investigating the impact of the input dimensionality; (3) evaluating the effect of the number of inputs to the M-NEA residual block; and (4) examining the influence of incorporating node and edge layers.
Effect of GCN, Attention, and TCN branch: To evaluate the effectiveness of the GCN, Attention, and TCN branches individually, we compare their performance in position and orientation prediction using input frame numbers of F = 1, F = 27, and F = 81 for the large models (Dim = 192). Since the TCN branch does not predict orientation, its effectiveness is assessed solely based on position prediction. In this setting, the TCN branch outputs position estimates directly. In Table 9, we report the MPJPE and MPJAE results of our method for each component, as well as for different combinations of the GCN, attention, and TCN branches. The results demonstrate that integrating attention-enhanced GCN with TCN branches consistently yields superior performance.
Effect of input dimensionality: We investigate the impact of input dimensionality by conducting experiments with input frame numbers F = 27, F = 81, and F = 243. Under this setting, we compare input dimensions of 128, 192, 256, and 384, and evaluate their performance in terms of position and orientation prediction. Table 10 reports the MPJPE and MPJAE results of our method under different input dimensionalities. The results indicate that increasing the input dimension does not necessarily lead to improved performance. In particular, when F = 27, using a 256-dimensional input significantly degrades both position and orientation accuracy. Based on this analysis, an input dimension of 192 achieves the best overall performance.
Effect of M-NEA layer: We investigate the effect of the number of M-NEA residual blocks by conducting experiments with an input frame length of F = 27. Under this setting, we evaluate models with 2, 3, 4, and 5 M-NEA residual blocks and assess their performance in terms of both position and orientation accuracy. Table 11 reports the MPJPE and MPJAE results for different numbers of M-NEA residual blocks. The results indicate that increasing the number of M-NEA residual blocks does not necessarily improve performance, with three blocks achieving the best overall results.
Effect of Node and Edge layer: To evaluate the contribution of orientation supervision, we conduct ablation experiments with different temporal input lengths (F = 1, 27, and 81). Under these settings, we analyze whether incorporating bone orientation information improves joint position accuracy. Specifically, we use 2D joint positions as node features while keeping the attention mechanism unchanged to ensure a fair comparison. We compare two configurations: Node-Attention (without orientation supervision) and Node-Edge-Attention (with orientation supervision). Table 12 reports the corresponding MPJPE results. The experimental results demonstrate that introducing orientation supervision consistently reduces MPJPE across all temporal input lengths. This confirms that jointly modeling node (position) and edge (orientation) information helps alleviate depth ambiguity and enhances the accuracy of 3D joint position estimation.
Based on the ablation study results, we conclude that STAG-Net achieves optimal performance when using attention-enhanced GCN combined with TCN branches, an input dimensionality of 192, and three M-NEA residual blocks. Under this configuration, the model delivers the best accuracy for both position and orientation prediction, outperforming the most recent state-of-the-art methods.

5. Real-Time Application

For real-time 6D human pose estimation, we deployed our model using RTMPose as the 2D pose detector, with the input camera resolution set to 1280×1024. The system was integrated with ROS 2 Humble, where the predicted 3D keypoints were published as ROS topics and visualized in RViz.
To demonstrate the applicability of our method in robotics, we further employed the predicted joint rotations to perform teleoperation of a robotic arm. The estimated joint orientations were directly mapped to the robot’s joints, enabling real-time motion control driven by human movement. Experimental results show that the complete pipeline—including the 2D pose detector—achieves a processing speed of 47–60 FPS, which significantly exceeds the real-time requirement of 25 FPS. We submitted this work along with a demonstration video to highlight both the effectiveness of the proposed model and its human potential–robot collaboration.

6. Discussion

Through extensive experimental validation, STAG-Net demonstrates strong performance by constructing hierarchical feature representations of joints and bones. Nevertheless, several limitations remain.
First, attention-based architectures inherently introduce considerable computational overhead. As the input sequence length increases, the cost of spatio-temporal attention grows significantly. As shown in Table 8, increasing the input from 81 to 243 frames results in nearly a threefold increase in computational cost. Although long temporal sequences improve accuracy, this trade-off limits scalability in resource-constrained scenarios.
Second, the benefit of attention mechanisms depends on the temporal context. As indicated in Table 9, under single-frame settings, the attention-enhanced GCN does not provide substantial improvements over a lightweight combination of basic GCN and TCN branches. This suggests that attention modules are more advantageous for long-sequence modeling, while simpler architectures are preferable for real-time single-frame applications.
Finally, like most lifting-based approaches, our method depends on the quality of 2D pose detections. Errors in 2D joint localization may propagate to the 3D estimation stage, particularly under occlusion or challenging viewpoints.

7. Conclusions

In this paper, we addressed key limitations of existing 3D human pose estimation methods. Most prior approaches primarily focused on predicting joint keypoints and paid limited attention to recovering the complete 6D human pose, which consists of both 3D joint positions and bone orientations. In addition, many existing methods were either computationally expensive and unable to operate in real time, or they achieved real-time performance at the cost of accuracy. To overcome these challenges, we proposed STAG-Net, a unified framework that simultaneously predicted 3D joint positions and orientations, enabling accurate and real-time 6D human pose estimation.
Our method was designed to deliver real-time performance without sacrificing accuracy, achieving results comparable to or exceeding those of state-of-the-art approaches. For single-frame prediction, STG-Net outperformed recent GCN-based methods. For multi-frame prediction, using a 243-frame temporal input, STAG-Net surpassed the latest methods on three public benchmarks while requiring fewer parameters. These results demonstrated that our method was effective for both single-frame and multi-frame 3D pose estimation.
In future work, we plan to further improve both the efficiency and robustness of the proposed framework. A promising direction is to replace or complement the current attention-based modules with more lightweight sequence modeling architectures, such as state space models (e.g., Mamba), which exhibit linear complexity with respect to sequence length. Such designs could significantly alleviate the computational overhead associated with spatio-temporal attention, particularly when processing long temporal sequences. For real-time applications, we plan to further improve single-frame accuracy and reduce the model’s computational cost (FLOPs), making it even more suitable for high-speed robotic applications.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/robotics15030054/s1, Video S1: Qualitative Results; Video S2: Real-Time Application Demonstration; Data S1: Source Code.

Author Contributions

Conceptualization, C.Y. and R.J.; methodology, C.Y., R.J. and X.S.; software, C.Y., Q.G. and X.S.; writing—original draft preparation, C.Y., Q.G. and R.J.; writing—review and editing, R.J., M.H. and Y.Y.; supervision, M.H. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The 3D human pose estimation results generated in this study are provided in the Supplementary Materials. The datasets used for training and evaluation are publicly available: Human3.6M (http://vision.imar.ro/human3.6m/description.php (accessed on 1 March 2026)), MPI-INF-3DHP (https://vcai.mpi-inf.mpg.de/3dhp-dataset/ (accessed on 1 March 2026)), and HumanEva-I (http://humaneva.is.tue.mpg.de/ (accessed on 1 March 2026)). These datasets can be accessed from their respective official websites, subject to their data usage policies.

Acknowledgments

During the preparation of this manuscript, the authors used OpenAI’s ChatGPT (GPT-4, OpenAI, San Francisco, CA, USA) for language editing and polishing, including improvements in grammar, structure, and clarity. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
STAG-NetSpatial–Temporal Attention Graph network
STG-NetSpatial–Temporal Graph network
GCNGraph Convolutional Network
TCNTemporal Convolutional Network
CPNCascaded Pyramid Network
BNBatch Normalization
ECAEfficient Channel Attention
GNNGraph Neural Network
M-NEModulated Node–Edge
M-NEAModulated Node–Edge–Attention
FCFully Connected
MPJPEMean Per Joint Position Error
P-MPJPEProcrustes-aligned Mean Per Joint Position Error
IDevIdentity Deviation
MPJAEMean Per Joint Angular Error
3DPCK3D Percentage of Correct Keypoints
AUCArea Under the Curve

References

  1. Pavllo, D.; Feichtenhofer, C.; Grangier, D.; Auli, M. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7753–7762. [Google Scholar]
  2. Martinez, J.; Hossain, R.; Romero, J.; Little, J.J. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2640–2649. [Google Scholar]
  3. Li, W.; Liu, H.; Ding, R.; Liu, M.; Wang, P.; Yang, W. Exploiting temporal contexts with strided transformer for 3d human pose estimation. IEEE Trans. Multimed. 2022, 25, 1282–1293. [Google Scholar] [CrossRef]
  4. Li, W.; Liu, H.; Tang, H.; Wang, P.; Van Gool, L. Mhformer: Multi-hypothesis transformer for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–22 June 2022; pp. 13147–13156. [Google Scholar]
  5. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  6. Yu, B.X.; Zhang, Z.; Liu, Y.; Zhong, S.h.; Liu, Y.; Chen, C.W. Gla-gcn: Global-local adaptive graph convolutional network for 3d human pose estimation from monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 8818–8829. [Google Scholar]
  7. Manessi, F.; Rozza, A.; Manzo, M. Dynamic graph convolutional networks. Pattern Recognit. 2020, 97, 107000. [Google Scholar] [CrossRef]
  8. Liu, K.; Ding, R.; Zou, Z.; Wang, L.; Tang, W. A comprehensive study of weight sharing in graph networks for 3d human pose estimation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part X 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 318–334. [Google Scholar]
  9. Zou, Z.; Tang, W. Modulated graph convolutional network for 3D human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11477–11487. [Google Scholar]
  10. Zhao, L.; Peng, X.; Tian, Y.; Kapadia, M.; Metaxas, D.N. Semantic graph convolutional networks for 3d human pose regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3425–3435. [Google Scholar]
  11. Pavlakos, G.; Zhou, X.; Derpanis, K.G.; Daniilidis, K. Coarse-to-fine volumetric prediction for single-image 3D human pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7025–7034. [Google Scholar]
  12. Sosa, J.; Hogg, D. Self-supervised 3d human pose estimation from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4787–4796. [Google Scholar]
  13. Tome, D.; Russell, C.; Agapito, L. Lifting from the deep: Convolutional 3d pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2017; pp. 2500–2509. [Google Scholar]
  14. Liu, R.; Shen, J.; Wang, H.; Chen, C.; Cheung, S.C.; Asari, V. Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5064–5073. [Google Scholar]
  15. Lee, K.; Kim, W.; Lee, S. From human pose similarity metric to 3D human pose estimator: Temporal propagating LSTM networks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1781–1797. [Google Scholar] [CrossRef]
  16. Kanazawa, A.; Black, M.J.; Jacobs, D.W.; Malik, J. End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7122–7131. [Google Scholar]
  17. Zhang, H.; Tian, Y.; Zhou, X.; Ouyang, W.; Liu, Y.; Wang, L.; Sun, Z. Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 10–17 October 2021; pp. 11446–11456. [Google Scholar]
  18. Fisch, M.; Clark, R. Orientation keypoints for 6D human pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 10145–10158. [Google Scholar] [CrossRef] [PubMed]
  19. Banik, S.; Avagyan, E.; Auddy, S.; Gracia, A.M.; Knoll, A. PoseGraphNet++: Enriching 3D human pose with orientation estimation. arXiv 2023, arXiv:2308.11440. [Google Scholar]
  20. Cai, Y.; Ge, L.; Liu, J.; Cai, J.; Cham, T.J.; Yuan, J.; Thalmann, N.M. Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 2272–2281. [Google Scholar]
  21. Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  22. Song, X.; Li, Z.; Chen, S.; Demachi, K. Quater-gcn: Enhancing 3d human pose estimation with orientation and semi-supervised training. arXiv 2024, arXiv:2404.19279. [Google Scholar]
  23. Xu, T.; Takano, W. Graph stacked hourglass networks for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 16105–16114. [Google Scholar]
  24. Islam, Z.; Hamza, A.B. Multi-hop graph transformer network for 3D human pose estimation. J. Vis. Commun. Image Represent. 2024, 101, 104174. [Google Scholar] [CrossRef]
  25. Aouaidjia, K.; Li, A.; Zhang, W.; Zhang, C. 3D Human Pose Estimation via Spatial Graph Order Attention and Temporal Body Aware Transformer. arXiv 2025, arXiv:2505.01003. [Google Scholar] [CrossRef]
  26. Jiang, X.; Zhu, R.; Li, S.; Ji, P. Co-embedding of nodes and edges with graph neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 45, 7075–7086. [Google Scholar] [CrossRef]
  27. Zhou, Y.; Barnes, C.; Lu, J.; Yang, J.; Li, H. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5745–5753. [Google Scholar]
  28. Holschneider, M.; Kronland-Martinet, R.; Morlet, J.; Tchamitchian, P. A real-time algorithm for signal analysis with the help of the wavelet transform. In Proceedings of the Wavelets: Time-Frequency Methods and Phase Space Proceedings of the International Conference, Marseille, France, 14–18 December 1987; Springer: Berlin/Heidelberg, Germany, 1990; pp. 286–297. [Google Scholar]
  29. Mehta, D.; Sotnychenko, O.; Mueller, F.; Xu, W.; Elgharib, M.; Fua, P.; Seidel, H.P.; Rhodin, H.; Pons-Moll, G.; Theobalt, C. XNect: Real-time multi-person 3D motion capture with a single RGB camera. Acm Trans. Graph. (TOG) 2020, 39, 82-1. [Google Scholar] [CrossRef]
  30. Tang, Z.; Qiu, Z.; Hao, Y.; Hong, R.; Yao, T. 3d human pose estimation with spatio-temporal criss-cross attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4790–4799. [Google Scholar]
  31. Huynh, D.Q. Metrics for 3D rotations: Comparison and analysis. J. Math. Imaging Vis. 2009, 35, 155–164. [Google Scholar] [CrossRef]
  32. Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 1325–1339. [Google Scholar] [CrossRef]
  33. Mehta, D.; Rhodin, H.; Casas, D.; Fua, P.; Sotnychenko, O.; Xu, W.; Theobalt, C. Monocular 3d human pose estimation in the wild using improved cnn supervision. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; IEEE: New York, NY, USA, 2017; pp. 506–516. [Google Scholar]
  34. Sigal, L.; Balan, A.O.; Black, M.J. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vis. 2010, 87, 4–27. [Google Scholar] [CrossRef]
  35. Li, J.; Xu, C.; Chen, Z.; Bian, S.; Yang, L.; Lu, C. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3383–3393. [Google Scholar]
  36. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in Pytorch. 2017. Available online: https://pytorch.org (accessed on 1 March 2026).
  37. Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7103–7112. [Google Scholar]
  38. Lin, H.; Xu, S.; Su, C. MSTFormer: Multi-granularity spatial-temporal transformers for 3D human pose estimation. J. King Saud Univ. Comput. Inf. Sci. 2025, 37, 1–19. [Google Scholar] [CrossRef]
  39. Hao, X.; Li, H. Perspose: 3d human pose estimation with perspective encoding and perspective rotation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–23 October 2025; pp. 8110–8119. [Google Scholar]
  40. Fang, H.S.; Xu, Y.; Wang, W.; Liu, X.; Zhu, S.C. Learning pose grammar to encode human body configuration for 3d pose estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  41. Pavlakos, G.; Zhou, X.; Daniilidis, K. Ordinal depth supervision for 3d human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7307–7316. [Google Scholar]
  42. Lee, K.; Lee, I.; Lee, S. Propagating lstm: 3d pose estimation based on joint interdependency. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 119–135. [Google Scholar]
  43. Ci, H.; Wang, C.; Ma, X.; Wang, Y. Optimizing network structure for 3d human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 2262–2271. [Google Scholar]
  44. Zhao, W.; Wang, W.; Tian, Y. Graformer: Graph-oriented transformer for 3d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20438–20447. [Google Scholar]
  45. Zeng, A.; Sun, X.; Huang, F.; Liu, M.; Xu, Q.; Lin, S. Srnet: Improving generalization in 3d human pose estimation with a split-and-recombine approach. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 507–523. [Google Scholar]
  46. Chen, T.; Fang, C.; Shen, X.; Zhu, Y.; Chen, Z.; Luo, J. Anatomy-aware 3d human pose estimation with bone-based pose decomposition. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 198–209. [Google Scholar] [CrossRef]
  47. Zhao, Q.; Zheng, C.; Liu, M.; Wang, P.; Chen, C. Poseformerv2: Exploring frequency domain for efficient and robust 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8877–8886. [Google Scholar]
  48. Luo, C.; Chu, X.; Yuille, A. Orinet: A fully convolutional network for 3d human pose estimation. arXiv 2018, arXiv:1811.04989. [Google Scholar] [CrossRef]
  49. Wandt, B.; Rosenhahn, B. Repnet: Weakly supervised training of an adversarial reprojection network for 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7782–7791. [Google Scholar]
  50. Sárándi, I.; Linder, T.; Arras, K.O.; Leibe, B. Metrabs: Metric-scale truncation-robust heatmaps for absolute 3d human pose estimation. IEEE Trans. Biom. Behav. Identity Sci. 2020, 3, 16–30. [Google Scholar] [CrossRef]
  51. Zheng, C.; Zhu, S.; Mendieta, M.; Yang, T.; Chen, C.; Ding, Z. 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual Event, 11–17 October 2021; pp. 11656–11665. [Google Scholar]
  52. Gong, K.; Li, B.; Zhang, J.; Wang, T.; Huang, J.; Mi, M.B.; Feng, J.; Wang, X. PoseTriplet: Co-evolving 3D human pose estimation, imitation, and hallucination under self-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11017–11027. [Google Scholar]
  53. Oreshkin, B.N. 3d human pose and shape estimation via hybrik-transformer. arXiv 2023, arXiv:2302.04774. [Google Scholar] [CrossRef]
  54. Shetty, K.; Birkhold, A.; Jaganathan, S.; Strobel, N.; Kowarschik, M.; Maier, A.; Egger, B. Pliks: A pseudo-linear inverse kinematic solver for 3d human body estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 574–584. [Google Scholar]
  55. Qian, X.; Tang, Y.; Zhang, N.; Han, M.; Xiao, J.; Huang, M.C.; Lin, R.S. Hstformer: Hierarchical spatial-temporal transformers for 3d human pose estimation. arXiv 2023, arXiv:2301.07322. [Google Scholar] [CrossRef]
  56. Zhang, J.; Wang, Y.; Zhou, Z.; Luan, T.; Wang, Z.; Qiao, Y. Learning dynamical human-joint affinity for 3d pose estimation in videos. IEEE Trans. Image Process. 2021, 30, 7914–7925. [Google Scholar] [CrossRef] [PubMed]
  57. Hossain, M.R.I.; Little, J.J. Exploiting temporal information for 3d human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 68–84. [Google Scholar]
  58. Mehraban, S.; Adeli, V.; Taati, B. MotionAGFormer: Enhancing 3D Human Pose Estimation with a Transformer-GCNFormer Network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 6920–6930. [Google Scholar]
  59. Zhu, W.; Ma, X.; Liu, Z.; Liu, L.; Wu, W.; Wang, Y. Motionbert: A unified perspective on learning human motion representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 15085–15099. [Google Scholar]
Figure 1. Overview of the proposed STAG-Net architecture. The left panel illustrates the overall network framework, while the right panel shows the detailed structure of the proposed Modulated Node–Edge–Attention Layer.
Figure 1. Overview of the proposed STAG-Net architecture. The left panel illustrates the overall network framework, while the right panel shows the detailed structure of the proposed Modulated Node–Edge–Attention Layer.
Robotics 15 00054 g001
Figure 2. Overview of the 2D-to-6D pose lifting framework. (a) Input 2D pose sequence. (b) Definition of the 17-node undirected human body graph, shown in red points. (c) Definition of the graph structure with 16 edges, shown in red lines. (d) Example of the predicted 6D human pose.
Figure 2. Overview of the 2D-to-6D pose lifting framework. (a) Input 2D pose sequence. (b) Definition of the 17-node undirected human body graph, shown in red points. (c) Definition of the graph structure with 16 edges, shown in red lines. (d) Example of the predicted 6D human pose.
Robotics 15 00054 g002
Figure 3. Structure of our TCN Unit (TCN-U).
Figure 3. Structure of our TCN Unit (TCN-U).
Robotics 15 00054 g003
Figure 4. Overall structure of the proposed Skip-TCN.
Figure 4. Overall structure of the proposed Skip-TCN.
Robotics 15 00054 g004
Figure 5. Qualitative results of our model on the Human3.6M dataset across 15 different actions. The local coordinate axes at each keypoint are shown with X-axis in red, Y-axis in green, and Z-axis in blue, corresponding to the columns of the rotation matrix.
Figure 5. Qualitative results of our model on the Human3.6M dataset across 15 different actions. The local coordinate axes at each keypoint are shown with X-axis in red, Y-axis in green, and Z-axis in blue, corresponding to the columns of the rotation matrix.
Robotics 15 00054 g005
Figure 6. Comparison of performance and model size between the proposed methods and state-of-the-art approaches on the Human3.6M dataset using CPN 2D joints as input. (a) Single-frame setting: comparison of STG-Net with PoseGraphNet [19], VideoPose3D [1], and the sequence-to-sequence network [57]. (b) 243-frame setting: comparison of STAG-Net with MSTFormer [38], Multi-Hop [24], SR-Net [45], PoseFormer [51], Attention3D [14], and VideoPose3D [1]. Lower MPJPE indicates better performance, while fewer parameters correspond to a more lightweight model.
Figure 6. Comparison of performance and model size between the proposed methods and state-of-the-art approaches on the Human3.6M dataset using CPN 2D joints as input. (a) Single-frame setting: comparison of STG-Net with PoseGraphNet [19], VideoPose3D [1], and the sequence-to-sequence network [57]. (b) 243-frame setting: comparison of STAG-Net with MSTFormer [38], Multi-Hop [24], SR-Net [45], PoseFormer [51], Attention3D [14], and VideoPose3D [1]. Lower MPJPE indicates better performance, while fewer parameters correspond to a more lightweight model.
Robotics 15 00054 g006
Table 7. Position and orientation performance of STG-Net on H36M. The best and second-best scores are highlighted in bold and underlined, respectively.
Table 7. Position and orientation performance of STG-Net on H36M. The best and second-best scores are highlighted in bold and underlined, respectively.
MethodMPJPE (mm)MPJAE (rad)
Soubarna et al. [19] (2024)51.70.26
STG-Net (Ours, F = 1)50.80.23
Table 9. Ablation studies on the GCN, Attention, and TCN branches.
Table 9. Ablation studies on the GCN, Attention, and TCN branches.
ModelMPJPEMPJAE
TCN (F = 1)52.6
GCN (F = 1)52.70.23
STG-Net (F = 1)50.80.23
STAG-Net (F = 1)51.10.23
TCN (F = 27)51.4
Attention (F = 27)47.0
GCN+TCN (F = 27)49.40.22
GCN+Attention (F = 27)47.60.21
STAG-Net (F = 27)45.90.20
TCN (F = 81)48.8
Attention (F = 81)45.2
GCN+TCN (F = 81)48.60.21
GCN+Attention (F = 81)47.00.21
STAG-Net (F = 81)45.10.20
Table 10. Ablation studies on the input dimensionality.
Table 10. Ablation studies on the input dimensionality.
Input DimensionMPJPEMPJAE
128 (F = 27)47.10.21
192 (F = 27)45.90.20
256 (F = 27)48.10.22
384 (F = 27)45.90.20
128 (F = 81)46.00.21
192 (F = 81)45.10.20
256 (F = 81)45.90.20
384 (F = 81)45.50.19
128 (F = 243)43.30.19
192 (F = 243)41.80.19
256 (F = 243)43.70.20
384 (F = 243)42.40.19
Table 11. Ablation studies on the M-NEA Residual Block.
Table 11. Ablation studies on the M-NEA Residual Block.
Input M-NEA Residual BlockMPJPEMPJAE
2 (F = 27)47.40.21
3 (F = 27)45.90.20
4 (F = 27)48.10.22
5 (F = 27)46.50.21
Table 12. Ablation studies on orientation supervision.
Table 12. Ablation studies on orientation supervision.
ModelMPJPE
Node-Attention (F = 1)51.7
Node-Edge-Attention (F = 1)51.1
Node-Attention (F = 27)46.9
Node-Edge-Attention (F = 27)45.9
Node-Attention (F = 81)45.9
Node-Edge-Attention (F = 81)45.1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, C.; Jia, R.; Guo, Q.; Shi, X.; Hirano, M.; Yamakawa, Y. STAG-Net: A Lightweight Spatial–Temporal Attention GCN for Real-Time 6D Human Pose Estimation in Human–Robot Collaboration Scenarios. Robotics 2026, 15, 54. https://doi.org/10.3390/robotics15030054

AMA Style

Yang C, Jia R, Guo Q, Shi X, Hirano M, Yamakawa Y. STAG-Net: A Lightweight Spatial–Temporal Attention GCN for Real-Time 6D Human Pose Estimation in Human–Robot Collaboration Scenarios. Robotics. 2026; 15(3):54. https://doi.org/10.3390/robotics15030054

Chicago/Turabian Style

Yang, Chunxin, Ruoyu Jia, Qitong Guo, Xiaohang Shi, Masahiro Hirano, and Yuji Yamakawa. 2026. "STAG-Net: A Lightweight Spatial–Temporal Attention GCN for Real-Time 6D Human Pose Estimation in Human–Robot Collaboration Scenarios" Robotics 15, no. 3: 54. https://doi.org/10.3390/robotics15030054

APA Style

Yang, C., Jia, R., Guo, Q., Shi, X., Hirano, M., & Yamakawa, Y. (2026). STAG-Net: A Lightweight Spatial–Temporal Attention GCN for Real-Time 6D Human Pose Estimation in Human–Robot Collaboration Scenarios. Robotics, 15(3), 54. https://doi.org/10.3390/robotics15030054

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop