Graph U-Shaped Network with Mapping-Aware Local Enhancement for Single-Frame 3D Human Pose Estimation

: The development of 2D-to-3D approaches for 3D monocular single-frame human pose estimation faces challenges related to noisy input and failure to capture long-range joint correlations, leading to unreasonable predictions. To this end, we propose a straightforward, but effective U-shaped network called the mapping-aware U-shaped graph convolutional network (M-UGCN) for single-frame applications. This network applies skeletal pooling/unpooling operations to expand the limited convolutional receptive ﬁeld. For noisy inputs, as local nodes have direct access to the subtle discrepancies between poses, we deﬁne an additional mapping-aware local-enhancement mechanism to focus on local node interactions across multiple scales. We evaluated our proposed method on the benchmark datasets Human3.6M and MPI-INF-3DHP, and the experimental results demonstrated the robustness of the M-UGCN against noisy inputs. Notably, the average error in the proposed method was found to be 4.1% lower when compared to state-of-the-art methods adopting similar multi-scale learning approaches.


Introduction
Three-dimensional monocular human pose estimation  involves the extraction of human poses from RGB images and plays a vital role in diverse domains.It assists in behavior recognition for surveillance and healthcare, captures motion in virtual or augmented environments, and finds utility in sports analysis, robotics, and biomechanics.The method proposed in this paper falls into the category of typical 2D-to-3D lifting problems [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17], decomposing the 2D and 3D pose estimation tasks into a two-stage process.We focused on the second stage: lifting the 2D keypoints obtained by detectors to 3D positions in the view space.In this way, compared with the use of image-based approaches [19][20][21] to regress a parametric model's shape and pose parameters directly from RGB images, the actual overhead on the 3D pose estimation task is reduced and the feature domain gap is alleviated.In other words, two-stage estimation introduces a compromise between performance and consumption.
In 2D-to-3D GCN-based pose estimation tasks, human joints are represented by nodes, and bones are regarded as edges.Bones connect joints to their associated parents, shaping a sparse topology that can be expressed as an adjacency matrix.Nodes interact depending on the adjacency matrix.The adjacency matrix can only depict the actual physical node correlation.As we all know, when a person is moving, apparent relationships exist between non-local parts, such as legs and arms.Therefore, to make nodes link to others that are non-physically connected (shown in Figure 1a), some works [5,12,13] utilize high-order graph representation, while others [3,9] use learnable edges to reconstruct the adjacency matrix to an affinity matrix.However, the GCN-based methods [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15] successfully obtain joint correlation, but are highly dependent on 2D keypoint detectors and will still suffer instability from 2D inputs, worsening their adaptiveness to depth ambiguity.To break through this bottleneck, most GCN-based works turn to exploiting temporal information.Works [6,8] that use short-sequence inputs still suffer from unstable 2D data because of the great pose similarity with short consecutive frames, while works [14,15] that use lengthy sequences restrict practical application.Therefore, our work focused on how to enhance the robustness against noisy inputs in single-frame cases.However, the graph representation such as high-order or affinity matrices is static, which means that the model can only learn in the current step by one fixed modality of the node relationship.We attempted to apply the temporal-based dynamic affinity matrix when noisy inputs are given.As shown in Figure 2a, the robustness against noise improves when the initial affinity is influenced by pre-embedded temporal information.Previous work [23] further confirmed our ideas and showed the same decreasing tendency for prediction errors as more-adjacent frames are taken into the network.As illustrated in Figure 2b, the affinity visualization presents the weak relationships between non-local parts, such as legs and arms (marked as yellow rectangles).In contrast, using a single frame provides little information on which non-local nodes should be given more attention, thus limiting the receptive field for convolution.
Therefore, we designed a new strategy to cope with the invariance that frustrates nonlocal interaction.Although a node is allowed to collect features from multi-hop neighbors as the network becomes deeper, it has limited learning ability for non-local nodes, reducing the non-local interaction efficiency.The strategy first adopts pooling/unpooling operations to force the model to capture information between non-local nodes directly.Following the idea of multi-scale learning based on graph representation, we adopted a framework based on Graph U-Nets [24], but used skeletal pooling/unpooling [1] layers instead of gPool/gUnpool layers in order to maintain the basic body topology.The framework first samples gradual sub-structures of graph representations to enable high-level feature encoding and receptive field enlargement in the down-scaling stage, then progressively recovers the original scale in the up-scaling stage and decodes the features.The skeletal pooling/unpooling layers introduce multi-scale pose features.At lower scales, feature transformations enable interaction between non-local nodes.At higher scales, feature transformations help communications between nodes and their direct neighbors.In this process, the non-local relationship is considered, but the local node is ignored as the sub-optimal features are lost in the down-scaling stage to pool and the up-scaling stage to unpool.In terms of the above considerations, we designed a cross-scale interaction stage between the down-scaling and up-scaling stages.This stage contains several repeated mapping-aware fusion modules to achieve mapping-aware node interactions.In each module, nodes across different scales are connected through a particular type of edge, which we call directed mapping edges.As shown in Figure 1c, the multi-scale feature maps are linked by the mapping edges.These mapping edges simulate the directed mapping relationships and help to develop a novel learning-based sparse node update block, that is the mapping-aware node assignment block (NAB).The NABs pass features along the mapping edges in a node-to-node manner when updating graph nodes.In this way, the NABs allow for communication between the pooled nodes that carry non-local information and the unpooled nodes that retain detailed information from their direct neighbors, thus amplifying subtle differences based on poses.The resulting features are mixed with the initial features in a channelwise manner.Subsequent experiments demonstrated that amplifying the discrepancy among nodes allows the network to learn particular node correlations under different poses, thus enabling more-stable prediction results under noisy inputs.
In summary, our work makes three main contributions: • We propose a mapping-aware U-shaped graph convolution network (M-UGCN) for 2D-to-3D human pose estimation based on Graph U-Nets [24], and apply skeletal pooling/unpooling operations [1] to perform multi-scale learning.This network, in cooperation with an existing convolutional receptive field expansion strategy (e.g., the affinity matrix), can further enhance the non-local information.

•
We propose a cross-scale interaction stage in M-UGCN that bridges the up-scaling and down-scaling stages.In this stage, we establish directed mapping edges to simulate the node pooling and unpooling.Along the mapping edge, information is exchanged across multiple scales, helping the model learn detailed differences among poses.
To the best of our knowledge, this is the first attempt to introduce directed graph representation to describe the mapping relationships among nodes across scales.
In the subsequent sections, we delve deeper into specific aspects of our research.In Section 2, we provide an overview of related works about GCN-based 3D human pose estimation.Section 3 presents the method and details the mapping-aware local enhancement.In Section 4, we implement an ablation study to illustrate the effectiveness of our method.We also constructed a performance comparison with the previous works and provide some visualizations.Finally, the paper ends with the conclusions and a discussion of promising directions.

GCN-Based Works
GCN-based approaches have a strong capacity for feature extraction in the non-Euclidean case, which can be divided into spectral-based [32,33] and spatial-based approaches [34][35][36].Spectral-based approaches utilize Chebyshev polynomials or rational polynomials to parameterize the convolutional kernel applied to the frequency domain graph signal.Our work pertains to spatial-based approaches, which instead define convolution operations on the spatial relationships between nodes and their neighbors.In addition, the graph representation itself can be divided into two types, depending on the connection mode between graph nodes.
In an undirected graph , as the primary choice in most works, edges represent bidirectional connections between joints.GCN-based works using undirected graph representations face two main challenges: the weight-sharing problem and the limitation of the receptive field.Considering the first issue, Liu et al. [10] employed a weight matrix for each joint at the cost of a large number of parameters, Zhao et al. [7] extended the adjacency matrix to be channel-specific.Zou and Tang [3] designed a modulation vector for each node used in the feature transformation matrix.As for the second issue, the convolution filter should be re-defined to take the non-direct neighbor nodes into account, which can be divided into two solutions.The first solution involves generating a high-order graph representation, either explicitly or implicitly.Cai et al. [8] proposed various ways to define one-hop neighbor classifications based on human semantic information.Zou et al. [5] applied higher-order polynomials of the graph adjacency matrix to the original convolution operation to obtain multi-hop convolution results and added these results for a larger receptive field.Quan and Hamza [12] also introduced a higher-order graph convolutional framework that concatenates feature representations from multi-hop neighborhoods to capture long-range dependencies between body joints.Zhou et al. [2] and Zhao and Wang [4] substituted Chebyshev polynomials for the traditional convolutional kernel to obtain an implicit higher-order structure representation.The second solution is to learn a free matrix to define node relationships in the vertex domain [3,9,13], which can reflect the relationships between long-range nodes.
Directed graph representations model the hierarchy of joints and bones.Cheng et al. [14] employed directed-graph-structured data weighted by 2D confidence scores to mitigate noise propagation, and Hu et al. [15] used conditional directed graph convolution to enhance the dependence on non-local nodes.These methods enable cooperation between directed graph convolution and temporal information.The time axis is treated as a distinct dimension, and the associated need for lengthy temporal sequences limits the practical application of such methods.However, directed graph representations allow for an effective interpretation of the directional relationships between nodes.Inspired by this, we do not use such a representation to model the human skeleton, but, instead, employ it as a solution to encode the mapping relationships between nodes across multiple scales.

Multi-Scale Learning
Many image-based tasks, such as object detection, image classification, and image enhancement, take advantage of multi-scale feature learning.As image down-sampling can expand the receptive field while reducing the model scale, multi-scale learning approaches [37][38][39][40] have become prevalent.
Multi-scale learning applied to graph-structured data is a novel idea for 3D human estimation.Graph-structured data can be treated as analogous to image data, but as the graph structure has stronger irregularity, traditional pooling/unpooling operations were considered unsuitable in our case.Gao and Ji [24] developed an encoder-decoder model called Graph U-Nets.The graph pooling operation that they proposed can adaptively select some nodes to form lower-scale graph-structured data, while the graph unpooling operation restores the lower-scale data to higher-scale data.For monocular 3D human pose estimation, given that the topological graphs expressing the skeleton are very sparse, Li et al. [11] proposed a new model that contains three parallel sparse-to-fine graph representations of poses to achieve multi-scale learning.Cai et al. [8] and Xu and Takano [1] re-defined the solution for graph down-sampling to maintain the basic shape of the human body.We extended the ideas of the above-mentioned researchers to design pre-defined sub-graphs of the human body and propose a local-enhancement strategy to raise the upper bound in single-frame monocular human pose estimation.

Method
Given the 2D keypoints J 2D ∈ R N×2 in the pixel space, the aim here was to obtain 3D joint positions J 3D ∈ R N×3 in the view space.To construct a model for this purpose, we propose a novel GCN-based framework, which is a U-shaped structure.As shown in Figure 3, the proposed mapping-aware U-shaped graph convolutional network (M-UGCN) consists of four stages: the down-scaling, cross-scale interaction, up-scaling, and output stages.

Preliminaries for GCN
The human skeletal topology can be defined as a graph G = {V, E }, where V represents the joint nodes and E represents the edges (bones) linking joints.The bones can be interpreted as an adjacency matrix A ∈ {0, 1} N×N , where N denotes the number of joints.The index (i, j) of an element in A denotes the two sides of a bone linking nodes i and j.If node j is not directly connected to node i, the element of A corresponding to the index (i, j) is set to zero; otherwise, if node j is directly connected to node i, the associated element is set to one.After adding the identity matrix, A becomes Ã containing self-connection edges.Each node can be described as a D-dimensional feature vector The features of all nodes can be written as X ∈ R D×N .A graph convolutional (GConv) layer [35] updates the node features according to the following equation: where σ(.) is the activation function, W ∈ R D ×D is the learnable weight matrix for feature transformation, and Â is a symmetrically normalized matrix of Ã.This vanilla GCN shares one weight matrix W, which limits its ability to capture diverse spatial relationships.We applied modulated graph convolution [3], which creates a free modulation vector for each node, to transform the sharing weight matrix W. In this way, Equation (1) can be changed to: where M ∈ R D ×D is the collection of modulation vectors m i and the symbol denotes elementwise multiplication.In this work, ReLU [41] was used as the activation function for training.
In addition, the initial adjacency matrix A can be extended to an affinity matrix.Equation (3) details the generation of the affinity matrix: Specifically, a learnable mask O ∈ R N×N is added to the adjacency matrix A, which changes the weights of edges to link non-local nodes.Although this operation helps to expand the receptive field for convolution, O in each GConv layer is fixed once training has been completed, giving non-local nodes less chance to interact.Therefore, we propose a novel model based on multi-scale learning.We shortened the hop distance between non-local nodes by sampling down-scaled graph-structured nodes.Furthermore, we applied mapping-aware local enhancement in this network to remedy the negative effects of information loss and the static node correlation modality.

Network Architecture
As shown in Figure 3, we first applied a graph-embedding layer to convert 2D keypoints into a high-dimensional graph-based representation.The down-scaling stage gradually down-samples these graph data, allowing for the expansion of the receptive field, while the up-scaling stage gradually restores the low-scale graph data.The additional part-that is, the cross-scale interaction stage-between these two stages implements crossscale feature fusion through several stacked mapping-aware fusion modules.We focus on the module at this stage in Section 3.3.Finally, high-level features from groups obtained in the up-scaling stage are fed into the output stage for the final decoding.

Pooling and Unpooling
In this part, we introduce the pooling and unpooling layers.The pooling layer performs a skeletal pooling operation [1], the same as the process shown in Figure 4b, allowing for a more-compact physical structure.Unlike general graph pooling in Figure 4a, this operation maintains the physical connections between nodes.The original graph nodes will be cut into several parts, where each is reconnected after a maximum pooling operation.
The unpooling layer performs a skeletal unpooling operation [1] to restore the pooled nodes, the same as the process shown in Figure 5b.The general graph unpooling operation in Figure 5a sets nodes discarded previously to null, while the skeletal unpooling operation duplicates the pooled nodes to the corresponding vacant location.
The skeletal pooling/unpooling operation depends on the body joints' pre-defined structure and sub-structures.The design is shown in Figure 6, which has three configurations.We divided the original S 1 -structured nodes into six parts (Head, Body, Left/Right Arm, and Left/Right Leg).S 1 -structure nodes are then aggregated as new nodes and rewired to obtain the structure S 2 (Head, Body, Left/Right Arm, Left/Right Leg).S 3structured nodes are pooled from S 2 .Not that the number of structure configurations decides the number of module group stacks in the up-scaling and down-scaling stages, which means we used three modules in each stage.

Down-Scaling Stage
The down-scaling stage is comprises stacking three encoding module groups, with a skeletal pooling layer inserted between every two groups.It starts with the S 1 -structured node features.The features pass through the residual graph convolutional (Res-GConv) layer and, then, are down-scaled into S 2 -structured node features by a pooling layer.We repeat this process to generate S 3 -structured node features and feed them into a normal 1 × 1 convolutional layer.
The scales of the S 1 -, S 2 -, and S 3 -structured node features differ.Let these three differently scaled node features be collected as an atlas H = {H s |s ∈ S}, where S = {1, 2, 3}.When the subscript of the scale is s = 1, H s ∈ R D×16 is expressed as S 1 -structured graph features.H 2 ∈ R D×6 and H 3 ∈ R D×3 are the progressive pooling results of different levels corresponding to H 1 .The compression of graph nodes shortens the long-range distances, especially the distances between arms and legs.This provides an efficient way to aggregate information from higher-order neighbors.

Up-Scaling Stage
The up-scaling stage uses three stacked module groups and starts from the lowest-scale feature H 3 .The unpooling layer changes the features from the scale s + 1 to the new scale s.The unpooled features are fed into the Res-GConv layer to obtain H s .Then, H s in the down-scaling stage uses skip-connections to fuse with H s , if the cross-scale interaction stage is not utilized.Once the features have returned to the highest scale, an additional non-local layer is added at the end of module group.The non-local layer is utilized to recompute the attention weights for every pair of nodes when the potential node correlations have been explicitly reassigned and the current scale is the highest.
For each group in the up-scaling stage, the respective outputs are F 3 , F 2 , and F 1 .The features at different levels in the decoding process can provide valuable information for the final prediction, and so, F 3 , F 2 , and F 1 are all fed into the output stage.

Output Stage
The output stage is applied to process the multi-level features obtained in the upscaling stage.We used the skeletal unpooling operation mentioned in Section 3.2.3 to expand the scale of intermediate features to the highest scale, concatenate all three features, and feed them into a squeeze-and-excitation block (SE block) [42].The weights of each feature are re-calculated in this block.The final layer is a graph convolutional block, which is used for prediction.

Cross-Scale Interaction Stage
This stage lies between the down-scaling and up-scaling stages.It contains repeated mapping-aware fusion modules to target the local node interactions across adjacent scales.The number of fusion modules controls the node feature update and fusion level.In our experimental setup, the fusion level M of 2 was found to be optimal.In the next section, we detail the mechanism of the mapping-aware fusion module.

Mapping-Aware Fusion Module
The naïve multi-scale learning method has been proven useful for receptive field expansion, but comes at the cost of sub-optimal feature loss.In other words, the original version improves the global level, but ignores the local individuality, which might pose difficulties with regard to the ability to capture the subtle differences among poses.In particular, in noisy cases, unstable distant neighbors transfer noise as the learning process goes deeper.We believe that these abandoned features contain the most-direct valuable semantics from their nearest neighbors, thus helping to resist noise.Therefore, in the cross-scale interaction stage, we designed mapping-aware fusion modules and channelwise cross-scale feature fusion to allow interaction between the pooled nodes that carry nonlocal information and the unpooled nodes that retain detailed information from their direct neighbors.We took one mapping-aware fusion module as an example, in order to simplify the explanation, which means that the fusion level M is 1.The structure is depicted in Figure 3.
Each of the node-assignment blocks (NABs) in the mapping-aware fusion module has two input streams.In each stream, node features are organized into two types of node structures: target graph and source graph.For example, in a particular stream, the node features to be updated (which we treated as the target) are denoted by H 2 , and the source can be one of the adjacent-scaled features (H 1 or H 3 ).The purpose of the NAB is to transmit information from the source to the target graph nodes where mapping edges participate.We refer to this node-to-node information interaction with explicit direction as mapping-aware interaction to achieve local enhancement.
The NAB focuses on the feature transformation across scales and updates the target graph nodes, which depends significantly on the pre-calculated mapping edges and the corresponding source graph nodes.The NAB can process two situations through the mapping edges: (i) as shown in Figure 7a, when the target graph is at a lower scale than the source, the NAB can pass the information along a mapping edge from the source feature map and re-assign the attention to the corresponding pooled node in the target feature map; (ii) as shown in Figure 7b, when the target graph is at a higher scale than the source, the NAB extracts information from the pooled node in the source feature map along a cluster of mapping edges to update the target feature map.The updated feature maps are then fed into the channelwise mixer block and fused with the initial target graph features in a channelwise manner.

Preparation for Mapping Edges
As the unpooling and pooling between lower-and higher-scale feature maps is directional, we interpreted these relationships as directed mapping edges that can be described by a 2D matrix.We first initialized a 2D zero matrix.Then, when node i in a higher-scale feature map has a mapping relationship with node j in a lower-scale feature map, the matrix element corresponding to the 2D index (i, j) is 1.The mapping matrix A state is constructed following the above rule, where the subscript state ∈ {+, −} represents the mapping direction between nodes across scales.Its elements are called mapping edges.The start and end of a mapping edge represent the node candidates that are about to interact.
Notably, A + is the transpose of A − , and both are adjacency matrix representations for directed graphs.To avoid numerical instabilities, we followed the idea of [14] and normalized the directed adjacency matrix using the following equation: where D ∈ R N×N is the degree matrix of A state and N represents the total number of nodes in both the target-and source-scale feature maps.Next, the NAB concatenates the target H tar ∈ R D×N tar with the source graph features H src ∈ R D×N src along the spatial axis to obtain H, as follows: where N tar and N src are the number of nodes in the target and source graphs, respectively, and src denotes the adjacent scale index relative to tar.

Mapping-Aware Node-Assignment Block
The NAB utilizes Ãstate as a filter to activate a source node, whose features will be passed from the start to the end of a directed mapping edge.We first introduced the nodewise learnable parameters W ij ∈ R D×D between nodes i and j in a modulated manner [3] to enhance the node discrepancy.Then, we can obtain the updated node features using the mapping function F : where N i is a collection of all direct neighbors (including the self-node), ãstate ij denotes a mapping edge between nodes i and j, and e is an element of the identity matrix.Note that the original mapping matrix A state is transposed depending on the target and source graph scales.If tar < src, F uses Ã+ to represent nodes from the source to the target graph with the forward mapping relationships; otherwise, F uses Ã− .To make this easier to understand, we show these two cases for the mapping function F in Figure 8. Finally, we processed the updated features F (H, Ãstate ) through a pointwise convolutional layer.The result is sliced into H src_tar ∈ R D×N tar , consistent with the scale of the initial target graph.The final process is detailed below: where (:, : tar) slices the generated feature maps into the size of the target.As the NAB only pays attention to both ends of the mapping edge, such a mappingaware transformation enhances the local information across scales.

Channelwise Mixer Block
After the NABs update node features, we grouped the updated target and initial target node features into a set H tar = {H tar , H src_tar }.Each element in the set H tar is written as E ∈ R D×N tar .In the Mixer block, channelwise summation for fusion is applied to obtain the new fused features, after which the H tar are overwritten as inputs for the next mappingaware fusion module.Specifically, we first concatenated the features in H tar and applied global average pooling to obtain a channelwise item b ∈ R 1×N D , where N is the number of elements in H tar , equal to either 2 or 3.Then, we used b to calculate the learnable blending vectors β, as follows: where k is an adaptable kernel size, as mentioned in [43].Next, we reshaped β from 1 × N D to N × D and fused the features to overwrite H tar using: where β n ∈ R D is the n th learnable blending vector of β and E n is the n th element in H tar .The symbol denotes the matrix elementwise multiplication that should broadcast.
Compared with direct element addition, this solution fuses elements in a channelwise style, which was found to enhance the performance under our experimental settings.

Training
Most skeleton-based models aim to minimize the errors of the predicted J 3D = {j i |i = 1, . . ., N} and ground truth J3D = { ˜ji |i = 1, . . ., N} poses, which can be formulated as: We used a weighted combination of the Mean-Squared Errors (MSEs) and the Mean Absolute Errors (MAEs) as a loss function, following previous works [1,3,8].
Before training, for J 2D ∈ R N×2 , we normalized the keypoints in the pixel space within the range [−1, 1] and retained the scales of images in every skeleton.As some prior visual knowledge exists in the global translation, we did not remove it.For J3D , as the paired 3D ground truth, we transformed the original position from the world to view space and made the local coordinate system of every joint relative to the root.The training was then supervised with respect to J3D ∈ R N×3 for the root-relative predictions J 3D ∈ R N×3 in the view space.

Experiments
In this section, we first introduce the experimental settings, including the datasets used, evaluation metrics, and implementation details, in Sections 4.1 and 4.2.The ablation study and performance comparison are detailed in Sections 4.3 and 4.4, respectively.Finally, some visualization results are provided in Section 4.5.

Datasets and Evaluation
We conducted evaluation on two datasets: (a) Human3.6M(H36M) [25] is one of the most-widely used paired single-human pose datasets captured under laboratory conditions with 3.6-million frames.Four cameras were simultaneously used to capture daily human postures from different perspectives.The motion capture system allowed for precise 3D joint positions to be obtained simultaneously.Following previous work [1][2][3][4][5], the data for Subjects S1, S2, S3, S5, and S7 were defined as the training dataset, while the data of Subjects S9 and S11 were used for further testing.Before training, we flipped the paired data to expand the coverage of the dataset.(b) MPI-INF-3DHP (MPII) [26] is a paired dataset with 1.3-million frames obtained from a multi-view markerless motion capture in various scenario settings, including indoor scenes with green screens, outdoor scenes without green screens, and outdoor scenes.We only used its validation set to test the generalizability of the proposed method.We followed the method of Posaug [44] to complete data pre-processing.The redirection of the skeleton format was the same as the suggestion in [26].
Following previous works [1][2][3][4][5], we utilized two evaluation metrics: (a) MPJPE/PA-MPJPE, where MPJPE denotes the Euclidean distance between the ground truth and estimated 3D joint after aligning their roots, and PA-MPJPE is an additional metric that employs a rigid Procrustes [45] alignment to remove the influence of rotation and scale first and, then, calculates the error.We only used MPJPE, as the former is stricter and more representative.(b) PCK/AUC, of which the percentage of correct keypoints (PCK) is a metric typically used in 2D human pose estimation that can be extended to a 3D version, interpreted as the proportion of joints within a certain error threshold (typically set at 150 mm) relative to the total number of joints.The AUC score is interpreted as the area under the curve of the PCK.We used both of these metrics to validate the generalizability of the proposed model.

Implementation Details
Following previous works [1,[3][4][5]11,12], the 2D detection results were obtained using the cascaded pyramid network (CPN) [31] pre-trained on the COCO [46] dataset and fine-tuned on Human3.6M[25].All experiments were implemented in PyTorch with an AMD Ryzen 7 5800X 8-Core Processor and a single NVIDIA RTX 3090 GPU in Ubuntu 20.04 and optimized using Adam [47].We applied a modulated graph convolutional layer [3] as our GConv block, and the weight initialization was the same as described in [3].The GConv layer also employed the self-connection decoupling strategy to retain the self-loop information of nodes.
We trained our model in an end-to-end manner.The batch size was set to 256, and the feature dimension was set to 128.The learning rate was initially 0.002, and the learning rate decay value was 0.96, with the rate decayed every 4 epochs.Dropout [48] with a probability of 0.2 was applied to each GConv layer, in order to avoid over-fitting.The cross-scale fusion level M was set to 2.

Ablation Study
We constructed an ablation study on the Human3.6M[25] validation sets, employing MPJPE as the evaluation metric.Two different basic configurations were considered.One was the Simple-UGCN (S-UGCN), which removes the cross-scale interaction stage, but retains the U-shaped structure, and the other was the proposed method M-UGCN.We chose the ground truth keypoints as input, in order to exclude disturbances derived from noise.

Cross-Scale Fusion Level
Multi-scale node features can pass through the cross-scale interaction stage to enhance the local information through the use of stacked mapping-aware fusion modules.We denote the utilization frequency as the fusion level.To discuss the impact of the number of mapping-aware fusion modules, we tuned the fusion level from 0 to 3. The obtained mean errors are shown in Figure 9.It is evident that, when we adopted a fusion level of 2, the errors of the joint positions were sharply decreased, compared to Levels 1 and 3. Level 0-the same configuration as the Simple-UGCN-presented the closest results to Level 2. However, Fusion Level 0 (i.e., without local enhancement) is more likely to encounter bottlenecks under noisy inputs, as discussed in Section 4.3.

Components of the Mapping-Aware Fusion Module
This module achieves local information exchange from one side of a mapping edge to the other and contains two key components.The NAB passes edge information of edges from the source to target graph nodes along the mapping edges, while the channelwise Mixer block mixes the updated target graph nodes with the initial nodes in a learnable manner.To study the importance of both components in the mapping-aware fusion module, we designed four variants: (i) Strategy 1 is a Simple-UGCN, which removes the whole cross-scale interaction stage; (ii) Strategy 2 performs feature fusion with given blending factors under the weighted rate 1:3, taking the place of the channelwise mixer block; (iii) Strategy 3 adopts two spatial linear layers [49], whose hidden size is 512, to update the target graph node features, rather than the feature transformation and mapping function in Equation ( 6); (iv) Strategy 4 is the M-UGCN containing both of the components.All variants were constructed under Cross-Fusion Level 2, and the results are given in Table 1.
Table 1.Ablation study considering the components of the cross-scale interaction stage.In addition to the S-UGCN and M-UGCN, two other variants are added for comparison.The best are in bold.We can see that the NAB and Mixer block cooperated reasonably.When either of them was absent, the errors obviously increased.The results obtained under Strategy 2 demonstrated the importance of the learnable fusion operations from the opposite perspective.Strategy 3, with spatial linear layers, caused every output node feature to depend on all other nodes, going against our mapping-aware local enhancement for explicit node-to-node exchange.This led to the poorest result.

Mixer Type
The Mixer block fuses features in a learnable way.We compared two types for the Mixer.The first was discussed in Section 3.3, which learns the weights of each node feature map, and the final error was 34.51 mm.The second involved the common weighted summation of multiple features, with which the error was 1.02 mm higher.The results indicated that the channelwise Mixer block has more flexibility to sort the importance of features, thus boosting the final performance.

Mapping Edge Setting
The mapping edges, represented by A state , establish node-to-node interaction paths across different scales, playing an essential role in mapping-aware local enhancement.The participation of the mapping edges helps the NABs focus on nodes across scales that have explicit mapping relationships, generated by the pooling and unpooling processes.We compared four different settings to prove the effectiveness of mapping edges: (i) CNNlike methods-standard convolutional blocks used to replace the whole NAB for node updating; (ii) free edges-where the weights of mapping edges were learned from training data freely to generate A state ; (iii) undirected edges-mapping relationships considered to be bidirectional and A state symmetrically normalized before being fed into the NAB; (iv) directed edges-where the proposed mapping edges were interpreted as unidirectional links to simulate the pooling and unpooling processes.Figure 10 validates that the CNNlike method cannot adapt to deal with non-Euclidean data, for which GCN-based methods are suitable.The generation of free mapping edges was also not suited to our objective of capturing the node-to-node mapping relationships.In contrast, both undirected and directed edges succeeded in encouraging information exchange between local nodes across adjacency scales.However, the mapping direction offered further cues to inform the NAB about the state of the input feature (i.e., whether it was up-scaled or down-scaled from the source graph data).Using the MPJPE metric, we compared the Simple-UGCN and M-UGCN with previous works on Human3.6M.We used two types of input: the 2D keypoints detected by the CPN [31] and the ground truth.Note that some works have applied a post-refinement module [8] to process the final outputs when considering noisy detections.The x and y components of the outputs can be derived in two ways.One involves obtaining predictions directly in a regressive manner, while the other involves using the projection model to transform the x and y components from the pixel space to the view space.The postrefinement module mixes the results of these two solutions for more-stable predictions.
For the sake of fairness, we made respective comparisons depending on whether the post-refinement module was used or not, as the camera parameters corresponding to the images are not always available in practical application scenarios.
The results in Table 2 suggest that M-UGCN surpassed all other methods when considering noisy detections, even when the post-refinement module was excluded.Moreover, the use of mapping-aware fusion modules allowed for Simple-UGCN to overcome its bottleneck, especially for challenging, but relatively static poses such as sitting down.This means that the optimized M-UGCN had a stronger capability to balance short-and long-range joint relationships.Table 3 shows the significant performance enhancement when all the methods reached their upper bounds for precise inputs.Compared to the method in [1] (with 3.70 M parameters), which also takes advantage of multi-scale feature learning, the Simple-UGCN reduced the average error by 1.6 mm with only 0.87 M parameters, while the M-UGCN reduced the error by 2.3 mm with 1.25 M parameters.Our two models achieved a 90.8% and 86.8% reduction in the number of parameters, respectively, and had far fewer parameters than the second-place method [54] (9.49M).

Robustness against Noise
In practical applications, data inputs are typically noisy and unstable.To highlight the capacity for noise resistance, we trained our models and the other state-of-the-art models using the ground truth data with various levels of Gaussian noise added.Figure 11 clearly shows that the accuracy of the 2D detection directly affected the performance of the 2D-to-3D pose estimation tasks.Note that the works [3,5,9,[11][12][13] adopted the traditional receptive field expansion idea for GCN-based methods, while our method utilizes multiscale learning.Moreover, the M-UGCN learned the subtle differences between poses through the mapping-aware local enhancement, thus keeping the prediction errors at a much lower level.

Model Scale
In Table 4, we report the scales of our two model configurations, in order to show that our solutions achieved competitive results while controlling the number of parameters to be as small as possible.Note that all methods considered in the comparison were under their full parameter configurations and used the ground truth as the input.The most-robust and -effective method was that of Zhao and Wang [4], due to it having lowest prediction errors and relatively small model scale.However, the M-UGCN presented a more-powerful ability to deal with noisy inputs, as indicated in Table 2 and Figure 11.
We also compared the computational efficiency and inference time with the stateof-the-art in Table 4.With the cross-scale interaction stage, the MPJPE of the M-UGCN was 33.5, but its FLOPs and inference time rose because several parallel branches run in the model.This is acceptable because the eyes can struggle to determine tiny temporal discrepancy.When turning to a more-lightweight model configuration, the Simple-UGCN was more-computationally and -memory efficient.

Generalization Validation
We applied the proposed model trained on the Human3.6M[25] dataset to the MPI-INF-3DHP [26] validation set, in order to validate the generalizability of our method.Compared to the Human3.6Mdataset, the validation set of 3DHP contains more-diverse motions and unseen outdoor environments.Although we used only the Human3.6M[25] dataset comprising laboratory scenarios for training, the results provided in Table 5 suggest that our method had strong compatibility with unseen in-the-wild data.

Qualitative Results
Figure 12 shows some visualization results obtained using the M-UGCN on the Hu-man3.6Mdataset.The discrepancy between the predictions and ground truth can be seen to be negligible.Our method can reason properly for some challenging poses.We believe that the effectiveness of the skeleton-based method will benefit further applications, such as pose-based digital-character-driven solutions and action recognition.
Figure 13 shows qualitative comparisons between our method and a baseline method on challenging poses in the Human3.6M[25] dataset.The 2D keypoints detected by the CPN [31] were used as the input.In order to demonstrate robustness in noisy cases, we selected images with actors not facing the camera and severe self-occlusion.Our method obtained more-stable predictions when met with challenging poses.Qualitative results of our method on the Human3.6M[25] dataset.The discrepancy between the predictions and ground truth is negligible.
Figure 13.Qualitative comparisons between our proposed method and a baseline method [3] on challenging poses in the Human3.6M[25] dataset.The 2D keypoints detected by the CPN [31] were used as the input samples, and the brightness of points represents the scale of the error, amplifying the visual discrepancy from the ground truth.
We further compared our method with the baseline [3] on in-the-wild images with unseen poses, as illustrated in Figure 14.The same 2D pose detector, the CPN [31], was applied for fairness.Thanks to the local enhancement strategy in the cross-scale interaction stage, our method presented improved partial stability and generated more-plausible limbs for the poses.Qualitative comparisons on challenging in-the-wild images between our method and a baseline method [3].The last row shows that our method can reason more-believable poses in the keypoint-missing case.

Conclusions
We presented a lightweight 2D-to-3D method, the M-UGCN, for monocular 3D human pose estimation that reduced the number of parameters and reasoned more-believable poses with only one frame.A skeletal pooling and unpooling operation was introduced to U-shaped nets to exploit global features.The mapping-aware interaction was able to capture subtle discrepancies in local joint correlation.As far as we know, our method is the first attempt to apply directional mapping relationships described as directed graphs to multi-scale feature fusion in the sparse graph case.We built mapping edges across feature scales to simulate the graph-structured nodes' pooling and unpooling process, thus contributing to precise information complementarityand exchange.We implemented ablation experiments to prove the validation of the mapping-aware local enhancement.
Compared to the temporal-based method, the M-UGCN may show weaknesses in its ability to cope with challenging poses, but still surpassed some methods with short sequences of poses (shown in Table 6) and most of the SOTA single-frame methods (shown in Table 2).Despite the M-UGCN achieving promising performance, as a 2D-to-3D approach, it is highly dependent on 2D keypoint detectors.The M-UGCN may not be able to predict reasonable poses with poor 2D keypoint inputs.There are multiple solutions to enhance the M-UGCN on poor single-frame inputs.We believe that, with the incorporation of pre-refinement [57] on 2D inputs or post-refinement [8] on 3D outputs, the M-UGCN on noisy inputs can be enhanced even more.Besides, multiple 3D pose candidates' generation and sampling [58] comprise another solution to avoid unreasonable predictions.Methods marked by * employed the post-refinement module [8].↓ means lower is better.
In the future, considering the restriction of lengthy sequences in practical applications, we aim to improve the accuracy and decrease the computation to extend our methods for pose-based human driving or monocular motion capture.As an intermediate component, our model has a reliable backend of accuracy improvements and would be the key to transferring 3D poses into the 3D human mesh.

Figure 1 .
Figure 1.Graph-structured node interaction: (a) Node links to the other non-direct neighbor in a feature map; (b) down-sampled node interacts with its direct neighbor in a feature map; (c) node interacts across two different scaled feature maps through directed mapping edges.These edges represent the node pooling (red arrows) or unpooling (green arrows) direction, guiding cross-scale information exchange.

Figure 2 .
Figure 2. A sequential residual architecture composed of graph convolutional blocks [3] is considered as a baseline for the comparison of static and dynamic affinity: (a) The diagram indicates that the mean per-joint position error (MPJPE) decreases significantly with the dynamic affinity matrix inferred through the use of a longer temporal sequence.Our method, while not using temporal information, alleviates the limitations associated with static affinity.All methods were tested on the Human3.6Mdataset [25] with noisy inputs.(b) Static affinity visualization showing the weak relationships between non-local parts such as legs and arms (marked as yellow rectangles).

Figure 3 .
Figure 3. Overview of the proposed M-UGCN framework.The structure is roughly divided into four stages: the down-scaling, cross-scale interaction, up-scaling, and output stages.The cross-scale interaction stage, with M-repeated mapping-aware fusion modules, acts as a bridge between the down-scaling and up-scaling stages.The node-assignment blocks (NABs) and mixer blocks (Mixers) are critical components of this stage.The output stage performs the final multi-level feature fusion and outputs the predictions.

Figure 4 .
Figure 4. Illustrations of the pooling operation, where L denotes the level and N is the number of nodes in the current graph structure: (a) graph pooling [24]; (b) skeletal pooling [1].

Figure 5 .
Figure 5. Illustrations of the unpooling operation, where L denotes the level and N is the number of nodes in the current graph structure: (a) graph unpooling [24]; (b) skeletal unpooling [1].

Figure 6 .
Figure 6.Description of the structure of the body joints and its sub-structures.We designed these pre-defined structures to retain the physical human topology.

Figure 7 .
Figure 7. Illustration of mapping-aware interaction, where L denotes the level and N is the number of nodes in the current graph structure: the directed (a) or inverted directed (b) mapping edges (green arrows) from the source feature map (colored circles) to the target feature map (gray circles) guide cross-scale interaction in a node-to-node manner through the NAB.Then, the target graph nodes are updated.

Figure 8 .
Figure 8. Illustrations of the mapping function F in two cases, that is non-reversed (top) and reversed (bottom) mapping relationships between multi-scaled feature maps.The update of Node 1 is taken as an example.

Figure 9 .
Figure 9. Fusion level ablation study at the cross-scale interaction stage.

Figure 10 .
Figure 10.Ablation study of the mapping edges.The directed mapping edges are the most-suitable representatives for mapping relationships across adjacency scales.

Figure 11 .
Figure 11.Noise tolerance comparisons with the models Graformer [2] and MGCN [3] models.Here, we applied Gaussian noise with mean zero and various values for the standard deviation σ to the 2D ground truth data.

Figure 12 .
Figure 12.Qualitative results of our method on the Human3.6M[25]dataset.The discrepancy between the predictions and ground truth is negligible.

Figure 14 .
Figure14.Qualitative comparisons on challenging in-the-wild images between our method and a baseline method[3].The last row shows that our method can reason more-believable poses in the keypoint-missing case.

Table 2 .
[8]]titative evaluation results on Human3.6MunderMPJPE.2Dkeypointsdetected by the the CPN[31]were used as the input.The table is categorized into two groups.The bottom group methods employ the post-refinement module[8], while the top group methods do not.The best are in bold.

Table 3 .
Quantitative evaluation results on Human3.6Munder MPJPE.Ground-truth keypoints were used as the input.The best results are indicated in bold.

Table 4 .
Model scale comparisons among methods under their full configurations.Inf_time is the abbreviation of inference time.We computed the inference time as the average time required to infer a sample with a batch size of 256 after the model is warmed up.

Table 6 .
Comparisons between the proposed method and previous temporal-based methods in terms of the number of parameters, the length of the pose sequence, and MPJPE.