Skeleton-Based Spatio-Temporal U-Network for 3D Human Pose Estimation in Video

Despite the great progress in 3D pose estimation from videos, there is still a lack of effective means to extract spatio-temporal features of different granularity from complex dynamic skeleton sequences. To tackle this problem, we propose a novel, skeleton-based spatio-temporal U-Net(STUNet) scheme to deal with spatio-temporal features in multiple scales for 3D human pose estimation in video. The proposed STUNet architecture consists of a cascade structure of semantic graph convolution layers and structural temporal dilated convolution layers, progressively extracting and fusing the spatio-temporal semantic features from fine-grained to coarse-grained. This U-shaped network achieves scale compression and feature squeezing by downscaling and upscaling, while abstracting multi-resolution spatio-temporal dependencies through skip connections. Experiments demonstrate that our model effectively captures comprehensive spatio-temporal features in multiple scales and achieves substantial improvements over mainstream methods on real-world datasets.


Introduction
Recently, 2D keypoint detection and 3D pose estimation have received more and more attention [1][2][3][4][5][6][7][8]. The difficulty with 3D pose estimation is that multiple 3D poses can be mapped to the same 2D keypoints. Some studies address this ambiguity by modeling temporal information using recurrent neural networks(RNNs) [9,10] or graph convolutional networks(GCNs) [11][12][13]. These studies simply connect temporal dimension features to form a spatio-temporal grid of keypoints for subsequent feature extraction and pose estimation. However, we observe two main problems with these existing methods: • the previous methods have difficulty in handling complex long-time sequence action features; • the existing approach neglects to consider the compression and fusion of features in both temporal and spatial dimensions for human pose estimation.
First, previous RNN-based and graph-based methods have difficulty in handling complex, long-time sequence action features. A typical approach is to connect keypoints into spatio-temporal sequences based on skeletal structure and use a recurrent neural network [9] or graph convolution network [11] for pose estimation. Some recent studies have proved that temporal convolutional networks have better performance than traditional RNN and other methods in modeling temporal information, such as machine translation [14], language modeling [15], speech generation [16], and speech recognition [17]. Therefore, we employ temporal convolutions to capture long-term pose information for 3D pose estimation tasks.
Second, the existing methods neglect to consider the compression and fusion of features in both temporal and spatial dimensions for human pose estimation. Most of Figure 1. Illustration of the proposed skeleton-based spatio-temporal U-Net. In the semantic pooling stage, spatio-temporal semantic features are gradually compressed and fused into different granularities. In the semantic upsampling phase, spatio-temporal features are decoded and multi-resolution spatio-temporal dependencies are abstracted by skipping connections in the U-Net structure.
To the best of our knowledge, we are the first to leverage the temporal convolution and graph convolution to deal with spatio-temporal features of different granularities. Our main contributions are summarized as follows: • this work presents a novel spatio-temporal U-Net architecture with a cascade structure of temporal convolution layers and semantic graph convolution layers to gradually integrate the semantic features of local time and space; • the proposed structural temporal dilated convolution layer fuses long-time key point sequences in the temporal dimension to eliminate jitter and blur in 3D pose estimation in the single frame case; • the proposed semantic graph convolution layer fuses the semantic features of the human body in the spatial dimension with novel graph convolution, pooling, and unpooling layers.

3D Human Pose Estimation
The 3D Human Pose Estimation task aims to infer 3D body keypoints from a single image. Prior to the success of deep learning, most of the work [19][20][21][22] used feature engineering and modeling on bone and joint mobility to estimate the 3D pose. Next, a convolutional neural network (CNN) method is used for end-to-end 3D pose reconstruction [1,[23][24][25]. Unlike previous model-based methods, they estimate 3D pose directly from RGB images without intermediate supervision.
Two-step 3D pose estimation. 3D pose estimation is usually built on top of a 2D pose estimator, first using 2D pose estimation to predict 2D joint positions in image space, and then lifting it to 3D [1][2][3]9,26,27]. Some work [3] shows that predicting 3D pose is relatively straightforward given real 2D keypoints, and the quality of 2D keypoint estimates has a large impact on the final result. Some methods [1,2,28] use both image features and 2D keypoints for 3D pose estimation. Recent work [29] predicts 3D pose by predicting the depth of keypoints. There are methods [30] for 3D pose estimation using prior knowledge about bone length and projection consistency. Some recent studies [31,32] apply transformer networks to human pose estimation tasks. Recently, a differentiable epipolar transformer network in a synchronized and calibrated multi-view setup was proposed [31], enabling the 2D detector to leverage 3D-aware features to improve 2D pose estimation. A spatialtemporal two-stream transformer network [32] is proposed to model dependencies between joints using the Transformer self-attention operator. In addition, the human skeleton can be represented as a directed graph [33] to explicitly reflect the hierarchical relationships among the nodes and leverage varying non-local dependence for different poses by conditioning the graph topology on input poses.
Skeletal-based keypoint feature fusion. GCNs are introduced to learn high-level representations of relationships between nodes based on skeletal graphs. A recent study [11] designed a semantic GCN capable of capturing local and global relationships between human joints for human pose estimation. GCNs are used to learn multi-scale representations [12] to encode human skeletal joints, thereby converting 2D human joints to 3D. Most of these existing studies focus on the analysis of the spatial features of skeleton-based keypoints. Inspired by this, we extend the spatial dimensional fusion method to achieve the fusion of pose features in the spatio-temporal dimension.
Action recognition. Unlike pose estimation, which outputs 3d keypoint coordinates, action recognition directly classifies human behavior. Although the tasks are slightly different, there is also a lot of work in the field of action recognition that focuses on the analysis of the human skeleton structure. A spatial-temporal two-stream transformer network [32] is proposed to model dependencies between joints using the Transformer self-attention operator. Additionally, some work [34] has been done to explore and compare different ways of extracting human pose features, and to extend a TCN-like unit to extract the most relevant spatial and temporal characteristics for a sequence of frames.

Video Pose Estimation
Most of the previous work takes a single frame or a single image as input, and more recent studies use the temporal information of video to disambiguate pose estimation, producing more reliable and robust results. Previous studies [35,36] have used LSTMs to predict 3D poses predicted from a single image. Additionally, an LSTM sequence-tosequence learning model [37] is introduced to encode 2D pose sequences in videos into fixed-size vectors, which are then decoded into 3D pose sequences. There is also some work on RNN methods that consider prior information on body part connectivity [10]. Some research [13,[38][39][40] connects skeleton graphs into keypoint sequences and use GCNs for action recognition. Further, ref. [41] uses a TCN to process pose-encoded sequences, but this method ignores the structural features of the skeleton.
Since none of the existing 3D pose estimation methods consider the representation of features from different granularities of time and space at the same time, we propose a spatio-temporal U-Net model scheme to learn the semantic fusion features of time and space, and perform 3D human pose estimation.

Skeleton-Based Spatio-Temporal U-Net
As shown in Figure 2, we propose a novel spatio-temporal U-Net scheme to deal with spatio-temporal features of different granularities for 3D human pose estimation in video. The STUNet architecture consists of a cascading structure of structural temporal convolution network(S-TCN) layers and semantic graph convolution network (S-GCN) layers to progressively integrate semantic features in local time and space. We improve upon the underlying U-Net [18] structure, using skip connections to connect spatio-temporal features from the encoding stage to the decoding stage of each decoder layer. This structure can gradually abstract complex spatio-temporal information to obtain high-level semantic features of pose, and preserve local spatio-temporal information through skip connections. Modeling the 3D pose estimation problem as a U-Net model helps to predict more accurate 3D coordinates through high-level abstractions, and also helps to discover potential relationships between keypoints in temporal and spatial dimensions. The spatio-temporal features of different granularities are extracted and fused in the U-Net structure, which ultimately improves the accuracy of 3D human pose estimation.

Structural Temporal Dilated Convolutional Layer
To improve long-range temporal perception while avoiding excessive increases in training parameters, we employ time-dilated convolutional layers to fuse long-term keypoint sequences in the temporal dimension to alleviate jitter and ambiguity in 3D pose estimation. Dilated convolution is a sparsely structured convolution with a uniform distribution of kernel points and zero padding in between. Suppose there are dilated convolutions of two signals f and h with lengths N and 2M + 1, respectively, which can be calculated as: where D represents the dilation factor, and n and m represent the indices of the signals f and h, respectively. In Figure 3, we describe the structure of our model in the temporal dimension, whose perceptual domain grows with increasing horizontal layers. In the implementation, we use a similar approach to our previous work [41]. However, the difference is that our S-TCN retains the skeleton information and is able to fuse temporal dimensional features from graph structures of different granularity. The proposed S-TCN yields roughly the same computational cost as conventional convolution while increasing the perceptual field.

Semantic Graph Convolutional Network
The semantic graph convolution layer fuses the semantic features of the human body in the spatial dimension with novel graph convolution, pooling, and unpooling layers. The network consists of a skeletal structure-based graph network layer and a data-dependent non-local layer in series. The structure-based graph layer is used to capture the spatial dimensional human skeletal structure information and progressively pool it into a highlevel feature representation. The data-dependent non-local layer is used to analyze the features of long-range nodes since the graph convolution network does not handle longrange relationships well.

Structure-Based Graph Layer
In the classic GCN [13], the graph convolution operation on vertex v i is expressed as: where v represents the joint vertex of the skeletal graph and f is the feature map. B i represents the convolution sampling region of v i , defined as the neighboring vertices v j of the target vertex v i . w is a weighting function that processes the input values to provide a weight vector. Using a design similar to SemGCN [11], we introduce learnable matrices that transform traditional GCN (2) as follows: where k v is the kernel size on the spatial dimension,Ã k is the adjacency matrix of the keypoint graph representing connections, I is the identity matrix, and W k is the trainable weight matrix.

Graph Pooling and Upsampling
As shown in Figure 4, we divide the body key point nodes into five subsets according to the characteristics of the skeleton structure, and then perform a maximum pooling operation on each subset. Next, the coarsened graph is maximally assembled into a node that contains global information for the entire skeleton. In this process, the skeleton space structure is gradually fused and connected with the corresponding decoded layer in the skip connection of U-Net. Vertex features in the graph of the same granularity are assigned to corresponding vertices in the graph during the upsampling process to fully preserve local spatio-temporal features.  Based on the skeletal structure, the associated neighborhoods are established on the graph to perform the pooling operation, and the semantically similar vertices are clustered together to learn the key representations based on the graph. In this work, we progressively cluster the entire skeleton at each frame according to the structure of human limbs. For the bottom-up process, we use a simple upsampling procedure to copy the features of the vertices in the coarser graph to the corresponding vertices in the fine-grained graph. These higher-level features are concatenated with the lower-level features from skip connections for subsequent processing. Furthermore, temporal connections remain unchanged across different levels of spatio-temporal abstraction.
The fusion of the spatial skeleton contains only two graph pooling processes. In contrast, according to Figure 3, the fusion process of the input time series increases as the length of the series increases. For example, the perceptual domains of 27-frame and 81-frame are fused three and four times, respectively. Therefore, graph pooling layers are selectively inserted into TCN layers, which is related to the granularity of temporal and spatial fusion. We discuss this issue in detail in the subsequent ablation experiments.

Data-Dependent Non-Local Layer
Since the basic GCN has difficulty handling long-distance relationships, we design a data-dependent non-local layer to capture the global and long-distance relationships between joints in the body skeletal map. We follow the non-local [42] concept and define the operation as: where W x is the weight, f denotes the pairwise function to calculate the affinity between node i and other nodes j, and g is the function of calculating node representation. In implementation, we adopt [42] in a similar way as Equation (5).

Datasets and Metrics
Datasets. We use the Human3.6M [21] and HumanEva-I [43] datasets for experiments. Human3.6M is the most widely used dataset for 3D pose estimation tasks. It contains 3.6 million images captured in different views by four simultaneous cameras. The dataset consisted of 11 human subjects performing 15 indoor daily activities, such as walking, talking on the phone, sitting, and participating in discussions. The dataset uses a motion capture system to detect precise 3D coordinates and then obtains 2D poses based on the projection of internal and external camera parameters. HumanEva-I is a relatively earlier dataset for 3D human pose estimation. It has a small amount of data with relatively simple pose estimation scenarios. We used the same training and testing split approach as in previous work [10,41,44,45], evaluating multiple subject actions including walking, jogging, and boxing. Following previous work [6][7][8], standard normalization is used to handle the distribution of 2D and 3D poses before data input.
Metrics. There are two standard protocols to evaluate our model on Human3.6M. Following previous work [6][7][8], five subjects (S1, S5, S6, S7, and S8) were used as the training set and two subjects (S9 and S11) were used as the test set, which are evaluated under protocol 1 and protocol 2. As in previous work [3,5,7,12,46], we used two metrics in the dataset Human3.6M to evaluate our method under protocol 1 and protocol 2. Protocol 1 is the metric used is the mean position error per joint (MPJPE), which measures the average Euclidean distance between the ground truth and prediction after alignment of the root joint. Protocol 2 uses the mean per-joint position error (P-MPJPE) after alignment so that it is not affected by rotation and scaling and has better robustness.

Implementation Details
Following previous work [6,7], we use the predicted 2D keypoints released by [41] from the Cascaded Pyramid Network (CPN) as the input of our 3D pose model. Since there is a strong correlation between clips of the same video screen, we sample from different video clips to avoid biased statistics for batch normalization [47].
We trained 100 epochs using the AMSGrad optimizer [48]. An exponential decay learning rate scheme was employed, starting with η = 0.001 and applying a shrink factor of α = 0.95 per epoch. We also set the batch size and dropout rate to 1024 and 0.2, respectively. The pose data are enhanced by horizontal flipping.

Experimental Results
Learned weighting matrices. The proposed U-Net network contains S-GCN layers in each level. Figure 5 visualizes learned weighting matrices in the network, including the weights in the original map with 17 keypoints and the fusion map with five keypoints. The weight in the upper left is larger than the lower right, which means that the central node has a higher impact than the end node. In other words, the keypoint information is passed through the S-GCN based on the skeleton structure, which proves that the skeleton structure information is fully utilized. In addition, we observe that the weights learn the structural features of the human skeleton. For example, the head, nose, and neck have relatively fixed structural relationships, and the connection weights obtained through training are relatively high. Figure 5 demonstrates that S-GCN correctly resolves the structure of key points of the human skeleton, thus improving the performance of 3D body pose estimation.   Table 1, our model has an average lead of 0.6 mm compared to the previous best result [49] . Compared to the baseline model [41], our 243-frame model is 6.8 mm and 5.7 mm ahead for the "sitting" and "seated" actions, respectively, indicating that our model is better able to cope with complex situations such as occlusion and overlay. For protocol 2 in Table 2, our method achieves a minimum error of 35.4 mm, which is a reduction of 0.2 mm [49] compared to the best result. It is worth mentioning that our model has a great advantage on highly dynamic action sequences, especially the "Walk" and "Walk Together" sequences, which are reduced by more than 4 mm in protocol 1 compared to the base model.
The test results of the HumanEva-I dataset are shown in Table 3, where "-" indicates that no corresponding results are reported in this work. The experiments show that we achieved the best results in seven out of nine sequence tasks for Walk, Jog, and Box. It is worth mentioning that since the HumanEva-I dataset is a relatively easy task, our result metrics still have a slight lead over existing methods. We attribute it to spatio-temporal semantic fusion that smooths dynamic action prediction to improve the error of 3D pose estimation. In general, our model has high performance under various behavioral tasks and multiple evaluation protocols compared to existing models.
To visualize the output of the model, Figure 6 shows the visual qualitative results of our model under multiple action sequences, being the "'Walk", "Wait", "Pose", and "Purch" sequences. We follow our previous work [41] and use the 2D keypoints estimation results of the Cascaded Pyramid Network (CPN) as input. The figure shows that our STUNet model has stable and accurate results on multiple action sequences.

Computational Complexity
As shown in Table 4, we report the number of model parameters and floating point operations (FLOP) and compare them with previous work. The performance comparison under protocol 1 is also shown in Table 4. For the 243-frame model, with a slight increase in parameters and FLOPs, the average error metric of the model reaches the best. In the 27-frame and 81-frame models, the number of parameters and computations are reduced substantially as the acceptance domain is reduced, while also largely maintaining a relatively competitive performance. This is attributed to the fact that we retain the skeleton map information when performing feature fusion to improve accuracy, while the U-Net structure compresses the spatio-temporal information to control the number of parameters.

Ablation Study and Analysis
Granularity of spatio-temporal features. We performed ablation experiments on our models using the Human3.6M dataset. To explore the impact of spatio-temporal fusion at different granularities, we tested the 81-frame and 243-frame models and compared them using the MPJPE metric. The ablation studies are designed to investigate the effects of the order and granularity of temporal and spatial fusion. N 1 and N 2 represent the locations of spatial graph pooling operations, respectively, which determine the timing of spatio-temporal feature fusion in the framework. At higher levels of the spatio-temporal dimension, features of specific keypoints are grouped and fused, and some fine-grained information is lost. This fine-grained information is concatenated in the subsequent upsampling process through skip connections. The larger the values of N 1 and N 2 , the later the spatio-temporal features are fused, which means that more layers of the model process features in the fine-grained spatio-temporal dimension. As a result, we achieve the best results at (1,3) and (2,4), respectively. The best results for N 1 and N 2 for both frames 81 and 243 take intermediate values. It is worth mentioning that Table 5 shows that our model is not sensitive to the hyper-parameter values of N 1 and N 2 and the effect on the results is within 1 mm, indicating its strong robustness.
U-Net architecture. This work presents a novel STUNet architecture with a cascade structure of temporal convolution layers and graph convolution layers to gradually integrate the semantic features of local time and space. Skip connections combine coarse-grained feature embeddings from the decoder sub-network with fine-grained spatiotemporal feature embeddings from the encoder sub-network to boost the final 3D pose estimation performance. Experiments demonstrate that skip connections are effective in recovering and fusing spatio-temporal features of different granularities.
Spatio-temporal feature fusion. Our approach performs feature fusion and compression in both the temporal and spatial dimensions. The proposed structural temporal dilated convolution layer can retain spatial structured graph information while fusing long-time key point sequences in the temporal dimension to eliminate jitter and blur in 3D pose estimation in the single frame case. The semantic features of the human body structure are processed by semantic graph convolution, which leads to a significant improvement in the accuracy of our model compared to the base model.
Optimization of multi-frame input. Models with multiple frames of input are usually more difficult to practice for specific scenarios than single frames. However, we applied dilated TCN to handle temporal dimensional features, which provides the possibility of im-plementing application optimization. Similar to other dilated TCN-based approaches [41], since the time-dimensional features are processed hierarchically, features at different moments can be subsequently reused as long as they are computed once at the inference time. Therefore, the input of 243 frames only affects the start and end of model inference, while the runtime inference is able to be optimized for parallelism and reuse, maintaining a good performance.

Conclusions
In this paper, we address the problem that existing methods lack effective means to extract dynamic complex features from spatio-temporal structures. We propose a novel spatio-temporal U-Net scheme to deal with spatio-temporal features in multiple scales for 3D human pose estimation in video. The proposed STUNet architecture consists of a cascade structure of semantic-structural graph convolution layers and temporal dilated convolution layers, progressively extracting and fusing the spatio-temporal semantic features from fine-grained to coarse-grained. Our method achieves competitive performance on 3D body pose estimation benchmarks. It is experimentally demonstrated that the U-shaped network structure optimizes 3D pose estimation by the downscaling and upscaling of spatiotemporal fusion features. The experiments also show the importance of the representation of spatio-temporal features at different granularities in the pose recognition task, which opens up many possible directions for future work. For example, how to get rid of manual hard rules and apply a graph pooling method with trainable rules [50,51] to automatically fuse spatio-temporal features of different granularities for pose estimation tasks.
Author Contributions: W.L., R.D. and S.C. were all responsible for the design of the research, the implementation of the approach, and the analysis of the results. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

GCN
Graph Convolutional Network TCN Temporal Convolutional Network CNN Convolutional Neural Network ST-GCN Spatial Temporal Graph Convolutional Network