Although anatomically based Euclidean skeletal topology (e.g., the “shoulder–elbow–wrist” chain structure) ensures anatomical plausibility, its representational capacity is often constrained by the physical connections of the skeleton. This limitation prevents KTPFormer from learning cross-joint long-range semantic functional relationships. Therefore, this study constructs a hybrid topology for GCN–Transformer fusion models. The hybrid topology consists of the original physical connections together with the designed semantic prior edges. The design philosophy of this hybrid topology is to systematically add semantic prior edges—based on distinguishing between core and non-core regions—while preserving the physical connections. In this way, the proposed topology achieves a unified representation of both physical constraints and semantic relationships. Consequently, when estimating poses for complex multi-class human actions, the proposed topology enables effective feature extraction from cross-joint motion in core regions as well as cross-joint motion in non-core regions.
3.2.1. Semantic Prior Edges Design for Common Features in Core Regions
In multi-class complex human actions, core regions should avoid joints with excessive movement amplitude (e.g., lower limbs in Sitting (Sit) and upper limbs in Phoning (Pho),
Appendix A), as these lack common features across actions. Instead, joints with moderate and consistent motion (square dashed line region in
Figure 3) are selected, and their identification is supported by quantitative analysis using the cross-action joint velocity variance metric
.
The calculation of follows four sequential formulas, forming a rigorous quantitative chain to measure joint motion stability across actions:
Single-sequence average joint velocity:
where
is the sequence length,
denotes the 3D coordinate of joint
at frame
,
(dataset frame rate), and the result reflects the average motion speed of the joint in a single sequence.
Class-averaged joint velocity:
where
is the set of sequences for action class
, quantifying the typical motion speed of the joint within the class.
Global-averaged joint velocity:
where
is the set of all 15 action classes, serving as the baseline for cross-action comparison.
Cross-action joint velocity variance:
As the core screening metric, a smaller indicates more stable motion patterns across actions. It should be noted that a lower does not imply higher task importance of a joint, but rather reflects greater motion consistency across heterogeneous action categories. Such stability is desirable for modeling common features shared among multiple actions.
Based on the 17-joint skeleton of the Human3.6M dataset, we calculated the cross-action velocity variance
for each joint (see
Figure 3) and ultimately determined the core joint set through statistical distribution analysis and kinematic function verification. By analyzing the overall distribution of cross-action velocity variance
for all joints, we found that these variances do not change continuously or uniformly—instead, some joints exhibit significantly lower variance and thus possess stronger motion stability. While certain other joints also show relatively low variance, they are excluded from the core joint set due to mismatched functional positioning: the head and neck joints are connected to the torso via a single skeletal chain, functioning as distal joints. Their seemingly stable motion states are mainly derived from the passive constraints imposed by the torso, rather than actively participating in maintaining global pose stability, and therefore cannot meet the structural support requirements of the core region. As analyzed above, the core joints identified in this dataset consist of six members: Spine (
), LHip (
), RHip
), LShoulder (
), Thorax (
), and RShoulder (
) (highlighted with black borders in
Figure 3).
From a kinematic and structural perspective, these six core joints are all located in the central torso, participate in multi-chain skeletal coupling, and serve as key hubs for force transmission and information propagation within the human skeleton graph. They jointly support the global pose structure, coordinate upper and lower limb movements, and are particularly suitable as the structural backbone for pose representation—laying a stable foundation for the subsequent design of semantic prior edges.
Based on comprehensive considerations of statistical motion stability, multi-chain skeletal coupling characteristics, and functional roles in pose coordination, the core region is formally defined as the set of these six torso joints, which together form a stable and semantically meaningful structural backbone.
Based on the above analysis, we summarize a general modeling principle: joints with low cross-action motion variance exhibit stable and shared motion patterns across actions and are therefore suitable for learning common representations, whereas joints with higher variance tend to encode action-specific or personalized characteristics and are treated as non-core regions.
Combining quantitative screening and anatomical constraints, the core region consists of six joints: Spine, Thorax, LShoulder, RShoulder, LHip, RHip (
Figure 4, black dashed box). Based on this, this study designed three different semantic prior edge schemes in the region (
Figure 4), enhancing the common feature representation of the core region from different dimensions. These three schemes were optimized for two key dimensions: upper and lower limb information transmission and left–right side balance correlation. The first scheme is the “chest-hip” connection (
Figure 4a), which simultaneously enhances vertical information transmission and horizontal balance correlation. Vertically, it improves the lengthy path of “chest–spine–hip” making information transmission more direct. Horizontally, it helps the left and right sides of the core regions better coordinate, thereby improving the overall balance correlation of the skeleton. The second scheme is the “shoulder–hip” connection (
Figure 4b), which focuses on vertical enhancement by establishing a direct connection between the shoulder and hip, improving the lengthy path of “shoulder–chest–spine–hip” in the original topology and enhancing vertical information transmission. The third scheme is the “shoulder–shoulder” and “hip–hip” connection (
Figure 4c), which focuses on horizontal enhancement by explicitly establishing bilateral symmetric connections, enhancing the balance correlation between left and right joints and improving the skeleton’s ability to represent human posture symmetry.
To verify the generality of this design approach, this study applied it to the HumanEva-I dataset. Following the same principle of cross-action motion stability, we identified a compact torso-centered core region in HumanEva-I and designed semantic prior edges within this region accordingly, rather than introducing dataset-specific heuristic connections. Following the common validation paradigm in related 3D human pose estimation (3D HPE) research [
4,
20,
21], this study selected the Walk and Jog actions to validate cross-dataset effectiveness. We constructed a core region encompassing the head, shoulders, neck, and hips, and added a “head–hip” semantic connection to form a hybrid topology structure (
Figure 5b), which strengthens the common feature representation capability of the core region in this dataset. At this point, this study has completed the design of semantic prior edges for the common features of core regions across the two datasets.
3.2.2. Semantic Prior Edges Design for Personalized Features in Non-Core Regions
Introducing semantic prior edges to the common features of core regions enables better representation for the vast majority of action classes in the dataset. However, given the complexity of human actions in the dataset, a small subset of action classes still requires effective expression through the enhancement of personalized features in non-core regions. These personalized features typically exist between limb joints and serve to characterize long-range semantic dependencies among limb endpoints in complex actions of certain classes; such dependencies are particularly prominent only in a limited number of action classes; hence, we define them as personalized features of non-core regions. Therefore, the core goal of designing semantic prior edges for non-core regions is—on the basis of the common framework of core regions—to accurately strengthen the connections between key nodes carrying action-specific characteristics, realizing a two-layer representation structure of “common foundation + personalized differentiation”, and making up for the insufficient representation of fine action details by core regions.
When constructing semantic prior edges in non-core regions, we first excluded the lower limb ankle joints ((LAnkle
, RAnkle
)). The core reason is that their semantic relevance and feature value do not meet the selection criteria for key nodes in non-core regions, rather than a mere preference for action types. As shown in
Figure 3, although the ankle joints exhibit extremely high cross-action velocity variance—far higher than other non-core joints, with distinct action-specific characteristics—an analysis from the perspective of semantic logic and representation requirements reveals significant shortcomings: the movement of ankle joints is mainly strongly bound to lower limb-dominated large-displacement actions (e.g., walking, jumping), and their variance contribution mostly stems from large-amplitude limb displacements rather than fine posture adjustments. In contrast, the core feature of upper limb functional actions (e.g., Phoning, Directions) lies precisely in fine, small-amplitude joint movements. Notably,
Figure 3 shows RWrist has the highest cross-action velocity variance
) among all upper limb non-core joints. This means it undertakes the most delicate detail changes in actions, such as rotational pointing in Directions and stable device-holding in Phoning, serving as a key carrier of action-specific details for upper limb movements, which perfectly aligns with the design goal of “enhancing personalized features in non-core regions”. In comparison, the left wrist (LWrist) exhibits significantly smaller variations across different actions (
), indicating its limited fine-tuning range and much lower ability to capture action-specific details than the right wrist. Therefore, it does not meet the selection criteria for key nodes in non-core regions.
This study designs three personalized semantic prior edge schemes (
Figure 6), establishing coordination relationships between upper limb joints for specific actions, enabling the model to better capture cross-joint long-range semantic dependencies in specific actions. The first scheme is the “right shoulder–right wrist” connection (
Figure 6a), primarily designed to establish direct information transmission paths for action classes with larger upper limb movement ranges. The second scheme is the “left shoulder–left wrist” connection (
Figure 6b), used to verify whether such connections can bring the same effect on the left side, where the action performance is less significant. The scheme is intended to validate the rationality of the first scheme’s design through comparative experiments. The third scheme introduces both bilateral “shoulder–wrist” connections simultaneously (
Figure 6c), focusing on introducing bilateral symmetry to evaluate its impact on pose representation capability. The third scheme aims to validate the rationality of the first scheme’s design through comparative experiments with the first two schemes.
The design of personalized semantic prior edges in non-core regions is not independent of core regions, but forms a complementary synergy of “commonality-personalization” with them: the torso joints in core regions provide a stable benchmark for global posture, ensuring the consistency of basic representation across different actions, while the personalized shoulder–wrist connections in non-core regions focus on action-specific details, enabling accurate differentiation of similar actions. This two-layer design not only avoids the “over-generalization” of fine actions by the common representation of core regions but also solves the “representation confusion” of high-variance joints in non-core regions, laying a solid foundation for the subsequent performance improvement of the model in 3D HPE tasks.