Geometry-Aware 3D Hand–Object Pose Estimation Under Occlusion via Hierarchical Feature Decoupling

Cai, Yuting; Pan, Huimin; Yang, Jiayi; Liu, Yichen; Gao, Quanli; Wang, Xihan

doi:10.3390/electronics14051029

Open AccessArticle

Geometry-Aware 3D Hand–Object Pose Estimation Under Occlusion via Hierarchical Feature Decoupling

by

Yuting Cai

,

Huimin Pan

,

Jiayi Yang

,

Yichen Liu

,

Quanli Gao

^*

and

Xihan Wang

^*

School of Computer Science, Xi’an Polytechnic University, Xi’an 710600, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(5), 1029; https://doi.org/10.3390/electronics14051029

Submission received: 19 February 2025 / Revised: 3 March 2025 / Accepted: 4 March 2025 / Published: 5 March 2025

(This article belongs to the Special Issue Deep Learning for Computer Vision Application)

Download

Browse Figures

Versions Notes

Abstract

Hand–object occlusion poses a significant challenge in 3D pose estimation. During hand–object interactions, parts of the hand or object are frequently occluded by the other, making it difficult to extract discriminative features for accurate pose estimation. Traditional methods typically extract features for both the hand and object from a single image using a shared backbone network. However, this approach often results in feature contamination, where hand and object features are mixed, especially in occluded regions. To address these issues, we propose a novel 3D hand–object pose estimation framework that explicitly tackles the problem of occlusion through two key innovations. While existing methods rely on a single backbone for feature extraction, our framework introduces a feature decoupling strategy that shares low-level features (using ResNet-50) to capture interaction contexts, while separating high-level features into two independent branches. This design ensures that hand-specific features and object-specific features are processed separately, reducing feature contamination and improving pose estimation accuracy under occlusion. Recognizing the correlation between the hand’s occluded regions and the object’s geometry, we introduce the Hand–Object Cross-Attention Transformer (HOCAT) module. Unlike traditional attention mechanisms that focus solely on feature correlations, the HOCAT leverages the geometric stability of the object as prior knowledge to guide the reconstruction of occluded hand regions. Specifically, the object features (key/value) provide contextual information to enhance the hand features (query), enabling the model to infer the positions of occluded hand joints based on the object’s known structure. This approach significantly improves the model’s ability to handle complex occlusion scenarios. The experimental results demonstrate that our method achieves significant improvements in hand–object pose estimation tasks on publicly available datasets such as HO3D V2 and Dex-YCB. On the HO3D V2 dataset, the PAMPJPE reaches 9.1 mm, the PAMPVPE is 9.0 mm, and the F-score reaches 95.8%.

Keywords:

hand–object interaction; 3D hand–object pose estimation; feature fusion

1. Introduction

Three-dimensional hand–object pose estimation holds significant value across various fields, particularly in virtual reality (VR), augmented reality (AR) [1,2], robotics [3,4], and human–computer interaction [5]. By accurately estimating the 3D poses of hands and objects, more natural interaction methods can be achieved, thereby enhancing user experience. For instance, in VR and AR applications, the precision of gesture control and object manipulation directly impacts user immersion and operational smoothness. In robotics, 3D hand–object pose estimation is crucial for robotic grasping and manipulation tasks. Especially in complex, dynamic environments, the interactions between hands and objects are highly intricate, making the accurate reconstruction of their relative positions and poses essential for practical applications.

In the field of 3D hand–object pose estimation, existing methods still face challenges in effectively addressing occlusion. Recently, combining the local and global information of hand–object features has become a common solution [6,7,8]. Many studies employ a single backbone network to simultaneously extract hand and object features within the same space [7,9,10]. The advantage of this approach lies in its ability to share contextual information between the hand and object, effectively capturing their interactions and reducing the complexity of feature alignment. Additionally, it leverages attention mechanisms to enhance important features [7]. However, in hand–object interaction scenarios where partial overlap often occurs, a single backbone network may confuse hand and object features with complex background information during feature extraction. This is particularly problematic in RGB images, where distinguishing between the hand, the object, and unrelated background areas becomes difficult, leading to increased errors. Moreover, current methods lack optimization in balancing feature sharing and separation, making it challenging to simultaneously capture the global contextual relationships between the hand and the object while performing fine-grained processing of their respective features. These limitations collectively restrict the accuracy of models in estimating hand–object poses under complex occlusion scenarios.

To address the issues, we propose a 3D hand–object pose estimation network. While multi-branch networks are widely used in pose estimation, our framework introduces a feature decoupling strategy that incorporates a multi-branch feature pyramid, where the backbone shares the lower-level feature extraction layers of ResNet-50, enabling early-stage feature sharing between the hand and object. At higher levels, the network introduces two parallel branches, each equipped with independent lateral connections and upsampling operations to separately refine and process the hand and object features. Each branch independently extracts and handles the specific features required. This design preserves the global contextual relationship between the hand and object while ensuring differentiated and detailed processing of their respective features.

Recognizing the correlation between the hand’s occluded regions and the object’s geometry, we introduce the Hand–Object Cross-Attention Transformer (HOCAT) module. Recent works have explored Transformer-based modules for 3D pose estimation. Huang et al. [11] proposed a Transformer-based network for estimating 3D hand poses from point clouds, demonstrating the effectiveness of attention mechanisms in capturing spatial relationships. Unlike generic attention mechanisms, the HOCAT explicitly incorporates the geometric stability of objects as prior knowledge, enabling more accurate reconstruction of occluded hand regions. While Liu et al. [7] used a Contextual Reasoning (CR) module to enhance object features using hand–object interaction regions, the HOCAT is designed to enhance both hand and object features by leveraging the geometric stability of objects as prior knowledge. Specifically, the HOCAT integrates minor features from the object feature map, which are relevant to the hand, into the main features of the hand through the Transformer structure. This compensates for the feature loss caused by the hand’s non-rigidity, flexibility, and occlusion. Particularly when the hand is partially occluded by the object, the HOCAT leverages the stable geometric structure of the object to provide additional contextual information, enabling the model to predict the hand’s 3D pose with greater accuracy. In the experimental section, our method achieves significant improvements in performance on public datasets such as HO3D V2 [12] and Dex-YCB [13], particularly in the 3D hand pose estimation task, where it shows higher accuracy and adaptability.

Our main contributions are summarized as follows:

We propose a novel 3D hand–object pose estimation network that estimates a 3D hand–object pose from a single RGB image. It incorporates a multi-branch feature pyramid, which shares low-level features while dividing high-level features into two branches to extract separate features for the hand and object. While multi-branch networks are widely used in pose estimation, our design explicitly decouples hand and object features based on their physical properties (non-rigid vs. rigid). The shared low-level layers capture interaction contexts, while independent high-level branches refine features tailored to hand articulation and object geometry. This strategy significantly reduces feature contamination under occlusion.
Because existing attention mechanisms lack geometric constraints for occlusion reasoning, we designed a Hand–Object Cross-Attention Transformer (HOCAT) module, which uniquely integrates the object’s geometric stability as a priori: the rigid object features (key/value) guide the reconstruction of occluded hand regions (query), ensuring physically plausible predictions. This module effectively integrates minor object features into the primary hand features, enhancing the hand features and improving the model’s adaptability to complex occlusion scenarios, as well as its overall task performance.
Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art 3D hand–object pose estimation methods on datasets of hand–object interactions with significant hand occlusions. On the HO3D V2 dataset, the PAMPJPE reaches 9.1 mm, the PAMPVPE is 9.0 mm, and the F-score reaches 95.8%.

2. Related Work

2.1. Unified Frameworks for Rigid Body, Hand, and Hand Pose Estimation

Significant progress has been made in 3D pose estimation for rigid bodies (e.g., objects) and hands. Early methods relied heavily on multi-view images [14,15] or depth sensors [16,17], using geometric constraints to address occlusion. However, these approaches were hardware-intensive and difficult to generalize to monocular RGB scenarios. In recent years, monocular RGB methods have become mainstream:

Rigid Body Pose Estimation: Tekin et al. [18] proposed a single-shot network for joint 6D object pose estimation, but it lacks adaptability to non-rigid components. Hand Pose Estimation: Methods based on parametric models (e.g., MANO [19]) achieve efficient reconstruction by regressing joint parameters [20,21,22], but they often produce geometrically implausible results in occluded scenarios. Multi-Task Unified Frameworks: I2L-MeshNet [23] and METRO [24] attempt to handle multiple tasks using shared backbone networks, but the high-level features of hands (non-rigid) and objects (rigid) tend to mix, leading to error accumulation in occluded regions. Unlike traditional multi-task frameworks, we propose a task-specific feature decoupling strategy. While low-level features are shared to capture interaction contexts, high-level features are processed independently to explicitly model the physical differences between hands (non-rigid) and objects (rigid), significantly reducing feature contamination (see Section 3.1).

2.2. Occlusion Challenges and Hand–Object Interaction Modeling

Occlusion is a core challenge in hand–object pose estimation. Existing methods primarily address this issue through the following approaches:

Attention-Based Local Reasoning: For example, Lin et al. [25] used Transformers to capture the global context but did not leverage object geometric priors, leading to physically implausible predictions in occluded regions. Multi-Modal Fusion: Hasson et al. [16] improved interaction accuracy by incorporating physical constraints, but their approach relies on precise contact point annotations, making it difficult to scale to real-world scenarios. Implicit Representation Learning: Karunratanakul et al. [17] modeled hand–object interactions using signed distance fields, but this method suffers from high computational complexity and sensitivity to training data.

These approaches generally overlook two critical issues: (1) the conflict between the non-rigid deformation of hands and the rigid structure of objects and (2) the lack of object priors to guide geometric reasoning in occluded regions. For instance, Liu et al. [7] improved interaction consistency through temporal modeling, but their single-branch network cannot distinguish between hand and object features, leading to significant errors under severe occlusion. To address these issues, we propose a cross-modal attention mechanism (HOCAT) which integrates the geometric stability of objects as a priori into the feature fusion process (see Section 3.2). By using object features as key/value to guide the attention allocation of hand features (query), HOCAT can infer the positions of occluded hand joints.

3. Methods

Our proposed framework for 3D hand–object pose estimation consists of three main components: (1) a multi-branch feature pyramid shares low-level features from the ResNet-50, while decoupling high-level features into two independent branches to extract features for the hand and the object, respectively; (2) a Hand–Object Cross-Attention Transformer (HOCAT) module for deep feature fusion between hand and object features; and (3) decoders for estimating the 3D poses of the hand and object. The overall pipeline begins with an RGB image as input, which is processed by the backbone network to extract multi-scale features. These features are then refined and fused through the HOCAT module and finally passed to the respective decoders for pose estimation. Below, we provide detailed descriptions of each component.

3.1. Backbone

The framework is illustrated in Figure 1. In our approach, the backbone is built upon the ResNet50 [26] model, which efficiently extracts multi-scale features from the input image by processing it layer by layer. To facilitate the effective fusion of features at different levels, we adopt the top-down path in the FPN [27] structure, progressively upsampling deep-level features and fusing them with their corresponding shallow-level features. To achieve separation of hand and object features, each stage of ResNet50 [26] extracts the features from both the hand and object using the same set of convolution operations. This shared approach to low-level feature extraction enables the network to capture common information between the hand and object, thereby reducing computational costs and enhancing training efficiency. In the top-down pathway, we designed two branches for hand and object features, implementing distinct lateral connection layers and upsampling methods. Specifically, each feature type utilizes its own lateral connection layer, followed by upsampling and fusion of the respective feature maps. Following feature fusion, we apply smoothing layers for both hand and object features to reduce noise in the feature maps. This step process contributes to improved feature quality and distinguishability. The heatmap of the feature maps generated by our backbone network is shown in Figure 2. With our design scheme, the backbone model can still effectively capture and clearly highlight hand features under occlusion conditions.

To further enhance the interaction between features, we introduce the SE (Squeeze-and-Excitation) module [28]. This module applies channel-wise weighting to the hand and object features, adaptively enhancing important features while suppressing irrelevant ones. The SE module extracts global information through global pooling and generates weights via fully connected layers. These weights are then multiplied back into the original feature maps, enabling dynamic adjustment of the feature maps. This mechanism ensures that the features of the hand and object are sufficiently distinguished in representation, thereby improving the performance of subsequent tasks.

3.2. Hand–Object Cross-Attention Transformer Module

In the interaction between the hand and the object, complex and interdependent dynamic characteristics frequently emerge, particularly when the hand grasps or manipulates the object. The flexibility and non-rigidity of the hand contrast sharply with the rigid structure of the object. Based on this relationship, we propose the Hand–Object Cross Attention Transformer (HOCAT) module, as shown in Figure 3. The module takes as input the features extracted from the hand feature map

F_{h a n d f e a t u r e}

and the object feature map

F_{o b j f e a t u r e}

using ROIAlign [29] based on the hand bounding box.

To enhance the representation ability of the hand features, the module uses a Transformer [30] structure to fuse the main features

F_{h a n d m a i n}

of the hand and minor features

F_{o b j m i n o r}

of the object. First, the

F_{h a n d m a i n}

and

F_{o b j m i n o r}

are encoded separately using 1 × 1 convolution operations to generate query, key, and value. The

F_{h a n d m a i n}

is used as the query, while the

F_{o b j m i n o r}

serves as the key and value inputs to the Transformer [30] module. Through the multi-head attention (MHA), the

F_{h a n d m a i n}

can attend to the

F_{o b j m i n o r}

, using the relatively stable structure of the object to provide information for hand features that may be distorted due to hand bending or occlusion. The output of the MHA is added to the

F_{h a n d m a i n}

input and passed through layer normalization to maintain feature stability. After normalization, the features are further processed through the Feed-Forward Network (FFN). The output of the FFN is then added to the previous result, resulting in the enhanced hand feature map. Finally, the fused hand features

F_{h a n d}

are passed through the hand decoder to obtain the hand joints and 3D pose estimation. For the object features, we extract the main object features

F_{o b j m a i n}

using the ROIAlign [29] based on the object bounding box, and then directly input them into the object decoder to obtain the 6D pose of the object.

HOCAT effectively fuses the main hand features with the minor object features, leveraging the cross-attention mechanism to enhance feature representation in hand–object interaction scenarios, providing stronger contextual information and expressiveness for 3D pose estimation.

3.3. Regressor

The hand decoder regressor is composed of a single hourglass network [31], a parameter regression network, and the MANO regression network [32]. The hourglass network [31] processes the fused features

F_{h a n d}

as input, producing heatmaps

H \in R^{256 \times 32 \times 32}

for each joint. The MANO regression network [32] transforms these features into MANO pose

θ \in R^{48 \times 3}

and MANO shape

β \in R^{10}

. With these parameters, the MANO layer constructs the 3D hand mesh

V \in R^{778 \times 3}

and the 3D joints

J \in R^{21 \times 3}

.

The object decoder uses a dual-stream architecture, receiving as input. The first stream predicts the 2D positions of predefined 3D control points for the object through image grid proposals. The second stream regresses the confidence scores for each proposed control point. The decoder uses a total of 21 control points (8 corner points, 12 midpoints of edges, and 1 center point), and shared convolutional layers are used to extract common features. Subsequent independent convolutional layers refine the predictions. Using the PnP algorithm, the predicted 2D control points are matched to the corresponding 3D control points on the object mesh, allowing the calculation of the object’s 6-Dof pose.

3.4. Loss Function

The loss function for hand pose estimation combines multiple losses, including those for 2D keypoints, 3D keypoints, and parameter predictions based on the MANO model. 2D Keypoint Detection Loss (

L_{H}

): It measures the

L_{2}

loss between the predicted 2D keypoint heatmap and the ground truth heatmap. This loss is used to ensure the model accurately predicts the hand joint locations in the image plane.

L_{H} = \frac{1}{N} \sum_{i = 1}^{N} {‖{\hat{H}}_{i} - H_{i}‖}_{2}^{2}

(1)

where

{\hat{H}}_{i}

represents the 2D heatmap predicted by the model,

H_{i}

is the ground truth 2D heatmap, and N is the number of keypoints.

3D Keypoint Loss (

L_{3 d}

): To ensure that the model can accurately predict the 3D structure of the hand,

L_{2}

loss is used to calculate the difference between the predicted 3D keypoint coordinates and the ground truth coordinates.

L_{3 d} = \frac{1}{N} \sum_{i = 1}^{N} {‖{\hat{J}}_{3 D, i} - J_{3 d, i}‖}_{2}^{2}

(2)

where

{\hat{J}}_{3 D, i}

represents the predicted 3D keypoint coordinates, and

J_{3 D, i}

represents the ground truth 3D keypoint coordinates.

MANO Model Parameter Loss (

L_{m a n o}

): The shape and pose parameters of the hand are predicted using the MANO model.

L_{2}

loss is used to calculate the difference between the model-predicted hand pose parameters

θ

and shape parameters

β

and the true parameters.

L_{mano} = {‖\hat{θ} - θ‖}_{2}^{2} + {‖\hat{β} - β‖}_{2}^{2}

(3)

where

\hat{θ}

represents the model-predicted hand pose parameters,

θ

is the true pose parameters,

\hat{β}

is the model-predicted shape parameters, and

β

is the true shape parameters.

The total loss function for hand pose estimation combines the above three losses, using weighting parameters

α_{h}

,

α_{3 d}

, and

α_{m a n o}

to control the relative importance of each part in the optimization process. The total hand loss function is shown as follows:

L_{hand} = α_{h} L_{H} + α_{3 d} L_{3 d} + α_{m a n o} L_{m a n o}

(4)

With this combined loss function, the hand decoder can optimize both 2D and 3D keypoint predictions while ensuring the accuracy of the hand pose and shape parameters based on the MANO model, thus enhancing the overall precision of hand 3D reconstruction.

The loss function for object pose estimation combines keypoint coordinate loss and confidence score loss. Keypoint Coordinate Loss (

L_{p 2 d}

): For object keypoint prediction, the loss function uses

L_{1}

loss to measure the discrepancy between the predicted 2D keypoint coordinates and the ground truth coordinates.

L_{p 2 p} = \frac{1}{N} \sum_{i = 1}^{N} {‖{\hat{P}}_{i} - P_{i}‖}_{2}^{2}

(5)

where

{\hat{P}}_{i}

represents the model-predicted keypoint coordinates,

p_{i}

denotes the ground truth keypoint coordinates, and N is the number of keypoints.

Confidence Loss (

L_{c o n f}

): This loss measures the difference between the predicted confidence scores for each keypoint and their ground truth values, optimized using an

L_{1}

loss function. The confidence score reflects the model’s certainty in each keypoint prediction. By comparing with the actual keypoints, the model reduces the confidence of inaccurate predictions.

L_{conf} = \frac{1}{N} \sum_{i = 1}^{N} {‖{\hat{C}}_{i} - C_{i}‖}_{2}^{2}

(6)

where

{\hat{C}}_{i}

represents the predicted confidence score, and

C_{i}

is the ground truth confidence label.

The total loss function for the object combines the keypoint coordinate loss and confidence loss, with weight parameters

α_{p 2 d}

and

α_{c o n f}

adjusting their relative importance during the optimization process. The total loss function for the object is shown as follows:

L_{obj} = α_{p 2 d} L_{p 2 d} + α_{c o n f} L_{c o n f}

(7)

By combining these two loss functions, the object decoder is not only able to accurately predict the coordinates of keypoints but also enhances confidence in keypoint predictions, ensuring greater reliability in the model’s keypoint estimations.

4. Experiments and Results

4.1. Implementation Details

The computational power of the computer used in our experiments is vGPU-32GB, with 32 GB of video memory. All implementations were completed using PyTorch (torch >= 1.0.1.post2) [33]. The Adam optimizer [34] with a weight decay of 5 × 10⁻⁴ was employed to train the network, using a mini-batch size of 32. All models in our experiments were trained for 120 epochs. The initial learning rate was set to 1 × 10⁻⁴ and reduced every 10 epochs. Input images were resized to 256 × 256 and enhanced using data augmentation techniques such as scaling, rotation, translation, and color adjustments. The values for

α_{h}

,

α_{3 d}

,

α_{m a n o}

, and

α_{c o n f}

were set to 100, 10,000, 1, 500, and 100, respectively.

4.2. Datasets and Evaluation Metrics

HO3D v2. The HO3D v2 dataset [12] is specifically designed for 3D vision tasks involving hand–object interaction, as shown in Figure 4a. It contains 66,034 training samples and 11,524 evaluation samples. Test results can be evaluated online via the official server. For hand pose estimation, we present the F-score after performing Procrustes alignment, along with the mean per-joint position error (PAMPJPE) and the mean per-vertex position error (PAMPVPE), both measured in millimeters.

Dex-YCB. The Dex-YCB dataset [13] is a large-scale dataset specifically designed for studying hand–object interaction, as shown in Figure 4b. Dex-YCB provides hand-grasping-object data for tasks like 2D object and keypoint detection, 6D object pose estimation, and 3D hand pose estimation. The data were captured using eight RGB-D cameras at 30 fps, recording RGB images and depth maps from eight viewpoints at a resolution of 640 × 480. Over 1000 sequences were recorded, totaling 582,000 RGB-D frames. For hand pose estimation, we report the mean per-joint position error (PAMPJPE) and the mean per-joint position error without Procrustes alignment (MPJPE), measured in millimeters.

For 6D object pose estimation, we report the percentage of objects with an average vertex error falling within 10% of the object’s diameter (ADD-0.1D).

In our quantitative evaluation, we use the following metrics:

PAMPJPE measures the average position error of each joint after applying Procrustes alignment. Procrustes alignment is a geometric transformation used to eliminate differences in rotation, scaling, and translation between the predicted and ground-truth poses, thereby facilitating a more equitable evaluation of hand pose estimation accuracy. A lower PAMPJPE indicates a higher accuracy of the model in predicting hand poses.

PAMPVPE measures the discrepancy between the predicted and actual positions of 3D vertices in a hand or object mesh. Specifically, PAMPVPE evaluates the average Euclidean distance between the predicted and ground-truth 3D vertex positions after global alignment. A smaller PAMPVPE indicates that the predicted 3D mesh closely matches the actual mesh, reflecting higher model prediction accuracy.

F-Score represents the harmonic mean between the predicted mesh and the ground-truth mesh based on a given distance threshold. F@5/F@15 refers to thresholds of 5 mm and 15 mm, respectively.

ADD-0.1D calculates the mean distance between 3D model points transformed using the ground-truth pose and those transformed by the predicted pose. For asymmetric objects, pose estimation is deemed accurate if the mean distance of model points falls within 10% of the object’s diameter.

4.3. Experimental Results and Analysis

Our method demonstrates outstanding performance across multiple benchmarks, particularly in hand pose estimation. Our approach shows significant improvements in hand PAMPJPE, PAMPVPE, F@5, and F@15.

On the HO3D dataset, our method demonstrates significant competitiveness when compared to the current state-of-the-art methods in hand pose estimation, as shown in Table 1. In terms of key metrics, we achieved 9.1 mm and 9.0 mm for PAMPJPE and PAMPVPE, respectively, while also achieving high accuracy rates of 56.6% and 95.8% for F@5 and F@15. This indicates that our method has reached the current advanced level of accuracy in hand pose estimation and maintains small errors in both mesh error and keypoint location estimation.

Table 2 shows the comparison results of 6D object pose estimation on the HO3D V2 dataset. Liu et al. [7] achieved higher ADD-0.1D scores (67.7% vs. our 60.8%) on the HO3D V2 dataset. This is primarily because their framework employs a Contextual Reasoning (CR) module, which extracts features from hand–object interaction regions and uses them as context to enhance object features. This design effectively captures the spatial and semantic relationships between hands and objects, leading to improved object pose estimation. In contrast, the design of HOCAT focuses more on enhancing hand feature learning, while the independent modeling of objects is relatively weaker, which affects object pose estimation in certain categories but improves hand pose estimation (PAMPJPE 9.1 mm vs. Liu’s 10.1 mm).

Our method was also compared with existing methods on the Dex-YCB dataset, and the results show significant improvements in both hand pose estimation and 6D object pose estimation tasks. Table 3 presents the comparison results of different methods. Our method achieved 13.79 mm and 5.65 mm in MPJPE and PAMPJPE, respectively, significantly outperforming the existing methods.

Table 4 shows the comparison results of 6D object pose estimation on the Dex-YCB dataset, covering evaluation results across multiple common object categories. Our method achieved an average ADD-0.1 score of 31.8%, which is a significant improvement of 2% over Liu et al. [7]. This further validates the effectiveness of our model in the 3D object pose estimation task.

4.4. Ablation Study

In this study’s ablation experiments, we evaluated the performance improvements of the multi-branch feature pyramid and the Hand–Object Cross-Attention Transformer (HOCAT) module for hand pose estimation on the HO3D V2 dataset. Figure 5 presents the qualitative results of each module, providing a more intuitive perspective on their respective effects. As shown in Table 5, the baseline model, which does not use the multi-branch feature pyramid and HOCAT, performs hand–object pose estimation by extracting features only through the ResNet-50 backbone, achieving a PAMPJPE error of 11.0, a PAMPVPE error of 10.9, and F@5 and F@15 scores of 48.3 and 93.5, respectively. These results serve as the baseline for comparing the performance of the model without any specific enhancement modules. When the model was without HOCAT, relying solely on the multi-branch feature pyramid to extract hand and object features for hand–object pose estimation, achieved a PAMPJPE error of 10.1, a PAMPVPE error of 10.0, and F@5 and F@15 scores of 51.9 and 94.5, respectively. However, when HOCAT was introduced, the model performance improved further, reaching its best results. The PAMPJPE and PAMPVPE errors dropped to 9.1 and 9.0, while the F@5 and F@15 scores increased to 56.6 and 95.8, respectively. Additionally, the inference time decreased to 0.151 s with an FPS of 6.61, though the number of parameters increased to 142.5 MB. This demonstrates that HOCAT effectively integrates relevant minor features from the object feature map into the main hand features via the Transformer structure, compensating for feature loss caused by the hand’s non-rigidity, flexibility, and occlusion. It significantly enhances the representation of hand features in occluded scenarios.

4.5. Qualitative Analysis

The qualitative results of hand pose estimation on the HO3D v2 dataset are shown in Figure 6, and the qualitative results of object pose estimation are shown in Figure 7. It can be observed that even under severe occlusions, the 3D meshes we generate remain relatively complete. With the help of the HOCAT, our method successfully infers the occluded regions, thereby improving reconstruction performance.

5. Conclusions

In this paper, we address the challenge of occlusion in hand–object interactions by proposing a novel 3D pose estimation framework. This framework integrates a multi-branch feature pyramid and a cross-modal attention mechanism (HOCAT). By sharing low-level features and decoupling high-level features into independent branches for hands and objects, it effectively reduces feature contamination, thereby improving pose estimation accuracy in occlusion scenarios. The HOCAT module leverages the geometric stability of objects as prior knowledge to guide the reconstruction of occluded hand regions, achieving state-of-the-art performance in hand pose estimation (PAMPJPE: 9.1 mm, F-score: 95.8% on the HO3D V2 dataset). Our method holds significant potential for real-world applications, such as enhancing gesture-based interaction and object manipulation capabilities in immersive environments, as well as improving robotic grasping and manipulation tasks by providing accurate 3D hand–object pose estimates. However, while the framework excels in hand pose estimation, it still has limitations in object 6D pose estimation (ADD-0.1D: 60.8% on the HO3D V2 dataset). To address this issue, future research will focus on object pose optimization by integrating CAD model priors or category-level shape constraints to refine object pose estimation. Additionally, we will design lightweight network architectures and employ model compression techniques to enable real-time deployment on edge devices. By tackling these challenges, we aim to advance the state of the art in 3D hand–object pose estimation and promote its broader adoption in practical applications.

Author Contributions

Conceptualization, Y.C., X.W., and Q.G.; methodology, Y.C., H.P., and J.Y.; software, Y.C., Y.L., and X.W.; validation, J.Y., Q.G., Y.L., and X.W.; formal analysis, Y.C., X.W., and Q.G.; investigation, Y.C., X.W., and Q.G; resources, Y.C., X.W., Y.L., and Q.G.; data curation, Y.C., Q.G., and H.P.; writing—original draft preparation, Y.C.; writing—review and editing, Q.G., J.Y., H.P., and X.W.; visualization, Y.C.; supervision, Q.G. and X.W.; project administration, Y.C., X.W., and Q.G.; funding acquisition, Q.G. and X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Natural Science Foundation of China (No. 62072362, 12101479) and the Shaanxi Provincial Key Industry Innovation Chain Program (No. 2020ZDLGY07-05), Natural Science Basis Research Plan in Shaanxi Province of China (No. 2021JQ-660, 2024JC-YBMS-531), Shaanxi Provincial Innovation Capacity Support Programme Project (No. 2024ZC-KJXX-034), and Xi’an Major Scientific and Technological Achievements Transformation Industrialization Project (No. 23CGZHCYH0008).

Data Availability Statement

We will provide links to the two datasets used in our experiments. HO3D V2: https://github.com/shreyashampali/ho3d (accessed on 21 June 2024); Dex-YCB: https://dex-ycb.github.io/ (accessed on 8 August 2024).

Acknowledgments

We thank the Editor and anonymous Reviewers for their valuable suggestions to improve the quality of this paper.

Conflicts of Interest

The authors declare that they have no known conflicting financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Huerst, W.; Van Wezel, C. Gesture-based interaction via finger tracking for mobile augmented reality. Multimed. Tools Appl. 2013, 62, 233–258. [Google Scholar] [CrossRef]
Piumsomboon, T.; Clark, A.; Billinghurst, M.; Cockburn, A.J.A. User-Defined Gestures for Augmented Reality. In CHI’13 Extended Abstracts on Human Factors in Computing Systems; 2013; Available online: https://dl.acm.org/doi/10.1145/2468356.2468527 (accessed on 3 March 2025).
Chen, P.; Chen, Y.; Yang, D.; Wu, F.; Li, Q.; Xia, Q.; Tan, Y. I2UV-HandNet: Image-to-UV Prediction Network for Accurate and High-fidelity 3D Hand Mesh Modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 17 October 2021. [Google Scholar]
Zhang, B.; Wang, Y.; Deng, X.; Zhang, Y.; Tan, P.; Ma, C.; Wang, H. Interacting Two-Hand 3D Pose and Shape Reconstruction From Single Color Image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 17 October 2021. [Google Scholar]
Sridhar, S.; Feit, A.M.; Theobalt, C.; Oulasvirta, A. Investigating the Dexterity of Multi-Finger Input for Mid-Air Text Entry. In Proceedings of the ACM Conference on Human Factors in Computing Systems, Seoul, Republic of Korea, 18–23 April 2015. [Google Scholar]
Chen, Y.; Tu, Z.; Kang, D.; Chen, R.; Yuan, J. Joint Hand-Object 3D Reconstruction from a Single Image with Cross-Branch Feature Fusion. IEEE Trans. Image Process. 2021, 30, 4008–4021. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Jiang, H.; Xu, J.; Liu, S.; Wang, X. Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Montreal, QC, Canada, 17 October 2021. [Google Scholar]
Tse, T.H.E.; Kim, K.I.; Leonardis, A.; Chang, H.J. Collaborative Learning for Hand and Object Reconstruction with Attention-guided Graph Convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 24 June 2022. [Google Scholar]
Hampali, S.; Sarkar, S.D.; Rad, M.; Lepetit, V. Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 24 June 2022. [Google Scholar]
Hasson, Y.; Tekin, B.; Bogo, F.; Laptev, I.; Pollefeys, M.; Schmid, C. Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 19 June 2020. [Google Scholar]
Huang, L.; Tan, J.; Liu, J.; Yuan, J. Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation; Springer: Cham, Switzerland, 2020. [Google Scholar]
Hampali, S.; Rad, M.; Oberweger, M.; Lepetit, V.J.I. HOnnotate: A Method for 3D Annotation of Hand and Object Poses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 19 June 2020. [Google Scholar]
Chao, Y.W.; Yang, W.; Xiang, Y.; Molchanov, P.; Fox, D. DexYCB: A Benchmark for Capturing Hand Grasping of Objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 17 October 2021. [Google Scholar]
Ballan, L.; Taneja, A.; Gall, J.; Gool, L.V.; Pollefeys, M.J.S. Motion Capture of Hands in Action Using Discriminative Salient Points. In Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012. [Google Scholar]
Oikonomidis, I.; Kyriazis, N.; Argyros, A.A. Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2088–2095. [Google Scholar]
Hasson, Y.; Varol, G.; Tzionas, D.; Kalevatykh, I.; Black, M.J.; Laptev, I.; Schmid, C. Learning Joint Reconstruction of Hands and Manipulated Objects. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 19 June 2020. [Google Scholar]
Karunratanakul, K.; Yang, J.; Zhang, Y.; Black, M.; Tang, S. Grasping Field: Learning Implicit Representations for Human Grasps. In Proceedings of the 2020 International Conference on 3D Vision (3DV), Fukuoka, Japan, 25–28 November 2020. [Google Scholar]
Tekin, B.; Bogo, F.; Pollefeys, M.J.I. H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Romero, J.; Tzionas, D.; Black, M.J. Embodied Hands: Modeling and Capturing Hands and Bodies Together. arXiv 2022, arXiv:2201.02610. [Google Scholar] [CrossRef]
Boukhayma, A.; Bem, R.D.; Torr, P.H.S. 3D Hand Shape and Pose From Images in the Wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Chen, Y.; Tu, Z.; Kang, D.; Bao, L.; Zhang, Y.; Zhe, X.; Chen, R.; Yuan, J. Model-based 3D Hand Reconstruction via Self-Supervised Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 17 October 2021. [Google Scholar]
Baek, S.; Kim, K.I.; Kim, T.K. Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Moon, G.; Lee, K.M. I2L-MeshNet: Image-to-Lixel Prediction Network for Accurate 3D Human Pose and Mesh Estimation from a Single RGB Image. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Lin, K.; Wang, L.; Liu, Z. End-to-End Human Pose and Mesh Reconstruction with Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 17 October 2021. [Google Scholar]
Lin, K.; Wang, L.; Liu, Z. Mesh Graphormer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 17 October 2021. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J.J.I. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 386–397. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Newell, A.; Yang, K.; Deng, J.J.S.I.P. Stacked Hourglass Networks for Human Pose Estimation. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Zhang, X.; Li, Q.; Mo, H.; Zhang, W.; Zheng, W. End-to-end Hand Mesh Recovery from a Monocular RGB Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Paszke, A.; Lerer, A.; Killeen, T.; Antiga, L.; Yang, E.; Tejani, A.; Fang, L.; Gross, S.; Bradbury, J.; Lin, Z. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems 32, Volume 11 of 20: 32nd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, CA, Canada, 8–14 December 2019. [Google Scholar]
Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Li, K.; Yang, L.; Zhan, X.; Lv, J.; Xu, W.; Li, J.; Lu, C. ArtiBoost: Boosting Articulated 3D Hand-Object Pose Estimation via Online Exploration and Synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 17 October 2021. [Google Scholar]
Chen, X.; Liu, Y.; Dong, Y.; Zhang, X.; Ma, C.; Xiong, Y.; Zhang, Y.; Guo, X. MobRecon: Mobile-Friendly Hand Mesh Reconstruction from Monocular Image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 17 October 2021. [Google Scholar]
Aboukhadra, A.T.; Malik, J.N.; Elhayek, A.; Robertini, N.; Stricker, D. THOR-Net: End-to-end Graformer-based Realistic Two Hands and Object Reconstruction with Self-supervision. In Proceedings of the IEEE/cvf Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 8 January 2022; pp. 1001–1010. [Google Scholar]
Fu, Q.; Liu, X.; Xu, R.; Niebles, J.C.; Kitani, K. Deformer: Dynamic Fusion Transformer for Robust Hand Pose Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 23543–23554. [Google Scholar]
Qi, H.; Zhao, C.; Salzmann, M.; Mathis, A. HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 10392–10402. [Google Scholar]
Spurr, A.; Iqbal, U.; Molchanov, P.; Hilliges, O.; Kautz, J. Weakly Supervised 3D Hand Pose Estimation via Biomechanical Constraints. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Xu, H.; Wang, T.; Tang, X.; Fu, C. H2ONet: Hand-Occlusion-and-Orientation-Aware Network for Real-Time 3D Hand Mesh Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17048–17058. [Google Scholar]

Figure 1. Overview of the framework: The framework consists of backbone, HOCAT, and decoders for hand and object. Initially, the RGB image is processed by the backbone, which separates and optimizes the features of the hand and object. Subsequently, the hand features are fused using HOCAT. Finally, the enhanced hand and object features are input into their respective decoders to estimate the pose.

Figure 2. Heatmap illustration of the feature maps for the hand and object generated by our backbone network.

Figure 3. Structure of the Hand–Object Cross-Attention Transformer module.

Figure 4. Dataset image. (a) Image of HO3D V2 dataset; (b) Image of Dex-YCB dataset.

Figure 5. Qualitative analysis of the hand pose estimation results on HO3D v2. ‘w/o’ represents the removal of a specific module from the model.

Figure 6. Qualitative analysis of the hand pose estimation of the proposed method on HO3D v2.

Figure 7. Qualitative analysis of the object pose estimation of the proposed method on HO3D v2.

Table 1. Comparison of hand pose estimation on the HO3D V2 dataset.

Methods	PAMPJPE ↓	PAMPVPE ↓	F@5 ↑	F@15 ↑	Object
Hasson et al. [10]	11.4	11.4	42.8	93.2	Yes
I2L-MeshNet [23]	11.2	13.9	40.9	93.2	No
Hasson et al. [16]	11.0	11.2	46.4	93.9	Yes
Hampali et al. [12]	10.7	10.6	50.6	94.2	Yes
METRO [24]	10.4	11.1	48.4	94.6	No
Liu et al. [7]	10.1	9.7	53.2	95.2	Yes
ArtiBoost [35]	11.4	10.9	48.8	94.4	Yes
KeypointTrans [9]	10.8	-	-	-	Yes
MobRecon [36]	9.2	9.4	53.8	95.7	No
THOR-Net [37]	11.3	-	-	-	Yes
Deformer [38]	9.4	9.1	54.6	96.3	No
HOISDF [39]	9.6	-	-	-	Yes
Ours	9.1	9.0	56.6	95.8	Yes

Table 2. Comparison results of 6D object pose estimation on the HO3D V2 dataset.

Methods	Cleanser ↑	Bottle ↑	Can ↑	Average ↑
Liu et al. [7]	88.1	61.9	53.0	67.7
Ours	93.4	40.6	48.5	60.8

Table 3. Comparison results of hand pose estimation on the Dex-YCB dataset.

Methods	MPJPE ↓	PAMPJPE ↓	Object
Spurr et al. [40]	17.3	6.83	No
METRO [24]	15.2	6.99	No
Liu et al. [7]	15.2	6.58	Yes
MobRecon [36]	14.2	6.40	No
Xu et al. [41]	14.0	5.70	No
Ours	13.7	5.65	Yes

Table 4. Comparison results of 6D object pose estimation on the Dex-YCB dataset.

Methods Metrics in [mm]	ADD-0.1d ↑
Methods Metrics in [mm]	Liu et al. [7]	Ours
master chef can	34.2	26.4
cracker box	56.4	73.2
sugar box	42.4	44.8
tomato soup can	17.1	10.3
mustard bottle	44.3	53.6
tuna fish can	11.9	7.4
pudding box	36.4	32.7
gelatin box	25.6	25.6
potted meat can	21.9	23.9
banana	16.4	21.2
pitcher base	36.9	44.1
bleach cleanser	46.9	48.0
bowl	30.2	32.9
mug	18.5	15.1
power drill	36.6	47.8
wood block	38.5	44.0
scissors	12.9	13.2
large marker	2.8	2.5
extra large clamp	38.9	42.0
foam brick	27.5	27.8
average	29.8	31.8

Table 5. The result of ablation study on the HO3D v2 dataset.

Architectures	PAMPJPE ↓	PAMPVPE ↓	F@5 ↑	F@15 ↑	Inference Time (s)	FPS	Number of Parameters (MB)
Baseline	11.0	10.9	48.3	93.5	0.164	6.43	118.1
w/o HOCAT	10.1	10.0	51.9	94.5	0.152	6.59	131.6
Ours	9.1	9.0	56.6	95.8	0.151	6.61	142.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cai, Y.; Pan, H.; Yang, J.; Liu, Y.; Gao, Q.; Wang, X. Geometry-Aware 3D Hand–Object Pose Estimation Under Occlusion via Hierarchical Feature Decoupling. Electronics 2025, 14, 1029. https://doi.org/10.3390/electronics14051029

AMA Style

Cai Y, Pan H, Yang J, Liu Y, Gao Q, Wang X. Geometry-Aware 3D Hand–Object Pose Estimation Under Occlusion via Hierarchical Feature Decoupling. Electronics. 2025; 14(5):1029. https://doi.org/10.3390/electronics14051029

Chicago/Turabian Style

Cai, Yuting, Huimin Pan, Jiayi Yang, Yichen Liu, Quanli Gao, and Xihan Wang. 2025. "Geometry-Aware 3D Hand–Object Pose Estimation Under Occlusion via Hierarchical Feature Decoupling" Electronics 14, no. 5: 1029. https://doi.org/10.3390/electronics14051029

APA Style

Cai, Y., Pan, H., Yang, J., Liu, Y., Gao, Q., & Wang, X. (2025). Geometry-Aware 3D Hand–Object Pose Estimation Under Occlusion via Hierarchical Feature Decoupling. Electronics, 14(5), 1029. https://doi.org/10.3390/electronics14051029

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Geometry-Aware 3D Hand–Object Pose Estimation Under Occlusion via Hierarchical Feature Decoupling

Abstract

1. Introduction

2. Related Work

2.1. Unified Frameworks for Rigid Body, Hand, and Hand Pose Estimation

2.2. Occlusion Challenges and Hand–Object Interaction Modeling

3. Methods

3.1. Backbone

3.2. Hand–Object Cross-Attention Transformer Module

3.3. Regressor

3.4. Loss Function

4. Experiments and Results

4.1. Implementation Details

4.2. Datasets and Evaluation Metrics

4.3. Experimental Results and Analysis

4.4. Ablation Study

4.5. Qualitative Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI