Action Recognition via Multi-View Perception Feature Tracking for Human–Robot Interaction

Bandi, Chaitanya; Thomas, Ulrike

doi:10.3390/robotics14040053

Open AccessArticle

Action Recognition via Multi-View Perception Feature Tracking for Human–Robot Interaction

by

Chaitanya Bandi

^*

and

Ulrike Thomas

Robotics and Human Machine Interaction Lab, Technical University of Chemnitz, Reichenhainer Straße 70, 09126 Chemnitz, Germany

^*

Author to whom correspondence should be addressed.

Robotics 2025, 14(4), 53; https://doi.org/10.3390/robotics14040053

Submission received: 15 February 2025 / Revised: 26 March 2025 / Accepted: 10 April 2025 / Published: 19 April 2025

(This article belongs to the Special Issue Human–AI–Robot Teaming (HART))

Download

Browse Figures

Versions Notes

Abstract

Human–Robot Interaction (HRI) depends on robust perception systems that enable intuitive and seamless interaction between humans and robots. This work introduces a multi-view perception framework designed for HRI, incorporating object detection and tracking, human body and hand pose estimation, unified hand–object pose estimation, and action recognition. We use the state-of-the-art object detection architecture to understand the scene for object detection and segmentation, ensuring high accuracy and real-time performance. In interaction environments, 3D whole-body pose estimation is necessary, and we integrate an existing work with high inference speed. We propose a novel architecture for 3D unified hand–object pose estimation and tracking, capturing real-time spatial relationships between hands and objects. Furthermore, we incorporate action recognition by leveraging whole-body pose, unified hand–object pose estimation, and object tracking to determine the handover interaction state. The proposed architecture is evaluated on large-scale, open-source datasets, demonstrating competitive accuracy and faster inference times, making it well-suited for real-time HRI applications.

Keywords:

human–robot interaction; human pose estimation; hand pose estimation; unified hand–object pose estimation

1. Introduction

Human–Robot Interaction (HRI) is a rapidly evolving field that aims to enable seamless interaction between humans and robots in shared workspaces. HRI systems are increasingly being deployed in applications such as industrial assembly, healthcare, and domestic assistance, where robots must perceive, understand, and adapt to human actions in real-time [1]. A critical component of HRI is the ability to accurately track and interpret human and object dynamics, which requires robust multi-modal perception systems. Recent advancements in computer vision and deep learning have enabled the development of perception frameworks that integrate features such as object detection, human pose estimation, and hand–object interaction tracking [2,3]. These systems allow robots to infer human intentions, predict actions, and respond appropriately, thus enhancing interaction efficiency and safety. A more recent study [4] on collaborative human–robot interaction with human perception and action recognition explores how robots can effectively understand and respond to human actions in shared workspaces. Specialized perception methods, such as vision and movement capture, together with sensor fusion, have an essential role in inferring the intentions and actions of a person. Employing deep-learning and machine-learning methods enhances the ability of robots to identify human actions in real-time for better and safer interaction, with a focus on improving issues including occlusions, variations in human behavior, and the need for precise and timely execution within dynamic environments.

In this study, unlike [4], we focus on incorporating the estimation of 3D hand pose and 6D object pose along with the estimation of human poses to refine the interaction between humans and robots. Although body pose estimation provides essential information about what a human is doing or intending to do, detailed hand tracking will enrich the range of such interactions, allowing for more context-, gesture-, and action-sensitive input across on-device applications. Knowing the hand orientation and how the fingers move when handing over an object improves the robot’s ability to predict human behavior and provide natural and efficient interaction. In assembly planning, hand pose estimation would enable better robot alignment with human motions, ensuring better synchronization in complex tasks involving placing or adjusting parts.

In interaction scenarios, understanding the scene is quite necessary. Tools like object detection and segmentation can help in perceiving the environment. Recent works have leveraged deep-learning-based approaches, such as YOLO [5] and Faster R-CNN [6], to achieve real-time object detection with high accuracy. In this work, we utilize [7], a state-of-the-art object detection framework, to detect, segment, and track objects in collaborative scenarios. YOLOv8 balances speed and accuracy, making it suitable for real-time HRI applications.

Accurate 3D estimation of human and hand pose is essential for understanding human actions and intentions in HRI. While existing methods such as OpenPose [8] and MediaPipe [9] have made significant progress in 2D pose estimation, 3D pose estimation remains challenging. The results are unsatisfactory due to depth ambiguity and occlusions. Recent works have explored deep-learning-based approaches for 3D whole-body human pose estimation [4,10,11,12,13,14,15]. In this work, we introduce a complete perception tracking architecture for the handover of objects in the HRI scenario. This architecture improves computational efficiency and accuracy, enabling real-time tracking of human objects and hands. The interaction of hands with objects is one of the essential aspects that HRI needs in order to interpret human manipulation. Unified hand–object pose estimation consists of the joint 3D pose estimation of hands and objects, capturing their spatial relationships. The overall architecture with action recognition is illustrated in Figure 1. Recent works have explored this problem using graph-based models [16] and transformer architectures [17]. In this work, we propose a novel architecture for 3D unified hand–object pose estimation and tracking, which leverages multi-modal vision data to improve robustness and accuracy. Our approach enables real-time tracking of hand–object interactions, facilitating more intuitive and efficient human–robot interaction.

Our contributions are listed as follows:

We propose a novel architecture for unified hand–object pose estimation and tracking, enabling real-time understanding of hand–object interactions.
We perform intermediate adaptive fusion of RGB and depth features.
We develop our previous research [18] to perform action recognition for the HRI task.
We evaluate our unified hand–object architecture on existing open-source datasets, achieving competitive state-of-the-art performance with faster inference times, making them suitable for real-time HRI applications.

The rest of this manuscript is structured as follows: Section 2 reviews related work on perception features, such as whole-body pose estimation, unified hand–object pose estimation, and action recognition. Section 3 describes our proposed methodologies. Section 4 details the experimental setup, presents the results, concludes the paper and discusses future directions.

2. Related Work

The introduction of deep learning revolutionized pose estimation. Convolutional Neural Networks (CNNs) enabled end-to-end learning from images, significantly improving joint localization [19]. Later, architectures like Stacked Hourglass Networks [20] and Residual Networks [21] enhanced accuracy by capturing hierarchical spatial relationships. More recently, transformer-based models such as TokenPose [22] and HRFormer [23] have leveraged self-attention mechanisms to capture long-range dependencies, further refining pose estimation in challenging conditions. Additionally, multi-modal approaches integrating RGB, depth, and thermal data have been explored to improve robustness in low-light and occluded environments [14].

2.1. Whole-Body Human Pose Estimation

This work integrates whole-body human pose estimation as a core part of our action recognition framework. We provide extensive background by reviewing recent progress towards whole-body pose estimation, focusing on state-of-the-art approaches leveraging deep-learning-based methods. ExPose [24] presents a method for estimating expressive 3D human body pose from a single image. The approach leverages an attention mechanism guided by body structure to improve the accuracy of pose regression. Instead of treating different body parts independently, the method uses a body-driven attention mechanism that dynamically focuses on relevant regions, allowing better articulation of hands, face, and overall body movement.

FrankMocap [25] is a monocular 3D whole-body pose estimation system that combines regression-based methods with integration strategies to reconstruct body, hand, and face accurately poses from a single RGB image or video. The system builds on parametric human models, such as SMPL-X [26], to jointly estimate body and hand articulation while addressing challenges like occlusions and complex hand–object interactions. FrankMocap utilizes a two-stage pipeline, where it first detects body and hand keypoints using a deep-learning-based regression model and then refines these estimates through an optimization process that aligns them with the 3D parametric model. PIXIE [27] introduces a framework for estimating detailed 3D human body, hand, and face poses from monocular images. Instead of processing body parts independently, the approach integrates global and local features to maintain coherence in the estimated pose. This ensures better articulation, reducing errors caused by occlusions or ambiguous hand and face positions.

Hand4Whole [28] focuses on improving 3D hand pose estimation within a full-body human mesh reconstruction. The method addresses the challenge of accurately capturing hand articulation, often affected by occlusions and scale variations in whole-body estimation frameworks. By designing a specialized hand pose estimation module, Hand4Whole refines hand predictions while ensuring consistency with the overall body pose. OSX [15] is a one-stage pipeline, unlike earlier works, and refines hand predictions while ensuring consistency with the overall body pose. In this work, we use this architecture for the estimation of a whole-body pose.

2.2. Object Pose Estimation

In the case of joint hand–object pose estimation, there is an added level of complexity as the hand and object must interact and influence each other’s pose. This is particularly challenging when dealing with articulated hand movements and complex object geometries. Object pose estimation is studied first to understand the unified hand–object pose estimation. The 6D object pose estimation research has explored two primary methodologies: direct regression and 3D point-based regression, which reconstructs 6D poses using the Perspective-n-Point (PnP) algorithm [29]. In the direct regression approach, models aim to estimate translation and rotation directly. For example, ref. [30] developed a Convolutional Neural Network (CNN) that performs 6D object pose estimation by predicting both the translation and rotation, with the rotation represented as a quaternion. This approach also introduced the Yale–CMU–Berkeley (YCB) object dataset, a comprehensive and widely adopted benchmark in the field, consisting of real-world objects with annotated 6D poses. However, direct regression methods have encountered limitations, particularly in achieving high accuracy. To address these, subsequent works like those of [31,32] adopted a two-stage approach. In this method, the CNN first detects 2D keypoints on the object within RGB images. Next, using these 2D keypoints with 3D model point correspondences, the PnP algorithm estimates the object’s 6D pose, providing a more robust and accurate estimation. More recent efforts have focused on further enhancing the accuracy of 6D object pose estimation by incorporating multi-view setups. For instance, ref. [33] propose a multi-view, multi-object pose estimation framework in which multi-camera observations can provide extra spatial information. Such a setup improves pose estimation performance in cluttered scenes with various object classes, relying on ambiguity alleviation provided by multiple views, unlike single-view methods.

2.3. Unified Hand–Object Pose Estimation

The unified hand–object pose estimation field has evolved rapidly, with early methods aiming to jointly address tasks such as object pose estimation, 3D hand pose estimation, object recognition, and action classification within a single framework. To improve the precision of 3D hand–object pose estimation, ref. [34] introduce the Graph UNet architecture. This architecture exploits graph-based models’ adjacency to jointly learn hand poses and their spatial relationships in our 3D end-to-end network, yielding more accurate combined hand–object pose estimation. These models have great value when the most relevant predictions are required at the instance level, both space and time, for example, in hand–object manipulation applications.

Ref. [16] presents an end-to-end model to predict feasible hand–object poses based on this context-driven approach. Their model incorporates a shared feature extraction backbone that encodes contextual information for both the hand and the object, thus enriching the interaction representation. The model learns how the hand and object interact naturally via shared features, leading to more plausible hand pose estimates. Similarly, ref. [35] propose a semi-supervised learning framework for hand–object interaction. Their approach leverages spatial–temporal consistencies to generate pseudo-labels, which help refine predictions over time. This architecture consists of dual streams with a Feature Pyramid Network (FPN) based on ResNet50. Each stream extracts contextual features for the hand and object, which are then processed to produce independent decodings for hand mesh reconstruction and 6D object pose estimation. Refs. [16,34,35] are early works in hand–object pose estimation; although the accuracy on the benchmarks are quite good, the method does not work well on 6D object pose estimation as these works do not include the interdependency learning.

Ref. [36] proposed a collaborative learning model that avoids a shared backbone. Instead, this method utilizes unsupervised associative loss to encode hand–object features, with an attention-guided graph convolution network facilitating information flow between the object and hand mesh networks. This approach allows nuanced pose adjustments by focusing on independent but interlinked hand–object features. Furthermore, ref. [37] proposes a dense mutual attention module to refine predicted hand and object meshes in previous stages, thus advancing the field. The proposed model achieves pose reconstruction with high accuracy by facilitating detailed dependencies between hands and objects through graph convolution and attention mechanisms. Ref. [37] predicts the hand mesh and the object mesh instead of regressing the 2D keypoints. This work highly depends on the initial pose of the object mesh for further refinement using GCN.

Finally, AlignSDF [38] combines an implicit Signed Distance Function (SDF) representation with a parametric model. This technique mirrors an increasing trend in animation toward hybrid architectures. In our previous work [39], we use a cross-model autoencoder architecture to reconstruct hand mesh and object pose, but this work fails to produce results in real-time, even though the accuracy is higher. Despite these advancements, challenges remain in handling dynamic poses, occlusions, and real-time applications. In our recent work [40], we introduced a perception-based feature tracking model designed to enhance real-time performance while reducing network complexity. In this manuscript, we further refine the model to achieve a better trade-off between real-time efficiency and accuracy. The key differences between the two models lie in the fusion technique, hand mesh network, and network backbone. Additionally, we incorporate action recognition to identify interaction states.

3. Methodology

The proposed methodology is structured around a dual-stream architecture designed to integrate object detection, pose estimation, and action recognition for effective human–object interaction analysis. The first network focuses on detecting and tracking relevant features from an RGB image using the YOLOv8 [7] model, which performs object detection and segmentation to identify key elements such as the human body, hands, and objects. The first stream directly processes the input human body using a whole-body human pose estimation network [15] to extract direct 3D body pose. This information is fed into an action recognition network to classify interaction states.

Meanwhile, in the second stream, the segmented hand and object regions are analyzed for spatial intersection in 2D and 3D using depth information (i.e., intersection over union and depth proximity are checked). Suppose an object is detected within the hand region. The cropped hand–object region is processed through a unified hand–object pose estimation network, which regresses hand Model with Articulated and Non-rigid defOrmations (MANO) [41] hand parameters and object keypoints. These parameters are then refined to generate a 3D hand model using the MANO framework, while the object’s 6D pose is estimated through the Perspective-n-Point (PnP) algorithm [29]. The detected objects from YOLOv8 [7] are tracked using the DeepSORT [42] algorithm to maintain object identity across frames. The overall process is structured as in Figure 1. Additionally, we provide a flowchart of the complete model to clearly illustrate each step of the process, as shown in Figure 2. We have color coded each section in the flow chart so that it is easier to follow.

3.1. Object Detection, Segmentation, and Tracking

YOLOv8 [7] is a state-of-the-art object detection and segmentation model that builds upon previous YOLO [5] architectures while incorporating several key advancements to improve accuracy, speed, and efficiency. It features an improved backbone network with a CSPDarknet-based structure, enhancing feature extraction through optimized convolutional layers and a more effective spatial pyramid pooling mechanism. The model utilizes an anchor-free detection approach, reducing computational overhead while maintaining high detection precision. YOLOv8’s detection framework directly predicts object bounding boxes and class probabilities from the feature maps, improving real-time performance. YOLOv8 extends its detection head with additional mask prediction layers for segmentation tasks, enabling instance segmentation by generating fine-grained object masks alongside bounding boxes. The model can carry out object localization and pixel-level segmentation, which gives it a remarkable ability to perform tasks where fine-grained scene understanding is necessary. The performance of YOLOv8 is studied in diverse domains, such as autonomous driving, robotic perception, and medical image analysis (real-world scenarios), demonstrating the strong generalization capability of YOLOv8 compared to previous YOLO structures. Its efficiency and ability to handle complex scenes make it a robust choice for real-time vision-based applications.

Due to the advantages of YOLOv8, we employ it for object, human, and hand detection and segmentation using a transfer learning approach. By leveraging a pre-trained YOLOv8 model, we fine-tune it on a custom dataset containing annotated images of hands, objects, and humans. Transfer learning allows the model to retain the robust feature representations learned from large-scale datasets while adapting to detecting and segmenting relevant elements in human–robot interaction scenarios. We further refine the detection accuracy during training using data augmentation, an adaptive learning rate setting, and fine-grained annotations joined from predictions that help parse occlusions and adapt to hand–object interactions. Sending multiple synchronized images from active perception provides complete information about the objects of interest in the scene, avoiding misses due to inter-frame motion and allowing accurate depth perception.

DeepSORT (Deep Simple Online and Real-time Tracker) [42] is an advanced object-tracking algorithm that enhances the original SORT (Simple Online and Real-time Tracker) by integrating deep-learning-based appearance descriptors. Unlike SORT, which relies solely on the Kalman filter and motion information for tracking, DeepSORT introduces a deep appearance feature extractor based on a Convolutional Neural Network (CNN). This addition improves object association across frames, reducing identity switches and improving robustness in challenging scenarios such as occlusions and rapid motion. The tracker utilizes Mahalanobis distance and cosine similarity between deep feature embeddings to effectively associate detections with existing tracks.

The efficient tracking of moving objects and the continuity of tracking an object require information from previous detections; we use DeepSORT with YOLOv8 detections of humans, hands, and objects (using tracker descriptors from previous detections). DeepSORT helps with this process, assigning IDs to detected objects so that the previously identified objects have consistent IDs in subsequent frames. The synergy of these sensor types allows for tracking hands and objects with spatial precision in human–robot interactions, which is conducive to a wide range of downstream tasks, like action recognition and handover state estimation.

3.2. 3D Whole-Body Pose Tracking

To track the whole human body pose in our framework, we utilize an existing 3D whole-body pose estimation method, namely OSX [15]. This method reconstructs a detailed 3D representation of the human body, including body, hands, and face, in a unified manner. The network follows a one-stage pipeline, directly regressing a parametric human body model from an input image. Unlike multi-stage approaches that separately estimate body and hand poses, this model employs a component-aware transformer to capture fine-grained interactions between different body parts. The architecture integrates the SMPL-X model, which represents the whole human body using shape and pose parameters, allowing seamless recovery of body components in a single pass.

The framework comprises a feature extraction backbone, a transformer-based attention module, and a regression head. The input image is passed through a CNN, which helps the model learn about spatial features. Then, we pass these features to a component-aware transformer, which learns long-range dependencies and intercomponent relationships between the body, hands, and face. Transformers have been applied to the coordinate-based representation in our model to invest an additional layer of attention in the head generation process, allowing more focused modeling. The predicted output consists of the SMPL-X [26] parameters, which define the global body pose

θ

, shape

β

, hand articulation, and facial expression coefficients.

During inference, the model takes an RGB image as input and directly outputs the 3D human mesh along with articulated hand and face poses. A component-aware transformer processes the input efficiently to refine predictions, making the method robust to occlusions and changing body poses. We stick to the existing pose estimation approach, but our implementation in the person detection module is the key difference. The original framework uses YOLOv5 [43] to detect the human region of interest before estimating the pose. In contrast, we integrate YOLOv8 [7], which provides superior accuracy and faster inference speeds. YOLOv8 has an improved backbone with more profound and efficient convolutional layers, enabling better feature extraction and detection performance.

3.3. Unified Hand–Object Pose Estimation

We first detect the hand-held object using the YOLOv8 [7] architecture for estimating the pose of the hand–object. After detecting the hand and the object, we crop the respective regions from both the RGB and aligned depth images of size

224 \times 224 \times 3

and

224 \times 224 \times 1

, respectively. These cropped regions, which include the hand and object, are then processed using two distinct streams for feature extraction. In the first stream, we pass the RGB image through the EfficientNet-B0 [44] architecture to extract hand–object features, while in the second stream, the aligned depth image is processed to extract hand–object features. Both streams operate independently to capture unique information from each modality. The output features from both the RGB and depth streams are then forwarded to an adaptive attention fusion model. This model performs feature fusion to combine the features extracted from both streams, producing a unified representation.

Then, the adaptive attention fusion output is fed to two independent deformable transformer encoders. These encoders improve feature maps by encoding spatial relationships and pose details. We utilize a cross-attention mechanism to model the interaction behavior between hand and object. This motivates cross-attention to enable a feature set to attend to another feature set, which further encodes hand–object dependencies. Finally, focused features are fed into respective decoders. The decoder for the hand regresses the MANO [41] hand parameters, while the decoder for the object regresses the object keypoints, enabling accurate pose estimation for both the hand and the object. The overall process is illustrated in Figure 3.

3.3.1. EfficientNet-B0 Architecture

EfficientNet-B0 [44] is a highly efficient Convolutional Neural Network (CNN) model that balances the trade-off between accuracy and computational efficiency. It is part of the EfficientNet family, which uses a compound scaling method to improve model performance without excessive computational cost. EfficientNet-B0 is designed to be lightweight while achieving state-of-the-art performance on various image classification tasks. The input RGB image of size

224 \times 224 \times 3

is passed through the network, and after six stages of convolutional and pooling layers the output feature map has a size of

28 \times 28 \times 80

. Each stage in the architecture consists of a series of operations, including convolution, batch normalization, activation functions, and pooling layers. Using depth-wise separable convolutions helps reduce the number of parameters and computational costs compared to traditional convolutions. The layer information is mentioned in Table 1. We follow a procedure similar to that for depth image processing for the RGB image. The input depth image of size

224 \times 224 \times 1

is forwarded through the EfficientNet-B0 network to extract meaningful features. This process is performed independently from the RGB stream, allowing the depth information to be used to understand the scene comprehensively. The sample feature output from the EfficientNet for the RGB stream and the depth stream is illustrated in Figure 4. From this, we can clearly understand that features extracted from these two images are different but highly related to the hand–object region. The feature map size is

28 \times 28

for visualization purpose in the figure; we resized this to

56 \times 56

. This process leverages the depth image to enhance spatial understanding, contributing to more accurate hand–object pose estimations by providing additional context from depth information. Table 1 summarizes the key layers and their properties with parameters in EfficientNet-B0:

Conv1: The initial convolutional layer takes the input image of size $224 \times 224 \times 3$ and reduces the spatial dimensions by a factor of 2, resulting in an output size of $112 \times 112 \times 32$ . This layer uses a $3 \times 3$ kernel.
MBConv Layers: These are the depthwise separable convolutions (MBConv), which help reduce the number of parameters and computations. Each MBConv block consists of a depthwise convolution followed by a pointwise convolution. The sizes of the output feature maps progressively decrease, with the last MBConv layer outputting a feature map of size $28 \times 28 \times 80$ .

3.3.2. Adaptive Attention Fusion Mechanism

From Figure 4, we observe that the extracted features from the two different input streams are distinct and relative. These features can be fused in various ways. A straightforward approach is to combine both the RGB and depth feature maps directly. However, this method may result in information loss or the addition of redundant information. To address this, we learn an adaptive attention parameter that allows the network to adjust dynamically, capturing both features effectively based on this parameter. The output feature maps from the two EfficientNet-B0 [44] streams, each of size

28 \times 28 \times 80

(one for the RGB input and one for the depth input), are forwarded to the Adaptive Attention Fusion (AAF) mechanism. The main task of AAF is to fuse the features of both streams in a meaningful way, for which attention-based methods are used to select and combine the features while preserving spatial information that carries relevant information about modality correlation. The proposed adaptive fusion mechanism aims to learn the correlation between RGB and depth modalities, which allows for assigning different attention weights to RGB and depth inputs, leading to better performance. This enables the system to prioritize the more informative parts of each feature map, resulting in a more refined and complementary representation of the scene.

Given the feature maps

F_{r g b}

and

F_{d e p t h}

from the two streams, the fusion process can be mathematically formulated as:

F_{f u s e d} = F_{r g b} \cdot A_{r g b} + F_{d e p t h} \cdot A_{d e p t h}

(1)

where:

$F_{r g b} \in R^{28 \times 28 \times 80}$ and $F_{d e p t h} \in R^{28 \times 28 \times 80}$ are the feature maps from the RGB and depth streams, respectively.
$A_{r g b}$ and $A_{d e p t h}$ represent the attention maps for the RGB and depth feature maps, which are learned during the fusion process.
$F_{f u s e d} \in R^{28 \times 28 \times 80}$ is the final fused feature map, which combines the information from both modalities.

The attention maps

A_{r g b}

and

A_{d e p t h}

are generated by applying a lightweight attention mechanism to the respective feature maps. The attention maps focus on the spatial locations and channels that contain the most essential information for downstream tasks. We use a simple, yet effective attention mechanism based on channel and spatial attention to generate the attention maps. Spatial attention learns to focus on critical spatial regions in the feature maps, while channel attention captures the most informative channels. Both attention mechanisms are combined to produce a final attention map for each stream.

The channel attention map

A_{c}

is obtained by performing a global average pooling operation on the feature map and passing it through a lightweight, fully connected layer. This allows the model to assign higher weights to more informative channels.

A_{c} = σ (W_{c} \cdot (\frac{1}{H \cdot W} \sum_{i, j} F (i, j)))

(2)

where:

$F (i, j)$ denotes the feature map at spatial location $(i, j)$ ,
$W_{c}$ is the weight matrix for the fully connected layer,
$σ$ is the sigmoid activation function,
H and W are the height and width of the feature map.

The spatial attention map

A_{s}

is computed by applying convolutional operations over the entire feature map. This allows the model to focus on important spatial regions.

A_{s} = Conv (ReLU (F))

where Conv refers to a convolutional operation applied to the feature map

F

.

The final attention map for each modality is the product of the spatial and channel attention maps:

A_{r g b} = A_{c} ⊙ A_{s}

A_{d e p t h} = A_{c} ⊙ A_{s}

where ⊙ denotes the element-wise multiplication.

Once the attention maps

A_{r g b}

and

A_{d e p t h}

are computed, the fusion process blends the features from both streams according to their importance, as defined by the attention weights. The final fused feature map

F_{f u s e d}

is then used as the input for subsequent tasks, such as hand–object pose estimation or action recognition.

F_{f u s e d} = F_{r g b} \cdot A_{r g b} + F_{d e p t h} \cdot A_{d e p t h}

This results in a feature map incorporating the most relevant information from the RGB and depth streams, enabling better performance in downstream tasks.

3.3.3. Deformable Transformer Encoder Network (DTEN)

After six layers, the output feature map from EfficientNet-B0 [44] has a spatial resolution of

28 \times 28

, with 80 channels. This feature map is forwarded to a DTEN, where the deformable transformer encoder idea is considered from ref. [45] to enhance spatial adaptability and focus on relevant regions with learnable sampling offsets.

Before passing the

28 \times 28 \times 80

feature map to the DTEN, the following steps are applied:

Flattening & Positional Encoding
- The feature map is reshaped into a sequence of tokens:
  
  $X \in R^{(28 \times 28) \times 80}$
- A learned positional encoding $P$ is added to retain spatial information:
  
  $X_{input} = X + P$
Linear Projection
- The feature dimension 80 is projected to a higher-dimensional space d, which is the embedding dimension of the transformer (e.g., $d = 128$ ):
  
  $X_{embed} = W_{proj} X_{input}$
- This ensures compatibility with the Deformable Attention Module.

The DTEN consists of multiple layers, each performing Deformable Multi-Head Attention (DMHA) [45], followed by a Feedforward Network (FFN).

3.3.4. Deformable Multi-Head Attention (DMHA)

Instead of attending to all pixels globally (as in a standard transformer), DMHA selectively samples relevant key locations using learned offsets.

For a given query position

p_{q}

, attention is computed as:

DeformAttn (X, p_{q}) = \sum_{m = 1}^{M} \sum_{k = 1}^{K} A_{m q k} \cdot X (p_{q} + Δ p_{m q k})

(3)

where M = number of attention heads, K = number of sampled key points per head,

Δ p_{m q k}

= learnable offsets for adaptive spatial sampling, and

A_{m q k}

= learned attention weights. This mechanism ensures that the network focuses only on important spatial regions without computational overhead from full self-attention. After deformable attention, a position-wise feedforward network is applied to enhance feature extraction:

X_{output} = ReLU (W_{1} X) W_{2}

(4)

where

W_{1}

and

W_{2}

are learnable weight matrices. After multiple layers of deformable attention and feedforward transformations, the final output has the same spatial resolution but a different feature dimension (e.g., $28 \times 28 \times 64$ ). Table 2 summarizes the integration of EfficientNet-B0 with the deformable transformer.

We improve spatial adaptability by integrating DETN with EfficientNet-B0 [44] and ensure efficient feature extraction. The learned sampling offsets allow the model to focus on significant regions, making it highly effective for human–object interaction tasks.

3.3.5. Cross Attention Mechanism for Hand–Object Interaction

After obtaining the refined hand and object features from their respective DTENs, we apply a cross-attention mechanism [46] to learn dependencies and interactions between the hand and the object features. This step is crucial for accurately capturing the spatial and semantic relationships that define human–object interactions. To establish interdependencies between hand and object features, we compute the cross-attention mechanism, which involves projecting both feature sets into query (

Q

), key (

K

), and value (

V

) representations. The overall process of cross attention is illustrated in Figure 5. The cross-attention score between the hand (H) and object (O) features is computed using a scaled dot-product attention mechanism:

Attention (Q_{H}, K_{O}, V_{O}) = softmax (\frac{Q_{H} K_{O}^{T}}{\sqrt{d}}) V_{O}

(5)

Attention (Q_{O}, K_{H}, V_{H}) = softmax (\frac{Q_{O} K_{H}^{T}}{\sqrt{d}}) V_{H}

(6)

The final attended feature maps are computed as follows:

H^{'} = Attention (Q_{H}, K_{O}, V_{O}) + H

(7)

O^{'} = Attention (Q_{O}, K_{H}, V_{H}) + O

(8)

where

H^{'}

is the refined hand feature after attending to object features and

O^{'}

is the refined object feature after attending to hand features. The residual connections preserve the original features while enhancing them with contextual information from the cross-attention mechanism.

3.3.6. Hand Decoder

The output from the cross-attention module of size

28 \times 28 \times 64

is forwarded to the hand decoder to estimate 2D heatmaps of finger joints and subsequently refine them using a mesh regression network similar to [35]. The complete pipeline consists of two networks: a lightweight Hourglass Network [20] to predict 2D heatmaps of finger joints and a Mesh Regression Network that refines hand pose and shape parameters using the MANO [41] model.

The Hourglass Network is a symmetric encoder–decoder architecture designed for keypoint detection. Given an input feature map

F \in R^{28 \times 28 \times 64}

, the network produces a set of 2D heatmaps

H \in R^{32 \times 32 \times 21}

, where each channel represents a specific joint. Table 3 describes the layers in the Hourglass Network:

After obtaining the 2D heatmaps from the Hourglass Network, the intermediate feature representations from the network are concatenated with the predicted heatmaps and forwarded to the Mesh Regression Network. The combined feature representation is processed through a series of fully connected layers to regress the hand pose and shape parameters. Specifically, the network predicts the pose parameters

θ \in R^{48}

, which define the joint rotations, and the shape parameters

β \in R^{10}

, which capture the overall hand shape variations. Once the pose and shape parameters are estimated, they are used as inputs to the MANO model, a differentiable parametric hand model. The MANO model maps these parameters to a set of 3D hand mesh vertices

V \in R^{778 \times 3}

and corresponding 3D joint locations

J_{3 D} \in R^{N_{h} \times 3}

, where

N_{h} = 21

. This transformation is formally represented as:

V, J_{3 D} = M (θ, β)

(9)

where

M

denotes the differentiable function of the MANO model. By leveraging both learned deep features and a parametric hand model, the proposed architecture ensures robust and accurate estimation of hand structure in 3D space.

3.3.7. Object Decoder

To estimate the 6D object pose, we employ a decoding architecture similar to the method proposed in [35]. The object decoder takes as input the attended feature representation of size

28 \times 28 \times 64

obtained after cross-attention and forwards it through a six-layered convolution to regress the object keypoints and confidence scores. These features are then forwarded to the PnP [29] algorithm to obtain a 6D object pose.

3.3.8. Overall Loss Function

The total loss function, denoted as

L_{total}

, is formulated as the sum of the hand pose estimation loss

L_{hand}

and the object pose estimation loss

L_{obj}

:

L_{total} = L_{hand} + L_{obj} .

(10)

The hand loss function

L_{hand}

is defined as:

L_{hand} = α_{h} L_{H} + α_{3 d} L_{3 d} + α_{mano} L_{mano},

(11)

where

L_{H}

represents the L2 loss applied to the predicted 2D hand joint locations.

L_{3 d}

corresponds to the L2 loss imposed on the estimated 3D joint positions J and hand mesh vertices V.

L_{mano}

denotes the L2 loss on the MANO [41] model parameters, specifically the shape parameter

β

and pose parameter

θ

.

α_{h}

,

α_{3 d}

, and

α_{mano}

are weighting coefficients that regulate the contribution of each loss term.

Similarly, the object loss function

L_{obj}

is given by:

L_{obj} = α_{p 2 d} L_{p 2 d} + α_{conf} L_{conf},

(12)

where

L_{p 2 d}

is the L1 loss applied to the predicted 2D cuboid keypoints.

L_{conf}

represents the confidence loss imposed on the cuboid predictions.

α_{p 2 d}

and

α_{conf}

are coefficients balancing the impact of the respective loss terms. These loss functions collectively ensure that the network learns both hand and object pose estimations while maintaining consistency between 2D and 3D representations. Figure 6 illustrates the outputs via a flow chart from the hand decoder and the object decoder.

3.4. Skeleton-Based Action Recognition

Building on the previous subsections, where we extracted 3D whole-body poses and unified hand–object pose estimation, we now utilize this comprehensive information to recognize actions essential for human–robot interaction, such as object handover. In our earlier research [18], we developed a method for action recognition using skeletal data, a key component for seamless human–robot interaction. That study introduced a self-attention mechanism to enhance action recognition models, moving beyond traditional Recurrent Neural Networks (RNNs) to improve feature extraction and sequence modeling, resulting in more accurate and efficient predictions. Expanding on this initial work, we now refine our approach by incorporating hand and object pose information, leading to more precise action recognition for interactive robotic applications.

The whole-body pose estimation provides a continuous stream of data. In contrast, hand–object pose estimation is intermittent, activated only when an object is in proximity to or being held by the hand. We employ a two-stream action recognition framework with late-stage fusion to handle this variability in data availability. This approach allows us to integrate the continuous body pose data from the first stream with the intermittent hand–object pose data from the second stream, combining them at a later stage to enhance action recognition accuracy while accommodating the varying availability of the input streams. In this two-stream action recognition framework, we utilize two types of data, as given below; the motivation is to achieve higher accuracy in action recognition. The first stream is always available and contains full-body human skeleton data of the person, giving an overview of a person moving from one position to another. The second stream incorporates hand–object pose data, which, while valuable for recognizing more complex interactions, may sometimes be missing or unavailable. The challenge is that specific actions, such as waving or crossing hands, can be effectively recognized using only the whole-body skeleton data from Stream 1. However, more complex actions, like handovers or placing objects, require the additional context provided by the hand–object pose data in Stream 2. This interaction is a dynamic dependency between the data streams, where one could be complete while the other may or may not be as evident.

We add Kalman filter tracking once the hand–object interaction is detected to enhance the hand–object pose estimation further. The Kalman filter is used to predict and smooth the estimated 6D pose of the hand object, improving the robustness and stability of the tracking over time. Once the hand–object pose is initially detected, the Kalman filter helps by continuously estimating the object’s position and orientation, even in the presence of noise or missing data. It considers previous pose estimates and new hand–object pose detection module measurements. Enabling real-time tracking from this means the system should be resilient against these intermittent occlusions or variations in the pose data.

In this action recognition framework, we define two sets of actions based on the availability of the data streams. Stream 1 provides continuous whole-body human skeleton data, including hand poses, which is always available. Stream 2 offers hand–object pose data but is not always available, as it is only activated when an object is in hand or proximity. Based on this, we can classify actions into two categories. If Stream 2 (hand–object pose data) is unavailable, the system defaults to recognizing actions from the first table, which relies only on the human skeleton data. These actions include basic gestures and body movements, such as waving, walking towards the robot, and pointing at an object.

To address this challenge, we propose a two-stream network for action recognition. The first stream processes the 3D whole-body human skeleton data, including hand joints, using the well-known ST-GCN (Spatio-Temporal Graph Convolutional Network) [47] architecture to extract spatial and temporal features. The second stream encodes the 3D hand pose and 6D object pose using two graph convolutional layers, which capture the spatial relationships between hand joints and the object. The encoded features from this second stream are then passed through a Transformer Encoder, as presented in our previous work [18], to model the temporal dependencies and refine the action representations. The block illustrated is presented in Figure 7.

Suppose the 3D hand and object pose data are available. In that case, we fuse the outputs from both streams and forward the combined feature representation to a series of three fully connected layers, interspersed with normalization and activation functions, to recognize the action. In cases where the 3D hand–object pose estimation is missing, we inject zero vectors into the second stream’s output and pass the modified features from the first stream through the fully connected layers to detect the action, ensuring robust performance even when some input data are absent. To train the network, we use the cross-entropy loss given by:

L = - \sum_{i = 1}^{N} y_{i} log ({\hat{y}}_{i})

where N is the number of classes,

y_{i}

is the true label (one-hot encoded vector),

{\hat{y}}_{i}

is the predicted probability for class i, and

L

is the total loss. This loss function measures the difference between the predicted class probabilities (

\hat{y}

) and the true class labels (y), and it is minimized during training to improve the model’s performance.

4. Results

In this section, we conduct a comprehensive evaluation of the proposed architecture and present both qualitative and quantitative results. The performance of our model is assessed using two large-scale, open-source datasets: the HO-3D dataset [48] and the Dex-YCB dataset [49]. Our analysis considers multi-modal vision approaches that utilize both RGB and depth data and methods that rely solely on RGB input. Additionally, we compare our results against state-of-the-art techniques to provide a thorough performance benchmark.

The HO-3D dataset [48] is a large-scale benchmark for 3D hand–object pose estimation. It comprises more than 77,000 frames from several video sequences, with hands interacting with YCB [50] objects. The dataset was gathered using RGB-D cameras, which provide both RGB images and depth maps, together with ground-truth 3D hand and object poses. The annotations include 3D joint positions of the hand, object pose, and camera parameters, making it a valuable resource for research on hand–object interaction. The dataset consists of 10 subjects with natural manipulation and grasping of 10 everyday objects, offering diverse hand shapes, object poses, and interaction styles.

The Dex-YCB dataset [49] is a comprehensive benchmark aimed at facilitating hand–object interaction studies. It contains RGB, depth, and multi-view images of human hands interacting with YCB objects used extensively in robotics and vision research. The dataset comprises over 580,000 images from eight calibrated cameras, providing multiple viewpoints. It was recorded in an HRI environment with 10 subjects performing an object handover. The dataset provides object segmentation, 3D hand pose annotations, 6D object poses, and contact maps and is a valuable resource for training and evaluating models in hand–object pose estimation, grasp analysis, and action recognition.

The HO3D dataset consists of recordings where subjects already hold and manipulate objects from the beginning. In contrast, the Dex-YCB dataset captures the full sequence, including the approach, grasp, and lift before performing a handover. Figure 8 illustrates the input samples from the HO-3D and Dex-YCB datasets. In our proposed unified hand–object pose estimation architecture, we focus on images where the object is already in hand, using cropped regions around the hand-held object as input. To ensure relevant data selection, we apply a preprocessing step to filter out images where the subject is not holding anything. As we know the annotations of 3D hand and object, we perform a proximity check, considering an object to be grasped if it is within 0.5 cm of the hand. Once identified, we crop the hand–object region with a random offset of 10 to 20 pixels, ensuring variability in the training data.

For the HO-3D dataset, we evaluate performance using F-scores, mean joint error (PAMPJPE), and mean mesh error (Mean Per Vertex Position Error) (PAMPVPE), all measured in millimeters after applying Procrustes alignment. In contrast, for the Dex-YCB dataset, we report MPJPE (Mean Per Joint Position Error) without Procrustes alignment. For 6D object pose estimation, we use the ADD-S (Average Distance of Model Points with Symmetry) metric. The ADD (Average Distance of Model Points) is a widely adopted evaluation metric that measures the mean distance between corresponding 3D points of the ground truth object model and its predicted pose, assessing the accuracy of the estimated 6D object pose. Alignment error between ground truth and predicted values is measured using ADD-S. The lower the ADD-S value, the higher the pose prediction accuracy.

Using the Adam Optimizer, the unified hand–object pose estimation network is trained for 100 epochs on both the HO-3D [48] and the Dex-YCB [49] datasets. To improve the generalization of the test data, we include a weight decay of

5 \times 10^{- 5}

for every 10 epochs. During training, the aspect ratio of all input images was preserved to provide the original representation of hand poses. The images were resized to 224 × 224 pixels to normalize the input, ensuring their original proportions were not altered.

4.1. Comparison to the State-of-the-Art

In this section, we compare the results with the state-of-the-art methods on the Dex-YCB and the HO-3D datasets.

Table 4 compares hand pose estimation performance on the HO-3D [48] dataset across multiple state-of-the-art methods. Performance is evaluated using PA-MPJPE and PA-MPVPE, where lower values indicate better accuracy, and F@5 and F@15, where higher values are preferable. The results demonstrate that earlier methods, such as those proposed by [16,51], show relatively higher errors in PA-MPJPE and PA-MPVPE, with lower F@5 and F@15 scores. Refs. [35,48] improve upon these results, reducing the mean errors and increasing the frame accuracy metrics. HFL-Net [52] further refines these improvements, achieving notable gains. Our proposed method surpasses all previous approaches, achieving the lowest PA-MPJPE (8.76 mm) and PA-MPVPE (8.68 mm) and the highest frame precision scores of 58.5% for F @ 5 and 96.9% for F @ 15. These findings indicate that our approach improves precision and robustness in estimating hand poses compared to prior work.

Table 5 presents a performance comparison of hand pose estimation methods on the DexYCB dataset, with MPJPE and PAMPJPE as evaluation metrics, where lower values indicate better accuracy. The earlier approaches by [16,51] exhibit the highest MPJPE errors, reflecting lower estimation accuracy. Refs. [35,36] improve upon these results, with [35] being the first to report PAMPJPE. Refs. [37,52] further reduce MPJPE, with Lin et al. achieving a PAMPJPE of 5.47 mm. Ref. [53] introduce an RGB-D-based method, significantly lowering both MPJPE and PAMPJPE. Our proposed approach outperforms all prior methods, achieving the lowest MPJPE (11.7 mm) and PAMPJPE (4.48 mm), demonstrating superior hand pose estimation accuracy by effectively leveraging RGB-D input.

Table 6 compares the performance of different methods for object pose estimation on the DexYCB dataset [49], using AUC and ADD-S < 2 cm as evaluation metrics, where higher values indicate better accuracy. The approach by [16] achieves the lowest AUC (0.69) and ADD-S < 2 cm (0.65), while their later work [51] improves these scores to 0.75 and 0.71, respectively. Chen et al. introduce two methods, AlignSDF [38] and GSDF [54], which further enhance performance, with GSDF reaching 0.75 AUC and 0.77 ADD-S < 2 cm. Ref. [53] significantly improved object pose estimation, achieving 0.84 AUC and 0.82 ADD-S < 2 cm. Our proposed method surpasses all prior approaches, achieving the highest AUC (0.86) and ADD-S < 2 cm (0.83), demonstrating superior object pose estimation accuracy. A few generated output samples on the Dex-YCB dataset are presented in Figure 9. As previously mentioned, we use RGB and depth images to train the proposed architecture. However, the figure does not show the depth image, as it is difficult to perceive with the human eye. The illustration consists of RGB images on the left and their corresponding results on the right. The first two rows display high-accuracy predictions, while the last row highlights failure cases where either the hand pose or the object pose is affected by occlusion.

In the bottom-left image, where a person is holding a bowl, the prediction appears accurate when viewed from the camera’s perspective. However, when observed from a different viewpoint, the error becomes noticeable. Similarly, in the case of a person holding a marker, occlusion and blurring lead to a high error in object pose estimation. Another contributing factor to this error is the small size of the marker object, making it difficult to detect in the initial stages.

4.2. Action Recognition Results

To perform action recognition, we generate a dataset subset from the Dex-YCB dataset and record a few necessary actions in our HRI environment. In our HRI setup, we have developed an environment in which a Franka Emika Panda robot is placed on a tabletop workspace to enable seamless interaction between the robot and the human. The workspace is structured to facilitate efficient object handling, human intention recognition, and action execution to ensure seamless coordination between the two. To understand the scene of interaction, we utilize a multi-camera system consisting of four cameras that are strategically placed to achieve the objective of acquiring pertinent information. Two cameras are directed towards the tabletop, constantly observing the objects and tracking changes in the scene. These allow the system to identify the position of objects, context, and possible points of interaction for the robot, allowing it to react accordingly to human actions. Two cameras focus on tracking the human subject. One is placed in front of the human, tracking the full-body movement, hand movement, and facial expressions needed for intent recognition. The front camera is vital in capturing actions such as reaching for an object, pointing, or making signals for commands. The second camera is positioned on the subject’s side to facilitate a lateral view that can be utilized to record extra spatial data that may be blocked from the frontal view. The auxiliary camera is helpful in monitoring arm and hand movement and object tracking, thereby making the system capable of effectively identifying activities such as grasping, lifting, and moving objects. The systematic positioning of the cameras means no important information is overlooked, resulting in smooth, natural, and context-aware interactions between humans and robots. Our HRI environment can be observed in Figure 10.

We leverage existing sequences from the Dex-YCB dataset to construct a new action dataset, as these sequences inherently contain object handover interactions. The Dex-YCB dataset provides a rich set of labels, including segmentation maps, hand pose annotations, object pose estimation, contact maps, and camera calibration parameters for all eight camera views. Each subject in the dataset performs multiple repetitions of object handovers, making it a valuable resource for action recognition.

We systematically segment the input data from these sequences into meaningful action categories relevant to the handover process. In each recorded interaction, the first identifiable action is the hand approaching the object. Once the hand makes contact and successfully grasps the object, the action transitions into object pickup. This is followed by a brief holding or waiting phase, where the subject momentarily pauses before executing the handover motion. The placing action can be derived by reversing the pickup sequence, allowing us to construct a complete dataset for picking up and placing an object. Structuring the data ensures our dataset captures the essential stages of hand–object interactions. This makes it suitable for training action recognition models in a human–robot interaction setting. In the Dex-YCB datasets, body pose estimation labels are not available. To estimate the 3D body pose, we use ref. [15], which we use for whole-body poses in our architecture to estimate the body pose estimation in the Dex-YCB dataset; a few samples with whole-body skeleton projected onto 2D images are illustrated in Figure 11. With this information, we train the skeleton-based action recognition network using this dataset for 100 epochs with the Adam Optimizer and a weight decay of

5 \times 10^{- 5}

for every 10 epochs.

The Dex-YCB dataset primarily focuses on object handover interactions. However, for seamless interaction, additional general actions must also be recognized. These include waving, walking toward the robot, pointing at an object, waiting (idle or an unknown action), and opening the hand to receive an object. These actions are recorded in our HRI environment; we repeat each action 20 times for 10 subjects. The complete list of actions, with dependencies, is presented in Table 7. A few samples of these action recordings are presented in the Figure 11. We consider 20 repetitions of all ten actions to avoid data imbalance, providing 2000 video samples. Each video sample ranges from 60 to 90 frames, depending on the length of action. To perform preprocessing and input data encoding, we follow a similar process as mentioned in our previous work [18]. Data are further split into subsets using striding, avoiding whole data being needed during inference, and each action has 5000 samples, from which 500 samples for each action are considered as test data. Using this technique, we try to predict actions using a lower number of frames.

After training the network, we evaluated the performance of action recognition on the test dataset, where each action consists of 500 sequences. The mean classification accuracy of all actions is 98.7%. The confusion matrix is illustrated in Figure 12 to understand the performance of each recognized action. From the confusion matrix, we can clearly understand that class 3 has 40% of overall error, i.e., out of 1.3% overall error, 40% of errors are from class 3 being recognized as class 7. This occurs because the hand reaching for the object is similar to the open hand action. Other than this, the remaining actions are different in terms of motion and are recognized quite well using the proposed architecture.

4.3. Discussion and Future Works

In this study, we presented a complete pipeline for perception feature tracking for human–robot interaction and proposed a unified framework for hand–object pose estimation. Our results demonstrate that the proposed approach achieves state-of-the-art performance, outperforming existing methods in both hand and object pose estimation tasks on the Dex-YCB dataset and HO-3D dataset. We do not compare the object pose estimation of the HO-3D dataset because not many works compare this information, as it was primarily used for 3D hand pose estimation. Using both RGB and depth modalities significantly improved the robustness of the model, particularly in occlusion-heavy scenarios. Compared to prior RGB methods such as [35,37,52] and RGB-D methods [53], our approach improves hand pose estimation accuracy, as seen in the reduction of MPJPE and PA-MPJPE errors.

To better understand the contribution of each component in our unified hand–object pose estimation framework, we conduct an ablation study by systematically removing or modifying key modules and analyzing their impact on performance. This study helps validate the effectiveness of our design choices and provides insights into how different components contribute to overall accuracy. We perform ablation experiments on the Dex-YCB dataset by removing specific elements of our architecture and evaluating the model’s performance on hand pose estimation (PA-MPJPE, PA-MPVPE) and object pose estimation (ADD-S, AUC). The variations considered in this study include:

Without Depth Input: The model is trained using only RGB images to analyze the importance of depth information.
Without Adaptive Fusion: The feature fusion mechanism between hand and object is removed to examine its effect.
Without Cross Attention: Cross-attention mechanism for hand–object interaction learning is removed.

Table 8 presents an ablation study evaluating the impact of different components on unified hand–object pose estimation. The study analyzes the performance using four key metrics: MPJPE (Mean Per Joint Position Error) and PA-MPJPE (Procrustes-Aligned MPJPE), where lower values indicate better accuracy, and ADD-S (Average Distance for Small Objects) and AUC (Area Under the Curve), where higher values indicate better performance—removing both depth and fusion results in the highest MPJPE (12.6) and the lowest AUC (0.76), demonstrating a significant degradation in pose estimation. Eliminating adaptive fusion improves AUC (0.83) but still produces a higher MPJPE (12.2) than the entire model. Similarly, removing cross-attention slightly reduces MPJPE (12.1), but ADD-S (0.76) remains low, indicating that cross-attention contributes to better object pose accuracy. The full model achieves the best overall performance, with the lowest MPJPE (11.7) and PA-MPJPE (4.48) and the highest ADD-S (0.83) and AUC (0.86), highlighting the effectiveness of depth information, adaptive fusion, and cross-attention in improving hand–object pose estimation.

We also evaluate the real-time performance of our complete model and achieve an average processing speed of 28 fps when running on a 48 GB NVIDIA RTX GPU, demonstrating its feasibility for real-time applications. This inference speed is only for the perception features, and when it is combined with robot control, the fps may be a bit slower, but we expect it to be enough for real-time processing. Ref. [4] uses only the human pose estimation and action recognition for human–robot interaction with the inference speed of 20 fps on a GPU, but did not specify the configuration of the GPU. We plan to further evaluate the complete interaction process with the robot in future work to validate our approach in real-world scenarios. In Figure 9, we have seen that, in certain situations, such as with small objects and with high occlusions, errors occur in hand–object pose estimation. The small objects are hard to track as the training images mostly contain blurred images and not enough features to recognize. For such cases, generating the pseudo-realistic synthetic data with small objects and retraining the complete model in combination would help in improving the performance. Another future direction to solve high occlusion and small object failure cases and is to utilize multi-view, image-based pose estimation to correct the object pose.

Author Contributions

Conceptualization, U.T. and C.B.; methodology, C.B.; software, C.B.; validation, C.B.; writing—original draft preparation, C.B.; writing—review and editing, U.T.; visualization, C.B.; supervision, U.T.; funding acquisition, U.T.; revision, U.T. All authors have read and agreed to the published version of the manuscript.

Funding

Funded by the German Federal Ministry of Education and Research (BMBF)—Project-ID 01IS23047B—aiRobot.

Data Availability Statement

Most of the research study is conducted on publicly available datasets, as mentioned throughout the manuscript. A partial dataset from our previous work has been published in Orcid (ID:0000-0001-7339-8425). The newer actions generated were collected from our lab, and due to privacy reasons we cannot yet open-source this, but in the future we will update this in our Orchid profile.

Acknowledgments

During the preparation of this manuscript, the author (Chaitanya Bandi) used ChatGPT 3.5 for the purposes of correcting spelling and grammar and adjusting the sentences by rewriting. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Goodrich, M.A.; Schultz, A.C. Human-robot interaction: A survey. Found. Trends -Hum.-Comput. Interact. 2007, 1, 203–275. [Google Scholar] [CrossRef]
Kulić, D.; Croft, E.A. Predicting human intentions in collaborative tasks with robot partners. IEEE Trans. Robot. 2007, 23, 927–939. [Google Scholar]
Nikolaidis, S.; Shah, J. Intention recognition for human-robot interaction. In Proceedings of the 2015 ACM/IEEE International Conference on Human-Robot Interaction, Portland, OR, USA, 2–5 March 2015; pp. 241–242. [Google Scholar]
Yu, X.; Zhang, X.; Xu, C.; Ou, L. Human–robot collaborative interaction with human perception and action recognition. Neurocomputing 2024, 563, 126827. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38. [Google Scholar] [CrossRef] [PubMed]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8; Ultralytics: Frederick, MD, USA, 2023. [Google Scholar]
Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.; Lee, J.; et al. Mediapipe: A framework for building perception pipelines. arXiv 2019, arXiv:1906.08172. [Google Scholar]
Fang, H.; Li, J.; Tang, H.; Xu, C.; Zhu, H.; Xiu, Y.; Li, Y.L.; Lu, C. AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7157–7173. [Google Scholar] [CrossRef] [PubMed]
Yang, Z.; Zeng, A.; Yuan, C.; Li, Y. Effective Whole-body Pose Estimation with Two-stages Distillation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France, 2–3 October 2023; pp. 4212–4222. [Google Scholar]
Jiang, T.; Lu, P.; Zhang, L.; Ma, N.; Han, R.; Lyu, C.; Li, Y.; Chen, K. RTMPose: Real-Time Multi-Person Pose Estimation Based on MMPose. arXiv 2023, arXiv:2303.07399. [Google Scholar]
Contributors, M. OpenMMLab Pose Estimation Toolbox and Benchmark. 2020. Available online: https://github.com/open-mmlab/mmpose (accessed on 1 October 2024).
Zhang, J.; Wu, X.; Li, H.; Zhao, X. Multimodal human pose estimation with RGB and thermal images. IEEE Trans. Image Process. 2022, 31, 2678–2689. [Google Scholar]
Lin, J.; Zeng, A.; Wang, H.; Zhang, L.; Li, Y. One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21159–21168. [Google Scholar]
Hasson, Y.; Varol, G.; Tzionas, D.; Kalevatykh, I.; Black, M.J.; Laptev, I.; Schmid, C. Learning joint reconstruction of hands and manipulated objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11807–11816. [Google Scholar]
Chen, C.F.; Fan, Q.; Panda, R. CrossFormer: A versatile vision transformer hinging on cross-scale attention. arXiv 2021, arXiv:2108.00154. [Google Scholar]
Bandi, C.; Thomas, U. Skeleton-based Action Recognition for Human-Robot Interaction using Self-Attention Mechanism. In Proceedings of the 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), Jodhpur, India, 15–18 December 2021; pp. 1–8. [Google Scholar] [CrossRef]
Toshev, A.; Szegedy, C. DeepPose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [Google Scholar]
Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 483–499. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Li, Y.; Wang, S.; Wu, Y.; Lu, Y.; Gao, J.; Li, H. TokenPose: Learning keypoint tokens for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11313–11322. [Google Scholar]
Yuan, Y.; Fu, R.; Huang, L.; Lin, W.; Zhang, C.; Chen, X. HRFormer: High-Resolution Transformer for Dense Prediction. Adv. Neural Inf. Process. Syst. 2021, 34, 7281–7293. [Google Scholar]
Choutas, V.; Pavlakos, G.; Bolkart, T.; Tzionas, D.; Black, M.J. Monocular Expressive Body Regression through Body-Driven Attention. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 20–40. [Google Scholar]
Rong, Y.; Shiratori, T.; Joo, H. FrankMocap: A Monocular 3D Whole-Body Pose Estimation System via Regression and Integration. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 1749–1759. [Google Scholar]
Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graph. 2015, 34, 248. [Google Scholar] [CrossRef]
Feng, Y.; Choutas, V.; Bolkart, T.; Tzionas, D.; Black, M.J. Collaborative Regression of Expressive Bodies using Moderation. In Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; pp. 792–804. [Google Scholar]
Moon, G.; Choi, H.; Lee, K.M. Accurate 3D Hand Pose Estimation for Whole-Body 3D Human Mesh Estimation. In Proceedings of the Computer Vision and Pattern Recognition Workshop (CVPRW), New Orleans, LA, USA, 19–20 June 2022. [Google Scholar]
Lepetit, V.; Moreno-Noguer, F.; Fua, P. EPnP: An Accurate O(n) Solution to the PnP Problem. Int. J. Comput. Vision 2009, 81, 155–166. [Google Scholar] [CrossRef]
Xiang, Y.; Schmidt, T.; Narayanan, V.; Fox, D. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. In Proceedings of the Robotics: Science and Systems (RSS), Pittsburgh, PA, USA, 26–30 June 2018. [Google Scholar] [CrossRef]
Peng, S.; Liu, Y.; Huang, Q.; Zhou, X.; Bao, H. PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4561–4570. [Google Scholar] [CrossRef]
Hu, Y.; Wang, G.; Yang, J.; Xia, J.; Xu, K. Segmentation-Driven 6D Object Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3385–3394. [Google Scholar] [CrossRef]
Labbé, Y.; Carpentier, J.; Aubry, M.; Sivic, J.; Laptev, I. CosyPose: Consistent Multi-View Multi-Object 6D Pose Estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 574–591. [Google Scholar] [CrossRef]
Doosti, B.; Naha, S.; Mirbagheri, M.; Crandall, D. HOPE-Net: A Graph-Based Model for Hand-Object Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6608–6617. [Google Scholar] [CrossRef]
Liu, S.; Jiang, H.; Xu, J.; Liu, S.; Wang, X. Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Tse, T.H.E.; Kim, K.I.; Leonardis, A.; Chang, H.J. Collaborative Learning for Hand and Object Reconstruction with Attention-guided Graph Convolution. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1654–1664. [Google Scholar] [CrossRef]
Wang, R.; Mao, W.; Li, H. Interacting Hand-Object Pose Estimation via Dense Mutual Attention. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 5724–5734. [Google Scholar]
Chen, Z.; Hasson, Y.; Schmid, C.; Laptev, I. AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction. In Proceedings of the ECCV, 17th European Conference, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Bandi, C.; Thomas, U. Hand Mesh and Object Pose Reconstruction Using Cross Model Autoencoder. In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Virtual, 6–8 February 2024; SciTePress: Setúbal, Portugal, 2024; Volume 4: VISAPP. INSTICC, pp. 183–193. [Google Scholar] [CrossRef]
Bandi, C.; Thomas, U. Multi-Modal Multi-View Perception Feature Tracking for Handover Human Robot Interaction Applications. In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Porto, Portugal, 25–27 February 2025; SciTePress:: Setúbal, Portugal, 2025; Volume 3: VISAPP. INSTICC, pp. 797–807. [Google Scholar] [CrossRef]
Romero, J.; Tzionas, D.; Black, M.J. Embodied hands: Modeling and capturing hands and bodies together. ACM Trans. Graph. 2021, 40, 1–17. [Google Scholar] [CrossRef]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracker with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Jocher, G. Ultralytics YOLOv5; Ultralytics: Frederick, MD, USA, 2020. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2019, arXiv:abs/1905.11946. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Yu, B.; Yin, H.; Zhu, Z. Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm Sweden, 13–19 July 2018; AAAI Press: Menlo Park, CA, USA, 2018. IJCAI’18. pp. 3634–3640. [Google Scholar]
Hampali, S.; Rad, M.; Oberweger, M.; Lepetit, V. HOnnotate: A method for 3D Annotation of Hand and Object Poses. In Proceedings of the CVPR, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Chao, Y.W.; Yang, W.; Xiang, Y.; Molchanov, P.; Handa, A.; Tremblay, J.; Narang, Y.S.; Van Wyk, K.; Iqbal, U.; Birchfield, S.; et al. DexYCB: A Benchmark for Capturing Hand Grasping of Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Çalli, B.; Walsman, A.; Singh, A.; Srinivasa, S.S.; Abbeel, P.; Dollar, A.M. Benchmarking in Manipulation Research: The YCB Object and Model Set and Benchmarking Protocols. arXiv 2015, arXiv:abs/1502.03143. [Google Scholar]
Hasson, Y.; Tekin, B.; Bogo, F.; Laptev, I.; Pollefeys, M.; Schmid, C. Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 568–577. [Google Scholar]
Lin, Z.; Ding, C.; Yao, H.; Kuang, Z.; Huang, S. Harmonious Feature Learning for Interactive Hand-Object Pose Estimation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12989–12998. [Google Scholar] [CrossRef]
Hoang, D.C.; Tan, P.X.; Nguyen, A.N.; Vu, D.Q.; Vu, V.D.; Nguyen, T.U.; Hoang, N.A.; Phan, K.T.; Tran, D.T.; Nguyen, V.T.; et al. Multi-Modal Hand-Object Pose Estimation With Adaptive Fusion and Interaction Learning. IEEE Access 2024, 12, 54339–54351. [Google Scholar] [CrossRef]
Chen, Z.; Chen, S.; Schmid, C.; Laptev, I. gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]

Figure 1. The overall architecture for perception features tracking and action recognition for human–robot interaction such as object handover.

Figure 2. Flowchart of the complete pipeline.

Figure 3. Unified hand–object pose estimation architecture. The architecture includes lightweight networks with adaptive attention fusion and cross attention mechanism to achieve real-time performance for HRI.

Figure 4. Sample feature outputs from EfficientNet-B0 for the RGB image and the depth image.

Figure 5. Cross attention mechanism. The mechanism learns the inter-dependencies between two individual features, where F1 and F2 are hand and object input features. In the first stream, head features are cross attended with object features to learn inter-dependencies, and vice versa.

Figure 6. Output flow illustrations from the hand decoder and the object decoder.

Figure 7. A simple two-stream network for action recognition.

Figure 8. Input samples from the datasets. The images on the left are from the Dex-YCB dataset, and those on the right are from the HO-3D dataset.

Figure 9. Qualitative results of unified hand–object pose estimation on the Dex-YCB Dataset.

Figure 10. Human–robot interaction environment.

Figure 11. Whole-body skeleton generated on the Dex-YCB dataset and in out HRI environment.

Figure 12. Confusion matrix depicting action classification results.

Table 1. EfficientNet-B0 layer details.

Layer	Output Size	Kernel Size	Parameters
Input Image	$224 \times 224 \times 3$	-	-
Conv1	$112 \times 112 \times 32$	$3 \times 3$	864
MBConv1	$112 \times 112 \times 32$	$3 \times 3$	2880
MBConv2	$56 \times 56 \times 16$	$3 \times 3$	4416
MBConv3	$56 \times 56 \times 24$	$3 \times 3$	14,592
MBConv4	$28 \times 28 \times 40$	$5 \times 5$	33,600
MBConv5	$28 \times 28 \times 80$	$3 \times 3$	92,160

Table 2. Deformable transformer integration with EfficientNet-B0.

Stage	Input Size	Operation	Output Size
EfficientNet-B0 (5 layers)	$224 \times 224 \times 3$	CNN Blocks	$28 \times 28 \times 80$
Linear Projection	$28 \times 28 \times 80$	FC Layer	$28 \times 28 \times 128$
DTEN	$28 \times 28 \times 128$	Deformable Attention	$28 \times 28 \times 128$
Output Representation	$28 \times 28 \times 128$	FFN + LayerNorm	$28 \times 28 \times 64$

Table 3. Hourglass Network architecture.

Stage	Feature Map Size	Operation
Input	$28 \times 28 \times 64$	Input from cross-attention module
Encoder (Downsampling Path)
Conv1	$28 \times 28 \times 64$	$3 \times 3$ Conv, ReLU
Down1	$14 \times 14 \times 128$	$2 \times 2$ MaxPool
Conv2	$14 \times 14 \times 128$	$3 \times 3$ Conv, ReLU
Down2	$7 \times 7 \times 256$	$2 \times 2$ MaxPool
Conv3	$7 \times 7 \times 256$	$3 \times 3$ Conv, ReLU
Down3	$4 \times 4 \times 512$	$2 \times 2$ MaxPool
Bottleneck
Residual Block	$4 \times 4 \times 512$	Residual Connections
Decoder (Upsampling Path)
Up1	$7 \times 7 \times 256$	Transposed Conv, ReLU
Skip Connection	$7 \times 7 \times 256$	Feature Addition from Encoder
Up2	$14 \times 14 \times 128$	Transposed Conv, ReLU
Skip Connection	$14 \times 14 \times 128$	Feature Addition from Encoder
Up3	$32 \times 32 \times 21$	Transposed Conv, ReLU
Output	$32 \times 32 \times 21$	2D Joint Heatmaps

Table 4. Performance comparison with state-of-the-art methods on hand pose estimation on the HO3D dataset. Lower values are better for PA-MPJPE and PA-MPVPE, while higher values are better for F@5 and F@15.

Method	PA-MPJPE ↓	PA-MPVPE ↓	F@5 ↑	F@15 ↑
Hasson et al. [16]	11.4	11.4	42.8	93.2
Hasson et al. [51]	11.0	11.2	46.4	93.9
Hampali et al. [48]	10.7	10.6	50.6	94.2
Liu et al. [35]	10.1	9.7	53.2	95.2
HFL-Net [52]	8.9	8.7	57.5	96.5
Ours	8.76	8.68	58.5	96.9

Table 5. Comparison of hand pose estimation results with state-of-the-art methods on the DexYCB dataset. Lower values indicate better performance.

Method	MPJPE ↓	PAMPJPE ↓	RGB-D
Hasson et al. [16]	17.6	-	RGB
Hasson et al. [51]	18.8	-	RGB
Tse et al. [36]	15.3	-	RGB
Liu et al. [35]	15.27	6.58	RGB
Zhao et al. [37]	12.7	-	RGB
Lin et al. [52]	12.56	5.47	RGB
Hoang et al. [53]	12.15	4.54	RGB-D
Ours	11.7	4.48	RGB-D

Table 6. Performance comparison of object pose estimation on the DexYCB dataset. Higher values indicate better performance.

Method	AUC ↑	ADD-S < 2 cm ↑
Hasson et al. [16]	0.69	0.65
Hasson et al. [51]	0.75	0.71
Chen et al. [38]	0.72	0.74
Chen et al. [54]	0.75	0.77
Hoang et al. [53]	0.84	0.82
Ours	0.86	0.83

Table 7. Action categories based on stream availability and action class number.

Action Type	Actions	Action Class No.
Whole Body	Waving	1
	Walking Towards the Robot	2
	Hand Reaching	3
	Crossing Hands	4
	Pointing at an Object	5
	Waiting Gesture	6
Body + Object	Receiving an Object from Robot (Open Hand)	7
	Placing an Object	8
	Handing Over Object to Robot	9
	Picking Up an Object	10

Table 8. Ablation study results for unified hand–object pose estimation. Lower values indicate better performance for MPJPE and PA-MPJPE, while higher values indicate better performance for ADD-S and AUC.

Method	MPJPE ↓	PA-MPJPE ↓	ADD-S ↑	AUC ↑
Without Depth and Fusion	12.6	4.8	0.77	0.76
Without Adaptive Fusion	12.2	4.54	0.79	0.83
Without Cross Attention	12.1	4.5	0.76	0.83
Full Model (Ours)	11.7	4.48	0.83	0.86

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bandi, C.; Thomas, U. Action Recognition via Multi-View Perception Feature Tracking for Human–Robot Interaction. Robotics 2025, 14, 53. https://doi.org/10.3390/robotics14040053

AMA Style

Bandi C, Thomas U. Action Recognition via Multi-View Perception Feature Tracking for Human–Robot Interaction. Robotics. 2025; 14(4):53. https://doi.org/10.3390/robotics14040053

Chicago/Turabian Style

Bandi, Chaitanya, and Ulrike Thomas. 2025. "Action Recognition via Multi-View Perception Feature Tracking for Human–Robot Interaction" Robotics 14, no. 4: 53. https://doi.org/10.3390/robotics14040053

APA Style

Bandi, C., & Thomas, U. (2025). Action Recognition via Multi-View Perception Feature Tracking for Human–Robot Interaction. Robotics, 14(4), 53. https://doi.org/10.3390/robotics14040053

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Action Recognition via Multi-View Perception Feature Tracking for Human–Robot Interaction

Abstract

1. Introduction

2. Related Work

2.1. Whole-Body Human Pose Estimation

2.2. Object Pose Estimation

2.3. Unified Hand–Object Pose Estimation

3. Methodology

3.1. Object Detection, Segmentation, and Tracking

3.2. 3D Whole-Body Pose Tracking

3.3. Unified Hand–Object Pose Estimation

3.3.1. EfficientNet-B0 Architecture

3.3.2. Adaptive Attention Fusion Mechanism

3.3.3. Deformable Transformer Encoder Network (DTEN)

3.3.4. Deformable Multi-Head Attention (DMHA)

3.3.5. Cross Attention Mechanism for Hand–Object Interaction

3.3.6. Hand Decoder

3.3.7. Object Decoder

3.3.8. Overall Loss Function

3.4. Skeleton-Based Action Recognition

4. Results

4.1. Comparison to the State-of-the-Art

4.2. Action Recognition Results

4.3. Discussion and Future Works

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI