Hand Grasp Pose Prediction Based on Motion Prior Field

Shi, Xu; Guo, Weichao; Xu, Wei; Sheng, Xinjun

doi:10.3390/biomimetics8020250

Open AccessArticle

Hand Grasp Pose Prediction Based on Motion Prior Field

¹

State Key Laboratory of Mechanical System and Vibration, School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

²

Meta Robotics Institute, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

Biomimetics 2023, 8(2), 250; https://doi.org/10.3390/biomimetics8020250

Submission received: 28 April 2023 / Revised: 9 June 2023 / Accepted: 9 June 2023 / Published: 12 June 2023

(This article belongs to the Special Issue Bionic Robot Hand: Dexterous Manipulation and Robust Grasping)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Shared control of bionic robot hands has recently attracted much research attention. However, few studies have performed predictive analysis for grasp pose, which is vital for the pre-shape planning of robotic wrists and hands. Aiming at shared control of dexterous hand grasp planning, this paper proposes a framework for grasp pose prediction based on the motion prior field. To map the hand–object pose to the final grasp pose, an object-centered motion prior field is established to learn the prediction model. The results of motion capture reconstruction show that, with the input of a 7-dimensional pose and cluster manifolds of dimension 100, the model performs best in terms of prediction accuracy (90.2%) and error distance (1.27 cm) in the sequence. The model makes correct predictions in the first 50% of the sequence during hand approach to the object. The outcomes of this study enable prediction of the grasp pose in advance as the hand approaches the object, which is very important for enabling the shared control of bionic and prosthetic hands.

Keywords:

grasp pose prediction; prior field; shared control

1. Introduction

Bionic robot hands play an important role in human–machine collaboration and rehabilitation engineering to improve the life quality of amputees. During the process of grasping an object by a prosthetic hand, the finger and wrist joints require dexterous control. Nevertheless, the conventional myoelectric control performance, which relies on the nervous system of the amputation stump, is inadequate to cope with the manipulation requirements of prostheses [1]. To reduce the burden on the user, machine intelligence is needed to help the human with delicate control according to human intent. In recent years, there has been a growing trend towards introducing a shared control framework for the control of multi-degree-of-freedom prosthetic hands. Many studies have involved attachment of a camera to the prosthetic hand to capture objects and environmental information to assist the user with pre-shape adjustment of the prosthetic hand or control of the wrist joint. The authors of [2,3] installed a depth camera on the wrist to capture object information to realize the adjustment of the wrist angle and the selection of pre-grasp gestures. Other studies, such as [4,5,6,7], involved installation of RGB cameras on prosthetic hands, relying only on RGB images to achieve the same results.

There are several shortcomings in the current approach to machine-vision-based hand-sharing control of prostheses. First, current research does not consider the prediction of grasp pose. In [3,6], for example, the essence of the work is consideration of the object–gesture classification problem, which requires the user to set up the appropriate approach posture during the pre-shape process. The aim is only to grasp, but there is no consideration of whether the different contact areas and different approach directions are compatible with everyday human habits. Additionally, the pre-shape performance mainly depends on the camera capture perspective. In [3,4], a predictive model is used to learn directly from the captured image, making the application highly viewpoint-dependent and, thus, difficult to generalize to multi-view deployments.

In order to address the abovementioned requirements, inspiration was needed from several cross-cutting areas, including human motion analysis and robot grasping planning. In studies related to the analysis of human motion analysis, researchers have undertaken a considerable amount of research on data collection and analysis. In [8,9], human upper limbs were modeled with different degrees of freedom for daily activities and clustering analysis was performed. In [10,11], whole-body motion data for a human during grasping action were collected, and then a generative model was trained to produce a sequence of grasping motions for new objects. In [12], similarly, data for the grasping behaviors were collected and a dataset was produced. The authors of [13,14,15] collected and analyzed data for the contact area and hand posture during the grasping of objects. In [16], different grasping poses of objects were generated for everyday scenes, while, in [17], future human motion was predicted with a known human posture and environment. Much work has also been carried out by many researchers in the field of robotics on the planning and prediction of grasping poses. The authors of [18] generated various grasping poses of objects using cVAE for a two-finger gripper robot arm. In two studies [19,20], an implicit neural field was created centered on the object that encoded the distance from the spatial point to the grasping pose into the field, while the authors of [21,22] generated dexterous grasping poses and grasping gestures using generative models.

However, assessment of research advances in human motion analysis and robot grasping planning suggests that the fields of motion behaviour analysis and robot grasp pose prediction are not yet well integrated. Furthermore, there is a lack of grasp pose prediction planning in the field of the shared control of prosthetic hands based on human behaviour analysis as a starting point. Therefore, we propose a grasping pose prediction method based on a motion prior model to achieve grasping pose prediction for multiple grasping gestures of objects and arbitrary approach directions, providing a theoretical basis for semi-autonomous control of the prosthetic hand-wrist. The core idea of this paper is to build an object-centred prior field of human grasping trajectories, to map each trajectory point to a final grasping pose, and to produce a model that maps every point in space to a certain grasping pose.

To overcome the challenges, it is of great importance to explore a predictive model oriented towards the shared control of prosthetic hands, which is capable of predicting pre-shape gestures and grasping postures in advance, laying the foundation for joint control. In summary, we propose a new grasp pose prediction framework for prosthetic hand shared control based on a motion prior field. We establish an object-centered motion prior field of grasp motion trajectories and create a prediction model that predicts grasp poses in advance, covering arbitrary approach directions and multiple pre-shape types. This avoids the dependence of similar work on the camera capture perspective. Note that, we define the grasp pose as consisting of two parts: one is the 6-DOF wrist pose in the object coordinate system, and the other is the pre-shape type. Among them, the definition of pose of the former is the same as that of the end claw of the general mechanical arm, while the latter is aimed at grasping prosthetic hands. In the following sections, we explain the hardware and the testbed that we use in Section 2. This is followed by a detailed introduction to the motion prior field, provided in Section 2.2.1, Section 2.2.2, Section 2.2.3 and Section 2.2.4. Finally, we analyse the comparative results under different design parameters in Section 3, including the prediction accuracy for the motion prior field and the predicted grasp pose error in the sequence.

2. Materials and Methods

We established an object-centered motion prior field, and used each spatial point in this field to represent the pose of the hand relative to the object. Therefore, in order to integrate human motion habits as prior information into the prior field, we collected a large number of motion sequences of human hand approaches and grasps. The sequences obtained contained two pieces of information: the hand pose in the trajectory and the hand pose when grasping. We sought to produce a mapping model based on these collected data, which would map the hand pose in the trajectory to the hand pose when grasping. With this mapping relationship, any point in space can be regarded as a hand pose in the trajectory. Therefore, we can predict the pose of the hand when it finally grasps an object before it even touches it.

As shown in Figure 1, this work is divided into two main sections: hand–object localisation and grasp pose prediction. We aim to obtain the relative hand–object pose

T_{object}^{hand}

by using a motion capture system before feeding this pose into the learned prediction model MPFNet and, finally, outputting the predicted grasp pose

P (g_{object}^{hand} ∣ T_{object}^{hand})

. The above process is performed while the hand is still at a certain distance from the object, with the intention of predicting the possible grasping poses in advance and laying the foundation for later wrist joint control.

In order to achieve this, we capture the motion sequence information through the motion capture device, establish the a priori field of the object-centred grasping motion trajectory, and design the prediction model MPFNet to map any hand. The second step is to establish an a priori field of object-centred grasping motion trajectories by capturing motion sequence information with the motion capture device, and to design a prediction model MPFNet to map arbitrary hand poses in space to grasping poses.

2.1. Hardware and Test Bed

The core step is the construction of an a priori field of motion trajectories, which requires the accurate capture of hand and object poses throughout the approach–grasp process. As shown in Figure 2, we prepared 19 types of objects for daily use, and placed 8 motion capture devices evenly around the table. Eight OptiTrack (NaturalPoint, Inc., Corvallis, OR, USA) motion capture cameras were used, arranged in four corners, with an acquisition period of 70 µs; four 9 mm diameter markers were installed on the hand and several 3 mm diameter markers were installed on the object. For the hand pose tracking, we used a glove and installed two markers at the wrist and palm sections, respectively, to create a rigid body model of the hand. Note that, for hand pose tracking, we treated the wrist pose as the overall hand pose and ignored the specific posture of each finger, as the wrist pose is more important for predicting the grasp pose, while the finger posture configuration can be determined by the pre-shape gesture.

2.2. Motion Prior Field

In this work, we define the grasp pose as consisting of two parts: pre-shape type and wrist 6D pose, as shown in Figure 3. Since our future work is mainly oriented towards shared control of prosthetic hands, and the fingers of prosthetic hands are much less dexterous than human fingers, we do not include the finger posture as part of the grasp pose, but use the more general pre-shape type instead of the joint configuration of the fingers. Specifically, our choice of pre-shape type is derived from the taxonomy criteria in [23], where two to three pre-shape types can be artificially classified according to the different grasping parts of each class of objects, designated as c, which is a discrete variable. The wrist 6D pose

g_{object}^{hand}

is a Euclidean transformation that represents the wrist 6D pose when the hand is in contact with the object in the object coordinate system and is a continuous variable in Euclidean space.

2.2.1. Data Collection

In order to cover as many objects as possible used by ordinary people in daily life, while taking into account a standard benchmark, we chose 19 objects from YCB-video [24], which are all standard parts of a YCB object [25] (http://ycbbenchmarks.org/, accessed on 10 June 2023). For each of these objects, the subjects grasp each object with two or three pre-shape types, respectively, and cover the approach direction and the approach angle as much as possible.

In order to meet the requirement that the acquisition process covers all the space in which the human upper arm moves, we followed the principle of [26] for object placement, as shown in Figure 4. The rays start at 0° and go to 160° with 20° intervals. In the direction of each ray, the objects are located at two distances, r1 = 30 mm and r2 = 60 mm. Here, we discarded the leftmost ray because humans normally do not grasp objects at this angular range. We placed the object at two positions on each ray, one near and one far, to cover as much of the grasping motion trajectory as possible. Eight subjects (six males and two females) participated in the data collection process, and an average of 500 grasps were performed per person.

In addition to placing the objects in different positions, we also rotated the objects themselves. At each placement position, we rotated the objects at random angles for multiple grasps. For each object, the subjects covered as many approach directions and approach angles as possible; thus, we were able to construct a variety of grasping poses (including various pre-shape types and a large number of wrist 6D poses) for the object over a 360° spatial range. We performed a large number of grasps for 19 objects in all directions and for various pre-shape types, and obtained a total of about 4200 accurate grasp poses, of which each pre-shape type for each type of object corresponded to 100 wrist 6D poses.

After completing each “approach–grasp” motion sequence, we post-processed the motion trajectory corresponding to each grasp pose. The purpose of post-processing was as follows:

The initial hand pose was the same at the beginning of each motion capture, so we excluded the first 10% of the motion sequence;
The purpose of our modeling was to help the system complete the prediction when the hand was far away from the object, so trajectory points too close to the grasp position were less valuable; hence, we excluded the last 30% of the data for the motion sequence;
Since the hand did not move uniformly during the approach, and the acquisition frequency of the motion capture device was fixed, the collected trajectory points were not uniformly distributed in the sequence; thus, we interpolated the whole segment of data to ensure that the data tended to be uniformly distributed in space.

Finally, each grasping pose corresponded, on average, to 50 trajectory points.

2.2.2. Constructing the Motion Prior Field

In order to establish the object-centered motion trajectory prior field, it is necessary to transform the hand pose at each moment of the “approach–grasp” process to the object coordinate system. Let the pose of the hand and object in the motion capture system coordinate system (world coordinate system) be

T_{world}^{hand}

and

T_{world}^{object}

, respectively, then the transformation relationship is as follows:

\begin{matrix} T_{object}^{hand} = T_{object}^{world} * T_{world}^{hand} \\ = {(T_{world}^{object})}^{- 1} * T_{world}^{hand} \\ = [\begin{matrix} {(R_{world}^{object})}^{T} & - {(R_{world}^{object})}^{T} * t_{world}^{object} \\ 0^{T} & 1 \end{matrix}] * [\begin{matrix} R_{world}^{hand} & t_{world}^{hand} \\ 0^{T} & 1 \end{matrix}] \\ = [\begin{matrix} {(R_{world}^{object})}^{T} * R_{world}^{hand} & {(R_{world}^{object})}^{T} * (t_{world}^{hand} - t_{world}^{object}) \\ 0^{T} & 1 \end{matrix}] \end{matrix}

(1)

Figure 5 shows the result of a prior field construction for a part of all the objects: one color represents a pre-shape type, and one to two samples of each pre-shape type are selected as shown in the figure. In the scatterplot on the right, each point represents the trajectory point of the hand motion corresponding to one specific grasp pose, and all the trajectory points are collected to construct the motion prior field.

As a result, we map each trajectory point in space to a specific grasp pose. In the motion prior field, each point can be represented as a vector:

p = (q_{x}, q_{y}, q_{z}, q_{w}, t_{x}, t_{y}, t_{z}, c, g)

(2)

In this,

(q_{x}, q_{y}, q_{z}, q_{w})

denote the current rotation information of the hand and

(t_{x}, t_{y}, t_{z})

denote the translation information of the hand. c and g represent information about the grasp pose to which this point is mapped, i.e., the pre-shape type c, and the wrist pose

g_{hand}^{object}

. Next, before training the prediction model, we need to consider how to pre-process such a huge amount of sampled data. One idea is to cluster the wrist 6D poses.

Since each pre-shape type will correspond to a large number of wrist 6D poses, and these wrist poses are uniformly distributed in space in all directions, we have two ways to deal with the wrist poses: the first is to cluster them to obtain a number of averaged wrist poses; the second is not to cluster them and to treat each wrist posture as a separate individual. Figure 6a shows the result of clustering; the middle layer corresponds to the averaged cluster manifolds. It is reflected in the motion prior field as a “piece” of trajectory points mapped to an average pose, which makes the field more linearly separable in space. Figure 6b shows the result without clustering, which is reflected in the motion prior field, making the field less linearly separable in space. It is worth noting that, as the number of cluster manifolds increases, Figure 6a converges to Figure 6b, because, if the number of cluster manifolds is equal to the number of grasp pose individuals, then there is no clustering, and the number of cluster manifolds is 100. We explored the effect of the cluster manifolds on the prediction of the model in our experiments.

As the online process is bound to have errors in the 6D pose estimation of the object based on the RGB camera, this will cause the reconstructed hand pose

T_{object}^{hand}

to have deviations of 1–5 cm for both rotation and translation compared to the real value. These deviations will make it difficult for the prediction model to converge. Therefore, the robustness of the system to the pose estimation errors needs to be improved by adding noise. At the sampling frequency of the motion capture system, the spatial separation of adjacent trajectory points during the normal approach of the human hand to the target is about 2 cm, so we add

N (μ = 0, s t d = 0.1)

and

N (μ = 0, s t d = 0.01)

Gaussian noise to the rotation of the wrist and the translation of the wrist in the trajectory points, respectively.

2.2.3. MPFNet

Based on the constructed a priori field, we can know the grasping pose (i.e., pre-shape type + wrist pose) corresponding to these existing trajectory points in space. However, for newly collected trajectory points, the system also needs to make a prediction of its corresponding grasping pose. Therefore, we consider designing a classification model to learn a nonlinear mapping relationship

P (g_{object}^{hand} ∣ T_{object}^{hand})

from the trajectory points to the grasp poses.

We used a simple multi-layer perception with five layers and undertook an experimental comparison of different input and output cases, as Figure 7 shows. First, there are two choices of inputs: 6D pose and 3D position. The former is represented by a 7-dimensional vector (quaternion + translation) and the latter by a 3D vector (translation). The dimension of the output is influenced by the cluster manifolds, which have a scale of

n u m_{m a n i f o l d s} * n u m_{g t y p e}

, representing the average number of grasp poses clustered out in total.

n u m_{m a n i f o l d s}

represents the number of clustered manifolds within each pre-shape type, and

n u m_{g t y p e}

is the amount of pre-shape types. Therefore, the labels are also in the form of a one-hot encoding, representing one of the average poses of the cluster manifolds. In order to converge the model, and to make the incorrect prediction results as close as possible to the correct ones, we use a cross-entropy loss function based on softmax as follows:

L = - \frac{1}{N} \sum_{i} \sum_{c = 1}^{M} y_{i c} log (p_{i c}), p_{i c} = \frac{e^{z_{c}}}{\sum_{j = 1}^{M} e^{z_{j}}}

(3)

M represents the total number of categories of clustered manifolds, which is the dimension of the output.

z_{j}

represents the last output of the full connection layer, where

z_{c}

represents the output when the class is c. N denotes the number of training samples, and

y_{i c}

and

p_{i c}

represent the probabilities of ground truth and prediction, respectively.

The complete process of prediction is as follows: First, we place an object in a certain location. A second person standing in front of the object moves their hand towards the object. Then the position and orientation of the hand and object are measured. After coordinate transformation, we can obtain the hand pose in the object coordinate system. Finally, the hand pose

T_{object}^{hand}

is inputted into MPFNet to obtain the predicted grass pose

g_{object}^{hand}

.

2.2.4. Evaluation Metrics

The benchmark used was based on human performance, because our evaluation criteria were all based on the dataset of human hand grasping objects. We defined two evaluation metrics for the prediction model:

Prediction (classification) accuracy. This represents the classification accuracy of all trajectory points in the motion prior field and is able to measure the fit of the prediction model to the whole field;
The predicted grasp pose error in the sequence. This focuses on the 50 percent of the trajectory points of one sequence of approach. Each trajectory point generates a predicted grasp pose; this grasp pose will have some error (or no error) with the real pose before. We record the errors in this approach sequence:

${e r r}_{dist} = \frac{1}{m} \sum_{x \in X} ∥({\hat{R}}_{i} x + {\hat{t}}_{i}) - (R_{i} x + t_{i})∥$

(4)

We use the distance between the predicted pose and the real pose to represent the error size. x are the selected points to represent the hand, and X is the set of x, with m representing the amount of x. As shown in Figure 8, the green hand represents the true grasp pose of this approach sequence; the gray hand represents the incorrectly predicted grasp pose at some point.
If the errors are accumulated for each moment:

${E r r}_{d i s t} = \sum_{i = 0}^{t - 1} e r r_{d i s t}$

(5)

then the total error accumulation ${E r r}_{d i s t}$ of this approach sequence will be obtained. The smaller this accumulation is, the more stable and reliable the prediction model is during one approach; however, if this accumulation is larger, the prediction model is more unstable and less reliable.

3. Results

The purpose of our experiments was to explore the effect of the input-output form on the model performance. To be more specific: we used different representations of trajectory points (7-dimensional pose or 3-dimensional position), and different cluster manifolds (we chose 4, 20, 40, 60, 80, and 100 as representative manifolds). Since, in the future, the model will need to contribute to the control of the prosthetic wrist joint, the number of manifolds cannot be too small, otherwise the averaged grasping pose would be sparse, which would be detrimental for wrist joint control. Therefore, we set the minimum number of manifolds to four. In this way, one pre-shape type within 360° will cluster four average wrist poses with a 90° adjacent interval, which is the maximum acceptable adjustment range for the wrist and forearm [26]. Next, we compared and analyzed two aspects: the prediction accuracy among the motion prior fields, and the predicted grasp pose error in the sequence.

3.1. Prediction Accuracy among the Motion Prior Fields

As shown in Figure 9, the horizontal coordinates represent the different cluster manifolds, where 100 manifolds means that each pose is considered as one manifold, which corresponds to the case of "no clustering". The blue box and the orange box indicate the results when the input is a 7-dimensional pose and a 3-dimensional position, respectively. It can be seen that the model predicts better with pose than with position, regardless of the clustering. When the input is pose, the prediction accuracy decreases and then increases with increase in manifolds. In particular, when the manifolds are 100, the accuracy is almost the same as that when the manifolds are 4. Generally speaking, the fewer the manifolds, the better the linear differentiability of the field. However, we find that the linear differentiability of manifolds and fields is not simply negatively correlated. Therefore, when the manifolds are in the middle (e.g., 40, 60), the model is artificially reduced in dimensions, which makes the fit of the neural network constrained, so the accuracy will be reduced. When there are few or many manifolds (e.g., 4 and 100), the field is closer to the original form of the data, i.e., either the simple divisible space at 4 manifolds or the original high-dimensional random distribution at 100 manifolds. Therefore, this is when the neural network has less constraint in fitting and has higher accuracy.

3.2. Predicted Grasp Pose Error in Sequence

We also compared the two input cases and different numbers of cluster manifolds for the predicted grasp pose error in the sequence, as shown in Figure 10. The horizontal coordinate indicates the moment of motion and the vertical coordinate indicates the error distance from the predicted grasp pose to the real grasp pose at the current moment. Figure 10a shows the results with the input as pose. It can be seen that the error distance shrinks faster and the overall error decreases as the manifolds increase. This is because the smaller the manifolds are, the fewer the average grasp poses will be, and then each prediction will have a “steady-state error” with the real grasp pose. Figure 10b shows the result when the input is position—the error is larger and the variance is greater than when the input is pose. In summary, with pose and 100 manifolds, the error distance ends up being about 1 cm, which is sufficient for wrist control.

The error distance at each moment in the sequence is summed up to obtain the bar chart in Figure 11. In the experiment, our sampling frequency was 50 Hz, and we intercepted the first 50% of the sequence in the hand approach process, corresponding to about 1000 ms in time. It can be seen that the error distance summation for the input pose was lower than for the input position in any of the cluster manifolds. In addition, the error distance summation was minimal for cluster manifolds of 100 (i.e., no clustering). This suggests that the finer the grasp pose is divided, the better the model performs in sequence prediction.

4. Discussion

For better understanding, we visualised and analysed the predictions of the model throughout the ‘hand approaching object’ process. As shown in Figure 12, the left column represents the four key frames during the hand approaching the object, and the three columns on the right show the predictions of the model under different situations. The flesh-colored hand shows the reconstructed hand pose (historical data), green represents the real grasp pose, and gray represents the incorrectly predicted grasp pose. The second column is the result of the motion capture reconstruction. In particular, in the second key frame, the grasp pose is spatially close to the true grasp pose, although it is incorrectly predicted. and we can see that the hand pose is tracked very smoothly and the correct prediction is made at the third key frame. In addition, we explored the robustness of the model to the inputs. We added Gaussian noise to the pose of the hand in the sequence, as shown in the third column. It can be seen that the hand motion sequence was no longer smooth. However, by training the model on the motion prior field after adding the noise, we were still able to obtain accurate predictions. In the third key frame, although the predicted grasp pose was wrong, it was very close to the real grasp pose (green hand). In the fourth key frame, the system’s prediction was also completely correct. Therefore, the model was able to make accurate predictions in the first 50% of sequence of the hand approaching the object.

Based on the experimental results, we found that, by encoding human motion trajectories into fields as a priori information, we were able to obtain a mapping from any point in space to the final grasp pose, and, therefore, a framework for prediction could be constructed. After comparison of experiments on different objects, we believe that neither weight nor size plays a major role in the prediction of grasp pose, but, for the shape and pre-shape category of the object, as long as the shape of the object is similar, the prediction model is supported. To the best of our knowledge, this paper is the first to use prior knowledge to predict the manual grasp pose. In the shared control of grasp in a prosthetic hand, the results of this study can be used as input information to the control end, i.e., the grasp pose is predicted in advance and used as a control target for the wrist joint and the hand pre-shape. In addition, the results of this paper can also be useful in scenarios such as human–robot handover, for example, by predicting the grasp area of the human hand in advance and, thus, helping the robot arm to plan the grasp area. Finally, there are still some shortcomings and improvements that require to be made. First, the variety of objects collected in this paper was limited and did not cover all scenes from everyday activities. Second, the prediction model is essentially a classification after dense sampling and does not generate a completely new grasp pose at each capture, so it has limited robustness and is difficult to migrate. Moreover, the purpose of designing the prediction model was to predict the final grasp pose as accurately as possible in the whole approach process, so we did not discuss which state is the most suitable for prediction in this process. However, we will investigate these issues further in follow-up work.

5. Conclusions

In this paper, focusing on the shared control of prostheses, we proposed a framework for grasp pose prediction based on hand–object localization and a motion prior field. We collected a large amount of “approach–grasp” motion data to build an object-centered motion prior field. We related the trajectory points in space to the grasp pose and trained a prediction model to map this correspondence. We experimentally investigated the effect of different input dimensions and different cluster manifolds on the prediction effect of the model. The results showed that the model performed best in terms of prediction accuracy and error distance in the sequence when the input was a 7-dimensional pose and the cluster manifolds were 100 (i.e., no clustering of the grasp pose). The model enabled the system to accurately predict the final grasp pose in the first 50% of the grasp motion sequence. Our results represent grasp pose predictions based on hand–object localization, which essentially map the state of the end effector in SE(3) space to the final grasping state. Moreover, the prediction model is very simple, so other researchers can refer to our ideas and potentially use the network model we provide directly to quickly predict grasp pose and reduce the burden of grasping planning. In the future, we intend to apply the current prediction model to the semi-autonomous control of a 2 DoF prosthetic hand–wrist to accomplish grasping of everyday objects in different directions and for different contact areas.

Author Contributions

Conceptualization, X.S. (Xu Shi), W.G. and X.S. (Xinjun Sheng); data curation, X.S. (Xu Shi) and W.X.; formal analysis, X.S. (Xu Shi) and W.G.; funding acquisition, X.S. (Xinjun Sheng); investigation, X.S. (Xu Shi); methodology, X.S. (Xu Shi) and W.X.; project administration, X.S. (Xinjun Sheng); software, X.S. (Xu Shi) and Wei Xu; supervision, W.G. and X.S. (Xinjun Sheng); validation, X.S. (Xu Shi); visualization, X.S. (Xu Shi); writing—original draft, X.S. (Xu Shi); writing—review and editing, W.G. and X.S. (Xinjun Sheng). All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Natural Science Foundation of China (Grant Nos. 91948302, 52175021), and in part by the Natural Science Foundation of Shanghai (Grant No. 23ZR1429700).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data and code during the current study can be obtained from the author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Cognolato, M.; Atzori, M.; Gassert, R.; Müller, H. Improving Robotic Hand Prosthesis Control With Eye Tracking and Computer Vision: A Multimodal Approach Based on the Visuomotor Behavior of Grasping. Front. Artif. Intell. 2022, 4, 199. [Google Scholar] [CrossRef] [PubMed]
Castro, M.N.; Dosen, S. Continuous Semi-autonomous Prosthesis Control Using a Depth Sensor on the Hand. Front. Neurorobot. 2022, 16, 814973. [Google Scholar] [CrossRef] [PubMed]
Vasile, F.; Maiettini, E.; Pasquale, G.; Florio, A.; Boccardo, N.; Natale, L. Grasp Pre-shape Selection by Synthetic Training: Eye-in-hand Shared Control on the Hannes Prosthesis. arXiv 2022, arXiv:2203.09812. [Google Scholar]
Taverne, L.T.; Cognolato, M.; Bützer, T.; Gassert, R.; Hilliges, O. Video-based Prediction of Hand-grasp Preshaping with Application to Prosthesis Control. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 4975–4982. [Google Scholar] [CrossRef]
He, Y.; Kubozono, R.; Fukuda, O.; Yamaguchi, N.; Okumura, H. Vision-Based Assistance for Myoelectric Hand Control. IEEE Access 2020, 8, 201956–201965. [Google Scholar] [CrossRef]
Shi, X.; Xu, W.; Guo, W.; Sheng, X. Target prediction and temporal localization of grasping action for vision-assisted prosthetic hand. In Proceedings of the 2022 IEEE International Conference on Robotics and Biomimetics (ROBIO), Jinghong, China, 5–9 December 2022; pp. 285–290. [Google Scholar] [CrossRef]
Zhong, B.; Huang, H.; Lobaton, E. Reliable Vision-Based Grasping Target Recognition for Upper Limb Prostheses. IEEE Trans. Cybern. 2020, 52, 1750–1762. [Google Scholar] [CrossRef] [PubMed]
Gloumakov, Y.; Spiers, A.J.; Dollar, A.M. A Clustering Approach to Categorizing 7 Degree-of-Freedom Arm Motions during Activities of Daily Living. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 7214–7220. [Google Scholar] [CrossRef]
Gloumakov, Y.; Spiers, A.J.; Dollar, A.M. Dimensionality Reduction and Motion Clustering During Activities of Daily Living: Three-, Four-, and Seven-Degree-of-Freedom Arm Movements. IEEE Trans. Neural Syst. Rehabil. Eng. 2020, 28, 2826–2836. [Google Scholar] [CrossRef] [PubMed]
Taheri, O.; Ghorbani, N.; Black, M.J.; Tzionas, D. GRAB: A Dataset of Whole-Body Human Grasping of Objects. In Computer Vision—ECCV 2020; Springer: Cham, Switzerland, 2020; Volume 12349, pp. 581–600. [Google Scholar] [CrossRef]
Taheri, O.; Choutas, V.; Black, M.J.; Tzionas, D. GOAL: Generating 4D Whole-Body Motion for Hand-Object Grasping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13263–13273. [Google Scholar]
Chao, Y.W.; Yang, W.; Xiang, Y.; Molchanov, P.; Handa, A.; Tremblay, J.; Narang, Y.S.; Van Wyk, K.; Iqbal, U.; Birchfield, S.; et al. DexYCB: A Benchmark for Capturing Hand Grasping of Objects. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 9040–9049. [Google Scholar] [CrossRef]
Brahmbhatt, S.; Ham, C.; Kemp, C.C.; Hays, J. ContactDB: Analyzing and Predicting Grasp Contact via Thermal Imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8709–8719. [Google Scholar]
Brahmbhatt, S.; Handa, A.; Hays, J.; Fox, D. ContactGrasp: Functional Multi-finger Grasp Synthesis from Contact. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 2386–2393. [Google Scholar] [CrossRef] [Green Version]
Brahmbhatt, S.; Tang, C.; Twigg, C.D.; Kemp, C.C.; Hays, J. ContactPose: A Dataset of Grasps with Object Contact and Hand Pose. arXiv 2020, arXiv:2007.09545. [Google Scholar] [CrossRef]
Corona, E.; Pumarola, A.; Alenya, G.; Moreno-Noguer, F.; Rogez, G. GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5030–5040. [Google Scholar] [CrossRef]
Corona, E.; Pumarola, A.; Alenya, G.; Moreno-Noguer, F. Context-Aware Human Motion Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6992–7001. [Google Scholar]
Mousavian, A.; Eppner, C.; Fox, D. 6-DOF GraspNet: Variational Grasp Generation for Object Manipulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2901–2910. [Google Scholar]
Chen, Y.C.; Murali, A.; Sundaralingam, B.; Yang, W.; Garg, A.; Fox, D. Neural Motion Fields: Encoding Grasp Trajectories as Implicit Value Functions. arXiv 2022, arXiv:2206.14854. [Google Scholar] [CrossRef]
Weng, T.; Held, D.; Meier, F.; Mukadam, M. Neural Grasp Distance Fields for Robot Manipulation. arXiv 2022, arXiv:2211.02647. [Google Scholar] [CrossRef]
Wei, W.; Li, D.; Wang, P.; Li, Y.; Li, W.; Luo, Y.; Zhong, J. DVGG: Deep Variational Grasp Generation for Dextrous Manipulation. IEEE Robot. Autom. Lett. 2022, 7, 1659–1666. [Google Scholar] [CrossRef]
Lundell, J.; Verdoja, F.; Kyrki, V. DDGC: Generative Deep Dexterous Grasping in Clutter. IEEE Robot. Autom. Lett. 2021, 6, 6899–6906. [Google Scholar] [CrossRef]
Feix, T.; Romero, J.; Schmiedmayer, H.B.; Dollar, A.M.; Kragic, D. The GRASP Taxonomy of Human Grasp Types. IEEE Trans. Hum.-Mach. Syst. 2016, 46, 66–77. [Google Scholar] [CrossRef]
Xiang, Y.; Schmidt, T.; Narayanan, V.; Fox, D. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. arXiv 2018, arXiv:1711.00199. [Google Scholar] [CrossRef]
Calli, B.; Singh, A.; Walsman, A.; Srinivasa, S.; Abbeel, P.; Dollar, A.M. The YCB object and Model set: Towards common benchmarks for manipulation research. In Proceedings of the 2015 International Conference on Advanced Robotics (ICAR), Istanbul, Turkey, 27–31 July 2015; pp. 510–517. [Google Scholar] [CrossRef]
Kaliki, R.R.; Davoodi, R.; Loeb, G.E. Prediction of Distal Arm Posture in 3-D Space From Shoulder Movements for Control of Upper Limb Prostheses. Proc. IEEE 2008, 96, 1217–1225. [Google Scholar] [CrossRef]

Figure 1. The pipeline of the framework. The motion capture system reconstructs the hand–object-pose

T_{object}^{hand}

, and then the model predicts the final grasp pose

P (g_{object}^{hand} ∣ T_{object}^{hand})

based on

T_{object}^{hand}

. The green hand in the lower left figure represents the correct prediction.

Figure 1. The pipeline of the framework. The motion capture system reconstructs the hand–object-pose

T_{object}^{hand}

, and then the model predicts the final grasp pose

P (g_{object}^{hand} ∣ T_{object}^{hand})

based on

T_{object}^{hand}

. The green hand in the lower left figure represents the correct prediction.

Figure 2. Devices used for data acquisition and experiments.

Figure 3. Definition of grasp pose, consisting of two parts: pre-shape type and wrist pose.

Figure 4. Object placement plane top view.

Figure 5. Motion prior field diagram of some objects. The color of the scatter corresponds to the pre-shape type. Note that, this is an object-centered representation, so the axes belong to the object coordinate system.

Figure 6. Clustering of grasp pose. (a) cluster manifolds < total number of grasp pose; (b) cluster manifolds = total number of grasp poses, which is equivalent to no clustering.

Figure 7. Structure of multi-layer perceptron (MLP). The numbers, such as 128, 512 and 1024, are the dimensions of the fully connected layer.

Figure 8. Predicted grasp pose error in the sequence. The flesh-colored hand represents the true grasp pose and the gray hand represents the incorrectly predicted grasp pose.

Figure 9. Prediction accuracy among motion prior fields.

Figure 10. Predicted grasp pose error in sequence. (a) and (b) respectively show the change of the error distance of the prediction model within 1000 ms when the input is pose and position.

Figure 11. Summation of predicted grasp pose error in sequence.

Figure 12. Demonstration of predictive model performance during hand approach to an object. The left column represents the four key frames during the hand approaching the object, and the right column is the result of the motion capture reconstruction.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shi, X.; Guo, W.; Xu, W.; Sheng, X. Hand Grasp Pose Prediction Based on Motion Prior Field. Biomimetics 2023, 8, 250. https://doi.org/10.3390/biomimetics8020250

AMA Style

Shi X, Guo W, Xu W, Sheng X. Hand Grasp Pose Prediction Based on Motion Prior Field. Biomimetics. 2023; 8(2):250. https://doi.org/10.3390/biomimetics8020250

Chicago/Turabian Style

Shi, Xu, Weichao Guo, Wei Xu, and Xinjun Sheng. 2023. "Hand Grasp Pose Prediction Based on Motion Prior Field" Biomimetics 8, no. 2: 250. https://doi.org/10.3390/biomimetics8020250

APA Style

Shi, X., Guo, W., Xu, W., & Sheng, X. (2023). Hand Grasp Pose Prediction Based on Motion Prior Field. Biomimetics, 8(2), 250. https://doi.org/10.3390/biomimetics8020250

Article Menu

Hand Grasp Pose Prediction Based on Motion Prior Field

Abstract

1. Introduction

2. Materials and Methods

2.1. Hardware and Test Bed

2.2. Motion Prior Field

2.2.1. Data Collection

2.2.2. Constructing the Motion Prior Field

2.2.3. MPFNet

2.2.4. Evaluation Metrics

3. Results

3.1. Prediction Accuracy among the Motion Prior Fields

3.2. Predicted Grasp Pose Error in Sequence

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI