6-DoF Grasp Detection Method Based on Vision Language Guidance

Li, Xixing; Chen, Jiahao; Wu, Rui; Liu, Tao

doi:10.3390/pr13051598

Open AccessArticle

6-DoF Grasp Detection Method Based on Vision Language Guidance

¹

Hubei Key Laboratory of Modern Manufacturing and Quality Engineering, Hubei University of Technology, Wuhan 430068, China

²

School of Mechanical Engineering, Hubei University of Technology, Wuhan 430068, China

^*

Author to whom correspondence should be addressed.

Processes 2025, 13(5), 1598; https://doi.org/10.3390/pr13051598

Submission received: 1 May 2025 / Revised: 16 May 2025 / Accepted: 19 May 2025 / Published: 21 May 2025

(This article belongs to the Special Issue Transfer Learning Methods in Equipment Reliability Management)

Download

Browse Figures

Versions Notes

Abstract

The interactive grasp of robots can grasp the corresponding objects according to the user’s choice. Most interactive grasp methods based on deep learning comprise visual language and grasp detection models. However, in existing methods, the trainability and generalization ability of the visual language model is weak, and the robot cannot cope well with grasping small target objects. Therefore, this paper proposes a 6-DoF grasp detection method guided by visual language, which converts text instructions and RGBD images of the scene to be grasped into inputs and outputs for the 6-DoF grasp posture of the object corresponding to the text instructions. In order to improve the trainability and feature extraction ability of the visual language model, a multi-head attention mechanism combined with hybrid normalization is designed. At the same time, a local attention mechanism is introduced into the grasp detection model to enhance the global and local information interaction ability of point cloud data, thereby improving the grasping ability of the grasp detection model for small target objects. The method proposed in this paper first uses the improved visual language model to predict the plane position information of the target object, then uses the improved grasp detection model to predict all the graspable postures in the scene, and finally uses the plane position information to filter out the graspable postures of the target object. The visual language model and grasp detection model proposed in this paper have achieved excellent performance in various scenarios of public datasets while ensuring a specific generalization ability. In addition, we also conducted real grasp experiments, and the 6-DoF grasp detection method based on visual language guidance proposed in this paper achieved a grasp success rate of 95%.

Keywords:

deep learning; interactive grasp; RGBD images; attention mechanism

1. Introduction

Grasp has long been a central focus in robot research, as it is the primary way robots engage in production and interaction. With the advancement of deep learning techniques, particularly the robust feature mapping and learning capabilities of neural networks, more accurate predictions of grasp postures can be achieved, especially in unstructured environments [1,2,3,4]. Many deep-learning-based methods for grasp posture prediction aim to predict the graspable postures of all objects within a multi-object scene. In general, in order to determine the optimal grasp posture, only the one with the highest confidence is selected. This leads to random object grasp with each prediction, which means the robot may not grasp the intended object as specified by the user. Consequently, many researchers have focused on interactive grasp tasks, where the user selects the object to be grasped [5,6,7,8]. The process generally unfolds in two steps in deep-learning-based interactive grasp methods. The first step involves determining the specific location of the target object from the image using text instructions, followed by predicting how to grasp it [9,10]. The second step involves using a planar grasp posture prediction network to predict the feasible grasp postures. However, this approach is limited by the camera’s angle and the scene layout, making it challenging for the model to accurately locate the target object in the first step.

Additionally, the planar grasp method cannot process spatial information, which leads to overlooking potential grasp postures and, consequently, reduced grasp success rates in practical applications. To address these issues, Lu et al. [11] introduced the VL-Grasp model. This novel approach integrates a visual language model with a six degrees of freedom (6-DoF) grasp posture detection model for interactive grasp tasks, yielding favorable results in real grasp experiments. Other researchers have incorporated semantic knowledge as prior information in task-oriented grasp to help robots adapt to various situations [12,13]; however, these methods are limited to specific environments and lack generalization. In response, Tang et al. [14] proposed the GraspGPT model, which leverages text instructions and a large language model interface to recognize and locate multiple objects, achieving a successful grasp.

Although these studies have significantly contributed to interactive grasp tasks, they have the following issues: First, identifying targets through large-scale language models can be time-consuming and may have specific network requirements when deployed locally; second, the existing visual language models used for grasping tasks have limited robustness and poor generalization ability during the training process; and third, existing grasp models often demonstrate inadequate performance when dealing with small objects. To address these challenges, this paper proposes a new 6-DoF grasp method based on visual language guidance based on Lu et al. [11]. The proposed method takes text instructions and RGBD images as input, utilizing a newly designed visual language model to detect the planar position of the target object. The process then uses an improved 6-DoF grasp model to identify all possible graspable postures within the scene. The object position predicted by the visual language model is subsequently used to select the optimal grasp posture. To assess the performance of the proposed method, we compare the recognition and prediction capabilities of the visual language and grasp detection models across different datasets. Additionally, real grasp experiments are conducted to validate the effectiveness of the proposed approach, demonstrating promising results.

In summary, our contributions to this paper are as follows:

(1): A novel visual language model is introduced, which enhances the multi-head attention mechanism to optimize the training process and improve the model’s recognition and detection capabilities.
(2): A new 6-DoF grasp model is proposed to address the challenges of grasping small and multi-scale targets. This model incorporates a point cloud feature encoder and multiple sampling mechanisms to enhance its ability to grasp small objects.
(3): The end-to-end interactive grasp of small target objects is achieved by integrating the visual language model with the 6-DoF grasp and detection model. A real grasp experiment is conducted to validate the effectiveness of the proposed method.

2. Related Works

2.1. Visual Language Model

A visual language model’s task is to determine a target object’s position based on textual instructions. The model’s output typically consists of detection boxes and segmentation masks employed for target detection and image segmentation tasks, respectively. Traditional methods [15,16,17] utilize convolutional neural networks [18] and long short-term memory networks [19] to extract features. More recently, the Transformer mechanism [20] has gained widespread adoption in target detection tasks [21,22] due to its ability to capture global context and support parallel processing. Additionally, the Transformer excels at capturing long-range dependencies in sequences, making it well-suited for natural language processing and cross-modal learning tasks. In visual research, target detection and image segmentation are typically studied separately, but these tasks can share the same backbone network for early-stage feature extraction. Consequently, unified frameworks [23,24] have been proposed to perform specific tasks by leveraging a shared backbone network and distinct heads. For instance, the RefTR model proposed by Li et al. [23] separately extracts features from input text and images through the network. Incorporating the Transformer mechanism can generate separate outputs for target detection and segmentation.

2.2. Grasp Detection Model

Grasp detection models can be divided into planar and 6-DoF grasp according to coverage ranges. In the study of planar grasp, Lenz et al. [25] used a five-dimensional grasp box to represent the grasping posture. They generated the grasp results through a deep convolutional neural network. However, this method is highly dependent on the local features of the grasp scene. Kumra et al. [26] proposed the GR-ConvNet model to generate grasp postures via pixel-by-pixel detection. Morrison et al. [27] proposed a novel grasp label representation that can automatically generate pixel-level grasp labels from the grasp box, which improves grasp accuracy to a certain extent. However, planar grasp detection ignores many possible grasp results due to the lack of spatial information. In a study of 6-DoF grasp, Fang et al. [28] proposed the GraspNet-1Billion large-scale grasp dataset and an end-to-end grasp posture prediction network given a point cloud input. Chen et al. [29] proposed the TransSC network to complete the point cloud shape to address the problem of incomplete sparse point cloud features, and used the PointNetGPD [4] network to evaluate the grasp results. Finally, they cooperated with the MoveIt task builder for motion planning. In order to improve the model’s sampling ability of the entire scene, Ma et al. [30] proposed a new grasp method to solve the feature learning problem when the grasp scene scale is unbalanced.

Although the above-mentioned visual language model and grasp detection model have achieved good results, the existing visual language model training process is unstable, and the encoding ability of images and texts is insufficient. Therefore, when facing scenes where multiple objects are in contact with each other, the recognition and detection results are not precise enough. At the same time, the existing grasp detection model has weak grasp performance for small objects in stacked scenes. In order to solve these problems, this paper proposes a 6-DoF grasp detection method based on visual language guidance. It constructs a new visual language model and grasp detection model to improve the interactive grasp ability of small target scenes.

3. Method

3.1. Overall Framework

The overall framework of the 6-DoF grasping method based on visual language guidance proposed in this paper is illustrated in Figure 1. Visual language-based grasping is achieved through the visual language model and the grasp detection model.

First, an RGB image of the scene and textual instructions describing the object to be grasped are provided as input. The textual instructions, written in English, should specify the object’s type, color, and position as accurately as possible. Next, the visual language model extracts and encodes the input image and text features separately. The corresponding segmentation mask and target detection maps are then generated through the decoding process via the visual language converter. Subsequently, the RGB image and depth map are fed into the grasp detection model to reconstruct the 3D scene point cloud. This point cloud is then used by the improved grasp detection model proposed in this paper to predict the optimal grasping posture. Finally, the target detection map generated by the visual language model is used to identify the target object region and select the best 6-DoF grasping posture, which is then output through collision detection and grasp confidence ranking.

3.2. Improved Visual Language Model

The visual language model is depicted in Figure 2. For the input image and text, features are extracted using the image feature extraction network and the text feature encoder, respectively. The image feature extraction network employs ResNet50 [31], while the text feature encoder utilizes BERT [32]. The visual language encoder integrates the multimodal feature inputs from both language and text using six Transformer encoder layers based on the Transformer architecture [20] for multimodal encoding. Subsequently, the image–text pair is fed into the query encoder, which generates query phrase embeddings. The query decoder performs joint reasoning on these embeddings and decodes the multi-task features, which are then passed to the detection and segmentation heads. These heads produce a set of detection boxes and masks to guide the subsequent grasp configuration process.

After the model extracts image features and text features, respectively, in order to stabilize the training process of the model and improve the performance of the visual language encoder, we designed a new multi-head attention mechanism, the overall structure of which is shown in Figure 3. Figure 3a is a post-normalized multi-head attention mechanism. This method can produce a more substantial regularization effect, which helps improve the model’s performance. Figure 3b is a pre-normalized multi-head attention mechanism. Normalization is performed before the residual is added, resulting in a more significant identity path, which helps to accelerate convergence and more stable gradients. To combine the advantages of the two methods, we use a hybrid normalized multi-head attention mechanism [33] and use Dynamic Tanh (DyT) to replace the normalization layer, as shown in Figure 3c. DyT is shown in Figure 3d, where

α

is a learnable parameter. This operation aims to simulate the behavior of the normalization layer by learning the appropriate scaling factor through

α

and compressing the extreme values through the bounded tan function. DyT can improve the final performance of the model without adjusting the training hyperparameters.

3.3. Improved 6-DoF Grasp Detection Model

The overall structure of the grasp detection model proposed in this paper is illustrated in Figure 4. It consists of three stages: the point cloud feature encoding stage, the scene sampling stage, and the grasp configuration regression stage. The scene sampling stage is only used during the model test and is not involved in the model training.

In the point cloud feature encoding stage, the point cloud data is unordered, densely packed, sparsely distributed, and unstructured, which poses challenges during feature extraction. We employ the PointNetSA module from the PointNet++ [34] network to address these challenges for point cloud processing. The PointNetSA module uses farthest-point sampling to select several feature center points from the input point cloud. These center points are then used to group the point clouds, and the PointNet [35] network is applied to perform feature sampling on each group. We introduce a Transformer block after the PointNetSA module. This addition improves the model’s attention to relevant regions of the point cloud. Furthermore, the network’s feature extraction capabilities for multiple targets are strengthened through a network feature fusion structure, which integrates multi-scale features.

To address the model’s poor robustness when grasping small objects due to the uneven label distribution in the dataset, we applied the Scale-Balanced-Grasp [30] uniform scene sampling method. In the scene sampling stage, the objects are first segmented using the DSN segmentation network. This is followed by interpolation and fusion of the point cloud features obtained in the feature encoding stage with the segmented features. The fused features are then used to perform balanced sampling on the original point cloud features.

The DSN segmentation network, shown in Figure 5 [30], takes the original scene point cloud as input and extracts point cloud features through the point cloud segmentation converter. The network structure of the segmentation converter mirrors that of the point cloud encoder, utilizing the PointNetSA module for feature processing. The Transformer block applies attention, but the number of feature fusion layers is reduced to two. After extracting the scene features, the foreground and center direction modules are employed for data refinement and dimensionality reduction. Finally, the point cloud segmentation mask and center offset are obtained through inverse distance weighted interpolation. This interpolation method transfers the features of known points to target points based on distance, with the calculation formulas provided in Equations (1) and (2).

f^{(j)} (x) = \frac{\sum_{i = 1}^{k} w_{i} (x) f_{i}^{(j)}}{\sum_{i = 1}^{k} w_{i} (x)}

(1)

w_{i} (x) = \frac{1}{d {(x, x_{i})}^{p}}, j = 1, \dots, C

(2)

In Equations (1) and (2),

w_{i}

represents the weight of the calculation point, which is inversely proportional to the distance. The closer the distance, the greater the influence. Typically,

f^{(j)}

is the value of the unknown point,

f_{i}^{(j)}

is the value of the known point,

k

represents the points taken from the known point set for interpolation calculation, and

p

signifies the influence of distance on the weight.

When sampling point cloud features, fixed-position sampling methods often overlook the features of small targets. We employ a multi-scale cylindrical group network during the feature capture and regression stages to comprehensively capture a wide range of sampling points across the scene. This network samples point cloud features using cylindrical sampling methods at varying depths and scales. The network architecture is illustrated in Figure 6.

The original scene point cloud is input into the network, where four cylinders with different radii and depths are used to sample the point cloud. The cylindrical features from these four different scales are stacked and combined into a new dimensional space. Next, shared convolution is applied, allowing all point clouds to share the same parameters, enabling weight sharing during training. A maximum pooling layer is then used to extract key features. The features obtained from the four cylinders with different radii are concatenated to form a complete cylindrical sampling feature. The calculation formulas are presented in Equations (3) and (4), where

T_{o u t}

represents the complete cylindrical sampling feature,

T_{i}

denotes the cylindrical feature with varying radii,

n

represents the four cylinders with different radii,

C 1 D

corresponds to one-dimensional convolution for dimensionality reduction,

D_{i n p u t}

is the input original point cloud,

C y_{i} (x)

denotes the cylindrical point cloud extraction at different depths,

k

refers to the four cylinders with different depths,

s m l p (x)

represents the shared convolution, and

M a x (x)

indicates the maximum pooling layer.

T_{o u t} = C 1 D (\sum_{i = 1}^{n} T_{i})

(3)

T = M a x (S m l p (\sum_{i = 1}^{k} C y_{i} (D_{i n p u t})))

(4)

To enhance the semantic features following feature extraction, we employ the ApproachNet [28] network to derive approximate vector features from the balanced sampling features described above. These vector features guide the fusion of local features at different scales. To further improve the interaction between local and global features, we introduce a local attention module, the structure of which is shown in Figure 7.

After incorporating local attention into both the cylindrical feature and the approach vector feature, these features are combined and input into the OperationNet [28] network to predict the in-plane rotation, approach distance, grasp width, and grasp confidence. Additionally, the data is fed into the ToleranceNet [28] network to assess the tolerance of the grasp point, specifically the robustness of the grasp operation to positional errors.

3.4. The Proposed Grasp Detection Method

This article proposes a 6-DoF grasp detection method based on visual language guidance. The overall grasp detection process is shown in Figure 1, and the specific implementation steps are as follows:

(1): Obtain RGB and depth images to be grasped and describe the objects to be grasped in the scene using English text, including the target object’s type, color, and shape.
(2): The improved visual language model uses RGB images and text as inputs to locate the position of the target object. The multi-head attention mechanism introduced in the model can improve the stability of model training and enhance the model’s feature extraction ability. The improved visual language model can predict segmentation masks and object detection images to guide the subsequent model’s grasp posture selection.
(3): The improved grasp detection model reconstructs the scene point cloud using RGB and depth maps. After scene segmentation, cylindrical feature sampling, and point cloud feature extraction, it predicts all graspable poses in the scene.
(4): We use the detection results of the visual language model to screen all graspable poses in the scene and obtain the optimal gripping pose through nonmaximum suppression and collision detection.

4. Implementation Details

4.1. Dataset and Evaluation Metrics

The RoboRefIt [36] dataset is used for model training in the visual language model. This dataset, designed for visual language detection and segmentation, contains RGBD images of 66 objects across 50,758 scenes. It provides 2D detection bounding boxes and segmentation mask labels for each scene. The dataset is divided into a training set, test set A, and test set B. Test set A consists of scenes featuring objects that appear in the training set, while test set B contains scenes with objects not seen in the training set. The intersection-over-union (IoU) ratio is used as the evaluation metric for the detection-bounding boxes, with a threshold of 0.5, and the average IoU ratio is used to assess the segmentation masks.

We also employed the GraspNet-1Billion dataset for the grasp detection model and used the average grasp precision as the evaluation metric for the proposed model. Additionally, to better assess the model’s performance in perceiving and grasping multi-scale objects, we use categories

A P_{s}

,

A P_{m}

, and

A P_{l}

to evaluate the model’s grasp performance on small, medium, and large objects. The size divisions for grasp are defined as follows: 0–4 cm for small, 4–7 cm for medium, and 7 cm and above for large objects.

4.2. Loss Calculation and Training Details

The training loss function in the visual language model is the sum of the grasp detection box loss and the segmentation mask loss. The loss function for the grasp detection box is presented in Equations (5)–(7), where the weighted sum of L1 loss and generalized IoU loss is used. In this context,

L_{l o s s 1}

represents L1 loss,

L_{i o u}

represents generalized IoU loss,

b_{0}

is the true detection box label,

b

is the predicted detection box,

C

is the minimum enclosing rectangular box containing both the true detection box

b_{0}

and the predicted detection box

b

, and

N

is the number of samples.

L_{d} = λ_{i o u} L_{i o u} (b_{0}, b) + λ_{L 1} L_{l o s s 1} (b_{0}, b)

(5)

L_{i o u} = \frac{|b_{0} \cap b|}{|b_{0} \cup b|} - \frac{|C - b_{0} \cup b|}{|C|}

(6)

L_{l o s s 1} = \frac{\sum_{i = 1}^{N} |b_{0} - b|}{N}

(7)

The loss function for the segmentation mask is shown in Equations (8)–(12), where the weight sum of focal loss

L_{f o c a l}

[37] and

L_{d i c e}

loss [38] is used. Here,

s_{0}

is the true mask label,

s

is the predicted mask,

p_{t}

represents the difference between the predicted and true values, and

α_{t}

is the weight factor.

L_{s} = λ_{f o c a l} L_{f o c a l} (s_{0}, s) + λ_{d i c e} L_{d i c e} (s_{0}, s)

(8)

L_{f o c a l} = - α_{t} {(1 - p_{t})}^{γ} \log (p_{t})

(9)

α_{t} = \{\begin{array}{l} α, i f s_{0} = 1 \\ 1 - α, o t h e r w i s e \end{array}

(10)

p_{t} = \{\begin{array}{l} s, i f s_{0} = 1 \\ 1 - s, o t h e r w i s e \end{array}

(11)

L_{d i c e} = 2 [\frac{s_{0} (\sum_{i}^{N} s_{i}^{2} + \sum_{i}^{N} s_{i 0}^{2}) - 2 s (\sum_{i}^{N} s_{i} s_{i 0})}{{(\sum_{i}^{N} s_{i 0}^{2} + \sum_{i}^{N} s_{i 0}^{2})}^{2}}]

(12)

We define the total grasp loss in the grasp detection model as shown in Equation (13).

L_{a p p r o a c h}

represents the loss of the approximation vector,

L_{R o t a t i o n}

is the rotation angle loss,

α

is a hyperparameter with a value of 0.2,

W

is the sample weight in the grasp scene, and

L_{G r a s p a b l e}

is the grasp robustness loss. In Equation (14),

c_{i}

is the binary prediction for whether a point cloud in

i

is graspable. If point cloud

i

is graspable,

c_{i}^{*}

is 1; otherwise,

c_{i}^{*}

is 0. Additionally,

s_{i j}

is the predicted confidence score for view

j

in point cloud

i

,

s_{i j}^{~}

is the true label value for view

j

in point cloud

i

,

|v_{i j}, v_{i j}^{*}|

represents the degree of difference, and

1 (x)

limits the difference between the predicted value and the true label to within 5 degrees. The classification loss

L_{c l s}

is defined using the softmax function, and the regional loss

L_{r e g}

is defined using the smooth L1 function.

L_{g r a s p} = W (L_{a p p r o a c h} + α L_{R o t a t i o n}) + L_{G r a s p a b l e}

(13)

\begin{array}{l} L_{a p p r o a c h} (c_{i}, s_{i j}) = \frac{1}{N_{c l s}} \sum_{i} L_{c l s} (c_{i}, c_{i}^{*}) \\ + λ_{1} \frac{1}{N_{r e g}} \sum_{i} \sum_{j} c_{i}^{*} 1 (|v_{i j}, v_{i j}^{*}| < 5^{°}) L_{r e g} (s_{i j}, s_{i j}^{*}) \end{array}

(14)

In Equation (15),

R_{i j}

represents the grasping rotation, while

S_{i j}

,

W_{i j}

, and

d

represent the grasp confidence, grasp width, and grasp approach depth, respectively. For

L_{c l s}

, we use the sigmoid cross-entropy loss function in Equation (16),

T_{i j}

denotes the maximum perturbation the grasp posture can resist.

\begin{array}{l} L_{R a t a t i o n} (R_{i j}, S_{i j}, W_{i j}) = \sum_{d = 1}^{K} (\frac{1}{N_{c l s}} \sum_{i j} L_{c l s}^{d} (R_{i j}, R_{i j}^{*}) \\ + λ_{2} \frac{1}{N_{r e g}} \sum_{i j} L_{r e g}^{d} (S_{i j}, S_{i j}^{*}) \\ + λ_{3} \frac{1}{N_{r e g}} \sum_{i j} L_{r e g}^{d} (W_{i j}, W_{i j}^{*})) \end{array}

(15)

L_{G r a s p a b l e} = \frac{1}{N_{r e g}} \sum_{d = 1}^{K} \sum_{i j} L_{r e g}^{d} (T_{i j}, T_{i j}^{*})

(16)

During the visual language model training, we froze the ResNet-50 weights and used pre-trained BERT weights. The model was trained for 90 epochs with a batch size of 16 and an initial learning rate of 0.0001. The input scene consisted of 20,000 point cloud samples for the visual grasp model. The Adam optimizer was used for training, with 50 epochs, a batch size of 6, and an initial learning rate of 0.001. Both models were trained in a PyTorch 1.6 environment with CUDA 12.1, utilizing an AMD EPYC 7302 processor and three NVIDIA GeForce RTX 3090 GPUs.

5. Results

Our experiments are categorized into two types: dataset-based experiments and real-world grasp experiments.

5.1. Experiments Based on Dataset

Table 1 shows the recognition prediction performance, model parameter quantity, and inference speed of the visual language model. It compares the performance of the visual language model proposed in this paper with that of VL-Grasp [11].

In test set A, the visual language model proposed in this paper achieved the best results in target detection (TD) and object segmentation (OS) tasks. When using ResNet50 as the backbone network, the visual language model proposed in this paper leads the original model in feature extraction by relying on the data processing capability of the improved multi-head attention encoding layer. Since test set B contains some new scenes that do not appear in the training set, test set B tests the model’s generalization ability more. The visual language model proposed in this paper also achieved good results. Two scenes from test set A and test set B were selected for visualization, with the detection and segmentation results shown in Figure 8. The figure displays the input text and image, along with the detection and segmentation results predicted by the model.

We used the GraspNet-1 Billion dataset to verify the grasp performance of the proposed model and compared it with other mainstream grasp detection methods [39,40,41] that use point clouds as input. We recorded each model’s average grasp values AP, AP_0.4, and AP_0.8 in seen, similar, and novel scenes. The results are shown in Table 2. As can be seen from the table, the grasp detection model proposed in this paper achieved the best results in similar and novel scenes. Compared with the model proposed by Ma et al. [30], after adding the network feature fusion structure and local attention mechanism, the model’s overall performance in all scenes improved. In the seen scenes, the average grasp value AP increased by 1.3, the average grasp value AP increased by 1.41, and the average grasp value AP increased by 0.47 in the unfamiliar scenes.

In order to verify the grasp performance of the model proposed in this paper when dealing with objects of different sizes, we divide the grasp posture predicted by the model into small, medium, and large specifications according to the grasp width and calculate the corresponding values in three scenes, respectively. The results are shown in Table 3. As can be seen from the table, thanks to the accurate sampling of small objects in the scene by the segmentation network, the grasp detection accuracy of small targets by the model proposed by Ma and the model proposed in this paper has been dramatically improved. The model proposed in this paper introduces a structure of fusion of different deep features, which improves the model’s local and global interaction ability. At the same time, adding local attention also improves the model’s attention to important features. Therefore, the accuracy of the model proposed in this paper when grasping objects of various sizes has been improved to a certain extent.

We visualized the results of the proposed model and the comparison model [28,30] when grasping small target objects, and the visualization results are shown in Figure 9. The more the grasp posture is biased towards red, the higher the grasp confidence is, and the more it is biased towards blue, the lower the grasp confidence is.

It can be seen from the figure that when facing small target objects such as strawberries and toy models, the GraspNet model without small target feature sampling assigns very few grasp postures to these small target objects, and the grasp postures are very simple. In addition, the model itself does not extract local features in place, resulting in serious collisions between the final generated grasp postures. After introducing the scale feature collection strategy and segmentation network, Ma’s model and the model proposed in this paper have increased the attention to small target objects and generated a series of grasp postures. Compared with the model proposed by Ma, the proposed model introduces a more flexible local attention mechanism, which makes the grasp postures predicted by the model more uniform and fitting for the object.

5.2. Real Grasp Experiment

We first built a real grasp detection platform, as shown in Figure 10. We used the KR5 R1400 robot and the Dahuan two-finger gripper to grasp objects and used the Realsense-D435i depth camera to obtain the RGBD image of the grasping scene in an “eye-on-hand” manner. We selected six objects, as shown in Figure 10b, for real grasp experiments, including apples, water cups, oranges, Coke cans, bananas, and mice that are common in everyday life. These objects appear in the RoboRefIt dataset and have different shapes and colors. Therefore, the model’s recognition and segmentation capabilities and grasp and detection capabilities for multiple objects can be effectively tested.

We select one object from the six objects in Figure 10b as the grasped object of this group of experiments and cyclically select the other three objects as interference items to add to the grasping scene. We conducted experiments with four objects each time and repeated 20 times as a group of experiments. A total of six groups of experiments are conducted, so 120 grasp experiments are finally completed. The experimental results are shown in Table 4. Scenes 1–6 represent the six objects to be grasped: apple, cup, orange, cola, banana, and mouse. It can be seen from the table that the grasp detection method proposed in this paper has achieved excellent grasp results in various grasp scenes and different target objects to be grasped, with a grasp success rate of 95.00%.

Some grasp samples of the six objects to be grasped are shown in Figure 11, which shows the text description, RGB image, target detection result, target segmentation result, and grasp result of different objects to be grasped.

Figure 12 shows the visualization results of some failed grasp experiments, which were mainly caused by grasping the wrong object, colliding with other objects during the grasp process, and having an imperfect grasp posture. Therefore, in future research, it is necessary to design the robot’s motion trajectory path and optimize the grasp posture angle to improve the grasp success rate.

6. Conclusions

In order to meet the needs of intelligent grasp, improve the trainability and generalization ability of the visual language model, and improve the recognition and grasp success rate of small target objects, this paper proposes a 6-DoF grasp method based on visual language guidance. We verified the visual language model and grasp detection model on different data sets. The results show that the visual language model proposed in this paper has achieved high recognition detection results in both target detection and segmentation results. Compared with other grasp models, the grasp accuracy is also improved when grasping small target objects. We conducted real grasp experiments, which showed that the grasp success rate of the method proposed in this paper reached 95.00%.

Although the method proposed in this article achieves the grasp of specific objects in multi-object scenes, the model’s grasp detection ability is insufficient when dealing with other interference factors, such as the stacking of multiple objects and lighting. Moreover, during grasp experiments, the robot is prone to colliding with different objects, causing changes in the position of objects in the scene and ultimately affecting the grasp operation. Therefore, corresponding occlusion object segmentation algorithms should be added in subsequent research, and robot path planning algorithms should be designed to improve the overall grasp success rate of the model.

Author Contributions

Conceptualization, X.L. and J.C.; methodology, J.C.; software, X.L. and J.C.; validation, X.L., J.C., R.W. and T.L.; formal analysis, R.W.; investigation, J.C. and T.L.; resources, R.W.; data curation, J.C.; writing—original draft preparation, J.C.; writing—review and editing, X.L. and R.W.; visualization, J.C.; supervision, X.L., J.C. and R.W.; project administration, X.L.; funding acquisition, X.L. and R.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (51805152), the Zhejiang Province Postdoctoral Optimal Funding Project (ZJ2024148), and the Doctoral Scientific Research Foundation of Hubei University of Technology (BSQD2020007).

Data Availability Statement

The data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

6-DoF	Six degrees of freedom
DyT	Dynamic Tanh
IoU	Intersection over union
TD	Target detection
OS	Object segmentation

References

Fu, K.; Dang, X. Light-weight convolutional neural networks for generative robotic grasping. IEEE Trans. Ind. Inform. 2024, 20, 6696–6707. [Google Scholar] [CrossRef]
Chu, F.; Xu, R.; Vela, P. Real-world multiobject, multigrasp detection. IEEE Robot. Autom. Lett. 2018, 3, 3355–3362. [Google Scholar] [CrossRef]
Ibarz, J.; Tan, J.; Finn, C.; Kalakrishnan, M.; Pastor, P.; Levine, S. How to train your robot with deep reinforcement learning: Lessons we have learned. Int. J. Robot. Res. 2021, 40, 698–721. [Google Scholar] [CrossRef]
Liang, H.; Ma, X.; Li, S.; Michael, G.; Song, T.; Bin, F. Pointnetgpd: Detecting grasp configurations from point sets. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 3629–3635. [Google Scholar]
Dobrev, Y.; Flores, S.; Vossiek, M. Multi-modal sensor fusion for indoor mobile robot pose estimation. In Proceedings of the 2016 IEEE/ION Position, Location and Navigation Symposium (PLANS), Savannah, GA, USA, 11–14 April 2016; pp. 553–556. [Google Scholar]
Liu, X.; Liu, X.; Guo, D.; Huaping, L.; Fuchun, S.; Min, H. Self-supervised learning for alignment of objects and sound. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 1588–1594. [Google Scholar]
Noda, K.; Arie, H.; Suga, Y.; Ogata, T. Multimodal integration learning of robot behavior using deep neural networks. Robot. Auton. Syst. 2014, 62, 721–736. [Google Scholar] [CrossRef]
Shridhar, M.; Mittal, D.; Hsu, D. Ingress: Interactive visual grounding of referring expressions. Int. J. Robot. Res. 2020, 39, 217–232. [Google Scholar] [CrossRef]
Chen, Y.; Xu, R.; Lin, Y.; Vela, p. A joint network for grasp detection conditioned on natural language commands. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 4576–4582. [Google Scholar]
Ding, M.; Liu, Y.; Yang, C.; Lan, X. Visual manipulation relationship detection based on gated graph neural network for robotic grasping. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 1404–1410. [Google Scholar]
Lu, Y.; Fan, Y.; Deng, B.; Liu, F.; Li, Y.; Wang, S. Vl-grasp: A 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 976–983. [Google Scholar]
Song, D.; Huebner, K.; Kyrki, V.; Kragic, D. Learning task constraints for robot grasping using graphical models. In Proceedings of the 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, Taipei, Taiwan, 18–22 October 2010; pp. 1579–1585. [Google Scholar]
Murali, A.; Liu, W.; Marino, K.; Chernova, S.; Gupta, A. Same object, different grasps: Data and semantic knowledge for task-oriented grasping. arXiv 2020, arXiv:2011.06431. [Google Scholar]
Tang, C.; Huang, D.; Ge, W.; Liu, W.; Zhang, H. Graspgpt: Leveraging semantic knowledge from a large language model for task-oriented grasping. IEEE Robot. Autom. Lett. 2023, 8, 7551–7558. [Google Scholar] [CrossRef]
Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; Berg, T. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1307–1315. [Google Scholar]
Nagaraja, V.; Morariu, V.; Davis, L. Modeling context between objects for referring expression understanding. arXiv 2016, arXiv:1608.00525. [Google Scholar]
Hu, R.; Xu, H.; Rohrbach, M.; Feng, J.; Saenko, K.; Darrell, T. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4555–4564. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Greff, K.; Srivastava, R.; Koutník, J.; Steunebrink, B.; Schmidhuber, J. LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2222–2232. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Yang, L.; Xu, Y.; Yuan, C.; Liu, W.; Li, B.; Hu, W. Improving visual grounding with visual-linguistic verification and iterative reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9499–9508. [Google Scholar]
Feng, G.; Hu, Z.; Zhang, L.; Lu, H. Encoder fusion network with co-attention embedding for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 15506–15515. [Google Scholar]
Li, M.; Sigal, L. Referring transformer: A one-step approach to multi-task visual grounding. Adv. Neural Inf. Process. Syst. 2021, 34, 19652–19664. [Google Scholar]
Luo, G.; Zhou, Y.; Sun, X.; Cao, L.; Wu, C.; Deng, C.; Ji, R. Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 10034–10043. [Google Scholar]
Lenz, I.; Lee, H.; Saxena, A. Deep learning for detecting robotic grasps. Int. J. Robot. Res. 2015, 34, 705–724. [Google Scholar] [CrossRef]
Kumra, S.; Joshi, S.; Sahin, F. Antipodal robotic grasping using generative residual convolutional neural network. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 9626–9633. [Google Scholar]
Morrison, D.; Corke, P.; Leitner, J. Learning robust, real-time, reactive robotic grasping. Int. J. Robot. Res. 2020, 39, 183–201. [Google Scholar] [CrossRef]
Fang, H.; Wang, C.; Gou, M.; Lu, C. Graspnet-1billion: A large-scale benchmark for general object grasping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 11444–11453. [Google Scholar]
Chen, W.; Liang, H.; Chen, Z.; Sun, F.; Zhang, J. Improving object grasp performance via transformer-based sparse shape completion. J. Intell. Robot. Syst. 2022, 104, 45. [Google Scholar] [CrossRef]
Ma, H.; Huang, D. Towards scale balanced 6-dof grasp detection in cluttered scenes. In Proceedings of the 6th Conference on Robot Learning, PMLR, Atlanta, GA, USA, 6–9 November 2023; pp. 2004–2013. [Google Scholar]
Shafiq, M.; Gu, Z. Deep residual learning for image recognition: A survey. Appl. Sci. 2022, 12, 8972. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Zhuo, Z.; Zeng, Y.; Wang, Y.; Zhang, S.; Yang, J.; Li, X.; Zhou, X.; Ma, J. HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization. arXiv 2025, arXiv:2503.04598. [Google Scholar]
Qi, C.; Yi, L.; Su, H.; Guibas, L. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv 2017, arXiv:1706.02413. [Google Scholar]
Qi, C.; Su, H.; Mo, K.; Guibas, L. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Oliveira, D.; Conceicao, A. A Fast 6DOF Visual Selective Grasping System Using Point Clouds. Machines 2023, 11, 540. [Google Scholar] [CrossRef]
Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Wang, C.; Fang, H.S.; Gou, M.; Fang, H.; Gao, J.; Lu, C. Graspness discovery in clutters for fast and accurate grasp detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 15964–15973. [Google Scholar]
Ten, A.; Gualtieri, M.; Saenko, K.; Platt, R. Grasp pose detection in point clouds. Int. J. Robot. Res. 2017, 36, 1455–1473. [Google Scholar]
Li, Y.; Kong, T.; Chu, R.; Li, Y.; Wang, P.; Li, L. Simultaneous semantic and collision learning for 6-dof grasp pose estimation. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 3571–3578. [Google Scholar]

Figure 1. The overall framework of the 6-DoF grasping method based on visual language guidance.

Figure 2. Visual language model structure.

Figure 3. Comparison of multi-head attention mechanism structures. (a) Post normalization structure; (b) Pre normalized structure; (c) hybrid normalized multi-head attention mechanism structure; (d) DyT structure.

Figure 4. The 6-DoF grasp detection model structure proposed in this paper.

Figure 5. DSN segmentation network structure.

Figure 6. Cylindrical sampling network structure.

Figure 7. Local attention module structure.

Figure 8. Visualization of some scenes of the visual language model.

Figure 9. Visualization of some scenes of the grasp detection model [28,30].

Figure 10. Real grasp experimental environment.

Figure 11. Visualization of some results of real grasp experiments.

Figure 12. Visualization of some results of failure grasp experiments.

Table 1. Visual language model performance comparison.

Method	Backbone Network	Task	Test Set A	Test Set B	Params/MB	Time/ms
VL-Grasp [11]	ResNet50	TD	86.92	54.12	127.16	44.37
	ResNet50	OS	85.46	61.49	127.16	44.37
	ResNet101	TD	85.62	55.97	-	-
	ResNet101	OS	83.89	60.72	-	-
our	ResNet50	TD	87.96	55.19	128.36	65.53
our	ResNet50	OS	88.23	62.44	128.36	65.53

Table 2. Comparison of grasp detection model performance.

Method	Seen			Similar			Novel			Params /MB	Time /ms
Method	AP	AP_0.8	AP_0.4	AP	AP_0.8	AP_0.4	AP	AP_0.8	AP_0.4
GPD [40]	22.87	28.53	12.84	21.33	27.83	9.64	8.24	8.89	2.67	-	-
PointnetGPD [4]	25.96	33.01	15.37	22.68	29.15	10.76	9.23	9.89	2.74	-	-
GraspNet-1B [28]	27.56	33.43	16.95	26.11	34.18	14.23	10.55	11.25	3.98	1.03	296
Li et al. [41]	36.55	47.22	19.24	28.36	36.11	10.85	14.01	16.56	4.82	-
GSNet [39]	67.12	78.46	60.90	54.81	66.72	46.17	24.31	30.52	14.23	-	-
Ma et al. [30]	63.83	74.25	58.66	58.46	70.05	51.32	24.63	31.05	12.85	1.14	43.29
our	65.13	76.15	60.24	59.87	70.90	52.29	25.10	31.67	13.17	1.58	87.83

Table 3. Grasp performance of the grasp detection model on multi-scale objects.

Method	Seen			Similar			Novel
Method	AP_S	AP_M	AP_L	AP_S	AP_M	AP_L	AP_S	AP_M	AP_L
Baseline	9.44	45.99	54.13	5.15	35.54	47.82	4.91	15.26	19.83
Ma et al.	18.29	52.60	64.34	10.03	42.77	57.09	9.29	18.74	24.36
Ours	19.93	55.39	70.65	10.65	45.84	64.38	10.05	20.53	27.05

Table 4. Real grasp experiment results.

Scene	1	2	3	4	5	6
Grasp times	20	20	20	20	20	20
Successful grasp	18	19	17	20	20	20
Grasp success rate	114/120 = 95.00%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Chen, J.; Wu, R.; Liu, T. 6-DoF Grasp Detection Method Based on Vision Language Guidance. Processes 2025, 13, 1598. https://doi.org/10.3390/pr13051598

AMA Style

Li X, Chen J, Wu R, Liu T. 6-DoF Grasp Detection Method Based on Vision Language Guidance. Processes. 2025; 13(5):1598. https://doi.org/10.3390/pr13051598

Chicago/Turabian Style

Li, Xixing, Jiahao Chen, Rui Wu, and Tao Liu. 2025. "6-DoF Grasp Detection Method Based on Vision Language Guidance" Processes 13, no. 5: 1598. https://doi.org/10.3390/pr13051598

APA Style

Li, X., Chen, J., Wu, R., & Liu, T. (2025). 6-DoF Grasp Detection Method Based on Vision Language Guidance. Processes, 13(5), 1598. https://doi.org/10.3390/pr13051598

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

6-DoF Grasp Detection Method Based on Vision Language Guidance

Abstract

1. Introduction

2. Related Works

2.1. Visual Language Model

2.2. Grasp Detection Model

3. Method

3.1. Overall Framework

3.2. Improved Visual Language Model

3.3. Improved 6-DoF Grasp Detection Model

3.4. The Proposed Grasp Detection Method

4. Implementation Details

4.1. Dataset and Evaluation Metrics

4.2. Loss Calculation and Training Details

5. Results

5.1. Experiments Based on Dataset

5.2. Real Grasp Experiment

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI