Robot Grasp Detection with Loss-Guided Collaborative Attention Mechanism and Multi-Scale Feature Fusion

Fang, Haibing; Wang, Caixia; Chen, Yong

doi:10.3390/app14125193

Open AccessArticle

Robot Grasp Detection with Loss-Guided Collaborative Attention Mechanism and Multi-Scale Feature Fusion

by

Haibing Fang

¹,

Caixia Wang

^1,2,* and

Yong Chen

¹

College of Electrical Engineering , Northwest Minzu University, Lanzhou 730030, China

²

Gansu Engineering Research Center for Eco-Environmental Intelligent Networking, Lanzhou 730030, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(12), 5193; https://doi.org/10.3390/app14125193

Submission received: 25 April 2024 / Revised: 4 June 2024 / Accepted: 10 June 2024 / Published: 14 June 2024

Download

Browse Figures

Versions Notes

Abstract

Grasp detection serves as the fundamental element for achieving successful grasping in robotic systems. The encoder–decoder structure has become widely adopted as the foundational architecture for grasp detection networks due to its inherent advantages of speed and accuracy. However, traditional network structures fail to effectively extract the essential features required for accurate grasping poses and neglect to eliminate the checkerboard artifacts caused by inversion convolution during decoding. Aiming at overcoming these challenges, we propose a novel generative grasp detection network (LGAR-Net2). A transposed convolution layer is employed to replace the bilinear interpolation layer in the decoder to remove the issue of uneven overlapping and consequently eliminate checkerboard artifacts. In addition, a loss-guided collaborative attention block (LGCA), which combines attention blocks with spatial pyramid blocks to enhance the attention to important regions of the image, is constructed to enhance the accuracy of information extraction. Validated on the Cornell public dataset using RGB images as the input, LGAR-Net2 achieves an accuracy of 97.7%, an improvement of 1.1% over the baseline network, and processes a single RGB image in just 15 ms.

Keywords:

residual network; loss-guided collaborative attention; grasp detection

1. Introduction

With the widespread implementation of robots in various fields, grasping technology, which plays a crucial role in controlling robots, has gained significant prominence. Currently, the majority of grasp detection methods still heavily rely on established object models and contextual information, rendering it inadequate for tackling the challenging task of recognizing complex unknown objects. Deep learning-based methods [1] have made significant progress, along with the continuous development of deep learning technology, in the field of grasp detection in recent years. Jiang et al. [2] first proposed a grasp representation based on 2D-oriented grasp boxes, optimizing the expression of grasp poses and enabling direct grasp detection through object detection. Since then, researchers have focused on improving the accuracy and efficiency of grasp detection. Lenz et al. [3] employed a two-stage cascaded network for grasp pose detection, achieving an accuracy of 73.9% on the Cornell grasp dataset. Due to the extensive number of network parameters, real-time performance was compromised, and generalization capabilities were limited. Redmon et al. [4] trained an end-to-end network using deeper networks with input RGD (red, green, depth) data to obtain grasp detection results directly, achieving 88.0% accuracy on the Cornell dataset. Morrison et al. [5] proposed a generative grasp convolutional neural network that utilizes depth images as input and directly outputs the precise grasp location, angle, and jaw width, thereby enabling real-time object grasping by robotic arms. Kumra et al. [6] proposed GRCNN, a convolutional neural network capable of processing multi-modal inputs such as RGB, RGD, and RGB-D data. By employing regression techniques for grasp pose prediction, the model achieved an impressive accuracy rate of 96.6% on the Cornell grasp dataset.

To enhance real-time performance and accuracy in the grasp detection process, this paper proposes LGAR-Net2, a grasp network based on a loss-guided collaborative attention mechanism. RGB input is utilized to generate grasp detection results with identical dimensions as the input image, achieving pixel-level detection precision. In the coding stage, multi-layer step convolution is employed for deep semantic extraction preparation. The semantic extraction stage incorporates a convolution structure that combines residual architecture and a loss-guided collaborative attention mechanism to enhance network robustness in detecting irregular objects. In the decoding stage, a bilinear interpolation layer replaces the inverse transpose convolution layer to ensure smoother pixel representation.

The contributions of this paper are summarized as follows:

We propose a novel grasp detection network that integrates a loss-guided collaborative attention mechanism with multi-scale feature fusion to enhance the network’s focus on object grasp regions.
In order to address the checkerboard artifacts in the upper sampling layer and improve the grasping accuracy, an effective convolutional decoding block is proposed.
The grasping network LGAR-Net2 built with the proposed module achieved 97.7% performance in both open datasets and simulated robot grasping. And an ROS simulation environment is established for the real-time prediction of grasping poses using LGAR-Net2, achieving a processing speed of 15 ms per piece with an accuracy rate of 93.3%.

2. Related Works

2.1. Grasp Detection Using Deep Learning

The empirical method and analytical method are commonly employed for grasping detection. The analytical approach relies on a known object model and utilizes precise mathematical modeling to calculate the grasping pose [7]. While it boasts a rich theoretical foundation, the real scene’s grasping object model and the mathematical model between the manipulator and the object are more intricate, leading to extensive calculations. On the other hand, the empirical method determines the grasping pose based on the approximate position and shape of the object, thus circumventing the complex modeling process. Method [8], which is based on target recognition, initially identifies the region of interest (ROI) and subsequently determines the grasping pose using a regression approach. Lenz et al. [3] employed a neural network as a classifier for detecting rectangles, achieving a detection accuracy of 75.6% on the Cornell dataset, thereby enhancing the precision of the grasping pose estimation. Pinto et al. [9] utilized convolutional neural networks (CNNs) to accurately predict the grasping posture of objects, leading to a significant improvement in the network’s overall generalization performance.

2.2. Encoder–Decoder Regression

The encoder–decoder architecture utilizes the encoder module to extract feature and the decoder module to reconstruct the learned feature information into the output of the original image size. This design is employed in U-Net [10], Seg-Net [11], and other networks to achieve automatic and efficient pixel-level output. Mahajan et al. [12] proposed a vector quantization variational autoencoder architecture for feature extraction in a semi-supervised learning setting, thereby improving the generalization capability of GG-CNN. Additionally, Yu et al. [13] utilized separable convolution for speed to replace traditional transpose convolution in order to reduce the model parameters and employed target segmentation technology to effectively distinguish the target from the background, resulting in higher accuracy. However, the checkerboard artifacts remain challenging, leading to feature map distortion in the decoder and impacting grasp prediction performance. In contrast to the above approaches, this study presents a deconvolution decoder based on bilinear difference to address the checkerboard artifacts issue.

2.3. Attention Mechanisms for Grasp Detection

The attention mechanism [14] has shown strong performance in natural language processing (NLP) tasks and, subsequently, researchers have applied this framework to grasping pose detection. Recent research on grasp detection [15] has demonstrated that incorporating an attention mechanism into the network can effectively enhance the network’s focus on grasp regions. In their work, Hu et al. [16] integrated a channel attention mechanism with a residual structure to establish inter-channel connections by encoding the weight of each channel. However, they did not consider encoding spatial feature information. Building on the SE attention, research [17] introduces the CBAM attention, which comprehensively considers information in both spatial and channel dimensions and utilizes the self-attention mechanism in the CoA module [18] to enhance the feature information recognition ability in the spatial dimension. The evaluation score of this region is obtained by summing the correlation between each region in the feature image and the adjacent region and employing the one-dimensional convolution operation to determine the weight of the space and the channel. AMSC-Net [19] effectively improves the network’s attention to the grasping area by fusing the attention mechanism of two dimensions: space and channel. Nonetheless, the attention mechanism of AMSC-Net is achieved through simple convolution, which does not fully exploit contextual information. DSC-Grasp-Net introduces a novel collaborative attention (CA) mechanism [20] which converts the feature tensor into a single feature vector through 2D global pooling to enhance the relevant features.

In contrast to the aforementioned, this article proposes an attention mechanism that employs a loss-guided approach. By utilizing a loss function to adjust attention, the recognition capability of the regions of interest is enhanced. It modifies feature weights in multiple dimensions to improve focus on crucial feature information.

3. Problem Statement

3.1. Grasp Pose Definition

During the process of grasp pose generation, precise grasping can be achieved by establishing a mathematical model that encompasses the robot arm, object, and camera. Typically, the grasp pose in the camera coordinate system is represented as a five-dimensional grasp representation [6], which can be visualized as depicted in Figure 1b.

The grasp pose of a pixel point in the camera coordinate system is initially defined as follows:

g = (p, θ, w, q)

(1)

The grasping point is denoted by coordinate

p = (u, v)

in the plane, while angle

θ

represents the orientation of the width edge of the grasping box with respect to the horizontal X-axis. The range of angle

θ

is defined as

[- \frac{1}{2} π, \frac{1}{2} π]

.

w

denotes the gripper’s opening and closing width;

q

quantifies the grasping quality of a point on a scale ranging from 0 to 1.

The grasping pose set of each pixel in the image is defined as follows:

{\hat{G}}_{i} = ({\hat{P}}_{i}, {\hat{θ}}_{i}, {\hat{W}}_{i}, {\hat{Q}}_{i})

(2)

The grasping posture is defined within the world coordinate system as follows:

G_{i} = (P_{i}, θ_{i}, W_{i}, Q_{i})

(3)

P_{i} = (X_{i}, Y_{i}, Z_{i})

represents the point in relation to the world coordinate system, where its elevation remains constant relative to the spatial relationship between the camera and the object, thereby ensuring a fixed value for

Z_{i}

within the global coordinate system.

Finally, the transformation relationship

T_{c a m e r a}^{b a s e}

between the robotic arm base and the camera is determined by employing DH modeling [21] for coordinate system conversion.

The grasping pose in the camera coordinate system is transformed into the grasping pose in the world coordinate system, which can be achieved through Equation (4):

G_{i} = T_{c a m e r a}^{b a s e} * ({\hat{G}}_{i})

(4)

The grasp detection network utilizes camera image

I \in R^{n \times h \times w}

as an input to compute the set of grasp poses

{\hat{G}}_{i}

for each pixel point in the camera coordinate system, followed by determining the optimal grasp pose using formula

{\hat{G}}^{*} = \max_{\hat{Q}} \hat{G}

based on the highest grasp quality. Consequently, this paper focuses on investigating the set of grasp poses

{\hat{G}}_{i}

for every pixel point in the camera coordinate system.

3.2. The Chessboard Artifacts Problem Description

The output feature map of the transposed convolution process in a detection network with a decoding structure exhibits distinct grid-like patterns due to the “uneven overlap” resulting from the deconvolution process, as illustrated in Figure 2.

The chessboard artifacts can significantly impact the quality and accuracy of the generated images, leading to a reduction in image clarity and realism. This artifact can impede the visual perception of the image, potentially resulting in failure to recognize it. Therefore, effectively mitigating or eliminating the chessboard artifacts is crucial for ensuring the high quality of the generated images.

4. Method

This section presents a comprehensive description of LGAR-Net2, a proposed deep learning network for pixel-level grasp detection. The input to LGAR-Net2 is an RGB image, and the output consists of four pairs of grasp pose images with the same dimensions, where each pixel in the output corresponds to a pixel in the input image. Figure 3 illustrates the architecture of LGAR-Net2, which comprises three main modules: encoder, bottleneck, and decoder. The encoder module primarily incorporates multiple convolutional layers to decompose the multi-channel input image and extract relevant features. The bottleneck module leverages the fusion of the residual blocks and LGA2 blocks to capture multi-dimensional signals at different scales. Finally, the decoder module employs parallel standard convolutional processing to generate grasp predictions.

4.1. Encoder

In order to extract additional feature information, this paper uses the feature extraction module to extract the detailed features of input images through multi-layer convolution layers, as shown in Figure 4. Specifically, it includes one CBR Block*4 blocks, one CBR Block*2 blocks, and two average pooling layers (step length = 2). We set the convolution kernel size in the module to 3 × 3 so as to achieve stronger feature description capability. Through the feature extraction module, the input RGB data are converted into feature graph

F_{e} \in R^{64 \times h / 2 \times w / 2}

.

The CBR Block*k can be characterized as follows:

T_{out} = f_{a v g p o o l} (f_{c o n v}^{k} (f_{c o n v}^{k - 1} (\dots f_{c o n v}^{1} (T_{i n}) \dots)))

(5)

The series structure of the 2D convolution layer, normalization layer, and RELU layer is denoted by

f_{c o n v}^{k}

, while

f_{a v g p o o l}

represents the sampling layer implemented through average pooling.

4.2. Bottleneck Part

At the bottleneck section, the network must extract a substantial amount of semantic information to provide the dependable feature images for the decoding phase. However, with an increasing number of network layers, the issue of gradient vanishing becomes more pronounced. Residual layers [16] offer an effective network design approach that circumvents the problem of gradient vanishing by backpropagating the output from the preceding layer while maintaining consistent input dimensions. Consequently, in this stage, we employ multiple residual modules to extract profound features from the image.

4.2.1. Residual Squeeze Excitation Block (RSE Block)

In this study, we propose a novel residual squeeze excitation network (RSE) with six locally connected skip connections to enhance the semantic features, as shown in Figure 5. The RSE architecture consists of two 2D convolutional layers followed by a squeeze block and an excitation block in series. The calculation formula for RSE is as follows:

\{\begin{matrix} T_{1} = f_{R S E} (F_{e}) \\ T_{2} = f_{R S E} (T_{1}) \\ ⋮ \\ T_{n} = f_{R S E} (T_{n - 1}) \end{matrix}

(6)

The input feature image,

F_{e}

, is processed by n RSE blocks (T₁ − Tn) in this study. Specifically, six RSE blocks (n = 6) are employed to generate output feature images which are subsequently fused to yield multi-layer feature images with enhanced richness.

4.2.2. Loss-Guided Collaborative Attention Block (LGCA Block)

During the forward propagation of features, semantic features are enhanced while detailed information is gradually suppressed, thereby compromising the discriminative capability of the features. In this study, we propose a novel LGCA block, as shown in Figure 6, that combines the cross-entropy auxiliary loss function with the ground truth and deep features from the grasping region, self-attention mechanism, and MUS module. This module guides the network to focus on contextual information within graspable regions in order to enhance their discriminative ability. Compared to existing multi-branch and multi-scale context strategies based on hole convolution, which only considers spatial relationships between pixels, our LGCA block enhances each pixel’s contribution to grasp representation by considering associations between all the graspable object regions and individual pixel points. It also facilitates the selective learning of graspable regions while suppressing interference from non-graspable regions, thus improving the generalization capabilities of grasping detection algorithms.

The LGCA block consists of a loss-guided layer, a collaborative attention block based on local jump, and a multi-scale module. Each of these components will be individually introduced in the following sections.

1.: Loss guide block

During the process of network training, the loss-guided layer will be up-sampled into

T_{6} \in R^{128 \times h / 4 \times w / 4}

to

F_{l} \in R^{1 \times h \times w}

with the same initial input size through the 1 × 1 convolution layer and the up-sampling layer based on the bilinear interpolation method. Then, the difference between

F_{l}

and the standard grab pose pixels generated by the data label is taken as a loss. The loss is added to the total loss to guide the network to notice the best pose of the object.

2.: Collaborative attention block

It is crucial to accurately understand the object feature information in complex backgrounds, but traditional convolutional calculations with sliding operations can only establish connections between adjacent pixels, while they cannot establish effective connections between pixels in different channels or positions. Moreover, the semantic information in the deep layers of the network is expressed in different channels, but traditional 2D convolution methods only operate within a single channel, without establishing connections between channels. Therefore, in space, we establish spatial connections between pixels by means of cutting, transposition, and assembly, and in different channels, we establish connections between channels by performing average pooling down-sampling on each channel and then concatenating the results of the convolution. By adding channel-wise attention blocks, we can more accurately capture the semantic features of objects, as shown in the structure of the collaborative attention block in Figure 7.

The collaborative attention block comprises a spatial-correlation attention block and a channel-correlation attention block. The input of the channel-correlation attention module is T6, and its output has an identical shape. Initially, the vector Sc is obtained by average pooling down-sampling for each channel, and then the 1 × 3 convolution kernel (step size is 1) is employed for the one-dimensional convolution of Sc, and the feature information of the adjacent channels is convolved and fused. The magnitude of the vector Sc is then recovered using a deconvolution operation, and finally the weights are mapped between [0,1] using a Sigmoid function. The channel association attention block makes full use of the feature relationship between adjacent channels, allocates the weights between channels, and strengthens the connection between channels. Its structure can be expressed as follows:

\{\begin{cases} S_{c} = \frac{1}{H \times W} • \sum_{h = 1}^{H} \sum_{w = 1}^{W} F_{c, h, w} \\ S = {S_{1}, S_{2}, \dots S_{c}} \\ \hat{S} = S i g (D C o n v (C o n v (S))) \\ F_{s} = F c, h, w • \hat{S} c, 1, 1 \end{cases}

(7)

where

S_{c}

is the average pooling value of the C channel, and

F_{c, h, w}

is a pixel in the image.

The spatial attention module is depicted in Figure 7. Similarly, by taking T6 as the input, the feature images of each channel are flattened into a vector and concatenated to obtain the vector E∈R^C×HW. Subsequently, E is multiplied with its transpose and passed through the softmax layer to derive the output vector S according to the following formula:

E = r e s h a p e (F)

(8)

S_{i j} = \frac{e x p {(E^{T} * E)}_{i j}}{\sum_{j = 1}^{H W} e x p {(E^{T} * E)}_{i j}}

(9)

where

S \in R^{H W \times H W}

and * are matrix multiplication.

3.: Multi-Scale Module

As a powerful technique, Atrous convolutions offer the ability to precisely manage the resolution of features calculated by deep convolutional neural networks and allow for fine-tuning of the filter’s viewpoint to capture information at multiple scales. In this study, we establish a connection between the deep layer of the network and the decoding module using the MUS module [22], where we input the feature map FA extracted by the attention module into it. Within the MuS module, we parallelize void convolutions with different expansion rates (6/12/18) to generate feature maps, which are subsequently restored to match the size of input feature maps through 2D convolution layers.

4.3. Feature Fusion Module

Using T₁~T₇ as the input, the feature image

T_{1 ~ 7} \in R^{128 * 7 \times h / 4 \times w / 4}

is generated by channel splicing technology. Finally, 128 convolution kernels are used to restore the feature image to the same size as the input

F_{s} = c o n v c a t (T_{1}, \dots T_{n})

.

The pre-stage input is squeezed by six residual excitation blocks to generate six sets of feature images (

T_{1 ~ 6} \in R^{128 \times h / 4 \times w / 4}

). Then, it is sent to the loss-guided collaborative attention block to adjust the attention to generate a set of feature images (

T_{7} \in R^{128 \times h / 4 \times w / 4}

). Then, seven groups of feature images are fused by feature fusion blocks to generate

T_{1 ~ 7} \in R^{128 * 7 \times h / 4 \times w / 4}

. Finally, 128 convolution kernels are used to restore the feature image to the same size as the input.

4.4. Decoder

Despite the successful recovery of feature images in the decoding stage by traditional deconvolution layers, they often lead to checkerboard artifacts between adjacent pixels. This issue significantly impacts performance, particularly for pixel-level grasp prediction. To address the problem of checkerboard artifacts, conventional methods typically modify the stride of the convolution kernel to ensure divisibility between kernel size and stride. However, since convolution weights change during learning, this approach fails to fundamentally resolve the checkerboard artifacts problem. Considering that bilinear interpolation can minimize differences in pixel values between interpolated and original input pixels, we employed a combination of bilinear interpolation layers for feature recovery. This approach effectively enhances pixel smoothness and successfully resolves the checkerboard artifacts problem.

The BIC-Block [23] (as illustrated in Figure 8) consists of a bilinear interpolation layer, 2D convolution layer, and RELU layer. Its operation can be described as follows:

F_{b} = f_{B I C} (f_{B I C} (F_{s}))

(10)

4.5. Loss Function

The loss function of LGAR-Net2 comprises the auxiliary LGCA loss and losses for the regression of the graspable region’s confidence, angle, and grasp width.

(1): LGCA auxiliary losses: The function $l o o s_{s} = S m o o t h L 1 ({\hat{y}}_{i}^{g}, y_{i}^{g})$ is defined, where ${\hat{y}}_{i}^{g}$ represents the predicted value image of the network’s grabber region and $y_{i}^{g}$ denotes the label for the grabber region.
(2): Regression loss estimation for the confidence, angle, and width of a crawlable region: The definitions of $l o o s_{\cos}$ , $l o o s_{\sin}$ , $l o o s_{ω}$ , and $l o o s_{u v}$ are equivalent to $S m o o t h L 1 \cos, \sin, u v ({\hat{y}}_{i}^{g}, y_{i}^{g})$ , where $y_{i}^{g} \in R^{H \times W}$ represents the ground truth image of the graspable region mapping for the i-th sample, while ${\hat{y}}_{i}^{g} \in R^{H \times W}$ denotes the predicted value image of the graspable region for the i-th sample.

The definition of the total network loss is as follows:

l o o s = λ_{1} (\frac{1}{n} \sum_{i = 1}^{n} l o o s_{u v}) + λ_{2} (\frac{1}{n} \sum_{i = 1}^{n} l o o s_{\cos} + \frac{1}{n} \sum_{i = 1}^{n} l o o s_{\sin}) + λ_{3} (\frac{1}{n} \sum_{i = 1}^{n} l o o s_{ω}) + λ_{4} (\frac{1}{n} \sum_{i = 1}^{n} l o o s_{s})

(11)

where n represents the number of training samples, and, respectively, denotes the true value and estimated value for both the confidence in identifying a graspable region and the attitude information.

The definition of the Smooth L1 loss is as follows:

{S m o o t h}_{L 1} (x) = \{\begin{matrix} 0.5 x^{2}, i f |x| < 1 \\ |x| - 0.5, o t h e r w i s e \end{matrix}

(12)

Given the substantial influence of grasp position and angle on grasping success, the relative importance of local feature range, as indicated by claw opening width, is considered to be less significant compared to these factors.

5. Results

In this work, we conducted a large number of comparative experiments to verify our LGAR-Net2 method. We perform trials on the publicly available crawl datasets from Cornell and Jacquard [24]. Furthermore, to mitigate the risk of overfitting caused by the limited size of the Cornell dataset, this study incorporates measures like data clipping, introducing random rotations, and applying random scaling to augment the data. The enhanced dataset is adjusted to 224 × 224 in order to conform with the LGAR-Net2 network’s input requirements. To test the model’s discrimination and generalization abilities, the Cornell dataset is divided into two sub-datasets based on image splitting and object splitting. The Jacquard crawl dataset, a large-scale crawl dataset created from CAD models, eliminates the need for manual data collection and label sampling. The training set and test set of Jacquard are divided using five-fold cross-validation.

5.1. Experimental Details

In this paper, the graphics card is configured as NVIDIA GeForce RTX 2080 Ti (11 GB), and the Ubuntu16.04 system combined with the Pytorch 2.1.1 deep learning framework is used as the software platform for experiments. The experiments and verification in this paper are completed in the above platform.

Model training implementation: We employ the Xavier normal distribution for initializing network parameters and utilize the Adam optimization algorithm for network optimization. The number of epochs is 50, 1000 batches of data are processed in each epoch, and the number of samples per input is set to eight.
Evaluation metrics: The angle difference and Jaccard index are two commonly employed criteria for assessing grasping performance. According to the angle difference criterion, a successful grasp is defined as having a deviation of less than 30° between the predicted grasping area and its corresponding label. Under the Jaccard index, a grasp is deemed successful if the overlap between the predicted and labeled grasping areas exceeds 25%. The mathematical expression of the Jaccard index is as follows:

$J (g_{p}, g_{t}) = \frac{|g_{p} \cap g_{t}|}{|g_{p} \cup g_{t}|} > 0.25$

(13)
In the formula, $g_{p}$ and $g_{t}$ denote the predicted grasping area and true grasping area value, respectively. The intersection of predicted grasping area and grasping area label is represented by $g_{p} \cap g_{t}$ , while the union of predicted grasping area and grasping area label is denoted by $g_{p} \cup g_{t}$ .

5.2. Comparative Study

5.2.1. Attention Mechanism Experiment

In order to verify the attention of the attention mechanism in the deep semantic feature fusion strategy to the object area, an attention comparison experiment was conducted with the previous research, and the results are shown in Figure 9.

As depicted in the figure, the basic network without employing a feature fusion strategy exhibits issues such as semantic ambiguity and feature confusion in the object regions, which leads to inadequate attention towards the graspable regions of the objects. As shown in Figure 7, the cup in line 1 is expected to possess the optimal grasping area within its body. The basic network without the attention mechanism neglects to allocate sufficient focus on the cup’s body, while the LGAR-Net2 network proposed in this paper can effectively direct attention towards the central region of the cup’s body and suppresses excessive emphasis on grasping the handle, thereby enhancing overall grasping quality. The optimal grasping position for the scraper in the third row is at the handle. The basic network without an attention mechanism has an even distribution of attention to the whole scraper, failing to accurately identify the best grasping area at the handle. The LGAR-Net2 network proposed in this paper effectively focuses on the grasping area of the handle and suppresses the ungrasping blade area. Compared to other networks, LGAR-Net2 with an added attention mechanism demonstrates heightened focus on graspable object areas.

5.2.2. Checkerboard Artifact Experiment

To demonstrate the efficacy of the feature recovery module in LGAR-Net2, we visually depicted the feature images obtained from the BIC block output and provided a comparative analysis with the results achieved by GRCNN. As illustrated in Figure 10, it is evident that the basic network’s output exhibits noticeable checkerboard artifacts. However, through bilinear interpolation within the BIC block, as exemplified in columns 3 and 6 of the output images, adjacent pixel values are effectively smoothed out while eliminating these artifacts. This substantiates our proposed BIC block’s ability to successfully eradicate checkerboard artifacts and significantly enhance decoder performance.

5.2.3. Experimental Verification on Datasets

In this section, we conducted comparative experiments using LGAR-Net2 LGAR-Net [23] and GRCNN networks on the Cornell dataset and Jacquard dataset, as depicted in Figure 11. Herein, rectangular boxes are employed to represent the grasping pose of network output.

As depicted in Figure 11, the baseline network exhibits issues of semantic ambiguity and feature confusion when processing interpretable region features, leading to inadequate attention towards these regions. For instance, during the comparison of the razor (4), interference from the uncatchable area can impact the results. The absence of an attention mechanism in the baseline network restricts proper emphasis on the central position of the cylinder; however, LGAR-Net2 effectively addresses this problem by considering catchable areas on both sides of the razor handle while avoiding uncatchable regions at its top. Furthermore, when there is a color similarity between the cylinder (1) and its background, relying solely on LGAR-Net2 may allocate excessive attention to the background. Nevertheless, LGAR-Net2 not only suppresses background features but also significantly prioritizes catchable areas. In comparison to its counterpart network, LGAR-Net2 with an attention mechanism demonstrates an exceptional level of focus on pixels within catchable areas.

The findings from Table 1 indicate that the grasping detection algorithm introduced in this paper achieves a better balance between accuracy and speed without relying on depth information, leading to improvements in real-time performance and accuracy.

The network proposed in this paper was utilized to conduct experiments on the Cornell dataset and compare with the existing research methods. The results are presented in Table 2, indicating that when using RGB data, the method proposed in this paper achieved a capture detection accuracy of 94.4% on the jacquard dataset.

5.3. Simulation Experiment Capture Detection

The UR5’s 3D model was imported into the ROS control system “Moveit” to establish a robotic arm control system, with configuration parameters automatically generated by algorithms. Subsequently, the Gazebo simulation environment incorporated 3D models of objects and cameras to create a grasping simulation test platform. Additionally, the grasp detection model was integrated into the ROS system for conducting simulation experiments. Figure 12 illustrates simulated capture and detection scenarios.

The method for robot grasping detection proposed in this paper is utilized for the purpose of grasping and detecting objects within a simulated environment. The simulation results for the detection of various objects are presented in Table 3.

The analysis of the experimental results indicates that the pose detection accuracy of the network in this paper is up to 93.3%, which is slightly lower than the maximum accuracy of the model training, but it also demonstrates good detection performance. Utilizing the grasping pose detection algorithm designed in this paper, the robot can basically complete the grasping task in the simulated scene. Across various target objects and diverse object placements, the machine consistently achieves successful grasping actions, underscoring the algorithm’s robust adaptability to novel scenes.

6. Conclusions

We propose LGAR-Net2, an efficient grasp detection network for detecting the grasp pose of unknown objects. To address the checkerboard artifact problem commonly encountered in inverse convolution, we combine bilinear interpolation and introduce the LGCA block to enhance the feature extraction ability of graspable regions. As a result, our approach significantly improves both the accuracy and generalization of the network. Experimental verification on the Cornell and Jacquard datasets confirms its effectiveness. Future research will focus on further enhancing the grasp detection ability in complex backgrounds by fusing multi-modal input methods and deploying grasp detection algorithms to real robotic arms for achieving real-time grasping tasks.

Author Contributions

Conceptualization, H.F.; methodology, H.F., C.W. and Y.C.; software, H.F. and Y.C.; writing—original draft preparation, H.F.; writing—review and editing, H.F and C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is financially supported by Fundamental Research Funds for the Central Universities of Northwest Minzu University (Grant No. 31920220048) Gansu Province Demonstration major of Innovation and Entrepreneurship Education for Colleges and Universities (Grant No. 2023SJCXCYSFZY01) and National Natural Science Foundation of China (Grant No. 12205241).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Cornell and Jacquard and LGAR-Net2 experimental results are available on https://github.com/123fang456/LGAR-Net, since 4 February 2024. The code presented in this study is available on request from the corresponding author.

Acknowledgments

The authors are very grateful for Zhiluo Huang in the writing of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ni, P.; Zhang, W.; Zhu, X.; Cao, Q. PointNet++ grasping: Learning an end-to-end spatial grasp generation algorithm from sparse point clouds. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 3619–3625. [Google Scholar]
Jiang, Y.; Moseson, S.; Saxena, A. Efficient grasping from RGBD images: Learning using a new rectangle representation. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 3304–3311. [Google Scholar]
Lenz, I.; Saxena, A. Deep learning for detecting robotic grasps. Int. J. Robot. Res. 2015, 34, 705–724. [Google Scholar] [CrossRef]
Redmon, J.; Angelova, A. Real-time grasp detection using convolutional neural networks. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 1316–1322. [Google Scholar]
Morrison, D.; Corke, P.; Leitner, J. Learning robust, real-time, reactive robotic grasping. Int. J. Robot. Res. 2020, 39, 183–201. [Google Scholar] [CrossRef]
Kumra, S.; Joshi, S.; Sahin, F. Antipodal robotic grasping using generative residual convolutional neural network. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October–24 January 2020; pp. 9626–9633. [Google Scholar]
Bicchi, A.; Kumar, V. Robotic grasping and contact: A review. In Proceedings of the 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat.No. 00CH37065), San Francisco, CA, USA, 24–28 April 2000; IEEE: New York, NY, USA, 2000; Volume 1, pp. 348–353. [Google Scholar]
Guo, D.; Sun, F.; Liu, H.; Kong, T.; Fang, B.; Xi, N. A hybrid deep architecture for robotic grasp detection. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1609–1614. [Google Scholar]
Pinto, L.; Gupta, A. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; IEEE: New York, NY, USA, 2016; p. 34063413. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Mahajan, M.; Bhattacharjee, T.; Krishnan, A.; Shukla, P.; Nandi, G.C. Robotic grasp detection by learning representation in a vector quantized manifold. In Proceedings of the 2020 International Conference on Signal Processing and Communications (SPCOM), Bangalore, India, 19–24 July 2020; pp. 1–5. [Google Scholar]
Yu, Y.; Cao, Z.; Liu, Z.; Geng, W.; Yu, J.; Zhang, W. A twostream CNN with simultaneous detection and segmentation for robotic grasping. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 1167–1181. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Zhou, Z.; Zhu, X.; Cao, Q. AAGDN: Attention-Augmented Grasp Detection Network Based on Coordinate Attention and Effective Feature Fusion Method. IEEE Robot. Autom. Lett. 2023, 8, 3462–3469. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, X.; Ross, G.; Abhinav, G.; Kaiming, H. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; p. 77947803. [Google Scholar]
Luo, C.; Shi, C.; Zhang, X.; Peng, J.; Li, X.; Chen, Y. AMCNet: Attention-Based Multiscale Convolutional Network for DCM MRI Segmentation. In Proceedings of the 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Milwaukee, WI, USA, 15–19 July 2019; pp. 434–439. [Google Scholar] [CrossRef]
Zhou, Z.; Zhang, X.; Ran, L.; Han, Y.; Chu, H. DSC-GraspNet: A Lightweight Convolutional Neural Network for Robotic Grasp Detection. In Proceedings of the 2023 9th International Conference on Virtual Reality (ICVR), Xianyang, China, 12–14 May 2023. [Google Scholar]
Denavit, J.; Hartenberg, R.S. A kinematic notation for lower-pair mechanisms based on matrices. Asme J. Appl. Mech. 1995, 22, 215–221. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Fang, H.; Wang, C.; Chen, Y. Grasping Pose Detection Based on Loss-Guided Attention Mechanism and Residual Network. In Proceedings of the 2023 IEEE 5th International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Dali, China, 11–13 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 921–924. [Google Scholar]
Depierre, A.; Dellandrea, E.; Chen, L. Jacquard: A large scale dataset for robotic grasp detection. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Spain, 1–5 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 3511–3516. [Google Scholar]
Karaoguz, H.; Jensfelt, P. Object Detection Approach for Robot Grasp Detection. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 4953–4959. [Google Scholar]

Figure 1. Definition of grasp pose: (a) the network-derived grasp pose, encompassing the quality score, angle, and width of each pixel point. (b) A visual representation depicting a point grasp pose.

Figure 2. Chessboard artifacts, (a) original image, (b) feature image of intermediate layer, (c) feature image generated after inversion convolution.

Figure 3. LGAR-Net2 network architecture. Firstly, the RGB image (

I \in R^{n \times h \times w}

) is fed into the feature extraction module, yielding the output feature image

F_{e} \in R^{64 \times h / 2 \times w / 2}

. Subsequently, it is passed through a bottleneck section comprising residual networks and LGCA block, where the LGCA block facilitates network attention towards an optimal grasping pose for the target, resulting in output feature image

F_{s} \in R^{128 \times h / 4 \times w / 4}

. The feature image size is then restored to its original dimensions using two-stage BIC blocks. Finally, the output of the feature recovery module undergoes four parallel standard convolutional processes to obtain capture prediction

\hat{G} = (\hat{P}, \hat{θ}, \hat{W}, \hat{Q})

.

Figure 3. LGAR-Net2 network architecture. Firstly, the RGB image (

I \in R^{n \times h \times w}

) is fed into the feature extraction module, yielding the output feature image

F_{e} \in R^{64 \times h / 2 \times w / 2}

. Subsequently, it is passed through a bottleneck section comprising residual networks and LGCA block, where the LGCA block facilitates network attention towards an optimal grasping pose for the target, resulting in output feature image

F_{s} \in R^{128 \times h / 4 \times w / 4}

. The feature image size is then restored to its original dimensions using two-stage BIC blocks. Finally, the output of the feature recovery module undergoes four parallel standard convolutional processes to obtain capture prediction

\hat{G} = (\hat{P}, \hat{θ}, \hat{W}, \hat{Q})

.

Figure 4. CBR Block*k structure diagram.

T_{i n} \in R^{n_{i n} \times h \times w}

and

T_{o u t} \in R^{n_{o u t} \times h / 2 \times w / 2}

represent the input feature map and output feature map, respectively, where

n_{i n}

and

n_{o u t}

represent the number of channels in the input and output, and h and w represent the height and width of the feature image.

Figure 4. CBR Block*k structure diagram.

T_{i n} \in R^{n_{i n} \times h \times w}

and

T_{o u t} \in R^{n_{o u t} \times h / 2 \times w / 2}

represent the input feature map and output feature map, respectively, where

n_{i n}

and

n_{o u t}

represent the number of channels in the input and output, and h and w represent the height and width of the feature image.

Figure 5. RSE block structure. Utilizing

T \in R^{128 \times h / 4 \times w / 4}

as the input image, undergoing two convolutional layers and one SE block, while maintaining consistent input and output image sizes.

Figure 5. RSE block structure. Utilizing

T \in R^{128 \times h / 4 \times w / 4}

as the input image, undergoing two convolutional layers and one SE block, while maintaining consistent input and output image sizes.

Figure 6. Loss-guided collaborative attention block (LGCA block). The input feature T6 is initially processed by the loss-guided layer to obtain F_l, followed by subsequent processing through the coordination attention module and the MUS module, resulting in a refined feature map

T_{7} \in R^{128 \times h / 4 \times w / 4}

.

Figure 6. Loss-guided collaborative attention block (LGCA block). The input feature T6 is initially processed by the loss-guided layer to obtain F_l, followed by subsequent processing through the coordination attention module and the MUS module, resulting in a refined feature map

T_{7} \in R^{128 \times h / 4 \times w / 4}

.

Figure 7. The collaborative attention block; (a,b) depict the intricate architecture diagrams of the spatial-correlation attention block and the channel-correlation attention block.

Figure 8. BIC-Block structure

F_{i n} \in R^{n_{i n} \times h / 2 \times w / 2}

and

F_{o u t} \in R^{n_{o u t} \times h \times w}

represent the input feature map and output feature map.

Figure 8. BIC-Block structure

F_{i n} \in R^{n_{i n} \times h / 2 \times w / 2}

and

F_{o u t} \in R^{n_{o u t} \times h \times w}

represent the input feature map and output feature map.

Figure 9. Results of the attention comparison experiment, (a) image of the target object, (b) feature image of the intermediate layer GRCNN network, (c) feature image of the LGAR-Net network, and (d) feature image of the LGAR-Net2 network after applying LGCA.

Figure 10. Comparison of the checkerboard artifacts: columns 2 and 5 show feature images obtained by ordinary DW convolution, columns 3 and 6 show feature images obtained by BIC block convolution.

Figure 11. (1)~(2) are the comparison results on the jacquard dataset, (3)~(4) are the comparison results on the Cornell dataset.

Figure 12. The simulated grasping scene in ROS system; the first row is the lateral and above view, and the second row is the depth image and RGB image taken by the depth camera.

Table 1. Cornell dataset experimental results.

Author	Method	Input	Accuracy (%)	Times (ms)
Jiang [2]	Fast Search	RGB-D	60.5	5000
Lenz [3]	SAE, struct. reg.	RGB-D	73.9	1350
Wang [18]	Two-stage closed-loop	RGB-D	85.3	140
Redmon [4]	AlexNet, MultiGrasp	RGB-D	88.0	76
Morrison [5]	GG-CNN	D	73.0	-
Karaoguz [25]	GRPN	RGB	88.7	200
Kumra [6]	GRCNN	RGB	96.6	19
Ours	LGAR-Net2	RGB	97.7	15

Table 2. Jacquard dataset experimental results.

Author	Method	Input	Accuracy (%)	Times (ms)
Depierre [24]	Jacquard	RGD	93.6	-
Morrison [5]	GG-CNN2	D	84	19
Kumra [6]	GRCNN	RGB	91.8	19
Ours	LGAR-Net2	RGB	94.4	17

Table 3. Results of simulated grasping experiments.

Target Object	Grasp Pose Detection Accuracy (%)	Success/Attempt	Accuracy (%)
Coke Bottle	87.5	13/15	86.7
Wood Block	88.1	13/15	86.7
Screwdriver	93.5	14/15	93.3
Remote	90.1	13/15	86.7
Hexagon Nut	93.2	13/15	86.7
Weight	88.7	13/15	86.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fang, H.; Wang, C.; Chen, Y. Robot Grasp Detection with Loss-Guided Collaborative Attention Mechanism and Multi-Scale Feature Fusion. Appl. Sci. 2024, 14, 5193. https://doi.org/10.3390/app14125193

AMA Style

Fang H, Wang C, Chen Y. Robot Grasp Detection with Loss-Guided Collaborative Attention Mechanism and Multi-Scale Feature Fusion. Applied Sciences. 2024; 14(12):5193. https://doi.org/10.3390/app14125193

Chicago/Turabian Style

Fang, Haibing, Caixia Wang, and Yong Chen. 2024. "Robot Grasp Detection with Loss-Guided Collaborative Attention Mechanism and Multi-Scale Feature Fusion" Applied Sciences 14, no. 12: 5193. https://doi.org/10.3390/app14125193

APA Style

Fang, H., Wang, C., & Chen, Y. (2024). Robot Grasp Detection with Loss-Guided Collaborative Attention Mechanism and Multi-Scale Feature Fusion. Applied Sciences, 14(12), 5193. https://doi.org/10.3390/app14125193

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robot Grasp Detection with Loss-Guided Collaborative Attention Mechanism and Multi-Scale Feature Fusion

Abstract

1. Introduction

2. Related Works

2.1. Grasp Detection Using Deep Learning

2.2. Encoder–Decoder Regression

2.3. Attention Mechanisms for Grasp Detection

3. Problem Statement

3.1. Grasp Pose Definition

3.2. The Chessboard Artifacts Problem Description

4. Method

4.1. Encoder

4.2. Bottleneck Part

4.2.1. Residual Squeeze Excitation Block (RSE Block)

4.2.2. Loss-Guided Collaborative Attention Block (LGCA Block)

4.3. Feature Fusion Module

4.4. Decoder

4.5. Loss Function

5. Results

5.1. Experimental Details

5.2. Comparative Study

5.2.1. Attention Mechanism Experiment

5.2.2. Checkerboard Artifact Experiment

5.2.3. Experimental Verification on Datasets

5.3. Simulation Experiment Capture Detection

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI