Secure Grasping Detection of Objects in Stacked Scenes Based on Single-Frame RGB Images

Secure grasping of objects in complex scenes is the foundation of many tasks. It is important for robots to autonomously determine the optimal grasp based on visual information, which requires reasoning about the stacking relationship of objects and detecting the grasp position. This paper proposes a multi-task secure grasping detection model, which consists of the grasping relationship network (GrRN) and the oriented rectangles detection network CSL-YOLO, which uses circular smooth label (CSL). GrRN uses DETR to solve set prediction problems in object detection, enabling end-to-end detection of grasping relationships. CSL-YOLO uses classification to predict the angle of oriented rectangles, and solves the angle distance problem caused by classification. Experiments on the Visual Manipulate Relationship Dataset (VMRD) and the grasping detection dataset Cornell demonstrate that our method outperforms existing methods and exhibits good applicability on robot platforms.


Introduction
Robot grasping is a fundamental task in robot operation and lays the groundwork for completing complicated tasks.In the context of real grasping scenarios, complex scenes are common, and objects are frequently arranged in a stacked position, as seen in material handling and fruit sorting.If the grasped object is concealed by other objects, the object stack becomes unstable, and the rigid object may shatter.While it is intuitive for humans to select a stable object from a stack of objects, this poses a significant challenge for robots, since they solely rely on vision.Therefore, it is crucial for robots to make autonomous decisions to determine a secure grasping position to maintain the stability of the entire object stack.
The development of deep learning has led to two categories of vision-based robot grasping methods: six degrees of freedom (6DoF) grasping and 2D plane grasping [1].Most 6DoF grasping methods require point clouds and intrinsic camera parameters to determine an object's position, estimate the pose, and match the original object using templates [2,3], offering high precision but requiring significant computational resources.Some methods use local point clouds to accelerate computation, but this may lead to a loss of object edge features and incorrect candidate grasping positions [4].Recent approaches have achieved positive results by optimizing the decision-making process and reducing interfaces to accelerate grasping position generation under 6DoF [5,6].In scenarios where objects are on a plane and can only be grasped from one direction, 2D plane grasping is preferable.The main method for this is object detection through rotation, generating potential grasping positions in an image using data-driven convolutional neural networks [7].However, the resulting grasp positions' safety is not immediately apparent, and a scoring system is often Sensors 2023, 23, 8054 2 of 15 utilized as a supplement to determine each grasping box's security score [8].This approach works well in specific scenarios but requires a large amount of data and lacks strong generalization capabilities.A solution to this problem is to assess the stacking relationship between objects before grasping, verifying the grasped object's z-axis position in the final grasp using only the depth image, which reduces computational power.Traditional object stacking reasoning uses object pairwise pooling.However, this process is time-consuming, and it cannot consider global image information when multiple objects are in the image.Recently, transformers have been used to process images [9], allowing object detection to be transformed into an unordered set problem, providing the foundation for the object stacking relationship reasoning method proposed in this paper.
We propose a data-driven, multi-task secure grasping detection model in this paper which utilizes a single RGB frame to obtain global information by detecting object stacking relationships and grasping positions before obtaining the final secure grasping position via post-processing.The gripper we used in this paper is a parallel gripper.To preserve visual information within the image, we incorporate residual modules [10] into our Grasping Relationship Network (GrRN) for object stacking relationship detection, inspired by the network design of Adj-Net [11] and Deformable DETR [12].Furthermore, we created a rotation-based object detection model called CSL-YOLO, using one-hot encoding, which is inspired by YOLOv5 6.0 [13] and circular smooth label (CSL) [14].Our experiments, conducted using the Visual Manipulation Relationship Dataset (VMRD) [15] and Cornell [16], demonstrate that our proposed object stacking relationship detection and grasping position detection methods perform well.The primary contributions of this paper are as follows: (1) Analyzing how to use an adjacency matrix to represent an object stack.We used the mathematical properties of the adjacency matrix and post-processing to obtain a secure grasp.(2) Using the Hungarian algorithm of Deformable DETR [12] to generate predictions for object queries and corresponding relationships between objects, and then using this relationship and visual features learned by Encoder to generate an adjacency matrix.
We analyzed the impact of multi-scale features and variable self-attention mechanisms on overall model performance.Adding residual modules between the original feature map and the output of Encoder provides adequate visual features for the input of the MLP that generates the adjacency matrix.(3) Combining the CSL [14] idea with the one-stage object detection model YOLOv5 [13].
We demonstrated that angle prediction can be transformed from a regression problem to a classification problem using one-hot encoding and using Gaussian functions as a window function to improve the rationality of loss calculation.
This paper is organized as follows: Section 2 provides an overview of the research status of secure robot grasping.Section 3 details the use of the adjacency matrix to determine the optimal grasping object, the principles of predicting the adjacency matrix, and how to generate rotating grasping boxes.Section 4 demonstrates the performance of our method on a dataset, including testing its capabilities and presenting experimental results.Finally, Section 5 presents this paper's conclusion.

Object Detection
The accurate identification of object location and category within an image is crucial for successful stacking relationship detection.Predicting rotating rectangular boxes is a fundamental aspect of grasping detection and a part of object detection.Therefore, it is crucial to select an appropriate object detector.Recent advances in deep learning have led to the development of highly competent object detectors such as two-stage RCNN [17], Fast RCNN [18], and Faster RCNN [19], as well as one-stage SSD series [20], and YOLO series [13,21].One-stage methods are faster than two-stage methods, but they have slightly lower accuracy.In recent years, the appearance of the transformer-based object detector, DETR [22], has become a new paradigm.DETR regards object detection as a set prediction problem, achieving end-to-end object detection and removing the artificially defined parts of traditional methods, allowing the adjacency matrix prediction problem to be implemented with an end-to-end network.The issue of weak performance on small objects and slow model convergence in DETR is resolved by Deformable DETR [12], which is selected as the backbone network.To enhance accuracy while maintaining real-time detection speed, YOLOv5 [13] employs mosaic augmentation, feature pyramid, and path aggregation methods, making it the ideal backbone network for grasp box detection.

Stacking Relationship Detection
Stacking relationships are crucial in identifying the optimal secure grasping method.Recently, VMRN [23], the first use of convolutional neural networks in stack relationship detection, was introduced by Zhang, who also published VMRD [15].VMRN detects objects first and then uses convolutional operations on each object pair to predict the relationship between them.To expedite the time-consuming operation of convolution on each object pair, Park et al. [24] expanded the grasping information to 15 dimensions and utilized an optimized cross-scale YOLOv3 network FCNN to directly forecast object subcategories, significantly enhancing detection speed.Additionally, Chi et al. [25] affirmed the significance of spatial and semantic information of objects in inferring the stacking relationship and proposed the VSE model to improve the accuracy of stack relationship detection through encoded spatial and semantic information output by the bag-of-words model for object pair pooling.Furthermore, Tchuiev et al. [11] successfully solved the adjacency matrix prediction problem posed by the stacking challenge by leveraging endto-end object detectors and proposed Adj-Net, which significantly improved the accuracy of detecting stacking relationships.This paper adopts Adj-Net and modifies the parts of the object detection and adjacency matrix prediction to improve the model detection performance of stacking relationships.

Grasping Detection
Traditional grasping methods typically utilize object texture, geometric shapes, and the tactile information of robotic hands for grasping detection [26,27].In recent years, convolutional neural network-based grasping detection has grown increasingly popular.Guo et al. [28] introduced a hybrid depth structure that incorporates both visual and tactile sensors, leveraging tactile data to enhance visual information for more effective learning and ultimately improve grasping detection success rates.Similarly, Chu et al. [29] utilized Faster RCNN and a region proposal network to generate grasping boxes while converting the angle problem into a classification challenge with null hypotheses competition, resulting in significantly improved grasping box generation accuracy.Additionally, Dong et al. [30] proposed a two-stage method that entails first acquiring image mask features and subsequently generating grasping detection results by leveraging these mask features to mitigate the impact of cluttered background information on grasping detection accuracy.In recent years, one-stage object detection and rotation box detection methods have developed rapidly, and the proposed CSL [14] provides a good solution for angle classification problems and can adapt to different object detectors.

The Method of Grasping in Stacked Scenes
Our proposed multi-task model comprises two components: the Grasping Relationship Network (GrRN) and the CSL-YOLO network.GrRN employs a multi-scale transformer to detect grasp sequences, while CSL-YOLO is an improved YOLOv5 network that utilizes CSL.The outputs of both tasks are then subjected to a post-processing operation to determine the suggested grasping positions.The input of the model is an RGB image, and the output is the secure grasping position in a single RGB frame.Figure 1 provides an overview of the overall model structure.
to determine the suggested grasping positions.The input of the model is an RGB image, and the output is the secure grasping position in a single RGB frame.Figure 1 provides an overview of the overall model structure.

Initialization with Adjacent Matrix
In complex scenes, objects are frequently stacked.We represent each object as a node, and the relationship between two stacked objects as a weighted edge.Thus, any object stack can be represented by a weighted directed graph  ≜ , ℰ, ) with N  nodes  ∈  and N ℰ edges ϵ ∈ ℰ, where each edge has a weight ω ∈ .For two objects, o1 and o2, if o1 directly overlaps object o2, an edge ϵ → is formed, with the weight ω representing the probability of its existence.In the dataset, ω = 1, whereas during prediction, the value of ω ranges between 0 and 1.
Our primary objective is to predict the weighted directed graph , which can be represented by an adjacency matrix A in data structures: The adjacency matrix A represents the stacking relationship between objects in the object stack, and its size is N × N .A diagonal element in A must be 0, since an object cannot overlap itself.The element ω in the row i and column j of A represents the probability of the existence of edge ϵ → .Since the object detection results' order may be uncertain (i.e., index and index may not correspond), the adjacency matrix A is not unique and is determined by the actual order of the object detection results.We can calculate the A using a unit matrix E after row and column transformations based on the relationship between index and index , as follows: The dataset predefines index and A , while index is determined through the Hungarian algorithm and post-processing during object detection.To predict the adja-

Initialization with Adjacent Matrix
In complex scenes, objects are frequently stacked.We represent each object as a node, and the relationship between two stacked objects as a weighted edge.Thus, any object stack can be represented by a weighted directed graph G (V, E , W ) with N V nodes ∈ V and N E edges ∈ E , where each edge has a weight ω ∈ W. For two objects, o1 and o2, if o1 directly overlaps object o2, an edge o1→o2 is formed, with the weight ω representing the probability of its existence.In the dataset, ω = 1, whereas during prediction, the value of ω ranges between 0 and 1.
Our primary objective is to predict the weighted directed graph G, which can be represented by an adjacency matrix A in data structures: The adjacency matrix A represents the stacking relationship between objects in the object stack, and its size is N V × N V .A diagonal element in A must be 0, since an object cannot overlap itself.The element ω ij in the row i and column j of A represents the probability of the existence of edge oi→oj .Since the object detection results' order may be uncertain (i.e., index pre and index origin may not correspond), the adjacency matrix A gt is not unique and is determined by the actual order of the object detection results.We can calculate the A gt using a unit matrix E change after row and column transformations based on the relationship between index pre and index origin , as follows: The dataset predefines index origin and A origin , while index pre is determined through the Hungarian algorithm and post-processing during object detection.To predict the adjacency matrix A, we multiply a matrix adj 1 with N V rows and a matrix adj 2 with N V columns, resulting in the predicted value of matrix A, denoted as A m .
To achieve secure grasping, the n-th power of the adjacency matrix A can be used.The matrix power calculation can determine if there are still objects between two objects, thus obtaining the uncovered objects in the object stack.As demonstrated in Figure 2, we consider an object stack with object o1 covering object o2 and object o2 covering object o3.We can obtain the adjacency matrix A for this object stack.For elements ω ij in the n-th power matrix A n of A where ω ij = 1, there are (n − 1) objects between object o i and object o j .When A n (n = 1) is a matrix of all zeros, ω ij values equal to 1 in A n−1 signify that object o i can be grasped safely.When A consists entirely of zeros, it implies that every object can be grasped safely.
Sensors 2023, 23, x FOR PEER REVIEW 5 of 15 cency matrix A, we multiply a matrix adj with N  rows and a matrix adj with N  columns, resulting in the predicted value of matrix A, denoted as A .
To achieve secure grasping, the n-th power of the adjacency matrix A can be used.The matrix power calculation can determine if there are still objects between two objects, thus obtaining the uncovered objects in the object stack.As demonstrated in Figure 2, we consider an object stack with object o1 covering object o2 and object o2 covering object o3.We can obtain the adjacency matrix A for this object stack.For elements ω in the nth power matrix A of A where ω = 1, there are (n − 1) objects between object o and object o .When A n 1) is a matrix of all zeros, ω values equal to 1 in A signify that object o can be grasped safely.When A consists entirely of zeros, it implies that every object can be grasped safely.The left-hand side of the figure presents a stack of objects and its directed graph, while the right-hand side shows the corresponding adjacency matrix and its power.To calculate the secure grasping, we utilize the n-th power of the adjacency matrix.Elements of the matrix's i-th row and jth column denote the probability of covering.

GrRN
After observing the impressive capabilities of end-to-end object detection models such as DETR [22] in resolving matrix prediction problems, notably the inspiring results of Adj-Net [11], we aimed to incorporate these findings into our research.Traditional solutions to the stacking prediction problem involve multi-stage methods requiring object detection to establish the point set  of a directed graph, which is then matched to obtain the edge set ℰ and probability set  for the existence of edges.Consequently, the adjacency matrix prediction problem is categorized as a set prediction problem.DETR [22] regards object detection as a set prediction problem, which can directly obtain the node set  of the directed graph without requiring post-processing operations, providing great convenience for predicting the weighted edge set ℰ in subsequent steps.We based our experiments on Deformable DETR [12], which resolves the issues of sluggish convergence and poor performance on small objects found in DETR.The GrRN is presented in Figure 3.The left-hand side of the figure presents a stack of objects and its directed graph, while the right-hand side shows the corresponding adjacency matrix and its power.To calculate the secure grasping, we utilize the n-th power of the adjacency matrix.Elements of the matrix's i-th row and j-th column denote the probability of covering.

GrRN
After observing the impressive capabilities of end-to-end object detection models such as DETR [22] in resolving matrix prediction problems, notably the inspiring results of Adj-Net [11], we aimed to incorporate these findings into our research.Traditional solutions to the stacking prediction problem involve multi-stage methods requiring object detection to establish the point set V of a directed graph, which is then matched to obtain the edge set E and probability set W for the existence of edges.Consequently, the adjacency matrix prediction problem is categorized as a set prediction problem.DETR [22] regards object detection as a set prediction problem, which can directly obtain the node set V of the directed graph without requiring post-processing operations, providing great convenience for predicting the weighted edge set E in subsequent steps.We based our experiments on Deformable DETR [12], which resolves the issues of sluggish convergence and poor performance on small objects found in DETR.The GrRN is presented in Figure 3.
GrRN takes RGB images as its input and outputs predictions for object detection and the corresponding adjacency matrix.The model initially extracts multi-scale features I e (input of Encoder) of the image using a feature extractor (ResNet50 in this paper).The number of scales is 4, consistent with Deformable DETR [12].The dimensions of I e are e × h.Six multi-head self-attention modules utilize I e to generate O e (output of Encoder), with the dimensions of e × h.Decoder takes the object query and O e as inputs.The dimensions of the object query are q × h.O d is the output of Decoder, with dimensions q × h.Feeding the output of Decoder through a feedforward network generates the detection results for bounding boxes (O d ) and class detections.The dimensions of O d are q × 4, while the dimensions of class detections are q × (N class + 1), where 1 denotes the absence of an object.To enhance the visual information of the features, the model connects I e residually with O e and remodels it into h × 1 × e.We utilize a convolution operation to alter the depth and obtain the feature map I a , with the dimensions of h × 1 × q.Subsequently, it is resized to q × h.Merging O d and I a yields I a with the dimensions of q × (h + 4).The model processes I a through two independent MLP operations that do not alter its dimensions.
These operations yield two matrices, adj 1 and adj 2 , with the dimensions of q × (h + 4).The matrices are then used for calculating the adjacency matrix.The model multiplies adj 1 and adj T  2 , and the result goes through a sigmoid operation to yield the preliminary prediction for the adjacency matrix, A p .The size of A p is q × q.After finding the result of the Hungarian matching, the indices i 1 , i 2 , I, i m of the objects from q are generated.The corresponding rows and columns are then extracted from A p to obtain the final adjacency matrix, A m .GrRN takes RGB images as its input and outputs predictions for object detection and the corresponding adjacency matrix.The model initially extracts multi-scale features I (input of Encoder) of the image using a feature extractor (ResNet50 in this paper).The number of scales is 4, consistent with Deformable DETR [12].The dimensions of I are e × h.Six multi-head self-attention modules utilize I to generate O (output of Encoder), with the dimensions of e × h.Decoder takes the object query and O as inputs.The dimensions of the object query are q × h .O is the output of Decoder, with dimensions q × h.Feeding the output of Decoder through a feedforward network generates the detection results for bounding boxes (O′ ) and class detections.The dimensions of O′ are q × 4, while the dimensions of class detections are q × N + 1), where 1 denotes the absence of an object.To enhance the visual information of the features, the model connects I residually with O and remodels it into h × 1 × e.We utilize a convolution operation to alter the depth and obtain the feature map I , with the dimensions of h × 1 × q.Subsequently, it is resized to q × h .Merging O′ and I yields I′ with the dimensions of q × h + 4).The model processes I′ through two independent MLP operations that do not alter its dimensions.These operations yield two matrices, adj and adj , with the dimensions of q × h + 4).The matrices are then used for calculating the adjacency matrix.
The model multiplies adj and adj , and the result goes through a sigmoid operation to yield the preliminary prediction for the adjacency matrix, A .The size of A is q × q.After finding the result of the Hungarian matching, the indices i , i , I, i of the objects from q are generated.The corresponding rows and columns are then extracted from A to obtain the final adjacency matrix, A .We attempted to use Decoder's output, O , to predict the adjacency matrix.However, the utilization of O produced much better results.DETR suggests that Decoder has the capability to learn more about the object's boundary information while Encoder retains more visual information about the object.Given the importance of visual information in We attempted to use Decoder's output, O d , to predict the adjacency matrix.However, the utilization of O e produced much better results.DETR suggests that Decoder has the capability to learn more about the object's boundary information while Encoder retains more visual information about the object.Given the importance of visual information in determining whether objects are stacked, we postulate that using Encoder's output to predict the adjacency matrix is more appropriate.
Due to the increased ability of the model to predict adjacent matrices, we need to consider the loss of predicting adjacent matrices when calculating the loss.The loss of the entire model can be divided into two parts: bipartite matching loss and model optimization loss.Since the prediction of the adjacent matrix is made after bipartite matching, the loss of bipartite matching remains the same and is not modified, just like in DETR.For the model optimization loss, we consider it from the following perspectives.
The initial aspect to consider is the classification loss, which we evaluate using the cross-entropy loss.The formula for the cross-entropy loss is as follows: cross-entropy loss.The formula for the cross-entropy loss is as follows: N class +1 c=1 p∈ (4) where p ∈  represents all proposed boxes obtained through bipartite graph matching, N class is the number of classes in the dataset, including the "no object" class represented by 1.Since the occurrence of the "no object" class is greater than other object classes in practical detection tasks, we assign a weight  c to each class during the calculation of classification loss.The weight assigned to the "no object" class is 0.01, compared to 1 assigned to other classes.We use y gt (x) and y pre (x) to represent the true and predicted class values of the ground truth box corresponding to the predicted box x, respectively.
For the bounding boxes, we use l1 loss and GIoU loss based on the recommendation of DETR.While l1 loss is sensitive to the size of the bounding box, it does not always precisely represent the distance between the predicted and ground truth boxes.Therefore, we use the GIoU loss as an auxiliary measure.The formula for both losses is as follows: where p ∈ P represents all proposed boxes obtained through bipartite graph matching, N class is the number of classes in the dataset, including the "no object" class represented by 1.Since the occurrence of the "no object" class is greater than other object classes in practical detection tasks, we assign a weight cross-entropy loss.The formula for the cross-entropy loss is as follows: N class +1 c=1 p∈ (4) where p ∈  represents all proposed boxes obtained through bipartite graph matching, N class is the number of classes in the dataset, including the "no object" class represented by 1.Since the occurrence of the "no object" class is greater than other object classes in practical detection tasks, we assign a weight  c to each class during the calculation of classification loss.The weight assigned to the "no object" class is 0.01, compared to 1 assigned to other classes.We use y gt (x) and y pre (x) to represent the true and predicted class values of the ground truth box corresponding to the predicted box x, respectively.
For the bounding boxes, we use l1 loss and GIoU loss based on the recommendation of DETR.While l1 loss is sensitive to the size of the bounding box, it does not always to each class during the calculation of classification loss.The weight assigned to the "no object" class is 0.01, compared to 1 assigned to other classes.We use y gt (x) and y pre (x) to represent the true and predicted class values of the ground truth box corresponding to the predicted box x, respectively.
For the bounding boxes, we use l1 loss and GIoU loss based on the recommendation of DETR.While l1 loss is sensitive to the size of the bounding box, it does not always precisely represent the distance between the predicted and ground truth boxes.Therefore, we use the GIoU loss as an auxiliary measure.The formula for both losses is as follows: When calculating the l1 loss, we measure the distances between the predicted and actual values of cx, cy, w, and h independently.S gt and S pre represent the surface areas of the ground truth box and predicted box, respectively.The minimum bounding box that encompasses both the ground truth and predicted boxes is represented by c.
The adjacency matrix A m is mostly sparse, with the majority of the values being 0. We adopt the binary cross-entropy loss function, from Adj-Net, to calculate the loss.In comparison with l1 and l2 losses, binary cross-entropy loss can effectively penalize incorrect 0 values, resulting in a faster model convergence speed.The formula for binary cross-entropy loss is as follows: (7) The ultimate loss for the GrRN model is a weighted sum of all losses mentioned above: where all λ values are hyperparameters.

CSL-YOLO
In the context of 2D robotic grasping, rotated rectangles are commonly used to represent the area in which the robotic arm should grasp.We implemented modifications to the long-side representation method to suit the field of robotic grasping, resulting in the grasp-side representation method.This approach is denoted by (x, y, h, w, θ), where x and y denote the central coordinates of the rectangle, h indicates the length of the grasping side, w refers to the distance between the robotic fingers' openings, and θ has the range [−90 • , 90 • ).Due to the limitations of annotation tools, the available angle values in the dataset include {−90 • , −89 • , ..., 88 • , 89 • }.
To predict the grasp boxes, we based our work on YOLOv5 and developed CSL-YOLO, which is built upon the CSL.The input of CSL-YOLO is an RGB image, and the output of the model is all potential grasp boxes in the image.Like YOLOv5, CSL-YOLO consists of a backbone, neck, and head.The structure of the model is shown in Figure 4.
RGB images are first zero-padded so that their width and height are the same as each other, then resized to h × h.The backbone uses these resized images to extract visual features, reducing the image's width and height by half as it passes through successive feature layers.The lower convolutional layers learn visual features related to object contours, while higher layers extract more semantic features.The Feature Pyramid Network (FPN) is used to transmit strong, semantic features from the higher layers to the lower layers, while the Path Aggregation Network (PAN) transmits positional features from the lower layers to the higher layers.The head generates the final three output feature maps, which predict objects at three different scales.The high-resolution feature map is best suited for small objects, whereas the low-resolution feature map is better for larger objects.During training, the object's center point position is used to calculate the loss.Non-Maximum Suppression (NMS) is used to avoid the over-representation of objects in the output.RGB images are first zero-padded so that their width and height are the same as each other, then resized to h × h.The backbone uses these resized images to extract visual features, reducing the image's width and height by half as it passes through successive feature layers.The lower convolutional layers learn visual features related to object contours, while higher layers extract more semantic features.The Feature Pyramid Network (FPN) is used to transmit strong, semantic features from the higher layers to the lower layers, while the Path Aggregation Network (PAN) transmits positional features from the lower layers to the higher layers.The head generates the final three output feature maps, which predict objects at three different scales.The high-resolution feature map is best suited for small objects, whereas the low-resolution feature map is better for larger objects.During training, the object's center point position is used to calculate the loss.Non-Maximum Suppression (NMS) is used to avoid the over-representation of objects in the output.
To facilitate angle prediction in YOLOv5, we referred to CSL and treated angle prediction as a classification problem instead of a regression one.Unlike regression, the classification problem can address the boundary problem.Angles exhibit periodicity, and −90° and 89° are equivalent.The loss between these angles ought to be minimal, but regression will yield high loss values.Classification considers every prediction, right or wrong, to be equal, eliminating the boundary problem.Nonetheless, classification fails to provide information about the distance between two angles.In fact, angles close to the true angle are admissible, and the model should minimize the loss for such angles.CSL replaced the true label in the cross-entropy loss function with CSL x) .This replacement allows the model to penalize predictions closer to the true angle less, improving the accuracy of angle prediction.The formula to compute CSL x) is: where x represents the predicted angle by the model, θ represents the actual angle of the grasping box, g x) is the window function, and r is the window radius.We apply a penalty that decreases as the predicted angle falls within the window radius of θ.Based on the results of our ablation experiments, we defined r as 6.After replacing the true label, the formula for the new loss function is as follows: Since there are no categories for grasp boxes in this study, category loss is not necessary.The other loss functions remain unmodified, and thus the final loss function of the CSL-YOLO model is: To facilitate angle prediction in YOLOv5, we referred to CSL and treated angle prediction as a classification problem instead of a regression one.Unlike regression, the classification problem can address the boundary problem.Angles exhibit periodicity, and −90 • and 89 • are equivalent.The loss between these angles ought to be minimal, but regression will yield high loss values.Classification considers every prediction, right or wrong, to be equal, eliminating the boundary problem.Nonetheless, classification fails to provide information about the distance between two angles.In fact, angles close to the true angle are admissible, and the model should minimize the loss for such angles.CSL replaced the true label in the cross-entropy loss function with CSL(x).This replacement allows the model to penalize predictions closer to the true angle less, improving the accuracy of angle prediction.The formula to compute CSL(x) is: where x represents the predicted angle by the model, θ represents the actual angle of the grasping box, g(x) is the window function, and r is the window radius.We apply a penalty that decreases as the predicted angle falls within the window radius of θ.Based on the results of our ablation experiments, we defined r as 6.After replacing the true label, the formula for the new loss function is as follows: Since there are no categories for grasp boxes in this study, category loss is not necessary.The other loss functions remain unmodified, and thus the final loss function of the CSL-YOLO model is: where all λ values are hyperparameters.

Experiment and Result Analysis
This chapter presents experimental results for GrRN and CSL-YOLO, along with an investigation of the impact of grasping in a real-world scenario.The proposed models were implemented using the PyTorch 1.12.1 framework and trained and tested using an NVIDIA Tesla V100 with 16 G memory.To verify the grasping algorithm in a real-world stacking scenario, we utilize a 4DoF Kinova gen2 robotic arm and an Intel Real Sense2 depth camera.

Experimental Setup for GrRN
The proposed grasp relationship detection method was trained and validated on the VMRD [15] using a 9:1 ratio for the training and validation sets, which consisted of 4233 images, and a test set with 450 images.Due to the high computational expenses of the multi-task secure grasping method, we employed ResNet50 as the feature extractor, which has relatively few parameters and low computation costs.The model specifications were set as follows: h = 256 for number of hidden dimensions, eight for the number of heads in the variable transformer module, four for the number of reference points in the variable self-attention, six for the number of modules in Encoder and Decoder, and 300 for the quantity of object queries.The convolution kernel size that changed dimensions was 1 × 1 × 300.The two MLPs that predicted the adjacency matrix had the following specifications: the number of input dimensions was h + 4 = 260, the number of hidden dimensions was 260, and the number of output dimensions was 260.They had three hidden layers.The AdamW optimizer was used to train the network.During training, the adjacency matrix prediction part was frozen at first, and the object detection part was trained for 300 epochs utilizing the COCO dataset at a learning rate of 0.001.Subsequently, the whole network was trained on VMRD for 500 epochs at a learning rate of 0.0001.

Experimental Results of GrRN
Our method's effectiveness was evaluated using the VMRD, and its performance was compared to three of the most stacked object detection algorithms-VMRN, VSE, and Adj-Net.We utilized the detection results from Adj-Net and considered them accurate under the following circumstances:

•
For objects i and j where i is placed on j, P ∃ i→j > 0.5 and P ∃ i→j > P ∃ j→i .

•
For objects i and j that have no direct relationship, P ∃ i→j < 0.5 and P ∃ j→i < 0.5.
In the field of object detection, several concepts are used, including true positive (TP), false positive (FP) for incorrect predictions, true negative (TN), and false negative (FN) for missed detection.Our evaluation of the model's object detection performance is based on two metrics: Object Recall (OR) and Object Precision (OP).The formulas for calculating OR and OP are: OP = TP TP + FP (13) When detecting grasping relationships, we utilize the standard measures of true positive (TP), false positive (FP) for incorrect predictions, true negative (TN), and false negative (FN) for missed detection, following the practices of object detection.To evaluate our model's performance, we use three metrics:

•
Relationship Recall (RR): The number of correctly detected relationships divided by the total number of correct stacking relationships.

•
Relationship Precision (RP): The quantity of correctly predicted relationships divided by the total quantity of detected relationships.If the tuple o i , R ij , o j is correct, the detected relationship is considered correct, where o i represents the i-th object and R represents the relationship between the two objects in the indices.

•
Image Accuracy (IA): In the test set, RR and RP are both 100% for all the existing objects in the image.The notation IA-x represents the presence of x objects in the image.
Figure 5 shows some detection results of our methods on VMRD.One image was chosen from each of IA-2 to IA-5 for display.The top row displays the original images, while the second row displays the results of object detection, including bounding boxes, categories, confidence scores, and object indexes.The bottom row shows the predicted adjacency matrices, with dark squares indicating the value of 0, and light squares indicating the value of 1.
objects in the image.The notation IA-x represents the presence of x objects in the image.
Figure 5 shows some detection results of our methods on VMRD.One image was chosen from each of IA-2 to IA-5 for display.The top row displays the original images, while the second row displays the results of object detection, including bounding boxes, categories, confidence scores, and object indexes.The bottom row shows the predicted adjacency matrices, with dark squares indicating the value of 0, and light squares indicating the value of 1.The comparison of the object detection results with other models is shown in Table 1, and as ResNet50 was the feature extractor we employed, Adj-Net utilized the same feature extractor.Our method was more effective than current state-of-the-art approaches.The more advanced deep learning becomes, the better object detectors perform, resulting in fewer false positives and negatives, aiding in the inference of object stacking relationships.The comparison of the grasping detection results with other models is shown in Table 2. Our method exhibits superior performance as compared to the current best method.The comparison of the object detection results with other models is shown in Table 1, and as ResNet50 was the feature extractor we employed, Adj-Net utilized the same feature extractor.Our method was more effective than current state-of-the-art approaches.The more advanced deep learning becomes, the better object detectors perform, resulting in fewer false positives and negatives, aiding in the inference of object stacking relationships.The comparison of the grasping detection results with other models is shown in Table 2. Our method exhibits superior performance as compared to the current best method.The object detection process now benefits from an improved performance, which leads to the easier detection of objects in the image.Consequently, the efficacy of the adjacency matrix detection also increases.The existing techniques for predicting object stacking relationships necessitate pooling convolution operations between object pairs, allowing predictions for only two objects at a time.This process proves to be time-consuming with an increased number of objects in the input image.However, the advent of end-to-end object detection facilitates the prediction of the stacking relationships for all objects simultaneously.The current study focuses on images that contain between two and five objects within the VRMD.We assessed the efficacy of various models under different object conditions, as presented in Table 3.Our method outperformed all the other considered techniques overall.Notably, precision levels decrease significantly as the number of objects within the image increases and the inherent object relationships become more complex.Table 4 exhibits the comparison of results obtained from GrRN-DETR (with DETR as a backbone network) and GrRN-Decoder (with Decoder output) in predicting the adjacency matrix.The effectiveness of DETR as a backbone network is compromised by its inability to correctly identify smaller objects, sensitivity to convergence time, and inferior object detection performance.As a result, the ability of the DETR-based model to predict the adjacency matrix is also compromised.The GrRN-Decoder model, on the other hand, lacks visual information, impeding the convergence of the adjacency matrix prediction component.

Experimental Setup for CSL-YOLO
For this study, we utilized the VMRD and the Cornell datasets with a total of 5568 images, distributed in a 8:1:1 ratio for training, validation, and test sets, respectively.The effectiveness of different window sizes {2, 4, 6, 8} was tested using the Gaussian function as the window function.Training incorporated a warm-up strategy while disabling mosaic data augmentation, with the application of Adam optimization at a learning rate of 0.0001.

Experimental Results for CSL-YOLO
To assess the efficacy of grasping detection, the rectangle metric was employed in this study.A predicted grasping was considered valid under two conditions: (1) the predicted grasping box has a rotation angle that varies by no more than 30 degrees from the true box, and (2) the Jaccard index J(A, B) = |A ∩ B|/|A ∪ B| between the predicted grasping box A and the true box B is greater than 25%.
We use Image-wise (IW) and Object-wise (OW) to evaluate the performance of the model.The definitions of IW and OW are as follows: Our method's grasping detection results on the VMRD and Cornell datasets are presented in Figure 6.The ground truth data from the original datasets are displayed in the first row, with our detection results in the second row.
this study.A predicted grasping was considered valid under two conditions: (1) the predicted grasping box has a rotation angle that varies by no more than 30 degrees from the true box, and (2) the Jaccard index J A, B) = |A ∩ B|/|A ∪ B| between the predicted grasping box A and the true box B is greater than 25%.
We use Image-wise (IW) and Object-wise (OW) to evaluate the performance of the model.The definitions of IW and OW are as follows: • IW: The entire dataset is shuffled and randomly divided into training and test sets to test the model's generalization ability for previously seen objects when they appear at new positions and rotation angles.• OW: The dataset is divided by object instance, and the objects in the test set have not appeared in the training set before, to test the model's generalization ability for unseen objects.
Our method's grasping detection results on the VMRD and Cornell datasets are presented in Figure 6.The ground truth data from the original datasets are displayed in the first row, with our detection results in the second row.The study began by evaluating the model's efficacy under different window sizes relative to traditional approaches.A summary of the outcomes, presented in Table 5, indicated superior grasping detection capabilities for the model when a window size of six was used.Notably, the window size directly affects the model's grasp detection ability: undersized windows may exclude some grasping boxes that should be identified, impairing the model's ability to attain local optima, whereas oversized selections may produce partially accurate outputs that affect model judgments.Evidently, the IW value surpassed the OW value as the model's error rate increased while evaluating objects not represented in the dataset.The study began by evaluating the model's efficacy under different window sizes relative to traditional approaches.A summary of the outcomes, presented in Table 5, indicated superior grasping detection capabilities for the model when a window size of six was used.Notably, the window size directly affects the model's grasp detection ability: undersized windows may exclude some grasping boxes that should be identified, impairing the model's ability to attain local optima, whereas oversized selections may produce partially accurate outputs that affect model judgments.Evidently, the IW value surpassed the OW value as the model's error rate increased while evaluating objects not represented in the dataset.

Experiments in Real-World Scenarios
This study utilized various objects in real-world scenarios to form distinct object stacks.RGB images, obtained through depth cameras, underwent object detection, adjacency matrix prediction, and grasping detection.Grasping boxes were selected based on the coefficient of overlap, K(o, g) = S o ∩ S g /S g greater than 0.5, where o refers to the object box, g to the grasping box, and S to the box area.The grasping box closest to the center Sensors 2023, 23, 8054 13 of 15 point of the object box was selected for use as the final grasping object for the robot arm.Grasping is then performed using the depth image information.Figure 7 depicts a specific grasping experiment where the robotic arm needs to move the objects on the right stack to the designated position on the left.The grasping process of the robotic arm is shown in the first row, while the predicted results of the adjacency matrix before each grasp is shown in the second row.

Experiments in Real-World Scenarios
This study utilized various objects in real-world scenarios to form distinct object stacks.RGB images, obtained through depth cameras, underwent object detection, adjacency matrix prediction, and grasping detection.Grasping boxes were selected based on the coefficient of overlap, K o, g) = S ∩ S )/S greater than 0.5, where o refers to the object box, g to the grasping box, and S to the box area.The grasping box closest to the center point of the object box was selected for use as the final grasping object for the robot arm.Grasping is then performed using the depth image information.Figure 7 depicts a specific grasping experiment where the robotic arm needs to move the objects on the right stack to the designated position on the left.The grasping process of the robotic arm is shown in the first row, while the predicted results of the adjacency matrix before each grasp is shown in the second row.

Conclusions
This paper proposes a multi-task deep neural network framework as a solution to the challenge of secure grasping in stacking scenarios.The framework commences with executing two pre-tasks: stacking relationship detection and grasping detection, before proceeding to the secure grasping task through post-processing.At first, the stacking relationship detection model detects objects within the RGB images, then predicts the object stack's adjacency matrix by merging visual detection and object detection information.The adjacency matrix is then utilized to select an object in the current grasp sequence.A visual information enhancement module was employed to boost model efficiency.The grasping detection model utilizes a one-stage object detection model to predict the grasping box, classification techniques to solve the angle prediction problem, and the CSL methodology to boost the model's ability to judge angle distance.On the VMRD and the Cornell datasets, our approach outperformed traditional methods and achieved secure grasping in real-world scenarios.In the future, there will be further improvements aimed at accelerating model prediction accuracy and speed.

Conclusions
This paper proposes a multi-task deep neural network framework as a solution to the challenge of secure grasping in stacking scenarios.The framework commences with executing two pre-tasks: stacking relationship detection and grasping detection, before proceeding to the secure grasping task through post-processing.At first, the stacking relationship detection model detects objects within the RGB images, then predicts the object stack's adjacency matrix by merging visual detection and object detection information.The adjacency matrix is then utilized to select an object in the current grasp sequence.A visual information enhancement module was employed to boost model efficiency.The grasping detection model utilizes a one-stage object detection model to predict the grasping box, classification techniques to solve the angle prediction problem, and the CSL methodology to boost the model's ability to judge angle distance.On the VMRD and the Cornell datasets, our approach outperformed traditional methods and achieved secure grasping in realworld scenarios.In the future, there will be further improvements aimed at accelerating model prediction accuracy and speed.

Figure 1 .
Figure 1.The model's overall structure comprises the proposed grasping relationship detection network at the top, which employs Deformable DETR for object detection and generates the adjacency matrix by multiplying two feature matrices.The bottom part is the proposed rotation box detection method.Subsequently, the final grasping results are obtained via post-processing.

Figure 1 .
Figure 1.The model's overall structure comprises the proposed grasping relationship detection network at the top, which employs Deformable DETR for object detection and generates the adjacency matrix by multiplying two feature matrices.The bottom part is the proposed rotation box detection method.Subsequently, the final grasping results are obtained via post-processing.

Figure 2 .
Figure2.The left-hand side of the figure presents a stack of objects and its directed graph, while the right-hand side shows the corresponding adjacency matrix and its power.To calculate the secure grasping, we utilize the n-th power of the adjacency matrix.Elements of the matrix's i-th row and jth column denote the probability of covering.

Figure 2 .
Figure2.The left-hand side of the figure presents a stack of objects and its directed graph, while the right-hand side shows the corresponding adjacency matrix and its power.To calculate the secure grasping, we utilize the n-th power of the adjacency matrix.Elements of the matrix's i-th row and j-th column denote the probability of covering.

15 Figure 3 .
Figure 3.The network architecture of GrRN.The image generates multi-scale features after going through a feature extractor, and then obtains object detection results through Deformable DETR.After visual enhancement, the adjacency matrix is predicted and the dark portion in the matrix represents 0, while the bright portion represents 1.

Figure 3 .
Figure 3.The network architecture of GrRN.The image generates multi-scale features after going through a feature extractor, and then obtains object detection results through Deformable DETR.After visual enhancement, the adjacency matrix is predicted and the dark portion in the matrix represents 0, while the bright portion represents 1.

Figure 4 .
Figure 4.The network architecture of CSL-YOLO.The input of the network is an RGB image, and the output is a rotated grasping box.

Figure 4 .
Figure 4.The network architecture of CSL-YOLO.The input of the network is an RGB image, and the output is a rotated grasping box.

Figure 5 .
Figure 5. Stacking relationship detection results of our methods on Visual Manipulation Relationship Dataset.The first row of images contains stacks of objects with varying numbers.The second row of images displays the results of the object detection.The third row of images shows the predicted results of the adjacency matrix.

Figure 5 .
Figure 5. Stacking relationship detection results of our methods on Visual Manipulation Relationship Dataset.The first row of images contains stacks of objects with varying numbers.The second row of images displays the results of the object detection.The third row of images shows the predicted results of the adjacency matrix.

•
IW: The entire dataset is shuffled and randomly divided into training and test sets to test the model's generalization ability for previously seen objects when they appear at new positions and rotation angles.• OW: The dataset is divided by object instance, and the objects in the test set have not appeared in the training set before, to test the model's generalization ability for unseen objects.

Figure 6 .
Figure 6.Grasping detection on Visual Manipulation Relationship Dataset and Cornell.(a) is the ground truth, and (b) is the result detected by our method.

Figure 6 .
Figure 6.Grasping detection on Visual Manipulation Relationship Dataset and Cornell.(a) is the ground truth, and (b) is the result detected by our method.

Figure 7 .
Figure 7. Robotic arm grasping in a real-world scenario.In the matrix, the dark portion represents 0, while the light portion represents 1.

Figure 7 .
Figure 7. Robotic arm grasping in a real-world scenario.In the matrix, the dark portion represents 0, while the light portion represents 1.

Table 1 .
Results of object detection from different models.

Table 1 .
Results of object detection from different models.

Table 2 .
Results of grasp relationship from different models.

Table 3 .
Results of grasp relationship IA-x from different models.

Table 4 .
Results of different ways to calculate adjacent matrix.

Table 5 .
Results of grasping detection from different models and window size.

Table 5 .
Results of grasping detection from different models and window size.