Robotic Grasp Detection Network Based on Improved Deformable Convolution and Spatial Feature Center Mechanism
Abstract
:1. Introduction
- We propose a grasping system that can directly deduce the grasping rectangles of multiple grasping objects by generating images of the grasping quality, angle, and width;
- Our model introduces improved deformable convolution in the feature extraction module to adjust the adaptive receptive field, then uses an SFC layer to capture the global remote dependence through a lightweight MLP architecture. Compared with transformer encoders based on a multihead attention mechanism, the lightweight MLP architecture is simpler, lighter, and more computationally efficient. Additionally, to preserve the local corner region, we propose an LFC mechanism to gather local regional features in the layer. A lightweight CARAFE operator is used to complete the upsampling process. Compared with transpose convolution, the CARAFE operator has lower computational complexity and achieves better performance;
- We evaluated our model on publicly available grasp datasets and achieved the highest accuracies of 99.3% and 96.1% for the Cornell and Jacquard grasp datasets, respectively;
- We deployed the proposed model on an actual robot arm and conducted real-time grasping, which proved the feasibility of the model.
2. Related Studies
2.1. Grasp Detection with Deep Learning
2.2. Deformable Convolutional Networks
2.3. MLP in Computer Vision
3. Improved Deformable Convolution and Spatial Feature Center Mechanism
3.1. Problem Statement
3.2. Network Architecture
3.2.1. Improved Deformable Convolution-Based Feature Extraction Module
3.2.2. Spatial Feature Center Module
3.3. Loss Function
4. Experimental Validation
4.1. Datasets and Experimental Setup
- The intersection-over-union (IoU) score between the generated predictive grasping rectangle G and the ground truth grasping rectangle () is greater than 0.25, that is,
- The offset () between the predicted azimuth of the grasped rectangle ) and the ground truth azimuth of the grasped rectangle () is less than , that is,
4.2. Experimental Results and Analysis
4.3. Comparison of Multiobject Grasp
4.4. Ablation Study
4.4.1. Improved DCN vs. Convolutional Layer
4.4.2. Effectiveness of the SFC Module
4.5. Grasping in Real-World Scenarios
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Kumra, S.; Kanan, C. Robotic grasp detection using deep convolutional neural networks. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2016; pp. 769–776. [Google Scholar]
- Redmon, J.; Angelova, A. Real-time grasp detection using convolutional neural networks. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 6–30 May 2015; pp. 1316–1322. [Google Scholar]
- Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the NIPS, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
- Huang, Z.; Zhu, Z.; Wang, Z.; Shi, Y.; Fang, H.; Zhang, Y. DGDNet: Deep Gradient Descent Network for Remotely Sensed Image Denoising. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
- Huang, Z.; Wang, Z.; Zhu, Z.; Zhang, Y.; Fang, H.; Shi, Y.; Zhang, T. DLRP: Learning Deep Low-Rank Prior for Remotely Sensed Image Denoising. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
- Sahbani, A.; El-Khoury, S.; Bidaud, P. An overview of 3D object grasp synthesis algorithms. Robot. Auton. Syst. 2012, 60, 326–336. [Google Scholar] [CrossRef]
- Shimoga, K.B. Robot Grasp Synthesis Algorithms: A Survey. Int. J. Robot. Res. 1996, 15, 230–266. [Google Scholar] [CrossRef]
- Huang, Z.; Wang, L.; An, Q.; Zhou, Q.; Hong, H. Learning a Contrast Enhancer for Intensity Correction of Remotely Sensed Images. IEEE Signal Process. Lett. 2022, 29, 394–398. [Google Scholar] [CrossRef]
- Jiang, Y.; Moseson, S.; Saxena, A. Efficient grasping from RGBD images: Learning using a new rectangle representation. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 3304–3311. [Google Scholar]
- Lenz, I.; Lee, H.; Saxena, A. Deep learning for detecting robotic grasps. Int. J. Robot. Res. 2013, 34, 705–724. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2012, 60, 84–90. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Guo, D.; Sun, F.; Liu, H.; Kong, T.; Fang, B.; Xi, N. A hybrid deep architecture for robotic grasp detection. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1609–1614. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef]
- Chu, F.J.; Xu, R.; Vela, P.A. Real-World Multiobject, Multigrasp Detection. IEEE Robot. Autom. Lett. 2018, 3, 3355–3362. [Google Scholar] [CrossRef]
- Zhou, X.; Lan, X.; Zhang, H.; Tian, Z.; Zhang, Y.; Zheng, N. Fully Convolutional Grasp Detection Network with Oriented Anchor Box. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 7223–7230. [Google Scholar]
- Asif, U.; Tang, J.; Harrer, S. GraspNet: An Efficient Convolutional Neural Network for Real-time Grasp Detection for Low-powered Devices. In Proceedings of the International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018. [Google Scholar]
- Kumra, S.; Joshi, S.; Sahin, F. Antipodal Robotic Grasping using Generative Residual Convolutional Neural Network. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 9626–9633. [Google Scholar]
- Wang, S.; Zhou, Z.; Kan, Z. When Transformer Meets Robotic Grasping: Exploits Context for Efficient Grasp Detection. IEEE Robot. Autom. Lett. 2022, 7, 8170–8177. [Google Scholar] [CrossRef]
- Tran, V.L.; Lin, H. BiLuNetICP: A Deep Neural Network for Object Semantic Segmentation and 6D Pose Recognition. IEEE Sens. J. 2021, 21, 11748–11757. [Google Scholar] [CrossRef]
- Tian, F.H.; Zhang, J.; Zhong, Y.; Liu, H.; Duan, Q. A method for estimating an unknown target grasping pose based on keypoint detection. In Proceedings of the 2022 2nd International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI), Fuzhou, China, 23–25 September 2022; pp. 267–271. [Google Scholar]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
- Huang, Z.; Zhu, Z.; An, Q.; Wang, Z.; Zhou, Q.; Zhang, T.; Alshomrani, A.S. Luminance Learning for Remotely Sensed Image Enhancement Guided by Weighted Least Squares. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
- Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets V2: More Deformable, Better Results. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 9300–9308. [Google Scholar]
- Chen, Y.; Kalantidis, Y.; Li, J.; Yan, S.; Feng, J. A2-Nets: Double Attention Networks. In Proceedings of the Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
- Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-Alone Self-Attention in Vision Models. arXiv 2019, arXiv:1906.05909. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
- Vaswani, A.; Ramachandran, P.; Srinivas, A.; Parmar, N.; Hechtman, B.A.; Shlens, J. Scaling Local Self-Attention for Parameter Efficient Visual Backbones. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12889–12899. [Google Scholar]
- Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Keysers, D.; Uszkoreit, J.; Lucic, M.; et al. MLP-Mixer: An all-MLP Architecture for Vision. In Proceedings of the Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
- Liu, H.; Dai, Z.; So, D.R.; Le, Q.V. Pay Attention to MLPs. In Proceedings of the Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
- Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. MetaFormer is Actually What You Need for Vision. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10809–10819. [Google Scholar]
- Hou, Q.; Jiang, Z.; Yuan, L.; Cheng, M.M.; Yan, S.; Feng, J. Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 45, 1328–1334. [Google Scholar] [CrossRef]
- Huang, Z.; Zhang, Y.; Li, Q.; Li, X.; Zhang, T.; Sang, N.; Hong, H. Joint Analysis and Weighted Synthesis Sparsity Priors for Simultaneous Denoising and Destriping Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 6958–6982. [Google Scholar] [CrossRef]
- Morrison, D.; Corke, P.; Leitner, J. Learning robust, real-time, reactive robotic grasping. Int. J. Robot. Res. 2019, 39, 183–201. [Google Scholar] [CrossRef]
- Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. arXiv 2022, arXiv:2211.05778. [Google Scholar]
- Quan, Y.; Zhang, D.; Zhang, L.; Tang, J. Centralized Feature Pyramid for Object Detection. arXiv 2022, arXiv:2210.02093. [Google Scholar] [CrossRef]
- Gillani, I.S.; Munawar, M.R.; Talha, M.; Azhar, S.; Mashkoor, Y.; Uddin, M.S.; Zafar, U. YOLOV5, YOLO-X, YOLO-R, YOLOV7 Performance Comparison: A Survey. 2022. Available online: https://aircconline.com/csit/papers/vol12/csit121602.pdf (accessed on 5 February 2023).
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Larsson, G.; Maire, M.; Shakhnarovich, G. FractalNet: Ultra-Deep Neural Networks without Residuals. arXiv 2016, arXiv:1605.07648. [Google Scholar]
- Depierre, A.; Dellandréa, E.; Chen, L. Jacquard: A Large Scale Dataset for Robotic Grasp Detection. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 3511–3516. [Google Scholar]
- Huang, Z.; Zhang, Y.; Yue, X.; Li, X.; Fang, H.; Hong, H.; Zhang, T. Joint horizontal-vertical enhancement and tracking scheme for robust contact-point detection from pantograph-catenary infrared images. Infrared Phys. Technol. 2020, 105, 103156. [Google Scholar] [CrossRef]
- Morrison, D.; Corke, P.; Leitner, J. Closing the Loop for Robotic Grasping: A Real-time, Generative Grasp Synthesis Approach. arXiv 2018, arXiv:1804.05172. [Google Scholar]
- Karaoğuz, H.; Jensfelt, P. Object Detection Approach for Robot Grasp Detection. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 4953–4959. [Google Scholar]
- Ainetter, S.; Fraundorfer, F. End-to-end Trainable Deep Neural Network for Robotic Grasp Detection and Semantic Segmentation from RGB. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13452–13458. [Google Scholar]
- Yu, S.; Zhai, D.; Xia, Y.; Wu, H.; Liao, J.J. SE-ResUNet: A Novel Robotic Grasp Detection Method. IEEE Robot. Autom. Lett. 2022, 7, 5238–5245. [Google Scholar] [CrossRef]
- Cao, H.; Chen, G.; Li, Z.; Lin, J.; Knoll, A. Residual Squeeze-and-Excitation Network with Multi-scale Spatial Pyramid Module for Fast Robotic Grasping Detection. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13445–13451. [Google Scholar]
Method | Author | Accuracy (%) | Time (ms) | |
---|---|---|---|---|
IW | OW | |||
SAE [12] | Lenz | 73.9 | 75.6 | 1350 |
AlexNet, MultiGrasp [2] | Redmon | 88.0 | 87.1 | 76 |
GG-CNN [44] | Morrison | 73.0 | 69.0 | 19 |
GRPN [45] | Karaoguz | 88.7 | - | 200 |
ResNet-50x2 [1] | Kumra | 89.2 | 88.9 | 103 |
GR-ConvNet-RGB-D [20] | Sulabh | 97.7 | 96.6 | 20 |
E2E-net-RGB [46] | Ainetter | 98.2 | - | 63 |
TF-Grasp [21] | Wang | 97.99 | 96.7 | 42 |
SE-ResUNet [47] | Yu | 98.2 | 97.1 | 25 |
CenterNet-SPF [23] | Tian | 98.31 | 97.6 | 24 |
DCSFC-Grasp | Ours | 99.3 | 98.5 | 22 |
Method | Splitting | Jaccard Index | Azimuth Threshold | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
20% | 25% | 30% | 35% | 40% | 10° | 15° | 20° | 25° | 30° | ||
GR-ConvNet [20] | IW (%) | 98.1 | 97.7 | 96.8 | 94.1 | 88.7 | 86.4 | 94.4 | 97.2 | 97.7 | 97.7 |
TF-Grasp [21] | 98.4 | 97.99 | 97.2 | 94.9 | 90.3 | 89.21 | 95.44 | 97.52 | 97.98 | 97.99 | |
DCSFC-Grasp | 99.5 | 99.3 | 99.2 | 99.0 | 98.5 | 93.6 | 94.8 | 98.3 | 99.2 | 99.3 | |
GR-ConvNet [20] | OW (%) | 97.1 | 96.6 | 93.2 | 90.5 | 84.8 | 84.7 | 90.9 | 95.1 | 95.8 | 96.6 |
TF-Grasp [21] | 97.4 | 96.7 | 93.9 | 91.7 | 85.2 | 85.2 | 91.6 | 95.4 | 96.0 | 96.7 | |
DCSFC-Grasp | 98.6 | 98.5 | 98.3 | 98.1 | 97.4 | 93.1 | 93.9 | 97.4 | 98.5 | 98.5 |
Method | Author | Year | Accuracy (%) |
---|---|---|---|
Jacquard [42] | Depierre | 2018 | 74.2 |
GG-CNN [44] | Morrison | 2018 | 84.0 |
FGGN,ResNet-101 [18] | Zhou | 2018 | 91.8 |
GR-ConvNet [20] | Sulabh | 2020 | 94.6 |
RSEN [48] | Cao | 2021 | 94.8 |
TF-Grasp [21] | Wang | 2022 | 94.6 |
SE-ResUNet [43] | Yu | 2022 | 95.7 |
CenterNet-SPF [23] | Tian | 2022 | 95.5 |
DCSFC-Grasp | Ours | 2023 | 96.1 |
Network | Accuracy (%) | |
---|---|---|
IW | OW | |
GG-CNN (convolutional layer) | 73.0% | 69.0% |
GG-CNN (improved DCN) | 79.6% | 77.2% |
GR-ConvNet (convolutional layer) | 97.7% | 96.6% |
GR-ConvNet (improved DCN) | 98.1% | 97.2% |
Lightweight MLP | ✓ | ✓ | ✓ | ✓ | ||||
Learnable Feature Center Block | ✓ | ✓ | ✓ | ✓ | ||||
Stem Block | ✓ | ✓ | ✓ | ✓ | ||||
Accuracy (IW) | 97.8% | 98.0% | 98.7% | 97.9% | 99.1% | 98.8% | 98.2% | 99.3% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zou, M.; Li, X.; Yuan, Q.; Xiong, T.; Zhang, Y.; Han, J.; Xiao, Z. Robotic Grasp Detection Network Based on Improved Deformable Convolution and Spatial Feature Center Mechanism. Biomimetics 2023, 8, 403. https://doi.org/10.3390/biomimetics8050403
Zou M, Li X, Yuan Q, Xiong T, Zhang Y, Han J, Xiao Z. Robotic Grasp Detection Network Based on Improved Deformable Convolution and Spatial Feature Center Mechanism. Biomimetics. 2023; 8(5):403. https://doi.org/10.3390/biomimetics8050403
Chicago/Turabian StyleZou, Miao, Xi Li, Quan Yuan, Tao Xiong, Yaozong Zhang, Jingwei Han, and Zhenhua Xiao. 2023. "Robotic Grasp Detection Network Based on Improved Deformable Convolution and Spatial Feature Center Mechanism" Biomimetics 8, no. 5: 403. https://doi.org/10.3390/biomimetics8050403
APA StyleZou, M., Li, X., Yuan, Q., Xiong, T., Zhang, Y., Han, J., & Xiao, Z. (2023). Robotic Grasp Detection Network Based on Improved Deformable Convolution and Spatial Feature Center Mechanism. Biomimetics, 8(5), 403. https://doi.org/10.3390/biomimetics8050403