A Multi-Scale Cross-Layer Fusion Method for Robotic Grasping Detection
Abstract
1. Introduction
- (1)
- This study proposes a real-time encoder–decoder grasp detection framework that fuses multi-scale and cross-layer features to produce pixel-level grasp rectangles from RGB-D inputs, ensuring high efficiency (27 FPS) and fine spatial detail preservation.
- (2)
- A multi-scale spatial feature enhancement module (MSFEM) is introduced at the bottleneck to alleviate detail loss in traditional pyramids by using channel partitioning and parallel convolutions with diverse receptive fields.
- (3)
- A cascaded fusion attention module (CFAM) is designed to resolve feature fusion conflicts through a dual-path spatial-channel attention scheme, enhancing semantic alignment and grasp region perception.
- (4)
- Extensive experiments on the Cornell and Jacquard datasets, along with real-world deployment on an AUBO i5 robot, demonstrate state-of-the-art accuracy and robust performance in unstructured environments.
2. Related Works
2.1. Multi-Scale Feature Learning
2.2. Encoder–Decoder Architectures
3. Problem Formulation and MCFG-Net Architecture
3.1. Problem Statement
3.2. Model Architecture
3.3. Multi-Scale Spatial Feature Enhancement Module
3.4. Cascade Fusion Attention Module
3.5. Loss Function
4. Dataset and Robot Grasping Experiment
4.1. Datasets and Metrics
- (1)
- IW: To evaluate the generalization ability of a network model when objects exhibit varied poses, thereby providing insight into the model’s adaptability to changes in object orientation.
- (2)
- OW: To evaluate the generalization ability of a network model when encountering entirely novel objects, thereby assessing its capacity to grasp previously unseen items in real-world applications.
4.2. Implementation Details
4.3. Results on Cornell
4.4. Results on Jacquard
4.5. Real-Scene Evaluation
5. Discussion
5.1. Ablation Study
5.2. Failure Case Analysis
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Newbury, R.; Gu, M.; Chumbley, L.; Inaba, M. Deep learning approaches to grasp synthesis: A review. IEEE Trans. Robot. 2023, 39, 3994–4015. [Google Scholar] [CrossRef]
- Jiang, Y.; Fang, Y.; Deng, L. PDCNet: A Lightweight and Efficient Robotic Grasp Detection Framework via Partial Convolution and Knowledge Distillation. Comput. Vis. Image Underst. 2025, 259, 104441. [Google Scholar] [CrossRef]
- Peng, Y.; Yang, X.; Li, D.; Ma, Z.; Liu, Z.; Bai, X.; Mao, Z. Predicting flow status of a flexible rectifier using cognitive computing. Expert Syst. Appl. 2025, 264, 125878. [Google Scholar] [CrossRef]
- Mao, Z.; Suzuki, S.; Nabae, H.; Miyagawa, S.; Suzumori, K.; Maeda, S. Machine learning-enhanced soft robotic system inspired by rectal functions for investigating fecal incontinence. arXiv 2024, arXiv:2404.10999. [Google Scholar]
- Deng, Y.; Guo, X.; Wei, Y.; Zhang, F. Deep reinforcement learning for robotic pushing and picking in cluttered environment. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 619–626. [Google Scholar]
- Lenz, I.; Lee, H.; Saxena, A. Deep learning for detecting robotic grasps. Int. J. Robot. Res. 2015, 34, 705–724. [Google Scholar] [CrossRef]
- Redmon, J.; Angelova, A. Real-time grasp detection using convolutional neural networks. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 1316–1322. [Google Scholar]
- Levine, S.; Pastor, P.; Krizhevsky, A.; Quillen, D. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robot. Res. 2018, 37, 421–436. [Google Scholar] [CrossRef]
- Kumra, S.; Joshi, S.; Sahin, F. Antipodal robotic grasping using generative residual convolutional neural network. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 9626–9633. [Google Scholar]
- Zhang, Z.; Zhou, Z.; Wang, H.; Wu, Y. Grasp stability assessment through attention-guided cross-modality fusion and transfer learning. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 9472–9479. [Google Scholar]
- Yu, S.; Zhai, D.-H.; Xia, Y. SKGNet: Robotic grasp detection with selective kernel convolution. IEEE Trans. Autom. Sci. Eng. 2023, 20, 2241–2252. [Google Scholar] [CrossRef]
- Qin, R.; Ma, H.; Gao, B.; Wang, X. RGB-D grasp detection via depth guided learning with cross-modal attention. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 8003–8009. [Google Scholar]
- Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 1 April 2018; pp. 834–848. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Li, Y.; Zhang, X.; Chen, D. DFANet: Deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9522–9531. [Google Scholar]
- Pang, Y.; Zhao, X.; Zhang, L.; Lu, H. Multi-scale interactive network for salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9413–9422. [Google Scholar]
- Kumra, S.; Kanan, C. Robotic grasp detection using deep convolutional neural networks. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 769–776. [Google Scholar]
- Mahler, J.; Liang, J.; Niyaz, S.; Laskey, M.; Doan, R.; Liu, X.; Lin, J.; Vuong, A.; Goldberg, K. Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. In Proceedings of the Robotics: Science and Systems (RSS), Cambridge MA, USA, 12–16 July 2017. [Google Scholar]
- Zhou, X.; Lan, X.; Zhang, H.; Zhu, Y. Fully convolutional grasp detection network with oriented anchor box. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 7223–7230. [Google Scholar]
- Xiong, J.; Kasaei, S. HMT-Grasp: A hybrid Mamba-Transformer approach for robot grasping in cluttered environments. arXiv 2024, arXiv:2410.03522. [Google Scholar]
- Morrison, D.; Corke, P.; Leitner, J. Learning robust, real-time, reactive robotic grasping. Int. J. Robot. Res. 2020, 39, 183–201. [Google Scholar] [CrossRef]
- Jiang, Y.; Moseson, S.; Saxena, A. Efficient grasp detection from RGB-D images: Learning using a new rectangle representation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, 9–13 May 2011; pp. 3304–3311. [Google Scholar]
- Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Sun, J.; Chen, L.; Li, Y. Dilated Convolution with Learnable Spacings. arXiv 2022, arXiv:2201.12345. [Google Scholar]
- Zhao, K.; Tang, M.; Wu, S.; Li, H. Dynamic Dilated Convolutions (D2Conv3D) for Object Segmentation in Videos. arXiv 2023, arXiv:2302.09876. [Google Scholar]
- Depierre, A.; Dellandréa, E.; Chen, L. Jacquard: A large scale dataset for robotic grasp detection. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1 October 2018; pp. 3511–3516. [Google Scholar]
- Morrison, D.; Corke, P.; Leitner, J. Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach. In Proceedings of the Robotics: Science and Systems (RSS), Pittsburgh, PA, USA, 26–30 June 2018. [Google Scholar]
- Tian, H.; Song, K.; Li, S.; Ma, S.; Yan, Y. Lightweight pixel-wise generative robot grasping detection based on RGB-D dense fusion. IEEE Trans. Instrum. Meas. 2022, 71, 5017912. [Google Scholar] [CrossRef]
- Yu, S.; Zhai, D.-H.; Xia, Y.; Wu, H.; Liao, J. SE-ResUNet: A novel robotic grasp detection method. IEEE Robot. Autom. Lett. 2022, 7, 5238–5245. [Google Scholar] [CrossRef]
- Song, Y.; Wen, J.; Liu, D.; Zhu, L. Deep robotic grasp prediction with hierarchical RGB-D fusion. Int. J. Control Autom. Syst. 2022, 20, 243–254. [Google Scholar] [CrossRef]
- Wang, S.; Zhou, Z.; Kan, Z. When transformer meets robotic grasping: Exploits context for efficient grasp detection. IEEE Robot. Autom. Lett. 2022, 7, 8170–8177. [Google Scholar] [CrossRef]
- Fu, K.; Dang, X. Light-weight convolutional neural networks for generative robotic grasping. IEEE Trans. Ind. Inform. 2024, 20, 6696–6707. [Google Scholar] [CrossRef]
- Zhai, D.-H.; Yu, S.; Xia, Y. FANet: Fast and accurate robotic grasp detection based on keypoints. IEEE Trans. Autom. Sci. Eng. 2024, 21, 2974–2986. [Google Scholar] [CrossRef]
- Kuang, X.; Tao, B. ODGNet: Robotic grasp detection network based on omni-dimensional dynamic convolution. Appl. Sci. 2024, 14, 4653. [Google Scholar] [CrossRef]
- Deng, S.; Pei, R.; Zhou, L.; Qin, H.; Sun, W.; Liang, Q. An Efficient Generative Intelligent Multiobjective Grasping Model for Kitchen Waste Sorting. IEEE Trans. Instrum. Meas. 2025, 74, 2522810. [Google Scholar] [CrossRef]
- Zhang, H.; Lan, X.; Zhou, X.; Huang, Q. ROI-based robotic grasp detection for object overlapping scenes. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 4768–4775. [Google Scholar]
- Ainetter, S.; Fraundorfer, F. End-to-end trainable deep neural network for robotic grasp detection and semantic segmentation from RGB. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13452–13458. [Google Scholar]
- Liu, D.; Tao, X.; Yuan, L.; Zhang, Y.; Wang, H. Robotic objects detection and grasp in clutter based on cascaded deep convolutional neural network. IEEE Trans. Instrum. Meas. 2021, 71, 5004210. [Google Scholar] [CrossRef]
- Yang, L.; Zhang, C.; Liu, G.; Zhong, Z.; Li, Y. A model for robot grasping: Integrating transformer and CNN with RGB-D fusion. IEEE Trans. Consum. Electron. 2024, 70, 4673–4684. [Google Scholar] [CrossRef]










| Authors | Input-Size | Speed (ms) | Accuracy (%) | |
|---|---|---|---|---|
| IW | OW | |||
| ★Morrison [30] | 300 × 300 | 24 | 76.4 | 74.1 | 
| ★Kumra [9] | 224 × 224 | 14 | 96.6 | 94.3 | 
| Tian [31] | 224 × 224 | 15 | - | 98.9 | 
| Yu [32] | 224 × 224 | 25 | 98.2 | 97.1 | 
| Song [33] | 336 × 336 | 18 | 92.3 | 86.8 | 
| Wang [34] | 224 × 224 | 41.6 | 98.0 | 96.7 | 
| Fu [35] | 300 × 300 | 14.7 | 97.7 | 98.9 | 
| Zhai [36] | - | 23 | 98.5 | 97.8 | 
| Kuang [37] | 224 × 224 | 21 | 98.4 | 97.8 | 
| Deng [38] | 224 × 224 | - | 98.3 | 97.1 | 
| ★This study | 224 × 224 | 27 | 99.62 ± 0.11 | 98.87 ± 0.15 | 
| Authors | Method | Modality | Accuracy (%) | 
|---|---|---|---|
| Morrison [22] | GGCNN | D | 84 | 
| Zhou [20] | FCGN, ResNet-101 | RGB | 91.8 | 
| Zhang [39] | ROI-GD | RG-D | 93.6 | 
| Ainetter [40] | ResNet-101+FPN | RGB | 92.95 | 
| Liu [41] | Y-Net | RGB | 92.1 | 
| Yang [42] | RCrossFormer | RGB | 93.81 | 
| Wang [34] | TF-Grasp | RGB-D | 94.6 | 
| This study | MCFG-Net | D | 94.58 ± 0.18 | 
| RGB | 94.46 ± 0.22 | ||
| RGB-D | 95.48 ± 0.13 | 
| Scene | Number of Experiments | Success Rate (%) | 
|---|---|---|
| Single | 200 | 98.5 | 
| Multi | 100 | 95 | 
| Method | MSFEM | CFAM | Parameter | Cornell | 
|---|---|---|---|---|
| MCFG-Net-I | — | — | 2.93 M | 93.28 ± 0.49 | 
| MCFG-Net-II | ✓ | — | 3.14 M | 97.78 ± 0.14 | 
| MCFG-Net-III | — | ✓ | 2.99 M | 98.88 ± 0.14 | 
| MCFG-Net | ✓ | ✓ | 3.2 M | 99.62 ± 0.11 | 
| Failure Type | Number of Cases | Typical Causes | 
|---|---|---|
| Recognition Failure | 6 | Occlusion, background clutter, and low visual contrast | 
| Execution Failure (Drop) | 2 | Grasp pose deviation, slippage, and unstable contact | 
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Huang, C.; Xu, J.; Cai, X.; Shen, S. A Multi-Scale Cross-Layer Fusion Method for Robotic Grasping Detection. Technologies 2025, 13, 357. https://doi.org/10.3390/technologies13080357
Huang C, Xu J, Cai X, Shen S. A Multi-Scale Cross-Layer Fusion Method for Robotic Grasping Detection. Technologies. 2025; 13(8):357. https://doi.org/10.3390/technologies13080357
Chicago/Turabian StyleHuang, Chengxuan, Jing Xu, Xinyu Cai, and Shiying Shen. 2025. "A Multi-Scale Cross-Layer Fusion Method for Robotic Grasping Detection" Technologies 13, no. 8: 357. https://doi.org/10.3390/technologies13080357
APA StyleHuang, C., Xu, J., Cai, X., & Shen, S. (2025). A Multi-Scale Cross-Layer Fusion Method for Robotic Grasping Detection. Technologies, 13(8), 357. https://doi.org/10.3390/technologies13080357
 
        


 
       