Research on a Fusion Technique of YOLOv8-URE-Based 2D Vision and Point Cloud for Robotic Grasping in Stacked Scenarios

Ye, Xuhui; Qin, Xiaoyang; Zhan, Leming; Wang, Jun; Chen, Yan

doi:10.3390/app15126583

Open AccessArticle

Research on a Fusion Technique of YOLOv8-URE-Based 2D Vision and Point Cloud for Robotic Grasping in Stacked Scenarios

by

Xuhui Ye

^1,2,

Xiaoyang Qin

^1,2,

Leming Zhan

^1,2,

Jun Wang

^1,2 and

Yan Chen

^3,*

¹

School of Mechanical Engineering, Hubei University of Technology, Wuhan 430068, China

²

Hubei Key Laboratory of Modern Manufacturing Quality Engineering, Wuhan 430068, China

³

School of Mechanical and Electrical Engineering, Wuhan Donghu University, Wuhan 430212, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(12), 6583; https://doi.org/10.3390/app15126583

Submission received: 14 April 2025 / Revised: 1 June 2025 / Accepted: 10 June 2025 / Published: 11 June 2025

Download

Browse Figures

Versions Notes

Abstract

In industrial robotic grasping tasks, traditional 3D point cloud registration and pose estimation methods often struggle with low efficiency and limited accuracy in stacked and cluttered environments. To address these challenges, this paper proposes a grasp pose estimation algorithm that integrates 2D object detection based on YOLOv8-URE with 3D point cloud registration. In the detection stage, the method enhances object feature perception and localization by optimizing the receptive field structure and introducing attention mechanisms. It also employs an efficient multi-scale feature fusion strategy to improve bounding box regression accuracy. During point cloud processing, target centers predicted by the detector guide rapid segmentation, followed by robust registration techniques to estimate precise object poses. Experimental results demonstrate that YOLOv8-URE improves detection accuracy by 9.21% compared to YOLOv8n, reduces registration time by 60.5%, and significantly increases grasp success rates, proving its reliability and effectiveness in industrial scenarios.

Keywords:

industrial robotic grasping; YOLOv8-URE; 2D visual detection; information fusion; stacked-scene grasping

1. Introduction

With the continuous advancement of automation technology, vision-based robots have been increasingly applied in various domains such as industrial manufacturing [1], logistics and packaging [2], and intelligent agricultural harvesting [3]. Among these applications, object grasping and sorting are critical steps, as their efficiency and accuracy directly impact the overall productivity and quality of the workflow.

Numerous studies have focused on enhancing the grasping capabilities of vision-based robots, yielding a range of significant advancements. For example, Lu Zhiliang et al. proposed an operational relationship reasoning algorithm based on the VLMRN neural network model [4] and a real-time grasp region detection algorithm using the SE-RetinaGrasp network. They developed a grasping method specifically designed for stacked object scenarios by integrating the two approaches. However, the grasping success rate and target recognition accuracy of this method in complex stacking scenarios still need to be improved. Li Xiuzhi et al. proposed an optimal grasping posture detection algorithm based on a dual-network architecture [5], which improves the detection speed and enhances the recognition performance for small target objects by modifying the YOLOv3 detection model. However, the algorithm relies on two deep learning networks, resulting in a more complex structure and higher computational costs, making it difficult to meet real-time application requirements. Guo et al. proposed a grasping detection method tailored for various fruit stacking scenarios [6], which can determine the grasping parameters of the currently targeted object. However, it cannot perceive global scene information and grasp specific types of objects. Moreover, the relatively small size of the dataset limits its generalizability to broader applications. Geng et al. combined the object detection algorithm (YOLOv5) with the fully convolutional grasp detection network (GDFCN) to propose a real-time grasp detection algorithm for robotic arms to handle unfamiliar objects [7]. This method effectively achieves object classification under real-time constraints and improves the stability and accuracy of grasp detection. Nevertheless, its performance under complex stacked scenes still has room for improvement.

Another line of research focuses on end-to-end object pose estimation methods that learn to directly map input data to pose parameters, thereby eliminating the need for manually designed features and complex multi-stage processing. Xiaoxin Zhao [8], Yajun Zhang [9], Guan Qi [10], and Gu Wang [11] investigated the use of convolutional neural networks (CNNs) to directly regress object poses from RGB images. By leveraging end-to-end training on large-scale datasets, these networks can automatically learn the mapping between image features and object pose. Such methods simplify the pose estimation pipeline and demonstrate strong performance under ideal conditions. However, end-to-end approaches still face several challenges, including a high dependency on the quality and diversity of training data, limited generalization ability in complex or cluttered environments, and suboptimal pose estimation accuracy. Additionally, the black-box nature of these models makes it difficult to interpret and optimize their internal decision-making processes.

In summary, existing algorithms often struggle to meet real-time requirements while maintaining high detection accuracy. These limitations become even more pronounced in complex stacked scenes, where increased algorithmic complexity and restricted computational resources hinder performance. Furthermore, current methods frequently suffer from low grasp-success rates and inaccurate target recognition in such environments. To address the inability of current robotic grasping algorithms to balance detection accuracy and real-time performance, this paper proposes a novel approach that integrates deep learning-based object detection with 3D point cloud processing. In the detection stage, the accuracy and real-time performance of the object detector are enhanced through improvements in the deep learning network architecture while also extracting the target’s center coordinates. In the point cloud processing stage, 3D coordinate transformation and KD-Tree search are employed to rapidly isolate the target point cloud. Subsequently, point cloud alignment is refined using the sample consensus and normal distributions transform (NDT) algorithms, thereby improving both registration accuracy and computational efficiency. This integrated method offers a real-time and accurate solution for object recognition and grasping in vision-guided robotic systems.

2. Methodology

Upon initialization, the depth camera module captures both RGB images and 3D point cloud data, which are then forwarded to the deep learning and point cloud processing modules, respectively.

As shown in Figure 1, in the deep learning module, the input RGB image is passed through the enhanced YOLOv8n network for object detection, whereby the object is identified and classified into four pose-based categories. During this process, a bounding box is generated around the object, and the coordinates of the bounding box center are extracted. These coordinates are then normalized and converted into 3D coordinates, including horizontal and vertical positions along with the corresponding depth value. The resulting 3D coordinates are subsequently passed to the point cloud processing module for further analysis.

In the point cloud processing module, the point cloud data captured by the depth camera are first preprocessed, including voxel grid downsampling to reduce point cloud density and background point removal through segmentation. Then, based on the 3D coordinates obtained from the deep learning module, a KD-Tree is used to efficiently locate and extract the target point cloud. Next, the target point cloud is aligned with a template point cloud selected based on the preliminary pose classification result, yielding a homogeneous transformation matrix. This matrix is applied to the corresponding ideal grasp pose of the template point cloud, resulting in the final grasp pose for the target object in the camera coordinate system.

In the grasp control module, the transformation matrix obtained from hand–eye calibration is used to perform coordinate transformation on the grasp pose, resulting in the final grasp pose. Then, inverse kinematics is applied to calculate the corresponding joint angles, which are sent to the robot controller to execute the target grasp.

2.1. YOLOv8-URE

2.1.1. Overview

YOLOv8 is a relatively novel one-stage target detection algorithm in the YOLO series, and has faster speed, accuracy, and performance compared with other mainstream target detection algorithms. As a state-of-the-art (SOTA) model, YOLOv8 is released in five versions—YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x—with the number of parameters and floating-point operations (FLOPs) increasing progressively across the versions. Considering the requirements of real-time performance, computational efficiency, and detection accuracy in industrial environments, we selected YOLOv8n as the baseline model for our application. As illustrated in Figure 2, the YOLOv8-URE algorithm, mainly comprises two key components: the backbone module (Figure 2a), responsible for extracting feature information, and the neck module (Figure 2b), which further enhances the feature maps by introducing multi-scale feature fusion, thereby improving the model’s capability to perceive and localize targets accurately.

In the backbone module, to enhance the feature extraction capability for stacked objects, the feature extraction network incorporates C2f_UniRepLKNetBlock. This structure employs a small number of large convolutional kernels to ensure a broad and effective receptive field while leveraging small kernels to efficiently capture more complex spatial patterns. Additionally, multiple lightweight blocks are used to increase network depth, thereby further enhancing its representational capacity. Efficient local attention (ELA) [12] is introduced as an attention mechanism that addresses the limitations of traditional coordinate attention methods. Specifically, it highlights the poor generalization capability caused by batch normalization, the adverse effects of dimensionality reduction on channel attention, and the high computational complexity of conventional attention generation. By combining 1D convolution with group normalization-based feature enhancement, ELA efficiently encodes two 1D positional feature maps without dimensionality reduction. This enables precise localization of regions of interest while maintaining a lightweight architecture.

In the neck module, to enhance the feature fusion capability for stacked objects, this paper redesigns the neck by integrating CSPStage and DySample mechanisms into a newly designed CSPStage-Dy module. CSPStage employs different channel numbers for features at varying scales, allowing flexible control over the representation capacity of both high-level and low-level features under lightweight computational constraints. Additionally, the redundant upsampling operations in the original queen-fusion structure are removed, which significantly reduces inference latency while maintaining only a minimal drop in accuracy. Furthermore, the original convolution-based fusion method is optimized using CSPNet connections, and the concepts of re-parameterization and the ELAN structure are introduced to further enhance accuracy without significantly increasing the computational load. Moreover, DySample learns pixel-wise weights, enabling the network to adopt adaptive sampling strategies across different regions, especially around object boundaries, thus improving edge clarity and detection precision.

Finally, Shape-IoU [13] is used to replace the original CIoU loss function to accelerate model convergence and improve regression speed.

2.1.2. Backbone Module

UniRepLKNet (unified re-parameterized large-kernel network), proposed by Xiaohan Ding et al. [14], is a universal perception large-kernel convolutional neural network. Compared to simply increasing the model depth, large convolutional kernels can more efficiently expand the effective receptive field, enabling the network to capture broader contextual information and extract more complex features. To balance both local and global feature extraction, this module combines large-kernel convolutions with small-kernel convolutions: the small kernels help capture local patterns and improve training stability, while the large kernels are more effective at modeling long-range dependencies [15]. The outputs of these two convolutions are then added after each passes through its own batch normalization (BN) layer [16]. During inference, structural re-parameterization [17] is applied to merge the BN layers and small-kernel convolutions equivalently into the large-kernel convolution. This approach maintains the functionality of the small-kernel convolutions while optimizing inference efficiency and reducing computational cost. As a result, the large kernels are better able to capture sparse patterns, allowing a pixel in the feature map to be more strongly related to distant pixels than to its immediate neighbors, thereby generating higher-quality feature representations.

The model architecture is shown in Figure 3. The DW Conv module incorporates both large-kernel and parallel dilated convolutions. Through structural re-parameterization, the block is equivalently transformed into a single large-kernel convolution for efficient inference.

This paper proposes C2f_UniRepLKNetBlock, which integrates the lark block concept into the C2f structure, as shown in Figure 4. The core idea of the C2f structure lies in progressive cross-layer fusion combined with a partial gradient transmission mechanism, which effectively enhances gradient flow and optimizes feature representation. This design enables the model to achieve more efficient gradient propagation and faster convergence, all while maintaining a low parameter count. The SE block module within UniRepLKNetBlock employs global average pooling and channel attention mechanisms to dynamically adjust the importance of feature channels, thereby improving the model’s focus on salient features. By combining C2f with UniRepLKNetBlock, the network not only retains its lightweight and efficient training advantages but also significantly strengthens its multi-scale perception and representation capabilities.

Specifically, the C2f_UniRepLKNetBlock module leverages large-kernel convolutions to expand the global receptive field, facilitating a comprehensive understanding of the overall stacked structure in cluttered scenes. In addition, the use of parallel convolutions with varying dilation rates enables the network to penetrate occlusions and better capture object edge details. Furthermore, the structural re-parameterization design allows the network to learn rich features through multi-branch structures during training, which are then fused into a single branch during inference, ensuring high computational efficiency without sacrificing performance.

Subsequently, the ELA (efficient local attention) mechanism was introduced. ELA applies strip pooling along the spatial dimensions to extract horizontal and vertical feature vectors, maintaining elongated kernel shapes to effectively capture long-range dependencies while avoiding interference from irrelevant regions during label prediction. The mechanism independently processes the directional feature vectors to generate attention weights, which are then fused through multiplicative operations, thereby enabling precise localization of the region of interest (ROI). To observe the improvement in recognition ability of the model feature fusion network more intuitively, HiResCAM [18] (high-resolution class activation mapping) was used to draw heat maps, which can be more intuitively observed to show the learning of the network for different targets. As can be seen in Figure 5b, without the introduction of attention, the network pays more attention to the edge of the label, and some areas of attention are more dispersed, while in Figure 5c, after the introduction of the ELA mechanism, the model’s perception of the correct target is strengthened and the attention to the target is concentrated rather than dispersed, which makes the model focus on the target features more accurately.

2.1.3. Improved Neck Module

To redesign the neck, this paper combines the CSPStage (cross stage partial stage) from Efficient RepGFPN [19] with DySample (Dynamic Sampling) [20], proposing the CSPStage-Dy module.

As illustrated in Figure 6b, in the CSPStage module, the fused image features are divided into two branches, each containing a standard Conv-BN-Act structure. One branch incorporates the BasicBlock_Reverse module composed of a 3 × 3 RepConv and BatchNorm2d layers, which are stacked repeatedly to enhance feature representation. When Shortcut = True, a residual connection is applied to alleviate gradient vanishing and improve training stability; otherwise, with Shortcut = False, the module simplifies to a pure convolutional block. Moreover, as illustrated in Figure 6c, a simplified RepConv design is adopted that retains the original mechanism: multi-branch structures are used during training to extract features at multiple scales, while during inference, structural re-parameterization merges these branches into a single convolution. At the same time, the branches have been streamlined and lightweight activation functions along with optimized normalization strategies have been introduced, making CSPStage better suited for real-time detection tasks.

DySample is an ultralightweight dynamic upsampling module that primarily consists of two key steps: feature resampling and dynamic sampling point generation. As illustrated in Figure 7, DySample performs accurate upsampling through grid sampling operations combined with dynamic weight control, enabling it to retain critical region information. Compared to traditional methods, it offers higher boundary sampling precision, making it particularly suitable for addressing object boundary ambiguity in stacked scenes. Its lightweight design significantly reduces computational overhead and eliminates the need for additional sub-networks to generate dynamic convolution kernels.

In summary, CSPStage enhances feature interaction and multi-scale representation, while DySample improves upsampling quality with efficient resource utilization. When combined, these modules enable efficient feature fusion within the feature pyramid at lower latency, offering strong support for real-time object detection tasks.

2.2. Point Cloud Preprocessing

2.2.1. Point Cloud Filtering

The large number of point sets in the point cloud data leads to slow computation while processing the point cloud. Therefore, filtering with downsampling is used first to remove the redundant points during the computation process. Three-dimensional voxel grid filtering is used to significantly reduce the number of data points by dividing the point cloud into voxels and replacing multiple points with one representative point in each voxel. Also, 3D voxel grid filtering can effectively remove isolated noise points. Through voxelization, the point cloud is resampled to a consistent spatial resolution, which facilitates subsequent surface segmentation or surface reconstruction. The filtering effect is shown in Figure 8.

To verify the effectiveness of the filtering process, a comparison of the number of points and point cloud loading time was conducted. As shown in Table 1, while preserving the essential geometric features, the point cloud becomes sparser after filtering, resulting in a significant reduction in the total number of points and a noticeable improvement in loading speed, which is beneficial for subsequent processing.

To further evaluate the acceleration effect of voxel filtering in point cloud processing, we measured the total processing time for ground segmentation and ICP registration under two conditions: without filtering and with 3D voxel grid filtering. As shown in Table 2, although the filtering step introduces a slight computational overhead, it significantly reduces the time required for subsequent processing (segmentation + registration). This demonstrates that voxel filtering effectively improves the overall efficiency of the algorithm.

2.2.2. Point Cloud Image Segmentation

As shown in Figure 9a, a large number of points—including the ground and background—are captured by the depth camera during image acquisition. These irrelevant points have little correlation with the target object and may interfere with the point cloud registration process, negatively affecting the accuracy of subsequent processing. To address this issue, sample consensus segmentation (SAC) is employed to segment the target point cloud from the background. As illustrated in Figure 9b, after segmentation, the cluttered background points in the original scene are successfully removed, leaving only the point cloud of the target object. This significantly improves the quality of the input data and facilitates accurate and efficient point cloud registration and pose estimation.

2.3. Fusion Processing

The point cloud dataset collected by the depth camera is large. Even after pre-processing, it is still dense, and if directly used for alignment, it will consume more time and memory, making it difficult to meet real-time requirements, so it is necessary to further segment the scene point cloud. After locating the position of the target object through the deep learning module, the coordinates of the center point of the detection rectangle box are converted to 3D coordinates, and then the coordinates are used to perform a KD-Tree search so as to achieve the positioning and segmentation of the target point cloud.

2.3.1. 3D Coordinate Transformation

After the object detection is completed, the coordinate information obtained from the detection can be used to quickly locate the target area, optimize the KD-Tree search process, and improve the efficiency of target point cloud retrieval. To achieve this, the 2D center coordinates need to be transformed into 3D coordinates. Since the coordinates of the object are identified in the RGB image and the resolutions of the RGB and depth images are different, the RGB image must first be normalized. This process requires the use of the depth camera’s intrinsic matrix:

i n t r i n s i c m a t r i x = (\begin{matrix} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \end{matrix}),

(1)

where

f_{x}

and

f_{y}

are the focal lengths in the x and y directions of the image plane (measured in pixels), which determine the image scaling.

c_{x}

and

c_{y}

represent the coordinates of the principal point, which is typically located at the center of the image.

Then, the 3D coordinate transformation is performed using the following equation:

T a r g e t P o i n t_{x} = \frac{D e p t h \cdot (C_{x} - c_{x})}{f_{x} \cdot D e p t h_{s c a l e}},

(2)

T a r g e t P o i n t_{y} = \frac{D e p t h \cdot (C_{y} - c_{y})}{f_{y} \cdot D e p t h_{s c a l e}},

(3)

T a r g e t P o i n t_z = \frac{D e p t h}{D e p t h_{s c a l e}},

(4)

where

T a r g e t P o i n t_{x}

,

T a r g e t P o i n t_{y}

, and

T a r g e t P o i n t_{z}

denote the 3D coordinates of the target point after coordinate transformation, corresponding to the horizontal, vertical, and depth positions in the camera coordinate system, respectively;

D e p t h

denotes the raw depth value acquired by the depth camera, typically measured in millimeters (mm);

D e p t h_{s c a l e}

takes a value of 1000 and is used to convert the depth unit from millimeters to meters to ensure consistency with the camera coordinate system; and

C_{x}

represents the horizontal and vertical coordinates (in pixels) of the center point of the detected bounding box after alignment with the RGB image. The target point coordinates obtained from the 2D RGB image and the corresponding depth map can thus be accurately converted to 3D coordinates using the intrinsic parameters of the depth camera.

2.3.2. KD-Tree Neighbor Search

Based on the data structuring algorithm in NNL [21], KD-Tree (k-dimensional tree) serves as an efficient spatial indexing structure, widely adopted for its simplicity, low memory overhead, and high computational efficiency. The core idea of the KD-Tree algorithm is to select a dimension and split the set of points based on the median value along that dimension, ensuring that the resulting subsets are approximately equal in size. This binary partitioning process is repeated until the dataset can no longer be subdivided.

Assuming a two-dimensional dataset, as illustrated in Figure 10, the construction begins by computing the median of all points along the X-axis, which serves as the initial splitting reference. This divides the dataset into left and right subsets. The median of the Y-axis is then computed within each subset to further split them into upper and lower partitions. This recursive partitioning continues, alternating between axes until each point is individually isolated. The positions of the sample points are then visualized within the 2D coordinate system.

Based on this structure, KD-Tree enables efficient search operations. Compared with a traditional linear search, its average time complexity is

O (l o g n)

, which is significantly better than the

O (n)

of a linear search. To validate this conclusion, we performed radius-based neighbor searches on point clouds with varying numbers of points and recorded the search time. As shown in Table 3, for small-scale point clouds, the linear search performs comparably to or even better than KD-Tree. However, as the point cloud size increases, KD-Tree demonstrates significantly higher search efficiency, confirming its acceleration advantage in large-scale data scenarios.

As shown in Figure 11, this paper utilizes the neighbor search method of KD-Tree to rapidly extract the region of interest around the 3D target point, enabling efficient localization and segmentation of the target point cloud. By this method, the search range can be reduced and the initialization accuracy of point cloud matching can be effectively improved, thus accelerating the convergence speed of the alignment algorithm and improving the robustness.

2.4. Point Cloud Registration

After locating and segmenting the target point cloud, the corresponding template point cloud is selected based on the pose classification results from YOLOv8-URE, and registration is then performed between the template and target point clouds.

Point cloud registration can be divided into two stages: coarse registration and fine registration. Coarse registration is used when the relative positions of point clouds are completely unknown. Its goal is to estimate an approximate rotation and translation matrix to align the target point cloud into a common coordinate system, thereby providing a reliable initial estimate for fine registration. Fine registration then refines this alignment by minimizing the spatial discrepancy between point clouds to compute a more accurate transformation matrix for precise alignment.

In this study, the sample consensus algorithm is selected for coarse registration. This algorithm handles outliers through random sampling, and its procedure includes the following steps:

1. Random Sampling—Randomly select a subset of points from the source point cloud.

2. Transformation Estimation—Compute a rigid transformation matrix based on the sampled points.

3. Model Verification—Apply the transformation to the entire point cloud and evaluate its consistency.

As can be seen in Figure 12, both the registration result of the sample and that of the actual workpiece demonstrate that accurate alignment can still be achieved despite the presence of surrounding outlier points, indicating the strong robustness of the algorithm, which makes it more suitable for registration in industrial environments.

To further improve the point cloud alignment accuracy and ensure computational efficiency, this paper selects the NDT (normal distributions transform registration) algorithm as the fine alignment stage algorithm. As shown in Figure 13, taking the workpiece point cloud used in this paper as an example, a comparison is made between the proposed registration algorithm and commonly used methods such as ICP and GICP in a single-object scenario, which shows that the results of this paper’s alignment algorithm are more accurate and have strong robustness to noise and outliers.

2.5. Grasp Pose Estimation

After completing the registration between the target point cloud and the template point cloud, a homogeneous transformation matrix

T

is obtained. This matrix represents the rigid transformation between the two point clouds and is used to align the template point cloud with the target point cloud. Subsequently, a homogeneous transformation matrix is applied to the ideal grasp pose defined on the template, thereby deriving the corresponding grasp pose for the target point cloud. This process enables the transfer of known grasping strategies from the template to the real object. The transformation matrix is a standard 4 × 4 homogeneous matrix, and is expressed as follows:

T = (\begin{matrix} R & t \\ 0^{T} & 1 \end{matrix}),

(5)

where

R = [r_{i j}] \in R^{3 \times 3}

, is a rotation matrix, which describes the pose of the object, and

t = {[t_{x}, t_{y}, t_{z}]}^{T}

, is a translation vector, which describes the position of the object in 3D space.

Using this matrix, the predefined ideal grasp pose

T_{g r a s p}^{t e m p l a t e}

of the template point cloud can be mapped to the actual grasp pose

T_{g r a s p}^{t a r g e t}

of the target point cloud. The computation is as follows:

T_{g r a s p}^{t a r g e t} = T \cdot T_{g r a s p}^{t e m p l a t e} .

(6)

The resulting grasp pose is still expressed in the camera coordinate system. Subsequently, it can be further transformed into the robot base coordinate system using the known hand–eye calibration matrix

T_{c a m 2 b a s e}

:

T_{g r a s p}^{b a s e} = T_{c a m 2 b a s e} \cdot T_{g r a s p}^{t \arg e t} .

(7)

Finally, the grasp pose is processed through inverse kinematics based on the robot’s DH model to solve for the joint angles. These computed joint values are subsequently used to generate control commands for executing the grasping operation.

3. Experimental Results and Analysis

3.1. Experimental Setup for Object Detection

The experimental setup for object detection in this study was as follows: an Nvidia GeForce RTX 2080 Ti GPU with 16 GB of memory running on the Windows 10 operating system. The programming language used was Python 3.8.18, with CUDA version 11.3. The YOLOv8n model was implemented using the Ultralytics library (version 8.1.9). The initial learning rate was set to 0.01, with three warm-up epochs. Data augmentation was disabled during the final 10 training epochs. The detailed training parameters are summarized in Table 4.

3.2. Dataset Description

3.2.1. Dataset of Reducing Tee Pipes

The dataset is a self-constructed collection of a reducing tee pipe, designed for robotic arm grasping of stacked objects in industrial environments. A total of 3000 images were captured, featuring the target object in various poses, quantities, and simulated stacking scenarios. The dataset was divided into a training set and a validation set in a 9:1 ratio.

3.2.2. WiderPerson Dataset

The WiderPerson dataset is a publicly available dataset designed for outdoor pedestrian detection. It is characterized by high target density and severe occlusion, especially in crowded scenes with substantial target overlaps. These characteristics closely resemble industrial environments, which often involve stacking, occlusions, and densely arranged objects. Therefore, this dataset was selected in this study for training and evaluating the object detection module. Considering that some images lack annotation information, the original dataset was filtered and preprocessed, resulting in a total of 13,381 fully annotated image samples. Among them, 7999 images were used for training, 1000 for validation, and 4385 for testing. This dataset provides strong support for evaluating detection algorithms under complex scenarios.

3.3. Description of the Indicator Parameters

In this paper, precision (P), recall (R), mean average precision (mAP), number of network parameters, and floating-point operations (GFLOPs) are used as evaluation metrics to assess the performance of the model. The definitions of precision, recall, and mAP are as follows:

P = \frac{T_{p}}{T_{P} + F_{P}} \times 100 %,

(8)

R = \frac{T_{p}}{T_{P} + F_{N}} \times 100 %,

(9)

A P = \int_{0}^{1} P (r) d r,

(10)

m A P = \frac{\sum_{i = 1}^{k} A P_{i}}{k} .

(11)

In Equations (8) and (9),

T_{p}

(true positive) refers to correctly predicted positive samples, i.e., samples that are actual targets and correctly predicted as such;

F_{P}

(false positive) refers to false positives, i.e., non-target samples that are incorrectly predicted as targets; and

F_{N}

(false negative) refers to false negatives, i.e., target samples that are incorrectly predicted as non-targets. In Equation (10),

A P

(average precision) represents the accuracy for a single category. In Equation (11),

m A P

(mean average precision) is obtained by integrating the area under the precision–recall curve and represents the average accuracy across all k categories.

3.4. Ablation Experiment

To evaluate the impact of each proposed improvement module on overall performance, systematic ablation experiments were conducted on a self-constructed dataset featuring occlusion and stacking scenarios. The baseline was established using the basic version of YOLOv8-URE basic version without any improvement modules (denoted Base). Building upon this baseline, the modules were introduced sequentially: C2f_UniRepLKNetBlock (Base-1), ELA mechanism (Base-2), CSPStage-Dy enhanced neck (Base-3), and ShapeIoU loss (Base-4). Furthermore, additional configurations—Base-5, Base-6, and Base-7—were constructed by progressively combining ELA, the improved neck, and ShapeIoU loss on top of Base-1.

The experimental results, as shown in Table 5, indicate that the standalone introduction of the UniRepLKNetBlock significantly reduces the model parameters (by approximately 34.3%) and computational complexity (FLOPs decreased by 2.3%), albeit at the cost of some accuracy loss. After incorporating the ELA mechanism, both recall and mAP@0.5:0.95 show marked improvements, demonstrating its strong capability to enhance key regions in shallow features. The improved neck structure (including DySample) significantly boosts the detection accuracy of primary targets (mAP@0.5 increased to 97.3%), highlighting its advantages in multi-scale fusion and feature consistency.

Notably, when both ELA and the CSPStage-Dy enhanced neck are introduced simultaneously in Base-6, the mAP improvement is no longer dramatically higher compared to individual modules. However, this combination maintains a lightweight architecture while achieving a more balanced and stable performance in recall and overall metrics. The underlying mechanism is that ELA strengthens the front-end network’s perception of critical target regions, making target contour features more prominent, whereas DySample enhances spatial consistency and contextual modeling in the neck’s multi-scale fusion. Acting at different network stages, these two modules complement each other under a decoupled structure, improving overall perception and representation capabilities, thereby exhibiting stronger robustness and detection stability in complex occlusion and dense target environments.

Finally, with the introduction of ShapeIoU loss (Base-7), the model’s convergence speed further accelerates, bounding box regression becomes more precise, and detection accuracy reaches its highest (mAP@0.5 of 98.3%). This validates the complementary value and optimization potential of each module within the overall architecture.

3.5. Analysis of Object Detection Results

3.5.1. Comparison of Model Experimental Results

To verify the detection accuracy of YOLOv8-URE in complex scenarios, a large number of workpieces were placed in the original environment, introducing complex stacking relationships and dense occlusions. As shown in Figure 14, for a clear comparison of detection results, 20 workpieces were used as an example. The YOLOv8-URE network was able to detect all the workpieces and classify them correctly, while YOLOv8n missed one target.

To further demonstrate the detection accuracy of YOLOv8-URE, the number of workpieces was gradually increased to evaluate its performance under varying conditions. As shown in Table 6, both YOLOv8-URE and YOLOv8n perform well with a small number of targets. However, as the number of targets increases, YOLOv8n shows a clear rise in missed and false detections. In contrast, YOLOv8-URE consistently maintains more accurate detection results, demonstrating better robustness and overall performance in complex, high-density, and occluded scenarios.

To evaluate the robustness of the YOLOv8-URE model under conditions resembling real-world industrial environments, this study processed the original scenarios by adding random Gaussian noise and random black screen occlusion, so as to generate blurred scenarios for testing and comparing with the YOLOv8n model.

As shown in Figure 15, taking the detection of 20 workpieces as an example, both models experience increased missed and false detections in blurred scenarios. However, YOLOv8-URE consistently achieves a higher detection success rate than YOLOv8n. As shown in Table 7, as the number of targets increases, detection performance in the blurred scene declines more noticeably, especially for YOLOv8n. In contrast, YOLOv8-URE maintains results closer to the actual target count with fewer errors, demonstrating stronger robustness and accuracy under visually degraded conditions.

In summary, the detection performance of YOLOv8-URE is similar to that of YOLOv8n when the number of targets is small, but with an increase in the number of targets, the misses and false detection of YOLOv8n rise significantly, especially in blurred scenes. In contrast, YOLOv8-URE exhibits higher accuracy and robustness under higher target density and occlusion conditions, and the overall detection performance is better than YOLOv8n.

3.5.2. Generalization Performance Comparison

To verify the detection performance and generalization capability of YOLOv8-URE on external datasets, the model was trained on the publicly available WiderPerson dataset and compared with other mainstream algorithms. The experimental results show that the YOLOv8-URE method achieves a recall of 80.4% on the test set, which is higher than other mainstream algorithms. Moreover, YOLOv8-URE is the smallest, at only 4.46 MB, making it highly suitable for lightweight deployment. The experimental results are shown in Table 8.

3.6. Grasping Algorithm Experiments

The experiments were conducted on the Visual Studio 2019 platform with an R7-5800H CPU. The robotic arm used was the SD-700E model from Xinshida. The grasping environment is illustrated in Figure 16.

3.6.1. Registration Algorithm Experiment

To verify the superiority of the proposed algorithm and evaluate the acceleration effect of integrating KD-Tree search on the overall process of point cloud processing, this paper presents a comparative experiment. The experiment takes the entire point cloud processing module as the evaluation object, with the version excluding KD-Tree-based segmentation used as the control group. To comprehensively evaluate the performance of the algorithms, three metrics are used for comparison:

1. Program running time—to measure computational efficiency;

2. Root mean square error (RMSE)—to reflect the overall level of point cloud registration error (see Equation (10));

3. Mean absolute error (MAE)—to quantify the absolute deviation of registration errors (see Equation (11)).

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {‖T (p_{i}) - q_{i}‖}^{2}}

(12)

M A E = \frac{1}{n} \sum_{i = 1}^{N} ‖T (p_{i}) - q_{i}‖

(13)

In the equation,

n

denotes the number of point pairs involved in the registration from the template point cloud,

p_{i}

represents the i-th point in the template point cloud,

T (p_{i})

is the transformed position of

p_{i}

after registration, and

q_{i}

is the corresponding point in the target point cloud matched to

p_{i}

.

As shown in Table 9, the proposed algorithm rapidly locates the target point cloud within a limited search space through KD-Tree search. This significantly improves the processing speed and real-time performance, while also reducing RMSE and MAE to a certain extent, thereby satisfying the requirements for point cloud registration.

3.6.2. Workpiece Grasping Experiment

As shown in Figure 17, this is the main process for target object grasping.

To evaluate the grasping performance of the proposed algorithm and verify its robustness, four distinct object poses were defined, as illustrated in Figure 18. For each pose, 25 grasping trials were conducted under consistent experimental conditions.

The experimental results are shown in the Table 10. Postures 1 and 3 achieved higher success rates, while postures 2 and 4 showed a noticeable drop in grasp success rates. An analysis of the failed cases reveals two main causes:

In Postures 2 and 4, occlusions in the point cloud led to insufficient feature points, resulting in reduced pose estimation accuracy.
Under these postures, the gripper was more likely to collide with the object, which caused grasp failures.

4. Conclusions and Future Work

4.1. Conclusions

(1) In this study, the YOLOv8n algorithm was improved and redesigned by enhancing feature extraction and reconstructing the feature fusion network. In the feature extraction stage, large-kernel convolutions are introduced to expand the receptive field and enhance object perception, while a lightweight ELA mechanism is integrated to effectively reduce the number of parameters and floating-point operations. In the feature fusion stage, the network is redesigned to improve detection accuracy without significantly increasing computational cost, thus reducing both false positives and false negatives. Ablation experiments show that compared to the original YOLOv8n, the improved YOLOv8-URE reduces the number of parameters by 27.3% and increases precision by 1.1%, recall by 5.1%, and mAP@0.5 by 1.8%, while reducing FLOPs by 2.7 GFLOPs and improving inference speed by 25 FPS. In generalization comparison experiments, YOLOv8-URE achieved 0.7% higher recall than YOLOv8n, with a nearly identical AP value, and the overall model size was reduced by 31.1%. These results demonstrate that YOLOv8-URE achieves improved accuracy while maintaining a lightweight design and strong generalization, making it a practical and versatile solution.

(2) In the point cloud processing module, this paper proposes a point cloud segmentation algorithm that integrates 3D coordinate transformation and KD-Tree search. By transforming coordinates to provide initial target points for KD-Tree, the method effectively enables fast localization and segmentation of the target point cloud, reduces the number of points involved in computation, and accelerates subsequent alignment. At the alignment stage, combining sample consensus with the NDT algorithm achieves higher alignment accuracy and robustness in complex environments compared to traditional methods, providing a more reliable positional basis for subsequent workpiece grasping.

(3) Grasping experiments were conducted on the target workpiece under various poses. The results demonstrate that the system maintains a high success rate across different orientations, indicating strong robustness and reliable performance in completing the grasping tasks.

4.2. Future Work

(1) Since the object detection module relies on deep learning algorithms, its performance is to some extent constrained by the size of the dataset. Additionally, when switching to new target objects, the process of recollecting and annotating data remains labor-intensive. Therefore, future work should further explore methods such as adaptive fine-tuning, incremental learning, and transfer learning to enhance the model’s performance and generalization capabilities.

(2) In the process of feature fusion between RGB information and point cloud data, issues such as viewpoint differences, resolution mismatch, and occlusion may arise during data acquisition, affecting the quality of fusion. Therefore, future research should consider incorporating transformer-based feature alignment mechanisms to improve the accuracy and robustness of multi-modal feature fusion.

(3) In practical grasping tasks, the complex geometry of certain workpiece poses may cause collisions between the gripper and the object, leading to grasp failures. Consequently, future research will focus on optimizing the grasp path to further improve grasp success rate and stability, and we will conduct a quantitative comparison with the state-of-the-art GraspNet framework under the same experimental conditions to comprehensively evaluate the proposed improvements.

Author Contributions

Conceptualization, X.Y. and X.Q.; methodology, X.Y. and X.Q.; software, X.Y.; validation, X.Q. and L.Z.; formal analysis, L.Z.; investigation, J.W.; data curation, J.W.; writing—original draft preparation, X.Q.; writing—review and editing, X.Y. and Y.C.; supervision, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Hubei Province (Youth Program) (2023AFB381), the Hubei Agricultural Machinery Equipment Shortcomings Tackling Project “Research, Development, and Promotion of Key Technologies and Equipment for Aquatic Product Processing” (HBSNYT202221), and the Youth Talent Project of the Science and Technology Research Program of the Hubei Provincial Department of Education (Project Number Q20231412).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available from the corresponding author upon reasonable request. The data are not publicly available due to privacy restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, C.; Yang, T. Application and industrial development of machine vision in intelligent manufacturing. Mach. Tool Hydraul. 2021, 49, 172–178. (In Chinese) [Google Scholar]
Li, C.; Wei, X.; Zhou, Y.; Li, H. Research on control of intelligent logistics sorting robot based on laser vision guidance. Laser J. 2022, 43, 217–222. (In Chinese) [Google Scholar]
Xue, L.; Zhou, J. Visual servo control of agricultural robot parallel picking arm. Sens. Microsyst. 2017, 36, 123–126. (In Chinese) [Google Scholar] [CrossRef]
Lu, Z. Research on stacking object grasping method based on deep learning. Master’s Thesis, Guangdong University of Technology, Guangzhou, China, 2020. (In Chinese). [Google Scholar] [CrossRef]
Li, X.; Li, J.; Zhang, X.; Peng, X. Optimal grasp posture detection method for robots based on deep learning. Chin. J. Sci. Instrum. 2020, 41, 108–117. (In Chinese) [Google Scholar] [CrossRef]
Guo, D.; Kong, T.; Sun, F.; Liu, H. Object discovery and grasp detection with a shared convolutional neural network. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 16–21 May 2016; IEEE: New York, NY, USA, 2016; pp. 1234–1239. [Google Scholar]
Geng, Z.; Chen, G. A novel real-time grasping method combined with YOLO and GDFCN. In Proceedings of the 2022 IEEE 10th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Beijing, China, 27–29 October 2022; IEEE: New York, NY, USA, 2022; pp. 500–505. [Google Scholar]
Xiao, X.; Zheng, Y.; Sai, Q.; Fu, D. Lightweight model-based 6D pose estimation of drones. Control Eng. 2025, 24, 1–10. (In Chinese) [Google Scholar] [CrossRef]
Zhang, Y.; Yi, J.; Chen, Y.; Dai, Z.; Han, F.; Cao, S. Pose estimation for workpieces in complex stacking industrial scene based on RGB images. Appl. Intell. 2022, 1, 1–3. [Google Scholar] [CrossRef]
Guan, Q.; Sheng, Z.; Xue, S. HRPose: Real-time high-resolution 6D pose estimation network using knowledge distillation. Chin. J. Electron. 2023, 32, 189–198. [Google Scholar] [CrossRef]
Wang, G.; Manhardt, F.; Tombari, F.; Ji, X. GDR-Net: Geometry-guided direct regression network for monocular 6D object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; IEEE: Los Alamitos, CA, USA, 2021; pp. 16611–16621. [Google Scholar]
Xu, W.; Wan, Y. ELA: Efficient local attention for deep convolutional neural networks. arXiv 2024, arXiv:2403.01123. [Google Scholar]
Zhang, H.; Zhang, S. Shape-Iou: More accurate metric considering bounding box shape and scale. arXiv 2023, arXiv:2312.17663. [Google Scholar]
Ding, X.; Zhang, Y.; Ge, Y.; Zhao, S.; Song, L.; Yue, X.; Shan, Y. UniRepLKNet: A universal perception large-kernel convNet for audio, video, point cloud, time-series, and image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: Los Alamitos, CA, USA, 2024; pp. 5513–5524. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; IEEE: Los Alamitos, CA, USA, 2021; pp. 13733–13742. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; PMLR: Cambridge, MA, USA, 2015; pp. 448–456. [Google Scholar]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31×31: Revisiting large kernel design in CNNs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; IEEE: Los Alamitos, CA, USA, 2022; pp. 11963–11975. [Google Scholar]
Draelos, R.L.; Carin, L. Use HiResCAM instead of Grad-CAM for faithful explanations of convolutional neural networks. arXiv 2020, arXiv:2011.08891. [Google Scholar]
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. Damo-YOLO: A report on real-time object detection design. arXiv 2022, arXiv:2211.15444. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–7 October 2023; IEEE: Los Alamitos, CA, USA, 2023; pp. 6027–6037. [Google Scholar]
Baranchuk, D.; Babenko, A.; Malkov, Y. Revisiting the inverted indices for billion-scale approximate nearest neighbors. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 202–216. [Google Scholar]

Figure 1. Overall system framework of object detection and grasping.

Figure 2. Structure of the YOLOv8-URE network: (a) backbone module; (b) neck module; (c) head module; (d) SPPF module; (e) detection layer.

Figure 3. Architecture of UniRepLKNet: (a) overall structure of UniRepLKNet; (b) lark block module.

Figure 4. Architecture of C2f_UniRepLKNetBlock: (a) overall structure of C2f_UniRepLKNetBlock; (b) UniRepLKNetBlock module.

Figure 5. Heatmap comparison: (a) input image; (b) attention heatmap of the model with ELA; (c) attention heatmap of the model without ELA. Color legend: Red denotes high attention, yellow denotes medium attention, and blue denotes low attention.

Figure 6. Architecture of RepGFPN: (a) overall structure of RepGFPN; (b) CSPStage module; (c) simplified RepConv module.

Figure 7. Architecture of Dysample.

Figure 8. Comparison of point clouds before and after filtering: (a) original point cloud of the desk; (b) filtered point cloud of the desk; (c) original point cloud of the workpiece; (d) filtered point cloud of the workpiece.

Figure 9. Comparison of scene point cloud before and after segmentation: (a) original point cloud; (b) point cloud after SAC segmentation.

Figure 10. Binary tree diagram.

Figure 11. Target point cloud segmentation: (a) RGB target detection result; (b) segmented target point cloud (the green part indicates the segmented target).

Figure 12. Sample consensus alignment results: (a) front view of the sample point cloud; (b) side view of the sample point cloud; (c) front view of the workpiece point cloud; (d) side view of the workpiece point cloud.

Figure 13. Comparison of ICP, GICP, and SAC + NDT registration results: (a) ICP registration front view; (b) ICP registration side view; (c) GICP registration front view; (d) GICP registration side view; (e) SAC + NDT registration front view; (f) SAC + NDT registration side view.

Figure 14. YOLOv8-URE and YOLOv8n detection results in the original scene: (a) YOLOv8-URE detection results in the original scene; (b) YOLOv8n detection results in the original scene. Note: red bounding boxes are generated by the algorithm; green boxes indicate missed detections; yellow boxes indicate false detections.

Figure 15. Comparison of detection results in blurred scenes between YOLOv8-URE and YOLOv8n: (a) detection results of YOLOv8-URE in blurred scenes; (b) detection results of YOLOv8n in blurred scenes. Note: red bounding boxes are generated by the algorithm, green boxes indicate missed detections, and yellow boxes indicate false detections.

Figure 16. Grasping experimental environment: 1. STEP-700E; 2. pneumatic gripper; 3. Tuyang depth camera module; 4. workpiece to be grasped.

Figure 17. Grasping process diagram: (a) original point cloud; (b) YOLOv8-URE detection result; (c) illustration of target point cloud segmentation; (d) illustration of pose estimation for the registered target point cloud; (e) Illustration of grasping pose. The green part represents the segmented target point cloud.

Figure 18. Workpiece placement pose for grasping: (a) Pose①; (b) Pose②; (c) Pose③; (d) Pose④.

Table 1. Comparison of point cloud size and loading time before and after filtering.

Comparison Parameter	Desk Point Cloud	Workpiece Point Cloud
Number of Original Points	460,400	7197
Number of Filtered Points	41,049	2126
Original Point Cloud Load Time (s)	0.562537 s	0.0231155
Filtered Point Cloud Load Time (s)	0.052796	0.0105605

Table 2. Runtime comparison before and after voxel filtering.

Method	Original Points	Points After Filtering	Filtering Time	Total Time
No Filtering	626,910	-	-	6152 ms
After Filtering	626,910	233,783	1681 ms	3677 ms

Table 3. Comparison of radius neighborhood search time.

Number of Points	Search Radius (m)	KD-Tree Radius Search Time (ms)	Linear Search Time (ms)
1000	0.02	74	47
10,000	0.02	184	501
50,000	0.02	774	2387
100,000	0.02	1752	5141
500,000	0.02	7383	19,378

Table 4. Network training experiment parameters.

Training Parameter	Value
Input Image Size	640
Number of Epochs	100
Batch size	16
Optimizer	SGD
Momentum of Optimizer	0.937
Optimizer Weight Decay Factor	0.0005

Table 5. Results of ablation experiments.

Algorithm	Uni	ELA	Improved Neck	ShapeIoU	P (%)	R (%)	mAP@0.5 (%)	Parameter (M)	GFLOPs	FPS
Base	×	×	×	×	96.5	90.3	96.5	3.15	8.9	614
Base-1	√	×	×	×	94.8	92.0	96.5	2.16	5.9	660
Base-2	×	√	×	×	96.8	94.0	95.9	3.07	8.2	640
Base-3	×	×	√	×	95.8	95.0	97.3	3.26	8.2	605
Base-4	×	×	×	√	94.9	94.2	96.3	3.0	8.0	650
Base-5	√	√	×	×	91.5	94.8	96.7	2.09	5.9	662
Base-6	√	√	√	×	95.9	95.0	96.7	2.28	6.1	632
Base-7	√	√	√	√	97.6	95.4	98.3	2.29	6.1	639

The best results for each metric are highlighted in bold; FPS was evaluated with a batch size of 16.

Table 6. Comparison of original scene detection.

Number of Targets	Missed Detections	False Detections	Correct Detections	Detection Accuracy
15	0 ± 0/0 ± 0	0 ± 0/0 ± 0	15 ± 0/15 ± 0	100% ± 0.00%/100% ± 0.00%
20	0 ± 0/0.4 ± 0.49	0 ± 0/0.2 ± 0.4	20 ± 0/19.4 ± 0.49	100% ± 0.00/97% ± 2.50%
25	0.2 ± 0.4/3.2 ± 0.75	0.4 ± 0.49/0.6 ± 0.49	24.4 ± 0.49/21.2 ± 0.75	97.6% ± 1.96%/84.8% ± 2.99%
30	2 ± 0.63/4 ± 0.89	0.8 ± 0.75/1.8 ± 0.4	27.2 ± 0.75/24.2 ± 0.75	90.67% ± 2.49/80.67% ± 2.49
35	5.4 ± 0.49/8 ± 0.63	2.4 ± 0.5/3.2 ± 0.75	27.2 ± 0.8/23.8 ± 0.98	77.71% ± 2.14/68% ± 2.8
40	8 ± 1.17/11.8 ± 0.75	3.8 ± 0.75/4.4 ± 0.8	27.4 ± 1.02/23.8 ± 0.4	68.5% ± 2.25/59.5% ± 1.00

All values are presented in the format of YOLOv8-URE/YOLOv8n; all results are reported as means ± standard deviation over five independent experiments.

Table 7. Comparison of blurred scene detection.

Number of Targets	Missed Detections	False Detections	Correct Detections	Detection Accuracy
15	0 ± 0/0 ± 0	0 ± 0/2 ± 0.5	15 ± 0/15 ± 0	100% ± 0.00%/100% ± 0.00%
20	0.4 ± 0.49/2.6 ± 0.8	0.2 ± 0.4/1.6 ± 0.49	19.4 ± 0.49/15.8 ± 0.75	97% ± 2.45%/79% ± 3.70%
25	1.4 ± 0.8/4 ± 0.89	2.4 ± 1.02/2.6 ± 0.49	21.2 ± 1.33/18.4 ± 1.02	84.8% ± 5.31%/73.6% ± 4.08%
30	1.8 ± 0.75/5.2 ± 0.75	3.2 ± 0.75/2.8 ± 0.75	25 ± 1.1/22 ± 0.63	83.33% ± 3.65%/73.33% ± 2.11%
35	5.2 ± 0.75/8 ± 0.63	3.4 ± 1.02/3.6 ± 0.49	26.4 ± 0.6/23.4 ± 0.8	75.43% ± 2.91%/66.86% ± 2.29%
40	9 ± 0.75/13 ± 0.63	3.2 ± 0.75/4.2 ± 0.75	27 ± 0.63/22.8 ± 1.17	67.50% ± 1.5%/57% ± 2.92%

All values are presented in the format of YOLOv8-URE/YOLOv8n; all results are reported as means ± standard deviation over five independent experiments.

Table 8. Comparison of algorithm generalization.

Algorithm	R%	mAP@0.5%	mAP@0.5:0.95%	Size (MB)
YOLOv8n	79.7 ± 0.8	88.4 ± 0.6	62.3 ± 1.0	6.5
YOLOv7-tiny	80.2 ± 0.7	86.9 ± 0.8	54.3 ± 1.2	11.7
YOLOv4	75.2 ± 0.9	84.9 ± 0.7	51.9 ± 1.3	17.76
YOLOv3	70.3 ± 1.1	82.0 ± 1.0	47.6 ± 1.5	117.7
YOLOv5s	74.1 ± 0.6	86.3 ± 0.7	55.2 ± 1.1	13.78
Faster R-CNN	71.3 ± 0.8	87.2 ± 0.6	56.9 ± 1.3	108
SSD	64.8 ± 1.2	75.9 ± 1.5	47.8 ± 1.4	92.6
YOLOv8-URE	80.4 ± 0.5	88.3 ± 0.6	62.4 ± 0.8	4.46

The best results for each metric are highlighted in bold; all results are reported as means ± standard deviation over five independent experiments.

Table 9. Algorithm comparison results.

Algorithm	Control Algorithm	Proposed Algorithm
Running Time (s)	10.5949	4.18396
RMSE	0.317391	0.052965
MAE	0.267475	0.051005

The best results for each metric are highlighted in bold.

Table 10. Results of the grasping experiment.

Object Pose	Grasp Attempts	Failures	Success Rate
Pose①	25	3	88%
Pose②	25	4	84%
Pose③	25	2	92%
Pose④	25	5	80%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, X.; Qin, X.; Zhan, L.; Wang, J.; Chen, Y. Research on a Fusion Technique of YOLOv8-URE-Based 2D Vision and Point Cloud for Robotic Grasping in Stacked Scenarios. Appl. Sci. 2025, 15, 6583. https://doi.org/10.3390/app15126583

AMA Style

Ye X, Qin X, Zhan L, Wang J, Chen Y. Research on a Fusion Technique of YOLOv8-URE-Based 2D Vision and Point Cloud for Robotic Grasping in Stacked Scenarios. Applied Sciences. 2025; 15(12):6583. https://doi.org/10.3390/app15126583

Chicago/Turabian Style

Ye, Xuhui, Xiaoyang Qin, Leming Zhan, Jun Wang, and Yan Chen. 2025. "Research on a Fusion Technique of YOLOv8-URE-Based 2D Vision and Point Cloud for Robotic Grasping in Stacked Scenarios" Applied Sciences 15, no. 12: 6583. https://doi.org/10.3390/app15126583

APA Style

Ye, X., Qin, X., Zhan, L., Wang, J., & Chen, Y. (2025). Research on a Fusion Technique of YOLOv8-URE-Based 2D Vision and Point Cloud for Robotic Grasping in Stacked Scenarios. Applied Sciences, 15(12), 6583. https://doi.org/10.3390/app15126583

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on a Fusion Technique of YOLOv8-URE-Based 2D Vision and Point Cloud for Robotic Grasping in Stacked Scenarios

Abstract

1. Introduction

2. Methodology

2.1. YOLOv8-URE

2.1.1. Overview

2.1.2. Backbone Module

2.1.3. Improved Neck Module

2.2. Point Cloud Preprocessing

2.2.1. Point Cloud Filtering

2.2.2. Point Cloud Image Segmentation

2.3. Fusion Processing

2.3.1. 3D Coordinate Transformation

2.3.2. KD-Tree Neighbor Search

2.4. Point Cloud Registration

2.5. Grasp Pose Estimation

3. Experimental Results and Analysis

3.1. Experimental Setup for Object Detection

3.2. Dataset Description

3.2.1. Dataset of Reducing Tee Pipes

3.2.2. WiderPerson Dataset

3.3. Description of the Indicator Parameters

3.4. Ablation Experiment

3.5. Analysis of Object Detection Results

3.5.1. Comparison of Model Experimental Results

3.5.2. Generalization Performance Comparison

3.6. Grasping Algorithm Experiments

3.6.1. Registration Algorithm Experiment

3.6.2. Workpiece Grasping Experiment

4. Conclusions and Future Work

4.1. Conclusions

4.2. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI