Optimization-Oriented Vision-Guided Robotic Grasping for Bolt Handling in Intelligent Manufacturing

Fu, Pengzhan; Zhang, Zhenlin; Liu, Long; Xi, Yingze; Zhao, Xingwei; Wang, Xuan

doi:10.3390/math14122133

Open AccessArticle

Optimization-Oriented Vision-Guided Robotic Grasping for Bolt Handling in Intelligent Manufacturing

by

Pengzhan Fu

¹

,

Zhenlin Zhang

²,

Long Liu

²,

Yingze Xi

²,

Xingwei Zhao

¹

and

Xuan Wang

^1,*

¹

School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan 430074, China

²

Dongfeng Motor Corporation Research and Development Institute, Wuhan 430108, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(12), 2133; https://doi.org/10.3390/math14122133 (registering DOI)

Submission received: 29 April 2026 / Revised: 29 May 2026 / Accepted: 8 June 2026 / Published: 15 June 2026

(This article belongs to the Special Issue Intelligent Systems and Dynamic Scheduling: Optimization and Management, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Accurate detection and reliable grasping of small bolts are essential for intelligent manufacturing and automated assembly. However, this remains a challenge due to the small size, slender geometry, and metallic reflective surfaces of bolts. In this paper, we propose a vision-guided robotic bolt handling framework that integrates lightweight object detection, optimization-oriented grasp execution, and collision-aware trajectory planning. The lightweight YOLOv8n-BoltLite detector, improved with E-C2f, LCA, SA-PAN, and WD-IoU loss, enhances localization accuracy and feature representation for small and slender bolts. A robotic grasping framework is designed to transform detection results into executable robotic actions through 3D pose estimation, mid-shank grasp point generation, and optimization-oriented execution formulation. Additionally, a five-segment trajectory planning strategy ensures safe and efficient robot motion. Experimental results show that YOLOv8n-BoltLite achieves a five-run average mAP of 99.64 ± 0.05% with 198 FPS, and 3.02 M parameters. On an additional challenging external test set involving illumination variation, clutter, partial occlusion, reflection, and clustered bolts, the proposed detector achieves 94.62 ± 0.18%, outperforming recent lightweight detectors under the same training protocol. Robotic experiments involving 1000 controlled grasping trials and 300 multi-target grasping attempts demonstrate a controlled-condition success rate of 97.0% and improved target-selection reliability in multi-bolt scenes. These results suggest that the proposed framework offers a practical and efficient solution for automated bolt handling in intelligent manufacturing environments.

Keywords:

machine vision; robotic grasping; optimization-oriented execution; trajectory planning; intelligent manufacturing; small-object detection

MSC:

68T45

1. Introduction

The automated handling of small bolts is a crucial task in intelligent manufacturing, where it is often required for assembly and disassembly processes. However, challenges such as small object size, irregular shapes, weak textures, and reflective surfaces complicate both detection and grasping. The commonly used bolt sorting methods in industrial production mainly include three types: manual visual sorting, mechanical vibration screening, and simple photoelectric sensor detection. The above methods are all difficult to meet the needs of current multi-variety, small-batch, and flexible intelligent manufacturing. Therefore, achieving accurate identification and rapid grasping of bolts with different specifications is crucial for improving production efficiency and ensuring assembly quality.

The application of object detection algorithms based on deep learning has become increasingly mature. As a typical single-stage algorithm, the YOLOs, featuring fast speed, high accuracy, and flexible deployment, have been widely applied in industrial online detection scenarios. Considering that the appearance differences between bolts of different specifications are small in the identification task [1], the algorithm needs to distinguish fine or dimensional features; in addition, the scattered or stacked postures of bolts are random, so the algorithm must be robust to posture changes.

In this paper, we address these challenges by proposing an integrated vision-guided robotic handling framework. This framework not only includes a lightweight detection model, YOLOv8n-BoltLite, enhanced with advanced loss functions (E-C2f, LCA, SA-PAN, and WD-IoU), but also an optimization-oriented grasp execution formulation that accounts for factors such as detection reliability, localization uncertainty, reachability, motion distance, and collision risk. As illustrated in Figure 1, in order to simulate practical application scenarios involving bolts with varying diameters and lengths, experiments were conducted on five representative categories of bolts. These categories cover a range of geometric scales and aspect ratios commonly encountered in industrial assembly tasks. On the other hand, after completing the bolt identification, a robotic arm completes the grasping, realizing a fully automated process of “identification-positioning-grasping-placement”, so as to improve the production efficiency of industrial production lines and reduce labor costs.

Thus, the research significance of this paper is reflected in the following aspects: (1) A lightweight bolt detection model, YOLOv8n-BoltLite, is developed by integrating E-C2f, LCA, SA-PAN, and WD-IoU loss to improve the detection accuracy and localization reliability of small and slender bolt objects. (2) A complete perception-to-grasping pipeline is constructed, including depth-based 3D pose recovery, hand–eye transformation, mid-shank grasp point generation, and hybrid trajectory planning. (3) An optimization-oriented target-selection strategy is introduced for multi-bolt scenes. Unlike a purely confidence-based or nearest-first strategy, the proposed formulation jointly considers detection reliability, localization uncertainty, motion distance, execution time, local interference risk, and feasibility constraints.

2. Related Works

2.1. Object Recognition

Object detection is a core task in computer vision, aiming to jointly perform object classification and localization by predicting labels and corresponding bounding boxes [2].

Traditional methods do not rely on automatic feature learning; instead, they typically follow three separate stages: hand-crafted feature extraction, sliding-window-based region proposal generation, and classification using machine learning models [3]. These approaches depend heavily on prior knowledge for feature design, which restricts their applicability to objects with fixed shapes, simple backgrounds, and limited pose variations, resulting in poor generalization and robustness. Dalal et al. [4] proposed the HOG feature, which represents object contours by capturing local gradient distributions. Combined with an SVM classifier, it achieved notable performance in pedestrian detection. However, HOG remains sensitive to object rotation, deformation, and background clutter. Local feature-based methods, such as SIFT [5] and SURF [6], construct scale-space representations to extract invariant keypoint descriptors, enabling robustness to scale and rotation. Nevertheless, their high computational cost limits real-time applicability in high-speed industrial inspection scenarios.

Today, deep learning driven object detection methods [2] have significantly improved accuracy and efficiency. Two-stage detectors divide the process into region proposal generation and region-wise classification with bounding box regression. R-CNN [7] first introduced CNNs into detection by using selective search for proposals, AlexNet [8] for feature extraction, and separate classifiers and regressors, but its multi-stage design leads to redundant computation and low efficiency. Fast R-CNN [9] improves this by introducing RoI Pooling and sharing feature maps, enabling end-to-end training with better speed and accuracy. Faster R-CNN [10] further replaces selective search with a RPN, which shares convolutional features with the backbone and reduces computational cost while supporting multi-scale detection. Despite these improvements, two-stage methods still suffer from complex architectures, high computational cost, and large model sizes, limiting deployment on industrial embedded systems. Single-stage methods eliminate the region proposal stage and directly formulate detection as an end-to-end regression problem, jointly predicting bounding box coordinates and class probabilities. Redmon et al. [11] proposed YOLO, the first single-stage end-to-end detector, which divides the input image into grid cells and predicts bounding boxes, confidence scores, and class probabilities in a unified framework, achieving real-time performance. YOLOv3 [12] further improves the framework by adopting the Darknet-53 backbone and incorporating a FPN [13] for multi-scale feature fusion, thereby enhancing the detection of small and multi-scale objects. Lin et al. [14] addressed the issue of class imbalance between positive and negative samples in single-stage detectors by proposing the Focal Loss. This loss down-weights easy negatives and focuses training on hard examples. This effectively improves detection performance, enabling single-stage methods to achieve accuracy comparable to two-stage detectors.

2.2. YOLOs

The YOLO series, first proposed by Redmon et al. in 2016, represents a class of single-stage object detection algorithms. Due to its end-to-end training framework, simple architecture, real-time performance, and ease of deployment, YOLO has been widely applied in industrial vision, robotic perception, and intelligent assembly.

YOLOv1 pioneered single-stage detection by formulating object detection as a regression problem, significantly simplifying traditional multi-stage pipelines. YOLOv2 [15] introduced anchor boxes and batch normalization, improving localization accuracy and training stability, but its ability to capture fine-grained features remained limited. YOLOv3 uses Darknet-53 as its backbone network. By incorporating residual connections and the FPN, it enhances feature extraction and fuses high-resolution features with semantically deep features, thereby significantly improving the performance for small objects. YOLOv4 [16] incorporated CSPDarknet53, mosaic data augmentation, and the CIoU loss, improving feature representation and reducing redundancy, although its model size limits deployment on resource-constrained devices. YOLOv5 introduced model scaling, adaptive anchor clustering, and lightweight modules such as the Focus layer and C3 blocks, achieving a balance between efficiency and accuracy, making it widely used in industrial applications. Subsequent versions, including YOLOv6, YOLOv7 [17], and YOLOv8, further focus on efficient architectures, re-parameterization, and anchor-free detection. Among them, YOLOv8 improves the PANet-based feature fusion with a more lightweight design, enhancing multi-scale feature aggregation and small-object representation. It also adopts a dynamic label assignment strategy based on classification confidence and IoU, eliminating anchor boxes and improving localization accuracy. These advancements make YOLOv8 particularly suitable for industrial small-object detection tasks. More recently, YOLOv9 introduces programmable gradient information and GELAN to improve feature learning efficiency. YOLOv10 proposes consistent dual assignments and an NMS-free end-to-end detection pipeline to reduce inference latency. RT-DETR provides a real-time transformer-based detector by combining an efficient hybrid encoder and uncertainty-minimal query selection. Gold-YOLO improves multi-scale feature fusion through a gather-and-distribute mechanism.

2.3. Vision-Based Robotic Grasping Strategies

Traditional robotic grasping methods [18] determine grasp poses based on known object geometry and mechanical analysis, typically based on force closure and form closure principles. For objects with available CAD models, grasp planning generally involves: (1) sampling candidate contact point pairs on the object surface; (2) evaluating their force-closure quality; and (3) selecting the optimal pair as the final grasp configuration. These methods can achieve stable and reliable grasping. For bolts, the cylindrical shank provides a suitable grasping region. A parallel-jaw gripper can stably grasp the bolt by symmetrically clamping the midsection along a direction perpendicular to its axis. Given the regular geometry and well-defined grasping strategy of the target bolts in this study, a geometry-based method is adopted. This approach avoids reliance on complex grasp pose estimation networks [19] and is more consistent with the requirements of industrial deployment.

In recent years, deep learning-based robotic grasping [20,21] has made remarkable progress. Early approaches [22] predicted five-dimensional grasp parameters from RGB-D images, while later single-stage regression methods [20] enabled real-time grasp detection from full images. Models such as GG-CNN [23] achieve real-time closed-loop grasping with improved robustness. In the area of 6-DoF grasp pose estimation, 6-DoF GraspNet directly predicts six-degree-of-freedom grasp poses from point cloud data [24], which makes it effective for handling object stacking, partial occlusion, and other conditions that complicate grasp identification. Contact-GraspNet [25] further improves grasp generation by learning the distribution of feasible contact points and producing high-quality grasp candidates accordingly. Although these methods perform well for unknown objects and unstructured environments, they usually involve high computational cost in both training and inference, greatly limiting efficiency and practicality.

In this study, the target bolts are characterized by regular geometry, limited and enumerable categories, and a known operating plane. Under these conditions, a grasping strategy based on visual detection and localization followed by geometric rule-based grasp pose determination is more suitable than a deep learning-based approach.

3. Methods

3.1. Architecture of Improved YOLO Algorithm for Bolt Recognition

3.1.1. Overall Architecture

The overall architecture of the proposed system is designed to achieve accurate bolt recognition and provide reliable positional information for subsequent robotic grasping. As illustrated in Figure 2, the framework follows a standard detection pipeline consisting of an input layer, a backbone for hierarchical feature extraction, a neck for multi-scale feature aggregation, and a detection head for category prediction and bounding box regression. Given an input image, the backbone first extracts multi-level visual features. A Lightweight Channel Attention (LCA) module is embedded into the feature extraction stage to recalibrate channel-wise responses and enhance structurally informative features while suppressing redundant background and noise-related activations. This design improves feature representation quality without introducing significant computational complexity.

The enhanced features are then delivered to the neck network for cross-scale fusion. In conventional feature pyramid aggregation, direct bidirectional fusion may introduce redundant information exchange. Therefore, a Scale-Adaptive Path Aggregation Network (SA-PAN) is adopted to perform more efficient multi-scale fusion. By assigning learnable importance to feature maps of different resolutions, SA-PAN enables the network to selectively emphasize scale levels that are more relevant to bolt instances.

Based on the fused feature maps, the detection head performs bolt category classification and bounding box regression in an anchor-free manner. Since the downstream grasping task requires highly accurate center localization, the detection branch is further optimized through scale-aware dynamic label assignment and the proposed weighted distance IoU loss (WD-IoU Loss). Instead of introducing predefined anchor boxes, the scale-aware assignment strategy selects positive samples according to classification confidence, spatial consistency, and IoU-related localization quality, which is consistent with the anchor-free detection paradigm of YOLOv8. Meanwhile, WD-IoU strengthens the center-distance constraint during bounding box regression, especially for small and slender bolt objects. As a result, the proposed detector can provide more accurate pixel-level localization, which forms the basis for precise robotic grasping in subsequent manipulation.

In summary, the proposed overall architecture retains the efficient one-stage detection paradigm while introducing task-specific enhancements at the feature attention, neck fusion, and localization loss levels. These improvements collectively strengthen the network’s ability to detect bolts with subtle texture variations and strict localization requirements, thus providing stable and accurate visual input for subsequent robotic grasping.

3.1.2. Efficient C2f for Feature Extraction

In this study, the detector is required to recognize five different specifications of bolts. The YOLOv8n backbone adopts the C2f module. Benefiting from the efficient layer aggregation of ELAN, C2f enhances feature reuse and improves gradient flow. Nevertheless, when dealing with bolts, which are typically characterized by small object scale, subtle texture cues, and elongated shapes with high aspect ratios, the standard convolution may suffer from a mismatch between its receptive field and the scale of the target features. Consequently, redundant background information can be incorporated so the task-relevant features, such as local contours, may not be sufficiently emphasized.

Therefore, an Efficient C2f (E-C2f) module is proposed in this work. On the basis of retaining the residual structure and efficient gradient propagation of the original C2f module, standard convolution is replaced by grouped depthwise separable convolution to achieve lightweight feature extraction, enhance fine-texture representation, and suppress redundant background information. Structurally, the E-C2f module comprises four components: a 1 × 1 convolution for channel reduction, grouped depthwise separable convolution, residual branch concatenation, and a final 1 × 1 convolution for channel expansion. In this process, the conventional convolution operation is factorized into depthwise convolution and pointwise convolution. Owing to this decomposition, E-C2f significantly reduces model complexity, The depthwise convolution helps preserve fine-grained bolt features, including thread details and slender boundary structures. The retained shortcut connection further ensures stable gradient backpropagation and mitigates the optimization difficulties that may be introduced by lightweight redesign. Therefore, E-C2f achieves a better balance between efficiency and representational power, which is especially beneficial for accurate detection of small bolt targets in robotic grasping scenarios.

3.1.3. Lightweight Channel Attention and SA-PAN Scale Adaptation

Although the background in this study is relatively simple, weak metallic reflections and subtle thread textures may still result in weak responses in some channels. Conventional attention mechanisms usually introduce additional learnable parameters and computational overhead. To overcome this issue, a Lightweight Channel Attention (LCA) module is further introduced.

The proposed LCA learns channel-wise importance through global average pooling followed by sigmoid activation, without using fully connected layers or other heavy operations. For an intermediate feature map

F \in R^{C \times H \times W}

, the channel attention weights are computed as

s_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F_{c} (i, j), c = 1, 2, \dots, C

(1)

and the reweighted feature map is obtained by

F_{c}^{'} (i, j) = w_{c} \cdot F_{c} (i, j) = σ (s_{c}) \cdot F_{c} (i, j)

(2)

where

s_{c}

denotes the global descriptor of the c-th channel,

σ (\cdot)

is the sigmoid function,

w_{c}

is the corresponding channel weight.

Through this lightweight reweighting mechanism, informative channels related to threads, contours, and structural boundaries are strengthened, while low-response or noisy channels are suppressed. Consequently, the network can generate more discriminative feature representations for bolt recognition with negligible additional computational cost.

Next, this article proposes the SA-PAN (Scale Adaptive PAN) feature pyramid as the neck network. The backbone outputs three feature levels

F_{P 3} \in R^{C \times 80 \times 80}

,

F_{P 4} \in R^{2 C \times 40 \times 40}, F_{P 5} \in R^{4 C \times 20 \times 20}

, corresponding to different spatial resolutions. Instead of treating all fusion inputs equally, SA-PAN introduces learnable scale-adaptive coefficients to modulate the contribution of each branch. By performing weighted fusion based on adaptive weights, the final fused features are obtained

F_{f u s i o n} = α \cdot U (F_{P 4}) + β \cdot U (F_{P 3}) + γ \cdot D (F_{P 5})

(3)

To ensure stable optimization, the coefficients can be normalized as

\begin{matrix} α = \frac{e^{δ_{4}}}{\sum_{i = 3}^{5} e^{δ_{i}}}, β = \frac{e^{δ_{3}}}{\sum_{i = 3}^{5} e^{δ_{i}}}, γ = \frac{e^{δ_{5}}}{\sum_{i = 3}^{5} e^{δ_{i}}}, \\ α + β + γ = 1, α ≫ β, γ \end{matrix}

(4)

In this way, SA-PAN preserves the bidirectional aggregation advantage of PANet while enabling the network to adaptively emphasize the most relevant scale information and suppress redundant cross-scale transmission. This design improves feature utilization efficiency and enhances the detection of small fasteners with fine structural details.

3.1.4. Anchor-Free Detection Head and WD-IoU Loss

The detection head performs object classification and bounding box regression on the fused multi-scale features, and its optimization objective directly affects localization accuracy and convergence stability. In this work, the original anchor-free detection paradigm of YOLOv8 is retained. This design avoids the dependence on manually predefined anchor boxes and is more suitable for deployment scenarios involving bolts with different scales and aspect ratios.

Since the five categories of bolts considered in this study are mostly small, slender, and visually similar, accurate center localization is particularly important for the subsequent robotic grasping task. Therefore, the detection head is optimized from two aspects: scale-aware dynamic sample assignment and regression loss design. During training, positive samples are assigned according to the joint quality of classification confidence and localization consistency, rather than by matching predefined anchor boxes. In this way, the model can adaptively select more reliable training samples for bolt instances of different sizes while maintaining the anchor-free structure.

To further reduce localization deviation, especially for small-scale and elongated bolt objects, a weighted distance IoU loss (WD-IoU Loss) is introduced. WD-IoU extends DIoU loss by incorporating a scale-aware weighting coefficient into the center-distance penalty. The loss function is defined as

L_{W D - I o U} = 1 - I o U + \frac{ω \cdot ρ^{2} (b, b^{g t})}{c^{2}}

(5)

where b and

b^{g t}

denote the predicted bounding box and the ground-truth bounding box, respectively;

ρ (b, b^{g t})

is the Euclidean distance between their center points; c is the diagonal length of the smallest enclosing box; and

ω

is a scale-aware weighting coefficient. For smaller bolt targets, a larger value of

ω

is assigned to impose a stronger penalty on center deviation. For larger bolt targets, a moderate value of

ω

is used to avoid over-penalizing small localization fluctuations.

ω

is defined as

ω_{i} = 1 + μ exp (- \frac{A_{i}}{\bar{A} + ϵ})

(6)

where

A_{i}

denotes the area of the ground-truth bounding box,

\bar{A}

is the average bounding-box area in the training set,

μ

controls the strength of scale-aware weighting, and

ϵ

is a small positive constant. In this design, smaller targets receive larger center-distance penalties, which is consistent with the high localization accuracy required by robotic grasping.

The total training loss is formulated as

L_{total} = L_{c l s} + η \cdot L_{W D - I o U}

(7)

where

L_{c l s}

denotes the classification loss and

η

is the weighting coefficient used to balance classification and localization objectives.

In this study, the localization term is emphasized because the center position of the detected bolt directly affects the subsequent 3D pose recovery and grasp point generation. Through the combination of anchor-free detection, scale-aware sample assignment, and WD-IoU regression optimization, the proposed detection head improves localization accuracy while preserving the lightweight and real-time characteristics of YOLOv8n.

3.2. Robotic Arm Grasping

3.2.1. 3D Pose Recovery and Hand–Eye Transformation

In the previous subsection, the neural network outputs the two-dimensional detection results of the target bolt in the image plane, including the center pixel coordinates

(u_{c}, v_{c})

the bounding-box width and height, and the category label. For robotic grasping, the detector output must be further converted into the three-dimensional grasp pose of the target, including its spatial position and the rotation angle around the vertical axis, so that the manipulator can execute a grasp in physical space. In the present work, an Intel RealSense D435i depth camera (Intel Corporation, Santa Clara, CA, USA) is employed to obtain per-pixel depth measurements. This configuration is more suitable for practical manipulation scenarios involving variable plane height, local unevenness, or partial stacking of bolts. After depth acquisition, the depth map is preprocessed by denoising and invalid-value suppression. The depth corresponding to the detected center pixel

(u_{c}, v_{c})

is denoted as

Z_{c}

and taken as the depth of the target. Once

Z_{c}

is available, the pixel coordinates can be back-projected into the camera coordinate system by means of the intrinsic calibration matrix:

\begin{matrix} X_{c} = (u_{c} - c_{x}) \times Z / f_{x} \\ Y_{c} = (v_{c} - c_{y}) \times Z / f_{y} \\ Z_{c} = Z \end{matrix}

(8)

After the 3D position has been determined, the bolt region inside the detection box is segmented from the foreground. The contour of the segmented region is then extracted, and a minimum-area rotated bounding rectangle is fitted to the contour. The rotation angle of this rectangle is taken as the refined orientation of the bolt major axis and is denoted by

ψ

. In this way, the 2D detection result is upgraded to a grasp-oriented pose description composed of the 3D position

(X_{c}, Y_{c}, Z_{c})

and the in-plane orientation

ψ

.

The proposed system adopts an Eye-in-Hand configuration, in which the camera is rigidly mounted on the robot end-effector. This arrangement allows the vision sensor to move together with the manipulator, enabling both coarse workspace scanning and close-range observation of the target. To establish the transformation relationship between the camera coordinate system {C} to the end-effector coordinate system {E}, we use the constant rigid transformation

X = {}^{E}T_{C}

(9)

In this work, a checkerboard-based calibration method is used. The robot is moved to multiple poses, and for each pose, the end-effector pose relative to the robot base,

{}^{B}T_{E, j}

, and the checkerboard pose relative to the camera,

{}^{C}T_{b o a r d, j}

are recorded. For any two poses i and j, the relative motions are defined as

A_{i j} = {({}^{B}T_{E, j})}^{- 1} {}^{B}T_{E, i}, B_{i j} = {}^{C}T_{b o a r d, j} {({}^{C}T_{b o a r d, i})}^{- 1}

(10)

The hand–eye relationship is then given by the classical equation

A_{i j} X = X B_{i j}

(11)

By solving this equation over multiple pose pairs, the hand–eye transformation

{}^{E}T_{C}

is obtained. Accordingly, the target point in the robot base frame can be computed as

p_{b} = {}^{B}T_{E} {}^{E}T_{C} p_{c}

(12)

where

{}^{B}T_{E}

is the current end-effector pose from robot kinematics. This transformation establishes the geometric basis for subsequent grasp pose generation and robotic execution.

3.2.2. Capture Point Generation

One intuitive strategy is to grasp the bolt laterally from both sides of the head.

A bolt is a typical elongated workpiece mainly composed of a head and a shank. One intuitive strategy is to grasp the bolt laterally from both sides of the head. Although this strategy generally provides relatively high tolerance to visual localization errors, it is not ideal for robust manipulation. The bolts used in this study vary considerably across different specifications, which may lead to uneven force distribution at the gripper fingertips. When the bolt is clamped at the head, the shank remains suspended below the contact region. For bolts with a large length-to-diameter ratio, this cantilever-like configuration is prone to oscillation during high-speed robot motion, thereby degrading grasp stability.

Considering the above factors, the proposed system adopts a mid-shank grasping strategy. First, the grasp point is closer to the physical center of mass of the bolt, which reduces the gravitational moment acting on the contact interface after lifting. Second, after being grasped near the middle of the shank, the bolt naturally tends to hang vertically under gravity, thereby forming a self-stabilizing posture during transportation.

For grasp-point generation, the geometric center of the detected bolt region is first calculated and used as the primary grasp point. For the parallel gripper, this point is directly used as the mid-shank grasp point. For the five-finger dexterous hand, however, grasping a horizontally placed bolt using only the centroid may not provide sufficient grasp stability. Therefore, an additional grasp point is generated along the estimated bolt axial direction based on the centroid position and the fitted bolt orientation. This two-point grasping configuration provides a more stable contact arrangement, particularly for horizontally placed bolts.

3.2.3. Robot Arm Trajectory Planning

After the optimal joint configurations corresponding to the start pose and target pose are obtained through the above geometric derivations and inverse kinematics, trajectory planning is further performed to generate a sequence of discrete position, velocity, and acceleration commands for each joint, thereby enabling smooth motor motion and stable task execution. To improve grasping efficiency, a complete pick-and-place cycle is divided into five logical motion segments:

Camera pose to pre-grasp pose.
The pre-grasp pose is defined at a safe height directly above the target bolt. The manipulator moves from the image acquisition pose to this pre-grasp pose using quintic polynomial interpolation in joint space, allowing all joints to start and stop smoothly while reaching the target synchronously. Quintic planning ensures continuity of position, velocity, and acceleration, which is beneficial for reducing vibration and mechanical impact.
For a joint variable q(t), the quintic trajectory is expressed as

$q (t) = a_{0} + a_{1} t + a_{2} t^{2} + a_{3} t^{3} + a_{4} t^{4} + a_{5} t^{5}$

(13)

where the coefficients $a_{0}, \dots, a_{5}$ are determined by the boundary conditions of position, velocity, and acceleration at the start and end instants.
Pre-grasp pose to grasp pose.
From the safe height, the end-effector descends vertically to the grasp pose. Since lateral deviation in this phase may cause interference with neighboring bolts, this segment is planned by straight-line interpolation in Cartesian space, ensuring that the end-effector moves strictly along the vertical direction. Let the Cartesian position of the end-effector be $p (t)$ . The linear interpolation between two poses $p_{s}$ and $p_{e}$ can be written as

$\begin{matrix} p (t) = (1 - λ (t)) p_{s} + λ (t) p_{e}, \\ λ (t) \in [0, 1] \end{matrix}$

(14)

For this segment, $p_{s}$ and $p_{e}$ differ only in the vertical coordinate, thereby guaranteeing a pure downward motion.
Grasp pose to lifting pose.
After the gripper closes and the bolt is grasped, the end-effector moves vertically upward to a safe lifting height. This segment also adopts Cartesian straight-line interpolation, so as to quickly separate the bolt from surrounding objects while maintaining a predictable and collision-safe motion.
Lifting pose to bin-above pose.
Once the bolt has been lifted to a safe height, it is transported to the region above the corresponding sorting bin. Since this segment mainly involves long-distance transfer in free space, joint-space quintic polynomial planning is used again to improve motion smoothness and execution efficiency.
Bin-above pose to release pose.
Finally, the end-effector descends vertically from the safe transfer height to the release height inside the bin. This segment is planned by Cartesian straight-line interpolation. After reaching the release pose, the gripper opens to complete the placement operation.

The above trajectory planning strategy provides an executable motion for a given target bolt. However, in practical visual grasping scenarios, the robot execution process should also consider whether the detected target is reliable, reachable, and safe to grasp. Therefore, an optimization-oriented grasp execution formulation is introduced to describe the relationship among perception confidence, grasp feasibility, motion efficiency, and collision safety.

3.3. Optimization-Oriented Grasp Execution Formulation

The previous subsections describe how the visual detection results are converted into grasp-oriented robot motion commands. Specifically, the detector provides the category label, bounding box, and center pixel coordinates of a target bolt. The depth camera is then used to recover the corresponding 3D position, and the hand–eye transformation maps the target position from the camera coordinate system to the robot base coordinate system. Based on the recovered position and in-plane orientation, a mid-shank grasp point is generated, followed by hybrid trajectory planning for pick-and-place execution. In this subsection, the above process is further formulated from an optimization-oriented perspective to explicitly describe the feasibility, reliability, and motion-efficiency considerations involved in grasp execution.

Let the set of valid bolt detections in the current image be denoted as

B = {1, 2, \dots, N},

(15)

where N is the number of detected bolt candidates that satisfy the detection confidence threshold. For each candidate

i \in B

, the visual detector outputs the category label

c_{i}

, the confidence score

s_{i}

, the bounding box

b_{i} = (u_{i}, v_{i}, w_{i}, h_{i})

, and the center pixel coordinate

(u_{i}, v_{i})

. With the depth value

Z_{i}

obtained, the center point of the bolt can be back-projected into the camera coordinate system. The corresponding position in the robot base coordinate system is obtained by

p_{i}^{B} = {}^{B}T_{E} {}^{E}T_{C} p_{i}^{C},

(16)

where

p_{i}^{C} = {[X_{i}^{C}, Y_{i}^{C}, Z_{i}^{C}, 1]}^{T}

denotes the homogeneous coordinate of the target point in the camera coordinate system,

{}^{E}T_{C}

is the hand–eye transformation matrix, and

{}^{B}T_{E}

is the current end-effector pose obtained from robot kinematics.

For each feasible candidate, the grasp pose is generated according to the mid-shank grasping strategy. The grasp pose is expressed as

G_{i} = (p_{i}^{g}, R_{i}^{g}),

(17)

where

p_{i}^{g}

is the generated grasp point near the middle part of the bolt shank, and

R_{i}^{g}

is the corresponding end-effector orientation determined by the bolt axis direction and the gripper approach configuration. Compared with grasping at the bolt head, the mid-shank strategy reduces the gravitational moment after lifting and improves transportation stability, which is consistent with the geometric characteristics of the target bolts.

To ensure safe and stable execution, the pick-and-place motion is represented by a sequence of key poses:

K_{i} = \{G_{i}^{c a m}, G_{i}^{p r e}, G_{i}^{g}, G_{i}^{l i f t}, G_{i}^{b i n}, G_{i}^{r e l}\},

(18)

where

G_{i}^{c} a m

is the image acquisition pose,

G_{i}^{p} r e

is the pre-grasp pose above the target bolt,

G_{i}^{g}

is the grasp pose,

G_{i}^{l} i f t

is the lifting pose,

G_{i}^{b} i n

is the pose above the corresponding sorting bin, and

G_{i}^{r} e l

is the release pose. The vertical approach and lifting segments are planned in Cartesian space to reduce lateral deviation near the workbench, while long-distance free-space transfer is planned in joint space using quintic polynomial interpolation.

For candidate i, the execution cost is defined as

J_{i} = λ_{d} {\hat{D}}_{i} + λ_{t} {\hat{T}}_{i} + λ_{u} {\hat{U}}_{i} + λ_{r} {\hat{R}}_{i} - λ_{s} {\hat{s}}_{i},

(19)

where

{\hat{D}}_{i}, {\hat{T}}_{i}, {\hat{U}}_{i}, {\hat{R}}_{i}

and

{\hat{s}}_{i}

denote the normalized motion distance, estimated execution time, localization uncertainty, local interference risk, and detection confidence, respectively. The coefficients

λ_{d}, λ_{t}, λ_{u}, λ_{r}

and

λ_{s}

are non-negative weighting factors used to reflect different engineering priorities. While calculating

J_{i}

, all terms are normalized to [0, 1] using the observed minimum and maximum values in the current candidate set. The weighting coefficients are selected on a validation set containing 60 multi-bolt scenes. A grid search with a step size of 0.05 is performed under the constraint

λ_{d} + λ_{t} + λ_{u} + λ_{r} + λ_{s} = 1

. The selected weights are

λ_{d} = 0.20, λ_{t} = 0.15, λ_{u} = 0.30, λ_{r} = 0.25

and

λ_{s} = 0.10

. This setting gives higher priority to localization reliability and local interference avoidance, while still considering motion efficiency and detection confidence.

The motion distance term is estimated from the planned key poses:

D_{i} = {∥p_{c a m} - p_{i}^{p r e}∥}_{2} + {∥p_{i}^{p r e} - p_{i}^{g}∥}_{2} + {∥p_{i}^{g} - p_{i}^{l i f t}∥}_{2} + {∥p_{i}^{l i f t} - p_{i}^{b i n}∥}_{2} + ∥p_{i}^{b i n} - p_{i}^{r e l} ∥_{2}∥

(20)

where

p_{c a m}, p_{i}^{g}, p_{i}^{l} i f t, p_{i}^{b} i n

and

p_{i}^{r} e l

are the Cartesian positions of the corresponding key poses. This term describes the spatial length of the pick-and-place motion and is related to task efficiency.

The execution time term is estimated as the sum of the durations of the five motion segments:

T_{i} = T_{i}^{cam \to pre} + T_{i}^{pre \to grasp} + T_{i}^{grasp \to lift} + T_{i}^{lift \to bin} + T_{i}^{bin \to release} .

(21)

For joint-space transfer segments, the time is constrained by the velocity and acceleration limits of the robot joints. For Cartesian vertical approach, lifting, and release segments, the time is constrained by the allowable end-effector velocity and the required vertical travel distance.

The localization uncertainty term is introduced to reflect the reliability of the visually estimated grasp pose:

U_{i} = (1 - s_{i}) + \frac{κ}{w_{i} h_{i} + ε},

(22)

where

s_{i}

is the detection confidence score,

w_{i} h_{i}

is the image-plane area of the detected bounding box,

κ

is a scaling coefficient, and

ε

is a small positive constant used to avoid division by zero. This term assigns a larger uncertainty penalty to low-confidence detections and small image regions. Such a design is reasonable because small and slender bolts are more sensitive to center localization errors during the subsequent grasping process. In this study,

κ

was set to 1000

{pixel}^{2}

and

ϵ

was set to

1 \times 10^{- 6}

, so that the bounding-box-area-related uncertainty term remained comparable to the confidence-related term before normalization.

The local interference risk term is used to describe the spatial proximity between the target candidate and other detected bolts:

R_{i} = \sum_{\begin{matrix} j \in B \\ j \neq i \end{matrix}} exp (- \frac{{∥p_{i}^{B} - p_{j}^{B}∥}_{2}^{2}}{2 σ_{r}^{2}}),

(23)

where

p_{i}^{B}

and

p_{j}^{B}

are the positions of candidates i and j in the robot base coordinate system, and

σ_{r}

is a parameter controlling the influence range of neighboring objects. A larger value of

R_{i}

indicates that the target is closer to other detected objects, which may increase the risk of accidental contact during the approach or lifting phase. In scenes with only one valid target, this term naturally becomes zero. In this study,

σ_{r}

was set to 25 mm to describe the local neighborhood range in which adjacent bolts may interfere with the vertical grasping motion.

The feasible grasp candidate should satisfy the following constraints:

\begin{matrix} s_{i} \geq s_{min}, \\ p_{i}^{B} \in W, \\ q_{min} \leq q_{k} \leq q_{max}, G_{k} \in K_{i}, \\ d (P_{i}, O) \geq d_{min}, \end{matrix}

(24)

where

s_{m i n}

is the minimum confidence threshold,

W

denotes the reachable workspace of the robot,

q_{k}

is the inverse-kinematic solution of each key pose,

q_{m i n}

and

q_{m a x}

are the joint limits,

P_{i}

denotes the planned motion path,

O

represents obstacles or non-target objects in the workspace, and

d_{m i n}

is the minimum allowable safety distance. These constraints respectively ensure perception reliability, reachability, kinematic feasibility, and collision safety.

When multiple feasible candidates are present in the camera view, the next grasp target can be selected by

i^{*} = arg min_{i \in B_{f}} J_{i},

(25)

where

B_{f}

is the set of candidates satisfying the feasibility constraints. When only one valid bolt is present, the formulation reduces to a feasibility and trajectory-cost evaluation for that target.

Overall, this optimization-oriented formulation connects the perception output, grasp pose generation, and trajectory execution in a unified framework. It does not replace the detection network or the trajectory planner. Instead, it provides an explicit decision-level representation for selecting feasible grasp targets and evaluating execution cost. This is beneficial for extending the current system to more complex industrial scenarios involving multiple visible bolts, cluttered placement, or mixed-object sorting. From an optimization perspective, the proposed formulation can be regarded as a constrained discrete decision problem over feasible grasp candidates. The uncertainty term and local interference term play roles similar to risk penalties in uncertainty-aware optimization. Although the present formulation does not solve a full distributionally robust optimization problem, it follows the same principle of incorporating uncertain perception quality into decision making. Recent studies on Wasserstein [26] distributionally robust chance-constrained optimization also demonstrate the importance of modeling uncertainty through ambiguity sets and feasibility constraints when exact distributions are unavailable.

4. Experiment

4.1. Experimental Setup and Result Analysis

To validate the effectiveness of the proposed components, including the E-C2f efficient feature extraction module, LCA, SA-PAN, and the WD-IoU loss function, a series of controlled experiments were conducted on a self-constructed dataset of standard bolts. The dataset consists of five categories of commonly used bolts. For each category, 400 images were collected under controlled conditions without stacking or complex background interference, as illustrated in Figure 3. Among them, 1400 images are used for training, 300 images for validation, and 300 images for testing. Each category contains 280 training images, 60 validation images, and 60 test images. Images collected from the same acquisition sequence are assigned to the same subset to avoid data leakage. To evaluate generalization under more realistic industrial conditions, an additional challenging external test set containing 800 images is constructed. This set includes varying illumination, clustered bolts, partial occlusion, reflective surfaces, and cluttered backgrounds. The challenging set is not used for training or hyperparameter tuning. All experiments were performed under a unified hardware and software environment to ensure fair comparison across different model configurations. Specifically, the experiments were conducted on a workstation equipped with an NVIDIA RTX 3060 GPU. The software environment includes Ubuntu 20.04, PyTorch 2.0.1, and CUDA 11.7. All detectors are trained using an input resolution of 640 × 640, a batch size of 16, and 200 epochs. The initial learning rate is 0.01, the momentum is 0.937, and the weight decay is 0.0005. The confidence threshold is set to 0.25 and the NMS IoU threshold is set to 0.70 unless otherwise specified. FPS is measured on an NVIDIA RTX 3060 GPU with batch size 1 after 100 warm-up iterations and 1000 inference iterations.

For a comprehensive performance evaluation of bolt detection, several widely used quantitative metrics in visual detection tasks are adopted, including mAP, Precision, model parameter, and FPS. To provide a comprehensive comparison, the proposed YOLOv8n-BoltLite was compared with both classical and recent object detection models, including YOLOv8n, YOLOv9-t, YOLOv10n, RT-DETR-R18, and Gold-YOLO-N. All models are trained and evaluated under the same experimental conditions to ensure a fair comparison.

As shown in Table 1, the proposed YOLOv8n-BoltLite achieves an mAP of 99.64% ± 0.05 on the controlled test set which is relatively clean and without complex background. Compared with the original YOLOv8n, YOLOv8n-BoltLite improved mAP from 98.9% to 99.64%, while reducing the parameter count from 3.20 M to 3.02 M. The inference speed reached 198 FPS, which is higher than that of YOLOv8n, YOLOv9-t, RT-DETR-R18, and Gold-YOLO-N, although YOLOv10n achieved a higher FPS due to its more aggressive real-time detection design. The detection results’ schematic diagram of YOLOv8n BoltLite network is shown in Figure 4. It can be clearly seen that the proposed model provides a favorable balance among detection accuracy, model compactness, and inference speed. Compared with larger or more complex models, YOLOv8n-BoltLite is more suitable for real-time robotic bolt-handling scenarios where both localization accuracy and computational efficiency are required.

To further evaluate robustness, the trained models were also tested on the challenging external test set. The results are shown in Table 2. Compared with the controlled test set, the performance of all models decreased on the challenging external test set. This reduction is reasonable considering that illumination variation, partial occlusion, metallic reflection, bolt clustering, and background clutter increase the difficulty of both classification and localization. However, YOLOv8n-BoltLite still achieved the best overall performance, with 94.6% mAP. Compared with YOLOv8n, the proposed model improved mAP by 2.8 percentage points on the challenging test set.

This improvement indicates that the proposed E-C2f, LCA, SA-PAN, and WD-IoU components are more beneficial under difficult visual conditions than under the clean controlled setting. In particular, the challenging test results show that the proposed method does not merely benefit from the simplicity of the original dataset, but also improves robustness against practical disturbances such as reflections, occlusions, and clustered arrangements.

4.2. Ablation Experiment

To isolate the contribution of each module, a removal-based ablation study is conducted. Starting from the full YOLOv8n-BoltLite model, E-C2f, LCA, SA-PAN, and WD-IoU are removed independently while all other settings remain unchanged. All ablation experiments are conducted using the same training, validation, and test splits described in Section 4.1.

As shown in Table 3, the experiment reflects the performance from the detection accuracy, model size, and inference speed. The test sets selected controlled testing for simple scenarios and challenging testing for complex scenarios. The original YOLOv8n model achieves the lowest detection accuracy in both controlled and challenging scenarios, with relatively redundant parameters and poor comprehensive performance, which cannot well adapt to slender small-size bolt detection tasks. When the E-C2f lightweight feature extraction module is removed, the detection accuracy declines obviously in two test environments, and the network parameter quantity increases slightly with a certain drop in inference frame rate. It demonstrates that the E-C2f module can streamline the basic convolution structure, extract effective shallow features of bolts, and balance detection precision and lightweight design requirements synchronously. The absence of the LCA channel attention module also causes a notable regression of model performance. This module can screen valuable feature channels and suppress interference information from irrelevant background. It enhances the feature discrimination capability for visually similar bolt targets, and presents prominent optimization effects on detection stability under complex industrial interference conditions. After eliminating the SA-PAN scale adaptive feature fusion module, the model fails to make full use of multi-scale feature information. The shallow detail features and deep semantic features of bolts cannot be effectively fused, resulting in insufficient adaptability to bolts with different sizes and shapes, which further reduces the overall detection accuracy of the network. In the comparison group without WD-IoU loss function, the bounding box regression accuracy decreases. Benefiting from the scale-aware penalty mechanism, WD-IoU is designed for the structural characteristics of bolts. It effectively reduces positioning deviation caused by dense arrangement and partial occlusion, and improves the localization reliability of small targets. The complete YOLOv8n-BoltLite model integrating all optimized modules obtains the optimal performance in all indicators. Its controlled mAP reaches 99.64 ± 0.05, and the challenging scene mAP is up to 94.62 ± 0.18. Meanwhile, the model maintains a low parameter volume of 3.02 M and a high real-time inference speed of 198 FPS.

In summary, each core module plays an irreplaceable independent role in improving detection accuracy, anti-interference robustness, lightweight level and inference efficiency. All modules coordinate and complement each other. The revised one-by-one removal ablation scheme effectively quantifies the effectiveness of each component, fully verifying the rationality and necessity of the module design for bolt detection tasks, and effectively makes up for the deficiency of the original cumulative ablation experiment. To further verify the effectiveness of the proposed WD-IoU loss, an additional comparison was conducted with commonly used bounding-box regression losses, including CIoU, EIoU, and Wise-IoU. In this experiment, only the regression loss was changed, while the backbone, neck, detection head, training settings, and dataset split remained unchanged. As shown in Table 4, WD-IoU achieved the best performance on both the controlled and challenging test sets. The improvement on the controlled test set was relatively small because the detection accuracy was already close to saturation. However, WD-IoU produced a more evident improvement on the challenging test set. This result indicates that the scale-aware center-distance penalty is beneficial for small and slender bolt localization, which is important for subsequent robotic grasping.

4.3. Robot Arm Grasping Experiment

To evaluate the practical feasibility of the proposed vision-guided robotic grasping framework, robotic grasping experiments were conducted using the five bolt categories considered in this study. The purpose of this section is to verify whether the proposed perception-to-manipulation pipeline can reliably transform detection results into executable grasping actions, and whether the optimization-oriented target-selection strategy is effective when multiple feasible bolts are present in the workspace. The experimental platform consisted of an industrial robotic arm, an Intel RealSense D435i RGB-D camera mounted in an eye-in-hand configuration, and two representative end-effectors: an electric parallel gripper and a five-finger dexterous hand. The overall hand–eye calibration error was controlled within 0.217 mm, and the repeatability of the robotic manipulator was ±0.05 mm. During each trial, the detector first outputs the bolt category, bounding box, confidence score, and center position. The depth camera was then used to recover the 3D position of the detected bolt, and the grasp pose was generated according to the mid-shank grasping strategy described in Section 3.2. Finally, the robot executed the planned pick-and-place trajectory.

In order to closely simulate real industrial feeding scenarios, two representative bolt placement configurations were designed, as illustrated in Figure 5.

Upright placement: As shown in Figure 5a, the bolt is placed vertically, with the head in contact with the workbench surface and the shank pointing upward. The central axis of the bolt is perpendicular to the workbench plane.
Horizontal placement: As shown in Figure 5b, the bolt is placed flat on the workbench, with the shank lying on the surface and the head oriented laterally. In this case, the central axis of the bolt is parallel to the workbench plane.

To ensure statistical reliability, each bolt category, placement configuration, and end-effector combination was tested independently for 50 repeated trials, resulting in a total of 1000 valid grasping experiments. Two evaluation metrics were defined:

Grasp success rate, which reflects system reliability. A grasp is considered successful if the bolt remains stable throughout the entire process, including grasping, lifting, transportation, and placement, without dropping, slipping, or exhibiting visible damage such as thread deformation.
Task execution time, defined as the total duration from the completion of visual pose estimation to the stable placement of the bolt at the designated location, which reflects system real-time performance.

Figure 6 shows the experimental schematic diagram of the robotic arm automatically grasping bolts in two different placement states, as described above.

As shown in Table 5, the proposed system achieved an overall grasping success rate of 97.0% under controlled conditions, with an average cycle time of 8.73 s. The parallel gripper achieved slightly higher efficiency than the five-finger dexterous hand because of its simpler closing motion and shorter mechanical response time. For both end-effectors, the success rate under horizontal placement was lower than that under upright placement, considering horizontally placed bolts provide a smaller effective contact region and are more likely to roll or slip during lifting.

In addition, the grasping performance was further analyzed across the five bolt categories. Each category included 200 trials, covering both placement configurations and both end-effectors. The success rates were 98.5% (197/200), 98.0% (196/200), 97.0% (194/200), 96.5% (193/200), and 95.0% (190/200), respectively. A chi-square test on the success/failure counts showed no significant difference among the five categories (p = 0.272), suggesting that the grasping performance was not strongly category-dependent under the tested controlled conditions. The slightly lower success rate for smaller or more slender bolts was mainly attributed to the reduced contact area and increased slipping risk during horizontal grasping.

To further validate the optimization-oriented target-selection method, an additional multi-target experiment was conducted using scenes containing three to six visible bolts. We conducted a total of 100 multi-bolt scenes and 300 grasping attempts. As shown in Table 6, the proposed strategy was compared with three simple target-selection baselines: random feasible selection, highest-confidence-first selection, and nearest-target-first selection.

Although the aggregate success rate of the proposed strategy in the multi-target experiment was also 97.0%, this result was obtained under a different protocol from the single-target controlled experiment. The multi-target experiment focused on target selection among multiple feasible candidates, while the single-target experiment evaluated the overall perception-to-grasping pipeline across bolt categories, placement configurations, and end-effectors. To further examine the effect of scene density, the multi-target results were analyzed according to the number of visible bolts. The 100 multi-bolt scenes were evenly divided into four groups, with 25 scenes each containing 3, 4, 5, and 6 visible bolts, respectively. Since each scene was tested with three grasping attempts, each group contained 75 attempts. The success rates for the 3-, 4-, 5-, and 6-bolt scenes were 98.7% (74/75), 97.3% (73/75), 96.0% (72/75), and 96.0% (72/75), respectively.

In addition, as shown in Table 6, the nearest-target-first strategy achieved the shortest average cycle time and travel distance because it prioritized spatial proximity. However, it tended to select bolts located in locally crowded regions, which increased the risk of interference during the vertical approach and lifting stages. The highest-confidence-first strategy improved perception reliability but did not consider motion efficiency or neighboring-object interference. In comparison, the proposed optimization-oriented selection strategy jointly considered detection confidence, localization uncertainty, motion distance, estimated execution time, and local interference risk. As a result, it achieved the highest grasping success rate while maintaining a competitive execution time. These results indicate that the proposed optimization-oriented formulation is practically useful for multi-target bolt selection.

The most common failure mode in 1000 controlled grip tests is grip slip or unstable clamping, mainly occurring on small bolts or horizontally placed bolts. This is because the effective contact area between the gripper and the bolt shank becomes limited when the bolt diameter is small or when the bolt lies flat on the workbench. Another important source of failure was depth error caused by metallic reflection. In several cases, the reflective surface of the bolt led to missing or unstable depth values, which affected the accuracy of 3D pose recovery. A smaller number of failures were caused by detection or localization errors, especially when the bolt edge was blurred or the visual feature response was weak. In multi-bolt scenes, local interference occasionally occurred when the selected target was too close to neighboring bolts, causing accidental contact during the approach or lifting stage. A few placement failures were also observed, mainly due to slight release-pose deviation or gripper rebound. These failure cases suggest that future improvements should focus on more robust depth filtering, force-feedback-based grasp verification, and more detailed collision checking in densely arranged bolt scenes.

Overall, the robotic experiments demonstrate that the proposed framework can effectively integrate visual detection, depth-based 3D pose recovery, mid-shank grasp-point generation, optimization-oriented target selection, and trajectory execution. The controlled experiments verify the feasibility and reliability of the perception-to-manipulation pipeline, while the multi-target experiments further show that the proposed selection formulation can reduce local interference risk compared with simpler target-selection strategies. However, the current experiments were mainly conducted in structured or semi-structured bolt-handling scenarios. Strong occlusion, severe clutter, highly reflective surfaces, and mixed-object manipulation remain challenging and require further investigation in future work.

5. Conclusions

This paper addresses the critical demand for high-precision recognition and automated grasping of bolts in industrial automation. Considering the challenges associated with bolts such as small object size, and random placement orientations—as well as the limitations of traditional vision-based methods, an integrated system is developed that combines high-accuracy visual detection, precise coordinate mapping, and adaptive robotic grasping.

In the perception stage, we propose a lightweight detection framework YOLOv8n-BoltLite and several improvements are introduced to enhance performance. First, the E-C2f module is designed, enabling efficient extraction of fine-grained features while reducing redundant computation. Second, a LCA mechanism is incorporated to enhance discriminative feature channels and suppress noise without introducing additional parameters. Third, a SA-PAN is developed to improve multi-scale feature fusion by emphasizing informative feature scales and reducing redundancy. Finally, the WD-IoU loss function is employed to strengthen center-distance constraints during bounding box regression, thereby improving localization accuracy for small and slender bolt targets. Experimental results show that YOLOv8n-BoltLite achieves 99.64 ± 0.05% mAP on the controlled test set and 94.6 ± 0.18% mAP on the challenging external test set. Robotic experiments show a 97.0% success rate under controlled conditions. These results indicate that the proposed framework is effective for structured and semi-structured bolt-handling scenarios.

In the robotic grasping stage, high-precision hand–eye calibration and coordinate transformation are performed, with the overall calibration error controlled within 0.217 mm. A vision-guided grasping framework is established to handle two representative placement configurations and multiple bolt types. Two types of end-effectors, namely a parallel gripper and a five-finger dexterous hand, are employed to validate system adaptability. Combined with adaptive grasp strategies and collision-free trajectory planning, a complete closed-loop pipeline from perception to manipulation is realized. Extensive experiments show that the proposed system achieves a grasp success rate consistently above 95%, with an average pick-and-place cycle time of less than 10 s and a positioning accuracy better than 0.3 mm. These results demonstrate that the system satisfies the practical requirements of precision assembly tasks in industrial scenarios.

Overall, this work achieves a tightly integrated optimization of visual perception and robotic manipulation for small fasteners. The proposed system effectively addresses key challenges in industrial applications, including low detection accuracy for small objects, large localization errors, limited pose adaptability, and insufficient grasp reliability. The experimental results verify that the system meets the performance requirements of industrial automated assembly. Future work will focus on extending the system to more complex scenarios involving cluttered environments, occlusions, and mixed object categories, as well as integrating multi-view perception and force feedback to further improve robustness.

Author Contributions

Methodology, P.F.; Validation, Y.X.; Investigation, L.L.; Resources, Z.Z.; Data curation, X.W.; Writing—original draft, P.F.; Writing—review and editing, X.Z.; Visualization, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China under Grant 2023YFB3408603, Huazhong University of Science and Technology-Wuxi Research Institute, and the National Natural Science Foundation of China, grant numbers 52422501 and 52188102; and the Fundamental and Interdisciplinary Disciplines Breakthrough Plan in Humanoid Robotics of the Ministry of Education of China.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to the dataset being a self-built industrial bolt image dataset with customized annotation information for research use.

Conflicts of Interest

Authors Zhenlin Zhang, Long Liu and Yingze Xi were employed by the company Dongfeng Motor Corporation Research and Development Institute, Dongfeng Motor Group (China). The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be constructed as a potential conflict of interest.

References

Wang, L.; Fan, F.; Gu, Y. Using GHT-ICP2D machine vision to improve automated robotic sorting in industrial applications. Int. J. Adv. Manuf. Technol. 2025, 141, 3869–3884. [Google Scholar] [CrossRef]
Zheng, X.; Wang, H.; Shang, Y.; Chen, G.; Zou, S.; Yuan, Q. Starting from the structure: A review of small object detection based on deep learning. Image Vis. Comput. 2024, 146, 105054. [Google Scholar] [CrossRef]
Wei, L.; Dragomir, A.; Dumitru, E.; Christian, S.; Scott, R.; Cheng-Yang, F.; Berg, A.C. SSD: Single Shot MultiBox Detector; Springer: Cham, Switzerland, 2016. [Google Scholar]
Dalal, N. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: New York, NY, USA, 2025. [Google Scholar]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Gool, L.V. SURF: Speeded up robust features. In Proceedings of the 9th European conference on Computer Vision-Volume Part I, Graz, Austria, 7–13 May 2006. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; IEEE: New York, NY, USA, 2014. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the NIPS, Lake Tahoe, NV, USA, 3–8 December 2012. [Google Scholar]
Girshick, R. Fast R-CNN. arXiv 2015, arXiv:1504.08083. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2016; IEEE: New York, NY, USA, 2017. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2980–2988. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 7263–7271. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. Scaled-yolov4: Scaling cross stage partial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 13029–13038. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; IEEE: New York, NY, USA, 2023; pp. 7464–7475. [Google Scholar]
Du, G.; Wang, K.; Lian, S.; Zhao, K. Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: A review. Artif. Intell. Rev. Int. Sci. Eng. J. 2021, 54, 1677–1734. [Google Scholar] [CrossRef]
Tian, X.; Ke, J.; Wu, W.; Teng, J. Design of a Pill-Sorting and Pill-Grasping Robot System Based on Machine Vision. Future Internet 2025, 17, 501. [Google Scholar] [CrossRef]
Shukla, P.; Kushwaha, V.; Nandi, G.C. Vision-Based Intelligent Robot Grasping Using Sparse Neural Network. Int. J. Intell. Robot. Appl. 2023, 9, 1214–1227. [Google Scholar]
Arkhipov, M.V.; Malkarov, A.Y.; Matrosova, V.V. Sorting System Based on Machine Vision and Manipulation Robot. In Advances in Automation V (RusAutoCon 2023); Lecture Notes in Electrical Engineering; Springer: Cham, Switzerland, 2024; pp. 409–420. [Google Scholar]
Luca, B.; Mario, S.; Margherita, P.; Roberto, V.; Marcello, P. Deep Learning-Based Method for Vision-Guided Robotic Grasping of Unknown Objects. Adv. Transdiscipl. Eng. 2020, 44, 101052. [Google Scholar]
Morrison, D.; Corke, P.; Leitner, J. Closing the Loop for Robotic Grasping: A Real-time, Generative Grasp Synthesis Approach. arXiv 2018, arXiv:1804.05172. [Google Scholar] [CrossRef]
Lyu, X.; Hajjaj, S.S.H. Vision-Based Robotic Grasping: Integrating YOLOv8 and GRCNN for Dynamic Object Handling. In Proceedings of the International Conference on Intelligent Manufacturing and Robotics, Selangor, Malaysia, 11–12 November 2025. [Google Scholar]
Sundermeyer, M.; Mousavian, A.; Triebel, R.; Fox, D. Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: New York, NY, USA, 2021. [Google Scholar]
Seyedi, I.; Candelieri, A.; Messina, E.; Archetti, F. Wasserstein Distributionally Robust Optimization for Chance Constrained Facility Location Under Uncertain Demand. Mathematics 2025, 13, 2144. [Google Scholar] [CrossRef]

Figure 1. Example samples of bolts used in five types of experiments.

Figure 2. Overall architecture of the proposed bolt detection and grasping-oriented perception network.

Figure 3. Sample images from the constructed bolt dataset. The dataset includes five categories under different orientations and controlled conditions without clutter or background interference.

Figure 4. Schematic diagram of YOLOv8n-BoltLite detection results.

Figure 5. Illustration of grasp points for two representative bolt placement configurations: (a) upright placement, in which the bolt is placed vertically with the head in contact with the workbench surface and the shank pointing upward; and (b) horizontal placement, in which the bolt is placed flat on the workbench with the shank lying on the surface. The red rectangular boxes indicate the detected bolt regions. The red circles indicate the calculated target centroids, which are used as the grasp points for the parallel gripper. For the five-finger dexterous hand, grasping a horizontally placed bolt using only the centroid may not ensure sufficient stability; therefore, a second grasp point is generated according to the centroid position and the bolt axial direction, as indicated by the yellow circle in (b).

Figure 6. Examples of robotic bolt grasping using a parallel gripper: (a) grasping an upright bolt on the workbench; (b) grasping a horizontally placed bolt on the workbench.

Table 1. Comparison of detection performance among different models on controlled tests.

Model	mAP	Precision	Recall	Parameters	FPS
YOLOv8n	98.9 ± 0.08	98.3	98.5	3.20	186
YOLOv9-t	99.3 ± 0.06	98.9	99.0	2.05	173
YOLOv10n	99.2 ± 0.07	98.8	99.0	2.30	212
RT-DETR-R18	99.0 ± 0.09	98.7	98.9	20.0	82
Gold-YOLO-N	99.1 ± 0.07	98.8	98.8	5.60	161
YOLOv8n-BoltLite	99.64 ± 0.05	99.3	99.5	3.02	198

Table 2. Comparison of detection performance among different models on challenging tests.

Model	mAP	Precision	Recall
YOLOv8n	91.8 ± 0.23	91.2	90.9
YOLOv9-t	93.9 ± 0.20	93.4	92.8
YOLOv10n	93.5 ± 0.22	93.1	92.6
RT-DETR-R18	94.0 ± 0.18	93.6	93.2
Gold-YOLO-N	93.7 ± 0.21	93.2	92.7
YOLOv8n-BoltLite	94.6 ± 0.18	94.1	93.9

Table 3. Ablation study of the proposed modules.

Model	Controlled mAP	Challenging mAP	Parameters	FPS
YOLOv8n	98.92 ± 0.08	91.82 ± 0.24	3.20	186
Full model w/o E-C2f	99.44 ± 0.06	93.90 ± 0.21	3.08	191
Full model w/o LCA	99.31 ± 0.07	93.58 ± 0.25	3.08	191
Full model w/o SA-PAN	99.48 ± 0.05	94.01 ± 0.20	3.04	195
Full model w/o WD-IoU	99.37 ± 0.06	93.86 ± 0.23	3.02	198
Full YOLOv8n-BoltLite	99.64 ± 0.05	94.62 ± 0.18	3.02	198

Table 4. Comparison of different bounding-box regression losses.

Loss Function	Controlled mAP	Challenging mAP
CIoU	99.39 ± 0.07	93.92 ± 0.22
EIoU	99.43 ± 0.06	94.04 ± 0.21
Wise-IoU	99.50 ± 0.05	94.18 ± 0.20
WD-IoU	99.64 ± 0.05	94.62 ± 0.18

Table 5. Grasping performance under controlled conditions.

End-Effector	Placement	Success Rate (%)	Cycle Time (s)
Parallel gripper	Upright	98.4	8.23 ± 0.51
Parallel gripper	Horizontal	96.8	8.68 ± 0.64
Dexterous hand	Upright	97.2	8.76 ± 0.58
Dexterous hand	Horizontal	95.6	9.23 ± 0.72
Overall	-	97.0	8.73 ± 0.70

Table 6. Comparison of target-selection strategies in multi-bolt scenes.

Target-Selection Strategy	Success Rate (%)	Mean Cycle Time (s)
Random feasible selection	89.7	8.92 ± 1.21
Highest-confidence-first	92.7	8.77 ± 1.08
Nearest-target-first	93.7	8.14 ± 0.91
Proposed optimization-oriented selection	97.0	8.32 ± 0.84

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fu, P.; Zhang, Z.; Liu, L.; Xi, Y.; Zhao, X.; Wang, X. Optimization-Oriented Vision-Guided Robotic Grasping for Bolt Handling in Intelligent Manufacturing. Mathematics 2026, 14, 2133. https://doi.org/10.3390/math14122133

AMA Style

Fu P, Zhang Z, Liu L, Xi Y, Zhao X, Wang X. Optimization-Oriented Vision-Guided Robotic Grasping for Bolt Handling in Intelligent Manufacturing. Mathematics. 2026; 14(12):2133. https://doi.org/10.3390/math14122133

Chicago/Turabian Style

Fu, Pengzhan, Zhenlin Zhang, Long Liu, Yingze Xi, Xingwei Zhao, and Xuan Wang. 2026. "Optimization-Oriented Vision-Guided Robotic Grasping for Bolt Handling in Intelligent Manufacturing" Mathematics 14, no. 12: 2133. https://doi.org/10.3390/math14122133

APA Style

Fu, P., Zhang, Z., Liu, L., Xi, Y., Zhao, X., & Wang, X. (2026). Optimization-Oriented Vision-Guided Robotic Grasping for Bolt Handling in Intelligent Manufacturing. Mathematics, 14(12), 2133. https://doi.org/10.3390/math14122133

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Optimization-Oriented Vision-Guided Robotic Grasping for Bolt Handling in Intelligent Manufacturing

Abstract

1. Introduction

2. Related Works

2.1. Object Recognition

2.2. YOLOs

2.3. Vision-Based Robotic Grasping Strategies

3. Methods

3.1. Architecture of Improved YOLO Algorithm for Bolt Recognition

3.1.1. Overall Architecture

3.1.2. Efficient C2f for Feature Extraction

3.1.3. Lightweight Channel Attention and SA-PAN Scale Adaptation

3.1.4. Anchor-Free Detection Head and WD-IoU Loss

3.2. Robotic Arm Grasping

3.2.1. 3D Pose Recovery and Hand–Eye Transformation

3.2.2. Capture Point Generation

3.2.3. Robot Arm Trajectory Planning

3.3. Optimization-Oriented Grasp Execution Formulation

4. Experiment

4.1. Experimental Setup and Result Analysis

4.2. Ablation Experiment

4.3. Robot Arm Grasping Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI