3.1. Architecture of Improved YOLO Algorithm for Bolt Recognition
3.1.1. Overall Architecture
The overall architecture of the proposed system is designed to achieve accurate bolt recognition and provide reliable positional information for subsequent robotic grasping. As illustrated in
Figure 2, the framework follows a standard detection pipeline consisting of an input layer, a backbone for hierarchical feature extraction, a neck for multi-scale feature aggregation, and a detection head for category prediction and bounding box regression. Given an input image, the backbone first extracts multi-level visual features. A Lightweight Channel Attention (LCA) module is embedded into the feature extraction stage to recalibrate channel-wise responses and enhance structurally informative features while suppressing redundant background and noise-related activations. This design improves feature representation quality without introducing significant computational complexity.
The enhanced features are then delivered to the neck network for cross-scale fusion. In conventional feature pyramid aggregation, direct bidirectional fusion may introduce redundant information exchange. Therefore, a Scale-Adaptive Path Aggregation Network (SA-PAN) is adopted to perform more efficient multi-scale fusion. By assigning learnable importance to feature maps of different resolutions, SA-PAN enables the network to selectively emphasize scale levels that are more relevant to bolt instances.
Based on the fused feature maps, the detection head performs bolt category classification and bounding box regression in an anchor-free manner. Since the downstream grasping task requires highly accurate center localization, the detection branch is further optimized through scale-aware dynamic label assignment and the proposed weighted distance IoU loss (WD-IoU Loss). Instead of introducing predefined anchor boxes, the scale-aware assignment strategy selects positive samples according to classification confidence, spatial consistency, and IoU-related localization quality, which is consistent with the anchor-free detection paradigm of YOLOv8. Meanwhile, WD-IoU strengthens the center-distance constraint during bounding box regression, especially for small and slender bolt objects. As a result, the proposed detector can provide more accurate pixel-level localization, which forms the basis for precise robotic grasping in subsequent manipulation.
In summary, the proposed overall architecture retains the efficient one-stage detection paradigm while introducing task-specific enhancements at the feature attention, neck fusion, and localization loss levels. These improvements collectively strengthen the network’s ability to detect bolts with subtle texture variations and strict localization requirements, thus providing stable and accurate visual input for subsequent robotic grasping.
3.1.2. Efficient C2f for Feature Extraction
In this study, the detector is required to recognize five different specifications of bolts. The YOLOv8n backbone adopts the C2f module. Benefiting from the efficient layer aggregation of ELAN, C2f enhances feature reuse and improves gradient flow. Nevertheless, when dealing with bolts, which are typically characterized by small object scale, subtle texture cues, and elongated shapes with high aspect ratios, the standard convolution may suffer from a mismatch between its receptive field and the scale of the target features. Consequently, redundant background information can be incorporated so the task-relevant features, such as local contours, may not be sufficiently emphasized.
Therefore, an Efficient C2f (E-C2f) module is proposed in this work. On the basis of retaining the residual structure and efficient gradient propagation of the original C2f module, standard convolution is replaced by grouped depthwise separable convolution to achieve lightweight feature extraction, enhance fine-texture representation, and suppress redundant background information. Structurally, the E-C2f module comprises four components: a 1 × 1 convolution for channel reduction, grouped depthwise separable convolution, residual branch concatenation, and a final 1 × 1 convolution for channel expansion. In this process, the conventional convolution operation is factorized into depthwise convolution and pointwise convolution. Owing to this decomposition, E-C2f significantly reduces model complexity, The depthwise convolution helps preserve fine-grained bolt features, including thread details and slender boundary structures. The retained shortcut connection further ensures stable gradient backpropagation and mitigates the optimization difficulties that may be introduced by lightweight redesign. Therefore, E-C2f achieves a better balance between efficiency and representational power, which is especially beneficial for accurate detection of small bolt targets in robotic grasping scenarios.
3.1.3. Lightweight Channel Attention and SA-PAN Scale Adaptation
Although the background in this study is relatively simple, weak metallic reflections and subtle thread textures may still result in weak responses in some channels. Conventional attention mechanisms usually introduce additional learnable parameters and computational overhead. To overcome this issue, a Lightweight Channel Attention (LCA) module is further introduced.
The proposed LCA learns channel-wise importance through global average pooling followed by sigmoid activation, without using fully connected layers or other heavy operations. For an intermediate feature map
, the channel attention weights are computed as
and the reweighted feature map is obtained by
where
denotes the global descriptor of the c-th channel,
is the sigmoid function,
is the corresponding channel weight.
Through this lightweight reweighting mechanism, informative channels related to threads, contours, and structural boundaries are strengthened, while low-response or noisy channels are suppressed. Consequently, the network can generate more discriminative feature representations for bolt recognition with negligible additional computational cost.
Next, this article proposes the SA-PAN (Scale Adaptive PAN) feature pyramid as the neck network. The backbone outputs three feature levels
,
, corresponding to different spatial resolutions. Instead of treating all fusion inputs equally, SA-PAN introduces learnable scale-adaptive coefficients to modulate the contribution of each branch. By performing weighted fusion based on adaptive weights, the final fused features are obtained
To ensure stable optimization, the coefficients can be normalized as
In this way, SA-PAN preserves the bidirectional aggregation advantage of PANet while enabling the network to adaptively emphasize the most relevant scale information and suppress redundant cross-scale transmission. This design improves feature utilization efficiency and enhances the detection of small fasteners with fine structural details.
3.1.4. Anchor-Free Detection Head and WD-IoU Loss
The detection head performs object classification and bounding box regression on the fused multi-scale features, and its optimization objective directly affects localization accuracy and convergence stability. In this work, the original anchor-free detection paradigm of YOLOv8 is retained. This design avoids the dependence on manually predefined anchor boxes and is more suitable for deployment scenarios involving bolts with different scales and aspect ratios.
Since the five categories of bolts considered in this study are mostly small, slender, and visually similar, accurate center localization is particularly important for the subsequent robotic grasping task. Therefore, the detection head is optimized from two aspects: scale-aware dynamic sample assignment and regression loss design. During training, positive samples are assigned according to the joint quality of classification confidence and localization consistency, rather than by matching predefined anchor boxes. In this way, the model can adaptively select more reliable training samples for bolt instances of different sizes while maintaining the anchor-free structure.
To further reduce localization deviation, especially for small-scale and elongated bolt objects, a weighted distance IoU loss (WD-IoU Loss) is introduced. WD-IoU extends DIoU loss by incorporating a scale-aware weighting coefficient into the center-distance penalty. The loss function is defined as
where
b and
denote the predicted bounding box and the ground-truth bounding box, respectively;
is the Euclidean distance between their center points;
c is the diagonal length of the smallest enclosing box; and
is a scale-aware weighting coefficient. For smaller bolt targets, a larger value of
is assigned to impose a stronger penalty on center deviation. For larger bolt targets, a moderate value of
is used to avoid over-penalizing small localization fluctuations.
is defined as
where
denotes the area of the ground-truth bounding box,
is the average bounding-box area in the training set,
controls the strength of scale-aware weighting, and
is a small positive constant. In this design, smaller targets receive larger center-distance penalties, which is consistent with the high localization accuracy required by robotic grasping.
The total training loss is formulated as
where
denotes the classification loss and
is the weighting coefficient used to balance classification and localization objectives.
In this study, the localization term is emphasized because the center position of the detected bolt directly affects the subsequent 3D pose recovery and grasp point generation. Through the combination of anchor-free detection, scale-aware sample assignment, and WD-IoU regression optimization, the proposed detection head improves localization accuracy while preserving the lightweight and real-time characteristics of YOLOv8n.
3.2. Robotic Arm Grasping
3.2.1. 3D Pose Recovery and Hand–Eye Transformation
In the previous subsection, the neural network outputs the two-dimensional detection results of the target bolt in the image plane, including the center pixel coordinates
the bounding-box width and height, and the category label. For robotic grasping, the detector output must be further converted into the three-dimensional grasp pose of the target, including its spatial position and the rotation angle around the vertical axis, so that the manipulator can execute a grasp in physical space. In the present work, an Intel RealSense D435i depth camera (Intel Corporation, Santa Clara, CA, USA) is employed to obtain per-pixel depth measurements. This configuration is more suitable for practical manipulation scenarios involving variable plane height, local unevenness, or partial stacking of bolts. After depth acquisition, the depth map is preprocessed by denoising and invalid-value suppression. The depth corresponding to the detected center pixel
is denoted as
and taken as the depth of the target. Once
is available, the pixel coordinates can be back-projected into the camera coordinate system by means of the intrinsic calibration matrix:
After the 3D position has been determined, the bolt region inside the detection box is segmented from the foreground. The contour of the segmented region is then extracted, and a minimum-area rotated bounding rectangle is fitted to the contour. The rotation angle of this rectangle is taken as the refined orientation of the bolt major axis and is denoted by . In this way, the 2D detection result is upgraded to a grasp-oriented pose description composed of the 3D position and the in-plane orientation .
The proposed system adopts an Eye-in-Hand configuration, in which the camera is rigidly mounted on the robot end-effector. This arrangement allows the vision sensor to move together with the manipulator, enabling both coarse workspace scanning and close-range observation of the target. To establish the transformation relationship between the camera coordinate system {
C} to the end-effector coordinate system {
E}, we use the constant rigid transformation
In this work, a checkerboard-based calibration method is used. The robot is moved to multiple poses, and for each pose, the end-effector pose relative to the robot base,
, and the checkerboard pose relative to the camera,
are recorded. For any two poses
i and
j, the relative motions are defined as
The hand–eye relationship is then given by the classical equation
By solving this equation over multiple pose pairs, the hand–eye transformation
is obtained. Accordingly, the target point in the robot base frame can be computed as
where
is the current end-effector pose from robot kinematics. This transformation establishes the geometric basis for subsequent grasp pose generation and robotic execution.
3.2.2. Capture Point Generation
One intuitive strategy is to grasp the bolt laterally from both sides of the head.
A bolt is a typical elongated workpiece mainly composed of a head and a shank. One intuitive strategy is to grasp the bolt laterally from both sides of the head. Although this strategy generally provides relatively high tolerance to visual localization errors, it is not ideal for robust manipulation. The bolts used in this study vary considerably across different specifications, which may lead to uneven force distribution at the gripper fingertips. When the bolt is clamped at the head, the shank remains suspended below the contact region. For bolts with a large length-to-diameter ratio, this cantilever-like configuration is prone to oscillation during high-speed robot motion, thereby degrading grasp stability.
Considering the above factors, the proposed system adopts a mid-shank grasping strategy. First, the grasp point is closer to the physical center of mass of the bolt, which reduces the gravitational moment acting on the contact interface after lifting. Second, after being grasped near the middle of the shank, the bolt naturally tends to hang vertically under gravity, thereby forming a self-stabilizing posture during transportation.
For grasp-point generation, the geometric center of the detected bolt region is first calculated and used as the primary grasp point. For the parallel gripper, this point is directly used as the mid-shank grasp point. For the five-finger dexterous hand, however, grasping a horizontally placed bolt using only the centroid may not provide sufficient grasp stability. Therefore, an additional grasp point is generated along the estimated bolt axial direction based on the centroid position and the fitted bolt orientation. This two-point grasping configuration provides a more stable contact arrangement, particularly for horizontally placed bolts.
3.2.3. Robot Arm Trajectory Planning
After the optimal joint configurations corresponding to the start pose and target pose are obtained through the above geometric derivations and inverse kinematics, trajectory planning is further performed to generate a sequence of discrete position, velocity, and acceleration commands for each joint, thereby enabling smooth motor motion and stable task execution. To improve grasping efficiency, a complete pick-and-place cycle is divided into five logical motion segments:
Camera pose to pre-grasp pose.
The pre-grasp pose is defined at a safe height directly above the target bolt. The manipulator moves from the image acquisition pose to this pre-grasp pose using quintic polynomial interpolation in joint space, allowing all joints to start and stop smoothly while reaching the target synchronously. Quintic planning ensures continuity of position, velocity, and acceleration, which is beneficial for reducing vibration and mechanical impact.
For a joint variable
q(t), the quintic trajectory is expressed as
where the coefficients
are determined by the boundary conditions of position, velocity, and acceleration at the start and end instants.
Pre-grasp pose to grasp pose.
From the safe height, the end-effector descends vertically to the grasp pose. Since lateral deviation in this phase may cause interference with neighboring bolts, this segment is planned by straight-line interpolation in Cartesian space, ensuring that the end-effector moves strictly along the vertical direction. Let the Cartesian position of the end-effector be
. The linear interpolation between two poses
and
can be written as
For this segment, and differ only in the vertical coordinate, thereby guaranteeing a pure downward motion.
Grasp pose to lifting pose.
After the gripper closes and the bolt is grasped, the end-effector moves vertically upward to a safe lifting height. This segment also adopts Cartesian straight-line interpolation, so as to quickly separate the bolt from surrounding objects while maintaining a predictable and collision-safe motion.
Lifting pose to bin-above pose.
Once the bolt has been lifted to a safe height, it is transported to the region above the corresponding sorting bin. Since this segment mainly involves long-distance transfer in free space, joint-space quintic polynomial planning is used again to improve motion smoothness and execution efficiency.
Bin-above pose to release pose.
Finally, the end-effector descends vertically from the safe transfer height to the release height inside the bin. This segment is planned by Cartesian straight-line interpolation. After reaching the release pose, the gripper opens to complete the placement operation.
The above trajectory planning strategy provides an executable motion for a given target bolt. However, in practical visual grasping scenarios, the robot execution process should also consider whether the detected target is reliable, reachable, and safe to grasp. Therefore, an optimization-oriented grasp execution formulation is introduced to describe the relationship among perception confidence, grasp feasibility, motion efficiency, and collision safety.
3.3. Optimization-Oriented Grasp Execution Formulation
The previous subsections describe how the visual detection results are converted into grasp-oriented robot motion commands. Specifically, the detector provides the category label, bounding box, and center pixel coordinates of a target bolt. The depth camera is then used to recover the corresponding 3D position, and the hand–eye transformation maps the target position from the camera coordinate system to the robot base coordinate system. Based on the recovered position and in-plane orientation, a mid-shank grasp point is generated, followed by hybrid trajectory planning for pick-and-place execution. In this subsection, the above process is further formulated from an optimization-oriented perspective to explicitly describe the feasibility, reliability, and motion-efficiency considerations involved in grasp execution.
Let the set of valid bolt detections in the current image be denoted as
where
N is the number of detected bolt candidates that satisfy the detection confidence threshold. For each candidate
, the visual detector outputs the category label
, the confidence score
, the bounding box
, and the center pixel coordinate
. With the depth value
obtained, the center point of the bolt can be back-projected into the camera coordinate system. The corresponding position in the robot base coordinate system is obtained by
where
denotes the homogeneous coordinate of the target point in the camera coordinate system,
is the hand–eye transformation matrix, and
is the current end-effector pose obtained from robot kinematics.
For each feasible candidate, the grasp pose is generated according to the mid-shank grasping strategy. The grasp pose is expressed as
where
is the generated grasp point near the middle part of the bolt shank, and
is the corresponding end-effector orientation determined by the bolt axis direction and the gripper approach configuration. Compared with grasping at the bolt head, the mid-shank strategy reduces the gravitational moment after lifting and improves transportation stability, which is consistent with the geometric characteristics of the target bolts.
To ensure safe and stable execution, the pick-and-place motion is represented by a sequence of key poses:
where
is the image acquisition pose,
is the pre-grasp pose above the target bolt,
is the grasp pose,
is the lifting pose,
is the pose above the corresponding sorting bin, and
is the release pose. The vertical approach and lifting segments are planned in Cartesian space to reduce lateral deviation near the workbench, while long-distance free-space transfer is planned in joint space using quintic polynomial interpolation.
For candidate
i, the execution cost is defined as
where
and
denote the normalized motion distance, estimated execution time, localization uncertainty, local interference risk, and detection confidence, respectively. The coefficients
and
are non-negative weighting factors used to reflect different engineering priorities. While calculating
, all terms are normalized to [0, 1] using the observed minimum and maximum values in the current candidate set. The weighting coefficients are selected on a validation set containing 60 multi-bolt scenes. A grid search with a step size of 0.05 is performed under the constraint
. The selected weights are
and
. This setting gives higher priority to localization reliability and local interference avoidance, while still considering motion efficiency and detection confidence.
The motion distance term is estimated from the planned key poses:
where
and
are the Cartesian positions of the corresponding key poses. This term describes the spatial length of the pick-and-place motion and is related to task efficiency.
The execution time term is estimated as the sum of the durations of the five motion segments:
For joint-space transfer segments, the time is constrained by the velocity and acceleration limits of the robot joints. For Cartesian vertical approach, lifting, and release segments, the time is constrained by the allowable end-effector velocity and the required vertical travel distance.
The localization uncertainty term is introduced to reflect the reliability of the visually estimated grasp pose:
where
is the detection confidence score,
is the image-plane area of the detected bounding box,
is a scaling coefficient, and
is a small positive constant used to avoid division by zero. This term assigns a larger uncertainty penalty to low-confidence detections and small image regions. Such a design is reasonable because small and slender bolts are more sensitive to center localization errors during the subsequent grasping process. In this study,
was set to 1000
and
was set to
, so that the bounding-box-area-related uncertainty term remained comparable to the confidence-related term before normalization.
The local interference risk term is used to describe the spatial proximity between the target candidate and other detected bolts:
where
and
are the positions of candidates
i and
j in the robot base coordinate system, and
is a parameter controlling the influence range of neighboring objects. A larger value of
indicates that the target is closer to other detected objects, which may increase the risk of accidental contact during the approach or lifting phase. In scenes with only one valid target, this term naturally becomes zero. In this study,
was set to 25 mm to describe the local neighborhood range in which adjacent bolts may interfere with the vertical grasping motion.
The feasible grasp candidate should satisfy the following constraints:
where
is the minimum confidence threshold,
denotes the reachable workspace of the robot,
is the inverse-kinematic solution of each key pose,
and
are the joint limits,
denotes the planned motion path,
represents obstacles or non-target objects in the workspace, and
is the minimum allowable safety distance. These constraints respectively ensure perception reliability, reachability, kinematic feasibility, and collision safety.
When multiple feasible candidates are present in the camera view, the next grasp target can be selected by
where
is the set of candidates satisfying the feasibility constraints. When only one valid bolt is present, the formulation reduces to a feasibility and trajectory-cost evaluation for that target.
Overall, this optimization-oriented formulation connects the perception output, grasp pose generation, and trajectory execution in a unified framework. It does not replace the detection network or the trajectory planner. Instead, it provides an explicit decision-level representation for selecting feasible grasp targets and evaluating execution cost. This is beneficial for extending the current system to more complex industrial scenarios involving multiple visible bolts, cluttered placement, or mixed-object sorting. From an optimization perspective, the proposed formulation can be regarded as a constrained discrete decision problem over feasible grasp candidates. The uncertainty term and local interference term play roles similar to risk penalties in uncertainty-aware optimization. Although the present formulation does not solve a full distributionally robust optimization problem, it follows the same principle of incorporating uncertain perception quality into decision making. Recent studies on Wasserstein [
26] distributionally robust chance-constrained optimization also demonstrate the importance of modeling uncertainty through ambiguity sets and feasibility constraints when exact distributions are unavailable.