The recognition and localization of rebar binding points are performed in two main stages. First, based on machine vision and deep learning techniques, the rebar binding points and their pose-defining keypoints are detected, and their three-dimensional spatial coordinates are obtained in the camera coordinate frame. Subsequently, using the hand–eye calibration results, the spatial coordinates of these keypoints are transformed from the camera coordinate frame to the manipulator base coordinate frame. Second, after acquiring the three-dimensional coordinates of all keypoints in the manipulator base frame, vector and matrix operations are employed to further compute the pose of the target rebar intersection with respect to the manipulator base coordinate frame. This provides accurate and reliable target pose information for subsequent manipulator path planning and rebar binding execution.
3.1.1. Rebar Binding Point Recognition
To obtain the three-dimensional pose information of rebar binding points, the recognition results are defined as follows. As illustrated in
Figure 2, the output includes the binding state of each rebar binding point, as well as the positions of five keypoints denoted as c, u, l, d, and r. These keypoints characterize the geometric structure of the rebar intersection region and provide essential spatial constraints for subsequent pose estimation of the target binding point.
Deep learning-based object detection methods exhibit strong robustness and good adaptability in complex environments. The YOLO family of algorithms [
39], as representative single-stage object detectors, achieves favorable real-time performance while maintaining competitive detection accuracy. Among them, YOLOv8 is a mature version that integrates pose and keypoint detection capabilities. Its pose model can directly output target keypoints, which effectively meets the requirements of binding state determination and spatial pose recognition for rebar binding points. Therefore, the YOLOv8n-pose model is selected as the baseline model for rebar binding point recognition in this study. As the smallest-scale network in the YOLOv8 family, YOLOv8n-pose offers relatively low computational complexity and fast inference speed while preserving high detection accuracy. However, in the rebar binding application scenario—characterized by limited onboard computing resources and complex working environments—there remains room for further improvement in real-time performance. To address this issue, this paper introduces targeted improvements to the network structure and loss function of the YOLOv8n-pose model, as detailed below.
The Ghost module decomposes a standard convolution into two components: intrinsic feature generation via primary convolution and redundant feature expansion through linear transformations. By employing computationally efficient channel-wise linear operations to generate redundant feature maps, the Ghost module significantly reduces model parameters and computational complexity while preserving the output feature dimensions. Inspired by the design principle of the Ghost module, this paper constructs a multi-scale extended Ghost convolution module, termed Multi-Scale Ghost Convolution (MSGConv), to enhance multi-scale feature representation capability. Furthermore, C2fMSGR and MSGELAN structures based on the proposed MSGConv module are designed and employed to replace selected standard convolution layers and C2f modules in the YOLOv8n-pose network. This strategy enables effective model lightweighting while maintaining strong feature representation capability and improving overall computational efficiency.
The MSGConv module extends the design philosophy of GhostConv—namely, “primary feature generation plus lightweight convolutional expansion”—by introducing a parallel multi-scale branch structure to further enhance feature representation capability. Specifically, a standard convolution layer
is first employed to project the input feature map
from
channels to an intermediate channel dimension
/2, yielding the main-branch feature map
. Subsequently,
is partitioned along the channel dimension into
sub-feature maps, denoted as
, where each sub-feature map contains
channels. For different sub-feature maps, depthwise convolutions or group convolutions with kernel size
are applied, thereby constructing multi-scale feature representations with diverse receptive fields. Finally, the main-branch feature
and the outputs of all scale-specific branches are concatenated along the channel dimension and fused through a 1 × 1 convolution layer to restore the target output channel dimension
. By leveraging group and depthwise convolutions within a parallel multi-scale architecture, the proposed MSGConv module enhances feature representation capability while significantly reducing the number of parameters and computational complexity, thus preserving the lightweight nature of the module. The forward computation of MSGConv can be formally expressed as:
Based on the above design, a C2fMSGR module is proposed, which integrates the information splitting and efficient feature aggregation mechanism of the C2f structure with multi-scale Ghost units equipped with residual connections (Multi-Scale Ghost with Residual, MSGA). Specifically, the input feature map is first projected to channels via the convolution layer and then split along the channel dimension into two parts, denoted as and , each containing channels. Subsequently, is sequentially fed into MSGA Bottleneck units to extract multi-scale features and generate multiple intermediate feature maps. Finally, and all intermediate features are concatenated along the channel dimension and fused through a 1 × 1 convolution layer , producing an output feature map with channels. By inheriting the advantages of efficient information flow and residual feature fusion from the C2f structure while introducing multi-scale Ghost-based feature extraction, the proposed C2fMSGR module enhances feature representation capability under relatively low computational complexity.
In addition, an MSGELAN structure is designed for feature fusion. This structure first employs two parallel 1 × 1 convolution branches to project the input feature into a low-rank channel space, where branch includes an activation function and branch does not. Both branches output /4 channels. Based on the output of branch , an MSGBottleneck unit (denoted as ) is introduced to expand the channel dimension to . Subsequently, the feature maps from , , and are concatenated along the channel dimension and fused via a 1 × 1 convolution layer to generate the final output feature. By preserving multi-scale semantic information while effectively reducing computational complexity, the MSGELAN structure is particularly well suited for feature fusion in the Feature Pyramid Network (FPN) or Neck stage.
The overall network architecture of the lightweight YOLOv8n-pose model after the proposed modifications is illustrated in
Figure 3.
After applying the lightweight structural optimization, the number of model parameters is reduced from 3.3 M to 1.97 M, and the computational complexity decreases from 8.3 GFLOPs to 5.7 GFLOPs. Moreover, real-time inference performance exceeding 48 FPS is achieved on the onboard Jetson AGX Orin platform (Nvidia Corporation, Santa Clara, CA, USA). Although the lightweight design significantly reduces computational overhead, it may also weaken the feature representation capability for keypoints to some extent, thereby imposing higher requirements on convergence stability and hard-sample learning. Consequently, it is necessary to introduce a more stable loss function that is more sensitive to hard samples to improve the overall convergence behavior of keypoint prediction.
The original YOLOv8 model employs the BCEWithLogitsLoss, which exhibits limited adaptability under complex background interference and dynamically changing sample distributions, making it difficult to effectively balance the learning of high-confidence samples and low-confidence hard samples during training. To address this issue, an adaptive threshold focal loss function is introduced in this paper. By dynamically adjusting the modulation factor, the proposed loss function enhances the model’s focus on hard samples and improves the convergence stability and robustness of keypoint prediction.
In the standard binary cross-entropy (BCE) loss, let the network output be
, the predicted probability after the Sigmoid function be
, and the ground-truth label be
. The BCE loss is defined as:
Based on this formulation, the proposed loss function retains the BCE loss as the core term while introducing three key enhancements. First, to unify the confidence representation of positive and negative samples, a dynamic prediction confidence
is defined as:
where
for positive samples and
for negative samples, reflecting the model’s confidence in the current prediction. For each training batch, the mean confidence is computed as:
And an exponential moving average is employed to update the dynamic threshold .
This threshold characterizes the overall prediction difficulty of the network at the current training stage. Subsequently, an adaptive focal modulation factor
is introduced and defined as:
As training progresses, when the model exhibits unstable predictions and higher sample difficulty (i.e., lower
), the value of
increases adaptively, thereby overcoming the limitation of a fixed
that cannot accommodate different training phases. Finally, samples are partitioned according to their confidence levels, and different modulation strategies are applied. For high-confidence samples (
), the modulation term is defined as
; for low-confidence samples (
), the modulation term is defined as
. When
is small, the term
becomes large, effectively enhancing hard-sample learning and alleviating gradient vanishing. The final modulation factor and the proposed loss function are given by:
3.1.2. Rebar Binding Point Localization
Using the proposed rebar binding point recognition model, the pixel coordinates of five keypoints corresponding to each binding point can be obtained in the image plane. Based on the camera imaging model, the three-dimensional coordinates of each keypoint in the camera coordinate frame
can be computed through the following matrix transformation:
where
denotes the pixel coordinates of the keypoint in the image plane,
is the depth value measured by the depth camera at the corresponding pixel, and
represents the camera intrinsic matrix, given by:
After camera calibration and hand–eye calibration, the homogeneous transformation matrix
TCB, which describes the pose of the camera coordinate frame
with respect to the manipulator base coordinate frame
, can be obtained. Accordingly, the keypoints can be transformed from the camera coordinate frame to the manipulator base coordinate frame as follows:
Since the five keypoints are not coplanar in three-dimensional space, a dedicated target coordinate frame
is constructed to accurately describe the spatial pose of the rebar binding point, as illustrated in
Figure 4. Specifically, the vector from keypoint u to d is defined as the x-axis direction, while the vector from keypoint r to l is defined as an auxiliary direction (referred to as the left axis). The positive direction of the y-axis is then obtained by the cross product of the x-axis and the left axis, and the positive direction of the z-axis is determined by the cross product of the x-axis and the y-axis. The keypoint c is selected as the origin of the coordinate frame, thereby fully defining the spatial pose of the target rebar binding point.
On this basis, as illustrated in
Figure 5, the target coordinate frame
of the rebar binding point is first rotated by 45° about its y-axis to obtain an intermediate coordinate frame
. Subsequently,
is rotated by −45° about its z-axis, resulting in the final coordinate frame
. Based on
, the target pose matrix of the binding tool, denoted as
TG, is constructed. This matrix represents the desired pose of the manipulator end-effector in the manipulator base coordinate frame. By controlling the manipulator to drive the binding tool to this target pose, the automatic binding operation at the corresponding rebar intersection point can be accomplished.