In this study, we present a weld recognition framework that integrates the YOLO algorithm and an enhanced U-Net deep learning model. The methodology comprises three critical stages: First, upon acquiring weld images with complex backgrounds, YOLO is employed to accurately localize the weld plate region of interest (ROI), thereby mitigating computational redundancy and misidentification risks arising from whole-image feature extraction. Second, to address surface interferences such as occlusion and oxidation on the parent material, an attention mechanism is integrated into the U-Net architecture, enabling pixel-level precise segmentation of weld regions. Finally, a rigid calibration-based cross-modal information association method is designed, which performs dynamic equidistant sampling on segmentation results, establishes bidirectional mappings between 2D pixel coordinates and 3D point cloud coordinates, enhances path smoothness and spatial alignment accuracy, and provides foundational support for multi-layer, multi-bead welding path planning.
2.1.1. YOLOv8 Weld Area Recognition Preprocessing
When the system captures a weld image with a complex background, the initial stage in generating the robot welding trajectory entails autonomous recognition and localization of the weld region. This study employs a YOLO-based target detection algorithm to accurately locate the weld plate area, ensuring efficient spatial confinement for subsequent processing.
YOLOv8, a state-of-the-art target detection framework grounded in convolutional neural networks [
14,
15,
16,
17], involves several key stages during training: dataset compilation, image annotation, dataset partitioning, hyperparameter configuration, and iterative validation of model performance. The compilation and annotation of datasets serve as foundational pillars for training the YOLOv8 weld plate recognition model, as the quantity and quality of these datasets directly influence the precision and reliability of downstream models.
In this research, 1060 images of weld plates with varying dimensions, acquired under diverse lighting conditions via a 3D visual sensor, were systematically collected. Through the application of geometric transformations (rotation and mirroring), noise perturbations (Gaussian/salt-and-pepper noise and brightness adjustments), and blur processing (motion blur and out-of-focus blur), alongside additional augmentation strategies, the dataset was expanded to 10,600 images. The dataset was partitioned into training, validation, and test subsets at a ratio of 70%, 15%, and 15%, respectively, to ensure robust model generalization.
The YOLOv8 model was trained utilizing the PyTorch 1.10.1 deep learning framework. The host system, equipped with an NVIDIA RTX 3070 GPU, leveraged the NVIDIA CUDA toolkit to accelerate both training and inference processes, optimizing computational efficiency. The trained model is capable of automatically localizing the weld plate within an input image, as illustrated in
Figure 2. Phase contrast imaging techniques, as explored by Yang et al. [
18], offer enhanced resolution for micro-gap weld seam detection, which could further refine the precision of the proposed stereo vision system.
As depicted in
Figure 3, the model training curve illustrates the convergence of key performance metrics. The trained model achieved 100% accuracy and recall on the test set, alongside a bounding box loss (box_loss) of 0.26%, demonstrating exceptional detection precision. These results indicate the model’s capability to accurately localize and determine the 2D coordinates of weld plates within complex background images. By integrating camera calibration parameters, the corresponding 3D coordinates of 2D pixel points are systematically calculated, enabling precise extraction of the 3D point cloud region of interest (ROI) for weldments. As demonstrated in
Figure 4, the outcomes of weldment recognition and point cloud extraction validate the framework’s effectiveness in bridging 2D image semantics and 3D spatial information.
2.1.2. Improvement and Training of U-Net Model for Weld Position Recognition
Following the extraction of the weld plate region of interest (ROI) via the YOLO algorithm, this paper presents an enhanced U-Net-based weld recognition approach to address the challenge of diminished image recognition accuracy caused by surface interferences—such as occlusion and oxidation—on the base material.
U-Net is a well-established semantic segmentation network [
19], and its core architecture is an encoder–decoder structure that balances both global semantics and local details. Although U-Net performs well in common applications such as medical image segmentation, it exhibits the following limitations in industrial scene image segmentation:
First, the original U-Net employs a shallow encoder without a robust backbone network, resulting in inadequate feature representation for high-resolution images with complex weld textures. This limitation hinders the accurate identification of fuzzy boundaries and high-frequency details critical for precise segmentation. Backbone networks such as VGG16, which demonstrate proficiency in semantic extraction and multi-layer feature expression, address this gap by enhancing hierarchical feature learning and recognition accuracy. Second, the absence of an attention mechanism in the standard U-Net architecture may lead to blurred object boundaries and target misclassification, particularly in noisy industrial environments. By integrating the DAM module [
20], which incorporates channel and spatial attention mechanisms, the model’s capability to prioritize discriminative local features is strengthened, enabling more accurate delineation of weld regions. Additionally, reliance on a single loss function in conventional U-Net implementations renders it ineffective in addressing challenges such as category imbalance, edge localization errors, and inconsistent regional segmentation. To mitigate these issues, this study combines Dice Loss [
21], Focal Loss [
22], and boundary refinement. These modifications enhance both segmentation accuracy and the model’s generalization capacity across diverse industrial welding scenarios.
To address these issues, based on the U-Net framework, this paper proposes an improved model. This model employs VGG16 as the encoder’s backbone network, incorporates the DAM attention module, and combines Dice Loss, Focal Loss, and Weighted IoU Loss [
23] for joint training (as shown in
Figure 5). The model is developed with the aim of enhancing the accuracy and robustness of weld segmentation.
In the backbone network, VGG16 substitutes the original encoder module to enhance the feature extraction capabilities and model generalization performance. The structure is composed of five convolution blocks that form a multi-scale feature pyramid, with the number of channels increasing from 64 to 512. The input image is downsampled from 512 × 512 × 3 to 32 × 32 × 512, enabling the extraction of features ranging from shallow-level texture details to deep-level structural information. The shallow layers are dedicated to capturing texture features, the middle layers concentrate on detecting directional cues, and the deep layers emphasize the extraction of edge and topological features, which enhances the recognition of complex welds. The multi-layer fusion refines edge and spatial details, thereby boosting the accuracy of weld stripe recognition.
During the model enhancement phase, this study introduces a dual-channel attention module (DAM) into the enhanced U-Net architecture to improve the robustness of weld recognition under complex backgrounds. The DAM module incorporates both channel and spatial attention mechanisms to strengthen feature representation capabilities, enabling the network to better handle hierarchical visual information. Specifically, the module is embedded within cross-connection layers to dynamically enhance features during high-level semantic and low-level detail feature extraction stages. Channel attention enhances the activation of semantically relevant channel features, thereby boosting discriminative power for distinguishing between different weld types. Spatial attention, conversely, enhances the spatial consistency of target regions and mitigates background interference, improving the localization of weld boundaries. Through weighted fusion of these two attention mechanisms, the proposed approach effectively enhances both weld segmentation accuracy and boundary positioning precision without imposing substantial increases in model computational overhead. Experimental results indicate that this method exhibits superior robustness, particularly in challenging conditions involving occlusion, scale variations, and noise interference.
In the loss function design stage, based on the task characteristics of weld identification and positioning, a joint loss strategy (Dice–Focal–Weighted IoU Loss) is proposed, integrating Dice Loss, Focal Loss, and Weighted IoU Loss. This strategy aims to enhance model performance in three aspects: regional consistency, category balance, and edge accuracy.
- (1)
Dice Loss to improve the overall consistency of the weld area
Dice Loss effectively alleviates the problem of imbalance between the weld area and the background and improves the segmentation accuracy by maximizing the overlap between the prediction and the true label. It is defined as follows:
where
and
are the predicted probability and true label of the
i-th pixel, respectively, and
is a smoothing factor (usually 10
−6 to prevent the denominator from being zero). By maximizing the Dice coefficient, the model can enhance the overall modeling ability of the weld contour while maintaining the integrity of the region.
- (2)
Focal Loss to solve the problem of imbalance between weld and background categories
Since the weld area accounts for a small proportion of the image, it is easily affected by category imbalance during training, resulting in incomplete recognition. Focal Loss can improve the model’s ability to learn weld edges and complex shapes by increasing the loss weight of difficult-to-classify samples, thereby improving the model’s ability to extract weld areas under complex backgrounds. Its definition is as follows:
where
represents the predicted probability of the positive sample,
is the balance factor (usually 0.25), and
is the adjustment factor (usually 2).
- (3)
Weighted IoU Loss to enhance the positioning accuracy of weld edges
In weld recognition, edge positioning accuracy directly affects the path planning effect. Traditional IoU Loss has shortcomings in edge modeling. This paper introduces Weighted IoU Loss. By giving higher weights to edge pixels and enhancing edge feature learning, the model can more accurately identify and locate weld boundaries and improve segmentation accuracy under complex weld morphology. Its definition is as follows:
where
is the dynamic weight for each pixel, which is adaptively adjusted during training based on the importance of the weld edge.
This paper adopts a weighted fusion strategy to combine the three loss functions into the final loss function:
where
,
and
are weight coefficients, which control the contribution of different loss terms, respectively. Through experimental adjustment, this paper sets the weight coefficients to
= 0.5,
= 0.25, and
= 0.25, ensuring that the detailed modeling ability of the weld boundary is enhanced while maintaining the global feature expression.
In dataset construction, to enhance the segmentation accuracy of the weld area for U-Net, this study introduces a model-driven preprocessing approach: the weld plate region is extracted via the trained YOLOv8s model (mAP@0.5 = 99.5%), detection boxes with confidence scores exceeding 0.85 are identified, and a 5% boundary expansion is applied to the bounding boxes before cropping. This process retains spatial contextual details (approximately 3–4 mm) around the weld, generating high-purity input data for subsequent segmentation (as shown in
Figure 6). By leveraging cascaded model processing, this method effectively mitigates background interference, increases the proportion of effective pixels in the weld region, and significantly enhances the model’s capability to recognize weld features under complex background conditions.
For the labeling process, the Labelme tool is employed to conduct precise pixel-level annotation of weld groove feature lines, with annotation errors maintained within ±2 pixels. The labeled data are then converted into VOC format mask images, which serve as training inputs for the semantic segmentation model (as illustrated in
Figure 7).
The improved U-Net model was contrasted against the original model. The Adam optimizer was employed in the training process to enhance the convergence rate. The training results are presented in
Figure 8. The loss and accuracy curves of the two models exhibited favorable convergence behavior, without discernible signs of overfitting or underfitting. Moreover, the training process was stable and reliable.
Comparative analysis demonstrates that the enhanced U-Net outperforms the original model in both loss function performance and mAP metrics. The improved model’s loss value decreased to 0.09 by the fifth epoch and eventually stabilized at 0.032, a 46.7% reduction compared to the original model’s stable loss of approximately 0.06. In terms of the mAP, the enhanced model stabilized at 0.88 with a peak of 0.908, significantly surpassing the original model’s stable mAP of 0.78 (peak: 0.803). Additionally, the training curve of the improved model exhibits smoother trends with less pronounced oscillations, indicating stronger feature extraction capabilities and training stability. These results highlight that the U-Net following architectural optimizations is better adapted to local feature segmentation in weld images, achieving faster convergence and higher recognition accuracy compared to its baseline counterpart.
To comprehensively assess the improved model’s performance in weld feature recognition, this study conducts a comparative analysis between the original U-Net and the optimized model using two key metrics: mean intersection over union (mIoU) and F1-Score.
Figure 9 illustrates the evolution curves of key performance indicators. The results demonstrate that the optimized model exhibits substantial improvements across all metrics:
Segmentation accuracy: The mean intersection over union (mIoU) increased from 0.796 to 0.887, indicating that segmented regions align more closely with ground-truth annotations and effectively capture fine-grained weld features.
Recognition performance: Precision and recall rose to 0.937 and 0.944, respectively, significantly reducing instances of false positives (incorrect detections) and false negatives (missed detections), which highlights the model’s enhanced ability to distinguish weld regions from complex backgrounds.
Comprehensive performance: The F1-Score, a balanced measure of accuracy and robustness, improved to 0.940, reflecting the model’s superior capability to handle challenges such as occlusion and texture variation in industrial welding scenarios.
In conclusion, the optimized U-Net model outperforms the baseline counterpart in segmentation boundary definition, detail reconstruction capability, and overall recognition accuracy, thereby validating the efficacy of incorporating the attention mechanism and implementing architectural optimizations.
Figure 10 presents the model’s recognition performance in real-world industrial welding scenarios, demonstrating its practical applicability and robustness.
In addition, we also compared our model with pspNet and DeepLab V3+, two mainstream semantic segmentation task frameworks. The comparison results of the mIoU, MPA, and inference time are shown in
Table 1. There are 1245 training images after data enhancement, and the training dataset and validation dataset are divided into the training dataset and validation dataset according to the ratio of 9:1. It is worth noting that since our strategy is to first use the YOLOv8 model to identify the image ROI area and then apply the improved U-Net model on the ROI area to identify the weld beam, in the training of the improved U-Net part of the model proposed in this paper, the training set first uses the Yolov8 model for target recognition and cropping, while the pspNet and DeepLab V3+ models are directly trained. In addition, the backbone feature extraction networks of U-Net, pspNet, and DeepLab V3+ are selected as VGG16, Resnet50, and Xception, respectively.
From
Table 1, we can see that the mIoU and MPA of the improved U-Net of the three models show that the improved U-Net proposed by us has a significantly better effect. However, it is worth noting that the training effect of the model is affected by many factors such as the dataset, the choice of the backbone network and loss function, the number of training rounds, and the learning rate. In addition, our improved U-Net is trained on the cropped image, so it is difficult to conclude that our improved U-Net has better performance. However, for the case where the foreground pixel ratio is small and the number of training images is limited, our strategy of combining Yolov8 and U-Net models is better in terms of algorithm stability and detection effect. As shown in
Figure 11 and
Figure 12, which are detection comparisons of typical simple scenes and complex scenes, respectively, it can be seen from the observation that for simple scenes, all three models can find the edge position of the weld edge, but for some objects or scenes that have not appeared in the training data images, the method without the YOLO pre-selection box is prone to misidentification.
2.1.3. Extraction of Weld Skeleton Feature Points Based on Depth Image
Building on the enhanced U-Net weld segmentation results, this study presents a cross-modal weld skeleton feature extraction methodology. First, the Zhang–Suen thinning algorithm processes the segmented weld region to generate a single-pixel-width central skeleton, preserving geometric topology while reducing dimensional complexity. Next, path-length-based, equidistant, uniform interpolation samples feature points along the skeleton, ensuring consistent spatial discretization for subsequent trajectory planning. Finally, bidirectional mapping between 2D pixel coordinates (u and v) and 3D spatial coordinates (X, Y, and Z) is established to align semantic segmentation results with 3D geometric information, enabling precise weld trajectory planning by integrating visual semantics and spatial geometry.
To derive a precise weld centerline from the U-Net segmentation mask, the classical Zhang–Suen skeleton thinning algorithm is employed. This iterative technique systematically removes non-skeleton edge pixels while maintaining the image’s topological structure and connectivity, ensuring the preservation of critical geometric features. The result is a topologically simplified, single-pixel-width central skeleton that retains the complete structural integrity of the weld region. This representation is well suited for feature extraction in slender structures such as weld seams, providing an ideal foundation for subsequent geometric analysis, as demonstrated in
Figure 13To extract uniformly distributed feature points along the weld skeleton, this study employs a path-length-based uniform interpolation algorithm. The algorithm first computes the total traversal length of the skeleton, then identifies sampling positions at equal intervals along this path, and finally generates feature points via linear interpolation, thereby ensuring consistent spatial distribution. To further enhance the precision and smoothness of feature point placement, a polynomial fitting approach is incorporated. Specifically, the skeleton contour is approximated using a third-order polynomial, and the fitting curve is refined via the least squares method to minimize approximation errors, yielding a smoother and more continuous sequence of feature points. As illustrated in
Figure 14 and
Figure 15, feature points sampled from the complexly shaped weld skeleton (depicted as green dots) closely match the original structural geometry while maintaining nearly uniform spacing. This dual-step approach—combining path-length interpolation with polynomial optimization—ensures reliable extraction of evenly spaced, geometrically consistent feature points, which are critical for subsequent trajectory planning and robotic guidance in welding applications.
Following the extraction of 2D weld skeleton feature points and the refinement of polynomial fitting, the precise mapping of 2D image semantic information to a 3D point cloud space is critical. To address this, this study employs a depth-aware mapping framework grounded in the Pinhole Camera Model, integrating camera calibration parameters to achieve coordinate conversion with sub-millimeter accuracy.
For the region of the depth map containing weld skeleton feature points, a 3 × 3-pixel neighborhood is sampled to extract local depth information. Median filtering is applied to eliminate depth discontinuity noise and isolate abnormal data points, ensuring reliable depth values for subsequent processing. Leveraging the intrinsic parameters from camera calibration, the framework utilizes OpenCV and PCL (Point Cloud Library) to transform the 2D skeleton feature points’ depth information into 3D coordinates within the camera coordinate system. The accuracy of this mapping is validated through a 3D point cloud visualization interface, as demonstrated in
Figure 16.
Following cross-modal mapping, the 2D weld skeleton feature points are accurately reconstructed as 3D spatial trajectories within the point cloud, generating a localized, semantic-labeled point cloud region. This mapping mechanism effectively aligns 2D semantic information with 3D spatial coordinates through the integrated optimization of depth perception and geometric constraints. By ensuring millimeter-level positional precision, it establishes a reliable spatial reference for subsequent welding path planning, bridging visual semantics and robotic kinematic requirements seamlessly.