1. Introduction
With the wide application of robots in industrial manufacturing, a welding method based on teaching and reproduction has emerged. Traditional robot welding relies on workers to manually define an a priori torch position trajectory, which is inefficient and easily affected by some time-varying dimensional errors of workpieces, such as thermal deformation [
1,
2]. To track the welding seam in real time and realize the automatic adjustment of the welding robots, the researchers designed some advanced sensors to obtain the spatial position information and category information of the welding seam. With the rapid development of machine vision, structured light vision has found extensive application in intelligent robots [
3,
4] due to its high precision advantages [
5] and the rich characteristic information it provides regarding welding processes [
6].
As shown in
Figure 1a,c, the characteristics of single-line structured light projection stripes vary significantly based on different welding seams. This variability is valuable for acquiring position information related to welding seam feature points and facilitates the convenient classification of stripes. Such classification is crucial for adjusting the welding process parameters effectively. However, as shown in
Figure 1b, common noises in the welding process, including splash, smog, and reflection light of arc, will be confused with stripes collected by industrial cameras. Several morphology-based methods have been proposed to realize the extraction of welding seam features. Li et al. [
7] used a Kalman filter to track the meaningful laser stripes on the image through a window to reduce the influence of image noise during welding, and the laser stripes are decomposed into line–junction–line combination fragments to obtain the weld type and the position of feature points on the image. Yang et al. [
8] applied a kernelized correlation filters’ algorithm to realize seam tracking with feature point marks, which can adapt to different types of weld seams. Two common challenges exist in morphology-based research. One is that the tasks of feature point (or region) extraction and weld classification are usually two-stage. On the flip side, while these models possess a degree of resistance to noise, their tracking accuracy degrades under sustained high-noise conditions [
9].
The evolution of deep learning technology in computer vision has led to the refinement of object detection and semantic segmentation, enabled by advances in computing hardware and the development of convolutional neural networks (CNNs). These tasks are frequently transferred to seam feature extraction tasks utilizing line structured light, contributing to the development of an anti-noise model for effective feature extraction, as evidenced by existing research.
Gao et al. [
10] built the YOLO-WELD model based on YOLOv5 in the task of detecting weld feature points. RepVGG is used as the backbone network, and the NAM attention mechanism and lightweight head layer, RD-Head, are introduced to improve the detection effect. Deng et al. [
11] improved CenterNet, used DenseNet as the backbone network, and adjusted the head layer to separate the feature point position regression task from the weld classification task. This operation prevented the general feature detection network from associating different categories with feature points on a laser stripe image. Liu et al. [
12] reported a method to extract the feature points of multi-layer and multi-pass welding seams by using a conditional generative adversarial network (CGAN) and improving the CNN model. Cacarion et al. [
13] presented a detection model without NMS (non-maximum suppression) and DETR (detection transformer), which considers CNN as a feature extractor, and used an improved Transformer [
14] to realize the further mining of the global information of the input picture. Because of the self-attention mechanism of the Transformer, the DETR incurs considerable computational complexity (in this paper, DETR is used as a typical algorithm to compare the inference time of the model).
However, these methods based on neural networks require considerable computing resources; in turn, they necessitate higher computing performance from the central control equipment integrated into welding robot systems. Notably, such systems commonly lack GPU acceleration support. To address this challenge, lightweight neural network techniques have been extensively explored in computer vision. Typical approaches include pruning, knowledge distillation, weight quantization, and especially structural simplification. A variety of effective lightweight backbones have emerged, ranging from MobileNet [
15,
16,
17,
18] and ShuffleNet [
19,
20] to newer designs such as Shvit [
21] and LSNet [
22]. Alongside these developments, many studies have begun integrating attention modules [
23,
24,
25] into lightweight architectures to counterbalance the performance degradation often induced by aggressive model simplification.
Liu et al. [
26] extended the concept of depthwise separable convolution from MobileNet to 3-D convolution. They combined the 2-D input image with temporal information from image sequences to assess the current welding state. Additionally, three attention mechanisms were introduced to enhance the model’s robustness against welding noise. Ma et al. [
27] developed the WeldNet model specifically for welding tasks, encompassing starting point detection and seam classification. The model’s architecture employs ShuffleNet as the backbone network. However, although these methods adopt lightweight network architectures and demonstrate real-time performance on GPUs, their CPU performance remains limited, and they require large-scale training datasets, thereby hindering the cost-effective deployment of neural networks in industrial applications.
There are also some feature extraction methods based on the semantic segmentation model. Zou et al. [
28] developed a lightweight laser stripe image segmentation model by replacing the backbone with ShuffleNetv2. They achieved laser stripe segmentation and improved the model by a pruning operation based on trainable parameters of the Batch Normalization (BN) layer, combined with a welding seam tracking algorithm. Despite maintaining a high inference speed on the CPU, this method does not directly provide the specific position of the welding seam in a one-stage manner.
Additionally, an efficient ViT (Vision Transformer) [
29] based on cascaded grouping attention was proposed, a substitute for multi-head self-attention (MHSA). The structure of this attention mechanism can have fewer parameters and obtain higher inference speed because it is memory-efficient.
Utilizing a lightweight model in welding tasks can decrease the number of network parameters and floating-point operations (FLOPs), making it more suitable for deployment on edge computing devices with limited computational capabilities. Nevertheless, this comes at the cost of diminished performance in abstract feature extraction, potentially weakening the model’s noise robustness.
To address the issues above, the objectives of this paper are as follows:
To propose a one-stage model for (1) extracting the position of light stripe feature points of weld structures and (2) performing weld classification (e.g., identifying groove types) by analyzing distinctive geometric patterns in the light stripe.
The proposed method can be used to enhance the model’s robustness to image noise during welding without affecting the real-time performance of the model.
To reduce the model’s parameters and computational cost can be deployed on the computing platform where GPU acceleration is unavailable, such as an embedded industrial computer (EIC), and at the same time maintain a high inference speed to meet the real-time requirements of welding tracking systems in the industrial field.
Compared with typical one-stage target detection models, the study aims to test and validate the feature point positioning performance under different noise conditions, weld classification performance, and real-time performance of the proposed model.
The paper comprises five sections.
Section 2 details the hardware devices and datasets employed in the study, while
Section 3 provides a structural description of WeldLight. Following this,
Section 4 outlines the specific configuration of the experiment and results obtained from the comparative verification of various models.
Section 5 presents the obtained conclusions.
3. WeldLight Structure
3.1. Overall Structure
As shown in
Figure 6, WeldLight can be divided into three parts: the backbone layer, the neck layer, and the head layer.
The backbone layer is responsible for extracting features and outputting feature map branches with different downsample factors, which are then delivered to the neck layer. MobileNetV3-Small is chosen in the proposed work. In the MobileNetV3 block, a sequence consisting of a
depthwise convolution (DWC) with a BN layer and different nonlinearities (i.e., h-swish and ReLU), followed by a channel-fusing
pointwise convolution (PWC) with a BN layer, is proposed to replace one standard 3 × 3 convolution with BN and ReLU. Notably, a similar DP block (
Figure 6) is used in the neck and head layers of WeldLight, employing the ReLU6 function as the nonlinearity to better suit feature-point localization. In terms of convolution computation, the quantitative relation between DP block and one standard 3 × 3 convolution is given by (
4).
where the numerator and denominator of the fraction represent the computations of the DP block and the standard
convolution, respectively,
k denotes the kernel size, and
denote the input channels, output channels, and the spatial dimensions of the input feature map. Generally, there are many output channels of convolution operation, especially the number of channels in the deep layer, which is much larger than the size of the convolution kernel. Consequently, the calculation can be reduced to about
.
In the neck layer, a feature pyramid network (FPN) [
34] construction is applied. Due to many stages of backbone producing feature maps of the different resolutions with different semantic levels, the deepest three branches produced by the backbone with downsampling strides specifically of
with respect to the input image will be fused by FPN, as shown in
Figure 6. Though more branches merged in the neck layer indicates the model can integrate deep abstract semantic information and superficial coarse-grained feature information to achieve better small-scale feature detection capability, the number of branches should not increase without limit, because more branches will seriously slow down the inference speed [
20]. Through experimentation, the optimal branch number of three was determined to minimize computational complexity and latency while preserving a certain level of positioning accuracy.
As depicted in
Figure 7b, the improved top-down pathway and lateral connection employ concatenation instead of addition, as seen in the original approach shown in
Figure 7a. Channel adjustment facilitates merging the feature map from the PWC with the upsampled output of the coarser-resolution feature map containing stronger semantic information. A DP block is also utilized to substitute the standard
convolution. Ultimately, the neck layer will generate a feature map of size
, which will subsequently be input into the head layer.
Additionally, within the neck layer, the network links a proposed attention module to the output of
downsampled feature map, aiming to enhance the model’s robustness against noise, as detailed in
Section 3.2.
In the head layer, it is necessary to predict the position of feature points, which is realized by a heatmap tensor and an offset tensor. The task also requires classifying the laser stripes to determine the weld type, which requires the head layer to output a one-hot tensor. The standard convolution block for decoupling the input tensor of the head layer is also replaced by a DP block. At this stage, the PWC not only fuses features across channels but also adjusts the number of channels. Since WeldLight does not classify each feature point separately but instead predicts a category for the entire laser stripe image, the categorical output of the head layer mainly derives from the lower-resolution feature map in the deepest backbone layer, enabling large-scale perception and producing a global category prediction rather than point-specific classification.
3.2. Cascaded Channel Attention Module (CCAM)
The deepest output shape from the backbone network obtained by FPN is , specifically . It should be noted that a significant portion of the features are related to noise. These features will be merged in the neck layer and finally affect outputs from the head layer. In order to weaken the influence of noise, CCAM is proposed to filter the noise.
As shown in
Figure 8, the core of CCAM is a channel attention (CA) block, which serves as a feature selector. Initially, a global average pooling layer is used to reduce the input feature map to a single vector, which is then fed to the first fully connected (FC) layer to reduce its dimensionality. A second FC layer restores the feature dimension. The resulting vector is passed through a softmax function to map it to the range of [0, 1], which enables channel-wise weighting of the input feature map through element-wise multiplication. Channels with higher weights are considered more significant.
In the CCAM design, we adopt softmax rather than sigmoid for channel weighting. While sigmoid is commonly used, it treats channels independently and may assign similar weights across them, limiting discrimination. Softmax instead normalizes the weights to sum to one, emphasizing the relative differences among channels and thereby supporting more effective suppression of noise and highlighting informative features. This choice is also consistent with established deep learning practices, where softmax is widely employed for normalized weighting in multi-channel scenarios [
35]. To summarize, the refined feature map
by CA block is computed as
where
denotes the input feature, and
and
denote the parameters of two FC stages, in which
r, as shown in
Figure 8, stands for the reduction ratio with 16 in WeldLight, and
C denotes the input tensor number of CCAM, while
g stands for the number of groups (splits) for the input tensor.
In addition, CCAM adopts a cascaded group structure to further reduce the computational overhead. As illustrated in
Figure 6 and
Figure 8, the deepest output feature map of the backbone is split into g parts, which are then fed into the CCAM. In the CCAM, the output from each CA block is merged with the next split by element-wise addition to enhance the information fusion between channels. The process is repeated across all g groups.
For CCAM, the computation generated by the FC layer is . Instead of relying on the cascaded group structure, the computation generated by the FC layer through direct utilization of channel attention leads to a calculation amount of . The computation of the CA module, when employing a cascaded group structure, is approximately of the unused capacity. However, this structure also enables multi-layer attention effects, allowing the model to refine its feature selection at each stage.
3.3. Loss Function
The head layer produces three outputs, a heatmap tensor
, an offset tensor
, and a classification tensor (i.e., one-hot encoding)
, respectively, where
W and
H are the input image dimensions of the model, and
c is the number of categories. Throughout the training process, the loss is determined by comparing the predictions generated by the head layer with the ground truth label values, emphasizing that the shapes of the label values should match those of the predictions. This contrast aims to optimize the neural network parameters to achieve an optimal score or regression value prediction for the model’s output. Consequently, it is essential to address the generation of labels in this context. The coordinate of the feature points location, following their mapping to the corresponding bin in the label of heatmap tensor, is
, where
is the coordinate of the
k-th feature point with respect to the input image size of the model (i.e., 512 × 512 for WeldLight), and
denotes the downsampling stride factor. On the label heatmap, positive sample points are represented using Gaussian circles
to accelerate the convergence of the model [
36,
37], where
is a factor that varies with the size of the object [
37], and
stands for the center of Gaussian circles, which is the ideal positive location of feature points mapping on the heatmap.
In light of the inherent rounding effect of floor division, relying solely on the heatmap to restore position information for feature points can result in accuracy loss. Offsets are introduced to address this issue. The label value of offsets of every feature point is denoted by
where
is a 2-D coordinate representing the relative position proportion of a feature point in a corresponding bin of the label heatmap. Therefore, the offset is represented by a two-channel tensor. Additionally, the offset tensor bins are activated only at the corresponding positive locations within the heatmap tensor.
Let
be the predicted score at location
of the heatmap produced by WeldLight, and let
be the value at the same location on the label heatmap with Gaussian circles; a variant of focal loss [
38] is used to optimize the heatmap prediction, denoted by
where
N is the number of feature points in a structured light stripe image,
H and
W stand for the spatial size of the heatmap,
is utilized to fine-tune the weights for challenging and easily locatable points, while
governs the weighting of non-central values within the Gaussian circle. Importantly, the loss weight assigned to a predicted score in
increases, as the distance from the Gaussian center in the label grows.
The essence of offset prediction is rooted in a regression task. In order to realize the accurate offset prediction, a smooth L1 loss function [
39] is utilized as follows:
where
N is the number of feature points in a structured light stripe image, too. The smooth L1 loss leverages the advantages of both the L1 and L2 loss functions. This method prevents gradient explosion in the initial stages of training and promotes the acquisition of a more tempered gradient during back-propagation at the end of training. This contributes significantly to improved model convergence.
Cross-entropy loss (CE loss) can be applied to the weld classification task. Here, let
represent the predicted score of structured light stripe image for class
i, predicted by WeldLight, and boolean variable
denotes the label value of class
i in the one-hot encoding
K; hence, the loss produced by classification is denoted by
where
c stands for the total number of welding seam types. Finally, the total loss
is the overall training objective, which is derived as follows:
where
and
represent the constant weights assigned to the offset and classification loss, configuring to a value of 1.
3.4. Post-Processing
A correlation exists between the prediction of point numbers and laser stripe classification, as discussed in [
11]. Specifically, if the classification result is predicted as “lap-seam,” the model identifies the first two bins on the heatmap tensor with the highest predicted scores. These are then combined with the corresponding bins on the offset tensor, leading to the identification of the final predicted positions. In this scenario, there is no need to consider the suppression by predicted score thresholds. However, as depicted in
Figure 9, the heatmap tensor undergoes resizing to match the uniform shape of the original image, covering it entirely. The colors in the heatmap indicate the predicted scores assigned to the potential feature points.
Notably, the figure illustrates a broken laser stripe, where both ends are confidently identified as potential feature points by WeldLight, despite their proximity. An empirical solution to this issue involves applying a max-pooling operation to the heatmap. Bins with unchanged values correspond to effective potential feature points by comparing the heatmap tensor before and after pooling. In cases of closely located potential feature points, only the one with the highest predicted score is retained, effectively resolving the issue.
4. Verification
4.1. Test Environment
The network was constructed using the PyTorch 2.0.0 deep learning framework based on Python 3.8. The dataset utilized for training and testing the model was obtained from the welding seam positioning and tracking sensor mentioned earlier. The model was trained on a computer with an Nvidia 2080Ti GPU and an Intel Xeon (R) E5-2683 CPU. The model was converted to the ONNX format and used OpenVINO for DNN inference on an EIC with an Intel Core i7-8650U CPU. The network initializes its backbone with the weights of MobileNetv3 pretrained on ImageNet to expedite training and achieve rapid convergence. The FC layer was implemented using a standard 1 × 1 convolution, and the convolution layer parameters were initialized using the He initialization method. As for the BN layer parameters, the weights were initialized to one, and the biases were initialized to zero.
The initial learning rate was set to using Adam as the optimizer (momentum = 0.9, weight decay = 0), and a cosine annealing strategy was adopted for learning rate scheduling. The batch size of the training process was set to 32. The maximum number of training epochs was set to 500, and an early stopping strategy was applied such that the training was terminated at the 50th epoch once the validation loss had stabilized. These hyperparameter values were determined empirically to balance convergence speed and training stability.
Upon observing convergence in the loss function of the validation set during the training stage, the model was utilized to evaluate both the positioning performance of the feature points and the classification performance of the welding seams. The comparative model opted for YOLOv5n and DETR, trained on the COCO dataset, to obtain initial weights, which were subsequently fine-tuned using the default training configurations provided by the respective open-source projects.
4.2. Weld Classification Performance
For welding tasks, the baseline network used in WeldLight is CenterNet, which, similar to DETR and YOLOv5n employed for comparison, belongs to the family of object detection networks. Since a welding image should contain only a single class, previous studies have commonly adopted the class of the point with the highest classification confidence as the image-level label. WeldLight, however, incorporates a dedicated prediction branch to directly infer the class of the entire image rather than distinguishing the classes of individual feature points, thereby better adapting to the requirements of weld seam classification in welding operations. In contrast, DETR, through its end-to-end prediction mechanism, is able to model the relationship between the global image context and the classification of feature points, thereby reducing the risk of assigning multiple classes to different points within the same image. By design, YOLOv5n assigns a class label to each detected point, which necessitates determining the overall image category based on the point with the highest classification confidence.
Notably, all three models accurately classified 150 images in the test set. In pursuit of a more granular comparison, the images underwent adjustments, including random cropping, translation, deformation, and brightness modification. For a visual representation, a selection of the test set images utilized in the classification performance evaluation is presented in
Figure 10.
The adjusted test set was used to evaluate the three models, respectively. Finally, the classification performance was evaluated by the confusion matrix obtained by each model, presented in
Figure 11.
The core of the weld classification involves a multi-classification task. To streamline comparisons and address discrepancies in sample numbers, the evaluation of the three models in welding seam classification employed the following three metrics:
where
stand for the weighted-average precision, recall, and F1-score, respectively.
,
, and
are the precision, recall, and F1-score of class
i, respectively.
represents the ratio of the sample size of category i to the total sample size. True positive (TP), false negative (FN), and false positive (FP) are values that can be determined from the confusion matrix [
40].
Table 2 displays the metric values of the three models on the adjusted test set. Notably, WeldLight exhibited the highest values across the three key metrics assessing classification performance, with a weighted-average precision of 0.9674, a weighted-average recall of 0.9666, and a weighted-average F1-score of 0.9668.
4.3. Feature Point Positioning Performance
Welding-related noise, such as splash, reflection, and smog, can compromise the quality of images captured by industrial cameras.
Figure 12 demonstrates that all three models reliably distinguish laser stripes from noise and precisely localize feature points, maintaining robust performance even under noise conditions.
To evaluate the robustness under different noise levels, the test dataset was divided into high-noise and low-noise subsets, each accounting for 50% of the samples. After post-processing, the three models were evaluated for their feature point positioning performance on both subsets. The predictions were generated for the coordinates of the feature points with respect to the original image resolution of
. The purpose of this evaluation is to assess the models’ feature point positioning accuracy under varying noise conditions and to analyze their noise robustness. The performance metrics include the mean absolute error (MAE), root mean square error (RMSE), standard deviation (
), and projecting the predicted coordinates and the label coordinates to compute the mean Euclidean distance (
) with respect to the CCF. The positioning errors of the feature points were calculated along both the
X-axis and
Y-axis of the image, as follows:
where
and
stand for the coordinate of the label and the predicted coordinate of the feature points, respectively.
The MAE was used to evaluate the static precision of the positioning, which is derived as follows:
where
N denotes the total number of feature points in the designated test set and keeps the same definition in the following formulas.
The RMSE would highlight the impact of predicted outliers, which is derived as follows:
was used to evaluate the stability/robustness of the model [
7], which is derived as follows:
where
is the average value of
at all feature points.
The metric
was employed to assess the error level of three models after projecting 2-D positioning results into the 3-D space. It is derived as follows:
where (
) and (
), which stand for the label and predicted coordinates of feature points with respect to CCF, can be calculated by Equation (
3), based on
and
. An a priori assumption is made that the calibration result of the line structured light plane equation and the intrinsic matrix of the camera is ideal, attributing any error solely to inaccuracies in the feature points’ locations within the PCF, as predicted by the positioning models.
The absolute error curve, which represents the test outcomes using three models, is depicted in
Figure 13,
Figure 14 and
Figure 15, respectively. The green curve corresponds to the absolute error measured from the low-noise test set, the blue curve corresponds to absolute error measurements from the high-noise test set, and the red horizontal line denotes the MAE. All metrics related to the positioning errors of the feature points were calculated and arranged in
Table 3 and
Table 4; the MAE, RMSE, and
were assessed along the X and Y directions of the image, and the average values along the two directions were calculated, represented as the average-MAE, average-RMSE, and average-
. Simultaneously, the outcomes of WeldLight without CCAM are included in
Table 3 and
Table 4, too. Notably, there was no specialized positioning performance analysis for different weld types in the test set due to the inherent uncertainty in weld types during welding processes.
The outcomes of the positioning performance evaluation for the weld feature points on a low-noise test dataset are listed in
Table 3. Within the low-noise scenario, the three models are basically at the same level for the MAE and RMSE. Notably, regarding the average-
, WeldLight with and without CCAM exhibited outstanding performance compared to the other two models, achieving values of 1.943 pixels and 1.922 pixels, respectively. In terms of
, WeldLight showed the best performance, reaching 0.197 mm. The utilization of CCAM did not yield a significant difference in the low-noise test set condition.
The experimental findings from the positioning performance assessment of the welding seam feature points in a high-noise test set are illustrated in
Table 4. Within the high-noise environment, WeldLight exhibited an average MAE, RMSE, and
of 1.736 pixels, 2.407 pixels, and 0.205 mm, respectively. These metrics outperformed both YOLOv5 and DETR. Moreover, compared to WeldLight without CCAM, the model integrating CCAM showcased enhanced performance across nearly all the positioning metrics, reflecting more stable and accurate localization results. This improvement suggests that CCAM contributes to filtering out noise interference and enhancing the robustness of feature extraction, which is particularly beneficial in high-noise welding scenarios. This phenomenon contrasts the measurement results observed in the low-noise test set. WeldLight excels in average
, registering a value of 2.217 pixel, underscoring WeldLight’s superior stability in accurately localizing weld feature points within high-noise welding images.
4.4. Lightweight Level
The model’s lightweight characteristics were evaluated based on multiple metrics, including the total number of model parameters (Params), floating point operations (FLOPs), mean latency, and frames per second (FPS). Params serve as an indirect measure of the computational complexity and memory utilization, while FLOPs represent the computational cost of the model [
41]. A lower mean latency (or a higher FPS) implies reduced inference time, a critical factor for optimal real-time performance in the welding seam tracking system employing this model. This aspect is particularly crucial for complex seam tracking applications, welding speed, and welding process control. The mean latency of the three models was determined through 100 tests using a single image, and the frames per second (FPS) value was calculated by dividing 1000 ms by the mean latency.
It is worth noting that, despite a consistent evaluation criterion, DETR was not designed for real-time tasks and thus lacks advantages in inference speed. Moreover, Transformer-based architectures such as DETR face inherent challenges in learning low-level features from scratch on limited datasets, making large-scale pretraining essential to achieve competitive performance with CNNs [
42]. In this study, these limitations were mitigated by pretraining on the COCO dataset, followed by local fine-tuning until full convergence. Nevertheless, such architectural characteristics remain important considerations when interpreting the results. Naturally, these factors were also taken into account during the initial design of WeldLight, to avoid similar limitations.
Table 5 provides an overview of the lightweight metrics for the three models. WeldLight outperformed the others across all metrics, with the Params being 1.3 M, translating to 72% of YOLOv5n, 3.5% of DETR, and 0.87 GFLOPs, representing 21% of YOLOv5n and 1.5% of DETR. With a mean latency of 29.32 ms and FPS at 34.11 Hz, WeldLight fulfills stringent requirements for real-time weld seam tracking.