1. Introduction
Welding is one of the most critical joining technologies in modern manufacturing, with widespread applications in industries such as automotive, aerospace, shipbuilding, and energy equipment. The quality of weld seams directly affects the mechanical performance and structural safety of products. As a prerequisite for robotic welding, accurate seam detection is essential for automated path planning and high-precision welding operations. Therefore, developing reliable and efficient weld seam detection methods holds great significance for advancing intelligent manufacturing.
Currently, industrial seam detection primarily relies on structured light scanning or laser vision sensors. These methods project gratings or laser stripes onto the workpiece surface, capture the reflected signals with cameras, and reconstruct the geometric profile of the seam [
1]. While structured-light-based methods achieve high precision, they face several limitations in practical applications: (1) High hardware costs—structured light sensors are expensive and demand strict calibration and maintenance, resulting in high overall costs [
2,
3]; (2) Limited robustness—variations in illumination, strong metal surface reflections, and smoke interference in industrial environments can easily disrupt light projection and signal acquisition [
4]; (3) Restricted flexibility—structured light devices are often bulky and difficult to integrate into compact welding systems or adapt to complex workpiece geometries [
4].
With the rapid development of computer vision and deep learning, image-based detection and recognition techniques have shown remarkable success in fields such as industrial quality inspection, defect detection, and object recognition [
5,
6,
7]. However, in weld seam detection, both academic research and industrial applications remain largely dominated by structured-light or laser-based methods [
1,
8,
9]. To date, there has been no systematic exploration of pure vision approaches that rely solely on conventional industrial cameras [
1,
10]. As a result, seam detection remains heavily dependent on costly optical devices, limiting scalability and adaptability in real-world applications [
1,
9].
To address this gap, we propose a pure vision-based seam detection framework termed SANet (Strip-Aware Network). Unlike structured light methods, SANet requires only standard industrial cameras, eliminating reliance on expensive optical sensors and reducing deployment and maintenance costs. Moreover, the absence of active projection improves robustness under complex industrial environments. SANet is lightweight, enabling easy integration into robotic welding systems or compact manufacturing platforms. Most importantly, to the best of our knowledge, SANet is the first attempt to introduce a pure vision approach to seam detection, thereby opening a new research direction and offering a promising pathway for low-cost, high-flexibility intelligent welding.
The proposed SANet is built upon the HCFNet [
11] architecture, as illustrated in
Figure 1. To enhance feature representation, we design three key modules. First, the Paralleled Strip and Spatial Context-Aware (PSSCA) module captures elongated seam structures and spatial context in parallel, improving fine-grained feature extraction. Second, the StripPooling attention mechanism, introduced from prior studies [
12], expands the receptive field along strip directions to strengthen discriminative features of seam regions. Third, the Multistage Fusion (MF) module integrates features from different stages of the encoder and decoder, ensuring comprehensive interaction between global and local information and improving robustness and generalization under complex backgrounds. Together, these components enable SANet to effectively capture seam geometry and achieve superior detection performance.
Extensive experiments on a self-built weld seam dataset containing over 4000 images demonstrate the effectiveness of SANet. Results show that PSSCA improves strip-aware and context modeling, StripPooling enhances seam region discrimination under challenging backgrounds, and MF ensures robust cross-stage feature integration. Compared to structured-light-based methods, SANet offers clear advantages in terms of cost efficiency and deployment flexibility while maintaining competitive detection accuracy. Thus, SANet provides a practical and scalable pure vision solution for seam detection, offering both theoretical and application value for intelligent welding.
The main contributions of this work can be summarized as follows:
We formulate weld seam detection as an image segmentation problem and propose SANet (Strip-Aware Network), a U-shaped deep neural network with strip-aware and attention mechanisms, enabling accurate seam detection under pure vision conditions.
We design two novel modules: the PSSCA module, which enhances fine-grained strip feature extraction and contextual modeling; and the MF module, which integrates multistage features across the encoder and decoder to improve robustness and generalization.
We incorporate the StripPooling attention mechanism into SANet, leveraging its ability to strengthen strip-direction feature modeling and improve discriminative representation under complex backgrounds.
We construct a seven-channel input representation, combining RGB images with grayscale-blurred, thresholded, edge-detected, and morphologically processed images, thereby enriching texture, boundary, and salient features.
We build a self-collected weld seam dataset with over 4000 images across diverse industrial scenarios (e.g., pipe joints, trusses, elbows, furnaces) and conduct extensive evaluations. Experimental results demonstrate that SANet significantly outperforms baseline methods in IoU, Dice, and other metrics, providing a cost-effective and reliable pure vision solution for intelligent welding.
2. Related Work
2.1. Structured Light Vision Methods
Structured light vision methods represent one of the most mature and widely applied techniques for weld seam detection and recognition. The fundamental principle is to project gratings or laser stripes onto the seam surface, capture the deformations using cameras, and then reconstruct the geometric profile of the weld seam for precise localization and tracking. In recent years, researchers have proposed a variety of improvements tailored to different seam types.
For example, Yang et al. developed a cross-mark structured light method for seam detection, which combines line fitting with template matching within the region of interest (ROI) to extract seam edges [
13]. This method performs well on V-type seams and narrow gaps but is highly sensitive to illumination and projection quality. For narrow butt seams, structured light stripe sensors have been used to capture seam cross-sections, followed by geometric calculations and curve fitting for accurate seam tracking [
13]. Although highly precise, such methods often suffer from instability in environments with strong reflections or interfering light.
To reduce hardware complexity, Pham et al. proposed a seam tracking approach that combines a single laser line with a fixed-view RGB camera [
14]. While this method simplifies the hardware setup, it remains fundamentally dependent on laser projection. In more complex industrial environments, Ali et al. introduced neural networks to extract and segment features from structured-light images, enabling high-precision seam detection under real-world conditions [
15]. Similarly, Li et al. developed a multi-seam recognition method using structured light, integrating both line and curve feature extraction to adapt to diverse seam geometries [
16]. Other researchers have also explored active coaxial light sources for seam detection and automatic calibration in digital twin environments [
17], emphasizing system consistency at the expense of greater complexity and deployment cost.
More recently, Spruck et al. combined laser triangulation imaging with deep neural networks for weld seam quality inspection and defect classification, achieving classification accuracies as high as 96.9% [
18]. This highlights the potential of combining structured light imaging with deep learning to improve robustness and recognition performance. Nevertheless, such approaches remain reliant on external optical projection systems.
In summary, structured light vision methods have demonstrated high accuracy and strong engineering applicability in weld seam detection. However, their dependence on expensive optical hardware, limited robustness under challenging environments, and high system complexity remain critical drawbacks. These limitations underscore the motivation to explore lightweight, low-cost alternatives that eliminate the reliance on optical projection, paving the way for the development of pure vision-based weld seam detection methods.
2.2. Other Weld Seam Detection Methods
In addition to structured light vision, researchers have explored various alternative approaches for weld seam detection, including laser vision, ultrasonic inspection, and conventional image processing-based methods.
In laser vision methods, laser stripes are commonly used to extract seam geometry [
4]. By leveraging the relative pose between the laser source and the camera, triangulation is applied to reconstruct seam positions. These methods offer high accuracy and fast response, but their robustness is limited in the presence of highly reflective metallic surfaces, welding fumes, or complex curved workpieces.
Ultrasonic and electromagnetic techniques [
19,
20] employ non-contact or contact sensors to capture surface or subsurface variations in physical properties, assisting in seam or defect identification. While such methods can detect certain internal seam defects and are valuable in specific applications, their high equipment cost, operational complexity, and limited applicability hinder their widespread use in general seam detection scenarios.
Conventional image processing-based methods have also been investigated for weld seam detection [
8]. Techniques such as edge detection, thresholding, and Hough transform have been used to extract seam regions. These approaches are simple to implement and computationally efficient, yet they struggle in environments with complex lighting, noisy backgrounds, or highly variable seam geometries, leading to poor robustness and limited generalization.
Overall, while these methods have advanced seam detection to some extent, they suffer from shortcomings such as heavy reliance on sensor quality, limited adaptability, or insufficient feature representation capability. These limitations highlight the pressing need for novel approaches that achieve high robustness, low cost, and strong adaptability in complex industrial environments.
2.3. Vision-Based Deep Learning Methods
Recent advances in deep learning have demonstrated remarkable progress in computer vision, showcasing strong capabilities in feature learning and representation. In industrial inspection, deep learning has been widely applied to defect recognition, object detection, and semantic segmentation, effectively addressing the limitations of traditional methods under challenging conditions such as variable illumination, background clutter, and irregular textures. Its end-to-end feature learning paradigm enables automatic extraction of multi-level discriminative features, avoiding the constraints of handcrafted designs and ensuring superior robustness in complex scenarios.
Deep learning is particularly effective in small-object detection. For instance, state-of-the-art architectures such as HCFNet [
11] introduce multi-scale feature fusion and attention mechanisms to enhance fine-grained feature modeling. These advances significantly improve the detectability and discriminability of small objects in complex environments, underscoring the capability of deep learning to achieve breakthroughs in highly challenging vision tasks. In parallel, recent work has also explored the incorporation of explicit geometric structures into CNN design, such as leveraging the truncated icosahedron in Football Net [
21], which demonstrates the potential of geometry-aware architectures to strengthen feature representation.
Recent studies have also explored the integration of Transformer architectures into weld seam detection and industrial inspection. For instance, Ali et al. (2023) proposed an article [
15] which presents a neural network framework that integrates Transformer-based components for weld seam detection in real industrial environments. Their hybrid CNN–Transformer design demonstrates that attention mechanisms can significantly enhance contextual modeling and robustness under complex illumination and reflection conditions. In light of current research, the best segmentation performance in industrial and inspection tasks is frequently achieved using Vision Transformer (ViT) architectures or hybrid CNN–Transformer models, which effectively balance local feature extraction with global contextual reasoning. Nevertheless, SANet differs from these methods by maintaining a pure CNN-based and camera-only framework, specifically optimized for strip-like weld seams in complex industrial scenes. Rather than relying on Transformer blocks, SANet focuses on lightweight strip-aware contextual modeling and multistage feature fusion, offering a more efficient and deployable solution under hardware-constrained environments.
Building upon this evidence, deep learning also holds great potential for detecting elongated strip-like objects. Weld seams, as typical elongated structures, are difficult to recognize due to their slender geometry, blurred boundaries, and susceptibility to background interference. By incorporating structural designs that explicitly strengthen strip-aware feature representation, the detectability and robustness of seam detection can be substantially improved. Therefore, combining the powerful feature learning capability of deep learning with strip-aware mechanisms provides a promising and valuable research pathway for developing pure vision-based weld seam detection models.
2.4. Motivation for SANet
Although existing weld seam detection methods have contributed to the advancement of welding automation, they still exhibit notable limitations. Structured light and laser vision methods achieve high accuracy but suffer from high hardware costs, environmental sensitivity, and complex integration, limiting their scalability. Ultrasonic and electromagnetic sensing techniques can aid defect detection, but their high equipment costs and narrow applicability prevent them from serving as general-purpose solutions. Traditional image-processing approaches are simple to implement but lack robustness when faced with complex backgrounds or diverse seam geometries.
In contrast, deep learning has demonstrated superior performance in a wide range of challenging vision tasks, particularly in small-object detection. By employing multi-scale feature fusion and attention mechanisms, deep learning models overcome the limitations of weak and easily lost features. Representative works such as HCFNet have shown that deep learning can precisely model fine-grained object features and achieve high detection accuracy [
11]. These advances indicate that deep learning is not only effective for small-object detection but also holds great potential for addressing other difficult vision tasks.
Weld seams, as typical elongated strip-like targets, present unique challenges: slender geometry, blurred boundaries, and susceptibility to illumination, reflection, and interference in industrial environments. Inspired by the success of deep learning in small-object detection, we hypothesize that introducing strip-aware mechanisms into network design, combined with contextual feature modeling, can significantly enhance seam feature representation. This approach is expected to achieve accurate, low-cost, and robust weld seam detection. Guided by this motivation, we propose SANet (Strip-Aware Network), a pure vision deep learning framework specifically designed for weld seam detection, providing a new technical pathway and innovative perspective for intelligent welding.
Compared with the baseline HCFNet, the proposed SANet introduces a series of targeted improvements to better fit the geometric and contextual properties of weld seams. Specifically, SANet incorporates the StripPooling attention mechanism, which was not included in HCFNet, to explicitly model elongated contextual dependencies along the seam direction. Furthermore, the PSSCA (Paralleled Strip and Spatial Context-Aware) module replaces the Patch-Aware attention unit in HCFNet with a Strip-Aware strategy. This modification is driven by the observation that weld seams differ fundamentally from the small, isolated objects handled by HCFNet—they are long, continuous, and structurally coherent. The Strip-Aware mechanism enables more efficient feature aggregation along horizontal and vertical strip orientations, effectively expanding the receptive field while preserving spatial resolution. Consequently, SANet achieves more precise and context-rich feature extraction for strip-shaped targets, demonstrating a structural evolution from HCFNet toward a design specialized for weld seam detection tasks.
In addition to the aforementioned improvements, SANet also integrates the Spatial Context-Aware Module (SCAM) [
22], whose structure is schematically illustrated as the attention part in
Figure 2. The detailed architecture of SCAM was originally presented in its source paper, and thus, for copyright and authorization reasons, we refrain from reproducing its exact diagram here. Instead, a simplified schematic is provided in
Figure 2, which adequately demonstrates the computational process and operational flow of the module.
The primary role of SCAM is to enhance the extraction of fine-grained and small-object features by jointly modeling spatial and channel dependencies. During the adaptation from HCFNet to SANet, the original Patch-Aware mechanism of HCFNet was removed to make room for the proposed Strip-Aware attention, which focuses on elongated and continuous regions. However, this removal inevitably weakened the model’s sensitivity to subtle feature variations. To address this issue, SCAM was incorporated as a compensatory mechanism—not only restoring the original fine-detail extraction capability of HCFNet, but also further reinforcing the performance of the strip-aware attention design. Through extensive experiments, this integration was found to effectively balance global contextual perception and local fine-detail enhancement, thereby improving the model’s overall representation capacity for weld seam detection.
3. Materials and Methods
3.1. Dataset Description
Currently, there is no publicly available dataset specifically designed for pure vision-based weld seam detection. To support the training and evaluation of the proposed method, we constructed a self-collected dataset comprising more than 4000 weld seam images. All images were captured using conventional industrial cameras in real production environments, ensuring both scene diversity and application representativeness.
During data acquisition, we deliberately included a wide range of typical industrial welding scenarios, such as the following:
T-joint elbows—seams located at complex pipe junctions characterized by multi-directional intersections and curved surface transitions;
Pipe trusses—large-scale structural weld seams under cluttered backgrounds and varying illumination conditions;
Industrial furnaces—weld seams on high-temperature equipment surfaces, where imaging is often affected by strong reflections and noise interference;
Pipe butt joints—common linear or circular seams exhibiting representative elongated geometric structures.
These scenarios encompass the major types of welding processes in manufacturing and capture the diversity and complexity of seam geometries. Unlike datasets that rely on structured-light or laser-based imaging, this dataset is constructed entirely from pure vision images, thereby providing reliable support for the exploration of cost-effective and flexible seam detection approaches.
To ensure annotation accuracy, all images were carefully reviewed by domain experts, and weld seam regions were precisely labeled. This guarantees the scientific rigor and reproducibility of subsequent model training and evaluation. Beyond meeting the needs of this study, the dataset also serves as a valuable resource for future research on pure vision-based weld seam detection.
3.2. Data Preprocessing
To enhance the model’s ability to perceive seam-related features, we applied a series of preprocessing operations to the raw images, constructing multi-channel input data that enrich both seam morphology and boundary representation. The overall workflow is illustrated in
Figure 3.
The original input consists of RGB images with three channels, providing rich texture and color information as the baseline input to the network. However, weld seams in industrial environments often exhibit uneven illumination, blurred edges, and background clutter. To address these challenges, the following preprocessing steps were sequentially performed, generating additional feature channels (
Figure 3):
Grayscale and Gaussian blurring: The RGB image was converted into grayscale and subsequently smoothed using a Gaussian filter. This reduces noise and irrelevant textures, thereby highlighting the global structural characteristics of weld seams. One additional channel was produced.
Thresholding: Adaptive thresholding was applied to the grayscale image to distinguish seams from their surrounding background regions. One additional channel was produced.
Edge detection: The Canny operator was employed to extract seam boundaries and contour information. One additional channel was produced.
Morphological closing: A closing operation was applied to the edge detection result to fill small gaps along the seam contours and preserve structural continuity. One additional channel was produced.
In total, a 7-channel input representation was constructed, consisting of the original RGB channels and four preprocessed feature maps. This design allows the network to jointly leverage global texture information from raw images and salient seam characteristics from preprocessing outputs, thereby improving seam discriminability and enhancing the robustness and generalization of the model under complex industrial conditions.
3.3. Network Architecture (SANet)
3.3.1. Overall Architecture
The proposed SANet (Strip-Aware Network) adopts a U-shaped architecture to effectively integrate global semantic information with local fine-grained details, thereby meeting the dual requirements of accuracy and robustness in weld seam detection. The network input consists of a 7-channel representation, including the original RGB channels and 4 additional single-channel maps obtained through grayscale–blur processing, thresholding, edge detection, and morphological closing. This design enables the integration of texture, boundary, and saliency information at the input stage, providing a solid foundation for subsequent feature extraction.
In the encoder, SANet improves upon the feature extraction units of HCFNet. While HCFNet employs a combination of PPA and max pooling, SANet replaces this structure with the proposed PSSCA (Paralleled Strip and Spatial Context-Aware) module, which transforms the original patch-aware design into a strip-aware mechanism. In addition, the StripPooling module is incorporated. Although named as a pooling operation, StripPooling essentially functions as an attention mechanism, enhancing strip-aware features through directional contextual modeling. Ultimately, the encoder’s feature extraction units are replaced with PSSCA, significantly strengthening the network’s ability to represent elongated seam structures.
In the decoder, SANet largely retains the HCFNet design, where feature maps are progressively upsampled and refined. Specifically, the PPA units in HCFNet are replaced with PSSCA, while StripPooling is not applied in the decoder in order to maintain lightweight design. This ensures that the decoder can fuse multi-scale semantic information while preserving computational efficiency during feature enhancement.
Furthermore, SANet introduces a Multistage Fusion (MF) module that spans the entire network. The MF module aggregates features across the channel dimension by combining (i) the max pooling outputs from the encoder, (ii) the decoder features before PSSCA, and (iii) the decoder features after PSSCA. This design ensures that features from different stages interact and complement one another at the same scale, thereby enhancing seam boundary representation and fine-grained structural modeling.
In summary, SANet achieves comprehensive strip-aware feature modeling through PSSCA and StripPooling in the encoder, PSSCA combined with upsampling in the decoder, and MF-based feature integration across the network. Together with the 7-channel input design, SANet demonstrates high accuracy and robustness in weld seam detection under pure vision conditions.
3.3.2. Paralleled Strip and Spatial Context-Aware (PSSCA) Module
Traditional U-shaped networks exhibit limitations when dealing with elongated objects. Although convolutional operations are effective for capturing local features, they lack sufficient capacity for modeling long-range dependencies along strip-like structures. This often leads to weakened representations and blurred boundaries of elongated targets such as weld seams, thereby reducing detection accuracy.
To address this issue, we propose the Paralleled Strip and Spatial Context-Aware (PSSCA) module, as illustrated in
Figure 2. Serving as the core component of SANet, the PSSCA module is designed to enhance the feature extraction capability for elongated strip-like objects while improving network robustness in complex scenarios.
We let the input feature map be , where C, H, and W denote the number of channels, height, and width, respectively.
First, the channel dimension is adjusted using a
convolution:
where
is the updated number of channels. The resulting feature
is then divided into three parallel branches:
Residual branch: The input feature
is directly preserved as a residual connection:
Strip partition branch (4 strips): The feature map
is partitioned into 4 strips along either the row or column direction. Each strip is flattened and passed through a feed-forward network (FFN):
Strip partition branch (8 strips): Similarly,
is partitioned into 8 strips. To achieve this, the feature map is reshaped into a square, duplicated, and then divided along both row and column directions. The resulting strip vectors are processed by an FFN:
All three outputs are aligned to the same shape, allowing channel-wise concatenation:
Finally, the concatenated features are fed into an attention mechanism. Unlike the original design in HCFNet, we replace the attention module with the Spatial Context Aware Module (SCAM) [
22], which jointly enhances spatial and channel representations:
In summary, the PSSCA module employs a parallel design that integrates residual connections, multi-scale strip partitioning, and contextual modeling. This explicitly introduces directional dependencies along strip-like structures while retaining the local modeling capability of convolution. Combined with SCAM attention, the module achieves globally enhanced feature representations, thereby significantly improving the detection robustness and accuracy of weld seams.
3.3.3. Multistage Fusion (MF)
Although U-shaped networks employ skip connections to enable information flow between the encoder and decoder, the interactions across different stages remain insufficient. This limitation is particularly critical for weld seam detection, where fine-grained features and high-level semantic features exhibit strong complementarity. To address this issue, we propose the Multistage Fusion (MF) module, which enhances cross-level feature interaction and improves the modeling of seam boundaries and structural details. The architecture of the MF module is illustrated in
Figure 4.
In the MF module, three types of features are integrated:
- -
Feature A: the outputs from max pooling layers in the encoder, providing multi-scale semantic features from the downsampling stage;
- -
Feature B: the decoder features prior to PSSCA processing, which retain original spatial details;
- -
Feature C: the decoder features after PSSCA, containing strip-enhanced and context-aware representations.
We let these features be denoted as
where
,
H, and
W represent the number of channels, height, and width, respectively. The MF module concatenates these features along the channel dimension to obtain the fused representation:
To ensure that the fused features can be effectively utilized by subsequent layers, a
convolution is applied for channel compression and nonlinear transformation:
Through this design, the MF module enables feature integration across multiple stages of the encoder and decoder, thereby facilitating the interaction between low-level fine-grained details and high-level semantic information. As a result, the network achieves improved accuracy and robustness in weld seam detection.
3.3.4. StripPooling Attention
To further enhance the discriminative features of weld seam regions, SANet incorporates the StripPooling attention mechanism. Originally introduced by Hou et al. [
12], StripPooling aims to enlarge the receptive field and emphasize strip-like structures by applying pooling operations along strip directions. This allows the model to capture long-range dependencies aligned with elongated objects and strengthens its capability for strip-aware feature modeling.
In the context of weld seam detection, where seams inherently exhibit elongated geometries, traditional pooling strategies fail to adequately capture global contextual information along the strip direction. Embedding the StripPooling module into SANet effectively addresses this limitation: (i) pooling along strip directions increases the contrast between weld seams and their background, and (ii) when combined with convolutional features, it enhances boundary sharpness and overall recognition accuracy.
It should be emphasized that StripPooling is not an original contribution of this work. Rather, it is integrated into SANet as a mature attention mechanism to strengthen strip-specific feature representations. The core innovations of this paper lie in the design of the PSSCA module and the multistage fusion strategy, while StripPooling serves as a complementary component to further improve feature modeling effectiveness.
3.3.5. Loss Function
Unlike conventional semantic segmentation networks that apply supervision only to the final output, SANet introduces supervision at multiple stages of the decoder. This design leverages the hierarchical nature of the U-shaped architecture: during upsampling, intermediate decoder outputs progressively restore spatial resolution, and their accuracy directly affects higher-resolution feature reconstruction. By injecting supervision at each stage, the model is constrained to reconstruct features more accurately and stably throughout the decoding process.
Formally, we let the decoder produce
L outputs at different stages. The prediction at stage
i is denoted as
, and its corresponding ground truth
is obtained by downsampling the original annotation
Y to match the resolution of
. The stage-wise loss
is defined as
where
represents the cross-entropy loss and
denotes the Intersection-over-Union loss.
The total loss is obtained as the weighted sum of stage-wise losses:
where
denotes the weight assigned to stage
i. In this work, the weights are set progressively from low to high resolution as
. This weighting strategy ensures that the final output plays a dominant role, while intermediate outputs still provide valuable supervisory signals.
By introducing supervision across multiple stages, SANet not only optimizes the final prediction but also encourages effective learning of intermediate feature representations. This multistage supervision alleviates gradient vanishing, stabilizes convergence, and improves detection performance under complex background conditions.
4. Results
4.1. Evaluation Metrics and Model Training
To comprehensively evaluate the performance of the proposed SANet on the self-constructed weld seam dataset, we adopted four widely used metrics: Intersection over Union (IoU), Sørensen–Dice coefficient (Dice), Sensitivity (Sens), and Specificity (Spec). These metrics provide an objective evaluation of the model’s overall performance from two complementary perspectives: region-level segmentation accuracy and boundary localization precision.
Model training and testing were conducted on four NVIDIA Tesla V100-SXM2-32 GB GPUs. Each input image had a resolution of , represented as a seven-channel input consisting of RGB channels and four preprocessing channels. The total number of parameters in SANet is approximately 20.5 million. The dataset was randomly shuffled and then divided into training, validation, and test sets in a ratio of 7:2:1.
During training, the Stochastic Gradient Descent (SGD) optimizer was employed with a batch size of 6 and a total of 170 epochs. As shown in
Figure 5, both the IoU and the total loss exhibited stable convergence in the later training stages, demonstrating the effectiveness and reliability of the chosen training parameters. To ensure training stability and reproducibility, SANet was trained for 170 epochs using the validation set for convergence monitoring rather than early stopping. The final reported results correspond to the model obtained after 170 training epochs. Moreover, all quantitative results presented in this paper are reported as the average of five independent experiments, which effectively mitigates the influence of random initialization and ensures the robustness of the conclusions.
4.2. Comparison
To validate the effectiveness of SANet in weld seam detection, we conducted comparative experiments on the self-constructed weld seam dataset. SANet was compared against several widely adopted semantic segmentation networks, including U-Net [
23], UIUNet (U-Net in U-Net) [
24], HintUNet (a U-Net variant incorporating hint supervision), and HintHCFNet (an HCFNet variant incorporating hint supervision) [
25]. These architectures have demonstrated strong performance across various segmentation tasks, thereby providing valuable benchmarks for assessing the performance of SANet. For a more intuitive comparison,
Figure 6 presents visual examples of segmentation outputs, highlighting the superior seam localization ability of SANet over competing methods.
As shown in
Table 1, SANet consistently achieves the best overall performance on the self-constructed weld seam dataset. SANet obtains an IoU of 0.9623 and a Dice coefficient of 0.9807, which are among the highest across all models. More importantly, SANet attains the lowest total loss (0.0435), indicating more stable convergence and superior optimization efficiency.
Additionally, the Sensitivity (Sens) and Specificity (Spec) metrics in this work were computed using soft differentiable formulations without applying a fixed threshold. Given that weld seam pixels are far fewer than background pixels, the absolute values of Sens appear relatively low, which is expected in this task setting. Therefore, these two metrics should be interpreted based on their relative rather than absolute values. Under this interpretation, SANet still outperforms the baseline models, achieving higher Sens (18.33%) and Spec (99.90%) simultaneously. This demonstrates that SANet not only captures the sparse seam pixels effectively but also maintains high accuracy on the background. These results clearly confirm the benefits of the proposed PSSCA and MF modules in enhancing elongated feature extraction and cross-stage feature fusion.
By contrast, HintUNet, although achieving similar IoU and Dice scores (0.9623 and 0.9807), was less stable during training, frequently encountering gradient explosion. Only after applying gradient clipping could HintUNet converge, which inevitably altered its learning dynamics and made the optimization process less reliable. The instability was even more severe for HintHCFNet: despite extensive use of gradient clipping, the model still failed to converge properly, reflected in its very high total loss (0.7843) and markedly inferior IoU (0.3782) and Dice (0.5281). This clearly shows that the heavy modifications in HintHCFUNet exacerbate training instability on this task.
In contrast, SANet converged stably under standard hyperparameter settings without the need for aggressive stabilization strategies such as strong gradient clipping. This robustness, together with its superior accuracy metrics, highlights SANet’s adaptability and suitability for complex industrial seam detection scenarios.
4.3. Ablation
To further evaluate the effectiveness of the key components in SANet, we performed ablation experiments focusing on the PSSCA and MF modules. IoU and Dice metrics were computed on the self-constructed dataset, and the results are summarized in
Table 2.
4.4. Ablation Study
Table 2 reports the ablation study of SANet on the self-constructed dataset. Starting from the baseline UNet-like structure, we gradually introduce the proposed modules and evaluate their contributions.
The baseline model achieves an IoU of 0.9258 and a Dice coefficient of 0.9606, which provides a solid foundation but leaves room for improvement. When the PSSCA module is incorporated, both IoU and Dice increase to 0.9316 and 0.9651, respectively. This improvement verifies the effectiveness of parallel strip and spatial context awareness in enhancing feature extraction of elongated weld seams. Sensitivity and specificity also improve slightly, indicating more accurate recognition of the sparse seam pixels while maintaining high background accuracy.
When the MF module is further added on top of PSSCA, the full SANet achieves the best performance across all metrics, with an IoU of 0.9667, a Dice coefficient of 0.9829, Sensitivity of 0.1796, and Specificity of 0.9991. These results demonstrate that the MF module effectively strengthens cross-stage feature interactions, leading to better overall segmentation quality.
In summary, both PSSCA and MF contribute positively to model performance, and their combination yields the highest accuracy and robustness. This validates the design choice of integrating strip-aware feature modeling with multistage fusion in SANet.
5. Discussion
5.1. Innovations and Limitations
The proposed SANet (Strip-Aware Network) introduces a novel pure vision-based approach tailored for weld seams, which represent a typical elongated structure, and demonstrates several methodological innovations. First, the PSSCA (Paralleled Strip and Spatial Context-Aware) module extends the original PPA design in HCFNet and, combined with the StripPooling attention mechanism, explicitly strengthens strip-direction feature modeling, thereby improving weld seam recognition. Second, the Multistage Fusion (MF) module facilitates effective cross-level feature interaction, preserving global semantic information while enhancing local detail expression, which significantly improves boundary segmentation accuracy. Third, at the input level, SANet adopts an innovative seven-channel input design that integrates raw RGB images with multiple preprocessing results, providing richer texture, boundary, and saliency cues for the network. Collectively, these innovations enable SANet to achieve robust and accurate performance even under complex industrial conditions.
Despite the encouraging results, certain limitations remain. First, the parameter scale of the model is relatively large, which may limit deployment in resource-constrained environments and necessitates further optimization. Second, although the self-constructed dataset covers multiple typical industrial scenarios, its scale and diversity are still limited compared to larger public datasets. These constraints highlight directions for future improvement.
It is important to clarify that in some of the presented images, only partial welding has been performed along the detected seam. This is because the weld seam detection task in this study focuses on locating potential weld paths rather than evaluating completed welds. The actual welding process—including parameters such as coverage pattern (spot or continuous), torch trajectory, and motion speed—is defined by an independent “welding process package” used for robotic operation tuning. These parameters vary according to specific industrial requirements and are beyond the scope of this paper. The preliminary morphological results shown in
Figure 3 further illustrate the necessity of the proposed deep learning framework, as conventional methods alone cannot provide reliable seam localization under complex industrial conditions.
5.2. Experimental Observations and Interpretations
During the training of comparative models, we observed distinct stability differences among network architectures. In particular, HintHCFNet exhibited frequent gradient explosion, which can be attributed to its recursive multi-scale hint feedback structure. This recursive coupling amplifies gradient variance when dealing with high-resolution weld textures and significant class imbalance between seam and background pixels. Even with gradient clipping and reduced learning rates, the model remained unstable, suggesting that its feedback-based mechanism is highly sensitive to noisy gradients in industrial data.
In contrast, UNet did not suffer from convergence issues but instead underfitted the task. Its relatively simple encoder–decoder design lacks sufficient contextual modeling and strip-aware feature perception, leading to weak activation in elongated or low-contrast weld regions. Consequently, the output masks appeared nearly blank rather than divergent, reflecting limited representational power rather than instability.
Although HintHCFNet failed to produce meaningful outputs, its inclusion in
Figure 6 was intentional. It demonstrates that network complexity does not guarantee robustness and highlights SANet’s stable convergence and adaptability under the same training conditions.
Furthermore, considering the dataset’s diversity, SANet was evaluated across multiple weld categories, including pipe joints, trusses, elbows, and furnace seams. The consistent performance across these distinct geometries underscores its generalization capability. While the segmentation masks may still contain small discontinuities or voids, these artifacts are mainly confined to highly reflective surfaces and can be mitigated through post-processing techniques such as morphological closing and curve fitting.
Finally, we also compared SANet with traditional image-processing methods such as edge detection and adaptive thresholding. These methods can approximate seam contours under uniform illumination but fail in the presence of reflection, noise, or complex backgrounds. In contrast, SANet maintains high continuity and localization accuracy under such conditions, making it a more reliable and interpretable approach for robotic welding automation.
5.3. Future Directions
Building upon the contributions and limitations of this study, several promising directions for future research can be identified.
First, in terms of model efficiency and deployment, the current SANet has a relatively high parameter count. Future work could explore model compression techniques such as pruning, parameter sharing, and knowledge distillation to reduce model complexity and inference cost, enabling deployment on embedded or resource-limited industrial devices.
Second, regarding dataset expansion and diversity, although the proposed dataset encompasses multiple industrial scenarios, it remains insufficient to capture the full range of welding conditions. Future research will aim to enlarge the dataset scale, particularly by incorporating weld samples under different materials, illumination, and noisy environments, thereby improving generalization in real-world applications. Synthetic data generation and domain adaptation techniques may also be leveraged to alleviate the high cost of data collection and annotation.
Third, for adaptive modules and feature modeling, the current StripPooling design relies on fixed strip partitioning strategies. Future extensions may introduce dynamic strip partitioning or learnable geometric priors, allowing the model to adaptively adjust feature extraction to varying seam shapes and scales. Additionally, integrating multimodal information such as infrared or depth images may further enhance robustness under challenging industrial conditions.
Finally, in terms of practical applications and system integration, SANet can be incorporated into robotic welding systems to enable real-time seam detection and path planning, validating its value in actual production lines. By collaborating with industrial control systems, SANet holds promise as a foundational technology for advancing intelligent and automated welding.