Automated Weld Defect Classification Enhanced by Synthetic Data Augmentation in Industrial Ultrasonic Images

Naddaf-Sh, Amir-M.; Baburao, Vinay S.; Ben-Miled, Zina; Zargarzadeh, Hassan

doi:10.3390/app152312811

Open AccessArticle

Automated Weld Defect Classification Enhanced by Synthetic Data Augmentation in Industrial Ultrasonic Images

¹

Phillip M. Drayer Electrical and Computer Engineering Department, Lamar University, Beaumont, TX 77705, USA

²

CRC-Evans, Houston, TX 77066, USA

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(23), 12811; https://doi.org/10.3390/app152312811

Submission received: 1 November 2025 / Revised: 27 November 2025 / Accepted: 1 December 2025 / Published: 3 December 2025

(This article belongs to the Special Issue Machine Learning for Object Detection and Scene Description in Images and Videos)

Download

Browse Figures

Versions Notes

Abstract

Automated ultrasonic testing (AUT) serves as a vital method for evaluating critical infrastructure in industries such as oil and gas. However, a significant challenge in deploying artificial intelligence (AI)-based interpretation methods for AUT data lies in improving their reliability and effectiveness, particularly due to the inherent scarcity of real-world defective data. This study directly addresses data scarcity in a weld defect classification task, specifically concerning the detection of lack of fusion (LOF) defects in weld inspections using a proprietary industrial ultrasonic B-scan image dataset. This paper leverages state-of-the-art generative models, including Generative Adversarial Networks (GANs) and Denoising Diffusion Probabilistic Models (DDPM) (StyleGAN3, VQGAN with an unconditional transformer, and Stable Diffusion), to produce realistic B-scan images depicting LOF defects. The fine-tuned Transformer-based models, including ViT-Base, Swin-Tiny, and MobileViT-Small classifiers, on the regular B-scan image dataset are then applied to retain only high-confidence positive synthetic samples from each method. The impact of these synthetic images on the classification performance of a ResNet-50 model is evaluated, where it is fine-tuned with cumulative additions of synthetic images, ranging from 10 to 200 images. Its accuracy on the test set increases by 38.9% relative to the baseline with the addition of either 80 synthetic images using VQGAN with an unconditional transformer or 200 synthetic images by StyleGAN3 to the training set, and by 36.8% with the addition of 150 synthetic images by Stable Diffusion. This also outperforms Transformer-based vision models that are trained on regular training data. Concurrently, knowledge distillation experiments involve training ResNet-50 as a student model, leveraging the expertise of ViT-Base and Swin-Tiny as teacher models to demonstrate the effectiveness of adding the synthetic data to the training set, where the greatest enhancement is observed to be 34.7% relative to the baseline. This work contributes to advancing robust, AI-assisted tools for critical infrastructure inspection and offers practical pathways for enhancing available models in resource-constrained industrial environments.

Keywords:

automated ultrasonic testing; weld defect classification; synthetic data augmentation; generative models; nondestructive testing

1. Introduction

Non-destructive testing (NDT) methods, such as X-radiography, Ultrasonic testing (UT), Magnetic particle testing, and Eddy current testing, are critical for maintaining the safety and integrity of components across safety-critical industries (e.g., aerospace, power generation, oil and gas) by enabling the early detection of defects and preventing economic losses and environmental disasters [1]. The labor-intensive nature of NDT and the high risk of human error necessitate the development of automated assistive inspection tools. Deep learning-based solutions for automated defect detection in digital X-radiography [2,3,4] and UT data [5,6] are crucial for mitigating the risks and costs associated with maintaining aging infrastructure. Specifically, UT is favored for its ability to detect small subsurface flaws and generate internal structural images [7,8,9], and is often utilized in high-precision Automated Ultrasonic Testing (AUT) systems, where data is presented in formats like A-scans, B-scans (preferred for detailed visual representation), and C-scans. While Artificial Intelligence, particularly deep learning, offers powerful tools for transforming AUT through automatic feature extraction and pattern recognition, its real-world application is significantly challenged by the limited availability of defective data from industrial settings [10].

To address data scarcity (not limited to AUT) in automated defect detection generally, the authors in [10] present a two-phase approach for generating synthetic, self-labeled defective industrial images. Using generative models like Stable Diffusion, this method aims to overcome data scarcity in automated defect detection. First, the model learns the concept of industrial images. Second, it learns to control the image generation process to produce images with specific defect characteristics. This synthetic data significantly improves crack segmentation model performance, particularly when real annotated data is limited. In another study, ref. [11] proposed the “In&Out” approach, which leverages diffusion models for realistic in-distribution data augmentation in surface defect detection. This method combines diffusion-generated samples with traditional out-of-distribution samples, significantly improving classification performance, especially when positive samples are scarce. DIAG (Diffusion-based In-distribution Anomaly Generation) [12], a training-free pipeline leveraging latent diffusion models for data augmentation in surface defect detection, implements a human-in-the-loop approach in which domain experts provide multimodal guidance through text descriptions and region localization of anomalies, enhancing interpretability and enabling plausible anomaly generation without real positive data.

In AUT, specifically for extending the B-scan image dataset, expensive data augmentation using virtual flaws [13] enables the training of modern deep convolutional neural networks for flaw detection in phased-array ultrasonic data, achieving human-level performance in defect classification. DetectionGAN [14], a novel deep learning Generative Adversarial Network (GAN) [15], experimentally demonstrates that expanding training datasets by generating synthetic B-scan images with defects at distinct locations significantly improves the performance of deep convolutional neural object detection networks for defect detection. A key contribution is the use of a pre-trained object detector as an additional discriminator within the GAN, which enhances the quality of generated images and ensures accurate defect placement. DiffUT [16] addresses the challenge of limited sample availability in ultrasonic B-scan wheel defect data for high-speed rail by using a novel diffusion model, which learns probability and noise distributions through the diffusion process and generates synthetic data that significantly improves defect detection performance.

In this study, the scarcity of data for weld defect classification as a binary task, along with the poor performance of a deep convolutional neural network model, ResNet-50, on the defect class in industrial B-scan images, is addressed by leveraging synthetic data generation to enhance classification performance. In previous work [17], the current authors introduced a real-world ultrasonic B-scan image dataset collected by UT experts during automated girth weld inspection in oil and gas pipelines, focusing on the detection of lack of fusion (LOF) defects. They also evaluated the baseline performance of state-of-the-art deep learning models. In their subsequent study, ref. [18] leveraged a foundational vision model for weld defect detection in industrial ultrasonic B-scan images, demonstrating improved performance with an F1-Score of nearly 0.940 on the same dataset and reducing implementation complexity compared to previously studied methods.

The main contributions of this paper are as follows:

State-of-the-art generative models, including StyleGAN3, VQGAN with an unconditional transformer, and Stable Diffusion, are fine-tuned to synthesize realistic B-scan images of LOF defects under limited data conditions.
High-confidence synthetic defect images are selected through a filtering process using fine-tuned Transformer-based classifiers.
The impact of synthetic augmentation on ResNet-50 is demonstrated, showing substantial performance improvement where the baseline model fails.
A comparison between synthetic augmentation and knowledge distillation is provided, demonstrating that synthetic data yields greater performance gains for this task.

The remainder of this paper is organized as follows. The Section 2 explains the theoretical background and key terms, along with details about the image generation methods, the deep learning models used, and the evaluation metrics. The Experimental Setup Section 3 describes the dataset, the implementation of the experiments, and the hyperparameter configurations. The Section 4 presents a performance comparison of ResNet-50 models fine-tuned on the B-scan dataset, incorporating synthetic images generated by different image generation methods. Finally, the Section 5 summarizes the paper.

2. Materials and Methods

2.1. Lack of Fusion (LOF) Detection via Ultrasonic B-Scan

A B-scan in ultrasonic testing provides a two-dimensional image by compiling A-scan waveforms collected along an inspection path, mapping acoustic data to a coordinate system where one axis is the probe’s position and the other is time-of-flight (depth) [19]. Pixel intensities represent the amplitude of reflected echoes, which are used to identify material boundaries and internal flaws. Specifically, Lack of Fusion (LOF) is a common, critical weld defect appearing in a B-scan as a distinct reflector echo, often elongated, at the non-fused interface. Interpreting these B-scan indications requires expert knowledge to distinguish subtle defect signals from background noise and geometry echoes, a task that is challenging to automate. Figure 1A summarizes the general weld inspection and B-scan acquisition process. This study focuses on detecting LOF in B-scans as a binary classification problem (defect vs. non-defect), addressing the difficulty of classifying these low-contrast or noise-masked indications. Figure 2 illustrates sample B-scan images from the dataset containing LOF defects.

2.2. Synthetic Image Generation

Synthetic Image Generation refers to the production of artificial yet realistic images via generative modeling. In this study, additional B-scan images that simulate LOF defects are created using advanced generative models. By learning the statistical features of real B-scan defect images, new examples that mimic real defect appearances can be created by these models. The motivation is to augment the limited real dataset with diverse synthetic examples of defects, addressing class imbalance (far fewer defect images than non-defect images) and improving the deep learning-based image classifiers performance. In this paper, three generative approaches are utilized:

2.2.1. StyleGAN3

A state-of-the-art Generative Adversarial Network (GAN) developed by NVIDIA (NVIDIA Corporation, Santa Clara, CA, USA) [20], renowned for its ability to generate high-quality, natural-looking images. A key innovation in StyleGAN3 is its alias-free architecture, which addresses aliasing artifacts common in previous GANs by ensuring every layer of the synthesis network produces a continuous signal, leading to more natural transformations. This is achieved by carefully redesigning the up-sampling and down-sampling layers with proper anti-aliasing filters. Furthermore, StyleGAN3 interprets all signals within the network as continuous, rather than discrete pixel values, which allows for true sub-pixel equivariance to translation and rotation. This fundamental change prevents “texture sticking,” where details appear fixed to pixel coordinates instead of moving naturally with the depicted object. The internal representations of StyleGAN3 develop their own coordinate systems, ensuring that fine details are correctly attached to underlying surfaces.

2.2.2. Denoising Diffusion Probabilistic Models (DDPMs)

DDPMs [21] are a class of generative models that learn to reverse a gradual noising process. They progressively transform random noise into structured data, such as images, over a sequence of steps, achieving high-quality sample generation. Stable Diffusion is a prominent latent diffusion model that builds upon these DDPM principles, operating in a compressed latent space for efficiency. In this study, to fine-tune Stable Diffusion [22] on a limited amount of data, two techniques are utilized: DreamBooth [23] and LoRA [24].DreamBooth is a fine-tuning technique that enables text-to-image diffusion models to generate personalized images of a specific subject using only a few reference photos. It maintains the subject’s consistent appearance by employing a unique identifier token alongside a class noun, such as “a photo of [V] cat.” This approach allows for diverse image generation while effectively preserving the model’s pre-trained knowledge. Meanwhile, LoRA (Low-Rank Adaptation) is an efficient fine-tuning technique that adds small, trainable low-rank matrices to the model’s layers, which drastically reduces memory and computational demands compared to full fine-tuning.

2.2.3. VQGAN with an Unconditional Transformer

Vector Quantized Generative Adversarial Network (VQGAN) [25] combined with an unconditional transformer is a two-stage generative model. It operates by encoding images into a compact latent space using a vector quantized autoencoder, which discretizes the latent representations into a finite codebook of embeddings. This latent space significantly reduces the dimensionality of the data while preserving semantic structure, allowing for efficient manipulation and high-quality reconstruction of images. The GAN component ensures that generated images from the latent codes are photorealistic and detailed, overcoming the blurriness often seen in traditional autoencoders. Subsequently, an autoregressive transformer models the composition of these visual tokens, enabling high-resolution image synthesis. The synthetic image generation process illustrated in Figure 1B.

2.3. Deep Learning Architectures for Image Classification

Deep learning, a subset of machine learning, employs multi-layered artificial neural networks to learn hierarchical representations from data. In the context of image classification, these models automatically extract and learn relevant features from images, enabling them to identify and categorize visual data with high accuracy. The deep learning models utilized in this study are briefly introduced below:

2.3.1. ResNet-50

ResNet-50 [26] is a prominent Convolutional Neural Network (CNN) architecture, known for its deep structure facilitated by residual connections or skip connections. These connections enable the training of very deep networks by mitigating the vanishing gradient problem, allowing for effective feature learning. In this study, ResNet-50 was selected as the primary classification model to observe the influence of synthetic image inclusion in the training set. The choice of ResNet-50 as the target model for performance improvement is strategic; it is a well-established and computationally less demanding CNN compared to more recent Transformer architectures. Improving its performance via synthetic data makes it a more practical and deployable choice for real-world industrial settings, such as edge devices or mobile inspection units, where computational resources are often limited.

2.3.2. Vision Transformers

In recent years, transformer architectures [27], originally developed for sequence modeling in natural language processing, have been adapted for image recognition tasks. Vision Transformers (ViT) split an image into patches and employ self-attention mechanisms to learn global contextual relationships, in contrast to CNNs that learn local spatial filters. ViT-Base [28] directly applies a standard transformer encoder to sequences of image patches. It is pre-trained on vast amounts of image data, enabling strong generalization capabilities. Swin-Tiny [29] is a hierarchical Vision Transformer that computes self-attention within non-overlapping windows, offering improved efficiency and scalability compared to global attention. It addresses the computational cost of ViTs while maintaining strong performance. MobileViT [30] integrates efficient Transformer layers into a lightweight CNN for mobile-friendly deployment. In this study, all three Transformer-based models are utilized for filtering the generated synthetic images, while only ViT-Base and Swin-Tiny are used for knowledge distillation experiments. Figure 1C illustrates the synthetic image quality assessment step, where three Transformer-based classifiers filter generated samples based on high-confidence LOF predictions. Figure 1D depicts the incremental dataset extension process in which the filtered synthetic LOF images are added to the training set in controlled quantities.

2.4. Knowledge Distillation

Knowledge distillation (KD) [31] is a machine learning technique designed to transfer learned representations and knowledge from a large, often complex and high-performing teacher model to a smaller, more computationally efficient student model. Unlike traditional training, where a model learns from hard labels, KD often involves the student learning from soft targets (pseudo-probabilities) or intermediate representations generated by the teacher model, thereby mimicking the teacher’s decision-making process and often achieving comparable performance. In this paper, KD is employed to provide a comparative benchmark against the synthetic data augmentation strategy, demonstrating the effectiveness of the proposed method in enhancing the performance of a weaker model (ResNet-50).

2.5. Evaluation Metrics

To quantitatively evaluate defect classification performance, standard classification metrics are employed. Accuracy is calculated as the sum of true positives (TPs) and true negatives (TNs) divided by the total number of test samples, providing the proportion of correctly classified defect and non-defect instances. However, accuracy can be misleading in the presence of class imbalance for example, if defects are rare, a model that always predicts “no defect” may still achieve high accuracy. Therefore, greater emphasis is placed on precision, recall, and the F1-Score, particularly for the defect class.

Precision (P) is computed as the number of TPs divided by the sum of TPs and false positives (FPs). This metric reflects the proportion of predicted defects that are actual defects, thus indicating the model’s ability to control false alarms (FPs). Recall (R) is determined by dividing the number of TPs by the sum of TPs and false negatives (FNs), capturing the model’s ability to identify actual defects. The F1-Score, representing the harmonic mean of precision and recall, provides a balanced measure of performance. A perfect F1-Score of 1.0 implies complete detection of defects with no FPs or FNs.

These metrics are reported on a held-out test set of real B-scan images. In practical applications, high precision is valued to avoid unnecessary follow-up actions on FPs, while high recall is crucial to minimize missed defect cases. The F1-Score serves as a concise indicator of the model’s overall effectiveness in detecting the rare defect class. AUC (Area Under the Receiver Operating Characteristic Curve) is a comprehensive metric that evaluates the model’s ability to distinguish between positive and negative classes across all possible classification thresholds. It is robust to class imbalance and provides an aggregate measure of performance. Figure 1E shows the final stage of the workflow, where the augmented dataset is used to fine-tune ResNet-50 and evaluate its performance.

Accuracy = \frac{T P + T N}{T P + F P + F N + T N}

Precision = \frac{T P}{T P + F P}

Recall = \frac{T P}{T P + F N}

F 1 - Score = \frac{2 \times P \times R}{P + R}

3. Experimental Setup

3.1. Dataset

The dataset utilized in this study is a proprietary collection of ultrasonic B-scan images, compiled from 87 distinct weld inspection records. These records were acquired by experienced UT experts using an automated girth weld inspection system, which employs phased array technology for the inspection of onshore oil and gas pipelines. It is important to note that the authors of this study were not involved in the initial inspection process; rather, the B-scan images were collected from pre-existing proprietary weld inspection data provided by these UT experts. This utilization of real-world B-scan images with genuine defects contrasts with the prevalent reliance in research on in-laboratory images or artificially flawed data, highlighting the data acquisition challenges faced by industry in training robust deep learning-based models.

The recorded weld data represents J-bevel weld types, and the primary weld defect class identified and targeted for detection is lack of fusion (LOF), recognized as the most common defect encountered in automated girth welding within this domain. Each recorded weld originally consisted of a 1920 × 1080 image containing multiple strip charts, which represent the output from the channels of the phased array probes used in AUT. These strip charts also provided views from both downstream and upstream phased array probes, along with detailed indications of the weld zones represented by each channel.

From these weld inspection records, a total of 359 B-scan strip chart images were collected by the authors. The dataset exhibits a class imbalance, comprising 234 negative images (no LOF defect) and 95 positive images (containing the lack of fusion defect). The distribution of the B-scan dataset includes images predominantly from the Fill 1, Hot Pass 1, and Hot Pass 2 zones of the weld. Figure 3 illustrates these weld zones. It is meant to show a general idea of how the weld is built up in different regions, Root, Body, and Cap, using multiple passes. The number of passes shown (like Root, LCP, Hot-Pass 1, etc.) is just an example. In actual practice, the number of weld passes will vary depending on factors like the pipe’s wall thickness. Thicker pipes typically require more fill and cap passes The core task addressed in this study is image classification, specifically identifying the presence or absence of LOF defects. For robust experimental evaluation, the dataset was split into training, validation, and test sets using a stratified sampling approach. This stratification ensured that the proportion of positive and negative samples was maintained across the splits, with a ratio of 0.64 for training, 0.16 for validation, and 0.20 for testing. For transparency, it is acknowledged that not using a weld-ID–based grouped split may introduce a potential limitation; however, because each weld is represented by strip charts originating from different weld zones, the risk of information leakage is considered to be substantially reduced. For more information about the dataset, please refer to [17]. Table 1 shows the number of negative and positive B-scan images in each split.

3.2. Experiments Implementation Details

3.2.1. Fine-Tuning Procedures for Synthetic Image Generation Models

To address the scarcity of positive (LOF defect) samples in the proprietary dataset, three advanced generative models, VQGAN with an unconditional transformer, Stable Diffusion, and StyleGAN3, are employed to generate synthetic B-scan images. These models are fine-tuned exclusively on the positive images from the training set of the original dataset. Specifically, 73 positive images, which are resized to 256 × 256 pixels, are oversampled through duplication to a total of 146 images to provide a sufficient base for fine-tuning the generative models. A random seed of 42 is consistently used across all experiments to ensure reproducibility. All experiments are implemented in PyTorch 2.0.2 (StyleGAN3), 1.7.1 (VQGAN), 2.1.2 (Stable Diffusion) [32] and executed on a single NVIDIA A100 (40 GB) GPU (NVIDIA Corporation, Santa Clara, CA, USA).

VQGAN With An Unconditional Transformer: For the VQGAN with an unconditional transformer model, the instructions and default configurations from the main repository (https://github.com/CompVis/taming-transformers, accessed on 10 May 2025) are followed. This involves a two-stage fine-tuning process. First, the VQGAN model is fine-tuned on the positive B-scan images for 30 epochs using pre-trained ImageNet weights, with a batch size of 8. The model with the lowest validation loss, which occurred at epoch 24, is selected. Subsequently, an unconditional transformer model, architecturally similar to GPT-2 [33] and configured with the following parameters: vocab_size of 1024, block_size of 512, 24 layers, 16 heads, and an embedding size of 1024, is trained for 20 epochs with a batch size of 32. The best-performing model (based on validation loss), selected from epoch 16, is used for subsequent analysis.
StyleGAN3: The StyleGAN3 model is fine-tuned using the stylegan3-transform configuration and the checkpoint (stylegan3-t-ffhqu-256x256.pkl) pre-trained on the FFHQ-U 256 × 256 dataset, adhering to the default settings provided in the official repository (https://github.com/NVlabs/stylegan3, accessed on 17 May 2025). Training is carried out for 100,000 iterations with a batch size of 32.
Stable Diffusion: The fine-tuning procedure for Stable Diffusion follows the guidelines outlined in the open-source repository (https://github.com/kohya-ss/sd-scripts, accessed on 3 May 2025). The sd-v1-5-pruned-noema-fp16 checkpoint (https://huggingface.co/hollowstrawberry/stable-diffusion-guide/blob/main/models/sd-v1-5-pruned-noema-fp16.safetensors, accessed on 3 May 2025) is used as the base model for fine-tuning on the positive B-scan images. DreamBooth fine-tuning, a technique designed to teach a new concept to a diffusion model with fewer iterations and without overwriting its extensive prior knowledge, is applied. For this purpose, 200 regularization images generated using the prompt “background defected” are placed alongside the positive images from the dataset’s training set, which are associated with the prompt “skt background defected” (where “skt” serves as a token identifier without semantic meaning). A parameter-efficient fine-tuning (PEFT) approach, specifically LoRA (LoRA-c3Lier), is employed following the configuration of an extended LoRA variant that adapts both linear layers and convolutional layers. In this setup, LoRA adapters are applied to the linear and 1 × 1 convolution projection layers with a rank (network_dim) of 32 and an alpha value of 16. In addition, 3 × 3 convolution layers are also adapted using convolutional LoRA components configured with rank (conv_dim) 4 and alpha 1. Training is conducted for 25 epochs with a learning rate set to $1 \times 10^{- 3}$ and a batch size of 256. Figure 4 presents positive sample B-scan images from the training set alongside images generated by each different method.

3.2.2. Fine-Tuning Procedures for Image Classification Models

Four different image classification models are fine-tuned on the B-scan image dataset, representing both CNN and Transformer architectures: ResNet-50 (https://huggingface.co/timm/resnet50.a1_in1k, accessed on 30 May 2025), ViT-Base (https://huggingface.co/google/vit-base-patch16-224, accessed on 30 May 2025), Swin Transformer (Tiny) (https://huggingface.co/microsoft/swin-tiny-patch4-window7-224, accessed on 30 May 2025), and MobileViT (Small) (https://huggingface.co/apple/mobilevit-small, accessed on 30 May 2025). All models are initialized with ImageNet-pretrained weights to provide a strong starting point, given the limited data available. The three Transformer-based image classification models are fine-tuned on the original dataset without any synthetic images. These models are chosen for their strong baseline performance on the original dataset and are used as a filtering mechanism for synthetic data. From each fine-tuned synthetic image generation model, 500 synthetic images are generated. From the 500 synthetic images generated by each method, only those that are confidently classified as positive samples by all three high-performing transformer models are retained for inclusion in the training set. This filtering step is considered crucial to ensure that only high-quality, truly representative synthetic positive images are added to the training set. In this way, the inclusion of noisy or ambiguous samples that could negatively impact the classifier’s performance is prevented. The specific counts of positively classified synthetic images retained from each method are as follows: 430 images from Stable Diffusion, 211 images from StyleGAN3, and 207 images from VQGAN with an unconditional transformer. The differences in retained samples across the generative methods reflect the quality of the generated images, since all three filters were trained identically and a sample is kept only when all three models agree it is positive.

To evaluate the effect of synthetic image augmentation, ResNet-50 is selected as the primary classification model. The filtered, positively classified synthetic images are randomly added to the original training set in cumulative steps of 10, 20, 50, 80, 100, 120, 150, and 200 images. This incremental approach is used to enable a detailed assessment of the impact of adding synthetic data on model performance. The optimal amount of synthetic data needed and the point at which additional data no longer improves, or may even harm, performance are identified. At each cumulative step, ResNet-50 is fine-tuned on the augmented training set. Figure 5 depicts the complete workflow, encompassing the generation of synthetic images, their selection, and the subsequent fine-tuning of Image generation models and ResNet-50.

To provide a comparative analysis and demonstrate the effectiveness of the proposed approach, the knowledge distillation technique is employed. This involves using ViT-Base and Swin-Tiny as teacher models due to their superior performance on the original dataset, while ResNet-50 was designated as the student model. To perform knowledge distillation, the logits (raw prediction scores) from both the teacher model (T) and the student model (S) are first obtained. These logits are then scaled by a temperature parameter, which controls the softness of the teacher’s predictions. A higher temperature yields softer probability distributions, encouraging the student to learn the general shape of the teacher’s output rather than focusing on exact probabilities. The overall influence of this knowledge distillation process on the student’s training is then adjusted by a lambda parameter, which weighs the importance of the distillation loss. In the experiments, the temperature is set to 5 and lambda to 0.5.

The core of this distillation relies on the Kullback-Leibler (KL) Divergence loss [34] to quantify the difference between the probability distribution of the student (S) and the teacher (T). Given two data distributions, T and S, the KL Divergence measures the additional information required to represent distribution T using distribution S. If T and S are identical, their KL divergence is zero, as no extra information is needed to describe T from S. This characteristic makes KL divergence particularly useful in knowledge distillation, as it helps the student model (S) to minimize its divergence from the teacher model (T), thereby learning the teacher’s knowledge. For discrete probability distributions T and S, the KL Divergence of S from T, denoted as

D_{K L} (T ‖ S)

, is given by:

D_{K L} (T ‖ S) = \sum_{x \in X} T (x) ln (\frac{T (x)}{S (x)})

where X is set of all possible events and x is an event.

All image classification models involved in the study, including the three Transformer-based models and ResNet-50, are fine-tuned for up to 100 epochs with a batch size of 128. A learning rate of 1 × 10⁻⁴ is used, coupled with a cosine learning rate scheduler without weight decay, and 10 warm-up epochs are incorporated to stabilize the initial training. For reproducibility, a fixed random seed of 42 is used across all experiments. Regarding image transformations for each model, in the training set, each image is first randomly cropped and resized to the size specified by the model’s input requirements, then randomly flipped horizontally, which only alters visual orientation and is physically valid, and finally normalized using the mean and standard deviation values of the ImageNet dataset. For the validation and test sets, the transformations are deterministic to ensure consistent evaluation: each image is resized, then center-cropped to the model’s input size, and finally normalized in the same manner as in the training set. All computational tasks are performed using PyTorch on a single Nvidia A100 (40 GB) GPU. The checkpoint of the model achieving the lowest evaluation loss on the validation set during the fine-tuning process is selected and subsequently evaluated on the test set to compare its classification metrics. The implementation of these experiments is inspired by (https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification, accessed on 30 May 2025) and the following link (https://github.com/huggingface/transformers/blob/main/docs/source/en/tasks/knowledge_distillation_for_image_classification.md, accessed on 10 June 2025).

4. Results and Discussion

The results of our experiments are summarized in Figure 6 and Figure 7 and Table 2, Table 3, Table 4 and Table 5. These illustrate the impact of synthetic data augmentation (using each synthetic image generation method), the effect of incremental synthetic additions, and the outcomes of knowledge distillation. All percentage improvements reported in this work refer to relative increases over the baseline in Table 2 unless otherwise specified. Moreover, these tables include AUC and overall accuracy for the binary classification task, along with precision, recall, and F1-Score reported specifically for the defect class.

4.1. Qualitative and Feature-Space Analysis of Synthetic Images

To qualitatively evaluate the similarity between real B-scan images and the filtered synthetic defective B-scan images generated by each model, a comparative t-SNE analysis was performed. Feature embeddings were extracted from ViT-Base, Swin-Tiny, and MobileViT-Small for all real positive and negative samples as well as for the filtered synthetic images generated by VQGAN with an unconditional transformer, StyleGAN3, and Stable Diffusion. The resulting feature sets were standardized and projected into two dimensions using t-SNE, enabling a visual comparison of how the synthetic samples align with the distribution of real B-scan images in the learned feature spaces of the classifiers.

The t-SNE projections in Figure 6A,B show that VQGAN with an unconditional transformer images formed a distinct, compact cluster within the broader region occupied by the real positive class, like their more abstract visual appearance in Figure 4. In contrast, synthetic samples from StyleGAN3 and Stable Diffusion clustered closely with the real defective B-scan images, reflecting their stronger visual resemblance. Importantly, all three synthetic sets remained largely separated from the real negative class, indicating that the generative models did not introduce substantial misleading artifacts.

These results indicate that, even though visually different, VQGAN with an unconditional transformer images still encode salient low-level characteristics of defective B-scan images, making them useful for model training. The more overlapped appearance of MobileViT-Small embeddings in Figure 6C arises from its lightweight architecture, which uses compressed intermediate representations that capture less global structure compared to ViT-Base and Swin-Tiny. This behavior is expected for mobile-oriented models and does not contradict the strong classification performance reported in Table 2, because t-SNE often produces more overlapping clusters for compact models and should be interpreted only as a qualitative visualization rather than as direct evidence of poor class separability. In this study, no formal domain-expert evaluation was performed, as the objective was to assess how the synthetic image sets influence the defect classification performance of ResNet-50; accordingly, their representation in the feature spaces of the three Transformer-based models serves as the qualitative basis for comparison rather than human visual assessment.

4.2. ResNet-50 Baseline Performance vs. Transformer-Based Models

Table 2 presents the baseline classification performance of each model (ResNet-50, ViT-Base, Swin-Tiny, MobileViT-Small) trained only on the original dataset. Without augmentation, the ResNet-50 model struggled, especially in recall, precision, and F1-Score, reflecting that it missed all detections from the defect class when trained solely on the limited and imbalanced original dataset; it completely fails to correctly classify any positive samples. It effectively performs no better than random guessing or simply predicting the majority (negative) class. This dramatic baseline failure of ResNet-50 is not just a data point; it is a powerful justification for the entire research endeavor. It demonstrates the critical need for synthetic data to make a standard, deployable CNN like ResNet-50 effective for detecting rare, critical defects in real-world industrial settings.

In contrast, the Transformer-based models demonstrate significantly higher baseline performance on the original dataset. MobileViT-Small, despite being the most lightweight among them (5.6M parameters), achieves a remarkable F1-Score of 0.875 and an AUC of 0.969, highlighting its efficiency and strong generalization capabilities. ViT-Base and Swin-Tiny also exhibit robust performance, establishing them as strong baselines and suitable candidates for teacher models in knowledge distillation.

4.3. Improving ResNet-50 for LOF Defect Classification Through Synthetic Data

To analyze the relationship between the quantity of synthetic data and model performance, the number of synthetic images in the training set was incrementally increased for fine-tuning ResNet-50. As shown in Figure 7, the effect of adding synthetic images (generated via VQGAN with unconditional transformer, StyleGAN3, and Stable Diffusion) on classification metrics is illustrated by the plots.

Focusing on the F1-Score plot (bottom right), a steep rise in performance is observed with an initial small addition of synthetic images. For all three methods, the F1-Score is seen to climb sharply when moving from 0 to approximately 50 synthetic images. This indicates that even a modest injection of high-quality synthetic data significantly improves the model’s learning capability. As more synthetic images are added, the F1-Score continues to improve, but with a tapering slope, eventually plateauing around 100–150 synthetic images. After this point, the improvements become smaller, showing that adding more examples doesn’t help the model much anymore. Moreover, no significant degradation in accuracy is observed when synthetic data are added, up to 200 images. This suggests that the filtering and quality control measures for synthetic images are effective in preventing the classifier from being misled.

A robust and consistent trend is observed across all metrics with images generated by StyleGAN3 and Stable Diffusion. Instability is observed with VQGAN with an unconditional transformer when fewer than 50 images are used, as oscillations in the trend are illustrated. However, when synthetic images generated by this model are utilized, ResNet-50 achieves higher performance than the other methods with only 80 images. In Table 3, the best performance achieved by each method and the corresponding number of added images are reported. It is important to note that synthetic images generated by the VQGAN with an unconditional transformer method are not like those generated by the other methods or to the original images. However, these images enhance the performance of a deep learning model because they contain low-level features of LOF defects, which are represented in the generated images and are helpful for the model, but not for training human experts. If synthetic images are intended to be used for training human experts, only those generated by StyleGAN3 and Stable Diffusion are considered helpful, as their appearance closely resembles that of the original images. Excluding VQGAN with an unconditional transformer, the addition of StyleGAN3-generated images improves the performance of ResNet-50 beyond that of transformer-based models, when 200 synthetic positive samples are used.

The similar best-case performance across the three generative methods in Table 3 occurs because each method effectively enriches the positive class with informative synthetic samples. StyleGAN3 and Stable Diffusion generate visually realistic LOF patterns that closely resemble real defects, while VQGAN with an unconditional transformer produces less realistic but still meaningful representations that occupy a complementary region of the feature space. After filtering by the three Transformer-based classifiers, only high-confidence synthetic images are retained, ensuring that all generators provide useful training signals for ResNet-50. Consequently, the model reaches comparable peak performance with each augmentation strategy, even though the number of synthetic samples required to achieve this level differs among the methods.

The diversity analysis in Table 4 provides further insight into why StyleGAN3 requires more synthetic images to reach its optimal performance. StyleGAN3 exhibits consistently higher mean pairwise distances (MPD) than VQGAN with an unconditional transformer across all three feature extractors, indicating greater intra-class variability in the generated samples. Its convex hull area (CHA) is also substantially larger than that of VQGAN for Swin-Tiny and MobileViT-Small, reflecting broader coverage of the feature space. This higher variability means that ResNet-50 benefits from a larger number of StyleGAN3 samples before fully internalizing the diversity of the synthesized defect appearances. In contrast, VQGAN-generated samples form tighter, more homogeneous clusters, allowing the model to achieve peak performance with fewer examples, despite their reduced visual realism.

4.4. Synthetic Data vs. Knowledge Distillation

In Table 5, the performance of ResNet-50 after fine-tuning as a student model is provided. Although the enhancement is significant, it cannot achieve the performances achieved by adding synthetic images generated with image generation methods to the training set. With knowledge distillation, the highest performance in terms of F1-Score is achieved when Swin-Tiny is considered as a teacher model, which is 6.63% lower than the performance achieved with the addition of 80 or 200 synthetic images generated via VQGAN with an unconditional transformer or a StyleGAN3, respectively. The reason MobileViT-Small is not utilized for knowledge distillation experiments is that it has significantly fewer parameters as a teacher compared to using ResNet-50 as the student model. Additionally, it produces compressed representations due to its optimization for inference speed and memory efficiency, whereas ResNet-50 can better absorb complex representations from ViT-Base and Swin-Tiny, which have high-dimensional feature representations. Nevertheless, there was no improvement at all in ResNet-50’s performance when MobileViT-Small was used as the teacher, compared to its baseline performance. To avoid misinterpretation, it is clarified that the comparison between synthetic data augmentation and knowledge distillation in this study applies only to the ResNet-50 architecture and is not intended to be generalized across other model families.

5. Conclusions

This study addressed the critical issue of data scarcity in automated ultrasonic testing (AUT) for weld defect classification, focusing on the detection of rare and severe lack of fusion (LOF) defects. By leveraging synthetic B-scan image generation, the study demonstrated a substantial improvement in deep learning model performance, particularly for the ResNet-50 architecture. Initially, ResNet-50 exhibits weak performance on the imbalanced real-world dataset for detecting defective B-scan images; its accuracy of 0.671 reflects majority-class accuracy due to the dominance of normal images, resulting in an F1-Score, precision, and recall of 0.000 and an AUC of 0.500. However, when supplemented with synthetic images, specifically 80 generated by VQGAN with an unconditional transformer or 200 generated by StyleGAN3, the model achieved an F1-Score of 0.884 and perfect precision of 1.000, representing up to a 38.9% increase in accuracy on the test set. The study also evaluated different generative models, finding that while VQGAN with an unconditional transformer exhibited some instability with a smaller number of synthetic images, it still enabled high performance with limited synthetic input. StyleGAN3 and Stable Diffusion consistently improved results, with StyleGAN3 achieving similar peak performance using 200 synthetic samples. These findings underscore that high-quality synthetic data, even when visually different from real images, can contribute critical low-level features for effective training. In comparison, knowledge distillation improved baseline performance but remained 6.63% lower relative to the best results achieved using synthetic data. This highlights the superior utility of synthetic data augmentation for training efficient models like ResNet-50, making them practical for deployment in resource-constrained industrial settings such as mobile inspection units or edge devices. Overall, this work advances AI-assisted infrastructure inspection by offering scalable and high-performance solutions using synthetic data.

Although industrial ultrasonic data are used in this study, a complete assessment of operational efficiency would require deployment within an actual AUT inspection workflow. The synthetic image generation steps are carried out entirely offline, and only the final ResNet-50 classifier is employed during inference. As this lightweight model is capable of real-time operation on standard industrial hardware, the proposed approach is considered suitable for integration into existing inspection systems used by UT experts. It is expected that future field deployment will further confirm its practicality and effectiveness in industrial environments.

Author Contributions

Conceptualization, A.-M.N.-S. and H.Z.; methodology, A.-M.N.-S. and H.Z.; software, A.-M.N.-S.; validation, A.-M.N.-S. and H.Z.; formal analysis, A.-M.N.-S. and H.Z.; investigation, A.-M.N.-S. and H.Z.; resources, H.Z., V.S.B. and Z.B.-M.; data curation, A.-M.N.-S. and V.S.B.; writing—original draft preparation, A.-M.N.-S.; writing—review and editing, A.-M.N.-S., H.Z. and V.S.B.; visualization, A.-M.N.-S.; supervision, H.Z.; project administration, H.Z.; funding acquisition, Z.B.-M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All open-source implementations used in this paper are referenced in the main body of the article. However, the dataset is proprietary to CRC-Evans, and the authors are not authorized to publish it.

Acknowledgments

This paper is part of the first author’s Doctor of Engineering dissertation [35] which is being carried out in the Department of Electrical and Computer Engineering at Lamar University.

Conflicts of Interest

Author Vinay S. Baburao is employed by the company CRC-Evans. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Adegboye, M.A.; Fung, W.-K.; Karnik, A. Recent Advances in Pipeline Monitoring and Oil Leakage Detection Technologies: Principles and Approaches. Sensors 2019, 19, 2548. [Google Scholar] [CrossRef]
Ajmi, C.; Zapata, J.; Elferchichi, S.; Laabidi, K. Advanced Faster-RCNN Model for Automated Recognition and Detection of Weld Defects on Limited X-Ray Image Dataset. J. Nondestruct. Eval. 2024, 43, 14. [Google Scholar] [CrossRef]
Naddaf-Sh, M.M.; Naddaf-Sh, S.; Zargarzadeh, H.; Zahiri, S.M.; Dalton, M.; Elpers, G.; Kashani, A.R. 9—Defect Detection and Classification in Welding Using Deep Learning and Digital Radiography. In Fault Diagnosis and Prognosis Techniques for Complex Engineering Systems; Karimi, H., Ed.; Academic Press: Cambridge, MA, USA, 2021; pp. 327–352. [Google Scholar]
Naddaf-Sh, S.; Naddaf-Sh, M.M.; Zargarzadeh, H.; Dalton, M.; Ramezani, S.; Elpers, G.; Baburao, V.S.; Kashani, A.R. Real-Time Explainable Multiclass Object Detection for Quality Assessment in 2-Dimensional Radiography Images. Complexity 2022, 2022, 4637939. [Google Scholar] [CrossRef]
Medak, D.; Posilović, L.; Subašić, M.; Budimir, M.; Lončarić, S. Automated Defect Detection from Ultrasonic Images Using Deep Learning. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 2021, 68, 3126–3134. [Google Scholar] [CrossRef] [PubMed]
Ye, J.; Ito, S.; Toyama, N. Computerized Ultrasonic Imaging Inspection: From Shallow to Deep Learning. Sensors 2018, 18, 3820. [Google Scholar] [CrossRef] [PubMed]
Dwivedi, S.K.; Vishwakarma, M.; Soni, A. Advances and Researches on Non Destructive Testing: A Review. Mater. Today Proc. 2018, 5, 3690–3698. [Google Scholar] [CrossRef]
Swornowski, P.J. Scanning of the Internal Structure Part with Laser Ultrasonic in Aviation Industry. Scanning 2011, 33, 378–385. [Google Scholar] [CrossRef]
Tu, X.L.; Zhang, J.; Gambaruto, A.M.; Wilcox, P.D. A Framework for Computing Directivities for Ultrasonic Sources in Generally Anisotropic, Multi-Layered Media. Wave Motion 2024, 128, 103299. [Google Scholar] [CrossRef]
Valvano, G.; Agostino, A.; De Magistris, G.; Graziano, A.; Veneri, G. Controllable Image Synthesis of Industrial Data Using Stable Diffusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 5354–5363. [Google Scholar]
Capogrosso, L.; Girella, F.; Taioli, F.; Dalla Chiara, M.; Aqeel, M.; Fummi, F.; Setti, F.; Cristani, M. Diffusion-Based Image Generation for In-Distribution Data Augmentation in Surface Defect Detection. arXiv 2024, arXiv:2406.00501. [Google Scholar]
Girella, F.; Liu, Z.; Fummi, F.; Setti, F.; Cristani, M.; Capogrosso, L. Leveraging Latent Diffusion Models for Training-Free In-Distribution Data Augmentation for Surface Defect Detection. In Proceedings of the 2024 International Conference on Content-Based Multimedia Indexing (CBMI), Reims, France, 4–6 September 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–7. [Google Scholar]
Virkkunen, I.; Koskinen, T.; Jessen-Juhler, O.; Rinta-Aho, J. Augmented Ultrasonic Data for Machine Learning. J. Nondestruct. Eval. 2021, 40, 4. [Google Scholar] [CrossRef]
Posilović, L.; Medak, D.; Subašić, M.; Budimir, M.; Lončarić, S. Generative Adversarial Network with Object Detector Discriminator for Enhanced Defect Detection on Ultrasonic B-Scans. Neurocomputing 2021, 459, 361–369. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Zhang, Q.; Tian, K.; Zhang, F.; Li, J.; Yang, K.; Luo, L.; Gao, X.; Peng, J. DiffUT: Diffusion-Based Augmentation for Limited Ultrasonic Testing Defects in High-Speed Rail. NDT E Int. 2025, 154, 103388. [Google Scholar] [CrossRef]
Naddaf-Sh, A.M.; Baburao, V.S.; Zargarzadeh, H. Automated Weld Defect Detection in Industrial Ultrasonic B-Scan Images Using Deep Learning. NDT 2024, 2, 108–127. [Google Scholar] [CrossRef]
Naddaf-Sh, A.M.; Baburao, V.S.; Zargarzadeh, H. Leveraging Segment Anything Model (SAM) for Weld Defect Detection in Industrial Ultrasonic B-Scan Images. Sensors 2025, 25, 277. [Google Scholar] [CrossRef]
Krautkrämer, J.; Krautkrämer, H. Ultrasonic Testing of Materials; Springer Science and Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-Free Generative Adversarial Networks. Adv. Neural Inf. Process. Syst. 2021, 34, 852–863. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22500–22510. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the ICLR 2022, Online, 25–29 April 2022; Volume 1, p. 3. [Google Scholar]
Esser, P.; Rombach, R.; Ommer, B. Taming Transformers for High-Resolution Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12873–12883. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Mehta, S.; Rastegari, M. MobileViT: Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Paszke, A. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Habib, G.; Kaleem, S.M.; Rouf, T.; Lall, B. A Comprehensive Review of Knowledge Distillation in Computer Vision. arXiv 2024, arXiv:2404.00936. [Google Scholar] [CrossRef]
Naddaf Shargh, A. Deep Learning Methods for Defect Analysis in Industrial Ultrasonic Images. Ph.D. Thesis, Lamar University, Beaumont, TX, USA, 2025. Available online: https://www.proquest.com/openview/2f717dbb11a339f3c9747c35937880cc/1?pq-origsite=gscholar&cbl=18750&diss=y (accessed on 31 October 2025).

Figure 1. (A) Acquisition of weld inspection data and extraction of B-scan strip charts to form the original dataset. (B) Training of three synthetic image generation models using the positive LOF samples. (C) Quality assessment of synthetic images using three Transformer-based classifiers to retain only high-confidence samples. (D) Progressive extension of the training dataset by adding filtered synthetic LOF images in controlled increments. (E) Fine-tuning of the ResNet-50 using the extended dataset and comparison of performance across all experimental conditions.

Figure 2. Samples of B-scan images with LOF defect from the dataset.

Figure 3. Weld Zones.

Figure 4. Sixteen samples displayed in a 4 × 4 grid for each method: (a) B-scan images with LOF defects from the training set; (b) defective B-scan images generated using VQGAN with an unconditional transformer; (c) defective B-scan images generated using StyleGAN3; and (d) defective B-scan images generated using Stable Diffusion.

Figure 5. The pipeline used for augmenting a limited B-scan image dataset involves AI-based generative models and a selection process to create a high-confidence synthetic image set for fine-tuning ResNet-50. The orange cylinders show original defective B-Scan images in the train set. The blue cylinders show synthetic defective B-Scan images. The purple cylinder shows both defective and non-defective B-Scan images in the train set.

Figure 6. t-SNE projections comparing real and synthetic B-scan images in the feature spaces of three Transformer models. Real positive and negative samples form mostly distinct clusters in feature spaces of (A) ViT-Base and (B) Swin-Tiny. StyleGAN3 and Stable Diffusion synthetic samples overlap substantially with the real positive class, while VQGAN with an unconditional transformer forms a separate cluster, demonstrating a distinct feature representation despite effective defect encoding.

Figure 7. Comparison of Accuracy, Precision, Recall, and F1-Score metrics on the test set after adding synthetic images (generated via three different methods) to the training set for fine-tuning ResNet-50.

Table 1. Number of negative (no LOF defect) and positive (with LOF defect) B-scan images in each subset of the dataset (train, validation, and test sets).

Type of Image	Train	Validation	Test
Negative	155	39	49
Positive	73	19	24
Total	228	58	73

Table 2. The performance of the models on the test set without the addition of any synthetic images is presented below. The models were fine-tuned with the same hyperparameters on the training set for a fair comparison.

Model	Params	AUC	Accuracy	Precision	Recall
ResNet-50	25.6 M	0.500	0.671	0.000	0.000
ViT-Base	86.6 M	0.886	0.904	0.870	0.833
Swin-Tiny	28.3 M	0.866	0.877	0.800	0.833
MobileViT-S	5.6 M	0.969	0.918	0.875	0.875

Table 3. The best performance achieved by ResNet-50 after adding a specific number (#) of synthetic images generated with three different methods to the training set. The fine-tuning is done with the same hyperparameters as those applied to a regular training set without any synthetic images, ensuring a fair comparison.

Method	# Images	AUC	Accuracy	Precision	Recall	F1-Score
VQGAN + Transformer	80	0.896	0.932	1.000	0.792	0.884
StyleGAN3	200	0.896	0.932	1.000	0.792	0.884
Stable Diffusion	150	0.875	0.918	1.000	0.750	0.857

Table 4. Diversity metrics for synthetic images generated by the three methods, computed using mean pairwise distance (MPD) and convex hull area (CHA) in the 2-D t-SNE feature space. Values are reported for 200 synthetic samples per generator using embeddings from ViT-Base, Swin-Tiny, and MobileViT-S.

Model	ViT-Base		Swin-Tiny		MobileViT-S
Model	MPD	CHA	MPD	CHA	MPD	CHA
VQGAN + Transformer	9.95	569.53	8.42	266.25	15.49	905.52
Stable Diffusion	12.32	862.94	10.59	495.74	17.89	1307.95
StyleGAN3	13.60	488.90	11.07	536.71	24.82	2074.90

Table 5. Performance of ResNet-50 as a student model using knowledge distillation. The fine-tuning is done with the same hyperparameters as for a regular training set without synthetic images, ensuring a fair comparison.

Teacher	AUC	Accuracy	Precision	Recall	F1-Score
ViT-Base	0.856	0.836	0.688	0.917	0.786
Swin-Tiny	0.854	0.904	1.000	0.708	0.829

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Naddaf-Sh, A.-M.; Baburao, V.S.; Ben-Miled, Z.; Zargarzadeh, H. Automated Weld Defect Classification Enhanced by Synthetic Data Augmentation in Industrial Ultrasonic Images. Appl. Sci. 2025, 15, 12811. https://doi.org/10.3390/app152312811

AMA Style

Naddaf-Sh A-M, Baburao VS, Ben-Miled Z, Zargarzadeh H. Automated Weld Defect Classification Enhanced by Synthetic Data Augmentation in Industrial Ultrasonic Images. Applied Sciences. 2025; 15(23):12811. https://doi.org/10.3390/app152312811

Chicago/Turabian Style

Naddaf-Sh, Amir-M., Vinay S. Baburao, Zina Ben-Miled, and Hassan Zargarzadeh. 2025. "Automated Weld Defect Classification Enhanced by Synthetic Data Augmentation in Industrial Ultrasonic Images" Applied Sciences 15, no. 23: 12811. https://doi.org/10.3390/app152312811

APA Style

Naddaf-Sh, A.-M., Baburao, V. S., Ben-Miled, Z., & Zargarzadeh, H. (2025). Automated Weld Defect Classification Enhanced by Synthetic Data Augmentation in Industrial Ultrasonic Images. Applied Sciences, 15(23), 12811. https://doi.org/10.3390/app152312811

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Automated Weld Defect Classification Enhanced by Synthetic Data Augmentation in Industrial Ultrasonic Images

Abstract

1. Introduction

2. Materials and Methods

2.1. Lack of Fusion (LOF) Detection via Ultrasonic B-Scan

2.2. Synthetic Image Generation

2.2.1. StyleGAN3

2.2.2. Denoising Diffusion Probabilistic Models (DDPMs)

2.2.3. VQGAN with an Unconditional Transformer

2.3. Deep Learning Architectures for Image Classification

2.3.1. ResNet-50

2.3.2. Vision Transformers

2.4. Knowledge Distillation

2.5. Evaluation Metrics

3. Experimental Setup

3.1. Dataset

3.2. Experiments Implementation Details

3.2.1. Fine-Tuning Procedures for Synthetic Image Generation Models

3.2.2. Fine-Tuning Procedures for Image Classification Models

4. Results and Discussion

4.1. Qualitative and Feature-Space Analysis of Synthetic Images

4.2. ResNet-50 Baseline Performance vs. Transformer-Based Models

4.3. Improving ResNet-50 for LOF Defect Classification Through Synthetic Data

4.4. Synthetic Data vs. Knowledge Distillation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI