In this section, we propose a pipeline using EfficientNet for feature extraction, YOLOv8 for real-time detection, and multi-head attention to achieve more refined features to detect welding defects during ultrasonic welds in vehicle LIBs manufacturing. While the advanced version of YOLO offers potential enhancements, at the time of model development and experimentation for this research, YOLOv8 represented a thoroughly validated and stable detection framework, extensively documented and benchmarked across diverse industrial applications. YOLOv8’s stability ensured reliable integration with our custom modules (such as EfficientNet backbone and multi-head attention mechanisms), significantly reducing experimental risk in industrial environments. The proposed model represents a defect detection algorithm grounded in deep neural networks. Its most prominent attribute lies in its high-speed operational capabilities, rendering it particularly suited for real-time systems. Subsequent sections provide proper step-by-step deployment details about the method.
3.1. Model Overview
In this study we proposed an EfficientNet with multi-head attention-based YOLOv8 defect detection model for detection of defects in vehicle LIB ultrasonic welds, presenting unique challenges, including the need to identify small-scale anomalies (e.g., micro-cracks or fold) or other larger structural defects like shear and porosity. Achieving this balance of fine-grained detection and computational efficiency requires a model that can generalize across varying defect types while remaining adaptable to new defect scenarios introduced during production. In the first phase, the model used EfficientNet, which efficiently captures multi-scale features; it ensures that the system remains reliable under diverse manufacturing conditions. Therefore, EfficientNet-B0 [
30] is utilized as a backbone network that attempts to achieve feature representation through its lightweight model that enables its use in edge devices and provides real-time quality assurance without loss; it is therefore a perfect backbone network for real-time detection. EfficientNet’s architecture utilizes a combination of depthwise convolutions, inverted residual bottlenecks, and channel-wise attention (Squeeze-and-Excitation) modules, effectively capturing hierarchical representations from weld images. Additionally, to address the multi-scale and fractal-like complexity of real-world defects, our model integrates multi-scale dilated convolution blocks within the feature extraction process. Each of these blocks runs parallel convolutional filters with dilation rates of 1, 2, and 5, aggregating their outputs to emulate the behavior of a fractional integral operator. While our multi-branch convolutional design exhibits self-similarity and multi-scale dilation patterns, we do not claim a formal fractal construct. Instead, we leverage architectural intuition from prior works like FractalNet to efficiently capture spatial hierarchies across multiple receptive fields. This fractal-inspired design expands the effective receptive field and enriches features across scales, all without increasing parameter count, supporting both computational efficiency and superior sensitivity to diverse defect morphologies. Grounded in fractional-order systems theory and aligned with the latest advances in fractal neural network, the features extraction flows are as follows:
Low-Level Features (initial layers): capture fundamental image attributes such as edges, textures, and subtle gradients, which are crucial for recognizing simple yet critical defect markers, such as edges of cracks or porosity boundaries.
Mid-Level Features (intermediate layers): identify structured features like shapes, sizes, and continuity disruptions critical for differentiating complex structural defects (e.g., fold and crease).
High-Level Features (final layers before attention module): extract semantic information, enabling the model to understand defect-specific patterns, distinguishing nuanced defect types, and generalizing across diverse defect representations.
To represent the structure of EfficientNet-B0, we use the short form ‘Conv, MBConv1, MBConv6’, where Conv is the first convolutional layer. The upper and lower layers of our EfficientNet were trained with the RIAWELC dataset and GC10-DET dataset, respectively [
31,
32,
33,
34]. MBConv1 and MBConv6 are the variable kernel sizes and variable blocks of the convolutional layer, AvgPool is the average pooling layer, and fully connect is the fully connected linear layer, as shown in
Figure 2.
The multi-head attention mechanism, situated after feature extraction, significantly enhances the interpretability and discrimination capability of the model. Unlike conventional convolutional methods that primarily operate locally, multi-head attention explicitly calculates global dependencies between spatial regions, effectively highlighting specific image areas contributing most significantly to defect classification. The multi-head attention layer is located between the EfficientNet-B0 pooling and the Softmax layer. Adding multi-head attention is to drag the model into high-risk areas and increase sensitivity of detection that traditional techniques might run undetected. This attention mechanism improves the model’s accuracy by enhancing focus on regions most likely to contain defects, thereby reducing false negatives in challenging inspection scenarios. The use of multiple attention heads allows the model to focus on both fine-grained details (e.g., micro-cracks) and larger-scale patterns (e.g., cracks and porosity), providing a comprehensive understanding of the defect landscape. Weld defects, such as cracks or porosity, can occur in irregular and unpredictable patterns; therefore, multi-head attention effectively models such spatial relationships, ensuring that critical regions receive greater emphasis. Each attention head independently computes a refined representation of the feature map, focusing on a specific subset of the feature space. Softmax function learnable weights are used to project input features into lower-dimensional query, key, and value spaces, while optimizing the attention mechanism for computational efficiency. The parallel processing of multiple attention heads improves the robustness of the model, particularly in scenarios where defects exhibit high variability in size, shape, or texture. The refined feature map generated by the multi-head attention mechanism is seamlessly integrated with the input of YOLOv8, creating a unified framework for ultrasonic weld defect detection.
The model utilizes YOLOv8 as its foundational architecture due to its efficient single-pass detection capabilities, which allow real-time operation essential for high-speed LIB production environments. A pivotal aspect of our methodology is the incorporation of the multi-head attention mechanism to enhance feature discrimination and defect localization. YOLOv8 provided simpler, well-documented, and modular architecture that facilitated straightforward integration with attention modules, thus ensuring rapid and error-free model customization. Its architecture processes the entire image in a single forward pass, balancing accuracy and speed. Each image is divided into an
grid, and for each cell, bounding boxes are predicted alongside confidence scores and class probabilities, enabling real-time analysis. Vanilla Yolov8 incorporates with the CSPDarknet53 [
35], as the backbone network, but we used EfficientNet as the backbone network with Mish activation function as shown in
Figure 2. Based on this configuration, we adopted the path aggregation network (PANet) into the neck module of YOLOv8. To ensure optimal feature aggregation and activation dynamics, we evaluated three neck-activation configurations: FPN + ReLU (baseline), BiFPN + SiLU, and PANet + Mish. Our final design having PANet with Mish demonstrated superior multi-scale fusion and smoother gradient propagation, yielding the best balance between accuracy (78.4% mAP, 79.4% F1, 79.8% recall) and inference speed (57.5 FPS), making it the most effective choice for our weld defect detection pipeline.
In this configuration, a universal network structure is built, which is coordinated between top-down and bottom-up modules, and shallow location information and deep semantic information are combined through feature fusion to increase the feature breadth and depth. A ‘decoupled header’ is used in the main structure of YOLOv8, and distributed focus loss (DFL) [
36] is used for bounding box regression and object classification prediction. The single-stage architecture, which does not employ the RPN, allows YOLO to achieve a faster inference with simple architecture compared to the two-stage process, making it suitable for applications requiring real-time or near-real-time performance in objects. This YOLO variant optimizes the loss function by using the VFL loss (vertical federated learning loss) [
37] for classification and the CIoU (Complete Intersection over Union) loss [
38] and the DFL loss for regression, both of which have specific characteristics. The YOLOv8 model is further enhanced by a tailored transfer learning strategy, utilizing pre-trained weights from welding defect datasets to enhance detection accuracy for LIB-specific defects. By pre-training on this domain-relevant dataset, the model acquires foundational features that are then fine-tuned on LIB-specific data, reducing training time and increasing detection robustness.
3.2. Dataset Preparation and Processing
In constructing a robust defect detection model, a well-curated and representative dataset is essential. Our industrial dataset was carefully assembled to capture the range of defects caused during ultrasonic welding, typically encountered in LIB production, while ensuring the model’s ability to generalize across real-world conditions. The dataset includes both real-world defect images and synthetic images generated through a Generative Adversarial Network (GAN), which extends the dataset’s diversity and enhances the model’s defect recognition capabilities. The dataset exhibits moderate imbalance, with shear and clean classes being more frequent than fold and crease. To address this, we applied class-balanced augmentation and GAN-based synthesis for underrepresented classes, ensuring more uniform training distributions as detailed in
Table 1. To maximize model robustness and ensure adaptability across varying real-world conditions, extensive data preprocessing and augmentation steps were applied to each image in the dataset. We also employed the RIAWELC dataset and GC10-DET dataset in the transfer learning process, which allow the model to generalize across a range of defect types and shapes. The preprocessing pipeline consisted of resizing, normalization, and augmentation transformations tailored to the nature of LIB welding defect detection, which is discussed in subsequent sections. For the industrial LIB dataset, we adopted an 80/20 stratified train–test split. To further confirm the robustness of our results, a 5-fold cross-validation protocol was applied during ablation and hyperparameter tuning. Public datasets RIAWELC and GC10-DET followed their respective official splits.
To address class imbalance and expand the representation of rare defect categories, we employed a Deep Convolutional GAN (DCGAN). The generator comprised four transposed convolutional layers with batch normalization and LeakyReLU activations, while the discriminator was structured symmetrically with convolutional layers and dropout regularization. Training was conducted for 200 epochs using the Adam optimizer (learning rate = 2 × 10−4, , batch size = 64). Convergence was monitored through loss stabilization and qualitative inspection of generated images. Synthetic images were generated using DCGAN and integrated only into the training set.
3.2.1. Data Collection Methods and Labeling Standards
The industrial dataset was systematically collected from an operational industrial setting during the actual ultrasonic welding process for lithium-ion battery (LIB) production. High-resolution images were captured using industrial-grade vision systems strategically positioned along the automated manufacturing lines. The imaging devices maintained consistent lighting conditions and fixed positions relative to the welding apparatus, ensuring uniformity and minimizing variability unrelated to weld defects. Captured images underwent immediate preliminary quality checks to filter out unusable or unclear captures, thus ensuring that all dataset images were consistently high clarity and relevant to defect detection tasks. Domain experts from the production team, who were thoroughly trained in recognizing various ultrasonic weld defects, conducted the labeling process. The labeling strictly adhered to clearly defined guidelines that were specifically established for this research. The defect labeling standards were based on internationally recognized weld defect criteria, modified slightly to accommodate LIB-specific features. Each expert annotator was required to do the following:
Identifying defect type: clearly classify each defect into predefined categories (cracks, porosity, shear, fold, crease).
Precise bounding box labeling: accurately mark defect boundaries using bounding boxes to ensure consistency in training YOLO-based detection models.
Cross-validation: Implement double-blind labeling, where two independent experts annotate each image. Discrepancies were resolved through consensus discussions facilitated by a senior quality assurance engineer.
Verification and Validation: 15% of the labeled dataset underwent rigorous random audits by an independent senior engineer to verify annotation accuracy and consistency, achieving an agreement rate exceeding 95%.
Additionally, synthetic data augmentation was performed using Generative Adversarial Networks (GANs) to simulate rare or hard-to-capture defects. The GAN-generated images were carefully validated by domain experts to ensure realism and relevance to actual defect scenarios, thereby enhancing dataset diversity and robustness. To verify the realism of synthetic images, we computed Fréchet Inception Distance (FID) and Inception Score (IS) [
39,
40], which were computed using the official PyTorch implementation from the TorchMetrics library. All synthetic images generated by the GAN augmentation pipeline were resized to match the input expectations of the Inception v3 network [
41], which was pre-trained on ImageNet-1k [
42].
For FID computation, we extracted 2048-dimensional features from the pool3 layer of Inception v3 and calculated the distance between real and synthetic distributions using 64-bit double precision for stable covariance estimation. A sample size of 2000 synthetic images and an equal number of real images from the training split were used. For IS computation, the same 2000 synthetic images were evaluated. Each image was passed through Inception v3, and the Softmax class probabilities were used to compute the KL divergence between conditional and marginal label distributions, averaged over 10 splits. The synthetic samples achieved an FID of 14.72 and an IS of 2.85, indicating close alignment with real defect distributions. These evaluations, together with visual confirmation, demonstrate that the augmented set enhances diversity without introducing significant domain shift.
3.2.2. RIAWELC Dataset
This is a radiographic image dataset for weld defects classification. The RIAWELC dataset [
31] collects 24,407 224 × 224 8-bit radiographic images digitalized in the.png format with four classes of weld defects represented as lack of penetration (LP), porosity (PO), cracks (CRs) and no defect (ND) as shown in
Figure 3. It is used for initial training to familiarize the model with defect types such as cracks and porosity. While it does not contain ultrasonic-specific data, the common type in defects of cracks and porosity have resemblance. Moreover, the dataset’s diverse defect types are essential for pre-training for the model’s general structural anomaly recognition.
3.2.3. GC10-DET Dataset
The GC10-DET dataset [
33] was collected under actual industrial settings for extensive metal surface defect identification. It includes a total of 2300 images with a resolution of 2048 × 1000 pixels. The dataset includes ten types of defects found on the surface of steel plates, in which we choose punching hole, weld line, inclusion, and waist folding for transfer learning, which align with surface defects commonly encountered in ultrasonic welding for the TABs of cell.
Figure 4 displays some defect sample images with annotations. With strong inter-class similarity and unbalanced sample distribution, the GC10-DET dataset shows a substantial variance in the number of images for each type of defect. Also, there could be multiple defect types in the same image, posing a challenge to defect detection algorithms due to the unbalanced data distribution. Together, the RIAWELC dataset and GC10-DET dataset allow the model to generalize across a range of defect types and shapes.
3.2.4. Industrial Dataset for Industrial LIB Weld Images
This collection of industrial data consists of 1500 high-resolution images (native resolution ≈ 2048 × 1000 pixels), acquired using production-line inspection cameras directly from LIB manufacturing environments as shown in
Figure 5. These images cover defect types inherent to ultrasonic welding for TAB and busbar joints, including cracks, fold, porosity, crease, and shear. Each image is meticulously annotated by domain experts of production line to accurately label defect locations and types, ensuring high-quality labels for supervised learning.
To address class imbalance, we applied a GAN-based augmentation pipeline. The generator
receives a latent input vector
and produces a candidate weld-defect image
. The discriminator
outputs a probability distribution estimating whether input
is real or synthetic. Training follows the minimax formulation:
The loss stabilizes adversarial updates by alternating between gradient steps on
and
. For normalization, all samples are resized to
, pixel values are scaled to [0, 1], and standardized to zero mean and unit variance per channel:
Hyperparameters include a batch size of 64, Adam optimizer with learning rate 2 × 10−42, , , and 200 training epochs. The generator uses transposed convolutions with ReLU activations except at the output (Tanh), while the discriminator applies strided convolutions with LeakyReLU (). Dropout (0.3) and spectral normalization were added to improve convergence stability. Augmentation transformations (rotation ± 15°, scaling 0.9–1.1×, horizontal flip, contrast ± 20%) were applied to both synthetic and real samples. Synthetic data expanded the industrial dataset from 3500 to ~6000 images, balancing defect categories: shear (2000), porosity (1500), crease (1200), fold (800), crack (500), and clean (5000). The dataset was split 80/20 with stratification to preserve per-class balance.
3.3. Feature Extraction Using EfficientNet
EfficientNet is used as a feature extraction network in the classification experiment. EfficientNet uses a composite scaling mechanism to maintain a balance between resolution, depth, and width, making the extracted features rich and computationally efficient. The EfficientNet series includes 7 CNNs and is labeled as EfficientNet-B0 to EfficientNet-B7. In this study, EfficientNet-B0 is used for feature extraction, which provides a balanced level of computational efficiency and accuracy. We select EfficientNet-B0 because its compound scaling (width, depth, resolution) and MBConv blocks with Squeeze-and-Excitation provide high-quality features at low parameter cost. In our defect images, B0 offered the best trade-off between small-object sensitivity and stability during transfer learning, while avoiding the heavier footprint of larger backbones. This choice keeps the detector compact without sacrificing the hierarchical detail needed for micro-crack, porosity, fold, and crease discrimination. By inserting a multi-head attention layer between the pooling layer of EfficientNet-B0 and the Softmax layer, EfficientNet-B0 can outperform a number of feature extractors with fewer parameters at the same input resolution [
42,
43,
44,
45]. The high scalability of EfficientNet-B0 can effectively extract meaningful features in ultrasonic weld images whose complex structure contains defects. The specific structure of EfficientNet-B0 is shown in
Figure 6. It can be divided into seven blocks according to channel range, pass speed, and filter size. To capture scale variation, we aggregate parallel dilated convolutions (e.g., dilation {1, 2, 5}) and sum the activations. This constructs receptive fields that function like a fractional-order spatial operator, broadening context while retaining local sensitivity. The result is a scale-spanning representation that enriches fine weld textures and larger geometric irregularities without materially increasing parameters.
The initial input to the model is a high-resolution ultrasonic weld image
I, where H, W, and C denote the image’s height, width, and channel count, respectively. These images are often captured under constant lighting and environmental conditions of the quick moving conveyer belt within the production line. To standardize the input, the images are resized to a fixed resolution of 224 × 224 pixels and normalized to a [0, 1] range. Resizing ensures compatibility with EfficientNet’s pre-trained weights, while normalization reduces the influence of intensity variations. The process is expressed as follows:
where
and σ are the mean and standard deviation of the pixel intensities. Based on MobileNet [
46,
47], the mobile inverted bottleneck (MBConv) is a key component of EfficientNet-B0. EfficientNet processes the input images through a series of convolutional layers, capturing hierarchical features ranging from low-level edges and textures to high-level defect patterns. Each convolutional block produces a feature map
as follows:
where
and
represent the weights and biases of the
layer, “*” denotes convolution, and σ is the activation function. As shown in
Figure 6, MBConv consists of two k1 × 1 convolutional layers, a depth convolutional layer, a Squeeze-and-Excitation (SE) [
48,
49] module block, and a dropout layer. To improve the quality of the features, an SE module is added in each convolutional block. It changes the size of the feature maps and distorts the channels that are favorable for error detection. Channel expansion is performed over the first k1 × 1 convolution layer. Deep convolution reduces the number of parameters. By using SE blocks, one can focus specifically on the relationships between the channels and assign variable weights to the channels instead of computing them uniformly. Channel compression is completed via a k1 × 1 s convolution layer. The recalibration is achieved through global average pooling, followed by a non-linear transformation:
where
and
are learnable parameters. This process ensures that defect-related features are amplified, improving downstream detection accuracy. The final feature map
serves as a rich representation of the input image, encapsulating spatial and semantic details critical for identifying weld defects. These features are passed to Neck followed by avgpooling before being fed into the multi-head attention module for further processing.
3.4. Multi-Head Attention
The multi-head attention mechanism is a powerful extension of the self-attention mechanism, designed to simultaneously focus on different aspects of the input feature map. In the context of ultrasonic welding defects detection, this mechanism is essential for capturing the diverse spatial and contextual relationships that characterize various types of defects. In this work, we use a multi-headed self-attention mechanism that handles the attention of scale dot products [
50]. In EfficientNet-B0, a multi-headed self-attention layer is inserted between the pooling and Softmax layers. The multi-headed self-attention mechanism allows the network to focus on important information in the image, so that the network has many representation subspaces. The self-attention mechanism is able to evaluate the different influences of the respective pixel positions and assign them corresponding weights for classification. Thus, it is possible to evaluate the relationship of a region to the surrounding area and determine its respective influence in many regions based on the correlation. In similar cases such as defect detection in ultrasonic welding, the influence on the environment often depends on the relationship to the surrounding. Use the L×N matrix Y to represent a set of L to N dimensional objects. Y is the output of the pooling layer, and the corresponding row Y is a separate object vector, as shown in
Figure 2.
The feature map
as a result of multi-head self-attention, the input provided, undergoes a transformation into three distinct matrices as query Q, key K, and value V:
where
,
, and
are learnable projection matrices that enable the model to focus on specific feature subspaces. Such vectors can actually serve as an abstraction for the calculation of attention. The similarities of the query and key vectors used to compute the attention weights are calculated from the scaled dot product in the following equation:
where
Q and
K represent query and key matrices derived from the feature map
F, and
dk is the dimension of the keys. The Softmax function ensures that the attention weights are normalized across all areas of the image, highlighting regions of high relevance. By applying these computed attention weights to the value matrix
, the model recalibrates its learned feature maps, effectively emphasizing crucial defect-specific regions (e.g., edges of cracks, porosity clusters) and suppressing irrelevant or noisy features (e.g., non-defective background textures). The attention weights
are applied to the value matrix
to produce a refined attention-weighted feature map
F′. The attention-weighted feature map
F′ is then given by
Consequently, the recalibrated feature map is enriched with explicit defect-location and defect-type-specific contextual information, thereby directly enhancing interpretability and classification performance. V allows the model to attend to critical defect features while minimizing attention on irrelevant areas and amplifying features associated with defects while suppressing irrelevant noise, such as background textures or other variations in the terminal interface.
Self-attention enables the model to dynamically adapt its focus based on the unique characteristics of each image. For example, it can prioritize features associated with a crack’s initiation point while also capturing the progression of the crack across the weld. For the case of multi-head self-attention, it linearly processes Q, K, and V multiple times via different weight matrices and processes the input features through multiple parallel attention heads, each focusing on a different aspect of the feature space as shown in
Figure 7. Initially, for parallel attention computations, the input feature map
is split into
subspaces, with each attention head independently computing a refined representation:
where
,
, and
are the query, key, and value matrices specific to the
head. This allows the model to observe the features from different attention heads, each learning different aspects of the image. The output from each attention head is concatenated and linearly transformed to generate the final attention-enhanced feature map
Fmulti-head:
where
is a learnable weight matrix. Multi-head attention provides a comprehensive representation of the input features, capturing both local details (e.g., micro-cracks) and global context (e.g., uneven weld lines). We propose a network with an attention layer that has 512 pixels at its output size and h = 3. This integration ensures that the contextual enhancements provided by attention are effectively utilized in the final detection for the YoloV8 model. Multi-head attention in our proposed model employs multiple parallel attention heads, each independently focusing on distinct spatial or semantic aspects of weld defects. This explicit differentiation is crucial for interpreting the model’s ability to discriminate defect types, as detailed below:
Fine-Grained Attention Heads: Certain attention heads specifically target small-scale defects (e.g., micro-cracks and pores), capturing localized and subtle structural discontinuities. These fine-grained attention heads effectively distinguish defects that are visually minimal yet structurally critical.
Contextual Attention Heads: Other attention heads explicitly identify larger-scale spatial patterns indicative of defects like shear and crease. By focusing on extended spatial correlations and irregularities in larger regions, these heads provide interpretability into the model’s decision process regarding extensive, structurally significant defects.
Cross-Scale Attention Integration: Attention heads dynamically integrate features across different scales, thereby comprehensively capturing defects exhibiting variable size or texture patterns. This multi-scale interpretability ensures robust performance even in highly varied industrial conditions.
To empirically support our selection of three attention heads with a 512-dimensional embedding, we conducted an ablation study across different configurations. Using a single attention head with a 256-dimensional output resulted in an mAP of 76.0%, precision of 75.1%, and recall of 76.8%, while maintaining high throughput at 62.5 FPS on an RTX 4090. Increasing the number of heads to three and the embedding dimension to 512 boosted the mAP to 78.4%, precision to 77.0%, and recall to 79.2%, with only a moderate drop in FPS to 57.5 and a manageable parameter count of 30.1 million. Further increasing to six heads and 768 dimensions yielded only marginal gains (mAP: 78.9%, recall: 79.5%), while significantly degrading runtime speed to 52.1 FPS and increasing parameter count to 33.7 million. FLOPs grew from 92.4 GF (1-head) to 98.6 GF (3-head) and 108.9 GF (6-head), respectively. This trade-off analysis confirms that the 3-head, 512-dim configuration offers the optimal balance between accuracy and runtime performance, justifying its adoption in our final model.
3.5. Detection Head Using YOLOv8
Following feature extraction with the multi-head mechanism, YOLOv8 is employed as the detection head to locate and classify defects. YOLOv8 is a one-stage object detection framework that achieves a balance between speed and accuracy, making it particularly suited for high-throughput manufacturing environments. To ensure quality and safety, defects such as cracks, porosity, folds, shears, and creases must be detected in real time. During the manufacturing pipeline of vehicle LIBs production, a preliminary experimental evaluation was performed which indicated that YOLOv8, with our EfficientNet-based feature extraction backbone and multi-head attention integration, offered superior robustness and accuracy, specifically on our defect detection datasets (private LIB ultrasonic welding dataset, GC10-DET, and RIAWELC). Although YOLOv10 showed promising results in general object detection benchmarks, its incremental accuracy improvements were marginal (less than 1–2% increase in mAP), specifically for subtle defect classes such as crease and fold. Thus, YOLOv8 provided comparable practical accuracy without additional computational overhead as it offers excellent inference speed (~65 FPS on an RTX 4090 GPU) and reliable detection accuracy, making it a strong industrial baseline for high-speed LIB manufacturing lines. However, our deployment prioritizes real-time CPU-based inference. In this environment, our model achieves 45 FPS on a standard Intel i5 CPU with 512 MB RAM, significantly outperforming YOLOv8’s CPU performance (≈22 FPS from Ultralytics benchmarks). Conversely, although YOLOv10 can achieve modest GPU speed gains (5–8%) over YOLOv8 on certain tasks, it increases model complexity (2.3 M vs. 1.7 M parameters, 6.7 G vs. 3.2 G FLOPs). For inline industrial use, our model’s CPU-level performance and reduced resource footprint offer decisive practical benefits.
The feature map
from the EfficientNet backbone with attention is fed into YOLOv8’s detection module. This module refines the features and predicts bounding boxes, confidence scores, and class probabilities for each defect. Later, YOLOv8 divides the input feature map into a grid of
S ×
S cells. Each cell predicts bounding boxes for objects potentially located within its region, along with associated confidence scores and class probabilities. The output for a single grid cell is represented as
where
are the normalized center coordinates of the bounding box.
are the width and height of the bounding box.
is the confidence score for the presence of a defect.
is the probability of the defect belonging to class
. Unlike earlier YOLO models, YOLOv8 adopts an anchor-free approach, simplifying the architecture and improving inference speed. Instead of predefined anchor boxes, it predicts box centers and offsets directly:
where
represents the prediction function. YOLOv8’s training process minimizes a combined loss function, incorporating the following:
Complete Intersection over Union (CIoU) loss for precise bounding box regression.
Objectness loss (confidence score) based on Binary Cross-Entropy to distinguish between defect and background effectively.
Classification loss based on Binary Cross-Entropy with logits for accurate defect classification.
Formally, the combined loss function is represented as
where
and the localization loss
measures the accuracy of the bounding box predictions using the CIoU (Complete Intersection over Union) metric:
where
are the predicted and ground truth bounding boxes. YOLOv8 generates predictions at multiple scales to handle defects of varying sizes, from small crease patterns to large cracks. This multi-scale capability ensures comprehensive detection across all defect types. Lastly, the optimized YOLOv8 model with transfer learning undergoes further training with a combination of real and augmented defect images. Stochastic gradient descent (SGD) with momentum is used, hence improving training convergence and model stability, which is explained in a further section. Momentum,
m, helps in accelerating updates and reducing oscillations in the gradient descent as follows:
where Δ
wt represents the weight update at iteration t.