Explainable Hybrid Deep Learning for Microscopic Dust Defect Inspection on Voice Coil Motor Assembly Components

Phunpeng, Veena; Chaiyasin, Kreetiwat; Khodcharad, Kitsana; Boransan, Wipada; Patangtalo, Watcharapong; Chaimanatsakun, Attaphon

doi:10.3390/asi9060120

Open AccessArticle

Explainable Hybrid Deep Learning for Microscopic Dust Defect Inspection on Voice Coil Motor Assembly Components

by

Veena Phunpeng

^1,*

,

Kreetiwat Chaiyasin

²

,

Kitsana Khodcharad

¹

,

Wipada Boransan

³

,

Watcharapong Patangtalo

¹ and

Attaphon Chaimanatsakun

^4,*

¹

Mechanical Engineering, Suranaree University of Technology, Nakhon Ratchasima 30000, Thailand

²

Mechatronics Engineering, Suranaree University of Technology, Nakhon Ratchasima 30000, Thailand

³

Institute of Research and Development, Suranaree University of Technology, Nakhon Ratchasima 30000, Thailand

⁴

Mechanical Engineering, Faculty of Engineering at Sriracha, Kasetsart University Sriracha Campus, Chonburi 20230, Thailand

^*

Authors to whom correspondence should be addressed.

Appl. Syst. Innov. 2026, 9(6), 120; https://doi.org/10.3390/asi9060120

Submission received: 17 April 2026 / Revised: 27 May 2026 / Accepted: 28 May 2026 / Published: 2 June 2026

(This article belongs to the Special Issue Feature Papers in the ‘Industrial and Manufacturing Engineering’ Section)

Download

Browse Figures

Versions Notes

Abstract

Ensuring the cleanliness of precision components is critical in Hard Disk Drive (HDD) manufacturing, where microscopic dust contamination on the Voice Coil Motor Assembly (VCMA) can lead to positioning errors, unstable head movement, and long-term reliability failures. However, automated inspection of such contamination remains challenging because dust particles are extremely small, visually irregular, and often appear under complex microscopic backgrounds. This study presents an explainable hybrid deep learning framework for microscopic dust inspection by integrating object detection for precise localization and image classification for defect confirmation. Three YOLO architectures, namely YOLOv5, YOLOv8, and YOLOv11, were comparatively evaluated for dust detection, while three convolutional neural network (CNN) models, ResNet50, EfficientNetB0, and MobileNetV2, were implemented using transfer learning with frozen feature extraction layers for Good (G) and Not Good (NG) image-level classification. The experimental dataset consisted of annotated microscopic VCMA images, with data augmentation applied to the training subset to mitigate limited sample size and class imbalance. Experimental results showed that YOLOv8 achieved the strongest overall aggregate detection performance, whereas YOLOv5 was selected as the preferred detector for subsequent hybrid integration because it produced fewer false positives under reflective and textured microscopic backgrounds. YOLOv11 exhibited lower detection performance in the present setting, likely due to its architectural characteristics being less suited to the limited-data and high-background-complexity conditions of this study. In the present experimental setting, YOLOv5 achieved mAP@0.5 = 0.62, precision = 0.75, and recall = 0.69. For image-level classification, EfficientNetB0 achieved the highest classification accuracy of 93.10%, with F1-score = 0.932 and AUC = 0.986. In addition, Grad-CAM visualizations demonstrated that EfficientNetB0 consistently focused on physically meaningful dust-contaminated regions, thereby enhancing the interpretability of the classification results. Overall, the proposed hybrid framework integrating YOLOv5-based localization with EfficientNetB0-based defect confirmation showed promising potential for improving inspection reliability, false-alarm control, and explainability in automated VCMA quality inspection. These findings support the feasibility of explainable deep learning for microscopic defect inspection in HDD manufacturing and suggest its potential applicability to other precision manufacturing environments.

Keywords:

Voice Coil Motor Assembly (VCMA); microscopic dust detection; YOLO object detection; Convolutional Neural Network (CNN); explainable AI; Grad-CAM; hybrid deep learning; hard disk drive manufacturing

1. Introduction

Hard disk drives (HDDs) remain essential for large-scale data storage due to their cost-effective, high-capacity storage and continued relevance in data center and cloud infrastructure applications [1]. As HDD components are manufactured with increasingly tight mechanical tolerances, particulate cleanliness becomes a critical determinant of assembly yield and long-term reliability. In particular, microscopic dust contamination on Voice Coil Motor Assembly (VCMA) components can induce positioning errors, unstable actuator motion, and subsequent performance degradation. Reliable inspection of such contamination is therefore a key requirement in HDD manufacturing.

Although VCMA assembly is conducted under controlled cleanroom conditions, microscopic particulate contamination can still arise during component handling, transfer between workstations, and fixture contact—stages where full hermetic isolation is impractical [2]. Existing quality control practices in HDD manufacturing rely primarily on manual optical inspection or conventional automated optical inspection systems, which are generally calibrated for macro-scale defects and lack sensitivity to sub-100 μm dust particles characteristic of VCMA contamination [3,4]. These limitations motivate the development of automated deep learning-based inspection capable of detecting microscopic dust under realistic industrial imaging conditions.

Automated inspection of microscopic dust defects is challenging because target particles are extremely small, morphologically irregular, and often embedded in reflective or textured microscopic backgrounds. Contemporary surveys on deep learning for automated visual inspection report that learning-based approaches offer improved robustness and adaptability, yet they also highlight persistent deployment barriers such as limited defect samples, dataset bias, and distribution shift [5,6,7]. Broader surveys on deep-learning-based surface defect detection likewise stress that data quality, scale variation, and background complexity remain central obstacles for robust industrial deployment [8].

Deep learning has become the dominant paradigm for representation learning in computer vision, enabling models to learn hierarchical features directly from data [9,10]. In industrial inspection, convolutional neural networks (CNNs) are frequently used for image-level classification tasks because they can capture subtle texture cues and discriminative patterns. Residual networks have become standard backbones in many recognition pipelines [11], while EfficientNet provides systematic scaling rules and strong accuracy-efficiency trade-offs suitable for industrial applications constrained by inference time [12]. MobileNets are widely used when computational budgets are tight, enabling deployment in resource-limited environments [13]. More recently, vision transformers have expanded the design space for visual recognition, although CNNs remain common in industrial inspection due to their maturity and practical efficiency [14].

In industrial inspection, two problem formulations are central to practical deployment: image-level classification for rapid screening and object detection for defect localization. Among real-time detection frameworks, the YOLO family of one-stage detectors has been widely adopted due to its end-to-end design and favorable speed-accuracy trade-off [15,16,17]. The YOLO lineage has evolved through improvements in feature extraction, multiscale feature fusion, and training strategies. YOLOv4 introduced practical architectural and training refinements to improve real-time detection [18,19], while YOLOv7 further advanced performance through trainable bag-of-freebies strategies [20]. The Ultralytics ecosystem provides widely used implementations and documentation for YOLOv5, YOLOv8, and YOLOv11, enabling consistent and reproducible workflows for training, validation, and deployment of YOLO-based detectors [21,22,23].

Microscopic dust detection introduces specific difficulties that challenge general-purpose detectors. Dust particles are often a small fraction of the image area, reflective metallic surfaces can generate false positives, and scale variation requires fine-grained multiscale feature maps. Feature Pyramid Networks (FPN) are a foundational approach for multiscale feature representation and have influenced many modern detectors [24]. Beyond YOLO, alternative detection families including SSD [25], Faster R-CNN [26], RetinaNet [27], and Mask R-CNN [28] provide useful reference points for understanding trade-offs between accuracy, speed, and localization granularity.

Classification models raise interpretability concerns because a correct label does not guarantee the model attended to the true defect region. Attention mechanisms such as Squeeze-and-Excitation (SE) blocks [29] and Convolutional Block Attention Module (CBAM) [30] can improve representational sensitivity. Class Activation Mapping (CAM) introduced a method to visualize discriminative regions using global average pooling [31], and Grad-CAM generalized this to broader architectures using gradient-based heatmaps [32]. Broader XAI literature emphasizes that interpretability supports accountability, debugging, and responsible deployment in high-stakes decision contexts [33,34]. In manufacturing, explainability helps verify that the model focuses on true dust regions rather than spurious reflections or texture artifacts.

Another practical dimension is data availability. In real factories, defect data can be scarce, and distributions can change as processes drift. Data augmentation is commonly used to improve generalization under limited data conditions [35]. Adam is widely used for deep learning optimization due to adaptive learning rates [36], while AdamW improves generalization by decoupling weight decay from the gradient update [37]. Modern frameworks such as PyTorch support rapid experimentation and have become a standard foundation for implementing CNN and detection models [38]. Detection models are commonly benchmarked on MS COCO, which defines standard evaluation protocols that help contextualize detector capability relative to industrial domains [39].

When labeled defect data are extremely limited, anomaly detection methods offer an alternative formulation by learning normal appearance and flagging deviations. Representative approaches include PaDiM [40], PatchCore [41], STFPM [42], Deep SVDD [43], and generative methods such as AnoGAN and GANomaly [44,45], benchmarked on datasets such as MVTec AD [46]. Although anomaly detection is outside the primary scope of this study, it motivates future directions for settings where defect labels are scarce.

Despite recent progress in deep learning for industrial inspection, microscopic dust detection on VCMA components remains insufficiently studied, particularly in settings requiring both reliable defect localization and image-level confirmation under limited-data conditions. To the best of the authors’ knowledge, no prior deep-learning study has specifically addressed microscopic dust inspection on HDD VCMA components. This gap is attributable to the rarity of naturally occurring, visually identifiable dust contamination in controlled HDD production, the cost of expert annotation, the highly reflective surface texture, and the limited availability of industrial samples. Existing approaches in related domains often focus on either classification or detection alone [47], while the interpretability of model decisions is rarely examined in the context of precision manufacturing contamination inspection.

Microscopic dust inspection on VCMA components is also closely related to the problem of tiny object detection (TOD), in which target objects occupy only a small fraction of the image area. Specialized strategies such as slicing-based inference (e.g., SAHI—Slicing Aided Hyper Inference [48]) and scale-enhanced feature extraction have been proposed to improve detection sensitivity for micro-scale features. However, such methods were outside the scope of the present study, which focused on comparing representative YOLO generations and CNN classifiers under the same constrained dataset and imaging conditions; these strategies are identified as a priority for future work.

To address these gaps, this study proposes an explainable hybrid inspection framework integrating YOLO-based localization, CNN-based Good/NG confirmation, and Grad-CAM-based visual interpretation for practical VCMA quality inspection. The principal scientific contributions are as follows:

(1): A systematic comparative evaluation of representative YOLO-based detectors (YOLOv5, YOLOv8, and YOLOv11) and CNN-based classifiers (ResNet50, EfficientNetB0, and MobileNetV2) under identical experimental conditions on microscopic VCMA images, providing a rigorous basis for deployment-oriented model selection.
(2): A sequential hybrid inspection framework that integrates CNN-based image-level screening with YOLO-based defect localization, demonstrating improved false-positive suppression and selective localization for practical VCMA quality inspection.
(3): An explainability analysis using Grad-CAM to verify that selected models attend to physically meaningful dust-contaminated regions, supporting interpretability and engineering trust in automated inspection decisions.

2. Experimental Design and Methodology

This section describes the experimental design used to develop and evaluate deep learning-based inspection methods for microscopic dust detection on VCMA components. Two complementary inspection paradigms were investigated: YOLO-based object detection for dust localization and CNN-based image classification for Good/NG decision making. To ensure fair comparison, all models were evaluated under a unified imaging setup, dataset partition, preprocessing pipeline, and evaluation protocol. The workflow comprised image acquisition, dataset preparation, model training, explainability analysis, and performance evaluation under conditions representative of industrial inspection practice.

A HDD is a magnetic data storage device in which digital information is stored on rotating platters and accessed by a read/write head operating with extremely high positional accuracy. The motion of the read/write head is controlled by an actuator system that must function reliably at the micrometer scale to ensure stable data reading and writing. Due to the extremely small clearance between the read/write head and the disk surface, HDD components are highly sensitive to particulate contamination. Even microscopic dust particles can disturb head positioning, induce mechanical vibration, or lead to long-term degradation of operational reliability. Consequently, strict cleanliness control is required throughout HDD manufacturing, particularly during actuator-related assembly processes.

In the actuator system, the VCMA plays a critical role in controlling the precise motion of the read/write head. VCMA operates based on electromagnetic principles, converting electrical current into controlled mechanical displacement to enable rapid and accurate head positioning. Because of its direct influence on actuator precision and system stability, the VCMA is considered one of the most critical components affecting HDD performance. Foreign particles present on VCMA surfaces can significantly affect positioning accuracy and may cause unstable read/write operations or reliability failures over time. For this reason, VCMA components were selected as the primary inspection target in this study, shown in Figure 1.

The contamination inspection in this study involves some regions of VCMA which have been chosen as part of the development of a complete imaging method for the full-scale VCMA inspection work in the long run. These regions are not easily observed visually and were selected based on practical manufacturing constraints such as surface geometry and the presence of reflective metallic surfaces that can obscure small dust particles. These properties of the VCMA inspection indicate that the inspection in this study is representative of the real-world challenges encountered in industrial HDD production lines.

2.1. Image Capture

This case study focuses on microscopic dust inspection on VCMA components collected from an actual hard disk drive manufacturing environment. Image acquisition was conducted using a high-resolution digital microscope camera mounted above the inspection platform. VCMA samples were placed on a customized fixture to maintain consistent positioning and imaging angles throughout the capture process.

High-resolution images of the VCMA component were taken from a real industrial production environment with a 5-megapixel digital microscope camera equipped with a variable-zoom optical lens. The optical magnification (0.8× to 10×) was used based on the area to be inspected and the size of the particle. With a ring-shaped LED lighting system for adequate illumination and to minimize glare from the reflective metallic surface. All images were taken with very controlled lighting to ensure a comparable brightness and contrast throughout the dataset shown in Figure 2. The key imaging parameters used throughout data collection are summarized in Table 1.

Several imaging parameters were standardized throughout data collection. The camera-to-subject distance was fixed at 100 mm, with an optical magnification of 10× and a native image resolution of 640 × 480 pixels. Illumination intensity was adjusted empirically to maximize contrast between dust particles and the reflective metallic surface.

To reduce background complexity and improve defect visibility, a white planar background was employed beneath the components. A ring-shaped LED illumination system was integrated around the camera lens to provide uniform lighting coverage across the metallic surface of the VCMA. The illumination intensity was carefully adjusted to mitigate surface reflections while enhancing the contrast of dust particles against the reflective substrate.

Due to the complex geometry of VCMA components, surface protrusions caused unstable placement and severe glare during initial image acquisition. To address this challenge, a custom fixture was designed and fabricated using 3D printing technology, as illustrated in Figure 3. The fixture enabled the VCMA samples to remain level and mechanically stable during imaging, significantly reducing reflection artifacts and motion-induced blur.

Furthermore, considering different potential defect locations during the assembly process, image capture was performed from both the front and rear surfaces of the VCMA components. Figure 4 demonstrates the comparison of imaging conditions with and without the customized fixture, highlighting the improved surface alignment and consistent focus achieved through fixture-assisted positioning shown in Figure 4.

2.2. Challenges in Microscopic Dust Inspection

The evaluation of microscopic dust among VCMA components presents several technical challenges for reliable defect detection. One main issue is the reflection of metallic VCMA components on the surface, where dust particles may be blocked or create highly brilliant artifacts which distort their visual interpretation Figure 5a. Another problem is that of vision blurriness, especially with dust particles outside the perfect focal plane or when extensive magnification is needed. Blurred regions decrease the sharpness and contrast of edge sections such that small dust particles become indistinguishable from the background Figure 5b. Additionally, too much illumination from the light source can also create overexposed areas such as shown in Figure 5c, which will make intricate surface detail hard to see, resulting in effects like contamination. Additionally, dust particles located out of focus from the original examination area present additional obstacles. As shown in Figure 5d, remote dust particles appear pale or unclear due to issues related to resolution and depth of field, leading to increases in missed detections. These shortcomings restrict the effectiveness of classical rule-based or manual inspection methods and require deep-learning approaches trained to filter through complex visual patterns in various imaging environments.

2.3. Data Acquisition and Dataset Preparation

Two related but distinct datasets were prepared for this study. For image-level classification, the original dataset consisted of 47 microscope images, including 30 Good samples and 17 Not Good (NG) samples. Data augmentation was applied only to the training subset, resulting in 192 images for model development while preserving untouched validation and test images for unbiased evaluation. For object detection, the same 47 original images were manually annotated with dust bounding boxes, yielding 109 labeled defect instances. To avoid information leakage, all augmented samples derived from a given original image were kept exclusively within the training set, and no augmented variants of validation or test images were used during model selection or final evaluation. This strategy ensured that the reported performance reflected generalization to previously unseen images rather than memorization of transformed samples.

After dataset preparation, the data were divided into training, validation, and testing subsets using a 70:15:15 ratio with balanced class representation where applicable. For the classification task, the split was performed before augmentation so that only the training images were augmented. For the detection task, the annotated images were partitioned according to the same protocol to support consistent comparison across YOLO-based models.

All images were then cropped to contain only the region of interest (ROI), defined as the relevant VCMA surface area where dust contamination may appear. ROI extraction was performed to remove non-informative background regions and to focus model learning on the most critical inspection area. Representative examples of ROI selection and preprocessing are shown in Figure 6.

Following ROI extraction, images were resized to match the input dimensions required by each model architecture and normalized by scaling pixel intensity values to the range [0, 1]. To improve model robustness and reduce overfitting under limited data conditions, data augmentation techniques were applied during training only, including random rotation within ±15°, random horizontal flipping, random zoom within ±10%, and brightness adjustment within ±20%. These preprocessing and augmentation steps were applied consistently to both YOLO-based detection models and CNN-based classification models, while validation and test datasets were kept unchanged. This strategy was adopted to enhance generalization across variations in lighting conditions and surface textures while ensuring fair and reproducible performance evaluation. All images in the dataset were acquired under these standardized conditions. No deliberate variation in brightness, contrast, blur, or lens state was introduced during data collection; consequently, the dataset does not capture the full range of imaging disturbances that may occur in long-term production use.

The limited dataset size reflects practical constraints of acquiring annotated microscopic contamination samples in controlled HDD manufacturing, where dust events are infrequent and expert annotation is costly. To support reliable model development under these conditions, four safeguards were adopted: (1) CNN models were initialized with ImageNet-pretrained weights and trained with frozen feature extractors, a transfer learning strategy shown to enable effective adaptation under limited labeled data [49,50]; (2) augmentation was restricted strictly to the training subset to prevent information leakage; (3) 5-fold cross-validation was applied to reduce sensitivity to any single train/test partition [51]; and (4) dataset expansion through the acquisition of additional real samples is recognized as a practical strategy for improving model generalizability, as increasing the number of labeled training images reduces the risk of overfitting and allows models to learn more diverse defect morphologies and background variations, which has been consistently demonstrated to improve model stability and transferability in industrial inspection under limited-data conditions [35]. The study is accordingly framed as a proof-of-feasibility investigation, and dataset expansion is identified as a priority for future work in Section 4.3.

2.4. Workflow of YOLO-Based VCMA Defect Detection System

In this study, a YOLO-based object detection framework was employed to detect and localize microscopic dust contamination on VCMA surfaces under realistic industrial imaging conditions. The YOLO family was selected because its single-stage detection paradigm enables object localization and classification within a unified network, making it suitable for time-sensitive inspection tasks involving small and visually subtle defects. A conceptual overview of the YOLO-based detection principle adopted in this work is presented in Figure 7, while the complete experimental workflow is illustrated in Figure 8.

The detection workflow began with image acquisition and preprocessing, following the procedures described in Section 2.1 and Section 2.3. The annotated detection dataset consisted of 47 original VCMA images, in which dust-contaminated regions were manually labeled using bounding boxes to provide ground-truth supervision for model training. Each bounding box corresponded to a visually identifiable dust-contaminated region on the VCMA surface. Regions were annotated as dust only when the particles were visually distinguishable from the surrounding VCMA surface and exhibited morphological characteristics consistent with contamination under microscopic observation. These annotations served as the basis for supervised learning and enabled the models to learn both the spatial location and visual characteristics of microscopic dust particles.

Three YOLO variants, namely YOLOv5, YOLOv8, and YOLOv11, were comparatively evaluated to examine differences in detection behavior across model generations under identical experimental conditions. All models were trained using the same annotated dataset, preprocessing pipeline, and evaluation protocol to ensure fair comparison. During training, the YOLO framework jointly learned feature extraction, object classification, and bounding-box regression in an end-to-end manner, thereby enabling simultaneous dust recognition and localization on VCMA images.

All YOLO models were implemented in Python 3.12.13 using the PyTorch 2.10.0-based Ultralytics 8.4.37 framework and trained in a Google Colab environment equipped with an NVIDIA Tesla T4 GPU (16 GB VRAM), an x86_64 CPU, and 12.7 GB of system memory under Linux 6.6. To ensure consistency across experiments, the same training hyperparameters were used for all three YOLO models, including an input image size of 640 × 640 pixels, a batch size of 16, a learning rate of 0.001, and 150 training epochs. These unified settings were adopted to support reproducibility and to isolate architectural differences as the primary factor influencing comparative performance.

The output of the YOLO-based detection system consisted of predicted bounding boxes indicating the location of dust particles on the VCMA surface together with their associated confidence scores. These predictions were subsequently evaluated using standard object detection metrics, including Precision, Recall, F1-score, mAP@0.5, and mAP@0.5:0.95, as described in Section 2.10. Representative examples of the YOLO-based detection workflow and output visualization are shown in Figure 8.

2.5. Workflow of CNN-Based VCMA Defect Classification System

In addition to object detection, a CNN-based classification framework was investigated to perform image-level inspection of VCMA components. In this approach, each VCMA image was classified into one of two categories: Good, representing dust-free components, and Not Good (NG), representing components with visible dust contamination. The Good class included images without visible dust contamination, whereas the NG class comprised images containing visually identifiable dust or foreign particles on the VCMA surface. This formulation reflects practical industrial inspection scenarios in which rapid pass/fail screening is required at the image level without explicit defect localization.

A conceptual overview of the CNN-based classification principle adopted in this study is shown in Figure 9, while the detailed experimental workflow is illustrated in Figure 10. The classification workflow began with image acquisition, labeling, and preprocessing, following the procedures described in Section 2.1 and Section 2.3. The classification dataset was derived from the original set of microscope images and organized into Good and NG classes for supervised learning.

Three CNN architectures, namely ResNet50, EfficientNetB0, and MobileNetV2, were evaluated to assess their suitability for VCMA dust classification under identical experimental conditions. These models were selected to represent different design trade-offs in terms of feature extraction capability, model complexity, and computational efficiency. Transfer learning was applied by initializing each network with ImageNet-pretrained weights and retaining the pretrained backbone as a fixed feature extractor, while only the classification head was trained for the binary defect classification task.

To adapt the selected CNN architectures to this application, the original classification layers were replaced with a task-specific classifier head consisting of global average pooling, fully connected layers, dropout regularization with a rate of 0.1, and a SoftMax output layer for binary prediction. Input images were resized to 224 × 224 pixels to match the input requirements of the CNN models. All preprocessing steps and training conditions were kept consistent across the three CNN architectures to ensure fair comparison and reproducibility.

The CNN-based classification system was trained to learn discriminative image-level representations that distinguish defect-free VCMA surfaces from dust-contaminated samples. During inference, the output of the classifier was a predicted class label together with its associated confidence score for each input image. These predictions were subsequently evaluated using standard classification metrics, including accuracy, precision, recall, F1-score, and confusion-matrix analysis, as described in Section 2.10. Representative examples of the CNN-based classification workflow are presented in Figure 10 and Figure 11. To further examine whether the CNN models relied on physically meaningful dust regions during classification, Grad-CAM visualization was subsequently applied, as described in the section below.

2.6. Gradient-Weighted Class Activation Mapping (Grad-CAM)

To enhance the interpretability of the CNN-based classification models, Grad-CAM was applied to visualize the discriminative image regions that contributed most strongly to each model’s prediction. Grad-CAM computes the gradient of the target class score with respect to the feature maps of the final convolutional layer and uses these gradients as weights to generate a class-discriminative heatmap. This technique enables spatial localization of the regions that the model attends to when making a classification decision, without requiring architectural modifications or additional supervision.

In this study, Grad-CAM was applied independently to ResNet50, EfficientNetB0, and MobileNetV2 on VCMA test images to qualitatively assess whether each model’s attention was directed toward physically meaningful dust regions rather than irrelevant background textures or surface reflections.

Grad-CAM was applied to the final convolutional layer of each model, specifically conv5_block3_out for ResNet50, block7a_project_conv for EfficientNetB0, and Conv_1 for MobileNetV2, which are the deepest convolutional layers retaining spatial feature maps prior to global average pooling.

The activation maps were upsampled to the original input image resolution using bilinear interpolation and subsequently normalized to the range [0, 1] via min-max scaling to ensure consistent visual representation across images. The normalized heatmaps were then overlaid on the corresponding input images at an opacity of 0.5 using a jet colormap, with values ranging from blue (low activation) to red (high activation), enabling direct visual comparison of model attention across architectures.

This visualization serves as a complementary evaluation tool alongside the quantitative metrics reported in Section 3, providing deeper insight into model behavior and supporting confidence in the reliability of the proposed inspection system.

2.7. Hybrid Inspection Framework for VCMA Defect Detection

Microscopic dust contamination on VCMA components presents significant challenges for automated inspection due to its small size, irregular appearance, and surface reflections. CNN-based classification models are effective in identifying contaminated samples at the image level; however, they do not provide information regarding defect locations. To enable precise spatial inspection and improve practical applicability, a YOLO-based object detection module was integrated into a hybrid inspection framework.

In the proposed pipeline, each VCMA image is first analyzed using the CNN classifier to determine the presence of dust defects. Defect-free images are directly accepted, while contaminated samples are forwarded to the YOLO detector for localized inspection. This sequential strategy limits detection to relevant samples, improving computational efficiency and reducing unnecessary false detections.

The YOLO detector localizes microscopic dust regions by generating bounding boxes corresponding to defect areas on the VCMA surface. This localization supports visualization and further defect analysis while complementing the global discrimination capability of CNN classification. By integrating classification and detection in a coarse-to-fine manner, the hybrid framework enhances inspection robustness for industrial VCMA quality control, as illustrated in Figure 12.

2.8. Rationale for Model Selection

YOLO-based object detection and CNN-based image classification architectures were selected in this study due to their complementary roles in industrial visual inspection tasks. Specifically, three YOLO variants YOLOv5, YOLOv8 [19], and YOLOv11 [20] were evaluated for real-time dust localization, while ResNet50 [9], EfficientNetB0 [10], and MobileNetV2 [11] were employed for image-level defect classification. YOLO-based detection enables simultaneous localization and classification of dust particles on VCMA surfaces, making it suitable for inspection scenarios in which spatial information about defect locations is required. However, detection performance may be influenced by surface reflections, illumination variations, and background noise commonly encountered in industrial imaging environments.

In contrast, CNN-based image classification focuses on learning global feature representations from visual data to provide stable image-level decision making. This approach is well suited for binary quality assessment tasks, where a pass/fail decision is sufficient, and can serve as an alternative or secondary verification strategy to reduce misclassification arising from localization-based methods.

By evaluating YOLO-based detection and CNN-based classification independently under identical experimental conditions, this study aims to systematically examine the strengths and limitations of each approach for VCMA dust inspection. This comparative design establishes a methodological basis for analyzing inspection performance and provides insights that may support the future development of hybrid inspection frameworks combining localization capability with robust image-level classification.

2.9. Model Training and Implementation

Both YOLO-based detection models and CNN-based classification models were implemented in Python 3.10 under consistent computational environments to ensure fair and reproducible performance comparison. All experiments were conducted using the same software framework and hardware configuration whenever applicable. The key training parameters for both model categories are summarized in Table 2.

The small-scale (s) variants were used for all three YOLO architectures (YOLOv5s, YOLOv8s, YOLOv11s), selected to balance detection capability with computational efficiency under the limited-data conditions of this study. All models were initialized with COCO-pretrained weights provided through the Ultralytics framework (version 8.4.37) and trained using SGD with a fixed learning rate of 0.001. The composite YOLO training loss comprised objectness loss, classification loss, and bounding-box regression loss, optimized jointly in an end-to-end manner.

For the YOLO-based detection models, ground-truth annotations were provided in YOLO-format text files specifying bounding-box coordinates and corresponding class labels. Training aimed to minimize a composite detection loss integrating object, classification, and localization terms, thereby enabling simultaneous learning of dust presence and spatial localization. To improve convergence stability and reduce overfitting, early stopping and learning-rate scheduling were applied during training.

For the CNN-based classification models, each input image was assigned a binary categorical label (Good or NG), and training aimed to minimize categorical cross-entropy loss. The CNN models were trained using batch size 32, learning rate 0.005, and dropout regularization of 0.1 under the selected final configuration. To support stable convergence and fair comparison across architectures, early stopping (patience = 5) and ReduceLROnPlateau scheduling (factor = 0.2, patience = 3) were employed, with the best model weights restored after training.

Training configurations such as input image resolution, number of training epochs, batch size, optimizer, and learning rate were kept consistent within each model category to support fair comparison across YOLO variants and across CNN architectures. This unified training and implementation strategy enables objective evaluation of YOLO-based detection and CNN-based classification approaches for VCMA defect inspection.

2.10. Evaluation Metrics and Visualization

Model performance was quantitatively evaluated using task-appropriate metrics for object detection, image classification, and system-level hybrid inspection. To ensure consistency and comparability, all evaluations were conducted using the same dataset partitions and preprocessing procedures described in Section 2.

For the YOLO-based detection models, detection performance was assessed using Precision, Recall, and F1-score, which are widely adopted metrics for evaluating object detection performance. Precision measures the proportion of correctly detected dust particles among all detected instances, while Recall represents the proportion of correctly detected dust particles relative to all ground-truth instances. The F1-score provides a balanced measure between Precision and Recall. These standard metrics are defined as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(1)

R e c a l l = \frac{T P}{T P + F N}

(2)

F 1 S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(3)

I o U = \frac{A r e a o f I n t e r s e c t i o n}{A r e a o f U n i o n}

(4)

B o x = (x_{c e n t e r}, y_{c e n t e r}, w, h)

(5)

Here, TP, FP, and FN denote true positives, false positives, and false negatives, respectively.

In addition to these measures, the mean Average Precision (mAP) at Intersection over Union (IoU) thresholds of 0.5 (mAP@0.5) and 0.5–0.95 (mAP@0.5:0.95) were calculated to provide a more comprehensive assessment of detection robustness and localization accuracy across the evaluated YOLO variants.

For the CNN-based classification models, performance was evaluated using image-level accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). Confusion matrices were additionally used to analyze the distribution of correct and incorrect predictions between the Good and NG classes. Because false-negative predictions are particularly critical in industrial quality inspection, recall was treated as an important metric alongside AUC, F1-score, and overall classification accuracy during model interpretation.

For the proposed sequential hybrid framework, additional system-level evaluation was conducted to assess the combined behavior of the screening and localization stages. The hybrid pipeline was compared with the standalone CNN and YOLO configurations in terms of image-level decision quality, false-positive and false-negative behavior, and the proportion of images forwarded to the YOLO-based localization stage. Failure modes of the sequential pipeline were also analyzed to identify the dominant sources of residual error.

To enhance interpretability, Grad-CAM was applied to the CNN-based classification models to generate class-specific heatmaps indicating the image regions most influential to each prediction. These visualizations were used to assess whether the models attended to physically meaningful dust regions rather than irrelevant background patterns or reflective artifacts. For the YOLO-based detection models, predicted bounding boxes were overlaid on VCMA images to qualitatively assess localization accuracy and detection consistency. Together, these visualization methods complemented the quantitative metrics by providing additional insight into model behavior and inspection reliability.

3. Results

This section presents the experimental results and comparative analysis of the YOLO-based detection and CNN-based classification models for microscopic dust inspection on VCMA components. All models were evaluated using the same dataset partitions and performance criteria described in Section 2 to ensure fair and reproducible comparison. The results are presented in five parts: YOLO-based detection performance, CNN-based classification performance, Grad-CAM-based interpretability analysis, evaluation of the sequential hybrid inspection framework, and a comparative discussion of the findings in relation to industrial deployment.

3.1. Performance of YOLO-Based Detection Models

Three YOLO architectures, namely YOLOv5, YOLOv8, and YOLOv11, were evaluated for their ability to detect microscopic dust particles on VCMA components. Table 3 summarizes the quantitative performance comparison among the three YOLO models in terms of Precision, Recall, F1-score, mAP@0.5, mAP@0.5:0.95 and number of false positives. Overall, the results indicate distinct detection behaviors across model variants, particularly with respect to the trade-off between detection performance and false-positive control.

Among the evaluated models, YOLOv8 achieved the highest Precision (0.81), F1-score (0.73), and mAP@0.5 (0.66), indicating the strongest overall detection performance in terms of these aggregate metrics. YOLOv5, however, achieved the highest Recall (0.69) and produced the lowest number of false positives (2), while maintaining the same mAP@0.5:0.95 value (0.26) as YOLOv8. In contrast, YOLOv11 showed substantially lower Recall (0.38), F1-score (0.49), and mAP values, indicating weaker localization performance under the present experimental setting. This performance gap is likely attributable to YOLOv11’s anchor-free detection architecture, which, while effective on large-scale datasets, may exhibit optimization instability under the limited-data conditions of the present study. With only 47 original training images, the model may have been insufficient to fully converge, resulting in inconsistent bounding-box predictions and reduced sensitivity to small dust targets. The higher false-positive rate observed for YOLOv8 is likely related to its more sensitive feature extraction design, which improves aggregate detection performance on diverse datasets but also increases susceptibility to spurious activations on highly reflective metallic surfaces characteristic of VCMA components. In contrast, YOLOv5’s anchor-based architecture tends to produce more conservative predictions, which reduces background sensitivity and is advantageous in this task where the operational cost of false alarms outweighs marginal gains in aggregate sensitivity. A formal ablation study isolating the contribution of architectural differences across YOLO generations remains a direction for future work.

These findings suggest that detector selection in the present VCMA inspection task should be guided not only by aggregate detection metrics, but also by the operational cost of false alarms. Although YOLOv8 yielded the strongest overall quantitative performance, it generated the highest number of false positives, particularly in visually complex or reflective regions. By comparison, YOLOv5 provided a more favorable balance between recall and false-positive suppression, making it more suitable for deployment-oriented inspection scenarios in which unnecessary alarms may increase manual verification effort and interrupt production flow. Therefore, YOLOv5 was selected as the preferred detection model for subsequent integration into the hybrid inspection framework.

Figure 13 presents representative qualitative detection outputs from YOLOv5, YOLOv8, and YOLOv11. The examples illustrate clear differences in bounding-box placement and background sensitivity across the three models. YOLOv5 generally produced more conservative predictions with less background interference, whereas YOLOv8 showed stronger sensitivity but also more spurious responses in visually complex regions. YOLOv11, in turn, showed weaker localization consistency under the present imaging conditions. These qualitative observations are consistent with the quantitative results reported in Table 3 and provide further insight into the practical trade-offs among the evaluated YOLO variants.

3.2. Performance of CNN-Based Classification Models

For the image-level classification task, three pre-trained CNN architectures, namely ResNet50, EfficientNetB0, and MobileNetV2, were evaluated as fixed feature extractors for classifying VCMA images into Good and Not Good (NG) categories. Model performance was assessed using standard classification metrics, including Accuracy, Precision, Recall, and F1-score, and area under the receiver operating characteristic curve (AUC). To examine convergence behavior and training stability, the training and validation loss and accuracy curves are shown in Figure 14, while the confusion matrices are presented in Figure 15.

As illustrated in Figure 14, all three CNN models exhibited stable learning behavior without severe divergence during training. Among them, EfficientNetB0 showed relatively smoother validation curves and faster convergence, suggesting more stable generalization under the present dataset conditions. ResNet50 also converged consistently, whereas MobileNetV2 showed slightly greater fluctuation in validation behavior, indicating potentially higher sensitivity to image variation and class ambiguity.

The confusion matrices in Figure 15 further reveal differences in misclassification behavior across models. EfficientNetB0 produced the lowest number of misclassified samples overall, indicating a more balanced separation between Good and NG images. ResNet50 achieved comparable performance but showed slightly more classification errors, whereas MobileNetV2 demonstrated a greater tendency to incorrectly classify defect-free samples as defective, consistent with its lower precision.

To assess threshold-independent discriminative performance, receiver operating characteristic (ROC) curves were generated for the three CNN models, as shown in Figure 16. The ROC analysis complements the confusion-matrix results by illustrating the trade-off between true positive rate and false positive rate across different classification thresholds. Although MobileNetV2 achieved the highest AUC, EfficientNetB0 showed the most favorable overall balance between discriminative performance and validation stability under the present experimental setting. The corresponding AUC values were 0.957 for ResNet50, 0.986 for EfficientNetB0, and 1.000 for MobileNetV2.

The ROC analysis provides a threshold-independent view of classification performance and complements the confusion-matrix analysis shown in Figure 16. Table 4 summarizes the corresponding classification metrics, including Accuracy, Precision, Recall, F1-score, and AUC, for the three evaluated CNN architectures.

Table 4 shows that EfficientNetB0 achieved the most favorable overall balance among the evaluated CNN architectures. It combined high Accuracy (93.10%), Precision (0.91), F1-score (0.90), and AUC (0.986), indicating strong discriminative capability for distinguishing dust-free and contaminated VCMA samples. ResNet50 yielded comparable Precision (0.91) and Recall (0.90), but with substantially lower Accuracy (82.76%) and slightly lower F1-score (0.89). In contrast, MobileNetV2 achieved the highest AUC (1.000), but its lower Precision (0.87) and F1-score (0.88) indicate a greater tendency to generate false alarms on good samples. These results suggest that EfficientNetB0 provided the best overall trade-off between classification performance and practical reliability at the image level.

To further examine model reliability beyond the single-run classification metrics reported in Table 5, the best valid configuration of each CNN architecture was identified from the design-of-experiments analysis and is summarized in Table 5. Notably, all three models were compared under the same optimized hyperparameter setting, namely batch size = 32, learning rate = 0.005, dropout = 0.1, and no data augmentation, thereby enabling a fair assessment of architecture-dependent performance.

Under these identical settings, EfficientNetB0 achieved the most reliable overall performance, with a test accuracy of 93.10%, test loss of 0.226, F1-score of 0.930, AUC of 0.986, and the highest 5-fold cross-validated accuracy of 93.24% with the lowest standard deviation (±3.57). Although MobileNetV2 achieved the same test accuracy (93.10%) and F1-score (0.930), and yielded a perfect AUC of 1.000, its substantially larger fold-to-fold variation (±15.12) and lower cross-validated accuracy (81.01%) suggest reduced robustness under the present limited-data condition. ResNet50 showed comparatively lower predictive performance overall, with a test accuracy of 82.76% and a 5-fold cross-validated accuracy of 79.17% ± 5.14. These findings further support the selection of EfficientNetB0 as the primary CNN model for subsequent integration into the hybrid inspection framework.

The 5-fold cross-validation results further confirm the selection of EfficientNetB0, which achieved the highest cross-validated accuracy of 93.24% with the lowest standard deviation (±3.57), indicating stable generalization across data partitions. In contrast, MobileNetV2 showed substantially higher fold-to-fold variation (±15.12) despite achieving the same single-run test accuracy, while ResNet50 yielded the lowest cross-validated accuracy (79.17% ± 5.14). These results demonstrate that 5-fold cross-validation provides a more reliable basis for model selection than single-run evaluation alone under small-sample conditions.

Representative qualitative examples of classification results are shown in Figure 17. The visual comparison indicates that EfficientNetB0 produced more consistent predictions under varying illumination and reflective surface conditions, correctly identifying contaminated samples that were occasionally misclassified by the other models. These qualitative observations are consistent with the quantitative trends reported in Table 4 and Table 5.

Overall, the classification results indicate that EfficientNetB0 provides the most suitable CNN architecture for image-level VCMA dust classification in the present study. Its favorable balance between predictive performance, classification consistency, and validation stability makes it well suited for subsequent use in the proposed hybrid inspection framework. To further examine whether the CNN models relied on physically meaningful dust regions during prediction, Grad-CAM visualization was subsequently applied, as presented in the next subsection.

Evaluation on Extended Dataset

To further assess the generalizability of the proposed CNN-based classification framework, EfficientNetB0 was additionally evaluated on an extended dataset comprising 123 original images acquired from the same VCMA inspection environment. The extended dataset was augmented to 388 training images using the same augmentation pipeline described in Section 2.3, and model training followed the same hyperparameter configuration as the original study.

The evaluation on the extended dataset yielded a test accuracy of 76.00%, F1-score of 0.432, and AUC of 0.605 (Table 6), which are notably lower than the results obtained on the original 47-image dataset. Inspection of the classification report revealed that the model predicted all test samples as NG, resulting in a recall of 1.00 for the NG class but a precision of 0.00 for the Good class. This behavior is consistent with severe class imbalance in the test subset (NG: 19, Good: 6) and suggests that the model did not successfully generalize across the two datasets. It should be noted that while the original dataset was collected from a single fixed inspection position to ensure imaging consistency, the extended dataset was collected from multiple different positions across the VCMA surface in order to capture additional NG samples that were not present at the original position. This difference in acquisition strategy inevitably introduced substantial variation in surface geometry, illumination conditions, and background texture, making direct model transfer between the two datasets challenging.

The performance degradation observed on the extended dataset is primarily attributable to distribution shift between the original and extended image sets, where differences in surface texture, illumination conditions, and dust morphology between the two acquisition batches reduced the transferability of learned features. These findings highlight that dataset consistency and controlled acquisition conditions are critical prerequisites for reliable model generalization in microscopic dust inspection. Addressing distribution shift through domain adaptation or unified data collection protocols is identified as a priority direction for future work.

To further illustrate the visual differences between the two datasets, representative microscopic images from the original dataset and the extended dataset are shown in Figure 18.

As shown in Figure 18, the two datasets differ substantially in surface geometry, illumination conditions, and component orientation, which explains the observed degradation in model performance when transferring across acquisition batches.

3.3. Visualization Analysis Using Grad-CAM

To enhance the interpretability of the CNN-based VCMA defect classification models, Gradient-weighted Class Activation Mapping (Grad-CAM) was employed to visualize the image regions that contributed most strongly to model predictions. Grad-CAM provides class-discriminative localization maps by projecting the gradients of the target class onto the final convolutional feature maps, thereby enabling qualitative assessment of model attention and decision-making behavior. For all models, Grad-CAM heatmaps were generated from the final convolutional layer of each respective architecture—specifically conv5_block3_out for ResNet50, block7a_project_conv for EfficientNetB0, and Conv_1 for MobileNetV2—and overlaid on the input images using a jet colormap (opacity 0.5), with activation values normalized to [0, 1] via min-max scaling, as described in Section 2.5. Representative Grad-CAM visualizations are shown in Figure 19.

As shown in Figure 19a, the original VCMA image contains a localized dust-contaminated region on the reflective metallic surface, providing a reference for interpreting the corresponding attention maps generated by the three CNN models. Among them, Figure 19c indicates that EfficientNetB0 produced the most spatially concentrated activation pattern, with the highlighted region closely aligned with the actual dust-contaminated area. This behavior suggests that EfficientNetB0 relied more consistently on defect-relevant local features rather than on surrounding textures or reflective artifacts. These observations are qualitatively consistent with the superior classification performance of EfficientNetB0 reported in Section 3.2.

In contrast, Figure 19b shows that ResNet50 occasionally activated peripheral or non-defective regions in addition to the true dust area, particularly under the influence of strong surface reflections. These broader and less localized activation patterns suggest higher sensitivity to illumination-induced variation, which may have contributed to less robust classification behavior. Similarly, Figure 19d demonstrates that MobileNetV2 produced comparatively more diffuse activation maps and, in some cases, appeared to emphasize general surface characteristics rather than the fine dust cluster itself. This behavior is consistent with its lower precision and suggests reduced sensitivity to subtle defect morphology under challenging imaging conditions.

To complement the qualitative interpretation, the Grad-CAM outputs were further reviewed across all 29 test images, and the attention patterns were categorized according to whether the highlighted regions were aligned with visible dust, concentrated on background or reflection artifacts, or spatially diffuse. The corresponding semi-quantitative summary is presented in Table 7.

As shown in Table 7, EfficientNetB0 achieved the highest number of correctly localized attention maps, with 22 of 29 cases aligned with physically meaningful dust regions. In contrast, ResNet50 showed more misaligned attention patterns, whereas MobileNetV2 more frequently produced diffuse activations. Taken together, the qualitative and semi-quantitative Grad-CAM analyses indicate that EfficientNetB0 not only achieved strong classification performance but also relied more consistently on defect-relevant visual evidence. This strengthens confidence in its suitability for deployment in explainable industrial inspection workflows.

Overall, the Grad-CAM visualizations provide qualitative support for the classification results and indicate that EfficientNetB0 more consistently attends to physically meaningful dust regions on the VCMA surface. The comparison across Figure 19b–d highlights clear differences in model attention behavior and strengthens confidence in the interpretability of the selected CNN model for industrial inspection applications.

3.4. Performance of the Sequential Hybrid Inspection Framework

The sequential hybrid inspection framework was evaluated to assess system-level performance for microscopic dust inspection on VCMA components. In this pipeline, EfficientNetB0 was used as the initial image-level screening model to distinguish between defect-free and contaminated samples, and only images predicted as Not Good (NG) were subsequently forwarded to the YOLOv5 detector for localized dust inspection. This sequential design was intended to combine the strong image-level discrimination capability of the CNN classifier with the spatial localization capability of the object detector.

To isolate the structural contribution of the hybrid framework, the same dataset and pre-trained models used in the standalone CNN and YOLO evaluations were retained without additional retraining. System-level performance was then assessed by comparing the standalone and hybrid configurations in terms of image-level decision quality, false-positive and false-negative behavior, and the proportion of images forwarded to the detection stage.

As summarized in Table 8, the hybrid framework reduced the proportion of images entering the localization stage by forwarding only CNN-screened NG samples to YOLOv5. This reduced detector workload from 100.00% of images in the standalone YOLOv5 pipeline to 57.45% in the hybrid configuration. Under the present experimental setting, the hybrid strategy achieved an overall Precision of 1.00, Recall of 0.900, and F1-score of 0.974, while reducing the false-positive rate from 24% in the standalone YOLOv5 pipeline to 0%. These results indicate that the sequential design can substantially suppress false alarms while preserving high image-level classification reliability and selective localization capability.

As shown in Figure 20, the proposed hybrid framework reduced the proportion of images forwarded to the YOLOv5 localization stage by restricting detection to CNN-screened NG samples. Compared with the standalone YOLOv5 pipeline, which processes all input images, the proportion of images entering the localization stage decreased from 100.00% to 57.45%. This reduction supports the practical relevance of the hybrid design by limiting unnecessary localization operations on apparently defect-free VCMA samples.

To further examine the limitations of the hybrid framework, the major failure modes of the sequential pipeline were analyzed and are summarized in Table 9.

Table 9 provides further insight into the limitations of the sequential design by identifying the major failure modes of the hybrid pipeline. In particular, three contaminated samples were incorrectly rejected during the initial CNN screening stage and were therefore not forwarded to YOLOv5 for localization. In addition, 1 NG sample was correctly screened but still missed during the localization stage. No Good samples were unnecessarily forwarded to the detector, and no false-positive YOLO detections were observed after screening. These results indicate that, although the hybrid strategy substantially reduces false detections, its overall performance remains dependent on the reliability of both the screening and localization stages.

The three NG samples missed at the CNN screening stage contained dust particles with low contrast, weak visual saliency, or partial occlusion by surface reflections, reducing the discriminability of image-level features relative to the dominant background texture. The single NG sample missed by YOLOv5 after screening exhibited ambiguous dust morphology and low bounding-box confidence, likely due to boundary overlap with the reflective metallic surface. These observations suggest that future improvements should target hard-negative mining and threshold tuning at the screening stage, and small-object detection refinement at the localization stage.

It should be noted that the sequential design introduces a structural vulnerability: any NG sample misclassified as Good during the CNN screening stage will not reach the detector and will remain undetected. In the present study, this corresponds to an FNR of 14.3% for standalone EfficientNetB0 and 10.0% for the hybrid framework (Table 8). The residual FNR remains a meaningful safety concern for mass production, and future work should explore confidence-aware routing and adaptive threshold adjustment to further reduce missed detections.

Figure 21 presents representative outputs of the proposed hybrid framework. As shown in Figure 21a,c, contaminated VCMA samples were first identified by the CNN screening stage as NG. The corresponding images were then forwarded to the YOLOv5 detector, which localized the dust-contaminated regions with bounding boxes, as illustrated in Figure 21b,d. These examples demonstrate the intended coarse-to-fine inspection behavior of the hybrid system, in which global image-level discrimination is followed by localized defect analysis only when needed.

Overall, the sequential hybrid inspection framework combines the complementary strengths of CNN-based classification and YOLO-based localization in a unified inspection strategy. The quantitative and qualitative results suggest that this approach is promising for practical VCMA inspection, particularly in scenarios where false-alarm control and selective localization are more important than applying object detection to every image. At the same time, the failure mode analysis indicates that the robustness of the pipeline can be further improved through enhanced screening sensitivity, detector refinement, or joint optimization of the two stages.

To further characterize the spatial distribution of contamination identified by the hybrid framework, centroid positions of the detected dust instances were analyzed across a normalized 3 × 3 zone grid, as illustrated in Figure 22. The results indicate that dust particles were predominantly concentrated in the left-center region of the VCMA surface, with the Mid-Left zone accounting for the highest particle density (30%), followed by the Top-Left (20%) and Center (20%) zones. The Top-Right, Bottom-Left, and Bottom-Right regions showed no detectable contamination, suggesting preferential dust accumulation driven by airflow patterns or component handling during manufacturing. These spatial findings provide additional context for the failure mode analysis reported in Table 9 and may inform targeted inspection strategies in future work.

3.5. Comparative Discussion and Industrial Implications

The comparative results demonstrate that YOLO-based object detection and CNN-based image classification provide complementary strengths for microscopic dust inspection on VCMA components, rather than representing competing alternatives. Each model family addresses a different operational requirement within the inspection workflow. YOLO-based detectors provide explicit spatial localization of dust particles, which is valuable for defect visualization, process feedback, and localized inspection analysis. In contrast, CNN-based classifiers provide more stable image-level discrimination, making them suitable for rapid screening and decision support in scenarios where precise localization is not always required. To summarize the practical implications of the comparative results, a deployment-oriented comparison of the evaluated inspection approaches is provided in Table 10. The categorizations in Table 10 are based on the quantitative results and qualitative observations reported in Section 3.1, Section 3.2, Section 3.3 and Section 3.4.

As shown in Table 10, the evaluated approaches differ not only in predictive behavior but also in their practical suitability for industrial inspection tasks. YOLO-based models are advantageous when explicit defect localization is required, whereas CNN-based models are more suitable for stable image-level screening. The proposed hybrid framework combines these complementary strengths and therefore provides the most balanced solution for deployment-oriented VCMA dust inspection.

Among the evaluated object detectors, YOLOv8 achieved the strongest overall aggregate detection performance in terms of Precision, F1-score, and mAP@0.5. However, YOLOv5 was selected as the preferred localization model for subsequent integration into the hybrid framework because it produced the lowest number of false positives while maintaining competitive Recall and the same mAP@0.5:0.95 as YOLOv8 under the present experimental setting. This distinction is practically important because reflective metallic surfaces and textured microscopic backgrounds can easily trigger spurious detections. In such an environment, a detector with stronger false-positive control may be more desirable than a detector with slightly stronger aggregate metrics but substantially higher false-alarm cost. YOLOv11 exhibited the weakest overall detection performance among the three evaluated models, likely due to its anchor-free architecture being less suited to the limited-data and high-background-complexity conditions of the present study.

For image-level classification, EfficientNetB0 emerged as the most reliable CNN architecture. In addition to achieving the strongest overall balance in the single-run classification metrics, it also demonstrated the most stable behavior across the design-of-experiments evaluation and cross-validation analysis. Compared with ResNet50, EfficientNetB0 provided better predictive performance under the present limited-data setting, while compared with MobileNetV2, it showed stronger reliability and lower fold-to-fold variation. The Grad-CAM analysis further reinforced this finding by showing that EfficientNetB0 more consistently focused on physically meaningful dust-contaminated regions rather than reflective artifacts or irrelevant background patterns. Taken together, these results indicate that model selection for industrial microscopic inspection should consider not only predictive accuracy, but also stability and interpretability.

The evaluation of the sequential hybrid framework further highlights the practical value of combining these two model families in a coarse-to-fine inspection strategy. By using EfficientNetB0 as the initial screening stage and YOLOv5 as the subsequent localization stage, the proposed pipeline suppresses false alarms at the system level while preserving the ability to localize contamination when it is most relevant. The framework also reduces the number of images forwarded to the detector, thereby limiting unnecessary localization operations on apparently defect-free samples. This design is particularly suitable for industrial inspection environments in which false-alarm control, selective localization, and interpretable decision-making are more important than applying localization to every image. The hybrid results therefore suggest that integrating classification and detection can improve the practical usability of automated VCMA inspection systems.

From an industrial perspective, the findings of this study have three important implications. First, detector selection should be guided by operational priorities rather than by a single summary metric. In the present application, false-positive suppression was more critical than maximizing sensitivity because unnecessary alarms can interrupt production flow and increase manual verification effort. Second, image-level classifiers can serve as effective front-end screening tools when supported by reliable interpretability analysis, thereby improving trust in deployment. Third, hybrid inspection architectures offer a realistic compromise between throughput and diagnostic capability by reserving localized analysis for images most likely to contain contamination. These considerations are directly relevant to practical VCMA inspection and may also extend to other precision-manufacturing settings involving subtle microscopic defects.

Despite these promising findings, the results should be interpreted considering several limitations. The dataset size remained relatively limited, and the experiments were conducted under controlled imaging conditions designed to reflect a practical but still bounded inspection environment. As a result, the reported performance may not fully capture the range of variation encountered across broader production drift, defect morphology diversity, and illumination change. In addition, the sequential hybrid design depends on the reliability of both the screening and localization stages; defects misclassified as Good during the initial CNN stage will not be forwarded to the detector for subsequent localization. The failure mode analysis further indicates that missed defects at the screening stage remain a non-negligible risk in the present pipeline. Future work should therefore consider larger and more diverse datasets, more extensive system-level validation, and possibly joint optimization or confidence-aware interaction between the classification and detection stages.

Overall, the comparative analysis indicates that EfficientNetB0 and YOLOv5 form the most suitable model combination for the present VCMA dust inspection task, balancing classification reliability, localization practicality, and explainability. The proposed sequential hybrid framework is therefore a promising direction for automated microscopic contamination inspection in HDD manufacturing and related precision assembly applications.

4. Discussion

This study investigated microscopic dust inspection on VCMA components from three complementary perspectives: object detection for dust localization, image-level classification for Good/NG screening, and a sequential hybrid framework integrating both capabilities. The findings indicate that model suitability in this application cannot be judged by a single metric alone but must instead be interpreted in relation to reflective microscopic backgrounds, limited-data conditions, false-alarm cost, and the operational priorities of industrial inspection.

4.1. Summary of Key Findings

For object detection, the comparative results showed that the evaluated YOLO models exhibited different trade-offs between aggregate detection performance and false-positive control. YOLOv8 achieved the strongest overall aggregate detection metrics, whereas YOLOv5 was selected as the preferred detector for subsequent hybrid integration because it produced fewer false positives under reflective and textured microscopic backgrounds. YOLOv11, in contrast, showed the weakest performance among the three evaluated models, likely attributable to its anchor-free architecture being less suited to the limited-data and high-background-complexity conditions of the present study, resulting in reduced convergence stability and lower detection sensitivity for small dust targets. This distinction is important in VCMA inspection, where unnecessary alarms may increase manual verification effort and interrupt production flow. These findings suggest that, under the present limited-data industrial setting, deployment-oriented detector selection should prioritize operational robustness alongside conventional performance metrics.

For image-level classification, EfficientNetB0 emerged as the most reliable CNN architecture. Its advantage was not limited to strong classification performance but also extended to cross-validation stability and interpretability. Compared with ResNet50, EfficientNetB0 provided better predictive performance under the present limited-data condition, while compared with MobileNetV2, it showed lower fold-to-fold variation despite MobileNetV2 achieving a higher AUC. The Grad-CAM results further strengthened this conclusion by showing that EfficientNetB0 more consistently attended to physically meaningful dust-contaminated regions rather than reflective or irrelevant background structures. These observations indicate that robust deployment-oriented model selection should consider not only predictive accuracy, but also stability and interpretability.

Building upon the complementary strengths of YOLO-based localization and CNN-based global discrimination, the sequential hybrid inspection framework showed promising system-level behavior. By filtering defect-free samples before localization, the hybrid pipeline reduced unnecessary detector usage and strongly suppressed false alarms while preserving localized dust inspection for suspected NG samples. At the same time, the failure mode analysis indicates that the reliability of the screening stage remains critical, since contaminated samples incorrectly rejected during the initial CNN stage cannot be recovered during subsequent localization.

4.2. Industrial and Scientific Implications

From an industrial perspective, the present findings emphasize that automated microscopic inspection systems should be designed according to operational priorities rather than a single summary metric. In the current VCMA application, false-positive suppression was especially important because unnecessary alarms may slow production and increase verification burden. This explains why YOLOv5 was favored for hybrid deployment despite YOLOv8 achieving stronger aggregate detection metrics. Likewise, image-level CNN classifiers are valuable not only because they enable rapid screening, but also because they can act as front-end filters that reduce detector workload when incorporated into a hybrid framework. These considerations are directly relevant to VCMA cleanliness inspection and may also extend to other precision-manufacturing settings involving subtle microscopic contaminants or low-contrast defects.

Scientifically, this study reinforces several important insights. First, model ranking in industrial micro-defect inspection is strongly influenced by dataset scale, visual complexity, and deployment priorities; newer architectures do not necessarily provide the most suitable practical solution under limited-data conditions. Second, explainability is not merely an auxiliary feature, but an important validation tool when the visual distinction between clean and contaminated samples is subtle. Third, the results suggest that structurally integrating image-level discrimination with selective localization offers a promising direction for inspection systems that require both efficiency and diagnostic transparency.

4.3. Industrial Deployment Considerations and Comparison with Alternative Inspection Approaches

The VCMA samples used in this study were collected directly from the VCMA inspection station on an actual HDD manufacturing production line, as described in Section 2.1. Image acquisition was performed under the same physical conditions present during routine production operations, including reflective metallic surfaces, controlled LED ring illumination, and fixture-assisted positioning. The experimental dataset therefore reflects the visual complexity and operational constraints encountered in real-world industrial inspection rather than idealized laboratory conditions. These properties confirm that the proposed framework was developed and evaluated under imaging conditions representative of actual HDD production practice.

Although the proposed framework was evaluated offline rather than directly integrated into an in-line production system, the imaging setup, fixture-assisted positioning, and controlled acquisition procedure were intentionally designed to emulate the realistic inspection conditions of HDD manufacturing environments. The proposed framework is intended as a foundational model to be integrated into a dedicated inspection machine for deployment in actual VCMA production lines in future work. The use of lightweight architectures such as EfficientNetB0 and YOLOv5s supports the feasibility of future real-time deployment due to their favorable balance between detection performance and computational efficiency. In the present study, the hybrid framework achieved an average inference time of 11.18 ms per image, which is favorable for time-sensitive industrial inspection scenarios.

To contextualize the proposed hybrid framework within the broader landscape of inspection strategies, a conceptual comparison with alternative approaches is provided in Table 11. Conventional methods rely on fixed geometric templates, color thresholds, and pixel comparison algorithms that require manual parameter tuning by domain experts [4]. These methods are highly sensitive to illumination variation, surface reflection, and focus inconsistency, making them difficult to generalize for microscopic VCMA inspection where reflective metallic surfaces and sub-100 μm dust particles present persistent challenges [52,53]. In contrast, deep learning-based approaches can learn discriminative features directly from data without requiring explicit rule design, offering improved robustness under complex and variable imaging conditions [4,52,53].

4.4. Limitations and Future Work

The present findings should be interpreted considering several limitations. First, the dataset remained relatively limited, particularly for the detection task, which may restrict generalizability across broader production conditions and increase sensitivity to data partition effects. Second, the experiments were conducted under a controlled imaging setup designed to reflect representative industrial conditions, but not necessarily the full range of process drift, contamination morphology variation, and illumination changes that may occur in long-term manufacturing use. Although brightness adjustment within ±20% was incorporated into the training augmentation pipeline, no systematic sensitivity analysis was performed to evaluate model performance under deliberate variations in image brightness, contrast, blur, or lens contamination. The degree to which the reported accuracy degrades under such real-world disturbances therefore remains an open question. Third, the hybrid framework was implemented as a sequential combination of independently evaluated modules rather than as a jointly optimized end-to-end system. As a result, stage-to-stage error propagation remains a meaningful constraint, especially when contamination is missed during the initial screening stage. These limitations suggest that the current study should be interpreted as a strong proof of feasibility rather than a final generalized deployment validation.

Future research may extend the proposed framework through larger and more diverse datasets, more extensive system-level validation under production drift, and tighter integration between screening and localization stages. Promising directions include confidence-guided routing, adaptive thresholds, multi-defect inspection, and real-time deployment optimization using hardware acceleration or robotic inspection platforms. Systematic environmental robustness evaluation, including controlled perturbations of brightness, contrast, blur, and lens contamination, is also identified as a priority direction, with brightness normalization, adaptive histogram equalization, and defocus-aware filtering as candidate mitigation strategies. Furthermore, the integration of specialized tiny object detection (TOD) strategies, such as SAHI (Slicing Aided Hyper Inference), represents a promising direction for improving detection sensitivity on microscopic dust particles. Applying slicing-based inference or scale-enhanced architectures in future work may further improve localization performance under the limited-data and high-magnification imaging conditions characteristic of VCMA inspection. Continued incorporation of explainable AI techniques may further improve transparency, operator trust, and fault diagnosis capability in automated microscopic inspection systems.

Overall, this study indicates that EfficientNetB0 and YOLOv5 form the most suitable model combination for the present VCMA dust inspection task, balancing classification reliability, localization practicality, and explainability. The proposed sequential hybrid framework therefore represents a promising direction for automated microscopic contamination inspection in HDD manufacturing and related precision assembly applications.

5. Conclusions

This study investigated deep-learning-based inspection methods for detecting and classifying microscopic dust contamination on VCMA components and further evaluated a sequential hybrid inspection framework integrating CNN-based image classification with YOLO-based object detection. The results demonstrate that both standalone models and the hybrid strategy are promising for industrial microscopic inspection under limited-data conditions.

For object detection, the evaluated YOLO models exhibited different trade-offs between aggregate detection performance and false-positive control. Although YOLOv8 achieved the strongest overall detection metrics, YOLOv5 was selected as the preferred detector for hybrid integration because it produced fewer false positives under reflective and textured microscopic backgrounds while maintaining competitive localization performance. YOLOv11 showed the weakest detection performance under the present limited-data setting, likely reflecting optimization instability under the constrained sample scale and background complexity of VCMA microscopic inspection. In the present experimental setting, YOLOv5 achieved an inference time of 12.9 ms per image, with a precision of 0.75 and mAP@0.5 of 0.62, supporting its suitability for real-time or in-line inspection scenarios.

For image-level classification, EfficientNetB0 emerged as the most reliable CNN architecture. It provided strong classification performance together with better validation stability and more physically meaningful Grad-CAM attention patterns than ResNet50 and MobileNetV2. These findings indicate that EfficientNetB0 offers a favorable balance between predictive accuracy, robustness, and interpretability for Good/NG classification of VCMA images.

By structurally integrating global image-level screening with localized defect detection, the proposed hybrid inspection framework reduced detector workload and suppressed false alarms while preserving selective dust localization for suspected contaminated samples. This coarse-to-fine strategy reflects practical industrial deployment requirements in which efficiency, reliability, and interpretability are equally important.

Overall, this work highlights the complementary strengths of detection and classification models and shows that their integration within a sequential hybrid framework provides a practical direction for microscopic defect inspection in precision manufacturing. The findings contribute useful guidance for intelligent quality inspection system design and provide a foundation for future developments toward adaptive, robust, and high-throughput industrial inspection platforms.

Author Contributions

Conceptualization, V.P. and A.C.; methodology, K.C. and A.C.; software and model development, K.C., W.P. and V.P.; investigation, K.C. and K.K.; data curation, K.C. and K.K.; formal analysis, V.P., K.C., W.B. and A.C.; resources, V.P. and W.P.; project administration, K.C. and W.B.; writing original draft preparation, V.P., K.C. and A.C.; writing review and editing, V.P., W.B., W.P. and A.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by (i) Suranaree University of Technology (SUT), (ii) Thailand Science Research and Innovation (TSRI), and (iii) the National Science, Research and Innovation Fund (NSRF), NRIIS number 215648.

Data Availability Statement

The data presented in this study are not publicly available due to confidentiality restrictions from the industrial partner. The data may be available from the corresponding authors upon reasonable request.

Acknowledgments

This work was supported by (i) Suranaree University of Technology (SUT), (ii) Thailand Science Research and Innovation (TSRI), and (iii) National Science, Research and Innovation Fund (NSRF). Generative AI tools were used during manuscript preparation for language refinement, structural editing, and the creation of the graphic abstract. The authors reviewed, revised, and approved all AI-assisted outputs and take full responsibility for the final content.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

VCMA	Voice Coil Motor Assembly
HDD	Hard Disk Drive
AI	Artificial Intelligence
CNN	Convolutional Neural Networks
YOLO	You Only Look Once
ROI	Region of Interest
mAP	Mean Average Precision
TP	True Positive
FP	False Positive
FN	False Negative
IoU	Intersection over Union
FPS	Frames per Second
XAI	Explainable Artificial Intelligence
Grad-CAM	Gradient-weighted Class Activation Mapping

References

Mordor Intelligence. Hard Disk Drive (HDD) Market Size & Share Analysis—Growth Trends and Forecast (2025–2030). 2024. Available online: https://www.mordorintelligence.com/industry-reports/hard-disk-drive-market (accessed on 11 September 2025).
Thongsri, J. A Problem of Particulate Contamination in an Automated Assembly Machine Successfully Solved by CFD and Simple Experiments. Math. Probl. Eng. 2017, 2017, 6859852. [Google Scholar] [CrossRef]
Lee, S.-J. Few-Shot Adaptation of Foundation Vision Models for PCB Defect Inspection. J. Imaging 2025, 11, 415. [Google Scholar] [CrossRef] [PubMed]
Czimmermann, T.; Ciuti, G.; Milazzo, M.; Chiurazzi, M.; Roccella, S.; Oddo, C.M.; Dario, P. Visual-Based Defect Detection and Classification Approaches for Industrial Applications—A Survey. Sensors 2020, 20, 1459. [Google Scholar] [CrossRef] [PubMed]
Weihua, Y. A Survey of Surface Defect Detection Based on Deep Learning. In Proceedings of the 2022 7th International Conference on Modern Management and Education Technology (MMET 2022), Qingdao, China, 14–16 October 2022; Atlantis Press: Amsterdam, The Netherlands, 2022. [Google Scholar]
Hütten, N.; Tercan, H.; Schneider, D.; Meisen, T. Deep Learning for Automated Visual Inspection in Manufacturing and Maintenance: A Survey of Open-Access Papers. Appl. Syst. Innov. 2024, 7, 11. [Google Scholar] [CrossRef]
Li, Z.; Zhong, X.; Fu, Y.; Jiang, X.; Shu, T.; Feng, W.; Xu, D. A Survey of Deep Learning for Industrial Visual Anomaly Detection. Artif. Intell. Rev. 2025, 58, 279. [Google Scholar] [CrossRef]
He, Y.; Song, K.; Meng, Q.; Yan, Y. A Survey on Surface Defect Inspection Based on Generative Models in Manufacturing. Appl. Sci. 2024, 14, 6774. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; PMLR: New York, NY, USA, 2019; pp. 6105–6114. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 4–8 May 2021. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar] [CrossRef]
Chen, L.; Shi, W.; Deng, D. Improved YOLOv3 Based on Attention Mechanism for Fast and Accurate Ship Detection in Optical Remote Sensing Images. Remote Sens. 2021, 13, 660. [Google Scholar] [CrossRef]
He, L.; Li, Y.; Wang, X.; Zhou, Y. Research and Application of YOLOv11-Based Object Segmentation in Intelligent Recognition at Construction Sites. Buildings 2024, 14, 3777. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Zeng, W.; Liu, H.; Li, Z.; Zhao, X. Deep Learning-Based Object Detection: A Comprehensive Review of YOLO, RCNN, and SSD Series. Electron. Res. Arch. 2026, 34, 2674–2731. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; IEEE: New York, NY, USA, 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Ultralytics. YOLOv5—Ultralytics Documentation. Available online: https://docs.ultralytics.com/models/yolov5/ (accessed on 15 March 2025).
Ultralytics. YOLOv8—Ultralytics Documentation. Available online: https://docs.ultralytics.com/models/yolov8/ (accessed on 15 March 2026).
Ultralytics. YOLOv11—Ultralytics Documentation. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 15 March 2026).
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 2117–2125. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Spinger: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: New York, NY, USA, 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Spinger: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar] [CrossRef]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 2921–2929. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 618–626. [Google Scholar] [CrossRef]
Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Samek, W.; Wiegand, T.; Müller, K.-R. Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models. ITU J. ICT Discov. 2018, 1, 39–48. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; Available online: https://openreview.net/forum?id=8gmWwjFyLj (accessed on 15 March 2026).
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019; Available online: https://openreview.net/forum?id=Bkg6RiCqY7 (accessed on 15 March 2026).
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Adv. Neural Inf. Process. Syst. 2019, 32, 8024–8035. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Spinger: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar] [CrossRef]
Defard, T.; Setkov, A.; Loesch, A.; Audigier, R. PaDiM: A Patch Distribution Modeling Framework for Anomaly Detection and Localization. In Proceedings of the International Conference on Pattern Recognition (ICPR), Virtual, 10–15 January 2021; Spinger: Cham, Switzerland, 2021; pp. 475–489. [Google Scholar] [CrossRef]
Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; Gehler, P. Towards Total Recall in Industrial Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 14298–14308. [Google Scholar] [CrossRef]
Wang, G.; Han, S.; Ding, E.; Huang, D. Student-Teacher Feature Pyramid Matching for Anomaly Detection. In Proceedings of the 32nd British Machine Vision Conference (BMVC), Online, 22–25 November 2021; Available online: https://www.bmva-archive.org.uk/bmvc/2021/conference/papers/paper_1273.html (accessed on 15 March 2026).
Ruff, L.; Vandermeulen, R.; Görnitz, N.; Deecke, L.; Siddiqui, S.A.; Binder, A.; Müller, E.; Kloft, M. Deep One-Class Classification. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; PMLR: New York, NY, USA, 2018; pp. 4393–4402. Available online: http://proceedings.mlr.press/v80/ruff18a.html (accessed on 15 March 2026).
Akcay, S.; Atapour-Abarghouei, A.; Breckon, T.P. GANomaly: Semi-Supervised Anomaly Detection via Adversarial Training. In Proceedings of the Asian Conference on Computer Vision (ACCV), Perth, Australia, 2–6 December 2018; Spinger: Cham, Switzerland, 2018; pp. 622–637. [Google Scholar] [CrossRef]
Schlegl, T.; Seeböck, P.; Waldstein, S.M.; Schmidt-Erfurth, U.; Langs, G. f-AnoGAN: Fast Unsupervised Anomaly Detection with Generative Adversarial Networks. Med. Image Anal. 2019, 54, 30–44. [Google Scholar] [CrossRef]
Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: New York, NY, USA, 2019; pp. 9592–9600. [Google Scholar] [CrossRef]
Chousangsuntorn, C.; Tongloy, T.; Chuwongin, S.; Boonsang, S. A Deep Learning System for Recognizing and Recovering Contaminated Slider Serial Numbers in Hard Disk Manufacturing Processes. Sensors 2021, 21, 6261. [Google Scholar] [CrossRef]
Akyon, F.C.; Altinuc, S.O.; Temizel, A. Slicing Aided Hyper Inference and Fine-Tuning for Small Object Detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; IEEE: New York, NY, USA, 2022; pp. 966–970. [Google Scholar] [CrossRef]
Farmanesh, A.; Ramírez Sanchis, G.; Ordieres-Meré, J. Comparison of Deep Transfer Learning Against Contrastive Learning in Industrial Quality Applications for Heavily Unbalanced Data Scenarios When Data Augmentation Is Limited. Sensors 2025, 25, 3048. [Google Scholar] [CrossRef]
Hridoy, M.W.; Rahman, M.M.; Sakib, S. A Framework for Industrial Inspection System Using Deep Learning. Ann. Data Sci. 2024, 11, 445–478. [Google Scholar] [CrossRef]
Abedin, T.; Xu, H.; Uddin, S. The Impact of K Selection in K-Fold Cross-Validation on Bias and Variance in Supervised Learning Models. Sci. Rep. 2026, 16, 6084. [Google Scholar] [CrossRef]
Saberironaghi, A.; Ren, J.; El-Gindy, M. Defect detection methods for industrial products using deep learning techniques: A review. Algorithms 2023, 16, 95. [Google Scholar] [CrossRef]
Lv, X.; Duan, F.; Jiang, J.-J.; Fu, X.; Gan, L. Deep metallic surface defect detection: The new benchmark and detection network. Sensors 2020, 20, 1562. [Google Scholar] [CrossRef]

Figure 1. Overview of the HDD structure and actuator mechanism. Annotated schematic highlighting key HDD components, with emphasis on the actuator system and the VCMA.

Figure 2. VCMA components and designated inspection regions. (a) CAD representations highlighting the overall geometry and selected inspection areas, and (b,c) representative microscope images showing typical surface texture, reflectivity, and microscopic dust particles under the experimental imaging setup.

Figure 3. Experimental image acquisition setup for VCMA inspection using a digital microscope and ring-shaped LED illumination.

Figure 4. Effect of the customized fixture on VCMA positioning during image acquisition: (a) top view in the correct orientation, showing proper alignment and illumination; (b) top view in the inverted orientation, showing misalignment and unsuitable illumination; (c) right-side view of the correctly oriented component, showing full visibility of the inspection region; and (d) right-side view in the inverted orientation, showing incomplete visibility and unstable inspection conditions.

Figure 5. Examples of challenges in microscopic dust inspection on VCMA components. (a) surface reflections, (b) image blur, (c) overexposure due to excessive illumination, and (d) distant dust particles.

Figure 6. Representative examples of preprocessing applied to VCMA images prior to model training. (a) ROI cropping. (b) Random rotation (±15°) used for data augmentation.

Figure 7. Conceptual YOLO Diagram.

Figure 8. Workflow of YOLO-based VCMA defect detection system, including image acquisition, preprocessing, annotation, feature extraction, and defect localization.

Figure 9. Conceptual CNN diagram illustrating image-level classification of VCMA samples into Good and NG categories.

Figure 10. Workflow of the CNN-based VCMA defect classification system, including image acquisition, preprocessing, feature extraction, and image-level classification.

Figure 11. Training and testing pipeline of the CNN-based VCMA defect classification system, illustrating the feature extraction process during training and the image-level defect prediction during inference.

Figure 12. Hybrid inspection workflow combining CNN-based classification with YOLO-based localization for VCMA defect detection.

Figure 13. Representative dust localization results of (a) YOLOv5, (b) YOLOv8, and (c) YOLOv11 on VCMA samples, illustrating differences in bounding-box placement and background sensitivity.

Figure 14. Training and validation loss and accuracy curves of the evaluated CNN architectures, including (a,b) ResNet50, (c,d) EfficientNetB0, and (e,f) MobileNetV2, showing their convergence behavior during Good/NG classification training on VCMA images.

Figure 15. Confusion matrices of the evaluated CNN models: (a) ResNet50, (b) EfficientNetB0, and (c) MobileNetV2, showing the distribution of correct and incorrect predictions for Good and Not Good (NG) VCMA samples.

Figure 16. Receiver operating characteristic (ROC) curves of ResNet50, EfficientNetB0, and MobileNetV2 for Good/NG classification of VCMA images.

Figure 17. Representative qualitative classification results generated by (a) ResNet50, (b) EfficientNetB0, and (c) MobileNetV2 for Good/NG prediction on VCMA test-images under varying illumination and reflective surface conditions.

Figure 18. Representative microscopic images from (a) the original dataset collected at a fixed position and (b) the extended dataset collected from multiple positions across the VCMA surface.

Figure 19. Grad-CAM visualizations of (a) Original Image, (b) ResNet50, (c) EfficientNetB0, and (d) MobileNetV2, illustrating model-specific attention patterns during Good/NG classification. Grad-CAM highlights class-discriminative regions via gradient-weighted feature maps from the last convolutional layer. Red regions indicate the model’s primary decision basis.

Figure 20. Reduction in detector workload achieved by the proposed hybrid framework, measured by the proportion of images forwarded to the YOLOv5 localization stage.

Figure 21. Representative outputs of the sequential hybrid inspection framework. Panels (a,c) show CNN-based screening results identifying contaminated VCMA samples as NG, while panels (b,d) show YOLOv5-based localization of microscopic dust regions using bounding box predictions.

Figure 22. Spatial distribution analysis of annotated dust particles across the VCMA surface derived from the hybrid inspection framework.

Table 1. Experimental imaging environment configuration.

Parameter	Configuration
Camera type	5 MP digital microscope
Optical magnification	0.8×–10×
Working distance	100 mm
Image resolution	640 × 480 pixels
Illumination	Ring-shaped LED illuminator
Ambient illumination	776–887 lux (measured range)
Fixture	Custom 3D-printed fixture
Capture environment	HDD manufacturing environment
Cleanroom classification	Class 100 (ISO Class 5)
Production line type	Manual inspection line
Surface type	Reflective metallic VCMA

Table 2. Main training parameters used for YOLO-based detection and CNN-based classification models.

Parameter	Yolo	CNN
Input size	640 × 640 pixel	224 × 224 pixel
Epochs	150	20
Batch size	16	32
Optimizer	SGD	Adam
Learning rate	0.001	0.005
Model variants	YOLOv5s/YOLOv8s/YOLOv11s	ResNet50/EfficientNetB0/MobileNetV2
Pretrained weights	COCO (Ultralytics 8.4.37)	ImageNet
Loss function	Objectness + Classification + Box regression	Categorical cross-entropy

Table 3. Quantitative comparison of YOLO-based detection models for microscopic dust localization on VCMA components.

YOLO Model	Precision	Recall	F1-Score	mAP@0.5	mAP@0.5:0.95	False Positive (n)
YOLOv5	0.75	0.69	0.72	0.62	0.26	2
YOLOv8	0.81	0.67	0.73	0.66	0.26	7
YOLOv11	0.68	0.38	0.49	0.34	0.19	3

Table 4. Quantitative comparison of CNN-based classification models for Good/NG classification of VCMA images.

CNN Architecture	Precision	Accuracy (%)	Recall	F1-Score	AUC
ResNet50	0.91	82.76	0.90	0.89	0.957
EfficientNetB0	0.91	93.10	0.89	0.90	0.986
MobileNetV2	0.87	93.10	0.90	0.88	1.000

Table 5. Best valid configuration of each CNN model.

Model	Test Accuracy (%)	AUC	Test Loss	F1-Score	5-Fold CV Accuracy (%)	CV Std.	Inference Time (ms)
ResNet50	82.76	0.957	0.371	0.827	79.17	5.14	95.2
EfficientNetB0	93.10	0.986	0.226	0.930	93.24	3.57	88.1
MobileNetV2	93.10	1.000	0.195	0.930	81.01	15.12	91.0

Table 6. Comparison of EfficientNetB0 classification performance between the original dataset (47 images) and the extended dataset (123 images).

Metric	Original Dataset (47 Images)	Extended Dataset (123 Images)
Test Accuracy (%)	93.10	76.00
Precision	0.91	0.38
Recall	0.89	0.50
F1-score	0.93	0.43
AUC	0.98	0.60
5-Fold CV Accuracy (%)	93.24 ± 3.57	87.11 ± 3.57
Inference Time (ms)	88.10	89.58

Table 7. Semi-quantitative assessment of Grad-CAM attention patterns for the evaluated CNN models.

Model	Correctly Localized Attention (n)	Misaligned Attention (n)	Diffuse Attention (n)
ResNet50	14	7	6
EfficientNetB0	22	5	2
MobileNetV2	17	9	3

Table 8. System-level performance comparison of standalone and hybrid inspection frameworks.

Framework	EfficientNetB0 Only	YOLOv5 Only	Hybrid (EfficientNetB0 → YOLOv5)
Accuracy (%)	93.10	-	93.62
Precision	0.91	0.75	1.00
Recall	0.89	0.69	0.90
F1-score	0.93	0.72	0.97
Classification:
False Positive Rate (%)	0	-	0
False Negative Rate (%)	14.30	-	10
Localization:
Images Sent to YOLO (%)	-	100.00	57.45
Speed:
Avg. Inference Time (ms/image)	88.10	12.90	11.18

Table 9. Failure mode analysis of the hybrid inspection framework.

Failure Mode	Count
NG rejected by CNN screening	3
NG forwarded to YOLO but missed by detector	1
Good incorrectly forwarded to YOLO	0
YOLO false positive after CNN screening	0

Table 10. Deployment-oriented comparison of the evaluated inspection approaches.

Criterion	YOLOv5	YOLOv8	YOLOv11	ResNet50	Efficient NetB0	Mobile NetV2	Hybrid
Localization capability	Yes	Yes	Yes	No	No	No	Yes
False-positive control	High	Low	Medium	–	–	–	High
Sensitivity to subtle dust	Medium	High	Medium	–	–	–	High
Image-level screening suitability	–	–	–	Medium	High	Medium	High
Interpretability support	Medium	Medium	Medium	Medium	High	Medium	High
Deployment suitability	High	Medium	Medium	Medium	High	Medium	High

Table 11. Conceptual comparison between inspection strategies for microscopic VCMA dust inspection.

Method	Localization	Image-Level Decision	Explainability	Reflection Robustness
Conventional [4,52,53]	Limited	Moderate	Low	Low
CNN only	No	High	Moderate	Moderate
YOLO only	Yes	Moderate	Moderate	Sensitive
Proposed Hybrid	Yes	High	High	Improved

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the International Institute of Knowledge Innovation and Invention. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Phunpeng, V.; Chaiyasin, K.; Khodcharad, K.; Boransan, W.; Patangtalo, W.; Chaimanatsakun, A. Explainable Hybrid Deep Learning for Microscopic Dust Defect Inspection on Voice Coil Motor Assembly Components. Appl. Syst. Innov. 2026, 9, 120. https://doi.org/10.3390/asi9060120

AMA Style

Phunpeng V, Chaiyasin K, Khodcharad K, Boransan W, Patangtalo W, Chaimanatsakun A. Explainable Hybrid Deep Learning for Microscopic Dust Defect Inspection on Voice Coil Motor Assembly Components. Applied System Innovation. 2026; 9(6):120. https://doi.org/10.3390/asi9060120

Chicago/Turabian Style

Phunpeng, Veena, Kreetiwat Chaiyasin, Kitsana Khodcharad, Wipada Boransan, Watcharapong Patangtalo, and Attaphon Chaimanatsakun. 2026. "Explainable Hybrid Deep Learning for Microscopic Dust Defect Inspection on Voice Coil Motor Assembly Components" Applied System Innovation 9, no. 6: 120. https://doi.org/10.3390/asi9060120

APA Style

Phunpeng, V., Chaiyasin, K., Khodcharad, K., Boransan, W., Patangtalo, W., & Chaimanatsakun, A. (2026). Explainable Hybrid Deep Learning for Microscopic Dust Defect Inspection on Voice Coil Motor Assembly Components. Applied System Innovation, 9(6), 120. https://doi.org/10.3390/asi9060120

Article Menu

Explainable Hybrid Deep Learning for Microscopic Dust Defect Inspection on Voice Coil Motor Assembly Components

Abstract

1. Introduction

2. Experimental Design and Methodology

2.1. Image Capture

2.2. Challenges in Microscopic Dust Inspection

2.3. Data Acquisition and Dataset Preparation

2.4. Workflow of YOLO-Based VCMA Defect Detection System

2.5. Workflow of CNN-Based VCMA Defect Classification System

2.6. Gradient-Weighted Class Activation Mapping (Grad-CAM)

2.7. Hybrid Inspection Framework for VCMA Defect Detection

2.8. Rationale for Model Selection

2.9. Model Training and Implementation

2.10. Evaluation Metrics and Visualization

3. Results

3.1. Performance of YOLO-Based Detection Models

3.2. Performance of CNN-Based Classification Models

3.3. Visualization Analysis Using Grad-CAM

3.4. Performance of the Sequential Hybrid Inspection Framework

3.5. Comparative Discussion and Industrial Implications

4. Discussion

4.1. Summary of Key Findings

4.2. Industrial and Scientific Implications

4.3. Industrial Deployment Considerations and Comparison with Alternative Inspection Approaches

4.4. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI