1. Introduction
Hard disk drives (HDDs) remain essential for large-scale data storage due to their cost-effective, high-capacity storage and continued relevance in data center and cloud infrastructure applications [
1]. As HDD components are manufactured with increasingly tight mechanical tolerances, particulate cleanliness becomes a critical determinant of assembly yield and long-term reliability. In particular, microscopic dust contamination on Voice Coil Motor Assembly (VCMA) components can induce positioning errors, unstable actuator motion, and subsequent performance degradation. Reliable inspection of such contamination is therefore a key requirement in HDD manufacturing.
Although VCMA assembly is conducted under controlled cleanroom conditions, microscopic particulate contamination can still arise during component handling, transfer between workstations, and fixture contact—stages where full hermetic isolation is impractical [
2]. Existing quality control practices in HDD manufacturing rely primarily on manual optical inspection or conventional automated optical inspection systems, which are generally calibrated for macro-scale defects and lack sensitivity to sub-100 μm dust particles characteristic of VCMA contamination [
3,
4]. These limitations motivate the development of automated deep learning-based inspection capable of detecting microscopic dust under realistic industrial imaging conditions.
Automated inspection of microscopic dust defects is challenging because target particles are extremely small, morphologically irregular, and often embedded in reflective or textured microscopic backgrounds. Contemporary surveys on deep learning for automated visual inspection report that learning-based approaches offer improved robustness and adaptability, yet they also highlight persistent deployment barriers such as limited defect samples, dataset bias, and distribution shift [
5,
6,
7]. Broader surveys on deep-learning-based surface defect detection likewise stress that data quality, scale variation, and background complexity remain central obstacles for robust industrial deployment [
8].
Deep learning has become the dominant paradigm for representation learning in computer vision, enabling models to learn hierarchical features directly from data [
9,
10]. In industrial inspection, convolutional neural networks (CNNs) are frequently used for image-level classification tasks because they can capture subtle texture cues and discriminative patterns. Residual networks have become standard backbones in many recognition pipelines [
11], while EfficientNet provides systematic scaling rules and strong accuracy-efficiency trade-offs suitable for industrial applications constrained by inference time [
12]. MobileNets are widely used when computational budgets are tight, enabling deployment in resource-limited environments [
13]. More recently, vision transformers have expanded the design space for visual recognition, although CNNs remain common in industrial inspection due to their maturity and practical efficiency [
14].
In industrial inspection, two problem formulations are central to practical deployment: image-level classification for rapid screening and object detection for defect localization. Among real-time detection frameworks, the YOLO family of one-stage detectors has been widely adopted due to its end-to-end design and favorable speed-accuracy trade-off [
15,
16,
17]. The YOLO lineage has evolved through improvements in feature extraction, multiscale feature fusion, and training strategies. YOLOv4 introduced practical architectural and training refinements to improve real-time detection [
18,
19], while YOLOv7 further advanced performance through trainable bag-of-freebies strategies [
20]. The Ultralytics ecosystem provides widely used implementations and documentation for YOLOv5, YOLOv8, and YOLOv11, enabling consistent and reproducible workflows for training, validation, and deployment of YOLO-based detectors [
21,
22,
23].
Microscopic dust detection introduces specific difficulties that challenge general-purpose detectors. Dust particles are often a small fraction of the image area, reflective metallic surfaces can generate false positives, and scale variation requires fine-grained multiscale feature maps. Feature Pyramid Networks (FPN) are a foundational approach for multiscale feature representation and have influenced many modern detectors [
24]. Beyond YOLO, alternative detection families including SSD [
25], Faster R-CNN [
26], RetinaNet [
27], and Mask R-CNN [
28] provide useful reference points for understanding trade-offs between accuracy, speed, and localization granularity.
Classification models raise interpretability concerns because a correct label does not guarantee the model attended to the true defect region. Attention mechanisms such as Squeeze-and-Excitation (SE) blocks [
29] and Convolutional Block Attention Module (CBAM) [
30] can improve representational sensitivity. Class Activation Mapping (CAM) introduced a method to visualize discriminative regions using global average pooling [
31], and Grad-CAM generalized this to broader architectures using gradient-based heatmaps [
32]. Broader XAI literature emphasizes that interpretability supports accountability, debugging, and responsible deployment in high-stakes decision contexts [
33,
34]. In manufacturing, explainability helps verify that the model focuses on true dust regions rather than spurious reflections or texture artifacts.
Another practical dimension is data availability. In real factories, defect data can be scarce, and distributions can change as processes drift. Data augmentation is commonly used to improve generalization under limited data conditions [
35]. Adam is widely used for deep learning optimization due to adaptive learning rates [
36], while AdamW improves generalization by decoupling weight decay from the gradient update [
37]. Modern frameworks such as PyTorch support rapid experimentation and have become a standard foundation for implementing CNN and detection models [
38]. Detection models are commonly benchmarked on MS COCO, which defines standard evaluation protocols that help contextualize detector capability relative to industrial domains [
39].
When labeled defect data are extremely limited, anomaly detection methods offer an alternative formulation by learning normal appearance and flagging deviations. Representative approaches include PaDiM [
40], PatchCore [
41], STFPM [
42], Deep SVDD [
43], and generative methods such as AnoGAN and GANomaly [
44,
45], benchmarked on datasets such as MVTec AD [
46]. Although anomaly detection is outside the primary scope of this study, it motivates future directions for settings where defect labels are scarce.
Despite recent progress in deep learning for industrial inspection, microscopic dust detection on VCMA components remains insufficiently studied, particularly in settings requiring both reliable defect localization and image-level confirmation under limited-data conditions. To the best of the authors’ knowledge, no prior deep-learning study has specifically addressed microscopic dust inspection on HDD VCMA components. This gap is attributable to the rarity of naturally occurring, visually identifiable dust contamination in controlled HDD production, the cost of expert annotation, the highly reflective surface texture, and the limited availability of industrial samples. Existing approaches in related domains often focus on either classification or detection alone [
47], while the interpretability of model decisions is rarely examined in the context of precision manufacturing contamination inspection.
Microscopic dust inspection on VCMA components is also closely related to the problem of tiny object detection (TOD), in which target objects occupy only a small fraction of the image area. Specialized strategies such as slicing-based inference (e.g., SAHI—Slicing Aided Hyper Inference [
48]) and scale-enhanced feature extraction have been proposed to improve detection sensitivity for micro-scale features. However, such methods were outside the scope of the present study, which focused on comparing representative YOLO generations and CNN classifiers under the same constrained dataset and imaging conditions; these strategies are identified as a priority for future work.
To address these gaps, this study proposes an explainable hybrid inspection framework integrating YOLO-based localization, CNN-based Good/NG confirmation, and Grad-CAM-based visual interpretation for practical VCMA quality inspection. The principal scientific contributions are as follows:
- (1)
A systematic comparative evaluation of representative YOLO-based detectors (YOLOv5, YOLOv8, and YOLOv11) and CNN-based classifiers (ResNet50, EfficientNetB0, and MobileNetV2) under identical experimental conditions on microscopic VCMA images, providing a rigorous basis for deployment-oriented model selection.
- (2)
A sequential hybrid inspection framework that integrates CNN-based image-level screening with YOLO-based defect localization, demonstrating improved false-positive suppression and selective localization for practical VCMA quality inspection.
- (3)
An explainability analysis using Grad-CAM to verify that selected models attend to physically meaningful dust-contaminated regions, supporting interpretability and engineering trust in automated inspection decisions.
2. Experimental Design and Methodology
This section describes the experimental design used to develop and evaluate deep learning-based inspection methods for microscopic dust detection on VCMA components. Two complementary inspection paradigms were investigated: YOLO-based object detection for dust localization and CNN-based image classification for Good/NG decision making. To ensure fair comparison, all models were evaluated under a unified imaging setup, dataset partition, preprocessing pipeline, and evaluation protocol. The workflow comprised image acquisition, dataset preparation, model training, explainability analysis, and performance evaluation under conditions representative of industrial inspection practice.
A HDD is a magnetic data storage device in which digital information is stored on rotating platters and accessed by a read/write head operating with extremely high positional accuracy. The motion of the read/write head is controlled by an actuator system that must function reliably at the micrometer scale to ensure stable data reading and writing. Due to the extremely small clearance between the read/write head and the disk surface, HDD components are highly sensitive to particulate contamination. Even microscopic dust particles can disturb head positioning, induce mechanical vibration, or lead to long-term degradation of operational reliability. Consequently, strict cleanliness control is required throughout HDD manufacturing, particularly during actuator-related assembly processes.
In the actuator system, the VCMA plays a critical role in controlling the precise motion of the read/write head. VCMA operates based on electromagnetic principles, converting electrical current into controlled mechanical displacement to enable rapid and accurate head positioning. Because of its direct influence on actuator precision and system stability, the VCMA is considered one of the most critical components affecting HDD performance. Foreign particles present on VCMA surfaces can significantly affect positioning accuracy and may cause unstable read/write operations or reliability failures over time. For this reason, VCMA components were selected as the primary inspection target in this study, shown in
Figure 1.
The contamination inspection in this study involves some regions of VCMA which have been chosen as part of the development of a complete imaging method for the full-scale VCMA inspection work in the long run. These regions are not easily observed visually and were selected based on practical manufacturing constraints such as surface geometry and the presence of reflective metallic surfaces that can obscure small dust particles. These properties of the VCMA inspection indicate that the inspection in this study is representative of the real-world challenges encountered in industrial HDD production lines.
2.1. Image Capture
This case study focuses on microscopic dust inspection on VCMA components collected from an actual hard disk drive manufacturing environment. Image acquisition was conducted using a high-resolution digital microscope camera mounted above the inspection platform. VCMA samples were placed on a customized fixture to maintain consistent positioning and imaging angles throughout the capture process.
High-resolution images of the VCMA component were taken from a real industrial production environment with a 5-megapixel digital microscope camera equipped with a variable-zoom optical lens. The optical magnification (0.8× to 10×) was used based on the area to be inspected and the size of the particle. With a ring-shaped LED lighting system for adequate illumination and to minimize glare from the reflective metallic surface. All images were taken with very controlled lighting to ensure a comparable brightness and contrast throughout the dataset shown in
Figure 2. The key imaging parameters used throughout data collection are summarized in
Table 1.
Several imaging parameters were standardized throughout data collection. The camera-to-subject distance was fixed at 100 mm, with an optical magnification of 10× and a native image resolution of 640 × 480 pixels. Illumination intensity was adjusted empirically to maximize contrast between dust particles and the reflective metallic surface.
To reduce background complexity and improve defect visibility, a white planar background was employed beneath the components. A ring-shaped LED illumination system was integrated around the camera lens to provide uniform lighting coverage across the metallic surface of the VCMA. The illumination intensity was carefully adjusted to mitigate surface reflections while enhancing the contrast of dust particles against the reflective substrate.
Due to the complex geometry of VCMA components, surface protrusions caused unstable placement and severe glare during initial image acquisition. To address this challenge, a custom fixture was designed and fabricated using 3D printing technology, as illustrated in
Figure 3. The fixture enabled the VCMA samples to remain level and mechanically stable during imaging, significantly reducing reflection artifacts and motion-induced blur.
Furthermore, considering different potential defect locations during the assembly process, image capture was performed from both the front and rear surfaces of the VCMA components.
Figure 4 demonstrates the comparison of imaging conditions with and without the customized fixture, highlighting the improved surface alignment and consistent focus achieved through fixture-assisted positioning shown in
Figure 4.
2.2. Challenges in Microscopic Dust Inspection
The evaluation of microscopic dust among VCMA components presents several technical challenges for reliable defect detection. One main issue is the reflection of metallic VCMA components on the surface, where dust particles may be blocked or create highly brilliant artifacts which distort their visual interpretation
Figure 5a. Another problem is that of vision blurriness, especially with dust particles outside the perfect focal plane or when extensive magnification is needed. Blurred regions decrease the sharpness and contrast of edge sections such that small dust particles become indistinguishable from the background
Figure 5b. Additionally, too much illumination from the light source can also create overexposed areas such as shown in
Figure 5c, which will make intricate surface detail hard to see, resulting in effects like contamination. Additionally, dust particles located out of focus from the original examination area present additional obstacles. As shown in
Figure 5d, remote dust particles appear pale or unclear due to issues related to resolution and depth of field, leading to increases in missed detections. These shortcomings restrict the effectiveness of classical rule-based or manual inspection methods and require deep-learning approaches trained to filter through complex visual patterns in various imaging environments.
2.3. Data Acquisition and Dataset Preparation
Two related but distinct datasets were prepared for this study. For image-level classification, the original dataset consisted of 47 microscope images, including 30 Good samples and 17 Not Good (NG) samples. Data augmentation was applied only to the training subset, resulting in 192 images for model development while preserving untouched validation and test images for unbiased evaluation. For object detection, the same 47 original images were manually annotated with dust bounding boxes, yielding 109 labeled defect instances. To avoid information leakage, all augmented samples derived from a given original image were kept exclusively within the training set, and no augmented variants of validation or test images were used during model selection or final evaluation. This strategy ensured that the reported performance reflected generalization to previously unseen images rather than memorization of transformed samples.
After dataset preparation, the data were divided into training, validation, and testing subsets using a 70:15:15 ratio with balanced class representation where applicable. For the classification task, the split was performed before augmentation so that only the training images were augmented. For the detection task, the annotated images were partitioned according to the same protocol to support consistent comparison across YOLO-based models.
All images were then cropped to contain only the region of interest (ROI), defined as the relevant VCMA surface area where dust contamination may appear. ROI extraction was performed to remove non-informative background regions and to focus model learning on the most critical inspection area. Representative examples of ROI selection and preprocessing are shown in
Figure 6.
Following ROI extraction, images were resized to match the input dimensions required by each model architecture and normalized by scaling pixel intensity values to the range [0, 1]. To improve model robustness and reduce overfitting under limited data conditions, data augmentation techniques were applied during training only, including random rotation within ±15°, random horizontal flipping, random zoom within ±10%, and brightness adjustment within ±20%. These preprocessing and augmentation steps were applied consistently to both YOLO-based detection models and CNN-based classification models, while validation and test datasets were kept unchanged. This strategy was adopted to enhance generalization across variations in lighting conditions and surface textures while ensuring fair and reproducible performance evaluation. All images in the dataset were acquired under these standardized conditions. No deliberate variation in brightness, contrast, blur, or lens state was introduced during data collection; consequently, the dataset does not capture the full range of imaging disturbances that may occur in long-term production use.
The limited dataset size reflects practical constraints of acquiring annotated microscopic contamination samples in controlled HDD manufacturing, where dust events are infrequent and expert annotation is costly. To support reliable model development under these conditions, four safeguards were adopted: (1) CNN models were initialized with ImageNet-pretrained weights and trained with frozen feature extractors, a transfer learning strategy shown to enable effective adaptation under limited labeled data [
49,
50]; (2) augmentation was restricted strictly to the training subset to prevent information leakage; (3) 5-fold cross-validation was applied to reduce sensitivity to any single train/test partition [
51]; and (4) dataset expansion through the acquisition of additional real samples is recognized as a practical strategy for improving model generalizability, as increasing the number of labeled training images reduces the risk of overfitting and allows models to learn more diverse defect morphologies and background variations, which has been consistently demonstrated to improve model stability and transferability in industrial inspection under limited-data conditions [
35]. The study is accordingly framed as a proof-of-feasibility investigation, and dataset expansion is identified as a priority for future work in
Section 4.3.
2.4. Workflow of YOLO-Based VCMA Defect Detection System
In this study, a YOLO-based object detection framework was employed to detect and localize microscopic dust contamination on VCMA surfaces under realistic industrial imaging conditions. The YOLO family was selected because its single-stage detection paradigm enables object localization and classification within a unified network, making it suitable for time-sensitive inspection tasks involving small and visually subtle defects. A conceptual overview of the YOLO-based detection principle adopted in this work is presented in
Figure 7, while the complete experimental workflow is illustrated in
Figure 8.
The detection workflow began with image acquisition and preprocessing, following the procedures described in
Section 2.1 and
Section 2.3. The annotated detection dataset consisted of 47 original VCMA images, in which dust-contaminated regions were manually labeled using bounding boxes to provide ground-truth supervision for model training. Each bounding box corresponded to a visually identifiable dust-contaminated region on the VCMA surface. Regions were annotated as dust only when the particles were visually distinguishable from the surrounding VCMA surface and exhibited morphological characteristics consistent with contamination under microscopic observation. These annotations served as the basis for supervised learning and enabled the models to learn both the spatial location and visual characteristics of microscopic dust particles.
Three YOLO variants, namely YOLOv5, YOLOv8, and YOLOv11, were comparatively evaluated to examine differences in detection behavior across model generations under identical experimental conditions. All models were trained using the same annotated dataset, preprocessing pipeline, and evaluation protocol to ensure fair comparison. During training, the YOLO framework jointly learned feature extraction, object classification, and bounding-box regression in an end-to-end manner, thereby enabling simultaneous dust recognition and localization on VCMA images.
All YOLO models were implemented in Python 3.12.13 using the PyTorch 2.10.0-based Ultralytics 8.4.37 framework and trained in a Google Colab environment equipped with an NVIDIA Tesla T4 GPU (16 GB VRAM), an x86_64 CPU, and 12.7 GB of system memory under Linux 6.6. To ensure consistency across experiments, the same training hyperparameters were used for all three YOLO models, including an input image size of 640 × 640 pixels, a batch size of 16, a learning rate of 0.001, and 150 training epochs. These unified settings were adopted to support reproducibility and to isolate architectural differences as the primary factor influencing comparative performance.
The output of the YOLO-based detection system consisted of predicted bounding boxes indicating the location of dust particles on the VCMA surface together with their associated confidence scores. These predictions were subsequently evaluated using standard object detection metrics, including Precision, Recall, F1-score, mAP@0.5, and mAP@0.5:0.95, as described in
Section 2.10. Representative examples of the YOLO-based detection workflow and output visualization are shown in
Figure 8.
2.5. Workflow of CNN-Based VCMA Defect Classification System
In addition to object detection, a CNN-based classification framework was investigated to perform image-level inspection of VCMA components. In this approach, each VCMA image was classified into one of two categories: Good, representing dust-free components, and Not Good (NG), representing components with visible dust contamination. The Good class included images without visible dust contamination, whereas the NG class comprised images containing visually identifiable dust or foreign particles on the VCMA surface. This formulation reflects practical industrial inspection scenarios in which rapid pass/fail screening is required at the image level without explicit defect localization.
A conceptual overview of the CNN-based classification principle adopted in this study is shown in
Figure 9, while the detailed experimental workflow is illustrated in
Figure 10. The classification workflow began with image acquisition, labeling, and preprocessing, following the procedures described in
Section 2.1 and
Section 2.3. The classification dataset was derived from the original set of microscope images and organized into Good and NG classes for supervised learning.
Three CNN architectures, namely ResNet50, EfficientNetB0, and MobileNetV2, were evaluated to assess their suitability for VCMA dust classification under identical experimental conditions. These models were selected to represent different design trade-offs in terms of feature extraction capability, model complexity, and computational efficiency. Transfer learning was applied by initializing each network with ImageNet-pretrained weights and retaining the pretrained backbone as a fixed feature extractor, while only the classification head was trained for the binary defect classification task.
To adapt the selected CNN architectures to this application, the original classification layers were replaced with a task-specific classifier head consisting of global average pooling, fully connected layers, dropout regularization with a rate of 0.1, and a SoftMax output layer for binary prediction. Input images were resized to 224 × 224 pixels to match the input requirements of the CNN models. All preprocessing steps and training conditions were kept consistent across the three CNN architectures to ensure fair comparison and reproducibility.
The CNN-based classification system was trained to learn discriminative image-level representations that distinguish defect-free VCMA surfaces from dust-contaminated samples. During inference, the output of the classifier was a predicted class label together with its associated confidence score for each input image. These predictions were subsequently evaluated using standard classification metrics, including accuracy, precision, recall, F1-score, and confusion-matrix analysis, as described in
Section 2.10. Representative examples of the CNN-based classification workflow are presented in
Figure 10 and
Figure 11. To further examine whether the CNN models relied on physically meaningful dust regions during classification, Grad-CAM visualization was subsequently applied, as described in the section below.
2.6. Gradient-Weighted Class Activation Mapping (Grad-CAM)
To enhance the interpretability of the CNN-based classification models, Grad-CAM was applied to visualize the discriminative image regions that contributed most strongly to each model’s prediction. Grad-CAM computes the gradient of the target class score with respect to the feature maps of the final convolutional layer and uses these gradients as weights to generate a class-discriminative heatmap. This technique enables spatial localization of the regions that the model attends to when making a classification decision, without requiring architectural modifications or additional supervision.
In this study, Grad-CAM was applied independently to ResNet50, EfficientNetB0, and MobileNetV2 on VCMA test images to qualitatively assess whether each model’s attention was directed toward physically meaningful dust regions rather than irrelevant background textures or surface reflections.
Grad-CAM was applied to the final convolutional layer of each model, specifically conv5_block3_out for ResNet50, block7a_project_conv for EfficientNetB0, and Conv_1 for MobileNetV2, which are the deepest convolutional layers retaining spatial feature maps prior to global average pooling.
The activation maps were upsampled to the original input image resolution using bilinear interpolation and subsequently normalized to the range [0, 1] via min-max scaling to ensure consistent visual representation across images. The normalized heatmaps were then overlaid on the corresponding input images at an opacity of 0.5 using a jet colormap, with values ranging from blue (low activation) to red (high activation), enabling direct visual comparison of model attention across architectures.
This visualization serves as a complementary evaluation tool alongside the quantitative metrics reported in
Section 3, providing deeper insight into model behavior and supporting confidence in the reliability of the proposed inspection system.
2.7. Hybrid Inspection Framework for VCMA Defect Detection
Microscopic dust contamination on VCMA components presents significant challenges for automated inspection due to its small size, irregular appearance, and surface reflections. CNN-based classification models are effective in identifying contaminated samples at the image level; however, they do not provide information regarding defect locations. To enable precise spatial inspection and improve practical applicability, a YOLO-based object detection module was integrated into a hybrid inspection framework.
In the proposed pipeline, each VCMA image is first analyzed using the CNN classifier to determine the presence of dust defects. Defect-free images are directly accepted, while contaminated samples are forwarded to the YOLO detector for localized inspection. This sequential strategy limits detection to relevant samples, improving computational efficiency and reducing unnecessary false detections.
The YOLO detector localizes microscopic dust regions by generating bounding boxes corresponding to defect areas on the VCMA surface. This localization supports visualization and further defect analysis while complementing the global discrimination capability of CNN classification. By integrating classification and detection in a coarse-to-fine manner, the hybrid framework enhances inspection robustness for industrial VCMA quality control, as illustrated in
Figure 12.
2.8. Rationale for Model Selection
YOLO-based object detection and CNN-based image classification architectures were selected in this study due to their complementary roles in industrial visual inspection tasks. Specifically, three YOLO variants YOLOv5, YOLOv8 [
19], and YOLOv11 [
20] were evaluated for real-time dust localization, while ResNet50 [
9], EfficientNetB0 [
10], and MobileNetV2 [
11] were employed for image-level defect classification. YOLO-based detection enables simultaneous localization and classification of dust particles on VCMA surfaces, making it suitable for inspection scenarios in which spatial information about defect locations is required. However, detection performance may be influenced by surface reflections, illumination variations, and background noise commonly encountered in industrial imaging environments.
In contrast, CNN-based image classification focuses on learning global feature representations from visual data to provide stable image-level decision making. This approach is well suited for binary quality assessment tasks, where a pass/fail decision is sufficient, and can serve as an alternative or secondary verification strategy to reduce misclassification arising from localization-based methods.
By evaluating YOLO-based detection and CNN-based classification independently under identical experimental conditions, this study aims to systematically examine the strengths and limitations of each approach for VCMA dust inspection. This comparative design establishes a methodological basis for analyzing inspection performance and provides insights that may support the future development of hybrid inspection frameworks combining localization capability with robust image-level classification.
2.9. Model Training and Implementation
Both YOLO-based detection models and CNN-based classification models were implemented in Python 3.10 under consistent computational environments to ensure fair and reproducible performance comparison. All experiments were conducted using the same software framework and hardware configuration whenever applicable. The key training parameters for both model categories are summarized in
Table 2.
The small-scale (s) variants were used for all three YOLO architectures (YOLOv5s, YOLOv8s, YOLOv11s), selected to balance detection capability with computational efficiency under the limited-data conditions of this study. All models were initialized with COCO-pretrained weights provided through the Ultralytics framework (version 8.4.37) and trained using SGD with a fixed learning rate of 0.001. The composite YOLO training loss comprised objectness loss, classification loss, and bounding-box regression loss, optimized jointly in an end-to-end manner.
For the YOLO-based detection models, ground-truth annotations were provided in YOLO-format text files specifying bounding-box coordinates and corresponding class labels. Training aimed to minimize a composite detection loss integrating object, classification, and localization terms, thereby enabling simultaneous learning of dust presence and spatial localization. To improve convergence stability and reduce overfitting, early stopping and learning-rate scheduling were applied during training.
For the CNN-based classification models, each input image was assigned a binary categorical label (Good or NG), and training aimed to minimize categorical cross-entropy loss. The CNN models were trained using batch size 32, learning rate 0.005, and dropout regularization of 0.1 under the selected final configuration. To support stable convergence and fair comparison across architectures, early stopping (patience = 5) and ReduceLROnPlateau scheduling (factor = 0.2, patience = 3) were employed, with the best model weights restored after training.
Training configurations such as input image resolution, number of training epochs, batch size, optimizer, and learning rate were kept consistent within each model category to support fair comparison across YOLO variants and across CNN architectures. This unified training and implementation strategy enables objective evaluation of YOLO-based detection and CNN-based classification approaches for VCMA defect inspection.
2.10. Evaluation Metrics and Visualization
Model performance was quantitatively evaluated using task-appropriate metrics for object detection, image classification, and system-level hybrid inspection. To ensure consistency and comparability, all evaluations were conducted using the same dataset partitions and preprocessing procedures described in
Section 2.
For the YOLO-based detection models, detection performance was assessed using Precision, Recall, and F1-score, which are widely adopted metrics for evaluating object detection performance. Precision measures the proportion of correctly detected dust particles among all detected instances, while Recall represents the proportion of correctly detected dust particles relative to all ground-truth instances. The F1-score provides a balanced measure between Precision and Recall. These standard metrics are defined as follows:
Here, TP, FP, and FN denote true positives, false positives, and false negatives, respectively.
In addition to these measures, the mean Average Precision (mAP) at Intersection over Union (IoU) thresholds of 0.5 (mAP@0.5) and 0.5–0.95 (mAP@0.5:0.95) were calculated to provide a more comprehensive assessment of detection robustness and localization accuracy across the evaluated YOLO variants.
For the CNN-based classification models, performance was evaluated using image-level accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). Confusion matrices were additionally used to analyze the distribution of correct and incorrect predictions between the Good and NG classes. Because false-negative predictions are particularly critical in industrial quality inspection, recall was treated as an important metric alongside AUC, F1-score, and overall classification accuracy during model interpretation.
For the proposed sequential hybrid framework, additional system-level evaluation was conducted to assess the combined behavior of the screening and localization stages. The hybrid pipeline was compared with the standalone CNN and YOLO configurations in terms of image-level decision quality, false-positive and false-negative behavior, and the proportion of images forwarded to the YOLO-based localization stage. Failure modes of the sequential pipeline were also analyzed to identify the dominant sources of residual error.
To enhance interpretability, Grad-CAM was applied to the CNN-based classification models to generate class-specific heatmaps indicating the image regions most influential to each prediction. These visualizations were used to assess whether the models attended to physically meaningful dust regions rather than irrelevant background patterns or reflective artifacts. For the YOLO-based detection models, predicted bounding boxes were overlaid on VCMA images to qualitatively assess localization accuracy and detection consistency. Together, these visualization methods complemented the quantitative metrics by providing additional insight into model behavior and inspection reliability.
3. Results
This section presents the experimental results and comparative analysis of the YOLO-based detection and CNN-based classification models for microscopic dust inspection on VCMA components. All models were evaluated using the same dataset partitions and performance criteria described in
Section 2 to ensure fair and reproducible comparison. The results are presented in five parts: YOLO-based detection performance, CNN-based classification performance, Grad-CAM-based interpretability analysis, evaluation of the sequential hybrid inspection framework, and a comparative discussion of the findings in relation to industrial deployment.
3.1. Performance of YOLO-Based Detection Models
Three YOLO architectures, namely YOLOv5, YOLOv8, and YOLOv11, were evaluated for their ability to detect microscopic dust particles on VCMA components.
Table 3 summarizes the quantitative performance comparison among the three YOLO models in terms of Precision, Recall, F1-score, mAP@0.5, mAP@0.5:0.95 and number of false positives. Overall, the results indicate distinct detection behaviors across model variants, particularly with respect to the trade-off between detection performance and false-positive control.
Among the evaluated models, YOLOv8 achieved the highest Precision (0.81), F1-score (0.73), and mAP@0.5 (0.66), indicating the strongest overall detection performance in terms of these aggregate metrics. YOLOv5, however, achieved the highest Recall (0.69) and produced the lowest number of false positives (2), while maintaining the same mAP@0.5:0.95 value (0.26) as YOLOv8. In contrast, YOLOv11 showed substantially lower Recall (0.38), F1-score (0.49), and mAP values, indicating weaker localization performance under the present experimental setting. This performance gap is likely attributable to YOLOv11’s anchor-free detection architecture, which, while effective on large-scale datasets, may exhibit optimization instability under the limited-data conditions of the present study. With only 47 original training images, the model may have been insufficient to fully converge, resulting in inconsistent bounding-box predictions and reduced sensitivity to small dust targets. The higher false-positive rate observed for YOLOv8 is likely related to its more sensitive feature extraction design, which improves aggregate detection performance on diverse datasets but also increases susceptibility to spurious activations on highly reflective metallic surfaces characteristic of VCMA components. In contrast, YOLOv5’s anchor-based architecture tends to produce more conservative predictions, which reduces background sensitivity and is advantageous in this task where the operational cost of false alarms outweighs marginal gains in aggregate sensitivity. A formal ablation study isolating the contribution of architectural differences across YOLO generations remains a direction for future work.
These findings suggest that detector selection in the present VCMA inspection task should be guided not only by aggregate detection metrics, but also by the operational cost of false alarms. Although YOLOv8 yielded the strongest overall quantitative performance, it generated the highest number of false positives, particularly in visually complex or reflective regions. By comparison, YOLOv5 provided a more favorable balance between recall and false-positive suppression, making it more suitable for deployment-oriented inspection scenarios in which unnecessary alarms may increase manual verification effort and interrupt production flow. Therefore, YOLOv5 was selected as the preferred detection model for subsequent integration into the hybrid inspection framework.
Figure 13 presents representative qualitative detection outputs from YOLOv5, YOLOv8, and YOLOv11. The examples illustrate clear differences in bounding-box placement and background sensitivity across the three models. YOLOv5 generally produced more conservative predictions with less background interference, whereas YOLOv8 showed stronger sensitivity but also more spurious responses in visually complex regions. YOLOv11, in turn, showed weaker localization consistency under the present imaging conditions. These qualitative observations are consistent with the quantitative results reported in
Table 3 and provide further insight into the practical trade-offs among the evaluated YOLO variants.
3.2. Performance of CNN-Based Classification Models
For the image-level classification task, three pre-trained CNN architectures, namely ResNet50, EfficientNetB0, and MobileNetV2, were evaluated as fixed feature extractors for classifying VCMA images into Good and Not Good (NG) categories. Model performance was assessed using standard classification metrics, including Accuracy, Precision, Recall, and F1-score, and area under the receiver operating characteristic curve (AUC). To examine convergence behavior and training stability, the training and validation loss and accuracy curves are shown in
Figure 14, while the confusion matrices are presented in
Figure 15.
As illustrated in
Figure 14, all three CNN models exhibited stable learning behavior without severe divergence during training. Among them, EfficientNetB0 showed relatively smoother validation curves and faster convergence, suggesting more stable generalization under the present dataset conditions. ResNet50 also converged consistently, whereas MobileNetV2 showed slightly greater fluctuation in validation behavior, indicating potentially higher sensitivity to image variation and class ambiguity.
The confusion matrices in
Figure 15 further reveal differences in misclassification behavior across models. EfficientNetB0 produced the lowest number of misclassified samples overall, indicating a more balanced separation between Good and NG images. ResNet50 achieved comparable performance but showed slightly more classification errors, whereas MobileNetV2 demonstrated a greater tendency to incorrectly classify defect-free samples as defective, consistent with its lower precision.
To assess threshold-independent discriminative performance, receiver operating characteristic (ROC) curves were generated for the three CNN models, as shown in
Figure 16. The ROC analysis complements the confusion-matrix results by illustrating the trade-off between true positive rate and false positive rate across different classification thresholds. Although MobileNetV2 achieved the highest AUC, EfficientNetB0 showed the most favorable overall balance between discriminative performance and validation stability under the present experimental setting. The corresponding AUC values were 0.957 for ResNet50, 0.986 for EfficientNetB0, and 1.000 for MobileNetV2.
The ROC analysis provides a threshold-independent view of classification performance and complements the confusion-matrix analysis shown in
Figure 16.
Table 4 summarizes the corresponding classification metrics, including Accuracy, Precision, Recall, F1-score, and AUC, for the three evaluated CNN architectures.
Table 4 shows that EfficientNetB0 achieved the most favorable overall balance among the evaluated CNN architectures. It combined high Accuracy (93.10%), Precision (0.91), F1-score (0.90), and AUC (0.986), indicating strong discriminative capability for distinguishing dust-free and contaminated VCMA samples. ResNet50 yielded comparable Precision (0.91) and Recall (0.90), but with substantially lower Accuracy (82.76%) and slightly lower F1-score (0.89). In contrast, MobileNetV2 achieved the highest AUC (1.000), but its lower Precision (0.87) and F1-score (0.88) indicate a greater tendency to generate false alarms on good samples. These results suggest that EfficientNetB0 provided the best overall trade-off between classification performance and practical reliability at the image level.
To further examine model reliability beyond the single-run classification metrics reported in
Table 5, the best valid configuration of each CNN architecture was identified from the design-of-experiments analysis and is summarized in
Table 5. Notably, all three models were compared under the same optimized hyperparameter setting, namely batch size = 32, learning rate = 0.005, dropout = 0.1, and no data augmentation, thereby enabling a fair assessment of architecture-dependent performance.
Under these identical settings, EfficientNetB0 achieved the most reliable overall performance, with a test accuracy of 93.10%, test loss of 0.226, F1-score of 0.930, AUC of 0.986, and the highest 5-fold cross-validated accuracy of 93.24% with the lowest standard deviation (±3.57). Although MobileNetV2 achieved the same test accuracy (93.10%) and F1-score (0.930), and yielded a perfect AUC of 1.000, its substantially larger fold-to-fold variation (±15.12) and lower cross-validated accuracy (81.01%) suggest reduced robustness under the present limited-data condition. ResNet50 showed comparatively lower predictive performance overall, with a test accuracy of 82.76% and a 5-fold cross-validated accuracy of 79.17% ± 5.14. These findings further support the selection of EfficientNetB0 as the primary CNN model for subsequent integration into the hybrid inspection framework.
The 5-fold cross-validation results further confirm the selection of EfficientNetB0, which achieved the highest cross-validated accuracy of 93.24% with the lowest standard deviation (±3.57), indicating stable generalization across data partitions. In contrast, MobileNetV2 showed substantially higher fold-to-fold variation (±15.12) despite achieving the same single-run test accuracy, while ResNet50 yielded the lowest cross-validated accuracy (79.17% ± 5.14). These results demonstrate that 5-fold cross-validation provides a more reliable basis for model selection than single-run evaluation alone under small-sample conditions.
Representative qualitative examples of classification results are shown in
Figure 17. The visual comparison indicates that EfficientNetB0 produced more consistent predictions under varying illumination and reflective surface conditions, correctly identifying contaminated samples that were occasionally misclassified by the other models. These qualitative observations are consistent with the quantitative trends reported in
Table 4 and
Table 5.
Overall, the classification results indicate that EfficientNetB0 provides the most suitable CNN architecture for image-level VCMA dust classification in the present study. Its favorable balance between predictive performance, classification consistency, and validation stability makes it well suited for subsequent use in the proposed hybrid inspection framework. To further examine whether the CNN models relied on physically meaningful dust regions during prediction, Grad-CAM visualization was subsequently applied, as presented in the next subsection.
To further assess the generalizability of the proposed CNN-based classification framework, EfficientNetB0 was additionally evaluated on an extended dataset comprising 123 original images acquired from the same VCMA inspection environment. The extended dataset was augmented to 388 training images using the same augmentation pipeline described in
Section 2.3, and model training followed the same hyperparameter configuration as the original study.
The evaluation on the extended dataset yielded a test accuracy of 76.00%, F1-score of 0.432, and AUC of 0.605 (
Table 6), which are notably lower than the results obtained on the original 47-image dataset. Inspection of the classification report revealed that the model predicted all test samples as NG, resulting in a recall of 1.00 for the NG class but a precision of 0.00 for the Good class. This behavior is consistent with severe class imbalance in the test subset (NG: 19, Good: 6) and suggests that the model did not successfully generalize across the two datasets. It should be noted that while the original dataset was collected from a single fixed inspection position to ensure imaging consistency, the extended dataset was collected from multiple different positions across the VCMA surface in order to capture additional NG samples that were not present at the original position. This difference in acquisition strategy inevitably introduced substantial variation in surface geometry, illumination conditions, and background texture, making direct model transfer between the two datasets challenging.
The performance degradation observed on the extended dataset is primarily attributable to distribution shift between the original and extended image sets, where differences in surface texture, illumination conditions, and dust morphology between the two acquisition batches reduced the transferability of learned features. These findings highlight that dataset consistency and controlled acquisition conditions are critical prerequisites for reliable model generalization in microscopic dust inspection. Addressing distribution shift through domain adaptation or unified data collection protocols is identified as a priority direction for future work.
To further illustrate the visual differences between the two datasets, representative microscopic images from the original dataset and the extended dataset are shown in
Figure 18.
As shown in
Figure 18, the two datasets differ substantially in surface geometry, illumination conditions, and component orientation, which explains the observed degradation in model performance when transferring across acquisition batches.
3.3. Visualization Analysis Using Grad-CAM
To enhance the interpretability of the CNN-based VCMA defect classification models, Gradient-weighted Class Activation Mapping (Grad-CAM) was employed to visualize the image regions that contributed most strongly to model predictions. Grad-CAM provides class-discriminative localization maps by projecting the gradients of the target class onto the final convolutional feature maps, thereby enabling qualitative assessment of model attention and decision-making behavior. For all models, Grad-CAM heatmaps were generated from the final convolutional layer of each respective architecture—specifically conv5_block3_out for ResNet50, block7a_project_conv for EfficientNetB0, and Conv_1 for MobileNetV2—and overlaid on the input images using a jet colormap (opacity 0.5), with activation values normalized to [0, 1] via min-max scaling, as described in
Section 2.5. Representative Grad-CAM visualizations are shown in
Figure 19.
As shown in
Figure 19a, the original VCMA image contains a localized dust-contaminated region on the reflective metallic surface, providing a reference for interpreting the corresponding attention maps generated by the three CNN models. Among them,
Figure 19c indicates that EfficientNetB0 produced the most spatially concentrated activation pattern, with the highlighted region closely aligned with the actual dust-contaminated area. This behavior suggests that EfficientNetB0 relied more consistently on defect-relevant local features rather than on surrounding textures or reflective artifacts. These observations are qualitatively consistent with the superior classification performance of EfficientNetB0 reported in
Section 3.2.
In contrast,
Figure 19b shows that ResNet50 occasionally activated peripheral or non-defective regions in addition to the true dust area, particularly under the influence of strong surface reflections. These broader and less localized activation patterns suggest higher sensitivity to illumination-induced variation, which may have contributed to less robust classification behavior. Similarly,
Figure 19d demonstrates that MobileNetV2 produced comparatively more diffuse activation maps and, in some cases, appeared to emphasize general surface characteristics rather than the fine dust cluster itself. This behavior is consistent with its lower precision and suggests reduced sensitivity to subtle defect morphology under challenging imaging conditions.
To complement the qualitative interpretation, the Grad-CAM outputs were further reviewed across all 29 test images, and the attention patterns were categorized according to whether the highlighted regions were aligned with visible dust, concentrated on background or reflection artifacts, or spatially diffuse. The corresponding semi-quantitative summary is presented in
Table 7.
As shown in
Table 7, EfficientNetB0 achieved the highest number of correctly localized attention maps, with 22 of 29 cases aligned with physically meaningful dust regions. In contrast, ResNet50 showed more misaligned attention patterns, whereas MobileNetV2 more frequently produced diffuse activations. Taken together, the qualitative and semi-quantitative Grad-CAM analyses indicate that EfficientNetB0 not only achieved strong classification performance but also relied more consistently on defect-relevant visual evidence. This strengthens confidence in its suitability for deployment in explainable industrial inspection workflows.
Overall, the Grad-CAM visualizations provide qualitative support for the classification results and indicate that EfficientNetB0 more consistently attends to physically meaningful dust regions on the VCMA surface. The comparison across
Figure 19b–d highlights clear differences in model attention behavior and strengthens confidence in the interpretability of the selected CNN model for industrial inspection applications.
3.4. Performance of the Sequential Hybrid Inspection Framework
The sequential hybrid inspection framework was evaluated to assess system-level performance for microscopic dust inspection on VCMA components. In this pipeline, EfficientNetB0 was used as the initial image-level screening model to distinguish between defect-free and contaminated samples, and only images predicted as Not Good (NG) were subsequently forwarded to the YOLOv5 detector for localized dust inspection. This sequential design was intended to combine the strong image-level discrimination capability of the CNN classifier with the spatial localization capability of the object detector.
To isolate the structural contribution of the hybrid framework, the same dataset and pre-trained models used in the standalone CNN and YOLO evaluations were retained without additional retraining. System-level performance was then assessed by comparing the standalone and hybrid configurations in terms of image-level decision quality, false-positive and false-negative behavior, and the proportion of images forwarded to the detection stage.
As summarized in
Table 8, the hybrid framework reduced the proportion of images entering the localization stage by forwarding only CNN-screened NG samples to YOLOv5. This reduced detector workload from 100.00% of images in the standalone YOLOv5 pipeline to 57.45% in the hybrid configuration. Under the present experimental setting, the hybrid strategy achieved an overall Precision of 1.00, Recall of 0.900, and F1-score of 0.974, while reducing the false-positive rate from 24% in the standalone YOLOv5 pipeline to 0%. These results indicate that the sequential design can substantially suppress false alarms while preserving high image-level classification reliability and selective localization capability.
As shown in
Figure 20, the proposed hybrid framework reduced the proportion of images forwarded to the YOLOv5 localization stage by restricting detection to CNN-screened NG samples. Compared with the standalone YOLOv5 pipeline, which processes all input images, the proportion of images entering the localization stage decreased from 100.00% to 57.45%. This reduction supports the practical relevance of the hybrid design by limiting unnecessary localization operations on apparently defect-free VCMA samples.
To further examine the limitations of the hybrid framework, the major failure modes of the sequential pipeline were analyzed and are summarized in
Table 9.
Table 9 provides further insight into the limitations of the sequential design by identifying the major failure modes of the hybrid pipeline. In particular, three contaminated samples were incorrectly rejected during the initial CNN screening stage and were therefore not forwarded to YOLOv5 for localization. In addition, 1 NG sample was correctly screened but still missed during the localization stage. No Good samples were unnecessarily forwarded to the detector, and no false-positive YOLO detections were observed after screening. These results indicate that, although the hybrid strategy substantially reduces false detections, its overall performance remains dependent on the reliability of both the screening and localization stages.
The three NG samples missed at the CNN screening stage contained dust particles with low contrast, weak visual saliency, or partial occlusion by surface reflections, reducing the discriminability of image-level features relative to the dominant background texture. The single NG sample missed by YOLOv5 after screening exhibited ambiguous dust morphology and low bounding-box confidence, likely due to boundary overlap with the reflective metallic surface. These observations suggest that future improvements should target hard-negative mining and threshold tuning at the screening stage, and small-object detection refinement at the localization stage.
It should be noted that the sequential design introduces a structural vulnerability: any NG sample misclassified as Good during the CNN screening stage will not reach the detector and will remain undetected. In the present study, this corresponds to an FNR of 14.3% for standalone EfficientNetB0 and 10.0% for the hybrid framework (
Table 8). The residual FNR remains a meaningful safety concern for mass production, and future work should explore confidence-aware routing and adaptive threshold adjustment to further reduce missed detections.
Figure 21 presents representative outputs of the proposed hybrid framework. As shown in
Figure 21a,c, contaminated VCMA samples were first identified by the CNN screening stage as NG. The corresponding images were then forwarded to the YOLOv5 detector, which localized the dust-contaminated regions with bounding boxes, as illustrated in
Figure 21b,d. These examples demonstrate the intended coarse-to-fine inspection behavior of the hybrid system, in which global image-level discrimination is followed by localized defect analysis only when needed.
Overall, the sequential hybrid inspection framework combines the complementary strengths of CNN-based classification and YOLO-based localization in a unified inspection strategy. The quantitative and qualitative results suggest that this approach is promising for practical VCMA inspection, particularly in scenarios where false-alarm control and selective localization are more important than applying object detection to every image. At the same time, the failure mode analysis indicates that the robustness of the pipeline can be further improved through enhanced screening sensitivity, detector refinement, or joint optimization of the two stages.
To further characterize the spatial distribution of contamination identified by the hybrid framework, centroid positions of the detected dust instances were analyzed across a normalized 3 × 3 zone grid, as illustrated in
Figure 22. The results indicate that dust particles were predominantly concentrated in the left-center region of the VCMA surface, with the Mid-Left zone accounting for the highest particle density (30%), followed by the Top-Left (20%) and Center (20%) zones. The Top-Right, Bottom-Left, and Bottom-Right regions showed no detectable contamination, suggesting preferential dust accumulation driven by airflow patterns or component handling during manufacturing. These spatial findings provide additional context for the failure mode analysis reported in
Table 9 and may inform targeted inspection strategies in future work.
3.5. Comparative Discussion and Industrial Implications
The comparative results demonstrate that YOLO-based object detection and CNN-based image classification provide complementary strengths for microscopic dust inspection on VCMA components, rather than representing competing alternatives. Each model family addresses a different operational requirement within the inspection workflow. YOLO-based detectors provide explicit spatial localization of dust particles, which is valuable for defect visualization, process feedback, and localized inspection analysis. In contrast, CNN-based classifiers provide more stable image-level discrimination, making them suitable for rapid screening and decision support in scenarios where precise localization is not always required. To summarize the practical implications of the comparative results, a deployment-oriented comparison of the evaluated inspection approaches is provided in
Table 10. The categorizations in
Table 10 are based on the quantitative results and qualitative observations reported in
Section 3.1,
Section 3.2,
Section 3.3 and
Section 3.4.
As shown in
Table 10, the evaluated approaches differ not only in predictive behavior but also in their practical suitability for industrial inspection tasks. YOLO-based models are advantageous when explicit defect localization is required, whereas CNN-based models are more suitable for stable image-level screening. The proposed hybrid framework combines these complementary strengths and therefore provides the most balanced solution for deployment-oriented VCMA dust inspection.
Among the evaluated object detectors, YOLOv8 achieved the strongest overall aggregate detection performance in terms of Precision, F1-score, and mAP@0.5. However, YOLOv5 was selected as the preferred localization model for subsequent integration into the hybrid framework because it produced the lowest number of false positives while maintaining competitive Recall and the same mAP@0.5:0.95 as YOLOv8 under the present experimental setting. This distinction is practically important because reflective metallic surfaces and textured microscopic backgrounds can easily trigger spurious detections. In such an environment, a detector with stronger false-positive control may be more desirable than a detector with slightly stronger aggregate metrics but substantially higher false-alarm cost. YOLOv11 exhibited the weakest overall detection performance among the three evaluated models, likely due to its anchor-free architecture being less suited to the limited-data and high-background-complexity conditions of the present study.
For image-level classification, EfficientNetB0 emerged as the most reliable CNN architecture. In addition to achieving the strongest overall balance in the single-run classification metrics, it also demonstrated the most stable behavior across the design-of-experiments evaluation and cross-validation analysis. Compared with ResNet50, EfficientNetB0 provided better predictive performance under the present limited-data setting, while compared with MobileNetV2, it showed stronger reliability and lower fold-to-fold variation. The Grad-CAM analysis further reinforced this finding by showing that EfficientNetB0 more consistently focused on physically meaningful dust-contaminated regions rather than reflective artifacts or irrelevant background patterns. Taken together, these results indicate that model selection for industrial microscopic inspection should consider not only predictive accuracy, but also stability and interpretability.
The evaluation of the sequential hybrid framework further highlights the practical value of combining these two model families in a coarse-to-fine inspection strategy. By using EfficientNetB0 as the initial screening stage and YOLOv5 as the subsequent localization stage, the proposed pipeline suppresses false alarms at the system level while preserving the ability to localize contamination when it is most relevant. The framework also reduces the number of images forwarded to the detector, thereby limiting unnecessary localization operations on apparently defect-free samples. This design is particularly suitable for industrial inspection environments in which false-alarm control, selective localization, and interpretable decision-making are more important than applying localization to every image. The hybrid results therefore suggest that integrating classification and detection can improve the practical usability of automated VCMA inspection systems.
From an industrial perspective, the findings of this study have three important implications. First, detector selection should be guided by operational priorities rather than by a single summary metric. In the present application, false-positive suppression was more critical than maximizing sensitivity because unnecessary alarms can interrupt production flow and increase manual verification effort. Second, image-level classifiers can serve as effective front-end screening tools when supported by reliable interpretability analysis, thereby improving trust in deployment. Third, hybrid inspection architectures offer a realistic compromise between throughput and diagnostic capability by reserving localized analysis for images most likely to contain contamination. These considerations are directly relevant to practical VCMA inspection and may also extend to other precision-manufacturing settings involving subtle microscopic defects.
Despite these promising findings, the results should be interpreted considering several limitations. The dataset size remained relatively limited, and the experiments were conducted under controlled imaging conditions designed to reflect a practical but still bounded inspection environment. As a result, the reported performance may not fully capture the range of variation encountered across broader production drift, defect morphology diversity, and illumination change. In addition, the sequential hybrid design depends on the reliability of both the screening and localization stages; defects misclassified as Good during the initial CNN stage will not be forwarded to the detector for subsequent localization. The failure mode analysis further indicates that missed defects at the screening stage remain a non-negligible risk in the present pipeline. Future work should therefore consider larger and more diverse datasets, more extensive system-level validation, and possibly joint optimization or confidence-aware interaction between the classification and detection stages.
Overall, the comparative analysis indicates that EfficientNetB0 and YOLOv5 form the most suitable model combination for the present VCMA dust inspection task, balancing classification reliability, localization practicality, and explainability. The proposed sequential hybrid framework is therefore a promising direction for automated microscopic contamination inspection in HDD manufacturing and related precision assembly applications.
4. Discussion
This study investigated microscopic dust inspection on VCMA components from three complementary perspectives: object detection for dust localization, image-level classification for Good/NG screening, and a sequential hybrid framework integrating both capabilities. The findings indicate that model suitability in this application cannot be judged by a single metric alone but must instead be interpreted in relation to reflective microscopic backgrounds, limited-data conditions, false-alarm cost, and the operational priorities of industrial inspection.
4.1. Summary of Key Findings
For object detection, the comparative results showed that the evaluated YOLO models exhibited different trade-offs between aggregate detection performance and false-positive control. YOLOv8 achieved the strongest overall aggregate detection metrics, whereas YOLOv5 was selected as the preferred detector for subsequent hybrid integration because it produced fewer false positives under reflective and textured microscopic backgrounds. YOLOv11, in contrast, showed the weakest performance among the three evaluated models, likely attributable to its anchor-free architecture being less suited to the limited-data and high-background-complexity conditions of the present study, resulting in reduced convergence stability and lower detection sensitivity for small dust targets. This distinction is important in VCMA inspection, where unnecessary alarms may increase manual verification effort and interrupt production flow. These findings suggest that, under the present limited-data industrial setting, deployment-oriented detector selection should prioritize operational robustness alongside conventional performance metrics.
For image-level classification, EfficientNetB0 emerged as the most reliable CNN architecture. Its advantage was not limited to strong classification performance but also extended to cross-validation stability and interpretability. Compared with ResNet50, EfficientNetB0 provided better predictive performance under the present limited-data condition, while compared with MobileNetV2, it showed lower fold-to-fold variation despite MobileNetV2 achieving a higher AUC. The Grad-CAM results further strengthened this conclusion by showing that EfficientNetB0 more consistently attended to physically meaningful dust-contaminated regions rather than reflective or irrelevant background structures. These observations indicate that robust deployment-oriented model selection should consider not only predictive accuracy, but also stability and interpretability.
Building upon the complementary strengths of YOLO-based localization and CNN-based global discrimination, the sequential hybrid inspection framework showed promising system-level behavior. By filtering defect-free samples before localization, the hybrid pipeline reduced unnecessary detector usage and strongly suppressed false alarms while preserving localized dust inspection for suspected NG samples. At the same time, the failure mode analysis indicates that the reliability of the screening stage remains critical, since contaminated samples incorrectly rejected during the initial CNN stage cannot be recovered during subsequent localization.
4.2. Industrial and Scientific Implications
From an industrial perspective, the present findings emphasize that automated microscopic inspection systems should be designed according to operational priorities rather than a single summary metric. In the current VCMA application, false-positive suppression was especially important because unnecessary alarms may slow production and increase verification burden. This explains why YOLOv5 was favored for hybrid deployment despite YOLOv8 achieving stronger aggregate detection metrics. Likewise, image-level CNN classifiers are valuable not only because they enable rapid screening, but also because they can act as front-end filters that reduce detector workload when incorporated into a hybrid framework. These considerations are directly relevant to VCMA cleanliness inspection and may also extend to other precision-manufacturing settings involving subtle microscopic contaminants or low-contrast defects.
Scientifically, this study reinforces several important insights. First, model ranking in industrial micro-defect inspection is strongly influenced by dataset scale, visual complexity, and deployment priorities; newer architectures do not necessarily provide the most suitable practical solution under limited-data conditions. Second, explainability is not merely an auxiliary feature, but an important validation tool when the visual distinction between clean and contaminated samples is subtle. Third, the results suggest that structurally integrating image-level discrimination with selective localization offers a promising direction for inspection systems that require both efficiency and diagnostic transparency.
4.3. Industrial Deployment Considerations and Comparison with Alternative Inspection Approaches
The VCMA samples used in this study were collected directly from the VCMA inspection station on an actual HDD manufacturing production line, as described in
Section 2.1. Image acquisition was performed under the same physical conditions present during routine production operations, including reflective metallic surfaces, controlled LED ring illumination, and fixture-assisted positioning. The experimental dataset therefore reflects the visual complexity and operational constraints encountered in real-world industrial inspection rather than idealized laboratory conditions. These properties confirm that the proposed framework was developed and evaluated under imaging conditions representative of actual HDD production practice.
Although the proposed framework was evaluated offline rather than directly integrated into an in-line production system, the imaging setup, fixture-assisted positioning, and controlled acquisition procedure were intentionally designed to emulate the realistic inspection conditions of HDD manufacturing environments. The proposed framework is intended as a foundational model to be integrated into a dedicated inspection machine for deployment in actual VCMA production lines in future work. The use of lightweight architectures such as EfficientNetB0 and YOLOv5s supports the feasibility of future real-time deployment due to their favorable balance between detection performance and computational efficiency. In the present study, the hybrid framework achieved an average inference time of 11.18 ms per image, which is favorable for time-sensitive industrial inspection scenarios.
To contextualize the proposed hybrid framework within the broader landscape of inspection strategies, a conceptual comparison with alternative approaches is provided in
Table 11. Conventional methods rely on fixed geometric templates, color thresholds, and pixel comparison algorithms that require manual parameter tuning by domain experts [
4]. These methods are highly sensitive to illumination variation, surface reflection, and focus inconsistency, making them difficult to generalize for microscopic VCMA inspection where reflective metallic surfaces and sub-100 μm dust particles present persistent challenges [
52,
53]. In contrast, deep learning-based approaches can learn discriminative features directly from data without requiring explicit rule design, offering improved robustness under complex and variable imaging conditions [
4,
52,
53].
4.4. Limitations and Future Work
The present findings should be interpreted considering several limitations. First, the dataset remained relatively limited, particularly for the detection task, which may restrict generalizability across broader production conditions and increase sensitivity to data partition effects. Second, the experiments were conducted under a controlled imaging setup designed to reflect representative industrial conditions, but not necessarily the full range of process drift, contamination morphology variation, and illumination changes that may occur in long-term manufacturing use. Although brightness adjustment within ±20% was incorporated into the training augmentation pipeline, no systematic sensitivity analysis was performed to evaluate model performance under deliberate variations in image brightness, contrast, blur, or lens contamination. The degree to which the reported accuracy degrades under such real-world disturbances therefore remains an open question. Third, the hybrid framework was implemented as a sequential combination of independently evaluated modules rather than as a jointly optimized end-to-end system. As a result, stage-to-stage error propagation remains a meaningful constraint, especially when contamination is missed during the initial screening stage. These limitations suggest that the current study should be interpreted as a strong proof of feasibility rather than a final generalized deployment validation.
Future research may extend the proposed framework through larger and more diverse datasets, more extensive system-level validation under production drift, and tighter integration between screening and localization stages. Promising directions include confidence-guided routing, adaptive thresholds, multi-defect inspection, and real-time deployment optimization using hardware acceleration or robotic inspection platforms. Systematic environmental robustness evaluation, including controlled perturbations of brightness, contrast, blur, and lens contamination, is also identified as a priority direction, with brightness normalization, adaptive histogram equalization, and defocus-aware filtering as candidate mitigation strategies. Furthermore, the integration of specialized tiny object detection (TOD) strategies, such as SAHI (Slicing Aided Hyper Inference), represents a promising direction for improving detection sensitivity on microscopic dust particles. Applying slicing-based inference or scale-enhanced architectures in future work may further improve localization performance under the limited-data and high-magnification imaging conditions characteristic of VCMA inspection. Continued incorporation of explainable AI techniques may further improve transparency, operator trust, and fault diagnosis capability in automated microscopic inspection systems.
Overall, this study indicates that EfficientNetB0 and YOLOv5 form the most suitable model combination for the present VCMA dust inspection task, balancing classification reliability, localization practicality, and explainability. The proposed sequential hybrid framework therefore represents a promising direction for automated microscopic contamination inspection in HDD manufacturing and related precision assembly applications.
5. Conclusions
This study investigated deep-learning-based inspection methods for detecting and classifying microscopic dust contamination on VCMA components and further evaluated a sequential hybrid inspection framework integrating CNN-based image classification with YOLO-based object detection. The results demonstrate that both standalone models and the hybrid strategy are promising for industrial microscopic inspection under limited-data conditions.
For object detection, the evaluated YOLO models exhibited different trade-offs between aggregate detection performance and false-positive control. Although YOLOv8 achieved the strongest overall detection metrics, YOLOv5 was selected as the preferred detector for hybrid integration because it produced fewer false positives under reflective and textured microscopic backgrounds while maintaining competitive localization performance. YOLOv11 showed the weakest detection performance under the present limited-data setting, likely reflecting optimization instability under the constrained sample scale and background complexity of VCMA microscopic inspection. In the present experimental setting, YOLOv5 achieved an inference time of 12.9 ms per image, with a precision of 0.75 and mAP@0.5 of 0.62, supporting its suitability for real-time or in-line inspection scenarios.
For image-level classification, EfficientNetB0 emerged as the most reliable CNN architecture. It provided strong classification performance together with better validation stability and more physically meaningful Grad-CAM attention patterns than ResNet50 and MobileNetV2. These findings indicate that EfficientNetB0 offers a favorable balance between predictive accuracy, robustness, and interpretability for Good/NG classification of VCMA images.
By structurally integrating global image-level screening with localized defect detection, the proposed hybrid inspection framework reduced detector workload and suppressed false alarms while preserving selective dust localization for suspected contaminated samples. This coarse-to-fine strategy reflects practical industrial deployment requirements in which efficiency, reliability, and interpretability are equally important.
Overall, this work highlights the complementary strengths of detection and classification models and shows that their integration within a sequential hybrid framework provides a practical direction for microscopic defect inspection in precision manufacturing. The findings contribute useful guidance for intelligent quality inspection system design and provide a foundation for future developments toward adaptive, robust, and high-throughput industrial inspection platforms.