A Semi-Automatic Labeling Framework for PCB Defects via Deep Embeddings and Density-Aware Clustering
Abstract
Highlights
- A production-oriented, semi-automatic labeling pipeline reliably converts defect ROIs into consistent class labels by coupling margin-aware cropping, pretrained embeddings, and clustering, achieving cluster-level label quality without dense, pixel-wise annotation.
- Cluster-level decisions concentrate human effort where it matters—on ambiguous, low-consistency clusters—thereby reducing labeling latency while maintaining label fidelity.
Abstract
1. Introduction
- (i)
- An end-to-end, production-oriented workflow that connects ROI proposals to batch labels via margin-aware cropping, backbone-agnostic embeddings (Histogram of Oriented Gradients (HOG) [14,15]/ResNet-50/ViT-B/16), auto-configurable clustering (k-means [16,17]/Gaussian Mixture Model (GMM) [18,19]/Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) [20,21]), and cluster-level visualization (top-K montages) for rapid human confirmation was used.
- (ii)
- Operational mechanisms include data-scale heuristics for HDBSCAN and memory-safe incremental PCA for HOG.
- (iii)
- A comprehensive evaluation on a real PCB corpus with 9354 crops across 10 categories (long-tailed distribution) is performed, reporting agreement metrics alongside t-SNE projections and montage-based qualitative analyses.
Related Works
2. Materials and Methods
2.1. Pipeline Overview
2.2. Dataset Preparation
2.3. Embedding Extraction
2.4. Clustering Algorithms and Implementation
2.5. Accuracy Evaluation
3. Results
3.1. Comparative Analysis of Embedding Backbones and Clustering Algorithms
3.2. Macro- vs. Micro-Cluster Purity
3.3. Qualitative Analyses
- (i)
- Complex backgrounds. In some patches, high-frequency textures or soldering residues were visually dominant, causing defect signals, such as pinholes or scratches, to be absorbed into background patterns.
- (ii)
- Lighting and contrast variations. Tiny or low-contrast defects (e.g., hairline scratches, faint pollution marks) were frequently misassigned due to uneven illumination or a low signal-to-noise ratio.
- (iii)
- Overlapping class definitions. Semantic ambiguity between classes, such as Pollution vs. Foreign Particles or Spur vs. Spurious Copper, often led to mixed clusters, even under otherwise stable conditions.
3.4. Practical Labeling Impact
3.5. Inter-Rater Agreement
3.6. Generalization Across Imbalance Ratios
4. Discussion
- First, resource fit. HOG (+ incremental PCA) supports CPU-only scenarios; ResNet-50/ViT runs on a GPU or CPU.
- Second, scale and safety. The pipeline handles thousands of crops with memory-safe routines and exports artifacts for traceability, aiding audits and compliance.
- Third, human factors. Cluster montages enable at-a-glance verification of within-cluster consistency and borderline cases, boosting review speed without degrading label quality.
- Bootstrapping at scale. ResNet-50 (or ViT) + HDBSCAN is used for surface high-purity micro-clusters for rapid initial labeling.
- Stable taxonomies/periodic re-builds. K-means with K = expected classes to maximize ARI and partition stability.
- Edge/CPU-only checks. HOG (+ incremental PCA) with HDBSCAN is a lightweight screening path (lower precision but operationally simple).
- (i)
- Density retuning for HDBSCAN (e.g., co-scheduling minimum cluster size and minimum samples and using scale-normalized distances).
- (ii)
- Class-conditional auto-K or evidence-based model selection for partitioners to avoid over-/under-partitioning.
- (iii)
- Guided merge/split operations with cluster-level label propagation.
- (iv)
- Targeted encoder refinement (lightweight contrastive fine-tuning with hard clusters and curriculum from coarse to fine).
- (v)
- Data-centric normalization (illumination/background control, crop margin standardization) to reduce spurious boundaries.
- (i)
- Encoder learning. Evaluate self-supervised or domain-adaptive encoders and metric-learning objectives to sharpen separability for rare/tiny defects, potentially improving both HDBSCAN purity and k-means ARI.
- (ii)
- Interactive loops. Integrate active selection (hard clusters first), human feedback (merge/split logs), and label propagation to iteratively refine clusters and train a supervised detector in a curriculum from coarse to fine.
- (iii)
- Taxonomy and imbalance handling. Explore cost-sensitive clustering objectives, auto-K with model evidence, and uncertainty-aware consolidation of micro-clusters to stabilize labels under shifting class definitions; continue reporting IR and Gini as standard imbalance diagnostics.
- (iv)
- Material robustness. Extend evaluation across heterogeneous PCB materials (e.g., different solder mask colors and surface finishes) to quantify cross-material stability and investigate adaptive normalization or domain-adaptive encoder fine-tuning for improved generalization (see Appendix B).
- (v)
- Downstream detection. Use cluster-confirmed labels to train tiny-object detectors and few-shot variants and explicitly compare detector accuracy trained on cluster-labeled versus manually labeled datasets. Closing this loop with pseudo-label repair could ultimately yield a continuous learning pipeline across product revisions.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| VI | Visual Inspection |
| PCB | Printed Circuit Board |
| HOG | Histogram of Oriented Gradients |
| ViT | Vision Transformer |
| GMM | Gaussian Mixture Model |
| HDBSCAN | Hierarchical Density-Based Spatial Clustering of Applications with Noise |
| t-SNE | t-distributed Stochastic Neighbor Embedding |
| IR | Imbalance Ratio |
| NMI | Normalized Mutual Information |
| AMI | Adjusted Mutual Information |
| ARI | Adjusted Rand Index |
| BIC | Bayesian Information Criterion |
Appendix A
| Algorithm A1. Semi-automatic labeling pipeline. |
| Input: Raw PCB images I 1: Crop defect ROIs → C = {c1,…,cn} 2: Extract embeddings E = fθ(C) (θ ∈ {HOG, ResNet-50, ViT-B/16}) 3: Dimensionality reduction (PCA → 128–256 D) 4: Cluster E using k-means, GMM, or HDBSCAN 5: Assign cluster IDs → cluster-level annotation 6: Optionally merge/split clusters via human feedback Output: Label set L for training detectors |
Appendix B. Practical Parameter Tuning Guide
- –
- Crop margin: 8–12 px (larger for reflective surfaces).
- –
- Contrast normalization (γ): 0.9–1.2, depending on illumination.
- –
- HDBSCAN min_samples: 5–15 for low-contrast defects.
References
- Fonseca, L.A.L.O.; Iano, Y.; Oliveira, G.G. Automatic Printed Circuit Board Inspection: A Comprehensible Survey. Discov. Artif. Intell. 2024, 4, 10. [Google Scholar] [CrossRef]
- Zhou, Y.; Yuan, M.; Zhang, J.; Qin, S.; Ding, G. Review of Vision-Based Defect Detection Research and Its Perspectives for Printed Circuit Board. J. Manuf. Syst. 2023, 70, 557–578. [Google Scholar] [CrossRef]
- Malge, P.S.; Nadaf, R.S. A Survey: Automated Visual PCB Inspection Algorithm. Int. J. Eng. Res. Technol. 2014, 3, 223–229. [Google Scholar]
- Mirzaei, B.; Nezamabadi-Pour, H.; Raoof, A.; Derakhshani, R. Small Object Detection and Tracking: A Comprehensive Review. Sensors 2023, 23, 6887. [Google Scholar] [CrossRef] [PubMed]
- Xiuling, Z.; Huijuan, W.; Yu, S.; Gang, C.; Quanbo, Y. Starting from the Structure: A Review of Small Object Detection Based on Deep Learning. Image Vis. Comput. 2024, 146, 105054. [Google Scholar] [CrossRef]
- Nikouei, M.; Baroutian, B.; Nabavi, S.; Taraghi, F.; Aghaei, A.; Sajedi, A.; Moghaddam, M.E. Small Object Detection: A Comprehensive Survey on Challenges, Techniques and Real-World Applications. Intell. Syst. Appl. 2025, 27, 200561. [Google Scholar] [CrossRef]
- Ruff, L.; Vandermeulen, R.; Goernitz, N.; Deecke, L.; Siddiqui, S.A.; Binder, A.; Müller, E.; Kloft, M. Deep One-Class Classification. In Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Defard, T.; Setkov, A.; Loesch, A.; Audigier, R. PaDiM: A Patch Distribution Modeling Framework for Anomaly Detection and Localization. In Proceedings of the Pattern Recognition. ICPR International Workshops and Challenges, ICPRW, Virtual Event, 10–15 January 2021. [Google Scholar]
- Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; Gehle, P. Towards Total Recall in Industrial Anomaly Detection (PatchCore). In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Zavrtanik, V.; Kristan, M.; Skočaj, D. DRAEM: A Discriminatively Trained Reconstruction Embedding for Surface Anomaly Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2021. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Dosovitskiy, A. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ICLR, Online, 23–26 June 2021. [Google Scholar]
- Lee, S.-J.; Seo, S.-B.; Bae, Y.-S. Anomaly Detection Model-Based Visual Inspection Method for PCB Board Manufacturing Process. Trans. Korean Inst. Electr. Eng. 2024, 73, 2024–2029. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the Computer Vision and Pattern Recognition CVPR, San Diego, CA, USA, 20–26 June 2005; pp. 886–893. [Google Scholar]
- Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object Detection with Discriminatively Trained Part-Based Models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [Google Scholar] [CrossRef]
- Lloyd, S. Least Squares Quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
- MacQueen, J. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Oakland, CA, USA, 1967; pp. 281–297. [Google Scholar]
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. Roy. Stat. Soc. B 1977, 39, 1–38. [Google Scholar] [CrossRef]
- Reynolds, D.A.; Rose, R.C. Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models. IEEE Trans. Speech Audio Process. 1995, 3, 72–83. [Google Scholar] [CrossRef]
- Campello, R.J.G.B.; Moulavi, D.; Sander, J.S. Density-Based Clustering Based on Hierarchical Density Estimates. In Proceedings of the Advances in Knowledge Discovery and Data Mining 17th Pacific-Asia Conference, PAKDD, Gold Coast, Australia, 14–17 April 2013; pp. 160–172. [Google Scholar]
- McInnes, L.; Healy, J.; Astels, S. HDBSCAN: Hierarchical Density Based Clustering. J. Open Source Softw. 2017, 2, 205. [Google Scholar] [CrossRef]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar] [CrossRef]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning ICML, Online, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
- Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Online, 10–17 October 2021; pp. 9630–9640. [Google Scholar] [CrossRef]
- Tang, S.; He, F.; Huang, X.; Yang, J. Online PCB Defect Detector on a New PCB Defect Dataset (DeepPCB). arXiv 2019, arXiv:1902.06197. [Google Scholar] [CrossRef]
- Huang, W.; Wei, P.; Zhang, M.; Liu, H. HRIPCB: A Challenging Dataset for PCB Defects Detection and Classification. J. Eng. 2020, 2020, 303–309. [Google Scholar] [CrossRef]
- Kaggle Community. PCB Defects (HRIPCB-Style, 6 Classes, 1386 Images). Dataset. Available online: https://www.kaggle.com/datasets/akhatova/pcb-defects (accessed on 10 October 2025).
- Zhu, X.; Ghahramani, Z. Learning from Labeled and Unlabeled Data with Label Propagation; Technical Report CMU-CALD-02-107; Carnegie Mellon University: Pittsburgh, PA, USA, 2002. [Google Scholar]
- Zhou, D.; Bousquet, O.; Lal, T.N.; Weston, J.; Schölkopf, B. Learning with Local and Global Consistency. In Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2003. [Google Scholar]
- Iscen, A.; Tolias, G.; Avrithis, Y.; Chum, O. Label Propagation for Deep Semi-Supervised Learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5070–5079. [Google Scholar] [CrossRef]
- Wan, Y.; Gao, L.; Li, X.; Gao, Y. Semi-Supervised Defect Detection Method with Data Augmentation for PCB. Sensors 2022, 22, 7971. [Google Scholar] [CrossRef]
- He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
- Gini, C. Variabilità e Mutabilità. Studi Economico-Giuridici Univ. Cagliari 1912, 3, 1–158. [Google Scholar]
- Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
- Artač, M.; Jogan, M.; Leonardis, A. Incremental PCA for On-Line Visual Learning and Recognition. Pattern Recognit. 2003, 36, 301–309. [Google Scholar]
- Vinh, N.; Epps, J.; Bailey, J. Information Theoretic Measures for Clusterings Comparison. J. Mach. Learn. Res. 2010, 11, 2837–2854. [Google Scholar]
- Hubert, L.; Arabie, P. Comparing Partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
- van der Maaten, L.; Hinton, G. Visualizing Data Using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef]




| Class | Count | Proportion (%) | Average Size (px) | Local Contrast | Labeling Difficulty |
|---|---|---|---|---|---|
| PSR Peel-Off | 4625 | 49.44 | 79 ± 18 | 0.83 | Low |
| Unknown | 1337 | 14.29 | — | — | Not applicable |
| Pollution | 1302 | 13.92 | 36 ± 8 | 0.52 | High |
| Foreign Particles | 982 | 10.5 | 43 ± 10 | 0.59 | Medium |
| PSR Skip | 632 | 6.76 | 54 ± 11 | 0.66 | Medium |
| Scratch | 329 | 3.52 | 41 ± 9 | 0.63 | Medium |
| Missing of Silk | 57 | 0.61 | 28 ± 6 | 0.48 | High |
| Spurious Copper | 49 | 0.52 | 32 ± 7 | 0.51 | High |
| Pin-Hole | 38 | 0.41 | 24 ± 6 | 0.44 | High |
| Mouse-bite | 3 | 0.03 | 17 ± 5 | 0.4 | High |
| Encoder | Resize | Embedding Dimension | Batch Size |
|---|---|---|---|
| HOG | 224 | 26,244 | - |
| ResNet-50 | 224 | 2048 | 64 |
| ViT-B/16 | 224 | 768 | 64 |
| Algorithm | Target Clusters | Distance Metric | Init/Seed |
|---|---|---|---|
| K-means | 10 | Euclidean | 42 |
| GMM | 10 | Gaussian likelihood | 42 |
| HDBSCAN | Variable | Euclidean | N/A |
| Encoder | Clustering Algorithm | Purity (95% CI) | NMI (95% CI) | AMI (95% CI) | ARI (95% CI) | p vs. Best |
|---|---|---|---|---|---|---|
| HOG | K-means | 0.535 [0.523–0.547] | 0.149 [0.142–0.156] | 0.147 [0.139–0.154] | 0.060 [0.056–0.064] | <0.001 * |
| HOG | GMM | 0.558 [0.545–0.571] | 0.157 [0.149–0.165] | 0.155 [0.147–0.163] | 0.034 [0.031–0.037] | <0.001 * |
| HOG | HDBSCAN | 0.501 [0.487–0.515] | 0.022 [0.018–0.026] | 0.021 [0.017–0.025] | 0.044 [0.041–0.047] | <0.001 * |
| ResNet-50 | K-means | 0.617 [0.604–0.630] | 0.255 [0.247–0.263] | 0.253 [0.244–0.262] | 0.169 [0.162–0.176] | 0.043 * |
| ResNet-50 | GMM | 0.616 [0.603–0.629] | 0.248 [0.240–0.256] | 0.246 [0.237–0.255] | 0.153 [0.146–0.160] | 0.018 * |
| ResNet-50 | HDBSCAN | 0.624 [0.610–0.638] | 0.290 [0.282–0.299] | 0.283 [0.274–0.292] | 0.178 [0.170–0.186] | — |
| ViT-B/16 | K-means | 0.604 [0.592–0.616] | 0.227 [0.219–0.235] | 0.225 [0.216–0.234] | 0.158 [0.150–0.166] | 0.271 |
| ViT-B/16 | GMM | 0.599 [0.586–0.612] | 0.232 [0.224–0.240] | 0.230 [0.221–0.239] | 0.129 [0.122–0.136] | 0.042 * |
| ViT-B/16 | HDBSCAN | 0.606 [0.592–0.620] | 0.281 [0.273–0.289] | 0.274 [0.265–0.283] | 0.174 [0.166–0.182] | 0.27 |
| Encoder | Clustering Algorithm | Best Cases | Worst Cases | ||||
|---|---|---|---|---|---|---|---|
| Purity | Majority Class | Size | Purity | Majority Class | Size | ||
| HOG | k-means | 0.96 | PSR Peel-Off | 929 | 0.27 | PSR Peel-Off | 763 |
| HOG | GMM | 0.98 | PSR Peel-Off | 906 | 0.28 | Pollution | 1631 |
| HOG | HDBSCAN | 1 | PSR Peel-Off | 20 | 0.34 | Pollution | 902 |
| ResNet-50 | k-means | 0.97 | PSR Peel-Off | 1122 | 0.23 | Foreign Particles | 1015 |
| ResNet-50 | GMM | 0.97 | PSR Peel-Off | 1187 | 0.26 | Pollution | 768 |
| ResNet-50 | HDBSCAN | 1 | Pollution | 35 | 0.23 | Pollution | 3445 |
| ViT-B/16 | k-means | 0.97 | PSR Peel-Off | 1226 | 0.28 | PSR Peel-Off | 887 |
| ViT-B/16 | GMM | 0.97 | PSR Peel-Off | 1171 | 0.32 | PSR Peel-Off | 910 |
| ViT-B/16 | HDBSCAN | 1 | Pollution | 14 | 0.26 | PSR Peel-Off | 4114 |
| IR | Purity | NMI | AMI | ARI | Clusters |
|---|---|---|---|---|---|
| 500 | 0.642 | 0.222 | 0.238 | 0.120 | 32 |
| 1000 | 0.596 | 0.250 | 0.301 | 0.176 | 31 |
| 1542 | 0.658 | 0.233 | 0.279 | 0.162 | 46 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lee, S.-J.; Seo, S.-B.; Bae, Y.-S. A Semi-Automatic Labeling Framework for PCB Defects via Deep Embeddings and Density-Aware Clustering. Sensors 2025, 25, 6470. https://doi.org/10.3390/s25206470
Lee S-J, Seo S-B, Bae Y-S. A Semi-Automatic Labeling Framework for PCB Defects via Deep Embeddings and Density-Aware Clustering. Sensors. 2025; 25(20):6470. https://doi.org/10.3390/s25206470
Chicago/Turabian StyleLee, Sang-Jeong, Sung-Bal Seo, and You-Suk Bae. 2025. "A Semi-Automatic Labeling Framework for PCB Defects via Deep Embeddings and Density-Aware Clustering" Sensors 25, no. 20: 6470. https://doi.org/10.3390/s25206470
APA StyleLee, S.-J., Seo, S.-B., & Bae, Y.-S. (2025). A Semi-Automatic Labeling Framework for PCB Defects via Deep Embeddings and Density-Aware Clustering. Sensors, 25(20), 6470. https://doi.org/10.3390/s25206470

