Zero-Shot Industrial Anomaly Detection via CLIP-DINOv2 Multimodal Fusion and Stabilized Attention Pooling
Abstract
1. Introduction
- The study proposes hierarchical multimodal feature refinement network architecture. By introducing a Dual-Modality Attention (DMA) mechanism, our approach achieves deep integration of CLIP’s global semantic representations and DINOv2’s multi-scale local structural features, thereby effectively overcoming the fundamental challenge of simultaneously capturing both macroscopic structural anomalies and microscopic textural deviations that have plagued existing methods.
- We design a Stabilized Attention-based Pooling (SAP) module, which uses anomaly heatmaps as prior guidance for feature aggregation. Combined with a lightweight attention mechanism, the SAP module effectively refines contextual information and substantially enhances the representational capacity of anomalous regions, thereby mitigating the feature blurring issue commonly associated with standard pooling operations in anomaly detection.
- Extensive experiments on seven industrial anomaly detection benchmarks demonstrate that our method achieves remarkable performance improvements. It attains an average AUROC of 93.4% and an average AP of 94.3% in image-level detection tasks, along with an average AUROC of 96.9% and an average AUPRO of 92.4% in pixel-level segmentation tasks. These results comprehensively outperform existing state-of-the-art (SOTA) methods, setting a new benchmark in the field of zero-shot anomaly detection.
2. Related Work
3. Proposed Method
3.1. Problem Definition
3.2. Overall Architecture
3.3. Image Encoder
3.3.1. Dual-Modality Attention Mechanism (DMA)
3.3.2. Text Encoder
3.3.3. Stabilized Attention-Based Pooling Module (SAP)
3.3.4. Loss Function
3.3.5. Segmentation Loss
3.3.6. Classification Loss
4. Experiments
4.1. Experimental Setup
4.1.1. Datasets
4.1.2. Evaluation Metrics
4.1.3. Implementation Details
4.2. Comparative Experiments
4.3. Qualitative Analysis
4.4. Ablation Study
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Chen, S.Z.; Fu, T.T.; Song, J.; Wang, X.Y.; Qi, M.H.; Hua, C.C.; Sun, J. A lightweight semi-supervised distillation framework for hard-to-detect surface defects in the steel industry. Expert Syst. Appl. 2026, 297, 129489. [Google Scholar] [CrossRef]
- Weng, W.; He, Z.; Jiang, J.; Zheng, G.; Wan, A.; Cheng, X. Enhancing multi-scale learning with dual-path adaptive feature fusion for mixed supervised-industrial defect detection. Eng. Res. Express 2025, 7, 025438. [Google Scholar] [CrossRef]
- Sheng, F.Q.; Zhu, Y.; Jin, L.J.; Yin, J.J. Semi-supervised semantic segmentation with confidence-driven consistency learning. Expert Syst. Appl. 2026, 296, 128965. [Google Scholar] [CrossRef]
- Sun, Q.; Xu, K.; Zhao, D.L.; Li, H.J.; Jin, L.; Liu, C.N.; Xu, P.J. PNG: An adaptive local-global hybrid framework for unsupervised material surface defect detection. Expert Syst. Appl. 2025, 293, 128711. [Google Scholar] [CrossRef]
- Shi, H.; Pan, Y.F.; Gao, R.X.; Guo, Z.C.; Zhang, C.Q.; Zhao, P. DAE-SWnet: Unsupervised internal defect segmentation through infrared thermography with scarce samples. J. Manuf. Syst. 2025, 82, 766–785. [Google Scholar] [CrossRef]
- Wang, E.R.; Chen, S.Y.; Peng, L.F.; Zhang, X.Q.; Ou, Y.C.; Peng, J.W. Contrastive self-supervised subspace clustering via KAN-based multi-view fusion. Expert Syst. Appl. 2026, 296, 128995. [Google Scholar] [CrossRef]
- Li, Y.; Yang, J.; Wang, W.; Gao, T. A joint collaborative adaptation network for fault diagnosis of rolling bearing under class imbalance and variable operating conditions. Adv. Eng. Inform. 2026, 69 Pt B, 103931. [Google Scholar] [CrossRef]
- Jeong, J.; Zou, Y.; Kim, T.; Zhang, D.; Ravichandran, A.; Dabeer, O. WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar] [CrossRef]
- Zhou, Q.; Pang, G.; Tian, Y.; He, S.; Chen, J. Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection. In Proceedings of the 12th International Conference on Learning Representations, ICLR 2024, Vienna, Austria, 7 May 2024; Available online: https://openreview.net/forum?id=buC4E91xZE (accessed on 27 October 2025).
- Salehi, A.; Salehi, M.; Hosseini, R.; Snoek, C.G.M.; Yamada, M.; Sabokrpu, M. Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detections. arXiv 2025. [Google Scholar] [CrossRef]
- Xie, S.; Wu, X.J.; Wang, M.Y. Semi-Patchcore: A Novel Two-Staged Method for Semi-Supervised Anomaly Detection and Localization. IEEE Trans. Instrum. Meas. 2025, 74, 3506012. [Google Scholar] [CrossRef]
- Xu, Y.G.; Wang, H.; Liu, Z.L.; Zuo, M.J. Self-Supervised Defect Representation Learning for Label-Limited Rail Surface Defect Detection. IEEE Sens. J. 2023, 23, 29235–29246. [Google Scholar] [CrossRef]
- Tailanian, M.; Pardo, Á.; Musé, P. U-Flow: A U-Shaped Normalizing Flow for Anomaly Detection with Unsupervised Threshold. J. Math. Imaging Vis. 2024, 66, 678–696. [Google Scholar] [CrossRef]
- Denis, G.; Shun, I.; Kazuki, K. CFLOW-AD: Real-Time Unsupervised Anomaly Detection with Localization via Conditional Normalizing Flows. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022. [Google Scholar] [CrossRef]
- Deng, H.; Li, X. Anomaly Detection via Reverse Distillation from One-Class Embedding. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
- Batzner, K.; Heckler, L.; König, R. EfficientAD: Accurate Visual Anomaly Detection at Millisecond-Level Latencies. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024. [Google Scholar] [CrossRef]
- Salehi, M.; Sadjadi, N.; Baselizadeh, S.; Rohban, M.H.; Rabiee, H.R. Multiresolution Knowledge Distillation for Anomaly Detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
- Huang, C.; Guan, H.; Jiang, A.; Zhang, Y.; Spratling, M.; Wang, Y.F. Registration Based Few-Shot Anomaly Detection. In Computer Vision—ECCV 2022, Proceedings of the European Conference on Computer Vision 2022, Tel Aviv, Israel, 23–27 October 2022; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2022. [Google Scholar] [CrossRef]
- Wan, Q.; Gao, L.; Li, X.; Wen, L. Industrial Image Anomaly Localization Based on Gaussian Clustering of Pretrained Feature. IEEE Trans. Ind. Electron. 2022, 69, 6182–6192. [Google Scholar] [CrossRef]
- Lee, S.; Lee, S.; Song, B.C. CFA: Coupled-Hypersphere-Based Feature Adaptation for Target-Oriented Anomaly Localization. IEEE Access 2022, 10, 78446–78454. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; Available online: https://proceedings.mlr.press/v139/radford21a.html (accessed on 27 October 2025).
- Chen, X.; Han, Y.; Zhang, J. A Zero-/Few-Shot Anomaly Classification and Segmentation Method for CVPR 2023 VAND Workshop Challenge Tracks 1&2: 1st Place on Zero-shot AD and 4th Place on Few-shot AD. arXiv 2023. [Google Scholar] [CrossRef]
- Deng, H.; Zhang, Z.; Bao, J.; Li, X. AnoVL: Adapting Vision-Language Models for Unified Zero-shot Anomaly Localization. arXiv 2023, arXiv:2308.15939. [Google Scholar] [CrossRef]
- Cao, Y.; Xu, X.; Sun, C.; Cheng, Y.; Du, Z.; Gao, L.; Shen, W. Segment Any Anomaly without Training via Hybrid Prompt Regularization. arXiv-CS-Computer Vision and Pattern Recognition. arXiv 2023. [Google Scholar] [CrossRef]
- Derakhshani, M.M.; Sanchez, E.; Bulat, A.; Da Costa, V.G.T.; Snoek, C.G.M.; Tzimiropoulos, G.; Martinez, B. Bayesian Prompt Learning for Image-Language Model Generalization. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023. [Google Scholar] [CrossRef]
- Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to Prompt for Vision-Language Models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
- Roy, S.; Etemad, A. Consistency-guided Prompt Learning for Vision-Language Models. arXiv-CS-Computer Vision and Pattern Recognition. arXiv 2023, arXiv:2306.01195. [Google Scholar]
- Chen, X.; Zhang, J.; Tian, G.; He, H.; Zhang, W.; Wang, Y.; Wang, C.; Liu, Y. CLIP-AD: A Language-Guided Staged Dual-Path Model for Zero-Shot Anomaly Detection. In Proceedings of the International Joint Conference on Artificial Intelligence, Jeju, Republic of Korea, 3–9 August 2024. [Google Scholar]
- Li, S.; Cao, J.; Ye, P.; Ding, Y.; Tu, C.; Chen, T. ClipSAM: CLIP and SAM collaboration for zero-shot anomaly segmentation. Neurocomputing 2025, 618, 129122. [Google Scholar] [CrossRef]
- Qu, Z.; Tao, X.; Prasad, M.; Shen, F.; Zhang, Z.; Gong, X.; Ding, G. VCP-CLIP: A Visual Context Prompting Model for Zero-Shot Anomaly Segmentation. In Computer Vision–ECCV 2024, Proceedings of the European Conference on Computer Vision 2024, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2025. [Google Scholar] [CrossRef]
- Zhu, J.; Pang, G. Toward Generalist Anomaly Detection via In-Context Residual Learning with Few-Shot Sample Prompts. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar] [CrossRef]
- Li, X.; Zhang, Z.; Tan, X.; Chen, C.; Qu, Y.; Xie, Y.; Ma, L. PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar] [CrossRef]
- Cao, Y.; Zhang, J.; Frittoli, L.; Cheng, Y.; Shen, W.; Boracchi, G. AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection. In Computer Vision—ECCV 2024, Proceedings of the European Conference on Computer Vision 2024, Milan, Italy, 29 September–4 October 2024; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2024. [Google Scholar] [CrossRef]
- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
- Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Jorge Cardoso, M. Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Proceedings of the DLMIA ML-CDS 2017, Québec City, QC, Canada, 14 September 2017; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2017. [Google Scholar] [CrossRef]
- Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD—A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
- Zou, Y.; Jeong, J.; Pemula, L.; Zhang, D.; Dabeer, O. SPot-the-Difference Self-supervised Pre-training for Anomaly Detection and Segmentation. In Proceedings of the European Conference on Computer Vision 2022, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Mishra, P.; Verk, R.; Fornasier, D.; Piciarelli, C.; Foresti, G.L. VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization. In Proceedings of the 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE), Kyoto, Japan, 20–23 June 2021. [Google Scholar] [CrossRef]
- Jezek, S.; Jonak, M.; Burget, R.; Dvorak, P.; Skotak, M. Deep learning-based defect detection of metal parts: Evaluating current methods in complex conditions. In Proceedings of the 2021 13th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT), Brno, Czech Republic, 25–27 October 2021. [Google Scholar] [CrossRef]
- Tabernik, D.; Sela, S.; Skvarc, J.; Skočaj, D. Segmentation-based deep-learning approach for surface-defect detection. J. Intell. Manuf. 2019, 31, 759–776. [Google Scholar] [CrossRef]
- Matthias, W.; Tobias, H. Weakly supervised learning for industrial optical inspection. In Proceedings of the 29th Annual Symposium of the German Association for Pattern Recognition (DAGM 2007), Heidelberg, Germany, 12–14 September 2007. [Google Scholar] [CrossRef]
- Aota, T.; Tong, L.T.T.; Okatani, T. Zero-shot versus Many-shot: Unsupervised Texture Anomaly Detection. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023. [Google Scholar] [CrossRef]












| Metric | Datasets | WinCLIP | AnoVL | AnomalyCLIP | AdaCLIP | Crane | Ours |
|---|---|---|---|---|---|---|---|
| Image-level (AURO) | MVTec | 91.8 | 92.5 | 91.5 | 89.2 | 93.9 | 94.8 |
| VisA | 78.1 | 79.2 | 82.1 | 85.8 | 83.6 | 84.6 | |
| MPDD | 63.6 | 72.7 | 77.0 | 76.0 | 81.0 | 84.1 | |
| BTAD | 68.2 | 80.3 | 88.3 | 88.6 | 96.3 | 96.4 | |
| KSDD | 84.3 | 94.4 | 84.7 | 97.1 | 97.8 | 98.0 | |
| DAGM | 91.8 | 89.7 | 97.5 | 99.1 | 98.9 | 99.0 | |
| DTD | 93.2 | 94.9 | 93.5 | 95.5 | 95.8 | 97.0 | |
| Average | 81.6 | 86.2 | 87.8 | 90.2 | 92.5 | 93.4 |
| Metric | Datasets | WinCLIP | AnoVL | AnomalyCLIP | AdaCLIP | Crane | Ours |
|---|---|---|---|---|---|---|---|
| Image-level (AP) | MVTec | 96.5 | 95.1 | 96.2 | 95.7 | 97.6 | 97.8 |
| VisA | 81.1 | 80.2 | 85.4 | 79.0 | 86.7 | 87.6 | |
| MPDD | 69.9 | 83.6 | 82.0 | 80.2 | 84.1 | 86.4 | |
| BTAD | 70.9 | 72.8 | 87.3 | 93.8 | 97.0 | 98.0 | |
| KSDD | 77.4 | 90.8 | 80.0 | 89.6 | 94.5 | 94.6 | |
| DAGM | 79.5 | 76.1 | 92.3 | 88.5 | 96.1 | 96.9 | |
| DTD | 92.6 | 93.3 | 97.0 | 97.3 | 98.2 | 98.9 | |
| Average | 81.1 | 84.6 | 88.6 | 89.2 | 93.5 | 94.3 |
| Metric | Datasets | WinCLIP | AnoVL | AnomalyCLIP | AdaCLIP | Crane | Ours |
|---|---|---|---|---|---|---|---|
| Image-level (F1-max) | MVTec | 92.9 | 93.2 | 92.7 | 90.6 | 93.6 | 94.2 |
| VisA | 80.7 | 79.7 | 80.4 | 83.1 | 81.2 | 82.0 | |
| MPDD | 77.5 | 88.3 | 80.4 | 82.5 | 83.0 | 83.7 | |
| BTAD | 67.6 | 73.0 | 83.8 | 88.2 | 93.7 | 95.6 | |
| KSDD | 79.0 | 88.0 | 82.7 | 90.7 | 89.7 | 89.9 | |
| DAGM | 87.6 | 74.7 | 90.1 | 97.5 | 94.7 | 95.2 | |
| DTD | 94.1 | 97.3 | 93.6 | 94.7 | 94.6 | 95.9 | |
| Average | 82.8 | 84.9 | 87.2 | 89.6 | 90.1 | 90.9 |
| Metric | Datasets | WinCLIP | AnoVL | AnomalyCLIP | AdaCLIP | Crane | Ours |
|---|---|---|---|---|---|---|---|
| Pixel-level (AUROC) | MVTec | 85.1 | 90.6 | 91.1 | 88.7 | 91.2 | 92.5 |
| VisA | 79.6 | 85.2 | 95.5 | 95.5 | 95.3 | 96.0 | |
| MPDD | 76.4 | 62.3 | 96.5 | 96.1 | 97.6 | 97.8 | |
| BTAD | 72.7 | 75.2 | 94.2 | 92.1 | 96.7 | 97.2 | |
| KSDD | 68.8 | 97.1 | 90.6 | 97.7 | 99.2 | 99.0 | |
| DAGM | 87.6 | 79.7 | 95.6 | 91.5 | 96.2 | 97.0 | |
| DTD | 83.9 | 97.7 | 97.9 | 97.9 | 98.8 | 98.9 | |
| Average | 79.2 | 84.0 | 94.5 | 94.2 | 96.4 | 96.9 |
| Metric | Datasets | WinCLIP | AnoVL | AnomalyCLIP | AdaCLIP | Crane | Ours |
|---|---|---|---|---|---|---|---|
| Pixel-level (AUPRO) | MVTec | 64.6 | 77.8 | 81.4 | 37.8 | 88.1 | 87.0 |
| VisA | 56.8 | 60.5 | 87.0 | 72.9 | 90.6 | 90.6 | |
| MPDD | 48.9 | 38.3 | 88.7 | 62.8 | 93.2 | 93.2 | |
| BTAD | 27.3 | 40.9 | 74.8 | 20.3 | 86.8 | 88.9 | |
| KSDD | 24.2 | 82.6 | 67.8 | 33.8 | 97.4 | 98.5 | |
| DAGM | 65.7 | 56.0 | 91.0 | 50.6 | 93.8 | 93.8 | |
| DTD | 57.8 | 90.5 | 92. 3 | 72.9 | 96.0 | 95.1 | |
| Average | 49.3 | 63.8 | 83.3 | 50.1 | 92.3 | 92.4 |
| Metric | Datasets | WinCLIP | AnoVL | AnomalyCLIP | AdaCLIP | Crane | Ours |
|---|---|---|---|---|---|---|---|
| Pixel-level (F1-max) | MVTec | 31.6 | 36.5 | 39.1 | 43.4 | 43.8 | 46.3 |
| VisA | 14.8 | 14.6 | 28.3 | 37.7 | 30.2 | 32.6 | |
| MPDD | 15.4 | 15.6 | 34.2 | 34.9 | 42.0 | 44.2 | |
| BTAD | 18.5 | 23.4 | 49.7 | 51.7 | 61.1 | 60.4 | |
| KSDD | 21.3 | 23.1 | 51.3 | 54.5 | 62.4 | 64.4 | |
| DAGM | 13.9 | 12.8 | 58.9 | 57.5 | 66.8 | 68.4 | |
| DTD | 16.1 | 6.8 | 62.2 | 71.6 | 71.8 | 73.9 | |
| Average | 18.8 | 24.7 | 46.2 | 50.2 | 54.0 | 55.7 |
| Model Configuration | MVTec | BTAD | Average |
|---|---|---|---|
| A. Baseline | (91.7, 96.4) | (95.8, 95.3) | (93.8, 95.9) |
| B. hierarchical fusion | (93.7, 97.4) | (95.8, 96.6) | (94.8, 97.0) |
| C. B + DMA | (94.4, 97.7) | (96.1, 97.3) | (95.3, 97.5) |
| D. C + SAP | (94.7, 97.8) | (96.4, 98.0) | (95.6, 97.9) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jiang, J.; He, Z.; Wan, A.; AL-Bukhaiti, K.; Wang, K.; Zhu, P.; Cheng, X. Zero-Shot Industrial Anomaly Detection via CLIP-DINOv2 Multimodal Fusion and Stabilized Attention Pooling. Electronics 2025, 14, 4785. https://doi.org/10.3390/electronics14244785
Jiang J, He Z, Wan A, AL-Bukhaiti K, Wang K, Zhu P, Cheng X. Zero-Shot Industrial Anomaly Detection via CLIP-DINOv2 Multimodal Fusion and Stabilized Attention Pooling. Electronics. 2025; 14(24):4785. https://doi.org/10.3390/electronics14244785
Chicago/Turabian StyleJiang, Junjie, Zongxiang He, Anping Wan, Khalil AL-Bukhaiti, Kaiyang Wang, Peiyi Zhu, and Xiaomin Cheng. 2025. "Zero-Shot Industrial Anomaly Detection via CLIP-DINOv2 Multimodal Fusion and Stabilized Attention Pooling" Electronics 14, no. 24: 4785. https://doi.org/10.3390/electronics14244785
APA StyleJiang, J., He, Z., Wan, A., AL-Bukhaiti, K., Wang, K., Zhu, P., & Cheng, X. (2025). Zero-Shot Industrial Anomaly Detection via CLIP-DINOv2 Multimodal Fusion and Stabilized Attention Pooling. Electronics, 14(24), 4785. https://doi.org/10.3390/electronics14244785

