DCS: A Zero-Shot Anomaly Detection Framework with DINO-CLIP-SAM Integration
Abstract
1. Introduction
- To enhance the anomaly semantic modeling capability of CLIP, we propose adaptive learned fine-grained text prompts (FinePrompt), which replace generic “abnormal” descriptions by constructing a library of fine-grained anomaly descriptions. In addition, we have introduced learnable text embeddings and adaptive prompt weighting, which effectively alleviate semantic ambiguity and improve accuracy.
- In order to enhance the cross-modal interaction in CLIP and improve the segmentation quality, we have proposed an Adaptive Dual-path Cross-modal Interaction (ADCI) module. By integrating dual paths at the patch level and multi-scale level, ADCI promotes the effective interaction between semantic features and visual features, so that it can learn the local and global semantics of anomaly regions. Therefore, the accuracy and stability of coarse segmentation based on CLIP have been significantly improved.
- In order to make the SAM segmentation more fine-grained and accurate, we introduced Box-Point Prompt Combiner (BPPC), which uses Grounding DINO’s box prior and CLIP-guided positive/negative point prompts to provide SAM with a more reliable prompt combination.
- Experiments conducted on multiple datasets show that our method has achieved the state-of-the-art performance in zero-shot anomaly detection. In particular, on the MVTec-AD dataset [12], our methods are 3.5%, 9.3% and 13.7% higher than the state-of-the-art CLIP-based methods in AUROC, F1-max and AP respectively, and surpass prior state-of-the-art SAM-based methods by 21.4%, 10.6% and 19.4% on the same metrics.
2. Related Works
2.1. Foundation Models
2.2. Zero-Shot Anomaly Detection
2.3. Cross-Modal Interaction
3. Methods
3.1. Overall Architecture
3.2. Fine-Grained Text Templates for Adaptive Learning
3.2.1. Fine-Grained Anomaly Description Library
3.2.2. Learning Text Embeddings
3.2.3. Adaptive Prompt Weighting
3.3. Adaptive Dual-Path Cross-Modal Interaction
3.3.1. Strip Path Based on Attention-Weighted Pooling
3.3.2. Scale Path
3.3.3. Gated Dual Path Fusion
3.3.4. Multi-Stage Weighted Integration of Quality Perception
3.4. Box-Point Prompt Combiner
3.4.1. Box Prompt Positioning
3.4.2. Positive-Point Positioning
3.4.3. Negative Point Positioning
3.4.4. Box-Point Prompt Combination
4. Experiments
4.1. Datasets
4.2. Evaluation Metrics
4.3. Implementation Details
4.4. Main Results
4.5. Ablation Study
4.5.1. Main Ablation Study
4.5.2. Ablation Study of the FinePrompt Module
4.5.3. Ablation Study of ADCI Modules
4.5.4. An Ablation Study of the BPPC Module
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Defard, T.; Setkov, A.; Loesch, A.; Audigier, R. Padim: A patch distribution modeling framework for anomaly detection and localization. In Proceedings of the International Conference on Pattern Recognition, Virtual Event, 10–15 January 2021; Springer International Publishing: Cham, Switzerland, 2021; pp. 475–489. [Google Scholar]
- Deng, H.; Li, X. Anomaly detection via reverse distillation from one-class embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9737–9746. [Google Scholar]
- Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; Gehler, P. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14318–14328. [Google Scholar]
- You, Z.; Cui, L.; Shen, Y.; Yang, K.; Lu, X.; Zheng, Y.; Le, X. A unified model for multi-class anomaly detection. Adv. Neural Inf. Process. Syst. 2022, 35, 4571–4584. [Google Scholar]
- Jeong, J.; Zou, Y.; Kim, T.; Zhang, D.; Ravichandran, A.; Dabeer, O. Winclip: Zero-/few-shot anomaly classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19606–19616. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, PmLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Chen, X.; Han, Y.; Zhang, J. A zero-/few-shot anomaly classification and segmentation method for CVPR 2023 (VAND) workshop challenge tracks 1 & 2. 1st Place on Zero-shot AD and 4th Place on Few-shot AD. arXiv 2023, arXiv:2305.17382. [Google Scholar]
- Gu, Z.; Zhu, B.; Zhu, G.; Chen, Y.; Tang, M.; Wang, J. Anomalygpt: Detecting industrial anomalies using large vision-language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; AAAI Press: Menlo Park, CA, USA, 2024; Volume 38, pp. 1932–1940. [Google Scholar]
- Li, S.; Cao, J.; Ye, P.; Ding, Y.; Tu, C.; Chen, T. ClipSAM: CLIP and SAM collaboration for zero-shot anomaly segmentation. Neurocomputing 2025, 618, 129122. [Google Scholar] [CrossRef]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
- Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 38–55. [Google Scholar]
- Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD–A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9592–9600. [Google Scholar]
- Zhou, Q.; Pang, G.; Tian, Y.; He, S.; Chen, J. Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection. arXiv 2023, arXiv:2310.18961. [Google Scholar]
- Cao, Y.; Xu, X.; Sun, C.; Cheng, Y.; Du, Z.; Gao, L.; Shen, W. Segment any anomaly without training via hybrid prompt regularization. arXiv 2023, arXiv:2305.10724. [Google Scholar] [CrossRef]
- Hu, R.; Rohrbach, M.; Darrell, T. Segmentation from natural language expressions. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer International Publishing: Cham, Switzerland, 2016; pp. 108–124. [Google Scholar]
- Chen, D.J.; Jia, S.; Lo, Y.C.; Chen, H.T.; Liu, T.L. See-through-text grouping for referring image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7454–7463. [Google Scholar]
- Feng, G.; Hu, Z.; Zhang, L.; Sun, J.; Lu, H. Bidirectional relationship inferring network for referring image localization and segmentation. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 2246–2258. [Google Scholar] [CrossRef]
- Kumar, D.; Pawar, P.P.; Addula, S.R.; Meesala, M.K.; Oni, O.; Cheema, Q.N.; Haq, A.U.; Sajja, G.S. AI-Powered security for IoT ecosystems: A hybrid deep learning approach to anomaly detection. J. Cybersecur. Priv. 2025, 5, 90. [Google Scholar] [CrossRef]
- Deng, H.; Zhang, Z.; Bao, J.; Li, X.A. Adapting Vision-Language Models for Unified Zero-Shot Anomaly Localization. arXiv 2023, arXiv:2308.15939. [Google Scholar]
- Zhu, J.; Cai, S.; Deng, F.; Ooi, B.C.; Wu, J. Do LLMs Understand Visual Anomalies? Uncovering LLM’s Capabilities in Zero-shot Anomaly Detection. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 48–57. [Google Scholar]
- Zou, Y.; Jeong, J.; Pemula, L.; Zhang, D.; Dabeer, O. Spot-the-difference self-supervised pre-training for anomaly detection and segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 392–408. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2024, arXiv:1711.05101. [Google Scholar]
- Chen, X.; Zhang, J.; Tian, G.; He, H.; Zhang, W.; Wang, Y.; Wang, C.; Liu, Y. Clip-ad: A language-guided staged dual-path model for zero-shot anomaly detection. In Proceedings of the International Joint Conference on Artificial Intelligence, Jeju, Republic of Korea, 2–8 August 2024; Springer Nature: Singapore, 2024; pp. 17–33. [Google Scholar]
- Cao, Y.; Zhang, J.; Frittoli, L.; Cheng, Y.; Shen, W.; Boracchi, G. Adaclip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 55–72. [Google Scholar]


| Base Model | Method | Anomaly Description | VisA | MVTec-AD | ||||
|---|---|---|---|---|---|---|---|---|
| AUROC | F1-Max | AP | AUROC | F1-Max | AP | |||
| CLIP-based Approaches | WinCLIP | state ensemble | 79.6 | 14.8 | – | 85.1 | 31.7 | – |
| APRIL-GAN | state ensemble | 94.2 | 32.3 | 25.7 | 87.6 | 43.3 | 40.8 | |
| SDP | normal/anomalous | 84.1 | 16.0 | 9.6 | 88.7 | 35.3 | 28.5 | |
| SDP+ | normal/anomalous | 94.8 | 26.5 | 20.3 | 91.2 | 41.9 | 39.4 | |
| AnomalyCLIP | normal/damaged | 95.5 | 28.3 | 21.3 | 91.1 | 39.1 | 34.5 | |
| AdaCLIP | normal/damaged | 95.5 | 32.9 | 25.8 | 88.7 | 45.2 | 42.7 | |
| SAM-based Approaches | SAA | normal/defective | 83.7 | 12.8 | 5.5 | 67.7 | 23.8 | 15.2 |
| SAA+ | normal/defective | 74.0 | 27.1 | 22.4 | 73.2 | 37.8 | 28.8 | |
| CLIP&SAM | ClipSAM | normal/anomalous | 95.6 | 33.1 | 26.0 | 92.3 | 47.8 | 45.9 |
| CLIP&DINO& SAM | Ours | fine-grained description | 97.2 | 35.6 | 29.1 | 94.6 | 48.4 | 48.2 |
| (A) Method-level reasoning efficiency comparison | ||
| Method | FPS ↑ | Latency (ms/img) ↓ |
| WinCLIP | 40.0 | 25.0 |
| SAA | 4.0 | 250.0 |
| DCS (Ours) | 2.2 | 455.0 |
| (B) DCS reasoning phase overhead decomposition (Latency) | ||
| Component (Inference) | Latency (ms/img) | Share (%) |
| Grounding DINO (box + pos) | 140.0 | 30.8 |
| FinePrompt (pos-aware prompt assembly) | 3.0 | 0.7 |
| CLIP image encoding (ViT-L/14-336) | 25.0 | 5.5 |
| CLIP text (cached lookup + similarity) | 10.0 | 2.2 |
| ADCI (cross-modal interaction head) | 40.0 | 8.8 |
| BPPC (point sampling + prompt compose) | 12.0 | 2.6 |
| SAM (ViT-H prompt encoder + mask decoder) | 225.0 | 49.5 |
| Total (DCS) | 455.0 | 100 |
| FinePrompt | ADCI | BPPC | AUROC | F1-Max | AP |
|---|---|---|---|---|---|
| ✓ | 93.6 | 44.5 | 44.2 | ||
| ✓ | 92.8 | 45.9 | 44.6 | ||
| ✓ | 92.7 | 46.4 | 45.4 | ||
| ✓ | ✓ | 94.1 | 46.9 | 46.0 | |
| ✓ | ✓ | 94.2 | 47.6 | 47.0 | |
| ✓ | ✓ | 93.9 | 47.8 | 47.1 | |
| ✓ | ✓ | ✓ | 94.6 | 48.4 | 48.2 |
| FG-desc | Pos | LearnTok | ReWeight | AUROC | F1-Max | AP |
|---|---|---|---|---|---|---|
| ✓ | ✓ | ✓ | 93.4 | 46.3 | 46.0 | |
| ✓ | ✓ | ✓ | 93.9 | 47.1 | 46.8 | |
| ✓ | ✓ | ✓ | 94.1 | 47.4 | 47.1 | |
| ✓ | ✓ | ✓ | 94.3 | 47.8 | 47.6 | |
| ✓ | ✓ | ✓ | ✓ | 94.6 | 48.4 | 48.2 |
| SoftPool | DualPath | Gate | StageW | AUROC | F1-Max | AP |
|---|---|---|---|---|---|---|
| ✓ | ✓ | ✓ | 94.0 | 47.2 | 47.0 | |
| ✓ | ✓ | ✓ | 94.1 | 47.4 | 47.2 | |
| ✓ | ✓ | ✓ | 94.3 | 47.8 | 47.6 | |
| ✓ | ✓ | ✓ | 94.4 | 48.0 | 47.8 | |
| ✓ | ✓ | ✓ | ✓ | 94.6 | 48.4 | 48.2 |
| Box () | PosPts () | NegPts () | AUROC | F1-Max | AP |
|---|---|---|---|---|---|
| ✓ | 94.0 | 47.1 | 46.7 | ||
| ✓ | ✓ | 93.8 | 47.0 | 46.6 | |
| ✓ | ✓ | 94.3 | 47.8 | 47.6 | |
| ✓ | ✓ | ✓ | 94.6 | 48.4 | 48.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wan, Y.; Lang, Y.; Yao, L. DCS: A Zero-Shot Anomaly Detection Framework with DINO-CLIP-SAM Integration. Appl. Sci. 2026, 16, 1836. https://doi.org/10.3390/app16041836
Wan Y, Lang Y, Yao L. DCS: A Zero-Shot Anomaly Detection Framework with DINO-CLIP-SAM Integration. Applied Sciences. 2026; 16(4):1836. https://doi.org/10.3390/app16041836
Chicago/Turabian StyleWan, Yan, Yingqi Lang, and Li Yao. 2026. "DCS: A Zero-Shot Anomaly Detection Framework with DINO-CLIP-SAM Integration" Applied Sciences 16, no. 4: 1836. https://doi.org/10.3390/app16041836
APA StyleWan, Y., Lang, Y., & Yao, L. (2026). DCS: A Zero-Shot Anomaly Detection Framework with DINO-CLIP-SAM Integration. Applied Sciences, 16(4), 1836. https://doi.org/10.3390/app16041836
