Abstract
With the rapid development of semiconductor manufacturing technology, methods to effectively control the production process, reduce variation in the manufacturing process, and improve the yield rate represent important competitive factors for wafer factories. Wafer bin maps, a method for characterizing wafer defect patterns, provide valuable information for engineers to quickly identify potential root causes through accurate pattern recognition. Vision-based deep learning approaches rely on visual patterns to achieve robust performance. However, they rarely exploit the rich semantic information embedded in defect descriptions, limiting interpretability and generalization. To address this gap, we propose YOLO-LA, a lightweight prototype-based vision–language alignment framework that integrates a pretrained frozen YOLO backbone with a frozen text encoder to enhance wafer defect recognition. A learnable projection head is introduced to map visual features into a shared embedding space, enabling classification through cosine similarity Experimental results on the WM-811K dataset demonstrate that YOLO-LA consistently improves classification accuracy across different backbones while introducing minimal additional parameters. In particular, YOLOv12 achieves the fastest speed while maintaining competitive accuracy, whereas YOLOv10 benefits most from semantic prototype alignment. The proposed framework is lightweight and suitable for real-time industrial wafer inspection systems.