1. Introduction
Semiconductor components require high manufacturing quality with minimal sur-face defects [
1,
2]. To overcome the brittleness and hardness of semiconductor materials, complex manufacturing processes have been proposed [
3,
4,
5], and the intricacy of manufacturing defects in wafers produced during these processes has increased. A wafer bin map (WBM) [
6] visually represents the test outcomes for each chip on a wafer, based on the chip’s probe test failure mode and its position (die). The chip probe test, an essential final assessment after the entire manufacturing sequence, evaluates the performance and functionality of each chip. During this test, each die is subjected to multiple probe test modes, with the first failure mode being recorded as the bin result. Throughout the wafer fabrication process, various manufacturing issues can lead to multiple defective dies on a wafer. These defects often cluster in specific areas on the wafer, forming spatial patterns known as gross failure areas (GFAs). These patterns, which include common types such as ring, scratch, loc, and center, are indicative of process-related issues and provide valuable data for improving yield and quality [
7]. Classifying GFAs is crucial for engineers to identify and address problems in the production process, thereby reducing costs and enhancing yield. As production environments become more intricate, the need for automated WBM GFA classification has become increasingly important.
Deep learning-based object detection outperforms traditional algorithms and classification networks in terms of generalization ability and localization precision, leading to superior overall performance. Object detection algorithms can be categorized into two types: single-stage networks, exemplified by YOLO [
8], SSD [
9], and RetinaNet [
10], and two-stage networks, represented by Faster R-CNN [
11]. While two-stage networks prioritize accuracy over speed by separating localization and recognition, single-stage networks simultaneously perform both tasks, resulting in faster detection speeds. Although these vision-based models achieve competitive performance, they only rely on pixel-level information and ignore rich semantic knowledge inherent in defect pattern descriptions.
Meanwhile, recent progress in vision–language models (VLMs), such as CLIP-style architectures [
12], demonstrates that aligning visual representations with natural-language semantics significantly improves generalization and interpretability. Much VLM research has been successfully applied in industrial defect detection. AnomalyGPT was proposed by Gu et al. [
13] to address industrial anomaly detection by integrating fine-grained visual decoding and multi-turn dialog capability. It achieved state-of-the-art image- and pixel-level performance on MVTec-AD with strong few-shot generalization from only one normal example. Qian et al. [
14] proposed a contrastive cross-modal training framework named CLAD that adopted large vision–language models to align visual and textual representations for industrial anomaly detection and localization. By jointly improving image-level detection and pixel-level localization on benchmarks such as MVTec-AD and VisA, it demonstrates that cross-modal contrastive learning can enhance both performance and interpretability in industrial inspection tasks. Cao et al. [
15] developed a model named AnomalyVLM to address zero-shot industrial anomaly detection In AnomalyVLM, product standards were regarded as a substitute for reference images, enabling vision–language models to reason about normality and abnormality without category-specific training data.
However, integrating VLM concepts into industrial wafer map analysis remains largely unexplored. In particular, wafer defect classes naturally correspond to concise textual descriptions, suggesting an opportunity to leverage language as a source of domain knowledge. Therefore, in this paper, we introduce YOLO-LA, a prototype-based vision–language alignment framework, in which a trained YOLO backbone serves as a visual encoder, while a frozen text encoder generates semantic prototypes from defect descriptions. A lightweight learnable projection head aligns the YOLO visual embedding with the text embedding space, enabling wafer classification via cosine similarity to the textual prototypes.
2. YOLO Architectures and Baseline Model
YOLO, which stands for “You Only Look Once,” is a state-of-the-art, real-time object detection system that has revolutionized the field of computer vision. Developed by Joseph Redmon and his collaborators [
8], YOLO is designed to detect and classify objects within images or video streams with remarkable speed and accuracy. Unlike traditional object detection methods that apply a classifier to various parts of an image, YOLO reframes object detection as a single regression problem, directly predicting bounding boxes and class probabilities from full images in one evaluation. This approach allows YOLO to process images in real time, making it highly efficient for applications that require fast and accurate object detection, such as autonomous driving, surveillance, and augmented reality. YOLO’s architecture is based on a convolutional neural network (CNN) that divides the input image into a grid and predicts bounding boxes and probabilities for each grid cell, enabling it to detect multiple objects of different classes simultaneously. Over the years, YOLO has undergone several iterations, each improving upon the previous in terms of speed, accuracy, and robustness, solidifying its position as a leading framework in the object detection domain.
2.1. YOLOv8
YOLOv8 was proposed by Jocher et al. [
16]. In YOLOv8, the backbone is responsible for extracting hierarchical features from the input image. It consists of several sequential layers, including CBS, C2f, and SPPE. CBS typically stands for Convolution + BatchNorm + SiLU activation. It is used to process feature maps efficiently. C2f is a modified bottleneck structure, designed for feature extraction with fewer parameters. SPPF (Spatial Pyramid Pooling—Fast) enhances the receptive field and captures multi-scale features at the end of the backbone. The neck is designed to aggregate and fuse features from different stages of the backbone, helping detect objects at various scales. It includes C2f blocks for feature processing, Upsample operations to increase spatial resolution, and Concat operations to fuse feature maps from different layers. This part resembles a Feature Pyramid Network (FPN) combined with a Path Aggregation Network (PAN), enabling both top–down and bottom–up feature fusion. The head is responsible for final predictions.
2.2. YOLOv10
YOLOv10 [
17] is the most representative model in YOLO series models. YOLOv10′s architecture leverages the strengths of its predecessors while incorporating several groundbreaking innovations. In contrast, in YOLOv10, the backbone utilizes an advanced version of C2F (cross-stage partial with two-fusion layers), called C2fCIB, which optimizes gradient flow and minimizes computational redundancy [
18]. The neck module is designed to aggregate features from various scales, which employs PANet (Path Aggregation Network) layers to facilitate effective multi-scale feature fusion and pass them to the head.
2.3. YOLOv11
YOLOv11 [
19] builds upon the foundation of YOLOv10, retaining its core design principles such as the modular backbone–neck–head architecture and the use of efficient components like CBS and SPPF blocks. Both versions adopt a one-to-many prediction strategy in the head, enabling each spatial location to generate multiple object detections. However, YOLOv11 introduces several key improvements over YOLOv10. YOLOv11 introduces a novel C3K2 (cross-stage partial with kernel size 2) module, which is an improved version of C2f, which combines the CSP concept with a smaller convolution kernel. By splitting the feature map into two parts, one is passed directly, while the other undergoes multiple bottleneck convolutions before being fused with the first part. This preserves rich features while reducing computation and parameter count, so as to improve computational efficiency and reduce inference latency. In addition, they also modified the PSA to C2PSA. In the C2PSA module, a convolutional layer with a kernel size of 1 is performed to split the feature map into two parts. One part enters the PSA module to highlight the response of key spatial regions, while the other part retains the original features. In the end, the two parts are concatenated and fused through another convolutional layer with a kernel size of 1.
2.4. YOLOv12
The YOLOv12 [
20] architecture is composed of four key components, including a backbone module, neck module, and head module, where two techniques are applied to reduce computational cost. First, a residual efficient layer aggregation network (R-ELAN) in the backbone module was proposed to integrate features from different scales and reduce the computation cost and memory usage simultaneously. In addition, area attention was used in the neck module to reduce computation by narrowing the attention from global to local. The head part is the simple classification network, which uses a convolutional layer followed by an adaptive average pool layer. The last linear layer represents the output number of defect types.
5. Conclusions
In this paper, we propose YOLO-LA, a lightweight prototype-based vision–language alignment framework that integrates a frozen YOLO classifier backbone with a frozen text encoder for wafer pattern detection. Semantic text prototypes are generated from natural-language defect descriptions using a text encoder, enabling interpretable classification via cosine similarity in a shared embedding space. In addition, an efficient learnable projection head is designed to achieve minimal extra computation. Extensive experiments on wafer map defect classification demonstrate that semantic alignment consistently improves classification robustness, particularly under low-resolution conditions where purely visual cues are insufficient to distinguish structurally similar defect patterns. The results further show that the proposed framework is robust to prompt formulation, benefits from a frozen visual backbone, and achieves an effective performance–efficiency trade-off with a lightweight projection head. In future work, we will explore few-shot and zero-shot defect recognition through dynamic prototype updating, as well as validation on real-world production data from semiconductor fabrication lines.