2.1. HybridCNN-ViT Architecture
HybridCNN-ViT combines a CNN backbone for hierarchical local feature extraction with a lightweight Vision Transformer (ViT) encoder for long-range dependency modeling, aiming to improve discrimination for spatially distributed and visually similar defect patterns. Unlike heavier hybrid architectures that mainly rely on scaling Transformer capacity, the proposed HybridCNN-ViT adopts a lightweight CNN–Transformer design for wafer defect classification. In our framework, this compact hybrid encoder is further combined with a systematically calibrated semi-supervised pseudo-label selection strategy to improve pseudo-label reliability under limited labeled data.
Given an input image
, the CNN module extracts and progressively downsamples feature maps. The first stage applies a convolutional block:
Here, the convolution uses a stride of 1 and a padding of 1, so the spatial resolution is preserved before pooling. The subsequent max-pooling layer uses kernel size 2 and stride 2, yielding .
A residual block is then used:
where
denotes the shortcut branch. When the input and output dimensions are unchanged,
is an identity mapping; otherwise, it is implemented as a projection shortcut using a
convolution followed by batch normalization to match the channel number and spatial resolution. In our setting, the residual block performs a
channel expansion with a stride of 2:
Thus, . Under the classification input size of used in this work, the resulting feature map is .
Adaptive average pooling is then applied to maintain a fixed spatial size, yielding under our experimental setting. The pooled feature map is flattened into tokens and linearly projected into a D-dimensional embedding space with . Learnable positional embeddings are added, and a learnable class token is prepended to the token sequence. A dropout layer with a rate of is applied before the Transformer encoder.
The Transformer encoder consists of
layers of multi-head self-attention and feed-forward blocks with residual connections, following standard ViT practice [
14]. In the implementation used in this work, each encoder layer uses 8 attention heads, an embedding dimension of 128, a feed-forward dimension of 256, GELU activation, and dropout with a rate of
. We set
based on validation results to balance accuracy and computation; increasing to
yields a small accuracy gain (0.37%) but increases computation by 48% under our setting. The final class token is fed to a linear head:
2.2. Semi-Supervised Training Strategy
To leverage unlabeled data while reducing confirmation bias, we adopt a three-stage progressive pseudo-labeling strategy that combines class-adaptive confidence thresholding with uncertainty-aware filtering. The overall design aims to improve pseudo-label reliability in a reproducible manner rather than relying on a fixed global threshold.
We denote the training stage by and the number of Monte Carlo stochastic forward passes by . Stage 1 trains the classifier using only the labeled set , producing an initial teacher model. Stages 2 and 3 progressively expand the training data with pseudo-labeled samples selected from the unlabeled set . Specifically, Stage 2 uses the best teacher from Stage 1 to generate the initial pseudo-labeled subset, while Stage 3 uses the updated teacher from Stage 2 to refresh pseudo-labels and further refine the accepted sample set.
For each unlabeled sample
, we perform
stochastic forward passes with dropout enabled and obtain predictive probabilities
for class
c. The mean predictive probability is
and the candidate pseudo-label together with its confidence score is defined as
To characterize stage-wise class reliability, we define the candidate set for class
c at stage
s as
On this candidate set, the mean and standard deviation of the confidence scores are computed as
Here, reflects the average confidence of class-c candidates at stage s, whereas measures the dispersion of their confidence scores. The ratio is used as a stage-wise indicator of class reliability and is recomputed whenever pseudo-labels are refreshed.
Based on these class-wise statistics, we define a class-adaptive confidence threshold:
where
,
, and
control the baseline confidence level, the class-reliability adjustment, and the entropy-based sample-level adjustment, respectively. The threshold is clipped to the interval
after computation. In this formulation, the term
makes the threshold responsive to stage-wise class reliability, while the entropy-dependent term provides a soft correction according to the concentration of the current predictive distribution.
We estimate predictive uncertainty using Monte Carlo Dropout [
27]. Dropout is activated in the ViT branch during pseudo-label generation, including the token sequence before the Transformer encoder (rate
) and the Transformer encoder layers (rate
), so that stochastic forward sampling can be performed without changing the deterministic training objective. The predictive entropy and mutual information are computed as
Predictive entropy measures the overall uncertainty of the averaged predictive distribution, whereas mutual information captures the inconsistency of predictions under stochastic model perturbations. Using both quantities allows us to reject samples with either diffuse predictions or unstable stochastic behavior.
A pseudo-label is accepted only if it satisfies both the adaptive confidence condition and the uncertainty constraints:
Here, the confidence threshold provides a soft selection mechanism, while the entropy and mutual information constraints serve as hard rejection gates for uncertain predictions.
The five hyperparameters involved in pseudo-label selection, namely , , , , and , are systematically calibrated on a held-out pseudo-evaluation subset that is not used to fit the teacher or student models. The final configuration is selected by jointly considering pseudo-label macro-F1, sample acceptance rate, and uncertainty filtering effectiveness. The final values used in all experiments are , , , , and .
The three training stages are defined as follows. Stage 1 performs supervised warm-up using only . Stage 2 uses the best teacher from Stage 1 to infer pseudo-labels on , computes the corresponding class-wise statistics and , applies adaptive thresholding and uncertainty-aware filtering, and forms the accepted pseudo-labeled subset . The classifier is then trained on . Stage 3 repeats this procedure using the updated teacher from Stage 2. Pseudo-labels are regenerated on , the class-wise statistics and adaptive thresholds are recomputed, the accepted pseudo-labeled subset is refreshed to obtain , and the final classifier is trained on . Therefore, Stage 3 is not merely an additional round of training; the teacher model, class-wise statistics, adaptive thresholds, and accepted pseudo-label set are all updated.
We found three stages to be a practical trade-off, as adding more refresh stages yielded only marginal improvement while substantially increasing training cost.