Our method consists of two components: one is patch-feature stream, which stores patch features in the patch-feature memory bank, and the other is segmentation-map stream, which stores segmentation maps in a segmentation-map memory bank. As illustrated in
Figure 2, during the training phase, the normal images are used to generate patch features and segmentation maps, which are then stored in their respective memory banks. Specifically, (1) the patch-feature memory bank stores patch-level features that capture fine-grained details of the image, and (2) the segmentation-map memory bank assists in determining the validity of various component combinations to identify logical anomalies. During the test phase, the differences between the patch feature and segmentation maps of the test image and those of the training images are computed separately. Finally, the distances are calculated to derive the anomaly score. The pseudo-code of the training and test procedure is shown in Algorithm 1.
3.2. Segmentation-Map Stream
To effectively detect logical anomalies in images, it is critical to explicitly model the intrinsic constraints among components, such as spatial layouts and topological dependencies. This is achieved by first segmenting the image into semantically meaningful component maps and then systematically analyzing pairwise relationships between components through statistical metrics. By quantifying deviations from normal constraints, the framework can pinpoint violations of logical consistency, such as misassembled parts or missing components, thereby enabling precise logical anomaly detection.
Feature Distillation Guidance. Our objective is to construct a unified framework for anomaly detection. The architecture of ResNet [
30], comprising stacked convolutional layers, is more suited to extract fine-grained features. However, it is less effective at generating segmentation maps directly, as evidenced by our ablation study on different teacher models. Our work is inspired by the knowledge distillation approach [
38], where a teacher model can effectively guide the student model to learn the features enriched with global context—a prerequisite for constructing discriminative segmentation maps to detect logical anomalies.
Specifically, we employ a self-supervised Vision Transformer DINO [
39] as the teacher model, which excels in capturing holistic semantic relationships through its attention mechanism. Since the outputs of thes two models differ, the student model needs to align with the output of the teacher model. Inspired by RD [
13], we design a multi-scale feature fusion module to align their feature representations. The specific steps are as follows: for an image
from the training set
, the multi-layer feature outputs
(
l = 1, 2,
… ,
m) of the student model is extracted by feature extractor. We employ a transformation function to synchronize these various features with the corresponding features of the teacher:
where
and
denote the feature dimensions of layer
l in the student model and the teacher model, respectively.
Finally, we obtain
m dimensionally consistent features, which are concatenated to form a new feature
as follows:
After aligning the feature
with the feature
extracted by the teacher network through two-dimensional convolution, we define the distillation loss
between them. This alignment is achieved using a mean squared error (MSE) loss during training, and it is expressed as:
where
N denotes the total of training samples, while
and
represent the features
and
of the
ith sample, respectively.
Segmentation Module. After the feature distillation process, the features extracted by ResNet can now be utilized to construct segmentation maps. As illustrated in the segmentation module in
Figure 3, K-means clustering is applied to
to obtain
k clusters. The cosine similarity between each cluster and the original feature
is then calculated to generate an initial segmentation map. Subsequently, the segmentation map is resized to match the original image dimensions through interpolation. Finally, a fully connected Gaussian Conditional Random Field (CRF) [
40] is applied as a post-processing step to refine the segmentation results. To identify noise components, we further filter each refined map using an 11 × 11 mean filter, discarding regions with maximum values below 0.5 (normalized to [0, 1]) [
14], thereby enhancing foreground–background separation and yielding the final segmentation map.
3.3. Anomaly Score Computation
The proposed method stores the patch features of normal images and their corresponding segmentation maps in separate memory banks. Patch feature-based methods have been shown to effectively detect structural anomalies by focusing on local features. In contrast, logical anomalies require modeling and analyzing the relationships between components in the segmentation maps. The core of our anomaly scoring mechanism relies on comparing the test sample against the representations of normal samples stored in the two memory banks to quantify deviations from normality. During testing, for each stream, we compute the discrepancy between the test sample and its nearest neighbors within the corresponding memory bank. This discrepancy serves as the foundational anomaly score for that stream. For outliers, we use KNN-based anomaly scores, which naturally smooth out minor outliers. The following describes how to calculate anomaly scores in two memory banks.
Patch-Feature Memory Bank. This memory bank
is constructed by storing patch features to detect structural defects following established approaches [
23,
29]: For each training sample
, their patch features
are extracted and used to build a memory bank
:
where
represents the patch features of the
ith training sample. The anomaly score
for the test sample
is predicted as:
where
denotes the patch feature of the test sample.
Segmentation-Map Memory Bank. This memory bank
is built by storing segmentation maps containing four components to detect logical errors, following the calculation method used during the test phase. Specifically, the method includes the following steps. (1) Area Feature: The area feature is computed by summing the total number of pixels within the segmented regions. (2) Color Feature: The image is converted from RGB space to CIELAB space, which consists of three components:
L (luminance),
a (green–red), and
b (blue–yellow). For each pixel, luminance is ignored, and the ratio
is calculated. The average value over the entire region is then taken as the color feature. (3) Quantity Feature: It is derived by grouping regions using DBSCAN [
41] and calculating their density. By combining these three features, the anomaly score
is obtained by calculating the average
distance between the test image
and its five nearest neighbors
and is defined as:
However, we find that the computation of distances between normal and test images based on the aforementioned features is relatively independent, leading to insufficient anomaly scoring and limited generalization across complex scenarios. To address this problem, based on anomalies that often exhibit specific spatial–morphological patterns, we additionally design a new distance calculation method that incorporates spatial–morphological features–centroid (), area ratio (the ratio of the number of pixels in the segmented region to the total image), and variance v to model inter-component relationships from a shape-based perspective. The centroid captures positional consistency, quantifies size conformity relative to the entire image, and v measures shape regularity or distribution dispersion. The computing steps are as follows:
Step 1: Relationship Calculation: For each segmented component
c in image
in
, we extract three key spatial–morphological features: centroid
, area ratio
, and variance
. For every unique pair of components
where
, we compute the absolute difference for each feature type between the two components:
Each pair
thus contributes a 4-dimensional difference vector:
The complete inter-component relationship representation for each image is the set of all pairwise difference vectors:
where
K denotes the number of clusters.
Step 2: Threshold Determination: Using the training set of normal images
, we learn the acceptable range for each type of pairwise difference feature. For each difference feature type
and for each possible component pair
, we collect all values
across all training images
, and the maximum and minimum values of
are obtained. The deviation threshold
T is defined as the width of the normal range to define the normal range of deviations:
Step 3: Distance Calculation: As shown in
Figure 4, for a test image
, its inter-component relationship representation
is extracted. The nearest neighbors
of the test image are retrieved from
, and the relationships
are identified as the reference relationships. For each feature type
k within each pairwise difference vector
, the deviation between reference relationships and the relationships of the test image is computed as:
If the difference
exceeds the threshold
, it is considered as an anomaly and the
-distance is calculated, i.e.:
is added as part of the anomaly score for the segmentation map. The process is defined as:
Aggregation of Anomaly Scores. The anomaly scores derived from the two memory banks originate from different feature spaces and calculation methods, leading to inherent differences in their numerical scales and distributions. Directly summing these uncalibrated scores would cause the score with the larger inherent scale to dominate the aggregated result, potentially masking the signal from the other anomaly type. This imbalance would compromise the accuracy of the final anomaly score. To ensure that both scores contribute proportionally to the final result based on their respective deviation magnitudes, we first normalize both scores, and the final aggregated anomaly score
is obtained after normalization, i.e.,:
where
denotes the normalizing operation. Specifically, it is derived from
, where
S represents the original anomaly score, while
and
denote the mean and standard deviation, respectively.