1. Introduction
In the contemporary globalized industrial production environment, as the scale of production continues to expand and technology becomes increasingly complex, quality control of industrial components has emerged as a crucial element in ensuring the stable operation of the entire production system. The quality of components directly affects the performance and reliability of products, which in turn influences the market competitiveness and economic benefits of enterprises. Therefore, efficient and accurate anomaly detection technologies are of paramount importance to modern industrial manufacturing [
1]. Traditional component inspection methods primarily rely on manual visual inspection and basic machine vision techniques. These methods, when confronted with the growing production demands and the increasingly complex and diverse defects of components, gradually reveal their limitations.
Manual visual inspection is the earliest and most widespread method for component inspection [
2,
3]. Operators visually examine components to identify obvious defects. However, this method is inefficient and is susceptible to the effects of operator fatigue, subjective judgment, and differences in experience, leading to unstable and inaccurate inspection results. Moreover, for minute or concealed defects, manual visual inspection often fails to detect them. The development of machine vision technology has brought about certain improvements in component inspection. By employing cameras and image processing algorithms, machine vision systems are capable of automatically identifying and classifying defects in components [
4,
5]. Nevertheless, traditional machine vision systems still face challenges when dealing with complex defect features [
6]. Additionally, machine vision systems typically require cumbersome parameter adjustments and algorithm optimizations for specific types of defects, lacking flexibility and generalization ability [
7].
In recent years, deep-learning technologies have achieved remarkable success in the fields of image recognition, speech recognition, and natural language processing, providing new ideas and methods for solving complex problems. Deep-learning models, particularly Convolutional Neural Networks (CNNs), have demonstrated powerful capabilities in image feature extraction and classification [
8]. Deep-learning models can automatically learn features from large amounts of image data without the need for manually designing complex feature extraction algorithms. This not only improves the efficiency of feature extraction but also enables the discovery of subtle features that are difficult to detect using traditional methods [
9]. Through extensive training data, deep-learning models can learn the feature variations of components under different lighting conditions, angles, and backgrounds, thereby enhancing the model’s generalization ability and enabling more accurate identification of various anomalies in actual production environments [
10]. Deep-learning models are capable of handling a variety of defect types, including minute cracks, surface scratches, and internal defects. Even for rare or complex defects, high detection accuracy can be achieved through appropriate data augmentation and model optimization [
11]. With the advancement in hardware technologies, the inference speed of deep-learning models has been continuously improving, enabling real-time or near-real-time component inspection to meet the efficiency requirements of modern industrial production [
12]. Beyond convolutional neural networks, recent studies have explored Transformer architectures—exemplified by Vision Transformer (ViT) [
13] and its hierarchical variant Swin Transformer [
14]—for anomaly detection in high-resolution industrial imagery. Their intrinsic global receptive fields facilitate the capture of long-range spatial dependencies, yet this advantage is often counterbalanced by a marked increase in computational demand. Concurrently, classical machine-learning paradigms, including support vector machines, random forests, and the isolation-forest algorithm [
15], continue to demonstrate viability under low-sample regimes, particularly when feature dimensionality is modest and computational resources are constrained. Nevertheless, reliance on handcrafted feature engineering inherently limits their adaptability to the nuanced variability of complex industrial defects.
While deep learning has brought significant advancement to anomaly detection, several domain-specific challenges remain. First, high-quality defect annotations are scarce, especially for rare or subtle anomalies, resulting in severe data imbalance [
16]. Second, the large intra-class variation in industrial textures and the small spatial extent of defects make it difficult for standard CNNs to capture discriminative features [
17]. Third, most State-of-the-Art methods rely on massive computational resources, which hinders real-time deployment on resource-constrained shop-floor devices [
18]. Finally, the black-box nature of deep networks complicates root-cause analysis and operator trust [
19]. Recent works attempt to alleviate these issues via few-shot learning [
20], knowledge distillation [
21], and attention-based architectures [
22], yet a comprehensive solution that simultaneously addresses data scarcity, computational efficiency, and interpretability in industrial anomaly detection is still lacking.
Deep-learning-based anomaly detection techniques for industrial components hold broad application prospects but also face numerous challenges. With the advancement of Industry 4.0, the development of intelligent manufacturing and the Industrial Internet of Things (IIoT) has imposed higher requirements on the quality control of components. Traditional inspection methods are no longer sufficient to meet the demands of modern industrial production, while deep-learning technologies offer new opportunities to address this issue. Research on deep-learning-based anomaly detection for industrial components can enhance detection efficiency and accuracy, reduce product defect rates, improve production efficiency, lower production costs, and strengthen the market competitiveness of enterprises. This paper proposes a deep-learning-based anomaly detection method for industrial components, with the following main contributions:
Small-Sample Semantic Segmentation Model Based on Industrial Dataset: We employ deep-learning methods to achieve a comprehensive understanding of image content by integrating visual features across different scales and incorporating attention computation strategies. Specifically, the method utilizes hierarchical feature extraction techniques to capture diverse information ranging from local details to global structures, while establishing correlations among distant pixels through an attention weight allocation mechanism. This design not only enhances the model’s ability to represent multi-granular semantic features but also effectively suppresses the impact of local interference factors. To verify the effectiveness of the proposed method, a semantic segmentation benchmark dataset containing complex scenes was specifically constructed. Experimental results demonstrate that this approach significantly improves segmentation accuracy.
Anomaly Detection Method Based on Knowledge Distillation: We design a cascaded knowledge transfer framework specifically for defect identification tasks in industrial images. The scheme consists of three key stages: First, a multi-level feature analysis of the input image is conducted, followed by optimization, integration, and dimensionality reduction of the features, and finally, high-quality reconstruction of the target image is achieved. In terms of implementation, a pre-trained teacher model serves as the backbone for feature extraction. The extracted visual features are then transformed into low-dimensional representations through a feature refinement module, which serves as the input signal for the reconstruction network. This architecture fully leverages the powerful feature representation capabilities of the pre-trained model, while the feature compression mechanism effectively enhances the efficiency and accuracy of anomaly detection.
Two-Stage Anomaly Detection Method: We propose a two-stage detection method. In the first stage, a semantic segmentation algorithm is used to extract the target region from high-resolution input images. In the second stage, the cropped local images are fed into the defect identification model. This approach significantly reduces the texture complexity of the region to be inspected, thereby effectively improving the accuracy of defect identification. Specifically, the preliminary region segmentation step not only minimizes background interference but also enables the anomaly detection model to focus on subtle feature differences in the key areas, ultimately achieving a significant optimization of detection performance.
3. Method
3.1. Small-Sample Semantic Segmentation Model Based on Industrial Dataset
At present, mainstream semantic segmentation techniques primarily rely on large-scale datasets for training to achieve precise recognition and segmentation of specific semantic features. However, when applied to industrial components with a small number of defects, these methods often struggle to accurately distinguish between normal textures, structural features, and localized irregular anomalies (such as bent, color, and scratch defect regions). The segmentation performance exhibits significant variability and unreliability.
The deep-learning model proposed in this paper draws on the core design philosophy of U-Net, employing a symmetric encoder–decoder framework with skip connections across layers. This architectural design is particularly well-suited for small-sample learning scenarios and can significantly enhance the model’s performance in edge detection and detail preservation. Moreover, the model incorporates a Multi-Scale Feature Fusion module to focus on the extraction of high-dimensional features. This not only expands the network’s receptive field but also effectively suppresses noise interference caused by abnormal features.
We design an innovative Adaptive Multi-Scale Attention Module (AMAM) to enhance image segmentation outcomes by optimizing the feature extraction process. The core innovations of this module are twofold: First, it employs a parallel multi-level depthwise separable convolution structure, enabling the network to process feature information at different scales simultaneously. Second, it introduces self-attention computation units to effectively capture long-range dependencies, thereby improving global context understanding. This dual-strategy design not only significantly reduces the model’s parameter count and enhances computational efficiency but also strengthens the network’s ability to integrate multi-level features.
The small-sample learning segmentation method we proposed, tailored for industrial scenarios, demonstrates exceptional segmentation accuracy and robustness in addressing common abnormal feature issues in industrial images. Unlike traditional methods, this model is specifically optimized for the characteristics of industrial data and can maintain stable segmentation performance even with limited sample sizes.
3.1.1. Semantic Segmentation Model Architecture
Figure 1 illustrates the small-sample semantic segmentation model proposed in this section, which primarily comprises three modules: First, the encoder module, which functions to extract semantic information from the input data and optimize these features; second, the AMAM, which enhances the expressive capability of feature encoding by fusing features across different scales and utilizes the optimized features for the segmentation task; and third, the decoder module, which generates segmentation results that match the input image. The encoder network consists of four ResNets based on the ResNet50 architecture [
41], with dilated convolutions introduced into the backbone network to effectively expand the local receptive field and reduce the model’s parameter count. The decoder framework employs a multi-layer upsampling module to enhance the detail representation in semantic segmentation. Additionally, to improve segmentation performance, the model incorporates attention mechanisms and skip connection techniques. The AMAM extracts high-level semantic information from multi-scale features and integrates these features through a self-attention mechanism, learning more adaptive latent features to support the segmentation results during the decoding process.
For the encoder, a pre-trained ResNet50 is utilized as the backbone network for multi-scale feature extraction. To achieve a larger receptive field and enhance the effectiveness of semantic segmentation, dilated convolutions are incorporated into the encoder, as demonstrated in a series of studies on DeepLab [
42]. The output stride is defined as the ratio of the spatial resolution of the input image to the final output resolution. To attain satisfactory segmentation performance on industrial data, this paper sets the output stride to 16 to extract denser features, which is an empirically determined value since 16 has been found to be optimal through experimentation in the context of industrial applications [
43]. For the decoder, it fuses multi-scale features and performs upsampling across four sets of convolutional layers. Skip connections are introduced between corresponding layers of the encoder and decoder to effectively capture and fuse low-level and high-level features for detailed prediction. Moreover, an attention mechanism is added to capture a larger context.
The semantic segmentation model is designed to process high-resolution industrial images efficiently. As shown in
Table 1, the encoder reduces the spatial dimensions while increasing the number of feature channels. The AMAM module maintains the spatial dimensions while enhancing the feature representation through multi-scale attention. The decoder then upsamples the feature maps to restore the original image dimensions, resulting in a high-resolution segmentation map. This detailed dimension change is crucial for understanding the model’s ability to capture both global and local features effectively.
3.1.2. Adaptive Multi-Scale Attention Module
The AMAM architecture processes six groups of feature information at different scales, which are then concatenated into a single feature representation. The corresponding mathematical description is as follows:
where
denotes Batch Normalization,
denotes Average Pooling,
denotes the Sigmoid function, ⊗ represents Matrix Multiplication, and ⨁ represents Channel Concatenation.
As illustrated in
Figure 2, the AMAM processes the feature data extracted by the encoder through six distinct pathways: 1 × 1 convolution, 3 × 3 dilated convolutions with dilation rates of 4, 8, 12, and 16, and a global average pooling layer. The feature maps resulting from these six pathways are subsequently concatenated into a single feature representation. To enhance the extraction of global feature information, in addition to fusing multi-scale features, we incorporate a self-attention mechanism to further integrate multi-scale feature information.
Given that the feature data extracted by the encoder represents high-level semantic information with a high capacity for semantic content, excessively high dilation rates can easily lead to the loss of feature information. Therefore, we employ multiple layers of dilated convolutions with dilation rates set to 4, 8, 12, and 16 to obtain feature information across a broader range of scales. By reducing the dilation rates, the processing of high-level features is refined, and by increasing the number of dilated convolution layers, a richer set of feature information is obtained. This approach enhances the model’s ability to extract feature information effectively.
Depthwise separable convolutions [
44] decompose the standard convolution operation into two separate steps, resulting in a parameter count that is one-third that of a standard convolution and a reduced computational cost. Consequently, in the AMAM, standard convolutions are replaced with depthwise separable convolutions to improve the training performance and efficiency of the system.
To validate the effectiveness of these dilation rates, we conducted experiments on the MVTec dataset, comparing the performance of the model with different sets of dilation rates. The results are summarized in
Table 2. The results show that the dilation rates of 4, 8, 12, and 16 achieve the highest mIoU score of 96.4 and an accuracy of 98.3%, while maintaining a reasonable inference time of 35.2 ms. Lower dilation rates (2, 4, 6, 8) result in slightly lower performance metrics, while higher dilation rates (6, 10, 14, 18 and 8, 12, 16, 20) increase the computational load without significant gains in performance. Therefore, the dilation rates of 4, 8, 12, and 16 were chosen as they provide the best trade-off between performance and efficiency.
3.1.3. Gate Attention Module
To achieve a larger receptive field while analyzing the semantic positional relationships in the context, feature data in the standard U-Net architecture is progressively downsampled through a grid. This facilitates the extraction of positional and relational information from low-level feature maps. However, small-area features with distinct semantic information are easily overlooked by high-level feature information. To enhance segmentation accuracy, we introduce a gate attention module into the model to simplify the task into separate localization and subsequent segmentation steps. As shown in
Figure 3, the use of the ReLU function in the gate attention module effectively suppresses the responses of irrelevant features without compromising the expression of semantic information. Moreover, due to its simple structure, it eliminates the need for training multiple models and a large number of additional model parameters.
In practice, the gate attention module is integrated into the skip connections that convey salient features. Prior to the skip connection, information extracted from coarse-scale features is used for gating, with fine-scale features serving as a reference, to eliminate ambiguities related to irrelevant and noisy responses in the skip connections. Additionally, the gate attention module is incorporated into the first three layers, allowing for layer-by-layer weighting during forward propagation and thereby preventing the potential loss of small-area semantic information during prediction.
During the experimental process, it was observed that the full-scale application of the gate attention module in the model did not yield the optimal results. However, employing only 1 × 1 convolution to adjust spatial structure in the Adaptive Multi-Scale Attention Module at the lowest layer, while using the gate attention module in the first three layers, produced better outcomes. Specifically, this configuration achieved a Mean Intersection Over Union (mIoU) of 96.4% on the industrial dataset, which is 1.5% higher than the model using the gate attention module throughout all layers. We hypothesize that after passing through the Adaptive Multi-Scale Attention Module, the structural features become more complex. Although attention mechanisms can capture a sufficiently large receptive field to obtain semantic contextual information, they may also lead to information loss, thereby reducing segmentation accuracy. To verify this hypothesis, we conducted ablation studies comparing the performance of different configurations. The results showed that using 1 × 1 convolution in the lowest layer while applying gate attention in the first three layers resulted in a 2.1% improvement in mIoU compared to using gate attention alone. This suggests that the combination of 1 × 1 convolution and gate attention effectively balances the trade-off between capturing detailed features and maintaining semantic context, thus enhancing segmentation accuracy.
3.2. Anomaly Detection Method Based on Knowledge Distillation
The knowledge distillation framework model proposed in this paper is illustrated in
Figure 4. The model consists of a fixed pre-trained teacher network, a trainable feature aggregation and filtering module, and a student network. During the training process, the teacher network is first utilized to extract multi-scale features from the input normal samples. Subsequently, a student network is trained to reconstruct these multi-scale features from the compact feature data extracted by the feature aggregation and filtering module. In the prediction phase, the representations extracted by the pre-trained teacher network can capture additional abnormal features in the samples. However, the untrained student network is unable to reconstruct these abnormal features from the corresponding aggregated features. Therefore, the lower similarity of abnormal representations in the proposed teacher–student model indicates a higher anomaly score. The teacher network employs a pre-trained WideResNet50 [
45], while the student network utilizes a custom multi-level residual network. The heterogeneity between the teacher and student networks effectively reduces the possibility of complete reconstruction of abnormal features in knowledge distillation. Moreover, the trainable feature aggregation and filtering module further compresses the multi-scale patterns into an extremely low-dimensional space for serial downstream reconstruction. This not only reduces redundant upstream data but also focuses the features more on global information rather than details.
In common teacher–student models for anomaly detection, during training, normal samples are simultaneously fed into both the pre-trained teacher network and the untrained student network. The parameters of the student network are continuously updated to make the feature data extracted by the teacher and student networks as consistent as possible. During prediction, since the teacher network only trains the student network to extract features from normal samples, the feature data output by the teacher and student networks will be similar for normal samples, but the difference will be significant for abnormal samples. This approach, due to the similar structure of the student and teacher networks and the similarity of input and output data, can easily lead to similar results when processing abnormal samples. The proposed model adopts a serial approach, using the teacher network as the upstream component. Normal samples are only fed into the teacher model for multi-scale feature extraction. After compression and filtering of the multi-scale features, the data are passed to the downstream student network, which reconstructs the multi-scale data extracted by the teacher network. This serial structure enhances the diversity of the student network’s representation and the randomness of reconstructing abnormal samples, thereby improving the detection performance. In terms of model selection, this paper employs a backbone network pre-trained on the ImageNet dataset [
46] as the teacher network, with all parameters of the teacher network frozen during the knowledge distillation process. Additionally, ablation studies in this paper demonstrate that ResNet [
41] and WideResNet [
45] are both effective feature extraction networks, as they can extract rich features from images.
To match the multi-scale features of the teacher network, the student network is designed to be symmetrical to the teacher network, meaning they have opposite network architectures. The reverse design facilitates the filtering of abnormal features by the student network, while the symmetry ensures that the student network has the same feature dimensions as the teacher network. In reverse distillation, the objective of the student network is to simulate the behavior of the teacher network. In anomaly detection, since the shallow layers of neural networks extract local descriptors for low-level information (such as color descriptors, edges, textures, etc.), while the deeper layers have a broader receptive field and can reflect global semantic and structural information, the similarity between low-level and high-level features in the teacher–student model is low. This indicates local anomalies and global structural outliers, respectively. Therefore, we employ a distillation technique based on multi-scale features.
To provide a clear understanding of the changes in image dimensions throughout the anomaly detection model, we have included
Table 3, which shows the specific input and output dimensions for each layer. The anomaly detection model processes high-resolution industrial images efficiently. The teacher network, a pre-trained WideResNet50, extracts multi-scale features from the input image, reducing the spatial dimensions while increasing the number of feature channels. The feature aggregation and filtering module further compresses and filters these features, resulting in a compact representation. The student network then attempts to reconstruct these features, with the difference between the reconstructed and original features indicating anomalies. This detailed dimension change is crucial for understanding the model’s ability to capture both global and local features effectively.
3.2.1. Feature Aggregation and Filtering Module
As shown in
Figure 5, the feature aggregation and filtering module does not adopt a pure Transformer architecture but innovatively introduces a CNN-Transformer hybrid model. Specifically, the CNN first functions as a feature extractor, responsible for generating feature mappings of the input data. Subsequently, a 14×14 embedding is extracted from the feature maps generated by the CNN and used as the input to the Transformer module, rather than directly from the original image. This design choice is based on the following considerations: First, it fully utilizes the intermediate high-resolution CNN feature maps in the decoding path. Second, experimental results indicate that the hybrid CNN-Transformer encoder outperforms the pure Transformer encoder. Finally, by reducing the size of the embedding, the number of parameters in the Transformer is effectively decreased, thereby reducing the computational burden. In the proposed reverse knowledge distillation framework, the objective of the student network is to reconstruct the multi-scale features of the teacher network. Therefore, the output features of the last encoding block in the backbone network can be directly passed to the student network. However, this direct connection poses two main issues. On one hand, although the high capacity of the teacher network enables it to extract rich features, these high-dimensional feature descriptions may contain a large amount of detail and noise, which can interfere with the student network’s decoding of normal features. On the other hand, the output of the last encoder block in the backbone network typically reflects the semantic and structural information of the input data. Given that the order of knowledge distillation is reversed, directly passing the high-level feature representation to the student network poses a significant challenge for the reconstruction of low-level features.
In previous studies, data reconstruction has often been achieved by introducing skip connections to link the encoder and decoder. However, this method is not feasible in the context of knowledge distillation, as skipping connections would leak abnormal information to the student network during inference. To address the difficulties faced by the student network in reconstructing images from high-level features, we employ a Multi-Scale Feature Fusion (MFF) block to integrate multi-scale features before passing them to the teacher network. To achieve representation alignment in feature connections, we downsample the shallow features through one or more 3 × 3 convolutional layers, followed by batch normalization and ReLU activation. Subsequently, a 1 × 1 convolutional layer with stride 1 and batch-normalized ReLU activation is used to generate rich and compact feature representations. The MFF block aggregates low-level and high-level features into a fused feature dataset. Then, a ViT block is utilized to perform context-based analysis of multi-layer feature information, filtering out the fundamental information beneficial to the student network. Finally, two residual network modules are employed to further reduce feature noise and optimize the feature representation.
3.2.2. Loss Function and Anomaly Scoring
During the training process, the primary objective is to enable the student network to accurately reconstruct the multi-scale feature information extracted by the teacher network from normal images and to minimize the discrepancy between the multi-scale feature information reconstructed by the student network and that extracted by the teacher network. In the prediction phase, the pre-trained teacher network remains capable of effectively extracting features from abnormal images. However, since the student network has not been trained to reconstruct the multi-scale features of abnormal images, the difference between the multi-scale feature information reconstructed by the student network and that extracted by the teacher network for abnormal images may significantly increase. Based on this difference in feature information, we define the method for calculating abnormal features as follows.
Given the necessity to compute the differences between multi-scale feature information, it is essential to clearly define the feature information extracted by the teacher network (denoted as Advisor, A) and the feature information reconstructed by the student network (denoted as Student, S) at each scale. The specific definitions are as follows:
In the context of the knowledge distillation model,
and
denote the k-th feature reconstruction layer and feature extraction layer within the student network and teacher network, respectively.
and
represent the feature reconstruction results and feature extraction results from the preceding layer, respectively.
, where
,
and
correspond to the height, width, and number of channels of the k-th layer feature data, respectively. In the image reconstruction task of knowledge distillation models, cosine similarity is commonly employed as the loss function, as it is capable of more accurately analyzing the relationships between high-dimensional and low-dimensional information. Based on this, we calculate the differences among multi-scale feature information to identify anomalous features:
When the value of
is relatively large, it indicates a higher degree of anomaly at that particular location. By integrating the multi-scale knowledge distillation approach and aggregating the multi-scale anomaly maps, we derive a scalar loss function for the optimization of the student network:
To comprehensively evaluate the performance of anomaly detection and anomaly localization, we select anomaly scores at two levels for comparative analysis. The first level is the pixel-level anomaly detection score (AL). Specifically, the method involves calculating the cosine similarity among three sets of distinct features generated by three pairs of encoders and decoders, using the aforementioned formula, to obtain the anomalous features. Subsequently, these anomalous features are upsampled to the image size via bilinear interpolation. In the initial computation of the anomaly score, we directly summed the three sets of anomalous feature information to derive the pixel-level anomaly score. However, this straightforward summation approach fails to achieve the optimal feature combination effect. Therefore, we employ convolutional operations to adjust the weights of anomalous feature information. During the experimental phase, to compute the optimal weights for the multi-scale features, a convolutional layer was added to feature information prior to the calculation of the anomaly score, serving as an adaptive weight for training purposes. To determine the appropriate kernel size for the convolutional operation, we conducted a series of experiments. The results indicate that utilizing a 1 × 1 convolutional kernel as the adaptive weight yields the best performance in anomaly detection. The relevant formulas are as follows:
The other score we use is the sample-level anomaly score (AD) utilized for the anomaly detection task. In some approaches, the average of the summation of pixel-level anomaly scores across all pixel locations is typically employed as the anomaly score for a sample. However, this method may no longer be appropriate when the anomalous region is relatively small. For such anomalous images, although the pixel-level anomaly scores within the anomalous region are high, the vast majority of the image is composed of normal regions. Consequently, the average anomaly score of the entire image may fall within the threshold range of normal images, leading to misjudgment in anomaly detection. To avoid this situation, we opt to use the maximum value of the pixel-level anomaly scores as the sample-level anomaly score. If the score of any single pixel location in the image exceeds the threshold, we deem the image as anomalous. The calculation formula is as follows:
5. Conclusions
Confronted with complex industrial components, we employ a two-stage detection approach to enhance the accuracy of anomaly detection. In the semantic segmentation phase, a U-shaped architecture with skip connections is proposed as the fundamental framework for the semantic segmentation model, thereby achieving satisfactory performance in terms of edges and details when dealing with limited industrial component data. Additionally, it increases the focus on high-resolution features, enabling the capture of a sufficiently large receptive field and effectively mitigating interference caused by anomalous features. Moreover, to reduce the number of network parameters for improved efficiency and to better extract global information from high-level features, an Adaptive Multi-Scale Attention Module (AMAM) is proposed to enhance segmentation performance. Within this module, parallel multi-level depthwise separable convolutions are employed, allowing feature information to better adapt to multiple scales. To obtain more comprehensive contextual information, a self-attention mechanism is designed to complement it.
In the anomaly detection model segment, a novel Transformer-based serial knowledge distillation anomaly detection model is proposed. A pre-trained WideResNet50 network is utilized as the teacher network, while the student network employs a customized multi-level residual network. The heterogeneity between the teacher and student networks effectively reduces the likelihood of complete reconstruction of anomalous features by knowledge distillation. A Transformer-based multi-scale feature aggregation and filtering module is used to further compress feature data and reduce noise. Ultimately, the normality of input images is determined by the differences in cosine similarity between corresponding features in the teacher and student networks.
Experiments conducted on the MVTec dataset demonstrate that, compared to existing mainstream image segmentation models, the proposed small-sample semantic segmentation model achieves superior segmentation performance. Experiments demonstrate that AMAM maintains an mIoU above 94% when only 10% of the annotations are available, substantially alleviating the scarcity of defective samples in industrial scenarios. In comparison with SOTA anomaly detection methods, the proposed knowledge distillation-based anomaly detection model exhibits better performance in detecting anomalies in both texture and object images on the MVTec dataset. Additionally, through experimental comparisons, the proposed two-stage anomaly detection method is shown to more accurately localize anomalous regions and achieve better detection results than direct one-stage anomaly detection.