4.1. Dataset
Our experiments predominantly utilized the widely used MVTec Anomaly Detection dataset [
38] and VisA dataset [
39] for both anomaly detection and localization tasks. Similar to standard anomaly detection tasks, the training sets contain only images of normal samples, while the test sets include both normal and anomalous samples, along with their corresponding segmentation masks.
MVTec AD [
38] is a widely recognized dataset for anomaly detection that consists of a total of 5354 images, with the training set containing 3629 normal samples. The testing set includes not only normal samples but also encompasses a variety of anomalies ranging from scratches, dents, colored spots, and cracks to combined defects. It contains 15 categories, including 10 object categories (bottle, cable, capsule, hazelnut, metal_nut, pill, screw, toothbrush, transistor, zipper) and 5 texture categories (carpet, grid, leather, tile, wood).
VisA [
39] is a recent industrial anomaly detection dataset that comprises a total of 12 categories and 10,821 high-resolution images. Of these, 9621 normal samples are used to form the training set, while the remaining 1200 images constitute the test set. These 12 subsets can be categorized into three broad groups based on the characteristics of the objects. The first group comprises single-instance objects in a single image, such as Cashew, Chewing Gum, Fryum, and Pipe Fryum, and the second group consists of multiple instances in a single image, including Capsules, Candles, Macaroni1, and Macaroni2. The remaining are four subsets of printed circuit boards (PCBs) with intricate designs. Anomalous images include surface defects such as scratches, dents, colored spots, and cracks, as well as structural defects, such as misplacements or missing components.
Both the MVTec AD and VisA datasets are designed with real anomalies reserved for testing, while the training data include only normal samples. This setup inherently evaluates a model’s ability to generalize to unseen real-world defects, as the training phase does not expose the model to actual anomaly patterns.
4.4. Main Results
Quantitative results. We report the image-wise AUROC for the image-level anomaly detection task in
Table 2. Our approach achieves 100% discrimination accuracy across multiple categories and leads in average performance compared to current state-of-the-art methods, demonstrating its ability to accurately distinguish defective items of various materials and appearances.
For pixel-level outcomes, we report the results in
Table 3 and
Table 4. On average, our method excels in both pixel-level AUROC and pixel-level AP, achieving top scores across multiple categories. Specifically, our method achieves improvements of 0.4% and 1.2% on the AUROC and AP metrics, respectively.
Regarding region-wise performance, we compare using the AUPRO metric, as detailed in
Table 5. Since AUPRO considers regional overlaps rather than pixel-level comparisons, it treats anomalies of any size equally. Our method surpasses state-of-the-art performance in over half of the categories, showcasing consistent performance across different sizes and shapes of anomalous regions.
In
Table 6 and
Table 7, we provide a quantitative comparison of our FFC-AD method against recent state-of-the-art anomaly detection methods using the AUROC metric at both the image level and pixel level. Our method achieves comparable performance on the image-wise AUROC metric. This is primarily due to other methods, such as those reported in [
17,
18], utilizing a WideResNet-50 as their feature extractor, whereas our approach employs a ResNet-18 with FFC layers. Despite the significant disparity in the number of parameters, our method remains competitive. On the pixel-wise AUROC metric, our method significantly outperforms other approaches, demonstrating its superior localization capabilities.
Table 8 shows that our method consistently achieves the best localization accuracy, surpassing the previous best-performing method by 3.3% in terms of average AP.
Table 9 focuses on the AUPRO metric, which evaluates region-wise overlaps rather than pixel-level comparisons, thereby treating anomalies of any size equally. This consistent performance enhancement highlights the capability of our FFC-AD method in accurately identifying and localizing defect areas, regardless of their complexity or scale.
Our method achieves state-of-the-art performance across different metrics on multiple datasets, spanning diverse defect types and object/texture classes. RD [
17], RD++ [
18], and ViTAD [
24] adopt a feature reconstruction paradigm that directly compares features from the input and reconstructed images. While effective at capturing high-level discrepancies, the high-dimensional nature of the feature space and the lack of explicit spatial cues limit their performance in fine-grained localization, leading to lower AP and AUPRO scores.
Diffusion-based approaches such as DiAD [
20] and DDAD [
22] introduce pretrained feature extractors to guide the reverse denoising process. However, the pixel-level fidelity of diffusion-generated outputs remains a challenge. These methods often struggle with accurately reconstructing small or fine-grained anomalies, thereby affecting both classification and localization precision. DRAEM [
10] and DeSTSeg [
19] leverage pixel-level pseudo-anomaly generation and segmentation-specific architectural designs. These techniques lead to competitive localization accuracy but rely heavily on carefully crafted training augmentations and specialized modules. SimpleNet [
11] focuses on efficient anomaly simulation in the feature space. Although it performs well in distinguishing normal and anomalous samples, its lack of explicit pixel-level reconstruction or segmentation mechanisms limits its localization capability. RealNet [
21] simulates anomalies at the image level, providing a more intuitive modeling of visual defects. However, the absence of a dedicated segmentation network restricts its ability to localize defects precisely. The consistent results across these categories of our FFC-AD demonstrate strong cross-category robustness and generalization to unseen anomalies.
Qualitative results. To further evaluate the effectiveness of our proposed method, we conduct qualitative experiments to demonstrate the anomaly localization performance.
Figure 3 and
Figure 4 illustrate the visualization results of our approach on the MVTec AD and VisA datasets, respectively. As can be seen from the figures, our method accurately distinguishes various defects across different categories. For large defects, our method provides sharp and compact edge localization. For small defects, our method achieves precise defect localization with no extraneous false detections. The consistent performance across different datasets demonstrates the generalization capability and versatility of our approach.
Figure 5 presents a visual comparison of the results on the same sample using different methods. It can be observed that DRAEM [
10] performs poorly on low-resolution images, failing to effectively identify abnormal regions. In contrast, RD [
17] demonstrates a better capability in detecting various types of anomalies, with abnormal areas being more accurately localized. However, RD often suffers from a high false positive rate. For instance, in the “bottle” and “tile” samples, the significant number of false alarms could lead to unnecessary alerts in practical applications. SimpleNet [
11] enhances the model’s generalization by adding noise at the feature level. Nevertheless, due to the lack of explicit segmentation constraints, it struggles to precisely locate the boundaries of small anomalies, as seen in the “capsule” and “screw” samples, indicating room for improvement in its segmentation performance. Our method achieves the best segmentation for both texture and object categories, with accurate, shape-preserving outputs handling anomalies of all sizes. Notably, in the “grid” and “screw” samples, it significantly reduces false positives while accurately localizing anomalies, demonstrating superior performance. In summary, our method excels across diverse anomaly detection scenarios, with great practical application potential.
Efficiency results. Our proposed method achieves a remarkable balance between computational efficiency and performance, as evidenced by the results in
Table 10. With only 30 M parameters and 40 G FLOPs, FFC-AD outperforms state-of-the-art methods in both detection and localization. The lightweight design of FFC-AD is attributed to its frequency-domain feature transformation, which reduces redundancy in spatial feature learning while maintaining discriminative power. This makes it particularly suitable for real-world industrial applications where both accuracy and resource constraints are critical considerations.
False detection. As illustrated in
Table 11, the F1-score achieved by our method reflects a balanced improvement in both precision and recall, outperforming all competing approaches. This dual enhancement indicates that FFC-AD effectively addresses two critical challenges in anomaly detection: reducing false positives (FPs) while maintaining sensitivity to false negatives (FNs). The high precision score demonstrates the model’s ability to suppress FPs, which are often caused by texture ambiguities or reconstruction artifacts in conventional methods. Simultaneously, the superior recall score highlights the model’s robustness against FNs, particularly in detecting subtle or low-contrast anomalies. The balanced precision–recall trade-off further underscores the effectiveness of HSAS in mitigating overgeneralization, a common issue in reconstruction-based models. By simulating anomalies in the hidden space, HSAS prevents the model from reconstructing defects as normal patterns, thereby reducing FNs without compromising precision.
4.5. Ablation Studies
Main architecture. As shown in
Table 12, we conducted ablation studies focusing on the contributions of each component in our FFC-AD. Compared to experiment 1, introducing the FFC encoder results in slight improvements to both metrics. The detection AUROC increases to 98.7%, indicating that the FFC encoder effectively enhances feature extraction capabilities, contributing to better overall performance. In experiment 3, further incorporation of the FFC denoising autoencoder (FFC DAE) leads to additional gains in both detection and localization. The inclusion of the FFC DAE demonstrates its role in refining the anomaly detection process by reducing noise and enhancing robustness, thus improving both detection and localization accuracy. The final configuration integrates all three components, achieving the highest performance, with a detection AUROC of 98.9% and a localization AP of 77.0%. The introduction of Hidden Space Anomaly Simulation (HSAS) notably improves the model’s ability to accurately localize anomalies, suggesting its effectiveness in simulating and identifying anomalies in the hidden space. This hierarchical approach captures more nuanced spatial information, leading to enhanced anomaly detection and localization. In summary, the ablation study confirms that each component contributes positively to the overall performance of the FFC-AD method.
FFC layer architecture. To determine the optimal number of FFC layers within each FFC block, we conducted an ablation study. The results are summarized in
Table 13. Incorporating two layers leads to significant improvements, achieving the highest performance metrics. This configuration is marked as the default entry due to its superior performance, suggesting that two layers provide an optimal balance between complexity and effectiveness. Increasing the number of layers to three slightly decreases performance compared to two layers, with a drop of 0.2% and 2.4% in detection AUROC and localization AP, respectively. This suggests that adding more layers beyond two may introduce unnecessary complexity without corresponding benefits.
As presented in
Table 14, we further analyzed the impact of varying the ratio of the global part within the FFC block. With a global ratio of 0.25, the model achieves a detection AUROC of 98.3% and a localization AP of 75.3%. This lower ratio suggests that insufficient emphasis on the global context may limit the model’s ability to capture broader patterns. In defect detection tasks, although local details of defects are crucial, a certain degree of global context helps in understanding the overall scene where the defect is located, such as the relationship between the defect and the surrounding normal structure. Thus, an overly low global ratio may restrict the model’s capacity to capture broader patterns, making it difficult to accurately identify defects that are related to the overall scene. In contrast, increasing the global ratio to 0.75 also leads to a slight decline in performance. Particularly in defect detection tasks, most defects occupy only a small number of pixels. If the global ratio is too large, the model will allocate more computing resources and attention to the global information, resulting in insufficient extraction and analysis of the subtle local features of defects, which, in turn, affects the accuracy of detection and localization. In summary, the ablation study on the global-to-local ratio highlights that a balanced ratio of 0.5 provides the best overall performance. This configuration ensures that the model effectively captures both local and global features, leading to enhanced detection and localization capabilities.
HSAS architecture. We conducted ablation studies to investigate the impact of different dropout probabilities and types on the performance of the Hidden Space Anomaly Simulation (HSAS) module. The results are summarized in
Table 15. We first evaluated the effect of varying dropout probabilities within the 2D dropout configuration. Increasing the dropout probability from 0.01 to 0.04 slightly improves the localization AP to 76.2%. This suggests that the mild neuron deactivation begins to disrupt overreliance on specific pathways but remains limited in promoting distributed feature learning, as the sparse deactivation leaves most original patterns intact, hindering the decoder’s ability to distinguish simulated defects. Setting the dropout probability to 0.1 yields the highest performance detection AUROC of 98.9%. This suggests that a moderate level of dropout effectively enhances feature robustness without overfitting, which is consistent with the method’s goal of repurposing dropout to disrupt co-adaptation between normal and anomalous pattern reconstruction. Further increasing the dropout probability to 0.2 leads to a slight decrease in both detection AUROC and localization AP. Excessive neuron deactivation impairs the decoder’s access to meaningful latent space features, as the aggressive randomness undermines the integrity of simulated defective patterns. This suggests that dropout probabilities must be calibrated to preserve the structural integrity of simulated defects while achieving the desired regularization effect.
We also compared the performance using 1D dropout with the probability of 0.1. While this configuration matches the detection performance of the optimal 2D dropout setting, it falls slightly short in localization accuracy. This discrepancy can be attributed to the inherent structural differences between the two dropout mechanisms; 2D dropout operates on spatial dimensions, preserving local contextual relationships while introducing randomness, which aligns with the spatial nature of anomalous patterns in our task. In contrast, 1D dropout, which acts along the feature channel axis, disrupts inter-channel dependencies but fails to adequately model the spatial continuity critical for precise localization. This result reinforces that 2D dropout is more effective for capturing spatial dependencies in anomaly detection tasks, particularly when precise localization of defective regions is required.
Optimizer choice. To evaluate the impact of different optimizers on the performance of anomaly detection, we conducted an ablation study comparing AdamW and SGD (Stochastic Gradient Descent). The results are summarized in
Table 16. AdamW is known for its adaptive learning rates for different parameters, which can help in faster convergence and better handling of sparse gradients. With the SGD optimizer, the model reaches a higher performance. SGD’s simplicity and effectiveness in navigating the loss landscape with momentum often lead to better generalization and higher accuracy, especially in tasks requiring precise localization.