SeRNet: Segmentation Helps Reconstruction for Anomaly Detection

Cui, Yan; Sun, Jinkai; Liu, Xiying; Wei, Shun; Jiang, Jielin

doi:10.3390/app16041670

Open AccessArticle

SeRNet: Segmentation Helps Reconstruction for Anomaly Detection

by

Yan Cui

¹

,

Jinkai Sun

²

,

Xiying Liu

²

,

Shun Wei

²

and

Jielin Jiang

^2,*

¹

School of Artificial Intelligence, Nanjing Normal University of Special Education, Nanjing 210038, China

²

School of Software, Nanjing University of Information Science and Technology, Nanjing 210044, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(4), 1670; https://doi.org/10.3390/app16041670

Submission received: 25 December 2025 / Revised: 2 February 2026 / Accepted: 4 February 2026 / Published: 7 February 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

With production growth and improvements in production speed in modern industries, accurate anomaly detection is becoming increasingly important to improve quality inspection efficiency and help minimize production costs. Existing reconstruction-based methods achieve promising anomaly detection results in some scenarios. However, when large-scale anomalies exist, the generalization ability of these methods is limited, and it is difficult to reconstruct the anomalies effectively into normal areas, which may lead to unsatisfactory results. To address this issue, this paper proposes a novel network architecture called SeRNet, which comprises three components: a segmentation sub-network, a reconstruction sub-network, and a repair module. SeRNet addresses the challenge of large-scale anomaly reconstruction by utilizing the segmentation sub-network for pre-segmentation, the repair module for repairing the large-scale anomalies using normal images similar to the test images, and the reconstruction sub-network for processing small-scale anomalies and anomalies in the repaired splices. Additionally, this paper introduces two methods for generating pseudo-anomalies at different scales. SeRNet leverages the advantages of both the segmentation sub-network, which is effective at segmenting large-scale anomalies, and the reconstruction sub-network, which can effectively reconstruct small-scale anomalies. Experiments on the MVTec AD industrial dataset demonstrate that SeRNet delivers outstanding performance, achieving an image-level AUROC score of 99.6.

Keywords:

anomaly detection; reconstruction; pseudo-anomaly

1. Introduction

Anomaly detection is a crucial research field in computer vision, with widespread applications across a range of industries [1,2,3,4]. Traditional anomaly detection methods typically rely on supervised learning approaches [5,6] and have demonstrated promising detection results. However, the scarcity of anomalous samples and the diversity of anomaly types limit the generalizability of these approaches. Thus, with the development of deep learning, unsupervised anomaly detection methods [7,8,9,10] have attracted considerable attention. Unsupervised anomaly detection methods can be categorized into two main classes: embedding-based and reconstruction-based methods.

Embedding-based methods [11,12,13] typically utilize a pre-trained model to extract features from samples and model the feature space to detect anomalies. Although these methods achieve higher detection accuracy, they also lack interpretability. In contrast, reconstruction-based methods [14,15,16,17] rely on restoration techniques to detect anomalies. During the training phase, reconstruction-based methods employ a pseudo-anomaly generation strategy to obtain anomalous images, and the reconstruction model then learns how to reconstruct these pseudo-anomalous samples as normal patterns. The resulting model then attempts to reconstruct real anomalous images during testing. Anomaly detection is performed by comparing the differences between the input and reconstructed images. However, these models typically have limited generalization ability, which prevents them from fully reconstructing all anomalies, particularly in the case of large-scale anomalies. The convolutional layer in the reconstruction model has a constrained receptive field, resulting in poor reconstruction ability and affecting the final detection accuracy. To address the challenges faced by reconstruction networks, previous research has alleviated this issue by introducing more powerful reconstruction networks. For example, anomaly diffusion [18] incorporates a diffusion model, while MUTAD [19] utilizes a multi-scale Transform model as the reconstruction network. Although these methods enhance reconstruction capabilities, they also increase the complexity of the detection algorithms, resulting in decreased efficiency.

To address the challenges described above, this paper proposes a network architecture called SeRNet, which comprises three parts: a segmentation sub-network, a reconstruction sub-network, and a repair module. Previous work often simply combines segmentation and reconstruction, using the difference images from the reconstruction output of the segmentation network for anomaly detection. This approach does not fully leverage the potential of the segmentation network, treating it merely as a discriminator for differences before and after image reconstruction, and it does not address the reconstruction difficulties faced by the reconstruction network when large-scale anomalies exist. The segmentation sub-network is responsible for detecting suspected large-scale anomalies, whereas the reconstruction sub-network focuses on reconstructing small-scale anomalies and repairing splices of anomalies. The repair module, guided by the prediction masks, uses similar images to effectively repair large-scale anomalies, thus helping to reconstruct anomalous regions. The above structure allows SeRNet to reconstruct anomalous regions of different scales more effectively. Two pseudo-anomaly generation methods, namely Perlin Pseudo-Anomaly (PPA) and Diffusion Pseudo-Anomaly (DPA), are employed to guide the learning of each sub-network. The PPA approach generates large-scale anomalies to guide the segmentation sub-network, whereas DPA generates small-scale anomalies to enable the reconstruction sub-network to focus on reconstructing the details of identified anomalous regions. This design enables each sub-network to focus on learning the specific anomaly patterns or features, thereby enhancing the model’s overall detection performance. Figure 1c illustrates the reconstructed image of SeRNet. Compared with Draem, the classical reconstruction method [9], SeRNet can more effectively handle large-scale anomalies. SeRNet does not adopt conventional innovative methods often found in previous work, such as replacing the reconstruction network model. Instead, SeRNet approaches the problem at its root, recognizing that the reconstruction network encounters receptive field limitations when handling large-scale anomalies. Therefore, SeRNet first locates these large-scale anomalies using the segmentation network and then utilizes a repair module for reconstruction. This way, we effectively avoid the receptive field limitations faced by the reconstruction network during the rebuilding process, thereby enhancing reconstruction quality and improving anomaly detection accuracy. Overall, this paper’s contributions can be summarized as follows:

(1) This paper presents SeRNet, an innovative network architecture that addresses the reconstruction challenge in large-scale anomaly detection through the collaborative design of the segmentation sub-network and the repair module, providing a new theoretical framework for industrial anomaly detection.

(2) This paper introduces two novel pseudo-anomaly generation methods, PPA and DPA, which optimize pseudo-anomaly generation strategies for segmentation and reconstruction tasks, significantly enhancing the model’s ability to capture anomalous features across varying scales.

(3) Extensive experiments validate the accuracy and robustness of SeRNet. The results indicate that SeRNet performs exceptionally well on the MVTec AD dataset, achieving an image-level AUROC score of up to 99.6. SeRNet demonstrates high performance across multiple benchmarks, showcasing superior generalization in complex industrial environments with a robust and efficient framework for anomaly detection.

The rest of this paper is organized as follows: Section 2 reviews the related work. Section 3 introduces the proposed SeRNet. Section 4 presents comparative and ablation experiments. Conclusions are provided in Section 5.

2. Related Work

2.1. Embedding-Based Methods

Embedding-based methods compress normal features into a lower-dimensional space in which the anomalous features are noticeably distant from the clusters formed by the normal features. PatchCore [20] utilizes a pre-trained model to extract features and construct a memory bank of patch features. This approach then calculates the difference between the test samples and the feature bank using the nearest neighbor method. The incorporation of a memory bank boosts algorithmic performance but also raises space and time complexity. Padim [21] introduced a patch distribution modeling framework that utilizes a pre-trained patch feature embedding model and a multivariate Gaussian distribution, enabling a probabilistic representation of normal samples to be obtained. This method models the fixed positions of feature maps, and it is only adapted to aligned datasets. Tsai et al. [22] proposed a representation learning method that extracts features from normal samples based on multi-scale patches and also evaluated feature similarity among the patches to achieve superior anomaly segmentation results. Through multi-scale feature extraction, the model can capture anomalies at different scales, thereby achieving satisfactory results in anomaly detection. The PyramidFlow [12] methods employ flow models to learn the spatial distribution of features in normal samples. Both multi-scale and feature pyramid tricks are often used to extract image features and perceive multi-scale anomalies. Compared to traditional convolutional neural networks, flow models require more computational resources, which, to some extent, limits their applicability. RD4AD [13] uses knowledge distillation to compare the differences between anomalous and normal samples in the feature space. RD4AD [13] differs from standard knowledge distillation methods [23,24] by adopting an encoder-decoder structure, where knowledge from the pre-trained teacher model is conveyed to the student model in an inverse direction. This approach effectively filters out anomalous features, reducing their interference with the student network’s input and enabling it to learn normal feature representations more accurately. Recently, several hybrid methods have emerged. For example, MemSeg [25] combines an embedding-based feature repository with segmentation, introducing pseudo-anomalies, multi-scale feature fusion modules, and spatial attention modules. This approach simplifies semi-supervised anomaly detection into an end-to-end semantic segmentation task, making semi-supervised surface defect detection more flexible.

2.2. Reconstruction-Based Methods

Reconstruction-based methods are applied in anomaly detection by training a model to reconstruct anomalous regions as normal regions. Due to the limited availability of anomaly samples, many current reconstruction-based methods employ pseudo-anomaly generation strategies. These strategies generate synthetic anomaly samples that are then used for model training. Draem [9] uses Perlin noise [26] to generate pseudo-anomaly masks. This approach reconstructs anomalous images into normal images using a reconstruction network and utilizes a discriminative network to distinguish anomalous regions from the input and reconstructed images. However, since Draem [9] introduces an additional dataset as the source of anomalous data, this leads to a discrepancy between the distribution of pseudo-anomalies and that of real anomalies, which hinders the model’s ability to effectively learn real anomalous features. RIAD [16] uses a random mask strategy to generate pseudo-anomalies and utilizes the Structural Similarity Index Measure (SSIM) and the Gradient Magnitude Similarity (GMS) as reconstruction loss functions, thereby improving detection accuracy compared to traditional Mean Square Error (MSE) loss functions. SSIM and GMS effectively guide the image reconstruction task. However, the introduction of the mask strategy increases the number of reconstructions, potentially leading to higher algorithmic time complexity. EdgRec [15] transforms images with pseudo-anomalies into edge images and feeds them into the reconstruction network. This model discards the discriminative network and uses a new anomaly evaluation function to identify differences between the input and reconstructed images. The function takes into account both color and gradient differences to detect anomalies. Employing an anomaly evaluation function to identify image differences not only enhances the model’s interpretability but also effectively mitigates the risk of network overfitting. Xian et al. [27] proposed a new framework called the Dual-Siamese network. The first Siamese architecture captures distinguishing features from normal samples and anomalous samples generated by the pseudo-defect generation module. The subsequent second Siamese architecture reconstructs and maps the dual-density features. This framework avoids the inaccuracies present in existing image reconstruction-based methods and addresses issues found in discriminative embedding-based approaches. Behrendt et al. [28] introduced diffusion models into anomaly detection and proposed a conditioning mechanism that incorporates additional feature representations of the input image into the denoising process of DDPMs. This effectively addresses the challenges faced by DDPMs in capturing precise intensity characteristics and domain adaptation. As a result, this method has achieved remarkable success in medical anomaly detection. CFRDC [29] employs a reconstruction-based approach, effectively addressing the issue of class increment in anomaly detection scenarios. It demonstrates the ability to learn features of new classes while retaining knowledge of previously learned classes, underscoring the strong adaptability and practicality of reconstruction-based methods.

2.3. Pseudo-Anomaly-Based Methods

The pseudo-anomaly-based approach generates simulated anomaly samples during the training phase, enabling the model to learn the differences between normal and anomalous samples. CutPaste [30] and AnoSeg [31] simulate anomalous samples by randomly copying small rectangular regions from the input image and pasting them elsewhere within the image. While this approach is simple and efficient, the generated anomalies often differ significantly from real anomalies. GLASS [32] integrates global anomalies with local anomalies, employing gradient ascent and truncated projection to generate anomaly samples that closely resemble the normal distribution, thereby significantly improving anomaly detection and localization performance. Hu et al. [18] proposed a pseudo-anomaly generation strategy based on diffusion models. This method leverages a pre-trained latent diffusion model and incorporates spatial anomaly embedding and adaptive attention reweighting mechanisms, enabling the generation of realistic and diverse anomaly samples even with limited anomalous data.

3. Methods

Figure 2 shows the overall structure of the proposed SeRNet, which comprises a segmentation sub-network, a reconstruction sub-network, and a repair module that connects the two sub-networks. To facilitate the introduction of SeRNet, Table 1 lists the key symbols along with their definitions.

3.1. Pseudo-Anomaly Generation Strategy

SeRNet’s segmentation sub-network is responsible for segmenting large-scale anomalies, while its reconstruction sub-network is responsible for reconstructing small-scale anomalies. In this study, two pseudo-anomaly generation methods are used. The PPA method is used to generate large-scale pseudo-anomalies for the segmentation sub-network, while the DPA method is used to generate small-scale pseudo-anomalies for the reconstruction sub-network.

Inspired by [9], PPA uses Perlin noise [26] to generate anomaly masks. However, unlike [9], the PPA method restricts the generation of small-scale pseudo-anomalies by filtering the pseudo-anomaly masks such that the scale of the generated pseudo-anomalies is consistent with the segmentation sub-network’s large-scale anomaly detection task. Furthermore, the PPA approach improves upon the choice of pseudo-anomalous data sources compared to Draem [9]. Draem [9] utilizes the DTD dataset [33] as a source for pseudo-anomalies. However, the data distribution of the DTD dataset differs significantly from that of industrial products in practice as the distribution of real defects, such as breaks, scratches, and misalignment, is similar to that of normal areas. PPA addresses this issue by employing images from the same class as the anomaly data source. This approach allows pseudo-anomalies to be generated that better align with the true distribution of anomalies. To ensure diversity among the pseudo-anomaly data sources, images of the same class are first divided into four pieces and randomly combined to create spatial differences. The resulting images are then randomly enhanced with brightness, contrast, color, and affine transformations. After these processing steps, images from the same class can be used as a data source for pseudo-anomaly generation. Figure 3a shows the process of generating pseudo-anomalies using PPA, which can be expressed as:

I_{a} = (1 - β) (M_{a} ⊙ I) + β (M_{a} ⊙ I_{A}) + {\bar{M}}_{a} ⊙ I

(1)

where

M_{a}

denotes the pseudo-anomaly mask generated by Perlin noise,

{\bar{M}}_{a}

denotes the inverse result of

M_{a}

, I represents the normal image,

I_{A}

represents the (enhanced) image of the same class, ⊙ indicates the element-by-element multiplication operation, and

β

represents a randomized weight in the range [0, 0.8].

In contrast, the DPA method creates one or two rectangular masks that cover less than 2.5% of the image area. It then generates two to four randomly proportioned rectangular masks immediately adjacent to these rectangles to form an irregular region. These irregular regions represent pseudo-anomaly masks. The process of gradually changing from a single rectangular mask to an irregular mask can be vividly summarized as diffusion, which is the origin of the name DPA.

DPA then randomly adjusts the color and lightness of these image regions to generate pseudo-anomalies. To train the reconstruction sub-network, large-scale anomalies can be generated using PPA to ensure the creation of splicing anomalies during the repair phase. Additionally, the DPA method can be used to generate small-scale pseudo-anomalies, thus enabling the reconstruction sub-network to prioritize image details.

DPA anomaly generation process is shown in Figure 3b and can be expressed as:

I_{a} = {\bar{M}}_{d} ⊙ I + [(M_{d} ⊙ I) + Δ b r i + Δ c o l]

(2)

where

M_{d}

denotes the pseudo-anomaly mask randomly generated by DPA,

{\bar{M}}_{d}

represents the inverse result of

M_{d}

, I is the normal image,

Δ b r i

denotes the random brightness variation value,

Δ c o l

denotes the random color variation value, and ⊙ signifies the elementwise multiplication operation.

3.2. Segmentation Sub-Network and Reconstruction Sub-Network

As shown in Figure 4, both the segmentation sub-network and the reconstruction sub-network of SeRNet adopt the classical U-net architecture, which comprises an encoder and a decoder. The encoder comprises convolutional and pooling layers that extract image features and reduce the image’s scale to create a latent feature space. The decoder consists of convolutional and upsampling layers that decode the image’s spatial dimensions and details from the latent feature space.

The reconstruction sub-network suppresses the expression of anomalous features through the downsampling layers of the encoder and then reconstructs the normal image from the latent feature space via the upsampling layers of the decoder. However, large-scale anomalies still behave as anomalous features in the latent feature space. During the decoder’s upsampling process, the restricted receptive field causes the current pixel values to be influenced by surrounding pixels, making large-scale anomalies more likely to be reconstructed as anomalies. Consequently, the reconstruction sub-network performs well in reconstructing small-scale anomalies but struggles with large-scale anomalies. To assist in reconstructing large-scale anomalies, the segmentation sub-network is responsible for localizing them. This is attributed to the fact that large-scale anomalies exhibit more distinctive features and stronger boundaries, making them easier for the segmentation sub-network to identify.

The segmentation sub-network of SeRNet only produces an anomaly mask as its output. In addition, to reduce the total number of parameters in SeRNet, the encoder in the segmentation sub-network uses only two max-pooling layers and three convolutional layers, while the decoder contains two upsampling layers, two convolutional layers, and one output convolutional layer.

Unlike the segmentation sub-network, the reconstruction sub-network must output reconstructed three-channel color images, which is a relatively complex task. Thus, the reconstruction sub-network’s encoder comprises five convolutional layers and four max-pooling layers, while its decoder consists of four convolutional layers, four upsampling layers, and one output convolutional layer. However, increasing the number of layers may also result in a loss of image information during downsampling, leading to distortion of the reconstructed image. To address this issue, this study introduces skip connections, which link the feature maps of each layer of the encoder directly to the feature maps of the corresponding layer of the decoder. This enables the decoder to reconstruct fine-grained features from the feature maps of lower layers via the skip connection path. Reconstruction sub-networks with skip connections are correspondingly more effective at capturing image details.

The segmentation sub-network and the reconstruction sub-network are optimized separately. The segmentation sub-network is trained first, and after the training is complete, it infers and outputs the anomaly segmentation map, which assists in training the reconstruction sub-network. To ensure the consistency of SeRNet during the training phase, the same loss function is used to assess both sub-networks. The total loss comprises two components: the MSE loss and the SSIM loss [16]. These components can be expressed as the following:

L o s s (y_{p}, y_{t}) = S S I M (y_{p}, y_{t}) + M S E (y_{p}, y_{t})

(3)

where

y_{p}

denotes the reconstructed image or predicted mask, and

y_{t}

denotes ground truth image or mask.

3.3. Repair Module

The repair module acts as a crucial link that connects both sub-networks. This module has multiple functions, including prediction mask binarization, filtering, and large-scale anomaly repair. The pseudo code for the repair module is provided in Algorithm 1. The inputs are the prediction mask

M_{p}

, the anomalous image (or pseudo-anomalous image)

I_{a}

, and the similar image

I_{s}

of

I_{a}

. Firstly,

M_{p}

is binarized in steps 1–5. Subsequently, in steps 6–10, prediction masks with small-scale anomalies are filtered and the anomalies are eliminated. Finally, in steps 11–17,

I_{a}

is repaired utilizing

I_{s}

under the guidance of

M_{p}

, and the resulting repaired image

I_{b}

is returned.

Algorithm 1 Repair strategy

Require: $M_{p}$ : the prediction mask. $I_{a}$ : the (pseudo) anomalous image. $I_{s}$ : the similar image of $I_{a}$ .
Ensure: $I_{b}$ : the repaired image of $I_{a}$ .
# Prediction mask filtering

1:

if

M_{p} [i, j] < t h_{p}

then

2:

M_{p} [i, j] \leftarrow 0

3:

else

4:

M_{p} [i, j] \leftarrow 1

5:

end if

# Prediction mask filtering
# $S_{w}$ denotes the connected region in $M_{p}$
# $A R E A (S_{w})$ means calculate the area of $S_{w}$

6:

for

S_{w}

in

M_{p}

do

7:

if

A R E A (S_{w}) < t h_{a}

then

8:

M_{p} [S_{w}] \leftarrow 0

9:

end if

10:

end for

11:

for

i t e m

in

M_{p}

do

12:

if

M_{p} [i t e m] = = 0

then

13:

I_{b} [i t e m] \leftarrow I_{a} [i t e m]

14:

else

15:

I_{b} [i t e m] \leftarrow I_{s} [i t e m]

16:

end if

17:

end for

The pixels in the prediction mask

M_{p}

output from the segmentation sub-network all have values in the range [0, 1], which indicate the probability of a pixel belonging to an anomalous region. This paper utilizes binarization to handle the continuous distribution of pixel values in

M_{p}

. Specifically, a threshold

t h_{p}

is introduced, where pixel values below the threshold are set to 0, and those above it are set to 1. This treatment prevents anomalous regions with low probability from introducing additional anomalies during the repair phase. The binarized version of

M_{p}

must also be filtered. To do so, if the area of an anomalous region in

M_{P}

’s is below a set threshold

t h_{a}

, the corresponding pixel values of the connected area should be set to 0, thus indicating that these small-scale anomalies should be excluded. Since the reconstruction sub-network can reconstruct these small-scale anomalies into normal regions, it is unnecessary to repair them, thus avoiding potentially introducing additional anomalies. Figure 5 shows a comparison of the original and processed prediction masks. As shown, the non-binarized and unfiltered masks contain very high noise levels, while the processed masks contain only large-scale masks representing suspected anomalous regions.

The repair operation replaces anomalous regions of

I_{a}

with corresponding regions from similar image

I_{s}

under the guidance of

M_{p}

. The sources of

I_{s}

are explained in Section 3.4. The repair result is shown in Figure 6. Repairing anomalous regions using the similar images approach effectively assists in the reconstruction of large-scale anomalies.

The prediction masks assist in repairing large-scale anomalies by providing location information about these anomalies, thereby supporting the reconstruction network in transforming anomalous areas into normal ones. Without the aid of the prediction masks, the reconstruction network would struggle to effectively achieve this transformation, potentially leading to errors in anomaly detection. Therefore, the prediction masks play a crucial role in the repair module. The repair module consists of two main operations: first, processing the prediction masks generated by the segmentation sub-network, and second, using the processed prediction masks to replace anomalous regions with normal ones to repair large-scale anomalies. The objective of the repair module is to minimize the introduction of additional anomalies while effectively repairing large-scale anomalies.

3.4. Inference

The inference phase of SeRNet includes the following additional operations.

3.4.1. Similar Image Search

As noted in Section 3.3, SeRNet needs to repair large-scale anomalies using normal regions from similar images. During the training phase, the image

I_{t}

generates the pseudo-anomalous image

I_{a}

; the corresponding region of

I_{t}

is then used to repair the large-scale anomalies in

I_{a}

.

During the inference phase, SeRNet searches for the training image that is the most similar to the test image. To achieve this, a perceptual hash algorithm [34] is used. The algorithm generates a unique perceptual hash value for each image based on the appearance and location of the target object, ensuring that the test image can accurately retrieve similar images from the training set. In the experiments, perceptual hashing for encoding images is implemented using the default settings of the version 4.3.1 of ImageHash (ImageHash library: https://pypi.org/project/ImageHash/ (accessed on 1 February 2026)) library.

For the MVTec AD [35] dataset, which contains 3416 training images, only 217 KB of storage is needed for its hash values, equating to an average of 0.06 KB per image. This is because the perceptual hash algorithm can convert an image into a 16-bit hash string, and computers are highly efficient at storing strings, allowing for significant reductions in storage space.

Algorithm 2 outlines the steps for executing a similar image search algorithm. This algorithm takes the set of training samples

D_{t}

and the anomalous image

I_{a}

as inputs. Before the inference process, steps 1–4 are executed to compute the perceptual hash value of each training image and save it in the corresponding file, which avoids the need for repeated computation during the inference stage. During inference, Step 5 is executed to calculate the hash value of the current test image. Step 6 is then performed to search the file containing the stored hash value to find the image with the smallest Hamming distance from the

I_{a}

hash value, i.e., the most similar image

I_{s}

. Figure 7 illustrates the result of this search. A similar image search algorithm can find similar images for screws in the first column and hazelnuts in the fourth column of Figure 7 when both have different orientations, thus playing a crucial role in ensuring accurate repairs for large-scale anomalies.

Algorithm 2 Similar image search

Require: $D_{t}$ : the set of training samples in the same category. $I_{a}$ : the anomalous image.
Ensure: $F_{h}$ : perceptual hash value storage files. $I_{s}$ : the similar image of $I_{a}$ .
# Compute the perceptual hash of the training samples in advance and store it.

1:

for I in

D_{t}

do

2:

h a s h_{I} \leftarrow C a l c u l a t e H a s h (I)

3:

S a v e F i l e (h a s h_{I}, N a m e (I), F_{h})

4:

end for

# At the point of inference, the perceptual hash of the present image $I_{a}$ is calculated.

5:

h a s h_{I_{a}} \leftarrow C a l c u l a t e H a s h (I_{a})

# $M i n H a s h D i s ()$ means to find the minimum Hamming distance

6:

I_{s} \leftarrow M i n H a s h D i s (h a s h_{I_{a}}, F_{h})

3.4.2. Anomaly Evaluation

SeRNet reconstructs the test image during the inference phase and then uses an anomaly evaluation function [15] to compare the differences between the original and reconstructed images. This comparison will determine the presence of any anomalies. The anomaly evaluation function measures the color differences and structural differences between the test image and the reconstructed image. To represent the color difference, the anomaly evaluation function calculates the difference between the a-channel and b-channel of the two images in the CIELAB color space, which can be expressed as:

D_{c o l o r} = [{(I_{p}^{a} - I_{t}^{a})}^{2} + {(I_{p}^{b} - I_{t}^{b})}^{2}] * F_{m e a n}

(4)

where

F_{m e a n}

denotes the filtering process using a convolutional kernel of size 21.

The anomaly evaluation function calculates the image’s structural differences based on the image’s multi-scale gradient magnitude similarity (MSGMS) [16], which can be expressed as:

D_{s t r u c t u r e} = 1_{H \times W} - MSGMS (I_{p}, I_{t}) * F_{m e a n}

(5)

where

M S G M S (I_{p}, I_{t})

denotes the computation of the multiscale gradient magnitude similarity of

I_{p}

and

I_{t}

, with the convolutional kernel size of

F_{m e a n}

set to 21.

The anomaly evaluation function can be expressed as:

D_{s c o r e} = c D_{c o l o r} + D_{s t r u c t u r e}

(6)

where c is taken as 0.001 to balance the difference between the orders of magnitude of

D_{c o l o r}

and

D_{s t r u c t u r e}

. The maximum value of

D_{s c o r e}

represents the anomaly score of the test image.

4. Experiments

4.1. Experimental Details

Datasets: MVTec AD [35] is widely used for detecting anomalies in industrial products and consists of fifteen classes. Its training set comprises 3629 normal images, while its test set comprises 1725 images, including multiple anomalies for each class.

VisA [36] comprises a total of 10,821 images across twelve industrial classes. The training set includes 8659 normal images, while the test set consists of 962 normal images and 1200 anomalous images.

MPDD [37] contains 1346 images of metal parts categorized into six groups. The training set includes 888 normal images, and the test set contains 176 normal images and 282 anomalous images.

BTAD [38] contains three classes of industrial products. These three classes contain 400, 1000, and 399 normal images, respectively. The scale of anomalies in the same class varies widely, creating a challenge for anomaly detection.

The types of anomalies in the four datasets mainly include scratches, wear, cracks, and missing parts, with the ground truth represented by mask images indicating the actual anomalies.

Implementation Details: The SeRNet model was implemented using Python 3.8 and the PyTorch 1.11 frameworks on an Intel(R) Core i7-9700K CPU and an Nvidia RTX 2080 Ti GPU. The training process was conducted for 600 epochs, with an initial learning rate of 0.0001. The experimental setup has a batch size of four, and the optimizer used is Adam. The generation of pseudo-anomalies has been enhanced using methods from the imgaug library, with specific methods, including GammaContrast, MultiplyAndAddToBrightness, AddToHueAndSaturation, Autocontrast, and Affine Transformation. The input training images were resized to a resolution of

256 \times 256

pixels. Section 3.3 employs a binarization threshold of 0.8 and an area filtering threshold of 7000.

4.2. Experimental Results

4.2.1. Comparison with State-of-the-Art Methods

Table 2 shows the experimental results of SeRNet on the MVTec AD dataset. Notably, SeRNet achieves a perfect I-AUROC score of 100 across seven different item classes, representing a significant performance improvement over reconstruction-based methods [9,18]. Specifically, SeRNet surpasses MUTAD [19] by 3.1 in I-AUROC on average, exceeds DTAM [39] by 1.8, outperforms AnomalyDiffusion [18] by 1.1 on average in texture categories, and surpasses SuperSimpleNet [40] by 1.7 on average in object categories. The AP scores shown in Table 3 also demonstrate the strong anomaly localization capability of SeRNet. Compared to the knowledge distillation-based STPM [23] and ST [41], SeRNet surpassed their scores by 20.5 and 29.3, respectively. In comparison to the reconstruction-based RIAD [16], SeRNet showed an AP improvement of 26.6.

As shown in Table 4, SeRNet performs well on the VisA dataset. In terms of the average I-AUROC, SeRNet outperforms LightFlow [42] and FastFlow [43], which are based on the flow model, by 2.8 and 15.8, respectively. Table 5 shows the experimental results for the MPDD dataset, where SeRNet obtains an I-AUROC of 100.0 for both connector and metal plate classes. Its average I-AUROC and P-AUROC are 12.5 and 1.2 higher than those of EffAD [44], and 10.8 and 0.8 higher than those of CFlow [45], respectively.

Table 2. Results of anomaly detection for different methods on the MVTec AD dataset (I-AUROC), with the best results highlighted in bold.

Class	Draem [9]	DTAM [39]	AnomalyDiffusion [18]	MUTAD [19]	MSPBA [22]	FastRecon [46]	DBKD [47]	SuperSimpleNet [40]	SeRNet (Ours)
Carpet	97.0	98.0	96.7	97.7	93.4	99.9	97.1	98.4	98.8
Grid	99.9	96.0	98.4	98.3	100.0	99.7	88.8	99.3	100.0
Leather	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0
Tile	99.6	100.0	100.0	99.9	96.2	100.0	99.5	99.7	100.0
Wood	99.1	100.0	98.4	99.9	99.7	99.3	99.1	99.3	100.0
Avg. Text.	99.1	98.8	98.7	99.2	97.7	97.6	98.0	99.3	99.8
Bottle	99.2	100.0	99.8	100.0	100.0	99.4	100.0	100.0	100.0
Capsule	98.5	85.4	99.7	96.5	97.2	90.1	98.0	98.1	98.4
Pill	98.9	96.8	98.0	95.7	97.7	93.5	98.3	98.1	98.7
Transistor	93.1	96.8	100.0	95.1	98.9	97.7	99.8	99.9	99.9
Zipper	100.0	99.5	99.9	98.5	99.5	97.0	98.6	99.6	99.2
Cable	91.8	95.3	100.0	98.3	98.8	93.8	98.0	98.1	99.7
Hazelnut	100.0	100.0	99.8	99.3	99.6	99.3	100.0	99.8	99.9
Metal_nut	98.7	100.0	100.0	97.4	97.8	99.1	100.0	99.5	100.0
Screw	93.9	99.8	96.8	87.2	94.1	62.5	95.5	92.9	99.9
Toothbrush	100.0	100.0	100.0	83.8	100.0	93.6	92.5	92.2	100.0
Avg. Obj.	97.4	97.4	99.4	95.2	98.4	92.6	98.1	97.9	99.6
Average	98.0	97.8	99.2	96.5	98.2	94.2	98.1	98.4	99.6

Table 3. Results of anomaly location for different methods on the MVTec AD dataset (AP), with the best results highlighted in bold.

Method	RIAD [16]	Padim [21]	ST [41]	UTRAD [48]	STPM [23]	SeRNet
Carpet	61.4	60.7	52.2	17.2	65.3	75.7
Grid	36.4	35.7	10.1	15.4	45.4	58.5
Leather	49.1	53.5	40.9	11.9	42.9	77.4
Tile	52.6	52.4	65.3	52.0	61.7	87.7
Wood	38.2	46.3	53.3	18.3	47.0	76.2
Avg. Text.	47.5	49.7	44.4	23.0	52.5	75.1
Bottle	76.4	77.3	74.2	37.8	80.6	89.1
Capsule	38.2	46.7	25.9	25.2	35.9	55.6
Pill	51.6	61.2	62.0	52.5	63.3	80.8
Transistor	39.2	72.0	27.1	49.7	44.4	75.9
Zipper	63.4	58.2	36.1	35.6	54.9	72.4
Cable	24.4	45.4	48.2	62.4	58.0	74.6
Hazelnut	33.8	61.1	57.8	45.3	60.3	85.9
Metal_nut	64.3	77.4	83.5	65.7	79.3	89.6
Screw	43.9	21.7	7.8	1.7	26.9	47.8
Toothbrush	50.6	54.7	37.7	30.6	48.8	75.1
Avg. Obj	48.6	57.6	46.0	40.7	55.2	74.7
Average	48.2	55.0	45.5	34.8	54.3	74.8

Table 4. Results of anomaly detection and location for different methods on the VisA (I-AUROC/P-AUROC), with the best results highlighted in bold.

Method	Candle	Capsules	Cashew	Chewing Gum	Fryum	Macaroni1	Macaroni2	PCB1	PCB2	PCB3	PCB4	Pipe Fryum	Average
SPD [36]	89.1/97.3	68.1/86.3	90.5/86.1	99.3/96.9	89.8/88.0	85.7/98.8	70.8/96.0	92.7/97.7	87.9/97.2	85.4/96.7	99.1/89.2	95.6/95.4	87.8/93.8
OminAL [49]	85.1/90.5	87.9/98.6	97.1/98.9	94.9/98.7	97.0/89.3	96.9/98.9	89.9/99.1	96.6/98.7	99.4/83.2	96.9/98.4	97.4/98.5	91.4/99.1	94.2/96.0
FastFlow [43]	92.8/94.9	71.2/75.3	91.0/91.4	91.4/98.6	88.6/97.3	98.3/97.3	86.3/89.2	77.4/75.2	61.9/67.3	74.3/94.8	80.9/89.9	72.0/87.3	82.2/88.2
SimpleNet [50]	96.9/98.6	89.5/99.2	94.8/99.0	100.0/98.5	96.6/94.5	97.6/99.6	83.4/96.4	99.2/99.8	99.2/98.8	98.6/99.2	98.9/98.6	99.2/99.3	96.2/98.5
LightFlow [42]	96.2/98.9	93.6/99.5	93.5/99.1	98.3/98.9	95.8/95.1	97.4/99.5	92.5/99.1	95.1/95.4	94.1/98.5	91.1/98.2	96.5/97.1	97.8/98.6	95.2/98.5
PatchCore [20]	98.7/99.2	68.8/96.5	97.7/99.2	99.1/98.9	91.6/95.9	90.1/98.5	63.4/93.5	96.0/99.8	95.1/98.4	93.0/98.9	99.5/98.3	99.0/99.3	91.0/98.1
SeRNet (Ours)	97.0/98.8	96.4/99.8	99.0/98.8	99.6/99.0	98.3/95.9	98.5/99.8	94.2/99.6	98.4/99.7	98.5/98.9	99.4/99.2	98.8/99.2	98.2/99.6	98.0/99.0

Table 5. Results of anomaly location and location for different methods on the MPDD (I-AUROC/P-AUROC), with the best results highlighted in bold.

Class	RD4AD [13]	CFLOW [45]	EffAD [44]	SeRNet
Bracket black	50.7/96.4	96.5/97.6	85.5/97.2	94.1/98.1
Bracket brown	68.7/97.3	49.7/95.0	85.3/92.5	99.1/97.2
Bracket white	83.7/98.1	79.9/96.7	96.5/97.7	94.6/96.3
Connector	93.1/98.6	99.5/97.4	50.0/99.3	100.0/99.5
Metal plate	91.9/97.5	97.4/98.6	100.0/96.5	100.0/98.6
Tubes	89.7/99.1	99.6/99.2	95.3/99.0	99.5/99.6
Average	79.6/97.8	87.1/97.4	85.4/97.0	97.9/98.2

SeRNet also performs well on BTAD. The experimental results are shown in Table 6, where SeRNet achieves an excellent score of 99.9 for I-AUROC and 99.7 for P-AUROC on class 03. For both anomaly detection and anomaly localization, SeRNet achieves the highest average accuracy. This demonstrates that SeRNet is still effective for detecting anomalies with large-scale variations.

4.2.2. Qualitative Analysis

Although existing reconstruction-based methods perform well in many categories, they struggle in categories with large-scale anomalies, such as cables, bottles, and pills. This is mainly due to the limitations of auto-encoders. Figure 8 compares the reconstructed images and detection results of Draem, the reconstruction-based method [9], and SeRNet. Compared to the Draem method, SeRNet shows better reconstruction quality, leading to higher accuracy in anomaly localization. This improvement is attributed to the segmentation sub-network’s capability to locate large-scale anomalies and the repair module’s effectiveness. Furthermore, thanks to the reconstruction ability of the reconstruction sub-network, SeRNet still achieves satisfactory performance in categories containing many small-scale anomalies, such as carpets and capsules. This is illustrated in Figure 9, which presents the specific reconstruction detection results of SeRNet across four datasets, encompassing both large-scale and small-scale anomalies.

4.2.3. Complexity Analysis

Table 7 shows the test time for a single image using SeRNet. Compared to CSFlow [55], PatchCore [20], AST [24], SPADE [56] and Padim [21], SeRNet has a significantly reduced inference time. This is because CSFlow [55] and AST [24] use complex flow models, whereas Patchcore [20], Padim [21], and SPADE [56] rely on image patches, leading to multiple computations per image. Additionally, since Padim [21] is a modification of SPADE [56], both share the same backbone network and parameters. Draem [9] uses an end-to-end anomaly detection algorithm where the discriminative network directly outputs anomaly masks, thereby reducing the inference time. The similar image search is a key step in SeRNet. As the scale of the dataset increases, similar image search does not significantly increase the inference time. The specific reasons are as follows. Firstly, the perceptual hash values of the training images have been pre-computed, so only the perceptual hash value of the test image needs to be computed during the inference process, and the time to compute the hash value of a single image is only 1 to 2 ms. In addition, SeRNet performs image search by calculating the Hamming distance of the perceptual hash values between images, and the Hamming distance can be calculated using Numpy’s broadcast mechanism instead of a loop operation, which makes the image search operation very efficient. Secondly, when working with large datasets, although each category may contain a large number of images, these images typically display only a limited number of standardized appearances. This is because industrial products tend to have uniform designs, resulting in a high degree of visual consistency. Therefore, we can create a representative sample set by selecting a small number of images with typical appearances from the training set, using either hand-picking or cluster analysis. During the testing phase, we only need to search for similar images within this representative sample set, eliminating the need for a comprehensive search across the entire dataset.

4.3. Ablation Study

4.3.1. Model Structure

SeRNet leverages pre-segmentation results to repair large-scale anomalies, which in turn helps the image reconstruction process. This section presents an ablation study to demonstrate the efficacy of SeRNet’s segmentation and reconstruction sub-network structure. The segmentation sub-network, reconstruction sub-network, and SeRNet were tested on the MVTec AD dataset, the results of which are shown in Table 8. The experimental results show that the segmentation network alone achieves poor performance. This likely relates to the simple structure of SeRNet’s segmentation sub-network, which is designed for large-scale anomaly detection and may struggle to detect small-scale anomalies. In comparison, the performance of the reconstruction sub-network when used alone is better than that of the segmentation sub-network when used alone, as the reconstruction sub-network has a more complex structure and more parameters, resulting in better generalization performance. SeRNet, which combines both the segmentation and reconstruction sub-networks, achieved an I-AUROC that is 1.5 higher and a P-AUROC that is 2.5 higher than the reconstruction sub-network alone, demonstrating the significant guiding role of the segmentation sub-network.

4.3.2. Repair Module

This section explores the role of SeRNet’s repair module, conducting ablation experiments on the MVTecAD dataset and the BTAD dataset, with results shown in Table 9 and Table 10. When similar image searching is not utilized, random normal images from the same class are used as substitutes. As shown, SeRNet performs worst when the prediction mask is not processed, and random image regions are used for anomaly repair. This mainly occurs because the absence of prediction masks introduces abundant fine-scale noise that hampers the image reconstruction process. Both using only similar images and processing the prediction masks individually led to improved performance. However, neither approach was able to fully eliminate the extra noise. The optimal performance of SeRNet is achieved only when both components are used together, thus demonstrating the effectiveness of the repair module. Connecting the two sub-networks without processing would not significantly improve the model’s overall performance.

Additionally, the results of the ablation experiments for the threshold values

t h_{p}

and

t h_{a}

in Algorithm 1 are shown in Table 11 and Table 12. Table 11 indicates that when

t h_{p}

and

t h_{a}

are set to 0.8 and 7000, the image-level AUROC increases by 2.9, while the detection accuracy significantly declines when

t h_{a}

is set to less than 1000 or greater than 8000. The best performance is achieved when

t h_{a}

is set to 7000. This is because relatively larger values of

t h_{p}

and

t h_{a}

effectively repair larger anomalies, while reducing the additional anomalies introduced by the segmentation sub-network. However, if

t h_{a}

is too large, it may ignore the masks generated by the segmentation network, leading to a decrease in accuracy. The ablation experiment results on the BTAD dataset shown in Table 12 are similar, indicating that the choice of thresholds is not sensitive and is closely related to both the segmentation sub-network and the reconstruction sub-network, with less dependence on the dataset. The threshold

t h_{p}

is used to filter out segmentation regions with low confidence, so its setting is related to the segmentation sub-network and varies with changes in it. The threshold

t h_{a}

is associated with the size of the input image and the reconstruction sub-network. When the input image size is fixed, and the reconstruction sub-network remains unchanged, the threshold

t h_{a}

also remains constant.

4.3.3. Pseudo-Anomaly Generation Strategy

The SeRNet model training is performed using normal samples that contain pseudo-anomalies, and the generation process of these pseudo-anomalies is crucial to the model’s performance. This section explores the impact of the classical pseudo-anomaly generation methods CutPaste [30], PPA, and DPA on the detection accuracy of SeRNet. The basic idea of CutPaste involves cropping and pasting a random area of the original image into another area, generating anomalies that are obvious but insufficiently realistic. The experimental results are presented in Table 13. The results indicate that the strategies proposed in this paper are highly effective for producing appropriate pseudo-anomalies that can be used to train the various sub-networks of SeRNet.

PPA can generate large-scale pseudo-anomalies, which are not lost during the downsampling process of the segmentation sub-network. As a result, the segmentation sub-network can better learn and recognize anomalous regions, thus improving the performance of the model. While DPA focuses on the generation of small-scale pseudo-anomalies, which enables the upsampling process of the reconstruction sub-network to focus on local features and better reconstruct image details. This training combination achieves the highest image-level and pixel-level AUROC. Using only CutPaste [30], PPA, or DPA methods failed to achieve optimal results with SeRNet. This finding reveals that when formulating a pseudo-anomaly generation strategy, it is essential to consider both the similarity between pseudo-anomalies and real anomalies and the characteristics of the algorithm being trained.

4.3.4. Anomaly Evaluation Function

The anomaly evaluation function consists of

D_{s t r u c t u r e}

and

D_{c o l o r}

. We conducted ablation experiments on these two components, with results shown in Table 14. The experimental results demonstrate that utilizing

D_{s t r u c t u r e}

and

D_{c o l o r}

individually results in a notable decline in the algorithm’s accuracy. Specifically, image-level AUROCs exhibited a reduction of 1.6 and 17.3, while pixel-level AUROCs demonstrated a decrease of 0.9 and 7.7, respectively. The combination of

D_{s t r u c t u r e}

and

D_{c o l o r}

is most effective as it enables the comprehensive coverage of diverse anomaly types, leading to enhanced detection performance.

5. Conclusions

This paper proposes an unsupervised image anomaly detection model called SeRNet to address the challenge of reconstructing large-scale anomalies in current reconstruction sub-networks. The key contribution of this paper is to combine the advantages of segmentation and reconstruction in order to reconstruct large-scale anomalies. SeRNet employs a serial fusion approach of segmentation followed by reconstruction. The segmentation sub-network performs pre-segmentation of anomalous images to detect suspected anomalous regions, while the repair module employs prediction masks and similar images to repair the suspected large-scale anomalies. The reconstruction sub-network then reconstructs the repaired image. In addition, two methods for generating pseudo-anomalies at different scales are introduced for training SeRNet. The extensive experiments performed in this work demonstrate that SeRNet surpasses other state-of-the-art anomaly detection methods in terms of both performance and robustness.

However, SeRNet also faces some limitations and challenges. While the introduction of segmentation does enhance the performance of the reconstruction network, it also brings the risk of over-reliance. Additionally, when working with non-uniform textures, the detection accuracy of SeRNet may decline, which is a common challenge faced by most industrial anomaly detection methods. To address the limitations of SeRNet, we will leverage diffusion models [57] or transform networks in our future work to enhance reconstruction capabilities and tackle complex anomaly detection problems.

Author Contributions

Conceptualization, J.S. and X.L.; methodology, X.L.; software, J.S. and X.L.; validation, Y.C. and J.J.; formal analysis, J.S. and J.J.; investigation, S.W.; resources, Y.C.; data curation, J.S.; writing—original draft preparation, J.S.; writing—review and editing, Y.C., J.J., S.W. and J.S.; visualization, J.S.; supervision, Y.C., S.W. and J.J.; project administration, J.S.; funding acquisition, Y.C. and J.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China under Grant 62001236.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study can be requested from the corresponding author, as some of the code is being used in other unpublished research projects that are crucial for our further research. Therefore, we aim to manage the data appropriately while ensuring its integrity and confidentiality.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wolleb, J.; Bieder, F.; Sandkühler, R.; Cattin, P.C. Diffusion models for medical anomaly detection. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2022; pp. 35–45. [Google Scholar]
Nandakumar, N.; Eberhardt, J. A Synthetic Image Generation Pipeline for Vision-Based AI in Industrial Applications. Appl. Sci. 2025, 15, 12600. [Google Scholar] [CrossRef]
Shang, Q.; Yang, J.; Fu, F.; Ma, J. Deep support vector data description based on correntropy for few-shot anomaly detection. Digit. Signal Process. 2025, 160, 105074. [Google Scholar] [CrossRef]
Li, J.; Da, F.; Yu, Y. PCBSSD: Self-supervised symmetry-aware detector for PCB displacement and orientation inspection. Measurement 2025, 243, 116342. [Google Scholar] [CrossRef]
Pang, G.; Yan, C.; Shen, C.; Hengel, A.v.d.; Bai, X. Self-trained deep ordinal regression for end-to-end video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12173–12182. [Google Scholar]
Feng, J.C.; Hong, F.T.; Zheng, W.S. Mist: Multiple instance self-training framework for video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14009–14018. [Google Scholar]
Guo, X.; Zhao, S.; Xue, J.; Liu, D.; Han, X.; Zhang, S.; Zhang, Y. Multi-Granularity Content-Aware Network with Semantic Integration for Unsupervised Anomaly Detection. Appl. Sci. 2025, 15, 11842. [Google Scholar] [CrossRef]
Wang, H.; He, X.; Zhang, C.; Liang, X.; Zhu, P.; Wang, X.; Gui, W.; Li, X.; Qian, X. Accelerating surface defect detection using normal data with an attention-guided feature distillation reconstruction network. Measurement 2025, 246, 116702. [Google Scholar] [CrossRef]
Zavrtanik, V.; Kristan, M.; Skočaj, D. Draem—A discriminatively trained reconstruction embedding for surface anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 8330–8339. [Google Scholar]
Jiang, X.Q.; Huang, K.; Zhou, S.; Hu, W.; Peng, H.; Jin, J.; Fang, Z. Dual Flow Reverse Distillation for Unsupervised Anomaly Detection. Digit. Signal Process. 2025, 164, 105258. [Google Scholar] [CrossRef]
Liu, X.; Zhu, J.; Zhu, Z.; He, J. DAFSF: A Defect-Aware Fine Segmentation Framework Based on Hybrid Encoder and Adaptive Optimization for Image Analysis. Appl. Sci. 2025, 15, 11351. [Google Scholar] [CrossRef]
Lei, J.; Hu, X.; Wang, Y.; Liu, D. PyramidFlow: High-Resolution Defect Contrastive Localization using Pyramid Normalizing Flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14143–14152. [Google Scholar]
Deng, H.; Li, X. Anomaly detection via reverse distillation from one-class embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9737–9746. [Google Scholar]
Jiang, J.; Zhu, J.; Bilal, M.; Cui, Y.; Kumar, N.; Dou, R.; Su, F.; Xu, X. Masked swin transformer unet for industrial anomaly detection. IEEE Trans. Ind. Inform. 2022, 19, 2200–2209. [Google Scholar] [CrossRef]
Liu, T.; Li, B.; Zhao, Z.; Du, X.; Jiang, B.; Geng, L. Reconstruction from edge image combined with color and gradient difference for industrial surface anomaly detection. arXiv 2022, arXiv:2210.14485. [Google Scholar] [CrossRef]
Zavrtanik, V.; Kristan, M.; Skočaj, D. Reconstruction by inpainting for visual anomaly detection. Pattern Recognit. 2021, 112, 107706. [Google Scholar] [CrossRef]
Yamada, S.; Kamiya, S.; Hotta, K. Reconstructed student-teacher and discriminative networks for anomaly detection. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; IEEE: Piscataway, USA, 2022; pp. 2725–2732. [Google Scholar]
Hu, T.; Zhang, J.; Yi, R.; Du, Y.; Chen, X.; Liu, L.; Wang, Y.; Wang, C. Anomalydiffusion: Few-shot anomaly image generation with diffusion model. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 8526–8534. [Google Scholar]
Zou, Q.; Kang, C.; Liu, Q. Multi-scale u-shape transformer network for unsupervised image anomaly detection. In Proceedings of the 2025 28th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Compiegne, France, 5–7 May 2025; pp. 891–896. [Google Scholar]
Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; Gehler, P. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14318–14328. [Google Scholar]
Defard, T.; Setkov, A.; Loesch, A.; Audigier, R. Padim: A patch distribution modeling framework for anomaly detection and localization. In Proceedings of the International Conference on Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2021; pp. 475–489. [Google Scholar]
Tsai, C.C.; Wu, T.H.; Lai, S.H. Multi-scale patch-based representation learning for image anomaly detection and segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 3992–4000. [Google Scholar]
Wang, G.; Han, S.; Ding, E.; Huang, D. Student-teacher feature pyramid matching for anomaly detection. arXiv 2021, arXiv:2103.04257. [Google Scholar] [CrossRef]
Rudolph, M.; Wehrbein, T.; Rosenhahn, B.; Wandt, B. Asymmetric student-teacher networks for industrial anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 2592–2602. [Google Scholar]
Yang, M.; Wu, P.; Feng, H. Memseg: A semi-supervised method for image surface defect detection using differences and commonalities. Eng. Appl. Artif. Intell. 2023, 119, 105835. [Google Scholar] [CrossRef]
Perlin, K. An image synthesizer. ACM Siggraph Comput. Graph. 1985, 19, 287–296. [Google Scholar] [CrossRef]
Tao, X.; Zhang, D.; Ma, W.; Hou, Z.; Lu, Z.; Adak, C. Unsupervised Anomaly Detection for Surface Defects With Dual-Siamese Network. IEEE Trans. Ind. Inform. 2022, 18, 7707–7717. [Google Scholar] [CrossRef]
Behrendt, F.; Bhattacharya, D.; Mieling, R.; Maack, L.; Krüger, J.; Opfer, R.; Schlaefer, A. Guided reconstruction with conditioned diffusion models for unsupervised anomaly detection in brain mris. Comput. Biol. Med. 2025, 186, 109660. [Google Scholar] [CrossRef]
Pang, J.; Li, C. Context-aware feature reconstruction for class-incremental anomaly detection and localization. Neural Netw. 2025, 181, 106788. [Google Scholar] [CrossRef]
Li, C.L.; Sohn, K.; Yoon, J.; Pfister, T. Cutpaste: Self-supervised learning for anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9664–9674. [Google Scholar]
Song, J.; Kong, K.; Park, Y.I.; Kim, S.G.; Kang, S.J. AnoSeg: Anomaly segmentation network using self-supervised learning. arXiv 2021, arXiv:2110.03396. [Google Scholar]
Chen, Q.; Luo, H.; Lv, C.; Zhang, Z. A unified anomaly synthesis strategy with gradient ascent for industrial anomaly detection and localization. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 37–54. [Google Scholar]
Cimpoi, M.; Maji, S.; Kokkinos, I.; Mohamed, S.; Vedaldi, A. Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3606–3613. [Google Scholar]
Niu, X.M.; Jiao, Y.H. An overview of perceptual hashing. Acta Electonica Sin. 2008, 36, 1405. [Google Scholar]
Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD–A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9592–9600. [Google Scholar]
Zou, Y.; Jeong, J.; Pemula, L.; Zhang, D.; Dabeer, O. Spot-the-difference self-supervised pre-training for anomaly detection and segmentation. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 392–408. [Google Scholar]
Jezek, S.; Jonak, M.; Burget, R.; Dvorak, P.; Skotak, M. Deep learning-based defect detection of metal parts: Evaluating current methods in complex conditions. In Proceedings of the 2021 13th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT), Brno, Czech Republic, 25–27 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 66–71. [Google Scholar]
Mishra, P.; Verk, R.; Fornasier, D.; Piciarelli, C.; Foresti, G.L. VT-ADL: A vision transformer network for image anomaly detection and localization. In Proceedings of the 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE), Kyoto, Japan, 20–23 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
Liu, C.; Wang, J.; Han, Z. DTAM: A difference-trainable adaptive memory anomaly detection and location method. Measurement 2025, 256, 118423. [Google Scholar] [CrossRef]
Rolih, B.; Fučka, M.; Skočaj, D. SuperSimpleNet: Unifying Unsupervised and Supervised Learning for Fast and Reliable Surface Defect Detection. arXiv 2024, arXiv:2408.03143. [Google Scholar]
Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4183–4192. [Google Scholar]
Peng, C.; Zhao, L.; Wang, S.; Abbas, Z.; Liang, F.; Islam, M.S. LightFlow: Lightweight unsupervised defect detection based on 2D Flow. IEEE Trans. Instrum. Meas. 2024, 73, 2521912. [Google Scholar] [CrossRef]
Yu, J.; Zheng, Y.; Wang, X.; Li, W.; Wu, Y.; Zhao, R.; Wu, L. Fastflow: Unsupervised anomaly detection and localization via 2d normalizing flows. arXiv 2021, arXiv:2111.07677. [Google Scholar] [CrossRef]
Batzner, K.; Heckler, L.; König, R. Efficientad: Accurate visual anomaly detection at millisecond-level latencies. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 128–138. [Google Scholar]
Gudovskiy, D.; Ishizaka, S.; Kozuka, K. Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 98–107. [Google Scholar]
Fang, Z.; Wang, X.; Li, H.; Liu, J.; Hu, Q.; Xiao, J. FastRecon: Few-shot Industrial Anomaly Detection via Fast Feature Reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 17481–17490. [Google Scholar]
Zhou, Y.; Huang, Z.; Zeng, D.; Qu, Y.; Wu, Z. Dual-Branch Knowledge Distillation via Residual Features Aggregation Module for Anomaly Segmentation. IEEE Trans. Instrum. Meas. 2024, 74, 3503411. [Google Scholar] [CrossRef]
Chen, L.; You, Z.; Zhang, N.; Xi, J.; Le, X. Utrad: Anomaly detection and localization with u-transformer. Neural Netw. 2022, 147, 53–62. [Google Scholar] [CrossRef] [PubMed]
Zhao, Y. Omnial: A unified cnn framework for unsupervised anomaly localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 3924–3933. [Google Scholar]
Liu, Z.; Zhou, Y.; Xu, Y.; Wang, Z. Simplenet: A simple network for image anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 20402–20411. [Google Scholar]
Ding, C.; Pang, G.; Shen, C. Catching both gray and black swans: Open-set supervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7388–7398. [Google Scholar]
Salehi, M.; Sadjadi, N.; Baselizadeh, S.; Rohban, M.H.; Rabiee, H.R. Multiresolution knowledge distillation for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14902–14912. [Google Scholar]
Lee, S.; Lee, S.; Song, B.C. Cfa: Coupled-hypersphere-based feature adaptation for target-oriented anomaly localization. IEEE Access 2022, 10, 78446–78454. [Google Scholar] [CrossRef]
Yi, J.; Yoon, S. Patch svdd: Patch-level svdd for anomaly detection and segmentation. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
Rudolph, M.; Wehrbein, T.; Rosenhahn, B.; Wandt, B. Fully convolutional cross-scale-flows for image-based defect detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 1088–1097. [Google Scholar]
Cohen, N.; Hoshen, Y. Sub-image anomaly detection with deep pyramid correspondences. arXiv 2020, arXiv:2005.02357. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]

Figure 1. Comparison of reconstructed images: (a) test images, (b) Draem reconstructed images, and (c) SeRNet reconstructed images.

Figure 2. The structure of SeRNet.

Figure 3. The pseudo-anomaly generation process. (a) PPA and (b) DPA methods.

Figure 4. Network architecture. (a) Segmentation sub-network and (b) reconstruction sub-network.

Figure 5. Prediction mask processing results. (a) Original prediction masks and (b) processed prediction masks.

Figure 6. Comparison of original test images and patched test images. (a) Original test images and (b) repaired test images.

Figure 7. Search for similar image results: (a) test images and (b) the most similar images.

Figure 8. Qualitative comparison of detection results between SeRNet and Draem on the MVTec AD dataset. (a) Anomaly image, (b) ground truth mask, (c) reconstruction image from Draem, (d) anomaly detection heatmap from Draem, (e) reconstruction image from SeRNet, and (f) anomaly detection heatmap from SeRNet.

Figure 9. Qualitative detection results of SeRNet on the MVTec AD, VisA MPDD and BTAD. (a) Test images, (b) reconstructed images, (c) ground truth, and (d) predicted anomalous regions.

Table 1. Symbol descriptions.

Symbol	Definition
$I_{t}$ , $I_{a}$	The input image, the (pseudo) anomalous image,
$I_{s}$ , $I_{b}$	The similar image, the repaired image
$I_{p}$ , I	The reconstructed image, the normal image
$M_{t}$ , $M_{p}$	The ground truth mask, the prediction mask
$M_{a}$ , $M_{d}$	The pseudo-anomaly mask of PPA and DPA
$t h_{p}$ , $t h_{a}$	Binarization threshold, filtering threshold
$I_{p}^{a}$ , $I_{p}^{b}$	The a-channel and b-channel of $I_{p}$ in CLELAB color space.

Table 6. Results of anomaly detection and location for different methods on the BTAD dataset (I-AUROC/P-AUROC), with the best results highlighted in bold.

Method	Class 01	Class 02	Class 03	Average
Fastflow [43]	99.4/97.1	82.4/93.6	91.1/98.3	91.0/96.3
PatchCore [20]	90.9/95.5	79.3/94.7	99.8/99.3	90.0/96.5
DRA [51]	-/-	-/-	-/-	94.2/75.4
MKD [52]	-/-	-/-	-/-	93.5/96.5
CFA [53]	98.1/95.9	85.5/96.0	99.0/98.6	94.2/96.8
VT-ADL [38]	97.6/99.0	71.0/94.0	82.6/77.0	83.7/90.0
PatchSVDD [54]	95.7/91.6	72.1/93.6	82.1/91.0	83.3/92.1
SeRNet (ours)	98.1/95.7	85.2/95.2	99.9/99.7	94.4/96.9

Table 7. Comparative results of model testing time (s), parameters (M), and performance (I-AUROC/P-AUROC) on the MVTec AD dataset, with the best results highlighted in bold.

Method	Testing Time	Parameters	Performance
Draem [9]	0.072	138.1	98.0/96.0
CSFlow [55]	0.116	275.5	98.7/-
PatchCore [20]	0.261	116.4	99.0/98.1
AST [24]	0.823	223.6	96.0/95.2
SPADE [56]	1.256	68.9	85.5/96.5
Padim [21]	1.625	68.9	95.5/97.5
SeRNet (ours)	0.085	81.0	99.6/97.6

Table 8. Results of ablation experiments on the MVTec AD dataset with the SeRNet structure (Image/Piexl-level AUROC). The checkmark (✓) denotes settings included in the experiment, and bold formatting indicates the best performance.

Segmentation Sub-Network	Reconstruction Sub-Network	Image/Pixel-Level
✓		77.6/75.3
	✓	98.1/95.1
✓	✓	99.6/97.6

Table 9. Results of ablation experiments on the MVTec AD dataset with the repair module (Image/Piexl-level AUROC). When the similar images search approach is not used, random images from the same class are used instead of similar images. The checkmark (✓) denotes settings included in the experiment, and bold formatting indicates the best performance.

Similar Images Search	Prediction Masks Processing	Image/Pixel-Level
		93.0/96.5
✓		94.1/96.9
	✓	98.3/97.1
✓	✓	99.6/97.6

Table 10. Results of ablation experiments on the BTAD dataset with the repair module (Image/Piexl-level AUROC). When the similar images search approach is not used, random images from the same class are used instead of similar images. The checkmark (✓) denotes settings included in the experiment, and bold formatting indicates the best performance.

Similar Images Search	Prediction Masks Processing	Image/Pixel-Level
		87.7/92.5
✓		90.3/94.5
	✓	92.1/95.7
✓	✓	94.4/96.9

Table 11. Results of ablation experiments on the MVTec AD dataset with the thresholds during prediction masks processing (Image/Piexl-level AUROC). Bold formatting indicates the best performance.

	0.3	0.6	0.8
${th}_{a}$	0.3	0.6	0.8
1000	96.7/97.5	98.3/97.4	99.1/97.4
3000	99.5/97.4	99.5/97.5	99.4/97.3
5000	99.5/97.5	99.5/97.5	99.5/97.5
7000	99.6/97.5	99.6/97.5	99.6/97.6
8000	99.6/97.4	99.6/97.4	99.5/97.5

Table 12. Results of ablation experiments on the BTAD dataset with the thresholds during prediction masks processing (Image/Piexl-level AUROC). Bold formatting indicates the best performance.

	0.3	0.6	0.8
${th}_{a}$	0.3	0.6	0.8
1000	91.2/92.1	92.6/92.5	93.3/93.9
3000	92.3/94.3	93.5/95.6	93.7/96.2
5000	93.7/94.7	94.0/97.5	94.4/96.3
7000	94.0/96.2	94.2/96.9	94.4/96.9
8000	94.2/95.4	93.8/96.4	94.2/96.6

Table 13. Results of ablation experiments on the MVTec AD dataset with different pseudo-anomaly generation strategies (Image/Piexl-level AUROC). The checkmark (✓) denotes settings included in the experiment, and bold formatting indicates the best performance.

CutPaste [30]	PPA	DPA	Image/Pixel-Level
✓			96.7/95.8
	✓		98.1/95.2
		✓	97.9/96.3
	✓	✓	99.6/97.6

Table 14. Results of ablation experiments on the MVTec AD dataset with different anomaly evaluation functions (Image/Piexl-level AUROC). Bold formatting indicates the best performance.

	$D_{structure}$	$D_{color}$	$D_{structure} + {cD}_{color}$
Image-level	98.0	82.3	99.6
Pixel-level	96.7	89.9	97.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cui, Y.; Sun, J.; Liu, X.; Wei, S.; Jiang, J. SeRNet: Segmentation Helps Reconstruction for Anomaly Detection. Appl. Sci. 2026, 16, 1670. https://doi.org/10.3390/app16041670

AMA Style

Cui Y, Sun J, Liu X, Wei S, Jiang J. SeRNet: Segmentation Helps Reconstruction for Anomaly Detection. Applied Sciences. 2026; 16(4):1670. https://doi.org/10.3390/app16041670

Chicago/Turabian Style

Cui, Yan, Jinkai Sun, Xiying Liu, Shun Wei, and Jielin Jiang. 2026. "SeRNet: Segmentation Helps Reconstruction for Anomaly Detection" Applied Sciences 16, no. 4: 1670. https://doi.org/10.3390/app16041670

APA Style

Cui, Y., Sun, J., Liu, X., Wei, S., & Jiang, J. (2026). SeRNet: Segmentation Helps Reconstruction for Anomaly Detection. Applied Sciences, 16(4), 1670. https://doi.org/10.3390/app16041670

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SeRNet: Segmentation Helps Reconstruction for Anomaly Detection

Abstract

1. Introduction

2. Related Work

2.1. Embedding-Based Methods

2.2. Reconstruction-Based Methods

2.3. Pseudo-Anomaly-Based Methods

3. Methods

3.1. Pseudo-Anomaly Generation Strategy

3.2. Segmentation Sub-Network and Reconstruction Sub-Network

3.3. Repair Module

3.4. Inference

3.4.1. Similar Image Search

3.4.2. Anomaly Evaluation

4. Experiments

4.1. Experimental Details

4.2. Experimental Results

4.2.1. Comparison with State-of-the-Art Methods

4.2.2. Qualitative Analysis

4.2.3. Complexity Analysis

4.3. Ablation Study

4.3.1. Model Structure

4.3.2. Repair Module

4.3.3. Pseudo-Anomaly Generation Strategy

4.3.4. Anomaly Evaluation Function

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI