1. Introduction
Anomaly detection is a crucial research field in computer vision, with widespread applications across a range of industries [
1,
2,
3,
4]. Traditional anomaly detection methods typically rely on supervised learning approaches [
5,
6] and have demonstrated promising detection results. However, the scarcity of anomalous samples and the diversity of anomaly types limit the generalizability of these approaches. Thus, with the development of deep learning, unsupervised anomaly detection methods [
7,
8,
9,
10] have attracted considerable attention. Unsupervised anomaly detection methods can be categorized into two main classes: embedding-based and reconstruction-based methods.
Embedding-based methods [
11,
12,
13] typically utilize a pre-trained model to extract features from samples and model the feature space to detect anomalies. Although these methods achieve higher detection accuracy, they also lack interpretability. In contrast, reconstruction-based methods [
14,
15,
16,
17] rely on restoration techniques to detect anomalies. During the training phase, reconstruction-based methods employ a pseudo-anomaly generation strategy to obtain anomalous images, and the reconstruction model then learns how to reconstruct these pseudo-anomalous samples as normal patterns. The resulting model then attempts to reconstruct real anomalous images during testing. Anomaly detection is performed by comparing the differences between the input and reconstructed images. However, these models typically have limited generalization ability, which prevents them from fully reconstructing all anomalies, particularly in the case of large-scale anomalies. The convolutional layer in the reconstruction model has a constrained receptive field, resulting in poor reconstruction ability and affecting the final detection accuracy. To address the challenges faced by reconstruction networks, previous research has alleviated this issue by introducing more powerful reconstruction networks. For example, anomaly diffusion [
18] incorporates a diffusion model, while MUTAD [
19] utilizes a multi-scale Transform model as the reconstruction network. Although these methods enhance reconstruction capabilities, they also increase the complexity of the detection algorithms, resulting in decreased efficiency.
To address the challenges described above, this paper proposes a network architecture called SeRNet, which comprises three parts: a segmentation sub-network, a reconstruction sub-network, and a repair module. Previous work often simply combines segmentation and reconstruction, using the difference images from the reconstruction output of the segmentation network for anomaly detection. This approach does not fully leverage the potential of the segmentation network, treating it merely as a discriminator for differences before and after image reconstruction, and it does not address the reconstruction difficulties faced by the reconstruction network when large-scale anomalies exist. The segmentation sub-network is responsible for detecting suspected large-scale anomalies, whereas the reconstruction sub-network focuses on reconstructing small-scale anomalies and repairing splices of anomalies. The repair module, guided by the prediction masks, uses similar images to effectively repair large-scale anomalies, thus helping to reconstruct anomalous regions. The above structure allows SeRNet to reconstruct anomalous regions of different scales more effectively. Two pseudo-anomaly generation methods, namely Perlin Pseudo-Anomaly (PPA) and Diffusion Pseudo-Anomaly (DPA), are employed to guide the learning of each sub-network. The PPA approach generates large-scale anomalies to guide the segmentation sub-network, whereas DPA generates small-scale anomalies to enable the reconstruction sub-network to focus on reconstructing the details of identified anomalous regions. This design enables each sub-network to focus on learning the specific anomaly patterns or features, thereby enhancing the model’s overall detection performance.
Figure 1c illustrates the reconstructed image of SeRNet. Compared with Draem, the classical reconstruction method [
9], SeRNet can more effectively handle large-scale anomalies. SeRNet does not adopt conventional innovative methods often found in previous work, such as replacing the reconstruction network model. Instead, SeRNet approaches the problem at its root, recognizing that the reconstruction network encounters receptive field limitations when handling large-scale anomalies. Therefore, SeRNet first locates these large-scale anomalies using the segmentation network and then utilizes a repair module for reconstruction. This way, we effectively avoid the receptive field limitations faced by the reconstruction network during the rebuilding process, thereby enhancing reconstruction quality and improving anomaly detection accuracy. Overall, this paper’s contributions can be summarized as follows:
(1) This paper presents SeRNet, an innovative network architecture that addresses the reconstruction challenge in large-scale anomaly detection through the collaborative design of the segmentation sub-network and the repair module, providing a new theoretical framework for industrial anomaly detection.
(2) This paper introduces two novel pseudo-anomaly generation methods, PPA and DPA, which optimize pseudo-anomaly generation strategies for segmentation and reconstruction tasks, significantly enhancing the model’s ability to capture anomalous features across varying scales.
(3) Extensive experiments validate the accuracy and robustness of SeRNet. The results indicate that SeRNet performs exceptionally well on the MVTec AD dataset, achieving an image-level AUROC score of up to 99.6. SeRNet demonstrates high performance across multiple benchmarks, showcasing superior generalization in complex industrial environments with a robust and efficient framework for anomaly detection.
The rest of this paper is organized as follows:
Section 2 reviews the related work.
Section 3 introduces the proposed SeRNet.
Section 4 presents comparative and ablation experiments. Conclusions are provided in
Section 5.
3. Methods
Figure 2 shows the overall structure of the proposed SeRNet, which comprises a segmentation sub-network, a reconstruction sub-network, and a repair module that connects the two sub-networks. To facilitate the introduction of SeRNet,
Table 1 lists the key symbols along with their definitions.
3.1. Pseudo-Anomaly Generation Strategy
SeRNet’s segmentation sub-network is responsible for segmenting large-scale anomalies, while its reconstruction sub-network is responsible for reconstructing small-scale anomalies. In this study, two pseudo-anomaly generation methods are used. The PPA method is used to generate large-scale pseudo-anomalies for the segmentation sub-network, while the DPA method is used to generate small-scale pseudo-anomalies for the reconstruction sub-network.
Inspired by [
9], PPA uses Perlin noise [
26] to generate anomaly masks. However, unlike [
9], the PPA method restricts the generation of small-scale pseudo-anomalies by filtering the pseudo-anomaly masks such that the scale of the generated pseudo-anomalies is consistent with the segmentation sub-network’s large-scale anomaly detection task. Furthermore, the PPA approach improves upon the choice of pseudo-anomalous data sources compared to Draem [
9]. Draem [
9] utilizes the DTD dataset [
33] as a source for pseudo-anomalies. However, the data distribution of the DTD dataset differs significantly from that of industrial products in practice as the distribution of real defects, such as breaks, scratches, and misalignment, is similar to that of normal areas. PPA addresses this issue by employing images from the same class as the anomaly data source. This approach allows pseudo-anomalies to be generated that better align with the true distribution of anomalies. To ensure diversity among the pseudo-anomaly data sources, images of the same class are first divided into four pieces and randomly combined to create spatial differences. The resulting images are then randomly enhanced with brightness, contrast, color, and affine transformations. After these processing steps, images from the same class can be used as a data source for pseudo-anomaly generation.
Figure 3a shows the process of generating pseudo-anomalies using PPA, which can be expressed as:
where
denotes the pseudo-anomaly mask generated by Perlin noise,
denotes the inverse result of
,
I represents the normal image,
represents the (enhanced) image of the same class, ⊙ indicates the element-by-element multiplication operation, and
represents a randomized weight in the range [0, 0.8].
In contrast, the DPA method creates one or two rectangular masks that cover less than 2.5% of the image area. It then generates two to four randomly proportioned rectangular masks immediately adjacent to these rectangles to form an irregular region. These irregular regions represent pseudo-anomaly masks. The process of gradually changing from a single rectangular mask to an irregular mask can be vividly summarized as diffusion, which is the origin of the name DPA.
DPA then randomly adjusts the color and lightness of these image regions to generate pseudo-anomalies. To train the reconstruction sub-network, large-scale anomalies can be generated using PPA to ensure the creation of splicing anomalies during the repair phase. Additionally, the DPA method can be used to generate small-scale pseudo-anomalies, thus enabling the reconstruction sub-network to prioritize image details.
DPA anomaly generation process is shown in
Figure 3b and can be expressed as:
where
denotes the pseudo-anomaly mask randomly generated by DPA,
represents the inverse result of
,
I is the normal image,
denotes the random brightness variation value,
denotes the random color variation value, and ⊙ signifies the elementwise multiplication operation.
3.2. Segmentation Sub-Network and Reconstruction Sub-Network
As shown in
Figure 4, both the segmentation sub-network and the reconstruction sub-network of SeRNet adopt the classical U-net architecture, which comprises an encoder and a decoder. The encoder comprises convolutional and pooling layers that extract image features and reduce the image’s scale to create a latent feature space. The decoder consists of convolutional and upsampling layers that decode the image’s spatial dimensions and details from the latent feature space.
The reconstruction sub-network suppresses the expression of anomalous features through the downsampling layers of the encoder and then reconstructs the normal image from the latent feature space via the upsampling layers of the decoder. However, large-scale anomalies still behave as anomalous features in the latent feature space. During the decoder’s upsampling process, the restricted receptive field causes the current pixel values to be influenced by surrounding pixels, making large-scale anomalies more likely to be reconstructed as anomalies. Consequently, the reconstruction sub-network performs well in reconstructing small-scale anomalies but struggles with large-scale anomalies. To assist in reconstructing large-scale anomalies, the segmentation sub-network is responsible for localizing them. This is attributed to the fact that large-scale anomalies exhibit more distinctive features and stronger boundaries, making them easier for the segmentation sub-network to identify.
The segmentation sub-network of SeRNet only produces an anomaly mask as its output. In addition, to reduce the total number of parameters in SeRNet, the encoder in the segmentation sub-network uses only two max-pooling layers and three convolutional layers, while the decoder contains two upsampling layers, two convolutional layers, and one output convolutional layer.
Unlike the segmentation sub-network, the reconstruction sub-network must output reconstructed three-channel color images, which is a relatively complex task. Thus, the reconstruction sub-network’s encoder comprises five convolutional layers and four max-pooling layers, while its decoder consists of four convolutional layers, four upsampling layers, and one output convolutional layer. However, increasing the number of layers may also result in a loss of image information during downsampling, leading to distortion of the reconstructed image. To address this issue, this study introduces skip connections, which link the feature maps of each layer of the encoder directly to the feature maps of the corresponding layer of the decoder. This enables the decoder to reconstruct fine-grained features from the feature maps of lower layers via the skip connection path. Reconstruction sub-networks with skip connections are correspondingly more effective at capturing image details.
The segmentation sub-network and the reconstruction sub-network are optimized separately. The segmentation sub-network is trained first, and after the training is complete, it infers and outputs the anomaly segmentation map, which assists in training the reconstruction sub-network. To ensure the consistency of SeRNet during the training phase, the same loss function is used to assess both sub-networks. The total loss comprises two components: the MSE loss and the SSIM loss [
16]. These components can be expressed as the following:
where
denotes the reconstructed image or predicted mask, and
denotes ground truth image or mask.
3.3. Repair Module
The repair module acts as a crucial link that connects both sub-networks. This module has multiple functions, including prediction mask binarization, filtering, and large-scale anomaly repair. The pseudo code for the repair module is provided in Algorithm 1. The inputs are the prediction mask
, the anomalous image (or pseudo-anomalous image)
, and the similar image
of
. Firstly,
is binarized in steps 1–5. Subsequently, in steps 6–10, prediction masks with small-scale anomalies are filtered and the anomalies are eliminated. Finally, in steps 11–17,
is repaired utilizing
under the guidance of
, and the resulting repaired image
is returned.
| Algorithm 1 Repair strategy |
Require:
: the prediction mask. : the (pseudo) anomalous image. : the similar image of . Ensure:
: the repaired image of . # Prediction mask filtering
- 1:
if
then - 2:
- 3:
else - 4:
- 5:
end if # Prediction mask filtering # denotes the connected region in # means calculate the area of
- 6:
for in do - 7:
if then - 8:
- 9:
end if - 10:
end for - 11:
for in do - 12:
if then - 13:
- 14:
else - 15:
- 16:
end if - 17:
end for
|
The pixels in the prediction mask
output from the segmentation sub-network all have values in the range [0, 1], which indicate the probability of a pixel belonging to an anomalous region. This paper utilizes binarization to handle the continuous distribution of pixel values in
. Specifically, a threshold
is introduced, where pixel values below the threshold are set to 0, and those above it are set to 1. This treatment prevents anomalous regions with low probability from introducing additional anomalies during the repair phase. The binarized version of
must also be filtered. To do so, if the area of an anomalous region in
’s is below a set threshold
, the corresponding pixel values of the connected area should be set to 0, thus indicating that these small-scale anomalies should be excluded. Since the reconstruction sub-network can reconstruct these small-scale anomalies into normal regions, it is unnecessary to repair them, thus avoiding potentially introducing additional anomalies.
Figure 5 shows a comparison of the original and processed prediction masks. As shown, the non-binarized and unfiltered masks contain very high noise levels, while the processed masks contain only large-scale masks representing suspected anomalous regions.
The repair operation replaces anomalous regions of
with corresponding regions from similar image
under the guidance of
. The sources of
are explained in
Section 3.4. The repair result is shown in
Figure 6. Repairing anomalous regions using the similar images approach effectively assists in the reconstruction of large-scale anomalies.
The prediction masks assist in repairing large-scale anomalies by providing location information about these anomalies, thereby supporting the reconstruction network in transforming anomalous areas into normal ones. Without the aid of the prediction masks, the reconstruction network would struggle to effectively achieve this transformation, potentially leading to errors in anomaly detection. Therefore, the prediction masks play a crucial role in the repair module. The repair module consists of two main operations: first, processing the prediction masks generated by the segmentation sub-network, and second, using the processed prediction masks to replace anomalous regions with normal ones to repair large-scale anomalies. The objective of the repair module is to minimize the introduction of additional anomalies while effectively repairing large-scale anomalies.
3.4. Inference
The inference phase of SeRNet includes the following additional operations.
3.4.1. Similar Image Search
As noted in
Section 3.3, SeRNet needs to repair large-scale anomalies using normal regions from similar images. During the training phase, the image
generates the pseudo-anomalous image
; the corresponding region of
is then used to repair the large-scale anomalies in
.
During the inference phase, SeRNet searches for the training image that is the most similar to the test image. To achieve this, a perceptual hash algorithm [
34] is used. The algorithm generates a unique perceptual hash value for each image based on the appearance and location of the target object, ensuring that the test image can accurately retrieve similar images from the training set. In the experiments, perceptual hashing for encoding images is implemented using the default settings of the version 4.3.1 of ImageHash (ImageHash library:
https://pypi.org/project/ImageHash/ (accessed on 1 February 2026)) library.
For the MVTec AD [
35] dataset, which contains 3416 training images, only 217 KB of storage is needed for its hash values, equating to an average of 0.06 KB per image. This is because the perceptual hash algorithm can convert an image into a 16-bit hash string, and computers are highly efficient at storing strings, allowing for significant reductions in storage space.
Algorithm 2 outlines the steps for executing a similar image search algorithm. This algorithm takes the set of training samples
and the anomalous image
as inputs. Before the inference process, steps 1–4 are executed to compute the perceptual hash value of each training image and save it in the corresponding file, which avoids the need for repeated computation during the inference stage. During inference, Step 5 is executed to calculate the hash value of the current test image. Step 6 is then performed to search the file containing the stored hash value to find the image with the smallest Hamming distance from the
hash value, i.e., the most similar image
.
Figure 7 illustrates the result of this search. A similar image search algorithm can find similar images for screws in the first column and hazelnuts in the fourth column of
Figure 7 when both have different orientations, thus playing a crucial role in ensuring accurate repairs for large-scale anomalies.
| Algorithm 2 Similar image search |
Require:
: the set of training samples in the same category. : the anomalous image. Ensure:
: perceptual hash value storage files. : the similar image of . # Compute the perceptual hash of the training samples in advance and store it.
- 1:
for I in do - 2:
- 3:
- 4:
end for - 5:
- 6:
|
3.4.2. Anomaly Evaluation
SeRNet reconstructs the test image during the inference phase and then uses an anomaly evaluation function [
15] to compare the differences between the original and reconstructed images. This comparison will determine the presence of any anomalies. The anomaly evaluation function measures the color differences and structural differences between the test image and the reconstructed image. To represent the color difference, the anomaly evaluation function calculates the difference between the a-channel and b-channel of the two images in the CIELAB color space, which can be expressed as:
where
denotes the filtering process using a convolutional kernel of size 21.
The anomaly evaluation function calculates the image’s structural differences based on the image’s multi-scale gradient magnitude similarity (MSGMS) [
16], which can be expressed as:
where
denotes the computation of the multiscale gradient magnitude similarity of
and
, with the convolutional kernel size of
set to 21.
The anomaly evaluation function can be expressed as:
where
c is taken as 0.001 to balance the difference between the orders of magnitude of
and
. The maximum value of
represents the anomaly score of the test image.
5. Conclusions
This paper proposes an unsupervised image anomaly detection model called SeRNet to address the challenge of reconstructing large-scale anomalies in current reconstruction sub-networks. The key contribution of this paper is to combine the advantages of segmentation and reconstruction in order to reconstruct large-scale anomalies. SeRNet employs a serial fusion approach of segmentation followed by reconstruction. The segmentation sub-network performs pre-segmentation of anomalous images to detect suspected anomalous regions, while the repair module employs prediction masks and similar images to repair the suspected large-scale anomalies. The reconstruction sub-network then reconstructs the repaired image. In addition, two methods for generating pseudo-anomalies at different scales are introduced for training SeRNet. The extensive experiments performed in this work demonstrate that SeRNet surpasses other state-of-the-art anomaly detection methods in terms of both performance and robustness.
However, SeRNet also faces some limitations and challenges. While the introduction of segmentation does enhance the performance of the reconstruction network, it also brings the risk of over-reliance. Additionally, when working with non-uniform textures, the detection accuracy of SeRNet may decline, which is a common challenge faced by most industrial anomaly detection methods. To address the limitations of SeRNet, we will leverage diffusion models [
57] or transform networks in our future work to enhance reconstruction capabilities and tackle complex anomaly detection problems.