Image Anomaly Detection Using Normal Data Only by Latent Space Resampling

: Detecting image anomalies automatically in industrial scenarios can improve economic efﬁciency, but the scarcity of anomalous samples increases the challenge of the task. Recently, autoencoder has been widely used in image anomaly detection without using anomalous images during training. However, it is hard to determine the proper dimensionality of the latent space, and it often leads to unwanted reconstructions of the anomalous parts. To solve this problem, we propose a novel method based on the autoencoder. In this method, the latent space of the autoencoder is estimated using a discrete probability model. With the estimated probability model, the anomalous components in the latent space can be well excluded and undesirable reconstruction of the anomalous parts can be avoided. Speciﬁcally, we ﬁrst adopt VQ-VAE as the reconstruction model to get a discrete latent space of normal samples. Then, PixelSail, a deep autoregressive model, is used to estimate the probability model of the discrete latent space. In the detection stage, the autoregressive model will determine the parts that deviate from the normal distribution in the input latent space. Then, the deviation code will be resampled from the normal distribution and decoded to yield a restored image, which is closest to the anomaly input. The anomaly is then detected by comparing the difference between the restored image and the anomaly image. Our proposed method is evaluated on the high-resolution industrial inspection image datasets MVTec AD which consist of 15 categories. The results show that the AUROC of the model improves by 15% over autoencoder and also yields competitive performance compared with state-of-the-art methods. parameter conﬁguration, the encoder is implemented as six convolutional layers with kernel size 4, stride step 2, padding 1, and followed by ReLU, one convolutional layer with kernel size 3, stride step 1, padding 1 and followed by eight residual blocks, which are implemented as ReLU, 3 × 3 conv, ReLU, 1 × 1 conv for each block. Images are encoded as z e with shape 32 × 32 × 32. The decoder is a symmetrical structure of the encoder using transposed convolutions. The dimensionality of the codebook and z are designed as 512 × 32 ( K = 512) and 32 × 32, respectively. The weight factor β is set to 0.25. We use the ADAM optimizer with learning rate 3e-4 and train for 100 epochs with batch size 256. For PixelSNAIL parameter conﬁguration, the model consists of one residual block and one attention block, both of which are repeated three times due to the limitations of GPU memory trained for 150 epochs with batch size 64, ADAM optimizer, and learning rate 0.0003. We set the parameters d, color sigma, and space sigma of the bilateral ﬁlter to (20,75,75).


Introduction
One of the key factors in optimizing the manufacturing process is automatic anomaly detection, which makes it possible to prevent production errors, thereby improving quality and bringing economic benefits to the plant. The common practice of anomaly detection in industry is for a machine to make judgments on images acquired through digital cameras or sensors. This is essentially an image anomaly detection problem that is looking for patterns that are different from normal images [1]. Humans can easily handle this task through awareness of normal patterns, but this is relatively difficult for machines. Unlike other computer vision tasks, image anomaly detection suffers from some of the following inevitable challenges: class imbalance, variety of anomaly types, and unknown anomaly [2,3]. Anomalous instances are generally rare, whereas normal instances account for a significant proportion. In addition, distinct and 1. We propose a novel method only using normal data for image anomaly detection. It effectively excludes the anomalous components in the latent space and avoids the unwanted reconstruction of the anomalous part, which achieves better detection results. 2. We propose a new method for anomaly score. The high anomaly scores are concentrated in the regions where anomalies are present, which will reduce the noise introduced by the reconstruction and improve precision.
The remainder of this paper is organized as follows: Section 2 reviews the related work of image anomaly detection. A detailed description of our proposed method is given in Section 3. Section 4 presents experimental setups and comparisons. The conclusion is finally summarized in Section 5.

Related Work
There exists an abundance of work on image anomaly detection over the last several decades, and we outline several types of methods related to the proposed approach.

Feature Extraction Based Method
Feature extraction-based methods generally map images to the appropriate feature space and detect anomalies based on distance. Generally, feature extraction and anomaly detection are disjointed. Abdel-Qader et al. [16] proposed a PCA-based method for concrete bridge deck crack detection. They first use PCA to get dominant eigenvectors of tested image patches and then calculate the European distance from the dominant eigenvectors of normal image patches. Liu et al. [17] used SVDD to detect defects in TFT-LCD array images. They described an image patch with four features including entropy, energy, contrast, and homogeneity and trained an SVDD model using normal image patches. If a feature vector lies outside the hypersphere found by SVDD during testing, the image patch corresponding to this feature vector is considered anomalous. Similar works can be found in [18][19][20]. Compared to these traditional dimensionality reduction models, a convolutional neural network (CNN) provides nonlinear mapping and is better at extracting semantic information. There are two common practices for applying CNN for feature extraction, one is to use a pre-trained network, such as VGG or ResNet, and the other is to develop a deep feature extraction model specifically for the purpose. For example, Napoletano et al. [21] used a pre-trained ResNet-18 to extract feature vectors from the scanning electron microscope (SEM) images to construct a dictionary. In the prediction phase, a tested image is considered anomalous if the average Euclidean distance between its feature and its m nearest neighbors in the dictionary is higher than the threshold. The other is the Deep SVDD model developed by Ruff et al. [22] who combine CNN with SVDD. They trained the deep neural network by minimizing the volume of the hypersphere containing feature vectors of the data.

Probability Based Method
These methods assume that anomalies occur in low probability regions of the normal data, and the main principle is as follows: (i) Establish the probability density function (PDF) of normal data; (ii) Evaluate the test samples by PDF and low probability density values are most likely to be abnormal. There are various methods depending on the distribution assumptions, such as Gaussian [23], Gaussian Mixture Model (GMM) [24] or Markov random fields (MRF) [25]. For instance, Böttger et al. [26] improved on the framework of [23], they applied the CS theory [27] to compress defect-free texture patches into texture features and use GMM to estimate their probability distributions. In the detection stage, a pixel is considered to be defective if the likelihood of the local patch corresponding to it is less than the threshold value. Recently deep learning methods have improved PDF estimation performance. Due to the powerful image generation capability of deep autoregressive (DAR) model, it is also used for anomaly detection.
Richter et al. [28] applied PixelCNN [29] to circuit board images that predicted pixel likelihood to anomaly detection, but it may not work well in more complex natural images. One advantage of the DAR model is that it can give the likelihood for each pixel, but Shafaei et al. [30] recently compared several anomaly detection methods, whose anomaly scores are given as softmax output, Euclidean distance, likelihood, etc. The experimental results show that PixelCNN++ using likelihood as the anomaly score has a lower accuracy than the mean. This shows that it is unsuitable to use low likelihood values as the anomaly score directly. Our method combines the advantages of DAR, using likelihood for resampling instead of scoring directly. On the other hand, we use DAR to estimate density for a low-dimensional discrete latent space, which avoids the problem of "curse of dimensionality" to improve modeling efficiency.

Reconstruction Based Method
The assumption of this kind of method is that normal images can be reconstructed from latent space better than anomalous images. Early defect detection works utilized sparse dictionary [31,32] for reconstruction. Deep learning techniques have extended the toolbox of reconstruction-based methods. Autoencoder-like models are the commonly-used model for these kinds of methods. Generally, AE-like based methods can be summarized as (i) training AE from only normal samples; and (ii) anomaly segmentation based on reconstruction error of input samples, which may have anomalies. Baur et al. [10] applied AE directly to detect pathologies in Brain MR Images. In particular, VAE attempts to detect anomalies through the generated perspective. Some work [8,11] assumes that VAE trained on normal images will only be able to reconstruct normal images, thus getting larger reconstruction errors in anomalous regions. Since the essential dimensions of the data are not known, information may be lost after a bottleneck and even normal parts may not be reconstructed correctly, resulting in false detections. Several approaches have been proposed to address this issue. Bergmann et al. [9] thought p -distance would enlarge slightly reconstruction errors and replaced it with SSIM. Nair et al. [33] are the first to use uncertainty estimate based on Monte Carlo dropout for lesion detection. They perform dropout operation on AE and then take the variance of result as anomaly score. Both Venkataramanan et al. [34] and Liu et al. [35] implemented Grad-CAM [36] into AE, aiming to replace reconstruction errors with the attention mechanism and achieving outstanding performance. Haselmann et al. [37] treated anomaly detection as an inpainting problem that uses AE trained on the normal dataset to generate patches clipped on the image. Dehaene et al. [38] argued that learning only on normal manifold does not guarantee generalization outside of the manifold, which will lead to unreliable reconstruction. They used iterative energy-based projection to sample the closest normal image to the anomalous input on normal manifold. Another is to apply Generative adversarial networks (GAN), and Schlegl et al. [39] used GAN to model normal OCT images of retina distribution. At the time of the prediction, searching the latent space for suitable latent codes makes the generator yield the normal sample closest to the anomalous input, and then use 1 distance as anomaly score. Soon after, Schlegl et al. [40] optimized the search process before the generator. Similarly, GANomaly [41] adds GAN's discriminator to AE and encodes the reconstructed image for detecting the anomalies in X-ray images by summing the errors of latent code, reconstruction error, and adversarial loss as anomaly scores. However, GAN suffers from pattern collapse that may yield plausible samples and cause false alarms.

Method
This section describes the principles of our proposed method. The first step is to train a VQ-VAE using anomaly-free images as training data. When VQ-VAE can embed anomaly-free images into a compact discrete latent space and reconstruct high-quality outputs, we start performing the second step. All anomaly-free images are first encoded using the trained VQ-VAE to collect a latent code set, and then the probability distribution of this latent code set is estimated using PixelSNAIL. At the prediction stage, when the latent code of an input image is out of the distribution learned in the second step, PaxielSNAIL will conduct resampling operations on it. The resampled latent code is decoded as a restored image, which is used for anomaly detection by calculating the error with the directly reconstructed image. The overall work-flow of the proposed method is shown in Figure 1; in the following, we describe it in detail. Step 3.

Structuring Latent Space
VQ-VAE is originally proposed as a compression model with good results, and we use it as a reconstruction model to construct the latent space. It is different from VAE; VAE assumes that the latent space satisfies a Gaussian prior and reconstructs using the re-parameterization trick, which implies that the process of getting the latent representation is indeterministic. However, VQ-VAE encodes the input to the latent representation with a deterministic mapping and reconstructs data from quantized vectors. VQ-VAE provides a sufficiently large latent space dimension, resulting in far better reconstruction performance than VAE and AE. Since the latent space is discrete, it can be modeled by deep autoregressive models, and the estimated probability distribution is the basis for the proposed resampling operation.
VQ-VAE consists of an encoder, a codebook with K embedding vectors e ∈ R K×D , and a decoder. The method trains discrete latent variables z ∈ R M×N using the codebook combined with nearest neighbor search, which is performed on the encoder output z e ∈ R M×N×D using the 2 -distance. Then, the vectors in the codebook replace z e to yield z q ∈ R M×N×D as input to the decoder, where z is used as the index table for z e transformation to z q . The model optimizes the parameters of the encoder, decoder, and codebook, with the goal of minimum reconstruction errors.
Specifically, as shown in step 1 in Figure 1, given an anomaly-free input image x, the encoder first encodes x into M × N D-dimensional vectors z e that maintain the two-dimensional spatial structure of the image. These vectors are then quantized based on their nearest distance to the embedding vectors e 1,2,...,k in the codebook; the process can be defined by Equation (1): This quantization process is essentially a lookup process, resulting in an index table z. The index table maintains the same spatial structure as z e , whose value for each component is the sequence number of embedding vectors in the codebook. Each vector in z e will be replaced by the e nearest to it before decoding, where e is selected directly in the codebook based on the value of the corresponding position in z, as shown in Equation (2): Finally, the decoder maps this replaced vectors z q back to the image x d . With the quantization operation, the discrete z can represent the whole latent space, and thus changing the value of z at prediction stage can generate the desired reconstructed images.
To learn these mappings, we randomly initialize e 1,2,...,k by sampling in the Gaussian distribution N(0, 1) before training and through the loss function described in Equation (3) to train the model, where sg[·] operation represents equality in forward propagation and zero derivative in backward propagation: The entire loss function has three components. The first term is reconstruction loss, which encourages the output to be as close to the input as possible. The second term is used to constrain the embedding vectors in the codebook, minimizing the loss of information caused by replacing z e with z q . Before computing this term, indices of the nearest embedding vectors to z e are computed, and then z q is composed based on the index lookup. To prevent z e from fluctuating too frequently, the final term is used to normalize z e . During training, z e is required to guarantee the quality of the reconstruction, and z q is relatively free. A weight factor β is set with the expectation that "let z q go closer to z e " more than "let z e go closer to z q ".

Probabilistic Modeling for Latent Space
In order to constrain the latent space, a distribution of the latent space of normal images needs to be estimated. Deep autoregressive models allow for stable, parallel, and easy-to-optimize density estimation of sequence data. Another advantage is the ability to provide data likelihood, which makes our anomaly detection possible. PixelCNN [29] was one of the first autoregressive models implemented using convolutional neural networks applied to image generation, and many improved models have been developed based on it. Here, we use PixelSNAIL [14], an enhanced version of PixelCNN that adds the Self-Attention mechanism [42] to model long-term dependencies.
To estimate the probability distribution of the image, PixelSNAIL transforms the joint distribution of all pixels into a product of chain probabilities as shown in Equation (4). Each pixel is modeled as a conditional distribution, indicating the current pixel value is determined by all pixels preceding it: The conditional distribution p (x i | x <i ) is parameterized by a convolutional neural network and finally connected to a softmax layer to estimate the probability of 256 values. Mask convolution is used to guarantee the conditional relationship preventing the model from reading pixels below. The whole training process can be parallelized using the GPU.
When the VQ-VAE training is completed, as shown in step 2 in Figure 1, we extract the index table set z (1) , z (2) , . . . , z (N) by encoding all N normal images. For any z (i) in the index table set, since it records only the index of the embedded vector in the codebook, one can treat it as a single-channel image. Each "pixel" in the index table has K possible values, depending on the size of the codebook.
The process of modeling the probability distribution of the index table by PixelSNAIL is analogous to the process for a low-resolution image as described above. The network is trained using the cross-entropy loss function with the expectation that the inference of the network will be identical to the true index. After training, the network can capture the pattern of normal images in latent space, predicting the most likely current index number based on the preceding index order.

Resampling Operation
VQ-VAE's powerful reconstruction capability implies that it also has strong generalization capability. During the prediction stage, unseen anomalous images are encoded whose latent codes deviate from the distribution of normal images, and this caused values in index tables to change as well. Our intention is to reconstruct a normal image that most closely matches the corresponding anomalous image. The word "closest" is achieved by means that only the areas where anomalies exist are restored, while the normal areas remain unchanged. Since the index arrangement of the index table determines the final latent code to be decoded, the restored image can be reconstructed by updating the value in the index table that does not satisfy the normal image pattern.
Specifically, as shown in step 3 of Figure 1, an anomalous imagex is pushed into the encoder network and extracted an M × N index tablez. Then, the trained PixelSNAIL model infers the likelihood on each component ofz in parallel. In this process, PixelSNAIL will estimate the conditional distribution of current component ofz based on arrangement characteristics of the preceding. If the component ofz is assigned an extremely low likelihood, it means that the index does not conform to the pattern of normal images. We set a hyperparameter threshold η to identify these anomalous patterns, and the setting of this threshold is described in detail in the experimental section. When the likelihood of the current component is less than the threshold η, we perform a resampling operation in the conditional distribution of that component, generating a new index that is most likely to occur under the semantics of the preceding normal pattern. The entire resampling process is carried out in raster order, ensuring that the pre-sequence arrangement follows the normal images' pattern.
The resampling operation can be defined by Equation (5), where I(m, n) means the minimum conditional probability given the bottom half of the 8-neighbor ofz m,n , 1 < m, n < M, N: We assume that anomalous region usually has local continuity, and, based on this assumption, we add a local constraint I(m, n) < η to increase the stability of the resampling. As

Detection of Anomalies
Based on indexes recorded in the resampled index table, the corresponding embedding vectors can be extracted from the codebook to yield the normal quantized latent code. After that, the decoder maps this quantized latent code back to the image space to reconstruct the restored imagex r . Reconstruction-based methods typically calculate the reconstruction error of the input and output the anomaly score, with higher scores representing more likely anomalies. However, we chose the 2 -distance between the VQ-VAE directly reconstructed imagex d and the restored imagex r which is defined in Equation (6), as the anomaly score to perform anomaly detection: We argue that the VQ-VAE model is applied to compression, the reconstruction may only lose some details, but the semantics of the raw image is not changed. In addition, anomalies are typically small in both size and proportion to the image being processed [1], and VQ-VAE has enough generalization capability to reconstruct these unseen anomalies. As resampling operation is applied only to components with very low likelihood values in the index table, the results calculated using Equation (6) will only have high anomaly scores in certain regions, reducing the possibility of false alarms. We verified the proposed anomaly score in the experimental section, and the results also show that ours is better than using traditional detection methods.
After getting the residual map, we apply a smoothing post-processing using bilateral filtering. The smoothed anomaly score map shows the anomaly score for each pixel, and the final segmentation can be obtained by binarization operation to identify the location of the anomaly.
Some visual examples of the above process are given in Figure 3. Since images are mapped to a unified latent space that is quantized as a discrete and solidified codebook, PixelSNAIL can easily determine whether to resample from the distribution of normal images based on a fixed threshold, followed by the decoder generating an anomaly-free sample closest to the input. Figure 3c shows restored images decoded from resampled index tables. Comparing with the original images in Figure 3a, one can see that the decoder only regenerates for the parts that have anomalies, but not for the whole image. As shown in Figure 3d, VQ-VAE can still reconstruct unseen anomalies, maintaining the structure of the original images. By comparing these two reconstructed images, it is possible to obtain areas that are considered anomalous as shown in Figure 3b.

Experiment
Our experiments were designed to detect anomalies in images. We used the MVTec AD dataset for evaluation. In this section, we first describe the composition of the dataset and some training settings. Secondly, the detection performance of the proposed model is verified by comparing it with other methods. Thirdly, we performed ablation experiments to demonstrate the effects of the proposed anomaly score and the local constraint.

Dataset
We evaluate the performance of the proposed method on the comprehensive natural image anomaly detection dataset: MVTec AD dataset [15]. The MVTec AD dataset provides over 5000 high-resolution images divided into five texture and 10 object categories. Texture types cover both regular (carpet, grid) and random (leather, tile, wood) cases, while the remaining 10 objects contain rigid, fixed appearance (bottle, metal nut), deformable (cable), and natural variation (hazelnut) cases. The dataset configures a scenario that only provides normal images during training. For each class, the training set is only composed of anomaly-free images, and the test set consists of anomaly-free images as well as images containing 73 different types of fine-grained anomalies, such as defects on the objects' surface like scratches or dents, structural defects like distortion of object parts, or defects due to the missing parts of certain objects. Pixel-precise ground truth labels are provided for each anomaly image region. The dataset overview is shown in Figure 4. The MVTec AD dataset can be downloaded in https://www.mvtec.com/company/research/datasets/mvtec-ad/.

Evaluation Metric
We use the same evaluation criteria defined in Bergmann et al. [15] to test the proposed method's performance. First, a minimum area of connected components is set for each category. Then, using the method to be evaluated to predict anomaly score maps on a validation set containing only normal images. After that, binary decisions are made on these anomaly score maps by incremental thresholds. Until the area of the largest connected component in binary images is equal to the defined minimum area, this threshold is determined as the final binary threshold. Based on this threshold, we calculate the average of the percentage of correct images that are correctly classified as anomaly-free and anomaly for image-level detection. For pixel-level detection, evaluate per-region overlap (PRO) and the area under the ROC curve (AUROC). PRO is the relative per-region overlap of the predicted segmentation map S p with the ground truth S g , and it gives greater weight to the connectivity component that contains fewer pixels: where s g , n are each connected component of S p and its corresponding S g and the number of connected components it contains, respectively. In addition, the image-level and pixel-level F1 Score are also evaluated, where F1 score is computed as: We define TP as the count of images or pixels correctly classified as anomalous, FP as the count of images or pixels incorrectly classified as anomalous, and FN as the count of images or pixels incorrectly classified as normal.

Experimental Setup
The experimental environment is a computer with Intel Xeon E5-2667 CPU, 64 GB of RAM, Nvidia 1080ti GPU, running Ubuntu 18.04, and we use the Pytorch library to implement our architecture.

Data Augmentation
To diversify the training set to make the model more generalizable, we use random transforms and rotations on the MVTec AD dataset to augment the training data. Specifically, applying random rotation selected from a set {0 • , 90 • , 180 • , 270 • } and random flip to texture categories and some object categories (bottle, hazelnut, metal nut, and screw). The remaining object categories are randomly rotated between the range [−10 • , 10 • ]. Finally, all categories are rescaled to 256 × 256, where texture categories are achieved by random cropping.

Network Setup
For VQ-VAE parameter configuration, the encoder is implemented as six convolutional layers with kernel size 4, stride step 2, padding 1, and followed by ReLU, one convolutional layer with kernel size 3, stride step 1, padding 1 and followed by eight residual blocks, which are implemented as ReLU, 3 × 3 conv, ReLU, 1 × 1 conv for each block. Images are encoded as z e with shape 32 × 32 × 32. The decoder is a symmetrical structure of the encoder using transposed convolutions. The dimensionality of the codebook and z are designed as 512 × 32 (K = 512) and 32 × 32, respectively. The weight factor β is set to 0.25. We use the ADAM optimizer with learning rate 3e-4 and train for 100 epochs with batch size 256. For PixelSNAIL parameter configuration, the model consists of one residual block and one attention block, both of which are repeated three times due to the limitations of GPU memory trained for 150 epochs with batch size 64, ADAM optimizer, and learning rate 0.0003. We set the parameters d, color sigma, and space sigma of the bilateral filter to (20,75,75).

Hyperparameter Setup
To determine the hyperparameter η, we make a validation set containing negative samples shown in Figure 5 and performed a grid search on this validation set until reaching the maximum AUROC. We randomly mask the anomaly-free images with black or gray rectangles that are 1% of the size of the prototype to generate negative samples. In the experiment, we set 0.0005 as the value of η.

Comparison Results
We present separately the results calculated according to the three different evaluation metrics and compared with the baseline methods listed in Bergmann et al. [15] as well as with other recent related methods to evaluate the performance of the proposed method. Table 1 shows the average accuracy of correctly classified anomalous and normal images, which demonstrates the ability of the method in the image-level classification task. Our proposed method yields better classification results in 10 of the 15 categories, with improvements ranging from 1% to 35%. It is not as good as CAVGA-D u [34] in both pill and capsule categories due to some anomalies similar to the normal random fine-grained pattern like spots that make the distinction more difficult, but the results are still superior to the baseline method. Tables 2 and 3 show the results of the comparison at the pixel level. Our proposed method is better than other methods as a whole on all metrics. PRO is highly demanding on the segmentation performance of the model under test, and even small areas of anomalous region prediction errors can reduce the value of this metric. We achieve the best results on grid, leather, and zipper, and leading results in the other seven categories as well. It is feasible to restore the anomalous portion of the image by resampling in low-dimensional latent space. Since Venkataramanan et al. [34] does not list specific AUROC, we selected VAE with Attention [35] for comparison. Our results at AUROC were better than the second AE SSI M [9] with an 8% improvement as Table 3 showed. Benefiting from the strong reconstruction capability of VQ-VAE and the constraint of resampling operations, the score of anomalies portion in the image is much larger than the normal portion. This leads to better results than other reconstruction-based methods, which mostly suffer from false detection due to insufficient reconstruction of the normal portion.  We further calculate the the F1 Score at the image level and pixel level to evaluate the classification and segmentation performance of the model. As shown in Table 4, the average result at the image level is 0.89, showing that the proposed model is basically able to correctly distinguish between anomalous and normal images. For pixel-level classification, since the number of anomalous pixels is much smaller than normal pixels, this can greatly affect the precision and thus indirectly the F1 Score. For example, a screw may have anomalies only in the head, and the number of these anomalous pixels accounts for only 1-2% of the image. When the predicted anomaly area is enlarged a bit, the precision may drop by more than 50%, but the prediction is intuitively acceptable in terms of pixel continuity. Although the metric assigns equal weights to positive and negative classes, we think the proposed model gives relatively good results, with the mean value 0.17 higher than AE-L2. Furthermore, we present the distribution of anomaly score for leather/screw/pill in Figure 6. The proposed model tends to assign high anomaly scores to anomalous pixels and low anomaly scores to normal pixels. Figure 6. The distribution of the anomaly score for pixels. The subplots from left to right are leather, screw, and pill, where the x-axis represents the log of anomaly score, red represents anomaly pixels, and green represents normal pixels. Moreover, we demonstrate some visual comparison results with the baseline methods in Figure 7. These test images contain various anomalies such as missing parts, distorted or scratched parts, misalignment, etc. Our proposed method can correctly detect these different anomalies of varying sizes. In the first row, the print on the capsule is scratched, and AE fails to reconstruct these prints accurately, resulting in high anomaly scores for all prints. The second and fifth rows are the missing anomalous cable and misplaced transistor separately; these two segmentation results illustrate that the proposed method can reconstruct the normal image closest to the anomalous input and thus precisely locate the anomaly. In all of these anomaly types, the segmentation results of the proposed method are roughly the same as ground truth, while the other methods can only detect part of them.

Ablation Experiment
To verify the validity of the proposed anomaly scores and the local constraint on the resampling operation, we performed ablation experiments separately. We use the 2 distance between input and output to get the anomaly score map and give comparison results with the proposed method. We also compare the model without the local constraint. The averages of the above three metrics are presented in Figure 8. The results show that, when the anomaly score map is binarized obtained from input and output residuals, the threshold needs to be raised to overcome the high anomaly scores in normal regions caused by reconstruction noise, but it also leads to a lower predicted recall yet. The proposed method only has residuals in regions with anomalies, thus providing a higher signal-to-noise ratio and precision.  Figure 8. Results of ablation experiment. Blue is the proposed method, orange is the condition with input-output residuals as anomaly scores, and grey is the condition without local constraint.
With the removal of local constraints, the resampling operation may be affected by noise and may even affect the following judgement which leads to lower results. However, models without local constraint are still better on average than the AE model and are competitive with the state-of-the art.

Conclusions
Due to the lack of sufficient anomaly data relative to normal data, we introduce a novel method for anomaly detection using only normal data. The work can be summarized as follows: • Using VQ-VAE to construct a discrete latent space. Then, the latent space distribution of the normal image is modeled using PixelSNAIL.

•
During anomaly detection, the discrete latent code out of the normal distribution is resampled by PixelSNAIL. After this resampling, the index table is reconstructed to a restored image by the decoder. The greater the distance between the restored image and the image reconstructed directly using VQ-VAE, the more likely the region is anomalous.

•
The method is evaluated on the industrial inspection dataset MVTec AD that contains 10 objects and five textures with 73 various anomalies. The results show that the proposed method achieves better performance compared to other methods.
Our motivation can be intuitively explained as keeping the normal portions intact while restoring the abnormal portions of images. Since the time required for resampling and inference is related to dimension of the latent space, this may constrain real-time performance of our method. In addition, it is possible to collect a small number of anomalous images in reality, and introducing this anomalous information may improve the performance of the model. Future work will focus on real-time performance and semi-supervised learning.