Hierarchical Image Transformation and Multi-Level Features for Anomaly Defect Detection

Anomalies are a set of samples that do not follow the normal behavior of the majority of data. In an industrial dataset, anomalies appear in a very small number of samples. Currently, deep learning-based models have achieved important advances in image anomaly detection. However, with general models, real-world application data consisting of non-ideal images, also known as poison images, become a challenge. When the work environment is not conducive to consistently acquiring a good or ideal sample, an additional adaptive learning model is needed. In this work, we design a potential methodology to tackle poison or non-ideal images that commonly appear in industrial production lines by enhancing the existing training data. We propose Hierarchical Image Transformation and Multi-level Features (HIT-MiLF) modules for an anomaly detection network to adapt to perturbances from novelties in testing images. This approach provides a hierarchical process for image transformation during pre-processing and explores the most efficient layer of extracted features from a CNN backbone. The model generates new transformations of training samples that simulate the non-ideal condition and learn the normality in high-dimensional features before applying a Gaussian mixture model to detect the anomalies from new data that it has never seen before. Our experimental results show that hierarchical transformation and multi-level feature exploration improve the baseline performance on industrial metal datasets.


Introduction
Anomalies are data that stand out amongst other data in a dataset and do not adhere to the normal behavior of the other data points. Anomaly detection thus refers to a process of detecting data that significantly lie outside of the majority data. The detection of anomalies and deviant patterns has been an active research area since various industries and business organizations strive to develop systems that are not only robust against deviant data, but can also detect them appropriately [1]. In prior decades, inspection methods for anomaly detection in industrial production lines mainly consisted of collecting images that experts would manually review for defects. Manual quality inspection is very inefficient in terms of time and labor for a company, but modern computer vision and deep learning techniques can address these issues. The drawback is that computer vision models with deep learning require a huge amount of data. However, the limitation in the industrial world is the availability of data: collecting more images is a particularly large challenge due to safety and security reasons. Moreover, in some cases, the percentage of anomalies in the dataset is extremely low, usually less than 1%. Since anomaly images are scarce and unknown to the user, researchers are seeking solutions for modeling the unsupervised normal or anomaly-free data distribution and defining a measurement in this normal data.

•
We introduce a novel hierarchical transformation module for anomaly detection. With this approach, the anomaly detection model not only contains robust normal data but also becomes more resistant to poison and non-ideal image variation.

•
We introduce a method that combines the hierarchical transformation process and multi-level feature selection for anomaly defect detection. Our method can easily be extended to the few-shot or zero-shot anomaly detection problem.

•
We demonstrate consistent gains in testing on several non-ideal image simulations and exceed the baseline performance.
Our paper is organized as follows. In Section 2, we give an overview of other works and corresponding methods. In Section 3, we present our method, provide a comprehensive workflow of hierarchical concepts, and discuss the multi-level feature process. We then analyze the experimental results on the Metal Casting (MC) dataset and MVTec Metal-nut dataset in Section 4. We finish with our conclusions in Section 5.

Related Work
In this section, we primarily discuss relevant work on anomaly defect detection with deep learning-based approaches, image transformation, and feature representation of normality and industrial anomaly detection. In recent years, deep learning-based models have shown tremendous capabilities in learning expressive representations of complex data, such as high-dimensional data, sequential data, and image data. Based on the handling of data variations, anomaly detection approaches can be classified into distribution-based methods and reconstruction-based methods.

Distribution-Based Methods
There is a tendency for anomalous data to fall into low-probability regions that are distributed throughout the normal data. In this scenario, distribution-based methods try to predict whether the new sample lies in a high-probability region or not. The most straightforward version of anomaly detection uses a simple statistical approach, wherein statistical techniques such as mean, median, and quantiles can be used to detect univariate feature values in the dataset [12]. However, simple statistical rules are prone to producing more false negatives and false positives. Conventional distribution-based methods for anomaly detection, such as SVM [13,14], one-class SVM [15], and kernel density estimation [16][17][18] are fragile when dealing with high-dimensionality data. The drawbacks of distribution-based techniques spawned a considerably more robust method using deep learning for anomaly detection. A deep learning model trained on anomaly detection is a classic yet challenging task that has numerous use cases across various domains, such as fraud detection [19][20][21], cyber security [22,23], time series analysis [24], and medical applications [25][26][27]. The challenge in anomaly detection with deep learning comes mainly from the fact that the task is data-scarce by definition.

Reconstruction-Based Methods
Anomaly detection approaches can also be classified as reconstruction-based. In this method, autoencoders can learn shared patterns of normal images and restore them correctly. In [28,29], the models estimate pixel-level reconstruction errors as anomaly scores. PCA-based [30] and autoencoder-based methods [31] rely on the perceptual loss, where the models trained only on normal data cannot accurately reconstruct anomalies. Apart from autoencoders, recent models [32][33][34] have used a GAN-based architecture as a detection method. In GAN-based anomaly detection models, GAN is applied to generate samples from scratch according to training data. Given test data, GAN-based models try to find the point in a generator's latent space that generates the sample closest to the considered input. Intuitively, if the GAN is able to capture a good representation of the test image, then the image is normal, and vice versa. Other generative models [35] learn distributions of anomaly-free data and estimate the reconstruction error metrics for unseen images with anomalies. Similar to autoencoders, a major difficulty with generative-based models lies in how to regularize the generator for compactness [36][37][38].

State-of-the-Art Anomaly Detection
Numerous deep learning-based methods have emerged in anomaly detection as discussed in this survey [8,11,23,39]. In industrial domains, research on big data presented in [40] proposed a variational long short-term memory (LSTM) learning model for anomaly detection on reconstructed feature representation. A variation of a self-supervised pretrained model, Patch SVDD [41] proposed combining multi-scale scoring masks to the final anomaly map. In [42], the proposed deep invertible network showed that large feature representation from ImageNet [43] can be more representative for the pretrained model to compare to a small specific dataset e.g., the industrial public MVTec dataset [44], or a medical image dataset [45,46]. Adopting the benefit of a huge ImageNet pre-trained model, PaDIM [47] proposed patch distribution modeling, which uses patch embedding from pretrained CNN and captures the probability with a multivariate gaussian distribution. In order to estimate the features vector of each sample from pooled feature maps [48] and some other popular models [49] use the Mahalanobis distance metric [50,51]. Another GAN-based anomaly detection method has become a popular deep learning anomaly detection approach since its introduction in [52]. This approach generally aims to learn abnormal inferences using adversarial learning of the representative of samples [53,54]. GAN-based models in anomaly detection are designed for reconstruction-based methods, where, in general terms, the simplest approach is to take the benefit of the reconstructed error as an anomaly score [55].
Inspired by these state-of-the-art anomaly detection works, we aim to explore the variations of normal images before distributing the new samples into the feature extractor. Image transformation in anomaly detection has been presented in [56], where geometric transformation is implemented to discriminate between many types of transformation and normal images to detect anomaly samples. Similarly, the latest geometric transformation in [57] was designed for few-shot learning. In contrast, HIT-MiLF utilizes the image generator to produce new normal samples from a pixel-wise transformation in batches and keeps the original label for new samples. In this way, we can say that HIT-MiLF is not costly in terms of labeling.

Method
In this section, we discuss the anomaly detection settings followed by the combination of hierarchical transformation and multi-level features. These modules are part of the data enhancement for the CNN to learn the invariance of normal training data. We first describe the anomaly detection setting that covers the whole structure of this work in Section 3.1. We explain the hierarchical transformation module that we add to the anomaly detection model in Section 3.2. We then explain Multi-level Features (MiLF) in Section 3.3, and end by discussing multi-level feature representation of the new generated samples.

Anomaly Detection Setting
In this paper, we consider the problem of anomaly detection, specifically in industrial images. Given a dataset D, the deep anomaly detection model aims to learn the feature representation mapping function F : D x → D y where D x is training data (one-class normal sample) and D y is the output prediction ( Figure 1). In the testing phase, the predicted samples D y can be represented as D y = {D n ∨ D a } where D y contains either normal data D n or anomaly data D a . We adopt ResNet18 [4] as the backbone of our network, which extracts the feature of D n before exploring the feature vector from a different block.
In accordance with typical anomaly detection settings, we train a network with a given sample of all normal images D n . In an ideal condition, D n is trained with network M to capture high-dimensional feature vectors. In anomaly detection, the anomaly-free data distribution is commonly estimated using multivariate Gaussian distribution N (µ, Σ), where µ is the mean and Σ is the covariance. We follow PaDIM [47] to learn the anomalyfree samples at a specific patch position (i, j), and learn the normality from the set patch embedding vector at (i, j). At the specific (i, j) position, X ij = x k ij , k ∈ [1, D n ] from n normal images and the multivariate Gaussian distribution N µ ij , Σ ij . The covariance of normality characterization at (i, j) position is estimated as follows: The regularization I makes the sample covariance matrix Σ ij invertible. Finally, each (i, j) patch position is associated with multivariate Gaussian parameters.

Hierarchical Transformation
The anomaly classification system should be trained with as many variations of the considered objects as possible. One major problem lies in industrial dataset availability: real public defect images are difficult to obtain, since anomalous images are extremely rare and sometimes, the defects in production lines involve sensitive data that is not easy to access. In [8], the authors presented a publicly available real-world public dataset. Although publicly accessible datasets are available, most of them contain sequential data. In this work, we mainly focus on industrial images and implement our approach on this type of data.
Anomaly defect detection faces a challenge where the training data contain only oneclass normal data. Our proposed HIT module consists of two main parts: the sample generator and the sample collector. These parts are assembled into one module to process all possible normal data. The output of this module is distributed to the CNN via multiple training batches.
. . , T n } be a set of pixel transformations where T n : D → D T(n) and D n is the initial or identity samples. The set of T in the image generator is based on the intuition of pixel-level transformations properties on anomaly detection. Pixel-level transformations keep the spatial structures and maintain the detailed artifacts of normal samples. In this work, transformation T includes hue saturation, noise injection, adding shadow effect, and adjusting brightness and contrast. In the first iteration of the hierarchical process, the original image D n will directly distribute to bacth_1 without any transformation. The next iteration, the image of D n will be processed in the transformation module T 1 and produce the new transformed sample of a normal image D T1 This hierarchical process will repeat according to the available set of T where we apply the set of T, for all original samples to the generator. The generator with T n+1 will generate another new samples of D T(n+1) from the combination of D n and D T(n−1) to the next iteration.
As shown in Figure 2, the class of new samples D T(n) will be the same as the normal image after transformation. Here, the new samples represent the normal image in different pixel-level conditions. This modification is what we want to achieve through this approach: we assume that the diversity of images from the original will unlock more informative features that represent the anomaly-free data. The new samples D T(n) will be distributed in multiple batches and directly forwarded to the CNN as new input training data.

Learning Multi-Level Feature Representation
Compared with traditional feature extraction methods, the CNN-based feature extractions are more capable of extracting feature distribution information. Moreover, this ability to extract high-level semantic information enables the model to be trained endto-end. Currently, various backbone networks have been used in previous work, such as [37,38,50], etc. In this work, we adopt ResNet18 to capture high-dimensional features of input training images. The details of the ResNet18 block structure are presented in Figure 3 and Table 1. As shown in Figure 4, the backbone includes four blocks, and these blocks extract appearance information from the low-level (block_1) and middle-level (block_2 and block_3) to the high-level (block_4). With the exception of the last block, each block consists of convolution levels, a rectified linear unit activation function (ReLU), batch normalization, and a max-pooling layer. These different blocks are fused by leading input and posterior output features to enrich the feature map. Since every feature output of this block can be retrieved as a high-dimensional feature vector, we explore this advantage to collect all feature outputs from each block.
average pooling 1 × 1 × 512 7 × 7 average pooling fully connected 1000 512 × 1000 fully connections However, in the complex condition of normal samples, high-dimensional features from a deep neural network cannot fully describe the normality feature of training data. This is because there is a lack of variation among limited anomaly-free training data. Therefore, it is crucial to enlarge the data in order to enrich the variation and strengthen the data complexity. To address this issue, apart from generating new samples in the single HIT module, we propose a combined model that jointly uses HIT and multi-level features from ResNet18 to extract variations of anomaly-free images. We utilize the multi-level features from different blocks of ResNet18 to capture the different relationship and semantic information features from normal samples.
As shown in Figure 4, we collect the extracted features from specific blocks and concatenate the activation vectors. The idea behind this approach is that the typical deep convolutional layer property of CNN, or different layers of deep CNN, can encode different levels and shapes of information. Low-layer features always contain more detailed information and have higher resolution. In other words, the first block of CNN contains features encoded with less context. However, in high-level blocks, the features encode more contextual or semantic information in low spatial resolution. Directly combining the low-level and high-level features may affect the feature concatenation that causes semantic ambiguity due to the introduction of high-detailed information. To address this concern, we exploit the middle level that acts as an intermediary feature representation between the low level and high level in order to bring about transition information. We show the effect of the feature-level block selection on the final anomaly prediction in Section 4.3.
After multi-level feature concatenation, the embedding vectors carry information from different semantic levels. We estimate the multivariate Gaussian distribution of N (µ, Σ) of the feature vectors from three levels. In this model, we partition the input image into patches and calculate the distribution before the multivariate distribution. We distribute all combination features provided by MiLF to ensure both images D n and D T(n) are treated consistently.

Anomaly Detection Setting
Dataset. In this experiment, we use the Metal Casting dataset [58] and Metal-nut from MVTec dataset [44] to detect anomalous defects for vision quality inspection. Metal casting is a manufacturing process in which a material is poured into a mold that contains a hollow cavity of the desired shape. There are many types of defects in metal images, such as blow holes, mold material defects, shrinkage defects, pinholes, scratch, etc. However, the objective of this work is mainly to detect anomalous image from only available normal images. The original MC dataset contains 1300 images of 512 × 512 pixels with 781 defect images and 519 non-defect images. The Metal-nut dataset consists of 335 images with 242 non-defect images and 93 defect images. In preparation for examining the model, for MC dataset, we use the 500 of 519 images and split into 400 non-defect images for training and 100 images for validation. We select 100 of 781 defect images in testing for five different poison levels. In this setting, one poison level contains 400 non-defect for training and 100 non-defect and 100 defect images for testing. For the Metal-nut dataset, we apply 220 non-defect images for training. For the testing set, we use 22 non-defect and 93 defect samples to perform poison levels test. The sample images from our datasets are shown in Figure 5.  Metrics. In a common classification model, accuracy is an acceptable metric that measures the number of predictions that are correct as a percentage of the total number of predictions. Accuracy as a prediction metric is suitable only when an equal distribution of classes exists in the testing set. However, in anomaly detection, we need to control the sensitivity of the model because the model may classify all testing data as anomalous, even though they are not (false positive). Thus, in the field of anomaly detection, the most suitable metrics that have been used in many works are F1 score and area under the curve (AUC) score. The F1 score is defined as the harmonic mean of precision and sensitivity or recall and is often useful when computing an average rate. The formula for the F1 score is as follows: The AUC score is the second metric in the field of anomaly detection where it measures the area underneath the receiver operating characteristic (ROC) curve. The AUC is a probability curve that plots the true positive rate (TPR) against false positive rate (FPR) at various threshold values. AUC integrates the classification performance between the normal image and defect image for all decision thresholds. Since the AUC represents the degree or measure of separability, it is suitable as a performance measurement in various settings. This metric indicates how well a model can distinguish between two classes. AUC ranges in value from 0 to 1, where the higher the AUC score, the better the model is at predicting normal and anomaly. The illustration of the perfect AUC score is shown in Figure 6.
True Positive Rate = True Positive True Positive + False Negative (5) False Positive Rate = False Positive False Positive + True Negative (6) Figure 6. The illustration of the ideal AUC score = 1 where false positive rate is zero and true positive rate is one. This means that a larger area under the curve is better.

Implementation Details
We conducted an anomaly detection experiment using two methods: an ideal or uniform sample test, and poison and non-ideal sample test. We defined our baseline as a standard anomaly detection model that uses the ideal training and poison-free testing data. The result of our baseline is presented in Table 2. We ran our models on a computer with a single Nvidia 1080i GPU card and used a PyTorch-based framework [59]. As shown in Table 2, the AUC metric of the anomaly detection baseline reached 0.973 (AUC score), 0.917 (F1 score) and 0.934 (AUC score), 0.928 (F1 score) on the MC dataset and Metal casting dataset respectively. Similar to the baseline model, we used our model on an ideal image and produced similar scores, which indicates that although our model was designed for non-ideal data, it can still be used on ideal data with an acceptable level of accuracy that is competitive with the original model. We then simulate the poison sample test images to validate our proposed method on poison and non-ideal images, for instance, with noise injection, blurring, and image sharpening. In this setting, the testing data contains poison and non-ideal samples that randomly set the number of samples for both normal and anomaly classes. We rank five levels of poison samples in testing data, from Level 1 to Level 5 which the number of poison images at each level increases by 10% relative to the original testing data. We re-run our baseline with this approach to show how poison and non-ideal samples severely weaken the baseline.

Experiment with HIT Module
In the first experiment, we assessed and compared the effectiveness of single HIT module with poison and non-ideal images on MC dataset and Metal-nut dataset. From the experimental results in Table 3, we observed that our baseline results dropped significantly when poison and non-ideal testing data were added. This phenomenon also occurs in several subsequent experiments with an increasing number of poison and non-ideal samples across two datasets. This indicates that the baseline model is confused by the new poison and non-ideal data, which are comparatively different from the ideal data. The poison images cause the model to fail to retain the informative features of normal images. To work around this issue, we then attach the HIT module to the baseline model while we maintain all settings and the detailed structure of our CNN model. The main goal of the HIT module is to generate new additional training samples as a means of enhancing the CNN to automatically extract more informative features from different image transformations.  Table 3 presents our experimental results on poison and non-ideal testing data for both the baseline and the baseline with the additional HIT module on MC and Metal-nut dataset. We analyze the results for every percentage level of poisoning testing data and plot the metric scores for all levels of poison samples. At all levels of poison samples, we observe that both the AUC score and F1 scores gradually drop. This phenomenon appears not only in the baseline results but also in our HIT module. However, the baseline score strikingly jumps for AUC at Level 1 (10% poison samples). In contrast with our methods, the HIT model experienced a decrease of 0.05 and at the same level on MC dataset. We hypothesize that poison samples strongly affect whole feature distribution. This condition indicates that when a large number of poison samples appear in testing data, the robustness of the model significantly decreases. On the other hand, even though HIT experienced a similar weakening situation, it maintained a competitive score and consistently performed above the baseline. In Table 3, we also noticed that the F1 score in our single HIT module for MC dataset was slightly low at the first increasing poison sample. We assume that the HIT module is not sufficiently stable with a small number of poison images.

Experiment with the MiLF Module
In the second experiment, we investigated the influence of the multi-level features on our backbone, ResNet18, and analyzed the model's performance for every poison and non-ideal level of testing data. The results in Table 4 show that both the AUC score and F1 score consistently meet or outperform the baseline. Notably, the average AUC score stays above 0.9 across all poison data levels for MC dataset. We observed that a larger number of poison images affects all metrics and methods gradually, but that our proposed method still outperforms the previous method. During the experiments, we inspected the layer by automatically selecting the best layer combination of the features level. From this experiment, we learned that multi-level implementation is needed to give a better feature representation of normality to adapt to poison samples. The experimental results in Table 4 show the score when multi-level features are included. We utilized the same combination of three main levels of feature representation (low-level, middle-level, and high-level). However, using the highest feature level from block_4 was not always the best option in this case, as doing so resulted in a low prediction score compared to other lower-level features from block_1. The benefit of the MiLF method lies in how we can combine the characteristics of high-resolution information from the high-level (block_4) with basic information from the low-level (block_1) features, where the semantic level information can help improve the accuracy.
Additionally, we explored the benefit of features-levels from different blocks to investigate the effect of implementing several blocks of ResNet18 on MC and Metal-nut datasets. We run the experiment in the normal setting of anomaly detection where all features from different levels of ResNet18 were collected before feed into anomaly model. In this experiment we manually selected the features-level from ResNet18 blocks and followed anomaly detection model for each level. First, we performed the model on high-level (block_4) then combined with other lower blocks. As shown in Table 5, combining the different levels yields higher result than only using high-level features. This result inline with MiLF concept that we want to explore the optimum level combination from deep feature extractor.

Experimental Results with the HIT and MiLF Combined
In the third experiment, we extended our ideas to demonstrate the effectiveness of our proposed models by combining the HIT module and MiLF into a single process and discussing several implicit factors that can influence anomaly detection. In the previous experiment (Section 4.3.1.), we briefly discussed what kind of transformations we can apply in the HIT module. In the first stage, we prepared the HIT module to produce new transformation samples and store the new sample in batches before distribution to the CNN. To perform multi-level feature selection, we used the same ResNet18 structure as in previous experiments, where the basic difference was the number of training data after adding the HIT samples. To inspect the variation of new samples, we ran the HIT + MiLF structure with ResNet18 in five different combinations according to the scalability of poison images. ResNet18 extracted the features of input images, including original samples, and generated new samples from each batch. Inside the feature extractor, we successively collected the extracted features from three different levels (low-middle-high level), where we used the highest validation score of each level before selecting a specific layer. The combination of these three layers returned better performance with this approach. As shown in Figure 7, combining HIT and multiple layers in MiLF outperformed the baseline on all poison levels. These results indicate that the new transformed samples and the optimum multi-level features can effectively improve normality learning from high-dimensional normal features. From these results, we notice that selecting the feature level from a single block of ResNet18 does not change the results significantly. However, the major effect of the combination multi-level model is to make the features from different levels uniform in resolution and dimensionality.

MC Dataset
Metal-Nut

Experimental Results with the HIT and MiLF Combined
In the third experiment, we extended our ideas to demonstrate the ef our proposed models by combining the HIT module and MiLF into a singl discussing several implicit factors that can influence anomaly detection. In experiment (Section 4.3.1.), we briefly discussed what kind of transforma apply in the HIT module. In the first stage, we prepared the HIT module to transformation samples and store the new sample in batches before distr CNN. To perform multi-level feature selection, we used the same ResNet1 in previous experiments, where the basic difference was the number of tra ter adding the HIT samples. To inspect the variation of new samples, we MiLF structure with ResNet18 in five different combinations according to of poison images. ResNet18 extracted the features of input images, inclu samples, and generated new samples from each batch. Inside the feature successively collected the extracted features from three different levels (low level), where we used the highest validation score of each level before selec layer. The combination of these three layers returned better performance proach. As shown in Figure 7, combining HIT and multiple layers in formed the baseline on all poison levels. These results indicate that the new samples and the optimum multi-level features can effectively improve no ing from high-dimensional normal features. From these results, we notice the feature level from a single block of ResNet18 does not change the resul ly. However, the major effect of the combination multi-level model is to tures from different levels uniform in resolution and dimensionality.   Figure 7, the red line represents our baseline scores, the gray with the HIT module, the orange is the baseline model with MiLF, and represents the combination module. The results clearly show that each pr basically suffers from a decrease in performance as the poison data increa We present the comparison results from all of our experiments in Figure 7. As shown in Figure 7, the red line represents our baseline scores, the gray is the baseline with the HIT module, the orange is the baseline model with MiLF, and the green line represents the combination module. The results clearly show that each proposed model basically suffers from a decrease in performance as the poison data increases. However, what we perceive from these experiments is that the diversity of normal images makes the features more diverse, and that it is useful to capture the anomalies from outlier distribution samples. This phenomenon leverages the final distribution mapping on Gaussian models; as a result, our combination model is resistant to poison data and can maintain competitive results. Overall, these experiments prove that our proposed model can handle not only high-level features, but poison-free images as well. An additional interesting aspect of our proposed idea is that the combined structure is relatively light and easy to implement for all ResNet variants.
In Table 6, we show the experiment results for anomaly detection on MC dataset and Metal-nut dataset from various existing anomaly detection methods. We reimplement the same ResNet18 backbone as feature extractor to compare the benefit of adopting hierarchical transformation and multi-level feature combination. We notice that Probabilistic modeling-based outperforms our approach at low level of poison samples. However, the existing methods drop the score as the number of poison samples increases. The results presented in Table 6 show that our combination approach is relatively consistent across two popular modeling-based models for all poison levels. This confirms that hierarchical image transformations and multi-level layer selection approach for feature selection is crucial to handle poison samples in anomaly detection.

Limitations
Here, we want to discuss some limitations of this approach based on our experiments. Since the HIT module is designed to produce new samples in a hierarchical way, it causes an increasing number of iterations and consumes more time. However, it is heavily dependent on the number of original images. In a scenario where normal training images are extremely limited, the search process will easily stagnate when searching the variations. We found that HIT baselines with various transformation methods are easily saturated when the original images are extremely limited.

Conclusions
In this empirical study, we have seen that informative features of normality create a strong foundation for an anomaly detection model to detect anomaly samples. Robust training data are important for teaching an anomaly model to be more sensitive to any perturbances from unseen samples. With hierarchical transformation samples, the CNN backbone is able to extract more informative features that surprisingly produce better features. From the multi-level feature combination, we can observe that combining lower-and high-level features with the help of the middle level can produce very competitive scores compared to a non-hierarchical model. This leads to the conclusion that a fully trained model combining hierarchical and multi-level features can push the model toward random poison images. Since this approach is light, our proposed model can be implemented on production lines.
Future work. Image transformation and multi-level feature combination pave the way to numerous extensions. In future work, we will study the application of our approach to other domains, e.g., medical images, natural images, or non-industrial datasets, where the anomalous data scarcity remains the bottleneck.