Fast and Efficient Image Novelty Detection Based on Mean-Shifts

Image novelty detection is a repeating task in computer vision and describes the detection of anomalous images based on a training dataset consisting solely of normal reference data. It has been found that, in particular, neural networks are well-suited for the task. Our approach first transforms the training and test images into ensembles of patches, which enables the assessment of mean-shifts between normal data and outliers. As mean-shifts are only detectable when the outlier ensemble and inlier distribution are spatially separate from each other, a rich feature space, such as a pre-trained neural network, needs to be chosen to represent the extracted patches. For mean-shift estimation, the Hotelling T2 test is used. The size of the patches turned out to be a crucial hyperparameter that needs additional domain knowledge about the spatial size of the expected anomalies (local vs. global). This also affects model selection and the chosen feature space, as commonly used Convolutional Neural Networks or Vision Image Transformers have very different receptive field sizes. To showcase the state-of-the-art capabilities of our approach, we compare results with classical and deep learning methods on the popular dataset CIFAR-10, and demonstrate its real-world applicability in a large-scale industrial inspection scenario using the MVTec dataset. Because of the inexpensive design, our method can be implemented by a single additional 2D-convolution and pooling layer and allows particularly fast prediction times while being very data-efficient.


Introduction
The ability to detect unusual patterns in images is an important capability of the human vision system. Humans can differentiate between expected variance in the data and outliers after having only seen examples of normal instances. In this work, we address the computer vision approach to this problem, usually known as image novelty detection. Novelty detection is related to outlier detection in the sense that both methods try to detect anomalies. However, while the latter is totally unsupervised, novelty detection has access to a training dataset consisting of clean normal reference data, and, hence, is an instance of weakly-supervised learning. The output of such an algorithm is a scoring function (anomaly score) that can be used to grade test data from inlier (normal) to outlier (novel) (e.g., [1]). Since the anomaly score is computed for a single input example, it can also be used for binary classification tasks. The major difficulty of such a model is that the decision boundary is not robust against overlapping between inlier and outlier distributions. This motivates the main idea of the ensemble approach to novelty detection: representing both training and test images as ensembles of image patches [2]. Instead of scoring a single test example with respect to the normal distribution, the ensemble approach first transforms the test example into a ensemble of patches and checks the test and training ensemble against each other, which improves the robustness of the decision process. There is a wide range of methods for testing if two samples originate from the same distribution. Here, we follow our previous work [2] and use the the Hotelling T 2 test [3] for assessing the mean-shift • Global novelty is spread across the entire image, e.g., when separating dog images from cat images. • Local novelty appears only in some parts of the image whereas the other parts of the image are totally normal, e.g., detecting tiny manufacturing defects in industrial visual inspection systems.
(b) Figure 1. Detection of (a) locally concentrated novelty and localization on the MVTec dataset and (b) global anomalies on the CIFAR-10 dataset. All shown examples are from the outlier test set and the red color highlights the location of the novelty in the images. The overlayed red score map is computed using the µshift anomaly score (cf. Equation (11)) without applying the spatial max-operator, such that the model output is a 2D grid of anomaly scores. These scores are mapped to a red heat map and resized to match the input resolution using bilinear interpolation. Hence, red areas correspond to potentially anomalous regions.
We identified the following practical principles for successful image novelty detection using mean-shifts: (1) First, as anomalies mostly consist of patterns not available in the normal class, a rich feature space, such as a pre-trained neural network, needs to be used. (2) No dimension reduction based on the inlier data should be applied, as the inlier data occupies only a small portion of the feature space, and projecting onto its subspaces causes the anomalies to overlap with the normal data. Additionally, (3), the spatial size of the expected anomalies needs to be correctly expressed in terms of the hyperparameters, i.e., patch size and local mean-shift region, as a small local novelty cannot influence the mean shift sufficiently in too large averaging areas, mainly, because the distributions overlap only in insufficiently small regions.

Contributions
In this work, we propose a non-expensive algorithm (https://github.com/matherm/ deep-mean-shift, 4 September 2022) based on the Hotelling T 2 test for image novelty detection that is stacked on top of a standard pre-trained neural network, such as EfficientNet [5] or Vision Image Transformer (ViT) [6]. Using an upstream pre-trained neural network induces a rich feature space with a diverse set of pre-learnt patterns and accommodates the previously mentioned principle (1). In contrast to our previous work [2], we follow principle (2) and use a full-rank covariance matrix for modelling the neural network features, instead of relying on a compressed low-rank approximation which improves performance significantly. Further, to fulfil principle (3), we generalize the existing ensemble approach to novelty localization and add a hyperparameter that controls the expected spatial size of the anomalies which has a strong impact on overall performance in practical applications. We show in extensive experiments that our approach not only achieves comparable results to existing state-of-the-art approaches, but is also applicable to a large-scale industrial inspection scenario. Further, due to its simple architecture, the model has faster prediction times compared to existing approaches. Lastly, because we only need to estimate the mean of the training dataset, our method is very data-efficient and reaches 90% AUC with only 10 non-defective examples of the MVTec dataset [7]. Figure 1 shows examples from the evaluated datasets.

Related Work
The use of limited supervision for image classification has been studied extensively [8,9]. Some approaches (e.g., [10]) consider the unbalanced setting where a small number of anomalous examples is given, but many examples are given from the normal class. However, these approaches use additional supervision that is not used in our method. Our work relates more closely to anomaly detection approaches that use limited to weak supervision [1]. During training, we only use examples from the normal class and, therefore, consider our method an instance of novelty detection, a semi-supervised version of anomaly detection, sometimes also referred to as one-class classification [11]. There are different approaches to the problem in general and we, therefore, group the related methods into the categories reconstruction-, classification-, distribution-based, and self-supervised methods.
Reconstruction-based methods. These methods derive a data-driven encoder and decoder from the reference data and expect the anomalous data to have a higher reconstruction error compared to normal data. However, such models are mostly based on unconstrained compression and, therefore, often oversee novel patterns, resulting in poor performance in practice [12].
Classification-based methods. These methods attempt to model a discriminating hyperplane between data regions of normal data and those of anomalous data [13], without necessarily using compression. Such methods often perform well in practice. However, their main limitation arises from the fact that the hyperplane can only be estimated accurately in regions occupied by the training examples [14]. The recently proposed Mahalanobis method [12] tries to heal the problem by negating the estimation process by using the null space of a pre-trained neural network feature space instead.
Distribution-based methods. These methods are another branch of novelty detection that model the distribution of the normal data. Such methods are often built around autoencoders [15] or normalizing flows [16]. However, it has been argued and empirically found that distribution-based methods that fit a flexible parametric distribution with the maximum likelihood objective may not be well-suited for detecting out-of-distribution data [17].
Self-supervised methods. These methods try to improve distribution-based methods by replacing the data likelihood with a proxy classification objective, such that classifying normal data based on that objective allows for a good separation of normal and anomalous data. These techniques are related to non-linear Independent Component Analysis (ICA) using an auxiliary variable, such as a time segment, a generalized non-stationary variable, or synthetic labels [18,19]. A successful application of this theory to images is to predict image rotations [20,21]. The proxy objective is given by first rotating the image by an arbitrary angle and then trying to predict that angle using a deep convolutional neural network. However, this strategy only works well for aligned objects with a natural orientation, where the rotation dependence is strong enough to learn a good rotation predictor.
In the literature, there are mostly specialized algorithms for either global or local novelty detection, and, hence, there are different methods superior within each scenario. For global novelty, particular rotation prediction [20], Deep SVD [13], and Deep Robust One Class Classifier (DROCC [14]) excel. The rotation prediction method is a self-supervised schemes that solve a proxy classification problem for feature learning and uses a softmaxbased anomaly score. Although the Deep SVD is a deep learning based version of the singular value decomposition (SVD), the DROCC method uses a nearest neighbor approach on pre-trained neural network features. For local defect detection, a recent method named PatchCore [22] achieves almost total recall on the MVTec challenge [7] using a greedy algorithm for dataset reduction based on coreset theory [23]. It is also based on an underlying nearest neighbor search in the feature space of a pre-trained EfficientNet-B4, but uses a modified distance as anomaly score. CutPaste [24] is a self-supervised method specially designed for local novelties. It is similar to rotation prediction [20] as it also solves a selfsupervised surrogate classification problem. However, instead of predicting the rotation of the input example it cuts out small patches and pastes them to another image location to create the contrastive dataset. The anomaly score is the class probability of being an altered image.
Our work is most related to recent works that model the internal distribution of images [16]. However, unlike these approaches, our model benefits from the estimation of only low order cumulants of image patches, i.e., mean and covariance, instead of a parametric model of the full density, which is a simpler task in general. By choosing non-linear basis functions for representing the patches, here a pre-trained deep neural network, our method adds a relatively small computational overhead compared to the method based on raw pixel mean-shifts [2], but improves the detection and localization performance effectively.
There are two similar approaches named PatchSVDD [25] and PaDiM [26] that we want to relate shortly. PatchSVDD optimizes a deep spherical embedding for extracted image patches. As this is based solely on the reference data, it implicitly reduces dimensionality, and novelties with orthogonal patterns are projected onto the null space of the normal distribution. This decreases the performance of the method compared to our approach that does not involve any dimension reduction. PaDiM is similar to ours and also computes fullrank Mahalanobis distances. However, it does not benefit from extracted patch ensembles that turned out to be performance critical in our tests. Therefore, it is a special case to ours where the size of the extracted patch ensemble is one and the covariance matrix is not shared across locations.

Method
The central part of our algorithm is the µshift(x) anomaly score [2] that is based on the Hotelling T 2 test [3]. This test is formulated on the basis of two samples of two distributions and measures the mean-shift between them. Therefore, it is a multivariate extension to the well-known Student's t-test. Here, we use ensembles of image patches as samples and measure their mean-shift in some specified feature space Φ. In this section, we first describe the data representation and the mean-shift detection in its classical form.
On a high level, the first step is the extraction of patches {I 0 (s), . . . , I N (s)} from the normal training images {I 0 , . . . , I N }. These patches are transformed by a feature map Φ, typically parametrized by a pre-trained neural network. For indexing the ensemble where necessary, we introduce the indexing variable s. Using this notation, a single extracted patch of the i-th example in feature space is denoted by x i and the corresponding patch ensemble by x i (s). Based on all available transformed training patches X, the required statistics, i.e., mean vector and covariance matrix, are computed. For a given test image I * , the same preparatory steps are applied and we extract a ensemble of patches I * (s) and compute the features x * (s). We then evaluate the mean-shift of the test example by comparing the test ensemble mean µ(x * (s)) with the mean of the entire training dataset µ(X).

Data Representation
The input examples I i ∈ [0, 1] 3×H×W , with i = 1, ..., N, are square-sized RGB images, i.e., H = W. The distinctive property of our algorithm is to generate patch ensembles, instead of processing the full image. For patch extraction, we tested several sampling strategies without noticing performance-critical differences. Therefore, we extract all valid patches of size R inside the image. The term valid is used in accordance with the neural network literature and means that all extracted patches must be entirely contained within the image borders. This is equivalent to cropping patches by a sliding window without applying image padding or crossing the border. As the input images are potentially large, the horizontal and vertical stride τ of the sliding window allows limiting the total number of cropped image patches S. We fix this parameter to τ = 2 for small images and τ = 16 for larger ones. Hence, the maximum number S of distinct image patches per input image depends only on the size of the image and the patch size R, i.e., the larger the image relative to the patch size, the more patches can be extracted. We do not apply any pre-processing and compute a feature representation Φ for the extracted patches I i (s) using a pre-trained neural network, given by where D is the number of features after flattening the computed feature map. Flattening is needed since some feature maps Φ, e.g., Convolutional Neural Networks (CNN), retain the spatial dimensions of the input patches. We organize the flattened feature vectors of all available extracted normal training patches in a long concatenated design matrix X ∈ R NS×D .

Mean-Shift Detection
We follow [2] and also perform mean-shift detection with the Hotelling T 2 test [3]. Since this test is a generalization of Student's t-test, it estimates the significance of meanshifts between two populations. In this section, we introduce the Hotelling T 2 test with its required statistics in the original form. In the second part of the paper, in Section 4, we derive a generalized version that is able to smoothly transition between global and local population mean-shifts. We discuss relevant hyperparameters that are needed for model selection in Sections 3.4 and 4.1.
For detecting anomalies, we first compute the feature-wise mean over all extracted and flattened feature maps of the training dataset X. Note that µ has the same dimension as x i . We then compare this reference mean with the mean of the transformed patches x * (s) extracted from a single test example I * . Given the two estimated mean vectors µ * and µ, the unnormalized Hotelling T 2 test statistic for a dependent test sample is computed byT is the empirical covariance matrix of the training dataset X. There is a intuitive geometric interpretation of theT 2 statistic available. That way, it can be interpreted as the squared Mahalanobis distance [27] between the two estimated mean vectors. For completeness, we want to highlight that we discarded the constant normalization factor NS 2 NS+S that appears in the original formula and, hence, denote our unnormalized version of the statistic byT 2 instead.
In principle, there are several options for defining the mean µ of the reference data, e.g., by clustering or partitioning. Here, we choose the simplest option and compute the feature-wise mean over all patches of the training examples. This gives a single µ-vector for the entire dataset as denoted in Equation (2). A naive global anomaly score is simply defined as the unnormalized T 2 test statistics over the entire image, given bỹ The entire pipeline of this global method is illustrated at the top in Figure 2. For illustration of the mean-shift, Figure 3 shows a scatter plot of test examples in feature space.
(a) (b) Figure 2. Schematic illustration of our mean-shift method. (a) First, the feature map Φ is computed per extracted image patch, then the mean statistics is computed over all training patches. Together with the empirical covariance matrix of the normal data, the Hotelling T 2 test is used as anomaly score. (b) The local mean-shift variant applies the global mean-shift method to a local region A of the image, and, hence, yields a field of mean vectors µ(x, y). The final score is computed by taking the maximum over the field of local scores. The covariance matrix is shared across all local regions.

Covariance Shrinkage
The empirical covariance matrix in Equation (5) cannot be robustly estimated for high dimensional data as most of the eigenvalues are close to zero and, hence, the estimates are very unstable. This is especially an issue for small datasets, where the number of patches is equal or smaller than the covariance matrix dimension D, but is potentially also a problem for highly redundant datasets that occupy only a small subspace. In order to mitigate, we use the Ledoit-Wolf shrinkage estimator [4]. This estimator is given by a convex combination between a scaled identity matrix and the empirical covariance matrix where the so-called shrinkage factor α ∈ [0, 1] is given analytically by minimizing the quadratic loss between the true and estimated covariance matrix. The exact formula for α is a bit cumbersome, we, therefore, refer the reader to Equation (5) in the original paper for it [4]. Loosely speaking, the shrinkage factor α is an analytic function of the empirical covariance matrix and the number of data points. A useful property of the estimator is, that the shrinkage factor is near to one for small numbers of data points and reduces to zero with increasing dataset size. Therefore, in the limit of an infinite number of data points, the shrunk covariance matrix converges to the true covariance matrix. Our experiments show that the chosen estimator is crucial and responsible for almost 5% of the overall performance. Figure 4 shows the impact of the shrinkage factor on novelty detection performance for different values of α and different dataset sizes. For the experiment, we used 64 × 64 patches and the EfficientNet-B4 feature space, which has dimensionality D = 6800 after the flattening operation (cf. Equation (1)). Therefore, the estimation of a covariance matrix with shape 6800 × 6800 is required. As already noted, outliers are projected onto the subspace spanned by the normal data, which makes a robust estimation of the full covariance matrix necessary, without the possibility of using low-rank approximations [12]. This is especially important when the expected anomalies are small and characterized by patterns that are not present in the training set.

Hyperparameter Selection
There are two main hyperparameters that need to be set for the global method. First is selecting the feature space Φ, and second is choosing the patch size R. In the following, we first discuss the feature spaces, and how the chosen neural network architectures differ impacts the model.  [29]. L indicates the feature block (layer) of the deep neural network. The commonly used Block convention wraps several adjacent layers of a deep neural network into blocks, such that the architectures become handier and easier to compare (e.g., [29]). We follow this convention and the hyperparameter L indexes entire blocks of the architecture. The superscript indicates the used network architecture, e.g., Φ vgg 3 for the third feature block of VGG-19. Additionally, we analyzed the recently presented Vision Image Transformer Model (ViT) Φ vit L [6]. The required pre-training of the networks is always done by using the well-known Ima-geNet dataset, where all architectures reach a test set accuracy of around 85%. Note, we concatenated two adjacent layers [L, L + 1] of EfficientNet-B4 as the layers are relatively low-dimensional, which improves the performance slightly (see, e.g., [22]). Due to the pooling layers, the spatial resolution of the feature map decreases with the depth of the network, i.e., deeper layers have lower spatial resolution. To enable channel-wise concatenation of different sized feature maps in the first place, we match the spatial resolution of the feature maps by spatially resizing the smaller downstream feature map to the size of its larger predecessor feature map using bilinear interpolation.

Patch Size R
The most crucial hyperparameter is the chosen image patch size R. Generally, the deeper the convolutional neural network, the smaller the spatial resolution of the resulting feature map, which is mainly caused by 2D-pooling operations [30]. While the receptive field grows, more global information is carried by the pixels of the feature maps. Importantly, for successful feature extraction, one needs to choose a layer that retains enough spatial resolution for the problem at hand and an appropriate receptive field to capture the anomalies. Hence, to gain sufficient separate inlier and outlier distributions, the layer selection L and patch size R depends on the size of the input images and the expected size of the anomalies. Note, that the Vision Image Transformer (ViT) is different as the receptive field size is implicitly learnt by the model and in principle equal to the size of the entire input and hence independent of the layer L. We did not notice a performance-critical impact of the stride parameter and keep it fixed to τ = 2 for smaller inputs and τ = 16 for larger ones.

Evaluating Global Novelty Detection
We compare our method with methods that particularly excel in global novelty detection. As a baseline we use the well-known OC-SVM [31] with an a RBF-kernel and flattened CNN feature maps. We reproduced all experiments by either using implementations provided by the authors or re-implementing the models by using available information and hyper-parameters. For testing global novelty detection we use the CIFAR-10 dataset [32], which are 32 × 32 RGB images, and test the methods in a one-vs-all procedure. This means we use the 5000 available training examples of a single class as normal class, and classify on the entire test dataset, consisting of 10 classes with 1000 examples each, afterwards. We use area-under-the-ROC-curve (AUC) as performance measure (see, e.g., [12,20]). The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at various thresholds and hence measures the overall discrimination performance of a binary classifier. In terms of hyperparameters for µshift, we selected θ = {L = 5, R = 32, τ = 2}. This is the maximum patch size possible and a special case of the method. However, it is also optimal for the chosen scenario: Changing the parameter L, decreases the AUC significantly for most of the network architectures (see Figure 5). The same applies to reducing the patch size R, which we verified in Figure 6.   Figure 6. Varying the patch size R and the averaging region A impacts the detection rate significantly.
Tests were conducted using the critical classes pill, capsule and screw classes from the MVTec dataset, and the entire CIFAR-10 dataset. The MVTec tests used the EfficientNet-B4 architecture. For CIFAR-10 we tested several popular architectures. Table 1 shows the results averaged across five folds of cross-validation using varying training and test splits. It is interesting to note how the different CNN architectures have a strong influence on the performance and how the Vision Image Transformer (ViT), with its large receptive field, is able to separate the inliers from the outliers almost entirely. Particularly using EfficientNet-B4 is problematic in the special case of global novelties as it possesses the smallest receptive field among the tested architectures and is not able to capture the entire image context into a single feature variable. We presented the original mean-shift method [2] based on raw pixel values and evaluated the performance without using pre-trained features or transfer-learning. For completeness, we report the low average AUC of only 0.67 using raw pixel values in Figure 6. This lack of performance emphasizes the requirement for a rich feature space, such that the inlier distribution does not overlap with novelties through its null space causing a large blind spot for novelty detection. Such an overlap happens naturally when the anomalous patterns are projected onto the subspace of the normal data and the corresponding features are not present in the given training data. Consequently, anomalous patterns cannot be detected as they are mapped to null space. A rich feature space with a diverse set of pre-learnt patterns mitigates that effect.
With RotationNet, we could achieve the reported AUC of 0.86 only when the internal network were pre-trained on ImageNet [20], but not when initialized randomly. However, despite of that, for general unsupervised feature learning, the method remains extremely powerful on CIFAR-10.

Local Anomaly Score
Due to global averaging, the naive global mean-shift anomaly score [2] is not flexible enough to localize anomalies properly. To improve, a generalization of the mean-shift detection capabilities to local mean-shifts is required, and, hence, a modified test statistic. To this end, we define a local version by computing an entire field of µ-vectors uniformly distributed across the image instead of a single vector.
As shown on the right-hand side in Figure 2, this is equivalent to computing the global anomaly score only for local parts of the image with shared covariance matrix across all locations. To leverage the spatial structure, we organize the S extracted patches x(s) as a √ S × √ S feature mapx(x, y), where the positions (x, y) correspond to their relative locations in the input image, i.e., the order of the patches and their relative spatial position is unchanged. Note, that S is a square number, because we assume that the input images are square-sized RGB images, i.e., H = W. Next, we compute the µ-vectors by averaging across a local neighborhood whose size is given by A. The resulting As in the global case, the µ(x, y) of the training data is computed by averaging over all available training examples where Sx is the pooled feature map andx(x, y) ∈ R D× √ S× √ S is the reshaped version of x(s) ∈ R S×D . The generalized anomaly score is then computed by taking the maximum over the field of local mean-shifts, given by Note that the covariance matrix Σ is exactly the same as in the global case and just the mean estimates are computed differently. In fact, for ρ = √ S the global case appears as a special case. A second special case appears when R = H and, hence, ρ = 1. Here, the extracted patches represent entire images.

Local Mean-Shift Region A
Generally, we differentiate between two extreme cases for novelty detection in this work: (1) global novelty and (2) local novelty. To cover both cases, we found that it is important to first adjust the patch size R to match the desired anomaly fraction of the given input resolution, such that the anomaly falls into the receptive field of the computed feature map. In simple terms, for globally distributed novelties, select a large patch size, for local anomalies a small one. As already noticed by others, the EfficientNet-B4 works very well for local novelty detection [12], whereas the VGG-19 is better suited for global case [20]. We argue that the main reason for this is that the EfficientNet-B4 uses almost solely 1 × 1 convolutions, which retains less blurry local features. To validate this assumption, we computed the receptive fields sizes for different selected CNN-architectures. Table 2 shows the results. Note that the receptive field can be computed by varying the input size R until the feature map Φ L of the desired block L has size 1 × 1. (Note that for some architectures, there are analytical formulas available, e.g., [33].) The stride of the receptive field τ is then the remaining input size H divided by the remaining feature map sizeH L , given by It can be seen, that the EfficientNet-B4 has a significantly smaller and a non-overlapping receptive field compared to, e.g., the VGG-19.
Finally, the last remaining hyperparameter is the local averaging region A for computing the local mean-shift statistics. Note, the parameter A is similar to the patch size R as it also impacts the effective receptive field of the entire model and hence the sensitivity for globally distributed and local concentrated novelty. Our final architecture-independent parameter vector is denoted by We visualized the local mean-shift for the bottle class in Figure 3 as an example.

Efficient Feature Computation
The computation of the feature map per extracted image patch is quite expensive in practice. For mitigation, we propose computing the features of all patches simultaneously by computing the feature map of the entire image in a single forward pass through the neural network. However, one needs to be careful, as the patch size R and stride τ is now restricted by the internal details of the selected architecture and their corresponding receptive fields, as shown in Table 2. For example, for block 5 of EfficientNet-B4, a single pixel in the feature map corresponds to 16 pixels in the input space and the stride is fixed to τ = 16. This also limits the freedom of the local averaging region A to a multiple of 16.  Table 3 shows the results of the experiment averaged across five folds of crossvalidation using again varying training and test splits. Because there are no defective images in the training set of MVTec, we only swapped the non-defective training and test data during cross-validation. In other words, the defective examples of the test dataset were kept constant and only the non-defective examples of the test and training datasets were varied. In terms of hyperparameters for µshift, we selected θ = {L = 5, R = 64, τ = 16, A = 80}. In order to find the best hyperparameters, we performed a grid-search L ∈ [2, 6], R ∈ [48, 144], A ∈ [64, 164] evaluating the average performance across all classes (cf. Figures 5  and 6). The sensitivity of the effective mean-shift region A on the detection performance can be seen in Figure 6. For the EfficientNet-B4 for instance, A = 96 gives an AUC of 98.5, A = 64 an AUC of 98.3. We also tested the method with the commonly used WideResnet-50 features and achieved 98.1 AUC on average for the same set of hyperparameters. Generally, the average AUC depends strongly on the average anomaly sizes: for instance, while the pill class benefits from a small averaging region, the screw class performs significantly better with a larger one.

Evaluating Local Novelty Detection
Note that we could reach the reported 99.0 AUC of PatchCore only in a single fold of cross-validation, but not on average over different folds of cross-validation and, therefore, report slightly lower average scores than in [22]. The same appears to be the case for CutPaste and we could only touch the reported 90.9 AUC. However, this is still impressive for a self-supervised scheme that does not rely on pre-training or transfer-learning. We also tested the related methods PatchSVDD [25] and PaDiM [26]. PatchSVDD reached on average 92.1 AUC, PaDiM scored 97.9 AUC using the EfficientNet-B5 feature space.

Complexity, Runtime, and Data Efficiency
The complexity and runtime differs heavily between training and test time. For training, the most expensive part is computing the D × D covariance matrix in Equation (5). With an efficient estimation algorithm, this can be achieved in O(min{(NS) 2 D, (NS)D 2 }). The mean estimation itself is linear O(NSD). At test time, the most expensive computation is computing the T 2 statistics in Equation (4) that needs a S × D-dimensional matrix-vector multiplications O(SD 2 ).
We noticed that depending on the dimensionality D and the chosen feature space, the computational costs of computing the feature maps quickly exceeds the cost of our algorithm. There is a fixed overhead that depends on the size of the input O(HW). For example, for EfficientNet-B4, in our experiments the computation of a single MVTec example took 36 ms for the feature map, and 30 ms for the anomaly score. The estimation of the covariance matrix took 20 s for a single class. The runtime was measured on standard CPU hardware (Intel i7-6700) without using GPU acceleration. By using a single GPU (GTX 3080Ti) the runtime could be reduced to 1 ms for the feature map, 1 ms for the anomaly score. Computing the covariance matrix took 3 s. Therefore, implementing the mean-shift detection by an additional CNN block consisting of a 2D-convolution for the Mahalanobis-distance and using 2D-pooling for the averaging-region, the model reached about 500 FPS for the entire pipeline on our hardware. Note that this is much faster than, e.g., the 7 FPS of the PatchCore GPU-model (using default 0.1 sampling ratio) [22]. The increased frame rate is mainly caused by avoiding the k-nearest neighbor search across the entire patch database for every prediction. As already mentioned, we also evaluated the data efficiency of the models with respect to performance in Figure 7. Again, PatchCore and µshift perform similar with respect to the number of needed training examples to reach a particular performance level. For example, 90% AUC could be achieved with only 10 non-defective examples of the MVTec dataset [7]. In this scenario, we also tested the recently proposed hierarchical method for few-shot anomaly detection (HTDGM) [ µshift e f f HTDGM [34] Mahalanobis [12] OC-SVM Figure 7. Data efficiency on the MVTec dataset across several analyzed architectures. With more than 10 non-defective training examples, an average AUC above 90% could be achieved using the EfficientNet-B4 or the Vision Transformer Network (ViT) as feature space.

Discussion
We found that the success of our approach critically depends on the details of how the patch ensemble is extracted from the input images. The most important parameters are the number and the size of the patches and by which features the patches are represented. Since mean-shift can only be detected when the outlier ensemble is sufficiently separate from the inlier distribution, the overlap acts as a blind spot for novelty detection. In the earlier version of the algorithm [2], we proposed using a hyperparameter selection rule based on a negentropy approximation [35] to minimize overlapping of the distributions. However, further experiments showed, that such an approach does not generalize well towards arbitrary datasets and deep features. One reason is that the method prefers larger patch sizes, which is beneficial for global novelty, but not for industry-relevant local novelty detection.
Another interesting point appears when comparing our method to PaDiM [26]. As already mentioned, this method is similar to ours, when the mean-shift region is equals to the patch size, i.e., A = R, and hence the ensemble size S is one. The advantage of our ensemble approach is manifested by the zigzag pattern in Figure 6. Here, increasing the local mean-shift area A just slightly over the patch size R increases the detection performance significantly, regardless of the actual chosen patch size.

Conclusions
For the task of novelty detection, we proposed a method that is capable of detecting novelties effectively using deep mean-shifts. By attaching our method on-top of a pretrained neural network, we were able to achieve state-of-the-art performance in standard benchmarks, such as the MVTec defect detection and CIFAR-10 one-class classification challenge. Because of the simple design, the method is easy to implement and provides a fast execution time. By using a GPU we could reach 500 FPS in our tests. Additionally, because the model only relies on low order statistics, it is very data efficient and achieves 90% AUC on the MVTec challenge with only 10 non-defective examples.
The main drawback of the method is that the model accuracy heavily depends on the specific problem at hand and the available knowledge about expected anomalies and their sizes. As shown in Table 1, swapping the feature space can cause a significant change in performance. Second, not setting the correct patch size reduces the performance quickly. However, the same limitations also appear in other methods, such as RotationNet or PatchCore. For practitioners it is of great importance to use domain knowledge and setting hyperparameters accordingly. A central open question is how to derive those hyperparameters directly from data, which we leave for future work.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: