A Probabilistic Transformation of Distance-Based Outliers

The scores of distance-based outlier detection methods are difﬁcult to interpret, making it challenging to determine a cut-off threshold between normal and outlier data points without additional context. We describe a generic transformation of distance-based outlier scores into interpretable, probabilistic estimates. The transformation is ranking-stable and increases the contrast between normal and outlier data points. Determining distance relationships between data points is necessary to identify the nearest-neighbor relationships in the data, yet, most of the computed distances are typically discarded. We show that the distances to other data points can be used to model distance probability distributions and, subsequently, use the distributions to turn distance-based outlier scores into outlier probabilities. Our experiments show that the probabilistic transformation does not impact detection performance over numerous tabular and image benchmark datasets but results in interpretable outlier scores with increased contrast between normal and outlier samples. Our work generalizes to a wide range of distance-based outlier detection methods, and because existing distance computations are used, it adds no signiﬁcant computational overhead.


INTRODUCTION
We propose a generic method to transform distance-based outlier detection models into interpretable, probabilistic models.An outlier is often described as "an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data" Barnett and Lewis (1978).The definition of an "inconsistent" observation is not uniform and varies depending on the application and algorithm used.Inconsistency can mean that the outlier object stems from a different distribution than the model describing the data, which reflects the classical definition of outliers by Hawkins (1980): "An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.".An outlier is also referred to as an anomaly or novelty, sometimes interchangeably.Therefore, outlier detection is also referred to as anomaly detection or novelty detection.Because the methods used to detect outliers, anomalies, and novelties are mostly the same, we make no distinction between these terms and refer to inconsistent instances as outliers.In a distance-based setting, we can define outliers as objects located far away from the remaining objects.Specifically, given a metric space (M, d) with metric d, each object x ∈ M receives a real-valued outlier score s := q(x) via a function q : M → R, where the function depends on the distances to the other objects in the dataset.To determine if an observation is considered an outlier, it is necessary to to establish a threshold value converting outlier scores into binary labels of normal and outlier data points.A major challenge in distance-based outlier detection is the interpretation of the resulting scores.The scores provided by distance-based methods differ widely in their scale, range, and meaning.Even when considering only a single outlier detection method, the same outlier score can describe different degrees of outlierness depending on the kind of data.These challenges make the interpretation and comparison of outlier scores difficult.Distance-based outlier detection scores are typically derived from some neighborhood representation given a distance matrix.We propose that the information contained in the distance matrix can be used to derive a probabilistic normalization of outlier scores such that they become interpretable.Based on a large number of benchmark datasets, we test our approach in terms of detection performance and interpretability and show that it is possible to achieve interpretable, probabilistic outlier scores with no detriment to the resulting detection performance.The rest of this paper is organized as follows.Section 2 provides an overview of distance-based outlier detection methods.In Section 3, we show score normalization schemes and their application to distance-based methods.In Section 4, we describe our proposed probabilistic normalization scheme, and in Section 5, we describe the results of applying our scheme on benchmark datasets.Finally, in Section 6, we derive conclusions and provide opportunities for future research.

DISTANCE-BASED OUTLIER DETECTION
assigns a real-valued outlier score to an observation.We further differentiate between the closed-world and open-world outlier detection setting, an often disregarded yet highly relevant aspect of distance-based outlier detection.The following outlier detection methods are formulated in a closed-world setting, such that the observations in a dataset X are assigned an outlier score.Often, however, it is necessary to assign an outlier score to unseen data, such that a model of normality is determined based on a dataset X, and the outlier score is determined on unseen observations in a dataset X test .At the end of this section, we provide a simple approach to transfer said closed-world outlier detection methods into an open-world setting.Knorr andNg (1997, 1998); Knorr et al. (2000) first formalized a distanced-based notion of outliers in which an object x ∈ X is said to be a DB-outlier in a dataset of n objects if |{x ∈ X | d(x, x ) > δ}| ≥ αn where α, δ ∈ R are parameters to be specified by the user and 0 ≤ α ≤ 1.In this specification, a fraction α of all objects have a distance from x that is larger than δ.Chandola et al. (2009) point out that this method can be viewed as global density estimation for each instance since it involves counting the number of neighbors in a hypersphere of radius δ.However, a major drawback of this definition is that it is difficult to determine a distance threshold δ and that the results do not determine a ranking of scores.Ramaswamy et al. (2000) build upon the ideas presented in DB-outliers.To determine the outlier score of an instance, they propose to use the distance to its k th -nearest neighbor as a score; thus, we refer to the method as k th NN.Compared to DB-outliers, the main benefit of this approach is that it does not require the user to specify a distance δ.The k th NN outlier score of an observation x is defined as

k-th Nearest Neighbors
where x ∈ X and d k (x, X) is the distance between x and its k th nearest neighbor in X. Angiulli and Pizzuti (2002) adapt the k th NN approach to use the average distance to the k-nearest neighbors of a point x instead of the k th distance, which can also be interpreted as the maximum distance.We refer to this method as kNN and define it as follows where x ∈ X and d (i) (x, X) is the distance between x and its i th nearest neighbor in X.
We propose to generalize k th NN and kNN as specific instances of weighting schemes for distance-based outlier detection.Weighting schemes are commonly used in k-nearest neighbors classification, where the schemes traditionally emphasize close neighbors and disregard neighbors farther away Geler et al. (2016).However, as evident in k th NN-based outlier detection, where only the farthest neighbor is considered, we propose to emphasize the neighbors farther away.A further difference between weighted-neighbors classification and outlier is the predicted result, which corresponds to class votes or outlier scores.To keep the resulting outlier scores in the same range, we propose to sum-normalize the weights such that the resulting weight vectors sum to one.The resulting outlier scores can subsequently be interpreted as a (smoothened) distance or distance probability.We adapt three of the weighting measures investigated in Geler et al. (2016) to the outlier detection task and describe k th NN as max-weighted and kNN as mean-weighted outlier detection.The distance and rank schemes are adapted from Dudani's weighted nearest neighbor classification Dudani (1976), the exponential scheme from Zavrel (1997), and the linear scheme from Macleod et al. (1987).In all cases, we reverse the schemes such that the farthest neighbor gets the largest weight.We define the schemes for a vector of k-nearest neighbor distances d as follows (3) where and s, a and b are hyperparameters of the respective weighting schemes.We show how the weights influence the determination of an outlier score based on a three-nearest-neighbors example in Figure 1.Because outlier scores are assumed to be positive values derived from distances, sum-normalization is possible by dividing each element in the weight vector by its sum as defined in Equation 10.Sumnormalization ensures that the weight vector sums to one and the weighted outlier score can be interpreted as a weighted distance.We further use the proposed weighting scheme to define a generic weighted knearest neighbor approach as kNNW, which serves as a basis for our tabular outlier detection experiments in Section 5. where d is the vector of k-nearest neighbor distances, w is the k-dimensional weight vector and • denotes the dot product between the distance-and weight vector.
More recently, authors proposed various sampling schemes to improve the efficiency of the described techniques.Wu and Jermaine (2006) propose an iterative sampling scheme to approximate the k th NN score, which we designate as k th iteratively sampled nearest neighbor k th ISNN.
where S x (X) is a randomly sampled subset of X excluding x.The subsampling is determined individually for each point x processed with q k th ISNN (x ); therefore, it is referred to as iterative sampling.Sugiyama and Borgwardt (2013) show that a simplification of k th ISNN leads to better detection performance over 16 different datasets.The authors propose to remove the iterative aspect of k th ISNN and, instead, sample only once for all data points and identify the first nearest neighbor, which we describe as the sampled nearest neighbor or SNN.
where S(X) is an independent random subset of the data that is determined once.In other words, for a point x, this method uses the distance to its closest point x in a fixed sample S(X) as an outlier score.Pang et al. (2015) extend the SNN approach by repeatedly sampling random subsets of the data, which we term repeatedly sampled nearest neighbor RSNN.
where r is the number of random subsets to sample and S i (X) is the i-th random sample.This method essentially represents an ensemble of nearest neighbor outlier detection models and, therefore, expectedly improves upon SNN, which the authors empirically show using 11 datasets.It can be argued that knearest neighbors ensembles with data subsampling are a generalization of RSNN, which are well-known techniques to improve neighbor-based outlier detection Zimek et al. (2013); Aggarwal and Sathe (2015); Muhr and Affenzeller (2022b).
In addition to data sampling techniques, other authors use randomized sampling to determine feature subspaces as initially motivated by Aggarwal and Yu (2002).Kriegel et al. (2009b) define a set of reference points based on the concept of shared nearest neighbors.The reference points characterize a subspace hyperplane, and the outlier scores are determined by the Euclidean distance of a point x to the subspace hyperplane, weighted by an indicator function that determines the relevance of a dimension.Agrawal (2009) proposes a very similar distance-based subspacing approach.Zhang et al. (2015) also use a shared nearest neighbor reference set to determine subspaces, using an angle-based approach to compute the outlier scores.Trittenbach and Böhm (2019) propose a method to determine subspaces that considers the relationship between subspaces.Keller et al. (2012) propose to determine high-contrast subspaces for outlier detection as a form of data pre-processing.Cabero et al. (2021) also determine the subspaces as a data pre-processing step based on archetypal analysis followed by a k th NN approach.
Preprint, please refer to the published version https://www.mdpi.com/2504-4990/5/3/42 5 Some authors combine distance-based outlier detection with dimensionality reduction techniques such as principal component analysis Dang et al. (2015) for high-dimensional data.In image-based outlier detection, authors use neural networks to evaluate the neighborhood search in latent spaces describing entire images Bergman et al. (2020), image sub-features Cohen and Hoshen (2021), or image patch-features Roth et al. (2022).
Another option to model distance-based outliers is to use reverse nearest neighbors or natural neighbors relationships.For example, Outlier Detection using Indegree Number (ODIN) Hautamaki et al. ( 2004) models the nearest neighbor relationships as a directed graph and defines the outlier score as the in-degree number in the graph such that a low in-degree number defines an outlier.Radovanović et al. (2015) analyze the concept of hubness, which appears in reverse nearest neighbors outlier detection, and propose an outlier detection method based on anti-hubs; points that do not occur in the nearest neighbors of any other points.Natural neighbors approaches discard the k-nearest neighbors parameter and instead perform a search over λ rounds to identify an appropriate number of neighbors such that a shared neighbor relationship is found Zhu et al. (2016); Wahid and Annavarapu (2021).A further extension is described by extended nearest neighbor approaches, which combine the nearest neighbors with reverse nearest neighbors and shared nearest neighbors Tang andHe (2015, 2017).

Local Outlier Factor
In contrast to the previously described techniques, referred to as global outlier detection techniques, the Local Outlier Factor (LOF) Breunig et al. (2000) model introduces the concept of local outliers.Schubert et al. (2014b) formalize distance-based outlier detection models such that an outlier score is determined based on some context set, typically the k-nearest neighbors of a point x.To compare the resulting outlier scores, another set of points is used, which is referred to as the reference set.Global methods compare the resulting score from the context set to all other points in the reference set, the dataset X.Because the comparison of scores is global, those methods ignore differences in the local densities of the data.Local methods use a different reference set to compare the scores to, typically, the k-nearest neighbors as in the context set.Local methods convert the distance information from the local neighborhood into some form of density; therefore, the methods are sometimes also referred to as density-based.LOF can be defined as a scoring function where N k (x) is the set of k-nearest neighbors of x and p is the local reachability density of x defined as includes all objects inside the k th -distance, which can, in the case of a "tie", be more than k objects.The local reachability density can be seen as an average inverse distance of a point x normalized such that the distance cannot be smaller than the k th -distance.According to the authors, the local reachability density stabilizes and prevents statistical fluctuations, a fact later analyzed in more detail by Schubert et al. (2014a).The local outlier factor then compares p(x), the density of the context set, to the average reachability density of the points in the reference set.If the average reachability density in the reference set is higher than the point density obtained from the context set, then the score of the local outlier factor is above one and considered less normal.
Preprint, please refer to the published version https://www.mdpi.com/2504-4990/5/3/426 Schubert et al. (2014b) propose a simplified version of the local outlier factor where p is the inverse k th -distance p(x) := d k (x, X) −1 , which represents a simpler density estimate as compared to the local reachability density in LOF.To better illustrate the general concept of local outlier detection, the simplified local outlier factor (SLOF) can be stated as follows The authors show that many local outlier models can be considered variations of the simplified local outlier factor model.For example, LDOF proposed by Zhang et al. (2009) is a variation of the simplified LOF model using an average distance as in kNN instead of the k th -distance.Influence Outlierness (INFLO) Jin et al. (2006) is another variation of the simplified LOF, which diverges by using a different context set that includes reverse nearest neighbors.Another method that can be seen as an extension of the simplified LOF is Local Outlier Probabilities (LoOP) Kriegel et al. (2009a), which adds a probabilistic normalization to SLOF.Many more local outlier detection methods have been described in the literature covering entire literature reviews Alghushairy et al. (2021).Schubert et al. (2014a) note that local outlier detection methods can be differentiated using their order of locality, and Goodge et al. (2022) show that the methods can be generalized as message-passing algorithms on a nearest neighbors graph.

Closed-world and Open-world
Distance-based outlier detection methods are typically defined in a closed-world setting; however, there is an important difference between the closed-world and open-world specification such that, for two equal points x ∈ X, x ∈ X test and x = x , the k th nearest neighbor in X is different.In the closed-world or transductive setting, the search for the k-nearest neighbors does not include the searched-for point x; in other words, the nearest neighbors graph does not include self-loops.In the open-world setting, it is not known if x is contained in the reference set X, and therefore all points in X are included in the k-nearest neighbors search.All of the referenced methods are described in a closed-world setting and do not explicitly state how to perform inductive outlier detection, yet commonly used toolkits for (distance-based) outlier detection focus on the open-world setting Zhao et al. (2019); Muhr et al. (2022), with no clear guidelines on how to transfer the transductive tasks to inductive tasks.In Section 4, we describe our method in the transductive, closed-world, and inductive, open-world setting.

OUTLIER SCORE NORMALIZATION
As mentioned in the introduction, the outlier scores resulting from distance-based approaches differ widely in their meaning and are challenging to interpret.In some cases, even within a data set, the scores for two different observations can denote different degrees of outlierness, depending on different local data distributions, a core motivation for local outlier detection methods.Some distance-based methods provide probability estimates, for example, Kriegel et al. (2009aKriegel et al. ( , 2012)); Janssens et al. (2012);van Stein et al. (2016), but these probabilistic interpretations are a core part of the underlying algorithms and cannot easily be transferred to other algorithmic approaches.Latecki et al. (2007)  estimation and claim that local outlier detection methods use heuristics to determine, perhaps coincidentally, something similar to a statistical kernel for density estimation.
Instead of algorithm-specific probabilistic interpretations, some authors propose outlier score normalization schemes independent of the underlying algorithm.A simple way of bringing outlier scores to a common scale is to apply a linear transformation as defined in Equation 9, such that the minimum score is mapped to 0 and the maximum score is mapped to 1.However, such a min-max scaling approach does not yield a useful probabilistic interpretation.Gao and Tan (2006) propose two approaches to model outlier scores as probabilities.In the first approach, they assume that the posterior probabilities follow a logistic sigmoid function.In the second approach, they assume that the outlier scores follow a mixture of exponential and Gaussian distributions.In both cases, the authors propose to use an expectation maximization approach to learn the parameters.Kriegel et al. (2011) point out that expectation maximization approaches to score normalization often converge to a "no outliers" or "all outlier" solution and, instead, propose to use the cumulative distribution function of a Gaussian or Gamma distribution to normalize the scores.Additionally, the authors show the usefulness of post-processing techniques to ensure an expected value of 0 for normal data points or to increase the contrast between normal and outlier data points.Schubert et al. (2012) note that a rank-based normalization can be useful if little knowledge available about the actual scores and score distributions.

Interpretability, Explanation, and Trustworthiness
The interpretability of outlier predictions should not be confused with the explanation of outlier detection models or the trustworthiness of predictions; therefore, for the ongoing discussion, we differentiate the terms as follows and describe them in detail following our differentiation.
• Interpretability: The ability to judge the relevance of a prediction.
• Explanation: The ability to explain the reasoning behind a prediction.
• Trustworthiness The ability to describe the confidence behind a prediction.
Explanation is sometimes also referred to as interpretation, but this kind of interpretation is separate from interpretability.Explanation algorithms reveal how models make decisions, but interpretability refers to the intrinsic property in which degree an inference result is understandable to human beings Li et al. (2022).There is a growing interest in methods for deriving explanations of outliers, that is, "[...] to give the users of some outlier detection method further aid in understanding and evaluating the result with respect to their domain."Zimek and Filzmoser (2018).Explanations highlight why a specific outlier detection model reaches a particular prediction.A common approach to explain outlier predictions is to compare normal data points and outliers in attribute subspaces in which the given outliers show separability from the normal data Micenková et al. (2013); Vinh et al. (2016); Macha and Akoglu (2018).Other authors derive explanations from statistical models of the normal and outlier data using minimum distance estimation Angiulli et al. (2013).The explanation of learning methods and outlier detection methods is discussed extensively in a research field known as Explainable Artificial Intelligence, or XAI Samek et al. (2019).Explanations can uncover hidden weaknesses of a model, also known as "Clever Hanses" Lapuschkin et al. (2019).The Clever Hans Effect occurs when the learned model produces correct predictions based on the "wrong" features, which appears to be a widespread problem in outlier detection Kauffmann et al. (2020).Another critical aspect of outlier detection predictions is trustworthiness.Trustworthiness describes an understanding of when a prediction should or should not be trusted Lee and See (2004) propose to use a Bayesian approach to add probabilistic uncertainty estimates to outlier scores, enabling the detector to assign a confidence score to each prediction, which captures its uncertainty in that prediction.

PROBABILISTIC OUTLIER SCORES
In this section, we derive a generic scheme to transform distance-based outlier detection scores into interpretable outlier scores based on distance probabilities.In the generic score normalization approaches mentioned in Section 3.1, only the actual scores are used for normalization.Conversely, the algorithmspecific normalization schemes are generally not easily transferrable to other algorithms.A common theme across a vast majority of distance-based outlier detection methods is the determination of nearest neighbors relationships between data points.Determining exact nearest neighbors relationships typically utilizes the computation of all distance relationships between data points resulting in a distance matrix M .Additionally, it has been shown that brute-force distance computation is preferable to index methods except for low-dimensional similarity search problems Muhr and Affenzeller (2022a).We note that approximate nearest neighbors approaches are also used for distance-based outlier detection strategies Kirner et al. (2017), but this represents a small minority of methods and is not the focus of our study.In the closed-world setting, the distance matrix corresponds to a square matrix of R n×n values for n points in the dataset.In the open-world setting, the distance matrix between n reference points R n×n has to be differentiated from the distance matrix of m query points to n reference points R m×n .For a point x i and a point x j , a value v i,j in the distance matrix corresponds to the distance d(x i , x j ).Most distance-based approaches use the k-nearest neighbors as a context set to determine the outlier score Schubert et al. (2014b), and any probabilistic estimate of those scores would be based on the limited information present in the context set.In contrast to previous approaches, we assume that the additional information contained in the distance matrix is useful for normalization.More concretely, we hypothesize that the additional information can be used to transform outlier scores into interpretable probabilistic estimates.
Based on a distance matrix of reference points, we define the concept of a normalization set.A normalization set describes a subset of the distance matrix used for the probabilistic score normalization.In the simplest case the entire distance matrix is used as a baseline normalization set; hence, the normalization set is defined as the distances contained in the distance matrix M between all reference points excluding self-loops in the matrix diagonal.Additionally, if the distance measure is symmetric, the normalization set from the distance matrix can be reduced to its upper or lower triangular set of values.We propose using the normalization set to determine a distance probability distribution.For example, in the parametric case, we estimate the parameters of a distribution P based on the distances in the normalization set.The distribution of distances has been investigated in the context of feature similarity Burghouts et al. (2007), hubness reduction Schnitzer et al. (2012), local intrinsic dimensionality Houle (2013) or compact sets Lellouche and Souris (2020).Pekalska and Duin (2000) show, based on the central limit theorem, that distances are approximately normally distributed for independent and identically distributed feature vectors.Under the assumption that the distances in the normalization set follow an unknown continuous probability distribution, we define a random variable r ∼ P that describes the normalization set.Given a probability density function p on r, any distance d(x, x ) can be interpreted as the distance density between x and x denoted p(d(x, x )) or, in short, p(x, x ).The cumulative distance distribution f (x, x ) := P (r ≤ d(x, x )) describes the probability of a distance in the normalization set being smaller or equal to d(x, x ).A query point with some distance-based outlier score can be directly interpreted using its distance distribution, where a probability of 99% means that the distance is in the Top-1% of distances observed in the normalization set.
In summary, we hypothesize that it is possible to transform distances to interpretable distance distributions without adverse effects on detection performance.We use the Receiver Operating Characteristic (ROC) Area Under the Curve (AUC) to measure outlier detection performance.A perfect ranking results in a ROC AUC value of 1, whereas an inverted perfect ranking would result in a value of 0. A value of 1/2 can be interpreted as random guessing Campos et al. (2016).To investigate our assumption, we test our approach in tabular datasets Campos et al. (2016); Muhr and Affenzeller (2022c), and a common image-based outlier detection dataset Bergmann et al. (2019Bergmann et al. ( , 2021)).

RESULTS
To evaluate the probabilistic transformation on tabular data, we use the proposed weighted k-nearest neighbor approach (kNNW).The datasets used stem from the DAMI Campos et al. (2016) library and the UTSD single-concept benchmark Muhr and Affenzeller (2022c).For both dataset collections, we only use the datasets with five percent of outliers, each consisting of ten randomly sampled variants, resulting in the dataset list shown in Table 1.For DAMI, we use the normalized and deduplicated variants, and for UTSD, we pre-process the data points with min-max scaling.We use two-fold, stratified crossvalidation to determine a ROC AUC estimate of the resulting distance-based and probabilistic estimates.We use Euclidean distance for all evaluations and fix the hyperparameters for the weighting schemes as s = a = b = 1, as shown in Figure 1.For the number of neighbors k of our tested kNNW method, we pick the best parameter from all possible values of k ∈ [1, 2, . . ., 100].
In our first analysis, we compare the performance of different weighting schemes over all described datasets.We find that, over all datasets, there is no difference in predictive performance between the different weighting schemes.There can be more considerable weighting scheme differences for individual datasets; however, those are dataset-specific and must be investigated case-by-case, as shown in Figure 3.We also note that weighting-scheme hyperparameter optimization might yield additional improvements, which we did not address in our analysis.In our second analysis, we investigate the impact of score normalization using different probability distributions.We examine a normal, exponential, and empirical distribution and compare it to the base case where no distribution is used for normalization.Because the cumulative distribution functions are monotonically non-decreasing, the ranking should be stable after the transformation, but due to the limited precision of the computations, it is not guaranteed that the transformation is ranking stable.In Figure 4, we show that the ROC AUC result after the transformation  matches the original result and the transformation is indeed ranking stable.From an interpretability perspective, there are datasets where using the entire distance matrix as a normalization set results in useful probabilistic estimates, for example, the Crop dataset shown in Figure 5.
However, using the entire distance matrix for normalization often leads to low probabilistic estimates for normal and outlier data points.The reason is that the resulting weighted neighbor distance is consistently low compared to all other distances in the dataset, even for outliers.In this case, a different normalization set has to be extracted from the distance matrix; for example, the m-neighborhood consisting of the distances to the m-closest reference points.
It is possible to analyze multiple normalization sets for a single prediction to provide more context for interpretation.For example, a prediction can be interpreted using different neighborhood probabilities from m ∈ [1, 2, . . . , 200] to determine an appropriate cut-off threshold.The cut-off decision always relies on the data characteristics and domain knowledge.In Figure 8, we plot the optimal cut-off threshold for different neighborhood sizes, but note that this optimal threshold is often difficult to determine.To evaluate an optimal cut-off threshold, it is necessary to evaluate it against a performance metric such as the F1-score requiring normal and outlier labeled data, which is often unavailable.However, using a probabilistic neighborhood analysis as shown in Figure 8 drastically simplifies the identification of a cut-off value, even when labels are unavailable.Thus, in addition to the improved interpretability, choosing an appropriate normalization set allows for a flexible definition of a cut-off threshold to transform outlier scores into class labels.Furthermore, it is possible to increase the contrast between normal and outlier data points using the right normalization set and distribution.Using statistical distances as a measure of contrast between normal and outlier scores, we can identify an optimal normalization set size.To give an example for the TwoLeadECG dataset, the statistical measures of contrast result in a contrast-optimal neighborhood size of m = 90 or m = 99 depending on the statistical distance used, with a cut-off threshold of approximately 95%, as shown in Figure 8.In Figure 6 and Figure 7, we compare the initial probabilistic estimates using the entire distance matrix to the smaller, local normalization set identified in Figure 8 and clearly demonstrate the increased contrast using a smaller normalization set.For image-based datasets, we extend the PatchCore methodology Roth et al. (2022) to ProbabilisticPatchCore.We evaluate the model on the datasets provided by MVTecAD, as shown in Table 2.A major difference between tabular k-nearest neighbors outlier detection and image outlier detection is that the image models may result in pixel-wise and image-wise outlier scores.In the pixel-wise case, each pixel of an observation is scored and in the image-wise case a single score is obtained for the entire image.
The PatchCore model is similar to k th NN, but uses a core-set sampled memory bank of patch-wise feature vectors that are generated using a pre-trained neural network.For PatchCore, the pixel-wise scores are determined through interpolation of the patch-wise scores; thus, it is not necessary to estimate a distance distribution per pixel, but one distribution per patch.Like the authors, we use the second and third layer of a WideResNet50 trained on ImageNet Deng et al. ( 2009) to determine 28 × 28 patches.To transform the pixel-wise scores into probabilistic estimates, we learn a patch-wise distributions and transform the scores to probabilistic estimates accordingly.In Figure 9, we highlight the challenge of interpretability based on a normal data sample; without additional context, it is not clear how to interpret the resulting distance-based scores.In addition to the improved interpretability, we find that the probabilistic normalization greatly increases the contrast between normal and outlier data points in the image detection tasks as visible in Figure 10.

CONCLUSION
We show that it is possible to transform distance-based outlier scores into interpretable probabilistic estimates.To demonstrate the viability of our approach, we derive and test a generalized, weighted knearest neighbors outlier detection model on a several tabular datasets and a probabilistic PatchCore model on image datasets.We show that the resulting probabilistic scores increase the contrast between normal and outlier data points and can easily be added to existing distance-based outlier detection methods.In contrast to previous score normalization techniques, which use solely the information contained in the outlier scores to derive a normalization, we make use of the distances to other data points as an additional source of information for normalization.Another interesting aspect of our analysis is showing that the probabilistic transformation increases the contrast between normal and outlier points, which should be further explored.Specifically, we find that there might be an optimal normalization set that maximially increases the contrast between normal and outlier points and future research is necessary to define measures of contrast and methods to identify an optimal normalization set for a given contrast measure.Because distance-based outlier detection techniques rely on distance computations for nearest neighbors search, our approach can be applied to a wide range of detection techniques.In our experiments, we use the common Euclidean distance metric, but other, possibly non-metric, distance measures are also used for outlier detection, and should also be investigated using our probabilistic score transformation.We investigated only the most apparent normalization sets, but there may be various other useful normalization sets hidden in the distances between points.Another limitation of our examination is the usage of real-world datasets, which limits the theoretical analysis of our approach, such as the normalization behaviour under specific dataset distributions.Our proposed normalization approach should be investigated more thoroughly in a theoretical setting to identify the limits of our approach and potentially proof some of the properties observed in our evaluation.Our proposed generalization of weighted nearest neighbors outlier detection should be analysed in more detail to thoroughly compare weighting strategies and weighting hyperparameters.
. . ., n} The set of all integers between 0 and n f (x) : A → B A function of x with domain A and range B
developed an outlier detection model based on local kernel density estimates.Schubert et al. (2014a) more generally analyze the connection of density-based outlier detection algorithms, such as the local distance-based methods, to kernel density estimation.The authors show that distance-based density estimation is closely related to kernel density Preprint, please refer to the published version https://www.mdpi.com/2504-4990/5/3/427 Muhr et al.

Figure 2 .
Figure 2. Visualization of the reference points as a normalization set, where the dotted blue lines indicate the connection of the query point to its nearest neighbors in the reference set, and the gray lines indicate the distance relationships between the reference points.

Figure 3 .Figure 4 .
Figure 3. ROC AUC results of the different weighting schemes for each dataset over all examined distributions.

Figure 5 .Figure 6 .Figure 7 .Figure 8 .
Figure 5. Transformation of distances to an exponential cumulative distance distribution for the first variant of the Crop dataset.

Figure 9 .Figure 10 .
Figure 9. Patch-wise scores of a normal sample of the Bottle dataset showing interpretability differences.
A large body of research investigates sampling and subspacing techniques for distance-based outlier detection and future researchers should evaluate the usefulness of probabilistic intepretations for such models.Another important area of outlier detection research is how to combine different detection models into ensembles that improve upon the individual models, which typically necessitates score normalization and, therefore, could benefit from probabilistic normalization.We further highlight the importance of a distinction between the open-world and closed-world setting for distance-based outlier detection and propose such a distinction for future distance-based methods.Preprint, please refer to the published version https://www.mdpi.com/2504-4990/5/3/4216

Table 1 .
The datasets used for k-nearest neighbors evaluation, where N denotes the number of samples, O the number of outliers, and d the dimensionality of the dataset.