Distribution-Aware Outlier Detection in High Dimensions: A Scalable Parametric Approach

Zhou, Jie; Hodge, Karson; Dong, Weiqiang; Tamakloe, Emmanuel

doi:10.3390/math14010077

Open AccessFeature PaperArticle

Distribution-Aware Outlier Detection in High Dimensions: A Scalable Parametric Approach

by

Jie Zhou

^1,*,

Karson Hodge

¹,

Weiqiang Dong

¹ and

Emmanuel Tamakloe

²

¹

Department of Mathematics and Computer Science, Southern Arkansas University, 100 East University, Magnolia, AR 71753, USA

²

Department of Mathematics and Natural Sciences, MCPHS University, 179 Longwood Avenue, Boston, MA 02115, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(1), 77; https://doi.org/10.3390/math14010077

Submission received: 30 October 2025 / Revised: 9 December 2025 / Accepted: 16 December 2025 / Published: 25 December 2025

Download

Browse Figures

Versions Notes

Abstract

We propose a distribution-aware framework for unsupervised outlier detection that transforms multivariate data into one-dimensional neighborhood statistics and identifies anomalies through fitted parametric distributions. This directly addresses central difficulties of high-dimensional data—including sparsity of observations, the concentration of pairwise distances, hubness phenomena in nearest-neighbor graphs, and general effects of the curse of dimensionality that degrade classical distance-based scoring. Supported by the Cumulative Distribution Function (CDF) Superiority Theorem and validated through Monte Carlo simulations, the method connects distributional modeling with Receiver Operating Characteristic–Area Under the Curve (ROC–AUC) consistency and produces interpretable, probabilistically calibrated scores. Across 23 real-world datasets, the proposed parametric models demonstrate competitive or superior detection accuracy with strong stability and minimal tuning compared with baseline non-parametric approaches. The framework is computationally lightweight and robust across diverse domains, offering clear probabilistic interpretability and substantially lower computational cost than conventional non-parametric detectors. These findings establish a principled and scalable approach to outlier detection, showing that statistical modeling of neighborhood distances can achieve high accuracy, transparency, and efficiency within a unified parametric framework.

Keywords:

outlier detection; high-dimensional data; parametric modeling; KNN distance; Manhattan distance; distance transformation; CDF-based scoring; ROC–AUC

MSC:

62H30; 62R07; 68T09

1. Introduction

Outlier detection plays a critical role in statistical analysis and data-driven decision making because extreme observations can bias estimates, corrupt model fitting, and obscure genuine rare signals [1,2]. It supports multiple objectives: preserving statistical validity by preventing distortion of summary statistics [1], ensuring model robustness [2], enhancing data quality by identifying measurement or entry errors [3], uncovering novel insights from rare events such as fraud or equipment failures [4], and enabling timely decision processes in domains such as finance, cybersecurity, and healthcare [2].

Although classical techniques perform well in low-dimensional settings, they often deteriorate as dimensionality increases. In high-dimensional spaces, the “curse of dimensionality’’ leads to distance concentration and sparsity, which undermine the reliability of proximity- and density-based approaches [5,6]. Additionally, irrelevant or noisy features can mask true anomalies and dramatically increase computational cost [2,6]. Nevertheless, accurate anomaly detection remains essential in fraud detection [7], network intrusion analysis, genetics, image processing, and sensor networks, where rare deviations can signal security breaches, biological abnormalities, or critical system failures.

However, existing approaches typically suffer from at least one of the following limitations: (1) non-parametric methods scale poorly due to reliance on local neighborhood computation; (2) many high-dimensional methods depend on heuristics or dimensionality reduction and lack interpretability; and (3) parametric models rarely come with theoretical guarantees on error control. This motivates the need for a scalable, distribution-grounded framework that produces interpretable anomaly scores with provable statistical properties in high-dimensional settings.

To address this need, we propose a parametric outlier detection framework that applies a uni-dimensional distance transformation capturing each point’s “degree of outlier-ness’’ while remaining computationally efficient regardless of ambient dimension. Specifically, our research objectives are as follows:

1.: Algorithmic efficiency: Develop a method whose computational cost scales linearly with sample size and is independent of feature dimension after transformation;
2.: Statistical interpretability: Model transformed distances with flexible parametric families, enabling distribution-based threshold selection and diagnostic inference;
3.: Provable detection performance: Establish theoretical guarantees showing that the method controls false alarm rates and maximizes statistical power under mild assumptions on the underlying data distribution.

By representing the dataset with a single distance vector, our method avoids the combinatorial cost of high-dimensional operations and enables interpretability through a compact set of distributional parameters. We fit a flexible parametric model—using positively skewed or log-transformed normal families—on these transformed distances, deriving closed-form thresholds and showing that our estimator behaves optimally in terms of false positive rate minimization and true detection rate maximization. Empirical evaluations across multiple benchmark datasets demonstrate that our approach consistently outperforms state-of-the-art non-parametric methods in mean ROC–AUC, validating both its practical utility and theoretical promises.

The proposed and existing algorithms have been benchmarked using the widely adopted ROC–AUC framework. RROC–AUC is a standard, threshold-independent metric used in outlier detection and classification. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) across all possible score thresholds. The AUC then represents the probability that a randomly sampled outlier receives a higher anomaly score than a randomly sampled inlier [8,9]. A comprehensive conceptual introduction is provided by Fawcett [10], who discusses ROC interpretation, model comparison, and use in anomaly detection more broadly. Under the standard definition, the AUC is computed as the integral of TPR from 0 to 1 with respect to FPR as shown in Equation (1):

AUC = \int_{0}^{1} TPR (FPR) d (FPR) .

(1)

To place this work in context, we now review the most relevant existing approaches in both non-parametric and parametric outlier detection.

2. Literature Review

Outlier detection has long been studied from both data-driven and model-based perspectives. A substantial body of non-parametric research leverages the geometry or local density of data. For example, the k-nearest neighbor (KNN) distance method identifies outliers as points whose average distance to their k nearest neighbors is unusually large [11,12], while density- or local-structure-based such as Local Outlier Factor (LOF), Connectivity-based Outlier Factor (COF), and Angle-Based Outlier Detection (ABOD) compare each point’s local density and structural relationships with those of its neighborhood to identify sparse regions [13,14,15]. These techniques make few assumptions about the underlying distribution and adapt well to complex, nonlinear structure. However, their performance tends to degrade in high dimensions—distances concentrate, noise dimensions mask true anomalies, and the computational cost of neighborhood or density estimation grows prohibitively with feature count [16,17]. Comprehensive evaluations and benchmark suites further document these effects and provide standardized comparisons across algorithms [17].

Recent large-scale benchmark studies reinforce these observations by systematically comparing dozens of unsupervised anomaly detection algorithms on real-world multivariate tabular data, showing that classical neighbor-based methods such as KNN and LOF remain strong baselines but exhibit unstable behavior across high-dimensional scenarios [18].

A second strand of non-parametric work refines these ideas. For example, Rehman and Belhaouari [16] propose KNN-based dimensionality reduction (D-KNN) to collapse multivariate data into a one-dimensional distance space, then apply box-plot adjustments and joint probability estimation to better separate outliers. Classical exploratory tools still inform practice: Tukey’s 1.5 × Interquartile Range (IQR) rule remains a widely used heuristic for flagging extreme points [19]. Distance choice is also critical in high-dimensional settings: Aggarwal et al. [12] showed that the

L_{1}

(Manhattan) distance preserves greater contrast than the

L_{2}

(Euclidean) distance as dimensionality increases, thereby enhancing the effectiveness of nearest-neighbor-based detection methods.

In parallel, recent work has revisited neighbor-based anomaly detection from a theoretical perspective. Dimensionality-Aware Outlier Detection (DAO) introduces a principled framework based on local intrinsic dimensionality (LID), demonstrating when traditional LOF and KNN scores fail and providing improved consistency guarantees in high-dimensional settings [20]. Other studies enhance neighborhood models via random subspace variational inference [21] or re-examine the construction of KNN-based anomaly scores to improve robustness and calibration in complex feature spaces [22].

Parametric approaches offer an alternative by imposing distributional structure, yielding interpretable tests and often lower computational burden. Early methods include Grubbs’ test and standardized residuals under normality [23], with extensions such as Rosner’s generalized Extreme Studentized Deviate (ESD) procedure and the Davies–Gather and Hawkins tests to detect multiple outliers when the number of anomalies is known a priori [3,24,25]. More recent work develops robust tests for broader location-scale and shape-scale families (e.g., exponential, Weibull, logistic) that avoid pre–specifying the number of outliers [26]. In time-series and regression contexts, parametric residual-based techniques using exponential or gamma error models are used to identify anomalous behavior and heavy-tailed departures [27,28].

Contemporary parametric research also explores distribution-aware anomaly detection in high-dimensional settings, such as density-ratio estimation based on scaled Bregman divergences for high-dimensional time series, which explicitly models deviation from learned probabilistic structure to improve stability and interpretability [29].

2.1. State of the Art

A wide range of state-of-the-art methods has been proposed for unsupervised outlier detection, drawing from distance, density, local deviation, and distribution-based measurement of anomaly scores. Among distance-based approaches, the k-nearest neighbor (KNN) method identifies anomalies as points whose average distance to their nearest neighbors is unusually large [11,12], while Outlier Detection using Indegree Number (ODIN) [30] and Local Distance-based Outlier Factor (LDOF) [31] refine this idea by examining relative neighborhood distances and distance ratios. Other approaches, such as Local Outlier Probability (LoOP) [32] and FastABOD [15], further integrate probabilistic normalization or angular relationships within local reference sets.

Density- and locality-based methods such as Local Outlier Factor (LOF) [13], Connectivity-based Outlier Factor (COF) [14], Local Density Factor (LDF) [33], and Influenced Outlierness (INFLO) [34] detect outliers by comparing each point’s local density or neighborhood structure to that of its surrounding region. Kernel-based models such as KDEOS [35] estimate density based on distance-weighted neighborhood contributions, while marginal and projection-based scoring methods such as HBOS [36] and LODA [37] analyze projected or marginal distributions. More recent work such as COPOD [38] leverages copula-based modeling to derive empirical tail probabilities for anomaly scoring.

Although these approaches are highly influential and widely applicable, they often encounter fundamental difficulties in high-dimensional settings. As dimensionality increases, pairwise distances tend to concentrate [5], making distance-based approaches such as KNN, ODIN, LDOF, LoOP, and FastABOD produce weakened score contrast between normal and anomalous points. Density-based methods (LOF, COF, LDF, INFLO) depend on reliable local density estimates, but sparsity in high dimensions produces noisy, unstable estimation [6]. Kernel-based approaches such as KDEOS suffer from exponential smoothness degradation, while HBOS and LODA may fail when projections no longer preserve informative structure. Additionally, hubness effects—where certain points become nearest neighbors of many others—distort the local neighborhood graph and bias KNN- dependent scoring.

These limitations motivate the core principle of our approach. Rather than depending on geometric or density estimates in the original multivariate space, we transform each point into a univariate nearest-neighbor distance statistic and analyze its distribution parametrically. This reduces the influence of distance concentration, hubness, sparsity, and unstable density estimation, allowing for stable rank-ordering of anomalies using a fitted distributional model.

In summary, while the above state-of-the-art detection techniques have proven successful in low- and moderate-dimensional spaces, their performance degrades as dimensionality increases. By shifting from multivariate geometric comparisons to parametric modeling of univariate neighborhood distances, our method offers a statistically well-grounded and dimension-robust solution for anomaly detection.

Modern anomaly detection systems span a broad spectrum, ranging from classical locality and density-based algorithms to modern representation-learning approaches. On the classical side, widely used methods include KNN [39], LOF [13], SimplifiedLOF [40], Local Outlier Probability (LoOP) [32], Local Distance-based Outlier Factor (LDOF) [31], Outlier Detection using Indegree Number (ODIN) [30], FastABOD [15], Kernel Density Estimation Outlier Score (KDEOS) [35], Local Density Factor (LDF) [33], Influenced Outlierness (INFLO) [34], and COF [14]. These methods remain widely adopted, computationally efficient, and assumption-light, thereby constituting strong state-of-the-art baselines for tabular anomaly detection. Beyond these, the Isolation Forest (iForest) isolates anomalies via random partitioning [41], while the one-class Support Vector Machine (SVM) provides a large-margin boundary in high-dimensional feature spaces [42]. For learned representations, autoencoders, Deep Support Vector Data Description (SVDD), and probabilistic hybrids such as Deep Autoencoding Gaussian Mixture Model (DAGMM) often achieve leading results on image and complex tabular benchmarks [43,44]. Lightweight projection-based schemes such as Histogram-Based Outlier Score (HBOS) and Lightweight Online Detection of Anomalies (LODA) deliver excellent speed–accuracy trade-offs [36,37], while the copula-based Copula-Based Outlier Detection (COPOD) provides fully unsupervised, distribution-free scoring with competitive accuracy [38]. Alongside these established paradigms, representation-learning and generative approaches continue to evolve. In particular, diffusion-based generative models have recently been adapted for anomaly detection and are increasingly recognized for their capacity to model complex high-dimensional distributions, though often at the expense of computational efficiency and interpretability [45]. Together with classical and hybrid methods, these techniques define the contemporary landscape of anomaly detection—from interpretable, efficient heuristics to highly expressive but resource-intensive deep models.

2.2. Our Contribution in Context

Although grounded in different philosophies, both research lines ultimately aim to balance sensitivity to genuine anomalies with robustness against noise. Non-parametric methods perform well when no clear distributional form is present, yet they often suffer from the curse of dimensionality. Parametric tests, by contrast, regain efficiency and offer finite-sample guarantees under correct model specification, but are vulnerable to misspecification. In this paper, we unify these paradigms by introducing a uni-dimensional distance transformation that maps any dataset—regardless of its original dimension—into a single distance vector, which is then modeled with a flexible parametric distribution. This hybrid approach preserves interpretability and scalability, enables closed-form inference, and delivers provable performance under mild assumptions.

Building on the gaps identified above, the paper proceeds as follows. We first formalize a CDF Superiority Theorem, establishing that a parametric CDF–based score achieves strictly higher ROC–AUC than the KNN distance under mild conditions. We further outline the proof that this parametric score also outperforms any other non-parametric method. We then validate this theoretical advantage through Monte Carlo experiments, demonstrating that the mean ROC–AUC across 500 simulation paths under the gamma distribution exceeds that of five established non-parametric methods: KNN, LOF, ABOD, COF, and CDF. We then introduce our practical framework: reduce high-dimensional data to a 1-D KNN (Manhattan) distance vector; fit either positively skewed families (e.g., gamma/Weibull) or—after a log transform—normal-like families (normal/t/skew-normal); and score observations by their fitted CDFs. Next, we benchmark our approach against several state-of-the-art nonparametric baselines using 23 publicly available datasets. These 23 datasets are separated into two distinct categories: the literature set and the semantic set. The literature set includes commonly used datasets in previous papers that may lack real-world labels and might be synthetic or have outliers defined from prior papers. The semantic set outlines outliers based on sematic or domain meaning, they are not arbitrary or synthetic but infer anomalies based real world deviations, i.e., errors in manufacturing.

We report performance in terms of ROC–AUC, together with goodness-of-fit (

R^{2}

) derived from Quantile-Quantiel (QQ) plots of fitting proposed probability distributions with 1-D KNN distance vector. We then examine the relationship between fit quality and detection accuracy, highlighting the conditions under which the parametric approach is most effective. Finally, we conclude with key implications and directions for future research.

3. Method and Theoretical Results

3.1. Positively-Skewed Distributions

Suppose that we originally have a data set in an N-dimensional space. According to Rehman and Belhaouari [16], this dataset can be effectively transformed into a one-dimensional distance space by employing a suitable metric such as Manhattan distance or Euclidean distance. Specifically, for each observation in the original N-dimensional space, the distance to its k-th nearest neighbor is computed. This process generates a new dataset consisting solely of these distances, denoted as

d_{k} \in R

. Formally, this transformation can be represented as follows:

d_{k} : R^{N} \to R

(2)

Each

d_{k}

represents the distance from a point to its k-th nearest neighbor, corresponding to the maximum distance within its k-neighbor set. We use the Manhattan distance, computed as the absolute sum of coordinate-wise differences. The rationale for using Manhattan distance is grounded in the work of Aggarwal et al. [12] on distance metrics in high-dimensional spaces. Compared to Euclidean distance, Manhattan distance lowers the density peak while spreading values more broadly, resulting in a longer-tailed distribution. This reduces the likelihood of misclassifying inliers as outliers. In high-dimensional settings, this effect becomes more pronounced, as certain data points—sometimes termed “hubs”—tend to emerge as nearest neighbors for many other points. Such uneven neighbor distribution contributes to the skewness observed in the k-th nearest neighbor distances [46].

As dimensionality increases,

L_{2}

(Euclidean) distances tend to concentrate due to an exaggeration effect that distorts the relative positioning of outliers. In contrast,

L_{1}

(Manhattan) distance is more robust to this effect and better captures the skewness and variability inherent in the data [12]. This distinction is particularly important under the curse of dimensionality, where KNN distances become increasingly equidistant. This equidistance causes distances to shrink and induces a positively skewed distribution [12]. As shown in Equation (3),

\frac{max (d_{k}) - min (d_{k})}{min (d_{k})} \to 0 as n \to \infty (for L 2)

(3)

and discussed in Aggarwal et al. [12], substituting

L_{2}

with

L_{1}

preserves a broader spread of distances, slowing the convergence toward uniformity and mitigating the equidistant effect.

In addition to these empirical observations, the skewness of the distance distribution has a theoretical justification. The Manhattan distance between two independent random points

X, X^{'} \in R^{p}

decomposes as

D = ∥ X - X^{'} ∥_{1} = \sum_{j = 1}^{p} Z_{j}, Z_{j} = | X_{j} - X_{j}^{'} | \geq 0 .

If each coordinate difference is symmetric with finite moments, then each

Z_{j}

has strictly positive third central moment. Since cumulants are additive, we obtain

E [D] = p E [Z_{1}], Var (D) = p Var (Z_{1}), κ_{3} (D) = p κ_{3} (Z_{1}),

which implies positive skewness for any finite dimension:

γ_{1} (D) = \frac{κ_{3} (D)}{Var {(D)}^{3 / 2}} = \frac{κ_{3} (Z_{1})}{Var {(Z_{1})}^{3 / 2}} \frac{1}{\sqrt{p}} > 0 .

Thus, even if standardized distances may approach normality under the Central Limit Theorem, the raw distances necessarily remain nonnegative and right-skewed [47,48].

A second source of skewness arises from the order-statistic properties of

d_{k}

. For a fixed reference point x, the k-th neighbor distance has density

f_{(k)} (r) = \frac{(n - 1)!}{(k - 1)! (n - 1 - k)!} f_{R} (r) F_{R} {(r)}^{k - 1} {(1 - F_{R} (r))}^{n - 1 - k},

supported on

[0, \infty)

[49,50]. When aggregated over all

x_{i}

, the resulting set

{d_{k} (x_{i})}

forms a mixture of such order-statistic distributions [51], naturally generating a long right tail associated with sparse or anomalous regions.

Given these theoretical foundations, our modeling choice naturally follows. Parametric methods offer several advantages over non-parametric approaches, including clearer interpretability, greater accuracy, and more efficient computation. Parametric analysis assumes that data arise from a specific underlying distribution, and in our case, the set of

d_{k}

values meets the structural requirements for such modeling by being nonnegative, right-tailed, and exhibiting positive skewness consistent with established parametric families.

Because the majority of observations fall within concentrated regions of the feature space, while a smaller number of samples lie in sparser or anomalous regions, the resulting density of

d_{k}

displays a steep rise near the modal distance followed by a long gradual decay. This behavior aligns with the skewed distance distributions observed in high-dimensional settings and with the distance concentration phenomenon, described as a declining ratio between spread and magnitude of distances [46]. The resulting one-dimensional data thus exhibit a persistent positive skew, with an extended right tail reflecting values that deviate from both the mean and median.

Accordingly, to capture and characterize this structure, we fit a family of positively skewed distributions to the transformed one-dimensional distance data. This enables us to use parametric scoring rules based on calibrated tail probabilities, rather than raw distances alone.

3.1.1. Assumptions and Limitations of Parametric Modeling

The parametric fitting strategy implicitly assumes that the empirical distribution of

d_{k}

can be reasonably approximated by a known parametric family. This assumption may fail in datasets where the

d_{k}

statistics exhibit multimodality, strong heterogeneity across local subregions, or heavy contamination, in which case a global one-family parametric fit may be inadequate. Under such circumstances, purely non-parametric methods (e.g., LOF, COF, or KNN) may perform either comparably or even favorably despite their susceptibility to distance concentration effects. Our empirical findings in Section 3.3 and Section 4 suggest that when the parametric fit is statistically well-aligned with the empirical distance distribution, the CDF-based scoring exhibits a clear ranking advantage; however, this advantage may diminish or disappear in regimes of strong model misspecification.

3.1.2. From Positioning to Theory

The discussion above motivates a hybrid scoring rule: reduce high-dimensional data to a one-dimensional summary and then apply a parametric score that aligns with the inlier distribution. We now provide a theoretical justification for this choice by comparing a distribution-aware score to a purely geometric one. Specifically, we consider two outlier scores for a point x: (i) the CDF score

F (x)

of the inlier distribution, which is a monotone transform of the optimal likelihood ratio, and (ii) the KNN distance score

d_{k} (x)

, a standard nonparametric baseline. Using ROC–AUC as our comparison criterion,

AUC (T) = Pr (T (X_{out}) > T (X_{in})),

We show that, under mild regularity conditions (continuous and strictly positive densities), the CDF-based score strictly dominates the KNN distance: it yields fewer pairwise misorderings between outliers and inliers and, therefore, achieves a larger AUC. This result formalizes why a univariate, distribution-aligned score can outperform distance-based heuristics, particularly in regimes where distances lose contrast.

To rigorously substantiate this intuition, we now present a formal result that characterizes when and why CDF-based scoring functions outperform KNN distances in ranking performance. We state the result next.

3.2. Behavior of Continuous Density Function Versus Non-Parametric for ROC-AUC Scores

This section presents the CDF Superiority Theorem and supports it through both numerical examples and simulation-based validation. We begin by examining the mathematical relationship between non-parametric raw distances and their CDF-transformed counterparts.

3.2.1. Comparing CDF-Based Scores and Raw KNN Distances

To clearly establish the validity of comparing CDF-based anomaly scores with non-parametric raw

d_{k}

distances, we first state the assumptions under which this comparison holds. Specifically, we assume the following: (1) The transformed distances

d_{k}

are nonnegative random variables with a continuous and strictly increasing cumulative distribution function

F_{d_{k}} (r)

on

[0, \infty)

; (2) The anomaly score is defined as a monotone transformation

s (x) = 1 - F_{d_{k}} (d_{k} (x))

; (3) Anomalies are characterized by extreme (large) distance values relative to the bulk of the distribution.

Under these assumptions, the CDF-based score preserves the rank ordering of raw distances, i.e.,

d_{k} (x_{i}) > d_{k} (x_{j}) \Leftrightarrow s (x_{i}) < s (x_{j}),

ensuring that any threshold-based anomaly detection using either

d_{k}

or its CDF-derived score is equivalent in the sense of order-preserving decision boundaries.

Having established the ranking equivalence relationship, we now formalize the theoretical advantage of the CDF-based approach.

3.2.2. The CDF Superiority Theorem

Theorem 1.

Let

X_{in} \sim f

and

X_{out} \sim g

be independent draws from two continuous densities

f, g

on

R

, each strictly positive everywhere. We compare two outlier-scoring rules:

CDF score:

$F (x) = \int_{- \infty}^{x} f (t) d t .$
KNN distance score:

$d_{k} (x) = distance from x to its k th nearest neighbor in an i . i . d . sample X_{1}, \dots, X_{n} \sim f .$

We use the standard ROC–AUC definition

AUC (T) = Pr (T (X_{out}) > T (X_{in})) .

Then

AUC (F) > AUC (d_{k}) .

Proof.

Let

X_{in} \sim f

and

X_{out} \sim g

be independent draws from continuous, strictly positive densities on

R

. For any scoring rule T, define its mis–ordering set

E_{T} = {(x_{0}, x_{1}) \in R^{2} : T (x_{1}) \leq T (x_{0})} .

Then

\begin{matrix} AUC (T) & = Pr (T (X_{out}) > T (X_{in})) \\ = 1 - Pr ((X_{in}, X_{out}) \in E_{T}) \\ = 1 - {\int \int}_{E_{T}} f (x_{0}) g (x_{1}) d x_{0} d x_{1} . \end{matrix}

(4)

(1): CDF Score.

For the CDF score

F (x) = \int_{- \infty}^{x} f (t) d t

, F is strictly increasing, hence

E_{F} = {(x_{0}, x_{1}) : x_{1} \leq x_{0}}

. Setting

μ_{F} = {\int \int}_{x_{1} \leq x_{0}} f (x_{0}) g (x_{1}) d x_{0} d x_{1} = Pr (X_{out} \leq X_{in}),

Equation (4) yields

AUC (F) = 1 - μ_{F}

.

(2): KNN Distance Score.

Let

d_{k} (x)

be the distance from x to its kth nearest neighbor within an i.i.d. sample

X_{1}, \dots, X_{n} \sim f

. Its misordering set is

E_{d_{k}} = {(x_{0}, x_{1}) : d_{k} (x_{1}) \leq d_{k} (x_{0})}

and

AUC (d_{k}) = 1 - {\int \int}_{E_{d_{k}}} f g .

Split

{\int \int}_{E_{d_{k}}} f g = \underset{= μ_{F}}{\underset{︸}{{\int \int}_{x_{1} \leq x_{0}} f g}} + \underset{δ}{\underset{︸}{{\int \int}_{x_{1} > x_{0}, d_{k} (x_{1}) \leq d_{k} (x_{0})} f g}} .

We claim

δ > 0

. Fix

x_{0} < x_{1}

. For

j \in {0, 1}

let

Y_{i}^{(j)} = | X_{i} - x_{j} |

(

i = 1, \dots, n

). Each

Y_{i}^{(j)}

has a continuous, strictly positive density on

(0, \infty)

; the kth nearest-neighbor distance is the kth order statistic

d_{k} (x_{j}) = Y_{(k)}^{(j)}

. The vector

(Y_{1}^{(0)}, \dots, Y_{n}^{(0)}, Y_{1}^{(1)}, \dots, Y_{n}^{(1)})

has a positive joint density on

{(0, \infty)}^{2 n}

, and the smooth, one-to-one a.e. mapping to

(Y_{(k)}^{(0)}, Y_{(k)}^{(1)}) = (d_{k} (x_{0}), d_{k} (x_{1}))

implies that the pair

(d_{k} (x_{0}), d_{k} (x_{1}))

has a continuous joint density h that is positive on

{(0, \infty)}^{2}

. Therefore,

Pr (d_{k} (x_{1}) \leq d_{k} (x_{0})) = {\int \int}_{y_{1} \leq y_{0}} h (y_{0}, y_{1}) d y_{1} d y_{0} > 0 .

Since

f (x_{0}) g (x_{1})

is strictly positive for all

x_{0} < x_{1}

, integrating this strictly positive probability over the set

{x_{1} > x_{0}}

yields

δ > 0

.

(3): Conclusion.

We have

AUC (d_{k}) = 1 - (μ_{F} + δ) < 1 - μ_{F} = AUC (F) .

Hence, the CDF score attains a strictly larger ROC-AUC than the KNN distance score. □

3.2.3. Extension to Other Nonparametric Methods

The same argument applies to any other non-parametric outlier score. Here is an outline of the proof.

ROC–AUC cares only about pairwise ordering.
$AUC (T) = Pr (T (X_{out}) > T (X_{in}))$ .
The CDF score is strictly monotonic in x.
$F (x) = {Pr}_{f} (X \leq x)$ increases strictly, so it never misorders any $x_{0} < x_{1}$ .
Any non-parametric method must misorder a positive-measure set of pairs.
Estimated from finite data (LOF, isolation forest, etc.), it cannot perfectly reproduce the CDF ordering, so there exists $x_{0} < x_{1}$ with $T (x_{1}) \leq T (x_{0})$ with positive probability.
Strict AUC gap follows.
Let $μ_{F} = Pr (X_{out} \leq X_{in})$ and $μ_{np} > μ_{F}$ be the misorder probability of the non-parametric score. Then,

$AUC (F) = 1 - μ_{F}, AUC (T_{np}) = 1 - μ_{np},$

so $AUC (F) > AUC (T_{np})$ .

Remark 1.

Because any non-parametric rule must misorder some inlier–outlier pairs with positive probability, its ROC–AUC is strictly lower than the ideal CDF rule’s.

3.2.4. Significance of the CDF Superiority Theorem

Under mild regularity conditions, assuming continuous and strictly positive densities, the CDF Superiority Theorem provides a theoretical guarantee for distribution-aware scoring in anomaly detection. Ranking observations by the inlier CDF

F (x)

—for example, using the tail score

1 - F (x)

—achieves superior anomaly ranking performance relative to raw KNN distances. The key insight follows from the probability integral transform: if

X \sim f

, then

U = F (X)

is uniformly distributed on

[0, 1]

. Since ROC analysis depends only on the ordering of scores and is invariant under strictly monotonic transformations [10,52], any monotone function of

F (x)

preserves the same ROC curve. Consequently, mapping data to a one-dimensional statistic aligned with the inlier distribution enables a scoring method that is theoretically matched to the underlying data distribution. When the model is reasonably well specified, such CDF-based scoring methods yield consistently improved anomaly ranking capability.

To evaluate this theoretical advantage in practice, we next consider the appropriateness of ROC–AUC as the ranking performance measure.

3.2.5. Justification for ROC AUC as Evaluation Metric

The ROC framework has been widely established as a threshold-independent evaluation measure for classification and ranking tasks [9,53,54]. ROC–AUC is especially meaningful in anomaly detection because it evaluates performance across all possible decision thresholds, avoiding the bias associated with any fixed cutoff.

Moreover, ROC–AUC admits a probabilistic interpretation: it represents the probability that a randomly selected anomalous point receives a higher anomaly score than a randomly selected nominal point [9]. Since our method produces a scalar ordering of data points—whether via raw distances

d_{k}

or their CDF-based transformation

1 - F_{d_{k}}

, ROC–AUC naturally quantifies the quality of the ranking induced by these scores.

Thus, given (i) the order-preserving relationship between

d_{k}

and

1 - F_{d_{k}}

under monotonic transformations, and (ii) the well-established theoretical interpretation of ROC–AUC as a measure of ranking quality [9,53], ROC–AUC provides an appropriate and theoretically grounded metric for evaluating anomaly scoring performance in our study.

3.2.6. Theoretical Support for Parametric Tests

The CDF Superiority Theorem in this paper shows that, under mild regularity and a correctly (or well) specified inlier model F, ranking observations by the inlier CDF—equivalently, by the tail score

p (x) = 1 - F (x)

—achieves a strictly higher ROC–AUC than geometric KNN distance scores. This result provides a principled foundation for parametric outlier procedures that base decisions on model-derived tail probabilities or residuals. In particular, it theoretically supports the multiple-outlier tests of Bagdonavičius and Petkevičius [55], which assume a parametric family for the inlier distribution and identify extreme observations via distribution-aware statistics on orderings of the sample. When the assumed family approximates the true inlier law, our theorem predicts that CDF-based rankings (and the associated p-value thresholds) are optimal in the ranking sense, explaining the empirical effectiveness of model-based multi-outlier tests and motivating their use over purely distance-based heuristics.

3.3. Worked Examples

In this subsection, we illustrate how disordering arises in distance-based anomaly scoring through explicit worked examples. These examples demonstrate that purely geometric scoring can assign identical or nearly identical anomaly values to points that should be distinct in ranking. We first present transparent one-dimensional cases, and then show that the same ambiguity persists in higher-dimensional settings.

3.3.1. 1-D KNN Disordering Example

Dataset (1-D): Inliers

{1, 2, 3}

; Outliers

{8, 9}

. Choose

k = 2

. Compute 2-NN distances among

{1, 2, 3, 8, 9}

:

d_{2} (1) = 2, d_{2} (2) = 1, d_{2} (3) = 2, d_{2} (8) = 5, d_{2} (9) = 6 .

CDF ordering demands

x_{0} < x_{1} \Rightarrow F (x_{0}) < F (x_{1})

. Pick

(x_{0}, x_{1}) = (1, 3)

: since

1 < 3

,

F (1) < F (3)

, yet

d_{2} (1) = d_{2} (3) = 2 ⟹ d_{2} (3) \leq d_{2} (1),

so the KNN score misorders that inlier–outlier pair.

3.3.2. 1-D LOF Disordering Example ( $k = 2$ )

Dataset:

{0, 1, 4}

with 0,1 inliers and 4 outlier.

Reachability distances:

\begin{matrix} {reach-dist}_{2} (0, 1) & = max (| 0 - 1 |, 3) = 3, & {reach-dist}_{2} (0, 4) & = max (4, 4) = 4, \\ {reach-dist}_{2} (1, 0) & = max (1, 4) = 4, & {reach-dist}_{2} (1, 4) & = max (3, 4) = 4, \\ {reach-dist}_{2} (4, 1) & = max (3, 3) = 3, & {reach-dist}_{2} (4, 0) & = max (4, 4) = 4 . \end{matrix}

Local reachability densities:

{lrd}_{2} (0) = \frac{1}{(3 + 4) / 2} = \frac{2}{7} \approx 0.2857, {lrd}_{2} (1) = \frac{1}{(4 + 4) / 2} = 0.25, {lrd}_{2} (4) = \frac{2}{7} \approx 0.2857 .

LOF scores:

{lof}_{2} (0) = \frac{1}{2} (\frac{0.25}{0.2857} + \frac{0.2857}{0.2857}) = 0.9375, {lof}_{2} (1) = \frac{1}{2} (\frac{0.2857}{0.25} + \frac{0.2857}{0.25}) \approx 1.1428, {lof}_{2} (4) = 0.9375 .

Pick

(x_{0}, x_{1}) = (0, 4)

: although

0 < 4 \Rightarrow F (0) < F (4)

, we have

{lof}_{2} (0) = {lof}_{2} (4)

so LOF misorders that inlier–outlier pair.

In the 1-D setting, the mechanism of disordering is completely exposed. When using KNN distances or LOF scores on scalar samples, ties or near-ties readily occur simply because different points may have identical local spacing along the line. These 1-D constructions make the core issue mathematically clear and easily traceable. The essential insight from these 1-D examples is that when distances collapse onto a small set of discrete values, rank ambiguity becomes inevitable. While this effect may seem attributable to the simplicity of the geometry, we next show that it persists even after moving away from one-dimensional space.

3.3.3. Extension to 3-D Case

While the 1-D case makes the disordering mechanism entirely transparent, we now extend the analysis to a 3-D dataset to demonstrate that the same misranking behavior emerges even in higher-dimensional settings where geometric distance structures are more complex.

Consider the following set of points in

R^{3}

: inliers

A = (0, 0, 0)

,

B = (2, 2, 0)

,

C = (2, 2, 2)

and outliers

D = (4, 4, 4)

,

E = (7, 6, 5)

. Using the Manhattan distance with

k = 2

, we compute the 2-nearest-neighbor distances for each point. Explicitly evaluating and sorting the pairwise distances, we obtain:

d_{2} (A) = 6, d_{2} (B) = 4, d_{2} (C) = 6, d_{2} (D) = 6, d_{2} (E) = 12 .

We observe that the outlier D receives the same score as two inliers, A and C, i.e.,

d_{2} (D) = d_{2} (A) = d_{2} (C) = 6 .

Thus, KNN distance—a purely non-parametric statistic—fails to distinguish D from points lying clearly deeper within the inlier cluster.

3.3.4. 3-D LOF Scoring

We also evaluate LOF on this dataset to analyze whether local-density comparison can resolve this ambiguity. Using the standard reachability and local reachability density definitions, we compute:

LRD (A) = 0.20, LRD (B) = 0.1667, LRD (C) = 0.20, LRD (D) = 0.1111, LRD (E) = 0.1111,

yielding the corresponding LOF scores:

LOF (A) = 0.92, LOF (B) = 1.20, LOF (C) = 0.69, LOF (D) = 1.40, LOF (E) = 1.40 .

Here, LOF successfully elevates the anomaly scores for D and E relative to the inliers. However, the improvement is only partial: LOF relies on local neighborhood densities and still lacks a global distributional reference, causing sensitivity to local sampling irregularities and neighborhood selection.

3.3.5. Connection to CDF-Based Scoring

To contrast these geometric scores with our CDF-based approach, we consider an inlier density whose probability mass is centered at the origin. Using the Euclidean radius,

R (x) = \sqrt{x_{1}^{2} + x_{2}^{2} + x_{3}^{2}},

we obtain the ordering:

R (A) < R (B) < R (C) < R (D) < R (E) .

Applying the tail score

s (x) = 1 - F_{R} (R (x))

yields strictly ordered anomaly scores:

s (E) > s (D) > s (C) > s (B) > s (A),

correctly ranking outliers above inliers. Unlike KNN distances (which tied A, C, and D) and unlike LOF (which is constrained by local density effects), the CDF-based score uses the entire fitted inlier distribution and thereby eliminates rank ambiguity arising from local geometric coincidence.

These examples collectively illustrate that disordering is not an artifact of low-dimensional illustrations, but a fundamental consequence of geometric distance concentration and neighborhood symmetry. Distribution-aware scoring using the fitted CDF provides a principled way to break ties and establish a globally consistent anomaly ranking.

3.4. Remark

KNN distance can assign identical scores to

x_{0} < x_{1}

even though

F (x_{0}) < F (x_{1})

. LOF can assign the same score to inlier and outlier, violating the true CDF ranking.

Thus, any nonparametric method like KNN or LOF must strictly underperform the CDF-based score in ROC–AUC: it misorders some positive-probability inlier–outlier pairs.

In practice we do not compute the integral

Pr (d_{k} (x_{1}) \leq d_{k} (x_{0})) = {\int \int}_{y_{1} \leq y_{0}} h (y_{0}, y_{1}) d y_{1} d y_{0}

directly, but rather approximate it by the fraction of misordered pairs in the finite data set. Concretely, if we have N inliers and M outliers, we form all

N \times M

pairs

(x_{in}, x_{out})

and compute

\hat{p} = \frac{1}{N M} \sum_{i = 1}^{N} \sum_{j = 1}^{M} 1 \{d_{k} (x_{out}^{(j)}) \leq d_{k} (x_{in}^{(i)})\} .

Even though the true probability

p > 0

, it’s quite possible—especially if N and M are small, or if the score ties a lot—that you observe zero misordered pairs in this sample, i.e.,

\hat{p} = 0

. That in turn makes the empirical AUC hit its maximum of 1.0.

The theorem guarantees that

p > 0

in the population limit, that is, as the number of data points approaches infinity. In finite samples, however, random fluctuations may cause the empirical estimate

\hat{p}

to be zero simply because, by chance, no misordered pairs are observed within the Monte Carlo sample.

As N and M grow larger, or as the experiment is repeated, the chance that

\hat{p} = 0

becomes smaller, roughly at an exponential rate in

N M

. However, this probability does not disappear entirely until the sample size tends to infinity.

In conclusion, the sample size must be sufficiently large to mitigate random misorderings arising from sampling variability. Since the CDF is estimated probabilistically, finite-sample fluctuations can cause certain points to be overestimated and thus mistakenly classified as outliers. In practice, a larger sample reduces this Monte Carlo noise and yields a ranking that better reflects the true ordering implied by the underlying distributions.

3.5. Monte Carlo Simulation of CDF Versus Non-Parametric ROC-AUC Scores

To evaluate the empirical performance of the CDF-based scoring rule and connect it to the theoretical result in Section 3.2, we ran Monte Carlo experiments using three data–generating models: Gamma, Inverse Gaussian, and Skew–Normal. For each distribution, 200 inlier samples were generated and MLE was applied to estimate the model parameters. These fitted parameters defined the reference inlier CDF used for scoring.

Each simulation run then produced 400 evaluation samples consisting of inliers mixed with injected outliers drawn from the fixed parameter settings in Table 1a–c. This procedure was repeated 500 times, and in each run, we computed ROC–AUC values for KNN, LOF, ABOD, COF, and the CDF-based approach. The results appear in Table 2a–c.

Across all settings, the CDF-based method gives the strongest average ROC–AUC performance and, in several cases, shows reduced variability across repetitions. With the Gamma model (Table 2a), the improvement over KNN and the other neighbor-based methods is consistent and substantial. With the Inverse Gaussian model (Table 2b), CDF performs slightly better than KNN and the other neighbor-based methods. With the Skew–Normal model (Table 2c), where asymmetry in cluster geometry is more pronounced, CDF again provides the most effective ranking of inliers and outliers. These results suggest that the advantage of the CDF transformation persists across a range of underlying distributions.

The benefit of the CDF approach comes from transforming distances onto a probability scale, which reduces artifacts that can arise from raw neighbor distances. Even with imperfect parametric fits due to finite sample size, this transformation produces a more stable and interpretable ordering of observations.

It is worth noting that the parameter configurations in Table 1a–c are not chosen to engineer specific theoretical tail behaviors. Rather, they create differences in the local neighborhood structure: inliers occupy tight regions of the space, whereas outliers are positioned to yield larger KNN distances. In this setting, anomalies naturally occur in the upper tail of the empirical distance distribution, and the CDF transformation emphasizes this separation.

In terms of computational efficiency, the CDF-based scoring rule incurs relatively low cost: after fitting the inlier distribution, each test observation is evaluated through a direct CDF computation. In contrast, KNN and the other neighbor-based methods require repeated pairwise distance calculations against the training data, resulting in higher computational burden as the dataset increases. This difference in cost is particularly relevant for large-scale or streaming applications. Furthermore, the consistent improvement in the CDF-based method across all three simulation settings (Gamma, Inverse Gaussian, and Skew–Normal) indicates that the ordering advantage is not tied to a specific parametric model and persists across differing underlying distributional structures.

Finally, the theoretical ordering guarantee of Section 3.2 applies under idealized distributional assumptions. The empirical results in this section show that the CDF-based method maintains its ranking advantage even when those assumptions are relaxed and the data include sampling variability and model mismatch. The complete Python code used for the Gamma, Inverse Gaussian, and Skew–Normal experiments, including parameter fitting and ROC–AUC evaluation, is provided in Appendix A to ensure reproducibility.

4. Our Parametric Outlier-Detection Framework

We propose a two-stage pipeline: (i) reduce the data to a one-dimensional distance statistic that preserves the degree of “outlier-ness’’ even in high dimension, and (ii) fit a parametric family to that statistic and score points by calibrated tail probabilities. This design keeps computation light, retains interpretability, and—by working with a 1-D summary—avoids the distance–concentration pitfalls of high-d spaces [5].

4.1. Dimensionality Reduction via KNN–Manhattan

For each observation

x \in R^{n}

, we compute the distance to its kth nearest neighbor under the Ł₁ metric,

d_{k} (x) = min_{x_{(k)} \in N_{k} (x)} {∥ x - x_{(k)} ∥}_{1} .

Using Ł₁ (Manhattan) rather than Ł₂ (Euclidean) helps retain spread and ranking contrast as n grows [12]; it also mitigates hubness, where a few points become nearest neighbors of many others and distort scores [46]. Empirically, the empirical distribution of

{d_{k} (x_{i})}

is typically right-skewed, which motivates the parametric fits below.

4.2. Fitting Positively Skewed Distributions

Let

D = {d_{k} (x_{i})}_{i = 1}^{n}

denote the dataset of one-dimensional distances. We fit

D

using a family of positively skewed distributions via the maximum likelihood estimation (MLE) method. This family includes Normal-like distributions such as the log-normal, log-Student-t, log-Laplace, log-logistic, and log-skew-normal, as well as other positively skewed distributions including the exponential, chi-squared (

χ^{2}

), gamma, Weibull-minimum, inverse Gaussian, Rayleigh, Wald, Pareto, Nakagami, logistic, power-law, and skew-normal distributions. Denoting a generic density by

p (\cdot; η)

with parameter

η

, we maximize

\hat{η} = arg max_{η} \sum_{i = 1}^{n} log p (d_{k} (x_{i}); η) .

For example, the gamma PDF is

p (d; κ, θ) = \frac{1}{Γ (κ) θ^{κ}} d^{κ - 1} e^{- d / θ}

, and the Weibull PDF is

p (d; λ, β) = \frac{β}{λ} {(\frac{d}{λ})}^{β - 1} e^{- {(d / λ)}^{β}}

. After fitting, we score a point by its right-tail probability under the fitted CDF

\hat{F}

,

s (x) = 1 - \hat{F} (d_{k} (x)),

which is a calibrated, distribution-aligned p-value. Rankings are invariant to monotone transforms, so we can equivalently use

- log s (x)

.

As shown in Table A1, Table A2, Table A3, Table A3, the vast majority of fitted distributions achieve an average

R^{2}

above 90%, and many exceed 95%. In our implementation, we set an empirical cutoff of 85% for the average

R^{2}

. For the two datasets where the best-fitting distribution yields an average

R^{2}

below 85%, we intentionally retain these cases to evaluate the robustness of our parametric scoring method under imperfect fits. Even in these cases, the tail ordering of distances—the key requirement of the CDF Superiority Theorem—remains valid, indicating that the fitted models still provide correct anomaly ranking in the distribution tails. Consistent with this, the average ROC–AUC on both the literature datasets and the semantic datasets remains high, demonstrating that our proposed approach is robust even when the global

R^{2}

of the QQ fit is not ideal.

Log–Transform and Normal-like Fits

We follow the ladder-of-powers guideline that lower-power transforms (log, square-root) reduce positive skew [19]. When the distance sample

{d_{k} (x_{i})}

is strictly positive and right-skewed, we set

y_{i} = log d_{k} (x_{i})

and fit location-scale families on

{y_{i}}

: normal

N (μ, σ^{2})

, Student-

t (μ, σ, ν)

for heavier tails, and logistic; to absorb any residual asymmetry we also include the skew-normal with shape parameter

α

[56]. Outlier scores are computed on the original scale via the fitted CDF of

Y = log d_{k} (X)

,

s (x) = 1 - {\hat{F}}_{Y} (log d_{k} (x)),

which is equivalent to using right-tail z-scores for normal-like fits. This “log-transform branch” complements the positive-skew families and improves robustness whenever the log transform approximately symmetrizes the distance distribution.

4.3. Baseline Non-Parametric Methods

We implement a set of standard baselines widely used in the outlier detection literature, including KNN, LOF, SimplifiedLOF, LoOP, LDOF, ODIN, FastABOD, KDEOS, LDF, INFLO, and COF. For these 11 baseline methods, we report results under both Ł₁ and Ł₂ metrics. All methods score by decreasing density (or increasing distance) and are evaluated by ROC–AUC. In the real-data experiments, the parameter k for the KNN distance is not chosen arbitrarily or fixed a priori. Instead, it is selected in a data-driven way by scanning values of k from 2 to 69 and choosing the value that maximizes the empirical ROC–AUC on the given dataset. This procedure prevents arbitrary selection of k and ensures that KNN is evaluated under its best achievable configuration for each dataset.

4.4. Datasets

As mentioned in the last subsection of Section 2 (Literature Review), we evaluate both the baseline methods and our proposed approaches on 23 datasets, including 11 literature datasets and 12 semantic datasets. A descriptive summary of the two dataset types is provided in Table 3. All datasets were obtained from Campos et al. [17]. We work directly with these benchmark datasets in their original form as provided by Campos et al. [17] and hosted in the DAMI repository [57], and apply the same z-score standardization to each feature (zero mean and unit variance) as used in their evaluation framework. This ensures full consistency with the established preprocessing treatment in the prior study and enables fair comparison across methods. The semantic datasets, in particular, have been modified to better reflect real-world occurrences of outliers. Each dataset varies on the number of outliers included, ranging from as low as

0.2 %

to as high as

75 %

. Campos et al. [17] provided results for multiple levels of outlier percentages for most datasets, therefore, we chose to focus on the highest outlier levels for they contain all observations instead of being a subset.

5. Empirical Real-World Data Results

Before examining the outlier-detection performance summarized in Table 4, it is instructive to first assess how well the proposed parametric families capture the underlying data distributions. To this end, we analyze the goodness-of-fit results based on the

R^{2}

values from QQ plots, as summarized in Table A1, Table A2, Table A3, Table A4.

5.1. Analysis of Goodness-of-Fit $R^{2}$ Across Literature and Semantic Datasets

Table A1, Table A2, Table A3, Table A4 summarize the

R^{2}

values across both log-transformed and untransformed settings. Log-transformed models consistently yield higher

R^{2}

values (94–99%), confirming their stability and closer alignment with theoretical quantiles. For the literature datasets, the log-transformed normal, Student-t, and skew-normal distributions perform best; for the semantic datasets, the skew-normal and Student-t distributions remain the strongest. Two representative examples of fitted distributions are shown in Figure 1 and Figure 2: the Arrhythmia dataset (Figure 1) achieves an

R^{2} = 0.9745

under a gamma distribution, while the Parkinson dataset (Figure 2) attains an

R^{2} = 0.9692

under a skew-normal distribution after log transformation. Both figures show strong agreement between theoretical and empirical quantiles, reinforcing that log transformation enhances fit quality and that the proposed parametric framework remains robust across diverse data types while supporting high detection accuracy.

5.2. Real Data Analysis Results

After evaluating 17 parametric distributions—12 positively skewed and 5 approximately symmetric—across 23 datasets, the proposed parametric fits on the one-dimensional distance function

d_{k} (\cdot)

(optionally after a log transform) achieve KNN-level or higher accuracy while consistently outperforming other baseline detectors. To illustrate this behavior concretely, Figure 3 and Figure 4 present two examples of ROC comparisons on two representative datasets, demonstrating that the best-fit parametric distribution produces detection performance comparable to, and often slightly better than, the KNN baseline.

In the literature datasets, the inverse Gaussian (without log transformation) distribution achieves the highest average ROC–AUC of 87.56%, matching or slightly exceeding KNN–

L_{1} / L_{2}

(87.53–87.66%) and clearly outperforming LOF, COF, KDEOS, and FastABOD, which frequently fall below 85%. Per-dataset analyses (Table A5, Table A6, Table A7, Table A8) show stable wins or ties for the parametric models, with notable advantages in moderately skewed datasets such as PIMA (73.7% vs. KNN–L1 67%) and strong robustness in highly skewed ones like KDDCup99 and WDBC, where fitted distributions maintain near-perfect detection accuracy (>96%). On the semantic datasets (Table A9, Table A10, Table A11, Table A12), the best-performing parametric distributions—the skew-normal under log transformation and the inverse Gaussian without log transformation—achieve an average ROC–AUC of 72.38%, essentially matching and slightly exceeding KNN–

L_{1}

(72.33%), while outperforming LOF (≈69%), KDEOS (≈65%), COF (≈60%), and ABOD (≈63%). Certain baseline methods, including LDOF and FastABOD, were computationally infeasible for several large datasets (as indicated in Table A5, Table A6, Table A7, Table A8, Table A9, Table A10,Table A11, Table A12), underscoring the practical advantage of the lightweight parametric approach. These results confirm that a small and interpretable family of fitted distributions, once paired with a simple neighborhood scale k, provides competitive accuracy with far less parameter tuning.

When comparing the average ROC–AUCs for the literature datasets and semantic datasets in Table 4, both domains show a similar drop in absolute performance, yet the parametric methods remain remarkably uniform across transformations and dataset types. Their average ROC–AUC stays within a narrow band (≈87% → 72%), indicating strong distributional adaptability and low sensitivity to distance-metric choice. In contrast, baseline methods exhibit wider fluctuations and sharper degradation. Several factors explain the superiority of the parametric framework: (1) performance consistency, as it maintains nearly identical rankings across datasets, highlighting reliable generalization; (2) statistical interpretability, since each fitted distribution (e.g., t, inverse-Gaussian, skew-normal) conveys explicit probabilistic semantics—tail behavior, variance, and skewness—that yield explainable anomaly thresholds; (3) computational efficiency, because once parameters are estimated, new-sample scoring becomes lightweight compared with K-neighbor searches; and (4) practical robustness, since these models attain equal or higher ROC–AUC than KNN or ODIN without heavy hyperparameter tuning. When performance levels are close, interpretability becomes decisive—the parametric models provide transparent probabilistic reasoning while achieving comparable or better accuracy. Overall, across both literature and semantic datasets, these results establish the proposed parametric approach as a simple, interpretable, and high-performing alternative to traditional distance-based outlier detectors.

6. Conclusions

We proposed a distribution-aware framework for unsupervised outlier detection that reduces multivariate data to one-dimensional neighborhood statistics and identifies anomalies through fitted parametric distributions. Supported by the CDF Superiority Theorem, this approach connects statistical distribution modeling with ROC–AUC consistency and produces interpretable, probabilistically calibrated scores for anomaly ranking.

Empirically, our results highlight three main observations. First, across both the literature and semantic benchmark datasets, the empirical kNN distance distributions are typically right-skewed, and a broad class of positively skewed distributions provides good one-dimensional fits. In Table A1, Table A2, Table A3, Table A4 most log-transformed models (normal, t, Laplace, logistic, skew-normal) attain average QQ-plot

R^{2}

values between approximately 91% and 98%, while several non-transformed families (gamma, inverse Gaussian, Weibull_min, chi-square, Pareto) also produce high fits on many datasets. No single distribution dominates universally, supporting the notion that one-dimensional neighborhood statistics are best described by a flexible family of tail models rather than a single canonical law.

Second, when these models are used to form CDF-based anomaly scores, the resulting ROC–AUC accuracy is competitive with, or superior to, strong non-parametric baselines. Across 23 datasets, the proposed parametric families deliver average ROC–AUC performance that matches or exceeds the strongest kNN-based methods, while clearly outperforming density-based and angle-based competitors such as LOF, KDEOS, COF, and LDOF. This is observed both for classical literature datasets (achieving

\sim 87.4 %

ROC–AUC) and semantically complex datasets (achieving ∼72.3% ROC–AUC), demonstrating robust performance across diverse regimes.

Third, because our modeling occurs in one dimension, the framework remains computationally lightweight, requiring little hyperparameter tuning. For methods with comparable accuracy, our parametric scoring rule offers clear probabilistic interpretability and lower computational cost, avoiding the heavy machinery and sensitivity associated with density estimation, high-dimensional kernels, or local geometric heuristics.

Overall, these results highlight a principled and interpretable pathway for outlier detection, showing that statistical modeling of neighborhood distances can achieve strong, stable performance without reliance on complex non-parametric procedures. In summary, the experimental developments of this work demonstrate that parametric CDF modeling of KNN distance statistics yields consistent performance gains or parity relative to established methods on real datasets, with strong empirical support from QQ–plot fits (Table A1, Table A2, Table A3, Table A4) and ROC–AUC benchmarks (Table 4). The speculative extensions—including adaptive model selection, multivariate dependence modeling, and hierarchical tail calibration—are presented as natural and logically grounded future directions. Thus, our conclusions clearly distinguish between what has been firmly established through experimentation and what is conceptually suggested for further methodological advancement.

Author Contributions

Conceptualization, J.Z.; Formal Analysis, J.Z., W.D., E.T. and K.H.; Methodology, J.Z., W.D., E.T. and K.H.; Project Administration, J.Z.; Software, J.Z. and K.H.; Supervision, J.Z., W.D. and E.T.; Validation, J.Z., W.D., E.T. and K.H.; Visualization, J.Z. and K.H.; Writing—original draft, J.Z. and K.H.; Writing—review and editing, J.Z., W.D., E.T. and K.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in Outlier-Detection at https://github.com/hodge-py/Outlier-Detection (accessed on 30 November 2025). These data were derived from the following resource available in the public domain: https://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/ (accessed on 30 November 2025).

Acknowledgments

During the preparation of this manuscript/study, the authors used ChatGPT 5.0 and 5.1 for the purposes of editing statements and correcting grammatical errors. The authors have reviewed and edited the output and take full responsibility for the content of this publication. This research was partially supported by an internal research grant from Southern Arkansas University awarded to Jie Zhou.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

KNN	k-Nearest Neighbors
LOF	Local Outlier Factor
COF	Connectivity-Based Outlier Factor
ABOD	Angle-Based Outlier Factor
KDE	Kernel Density Estimation
CDF	Cumulative Distribution Function
TPR	True Positive Rate
FPR	False Positive Rate
ROC	Receiver Operating Characteristic
AUC	Area Under Curve
ESD	Extreme Studentized Deviate
LoOP	Local Outlier Probabilities
LDOF	Local Distance-Based Outlier Factor
ODIN	Outlier Detection for Networks
KDEOS	Kernel Density Estimation Outlier Score
SVM	Support Vector Machine
SVDD	Support Vector Data Description
DAGMM	Deep Autoencoding Gaussian Mixture Model
HBOS	Histogram-Based Outlier Score
LODA	Lightweight On-line Detector of Anomalies
COPOD	Copula-Based Outlier Detection
INFLO	Influenced Outlierness

Appendix A

Listing A1 contains the Python 3.12.0 (64-bit) script used for the Gamma-based Monte Carlo experiments. After importing the required libraries and setting a fixed random seed (2026) for reproducibility, the code defines the set of detectors (KNN, LOF, SLOF, LoOP, LDOF, ABOD, LDF, INFLO, COF) and stores their respective training routines in a dictionary. The simulation parameters (runs, sample sizes, and inlier/outlier distribution parameters) are then specified. In each run, the script (i) samples Gamma inliers for training, (ii) fits the inlier model by maximum likelihood with fixed location, (iii) generates Gamma inlier and outlier test points, (iv) computes anomaly scores via the fitted CDF or each PyOD detector, and (v) records the resulting ROC–AUC. Finally, it summarizes the AUC distribution using pandas. The code for the Inverse Gaussian and Skew–Normal experiments follows the same structure, differing only in the distributional generator and corresponding CDF.

The code and additional experimental results can be found from the Hodge-py outlier detection repositories [58,59] and the DAMI outlier evaluation collection [57].

Listing A1. Monte Carlo simulation example for positively skewed data.

Table A1. R² for QQ plots of literature datasets—part 1.

	ALOI	Glass	Ionosphere	KDDCup99	Lymphogra	PenDigits
Log Transform
norm	98.81%	93.36%	98.36%	90.25%	92.44%	97.88%
t	98.86%	91.38%	98.36%	88.38%	96.71%	97.66%
laplace	95.66%	90.36%	91.60%	88.98%	90.75%	93.12%
logistic	98.42%	92.25%	96.12%	90.08%	94.47%	96.69%
skewnorm	99.71%	98.72%	98.30%	95.60%	97.85%	99.86%
No Transform
expon	89.28%	93.64%	97.14%	76.67%	96.17%	99.36%
chi2	91.27%	94.33%	96.29%	92.60%	97.28%	98.23%
gamma	93.91%	93.87%	97.12%	97.96%	92.45%	97.86%
weibull_min	93.93%	97.77%	97.33%	97.73%	91.00%	95.70%
invgauss	99.13%	95.99%	93.88%	99.18%	94.22%	96.39%
rayleigh	69.15%	75.77%	90.79%	52.58%	79.50%	93.74%
wald	96.14%	97.45%	93.01%	87.44%	98.24%	96.70%
pareto	98.36%	95.84%	97.14%	12.84%	96.17%	99.36%
nakagami	82.07%	84.00%	97.45%	76.86%	87.64%	94.71%
logistic	61.71%	71.96%	80.61%	46.46%	79.73%	87.25%
powerlaw	73.65%	86.12%	96.79%	63.82%	76.63%	87.89%
skewnorm	75.15%	81.96%	95.23%	58.90%	88.14%	96.78%

Table A2. R² for QQ plots of literature datasets—part 2.

	Shuttle	Waveform	WBC	WDBC	WPBC	Average
Log Transform
norm	99.10%	99.14%	87.33%	90.96%	92.60%	94.57%
t	99.29%	99.12%	83.70%	85.60%	89.97%	93.55%
laplace	97.02%	95.62%	82.65%	87.74%	89.02%	91.14%
logistic	99.09%	98.39%	83.39%	91.31%	90.83%	93.73%
skewnorm	99.25%	99.96%	93.28%	97.87%	98.95%	98.12%
No Transform
expon	73.73%	91.97%	91.40%	95.57%	99.17%	91.28%
chi2	73.78%	99.96%	89.83%	97.32%	98.60%	93.59%
gamma	72.50%	99.96%	98.67%	97.35%	98.60%	94.57%
weibull_min	79.88%	99.31%	92.12%	92.68%	97.12%	94.05%
invgauss	73.43%	99.97%	89.77%	97.18%	99.11%	94.39%
rayleigh	62.81%	99.71%	64.23%	79.35%	92.52%	78.19%
wald	78.22%	84.60%	92.95%	98.75%	97.24%	92.79%
pareto	73.73%	91.97%	64.07%	96.02%	99.24%	84.07%
nakagami	64.77%	99.66%	88.02%	86.87%	94.98%	87.00%
logistic	59.31%	97.13%	58.20%	72.55%	83.44%	72.58%
powerlaw	56.20%	93.50%	72.40%	76.42%	89.13%	79.32%
skewnorm	65.54%	99.90%	71.21%	85.54%	95.16%	83.04%

Table A3. R² for QQ plots of semantic datasets—part 1.

	Annthyroid	Arrhythmia	Cardiotocography	HeartDisease	Hepatitis	InternetAds
Log Transform
norm	97.32%	92.53%	98.33%	98.37%	96.42%	97.22%
t	99.12%	91.14%	98.23%	98.37%	96.42%	97.22%
laplace	98.76%	90.17%	94.26%	93.94%	89.60%	93.41%
logistic	98.97%	92.01%	97.23%	97.15%	93.92%	96.18%
skewnorm	97.48%	99.49%	99.44%	99.31%	96.45%	98.11%
No Transform
expon	76.81%	99.08%	97.85%	94.44%	87.87%	97.61%
chi2	85.52%	99.12%	97.06%	99.39%	89.40%	98.09%
gamma	85.93%	97.45%	99.23%	99.39%	93.61%	98.09%
weibull_min	77.03%	96.76%	98.14%	99.40%	96.15%	97.32%
invgauss	83.37%	99.20%	99.46%	99.34%	93.76%	98.64%
rayleigh	56.43%	89.74%	96.29%	99.22%	97.84%	95.73%
wald	85.90%	98.33%	93.95%	87.79%	80.24%	93.98%
pareto	84.19%	99.08%	97.85%	94.44%	87.87%	97.62%
nakagami	64.82%	93.25%	97.09%	99.31%	97.33%	90.16%
logistic	51.34%	81.56%	90.29%	94.63%	93.75%	87.67%
powerlaw	53.06%	84.56%	88.02%	94.70%	98.76%	85.66%
skewnorm	61.61%	93.83%	97.90%	99.51%	96.93%	95.73%

Table A4. R₂ for QQ plots of semantic datasets—part 2.

	PageBlocks	Parkinson	Pima	SpamBase	Stamps	Wilt	Average
Log Transform
norm	92.65%	94.42%	97.87%	74.47%	97.44%	96.23%	94.44%
t	91.75%	97.80%	97.87%	80.92%	97.26%	91.67%	94.82%
laplace	89.74%	96.46%	92.28%	83.68%	93.34%	97.20%	92.74%
logistic	92.16%	96.06%	96.27%	78.35%	96.42%	97.54%	94.36%
skewnorm	99.62%	96.92%	99.06%	83.94%	99.27%	97.95%	97.25%
No Transform
expon	79.86%	91.84%	97.62%	92.80%	97.74%	64.13%	89.80%
chi2	79.83%	85.41%	94.45%	92.24%	97.88%	67.36%	90.48%
gamma	93.53%	85.41%	99.70%	92.22%	97.88%	72.96%	92.95%
weibull_min	82.47%	94.74%	99.47%	89.26%	97.08%	86.30%	92.84%
invgauss	92.85%	88.89%	99.36%	92.53%	98.47%	57.91%	91.98%
rayleigh	58.19%	78.23%	97.66%	87.46%	92.07%	46.62%	82.79%
wald	89.31%	95.89%	92.66%	93.98%	96.88%	64.81%	89.59%
pareto	95.46%	91.84%	97.62%	93.47%	97.74%	67.46%	92.05%
nakagami	69.50%	84.59%	98.74%	87.67%	94.98%	52.37%	86.26%
logistic	52.31%	72.85%	91.07%	83.92%	85.41%	38.07%	77.01%
powerlaw	59.33%	68.75%	92.67%	77.26%	85.86%	40.43%	77.42%
skewnorm	63.91%	81.93%	99.20%	89.06%	94.75%	45.16%	84.96%

Table A5. Literature datasets ROC-AUC part 1.

	ALOI		Glass		Ionosphere
	ROC AUC	k	ROC AUC	k	ROC AUC	k
Log Transform
norm	74.50%	3	87.20%	10	90.90%	2
t	74.60%	3	87.60%	2	90.90%	2
laplace	74.50%	2	88.00%	2	90.70%	2
logistic	74.50%	3	87.60%	2	90.70%	2
skewnorm	74.50%	3	87.60%	10	90.80%	2
No Transform
expon	74.30%	3	87.10%	2	90.10%	2
chi2	74.40%	3	87.80%	2	90.10%	2
gamma	74.50%	2	88.00%	2	90.10%	2
weibull_min	74.40%	2	87.50%	10	90.10%	2
invgauss	74.60%	3	88.50%	2	90.10%	2
rayleigh	74.30%	3	87.50%	2	90.10%	2
wald	74.40%	3	87.80%	2	90.10%	2
pareto	74.50%	3	87.90%	2	90.10%	2
nakagami	74.50%	3	88.00%	2	90.10%	2
logistic	73.80%	3	87.20%	10	90.20%	2
powerlaw	74.50%	3	87.40%	10	89.90%	2
skewnorm	74.30%	3	87.70%	2	90.10%	2
Baseline Manhattan
KNN	74.60%	2	87.40%	10	89.60%	4
LOF	81.40%	7	86.70%	13	87.10%	10
SimplifiedLOF	74.86%	3	87.99%	2	90.04%	2
LoOP	83.45%	10	85.09%	20	86.38%	16
LDOF	75.24%	9	78.10%	26	83.22%	50
ODIN	74.62%	3	87.99%	2	90.04%	2
FastABOD	76.66%	14	50.00%	2	92.07%	69
KDEOS	52.26%	62	83.96%	19	86.25%	70
LDF	74.86%	3	87.99%	2	90.04%	2
INFLO	83.60%	10	83.79%	18	86.06%	16
COF	76.84%	30	89.86%	62	88.02%	13
Baseline Eucli.
KNN	74.06%	1	87.48%	8	92.74%	1
LOF	78.23%	9	86.67%	11	90.43%	83
SimplifiedLOF	79.57%	16	86.50%	16	90.50%	10
LoOP	80.08%	12	83.96%	18	90.21%	11
LDOF			77.89%	27	89.61%	14
ODIN	80.50%	11	72.93%	18	85.22%	13
FastABOD			85.80%	98	91.33%	3
KDEOS	77.26%	99	74.20%	28	83.40%	71
LDF	74.62%	9	90.35%	9	91.67%	50
INFLO	79.87%	9	80.38%	18	90.38%	10
COF	80.17%	13	89.54%	76	96.03%	100

Missing LDOF and FastABOD results attributed to computation cost.

Table A6. Literature datasets ROC-AUC part 2.

	KDDCup99		Lymphography		PenDigits
	ROC AUC	k	ROC AUC	k	ROC AUC	k
Log Transform
norm	96.80%	69	100.00%	19	98.20%	9
t	96.80%	68	99.90%	6	98.30%	10
laplace	96.70%	69	99.80%	31	98.40%	14
logistic	96.70%	69	100.00%	15	98.40%	11
skewnorm	96.70%	69	100.00%	8	98.70%	15
No Transform
expon	95.00%	69	99.30%	13	99.10%	12
chi2	96.90%	69	100.00%	38	98.40%	6
gamma	96.70%	69	100.00%	8	98.30%	12
weibull_min	96.80%	69	100.00%	8	98.20%	9
invgauss	96.50%	69	100.00%	8	99.10%	12
rayleigh	95.70%	69	99.90%	4	97.60%	6
wald	94.90%	69	100.00%	26	99.10%	9
pareto	96.40%	69	100.00%	13	99.10%	12
nakagami	96.50%	69	100.00%	8	97.90%	10
logistic	94.30%	69	100.00%	8	96.80%	8
powerlaw	96.90%	69	100.00%	8	99.10%	9
skewnorm	95.90%	69	100.00%	8	98.30%	9
Baseline Manhattan
KNN	97.00%	69	100.00%	7	99.10%	11
LOF	67.90%	45	100.00%	47	97.10%	55
SimplifiedLOF	95.40%	70	100.00%	3	99.13%	21
LoOP	66.52%	61	99.88%	59	96.24%	70
LDOF	77.09%	70	99.65%	44	72.92%	70
ODIN	97.01%	70	100.00%	8	99.12%	12
FastABOD	58.97%	70	99.18%	60	50.00%	2
KDEOS	50.00%	2	82.75%	33	86.69%	59
LDF	95.40%	70	100.00%	3	99.13%	21
INFLO	66.46%	54	99.88%	59	96.95%	70
COF	60.57%	69	96.48%	14	98.29%	69
Baseline Eucli.
KNN	98.97%	89	100.00%	14	99.21%	12
LOF	84.89%	100	100.00%	62	96.58%	73
SimplifiedLOF	66.80%	62	100.00%	98	96.68%	67
LoOP	70.31%	65	99.77%	47	96.23%	98
LDOF			99.77%	86	75.03%	91
ODIN	80.77%	100	99.88%	55	96.43%	100
FastABOD			99.77%	25	97.98%	100
KDEOS	60.51%	68	98.12%	99	82.21%	98
LDF	87.70%	90	100.00%	13	97.79%	12
INFLO	70.33%	56	99.88%	62	95.71%	98
COF	67.01%	67	100.00%	40	96.70%	95

Missing LDOF and FastABOD results attributed to computation cost.

Table A7. Literature datasets ROC-AUC part 3.

	Shuttle		Waveform		WBC
	ROC AUC	k	ROC AUC	k	ROC AUC	k
Log Transform
norm	84.60%	5	78.30%	68	99.40%	9
t	84.66%	5	78.50%	64	99.70%	23
laplace	84.60%	5	78.50%	61	99.70%	32
logistic	84.50%	5	78.60%	68	99.30%	10
skewnorm	84.60%	5	78.60%	69	99.60%	60
No Transform
expon	84.20%	5	78.50%	62	98.80%	10
chi2	84.50%	5	78.60%	66	99.80%	30
gamma	82.00%	5	78.60%	66	99.90%	40
weibull_min	83.90%	5	78.50%	67	99.80%	26
invgauss	84.50%	5	78.60%	66	99.50%	4
rayleigh	84.80%	5	78.80%	66	99.00%	33
wald	84.50%	5	78.60%	69	99.80%	69
pareto	84.30%	5	78.60%	62	99.20%	4
nakagami	84.60%	5	78.40%	68	99.80%	17
logistic	84.20%	5	78.40%	58	96.70%	23
powerlaw	82.80%	5	78.50%	59	99.80%	28
skewnorm	84.60%	5	78.50%	67	99.20%	61
Baseline Manhattan
KNN	84.68%	4	78.60%	65	99.70%	24
LOF	84.10%	7	76.50%	69	99.70%	65
SimplifiedLOF	78.00%	14	77.77%	70	99.72%	22
LoOP	82.08%	11	72.59%	70	97.28%	70
LDOF	77.98%	22	69.59%	67	94.37%	70
ODIN	84.68%	5	78.57%	66	99.74%	25
FastABOD	50.00%	2	52.31%	5	76.29%	13
KDEOS	77.30%	48	65.14%	70	97.54%	11
LDF	78.00%	14	77.77%	70	99.72%	22
INFLO	77.84%	10	71.50%	70	99.48%	67
COF	63.02%	64	76.25%	58	98.97%	58
Baseline Eucli.
KNN	81.76%	3	77.55%	77	99.72%	19
LOF	78.21%	6	75.60%	96	99.67%	98
SimplifiedLOF	76.61%	99	72.95%	100	99.39%	99
LoOP	76.40%	99	72.37%	100	98.03%	99
LDOF	84.75%	15	68.82%	100	96.53%	99
ODIN	78.90%	8	69.68%	100	96.74%	100
FastABOD	95.46%	6	67.31%	40	99.48%	49
KDEOS	66.55%	94	59.24%	99	64.79%	5
LDF	71.59%	4	78.89%	16	99.72%	71
INFLO	79.89%	98	70.92%	94	99.39%	99
COF	63.97%	71	77.59%	99	99.44%	74

Table A8. Literature datasets ROC-AUC part 4.

	WDBC		WPBC
	ROC AUC	k	ROC AUC	k
Log Transform
norm	97.70%	9	53.10%	12
t	98.40%	46	53.40%	20
laplace	98.30%	42	53.10%	26
logistic	97.30%	9	53.20%	20
skewnorm	98.70%	43	53.20%	26
No Transform
expon	98.50%	42	53.20%	14
chi2	99.00%	56	53.20%	26
gamma	98.90%	68	53.20%	26
weibull_min	98.90%	64	53.30%	12
invgauss	98.70%	25	53.10%	19
rayleigh	97.50%	20	53.20%	12
wald	98.70%	53	53.20%	12
pareto	98.80%	42	53.30%	19
nakagami	99.00%	63	53.20%	12
logistic	96.90%	39	53.20%	19
powerlaw	98.90%	64	53.10%	20
skewnorm	98.30%	41	52.90%	21
Baseline Manhattan
KNN	99.00%	69	53.10%	18
LOF	99.10%	69	52.70%	34
SimplifiedLOF	98.71%	57	52.70%	29
LoOP	98.38%	69	49.61%	61
LDOF	97.96%	70	50.18%	61
ODIN	98.96%	70	53.10%	19
FastABOD	50.00%	2	54.52%	4
KDEOS	90.08%	69	57.13%	34
LDF	98.71%	57	52.70%	29
INFLO	98.91%	70	49.27%	57
COF	97.70%	64	50.64%	47
Baseline Eucli.
KNN	98.63%	90	54.09%	12
LOF	98.91%	89	52.54%	24
SimplifiedLOF	98.68%	90	50.18%	1
LoOP	98.40%	100	50.18%	1
LDOF	98.18%	99	56.56%	7
ODIN	97.23%	93	50.73%	1
FastABOD	98.26%	97	53.42%	40
KDEOS	86.11%	80	51.85%	2
LDF	98.54%	33	58.29%	8
INFLO	98.49%	95	49.57%	20
COF	98.07%	55	55.69%	97

Table A9. Semantic datasets ROC-AUC part 1.

	Annthyroid		Arrhythmia		Cardiotocography
	ROC AUC	k	ROC AUC	k	ROC AUC	k
Log Transform
norm	67.60%	2	76.20%	44	55.70%	69
t	67.70%	2	76.10%	47	55.60%	69
laplace	67.60%	2	76.20%	34	55.70%	68
logistic	67.70%	2	76.10%	47	55.60%	69
skewnorm	67.70%	2	76.00%	35	55.70%	69
No Transform
expon	67.61%	2	75.66%	35	55.65%	69
chi2	67.64%	2	76.11%	46	55.69%	69
gamma	67.65%	2	76.11%	41	55.72%	69
weibull_min	67.61%	2	75.97%	44	55.79%	68
invgauss	67.69%	2	76.03%	45	55.64%	67
rayleigh	67.59%	2	75.99%	45	55.76%	69
wald	67.64%	2	76.05%	44	55.76%	69
pareto	67.60%	2	76.07%	35	55.65%	69
nakagami	67.62%	2	76.02%	38	55.70%	69
logistic	67.36%	2	76.15%	29	55.69%	68
powerlaw	67.46%	2	76.09%	45	55.77%	69
skewnorm	67.52%	2	76.20%	45	55.71%	69
Baseline Manhattan
KNN	67.28%	2	76.10%	43	55.80%	69
LOF	70.20%	11	75.50%	48	60.20%	69
SimplifiedLOF	67.74%	3	75.81%	51	53.78%	70
LoOP	72.09%	38	75.76%	70	56.84%	21
LDOF	78.92%	28	75.18%	6	56.17%	50
ODIN	67.67%	2	76.06%	44	55.76%	70
FastABOD	71.34%	46	67.53%	70	50.00%	2
KDEOS	50.00%	2	50.00%	2	50.32%	36
LDF	67.74%	3	75.81%	51	53.78%	70
INFLO	71.31%	31	75.30%	70	57.98%	69
COF	62.62%	55	75.52%	41	56.92%	70
Baseline Eucli.
KNN	64.90%	1	75.21%	60	66.67%	100
LOF	66.76%	9	74.42%	94	64.70%	100
SimplifiedLOF	66.53%	21	73.81%	65	59.79%	100
LoOP	67.72%	23	73.84%	77	59.50%	100
LDOF	69.21%	30	73.45%	100	57.69%	100
ODIN	69.33%	5	72.67%	98	62.12%	100
FastABOD	62.39%	4	74.18%	98	55.74%	100
KDEOS	67.81%	39	66.10%	21	54.74%	22
LDF	65.93%	8	72.29%	67	67.71%	100
INFLO	66.46%	47	73.15%	91	59.84%	100
COF	69.21%	30	73.39%	39	56.83%	20

Table A10. Semantic datasets ROC-AUC part 2.

	HeartDisease		Hepatitis		InternetAds
	ROC AUC	k	ROC AUC	k	ROC AUC	k
Log Transform
norm	70.10%	69	78.80%	26	72.20%	14
t	70.10%	69	78.80%	26	72.20%	14
laplace	69.70%	69	79.00%	25	72.20%	14
logistic	69.80%	68	79.00%	26	72.20%	14
skewnorm	70.10%	68	78.80%	26	72.20%	14
No Transform
expon	69.51%	69	77.04%	25	72.12%	14
chi2	70.02%	66	78.87%	40	72.23%	14
gamma	70.02%	66	78.47%	26	72.23%	14
weibull_min	70.16%	68	78.99%	26	70.36%	6
invgauss	69.91%	69	79.22%	26	72.20%	14
rayleigh	69.82%	68	79.05%	26	72.20%	14
wald	69.91%	68	78.59%	26	72.16%	14
pareto	70.18%	69	78.53%	25	72.12%	14
nakagami	69.99%	68	78.70%	26	72.23%	14
logistic	69.78%	69	78.53%	25	72.16%	14
powerlaw	69.63%	69	78.76%	26	72.18%	14
skewnorm	69.89%	66	78.53%	26	72.21%	14
Baseline Manhattan
KNN	70.00%	68	79.00%	25	72.20%	13
LOF	64.00%	69	80.40%	50	70.30%	69
SimplifiedLOF	66.97%	70	75.89%	51	74.21%	18
LoOP	55.55%	70	74.17%	65	65.28%	70
LDOF	54.32%	5	72.90%	69	64.68%	41
ODIN	69.99%	69	78.99%	26	72.21%	14
FastABOD	60.11%	66	68.08%	28	54.84%	14
KDEOS	65.43%	53	70.75%	36	50.00%	2
LDF	66.97%	70	75.89%	51	74.21%	18
INFLO	56.32%	68	74.63%	64	68.03%	70
COF	56.47%	70	73.02%	51	68.49%	32
Baseline Eucli.
KNN	68.38%	81	78.59%	21	72.23%	12
LOF	65.58%	100	80.37%	48	74.09%	98
SimplifiedLOF	56.93%	100	73.82%	78	74.31%	98
LoOP	56.14%	60	72.27%	78	70.07%	100
LDOF	56.91%	14	73.82%	79	69.36%	98
ODIN	60.59%	82	74.97%	58	60.54%	7
FastABOD	75.57%	100	70.95%	59	73.39%	24
KDEOS	55.69%	100	71.18%	79	57.78%	35
LDF	72.06%	83	82.89%	46	68.50%	100
INFLO	55.97%	15	60.28%	55	72.96%	98
COF	71.68%	100	82.72%	78	59.88%	10

Table A11. Semantic datasets ROC-AUC part 3.

	PageBlocks		Parkinson		Pima
	ROC AUC	k	ROC AUC	k	ROC AUC	k
Log Transform
norm	87.10%	69	73.70%	6	73.70%	68
t	87.10%	69	73.90%	4	73.70%	68
laplace	87.20%	69	73.90%	6	73.60%	69
logistic	87.00%	69	73.80%	6	73.70%	67
skewnorm	87.10%	69	73.90%	6	73.70%	67
No Transform
expon	87.04%	69	71.69%	6	73.38%	64
chi2	87.06%	69	73.82%	6	73.65%	66
gamma	86.97%	68	73.82%	6	73.59%	69
weibull_min	87.08%	68	74.15%	4	73.57%	68
invgauss	87.07%	68	73.75%	6	73.56%	63
rayleigh	86.94%	69	73.76%	6	73.57%	69
wald	87.06%	69	73.94%	6	73.62%	68
pareto	87.19%	69	73.77%	6	73.58%	64
nakagami	87.18%	69	73.63%	4	73.67%	68
logistic	86.72%	69	73.65%	6	73.69%	67
powerlaw	86.95%	69	73.60%	6	73.70%	69
skewnorm	87.04%	69	73.97%	6	73.53%	68
Baseline Manhattan
KNN	87.30%	69	73.70%	5	73.60%	67
LOF	81.10%	69	63.90%	5	67.20%	69
SimplifiedLOF	84.75%	70	72.11%	6	73.05%	70
LoOP	77.43%	70	57.56%	19	61.53%	69
LDOF	80.21%	70	52.98%	23	58.50%	65
ODIN	87.28%	70	73.74%	6	73.60%	68
FastABOD	50.78%	70	58.19%	8	51.13%	23
KDEOS	64.99%	70	76.94%	57	66.79%	70
LDF	84.75%	70	72.11%	6	73.05%	70
INFLO	75.35%	70	52.51%	11	61.62%	70
COF	69.91%	70	70.95%	70	66.15%	70
Baseline Eucli.
KNN	84.08%	100	65.24%	4	73.22%	85
LOF	81.87%	60	61.20%	6	68.96%	100
SimplifiedLOF	80.47%	98	60.73%	14	62.13%	100
LoOP	79.38%	86	58.31%	13	60.92%	99
LDOF	82.98%	82	55.32%	16	57.00%	98
ODIN	73.06%	100	52.61%	3	63.64%	100
FastABOD	73.39%	24	66.99%	15	76.08%	99
KDEOS	69.51%	91	58.67%	28	55.62%	2
LDF	83.02%	42	60.22%	6	72.89%	100
INFLO	76.80%	80	58.39%	10	61.73%	92
COF	77.02%	71	64.97%	98	70.12%	100

Table A12. Semantic datasets ROC-AUC part 4.

	SpamBase		Stamps		Wilt
	ROC AUC	k	ROC AUC	k	ROC AUC	k
Log Transform
norm	65.10%	41	91.70%	61	56.20%	2
t	65.00%	40	91.70%	68	56.10%	3
laplace	65.00%	46	91.90%	62	56.20%	2
logistic	65.00%	41	91.80%	68	56.20%	2
skewnorm	65.00%	46	92.20%	65	56.10%	3
No Transform
expon	64.94%	51	91.67%	63	56.03%	3
chi2	65.03%	49	91.93%	67	55.14%	2
gamma	65.00%	49	91.93%	67	56.07%	3
weibull_min	65.04%	41	91.86%	68	55.59%	3
invgauss	65.08%	40	92.19%	64	56.17%	2
rayleigh	64.98%	40	91.72%	67	56.09%	3
wald	65.04%	40	91.91%	57	56.21%	2
pareto	64.98%	40	91.99%	63	56.15%	3
nakagami	65.05%	44	91.83%	66	56.24%	3
logistic	65.01%	40	91.86%	63	56.21%	2
powerlaw	64.99%	39	91.83%	61	56.33%	2
skewnorm	65.07%	41	91.77%	66	56.23%	2
Baseline Manhattan
KNN	65.00%	39	91.90%	63	56.10%	2
LOF	47.80%	2	82.30%	69	67.00%	5
SimplifiedLOF	64.03%	70	91.04%	70	56.68%	3
LoOP	47.21%	3	77.32%	70	68.35%	14
LDOF	50.00%	2	70.70%	69	69.84%	16
ODIN	65.05%	40	91.91%	64	56.18%	2
FastABOD	54.71%	70	76.48%	69	85.03%	29
KDEOS	50.00%	2	78.51%	70	70.95%	62
LDF	64.03%	70	91.04%	70	56.68%	3
INFLO	50.69%	2	73.68%	70	70.21%	6
COF	48.71%	3	63.50%	70	59.82%	2
Baseline Eucli.
KNN	57.35%	63	90.11%	15	55.20%	1
LOF	47.38%	2	83.32%	100	63.09%	6
SimplifiedLOF	50.12%	2	74.35%	100	67.68%	7
LoOP	49.66%	2	75.28%	100	67.92%	10
LDOF	47.96%	5	75.26%	100	71.22%	13
ODIN	51.91%	47	75.34%	100	67.46%	10
FastABOD	43.72%	3	76.22%	97	55.43%	6
KDEOS	47.67%	100	69.13%	99	71.32%	33
LDF	53.64%	100	89.55%	100	61.27%	4
INFLO	47.38%	3	78.92%	100	63.21%	7
COF	49.95%	2	81.87%	100	64.83%	9

References

Barnett, V.; Lewis, T. Outliers in Statistical Data, 3rd ed.; John Wiley & Sons: Chichester, UK, 1994. [Google Scholar]
Chandola, V.; Banerjee, A.; Kumar, V. Anomaly Detection: A Survey. ACM Comput. Surv. 2009, 41, 1–58. [Google Scholar] [CrossRef]
Hawkins, D.M. Identification of Outliers; Chapman and Hall: London, UK, 1980. [Google Scholar]
Aggarwal, C.C. Outlier Analysis, 2nd ed.; Springer: Cham, Switzerland, 2017. [Google Scholar]
Beyer, K.; Goldstein, J.; Ramakrishnan, R.; Shaft, U. When Is “Nearest Neighbor” Meaningful? In Database Theory-ICDT’99, Proceedings of the 7th International Conference, Jerusalem, Israel, 10–12 January 1999; Beeri, C., Buneman, P., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 1999; Volume 1540, pp. 217–235. [Google Scholar]
Zimek, A.; Schubert, E.; Kriegel, H.P. A Survey on Unsupervised Anomaly Detection in High-Dimensional Numerical Data. Stat. Anal. Data Min. 2012, 5, 363–387. [Google Scholar] [CrossRef]
Bolton, R.J.; Hand, D.J. Statistical Fraud Detection: A Review. Stat. Sci. 2002, 17, 235–255. [Google Scholar] [CrossRef]
Hanley, J.A.; McNeil, B.J. The meaning and use of the area under a Receiver Operating Characteristic (ROC) curve. Radiology 1982, 143, 29–36. [Google Scholar] [CrossRef]
Hanley, J.A.; McNeil, B.J. A method of comparing the areas under Receiver Operating Characteristic curves derived from the same cases. Radiology 1983, 148, 839–843. [Google Scholar] [CrossRef] [PubMed]
Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
Fix, E.; Hodges, J.L. Discriminatory Analysis—Nonparametric Discrimination: Consistency Properties; Technical Report Technical Report 4; University of California: Berkeley, CA, USA, 1951. [Google Scholar]
Aggarwal, C.C.; Hinneburg, A.; Keim, D.A. On the Surprising Behavior of Distance Metrics in High Dimensional Space. In Proceedings of the Database Theory—ICDT 2001, Proceedings of the 8th International Conference, London, UK, 4–6 January 2001; Van den Bussche, J., Vianu, V., Eds.; Springer: Berlin/Heidelberg, Germany, 2001; pp. 420–434. [Google Scholar]
Breunig, M.M.; Kriegel, H.P.; Ng, R.T.; Sander, J. LOF: Identifying Density-Based Local Outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 15–18 May 2000; pp. 93–104. [Google Scholar] [CrossRef]
Tang, J.; Chen, Z.; Fu, A.W.C.; Cheung, D.W.L. Enhancing Effectiveness of Outlier Detections for Low Density Patterns. In Proceedings of the PAKDD, Taipei, Taiwan, 6–8 May 2002; pp. 535–548. [Google Scholar]
Kriegel, H.P.; Schubert, M.; Zimek, A. Angle-Based Outlier Detection in High-Dimensional Data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, 24–27 August 2008; pp. 444–452. [Google Scholar] [CrossRef]
Rehman, Y.; Belhaouari, S. Unsupervised outlier detection in multidimensional data. J. Big Data 2021, 8, 80. [Google Scholar] [CrossRef]
Campos, G.O.; Zimek, A.; Sander, J.; Campello, R.J.G.B.; Micenková, B.; Schubert, E.; Assent, I.; Houle, M.E. On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study. Data Min. Knowl. Discov. 2016, 30, 891–927. [Google Scholar] [CrossRef]
Bouman, R.; Bukhsh, Z.; Heskes, T. Unsupervised anomaly detection algorithms on real-world multivariate tabular data sets. ACM Comput. Surv. 2024. [Google Scholar]
Tukey, J.W. Exploratory Data Analysis; Addison-Wesley: Reading, MA, USA, 1977. [Google Scholar]
Anderberg, A.; Bailey, J.; Campello, R.J.G.B. Dimensionality-Aware Outlier Detection: Theoretical and Experimental Analysis. In Proceedings of the SIAM International Conference on Data Mining, Houston, TX, USA, 18–20 April 2024. [Google Scholar]
Kim, D.; Park, J.; Chung, H.C.; Jeong, S. Unsupervised outlier detection using random subspace and subsampling ensembles of Dirichlet process mixtures. Pattern Recognit. 2024, 156, 110846. [Google Scholar] [CrossRef]
Chen, X.; Yuan, Z.; Feng, S. Anomaly Detection Based on Improved k-Nearest Neighbor Rough Sets. Int. J. Approx. Reason. 2025, 176, 109323. [Google Scholar] [CrossRef]
Grubbs, F.E. Procedures for Detecting Outlying Observations in Samples. Technometrics 1969, 11, 1–21. [Google Scholar] [CrossRef]
Rosner, B. Percentage Points for a Generalized ESD Many-Outlier Procedure. Technometrics 1983, 25, 165–172. [Google Scholar] [CrossRef]
Davies, L.; Gather, U. The Identification of Multiple Outliers. J. Am. Stat. Assoc. 1993, 88, 782–792. [Google Scholar] [CrossRef]
Bagdonavičius, V.; Petkevičius, G. New Tests for the Detection of Outliers from Location–Scale and Shape–Scale Families. Mathematics 2020, 8, 2156. [Google Scholar]
Amin, M.; Afzal, S.; Akram, M.N.; Muse, A.H.; Tolba, A.H.; Abushal, T.A. Outlier Detection in Gamma Regression Using Pearson Residuals: Simulation and an Application. AIMS Math. 2022, 7, 15331–15347. [Google Scholar] [CrossRef]
A Model-Based Approach to Outlier Detection in Financial Time Series; IFC Bulletin 37; BIS: Basel, Switzerland, 2014.
Wang, Y.; Zhang, L.; Si, T.; Bishop, G.; Gong, H. Anomaly Detection in High-Dimensional Time Series with Scaled Bregman Divergence. Algorithms 2025, 18, 62. [Google Scholar] [CrossRef]
Angiulli, F.; Pizzuti, C. Fast outlier detection in high dimensional spaces. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), Helsinki, Finland, 19–23 August 2002; pp. 15–27. [Google Scholar]
Zhang, K.; Hutter, M.; Jin, H. A local distance-based outlier detection method. In Proceedings of the 20th International Conference on Advances in Database Technology (EDBT), Saint Petersburg, Russia, 24–26 March 2009; pp. 394–405. [Google Scholar]
Kriegel, H.P.; Kröger, P.; Schubert, E.; Zimek, A. LoOP: Local outlier probabilities. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), Hong Kong, China, 2–6 November 2009; pp. 1649–1652. [Google Scholar]
Latecki, L.J.; Lazarevic, A.; Pokrajac, D. Outlier detection with local and global consistency. In Proceedings of the 2007 SIAM International Conference on Data Mining, Minneapolis, MN, USA, 26–28 April 2007; pp. 597–602. [Google Scholar]
Jin, W.; Tung, A.K.; Han, J.; Wang, W. Ranking outliers using symmetric neighborhood relationship. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Singapore, 9–12 April 2006; pp. 577–593. [Google Scholar]
Schubert, E.; Zimek, A.; Kriegel, H.P. Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Discov. 2014, 28, 190–237. [Google Scholar] [CrossRef]
Goldstein, M.; Dengel, A. Histogram-Based Outlier Score (HBOS): A Fast Unsupervised Anomaly Detection Algorithm. In Proceedings of the LWA 2012—Lernen, Wissen, Adaptivität, Dortmund, Germany, 8–10 October 2012. [Google Scholar]
Pevnỳ, T. Loda: Lightweight On-line Detector of Anomalies. Mach. Learn. 2016, 102, 275–304. [Google Scholar] [CrossRef]
Li, Z.; Zhao, Y.; Botta, N.; Ionescu, C.; Hu, X. COPOD: Copula-Based Outlier Detection. In Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), Sorrento, Italy, 17–20 November 2020. [Google Scholar]
Ramaswamy, S.; Rastogi, R.; Shim, K. Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 16–18 May 2000; pp. 427–438. [Google Scholar]
Papadimitriou, S.; Kitagawa, H.; Gibbons, P.B.; Faloutsos, C. LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering (ICDE), Bangalore, India, 5–8 March 2003; pp. 315–326. [Google Scholar]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the ICDM, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar]
Schölkopf, B.; Platt, J.; Shawe-Taylor, J.; Smola, A.J.; Williamson, R.C. Estimating the Support of a High-Dimensional Distribution. Neural Comput. 2001, 13, 1443–1471. [Google Scholar] [CrossRef]
Ruff, L.; Vandermeulen, R.; Goernitz, N.; Deecke, L.; Siddiqui, S.A.; Binder, A.; Müller, E.; Kloft, M. Deep One-Class Classification. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; 80, pp. 4393–4402. [Google Scholar]
Zong, B.; Song, Q.; Min, M.R.; Cheng, W.; Lumezanu, C.; Cho, D.; Chen, H. Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection. In Proceedings of the ICLR, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Liu, J.; Ma, Z.; Wang, Z.; Liu, Y.; Wang, Z.; Sun, P.; Song, L.; Hu, B.; Boukerche, A.; Leung, V.C.M. A Survey on Diffusion Models for Anomaly Detection. arXiv 2025, arXiv:2501.11430. [Google Scholar] [CrossRef]
Radovanović, M.; Nanopoulos, A.; Ivanović, M. Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data. J. Mach. Learn. Res. 2010, 11, 2487–2531. [Google Scholar]
Kotz, S.; Kozubowski, T.; Podgórski, K. The Laplace Distribution and Generalizations; Birkhäuser: Boston, MA, USA, 2001. [Google Scholar]
Johnson, N.L.; Kotz, S.; Balakrishnan, N. Continuous Univariate Distributions, Volume I & II; Wiley: New York, NY, USA, 1994. [Google Scholar]
David, H.A.; Nagaraja, H.N. Order Statistics; Wiley: Hoboken, NJ, USA, 2003. [Google Scholar]
Biau, G.; Devroye, L. Lectures on the Nearest Neighbor Method; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
Titterington, D.; Smith, A.F.M.; Makov, U. Statistical Analysis of Finite Mixture Distributions; Wiley: New York, NY, USA, 1985. [Google Scholar]
Rosenblatt, M. Remarks on a Multivariate Transformation. Ann. Math. Stat. 1952, 23, 470–472. [Google Scholar] [CrossRef]
Swets, J.A. Measuring the accuracy of diagnostic systems. Science 1988, 240, 1285–1293. [Google Scholar] [CrossRef]
Hajian-Tilaki, K. Receiver Operating Characteristic (ROC) curve analysis for medical diagnostic test evaluation. Casp. J. Intern. Med. 2013, 4, 627–635. [Google Scholar]
Bagdonavičius, V.; Petkevičius, L. Multiple Outlier Detection Tests for Parametric Models. Mathematics 2020, 8, 2156. [Google Scholar] [CrossRef]
Azzalini, A. A Class of Distributions Which Includes the Normal Ones. Scand. J. Stat. 1985, 12, 171–178. [Google Scholar]
DAMI: Outlier Evaluation Benchmark. Available online: https://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/ (accessed on 16 February 2025).
Hodge-py Project. Outlier Detection: Literature Resources. Available online: https://github.com/hodge-py/Outlier-Detection/tree/Final/literature (accessed on 16 February 2025).
Hodge-py Project. Outlier Detection: Semantic Resources. Available online: https://github.com/hodge-py/Outlier-Detection/tree/Final/semantic (accessed on 16 February 2025).

Figure 1. Probability plot and histogram of the Arrhythmia dataset. Theoretical distribution for probability plot is set to gamma distribution.

Figure 2. Probability plot and histogram of the Parkinson dataset. Theoretical distribution for probability plot is set to skew-normal distribution after logarithmic transformation.

Figure 3. ROC comparison between the kNN baseline and the best-fit log Student-t distribution on the Shuttle dataset. The parametric fit closely tracks or exceeds the kNN ROC curve, illustrating how a 1D fitted distribution can replicate neighborhood-based anomaly scoring behavior.

Figure 4. ROC comparison between the kNN baseline and the best-fit log Student-t distribution on the Annthyroid dataset. The parametric fit closely tracks or exceeds the kNN ROC curve, illustrating how a 1D fitted distribution can replicate neighborhood-based anomaly scoring behavior.

Table 1. (a) Shape and scale parameters for training and testing sets (Gamma). (b) Shape and scale parameters for training and testing sets (Inverse Gaussian). (c) Shape, location, and scale parameters for training and testing sets (Skew-Normal).

	Train	Test Inlier	Test Outlier
(a)
Shape	2.0	2.0	5.0
Scale	2.0	2.0	2.0
(b)
Shape	2.0	2.0	5.0
Scale	2.0	2.0	2.0
(c)
Shape (a)	4.0	4.0	−4.0
Location	0.0	0.0	2.0
Scale	1.0	1.0	1.0

Table 2. (a) ROC-AUC Scores as per each method for Monte Carlo simulation (500 runs) with Gamma distribution. (b) ROC-AUC Scores as per each method for Monte Carlo simulation (500 runs) with Inverse Gaussian distribution. (c) ROC-AUC Scores as per each method for Monte Carlo simulation (500 runs) with Skew-Normal distribution.

	Mean	Std	Min	25%	50%	75%	Max
(a)
KNN	84.81%	2.22%	76.58%	83.42%	84.90%	86.39%	91.38%
LOF	62.67%	6.00%	46.90%	58.73%	62.45%	67.06%	77.70%
ABOD	80.66%	2.48%	72.47%	79.06%	80.66%	82.48%	87.09%
COF	50.39%	2.59%	43.55%	48.66%	50.43%	52.01%	57.70%
CDF	89.07%	1.59%	83.60%	87.95%	89.17%	90.22%	94.03%
(b)
KNN	59.06%	2.89%	50.46%	57.09%	59.09%	61.12%	66.87%
LOF	54.50%	3.05%	46.04%	52.50%	54.54%	56.49%	62.83%
ABOD	58.61%	2.95%	50.48%	56.72%	58.61%	60.58%	66.79%
COF	50.86%	2.83%	42.65%	48.97%	50.97%	52.77%	58.46%
CDF	59.30%	2.90%	51.18%	57.49%	59.34%	61.21%	68.10%
(c)
KNN	65.07%	3.35%	53.76%	63.01%	65.26%	67.33%	74.90%
LOF	50.84%	4.35%	38.28%	47.64%	50.94%	53.73%	63.98%
ABOD	61.01%	3.26%	48.70%	58.91%	61.15%	63.35%	71.20%
COF	50.17%	2.74%	40.60%	48.24%	50.24%	51.99%	56.75%
CDF	71.81%	2.67%	62.95%	70.02%	71.99%	73.59%	79.12%

Table 3. Details of datasets used for comparison.

Name	Type	Instances	Outliers	Attributes
ALOI	Literature	50,000	1508	27
Glass	Literature	214	9	7
Ionosphere	Literature	351	126	32
KDDCup99	Literature	60,632	246	38 + 3
Lymphography	Literature	148	6	3 + 16
PenDigits	Literature	9868	20	16
Shuttle	Literature	1013	13	9
Waveform	Literature	3443	100	21
WBC	Literature	454	10	9
WDBC	Literature	367	10	30
WPBC	Literature	198	47	33
Annthyroid	Semantic	7200	534	21
Arrhythmia	Semantic	450	206	259
Cardiotocography	Semantic	2126	471	21
HeartDisease	Semantic	270	120	13
Hepatitis	Semantic	80	13	19
InternetAds	Semantic	3264	454	1555
PageBlocks	Semantic	5473	560	10
Parkinson	Semantic	195	147	22
Pima	Semantic	768	268	8
SpamBase	Semantic	4601	1813	57
Stamps	Semantic	340	31	9
Wilt	>Semantic	>4839	>261	>5

Datasets are available from the Hodge-py outlier detection repositories [58,59] and the DAMI outlier evaluation collection [57].

Table 4. Comparison of average ROC-AUC across literature and semantic datasets.

Method	Literature Avg.	Semantic Avg.
Log Transform Models
norm	87.34%	72.34%
t	87.52%	72.33%
laplace	87.48%	72.35%
logistic	87.35%	72.33%
skewnorm	87.55%	72.38%
No Transform Models
expon	87.10%	71.86%
chi2	87.52%	72.26%
gamma	87.29%	72.30%
weibull_min	87.40%	72.18%
invgauss	87.56%	72.38%
rayleigh	87.13%	72.29%
wald	87.37%	72.32%
pareto	87.47%	72.32%
nakagami	87.45%	72.32%
logistic	86.52%	72.23%
powerlaw	87.35%	72.27%
skewnorm	87.25%	72.30%
Baseline—Manhattan Distance
KNN	87.53%	72.33%
LOF	84.75%	69.16%
SimplifiedLOF	86.76%	71.34%
LoOP	83.41%	65.76%
LDOF	79.66%	65.37%
ODIN	87.62%	72.37%
FastABOD	64.55%	62.35%
KDEOS	75.37%	62.06%
LDF	86.76%	71.34%
INFLO	83.07%	65.64%
COF	81.51%	64.34%
Baseline—Euclidean Distance
KNN	87.66%	70.93%
LOF	85.61%	69.31%
SimplifiedLOF	83.44%	66.72%
LoOP	83.27%	65.92%
LDOF	83.02%	65.85%
ODIN	82.64%	65.35%
FastABOD	87.65%	67.00%
KDEOS	73.11%	62.10%
LDF	86.29%	70.83%
INFLO	83.16%	64.59%
COF	84.02%	68.54%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, J.; Hodge, K.; Dong, W.; Tamakloe, E. Distribution-Aware Outlier Detection in High Dimensions: A Scalable Parametric Approach. Mathematics 2026, 14, 77. https://doi.org/10.3390/math14010077

AMA Style

Zhou J, Hodge K, Dong W, Tamakloe E. Distribution-Aware Outlier Detection in High Dimensions: A Scalable Parametric Approach. Mathematics. 2026; 14(1):77. https://doi.org/10.3390/math14010077

Chicago/Turabian Style

Zhou, Jie, Karson Hodge, Weiqiang Dong, and Emmanuel Tamakloe. 2026. "Distribution-Aware Outlier Detection in High Dimensions: A Scalable Parametric Approach" Mathematics 14, no. 1: 77. https://doi.org/10.3390/math14010077

APA Style

Zhou, J., Hodge, K., Dong, W., & Tamakloe, E. (2026). Distribution-Aware Outlier Detection in High Dimensions: A Scalable Parametric Approach. Mathematics, 14(1), 77. https://doi.org/10.3390/math14010077

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Distribution-Aware Outlier Detection in High Dimensions: A Scalable Parametric Approach

Abstract

1. Introduction

2. Literature Review

2.1. State of the Art

2.2. Our Contribution in Context

3. Method and Theoretical Results

3.1. Positively-Skewed Distributions

3.1.1. Assumptions and Limitations of Parametric Modeling

3.1.2. From Positioning to Theory

3.2. Behavior of Continuous Density Function Versus Non-Parametric for ROC-AUC Scores

3.2.1. Comparing CDF-Based Scores and Raw KNN Distances

3.2.2. The CDF Superiority Theorem

3.2.3. Extension to Other Nonparametric Methods

3.2.4. Significance of the CDF Superiority Theorem

3.2.5. Justification for ROC AUC as Evaluation Metric

3.2.6. Theoretical Support for Parametric Tests

3.3. Worked Examples

3.3.1. 1-D KNN Disordering Example

3.3.2. 1-D LOF Disordering Example ( k = 2 )

3.3.3. Extension to 3-D Case

3.3.4. 3-D LOF Scoring

3.3.5. Connection to CDF-Based Scoring

3.4. Remark

3.5. Monte Carlo Simulation of CDF Versus Non-Parametric ROC-AUC Scores

4. Our Parametric Outlier-Detection Framework

4.1. Dimensionality Reduction via KNN–Manhattan

4.2. Fitting Positively Skewed Distributions

Log–Transform and Normal-like Fits

4.3. Baseline Non-Parametric Methods

4.4. Datasets

5. Empirical Real-World Data Results

5.1. Analysis of Goodness-of-Fit R 2 Across Literature and Semantic Datasets

5.2. Real Data Analysis Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3.2. 1-D LOF Disordering Example ( $k = 2$ )

5.1. Analysis of Goodness-of-Fit $R^{2}$ Across Literature and Semantic Datasets