Next Article in Journal
Powerful Nonparametric Asymptotic Tests for Change in the Mean with Reduced Type I Errors
Previous Article in Journal
From Reliability Modelling to Cognitive Orchestration: A Paradigm Shift in Aircraft Predictive Maintenance
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Distribution-Aware Outlier Detection in High Dimensions: A Scalable Parametric Approach

by
Jie Zhou
1,*,
Karson Hodge
1,
Weiqiang Dong
1 and
Emmanuel Tamakloe
2
1
Department of Mathematics and Computer Science, Southern Arkansas University, 100 East University, Magnolia, AR 71753, USA
2
Department of Mathematics and Natural Sciences, MCPHS University, 179 Longwood Avenue, Boston, MA 02115, USA
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(1), 77; https://doi.org/10.3390/math14010077
Submission received: 30 October 2025 / Revised: 9 December 2025 / Accepted: 16 December 2025 / Published: 25 December 2025

Abstract

We propose a distribution-aware framework for unsupervised outlier detection that transforms multivariate data into one-dimensional neighborhood statistics and identifies anomalies through fitted parametric distributions. This directly addresses central difficulties of high-dimensional data—including sparsity of observations, the concentration of pairwise distances, hubness phenomena in nearest-neighbor graphs, and general effects of the curse of dimensionality that degrade classical distance-based scoring. Supported by the Cumulative Distribution Function (CDF) Superiority Theorem and validated through Monte Carlo simulations, the method connects distributional modeling with Receiver Operating Characteristic–Area Under the Curve (ROC–AUC) consistency and produces interpretable, probabilistically calibrated scores. Across 23 real-world datasets, the proposed parametric models demonstrate competitive or superior detection accuracy with strong stability and minimal tuning compared with baseline non-parametric approaches. The framework is computationally lightweight and robust across diverse domains, offering clear probabilistic interpretability and substantially lower computational cost than conventional non-parametric detectors. These findings establish a principled and scalable approach to outlier detection, showing that statistical modeling of neighborhood distances can achieve high accuracy, transparency, and efficiency within a unified parametric framework.

1. Introduction

Outlier detection plays a critical role in statistical analysis and data-driven decision making because extreme observations can bias estimates, corrupt model fitting, and obscure genuine rare signals [1,2]. It supports multiple objectives: preserving statistical validity by preventing distortion of summary statistics [1], ensuring model robustness [2], enhancing data quality by identifying measurement or entry errors [3], uncovering novel insights from rare events such as fraud or equipment failures [4], and enabling timely decision processes in domains such as finance, cybersecurity, and healthcare [2].
Although classical techniques perform well in low-dimensional settings, they often deteriorate as dimensionality increases. In high-dimensional spaces, the “curse of dimensionality’’ leads to distance concentration and sparsity, which undermine the reliability of proximity- and density-based approaches [5,6]. Additionally, irrelevant or noisy features can mask true anomalies and dramatically increase computational cost [2,6]. Nevertheless, accurate anomaly detection remains essential in fraud detection [7], network intrusion analysis, genetics, image processing, and sensor networks, where rare deviations can signal security breaches, biological abnormalities, or critical system failures.
However, existing approaches typically suffer from at least one of the following limitations: (1) non-parametric methods scale poorly due to reliance on local neighborhood computation; (2) many high-dimensional methods depend on heuristics or dimensionality reduction and lack interpretability; and (3) parametric models rarely come with theoretical guarantees on error control. This motivates the need for a scalable, distribution-grounded framework that produces interpretable anomaly scores with provable statistical properties in high-dimensional settings.
To address this need, we propose a parametric outlier detection framework that applies a uni-dimensional distance transformation capturing each point’s “degree of outlier-ness’’ while remaining computationally efficient regardless of ambient dimension. Specifically, our research objectives are as follows:
1.
Algorithmic efficiency: Develop a method whose computational cost scales linearly with sample size and is independent of feature dimension after transformation;
2.
Statistical interpretability: Model transformed distances with flexible parametric families, enabling distribution-based threshold selection and diagnostic inference;
3.
Provable detection performance: Establish theoretical guarantees showing that the method controls false alarm rates and maximizes statistical power under mild assumptions on the underlying data distribution.
By representing the dataset with a single distance vector, our method avoids the combinatorial cost of high-dimensional operations and enables interpretability through a compact set of distributional parameters. We fit a flexible parametric model—using positively skewed or log-transformed normal families—on these transformed distances, deriving closed-form thresholds and showing that our estimator behaves optimally in terms of false positive rate minimization and true detection rate maximization. Empirical evaluations across multiple benchmark datasets demonstrate that our approach consistently outperforms state-of-the-art non-parametric methods in mean ROC–AUC, validating both its practical utility and theoretical promises.
The proposed and existing algorithms have been benchmarked using the widely adopted ROC–AUC framework. RROC–AUC is a standard, threshold-independent metric used in outlier detection and classification. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) across all possible score thresholds. The AUC then represents the probability that a randomly sampled outlier receives a higher anomaly score than a randomly sampled inlier [8,9]. A comprehensive conceptual introduction is provided by Fawcett [10], who discusses ROC interpretation, model comparison, and use in anomaly detection more broadly. Under the standard definition, the AUC is computed as the integral of TPR from 0 to 1 with respect to FPR as shown in Equation (1):
AUC = 0 1 TPR ( FPR ) d ( FPR ) .
To place this work in context, we now review the most relevant existing approaches in both non-parametric and parametric outlier detection.

2. Literature Review

Outlier detection has long been studied from both data-driven and model-based perspectives. A substantial body of non-parametric research leverages the geometry or local density of data. For example, the k-nearest neighbor (KNN) distance method identifies outliers as points whose average distance to their k nearest neighbors is unusually large [11,12], while density- or local-structure-based such as Local Outlier Factor (LOF), Connectivity-based Outlier Factor (COF), and Angle-Based Outlier Detection (ABOD) compare each point’s local density and structural relationships with those of its neighborhood to identify sparse regions [13,14,15]. These techniques make few assumptions about the underlying distribution and adapt well to complex, nonlinear structure. However, their performance tends to degrade in high dimensions—distances concentrate, noise dimensions mask true anomalies, and the computational cost of neighborhood or density estimation grows prohibitively with feature count [16,17]. Comprehensive evaluations and benchmark suites further document these effects and provide standardized comparisons across algorithms [17].
Recent large-scale benchmark studies reinforce these observations by systematically comparing dozens of unsupervised anomaly detection algorithms on real-world multivariate tabular data, showing that classical neighbor-based methods such as KNN and LOF remain strong baselines but exhibit unstable behavior across high-dimensional scenarios [18].
A second strand of non-parametric work refines these ideas. For example, Rehman and Belhaouari [16] propose KNN-based dimensionality reduction (D-KNN) to collapse multivariate data into a one-dimensional distance space, then apply box-plot adjustments and joint probability estimation to better separate outliers. Classical exploratory tools still inform practice: Tukey’s 1.5 × Interquartile Range (IQR) rule remains a widely used heuristic for flagging extreme points [19]. Distance choice is also critical in high-dimensional settings: Aggarwal et al. [12] showed that the L 1 (Manhattan) distance preserves greater contrast than the L 2 (Euclidean) distance as dimensionality increases, thereby enhancing the effectiveness of nearest-neighbor-based detection methods.
In parallel, recent work has revisited neighbor-based anomaly detection from a theoretical perspective. Dimensionality-Aware Outlier Detection (DAO) introduces a principled framework based on local intrinsic dimensionality (LID), demonstrating when traditional LOF and KNN scores fail and providing improved consistency guarantees in high-dimensional settings [20]. Other studies enhance neighborhood models via random subspace variational inference [21] or re-examine the construction of KNN-based anomaly scores to improve robustness and calibration in complex feature spaces [22].
Parametric approaches offer an alternative by imposing distributional structure, yielding interpretable tests and often lower computational burden. Early methods include Grubbs’ test and standardized residuals under normality [23], with extensions such as Rosner’s generalized Extreme Studentized Deviate (ESD) procedure and the Davies–Gather and Hawkins tests to detect multiple outliers when the number of anomalies is known a priori [3,24,25]. More recent work develops robust tests for broader location-scale and shape-scale families (e.g., exponential, Weibull, logistic) that avoid pre–specifying the number of outliers [26]. In time-series and regression contexts, parametric residual-based techniques using exponential or gamma error models are used to identify anomalous behavior and heavy-tailed departures [27,28].
Contemporary parametric research also explores distribution-aware anomaly detection in high-dimensional settings, such as density-ratio estimation based on scaled Bregman divergences for high-dimensional time series, which explicitly models deviation from learned probabilistic structure to improve stability and interpretability [29].

2.1. State of the Art

A wide range of state-of-the-art methods has been proposed for unsupervised outlier detection, drawing from distance, density, local deviation, and distribution-based measurement of anomaly scores. Among distance-based approaches, the k-nearest neighbor (KNN) method identifies anomalies as points whose average distance to their nearest neighbors is unusually large [11,12], while Outlier Detection using Indegree Number (ODIN) [30] and Local Distance-based Outlier Factor (LDOF) [31] refine this idea by examining relative neighborhood distances and distance ratios. Other approaches, such as Local Outlier Probability (LoOP) [32] and FastABOD [15], further integrate probabilistic normalization or angular relationships within local reference sets.
Density- and locality-based methods such as Local Outlier Factor (LOF) [13], Connectivity-based Outlier Factor (COF) [14], Local Density Factor (LDF) [33], and Influenced Outlierness (INFLO) [34] detect outliers by comparing each point’s local density or neighborhood structure to that of its surrounding region. Kernel-based models such as KDEOS [35] estimate density based on distance-weighted neighborhood contributions, while marginal and projection-based scoring methods such as HBOS [36] and LODA [37] analyze projected or marginal distributions. More recent work such as COPOD [38] leverages copula-based modeling to derive empirical tail probabilities for anomaly scoring.
Although these approaches are highly influential and widely applicable, they often encounter fundamental difficulties in high-dimensional settings. As dimensionality increases, pairwise distances tend to concentrate [5], making distance-based approaches such as KNN, ODIN, LDOF, LoOP, and FastABOD produce weakened score contrast between normal and anomalous points. Density-based methods (LOF, COF, LDF, INFLO) depend on reliable local density estimates, but sparsity in high dimensions produces noisy, unstable estimation [6]. Kernel-based approaches such as KDEOS suffer from exponential smoothness degradation, while HBOS and LODA may fail when projections no longer preserve informative structure. Additionally, hubness effects—where certain points become nearest neighbors of many others—distort the local neighborhood graph and bias KNN- dependent scoring.
These limitations motivate the core principle of our approach. Rather than depending on geometric or density estimates in the original multivariate space, we transform each point into a univariate nearest-neighbor distance statistic and analyze its distribution parametrically. This reduces the influence of distance concentration, hubness, sparsity, and unstable density estimation, allowing for stable rank-ordering of anomalies using a fitted distributional model.
In summary, while the above state-of-the-art detection techniques have proven successful in low- and moderate-dimensional spaces, their performance degrades as dimensionality increases. By shifting from multivariate geometric comparisons to parametric modeling of univariate neighborhood distances, our method offers a statistically well-grounded and dimension-robust solution for anomaly detection.
Modern anomaly detection systems span a broad spectrum, ranging from classical locality and density-based algorithms to modern representation-learning approaches. On the classical side, widely used methods include KNN [39], LOF [13], SimplifiedLOF [40], Local Outlier Probability (LoOP) [32], Local Distance-based Outlier Factor (LDOF) [31], Outlier Detection using Indegree Number (ODIN) [30], FastABOD [15], Kernel Density Estimation Outlier Score (KDEOS) [35], Local Density Factor (LDF) [33], Influenced Outlierness (INFLO) [34], and COF [14]. These methods remain widely adopted, computationally efficient, and assumption-light, thereby constituting strong state-of-the-art baselines for tabular anomaly detection. Beyond these, the Isolation Forest (iForest) isolates anomalies via random partitioning [41], while the one-class Support Vector Machine (SVM) provides a large-margin boundary in high-dimensional feature spaces [42]. For learned representations, autoencoders, Deep Support Vector Data Description (SVDD), and probabilistic hybrids such as Deep Autoencoding Gaussian Mixture Model (DAGMM) often achieve leading results on image and complex tabular benchmarks [43,44]. Lightweight projection-based schemes such as Histogram-Based Outlier Score (HBOS) and Lightweight Online Detection of Anomalies (LODA) deliver excellent speed–accuracy trade-offs [36,37], while the copula-based Copula-Based Outlier Detection (COPOD) provides fully unsupervised, distribution-free scoring with competitive accuracy [38]. Alongside these established paradigms, representation-learning and generative approaches continue to evolve. In particular, diffusion-based generative models have recently been adapted for anomaly detection and are increasingly recognized for their capacity to model complex high-dimensional distributions, though often at the expense of computational efficiency and interpretability [45]. Together with classical and hybrid methods, these techniques define the contemporary landscape of anomaly detection—from interpretable, efficient heuristics to highly expressive but resource-intensive deep models.

2.2. Our Contribution in Context

Although grounded in different philosophies, both research lines ultimately aim to balance sensitivity to genuine anomalies with robustness against noise. Non-parametric methods perform well when no clear distributional form is present, yet they often suffer from the curse of dimensionality. Parametric tests, by contrast, regain efficiency and offer finite-sample guarantees under correct model specification, but are vulnerable to misspecification. In this paper, we unify these paradigms by introducing a uni-dimensional distance transformation that maps any dataset—regardless of its original dimension—into a single distance vector, which is then modeled with a flexible parametric distribution. This hybrid approach preserves interpretability and scalability, enables closed-form inference, and delivers provable performance under mild assumptions.
Building on the gaps identified above, the paper proceeds as follows. We first formalize a CDF Superiority Theorem, establishing that a parametric CDF–based score achieves strictly higher ROC–AUC than the KNN distance under mild conditions. We further outline the proof that this parametric score also outperforms any other non-parametric method. We then validate this theoretical advantage through Monte Carlo experiments, demonstrating that the mean ROC–AUC across 500 simulation paths under the gamma distribution exceeds that of five established non-parametric methods: KNN, LOF, ABOD, COF, and CDF. We then introduce our practical framework: reduce high-dimensional data to a 1-D KNN (Manhattan) distance vector; fit either positively skewed families (e.g., gamma/Weibull) or—after a log transform—normal-like families (normal/t/skew-normal); and score observations by their fitted CDFs. Next, we benchmark our approach against several state-of-the-art nonparametric baselines using 23 publicly available datasets. These 23 datasets are separated into two distinct categories: the literature set and the semantic set. The literature set includes commonly used datasets in previous papers that may lack real-world labels and might be synthetic or have outliers defined from prior papers. The semantic set outlines outliers based on sematic or domain meaning, they are not arbitrary or synthetic but infer anomalies based real world deviations, i.e., errors in manufacturing.
We report performance in terms of ROC–AUC, together with goodness-of-fit ( R 2 ) derived from Quantile-Quantiel (QQ) plots of fitting proposed probability distributions with 1-D KNN distance vector. We then examine the relationship between fit quality and detection accuracy, highlighting the conditions under which the parametric approach is most effective. Finally, we conclude with key implications and directions for future research.

3. Method and Theoretical Results

3.1. Positively-Skewed Distributions

Suppose that we originally have a data set in an N-dimensional space. According to Rehman and Belhaouari [16], this dataset can be effectively transformed into a one-dimensional distance space by employing a suitable metric such as Manhattan distance or Euclidean distance. Specifically, for each observation in the original N-dimensional space, the distance to its k-th nearest neighbor is computed. This process generates a new dataset consisting solely of these distances, denoted as d k R . Formally, this transformation can be represented as follows:
d k : R N R
Each d k represents the distance from a point to its k-th nearest neighbor, corresponding to the maximum distance within its k-neighbor set. We use the Manhattan distance, computed as the absolute sum of coordinate-wise differences. The rationale for using Manhattan distance is grounded in the work of Aggarwal et al. [12] on distance metrics in high-dimensional spaces. Compared to Euclidean distance, Manhattan distance lowers the density peak while spreading values more broadly, resulting in a longer-tailed distribution. This reduces the likelihood of misclassifying inliers as outliers. In high-dimensional settings, this effect becomes more pronounced, as certain data points—sometimes termed “hubs”—tend to emerge as nearest neighbors for many other points. Such uneven neighbor distribution contributes to the skewness observed in the k-th nearest neighbor distances [46].
As dimensionality increases, L 2 (Euclidean) distances tend to concentrate due to an exaggeration effect that distorts the relative positioning of outliers. In contrast, L 1 (Manhattan) distance is more robust to this effect and better captures the skewness and variability inherent in the data [12]. This distinction is particularly important under the curse of dimensionality, where KNN distances become increasingly equidistant. This equidistance causes distances to shrink and induces a positively skewed distribution [12]. As shown in Equation (3),
max ( d k ) min ( d k ) min ( d k ) 0 as n ( for L 2 )
and discussed in Aggarwal et al. [12], substituting L 2 with L 1 preserves a broader spread of distances, slowing the convergence toward uniformity and mitigating the equidistant effect.
In addition to these empirical observations, the skewness of the distance distribution has a theoretical justification. The Manhattan distance between two independent random points X , X R p decomposes as
D = X X 1 = j = 1 p Z j , Z j = | X j X j | 0 .
If each coordinate difference is symmetric with finite moments, then each Z j has strictly positive third central moment. Since cumulants are additive, we obtain
E [ D ] = p E [ Z 1 ] , Var ( D ) = p Var ( Z 1 ) , κ 3 ( D ) = p κ 3 ( Z 1 ) ,
which implies positive skewness for any finite dimension:
γ 1 ( D ) = κ 3 ( D ) Var ( D ) 3 / 2 = κ 3 ( Z 1 ) Var ( Z 1 ) 3 / 2 1 p > 0 .
Thus, even if standardized distances may approach normality under the Central Limit Theorem, the raw distances necessarily remain nonnegative and right-skewed [47,48].
A second source of skewness arises from the order-statistic properties of d k . For a fixed reference point x, the k-th neighbor distance has density
f ( k ) ( r ) = ( n 1 ) ! ( k 1 ) ! ( n 1 k ) ! f R ( r ) F R ( r ) k 1 ( 1 F R ( r ) ) n 1 k ,
supported on [ 0 , ) [49,50]. When aggregated over all x i , the resulting set { d k ( x i ) } forms a mixture of such order-statistic distributions [51], naturally generating a long right tail associated with sparse or anomalous regions.
Given these theoretical foundations, our modeling choice naturally follows. Parametric methods offer several advantages over non-parametric approaches, including clearer interpretability, greater accuracy, and more efficient computation. Parametric analysis assumes that data arise from a specific underlying distribution, and in our case, the set of d k values meets the structural requirements for such modeling by being nonnegative, right-tailed, and exhibiting positive skewness consistent with established parametric families.
Because the majority of observations fall within concentrated regions of the feature space, while a smaller number of samples lie in sparser or anomalous regions, the resulting density of d k displays a steep rise near the modal distance followed by a long gradual decay. This behavior aligns with the skewed distance distributions observed in high-dimensional settings and with the distance concentration phenomenon, described as a declining ratio between spread and magnitude of distances [46]. The resulting one-dimensional data thus exhibit a persistent positive skew, with an extended right tail reflecting values that deviate from both the mean and median.
Accordingly, to capture and characterize this structure, we fit a family of positively skewed distributions to the transformed one-dimensional distance data. This enables us to use parametric scoring rules based on calibrated tail probabilities, rather than raw distances alone.

3.1.1. Assumptions and Limitations of Parametric Modeling

The parametric fitting strategy implicitly assumes that the empirical distribution of d k can be reasonably approximated by a known parametric family. This assumption may fail in datasets where the d k statistics exhibit multimodality, strong heterogeneity across local subregions, or heavy contamination, in which case a global one-family parametric fit may be inadequate. Under such circumstances, purely non-parametric methods (e.g., LOF, COF, or KNN) may perform either comparably or even favorably despite their susceptibility to distance concentration effects. Our empirical findings in Section 3.3 and Section 4 suggest that when the parametric fit is statistically well-aligned with the empirical distance distribution, the CDF-based scoring exhibits a clear ranking advantage; however, this advantage may diminish or disappear in regimes of strong model misspecification.

3.1.2. From Positioning to Theory

The discussion above motivates a hybrid scoring rule: reduce high-dimensional data to a one-dimensional summary and then apply a parametric score that aligns with the inlier distribution. We now provide a theoretical justification for this choice by comparing a distribution-aware score to a purely geometric one. Specifically, we consider two outlier scores for a point x: (i) the CDF score F ( x ) of the inlier distribution, which is a monotone transform of the optimal likelihood ratio, and (ii) the KNN distance score d k ( x ) , a standard nonparametric baseline. Using ROC–AUC as our comparison criterion,
AUC ( T ) = Pr T ( X out ) > T ( X in ) ,
We show that, under mild regularity conditions (continuous and strictly positive densities), the CDF-based score strictly dominates the KNN distance: it yields fewer pairwise misorderings between outliers and inliers and, therefore, achieves a larger AUC. This result formalizes why a univariate, distribution-aligned score can outperform distance-based heuristics, particularly in regimes where distances lose contrast.
To rigorously substantiate this intuition, we now present a formal result that characterizes when and why CDF-based scoring functions outperform KNN distances in ranking performance. We state the result next.

3.2. Behavior of Continuous Density Function Versus Non-Parametric for ROC-AUC Scores

This section presents the CDF Superiority Theorem and supports it through both numerical examples and simulation-based validation. We begin by examining the mathematical relationship between non-parametric raw distances and their CDF-transformed counterparts.

3.2.1. Comparing CDF-Based Scores and Raw KNN Distances

To clearly establish the validity of comparing CDF-based anomaly scores with non-parametric raw d k distances, we first state the assumptions under which this comparison holds. Specifically, we assume the following: (1) The transformed distances d k are nonnegative random variables with a continuous and strictly increasing cumulative distribution function F d k ( r ) on [ 0 , ) ; (2) The anomaly score is defined as a monotone transformation s ( x ) = 1 F d k ( d k ( x ) ) ; (3) Anomalies are characterized by extreme (large) distance values relative to the bulk of the distribution.
Under these assumptions, the CDF-based score preserves the rank ordering of raw distances, i.e.,
d k ( x i ) > d k ( x j ) s ( x i ) < s ( x j ) ,
ensuring that any threshold-based anomaly detection using either d k or its CDF-derived score is equivalent in the sense of order-preserving decision boundaries.
Having established the ranking equivalence relationship, we now formalize the theoretical advantage of the CDF-based approach.

3.2.2. The CDF Superiority Theorem

Theorem 1. 
Let X in f and X out g be independent draws from two continuous densities f , g on R , each strictly positive everywhere. We compare two outlier-scoring rules:
  • CDF score:
    F ( x ) = x f ( t ) d t .
  • KNN distance score:
    d k ( x ) = distance from x to its k th nearest neighbor in an i . i . d . sample X 1 , , X n f .
We use the standard ROC–AUC definition
AUC ( T ) = Pr T ( X out ) > T ( X in ) .
Then
AUC ( F ) > AUC ( d k ) .
Proof. 
Let X in f and X out g be independent draws from continuous, strictly positive densities on R . For any scoring rule T, define its mis–ordering set
E T = { ( x 0 , x 1 ) R 2 : T ( x 1 ) T ( x 0 ) } .
Then
AUC ( T ) = Pr T ( X out ) > T ( X in ) = 1 Pr ( X in , X out ) E T = 1 E T f ( x 0 ) g ( x 1 ) d x 0 d x 1 .
(1)
CDF Score.
For the CDF score F ( x ) = x f ( t ) d t , F is strictly increasing, hence E F = { ( x 0 , x 1 ) : x 1 x 0 } . Setting
μ F = x 1 x 0 f ( x 0 ) g ( x 1 ) d x 0 d x 1 = Pr ( X out X in ) ,
Equation (4) yields AUC ( F ) = 1 μ F .
(2)
KNN Distance Score.
Let d k ( x ) be the distance from x to its kth nearest neighbor within an i.i.d. sample X 1 , , X n f . Its misordering set is E d k = { ( x 0 , x 1 ) : d k ( x 1 ) d k ( x 0 ) } and
AUC ( d k ) = 1 E d k f g .
Split
E d k f g = x 1 x 0 f g = μ F + x 1 > x 0 , d k ( x 1 ) d k ( x 0 ) f g δ .
We claim δ > 0 . Fix x 0 < x 1 . For j { 0 , 1 } let Y i ( j ) = | X i x j | ( i = 1 , , n ). Each Y i ( j ) has a continuous, strictly positive density on ( 0 , ) ; the kth nearest-neighbor distance is the kth order statistic d k ( x j ) = Y ( k ) ( j ) . The vector ( Y 1 ( 0 ) , , Y n ( 0 ) , Y 1 ( 1 ) , , Y n ( 1 ) ) has a positive joint density on ( 0 , ) 2 n , and the smooth, one-to-one a.e. mapping to ( Y ( k ) ( 0 ) , Y ( k ) ( 1 ) ) = ( d k ( x 0 ) , d k ( x 1 ) ) implies that the pair ( d k ( x 0 ) , d k ( x 1 ) ) has a continuous joint density h that is positive on ( 0 , ) 2 . Therefore,
Pr d k ( x 1 ) d k ( x 0 ) = y 1 y 0 h ( y 0 , y 1 ) d y 1 d y 0 > 0 .
Since f ( x 0 ) g ( x 1 ) is strictly positive for all x 0 < x 1 , integrating this strictly positive probability over the set { x 1 > x 0 } yields δ > 0 .
(3)
Conclusion.
We have
AUC ( d k ) = 1 ( μ F + δ ) < 1 μ F = AUC ( F ) .
Hence, the CDF score attains a strictly larger ROC-AUC than the KNN distance score. □

3.2.3. Extension to Other Nonparametric Methods

The same argument applies to any other non-parametric outlier score. Here is an outline of the proof.
  • ROC–AUC cares only about pairwise ordering.
    AUC ( T ) = Pr T ( X out ) > T ( X in ) .
  • The CDF score is strictly monotonic in x.
    F ( x ) = Pr f ( X x ) increases strictly, so it never misorders any x 0 < x 1 .
  • Any non-parametric method must misorder a positive-measure set of pairs.
    Estimated from finite data (LOF, isolation forest, etc.), it cannot perfectly reproduce the CDF ordering, so there exists x 0 < x 1 with T ( x 1 ) T ( x 0 ) with positive probability.
  • Strict AUC gap follows.
    Let μ F = Pr ( X out X in ) and μ np > μ F be the misorder probability of the non-parametric score. Then,
    AUC ( F ) = 1 μ F , AUC ( T np ) = 1 μ np ,
    so AUC ( F ) > AUC ( T np ) .
Remark 1
Because any non-parametric rule must misorder some inlier–outlier pairs with positive probability, its ROC–AUC is strictly lower than the ideal CDF rule’s.

3.2.4. Significance of the CDF Superiority Theorem

Under mild regularity conditions, assuming continuous and strictly positive densities, the CDF Superiority Theorem provides a theoretical guarantee for distribution-aware scoring in anomaly detection. Ranking observations by the inlier CDF F ( x ) —for example, using the tail score 1 F ( x ) —achieves superior anomaly ranking performance relative to raw KNN distances. The key insight follows from the probability integral transform: if X f , then U = F ( X ) is uniformly distributed on [ 0 , 1 ] . Since ROC analysis depends only on the ordering of scores and is invariant under strictly monotonic transformations [10,52], any monotone function of F ( x ) preserves the same ROC curve. Consequently, mapping data to a one-dimensional statistic aligned with the inlier distribution enables a scoring method that is theoretically matched to the underlying data distribution. When the model is reasonably well specified, such CDF-based scoring methods yield consistently improved anomaly ranking capability.
To evaluate this theoretical advantage in practice, we next consider the appropriateness of ROC–AUC as the ranking performance measure.

3.2.5. Justification for ROC AUC as Evaluation Metric

The ROC framework has been widely established as a threshold-independent evaluation measure for classification and ranking tasks [9,53,54]. ROC–AUC is especially meaningful in anomaly detection because it evaluates performance across all possible decision thresholds, avoiding the bias associated with any fixed cutoff.
Moreover, ROC–AUC admits a probabilistic interpretation: it represents the probability that a randomly selected anomalous point receives a higher anomaly score than a randomly selected nominal point [9]. Since our method produces a scalar ordering of data points—whether via raw distances d k or their CDF-based transformation 1 F d k , ROC–AUC naturally quantifies the quality of the ranking induced by these scores.
Thus, given (i) the order-preserving relationship between d k and 1 F d k under monotonic transformations, and (ii) the well-established theoretical interpretation of ROC–AUC as a measure of ranking quality [9,53], ROC–AUC provides an appropriate and theoretically grounded metric for evaluating anomaly scoring performance in our study.

3.2.6. Theoretical Support for Parametric Tests

The CDF Superiority Theorem in this paper shows that, under mild regularity and a correctly (or well) specified inlier model F, ranking observations by the inlier CDF—equivalently, by the tail score p ( x ) = 1 F ( x ) —achieves a strictly higher ROC–AUC than geometric KNN distance scores. This result provides a principled foundation for parametric outlier procedures that base decisions on model-derived tail probabilities or residuals. In particular, it theoretically supports the multiple-outlier tests of Bagdonavičius and Petkevičius [55], which assume a parametric family for the inlier distribution and identify extreme observations via distribution-aware statistics on orderings of the sample. When the assumed family approximates the true inlier law, our theorem predicts that CDF-based rankings (and the associated p-value thresholds) are optimal in the ranking sense, explaining the empirical effectiveness of model-based multi-outlier tests and motivating their use over purely distance-based heuristics.

3.3. Worked Examples

In this subsection, we illustrate how disordering arises in distance-based anomaly scoring through explicit worked examples. These examples demonstrate that purely geometric scoring can assign identical or nearly identical anomaly values to points that should be distinct in ranking. We first present transparent one-dimensional cases, and then show that the same ambiguity persists in higher-dimensional settings.

3.3.1. 1-D KNN Disordering Example

Dataset (1-D): Inliers { 1 , 2 , 3 } ; Outliers { 8 , 9 } . Choose k = 2 . Compute 2-NN distances among { 1 , 2 , 3 , 8 , 9 } :
d 2 ( 1 ) = 2 , d 2 ( 2 ) = 1 , d 2 ( 3 ) = 2 , d 2 ( 8 ) = 5 , d 2 ( 9 ) = 6 .
CDF ordering demands x 0 < x 1 F ( x 0 ) < F ( x 1 ) . Pick ( x 0 , x 1 ) = ( 1 , 3 ) : since 1 < 3 , F ( 1 ) < F ( 3 ) , yet
d 2 ( 1 ) = d 2 ( 3 ) = 2 d 2 ( 3 ) d 2 ( 1 ) ,
so the KNN score misorders that inlier–outlier pair.

3.3.2. 1-D LOF Disordering Example ( k = 2 )

Dataset:  { 0 , 1 , 4 } with 0,1 inliers and 4 outlier.
Reachability distances:
reach-dist 2 ( 0 , 1 ) = max ( | 0 1 | , 3 ) = 3 , reach-dist 2 ( 0 , 4 ) = max ( 4 , 4 ) = 4 , reach-dist 2 ( 1 , 0 ) = max ( 1 , 4 ) = 4 , reach-dist 2 ( 1 , 4 ) = max ( 3 , 4 ) = 4 , reach-dist 2 ( 4 , 1 ) = max ( 3 , 3 ) = 3 , reach-dist 2 ( 4 , 0 ) = max ( 4 , 4 ) = 4 .
Local reachability densities:
lrd 2 ( 0 ) = 1 ( 3 + 4 ) / 2 = 2 7 0.2857 , lrd 2 ( 1 ) = 1 ( 4 + 4 ) / 2 = 0.25 , lrd 2 ( 4 ) = 2 7 0.2857 .
LOF scores:
lof 2 ( 0 ) = 1 2 0.25 0.2857 + 0.2857 0.2857 = 0.9375 , lof 2 ( 1 ) = 1 2 0.2857 0.25 + 0.2857 0.25 1.1428 , lof 2 ( 4 ) = 0.9375 .
Pick ( x 0 , x 1 ) = ( 0 , 4 ) : although 0 < 4 F ( 0 ) < F ( 4 ) , we have lof 2 ( 0 ) = lof 2 ( 4 ) so LOF misorders that inlier–outlier pair.
In the 1-D setting, the mechanism of disordering is completely exposed. When using KNN distances or LOF scores on scalar samples, ties or near-ties readily occur simply because different points may have identical local spacing along the line. These 1-D constructions make the core issue mathematically clear and easily traceable. The essential insight from these 1-D examples is that when distances collapse onto a small set of discrete values, rank ambiguity becomes inevitable. While this effect may seem attributable to the simplicity of the geometry, we next show that it persists even after moving away from one-dimensional space.

3.3.3. Extension to 3-D Case

While the 1-D case makes the disordering mechanism entirely transparent, we now extend the analysis to a 3-D dataset to demonstrate that the same misranking behavior emerges even in higher-dimensional settings where geometric distance structures are more complex.
Consider the following set of points in R 3 : inliers A = ( 0 , 0 , 0 ) , B = ( 2 , 2 , 0 ) , C = ( 2 , 2 , 2 ) and outliers D = ( 4 , 4 , 4 ) , E = ( 7 , 6 , 5 ) . Using the Manhattan distance with k = 2 , we compute the 2-nearest-neighbor distances for each point. Explicitly evaluating and sorting the pairwise distances, we obtain:
d 2 ( A ) = 6 , d 2 ( B ) = 4 , d 2 ( C ) = 6 , d 2 ( D ) = 6 , d 2 ( E ) = 12 .
We observe that the outlier D receives the same score as two inliers, A and C, i.e.,
d 2 ( D ) = d 2 ( A ) = d 2 ( C ) = 6 .
Thus, KNN distance—a purely non-parametric statistic—fails to distinguish D from points lying clearly deeper within the inlier cluster.

3.3.4. 3-D LOF Scoring

We also evaluate LOF on this dataset to analyze whether local-density comparison can resolve this ambiguity. Using the standard reachability and local reachability density definitions, we compute:
LRD ( A ) = 0.20 , LRD ( B ) = 0.1667 , LRD ( C ) = 0.20 , LRD ( D ) = 0.1111 , LRD ( E ) = 0.1111 ,
yielding the corresponding LOF scores:
LOF ( A ) = 0.92 , LOF ( B ) = 1.20 , LOF ( C ) = 0.69 , LOF ( D ) = 1.40 , LOF ( E ) = 1.40 .
Here, LOF successfully elevates the anomaly scores for D and E relative to the inliers. However, the improvement is only partial: LOF relies on local neighborhood densities and still lacks a global distributional reference, causing sensitivity to local sampling irregularities and neighborhood selection.

3.3.5. Connection to CDF-Based Scoring

To contrast these geometric scores with our CDF-based approach, we consider an inlier density whose probability mass is centered at the origin. Using the Euclidean radius,
R ( x ) = x 1 2 + x 2 2 + x 3 2 ,
we obtain the ordering:
R ( A ) < R ( B ) < R ( C ) < R ( D ) < R ( E ) .
Applying the tail score s ( x ) = 1 F R ( R ( x ) ) yields strictly ordered anomaly scores:
s ( E ) > s ( D ) > s ( C ) > s ( B ) > s ( A ) ,
correctly ranking outliers above inliers. Unlike KNN distances (which tied A, C, and D) and unlike LOF (which is constrained by local density effects), the CDF-based score uses the entire fitted inlier distribution and thereby eliminates rank ambiguity arising from local geometric coincidence.
These examples collectively illustrate that disordering is not an artifact of low-dimensional illustrations, but a fundamental consequence of geometric distance concentration and neighborhood symmetry. Distribution-aware scoring using the fitted CDF provides a principled way to break ties and establish a globally consistent anomaly ranking.

3.4. Remark

KNN distance can assign identical scores to x 0 < x 1 even though F ( x 0 ) < F ( x 1 ) . LOF can assign the same score to inlier and outlier, violating the true CDF ranking.
Thus, any nonparametric method like KNN or LOF must strictly underperform the CDF-based score in ROC–AUC: it misorders some positive-probability inlier–outlier pairs.
In practice we do not compute the integral
Pr d k ( x 1 ) d k ( x 0 ) = y 1 y 0 h ( y 0 , y 1 ) d y 1 d y 0
directly, but rather approximate it by the fraction of misordered pairs in the finite data set. Concretely, if we have N inliers and M outliers, we form all N × M pairs ( x in , x out ) and compute
p ^ = 1 N M i = 1 N j = 1 M 1 d k ( x out ( j ) ) d k ( x in ( i ) ) .
Even though the true probability p > 0 , it’s quite possible—especially if N and M are small, or if the score ties a lot—that you observe zero misordered pairs in this sample, i.e., p ^ = 0 . That in turn makes the empirical AUC hit its maximum of 1.0.
The theorem guarantees that p > 0 in the population limit, that is, as the number of data points approaches infinity. In finite samples, however, random fluctuations may cause the empirical estimate p ^ to be zero simply because, by chance, no misordered pairs are observed within the Monte Carlo sample.
As N and M grow larger, or as the experiment is repeated, the chance that p ^ = 0 becomes smaller, roughly at an exponential rate in N M . However, this probability does not disappear entirely until the sample size tends to infinity.
In conclusion, the sample size must be sufficiently large to mitigate random misorderings arising from sampling variability. Since the CDF is estimated probabilistically, finite-sample fluctuations can cause certain points to be overestimated and thus mistakenly classified as outliers. In practice, a larger sample reduces this Monte Carlo noise and yields a ranking that better reflects the true ordering implied by the underlying distributions.

3.5. Monte Carlo Simulation of CDF Versus Non-Parametric ROC-AUC Scores

To evaluate the empirical performance of the CDF-based scoring rule and connect it to the theoretical result in Section 3.2, we ran Monte Carlo experiments using three data–generating models: Gamma, Inverse Gaussian, and Skew–Normal. For each distribution, 200 inlier samples were generated and MLE was applied to estimate the model parameters. These fitted parameters defined the reference inlier CDF used for scoring.
Each simulation run then produced 400 evaluation samples consisting of inliers mixed with injected outliers drawn from the fixed parameter settings in Table 1a–c. This procedure was repeated 500 times, and in each run, we computed ROC–AUC values for KNN, LOF, ABOD, COF, and the CDF-based approach. The results appear in Table 2a–c.
Across all settings, the CDF-based method gives the strongest average ROC–AUC performance and, in several cases, shows reduced variability across repetitions. With the Gamma model (Table 2a), the improvement over KNN and the other neighbor-based methods is consistent and substantial. With the Inverse Gaussian model (Table 2b), CDF performs slightly better than KNN and the other neighbor-based methods. With the Skew–Normal model (Table 2c), where asymmetry in cluster geometry is more pronounced, CDF again provides the most effective ranking of inliers and outliers. These results suggest that the advantage of the CDF transformation persists across a range of underlying distributions.
The benefit of the CDF approach comes from transforming distances onto a probability scale, which reduces artifacts that can arise from raw neighbor distances. Even with imperfect parametric fits due to finite sample size, this transformation produces a more stable and interpretable ordering of observations.
It is worth noting that the parameter configurations in Table 1a–c are not chosen to engineer specific theoretical tail behaviors. Rather, they create differences in the local neighborhood structure: inliers occupy tight regions of the space, whereas outliers are positioned to yield larger KNN distances. In this setting, anomalies naturally occur in the upper tail of the empirical distance distribution, and the CDF transformation emphasizes this separation.
In terms of computational efficiency, the CDF-based scoring rule incurs relatively low cost: after fitting the inlier distribution, each test observation is evaluated through a direct CDF computation. In contrast, KNN and the other neighbor-based methods require repeated pairwise distance calculations against the training data, resulting in higher computational burden as the dataset increases. This difference in cost is particularly relevant for large-scale or streaming applications. Furthermore, the consistent improvement in the CDF-based method across all three simulation settings (Gamma, Inverse Gaussian, and Skew–Normal) indicates that the ordering advantage is not tied to a specific parametric model and persists across differing underlying distributional structures.
Finally, the theoretical ordering guarantee of Section 3.2 applies under idealized distributional assumptions. The empirical results in this section show that the CDF-based method maintains its ranking advantage even when those assumptions are relaxed and the data include sampling variability and model mismatch. The complete Python code used for the Gamma, Inverse Gaussian, and Skew–Normal experiments, including parameter fitting and ROC–AUC evaluation, is provided in Appendix A to ensure reproducibility.

4. Our Parametric Outlier-Detection Framework

We propose a two-stage pipeline: (i) reduce the data to a one-dimensional distance statistic that preserves the degree of “outlier-ness’’ even in high dimension, and (ii) fit a parametric family to that statistic and score points by calibrated tail probabilities. This design keeps computation light, retains interpretability, and—by working with a 1-D summary—avoids the distance–concentration pitfalls of high-d spaces [5].

4.1. Dimensionality Reduction via KNN–Manhattan

For each observation x R n , we compute the distance to its kth nearest neighbor under the Ł1 metric,
d k ( x ) = min x ( k ) N k ( x ) x x ( k ) 1 .
Using Ł1 (Manhattan) rather than Ł2 (Euclidean) helps retain spread and ranking contrast as n grows [12]; it also mitigates hubness, where a few points become nearest neighbors of many others and distort scores [46]. Empirically, the empirical distribution of { d k ( x i ) } is typically right-skewed, which motivates the parametric fits below.

4.2. Fitting Positively Skewed Distributions

Let D = { d k ( x i ) } i = 1 n denote the dataset of one-dimensional distances. We fit D using a family of positively skewed distributions via the maximum likelihood estimation (MLE) method. This family includes Normal-like distributions such as the log-normal, log-Student-t, log-Laplace, log-logistic, and log-skew-normal, as well as other positively skewed distributions including the exponential, chi-squared ( χ 2 ), gamma, Weibull-minimum, inverse Gaussian, Rayleigh, Wald, Pareto, Nakagami, logistic, power-law, and skew-normal distributions. Denoting a generic density by p ( · ; η ) with parameter η , we maximize
η ^ = arg max η i = 1 n log p d k ( x i ) ; η .
For example, the gamma PDF is p ( d ; κ , θ ) = 1 Γ ( κ ) θ κ d κ 1 e d / θ , and the Weibull PDF is p ( d ; λ , β ) = β λ d λ β 1 e ( d / λ ) β . After fitting, we score a point by its right-tail probability under the fitted CDF F ^ ,
s ( x ) = 1 F ^ d k ( x ) ,
which is a calibrated, distribution-aligned p-value. Rankings are invariant to monotone transforms, so we can equivalently use log s ( x ) .
As shown in Table A1, Table A2, Table A3, Table A3, the vast majority of fitted distributions achieve an average R 2 above 90%, and many exceed 95%. In our implementation, we set an empirical cutoff of 85% for the average R 2 . For the two datasets where the best-fitting distribution yields an average R 2 below 85%, we intentionally retain these cases to evaluate the robustness of our parametric scoring method under imperfect fits. Even in these cases, the tail ordering of distances—the key requirement of the CDF Superiority Theorem—remains valid, indicating that the fitted models still provide correct anomaly ranking in the distribution tails. Consistent with this, the average ROC–AUC on both the literature datasets and the semantic datasets remains high, demonstrating that our proposed approach is robust even when the global R 2 of the QQ fit is not ideal.

Log–Transform and Normal-like Fits

We follow the ladder-of-powers guideline that lower-power transforms (log, square-root) reduce positive skew [19]. When the distance sample { d k ( x i ) } is strictly positive and right-skewed, we set y i = log d k ( x i ) and fit location-scale families on { y i } : normal N ( μ , σ 2 ) , Student- t ( μ , σ , ν ) for heavier tails, and logistic; to absorb any residual asymmetry we also include the skew-normal with shape parameter α [56]. Outlier scores are computed on the original scale via the fitted CDF of Y = log d k ( X ) ,
s ( x ) = 1 F ^ Y log d k ( x ) ,
which is equivalent to using right-tail z-scores for normal-like fits. This “log-transform branch” complements the positive-skew families and improves robustness whenever the log transform approximately symmetrizes the distance distribution.

4.3. Baseline Non-Parametric Methods

We implement a set of standard baselines widely used in the outlier detection literature, including KNN, LOF, SimplifiedLOF, LoOP, LDOF, ODIN, FastABOD, KDEOS, LDF, INFLO, and COF. For these 11 baseline methods, we report results under both Ł1 and Ł2 metrics. All methods score by decreasing density (or increasing distance) and are evaluated by ROC–AUC. In the real-data experiments, the parameter k for the KNN distance is not chosen arbitrarily or fixed a priori. Instead, it is selected in a data-driven way by scanning values of k from 2 to 69 and choosing the value that maximizes the empirical ROC–AUC on the given dataset. This procedure prevents arbitrary selection of k and ensures that KNN is evaluated under its best achievable configuration for each dataset.

4.4. Datasets

As mentioned in the last subsection of Section 2 (Literature Review), we evaluate both the baseline methods and our proposed approaches on 23 datasets, including 11 literature datasets and 12 semantic datasets. A descriptive summary of the two dataset types is provided in Table 3. All datasets were obtained from Campos et al. [17]. We work directly with these benchmark datasets in their original form as provided by Campos et al. [17] and hosted in the DAMI repository [57], and apply the same z-score standardization to each feature (zero mean and unit variance) as used in their evaluation framework. This ensures full consistency with the established preprocessing treatment in the prior study and enables fair comparison across methods. The semantic datasets, in particular, have been modified to better reflect real-world occurrences of outliers. Each dataset varies on the number of outliers included, ranging from as low as 0.2 % to as high as 75 % . Campos et al. [17] provided results for multiple levels of outlier percentages for most datasets, therefore, we chose to focus on the highest outlier levels for they contain all observations instead of being a subset.

5. Empirical Real-World Data Results

Before examining the outlier-detection performance summarized in Table 4, it is instructive to first assess how well the proposed parametric families capture the underlying data distributions. To this end, we analyze the goodness-of-fit results based on the R 2 values from QQ plots, as summarized in Table A1, Table A2, Table A3, Table A4.

5.1. Analysis of Goodness-of-Fit R 2 Across Literature and Semantic Datasets

Table A1, Table A2, Table A3, Table A4 summarize the R 2 values across both log-transformed and untransformed settings. Log-transformed models consistently yield higher R 2 values (94–99%), confirming their stability and closer alignment with theoretical quantiles. For the literature datasets, the log-transformed normal, Student-t, and skew-normal distributions perform best; for the semantic datasets, the skew-normal and Student-t distributions remain the strongest. Two representative examples of fitted distributions are shown in Figure 1 and Figure 2: the Arrhythmia dataset (Figure 1) achieves an R 2 = 0.9745 under a gamma distribution, while the Parkinson dataset (Figure 2) attains an R 2 = 0.9692 under a skew-normal distribution after log transformation. Both figures show strong agreement between theoretical and empirical quantiles, reinforcing that log transformation enhances fit quality and that the proposed parametric framework remains robust across diverse data types while supporting high detection accuracy.

5.2. Real Data Analysis Results

After evaluating 17 parametric distributions—12 positively skewed and 5 approximately symmetric—across 23 datasets, the proposed parametric fits on the one-dimensional distance function d k ( · ) (optionally after a log transform) achieve KNN-level or higher accuracy while consistently outperforming other baseline detectors. To illustrate this behavior concretely, Figure 3 and Figure 4 present two examples of ROC comparisons on two representative datasets, demonstrating that the best-fit parametric distribution produces detection performance comparable to, and often slightly better than, the KNN baseline.
In the literature datasets, the inverse Gaussian (without log transformation) distribution achieves the highest average ROC–AUC of 87.56%, matching or slightly exceeding KNN– L 1 / L 2 (87.53–87.66%) and clearly outperforming LOF, COF, KDEOS, and FastABOD, which frequently fall below 85%. Per-dataset analyses (Table A5, Table A6, Table A7, Table A8) show stable wins or ties for the parametric models, with notable advantages in moderately skewed datasets such as PIMA (73.7% vs. KNN–L1 67%) and strong robustness in highly skewed ones like KDDCup99 and WDBC, where fitted distributions maintain near-perfect detection accuracy (>96%). On the semantic datasets (Table A9, Table A10, Table A11, Table A12), the best-performing parametric distributions—the skew-normal under log transformation and the inverse Gaussian without log transformation—achieve an average ROC–AUC of 72.38%, essentially matching and slightly exceeding KNN– L 1 (72.33%), while outperforming LOF (≈69%), KDEOS (≈65%), COF (≈60%), and ABOD (≈63%). Certain baseline methods, including LDOF and FastABOD, were computationally infeasible for several large datasets (as indicated in Table A5, Table A6, Table A7, Table A8, Table A9, Table A10,Table A11, Table A12), underscoring the practical advantage of the lightweight parametric approach. These results confirm that a small and interpretable family of fitted distributions, once paired with a simple neighborhood scale k, provides competitive accuracy with far less parameter tuning.
When comparing the average ROC–AUCs for the literature datasets and semantic datasets in Table 4, both domains show a similar drop in absolute performance, yet the parametric methods remain remarkably uniform across transformations and dataset types. Their average ROC–AUC stays within a narrow band (≈87% → 72%), indicating strong distributional adaptability and low sensitivity to distance-metric choice. In contrast, baseline methods exhibit wider fluctuations and sharper degradation. Several factors explain the superiority of the parametric framework: (1) performance consistency, as it maintains nearly identical rankings across datasets, highlighting reliable generalization; (2) statistical interpretability, since each fitted distribution (e.g., t, inverse-Gaussian, skew-normal) conveys explicit probabilistic semantics—tail behavior, variance, and skewness—that yield explainable anomaly thresholds; (3) computational efficiency, because once parameters are estimated, new-sample scoring becomes lightweight compared with K-neighbor searches; and (4) practical robustness, since these models attain equal or higher ROC–AUC than KNN or ODIN without heavy hyperparameter tuning. When performance levels are close, interpretability becomes decisive—the parametric models provide transparent probabilistic reasoning while achieving comparable or better accuracy. Overall, across both literature and semantic datasets, these results establish the proposed parametric approach as a simple, interpretable, and high-performing alternative to traditional distance-based outlier detectors.

6. Conclusions

We proposed a distribution-aware framework for unsupervised outlier detection that reduces multivariate data to one-dimensional neighborhood statistics and identifies anomalies through fitted parametric distributions. Supported by the CDF Superiority Theorem, this approach connects statistical distribution modeling with ROC–AUC consistency and produces interpretable, probabilistically calibrated scores for anomaly ranking.
Empirically, our results highlight three main observations. First, across both the literature and semantic benchmark datasets, the empirical kNN distance distributions are typically right-skewed, and a broad class of positively skewed distributions provides good one-dimensional fits. In Table A1, Table A2, Table A3, Table A4 most log-transformed models (normal, t, Laplace, logistic, skew-normal) attain average QQ-plot R 2 values between approximately 91% and 98%, while several non-transformed families (gamma, inverse Gaussian, Weibullmin, chi-square, Pareto) also produce high fits on many datasets. No single distribution dominates universally, supporting the notion that one-dimensional neighborhood statistics are best described by a flexible family of tail models rather than a single canonical law.
Second, when these models are used to form CDF-based anomaly scores, the resulting ROC–AUC accuracy is competitive with, or superior to, strong non-parametric baselines. Across 23 datasets, the proposed parametric families deliver average ROC–AUC performance that matches or exceeds the strongest kNN-based methods, while clearly outperforming density-based and angle-based competitors such as LOF, KDEOS, COF, and LDOF. This is observed both for classical literature datasets (achieving 87.4 % ROC–AUC) and semantically complex datasets (achieving ∼72.3% ROC–AUC), demonstrating robust performance across diverse regimes.
Third, because our modeling occurs in one dimension, the framework remains computationally lightweight, requiring little hyperparameter tuning. For methods with comparable accuracy, our parametric scoring rule offers clear probabilistic interpretability and lower computational cost, avoiding the heavy machinery and sensitivity associated with density estimation, high-dimensional kernels, or local geometric heuristics.
Overall, these results highlight a principled and interpretable pathway for outlier detection, showing that statistical modeling of neighborhood distances can achieve strong, stable performance without reliance on complex non-parametric procedures. In summary, the experimental developments of this work demonstrate that parametric CDF modeling of KNN distance statistics yields consistent performance gains or parity relative to established methods on real datasets, with strong empirical support from QQ–plot fits (Table A1, Table A2, Table A3, Table A4) and ROC–AUC benchmarks (Table 4). The speculative extensions—including adaptive model selection, multivariate dependence modeling, and hierarchical tail calibration—are presented as natural and logically grounded future directions. Thus, our conclusions clearly distinguish between what has been firmly established through experimentation and what is conceptually suggested for further methodological advancement.

Author Contributions

Conceptualization, J.Z.; Formal Analysis, J.Z., W.D., E.T. and K.H.; Methodology, J.Z., W.D., E.T. and K.H.; Project Administration, J.Z.; Software, J.Z. and K.H.; Supervision, J.Z., W.D. and E.T.; Validation, J.Z., W.D., E.T. and K.H.; Visualization, J.Z. and K.H.; Writing—original draft, J.Z. and K.H.; Writing—review and editing, J.Z., W.D., E.T. and K.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in Outlier-Detection at https://github.com/hodge-py/Outlier-Detection (accessed on 30 November 2025). These data were derived from the following resource available in the public domain: https://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/ (accessed on 30 November 2025).

Acknowledgments

During the preparation of this manuscript/study, the authors used ChatGPT 5.0 and 5.1 for the purposes of editing statements and correcting grammatical errors. The authors have reviewed and edited the output and take full responsibility for the content of this publication. This research was partially supported by an internal research grant from Southern Arkansas University awarded to Jie Zhou.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
KNNk-Nearest Neighbors
LOFLocal Outlier Factor
COFConnectivity-Based Outlier Factor
ABODAngle-Based Outlier Factor
KDEKernel Density Estimation
CDFCumulative Distribution Function
TPRTrue Positive Rate
FPRFalse Positive Rate
ROCReceiver Operating Characteristic
AUCArea Under Curve
ESDExtreme Studentized Deviate
LoOPLocal Outlier Probabilities
LDOFLocal Distance-Based Outlier Factor
ODINOutlier Detection for Networks
KDEOSKernel Density Estimation Outlier Score
SVMSupport Vector Machine
SVDDSupport Vector Data Description
DAGMMDeep Autoencoding Gaussian Mixture Model
HBOSHistogram-Based Outlier Score
LODALightweight On-line Detector of Anomalies
COPODCopula-Based Outlier Detection
INFLOInfluenced Outlierness

Appendix A

Listing A1 contains the Python 3.12.0 (64-bit) script used for the Gamma-based Monte Carlo experiments. After importing the required libraries and setting a fixed random seed (2026) for reproducibility, the code defines the set of detectors (KNN, LOF, SLOF, LoOP, LDOF, ABOD, LDF, INFLO, COF) and stores their respective training routines in a dictionary. The simulation parameters (runs, sample sizes, and inlier/outlier distribution parameters) are then specified. In each run, the script (i) samples Gamma inliers for training, (ii) fits the inlier model by maximum likelihood with fixed location, (iii) generates Gamma inlier and outlier test points, (iv) computes anomaly scores via the fitted CDF or each PyOD detector, and (v) records the resulting ROC–AUC. Finally, it summarizes the AUC distribution using pandas. The code for the Inverse Gaussian and Skew–Normal experiments follows the same structure, differing only in the distributional generator and corresponding CDF.
The code and additional experimental results can be found from the Hodge-py outlier detection repositories [58,59] and the DAMI outlier evaluation collection [57].
Listing A1. Monte Carlo simulation example for positively skewed data.
Mathematics 14 00077 i001
Table A1. R2 for QQ plots of literature datasets—part 1.
Table A1. R2 for QQ plots of literature datasets—part 1.
ALOIGlassIonosphereKDDCup99LymphograPenDigits
Log Transform
norm98.81%93.36%98.36%90.25%92.44%97.88%
t98.86%91.38%98.36%88.38%96.71%97.66%
laplace95.66%90.36%91.60%88.98%90.75%93.12%
logistic98.42%92.25%96.12%90.08%94.47%96.69%
skewnorm99.71%98.72%98.30%95.60%97.85%99.86%
No Transform
expon89.28%93.64%97.14%76.67%96.17%99.36%
chi291.27%94.33%96.29%92.60%97.28%98.23%
gamma93.91%93.87%97.12%97.96%92.45%97.86%
weibull_min93.93%97.77%97.33%97.73%91.00%95.70%
invgauss99.13%95.99%93.88%99.18%94.22%96.39%
rayleigh69.15%75.77%90.79%52.58%79.50%93.74%
wald96.14%97.45%93.01%87.44%98.24%96.70%
pareto98.36%95.84%97.14%12.84%96.17%99.36%
nakagami82.07%84.00%97.45%76.86%87.64%94.71%
logistic61.71%71.96%80.61%46.46%79.73%87.25%
powerlaw73.65%86.12%96.79%63.82%76.63%87.89%
skewnorm75.15%81.96%95.23%58.90%88.14%96.78%
Table A2. R2 for QQ plots of literature datasets—part 2.
Table A2. R2 for QQ plots of literature datasets—part 2.
ShuttleWaveformWBCWDBCWPBCAverage
Log Transform
norm99.10%99.14%87.33%90.96%92.60%94.57%
t99.29%99.12%83.70%85.60%89.97%93.55%
laplace97.02%95.62%82.65%87.74%89.02%91.14%
logistic99.09%98.39%83.39%91.31%90.83%93.73%
skewnorm99.25%99.96%93.28%97.87%98.95%98.12%
No Transform
expon73.73%91.97%91.40%95.57%99.17%91.28%
chi273.78%99.96%89.83%97.32%98.60%93.59%
gamma72.50%99.96%98.67%97.35%98.60%94.57%
weibull_min79.88%99.31%92.12%92.68%97.12%94.05%
invgauss73.43%99.97%89.77%97.18%99.11%94.39%
rayleigh62.81%99.71%64.23%79.35%92.52%78.19%
wald78.22%84.60%92.95%98.75%97.24%92.79%
pareto73.73%91.97%64.07%96.02%99.24%84.07%
nakagami64.77%99.66%88.02%86.87%94.98%87.00%
logistic59.31%97.13%58.20%72.55%83.44%72.58%
powerlaw56.20%93.50%72.40%76.42%89.13%79.32%
skewnorm65.54%99.90%71.21%85.54%95.16%83.04%
Table A3. R2 for QQ plots of semantic datasets—part 1.
Table A3. R2 for QQ plots of semantic datasets—part 1.
AnnthyroidArrhythmiaCardiotocographyHeartDiseaseHepatitisInternetAds
Log Transform
norm97.32%92.53%98.33%98.37%96.42%97.22%
t99.12%91.14%98.23%98.37%96.42%97.22%
laplace98.76%90.17%94.26%93.94%89.60%93.41%
logistic98.97%92.01%97.23%97.15%93.92%96.18%
skewnorm97.48%99.49%99.44%99.31%96.45%98.11%
No Transform
expon76.81%99.08%97.85%94.44%87.87%97.61%
chi285.52%99.12%97.06%99.39%89.40%98.09%
gamma85.93%97.45%99.23%99.39%93.61%98.09%
weibull_min77.03%96.76%98.14%99.40%96.15%97.32%
invgauss83.37%99.20%99.46%99.34%93.76%98.64%
rayleigh56.43%89.74%96.29%99.22%97.84%95.73%
wald85.90%98.33%93.95%87.79%80.24%93.98%
pareto84.19%99.08%97.85%94.44%87.87%97.62%
nakagami64.82%93.25%97.09%99.31%97.33%90.16%
logistic51.34%81.56%90.29%94.63%93.75%87.67%
powerlaw53.06%84.56%88.02%94.70%98.76%85.66%
skewnorm61.61%93.83%97.90%99.51%96.93%95.73%
Table A4. R2 for QQ plots of semantic datasets—part 2.
Table A4. R2 for QQ plots of semantic datasets—part 2.
PageBlocksParkinsonPimaSpamBaseStampsWiltAverage
Log Transform
norm92.65%94.42%97.87%74.47%97.44%96.23%94.44%
t91.75%97.80%97.87%80.92%97.26%91.67%94.82%
laplace89.74%96.46%92.28%83.68%93.34%97.20%92.74%
logistic92.16%96.06%96.27%78.35%96.42%97.54%94.36%
skewnorm99.62%96.92%99.06%83.94%99.27%97.95%97.25%
No Transform
expon79.86%91.84%97.62%92.80%97.74%64.13%89.80%
chi279.83%85.41%94.45%92.24%97.88%67.36%90.48%
gamma93.53%85.41%99.70%92.22%97.88%72.96%92.95%
weibull_min82.47%94.74%99.47%89.26%97.08%86.30%92.84%
invgauss92.85%88.89%99.36%92.53%98.47%57.91%91.98%
rayleigh58.19%78.23%97.66%87.46%92.07%46.62%82.79%
wald89.31%95.89%92.66%93.98%96.88%64.81%89.59%
pareto95.46%91.84%97.62%93.47%97.74%67.46%92.05%
nakagami69.50%84.59%98.74%87.67%94.98%52.37%86.26%
logistic52.31%72.85%91.07%83.92%85.41%38.07%77.01%
powerlaw59.33%68.75%92.67%77.26%85.86%40.43%77.42%
skewnorm63.91%81.93%99.20%89.06%94.75%45.16%84.96%
Table A5. Literature datasets ROC-AUC part 1.
Table A5. Literature datasets ROC-AUC part 1.
ALOI Glass Ionosphere
ROC AUCkROC AUCkROC AUCk
Log Transform
norm74.50%387.20%1090.90%2
t74.60%387.60%290.90%2
laplace74.50%288.00%290.70%2
logistic74.50%387.60%290.70%2
skewnorm74.50%387.60%1090.80%2
No Transform
expon74.30%387.10%290.10%2
chi274.40%387.80%290.10%2
gamma74.50%288.00%290.10%2
weibull_min74.40%287.50%1090.10%2
invgauss74.60%388.50%290.10%2
rayleigh74.30%387.50%290.10%2
wald74.40%387.80%290.10%2
pareto74.50%387.90%290.10%2
nakagami74.50%388.00%290.10%2
logistic73.80%387.20%1090.20%2
powerlaw74.50%387.40%1089.90%2
skewnorm74.30%387.70%290.10%2
Baseline Manhattan
KNN74.60%287.40%1089.60%4
LOF81.40%786.70%1387.10%10
SimplifiedLOF74.86%387.99%290.04%2
LoOP83.45%1085.09%2086.38%16
LDOF75.24%978.10%2683.22%50
ODIN74.62%387.99%290.04%2
FastABOD76.66%1450.00%292.07%69
KDEOS52.26%6283.96%1986.25%70
LDF74.86%387.99%290.04%2
INFLO83.60%1083.79%1886.06%16
COF76.84%3089.86%6288.02%13
Baseline Eucli.
KNN74.06%187.48%892.74%1
LOF78.23%986.67%1190.43%83
SimplifiedLOF79.57%1686.50%1690.50%10
LoOP80.08%1283.96%1890.21%11
LDOF 77.89%2789.61%14
ODIN80.50%1172.93%1885.22%13
FastABOD 85.80%9891.33%3
KDEOS77.26%9974.20%2883.40%71
LDF74.62%990.35%991.67%50
INFLO79.87%980.38%1890.38%10
COF80.17%1389.54%7696.03%100
Missing LDOF and FastABOD results attributed to computation cost.
Table A6. Literature datasets ROC-AUC part 2.
Table A6. Literature datasets ROC-AUC part 2.
KDDCup99LymphographyPenDigits
ROC AUCkROC AUCkROC AUCk
Log Transform
norm96.80%69100.00%1998.20%9
t96.80%6899.90%698.30%10
laplace96.70%6999.80%3198.40%14
logistic96.70%69100.00%1598.40%11
skewnorm96.70%69100.00%898.70%15
No Transform
expon95.00%6999.30%1399.10%12
chi296.90%69100.00%3898.40%6
gamma96.70%69100.00%898.30%12
weibull_min96.80%69100.00%898.20%9
invgauss96.50%69100.00%899.10%12
rayleigh95.70%6999.90%497.60%6
wald94.90%69100.00%2699.10%9
pareto96.40%69100.00%1399.10%12
nakagami96.50%69100.00%897.90%10
logistic94.30%69100.00%896.80%8
powerlaw96.90%69100.00%899.10%9
skewnorm95.90%69100.00%898.30%9
Baseline Manhattan
KNN97.00%69100.00%799.10%11
LOF67.90%45100.00%4797.10%55
SimplifiedLOF95.40%70100.00%399.13%21
LoOP66.52%6199.88%5996.24%70
LDOF77.09%7099.65%4472.92%70
ODIN97.01%70100.00%899.12%12
FastABOD58.97%7099.18%6050.00%2
KDEOS50.00%282.75%3386.69%59
LDF95.40%70100.00%399.13%21
INFLO66.46%5499.88%5996.95%70
COF60.57%6996.48%1498.29%69
Baseline Eucli.
KNN98.97%89100.00%1499.21%12
LOF84.89%100100.00%6296.58%73
SimplifiedLOF66.80%62100.00%9896.68%67
LoOP70.31%6599.77%4796.23%98
LDOF 99.77%8675.03%91
ODIN80.77%10099.88%5596.43%100
FastABOD 99.77%2597.98%100
KDEOS60.51%6898.12%9982.21%98
LDF87.70%90100.00%1397.79%12
INFLO70.33%5699.88%6295.71%98
COF67.01%67100.00%4096.70%95
Missing LDOF and FastABOD results attributed to computation cost.
Table A7. Literature datasets ROC-AUC part 3.
Table A7. Literature datasets ROC-AUC part 3.
Shuttle WaveformWBC
ROC AUCkROC AUCkROC AUCk
Log Transform
norm84.60%578.30%6899.40%9
t84.66%578.50%6499.70%23
laplace84.60%578.50%6199.70%32
logistic84.50%578.60%6899.30%10
skewnorm84.60%578.60%6999.60%60
No Transform
expon84.20%578.50%6298.80%10
chi284.50%578.60%6699.80%30
gamma82.00%578.60%6699.90%40
weibull_min83.90%578.50%6799.80%26
invgauss84.50%578.60%6699.50%4
rayleigh84.80%578.80%6699.00%33
wald84.50%578.60%6999.80%69
pareto84.30%578.60%6299.20%4
nakagami84.60%578.40%6899.80%17
logistic84.20%578.40%5896.70%23
powerlaw82.80%578.50%5999.80%28
skewnorm84.60%578.50%6799.20%61
Baseline Manhattan
KNN84.68%478.60%6599.70%24
LOF84.10%776.50%6999.70%65
SimplifiedLOF78.00%1477.77%7099.72%22
LoOP82.08%1172.59%7097.28%70
LDOF77.98%2269.59%6794.37%70
ODIN84.68%578.57%6699.74%25
FastABOD50.00%252.31%576.29%13
KDEOS77.30%4865.14%7097.54%11
LDF78.00%1477.77%7099.72%22
INFLO77.84%1071.50%7099.48%67
COF63.02%6476.25%5898.97%58
Baseline Eucli.
KNN81.76%377.55%7799.72%19
LOF78.21%675.60%9699.67%98
SimplifiedLOF76.61%9972.95%10099.39%99
LoOP76.40%9972.37%10098.03%99
LDOF84.75%1568.82%10096.53%99
ODIN78.90%869.68%10096.74%100
FastABOD95.46%667.31%4099.48%49
KDEOS66.55%9459.24%9964.79%5
LDF71.59%478.89%1699.72%71
INFLO79.89%9870.92%9499.39%99
COF63.97%7177.59%9999.44%74
Table A8. Literature datasets ROC-AUC part 4.
Table A8. Literature datasets ROC-AUC part 4.
WDBC WPBC
ROC AUCkROC AUCk
Log Transform
norm97.70%953.10%12
t98.40%4653.40%20
laplace98.30%4253.10%26
logistic97.30%953.20%20
skewnorm98.70%4353.20%26
No Transform
expon98.50%4253.20%14
chi299.00%5653.20%26
gamma98.90%6853.20%26
weibull_min98.90%6453.30%12
invgauss98.70%2553.10%19
rayleigh97.50%2053.20%12
wald98.70%5353.20%12
pareto98.80%4253.30%19
nakagami99.00%6353.20%12
logistic96.90%3953.20%19
powerlaw98.90%6453.10%20
skewnorm98.30%4152.90%21
Baseline Manhattan
KNN99.00%6953.10%18
LOF99.10%6952.70%34
SimplifiedLOF98.71%5752.70%29
LoOP98.38%6949.61%61
LDOF97.96%7050.18%61
ODIN98.96%7053.10%19
FastABOD50.00%254.52%4
KDEOS90.08%6957.13%34
LDF98.71%5752.70%29
INFLO98.91%7049.27%57
COF97.70%6450.64%47
Baseline Eucli.
KNN98.63%9054.09%12
LOF98.91%8952.54%24
SimplifiedLOF98.68%9050.18%1
LoOP98.40%10050.18%1
LDOF98.18%9956.56%7
ODIN97.23%9350.73%1
FastABOD98.26%9753.42%40
KDEOS86.11%8051.85%2
LDF98.54%3358.29%8
INFLO98.49%9549.57%20
COF98.07%5555.69%97
Table A9. Semantic datasets ROC-AUC part 1.
Table A9. Semantic datasets ROC-AUC part 1.
Annthyroid Arrhythmia Cardiotocography
ROC AUCkROC AUCkROC AUCk
Log Transform
norm67.60%276.20%4455.70%69
t67.70%276.10%4755.60%69
laplace67.60%276.20%3455.70%68
logistic67.70%276.10%4755.60%69
skewnorm67.70%276.00%3555.70%69
No Transform
expon67.61%275.66%3555.65%69
chi267.64%276.11%4655.69%69
gamma67.65%276.11%4155.72%69
weibull_min67.61%275.97%4455.79%68
invgauss67.69%276.03%4555.64%67
rayleigh67.59%275.99%4555.76%69
wald67.64%276.05%4455.76%69
pareto67.60%276.07%3555.65%69
nakagami67.62%276.02%3855.70%69
logistic67.36%276.15%2955.69%68
powerlaw67.46%276.09%4555.77%69
skewnorm67.52%276.20%4555.71%69
Baseline Manhattan
KNN67.28%276.10%4355.80%69
LOF70.20%1175.50%4860.20%69
SimplifiedLOF67.74%375.81%5153.78%70
LoOP72.09%3875.76%7056.84%21
LDOF78.92%2875.18%656.17%50
ODIN67.67%276.06%4455.76%70
FastABOD71.34%4667.53%7050.00%2
KDEOS50.00%250.00%250.32%36
LDF67.74%375.81%5153.78%70
INFLO71.31%3175.30%7057.98%69
COF62.62%5575.52%4156.92%70
Baseline Eucli.
KNN64.90%175.21%6066.67%100
LOF66.76%974.42%9464.70%100
SimplifiedLOF66.53%2173.81%6559.79%100
LoOP67.72%2373.84%7759.50%100
LDOF69.21%3073.45%10057.69%100
ODIN69.33%572.67%9862.12%100
FastABOD62.39%474.18%9855.74%100
KDEOS67.81%3966.10%2154.74%22
LDF65.93%872.29%6767.71%100
INFLO66.46%4773.15%9159.84%100
COF69.21%3073.39%3956.83%20
Table A10. Semantic datasets ROC-AUC part 2.
Table A10. Semantic datasets ROC-AUC part 2.
HeartDiseaseHepatitis InternetAds
ROC AUCkROC AUCkROC AUCk
Log Transform
norm70.10%6978.80%2672.20%14
t70.10%6978.80%2672.20%14
laplace69.70%6979.00%2572.20%14
logistic69.80%6879.00%2672.20%14
skewnorm70.10%6878.80%2672.20%14
No Transform
expon69.51%6977.04%2572.12%14
chi270.02%6678.87%4072.23%14
gamma70.02%6678.47%2672.23%14
weibull_min70.16%6878.99%2670.36%6
invgauss69.91%6979.22%2672.20%14
rayleigh69.82%6879.05%2672.20%14
wald69.91%6878.59%2672.16%14
pareto70.18%6978.53%2572.12%14
nakagami69.99%6878.70%2672.23%14
logistic69.78%6978.53%2572.16%14
powerlaw69.63%6978.76%2672.18%14
skewnorm69.89%6678.53%2672.21%14
Baseline Manhattan
KNN70.00%6879.00%2572.20%13
LOF64.00%6980.40%5070.30%69
SimplifiedLOF66.97%7075.89%5174.21%18
LoOP55.55%7074.17%6565.28%70
LDOF54.32%572.90%6964.68%41
ODIN69.99%6978.99%2672.21%14
FastABOD60.11%6668.08%2854.84%14
KDEOS65.43%5370.75%3650.00%2
LDF66.97%7075.89%5174.21%18
INFLO56.32%6874.63%6468.03%70
COF56.47%7073.02%5168.49%32
Baseline Eucli.
KNN68.38%8178.59%2172.23%12
LOF65.58%10080.37%4874.09%98
SimplifiedLOF56.93%10073.82%7874.31%98
LoOP56.14%6072.27%7870.07%100
LDOF56.91%1473.82%7969.36%98
ODIN60.59%8274.97%5860.54%7
FastABOD75.57%10070.95%5973.39%24
KDEOS55.69%10071.18%7957.78%35
LDF72.06%8382.89%4668.50%100
INFLO55.97%1560.28%5572.96%98
COF71.68%10082.72%7859.88%10
Table A11. Semantic datasets ROC-AUC part 3.
Table A11. Semantic datasets ROC-AUC part 3.
PageBlocksParkinsonPima
ROC AUCkROC AUCkROC AUCk
Log Transform
norm87.10%6973.70%673.70%68
t87.10%6973.90%473.70%68
laplace87.20%6973.90%673.60%69
logistic87.00%6973.80%673.70%67
skewnorm87.10%6973.90%673.70%67
No Transform
expon87.04%6971.69%673.38%64
chi287.06%6973.82%673.65%66
gamma86.97%6873.82%673.59%69
weibull_min87.08%6874.15%473.57%68
invgauss87.07%6873.75%673.56%63
rayleigh86.94%6973.76%673.57%69
wald87.06%6973.94%673.62%68
pareto87.19%6973.77%673.58%64
nakagami87.18%6973.63%473.67%68
logistic86.72%6973.65%673.69%67
powerlaw86.95%6973.60%673.70%69
skewnorm87.04%6973.97%673.53%68
Baseline Manhattan
KNN87.30%6973.70%573.60%67
LOF81.10%6963.90%567.20%69
SimplifiedLOF84.75%7072.11%673.05%70
LoOP77.43%7057.56%1961.53%69
LDOF80.21%7052.98%2358.50%65
ODIN87.28%7073.74%673.60%68
FastABOD50.78%7058.19%851.13%23
KDEOS64.99%7076.94%5766.79%70
LDF84.75%7072.11%673.05%70
INFLO75.35%7052.51%1161.62%70
COF69.91%7070.95%7066.15%70
Baseline Eucli.
KNN84.08%10065.24%473.22%85
LOF81.87%6061.20%668.96%100
SimplifiedLOF80.47%9860.73%1462.13%100
LoOP79.38%8658.31%1360.92%99
LDOF82.98%8255.32%1657.00%98
ODIN73.06%10052.61%363.64%100
FastABOD73.39%2466.99%1576.08%99
KDEOS69.51%9158.67%2855.62%2
LDF83.02%4260.22%672.89%100
INFLO76.80%8058.39%1061.73%92
COF77.02%7164.97%9870.12%100
Table A12. Semantic datasets ROC-AUC part 4.
Table A12. Semantic datasets ROC-AUC part 4.
SpamBaseStamps Wilt
ROC AUCkROC AUCkROC AUCk
Log Transform
norm65.10%4191.70%6156.20%2
t65.00%4091.70%6856.10%3
laplace65.00%4691.90%6256.20%2
logistic65.00%4191.80%6856.20%2
skewnorm65.00%4692.20%6556.10%3
No Transform
expon64.94%5191.67%6356.03%3
chi265.03%4991.93%6755.14%2
gamma65.00%4991.93%6756.07%3
weibull_min65.04%4191.86%6855.59%3
invgauss65.08%4092.19%6456.17%2
rayleigh64.98%4091.72%6756.09%3
wald65.04%4091.91%5756.21%2
pareto64.98%4091.99%6356.15%3
nakagami65.05%4491.83%6656.24%3
logistic65.01%4091.86%6356.21%2
powerlaw64.99%3991.83%6156.33%2
skewnorm65.07%4191.77%6656.23%2
Baseline Manhattan
KNN65.00%3991.90%6356.10%2
LOF47.80%282.30%6967.00%5
SimplifiedLOF64.03%7091.04%7056.68%3
LoOP47.21%377.32%7068.35%14
LDOF50.00%270.70%6969.84%16
ODIN65.05%4091.91%6456.18%2
FastABOD54.71%7076.48%6985.03%29
KDEOS50.00%278.51%7070.95%62
LDF64.03%7091.04%7056.68%3
INFLO50.69%273.68%7070.21%6
COF48.71%363.50%7059.82%2
Baseline Eucli.
KNN57.35%6390.11%1555.20%1
LOF47.38%283.32%10063.09%6
SimplifiedLOF50.12%274.35%10067.68%7
LoOP49.66%275.28%10067.92%10
LDOF47.96%575.26%10071.22%13
ODIN51.91%4775.34%10067.46%10
FastABOD43.72%376.22%9755.43%6
KDEOS47.67%10069.13%9971.32%33
LDF53.64%10089.55%10061.27%4
INFLO47.38%378.92%10063.21%7
COF49.95%281.87%10064.83%9

References

  1. Barnett, V.; Lewis, T. Outliers in Statistical Data, 3rd ed.; John Wiley & Sons: Chichester, UK, 1994. [Google Scholar]
  2. Chandola, V.; Banerjee, A.; Kumar, V. Anomaly Detection: A Survey. ACM Comput. Surv. 2009, 41, 1–58. [Google Scholar] [CrossRef]
  3. Hawkins, D.M. Identification of Outliers; Chapman and Hall: London, UK, 1980. [Google Scholar]
  4. Aggarwal, C.C. Outlier Analysis, 2nd ed.; Springer: Cham, Switzerland, 2017. [Google Scholar]
  5. Beyer, K.; Goldstein, J.; Ramakrishnan, R.; Shaft, U. When Is “Nearest Neighbor” Meaningful? In Database Theory-ICDT’99, Proceedings of the 7th International Conference, Jerusalem, Israel, 10–12 January 1999; Beeri, C., Buneman, P., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 1999; Volume 1540, pp. 217–235. [Google Scholar]
  6. Zimek, A.; Schubert, E.; Kriegel, H.P. A Survey on Unsupervised Anomaly Detection in High-Dimensional Numerical Data. Stat. Anal. Data Min. 2012, 5, 363–387. [Google Scholar] [CrossRef]
  7. Bolton, R.J.; Hand, D.J. Statistical Fraud Detection: A Review. Stat. Sci. 2002, 17, 235–255. [Google Scholar] [CrossRef]
  8. Hanley, J.A.; McNeil, B.J. The meaning and use of the area under a Receiver Operating Characteristic (ROC) curve. Radiology 1982, 143, 29–36. [Google Scholar] [CrossRef]
  9. Hanley, J.A.; McNeil, B.J. A method of comparing the areas under Receiver Operating Characteristic curves derived from the same cases. Radiology 1983, 148, 839–843. [Google Scholar] [CrossRef] [PubMed]
  10. Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
  11. Fix, E.; Hodges, J.L. Discriminatory Analysis—Nonparametric Discrimination: Consistency Properties; Technical Report Technical Report 4; University of California: Berkeley, CA, USA, 1951. [Google Scholar]
  12. Aggarwal, C.C.; Hinneburg, A.; Keim, D.A. On the Surprising Behavior of Distance Metrics in High Dimensional Space. In Proceedings of the Database Theory—ICDT 2001, Proceedings of the 8th International Conference, London, UK, 4–6 January 2001; Van den Bussche, J., Vianu, V., Eds.; Springer: Berlin/Heidelberg, Germany, 2001; pp. 420–434. [Google Scholar]
  13. Breunig, M.M.; Kriegel, H.P.; Ng, R.T.; Sander, J. LOF: Identifying Density-Based Local Outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 15–18 May 2000; pp. 93–104. [Google Scholar] [CrossRef]
  14. Tang, J.; Chen, Z.; Fu, A.W.C.; Cheung, D.W.L. Enhancing Effectiveness of Outlier Detections for Low Density Patterns. In Proceedings of the PAKDD, Taipei, Taiwan, 6–8 May 2002; pp. 535–548. [Google Scholar]
  15. Kriegel, H.P.; Schubert, M.; Zimek, A. Angle-Based Outlier Detection in High-Dimensional Data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, 24–27 August 2008; pp. 444–452. [Google Scholar] [CrossRef]
  16. Rehman, Y.; Belhaouari, S. Unsupervised outlier detection in multidimensional data. J. Big Data 2021, 8, 80. [Google Scholar] [CrossRef]
  17. Campos, G.O.; Zimek, A.; Sander, J.; Campello, R.J.G.B.; Micenková, B.; Schubert, E.; Assent, I.; Houle, M.E. On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study. Data Min. Knowl. Discov. 2016, 30, 891–927. [Google Scholar] [CrossRef]
  18. Bouman, R.; Bukhsh, Z.; Heskes, T. Unsupervised anomaly detection algorithms on real-world multivariate tabular data sets. ACM Comput. Surv. 2024. [Google Scholar]
  19. Tukey, J.W. Exploratory Data Analysis; Addison-Wesley: Reading, MA, USA, 1977. [Google Scholar]
  20. Anderberg, A.; Bailey, J.; Campello, R.J.G.B. Dimensionality-Aware Outlier Detection: Theoretical and Experimental Analysis. In Proceedings of the SIAM International Conference on Data Mining, Houston, TX, USA, 18–20 April 2024. [Google Scholar]
  21. Kim, D.; Park, J.; Chung, H.C.; Jeong, S. Unsupervised outlier detection using random subspace and subsampling ensembles of Dirichlet process mixtures. Pattern Recognit. 2024, 156, 110846. [Google Scholar] [CrossRef]
  22. Chen, X.; Yuan, Z.; Feng, S. Anomaly Detection Based on Improved k-Nearest Neighbor Rough Sets. Int. J. Approx. Reason. 2025, 176, 109323. [Google Scholar] [CrossRef]
  23. Grubbs, F.E. Procedures for Detecting Outlying Observations in Samples. Technometrics 1969, 11, 1–21. [Google Scholar] [CrossRef]
  24. Rosner, B. Percentage Points for a Generalized ESD Many-Outlier Procedure. Technometrics 1983, 25, 165–172. [Google Scholar] [CrossRef]
  25. Davies, L.; Gather, U. The Identification of Multiple Outliers. J. Am. Stat. Assoc. 1993, 88, 782–792. [Google Scholar] [CrossRef]
  26. Bagdonavičius, V.; Petkevičius, G. New Tests for the Detection of Outliers from Location–Scale and Shape–Scale Families. Mathematics 2020, 8, 2156. [Google Scholar]
  27. Amin, M.; Afzal, S.; Akram, M.N.; Muse, A.H.; Tolba, A.H.; Abushal, T.A. Outlier Detection in Gamma Regression Using Pearson Residuals: Simulation and an Application. AIMS Math. 2022, 7, 15331–15347. [Google Scholar] [CrossRef]
  28. A Model-Based Approach to Outlier Detection in Financial Time Series; IFC Bulletin 37; BIS: Basel, Switzerland, 2014.
  29. Wang, Y.; Zhang, L.; Si, T.; Bishop, G.; Gong, H. Anomaly Detection in High-Dimensional Time Series with Scaled Bregman Divergence. Algorithms 2025, 18, 62. [Google Scholar] [CrossRef]
  30. Angiulli, F.; Pizzuti, C. Fast outlier detection in high dimensional spaces. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), Helsinki, Finland, 19–23 August 2002; pp. 15–27. [Google Scholar]
  31. Zhang, K.; Hutter, M.; Jin, H. A local distance-based outlier detection method. In Proceedings of the 20th International Conference on Advances in Database Technology (EDBT), Saint Petersburg, Russia, 24–26 March 2009; pp. 394–405. [Google Scholar]
  32. Kriegel, H.P.; Kröger, P.; Schubert, E.; Zimek, A. LoOP: Local outlier probabilities. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), Hong Kong, China, 2–6 November 2009; pp. 1649–1652. [Google Scholar]
  33. Latecki, L.J.; Lazarevic, A.; Pokrajac, D. Outlier detection with local and global consistency. In Proceedings of the 2007 SIAM International Conference on Data Mining, Minneapolis, MN, USA, 26–28 April 2007; pp. 597–602. [Google Scholar]
  34. Jin, W.; Tung, A.K.; Han, J.; Wang, W. Ranking outliers using symmetric neighborhood relationship. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Singapore, 9–12 April 2006; pp. 577–593. [Google Scholar]
  35. Schubert, E.; Zimek, A.; Kriegel, H.P. Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Discov. 2014, 28, 190–237. [Google Scholar] [CrossRef]
  36. Goldstein, M.; Dengel, A. Histogram-Based Outlier Score (HBOS): A Fast Unsupervised Anomaly Detection Algorithm. In Proceedings of the LWA 2012—Lernen, Wissen, Adaptivität, Dortmund, Germany, 8–10 October 2012. [Google Scholar]
  37. Pevnỳ, T. Loda: Lightweight On-line Detector of Anomalies. Mach. Learn. 2016, 102, 275–304. [Google Scholar] [CrossRef]
  38. Li, Z.; Zhao, Y.; Botta, N.; Ionescu, C.; Hu, X. COPOD: Copula-Based Outlier Detection. In Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), Sorrento, Italy, 17–20 November 2020. [Google Scholar]
  39. Ramaswamy, S.; Rastogi, R.; Shim, K. Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 16–18 May 2000; pp. 427–438. [Google Scholar]
  40. Papadimitriou, S.; Kitagawa, H.; Gibbons, P.B.; Faloutsos, C. LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering (ICDE), Bangalore, India, 5–8 March 2003; pp. 315–326. [Google Scholar]
  41. Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the ICDM, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar]
  42. Schölkopf, B.; Platt, J.; Shawe-Taylor, J.; Smola, A.J.; Williamson, R.C. Estimating the Support of a High-Dimensional Distribution. Neural Comput. 2001, 13, 1443–1471. [Google Scholar] [CrossRef]
  43. Ruff, L.; Vandermeulen, R.; Goernitz, N.; Deecke, L.; Siddiqui, S.A.; Binder, A.; Müller, E.; Kloft, M. Deep One-Class Classification. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; 80, pp. 4393–4402. [Google Scholar]
  44. Zong, B.; Song, Q.; Min, M.R.; Cheng, W.; Lumezanu, C.; Cho, D.; Chen, H. Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection. In Proceedings of the ICLR, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  45. Liu, J.; Ma, Z.; Wang, Z.; Liu, Y.; Wang, Z.; Sun, P.; Song, L.; Hu, B.; Boukerche, A.; Leung, V.C.M. A Survey on Diffusion Models for Anomaly Detection. arXiv 2025, arXiv:2501.11430. [Google Scholar] [CrossRef]
  46. Radovanović, M.; Nanopoulos, A.; Ivanović, M. Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data. J. Mach. Learn. Res. 2010, 11, 2487–2531. [Google Scholar]
  47. Kotz, S.; Kozubowski, T.; Podgórski, K. The Laplace Distribution and Generalizations; Birkhäuser: Boston, MA, USA, 2001. [Google Scholar]
  48. Johnson, N.L.; Kotz, S.; Balakrishnan, N. Continuous Univariate Distributions, Volume I & II; Wiley: New York, NY, USA, 1994. [Google Scholar]
  49. David, H.A.; Nagaraja, H.N. Order Statistics; Wiley: Hoboken, NJ, USA, 2003. [Google Scholar]
  50. Biau, G.; Devroye, L. Lectures on the Nearest Neighbor Method; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
  51. Titterington, D.; Smith, A.F.M.; Makov, U. Statistical Analysis of Finite Mixture Distributions; Wiley: New York, NY, USA, 1985. [Google Scholar]
  52. Rosenblatt, M. Remarks on a Multivariate Transformation. Ann. Math. Stat. 1952, 23, 470–472. [Google Scholar] [CrossRef]
  53. Swets, J.A. Measuring the accuracy of diagnostic systems. Science 1988, 240, 1285–1293. [Google Scholar] [CrossRef]
  54. Hajian-Tilaki, K. Receiver Operating Characteristic (ROC) curve analysis for medical diagnostic test evaluation. Casp. J. Intern. Med. 2013, 4, 627–635. [Google Scholar]
  55. Bagdonavičius, V.; Petkevičius, L. Multiple Outlier Detection Tests for Parametric Models. Mathematics 2020, 8, 2156. [Google Scholar] [CrossRef]
  56. Azzalini, A. A Class of Distributions Which Includes the Normal Ones. Scand. J. Stat. 1985, 12, 171–178. [Google Scholar]
  57. DAMI: Outlier Evaluation Benchmark. Available online: https://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/ (accessed on 16 February 2025).
  58. Hodge-py Project. Outlier Detection: Literature Resources. Available online: https://github.com/hodge-py/Outlier-Detection/tree/Final/literature (accessed on 16 February 2025).
  59. Hodge-py Project. Outlier Detection: Semantic Resources. Available online: https://github.com/hodge-py/Outlier-Detection/tree/Final/semantic (accessed on 16 February 2025).
Figure 1. Probability plot and histogram of the Arrhythmia dataset. Theoretical distribution for probability plot is set to gamma distribution.
Figure 1. Probability plot and histogram of the Arrhythmia dataset. Theoretical distribution for probability plot is set to gamma distribution.
Mathematics 14 00077 g001
Figure 2. Probability plot and histogram of the Parkinson dataset. Theoretical distribution for probability plot is set to skew-normal distribution after logarithmic transformation.
Figure 2. Probability plot and histogram of the Parkinson dataset. Theoretical distribution for probability plot is set to skew-normal distribution after logarithmic transformation.
Mathematics 14 00077 g002
Figure 3. ROC comparison between the kNN baseline and the best-fit log Student-t distribution on the Shuttle dataset. The parametric fit closely tracks or exceeds the kNN ROC curve, illustrating how a 1D fitted distribution can replicate neighborhood-based anomaly scoring behavior.
Figure 3. ROC comparison between the kNN baseline and the best-fit log Student-t distribution on the Shuttle dataset. The parametric fit closely tracks or exceeds the kNN ROC curve, illustrating how a 1D fitted distribution can replicate neighborhood-based anomaly scoring behavior.
Mathematics 14 00077 g003
Figure 4. ROC comparison between the kNN baseline and the best-fit log Student-t distribution on the Annthyroid dataset. The parametric fit closely tracks or exceeds the kNN ROC curve, illustrating how a 1D fitted distribution can replicate neighborhood-based anomaly scoring behavior.
Figure 4. ROC comparison between the kNN baseline and the best-fit log Student-t distribution on the Annthyroid dataset. The parametric fit closely tracks or exceeds the kNN ROC curve, illustrating how a 1D fitted distribution can replicate neighborhood-based anomaly scoring behavior.
Mathematics 14 00077 g004
Table 1. (a) Shape and scale parameters for training and testing sets (Gamma). (b) Shape and scale parameters for training and testing sets (Inverse Gaussian). (c) Shape, location, and scale parameters for training and testing sets (Skew-Normal).
Table 1. (a) Shape and scale parameters for training and testing sets (Gamma). (b) Shape and scale parameters for training and testing sets (Inverse Gaussian). (c) Shape, location, and scale parameters for training and testing sets (Skew-Normal).
TrainTest InlierTest Outlier
(a)
Shape2.02.05.0
Scale2.02.02.0
(b)
Shape2.02.05.0
Scale2.02.02.0
(c)
Shape (a)4.04.0−4.0
Location0.00.02.0
Scale1.01.01.0
Table 2. (a) ROC-AUC Scores as per each method for Monte Carlo simulation (500 runs) with Gamma distribution. (b) ROC-AUC Scores as per each method for Monte Carlo simulation (500 runs) with Inverse Gaussian distribution. (c) ROC-AUC Scores as per each method for Monte Carlo simulation (500 runs) with Skew-Normal distribution.
Table 2. (a) ROC-AUC Scores as per each method for Monte Carlo simulation (500 runs) with Gamma distribution. (b) ROC-AUC Scores as per each method for Monte Carlo simulation (500 runs) with Inverse Gaussian distribution. (c) ROC-AUC Scores as per each method for Monte Carlo simulation (500 runs) with Skew-Normal distribution.
MeanStdMin25%50%75%Max
(a)
KNN84.81%2.22%76.58%83.42%84.90%86.39%91.38%
LOF62.67%6.00%46.90%58.73%62.45%67.06%77.70%
ABOD80.66%2.48%72.47%79.06%80.66%82.48%87.09%
COF50.39%2.59%43.55%48.66%50.43%52.01%57.70%
CDF89.07%1.59%83.60%87.95%89.17%90.22%94.03%
(b)
KNN59.06%2.89%50.46%57.09%59.09%61.12%66.87%
LOF54.50%3.05%46.04%52.50%54.54%56.49%62.83%
ABOD58.61%2.95%50.48%56.72%58.61%60.58%66.79%
COF50.86%2.83%42.65%48.97%50.97%52.77%58.46%
CDF59.30%2.90%51.18%57.49%59.34%61.21%68.10%
(c)
KNN65.07%3.35%53.76%63.01%65.26%67.33%74.90%
LOF50.84%4.35%38.28%47.64%50.94%53.73%63.98%
ABOD61.01%3.26%48.70%58.91%61.15%63.35%71.20%
COF50.17%2.74%40.60%48.24%50.24%51.99%56.75%
CDF71.81%2.67%62.95%70.02%71.99%73.59%79.12%
Table 3. Details of datasets used for comparison.
Table 3. Details of datasets used for comparison.
NameTypeInstancesOutliersAttributes
ALOILiterature50,000150827
GlassLiterature21497
IonosphereLiterature35112632
KDDCup99Literature60,63224638 + 3
LymphographyLiterature14863 + 16
PenDigitsLiterature98682016
ShuttleLiterature1013139
WaveformLiterature344310021
WBCLiterature454109
WDBCLiterature3671030
WPBCLiterature1984733
AnnthyroidSemantic720053421
ArrhythmiaSemantic450206259
CardiotocographySemantic212647121
HeartDiseaseSemantic27012013
HepatitisSemantic801319
InternetAdsSemantic32644541555
PageBlocksSemantic547356010
ParkinsonSemantic19514722
PimaSemantic7682688
SpamBaseSemantic4601181357
StampsSemantic340319
Wilt>Semantic>4839>261>5
Datasets are available from the Hodge-py outlier detection repositories [58,59] and the DAMI outlier evaluation collection [57].
Table 4. Comparison of average ROC-AUC across literature and semantic datasets.
Table 4. Comparison of average ROC-AUC across literature and semantic datasets.
MethodLiterature Avg.Semantic Avg.
Log Transform Models
norm87.34%72.34%
t87.52%72.33%
laplace87.48%72.35%
logistic87.35%72.33%
skewnorm87.55%72.38%
No Transform Models
expon87.10%71.86%
chi287.52%72.26%
gamma87.29%72.30%
weibull_min87.40%72.18%
invgauss87.56%72.38%
rayleigh87.13%72.29%
wald87.37%72.32%
pareto87.47%72.32%
nakagami87.45%72.32%
logistic86.52%72.23%
powerlaw87.35%72.27%
skewnorm87.25%72.30%
Baseline—Manhattan Distance
KNN87.53%72.33%
LOF84.75%69.16%
SimplifiedLOF86.76%71.34%
LoOP83.41%65.76%
LDOF79.66%65.37%
ODIN87.62%72.37%
FastABOD64.55%62.35%
KDEOS75.37%62.06%
LDF86.76%71.34%
INFLO83.07%65.64%
COF81.51%64.34%
Baseline—Euclidean Distance
KNN87.66%70.93%
LOF85.61%69.31%
SimplifiedLOF83.44%66.72%
LoOP83.27%65.92%
LDOF83.02%65.85%
ODIN82.64%65.35%
FastABOD87.65%67.00%
KDEOS73.11%62.10%
LDF86.29%70.83%
INFLO83.16%64.59%
COF84.02%68.54%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, J.; Hodge, K.; Dong, W.; Tamakloe, E. Distribution-Aware Outlier Detection in High Dimensions: A Scalable Parametric Approach. Mathematics 2026, 14, 77. https://doi.org/10.3390/math14010077

AMA Style

Zhou J, Hodge K, Dong W, Tamakloe E. Distribution-Aware Outlier Detection in High Dimensions: A Scalable Parametric Approach. Mathematics. 2026; 14(1):77. https://doi.org/10.3390/math14010077

Chicago/Turabian Style

Zhou, Jie, Karson Hodge, Weiqiang Dong, and Emmanuel Tamakloe. 2026. "Distribution-Aware Outlier Detection in High Dimensions: A Scalable Parametric Approach" Mathematics 14, no. 1: 77. https://doi.org/10.3390/math14010077

APA Style

Zhou, J., Hodge, K., Dong, W., & Tamakloe, E. (2026). Distribution-Aware Outlier Detection in High Dimensions: A Scalable Parametric Approach. Mathematics, 14(1), 77. https://doi.org/10.3390/math14010077

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop