Distribution-Aware Outlier Detection in High Dimensions: A Scalable Parametric Approach
Abstract
1. Introduction
- 1.
- Algorithmic efficiency: Develop a method whose computational cost scales linearly with sample size and is independent of feature dimension after transformation;
- 2.
- Statistical interpretability: Model transformed distances with flexible parametric families, enabling distribution-based threshold selection and diagnostic inference;
- 3.
- Provable detection performance: Establish theoretical guarantees showing that the method controls false alarm rates and maximizes statistical power under mild assumptions on the underlying data distribution.
2. Literature Review
2.1. State of the Art
2.2. Our Contribution in Context
3. Method and Theoretical Results
3.1. Positively-Skewed Distributions
3.1.1. Assumptions and Limitations of Parametric Modeling
3.1.2. From Positioning to Theory
3.2. Behavior of Continuous Density Function Versus Non-Parametric for ROC-AUC Scores
3.2.1. Comparing CDF-Based Scores and Raw KNN Distances
3.2.2. The CDF Superiority Theorem
- CDF score:
- KNN distance score:
- (1)
- CDF Score.
- (2)
- KNN Distance Score.
- (3)
- Conclusion.
3.2.3. Extension to Other Nonparametric Methods
- ROC–AUC cares only about pairwise ordering..
- The CDF score is strictly monotonic in x.increases strictly, so it never misorders any .
- Any non-parametric method must misorder a positive-measure set of pairs.Estimated from finite data (LOF, isolation forest, etc.), it cannot perfectly reproduce the CDF ordering, so there exists with with positive probability.
- Strict AUC gap follows.Let and be the misorder probability of the non-parametric score. Then,so .
3.2.4. Significance of the CDF Superiority Theorem
3.2.5. Justification for ROC AUC as Evaluation Metric
3.2.6. Theoretical Support for Parametric Tests
3.3. Worked Examples
3.3.1. 1-D KNN Disordering Example
3.3.2. 1-D LOF Disordering Example ()
3.3.3. Extension to 3-D Case
3.3.4. 3-D LOF Scoring
3.3.5. Connection to CDF-Based Scoring
3.4. Remark
3.5. Monte Carlo Simulation of CDF Versus Non-Parametric ROC-AUC Scores
4. Our Parametric Outlier-Detection Framework
4.1. Dimensionality Reduction via KNN–Manhattan
4.2. Fitting Positively Skewed Distributions
Log–Transform and Normal-like Fits
4.3. Baseline Non-Parametric Methods
4.4. Datasets
5. Empirical Real-World Data Results
5.1. Analysis of Goodness-of-Fit Across Literature and Semantic Datasets
5.2. Real Data Analysis Results
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| KNN | k-Nearest Neighbors |
| LOF | Local Outlier Factor |
| COF | Connectivity-Based Outlier Factor |
| ABOD | Angle-Based Outlier Factor |
| KDE | Kernel Density Estimation |
| CDF | Cumulative Distribution Function |
| TPR | True Positive Rate |
| FPR | False Positive Rate |
| ROC | Receiver Operating Characteristic |
| AUC | Area Under Curve |
| ESD | Extreme Studentized Deviate |
| LoOP | Local Outlier Probabilities |
| LDOF | Local Distance-Based Outlier Factor |
| ODIN | Outlier Detection for Networks |
| KDEOS | Kernel Density Estimation Outlier Score |
| SVM | Support Vector Machine |
| SVDD | Support Vector Data Description |
| DAGMM | Deep Autoencoding Gaussian Mixture Model |
| HBOS | Histogram-Based Outlier Score |
| LODA | Lightweight On-line Detector of Anomalies |
| COPOD | Copula-Based Outlier Detection |
| INFLO | Influenced Outlierness |
Appendix A
| Listing A1. Monte Carlo simulation example for positively skewed data. |
![]() |
| ALOI | Glass | Ionosphere | KDDCup99 | Lymphogra | PenDigits | |
|---|---|---|---|---|---|---|
| Log Transform | ||||||
| norm | 98.81% | 93.36% | 98.36% | 90.25% | 92.44% | 97.88% |
| t | 98.86% | 91.38% | 98.36% | 88.38% | 96.71% | 97.66% |
| laplace | 95.66% | 90.36% | 91.60% | 88.98% | 90.75% | 93.12% |
| logistic | 98.42% | 92.25% | 96.12% | 90.08% | 94.47% | 96.69% |
| skewnorm | 99.71% | 98.72% | 98.30% | 95.60% | 97.85% | 99.86% |
| No Transform | ||||||
| expon | 89.28% | 93.64% | 97.14% | 76.67% | 96.17% | 99.36% |
| chi2 | 91.27% | 94.33% | 96.29% | 92.60% | 97.28% | 98.23% |
| gamma | 93.91% | 93.87% | 97.12% | 97.96% | 92.45% | 97.86% |
| weibull_min | 93.93% | 97.77% | 97.33% | 97.73% | 91.00% | 95.70% |
| invgauss | 99.13% | 95.99% | 93.88% | 99.18% | 94.22% | 96.39% |
| rayleigh | 69.15% | 75.77% | 90.79% | 52.58% | 79.50% | 93.74% |
| wald | 96.14% | 97.45% | 93.01% | 87.44% | 98.24% | 96.70% |
| pareto | 98.36% | 95.84% | 97.14% | 12.84% | 96.17% | 99.36% |
| nakagami | 82.07% | 84.00% | 97.45% | 76.86% | 87.64% | 94.71% |
| logistic | 61.71% | 71.96% | 80.61% | 46.46% | 79.73% | 87.25% |
| powerlaw | 73.65% | 86.12% | 96.79% | 63.82% | 76.63% | 87.89% |
| skewnorm | 75.15% | 81.96% | 95.23% | 58.90% | 88.14% | 96.78% |
| Shuttle | Waveform | WBC | WDBC | WPBC | Average | |
|---|---|---|---|---|---|---|
| Log Transform | ||||||
| norm | 99.10% | 99.14% | 87.33% | 90.96% | 92.60% | 94.57% |
| t | 99.29% | 99.12% | 83.70% | 85.60% | 89.97% | 93.55% |
| laplace | 97.02% | 95.62% | 82.65% | 87.74% | 89.02% | 91.14% |
| logistic | 99.09% | 98.39% | 83.39% | 91.31% | 90.83% | 93.73% |
| skewnorm | 99.25% | 99.96% | 93.28% | 97.87% | 98.95% | 98.12% |
| No Transform | ||||||
| expon | 73.73% | 91.97% | 91.40% | 95.57% | 99.17% | 91.28% |
| chi2 | 73.78% | 99.96% | 89.83% | 97.32% | 98.60% | 93.59% |
| gamma | 72.50% | 99.96% | 98.67% | 97.35% | 98.60% | 94.57% |
| weibull_min | 79.88% | 99.31% | 92.12% | 92.68% | 97.12% | 94.05% |
| invgauss | 73.43% | 99.97% | 89.77% | 97.18% | 99.11% | 94.39% |
| rayleigh | 62.81% | 99.71% | 64.23% | 79.35% | 92.52% | 78.19% |
| wald | 78.22% | 84.60% | 92.95% | 98.75% | 97.24% | 92.79% |
| pareto | 73.73% | 91.97% | 64.07% | 96.02% | 99.24% | 84.07% |
| nakagami | 64.77% | 99.66% | 88.02% | 86.87% | 94.98% | 87.00% |
| logistic | 59.31% | 97.13% | 58.20% | 72.55% | 83.44% | 72.58% |
| powerlaw | 56.20% | 93.50% | 72.40% | 76.42% | 89.13% | 79.32% |
| skewnorm | 65.54% | 99.90% | 71.21% | 85.54% | 95.16% | 83.04% |
| Annthyroid | Arrhythmia | Cardiotocography | HeartDisease | Hepatitis | InternetAds | |
|---|---|---|---|---|---|---|
| Log Transform | ||||||
| norm | 97.32% | 92.53% | 98.33% | 98.37% | 96.42% | 97.22% |
| t | 99.12% | 91.14% | 98.23% | 98.37% | 96.42% | 97.22% |
| laplace | 98.76% | 90.17% | 94.26% | 93.94% | 89.60% | 93.41% |
| logistic | 98.97% | 92.01% | 97.23% | 97.15% | 93.92% | 96.18% |
| skewnorm | 97.48% | 99.49% | 99.44% | 99.31% | 96.45% | 98.11% |
| No Transform | ||||||
| expon | 76.81% | 99.08% | 97.85% | 94.44% | 87.87% | 97.61% |
| chi2 | 85.52% | 99.12% | 97.06% | 99.39% | 89.40% | 98.09% |
| gamma | 85.93% | 97.45% | 99.23% | 99.39% | 93.61% | 98.09% |
| weibull_min | 77.03% | 96.76% | 98.14% | 99.40% | 96.15% | 97.32% |
| invgauss | 83.37% | 99.20% | 99.46% | 99.34% | 93.76% | 98.64% |
| rayleigh | 56.43% | 89.74% | 96.29% | 99.22% | 97.84% | 95.73% |
| wald | 85.90% | 98.33% | 93.95% | 87.79% | 80.24% | 93.98% |
| pareto | 84.19% | 99.08% | 97.85% | 94.44% | 87.87% | 97.62% |
| nakagami | 64.82% | 93.25% | 97.09% | 99.31% | 97.33% | 90.16% |
| logistic | 51.34% | 81.56% | 90.29% | 94.63% | 93.75% | 87.67% |
| powerlaw | 53.06% | 84.56% | 88.02% | 94.70% | 98.76% | 85.66% |
| skewnorm | 61.61% | 93.83% | 97.90% | 99.51% | 96.93% | 95.73% |
| PageBlocks | Parkinson | Pima | SpamBase | Stamps | Wilt | Average | |
|---|---|---|---|---|---|---|---|
| Log Transform | |||||||
| norm | 92.65% | 94.42% | 97.87% | 74.47% | 97.44% | 96.23% | 94.44% |
| t | 91.75% | 97.80% | 97.87% | 80.92% | 97.26% | 91.67% | 94.82% |
| laplace | 89.74% | 96.46% | 92.28% | 83.68% | 93.34% | 97.20% | 92.74% |
| logistic | 92.16% | 96.06% | 96.27% | 78.35% | 96.42% | 97.54% | 94.36% |
| skewnorm | 99.62% | 96.92% | 99.06% | 83.94% | 99.27% | 97.95% | 97.25% |
| No Transform | |||||||
| expon | 79.86% | 91.84% | 97.62% | 92.80% | 97.74% | 64.13% | 89.80% |
| chi2 | 79.83% | 85.41% | 94.45% | 92.24% | 97.88% | 67.36% | 90.48% |
| gamma | 93.53% | 85.41% | 99.70% | 92.22% | 97.88% | 72.96% | 92.95% |
| weibull_min | 82.47% | 94.74% | 99.47% | 89.26% | 97.08% | 86.30% | 92.84% |
| invgauss | 92.85% | 88.89% | 99.36% | 92.53% | 98.47% | 57.91% | 91.98% |
| rayleigh | 58.19% | 78.23% | 97.66% | 87.46% | 92.07% | 46.62% | 82.79% |
| wald | 89.31% | 95.89% | 92.66% | 93.98% | 96.88% | 64.81% | 89.59% |
| pareto | 95.46% | 91.84% | 97.62% | 93.47% | 97.74% | 67.46% | 92.05% |
| nakagami | 69.50% | 84.59% | 98.74% | 87.67% | 94.98% | 52.37% | 86.26% |
| logistic | 52.31% | 72.85% | 91.07% | 83.92% | 85.41% | 38.07% | 77.01% |
| powerlaw | 59.33% | 68.75% | 92.67% | 77.26% | 85.86% | 40.43% | 77.42% |
| skewnorm | 63.91% | 81.93% | 99.20% | 89.06% | 94.75% | 45.16% | 84.96% |
| ALOI | Glass | Ionosphere | ||||
|---|---|---|---|---|---|---|
| ROC AUC | k | ROC AUC | k | ROC AUC | k | |
| Log Transform | ||||||
| norm | 74.50% | 3 | 87.20% | 10 | 90.90% | 2 |
| t | 74.60% | 3 | 87.60% | 2 | 90.90% | 2 |
| laplace | 74.50% | 2 | 88.00% | 2 | 90.70% | 2 |
| logistic | 74.50% | 3 | 87.60% | 2 | 90.70% | 2 |
| skewnorm | 74.50% | 3 | 87.60% | 10 | 90.80% | 2 |
| No Transform | ||||||
| expon | 74.30% | 3 | 87.10% | 2 | 90.10% | 2 |
| chi2 | 74.40% | 3 | 87.80% | 2 | 90.10% | 2 |
| gamma | 74.50% | 2 | 88.00% | 2 | 90.10% | 2 |
| weibull_min | 74.40% | 2 | 87.50% | 10 | 90.10% | 2 |
| invgauss | 74.60% | 3 | 88.50% | 2 | 90.10% | 2 |
| rayleigh | 74.30% | 3 | 87.50% | 2 | 90.10% | 2 |
| wald | 74.40% | 3 | 87.80% | 2 | 90.10% | 2 |
| pareto | 74.50% | 3 | 87.90% | 2 | 90.10% | 2 |
| nakagami | 74.50% | 3 | 88.00% | 2 | 90.10% | 2 |
| logistic | 73.80% | 3 | 87.20% | 10 | 90.20% | 2 |
| powerlaw | 74.50% | 3 | 87.40% | 10 | 89.90% | 2 |
| skewnorm | 74.30% | 3 | 87.70% | 2 | 90.10% | 2 |
| Baseline Manhattan | ||||||
| KNN | 74.60% | 2 | 87.40% | 10 | 89.60% | 4 |
| LOF | 81.40% | 7 | 86.70% | 13 | 87.10% | 10 |
| SimplifiedLOF | 74.86% | 3 | 87.99% | 2 | 90.04% | 2 |
| LoOP | 83.45% | 10 | 85.09% | 20 | 86.38% | 16 |
| LDOF | 75.24% | 9 | 78.10% | 26 | 83.22% | 50 |
| ODIN | 74.62% | 3 | 87.99% | 2 | 90.04% | 2 |
| FastABOD | 76.66% | 14 | 50.00% | 2 | 92.07% | 69 |
| KDEOS | 52.26% | 62 | 83.96% | 19 | 86.25% | 70 |
| LDF | 74.86% | 3 | 87.99% | 2 | 90.04% | 2 |
| INFLO | 83.60% | 10 | 83.79% | 18 | 86.06% | 16 |
| COF | 76.84% | 30 | 89.86% | 62 | 88.02% | 13 |
| Baseline Eucli. | ||||||
| KNN | 74.06% | 1 | 87.48% | 8 | 92.74% | 1 |
| LOF | 78.23% | 9 | 86.67% | 11 | 90.43% | 83 |
| SimplifiedLOF | 79.57% | 16 | 86.50% | 16 | 90.50% | 10 |
| LoOP | 80.08% | 12 | 83.96% | 18 | 90.21% | 11 |
| LDOF | 77.89% | 27 | 89.61% | 14 | ||
| ODIN | 80.50% | 11 | 72.93% | 18 | 85.22% | 13 |
| FastABOD | 85.80% | 98 | 91.33% | 3 | ||
| KDEOS | 77.26% | 99 | 74.20% | 28 | 83.40% | 71 |
| LDF | 74.62% | 9 | 90.35% | 9 | 91.67% | 50 |
| INFLO | 79.87% | 9 | 80.38% | 18 | 90.38% | 10 |
| COF | 80.17% | 13 | 89.54% | 76 | 96.03% | 100 |
| KDDCup99 | Lymphography | PenDigits | ||||
|---|---|---|---|---|---|---|
| ROC AUC | k | ROC AUC | k | ROC AUC | k | |
| Log Transform | ||||||
| norm | 96.80% | 69 | 100.00% | 19 | 98.20% | 9 |
| t | 96.80% | 68 | 99.90% | 6 | 98.30% | 10 |
| laplace | 96.70% | 69 | 99.80% | 31 | 98.40% | 14 |
| logistic | 96.70% | 69 | 100.00% | 15 | 98.40% | 11 |
| skewnorm | 96.70% | 69 | 100.00% | 8 | 98.70% | 15 |
| No Transform | ||||||
| expon | 95.00% | 69 | 99.30% | 13 | 99.10% | 12 |
| chi2 | 96.90% | 69 | 100.00% | 38 | 98.40% | 6 |
| gamma | 96.70% | 69 | 100.00% | 8 | 98.30% | 12 |
| weibull_min | 96.80% | 69 | 100.00% | 8 | 98.20% | 9 |
| invgauss | 96.50% | 69 | 100.00% | 8 | 99.10% | 12 |
| rayleigh | 95.70% | 69 | 99.90% | 4 | 97.60% | 6 |
| wald | 94.90% | 69 | 100.00% | 26 | 99.10% | 9 |
| pareto | 96.40% | 69 | 100.00% | 13 | 99.10% | 12 |
| nakagami | 96.50% | 69 | 100.00% | 8 | 97.90% | 10 |
| logistic | 94.30% | 69 | 100.00% | 8 | 96.80% | 8 |
| powerlaw | 96.90% | 69 | 100.00% | 8 | 99.10% | 9 |
| skewnorm | 95.90% | 69 | 100.00% | 8 | 98.30% | 9 |
| Baseline Manhattan | ||||||
| KNN | 97.00% | 69 | 100.00% | 7 | 99.10% | 11 |
| LOF | 67.90% | 45 | 100.00% | 47 | 97.10% | 55 |
| SimplifiedLOF | 95.40% | 70 | 100.00% | 3 | 99.13% | 21 |
| LoOP | 66.52% | 61 | 99.88% | 59 | 96.24% | 70 |
| LDOF | 77.09% | 70 | 99.65% | 44 | 72.92% | 70 |
| ODIN | 97.01% | 70 | 100.00% | 8 | 99.12% | 12 |
| FastABOD | 58.97% | 70 | 99.18% | 60 | 50.00% | 2 |
| KDEOS | 50.00% | 2 | 82.75% | 33 | 86.69% | 59 |
| LDF | 95.40% | 70 | 100.00% | 3 | 99.13% | 21 |
| INFLO | 66.46% | 54 | 99.88% | 59 | 96.95% | 70 |
| COF | 60.57% | 69 | 96.48% | 14 | 98.29% | 69 |
| Baseline Eucli. | ||||||
| KNN | 98.97% | 89 | 100.00% | 14 | 99.21% | 12 |
| LOF | 84.89% | 100 | 100.00% | 62 | 96.58% | 73 |
| SimplifiedLOF | 66.80% | 62 | 100.00% | 98 | 96.68% | 67 |
| LoOP | 70.31% | 65 | 99.77% | 47 | 96.23% | 98 |
| LDOF | 99.77% | 86 | 75.03% | 91 | ||
| ODIN | 80.77% | 100 | 99.88% | 55 | 96.43% | 100 |
| FastABOD | 99.77% | 25 | 97.98% | 100 | ||
| KDEOS | 60.51% | 68 | 98.12% | 99 | 82.21% | 98 |
| LDF | 87.70% | 90 | 100.00% | 13 | 97.79% | 12 |
| INFLO | 70.33% | 56 | 99.88% | 62 | 95.71% | 98 |
| COF | 67.01% | 67 | 100.00% | 40 | 96.70% | 95 |
| Shuttle | Waveform | WBC | ||||
|---|---|---|---|---|---|---|
| ROC AUC | k | ROC AUC | k | ROC AUC | k | |
| Log Transform | ||||||
| norm | 84.60% | 5 | 78.30% | 68 | 99.40% | 9 |
| t | 84.66% | 5 | 78.50% | 64 | 99.70% | 23 |
| laplace | 84.60% | 5 | 78.50% | 61 | 99.70% | 32 |
| logistic | 84.50% | 5 | 78.60% | 68 | 99.30% | 10 |
| skewnorm | 84.60% | 5 | 78.60% | 69 | 99.60% | 60 |
| No Transform | ||||||
| expon | 84.20% | 5 | 78.50% | 62 | 98.80% | 10 |
| chi2 | 84.50% | 5 | 78.60% | 66 | 99.80% | 30 |
| gamma | 82.00% | 5 | 78.60% | 66 | 99.90% | 40 |
| weibull_min | 83.90% | 5 | 78.50% | 67 | 99.80% | 26 |
| invgauss | 84.50% | 5 | 78.60% | 66 | 99.50% | 4 |
| rayleigh | 84.80% | 5 | 78.80% | 66 | 99.00% | 33 |
| wald | 84.50% | 5 | 78.60% | 69 | 99.80% | 69 |
| pareto | 84.30% | 5 | 78.60% | 62 | 99.20% | 4 |
| nakagami | 84.60% | 5 | 78.40% | 68 | 99.80% | 17 |
| logistic | 84.20% | 5 | 78.40% | 58 | 96.70% | 23 |
| powerlaw | 82.80% | 5 | 78.50% | 59 | 99.80% | 28 |
| skewnorm | 84.60% | 5 | 78.50% | 67 | 99.20% | 61 |
| Baseline Manhattan | ||||||
| KNN | 84.68% | 4 | 78.60% | 65 | 99.70% | 24 |
| LOF | 84.10% | 7 | 76.50% | 69 | 99.70% | 65 |
| SimplifiedLOF | 78.00% | 14 | 77.77% | 70 | 99.72% | 22 |
| LoOP | 82.08% | 11 | 72.59% | 70 | 97.28% | 70 |
| LDOF | 77.98% | 22 | 69.59% | 67 | 94.37% | 70 |
| ODIN | 84.68% | 5 | 78.57% | 66 | 99.74% | 25 |
| FastABOD | 50.00% | 2 | 52.31% | 5 | 76.29% | 13 |
| KDEOS | 77.30% | 48 | 65.14% | 70 | 97.54% | 11 |
| LDF | 78.00% | 14 | 77.77% | 70 | 99.72% | 22 |
| INFLO | 77.84% | 10 | 71.50% | 70 | 99.48% | 67 |
| COF | 63.02% | 64 | 76.25% | 58 | 98.97% | 58 |
| Baseline Eucli. | ||||||
| KNN | 81.76% | 3 | 77.55% | 77 | 99.72% | 19 |
| LOF | 78.21% | 6 | 75.60% | 96 | 99.67% | 98 |
| SimplifiedLOF | 76.61% | 99 | 72.95% | 100 | 99.39% | 99 |
| LoOP | 76.40% | 99 | 72.37% | 100 | 98.03% | 99 |
| LDOF | 84.75% | 15 | 68.82% | 100 | 96.53% | 99 |
| ODIN | 78.90% | 8 | 69.68% | 100 | 96.74% | 100 |
| FastABOD | 95.46% | 6 | 67.31% | 40 | 99.48% | 49 |
| KDEOS | 66.55% | 94 | 59.24% | 99 | 64.79% | 5 |
| LDF | 71.59% | 4 | 78.89% | 16 | 99.72% | 71 |
| INFLO | 79.89% | 98 | 70.92% | 94 | 99.39% | 99 |
| COF | 63.97% | 71 | 77.59% | 99 | 99.44% | 74 |
| WDBC | WPBC | |||
|---|---|---|---|---|
| ROC AUC | k | ROC AUC | k | |
| Log Transform | ||||
| norm | 97.70% | 9 | 53.10% | 12 |
| t | 98.40% | 46 | 53.40% | 20 |
| laplace | 98.30% | 42 | 53.10% | 26 |
| logistic | 97.30% | 9 | 53.20% | 20 |
| skewnorm | 98.70% | 43 | 53.20% | 26 |
| No Transform | ||||
| expon | 98.50% | 42 | 53.20% | 14 |
| chi2 | 99.00% | 56 | 53.20% | 26 |
| gamma | 98.90% | 68 | 53.20% | 26 |
| weibull_min | 98.90% | 64 | 53.30% | 12 |
| invgauss | 98.70% | 25 | 53.10% | 19 |
| rayleigh | 97.50% | 20 | 53.20% | 12 |
| wald | 98.70% | 53 | 53.20% | 12 |
| pareto | 98.80% | 42 | 53.30% | 19 |
| nakagami | 99.00% | 63 | 53.20% | 12 |
| logistic | 96.90% | 39 | 53.20% | 19 |
| powerlaw | 98.90% | 64 | 53.10% | 20 |
| skewnorm | 98.30% | 41 | 52.90% | 21 |
| Baseline Manhattan | ||||
| KNN | 99.00% | 69 | 53.10% | 18 |
| LOF | 99.10% | 69 | 52.70% | 34 |
| SimplifiedLOF | 98.71% | 57 | 52.70% | 29 |
| LoOP | 98.38% | 69 | 49.61% | 61 |
| LDOF | 97.96% | 70 | 50.18% | 61 |
| ODIN | 98.96% | 70 | 53.10% | 19 |
| FastABOD | 50.00% | 2 | 54.52% | 4 |
| KDEOS | 90.08% | 69 | 57.13% | 34 |
| LDF | 98.71% | 57 | 52.70% | 29 |
| INFLO | 98.91% | 70 | 49.27% | 57 |
| COF | 97.70% | 64 | 50.64% | 47 |
| Baseline Eucli. | ||||
| KNN | 98.63% | 90 | 54.09% | 12 |
| LOF | 98.91% | 89 | 52.54% | 24 |
| SimplifiedLOF | 98.68% | 90 | 50.18% | 1 |
| LoOP | 98.40% | 100 | 50.18% | 1 |
| LDOF | 98.18% | 99 | 56.56% | 7 |
| ODIN | 97.23% | 93 | 50.73% | 1 |
| FastABOD | 98.26% | 97 | 53.42% | 40 |
| KDEOS | 86.11% | 80 | 51.85% | 2 |
| LDF | 98.54% | 33 | 58.29% | 8 |
| INFLO | 98.49% | 95 | 49.57% | 20 |
| COF | 98.07% | 55 | 55.69% | 97 |
| Annthyroid | Arrhythmia | Cardiotocography | ||||
|---|---|---|---|---|---|---|
| ROC AUC | k | ROC AUC | k | ROC AUC | k | |
| Log Transform | ||||||
| norm | 67.60% | 2 | 76.20% | 44 | 55.70% | 69 |
| t | 67.70% | 2 | 76.10% | 47 | 55.60% | 69 |
| laplace | 67.60% | 2 | 76.20% | 34 | 55.70% | 68 |
| logistic | 67.70% | 2 | 76.10% | 47 | 55.60% | 69 |
| skewnorm | 67.70% | 2 | 76.00% | 35 | 55.70% | 69 |
| No Transform | ||||||
| expon | 67.61% | 2 | 75.66% | 35 | 55.65% | 69 |
| chi2 | 67.64% | 2 | 76.11% | 46 | 55.69% | 69 |
| gamma | 67.65% | 2 | 76.11% | 41 | 55.72% | 69 |
| weibull_min | 67.61% | 2 | 75.97% | 44 | 55.79% | 68 |
| invgauss | 67.69% | 2 | 76.03% | 45 | 55.64% | 67 |
| rayleigh | 67.59% | 2 | 75.99% | 45 | 55.76% | 69 |
| wald | 67.64% | 2 | 76.05% | 44 | 55.76% | 69 |
| pareto | 67.60% | 2 | 76.07% | 35 | 55.65% | 69 |
| nakagami | 67.62% | 2 | 76.02% | 38 | 55.70% | 69 |
| logistic | 67.36% | 2 | 76.15% | 29 | 55.69% | 68 |
| powerlaw | 67.46% | 2 | 76.09% | 45 | 55.77% | 69 |
| skewnorm | 67.52% | 2 | 76.20% | 45 | 55.71% | 69 |
| Baseline Manhattan | ||||||
| KNN | 67.28% | 2 | 76.10% | 43 | 55.80% | 69 |
| LOF | 70.20% | 11 | 75.50% | 48 | 60.20% | 69 |
| SimplifiedLOF | 67.74% | 3 | 75.81% | 51 | 53.78% | 70 |
| LoOP | 72.09% | 38 | 75.76% | 70 | 56.84% | 21 |
| LDOF | 78.92% | 28 | 75.18% | 6 | 56.17% | 50 |
| ODIN | 67.67% | 2 | 76.06% | 44 | 55.76% | 70 |
| FastABOD | 71.34% | 46 | 67.53% | 70 | 50.00% | 2 |
| KDEOS | 50.00% | 2 | 50.00% | 2 | 50.32% | 36 |
| LDF | 67.74% | 3 | 75.81% | 51 | 53.78% | 70 |
| INFLO | 71.31% | 31 | 75.30% | 70 | 57.98% | 69 |
| COF | 62.62% | 55 | 75.52% | 41 | 56.92% | 70 |
| Baseline Eucli. | ||||||
| KNN | 64.90% | 1 | 75.21% | 60 | 66.67% | 100 |
| LOF | 66.76% | 9 | 74.42% | 94 | 64.70% | 100 |
| SimplifiedLOF | 66.53% | 21 | 73.81% | 65 | 59.79% | 100 |
| LoOP | 67.72% | 23 | 73.84% | 77 | 59.50% | 100 |
| LDOF | 69.21% | 30 | 73.45% | 100 | 57.69% | 100 |
| ODIN | 69.33% | 5 | 72.67% | 98 | 62.12% | 100 |
| FastABOD | 62.39% | 4 | 74.18% | 98 | 55.74% | 100 |
| KDEOS | 67.81% | 39 | 66.10% | 21 | 54.74% | 22 |
| LDF | 65.93% | 8 | 72.29% | 67 | 67.71% | 100 |
| INFLO | 66.46% | 47 | 73.15% | 91 | 59.84% | 100 |
| COF | 69.21% | 30 | 73.39% | 39 | 56.83% | 20 |
| HeartDisease | Hepatitis | InternetAds | ||||
|---|---|---|---|---|---|---|
| ROC AUC | k | ROC AUC | k | ROC AUC | k | |
| Log Transform | ||||||
| norm | 70.10% | 69 | 78.80% | 26 | 72.20% | 14 |
| t | 70.10% | 69 | 78.80% | 26 | 72.20% | 14 |
| laplace | 69.70% | 69 | 79.00% | 25 | 72.20% | 14 |
| logistic | 69.80% | 68 | 79.00% | 26 | 72.20% | 14 |
| skewnorm | 70.10% | 68 | 78.80% | 26 | 72.20% | 14 |
| No Transform | ||||||
| expon | 69.51% | 69 | 77.04% | 25 | 72.12% | 14 |
| chi2 | 70.02% | 66 | 78.87% | 40 | 72.23% | 14 |
| gamma | 70.02% | 66 | 78.47% | 26 | 72.23% | 14 |
| weibull_min | 70.16% | 68 | 78.99% | 26 | 70.36% | 6 |
| invgauss | 69.91% | 69 | 79.22% | 26 | 72.20% | 14 |
| rayleigh | 69.82% | 68 | 79.05% | 26 | 72.20% | 14 |
| wald | 69.91% | 68 | 78.59% | 26 | 72.16% | 14 |
| pareto | 70.18% | 69 | 78.53% | 25 | 72.12% | 14 |
| nakagami | 69.99% | 68 | 78.70% | 26 | 72.23% | 14 |
| logistic | 69.78% | 69 | 78.53% | 25 | 72.16% | 14 |
| powerlaw | 69.63% | 69 | 78.76% | 26 | 72.18% | 14 |
| skewnorm | 69.89% | 66 | 78.53% | 26 | 72.21% | 14 |
| Baseline Manhattan | ||||||
| KNN | 70.00% | 68 | 79.00% | 25 | 72.20% | 13 |
| LOF | 64.00% | 69 | 80.40% | 50 | 70.30% | 69 |
| SimplifiedLOF | 66.97% | 70 | 75.89% | 51 | 74.21% | 18 |
| LoOP | 55.55% | 70 | 74.17% | 65 | 65.28% | 70 |
| LDOF | 54.32% | 5 | 72.90% | 69 | 64.68% | 41 |
| ODIN | 69.99% | 69 | 78.99% | 26 | 72.21% | 14 |
| FastABOD | 60.11% | 66 | 68.08% | 28 | 54.84% | 14 |
| KDEOS | 65.43% | 53 | 70.75% | 36 | 50.00% | 2 |
| LDF | 66.97% | 70 | 75.89% | 51 | 74.21% | 18 |
| INFLO | 56.32% | 68 | 74.63% | 64 | 68.03% | 70 |
| COF | 56.47% | 70 | 73.02% | 51 | 68.49% | 32 |
| Baseline Eucli. | ||||||
| KNN | 68.38% | 81 | 78.59% | 21 | 72.23% | 12 |
| LOF | 65.58% | 100 | 80.37% | 48 | 74.09% | 98 |
| SimplifiedLOF | 56.93% | 100 | 73.82% | 78 | 74.31% | 98 |
| LoOP | 56.14% | 60 | 72.27% | 78 | 70.07% | 100 |
| LDOF | 56.91% | 14 | 73.82% | 79 | 69.36% | 98 |
| ODIN | 60.59% | 82 | 74.97% | 58 | 60.54% | 7 |
| FastABOD | 75.57% | 100 | 70.95% | 59 | 73.39% | 24 |
| KDEOS | 55.69% | 100 | 71.18% | 79 | 57.78% | 35 |
| LDF | 72.06% | 83 | 82.89% | 46 | 68.50% | 100 |
| INFLO | 55.97% | 15 | 60.28% | 55 | 72.96% | 98 |
| COF | 71.68% | 100 | 82.72% | 78 | 59.88% | 10 |
| PageBlocks | Parkinson | Pima | ||||
|---|---|---|---|---|---|---|
| ROC AUC | k | ROC AUC | k | ROC AUC | k | |
| Log Transform | ||||||
| norm | 87.10% | 69 | 73.70% | 6 | 73.70% | 68 |
| t | 87.10% | 69 | 73.90% | 4 | 73.70% | 68 |
| laplace | 87.20% | 69 | 73.90% | 6 | 73.60% | 69 |
| logistic | 87.00% | 69 | 73.80% | 6 | 73.70% | 67 |
| skewnorm | 87.10% | 69 | 73.90% | 6 | 73.70% | 67 |
| No Transform | ||||||
| expon | 87.04% | 69 | 71.69% | 6 | 73.38% | 64 |
| chi2 | 87.06% | 69 | 73.82% | 6 | 73.65% | 66 |
| gamma | 86.97% | 68 | 73.82% | 6 | 73.59% | 69 |
| weibull_min | 87.08% | 68 | 74.15% | 4 | 73.57% | 68 |
| invgauss | 87.07% | 68 | 73.75% | 6 | 73.56% | 63 |
| rayleigh | 86.94% | 69 | 73.76% | 6 | 73.57% | 69 |
| wald | 87.06% | 69 | 73.94% | 6 | 73.62% | 68 |
| pareto | 87.19% | 69 | 73.77% | 6 | 73.58% | 64 |
| nakagami | 87.18% | 69 | 73.63% | 4 | 73.67% | 68 |
| logistic | 86.72% | 69 | 73.65% | 6 | 73.69% | 67 |
| powerlaw | 86.95% | 69 | 73.60% | 6 | 73.70% | 69 |
| skewnorm | 87.04% | 69 | 73.97% | 6 | 73.53% | 68 |
| Baseline Manhattan | ||||||
| KNN | 87.30% | 69 | 73.70% | 5 | 73.60% | 67 |
| LOF | 81.10% | 69 | 63.90% | 5 | 67.20% | 69 |
| SimplifiedLOF | 84.75% | 70 | 72.11% | 6 | 73.05% | 70 |
| LoOP | 77.43% | 70 | 57.56% | 19 | 61.53% | 69 |
| LDOF | 80.21% | 70 | 52.98% | 23 | 58.50% | 65 |
| ODIN | 87.28% | 70 | 73.74% | 6 | 73.60% | 68 |
| FastABOD | 50.78% | 70 | 58.19% | 8 | 51.13% | 23 |
| KDEOS | 64.99% | 70 | 76.94% | 57 | 66.79% | 70 |
| LDF | 84.75% | 70 | 72.11% | 6 | 73.05% | 70 |
| INFLO | 75.35% | 70 | 52.51% | 11 | 61.62% | 70 |
| COF | 69.91% | 70 | 70.95% | 70 | 66.15% | 70 |
| Baseline Eucli. | ||||||
| KNN | 84.08% | 100 | 65.24% | 4 | 73.22% | 85 |
| LOF | 81.87% | 60 | 61.20% | 6 | 68.96% | 100 |
| SimplifiedLOF | 80.47% | 98 | 60.73% | 14 | 62.13% | 100 |
| LoOP | 79.38% | 86 | 58.31% | 13 | 60.92% | 99 |
| LDOF | 82.98% | 82 | 55.32% | 16 | 57.00% | 98 |
| ODIN | 73.06% | 100 | 52.61% | 3 | 63.64% | 100 |
| FastABOD | 73.39% | 24 | 66.99% | 15 | 76.08% | 99 |
| KDEOS | 69.51% | 91 | 58.67% | 28 | 55.62% | 2 |
| LDF | 83.02% | 42 | 60.22% | 6 | 72.89% | 100 |
| INFLO | 76.80% | 80 | 58.39% | 10 | 61.73% | 92 |
| COF | 77.02% | 71 | 64.97% | 98 | 70.12% | 100 |
| SpamBase | Stamps | Wilt | ||||
|---|---|---|---|---|---|---|
| ROC AUC | k | ROC AUC | k | ROC AUC | k | |
| Log Transform | ||||||
| norm | 65.10% | 41 | 91.70% | 61 | 56.20% | 2 |
| t | 65.00% | 40 | 91.70% | 68 | 56.10% | 3 |
| laplace | 65.00% | 46 | 91.90% | 62 | 56.20% | 2 |
| logistic | 65.00% | 41 | 91.80% | 68 | 56.20% | 2 |
| skewnorm | 65.00% | 46 | 92.20% | 65 | 56.10% | 3 |
| No Transform | ||||||
| expon | 64.94% | 51 | 91.67% | 63 | 56.03% | 3 |
| chi2 | 65.03% | 49 | 91.93% | 67 | 55.14% | 2 |
| gamma | 65.00% | 49 | 91.93% | 67 | 56.07% | 3 |
| weibull_min | 65.04% | 41 | 91.86% | 68 | 55.59% | 3 |
| invgauss | 65.08% | 40 | 92.19% | 64 | 56.17% | 2 |
| rayleigh | 64.98% | 40 | 91.72% | 67 | 56.09% | 3 |
| wald | 65.04% | 40 | 91.91% | 57 | 56.21% | 2 |
| pareto | 64.98% | 40 | 91.99% | 63 | 56.15% | 3 |
| nakagami | 65.05% | 44 | 91.83% | 66 | 56.24% | 3 |
| logistic | 65.01% | 40 | 91.86% | 63 | 56.21% | 2 |
| powerlaw | 64.99% | 39 | 91.83% | 61 | 56.33% | 2 |
| skewnorm | 65.07% | 41 | 91.77% | 66 | 56.23% | 2 |
| Baseline Manhattan | ||||||
| KNN | 65.00% | 39 | 91.90% | 63 | 56.10% | 2 |
| LOF | 47.80% | 2 | 82.30% | 69 | 67.00% | 5 |
| SimplifiedLOF | 64.03% | 70 | 91.04% | 70 | 56.68% | 3 |
| LoOP | 47.21% | 3 | 77.32% | 70 | 68.35% | 14 |
| LDOF | 50.00% | 2 | 70.70% | 69 | 69.84% | 16 |
| ODIN | 65.05% | 40 | 91.91% | 64 | 56.18% | 2 |
| FastABOD | 54.71% | 70 | 76.48% | 69 | 85.03% | 29 |
| KDEOS | 50.00% | 2 | 78.51% | 70 | 70.95% | 62 |
| LDF | 64.03% | 70 | 91.04% | 70 | 56.68% | 3 |
| INFLO | 50.69% | 2 | 73.68% | 70 | 70.21% | 6 |
| COF | 48.71% | 3 | 63.50% | 70 | 59.82% | 2 |
| Baseline Eucli. | ||||||
| KNN | 57.35% | 63 | 90.11% | 15 | 55.20% | 1 |
| LOF | 47.38% | 2 | 83.32% | 100 | 63.09% | 6 |
| SimplifiedLOF | 50.12% | 2 | 74.35% | 100 | 67.68% | 7 |
| LoOP | 49.66% | 2 | 75.28% | 100 | 67.92% | 10 |
| LDOF | 47.96% | 5 | 75.26% | 100 | 71.22% | 13 |
| ODIN | 51.91% | 47 | 75.34% | 100 | 67.46% | 10 |
| FastABOD | 43.72% | 3 | 76.22% | 97 | 55.43% | 6 |
| KDEOS | 47.67% | 100 | 69.13% | 99 | 71.32% | 33 |
| LDF | 53.64% | 100 | 89.55% | 100 | 61.27% | 4 |
| INFLO | 47.38% | 3 | 78.92% | 100 | 63.21% | 7 |
| COF | 49.95% | 2 | 81.87% | 100 | 64.83% | 9 |
References
- Barnett, V.; Lewis, T. Outliers in Statistical Data, 3rd ed.; John Wiley & Sons: Chichester, UK, 1994. [Google Scholar]
- Chandola, V.; Banerjee, A.; Kumar, V. Anomaly Detection: A Survey. ACM Comput. Surv. 2009, 41, 1–58. [Google Scholar] [CrossRef]
- Hawkins, D.M. Identification of Outliers; Chapman and Hall: London, UK, 1980. [Google Scholar]
- Aggarwal, C.C. Outlier Analysis, 2nd ed.; Springer: Cham, Switzerland, 2017. [Google Scholar]
- Beyer, K.; Goldstein, J.; Ramakrishnan, R.; Shaft, U. When Is “Nearest Neighbor” Meaningful? In Database Theory-ICDT’99, Proceedings of the 7th International Conference, Jerusalem, Israel, 10–12 January 1999; Beeri, C., Buneman, P., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 1999; Volume 1540, pp. 217–235. [Google Scholar]
- Zimek, A.; Schubert, E.; Kriegel, H.P. A Survey on Unsupervised Anomaly Detection in High-Dimensional Numerical Data. Stat. Anal. Data Min. 2012, 5, 363–387. [Google Scholar] [CrossRef]
- Bolton, R.J.; Hand, D.J. Statistical Fraud Detection: A Review. Stat. Sci. 2002, 17, 235–255. [Google Scholar] [CrossRef]
- Hanley, J.A.; McNeil, B.J. The meaning and use of the area under a Receiver Operating Characteristic (ROC) curve. Radiology 1982, 143, 29–36. [Google Scholar] [CrossRef]
- Hanley, J.A.; McNeil, B.J. A method of comparing the areas under Receiver Operating Characteristic curves derived from the same cases. Radiology 1983, 148, 839–843. [Google Scholar] [CrossRef] [PubMed]
- Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
- Fix, E.; Hodges, J.L. Discriminatory Analysis—Nonparametric Discrimination: Consistency Properties; Technical Report Technical Report 4; University of California: Berkeley, CA, USA, 1951. [Google Scholar]
- Aggarwal, C.C.; Hinneburg, A.; Keim, D.A. On the Surprising Behavior of Distance Metrics in High Dimensional Space. In Proceedings of the Database Theory—ICDT 2001, Proceedings of the 8th International Conference, London, UK, 4–6 January 2001; Van den Bussche, J., Vianu, V., Eds.; Springer: Berlin/Heidelberg, Germany, 2001; pp. 420–434. [Google Scholar]
- Breunig, M.M.; Kriegel, H.P.; Ng, R.T.; Sander, J. LOF: Identifying Density-Based Local Outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 15–18 May 2000; pp. 93–104. [Google Scholar] [CrossRef]
- Tang, J.; Chen, Z.; Fu, A.W.C.; Cheung, D.W.L. Enhancing Effectiveness of Outlier Detections for Low Density Patterns. In Proceedings of the PAKDD, Taipei, Taiwan, 6–8 May 2002; pp. 535–548. [Google Scholar]
- Kriegel, H.P.; Schubert, M.; Zimek, A. Angle-Based Outlier Detection in High-Dimensional Data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, 24–27 August 2008; pp. 444–452. [Google Scholar] [CrossRef]
- Rehman, Y.; Belhaouari, S. Unsupervised outlier detection in multidimensional data. J. Big Data 2021, 8, 80. [Google Scholar] [CrossRef]
- Campos, G.O.; Zimek, A.; Sander, J.; Campello, R.J.G.B.; Micenková, B.; Schubert, E.; Assent, I.; Houle, M.E. On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study. Data Min. Knowl. Discov. 2016, 30, 891–927. [Google Scholar] [CrossRef]
- Bouman, R.; Bukhsh, Z.; Heskes, T. Unsupervised anomaly detection algorithms on real-world multivariate tabular data sets. ACM Comput. Surv. 2024. [Google Scholar]
- Tukey, J.W. Exploratory Data Analysis; Addison-Wesley: Reading, MA, USA, 1977. [Google Scholar]
- Anderberg, A.; Bailey, J.; Campello, R.J.G.B. Dimensionality-Aware Outlier Detection: Theoretical and Experimental Analysis. In Proceedings of the SIAM International Conference on Data Mining, Houston, TX, USA, 18–20 April 2024. [Google Scholar]
- Kim, D.; Park, J.; Chung, H.C.; Jeong, S. Unsupervised outlier detection using random subspace and subsampling ensembles of Dirichlet process mixtures. Pattern Recognit. 2024, 156, 110846. [Google Scholar] [CrossRef]
- Chen, X.; Yuan, Z.; Feng, S. Anomaly Detection Based on Improved k-Nearest Neighbor Rough Sets. Int. J. Approx. Reason. 2025, 176, 109323. [Google Scholar] [CrossRef]
- Grubbs, F.E. Procedures for Detecting Outlying Observations in Samples. Technometrics 1969, 11, 1–21. [Google Scholar] [CrossRef]
- Rosner, B. Percentage Points for a Generalized ESD Many-Outlier Procedure. Technometrics 1983, 25, 165–172. [Google Scholar] [CrossRef]
- Davies, L.; Gather, U. The Identification of Multiple Outliers. J. Am. Stat. Assoc. 1993, 88, 782–792. [Google Scholar] [CrossRef]
- Bagdonavičius, V.; Petkevičius, G. New Tests for the Detection of Outliers from Location–Scale and Shape–Scale Families. Mathematics 2020, 8, 2156. [Google Scholar]
- Amin, M.; Afzal, S.; Akram, M.N.; Muse, A.H.; Tolba, A.H.; Abushal, T.A. Outlier Detection in Gamma Regression Using Pearson Residuals: Simulation and an Application. AIMS Math. 2022, 7, 15331–15347. [Google Scholar] [CrossRef]
- A Model-Based Approach to Outlier Detection in Financial Time Series; IFC Bulletin 37; BIS: Basel, Switzerland, 2014.
- Wang, Y.; Zhang, L.; Si, T.; Bishop, G.; Gong, H. Anomaly Detection in High-Dimensional Time Series with Scaled Bregman Divergence. Algorithms 2025, 18, 62. [Google Scholar] [CrossRef]
- Angiulli, F.; Pizzuti, C. Fast outlier detection in high dimensional spaces. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), Helsinki, Finland, 19–23 August 2002; pp. 15–27. [Google Scholar]
- Zhang, K.; Hutter, M.; Jin, H. A local distance-based outlier detection method. In Proceedings of the 20th International Conference on Advances in Database Technology (EDBT), Saint Petersburg, Russia, 24–26 March 2009; pp. 394–405. [Google Scholar]
- Kriegel, H.P.; Kröger, P.; Schubert, E.; Zimek, A. LoOP: Local outlier probabilities. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), Hong Kong, China, 2–6 November 2009; pp. 1649–1652. [Google Scholar]
- Latecki, L.J.; Lazarevic, A.; Pokrajac, D. Outlier detection with local and global consistency. In Proceedings of the 2007 SIAM International Conference on Data Mining, Minneapolis, MN, USA, 26–28 April 2007; pp. 597–602. [Google Scholar]
- Jin, W.; Tung, A.K.; Han, J.; Wang, W. Ranking outliers using symmetric neighborhood relationship. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Singapore, 9–12 April 2006; pp. 577–593. [Google Scholar]
- Schubert, E.; Zimek, A.; Kriegel, H.P. Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Discov. 2014, 28, 190–237. [Google Scholar] [CrossRef]
- Goldstein, M.; Dengel, A. Histogram-Based Outlier Score (HBOS): A Fast Unsupervised Anomaly Detection Algorithm. In Proceedings of the LWA 2012—Lernen, Wissen, Adaptivität, Dortmund, Germany, 8–10 October 2012. [Google Scholar]
- Pevnỳ, T. Loda: Lightweight On-line Detector of Anomalies. Mach. Learn. 2016, 102, 275–304. [Google Scholar] [CrossRef]
- Li, Z.; Zhao, Y.; Botta, N.; Ionescu, C.; Hu, X. COPOD: Copula-Based Outlier Detection. In Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), Sorrento, Italy, 17–20 November 2020. [Google Scholar]
- Ramaswamy, S.; Rastogi, R.; Shim, K. Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 16–18 May 2000; pp. 427–438. [Google Scholar]
- Papadimitriou, S.; Kitagawa, H.; Gibbons, P.B.; Faloutsos, C. LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering (ICDE), Bangalore, India, 5–8 March 2003; pp. 315–326. [Google Scholar]
- Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the ICDM, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar]
- Schölkopf, B.; Platt, J.; Shawe-Taylor, J.; Smola, A.J.; Williamson, R.C. Estimating the Support of a High-Dimensional Distribution. Neural Comput. 2001, 13, 1443–1471. [Google Scholar] [CrossRef]
- Ruff, L.; Vandermeulen, R.; Goernitz, N.; Deecke, L.; Siddiqui, S.A.; Binder, A.; Müller, E.; Kloft, M. Deep One-Class Classification. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; 80, pp. 4393–4402. [Google Scholar]
- Zong, B.; Song, Q.; Min, M.R.; Cheng, W.; Lumezanu, C.; Cho, D.; Chen, H. Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection. In Proceedings of the ICLR, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Liu, J.; Ma, Z.; Wang, Z.; Liu, Y.; Wang, Z.; Sun, P.; Song, L.; Hu, B.; Boukerche, A.; Leung, V.C.M. A Survey on Diffusion Models for Anomaly Detection. arXiv 2025, arXiv:2501.11430. [Google Scholar] [CrossRef]
- Radovanović, M.; Nanopoulos, A.; Ivanović, M. Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data. J. Mach. Learn. Res. 2010, 11, 2487–2531. [Google Scholar]
- Kotz, S.; Kozubowski, T.; Podgórski, K. The Laplace Distribution and Generalizations; Birkhäuser: Boston, MA, USA, 2001. [Google Scholar]
- Johnson, N.L.; Kotz, S.; Balakrishnan, N. Continuous Univariate Distributions, Volume I & II; Wiley: New York, NY, USA, 1994. [Google Scholar]
- David, H.A.; Nagaraja, H.N. Order Statistics; Wiley: Hoboken, NJ, USA, 2003. [Google Scholar]
- Biau, G.; Devroye, L. Lectures on the Nearest Neighbor Method; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
- Titterington, D.; Smith, A.F.M.; Makov, U. Statistical Analysis of Finite Mixture Distributions; Wiley: New York, NY, USA, 1985. [Google Scholar]
- Rosenblatt, M. Remarks on a Multivariate Transformation. Ann. Math. Stat. 1952, 23, 470–472. [Google Scholar] [CrossRef]
- Swets, J.A. Measuring the accuracy of diagnostic systems. Science 1988, 240, 1285–1293. [Google Scholar] [CrossRef]
- Hajian-Tilaki, K. Receiver Operating Characteristic (ROC) curve analysis for medical diagnostic test evaluation. Casp. J. Intern. Med. 2013, 4, 627–635. [Google Scholar]
- Bagdonavičius, V.; Petkevičius, L. Multiple Outlier Detection Tests for Parametric Models. Mathematics 2020, 8, 2156. [Google Scholar] [CrossRef]
- Azzalini, A. A Class of Distributions Which Includes the Normal Ones. Scand. J. Stat. 1985, 12, 171–178. [Google Scholar]
- DAMI: Outlier Evaluation Benchmark. Available online: https://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/ (accessed on 16 February 2025).
- Hodge-py Project. Outlier Detection: Literature Resources. Available online: https://github.com/hodge-py/Outlier-Detection/tree/Final/literature (accessed on 16 February 2025).
- Hodge-py Project. Outlier Detection: Semantic Resources. Available online: https://github.com/hodge-py/Outlier-Detection/tree/Final/semantic (accessed on 16 February 2025).




| Train | Test Inlier | Test Outlier | |
|---|---|---|---|
| (a) | |||
| Shape | 2.0 | 2.0 | 5.0 |
| Scale | 2.0 | 2.0 | 2.0 |
| (b) | |||
| Shape | 2.0 | 2.0 | 5.0 |
| Scale | 2.0 | 2.0 | 2.0 |
| (c) | |||
| Shape (a) | 4.0 | 4.0 | −4.0 |
| Location | 0.0 | 0.0 | 2.0 |
| Scale | 1.0 | 1.0 | 1.0 |
| Mean | Std | Min | 25% | 50% | 75% | Max | |
|---|---|---|---|---|---|---|---|
| (a) | |||||||
| KNN | 84.81% | 2.22% | 76.58% | 83.42% | 84.90% | 86.39% | 91.38% |
| LOF | 62.67% | 6.00% | 46.90% | 58.73% | 62.45% | 67.06% | 77.70% |
| ABOD | 80.66% | 2.48% | 72.47% | 79.06% | 80.66% | 82.48% | 87.09% |
| COF | 50.39% | 2.59% | 43.55% | 48.66% | 50.43% | 52.01% | 57.70% |
| CDF | 89.07% | 1.59% | 83.60% | 87.95% | 89.17% | 90.22% | 94.03% |
| (b) | |||||||
| KNN | 59.06% | 2.89% | 50.46% | 57.09% | 59.09% | 61.12% | 66.87% |
| LOF | 54.50% | 3.05% | 46.04% | 52.50% | 54.54% | 56.49% | 62.83% |
| ABOD | 58.61% | 2.95% | 50.48% | 56.72% | 58.61% | 60.58% | 66.79% |
| COF | 50.86% | 2.83% | 42.65% | 48.97% | 50.97% | 52.77% | 58.46% |
| CDF | 59.30% | 2.90% | 51.18% | 57.49% | 59.34% | 61.21% | 68.10% |
| (c) | |||||||
| KNN | 65.07% | 3.35% | 53.76% | 63.01% | 65.26% | 67.33% | 74.90% |
| LOF | 50.84% | 4.35% | 38.28% | 47.64% | 50.94% | 53.73% | 63.98% |
| ABOD | 61.01% | 3.26% | 48.70% | 58.91% | 61.15% | 63.35% | 71.20% |
| COF | 50.17% | 2.74% | 40.60% | 48.24% | 50.24% | 51.99% | 56.75% |
| CDF | 71.81% | 2.67% | 62.95% | 70.02% | 71.99% | 73.59% | 79.12% |
| Name | Type | Instances | Outliers | Attributes |
|---|---|---|---|---|
| ALOI | Literature | 50,000 | 1508 | 27 |
| Glass | Literature | 214 | 9 | 7 |
| Ionosphere | Literature | 351 | 126 | 32 |
| KDDCup99 | Literature | 60,632 | 246 | 38 + 3 |
| Lymphography | Literature | 148 | 6 | 3 + 16 |
| PenDigits | Literature | 9868 | 20 | 16 |
| Shuttle | Literature | 1013 | 13 | 9 |
| Waveform | Literature | 3443 | 100 | 21 |
| WBC | Literature | 454 | 10 | 9 |
| WDBC | Literature | 367 | 10 | 30 |
| WPBC | Literature | 198 | 47 | 33 |
| Annthyroid | Semantic | 7200 | 534 | 21 |
| Arrhythmia | Semantic | 450 | 206 | 259 |
| Cardiotocography | Semantic | 2126 | 471 | 21 |
| HeartDisease | Semantic | 270 | 120 | 13 |
| Hepatitis | Semantic | 80 | 13 | 19 |
| InternetAds | Semantic | 3264 | 454 | 1555 |
| PageBlocks | Semantic | 5473 | 560 | 10 |
| Parkinson | Semantic | 195 | 147 | 22 |
| Pima | Semantic | 768 | 268 | 8 |
| SpamBase | Semantic | 4601 | 1813 | 57 |
| Stamps | Semantic | 340 | 31 | 9 |
| Wilt | >Semantic | >4839 | >261 | >5 |
| Method | Literature Avg. | Semantic Avg. |
|---|---|---|
| Log Transform Models | ||
| norm | 87.34% | 72.34% |
| t | 87.52% | 72.33% |
| laplace | 87.48% | 72.35% |
| logistic | 87.35% | 72.33% |
| skewnorm | 87.55% | 72.38% |
| No Transform Models | ||
| expon | 87.10% | 71.86% |
| chi2 | 87.52% | 72.26% |
| gamma | 87.29% | 72.30% |
| weibull_min | 87.40% | 72.18% |
| invgauss | 87.56% | 72.38% |
| rayleigh | 87.13% | 72.29% |
| wald | 87.37% | 72.32% |
| pareto | 87.47% | 72.32% |
| nakagami | 87.45% | 72.32% |
| logistic | 86.52% | 72.23% |
| powerlaw | 87.35% | 72.27% |
| skewnorm | 87.25% | 72.30% |
| Baseline—Manhattan Distance | ||
| KNN | 87.53% | 72.33% |
| LOF | 84.75% | 69.16% |
| SimplifiedLOF | 86.76% | 71.34% |
| LoOP | 83.41% | 65.76% |
| LDOF | 79.66% | 65.37% |
| ODIN | 87.62% | 72.37% |
| FastABOD | 64.55% | 62.35% |
| KDEOS | 75.37% | 62.06% |
| LDF | 86.76% | 71.34% |
| INFLO | 83.07% | 65.64% |
| COF | 81.51% | 64.34% |
| Baseline—Euclidean Distance | ||
| KNN | 87.66% | 70.93% |
| LOF | 85.61% | 69.31% |
| SimplifiedLOF | 83.44% | 66.72% |
| LoOP | 83.27% | 65.92% |
| LDOF | 83.02% | 65.85% |
| ODIN | 82.64% | 65.35% |
| FastABOD | 87.65% | 67.00% |
| KDEOS | 73.11% | 62.10% |
| LDF | 86.29% | 70.83% |
| INFLO | 83.16% | 64.59% |
| COF | 84.02% | 68.54% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhou, J.; Hodge, K.; Dong, W.; Tamakloe, E. Distribution-Aware Outlier Detection in High Dimensions: A Scalable Parametric Approach. Mathematics 2026, 14, 77. https://doi.org/10.3390/math14010077
Zhou J, Hodge K, Dong W, Tamakloe E. Distribution-Aware Outlier Detection in High Dimensions: A Scalable Parametric Approach. Mathematics. 2026; 14(1):77. https://doi.org/10.3390/math14010077
Chicago/Turabian StyleZhou, Jie, Karson Hodge, Weiqiang Dong, and Emmanuel Tamakloe. 2026. "Distribution-Aware Outlier Detection in High Dimensions: A Scalable Parametric Approach" Mathematics 14, no. 1: 77. https://doi.org/10.3390/math14010077
APA StyleZhou, J., Hodge, K., Dong, W., & Tamakloe, E. (2026). Distribution-Aware Outlier Detection in High Dimensions: A Scalable Parametric Approach. Mathematics, 14(1), 77. https://doi.org/10.3390/math14010077


