Next Article in Journal
Cross-Lingual Sentiment Classification in Sustainable Mobility: A Zero-Shot Domain Transfer Evaluation Framework
Previous Article in Journal
Knowledge-Aware Recommendation Based on Hypergraph and Knowledge Graph
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Less Is More: Principled Diversity in Heterogeneous Anomaly Detection Ensembles

Faculty of Electrical Engineering, Computer Science and Information Technology Osijek, Josip Juraj Strossmayer University of Osijek, 31000 Osijek, Croatia
*
Author to whom correspondence should be addressed.
AI 2026, 7(6), 214; https://doi.org/10.3390/ai7060214
Submission received: 13 May 2026 / Revised: 8 June 2026 / Accepted: 9 June 2026 / Published: 11 June 2026

Abstract

Heterogeneous anomaly detection ensembles improve robustness by combining complementary detectors, yet existing approaches often rely on heuristic detector selection, fixed contamination assumptions, and equal weighting. We investigate whether compact ensembles of complementary detectors can outperform substantially larger heterogeneous configurations through diversity-aware weighting and adaptive contamination estimation. Experiments on 22 benchmark datasets show that a compact ensemble of four complementary classical detectors outperforms an eleven-detector ensemble containing deep learning components, while requiring only 13.8% of the computational cost. Across the benchmark, the proposed ensemble variants achieve strong rankings while remaining competitive with the strongest individual detectors (Friedman χ 2 = 71.58 , p < 0.001 ). These findings suggest that detector diversity, rather than ensemble size or architectural complexity, is the primary driver of robust unsupervised anomaly detection performance in resource-constrained environments.

1. Introduction

Anomaly detection in tabular data is a fundamental problem in domains including fraud detection [1], industrial monitoring [2], and medical screening [3]. In these settings, anomalies are rare, labels are unavailable, and the characteristics of anomalous samples vary substantially across datasets, making unsupervised detection the only practical option.
Ensemble methods that combine multiple detectors have consistently demonstrated superior robustness over individual algorithms [4,5], as no single detector achieves universal superiority across the heterogeneous anomaly types encountered in practice [6]. However, existing heterogeneous ensembles face three unresolved challenges. First, detector selection is often heuristic and does not explicitly account for redundancy between detectors [4]. Second, many methods rely on fixed contamination assumptions despite large variation in anomaly rates across datasets [7]. Third, unsupervised weighting schemes either treat detectors equally or adapt weights through mechanisms that reward conformity to the majority, progressively homogenising the ensemble [8].
We investigate how explicit diversity-aware weighting and adaptive contamination estimation affect heterogeneous anomaly detection ensembles. The framework combines eleven detectors spanning multiple anomaly detection paradigms, estimates contamination adaptively using a Gaussian Mixture Model (GMM) [9], and applies a Negative Correlation Learning (NCL)-inspired [10] weighting scheme that penalises correlated detectors. We evaluate the full framework across 22 ODDS benchmark datasets [11] with Friedman and Nemenyi statistical testing [12]. The experiments show that compact ensembles can outperform substantially larger configurations at a fraction of the computational cost, challenging the assumption that larger heterogeneous ensembles are necessarily better.
The scientific contributions of this paper are as follows:
  • An NCL-inspired unsupervised weighting scheme that explicitly penalises correlated detectors based on inter-detector score correlation, extending the diversity-penalisation principle of Negative Correlation Learning to the unsupervised anomaly detection setting.
  • A GMM-based adaptive contamination estimator enabling automatic threshold calibration across datasets with anomaly rates ranging from 1.2% to 35.9%, replacing the fixed assumptions used by existing ensemble methods.
  • A combinatorial search across all 2036 detector combinations ( n = 2 11 11 n ) identifying that compact ensembles of three to five complementary detectors match or exceed the full eleven-detector ensemble at as little as 8.7% of computational cost.
  • Rigorous statistical evaluation across 22 benchmark datasets using Friedman and Nemenyi testing, confirming significant improvements over six individual detectors ( χ 2 = 71.58 , p < 0.001 ).

2. Related Work

Classical anomaly detection includes a broad range of statistical, distance-based, density-based, and isolation-based approaches. Distance-based methods such as k-Nearest Neighbour (KNN) [13] and Local Outlier Factor (LOF) [14] identify anomalies through neighbour distance and local density deviation, respectively. Isolation Forest [15] instead exploits the relative ease of isolating anomalies through random recursive partitioning at O ( n log n ) complexity. Statistical approaches include Histogram-Based Outlier Score (HBOS) [16], Minimum Covariance Determinant (MCD) [17], and Copula-Based Outlier Detector (COPOD) [18], which model anomaly structure through density estimation, robust covariance estimation, and empirical copulas. Isolation-Based Nearest-Neighbour Ensembles (INNEs) [19] combine neighbourhood structure with isolation principles to measure local anomalousness. Deep learning approaches such as Variational Autoencoders (VAEs) [20], Deep Support Vector Data Description (Deep SVDD) [21], and Locally Unified Neighbourhood Anomaly Ranking (LUNAR) [22] learn latent representations or graph-aware neighbourhood structure for anomaly scoring. These methods differ substantially in their inductive biases, motivating heterogeneous ensembles that combine complementary anomaly assumptions. Comprehensive benchmarks [6] further demonstrate that no single detector consistently dominates across heterogeneous tabular datasets.
Ensemble methods improve robustness by aggregating complementary detectors. Existing approaches include Feature Bagging [23], which constructs diverse detectors from random feature subspaces, LSCP [24], which dynamically selects 3–5 locally competent detectors per test instance from a larger candidate pool, and SUOD [25], which accelerates large-scale ensemble training through approximation and parallelisation but reports diminishing returns beyond 6–8 detectors on most tabular datasets. However, unsupervised ensembles remain challenging because detector outputs are heterogeneous and diversity is difficult to enforce explicitly [4]. More recent work reflects the same tension between ensemble size and practical cost. SEAD-PL [26] draws from a large candidate pool but applies clustering-based pruning to retain only the most accurate and diverse subset, and RHAD [27] constructs a six-detector heterogeneous ensemble for industrial control systems yet finds that its RL-based scheduler concentrates influence on 3–4 detectors per detection window. ADBench [6] benchmarks 14 standalone detectors across 57 datasets but does not directly examine ensemble size effects. A consistent pattern across this body of work is that ensembles of 3–8 paradigmatically distinct detectors capture the bulk of achievable performance gains, with larger configurations offering marginal improvement at substantially increased computational cost. Diversity-aware ensemble principles have also found application in related domains such as fault diagnosis [28]. More recently, agentic and LLM-based frameworks have been explored for tabular anomaly detection [29], though their computational requirements remain prohibitive for resource-constrained deployments.
Adaptive weighting has received comparatively less attention in unsupervised anomaly detection. Cabrera-Bean et al. [8] proposed an Expectation–Maximisation (EM) framework that jointly infers pseudo-labels and detector weights from agreement patterns among correlated detectors. Negative Correlation Learning (NCL) [10] instead promotes ensemble diversity by penalising correlated learners, establishing that effective ensembles should balance individual detector quality against inter-detector redundancy. Although originally developed for supervised learning, the same principle is relevant in unsupervised anomaly detection, where detector diversity is often assumed rather than explicitly enforced.
Most anomaly detectors additionally require a contamination parameter specifying the expected anomaly fraction. Because anomaly prevalence varies substantially across datasets, ranging from 1.2% to 35.9% in the ODDS benchmark [7], fixed contamination assumptions are frequently unreliable. Prior work has therefore explored score distribution-based approaches [30] that estimate the boundary between normal and anomalous score populations directly from detector outputs, including probabilistic mixture-based formulations such as Gaussian Mixture Models (GMMs) [9].

3. Materials and Methods

3.1. Ensemble Framework

Let X = { x i } i = 1 n R d denote an unlabelled dataset. The goal of unsupervised anomaly detection is to assign each sample x i an anomaly score s i and binary prediction y ^ i { 0 , 1 } without access to ground-truth labels.
The proposed heterogeneous ensemble comprises eleven detectors spanning seven anomaly detection methods: isolation-based (Isolation Forest, INNE), density/distance-based (LOF, KNN), statistical (HBOS, MCD), probabilistic (COPOD), subspace (PCA), deep learning (VAE, DeepSVDD), and graph-based (LUNAR). Detector scores are min-max normalised to s ˜ m i [ 0 , 1 ] .
To estimate the contamination rate adaptively, we fit a two-component Gaussian Mixture Model (GMM) to the mean ensemble score
s ¯ i = 1 M m = 1 M s ˜ m i ,
using
p ( s ¯ ) = π 0 N ( μ 0 , σ 0 2 ) + π 1 N ( μ 1 , σ 1 2 ) ,
where μ 1 > μ 0 identifies the anomaly component. The contamination estimate is
ρ ^ = clip ( π 1 , 0.01 , 0.40 ) .
Each detector thresholds its scores at the ( 1 ρ ^ ) quantile.
Negative Correlation Learning (NCL) [10] decomposes the ensemble generalisation error as:
E ens = m = 1 M E m λ m = 1 M j m cov ( e m , e j ) ,
where E m is the individual error of detector m and cov ( e m , e j ) is the covariance between the errors of detectors m and j.
In the supervised NCL framework, minimising this objective jointly promotes individual detector competence and mutual complementarity by penalising pairwise error covariance. In the unsupervised setting, ground-truth labels are unavailable, so we substitute two observable proxies. Score variance σ m 2 = Var ( s ˜ m · ) approximates discriminative strength: a detector that spreads normal and anomalous scores apart produces high variance across the dataset. The Pearson correlation C m j between detector score vectors approximates error covariance: two detectors that assign similar scores to every sample share error structure and provide redundant rather than complementary evidence, consistent with the diversity literature [31]. Let S R M × n denote the normalised score matrix and C the corresponding Pearson correlation matrix. Using these unsupervised proxies, we define the following NCL-inspired weighting formulation:
w m σ m 2 1 + λ j m w j C m j ,
where λ > 0 controls the strength of the diversity penalty. Equation (5) is solved iteratively, initialising from variance weights w m ( 0 ) = σ m 2 / j σ j 2 and running for a fixed five iterations with no tolerance threshold. Convergence within five iterations was confirmed empirically across all 22 benchmark datasets. Weights are clipped to a minimum of 10 3 before renormalisation to prevent numerical degeneracy. Performance metrics remain stable across λ [ 0.1 , 5.0 ] (Friedman χ 2 = 1.245 , p = 0.871 for F1; see Section 4.2), confirming that the ensemble is robust to this hyperparameter. At λ 0 , Equation (5) reduces to pure variance weighting; at very large λ , the penalty term dominates and the formula converges to near-equal weighting among low-correlation detectors. The stable performance across this range reflects the fact that the recommended compact ensemble contains near-orthogonal detectors by construction (IsolationForest, INNE, HBOS, and KNN share no mathematical assumptions), so the correlation term remains small regardless of λ .
Final predictions are obtained through weighted soft voting:
s ^ i = m = 1 M w m s ˜ m i , y ^ i = m = 1 M w m b m i 0.5 .
Four weighting schemes are evaluated to assess the contribution of diversity-aware weighting: Equal ( w m = 1 / M ); Variance ( w m σ m 2 ); NCL defined in Equation (5); and EM, an Expectation–Maximisation agreement-lift scheme [8] that upweights detectors whose predictions align above chance with the ensemble consensus.

3.2. Experimental Setup

Experiments were conducted on 22 ODDS benchmark datasets on a server running Ubuntu 24.04 (Canonical, London, UK) with 1 TB RAM and two NVIDIA H100 NVL GPUs (96 GB VRAM each; NVIDIA, Santa Clara, CA, USA). Deep detectors were trained with fixed hyperparameters following established practice in the anomaly detection literature [20,21,22]. The VAE uses an encoder–decoder architecture with hidden layer sizes [ max ( 16 , 2 d ) , max ( 16 , d ) ] and their mirror as decoder, where d is the input dimensionality, ensuring a minimum layer width of 16 to prevent degenerate networks on low-dimensional data. DeepSVDD uses a three-layer architecture with hidden sizes [ max ( 32 , d ) , max ( 16 , d / 2 ) ] . LUNAR uses its default graph-based architecture. All three deep detectors were trained for 50 epochs using PyTorch 2.10 (PyTorch Foundation, San Francisco, CA, USA) with CUDA 13.0 (NVIDIA, Santa Clara, CA, USA). This fixed-hyperparameter protocol prioritises reproducibility over per-dataset tuning, consistent with the evaluation approach adopted in ADBench [6]. It should be noted that deep detectors are generally more sensitive to hyperparameter choices than classical methods such as IsolationForest or HBOS; the weaker results observed for VAE, DeepSVDD, and LUNAR may therefore partly reflect the absence of per-dataset tuning or early stopping rather than a fundamental limitation of deep anomaly detection on tabular data. Classical detectors ran in parallel on CPU, distance-based detectors (LOF, KNN) used ball-tree indexing and were subsampled to n 10 , 000 for scalability [25], with all other detectors capped at n 50 , 000 . Performance was evaluated using F1, AUROC, and AUPRC, and statistical significance was assessed using the Friedman test with Nemenyi post-hoc analysis [12].

4. Results

4.1. Benchmark Datasets

Experiments were conducted on 22 benchmark datasets from the ODDS repository [11], covering sample sizes n = 148 to 49 , 097 , dimensionalities d = 3 to 400, and anomaly rates from 1.2% to 35.9%, as shown in Table 1. The diversity of anomaly rates motivates adaptive contamination estimation.

4.2. Hyperparameter Sensitivity

We evaluated the sensitivity of the NCL weighting mechanism to the diversity penalty coefficient λ . As shown in Table 2, the performance metrics remain remarkably stable across a wide range of values ( λ [ 0.1 , 5.0 ] ). A Friedman test confirmed that the differences across these settings are not statistically significant ( χ 2 = 1.245 , p = 0.871 for F1), indicating that the ensemble is robust to this hyperparameter.

4.3. Overall Comparison

Table 3 reports mean performance across all 22 datasets for 16 methods. The Friedman test confirms significant differences between methods for both F1 ( χ 2 = 71.58 , p < 0.001 ) and AUPRC ( χ 2 = 41.36 , p < 0.001 ). All four ensemble variants rank in the top four for F1 and AUROC. On AUPRC, Ensemble–Expectation–Maximisation (EM) achieves 0.4371, closely matching IsolationForest (0.4391), with no significant difference under Nemenyi testing ( p = 1.000 ). Deep detectors underperform classical methods across all metrics, with AUPRC values of 0.300–0.326 compared to 0.420–0.439 for top classical detectors, consistent with prior findings that deep anomaly detection is not consistently superior on tabular data [6].
The Critical Difference (CD) diagram in Figure 1 further confirms these findings. At α = 0.05 (CD = 5.11 ), all four ensemble variants are connected by a non-significance bar, indicating no statistically significant differences among them. The ensemble group as a whole is not connected to the INNE–MCD–IsolationForest cluster, confirming a consistent ranking advantage over the median individual detector.
Nemenyi post-hoc tests in Table 4 show that ensemble variants significantly outperform weaker baselines such as LOF, DeepSVDD, VAE, PCA, and LUNAR ( p < 0.05 ), while no significant difference is observed against the strongest individual detectors (IsolationForest, INNE, MCD).

4.4. Ablation Studies

Table 5 compares the four weighting strategies. Differences between schemes are small across all metrics (F1 range: 0.0042), suggesting that the ensemble’s performance is driven primarily by detector diversity rather than the choice of weighting. EM achieves the highest AUROC (0.8176), while variance weighting achieves the best F1 (0.3154).
The voting threshold ablation in Figure 2 shows that performance is similarly stable across values of k. Ensemble size analysis in Figure 3 shows that performance saturates after three detectors, with subsequent additions providing only marginal gains, suggesting that complementary coverage is achieved quickly. The diversity ordering in Figure 3 follows a predefined paradigm-coverage sequence, adding one detector per anomaly detection paradigm in order of estimated standalone contribution, consistent with prior work on heterogeneous anomaly detection ensembles [24,25].
Contamination sensitivity analysis shown in Figure 4 confirms that performance is stable under perturbations of ± 0.05 in ρ ^ (F1 variation: 0.015), indicating that the GMM estimate does not need to be exact to be effective. A sensitivity analysis on the number of GMM components confirms that two components is the appropriate choice. Increasing to three components provides no improvement in AUROC or AUPRC, while a single-component model, equivalent to a fixed contamination assumption, yields lower F1 due to poor threshold calibration on high-contamination datasets. Two components is therefore the minimum parameterisation that captures the bimodal structure of normal and anomalous score populations.
To further assess GMM behaviour under challenging conditions, Figure 5 shows score distributions and fitted GMM components for the two lowest-contamination datasets in the benchmark. On Satimage-2 (1.2% anomalies), the anomalous score component remains well-separated at high scores despite the GMM overestimating contamination at ρ ^ = 0.220 . On Speech (1.7%), heavy component overlap causes the estimate to reach the clip ceiling ( ρ ^ = 0.400 ). In both cases, the contamination perturbation analysis in Figure 4 confirms that detection quality remains stable, indicating that threshold miscalibration in overlap-heavy settings does not propagate to catastrophic detection failure.

4.5. Compact Ensemble Analysis

Evaluation of compact detector subsets reveals that ensembles of three to five detectors can match or exceed the full 11-detector system at substantially lower computational cost, as shown in Table 6. The four-detector configuration (IsolationForest, INNE, KNN, HBOS) achieves AUPRC = 0.450 , exceeding the full ensemble ( 0.434 , + 3.7 % ) at only 13.8% of its training time. These four detectors cover complementary anomaly signals: axis-aligned tree partitioning (IsolationForest), nearest-neighbour isolation (INNE), marginal histogram density (HBOS), and global distance to the k-th neighbour (KNN), with no two sharing a mathematical assumption, minimising redundant error covariance. Adding LUNAR as a fifth detector further improves AUROC to 0.823 at 23.4% of training cost, while the three-detector configuration (IF, KNN, MCD) achieves the highest F1 (0.347) at 17.7% cost, confirming that optimal configuration depends on the target metric. A Friedman test across the five configurations in Table 6 confirms statistically significant differences in F1 ( χ 2 = 11.31 , p = 0.023 ), driven primarily by the binary threshold interaction between detector combination and the GMM contamination estimate. Differences in AUPRC are not statistically significant ( χ 2 = 2.98 , p = 0.562 ), indicating that all compact configurations produce equivalent score quality to the full ensemble. Compact ensemble selection should therefore be guided by the target metric: IF+INNE+KNN+HBOS for score-based ranking (AUPRC), and IF+KNN+MCD for binary classification (F1).
Table 7 reports the computational complexity of each individual detector averaged across benchmark datasets. VAE dominates training time at 23.99 ± 24.56  s, while COPOD and PCA are the fastest classical detectors (<0.06 s fit time). Inference time is negligible for all detectors (<0.35 ms per 1000 samples), confirming that deployment cost is dominated by training rather than prediction.
The performance–cost trade-off is illustrated in Figure 6.
Table 8 reports the best-performing combination per ensemble size across all n = 2 11 11 n = 2036 evaluated configurations. AUPRC peaks at n = 2 (IF+PCA, 0.456 ) but this configuration achieves the lowest F1 ( 0.297 ) and AUROC ( 0.798 ) across all size classes, reflecting limited robustness of two-detector ensembles for binary classification. Performance stabilises from n = 3 onward across all three metrics, with n = 3 –5 configurations achieving the strongest F1 and AUROC while remaining within 0.01 AUPRC of the two-detector peak. At 8.7% of full ensemble training time for n = 3 and 13.8% for n = 4 , compact configurations offer substantial computational savings with no meaningful loss in detection quality. Friedman testing confirms statistically significant differences across sizes for both AUPRC ( χ 2 = 19.46 , p = 0.022 ) and F1 ( χ 2 = 26.35 , p = 0.002 ). IsolationForest appears in the best combination at every size class, confirming its role as the strongest individual detector and a reliable ensemble anchor.

5. Discussion

The main result of this study is that ensemble performance is primarily determined by detector diversity rather than ensemble size. Compact ensembles of three to five paradigmatically complementary detectors consistently outperform the full eleven-detector ensemble, consistent with Negative Correlation Learning theory [10], which links ensemble quality to error covariance rather than component count. The strongest compact configurations share a common principle: each detector captures a distinct anomaly signal with no shared mathematical assumptions, resulting in low inter-detector correlation by construction. This suggests that, on the tabular benchmarks evaluated, heterogeneity is most beneficial when it introduces genuinely complementary inductive biases rather than additional architectural complexity, though this finding may not generalise beyond the ODDS benchmark collection.
Deep learning detectors (VAE, DeepSVDD, LUNAR) consistently underperform classical methods on tabular data, aligning with recent benchmark findings [6]. Reconstruction- and hypersphere-based objectives are less effective on structured tabular distributions, and in the ensemble these detectors are automatically downweighted due to weak and correlated signals without meaningfully improving overall performance.
The GMM-based contamination estimator proves robust across the benchmark collection, with performance remaining stable under perturbations of ± 0.05 in ρ ^ . Its bimodal assumption is sufficient for the ODDS benchmarks, though heavily overlapping score distributions may warrant a more expressive model. Across weighting schemes, differences in binary performance are small. EM-based weighting yields marginally better calibrated continuous scores, while NCL weighting reduces to variance-like behaviour in low-correlation settings and is expected to be more discriminative in higher-redundancy ensembles.
The negligible differences among weighting schemes in Table 5 (F1 range: 0.004) suggest that ensemble performance on the ODDS benchmark is driven primarily by detector diversity rather than the choice of weighting strategy. This outcome is explained by the moderate and relatively uniform inter-detector correlations observed across the benchmark, which limit the diversity penalty in Equation (5) and cause NCL to converge toward variance weighting. In this regime, equal weighting is a reasonable default and is recommended for practitioners deploying compact ensembles of paradigmatically orthogonal detectors. NCL weighting is expected to provide meaningful differentiation when the ensemble contains detectors with high intra-paradigm redundancy, for example, when multiple reconstruction-based or density-based detectors are included. In such settings, equal weighting assigns equal influence to redundant detectors, diluting the ensemble signal, while NCL explicitly penalises this redundancy. The value of the NCL formulation therefore lies not in average-case performance on diverse benchmarks, but in its principled behaviour under redundancy, a condition practitioners are likely to encounter when assembling detectors from a larger heterogeneous candidate pool.
The main limitations of this work are the restriction to tabular datasets, the tractability of an exhaustive subset search only for small detector pools, and potential GMM instability under heavy inlier-outlier score overlap. Scalable subset selection and high-dimensional covariance estimation remain open problems.

6. Conclusions

We proposed a heterogeneous ensemble framework for unsupervised anomaly detection combining adaptive GMM-based contamination estimation, diversity-aware weighting, and weighted soft voting. Across 22 benchmark datasets, the proposed ensemble variants achieve consistently strong rankings, significantly outperform several weaker baselines, and remain competitive with the strongest individual detectors ( χ 2 = 71.58 , p < 0.001 ). The central finding is that, across the evaluated benchmark collection, detector diversity is a stronger driver of ensemble performance than ensemble size: compact ensembles of three to five detectors match or exceed the full eleven-detector configuration at as little as 8.7% of computational cost. This conclusion is specific to unsupervised tabular anomaly detection on the ODDS benchmarks and should be interpreted with appropriate caution in other settings. Future work will extend the framework to time-series and graph-structured data and investigate scalable detector selection strategies and improved contamination estimation under multimodal score distributions.

Author Contributions

Conceptualization, T.K. and D.Š.; methodology, T.K.; software, T.K.; validation, T.K., D.Š., and M.K.; formal analysis, T.K.; investigation, T.K.; writing—original draft preparation, T.K.; writing—review and editing, D.Š., M.K., and I.L.; supervision, I.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded and is a part of research for the activities to be carried out for the project “Researching advanced algorithms and innovative business intelligence solutions in the cloud—NPOO.C3.2.R3-I1.04.0128”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code and experimental framework supporting the findings of this study are openly available at https://github.com/redline-tk/ensemble-anomaly-detection (accessed on 12 May 2025). Benchmark datasets are automatically downloaded from the Outlier Detection DataSets (ODDS) repository [11].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ADAnomaly Detection
AUPRCArea Under the Precision-Recall Curve
AUROCArea Under the Receiver Operating Characteristic Curve
CDCritical Difference
COPODCopula-Based Outlier Detector
EMExpectation–Maximisation
GMMGaussian Mixture Model
HBOSHistogram-Based Outlier Score
INNEIsolation-Based Nearest-Neighbour Ensemble
KNNk-Nearest Neighbour
LOFLocal Outlier Factor
LSCPLocally Selective Combination in Parallel Outlier Ensembles
LUNARLocally Unified Neighbourhood Anomaly Ranking
MCDMinimum Covariance Determinant
NCLNegative Correlation Learning
ODDSOutlier Detection DataSets
PCAPrincipal Component Analysis
SUODScalable Unsupervised Outlier Detection
SVDSingular Value Decomposition
SVDDSupport Vector Data Description
VAEVariational Autoencoder

References

  1. Pang, G.; Shen, C.; Cao, L.; van den Hengel, A. Deep Learning for Anomaly Detection: A Review. ACM Comput. Surv. 2021, 54, 1–38. [Google Scholar] [CrossRef]
  2. Chandola, V.; Banerjee, A.; Kumar, V. Anomaly Detection: A Survey. ACM Comput. Surv. 2009, 41, 1–58. [Google Scholar] [CrossRef]
  3. Fernando, T.; Gammulle, H.; Denman, S.; Sridharan, S.; Fookes, C. Deep Learning for Medical Anomaly Detection—A Survey. ACM Comput. Surv. 2021, 54, 1–37. [Google Scholar] [CrossRef]
  4. Zimek, A.; Campello, R.J.G.B.; Sander, J. Ensembles for Unsupervised Outlier Detection: Challenges and Research Questions. ACM SIGKDD Explor. Newsl. 2014, 15, 11–22. [Google Scholar] [CrossRef]
  5. Aggarwal, C.C. Outlier Analysis, 2nd ed.; Springer: Cham, Switzerland, 2017. [Google Scholar] [CrossRef]
  6. Han, S.; Hu, X.; Huang, H.; Jiang, M.; Zhao, Y. ADBench: Anomaly Detection Benchmark. In Proceedings of the 36th International Conference on Neural Information Processing Systems; Oh, A., Agarwal, S., Belmont, D., Eisenstein, J., Eds.; Curran Associates, Inc.: New Orleans, LA, USA, 2022; Volume 35, pp. 32142–32159. Available online: https://proceedings.neurips.cc/paper_files/paper/2022/file/cf93972b116ca5268827d575f2cc226b-Paper-Datasets_and_Benchmarks.pdf (accessed on 10 May 2026).
  7. Campos, G.O.; Zimek, A.; Sander, J.; Campello, R.J.G.B.; Micenková, B.; Schubert, E.; Assent, I.; Houle, M.E. On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study. Data Min. Knowl. Discov. 2016, 30, 891–927. [Google Scholar] [CrossRef]
  8. Cabrera-Bean, M.; Lázaro-Gredilla, M.; Van Vaerenbergh, S. Unsupervised Ensemble Classification with Correlated Agents. In Proceedings of the 2018 IEEE Statistical Signal Processing Workshop (SSP); IEEE: New York, NY, USA, 2018; pp. 528–532. [Google Scholar] [CrossRef]
  9. Reynolds, D.A. Gaussian Mixture Models. In Encyclopedia of Biometrics; Li, S.Z., Jain, A., Eds.; Springer: Boston, MA, USA, 2009; pp. 659–663. [Google Scholar] [CrossRef]
  10. Liu, Y.; Yao, X. Ensemble Learning via Negative Correlation. Neural Netw. 1999, 12, 1399–1404. [Google Scholar] [CrossRef] [PubMed]
  11. Rayana, S. ODDS Library; Stony Brook University, Department of Computer Science: Stony Brook, NY, USA, 2016; Available online: https://shebuti.com/outlier-detection-datasets-odds/ (accessed on 10 May 2026).
  12. Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. Available online: https://jmlr.org/papers/v7/demsar06a.html (accessed on 5 May 2026).
  13. Ramaswamy, S.; Rastogi, R.; Shim, K. Efficient Algorithms for Mining Outliers from Large Data Sets. In Proceedings of the ACM SIGMOD International Conference on Management of Data; ACM: New York, NY, USA, 2000; pp. 427–438. [Google Scholar] [CrossRef]
  14. Breunig, M.M.; Kriegel, H.P.; Ng, R.T.; Sander, J. LOF: Identifying Density-Based Local Outliers. SIGMOD Rec. 2000, 29, 93–104. [Google Scholar] [CrossRef]
  15. Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining; IEEE: New York, NY, USA, 2008; pp. 413–422. [Google Scholar] [CrossRef]
  16. Goldstein, M.; Dengel, A. Histogram-Based Outlier Score (HBOS): A Fast Unsupervised Anomaly Detection Algorithm. In KI-2012: Poster and Demo Track; German Research Center for Artificial Intelligence (DFKI): Kaiserslautern, Germany, 2012; Available online: https://www.goldiges.de/publications/HBOS-KI-2012.pdf (accessed on 8 June 2026).
  17. Rousseeuw, P.J.; Van Driessen, K. A Fast Algorithm for the Minimum Covariance Determinant Estimator. Technometrics 1999, 41, 212–223. [Google Scholar] [CrossRef]
  18. Li, Z.; Zhao, Y.; Botta, N.; Ionescu, C.; Hu, X. COPOD: Copula-Based Outlier Detection. In Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM); IEEE: New York, NY, USA, 2020; pp. 1118–1123. [Google Scholar] [CrossRef]
  19. Bandaragoda, T.R.; Ting, K.M.; Albrecht, D.; Liu, F.T.; Zhu, Y.; Wells, J.R. Isolation-Based Anomaly Detection Using Nearest-Neighbour Ensembles. Comput. Intell. 2018, 34, 968–998. [Google Scholar] [CrossRef]
  20. An, J.; Cho, S. Variational Autoencoder based Anomaly Detection using Reconstruction Probability. In SNU Data Mining Center Technical Report; SNUDM-TR-2015-03; Seoul National University, Department of Industrial Engineering: Seoul, Republic of Korea, 2015; Available online: http://dm.snu.ac.kr/static/docs/TR/SNUDM-TR-2015-03.pdf (accessed on 27 April 2026).
  21. Ruff, L.; Vandermeulen, R.; Goernitz, N.; Deecke, L.; Siddiqui, S.A.; Binder, A.; Müller, E.; Kloft, M. Deep One-Class Classification. In Proceedings of the 35th International Conference on Machine Learning; Dy, J., Krause, A., Eds.; PMLR: Stockholmsmässan, Stockholm, Sweden, 2018; Volume 80, pp. 4393–4402. Available online: https://proceedings.mlr.press/v80/ruff18a.html (accessed on 26 April 2026).
  22. Goodge, A.; Hooi, B.; Ng, S.K.; Ng, W.S. LUNAR: Unifying Local Outlier Methods via Graph Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Menlo Park, CA, USA, 2022; Volume 36, pp. 6737–6745. [Google Scholar] [CrossRef]
  23. Lazarevic, A.; Kumar, V. Feature Bagging for Outlier Detection. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2005; pp. 157–166. [Google Scholar] [CrossRef]
  24. Zhao, Y.; Nasrullah, Z.; Hryniewicki, M.K.; Li, Z. LSCP: Locally Selective Combination in Parallel Outlier Ensembles. In Proceedings of the 2019 SIAM International Conference on Data Mining (SDM); Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2019; pp. 585–593. [Google Scholar] [CrossRef]
  25. Zhao, Y.; Hu, X.; Cheng, C.; Wang, C.; Wan, C.; Wang, W.; Yang, J.; Bai, H.; Li, Z.; Xiao, C.; et al. SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection. In Proceedings of Machine Learning and Systems; Smola, A., Dimakis, A., Stoica, I., Eds.; MLSys: Virtual Conference, 2021; Volume 3, pp. 463–478. Available online: https://proceedings.mlsys.org/paper/2021/hash/37385144cac01dff38247ab11c119e3c-Abstract.html (accessed on 28 April 2026).
  26. Liu, Y.; Zhu, L.; Ding, L.; Huang, Z.; Sui, H.; Wang, S.; Song, Y. Selective Ensemble Method for Anomaly Detection Based on Parallel Learning. Sci. Rep. 2024, 14, 1420. [Google Scholar] [CrossRef] [PubMed]
  27. Zhou, D.; Liu, B. RHAD: A Reinforced Heterogeneous Anomaly Detector for Robust Industrial Control System Security. Electronics 2025, 14, 2440. [Google Scholar] [CrossRef]
  28. Gao, T.; Yang, J.; Wang, W.; Fan, X. A Domain Feature Decoupling Network for Rotating Machinery Fault Diagnosis Under Unseen Operating Conditions. Reliab. Eng. Syst. Saf. 2024, 252, 110449. [Google Scholar] [CrossRef]
  29. Wang, P.; Li, S. Multi-Agent Debate: A Unified Agentic Framework for Tabular Anomaly Detection. arXiv 2026, arXiv:2602.14251. [Google Scholar] [CrossRef]
  30. Janssens, J.H.M.; Huszár, F.; Postma, E.O.; van den Herik, H.J. Stochastic Outlier Selection. In Tilburg Centre for Creative Computing Technical Report; TiCC TR 2012-001; Tilburg Centre for Creative Computing, Tilburg University: Tilburg, The Netherlands, 2012; Available online: https://github.com/jeroenjanssens/scikit-sos/blob/main/doc/sos-ticc-tr-2012-001.pdf (accessed on 12 May 2026).
  31. Kuncheva, L.I.; Whitaker, C.J.; Shipp, C.A.; Duin, R.P.W. Limits on the Majority Vote Accuracy in Classifier Fusion. Pattern Anal. Appl. 2003, 6, 22–31. [Google Scholar] [CrossRef]
Figure 1. Critical difference diagram for F1 scores across 22 datasets and 16 methods ( α = 0.05 , CD = 5.11 ). Lower rank is better. The red arrow marks the critical difference threshold; green bars connect statistically equivalent methods; red and blue markers denote ensemble variants and individual detectors, respectively.
Figure 1. Critical difference diagram for F1 scores across 22 datasets and 16 methods ( α = 0.05 , CD = 5.11 ). Lower rank is better. The red arrow marks the critical difference threshold; green bars connect statistically equivalent methods; red and blue markers denote ensemble variants and individual detectors, respectively.
Ai 07 00214 g001
Figure 2. Precision, recall, and F1 score as a function of the hard voting threshold k { 4 , 5 , 6 , 7 , 8 } , averaged across 22 datasets. Shaded regions show ± 1 standard deviation.
Figure 2. Precision, recall, and F1 score as a function of the hard voting threshold k { 4 , 5 , 6 , 7 , 8 } , averaged across 22 datasets. Shaded regions show ± 1 standard deviation.
Ai 07 00214 g002
Figure 3. Mean F1 score as a function of ensemble size (predefined paradigm-coverage order). The sharpest gain occurs between two and three detectors; subsequent additions provide marginal improvement.
Figure 3. Mean F1 score as a function of ensemble size (predefined paradigm-coverage order). The sharpest gain occurs between two and three detectors; subsequent additions provide marginal improvement.
Ai 07 00214 g003
Figure 4. Effect of contamination perturbation δ on precision, recall, and F1, averaged across 22 datasets. The dashed vertical line marks the nominal GMM estimate ( δ = 0 ). Performance is stable across the full perturbation range.
Figure 4. Effect of contamination perturbation δ on precision, recall, and F1, averaged across 22 datasets. The dashed vertical line marks the nominal GMM estimate ( δ = 0 ). Performance is stable across the full perturbation range.
Ai 07 00214 g004
Figure 5. GMM score distributions and fitted components for Satimage-2 (1.2% anomalies) and Speech (1.7% anomalies). The orange dashed line marks the estimated threshold, the red dotted line marks the median anomaly score.
Figure 5. GMM score distributions and fitted components for Satimage-2 (1.2% anomalies) and Speech (1.7% anomalies). The orange dashed line marks the estimated threshold, the red dotted line marks the median anomaly score.
Ai 07 00214 g005
Figure 6. AUPRC, AUROC, and F1 versus computational cost for compact ensemble configurations. The dashed horizontal line marks the full 11-detector ensemble baseline. Highlighted points represent the recommended configurations from Table 6.
Figure 6. AUPRC, AUROC, and F1 versus computational cost for compact ensemble configurations. The dashed horizontal line marks the full 11-detector ensemble baseline. Highlighted points represent the recommended configurations from Table 6.
Ai 07 00214 g006
Table 1. Benchmark dataset characteristics, ordered by sample size.
Table 1. Benchmark dataset characteristics, ordered by sample size.
DatasetndAnomaliesAnom. Rate (%)
Wine12913107.7
Lympho1481864.1
Glass214994.2
Vertebral24063012.5
Ionosphere3513312635.9
WBC37830215.6
Arrhythmia4522746614.6
BreastW683923935.0
Pima768826834.9
Vowels145612503.4
Letter1600321006.3
Cardio1831211769.6
Musk3062166973.2
Speech3686400611.7
Thyroid37726932.5
Optdigits5216641502.9
Satimage-2580336711.2
Satellite643536203631.6
Pendigits6870161562.3
Annthyroid720065347.4
MNIST76031007009.2
Mammography11,18362602.3
Table 2. Sensitivity of NCL weighting to the diversity penalty coefficient λ defined by Equation (5), averaged across 22 datasets. Mean ± standard deviation reported.
Table 2. Sensitivity of NCL weighting to the diversity penalty coefficient λ defined by Equation (5), averaged across 22 datasets. Mean ± standard deviation reported.
λ F1AUROCAUPRC
0.1 0.3099 ± 0.2422 0.8141 ± 0.1714 0.4495 ± 0.3297
0.5 0.3099 ± 0.2422 0.8137 ± 0.1717 0.4494 ± 0.3294
1.0 0.3062 ± 0.2398 0.8132 ± 0.1719 0.4491 ± 0.3293
2.0 0.3057 ± 0.2390 0.8125 ± 0.1724 0.4484 ± 0.3291
5.0 0.3052 ± 0.2389 0.8114 ± 0.1731 0.4473 ± 0.3298
Table 3. Mean ± std performance across 22 benchmark datasets. Best value per metric in bold. † denotes an ensemble method. Methods are sorted by AUPRC.
Table 3. Mean ± std performance across 22 benchmark datasets. Best value per metric in bold. † denotes an ensemble method. Methods are sorted by AUPRC.
MethodTypeF1AUROCAUPRC
Ensemble-EM Proposed 0.311 ± 0.243 0 . 818 ± 0 . 170 0.437 ± 0.322
Ensemble-Equal Proposed 0.312 ± 0.239 0.817 ± 0.170 0.432 ± 0.318
Ensemble-NCL Proposed 0.319 ± 0.240 0.814 ± 0.174 0.434 ± 0.314
Ensemble-Var Proposed 0 . 315 ± 0 . 240 0.814 ± 0.173 0.435 ± 0.315
IsolationForestStandalone 0.298 ± 0.236 0.806 ± 0.168 0 . 439 ± 0 . 332
MCDStandalone 0.294 ± 0.255 0.809 ± 0.186 0.420 ± 0.334
LSCP Baseline 0.290 ± 0.233 0.793 ± 0.174 0.415 ± 0.323
PCAStandalone 0.261 ± 0.231 0.748 ± 0.203 0.397 ± 0.327
HBOSStandalone 0.255 ± 0.227 0.761 ± 0.207 0.395 ± 0.322
INNEStandalone 0.304 ± 0.202 0.793 ± 0.167 0.373 ± 0.252
KNNStandalone 0.283 ± 0.243 0.784 ± 0.182 0.369 ± 0.275
COPODStandalone 0.277 ± 0.225 0.775 ± 0.189 0.382 ± 0.301
VAEStandalone 0.282 ± 0.230 0.765 ± 0.180 0.326 ± 0.259
LUNARStandalone 0.263 ± 0.231 0.724 ± 0.201 0.314 ± 0.267
DeepSVDDStandalone 0.254 ± 0.199 0.718 ± 0.173 0.300 ± 0.253
LOFStandalone 0.207 ± 0.166 0.656 ± 0.182 0.247 ± 0.237
Table 4. Nemenyi post-hoc p-values. Significant results ( p < 0.05 ) in bold.
Table 4. Nemenyi post-hoc p-values. Significant results ( p < 0.05 ) in bold.
ComparisonEnsemble-EqualEnsemble-NCL
vs. DeepSVDD0.0010.002
vs. LOF0.0080.016
vs. PCA0.0080.014
vs. VAE0.0190.024
vs. LUNAR0.0440.076
vs. COPOD0.0550.069
vs. HBOS0.1970.240
vs. KNN0.4770.598
vs. LSCP0.5030.549
vs. INNE1.0001.000
vs. IsolationForest0.9540.971
vs. MCD0.9920.998
Table 5. Weighting scheme comparison averaged across 22 datasets. Best value per metric in bold.
Table 5. Weighting scheme comparison averaged across 22 datasets. Best value per metric in bold.
SchemeF1AUROCAUPRC
Equal 0.312 ± 0.239 0.817 ± 0.170 0.432 ± 0.318
Variance 0 . 315 ± 0 . 240 0.814 ± 0.173 0.435 ± 0.315
NCL 0.319 ± 0.240 0.814 ± 0.174 0.434 ± 0.314
EM 0.311 ± 0.243 0 . 818 ± 0 . 170 0 . 437 ± 0 . 322
Table 6. Compact ensemble performance compared to the full 11-detector ensemble, averaged across 22 datasets (mean ± std). Cost is expressed as a percentage of the full ensemble mean training time (40.87 s). Best value per metric in bold.
Table 6. Compact ensemble performance compared to the full 11-detector ensemble, averaged across 22 datasets (mean ± std). Cost is expressed as a percentage of the full ensemble mean training time (40.87 s). Best value per metric in bold.
EnsemblenF1AUROCAUPRCCost (%)
Full-11-EM11 0.311 ± 0.237 0.818 ± 0.167 0.437 ± 0.315 100
IF+INNE+KNN+HBOS4 0.293 ± 0.235 0.819 ± 0.165 0 . 450 ± 0 . 327 13.8
IF+INNE+KNN+HBOS+LUNAR5 0.307 ± 0.242 0 . 823 ± 0 . 165 0.447 ± 0.325 23.4
IF+KNN+HBOS3 0.319 ± 0.234 0.815 ± 0.168 0 . 446 ± 0 . 330 8.7
IF+KNN+MCD3 0 . 347 ± 0 . 272 0.827 ± 0.169 0.432 ± 0.321 17.7
Table 7. Computational complexity of individual detectors averaged across 22 benchmark datasets. Mean ± std reported.
Table 7. Computational complexity of individual detectors averaged across 22 benchmark datasets. Mean ± std reported.
DetectorFit Time (s)Inference (ms/1k)Peak Memory (MB)
IsolationForest 0.70 ± 0.19 0.02 ± 0.01 1.4 ± 1.5
INNE 0.78 ± 0.26 0.22 ± 0.15 6.4 ± 5.8
LOF 0.55 ± 0.42 0.33 ± 0.20 6.0 ± 4.4
KNN 0.55 ± 0.41 0.33 ± 0.25 4.1 ± 3.5
HBOS 0.46 ± 1.97 0.06 ± 0.28 2.5 ± 5.3
MCD 8.46 ± 25.91 0.01 ± 0.02 32.8 ± 102.5
COPOD 0.05 ± 0.09 0.05 ± 0.10 8.5 ± 15.8
PCA 0.05 ± 0.13 0.01 ± 0.02 5.2 ± 11.2
VAE 23.99 ± 24.56 0.04 ± 0.01 3.9 ± 11.2
DeepSVDD 4.06 ± 3.80 0.02 ± 0.05 1.7 ± 3.0
LUNAR 1.60 ± 0.87 0.26 ± 0.38 5.8 ± 10.0
Table 8. Best detector combination per ensemble size across all 11 n combinations, ranked by mean AUPRC across 22 datasets (mean ± std). Best value per metric in bold.
Table 8. Best detector combination per ensemble size across all 11 n combinations, ranked by mean AUPRC across 22 datasets (mean ± std). Best value per metric in bold.
nBest CombinationF1AUROCAUPRC
2IF+PCA 0.297 ± 0.231 0.798 ± 0.184 0 . 456 ± 0 . 333
3IF+KNN+HBOS 0 . 319 ± 0 . 234 0.815 ± 0.168 0.446 ± 0.330
4IF+INNE+KNN+HBOS 0.293 ± 0.235 0.819 ± 0.165 0.450 ± 0.327
5IF+INNE+KNN+HBOS+LUNAR 0.307 ± 0.242 0 . 823 ± 0 . 165 0.447 ± 0.325
6IF+INNE+KNN+HBOS+PCA+LUNAR 0.303 ± 0.239 0.819 ± 0.167 0.447 ± 0.329
7IF+INNE+KNN+HBOS+MCD+COPOD+LUNAR 0.311 ± 0.245 0.823 ± 0.167 0.444 ± 0.326
8IF+INNE+KNN+HBOS+MCD+COPOD+PCA+LUNAR 0.309 ± 0.244 0.818 ± 0.168 0.440 ± 0.326
9IF+INNE+LOF+KNN+HBOS+MCD+COPOD+PCA+LUNAR 0.305 ± 0.238 0.820 ± 0.168 0.440 ± 0.323
10IF+INNE+KNN+HBOS+MCD+COPOD+PCA+VAE+DSVDD+LUNAR 0.311 ± 0.245 0.815 ± 0.170 0.437 ± 0.323
11IF+INNE+LOF+KNN+HBOS+MCD+COPOD+PCA+VAE+DSVDD+LUNAR 0.311 ± 0.239 0.817 ± 0.170 0.434 ± 0.318
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Krčmar, T.; Šabanović, D.; Köhler, M.; Lukić, I. Less Is More: Principled Diversity in Heterogeneous Anomaly Detection Ensembles. AI 2026, 7, 214. https://doi.org/10.3390/ai7060214

AMA Style

Krčmar T, Šabanović D, Köhler M, Lukić I. Less Is More: Principled Diversity in Heterogeneous Anomaly Detection Ensembles. AI. 2026; 7(6):214. https://doi.org/10.3390/ai7060214

Chicago/Turabian Style

Krčmar, Tea, Dina Šabanović, Mirko Köhler, and Ivica Lukić. 2026. "Less Is More: Principled Diversity in Heterogeneous Anomaly Detection Ensembles" AI 7, no. 6: 214. https://doi.org/10.3390/ai7060214

APA Style

Krčmar, T., Šabanović, D., Köhler, M., & Lukić, I. (2026). Less Is More: Principled Diversity in Heterogeneous Anomaly Detection Ensembles. AI, 7(6), 214. https://doi.org/10.3390/ai7060214

Article Metrics

Back to TopTop