Next Article in Journal
Network Anomaly Detection Using Machine Learning Techniques
Previous Article in Journal
Implementing a Web Application for W3C WebAuthn Protocol Testing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Regression Tree Based Explanation for Anomaly Detection Algorithm †

by
Iñigo López-Riobóo Botana
1,*,
Carlos Eiras-Franco
2 and
Amparo Alonso-Betanzos
2
1
Research group LIDIA, Universidade da Coruña, Campus Elviña, 15071 A Coruña, Spain
2
CITIC Research Center, Universidade da Coruña, 15071 A Coruña, Spain
*
Author to whom correspondence should be addressed.
Presented at the 3rd XoveTIC Conference, A Coruña, Spain, 8–9 October 2020.
Proceedings 2020, 54(1), 7; https://doi.org/10.3390/proceedings2020054007
Published: 18 August 2020
(This article belongs to the Proceedings of 3rd XoveTIC Conference)

Abstract

:
This work presents EADMNC (Explainable Anomaly Detection on Mixed Numerical and Categorical spaces), a novel approach to address explanation using an anomaly detection algorithm, ADMNC, which provides accurate detections on mixed numerical and categorical input spaces. Our improved algorithm leverages the formulation of the ADMNC model to offer pre-hoc explainability based on CART (Classification and Regression Trees). The explanation is presented as a segmentation of the input data into homogeneous groups that can be described with a few variables, offering supervisors novel information for justifications. To prove scalability and interpretability, we list experimental results on real-world large datasets focusing on network intrusion detection domain.

1. Introduction

Anomaly Detection is an old discipline that has become relevant in situations in which datasets are huge and contain unexpected events carrying important information. These methods have found applications in fields such as network intrusion detection, and surveillance, among others. Several machine learning models are available [1,2], but despite being capable of offering very effective detection, most of these algorithms are unable to provide justifications about their outputs. The lack of explanation is one of the most important shortcomings of Machine Learning at present [3]. The European Union cites XAI (Explainable Artificial Intelligence) in its Ethics Guidelines for Trustworthy AI [4].
This work extends the ADMNC algorithm [5], an anomaly detection algorithm developed by our research group, with a new layer that opens the ADMNC black box by offering pre-hoc explainability. Regression decision trees are used to segment input data into homogeneous groups that can be described with a few variables. The objective is to provide a helpful and intuitive description of anomalous data, thus offering information to make informed decisions.

2. Methodology

The original ADMNC algorithm [5] is a method for large-scale offline learning to obtain a model of normal data that is then used to detect anomalies. The model used to obtain the pre-hoc explanation will consist of a grouping of the input patterns attending to their numerical variables. Clusters will be defined as the leaf nodes of a shallow decision tree [6]. Each pattern will be assigned its ADMNC estimator [5]. This estimator will then be approximated with a simple regression model, learned using the Apache Spark MLLib implementation of CART. Variance gives us an idea about how homogeneous the estimators for elements in a tree node are. Successive divisions turn nodes into more specific groups that contain similar elements. This balance between cluster homogeneity and explanation quality, given by the depth of each path, allows us to choose the level of detail for explanations.
We define the clustering C l ( D ) over dataset D as a set of m clusters C l i i [ 1 , m ] that contains every element in D. The weighted variance (WV) of a C l ( D ) is defined as:
W V ( C l ( D ) ) = i 1 . . m ( σ C l i 2 ) | C l i | | D | .
The weighted variance of a clustering measures how homogeneous its components are. This measure is complemented with another measure that indicates the number of input variables employed to characterize each cluster C l i . As a result, the quality, Q of a clustering is defined as:
Q ( C L ( D ) ) = W V ( C l ( D ) ) λ C l i C l ( D ) N V ( C l i ) ,
where N V ( C l i ) represents the number of variables needed to describe cluster C l i and λ is a hyperparameter that allows the supervisor to balance the accuracy and interpretability [6] of the whole clustering. This quality measure is always negative and the goal of the algorithm is maximizing its value to approach 0. Maximizing this measure will ensure that the groups obtained are as homogeneous as possible and that they are explained using as few of the input variables as possible.
This method is carried out in two steps: (1) a full N level tree is built using the well-known CART algorithm. (2) This full tree is pruned to optimize the quality measure. Those node splits that decrease variance but also decrease quality are discarded, yielding a simpler tree that maximizes quality. The main features that lead data to be anomalous can be obtained as the path to anomalous clusters.

3. Experimental Results

To assess the validity of our approach, we considered two large datasets focusing on the network intrusion detection domain, KDDCup99 [5] and ISCXIDS 2012. For each resulting clustering, we measured its quality Q and weighted variance. We also included the number of clusters and the number of variables employed for both the full and pruned tree. These results are listed in Table 1. We set hyperparameter λ accordingly with pruning effort. This value can be modified by the supervisor, assigning more or less importance to interpretability in comparison to predictive power. Area under ROC (Receiver Operating Characteristic) curve is provided as fitness measure for anomaly detection, making five repetitions of each experiment. An example of explanatory tree is shown in Figure 1.

4. Discussion and Conclusions

XAI is necessary to provide transparency to model predictions. It is a growing field of study that guarantees compliance with new European Union regulations. The proposed method allows us to examine differences between normal and anomalous data, potentially allowing the identification of generalization power, biases and formulation of hypothesis for abnormal data context.
In the future, we plan to add the categorical variables to the tree-based pre-hoc explanation. This will paint a more accurate picture of the input dataset. Another possible future research line is to improve explanations by introducing a previous dimensionality reduction step, as high dimensional data present redundant and irrelevant variables that produce bias and generalization errors.

Supplementary Materials

Pre-hoc regression trees are available online at https://www.dropbox.com/sh/m6lyn8zpss75sru/AADO_OFwzNwUTHD24vgJXhwma?dl=0

Funding

This research was partially funded by European Union ERDF funds, Ministerio de Ciencia e Innovación grant number PID2019-109238GB-C22, Xunta de Galicia through the accreditation of Centro Singular de Investigación 2016-2020, Ref. ED431G/01 and Grupos de Referencia Competitiva, Ref. GRC2014/035

Acknowledgments

We would like to thank CESGA for the use of their computing resources. Special recognition is given to the Spanish Ministerio de Educación for the predoctoral FPU funds, grant number FPU19/01457.

References

  1. Liu, F.T.; Ting, K.; Zhou, Z. Isolation-Based Anomaly Detection. TKDD 2012, 6, 1–39. [Google Scholar] [CrossRef]
  2. Lu, Y.C.; Chen, F.; Wang, Y.; Lu, C.T. Discovering anomalies on mixed-type data using a generalized student-t based approach. IEEE Trans. Knowl. Data Eng. 2016, 10, 2582–2595. [Google Scholar] [CrossRef]
  3. Adadi, A.; Berrada, M. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar]
  4. High Level Expert Group on Artificial Intelligence. Ethics Guidelines on Trustworthy Artificial Intelligence. 2019. Available online: https://ec.europa.eu/digital-single-market/en/news/ethics-guidelines-trustworthy-ai (accessed on 1 July 2020).
  5. Eiras-Franco, C.; Martínez-Rego, D.; Guijarro-Berdiñas, B.; Alonso-Betanzos, A.; Bahamonde, A. Large Scale Anomaly Detection in Mixed Numerical and Categorical Input Spaces. Inf. Sci. 2019, 487, 115–127. [Google Scholar]
  6. Eiras-Franco, C.; Guijarro-Berdiñas, B.; Alonso-Betanzos, A.; Bahamonde, A. A scalable decision-tree-based method to explain interactions in dyadic data. Decis. Support Syst. 2019, 127, 113141. [Google Scholar]
Figure 1. Explanatory tree after pruning ( λ = 10 3 ) using the KDDCup99-SMTP dataset. Named sequentially, reading from left to right, each node shows: the proportion of elements that it represents regarding the full dataset (shown in blue), overall variance (shown in blue), the weighted variance w.r.t children nodes (shown in dark blue) and mean and standard deviation for the subset of estimators. Further experimental results are given through supplementary materials reference.
Figure 1. Explanatory tree after pruning ( λ = 10 3 ) using the KDDCup99-SMTP dataset. Named sequentially, reading from left to right, each node shows: the proportion of elements that it represents regarding the full dataset (shown in blue), overall variance (shown in blue), the weighted variance w.r.t children nodes (shown in dark blue) and mean and standard deviation for the subset of estimators. Further experimental results are given through supplementary materials reference.
Proceedings 54 00007 g001
Table 1. Area under ROC curve (AUC) and explanatory tree metrics. Before pruning (Full, F) and after pruning (Pruned, P), considering hyperparameter λ , OV (Overall variance), Q (quality), WV (weighted variance), # C l (number of clusters) and NV (number of variables to reach all clusters).
Table 1. Area under ROC curve (AUC) and explanatory tree metrics. Before pruning (Full, F) and after pruning (Pruned, P), considering hyperparameter λ , OV (Overall variance), Q (quality), WV (weighted variance), # C l (number of clusters) and NV (number of variables to reach all clusters).
DatasetAUCExplanation
NameOV λ ( μ ± σ ) Tree Q WV # Cl NV
ISCXIDS 20120.105 10 4 0.919 ± 0.02F−0.0620.04829142
P−0.0510.049725
KDDCup99 - FULL0.049 10 3 0.758 ± 0.05F−0.1470.01128136
P−0.0320.012620
KDDCup99 - SMTP2.846 10 3 0.980 ± 0.01F−0.1053.630  × 10 9 22105
P−0.0056.632  × 10 6 35
KDDCup99 - HTTP0.843 10 3 0.992 ± 0.01F−0.8980.8311567
P−0.8420.83735
KDDCup99 - 102.454 10 3 0.966 ± 0.02F−1.3201.2272093
P−1.2471.228620
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Botana, I.L.-R.; Eiras-Franco, C.; Alonso-Betanzos, A. Regression Tree Based Explanation for Anomaly Detection Algorithm. Proceedings 2020, 54, 7. https://doi.org/10.3390/proceedings2020054007

AMA Style

Botana IL-R, Eiras-Franco C, Alonso-Betanzos A. Regression Tree Based Explanation for Anomaly Detection Algorithm. Proceedings. 2020; 54(1):7. https://doi.org/10.3390/proceedings2020054007

Chicago/Turabian Style

Botana, Iñigo López-Riobóo, Carlos Eiras-Franco, and Amparo Alonso-Betanzos. 2020. "Regression Tree Based Explanation for Anomaly Detection Algorithm" Proceedings 54, no. 1: 7. https://doi.org/10.3390/proceedings2020054007

APA Style

Botana, I. L. -R., Eiras-Franco, C., & Alonso-Betanzos, A. (2020). Regression Tree Based Explanation for Anomaly Detection Algorithm. Proceedings, 54(1), 7. https://doi.org/10.3390/proceedings2020054007

Article Metrics

Back to TopTop