Regression Tree Based Explanation for Anomaly Detection Algorithm

: This work presents EADMNC (Explainable Anomaly Detection on Mixed Numerical and Categorical spaces), a novel approach to address explanation using an anomaly detection algorithm, ADMNC, which provides accurate detections on mixed numerical and categorical input spaces. Our improved algorithm leverages the formulation of the ADMNC model to offer pre-hoc explainability based on CART (Classiﬁcation and Regression Trees). The explanation is presented as a segmentation of the input data into homogeneous groups that can be described with a few variables, offering supervisors novel information for justiﬁcations. To prove scalability and interpretability, we list experimental results on real-world large datasets focusing on network intrusion detection domain.


Introduction
Anomaly Detection is an old discipline that has become relevant in situations in which datasets are huge and contain unexpected events carrying important information. These methods have found applications in fields such as network intrusion detection, and surveillance, among others. Several machine learning models are available [1,2], but despite being capable of offering very effective detection, most of these algorithms are unable to provide justifications about their outputs. The lack of explanation is one of the most important shortcomings of Machine Learning at present [3]. The European Union cites XAI (Explainable Artificial Intelligence) in its Ethics Guidelines for Trustworthy AI [4]. This work extends the ADMNC algorithm [5], an anomaly detection algorithm developed by our research group, with a new layer that opens the ADMNC black box by offering pre-hoc explainability. Regression decision trees are used to segment input data into homogeneous groups that can be described with a few variables. The objective is to provide a helpful and intuitive description of anomalous data, thus offering information to make informed decisions.

Methodology
The original ADMNC algorithm [5] is a method for large-scale offline learning to obtain a model of normal data that is then used to detect anomalies. The model used to obtain the pre-hoc explanation will consist of a grouping of the input patterns attending to their numerical variables. Clusters will be defined as the leaf nodes of a shallow decision tree [6]. Each pattern will be assigned its ADMNC estimator [5]. This estimator will then be approximated with a simple regression model, learned using the Apache Spark MLLib implementation of CART. Variance gives us an idea about how homogeneous the estimators for elements in a tree node are. Successive divisions turn nodes into more specific groups that contain similar elements. This balance between cluster homogeneity and explanation quality, given by the depth of each path, allows us to choose the level of detail for explanations.
We define the clustering Cl(D) over dataset D as a set of m clusters Cl i ∀i ∈ [1, m] that contains every element in D. The weighted variance (WV) of a Cl(D) is defined as: The weighted variance of a clustering measures how homogeneous its components are. This measure is complemented with another measure that indicates the number of input variables employed to characterize each cluster Cl i . As a result, the quality, Q of a clustering is defined as: where NV(Cl i ) represents the number of variables needed to describe cluster Cl i and λ is a hyperparameter that allows the supervisor to balance the accuracy and interpretability [6] of the whole clustering. This quality measure is always negative and the goal of the algorithm is maximizing its value to approach 0. Maximizing this measure will ensure that the groups obtained are as homogeneous as possible and that they are explained using as few of the input variables as possible.
This method is carried out in two steps: (1) a full N level tree is built using the well-known CART algorithm. (2) This full tree is pruned to optimize the quality measure. Those node splits that decrease variance but also decrease quality are discarded, yielding a simpler tree that maximizes quality. The main features that lead data to be anomalous can be obtained as the path to anomalous clusters.

Experimental Results
To assess the validity of our approach, we considered two large datasets focusing on the network intrusion detection domain, KDDCup99 [5] and ISCXIDS 2012. For each resulting clustering, we measured its quality Q and weighted variance. We also included the number of clusters and the number of variables employed for both the full and pruned tree. These results are listed in Table 1. We set hyperparameter λ accordingly with pruning effort. This value can be modified by the supervisor, assigning more or less importance to interpretability in comparison to predictive power. Area under ROC (Receiver Operating Characteristic) curve is provided as fitness measure for anomaly detection, making five repetitions of each experiment. An example of explanatory tree is shown in Figure 1. Table 1. Area under ROC curve (AUC) and explanatory tree metrics. Before pruning (Full, F) and after pruning (Pruned, P), considering hyperparameter λ, OV (Overall variance), Q (quality), WV (weighted variance), #Cl (number of clusters) and NV (number of variables to reach all clusters).

Discussion and Conclusions
XAI is necessary to provide transparency to model predictions. It is a growing field of study that guarantees compliance with new European Union regulations. The proposed method allows us to examine differences between normal and anomalous data, potentially allowing the identification of generalization power, biases and formulation of hypothesis for abnormal data context.
In the future, we plan to add the categorical variables to the tree-based pre-hoc explanation. This will paint a more accurate picture of the input dataset. Another possible future research line is to improve explanations by introducing a previous dimensionality reduction step, as high dimensional data present redundant and irrelevant variables that produce bias and generalization errors.