Interpretable Single-dimension Outlier Detection (ISOD): An Unsupervised Outlier Detection Method Based on Quantiles and Skewness Coefficients

: A crucial area of study in data mining is outlier detection, particularly in the areas of network security, credit card fraud detection, industrial flaw detection, etc. Existing outlier detection algorithms, which can be divided into supervised methods, semi-supervised methods, and unsupervised methods, suffer from missing labeled data, the curse of dimensionality, low interpretability, etc. To address these issues, in this paper, we present an unsupervised outlier detection method based on quantiles and skewness coefficients called ISOD (Interpretable Single dimension Outlier Detection). ISOD first fulfils the empirical cumulative distribution function before computing the quantile and skewness coefficients of each dimension. Finally, it outputs the outlier score. This paper’s contributions are as follows: (1) we propose an unsupervised outlier detection algorithm called ISOD, which has high interpretability and scalability; (2) massive experiments on benchmark datasets demonstrated the superior performance of the ISOD algorithm compared with state-of-the-art baselines in terms of ROC and AP.


Introduction
Outlier detection, sometimes referred to as novelty detection, is the process of finding out what is different from normal data.According to Aggarwal, "outliers are also referred to as abnormalities, discordants, deviants, or anomalies in the data mining and statistics literature" [1].
Over the past few decades, many outlier detection algorithms have been proposed [20,[22][23][24][25]; depending on whether labeled data are utilized, they can be divided into three main categories: (1) supervised methods, (2) semi-supervised methods, and (3) unsupervised methods.We will provide more details on these methods in Section 2.
While these algorithms were shown to be effective in earlier applications, as the concept of big data has become more prevalent and data have become more multidimensional, they have increasingly become more problematic.
(1) Missing labeled data.Supervised algorithms require a large amount of labeled data that, in many cases, are difficult to implement or require incurring high costs.This can lead to unsatisfactory performance being demonstrated by these supervised algorithms.
(2) Curse of Dimensionality.In the era of big data, the dimensionality of data is increasing.The performance of supervised outlier detection algorithms, especially those based on proximity, will decrease rapidly with the increasing data dimensionality.
(3) Interpretability.In practical applications of anomaly detection, such as credit card fraud detection and medical imaging inspection for anomalies, we not only need to be able to detect anomalous data but also need to make a reasonable explanation as to why these data are anomalous.Due to the disparity in the distribution of outliers and normal instances, the peculiarities of various detection algorithms, and the complexity of data structures in particular applications, it might be challenging to explain abnormalities in outliers.
To avoid the above shortcomings, this paper proposes a new outlier detection algorithm based on quantiles and skewness coefficients: Interpretable Single dimension Outlier Detection (abbreviated as ISOD).In this method, the empirical cumulative distribution function of each sample's dimension is first determined using data from that sample; the skewness coefficient and quantile of the empirical cumulative distribution function are then computed.Finally, the skewness coefficient is used as a weight to summarize the anomaly score of that data point so that anomaly detection results can be obtained.
The rest of this paper is organized as follows: In Section 2, an overview of the current anomaly detection algorithms is provided.In Section 3, the focus is on the proposed algorithm (ISOD) and its analysis.In Section 4, the experiments we conduced and their results are analyzed, and Section 5 concludes the article.

Related Works
In data mining, training data are used to train models, and test data are used to measure performance.Based on the availability of labels, anomaly detection methods can be classified into supervised, semi-supervised, and unsupervised methods.

Supervised Methods
The availability of a training dataset with labeled cases for both normal and anomalous classes is assumed by supervised techniques.The training and testing datasets must then be chosen to perform cross-validation.The training dataset is modeled using a supervised learning technique.Creating a prediction model for normal vs. anomalous classes is a common strategy in these situations.The model is then assessed using the testing dataset.
The representative supervised method is a classification-based anomaly detection algorithm.A classifier is trained by using the labels in the training dataset so that it can distinguish between normal and abnormal data.Once this classifier is trained, it can accurately distinguish between normal data and abnormal data when facing new data.
A detailed description of these methods can be found in [22].The merits of supervised methods include the fact that they are (1) supervised, which means they are easy to use, and (2) robust to different data types.However, their shortcomings are obvious.Namely, (1) labeled data are difficult to obtain or require large costs to obtain, especially in industrial and commercial applications; also, (2) the abnormal result finally obtained by the supervised algorithm is binary, and the degree of abnormality cannot be further compared.

Semi-Supervised Methods
The main difference between semi-supervised anomaly detection algorithms and supervised anomaly detection algorithms is that not all data are labeled.In other words, a portion of the data has a label indicating whether it is normal data or abnormal data.But other data are unlabeled.The typical technique used in semi-supervised methods is to build a model for the class corresponding to normal behavior and use the model to identify anomalies in the test data.
A detailed description of these methods can be found in [22].In semi-supervised anomaly detection, class support vector machines and support vector data description are widely used [31,32].
Such techniques may not always be widely used because, with their use, it is difficult to obtain a dataset that covers all possible anomalies.Even if there is such a dataset, the data will change over time, and abnormal data that have not appeared before may appear.Therefore, when there are no historical anomaly data, unsupervised measures can be used as a preliminary strategy for anomaly detection.

Unsupervised Methods
Unsupervised methods are the most extensively used methods since they do not need labeled training data.Modeling the training dataset is carried out using an unsupervised learning approach.
The underlying premise used by the strategies in this category is that regular cases in the test data are significantly more common than abnormalities.If this presumption is incorrect, the effectiveness of these methods will drop dramatically.
By employing a portion of the unlabeled dataset as training data, many semi-supervised approaches may be modified to function in an unsupervised manner.This kind of adaptation is predicated on the test data having relatively few abnormalities and the model to be trained being resilient to those anomalies.
A detailed description of these methods can be found in [22].Isolation Forest is a representative for this type of method [33].
Some researchers have applied unsupervised anomaly detection for health testing [34] and time-series anomaly detection [35].A novel method based on mutual information and reduced spectral clustering was developed in [36].
The advantage of this type of method is that it can perform anomaly detection without label data, which means it is more suitable in most situations; its main disadvantages are that the interpretability is relatively poor, and the decision process is less direct and more obscure than supervised anomaly detection, especially when using artificial intelligence technologies such as neural networks and deep learning.

Self-Supervised Methods
Self-supervised learning (SSL) is an AI-based method of training algorithmic models on raw, unlabeled data.Using various methods and learning techniques, self-supervised models create labels and annotations during the pre-training stage, aiming to iteratively achieve an accurate ground truth so a model can go into production.Some self-supervised methods have been developed for outlier detection [37][38][39].A detailed description of these methods can be found in [40].
Automating Feature Subspace Exploration, a preprocessing step in machine learning for improving outlier detection, was developed in [41].

Quantiles
A quantile defines a particular part of a dataset; i.e., a quantile determines how many values in a distribution are above or below a certain limit.Special quantiles include the quartile (quarter), the decile (tenth), and percentiles (hundredth).
Although the term "quantile" lacks a uniform meaning, it is widely used to describe the proportion of values in data collection scores that are less than a particular number.A quantile shows how a given value compares to others.For example, if a value is in the kth percentile, it is greater than K percent of the total values.
In Formula (1), n x represents the number of values below x, n represents the total number of scores, and P x represents the quantile of the data x.

Skewness Coefficient
The skewness coefficient is one way to measure the skewness of a distribution, a measure of a probability distribution's asymmetry.A distribution is said to be skewed if its curve is twisted either toward the left or the right.Karl Pearson's coefficient of skewness is the most significant measure of skewness.It is sometimes referred to as Pearson's skewness coefficient.
When a dataset's skewness is measured, it typically takes the form of a bell curve.The skewness of normal distributions is zero.As a result, the distribution becomes symmetrical concerning the mean.Still, there are situations in which skewness is not symmetric.It can be either positive or negative in these circumstances.
When a distribution's tail is more prominent on the right than the left, it is said to be positively skewed.The skewness coefficient is assumed to be positive since the distribution is positive.The majority of the values thus turn out to be to the left of the mean.This indicates that the values on the right side are the most extreme.
Negative skewness, on the other hand, occurs when the tail is more pronounced on the left rather than the right side.Contrary to positive skewness, most of the values are found on the right side of the mean in negative skewness.As such, the most extreme values are found to be further to the left.Formula (2) describes how to calculate the skewness coefficient.

Definition of Outlier Detection
Outlier detection, without supervision, employs some criteria to find outlier candidates which deviate from major normal points.We have n data points X 1 , X 2 , . . ., X n ∈ R d , which are sampled independently and identically distributed.We use the matrix X ∈ R n×d as the notation of the entire dataset, which is formed by stacking each data point's vectors as rows.After giving X, an outlier detector obtains an outlier score o i ∈ R for each data point x i , 1 ≤ i ≤ n.Data points with higher outlier scores are more likely to be outliers.

Construct the Empirical Cumulative Distribution Function
Anomaly detection is carried out to find data points in areas with less probability of occurrence in the data distribution.In the univariate normal distribution model, the degree of anomaly can be determined by the ratio of its distance to the mean and its variance.Starting from this idea, we can calculate the degree of anomaly in each dimension of the multivariate probability distribution and finally determine its anomaly score.
In each dimension, the data can be arranged from small to large to construct an empirical cumulative distribution function.

Compute the Quantiles
In the dataset X ∈ R n×d , we use X i 1 ≤ i ≤ n as a data sample, and X j 1 ≤ j ≤ d is used as the j-th dimension of X.Therefore, we use X j i as the j-th entry of data X i .According to Formula (1), we compute the quantile of X j i through Formula (3).
where I{•} is an indicator function that is 1 when its argument is true and 0 when otherwise.

Compute the Skewness Coefficient
According to Formula (2), we compute the skewness coefficient of X j 1 ≤ j ≤ d through Formula (4). where i is the mean of the j-th feature.

Obtain the Outlier Scores
Finally, we obtain an outlier score for each X i through Formula (5).
We use γ i as the weighting factor when calculating the anomaly score of each data point.We use o ij to represent the abnormality degree of each dimension.

Pseudocode of ISOD
Based on the above steps, the pseudocode of the ISOD algorithm is given in Algorithm 1.
calculate the skewness coefficient for each dimension: 4.end for 5.for each data X i 1 ≤ i ≤ n: 6. calculate the anomaly score for each dimension

Time Complexity Analysis
According to Formulas (3) and ( 4), calculating the quantiles and skewness coefficients for all d dimensions using n samples leads to O(nd) time complexity.Similarly, according to Formula (5), calculating the anomaly score for d dimensions using n samples also leads to O(nd) time complexity.Therefore, the overall time complexity of ISOD is O(nd).

Interpretability
Interpretability is an important aspect of the practical applications of anomaly detection.In network attack detection, for example, finding an anomaly is as important as identifying the cause of the anomaly.An algorithm with high interpretability has greater reliability, which not only means that it can provide a result but also the reason(s) behind such a result, which is good for improving the performance of the system and assisting in decision making.Therefore, interpretability is very important in the application of anomaly detection.
As can be seen from Formula (5), the ISOD algorithm aggregates the anomalies on each dimension to determine the final anomaly score.Where necessary, we can give the anomalies of the anomalous data in each dimension, which helps the expert to further identify the dimension in which the anomaly occurs.This involves taking anomaly detection from a "black box" to a "white box".

Sensitivity Analysis
As can be seen from the description of the algorithmic process in Section 3.3 above, the ISOD algorithm independently calculates the skewness coefficients for each dimension as weights to be combined with the quantiles in each dimension.Therefore, there are no special requirements on the distribution of data, slight data noise, or the percentage of outliers.Therefore, we can confidently say that the ISOD algorithm is a robust anomaly detection algorithm that is insensitive to data noise, and this property will have a positive impact on its practical application.

Hyperparameter-Free and Unsupervised
The ISOD algorithm is an easy-to-understand unsupervised anomaly detection algorithm with the following advantages: (1) The ISOD algorithm is a statistic-based algorithm that calculates the anomalies in each dimension and aggregates them to obtain final anomaly scores for the sample data.Therefore, the algorithm has no hyperparameters, and no parameter tuning is required.(2) The algorithm is an unsupervised algorithm that does not need to prepare a large amount of labeled data for training, which gives the algorithm high interpretability and, at the same time, lays a better foundation for the practical applications of the algorithm.

ROC (Receiver Operating Characteristic)
The receiver operating characteristic (ROC) curve is frequently used for evaluating the performance of binary classification algorithms.It provides a graphical representation of a classifier's performance rather than a single value like most other metrics.The closer the ROC is to 1, the more effective that detection model is.This algorithm's ROC is equal to or lower than 0.5, which means that the inspection model has no value for use.

AP (Average Precision)
Another way to evaluate outlier detection models is to use the average precision (AP).The AP measures the average precision across all possible thresholds, with a higher value indicating a better model.The AP is more suitable for outlier detection problems with rare anomalies or imbalanced data, as it focuses more on the positive class (anomalies) than the negative class (normal instances).However, it may not reflect the overall accuracy or specificity of the model, as it does not account for the true negatives or false negatives.Evaluating outlier detection models can be challenging, especially when you do not have labeled data or ground truth data to compare with.One of the possible ways to evaluate outlier detection models is to use external validation, which involves comparing the results with some other sources of information, such as domain experts, feedback, or historical data.
Overall, 30% of the data in experiments is reserved for testing, while the remaining 70% is used for training.The area under the receiver operating characteristic (ROC) and average precision (AP) are used to obtain the average score from ten separate trials to assess performance.

Experimental Environment and Baselines
In subsequent experiments, a Windows personal computer with AMD Ryzen 7 5800H CPU and 16G of memory will be used.

Dataset
To validate the effectiveness of the proposed method, we conducted a series of comparative experiments on ten real-world datasets with different types and sizes.They were collected from several domains and are available on the OODS website (https://odds.cs.stonybrook.edu/,accessed on 20 October 2023).These 10 datasets have been frequently used by researchers to evaluate the performance of anomaly detection methods.
Table 1 shows the 10 datasets from the OODS website with the highest dimensions which were selected for our study.

Experimental Results
In this section, we give the experimental results of ISOD for the benchmark datasets in Tables 2 and 3.The highest ROC or AP score is marked in bold, which means that the algorithm achieves the best performance for this dataset.

Analysis of Experimental Results
The proposed ISOD algorithm achieved the best performance, with an average ROC of 0.813 and an average precision of 0.75.As shown in Table 2, the ISOD algorithm achieved the highest ROC in 6 of the 10 datasets.Additionally, as shown in Table 3, the ISOD algorithm achieved the highest AP (average precision) in 6 of the 10 datasets.
It is worth noting that, by analyzing the data in Tables 2 and 3, it can be found that the higher the data dimensionality, the better results the ISOD algorithm can achieve, as exemplified by the results for the Speech, Satellite, and Arrhythmia datasets.This confirms that the ISOD algorithm has low time complexity and good performance when working with data with high dimensionality, as noted in Section 3.4.1.

Additional Experimental Results and Analysis of Running Time
To further test the scalability of the ISOD algorithm, the running time of the algorithm on the 10 datasets mentioned above was tested, and the results are represented in the form of a scatter plot, as shown in Figure 1.In this figure, the horizontal axis represents the size of the dataset, the vertical axis represents the dimensionality of the data, and the dot size represents the running time of the ISOD algorithm on the dataset.The larger the dot, the longer the running time.

Conclusions
In this article, we proposed an effective unsupervised outlier detection method based on quantiles and skewness coefficients called ISOD.ISOD can be mainly divided into three stages: (1) the construction of the empirical cumulative distribution function; (2) the computation of the quantiles and skewness coefficients of each dimension; (3) summarizing the degree of anomalies in each dimension and ultimately obtaining the outlier score for each data point.After these stages, the method finally obtains the outlier scores.
The experimental results derived from applying the ISOD algorithm to 10 benchmark datasets show that the ISOD method has great competitive and promising performance in comparison to the state-of-the-art baseline anomaly detection algorithms.In addition to achieving better experimental results, the ISOD algorithm also has high interpretability and scalability, as explained in Section 4.
Based on Section 3.4 and Section 4.3.1, it is clear that the ISOD algorithm does not require labeled data and that it is an unsupervised anomaly detection algorithm.At the same time, it has good scalability and can obtain good performance with ultra-high dimensional datasets.Finally, this algorithm is theoretically guaranteed to have high interpretability.Although Figure 1 does not provide a specific running time, by comparing the size of these scatter plots, we can see that a dataset with a large amount of data or data with a high dimensionality has a longer running time for the ISOD algorithm, which conforms the complexity analysis results mentioned earlier.

Conclusions
In this article, we proposed an effective unsupervised outlier detection method based on quantiles and skewness coefficients called ISOD.ISOD can be mainly divided into three stages: (1) the construction of the empirical cumulative distribution function; (2) the computation of the quantiles and skewness coefficients of each dimension; (3) summarizing the degree of anomalies in each dimension and ultimately obtaining the outlier score for each data point.After these stages, the method finally obtains the outlier scores.
The experimental results derived from applying the ISOD algorithm to 10 benchmark datasets show that the ISOD method has great competitive and promising performance in comparison to the state-of-the-art baseline anomaly detection algorithms.In addition to achieving better experimental results, the ISOD algorithm also has high interpretability and scalability, as explained in Section 4.
Based on Sections 3.4 and 4.3.1, it is clear that the ISOD algorithm does not require labeled data and that it is an unsupervised anomaly detection algorithm.At the same time, it has good scalability and can obtain good performance with ultra-high dimensional datasets.Finally, this algorithm is theoretically guaranteed to have high interpretability.

Algorithm 1 :
ISOD Input: X = x ij n×d with n samples and d features Output: Outlier scores {O 1 , O 2 , . . . ,O i , . . . ,O n } 1. for each dimension 1 ≤ j ≤ d: 2. calculate the quantile of each data in this dimension

Figure 1 .
Figure 1.The running times for the ISOD algorithm on 10 benchmark datasets.(Largerdot mean longer running time).

Figure 1 .
Figure 1.The running times for the ISOD algorithm on 10 benchmark datasets.(Largerdot mean longer running time).

Table 2 .
ROC scores in terms of outlier detector performance (the highest ROC scores are marked in bold).

Table 3 .
Average precision (AP) scores in terms of outlier detector performance (the highest AP scores are marked in bold).