Online Multivariate Anomaly Detection and Localization for High-dimensional Settings

This paper considers the real-time detection of anomalies in high-dimensional systems. The goal is to detect anomalies quickly and accurately so that the appropriate countermeasures could be taken in time, before the system possibly gets harmed. We propose a sequential and multivariate anomaly detection method that scales well to high-dimensional datasets. The proposed method follows a nonparametric, i.e., data-driven, and semi-supervised approach, i.e., trains only on nominal data. Thus, it is applicable to a wide range of applications and data types. Thanks to its multivariate nature, it can quickly and accurately detect challenging anomalies, such as changes in the correlation structure and stealth low-rate cyberattacks. Its asymptotic optimality and computational complexity are comprehensively analyzed. In conjunction with the detection method, an effective technique for localizing the anomalous data dimensions is also proposed. We further extend the proposed detection and localization methods to a supervised setup where an additional anomaly dataset is available, and combine the proposed semi-supervised and supervised algorithms to obtain an online learning algorithm under the semi-supervised framework. The practical use of proposed algorithms are demonstrated in DDoS attack mitigation, and their performances are evaluated using a real IoT-botnet dataset and simulations.


I. INTRODUCTION
Anomaly detection is an important problem dealing with the detection of abnormal data patterns [1]. It has applications in a variety of different domains, such as cybersecurity [2], medical health care [3], quality control, etc. The importance of anomaly detection lies in the fact that an anomaly in the observation data may be a sign of an unwanted event such as failure, malicious activity, etc. in the underlying system. Therefore, accurate detection of such data patterns will allow proper countermeasures to be taken by the domain specialist to counteract any possible harm. To name a few examples, an anomaly in the MRI image could be due to the presence of a malignant tumor in the brain; and anomalous observations in the network traffic data could mean that the network is under a cyber-attack.
The advances in various technologies such as Internet-of-Things (IoT) devices and sensors, and wireless communications, have enabled the real-time monitoring of systems for detecting events of interest. In many modern and complex systems such as IoT networks, network-wide traffic monitoring systems, environmental monitoring systems, etc. massive amounts of heterogeneous data are generated, which Authors are with the Department of Electrical Engineering, University of South Florida, Tampa, FL 33620, USA (e-mail: mmozaffari@mail.usf.edu, yasiny@usf.edu). require real-time processing for timely detection of anomalous events. As an example, automated vehicles or advanced driver-assistance systems today are equipped with modules comprising a large number of sensors and actuators for control and safety purposes. Due to the catastrophic consequences of any fault in perceiving the environment or failure in a component of the system, as well as being compromised by hackers, it is crucial to preserve the robustness of the vehicle. To this end, the high-dimensional measurements from sensors need to be monitored and analyzed in real-time to detect anomalies such as sudden increase of speed, abnormal petrol consumption, anomalies in radar sensors and camera sensing, etc. [4]. Accurate and light-weight anomaly detection methods that can scale well to large systems are needed to be able to address such big data challenges in real-time.
Anomaly detection methods on univariate data streams have been studied thoroughly in the literature. However, little work has been done on multivariate anomaly detection, which has the potential to achieve quicker and more accurate detection than univariate anomaly detection by capturing more anomaly evidence in the interactions between system dimensions. Statistical approaches to anomaly detection assume anomaly to be a change in the probability distribution of the observations, such as a change in the mean, variance, or correlation structure between the data-streams. One important application for detecting changes in the correlation structures is finance, where the correlation structures between highdimensional processes modeling the exchange rates and market indexes are important for the right choice of asset allocation in portfolio [5]. Furthermore, in social networks, it is important to detect abrupt changes in interactions between the nodes; and in communication networks, it is of interest to detect highly correlated traffic in a network [6]. Distributed Denial of Service (DDoS) attacks to power grid through synchronous switching on/off of high-wattage IoT devices is another example where anomaly is manifested in correlations [7]. Detection of change in correlation structure requires the joint monitoring and multivariate analysis of the data-streams, which in turn, leads to the high-dimensionality challenge. To overcome this challenge, a desired anomaly detection technique needs to be scalable to high-dimensional data in real time.
Anomaly detection in many systems such as fraud detection could be the ultimate goal, however, in many scenarios, such as diagnosis systems (e.g., spacecraft monitoring system [8]) and cybersecurity, it is highly important to provide a degree of interpretation about the detected issue in the system and how to mitigate it. Considering the potential damages caused by failure in mitigation of unexpected behaviors, such as cyberattacks, detecting anomalies without providing any further information explaining where the anomaly has happened is of limited value to the engineers.
Motivated by the aforementioned challenges, we investigate an online multivariate anomaly detection and localization technique which is simple enough to handle high-dimensional and heterogeneous data in real-time.

A. Related Works
The problem of anomaly detection has been an important subject of study in several research communities such as statistics, signal processing, machine learning, information theory, data mining, etc. either specifically for an application domain or as a generic method. To name a few, an SVM classification approach for anomaly detection was proposed in [9]; several information theoretic measures were proposed in [10] for the intrusion detection problem; and two new information metrics for DDoS attack detection was introduced in [2]. Due to the challenging nature of the problem and considering the challenges posed by today's technological advances such as big data problems, there is still a need for reconsidering the anomaly detection problem.
Sequential anomaly detection techniques, compared to the outlier detection techniques [1], take also the history of observations into account rather than only the new observations. Sequential techniques are more suitable for real-time systems where timely and accurate detection of anomalies is important. The Cumulative Sum (CUSUM) detector [11] is a wellknown sequential change detection technique that assumes probabilistic models for nominal and anomalous data points, and computes the cumulative log-likelihood-ratio (LLR) over time, declaring anomaly if the statistic exceeds a predefined threshold. The accuracy of assumed models as well as the estimated parameters are the key factors in the performance of CUSUM and more generally parametric methods. CUSUM is minimax optimum under the condition that the probability distributions before and after the change are completely known [12]. However, in many real-world applications having a priori knowledge about the underlying distributions is not possible. Estimating the probability distributions quickly becomes intractable for high-dimensional data, which includes many unknowns such as the anomaly onset time, the subset of anomalous dimensions, etc., in addition to the parameters of the nominal and anomalous models. To tackle with this complexity, [13] proposed a relaxed version of CUSUM in which each data stream is assumed to be independent of others. However, this univariate method is not suitable for detecting changes in the correlation between data streams. A sequential test for detecting changes in the correlation between variables, as well as localizing the highly correlated variables, in highdimensional data streams has been proposed in [14]. This is a parametric method based on the assumption that the observed vectors are multivariate Gaussian distributed. It is proposed solely for the detection of correlation change between datastreams and does not generalize to other changes in the distribution. In this paper, we are interested in detecting general changes in unknown distributions, including the changes in correlation structure.
k-nearest-neighbor (kNN) distance-based methods are geometric methods that are based on the assumption that anomalous data instances occur far from the nominal instances. For instance, [15] and [16] have proposed nonparametric outlier detection techniques based on the minimum volume set (MVS) of the nominal data. MVS corresponds to the region of greatest probability density with minimum data volume and is known to be useful for anomaly detection [17] based on the assumption that anomalies occur in the less concentrated regions of the nominal dataset. These nonparametric outlier detection methods estimate the MVS of nominal training samples using kNN graphs, and declare a data point as anomalous if it lies outside the MVS. Despite being scalable to highdimensional and heterogeneous data, they do not consider the temporal anomaly information, and thus are prone to higher false alarm rates compared to sequential anomaly detection methods. Similarly, [18] proposed a kNN graph-based method that computes an anomaly score for each observation and declares an anomaly by thresholding the score value. In this paper, as opposed to the outlier detection methods which treat a single outlier as an anomaly, we consider an anomaly to consist of persistent outliers and investigate sequential and nonparametric detection of such anomalies using the temporal information in data streams. Recently, [19] proposed a nonparametric kNN-based sequential anomaly detection method for multivariate observations. This method computes the test statistic based on the number of kNN edges at different splitting points within a window and stops the test whenever the test statistics exceed a threshold. Due to its windowbased nature this method has inherent limitations in achieving small detection delays. It also recomputes the kNN graphs at every time instance and for every splitting point, therefore its computational complexity is note suitable for real-time applications. In another recent work, [20] proposed a distancebased and CUSUM-like change detection method for attributed graphs. Attributed graphs are first mapped into numeric vectors, and then the distance between the mean response of an observation window and the mean response of the training data are computed via a CUSUM-like sequential algorithm. In addition to the limitations arising from the window-based nature of the method, the local relations between samples are disregarded due to considering only the mean response of the training set. As a result, in cases where training data has a multimodal distribution, this method will not be effective. As compared to [20], we take into account the local relations between the data instances.

B. Contributions
In this paper, aiming at timely and accurate detection of anomalies in high-dimensional systems we propose two variations of a kNN-based sequential anomaly detection method, as well as a unified framework that combines the advantages of both methods. In summary, our contributions in this paper are as follows: • A framework for multivariate, data-driven and sequential detection of anomalies in high-dimensional systems is proposed for both semi-supervised and supervised set-tings depending on the availability of labeled data. Combining the advantages of supervised and semi-supervised settings, we further introduce an online learning scheme which can effectively detect both known and unknown anomaly types by incorporating the newly detected anomalies into the training set. • Asymptotic optimality of the proposed detection methods in the minimax sense is shown, and comprehensive analysis for computational complexity is provided. • An anomaly localization technique to identify the problematic data dimensions is also proposed based on the proposed detection methods. • The practicality of the proposed anomaly detection and localization methods is demonstrated on mitigating DDoS attacks through simulations and a real dataset.

C. Organization and Notations
The rest of the paper is organized as follows. In Section II, the mathematical formulation of the considered anomaly detection problem and the relevant background information are provided. We present the proposed anomaly detection and localization methods in Sections III and IV. Section V presents the application of our proposed methods in DDoS attack mitigation. Finally, we conclude the paper in Section VI.
Vectors and matrices are represented by boldface lowercase and uppercase letters, respectively. Script letters denote sets, e.g., X . Vectors are organized in a column unless otherwise stated. Probability and expectation are denoted with P and E, respectively.

II. PROBLEM FORMULATION
Suppose that a system is observed through d-dimensional observations X t = {x 1 , x 2 , . . . , x t } in time. The objective is to detect an anomaly occurring at an unknown time τ as soon as possible while satisfying a false alarm constraint. This problem can be formulated as a change detection problem as follows: where f is the true probability distribution of observations, f 0 and f 1 are the nominal and anomaly probability distributions, respectively. The objective of the problem is to find the anomaly time T that minimizes the average detection delay while satisfying a false alarm constraint, i.e., where E τ represents the expectation given that change occurs at τ , (.) + = max(., 0), and E ∞ denotes the expectation given that no change occurs, i.e., the expectation of false alarm period.
Lorden's minimax problem is a commonly used version of the above problem [21], in which the goal is to minimize the worst-case average detection delay subject to a false alarm constraint: where "ess sup" denotes essential supremum which is equivalent to supremum in practice. In simple words, the minimax criterion minimizes the average detection delay for the least favorable change-point and the least favorable history of measurements up to the change-point while the average false alarm period is lower bounded by β.
The CUSUM test provides the optimum solution to the minimax problem [12], given by (3): where T c is the stopping time, t = log f1(xt) f0(xt) is the loglikelihood ratio at time t, S 0 = 0, and h c is a decision threshold, selected in a way to satisfy a given false alarm constraint. Considering t as a statistical evidence for anomaly the CUSUM algorithm continues accumulating it, and declares an anomaly the first time the accumulated evidence S t exceeds a threshold h c , that is chosen sufficiently large for reliable detection. CUSUM requires the complete knowledge of the probability distributions f 0 and f 1 . However, in real-world applications, the true probability distributions are typically unknown. Even when f 0 and f 1 are known up to their parameters, and the parameters are estimated using the maximum likelihood approach, the procedure known as Generalized CUSUM (G-CUSUM) achieves only asymptotic optimality. Moreover, CUSUM and in general parametric methods are limited to the detection of certain anomaly types whose true probability distribution matches the assumed f 1 well.
In high-dimensional problems that require multivariate analysis, estimating the nominal probability distribution is typically not tractable, especially when the data dimensions are heterogeneous, e.g., environmental sensor data consisting of wind speed, direction, air temperature, pressure, humidity, weather condition (whether it is rainy, sunny or cloudy), etc. Considering the wide range of possible anomalies it is even more intractable to estimate the anomaly probability distribution. In such problems, knowing the probability distributions and parameters is highly complicated if not impossible, limiting the applicability of CUSUM and parametric methods in general.

III. PROPOSED DETECTION METHODS
We recently proposed a kNN-based sequential anomaly detection method called Online Discrepancy Test (ODIT) [22], and applied it to cyber-attack detection in smart grid [23] and in intelligent transportation systems [24]. In this section, we (i) first elaborate on the motivation behind ODIT, (ii) then present a modification for ODIT to prove its asymptotic optimality in the minimax sense under certain conditions, (iii) extensively analyze its computational complexity, (iv) propose an extension of ODIT for the cases where training data is available for some anomaly settings, (v) introduce a unified framework for the proposed ODIT detectors, (vi) and finally provide a simulation study to exemplify the timely and accurate detection by the proposed detectors under a challenging scenario in which univariate methods fail.
The rationale behind using kNN distance for anomaly detection is the similarity between the inverse kNN distance and likelihood. Specifically, for f (x i ) ≥ f (x j ), x i , x j ∈ X , it is expected that the distance g k (x i ) of x i to its kth nearest neighbor in X is smaller than that of x j . This probability increases with the size of X , i.e., lim |X |→∞ P (g k (x i ) ≤ g k (x j )) = 1. This in turn provides grounds for using the difference of kNN distances in ODIT to approximate the log-likelihood ratio t .
The similarity between the likelihood of data points and the inverse kNN distance is shown in Fig. 1 for several distributions. We consider Gaussian, Poisson and multinomial distributions to illustrate the similarity of 1/g k (x) and f (x) for three disparate data types, real-valued numeric, integervalued numeric and categorical, respectively. The inverse kNN distance graphs are scaled down to match the likelihood figure for the purpose of visualization. As shown in Fig. 1(a) with |X | = 10 6 , the inverse of kNN distance approximates the likelihood very well for the standard Gaussian random variable. Despite some discrepancy for the Poisson and multinomial cases due to the discreteness of these random variables, it may still serve well the purpose of approximating the loglikelihood ratio. For these discrete cases, to avoid zero kNN distance we consider much smaller number of data points, 10 and 50 for Poisson and multinomial, respectively. Fig.  1(b) and (c) are obtained by averaging over 5 × 10 5 and 10 4 trials, respectively. In order to show the similarity for a more complex distribution, in Fig. 1(d) we consider a twodimensional vector of a categorical random variable and a realvalued random variable with arbitrary distribution and 10 4 data points.

A. Online Discrepancy Test (ODIT)
The overview of the ODIT detector is given in Fig. 2. In the training phase, assuming a training set X N consisting of N nominal data instances, it firstly partitions X N into two sets X N1 and X N2 , where N 1 + N 2 = N , for computational efficiency as in the bipartite GEM algorithm [16]. Then, using the kNN distances {g k (x m )} between each node x m ∈ X N1 and its k nearest neighbors in X N2 ODIT finds an estimatê Ω α for the minimum volume set (MVS) Ω α given by where α ∈ (0, 1) is a significance level, e.g., 0.05. Ω α represents the most compact set of observations under nominal operation while its complement Ω α corresponds to the tail events (i.e., outliers) under nominal operation at significance level α. Then, in the test phase, it compares the kNN distances g k (x) between a test data instance x and its k nearest neighbors in X 2 withΩ α to compute a negative/positive anomaly evidence for anomaly x and accumulates it over time for reliable detection. Roughly, the greater g k (x) is, the less likely x comes from the same distribution f 0 as the nominal points.
The estimateΩ α provides a reference to evaluate g k (x) and compute the negative/positive anomaly evidence for x. Specifically, in the training phase, to estimate Ω α ODIT ranks the points in X N1 in the ascending order where g n (x m ) is the Euclidean distance between point x m ∈ X N1 and its nth nearest neighbor in X N2 , s ∈ [1, k] is a fixed number introduced for convenience, and γ > 0 is the weight. Next, it picks the first K points Hence, K is chosen as In the test phase, for each data instance x t , ODIT firstly computes the total distance L t with respect to the second training set X N2 as in (6). Then, it computes the anomaly evidence, which could be either positive or negative, by comparing L t with the MVS model found in the training phase through the borderline total distance L (K) where d is the number of data dimensions. Finally, it updates a detection statistic ∆ t which accumulates the anomaly evidence D t over time, and raises an anomaly alarm the first time ∆ t crosses a predefined threshold, which is a CUSUM-like procedure (cf. (4)). The ODIT procedure is summarized in Algorithm 1.

Algorithm 1
The proposed ODIT procedure 1: Input: X N , k, s, α, h 2: Initialize: ∆ ← 0, t ← 1 3: Training phase: 4: Partition X N into two sets X N1 and X N2 5: For each x m ∈ X N1 compute L m as in (6) 6: Find L (K) by selecting the Kth smallest L m 7: Test phase: Get new data x t and compute D t as in (7) 10: t ← t + 1 12: Declare Anomaly The computation of the anomaly evidence D t for each test instance x t has the simpler form D t = L t − L (K) in [22], where we proposed ODIT the first time. Although this simpler form of D t and the form proposed in (7) have similar structures, and they perform quite similarly in practice, the new form given in (7) naturally appears while proving the asymptotic optimality of ODIT in the minimax sense, as shown next.
Theorem 1. When the nominal distribution f 0 (x t ) is finite and continuous, and the attack distribution f 1 (x t ) is a uniform distribution, as the training set grows, the ODIT statistic D t converges in probability to the log-likelihood ratio, i.e., ODIT converges to CUSUM, which is minimax optimum in minimizing expected detection delay while satisfying a false alarm constraint.
Proof: Consider a hypersphere S t ∈ R d centered at x t with radius g k (x t ), the kNN distance of x t with respect to the training set X N2 . The maximum likelihood estimate for the probability of a point being inside S t under f 0 is given by k/N 2 . It is known that, as the total number of points grow, this binomial probability estimate converges to the true probability mass in S t in the mean square sense [25], i.e., k/N 2 For γ values different than 1, D t converges to the log-likelihood ratio scaled by γ.
Note that ODIT does not train on any anomalous data, i.e., does not use any knowledge of anomaly to be detected. While this generality is an attractive trait as it allows detection of any statistical anomaly, it also inevitably limits the performance for known anomaly types on which detectors can train. We will extend ODIT to this case with available anomaly information in Section III-C. In Theorem 1, we show that in the lack of knowledge about anomalies, ODIT reasonably assumes an uninformative uniform likelihood for the anomaly case, and achieves asymptotic optimality under this assumption in the CUSUM-sense for certain parameter choices.
Remark 1 (Parameter Selection): Due to its sequential nature, the parameters of ODIT either directly or indirectly control the fundamental trade-off between minimizing average detection delay and false alarm rate. Parameters k and s determine how many nearest neighbors to take into account in computing the total distance L m , given by (6). Smaller k would result in being more sensitive to anomaly, hence supports earlier detection, but at the same time it causes to be more prone to the false alarms due to nominal outliers. Larger k would result in vice versa. s is an auxiliary parameter chosen for further flexibility in this trade-off. s = 1 considers only the kth nearest neighbor while s = k sums all the first k nearest neighbors. Similar to k, smaller s makes the algorithm more sensitive to anomaly, but also more prone to nominal outliers. However, the effect of s is secondary to that of k. k and s should be chosen together to strike a balance between sensitivity to anomalies and robustness to nominal outliers. 0 < γ < d is the weight which determines the emphasis on the difference between distances. Large distance values are emphasized by large γ values and suppressed by small γ values. The alarm threshold h in (8) directly controls the tradeoff between minimizing detection delay and false alarm rate. Decreasing h will yield smaller detection delays, i.e., earlier detection, but also more frequent false alarms. It is typically selected to satisfy a false alarm constraint. The significance level α is at a secondary role supporting h. For fixed h, larger α would result in a smaller estimated MVSΩ α , which in turn results in smaller detection delays, but also more frequent false alarms since more nominal data points will lie outside the selected MVS. Note that h is the final decision threshold, whereas α is more of an intermediate parameter. Hence, one can always set α to a reasonable significance value, such as 0.05, and then adjust h accordingly to satisfy a desired false alarm rate. Regarding the sizes of training sets N 2 plays a more important role than N 1 , as shown in Theorem 1. Specifically, N 2 determines the accuracy of likelihood estimates by the kNN distances, whereas N 1 determines how well the significance level α is satisfied, which is an intermediate parameter as discussed before. Hence, typically N 2 should be chosen larger than N 1 , where N 1 + N 2 = N . It should be noted that the ODIT procedure, given by Algorithm 1, can also work without partitioning the training set. Partitioning is proposed for computational efficiency when dealing with large highdimensional datasets. However, it does not decrease the order of magnitude in computational complexity (see Section III-B) since even without partitioning the online testing procedure already scales linearly with the number of training instances, as opposed to the bipartite GEM algorithm [16] which decreases the complexity to linear from exponential using partitioning. As a result, Algorithm 1 can be used without partitioning the training set, especially for small datasets.
is the set of vertices and E is the set of edges connecting X K N1 to the neighbors in X N2 . The constructed graph G minimizes the total edge length K m=1 L m among all possible K-point kNN graphs between X N1 and X N2 . The computation of anomaly evidence D t in (7) can then be interpreted as the increase/decrease in the log of total edge length if the K-kNN graph were to include the test point x t .

Remark 3 (Comparisons):
ODIT learnsΩ α using kNN distances similarly to the outlier detection method called Geometric Entropy Minimization (GEM) [15], [16]. However, in the test phase, unlike GEM, which declares anomaly even when a single test point falls outside the MVS, ODIT sequentially updates a test statistic ∆ t using the closeness/remoteness of the test point to the MVS, and declares anomaly only when ∆ t is large enough, i.e., there is enough anomaly evidence with respect to a false alarm constraint. Doing so ODIT is able to timely and accurately detect persistent anomalies, as shown theoretically in Theorem 1 and through numerical results in Section III-E and Section V. Whereas, one-shot outlier detectors like GEM are prone to high false alarm rates due to the limitation of significance tests [26], [27]. The sequential detection structure of ODIT resembles that of CUSUM albeit with fundamental differences. Actually, the test statistic of ODIT implements a discrepancy function motivated by the discrepancy theory [28] and discrepancy norm [29], hence the name Online Discrepancy Test (ODIT). The nonparametric nature of ODIT does not require any knowledge of the nominal and anomaly probability distributions, as opposed to CUSUM. Moreover, the practical relaxations of CUSUM, such as G-CUSUM and independent CUSUM [13], cannot be applied to challenging scenarios such as high-dimensional systems which require multivariate anomaly detection with little or no knowledge of anomaly types. On the other hand, ODIT scales well to high-dimensional systems for multivariate detection, as discussed next.

B. Computational Complexity
Next, we analyze the computational complexity of our proposed method. Training phase of ODIT requires the kNN distances between each pair of the data points in the two training sets. Therefore, the time complexity of training phase is O(N 1 N 2 d), where d is the data dimensionality. The space complexity of training is O(N 2 d) since N 2 points are stored for testing. Note that training is performed once offline, thus the complexity of online testing is usually critical for scalability. In the test phase, computing the kNN distance of a test point among all points in the second training set takes O(N 2 d) time. The space complexity of testing is not significant as the test statistic is updated recursively. Consequently, the proposed ODIT algorithm linearly scales with the data dimensionality d both in training and testing. In the online testing phase, it also scales linearly with the number of training points. For high-dimensional systems with abundance of training data, the online testing time could be the bottleneck in implementing ODIT.
kNN Approximation: Computing the nearest neighbors of a query point is the most computationally expensive part of the algorithm as the distance to every other point in the second training data needs to be computed to select the k smallest ones. As the dimensionality increases and the training size grows, the algorithm becomes less efficient in terms of the running time. To this end, we propose to approximate the kNN distance rather than computing its exact value. It is natural to expect that ODIT's performance will drop due to the inaccuracy induced by the approximated kNN distances compared to that based on the exact kNN distances. However, depending on the system specifications, e.g., how frequently the data arrives and how critical timely detection is, the reduction in running time through kNN approximation may compensate for the performance loss, as we next analyze through an experiment. [30] proposes a kNN distance approximation algorithm that scales well to high-dimensional data. This algorithm performs hierarchical clustering by constructing a kmeans tree, and approximates the kNN distance by performing a priority search in the k-means tree, i.e., by searching for the k nearest neighbors only among a limited number of data points. The computation complexity of constructing the tree is O(N 2 dCI max log N2 log C ), where I max is the maximum number of iterations in k-means clustering, C is the number of clusters (a.k.a. branching factor), and log N2 log C is the average height of the tree. Using the priority search k-means tree algorithm, the computational complexity of kNN search reduces to O(Bd log N2 log C ), where B N 2 is the maximum number of data points to examine. Hence, the training complexity reduces to O(( Note that B N 2 and the number of iterations required for convergence is small [30]. More importantly, in online testing, the computational complexity per instance decreases to O(B log N2 log C d) from O(N 2 d). Experiment: We experimented with this approximation in our algorithm. The experiment is done in Matlab on an Intel 3.60 GHz processor with 16 GB RAM. In the experiment, the dimensionality of data is d = 50, the training data size is N = 5 × 10 5 , partitioned into N 1 = 0.38N and N 2 = 0.62N 1 , and the anomaly is defined as a shift in the mean of Gaussian observations by 3 standard deviation in 10% of the dimensions. We set the branching factor for building the priority search k-means tree as C = 100, and the maximum number of points to examine during search for the k nearest neighbors as B = 1000. The average computation time for both ODITs based on the exact and the approximate kNN distance is summarized in the Table I, which presents the time spent for the computation of (7) and (8) per observation. It is seen that the approximation method drops the average running time per observation to about 1/14 of that of the exact method.
To compare the original and efficient ODITs in systems 1 The same partitioning ratio is used in the experiments throughout the paper.  Fig. 3. Therefore, in such a case, approximate kNN computations are preferred over the exact kNN computations in terms of the actual detection delay (see the bottom figure in Fig. 4). Whereas for a sufficiently large sampling period, the delay is mainly due to the extra samples, thus exact kNN computations yield better results this case, as shown in the top figure in Fig. 4. Summary of ODIT: Here we highlight the prominent features of the proposed ODIT anomaly detector: • The sequential nature of ODIT makes it suitable for realtime systems, and especially for systems in which quick and accurate detection is critical. Additionally, as the nominal training set grows, it asymptotically achieves the minimax optimality in terms of quick and accurate detection when anomaly is from uniform distribution. • It is capable of performing multivariate detection in high-dimensional systems, as illustrated in Section III-E, thanks to its nonparametric and scalable nature. • ODIT can detect unknown anomaly types since it does not depend on any assumption about anomalies. Moreover, it is suitable for online learning such that its detection performance can be improved over time for previously encountered anomaly types (see Section III-D).

C. An Extension: ODIT-2
In this section we consider the case of having an additional anomaly training dataset along with the previously discussed nominal dataset. Next, we extend the ODIT method to take advantage of the anomaly dataset in order to improve its performance. With the inclusion of an anomaly training set, the ODIT-2 procedure is akin to the classification methods based on kNN distance [31], [32]. However, these methods are not sequential. Consider an anomaly training set X M = {x 1 , x 2 , ..., x M } in addition to the nominal set X N = {x 1 , x 2 , ..., x N }. In this case, the anomaly evidence for each instance can be computed by comparing the total distance L t with respect to the nominal dataset with the total distance L t with respect to the anomalous dataset. Thus, there is no need to learn the borderline total distance L (K) in training to be used as a baseline for L t in testing (cf. 7). That is, no training is needed for ODIT-2. However, before testing, a preprocessing might be required to remove the data points that are similar to the nominal train set. The reason for cleaning the anomaly dataset rather than the nominal dataset is that usually anomaly dataset is obtained by collecting observations from a known anomalous event which may typically include nominal observations too. For instance, in a network intrusion detection system (IDS), after occurrence of an attack, several observations could still be of nominal nature. The cleaning step is done by finding and removing the data points of anomaly training set which lie in the estimated MVS of the nominal training set, where L x m is the total distance of x m with respect to the nominal points in X N2 . Hence, the training procedure of ODIT, which finds L (K) , can be used for preprocessing the anomalous train data. While testing for each test data instance x t , the anomaly evidence is calculated by where L t and L t are the total distances of x t computed using (6) with respect to the points in X N2 and X clean M2 , respectively; and N and M are the number of points in the nominal and (cleaned) anomalous training sets. The statistic update and decision rule of ODIT-2 are the same as in ODIT, given by (8). In the ODIT-2 procedure, different than Algorithm 1, (11) is used in line 9 to compute the anomaly evidence D t .
In practice, there is a typical imbalance between the sizes of nominal and anomaly training sets due to the inherent difficulty of obtaining anomaly samples. Since the total kNN distances in a dense nominal set X N are expected to be smaller than those in a sparse anomaly dataset, for an anomalous data point, L t can be smaller than L t , resulting in a negative anomaly evidence, which can lead to poor detection. In order to deal with the imbalance of datasets, the term log(N/M ) in (11) acts as a correction factor. Specifically, for N > M , log(N/M ) > 0 compensates for L t being unfairly small compared to L t . This correction factor naturally appears in the asymptotic optimality proof, as shown next.

Corollary 1.
When the nominal distribution f 0 (x t ) and anomalous distribution f 1 (x t ) are finite and continuous, as the training sets grow, the ODIT-2 statistic D t , given by (11), converges in probability to the log-likelihood ratio, i.e., ODIT-2 converges to CUSUM, which is minimax optimum in minimizing expected detection delay while satisfying a false alarm constraint.
Proof: From the proof of Theorem 1, we know that

D. Unified Framework for Online Learning
Availability of labeled training data is a major limiting factor for improving the performance of anomaly detection techniques. In several applications, obtaining a comprehensive and accurate labeled training dataset for the anomaly class is very difficult [1]. In contrast, in most applications typically sufficient amount of comprehensive nominal training data is available. Semi-supervised techniques including ODIT, constitute a popular class of anomaly detection methods that require labeled training data only for the nominal class. These techniques try to build a model of nominal operation/behavior. Hence, anomaly detection is performed by detecting data which significantly deviates from the constructed nominal model. Supervised techniques on the other hand, assume availability of both nominal and anomalous datasets, and build models for classifying unseen data into nominal vs. anomaly classes. ODIT-2, as an example supervised technique, outperforms the semi-supervised ODIT technique for the known anomaly types, as shown in Section III-E and Section V. However, ODIT-2, and in general supervised anomaly detectors, fall short of detecting unknown anomaly types while ODIT, and in general semi-supervised anomaly detectors, can easily handle new anomaly patterns as they do not depend on assumptions about the anomalies.
Combining the strengths of ODIT and ODIT-2, we propose an online learning scheme called ODIT-uni which is capable of detecting new anomaly types and and at the same time improving its performance for detecting the previously seen anomaly types. Particularly, in the unified ODIT method, both ODIT and ODIT-2 run in parallel to detect anomalies, and the anomalous data instances first detected by ODIT are included in the anomalous training set of ODIT-2 in order to empower the detection of similar anomaly types. Since the ODIT-2 procedure involves all the necessary elements for ODIT, there is no further computation overhead induced by the unified approach. Keeping track of the cumulative decision statistics of ODIT and ODIT-2 the unified ODIT scheme, ODIT-uni, stops the first time either ODIT or ODIT-2 stops: where D (1) t and D (2) t are the anomaly evidences given by (7) and (11), respectively, and h 1 and h 2 are the decision thresholds for ODIT and ODIT-2. For known anomaly patterns on which ODIT-2 is trained, it is expected that ∆ ≥ h 1 is supposed to detect new anomaly types. If the alarm is raised by ODIT, then the anomaly onset time is estimated as the last time instance the ODIT statistic was zero, i.e.,τ = max{t < T : ∆ (1) t = 0}, and the data instances {xτ +1 , . . . , x T } betweenτ and T are added to the ODIT-2 anomaly training set. For reliable enhancement of the ODIT-2 anomaly training set with the newly detected instances, the ODIT threshold h 1 needs to be selected sufficiently high to prevent false alarms by ODIT, and thus false inclusions into the ODIT-2 training set. Obviously, large h 1 will increase the detection delays for previously unseen anomaly types, however, avoiding false training instances is a more crucial objective.

E. Example: Detecting Change in the Covariance of a High-Dimensional System
The nonparametric nature of the proposed ODIT detectors makes them suitable for multivariate detection in highdimensional and heterogeneous systems. Through an experiment we next show the advantage of ODIT and ODIT-2 over the parametric G-CUSUM detector in a challenging setting where anomaly is manifested as a change in the correlation between the individual data streams. This type of anomaly is well exemplified by the MadIoT attacks, recently introduced in [7], in which high wattage IoT devices, such as air conditioners and water heaters, are synchronously turned on/off to cause instability, and as a result blackout in the power grid.
Since it is not tractable to estimate the joint distribution of high-dimensional observations, especially the set of anomalous dimensions in the anomaly case, we implement the G-CUSUM detector proposed in [13] which performs univariate analysis by assuming that the data streams are independent. As expected this univariate approach fails to detect the change in the covariance of observations. In the experiment, we simulate a 100-dimensional system that generates data following a multivariate Gaussian distribution with µ = 20 and σ = 10 for the individual data streams, which initially have no correlation. At time t = 100, the covariance matrix of the observations is changed by randomly adding ρ = 0.6 correlation between 50% of the data streams without any change in the mean and variance (i.e., diagonal terms in the covariance matrix). Fig. 5 demonstrates the change in the distribution of two data dimensions. For better visualization, some of the anomaly instances that overlap with the nominal instances are not shown. We used N = 2 × 10 4 nominal training instances, and M = 2 × 10 4 anomalous training instances, which decreased to M = 4836 after cleaning, for a scenario in which 50% of the data dimensions become correlated with ρ = 0.6.
In the experiment we compare the performance of ODIT algorithms with G-CUSUM and Oracle CUSUM, which exactly knows the nominal and anomalous probability distributions. This is a challenging problem due to the fact that the mean and variance of individual data-streams does not change. In particular, some data instances after the anomaly onset are still very similar to the nominal instances. To cope with the similarity of the anomaly instances to the nominal ones, the parameters of ODIT algorithms are set to be k = s = γ = 1, α 1 = 0.2, α 2 = 0.005, and the cleaning step is performed on the anomaly train set for ODIT-2. As depicted by its ever-increasing decision statistic in Fig. 6, G-CUSUM fails to detect anomalies since it is not able to monitor correlations. Whereas, the ODIT algorithms successfully detect the change in the covariance structure of observations by performing multivariate analysis, as shown in Fig. 7. Since ODIT does not use any anomaly data in training, it detects the anomaly with larger delays compared to ODIT-2. This example provides a scenario where the availability of a set of previously encountered anomalous instances greatly helps ODIT-2 to perform significantly better. ODIT-2 achieves a close performance to the impractical Oracle CUSUM algorithm in the ideal case in which the anomalous dimensions in the test matches the ones in the training. To demonstrate that ODIT-2 is still able to operate reasonably well under non-ideal conditions, we tested it for the case where there is a mismatch between the test and train data in terms of the set of data-streams getting correlated. In this case, 27 out of the 50 dimensions getting correlated are not seen in the anomaly train data. Fig. 7 shows that despite the mismatch, ODIT-2 still performs better than ODIT.

IV. ANOMALY LOCALIZATION USING ODIT
In this section, we propose a localization strategy to identify the data dimensions in which the detected anomaly occurs so that necessary steps can be taken to mitigate the anomaly. Specifically, after an anomaly is detected in ODIT, our ob-  jective is to identify the dimensions that caused the detection statistic ∆ t to increase considerably and ultimately resulted in the detection. Our approach to perform this task is by examining the contribution of each dimension individually to the decision statistics. In the case of detection by ODIT, an increase in the total distance L t , given by (6), leads to an increase in the anomaly evidence D t , given by (7), finally leading to an increase in the detection statistic ∆ t , given by (8), and consequently the anomaly alarm. Let us assume x t is the test data instance, and {y 1 , . . . , y k } are its k nearest neighbors in the train set. The total kNN distance L t = k n=k−s+1 x t − y n γ , for γ = 2, can be written in terms of the d data dimensions as and x i t and y i n are the ith dimensions of the observation x t and its nth nearest neighbor y n . δ i t is the contribution of ith dimension of the observation x t at time t to the detection statistic. Therefore, by analyzing δ i t for each dimension i during the final increase period of ∆ t , which causes the anomaly alarm, we can identify the dimensions in which anomaly has been observed. To this end, we propose to use a recent history of Q i = {δ i q : q =τ + 1, . . . ,τ + S, ∀i} since the last time ∆ q = 0. This timeτ , the most recent time instance when the detection statistic was zero, can be seen as an estimate of the anomaly onset time. Finally, we apply a ttest on the S samples in Q to decide whether each dimension i is anomalous.
In particular, we propose the following anomaly localization procedure after the alarm is raised at time T : 1) Findτ = max{t < T : ∆ t = 0} 2) Compute the sample mean and sample standard deviation of Q i for each dimension i: 3) Identify the anomalous dimensions by applying a t-test: where µ i is the sample mean of nominal training {δ i 1 , . . . , δ i N1 } values, and θ is the (1 − β)th percentile, for significance level β, of Student's t-distribution with S − 1 degrees of freedom. The significance level β, for which a typical value is 0.05, controls a balance between sensitivity to anomalies and robustness to nominal outliers. For given β and S values, the threshold θ can be easily found from a lookup table for Student's tdistribution (e.g., θ = 6.314 for β = 0.05 and S = 2). The number of samples S needs to be at least 2 to have a degree of freedom at least 1. In practice, t-test is commonly used for small sample sizes, therefore S does not need to be large. Indeed, larger S would cause longer reaction time since the localization analysis would be performed at timeτ + S, which could be greater than the detection time T , incurring extra delay for localization and reaction after detection.
Localization by ODIT-2 is slightly different. Since log L (K N ) and log L (K M ) are constant, the increase that causes the alarm takes place in log L t − log L t . Writing L t and L t in terms of the contributions from the d dimensions, δ i t and δ i t , respectively, as in (14), the increase in the difference (δ i t − δ i t ) for some i leads to the increase in the decision statistics. Similar to ODIT, firstlyτ is found after a detection. Then, in the second step, the δ i and η i are computed by (15). Finally, in the third step, µ i corresponds to the sample mean of nominal training

V. DDOS ATTACK MITIGATION
Distributed Denial-of-Service (DDoS) attack is a major security problem in today's widely-networked systems and requires effective solution approaches [33], [34]. DDoS attack is traditionally known as a type of cyber-attack targeting an Internet service, with the intention of making it unavailable for the legitimate users. Nevertheless, it has also been recently investigated in the cyber-physical systems domain, such as the smart grid [7]. DDoS attack is typically performed by overwhelming the target with malicious requests from multiple geographically distributed sources. The attacker first builds a network of malicious devices known as "botnet" by infecting them with malware, and then remotely controls these devices to synchronously send some form of service requests to the target, which initiates a DDoS attack. The size of botnet both in the number of compromised devices and geographical distribution determines the threat level of a DDoS attack. It is extremely difficult to successfully mitigate a large-scale DDoS attack centrally at the attacked site without disrupting the regular service to legitimate users, as recently demonstrated by the massive DDoS attacks empowered by Internet-of-things (IoT) devices [34].
Low-Rate DDoS: The proliferation of IoT devices exacerbates the DDoS attack problem as many IoT devices, such as Internet-connected sensors, have low security measures, making them vulnerable to malware infections [35]. Abundance of low-security IoT devices worldwide enables an even more challenging new type of DDoS attack, called low-rate DDoS [2], which is considered a stealth attack since the amount of anomalous service requests from each compromised device can be quite low. Such low-rate change in the device behavior can easily bypass local intrusion detection systems (IDSs) that rely on observing raw data, such as data filters and firewalls. Yet, a synchronous low-rate DDoS attack from huge number of compromised devices, e.g., millions of IoT devices, can easily cause an overwhelming aggregated service request, and thus the failure of target. Successful DDoS attack mitigation requires quick detection of attack, and accurate identification of sources of malicious requests so that appropriate countermeasures can be taken against the attack. The timely detection of low-rate DDoS attacks is quite challenging at the local level, e.g., at the routers close to IoT devices. Although detection is trivial at the target due to the overwhelming aggregated service requests, accurate identification of attacking nodes and as a result mitigation of the DDoS attack in a centralized fashion is not tractable.
Challenges: There are several challenges for mitigating lowrate DDoS attacks. (i) High-dimensionality: DDoS attacks inherently relate to large-scale systems. Therefore, the proposed methods need to scale well to large systems. Particularly, for low-rate DDoS attacks, timely and accurate detection at a local level is challenging due to the similarity of attack behavior to the nominal behavior. Multivariate anomaly detection techniques can greatly facilitate timely and accurate detection, however even in a local IoT network, dimensionality, i.e., number of devices, makes joint probability density estimation intractable for parametric methods. (ii) Heterogeneity: The heterogeneous nature of IoT results in complex probability distributions even under nominal settings. Each device type in the network has different usage characteristics. For instance, it is expected for computer, phone, smart watch and temperature sensor in a network to have different operational baselines. Furthermore, even the nominal probability distribution of a single device is usually complicated due to its different operation modes, such as active use, passive use at the background, and hibernation. (iii) Unknown attack types: Due to the myriad of vulnerabilities in a network of low-security IoT devices, it is not possible to know future attack patterns. The conventional signature-based IDSs are not effective since they can only detect a predefined set of attack patterns. For the same reason, parametric detection techniques which assume probabilistic models for anomalies are not feasible as well. To be able to detect unknown anomaly types, a nonparametric detection method is needed.
Application of ODIT: Considering the challenges mentioned above ODIT provides an effective local DDoS attack mitigation approach that can handle high-dimensionality, heterogeneity, and unknown attack types for quick and accurate detection. Utilizing the hierarchical structure of large-scale systems, such as the Internet and the power distribution network, multiple ODITs running at local level, such as routers and data aggregators, can provide a complete IDS for DDoS attack mitigation. Since ODIT is a generic anomaly detection method, we do not specify the observed data type, i.e., service request, in the following simulations for DDoS mitigation. For instance, following the commonly used DDoS concept in computer networks (e.g., flooding-based DDoS [33]) the observed data vector could be the number of packets in unit time, such as packets per second, from a number of devices in the network 2 ; or considering a power delivery network like in [7] the observed data dimensions could be the power demand from houses.

A. Compared Methods
We compare the performance of the proposed methods with two state-of-the-art detection methods for DDoS attacks. The information metric-based method [2], and the deep autoencoder method [35] are used for comparison in the simulation (Section V-B) and real dataset (Section V-C) experiments, respectively. The latter was proposed in the paper that presented the N-BaIoT dataset [35], thus we use it to evaluate the performance of the proposed ODIT detectors on this dataset in Section V-C. The former is a window-based method that assumes Gaussian distribution for the nominal data, and Poisson distribution for the attack data. Specifically, in the training phase, it fits a Gaussian distribution to a nominal dataset, and then in the test phase it fits a Poisson distribution to a window of samples. By sliding the window and updating the Poisson distribution at each time it computes its detection statistic as where P and Q are the estimated Gaussian and Poisson distributions, and D α (P ||Q) is the Rényi divergence between P and Q with parameter α ∈ (0, ∞). Since this method is a window-based, its performance is highly dependent on the choice of the window size. For small window size, the accuracy of the probability distributions would not be good, resulting in poor performance, while large window size would increase the detection delay, as the attacks can only be detected at the end of the initial window. In the worst scenario assuming that the window size is W and anomaly starts at the beginning of the window, the detection delay would be at least W . Moreover, for large window size, it would take more time to see the effect of attack in the estimated Poisson distribution, and thus longer detection delays. This method is designed to capture the increase in the average data rate with respect to the average in the training dataset. We also compare the proposed methods with the conventional data filtering method that filters out the service requests, in particular data packets, from nodes whose number in a certain period (e.g., packet rate) exceeds a predefined threshold.

B. Experiment on Simulated Data
In the first experiment, as part of a low-rate DDoS attack scenario, we simulated an IoT network with d = 50 devices of different types, each having different nominal data transmission rates. Although the N-BaIoT dataset used in Section V-C is also collected from a similar IoT network, the attack magnitudes (i.e., increase in the data rates) are significantly higher than what we consider as low-rate DDoS here. We perform this simulation study to investigate a low-rate DDoS attack scenario in larger IoT networks. For example, the nominal data rate of a temperature sensor is considerably lower than that of a surveillance camera or a computer. In this simulation setup, 30% of the devices have two modes of operation, active and inactive states, with higher data rates in the former. The rest of the devices have a single baseline representing the background traffic in practical networks. The data rates of each device are generated independently from each other from a Gaussian distribution. For a device, data rates over time are independent and identically distributed. The mean data rates are chosen randomly in [10,50] for inactive states, in [50, 90] for active states, and in [10,100] for the devices with single states. The same variance σ 2 = 5 is used for all devices. Note that data rates of the bimodal devices with active and inactive states follow a mixture of two Gaussian distributions. The frequency of active and inactive states are set to be equal. Assume that an attacker initiates a DDoS attack at time τ = 101 through several compromised devices present in the network. When an attack starts, the compromised devices start sending data at a higher rate with a 5 standard deviation increase.
In the ODIT algorithms, we set the parameters as k = 1, s = 1, α 1 = 0.05, α 2 = 0.05, γ = 1, S = 2. The results are obtained using N = 2 × 10 5 nominal instances and M = 10 5 anomalous instances. Fig. 8 shows the decision statistics of ODIT, ODIT-2 and the information metric-based algorithm proposed in [2]. As depicted in the figure, ODIT statistics exhibit an abrupt increase. The best window size for the information metric-based method is found to be W = 5. The information distance starts increasing only when the window contains enough number of anomalous data instances. This result is consistent with the average performance results (average detection delay vs. false alarm rate) shown in Fig. 9. The ODIT-2 detector achieves zero detection delay with no false alarm. Similarly, ODIT achieves very small average detection delay while satisfying very low false alarm rate at 10 −3 . Although the information metric-based method also achieves  reasonable detection delays, compared to ODITs it suffers from its window-based nature. The smaller or larger window sizes do not give better results due to insufficient accuracy in probability distribution estimations and less sensitivity to anomalies, respectively.
The receiver operator characteristic (ROC) curves for localization of the malicious devices are shown in Fig. 10. In comparison with the data filtering approach, ODIT and ODIT-2 successfully identify the malicious devices with probability 0.95 and 1, respectively, while satisfying the false positive rate of 0.05. The conventional data filtering approach identifies a device as anomalous if its data rate exceeds a predefined threshold. Due to the small attack magnitudes in the simulated low-rate DDoS attack the data filtering approach fails to achieve high identification probability while satisfying small false positive rates.

C. Experiment on a Real Dataset: N-BaIoT
In the second experiment, we evaluated the proposed ODIT algorithms using the N-BaIoT dataset, which consists of real IoT data traffic observations including botnet attacks. This data is collected from 9 IoT devices including doorbell, thermostat, baby monitor, etc. infected by the Mirai and BASHLITE malware [35], [36]. Here we only consider the Mirai attack dataset. The benign and attack datasets for each device is composed of 115 features summarizing traffic statistics over different temporal windows. The dataset is collected for each device separately, and lacks timestamp. The number of instances is varied for each device and attack type. Therefore, we formed the training and test sets by randomly choosing data instances from each device. To form a network-wide instance for multivariate detection we stack the chosen instances from 9 devices into a single vector of 1035 dimensions. This way, we obtain a nominal training set with N = 10, 000 instances. We also build an anomalous training set with M = 5, 000 instances for the Ecobee thermostat device (device 3). To test for both known and unknown attack types we let ODIT-2 train only on attack data from device 3, and test under two scenarios: (i) device 3 (Ecobee Thermostat) is compromised (known anomaly type) (ii) device 6 (Provision PT-838 security camera) is compromised (unknown anomaly type). We form the test data similarly to the training data, assuming that the respective device gets compromised and starts sending malicious traffic at t = 101. In the ODIT algorithms we set parameters as k = s = γ = 1, α 1 = 0.05, α 2 = 0.1, S = 2.
An example of the decision statistics for ODIT and ODIT-2 under the two scenarios are shown in Fig. 11. ODIT is able to detect the attack with zero detection delay and zero false alarm in all trials in both known and unknown attack scenarios (Fig. 12). As for ODIT-2, which trains also on attack data from device 3, in the known attack scenario, zero detection delay with zero false alarm in all trials is achieved, similar to ODIT. Fig. 11 shows that the ODIT-2 decision statistic steadily rise even for the unknown attack when device 6 is attacking, yet with a smaller slope than that of ODIT, as expeected. However, such a rise is not guaranteed to happen in general for unknown anomaly types. When an unknown anomaly occurs in the test observations, depending on whether the anomalous observations are relatively similar to the nominal dataset or to the anomalous dataset, ODIT-2 may or may not detect the anomaly. In the case where the anomalous data instances are relatively more similar to the nominal set than to the anomaly set, ODIT-2 statistics will remain zero and it will fail in detecting the anomaly. In the experiment, however, the unknown anomaly type, the attack data from device 6, is relatively more similar to the anomaly training set, the attack data from device 3. Therefore, ODIT-2 is able to detect it, as shown in Fig. 12, where the average detection performances of ODIT and ODIT-2 are given for scenario 2 (unknown anomaly).
Next, the identification of malicious device is investigated in Fig. 13 in terms of the ROC curve (true positive rate vs. false positive rate) under the known anomaly scenario. Both variations of our proposed method identify the malicious device with very high probability while achieving small false alarm rates such as 0.01. We calculate the contribution of each device to the decision statistic in (15) as the sum of the contributions of all 115 dimensions corresponding to the device.
We also compare the performance of ODIT to the deep autoencoder-based detection method [35], as they both train only on the nominal data. The autoencoder method marks each observation instance as nominal or anomalous, and employs majority voting on a moving window of size ws * (to control the false positive rate), raising alarm only if the majority of the instances within the window are marked as anomalous. Due to its window-based majority rule, the sample detection delay (i..e., the number of anomalous instances observed before the detection) is at least ws * 2 +1. Whereas, the sequential nature of ODIT enables immediate detection together with zero false alarm, as demonstrated in Fig. 14 and Fig. 15. Following the analysis in [35] for each device, the sample detection delay and the false positive rate of both methods are compared in Fig. 14 and Fig. 15, respectively. The optimum window sizes reported in [35] for each device are used for the autoencoder method.

D. Online Learning Scheme: ODIT-uni
In this section, we present experiment results to demonstrate the practical advantage of the unified framework ODIT-uni, proposed in Section III-D. Following the simulated and realdata experiments of Sections V-B and V-C we train the algorithms on the nominal data and anomaly data for a specific  attack type. For the N-BaIoT dataset, we repeat the scenario 2 test, in which device 6 (Provision PT-838 security camera) starts sending malicious traffic while only attack data from device 3 is used to train ODIT-2. We extend the simulation experiment of Section V-B by testing the trained algorithms on a new anomaly type. Specifically, at time t = 101 a different set of devices start acting maliciously. Fig. 16 shows, for both the simulated and N-BaIoT datasets, the average detection delay by ODIT-2 for a constant false alarm rate of 0.01, versus the number of the data points from the new anomaly type added to the anomaly training set. In both cases, as the number of the confirmed instances added to the anomaly training set grows, ODIT-2 detection delay decreases. The confirmation can be through either a human expert or a sufficiently high decision threshold for ODIT which avoids false alarms, as explained in Section III-D. In the simulated data, ODIT-2 is not able to detect the new anomaly type at the beginning without seeing any representative instance. However, even after seeing only a single instance of the new anomaly type, it is able to detect it with a reasonable delay around 10. Whereas, in the N-BaIoT dataset, ODIT-2 is able to detect the unknown anomaly at the first encounter with an average delay of 0.79, and the average delay converges to zero as the training set is enhanced with  instances from the new anomaly type. In this way, ODIT-uni detects the unknown anomaly types through ODIT, and over time learns the geometry of new anomalies and improves its detection performance through ODIT-2.

VI. CONCLUSION
In this paper, we proposed an algorithm, called ODIT, that is suitable for quick and accurate anomaly detection and localization in high dimensional systems which require multivariate (i.e., joint) monitoring of system components. Our proposed anomaly detection method is generic and applicable to various contexts as it does not assume specific data types, probability distributions, and anomaly types. It only requires a nominal training set, and achieves asymptotic optimality in terms of minimizing average detection delay for a given false alarm constraint. We also showed how to benefit from available anomalous data (ODIT-2), and presented an online learning scheme (ODIT-uni) that detects unknown anomaly types and over time improves its performance by learning from detected anomalies. We evaluated the performance of our method in the context of DDoS attack detection and botnet detection using a simulated dataset and a real dataset. The experiments verified the advantage of proposed online learning method, and also showed that the proposed ODIT methods significantly outperform the state-of-the-art anomaly/change detection methods in terms of average detection delay and false alarm rate.
The proposed algorithms assume static nominal behavior and a static set of data dimensions. For instance, the proposed online learning scheme updates its anomaly knowledge in real-time, but it does not update its nominal data repository. Extending it to dynamic settings, such as an IoT network with dynamic topology and changing nominal behavior, remains to be an important future research direction.