Application of Sliding Nest Window Control Chart in Data Stream Anomaly Detection

: Since data stream anomaly detection algorithms based on sliding windows are sensitive to the abnormal deviation of individual interference data, this paper presents a sliding nest window chart anomaly detection based on the data stream (SNWCAD-DS) by employing the concept of the sliding window and control chart. By nesting a small sliding window in a large sliding window and analyzing the deviation distance between the small window and the large sliding window, the algorithm increases the out-of-bounds detection ratio and classiﬁes the conceptual drift data stream online. The designed algorithm is simulated on the industrial data stream of drilling engineering. The proposed algorithm SNWCAD is compared with Automatic Outlier Detection for Data Streams (A-ODDS) and Distance-Based Outline Detection for Data Stream (DBOD-DS). The experimental results show that the new algorithm can obtain higher detection accuracy than the compared algorithms. Furthermore, it can shield the inﬂuence of individual interference data and satisfy actual engineering needs.


Introduction
The data stream is ubiquitous in real-life and industrial data [1] such as the industrial sensor data stream [2,3], the Internet data stream [4][5][6][7][8], cloud computing [9], Internet of Things [10], the stock data stream [11], and the traffic data stream [12].The data stream has the characteristics of changing with time, being endless, real-time processing, being high-dimensional, irregular processing, and more.There are many anomaly detection models such as the landmark model [13], the sliding window model [14], and the snapshot model [15].The sliding window model only considers the latest M (M is the sliding window size) data in the data stream.As the data continues to arrive, the data in the window is updated over time.Because the sliding window model is simple and effective in the real application, it is widely studied.This paper discusses the data flow anomalies in many practical engineering applications as well as the degree of the abnormal size.Then, the size of the abnormal degree is put into the model for prediction or decision-making.One of the practical application problems is the industrial warning system.
The data stream anomaly detection includes concept drift [16] and drift level [17] detection, which is an online classification problem.According to the papers published in the international authoritative journals in machine learning and data mining and top international conferences in recent years, the research of data stream anomaly detection [18] is becoming a popular research topic.The characteristics of the data stream make data stream anomaly detection a challenging task.This is because most machine learning algorithms are based on static data environment.Therefore, they are not suitable for data stream anomaly detection.There are two tricky problems.One is the high computational complexity of the model and the other is the concept of drift that will change the model.Therefore, researchers are trying to solve data stream anomaly detection using the sliding window model and the control chart [19], evolutionary computation [20], transfer learning [21], clustering [22], feature selection [23], integrated learning [24], and the combination of various model classification theories [25].Several combination models have been proposed such as the combination of the sliding window and control chart for Internet data stream [26] anomaly detection, the combination of sliding window, and partial differential equation sorting (PDEs) algorithm to detect an abnormal data stream [27], data stream anomaly detection combining the sliding window, and integrated learning [28].There have been several control chart methods for data stream anomaly detection such as non-parametric accumulation and control charts electrocardiographic (ECG) anomaly detection [29], weighted calculation cumulative sum (CUSUM) study [30], CUSUM-based satellite power supply system anomaly detection [31], and an online consumption forecast based on CUSUM [32].The contributions of this paper include adopting the sliding window nested way to increase trend prediction accuracy, increasing the out-of-bounds detection rate, weakening the influence of the current point, and using the CUSUM algorithm to combine the above two points and then test the real-time data stream.
As shown in Figure 1, due to the influence of the sensor's position, working environment, and the quality of the sensor, the data stream usually has a small cluster of points (rectangle point data in Figure 1), which deviates from the overall trend.The rectangle point data are not really abnormal data but a kind of interference data.If this kind of data are also considered abnormal data, it will lead to the frequent triggering of the prediction model and increase the false alarm frequency of abnormal detection.Late-period data deviations from the normal trend (trapezoidal points in Figure 1) are the real underlying causes, which lead to data rising.Such data is often caused by the internal environment change and are the focus of our attention.This kind of internal environmental change and breakdown are what we want to analyze, which is consistent with our prediction goal.These data need to be correctly labeled.Therefore, there is a research direction for data stream anomaly detection to reduce the influence of individual point randomness, reduce false positives, increase trend judgment, and improve prediction accuracy.The long window is a piece of data that extracts the fixed length of the distance from the infinite data stream to analyze the trend of the data stream.The short window is built to suppress the randomness of the current data of the data stream and displays the immediate changes of the current data.Comparing and analyzing the changes in the data in the long and short windows, we can analyze the changes of the current data and can also shield some misclassification caused by the randomness of the current data.
As shown in Figure 2, data stream anomaly detection is a necessary part of many decisions in reality, such as network attack defense, industrial security accident early warning, traffic congestion warning, and stock decisions.First, the data stream anomaly and the degree of anomaly must be detected, and the abnormal data stream of several parameters is combined to train the decision-making model.With enough data trained models to monitor incoming data streams, the model is triggered when several data stream exceptions are combined.Afterwards, automatic decisions are made.The difficulty of online data stream classification lies in the unknown distribution of the data stream.It is difficult to judge whether the arriving data are real anomalies or interference and it is not easy to distinguish only by several incoming adjacent data.If the concept data drift of adjacent data occurs, follow-up data continues this trend in accordance with the trend of drift, which can be considered true anomalous data.This kind of data are caused by some inherent mechanism, which require special attention.However, if less data return to the overall trend over a shorter period, it may be due to erratic sensor operation or some changes in the working environment.A traditional machine learning classification model is based on historical data training.Due to the characteristics of real-time data flow, distinguishing between disturbing data and abnormal data is difficult in data stream anomaly detection.To solve this kind of problem, the Piecewise Linear Approximation (PLA) [33] was proposed for adopting the best fitting of the distance error.It adopts the linear regression method with piecewise linear approximation.The discrete wavelet transform (DWT) method and complex discrete wavelet transform (cdwt) method employed fragment similarity of the time data stream [34].Even though the above method analyzes the data stream anomaly detection, they do not analyze whether the current point has abnormal interference or is a normal point.This paper proposes a sliding nest window method.It can weaken the influence of the sampling data of the concept drift and it enhances the trend analysis ability.In addition, it can suppress the interference of noise.
This paper first introduces the background of the algorithm.Then the nested sliding window anomaly detection algorithm is described in detail.The proposed algorithm is compared with other relevant algorithms.Lastly, the paper is summarized.

A Nested Sliding Window Data Stream Anomaly Detection Algorithm
The object of machine learning mining is the property of commonality in the data.The data stream is affected by the conceptual drift and the feature needs to be extracted from the original value to explore the common attributes in the data stream.Although data streams drift in concept as the environment changes, the degree of drift is generally constant, or the degree of drift varies proportionately with an attribute value.Therefore, extracting this constant attribute value from the original value is the key for data stream machine learning.
In this paper, we use the box diagram control chart to detect the abnormal data stream and its size.The data stream traverses close to the median including from the lower part to the upper part.The area from the lower limit to the upper limit is normal.The data above the upper limit and the lower limit are key monitoring objects.After a few data points of a certain percentage continuously exceed the limit, the data stream can be regarded as an anomaly according to a certain trend.The distance between the current point and the upper limit or the lower limit needs to be calculated as the magnitude of the anomaly of the point.Since the data stream drifts with the time concept, the median value in the box graph changes constantly over time and can adaptively track the data stream trend.However, the relative value is relatively fixed.The upper limit and lower limit values will be relatively fixed or take fixed adjustment with certain parameters.
The mean value in Figure 3 is obtained by truncating the data stream in the sliding window.Lower and upper hinges are the normal data boundary lines.The data between the two lines are normal fluctuations without monitoring.The data between the lower hinge and the lower limit and those between the upper hinge and the upper limit are suspicious abnormal data, which need to be targeted, but no abnormality calculations are carried out on them.Data out of the lower limit and the upper limit are identified as anomalies and need to be processed.
(1) This paper analyzed the data stream anomaly detection.The concepts and definitions are presented as follows: Definition 1. Data stream DS.The data stream is a sequence of data generated continuously in chronological order, which is represented mathematically as DS = {(x 0 , t 0 ), (x 1 , t 1 ), ..., (x i , t i ), ...}.Among them, x i is the data arriving at t i .The nested sliding window data stream anomaly detection algorithm is outlined below.
(4) Use CUSUM principle to calculate the cumulative sum.
(6) Calculate the cumulative sum and variance.
(7) Calculate the upper critical line.
(8) Calculate the lower critical line.
(9) Analyze whether the current value is abnormal.If cusum > UCL, the current point is considered abnormally increased and the point flag is marked as 1.The abnormal accumulated number mcusum = 1.If DCL > cusum, the current point is considered to be abnormally decreased and the point flag is marked as 2. The abnormal accumulated number mcusum = 1.If there are no abnormal points detected, the flag is marked as 0 and mcusum = 0. (10) Calculate the current point flag.If the current point is the same as the previous moment mark, then the abnormal point accumulates as 1.Otherwise, the abnormal point is 1. (11) Calculate the abnormal cumulative number.If the abnormal cumulative number > h, the current point is the abnormal point.
The traditional Shewhart control chart confirms that the 3σ principle is used for UCL and DCL.The 3σ principle is based on the data that are normally distributed or approximate to normal distribution and the sampling number is large enough.The data stream sliding window analysis method truncates the amount of sampling data.The concept deviation is common in the dynamic data stream.They are not normal distribution or approximate normal distribution, which means the traditional 3σ mode is not appropriate.As shown in Figure 4, This paper treats 3σ as a parameter to be determined.This is obtained through data training.SNWCAD-DS adopt nested windows to cut data streams.It only analyzes the data in long and short windows and classifies them by comparing data deviations in long and short windows.As time goes on, the data in the window is constantly updated, the new data constantly enters the window, the old data is constantly removed from the window, and the data in the window is updated each time.The SNWCAD-DS algorithm is performed, and computation is the eleventh step such as the nested sliding window data stream anomaly detection algorithm.The real-time classification of the data flow values is realized.The algorithm can not only satisfy the classification difficulty brought by the concept drift, but also shield the randomness of the current data and improve the accuracy of data stream anomaly detection.

Simulation and Comparison
There are two key points in data stream anomaly detection including the accuracy of data stream classification and the computational complexity of the algorithm.If the computational complexity is high, the algorithm is not suitable for online data stream detection even though its accuracy is high.In addition, if the algorithm classification accuracy is too low, it is still meaningless even though its computational complexity is low.Therefore, the online classification accuracy should be improved as much as possible, which is the principle of data stream anomaly detection.The proposed algorithm is compared with DBOD-DS [35] and A-ODDS [36].The AUC (Area under Curve) index and the Jaccard coefficient in the receiver operating characteristic curve (ROC) are employed as a performance indicator.A satisfactory outlier detection technique is one that maximizes true positive (TP) values and minimizes false negative (FN) and false positive (FP) values.ROC detection accuracy is calculated according to the equations below.
TruePositiveRate(TPR) = TP/(TP + FN) false alarm rate: Jaccard coefficient JaccardCoe f f icient(JC) = TP/(TP where TP (True Positive) is the correct number, FN (False Negative) is the number of missed reports and the unmatched one, FP (False Positive) is mis-declaration, and TN (True Negative) is correct rejective Non-matching logarithms.The traditional method to calculate AUC is by changing the threshold, but this algorithm adopts the method of online learning and the coefficient of variance, which considers the whole µ × slc as one threshold.In addition, ROC is a dichotomy, but this article solves the problem of three classes.Therefore, this paper regards the abnormal rise and abnormal decline as one category and no abnormalities as another category.In the anomalous ascent and the decline class, we still distinguish the anomalous rise and the abnormal decline.Therefore, this article focuses on three categories.In this paper, the AUC area of the ROC is used as the classification accuracy and the test time is used as algorithm computational complexity.
The data are from a certain well in the Tarim Oilfield and they are tagged by expert experience.Machine learning was performed on slowly changing data, which were based on the data stream anomaly detection of the segment distance.The obtained graph is shown in Figure 5.As can be seen from Figure 5, the point near 230 clearly deviates from the normal trend range.According to the characteristics of the data stream analysis, when only a few data points arrive, it cannot be analyzed whether the data are really abnormal or caused by interference.So, the analysis of the latter data is required.The purpose of data stream online machine learning is to correctly classify the data stream.
The parameters of each algorithm are set according to Table 1.The algorithms are tested on the data shown in Figure 5.The normal data are labeled 0, the abnormal increase data are labeled 1, and the abnormal decline are labeled 2. The simulation results are shown in Figure 6.The parameters that are not in the algorithm are replaced by oblique lines.As can be clearly seen from Figure 6, SNWCAD-DS masks interferences in the range of 200-400 and can determine the actual anomalous drop in the 600-800 range.The original algorithms, DBOD-DS and A-ODDS cannot shield the interference.
Figure 7 is the distribution diagram of each algorithm of the Jaccard coefficient.As can be seen from Figure 7, the Jaccard coefficient of the proposed algorithm is higher than DBOD-DS and A-ODDS for different settings of long window length.As shown in Table 2, The running time of the proposed algorithm does not increase too much and can fully meet the requirements of online real-time running.The above analysis is based on online machine learning for slowly changing parameters.The following analysis simulates a three-category data stream with dramatic changes and a rising trending to verify the effectiveness of the proposed algorithm.As shown in Figure 8, the (+) points are the normal data, the (o) points are rising data, and the (trapezoid) points are the falling data.Non-abnormal data flag is "-", abnormal decline flag is "o", and abnormal rise flag is "+".The purpose of the anomaly detection algorithm is to mark the entire curve in different positions correctly.To compare the performance of different algorithms, the data detection is performed under the same setting for the four parameters.The settings of the four parameters include the long window 500, the short window 30, threshold 0.3, and the abnormal number 8.  Figure 10 maps the original value to the SNWCAD-DS space to determine whether the parameters are abnormal through the Shewhart control chart.'A' is the original data graph, 'B' is the converted data graph, and 'C' is the data graph mapped to the SNWCAD-DS space.As shown from the bottom graph in Figure 10, the original data have clear abnormalities in two places.The Shewhart control chart can quickly determine the abnormal point and its location.As shown in Figure 11, the proposed algorithm SNWCAD-DS has clear advantages than DBOD-DS and A-ODDS, which can significantly improve TPR and reduce FPR.The running time of different algorithms is compared.This experiment adopts 500 labeled data points.The operating environment involves the CPU dual-core 2.1 GHz, Win7 Sp1 x86, and memory 2 G.The running time is shown in Table 3.It can be seen from Table 3 that the complexity of the SNWCAD-DS algorithm is slightly larger, but the computational accuracy is higher than those of DBOD-DS and A-ODDS.In the practical application, the accuracy of SNWCAD-DS can be improved without affecting the online operation.
DBOD-DS adopts the adaptive probability density function to detect anomaly data.This method executes in a single window and does not analyze the influence of current data randomness on the classification error rate.A-ODDS detects anomaly data by calculating the deviations of global data and local data changes.In fact, the method adds one window because it is the calculation of all the data.With the accumulation of data, more and more historical data are obtained, which is equivalent to the increasing weight of the history data in the method of judgment and the existence of the history data weight.For the same problem of the DBOD-DS algorithm, the algorithm does not shield the randomness of the current data.

Summary
In this paper, the SNWCAD-DS is proposed to detect data stream anomaly.It is based on the segmented distance data stream and simulated in data-driven drilling fault diagnosis.Compared with the DBOD-DS algorithm and the A-ODDS algorithm, the proposed algorithm can not only meet the online operation complexity, but also significantly improve the accuracy of online data classification and reduce the false alarm rate.It provides a new distance-based algorithm for online machine classification and promotes the improvement of data stream anomaly detection technology.

Definition 2 .Definition 3 .Definition 4 .Definition 5 .Definition 6 .
Short window distance (number of data) SWM.Number between the current point and the closer fixed interval data.The length of short window is m.Long window distance (number of data) LWM.Number between the current point and the far fixed interval data.The length of the long window is M. Sliding window.Given a data stream sequence T with length n and a user-defined subsequence with length M. The length of the sliding window is M.All subsets of all subsequences taken from the data stream sequence T by the sliding window constitute s.Threshold µσ.µ is a constant.If the average value exceeds µ, it will be considered abnormal.The number of out-of-bounds data is h.The out-of-bounds rate is h/M.

Figure 8 .
Figure 8. Raw data and abnormal signs.

Figure 9
Figure 9 is a schematic diagram of the algorithm abnormal judgments.The top curve is the upper threshold line.The bottom curve is the lower threshold line.The middle line is considered normal data.Data which exceed the upper and lower threshold line are false outliers.If the number of false outliers in the long window is larger than h, then the current sampling point will be the abnormal point.The whole figure implies window sliding operation.Figure10maps the original value to the SNWCAD-DS space to determine whether the parameters are abnormal through the Shewhart control chart.'A' is the original data graph, 'B' is the converted data graph, and 'C' is the data graph mapped to the SNWCAD-DS space.As shown from the bottom graph in Figure10, the original data have clear abnormalities in two places.The Shewhart control chart can quickly determine the abnormal point and its location.

Conclusion 1 :
SNWCAD-DS algorithm can effectively improve the detection accuracy.Conclusion 2: SNWCAD-DS algorithm can effectively reduce the false alarm rate.Conclusion 3: SNWCAD-DS algorithm complexity meets the needs of practical application.

Figure 6 .
Comparison of classification results.

Table 2 .
Running time of different length of long window.

Table 3 .
Algorithm running time (s) comparison table.