Online Forecasting and Anomaly Detection Based on the ARIMA Model

: Real-time diagnostics of complex technical systems such as power plants are critical to keep the system in its working state. An ideal diagnostic system must detect any fault in advance and predict the future state of the technical system, so predictive algorithms are used in the diagnostics. This paper proposes a novel, computationally simple algorithm based on the Auto-Regressive Integrated Moving Average model to solve anomaly detection and forecasting problems. The good performance of the proposed algorithm was conﬁrmed in numerous numerical experiments for both anomaly detection and forecasting problems. Moreover, a description of the Autoregressive Integrated Moving Average Fault Detection (ARIMAFD) library, which includes the proposed algorithms, is provided in this paper. The developed algorithm proves to be an efﬁcient algorithm and can be applied to problems related to anomaly detection and technological parameter forecasting in real diagnostic systems.


Introduction
The number of real-time or online applications that generate time series data is increasing rapidly in various industries. The importance of anomaly detection in such applications is also increasing [1]. Generally, anomaly detection algorithms are intended to find a point of change in the system's state or behavior. The traditional classification of anomaly detection algorithms includes "online" and "offline" types. The main difference between them is that the offline data availability problem assumes that the full set of data is available, transforming the problem by finding all existing changepoints accurately. The online problem, in turn, expects the data to become available point by point in real time, transforming the problem by finding anomalies as soon as they occur [2]. Online or streaming applications are mainly used in critical systems where real-time decisions with acceptable quality are required. The trade-off between the detector's quality and computational time is one of the challenges for online algorithms. The problems that can be solved using these algorithms are actually important and vary from computer network intrusions to heart attack recognition.
The Autoregressive Moving Average (ARMA) model is widely used for time series analysis and forecasting [3][4][5]. The ARMA model is a combination of the Autoregressive model (AR), which describes the analytical component of the signal, and the moving average model (MA), which describes the noise component of the signal. The advantage of the ARMA model is its clear mathematical and statistical base. However, there are serious drawbacks such as time-consuming model tuning and hyperparameter selection. Another disadvantage of the model is its applicability only to stationary time series. The latter problem is resolved with the Autoregressive Integrated Moving Average (ARIMA) model, which evaluates the differences in the time series. A number of studies have been carried out for the automatic selection of hyperparameters; however, the need for manual selection of hyperparameters remains [6]. ARMA and ARIMA models and their modifications, such as Seasonal Autoregressive Integrated Moving Average (SARIMA), are described in detail in [6][7][8].
Since the state of systems can gradually change over time, causing data drifts or statistical changes in the data, continuous updating of the model is required to achieve the correct results and to make the model more robust and adaptive [9]. As such, continuous model learning or online refitting is used in many applications and can be achieved in a wide variety of practical tasks, such as predicting the values of technological parameters and detecting equipment faults [10]. Unfortunately, training the classical ARIMA models with a large amount of data is time consuming, making it inefficient to use and refit the model in the online mode. Many works are dedicated to creating online ARIMA algorithms for forecasting problems (see [11][12][13][14][15][16][17]). Most of them are based on removing the terms of the moving average and increasing the order of the auto-regressive part.
In this paper, a comparison of several online algorithms based on the ARIMA model is provided. Moreover, this paper represents a new algorithm that can be used both for anomaly detection and forecasting purposes in real-time streaming applications. Numerical experiments were conducted on the Numenta Anomaly Benchmark (NAB) [9], which contains anomalous data from real-world streaming applications and a real-world proprietary turbogenerator-bearing vibration dataset. All proposed algorithms are implemented in the library in the Python3 programming language. A brief description of the library and a link to the source code can be found in the Materials and Methods section.

Classical ARIMA Model
The ARMA model is described by an equation from [6]: where X t is the value of the X time series at moment t; α ∈ R p and β ∈ R q are the vectors of the weight coefficients; is a noise term; c is a constant; ∑ p i=1 α i X t−i + t + c is an autoregressive model with order p, which is a linear combination of the previous p values of series, constants and noise terms; ∑ q i=1 β i t−i + t is the moving average model with order q, which is a linear combination of the previous q noise terms and current noise term.
Time series representing real-world processes are often non-stationary. Deterministic components, such as trends, seasonality, slow and fast variation, and cyclical irregularity are usually represented in this manner. An effective way to remove most of these components is to compute their differences. For the time series X, at time moment t, the first order differences (lag = 1) are equal to ∇X t = X t − X t−1 , while the second order differences are equal to ∇ 2 X t = ∇X t − ∇X t−1 . To remove the seasonality, seasonal differences can be calculated, which are pairwise differences in values in neighboring seasons.
If ∇ d X t is described by the ARMA model (p, q), it can be said that X t is described by the ARIMA model (p, d, q): where p, d, q are the hyperparameters or orders of a model, α ∈ R p and β ∈ R q are the vectors of the weight coefficients.
The predicted value ofX t can be calculated as: As stated earlier, hyperparameter selection is a separate, non-trivial task. The Box-Jenkins method for solving this problem is described in [6]. Generally, the maximum likelihood (ML) and prediction error (PE) methods with a non-convex error criterion function are used to find the optimal ARMA weights. Other algorithms (interested readers are directed to [18]) are also used to solve this problem, but they all have their drawbacks. For example, they may fail to deliver the global optima.
The use of classical online algorithms with convex optimization methods, in this case, is impossible since the moving average model is nonlinear in its parameters and, therefore, to estimate the parameters, it is necessary to use nonlinear estimation procedures [6]. In turn, noise components { t } T t=1 are required at each iteration when updating the model weights (α, β) and at each iteration of the prediction procedure.
The solution to this problem may be the ARMA Online Gradient Descent algorithm (ARMA-OGD) from [12] (see Algorithm 1, where TLM max q is the maximum value of the Lagrange multipliers [12]). In this case, all terms of an MA model are replaced by additional autoregressive terms. The ARIMA model (p, d, q) is converted to the ARIMA model (p + m, d, 0), where m ∈ N is a constant, meaning that the algorithm with the coefficient vector γ ∈ R p+m attains a sublinear regret bound against the best ARMA model (p, d, q) prediction in hindsight, with weak assumptions of the noise terms. The sublinear regret with computational simplicity indicates the acceptableness of this approach. However, this algorithm requires the following assumptions: 1. Noise components are randomly and independently distributed with zero expectations and satisfy the following conditions: E[| t |] < M max < ∞ and E[ t (X t , X t − t )] < ∞ for every t. 2. The loss function t must satisfy a Lipschitz condition [19]. 3. Autoregressive weight coefficients α i satisfy |α i | < 1. This assumption is necessary to restrict the decision set [12], but it is not necessary or sufficient for stability. In fact, the absolute values of autoregressive weight coefficients can be bounded to any other real number. 4. Moving average weight coefficients β i satisfy ∑ q i=1 |β i | < 1 − ε for some ε > 0. 5. The signal is bounded by the constant. Without the loss of generality, we assume that |X t | < 1 for every t.

Full Refitting
The next online algorithm considered, based on the ARIMA model, is a full refitting algorithm. In this algorithm, the model is refitted on the whole dataset, including new data, each time the new point(s) have been received (Algorithm 2). The "warm start" technology was used, according to which the weight coefficients of the ARIMA model are not initialized randomly, but are remembered from the last loop and are given as initials for a new fitting loop. This allows the optimization algorithm to converge much faster.
Algorithm 2 Full refitting. Input Hyperparameters p, d, q, learning rate η; End for * f it() is the notation of any suitable optimization algorithm using "warm start" technology

Window Refitting
In the window refitting algorithm, the model is refitted not on the whole dataset, but only on its last section in relation to the current time moment. The size of each new training set is determined by the width of the window. The width is a hyperparameter, which requires an individual setting. The wider the window, the more accurate and computationally complex the model, and vice versa. This algorithm also uses "warm start" technology (see Algorithm 3).

Algorithm 3 Window refitting.
Input Hyperparameters p, d, q, learning rate η, window width W; End for * f it() is the notation of any suitable optimization algorithm using "warm start" technology

Performance Metric for the Forecasting Problem
To evaluate the forecasting algorithms, the Mean Absolute Percentage Error was used (Equation (4)). The main motivation for choosing this metric is the easy interpretability of the results and the ability to compare forecast errors between different physical measures.
whereX t is a predicted value at a time moment t, X t is a true value at time moment t, T is the number of points in time series X, and is an arbitrarily small yet strictly positive number to avoid undefined results when X t is zero-in our case, = 2.22 · 10 −16 .

Novel Anomaly Detection Algorithm Based on OGD Algorithm
According to the definition in [20], the technical state is the state of the system at a certain point in time, which, under certain environmental conditions, is characterized by the values of the parameters established by the technical documentation for the system. In our case, this is X t . This state may implicitly include information about the presence of a defect or other anomaly in a system. Applying the ARIMA models to these parameters can make the presence of an anomaly in a technical system more apparent. Weight coefficients can provide more transparent and precise information about changes in a system state than the parameters themselves.
The Online Gradient Descent algorithm allows us to track weight coefficients in online mode. This makes it possible to use the base of the Online Gradient Descent algorithm not only for forecasting problems, but also for anomaly detection problems. For this, it is necessary to transform the OGD algorithm by essentially creating a novel algorithm based on it.
Let us suppose that we have the entire history of the weight coefficients' changes in the form of vectors γ 1 . . . γ T−1 .
From Algorithm 1: If the coefficient vector satisfies all the requirements, there is no need to project the vector of the weight coefficients. Then, this formula is converted into the following form: However, tracking the values of the weight coefficients themselves is impractical for the following reasons: 1. The optimal values of the weight coefficients are unknown initially. 2. The state of a considered system may gradually change over time. This will lead to a shift in the optimal values of the weight coefficients. This, in turn, can lead to a high false-positive rate in an anomaly detection algorithm.
To avoid these shortcomings, we plot the discrete differences in the weight coefficients along a time axis. This solution allows us to analyze the state of the technical system more interpretably since, in a steady state, this value should tend toward 0 and, in the case of a change in an operating mode or anomaly mode, deviate from 0. Moreover, the value of the shift in each specific weight coefficient over the time interval is the sum of these weight coefficient values throughout the time interval. The vector of a discrete difference in a weight coefficient is: Thus, a discrete difference vector must tend toward 0 when close to the optimum value. At the same time, a strong and sudden deviation from 0 may indicate the occurrence of an anomaly. In order to monitor these changes, a threshold can be introduced, which is a hyperparameter and requires separate tuning. Based on the above, an anomaly detection and forecasting algorithm was obtained (the pseudo-code is presented in Algorithm 4).
As a rule, the state of a technical system is characterized by a large number of parameters that do not have the same scale and cannot be directly compared, which makes it difficult to describe the state of a technical system through the weights of the modernized ARIMA model. To solve this problem, the following normalization of the initial parameters was carried out:X whereX i is a scaled time series, X i is an initial time series, µ is the expectation of X, and σ is the standard deviation of X.

Algorithm 4 ARIMA Anomaly detection.
Input Hyperparameters p, d, q, learning rate η, threshold th , empty list r; If |∇γ t w | < th then Add value t to r ALERT End for End for In practice, when verification is provided using benchmarks, the sample size is not big enough (unlike real technical systems) to achieve a global optimum by stochastic gradient descent. In this regard, four heuristic methods have been developed, which are more sensitive to anomalies in this case. The first three are similar in meaning and can be used in batch processing. They are distinguished by metrics that generalize the entire set of weight coefficients with one point for each available time moment. The fourth method can only be used for offline analysis (this method is not overfitted for a particular dataset because similar results have been obtained for other datasets). All methods are presented below (executed instead of the condition in Algorithm 4):

Metric based on Euclidean norm
where S is the number of all weight coefficients of the ARIMA models for the considered system, t is the time moment, ∇γ ts is a discrete difference in one specific weight coefficient, win is the width of the window (requiring separate tuning, can be an equally timed physical process), µ is a function for computing the mean of a metric in a particular window win, σ is a function for computing the standard deviation of a metric in a particular window win, and t th 1 is the threshold of the metric.  If Metric t 43 is local maxima then ALERT

One Point Prediction
The idea of using the ARIMA model for anomaly detection purposes is not new. However, when using the ARIMA model in an anomaly detection algorithm, the approach is almost always the same [10,21,22]. The ARIMA model makes a prediction, then the difference between this prediction and the actual values is calculated and, if this difference is significant, then it is an anomaly. The pseudo-code of this approach is presented in the Algorithm 5. This algorithm has been chosen for comparison with the developed algorithm (Algorithm 4).

Algorithm 5 One-point prediction.
Input Hyperparameters p, d, q, learning rate η, threshold˜ th , emply list r; Add value t to r ALERT End for * f it() is the notation of any suitable optimization algorithm

Performance Metric for Anomaly Detection Problem
Standard metrics, such as precision and recall, are not suitable for evaluating anomaly detection algorithms in systems where time series are used since they do not take into account the specifics of time. Because of this, some measurable characteristics of the algorithm performance, such as detection delay, are not accounted for either. For this reason, the Numenta Anomaly Benchmark performance scoring algorithm (or the NAB evaluating algorithm) was chosen from [9]. The purpose of this algorithm is to encourage early detection and to penalize false positives and false negatives.
The NAB evaluating algorithm contains three components, which must be tuned: • the anomaly window; • the application profiles; • the scoring function.
The window in which the detected anomaly is perceived as a true positive is called an anomaly window (vertical blue lines in Figure 1). It is recommended by [9] to take a window equal to 10% of the sample length divided by the number of anomalies centered around a ground truth anomaly label. For technical systems, a more reasonable approach is to take the moment of occurrence of a hidden defect as a left boundary and the window width as the value characterizing the time taken to complete physical processes, during which the defect degrades.
In order to better understand the properties of each specific anomaly detection algorithm, the concept of an application profile was proposed. The application profile is a set of coefficients for true positives, false positives, and false negatives: A TP , A FP , A FN , with 0 ≤ A TP and −1 ≤ A FP , A FN ≤ 0. Three application profiles are usually used: standard (standard weights of coefficients), reward low FPs (heavily penalizes FPs), and reward low FNs (heavily penalizes FNs). The application profiles (see Table 1), given by [23], are used in our work. Table 1. Application profile for NAB scoring algorithm used in this work.

A TP A FP
The scoring function is used to weigh all true positives, false positives, and false negatives (true negatives are not used for evaluation, on principle), taking into account the coefficients and window width. Inside the window, only the first detection is used to weigh the true positives, while the rest are not taken into account. The sigmoidal scoring function gives high scores to early detections and gently penalizes detections outside of the window (see Figure 1). The equation of the scoring function is as follows: where y is the relative position of the detection within the anomaly window. Score for each special time series d: where Y d is the set of data instances detected as anomalies by the detection algorithm and f d is the number of windows in which there are no detected anomalies.
To evaluate the anomaly detection algorithm in the entire set of D datasets, it is necessary to add them together: The final score is: where S A perfect is a perfect score in which all anomalies are detected correctly and there are no false positives; S A null is a score in which no anomalies are detected. The better the final score, the better the anomaly detection algorithm. The best result is equal to 100.

Python Library
One of the results of this work is the Anomaly Detection Library, named "ARIMAFD". The library is implemented in the Python3 programming language. It includes an algorithm based on the ARIMA model for both forecasting and anomaly detection problems and others algorithms, as described in this paper. The advantages of algorithms from this library include the fact that a training set is not necessary, as well as the ability to work in online mode (batch processing). The disadvantages include the laborious process necessary to obtain the stationarity of the time series and to use heuristics.
The Anomaly Detection Library can be installed via the Python Package Index (PyPI) or via a link on GitHub https://github.com/waico/arimafd (accessed on 1 April 2021). For the purpose of unification, the commands and API design were developed in comparison with the scikit-learn library. An MIT License was used for the library.

Forecasting by Online Algorithms
We used one specific time series, received from the sensors of the turbogenerator in a single operational mode without transients, which produces a total of 25 thousand samples in about half a month. The choice to use only one time series was due to our desire to compare the forecast results for one physical quantity. The time series was separated in the training and testing sets, accordingly, as 5 and 20 thousand samples (5 thousand samples are enough for the first training period). At each point in the testing set, the results were evaluated. Table 2 presents the results for time series forecasting by various online algorithms, as presented in Section 2.2. The results in Table 2 are presented in more detail in Figure 2. Despite the fact that the MAPE curves of the non-refitted ARIMA, full refitting and window refitting algorithms are visually merged into one (their similar numerical values are shown in Table 2), this picture shows the general behavior of the algorithms' MAPE with an increase in the forecast horizon. We discuss the results more broadly in Section 4. Table 2. Mean absolute percentage error of algorithms and ratio of learning time between algorithms. Non-refitted Autoregressive Integrated Moving Average (ARIMA) is an algorithm with unchanged weight coefficients for the testing set after fitting on the training set. The training time ratio is the ratio of the mean time taken for one model retraining procedure in the research algorithm compared to the mean time taken for one model retraining procedure in the Gradient Descent algorithm (0.09 s per one retraining procedure by the Intel Core i9-9900K processor).

Anomaly Detection
Using the Numenta Anomaly benchmark [9], the developed anomaly detection algorithm based on the ARIMA model was tested. Table 3 shows the scoreboard with the current state of the anomaly detection algorithm's performance for the Numenta Anomaly benchmark, taken from the official page of the NAB on GitHub, and the results obtained using Algorithms 4 and 5. It can be seen from the table that the anomaly detection algorithm based on the ARIMA model shows strong results, and the algorithm based on Online Gradient descent and the complex metric took third place in accordance with this scoreboard. This algorithm was also tested for its ability to detect hidden defects on real technical systems. Figure 3 shows the discrete differences in the weight coefficients of a turbogenerator-bearing vibration. From this figure, it can be seen that the received values increase in absolute value in a time moment in which, according to the commission, a defect arises, leading to the failure of the turbogenerator. Most of the other splashes are related to changes in the operation mode. For the case of the turbogenerator, one out of five hidden defects was detected.

Discussion
The results obtained during the forecasting experiment with various online algorithms were rather interesting and sometimes unexpected. The main results are shown in Table 2 and Figure 2. In actual fact, we expected to obtain results showing that non-refitted ARIMA had the worst result in terms of the MAPE, but this does not always happen when using real-world data. One of the possible reasons for this is that, halfway through a month, the technical system's state does not change so much that the trained model becomes incorrect. On the other hand, the models have to be retrained sooner or later due to the turbogenerator's continuous state changes and data drift. In this case, questions regarding the model retraining frequency and window selection for retraining are raised. The OGD algorithm allows us to dispel these questions since the model is updated constantly in real time. Additionally, the results clearly show that all algorithms are suitable for short-term forecasting. However, for long-term forecasting (180 points ahead), as expected [16], the OGD algorithm is less suitable, as the MAPE value is quite small (less than 1 percent) and still comparable with other algorithms' results. The comparison of computational complexity also favors the OGD-based algorithm among the retrainable model-based algorithms ( Table 2).
We also obtained interesting results for the anomaly detection problems at the turbogenerator. We detected one out of five hidden defects. At first glance, it seems that we obtained a weak result when applying this algorithm since just one hidden defect out of five was correctly detected. However, in actual fact, we were able to detect the emergence of an anomaly that has not been detected by existing diagnostic systems, and which ultimately led to serious financial losses. Moreover, this algorithm is an unsupervised change point detection algorithm, which means that it can contribute toward safety, even in the case of previously unknown hidden defects. As for the remaining four hidden defects, there is currently no understanding of whether they impacted the signals from the sensors and, consequently, the possibility of detection. Some more anomaly detection results were obtained using the Numenta Anomaly Benchmark. This benchmark was initially created to evaluate unsupervised real-time anomaly detection algorithms and includes time series data from various sources. The fact that the developed algorithm took third place on the corresponding scoreboard allows us to assert that the developed algorithm can be universally applied.

Conclusions
Many researchers are working on high-performance anomaly detection and forecasting algorithms with the ability to learn continuously and adapt to changing states. In this paper, we proposed a novel algorithm that can solve all of these challenging tasks simultaneously. It is based on the well-known ARIMA model, which is widely used for time-series modeling. The proposed algorithm works in online mode and uses unlabelled data, making it possible to solve tasks in an unsupervised manner. At the same time, it is much more computationally efficient and almost identically accurate when compared to classical ARIMA-based algorithms. All of these characteristics are essential for online process monitoring, including technical system diagnostics. A library in the Python3 programming language, using the "ARIMAFD" algorithm, was also proposed. As such, it is applicable both for anomaly detection and forecasting purposes.
For the forecasting problem, the proposed algorithm was compared with two pseudoonline algorithms and the non-learning ARIMA model. The results clearly show that all the algorithms turned out to be quite accurate, while their computational complexity and stability to changes in a system differed significantly, in favor of the developed online ARIMA-based algorithm. The strong ability of the proposed algorithm to find anomalies was confirmed by the fact that it took third place in the Numenta Anomaly Benchmark competition and the fact that it was able to detect a fault in the real-world data from the turbogenerator. The proposed algorithm not only beats the classical ARIMA-based 1point-ahead prediction (Algorithm 5), but also outperforms most of the well-known online anomaly detection algorithms proposed by respectable researchers. While the results on the NAB prove the applicability of the proposed algorithm in various applications, successful anomaly detection in the turbogenerator data shows the clear benefits that the algorithm can bring in improving the safety and economic efficiency of power plants. Taken together, these results suggest that the combined algorithm we developed may be able to contribute to the development of technical diagnostics and online process monitoring.
The next steps in this direction should be the extension of a heuristics list that makes the overall algorithm more robust and invariant to the data, as well as the use of more advanced optimization algorithms.