Outlier Detection in Dynamic Systems with Multiple Operating Points and Application to Improve Industrial Flare Monitoring

: In chemical industries, process operations are usually comprised of several discrete operating regions with distributions that drift over time. These complexities complicate outlier detection in the presence of intrinsic process dynamics. In this article, we consider the problem of detecting univariate outliers in dynamic systems with multiple operating points. A novel method combining the time series Kalman ﬁlter (TSKF) with the pruned exact linear time (PELT) approach to detect outliers is proposed. The proposed method outperformed benchmark methods in outlier removal performance using simulated data sets of dynamic systems with mean shifts. The method was also able to maintain the integrity of the original data set after performing outlier removal. In addition, the methodology was tested on industrial ﬂaring data to pre-process the ﬂare data for discriminant analysis. The industrial test case shows that performing outlier removal dramatically improves ﬂare monitoring results through Partial Least Squares Discriminant Analysis (PLS-DA), which further conﬁrms the importance of data cleaning in process data analytics.


Introduction
Modern process industries rely on dependable measurements from instrumentation in order to achieve efficient, reliable and safe operation.To this end, the concept of the Internet of Things is receiving wider acceptance in the industry.This is resulting in facilities with massively instrumented intelligent sensors, actuators, and other smart devices gathering real-time process knowledge at a high frequency.However, in order to take take full advantage of such abundant data streams, we need to extract useful information from them as outlined in the vision for future smart manufacturing platforms [1].A typical application of processing large data streams is flare monitoring.Industrial flares are safety devices designed to burn waste gas in a safe and controlled manner [2], and they are commonly used in petroleum refining.The flares can be especially useful during non-routine situations-such as power outages, emergency conditions or plant maintenance activities.Conventionally, flare monitoring work relies on thermal imaging cameras to recognize the difference in the heat signature of a flare stack flame and the surrounding background [3].In this paper, we will demonstrate that plant-wide historical data can also be used to monitor flares effectively.
Raw data from field instrumentation stored in historians is difficult to use directly for modeling until it is cleaned and processed.One of the most important aspects of data cleaning and pre-processing is to remove erroneous values (i.e., outliers in this paper) that are inconsistent with the expected behaviors of measured signals during that timeframe.In chemical processes, outliers could be generated through malfunction of sensors, erroneous signal processing by the control system, or human-related errors such as inappropriate specification of data collection strategies.Outliers can negatively affect main statistical analyses such as the t-test and ANOVA by violating corresponding distribution assumptions, masking the existence of anomalies and swamping effects.Outliers can further negatively impact downstream data mining and processing procedures such as system identification and parameter estimation [4].
Numerous methods have been proposed for outlier detection, and decent reviews can be found in the work done by experts from different fields such as computer science [5] and chemometrics [6,7].Generally, based on whether we have knowledge of a process model a priori, we can categorize those methods into model-based and data-driven, and the latter can be further divided into four subcategories.First, from the perspective of regression theory, removing outliers from data sets is equivalent to estimating the underlying model of "meaningful" values.Several robust regression-based estimators in the presence of outliers are proposed, including L-estimator [8], R-estimator [9], M-estimators [10], S-estimators [11], etc.Although the estimators are simple to implement and take advantage of relations between variables, they do not work well when variables are independent, and the iterative procedures of deleting and refitting will significantly increase the computational cost.
Second, if we focus on estimating data location and scale robustly, we can apply several proximity-based methods including the general extreme studentized deviate (GESD) method [12,13], Hampel identifier [14], quartile-based identifier and boxplots [15] or minimum covariance determinant estimator [16].It is important to point out that a critical assumption of proximity-based methods is that the data follow a well-behaved distribution.However, in the majority of cases, such an assumption does not hold in chemical processes data due to transient dynamics of measured signals.As a result, being able to discriminate outliers from normal process dynamics poses a challenge for end-users of process data.In current literature, a moving window technique is often used to account for process dynamics by performing statistical abnormality detection within a smaller local window [4].However, such an approach does not always give satisfactory results when the time scales of variations in the datasets are non-uniform (fast and slow dynamics occurring in the same dataset).
Recently, machine learning methods have become increasingly popular in outlier detection.Typical examples include k-means clustering [17], k-nearest neighbor (kNN) [18], support vector machine (SVM) [19], principal component analysis (PCA) [20,21], and isolation forest [22].The general advantage of machine learning algorithms lies in their capability to explore interactions between variables and computational efficiency.In this paper, we use isolation forest as the reference method, which isolates abnormal observations through randomly selecting a value between the maximum and minimum values, partitioning the data and constructing a series of trees.The average path length from the root node to the leaf node of a tree over the forest is used to measure the normality of an observation and outliers usually have shorter paths.
The final class of outlier detection methods is time series based including the time series Kalman filter (TSKF) [23].The TSKF approximates normal process variations in the signal by univariate time-series models such as an autoregressive model (AR) and then identifies observations that are inconsistent with the expected time-series model behavior.The advantages of the TSKF method include robustness against ill-conditioned auto-covariance matrices (by using Burg's model estimation method [24]) while maintaining the original dataset integrity.Based on case studies in [23], the TSKF method obtains superior performance with stationary time series; however, it does not perform well on non-stationary process data with distribution shifts like product grade transitions.In addition, considering the fact that continuous operations of many chemical plants, especially petrochemical ones, are scattered along different operating regions with frequent mean shifts for process variables [25], detecting outliers in such situations becomes quite challenging because data no longer follow a single well-behaved distribution.As a result, we have to find ways to help us quickly pinpoint the mean shift locations of each variable to improve the overall performance of univariate outlier detection within each operating region bracketed by changepoints.
Generally speaking, changepoint analysis focuses on identifying break points within a dataset where the statistical properties such as mean or variance show dramatic differences from previous periods [26].Assuming we have a sequence of time series data, y 1:n = (y 1 , ..., y n ), and the data set includes m changepoints at τ 1:m = (τ 1 , ..., τ m ), where 1 ≤ τ 1 < τ i < τ m ≤ n − 1.We define τ 0 = 0 and τ m+1 = n, and assume that τ i are ordered, i.e., τ i < τ j if and only if i < j.As a result, the data y 1:n are divided into m + 1 segments with the ith segment containing y (τ i−1 +1):τ i .
Changepoints can be identified by minimizing: where represents the cost function for a segment and β is a penalty to reduce over-fitting.The cost function is usually the negative log-likelihood given the density function f (y|θ), as shown in Equation (2): where θ can be estimated by maximum likelihood given data within a single stage.The penalty is commonly chosen to be proportional to the number of changepoints, for example, β = m, and it grows with an increasing number of change points.For data y 1:s , if we use F (s) to represent the optimal solution of Equation ( 1), and define to be the set of possible changepoints, the following recursion from optimal partitioning (OP) [27] can be used to give the minimal cost for data y 1:t with t < s: where F (0) = −β.The recursion can be solved in turn for s = 1, 2, ..., n with a linear cost in s; as a result, the overall computational cost of finding F (n) using an optimal partitioning (OP) approach is O n 2 .
Other widely used changepoint search methods include binary segmentation (BS) [28] and segment neighborhood (SN) method [29] with O (nlogn) and O Qn 2 computational costs (Q is the maximum number of change points), respectively.Although the BS method is computationally efficient, it does not necessarily find the global minimum of Equation ( 1) [26].As for the SN method, since the number of change points Q increases linearly with n, the overall computational cost will become high with O n 3 .Bayesian approaches have also been reported [30,31] for changepoint detection, but the associated heavy computational cost cannot be overlooked.In this paper, because we deal with an enormous amount of high-frequency industrial data, a pruned exact linear (PELT) algorithm [26] with O (n) computational efficiency is adopted.
The article is organized as follows.In Sections 2.1 to 2.3, after an introduction to both methods, we propose a strategy to integrate TSKF with PELT to improve TSKF's performance in handling outlier detection in a dynamic data set with multiple operating points for each variable.In Section 2.4, the partial least squares discriminant analysis (PLS-DA) is briefly described to facilitate the understanding of how such a data-driven approach is applied in practice.In Section 3.1, the new outlier detection methodology framework is tested using simulated data set and compared with the conventional general extreme studentized deviate (GESD) and the latest isolation forest methods.Furthermore, The PELT-TSKF is applied to pre-processing both sediment and industrial flare data sets, and its efficacy in improving PLS-DA results are demonstrated in Section 3.2.Although this paper is mainly focused on building data-driven models for industrial flare monitoring, the sediment toxicity detection work is shown as an additional case study to demonstrate the proposed methodology.Discussion on the results of case studies is provided in Section 4. Finally, conclusions and future directions are given in Section 5.

Pruned Exact Linear Time (PELT) Method
To improve the computational efficiency of the optimal partitioning approach, a PELT method [26] is proposed to prune τs that can never be optimal at each iteration based on the assumption that there exists a constant K such that for all t < s < T, As a result, at a future time T > s, if the following condition holds, there will never be an optimal last changepoint at t prior to T: In above formulations, if the cost function is represented by negative log-likelihood, K = 0.The overview of the PELT algorithm is shown in Figure 1.We first initialize the objective function F (0) = −β and define an empty changepoint set and a searching pool.Second, we iteratively find the minimal value of objective function F (τ * ) and associated time point τ 1 , extend the change point set, and prune τs that cannot reach optima by shrinking the searching pool R τ * .The pruning step shown in Figure 1 can significantly reduce the computational cost of the optimal partitioning approach especially when n is large.

Time Series Kalman Filter (TSKF)
Given a univariate data set {y t } N t=1 , steps of an offline implementation of TSKF are as follows [23]: 1. Data partition: partition the data set into M subsets {y t } N i t=1 , i = 1, 2, ..., M. 2. Pre-whitening: for each subset {y t } N i t=1 , i = 1, 2, ..., M, pre-whiten the data using the reweighed minimum covariance determinant estimator [16], and centralize the data with robust center µ i . where

Predict:
4.3.Update: Usually, the autoregressive model order p = 2 will suffice, and, in this study, we pick the threshold γ = 95%, which means that we assume that the maximum amount of outliers in datasets would not exceed 5% of the total number of observations.

An Integrated Method for Outlier Detection in a Dynamic Data Set with Multiple Operating Points
To overcome the challenge of outlier detection in dynamic data sets with multiple operating points, we propose to combine the PELT with TSKF, and an integrated methodology overview is shown in Figure 2.For each variable {y t } n t=1 , we first use the PELT algorithm to find mean change points that segment the data and then apply TSKF to outlier detection within each subset bracketed by change points.Finally, we replace the outliers with neighboring normal points.

Partial Least Squares Discriminant Analysis (PLS-DA)
Partial least squares discriminant analysis (PLS-DA) is a linear classification method that derives from partial least squares (PLS) [33,34] by building a regression model between the predictor matrix X and the predicted label matrix Y [35].The standard mathematical formula for PLS-DA model is shown in Equation ( 20): where X is an n × m matrix of predictors; Y is an n × p one hot encoded matrix of class labels, and y ij represents the membership of the i-th sample to the j-th class expressed with binary code (1 or 0); T = [t 1 , t 2 , ..., t n ] T and U are n × a score matrices; P and Q are m × a and p × a orthogonal loading matrices, respectively; E and F are error matrices assumed to be independently and identically are the regression coefficients.
A distinctive feature of PLS-DA is that it maximizes the covariance between X and Y for each principal component spanning the reduced space, i.e., the covariance between T and U.In addition, advantages of PLS-DA include dimension reduction by using latent variables that are linear combinations of the original variables to model data variability and providing intuitive graphical visualization of data patterns and relations based on scores and loadings [36].The PLS1 algorithm is commonly used to build PLS-DA models [37] in the two-class scenarios.PLS-DA returns estimated class values within (0,1).The classification of samples is conducted by choosing a label with the highest probability, and usually a threshold is used to determine which class a sample belongs to.In two-class scenarios with labels "0" and "1", we assume that such a threshold equals 0.5 for simplicity, although the Bayes theorem can be used to give a more rigorous result [38].

Case Studies
In this section, PELT-TSKF's capability of detecting univariate outliers in the dynamic data sets with multiple operating points is tested in simulated cases, and is compared with the general ESD (GESD) identifier [13] that detects outliers based on mean and standard deviation, and the isolation forest [22] that detects outliers based on finding points associated with the shortest average path lengths from the root node to the leaf node of a tree over the forest.(Note: the isolation forest is implemented in publicly available python package scikit-lean [39].).Moreover, we use an open-source sediment toxicity detection challenge as a toy problem to demonstrate the efficacy of PELT-TSKF in improving the PLS-DA model performance before applying the same methodology to industrial flare monitoring.

Simulated Case Study
In our simulated case studies, to illustrate the capability of outlier detection in dynamic data sets with mean shifts, performance of the new PELT-TSKF algorithm is compared with that of a GESD identifier combined with the PELT algorithm (the performance of GESD identifier will become much worse if without PELT).The intrinsic dynamic variation is approximated by a stationary autoregressive moving average (ARMA) (1,1) model: where φ, θ are coefficients, ε is the white noise, ε t ∼ N (0, 1) and z −1 is a shift operator.A total number of 1000 samples are simulated with two mean shifts at the 300th time point (µ = 10) and the 600th time point (µ = 20), respectively.
The contamination rate κ is defined as the percentage of outliers in the data: In our simulation case, κ = 5%, and outliers with equal probabilities being +Amp or −Amp (Amp is the amplitude of outliers) are added every 20th sample point except for the mean shift at the 300th and 600th time points.
The detection rate χ is defined as the percentage of outliers being successfully identified, and mis-identification rate β is defined as the percentage of normal data falsely tagged as outliers (type I error): A series of data sets with different values of (φ, θ, Amp) are simulated repetitively 100 times each, and the average detection rate χ and mis-identification β are given in Table 1.Comparing the values of χ in Table 1, we can see that the PELT-TSKF works better than PELT-ESD and isolation forest algorithms when process autocorrelation increases with larger values of (φ, θ).The result suggests that system dynamics negatively affect the GESD identifier and isolation forest significantly.
Moreover, a larger outlier size Amp will surely help the outlier detection by increasing the outlier detection rate χ, but it will not necessarily decrease the mis-identification rate β for PELT-GESD because the relative conservativeness of the GESD identifier leading to minor changes of the false alarms with different outlier sizes.Although the PELT-TSKF is generally more aggressive than the PELT-GESD with a higher mis-identification rate β, it has limited influence when abundant data are available coming from high-frequency sampling.
We provide an illustrative data set with φ = 0.5, θ = −0.5,Amp = 4. Figures 3 and 4 show that the PELT algorithm can easily find two mean shifts at the 300th and 600th sample points marked by black dashed lines in both figures.In addition, additive outliers detected are highlighted by red squares, and we can see that the PELT-TSKF method is able to detect more additive outliers than the PELT-GESD identifier because there are more red squares in Figure 3 than in Figure 4.However, as indicated by the red circles shown in Figure 3, the PELT-TSKF cannot shield normal observations from swamping effects caused by outliers and leads to a few mis-identified instances-for example, the 21st point after an outlier at the 20th, and the 741st point after an outlier at the 740th time point are both mistaken for outliers because of previous anomalies.

Sediment Toxicity Detection
In this section, a sediment dataset is used.The X data include 1884 sediment samples and nine chemical variables that are the log-transformed concentrations (ppm) of metals (As, Cd, Cr, Cu, Pb, Hg, Ni, Ag, Zn) [40].Sediment samples are divided into two classes on the basis of their toxicity, as shown in Table 2.While class 1 contains 1624 non-toxic samples, class 2 includes 260 toxic samples.Figure 5 displays raw training X of the sediment data set.There are obvious mean shifts in variables like Cd, Cu, Ni, Ag, and Zn.Figures 6 and 7 show cleaning results by PELT-TSKF on Cd and Ni data set.The new method is able to detect abnormal observations highlighted in red squares and find the mean shift points marked by black dash lines.In Figure 6, there is still an outlier at around the 1300th point that the PELT-TSKF fails to detect, mainly because there are not enough normal data on the second stage to provide a good estimation of autoregressive models.As a result, the outlier detection performance is compromised.
Both raw training and testing X data sets are cleaned by PELT-TSKF.Then, raw and clean data sets are both weight centered by subtracting the average means of two groups to remove biases caused by unequal class sizes [37] and scaled using associated standard deviations.PLS-DA models are built and tested based on resulting standardized raw and clean data sets.Three principal components are selected based on cross-validation and are able to capture 80% of total variance of X.Three metrics are used to evaluate the model performance: non-error rate (NER, percentage of correctly predicted samples), toxicity sensitivity (Sn, ability to correctly recognize toxic samples as toxic in percentage), and non-toxicity specificity (Sp, ability to correctly recognize non-toxic samples as non-toxic, in percentage).The model testing results are shown in Table 3: Results in Table 3 show that cleaning the outliers by PELT-TSKF improved the data quality and consequently improved the PLS-DA classification results as reflected through three metrics: NER increased from 0.792 to 0.847, S n from 0.846 to 0.877 and S p from 0.783 to 0.842.

Industrial Flare Monitoring
In this section, the application of the proposed PELT-TSKF method and PLS-DA is tested with flare data sets taken from the petrochemical industry (for intellectual property reasons, the variables are named V1, V2, ...).Data are sampled at every 1 min.Flares associated with non-routine situations are tagged with "1" as opposed to those corresponding to normal operations with "0".We randomly select two testing data sets and use observations that precede them as training data.Details on the sizes of training and testing sets are listed in Table 4.In this study, out of 132 variables, we focus on critical ones including feed/outlet flows, temperature, water level, etc. Critical variables in both training and testing data are cleaned using PELT-TSKF methods, and part of the results are shown in Figures 8 and 9.As highlighted by black dash lines and red squares, the new algorithm is able to find out mean shift points while detecting outliers masked by process variations.
Table 5 summarizes PLS-DA results based on raw and clean flare data sets, and we can see that cleaning outliers improves flare monitoring results in the Flare 1 case, increasing the detection rate α from 62.86% to 88.47%.In addition, although removing outliers does not affect the false alarm rate in the Flare 1 case, it helps reduce false alarm rates in the Flare 2 case, decreasing the false alarm rate β from 0.18% to 0.16% or reducing six false alarms in total.
Furthermore, Figures 10 and 11 illustrate flare monitoring results obtained from PLS-DA models built on raw and clean data sets.While Y = 1 indicates the occurrence of non-routine flares, Y = 0 represents normal operation.Regarding Flare 1 results, comparing the middle plot with the bottom one shown in Figure 10, we can see that cleaning outliers will slightly increase the width of rectangular pulse, leading to a flare detection result close to the true measured flare events given by the plot on top.
As for Flare 2 results in Figure 11, we can easily find out that there are more spikes shown in the middle plot associated with raw data than the bottom one corresponding to a clean data set, indicating that the plant will suffer from a larger false alarm rate if no data cleaning is applied.

Discussion
From the results of simulated ARMA (1,1) data with mean shifts, we can see that PELT-TSKF is able to correctly identify the two stage changes and outperformed the general ESD and isolation forest approaches in outlier identification accuracy.This is especially apparent when the process dynamics become significant (shown by larger autocorrelation coefficients).However, it is worth pointing out that the isolation forest can also be used to identify multivariate outliers and is more efficient than the extension of TSKF in multivariate cases.The toy problem of sediment toxicity detection illustrates a successful application of PELT-TSKF in practice: cleaning outliers from concentration measurements and increasing the non-error rate, toxicity sensitivity and non-toxicity specificity calculated from the PLS-DA model as a result.In the flare monitoring case study, on the one hand, we demonstrate that applying data-driven approaches like PLS-DA can effectively assist process engineers; on the other hand, the efficacy of new PELT-TSKF has been proved in ameliorating the quality of industrial flaring data which contain multiple operating points, and improving the resulting flare monitoring performance by increasing the anomaly detection rate and reducing possibilities of false alarms.

Conclusions
Taking advantage of abundant raw plant data gathered by intelligent devices plays an important role in transforming modern plant operations from reactive to predictive.Data cleaning lays the foundation of knowledge discovery from data.In this paper, we focus on overcoming the challenges of outlier detection in dynamic systems with multiple operating points, and propose a novel method combining the time series Kalman filter (TSKF) with the pruned exact linear time (PELT) approach.The PELT-TSKF approach is able to pinpoint mean shifts and differentiate outliers from process variations at the same time.In addition, we demonstrate that applying data driven PLS-DA can assist process engineers in flare monitoring, and cleaning outliers using TSKF-PELT can further improve the performance of such a challenge task.
Although our method is able to deal with univariate outlier detection within dynamic system mean shifts, in practice, process operating points may be characterized by other factors other than mean, such as variance or frequency.In addition, operating point changes may affect individual variables in different ways: while a few of them show mean shifts, others exhibit variance changes.In addition,we only consider situations where operating point changes are significant enough that the impact of outliers can be ignored.When the impact of outliers on change points detection becomes significant, we can apply cross-validation to determine the optimal number of stage changes associated with the lowest cross validation error.Furthermore, multivariate outliers are also usually faced in process data, and how to efficiently detect multivariate outliers in the dynamic systems with multiple operating points is worth exploring in the future.

Figure 11 .
Figure 11.Flare 2: PLS-DA analysis based on raw and clean data.
3. Model fitting: based on the preliminary clean data {y c t } If ξ ≥ γ, stop; otherwise, increase ∆ by d∆.4.4.5.The outliers correspond to observations with Mahalanobis distance d t ≥ ∆ f inal .4.5.Replace: replace the outliers with neighboring normal values.

Table 2 .
Sediment dataset: sample partition in toxic and non-toxic classes.

Table 3 .
Sediment test results summary.

Table 4 .
Sizes of training and testing sets.

Table 5 .
Partial least squares-discriminant analysis (PLS-DA) results on raw and clean data sets.