Missing Data Probability Estimation-Based Bayesian Outlier Detection for Plant-Wide Processes with Multisampling Rates

: Traditional outlier detection methods assume that the sampling time and interval are the same. However, for plant-wide processes, since the signal change rate of different devices may vary by several orders of magnitude, the measured data in real-world systems usually have different sampling rates, resulting in missing data. To achieve reliable outlier detection, a missing data probability estimation-based Bayesian outlier detection method is adopted. In this strategy, the expectation–maximization (EM) algorithm is ﬁrst used to estimate the likelihood probability of different evidence under different process statuses by using the history dataset which contains complete and incomplete samplings. Secondly, the realization of unavailable parts in the monitoring point is estimated as a probability through historical data and online moving horizon data. Bayesian theory and likelihood probability are then used to calculate the outlier posterior probability of different realization. Finally, the outlier probability of the monitoring sampling is calculated by the probability of different realizations and the corresponding outlier probability. Using the Tennessee Eastman (TE) dataset, a simulation indicates that the proposed method exhibits a signiﬁcant improvement over the complete data method.


Introduction
Outlier detection, an important research topic in data mining, has attracted wide attention in academic and applied fields. Outliers appear in the data, usually as a result of process disturbance or instrument drift, which may dramatically undermine subsequent analysis based on the data. Therefore, with the development of science and technology, many outlier detection methods have been proposed; examples include a distance-based outlier detection method for multidimensional datasets [1]; a fuzzy rough semi-supervised outlier detection approach with the help of labeled samples [2]; neural networks-based abnormal detection [3,4]; fuzzy discrimination extended and applied to outlier detection [5,6]; rough set-based attribute selection to handle outlier detection for high dimension data [7]; rough entropy used to calculate the degree of outliers [8]; and, ellipsoidal support vector machine-based outlier detection [9].
Traditional outlier detection methods usually assume that the sampling time and interval are the same. However, for complex multivariable control systems, the signal change rate of different devices is different. Under this situation, if a short sampling period is adopted, the cost of the control system is improved, while if a long period is used, then important information contained in the fast-changing variables will be ignored. In other words, it is impossible to use a single sampling interval in a complex system. In fact, most real-world systems feature multisampling rates. Since a subset of the variables is sampled at a higher rate and another subset of the variables is sampled at a lower rate, the dataset collected from the system will be affected by a problem of incomplete data. For the incomplete dataset, the deleting method is first proposed, in which the records with missing data are deleted directly to obtain a complete dataset. Although this method is easy to implement, a significant amount of effective information will be deleted if the amount of missing data is large, and the real-time accuracy of outlier detection will be affected if only the complete sampling is used for modeling and detection. Therefore, imputation methods are used to handle incomplete data, such as mean completer, combinatorial completer, regression, hot deck, and multiple imputation. These methods achieve dataset integrity by estimating the value of missing data [10][11][12][13]. In addition, to make the best use of existing data without adding or deleting data, methods of data mining based on the missing dataset have also been proposed, for example, the artificial neural network method [14,15], or the rough set reasoning method [16,17].
Despite great achievements having been made for handling missing data and detection of outliers, there are still some shortcomings. For example, because all of these methods assign a definite value to missing data and classify a sample as being normal or an outlier directly, a wrong imputation value or wrong classification will significantly affect subsequent analysis and processing. To avoid such errors, probabilistic estimation is a good alternative. As an efficient tool in probabilistic inference, the Bayesian method has been developed, for example, a Bayesian probabilistic approach has been proposed to find the internal relations and to identify the faults possibly present in the system given the current observations [18], and a novel Bayesian probabilistic diagnostic framework for control loop monitoring has been established and demonstrated [19]. However, the traditional Bayesian method still requires that all monitor readings be available at the same time. To handle the missing data problem, a method based on marginalization over an underlying complete evidence matrix has been proposed to circumvent missing data problems and to realize probability estimation for unmeasured data [20]. While this method focuses only on the case where the missing data follow one specific pattern, in order to deal with the problem of multiple missing data patterns, [21] proposed an expectation-maximization (EM) approach to estimate the likelihood probability of different realizations. Based on the EM approach, considering the plant-wide process problem as well as the asynchronous measurements question, [22] proposed a Bayesian marginalization method within a moving horizon for online incomplete measurements and established a Bayesian diagnosis system revealing both the underlying fault status of the whole plant and the unavailable statuses of the corresponding local units.
Considering the missing data problem of outlier detection, based on the Bayesian probability framework and the marginalization method, as well as the EM algorithm, a missing data estimation and outlier detection strategy is proposed. The contribution contains four aspects: (1) Probability estimation for the realization of unavailable parts in the monitoring sampling is executed through a marginalization method over all historical data and online moving horizon data, including complete data and incomplete data.
(2) The EM algorithm for missing data estimation is used to calculate the likelihood probability of different realizations under different process statuses.
(3) The outlier probability of different realizations is obtained through Bayesian theory.
(4) The total outlier probability of the current incomplete sampling is calculated using the full probability theory with the probability of different realizations and its corresponding outlier probability.
The innovation is that the method proposed in this study is more applicable to real-world processes. It is well known that most real-world complex multivariable processes feature multisampling rates, resulting in a monitoring dataset with missing data. The question of how this missing data is handled is important. The proposed strategy can make full use of the information contained in historical incomplete data and online incomplete data, to determine whether current data, especially current incomplete data, are outliers or not. Incomplete data used in the modeling stage makes detection more accurate, and the use of online incomplete data enables more timely detection.
The remainder of this paper is structured as follows: first, the problem statement and motivation analysis are briefly reviewed. Then, the Bayesian outlier detection method for processes with multisampling rates is described in detail. The efficiency of the proposed approach is illustrated by the TE process. Finally, conclusions and perspectives are provided.

Problem Statement and Motivation Analysis
A data acquisition system (DAS) is a complex system composed of computer control technology, a computer network, and intelligent instruments. The widely used DAS in a plant-wide process is a distributed structure with an upper system, several data collection sites, and communication lines (shown in Figure 1). Usually, in this system, the sampling time and interval are assumed to be the same. In a complex multivariable plant-wide process, however, given the complexity of the controlled and monitored object, the signal change rate of different devices differs, so the sampling period of the related detection device is different, for example, the change rate of a temperature signal and an electrical signal may vary by several orders of magnitude. Although a short sampling period can be used to achieve better outlier detection, this will require an increased cost of the computer control system. Thus, it is impossible to use a single sampling period in all parts of the system. To this end, the best method is to adopt different sampling periods for different change rate signals. When multisampling rates are adopted, the dataset collected from the system will be affected by the problem of missing data. For instance, for a plant-wide monitoring process, suppose that the input is π 1 π 2 π 3 , π i ∈ {0, 1}, and the sampling interval for π 1 is T, for π 2 is 2T, and for π 3 is 3T. If the symbol " * " is used to represent unavailable values, then the monitoring dataset is illustrated in Table 1. This shows that there is only one complete data point in every six samplings. Below we discuss how to realize outlier detection for this kind of dataset.  The conventional outlier detection method requires the sampling interval and time to be consistent. Therefore, to realize monitoring for a multisampling rate system, the missing data should be handled first. The traditional methods mainly include deletion and imputation. The deleting method directly removes incomplete data; for the above example, only one complete sampling data point can be used in every six monitoring samples, which leads to the loss of a significant amount of effective information and to monitoring delay. To use as much information as possible, the imputation method, which gives a reasonable substitute value for missing data, is much more suitable. Through imputation, a complete dataset can be constructed. For the above example, the possible realizations for each incomplete data point are shown in Table 1; however, we need to determine which realization is most likely. In other words, the key question is how to derive imputation values that are as close to the missing original data as possible to reduce the estimated error. That is, the objective of our work is to provide an estimation method for missing data and an outlier detection method for multisampling rates in a plant-wide process. There are several problems to be considered: (1) Since most methods provide definite estimated values for missing data, wrong imputation will significantly affect subsequent analysis and processing. Can probabilistic estimation be considered for missing data realization estimation in order to avoid such errors?
(2) If probabilistic estimation is adopted for the realization of current incomplete monitoring samples, then a follow-up issue is how to calculate outlier probability for each realization and how to classify an incomplete sample through the realization probability and the outlier probability of each realization.
(3) Given the complexities of probabilistic estimation, and the large number of variables in a plant-wide process system, if all variables with continuous values are used in the monitoring index, then the computing complexity will exceed the ability of a general computer. Thus, determining how to form an effective monitoring index is also an urgent problem.
All of these problems will be effectively solved in the next section.

Bayesian Outlier Detection for Plant-Wide Processes with Multisampling Rates
Considering the shortcomings of traditional methods, based on the multisampling rates of plant-wide processes, missing data probability estimation-based Bayesian outlier detection is adopted here. In this strategy, considering the computing complexity of plant-wide processes, given that both the historical data and online horizon data are multisampling rates with incomplete data, the research includes four aspects: (1) to reduce complexity, variables with the same sampling period are placed in a sub-block, and PCA is performed for each sub-block to form monitoring evidence; (2) marginalization-based probability estimation for realization of current incomplete evidence is executed through historical multisampling rate samples and online moving horizon data; (3) the EM algorithm is used to estimate the likelihood of different evidence using multisampling rate historical data; and (4) the posterior probability of different process statuses for current incomplete data is calculated according to Bayesian theory and full probability theory.

Marginalization-Based Realization Estimation
For the multisampling rates system, in order to ensure the total amount of modeling and monitoring data, this study intends to probabilistically estimate the realization of missing data by using complete and incomplete historical data, as well as online moving windows. Several variables need to be defined before the estimation is made: (1) Evidence, E, which is usually the monitoring variable of a process. For a system with B monitors, its evidence can be expressed as E = {π 1 , π 2 , · · · , π B }, where π i is the ith source with q i discrete values. Therefore, the collection of all possible evidence is ε = {e 1 , e 2 , · · · , e K }, where K = ∏ B i=1 q i . However, for a plant-wide system, the variable's measured values are continuous, and the number of process variables is large; thus, taking each process variable as a source of evidence may be beyond the capability of a normal computer. Therefore, the suitable evidence should be designed first in this study. To reduce its size, the multi-block method is adopted to handle data and obtain suitable evidence. First, variables with the same sampling rate are placed into a sub-block, a PCA model is established for each sub-block, and the T 2 and SPE statistics and control limits of each sub-block can be obtained. For PCA details, please refer to [23].
According to the statistic and control limit, the evidence can be generated as where i = 1, 2, · · · , m denotes the block number; π i is the ith source in evidence; T 2 i and T 2 i,lim are the T 2 statistic and T 2 control limit for the ith sub-block; SPE i and SPE i,lim are the SPE statistic and SPE control limit for the ith sub-block; and π i = 1 indicates that the data of the ith sub-block is an outlier.
Note that the number of principal components is selected by the cumulative percent variance (CPV). For sub-block data, the covariance matrix of the data is calculated and related eigenvalues are sorted. The ratio between the first k eigenvalues and the sum of all eigenvalues is defined as the CPV, which is used to represent the proportion of data explained by the first k principal components in all data. A reasonable k is very important for PCA modeling. In this research, we choose the first k principal components whose CPV is greater than 85% as the modeling principal components, as shown in the equation (2) Process status, S, which is the internal state of the system. A system with G possible internal states is denoted S = {S 1 , S 2 , · · · , S G }. For the outlier detection problem, G = 2.
(3) History dataset, D, which is labeled data. A historical dataset with N historical training samples can be expressed as D = d 1 , d 2 , · · · , d N , where d t contains evidence e t and process internal status S t at time t: d t = e t , S t .
(4) Online horizon, H, which is defined as where h l refers to the lthforward sample from the current sample, and r is the length of the horizon. Data h l consists of an evidence vector e l and posteriori probability of F j under evidence e l .
Next, for a current evidence sample, h, with missing data, a marginalization-based solution is used for its realization probability estimation through its observed part, y h , the history training data, and online moving horizon data, which can be expressed as where Ψ = ψ 1|y h , ψ 2|y h , · · · , ψ K h |y h , ψ i|y h = p(e i |y h , H, D), K h is the number of the possible realizations of y h and Ω is the space of all possible parameters in Ψ. For instance, for a three-dimensional system with two possible discrete values, assuming the current incomplete evidence where the numerator can be calculated by integration over the likelihood space Ω.
The Dirichlet distribution with Dirichlet parameters is commonly used to estimate the probability of Ψ where Γ(·) is the Gamma function, α(e 1 |y h , D), α(e 2 |y h , D), · · · , α e K h y h , D and α(e i |y h , D) are the number of prior samples for the possible realization, e i , given the observed part, y h , which is calculated through the historical training data, D.
The likelihood of the samples that are possible realizations of h in the horizon can be written as where N H y h is the expected number of samples that are possible realizations of h in the horizon, and n(e i |y h , H) is the number of e i in the moving horizon.
Taking Equations (3)-(6) into Equation (2) p(e i |y h , H, D) = Using the same derivation procedures as in [20], the realization probabilities can be achieved as [22] p(e i |y h , H, D) = n(e i |y h , H) + α(e i |y h , D) where N D y h is the total number of samples that are possible realizations of h in historical data. Generally, to reflect the prior knowledge, a number of prior samples should be added to the history data; that is α(e i |y h , D) = n(e i |y h , D) + α(e i |y h ), where α(e i |y h ) is the number of prior hypothetical samples for the e i that are possible realizations of h, and n(e i |y h , D) is the number of e i that are possible realizations of h in historical data.
Then, the realization probability is re-expressed as p(e i |y h , H, D) = n(e i |y h , H) + n(e i |y h , D) + α(e i |y h ) where A y h are the total prior samples that are possible realizations of h.
To calculate the realization probability in Equation (9), the below should be performed n(e i |y h , D) = ∑ G j=1 N S j · p e i S j , D where N S j is the number of samples with process status S j in historical dataset D, p e i S j , D is the likelihood which will be introduced in the next section, and e i ∈ R h . For the priori information, the uniform prior is employed, therefore α(e i |y h ) = 1 (12) where e i ∈ R h . Then n e i h k , where h k is the sample in the online moving horizon, r is the length of the horizon, p * e i h k , y h , H is the recorded realization probability of sample h k , and e i ∈ R h . In summary, the realization probability is calculated through historical data, online horizon data, and prior knowledge.

Expectation-Maximization-Based Likelihood Probability Estimation
Here, we introduce the estimation method for the likelihood probability of different evidence under different process statuses. Considering that the historical data is multisampling rates with missing data, the expectation-maximization (EM) method for multiple missing data patterns proposed in [21] is adopted for likelihood estimation here.
First of all, the EM algorithm with missing data is introduced, which iteratively switches between the expectation step (E-step) and maximization step (M-step) to find the maximum likelihood estimate of parameters of interest.
In the E-step, the expected value of the log-likelihood function (Q-function) is built by using the previously estimated parameter, Θ old where Θ is the parameter set to be estimated; Θ old is the estimation result in the previous step; C miss is the unobserved data; and C obs is the observed dataset.
In the M-step, the new estimation of the parameter set is obtained by maximizing the Q-function obtained from the E-step This iteration continues until some stop criterion is satisfied. Next, we describe how to use the EM algorithm to solve the outlier identification problem. The likelihood probability is denoted θ k = p(e k S j , D) , which is interpreted as the probability of e k under process status S j , and Θ = [θ 1 , θ 2 , · · · , θ K ] is the likelihood probability set for process status S j . As for the outlier detection problem, there are two process statuses: normal and outlier. The optimized parameter set of all process statuses is The likelihood probability set Θ = [θ 1 , θ 2 , · · · , θ K ] for S j is estimated first. Since the process involves multisampling rates, the monitoring data subset D S j of S j contains the complete part D c and the incomplete part D ic ; i.e., D S j = {D c , D ic }, and (20) where d i c is the complete evidence, N c is the total number of complete evidence, d i ic is the incomplete evidence, and N ic is the number of incomplete evidence in D S j .
Given that the data are independent, the probabilities of the complete part and the incomplete part under current likelihood probability parameter set Θ old can be expressed, respectively The total likelihood function for D S j can be calculated as Moreover, since the incomplete data entries of D ic can be further partitioned into the monitoring part, y = y 1 , y 2 , · · · , y N ic , and the missing part, z = z 1 , z 2 , · · · , z N ic ,we have Taking Equation (24) into Equation (17), the Q-function is denoted as where Z is the space for all possible values of the realization z.
Following the derivations of [21], the Q-function can be expressed as where ε = [e 1 , e 2 , · · · , e K ] T , and n(e i |D c ) is the number of evidence e = e i in the complete dataset, and n e i D ic , Θ old is the estimated amount of evidence e = e i in the incomplete dataset under the likelihood parameter set Θ old .
Considering that θ k = 1 − θ k = 1 − ∑ j =k θ j , Equation (26) can be expanded as By taking the first derivative with respect to θ k and setting it to zero to obtain the maximum value of the Q-function, the estimation of θ k is achieved as Moreover, the initial conditions are set through the available complete evidence, θ 0 k = n(e k |D c ) N c . Finally, the process is repeated until the parameters converge.

Bayesian and Full Probability-Based Outlier Detection
Based on the likelihood probability of each evidence under different process statuses, given current evidence e c and historical evidence data D, the Bayesian strategy is adopted to infer the posterior probability of each possible process status, S j p S j e c , D = p(e c S j , D)p(S j D) where p(e c S j , D) is the likelihood probability, p(S j D) is the prior probability of process status S j , and p S j e c , D is the posteriori probability of S j under current evidence e c and historical database D. The process status with a large posterior probability is considered to be the probable internal process status. Then, according to the realization probability of the unavailable monitor's reading and posterior probability of each possible process status S j under each realization, the outlier probability of incomplete evidence can be calculated by the full probability method as Overall, the outlier probability estimation for a missing data point in multisampling rates of plant-wide processes can be executed in four phases.
Phase 1: Likelihood probability estimation, which is performed offline: (1) Calculate n(e k |D c ) using the complete dataset D c of certain process statuses.
(2) Use the complete historical data D c of each process status to calculate the initial value , which is set as Θ old = θ 0 1 θ 0 2 · · · θ 0 K . (3) According to Θ old , obtain n e k D ic , Θ old through the incomplete dataset based on Equation (28). The details are also illustrated in Figure 2.

Simulation and Application
In this section, the proposed monitoring scheme is applied to a plant-wide TE benchmark process, which is the general test platform of the process monitoring and diagnosis method [24]. First, obtain the history and online test datasets, both of which contain 960 samplings. The outliers in the history dataset are the 120th, 140th, 240th, 250th, 350th, 360th, 460th, 480th, 600th, 720th, and 840th samples; Table 2 shows the outlier reason for these samplings. The outliers in the online test dataset are also the 120th, 140th, 240th, 250th, 350th, 360th, 460th, 480th, 600th, 720th, and 840th samples, with different fault reasons, listed in Table 3.  Next, suppose that this process is a multisampling rates system, in which some variables are measured with period T, some with 2T, and the others with 3T. Then, put the variables with the same sampling period in the same sub-block so that three sub-blocks in total are established. Perform PCA for each sub-block; thus, related three-dimensional evidence with missing data are generated. Among the history dataset, there are five incomplete samples in every six samples, and the 140th, 250th, 350th, and 460th samples are incomplete data.
After obtaining the history dataset and online test dataset, perform the EM-based likelihood probability estimation, and marginalization-based realization estimation, as well as the Bayesian and full probability-based outlier detection. Figure 3 is the iteration process for the normal status and the outlier status, which shows that convergence is fast. The value of the likelihood for each generation and the final convergence value are shown in Tables 4 and 5 for normal status and outlier status, respectively. For normal status, only e 1 , e 3 , e 4 , and e 6 appear, and the likelihood probability of e 1 is approximately 100%. For outlier status, it is possible for all evidence, except e 1 , to appear with different probabilities.  The method proposed in this paper is compared with the traditional Bayesian method which uses the complete data only. The comparison contains two aspects: (1) The first aspect is the detection result for incomplete samples. The Bayesian detection method, which is based only on complete data, cannot achieve detection, whereas the method adopted in this study can realize the detection for the 140th, 250th, and 350th points, but still deems the 460th point normal instead of classifying it as an outlier, as it is shown in Table 6. (2) The second aspect is that, for complete outlier points, both the traditional Bayesian method and the method adopted in this study can achieve detection, but the detection result is slightly different. For the 120th and 480th points, both methods fail to detect the abnormality, while for the 240th and 600th points, the method used in this study finds the fault with higher probability. For the other outlier points, the two methods achieve the same result,which is illustrated in Figure 4.   Figure 5 is the result for online outlier detection using the adopted method. It reveals that, for all normal points, the correct classifications are given. For outliers in the test dataset, there are two complete outlier points (out of seven outliers), and one incomplete outlier point (out of four points) are wrongly considered a normal point, respectively, so the detection rates are 71.43% and 75% for complete outliers and incomplete outliers, respectively, giving an overall detection rate of 72.8%.

Conclusions
For effective process control and analysis, it is necessary to eliminate outliers. However, due to the complexity of plant-wide processes, systems usually feature multiple sampling rates, while the traditional outlier detection method typically assumes that the sampling time and interval are the same. Thus, dealing with the problem of multisampling rates is a key issue. In this study, a missing data probability estimation-based Bayesian outlier detection method is proposed for outlier detection in a process with multisampling rates.
The research includes four aspects: (1) If all variables are used as a source of evidence for the Bayesian outlier detection system, the amount of evidence is huge. In order to simplify the calculation while achieving desired rates of detection, variables with the same sampling period are placed in the same sub-block, PCA is performed for each sub-block, and the related T 2 and SPE are used to form monitoring evidence. (2) Due to the different sampling periods, the evidence formed through PCA features missing values, so marginalization-based probability estimation is adopted to calculate the realization of incomplete evidence through evidence from historical multisampling rates and an online moving horizon. (3) The possible internal status for each realization is estimated through the EM algorithm with missing data. (4) Via Bayesian and full probability theories, based on the results of (2) and (3), it is determined whether the current incomplete evidence is an outlier.
The difficulty of this work lies in the means of estimating the most likely realization of incomplete evidence based on historical incomplete evidence and online moving incomplete evidence, for which the EM method for multiple missing data patterns is needed. The main work of this research is thus the detection of outliers for incomplete evidence, using Bayesian and full probability theories, based on the possible realization of incomplete evidence in a multisampling rates system. Nonetheless, some disadvantages of this method remain. The first is that considering the accuracy of the estimation using the EM algorithm, there is a limit to the ratio of the maximum incomplete data that is allowed. Second, since the dimension of evidence is the same as the number of sub-blocks of the system, the sub-block is divided according to the sampling cycles. Thus, too many different sampling cycles leads to higher dimensions of evidence, resulting in the number of possible realizations increasing exponentially, thereby challenging computer capabilities. As a result, there is also a limit to the amount of sampling cycles.