Missing Data Probability Estimation-Based Bayesian Outlier Detection for Plant-Wide Processes with Multisampling Rates

Tian, Ying; Yin, Zhong; Huang, Miao

doi:10.3390/sym10100475

Open AccessArticle

Missing Data Probability Estimation-Based Bayesian Outlier Detection for Plant-Wide Processes with Multisampling Rates

by

Ying Tian

^1,*

,

Zhong Yin

¹ and

Miao Huang

²

¹

School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China

²

Ningbo Institute of Technology, Zhejiang University, Ningbo 315100, China

^*

Author to whom correspondence should be addressed.

Symmetry 2018, 10(10), 475; https://doi.org/10.3390/sym10100475

Submission received: 16 September 2018 / Revised: 30 September 2018 / Accepted: 1 October 2018 / Published: 10 October 2018

Download

Browse Figures

Versions Notes

Abstract

Traditional outlier detection methods assume that the sampling time and interval are the same. However, for plant-wide processes, since the signal change rate of different devices may vary by several orders of magnitude, the measured data in real-world systems usually have different sampling rates, resulting in missing data. To achieve reliable outlier detection, a missing data probability estimation-based Bayesian outlier detection method is adopted. In this strategy, the expectation–maximization (EM) algorithm is first used to estimate the likelihood probability of different evidence under different process statuses by using the history dataset which contains complete and incomplete samplings. Secondly, the realization of unavailable parts in the monitoring point is estimated as a probability through historical data and online moving horizon data. Bayesian theory and likelihood probability are then used to calculate the outlier posterior probability of different realization. Finally, the outlier probability of the monitoring sampling is calculated by the probability of different realizations and the corresponding outlier probability. Using the Tennessee Eastman (TE) dataset, a simulation indicates that the proposed method exhibits a significant improvement over the complete data method.

Keywords:

outlier detection; missing data estimation; Bayesian; expectation–maximization (EM); multisampling rates; plant-wide process

1. Introduction

Outlier detection, an important research topic in data mining, has attracted wide attention in academic and applied fields. Outliers appear in the data, usually as a result of process disturbance or instrument drift, which may dramatically undermine subsequent analysis based on the data. Therefore, with the development of science and technology, many outlier detection methods have been proposed; examples include a distance-based outlier detection method for multidimensional datasets [1]; a fuzzy rough semi-supervised outlier detection approach with the help of labeled samples [2]; neural networks-based abnormal detection [3,4]; fuzzy discrimination extended and applied to outlier detection [5,6]; rough set-based attribute selection to handle outlier detection for high dimension data [7]; rough entropy used to calculate the degree of outliers [8]; and, ellipsoidal support vector machine-based outlier detection [9].

Traditional outlier detection methods usually assume that the sampling time and interval are the same. However, for complex multivariable control systems, the signal change rate of different devices is different. Under this situation, if a short sampling period is adopted, the cost of the control system is improved, while if a long period is used, then important information contained in the fast-changing variables will be ignored. In other words, it is impossible to use a single sampling interval in a complex system. In fact, most real-world systems feature multisampling rates. Since a subset of the variables is sampled at a higher rate and another subset of the variables is sampled at a lower rate, the dataset collected from the system will be affected by a problem of incomplete data. For the incomplete dataset, the deleting method is first proposed, in which the records with missing data are deleted directly to obtain a complete dataset. Although this method is easy to implement, a significant amount of effective information will be deleted if the amount of missing data is large, and the real-time accuracy of outlier detection will be affected if only the complete sampling is used for modeling and detection. Therefore, imputation methods are used to handle incomplete data, such as mean completer, combinatorial completer, regression, hot deck, and multiple imputation. These methods achieve dataset integrity by estimating the value of missing data [10,11,12,13]. In addition, to make the best use of existing data without adding or deleting data, methods of data mining based on the missing dataset have also been proposed, for example, the artificial neural network method [14,15], or the rough set reasoning method [16,17].

Despite great achievements having been made for handling missing data and detection of outliers, there are still some shortcomings. For example, because all of these methods assign a definite value to missing data and classify a sample as being normal or an outlier directly, a wrong imputation value or wrong classification will significantly affect subsequent analysis and processing. To avoid such errors, probabilistic estimation is a good alternative. As an efficient tool in probabilistic inference, the Bayesian method has been developed, for example, a Bayesian probabilistic approach has been proposed to find the internal relations and to identify the faults possibly present in the system given the current observations [18], and a novel Bayesian probabilistic diagnostic framework for control loop monitoring has been established and demonstrated [19]. However, the traditional Bayesian method still requires that all monitor readings be available at the same time. To handle the missing data problem, a method based on marginalization over an underlying complete evidence matrix has been proposed to circumvent missing data problems and to realize probability estimation for unmeasured data [20]. While this method focuses only on the case where the missing data follow one specific pattern, in order to deal with the problem of multiple missing data patterns, [21] proposed an expectation–maximization (EM) approach to estimate the likelihood probability of different realizations. Based on the EM approach, considering the plant-wide process problem as well as the asynchronous measurements question, [22] proposed a Bayesian marginalization method within a moving horizon for online incomplete measurements and established a Bayesian diagnosis system revealing both the underlying fault status of the whole plant and the unavailable statuses of the corresponding local units.

Considering the missing data problem of outlier detection, based on the Bayesian probability framework and the marginalization method, as well as the EM algorithm, a missing data estimation and outlier detection strategy is proposed. The contribution contains four aspects:

(1) Probability estimation for the realization of unavailable parts in the monitoring sampling is executed through a marginalization method over all historical data and online moving horizon data, including complete data and incomplete data.

(2) The EM algorithm for missing data estimation is used to calculate the likelihood probability of different realizations under different process statuses.

(3) The outlier probability of different realizations is obtained through Bayesian theory.

(4) The total outlier probability of the current incomplete sampling is calculated using the full probability theory with the probability of different realizations and its corresponding outlier probability.

The innovation is that the method proposed in this study is more applicable to real-world processes. It is well known that most real-world complex multivariable processes feature multisampling rates, resulting in a monitoring dataset with missing data. The question of how this missing data is handled is important. The proposed strategy can make full use of the information contained in historical incomplete data and online incomplete data, to determine whether current data, especially current incomplete data, are outliers or not. Incomplete data used in the modeling stage makes detection more accurate, and the use of online incomplete data enables more timely detection.

The remainder of this paper is structured as follows: first, the problem statement and motivation analysis are briefly reviewed. Then, the Bayesian outlier detection method for processes with multisampling rates is described in detail. The efficiency of the proposed approach is illustrated by the TE process. Finally, conclusions and perspectives are provided.

2. Problem Statement and Motivation Analysis

A data acquisition system (DAS) is a complex system composed of computer control technology, a computer network, and intelligent instruments. The widely used DAS in a plant-wide process is a distributed structure with an upper system, several data collection sites, and communication lines (shown in Figure 1). Usually, in this system, the sampling time and interval are assumed to be the same. In a complex multivariable plant-wide process, however, given the complexity of the controlled and monitored object, the signal change rate of different devices differs, so the sampling period of the related detection device is different, for example, the change rate of a temperature signal and an electrical signal may vary by several orders of magnitude. Although a short sampling period can be used to achieve better outlier detection, this will require an increased cost of the computer control system. Thus, it is impossible to use a single sampling period in all parts of the system. To this end, the best method is to adopt different sampling periods for different change rate signals.

When multisampling rates are adopted, the dataset collected from the system will be affected by the problem of missing data. For instance, for a plant-wide monitoring process, suppose that the input is

[\begin{matrix} π_{1} & π_{2} & π_{3} \end{matrix}], π_{i} \in {0, 1}

, and the sampling interval for

π_{1}

is

T

, for

π_{2}

is

2 T

, and for

π_{3}

is

3 T

. If the symbol “∗” is used to represent unavailable values, then the monitoring dataset is illustrated in Table 1. This shows that there is only one complete data point in every six samplings. Below we discuss how to realize outlier detection for this kind of dataset.

The conventional outlier detection method requires the sampling interval and time to be consistent. Therefore, to realize monitoring for a multisampling rate system, the missing data should be handled first. The traditional methods mainly include deletion and imputation. The deleting method directly removes incomplete data; for the above example, only one complete sampling data point can be used in every six monitoring samples, which leads to the loss of a significant amount of effective information and to monitoring delay. To use as much information as possible, the imputation method, which gives a reasonable substitute value for missing data, is much more suitable. Through imputation, a complete dataset can be constructed. For the above example, the possible realizations for each incomplete data point are shown in Table 1; however, we need to determine which realization is most likely. In other words, the key question is how to derive imputation values that are as close to the missing original data as possible to reduce the estimated error. That is, the objective of our work is to provide an estimation method for missing data and an outlier detection method for multisampling rates in a plant-wide process. There are several problems to be considered:

(1) Since most methods provide definite estimated values for missing data, wrong imputation will significantly affect subsequent analysis and processing. Can probabilistic estimation be considered for missing data realization estimation in order to avoid such errors?

(2) If probabilistic estimation is adopted for the realization of current incomplete monitoring samples, then a follow-up issue is how to calculate outlier probability for each realization and how to classify an incomplete sample through the realization probability and the outlier probability of each realization.

(3) Given the complexities of probabilistic estimation, and the large number of variables in a plant-wide process system, if all variables with continuous values are used in the monitoring index, then the computing complexity will exceed the ability of a general computer. Thus, determining how to form an effective monitoring index is also an urgent problem.

All of these problems will be effectively solved in the next section.

3. Bayesian Outlier Detection for Plant-Wide Processes with Multisampling Rates

Considering the shortcomings of traditional methods, based on the multisampling rates of plant-wide processes, missing data probability estimation-based Bayesian outlier detection is adopted here. In this strategy, considering the computing complexity of plant-wide processes, given that both the historical data and online horizon data are multisampling rates with incomplete data, the research includes four aspects: (1) to reduce complexity, variables with the same sampling period are placed in a sub-block, and PCA is performed for each sub-block to form monitoring evidence; (2) marginalization-based probability estimation for realization of current incomplete evidence is executed through historical multisampling rate samples and online moving horizon data; (3) the EM algorithm is used to estimate the likelihood of different evidence using multisampling rate historical data; and (4) the posterior probability of different process statuses for current incomplete data is calculated according to Bayesian theory and full probability theory.

3.1. Marginalization-Based Realization Estimation

For the multisampling rates system, in order to ensure the total amount of modeling and monitoring data, this study intends to probabilistically estimate the realization of missing data by using complete and incomplete historical data, as well as online moving windows. Several variables need to be defined before the estimation is made:

(1) Evidence,

E

, which is usually the monitoring variable of a process. For a system with

B

monitors, its evidence can be expressed as

E = {π_{1}, π_{2}, \dots, π_{B}}

, where

π_{i}

is the

i th

source with

q_{i}

discrete values. Therefore, the collection of all possible evidence is

ε = {e_{1}, e_{2}, \dots, e_{K}}

, where

K = \prod_{i = 1}^{B} q_{i}

. However, for a plant-wide system, the variable’s measured values are continuous, and the number of process variables is large; thus, taking each process variable as a source of evidence may be beyond the capability of a normal computer. Therefore, the suitable evidence should be designed first in this study. To reduce its size, the multi-block method is adopted to handle data and obtain suitable evidence. First, variables with the same sampling rate are placed into a sub-block, a PCA model is established for each sub-block, and the

T^{2}

and

S P E

statistics and control limits of each sub-block can be obtained. For PCA details, please refer to [23].

According to the statistic and control limit, the evidence can be generated as

{\begin{cases} π_{i} = 0 i f T_{i}^{2} \leq T_{i, \lim}^{2} a n d S P E_{i} \leq S P E_{i, \lim} \\ π_{i} = 1 i f T_{i}^{2} > T_{i, \lim}^{2} o r S P E_{i} > S P E_{i, \lim} \end{cases}

(1)

where

i = 1, 2, \dots, m

denotes the block number;

π_{i}

is the

i th

source in evidence;

T_{i}^{2}

and

T_{i, \lim}^{2}

are the

T^{2}

statistic and

T^{2}

control limit for the

i th

sub-block;

S P E_{i}

and

S P E_{i, \lim}

are the

S P E

statistic and

S P E

control limit for the

i th

sub-block; and

π_{i} = 1

indicates that the data of the

i th

sub-block is an outlier.

Note that the number of principal components is selected by the cumulative percent variance (CPV). For sub-block data, the covariance matrix of the data is calculated and related eigenvalues are sorted. The ratio between the first

k

eigenvalues and the sum of all eigenvalues is defined as the CPV, which is used to represent the proportion of data explained by the first

k

principal components in all data. A reasonable

k

is very important for PCA modeling. In this research, we choose the first

k

principal components whose CPV is greater than 85% as the modeling principal components, as shown in the equation

\sum_{i = 1}^{k} λ_{i} / \sum_{i = 1}^{m} λ_{i} > 85 %

.

(2) Process status,

S

, which is the internal state of the system. A system with

G

possible internal states is denoted

S = {S_{1}, S_{2}, \dots, S_{G}}

. For the outlier detection problem,

G = 2

.

(3) History dataset,

D

, which is labeled data. A historical dataset with

N

historical training samples can be expressed as

D = {d^{1}, d^{2}, \dots, d^{N}}

, where

d^{t}

contains evidence

e^{t}

and process internal status

S^{t}

at time

t

:

d^{t} = {e^{t}, S^{t}}

.

(4) Online horizon,

H

, which is defined as

H = {h^{0}, h^{1}, h^{2}, \dots, h^{r - 1}}

, where

h^{l}

refers to the

l th

forward sample from the current sample, and

r

is the length of the horizon. Data

h^{l}

consists of an evidence vector

e^{l}

and posteriori probability of

F_{j}

under evidence

e^{l}

.

Next, for a current evidence sample,

h

, with missing data, a marginalization-based solution is used for its realization probability estimation through its observed part,

y_{h}

, the history training data, and online moving horizon data, which can be expressed as

p (e_{i} | y_{h}, H, D) = \int_{Ω} p (e_{i} | y_{h}, Ψ, H, D) f (Ψ | y_{h}, H, D) d Ψ

(2)

where

Ψ = {ψ_{1 | y_{h}}, ψ_{2 | y_{h}}, \dots, ψ_{K_{h} | y_{h}}}

,

ψ_{i | y_{h}} = p (e_{i} | y_{h}, H, D)

,

K_{h}

is the number of the possible realizations of

y_{h}

and

Ω

is the space of all possible parameters in

Ψ

. For instance, for a three-dimensional system with two possible discrete values, assuming the current incomplete evidence is

[\begin{matrix} 1 & * & * \end{matrix}]

, then its possible evidence set

R_{h} = [\begin{matrix} \begin{matrix} 1 & 0 & 0 \end{matrix} \\ \begin{matrix} 1 & 0 & 1 \end{matrix} \\ \begin{matrix} 1 & 1 & 0 \end{matrix} \\ \begin{matrix} 1 & 1 & 1 \end{matrix} \end{matrix}]

, and the number of the possible realizations,

K_{h}

, is 4.

In Equation (2),

f (Ψ | y_{h}, H, D)

can be calculated by the Bayes’ rule as

f (Ψ | y_{h}, H, D) = \frac{p (H | Ψ, y_{h}, D) f (Ψ | y_{h}, D)}{p (H | y_{h}, D)}

(3)

where the numerator can be calculated by integration over the likelihood space

Ω

.

p (H | y_{h}, D) = \int_{Ω} p (H | Ψ, y_{h}, D) f (Ψ | y_{h}, D)

(4)

The Dirichlet distribution with Dirichlet parameters is commonly used to estimate the probability of

Ψ

f (Ψ | y_{h}, D) = \frac{Γ (\sum_{i = 1}^{K_{h}} α (e_{i} | y_{h}, D))}{\prod_{i = 1}^{K_{h}} α (e_{i} | y_{h}, D)} \prod_{i = 1}^{K_{h}} {(ψ_{i | y_{_{h}}})}^{α (e_{i} | y_{h}, D) - 1}

(5)

where

Γ (\cdot)

is the Gamma function,

α (e_{1} | y_{h}, D), α (e_{2} | y_{h}, D), \dots, α (e_{K_{h}} | y_{h}, D)

and

α (e_{i} | y_{h}, D)

are the number of prior samples for the possible realization,

e_{i}

, given the observed part,

y_{h}

, which is calculated through the historical training data,

D

.

The likelihood of the samples that are possible realizations of

h

in the horizon can be written as

p (H | Ψ, y_{h}, D) = \prod_{k = 1}^{N_{H_{y_{h}}}} p (h_{y_{h}}^{k} | Ψ, y_{h}, D) = \prod_{i = 1}^{K_{h}} {(ψ_{i | y_{_{h}}})}^{n (e_{i} | y_{h}, H)}

(6)

where

N_{H_{y_{h}}}

is the expected number of samples that are possible realizations of

h

in the horizon, and

n (e_{i} | y_{h}, H)

is the number of

e_{i}

in the moving horizon.

Taking Equations (3)–(6) into Equation (2)

p (e_{i} | y_{h}, H, D) = \int_{Ω} ψ_{i / y_{h}} \frac{\prod_{i = 1}^{K_{h}} {(ψ_{i | y_{_{h}}})}^{n (e_{i} | y_{h}, H)} \frac{Γ (\sum_{i = 1}^{K_{h}} α (e_{i} | y_{h}, D))}{\prod_{i = 1}^{K_{h}} α (e_{i} | y_{h}, D)} \prod_{i = 1}^{K_{h}} {(ψ_{i | y_{_{h}}})}^{α (e_{i} | y_{h}, D) - 1}}{\int_{Ω} \prod_{i = 1}^{K_{h}} {(ψ_{i | y_{_{h}}})}^{n (e_{i} | y_{h}, H)} \frac{Γ (\sum_{i = 1}^{K_{h}} α (e_{i} | y_{h}, D))}{\prod_{i = 1}^{K_{h}} α (e_{i} | y_{h}, D)} \prod_{i = 1}^{K_{h}} {(ψ_{i | y_{_{h}}})}^{α (e_{i} | y_{h}, D) - 1}} d Ψ

(7)

Using the same derivation procedures as in [20], the realization probabilities can be achieved as [22]

p (e_{i} | y_{h}, H, D) = \frac{n (e_{i} | y_{h}, H) + α (e_{i} | y_{h}, D)}{N_{H_{y_{h}}} + N_{D_{y_{h}}}}

(8)

where

N_{D_{y_{h}}}

is the total number of samples that are possible realizations of

h

in historical data.

Generally, to reflect the prior knowledge, a number of prior samples should be added to the history data; that is

α (e_{i} | y_{h}, D) = n (e_{i} | y_{h}, D) + α (e_{i} | y_{h})

, where

α (e_{i} | y_{h})

is the number of prior hypothetical samples for the

e_{i}

that are possible realizations of

h

, and

n (e_{i} | y_{h}, D)

is the number of

e_{i}

that are possible realizations of

h

in historical data.

Then, the realization probability is re-expressed as

p (e_{i} | y_{h}, H, D) = \frac{n (e_{i} | y_{h}, H) + n (e_{i} | y_{h}, D) + α (e_{i} | y_{h})}{N_{H_{y_{h}}} + N_{D_{y_{h}}} + A_{y_{h}}}

(9)

where

A_{y_{h}}

are the total prior samples that are possible realizations of

h

.

To calculate the realization probability in Equation (9), the below should be performed

n (e_{i} | y_{h}, D) = \sum_{j = 1}^{G} N_{S_{j}} \cdot p (e_{i} | S_{j}, D)

(10)

N_{D_{y_{h}}} = \sum_{i = 1}^{K_{h}} n (e_{i} | y_{h}, D)

(11)

where

N_{S_{j}}

is the number of samples with process status

S_{j}

in historical dataset

D

,

p (e_{i} | S_{j}, D)

is the likelihood which will be introduced in the next section, and

e_{i} \in R_{h}

.

For the priori information, the uniform prior is employed, therefore

α (e_{i} | y_{h}) = 1

(12)

A_{y_{h}} = K_{h}

(13)

where

e_{i} \in R_{h}

.

Then

n (e_{i} | y_{h}, H) = \sum_{k = 1}^{r} n (e_{i} | h^{k}, y_{h}, H)

(14)

n (e_{i} | h^{k}, y_{h}, H) = {\begin{cases} p^{*} (e_{i} | h^{k}, y_{h}, H) ， i f e_{i} \in R_{h^{k}} \\ 0, i f e_{i} \notin R_{h^{k}} \end{cases}

(15)

N_{H_{y_{h}}} = \sum_{i = 1}^{K_{h}} n (e_{i} | y_{h}, H)

(16)

where

h^{k}

is the sample in the online moving horizon,

r

is the length of the horizon,

p^{*} (e_{i} | h^{k}, y_{h}, H)

is the recorded realization probability of sample

h^{k}

, and

e_{i} \in R_{h}

.

In summary, the realization probability is calculated through historical data, online horizon data, and prior knowledge.

3.2. Expectation–Maximization-Based Likelihood Probability Estimation

Here, we introduce the estimation method for the likelihood probability of different evidence under different process statuses. Considering that the historical data is multisampling rates with missing data, the expectation–maximization (EM) method for multiple missing data patterns proposed in [21] is adopted for likelihood estimation here.

First of all, the EM algorithm with missing data is introduced, which iteratively switches between the expectation step (E-step) and maximization step (M-step) to find the maximum likelihood estimate of parameters of interest.

In the E-step, the expected value of the log-likelihood function (Q-function) is built by using the previously estimated parameter,

Θ^{o l d}

Q (Θ | Θ^{o l d}) = E_{C_{m i s s} | C_{o b s}, Θ^{o l d}} [\log (p (C_{m i s s} | C_{o b s}, Θ))]

(17)

where

Θ

is the parameter set to be estimated;

Θ^{o l d}

is the estimation result in the previous step;

C_{m i s s}

is the unobserved data; and

C_{o b s}

is the observed dataset.

In the M-step, the new estimation of the parameter set is obtained by maximizing the Q-function obtained from the E-step

Θ^{n e w} = \arg \max \underset{Θ}{Q} (Θ | Θ^{o l d})

(18)

This iteration continues until some stop criterion is satisfied.

Next, we describe how to use the EM algorithm to solve the outlier identification problem. The likelihood probability is denoted

θ_{k} = p (e_{k} | S_{j}, D)

, which is interpreted as the probability of

e_{k}

under process status

S_{j}

, and

Θ = [θ_{1}, θ_{2}, \dots, θ_{K}]

is the likelihood probability set for process status

S_{j}

. As for the outlier detection problem, there are two process statuses: normal and outlier. The optimized parameter set of all process statuses is

Ξ = [Θ_{1}^{o p t}, Θ_{2}^{o p t}]

.

The likelihood probability set

Θ = [θ_{1}, θ_{2}, \dots, θ_{K}]

for

S_{j}

is estimated first. Since the process involves multisampling rates, the monitoring data subset

D_{S_{j}}

of

S_{j}

contains the complete part

D_{c}

and the incomplete part

D_{i c}

; i.e.,

D_{S_{j}} = {D_{c}, D_{i c}}

, and

D_{c} = {d_{c}^{1}, d_{c}^{2}, \dots, d_{c}^{N_{c}}}

(19)

D_{i c} = {d_{i c}^{1}, d_{i c}^{2}, \dots, d_{i c}^{N_{i c}}}

(20)

where

d_{c}^{i}

is the complete evidence,

N_{c}

is the total number of complete evidence,

d_{i c}^{i}

is the incomplete evidence, and

N_{i c}

is the number of incomplete evidence in

D_{S_{j}}

.

Given that the data are independent, the probabilities of the complete part and the incomplete part under current likelihood probability parameter set

Θ^{o l d}

can be expressed, respectively

p (D_{i c} | Θ^{o l d}) = \prod_{t = 1}^{N_{i c}} p (d_{i c}^{t} | Θ^{o l d})

(21)

p (D_{c} | Θ^{o l d}) = \prod_{t = 1}^{N_{c}} p (d_{c}^{t} | Θ^{o l d}) = \prod_{k = 1}^{K} θ_{k}^{n (e_{k} | D_{c})}

(22)

The total likelihood function for

D_{S_{j}}

can be calculated as

\begin{array}{l} L (D_{S_{j}} | Θ^{o l d}) & = \log p (D_{c}, D_{i c} | Θ^{o l d}) \\ = \log [p (D_{c} | Θ^{o l d}) p (D_{i c} | Θ^{o l d})] \\ = \log [\prod_{k = 1}^{K} θ_{k}^{n (e_{k} | D_{c})} \prod_{t = 1}^{N_{i c}} p (d_{i c}^{t} | Θ^{o l d})] \end{array}

(23)

Moreover, since the incomplete data entries of

D_{i c}

can be further partitioned into the monitoring part,

y = {y_{1}, y_{2}, \dots, y_{N_{i c}}}

, and the missing part,

z = {z_{1}, z_{2}, \dots, z_{N_{i c}}}

,we have

L (D | Θ^{o l d}) = \sum_{k = 1}^{K} n (e_{k} | D_{c}) \log θ_{k} + \sum_{t = 1}^{N_{i c}} \log p (z_{t}, y_{t} | Θ^{o l d})

(24)

Taking Equation (24) into Equation (17), the Q-function is denoted as

\begin{array}{l} Q (Θ | Θ^{o l d}) = \sum_{Z} p (z | y, D_{c}, Θ^{o l d}) \log p (D_{i c}, D_{c} | Θ) \\ = \sum_{Z} p (z | y, D_{c}, Θ^{o l d}) (\sum_{k = 1}^{K} n (e_{k} | D_{c}) \log θ_{k} + \sum_{t = 1}^{N_{i c}} \log p (z_{t}, y_{t} | Θ)) \end{array}

(25)

where

Z

is the space for all possible values of the realization

z

.

Following the derivations of [21], the Q-function can be expressed as

Q (Θ | Θ^{o l d}) = {[n (ε | D_{c}) + n (ε | D_{i c}, Θ^{o l d})]}^{T} \log Θ

(26)

where

ε = {[e_{1}, e_{2}, \dots, e_{K}]}^{T}

, and

n (e_{i} | D_{c})

is the number of evidence

e = e_{i}

in the complete dataset, and

n (e_{i} | D_{i c}, Θ^{o l d})

is the estimated amount of evidence

e = e_{i}

in the incomplete dataset under the likelihood parameter set

Θ^{o l d}

.

Considering that

θ_{k} = 1 - {\bar{θ}}_{k} = 1 - \sum_{j \neq k} θ_{j}

, Equation (26) can be expanded as

\begin{array}{l} Q (θ_{k} | Θ^{o l d}) = [n (e_{k} | D_{c}) + n (e_{k} | D_{i c}, Θ^{o l d})] \cdot \log θ_{k} \\ + [n ({\bar{e}}_{k} | D_{c}) + n ({\bar{e}}_{k} | D_{i c}, Θ^{o l d})] \cdot \log (1 - θ_{k}) \end{array}

(27)

where

\begin{array}{l} n (e_{k} | D_{i c}, Θ^{o l d}) = \sum_{t = 1}^{N_{i c}} p (e_{k} | d_{i c}^{t}, Θ^{o l d}) \\ n ({\bar{e}}_{k} | D_{i c}, Θ^{o l d}) = \sum_{j \neq k} n (e_{j} | D_{i c}, Θ^{o l d}) \\ p (e_{k} | d_{i c}^{t}, Θ^{o l d}) = {\begin{cases} \frac{p (e_{k} | Θ^{o l d})}{\sum_{e_{j} \in d_{i c}^{t}} p (e_{j} | Θ^{o l d})}, e_{k} \in d_{i c}^{t} \\ 0, e_{k} \notin d_{i c}^{t} \end{cases} \end{array}

(28)

By taking the first derivative with respect to

θ_{k}

and setting it to zero to obtain the maximum value of the Q-function, the estimation of

θ_{k}

is achieved as

θ_{k}^{n e w} = \frac{n (e_{k} | D_{c}) + n (e_{k} | D_{i c}, Θ^{o l d})}{N_{c} + N_{i c}}

(29)

Moreover, the initial conditions are set through the available complete evidence,

θ_{k}^{0} = \frac{n (e_{k} | D_{c})}{N_{c}}

. Finally, the process is repeated until the parameters converge.

3.3. Bayesian and Full Probability-Based Outlier Detection

Based on the likelihood probability of each evidence under different process statuses, given current evidence

e^{c}

and historical evidence data

D

, the Bayesian strategy is adopted to infer the posterior probability of each possible process status,

S_{j}

p (S_{j} | e^{c}, D) = \frac{p (e^{c} | S_{j}, D) p (S_{j} | D)}{\sum_{F_{j}} p (e^{c} | S_{j}, D) p (S_{j} | D)}

(30)

where

p (e^{c} | S_{j}, D)

is the likelihood probability,

p (S_{j} | D)

is the prior probability of process status

S_{j}

, and

p (S_{j} | e^{c}, D)

is the posteriori probability of

S_{j}

under current evidence

e^{c}

and historical database

D

. The process status with a large posterior probability is considered to be the probable internal process status.

Then, according to the realization probability of the unavailable monitor’s reading and posterior probability of each possible process status

S_{j}

under each realization, the outlier probability of incomplete evidence can be calculated by the full probability method as

p (S | y_{h}, H, D) = \sum_{i = 1}^{K_{h}} p (S | e_{i}, D) p (e_{i} | y_{h}, H, D)

(31)

Overall, the outlier probability estimation for a missing data point in multisampling rates of plant-wide processes can be executed in four phases.

Phase 1: Likelihood probability estimation, which is performed offline:

(1) Calculate

n (e_{k} | D_{c})

using the complete dataset

D_{c}

of certain process statuses.

(2) Use the complete historical data

D_{c}

of each process status to calculate the initial value

θ_{k}^{0} = \frac{n (e_{k} | D_{c})}{N_{c}}

, which is set as

Θ^{o l d} = [\begin{matrix} θ_{1}^{0} & θ_{2}^{0} & \dots & θ_{K}^{0} \end{matrix}]

.

(3) According to

Θ^{o l d}

, obtain

n (e_{k} | D_{i c}, Θ^{o l d})

through the incomplete dataset based on Equation (28).

(4) Calculate the new likelihood

Θ^{n e w}

by using

n (e_{k} | D_{c})

and

n (e_{k} | D_{i c}, Θ^{o l d})

based on Equation (29).

(5) Check whether the terminating conditions are satisfied; if so, record the final likelihood probability. Otherwise, set

Θ^{n e w}

as

Θ^{o l d}

and repeat steps (3)–(5).

Phase 2: Offline posterior probability estimation

(1) Based on offline likelihood probability, the posterior probability of each possible process status

S_{j}

under each evidence is obtained by Equation (30).

Phase 3: Realization probability estimation, which is performed online.

(1) For the current observed part

y_{h}

of the incomplete sample, calculate

n (e_{i} | y_{h}, D_{y_{h}})

,

N_{D_{y_{h}}}

,

α (e_{i} | y_{h})

and

A_{y_{h}}

according to Equations (10)–(13).

(2) Obtain

n (e_{i} | y_{h}, H_{y_{h}})

and

N_{H_{y_{h}}}

according to online moving horizon and Equations (14)–(16).

(3) Achieve the realization probability

p (e_{i} | y_{h}, H_{y_{h}}, D_{y_{h}})

based on Equation (9).

Phase 4: Online full probability estimation.

Using the realization probability of Equation (9) and posterior probability of each possible process status

S_{j}

of Equation (30), the outlier probability of an incomplete evidence can be calculated through Equation (31).

The details are also illustrated in Figure 2.

4. Simulation and Application

In this section, the proposed monitoring scheme is applied to a plant-wide TE benchmark process, which is the general test platform of the process monitoring and diagnosis method [24]. First, obtain the history and online test datasets, both of which contain 960 samplings. The outliers in the history dataset are the 120th, 140th, 240th, 250th, 350th, 360th, 460th, 480th, 600th, 720th, and 840th samples; Table 2 shows the outlier reason for these samplings. The outliers in the online test dataset are also the 120th, 140th, 240th, 250th, 350th, 360th, 460th, 480th, 600th, 720th, and 840th samples, with different fault reasons, listed in Table 3.

Next, suppose that this process is a multisampling rates system, in which some variables are measured with period T, some with 2T, and the others with 3T. Then, put the variables with the same sampling period in the same sub-block so that three sub-blocks in total are established. Perform PCA for each sub-block; thus, related three-dimensional evidence with missing data are generated. Among the history dataset, there are five incomplete samples in every six samples, and the 140th, 250th, 350th, and 460th samples are incomplete data.

After obtaining the history dataset and online test dataset, perform the EM-based likelihood probability estimation, and marginalization-based realization estimation, as well as the Bayesian and full probability-based outlier detection. Figure 3 is the iteration process for the normal status and the outlier status, which shows that convergence is fast. The value of the likelihood for each generation and the final convergence value are shown in Table 4 and Table 5 for normal status and outlier status, respectively. For normal status, only

e_{1}, e_{3}, e_{4}

, and

e_{6}

appear, and the likelihood probability of

e_{1}

is approximately 100%. For outlier status, it is possible for all evidence, except

e_{1}

, to appear with different probabilities.

The method proposed in this paper is compared with the traditional Bayesian method which uses the complete data only. The comparison contains two aspects:

(1): The first aspect is the detection result for incomplete samples. The Bayesian detection method, which is based only on complete data, cannot achieve detection, whereas the method adopted in this study can realize the detection for the 140th, 250th, and 350th points, but still deems the 460th point normal instead of classifying it as an outlier, as it is shown in Table 6.
(2): The second aspect is that, for complete outlier points, both the traditional Bayesian method and the method adopted in this study can achieve detection, but the detection result is slightly different. For the 120th and 480th points, both methods fail to detect the abnormality, while for the 240th and 600th points, the method used in this study finds the fault with higher probability. For the other outlier points, the two methods achieve the same result,which is illustrated in Figure 4.

Figure 5 is the result for online outlier detection using the adopted method. It reveals that, for all normal points, the correct classifications are given. For outliers in the test dataset, there are two complete outlier points (out of seven outliers), and one incomplete outlier point (out of four points) are wrongly considered a normal point, respectively, so the detection rates are 71.43% and 75% for complete outliers and incomplete outliers, respectively, giving an overall detection rate of 72.8%.

5. Conclusions

For effective process control and analysis, it is necessary to eliminate outliers. However, due to the complexity of plant-wide processes, systems usually feature multiple sampling rates, while the traditional outlier detection method typically assumes that the sampling time and interval are the same. Thus, dealing with the problem of multisampling rates is a key issue. In this study, a missing data probability estimation-based Bayesian outlier detection method is proposed for outlier detection in a process with multisampling rates.

The research includes four aspects: (1) If all variables are used as a source of evidence for the Bayesian outlier detection system, the amount of evidence is huge. In order to simplify the calculation while achieving desired rates of detection, variables with the same sampling period are placed in the same sub-block, PCA is performed for each sub-block, and the related

T^{2}

and

S P E

are used to form monitoring evidence. (2) Due to the different sampling periods, the evidence formed through PCA features missing values, so marginalization-based probability estimation is adopted to calculate the realization of incomplete evidence through evidence from historical multisampling rates and an online moving horizon. (3) The possible internal status for each realization is estimated through the EM algorithm with missing data. (4) Via Bayesian and full probability theories, based on the results of (2) and (3), it is determined whether the current incomplete evidence is an outlier.

The difficulty of this work lies in the means of estimating the most likely realization of incomplete evidence based on historical incomplete evidence and online moving incomplete evidence, for which the EM method for multiple missing data patterns is needed. The main work of this research is thus the detection of outliers for incomplete evidence, using Bayesian and full probability theories, based on the possible realization of incomplete evidence in a multisampling rates system. Nonetheless, some disadvantages of this method remain. The first is that considering the accuracy of the estimation using the EM algorithm, there is a limit to the ratio of the maximum incomplete data that is allowed. Second, since the dimension of evidence is the same as the number of sub-blocks of the system, the sub-block is divided according to the sampling cycles. Thus, too many different sampling cycles leads to higher dimensions of evidence, resulting in the number of possible realizations increasing exponentially, thereby challenging computer capabilities. As a result, there is also a limit to the amount of sampling cycles.

Supplementary Materials

Supplementary File 1

Author Contributions

Conceptualization, Y.T.; Methodology, Y.T.; Software, Y.T., Z.Y., and M.H.; Validation, Y.T., Z.Y., and M.H.; Data curation, Y.T. and Z.Y.; Writing—original draft preparation, Y.T.

Funding

This work was sponsored by the Shanghai Sailing Program (No. 17YF1428300, 17YF1413100) and the Shanghai University Youth Teacher Training Program (ZZslg16009).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Knorr, E.M.; Ng, R.T. Algorithms for Mining Distance-based Outliers in Large Datasets. In Proceedings of the International Conference on very Large Data Bases, New York, NY, USA, 24–27 August 1998; pp. 392–403. [Google Scholar]
Xue, Z.; Shang, Y.; Feng, A. Semi-supervised outlier detection based on fuzzy rough C-means clustering. Math. Comput. Simul. 2010, 80, 1911–1921. [Google Scholar] [CrossRef]
Englund, C.; Verikas, A. A hybrid approach to outlier detection in the offset lithographic printing process. Eng. Appl. Artif. Intell. 2005, 18, 759–768. [Google Scholar] [CrossRef]
Han, S.J.; Cho, S.B. Evolutionary neural networks for anomaly detection based on the behavior of a program. IEEE Trans. Syst. Man Cybern. Part B Cybern. 2006, 36, 559–570. [Google Scholar]
Hung, W.L.; Yang, M.S. An omission approach for detecting outliers in fuzzy regression models. Fuzzy Sets Syst. 2006, 157, 3109–3122. [Google Scholar] [CrossRef]
Lin, C.C.; Chen, A.P. Fuzzy discriminant analysis with outlier detection by genetic algorithm. Comput. Oper. Res. 2004, 31, 877–888. [Google Scholar] [CrossRef]
Xu, N.; Zhang, Y. An Efficient Reduction Algorithm of High-dimensional Decision Tables Based on Rough Sets Theory. In Proceedings of the Intelligent Control and Automation (WCICA 2004), Hangzhou, China; 15–19 June 2004; Volume 4305, pp. 4304–4308. [Google Scholar]
Li, X.; Rao, F. An rough entropy based approach to outlier detection. J. Comput. Inf. Syst. 2012, 8, 10501–10508. [Google Scholar]
Zhang, Y.; Meratnia, N.; Havinga, P.J.M. Distributed Online Outlier Detection in Wireless Sensor Networks Using Ellipsoidal Support Vector Machine; Elsevier Science Publishers B.V.: Amsterdam, The Netherlands, 2013; pp. 1062–1074. [Google Scholar]
Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data, Second Edition; John Wiley & Sons: Hoboken, NJ, USA, 2002; pp. 200–220. [Google Scholar]
Grzymalabusse, J.W.; Hu, M. A Comparison of Several Approaches to Missing Attribute Values in Data Mining; Springer: Berlin/Heidelberg, Germany, 2000; pp. 378–385. [Google Scholar]
Kumar, N.; Hoque, M.A.; Shahjaman, M.; Islam, S.M.S.; Mollah, M.N.H. A new approach of outlier-robust missing value imputation for metabolomics data analysis. Curr. Bioinform. 2017, 12. [Google Scholar] [CrossRef]
Kim, I.S.; Jung, W. Method of processing the outliers and missing values of field data to improve RAM analysis accuracy. J. Appl. Reliab. 2017, 17, 264–271. [Google Scholar]
Xiao, H.; Huang, D.; Pan, Y.; Liu, Y.; Song, K. Fault diagnosis and prognosis of wastewater processes with incomplete data by the auto-associative neural networks and ARMA Model. Chemometr. Intell. Lab. Syst. 2016, 161, 96–107. [Google Scholar] [CrossRef]
Yan, Y.T.; Zhang, Y.P.; Zhang, Y.W.; Du, X.Q. A selective neural network ensemble classification for incomplete data. Int. J. Mach. Learn. Cybern. 2016, 8, 1–12. [Google Scholar] [CrossRef]
Nowicki, R. On Combining Neuro-Fuzzy architectures with the rough set theory to solve classification problems with incomplete data. IEEE Trans. Knowl. Data Eng. 2008, 20, 1239–1253. [Google Scholar] [CrossRef]
Luo, C.; Li, T.; Yao, Y. Dynamic probabilistic rough sets with incomplete data. Inf. Sci. 2017, 417, 39–54. [Google Scholar] [CrossRef]
Pernestaal, A. Probabilistic Fault Diagnosis: With Automotive Applications. Ph.D. Thesis, Linköping University, Linköping, Sweden, 2009; pp. 38–49. [Google Scholar]
Huang, B. Bayesian methods for control loop monitoring and diagnosis. J. Process Control 2008, 18, 829–838. [Google Scholar] [CrossRef]
Qi, F.; Huang, B.; Tamayo, E.C. A Bayesian approach for control loop diagnosis with missing data. AIChE J. 2010, 56, 179–195. [Google Scholar] [CrossRef]
Zhang, K.; Gonzalez, R.; Huang, B.; Ji, G. An expectation maximization approach to fault diagnosis with missing data. IEEE Trans. Ind. Electron. 2015, 62, 1231–1240. [Google Scholar] [CrossRef]
Jiang, Q.; Huang, B.; Ding, S.X.; Yan, X. Bayesian fault diagnosis with asynchronous measurements and its application in networked distributed monitoring. IEEE Trans. Ind. Electron. 2016, 63, 6316–6324. [Google Scholar] [CrossRef]
Ge, Z.; Song, Z. Distributed PCA model for plant-wide process monitoring. Ind. Eng. Chem. Res. 2013, 52, 1947–1957. [Google Scholar] [CrossRef]
Downs, J.J.; Vogel, E.F. A plant-wide industrial process control problem. Comput. Chem. Eng. 1993, 17, 245–255. [Google Scholar] [CrossRef]

Figure 1. Data acquisition system.

Figure 2. Flowchart of the proposed outlier detection method.

Figure 3. EM iteration results for normal and outlier status: (a) the normal status; (b) the outlier status.

Figure 4. Outlier detection result for complete outlier sampling.

Figure 5. Online detection result for outlier sampling.

Table 1. Monitoring dataset with missing value.

Sample	$π_{2} (2 T)$	$π_{3} (3 T)$	Possible Realization
$d^{1}$	*	*	$[\begin{matrix} 0 & 0 & 0 \end{matrix}], [\begin{matrix} 0 & 0 & 1 \end{matrix}], [\begin{matrix} 0 & 1 & 0 \end{matrix}], [\begin{matrix} 0 & 1 & 1 \end{matrix}]$
$d^{2}$	1	*	$[\begin{matrix} 0 & 1 & 0 \end{matrix}], [\begin{matrix} 0 & 1 & 1 \end{matrix}]$
$d^{3}$	*	0	$[\begin{matrix} 0 & 1 & 0 \end{matrix}], [\begin{matrix} 0 & 0 & 0 \end{matrix}]$
$d^{4}$	1	*	$[\begin{matrix} 0 & 1 & 0 \end{matrix}], [\begin{matrix} 0 & 1 & 1 \end{matrix}]$
$d^{5}$	*	*	$[\begin{matrix} 0 & 0 & 0 \end{matrix}], [\begin{matrix} 0 & 0 & 1 \end{matrix}], [\begin{matrix} 0 & 1 & 0 \end{matrix}], [\begin{matrix} 0 & 1 & 1 \end{matrix}]$
$d^{6}$	1	0	$[\begin{matrix} 0 & 1 & 0 \end{matrix}]$

Table 2. Outliers in the history dataset.

Outlier	Reason	Type	Is Incomplete Data
120	A feed (stream 1)	Pulse change	No
140	Reactor level	Pulse change	Yes
240	D feed (stream 2)	Pulse change	No
250	Reactor temperature	Pulse change	Yes
350	Purge rate (stream 9)	Pulse change	Yes
360	E feed (stream 3)	Pulse change	No
460	Product separator temperature	Pulse change	Yes
480	A and C feed (stream 4)	Pulse change	No
600	Recycle flow (stream 8)	Pulse change	No
720	Reactor feed rate (stream 6)	Pulse change	No
840	Reactor pressure	Pulse change	No

Table 3. Outliers in the online test dataset.

Outlier	Reason	Type	Is Incomplete Data
120	Product separator level	Pulse change	No
140	Stripper steam flow	Pulse change	Yes
240	Product separator pressure	Pulse change	No
250	Compress work	Pulse change	Yes
350	Reactor cooling water outlet temp	Pulse change	Yes
360	Product separator underflow	Pulse change	No
460	Separator cooling water outlet temp	Pulse change	Yes
480	Stripper level	Pulse change	No
600	Stripper pressure	Pulse change	No
720	Stripper underflow (stream 11)	Pulse change	No
840	Stripper temperature	Pulse change	No

Table 4. EM iteration results for normal status.

	E1	E3	E4	E6
1	0.993464	0	0	0.006536
2	0.998419	0.000263	0.000263	0.001054
3	0.998309	0.00033	0.000308	0.001054
4	0.998284	0.000356	0.000306	0.001054
5	0.998278	0.000373	0.000295	0.001054
6	0.998275	0.000389	0.000282	0.001054
7	0.998274	0.000403	0.000269	0.001054

Table 5. EM iteration results for outlier status.

	E2	E3	E4	E5	E6	E7	E8
1	0.7142	0	0.1429	0	0.1429	0	0
2	0.3896	0.0714	0.1169	0.0714	0.2078	0.0714	0.0714
3	0.3896	0.0714	0.1169	0.0949	0.1845	0.0714	0.0714
4	0.3896	0.0714	0.1169	0.1026	0.1770	0.0714	0.0714
5	0.3896	0.0714	0.1169	0.1047	0.1745	0.0714	0.0714
6	0.3896	0.0714	0.1169	0.1055	0.1737	0.0714	0.0714
7	0.3896	0.0714	0.1169	0.1058	0.1734	0.0714	0.0714

Table 6. Outlier detection result for incomplete sampling.

Sampling Point	140	250	350	460
Probability of Outlier	1	1	0.933	0.006

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tian, Y.; Yin, Z.; Huang, M. Missing Data Probability Estimation-Based Bayesian Outlier Detection for Plant-Wide Processes with Multisampling Rates. Symmetry 2018, 10, 475. https://doi.org/10.3390/sym10100475

AMA Style

Tian Y, Yin Z, Huang M. Missing Data Probability Estimation-Based Bayesian Outlier Detection for Plant-Wide Processes with Multisampling Rates. Symmetry. 2018; 10(10):475. https://doi.org/10.3390/sym10100475

Chicago/Turabian Style

Tian, Ying, Zhong Yin, and Miao Huang. 2018. "Missing Data Probability Estimation-Based Bayesian Outlier Detection for Plant-Wide Processes with Multisampling Rates" Symmetry 10, no. 10: 475. https://doi.org/10.3390/sym10100475

APA Style

Tian, Y., Yin, Z., & Huang, M. (2018). Missing Data Probability Estimation-Based Bayesian Outlier Detection for Plant-Wide Processes with Multisampling Rates. Symmetry, 10(10), 475. https://doi.org/10.3390/sym10100475

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Missing Data Probability Estimation-Based Bayesian Outlier Detection for Plant-Wide Processes with Multisampling Rates

Abstract

1. Introduction

2. Problem Statement and Motivation Analysis

3. Bayesian Outlier Detection for Plant-Wide Processes with Multisampling Rates

3.1. Marginalization-Based Realization Estimation

3.2. Expectation–Maximization-Based Likelihood Probability Estimation

3.3. Bayesian and Full Probability-Based Outlier Detection

4. Simulation and Application

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI