A Modiﬁed Expectation Maximization Approach for Process Data Rectiﬁcation

: Process measurements are contaminated by random and/or gross measuring errors, which degenerates performances of data-based strategies for enhancing process performances, such as online optimization and advanced control. Many approaches have been proposed to reduce the inﬂuence of measuring errors, among which expectation maximization (EM) is a novel and parameter-free one proposed recently. In this study, we studied the EM approach in detail and argued that the original EM approach is not feasible to rectify measurements contaminated by persistent biases, which is a pitfall of the original EM approach. So, we propose a modiﬁed EM approach here to circumvent this pitfall by ﬁxing the standard deviation of random error mode. The modiﬁed EM approach was evaluated by several benchmark cases of process data rectiﬁcation from literatures. The results show advantages of the proposed approach to the original EM in solving efﬁciency and performance of data rectiﬁcation.


Introduction
With the advancement of smart manufacturing, process measurements play a more and more important role in modern chemical manufacturing plants [1][2][3]. The measurements are unavoidably contaminated by random errors and often by large-sized gross errors, too, which degenerate performances of process monitoring, control and optimization strategies based on measurements [1]. To recover the true values of process variables from the contaminated measurements, many approaches to data rectification, i.e., reducing the random and gross errors simultaneously from the measurements, have been proposed since 1960s [2].
The first way identifies gross errors with a statistical test by assuming random errors follow a normal distribution [10], then a procedure of data reconciliation, i.e., solving a constrained least squares problem whose objective is minimizing the difference between the measured values and reconciled values satisfying process models, is carried out to estimate the true values of the measurements not contaminated by gross errors, while the true values of the measurements contaminated by gross errors are treated as unknown parameters to be estimated. Although the algorithmic parameters, such as critical values of a statistical test, can be chosen with clear statistical meanings, only one gross error can be identified at a time because of the smearing effect of a large-sized gross error, so the approaches of a statistical test must identify gross errors one by one and elegant frameworks must be designed to promise the performance of data rectification [11].
The second way is based on robust estimators [6], which can simultaneously reduce the influences of random and gross errors by solving a constrained nonlinear least squares problem once. Different from the approaches of the statistical test described above, it is assumed that measurements contaminated by random and/or gross errors can be described by a heavy tail statistical distribution, such as contaminated normal [12], Cauchy [6], redescending [13], quasi weighted least squares (QWLS) [14] and correntropy [15] etc., which can effectively reduce the smearing effect of gross errors. The advantages of robust estimators for data rectification are: (1) gross errors can be identified with data reconciliation simultaneously; (2) the parameters of the robust estimators can be determined via Monte Carlo methods with clear statistical meanings [16] or online line search methods based on the Akaike information criterion (AIC) [13]. Currently, the robust estimators may be the most popular approach for process data rectification.
The third way is based on mixed integer programming (MIP) techniques [8][9][10], whose objective is to minimize the number of identified gross errors and the difference between the measured and rectified values, where the trade-off can be realized by the AIC [10,13] or a predetermined weighting factor of the objective function [9,17]. The MIP techniques show competitive or comparable performances to robust estimators for process data rectification, and the MIP technique based on the AIC is free from setting algorithmic parameters to balance the fitness and complexity of the model; although a critical value of identifying gross errors still needs to be determined, this value can be easily obtained from daily operation experiences of instrumentation engineers [10,17].
Recently, a novel way of process data rectification based on statistical inference was proposed, such as the approaches of Bayesian inference [18]. Being a widely used method of statistical inference, expectation maximization (EM) [19][20][21][22][23][24][25] has also been applied to process data rectification [26,27]. The statistical inference approaches are based on the Bayes rule [18], which inferences the unknown parameters by combining the information from collected data (measurements) and the prior probability distribution of the inferenced parameters. Although current works assume prior distribution before process data rectification, some reasonable prior information on the random and gross errors of measurements, such as standard deviation of random errors and occurrence of gross errors for a specified sensor, can be collected and modeled from the experiences of plant operators and historical process data [18,27,28]. So, the authors believe that the statistical inference approach to process data rectification deserves to be studied.
The established EM approach [26] is an interesting statistical inference approach because it has no algorithmic parameter to be determined before data rectification, but just assumes that measurement errors follow a finite Gaussian mixture distribution. The large number of parameters to be estimated with the EM algorithm [29] lead to its low-efficiency solving procedure, and from experiences of the authors, the original EM approach cannot be applied to rectify process measurements contaminated by persistent biases, because the estimated standard deviation of the random error mode is close to that of the gross error mode, which leads to difficulty of bias identification.
In this work, we argue that, for the original EM approach, the estimated value of standard deviation of random error mode is unavoidably enlarged by a persistent bias and leads to difficulty of bias detection. To circumvent this problem, we present a modified EM approach, where the standard deviation of random error mode is estimated before the EM iterations with a robust method [30], so the standard deviation of random error mode will not be enlarged by a persistent bias and it is possible to detect bias from the EM calculation result. Compared to the original EM approach, the modified one also reduces the number of parameters to be estimated and the time consumption of EM iterations can also be significantly reduced.
The remainder of this paper is organized as follows. Section 2 introduces the principles of the established EM approach for data rectification. The proposed modified EM approach and detailed calculation steps are presented in Section 3. Section 4 describes the performance analysis procedure used herein. The performance modified EM approach is evaluated and compared to the original one in Section 5. Finally, Section 6 concludes the paper.

Data Rectification Problem
Except for the random errors following normal distribution, three types of gross errors, namely, drift, outlier and persistent bias, also usually contaminate measurements, which are shown in the following Figure 1.  Figure 1 shows how different types of gross errors contaminate a process measurement whose true value is 1 in steady state. Obviously, any one of the systematic errors significantly reduces the reliability of a process measurement, which is the basis of online decision-making during the enhancement of the process performance and a systematic error cannot be eliminated with data reconciliation methods, because a zero mean of random error is assumed for the methods of data reconciliation. Essentially, outliers in a measurement horizon are also random errors with larger variance than random noises, and the original EM algorithm can identify and estimate outliers well. A persistent bias as shown in Figure 1 is not random, which will enlarge the estimated variance of the original EM algorithm, as shown in the following Section 3.1. On drift error of a sensor, it is also a non-random one with increasing error size and it will also lead to an enlarged estimated variance as a persistent error, supposing an average of a measurement horizon is taken as a representative of the horizon. In the following, we show how a data rectification problem is set up as a statistical inference problem.
Supposing a measurement horizon y j,h t h=t−H+1 is collected at time t, which involves H data points measured at different time point k for the jth process variable and a data matrix Y ∈ R J×H whose rows represent measurement horizons of all the J measured process variables, where y j,h is an element at the jth row and hth column of Y. A steadystate process data rectification can be formulated as a maximum likelihood estimation problem described as the following Equation (1).
In Equation (1), the objective function is the logarithm likelihood of the sampled measurements y j,h under the condition that the true value of the jth measurement is x j and the jth element of vector x is x j ; f (x) represents the process model and g(x) denotes the inequality constraints for the process variables, considering operational specifications and experienced bounds.
Assuming different distributions of the measurement errors, the formulation of the objective function of Equation (1) varies [6]. For the data reconciliation problem considering random errors only, it is assumed that random errors follow a normal distribution, and the logarithm of the objective function is a quadratic one. If a heavy tail distribution is assumed for measurement errors, as in the situation of data rectification using robust estimators, the logarithm of the objective function shows a more complex formulation, sometimes the function is nonconvex or even discontinuous [6].
In both the situations described above, we fix the distribution parameters of measurement errors. For the data rectification using the EM approach, the parameters of measurement error distribution are inferenced with the Bayes rule, as described in the following section.

Expectation Maximization Approach
For the EM approach [26], the difference between the hth measurement and the true value of the jth process variable, namely, ε j,h = y j,h − x j , is described with the finite Gaussian mixture model shown with Equation (2) [26].
In Equation (2), w j,1 represents the probability of a random error mode with a zero mean and standard deviation σ j,1 , and w j,2 represents the probability of a gross error mode with a zero mean and standard deviation σ j,2 that is larger than σ j,1 under the occurrence of a gross error. Supposing θ j = σ j,k , w j,k k=1, 2 , the likelihood of ε j,h for the kth error mode can be described with the following Equation (3) [26].
In Equation (3), z j,h is a latent variable to be estimated and z j,h = k represents that the error mode of y j,h is the kth one, where k = 1 represents a random error mode, and k = 2 means a gross error mode. Considering both error modes, the whole likelihood of ε j,h is represented as following Equation (4).
Based on the above descriptions, the previously mentioned Equation (1) can be written as the following Equation (5).
It is difficult to solve Equation (5) directly, because w j,k cannot be obtained explicitly. Hence, an EM approach was applied to solve Equation (5), by replacing Equation (5) with the following Equation (6) [26].
In Equation (6), Θ = x j , θ j j=1,...J and Θ (t) represent the estimation result of Θ at the tth iteration, which means the probability of error mode for a measurement, namely, p z j,h = k y j,h , Θ (t) , is estimated from y j,h and Θ (t) using the Bayes rule, as described with Equation (7).
In Equation (7), p j,h,k is calculated with Equation (3) and P z j,h = k Θ (t) = w j,k because Θ = x j , θ j j=1,...J and θ j = σ j,k , w j,k k=1, 2 . The calculation of the probability of P z j,h = k y j,h , Θ (t) is noted as the expectation step (E-step).
After the E-step, we estimate Θ using the maximization step (M-step), namely, solving Equation (6) with fixed P z j,h = k y j,h , Θ (t) calculated at the E-step, then a new estimation of Θ, i.e., Θ (t+1) , is the result. It must be noted that ln p z j,h = k, y j,h Θ is calculated with Equation (3) and the Bayes rule, as described by Equation (8).
To solve Equation (6), a coordinate search method is applied to estimate w j,k and σ j,k separately, as the following Equations (9) and (10) show [26].
With the new estimation of Θ, i.e., Θ (t+1) , obtained, we return to the E-step and check the difference between Q Θ, Θ (t) and Q Θ, Θ (t+1) of Equation (6), if the difference is not obvious we stop the iteration, or the else we continue [26].

Standard Deviation of the Original EM under Persistent Bias
Although the EM approach was successfully applied to several situations, such as non-persistent gross errors and concurrent errors of different types [26,27], there is still a little space for improvement in the situation of measurements contaminated by persistent biases, where y j,h − x (t) j 2 in Equation (9) is relatively large and unavoidably leads to a large σ (t+1) j,k even for the random error mode, whose standard deviation shall be relatively small, as can be argued in the follows.
j,h,k , Equation (9) can be rewritten as following Equation (11): With h = argmin h y j,h − x (t) j 2 and α j,h ,k = 1, it is easy to infer that Equation (11) arrived at its minimum, namely, σ , which fluctuates around the magnitude of the bias contaminating the jth measurement and leads to a large size standard deviation for random error mode of Equation (2), namely, σ j,1 . Under this situation, it is impossible to set ±3σ j,1 as the critical value for bias detection and the original intention of Equation (2) is violated, too. To verify the above argument, the simple linear data rectification case of Ripps [31] is used here to show the influence of persistent bias to the standard deviations of random mode. The Ripps case involves four streams with measured flowrates and three linear mass balance equality constraints are shown as the following Equation (12).
The true values of all the flowrates are x 1 = 0.1739, x 2 = 5.0435, x 3 = 1.2175 and x 4 = 4, with corresponding standard deviations σ 1 = 2.89 × 10 −4 , σ 2 = 2.5 × 10 −3 , σ 3 = 5.76 × 10 −4 , and σ 4 = 4 × 10 −2 for random noises. Here we assume that x 3 is contaminated by a bias with sizes being 5σ 3 , 10σ 3 and 15σ 3 , respectively, then 50 Monte Carlo simulations are carried for each bias size, with all the variables being added random noises with zero mean and corresponding standard deviations. For each Monte Carlo simulation, the sign of the bias is assigned randomly with equal probability. The minimum ratio of estimated σ 3,1 to σ 3 , namely, σ 3,1 σ 3 min , is shown in Figure 2 as follows. As Figure 2 shows, as in the above argument, the minimum estimated standard deviation of random noise mode is several times of that of the true, so it is impossible to detect a bias with the traditional 3 σ rule, as the original EM approach did [26].

Modification to the Original EM Approach
To apply the EM approach to the situation of measurements contaminated by persistent biases, a simple modification is presented herein for the original EM approach, namely, we directly estimate the variance of the random error mode in Equation (2), i.e., σ 2 j,1 , but not via the EM iterations. It has been shown that σ 2 j,1 can be estimated efficiently and robustly from process measurements even when the measurements are contaminated with gross errors [30]. Then all the other parameters of Θ in Equation (6) are still estimated using the above original EM procedure. Obviously, the influence of bias on the estimation of standard deviation of random error is avoided by this modification.
After Θ in Equation (6) being estimated, a criterion must be set up to detect bias for measurements. There are two established ways for detecting bias. The first one is shown by Equation (13), namely, a measurement is contaminated by a bias if the probability of gross error mode is larger than the random error mode [12]. The second one is using the deviation of reconciled value from the corresponding measured value [26], namely, if the following Equation (14) holds where y j is the average of measurement horizon of the jth process variable, then a bias is detected for the jth measured variable.
Obviously, the original EM approach can only use the first way of bias detection because the estimated value of σ j,1 is enlarged by a persistent bias. While three criteria can be applied to the modified EM approach, i.e., a bias is detected when Equation (13) holds, which is noted as probability criterion (PC); or Equation (14) holds, which is noted as deviation criterion (DC); or both Equations (13) and (14) hold simultaneously, which is noted as a probability and deviation criterion (PDC).
Based on the above description, the proposed modified EM approach can be shown in Table 1. 3. Initialize parameters: w y j,h /H and set t = 1.
and w (t+1) j,k using Equations (9) and (10), respectively, calculate x The modified EM with PC, DC or PDC for bias detection is noted as MEM-PC, EM-DC and EM-PDC, respectively.
The advantages of the modified EM algorithm over the original EM algorithm are: (1) the standard deviation of random error is not affected by a persistent bias, because a direct and robust variance estimation method [30] is used; (2) fewer variables need to be estimated by the modified EM algorithm, which means that the modified EM algorithm converges faster than the original EM algorithm.

Performance Analysis
To evaluate the performance of the proposed modified EM algorithm, the following three performance metrics, namely, overall performance (OP), average number of Type-I error (AVTI) and relative error reduction (RER), defined as following Equations (15)- (19) are used here [9]. OP = number of correctly identified bias number of gross errors simulated , AVTI = number of wrongly identified bias number of simulation trials , The following Monte Carlo simulation procedure [6] is carried here to evaluate the performance of data rectification.
(1) For all the measured variables, add random noises with zero mean and corresponding standard deviation. (2) Add bias to each measurement with a predefined probability p b , the bias size randomly distributes in the range of 5 and 25 times of standard deviation of random noise, the sign of the bias, namely, '+' or '−', is randomly assigned with equal probability. (3) Calculate performance of data rectification with Equations (15)- (19) for each evaluated method.
Four well-known test cases of process data rectification were used here to evaluate and compare the performances of the proposed MEM approach to the original EM approach, which are described as following.
The first case is the famous steam metering network (SMN) [32], which involves 11 units interconnected by 28 streams with measured flow rates, whose flowsheet diagram is demonstrated as Figure 3 with the true values of the flowrates of all the streams shown in the parenthesis. For each measured variable, the standard deviation of added random noise is set as 2.5% of its true values. The second case is a bilinear process of metallurgical grinding (MG) [33], which involves four units interconnected by nine streams with measured mass flowrates and 15 measured mass fractions. The flowsheet of the metallurgical grinding is shown in Figure 4 with the true values of all the measured variables, where the true values of flowrates are shown in the parenthesis and composition shown at the right side of the parenthesis. For all the measured variables, the corresponding σ of random noise is set as 2.5% of its true value. The third case, i.e., Pai-Fisher (PF) [34], is a typical nonlinear instance of data rectification, whose model is shown as the following Equation (20 The fourth case, namely, the Swartz case (Sw) [35], is a heat exchanger network, where streams Ai (i = 1, 2, . . . , 8) is heated by streams Bi (i = 1,2,3), Ci (i = 1,2) and Di (i = 1,2) via different heat exchangers, as Figure 5 shows. The true values of flowrate and temperature for each stream [12] are shown in Figure 5, too. The standard deviation of random noise for each flowrate is set as 2.5% of the corresponding true value of flowrate and 0.75 for temperature of each stream. For the Swartz case, both linear material balance equalities and nonlinear energy balance equalities for each heat exchanger/junction are used as constraints of data rectification. The enthalpy of unit mass of each stream is correlated with its temperature using a quadratic polynomial as Equation (21) shows, whose coefficients are shown in Table 2 [1]. For all the tested cases, the Monte Carlo simulations were carried out in a MATLAB 2018 (MathWorks, Boston, MA, USA) environment using a personal computer with Intel Core Processor (TM) i3 CPU 3120M @ 2.50 GHz, 8GB RAM (Intel, Santa Clara, CA, USA), random measuring noises were generated by "normrnd" command and "rand" command was used to assign the size and sign of a bias. The nonlinear programs of the EM were solved with "fmincon" command.

Results and Discussion
To evaluate the performances of different criteria of the modified EM approach, the OP and AVTI performances of the DC, PC and PDC for all the four tested cases are compared as shown in Figure 6. As Figure 6 shows, PC had higher OP and obviously higher AVTI than DC and PDC, which shows that the probabilities of random and gross error modes are not feasible to detect a bias because some variables not contaminated by a bias also have higher probability of gross error mode. The DC and PDC had the same OP and AVTI except for the bilinear MG case, where PDC detected a little less bias than DC; whether this was a special case needs to be investigated in the future, since this work focuses on modifying the original EM approach for rectifying measurements contaminated by persistent biases.
Because MEM-DC and MEM-PDC had almost the same performances of data rectification, MEM-DC was selected to be compared to the original EM approach, as shown in Table 3. As stated in Section 3.2, PC was used to detect bias for the original EM, because DC does not work in the situation of persistent bias contaminating measurement, the standard deviation of random mode, i.e., σ j,1 , was enlarged by the bias contaminating the jth measurement, as shown in Figure 1, and DC based on the 3σ rule cannot detect any bias from experiences of the authors. As Table 3 shows, the original EM had much lower OP and much higher AVTI than MEM-DC, which shows that the persistent bias influences not only the standard deviation of random error mode, but also the probability of random and gross error modes. It is interesting that the original EM approach had only a little worse RER than MEM-DC, which shows that the rectified values of both approaches are close to each other. At last, MEM-DC obviously consumed much less time than the original EM, because fewer parameters needed to be estimated for the former one.

Conclusions
In this work, we analyze the influence of a persistent bias on the estimated standard deviation of the random error mode for the EM approach and argue that the 3σ rule cannot be used to detect bias under the occurrence of a persistent bias. A modified EM approach was devised by estimating the standard deviation of random error mode from process measurements before the EM iterations. The performances of the modified and original EM approaches were evaluated and compared through four widely used linear and nonlinear examples of data rectification, and the results show that the original EM approach cannot be used to detect persistent biases, while the modified EM can; the modified EM consumes much less time than the original EM due to the reduction of estimated parameters.
The convergence of the modified EM algorithm is not proved and we will study this in the future to increase our understanding of the EM approach and to increase the reliability of the proposed EM approach.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.