A KPI-Based Probabilistic Soft Sensor Development Approach that Maximizes the Coefficient of Determination

Advanced technology for process monitoring and fault diagnosis is widely used in complex industrial processes. An important issue that needs to be considered is the ability to monitor key performance indicators (KPIs), which often cannot be measured sufficiently quickly or accurately. This paper proposes a data-driven approach based on maximizing the coefficient of determination for probabilistic soft sensor development when data are missing. Firstly, the problem of missing data in the training sample set is solved using the expectation maximization (EM) algorithm. Then, by maximizing the coefficient of determination, a probability model between secondary variables and the KPIs is developed. Finally, a Gaussian mixture model (GMM) is used to estimate the joint probability distribution in the probabilistic soft sensor model, whose parameters are estimated using the EM algorithm. An experimental case study on the alumina concentration in the aluminum electrolysis industry is investigated to demonstrate the advantages and the performance of the proposed approach.


Introduction
With the increasing demands placed on industry, requiring a decrease in the defective rate of products, better economic efficiency, and improved safety, there has been a growing demand to develop and implement approaches that can improve the overall control strategy [1]. The first issue that needs to be solved is achieving accurate and real-time estimation of key performance indicators (KPIs) [2]. The difficulty is that these KPIs are usually not easy to measure, or the measurement has significant time delay. Even if some KPIs are measurable, due to the complexity and nonlinearity of modern industrial systems and their complex working conditions, the KPIs may be extremely unreliable [3]. One way to solve the above problems is to develop a soft sensor, which seeks to select a group of easier-to-measure secondary variables that are correlated with the required primary variables (i.e., KPIs in this paper), so that the system is capable of providing process information as often as necessary for control [4,5]. In the development of a successful soft sensor, a good process model is required. The process models can be divided into two major categories: first principles models and data-driven models [6,7]. Although it is desirable to apply mass and energy balances to build a complete first principles model, lack of process knowledge, plant-model mismatch, and nonlinear characteristics limit the applicability of such an approach to the simplest processes. As an alternative, data-driven full-wavelength model. Boyaci [25] used the coefficient of determination to evaluate the adulteration rate of coffee beans, thus ensuring coffee quality. However, these applications only consider the coefficient of determination as an evaluation index without applying it for the modeling process. In general, the coefficient of determination is a criterion that can evaluate the quality of a model and has a concise structure, so it is appropriate to apply it to the soft sensor development process to establish a simpler and more accurate model for complex industrial process.
Therefore, this paper develops a KPI-based soft sensor model with simple structure and high accuracy, using the coefficient of determination method, which also solves the missing data issue using the EM algorithm.

The Gaussian Mixture Model
As a flexible and efficient tool for probabilistic data models, a Gaussian mixture model (GMM) can be used to define any complex probability distribution function and is, therefore widely used in many statistical data modelling applications. In this paper, GMM is used to approximate the joint probability distribution in the soft sensor probability model. The reason for introducing GMM is that, theoretically, any probability distribution can be approximated using the joint weighted Gaussian distribution [26].
If x represents a multidimensional random variable, then the joint probability distribution of the GMM is expressed as where α l is the mixing coefficient, which represents the prior probability of each mixed component; M is the number of mixed components; and M ∑ l=1 α l = 1. Θ = (θ 1 , θ 2 , · · · , θ M ) is the parameter vector of each mixed component, and each Gaussian probability density function p l (x) is determined by the parameter θ l = (µ l , Σ l ), where µ l is the mean and Σ l is the covariance matrix. The GMM parameters α l , µ l , and Σ l (l = 1, 2, . . . , M) are estimated using the EM algorithm.

The Expectation Maximization Algorithm
The EM algorithm is a maximum likelihood estimation method for solving model distribution parameters from "incomplete data" and was first introduced in [27]. Each iteration of the algorithm involves two steps, called the expectation step (E-step) and the maximization step (M-step).

E-Step
Given the observation data set X and the current parameters Γ (i) , the expectation of the log-likelihood function is called the Q-function which can be written as where γ can represent missing data due to observational conditions and other reasons, and can also refer to hidden variables. Since the direct optimization of the likelihood function is usually very difficult, the relationship between X, Γ, and γ can be established by introducing an additional variable γ to achieve the purpose of simplifying the likelihood function.

M-Step
A new parameter Γ (i+1) is calculated by maximizing Q(Γ, Γ (i) ) which was obtained from the E-step; that is, The iteration between the E-and M-steps continues until the elements of Γ are less than a given value.

The Coefficient of Determination
Analysis of variance is an approach for determining the significance and validity of a regression model using variances obtained from the data and model. The coefficient of determination is an analysis of variance approach that seeks to decompose the total variability in the data into various orthogonal components that can then be independently analyzed [23]. For the purposes of analyzing the regression, let the total sum of squares, denoted by TSS, be defined as where the real data set is represented as y = <y 1 , y 2 , . . . , y n > and y refers to the average of y i . Let the sum of squares due to regression, SSR, be defined as whereŷ i denotes the predicted value of the regression model for y i . The coefficient of determination R 2 represents the ratio of SSR to TSS, that is, Let the sum of squares due to the error, SSE, be defined as It can be proved that TSS = SSR + SSE [23,28], so R 2 can also be expressed as

Development of the Probabilistic Soft Sensor Model
In this section, in order to obtain more accurate KPI estimates, a soft sensor development approach based on maximizing the coefficient of determination is proposed. In addition, the problem of missing data in the training sample set is also considered. In order to more clearly describe the soft sensor development process, Figure 1 shows the modeling flow chart.

Development of the Probabilistic Soft Sensor Model
In this section, in order to obtain more accurate KPI estimates, a soft sensor development approach based on maximizing the coefficient of determination is proposed. In addition, the problem of missing data in the training sample set is also considered. In order to more clearly describe the soft sensor development process, Figure 1 shows the modeling flow chart.

EM Algorithm Handing Missing Data
Let X 1 , X 2 , . . . , X n be a random sample from a p-variate normal population, where X j = (x j1 , x j2 , . . . , x jp ), 1 ≤ j ≤ n, so the training sample set X can be written as The basic steps for processing missing data using the EM algorithm are given in [29].

E-Step: Prediction
For each sample X j containing missing values, X j = (m j , a j ), where m j is the missing value and a j is the available values. Given the population mean and variance, , from the ith iteration and a j , we use the expectation of the conditional normal distribution of m j as the estimate of the missing value. The (i + 1)th iteration is where ∼ µ i is a p × 1 matrix defined as µ i = µ i m , µ i a , µ i m is the mean of the missing part, and µ i a is the mean of the available part. In addition, can be written as

M-Step: Estimation
We compute the maximum likelihood estimates as follows: where X i+1 is the mean of the samples and S i+1 is the sample standard deviation, and they are all sufficient statistics. For a normal population, the importance of sufficient statistics is that the total information about µ and Σ in the data matrix X is contained in X and S, regardless of the sample size n. By transforming X and S, two new sufficient statistics T 1 and T 2 [29], given by are obtained. Combining Equations (14) and (15) with Equations (12) and (13) gives where The iteration between the E-and M-steps continues until the elements of ∼ µ and ∼ Σ are less than a given value. Therefore, the iteration result ∼ m is the optimal substitution for the missing values, resulting in a complete training sample set X.

Soft Sensor Development Approach Based on the Coefficient of Determination Maximization Strategy
For the complete training sample set X obtained from Section 3.1, which can be written as let x 1 , x 2 , · · · x p−1 denote the secondary variables, and x p denote the KPI. Our objective is to estimate x p from x 1 , x 2 , · · · x p−1 . R 2 measures the fraction of the total variance in the model explained by the regression with the given variables [23]. The range of R 2 is [0,1]. Let x p be the y mentioned in Section 2.3. Then, the coefficient of determination is If the secondary variables in the soft sensor model do not account for the variance of x p , the estimate of x ip , denotedx ip , is exactly equal to the sample mean of x ip , denoted x ip . In this case, SSR is 0 and SSE equal to TSS, so R 2 = 0. On the other hand, if x i1 , x i2 , · · · x i(p−1) fully explains the variance of x ip , for i = 1, 2, . . . , n, it follows that x ip = x ip , i.e., each error is zero and SSR = TSS, so R 2 = 1. In general, R 2 does not take the extreme values 0 or 1, but instead takes a certain value between the two [28]. For the case where the number of variables, p, is much smaller than the sample number n, the closer R 2 is to 1, the better the model. Therefore, when the model for the KPI maximizes R 2 , it becomes the best estimate of the KPI, that is, where ∼ x ip is the best estimate of x ip , and K i represents all possible estimates of x ip . Simplifying the above equation gives where x ip and x p are both computed values. Equation (23) can then be written as Multiplying Equation (24) on both sides by n −1 gives Considering that the mathematical expectation of a discrete random variable is where x i represents the ith value of the random variable x and p i represents its probability, Equation (26) can be expressed as where K denotes all possible estimates of the KPI x p , and ∼ x p represents the best estimate of the KPI when the coefficient of determination R 2 is maximized. Since x p is derived from the soft sensor models and secondary variables, the above equation can be written as In order to establish a more direct connection between ∼ x p and (x i1 , x i2 , . . . , x i(p-1) ), the left-hand side of Equation (28) will be simplified further. Firstly, it can be noted that K does not have an impact on the simplification, that is, In order to minimize the above equation, the following should hold: which can be rewritten as Furthermore, E x p x 1 , x 2 , · · · x p−1 can be expanded according to the definition of expectation, giving Thus, this establishes the basic framework of the probabilistic soft sensor model with KPI optimal estimation. The next part is to solve the joint probability distribution in the model. In this paper, GMM is used to approximate the joint probability distribution. Let p(x e ) = p x 1 , x 2 , · · · x p−1 ; that is, In order to deduce the specific representation of the KPI optimal estimation ∼ x p under the proposed probabilistic soft sensor model, we first introduce Lemma 1.

Lemma 1. [30]
Let G(x; µ, Σ) be a multidimensional normal density function with mean µ and covariance ; then, the joint probability density is Proof. The details of the proof can be found in [30].
Using Lemma 1, it follows that where µ l = µ T le , µ T l p and Σ l = Σ lee Σ lep Σ l pe Σ l pp . Therefore, Equations (33) and (34) can be written as Substituting Equations (39) and (40) into Equation (32) gives Extracting the sum in the numerator to outside the integral gives In order to make the derivation more concise, the positions of some factors in the integral are changed as follows: When the integral part is the conditional expectation, the above equation can be simplified to Therefore, the detailed soft sensor model expression of the KPI optimal estimation is obtained. In this paper, unknown parameters in the model are estimated using the EM algorithm. The iterative equations of the EM algorithm for estimating the GMM parameters are [31] where γ jl represents the responsivity of the mixed component l on the training sample data X j . It can be written as Consequently, the above steps give the GMM parameters, and the KPI optimal estimate ∼ x p follows.

Case Study
In this section, the effectiveness and feasibility of the proposed soft sensor model approach based on maximizing the coefficient of determination are evaluated through an industrial aluminum electrolytic production process. To show the advantages of the probabilistic soft sensor framework, the estimations are compared with the real values. For performance evaluation, the root-mean-squared error (RMSE) index is used.

Soft Sensor Development for Industrial Aluminum Electrolytic Process
Aluminum is widely used in construction and electrical industries [32]. The main method currently chosen for smelting aluminum plants is the cryolite-alumina molten salt electrolysis process, in which the electrochemical reaction process takes place in an electrolytic cell. Figure 2 shows the internal structure of the electrolytic cell.  Figure 3 shows the process flow diagram of the aluminum electrolysis process.  The main control goal of the aluminum electrolysis process is to keep the alumina concentration in the electrolysis cell stable within a certain range, preferably between 1.5% and 3.5% [33]. The control of alumina concentration relates to energy consumption and economic benefits of the aluminum electrolytic production process. On one hand, when the alumina concentration is too low, an additional chemical reaction occurs at the anode, which can easily cause a sudden rise in the cell voltage and the energy balance of the cell is destroyed. On the other hand, when the concentration reaches saturation, if the feeder continues to add alumina at the time, the raw material will be deposited at the bottom of the cell, so that the resistance increases and the current efficiency becomes low. Therefore, it is necessary to keep the alumina concentration in the proper range.
In soft sensor development for the aluminum electrolytic process, the measurable variables, the voltage x1 between the two electrodes obtained by the first voltage measuring instrument; the anode conductor current x2; the voltage x3 between the two electrodes obtained by the second voltage measuring instrument; and the alumina concentration x4 provided by an electrochemical analyzer, were selected as the secondary variables. The interelectrode voltage refers to the voltage between the anode guide and the corresponding cathode steel bar. The alumina concentration y provided by the laboratory is the primary variable for the model. Figure 4 shows a diagram of the process measurement system.
The variables x1(k), x2(k), x3(k), x4(k), and y(k) form the joint probability distribution ( ) The soft sensor was then developed according to the process described in Section 3 of this paper. It is assumed that M = 2. Molten cryolite is a solvent in which aluminum oxide is dissolved as a solute, forming a melt with good electrical conductivity. Carbon materials are used as cathodes and anodes, and a direct current is passed through them. The thermal energy of the direct current is used to melt the cryolite and maintain a constant electrolysis temperature. Furthermore, the electrochemical reaction occurs between the two electrodes, where the product at the cathode is aluminum liquid, and carbon dioxide and other gases are generated at the anode. The chemical reaction of the electrolytic process is The chemical reaction can produce gases other than carbon dioxide and carbon monoxide, as well as fluorocarbon gases. The gas purifying device uses alumina and fluorine generated in the mixed gas to produce fluorinated alumina, and the fluorinated alumina is then recycled to the electrolytic cell for chemical reaction. Figure 3 shows the process flow diagram of the aluminum electrolysis process.  Figure 3 shows the process flow diagram of the aluminum electrolysis process.  The main control goal of the aluminum electrolysis process is to keep the alumina concentration in the electrolysis cell stable within a certain range, preferably between 1.5% and 3.5% [33]. The control of alumina concentration relates to energy consumption and economic benefits of the aluminum electrolytic production process. On one hand, when the alumina concentration is too low, an additional chemical reaction occurs at the anode, which can easily cause a sudden rise in the cell voltage and the energy balance of the cell is destroyed. On the other hand, when the concentration reaches saturation, if the feeder continues to add alumina at the time, the raw material will be deposited at the bottom of the cell, so that the resistance increases and the current efficiency becomes low. Therefore, it is necessary to keep the alumina concentration in the proper range. The main control goal of the aluminum electrolysis process is to keep the alumina concentration in the electrolysis cell stable within a certain range, preferably between 1.5% and 3.5% [33]. The control of alumina concentration relates to energy consumption and economic benefits of the aluminum electrolytic production process. On one hand, when the alumina concentration is too low, an additional chemical reaction occurs at the anode, which can easily cause a sudden rise in the cell voltage and the energy balance of the cell is destroyed. On the other hand, when the concentration reaches saturation, if the feeder continues to add alumina at the time, the raw material will be deposited at the bottom of the cell, so that the resistance increases and the current efficiency becomes low. Therefore, it is necessary to keep the alumina concentration in the proper range.
In soft sensor development for the aluminum electrolytic process, the measurable variables, the voltage x 1 between the two electrodes obtained by the first voltage measuring instrument; the anode conductor current x 2 ; the voltage x 3 between the two electrodes obtained by the second voltage measuring instrument; and the alumina concentration x 4 provided by an electrochemical analyzer, were selected as the secondary variables. The interelectrode voltage refers to the voltage between the anode guide and the corresponding cathode steel bar. The alumina concentration y provided by the laboratory is the primary variable for the model. Figure 4 shows a diagram of the process measurement system.

EM Algorithm and Missing Values
We took 600 complete data groups from the training sample set, and deleted 10%, 20%, or 30% of the alumina concentration variable data. Then, the mean substitution method, the regression interpolation method, and the EM algorithm were used to process the sample set with missing values. Tables Table 1 Table 2 Table 3 show the mean and RMSE of the alumina concentration sample set for the three method simulations for missing ratios of 10%, 20%, and 30%. First, comparing the mean value, we can see from the above tables that the means of the regression interpolation method and the EM data interpolation method are closer to the mean of the real value set, and the mean substitution method is less effective. Obviously, the RMSE of the EM data interpolation method is much smaller than that of the regression interpolation method. Therefore, the accuracy and effectiveness of the EM data interpolation method in processing missing values is verified. Further, if there is a problem with missing values in the practical industrial process, The variables x 1 (k), x 2 (k), x 3 (k), x 4 (k), and y(k) form the joint probability distribution p(x(k)) = p(x 1 (k), x 2 (k), x 3 (k), x 4 (k), y(k)) . (47) The soft sensor was then developed according to the process described in Section 3 of this paper. It is assumed that M = 2.

EM Algorithm and Missing Values
We took 600 complete data groups from the training sample set, and deleted 10%, 20%, or 30% of the alumina concentration variable data. Then, the mean substitution method, the regression interpolation method, and the EM algorithm were used to process the sample set with missing values. Tables 1-3 show the mean and RMSE of the alumina concentration sample set for the three method simulations for missing ratios of 10%, 20%, and 30%. First, comparing the mean value, we can see from the above tables that the means of the regression interpolation method and the EM data interpolation method are closer to the mean of the real value set, and the mean substitution method is less effective. Obviously, the RMSE of the EM data interpolation method is much smaller than that of the regression interpolation method. Therefore, the accuracy and effectiveness of the EM data interpolation method in processing missing values is verified. Further, if there is a problem with missing values in the practical industrial process, the EM algorithm can be selected for data interpolation.

Experimental Results of the Soft Sensor Model Based on Maximizing the Coefficient of Determination
In order to verify the feasibility of the proposed approach, a test sample set was used to validate the designed soft sensor model. The test sample set was divided into four subsets of 100 samples. The actual alumina concentration measurement obtained from the laboratory was compared with the output of the soft sensor model to acquire an estimated performance evaluation of the model. The results are shown in Figure 5. Figure 5a-d show the estimated alumina concentrations based on the first, second, third, and fourth test subsets, respectively. Table 4 shows the root-mean-square errors (RMSE) of the four test subsets. It can be seen that, overall, the soft sensor model based on maximizing the coefficient of determination can accurately track the overall trends in the process. The alumina concentration output by the model is approximately the same as the actual laboratory measurement.    The backpropagation (BP) neural network and the least-squares, support vector machine (LSSVM) model were applied to the test sample set, and the first test subset was used for performance comparison. The parameters of the comparison algorithms were determined as follows: The number of hidden layer nodes in the BP neural network model was 100 and the activation function of the hidden layer was a sigmoid [34]. The kernel function of the LSSVM model was the radial basis function (RBF), and the kernel parameter and regular parameter were 1 and 20, respectively [34]. For each model, the number of secondary variables was 4, and the number of primary variables was 1. It could be seen that the two comparison models need different parameters in order to achieve an accurate estimation performance, while this is not necessary for the soft sensor model based on maximizing the coefficient of determination. The estimated results are shown in Figures 6 and 7. Figure 6 shows the estimated values of the soft sensor based on the BP neural network for the first test subset, and Figure 7 shows the estimated values of the soft sensor based on the LSSVM for the first test subset. It can be seen from Figure 6 that the soft sensor based on a BP neural network can roughly follow the trend of the laboratory measurements, but the error is still large at many points. It can be seen from Figure 7 that the overall performance of the soft sensor based on LSSVM is better than that based on a BP neural network, but compared with Figure 5a, it is obvious that the estimation of some extreme points is not as accurate as that given by the soft sensor based on maximizing the coefficient of determination. test subset, and Figure 7 shows the estimated values of the soft sensor based on the LSSVM for the first test subset. It can be seen from Figure 6 that the soft sensor based on a BP neural network can roughly follow the trend of the laboratory measurements, but the error is still large at many points. It can be seen from Figure 7 that the overall performance of the soft sensor based on LSSVM is better than that based on a BP neural network, but compared with Figure 5a, it is obvious that the estimation of some extreme points is not as accurate as that given by the soft sensor based on maximizing the coefficient of determination.   In practice, deviations from this behavior can provide information about the accuracy of the models. The BP neural network soft sensor produces a soft sensor system that has a consistent bias, since the values are consistently located above the y = x line. Furthermore, the bias in the LSSVM soft sensor model is smaller, but there also seems to be a calibration issue, since the data does not lie parallel to the y = x line. Finally, the proposed model has the smallest deviations and the most ideal performance.    there also seems to be a calibration issue, since the data does not lie parallel to the y = x line. Finally, the proposed model has the smallest deviations and the most ideal performance.   there also seems to be a calibration issue, since the data does not lie parallel to the y = x line. Finally, the proposed model has the smallest deviations and the most ideal performance.     To better illustrate the performance of the proposed soft sensor model, Table 5 shows the RMSE values for the different methods. As can be seen from Table 5, the RMSE of the proposed method is To better illustrate the performance of the proposed soft sensor model, Table 5 shows the RMSE values for the different methods. As can be seen from Table 5, the RMSE of the proposed method is smallest, which means that the estimation effect of the proposed model is better than those of the BP neural network model and the LSSVM model.

Conclusions
In this paper, a new KPI estimation method for probabilistic soft sensor development is proposed based on maximizing the coefficient of determination. The joint probability distribution in the probability model is approximated using GMM, while the EM algorithm is used to estimate the GMM parameters. In addition to providing accurate, real-time estimates of the KPIs, this paper also considers the missing values that training sample sets often face and uses the EM algorithm for processing. The resulting soft sensor design method was tested on a case study of the alumina extraction process, which shows that the proposed method can provide alumina concentration estimations that are consistent with the actual measurements obtained from laboratory tests. Future work will focus on applying the proposed soft sensor development approach to solving various problems such as dealing with dynamic, non-Gaussian, or batch processes.