A Regularized Mixture of Linear Experts for Quality Prediction in Multimode and Multiphase Industrial Processes

: This paper proposes the use of a regularized mixture of linear experts (MoLE) for predictive modeling in multimode-multiphase industrial processes. For this purpose, different regularized MoLE were evaluated, namely, through the elastic net (EN), Lasso, and ridge regression (RR) penalties. Their performances were compared when trained with different numbers of samples, and in comparison to other nonlinear predictive models. The models were evaluated on real multiphase polymerization process data. The Lasso penalty provided the best performance among all regularizers for MoLE, even when trained with a small number of samples.


Introduction
Soft sensors are inferential models used for online prediction of quality variables [1][2][3]. Often, the online measurement of such variables is difficult due to high costs or lack of a physical sensor. In such cases, the quality variables are measured by means of laboratory analysis. An operator collects a sample during production and sends it away for laboratory analysis. Meanwhile, the operator cannot act on the process in response to the most recent measurement, whose response is not yet known, leading to possible production loss, or at least quality loss. After the analysis result returns back, the operator can then make a decision, but it might be late. The goal of soft sensor technology is to solve this issue. It builds a data-driven model that relates the operational process variables to the quality variable. By doing so, it allows the online inference of the quality variables, allowing operators to get earlier information about the quality of the process and take corrective actions on time. The common steps to deploy a soft sensor are: data cleaning/synchronization [4], feature selection [5], model learning/validation [2], and model maintenance [6]. During the learning phase, it is beneficial to take, as much as possible, the process properties into consideration. For example, in case of multimode or multiphase processes, such information can be used in the modeling efforts. Multimode process operational characteristics exist due to external factors, such as changes in feedstock, production, operation conditions, or the external environment. Multiphase process characteristics are commonly found in batch processes. A series of phases comprise a batch cycle of production, with its own characteristics [7]. Several authors appeal to using these multiple operating properties into the modeling phase. From now on, we will also refer to each mode/phase of multimode-multiphase processes as the operating mode.
Two main steps follow the modeling of multimode-multiphase processes for quality prediction. The first step is the characterization of the operating modes, which can be done manually [8] or be inferred from data [9]. The second step is the learning of the models for each mode. In [10], the authors derived and proposed the use of a mixture of probabilistic principal component regression (MPPCR) for multimode data. In [11], the use of a two phase partial least squares (PLS) modeling approach was proposed. In the first phase, a separated intra-phase-PLS model is learned for each phase, and in the second phase a series of inter-phase-PLS are learned to model the relationships among different phases. In [12], in a case study for Mooney viscosity prediction in a rubbermixing process, the authors employed a fuzzy C-means clustering algorithm to cluster the samples in different subclasses, and then taught single Gaussian process regression (GPR) models for each subclass. In [7], the main idea was to learn individual partial least squares (PLS) models for each phase and each mode. The distinction between modes and phases was made manually by the experts. Following this, in [13] the authors successfully implemented a Gaussian mixture regression (GMR) model to handle the multimode data in a penicillin process. In [14], the authors incorporated the PLS model into the GMM framework for quality prediction and fault detection in an industrial hot strip mill process. In [15], the authors expanded the Gaussian mixture models (GMM) framework to its semi-supervised form for dealing with incomplete data and multimode processes. Other approaches model multimode processes using just in time learning (JITL) [16,17]. In [18], the authors proposed a robust GMR based on the Student's-t distribution for dealing with noise and outliers in data.
This paper proposes a mixture of experts (MoE) methodology with the following two characteristics: (1) the characterization of different modes, and (2) the learning of models in a single unified manner. MoE consists of a set of experts and gates, applied for modeling heterogeneous or multi-mode data. The experts are assigned for each mode, while the gate defines the boundaries for the experts. Figure 1 illustrates the MoE. MoE was introduced in [19,20], where the experts were neural network models and the gates were modeled by a softmax function. Since then, several extensions and variants of the MoE model have been proposed (see the review paper [21]). A variant of MoE is the mixture of linear experts (MoLE), wherein the experts are multivariate linear models. MoLE has the property of universal approximation, works for nonlinear modeling, is interpretable, and can automatically infer the different modes. All these characteristics make MoLE suitable for modeling multimode industrial data. However, the estimation of MoLE is unfeasible in the presence of collinearity or a small number of samples. Moreover, MoLE cannot handle irrelevant variables or perform feature selection. However, all these, are usual requirements to deal with industrial data. To solve the collinearity issue, in [9], the authors have proposed the use partial least squares (PLS) for modeling the gate and experts, defining the Mixture of Partial Least Squares Experts (Mix-PLS). It has been successfully applied to two industrial multimode-multiphase processes. In a short paper, Ref. [22] modeled the results of MoLE with elastic net penalty for modeling a polymerization multiphase process.
Beyond the industrial context, authors have appealed to regularization to MoLE models [23][24][25][26] to perform of feature selection and allow the learning with small number of samples. The regularized MoLE methods reported in the literature use the l 1 penalty, also known as least absolute shrinkage and selection operator (Lasso) [27]; the l 2 penalty, also known as ridge regression (RR) or Tikonohov regularization [28]; a compromise between l 1 and l 2 norm [29], also known as elastic net (EN); and the smooth clipped absolute deviation (Scad) [30]. In [23], the authors have used a RR penalty for the gates and a Lasso and Scad penalties to experts in MoLE. They reported successful results, in both, performance and in finding parsimonious models. They compared the results with a RR regularized linear model. Similarly, Ref. [24] applied Lasso penalty on gates and experts for classification problems. Their regularized MoLE performed better than ordinary MoLE, and state of art classifiers. In [26], the authors employed Lasso penalty for experts, and EN penalty for the gates, and reported better results than ordinary MoLE. In [31] the authors have employed a proximal-Newton expectation maximization (EM) to avoid instability while learning the gates. In [32], the authors discuss the theoretical aspects of the use of Lasso in MoLE. All results report regularization as a viable approach to improve the performance of MoLE when dealing with irrelevant variables and small number of samples.
The goal of this paper is to check the performance of regularized MoLE models for quality prediction in multimode-multiphase processes. For this purpose, a regularized version of MoLE based on EN penalty has been derived, which has a flexible regularization form. Thereafter, this paper derives three regularized MoLE models, defined as MoLE-Lasso, MoLE-RR, and MoLE-EN. Besides, a set of experiments was run and analyzed, with all regularized MoLEs, and the Mix-PLS [9]. The experiments were run on real multiphase polymerization data, for predicting two quality variables, with a total of 31 batches. The performances of MoLE models with respect to the number of batches for training were checked. The results show that MoLE-Lasso gives the most stable results, even in learning with only a few batches. On the other hand, the Mix-PLS has a tendency to perform better when the number of batches increases. Finally, all regularized MoLE's were also compared to different state-of-the-art models. The results show that the regularized MoLE is a valid choice for modeling multimode processes data.
The paper is divided up as follows. Section 2 presents the proposed regularized MoLE. Section 3 presents the experimental results. Finally, Section 4 gives concluding remarks.

Regularized MoLE
In this section, the regularized MoLE is derived. First, an introduction of MoLE is given. Afterwards, the derivation of regularized MoLE and its learning procedure are presented.

Notation
In this paper, finite random variables are represented by capital letters and their values by the corresponding lowercase letters, e.g., random variable A, and corresponding value a. Matrices and vectors are represented by boldface capital letters, e.g., A = [a ij ] n×d and boldface lowercase letters, e.g., a = [a 1 , . . . , a d ] T , respectively. The input and output/target variables are defined as X = {X 1 , . . . , X d } and Y, respectively. The variables X 1 , . . . , X d can take n different values as {x ij ∈ X j : j = 1, . . . , d and i = 1, . . . , n}, and similarly for Y as {y i ∈ Y : i = 1, . . . , n}.

Definition
The MoLE follows the divide and conquer principle. In the learning phase, the input space is divided into soft regions and a linear "expert" is learned for each region, while the gates assign soft boundaries to the experts' domains. The output for a new data sample is given by the weighted combination of the experts' outputs, where the new data is assigned to a specific or overlap of regions. The MoLE output is given bŷ where x i is the vector of input variables, P is the total number of experts,ŷ p (x i ) is the output of expert p, and g p (x i ) is the gate output for expert p. The gates assign mixture proportions to the experts, and have the following constraints ∑ P p=1 g p ( , are denoted by their shortened versionsŷ i ,ŷ pi , and g pi , respectively. Expert p has the linear formŷ p (x i ) = x T i θ p , where θ p is the vector of regression coefficients of linear expert p; the bias has been omitted to simplify the derivation. For the gate, the following softmax function will be employed: where v p is the gate parameter of expert p. This softmax function follows the required gate constraints. The MoLE formulation fits perfectly for multimode-multiphase processes, where each different mode can be modeled by an expert p, while the gates define the boundaries of the different modes. MoLE can infer the number of modes, or can use the expert information to define the number of modes. It has a very flexible format, and established learning algorithms. However, MoLE models cannot deal with collinearity or irrelevant variables. The next section will discuss how to apply regularization to MoLE models.

Formulation
The MoLE approximates the true probability distribution function (PDF), defined as p(y i |x i ), by a superposition of PDF's: where p(y i |x i , Θ p ) = N (y i |x T i θ p , σ 2 p ) is the conditional PDF of expert p, governed by the parameters Θ p = {θ p , σ 2 p }, where σ 2 p is the error variance of expert p, assumed to be zero mean. The collection of gates parameters is represented by G = {v 1 , . . . , v p }. The collection of all parameters is defined as Ω = {θ 1 , . . . , θ p , σ 2 1 , . . . , σ 2 p , v 1 , . . . , v p }. The estimation of Ω by maximum likelihood is unfeasible. Instead, the expectation maximization (EM) is commonly employed [33]. The EM uses a two-steps iterative procedure to perform the likelihood maximization of MoLE. In the first, called the expectation step (E-Step), the expectation of the log-likelihood (ELL) is evaluated, while in the second step, called maximization step (M-Step), new parameters are determined by maximum likelihood estimation from the ELL. Therefore, the maximization of likelihood in EM is achieved trough successive improvements of ELL; see [33] for further details on the application of the EM algorithm.
Since the EM is an iterative procedure, the superscript t is used to indicate the t-th iteration of the EM algorithm, e.g., Ω t is the vector of the estimated parameters of MoLE at iteration t. The ELL at the t-th E-Step, Q t (Ω), is given by where Q t e ( · ) and Q t g ( · ) account for the expert and gate contributions to the ELL, respectively, and γ t p (x i ) is the responsibility, which accounts for the probability of sample x i belonging to the region covered by expert p; from now on γ t p (x i ) will also be referred as γ t pi . The convergence of the ELL, can be measured by computing the ELL difference |Q t (Ω t ) − Q t−1 (Ω t−1 )| at each iteration. The complete derivation of ELL, and its relation to the log-likelihood of (3) can be found in ( [9], Section 4).
In the M-Step, the objective is to estimate the new parameters Ω t+1 = {Θ t+1 , G t+1 }, that maximize Equation (4). This is stated as the following optimization problem: The EM guarantees that the ELL, and consequently the log-likelihood of Equation (3), increases monotonically. The EM algorithm runs until the convergence of the ELL, where the convergence detection condition is defined The algorithmic solution is shown in Algorithm 1. Initialize Ω 0 , t ← 0, Done ← FALSE 3: while not Done do 4:

Regularization on Experts
The next step is to solve the maximization (6) for the experts. The experts' contribution to the ELL, Q t e (Θ), can be further decomposed to account for the contribution of each expert separately, as follows where Q t ep ( · ) is the individual contribution of expert p to the ELL. The new vector of expert coefficients θ t+1 p is the one that maximizes Q t ep (Θ), which is the solution of the following weighted least squares problem where Γ p = diag γ t p1 , . . . , γ t pn is the matrix of responsibilities of expert p. However, in presence of collinearity, the inverse (X T Γ p X) −1 becomes ill conditioned. To overcome this situation, a EN regularization is added as follows to penalize the loss function where λ e p is the regularization parameter that controls the sparsity of the solution, and α is the EN penalty. The EN regularization allows the use of the Lasso penalty when α = 1, the RR penalty when α = 0, or the EN penalty when α = 0.5 (a trade-off between Lasso and RR). The Lasso penalty is known to promote sparse solutions by shrinking the regression coefficient towards zero, being adequate when dealing with a large number of inputs. However, the Lasso penalty does not consider the group effect, i.e. for correlated features, it will tend to select one input while shrinking the coefficient of others to zero. The RR penalty alleviates the ill posed problem of (X T Γ p X + λ e p I) −1 by adding a regularization factor to it. In RR, all coefficients will have a contribution to prediction. On the other hand, the EN penalty integrates the benefits of both, it provides sparse solutions (Lasso penalty) while considering the effect group problem (RR penalty).
The error variance update becomes the solution of the following maximization problem which is equal to where the updated θ t+1 p is used to compute the updated variance term. The maximization problem in Equation (11) can be achieved using the coordinate gradient optimization descent algorithm, described in [34]. The coordinate gradient descent algorithm minimizes the loss function for each coordinate at a time. It converges to the optimal value if the loss function is convex and differentiable, conditions that hold for the loss function (11). The updated coefficient of variable j and expert p equals to whereỹ j pi = ∑ d l =j x il θ pl is the fitted value of local expert p, without the contribution of variable j. S(z, η) is the soft threshold operator, given by Further details on the derivation of EN learning by coordinate gradient descent can be found in [35]. For the experiments, the glmnet package [34] has been used. It is a computationally efficient procedure that uses cyclical coordinate descent, computed along a regularization path for solving the EN problem.
Another issue is to select a proper value of λ e p , which controls the overfitting and the sparsity of the solution. For such purpose, here it is adopted the Bayesian information criterion (BIC), which measures the trade off between accuracy and complexity, and has the following format where n p e = ∑ n i=1 γ t pi is the number of effective samples of expert p, and ψ e p is the number of the degree of freedom of expert p, and is the number of non zero elements in θ. The BIC has a tendency to penalize complex models, due to the log(n e p ) multiplicative factor, giving preference to simpler models. Thus, the selected value of λ e p is the one that minimizes the value of BIC(λ e p , α).

Regularization on Gates
On the experts' update, the EN regularization was easily added to penalize the experts' parameters (Section 2.4). On the other hand, during the gates learning, the application of the regularization term is not explicit. The contribution of gates to the ELL, Q t g (G), is given by The solution for the new parameters G t+1 by direct maximization of Equation (17) is not straightforward. Instead, the iterative re-weighted least squares (IRLS) method will be employed. The IRLS algorithm works in the following way. First, define the following auxiliary variablev k p as the auxiliary gate parameter in the k-th iteration of the IRLS algorithm. It is updated as follows: = (X T R p X)X T u p = (X T R p X)X T R pẑ k p , where R p = diag g p1 1 − g p1 , . . . , g pn 1 − g pn , The IRLS algorithm runs until convergence. The employed convergence detection condition is ||v k+1 p −v k p || 2 < ξ I . At the end of the algorithm, the new gate parameter is updated as v t+1 p =v K p , where K is the last iteration of the IRLS algorithm. In practice, few K iterations are necessary in the IRLS algorithm. The IRLS solution can become unstable due to the ill-conditioned inverse R −1 p . Many authors have proposed different alternatives to IRLS, to overcome this problem, such as the proximal-Newton EM in [31] that avoids the matrix inversion in the gates update. Here, a regularization term is added, which works well in practice. Specifically, the regularization term ζI is added such that the inverse becomes (R p + ζI) −1 . In experiments, ζ = 10 −3 is used.
However, similarly to the experts, the solution for the gates at each iteration of IRLS becomes ill conditioned in the presence of collinearity. Thus, through the closed form solution of the inner loops of the IRLS algorithm in Equation (18), the results derived for the experts are mimicked to the gates. By modifying so, the value ofv k p to be found at the k-th IRLS algorithm iteration with the EN regularization added, becomeŝ v k+1 Then, at each iteration of IRLS, an EN penalty is added to the loss function. In total, Kp EN maximization problems are computed. In a way similar to case of the experts, the solution of (19) can be achieved by using the coordinate gradient descent algorithm, described in [34]. In that case wherez j pi = ∑ d l =j x il v pl is the fitted value of gate p, without the contribution of variable j. The major issue here is to find the most appropriate λ g p for each gate p and at each iteration of the IRLS algorithm, where the value of λ g p controls the sparsity of the solution. For such purpose, it is adopted the BIC, which measures the trade off between accuracy and complexity, and for the gates, it has the following format where n g p = ∑ i r pi is the number of effective samples of gate p, and ψ g p is the degrees of freedom of expert p which is equal to the number of non zero elements in v. Thus, the selected value of λ g p is the one that minimizes the value of BIC g (λ g p , α). The IRLS algorithm with EN penalty is described in Algorithm 2.
while not Done do k ← k + 1 10: end while 11: return v t+1 p =v k p t: EM index; k: IRLS index. 12: end procedure

Model Selection and Stop Condition
In ordinary MoLE learning, the ELL is employed as the measure of convergence. Here, in addition to ELL, the BIC criterion will be employed to select the best MoLE architecture along the EM iterations. In that case, the parameters Ω to be selected is the ones were the BIC is minimal, instead of ELL. For this purpose, at each EM iteration, the BIC criterion is computed: Smaller values of BIC means better models. The BIC will increase as the complexity of the MoLE architecture increases. That selection allows the selection of less complex architectures, and overcome the problem of overfitting in the prediction phase. The ELL measures the convergence of the MoLE, while the BIC is considered as the criteria to select the best model. Here, the number of experts selected is not considered a concern, it is assumed that this information is known a priori, and comes from the process to be modeled. However, this criterion can also be employed for model selection. In [36], there is a short discussion on different modeling selections for MoE.

Different Regularized MoLE
The regularized MoLE learning, presented in previous sections will be used to derive three main regularized MoLE models for the experimental part. First, α = 1 is considered to derive the MoLE-Lasso; then α = 0.5 is used to derive the MoLE-EN; and α = 0 is used to derive the MoLE-RR. The selection for regularized MoLE regularization parameters will follow the BIC procedure, as previously discussed. The source code will be made available at the author's github page (www.github.com/faasouza (accessed on 19 February 2021)).

Experimental Results
This section presents the experimental results of using a real industrial polymerization dataset from [8]. The goal in this case study was the estimation of two quality variables related to the resin production in a batch polymerization process. These variables are two chemical properties: the resin acidity number (N A ) and the resin viscosity (µ). The datasets comprise a total of 34 variables measured over the course of the process, such as temperatures, pressures, valve openings, and controller set-points (manually adjusted). The provided dataset contains data from 33 batches (16 months of operating effort). The data were divided into two subsets: 26 batches constitute the training set, and the remaining 6 represent the test set. The properties of dataset are detailed in Table 1. The average number of samples per batch for training data is 18 samples, while for the test is 20. The total samples for the test dataset are 127 samples.

Experimental Settings
The results are divided into two steps. In the first step, the behavior of different regularized MoLEs is evaluated in different scenarios (different numbers of batches of data for training). The objective in this step is to understand the effects of samples in the stability/performance of each MoLE model with different penalties. In the second step, the mixture models that were trained on the full training data are compared to other predictive models by assessing the performance on the test dataset. The predictive performance is measured by the normalized root mean square error (NRMSE) and the coefficient of determination (R 2 ). The definitions of these metrics are as follows: whereŷ i is the predicted output and y is the sample mean of output y: y = 1 n ∑ n i y i . Lower values of NRMSE mean better models, while higher values of R 2 mean better models. R 2 can be interpreted as the rate of explained variance of the output to be predicted by the model, with bounds 0 ≤ R 2 ≤ 1, where 1 means perfect fit.

Comparison of a Regularized Mixture of Expert Models
In this section, the performances of different regularized mixture of experts, MoLE-Lasso, MoLE-EN, MoLE-RR, and Mix-PLS, are compared when using different numbers of samples for training. For this purpose, from the total of the 27 batches, 1, 5, 10, 15, and 20 batches were chosen randomly to compose the training dataset for training the MoLE models, while the original test set of six batches was kept fixed for evaluating the performance. The same procedure was repeated 10 times for each experiment, and then the R 2 metric was computed for each repetition. The average results for each of the 10 trials are shown in Table 2, for each model and for the different number of batches in the trainning dataset. To complement the analysis, the boxplots of R 2 between predictions and test data under a different number of batches for acidity and viscosity are presented in Figure 2.   Table 2, the results of training with 1 batch for resin acidity number prediction, show that MoLE-Lasso performs better with an average R 2 = 0.44. On the other hand, the performance of resin viscosity prediction with 1 batch has better results for MoLE-EN, with a value of R 2 = 0.76. This shows that the resin viscosity can be predicted even with few data for training. From the boxplot, Figure 2, the MoLE-Lasso and MoLE-EN have the lowest variance on resin viscosity prediction, while for resin acidity number the variance of prediction is higher. Figure 2, exhibits that the performance of all models increases with the number of batches used for training, as well the variance decreases. An interesting observation is related to the Mix-PLS performance, which shows a high variance of performance for small batches, which decreases as the number of training batches increases. For 20 batches it performs quite similar to, or even better than, Lasso regularization. This suggests that Mix-PLS can be a valid option when dealing with a large number of samples (in our experiment, more than 400 samples). Overall, RR regularization performs poorly among the experiments. This is reinforced by results in Table 2, where MoLE-RR provides the worst results in the majority of experiments.

MoLE-Lasso
Overall, all regularized MoLE performed well in the experiments. The MoLE-Lasso has shown to be more stable along the experiments, mainly when dealing with small number of batches, as concluded from the prediction performance and from inspecting the boxplot's results.

Comparison with Other Predictive Models
To compare the predictive performance of MoLE-Lasso, MoLE-EN, and MoLE-RR with other predictive algorithms, all models were run in the training dataset of 25 batches and tested on the test dataset. For comparison purposes, the following models were implemented: a multivariate linear regression model with Lasso, EN, and RR penalties, an artificial neural network (ANN), a support vector regression, with radial basis kernel (SVR), with hyper-parameters γ (kernel width) and C (Gain), a decision tree (DT), and partial least squares (PLS). The parameters of Lasso, EN, RR, and PLS were selected so as to minimize the BIC criterion. The number of hidden nodes N h of the ANN and the regularization parameter γ LS-SVR and the Gaussian kernel parameter σ LS-SVR of the SVR were determined using a 10-fold cross validation. Table 3 shows the parameters obtained for the PLS, ANN and SVR models. Table 3. Parameters selected for PLS, ANN, and SVR models for each data set.

PLS ANN LS-SVR
Viscosity LV = 10 N h = 3 γ LS-SVR = 50, σ LS-SVR = 10 Acidity LV = 17 N h = 3 γ LS-SVR = 50, σ LS-SVR = 25 Table 4 shows the prediction performance on the resin acidity number and resin viscosity, for all models, on the test dataset. Table 4, shows that the regularized MoLE models perform better among all datasets. On the other hand, nonlinear models such as ANN, SVM and DT have the worst results. Indeed, according to the results, the performance results have shown that for viscosity this is essentially a linear modeling problem, while for the acidity, there is a clear benefit of using the regularize MoLE methods. As the MoLE models can capture the different phases of the process, it could increase the performance over the linear and nonlinear models.

Conclusions
This paper derived different regularized MoLE models for multiphase-multimode modeling. For this purpose, a MoLE algorithm that integrates Lasso, EN, or RR penalties, was derived and used in the experiments. In the presented case study, the MoLE-Lasso has shown to provide the most stable performance, even when learning with few samples. The other regularized MoLE was shown to have less stability when learning with few samples-for instance, take the MoLE-RR and the Mix-PLS-but they have provided consistent results when increasing the number of training samples. The performanceregularized MoLE is problem dependent, and they all must be a valid option when dealing with multimode data.
Future work will check the performance of MoLE on feature selection on industrial data and new ways to improve the stability of MoLE learning. Future works will also address the comparison of regularized MoLE on different domain problems, with variety with respect to the sample size and the number of input variables and compare its performance against state-of-art-machine learning models.