An Improved Normalized Mutual Information Variable Selection Algorithm for Neural Network-Based Soft Sensors

In this paper, normalized mutual information feature selection (NMIFS) and tabu search (TS) are integrated to develop a new variable selection algorithm for soft sensors. NMIFS is applied to select influential variables contributing to the output variable and avoids selecting redundant variables by calculating mutual information (MI). A TS based strategy is designed to prevent NMIFS from falling into a local optimal solution. The proposed algorithm performs the variable selection by combining the entropy information and MI and validating error information of artificial neural networks (ANNs); therefore, it has advantages over previous MI-based variable selection algorithms. Several simulation datasets with different scales, correlations and noise parameters are implemented to demonstrate the performance of the proposed algorithm. A set of actual production data from a power plant is also used to check the performance of these algorithms. The experiments showed that the developed variable selection algorithm presents better model accuracy with fewer selected variables, compared with other state-of-the-art methods. The application of this algorithm to soft sensors can achieve reliable results.


Introduction
In real industrial processes, many variables cannot be measured at all or frequently enough with traditional instrumentation, owing to the technological or economical constraints of insufficient space, poor environmental conditions, extreme operating conditions, etc. Soft sensors provide valuable solutions for estimating these process variables through inference modeling with other easily measured variables. Paulsson, et al. [1] applied a soft sensor to a microbial recombinant protein production process in a bioreactor by exploiting bio-calorimetric methodology and obtained better results compared to other researchers. Chen, et al. [2] proposed a local soft-sensing method based on real-time correlation to predict silicon content, and better prediction performance of the method was validated on the online silicon content prediction. Kaneko, et al. [3] applied a soft sensor to the fault detection of chemical plants in order to accurately detect faults in the production process. Ahmed, et al. [4] successfully predicted the change of melt index during the production of high-density polyethylene using soft sensor technology. Wang, et al. [5] developed and applied a soft sensor to a refining process for quality prediction and achieved a high prediction accuracy. Xing, et al. [6] proposed a soft sensor for silicon content in industrial blast furnaces based on bagging local semi-supervised models (BLSM), the application results showed that BLSM had better prediction performance compared to other supervised soft sensors. Yuan, et al. [7] developed a novel variable-wise weighted stacked auto encoder (VW-SAE) for a soft sensor, and industrial data-based experiments showed that the proposed VW-SAE could present better prediction performance than traditional algorithms. Mambou, et al. [8,9] have also done some valuable research in this field.
In fact, most industrial processes are highly non-linear. Artificial neural networks (ANNs) are widely used in soft sensor technology because of their strong nonlinear fitting ability. Feil, et al. [10] presented a soft sensor for melt index by using a semi-mechanistic modeling approach; the ANN-based modeling approach can be efficiently used for on-line state estimation. Ko, et al. [11] used a soft sensor based on ANNs to predict the grain size of ore, and the resulting model can accurately measure the grain size distribution in industrial production in real time. Shakil, et al. [12] developed a soft sensor for measuring gas emissions during industrial boiler combustion using a dynamic ANN, and measurement results with a high accuracy were obtained. Bo, et al. [13] constructed an adaptive soft sensor instrument based on ANN technology and applied it to an advanced control system, the control system realized the quality closed-loop control very well. Jiesheng, et al. [14] used an ANN soft sensor model to predict the concentrate grade. Simulations show that the model has better generalization results and prediction accuracy.
The rapid development of distributed control systems (DCS) provides us with a huge amount of industrial information [15,16]. The big data also presents us with another problem in non-linear soft sensing: there are many redundant input variables in the process. If too many input variables are introduced in the training process of the ANN, the data dimensions increase and more computational power is required. At the same time, introducing variables that are unrelated to the prediction results will increase the noise in the dataset and reduce the prediction accuracy of the ANN. Therefore, the study of effective variable selection algorithms for soft sensors has attracted interest from many researchers. Variable selection algorithms can be broadly divided into two categories: wrapper and filter algorithms. The wrapper algorithms essentially treat the selection of inputs as an optimization of the model structure. They compare and evaluate either all or a subset of the possible input variable sets and select the set that yields optimal performance [17,18]. Wrapper algorithms utilize the performance of machine learning to evaluate the goodness of feature subset and, therefore, have great chance of leading to good prediction performances. Due to this advantage, the wrapper algorithms are widely studied and applied [19][20][21][22][23][24]. In spite of the performance, wrappers can be very computationally expensive since many predictive models with different feature subsets have to be built. Moreover, the results of the wrapper strategy lack generality as their use is limited to a specific regression model.
The filter algorithms utilize a statistical measure of the degree of dependence between the input and output variables as the criterion for variable selection. Compared to the model-based wrapper algorithm, the filter algorithms are model free and have less computational complexity. Among them, mutual information (MI) is a very suitable measure of dependence for variable selection during ANN development, since it is an arbitrary measure and makes no assumption regarding the structure of the dependence between variables. For a modeling problem, the ultimate objective is to reduce as much as possible an error criterion; the most frequently used is the mean squared error (MSE). Frénay et al. [25] demonstrated that that MI was an adequate criterion for feature selection with respect to the MSE, when the estimation error was identically distributed for any input variables with a uniform, Laplacian or Gaussian distribution. The MI also has been found to be robust due to its insensitivity to noise and data transformations [26,27]. Therefore, many MI-based feature/variable selection algorithms have been proposed. Hanchuan, et al. [28] developed a minimal redundancy maximal relevance (mRMR) criterion to minimize redundancy, and made use of a series of intuitive measures of relevance and redundancy to select meaningful features. Estévez, et al. [29] proposed an enhanced version of mutual information feature selection (MIFS) and mRMR that introduced the normalized mutual information (NMI) as a measure of redundancy. The proposed normalized mutual information feature selection (NMIFS) algorithm showed better performance than MIFS and mRMR, by compensating for the MI bias toward multivalued features and restricting its values to the range [0, 1]. Bennasar, et al. [30] gave a normalized joint mutual information maximisation (NJMIM) for feature selection, which is based on the "maximum of the minimum" criterion. The results showed that the NJMIM method outperforms the other methods on several tested public datasets. Novovicová, et al. [31] designed a sequential forward feature selection method based on conditional mutual information called CMIFS to find a subset of features that are most relevant to the classification task. The feature selection algorithm has better classification accuracy on high dimensional Reuters textual data.
However, the NMIFS also has some disadvantages. One of them is that the execution of the variable selection only depends on the data characteristics and has nothing to do with the structure and model accuracy of the ANN. Another disadvantage is that the algorithm is essentially a greedy search; it only deletes one variable at every iteration. This mechanism will bring about an inherent drawback, in that the solution tends to fall into a local optimum.
The tabu search (TS) algorithm can avoid local optima by extending local neighborhood searches. The research of Zhang, et al. [32] showed that TS is a promising tool for variable selection due to the quality of its obtained variable subset and its computational efficiency. Pachecoaab, J. [33] proposed a TS method that could find a smaller subset from a set of variables, to effectively classify cases. Osman, I. H. [34] used TS for the vehicle routing problem; the new method provided a set of lowest-cost delivery routes and achieved satisfactory results. Lin, et al. [35] applied TS to an economic dispatch problem, and results showed that the proposed algorithm can provide accurate solutions with reasonable performance.
In this paper, an improved NMIFS for soft sensors is proposed. To the best of our knowledge, this is the first implementation developing a neural network-based soft sensor by combining NMIFS with TS. The contributions of this paper are as follows: (1) The proposed algorithm performs variable selection by combining the NMI and training accuracy of neural networks. (2) TS is used as an auxiliary optimization method, and it effectively prevents the proposed algorithm from falling into local optimums. (3) The developed soft sensor is applied to predict the sulfur dioxide (SO 2 ) emissions of flue gas in a practical power plant, and it has better performance than other algorithms.
The paper is organized as follows. Section 2 reviews the theories of NMIFS and ANN. Section 3 presents the design and analysis of the proposed algorithm. Section 4 gives the numerical simulation results of two artificial dataset examples. Section 5 shows that the soft sensor based on NMI and TS (NMI-TS) achieved an accurate measurement of flue gas SO 2 concentration. Finally, some concluding remarks are given in Section 6.

Input Variable Selection Techniques
In the training processes of ANNs, redundant candidate variables tend to complicate the model, reduce the modeling accuracy, and can even lead to the over-fitting phenomenon. The purpose of the input variable selection (IVS) technique is to correctly select the relative variables from the candidate variable set C that contains all the potential input variables of the process. In the variable selection process, variables that contribute less or even have negative effects to the model accuracy will be excluded.
In general, the IVS can be implemented in three ways: (1) sequential forward selection, starting from a null set and selecting a variable into target set at every iteration until there is no improvement in the prediction accuracy of the model; (2) sequential backward selection, starting from set C and deleting a variable at a time until the termination rule is met; (3) global optimization, finding the best variable set among all the possible schemes with optimization algorithms.
Mathematically, if there are m input variables in C, there will be totally (2 m − 1) subsets in the selection result set [36]. Therefore, it is very difficult to find the best solution if the number of input variables is large. Based on this consideration, it is necessary to find a statistical metric by which to scale the degree of dependency between the input and output variables. Then, we can preliminarily select the input variable before modeling with ANNs. This statistical pre-processing input variables simplifies the complexity of modeling and can provide resulting solutions that have wider adaptability to different ANN structures. However, the performance of the IVS algorithm is highly dependent on the statistical metrics used.
During the development of IVS algorithms for ANNs, MI has been found to be a valuable evaluation criterion, as it is an arbitrary measure and makes no assumption regarding the structure of the inter-variable dependencies.

Mutual Information (MI)
Suppose that Y is a random output variable there will be some uncertainty in the observation value y ∈ Y. The uncertainty is defined on the basis of Shannon entropy, H. Suppose X is a random input variable and Y is output variable dependent on X, the mutual observation of (X, Y) will lower uncertainty as the knowledge of X allows inference toward Y. For a continuous variable, the definition of MI can be expressed as follows [34]: where "p(x, y)" is the joint probability density function (PDF), and "p(x)" and "p(y)" are the marginal PDFs of X and Y, respectively. The entropy of X and Y and its relationship to their MI is shown in Figure 1. scale the degree of dependency between the input and output variables. Then, we can preliminarily select the input variable before modeling with ANNs. This statistical pre-processing input variables simplifies the complexity of modeling and can provide resulting solutions that have wider adaptability to different ANN structures. However, the performance of the IVS algorithm is highly dependent on the statistical metrics used. During the development of IVS algorithms for ANNs, MI has been found to be a valuable evaluation criterion, as it is an arbitrary measure and makes no assumption regarding the structure of the inter-variable dependencies.

Mutual Information (MI)
Suppose that Y is a random output variable there will be some uncertainty in the observation value y ∈ Y. The uncertainty is defined on the basis of Shannon entropy, H. Suppose X is a random input variable and Y is output variable dependent on X, the mutual observation of (X,Y) will lower uncertainty as the knowledge of X allows inference toward Y. For a continuous variable, the definition of MI can be expressed as follows [34]: where "p(x, y)" is the joint probability density function (PDF), and "p(x)" and "p(y)" are the marginal PDFs of X and Y, respectively. The entropy of X and Y and its relationship to their MI is shown in Figure 1. MI has three primary properties: 1. Symmetry: I(X;Y) = I (Y;X).The information extracted from Y about X is of the same amount as the information extracted from X about Y. The only difference between them is the angle of the observer. 2. Positive: I(X;Y) ≥ 0. For all extracting information, the worst case scenario is that there is no information, that is I(X;Y) = 0 3. Extremum: I(X;Y) ≤ H(X), I(Y;X) ≤ H(Y). The amount of information extracted from one event about another is at most equal to the entropy of the other event, and does not exceed the amount of information contained in this other event itself.
The I(X;Y) measures the dependence of X and Y and provides useful information for variable selection. Therefore, how to compute I(X;Y) is very pivotal to MI-based variable selection algorithms. The difficulty of calculating I(X;Y) is that the real functional form of the PDF in (1) is unknown in real world. To solve the difficulty, many approximate estimation algorithms of MI have been widely studied to resolve PDFs. For instance, a kernel density estimation (KDE) exerts a basis function to every point of the feature data and has good computational quality [28]. Then, the PDF MI has three primary properties:

1.
Symmetry: I(X;Y) = I (Y;X).The information extracted from Y about X is of the same amount as the information extracted from X about Y. The only difference between them is the angle of the observer.

2.
Positive: I(X;Y) ≥ 0. For all extracting information, the worst case scenario is that there is no information, that is I(X;Y) = 0 3.
Extremum: I(X;Y) ≤ H(X), I(Y;X) ≤ H(Y). The amount of information extracted from one event about another is at most equal to the entropy of the other event, and does not exceed the amount of information contained in this other event itself.
The I(X;Y) measures the dependence of X and Y and provides useful information for variable selection. Therefore, how to compute I(X;Y) is very pivotal to MI-based variable selection algorithms. The difficulty of calculating I(X;Y) is that the real functional form of the PDF in (1) is unknown in real world. To solve the difficulty, many approximate estimation algorithms of MI have been widely studied to resolve PDFs. For instance, a kernel density estimation (KDE) exerts a basis function to every point of the feature data and has good computational quality [28]. Then, the PDF approximation can be achieved by adopting an envelope of all the basic functions exerted on each point. K-nearest neighbor (KNN) [37,38] is also an effective method to solve this problem. Despite superior approximation results, these kinds of algorithms have significantly high computational load, especially for large-scale problems. Histogram methods [39] present an alternative solution with acceptable accuracy and conspicuously higher efficiency than KDE methods.
The computational results fluctuate a lot when MI is used to actual production problems. For this reason, it can hardly be taken as an indicator to compare the similarity between different variables. A new method is introduced to normalize MI in which the entropy is applied as the denominator to adjust the value of MI within [0, 1]. The expression is shown as: Following that, NMI can be effectively applied to assess the similarity between the input variables and the target variables.
Inspired by Battiti and Hanchuan [26,28], the evaluation criterion whether a variable is selected is designed as follows: where Y is the output variable, p j is the pending variable, a i is the selected variable, and ξ is a constant that is used to control the ratio of redundant items in B j . The parameter ξ is used as a factor for controlling the redundancy penalization among single features and has a great influence on FS. The values of ξ are different according to the sample size, complexity and number of variables, and are usually determined experimentally based on the specific problem.

Tabu Search
TS is an intelligent optimization algorithm that extends local neighborhood searches. The most important idea of TS is to mark the local optimal solution that has been reached and try to avoid these objects in the further iterative search, to ensure the exploration of different effective search approaches.
The solution quality of greedy search algorithm depends on the starting point and neighborhood structure, and it is very easy to fall into a local optimum. To obtain better solutions, researchers usually expand or change neighborhood structure, or start searching from multiple starting points. However, these strategies still cannot guarantee the global optimum of the solution due to the nature of their "greed".
Unlike a greedy search, TS has the advantage of being able to escape from a local optimum by accepting inferior solutions. The implementation of TS involves the design of neighborhood, tabu list, tabu length, amnesty rule, termination rule and constitutes a loop with them in the search process.
At the beginning of TS, an initial solution is generated randomly or by some heuristic algorithm. Then, the quality of all solutions in the neighborhood is evaluated according to the fitness function. For a minimization problem, the solution that has smallest objective value has largest fitness function and will be selected as the solution of next iteration. The tabu list is a short-term memory of the specific changes of recent moves within the neighborhood, and can be used to preventing future moves from undoing those changes. Tabu length is the length of the tabu list. The amnesty rule is applied to avoid lost superior solution. If a solution in the tabu list is obviously better than other solutions, it can be taken as the solution of the current iteration ignoring the tabu rules. The termination rule is to stop the search process when the number of iterations reaching the limit or the obtained solution satisfying the preset accuracy.
Define the current solution as x, neighborhood as N(x), fitness function as C(x), tabu list as T. The algorithm flow of TS is described as follows: where X* is the optimal solution that is not written into the tabu list in the neighborhood. The algorithm flow as follows: Step 1 Generate an initial solution x ∈ X, set x* = x, T = ∅, define amnesty rule as A(x) = C(x*), iterations k = 0.
Step 6 Add x* to tabu list, release the solution that reaches the tabu length and return to Step 2 TS is widely used for combination optimization problems and has superiority in global search [40,41]. In this paper, we adopted the search mechanism of TS and incorporated it with NMIFS to develop a new variable selection method for soft sensors.

Proposed Methods for Feature Selection
Equation (4) not only contains MI between p j and Y, but also takes MI between p j and selected variables as punishment. This means that there should be a high degree of dependency between p j and the output variables, and p j should also have a lower dependency on the variables in A. In this way, variables that have less impact on the prediction accuracy, or redundant input variables, can be prevented from being added to A. It avoids increasing the size of the model and reduces the data processing time required in training the ANN. At the same time, the model input noise can be reduced to improve the accuracy of the model.
If the input variables are selected as described above, a termination criterion needs to be set for the algorithm. First, for each variable p j ∈ P, compute B j , then find the p j that maximizes B j ; set P ← P\{p j }; set A ← {p j }, and the algorithm stops when all B j are less than a scalar quantity. The scalar is a parameter that needs to be input manually, and different values are required for different IVS problems. Because this algorithm is a greedy search algorithm, it will produce a local optimal solution. To ensure the exploration of different effective searches, we used a TS algorithm.
In the process of programming the algorithm, we can set the tabu list as a "tube"; at each iteration, the element that needs to be put into the tabu table is packed into a ""ball", and the number of balls that the tube can accommodate is called tabu length. Then we press the ball into the tube inlet; if a ball emerges from the outlet of the tube, it indicates that the elements have reached the tabu length. If there are no balls appearing from the outlet of the tube, it means that the elements in the tabu list have not reached the tabu length. The size of the neighborhood is set to N and the tabu list to S. The algorithm flow as follows: Step 1 Evaluate I(Y; p j ); for each variable in P, Select the variable corresponding to the biggest I(Y; p j ) as the starting point.
Step 2 If P ∅ or S ∅, Stop the search.
Step 3 The variables satisfying the tabu length were taken from S and added to P.
Step 4 Evaluate B j for each variable in P.
Step 5 Sort the variables in P by B j , select the first N variables and put them in neighborhood.
Step 6 Evaluate λ n (fitness value) for each variable in the neighborhood.
Step 7 Select the variable corresponding to the smallest λ n as the optimal solution of this iteration Step 8 Put the other variables in S and return to Step 2.
The pseudo-code of the algorithm is shown in Algorithm 1. The algorithm can be regarded as a combination of local search (LS) algorithm and TS. The LS algorithm is a sequential forward selection approach that consists of two loops. The outer loop performs variable selection by choosing the variables with the highest fitness at every step, where the fitness of a variable is defined by its NMI. Then the algorithm takes the variables selected by the outer loop and enters them into the inner one. The inner loop performs TS by selecting the variable with the lowest fitness at every step, then putting the other variables into the tabu list.
The algorithm proceeds until all variables are selected into A, then the algorithm can obtain a "path" according to the order in which variables are selected into A. Along this known path, I (I = 1, 2, 3, . . . , i, i is the length of A) variables are selected to train the ANN at each iteration, and the cross-validation (CV) MSE is used to evaluate the modelling accuracy and select the optimal solution.

Subroutine of Cross-Validation (CV)
CV is a common method used in machine learning to build models and validate model parameters. To achieve the k-fold CV, we need to reserve a single sub-dataset (validation set) at the start and use the remaining k − 1 sub-datasets (training set) to train a new ANN, then obtain MSE by validating the reserved sub-dataset with the new ANN. The kfold CV process is repeated k times with each of the k sub-datasets used only once as the validation set, and then produce a single estimation λ n with the averaged results.
Algorithm 2 presents a detailed subroutine of CV, in which the MSE is used as the evaluation index in selecting variables. The k-fold CV procedure is implemented by the loop: (1) reserves a single sub-dataset, (2) uses the remaining k − 1 sub-datasets to retrain a new ANN and (3) obtains the MSE by validating the reserved sub-dataset with the retrained ANN.

Experimental Setting
All the algorithms in the paper are tested in the same experiment setting. The programs used for conducting the numerical experiments was coded in MATLAB 2016 and run on a windows 8.1 operating system. They used the same ANN structure that has one hidden layer with the hyperbolic tangent activation function and an output layer with the linear activation function. The ANN are trained using the standard back-propagation algorithm [42]. The number of hidden nodes are determined by trial experiments.
In the experiments of Section 4, 80% of the dataset is used for modeling, and 20% is used to test the accuracy of the model. The simulation results are obtained with the following definitions: (1) Model size (M.S): The number of input variables selected by the algorithm.
(2) MSE: During the whole modeling process, part of the data is used for modeling, and the other part is used to verify the accuracy of the model. We recorded MSE between the predicted and desired output to evaluate the prediction accuracy of the model.

Case 1
In the example, a nonlinear dataset containing a pool of candidate variables was generated to validate the performance of the methods. The true non-linear model [24] was: The input dataset X was generated from a multivariate normal distribution with covariance matrix . The covariance between two different variables (columns) was: and the coefficient ρ of the covariance matrix was set to 0.2, meaning that multi-collinearity was introduced, ε is white Gaussian noise, where β = [3.0, 1.5, 2.0, 4.0, 0.5, 1.3, −2.6, −3.5, −5.1, 2.0] T which was taken from [43]. Set the dataset X 1 ∈ R 2000×10 was the set of relevant variables, X 2 ∈ R 2000×10 was the set of irrelevant variables, then the input dataset X = {X 1 , X 2 }. In order to compare the performance of the proposed algorithm, some state-of-art MI-based variable selection algorithms such as NMIFS, and CMIFS [31] are applied to the numerical dataset.
Moreover, an advanced wrapper algorithm that combines nonnegative garrote with ANN, called NNG-ANN [24], is also applied in the experiments. Table 1 presents the performance of the simulation results with different algorithms. It can be seen from the table that NMI-TS has the best MSE and R 2 with the fewest variable selected. The WS+ is zero, meaning that the algorithm selects all the relative input variables. The WS-is one, meaning that only one irrelative variable is selected in the final model. The results show that the NMI-TS is superior in variable selection accuracy to other algorithms. The curves of MSE to the number of features of this dataset with MI-based algorithms are shown in Figure 2. Note that the characteristic of NNG-ANN is to give the best solution at a time instead of presenting a solution path. Therefore, the curves of NNG-ANN are not presented in this type of figures. It can be seen from the figure that the MSE reaches the best result at 11 features. At those points fewer than 11 features, the performance deteriorates drastically because of the lack of necessary input information. At the right of the best point, MSE slowly deteriorates due to the introduction of redundant information as the number of features increases. Moreover, the other algorithms converge more slowly to reach a better point than NMI-TS. The curves of MSE to the number of features of this dataset with MI-based algorithms are shown in Figure 2. Note that the characteristic of NNG-ANN is to give the best solution at a time instead of presenting a solution path. Therefore, the curves of NNG-ANN are not presented in this type of figures. It can be seen from the figure that the MSE reaches the best result at 11 features. At those points fewer than 11 features, the performance deteriorates drastically because of the lack of necessary input information. At the right of the best point, MSE slowly deteriorates due to the introduction of redundant information as the number of features increases. Moreover, the other algorithms converge more slowly to reach a better point than NMI-TS.  To further verify the performance of the algorithm, we conducted the following experiments on the above dataset  (2) Correlation The correlation between the input variables ρ increased from 0.2 to 0.8, and we then conducted experiments with MSE as the evaluation criterion. Results are shown in Figure 4.
(3) Noise intensity ε in Equation (5) was increased and we designed the experiments with MSE as the evaluation criterion. The results are shown in Figure 5.
Sample size, correlation, and noise intensity all affect the prediction accuracy of the algorithm, especially the modeling accuracy of the ANN. Because MI is used as the evaluation criterion, CMIFS NMIFS and NMI-TS are less affected. Although the MSE of CMIFS NMIFS and NMI-TS are essentially the same in some experiments, NMI-TS usually presents a better model accuracy with fewer variables selected. This algorithm can solve the problem of strong coupling between input variables. It is worth noting that the accuracy of the proposed algorithm has certain dependence the amount of training samples. If the training sample is too small, the sparse distribution of experimental data will reduce the reliability of histogram method. Then, the PDFs between variables cannot be effectively calculated, which will lead to the deterioration of the algorithm performance. Although the utilization of KDE or KNN can improve this situation, the cost of operation is also greatly increased. (2) Correlation The correlation between the input variables ρ increased from 0.2 to 0.8, and we then conducted experiments with MSE as the evaluation criterion. Results are shown in Figure 4. (2) Correlation The correlation between the input variables ρ increased from 0.2 to 0.8, and we then conducted experiments with MSE as the evaluation criterion. Results are shown in Figure 4.
(3) Noise intensity ε in Equation (5) was increased and we designed the experiments with MSE as the evaluation criterion. The results are shown in Figure 5.
Sample size, correlation, and noise intensity all affect the prediction accuracy of the algorithm, especially the modeling accuracy of the ANN. Because MI is used as the evaluation criterion, CMIFS NMIFS and NMI-TS are less affected. Although the MSE of CMIFS NMIFS and NMI-TS are essentially the same in some experiments, NMI-TS usually presents a better model accuracy with fewer variables selected. This algorithm can solve the problem of strong coupling between input variables. It is worth noting that the accuracy of the proposed algorithm has certain dependence the amount of training samples. If the training sample is too small, the sparse distribution of experimental data will reduce the reliability of histogram method. Then, the PDFs between variables cannot be effectively calculated, which will lead to the deterioration of the algorithm performance. Although the utilization of KDE or KNN can improve this situation, the cost of operation is also greatly increased. (3) Noise intensity ε in Equation (5) was increased and we designed the experiments with MSE as the evaluation criterion. The results are shown in Figure 5.
Sample size, correlation, and noise intensity all affect the prediction accuracy of the algorithm, especially the modeling accuracy of the ANN. Because MI is used as the evaluation criterion, CMIFS NMIFS and NMI-TS are less affected. Although the MSE of CMIFS NMIFS and NMI-TS are essentially the same in some experiments, NMI-TS usually presents a better model accuracy with fewer variables selected. This algorithm can solve the problem of strong coupling between input variables. It is worth noting that the accuracy of the proposed algorithm has certain dependence the amount of training samples. If the training sample is too small, the sparse distribution of experimental data will reduce the reliability of histogram method. Then, the PDFs between variables cannot be effectively calculated, which will lead to the deterioration of the algorithm performance. Although the utilization of KDE or KNN can improve this situation, the cost of operation is also greatly increased.

Case 2
In order to further validate the proposed algorithm, a benchmark dataset that was first proposed by Friedman, J. H. [44] is applied in our paper. The dataset used in the paper has five input variables relevant to the output variable, which is calculated by y 10 sin πx 1 x 2 20(x 3 -0.5) 2 10x 4 5x 5 ε where {x 1 , x 2 , x 3 , x 4 , x 5 } are relevant input variables and ε is white Gaussian noise. In addition, this dataset also contains 45 input variables that are irrelevant to the output variable. The role of the variable selection algorithms is to select {x 1 , x 2 , x 3 , x 4 , x 5 } from the total of 50 variables. However, the irrelevant variables have different degrees of correlation to {x 1 , x 2 , x 3 , x 4 , x 5 } ranging from 0.3 to 0.9, which increases the difficulty of variable selection. Table 2 presents the performance of the simulation results on the dataset with different algorithms. As we can see, the NMI-TS has smaller MS and better MSE than other approaches. The curves of MSE to the number of features of this dataset with MI-based algorithms are shown in Figure 6. The figure shows that the NMI-TS converges fast to the best solution, while the other methods require more variables to converge.

Case 2
In order to further validate the proposed algorithm, a benchmark dataset that was first proposed by Friedman, J. H. [44] is applied in our paper. The dataset used in the paper has five input variables relevant to the output variable, which is calculated by y = 10 sin(πx 1 x 2 ) + 20(x 3 − 0.5) 2 + 10x 4 + 5x 5 + ε where {x 1 , x 2 , x 3 , x 4 , x 5 } are relevant input variables and ε is white Gaussian noise. In addition, this dataset also contains 45 input variables that are irrelevant to the output variable. The role of the variable selection algorithms is to select {x 1 , x 2 , x 3 , x 4 , x 5 } from the total of 50 variables. However, the irrelevant variables have different degrees of correlation to {x 1 , x 2 , x 3 , x 4 , x 5 } ranging from 0.3 to 0.9, which increases the difficulty of variable selection. Table 2 presents the performance of the simulation results on the dataset with different algorithms. As we can see, the NMI-TS has smaller MS and better MSE than other approaches. The curves of MSE to the number of features of this dataset with MI-based algorithms are shown in Figure 6. The figure shows that the NMI-TS converges fast to the best solution, while the other methods require more variables to converge.

Desulfurization Technology of the Power Plant
The power plant providing the data used in this article operates a limestone-gypsum wet fluegas desulfurization technology, which uses lime or limestone to react with SO2 and absorb it. The chemical reaction principles are: The flue gas containing SO2 moves up from the bottom of the absorption tower, it encounters the spray layer of alkaline spray and reacts with it to form liquid drops. Tiny liquid drops fall into the slurry pool, where the oxidation of CaSO3 and the crystallization of CaSO3·1/2H2O are completed; finally, gypsum is formed at the bottom of the absorption tower.
The desulfurization process and related parameters are mainly taken from unit 9 of the power plant. The main feature of this unit is that it adopts the structure of a double absorption tower. Compared with the design of the single absorption tower, this structure can be used for secondary treatment of the discharged flue gas, and can remove SO2 in the flue gas more effectively. Figure 7 shows the flow direction of flue gas in the flue-gas system.

Desulfurization Technology of the Power Plant
The power plant providing the data used in this article operates a limestone-gypsum wet flue-gas desulfurization technology, which uses lime or limestone to react with SO 2 and absorb it. The chemical reaction principles are: The flue gas containing SO 2 moves up from the bottom of the absorption tower, it encounters the spray layer of alkaline spray and reacts with it to form liquid drops. Tiny liquid drops fall into the slurry pool, where the oxidation of CaSO 3 and the crystallization of CaSO 3 ·1/2H 2 O are completed; finally, gypsum is formed at the bottom of the absorption tower.
The desulfurization process and related parameters are mainly taken from unit 9 of the power plant. The main feature of this unit is that it adopts the structure of a double absorption tower. Compared with the design of the single absorption tower, this structure can be used for secondary treatment of the discharged flue gas, and can remove SO 2 in the flue gas more effectively. Figure 7 shows the flow direction of flue gas in the flue-gas system.
In the entire desulfurization process, the untreated flue gas is first be introduced into the two-stage absorption tower, and once desulfurized it is introduced into the chimney through a high-pressure dust catcher.
A fan pressurizes the flue gas and sends it to the bottom of the No.9-1 absorption tower, but before this happens it needs to be cooled down. The reason for this is that, within a certain range, lower temperatures are more conducive to the absorption of SO 2 in the flue gas. The flue gas rises from the bottom of the first absorption tower and reaches the top after being treated, and then it is sent to the bottom of the secondary absorption tower for the second treatment. With wet flue-gas desulfurization technology it is easy to produce "fog" with particle sizes of 10-60 microns in the absorption tower. Because the "fog" produced by this process contains SO 2 , the gas inside the tower is defogged through a mist eliminator before leaving the absorber. To prevent the liquid in the slurry tank of the absorber from pouring back into the flue-gas inlet, it is necessary to set an overflow port at the upper part of the slurry pool of the first stage absorber. The overflow port of the secondary absorber will overflow the slurry to the first stage absorber when a high liquid level is reached. Due to the large volume of the slurry pool, if the liquid within it remains stationary for a long time, solid precipitation will be generated. To prevent the depositing of solids in the slurry pool, pulsed suspension pumps are used to stir the slurry. Given that oxidation and crystallization processes are required in the slurry tank, an oxidation air blower is used to feed the air to the tower through an aeration device arranged in the slurry tank. Because there is less gypsum produced by the slurry pool in the secondary absorber, slurry is directly fed into the slurry pool of the first stage absorber by the gypsum swirl pump, then the oxidation and crystallization process is carried out. Apart from the different assembly of overflow ports and oxidation air blowers, the structure of the two absorption towers is largely the same, as shown in Figure 8. In the entire desulfurization process, the untreated flue gas is first be introduced into the twostage absorption tower, and once desulfurized it is introduced into the chimney through a highpressure dust catcher.
A fan pressurizes the flue gas and sends it to the bottom of the No.9-1 absorption tower, but before this happens it needs to be cooled down. The reason for this is that, within a certain range, lower temperatures are more conducive to the absorption of SO2 in the flue gas. The flue gas rises from the bottom of the first absorption tower and reaches the top after being treated, and then it is sent to the bottom of the secondary absorption tower for the second treatment. With wet flue-gas desulfurization technology it is easy to produce "fog" with particle sizes of 10-60 microns in the absorption tower. Because the "fog" produced by this process contains SO2, the gas inside the tower is defogged through a mist eliminator before leaving the absorber. To prevent the liquid in the slurry tank of the absorber from pouring back into the flue-gas inlet, it is necessary to set an overflow port at the upper part of the slurry pool of the first stage absorber. The overflow port of the secondary absorber will overflow the slurry to the first stage absorber when a high liquid level is reached. Due to the large volume of the slurry pool, if the liquid within it remains stationary for a long time, solid precipitation will be generated. To prevent the depositing of solids in the slurry pool, pulsed suspension pumps are used to stir the slurry. Given that oxidation and crystallization processes are required in the slurry tank, an oxidation air blower is used to feed the air to the tower through an aeration device arranged in the slurry tank. Because there is less gypsum produced by the slurry pool in the secondary absorber, slurry is directly fed into the slurry pool of the first stage absorber by the gypsum swirl pump, then the oxidation and crystallization process is carried out. Apart from the different assembly of overflow ports and oxidation air blowers, the structure of the two absorption towers is largely the same, as shown in Figure 8.

Expriment's Design and Results
In the desulfurization process, the untreated flue gas is first introduced into the two-stage absorption tower, then, after desulfurization, it is introduced into the chimney through the high pressure dust catcher. The flue-gas desulfurization system consists of SO2 absorption system, fluegas system, absorbent preparation system, gypsum treatment system, processed water system and accident slurry system. The whole system is very complex and involves many candidate variables

Expriment's Design and Results
In the desulfurization process, the untreated flue gas is first introduced into the two-stage absorption tower, then, after desulfurization, it is introduced into the chimney through the high pressure dust catcher. The flue-gas desulfurization system consists of SO 2 absorption system, flue-gas system, absorbent preparation system, gypsum treatment system, processed water system and accident slurry system. The whole system is very complex and involves many candidate variables that we do not know whether they are related to the output variable or not. There all 52 input variables provided by factory's DCS. Table A1 presents the name, description and unit of these variables. In fact, there are too many variables in the candidate variables pool that may cause modeling inaccuracies. Meanwhile, it is very difficult to precisely select variables and model for such a complex process simply through domain knowledge. Therefore, the variable selection algorithm based on statistical theory is very meaningful to the accurate modeling of the process.
The dataset from the power plant is entered into the three algorithms used in Section 4, 50% of the dataset is used for modeling, and 50% of the dataset is used to test the accuracy of the model. Table 3 shows the modeling accuracy. The soft sensor presented in this paper achieved the highest accuracy in predicting the SO 2 concentration. Some data was input into the soft sensor based on NMI-TS, and the predicted value was compared with the real value, as shown in Figure 9. Figure 10 presents the curves of MSE to the number of features of desulfurization process with MI-based algorithms. The best solution appears at the point of four input variables. At the left of the point, the MSE deteriorates drastically as the number of variables decreases. Similar to Figures 2 and 6, the NMI-TS has the best convergence rate among these algorithms. But unlike Figures 2 and 6, the MSE tends to increase with fluctuating changes at the right of the point. The reason for this phenomenon is that the practical dataset has large cross-correlation between different input variables as well as uncertainty between redundant and useful information. among these algorithms. But unlike Figures 2 and 6, the MSE tends to increase with fluctuating changes at the right of the point. The reason for this phenomenon is that the practical dataset has large cross-correlation between different input variables as well as uncertainty between redundant and useful information.    According to statistical results, there are four input variables selected by NMI-TS in the optimal solution, shown as Table 4.  According to statistical results, there are four input variables selected by NMI-TS in the optimal solution, shown as Table 4. The first variable is variable 3 that is the current of the third circulating slurry pump. There are four pumps in the tower 9-2, and every pump has its own motor to adjust the spraying speed of the slurry to control the rate of absorption. Usually, the operator takes the default operation mode under which there are two pumps normally open in the tower 9-2, and manually adjusts the switch and frequency of the third motors according to the production situation. Therefore, variable 3 is highly related to the target variable and included in the optimal solution. Secondly, variable 33 is the outlet's flue gas SO 2 concentration of the first tower and the inlet's flue gas SO 2 concentration of the second tower, which is obviously correlated to the final SO 2 concentration. The third one is variable 38 that is the limestone slurry to the absorption towers. It can be seen from reaction Formulaes (8) and (9) that the degree of this reaction is closely related to the amount of CaO and CaCO 3 , i.e., the limestone slurry. The fourth one is variable 50 that is the pH value of the slurry in the tower 9-2. This result is quite consistent with basic chemical reaction mechanism and operational experience in the field. If the SO 2 concentration of flue gas is relatively high, the SO 2 absorbed by the slurry will be relatively more and therefore lower the pH value of the slurry. Conversely, a relatively lower SO 2 concentration results in a higher pH value of the slurry.

Conclusions
The proposed algorithm for soft sensors is an innovative combination of NMI and TS. NMI-TS makes use of model accurateness to guide the process of variable selection by the NMI algorithm, and avoids the local optimal situation by using TS. We use simulation examples with different scales, correlations, and noise intensities to verify the superiority of our algorithm. Furthermore, the NMI-TS soft sensor was applied to a power plant desulfurization process. The simulation results show that the NMI-TS soft sensor can achieve high prediction accuracy with a simpler model of SO 2 concentration. The proposed soft sensor is very easily implemented and adaptable since the program can be stored as a subroutine in the host computer of DCS. With the huge amount of production data newly provided by the DCS, the soft sensor model could be easily retrained and updated by calling the subroutine at regular intervals.
In future research, it would be of interest to develop distributed parallel framework [15,16] of the proposed algorithm to solve the problem of increased computational complexity created by big data. The idea of distributed parallel framework could be combined with the proposed algorithm to obtain better and faster feature selection methods that may be successfully applied to large databases.

Conflicts of Interest:
The authors declare no conflict of interest. Table A1. All input variables in distributed control systems (DCS).

Number
Input Variable Units AT gypsum DP 2 s current A 8 AT gypsum DP 1 s current A 9 Gypsum slurry DP A's current A 10 Water ring vacuum pump's current A 11 #9AT's inlet flue-gas temperature K 12 #9AT's outlet flue-gas temp.  Intermediate value of desulfurization efficiency of Unit 9 AT = Absorption tower; DP = discharge pump; SP = slurry pump.