## 1. Introduction

When dealing with industrial processes, many variables are monitored through online sensors. Some of these variables can be very hard to measure though or can be measured only sporadically due to high cost sensors or lack of the latter. In some cases, variables are measured with high delays because of slow hardware sensors or laboratory analysis, leading to the impossibility of real-time monitoring of the process. Inferential models can then be created to estimate these hard-to-measure variables on the basis of online measured ones. Such models are referred to as Soft Sensors (SSs), or Virtual Sensors [

1,

2]. Their working principle is summarized in

Figure 1.

SSs were originally exploited in the science of chemometrics, a discipline that studies how to extract information from data sets of chemical systems by using multivariate statistics, mathematics and computer science [

3,

4]. Chemometrics solves prediction problems by learning models from data and exploiting machine learning, system identification, artificial intelligence and statistical learning theory [

1,

2,

5].

SSs’ industrial use ranges over a number of different types of processes, such as refineries [

6,

7,

8,

9,

10,

11], chemical plants [

12], cement kilns [

13,

14], power plants [

15,

16], pulp and paper mills [

17,

18], food processing [

2,

19], nuclear plants [

20,

21], pollution monitoring [

22], polymerization processes [

23,

24,

25] or wastewater treatment systems [

26,

27,

28,

29], just to mention a few. In

Figure 2, a distillation column process from a refinery is shown through its control software: displayed online measurements are collected in a historical database for future use.

SSs’ real-time estimation capability allows developing tight control policies and since they represent a low-cost alternative to expensive hardware devices, SSs allow the realization of more comprehensive monitoring networks. The real-time estimation obtained by the SS can be used by a controller, while the corresponding delayed measurements allow the model to be improved, by avoiding an error propagation effect. Besides their use for plant control, they are used to approach a number of other different problems as well, such as measuring system back-up, what-if analysis, sensor validation and fault diagnosis [

7,

30,

31,

32,

33,

34,

35,

36,

37,

38,

39,

40,

41].

The back-up of measuring instrumentation consists of substituting unavailable measuring equipment, to avoid degradation of plant performance and rises in costs. This can become necessary since measuring devices, and their data transmission systems, generally face harsh working environments that impose periodic maintenance procedures or lead to faults. Therefore, when a maintenance intervention is performed, measuring hardware needs to be turned off and suitably substituted throughout the entire process.

What-if analysis consists of using the model to perform simulations of the system dynamics with respect to types of input that are of interest, for a given time span, with the aim of both achieving a deeper understanding of the system behavior or designing suitable control policies.

SSs allow to reduce the need for measuring devices as well, improving system reliability and decreasing sensors acquisition and maintenance costs. They can eventually work in parallel with hardware sensors, giving this way useful information for fault detection tasks and they can be easily implemented on existing hardware and retuned when system parameters change. The main goals of fault detection are to perform early detection of faults providing as much information as possible about it, to provide a support system for scheduled maintenance interventions, and to provide a basis for the development of fault-tolerant systems.

SSs can be built with different approaches. Since they are dynamic I/O models, simpler ones, like linear models, are usually preferred because of the lower time and computing demand, and if a priori physical knowledge of the process to model is given, a white-box design approach is possible as well. However, in an industrial environment, because of the complexity of the processes and the amount of available data, nonlinear models are needed and data-driven black-box approaches lead to satisfactory results. Available data must be representative of the dynamics of the process, and the choice of the right inputs is, for this reason, a crucial step in the design process.

Input selection is a widely addressed subject. Several surveys on the topic are reported in literature. In [

42], a taxonomy of the most used methods for Artificial Neural Network models is given, as well as in [

43] where a review of the approaches used on microarray data for Support Vector Machine (SVM) models is performed. In [

5], a review of the SS design in its entirety is given. In [

44], filter selection methods exploiting Mutual Information are reviewed, while [

45] provides a survey of wrapper classified feature selection methods. The work in [

46] focuses on feature selection methods of the semi-supervised class. This paper actually focuses on the input selection step of SS design independently from the specific model adopted by the designer, with the aim to help in the choice of the most suitable technique by exploring all the classes of the state-of-the-art methods.

With the introduction of Industry 4.0 technologies, one underrepresented topic is the one related to the role of humans in manufacturing and how technology can enhance the integration between human and machine. Many manufacturing systems are people-oriented, meaning human operators interact with intelligent devices around them. In such environments, people are the ones with the responsibility for actions and decisions. In this case, automation aims to supply devices able to collect and aggregate data, so to provide them in a user-friendly way to the person in charge of making the right decision based on the available data. All the design processes require human mediation, and this generally applies to industrial automation and machine learning applications as well. This consciousness gave birth to the human-in-the-loop approach, which puts human knowledge and experience as a pivot of machine learning processes [

47,

48]. In the SS field, human knowledge of the industrial processes provided by technicians represents a vital resource for the design process. This was shown in [

49] where technician experience was crucial in the optimization of the number of inputs to build the best performing SS of a unit distillation process.

As the design steps strictly correlated to each other, each step is firstly explained in

Section 2. The input variable selection problem is addressed in

Section 3. The two main classes of approaches, Feature Extraction (FE) and Feature Selection (FS) are respectively discussed in

Section 4 and

Section 5. In the final

Section 6, conclusions are drawn as well as a table that summarizes the techniques cited in this work. It sorts the methods by classes along with their advantages and disadvantages, with the intent to provide some guidance to the designer for the one that most accommodates a specific case. In the same section, references were arranged in two tables: one for the methods and one for real-case applications.

## 2. SS Design Stages

SS design stages follow the typical steps of pattern recognition [

50] as well as system identification theory [

51]. In an industrial environment, ad-hoc experiments to collect suitable identification datasets are, in general, not possible as in the system identification practice. So, data have to be retrieved from industries historical databases.

SS design steps can be summarized as follows [

1]:

Each of these steps is crucial for the good success of the further one.

Data is stored by industries in historical databases, generally provided from a supervisory control and data acquisition (SCADA) control system [

52] or a distributed control system (DCS) [

53]. Collected data must be capable of representing the whole dynamics of the system, since the model cannot provide more information than the one stored in the data itself. After data are collected, the first stage of the design regards its filtering and preprocessing [

1,

2]. This is due to unprocessed data from databases coming with well-known problems, such as oversampling, outliers and missing data and accuracy problems, such as offsets, seasonal effects and high-frequency noise. Therefore, the designer should carefully deal with them and prepare data to become suitable for the next designing steps [

54,

55,

56]. Common preprocessing stages consist of resampling, noise filtering, outlier detection and removal and normalization.

Data collected in plant databases usually come with different sample rates. Easy-to-measure variables are automatically measured with online available sensors; while hard-to-measure ones cannot be measured automatically, but more commonly only sporadically, at high cost and with high delays, such as in the case of laboratory analysis [

1,

57]. For this reason, the former usually present high sample rates, even higher than the one required by the sampling theorem, while the latter tend to be downsampled [

58,

59]. High sample rate can lead to huge datasets that can suffer from data collinearity. Therefore, resampling becomes necessary to synchronize the variables back together [

60] and to avoid dealing with huge datasets.

Missing data and outliers are quite common problems in databases collected from industries. The former occur when values are missing in the observation of a variable; the latter are actually inconsistent data with the majority of the recorded ones that greatly deviate from the typical range of values, such as peaks, saturations, flat trends and discontinuities. They can both be caused by sensor or process faults and measurement noise. They are usually handled by removing the samples that contain them or by filling the missing observations with some imputing method. Outlier detection is, however, a tough task achieved through statistical techniques, such as the 3

$\sigma $-rule, and the final validation has to be manually performed by a plant expert to avoid outlier masking (a false negative) and outlier swamping (false positive) [

61].

High frequency noise is generally induced by measurement instruments and can be filtered out with low-pass filters. The appropriate bandwidth is chosen by using spectral analysis.

Once data have been preprocessed, selection of input variables is one critical step in the SS designing process. The importance of this stage is addressed in the next section along with the main techniques adopted.

The further step consists of the choice of the model structure, which is based on the a priori knowledge of the system. Mechanistic (or white-box) models are the ones obtained on the basis of first principles analysis, such approach requires a deep physical knowledge of the process. However, due to the complex processes occurring in industrial plants and given the great amount of collected data, the use of gray- or black-box data-driven identification approaches can be a good choice, since satisfactory results can be achieved with reasonable computational and time efforts. Such data-driven models are only based on the empirical observations of the process and generally require slight knowledge of the system. A great work of data processing is needed anyway. Since it is hard to find a general solution equally satisfactory for any case, any plant experts’ knowledge must still be taken into account. It can regard the input variables importance, the system order, the operating range, time delay, degree of nonlinearity or sampling times. Linear models are usually preferable as they are computationally easier, since the numerical procedures and the design of a controller are simpler. Such approximation is possible when certain conditions are met, like small variations around the nominal working point, a small degree of nonlinearity of the process or what degree of approximation is needed for the model. If a linear model does not show good results, a nonlinear identification approach is needed. In the industrial field, parametric structures such as FIR, ARX or ARMAX are widely used in both the linear and nonlinear (NFIR, NARX, NARMAX, respectively) cases [

1,

62,

63], as well as static models.

Industrial SSs are generally designed with Artificial Neural Networks (ANN), mainly with the Multi-Layer Perceptron (MLP) structure [

7,

11,

63,

64], Convolutional Neural Networks (CNN) [

65,

66], Generative Adversarial Networks (GAN) [

67,

68], Radial Basis Networks [

69], Wavelet Networks [

70], Hinging Hyperplanes [

71], Deep Belief Networks [

72,

73], Stacked Autoencoders [

74], Long Short-Term Memory Networks [

75], Support Vector Regression [

76], Gaussian Processes Regression [

77], Extreme Learning Machines [

78], Fuzzy Systems and Neuro-Fuzzy Systems [

79,

80], just to mention a few [

5,

81,

82,

83,

84,

85,

86,

87,

88]. In some cases, the designer can choose to create more linear or nonlinear models for the same system, each one for a different working point, instead of a single one covering all of the system dynamics. In such cases, the models are aggregated by a suitable algorithm such as fuzzy logic or neural stacking. Such approaches are called ensemble methods [

88].

In some cases, the designer could deal with systems showing finite time delay between the input variables and the process output, sometimes caused by the measurement process. Several approaches are proposed in literature. In [

57], a FIR model along with an Expectation–Maximization (EM)-based algorithm is used to estimate the model parameters and the time delays. In [

79], a Takagi–Sugeno-fuzzy model is tuned using a genetic-algorithm-based approach to identify delays. Genetic Algorithms have also been used to estimate the time delays as in [

89,

90], where they are used to minimize the Joint Conditional Entropy between the input and output variables.

In [

91], the problem of selecting input time-lags is treated as a variable selection problem with a multidimensional mutual information estimator. Mutual Information is also applied to delay selection in the design of a SS in [

92,

93], where the delay is estimated by using the Normalized Mutual Information and delayed replicas of the inputs. In [

10], delays are estimated through the learning phase of a Deep Belief Network.

Once the structure has been chosen, the whole preprocessed and selected data set should be partitioned in subsets for the last two design steps, as:

Identification data

Validation data

The first allows to identify the candidate models and empirically estimate their unknown parameters.

Finally, the validation step exploits validation data to verify whether the model is able to adequately represent the system and perform generalization to new samples. In SS design, as in pattern recognition and system identification, it is important to perform the validation on different data with respect to the ones used for the model identification. This is particularly done to investigate overfitting phenomena. Validation techniques analyze the model residuals characteristics by looking for any undesired correlation between them and present and/or delayed samples of model inputs and outputs. This can be immediately performed through graphical techniques such as visual comparison of the time-plotted output of the system and of the one estimated by the model, lag plots, correlation graphs or histograms. Other performance metrics usually adopted are the Mean Squared Error (MSE), the Normalized Root Mean Square Error (NRMSE), the Mean Absolute Error (MAE), Akaike’s Final Prediction Error (FPE), Akaike’s Information Criterion (AIC) [

94,

95], Rissanen’s Minimum Description Length (MDL) [

96], Bayesian Information Criterion [

97],

${C}_{P}$ statistics [

98]. When dealing with small datasets, cross-validation techniques such as the K-fold cross validation or the leave-one-out (LOOCV) one are employed. They consist of splitting given data samples in K number of groups (or folds). At each iteration, one of the K groups is used for validation, while the other K-1 groups are for training. This is done for all the K groups, and the final performance is given as an average of the performance measured at each iteration. LOOCV consists of the same approach when K is the number of samples as well.

A scheme of the design process of an SS is given in

Figure 3.

## 3. The Input Selection Problem in SS Design

The model design takes for granted that at least one or more of the candidate inputs is able to describe the output of the system chosen by the designer. If that is not the case, the model development is an impossible task and the available data should be reconsidered: ad hoc experiments should be performed to cover the dynamics of the system or a different variable can be chosen as output. Generally, given the initial set of candidate inputs, it is common to have irrelevant ones or to have correlations between some of the input variables, making them redundant. Irrelevant inputs are those that have little or no predictive power with respect to the output, so they can be discarded without losing information. The concept of redundancy is instead associated with the level of dependency among two or more variables.

The optimal subset of input variables is then unknown. What the designer wants to achieve is to discard such inputs, reducing in this way the degree of redundancy and to remove no informative variables, with the aim of detecting the relevant high informative ones to build an optimal set.

The reason why the number of inputs is reduced, is because the dimensionality and the representability of the input space is one of the factors that may limit the successful design of an SS. In the case of industrial processes, candidate inputs can reach great numbers [

91,

99]. Moreover, if in the model structure choice a non steady-state type of model (such as the ones mentioned) is preferred, the number of candidate variables is multiplied by the model order, making the number of variables even larger, mostly in the case of strong persistence systems.

When this occurs, a large number of inputs dramatically increases the computational cost of the model identification step [

100] and leads to a large number of model parameters to be estimated, generally causing poor generalization and high probability of overfitting [

101]. High-dimensional datasets that suffer the so-called “large

p, small

n” problem (where

p is the dimension of the input space and

n is the number of samples), tend to be indeed affected by overfitting. A model suffering overfitting mistakes small fluctuations for important variance leading to errors on test data. This unavoidably increases in the presence of noisy measurements. The reason behind this phenomenon is called curse of dimensionality [

102]: as the input dimensionality increases, the volume of the space increases so fast that the available data become sparse, meaning that the amount of input samples needed to support the result grows exponentially with its dimensionality [

103].

On the other hand, dimensionality reduction shortens the model development time, improves the predictor performance, facilitates data visualization and data understanding. Also, a reduction of the number of variables implies a lower number of required hardware sensors, decreasing costs associated with them, as well as fewer missing data and outliers to deal with.

The objective is therefore to find the input subset of minimum cardinality that preserves the information contained in the whole initial set with respect to the output; or put in other words, the subset containing the fewest inputs required to properly describe the behavior of the output.

To deal with the problem, approaches can be classified in [

104]:

Feature Extraction (FE, Unsupervised)

Feature Selection (FS, Supervised)

These two classes of methods are addressed in the next sections. A full taxonomy of the approaches is depicted in

Figure 4.

## 4. Feature Extraction

FE is a class of unsupervised methods that create new features based on transformations or combinations of the original variable set. The most well-known FE algorithm is Principal Component Analysis (PCA) [

105]. It uses orthogonal transformation to express a set of

p variables as

d vectors called principal components, with

d < p. The model identification is then performed on such found components. PCA finds the first principal component with the largest variance in a latent space where the original input space is projected, using the covariance matrix and its eigenvalues and eigenvectors. All the successive components are the ones with the highest variance that are orthogonal to the others. However, the relationship between the variables is assumed to be linear, such as the one between the principal components and the output. Therefore, the procedure will fail at identifying any nonlinear relationship in the data. Moreover, the transformations of the input variables are done without taking the output variable into account, with the method being unsupervised.

The first problem is overcome by some nonlinear versions of the algorithm, such as Nonlinear PCA (NLPCA) [

106] and Kernel PCA (KPCA) [

107]. The first one uses Autoassociative Neural Networks to perform the identity mapping: the network inputs are reproduced at the output layer, with an internal bottleneck layer and two additional hidden layers. The second one generalizes linear PCA into the nonlinear case using the kernel method: the original input vectors are mapped into a higher dimensional feature space in which the linear PCA is then calculated. In both cases, the transformations of the data can be highly complex and interpretation of the PCs is a harder task.

The unsupervised limitation led to the introduction of a supervised version of PCA, Supervised-PCA (SPCA) [

108], where PCA is applied to a subset of the inputs selected on the basis of their association with the output.

Other PCA variations are Independent Component Analysis (ICA) [

109] and Probabilistic PCA (PPCA) [

110]. ICA is originally developed to blindly separate multivariate signals with the goal of recovering mutually independent but unknown source signals from their linear mixtures without knowing the mixing coefficients. This is used to linearly transform original inputs into features that are mutually statistically independent. In PPCA, a Gaussian latent factor model is considered and then the PCAs are obtained as the solution of a maximum marginal likelihood problem, where the latent factors are marginalized out.

PCA, SPCA, ICA, KPCA and PPCA have been applied as a way of reducing the dimensionality of the data in many works [

28,

36,

56,

89,

111,

112,

113,

114,

115]. A comparison between PCA, KPCA and ICA as dimensionality reduction methods is performed in [

116], where KPCA showed the best performances among the three. In [

117], an original feature selection method that combines ICA and false nearest neighbors (FNN) is proposed as ICAFNN.

Another nonlinear dimensionality reduction FE approach is multidimensional scaling (MDS) [

118,

119] along with its variations such as Principal Coordinates Analysis (PCoA) [

120], metric-MDS, non-metric MDS and generalized MDS. MDS is a set of related ordination techniques to display the information of a dataset in a distance matrix that contains the distances between each pair of points of such dataset. The algorithm places each point into a space of a chosen dimension N such that the distances are preserved as well as possible. Other nonlinear FE methods are Isomap and its variations [

121,

122,

123], Locally Linear Embedding (LLE) [

124,

125] and Laplacian Eigenmaps [

126].

Isomap is a combination of the Floyd–Warshall (F-W) algorithm with classic MDS, where the pair-wise distances are assumed to be only known between neighboring points and the others are computed with the F-W algorithm. Then, classic MDS is performed to compute the reduced-dimensional positions of all the points. LLE has faster optimization and better results than Isomap. LLE also finds a set of the nearest neighbors of each point, so to describe it as a linear combination of them after computing a set of weights for each neighbor. It then finds the lower-dimensional embedding of the points such that each point is still described with the computed linear combination. Laplacian Eigenmaps adopts spectral techniques to perform dimensionality reduction, based on the assumption that the data lies in a low-dimensional manifold that exists in a higher-dimensional space.

Another unsupervised procedure to produce a low-dimensional representation of an input space is given by Self-Organizing Maps (SOM, or Kohonen maps) [

127], a type of artificial neural network used to perform dimensionality reduction. The task is achieved through Vector Quantization (VQ), which is a classical quantization technique from signal processing that describes a larger set of

n vectors by

c codebook or prototype vectors. The candidate set of inputs is considered as the prototype vectors of the SOM. Similar candidate variables will be identified by the formation of groups, which have the closest proximity to the same prototype vector. The distance measures generally used to evaluate this proximity are linear correlation or covariance in the linear case, otherwise Mutual Information or Entropy in the nonlinear case.

The drawback of FE techniques is that the variables in the new space do not have any physical meaning and they become difficult to be interpreted. In addition, in the case of industrial processes, since the original inputs are still needed to obtain the projections, the number of required hardware sensors for the estimation is not reduced, hence losing one of the important advantages brought by the use of a fewer number of inputs.

## 5. Feature Selection

FS refers to a class of supervised methods that select the best subset from the original feature set, retaining this way their physical meaning [

128]. The selection of the input variables takes the relationships between inputs and outputs variables into account, either related to the accuracy of the corresponding model or not. Different strategies are used to search among the possible sets of candidate variables and they can be classified in the following groups of methods [

45,

129]:

Filters

Wrappers

Embedded (model-based)

Hybrid approaches

Any of these procedures for input selection defines a criterion or a cost function to quantify the quality of a subset as well as a search strategy to determine the candidate subset [

50]. This is done since exhaustive search is not recommended or even not feasible in most cases, due to the extremely high computational expensiveness given the number of inputs. In an exhaustive search all the possible combinations of inputs are considered, and therefore, given

n candidate input variables,

${2}^{n}$ possible combinations of subsets exist.

So search strategies provide an efficient method to search through the many possible combinations of inputs and can be classified as local, that start their search from a point and then move incrementally, or global, that consider many combinations.

Forward selection and backward elimination are two linear incremental local strategies [

130]. Forward selection methods start with an empty input subset and then inputs from the candidate set are included one at a time. The chosen input should be the one that most contributes to the output, according to the criterion the specific method uses. The approach is computationally efficient and results in relatively small input sets, but because of its nature it may encounter a local optimum, terminating prematurely, or may ignore informative combinations of variables that are not very relevant individually [

129]. An extension of this strategy is the step-wise selection in which past input variables may be removed at each iteration to better handle redundancies in the subset.

Backward elimination, as opposed to the previous, starts by first considering all the candidate inputs. Then subsets with one less input are built and examined to evaluate whether the deleted one is more or less significant. The procedure goes on until no more inputs can be deleted, according to the adopted criterion. Such approach is generally more computationally demanding.

In their floating variants (Sequential Forward Floating Selection—SFFS and Sequential Backward Floating Selection—SBFS, respectively) [

131], there is an additional inclusion or exclusion step to remove variables once they were included (or excluded), so that a larger number of subsets can be sampled.

A heuristic search involves global strategies that implement a search of random solutions in the search space and increase the focus in regions that lead to good solutions. The nature of the approach allows finding global or near-global optimal solutions. They are usually implemented with evolutionary algorithms as Genetic Algorithms (GA) and Ant Colony Optimization Algorithms (ACO) [

132,

133,

134] or Simulated Annealing (SA). The approach requires the tuning of search parameters that trade-off the amount of the search space that is explored and the rate at which the final solution is reached.

The same given taxonomy of methods is used for semi-supervised feature selection methods as well, as stated in [

46]. Semi-supervised approaches for evaluating input relevance are exploited in cases in which both labeled and unlabeled data are available. This often happens since unlabeled data are more easily accessible than labeled ones, where hard-to-measure variables must be measured and recorded as well as easy-to-measure ones in order to provide enough data to build the predictive model. This paper takes into account only methods for which available data provided to the designer are labeled. The cited work gives, however, a comprehensive detailed survey of input selection methods in a semi-supervised environment.

#### 5.1. Filter Methods

These are methods that select subsets of inputs as a preprocessing step, exploiting statistical measures to quantify the quality of a subset (multivariate methods) or providing a ranking of each single variable based on a relevance index, eventually rejecting those with a value that falls below an established threshold (univariate methods). The relevance measure is usually a bivariate statistical analysis that evaluates each candidate-output relationship, so filter methods are usually characterized by incremental search strategies. In filter methods that operate on each input variable individually, dependencies and interactions between them are disregarded, not accommodating the multicollinearity problem [

135]. This is the reason why they are often used as a first screening step, before more sophisticated methods are applied in hybrid approaches.

Filter methods do not require to build any prediction model first since they are independent of the chosen model structure: these approaches separate the inputs selection task from the model identification step. This makes such methods simple and fast, because they are the least computationally demanding ones. They allow for good empirical results even in cases in which the number of samples is smaller than the number of inputs.

The most common filter method consists of the analysis of the correlation coefficient (CC). The most common coefficient is Pearson’s correlation coefficient

$\rho $, which is a measure of the linear correlation between two variables, in this case the candidate input and the output [

136]. The linear correlation between each input and the output is computed and then a ranking list of the inputs is provided, according to the scores. Practical examples of this approach are given by [

1,

23,

137].

In [

138], different coefficients such as Distance Correlation (DC) [

139], Maximal Correlation (MC) [

140] and Maximal Information Coefficient (MIC) [

141] are combined with Pearson’s coefficient to introduce a more robust factor that can be generally used when the relationship between the variables is not necessarily linear. Being the dependencies between variables neglected, if there is correlation between the candidate inputs, such approach would select too many variables giving problems of redundancy. To accommodate the problem, partial correlation can be used instead. It measures the strength of the relationship between two variables, while controlling for the effect of one or more other variables that is discounted.

In nonlinear settings,

$\rho $ is generally replaced by Mutual Information (MI) [

142], a measure of dependence based on information theory and Shannon’s notion of entropy that quantifies the information about a variable provided by a second variable. The reason why MI is adopted in nonlinear settings is because it is based on probability distributions within the data and makes no assumption on the structure of the dependence between the variables. It also is insensitive to noise and data transformation, making it a robust measure. In a univariate approach, it provides a ranking like in the linear case [

129,

143]. In multivariate approaches, when the number of candidate inputs is large, it is not possible in practice to evaluate the MI between all the possible subsets and the output, so incremental greedy procedures are frequently used. These approaches can possibly detect subsets of features that are jointly relevant or redundant. In such a context, probability density functions are unknown in real-world problems and MI has to be estimated. The most adopted methods are Nearest Neighbors-based algorithms that show good results [

77,

144,

145,

146] and are shown to outperform other common estimators such as the histogram one, the kernel estimator and the b-spline estimator [

147], as well as the CC approach [

91]. The basic histogram method is, however, preferred when dealing with small variables because of its simplicity [

90]. When the number of variables to work with increases, multivariate MI methods become complex due to the estimation of the probability density function [

148].

The multivariate problem is approximated with a univariate approach in [

149], where the Mutual Information Feature Selector (MIFS) is introduced: a heuristic criterion is adopted to find the subset that maximizes MI. MIFS’s performance can, however, be degraded as a result of large errors in estimating the mutual information. Another common drawback is the selection of redundant variables if an input is closely related to the already selected one. This is the reason why a new greedy selection method was introduced as MIFS-U (MIFS-Under Uniform Information Distribution) [

150]. It is shown that the two algorithms are equivalent to the maximization of multivariate MI [

151]. However, they could both lead to the selection of irrelevant variables earlier than relevant ones if the cardinality of the inputs subset becomes big. This is partly solved by mRMR (minimum redundancy-maximum relevance) [

152]. The criterion of maximum-relevance ensures that the selected inputs are highly informative by evaluating their high degree of correlation with the output. The criterion of minimum-redundancy looks for inputs that are maximally dissimilar from one another, in order to build the most useful set of relevant variables. In [

153], a novel mutual information feature selection method based on the normalization of the maximum relevance and minimum common redundancy (N-MRMCR-MI) is proposed, where the normalization method is applied to the Max-Relevance and Min-Common-Redundancy (MRMCR) criterion and returns a correlation measure that takes values between 0 and 1. NMIFS is another algorithm that proposes the average normalized MI as a measure of redundancy among inputs [

154]. In [

155], a variable selection method based on Dynamic Mutual Information is proposed and called DMIFS. In [

156], a selection based on Partial Mutual Information (PMI) is introduced and successfully applied in other works as well [

157,

158]. When datasets become extremely large, however, the greedy optimization tends to be infeasible. This can be overcome by the use of parallel computing to speed the procedure up. In [

159], the greedy optimization procedure is revisited to propose a semi-parallel optimization paradigm that works as the other state-of-the-art algorithms, but in a fraction of the time. The algorithm is tested even on a dataset of more than a million candidate inputs. Another method proposed after MIFS is Information Theoretic Subset Selection (ITSS) [

160], described as a multivariate MI approach where indications on when the growth of the subset has to be stopped are given, as opposed to the MIFS algorithm. The method exploits a parameter based on MI called Asymmetric Dependency Coefficient (ADC) to estimate the knowledge of the output carried by the selected subset. When the ADC reaches the maximum value of 1, a full knowledge of the output is reached. A review of variable selection methods based on MI is given in [

44].

Lipschitz’s quotients can be used for input selection by computing the Euclidean distances in the input space and in the output at different time instants [

161]. Such approach is based on the continuity property of the nonlinear function representing the input–output model and it depends only on the input–output data collected through experiments. In order to evaluate each subset of variables (or to evaluate the importance of the variable or variables excluded), each Lipschitz‘s quotient computed for that subset is compared with the one computed for the whole candidate set. However, this approach requires the computation of the quotient for all the possible combinations of the input variables, resulting in a high computational demand [

162].

In [

163], several linear filter variable selection methods are compared to nonlinear ones using two large databases, in particular a synthetic one and a real-world one. Results showed nonlinear methods to be a generally preferable and more robust tool.

#### 5.2. Wrapper Methods

Such class of methods perform the input variable selection by evaluating the performance of the final model via cross-validation, where each model corresponds to a unique combination of inputs [

45]. The assessment is done by using the same criteria that are used to evaluate the predictive performance in the model validation design step, for example, the MSE [

164,

165,

166,

167,

168]. In the case of the use of the MSE as an optimality criterion, the drawback is that the best model could not be the optimal one, since models with a large number of inputs tend to suffer overfitting. So other criteria like Akaike’s Information Criterion (AIC) [

94,

165,

169,

170], or the

${C}_{P}$ (Mallows’ Coefficient) statistics [

1] are adopted since these measures penalize overfitting by determining the optimal number of input variables as a trade-off between the model size and the accuracy.

With respect to filter methods, these approaches are slower and computationally and time expensive, since a new model is created every time a new subset is picked. Being the evaluation done on the final model, they generally give better results.

As explained in

Section 5, the ideal approach would be evaluating all possible subsets, but as it is infeasible, the use of a search strategy is needed. On the basis of the adopted search algorithm shown in the same section, wrapper methods can be classified as deterministic or randomized [

171,

172]. Deterministic wrappers use Sequential Feature Selection greedy algorithms like Forward Selection, Backward Elimination and their variants. They generally present a lower overfitting risk [

173,

174,

175]. Randomized wrapper methods are the ones adopting heuristic search and exploit a randomized criterion in the selection of the subset. As already stated, they have more parameters to be tuned [

176,

177].

#### 5.3. Embedded Methods

In this case, variable selection depends on the structure and on the type of the used model: a specific characteristic of the model or of its learning process is used to define the criterion. These methods, compared to the filter ones, are slower and give bad generalization performance (overfitting) when not enough data is available; vice-versa when enough data is available, they generally outperform filter methods [

5,

42].

Recursive feature elimination (RFE) [

129,

178,

179] is a backward-elimination embedded input selection strategy. RFE consists of an iterative process of training a model, where all the candidate inputs are initially used. At each iteration, RFE seeks to improve generalization performance by removing the least important variable in which the deletion will have the least effect on training error. This method works well for problems with small training samples and high input dimensionality, but it tends to remove redundant and weak variables, keeping independent ones. As already stated in this paper, weak input variables that are useless by themselves can provide a good improvement in performances when combined together, so simply removing them can degrade the classification performance.

For this reason, variations of the algorithm have been proposed such as Enhanced-RFE (EnRFE) [

180] or RFE-by-sensitivity-testing (RFEST) [

181]. Original RFE does not concern the further state at each iteration, as opposed to EnRFE that will retain redundant or weak features that are useful when combined with other features. It is shown that EnRFE performs better than its original version. In RFEST, RFE is used with sensitivity analysis to rank inputs and to overcome the same limitations.

Sensitivity analysis [

182] is an input selection method in which the model is first trained with all the candidate inputs, then one input is analyzed by measuring the variation of the output when it is perturbed [

183,

184,

185,

186]. If considered irrelevant by the sensitivity analysis, it is then removed.

Evolutionary ANNs (EANNs) [

187] are population-based algorithms for neural network models that simulate the natural evolution of biological systems to optimize the NN and to determine the optimal set of input weights. When the optimization procedure sets an input connection weight close or equal to zero, then that input variable is excluded, making the input selection embedded within the EANN approach.

Least Absolute Shrinkage and Selection Operator (LASSO) [

188,

189] is a regularization method that provides input selection in an embedded way. Regularization is a method of reducing variance in a linear regression model at the cost of introducing some bias. This is done by adding the model error function of a penalty term. Ridge Regression (RR) [

190] penalizes the sum of squared coefficients, the so-called

${L}_{2}$ penalty. When the function is forced to be less than a fixed value, the penalty term shrinks the model coefficients leading to a lower variance and a lower error value. This decreases the complexity of the model but does not reduce the number of variables, it rather just shrinks their effect. LASSO actually penalizes the sum of the parameters absolute values, the so-called

${L}_{1}$ penalty. This makes some of the parameters shrink to zero, which is never the case in ridge regression, eliminating some variables entirely and performing variable selection, by giving a subset of predictors that helps mitigate multicollinearity and model complexity. Elastic Net (EN) [

191] linearly combines the

${L}_{1}$ and

${L}_{2}$ penalties from LASSO and RR and can be optimized to effectively perform coefficient shrinkage as well as setting some of them to 0 for sparse variable selection. LASSO regularization for inputs selection is extended to the nonlinear case as well with the name of LASSO-MLP [

162,

192]. In this case, the

${L}_{1}$ penalty term is added to the error function of a single-layer MLP and then variable selection is performed as in linear LASSO.

#### 5.4. Hybrid Methods

Merging different methods often brings better results and less computational demand. Different combinations of input selection methods and classes can be performed to further reduce the number of inputs. Filter methods can be used, such as the pre-filtering method in [

1], where correlation coefficients and scatter plots are used as a preselection and then partial correlation and Mallows’

${C}_{p}$ statistics are used for input selection. In [

193], a combination of wrapper and embedded methods are proposed and called SBS-MLP. It presents low computational cost and tends to equally-perform or outperform other state-of-the-art methods it was compared with.

In [

194], a selection method is proposed in which Nearest Correlation Spectral Clustering Variable Selection (NCSCVS), a method that clusters inputs into groups based on the correlation between variables by nearest correlation spectral clustering, is used as a filter step and then integrated with group LASSO. This method is called Nearest Correlation Spectral Clustering Group LASSO (NCSC-GL). In [

195], the NC-based method is used to search for inputs correlated with the output, and then che correlation similarity between the inputs and the output is used to weight the respective input in the model. The method is called Nearest Correlation-Based Input Variable Weighting (NCVW).

In [

29], a self-organizing map (SOM) is used to reduce the dimensionality of the input space and obtain independent inputs. Then, to determine which inputs have a significant relationship with the output, a hybrid approach exploiting GA with a General Regression Neural Network (GRNN) is proposed and called GAGRNN.

In [

83], variable selection is performed by first ranking the candidate inputs exploiting correlation coefficient analysis, then the optimal subset is chosen with a wrapper approach by evaluating the prediction performance of different models.

The gradient-based leave-one-out gene selection (GLGS) algorithm [

196] combines a variant of the Leave-One-Out Cross-Validation (LOOCV) with the Gradient Descent Optimization algorithm to PCA, to perform input dimensionality reduction.

In [

197], an ensemble input set that maintains informative inputs from the original set is formed as a combination of the output feature set of a population of LASSO models. The regularizing factors of these selectors are estimated via cross-validation procedures.

In [

49], filter methods such as CC analysis, ITSS and Lipschitz quotients analysis are combined with either LASSO and plant experts’ knowledge, halving the original input set of an SS of a refinery process and outperforming the model trained with all the candidate inputs.

Other cases of hybrid approaches are reported in [

5].

## 6. Summary and Conclusions

Given the number of classes of input selection strategies, some key factors must be taken into account when creating the model. First of all, the designer needs to understand if the chosen algorithm is able to detect nonlinear relationships, which is a common trait when dealing with industrial processes. The number of available samples with respect to the number of inputs to be chosen could give a hint as well whether a filter, wrapper or embedded approach is preferable to avoid overfitting or poor generalization properties. The expected computational demand represents another incisive factor to be considered as well by the designer. Taking these considerations into account,

Table 1 carries a classification of the methods cited so far, divided by classes and showing the benefits and drawbacks of each, with the aim to provide the designer with a guidance of what the most suitable choice could be for the case in exam.

Wrappers and embedded algorithms are typically preferred where the number of candidate inputs is relatively smaller then the number of samples. Under this circumstance they both tend to give the best results even if they are time and computationally demanding. Otherwise, the final model will suffer overfitting, being the two methods model-dependent. As opposed to such model-dependent methods, filter approaches offer a faster and model-independent alternative. They perform an estimation of the input variable importance, avoiding this way the risk of overfitting. The input variables pruning ensures a reduction of the computational burden required for the model identification and validation steps. In some cases, such ranking can anyway be too inaccurate and an importance-wise greedy selection of the candidate inputs tends to ignore redundancies. For this reason, they work best as a first step of hybrid approaches. They represent the best choice if the number of candidate inputs is relatively greater then the number of samples.

Moreover, references were summarized into the next two tables: in

Table 2, references explaining theory and procedures are classified for each input selection method;

Table 3 collects references with real case studies application of each method.

Soft sensors concern several fields of study and research, from machine learning, mathematics and statistics. The choice of input variables has an utter impact in the model development and its final performance. In the case of empirical data-driven models, the difficulty of the task can be lightened by the a priori knowledge given by plant experts, if available. Most of the time, however, this is not possible, and a black-box investigation among variables is needed. Despite the variety of the existing methods, none of them provides a general solution equally satisfactory for any case, since in the case of real applications, each approach tends to give different incoherent results. This means that different tests must be performed by the designer in an effort to find the most suitable subset for the application, making the input selection step time and computationally consuming. For this reason, the problem of variable selection is highly demanded, making it a topic that still needs to be researched.