Robust Data-Driven Soft Sensors for Online Monitoring of Volatile Fatty Acids in Anaerobic Digestion Processes

: The concentration of volatile fatty acids (VFAs) is one of the most important measurements for evaluating the performance of anaerobic digestion (AD) processes. In real-time applications, VFAs can be measured by dedicated sensors, which are still currently expensive and very sensitive to harsh environmental conditions. Moreover, sensors usually have a delay that is undesirable for real-time monitoring. Due to these problems, data-driven soft sensors are very attractive alternatives. This study proposes di ﬀ erent data-driven methods for estimating reliable VFA values. We evaluated random forest (RF), artiﬁcial neural network (ANN), extreme learning machine (ELM), support vector machine (SVM) and genetic programming (GP) based on synthetic data obtained from the international water association (IWA) Benchmark Simulation Model No. 2 (BSM2). The organic load to the AD in BSM2 was modiﬁed to simulate the behavior of an anaerobic co-digestion process. The prediction and generalization performances of the di ﬀ erent models were also compared. This comparison showed that the GP soft sensor is more precise than the other soft sensors. In addition, the model robustness was assessed to determine the performance of each model under di ﬀ erent process states. It is also shown that, in addition to their robustness, GP soft sensors are easy to implement and provide useful insights into the process by providing explicit equations.


Introduction
Anaerobic digestion (AD) is a well-established process for stabilizing municipal sewage sludge and treating organic waste products and wastewaters from different industries, households and farms. In this process, the organic matter biodegrades in an oxygen-free environment. This leads to decomposition and bioconversion of organic matter into biogas, which mainly consist of CH 4 (50-60%) and carbon dioxide (30-40%), with some other trace gases such as hydrogen sulfide and water vapors, etc. [1]. The methane gas produced can be used as an energy source for generating electricity and heating the AD reactor. The benefits of the AD process are high organic load treatment, low sludge production, energy is recovered when the biogas produced is used and operating costs are reduced due to the oxygen-free operation [2]. Many operational parameters affect the performance and effluent quality of the AD process; therefore, frequent monitoring of these parameters is crucial to ensure a stable performance. Among these parameters, pH, partial alkalinity and volatile fatty acids (VFAs) are the most important measurements for monitoring the stability and performance of the digesters with low buffering capacity. However, in the highly buffered digester, although the was trained by using the back-propagation algorithm to predict different process responses, including effluent COD, biogas production, VFA, alkalinity and effluent pH. The OLR and Influent pH were considered as model inputs. Briefly, most of the studies in the past were focused on predicting VFA for a laboratory-scale anaerobic digester by using available input parameters without considering the difficulty of measuring them. Furthermore, most of the developed models were trained based on the very limited operational conditions, thus the generalization ability and performance of the models in different situations is ambiguous [14,15].
In this paper, we study the ability of different data-driven modelling techniques, such as artificial neural networks (ANNs), extreme learning machine (ELM), support vector machine (SVM), random forest (RF) and genetic programming (GP), to develop robust VFA monitoring software sensors from exclusively online easy-to-measure variables. The wrapper feature ranking method combining different models was used to select the most influential process variables for developing a data-driven soft sensor to increase the accuracy and reduce computation time. The procedure was applied to the wastewater treatment Benchmark Simulation Model No. 2 (BSM2) running in the Matlab Simulink environment. The developed software sensors were compared in terms of accuracy, robustness and transparency. Transparency is important because transparent models are very beneficial for controlling and gaining insight into the modelling procedures of AD processes. Although the Genetic Algorithm, which is very similar to the GP, has been applied for modelling AD processes, there is no soft sensor designed with the GP technique in this context [16,17]. Therefore, in this study, we discuss using the GP model for designing more transparent and interpretable soft sensors.

Benchmark Simulation Model No. 2
The proposed soft sensors were designed and evaluated with BSM2 [18]. The BSM2 simulation environment applies plant layout, a simulation model, influent loads, simulation procedures and evaluation criteria elements to analyze and evaluate the performances of a wastewater treatment plant. For details see Jeppsson et al. [18] for the complete layout of BSM2, which consists of a primary clarifier, an activated sludge biological reactor, a secondary clarifier, a thickener, an anaerobic digester, a dewatering unit and a storage tank. The activated sludge process consists of five reactors: two anoxic reactors with a total volume of 3000 m 3 and three aerobic reactors with a total volume of 9000 m 3 , which are used for nitrification and predenitrification, respectively. The plant capacity has been designed for 20648 m 3 d −1 of average influent with a dry weather flow rate and 592 mg L −1 of average biodegradable COD in the influent. The Activated Sludge Model No. 1 (ASM1) and the Anaerobic Digestion Model (ADM1) were used to describe the biological phenomena that take place in the activated sludge and AD reactor respectively. The influent characteristics consist of a 609 days dynamic influent data file (sampling frequency equal to a data point every 15 min) that includes rainfall and seasonal temperature variations over the year [18,19]. The first 245 days of influent data are used for plant stabilization under dynamic conditions and the last 364 days are used for the plant performance assessment.

Data Collection
The first step in designing a data-driven soft sensor is obtaining the process data. Therefore, we used synthetic data produced by BSM2. It should be noted that this paper describes a preliminary study for designing soft sensors for AD processes by applying different techniques to synthetic data; therefore, for applying the approaches discussed in this paper to real systems, real process data is necessary. Thirteen process variables obtained from the simulation are listed in Table 1. These variables are measured from the influent, effluent and gas line of AD every 15 min.

Pre-Processing of the Data
To achieve successful soft sensor development, the data set should be pre-processed to eliminate the missing and redundant values, outliers and signal noise. In this paper, as the data set is obtained synthetically from a simulator, there is no need to perform any further steps for removing the outliers and processing the missing values. However, the signal noise which is incorporated in BSM2 to obtain more comparable and realistic benchmark simulation results needs to be considered. Therefore, the signal noise should be handled appropriately for soft sensors in order to estimate VFA values accurately in different conditions. In this study, before model construction and prediction, the weighted moving average (WMA) method is adopted to reduce signal noise, as it is fast to compute and easy to use, compared to the other methods. In the WMA method each data point in the sample window is multiplied by a different weight based on its position [20] as given in Equation (1): where F t is the smoothed signal occurrence at time t, W i is the weight to be given to the actual occurrence for the time t − I, A i is the actual occurrence for the time t − i and n is the total number of window lengths in the prediction. In this work, a window length of 100 sampling times is chosen for all model constructions.
As the measured variables have different units, they need to be normalized before the different models are developed. The variable values were normalized according to their mean and standard deviation. To assess model performances, normalized root-mean-squared error (NRMSE) and the coefficient of determination (R 2 ) were calculated according to Equations (2) and (3) respectively.
where y act and y prd are the actual and predicted values, respectively, i is the data record number, y m is average of the experimental value, and n is the total number of records.
In addition, to examine the generalization ability of the models, the overall data set is split into a training set, used to fit the model, and a validation set, used to calculate the error. In this work, the obtained simulated data from day 245 to 450 and day 451 to 609 were used as training and validation sets respectively. It should be noted that due to the high number of data points (58,465), the training set was randomly sampled and finally 1252 data points were uniformly selected to reduce the computation time during model training.

Artificial Neural Network (ANN)
An ANN is a non-linear model that has at least three main layers, called input, hidden and output layers. All layers are connected by components called neurons ( Figure 1)-in which, three main operations are carried out. First n-elements of the input vector (z 1 , z 2 , . . . , z n ) are multiplied by weights (w 1,1 , w 1,2 , . . . , w 1,n ). Second, the weighted inputs are added together with bias signal b to obtain a value [21]: a = z 1 ·w 1,1 + z 2 ·w 1,2 + . . . + z n ·w 1,n + b (4) Processes 2020, 8, x FOR PEER REVIEW 5 of 17

Artificial Neural Network (ANN)
An ANN is a non-linear model that has at least three main layers, called input, hidden and output layers. All layers are connected by components called neurons ( Figure 1)-in which, three main operations are carried out. First n-elements of the input vector (z1, z2, …, zn) are multiplied by weights (w1,1, w1,2, …, w1,n). Second, the weighted inputs are added together with bias signal b to obtain a value [21]: Finally, the output signal is a function of a, the weighted sum of the inputs. The purpose of an activation function is to ensure that the input space is mapped to a different space in the output. Linear, sigmoid and hyperbolic transfer functions are generally used in ANNs [21,22].
The structure of neurons in each layer, which are grouped and connected, is called the topology of the network. There are many different topologies; however, the Multi-Layer feedforward Perceptron (MLP) is the most commonly used. The MLP topology can be characterized by the same number of network inputs and outputs, equal to the number of input and output variables of the system to be modelled [21]. There are no specific rules for finding the best topology of the network, and therefore different methods such as trial and error or evolutionary algorithms (e.g., genetic algorithm) can be used for this purpose [23]. Once the topology is obtained, the network should be trained. We used a back-propagation algorithm using Stochastic Gradient Descent (SGD) with an adaptive learning rate algorithm [21]. There are many parameters that need to be tuned for ANN, the most important ones include the number of layers, number of neurons in each layer, type of transfer function and regularization terms (L1 and L2), which prevent overfitting and improve generalization, were considered for a grid search. The number of epochs in this work was considered to be high (2000) because the early stopping method was used. The ANN models were trained in R [24] using the H2O package [25]. To determine the best ANN structure, the number of neurons in the hidden layer was varied from 5 to 160.

Extreme Learning Machine (ELM)
The ELM is a single hidden layer feed-forward neural network (SLFN) that was first introduced by Huang et al. [26]. In traditional ANN, all the parameters (number of neurons and hidden layers) have to be tuned, which leads to dependency between different parameters (weights and biases) in each layer. However, in ELM, the hidden layer does not need to be tuned [27]. In ELM, the input layer weights and biases are randomly initialized, then fixed without any tuning iteration. The output layer weights are calculated analytically. As there is no iterative procedure for the tuning phase, ELM Finally, the output signal is a function of a, the weighted sum of the inputs. The purpose of an activation function is to ensure that the input space is mapped to a different space in the output. Linear, sigmoid and hyperbolic transfer functions are generally used in ANNs [21,22].
The structure of neurons in each layer, which are grouped and connected, is called the topology of the network. There are many different topologies; however, the Multi-Layer feedforward Perceptron (MLP) is the most commonly used. The MLP topology can be characterized by the same number of network inputs and outputs, equal to the number of input and output variables of the system to be modelled [21]. There are no specific rules for finding the best topology of the network, and therefore different methods such as trial and error or evolutionary algorithms (e.g., genetic algorithm) can be used for this purpose [23]. Once the topology is obtained, the network should be trained. We used a back-propagation algorithm using Stochastic Gradient Descent (SGD) with an adaptive learning rate algorithm [21]. There are many parameters that need to be tuned for ANN, the most important ones include the number of layers, number of neurons in each layer, type of transfer function and regularization terms (L1 and L2), which prevent overfitting and improve generalization, were considered for a grid search. The number of epochs in this work was considered to be high (2000) because the early stopping method was used. The ANN models were trained in R [24] using the H2O package [25]. To determine the best ANN structure, the number of neurons in the hidden layer was varied from 5 to 160.

Extreme Learning Machine (ELM)
The ELM is a single hidden layer feed-forward neural network (SLFN) that was first introduced by Huang et al. [26]. In traditional ANN, all the parameters (number of neurons and hidden layers) have to be tuned, which leads to dependency between different parameters (weights and biases) in each layer. However, in ELM, the hidden layer does not need to be tuned [27]. In ELM, the input layer weights and biases are randomly initialized, then fixed without any tuning iteration. The output layer weights are calculated analytically. As there is no iterative procedure for the tuning phase, ELM has Processes 2020, 8, 67 6 of 17 a faster learning speed than traditional ANN and has a better generalization performance. Figure 2 shows the structure of the proposed ELM model. has a faster learning speed than traditional ANN and has a better generalization performance. Figure  2 shows the structure of the proposed ELM model.
where βj = [βj1, βj2, ..., βjm] are the weights of the output layer, which need to be estimated; f(.) is the transfer function; aj and bj are the input weights and biases, respectively. Equation (5) can be written in matrix form [28]: where, y = (y1, y2, …, yM) T , β = (β1, β2, …, βN) T and H given by: The β is determined via Moore-Penrose's generalized inverse: where The value of the input weights (aj) and biases (bj) can be randomly initialized; however, the weights of the output layer (βj) need to be determined with experimental data. Usually, the number of neurons in the hidden layer is higher than in the input layer (N > n). For the grid search, the number of neurons in the hidden layer was varied from 20 to 130.

Random Forest (RF)
The RF model was first developed by Breiman [29]. In this technique, a model is built based on a set of unpruned single regression trees. This is equal to combining different nonlinear relationships to form a more accurate non-linear model. In this method, trees are generated based on bootstrap sampling from the original training data set. In bootstrapping, random subsets of data are chosen from the original training data set to ensure the diversity among the ensemble of trees and enhance the prediction ability. The best node splitting feature for each node is selected from a set of m features that are randomly chosen from the total M features (m < M). By choosing m random features for node splitting, the correlation between different trees and thus the average response of multiple regression trees is expected to have lower variance compared to single regression trees [22,29]. RF has three main tuning parameters: the number of trees in the forest (ntrees), the number of features randomly sampled as candidates at each node split (mtry), and the maximum number of nodes in the trees . , x in ] T ∈ R n is the input vector and y i = [y i1 , y i2 , . . . , y im ] T ∈ R m the output vector, then the output of SLFN with N hidden neurons can be calculated as: where β j = [β j1 , β j2 , . . . , β jm ] are the weights of the output layer, which need to be estimated; f (.) is the transfer function; a j and b j are the input weights and biases, respectively. Equation (5) can be written in matrix form [28]: where, y = (y 1 , y 2 , . . . , y M ) T , β = (β 1 , β 2 , . . . , β N ) T and H given by: The β is determined via Moore-Penrose's generalized inverse: where The value of the input weights (a j ) and biases (b j ) can be randomly initialized; however, the weights of the output layer (β j ) need to be determined with experimental data. Usually, the number of neurons in the hidden layer is higher than in the input layer (N > n). For the grid search, the number of neurons in the hidden layer was varied from 20 to 130.

Random Forest (RF)
The RF model was first developed by Breiman [29]. In this technique, a model is built based on a set of unpruned single regression trees. This is equal to combining different nonlinear relationships to form a more accurate non-linear model. In this method, trees are generated based on bootstrap sampling from the original training data set. In bootstrapping, random subsets of data are chosen from the original training data set to ensure the diversity among the ensemble of trees and enhance the prediction ability. The best node splitting feature for each node is selected from a set of m features that are randomly chosen from the total M features (m < M). By choosing m random features for node splitting, the correlation between different trees and thus the average response of multiple regression trees is expected to have lower variance compared to single regression trees [22,29]. RF has three main tuning parameters: the number of trees in the forest (ntrees), the number of features randomly sampled as candidates at each node split (mtry), and the maximum number of nodes in the trees (maxnode). The parameters of the RF model for the grid search were set at 2:7 (with step size 1) for mtry, and 1000 to 2000 (with step size 100) for ntrees and 5 to 30 (with step size 5) for maxnode.

Support Vector Machine (SVM)
SVM is a relatively new type of machine learning method that was introduced by Cortes and Vapnik for regression and classification problems [30,31]. The objective of SVM is to map the input vectors X onto a very high-dimensional feature space via a kernel function and then to make a linear regression in this space. The regression function can be obtained as follows [32]: where y(x) represents predicted values, K(x, x i ) is a kernel function for input features, and w i and b are coefficients. The most famous kernel functions are the polynomial kernel, the radial basis, the exponential radial basis, and the multilayer perceptron kernel function [31,33]. In this work, due to the high prediction ability, the radial basis kernel function was used. The commonly used radial basis kernel has the form: where x and x i are support vectors satisfying the equations of kernel function K(x, x i ) and σ is the width of the Gaussian kernel function. More information on SVM can be found in references [31,32].
To find the precise model, the tuning parameters of SVM, mainly the regularization parameter C and the inverse kernel width σ used by the radial basis kernel function, should be determined. The SVM model parameters for the grid search were set as 0.001, 0.01, 0.02, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6 for σ, and 200 to 800 (with step size 100) for C.

Genetic Programming (GP)
GP is a powerful method for developing a mathematical expression. It was first introduced by Koza [34]. In this method, the mathematical expressions are generated using input-output data by a biologically inspired algorithm, where the population continuously evolves towards the best-fitted model [35]. GP and the genetic algorithm (GA) have some similarities, while the most important difference is the output format. The output of GA is a value, whereas the output of GP is a computer program. The complicated structures of computer programs, mathematical expressions, and process system models in GP are represented with trees [35,36].
The GP tree structure consists of different nodes, classified into internal or external based on their position. The internal nodes can be chosen through the operators {+, −, ×, /, sin, cos, log, abs}, mathematical functions, conditional statements or even the user-defined operators. The external nodes include the constants and the model variables. After introducing data into the algorithm, the population is randomly generated. This stage is crucial to increase diversity among the models. In the next stage, each model is evaluated by a fitness function to determine how well the developed models fit the observed data. The new generation is built based on the models, having a lower fitness error. The next generation of models is reproduced by using genetic operators, such as the reproduction, crossover and mutation operators [34,35]. In the crossover operator, two models exchange the sub-trees to generate two new models, while in the mutation operator, the new model is formed from the previously generated model by substituting a randomly selected sub-tree with a newly generated one. The mutation operator is used to increase the genetic diversity of the population. The reproduction operator copies models without any change to the next generation. The fitness evaluation step is repeated for the newly generated population, and the whole procedure continues until an acceptable fitness value is achieved or the algorithm reaches its generation limit. By using this iterative procedure, the accuracy of each model improves in each iteration and finally the best one is considered as the output of the GP algorithm [34,35]. Figure 3 shows a flow chart of the GP computation procedure:

Feature Ranking
Using many features for model building can cause many problems and directly affect the model accuracy. Hence, the redundant and unnecessary features should be eliminated as they may add noise and have no impact on the dependent variable. Using feature ranking helps to determine the importance of features and reduce the dimension of the data set [37]. The other benefits of the feature ranking method are shorter training time, ease of interpreting models, overfitting reduction and lower cost in data collection. The wrapper method obtains different variable importance results depending on the feature ranking model applied. We used the fscaret package of the R environment to avoid this problem [24,38]. Briefly, variable ranking is performed in three steps: model training, variable ranking extraction, and variable ranking scaling according to the generalization error. The final variable ranking is obtained by multiplying the raw variable importance with the fraction of minimal error obtained from models by the model's actual error according to the equation below [39]: where weightedImpi is the weighted average of the individual model; rawImpi is the raw importance of the variable obtained from model i; minError is the minimal error obtained (RMSE or MSE) for all models; and Errori is error for model i.

Studying the Relationship between Input and Output Data
Before fitting non-linear models to the calibration data set, it is necessary to check whether a linear model can describe the relationships between parameters. If a linear model describes the relationships, then there is no need to use the more complicated models. Therefore, a simple linear regression (LR) was performed to predict VFA. We ran the BSM2 simulation based on the default influent file and the data signals recorded according to Table 1. As mentioned earlier, in most of the previously published studies, the authors used hard to measure parameters, such as VFA, COD, alkalinity, etc., to develop different soft sensors for the wastewater treatment processes. Therefore, in this work, the COD, alkalinity and BOD were initially eliminated from the model's input candidate list. Due to the simplicity of the LR model, it is not necessary to perform feature ranking prior to

Feature Ranking
Using many features for model building can cause many problems and directly affect the model accuracy. Hence, the redundant and unnecessary features should be eliminated as they may add noise and have no impact on the dependent variable. Using feature ranking helps to determine the importance of features and reduce the dimension of the data set [37]. The other benefits of the feature ranking method are shorter training time, ease of interpreting models, overfitting reduction and lower cost in data collection. The wrapper method obtains different variable importance results depending on the feature ranking model applied. We used the fscaret package of the R environment to avoid this problem [24,38]. Briefly, variable ranking is performed in three steps: model training, variable ranking extraction, and variable ranking scaling according to the generalization error. The final variable ranking is obtained by multiplying the raw variable importance with the fraction of minimal error obtained from models by the model's actual error according to the equation below [39]: where weightedImp i is the weighted average of the individual model; rawImp i is the raw importance of the variable obtained from model i; minError is the minimal error obtained (RMSE or MSE) for all models; and Error i is error for model i.

Studying the Relationship between Input and Output Data
Before fitting non-linear models to the calibration data set, it is necessary to check whether a linear model can describe the relationships between parameters. If a linear model describes the relationships, then there is no need to use the more complicated models. Therefore, a simple linear regression (LR) was performed to predict VFA. We ran the BSM2 simulation based on the default influent file and the data signals recorded according to Table 1. As mentioned earlier, in most of the previously published studies, the authors used hard to measure parameters, such as VFA, COD, alkalinity, etc., to develop different soft sensors for the wastewater treatment processes. Therefore, in this work, the COD, alkalinity and BOD were initially eliminated from the model's input candidate list. Due to the simplicity of the LR model, it is not necessary to perform feature ranking prior to modelling, thus the rest of the parameters according to Table 1 were used for the LR model. Next, to determine the effect of other hard to measure parameters, such as TSS and Ammonia, on the outcome, they were removed one by one from the input data of the LR model and the error was calculated. The results of the LR models with different input vectors are shown in Table 2. The results clearly show that the relationship between parameters is highly linear. The LR model can still predict VFA with high accuracy even when TSS and Ammonia are eliminated from the input vector. This is contrary to the real behavior of the AD process, which is a very non-linear system. The most logical reasons for this linearity are the default influent data file and the design of the AD reactor itself. The default influent file is designed in such a way that the variation in the organic load to the AD reactor is not very intense, and thus, inhibition phenomena, which are the main source of non-linearity in the AD process, will not occur. In addition, the AD reactor is quite over-designed compared to its feed load, and small disturbances from the sludge recovery unit do not have a great impact on its operation. Therefore, it is possible that the process is pushed towards very narrow linear operational ranges. Thus, to increase the non-linearity and make the simulated AD process more realistic and challenging, we manipulated the feed load to the reactor. Furthermore, by manipulating the feed load, the AD's behavior is similar to the co-digestion, which is more interesting than the mono-digestion process. We randomly changed the concentration of the inorganic nitrogen (S in ), the composite (X c ), and the carbohydrate (X ch ) as well as the feed flow rate of the reactor. The summary of default and modified values of the parameters are shown in Table 3. Table 3. Summary of default and the modified values of inorganic nitrogen (S in ), composite (X c ) and carbohydrate (X ch ).

Default Values
Modified Values After manipulating the AD feed signal, a new LR model with the same approach was implemented with the new data set to determine the impact of changes on the VFA prediction. The prediction results are shown in Table 4. It can be seen that by manipulating the AD feed carbon, inorganic nitrogen load and feed flow, the process is pushed towards the non-linear operational ranges. The best Test_NRMSE is obtained by using all inputs except ammonia concentration. Although the accuracy of this model is better than the other LR models, its performance is far from an acceptable range. Thus, there is a demand for accurate prediction of VFA by non-linear models. In the next sections of this paper, non-linear approaches are applied to the newly obtained data set to develop more accurate soft sensors.

Choosing the Most Influential Variables Using the Feature Ranking Method
As mentioned earlier, using redundant and unnecessary features may add noise to the model and increase the risk of over-fitting. Therefore, a combination of influential sensor measurements should be chosen as model inputs. The measurements must be the most influential and, at the same time, they have to be easy to measure to satisfy the soft sensor definition. The best combination of parameters is generally chosen with feature ranking techniques. In the present study, we used the fscaret feature ranking technique. Before carrying out the feature ranking method, hard to measure parameters, including COD, alkalinity and BOD, were removed from the data set. The gas flow and CH 4 mole fraction were also removed due to their direct correlation with pressure and CO 2 mole fraction respectively. It should be noted that the same results could be obtained by using gas flow and CH 4 mole fraction instead of using pressure and CO 2 mole fraction; therefore, the decision to eliminate correlated parameters can be made based on the simplicity and availability of measurements. The remaining parameters, listed in Table 1, were used as an input vector for the fscaret method. Figure 4 shows the importance of the variables on a scale from 0 to 100 obtained with the fscaret method for VFA prediction. It can be seen that by manipulating the AD feed carbon, inorganic nitrogen load and feed flow, the process is pushed towards the non-linear operational ranges. The best Test_NRMSE is obtained by using all inputs except ammonia concentration. Although the accuracy of this model is better than the other LR models, its performance is far from an acceptable range. Thus, there is a demand for accurate prediction of VFA by non-linear models. In the next sections of this paper, non-linear approaches are applied to the newly obtained data set to develop more accurate soft sensors.

Choosing the Most Influential Variables Using the Feature Ranking Method
As mentioned earlier, using redundant and unnecessary features may add noise to the model and increase the risk of over-fitting. Therefore, a combination of influential sensor measurements should be chosen as model inputs. The measurements must be the most influential and, at the same time, they have to be easy to measure to satisfy the soft sensor definition. The best combination of parameters is generally chosen with feature ranking techniques. In the present study, we used the fscaret feature ranking technique. Before carrying out the feature ranking method, hard to measure parameters, including COD, alkalinity and BOD, were removed from the data set. The gas flow and CH4 mole fraction were also removed due to their direct correlation with pressure and CO2 mole fraction respectively. It should be noted that the same results could be obtained by using gas flow and CH4 mole fraction instead of using pressure and CO2 mole fraction; therefore, the decision to eliminate correlated parameters can be made based on the simplicity and availability of measurements. The remaining parameters, listed in Table 1, were used as an input vector for the fscaret method. Figure 4 shows the importance of the variables on a scale from 0 to 100 obtained with the fscaret method for VFA prediction. In Figure 4, pH, Ammonia concentration and pressure have the most influence on VFA. To determine the best subset of inputs, we used the SVM method as a non-linear technique. SVM was chosen because it is a fast and accurate method for modelling a non-linear system. Therefore, the most important variables (pH, Ammonia concentration and pressure) were chosen as the core subset and the other variables were added one by one to the model based on their importance values. The validation error is estimated for each subset after the models are trained. Table 5 shows the NRMSE In Figure 4, pH, Ammonia concentration and pressure have the most influence on VFA.
To determine the best subset of inputs, we used the SVM method as a non-linear technique. SVM was chosen because it is a fast and accurate method for modelling a non-linear system. Therefore, the most important variables (pH, Ammonia concentration and pressure) were chosen as the core subset and the other variables were added one by one to the model based on their importance values. The validation error is estimated for each subset after the models are trained. Table 5 shows the NRMSE of each subset based on the trained models. It can be seen that the second subset has the best NRMSE; therefore, this subset was selected for further development of VFA soft sensors based on ANN, ELM, SVM and RF. It should be noted that, as the GP model selects influential features inherently, there is no need to perform feature selection before its training. It is interesting that, based on the obtained data set, the flow does not have a significant effect on the VFA compared to the other variables and it is not included in the best subset. Figure 5 shows the trend of four input and output variables over the total experiment period. The input and output trends show that the behavior of the AD process is very complex, which results in absence of direct correlations between parameters. no need to perform feature selection before its training. It is interesting that, based on the obtained data set, the flow does not have a significant effect on the VFA compared to the other variables and it is not included in the best subset. Figure 5 shows the trend of four input and output variables over the total experiment period. The input and output trends show that the behavior of the AD process is very complex, which results in absence of direct correlations between parameters.

Soft Sensor Design
Before the soft sensors were designed, the data set was split into training and validation partitions. The uniformly sampled data from day 245 to 450 were used for training and the rest were used for validation of the final models. All calculations were performed using Amazon Elastic Compute Cloud with 36 cores and 60 GB of random-access memory (RAM) running on the openSUSE operating system (SUSE, Nürnberg, Germany). All modelling procedures were performed in the R environment, except GP, which was developed using the Eureqa Formulize software package [40,41]. This package has been optimized to find simple models, which are expected to have good generalization ability. The tuning parameters of each model were estimated using an extensive grid search. In this method for searching algorithm parameters, a tuning grid must be specified manually. In the grid, each algorithm tuning parameter can be specified as a vector of possible values. These vectors are combined to define all the possible combinations to try. As previously mentioned,

Soft Sensor Design
Before the soft sensors were designed, the data set was split into training and validation partitions. The uniformly sampled data from day 245 to 450 were used for training and the rest were used for validation of the final models. All calculations were performed using Amazon Elastic Compute Cloud with 36 cores and 60 GB of random-access memory (RAM) running on the openSUSE operating system (SUSE, Nürnberg, Germany). All modelling procedures were performed in the R environment, except GP, which was developed using the Eureqa Formulize software package [40,41]. This package has been optimized to find simple models, which are expected to have good generalization ability. The tuning parameters of each model were estimated using an extensive grid search. In this method for searching algorithm parameters, a tuning grid must be specified manually. In the grid, each algorithm tuning parameter can be specified as a vector of possible values. These vectors are combined to define all the possible combinations to try. As previously mentioned, different models such as RF, ELM, ANN and SVM have been used to design VFA soft sensors. Each of these models has its own parameters that must be tuned to generate an accurate model. Table 6 shows the final tuning parameters obtained with the grid search. No major parameters need to be tuned for the GP model implemented in the Eureqa Formulize software package. The NRMSE and R 2 are estimated based on the training and validation sets for each model. The results obtained using different techniques are shown in Table 7. The lowest validation error was achieved by the model developed with the GP algorithm; thus, it should be considered as the final, ready-to-use model. The ANN, ELM and SVM algorithms gave a slightly higher error, but they were still better than the RF model. It can be seen that the RF predictions failed with a NRMSE of 34%. The high error of RF is because it is a rule-based method, in which the data is categorized into different classes. Therefore, if applied to temporal data, weak results will be obtained due to the high number of classes. Although ANN, ELM and SVM are potential non-linear function approximation methods with a broad application, they are still "black box" models whose structure and parameters do not provide any insight into the phenomena underlying the process being modelled. In contrast, the GP model is transparent and can generate explicit equations that are very convenient for direct online implementation in the existing process information and control systems. The GP optimal soft sensor model generated by the Eureqa package is given as a coefficient in Table 8. The GP was trained based on the full input vector; however, Table 8 shows that the final model contains four variables. This is in accordance with the result obtained by the fscaret method. In addition, to check the prediction ability of each model, the final models were tested using the whole data set without sampling. Figure 6 shows the prediction result of each model. It can be seen that the performance of all models is satisfactory except for the RF model.

Evaluation of the Robustness of Soft Sensors
The prediction accuracy of soft sensors tends to drop after a period of their online operation due to changing process states. This change in soft sensor accuracy may cause some issues in the process operation, including increasing the maintenance cost and decreasing the quality of final products. Therefore, it is very beneficial to examine how this accuracy degradation affects the final prediction. To do this, we selected three important biochemical parameters of anaerobic digestion: the hydrolysis rate of carbohydrates (khyd,ch), the maximum uptake rate of acetate (km,ac), and the ammonia inhibition constant (kI,NH3). We then varied them by ±50% around their default values. The default values of khyd,ch, km,ac and kI,NH3 were 10 d −1 , 8 d −1 and 0.0018 kmol·m −3 respectively. To evaluate the robustness of each model, new data sets were generated by the BSM2 simulation with modified biochemical parameters. Then, each trained model was tested based on the newly obtained data set. Figure 7 shows the results of the robustness evaluation for each model. As the RF soft sensor failed, it is not considered for the robustness evaluation.

Evaluation of the Robustness of Soft Sensors
The prediction accuracy of soft sensors tends to drop after a period of their online operation due to changing process states. This change in soft sensor accuracy may cause some issues in the process operation, including increasing the maintenance cost and decreasing the quality of final products. Therefore, it is very beneficial to examine how this accuracy degradation affects the final prediction. To do this, we selected three important biochemical parameters of anaerobic digestion: the hydrolysis rate of carbohydrates (k hyd,ch ), the maximum uptake rate of acetate (k m,ac ), and the ammonia inhibition constant (k I,NH3 ). We then varied them by ±50% around their default values. The default values of k hyd,ch , k m,ac and k I,NH3 were 10 d −1 , 8 d −1 and 0.0018 kmol·m −3 respectively. To evaluate the robustness of each model, new data sets were generated by the BSM2 simulation with modified biochemical parameters. Then, each trained model was tested based on the newly obtained data set. Figure 7 shows the results of the robustness evaluation for each model. As the RF soft sensor failed, it is not considered for the robustness evaluation.
to changing process states. This change in soft sensor accuracy may cause some issues in the process operation, including increasing the maintenance cost and decreasing the quality of final products. Therefore, it is very beneficial to examine how this accuracy degradation affects the final prediction. To do this, we selected three important biochemical parameters of anaerobic digestion: the hydrolysis rate of carbohydrates (khyd,ch), the maximum uptake rate of acetate (km,ac), and the ammonia inhibition constant (kI,NH3). We then varied them by ±50% around their default values. The default values of khyd,ch, km,ac and kI,NH3 were 10 d −1 , 8 d −1 and 0.0018 kmol·m −3 respectively. To evaluate the robustness of each model, new data sets were generated by the BSM2 simulation with modified biochemical parameters. Then, each trained model was tested based on the newly obtained data set. Figure 7 shows the results of the robustness evaluation for each model. As the RF soft sensor failed, it is not considered for the robustness evaluation. From the variation of km,ac, it can be concluded that GP is the most robust approach, although NRMSE increases at −50%. For khyd,ch, the NRMSE is quite constant which shows that all techniques are insensitive to variations of this parameter; nonetheless, GP still has a lower error compared to the other model. For kI,NH3, again GP is more robust as the error does not change significantly. The least robust model is SVM, which has the highest error compared to the other models. To make comparison easier the average NRMSE for variation of each parameter is also shown in Figure 7.
Comparing avNRMSE for all the soft sensors shows that the GP has lower prediction error during changes in AD state parameters. Overall, the performance of other soft sensor models is also acceptable due to low avNRMSE.

Conclusions
In the present paper, different data-driven soft sensors are proposed for predicting the effluent VFA of the AD process. The performance of these soft sensors has been successfully demonstrated with a case study based on synthetic data obtained from BSM2. Analyzing the simulated data with the default BSM2 influent file, the LR models showed that the relationship between input and output From the variation of k m,ac , it can be concluded that GP is the most robust approach, although NRMSE increases at −50%. For k hyd,ch , the NRMSE is quite constant which shows that all techniques are insensitive to variations of this parameter; nonetheless, GP still has a lower error compared to the other model. For k I,NH3 , again GP is more robust as the error does not change significantly. The least robust model is SVM, which has the highest error compared to the other models. To make comparison easier the average NRMSE for variation of each parameter is also shown in Figure 7.
Comparing avNRMSE for all the soft sensors shows that the GP has lower prediction error during changes in AD state parameters. Overall, the performance of other soft sensor models is also acceptable due to low avNRMSE.

Conclusions
In the present paper, different data-driven soft sensors are proposed for predicting the effluent VFA of the AD process. The performance of these soft sensors has been successfully demonstrated with a case study based on synthetic data obtained from BSM2. Analyzing the simulated data with the default BSM2 influent file, the LR models showed that the relationship between input and output data is highly linear. The BSM2 model was modified to introduce non-linearity in the simulated data. Moreover, by applying this modification the behavior of the AD was similar to the anaerobic co-digestion process. The best subset of input variables including pH, ammonia concentration, pressure and CO 2 mole fraction was obtained by using the fscaret method along with SVM. After training the models, we obtained the prediction and generalization performances of each model based on a specific validation data set. The results show that all models except RF predict the effluent VFA precisely; however, GP performed slightly better than the other models. The RF model totally failed to predict VFA. This suggests that tree based models are not a very appropriate choice for developing models with an extrapolation capability similar to soft sensors. Assessing the robustness of soft sensors shows that the GP model is more robust and less sensitive to the state changes of the AD process. Last but not least, the other benefit of adopting the GP soft sensor, apart from accuracy and robustness, is its transparency, which makes it easy to integrate into process control systems without any further modifications.