Support Vector Regression Modelling of an Aerobic Granular Sludge in Sequential Batch Reactor

Support vector regression (SVR) models have been designed to predict the concentration of chemical oxygen demand in sequential batch reactors under high temperatures. The complex internal interaction between the sludge characteristics and their influent were used to develop the models. The prediction becomes harder when dealing with a limited dataset due to the limitation of the experimental works. A radial basis function algorithm with selected kernel parameters of cost and gamma was used to developed SVR models. The kernel parameters were selected by using a grid search method and were further optimized by using particle swarm optimization and genetic algorithm. The SVR models were then compared with an artificial neural network. The prediction results R2 were within >90% for all predicted concentration of COD. The results showed the potential of SVR for simulating the complex aerobic granulation process and providing an excellent tool to help predict the behaviour in aerobic granular reactors of wastewater treatment.


Background
One of the well-known wastewater treatment technologies is aerobic granular sludge (AGS), which is able to overcome the limitations of conventional activated sludge (CAS) [1]. AGS has a compact and dense biomass characteristic, and it relies on the growth of granules. Thus, when AGS was inoculated with a bubble column through a sequential batch reactor (SBR), it has been observed to have a short settling time for the formation of AGS [2][3][4]. Since then, AGS technology has attracted the interest of the builders for its advantages, especially in reducing footprints besides increasing the efficiency of the wastewater treatment plant. Aside from that, SBR is a preferred choice for improving AGS technology, because it offers operational flexibility with the reliable and rapid cultivation of AGS.
To facilitate and encourage AGS practical application in the field of wastewater treatment plants, numerous studies from researchers have broadly explored the fundamentals of AGS in SBR [5][6][7][8][9]. To further enhance the system, research on modelling and optimization by using simulation of the data-driven mathematical model would be a great evaluation tool to be used on the system. Furthermore, mathematical model has been proven to be a useful tool to study complex operations of the AGS system [10,11].
The investigation on modelling of the CAS in wastewater treatment plants has become a great tool and well-established practice supported by numerous robust models developed There are other modelling methods that are increasingly attracting researchers' interest from time to time in the field of wastewater treatment due to their appealing characteristics and promising generalization performance accuracy, such as SVR. ANN and SVR have similar architectures, but what distinguishes these two types of models is that ANN uses multi-layer connection with weight and activation function to predict non-linear problems, while SVR employs a linear model to implement non-linear class boundaries by mapping input vectors into a high-dimensional feature space for data regression. The principle of ANN follows empirical risk minimization (ERM), which targets minimizing the error of training data, which may converge to local minima due to the gradient descent learning algorithm; therefore, even the well-trained ANN data could face problems such as the overfitting problem. On the other hand, SVR follows structural risk minimization (SRM), meaning that it is easily trapped in local minima and is able to provide multiple solutions [24]. Much research in the field of wastewater has focused on prediction using ANN and SVR, but most of their dataset is larger in size and has not been designed specifically for the purpose of comparative analysis. Hence, appropriate steps in developing a model structure for training ANN and SVR is needed to avoid any bias due to the performance of the prediction model being directly affected by the performance of training model.
One of the well-known characteristics of SVR is the ability to deal with non-linearity relationships. This makes SVR better than neural networks, as it can easily work with non-linear dataset by using the kernel tricks. Apart from these characteristics, some disadvantages restrict the use of SVR on academician purposes as well as industrial platforms. Several parameters need to be defined by the user, which are SVR's kernel function and hyperparameters. A proper setting of these parameters is needed, as they will affect the generalization performance of the prediction model. This has been the main issue for practitioners who have used SVR. Several recommendations have been given on how to appropriately set SVR kernel parameters, but there is no consensus. The method by V. Cherkassky and Y. Ma [25] presented a practical way to set SVR parameters of cost and gamma. However, this method is based on user expertise, user prior understanding, or through the experimental trial setup. Hence, there is no assurance that the parameter values acquired are truly optimal. In addition, the truly optimal parameter selection is further problematic by the fact that the SVR generalization performance depends on both pairs of SVR kernel parameters together with hyperparameters. This makes it harder to manually fine-tune the hyperparameters, making it essential to use an optimization algorithm method to determine their best value [26].
Hence, the aim of this paper is to investigate the predictive performance of ANN and SVR based on the limited dataset of aerobic granulation in the SBR system under three different temperatures, i.e., 30 • C, 40 • C, and 50 • C, in predicting the behaviour of quality effluent of chemical oxygen demand (COD). The available dataset of aerobic granular sludge is limited due to constraint of experimental works that collect under different high temperatures of SBRs, making it hard to collect. From the final results of the ANN and SVR predicting model, that of SVR still needs a process of optimization. Hence, further optimization will also be implemented in this work.
The major contributions of this work can be summarized as: 1.
To the very best authors' knowledge, this is the first modelling work applying on limited dataset of 21 for aerobic sludge granulation using sequential batch reactor at high temperatures, which are 30 • C, 40 • C, and 50 • C.

2.
As a comparison, SVR and ANN models were developed to identify the best predictive model on limited dataset, which contains only 21 samples. 3.
Instead of using a trial-and-error method, an improvement has been made by using grid search algorithm in determine suitable pairs of C and γ value.

4.
Improvement on conventional grid search algorithm is the proposed by using optimization method using particle swarm optimization (PSO) and genetic algorithm (GA) to give better estimation of the best pair of C and γ value. The proposed method shows an improvement in accuracy performance of the COD concentration with a limited dataset.
The models for both SVR and ANN are developed and validate and evaluate well in this study to ensure the best models are built.

Optimization Methods
In this work, a proper setting is needed to simulate the SVR model. Out of several methods used from other applications in tuning the parameters, most are using trialand-error or grid search method. When applying the grid search method, there is a need to step down or decrease the parameter range to enhance the accuracy of the truly optimal solution. However, this method will cause time-wasting for the searching process. Parameter selection can be regarded as a constraint of SVR. Some researchers came out with an idea to solve this problem. Authors of [27] proposed an approach for tuning the SVR parameters by using mutative scale chaos optimization. Meanwhile, [28] used a novel optimization based on the quantum-inspired evolutionary algorithm method.
There are several optimization methods that were developed as fast alternative solutions to replace the grid search method, which are Particle Swarm Optimization (PSO) and Genetic Algorithm (GA). Many works from the literature show that the SVR model that is optimized by methods shows a significant improvement in terms of accuracy performance [29,30]. The motivation for selecting the PSO as an optimization method is due to its speed and simple adjustability of the algorithm [31]. A comparison needs to be made to show the effectiveness of both methods. GA was chosen as another optimization method for determining the parameters of SVR. Both PSO and GA methods have been thoroughly qualified and have been tested under the same conditions. Furthermore, it can be easily implemented using MATLAB software, make it easily parallelized [31]. In addition, both optimized PSO and GA are well-known with their efficiency to locate the global optima, even when the objective fitness function is discontinuous [32]. Hence, it would be interesting to make a comparison in the context of SVR parameters between the optimization method of PSO, GA, and grid search method for evaluating their performance achievements.

Study Area
The data collection of wastewater treatment plants has been conducted in Madinah, Saudi Arabia. During summertime, the temperatures of Madinah climate could reach close to 50 • C. The influent of wastewater is mainly treated in three stages: mechanical, biological, and chemical treatments. During mechanical treatment, all the matters pass through grit removal and fine screening processes. All the small raw will be suspended in the water, while larger raw plastic materials will be screened and ground before they can be conveyed to the next stage. This is to avoid blocking the valve nozzle and pipeline and to prevent damage of the sludge removal equipment.
The biological treatment of this study is based on aerobic granular sludge (AGS) instead of conventional activated sludge systems. The collected sludge was used to cultivate aerobic granulation using a sequencing batch reactor (SBR). Generally, the experiments have been conducted in a double-walled cylindrical glass column bioreactor. The capacity volume of the bioreactors is 3 L with internal radius of 3.25 cm and total height of 100 cm [33]. The system pump including the inlet, outlet, and aerator were all controlled by a programmable logic controller (PLC). The operation of the bioreactor was operated under SBR at a cycle of 180 min: 60 min of inlet feeding pump without mixing, 110 min for aeration, and 5 min each for settling and effluent discharge. Diffusers were placed at the bottom of volumetric flow rate of 4 Lmin −1 during aeration. The effluent was discharged through the outlet port located in the middle of the glass bioreactor column with high volumetric ratio of 50%. By heating only half of the column, it was sufficient to achieve the required temperature for the experiment (to develop the granules and process the Membranes 2021, 11, 554 5 of 24 wastewater). The heat is transferred and distributed evenly inside the bioreactor through conduction and convection. Heat is produced by the heating mantle, and through conduction, heat is transferred from the mantle to the glass column. In addition, heat transfer by convection is driven by the movement of water/fluids during the aeration process. Three sequencing batch reactors (SBR) with different temperatures of 30, 40, and 50 ± 1 • C were controlled using water bath sleeves and thermostats namely as SBR30 • C, SBR40 • C, SBR50 • C, were used for data collection. Figure 1 shows the schematic diagram of AGS experimental laboratory pilot-plant system. bottom of volumetric flow rate of 4 Lmin −1 during aeration. The effluent w through the outlet port located in the middle of the glass bioreactor colu volumetric ratio of 50%. By heating only half of the column, it was suffici the required temperature for the experiment (to develop the granules an wastewater). The heat is transferred and distributed evenly inside the biore conduction and convection. Heat is produced by the heating mantle, and th tion, heat is transferred from the mantle to the glass column. In addition, h convection is driven by the movement of water/fluids during the aeration sequencing batch reactors (SBR) with different temperatures of 30, 40, and controlled using water bath sleeves and thermostats namely as SBR30 °C, SB °C, were used for data collection. Figure 1 shows the schematic diagram o mental laboratory pilot-plant system.

Characteristic and Statistics Analysis of Wastewater Dataset
In this case study, the available dataset obtained for modelling and o limited due to the limitation of experimental works in [33]. The compositio into the rectors is the same composition of the influent used by [34]. The c the synthetic wastewater was involved two solutions (A) CH3COO MgSO4.7H20 3.7 mM, KCl 4.8 mM, and (B) NH4Cl 35.2 mM, K2HPO4 4.4 m mM and trace element solution 10 mLL −1 . These solutions were mixed and distilled water prior to feeding. The parameters considered for this study w gen (TN), total phosphorus (TP), ammonia nitrogen (AN), total organic mixed liquor suspended solids (MLSS), and chemical oxygen demand (COD Supplementary Table S1. Only effluent COD is considered as the output f diction. The overall data collected for 60 days with an interval of 3 days is 21 experimental work in [33] provides the influent and effluent parameters u development. The experiments were conducted in three operating conditio 50 °C) for verification. Table 1 refers to statistical analysis results of the influ ent parameters based on the available dataset obtained from the experim

Characteristic and Statistics Analysis of Wastewater Dataset
In this case study, the available dataset obtained for modelling and optimization is limited due to the limitation of experimental works in [33]. The composition that was fed into the rectors is the same composition of the influent used by [34]. The composition of the synthetic wastewater was involved two solutions (A) CH 3 COONa 65 mM, MgSO 4 .7H20 3.7 mM, KCl 4.8 mM, and (B) NH 4 Cl 35.2 mM, K 2 HPO 4 4.4 mM, KH 2 PO 4 2.2 mM and trace element solution 10 mL −1 . These solutions were mixed and prepared with distilled water prior to feeding. The parameters considered for this study were total nitrogen (TN), total phosphorus (TP), ammonia nitrogen (AN), total organic carbon (TOC), mixed liquor suspended solids (MLSS), and chemical oxygen demand (COD) as shown in Supplementary  Table S1. Only effluent COD is considered as the output for model prediction. The overall data collected for 60 days with an interval of 3 days is 21 samples. The experimental work in [33] provides the influent and effluent parameters used for model development. The experiments were conducted in three operating conditions (30,40, and 50 • C) for verification. Table 1 refers to statistical analysis results of the influent and effluent parameters based on the available dataset obtained from the experimental works in [33].

Data Pre-Processing
The important flow of modelling is data pre-processing. Data pre-processing includes data normalization and data division. In this study, the data has been set to an appropriate range for network learning in the model development. The following steps of modelling are applied to collected dataset of AGS.
Data normalization is performed for the development both models. This is a crucial step in modelling to prevent the domination of a large numeric scale from a smaller numeric scale. It also helps to simplify the difficulties during calculation. This process minimizes the chances of underflow and overflow of data. Normalization data is recommended to scale the training and testing data within the range of (0, 1). Equation (1) is used to normalize the dataset as follows: where x i is the input/output data, x imax is the maximum input/output data, and x i is the minimum input/output data. Data division of pre-processing is where the experimental data are divided into training and testing datasets. As stated by P. Samui and B. Dixon, no specific rule is needed to determine the division of training and testing datasets [35]. However, there are several methods recommended that can be applied depending on the size of the experimental dataset for instant K−1 fold cross-validation (CV) [36] and hold-out method [37]. In this study, K−1 fold CV method was chosen to split the dataset. The CV resampling method is a more suitable one to evaluate the data, especially for the limited dataset case. The single parameter K refers to several groups which experimental data is going to split into. In this work, a 10-fold CV is chosen, as it is commonly used and suggested [38]. The K−1 fold will be used to train the dataset, while the last fold, as a test dataset. From the iteration of the fold, an average model is then calculated and finalized. Then, the average training model is tested again on the testing dataset. Figure 2 describes the K−1 fold CV method.

Artificial Neural Network (ANN)
Artificial neural network was first established to mimic the work of the human brain but rapidly progressed to a broader range of application [18]. It is quite popular, especially for simulating complex processes such as wastewater treatment. The neural network structure consists of three main perceptron layers: input layer, hidden layer, and output layer. Each layer contains neurons and is connected to the next layer of neurons via synapse connections that carry weights. Each neuron will perform mathematical operations on its inputs, and the weight in the neurons will represent the relative significance of the node output. The output layer collects all the data inside the network and transforms it into the desired output, where every output has its own node [39].

Artificial Neural Network (ANN)
Artificial neural network was first established to mimic the work of the human brai but rapidly progressed to a broader range of application [18]. It is quite popular, especiall for simulating complex processes such as wastewater treatment. The neural networ structure consists of three main perceptron layers: input layer, hidden layer, and outpu layer. Each layer contains neurons and is connected to the next layer of neurons via syn apse connections that carry weights. Each neuron will perform mathematical operation on its inputs, and the weight in the neurons will represent the relative significance of th node output. The output layer collects all the data inside the network and transforms into the desired output, where every output has its own node [39].

ANN Structure
The determination of the ANN structure is important and requires additional effo to accomplish. The network structure of ANN is selected by considering the relationshi proposed by L. Rogers and U. Dowla [40]. Meanwhile, the range of hidden neurons fo the simulation started from 1 hidden neuron up to 16, with an increment of 2, as propose by R. Hecht-Nielsen [41]. The upper bound is set as the number of hidden neurons t guarantee the good accuracy performance of the model.
There are two types of processing functions inside the neuron, which are propagatio and activation function. In neurons, Equations (2)-(4) is shown as follows [42]. Figure depicts the single neuron architecture of ANN models, which contain an influent input o 6 layers, hidden layer (up to 16 layers), and 1 predicted output layer.

ANN Structure
The determination of the ANN structure is important and requires additional effort to accomplish. The network structure of ANN is selected by considering the relationship proposed by L. Rogers and U. Dowla [40]. Meanwhile, the range of hidden neurons for the simulation started from 1 hidden neuron up to 16, with an increment of 2, as proposed by R. Hecht-Nielsen [41]. The upper bound is set as the number of hidden neurons to guarantee the good accuracy performance of the model.
There are two types of processing functions inside the neuron, which are propagation and activation function. In neurons, Equations (2)-(4) is shown as follows [42]. Figure 3 depicts the single neuron architecture of ANN models, which contain an influent input of 6 layers, hidden layer (up to 16 layers), and 1 predicted output layer. The most preferred algorithms are feed-forward neural networks (FFNN), which was used in this work [18]. The simplest design to describe a single-layer FFNN is by using the matrix form. Inputs are multiplied by weights to compute the output of the propagation function, which is activation potential . Equations (5) and (6) demonstrate the activation potential that goes through the activation function to obtain the output y .
The key to accomplish the highest precision is to find the right combination of weights, which is accomplished through learning algorithms of neural networks. This is known as training model, where the learning process is started from the pairing of input and output data and continued to improve the weights until the output error is saturated [43].

Support Vector Regression (SVR)
The roots of Support Vector Regression (SVR) was initially introduced by V. Vapnik et al. based on the concept of structural risk minimization [44]. Generally, SVR is a technique that creates a predictive model that is capable in classifying unknown patterns into their groups. Then, SVR application introduced the regression and estimation function [45,46]. Theoretically, SVR depends on a decision planes that separate two different classes in term of margin maximization. It creates a hyper lane by using a linear model to implement non-linear class boundaries through some non-linear mapping input vectors into a high-dimensional feature space. There is non-linear and unknown dependency in mapping function = ( ) between high-dimensional input vector and scalar output . There is no information underlying joint probability functions, and one must contribute a distribution-free learning. Hence, the main goal of SVR model is to create a margin between these two categories as large as possible.
To build a strong predictive model of SVR regression, different mapping kernel functions can be selected into the algorithm. The input dataset is denoted as ( , ), = 1,2,3, . . . , where is the input and stands for numbers of data, and it is same as the size of the input dataset. Meanwhile, is corresponding to the output data. Despite ef- • Propagation function: • Activation function: • Node output: where input is denoted as x i , weight as w ij , bias is θ i , and time step is t. The most preferred algorithms are feed-forward neural networks (FFNN), which was used in this work [18]. The simplest design to describe a single-layer FFNN is by using the matrix form. Inputs x i are multiplied by weights w ij to compute the output of the propagation function, which is activation potential v i . Equations (5) and (6) demonstrate the activation potential that goes through the activation function σ to obtain the output y i .
The key to accomplish the highest precision is to find the right combination of weights, which is accomplished through learning algorithms of neural networks. This is known as training model, where the learning process is started from the pairing of input and output data and continued to improve the weights until the output error is saturated [43].

Support Vector Regression (SVR)
The roots of Support Vector Regression (SVR) was initially introduced by V. Vapnik et al. based on the concept of structural risk minimization [44]. Generally, SVR is a technique that creates a predictive model that is capable in classifying unknown patterns into their groups. Then, SVR application introduced the regression and estimation function [45,46]. Theoretically, SVR depends on a decision planes that separate two different classes in term of margin maximization. It creates a hyper lane by using a linear model to implement non-linear class boundaries through some non-linear mapping input vectors into a high-dimensional feature space. There is non-linear and unknown dependency in mapping function y = f (x) between high-dimensional input vector x and scalar output y. There is no information underlying joint probability functions, and one must contribute a distribution-free learning. Hence, the main goal of SVR model is to create a margin between these two categories as large as possible.
To build a strong predictive model of SVR regression, different mapping kernel functions can be selected into the algorithm. The input dataset is denoted as {(x i , y i ), i = 1, 2, 3, . . . i}, where x i is the input and i stands for numbers of data, and it is same as the size of the input dataset. Meanwhile, y i is corresponding to the output data. Despite efficient utilization of high-dimensional feature space of SVR, several other merits of SVR are a distinctively solvable optimization problem, having a theoretical analysis ability using computational learning theory, and having a highly effective approach to build a mathematical model with limited training datasets [47,48].
The basic theory underlying behind SVR algorithm is expressed as follows in Equation (7) [49]: where ω is denoted as the weight vector, φ(x) as a non-linear mapping function from low to high-dimensional in a feature space and b is designated as a bias. The value of b and ω in Equation (8) can be derived by substituting the input dataset of x i , y i into the Equation (7): Membranes 2021, 11, 554 where C is a cost parameter that acts as a regularization parameter by controlling the loss of input dataset; meanwhile, λ is the compromise of model complexity. R reg [ f ] represents the sum of experience risk, and R emp [ f ] is the empirical risk. A poor choice of C will lead to poor prediction of the SVR model. A margin classifier is defined for a minimum distance between any input data points and its hyper lane. To accomplish the minimal goal of structural risk, which is the optimal hyper lane, larger margins should be obtained to get better generalization. The higher or smaller confidence risk is denoted as ω 2 , and it also reflects the complexity of the model; C(.) is the size of training dataset and loss function, respectively, and e i is the difference between the predicted dataŷ i and experimental dataset y i . Based on the SVR principle to structured risk minimization, the SVR model sought to minimize the sum of confidence risk and empirical risk.
The problem in finding function f and given a loss function can be solved using a quadratic programming, as following in Equation (10): (11) can be obtained by substituting any supported number vector into Equation (10): Several kernel functions are available in literature. To define the kernel function in the inner product of high-dimensional feature space of SVR is as following Equation (12): By computing the kernel function in low-dimensional space, the inner product in highdimensional space can be obtained. Finally, Lagrange multiplier and optimal constraints are introduced, and the decision function has the following explicit form:

Kernel Functions
Kernels or kernel functions are a set of mathematical SVR algorithms that play an important role in SVR regression and its performance. The wrong selection of kernel functions will result in accuracy and prediction of the model. Basically, the kernel's function takes input data and transforms it into the required form. Kernel functions enable the operations to be performed in the input data space rather than in high-dimensional feature space. Different kernels use different types of SVR algorithm. In this study, radial basis function (RBF) (Equation (14)) is used for developing the SVR model.
where y regulates the effect of prediction variations on the kernel variation.
In general, RBF is the most used type of kernel by researchers. RBF kernels non-linearly map sample data into a higher dimensional feature space, unlike linear kernel functions, which are suitable for cases when the relation between the input and output is nonlinear. In addition, sigmoid kernels behave like RBF, but only for certain parameters [50]. Polynomial kernel has more parameters than RBF; the number of hypermeters will result in the complexity of model selection. Compared to RBF kernel, it is localized and has a finite response along the entire axis. However, the choice of kernels is different based on application problems, scaling methods, and parameters. Figure 4 depicts the SVR model structure.
erations to be performed in the input data space rather than in high-dimensional feature space. Different kernels use different types of SVR algorithm. In this study, radial basis function (RBF) (Equation (14)) is used for developing the SVR model.
where regulates the effect of prediction variations on the kernel variation.
In general, RBF is the most used type of kernel by researchers. RBF kernels non-linearly map sample data into a higher dimensional feature space, unlike linear kernel functions, which are suitable for cases when the relation between the input and output is nonlinear. In addition, sigmoid kernels behave like RBF, but only for certain parameters [50]. Polynomial kernel has more parameters than RBF; the number of hypermeters will result in the complexity of model selection. Compared to RBF kernel, it is localized and has a finite response along the entire axis. However, the choice of kernels is different based on application problems, scaling methods, and parameters. Figure 4 depicts the SVR model structure.

Cost and Gamma Parameters
There are two parameters that are considered for RBF kernel function, which are cost of penalty and gamma value. The parameter will control the trade-off between the slack variable size and margin, and parameter influences the partitioning outcome in the feature space of RBF kernel. The selection of suitable and values is important, as it will affect the performance prediction of the model. In determining the parameters, there are several steps: trial-and-error implementation, grid search, or using an optimization method such as a genetic algorithm (GA). Grid search is a conventional way, and it is time-consuming [51]. Alternative ways to find the values of the parameter is through particle swarm optimization (PSO) and genetic algorithm (GA).

Cost and Gamma Parameters
There are two parameters that are considered for RBF kernel function, which are cost of penalty C and gamma γ value. The parameter C will control the trade-off between the slack variable size and margin, and parameter γ influences the partitioning outcome in the feature space of RBF kernel. The selection of suitable C and γ values is important, as it will affect the performance prediction of the model. In determining the parameters, there are several steps: trial-and-error implementation, grid search, or using an optimization method such as a genetic algorithm (GA). Grid search is a conventional way, and it is time-consuming [51]. Alternative ways to find the values of the parameter is through particle swarm optimization (PSO) and genetic algorithm (GA).

Data De-Normalization
Once the prediction output is obtained, the final step of ANN and SVR model is data de-normalization. The purpose is to reverse the normalization process. It is essential to reshape the data into its original scale for the evaluation performance purpose. Figure 5 shows the modelling flow chart of ANN and SVR.

SVR-PSO Prediction Model
In SVR-PSO, the kernel selection parameters and regularization can be obtained through optimization framework that is derived based on particle swarm optimisation method. It based on the population search method that makes use of the idea of shared information by the community. The particles in the PSO are flown through the hyperdimensional search region. The position changes based on social psychology tendency, followed by the achievement of other particles. Therefore, the change of a particle within the dimensional space is inspired by the skills and information acquired by its neighbours. Hence, the results of particles will successfully return towards previously successful regions in hyperdimensional space.
As mentioned earlier in Section 1, to implement the SVR-PSO approach, kernel parameters (C and γ) need to be optimized. The position between particles within the swarm is denoted as a vector. It will encode the parameters in SVR. The objective function or fitness function will set up following regression accuracy performance where MSE and R 2 are used. Thus, the particles in the swarm with high regression accuracy will produce the value of MSE closer to 0 or closer to 1 for R 2 . Figure 6 shows the flow of PSO optimization.

Data De-Normalization
Once the prediction output is obtained, the final step of ANN and SVR model is data de-normalization. The purpose is to reverse the normalization process. It is essential to reshape the data into its original scale for the evaluation performance purpose. Figure 5 shows the modelling flow chart of ANN and SVR.

SVR-PSO Prediction Model
In SVR-PSO, the kernel selection parameters and regularization can be obtained through optimization framework that is derived based on particle swarm optimisation method. It based on the population search method that makes use of the idea of shared information by the community. The particles in the PSO are flown through the hyperdimensional search region. The position changes based on social psychology tendency, followed by the achievement of other particles. Therefore, the change of a particle within the dimensional space is inspired by the skills and information acquired by its neighbours. Hence, the results of particles will successfully return towards previously successful regions in hyperdimensional space.
As mentioned earlier in Section 1, to implement the SVR-PSO approach, kernel parameters ( and ) need to be optimized. The position between particles within the swarm is denoted as a vector. It will encode the parameters in SVR. The objective function or fitness function will set up following regression accuracy performance where MSE and R 2 are used. Thus, the particles in the swarm with high regression accuracy will produce the value of MSE closer to 0 or closer to 1 for R 2 . Figure 6 shows the flow of PSO optimization.

SVR-GA Prediction Model
The concept of GA is intended to mimic natural systems processes, especially for evolution. The GA algorithm framework is based on the principle of the survival of the fittest member in a group of the population that tries to preserve genetic information from

SVR-GA Prediction Model
The concept of GA is intended to mimic natural systems processes, especially for evolution. The GA algorithm framework is based on the principle of the survival of the fittest member in a group of the population that tries to preserve genetic information from one generation to another generation. One of GA's features is its inherent parallelism to another algorithm that is serial. If one path turns out to be a dead end, it will automatically be eliminated and resumed by other promising paths. Moreover, even on the condition of complex problems, GA is still able to find global optimum among many local optima. This is one of the remarkable capabilities of GA.
Initially for SVR-GA, the process begins to generate an initial random population. Then, fitness values for each chromosome will be evaluated. The new population will be generated using selection and crossover. Selection is where a new generation is produced. The most fit generation will be nominated, while the least fit will be eliminated. Through the crossover of DNA strands that occur in the reproduction of a biological organism, the new generation will be created to represent the current population. Lastly, when the stopping condition is achieved, the best values are obtained. Otherwise, it would start again with the selection. The stopping condition is satisfied when the lowest MSE values are achieved. Figure 7 shows the flow of GA optimization.

Model Validation
The validation and evaluation of a model involves analysing the The plotted graph is checked through several criteria, which reveals are random and checks whether the performance of prediction mod stantially when unknown data (testing data) are applied. The results on the residual sum of square due to error, RSS (Equation (15)); corr (Equation (18)); root mean square error, RMSE (Equation (19)); and MSE (Equation (20)).

Model Validation
The validation and evaluation of a model involves analysing the regression's best fit. The plotted graph is checked through several criteria, which reveals which residual plots are random and checks whether the performance of prediction models deteriorates substantially when unknown data (testing data) are applied. The results are compared based on the residual sum of square due to error, RSS (Equation (15)); correlation coefficient, R 2 (Equation (18)); root mean square error, RMSE (Equation (19)); and mean square error, MSE (Equation (20)). where the number of predictions is defined as N, experimental data as y i , y as the mean of the predicted data, andŷ i as predicted data. The statistic measures the total deviation of the forecast values from the fit to the forecast values. The closer the value is to zero indicates the model has a smaller random error component and that the fit is good for prediction. The correlation coefficient R 2 has been defined as the ratio of the sum of squares of the regression (SSR) and the total sum of squares (SST). The closer the value of R 2 is to 1 shows that a greater percentage of variance is accounted by the developed model.

RSS =
where SST = SSR + RSS. Equation (11) expressed the R 2 as: Root mean square error, RMSE, is also known as the fit standard error and the standard error for the regression problem. A value closer to zero indicates the best fit that is useful for prediction.
where MSE is the square error of the residual mean square.

SVR Training
The accuracy performance of SVR model is shown in Table 2. To obtain these results, the hyperparameters C and γ needed to be optimized. The hyperparameters in this study were tuned using the grid search method, which will yield the best value of kernel function cost C and gamma γ, paired with the highest performance. The accuracy performance was measured in terms of RMSE performance. The tuning hyperparameters using the grid search method will control the penalty due to a large prediction of residual errors. The value of C and γ was varied between 2 −15 , 2 −14 , . . . , 2 14 , 2 15 on a log scale function. The tuning using the grid search indicates a high risk of overfitting, hence cross-validation (CV) using K − 1 fold with unseen data was essential to ensure the models are generalizable.
The combinations of C and γ are performed on the training dataset by using the grid search method with 10-fold CV. The 9-fold of 53% of training will undergo the process of searching parameters and leave 1 last fold to be used as a test dataset. The final average model of the training dataset is then calculated before it will be tested again on the remaining fold of 47% testing dataset to verify and validate the results. The best pair of C and γ will provide the best CV accuracy. The epsilon ε is fixed to 0.01. The performance of the CV is evaluated based on the residual sum of square (RSS) and RMSE value. The RSS assesses the mean of dependent variables, which means smaller RSS represents better accuracy for the model. From the pair of C and γ in 2 −15 , 2 −14 , . . . , 2 14 , 2 15 , the grid search produces 31 × 31 = 961 combinations.
The kernel scale of hyperparameters controls the kernel function response towards the variation in predictions. Low kernel scales imply a milder kernel drop, and large kernel scale imply that the kernel will rapidly drop. Hence, kernel scale is related to spread of the data. Smaller values of kernel scales produce more flexible models.

ANN Training
The choice of network structures greatly effects the performance of ANN models. ANN consists of three layers of neurons, which are used for predicting the relationship between six influent inputs of aerobic granular sludge with the output of COD concentration. In this ANN training, the data were normalized before being separated into 53% training and 47% testing datasets. The range of hidden neurons was varied from 1 hidden neuron with increment of 2 up to 16 hidden neurons. Bayesian regularization and feed-forward neural network (FFNN) algorithms are applied to the training model. The developed FFNN models of COD concentration were analysed by computing R 2 , MSE, and RMSE accuracy performance. Once the prediction output was obtained, the data was de-normalized again to scale back to its original scale. The selected hidden neurons are chosen based on the highest correlation achieved in testing results, as shown in Table 3.

Model Evaluation between ANN and SVM
The SVR and ANN models were evaluated using a very limited dataset of 21 under different high temperatures. The testing dataset were isolated from the beginning during the development of training models to ensure adequate evaluation. Figure 8 shows the comparison of plotted graph between SVR and ANN models during training and testing for both models at three different temperatures. As shown in Figure 8, the training of the SVR model was able to predict the COD concentration for SBR30 • C, SBR40 • C, and SBR50 • C with R 2 values of 98.41%, 96.69%, and 96.17%, respectively. The testing of the SVR model's performance resulted in 89.29%, 88.43%, and 93.43%, respectively. Meanwhile, the training for the ANN model achieved 72.42%, 96.06%, and 92%, and for testing 82.62%, 85.39%, and 89.10%, respectively. The main aim of three different high temperatures used in this study was to investigate the granulation process, density, and performance stability of AGS in treating the wastewater.
From the results, the SVR model showed a clear advantage over FFNN model in term of R 2 and MSE. This is because the available amount of dataset of AGS in this study is limited, which can cause the training of the neural network to be disrupted by noise, especially in wastewater treatment research. Noise can occur due to input used and measurement errors [52,53]. Hence, it proved that SVR is a better option in modelling a limited and complex dataset such as AGS.
The prediction performance of ANN is strongly influenced by the number of neurons in the hidden layer, from 1 to 16, that have been trained and tested with random combinations of weights and threshold values. This makes the ANN model unstable due to changing network architecture, as the network was changing every time it was trained and tested. Besides, it is noted that ANN with a smaller number of neurons will have more instability, especially networks with one or two neurons in the hidden layer.
The dependent of the SVR model towards the training dataset is also less as compared to the FFNN model that required a larger dataset to obtain the desired accuracy performance. In fact, the SVR model is based on statistical theory, and it has rigorous mathematical and theoretical foundation, while the FFNN model needs to rely on the designer's knowledge and experience [54]. Hence, it could be concluded that the SVR model approach can give higher accuracy compared to FFNN using limited dataset in predicting COD concentration. Table 4 shows the performance comparison between SVR and FFNN in terms of R 2 and MSE for different temperatures of reactors.
the results, SVR model showed a clear advantage towards the ANFIS model. The ANFIS model struggled with the high level of noise in the data and due to large fuzzy rule bases required the reduction of the number of inputs. This work proved that SVR also shows a great ability to predict a larger dataset, but it needed to optimize a proper setting on the kernel hyperparameters.     The studies of modelling of aerobic granulation are still in their infancy. Three previous studies of aerobic granulation were performed by [21,22,55]. Table 4 tabulates all the outcomes of these models, including this work. The ANN model was developed by H. Gong et al. [20] to investigate the quality effluent of total nitrogen (TN) and COD. The collected dataset used was 205 and 136 samples. The model used single hidden layer; unfortunately, the number of neurons was not reported. The final performance is usually based on prediction of the unseen dataset, but in this work, the results are evaluated based on the combination of the training, validation, and testing.
The ANN model was known for predicting a larger network of dataset that yielded higher accuracy with good training performance but could lead to overfitting the model [21]. Studies by M.S. Zaghloul et al. [21] used the ANN model in predicting complex aerobic granulation in the SBR process. The model was developed using the modular method, in which the first sub-model will predict the biomass characteristics from the predicted biomass characteristics, and it will be used to predict the second sub-model to predict the effluent quality. The network used in this study is a feed-forward neural network (FFNN) with single hidden layer. It consists of large dataset with 2886 samples. The performance shows a compromising result with 99.98% of effluent COD.
Another study from the same author, M.S. Zaghloul et al. [22], reported the comparison between adaptive neuro-fuzzy inference systems (ANFIS) with SVR in predicting AGS. The ANFIS model was a hybrid with the nodes of FFNN in order to handle the fuzzy parameters. The model presented in this work is the continuation work from [21]. From the results, SVR model showed a clear advantage towards the ANFIS model. The ANFIS model struggled with the high level of noise in the data and due to large fuzzy rule bases required the reduction of the number of inputs. This work proved that SVR also shows a great ability to predict a larger dataset, but it needed to optimize a proper setting on the kernel hyperparameters.

Prediction Model of Optimized SVR
The selection of C and γ is crucial in SVR, as it was provided as an input and influenced the way the hyperplane divided based on training data. The optimal parameters are needed, as it will make the best regression model. In practice, the parameters are usually decided by grid search method. The parameters are varied with a fixed step size through a wide range of values, and the performance of every combination is assessed using accuracy performance. The searching parameters can take a very long time, depending on the size of the given dataset, which is time-consuming. Due to computational complexity, this grid search method is only suitable for the adjustment of very few parameters.
Hence, we proposed a particle swarm optimization (PSO) and genetic algorithm (GA) for hyperparameters selection.
The development of SVR is further improved by using an optimization method such as particle swarm optimization (PSO) and genetic algorithm (GA). The selection of hyperparameter using conventional grid search method has several disadvantages. To obtain optimal kernel parameters, an improved algorithm is needed. Hence, PSO and GA algorithms are adopted in this SVR model.
To implement SVR-PSO and SVM-GA approach, the fitness function is designed. PSO and GA contain multiple parameters that need to be set, such as C1, C2, population size, crossover rate, generation size, and maximum iteration. In order to emphasize each optimization, the same setting was implemented in both GA and PSO optimization. The value of inertia weight w min and w max , including C1, C2, are obtained from successful PSO/GA implementation in [56][57][58]. The parameters setting for both PSO and GA are shown in Table 5. The position of each particle within the swarm in PSO is viewed as a vector encoding the value of SVR parameters, which are regulation and kernel parameters (C and γ). Meanwhile, in GA, the survival principle of the fittest member in the population tries to re-train genetic information from generation to another generation. Through the crossover, a new generation will be created in the current population. The fitness function in these SVR-PSO and SVR-GA stopped when the stopping condition of R 2 was achieved. Thus, particles with high regression accuracy will produce the best value of fitness. Figure 9 shows the comparison between the conventional method of grid search and optimized PSO and GA.
From the plotted graph, the result shows a significant increment in the performance for optimized PSO and GA methods. The performance for SBR30 • C, SVR-PSO, and SVR-GA obtained 98.04% and 98.03%, respectively. It shows an increment of 10% for predicting the effluent COD concentration of 30 • C. The improvement was also shown by SBR40 • C and SBR50 • C for SVR-PSO with 92% and 94%, respectively. The increment is about 5% from the conventional method of grid search. Hence, optimization methods such as PSO and GA are a useful tool in determining optimal solution in predicting complex behaviour of AGS. The capability of each optimization helps to increase the performance of the limited data of this study. The overall performance of the optimization model is tabulated as shown in Table 6.

Model Validation and Evaluation
The validation of the SVR-PSO, SVR-GA, and SVR-Grid Search models are certified based on the validity of the training and testing dataset. The evaluation and validation of the data will be presented by plotting the cross plots of the developed models. The residual plots help to verify the stochastic nature of intrinsic errors that occur in the model. The comparisons of the results are based on three important criteria, which are the lowest value of RSS, the highest value of R2, and the lowest value of RMSE. and SBR50 °C for SVR-PSO with 92% and 94%, respectively. The increment is about 5% from the conventional method of grid search. Hence, optimization methods such as PSO and GA are a useful tool in determining optimal solution in predicting complex behaviour of AGS. The capability of each optimization helps to increase the performance of the limited data of this study. The overall performance of the optimization model is tabulated as shown in Table 6.    As shown in Figure 10, the SVR-PSO and SVR-GA models for SBR30 • C are in line with the regression line of the actual data of COD concentration, with a correlation value of 98.04% and 98.02%, respectively. However, for SVR-Grid Search, the correlation obtained is slightly lower, with 92.24%. This is because the optimizations PSO and GA allowed for more freedom in searching the parameter in hyperdimensional space that resulted in finding global optimum among many local optima. Corresponding to the R 2 results, the RMSE values obtained are 0.0486, 0.0487, and 0.1021, respectively, which are closer to value of zero. Both models are good and acceptable, as the RSS yields 0.0189, 0.0189, and 0.0834, respectively. Note that RSS value that is closer to zero indicates that the model has a smaller random error component and will be useful for predicting a model. The performance can be further seen in other reactors as well: SBR40 • C with correlation values of 92.54%, 91.82%, and 89.84%; SBR50 • C with correlation values of 94.59%, 94.11%, and 90.05%. The overall validation results for SBR30 • C, SBR40 • C, and SBR50 • C are presented in Table 7. The p-value (P1 and P2) in Table 7 indicates the probability of the coefficient (with 95% confidence bounds), and adjusted R 2 is the best indicator for fit when the value is close to 1.

Model Validation and Evaluation
The validation of the SVR-PSO, SVR-GA, and SVR-Grid Search models are certified based on the validity of the training and testing dataset. The evaluation and validation of the data will be presented by plotting the cross plots of the developed models. The residual plots help to verify the stochastic nature of intrinsic errors that occur in the model. The comparisons of the results are based on three important criteria, which are the lowest value of RSS, the highest value of R2, and the lowest value of RMSE.
As shown in Figure 10, the SVR-PSO and SVR-GA models for SBR30 °C are in line with the regression line of the actual data of COD concentration, with a correlation value of 98.04% and 98.02%, respectively. However, for SVR-Grid Search, the correlation obtained is slightly lower, with 92.24%. This is because the optimizations PSO and GA allowed for more freedom in searching the parameter in hyperdimensional space that resulted in finding global optimum among many local optima. Corresponding to the R 2 results, the RMSE values obtained are 0.0486, 0.0487, and 0.1021, respectively, which are closer to value of zero. Both models are good and acceptable, as the RSS yields 0.0189, 0.0189, and 0.0834, respectively. Note that RSS value that is closer to zero indicates that the model has a smaller random error component and will be useful for predicting a model. The performance can be further seen in other reactors as well: SBR40 °C with correlation values of 92.54%, 91.82%, and 89.84%; SBR50 °C with correlation values of 94.59%, 94.11%, and 90.05%. The overall validation results for SBR30 °C, SBR40 °C, and SBR50 °C are presented in Table 7. The p-value (P1 and P2) in Table 7 indicates the probability of the coefficient (with 95% confidence bounds), and adjusted R 2 is the best indicator for fit when the value is close to 1.

Conclusions
Two models were developed in this work, which are SVR and FFNN, to predict the concentration of COD in aerobic granular sludge at three different temperatures (30 • C, 40 • C, and 50 • C). The model was compared for prediction performance using conventional and optimization method, especially when it involved complex systems such as aerobic granulation sludge with a limited dataset. Therefore, some conclusions that can be drawn from this investigation paper are as follows: Based on limited training dataset, the testing evaluation of FFNN performance shows that the accuracy of FFNN is unstable. This is due to the changing number of neurons in the hidden layers with the random combination of weights and threshold values. By contrast, SVR performance shows a stable prediction accuracy with every combination of C and γ selection.
As a comparison, in limited experimental dataset for both FFNN and SVR, SVR is able to find the global solution better than FFNN during the training prediction and obtained excellent testing evaluation with good generalization capability. Meanwhile, FFNN faced a problem such that it easily converged to local minima and faced the overfitting in testing evaluation results.
For the proposed optimized methods of SVR-PSO and SVR-GA, both methods obtained much lower computational times as compared with the SVR-Grid Search. As presented in this works, the CPU time consumed by SVR-Grid Search had much higher ranges from 0.75 to 3.25 h depending on the iteration's combinations of the parameters. In this works, the pair of C and γ in 2 −15 , 2 −14 , . . . , 2 14 , 2 15 , the grid search produced 31 × 31 = 961 combinations. Meanwhile, the CPU time consumed by SVR-PSO ranged from 0.21 to 1.30 h. Depending on the population/generation size given, which is 100 in this work, the computational time consumed by SVR-PSO is superior to SVR-GA in convergence speed. The CPU time consumed by SVR-GA ranges from 0.32 to 1.58 h. Analysis of SVR model shows that SVR is an excellent tool for limited dataset as compared to FFNN model. The conventional method of tuning parameters of SVR were then improved by using optimization methods which are PSO and GA. The limited dataset of only 21 has been successfully developed to predict the effluent COD concentration. The results of optimize SVR shows an improvement accuracy of about 10% compared to grid search method.
In many types of applications, SVR can guarantee a higher accuracy for long-term prediction cases compared to the other computational approaches. Fundamentally, three major advantages of SVR can be highlighted throughout this work. First, there are only three types of kernel parameters needed to determine, which are kernel function and cost of the penalty C and gamma γ values. Secondly, the SVR model is highly unique, able to solve global problems and able to provide an optimal solution for solving a linear constrained quadratic problem. Thirdly, SVR provides good generalization performance due to the implementation of structural risk minimization (SRM) principle. However, there is some limitation of this study. A better result of AGS could be better achieved if the dataset used is more than 21 datasets for better division between training and testing datasets. Insufficient sample size could also reduce the performance results for statistical measurement. To further improve the objective function evaluation, the Gravitational Search Algorithm (GSA) method can be considered by a researcher in solving optimization, as it was reported to give better convergent, especially in the WWTP problem. Computational time of each optimization should be recorded to observe the time taken to obtain the highest performance.