Water Quality Indicator Interval Prediction in Wastewater Treatment Process Based on the Improved BES-LSSVM Algorithm

This paper proposes a novel interval prediction method for effluent water quality indicators (including biochemical oxygen demand (BOD) and ammonia nitrogen (NH3-N)), which are key performance indices in the water quality monitoring and control of a wastewater treatment plant. Firstly, the effluent data regarding BOD/NH3-N and their necessary auxiliary variables are collected. After some basic data pre-processing techniques, the key indicators with high correlation degrees of BOD and NH3-N are analyzed and selected based on a gray correlation analysis algorithm. Next, an improved IBES-LSSVM algorithm is designed to predict the BOD/NH3-N effluent data of a wastewater treatment plant. This algorithm relies on an improved bald eagle search (IBES) optimization algorithm that is used to find the optimal parameters of least squares support vector machine (LSSVM). Then, an interval estimation method is used to analyze the uncertainty of the optimized LSSVM model. Finally, the experimental results demonstrate that the proposed approach can obtain high prediction accuracy, with reduced computational time and an easy calculation process, in predicting effluent water quality parameters compared with other existing algorithms.


Introduction
Nowadays, freshwater is considered one of the most critical resources for humans, since it can ensure the availability of an acceptable quantity of water for livelihoods, health, ecosystems and production. Hence, freshwater plays a key role in poverty and disease burden reduction, economic growth and environmental sustainability [1,2]. This fact has long been acknowledged all over the world. However, due to industrial pollution, rapid population growth and farmland sewage caused by the extensive use of chemical fertilizers, pesticides and herbicides, the shortage of freshwater sources is a serious and challenging issue [3,4].
Wastewater treatment is one key technology to potentially provide additional water supplies, and it is very important for the functioning of the economy and society. Wastewater treatment has been attracting a lot of attention, since it can not only remove organic wastes to reduce the environmental burden, but also offer the advantage of producing a renewable source of water [5,6]. Wastewater treatment is a very complex process with a variety of physical and biochemical reactions since it presents nonlinear dynamic behavior, time delay and uncertainty [7]. In wastewater treatment plant processes, effluent water quality monitoring is an important task that involves measuring the evolution of the quality parameters in time.
Note that most traditional methods of measuring these quality indicators for wastewater treatment processes are based on manual lab-based monitoring approaches, with manual sample collection, long-time transportation and biological/microbial testing in a prediction ability of the single method [24]. Han et al. proposed a data-based predictive control strategy and proved its superiority through several simulations [25]. In [26], the total solid content of a wastewater treatment plant was predicted by an SVM model, which can enhance performance and durability.
Although SVM is a small-sample learning method and has been widely used to solve the wastewater prediction problem, the calculation process is multifarious, which is difficult to implement for large-scale training samples [27]. To overcome these disadvantages, the least-squares support vector machine (LSSVM) has been proposed. LSSVM improves the performance of the SVM algorithm by solving linear programming rather than quadratic programming. In this way, the calculation process can be reduced and the computation speed greatly improved [28]. Zhang et al. proposed an improved LSSVM model based on SVM to predict river flow [29]. Fei Luo et al. integrated the Gustafson-Kessel algorithm and least-squares support vector machine for line prediction of [30]. D. S. Manu et al. combined SVM and an adaptive neuro-fuzzy reasoning system model to predict the effluent nitrogen content of wastewater treatment plants [31]. Liu et al. investigated the online prediction of effluent COD in an anaerobic wastewater treatment system based on principal component analysis and the LSSVM algorithm [32].
Note that there are some unknown parameters in the kernel functions of LSSVM that need to be selected in advance. Generally, these parameters are determined according to experience, which may be time-consuming, and it is difficult to find the optimal parameters. Nowadays, swarm intelligence optimization algorithms are researched extensively, since the optimal solution can be found by swarm intelligence to perform a collaborative search mechanism. The results of the combination of swarm intelligence optimization algorithms and machine learning methods can be found in a large number of references. In [33], a hybrid model of particle swarm optimization (PSO) and support vector machine is proposed to predict the turbidity and pH value of sand filtered water in irrigation systems. Han et al. use an adaptive PSO algorithm to design self-organizing radial basis function neural networks to improve the accuracy and save time [34]. Chen et al. study the artificial bee colony optimization back-propagation network to predict the water quality of a water diversion project [35]. Fan et al. use the LSSVM model to improve the performance of predicting the safety factor of a circular slope [36]. Mahdi Shariati et al. use the gray wolf algorithm to optimize ELM model parameters to predict the compressive strength of partially replaced cement concrete [37]. However, to the best of the authors' knowledge, these swarm intelligence methods may fall into local optima and do not find the global optimal solutions. Most of the above-mentioned methods only focus on point prediction, without providing information regarding accuracy. The prediction results have strong uncertainty that affects the decision-making process, increasing the risk of not making good decisions. Prediction interval (PI) is a standard tool for quantifying prediction uncertainty. PI not only provides the range where the target value is most likely to exist, but also indicates its accuracy. Yao et al. combined the mean variance estimation (MVE) method with a recurrent neural network to measure the uncertainty in prediction [38]. Yuan et al. combined beta distribution with the PSO-LSTM model to obtain the wind power prediction interval with high reliability and a narrow interval width, so as to provide decision support for the safe and stable operation of power systems [39]. Liao et al. combined the bootstrap method with the long and short memory network to realize the uncertain prediction of the remaining service life of the machine [40]. Marin et al. obtained the prediction interval of power consumption by combining the delta method with a fuzzy prediction model [41]. Sun et al. constructed a high-quality prediction interval based on the two-step method of dual ELM and applied it to the scheduling of a gas system [42]. In recent years, a direct interval prediction method called upper and lower bound estimation (LUBE) has been proposed. The main idea of this method is to directly construct the upper and lower bounds of PI by optimizing the coefficients of the neural network according to the interval quality evaluation index. This approach can provide good performance and does not consider strict data distribution assumptions, such that it can provide more information about the prediction results, which motivates the work of this paper.
The main objective of this paper is to obtain a soft-sensor-based interval prediction method with high prediction accuracy and less computational time to predict the effluent water quality parameters, which is significant for water quality monitoring and control. Aiming at the online prediction of BOD/NH3-N effluent in a wastewater treatment plant within a smart data-driven framework, the main contributions of this paper are the following: • Data pre-processing methods, i.e., abnormal data elimination and normalization, are taken into consideration after the data and their related auxiliary variables are collected. Then, some key factors of the wasterwater quality indicators are selected based on the gray correlation analysis algorithm. • In order to improve the prediction accuracy of BOD/NH3-N effluent, a novel IBES-LSSVM algorithm is proposed, in which an improved bald eagle search (IBES) optimization algorithm is used to find the optimal parameters of the least-squares support vector machine (LSSVM). The superiority of the proposed method is verified by comparing it with the existing soft-sensing models (such as GWO, WOA, PSO, SSA) using some benchmark functions and providing higher prediction accuracy. • In order to estimate the uncertainty of the model prediction results and make better decisions, after obtaining the point prediction results, the interval prediction bounds of effluent quality are also generated. Compared with some existing soft-sensing models, the proposed interval prediction method can obtain a more accurate prediction range.
The structure of this paper is as follows: In Section 2, the problem description is given, including the real data collection, data pre-processing and gray-correlation-analysis-based data selection. Section 3 describes the model uncertainty analysis by using the proposed IBES-LSSVM algorithm and LUBE algorithm. In Section 4, the simulation examples are depicted, demonstrating the effectiveness of the proposed method based on the BOD and NH3-N data. Section 5 draws the main conclusions of this paper.

Problem Description
In this paper, a soft-sensing-based method is investigated to analyze and predict the water quality indicators, including three main aspects: data collection, data pre-processing and data interval prediction. The main steps of the approach presented in this paper are shown in Figure 1.
Under a smart data-driven framework, in order to predict water quality tendencies and analyze the mechanisms behind the considered data sources, enough relevant experimental data in real time must be collected based on the prediction quality indicators. Most collected data may present several issues, such as data sparsity and data synchronization, among others. After the data are collected, they must be pre-processed in advance by applying several procedures, such as data cleaning, abnormal data elimination or normalization. Then, correlation analysis from different dimensions of water quality indicators should be considered to extract the relations between these auxiliary variables and find the key factors.

Data Collection
Due to the complexity of the wastewater treatment process and the large number of parameters that need to be set, it is necessary to determine the characteristic variables related to the water quality to be determined as auxiliary variables. The data that can evaluate the quality or impact of water quality in wastewater treatment plants are mainly divided into the following four categories [43]: • Physical data: Physical properties are the ones that must be monitored throughout the treatment process, including total suspended solids, temperature, conductivity, transparency, total dissolved solids, etc.
• Chemical data: Chemical water quality indices of the national comprehensive discharge standard for water pollutants, including: pH, biochemical oxygen demand, biochemical oxygen consumption, heavy metals, nitrates, etc. • Biological data: Biomarkers include a variety of microorganisms in the water, such as mayflies, E. coli, etc. • Environmental data: Environmental data cover the whole process of water supply, including indexes of weather, hydrology, soil or ecology.

Data Collection
Data Pre-processing

Grey Correlation Analysis
Physical data：temperature , etc This paper focuses on a real wastewater treatment plant in Beijing, China, from August 2014 to September 2014 [7,44]. Two data sets are collected first, which are used to predict the BOD/NH3-N effluent, separately. (1) BOD data set: containing 360 batches of data with 23 variables (including the BOD effluent parameters)-the detailed information is shown in Table 1; (2) NH3-N data set: including 10 characteristic variables related to NH3-N effluent parameters, as shown in Table 2.

Elimination of Abnormal Data
Data collected from wastewater treatment plants can contain erroneous values because of improper instrument operation, human or environmental interference and other factors. As a result, we need to analyze the collected data first, and eliminate some abnormal or meaningless data.
In this paper, we use the 3σ criterion to handle the abnormal data of the two collected data sets. The sample data are denoted as x 1 , x 2 , · · · , x n . η i is used to represent the data residual error. Then, the standard deviation is calculated as follows: where n represents the number of elements in the data set, andx is the data average. If the residual error of particular data sample x i satisfies this means that it corresponds to an abnormal sample and needs to be eliminated. Otherwise, x i is accepted. Influent COD (ICOD) (mg/L) 07 Effluent COD (ECOD) (mg/L) 08 Sludge settling ratio of biochemical tank (mg/L) 09 MLSS in biochemical tank (MLSS) (mg/L) 10 Biochemical pool Do (mg/L) 11 Influent oil (IOil) (mg/L) 12 Effluent oil (EOil) Influent TN (IT) (mg/L) 18 Effluent TN (mg/L) 19 Influent phosphate concentration (IPC) (mg/L) 20 Effluent phosphate concentration (mg/L) 21 Inlet water temperature ( • C) 22 Outlet water temperature ( • C) 23 Effluent BOD (EBOD) (mg/L)

Data Normalization
Different variables often have different dimensions and dimensional units. In order to eliminate the dimensional influence between indicators, it is necessary to normalize the data to achieve uniformity among the different data indicators. There are four classes of normalization methods, i.e., rescaling, mean normalization, standardization and scaling to unit length. In this paper, the rescaling method is selected. The normalization formula is as follows: where x i is any value of a variable; x i min and x i max are, respectively, the minimum and maximum value of the variable. After this kind of normalization, all the values of the data are set in the range of [0, 1].

Correlation Degree Analysis
Since different characteristic variables will have different influences on the predicted variables, to obtain a soft-sensing model with a simpler structure, it is necessary to choose the quality indicators with high correlations. Selectingḿ auxiliary variables from m variable, it hasḿ < m. In practice, the larger m is, the smallerḿ is compared to m.
In this paper, the gray relational degree analysis method is investigated to select the characteristic variables of BOD and NH3-N effluents. Gray relational degree analysis is a multi-factor statistical method, which describes the strength of the relationship between various factors according to the gray relational degree. This method looks for the inconsistency between quantitative results and quantitative analysis in the traditional mathematical statistics method and reduces the amount of calculation.
The gray correlation coefficient is formulated as follows: where j means the j-th variable, k is the k-th iteration, x 0 (k) is the output variable, x j (k) is the input variable, µ j is the gray correlation coefficient and ρ is the resolution coefficient. If ρ is smaller, the difference between correlation coefficients is larger, and the distinguishing ability is stronger. Then, the gray correlation degree can be calculated as follows: where n is the number of variables. If the gray correlation degree is larger, this means that the corresponding variable has a higher correlation with the effluent quality indicators. Then, according to the gray correlation degree, the characteristic variables are sorted from front to back. Usually, a threshold is determined in advance ash, and then the key indicators can be selected as the input of the soft-sensing model if is satisfied.

Methodology
In this section, a novel IBES-LSSVM method is proposed to find the optimal kernel function parameters of the LSSVM in

LSSVM Algorithm
The theory of LSSVM was first proposed by Suykens in 1994. LSSVM is a kernel learning machine following the principle of structural risk minimization and is suitable for analyzing the issue of sample classification and regression estimation [45].
In LSSVM theory, firstly, the sample data are mapped to higher dimensions through nonlinear changes, and linear functions are used for fitting in this high-dimensional feature space: where y(x) is the output variable, x is the input variables, and w and b are weight and bias terms, respectively. The optimization objectives of the LSSVM regression algorithm can be formulated as where C is the regularization coefficient, ξ i is the relaxation variable, and By means of Lagrange multipliers α i , (10) can be expressed as: According to Karush-Kuhn-Tucker (KKT) optimization conditions: By defining kernel functions, the optimization problem (11) can be transformed into a linear solution issue: where K(x, x i ) is the kernel function. The Lagrange multiplier and its parameters can be obtained from (13). Therefore, the output of LSSVM can be obtained: For LSSVM, there are many different types of kernel functions, such as linear function, polynomial kernel function, radial basis function (RBF), sigmoid kernel function, etc. Different kernel functions will produce difference types of LSSVM. In this paper, we select RBF as the kernel function of the model: where σ is the variance of RBF. Through the aforementioned analysis, LSSVM has two tunable parameters (regularization coefficient C and variance of radial basis kernel function σ with RBF), which are important and need to be determined. To obtain the optimal two parameters, the next step is to use an improved PSO algorithm to optimize them.

IBES-LSSVM Algorithm
The BES algorithm is an optimization algorithm that simulates the hunting strategy of vultures when looking for fish. It can obtain a single optimal solution through multiple iterations and finally obtain the overall optimal solution, such that the position of the optimal solution corresponds to the optimal parameter value.
BES hunting is divided into three stages. In the first stage (selection space), the eagle selects the space with the largest prey number. In the second stage (spatial search), the eagle moves in the selected space to find the prey. In the third stage (dive), the eagle swings from the best position determined in the second stage and determines the best hunting.
In the selection stage, firstly, this paper optimizes the initial prey position and adopts the tent chaos strategy, which has the advantages of simple structure and strong ergodicity. Then, the linear decreasing method is used to improve the control parameters of the vulture iterative update position. The optimal model parameters of the model can be found that improve the quality of the fitting. The tent chaotic mapping function is described as: where λ is [0, 1]. Then, the vultures hunt for food. The formula is: where R 1 is a parameter controlling the position change, and C 1 is a random number between (0, 1). P best is the current optimal location. P mean is the average distribution location of vultures after the previous search. P i is the location of the i-th vulture.
In the search phase, vultures search for prey in the selected search space and move in different directions in the spiral space to speed up the search. The best position for subduction is: where: br(i) = r(i) · cos[(θ(i))] where θ(i) and r(i) are the polar angle and polar diameter of the spiral equation, respectively. ω and R 2 are the parameters controlling the spiral trajectory. C 2 and C 3 are a random number within (0, 1). The a(i) and b(i) represent the position of the vulture in polar coordinates, and the values are (−1, 1). During the dive phase, vultures swing from the best position in the search space to their target prey. All points also move towards the best point according to where: br(i) = r(i) · cosh[(θ(i))] where R 3 and R 4 represent the moving speed of the vulture to the optimal point. C 4 and C 5 are random numbers within (0, 1).

Interval Prediction
The traditional point prediction cannot deal with the uncertainty in the operation of the system. In order to obtain the numerical estimation and its reliability, the practical application requires the calculation of the prediction interval. Interval prediction indicates the estimation interval of the range of predicted values in a certain confidence interval. Therefore, the prediction interval is composed of the upper and lower line of prediction, which provides its accuracy within a certain confidence level. Assuming that the confidence level is (1 − µ)%, l and u are the lower and upper limits, respectively, when P(l < y < u) = 1 − µ%, and PI can be expressed as [l, u]. For a given confidence interval, the smaller the range of prediction interval, the smaller the uncertainty of prediction and the higher the accuracy.
The evaluation indexes of interval prediction are as follows [46]. PICP: The ratio of the real value to the upper and lower bounds of the prediction interval If the predicted value is within the [l i , u i ] range, c i is 1. Otherwise, c i is 0. If all predicted values are included in the prediction interval, PICP = 100%. n is the number of prediction points. In theory, PICP (1 − µ)%; otherwise, PI is invalid or unreliable. When comparing the PIs by the model, the other indexes should be as small as possible under the condition that the PICP is as close to the confidence level as possible.
PI N AW: The narrow PI has more information and practical value than the wide PI according to where R is the range of predicted values, respectively. PI NRW: Represents the standard square root width of the predicted interval. The expression is: CWC: In practical application, it is often hoped that a narrow prediction interval width can still be obtained under the condition of high prediction probability, i.e., the prediction interval range probability and interval width will conflict. Therefore, the comprehensive index CWC is proposed: where τ and µ are constants. When working with training data, the set (PICP) is 1. In addition, in data verification, (PICP) is a step function: LUBE is a method based on neural networks to directly calculate the lower and upper bound of the prediction interval. Assuming that the two node values of the output layer of the neural network are the upper and lower limits of the interval, respectively, all the predicted values are included in this range at the confidence level (1 − µ)%. The training purpose of a neural network is to minimize the objective function CWC. In this way, the probability and width of the prediction interval are considered at the same time, and the advantages and disadvantages of the prediction interval PI can be comprehensively evaluated.
The flow-chart of the proposed IBES-LSSVM algorithm is shown in Figure 2, which mainly includes the procedure presented in Algorithm 1.

Algorithm 1 LUBE interval prediction based on IBES-LSSVM model.
Input: Measured data of wasterwater treatment plant. Output: Prediction interval of BOD/NH3-N effluent.
Step 2: Analyzing and selecting the key indicators with high correlation degree by Equations (5)-(8).
Step 3: The bald eagle population is initialized by tent chaos strategy based on Equation (16).
Step 6: Return the global optimal prediction interval.

Simulation Results
In this section, the data sets of BOD/NH3-N effluents are collected from a wastewater treatment plant in Beijing and are used to verify the effectiveness of the proposed approach.
The following evaluation indices of several certainty point predictions are evaluated as follows:

Experiment of Benchmark Functions
The proposed approach is based on the six functions listed in Table 3 with the corresponding ranges and parameters. The range is the boundary of the function search space.
In order to verify the superiority of the proposed approach, it is compared with the WOA, GWO, PSO and SSA algorithms. Statistical results are presented in Table 4. Moreover, the iteration process is depicted in Figures 3-8. From the results, we can see that the convergence rate of IBES is better than that of the other algorithms and the proposed IBES method is able to provide competitive results on the benchmark functions.

Function
Range Parameters

Experiment of BOD Data
BOD is one of the most important effluent quality indexes and can reflect the water pollution situation [7]. First, the key auxiliary variables are selected for the BOD effluent data set by calculating the gray correlation degree based on (7). The threshold of the gray correlation degree is chosen as 0.8. Hence, 14 auxiliary variables (as shown in Table 5) are selected as the soft measurement model inputs. Including the output effluent BOD, there are 15 key indicators; the detailed information is shown in Figure 9. Moreover, the description of each datum is given in Figure 10.
In this paper, the BOD effluent data set has 365 sets of data; among them, 335 sets of data are randomly selected as training samples, and the remaining 30 sets of data are treated as the prediction samples. In order to demonstrate the superiority of the proposed IBES-LSSVM method, it is compared with some existing results, i.e., CNN, LSTM, ELMAN, WOA-LSSVM, GWO-LSSVM, PSO-LSSVM and SSA-LSSVM. In the experiments, the initialization conditions are set as: iter is 50, n = 30, ω max = 10, ω min = 0, R 1 = 1.8, R 2 = 1, R 3 = 1.5, R 4 = 1.5.

Experiment of NH3-N Data
In this experiment, the NH3-N effluent data set is considered, which has been described in [44]. First, the gray correlation degree is calculated from (7), and the results are presented in Figure 14. In addition, each selected auxiliary datum of the NH3-N data set is shown in Figure 15.  In this example, the threshold of the gray correlation degree is also chosen as 0.8; hence, 7 auxiliary variables (as shown in Table 8) are selected as the soft measurement model input. The experimental data of effluent NH3-N used in this paper are from a sewage treatment plant in Beijing. In total, 237 sets of data were obtained, including 200 sets of data that were randomly selected as training samples, and the remaining 37 sets of data were treated as the prediction samples. In order to demonstrate the superiority of the proposed BES-LSSVM method, it is compared with some existing approaches, i.e., CNN, LSTM, ELMAN, WOA-LSSVM, GWO-LSSVM, PSO-LSSVM and SSA-LSSVM. In the experiments, the parameters are set as follows: iter is 50, n = 30, ω max = 10, ω min = 0, R 1 = 1.8, R 2 = 1.2, R 3 = 1.8, R 4 = 1.8.
From Tables 9 and 10

Conclusions
This paper investigates an improved IBES-LSSVM algorithm to predict the effluent water quality indicators of a wastewater treatment plant, in which an improved BES method is proposed to find the optimal LSSVM parameters. To deal with the uncertainties of the data, the prediction interval is generated within a certain confidence level, which could provide the upper and lower bounds of the prediction results. Compared with other existing methods, the proposed approach demonstrates high prediction accuracy, with reduced computational time and an easy calculation process, in predicting effluent water quality parameters. Note that the proposed results can only predict the water quality indicators, but this is not the end work for a wastewater treatment plant process. The application of this work to reliable decision-making and the generation of a suitable control strategy will be our future work.