Prediction of BOD Concentration in Wastewater Treatment Process Using a Modular Neural Network in Combination with the Weather Condition

: Since weather has a huge impact on the wastewater treatment process (WWTP), the prediction accuracy for the Biochemical Oxygen Demand (BOD) concentration in WWTP would degenerate if using only one single artiﬁcial neural network as the model for soft measurement method. Aiming to solve this problem, the present study proposes a novel hybrid scheme using a modular neural network (MNN) combining with the factor of weather condition. First, discriminative features among di ﬀ erent weather groups are selected to ensure a high accuracy for sample clustering based on weather conditions. Second, the samples are clustered based on a density-based clustering algorithm using the discriminative features. Third, the clustered samples are input to each module in MNN, with the auxiliary variables correlated with BOD prediction input to the corresponding model. Finally, a constructive radial basis function neural network with the error-correction algorithm is used as the model for each subnetwork to predict BOD concentration. The proposed scheme is evaluated on a standard wastewater treatment platform—Benchmark Simulation Model 1 (BSM1). Experimental results demonstrate the performance improvement of the proposed scheme on the prediction accuracy for BOD concentration in WWTP. Besides, the training time is shortened and the network structure is compact.


Introduction
Biochemical Oxygen Demand (BOD) is defined as the amount of dissolved oxygen demanded in the process of microbial decomposition of organic materials in water at a specific temperature (usually 20 • C) for a fixed period (usually five days) [1], and is one important indicator to evaluate the total quantity of biodegradable organic pollutants in water. An accurate measurement for BOD concentration is therefore essential for monitoring the water quality in wastewater treatment process (WWTP). The most traditional way for measuring BOD concentration is based on biochemical methods, such as the standard dilution and manometric methods. However, these methods are time consuming and BOD cannot be measured in real time [2,3], which directly affects the effective control of WWTP. To realize the real-time measurement, the methods using BOD microbial sensors are applied [4,5], but it has the shortcomings of high cost and short life [6].
To avoid the above problems, the soft measurement method provides a new way to predict BOD concentration in recent years, which uses easy-to-measure variables to predict hard-to-measure variables and has applied widely in WWTP [7][8][9]. With the development of artificial intelligent technology,

Data Acquisition and Preprocessing
In the present study, Benchmark Simulation Model 1 (BSM1) [24], which is treated as a standard wastewater treatment platform and widely used for the evaluation of the performance of different methods [25][26][27], was used to test the effectiveness of the proposed hybrid scheme.
The input samples to the BSM1 model were downloaded from the website [28] and collected from different weather conditions including dry weather, rainy weather, and stormy weather. These samples were imported to BSM1 model, obtaining the effluent samples in WWTP. We acquired 900 samples, including 500 samples from dry weather, 300 samples from rainy weather, and 100 samples from stormy weather, and we randomly selected 80% samples from each weather for training.
To eliminate the dimensional effect, it is necessary to normalize the variable to the same scale. Each variable was normalized to [0,1], and the pth sample of the variable was normalized according to where x represents the vector for the variable before normalization, x p represents its pth sample, and X p is the normalized value of the pth sample after normalization. While mean(x) is the mean value of the variable over all samples, max(x) and min(x) are the maximum and minimum values.

Architecture of MNN
An MNN was used for BOD prediction in the present study, which consists of four layers: the input layer, task decomposition layer, the subnetwork layer, and the output layer ( Figure 1). The detailed function of each layer is described as follows.

Data Acquisition and Preprocessing
In the present study, Benchmark Simulation Model 1 (BSM1) [24], which is treated as a standard wastewater treatment platform and widely used for the evaluation of the performance of different methods [25][26][27], was used to test the effectiveness of the proposed hybrid scheme.
The input samples to the BSM1 model were downloaded from the website [28] and collected from different weather conditions including dry weather, rainy weather, and stormy weather. These samples were imported to BSM1 model, obtaining the effluent samples in WWTP. We acquired 900 samples, including 500 samples from dry weather, 300 samples from rainy weather, and 100 samples from stormy weather, and we randomly selected 80% samples from each weather for training.
To eliminate the dimensional effect, it is necessary to normalize the variable to the same scale. Each variable was normalized to [0,1], and the pth sample of the variable was normalized according to where x represents the vector for the variable before normalization, xp represents its pth sample, and Xp is the normalized value of the pth sample after normalization. While mean(x) is the mean value of the variable over all samples, max(x) and min(x) are the maximum and minimum values.

Architecture of MNN
An MNN was used for BOD prediction in the present study, which consists of four layers: the input layer, task decomposition layer, the subnetwork layer, and the output layer ( Figure 1). The detailed function of each layer is described as follows.  (1) The input layer is mainly responsible for importing the samples from the BSM1 model into the model. (2) The task decomposition layer decomposes the input samples into several clusters and allocates them to the respective modules. In this layer, we use a density-based clustering algorithm to decompose the input samples based on weather conditions using the most discriminative feature variables. The proposed scheme is introduced as follows in detail.

Feature Selection Method for Sample Clustering
To improve the results of sample clustering based on weather conditions, a filter-based feature selection method is used to select the most discriminative variables. The task decomposition layer could use these variables to divide the training samples into different clusters.
A discriminative index (DI) is defined to select the most discriminative features among groups from the input variables, which is measured by the ratio of between-group sum squares (BSS) to within-group sum squares (WSS). The DI value for each input variable is calculated according to where X p represents the normalized value of the pth sample (p = 1, . . . ,p, and p is the number of samples) and X k represents the vector of the variable in the group k (k = 1, . . . ,K, and K is the number of groups). In this experiment, the group was defined by weather conditions, and K was set to 3 accordingly. mean() represents the average value and V is an indicator function. When the pth sample belongs to the group k, V equals to 1. Otherwise, V equals to 0.
Note that the discriminative variables should have a higher BSS value and a lower WSS value. Therefore, the variables which have DI values higher than a predefined threshold α are selected for task decomposition in MNN.

The Density-Based Clustering Algorithm
In the present study, the task decomposition layer adopted a density-based clustering algorithm [29] to divide the samples into several clusters using the selected discriminative features. It is based on the idea that cluster centers are characterized by a higher density than their neighbors and a relatively large distance from other samples with higher densities. For each sample, its local density is computed according to where ρ p is the local density of the pth sample and d p,i represents the distance between the pth and ith samples. d c is a cutoff distance and F is an indicator function. When d p,i − d c < 1, F equals 1. Otherwise, F equals 0. Note that the relative magnitude of ρ p is used, and the choice of d c is thus robust for selecting the sample with a higher local density.

of 14
The relative distance is defined as the minimum distance between the sample and any other sample with a higher density. The relative distance of the pth sample is calculated as For the sample with the highest density, its relative distance equals its maximum distance from any other samples. The samples with a higher local density and a higher relative distance could be recognized as the cluster centers. Therefore, an index γ is calculated for the pth sample according to The samples with significantly higher γ values are identified as the cluster centers and the remaining samples are assigned to the same cluster that its nearest neighbors whose local densities are highest belong to. Subsequently, the samples from different clusters are input to respective subnetworks.

Determination of Auxiliary Variables
In this paper, we adopt the mutual information (MI) to select auxiliary variables which are correlated with the effluent BOD. The MI between the mth input variable and the output variable is defined as where X m denotes the sample vector of the mth input variable and O represents the output vector. H(·) represents the entropy and is defined by Equation (7), taking X m as an example.
where f(X p,m ) is a probability density function of the mth input variable estimated using the method of equal distance histogram. The joint entropy H(X m ,O) is defined as where f(X p,m ,O p ) denotes the joint probability density function of the mth input variable and the output variable, which is estimated using the method of equal distance histogram as well.
The MI between each input variable and the output variable is calculated, and the variables which have higher MI values than a predefined threshold β are selected as auxiliary variables for each subnetwork. Note that the parameter β is set based on MI values for each input variable and with a moderate value to ensure the selection of auxiliary variables but without the loss of many features.

Construction of Models in Subnetworks
In the present study, the number of subnetworks in MNN equals the number of identified clusters. In each subnetwork, we adopt an RBF neural network as the model due to its strong capability for nonlinear system modeling, and the error correction algorithm [30] is utilized for constructing a compact structure automatically. We use the auxiliary variables as the input variables to the model. Training samples from different clusters are input to respective subnetworks. Let X p = [X p,1 , X p,2 , . . . , X p,M ] denote the pth training sample in one subnetwork, where M is the number of auxiliary variables. The output of the hth hidden neuron in RBF neural network is calculated as where c h and σ h are the center and width of hth hidden neuron, respectively, and · represents the Euclidean Norm. The network output for the pth sample is computed by where ω h presents the connection weight between the hth hidden neuron (h = 1,2, . . . ,H, with H the number of the hidden neuron) and network output and ω 0 is the bias weight.
The procedure for error correction algorithm in each subnetwork is described as follows: Step 1: Initialization. The number of hidden neurons is set to zero (H = 0). Set the parameters, including the maximal number of hidden neurons (H m ), the maximal iterations, and the desired training RMSE.
Step 2: Calculate the error between the desired and actual outputs for all training samples. The error for the pth sample is defined as where O p and y p and are the desired and actual outputs for the pth sample. Find the maximal absolute error for all samples, and the samples with the maximal value is marked as p max. Create a new hidden neuron by Equations (12)- (14) c H+1 = X p max (12) and set H = H + 1.
Step 3: For the given RBF neural network, calculate Jacobian vector j p as where The quasi-Hessian matrix Q and the gradient vector g are obtained using Equations (19) and (20).
where η p is the sub gradient vector. Then, update subnetwork parameters using Equation (21) where I is the identity matrix and µ is the combination coefficient. t is the iteration step, and ∆ contains all the parameters to be updated in the subnetwork as The training process is repeated until the desired training RMSE or the maximal number of iterations is reached.
Step 4: Calculate the training RMSE. If the training RMSE is no more than the desired training RMSE or the hidden neurons reaches H m , the construction procedure ends. Otherwise, return to Step 2.

The Proposed Hybrid Scheme Based on MNN
The flowchart for the hybrid scheme proposed in the present study is illustrated in Figure 2. In brief, the DI value is first calculated for each input variables of training samples to select the discriminative variables for weather conditions. Then, the training samples are clustered using the density-based clustering algorithm, generating n clusters. Subsequently, an MNN consisting of n subnetworks is built, with the auxiliary variables selected based on MI as the inputs, and each subnetwork is self-constructed and trained using the error correction algorithm with the training samples from one cluster. When a testing sample comes in, it is allocated to one cluster, and the auxiliary variables of this testing sample are input to the corresponding subnetworks, with the output of this subnetwork as the predicted BOD concentration.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 14 The training process is repeated until the desired training RMSE or the maximal number of iterations is reached.
Step 4: Calculate the training RMSE. If the training RMSE is no more than the desired training RMSE or the hidden neurons reaches Hm, the construction procedure ends. Otherwise, return to Step 2.

The Proposed Hybrid Scheme Based on MNN
The flowchart for the hybrid scheme proposed in the present study is illustrated in Figure 2. In brief, the DI value is first calculated for each input variables of training samples to select the discriminative variables for weather conditions. Then, the training samples are clustered using the density-based clustering algorithm, generating n clusters. Subsequently, an MNN consisting of n subnetworks is built, with the auxiliary variables selected based on MI as the inputs, and each subnetwork is self-constructed and trained using the error correction algorithm with the training samples from one cluster. When a testing sample comes in, it is allocated to one cluster, and the auxiliary variables of this testing sample are input to the corresponding subnetworks, with the output of this subnetwork as the predicted BOD concentration.

Experimental Setup
In the present study, 900 samples (500 samples from dry weather, 300 samples from rainy weather, and 100 samples from stormy weather) were collected from the BSM1 model. We randomly selected 80% of samples from each weather condition as training samples, and the remaining samples were used for testing. Each sample has 10 variables, with the effluent BOD concentration as the output variable and the remaining variables treated as the input variables. All the input and output variables were normalized to [0,1]. Twenty independent trails were conducted to evaluate the performance of the proposed scheme. The experiments were carried out in the MATLAB 9.0 environment running with an Inter(R) Core(TM) i7 CPU (2.00 GHz) and 8.00 GB RAM.

Discriminative Features for Sample Clustering
We set the threshold α to 3, and the variables with higher DI value than the threshold were selected as discriminative features. The average and standard deviation of DI value for each input variable over 20 independent trails were calculated, as listed in Table 1. We can see that three variables, Q Inf , S NH,E , and X TSS,E , have significantly higher DI values than other variables, indicating these variables are most discriminative among groups. To test the effectiveness of the feature selection method for sample clustering, the distributions of these three features are plotted in Figure 3. We can see that the distributions of these features are discriminative between different weather conditions, especially between the dry and other weather conditions.

Experimental Setup
In the present study, 900 samples (500 samples from dry weather, 300 samples from rainy weather, and 100 samples from stormy weather) were collected from the BSM1 model. We randomly selected 80% of samples from each weather condition as training samples, and the remaining samples were used for testing. Each sample has 10 variables, with the effluent BOD concentration as the output variable and the remaining variables treated as the input variables. All the input and output variables were normalized to [0,1]. Twenty independent trails were conducted to evaluate the performance of the proposed scheme. The experiments were carried out in the MATLAB 9.0 environment running with an Inter(R) Core(TM) i7 CPU (2.00 GHz) and 8.00 GB RAM.

Discriminative Features for Sample Clustering
We set the threshold α to 3, and the variables with higher DI value than the threshold were selected as discriminative features. The average and standard deviation of DI value for each input variable over 20 independent trails were calculated, as listed in Table 1. We can see that three variables, QInf, SNH,E, and XTSS,E, have significantly higher DI values than other variables, indicating these variables are most discriminative among groups. To test the effectiveness of the feature selection method for sample clustering, the distributions of these three features are plotted in Figure  3. We can see that the distributions of these features are discriminative between different weather conditions, especially between the dry and other weather conditions.

Sample Clustering
The training samples were then clustered by the density-based clustering algorithm using the most discriminative features. The cutoff distance d c was set to 0.1 in the present study. The index γ was calculated for each training sample and sorted in a descending order, and we plot it from one trail in Figure 4 as an example. Two samples exhibit significantly higher γ values than others, which are marked in red in the figure and thus considered as the cluster centers accordingly. The remaining training samples were assigned to the same cluster that its nearest neighbors whose local densities are highest belong to. For a testing sample, we calculated the distance between the testing sample and all training samples and assigned it to the same cluster as the nearest neighbors with the highest local density.
The training samples were then clustered by the density-based clustering algorithm using the most discriminative features. The cutoff distance dc was set to 0.1 in the present study. The index γ was calculated for each training sample and sorted in a descending order, and we plot it from one trail in Figure 4 as an example. Two samples exhibit significantly higher γ values than others, which are marked in red in the figure and thus considered as the cluster centers accordingly. The remaining training samples were assigned to the same cluster that its nearest neighbors whose local densities are highest belong to. For a testing sample, we calculated the distance between the testing sample and all training samples and assigned it to the same cluster as the nearest neighbors with the highest local density.
To figure out the effectiveness of the sample clustering, the average clustering results for training and testing samples over 20 independent trials are listed in Table 2. For the training samples, the first cluster consists of all samples from the dry weather and only few samples from the rainy and stormy weather, and the second cluster consists of samples only from rainy and stormy weather. The clustering accuracy is high for both training (99.92%) and testing samples (95.25%). These results demonstrate the effectiveness of the proposed scheme for sample clustering based on weather conditions. Since two clusters were obtained, the number of subnetworks in MNN was determined as 2. The samples after clustering were then input to the corresponding modules in MNN.

Auxiliary variables for BOD prediction
The training samples were then used to select the auxiliary variables by calculating the MI between each input variable and the output variable. The average MI value for all input variables over 20 independent trials are listed in Table 3. In the present study, the threshold β was set to 6.5, and the variables which have a higher MI value than the threshold were selected. Accordingly, six variables, QInf, SNO,E, SNH,E, SCOD,Inf, SCOD,E, and XTSS,E, were mostly selected as the auxiliary variables for BOD prediction, which were the input variables for each subnetwork to train the model.  To figure out the effectiveness of the sample clustering, the average clustering results for training and testing samples over 20 independent trials are listed in Table 2. For the training samples, the first cluster consists of all samples from the dry weather and only few samples from the rainy and stormy weather, and the second cluster consists of samples only from rainy and stormy weather. The clustering accuracy is high for both training (99.92%) and testing samples (95.25%). These results demonstrate the effectiveness of the proposed scheme for sample clustering based on weather conditions. Since two clusters were obtained, the number of subnetworks in MNN was determined as 2. The samples after clustering were then input to the corresponding modules in MNN.

Auxiliary Variables for BOD Prediction
The training samples were then used to select the auxiliary variables by calculating the MI between each input variable and the output variable. The average MI value for all input variables over 20 independent trials are listed in Table 3. In the present study, the threshold β was set to 6.5, and the variables which have a higher MI value than the threshold were selected. Accordingly, six variables, Q Inf , S NO , E , S NH,E , S COD,Inf , S COD,E , and X TSS,E , were mostly selected as the auxiliary variables for BOD prediction, which were the input variables for each subnetwork to train the model.

Results for BOD Prediction
For the training process of each subnetwork, the learning rate was set to 0.01, and the maximal number of iterations was set to 50. The training process ended when either the number of hidden neurons reached 10 or the training RMSE was not larger than 0.01. The root mean square error was used to evaluate the accuracy for BOD concentration prediction, which is defined as where e p is the output error and p is the number of training samples. Taking one trail as an example, we plot the training RMSE and the number of hidden neurons during training for each subnetwork in MNN in Figure 5. We can see that the training RMSE decreases with a sudden increase when the number of hidden neurons changes. With the increase of the hidden neurons, the training RMSE decreases and finally reaches the desired training RMSE, obtaining a compact network structure.

Results for BOD Prediction
For the training process of each subnetwork, the learning rate was set to 0.01, and the maximal number of iterations was set to 50. The training process ended when either the number of hidden neurons reached 10 or the training RMSE was not larger than 0.01. The root mean square error was used to evaluate the accuracy for BOD concentration prediction, which is defined as where ep is the output error and p is the number of training samples.
Taking one trail as an example, we plot the training RMSE and the number of hidden neurons during training for each subnetwork in MNN in Figure 5. We can see that the training RMSE decreases with a sudden increase when the number of hidden neurons changes. With the increase of the hidden neurons, the training RMSE decreases and finally reaches the desired training RMSE, obtaining a compact network structure.  After training, the testing sample was then presented to the corresponding subnetwork with its auxiliary variables input to the model, and the output was obtained from the model accordingly. Taking one trail as an example, Figure 6 shows the testing results and testing errors after anti-normalization, demonstrating the effectiveness of the proposed scheme.
other methods, including an MNN without consideration of the selection of discriminative features for sample clustering (AllClu-MNN), an MNN without the selection of auxiliary variables which uses all variables as the input variables for the subnetworks (AllIn-MNN), an MNN with the standard RBF neural network as the subnetworks (RBF-MNN), and one single RBF neural network with error correction algorithm (ErrCor-RBF). The training and testing RMSE, training time, and the number of hidden neurons in each subnetwork were used as the measurement for the performance evaluation, and the average results over 20 independent runs are listed in Table 4. We can see that the proposed scheme outperforms the other models in prediction accuracy, training time, and the compactness of the model structure.
(a) Testing results (b) Testing error  Comparison between the proposed MNN and AllClu-MNN demonstrates the performance improvement by the selection of discriminative features. In the proposed hybrid scheme, the discriminative features are used for sample clustering to improve the clustering accuracy, while AllClu-MNN uses all features for sample clustering, which may decrease the clustering accuracy and degenerate the model performance directly. Besides, comparing the proposed scheme with AllIn-MNN, the results show that the selection of the auxiliary variables can improve the generalization performance and shorten the training time, and the network structure is smaller as well. When we replace the subnetwork in the proposed MNN with a standard RBF (RBF-MNN), it directly causes a To further evaluate the performance of the proposed scheme, the results were compared with other methods, including an MNN without consideration of the selection of discriminative features for sample clustering (AllClu-MNN), an MNN without the selection of auxiliary variables which uses all variables as the input variables for the subnetworks (AllIn-MNN), an MNN with the standard RBF neural network as the subnetworks (RBF-MNN), and one single RBF neural network with error correction algorithm (ErrCor-RBF). The training and testing RMSE, training time, and the number of hidden neurons in each subnetwork were used as the measurement for the performance evaluation, and the average results over 20 independent runs are listed in Table 4. We can see that the proposed scheme outperforms the other models in prediction accuracy, training time, and the compactness of the model structure. Comparison between the proposed MNN and AllClu-MNN demonstrates the performance improvement by the selection of discriminative features. In the proposed hybrid scheme, the discriminative features are used for sample clustering to improve the clustering accuracy, while AllClu-MNN uses all features for sample clustering, which may decrease the clustering accuracy and degenerate the model performance directly. Besides, comparing the proposed scheme with AllIn-MNN, the results show that the selection of the auxiliary variables can improve the generalization performance and shorten the training time, and the network structure is smaller as well. When we replace the subnetwork in the proposed MNN with a standard RBF (RBF-MNN), it directly causes a significant performance deterioration, demonstrating the superiority of using the ErrCor-RBF for subnetworks. Finally, the single ErrCor-RBF was tested, and the experimental results indicate that the modular structure can improve the generalization performance and significantly shorten the training time.

Performance Evaluation for Different Weather Conditions
To evaluate the performance of the proposed hybrid scheme on different weather conditions, we grouped the testing samples according to their labels for weather conditions, and the testing RMSE for each weather group was calculated (Table 5). We can see that the proposed MNN performs the best over other models for each weather condition. For the dry weather, the prediction accuracy is higher than other two weather conditions for all comparable models, the reason for which is that the key water parameters exhibit smaller fluctuations in the dry weather than other two weather conditions. Better prediction accuracy for rainy and stormy weather conditions is obtained by the proposed MNN and AllIn-MNN, indicating that using an individual module to model the data from rainy and stormy weather (all the training samples input to the second module are from the rainy and stormy weather, as shown in Table 2) provides a positive effect on the prediction accuracy for effluent BOD concentration. When using the single ErrCor-RBF to obtain a general model for data from different weather conditions, the prediction accuracy for effluent BOD concentration decreases for all the three weather conditions.

Conclusions
We propose a novel hybrid scheme based on MNN combining the factor of weather condition to improve the BOD prediction. The effectiveness and superiority of the proposed scheme were evaluated on BSM1. Experimental results demonstrate that the proposed scheme outperforms the other models from the following aspects.
MNN is used as the model to predict effluent BOD with modules modeling using clustered samples based on weather conditions. Considering that different nonlinearities are observed for the key parameters in WWTP in different weather conditions, the input samples are clustered based on weather conditions and modeled by respective subnetworks. Experimental results demonstrate the superior performance of the proposed scheme on BOD prediction.
Discriminative features are selected to obtain a better clustering accuracy for samples from different weather conditions. These selected features are demonstrated to be discriminative among groups, and the accuracy for sample clustering is high to ensure a higher prediction accuracy for the effluent BOD.
Auxiliary variables which exhibit high correlations with the effluent BOD are selected as inputs to the model, which can improve the generalization performance, shorten the training time, and make the structure smaller.
The constructive RBF neural network with the error correction algorithm is used for each subnetwork in MNN. On the one hand, the structure of each subnetwork is determined automatically. On the other hand, the superior generalization performance and compact structure are provided by this model compared with the standard RBF neural network.
In summary, the proposed hybrid scheme based on MNN, in combination with weather conditions, improves the performance of the model, including the improved generalization ability, short training time, and compact structure. The BSM1 was used as a standard platform for the evaluation of the effectiveness of the proposed scheme, which provides a guidance for the soft measurement in the actual wastewater treatment process in the future.