Predicting the Degree of Dissolved Oxygen Using Three Types of Multi-Layer Perceptron-Based Artiﬁcial Neural Networks

: Predicting the level of dissolved oxygen (DO) is an important issue ensuring the sustainability of the inhabitants of a river. A prediction model can predict the DO level using a historical dataset with regard to water temperature, pH, and speciﬁc conductance for a given river. The model can be built using sophisticated computational procedures such as multi-layer perceptron-based artiﬁcial neural networks. Different types of networks can be constructed for this purpose. In this study, the authors constructed three networks, namely, multi-verse optimizer (MVO), black hole algorithm (BHA), and shufﬂed complex evolution (SCE). The networks were trained using the datasets collected from the Klamath River Station, Oregon, USA, for the period 2015–2018. We found that the trained networks could predict the DO level of 2019. We also found that both BHA- and SCE-based networks could predict the level of DO using a relatively simple conﬁguration compared to that of MVO. From the viewpoints of absolute errors and Pearson’s correlation coefﬁcient, MVO- and SCE-based networks performed better than BHA-based networks. In synopsis, the authors recommend MVO-and MLP-based artiﬁcial neural networks for predicting the DO level of a river.


Problem Statement and Background
As is known, acquiring an appropriate forecast for water quality parameters such as dissolved oxygen (DO) is an important task due to their effects on aquatic health maintenance and reservoir management [1]. Constraints such as the influence of various environmental factors on the DO concentration [2] have driven many scholars to replace conventional models with sophisticated artificial intelligent techniques [3][4][5][6].

Similar Works
Through applying support vector regression (SVR), Liu et al. [46] showed the efficiency of the maximal information coefficient technique used for feature selection in the estimation of DO concentration. The results of the optimized dataset were much more reliable (28.65% in terms of root mean square error, RMSE) than the original input configuration. Csábrági et al. [47] showed the appropriate efficiency of three conventional notions of artificial neural networks (ANNs), namely, multi-layer perceptron (MLP), radial basis function (RBF), and general regression neural network (GRNN), for this purpose. Similar efforts can be found in [48,49]. Heddam [50] introduced a new ANN-based model, namely, evolving fuzzy neural network, as a capable approach for DO simulation in a river ecosystem. The suitability of fuzzy-based models has been investigated in many studies [51]. Adaptive neuro-fuzzy inference system (ANFIS) is another potent data mining technique that has been discussed in many studies [52][53][54]. More attempts regarding the employment of machine learning tools can be found in [55][56][57][58].
Ouma et al. [59] compared the performance of a feed-forward ANN with multiple linear regression (MLR) in simulating DO in Nyando River, Kenya. It was shown that the correlation of the ANN is considerably greater than the MLR (i.e., 0.8546 vs. 0.6199). Zhang et al. [60] combined a recurrent neural network (RNN) with kernal principal component analysis to predict hourly DO concentration. Their suggested model was found to be more accurate than regular data mining techniques, including feed-forward ANN, SVR, and GRNN, by around 8%, 17%, and 12%. Additionally, the largest accuracy (the coefficient of determination R 2 = 0.908) was obtained for DO in the upcoming hour. Ali et al. [61] combined a so-called denoising method, namely, "complete ensemble empirical mode decomposition with adaptive noise", with two popular machine learning models, namely, random forest (RF) and extreme gradient boosting, to analyze various water quality parameters. It was shown that the RF-based ensemble is a more accurate approach for the simulation of DO, temperature, and specific conductance. They also proved the viability of the proposed approaches by comparing them with some benchmark tools. Likewise, Ahmed [62] showed the superiority of RF over MLR for DO modeling. He also revealed that water temperature and pH play the most significant roles in this process. Ay and Kişi [63] conducted a comparison among MLP, RBF, ANFIS (sub-clustering), and ANFIS (grid partitioning). The respective R 2 values of 0.98, 0.96, 0.95, and 0.86 for one station (Number: 02156500) revealed that the outcomes of MLP are better correlated with the observed DOs.
Synthesizing conventional approaches with auxiliary techniques has led to novel hybrid tools for various hydrological parameters [64][65][66]. Ravansalar et al. [67] showed that linking the ANN with discrete wavelet transform results in an improvement of accuracy (i.e., Nash-Sutcliffe coefficient) from 0.740 to 0.998. A similar improvement was achieved for the SVR applied to estimate biochemical oxygen demand in Karun River, Western Iran. Antanasijević et al. [68] presented a combination of Ward neural networks and a local similarity index for predicting DO in the Danube River. They noted the better performance of the proposed model compared to the multi-site DO evaluative approaches presented in the literature.

Novelty and Objective
Metaheuristic search methods such as teaching-learning based optimization [69] have provided suitable approaches for intricate problems. Ahmed and Shah [52] suggested three optimized versions of ANFIS using differential evolution, genetic algorithm (GA), and ant colony optimization for predicting water quality parameters, including electrical conductivity, sodium absorption ratio, and total hardness. In similar research, Mahmoudi et al. [70] coupled SVR with the shuffled frog leaping algorithm (SFLA) for the same objective. Zhu et al. [71] compared the efficiency of the fruit fly optimization algorithm (FOA) with the GA and particle swarm optimization (PSO) for optimizing a least-squares SVR for forecasting the trend of DO. Referring to the obtained mean absolute percentage errors of 0.35%, 1.3%, 2.03%, and 1.33%, the proposed model (i.e., FOA-LSSVR) surpassed the benchmark techniques. In this work, three stochastic search techniques of multi-verse optimizer (MVO), black hole algorithm (BHA), and shuffled complex evolution (SCE) are used to optimize an MLP neural network for predicting DO using recent data collected from the Klamath River Station. According to Sullivan et al. [72], the reach of interest is classified as very poor water quality based on the Oregon Water Quality Index. Additionally, the reach of Keno dam (downstream of the river) is labeled as "water quality limited" for ammonia and dissolved oxygen year-round, as well as pH and chlorophyll a in summer. It clearly highlights the importance of water quality assessments in this area. To the best of the authors' knowledge, up to now, few metaheuristic algorithms have been used for training the ANN in the field of DO modeling (e.g., firefly algorithm [73] and PSO [74]). Therefore, the models suggested in this study are deemed as innovative hybrids for this purpose.

Methodology
The steps of this research are shown in Figure 1. After providing the appropriate dataset, the MLP is submitted to MVO, BHA, and SCE algorithms to adjust its parameters through metaheuristic schemes. During an iterative process, the MLP is optimized to present the best possible prediction of the DO.

The MVO
As is implied by its name, the MVO is obtained from multi-verse theory in physics [75]. According to this theory, there is more than one big bang event, each of which has initiated a separate universe. The algorithm was introduced by Mirjalili et al. [76]. The main components of the MVO are wormholes, black holes, and white holes. The concepts of black and white holes run the exploration phase, while the wormhole concept is dedicated to the exploitation procedure. The pseudo code of the MVO is presented as Algorithm 1.

Algorithm 1. Pseudo code of the MVO [77]
Initialize the parameters (population size, iterations, wormhole existence probability (WEP), travelling distance rate (TDR)) While maximum iteration not reached Compute the fitness for each universe AU = Sort the population BI = Normalize the fitness values for i = 2:N Black hole = i for j = 1:size(i) Generate r1 as a random value if r1 < BI(U_i) white hole = Roulette Wheel Selection(-B) U(black hole, j) = AU(white hole, j) end if Generate r2 as a random value if r2 < WEP Generate r3 and r4 as a random value if r3 < 0.5 Update the position of the universe (Equation (4)) else Update the position of the universe (Equation (4)) end if end if end for end for end while Response = Best solution.
In the MVO, the so-called parameter of "rate of inflation" (ROI) is defined for each universe. The objects are transferred from the universes with larger ROIs to those with lower values to improve the average ROI of the whole cosmos. During an iteration, the organization of the universes is carried out with respect to their ROIs, and after a roulette wheel selection (RWS), one of them is deemed the white hole. In this relation, a set of universes can be defined as: where g symbolizes the number of objects and k stands for the number of universes. The jth objective in the ith solution is generated according to the below equation: where ub j and lb j denote upper and lower bounds, and the function rand() produces a discrete randomly distributed number. In each repetition, there are two options for the x j i : (i) it is selected from earlier solutions using RWS (e.g., It can be written as follows: In the above equation, U i stands for the ith universe, Norm(U i ) gives the corresponding normalized ROI, and rand 1 is a random value in [0, 1]. Equation (4) expresses the measures considered to deliver the variations of the whole universe. In this sense, the wormholes are supposed to enhance the ROI.
where x j signifies the jth best-fitted universe obtained so far, and r 2 , r 3 , and r 4 are random values in [0, 1]. Moreover, the two parameters of WEP and TDR stand for the wormhole existence probability and the traveling distance rate, respectively. Given Iter as the running iteration, and Iter max as the maximum number of Iters, these parameters can be calculated as follows: where q is the accuracy of exploitation, and a and b are constant pre-defined values [78,79].

The BHA
Inspired by black hole incidents in space, Hatamlou [80] proposed the BHA in 2013. Emerging after the collapse of massive stars, a black hole is distinguished by a huge gravitational power. The stars move toward this mass, and it explains the pivotal strategy of the BHA for achieving an optimum response. A randomly generated constellation of stars represents the initial population. Based on the fitness of these stars, the most powerful one is deemed as the black hole that will absorb the surrounding ones. The pseudo code of the BHA is presented as Algorithm 2.

Algorithm 2. Pseudo code of the BHA [81]
Initialize the stars x i Initialize the parameters (fitness function and iterations (Iter)) Select the best-fitted star (x b ) as the black hole (BH) While maximum Iter not reached In this procedure, the positions change according to the relationship below: where rand is a random number in [0, 1], x BH is the black hole's position, Z is the total number of stars, and Iter symbolizes the iteration number.
Once the fitness of a star surpasses that of the black hole, they exchange their positions. In this regard, Equation (8) calculates the radius of the event horizon for the black hole.
where F i is the fitness of the i th star, and F BH is the value for the black hole [82].

The SCE
Originally proposed by Duan et al. [82], the SCE has been efficiently used for dealing with optimization problems with high dimensions. The SCE can be defined as a hybrid of complex shuffling and competitive evolution concepts with the strengths of the controlled random search strategy. This algorithm (i.e., the SCE) benefits from a deterministic strategy to guide the search. Additionally, utilizing random elements has resulted in a flexible and robust algorithm. The pseudo code of the SCE is presented as Algorithm 3. The SCE is implemented in seven steps. Assuming N C as the number of complexes and N P as the number of points existing in one complex, the sample size of the algorithm is generated as S = N C × N P . In this sense, N C ≥ 1 and N P ≥ 1 + the number of design variables. Next, the samples x 1 , x 2 , . . . , x s are created in the viable space (i.e., within the bounds). The fitness values are also calculated using sampling distribution. In the third step, these samples are arranged with reference to their fitness. An array-like D = {x i , f i , where i = 1, 2, . . . , s} can be considered for storing them. This array is then divided into N C complexes (C 1 , C 2 , . . . , C N C ), each of which contains N P samples (Equation (9)) In the fifth step, each complex is evolved by the competitive complex evolution algorithm. Later, in a process named the shuffling of the complexes, all complexes are replaced in array D. This array is then sorted based on the fitness values. Lastly, the algorithm checks for stopping criteria that terminate the process [84].

Accuracy Criteria
The quality of the results is lastly evaluated using Pearson's correlation coefficient (R P ) along with mean absolute error (MAE) and RMSE. They analyze the agreement and difference between the observed and predicted values of a target parameter. In the present work, given DO i predicted and DO i observed as the predicted and observed DOs, the R P , MAE, and RMSE are expressed by the following equations: where K signifies the number of compared pairs.

Data
As a matter of fact, intelligent models should first learn the pattern of the intended parameter in order to predict it. This learning process is carried out by analyzing the dependence of the target parameter on some independent factors. In this work, DO is the target parameter for water temperature (WT), pH, and specific conductance (SC). This study uses the data belonging to a US Geological Survey (USGS) station, namely, the Klamath River (station number: 11509370). As Figure 2 illustrates, this station is located in Klamath County, Oregon State. Pattern recognition is fulfilled by means of data obtained between 1 October 2014 and 30 September 2018. After training the models, the DO for the subsequent year (i.e., from 1 October 2018 to 30 September 2019) is predicted. Since the models do not know this data, the accuracy of this process will reflect their capability to predict DO in unseen conditions. Hereafter, these two groups are categorized as training data and testing data, respectively. Figure 3 depicts DO vs. WT, PH, and SC for the (a, c, and e) training and (b, d, and f) testing data. Based on the available data for the mentioned periods, the training and testing groups contain 1430 and 352 records, respectively. Moreover, the statistical description of these datasets is presented in Table 1   Moreover, the effect of the inputs on DO is investigated using a tree-based ensemble method. To do this, a bagged ensemble of 200 regression trees is implemented, and the outcome is reported as a value of importance. Figure 4 shows the results. As can be seen, SC is smaller than WT and the effect of WT smaller than pH. In other words, pH has the greatest impact on DO concentration in this case.

Optimization and Weight Adjustment
As explained, the proposed hybrid models are designed in the way that MVO, BHA, and SCE algorithms are responsible for adjusting the weights and biases of the MLP. To do this, a raw MLP structure (that wants to predict DO from WT, pH, and SC) should be given as the optimization problem to the mentioned algorithms. In this work, a three-layered MLP with three neurons in the input layer (each for one input), six neurons in the middle layer, and one neuron (for DO) in the last layer was considered, noting that the value of 6 was determined after a trial and error process; this is schematically shown in Figure 5. The overall formulation of a neuron can be expressed as follows: where f (x) is the activation function used by the neurons in a layer; additionally, R N and I N denote the response and input of the neuron N, respectively. With a large number of these equations, each algorithm first suggests a stochastic response for the W and b values. In the next iterations, the algorithms improve this response in order to build a more accurate MLP. The formulation of a trained MLP network is presented at the end of the study to better illustrate this concept. The created hybrids are implemented with different population sizes (N Pop s) of the trainer algorithm to achieve the best results. Figure 6 shows the values of the objective function obtained for the N Pop s of 10, 25, 50, 75, 100, 200, 300, 400, and 500. In the case of this study, the objective function is reported by the RMSE criterion. Figure 6 shows that unlike the SCE, which gives more quality training with small N Pop s, the MVO performs better with the three largest N Pop s. The BHA, however, did not show any specific behavior. Overall, the MVO, BHA, and SCE with the N Pop s of 300, 50, and 10, respectively, could adjust the MLP parameters with the lowest error. As stated, metaheuristic algorithms minimize the errors in an iterative process. Figure 7 shows the convergence curve plotted for the selected configurations of the MLP-MVO, BHA-MVO, and SCE-MVO. To this end, the training RMSE is calculated for a total of 1000 iterations. According to Figure 7, the optimum values of the objective function are 1.314816444, 1.442582978, and 1.33041779 for the MLP-MVO, BHA-MVO, and SCE-MVO, respectively. These configurations are applied in the next section to predict DO. Their results are then evaluated for accuracy assessment.     As stated previously, the quality of the testing results shows how successful a trained model can be in confronting new conditions. The data of the fifth year were considered as these conditions in this study. Figure 10 depicts the histogram of the testing errors. In these charts, µ stands for the mean error, and σ represents the standard error. In this phase, the RMSEs of 1.3187, 1.4647, and 1.3085, along with the MEAs of 1.0161, 1.1997, and 1.0122, imply the power of the used models for dealing with strange data. It means that the weights (and biases) determined in the previous section have successfully mapped the relationship between DO and WT, PH, and SC for the second phase. From the comparison point of view, unlike the training phase, the SCE-based hybrid outperformed the MLP-MVO. The MLP-BHA, however, presented the poorest prediction of DO, again. The third accuracy indicator (i.e., the R P ) shows the agreement between the predicted and observed DO rates. This index can range within [−1, +1], where −1 (+1) indicates a totally negative (positive) correlation, and 0 means no correlation. Figure 11 shows a scatterplot for each model containing both training and testing results. As can be seen, all outputs are positively aggregated around the best-fit line (i.e., the black line). For the training results, the R P s of 0.8808, 0.8545, and 0.8778 indicate the higher consistency of the MLP-MVO results, while the values of 0.8741, 0.8453, and 0.8775 demonstrate the superiority of the MLP-SCE in the testing phase.

Cross-Validation
In order to further verify the potential of the implemented models, the trained models (i.e., the MVO, BHA, and SCE, with the N Pop s of 300, 50, and 10, respectively) are applied to another station called Fanno Creek (station number: 14206950). Similar to Klamath River, it is located in the Oregon State but in the northern part (longitude: 122 • 45 13" W and latitude: 45 • 24 13" N). The models predicted the DO for the water year 2019 using the daily records of WT, SC, and pH. Notably, the measured DO ranged from 5.40 to 13.20 mg/L in this site.
The results are shown in Table 2. As can be seen, the performance of all three models is satisfying, and the predicted values are well-correlated with the observed DO rates. It professes that the models can be used for new prediction cases. However, the combination of inputs (i.e., taking WT, pH, and SC into account) should be respected for all study cases.

Further Discussion
Due to the water quality situation in the Klamath River, this study was dedicated to suggesting novel predictive tools for analyzing the DO concentration in this reach. Notably, this river is important for irrigation, flows, and hydropower generation [72]. It was discussed that DO should be evaluated in response to different key parameters. Another difficulty is that the DO does not follow a certain pattern over time. Hence, it is necessary to tackle these complexities by selecting appropriate models.
Many scholars have stated their tendency regarding the use of hybrid methods for similar issues (e.g., sediment concentration [85] and salinity [86] predictions). The reason for the development of such models can be the use of an optimization technique in the position of a trainer algorithm. In the case of DO modeling in this study, each of the MVO, BHA, and SCE algorithms drove an MLP neural network. In other words, their specific solutions were employed to analyze the relationship between DO and influential parameters through a neural framework.
The high level of accuracy achieved shows that DO can be promisingly predicted. However, there are some suggestions that should be applied for even more efficient solutions. The first part is related to methodology. Optimization using metaheuristic techniques needs to be carefully monitored to select appropriate parameters. For example, the number of iterations and the complexity of the population are better optimized vs. time, where time is as important as accuracy. However, the focus of this study was mostly on the accuracy of prediction ( Figure 6). Another idea in this regard can be to test different metaheuristic techniques to find the quickest [87]. Another potential idea would be conducting comparisons with benchmark machine learning solutions such as the ANFIS and SVR as well as their hybridized versions.
The dataset is also a key item. The factors that can be regarded are optimizing the number of input parameters, selecting the appropriate time period, and the pre-processing and purification of misleading samples, among others. In this work, a valid 5-year dataset with three inputs was used, and the results showed that it could provide sufficient samples to be analyzed by the algorithms. Since the models could predict the DO of the testing period without prior knowledge of it, they can be used for further unseen events. Additionally, these results were achieved with the effect of only three influential parameters. Such simplicity is effective for avoiding complicated simulations and reducing the cost of computations.

SCE-Based Formula
In this section, the formulation of the MLP-SCE model is presented due to its superiority in the prediction phase as well as a simpler configuration compared to the MVO and BHA. Moreover, considering time efficiency, the SCE could optimize the ANN in a meaningfully shorter time. The elapsed times were approximately 5980, 675, and 531 s for the selected configurations of the MLP-MVO, MLP-BHA, and MLP-SCE, respectively (under a core i7 (at 1.8 GHz) operating system with 16 gigs of RAM).
This formula can serve as a predictive equation that estimates DO for given values of WT, pH, and SC. This model predicts DO based on Equation (14), in which Input stands for three inputs, and IW and b1 are their corresponding weights and bias vectors, respectively. LW and b2 symbolize the same values but for the output layer ( Figure 5). These numbers are optimally tuned by the SCE for the MLP so that the lowest training error is achieved (Figure 7). Additionally, f (x) (i.e., the activation function) is presented in Equation (20). It is worth noting that due to the neural network mechanism, this formula must be fed with normalized data.

Conclusions
This research points out the suitability of metaheuristic strategies for analyzing the relationship between DO and three influential factors (WT, PH, and SC) through the principles of a multi-layer perceptron network. The algorithms used were multi-verse optimizer, black hole algorithm, and shuffled complex evolution, which showed high applicability for optimization objectives. A finding of this study was that while the MVO needs N Pop = 300 to give proper training to the MLP, two other algorithms can do this with smaller populations (N Pop s of 50 and 10). According to the findings of the training phase, the MVO can achieve a more profound understanding of the mentioned relationship. The RMSE of this model was 1.3148, which was found to be smaller than MLP-BHA (1.4426) and MLP-SCE (1.3304). However, different results were observed in the testing phase. The SCE-based model came up with the largest accuracy (the R P s were 0.8741, 0.8453, and 0.8775). All in all, the authors believe that the tested models can serve as promising ways for predicting DO. However, assessing other metaheuristic techniques and other hybridization strategies is recommended for future studies.