Water Transparency Prediction of Plain Urban River Network: A Case Study of Yangtze River Delta in China

: Water transparency is commonly used to indicate the combined effect of hydrodynamics and the aquatic environment on water quality throughout a river network. However, how water transparency responds to these indicators still needs to be explored, especially their complicated nonlinear relationship; thus, this study represents an analysis of the Suzhou civil river network. Using an artiﬁcial neural network (ANN) hydrological model and a multiple linear model (MLR) with in-situ data between 2013–2019, we investigated the Suzhou River’s sensitivity to the six factors and water transparency, which including ﬂow velocity and data from ﬁve categories of water-quality monitoring data: total suspended matter (TSS), water temperature (TE), dissolved oxygen (DO), chlorophyll (Chl) and chemical oxygen demand (COD). The results suggest that the ANN model can achieve better performance than the MLR model. Furthermore, results also show a well-established correlation between enhanced hydrodynamics and improved water transparency when the ﬂow velocity ranged from 0.22 to 0.45 m/s. Overall, COD is a vital factor for the SD prediction because including the COD can see a notable improvement in the ANN model (with a correlation coefﬁcient of 0.918). This study demonstrates that the ANN model with hydrodynamic and water quality parameters can achieve a better prediction of water transparency than other discussed models for a coastal plain urban river network.


Introduction
Urbanization has accelerated the degradation of urban aquatic ecosystems, and the associated ecological issues have been recognized in China [1][2][3]. To mitigate these effects and improve urban river health, large-scale water-clearing regulations have been instituted for plain urban river networks, such as the one that empties into the Yangtze River Delta [4,5]. Water transparency is a commonly used indicator of water quality [6] because it incorporates physical, chemical, and biological processes [7,8] (e.g., flow rate, nutrient cycling, and phytoplankton photosynthesis, respectively [9][10][11]. Moreover, it is a key indicator for measuring the effect of ecological restoration [12,13] and is widely valued in China when constructing urban water environments [14][15][16][17][18]. Additionally, studies have shown that the three abovementioned processes also relate to geographical location [19][20][21]. For instance, water transparency is dominated by dynamic conditions associated with wind, waves, and human activities in shallow areas such as the Yangtze Delta. Other environmental water quality indicators include chlorophylla (Chl-a) [22,23]; nutritional status [24]; total phosphorus [25]; sediment resuspension, Table 1. Selected studies for different Secchi depth (SD) prediction model.
The objectives of this study are: • Ÿ To evaluate the big data analysis and self-learning ability of the developed machine learning model in SD prediction for a plain urban river network with long-term field observations; • Ÿ To compare the SD prediction performance between a machine model and a regression model to provide a better prediction model and highlight suitable parameters.

Overview of the Study Area
The study area includes a plain urban river network throughout Suzhou city in the Yangtze River Delta located in the southeastern part of Jiangsu province (city centre: 31.19 • N, 120.37 • E, altitude from −998.09 m to 611.9 m) ( Figure 1). The river network across the city is 34.72 km long, with a river catchment area of 14.2 km 2 . Its water depth varies between 2.8 and 3.2 m all year round [41]. Complicated water systems and flat terrain result in poor hydrodynamics of the river network [42,43]. Furthermore, the river network is the primary water resource as water is conveyed from Lake Taihu through this river.
Sustainability 2021, 13, x FOR PEER REVIEW 3 of 15 environmental parameters from 2013 to 2019 to assess the response of SD to the largescale water clearing regulation in the Yangtze Delta. Selected input candidates for the machine learning model include hydrodynamic condition index and water environmental factors: surface velocity (V), total suspended solids (TSS) concentration, dissolved oxygen (DO) concentration, near-surface chlorophyll (Chl) concentration, chemical oxygen demand (COD) concentration, and water temperature (TE). The objectives of this study are: • To evaluate the big data analysis and self-learning ability of the developed machine learning model in SD prediction for a plain urban river network with long-term field observations; • To compare the SD prediction performance between a machine model and a regression model to provide a better prediction model and highlight suitable parameters.

Overview of the Study Area
The study area includes a plain urban river network throughout Suzhou city in the Yangtze River Delta located in the southeastern part of Jiangsu province (city centre: 31.19° N, 120.37° E, altitude from -998.09 m to 611.9 m) ( Figure 1). The river network across the city is 34.72 km long, with a river catchment area of 14.2 km 2 . Its water depth varies between 2.8 and 3.2 m all year round [41]. Complicated water systems and flat terrain result in poor hydrodynamics of the river network [42,43]. Furthermore, the river network is the primary water resource as water is conveyed from Lake Taihu through this river.

Field Observation
Fourteen monitoring sites were set up in the study area to monitor hydrodynamic and water quality conditions. Corresponding indicators through the river network were collected. They were recorded between March 2013 and December 2019 at 10 a.m. Hydrodynamic data include the flow rate (Q) and water depth (H) measured from installed instruments: a fixed acoustic Doppler flowmeter and electronic water gauge (MXT04, China). The on-site data were then standardized, and detailed explanations shown in the following subsections. Gridded Secchi-disk transparency (SD) and other water quality parameters were collected from the Suzhou Ecological Environment Bureau (http://sthjj.suzhou.gov.cn/, accessed on 31 December 2019) once a week.

Field Observation
Fourteen monitoring sites were set up in the study area to monitor hydrodynamic and water quality conditions. Corresponding indicators through the river network were collected. They were recorded between March 2013 and December 2019 at 10 a.m. Hydrodynamic data include the flow rate (Q) and water depth (H) measured from installed instruments: a fixed acoustic Doppler flowmeter and electronic water gauge (MXT04, China). The on-site data were then standardized, and detailed explanations shown in the following subsections. Gridded Secchi-disk transparency (SD) and other water quality parameters were collected from the Suzhou Ecological Environment Bureau (http://sthjj. suzhou.gov.cn/, accessed on 31 December 2019) once a week.

Flow Data Processing Method
According to previous studies, to predict the concentration of both chemical and biological water quality indicators in river networks. A method considering the cumulated effects of previous flow on the indicators is given as [44]: where d is the discount coefficient and can be 0~1, the coefficient at 1, indicating the current water quality parameters in the river network all contributed by previous water flow, and vice versa. Here, i corresponds to the specific time series, j is the total number of observations, Q i represents the flow under the i time series. The d value generally selected at 0.95, considering that the stepped pattern of flow rate during the water transportation process leads to the most weight to the cumulated changes in water quality in the river networks. The weekly mean, maximum, and minimum SD were calculated from the observations and referred to SD WA, SD MX, and SD MI, respectively. The mean, maximum, minimum of other parameters throughout the monitoring period refer to X mean , X max , and X min . Considering the practicability and accuracy of the model, appropriate parameters were chosen for the SD prediction based on the correlation coefficients.

Multiple Linear Regression Method
The multiple linear regression (MLR) model was used to predict or estimate the dependent variable through the optimal combination of multiple optional independent variables with a set of coefficients. In this study, there are several potential predictors for SD prediction in the study area, which can be described as: where y is the dependent variable (SD), x i denotes independent parameters (hydrodynamic parameters and water quality parameters), β i denotes coefficients from multiple linear regression. The equation was converted to a logarithmic scale for the MLR model.

Artificial Neural Networks
Artificial neural networks (ANN) are a machine learning technique that is generally employed in unknown relationships in considerable information and are used to handle complex nonlinear features in the big dataset and perform classification and regression [45]. Inspired by the principle of neurons in human brains, artificial neurons are arranged in different layers, and each layer contains numerous neurons. The layers are mainly grouped into three categories: the input layer, the hidden layer, and the output layer. The ANN models employed in this study have a hidden layer with a sigmoid activation function, which is often used in biology for the characteristics of smoothness and ease of derivation. The application of the ANN models in the environment research field generally includes: estimating the reference evapotranspiration (ET0) in a river ecosystem [46], predicting the total phosphorus (TP) concentration in the overlying water of the Huai River based on hydrological and hydrodynamic parameters [47], predicting the total dissolved gas (TDG) downstream of spillways of dams [48], and predicting the algae distribution in the large shallow lake based on wind speed index [49]. However, there are few studies on the application of ANN to SD prediction.
The neurons in the input layer are corresponding to the number of input parameters. The hidden layer is the most important part of the ANN model to predict SD, where the neurons calculate the sum of the weighted input and add a deviation value (threshold). The running process of the ANN (Figure 2) model can be presented as: The adopted activation function refers to the sigmoid function: The ANN output is given by where ih w represents the weight, characterizing the connection between the Mth hidden neuron to the Pth output neuron, with the total number of m hidden neurons, and 2 B is the deviation term.

Random Variables Regression Model
In this study, we investigate the linear relationship between flow velocity and SD depth, and it is given by: where a and b are fitting parameters and ε is an error. ε follows the standardized normal distribution as ε~N(0, σ 2 ) in most cases, and thereby y~N( + a bx , σ 2 ). Here, the A i is the weighted sum of the ith hidden neuron, j is the number of the corresponding hidden neuron, and h is the total number of inputs, w ij represents the weight characterized by the connection Mth input to the Nth hidden neuron, and B 1 is the deviation term of each neuron in the hidden layer. The function gives the output of the Mth hidden neuron: The adopted activation function refers to the sigmoid function: The ANN output is given by where w ih represents the weight, characterizing the connection between the Mth hidden neuron to the Pth output neuron, with the total number of m hidden neurons, and B 2 is the deviation term.

Random Variables Regression Model
In this study, we investigate the linear relationship between flow velocity and SD depth, and it is given by: where a and b are fitting parameters and ε is an error. ε follows the standardized normal distribution as ε~N(0, σ 2 ) in most cases, and thereby y~N(a + bx, σ 2 ). The maximum likelihood method (MLM) was used here to estimate the parameters a and b [50]. For a given sample set: (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n ), the joint density L was calculated by: The goal of the MLM is to obtain the maximum of L, i.e., the minimum of: To get the minimum of Q(a, b), the following equations should be satisfied: Solving this, the expression of the estimatorâ andb were obtained: where: The solution of Equation (6) equals to the solution of the least square method when y i follows the normal distribution. The unbiased estimatorσ 2 is calculated by the equation follows: where: Determining SD by using the newly developed function is not similar to the common regressions. For a given x 0 , an interval of SD y 0 with a confidence level of 1 − α, instead of a unique value, is predicted by: where t is the t-distribution. As seen from Equation (13), the length of the predicted interval was a function of a flow velocity as well. According to this idea of the random variable regression, the support vector machine (SVM) is introduced to explore the relationship between SD and flow velocity of urban river work, aiming to verify the rationality of using flow velocity as a predictive model parameter in this study.

Model Performance Assessment Methods
The performance of models used in this study was evaluated through the coefficient of determination (R 2 ), the root mean squared error (RMSE) and the mean absolute error (MAE) [51]. The R 2 coefficient is used to estimate the goodness of fit between predicted values and observed values. The mean square error (MAE) has been used as the average value of the absolute error between the predicted value and the observed value.

Support Vector Machine Methods
The support vector machine method is a class of generalized classifiers that binary classification of data in a supervised learning method, and its decision boundary is the maximum margin hyperplane that is solved for the learning sample. SVM is usually used to analyze the complex nonlinear relationship between A and B. If given X = {X 1 , . . . , X n }, Y = {y 1 , . . . , y n }, Each sample of the input data contains multiple features and thus constitutes a feature space, and the learning objective is a binary variable. If there is a hyperplane as the decision boundary in the feature space where the input data is located, the learning targets are divided into positive and negative classes, and the distance from the point to the plane of any sample is greater than or equal to 1. The decision boundary is given by: where the w and b are the normal vector and intercept of the hyperplane, respectively. Then, it is claimed that the classification problem has linear separability, and the parameters are the normal vector and intercept of the hyperplane.  Table 2). Statistical results showed that the observed SD exceeding 0.4m accounted for 75% of total datasets. This proportion was higher than most other plain river network city in the Yangtze River Delta and was believed to benefit from long-term water transfer projects. The result in Figure 3 shows a tendency for SD to gradually get better under long-term hydrodynamic control measures, which indicate that under several years of hydrodynamic regulation, the water environment of the Suzhou urban river network has been positively improved gradually.   Figure 3a also illustrates the SD seasonality. Generally, the mean value is higher high in the warm season, low in the cool season. This finding is in line with previous studies [52,53]. Conversely, the DO concentration in the study area is low in the warm season and high in the cool season, affected by temperature [54]. Interestingly, the two parameters (DO and TE) have always been the most important parameter used in the Environmental prediction model. It was believed to be beneficial to handle the complex relationships between various parameters and promote the effectiveness of predicting models. In contrast,  Figure 3a also illustrates the SD seasonality. Generally, the mean value is higher high in the warm season, low in the cool season. This finding is in line with previous studies [52,53]. Conversely, the DO concentration in the study area is low in the warm season and high in the cool season, affected by temperature [54]. Interestingly, the two parameters (DO and TE) have always been the most important parameter used in the Environmental prediction model. It was believed to be beneficial to handle the complex relationships between various parameters and promote the effectiveness of predicting models. In contrast, the other five parameters do not show noticeable seasonal changes.

Input and Output of SD Prediction Model
Prior to the model training, datasets were preprocessed to reduce the impact of outliers on the model performance [55]. Therefore, data below 1.5 times the 25th percentile value and higher than 1.5 times the 75th percentile values were not considered, taking 5% of the total data. Based on previous studies (Table 1), five parameters include flow velocity, DO, TSS, COD, Chl, and TE were chosen as potential input parameters for the model development and selected by the correlation coefficient with SD. Thereafter, four different models were developed and evaluated based on the selected inputs, and statistics of the weekly SD is summarized in Table 3.  Table 4 shows the descriptive statistics of the input dataset that related to the SD in previous research. Here the X mean , X max , X min , S x , CC and C v represent the mean, the maximum, the minimum, the standard deviation, the correlation coefficient and the coefficient of variation with the Secchi Depth, respectively. It is summarized that the top three related to transparency are TSS, Chl and COD, and the velocity also has a significant correlation. The normalization is proved to be an important process that can increases the performance of the models significantly [56]. Therefore, this means all the input data obtained were normalized to possess zero mean and unit variance. added to the M2 model on the basis of M1, and the M3 model was developed using all input parameters(V, TSS, DO, COD, Chl and TE). Finally, the M4 model was developed with V, TSS, Chl and COD. For the four models based on the ANN technique, the number of input and output neurons is closely related to the structure of the model. In this study, trial and error are needed to find the optimal hidden layer, and the hidden layer with 15 neurons was tested to give the best result in this study. For the four models based on the MLR technique, the discounted flow rate was obtained with a discount coefficient of 0.95 based on the original data of the flowmeter, then the corresponding flow velocity according to the river topography at each observing point was, respectively obtained by discount calculation. Note that 60% of the data points in each model were randomly selected as training dataset, and the rest of 20% were regarded as verification and 20% for the testing dataset. Each model was tested three times.
The MLR model and ANN model were fitted on the training dataset, verification dataset, and testing dataset. Moreover, the testing dataset was used for unseen data to evaluate the performance of this fitted model through RMSE, each model was run three times, and the average of the results was given in Table 5. The ANN-based models performed remarkably better than the MLR model in all phases (Table 5). Among other ANN models, the M4 model has the best performance in all the phases. The ANN results clearly show significant improvement in the performances from M1 (CC = 0.859) to M3 (CC = 0.882). The CC increases from 0.875 to 0.897, with a 2.5% rate of improvement, the MAE decrease from 0.859 to 0.834 with a 3.0% rate of improvement. For the MLR models, it was well fitted on the training dataset while performed poorly in other phases. MLR-based M3 model has the best performance. The CC declined slowly from model M1 (CC = 0.594) to model M2 (CC = 0.573), and increased slightly from model M2 to model M3 (CC = 0.607) for other MLR-based models. Regarding the RMSE and MAE, the improvements are less than 10.5% and 6.2%, respectively, which is negligible. While this is not reflected in the case of ANN models, the performances of the ANN models with the training data was notably better than those obtained with the MLR models. Additionally, varying the number of input parameters in the models from three (M1) to five (M2) does not see a remarkable improvement in model performance, and slight improvement is obtained using the ANN model: 2.5% and 6.7% of improvement in favour of the M1 model based on the CC and RMSE, respectively. However, 3.1% of improvement was obtained regarding the MAE in favour of the M3 model: the MAE drops from 0.902 to 0.643 (decreased by 40.1%). As we can see from Table 5, the results are intriguing and encouraging with COD adding to the SD prediction model, and the scatter plots of the predictions against observations of the SD ANN and MLR models are shown in Figure 4. improvement is obtained using the ANN model: 2.5% and 6.7% of improvement in favour of the M1 model based on the CC and RMSE, respectively. However, 3.1% of improvement was obtained regarding the MAE in favour of the M3 model: the MAE drops from 0.902 to 0.643 (decreased by 40.1%). As we can see from Table 5, the results are intriguing and encouraging with COD adding to the SD prediction model, and the scatter plots of the predictions against observations of the SD ANN and MLR models are shown in Figure 4.

Analysis of the Correlation between SD and Velocity by Using Machine Learning Methods
The plain urban river network has a slow flow rate, the hydrodynamic of which is completely manually controlled. Therefore, velocity is considered as an important input parameter of the SD model in this study, and exploring the correlation between flow velocity and SD is needed after the model is established.
In view of the complex relationship between flow velocity and SD, two models in the scikit-learn library, the Linear Regression model (LR model) and the Support Vector Machine Regression model (SVR model), were selected to fit the two preprocessed datasets. Regression results were plotted in Figure 5 and the RMSE of training and test datasets were summarized in Table 6.

Analysis of the Correlation between SD and Velocity by Using Machine Learning Methods
The plain urban river network has a slow flow rate, the hydrodynamic of which is completely manually controlled. Therefore, velocity is considered as an important input parameter of the SD model in this study, and exploring the correlation between flow velocity and SD is needed after the model is established.
In view of the complex relationship between flow velocity and SD, two models in the scikit-learn library, the Linear Regression model (LR model) and the Support Vector Machine Regression model (SVR model), were selected to fit the two preprocessed datasets. Regression results were plotted in Figure 5 and the RMSE of training and test datasets were summarized in Table 6.  Table 6. Description of regression equations by the MLR models. A comparison indicated that the SD predicted using the LR and SVR models exhibited patterns of a straight line, curve and scatted points, respectively. LR model showed that the RMSE is below 25.00 and the R 2 over 0.41, which agrees with the results in Figure  5 that the majority of predicted SD based on the training dataset overlapped with the observed values. However, the RMSE increased significantly to be higher than 27.36, and R 2 decreased less than 0.33 when predicting the SD in the test dataset using the same model. This is a typical result of overfitting. LR and SVR models, in contrast, exhibited a result

Model
Regression Equations A comparison indicated that the SD predicted using the LR and SVR models exhibited patterns of a straight line, curve and scatted points, respectively. LR model showed that the RMSE is below 25.00 and the R 2 over 0.41, which agrees with the results in Figure 5 that the majority of predicted SD based on the training dataset overlapped with the observed values. However, the RMSE increased significantly to be higher than 27.36, and R 2 decreased less than 0.33 when predicting the SD in the test dataset using the same model. This is a typical result of overfitting. LR and SVR models, in contrast, exhibited a result that the RMSE on training and test dataset were close to each other. In order to avoid overfitting, the SVR model was optimized by adjusting the parameters. The RMSE and R 2 of the optimized SVR model were added to Table 6. Since the effect of the relationship was estimated by RMSE, the R 2 of the SVR model on testing and training datasets were still significantly different.
Results in Table 7 show that the SVR model has the higher RMSE on predicting the SD in the test dataset, while SVR has differences of RMSE within 2.00. Comparing to the LR model, the SVR model showed a higher RMSE at 26.82, but the R 2 was better than the LR model in the two datasets, indicating better performances on clarifying the relationship between SD and velocity.

Discussion
The ANN model and MLR model are compared based on their performances in (i) training, (ii) verification, and (iii) testing phases, with results summarized in Table 4. It appears that the ANN model is more accurate and consistent in different subsets since all the values of RMSE and MAE are similar, and all the correlation coefficients are also close to unity, and the performance of this model can be well demonstrated based on RMSE. It also shows that the ANN model results in a much higher value of the CC than the MLR model. The prediction results regarding the CC value during the verification phase showed an approximately 38.1% of improvement. In addition, the forecast results regarding the CC value during the test phase improved by approximately 36.9%. In some previous studies, the reported prediction of SD was not tested on the training dataset, which was due to the insufficient data size [57]. In this study, as we can see from Table 4, these are very encouraging results regarding the modelling of SD, and the results were fitted in all phases.
According to Table 4, the results show that during the verification phase, the ANN model shows a reasonable estimation of SD. Furthermore, an acceptable level can be observed using the model M1 and M3, and through the comparison of various statistical indices (CC, RMSE and MAE) expounds the performance of ANN models better than the MLR models, which demonstrates that the ANN method has the good advantage on predictive ability to acquire the SD of the plain urban river network. In the verification phase, using the ANN model, the best results are achieved using the M4 model. Therefore, in this comparison, the prediction performance of M4 is slightly better than that of M1 and M3. In the testing phase, as shown in Table 3, model M4 is always the best model, while for the MLR model, the M1 is the best model. In order to possess a good predictive ability, RMSE and MAE should be as low as possible, but for CC, the value of this parameter should be as high as possible.
Consequently, we can see that the inclusion of the two parameters (DO and TE) may not improve the performance of the model. Interestingly, besides TSS and Chl, the COD assumed major importance when included simultaneously as input to the model. As water quality parameters that affect the SD of water body, when included with COD parameters, DO and TE did not contribute significantly to model performance to predict the SD of the urban river network. As the most important environmental factors in water bodies, DO and TE mainly affect the degradation rate of pollutants in urban river networks. As the river network of plain cities in the Yangtze River Delta has undergone years of diversion and flow control, the water quality of the river network gradually improved and entered a steady state. Therefore, DO and TE, which affects the chemical process, are less sensitive to transparency. In contrast, COD is extremely difficult to degrade in urban river network water bodies and is closely related to TSS, which cause the SD to be more sensitive to COD in plain urban river network. The final model selected to predict the SD of the urban river network in this study contained velocity, TSS, Chl and COD (M4). The inclusion of the DO and TE may not improve the model performance and even sometimes contribute to increasing the values of the error indices. Additionally, more suitable and fewer inputs will help to simplify the implementation and the calculations procedure, which improve the practicality of the model.
Finally, for a given regression model, the LR model exhibited a slightly lower RMSE for exponential correlation than power correlation, whereas the SVR model resulted in an opposite result. The result indicates that within a certain flow rate threshold, there is a positive correlation between transparency and flow rate, which reveals that the hydrodynamic factors of the plain river network have a significant impact on the water transparency and can be used as an effective parameter of the prediction model. Within the range of flow velocity 0.22-0.45 m/s, increased flow rate has a positive effect on SD. On the one hand, the improvement in hydrodynamics brought by water resources regulation does have a positive impact on the water environment of urban river networks. On the other hand, the method of improving the water environment through hydrodynamic regulation has an improved flow rate threshold, which means hydrodynamic control is not a once-and-for-all method.
Additionally, in the analysis of long-term SD changes and ANN model results, it is essential to consider the influence of flow velocity changes caused by water regulation. The comparison of the correlation coefficient shown in Table 3 reveals that flow velocity has a larger impact weight on the water transparency of urban river network, and the absolute value of its correlation coefficient is ranked before dissolved oxygen and temperature.

Conclusions
In this study, an artificial neural networks model is proposed for estimating Secchi depth in a plain urban river network using long-term observed data. Through the comparison of results between the ANN model and MLR model, it reveals that the hydrodynamic parameters can be used as effective parameters for SD prediction models of the urban river network. Additionally, the impact of COD concentration on transparency is crucial in the river network due to the notable improvement with the inclusion of COD parameter as input in the model. The more accurate and more practical model of SD is the one with input parameters including flow velocity, TSS, COD, and Chl, and sensitivity ranks from high to low as TSS, Chl, COD and flow velocity.
In addition, ANN models perform better than the MLR models, which demonstrates the existence of a complex nonlinear relationship between SD and various parameters. The support vector machine was used to deduce the relationship between SD and hydrodynamic parameter, and a strong positive correlation was explored in this study when velocity range from 0.22 m/s to 0.45 m/s. Over 90% of data fall in the predicted intervals of SD for the method, reflecting the flow rate threshold of hydrodynamic regulation to improve water transparency in the urban river network.