1. Introduction
Urbanization has accelerated the degradation of urban aquatic ecosystems, and the associated ecological issues have been recognized in China [
1,
2,
3]. To mitigate these effects and improve urban river health, large-scale water-clearing regulations have been instituted for plain urban river networks, such as the one that empties into the Yangtze River Delta [
4,
5]. Water transparency is a commonly used indicator of water quality [
6] because it incorporates physical, chemical, and biological processes [
7,
8] (e.g., flow rate, nutrient cycling, and phytoplankton photosynthesis, respectively [
9,
10,
11]. Moreover, it is a key indicator for measuring the effect of ecological restoration [
12,
13] and is widely valued in China when constructing urban water environments [
14,
15,
16,
17,
18].
Additionally, studies have shown that the three abovementioned processes also relate to geographical location [
19,
20,
21]. For instance, water transparency is dominated by dynamic conditions associated with wind, waves, and human activities in shallow areas such as the Yangtze Delta. Other environmental water quality indicators include chlorophyll-a (Chl-a) [
22,
23]; nutritional status [
24]; total phosphorus [
25]; sediment resuspension, transportation and settlement; and total suspended substance [
26]. For a long time, the Yangtze River was diverted to bring water to the cities on the coastal plain, but this method was once considered to be a waste of money and labour. Therefore, an investigation into how hydrodynamic and hydro-environmental factors affected water transparency was begun. This improved understanding of the large-scale water clearing regulation can provide a theoretical basis for a better evaluation of water diversion projects.
The Secchi disk depth, or Secchi depth (SD), is a simple, traditional measure of water transparency [
27,
28]. A black-and-white disk is immersed vertically into water, and the visual depth is called the Secchi depth [
29], indicating the extent of the water transparency. Although the Secchi disk is a powerful tool, its main disadvantage is discrete spatial-temporal and asynchronous observations [
30,
31] for large areas. Additionally, errors may occur due to a lack of visual acuity. Therefore, the traditional method might not be adequate to evaluate water quality for large river networks. The Secchi method also consumes an abundance of labour and resources because of the complex branch system of a plain urban river network. As such, satellite sensors like China’s CBERS-1 can provide high-quality water transparency measurements with high spatial-temporal resolution [
32]. Colour sensors combined with remote sensors have been successfully applied to estuarine and thalassic water [
21], but there are limited studies on its application to China’s inland shallow areas.
Previous studies have shown that SD is a function of hydrodynamic factors like velocity and water level [
33,
34]. Further, hydrodynamic variations may lead to variations in hydro-environmental indicators [
35], and variations in hydro-environmental indicators might further cause variations in SD. As such, only employing hydrodynamic indicators to predict SD might be inadequate by themselves. For this reason, previous studies have explored the function of SD with various environmental parameters [
32,
36,
37,
38,
39]. A partial list is given in
Table 1). Over the past 50 years, several regression models have been developed, and in those models, the SD (its natural or decimal logarithm) was shown as a function of one or two parameters. Compared to the regression model, studies that employed and ANN on SD predictions is limited to date [
40].
Nevertheless, limited studies were using the ANN for SD prediction considering the three processes (physical, chemical, biological) impact on the plain urban river network. Inspired by big-data analysis and machine learning techniques, we attempted to develop a machine learning model based on remotely sensed SD and other hydrodynamic and environmental parameters from 2013 to 2019 to assess the response of SD to the large-scale water clearing regulation in the Yangtze Delta. Selected input candidates for the machine learning model include hydrodynamic condition index and water environmental factors: surface velocity (V), total suspended solids (TSS) concentration, dissolved oxygen (DO) concentration, near-surface chlorophyll (Chl) concentration, chemical oxygen demand (COD) concentration, and water temperature (TE).
The objectives of this study are:
To evaluate the big data analysis and self-learning ability of the developed machine learning model in SD prediction for a plain urban river network with long-term field observations;
To compare the SD prediction performance between a machine model and a regression model to provide a better prediction model and highlight suitable parameters.
3. Results
3.1. Characteristics of the SD Measured
SD WA, SD MX, and SD MI during the observation period from 5 January 2013 to 12 December 2019 in the study area are shown in
Figure 3, and their descriptive statistics are summarized in
Table 2.
Figure 3a shows that the weekly mean SD during the long observation ranges from 0.26 m to 0.89 m. The weekly maximum SD varied from 0.544 m to 1.103 m with an average value of 0.803m, while for weekly minimum SD ranges from 0.179 m to 0.566 m with a mean of 0.321 m (
Figure 3b and
Table 2). Statistical results showed that the observed SD exceeding 0.4m accounted for 75% of total datasets. This proportion was higher than most other plain river network city in the Yangtze River Delta and was believed to benefit from long-term water transfer projects. The result in
Figure 3 shows a tendency for SD to gradually get better under long-term hydrodynamic control measures, which indicate that under several years of hydrodynamic regulation, the water environment of the Suzhou urban river network has been positively improved gradually.
Figure 3a also illustrates the SD seasonality. Generally, the mean value is higher high in the warm season, low in the cool season. This finding is in line with previous studies [
52,
53]. Conversely, the DO concentration in the study area is low in the warm season and high in the cool season, affected by temperature [
54]. Interestingly, the two parameters (DO and TE) have always been the most important parameter used in the Environmental prediction model. It was believed to be beneficial to handle the complex relationships between various parameters and promote the effectiveness of predicting models. In contrast, the other five parameters do not show noticeable seasonal changes.
3.2. Input and Output of SD Prediction Model
Prior to the model training, datasets were preprocessed to reduce the impact of outliers on the model performance [
55]. Therefore, data below 1.5 times the 25th percentile value and higher than 1.5 times the 75th percentile values were not considered, taking 5% of the total data. Based on previous studies (
Table 1), five parameters include flow velocity, DO, TSS, COD, Chl, and TE were chosen as potential input parameters for the model development and selected by the correlation coefficient with SD. Thereafter, four different models were developed and evaluated based on the selected inputs, and statistics of the weekly SD is summarized in
Table 3.
Table 4 shows the descriptive statistics of the input dataset that related to the SD in previous research. Here the X
mean, X
max, X
min, S
x, CC and C
v represent the mean, the maximum, the minimum, the standard deviation, the correlation coefficient and the coefficient of variation with the Secchi Depth, respectively. It is summarized that the top three related to transparency are TSS, Chl and COD, and the velocity also has a significant correlation. The normalization is proved to be an important process that can increases the performance of the models significantly [
56]. Therefore, this means all the input data obtained were normalized to possess zero mean and unit variance.
Based on the selected parameters, four kinds of the model were developed(M1, M2 and M3), the M1 model was developed with only velocity, TSS and Chl, DO, and TE was added to the M2 model on the basis of M1, and the M3 model was developed using all input parameters(V, TSS, DO, COD, Chl and TE). Finally, the M4 model was developed with V, TSS, Chl and COD. For the four models based on the ANN technique, the number of input and output neurons is closely related to the structure of the model. In this study, trial and error are needed to find the optimal hidden layer, and the hidden layer with 15 neurons was tested to give the best result in this study. For the four models based on the MLR technique, the discounted flow rate was obtained with a discount coefficient of 0.95 based on the original data of the flowmeter, then the corresponding flow velocity according to the river topography at each observing point was, respectively obtained by discount calculation. Note that 60% of the data points in each model were randomly selected as training dataset, and the rest of 20% were regarded as verification and 20% for the testing dataset. Each model was tested three times.
The MLR model and ANN model were fitted on the training dataset, verification dataset, and testing dataset. Moreover, the testing dataset was used for unseen data to evaluate the performance of this fitted model through RMSE, each model was run three times, and the average of the results was given in
Table 5. The ANN-based models performed remarkably better than the MLR model in all phases (
Table 5). Among other ANN models, the M4 model has the best performance in all the phases. The ANN results clearly show significant improvement in the performances from M1 (CC = 0.859) to M3 (CC = 0.882). The CC increases from 0.875 to 0.897, with a 2.5% rate of improvement, the MAE decrease from 0.859 to 0.834 with a 3.0% rate of improvement. For the MLR models, it was well fitted on the training dataset while performed poorly in other phases. MLR-based M3 model has the best performance. The CC declined slowly from model M1 (CC = 0.594) to model M2 (CC = 0.573), and increased slightly from model M2 to model M3 (CC = 0.607) for other MLR-based models. Regarding the RMSE and MAE, the improvements are less than 10.5% and 6.2%, respectively, which is negligible. While this is not reflected in the case of ANN models, the performances of the ANN models with the training data was notably better than those obtained with the MLR models.
Additionally, varying the number of input parameters in the models from three (M1) to five (M2) does not see a remarkable improvement in model performance, and slight improvement is obtained using the ANN model: 2.5% and 6.7% of improvement in favour of the M1 model based on the CC and RMSE, respectively. However, 3.1% of improvement was obtained regarding the MAE in favour of the M3 model: the MAE drops from 0.902 to 0.643 (decreased by 40.1%). As we can see from
Table 5, the results are intriguing and encouraging with COD adding to the SD prediction model, and the scatter plots of the predictions against observations of the SD ANN and MLR models are shown in
Figure 4.
3.3. Analysis of the Correlation between SD and Velocity by Using Machine Learning Methods
The plain urban river network has a slow flow rate, the hydrodynamic of which is completely manually controlled. Therefore, velocity is considered as an important input parameter of the SD model in this study, and exploring the correlation between flow velocity and SD is needed after the model is established.
In view of the complex relationship between flow velocity and SD, two models in the scikit-learn library, the Linear Regression model (LR model) and the Support Vector Machine Regression model (SVR model), were selected to fit the two preprocessed datasets. Regression results were plotted in
Figure 5 and the RMSE of training and test datasets were summarized in
Table 6.
A comparison indicated that the SD predicted using the LR and SVR models exhibited patterns of a straight line, curve and scatted points, respectively. LR model showed that the RMSE is below 25.00 and the R
2 over 0.41, which agrees with the results in
Figure 5 that the majority of predicted SD based on the training dataset overlapped with the observed values. However, the RMSE increased significantly to be higher than 27.36, and R
2 decreased less than 0.33 when predicting the SD in the test dataset using the same model. This is a typical result of overfitting. LR and SVR models, in contrast, exhibited a result that the RMSE on training and test dataset were close to each other. In order to avoid overfitting, the SVR model was optimized by adjusting the parameters. The RMSE and R
2 of the optimized SVR model were added to
Table 6. Since the effect of the relationship was estimated by RMSE, the R
2 of the SVR model on testing and training datasets were still significantly different.
Results in
Table 7 show that the SVR model has the higher RMSE on predicting the SD in the test dataset, while SVR has differences of RMSE within 2.00. Comparing to the LR model, the SVR model showed a higher RMSE at 26.82, but the R
2 was better than the LR model in the two datasets, indicating better performances on clarifying the relationship between SD and velocity.
4. Discussion
The ANN model and MLR model are compared based on their performances in (i) training, (ii) verification, and (iii) testing phases, with results summarized in
Table 4. It appears that the ANN model is more accurate and consistent in different subsets since all the values of RMSE and MAE are similar, and all the correlation coefficients are also close to unity, and the performance of this model can be well demonstrated based on RMSE. It also shows that the ANN model results in a much higher value of the CC than the MLR model. The prediction results regarding the CC value during the verification phase showed an approximately 38.1% of improvement. In addition, the forecast results regarding the CC value during the test phase improved by approximately 36.9%. In some previous studies, the reported prediction of SD was not tested on the training dataset, which was due to the insufficient data size [
57]. In this study, as we can see from
Table 4, these are very encouraging results regarding the modelling of SD, and the results were fitted in all phases.
According to
Table 4, the results show that during the verification phase, the ANN model shows a reasonable estimation of SD. Furthermore, an acceptable level can be observed using the model M1 and M3, and through the comparison of various statistical indices (CC, RMSE and MAE) expounds the performance of ANN models better than the MLR models, which demonstrates that the ANN method has the good advantage on predictive ability to acquire the SD of the plain urban river network. In the verification phase, using the ANN model, the best results are achieved using the M4 model. Therefore, in this comparison, the prediction performance of M4 is slightly better than that of M1 and M3. In the testing phase, as shown in
Table 3, model M4 is always the best model, while for the MLR model, the M1 is the best model. In order to possess a good predictive ability, RMSE and MAE should be as low as possible, but for CC, the value of this parameter should be as high as possible.
Consequently, we can see that the inclusion of the two parameters (DO and TE) may not improve the performance of the model. Interestingly, besides TSS and Chl, the COD assumed major importance when included simultaneously as input to the model. As water quality parameters that affect the SD of water body, when included with COD parameters, DO and TE did not contribute significantly to model performance to predict the SD of the urban river network. As the most important environmental factors in water bodies, DO and TE mainly affect the degradation rate of pollutants in urban river networks. As the river network of plain cities in the Yangtze River Delta has undergone years of diversion and flow control, the water quality of the river network gradually improved and entered a steady state. Therefore, DO and TE, which affects the chemical process, are less sensitive to transparency. In contrast, COD is extremely difficult to degrade in urban river network water bodies and is closely related to TSS, which cause the SD to be more sensitive to COD in plain urban river network. The final model selected to predict the SD of the urban river network in this study contained velocity, TSS, Chl and COD (M4). The inclusion of the DO and TE may not improve the model performance and even sometimes contribute to increasing the values of the error indices. Additionally, more suitable and fewer inputs will help to simplify the implementation and the calculations procedure, which improve the practicality of the model.
Finally, for a given regression model, the LR model exhibited a slightly lower RMSE for exponential correlation than power correlation, whereas the SVR model resulted in an opposite result. The result indicates that within a certain flow rate threshold, there is a positive correlation between transparency and flow rate, which reveals that the hydrodynamic factors of the plain river network have a significant impact on the water transparency and can be used as an effective parameter of the prediction model. Within the range of flow velocity 0.22–0.45 m/s, increased flow rate has a positive effect on SD. On the one hand, the improvement in hydrodynamics brought by water resources regulation does have a positive impact on the water environment of urban river networks. On the other hand, the method of improving the water environment through hydrodynamic regulation has an improved flow rate threshold, which means hydrodynamic control is not a once-and-for-all method.
Additionally, in the analysis of long-term SD changes and ANN model results, it is essential to consider the influence of flow velocity changes caused by water regulation. The comparison of the correlation coefficient shown in
Table 3 reveals that flow velocity has a larger impact weight on the water transparency of urban river network, and the absolute value of its correlation coefficient is ranked before dissolved oxygen and temperature.
5. Conclusions
In this study, an artificial neural networks model is proposed for estimating Secchi depth in a plain urban river network using long-term observed data. Through the comparison of results between the ANN model and MLR model, it reveals that the hydrodynamic parameters can be used as effective parameters for SD prediction models of the urban river network. Additionally, the impact of COD concentration on transparency is crucial in the river network due to the notable improvement with the inclusion of COD parameter as input in the model. The more accurate and more practical model of SD is the one with input parameters including flow velocity, TSS, COD, and Chl, and sensitivity ranks from high to low as TSS, Chl, COD and flow velocity.
In addition, ANN models perform better than the MLR models, which demonstrates the existence of a complex nonlinear relationship between SD and various parameters. The support vector machine was used to deduce the relationship between SD and hydrodynamic parameter, and a strong positive correlation was explored in this study when velocity range from 0.22 m/s to 0.45 m/s. Over 90% of data fall in the predicted intervals of SD for the method, reflecting the flow rate threshold of hydrodynamic regulation to improve water transparency in the urban river network.