Water Quality Prediction Based on Multi-Task Learning

Water pollution seriously endangers people’s lives and restricts the sustainable development of the economy. Water quality prediction is essential for early warning and prevention of water pollution. However, the nonlinear characteristics of water quality data make it challenging to accurately predicted by traditional methods. Recently, the methods based on deep learning can better deal with nonlinear characteristics, which improves the prediction performance. Still, they rarely consider the relationship between multiple prediction indicators of water quality. The relationship between multiple indicators is crucial for the prediction because they can provide more associated auxiliary information. To this end, we propose a prediction method based on exploring the correlation of water quality multi-indicator prediction tasks in this paper. We explore four sharing structures for the multi-indicator prediction to train the deep neural network models for constructing the highly complex nonlinear characteristics of water quality data. Experiments on the datasets of more than 120 water quality monitoring sites in China show that the proposed models outperform the state-of-the-art baselines.


Introduction
The excessive exploitation and utilization of water resources have caused a series of problems, such as deterioration of water quality, damage to water functional areas, and degradation of river ecosystem structures, which seriously endanger the social and economic development and the safety of people. Water quality prediction is essential for water pollution prevention and treatment, which can help fully understand the dynamic trend of the surface water ecological environment and warn of possible pollution incidents.
However, it is difficult to predict water quality because of the nonlinear characteristics of water-related data [1]. Traditional statistical analysis methods lack nonlinear approximation and self-learning abilities and cannot fully consider the complex impact of various environmental factors. With the rapid development of machine learning technology, scholars have begun to explore water quality prediction based on machine learning. They have achieved better water quality prediction performance by establishing nonlinear learning cognitive models from historical data, summarizing and discovering knowledge, and predicting system behavior. Olyaie et al. (2017) applied linear genetic programming and a support vector machine (SVM) to predict dissolved oxygen (DO) in the Delaware River in Trenton, USA [2]. Li et al. (2017) proposed a method that combines ensemble empirical mode decomposition (EEMD) [3][4][5] and least-squares SVR (support vector regression) to predict DO concentration [6]. Leong et al. (2021) applied SVM [7][8][9] and least squares support vector models to the Perak River in Malaysia [10]. The performance of these methods depends not only on the models but also on the features selected for training. 2 of 19 More and more researchers have recently applied deep learning methods to water quality prediction because deep learning (DL) [11,12] can efficiently train and abstract multi-level features of multi-dimensional training data. Banejad et al. (2011) applied a basic neural network to predict biochemical oxygen demand (BOD) and DO of the Morad River in Iran [13]. They verified that the deep learning technology can reliably, efficiently, and accurately extract the nonlinear characteristics of water quality data. Subsequently, Heddam et al. (2014Heddam et al. ( , 2016 and  successively proposed models based on GRNN (generalized regression neural network) and MLP (multilayer perceptron), which were applied to different rivers in the United States and lakes in China [14][15][16]. Zhou et al. (2019) proposed a deep cascade forest (DCF) that uses several random forests based on ensemble learning, performing well on large and even small-scale data [17]. Wang et al. (2019) proposed a hybrid CNN-LSTM (convolutional neural network-long short-term memory) deep learning algorithm for a dynamic chemical oxygen demand (COD) prediction model of urban sewage [18]. Zou et al. (2020) proposed a water quality prediction method based on the Bi-LSTM (Bidirectional LSTM) model with multiple time scales [19]. Niu et al. (2021) also developed a pixel-based deep neural network regression model and a patch-based deep neural network regression model, to estimate seven optically inactive water quality parameters [20]. Yang et al. (2021) proposed a mixed model named CNN-LSTM with Attention (CLA), combining CNN, LSTM, and Attention mechanisms to predict water quality [21]. Guo et al. (2022) use progressively decreasing deep neural network and multimodal deep learning (MDL) models without well-handled input features, to estimate long-term water indicators and explore the contribution of each feature by quantifying [22].
However, these models are constructed to optimize a single prediction indicator such as the potential of hydrogen (pH), dissolved oxygen (DO), chemical oxygen demand-Mn (COD Mn ), and Ammonia Nitrogen (NH 3 -N, NHN for short), etc., which cannot guarantee the high efficiency and accuracy of the models in predicting other water quality indicators. The correlation between multiple prediction indicators can provide more correlation auxiliary information, which helps improve the prediction performance. To this end, we propose a water quality prediction model based on multi-task learning by learning the highly complex nonlinear characteristics of time series data and exploring the correlation of multi-indicator prediction.
The main contributions of this paper are as follows: (1) We propose a multi-indicator prediction model of surface water quality based on deep learning, which excavates the highly complex nonlinear characteristics of surface water ecological environment water quality data and explores the correlation of multiple water quality prediction indicators.
(2) We propose four water quality prediction frameworks, named hard parameter sharing structure (Multi-Task-Hard), soft parameter sharing structure (Multi-Task-Soft), gated parameter sharing structure (Multi-Task-Gate), and gated hidden parameters sharing structure (Multi-Task-GH), based on different multi-task learning structures and combine the frameworks with various mainstream deep learning models to form different water quality prediction models.
(3) We conducted experiments to predict four water quality indicators, including pH, DO, COD Mn , and NH 3 -N, on real data from more than 120 water quality monitoring sites in seven river systems and lakes in China. The experimental results demonstrate that the proposed water quality multi-task learning prediction framework outperforms the state-of-the-art single-indicator prediction models.

Methodology
The existing deep learning-based water quality prediction models rarely consider the relationship between multiple indicators of water quality. The relationship between multiple indicators is crucial for the prediction because they can provide more associated auxiliary information. To this end, we propose a prediction method based on exploring the correlation of water quality multi-indicator prediction tasks in this section. We first define the water quality prediction and explore four sharing structures for the multi-indictor prediction to train the deep neural network models for constructing the highly complex nonlinear characteristics of water quality data.

Definition of Water Quality Prediction
Following previous work [19,23,24], we choose four water quality indicators, including pH, DO, COD Mn , and NH 3 -N, as our prediction targets. Compared with other indicators, these indicators can predict that six water quality levels perform significantly better, reflecting the water quality better [23].
The water quality prediction is a time series prediction. We give the mathematical definitions for single-task prediction and multi-task prediction.
Single-task water quality prediction: X → Y . Given water quality prediction indicators at known past times (x 1 , . . . , x i ) ∈ X, analyze the change patterns and predict the water quality indicator at the future time interval [11], denoted as y i+1 ∈ Y.
For the four common water quality prediction indicators, pH, CODMn, DO, and NH3-N, the multi-task water quality prediction task can be defined as, given the numerical changes of pH, DO, COD Mn , NH 3 -N at the past i time intervals (x pH1 , . . . , , analyze the change patterns and predict the corresponding water quality indicators at the future time interval, denoted as y pH , y DO , y COD , y NHN ∈ Y pH , Y DO , Y COD , Y NHN .

Architecture of Water Quality Prediction Model Based on Multi-Task Learning
Frameworks for multi-task learning are often based on sharing the same bottom structure [25][26][27]. The model of multiple tasks can be transformed into a basic bottom model and multiple separate models. For single-task learning, the input and output of each task correspond to a separate model, and new models need to be built for new tasks, although the structure of the models is sometimes the same. For multi-task learning, the common structure of the model is unified into a basic model. Then, several separate models are introduced to realize the learning of multiple different tasks. Figure 1 is a basic framework for multi-task learning, in which the blue part represents the shared parameter layer, and the orange and yellow parts represent models for different tasks forming the tower layer. This framework structure saves the parameter space of multiple water quality prediction models and reduces the risk of over-fitting. We propose a multitask learning framework for water quality prediction based on different structures [28,29]. The framework can be developed into four forms: hard parameter sharing structure (Multi-Task-Hard), soft parameter sharing structure (Multi-Task-Soft), gated parameter sharing structure (Multi-Task-Gate), and gated hidden parameters sharing structure (Multi-Task-GH). The differences between the four structures are described in detail in Section 2.3. Figure 1. The basic framework of multi-task learning. The blue part represents the shared parameter layer, and the orange and yellow parts represent the models for different tasks forming the tower layer. The hard parameter sharing structure is the basic structure of the shared bottom  The hard parameter sharing structure is the basic structure of the shared bottom structure in multi-task learning. As shown in Figure 2, it is mainly divided into four parts. Figure 1. The basic framework of multi-task learning. The blue part represents the shared parameter layer, and the orange and yellow parts represent the models for different tasks forming the tower layer.

Hard Parameter Sharing Structure of Multi-Indicator Water Quality Prediction (Multi-Task-Hard)
The hard parameter sharing structure is the basic structure of the shared bottom structure in multi-task learning. As shown in Figure 2, it is mainly divided into four parts. Hard parameter sharing structure of multi-indicator water quality prediction. The dark blue part represents the shared parameter layer, and the orange and yellow parts represent the models for different tasks forming the tower layer.
The first part is the input layer ( , … , ), which contains the time sequence information of each water quality indicator at the past time intervals.
The second part is the shared parameter layer, which is designed as a fully connected layer. This part takes the information transmitted by the input layer and extracts a shared implicit vector . The third part is the tower layer, which is carefully designed for a task and will output the prediction results required by the corresponding task, which reflects the flexibility of the multi-task learning framework. The output of the second layer will be transmitted to the tower layer for different tasks simultaneously. Because of the differences between the indicators, it is necessary to design specific models for different water quality indicators in this layer. To put it simply, one task corresponds to one tower.
The fourth part is the output layer, which contains the outputs: , , , as the prediction. The algorithm is shown in Algorithm 1, and the MLP is selected for processing in the shared layer. The first part is the input layer (X 1 , . . . , X N ), which contains the time sequence information of each water quality indicator at the past time intervals.
The second part is the shared parameter layer, which is designed as a fully connected layer. This part takes the information transmitted by the input layer and extracts a shared implicit vector out shared .
The third part is the tower layer, which is carefully designed for a task and will output the prediction results required by the corresponding task, which reflects the flexibility of the multi-task learning framework. The output of the second layer will be transmitted to the tower layer for different tasks simultaneously. Because of the differences between the indicators, it is necessary to design specific models for different water quality indicators in this layer. To put it simply, one task corresponds to one tower.
The fourth part is the output layer, which contains the outputs: Y pH , Y DO , Y COD , Y NHN as the prediction.
The algorithm is shown in Algorithm 1, and the MLP is selected for processing in the shared layer.
Algorithm 1: Multi-indicator water quality prediction based on hard parameter sharing multi-task learning Input: water quality prediction indicators at the past time intervals (X 1 , . . . , X N ) 1 : out shared ← MLP([X 1 , . . . , X N ]) 2 : (Y 1 , . . . , Y N ) ← Tower 1,...,N (out shared ) Output: water quality indicators at the future time intervals (Y 1 , . . . , Y N ) We introduce the specific structure of the hard parameter sharing structure with pH, DO, COD Mn , and NH 3 -N as the prediction target. As shown in Figure 2, all indicators from the input layer to the shared parameter layer have the same structure. Take the pH value part as an example. We input data (pH 1 , . . . , pH t−1 ) in the input layer, which will be transmitted to the shared parameter layer and converted into the output vector by the fully connected neural network. Similarly, the inputs of DO, COD Mn , and NH 3 -N will also be converted to output vectors out DO , out COD , out NHN accordingly. The equation is as Equation (1). All input data will be dealt with by MLP and ReLU (rectified linear unit). out pH = ReLU(MLP(pH 1 , . . . pH t−1 )) out DO = ReLU(MLP(DO 1 , . . . DO t−1 )) out COD = ReLU(MLP(CODMn 1 , . . . CODMn t−1 )) where ReLU is a nonlinear function used to add nonlinearity to the model. Compared with sigmoid, ReLU can effectively alleviate the problems of gradient disappearance and gradient explosion in deep neural networks. The formula of ReLU is shown in Equation (2): We use the multilayer perceptron (MLP) to extract the deeply hidden features of the water quality time series. MLP can simply and efficiently represent the global features of time series, which helps the subsequent tower layer extract the deep local features for different water quality indicators. The formulation of MLP is shown in Equations (3) and (4), where x denotes the input, W denotes the weight matrix w i , b denotes the bias term, and y denotes the final output.
The MLP model consists of three parts: input layer, hidden layer, and output layer. The number of hidden layers in the MLP can be adjusted as a hyperparameter. The number of neurons in the output layer is the number of the water quality prediction indicators. We train the MLP model with the BP (Back Propagation) algorithm, whose loss propagates back from the top layer to the bottom layer.
The last layer of the network is the output layer, and the loss function is defined as Equation (5), where L n represents all neurons of the layer, y (j) n represents the output of the j-th neuron, t denotes the predicted value corresponding to (p H t , DO t , CODMn t , NHN t ), and y denotes the real value corresponding to (pH t , DO t , CODMn t , NHN t ).
The variables w and b are obtained by gradient descent to minimize the loss function, we show Equations (6)-(8) below to show the calculation of w and b's gradient: out shared = concat out pH , out DO , out COD , out NHN (11) The parameters update formulas of each layer are expressed in matrix forms, as shown in Equations (9) and (10): The well-trained model consists of updated w and b finally, and the output of Equation (1) can be obtained. Concatenating the four output vectors of Equation (1) to obtain, as shown in Equation (11), we can obtain the out shared .
out shared is the input of different tower layers (pH, DO, COD Mn , and NH 3 -N correspond to different towers) to generate corresponding prediction. Due to the different prediction targets, the tower layer structure can be different. Although the input out shared is the same for all towers, the output of each tower layer is different. The formulas are shown in Equation (12): Finally, the Root Mean Square Error (RMSE) between the predicted values ( pH t , DO t , CODMn t , NHN t ) and the real values (pH t , DO t , CODMn t , NHN t ) is calculated as the loss (see Equation (13)), where N is the number of samples. The loss is backpropagated to update the model parameters until the model converges.
All tasks share a shared parameter layer in the hard parameter sharing structure, and different tower layers are built for different tasks. Such structure reduces the complexity of the model structure and parameters. It ensures the model's flexibility since the model is required to learn a general implicit embedding in the sharing layer to make each task perform better, thus reducing the risk of overfitting.

Soft Parameter Sharing Structure of Multi-Indicator Water Quality Prediction (Multi-Task-Soft)
The shared parameter layer of Multi-Task-Hard cannot reflect the relationship between different tasks well and cannot guarantee the stable performance of the model. Therefore, we propose a soft parameter sharing structure-based multi-indicator water quality prediction (Multi-Task-Soft), which is based on Multi-Task-Hard. In the Multi-Task-Soft, data will be input to modules of different tasks to extract different features. Different tasks jointly maintain an implicit vector to learn the correlation between different indicators.
The architecture of Multi-Task-Soft is similar to that of the Multi-Task-Hard, as shown in Figure 3, which is also composed of four parts. Their main difference is the design of the shared parameter layer. Different from the single parameter sharing layer of Multi-Task-Hard, Multi-Task-Soft inputs the data to modules of different tasks to obtain different outputs. The structure also maintains an implicit vector to learn the correlation between different indicators. The implicit vector is merged with the outputs corresponding to the underlying structures of each task, and the merged results are input to the tower layer. Finally, each tower model will output the prediction results required by the corresponding task. The model process is shown in Algorithm 2. Multi-Task-Hard, Multi-Task-Soft inputs the data to modules of different tasks to obtain different outputs. The structure also maintains an implicit vector to learn the correlation between different indicators. The implicit vector is merged with the outputs corresponding to the underlying structures of each task, and the merged results are input to the tower layer. Finally, each tower model will output the prediction results required by the corresponding task. The model process is shown in Algorithm 2.
Meanwhile, (X pH , X DO , X COD , X NHN ) is also used as the input of another fully connected neural network to obtain the output vector hidden shared , as shown in Equation (14): Concatenate output vectors and hidden shared to obtain corresponding vectors Then, we input the vectors to the corresponding tower layer. Similar to Multi-Task-Hard, pH, DO, COD Mn , and NH 3 -N correspond to different towers, and the tower layer can be any neural network structure model. For different prediction indicators, the tower layer structure is different, which makes the corresponding output different. Taking MLP as an example, the predictions are shown as Equation (16): Finally, the RMSE between the predicted and real values is calculated as the loss, and the model parameters are updated by the backpropagation method until the model converges.
In this structure, the association between different indicators is obtained by learning an implicit public vector, and each task has its unique learning module. Finally, the individual learning and joint learning results are merged to achieve better prediction results.

Gating Parameter Sharing Structure of Multi-Indicator Water Quality Prediction (Multi-Task-Gate)
To better learn the relative weight of different indicators for the task, we further add the gating module in the parameter sharing layer. As shown in Figure 4, the input is processed by different modules to obtain different implicit features. The implicit features obtain the weight of the current task through SoftMax. According to the weight, different implicit vectors are weighted and summed to obtain the tower layer input of each task. Finally, each tower model outputs the prediction results of the corresponding task. The model process is shown as Algorithm 3.
To better learn the relative weight of different indicators for the task, we further add the gating module in the parameter sharing layer. As shown in Figure 4, the input is processed by different modules to obtain different implicit features. The implicit features obtain the weight of the current task through SoftMax. According to the weight, different implicit vectors are weighted and summed to obtain the tower layer input of each task. Finally, each tower model outputs the prediction results of the corresponding task. The model process is shown as Algorithm 3. The design of the shared parameter layer is similar to Multi-Task-Hard, and the data ( , … , ) ∈ , ( , … , ) ∈ , ( , … , ) ∈ X , ( , … , ) ∈ X is input separately to the fully connected neural network of the shared parameter layer to obtain the output vectors , , , , respectively. It is shown as Equation (1).
Unlike Multi-Task-Hard, the module calculates the importance of different output vectors to predict the pH instead of concatenating them and feeding them to the tower layer. Taking the prediction of pH as an example, we obtain the relative weights of different indicators in the prediction of pH through softmax. Softmax can map relative weights (w , w , w , w ) from 0 to 1. The relative weights show the corresponding results of different indicators, as shown in Equation (17): Meanwhile, the output vectors ( , , , ) are mapped through an MLP to ℎ , ℎ , ℎ , ℎ , as shown in Equation (18): The design of the shared parameter layer is similar to Multi-Task-Hard, and the data (pH 1 , . . . , pH t−1 ) ∈ X pH , (DO 1 , . . . , DO t−1 ) ∈ X DO , (COD Mn 1 , . . . , COD Mn t−1 ) ∈ X COD , (NHN 1 , . . . , HNH t−1 ) ∈ X NHN is input separately to the fully connected neural network of the shared parameter layer to obtain the output vectors out pH , out DO , out COD , out NHN , respectively. It is shown as Equation (1).
Unlike Multi-Task-Hard, the module calculates the importance of different output vectors to predict the pH instead of concatenating them and feeding them to the tower layer. Taking the prediction of pH as an example, we obtain the relative weights of different indicators in the prediction of pH through softmax. Softmax can map relative weights (w pH , w DO , w COD , w NHN ) from 0 to 1. The relative weights show the corresponding results of different indicators, as shown in Equation (17): (w pH , w DO , w COD , w NHN )= softmax MLP ph out pH , out DO , out COD , out NHN .
Meanwhile, the output vectors (out pH , out DO , out COD , out NHN ) are mapped through an MLP to hidden pH , hidden DO , hidden COD , hidden NHN , as shown in Equation (18): hidden pH , hidden DO , hidden COD , hidden NHN = MLP hidden out pH , out DO , out COD , out NHN .
The vector input of the tower layer is obtained by weighted fusion, as shown in Equation (19): Similarly, the tower layers of DO, COD Mn , and NH 3 -N also obtain the corresponding inputs, and the tower layers are designed as MLP. For different prediction indicators, the tower layer structure and output can be different. Taking MLP as an example, the formula is shown as Equation (16) in Section 2.3.2.
Finally, the RMSE between the predicted value and the real value is calculated as the loss, and the model parameters are updated by the backpropagation method until the model converges.
The gating parameter sharing structure does not learn the implicit vectors to extract the connection between tasks but learns the importance and connection of different indicators relative to a single task through the gating mechanism, which improves prediction performance. This section proposes a multi-task learning structure, which combines the advantages of the soft parameter sharing structure and the gated parameter sharing structure. As shown in Figure 5, the structure of the gated hidden parameter sharing structure (Multi-Task-GH) is similar to the Multi-Task-Gate, except that there is a model for learning an intermediate hidden vector in the parameter sharing layer. This intermediate implicit vector is similar to the Multi-Task-Soft design, which is combined with all other implicit vectors. The output results will be input to the tower layer through the gating mechanism. Finally, each tower model outputs the prediction results of the corresponding task. The model algorithm process is shown in Algorithm 4. Figure 6 shows an example of the kind of time series for each indicator.
We then input (X pH , X DO , X COD , X NHN ) to another implicit vector MLP to obtain the output vector hidden shared . As shown in Equation (20): Unlike Multi-Task-Soft, the module calculates the importance of different vectors out pH , out DO , out cod and out nhn for the prediction target together with hidden shared , respectively. The relative weight of the predicted target is obtained through Softmax, as shown in Equation (21): (w pH , w hidden )= Softmax MLP pH out pH , hidden shared , (w DO , w hidden )= Softmax (MLP DO (out DO , hidden shared )), (w COD , w hidden )= Softmax (MLP COD (out COD , hidden shared )), (w NHN , w hidden )= Softmax (MLP NHN (out NHN , hidden shared )). (21) Meanwhile, the output vectors are mapped through a fully connected neural network to (hidden pH , hidden DO , hidden COD , hidden NHN ): hidden pH = ReLU MLP hidden out pH , hidden shared , hidden DO = ReLU(MLP hidden (out DO , hidden shared )), hidden COD = ReLU(MLP hidden (out COD , hidden shared )), hidden NHN = ReLU(MLP hidden (out NHN , hidden shared )). (22) The vector input of the tower layer is obtained by weighted fusion: We then input v DO , v COD and v NHN into the corresponding tower layers. For different prediction indicators, the tower layer structure and output are different. Taking MLP as an example, the formula is shown as Equation (16) in Section 2.3.2. Finally, the RMSE between the predicted and real values is calculated as the loss, and the model parameters are updated by the backpropagation method until the model converges.

Summary of Four Water Quality Prediction Models
The structure of the proposed four water quality prediction models is summarized in Table 1. The input layer and output are not listed in the table due to their similarity and simplicity. For more details, please refer to Appendix A. Table 1. The structure of the proposed four water quality prediction models.

Name Layer Design
Mt-Hard

Experiment Setup
This section introduces the datasets, evaluation metrics, baseline models, and model settings for the evaluation.

Datasets
The experiment datasets come from 147 water quality monitoring stations set up by China National Environmental Monitoring Station in China's seven river systems and lakes. Each station's monitoring water quality indicators include pH, DO, COD Mn , and NH 3 -N. We have two datasets: D-s (Dataset-short) from 2013 to 2015 and D-l (Dataset-long) from 2012 to 2018. We select 120 stations with relatively complete data as the experiment dataset. Among them, there are 7 monitoring stations in the Pearl River, 22 in the Yangtze River, 11 in the Songhua River, 7 in the Liaohe River, 12 in the Yellow River, 26 in the Huaihe River, 6 in the Haihe River, 6 in the Taihu Lake, 4 in Poyang Lake, and 18 in other large lakes and rivers. Detailed statistics of the dataset are shown in Table 2.

Evaluation Metrics
We select Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), and Mean Absolute Error (MAE) as the evaluation metrics, which are widely used in time series prediction models [29]. Note that the lower the values of RMSE, MAPE, and MAE, the better the performance. The RMSE, MAE, and MAPE are calculated as follows: where y i indicates the i-th real value,ŷ i indicates the i-th predicted value, and N is the number of data samples. These three metrics are used to measure the error between the predicted values and the real values. MAPE reflects the relative error between the predicted values and the real values, while MAE is a simple superposition of the absolute error. Therefore, MAPE can more accurately reflect the deviation degree of the predicted values. At the same time, RMSE first squares the error values. If the dispersion of errors is high, the RMSE is magnified. Therefore, RMSE is more affected by outliers than MAE and MAPE, but they are at the same data level [30].

Model Setting
For model learning, the input space node number is 120, the sequence length is 10, and the dimension of each time point is 4, representing four water quality indicators (pH, DO, COD Mn , and NH 3 -N). For prediction, the output space node number is also 120, the sequence length is set to 1, and the water quality indicators at each time point are also pH, DO, COD Mn , and NH 3 -N. The first 60% of the data is used for training, 20% is used for validation, and the last 20% is used for testing. The prediction time step is set to 1. In other words, the historical water quality values of 120 monitoring stations in the previous ten weeks are used to predict their values in the next week. We compare the proposed models with other models to verify the effectiveness of the proposed models.
For all deep learning models, Adam is used as the optimizer, which combines the advantages of AdaGrad (adaptive gradient) and RMSProp (root mean square propagation) to update the step size by comprehensively considering the first-moment estimation (i.e., the mean value of the gradient) and the second-moment estimation (i.e., the variance of the gradient). The learning rate can be automatically adjusted, and the fluctuation range of the adjustment is not too large [29]. The hyperparameters are highly interpretable and usually only need to be fine-tuned or even not need to be adjusted, which is suitable for large-scale data and parameter scenarios. We choose RMSE as the loss function. The learning rate is set to 0.001, and the epochs and batch sizes are set to 100 and 5, respectively.

Results and Discussion
In this section, we compare the proposed method with baselines on the prediction performance of the single-indicator and multi-indicator. We then compare the influence of different tower layers on the model to verify the proposed methods' robustness and analyze the models' predictive performance for different rivers and lakes. We also show the training loss and validation loss of the best multi-task water quality prediction model.

Comparison of Prediction Performance for Single-Indicator
In this section, we compare the overall prediction performance of four multi-task learning models with seven baselines for single-indicator. The experimental results are shown in Table 3. In the table, the bold numbers are the best, and the numbers with asterisk are the second best. (1) For pH, the hard parameter sharing structure (Multi-Task-Hard), soft parameter sharing structure (Multi-Task-Soft), gated parameter sharing structure (Multi-Task-Gate), and gated hidden parameter sharing structure (Multi-Task-GH) achieve better performance in all the metrics. Multi-Task-GH achieves the best performance, which means the pH predicted by Multi-Task-GH is closer to the real values.
(2) For DO, the four multi-task learning models also achieve better performance. Among the four multi-task learning models, the performance of Multi-Task-Hard is worse than other multi-task models (Multi-Task-Soft, Multi-Task-Gate, and Multi-Task-GH) and even worse than some traditional deep learning models (MLP and GRU). The Multi-Task-GH still achieves the best performance.
(3) For COD Mn , the MLP model achieves the best performance in RMSE and MAE. Only Multi-Task-Gate achieves the best performance in MAPE among the four multitask learning models. The prediction performance of the three soft parameter sharing models, Multi-Task-Soft, Multi-Task-Gate, and Multi-Task-GH, is almost the same as that of MLP, which means that the predicted COD Mn of MLP is closer to the observed values. However, the multi-task learning model with three soft parameters shared can still achieve close results.
(4) For NH 3 -N, MLP achieves the best performance in only MAPE, while Multi-Task-GH achieves the best results in both RMSE and MAE. This means that the NH 3 -N predicted by the Multi-Task-GH model is closer to the real values in most cases.
As shown in Table 4, we further validate the proposed models on the dataset D-l. In the table, the bold numbers are the best. Similar to the results on D-s, the results also show that the multi-task learning models achieve better performance than other models in most cases and Multi-Task-GH achieves the best results in most.

Comparison of Four Indicators and Three Indicators Multi-Task Learning Models
It is worth mentioning that we have conducted experiments on three indicators of multi-task learning models. The experimental results show a similar conclusion, but the whole performance is weaker than the four ones. The results are shown in Table 5, where 4-task means four indicator multi-task learning model, 3-tasks are three indicator multi-task learning models that include (pH, DO, COD Mn ), and (DO, COD Mn , NH 3 -N), (pH, DO, NH 3 -N), and (pH, COD Mn , NH 3 -N) multi-task learning models. In the table, the bold numbers are the best. The results on D-s have similar trends.  Table 6 shows the average prediction performance of seven baselines and four multitask learning models for multi-indicators on D-s. In the table, the bold numbers are the best, and the numbers with asterisk are the second best. For the space limitation, we only put the results on D-s in the following sections because the results on D-s and D-l have similar trends. Among the models, the Multi-Task-GH model achieves the best on all the metrics. Although the single-task learning models may achieve the best effect in predicting one target water quality indicator in some cases, the prediction accuracy will decrease when predicting other water quality indicators. Therefore, when the same model structure is used to simultaneously predict multiple target water quality indicators (pH, DO, COD Mn , and NH 3 -N), the Multi-Task-GH model can accomplish this task well and achieve the best performance in most indicators. This means that the Multi-Task-GH model can accurately predict multi-indicator.

Tower Layer Analysis
This paper also analyzes the impact of different tower types on the prediction performance of the Multi-task-GH model, as shown in Table 7. In the table, the bold numbers are the best. The five deep learning structures of LSTM, GRU, CNN, ATTENTION, and MLP are used as the tower layer of the Multi-Task-GH model to train the model and predict the water quality indicators. The results show that the Multi-Task-GH model with MLP as the tower layer achieves the best performance in most cases. The robustness of the model is the best, and there is no sharp drop in the prediction accuracy when predicting different water quality indicators. For example, when the ATTENTION-based deep learning structure is used as the tower layer of the Multi-Task-GH model, the prediction results of DO, COD Mn , and NH 3 -N are good, while the prediction results of pH are greatly reduced, which achieve the worst performance among the five structures. This shows that the ATTENTION-based deep learning structure is not well compatible with the simultaneous prediction of four water quality indicators.

The Difference of Predictions and Real Data
To show the ability of the Multi-Task-GH model, we the difference between predictions and the real measured data, as shown in Figure 7. The curves of both present similar trends, which prove the Multi-Task-GH model can predict the indicator change.
Int. J. Environ. Res. Public Health 2022, 19, x FOR PEER REVIEW the best. The five deep learning structures of LSTM, GRU, CNN, ATTENTION, a are used as the tower layer of the Multi-Task-GH model to train the model and pr water quality indicators. The results show that the Multi-Task-GH model with ML tower layer achieves the best performance in most cases. The robustness of the mod best, and there is no sharp drop in the prediction accuracy when predicting differe quality indicators. For example, when the ATTENTION-based deep learning stru used as the tower layer of the Multi-Task-GH model, the prediction results of DO, and NH3-N are good, while the prediction results of pH are greatly reduced, which the worst performance among the five structures. This shows that the ATTENTIO deep learning structure is not well compatible with the simultaneous prediction of ter quality indicators.  0.262 0.181 0.024 1.182 0.796 0.086 1.515 0.173 0.592 0.403 0.33

The Difference of Predictions and Real Data
To show the ability of the Multi-Task-GH model, we the difference between tions and the real measured data, as shown in Figure 7. The curves of both presen trends, which prove the Multi-Task-GH model can predict the indicator change.

Model Training Loss and Validation Loss
We train the baselines and the proposed four multi-task learning models w hyperparameters. As shown in Figure 8, the two curves are the training loss and tion loss change curves of the Multi-Task-GH model. The two-loss curves conve

Model Training Loss and Validation Loss
We train the baselines and the proposed four multi-task learning models with fixed hyperparameters. As shown in Figure 8, the two curves are the training loss and validation loss change curves of the Multi-Task-GH model. The two-loss curves converge after about ten epochs of training.

Related Work Analysis
With the rapid development of machine learning, scholars have begun water quality prediction methods based on machine learning, such as the sup machine, genetic algorithm, and clustering algorithm. Recently, some resear proposed deep learning methods for water quality prediction, mainly aimed at a single indicator. These models are based on single-task learning, and the rep models are as follows: Avila et al. [16] adopted the ridge regression method to predict water qu al. [31] used PCA to assess water quality. Chen et al. [32] used a machine lea rithm with an integrated boosting method. Ahmed et al. [33] stacked multiple nected layers to predict water quality. Barzegar et al. [34] employed one convol and LSTM layers for water quality parameter prediction. Yang et al. [21] incorp LSTM and two fully connected layers for prediction, which can extract shor long-term correlations of water quality and avoid gradient disappearance. Shr [35] incorporated one GRU and two fully connected layers for water quality p Vaswani et al. and Jaderberg et al. [36,37] stacked three self-attention laye fully connected layers to mine the sequential relationship of water quality data sequence is first converted to embedding through the first fully connected laye verted embedding then completes the information aggregation on the time st the three-layer self-attention mechanism. Finally, it generates the water quality through a fully connected layer.
Prediction methods based on deep learning can well extract the complex characteristics and time-dependent relationship of water quality data, whic good prediction performance, but they still have some problems.
(1) Unable to predict multiple indicators with one model. The trained m

Related Work Analysis
With the rapid development of machine learning, scholars have begun to explore water quality prediction methods based on machine learning, such as the support vector machine, genetic algorithm, and clustering algorithm. Recently, some researchers have proposed deep learning methods for water quality prediction, mainly aimed at predicting a single indicator. These models are based on single-task learning, and the representative models are as follows: Avila et al. [16] adopted the ridge regression method to predict water quality. Lu et al. [31] used PCA to assess water quality. Chen et al. [32] used a machine learning algorithm with an integrated boosting method. Ahmed et al. [33] stacked multiple fully connected layers to predict water quality. Barzegar et al. [34] employed one convolution layer and LSTM layers for water quality parameter prediction. Yang et al. [21] incorporated one LSTM and two fully connected layers for prediction, which can extract short-term and longterm correlations of water quality and avoid gradient disappearance. Shrestha et al. [35] incorporated one GRU and two fully connected layers for water quality prediction.
Vaswani et al. and Jaderberg et al. [36,37] stacked three self-attention layers and two fully connected layers to mine the sequential relationship of water quality data. The input sequence is first converted to embedding through the first fully connected layer. The converted embedding then completes the information aggregation on the time step through the three-layer self-attention mechanism. Finally, it generates the water quality prediction through a fully connected layer.
Prediction methods based on deep learning can well extract the complex nonlinear characteristics and time-dependent relationship of water quality data, which achieves good prediction performance, but they still have some problems.
(1) Unable to predict multiple indicators with one model. The trained model often only performs well in the one prediction indicator. If the model is used to predict other indicators without changing the model's structure and parameters, the performance will be greatly reduced. Therefore, it is necessary to train different models to predict multiple indicators, which will lead to extended training and prediction time and large model storage space.
(2) Unable to consider the impact correlations between multiple indicators. When the water quality of the same water area becomes better or worse, there may be a certain correlation between different indicators. The single-indicator prediction model is difficult to deal with the correlations between multiple indicators. This paper proposes water quality prediction models based on multi-task learning to solve the above problems. The model based on multi-task learning improves the prediction performance of each task, which learns the relevance between different tasks. The multitask learning also saves parameter space and prediction time consumption by sharing part of the model [38,39].

Conclusions
This paper proposed a multi-task-learning-based prediction method to solve the shortcomings and challenges of the single-task learning model for water quality prediction. Four multi-task learning structures are proposed based on the idea of sharing bottom structures: hard parameter sharing structure, soft parameter sharing structure, gated parameter sharing structure, and gated hidden parameters sharing structure. Sufficient experiments are designed and implemented to demonstrate the effectiveness of the proposed method.
However, there is still room for improving the proposed method. The training gradient losses of different tasks in reverse gradient propagations show a magnitude gap, leading to unstable training. It is hard to train a large number of water quality indicators in multi-task learning because the balance of four indicators will be out of control.
In this paper, we did not take the data distributions and importance of different tasks into consideration explicitly because the Multi-task-GH can implicitly learn the unique and shared joint weights of each subtask through the gate network, and it can also implicitly reflect the different effects of data distributions. However, if the data distributions and importance of different tasks can be explicitly taken into consideration, e.g., as constraints of loss function or regularization, the models would have better performance. In the future, we will conduct more data analysis and design more reasonable losses for the tasks. Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: http://envi.ckcest.cn/environment/special/special_list.jsp?specialId=108, accessed on 10 October 2021.

Conflicts of Interest:
The authors declare no conflict of interest. The company "T.Y.Lin International Engineering Consulting (China) Co., Ltd." had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. Hyper-Parameters
In our proposed framework, there are many hyper-parameters. For MLP, we tune the number of hidden layers in a range of {1, 2, 3} and tune the number of hidden units in a range of {16, 32, 64}. Without specific mention, we set hidden units of the first MLP layer as 64, the second MLP layer as 32, and the third MLP layer as 16. We further tune the number of epochs for the training process in a range of {100, 200, 400}, the number of batch size is 8, and the learning rate is 0.001. The input space node number is 120, the sequence length is 10, and the dimension of each time point is 4, representing four water quality indicators: pH, DO, COD Mn , and NH 3 -N. All the information is summarized in Table A1.  Table A2 shows the details of input-output parameters.