Water Quality Prediction Based on LSTM and Attention Mechanism: A Case Study of the Burnett River, Australia

: Prediction of water quality is a critical aspect of water pollution control and prevention. The trend of water quality can be predicted using historical data collected from water quality monitoring and management of water environment. The present study aims to develop a long short-term memory (LSTM) network and its attention-based (AT-LSTM) model to achieve the prediction of water quality in the Burnett River of Australia. The models developed in this study introduced an attention mechanism after feature extraction of water quality data in the section of Burnett River considering the effect of the sequences on the prediction results at different moments to enhance the inﬂuence of key features on the prediction results. This study provides one-step-ahead forecasting and multistep forward forecasting of dissolved oxygen (DO) of the Burnett River utilizing LSTM and AT-LSTM models and the comparison of the results. The research outcomes demonstrated that the inclusion of the attention mechanism improves the prediction performance of the LSTM model. Therefore, the AT-LSTM-based water quality forecasting model, developed in this study, demonstrated its stronger capability than the LSTM model for informing the Water Quality Improvement Plan of Queensland, Australia, to accurately predict water quality in the Burnett River.


Introduction
Changes in water quality greatly affect ecosystem and human health. Prediction of water quality is the use of a long-term collection of water quality data to forecast possible water quality trend over a period of time for the future. It provides a scientific decision-making basis for assessing water environment in advance and preventing the large-scale occurrence of water pollution problems. Accurate water quality prediction plays an essential role in improving water management and pollution control. The goal of the Burnett Water Quality Improvement Plan of Queensland, Australia, is to manage the pollutant loads into the Burnett waterways and to help protect the Great Barrier Reef (GBR) region. Establishment of an effective and accurate water quality parameter prediction model is of great significance for improving the water quality of the Burnett River [1].
As the cost of hardware equipment related to water quality monitoring has been decreasing, it is possible to deploy a large-scale water quality monitoring sensors in rivers and lakes [2]. Water quality monitoring sensors automatically monitor parameters such as DO, pH, turbidity, and many other indicators. All indicators are recorded in the order of the monitoring time occurrence of the time series. In recent years, with the large-scale operation of water quality automatic monitoring stations, substantial data are being produced, and big data-driven water quality prediction models are receiving more and more attention in the field of water research. However, due to the variety in water quality indicators, the long period of water quality data collection, the correlation among the water quality indicators, the nonlinearity between water quality characteristics, and the volatility of the data, accurate and effective prediction of water quality has become a challenging issue. The current research hotspots mainly focus on how to improve the applicability and reliability of water quality prediction models [3].
River water quality exhibits characteristics such as seasonality and periodicity on macroscopic timescales, with nonlinearity and uncertainty [4]. River water quality parameters are not only affected by external factors, but also by the historical values of independent variables and random perturbations, and the selection of lagged terms and correlation variables is one of the factors affecting the prediction accuracy. Organic contaminants account for a large proportion of river water pollution. Organic pollution of rivers, due to the oxygen consumption from decomposition of contaminants, causes dissolved oxygen in water to decrease rapidly, consequently leading to deterioration of water quality. The self-purification ability of river water bodies decreases, resulting in serious damage to the ecosystem. Dissolved oxygen is a critical parameter for the pollution of river water, as well as an important indicator for whether the river water has the ability of self-purification. In the water environment, the growth of aquatic plants and animals cannot be separated from the appropriate amount of dissolved oxygen. In this study, the key parameter of DO for water quality evaluation is used as the target for model construction and prediction evaluation [5].
There are many traditional statistical techniques for water quality prediction [6], such as the autoregressive integrated moving average (ARIMA) method [7], and multiple linear regression (MLR) model [8,9]. However, water environment indicators are affected by many complex factors such as physical, chemical, and biological, and they have strong nonlinear characteristics. The above linear distribution-based models often fail to consider the influence of these factors in an integrated manner. In this study, water quality prediction uses traditional machine learning methods, including support vector regression (SVR) [10][11][12], artificial neural networks (ANN) [13][14][15], and other nonlinear methods. Singh et al. [13] proposed the use of ANN to predict water environment time series indicators. Barzegar et al. [16] used wavelet neural network (GWNN) to predict the salt concentration of the Aji Chay River in Northwest Iran. Through the calibration and verification of the model, the superiority of GWNN in water quality prediction is evident. ANN is also a powerful data-driven prediction model, which can well fit the nonlinear relationship between time series [17]. However, these methods still do not sufficiently learn the hidden relevant features in the time series, which significantly affect the prediction accuracy.
Currently, most water quality data belong to long correlation series data [18], and there may be some important events with long delays and intervals in the corresponding time series. It is difficult for traditional machine learning methods to fully use the available information with long historical observations. Recursive neural network (RNN) [19] is one of the deep learning methods capable of preserving and utilizing memory from previous network states [20]. RNN is very flexible in dealing with time series and capturing nonlinear relationships [21]. Nevertheless, it is difficult for the traditional RNN model to retain long-term dependencies among the variables due to the gradient vanishment [22]. Long short-term memory (LSTM) is a variant of RNN and can effectively alleviate the RNN network time delay and gradient vanishment [23] by implementing gating [24]. The interactive operation among these gates makes LSTM have sufficient ability to address the problem of long-term dependencies which general RNNs cannot learn, and it can balance the temporal and nonlinear relationship of data [25]. LSTM has been widely used in the prediction of water quality [26,27]. Ye et al. [28] combined all the water quality monitoring data of the rivers in Shanghai, used the LSTM model to predict and verify the main pollutant index, potassium permanganate index (COD), in the rivers, and proved that the prediction accuracy and generalization ability of the model outperformed the traditional RNN network model. Barzegar et al. [29] proposed a hybrid model combining LSTM and convolutional neural network (CNN) that outperformed single machine learning models including SVR, CNN, LSTM, and decision tree (DL) in predicting short-term water quality variables for the Small Prespa Lake in Greece. This shows that there is still much room for research based on LSTM to achieve high-precision water quality prediction, which can be explored in the direction of optimizing the internal structure of the network or combining with other methods [30,31]. LSTM lacks the ability to pay different degrees of attention to sub-window features, which may lead to some relevant information being ignored, and the important characteristics of time series cannot be valued.
In recent years, attention mechanisms [32,33] have been deployed in various tasks of natural language processing [34][35][36][37], including machine translation [38,39], syntactic analysis [40,41], and speech recognition [42,43]. We have applied the attention mechanism to effectively capture the more distant critical information and enhance the influence of the important characteristics on the prediction model by weighting the hidden layer elements at each timestep. Attention mechanisms have also been widely adopted in the field of time series analysis and forecasting [44,45]. Zhou et al. [46] proposed a short-term photovoltaic power forecasting based on LSTM neural network and attention mechanism to forecast short-term photovoltaic power generation in a time series manner. On this basis, we introduced the attention mechanism and developed an AT-LSTM model based on the LSTM model, focusing on better capturing the water quality variables. The DO concentration in the section of the Burnett River, Australia, was predicted using water quality monitoring raw data. Lastly, the prediction results were compared with the LSTM model. We aimed to achieve adaptive learning of long-term dependencies and hidden correlation features of multivariate temporal data to make river water quality predictions more accurate. The Burnett River was considered a case study to illustrate the applicability of the proposed AT-LSTM model.

Study Area and the Data
The Burnett River is located in southeastern Queensland, Australia, and originates on the western slopes of the Burnett Range east of the Eastern Highlands in a subtropical climate. The river flows southwest to Eidsvold, then turns east at Mundubbera, and finally flows northeast through Gayndah and Bundaberg before entering the Pacific Ocean at Burnett Heads after 270 miles (435 km) of navigation. It has a catchment area of 12,440 square miles (32,220 square kilometers). The major tributaries are the Auburn and Boyne rivers and the Baramba River. The Burnett Basin has a population of approximately 94,100 residents. The primary land use is grazing (77%, 2,500,000 hectares), followed by forestry (12%, 405,000 hectares). Approximately 10,100 hectares of sugar cane, the largest area of dryland cultivation (approximately 81,000 hectares), irrigated cultivation (approximately 41,000 hectares), and horticulture (approximately 10,000 hectares) are located in the Burnett watershed. It also contains several impoundments, including Paradise Dam. The catchment has undergone extensive modifications over the past 40 years, including industrial and port development in the estuary. The data used for this study are water quality data from the Burnett River automatic monitoring sites, the locations and catchment boundaries of which are shown in Figure 1. To ensure the reliability and applicability of the model, we used the monitoring data of the water quality collected from January 2015 to January 2020 in the Burnett River. The data are collected every half-hour and include five characteristics: water temperature (Temp), pH, dissolved oxygen (DO), conductivity (EC), chlorophyll-a (Chl-a), and turbidity (NTU). In this paper, the hourly water quality data with 39,752 characteristics and dissolved oxygen are used as the output variable. Table 1 shows the descriptive statistics of the data. In this water quality dataset, the indicators of the water quality data are mainly used to assess the water quality of the river. The DO used in this experiment is a key indicator of water organic pollution, which can reflect the degree of water pollution. In addition to DO water quality indicator, there are some indicators that affect water quality such as pH, Temp, EC, Chl-a, and turbidity. The variation of DO in the study period is shown in Figure 2.

Missing Value Processing
Missing values of data are handled in two ways: (1) if only one indicator is missing in one monitoring, the data are filled in by linear interpolation; (2) if a monitoring value is missing continuously, the data of the monitoring moments are deleted to avoid large errors caused by artificial filling.
Linear interpolation [47] is a widely used interpolation algorithm in the field of mathematics and graphics. Linear interpolation of water quality data can effectively compensate for the missing data problem of time series data and improve the model effect.
Assuming that there are missing values (x, y) between coordinates (x 0 , y 0 ) and coordinates (x 1 , y 1 ), we can obtain Equation (1): where x is known, and the value of y is obtained as in Equation (2): After interpolation, the dataset becomes a continuous time series of equal time intervals.

Water Quality Correlation Analysis
The multivariate time series build predictive models by analyzing historical time series data and correlations between individual factors [48]. For multi-element water quality time series data, different element features have different effects on water quality prediction. Multiple features need to be selected, and feature selection can reduce model training time, improve model efficiency, and make its generalization ability stronger.
Pearson correlation test [49] is used to determine the relevance of different features to the time-series features that need to be predicted. The Pearson correlation is mainly used to describe the degree of linear correlation between variables. The Pearson correlation coefficient is calculated as the quotient of the covariance E(XY) − E(X)E(Y) of variable X and variable Y divided by the standard deviation of the two variables, as shown in Equation (3): where ρ is the correlation coefficient of variable X and variable Y, and its absolute value is equal to 1. A correlation coefficient of 1 indicates that the two variables are strongly positively correlated and Y increases as X increases; a coefficient of −1 indicates that the two variables are strongly negatively correlated and Y decreases as X increases; a coefficient of 0 indicates that there is no correlation between the two variables.
In this paper, we used the Pearson correlation test for relevant multifactor water quality characteristics, and final test results are shown in Table 2. According to Table 2, the main element characteristics related to the water quality prediction index DO are pH, Chl-a, and Temp. The water quality prediction of multiple elements mainly considers these factors as the input characteristics.

Outlier Detection
Water quality monitoring stations are often affected by environmental changes and instrument failures in the process of data collection, resulting in missing data and data anomalies, which can have a serious impact on the subsequent model predictions. In this experiment, the detected anomalous values are considered as missing values, and linear interpolation is used to complete the data. Usually, outliers can be identified with the help of graphical methods (box-line plots and normal distribution plots) and modeling methods (linear regression, clustering algorithms, and K-nearest neighbor algorithms). This experiment used the box-line plot method to identify outliers.
The box-line plot technique [50] actually used the quantile of the data to identify the outliers among them. The graph is a typical statistical graph, which is widely used in both academia and industry. The shape of the box-line plot is characterized as shown in Figure 3. The lower quartile in Figure 3 refers to the value corresponding to the 25% quartile of the data (Q1), the median is the value corresponding to the 50% quartile of the data (Q2), and the upper quartile is the value corresponding to the 75% quartile of the data (Q3). The formula for the upper whisker is Q3 + 1.5(Q3 − Q1), and the formula for the lower whisker is Q1 − 1.5(Q3 − Q1), where Q3 − Q1 denotes the quartile difference. If a box-line diagram is used to identify outliers, the judgment criterion is that when the data value of a variable is greater than the upper whisker of the box line diagram or less than the lower whisker of the box line diagram, such a data point can be considered as an outlier.

Data Normalization
For the joint multifactor water quality time series data prediction, different water quality indicators often have different levels. In the subsequent water quality time series prediction process, different levels of elements seriously affect the accuracy of model predictions. In addition, in the process of model training, too large or too small input data can lead to problems such as model nonconvergence. To solve this problem, this paper used outlier normalization (min-max normalization) [51] to normalize the data. Outlier normalization scales the data on the basis of the ratio of the difference between the maximum and minimum values such that the range of variation of water quality data is maintained between 0 and 1. Normalization can alleviate the impact of different scales on model training. The normalization formula is shown in Equation (4): where X is the original data, X norm is the normalized data, X max is the maximum value in the original data, and X min is the minimum value in the original data.

Time Series Conversion to Supervised Data
Converting time series data from unsupervised data to supervised data is required before using a time series prediction model for forecasting to facilitate the model by comparing the gap between the true and predicted values. The conversion of time series data to supervised data relies on sliding window interception of feature input values and target values to construct supervised data [52]. Figure 4 shows the specific process.

Long Short-Term Memory Neural Network
The LSTM neural network is a special recurrent neural network (RNN) [53] that is capable of learning long-term patterns. It was first proposed by Hochreiter and Schmidhuber [54]. It has been applied very well to a wide variety of problems and is now widely used. The LSTM network is suitable for processing and predicting time-series features with very long intervals and delays in the time series. The LSTM network is also effective in solving the gradient disappearance and gradient explosion problems that tend to occur in traditional recurrent neural networks.
The LSTM model has an input gate, an output gate, and a forget gate, which are used to modify the memory. The input gate and output gate are mainly used to control the input features and output contents, while the forget gate is mainly used to decide which memories in the memory unit should be retained and which memories can be forgotten, which can be described by the Equations (5)- (9). The structure of LSTM is shown in Figure 5.
Input gate : Forget gate : Output gate : Long memory : Short memory : The W matrix represents the parameter matrix of various gates and memory cells, x represents the input values, h represents the hidden state variables, which are mainly used to store and update the historical information, and σ and tan h represent the sigmoid activation function and the tan h activation function, respectively. Once trained sufficiently, the LSTM model can extract features of complex time series information. On the basis of these effective features (hidden layer information from the LSTM model), the final fully connected layer is able to decode them into predicted values with reasonable accuracy.

Attention Mechanism
The essence of the attention mechanism is that, for a given target, a weight coefficient is generated and multiplied with the input to identify which features in the input are important for the target and which features are not. To implement the attention mechanism, we consider the raw data of the input as key, value pairs, and we calculate the similarity coefficient between Key and Query. On the basis of the Query in the given task in the target, we can get the weight coefficient corresponding to Value, and then multiply the weight coefficient with Value to get the output. We use Q, K, and V to denote Query, Key, and Value; the formula for calculating the weight coefficient W is shown in Equation (10): The attention weight coefficient W is multiplied by value using Equation (11) to obtain the output a containing the attention.
The detailed structure of the attentional model is shown in Figure 6. As we can see, the attention mechanism forms an attention weight vector by computing key, value , and then is multiplied by value to get a new output incorporating attention. The attention mechanism has many applications in various fields of deep learning. However, it should be noted that attention is not a unified model, but only a mechanism that has different sources for Query, key, and value in different application domains. This means that different domains have different implementation methods.
For the calculation of attention there are three steps: the first step is to calculate the similarity between the Query and key to get the weights, and the common similarity functions are dot product, splicing, and perceptron; the second step is to use a softmax [55] function to normalize these weights; the third step is that the weights are multiplied with the corresponding key value to get the final attention.

Model Establishment
In this research, we introduce an attention mechanism to the LSTM network and propose the AT-LSTM network model to process multivariate time series data. The main idea of the model is to reduce the effect of irrelevant factors on the results and highlight the impact of related factors by adaptively weighting hidden layer elements of the neural network, thus improving prediction accuracy. The model framework is shown in Figure 7, and the main components are the LSTM layer and the attention layer.  Table 3. The fully connected layer gets the normalized similarity weights via the softmax activation function. The weights are multiplied with the input layer to calculate the final attention. The flatten layer is used to "flatten" the input, which is to turn a multidimensional input into a onedimensional one. The hyperparameter setting of the model affects its performance on water quality prediction to some extent. This paper sets the time window to 100 through a trialand-error method, uses Bayesian optimization [56] for model hyperparameter optimization, and identifies relatively better hyperparameters and activation functions. The difference in the AT-LSTM model proposed in this study is the addition of the attention layer in comparison with the traditional LSTM model, while the other main structures are the same. In addition, the models were trained under the same hyperparameters, which helped us to compare the models. According to the above principle, the model learns on the basis of past fitting results, optimizes the water quality prediction by using the property of LSTM with memory, and finally outputs after activation through the fully connected layer. The specific process of the comprehensive water quality data prediction algorithm proposed in this paper is as follows: Step 1: Data cleaning. Before water quality prediction, the box-line plot technique in Section 2.4 is used to detect the abnormal values of water quality data, and the abnormal values are set to empty values. Then, the linear interpolation method in Section 2.2 is used to supplement the vacancy value Step 2: Data enhancement. Firstly, the Pearson correlation test in Section 2.3 is used to select characteristics, the correlation analysis between different water quality parameters is performed, and the key characteristics related to the characteristic to be predicted are used as inputs to the model. Secondly, the sliding window technique in Section 2.6 with a window size of 100 is used to capture the trend of water quality variables. Thirdly, the minmax normalization in Section 2.5 is used to alleviate the impact of different characteristic scales on model training.
Step 3: Training model. The water quality data are divided into three datasets according to the ratio of 8:1:1: training, validation, and test set. In this study, the training set contained 31,802 hourly entries (from 1 January 2015 to 4 February 2019), the validation set contained 3975 hourly entries (from 4 February 2019 to 20 July 2019), and the test set contained 3975 hourly entries (from 20 July 2019 to 1 January 2020). We used the training set to fit data samples, the validation set to tune hyperparameters, and the test set to evaluate the predictive performance and generalization ability of the model. The algorithm flow chart is shown in Figure 8.

Performance Criteria
The water quality prediction is essentially a regression problem. In the present study, the mean absolute error (MAE), root-mean-square error (RMSE), and coefficient of determination (R 2 ) were used to quantitatively evaluate the model prediction effect (as shown in Equations (12)- (14)). To reduce the randomness error of the algorithm, a random seed (random seed) was set during the experiment to ensure the consistency of the operation results.
where y i is the measured value,ŷ i is the predicted value, y mean is the mean value of y i , and m is the number of test sets. The RMSE is the square root of the MSE, a magnitude that is more intuitive. For example, if the RMSE is equal to 10, the regression effect can be considered to differ from the actual value by an average of 10. Its value ranges from zero to positive infinity; when it is equal to 0, it indicates a perfect model, with a larger the error denoted by a higher value. R 2 represents the fitting ability of the model; a closer value to 1 denotes a stronger fitting ability.

Experimental Environment
In this study, we used the Keras and Tensor-flow framework to provide water quality prediction with the following parameters: Intel i5-1140 CPU, 2.7 GHz frequency; Nvidia GTX 3050 GPU; 16 GB PC memory; Windows 10 64-bit operating system; Python version 3.9 development language; PyCharm Professional Edition 2021.3. To achieve the prediction of water quality in the Burnett River using AT-LSTM and LSTM models, we used the MSE as the loss function of the model and performed the calculations using the following standard equation: Both models were trained on the training set using the Adam optimizer [57] with a batch size of 64. To accelerate the convergence of the error, the backpropagation learning method was used. The validation set was used as an early stopping method to ensure that the model did not over-train.

Comparisons of One-Step-Ahead Forecast Using LSTM and AT-LSTM Models
This study aimed to analyze the differences between the AT-LSTM and LSTM models in multivariate time-series forecasting. The AT-LSTM and LSTM models used the past values of multivariate time series for one-step-ahead forecasting before performing multistep-ahead forecasting. The models used the same input data to predict the DO for the next hour. Figure 9a shows the comparison between the LSTM model and the AT-LSTM model when making predictions on the test set. From Figure 9a, we can see that the AT-LSTM model outperformed the LSTM model for water quality prediction for the Burnett River test set. The standard LSTM method performed less well, with an RMSE of 0.171 and R 2 of 0.918. After the introduction of the attention mechanism, RMSE and R 2 showed a reduction and an increase, respectively. This is because the attention layer in the model weighs hidden layer elements of the neural network of different moments, removes redundant information and noise in the time-series data, and highlights the influence of the relevant features on the prediction effect, thus improving prediction accuracy. Detailed comparisons of predicted and actual values from the two models in the test set at the corresponding time allowed us to better understand the differences between the two such as shown in Figure 9b. In Figure 9b, the blue curve represents the actual values, and the orange curve indicates the predicted values from the modeling. Although LSTM can predict water quality changes, AT-LSTM's predictions are less different from the actual values, indicating that AT-LSTM's generalization abilities are stronger than those of LSTM. Table 4 summarizes the performance of these models for the one-step-ahead DO prediction task in the monitoring sections. From the table, we can see that the method proposed in this paper shows significant improvements in the MAE, RMSE, and R 2 in comparison with LSTM. The R 2 of the AT-LSTM model increased from 0.918 to 0.953, and the RMSE and MAE of the AT-LSTM water quality prediction model had 23.9% and 27.7% reductions, respectively.  Figure 10 shows a box plot of the relative error percentage of each model in predicting DO. Clearly, the relative error distribution interval of the AT-LSTM model was smaller than that of the LSTM model. The attention mechanism allocated corresponding weights to the hidden layer elements of the neural network according to the different levels of importance of the hidden relevant features. Thus, with the same parameters used in the LSTM model, the AT-LSTM model could better fit the true value of DO, reduce the prediction error, and improve the accuracy and robustness of the model. The relative error percentage was calculated using Equation (16): where x t is the actual value of the moment t, andx t is the predicted value of the moment t.

Comparisons of Multistep Forecasting Using the LSTM and AT-LSTM Models
To verify the prediction performance and generalization ability of AT-LSTM model, this study conducted DO water quality prediction experiments with different step lengths. The sliding window width was still set at 100. The prediction steps were 4-48 steps, i.e., the past 100 h of data to predict the future 4-48 h of water quality. A comparison of the prediction errors MAE, RMSE, and R 2 of the LSTM and AT-LSTM models for the test set is presented in Table 5. Clearly, the MAE and RMSE values of the AT-LSTM model were smaller than those of the LSTM model in every step, and the R 2 of the AT-LSTM model was higher than that of the LSTM model at each step. The average values of MAE and RMSE of the total model decreased by 14.6% and 12.2% in comparison with LSTM, respectively, but the average R 2 increased by 10.8%. In terms of general trends, with the increase in the prediction step, the model prediction error also increased, being inferior to the prediction of the future 1 h. Comparisons of the values of real and model-predicted DO, error plots, and residual histograms produced by the LSTM and AT-LSTM models for the next 48 h are presented in Figures 11 and 12, respectively. They demonstrate the comparison of the reasonable precision of the 48 h ahead DO forecasts between AT-LSTM and LSTM. As shown in Figures 11 and 12, although both the LSTM and AT-LSTM models accurately captured the trend of the DO content, the AT-LSTM model exhibited better prediction and stronger generalization than the LSTM model. The performance evaluation indices are presented in Table 5. The values of evaluation criteria produced by the LSTM model on the test set were as follows: R 2 = 0.501, MAE = 0.333, and RMSE = 0.422. On the other hand, the values of evaluation indicators produced by the AT-LSTM model on the same test set were as follows: R 2 = 0.541, MAE = 0.312, and RMSE = 0.405. This explains once again that the attention mechanism could improve the effectiveness and accuracy of the AT-LSTM model for prediction in multivariate time series.

Model Verification
A new independent dataset (Gregory River data) was used to verify the advantages of AT-LSTM over LSTM in the prediction performance of multivariate time series. The comparison of RMSE using the LSTM model and the AT-LSTM model for 1-12 h ahead prediction on the new dataset is shown in Figure 13. By comparing the variation of RMSE from LSTM and AT-LSTM models with the steps, it can be seen from Figure 13 that the RMSE of AT-LSTM predicted 1-12 h ahead on the new dataset was always lower than that of LSTM. This verifies that the AT-LSTM model proposed in this paper has more advantages than the traditional LSTM model in the prediction performance of multivariate time series. The above experimental results demonstrate that the developed deep learning AT-LSTM model outperformed the LSTM model in terms of prediction performance and could generalize the prediction capabilities outside the training station.

Conclusions and Future Work
With the rapid development of technology, gathering water quality data quickly evolves from manual collection to automatic monitoring, which makes the processing of water quality data of a large capacity and high frequency a reality. However, the traditional prediction method struggles to fully extract the true characteristics of water quality information; thus, the results of prediction cannot meet the needs of the actual application. It is urgent to develop a better water quality prediction method with higher accuracy. The AT-LSTM model proposed in this study incorporates the nonlinear mapping capability of the LSTM neural network and the feature weighting function of the attention mechanism, extracts the characteristic information of water quality data efficiently, and predicts the dissolved oxygen content of the Burnett River with significantly better accuracy than the LSTM model. In comparison with the standard LSTM model, the RMSE and MAE of the AT-LSTM water quality prediction model had 23.9% and 27.7% reductions, respectively, and it achieved a higher R 2 of 95.3% with better generalization performance. In practical application, the AT-LSTM model can be used to establish the water quality prediction and early warning platform of Burnett River, to sense the potential pollution risk of the river water in advance, send early warning reports, and conduct pollution retrieval. This AT-LSTM model can significantly improve the prediction ability of relevant departments on water environment risk, upgrade the passive water environment risk emergency treatment to automatic prediction, and provide early warning and active prevention, thus protecting the aquatic environment of the Burnet River and the Great Barrier Reef. Furthermore, the model proposed in this study can provide a reference for the construction of water quality prediction models of surface water bodies in other regions. Clearly, the AT-LSTM model has important application value and practical significance.
In addition, the AT-LSTM prediction model proposed in this paper has room for further optimization, and subsequent research work can be carried out from the following aspects. Firstly, this paper used only the water quality data of one monitoring point, whereby the water quality monitoring data of other sites in the study area can be added. The correlation of their geographical location can also be taken into account, such that not only the dimensionality of the data but also the amount of data will increase, thus enhancing the accuracy of water quality prediction and better improving the predictive performance of the model. Secondly, the feature screening method used in this study is a relatively simple Pearson correlation test algorithm which is a linear feature screening algorithm. For future studies, we will try data preprocessing and feature engineering methods, such as the use of nonlinear feature screening methods to find the impact of predictive indicators as more effective factors, hoping that the model's prediction accuracy can be further improved.  Data Availability Statement: Burnett River water quality monitoring data-Historical: https:// www.data.qld.gov.au/dataset/burnett-river-monitoring-data-historical (accessed on 31 August 2022); Gregory River water quality monitoring data-Historical: https://www.data.qld.gov.au/ dataset/gregory-water-quality-monitoring-data-historical (accessed on 31 August 2022).

Conflicts of Interest:
The authors declare no conflict of interest.