Predicting the Trend of Dissolved Oxygen Based on the kPCA-RNN Model

: Water quality forecasting is increasingly signiﬁcant for agricultural management and environmental protection. Enormous amounts of water quality data are collected by advanced sensors, which leads to an interest in using data-driven models for predicting trends in water quality. However, the unpredictable background noises introduced during water quality monitoring seriously degrade the performance of those models. Meanwhile, artiﬁcial neural networks (ANN) with feed-forward architecture lack the capability of maintaining and utilizing the accumulated temporal information, which leads to biased predictions in processing time series data. Hence, we propose a water quality predictive model based on a combination of Kernal Principal Component Analysis (kPCA) and Recurrent Neural Network (RNN) to forecast the trend of dissolved oxygen. Water quality variables are reconstructed based on the kPCA method, which aims to reduce the noise from the raw sensory data and preserve actionable information. With the RNN’s recurrent connections, our model can make use of the previous information in predicting the trend in the future. Data collected from Burnett River, Australia was applied to evaluate our kPCA-RNN model. The kPCA-RNN model achieved R 2 scores up to 0.908, 0.823, and 0.671 for predicting the concentration of dissolved oxygen in the upcoming 1, 2 and 3 hours, respectively. Compared to current data-driven methods like Feed-forward neural network (FFNN), support vector regression (SVR) and general regression neural network (GRNN), the predictive accuracy of the kPCA-RNN model was at least 8%, 17% and 12% better than the comparative models in these three cases. The study demonstrates the effectiveness of the kPAC-RNN modeling technique in predicting water quality variables with noisy sensory data.


Introduction
Surface water quality has a strong dependence on the nature and extent of agricultural, industrial and other anthropogenic activities within a region's catchments [1]. The reliable prediction of water quality is crucial in order for decision-makers to improve water quality management and protection activities [2]. However, forecasting the temporal variation of water quality parameters for surface river system can be a significantly challenging task owing to rapidly changing environmental conditions and insufficiently historical data records [3].
Dissolved oxygen (DO) content is one of the most vital water quality variables as it directly indicates the status of the aquatic ecosystem and its ability to sustain aquatic life [4]. Rapid decomposition of organic materials, including manure or wastewater sources, can quickly take the DO out of water in few hours, resulting in deficient DO levels that can lead to stress and death of aquatic fauna [5]. For example, DO levels that remain below 1-2 mg/L for a few hours can result in large fish kills. In pond management, an aeration system can quickly increase dissolved oxygen levels if the decreasing of dissolved oxygen in the water can be predicted. Hence, short-term predictions of DO are critical in delivering good water quality management [6].
Various mechanism models have been applied for predicting the concentration of DO [7]. The mechanism model considers many factors such as physical, chemical, and biological factors affecting the change of water quality. The common mechanism models include the BASINS model system [8], the MIKE model system [9], and the QUAL2K model system [10]. However, it is often challenging to simulate the target water quality systems when lacking adequate monitoring data or background information [11]. Consequently, those models are not likely to be able to be generalized without significant parameter adjustment [12].
Data-driven models have received increasing attention in predicting the concentration of DO based on the sensory data. For example, in the study proposed by Zhang [13], a multi-layer feedforward neural network (FFNN) is designed for predicting the trend of dissolved oxygen of the Baffle Creek in Australia. In their approach, a mutual information-based feature selection strategy is introduced to pick up the relevant water quality variables for DO forecasting. Antanasijević et al. [14] tested the effectiveness of applying general regression neural network (GRNN) models for the forecasting of DO in the Danube River, Europe. In their experiments, 19 water quality parameters, five different data normalization methods, and three input selection techniques were tested to find the best combination. In addition, Li et al. [15] evaluated the performance of support vector regression (SVR) for the prediction of DO concentration based on multiple water quality parameters. The SVR was optimized by the particle swarm optimization algorithm and achieved superior performance than linear regression models. Though various data-driven models have been tested in predicting the trend of DO, most existing models lack the mechanisms in processing temporal data. Under these circumstances, seasonal or diurnal patterns within the water quality data are hard to be captured [16].
Apart from model architectures, the quality of input data also has an enormous influence on the data-driven model's performance [17]. The high-frequency data collected by sensors are prevalent in building water quality forecasting models. However, random errors generated by the environment, instruments or network transmission are unavoidable when monitoring water quality variables [18,19]. Though techniques such as z-score and min-max are used in preprocessing input data for data-driven models [14], those techniques aim to rescale the numeric range of water quality variables instead of reducing sensor noise. Accordingly, the unwanted noise would be accepted by the data-driven models, which increases the challenges for generating accurate predictions for water quality variables.
In this paper, we propose a water quality predictive model based on Kernel Principal Component Analysis (kPCA) and Recurrent Neural Network (RNN) to solve the above issues. Our work differs from other comparative approaches in the following two aspects: • Kernel Principal Component Analysis (kPCA) is implemented to reconstruct the input water quality data. Instead of feeding the water quality sensor data into the data-driven models directly, we pick up the top-ranked principal components as the new inputs. Meanwhile, the dropped principal components are expected to contain background noise. In this way, the reconstructed inputs only have useful information included. • A recurrent neural network (RNN) is designed to capture the temporal variations within water quality variables and utilize the historical changing patterns as a guide for predicting water quality in the future.
This study aims to evaluate the predictive accuracy of the kPCA-RNN model by comparing it with three data-driven methods discussed above. The evaluation is undertaken on a case study of DO concentrations in Burnett River, Australia.

Overview
The Burnett River is located on the southern Queensland coast and flows into the coral sea of the South Pacific Ocean. Cultivation of sugar cane and small crops are important land uses in this region. The total area of the catchment is about 33,000 km 2 . Figure 1 illustrates the location and extent of the catchment. Time series physiochemical water quality variables analysed in this study were obtained by a YSI 6 Series sonde sensor near the Bundaberg Co-op Wharf (Figure 1) [20]. Water quality variables such as temperature, electric conductivity (EC), pH, dissolved oxygen (DO), turbidity, and chlorophyll-a (Chl-a) are recorded with 1 h time interval for 5 months in 2015 (Table 1).  As demonstrated in Table 1, Chl-a and turbidity have larger variability than other water quality variables (CV > 50%). In the case of turbidity, this is due to extreme weather events [22]. The variability of Chl-a concentration can be affected by the discharge of river, temperature, and salinity variation. The high variability in turbidity and Chl-a are caused by a small number of observations with high values (Figure 2). Additionally, outliers of EC tend to have lower measurement values. These outliers can be caused by variations in river flow of other characteristics of the catchment. Ignoring those variations may cause serious information loss.  Figure 3 illustrates the changing patterns of DO both within a day and over a consecutive number of days. It is obvious that the concentration of DO follows a similar daily pattern, which makes it possible to predict the changing of DO. However, when tracking the concentration of DO in a larger time scale, it is plain to see that the mean value of the concentration of DO is increasing incrementally in the first half of the month and reach the peak value around 21 September. After keeping the high-level concentration for a few days, the DO level decreases gradually till the end of the month. This situation happens when unexpected activities are happening, such as heavy rainfall, excessive algae, and phytoplankton growth. In these circumstances, the predictive models should capture the daily temporal pattern when forecasting future DO concentration. Moreover, the model should be robust so it can have stable prediction performance at different time steps.   Hence, by both considering the partial autocorrelation results and the trends of DO under different time scales, we choose to use the data from 24 historical time steps as the input for our predictive model. In this way, the input can cover the information from the previous 24 h, which indicates the complete daily pattern of the DO concentration.

Kernel PCA Based Input Abstraction
Principal component analysis (PCA) is routinely applied for linear dimensionality reduction and feature abstraction [23]. The diagonal of the correlation matrix transforms the original principal correlated variables into principal uncorrelated (orthogonal) variables called principal components (PCs), which are weighed as linear combinations of the original variables. The eigenvalues of the PCs are a measure of associated variances, and the sum of the eigenvalues coincides with the total number of variables.
The standard PCA only allows linear dimensionality reduction. However, the multivariate water quality data have a more complicated structure which cannot be easily represented in a linear subspace. In this paper, kernel PCA (kPCA) [24] is chosen as a nonlinear extension of PCA to implement nonlinear dimensionality reduction for water quality variables. The kernel represents an implicit mapping of the data to a higher dimensional space where linear PCA is performed.
The PCA problem in feature space F can be formulated as the diagonalization of an l-sample estimate of the covariance matrix [25], which can be defined as Equation (1): where Φ(x i ) are centred nonlinear mappings of input variables x i ∈ R n . Then, we need to solve the following eigenvalue problem: Note that all the solutions V with λ ≥ 0 lie in the span of Φ(x 1 ), Φ(x 2 ), ..., Φ(x l ). An equivalently problem is defined below: where α denotes the column vector such that V = ∑ l i=1 α i Φ(x i ), and K is a kernel matrix which satisfies the following conditions: Then, we can compute the kth nonlinear principal component of x as the projection of Φ(x) onto the eigenvector V k : Then, the first p < l nonlinear components are chosen, which have the desired percentage of data variance. By doing this, the complexity of the original data series can be greatly reduced.

Recurrent Neural Network
Recurrent Neural Networks (RNN) have gained tremendous popularity over the last few years because of their capability in handling unstructured sequential data. In contradistinction to the feed-forward neural network, RNN has the information travelling in both directions. Computations derived from the earlier input are fed back into the network, which is critical in learning the nonlinear relationships between multiple water quality variables.
The general input to an RNN model is a variable-length sequence x = {x 1 , x 2 , ..., x T } where x i ∈ R d and d represents the dimention of x i . At each time step, RNN maintains its internal hidden state h, which results in a hidden sequence of {h 1 , h 2 , ..., h k }. The operation of an RNN at time step t can be formulated as: where f () is an activation function, w xh is the matrix of conventional weights between an input layer x and a hidden layer h, and w hh is the matrix between a hidden layer h and itself at adjacent time steps. The output of RNN is computed by: where w hy is the matrix of weights between the hidden layer h and output y. As exhibited in Figure 5, the structure of the RNN model across time can be expressed as a deep neural network with one layer per time step. Because this feedback loop occurs at every time step in the series, each hidden state contains traces not only of the previously hidden state, but also of all those that preceded h t−1 for as long as memory can persist.  Compared to the transitional feed-forward neural network, the recurrent structure in RNN can preserve the sequential information in its hidden state. In this approach, the input information can be spanned many time steps as it cascades forward to affect the processing of each new example. The features of RNN networks are especially suitable for processing time series water quality data because of the following reasons: Firstly, water quality data are periodically collected from different sensors and the previous values have strong relationship with the following changing. Secondly, the pattern of many water quality variables can only be recognized when enough historical data are involved and analysed.
In the proposed water quality predictive model, we apply the RNN structure with the LSTM cell [26]. To predict the concentration of DO at time step t + 1, the input time series include data in previous m time steps. Additionally, each time step has n water quality variables. Consequently, each input of the RNN model can be interpreted as a m × n matrix. The explicit hyperparameters of our RNN model will be outlined in the following Section 3.2.

Model Evaluation
We compared the kPCA-RNN model with the following three machine learning methods: 1. Feed-forward neural network (FFNN). FFNN has been broadly adopted for water quality analysis due to its capability in capturing nonlinear relationships within the short-term period [13]. 2. General regression neural network (GRNN). GRNN [27] is a type of radial basis function neural network that has good nonlinear approximation ability and fast convergence speed. It has been widely applied in short-term water quality forecasting [14,28]. 3. Support vector regression (SVR). SVR is a classic machine learning technique which can map inputs into higher dimensional space and interpret the problem as a linear regression [29].
The following performance indicators were applied to evaluate the predictive results. Those are the mean absolute error (MAE), the coefficient of determination (R 2 ), the root mean square error (RMSE), and the percent of prediction within a factor of 1.1 (FA1.1) [30]: where f i ,f i , n, and m represent the observed value, the predicted value, the number of observations, and the number of predictions within a factor of 1.1 of the observed values, respectively. Additionally, Figure 6 depicts the workflow of predicting the concentration of DO by using the kPCA-RNN model. There are two key steps in this workflow: applying the kPCA to denoise and reconstruct input data and implementing the RNN model to forecast the trend of dissolved oxygen in future time steps.

Workflow of Predicting DO
Firstly, the kPCA method is implemented on the tabulated water quality data (Table 1) to create corresponding principal components. The principal components with less importance are dropped to reduce the background noise in the original water quality dataset. Consequently, the remaining principal components are selected as new inputs for the predictive model.
Next, the input data are formed to m × n matrix as we explained in Section 2.2.2. After training and testing the RNN model, the concentration of DO in the upcoming time steps can be estimated. The kPCA-RNN model described in Figure 6 differs from most existing DO forecasting models in the following aspects: 1. Instead of using the sensor data directly, the kPCA method is implemented to the water quality sensor data to construct new inputs based on principal components. This step can help reduce the background noise and keep the most useful information for DO forecasting tasks. 2. The recurrent neural network is applied to process the time series water quality data. The recurrent structure offers a powerful way of capturing the temporal patterns across a period of time, which is critical in forecasting the changing of DO concentration in the future.

Applying kPCA on the Water Quality Data
We applied the kPCA method to the water quality dataset (Table 1) and obtained five principal components ( Table 2). Five principal components (Table 2) are ordered by their corresponded eigenvalue. The first principal component is the linear combination of all the variables that have a maximum variance, so it accounts for as much variation in the data as possible. After that, each succeeding component, in turn, has the highest variance possible under the constraint that it is orthogonal to the preceding components. The cumulative variance proportion of the first four principal components is 94.8%. This indicates that, retaining only the first four principal components, one can explain 94.8% of the full variance.  As has been pointed out, the first principal component (PC1) has the highest correlation (dotted box, Figure 7) with variables like temperature, pH and turbidity. The three dotted boxes in line 5 (PC1) highlight the highest correlation scores one got from the corresponding water quality variables (listed in the bottom axis). Furthermore, the second principal component (PC2) has the highest correlation (dotted box in line 6) with the remaining variables EC and Chl-a. This indicates that, by utilizing only principal components PC1 and PC2, most information involved in those five water quality variables can be presented. Furthermore, PC3 and PC4 also have a strong correlation with EC, pH, and Turbidity (solid box). On the contrary, PC5 has a low value of correlation coefficient to all water quality variables, which means it carries much noise information [31]. Accordingly, we accept the first four principal components as new inputs. The kPCA method can reduce the input size by 20% while still keeping the most valuable information.

RNN Hyperparameters Settings
One challenge of building a neural network model is optimizing the hyperparameters for predictive accuracy [32]. Generally, different neural network settings are required to achieve the promising results for different forecasting tasks. Hence, we need to choose proper neural network parameters for forecasting DO concentration in three different predictive horizons.
Three RNN models were designed to predict the next one, two, and three hours of DO concentration independently. Each RNN model has various parameters and they all accept four months of data (2928 samples) for training and one month of data (744 samples) for testing. Based on the partial autocorrelation analysis in Section 2.1.2, data from the previous 24 time steps were accepted as the model's input when predicting the concentration of DO in each future step.
The hyperparameters of the three RNN models were defined in Table 3.  Similarly, our proposed model achieves R 2 value of 0.823 for 2 h ahead forecasting in Figure 8b. When increasing the predictive horizon, the model does not predict what will happen after the last true measurement. Instead, the model needs to take a further step to generate the prediction. This usually happens when the model acts as an early warning system so there can be enough time for delivering management activities based on the forecasting results. In this circumstance, the model can only utilize what has been measured already to make the prediction. Hence, the prediction accuracy decreases slightly in this case.

Results and Discussion
In Figure 8c, the model obtains R 2 value of 0.671 for 3 h ahead forecasting. As we discussed above, it becomes more challenging when one increases the predictive horizon, while, in this case, around 93% prediction results are still within ± 10% range of the original observations (FA1.1). This gives us confidence that the proposed kPCA-RNN model can still yield promising estimations. As we discussed in Section 1, the rapidly changing of DO concentration in a few hours can put aquatic life under high stress. Hence, the promising predictions in a few hours ahead are significant in early warning and changing management activities.
In addition, we also listed the RMSE value at each time step for all the three experimental cases (lower part of each subfigure in Figure 8). This offers us a detailed insight into the prediction performance of the proposed kPCA-RNN model. The RMSE figures clearly indicate that our model has a stable performance accuracy at most of the time steps. This is critical in applying the model in processing the real-world water quality sensor data.  As can be seen, most biased estimations happened between 21 October 2015 and 29 October 2015, where there was a strong fluctuation of DO concentration. In the water monitoring reports published by the Queensland Government, there was a large amount of discharge for total nutrients, dissolved and particulate nutrients during that period of time. On the contrary, the discharge in the previous months was low. It indicates that the trend of concentration of DO was changing more frequently and heavily in October. However, the kPCA-RNN model is trained based on the concentration of DO obtained from historical months with regular DO change. Consequently, there are some predictions below the high points of the observations; for example, the predictions around 22 October 2015. Hence, it is necessary to involve extra water quality data to cover a longer time period.
We additionally compared the performance of the kPCA-RNN model with three models stated in Section 2.3. The same data set described in Section 2.1.1 was applied in all cases. For FFNN, we set the same neural network size as in the kPCA-RNN model. For GRNN, the standard deviation is set to 10 for the high dimensional inputs. For SVR, the Radial Basis Function kernel (RBF) is taken as the nonlinear kernel. The corresponding results are listed in Table 4. The kPCA-RNN models offer the best performance in all three of the prediction cases (Table 4). For example, in the 1 h ahead prediction, 99.5% of the predictions are within the FA1.1 range, which demonstrates that the model has a stable accuracy for most predictions.
Specifically, the kPCA-RNN model has 8%, 17% and 40% improved performance on the RMSE than the FFNN in all three of the cases, respectively. Similarly, the kPCA-RNN model achieves 43%, 52% and 21% improved performance on the RMSE over the SVR. Compared to GRNN, our proposed model gains 41%, 29% and 12% performance improvement on the RMSE scores. The FFNN, SVR and GRNN are ineffective in predicting the changing of DO concentration with 2 or 3 h predictive horizon because their model structures are not designed to handle time series data and the temporal pattern cannot be efficiently captured.
Hence, the kPCA-RNN model can perform as an early warning predictor for DO in application areas such as aquaculture ponds. By providing the DO significant changing alarm, farmers can consider appropriate actions to maintain the DO on a suitable level for the health of the aquatic ecosystem.

Conclusions
To summarize, the kPCA-RNN model was able to successfully predict the trend of DO in the following 1 to 3 h. We evaluated our model based on water quality data from Burnett River, Australia and compared it with the FFNN, SVR and GRNN methods. The results demonstrate that our method is more accurate and stable to the alternative methods, especially when the predictive horizon is increasing. Furthermore, as a data-driven modeling method, the kPCA-RNN model is not limited to a specific hydrological area and can be extended to predict various water quality variables.
For future work, inputs can be improved to include extra information such as rail fall and cover more extended periods of time. In addition, the water quality predictive model can be extended to support predicting multiple water variables simultaneously.
Author Contributions: Methodology, writing-original draft preparation, Y.-F.Z.; writing-review and editing, P.F.; project administration, writing-review and editing, P.J.T. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.