Next Article in Journal
Use of Moringa Oleifera as a Natural Coagulant in the Reduction of Water Turbidity in Mining Activities
Next Article in Special Issue
Design of Integrated Energy–Water Systems Using Automated Targeting Modeling Considering the Energy–Water–Carbon Nexus
Previous Article in Journal
Evaluation of Water Contamination Caused by Cemeteries in Central Ecuador—A Warning for the Authorities
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Prediction of Physico-Chemical Parameters of Surface Waters Using Autoregressive Moving Average Models: A Case Study of Kis-Balaton Water Protection System, Hungary

by
Zsófia Kovács
1,2,†,
Bálint Levente Tarcsay
2,3,†,
Piroska Tóth
1,2,
Csenge Judit Juhász
2,3,
Sándor Németh
3 and
Amin Shahrokhi
4,*
1
Sustainability Solutions Research Laboratory, Research Centre for Biochemical, Environmental and Chemical Engineering, University of Pannonia, 8200 Veszprém, Hungary
2
National Laboratory for Water Science and Water Security, University of Pannonia, 8200 Veszprém, Hungary
3
Department of Process Engineering, Research Centre for Biochemical, Environmental and Chemical Engineering, University of Pannonia, 8200 Veszprém, Hungary
4
Department of Radiochemistry and Radioecology, Research Centre for Biochemical, Environmental and Chemical Engineering, University of Pannonia, 8200 Veszprém, Hungary
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Water 2024, 16(16), 2314; https://doi.org/10.3390/w16162314
Submission received: 12 July 2024 / Revised: 8 August 2024 / Accepted: 12 August 2024 / Published: 16 August 2024

Abstract

:
In this work, the authors provide a case study of time series regression techniques for water quality forecasting. With the constant striving to achieve the Sustainable Development Goals (SDG), the need for sensitive and reliable water management tools has become critical. Continuous online surface water quality monitoring systems that record time series data about surface water parameters are essential for the supervision of water conditions and proper water management practices. The time series data obtained from these systems can be used to develop mathematical models for the prediction of the temporal evolution of water quality parameters. Using these mathematical models, predictions can be made about future trends in water quality to pinpoint irregular behaviours in measured data and identify the presence of anomalous events. We compared the performance of regression models with different structures for the forecasting of water parameters by utilizing a data set collected from the Kis-Balaton Water Protection System (KBWPS) wetland region of Hungary over an observation period of eleven months as a case study. In our study, autoregressive integrated moving average (ARIMA) regression models with different structures have been compared based on forecasting performance. Using the resulting models, trends of the oxygen saturation, pH level, electrical conductivity, and redox potential of the water could be accurately forecast (validation data residual standard deviation between 0.09 and 20.8) while in the case of turbidity, only averages of future values could be predicted (validation data residual standard deviation of 56.3).

1. Introduction

In this work, our aim is the development of time series regression models for water quality estimation in the Kis-Balaton Water Protection System (KBWPS) wetland region in Hungary.
Increasing environmental awareness and striving to attain the Sustainable Development Goals (SDG) has strongly propelled forward the development of water management tools [1]. Online water quality monitoring systems have become standard for the real-time supervision of surface water sources [2]. The main goal of these tools is to pinpoint the presence of water pollution and adverse environmental circumstances, and to provide opportunities to mitigate damage to the water body through timely intervention. Parallel to the advancement of online water quality monitoring systems, the development of various models for water quality prediction has become a focal point of academic research [3]. The abundance of time series data collected by the online monitoring systems could be used as a basis for developing time series regression models for the forecasting of water quality. These models could be subsequently used for the supervision of water sources by pinpointing anomalies, such as environmental changes in the water or sensor defects, such as measurement bias or sensor drift [4].
Water quality models can be divided into two groups, the first being mechanistic models, which mainly rely on computational fluid dynamics (CFD) techniques for describing the temporal and spatial evolution of surface waters through laws of nature [4]. These models provide in-depth knowledge and are generally robust and relatively precise. However, as a trade-off, they require explicit knowledge about the studied body of water, including information about the influx of water into the observed water body, terrain conditions, flow rates, etc. to establish accurate models. Contrary to this, data-based models only require observed time series data to make forecasts about the temporal evolution of water quality parameters. While not as robust as mechanistic models, their simple applicability propelled these techniques to the forefront of water quality modelling [4]. For the estimation of water quality temporal evolution, mostly artificial neural network (NN) structures [5], support vector regression techniques [6], and autoregressive methods [7] have been utilized among the data-based approaches. The trends show that while NN models have become increasingly popular, with a multitude of novel research approaches such as the use of radial basis function NNs [8] and various integrated methods that enhance the networks with filters [9], their overall practical application still suffers from a multitude of issues [10]. The determination of network topology is a complex process, and it requires problem-specific knowledge, since the relationships between predictor and predicted variables have no clear physical meanings and are thus difficult to understand.
On the other hand, autoregressive methods are still strongly represented in academic research due to their simpler structure and relatively high accuracy. The most commonly utilized techniques are the autoregressive (AR), AR with exogenous inputs (ARX), autoregressive integrated moving average (ARIMA), ARIMA with exogenous inputs (ARIMAX), seasonal ARIMA (SARIMA), and SARIMA with exogenous inputs (SARIMAX) time series models [11]. These techniques all have the common core element of utilizing previously observed values of a predicted variable to forecast the future values of said variable, which is provided by the autoregressive component of the model. While these techniques usually have good performance and may even compete with more complex model structures in terms of accuracy depending on the modelled time series [12], they are at their core linear techniques. Due to this, in the case of more complex, nonlinear systems, their forecasting performance may be lacking, which is why hybrid approaches using purely autoregressive techniques and other modelling methods have garnered increased research interest [11].
Examples of such hybrid techniques include the combination of autoregressive techniques with Markov-switching [4], wavelet methods [13], and long short-term memory methods [14]. However, the most extensive and common hybrid approaches utilize the combination of NN techniques and autoregressive approaches [15]. The research focus in this case lies on increasing the general applicability [8], accuracy [16], and the techniques’ ability to deal with data irregularities (missing data, outliers) [17].
It should be noted, however, that the models must be selected for each application in a case-by-case manner. While hybrid methods and NN models can be very reliable in many cases if complex environmental and hydrological effects are not present, autoregressive models perform well for water quality prediction using more simple model structures. Therefore, in many surface water systems worldwide, their applicability has been studied and utilized for water quality supervision [18]. Case studies and applications have been published all over the world, for example from India [19], Saudi Arabia [20], and Malaysia [21], which show the widespread use of autoregressive techniques for water quality time series forecasting.
In our work, we will follow the previously established trends and use autoregressive models for estimating the temporal evolutions of variables pertaining to water quality in the Kis-Balaton wetland region of Hungary. To the best of the authors’ knowledge, no studies have been provided on the water quality evolution in the Kis-Balaton wetland region, despite its importance to international nature conservation efforts. This is further reinforced by its place on the list of “Wetlands of International Importance as Waterfowl Habitat” established by the Ramsar Convention.
The authors put forward the following points to summarize the significance of the study:
  • Water quality observation in the wetland region of Kis-Balaton that holds international significance from a waterfowl habitat preservation point of view;
  • The development and comparison of a multitude of autoregressive models for water quality estimation in the region.
Following the literary review in the introduction, the authors introduce the studied area and its ecological importance. The online monitoring system as well as the measured water quality parameters and measurement procedures are introduced in the Materials and Methods Section, followed by the mathematical formalization of the employed time series regression techniques, their evaluation standards, and the fine-tuning procedure. The measurement results and the working data are introduced subsequently in the Preprocessing of Data Measured by a Continuous Water Quality Monitoring System Subsection of the Results Section, where the authors introduce the data structure and the data preprocessing. The results of the modelling techniques on the time series data are showcased in the Results Section, where the choice of the optimal model structure in terms of forecasting ability and accuracy is shown. Finally, in the Discussion Section, we evaluate the obtained models and propose additional uses and further goals for the provided work.

2. Materials and Methods

2.1. Studied Area

The Kis-Balaton Water Protection System (KBWPS) area (14,6 ha), designated as a Ramsar site, is of significant ecological importance, particularly within the context of the Zala River’s watershed. This designation underscores the area’s global significance as a wetland, highlighting its crucial role in biodiversity conservation, water management, and ecological services. The Ramsar Convention is an international treaty for the conservation and sustainable use of wetlands, and sites designated under this convention are recognized for their international importance, especially as habitats for waterbirds [22]. KBWPS is a biodiversity hotspot within the Pannonic biogeographic region, supporting a wide range of species, including several that are threatened or endangered. The ecological processes within the Kis-Balaton Water Protection System, particularly the filtration of nutrients by aquatic vegetation and sediment trapping, play an essential role in maintaining water quality in the Zala River and, by extension, Lake Balaton. The marshes and wetlands of KBWPS act as a natural water purification system, reducing the pollution load entering Lake Balaton, Hungary’s largest lake.
The inflow from the Zala River supports diverse habitats in and around Keszthely Bay, contributing to the lake’s biodiversity. The mixing of riverine and lacustrine waters creates unique ecological niches, supporting a variety of fish species, birds, and other wildlife. The river’s inflow also influences the distribution of aquatic vegetation, affecting the spawning and feeding grounds of fish and other aquatic organisms [23].
Given the importance of the Zala River’s inflow to Lake Balaton’s ecological health, conservation and management efforts are essential. These may include measures to control pollution sources upstream, manage land use to reduce sediment and nutrient runoff [24], and monitor water quality to detect and address emerging issues [25]. Sustainable management of the Zala River watershed is crucial for the long-term health and resilience of Lake Balaton’s ecosystems [26]. The Zala River’s inflow into Keszthely Bay at the western end of Lake Balaton is a key factor in the lake’s hydrological and ecological dynamics. Understanding and managing this inflow is essential for preserving the lake’s water quality [27], supporting its biodiversity, and ensuring its continued value as a natural resource and recreational destination.

2.2. Introduction of the Sampling Sites and the Continuous Online Water Quality Monitoring Systems

The Zala River, as the largest inflow to Lake Balaton, plays a crucial role in the hydrology and ecology of the lake, entering it at the western end and flowing into Keszthely Bay. This entry point is significant for several reasons, as it affects the water quality, sediment transport [28], and ecosystem dynamics of Lake Balaton, particularly in the Keszthely Bay area. The water source in question is surface water and wetlands designated as a nature preserve area, feeding and protecting Hungary’s biggest lake, and the main parameters to control are connected to the nutrient load going into Lake Balaton, as well as the preservation of the precious wildlife and waterfowl habitat.
The overview of the investigated area, featuring the geographic locations and photographs of the water quality monitoring stations, is shown in Figure 1.
Two sampling sites were designated [estuary of Zala River into Lake Balaton (b) 21 T and between Hídvégi Pond and Fenéki Pond (a) 4 T] in the catchment area of Lake Balaton, where Western Transdanubia Water Directorate (Hungary) employees regularly perform surface water sampling.
  • Hídvégi Pond, with a surface area of 18 km2, has a mean depth of 1.1 m.
  • Fenéki Pond is a typical wetland with a surface area of 16 km2.
In order to detect changes in water quality, we placed two self-developed modular continuous online water quality monitoring stations at the 21 T and 4 T water sampling sites. Calibrated Ponsel digital electrodes [Neotek-Ponsel: digital sensor PPHRB: pH, redox, temperature; digital SensorC4E: conductivity; digital sensor PODOA—luminescent optical technology: dissolved oxygen, saturation; digital sensor PNEPB-IR—nephelometric: turbidity; AQUALABO smart water solution, Champigny-sur-Marne, France ] were placed in the two continuous online water quality monitoring stations installed (2 × 4 probes) and six variable parameters were measured.
  • Parameters: temperature (°C), pH, redox potential (mV), electrical conductivity (EC), dissolved oxygen (DO), and turbidity (NTU).
  • A routine maintenance and cleaning process was performed on the sensors every two weeks, while the calibration was done every two months. The “smart” sensors store calibration data and measurement history within the internal memory sensors. Signal drift was not observed in the data set.
  • Measurement frequency: 15 min.
  • Energy supply: batteries and solar panel.
Categorization of the surface water quality was performed according to the WFD and Decree No. 10/2010 (VIII.18) of the Ministry of Rural Development [29]. Based on the Water Framework Directive and National Water Management Plan (VGT3), the surface water quality categories of the measured parameters are determined based on typological type. In the case of the measured water quality in the Zala River, the range belonging to excellent and good water quality is as follows:
  • The pH range is 7.5–8.5;
  • The electrical conductivity range is <700 µS/cm;
  • The dissolved oxygen range is 7.5–10.5 mg/L and the oxygen saturation range is 70–120%;
  • The turbidity value does not have a limit value (based on regular measurements carried out by the Western Transdanubia Water Directorate (Hungary), and has an average value of < 45 NTU).

2.3. Autoregressive Time Series Modelling Techniques

In this study, ARIMA models were utilized to describe the time series containing the water quality data and provide forecasting tools. The ARIMA model utilizes three components for the prediction of the value of an observed variable. The first is the previous, lagged values of said variable, which provides the autoregressive (AR) part of the model. The integrated (I) part refers to the degree of differencing between observations to obtain a stationary time series. Finally, the moving average (MA) incorporates the co-dependency between an observation and its corresponding residual error into the model. A general ARIMA model is thus defined using three parameters, p, d, and q, where p refers to the degree of the model, the number of lag terms used for estimation, d is the degree of differencing, and q, the order of the MA component, is the size of the moving average window [30]. The general notation to describe an ARIMA model structure is ARIMA (p, d, q).
Let y ϵ Rn×1 be a univariate time series with n observed values. The general ARIMA model for the prediction of the t-th time stamp entry, where t = p + 1, p + 2, …, n can be formulated according to Equation (1). Here, ∆ is the difference operator, c is the offset, ϕi refers to the weighting parameter of the i-th autoregressive predictor, ϵ is a white noise sequence, and θi is the weighting parameter of the i-th white noise entry.
dyt = c + ϕ1dyt−1 + … + ϕ1dytp + ϵt + θ1ϵt−1 + … + θqϵtq
Introducing the lag operator L, the formulation can be condensed into the form displayed in Equation (2), where Liyt = yti.
ϕ(L)(1 − L)dyt = c + θ(L)ϵt
The definition of the ARIMA models can be performed in a multitude of ways. In this case study, the authors utilize the original algorithm proposed by Box and Jenkins to fit the model [31]. In the case of the Box–Jenkins method, three steps are performed in an iterative fashion to obtain the ARIMA model:
  • Model structure selection.
  • Model parameter estimation.
  • Statistical model validation.
The first step is the determination of the general model structure, which is the choice of variables p, d, and q. The choice of parameter d is made to ensure the stationary nature of the time series; the order of differencing can be determined by observing autocorrelation functions plots for the observed time series or through the use of statistical tests such as the augmented Dickey–Fuller test [32] or the Kwiatkowski–Phillips–Schmidt–Shin test, which are the most common statistical tests used to evaluate stationarity in time series regression [33]. In the case of variables p and q, one can use autocorrelation function and partial autocorrelation function charts to roughly estimate their value. For a more rigorous approach, the Akaike information criterion (AIC) or its modified version for small sample sizes (AICc) may be utilized [34]. In the case of ARIMA models, the AIC and AICc may be calculated according to Equations (3) and (4), where k = 1 if c = 0 and 0 otherwise, and L ^ is the maximum of the likelihood function of the prediction. Variables p and q should be chosen in a fashion that minimizes the AICc.
A I C = 2 l o g ( L ^ ) + 2 p + q + k + 1
A I C c = A I C + 2 p + q + k + 1 p + q + k + 2 n p q k 2
Once the general model structure has been established, the subsequent steps involve estimating the model parameters, which are generally performed using nonlinear least squares or maximum likelihood estimation methods [31]. It is expected that the distribution of the error terms of the model fit to the training data should approach a white noise sequence with no apparent correlations between error terms and time stamps. This assumption can be tested using the Ljung–Box test, which evaluates the hypothesis of whether the error terms are independently distributed. Using the test, many authors have proposed ways to estimate optimal lag numbers should the model not perform efficiently [35].
Finally, model validation is generally performed by observing the error terms between the measured and forecast behaviour of the time series. In this work, the authors utilize the mean squared error (MSE) for quantifying the forecasting prowess of the models [36].
The MSE score is calculated as per Equation (5) for a validation data set with n entries and forecast values y ^ and y ^ y being the forecasting error terms.
M S E = i = 1 n ( Y ^ y ) 2 n ˙

3. Results

3.1. Preprocessing of Data Measured by a Continuous Water Quality Monitoring System

The observed data set contained measurement results for six variables, namely water temperature, dissolved oxygen concentration and saturation, pH levels, redox potential, electrical conductivity, and turbidity within the water. The data set was obtained over the course of approximately 307 days, with a measuring rate of 15 min in all sensors.
The data set spanned an observation period of eleven months, from January to November, containing a total of 29,489 data entries. The time series data were preprocessed before being used for modelling, because various outlier points were present in the original data set as well as missing values. Outlier detection was performed using interquartile range and z-score tests, depending on the distribution of the measured values [37]. In the case of normally distributed variables (electrical conductivity and pH), entries with an absolute z-score of 3 or above were considered outliers and eliminated. Interquartile range testing was applied for variables (oxygen concentration and saturation, turbidity, redox potential) whose distributions did not follow the normal distribution. The number of identified outliers was negligible (124 data entries being 0.42% of the entire sample number) compared to the total sample number. The samples considered as outliers or those with missing values were removed from the data set. Finally, the data were resampled to a scale of two observed data points per day to provide time series models that have better forecasting capabilities on a larger timescale. The temporal evolution of the different water quality parameters can be seen in Figure 2, Figure 3, Figure 4, Figure 5, Figure 6 and Figure 7 after the outlier removal.
The change in temperature naturally follows the annual temperature cycle of the region, increasing as winter turns into summer, showing clear signs of a trend in Figure 2. As expected, related variables such as the dissolved oxygen concentration and dissolved oxygen saturation also correlate with the temperature trend; as the water temperature increases, the total dissolved oxygen and the oxygen saturation show decreasing tendencies, as seen in Figure 3. While not clear-cut, weak signs of seasonality are present in the three data blocks, even within one year, with repeating patterns.
In the case of water pH, shown in Figure 4, a clear-cut trend is hard to distinguish, even though it may be noted that the values somewhat tend to decrease with the onset of summer, and generally higher pH values can be observed in the winter months. As a general observation, it can be seen that the pH of the water tends towards the more basic values.
In the case of redox potential, a clearly increasing trend can be observed in Figure 5 over the investigated period. Weak signs of seasonality within the period are also present, with periodic increases and decrements in redox potential.
In the case of electrical conductivity, while seasonality is not clearly present, a decreasing trend within the values parallel to the onset of the summer months can be observed in Figure 6.
Turbidity values are stochastic and hard to predict; the values show little to no trend, and weak signs of seasonality can be inferred (Figure 7). However, turbidity is mostly linked to variables such as the amount of rainfall over a certain period or the water flow rate, which can all cause the sediment on the water floor to mix up and heavily increase turbidity values.
The mean and standard deviation of the observed variables are displayed using box plots; since the observed variables are on different scales, two plots have been created to properly represent the data distribution. The results can be seen in Figure 8, which correspond to the conclusions drawn from the time plot analysis. In terms of standard deviation, pH shows the smallest variance among the observed variables, followed by dissolved oxygen saturation. The turbidity variable shows the greatest deviation among all the observed variables, which could be attributed to its increased sensitivity to hydrological phenomena, such as precipitation, compared to the other parameters.

3.2. General Scheme for the Development of ARIMA Models: Example of Dissolved Oxygen and Saturation

The resampled time series data were used to develop the time series regression models for forecasting purposes, as was described previously. The process is showcased by using the data set of dissolved oxygen concentration change as an example. In the case of the other variables, the time series regression was performed similarly; the final model fit and prediction accuracy in each case were compared to evaluate the model performance. In order to properly estimate the forecasting capabilities of the model, the observed time series data were partitioned into training and validation data sets. Generally, during validation or cross-validation of regression models, data are split into 70–30%, 80–20%, or 90–10% training and validation batches. The split was done in a 90–10% manner, with the last 10% of entries used as validation data. In this case, for the purposes of later application, it was noted that an approximately one-month forecast for the water quality trend was sufficient for the users; thus, the 90–10 split was chosen. It should be noted that due to the range of the available data (less than a year’s worth), a larger forecast that includes annually repeating seasonal changes in the physico-chemical water parameters cannot be reliably estimated.
In this subsection, the identification and performance of ARIMA models are shown for variable estimation using the dissolved oxygen saturation example data set. First, the order of differencing had to be established, to ensure that the time series was stationary. The test data and their corresponding autocorrelation function (ACF) plot are shown in Figure 9. The confidence bounds on the autocorrelation values can be seen in blue. It can be seen that the data show no clear trend, with possible hints of seasonality. To evaluate the stationary nature of the time series, the augmented Dickey–Fuller test was utilized, at a significance level of 0.05, which inferred that the time series was stationary in nature [32]. On the ACF plot, it can be observed that there is significant autocorrelation between the values of the time series; the ACF is periodic in nature, hinting at possible seasonality as well.
Since differencing was not necessarily based on the results of the augmented Dickey–Fuller test, the ARIMA model, in this case, was simplified into an ARIMA model. Subsequently, various model structures were investigated to describe the process. More complex model structures were also considered based on the autocorrelation and higher lag terms in the data, and their fitness was evaluated using the AICc index, as shown in Equation (4).
The evaluation metrics such as AICc values and MSE for the validation data set can be seen in Figure 10. The performance of various model structures with increasing numbers of AR and MA lags, between 1 and 15, was tested to find the optimal model for forecasting. It can be seen that as the number of AR and MA lags increases, the AICc index increases; as is expected when observing Equation 4, the superfluous lag terms provide no additional information gain. An optimized model structure in each instance was chosen manually based on three criteria:
  • The resulting model should have minimal MSE with regard to the validation data;
  • The model should not be biased and, as such, have a minimal value of the Akaike information criterion;
  • The residuals of the model prediction to the training data should pass the Ljung–Box test to ensure that the trend of the time series has been captured.
The model structure that minimizes the AICc is the ARIMA (9,0,3) model, or in this case, the ARMA (9,3) model. This model also provides one of the best MSE scores for the validation data set.
The distribution of the model residuals with the optimized ARMA (9,3) model structure for the training data as well as the ACF of model residuals are shown in Figure 11. The residual of the model fit is normally distributed with no clear trend, which indicates that the linear model offers a good predictive approach and only random noise, and disturbances are not accounted for.
It can be seen that model residuals are normally distributed and the ACF is rapidly approaching zero in most cases, and no significant autocorrelation is present between the residuals. The Ljung–Box test was also used to test the independence of the residuals, and the statistics showed that the residuals were uncorrelated at a significance level of 0.05.
Finally, the predictive capabilities of the ARIMA (9,3) model were tested. The dissolved oxygen concentration of the water was forecast for a period of the last 30 days. The measured validation data were compared to the forecast data to pinpoint the models’ ability to forecast future trends in dissolved oxygen saturation.
To do this, the last nine measured data points of the test data were utilized in the ARMA (9,3) model. The forecast results, as well as the upper and lower bounds on the prediction can be seen in Figure 12.
The figure also displays the training data and the model’s fit to said training data for the first 273 days. It can be seen that the model accurately forecasts the generic trend in the dissolved oxygen concentration over the validation period. Later on, the trend becomes less accurate due to the presence of random spikes, probably caused by hydrological conditions not accounted for in the model, such as rainfall or changes in water volumetric flow. Regardless, the results show that the time series model accurately forecasts the general trend of the data even over a longer period of time (30 days), and the upper and lower bounds can be used to pinpoint potential outlier values or reveal future trends in water quality parameters.
In the case of the other variables, models have been fitted using a similar approach. The optimized model structures, AICc scores, and their performance on the Ljung–Box test are displayed in Table 1. For example, in the case of the dissolved oxygen saturation model, the 30-day forecast was generated using the last nine samples of the data, around 270 days, as the model was found as an ARMA (9,3) type. This means that nine autoregressive components, also known as samples, are required to make an accurate one-step-ahead prediction. This is why the accuracy of the model to the validation data decreases over time. For the first nine forecast data points, the measured data were used; after that, the previously generated forecast data were exclusively utilized to predict the evolution of water oxygen saturation further into the future.
The O2 concentration variable was not fitted, as it is strongly linearly correlated with the O2 saturation value, and the values can be interchangeably calculated from one another. The same type of strong correlation can be observed in the temperature values, which is why their calculation was omitted. The concentration of O2 is not only related to temperature, but is also influenced by factors such as COD and BOD levels, meaning that the presence of decomposing organic matter can significantly impact this variable.

3.3. Time Series Regression Models for Other Measured Variables

The results for the forecast of the noted measurements are shown in Figure 13, Figure 14, Figure 15 and Figure 16. It may be noted that while the model does not fit to the unexpected changes in the water quality variables, it accurately captures the data trend for each observed parameter. In the cases of both pH and turbidity, the results converge to the average of the previous values, as in these cases, the values wavered randomly around an average point. To forecast these values more accurately, exogenous inputs such as hydrological conditions (water flux, precipitation) may be required.
Although some nonlinear trends can be found in certain variables within the study, the suitability of linear models such as ARMA and ARIMA was verified by observing the distribution of model residuals after fitting, both visually and through the Ljung–Box test. This can be observed in the example of the dissolved oxygen data series. The residual of the model fit is normally distributed with no clear trend, which indicates that the linear model offers a good predictive approach and only random noise, and disturbances are not accounted for. The same is true for pH and turbidity data. In the case of electrical conductivity and redox potential, there was some noticeable trend where the model errors slightly tended towards the positive or negative, and the distribution function was slightly skewed. This indicated that a nonlinear model would have been a better fit; however, the model still passed the Ljung–Box test, indicating that there was no significant trend within the residuals.
Finally, to quantify the error of the predictions for the thirty-day forecast period, the standard deviations and mean values of the model residuals were provided, as shown in Table 2.

4. Discussion

In this work, time series regression models were developed for forecasting the water quality parameters of the Kis-Balaton Water Protection System (KBWPS) region of Hungary, which is an important ecological system, a designation that is reinforced by its appearance on the list of “Wetlands of International Importance as Waterfowl Habitat” established by the Ramsar Convention.
ARIMA models have been developed to forecast the trends in the temporal evolution of dissolved oxygen saturation, pH, redox potential, electrical conductivity, and, finally, water turbidity. The models were established using data from online physico-chemical sensors, and various model structures were tested and optimized using the Akaike information criterion. The forecasting prowess of the methods was observed for a time interval of 30 days and was evaluated using the mean squared error metric.
The procedure was showcased for the observed variables, and the developed models provided good predictive capabilities for forecasting future trends in water quality even for longer time periods.
The measured pH values are 7.8–8.8, which aligns well with the 30-day forecast of the expected pH level. This is consistent with the geological background of the given catchment area and is considered acceptable. Similarly, in the case of the redox potential, the deviation is larger but still within acceptable limits. Compared to the measured values of dissolved oxygen, oxygen saturation (65–70%), and conductivity (650–700 µS/cm), the forecast parameter results provide a good indication of the water body’s quality, which can be a valuable result in water management. The comparison between the measured turbidity values (20–800 NTU) and the forecasted range clearly shows that turbidity is significantly influenced by hydrological variables, such as precipitation and wind speed. In the case of the resulting models, trends of the oxygen saturation, pH level, electrical conductivity, and redox potential of the water could be accurately forecast (validation data residual standard deviation between 0.09 and 20.8), while in the case of turbidity, only averages of future values could be predicted (validation data residual standard deviation of 56.3).
The authors also propose the application of the models for anomaly detection in the water source or for the identification of sensor errors, which is crucial for the preservation of the Kis-Balaton wetland.
Additionally, as future prospects, the authors propose the use of seasonal ARIMA or seasonal ARIMAX models when more online measurement data become available to include both seasonal trends and external hydrological variables (precipitation, water flux, etc.) into the models for more accurate forecasting potential.

5. Conclusions

While the AR(I)MA models provide useful tools for the forecasting of general trends in the physico-chemical parameters of water, their accuracy and forecasting window may yet be improved. The acquisition of new data to account for annually repeating seasonal patterns and the inclusion of hydrological variables as exogenous inputs may greatly increase both the forecasting window and the accuracy of the models.
In general, it may be said that forecasting limits should be set by experts in the field who wish to utilize the model to predict future water quality and who have an overview of what level of prediction inaccuracy may still be deemed acceptable.
It also needs to be noted that the presence of certain nonlinear trends cannot be captured using the linear AR(I)MA models and their derivative techniques. To account for these phenomena, nonlinear regression techniques such as the use of nonlinear support vector machines or neural networks should be utilized.

Author Contributions

Conceptualization, Z.K. and B.L.T.; methodology, Z.K. and B.L.T.; statistical analysis, C.J.J. and B.L.T.; investigation, Z.K., P.T., B.L.T., C.J.J. and S.N.; resources, Z.K. and B.L.T.; writing—original draft preparation, P.T., B.L.T. and C.J.J.; writing—review and editing, Z.K., P.T., B.L.T. and S.N.; visualization, P.T. and B.L.T.; supervision, Z.K., S.N. and A.S.; funding acquisition, Z.K. All authors have read and agreed to the published version of the manuscript.

Funding

The research presented in the article was carried out within the framework of the Széchenyi Plan Plus program with the support of the RRF-2.3.1-21-2022-00008 project.

Data Availability Statement

The data presented in this study are available on request from the first author, Zsófia Kovács (kovacs.zsofia@mk.uni-pannon.hu), due to legal reasons.

Acknowledgments

We thank the Western Transdanubia Water Directorate (Hungary) for providing the sampling sites, and special thanks to all colleagues from Széchenyi Plan Plus that made this research possible.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Vanham, D.; Hoekstra, A.Y.; Wada, Y.; Bouraoui, F.; De Roo, A.; Mekonnen, M.M.; Van De Bund, W.; Batelaan, O.; Pavelic, P.; Bastiaanssen, W.G.; et al. Physical water scarcity metrics for monitoring progress towards SDG target 6.4: An evaluation of indicator 6.4. 2 “Level of water stress”. Sci. Total Environ. 2018, 613, 218–232. [Google Scholar] [CrossRef] [PubMed]
  2. Garaba, S.P.; Zielinski, O. An assessment of water quality monitoring tools in an estuarine system. Remote Sens. Appl. Soc. Environ. 2015, 2, 1–10. [Google Scholar] [CrossRef]
  3. Ejigu, M.T. Overview of water quality modeling. Cogent Eng. 2021, 8, 1891711. [Google Scholar] [CrossRef]
  4. Liu, J.; Wang, P.; Jiang, D.; Nan, J.; Zhu, W. An integrated data-driven framework for surface water quality anomaly detection and early warning. J. Clean. Prod. 2020, 251, 119145. [Google Scholar] [CrossRef]
  5. Jaddi, N.S.; Abdullah, S. A cooperative-competitive master-slave global-best harmony search for ANN optimization and water-quality prediction. Appl. Soft Comput. 2017, 51, 209–224. [Google Scholar] [CrossRef]
  6. Liu, S.; Tai, H.; Ding, Q.; Li, D.; Xu, L.; Wei, Y. A hybrid approach of support vector regression with genetic algorithm optimization for aquaculture water quality prediction. Math. Comput. Model. 2013, 58, 458–465. [Google Scholar] [CrossRef]
  7. Ghaemi, E.; Tabesh, M.; Nazif, S. Improving the ARIMA Model Prediction for Water Quality Parameters of Urban Water Distribution Networks (Case Study: CANARY Dataset). Int. J. Environ. Res. 2022, 16, 98. [Google Scholar] [CrossRef]
  8. Deng, W.; Wang, G.; Zhang, X.; Guo, Y.; Li, G. Water quality prediction based on a novel hybrid model of ARIMA and RBF neural network. In Proceedings of the 2014 IEEE 3rd International Conference on Cloud Computing and Intelligence Systems, Shenzhen & Hong Kong, China, 27–29 November 2014; IEEE: New York, NY, USA, 2014; pp. 33–40. [Google Scholar]
  9. Bi, J.; Lin, Y.; Dong, Q.; Yuan, H.; Zhou, M. Large-scale water quality prediction with integrated deep neural network. Inf. Sci. 2021, 571, 191–205. [Google Scholar] [CrossRef]
  10. Chen, Y.; Song, L.; Liu, Y.; Yang, L.; Li, D. A review of the artificial neural network models for water quality prediction. Appl. Sci. 2020, 10, 5776. [Google Scholar] [CrossRef]
  11. Kaur, J.; Parmar, K.S.; Singh, S. Autoregressive models in environmental forecasting time series: A theoretical and application review. Environ. Sci. Pollut. Res. 2023, 30, 19617–19641. [Google Scholar] [CrossRef] [PubMed]
  12. Wang, X.; Tian, W.; Liao, Z. Statistical comparison between SARIMA and ANN’s performance for surface water quality time series prediction. Environ. Sci. Pollut. Res. 2021, 28, 33531–33544. [Google Scholar] [CrossRef]
  13. Zhou, S.; Song, C.; Zhang, J.; Chang, W.; Hou, W.; Yang, L. A hybrid prediction framework for water quality with integrated W-ARIMA-GRU and LightGBM methods. Water 2022, 14, 1322. [Google Scholar] [CrossRef]
  14. Xu, R.; Xiong, Q.; Yi, H.; Wu, C.; Ye, J. Research on water quality prediction based on SARIMA-LSTM: A case study of Beilun Estuary. In Proceedings of the 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Zhangjiajie, China, 10–12 August 2019; IEEE: New York, NY, USA, 2019; pp. 2183–2188. [Google Scholar]
  15. Faruk, D.Ö. A hybrid neural network and ARIMA model for water quality time series prediction. Eng. Appl. Artif. Intell. 2010, 23, 586–594. [Google Scholar] [CrossRef]
  16. Lola, M.S.; Zainuddin, N.H.; Abdullah, M.; Ponniah, V.; Ramlee, M.; Zakariya, R.; Idris, M.; Khalili, I. Improving the performance of ann-arima models for predicting water quality in the offshore area of kuala terengganu, terengganu, malaysia. J. Sustain. Sci. Manag. 2018, 13, 27–37. [Google Scholar]
  17. Qie, J.; Yuan, J.; Wang, G.; Zhang, X.; Zhou, B.; Deng, W. Water Quality Prediction Based on an Improved ARIMA-RBF Model Facilitated by Remote Sensing Applications. In Proceedings of the Rough Sets and Knowledge Technology: 10th International Conference, RSKT 2015, Held as Part of the International Joint Conference on Rough Sets, IJCRS 2015, Tianjin, China, 20–23 November 2015; Proceedings 10. Springer: Berlin/Heidelberg, Germany, 2015; pp. 470–481. [Google Scholar]
  18. Dastorani, M.; Mirzavand, M.; Dastorani, M.T.; Khosravi, H. Simulation and prediction of surface water quality using stochastic models. Sustain. Water Resour. Manag. 2020, 6, 74. [Google Scholar] [CrossRef]
  19. Parmar, K.S.; Bhardwaj, R. Water quality management using statistical analysis and time-series prediction model. Appl. Water Sci. 2014, 4, 425–434. [Google Scholar] [CrossRef]
  20. Elhag, M.; Gitas, I.; Othman, A.; Bahrawi, J.; Psilovikos, A.; Al-Amri, N. Time series analysis of remotely sensed water quality parameters in arid environments, Saudi Arabia. Environ. Dev. Sustain. 2021, 23, 1392–1410. [Google Scholar] [CrossRef]
  21. Katimon, A.; Shahid, S.; Mohsenipour, M. Modeling water quality and hydrological variables using ARIMA: A case study of Johor River, Malaysia. Sustain. Water Resour. Manag. 2018, 4, 991–998. [Google Scholar] [CrossRef]
  22. Stroud, D.A.; Davidson, N.C. Fifty years of criteria development for selecting wetlands of international importance. Mar. Freshw. Res. 2021, 73, 1134–1148. [Google Scholar] [CrossRef]
  23. Farkas, M.; Kaszab, E.; Radó, J.; Háhn, J.; Tóth, G.; Harkai, P.; Ferincz, Á.; Lovász, Z.; Táncsics, A.; Vörös, L.; et al. Planktonic and benthic bacterial communities of the largest central European shallow lake, Lake Balaton and its main inflow Zala River. Curr. Microbiol. 2020, 77, 4016–4028. [Google Scholar] [CrossRef] [PubMed]
  24. Kovács, J.; Korponai, J.; Kovács, I.S.; Hatvani, I.G. Introducing sampling frequency estimation using variograms in water research with the example of nutrient loads in the Kis-Balaton Water Protection System (W Hungary). Ecol. Eng. 2012, 42, 237–243. [Google Scholar] [CrossRef]
  25. Rédey, Á.; Husvéth, F.; Kovács, Z.; Utasi, A.; Domokos, E. Relation Between Global Environmental Issues and Surface Water Quality. Egypt. J. Phycol. 2010, 11, 121–129. [Google Scholar] [CrossRef]
  26. Rizk, R.; Juzsakova, T.; Cretescu, I.; Rawash, M.; Sebestyén, V.; Le Phuoc, C.; Kovács, Z.; Domokos, E.; Rédey, Á.; Shafik, H. Environmental assessment of physical-chemical features of Lake Nasser, Egypt. Environ. Sci. Pollut. Res. 2020, 27, 20136–20148. [Google Scholar] [CrossRef]
  27. Honti, M.; Gao, C.; Istvánovics, V.; Clement, A. Lessons learnt from the long-term management of a large (re) constructed wetland, the Kis-Balaton protection system (Hungary). Water 2020, 12, 659. [Google Scholar] [CrossRef]
  28. Rostási, Á.; Rácz, K.; Fodor, M.A.; Topa, B.; Molnár, Z.; Weiszburg, T.G.; Pósfai, M. Pathways of carbonate sediment accumulation in a large, shallow lake. Front. Earth Sci. 2022, 10, 1067105. [Google Scholar] [CrossRef]
  29. No, G.D. 10/2010. (VIII. 18.) of Ministry of Rural Development (VM) on Defining the Rules for Establishment and Use of Water Pollution Limits of Surface Water. Available online: https://net.jogtar.hu/jogszabaly?docid=A1000010.VM&searchUrl=/gyorskereso?keyword%3D10/2010 (accessed on 8 August 2024).
  30. Shumway, R.H.; Stoffer, D.S.; Shumway, R.H.; Stoffer, D.S. ARIMA models. In Time Series Analysis and Its Applications: With R Examples; Springer: Cham, Switzerland, 2017; pp. 75–163. [Google Scholar]
  31. Stellwagen, E.; Tashman, L. ARIMA: The Models of Box and Jenkins. Foresight Int. J. Appl. Forecast. 2013, 30, 28–33. [Google Scholar]
  32. Cheung, Y.W.; Lai, K.S. Lag order and critical values of the augmented Dickey–Fuller test. J. Bus. Econ. Stat. 1995, 13, 277–280. [Google Scholar]
  33. KosicKA, E.; KozłowsKi, E.; MAzurKiEwicz, D. The use of stationary tests for analysis of monitored residual processes. Eksploat. Niezawodn. 2015, 17, 604–609. [Google Scholar] [CrossRef]
  34. Sakamoto, Y.; Ishiguro, M.; Kitagawa, G. Akaike Information Criterion Statistics; D. Reidel: Dordrecht, The Netherlands, 1986; Volume 81, p. 26853. [Google Scholar]
  35. Hassani, H.; Yeganegi, M.R. Selecting optimal lag order in Ljung–Box test. Phys. A Stat. Mech. Its Appl. 2020, 541, 123700. [Google Scholar] [CrossRef]
  36. Cerqueira, V.; Torgo, L.; Mozetič, I. Evaluating time series forecasting models: An empirical study on performance estimation methods. Mach. Learn. 2020, 109, 1997–2028. [Google Scholar] [CrossRef]
  37. Singh, K.; Upadhyaya, S. Outlier detection: Applications and techniques. Int. J. Comput. Sci. Issues (IJCSI) 2012, 9, 307. [Google Scholar]
Figure 1. The modular continuous monitoring stations on the Kis-Balaton Water Protection System.
Figure 1. The modular continuous monitoring stations on the Kis-Balaton Water Protection System.
Water 16 02314 g001
Figure 2. Evolution of water temperature.
Figure 2. Evolution of water temperature.
Water 16 02314 g002
Figure 3. Evolution of dissolved oxygen concentration and saturation within the water.
Figure 3. Evolution of dissolved oxygen concentration and saturation within the water.
Water 16 02314 g003
Figure 4. Evolution of water pH.
Figure 4. Evolution of water pH.
Water 16 02314 g004
Figure 5. Evolution of water redox potential.
Figure 5. Evolution of water redox potential.
Water 16 02314 g005
Figure 6. Evolution of water electrical conductivity.
Figure 6. Evolution of water electrical conductivity.
Water 16 02314 g006
Figure 7. Evolution of water turbidity.
Figure 7. Evolution of water turbidity.
Water 16 02314 g007
Figure 8. Box plot of mean and standard deviation of water quality parameters.
Figure 8. Box plot of mean and standard deviation of water quality parameters.
Water 16 02314 g008
Figure 9. Original test data set (left) and its ACF (right).
Figure 9. Original test data set (left) and its ACF (right).
Water 16 02314 g009
Figure 10. AICc scores (left) and MSE (right) of validation data of different model structures.
Figure 10. AICc scores (left) and MSE (right) of validation data of different model structures.
Water 16 02314 g010
Figure 11. ACF (left) and distribution (right) of the ARMA (9,3) model residuals of the training data.
Figure 11. ACF (left) and distribution (right) of the ARMA (9,3) model residuals of the training data.
Water 16 02314 g011
Figure 12. ARMA (9,3) model forecasting capabilities tested on the validation data set for dissolved oxygen saturation values.
Figure 12. ARMA (9,3) model forecasting capabilities tested on the validation data set for dissolved oxygen saturation values.
Water 16 02314 g012
Figure 13. ARMA (5,5) model forecasting capabilities tested on the validation data set for electrical conductivity values.
Figure 13. ARMA (5,5) model forecasting capabilities tested on the validation data set for electrical conductivity values.
Water 16 02314 g013
Figure 14. ARMA (9,8) model forecasting capabilities tested on the validation data set for pH values.
Figure 14. ARMA (9,8) model forecasting capabilities tested on the validation data set for pH values.
Water 16 02314 g014
Figure 15. ARIMA (6,1,7) model forecasting capabilities tested on the validation data set for redox potential values.
Figure 15. ARIMA (6,1,7) model forecasting capabilities tested on the validation data set for redox potential values.
Water 16 02314 g015
Figure 16. ARMA (5,7) model forecasting capabilities tested on the validation data set for turbidity values.
Figure 16. ARMA (5,7) model forecasting capabilities tested on the validation data set for turbidity values.
Water 16 02314 g016
Table 1. Optimized model structures for the prediction of water quality parameters.
Table 1. Optimized model structures for the prediction of water quality parameters.
Water Quality ParameterModel StructureAICcLjung–Box Q Test (0.05 Significance Level)
Dissolved oxygen
saturation
ARMA (9,3)3718Approved
pHARMA (9,8)−774Approved
Redox potentialARIMA (6,1,7)4087Approved
Electrical conductivityARMA (5,5)5733Approved
TurbidityARMA (5,7)6940Approved
Table 2. Mean and standard deviation of model residuals for the fitted variables.
Table 2. Mean and standard deviation of model residuals for the fitted variables.
Water Quality ParameterMean ResidualStandard Deviation
Dissolved oxygen
saturation [%]
2.317.6
pH [-]−0.170.09
Redox potential [mV]−10.88.2
Electrical conductivity [μScm−1]15.320.8
Turbidity [NTU]12.856.3
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kovács, Z.; Tarcsay, B.L.; Tóth, P.; Juhász, C.J.; Németh, S.; Shahrokhi, A. Prediction of Physico-Chemical Parameters of Surface Waters Using Autoregressive Moving Average Models: A Case Study of Kis-Balaton Water Protection System, Hungary. Water 2024, 16, 2314. https://doi.org/10.3390/w16162314

AMA Style

Kovács Z, Tarcsay BL, Tóth P, Juhász CJ, Németh S, Shahrokhi A. Prediction of Physico-Chemical Parameters of Surface Waters Using Autoregressive Moving Average Models: A Case Study of Kis-Balaton Water Protection System, Hungary. Water. 2024; 16(16):2314. https://doi.org/10.3390/w16162314

Chicago/Turabian Style

Kovács, Zsófia, Bálint Levente Tarcsay, Piroska Tóth, Csenge Judit Juhász, Sándor Németh, and Amin Shahrokhi. 2024. "Prediction of Physico-Chemical Parameters of Surface Waters Using Autoregressive Moving Average Models: A Case Study of Kis-Balaton Water Protection System, Hungary" Water 16, no. 16: 2314. https://doi.org/10.3390/w16162314

APA Style

Kovács, Z., Tarcsay, B. L., Tóth, P., Juhász, C. J., Németh, S., & Shahrokhi, A. (2024). Prediction of Physico-Chemical Parameters of Surface Waters Using Autoregressive Moving Average Models: A Case Study of Kis-Balaton Water Protection System, Hungary. Water, 16(16), 2314. https://doi.org/10.3390/w16162314

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop