Machine Learning-Based Water Level Prediction in Lake Erie

: Predicting water levels of Lake Erie is important in water resource management as well as navigation since water level signiﬁcantly impacts cargo transport options as well as personal choices of recreational activities. In this paper, machine learning (ML) algorithms including Gaussian process (GP), multiple linear regression (MLR), multilayer perceptron (MLP), M5P model tree, random forest (RF), and k-nearest neighbor (KNN) are applied to predict the water level in Lake Erie. From 2002 to 2014, meteorological data and one-day-ahead observed water level are the independent variables, and the daily water level is the dependent variable. The predictive results show that MLR and M5P have the highest accuracy regarding root mean square error (RMSE ) and mean absolute error (MAE). The performance of ML models has also been compared against the performance of the process-based advanced hydrologic prediction system (AHPS), and the results indicate that ML models are superior in predictive accuracy compared to AHPS. Together with their time-saving advantage, this study shows that ML models, especially MLR and M5P, can be used for forecasting Lake Erie water levels and informing future water resources management.


Introduction
Water level plays an important part in the community's well-being and economic livelihoods. For example, water level changes can impact physical processes in lakes, such as circulation, resulting in changes in water mixing and bottom sediment resuspension, and thus could further affect water quality and aquatic ecosystems [1,2]. Hence, water level prediction attracts more and more attention [3,4]. For example, the International Joint Commission (IJC) suggests more efforts should be implemented to improve the methods of monitoring and predicting water level [5].
Water-level change is a complex hydrological phenomenon due to its various controlled factors, including meteorological conditions, as well as water exchange between the lake and its watersheds [6,7]. Thus, many tools used to forecast water levels, while considering influencing factors, have been developed, such as process-based models [8]. For example, Gronewold et al. showed that the advanced hydrologic prediction system (AHPS) can be used to capture seasonal and inter-annual patterns of Lake Erie water level from 1997 to 2009 [9]. However, the effectiveness of process-based models mainly depends on the accuracies of the models to represent the aquatic conditions and the abilities of the models in describing the variabilities in the observations [10,11]. In addition, process-based models are often time-consuming [12], so numerous studies have proposed using statistical models to predict water level, e.g., autoregressive integrated moving average model [13], artificial neural network [14], genetic programming [15], and support vector machine [16]. These studies mainly focused on leveraging historical water level without considering the physical process driven by meteorological conditions [6,17,18].
In this paper, six machine learning methodologies [19], with fast leaning speed, are applied to predict daily Lake Erie water level. And then the ML model predictive performance is compared to the process-based model (i.e., AHPS). ML models are established by considering not only the impacts of the historical water-level changes but also meteorological factors. Overall, there are three innovations of this work. First, it is the first work to take account of both historical water level and meteorological conditions in water level prediction for Lake Erie. Second, it is the first study to apply various ML models to forecast Lake Erie water level and compare the performance among various ML models. Third, it is the first work to compare the predictive performance between ML models and the process-based model AHPS.
The following parts of this paper are divided into four sections. Section 1 introduces the experimental area of this study. Section 2 describes various ML algorithms and the model performance assessment metrics. Section 3 presents the independent variables selection and performance comparison results for ML models. Section 4 discusses the forecasting results of various ML predicting models, and the performance comparisons between ML models and the process-based prediction system AHPS ( Figure 1).
Water 2020, 12, x FOR PEER REVIEW 2 of 14 In this paper, six machine learning methodologies [19], with fast leaning speed, are applied to predict daily Lake Erie water level. And then the ML model predictive performance is compared to the process-based model (i.e., AHPS). ML models are established by considering not only the impacts of the historical water-level changes but also meteorological factors. Overall, there are three innovations of this work. First, it is the first work to take account of both historical water level and meteorological conditions in water level prediction for Lake Erie. Second, it is the first study to apply various ML models to forecast Lake Erie water level and compare the performance among various ML models. Third, it is the first work to compare the predictive performance between ML models and the process-based model AHPS.
The following parts of this paper are divided into four sections. Section 1 introduces the experimental area of this study. Section 2 describes various ML algorithms and the model performance assessment metrics. Section 3 presents the independent variables selection and performance comparison results for ML models. Section 4 discusses the forecasting results of various ML predicting models, and the performance comparisons between ML models and the process-based prediction system AHPS ( Figure 1).

Study Area
Lake Erie is the southernmost, shallowest, and smallest part of the Great Lakes System and its mean depth, surface area, and volume are 18.9 m, 25,657 km 2 , and 484 km 3 [20]. It has three distinguishing basins including western, central, and eastern basins, with average depths of 7.4 m, 19 m, and 28.5 m [21,22]. The western basin is separated from the central basin by the islands extending from Point Pelee in Ontario to Marblehead in Ohio, and the eastern basin is separated from the central basin by the Pennsylvania Ridge from Long Point extending to Erie Pennsylvania [21]. There are five major inflows (Detroit River, Maumee River, Sandusky River, Cuyahoga River, and Grand River), and one outflow (Niagara River) for Lake Erie. The water retention capacity of Lake Erie is 2.76 years [20,23]. Figure 2 is generated based on GLERL (Great Lakes Environmental Research Laboratory) data and the identifiers represent the meteorological stations (red stars) and water level measuring stations (black circles).

Study Area
Lake Erie is the southernmost, shallowest, and smallest part of the Great Lakes System and its mean depth, surface area, and volume are 18.9 m, 25,657 km 2 , and 484 km 3 [20]. It has three distinguishing basins including western, central, and eastern basins, with average depths of 7.4 m, 19 m, and 28.5 m [21,22]. The western basin is separated from the central basin by the islands extending from Point Pelee in Ontario to Marblehead in Ohio, and the eastern basin is separated from the central basin by the Pennsylvania Ridge from Long Point extending to Erie Pennsylvania [21]. There are five major inflows (Detroit River, Maumee River, Sandusky River, Cuyahoga River, and Grand River), and one outflow (Niagara River) for Lake Erie. The water retention capacity of Lake Erie is 2.76 years [20,23].

Data Source
Following existing studies, e.g., Bucak et al. (2018), who use various meteorological variables to simulate the water level of Lake Beysehir in Turkey, this study also considers meteorological variables including precipitation, air temperature, shortwave radiation, longwave radiation, wind speed, and relative humidity. The average daily measured water level data at four stations including St. Toledo, Port Stanley, Buffalo, and Cleveland are regarded as the observed daily water level of Lake Erie (Table 1). The independent variables are selected in terms of the relevance degree between the meteorological variables (daily, one-day-, two-day-, and three-day-ahead observed average, minimum, and maximum values of air temperature, shortwave radiation, longwave radiation, wind speed, and relative humidity; daily, one-day-, two-day-, and three-day-ahead observed average precipitation values) ( Table 2) and the water level. The correlation function has been typically applied in previous studies [17,24], measuring the linear dependence between two variables. When the relatedness is not linear, it is not appropriate to apply this method. In this study, we use mutual information (MI) instead of correlation function since MI can measure the general dependence between two variables [25].
MI, i.e., Equation (1), measures the relevance degree between two variables according to the joint probability distribution function p(x,y) [26].

Data Source
Following existing studies, e.g., Bucak et al. (2018), who use various meteorological variables to simulate the water level of Lake Beysehir in Turkey, this study also considers meteorological variables including precipitation, air temperature, shortwave radiation, longwave radiation, wind speed, and relative humidity. The average daily measured water level data at four stations including St. Toledo, Port Stanley, Buffalo, and Cleveland are regarded as the observed daily water level of Lake Erie (Table 1). The independent variables are selected in terms of the relevance degree between the meteorological variables (daily, one-day-, two-day-, and three-day-ahead observed average, minimum, and maximum values of air temperature, shortwave radiation, longwave radiation, wind speed, and relative humidity; daily, one-day-, two-day-, and three-day-ahead observed average precipitation values) ( Table 2) and the water level. The correlation function has been typically applied in previous studies [17,24], measuring the linear dependence between two variables. When the relatedness is not linear, it is not appropriate to apply this method. In this study, we use mutual information (MI) instead of correlation function since MI can measure the general dependence between two variables [25]. MI, i.e., Equation (1), measures the relevance degree between two variables according to the joint probability distribution function p(x, y) [26].

Machine Learning Algorithms
This study adopts six widely used ML methods, including Gaussian process, multiple linear regression, multilayer perceptron, M5P model tree, random forest, and k-nearest neighbors.

Gaussian Process
Gaussian process (GP) [27] can be applied to settle two categories of problems: (1) Regression, where the data are continuous, and optimization of study can provide a close prediction for most of the time; (2) Classification, where the datasets are discrete and the predictions end up with a discrete set of classes [28]. In this work, we adopt the same algorithm of GP from existing studies [27][28][29]. Taking account of the noise of observation y, the Gaussian process is shown below: where x and y represent the input and output, respectively, f is a multi-dimensional Gaussian distribution determined by m(x) (mean function) and k(x,x') (covariance matrix), x and x' are the input values at two points, and ε (~N(0, σ 2 )) is the independent white Gaussian noise. To make choosing an appropriate noise level easier, this study applies normalization to the target attribute as well as the other attributes. If the data can be scaled properly, the mean could be zero. A function producing a positive semi-definite covariance matrix with elements [K] i,j = k(x i , x j is applied as the covariance function, so f is expressed as p(f|x) = N(0, K). Based on previous noise assumptions, p(y f) = N(f, σ 2 I) is obtained, and I is the unit matrix.

of 14
The prediction y * is the output corresponding to the input x * : where k * can be obtained from x and x * through [k * ] = k(x, x * , and k * * can be obtained from x * through [k * * ] = k(x * , x * ). µ * and σ 2 y * represent expectation and variance. The Gaussian covariance function is: where [υ, ω 1 , ω 2 , . . . , ω m ] represent a parameter vector, called hyperparameters. The simplest expression can be shown as follows:

Multiple Linear Regression
The multiple linear regression (MLR) method describes the linear relationship between the dependent and independent variables. In this work, we adopt the same MLR algorithm as existing studies [30][31][32].
where X 1,i , X 2,i , . . . , X k,i and Y i are the ith observations of the independent and dependent variables, respectively; β 0 , β 1 , . . . , β k are regression coefficients; and ε i is the residual for the ith observations.

Multilayer Perceptron
An artificial neural network (ANN) mimics human brain functioning, processing information through a single neuron. In this work, we adopt the same MLP algorithm as existing studies [33][34][35][36]. Multiplayer perceptron (MLP), a class of feedforward ANN, has three layers including input, hidden, and output layers. The hidden layers receive the signals from the nodes of the input layer and transform them into signals that are sent to all output nodes, transforming them into the last layer of outputs. The weights between the adjacent layers are adjusted automatically while minimizing the errors between the simulations and observations by using the backpropagation algorithm.

M5P Model Tree
There is a collection of training samples (T); each training sample is featured by a fixed set of attributes, which are input values, and has a corresponding target, which is the output value. In this work, we adopt the same M5P algorithm from existing studies [37][38][39]. In the beginning, T is associated with a leaf storing linear regression function or split into subsets accounting for the outputs, and then the same process is recursive for the subsets.
In this process, choosing appropriate attributes to split T, in order to minimize the expected error at a particular node, requires a splitting criterion, and standard deviation reduction is the expected error reduction, calculated by Equation (11).
where T is a group of samples that achieves the node; T i is the result of splitting the node in terms of the chosen attribute. The model finally chooses the split that maximizes the expected error reduction.

Random Forest
In this work, we adopt the same random forest (RF) algorithm from existing studies [40,41]. Unlike random tree (RT), instead of choosing features, RF can determine the most important ones among all the features after training. It combines many binary decision trees and each tree learns from a random sample of the observations without pruning. In RF, bootstrap samples are drawn to build multiple trees, which means that some samples will be used multiple times in a single tree, and each tree grows with a randomized subset of features from the total number of features. Finally, the output representing the average of each single-tree methodology is generated. Since RF contains a lot of trees, handling a larger number of data, limited generalization errors occur, avoiding overfitting.

K-Nearest Neighbor
Generally, the above machine learning methods can be regarded as eager methods, which means these methods start learning when they receive the data and create a global approximation. On the other hand, k-nearest neighbor (KNN) belongs to lazy learning, which creates a local approximation. In this work, we adopt the same KNN algorithm from existing studies [42][43][44]. It simply stores the training data without learning until it is given a test set, and calculates in terms of the current data set instead of coming up with an algorithm based on historic data. KNN is a method that can be used for either classification or regression, while distributing the attribute values for a target, which is the average of the values of its k nearest neighbors.

Model Performance Evaluation
The outputs of each ML model are daily water levels of Lake Erie. Then, these predictions are compared against the observations, which are the average measured water levels at the four stations mentioned above. The model performance evaluation criteria selected include root means square error (RMSE), mean absolute error (MAE), correlation coefficient (r), and mutual information (MI) mentioned in Section 2.2.
RMSE shows the residual value between the predictions and the observations, thus a smaller RMSE value represents a better fit between observations and predictions.
The mean absolute error (MAE) can be also applied to access the accuracy of fitting time-series data. Similar to RMSE, a smaller MAE value represents a better fit.
The correlation coefficient (r) describes the weight of the relationship between observations and predictions and ranges from −1 to 1. The closer to 0, the weaker linear relationship between observations and predictions. For example, a value of zero represents no linear relationship, −1 represents a strong negative linear relationship, and 1 represents a strong positive linear relationship.
where O i and P i represent the observations and predictions at the ith time step, O i and P i are the means of observations and predictions, and n represents total time step numbers. RMSE and MAPE describe the overall accuracy of the models, while r and MI describe the differences between the observations and the predictions [19].

Input Selection
A MI value of zero represents zero relatedness between two variables, and the larger value represents stronger relatedness [45]. All meteorological variables except one-day-, two-day-, and three-day-ahead observed minimum shortwave radiation (15×, 31×, and 47×) show a strong relevance to water level (Figure 3), so this study uses daily, one-day-, two-day-, and three-day-ahead observed average, minimum, and maximum values of air temperature, longwave radiation, relative humidity, and wind speed; daily, one-day-, two-day-, and three-day-ahead observed average and maximum values of shortwave radiation; daily, one-day-, two-day-, and three-day-ahead observed average values of precipitation as independent variables. One-day-ahead observed water level is also considered as an independent variable since it can also impact water level changes the next day [46].
Water 2020, 12, x FOR PEER REVIEW 7 of 14 average, minimum, and maximum values of air temperature, longwave radiation, relative humidity, and wind speed; daily, one-day-, two-day-, and three-day-ahead observed average and maximum values of shortwave radiation; daily, one-day-, two-day-, and three-day-ahead observed average values of precipitation as independent variables. One-day-ahead observed water level is also considered as an independent variable since it can also impact water level changes the next day [46]. Following a previous study [47], we split the whole dataset from 2002 to 2014 into two sections, including the training set (approximately 84% observations) from 2002 to 2012, aiming to establish the model and adjust the weights, as well as the testing set (approximately 16% observations) from 2013 to 2014, aiming to assess the model performance. Table 3 shows the performance of ML models for predicting water level by RMSE, MAE, r, and MI. MLR and M5P provide the most reliable predictions of water level during the testing period from 2013 to 2014 with RMSE and MAE values of 0.02 and 0.01, respectively. The RMSE values for different ML models in this study are also comparable to previous studies on water-level prediction, e.g., Khan et al. (2006) found RMSE values for the time horizon of 1 to 12 months of 0.057-0.239, 0.085-0.246, and 0.068-0.304 by applying SVM, MLP, and SAR models to Lake Erie [16]. Following a previous study [47], we split the whole dataset from 2002 to 2014 into two sections, including the training set (approximately 84% observations) from 2002 to 2012, aiming to establish the model and adjust the weights, as well as the testing set (approximately 16% observations) from 2013 to 2014, aiming to assess the model performance. Table 3 shows the performance of ML models for predicting water level by RMSE, MAE, r, and MI. MLR and M5P provide the most reliable predictions of water level during the testing period from 2013 to 2014 with RMSE and MAE values of 0.02 and 0.01, respectively.

Model Performance Comparison
The RMSE values for different ML models in this study are also comparable to previous studies on water-level prediction, e.g., Khan et al. (2006) found RMSE values for the time horizon of 1 to 12 months of 0.057-0.239, 0.085-0.246, and 0.068-0.304 by applying SVM, MLP, and SAR models to Lake Erie [16].  [49].
The predicted water levels agree with the observations qualitatively ( Figure 4). Due to the long period of testing time (2013-2014), we will divide it into three periods for each year to discuss the predictions in detail. In Figure 5a,b, both observations and predictions show the increasing trends, from January to March in 2013 and 2014, due to low precipitation frequency, the water levels are relatively stable and become lowest during the whole year. Beginning from April, the water level rises with frequent rainfall. From May to August (Figure 5c,d), the water level continues increasing to reach a peak in July in 2013 and 2014, showing the significant effects of precipitation on water level changes. From September to December, with precipitation reducing, the water level decreases. predicted groundwater levels in the coastal aquifer in Donghae City, Korea, through two wells with average RMSE values of 0.13 and 0.136 by applying ANN and SVM (support vector machine), respectively [49]. The predicted water levels agree with the observations qualitatively ( Figure 4). Due to the long period of testing time (2013-2014), we will divide it into three periods for each year to discuss the predictions in detail. In Figure 5a,b, both observations and predictions show the increasing trends, from January to March in 2013 and 2014, due to low precipitation frequency, the water levels are relatively stable and become lowest during the whole year. Beginning from April, the water level rises with frequent rainfall. From May to August (Figure 5c,d), the water level continues increasing to reach a peak in July in 2013 and 2014, showing the significant effects of precipitation on water level changes. From September to December, with precipitation reducing, the water level decreases.     Figure 6 shows that the predicted water levels agree with the observations quantitatively. The slopes are close to 1, indicating that the predictions from ML models show a strong correlation with the observed water levels even though there is a small difference between them. The dots in the KNN  Figure 6 shows that the predicted water levels agree with the observations quantitatively. The slopes are close to 1, indicating that the predictions from ML models show a strong correlation with the observed water levels even though there is a small difference between them. The dots in the KNN comparison figure seem scattered, showing that the predictions are smaller/larger than the observations.  Among these six models, the MLR and M5P models show better performance than others with only a 0.003% difference in capturing the peak values. The KNN model underestimates the peaks in 2013 and 2014 ( Figure 4) and overestimates the lowest level in winter 2014, which may be due to the k value selection. Choosing small values for k can be noisy and impact the predictions, while the large values can lead to smooth decision boundaries, resulting in low variance but increased bias and expensive computation. In this study, the value of k is set to 69, which is the square root of the total number of samples based on a previous study [50].

Comparison between ML Models and AHPS
An advanced hydrologic prediction system (AHPS) is a web-based suite of forecast products, and it has been widely used in predicting the water levels and hydrometeorological variables of the Laurentian Great Lakes [51]. In AHPS, the spatial meteorological forcing data obtained from the Among these six models, the MLR and M5P models show better performance than others with only a 0.003% difference in capturing the peak values. The KNN model underestimates the peaks in 2013 and 2014 ( Figure 4) and overestimates the lowest level in winter 2014, which may be due to the k value selection. Choosing small values for k can be noisy and impact the predictions, while the large values can lead to smooth decision boundaries, resulting in low variance but increased bias and expensive computation. In this study, the value of k is set to 69, which is the square root of the total number of samples based on a previous study [50].

Comparison between ML Models and AHPS
An advanced hydrologic prediction system (AHPS) is a web-based suite of forecast products, and it has been widely used in predicting the water levels and hydrometeorological variables of the Laurentian Great Lakes [51]. In AHPS, the spatial meteorological forcing data obtained from the National Climatic Data Center (NCDC) for sub-basins and over-lakes are averaged by Thiessen Polygon Software, and the results are regarded as the inputs of the lake thermodynamics model, calculating heat storage, and the large basin runoff model, calculating moisture storage, and the net basin supply is calculated based on water balance. Finally, the net basin supplies are translated to water levels and joint flows by the lake routing and regulation model [52]. Gronewold et al. predicted the water level of Lake Erie by using AHPS. He compared 3-month and 6-month mean predicted water level values to the monthly averaged water level observations for 13 years (1997-2009) and assessed AHPS predictive performance by forecasting bias (the averaged differences between the median value of predictions and monthly averaged observations) [9]. In this part, we also compare ML models and AHPS predictive performance in terms of forecasting bias.
All absolute values of forecasting bias based on ML models during the testing period (2013-2014) are smaller than those in AHPS from 1997 to 2009 (Table 4), showing that the predicted median water level values in ML models are much closer to the monthly average observations than AHPS, even though both AHPS and the five ML models (MLR, MLP, M5P, RF, and KNN) underestimate observations with negative forecasting bias. Table 5 indicates the time consumed during training and testing periods in ML models, which are all less than one minute except for the GP model. This is much faster than AHPS. ML models are more accurate than process-based AHPS in forecasting water level, and can also save computational time and expense.

Impact of Training and Testing Data Selection
In this study, following existing studies [18,53], we used 84% of the total dataset (years 2002-2012) as the training data and the remaining 16% (years 2013-2014) as the testing data. The performance of ML models can vary with different sizes of training data. To explore the impact of the training and testing data selection, we further examined these ML models with different sizes of training data. Specifically, we built ML-based prediction models with different training data and tested their performance on the same set of testing data (years 2013-2014). The detailed results are shown in Tables 6-11 (training data are from years 2008-2012, 2007-2012, 2006-2012, 2005-2012, 2004-2012,  and 2003-2012, respectively). We observed consistent performance of these ML models with different training data, i.e., MLR and M5P are the best ML models for predicting the water level of Lake Erie.

Conclusions
Considering multiple independent variables including meteorological data and one-day-ahead observed water level, this study developed multiple ML models to forecast Lake Erie water level and compared the performance among ML models by RMSE and MAE. MLR and M5P, among all ML models, had the best performance in capturing the variations of water level with the smallest RMSE and MAE, which were 0.02 and 0.01, respectively. Furthermore, this paper also compared the performance of process-based AHPS and ML models, showing that ML models have higher accuracy than AHPS in predicting water level. Based on the advantages of high accuracy and short computational cost, ML methods could be used for informing future water resources management in Lake Erie.