A Predictive Model of Chlorophyll a in Western Lake Erie Based on Artiﬁcial Neural Network

: The reoccurrence of algal blooms in western Lake Erie (WLE) since the mid-1990s, under increased system stress from climate change and excessive nutrients, has shown the need for developing management tools to predict water quality. In this study, process-based model GLM-AED (General Lake Model-Aquatic Ecosystem Dynamics) and statistical model ANN (artiﬁcial neural network) were developed with meteorological forcing derived from surface buoys, airports, and land-based stations and historical monitoring nutrients, to predict water quality in WLE from 2002 to 2015. GLM-AED was calibrated with observed water temperature and chlorophyll a (Chl-a) from 2002 to 2015. For ANN, during the training period (2002–2010), the inputs included meteorological forcing and nutrient concentrations, and the target was Chl-a simulated by calibrated GLM-AED due to the lack of continuously daily measured Chl-a concentrations. During the testing period (2011–2015), the predicted Chl-a concentrations were compared with the observations. The results showed that the ANN model has higher accuracy with lower Chl-a RMSE and MAE values than GLM-AED during 2011 and 2015. Lastly, we applied the established ANN model to predict the future 10-year water quality of WLE, which showed that the probability of adverse health effects would be moderate, so more intense water resources management should be implemented.


Introduction
Eutrophication has been a serious global environmental problem in large lakes [1][2][3], including Lake Victoria in Africa, Lake Loosdrecht in Europe, and Lake Taihu in Asia. To investigate this water quality problem in these regions, different approaches have been applied including sampling field data, lab methods, and remote sensing [4][5][6]. Western Lake Erie (WLE) also has been suffering from eutrophication [7][8][9][10], which was driven by climate change and excess nutrient loads, particularly phosphorus from agriculture [11]. With the implementation of phosphorus (P) abatement programs based on the Great Lakes Water Quality Agreement (GLWQA) in 1972 and the Lake Erie total phosphorus (TP) target load of 11,000 MTA in the 1978 Amendment to the GLWQA, P loads entering into WLE reduced, resulting in a decrease in algal blooms in WLE [12,13]. However, since the mid-1990s, algal blooms have reoccurred resulted in increased spring runoff flushing more P from the Maumee River watershed into WLE [14][15][16].
In consideration of the long-term variations in algal blooms, to manage them efficiently, there is a need to develop a tool to reproduce the long-term development of algal blooms to implement better water resource management. To date, there are two commonly applied approaches to predict water quality, namely, process-based [17] and data-driven [18] models. Process-based models are based on known principles and theories of physics, biochemistry, and ecology with solving a series of mathematical equations, such as onedimensional GLM-AED [19], two-dimensional CE-QUAL-W2 [20], and three-dimensional AEM3D [21]. Particularly, requiring less computational time than multi-dimensional models, one-dimensional models are suitable for long-term predictions [22]. Unlike processbased models, data-driven models are based on data mining algorithms and statistical techniques that analyze patterns amongst the dataset about a specific system to explore the relationships between the input and output data without taking account of explicit knowledge of physical, biochemical, and ecological behavior of the system [23]. Various types of data-driven models have already been applied to water research, such as unit hydrograph method, statistical models, and machine learning models [24,25]. Particularly, the artificial neural network method, one of the statistical methods, has been widely used in previous prediction cases [26,27], showing the capability of this method in analyzing biogeochemical monitoring data [28,29] and providing improved results compared to other modeling methods [30,31].
Thus far, most previous studies associated with water quality predictions in WLE used process-based models and there are no current studies that apply ANN to predict WLE water quality. This study aims to establish an ANN model to predict water quality in WLE and compare its predictive performance with process-based GLM-AED. Chl-a is a US EPA-approved water quality evaluating criterion, which is regarded as a response variable used to measure biotic productivity indicating water impairment level by nutrients [32]. Hence, in this study, we evaluated water quality in WLE by using Chl-a concentrations as criteria. Firstly, based on historical meteorological forcing and monitoring nutrient concentrations, we developed a one-dimensional model GLM-AED and then calibrated it with the observed water temperature and Chl-a concentrations from 2002 to 2015. Secondly, based on the inputs of meteorological forcing and monitored nutrient concentrations, as well as the target of Chl-a concentrations generated by calibrated GLM-AED, we trained the ANN model from 2002 to 2010. Thirdly, we tested the ANN model with the inputs of meteorological forcing and monitoring nutrient concentrations as well as the target of the observed Chl-a concentrations from 2011 to 2015, and then computed the RMSE and MAE values between the predicted and observed Chl-a. We also computed the RMSE and MAE values between modeled Chl-a from GLM-AED and the observations from 2011 to 2015, to compare the predictive performance between ANN and GLM-AED models. Finally, we applied the established ANN model to predict the future 10-year water quality of WLE.

Study Area and Available Data
Lake Erie is the southernmost of the Great Lakes and has been divided into three distinct basins (western, central, and eastern basins). The study area is the western basin ( Figure 1), which has a surface area of 4837 km 2 and an average depth of 7.4 m. It has two major tributaries: one is the Detroit River ( Figure 1 point A), accounting for approximately 80% of the total annual inflow into Lake Erie [33], and the other is the Maumee River ( Figure 1 point B), accounting for approximately 47% of the TP loads into Lake Erie during 2011-2013 [34]. The average TP concentration is about 25 times larger in the Maumee River than in the Detroit River [34]; therefore, the Maumee River is the main P source for western Lake Erie. The theoretical hydraulic residence time is approximately 51 days [35], with outflow to the central basin through a rocky chain of islands from Point Pelee, Ontario, to Marblehead, Ohio.
The western basin map was compiled from the Great Lakes Environmental Research Laboratory (GLERL) (Figure 1). The meteorological indicators, including shortwave radiation, cloud cover, air temperature, relative humidity, wind speed, and precipitation, were selected in this study. Solar radiation data were provided by Cleveland Hopkins International Airport, on the south shore of Lake Erie. Precipitation data were obtained from Port Stanley, which is located on the north shore of the central basin of Lake Erie. Cloud cover data were obtained from the Cleveland Burke Lakefront Airport, located on the south shore of the lake. The data were in the form of verbal observations, such as overcast, broken, clear, etc. Each observation was replaced with a corresponding value for the fraction of the sky covered with cloud. Relative humidity data were collected The western basin map was compiled from the Great Lakes Environmental Research Laboratory (GLERL) (Figure 1). The meteorological indicators, including shortwave radiation, cloud cover, air temperature, relative humidity, wind speed, and precipitation, were selected in this study. Solar radiation data were provided by Cleveland Hopkins International Airport, on the south shore of Lake Erie. Precipitation data were obtained from Port Stanley, which is located on the north shore of the central basin of Lake Erie. Cloud cover data were obtained from the Cleveland Burke Lakefront Airport, located on the south shore of the lake. The data were in the form of verbal observations, such as overcast, broken, clear, etc. Each observation was replaced with a corresponding value for the fraction of the sky covered with cloud. Relative humidity data were collected from Erieau and Windsor stations, located on the north shore of the central basin. Wind speed and air temperature data were collected from two sources. Most data were obtained from the National Ocean and Atmospheric Administration's (NOAA) National Data Buoy Center's buoy located in the central basin (Station ID: 45005) of Lake Erie and these data were only available during months without ice cover. Data from the Cleveland Burke airport were used for the months with ice cover. For information on the data resources of the Detroit and Maumee Rivers, see Appendixes A and B.
Observed surface water temperature data from 2002 to 2015 were obtained from NOAA. The observed Chl-a concentrations from 2002 to 2015 were retrieved from the Great Lakes Environmental database system (GLENDA), and they were sampled through the water column during spring and in the epilimnion (from 2 to 4 m depths) during summer. The observed bio-volumes (um L ) were converted to Chl-a biomass (ug L ) using the conversion formula: log ( ) = + log ( ℎ ℎ ) with speciesspecific values of and [36].

Modeling Methodology
The one-dimensional model GLM-AED (General Lake Model-Aquatic Ecosystem Dynamics module library) is open-source [37], and it has been applied to various water bodies, such as temperate lakes [19] and reservoirs [38]. The advantage of computational requirements makes it a common tool for long-term simulations. The hydrodynamic Observed surface water temperature data from 2002 to 2015 were obtained from NOAA. The observed Chl-a concentrations from 2002 to 2015 were retrieved from the Great Lakes Environmental database system (GLENDA), and they were sampled through the water column during spring and in the epilimnion (from 2 to 4 m depths) during summer. The observed bio-volumes (µm 3 L −1 ) were converted to Chl-a biomass (µg L −1 ) using the conversion formula: log 10 (biovolume) = a + b log 10 (chlorophyll a) with species-specific values of a and b [36].

GLM-AED Description
The one-dimensional model GLM-AED (General Lake Model-Aquatic Ecosystem Dynamics module library) is open-source [37], and it has been applied to various water bodies, such as temperate lakes [19] and reservoirs [38]. The advantage of computational requirements makes it a common tool for long-term simulations. The hydrodynamic model GLM assumes there is no horizontal variability and adopts a flexible Lagrangian structure, making it possible to adjust the vertical layer thicknesses dynamically to resolve the water column structure. The module solves the turbulent kinetic energy balance to model the surface heat and momentum budgets. The biogeochemical module AED can simulate nutrients, phytoplankton, and oxygen. In this module, 15 state variables were applied to model Chl-a, including dissolved oxygen (DO), four dissolved inorganic groups (dissolved reactive phosphorus (PO 4 ), nitrate (NO 3 ), ammonium (NH 4 ), and reactive silica [39]), three dissolved organic groups (dissolved organic nitrogen (DON), dissolved organic phosphorus (DOP), and dissolved organic carbon (DOC)), and three particulate detrital organic groups (particulate organic nitrogen (PON), particulate organic phosphorus (POP), and particulate organic carbon (POC)), diatoms (DIAT), chlorophytes (CHLOR), cryptophytes (CRYPT), and cyanobacteria (CYANO). Given the difficulties in simulating the changes in the associated population dynamics, we neglected to model dreissenid mussels and zooplankton, subsuming the related mortality and nutrient cycling into the respiration parameter [40].
GLM-AED was initialized on 1 May 2002. The mean daily meteorological forcing data included air temperature, wind speed, relative humidity, precipitation, shortwave solar radiation, and cloud cover. Mean daily boundary conditions included flow rates and water quality state variables for the Detroit River and Maumee River (flow, temperature, and the water quality state variables). The boundary of outflow was specified as the central basin, and the flow rate was calculated based on water balance from the flow rates of the Detroit River and Maumee River, evaporation, and the observed water levels [41,42]. The model was calibrated to the observed surface water temperature and Chl-a concentrations from 2002 to 2015.

ANN Description
The architecture of an artificial neural network (ANN) is typically composed of the input layer, the hidden layer(s), and the output layer [43]. The nodes in the input layer transmit the information to the nodes in the hidden layer by applying an activation function, each value of each node in the input layer is multiplied by its weight to obtain a new value, and then the new value of each node in the hidden layer is multiplied by its weight and pass to the node in the output layer by applying an activation function to get the final output [44]. The weights are assigned during training the ANN to minimize errors between the outputs and the targets [45]. The architecture of the ANN is shown in Figure 2.
GLM-AED was initialized on 1 May 2002. The mean daily meteorological forcing data included air temperature, wind speed, relative humidity, precipitation, shortwave solar radiation, and cloud cover. Mean daily boundary conditions included flow rates and water quality state variables for the Detroit River and Maumee River (flow, temperature, and the water quality state variables). The boundary of outflow was specified as the central basin, and the flow rate was calculated based on water balance from the flow rates of the Detroit River and Maumee River, evaporation, and the observed water levels [41,42]. The model was calibrated to the observed surface water temperature and Chl-a concentrations from 2002 to 2015.

ANN Description
The architecture of an artificial neural network (ANN) is typically composed of the input layer, the hidden layer(s), and the output layer [43]. The nodes in the input layer transmit the information to the nodes in the hidden layer by applying an activation function, each value of each node in the input layer is multiplied by its weight to obtain a new value, and then the new value of each node in the hidden layer is multiplied by its weight and pass to the node in the output layer by applying an activation function to get the final output [44]. The weights are assigned during training the ANN to minimize errors between the outputs and the targets [45]. The architecture of the ANN is shown in Figure 2. In this study, a multilayer perceptron (MLP) with a backpropagation algorithm was used since it was the most commonly used training methodology [46] and this kind of network has already been widely used for describing non-linear relationships in previous studies [28,29,47]. The inputs in the ANN model were the same as GLM-AED, including meteorological forcing, as well as flow rates, temperature, and 15 state variables of the Detroit and Maumee Rivers, so the number of nodes in the input layer is 40. Particularly, because the salinity of Lake Erie is constant with a value of 0 in GLM-AED, the salinity was not regarded as input in the ANN model. Due to the lack of continuous data for Chl- In this study, a multilayer perceptron (MLP) with a backpropagation algorithm was used since it was the most commonly used training methodology [46] and this kind of network has already been widely used for describing non-linear relationships in previous studies [28,29,47]. The inputs in the ANN model were the same as GLM-AED, including meteorological forcing, as well as flow rates, temperature, and 15 state variables of the Detroit and Maumee Rivers, so the number of nodes in the input layer is 40. Particularly, because the salinity of Lake Erie is constant with a value of 0 in GLM-AED, the salinity was not regarded as input in the ANN model. Due to the lack of continuous data for Chl-a concentrations, the simulated Chl-a concentrations from GLM-AED were regarded as the targets during the training process in the ANN model. The number of nodes in the output layer is 1 since there was only one output of Chl-a. The number of nodes in the hidden layers typically satisfies Equation (1) according to the previous study [48]: where n i , n h , and n o represent the number of nodes in the input, hidden, and output layers, respectively. In this study, the number of nodes in the hidden layers was in the range of 14-81. Between the input layer and the hidden layers, as well as the hidden layers and the output layer, the sigmoid activation function was used [49]. The training ranges of the RMSE and MAE are commonly applied in model performance evaluation for hydrodynamic and water quality models and statistical models [50,51]. These two performance measures are in the same unit as the model output, and hence, making them easily interpretable for the reader, and they also work well for continuous long-term simulations [50]. In this study, the root mean square error (RMSE) and mean absolute error based on the previous study were used for evaluating model predictive performance [52]: where P i and O i represent the prediction and observations at the step of i, respectively, and n represents the number of observations.

Performance of Predictive Models
The simulations provided by GLM-AED were compared with the observed surface water temperature and Chl-a from 2002 to 2015. The time-series of surface water temperature showed that the model reproduced the annual variations of water temperature, varying between −5 • C in winter and 25 • C in summer (Figure 3). The RMSE and MAE were 2.95 and 2.75 • C, respectively, which were comparable to previous studies with the applications of the one-dimensional model with an RMSE value of 1-6 • C [53] and three-dimensional model with an RMSE value of 1-3 • C [54] to Lake Erie. In Figure 4, both predicted and observed Chl-a peaks occurred in 2011, with high cyanobacteria in summer (optimum temperature of 25 • C) and diatoms in early spring (optimum temperature of 9.8 • C) [55]. The Chl-a RMSE and MAE were 19.61 and 18.23 µg L −1 , respectively. These errors are comparable to previous studies; for example, Trolle et al.  [57]. Disparities still occurred between predictions and observations, which may be due to the simplification of the complicated ecological processes; for example, our model GLM-AED assumes there is no horizontal variability (horizontally averaged) [37]. The parameters used in this model during calibration were listed in Appendices C and D.
Appl. Sci. 2021, 11, 6529 6 of 13 be due to the simplification of the complicated ecological processes; for example, our model GLM-AED assumes there is no horizontal variability (horizontally averaged) [37]. The parameters used in this model during calibration were listed in Appendixes C and D.  In the training process of the ANN model, to obtain the smallest errors, the number of the hidden layer was selected as 15. The sigmoid function was used as the activation function between layers. The learning rate, momentum, and batch size were 0.3, 0.2, and 100. With trying different percentages of the database for training, 66% of the total data (2002-2010) and 34% of the total data (2011-2015) were used for training and testing, respectively. Figure 5 showed the scatter density plots for Chl-a concentrations between the predictions from the ANN model (y-axis) and the targets from GLM-AED (x-axis) from 2002 to 2010. The value of R 2 was 0.83; thus, compared to the value of approximately 0.73 in the previous study [46], our ANN model can reproduce the target Chl-a concentrations. The RMSE and MAE values during the training process were 3.20 and 2.53 ug L , respectively. be due to the simplification of the complicated ecological processes; for example, our model GLM-AED assumes there is no horizontal variability (horizontally averaged) [37]. The parameters used in this model during calibration were listed in Appendixes C and D.  In the training process of the ANN model, to obtain the smallest errors, the number of the hidden layer was selected as 15. The sigmoid function was used as the activation function between layers. The learning rate, momentum, and batch size were 0.3, 0.2, and 100. With trying different percentages of the database for training, 66% of the total data (2002-2010) and 34% of the total data (2011-2015) were used for training and testing, respectively. Figure 5 showed the scatter density plots for Chl-a concentrations between the predictions from the ANN model (y-axis) and the targets from GLM-AED (x-axis) from 2002 to 2010. The value of R 2 was 0.83; thus, compared to the value of approximately 0.73 in the previous study [46], our ANN model can reproduce the target Chl-a concentrations. The RMSE and MAE values during the training process were 3.20 and 2.53 ug L , respectively. In the training process of the ANN model, to obtain the smallest errors, the number of the hidden layer was selected as 15. The sigmoid function was used as the activation function between layers. The learning rate, momentum, and batch size were 0.3, 0.2, and 100. With trying different percentages of the database for training, 66% of the total data (2002-2010) and 34% of the total data (2011-2015) were used for training and testing, respectively. Figure 5 showed the scatter density plots for Chl-a concentrations between the predictions from the ANN model (y-axis) and the targets from GLM-AED (x-axis) from 2002 to 2010. The value of R 2 was 0.83; thus, compared to the value of approximately 0.73 in the previous study [46], our ANN model can reproduce the target Chl-a concentrations. The RMSE and MAE values during the training process were 3.20 and 2.53 µg L −1 , respectively.  , which is because stations ER 58 and ER 59 are along the southwest border in the western basin, resulting in the observations at these two stations during summer being larger than the simulations near the Maumee River plume [58], especially in the blooming year 2011 and 2015 [55,59], since one-dimensional GLM-AED represents horizontal homogeneity and the nonlinear relationships in the ANN network prioritizes larger weights for the values with a higher frequency of input database to obtain the minimum statistical errors. Additionally, the computational time of the ANN model was less than two seconds, which was 30 times faster than GLM-AED.

The Application of ANN to Predict Water Quality under Future Climate Change and Nutrient Management
Making long-term predictions can avoid the effects of short-term variations, such as thermal stratification, heat content, etc. on phytoplankton abundance [60,61], so here, we used the established ANN model to predict future 10-year Chl-a concentrations in WLE from 2021 to 2031.

Future Climate Change and Nutrient Management
For future meteorological data, we downloaded the outputs of CMCC-CM2, one of the CMCC (Euro-Mediterranean Centre on Climate Change)-coupled climate models, which is mainly based on the Community Earth System Model project operated at the National Centre for Atmospheric Research (NCAR) in the USA from CMIP6 (Coupled Model Intercomparison Project) [62]. CMCC-CM2 is the evolution of CMCC-CM [63] applied in CMIP5 [64], and it is composed of the Community Atmosphere Model, the NEMO ocean engine, the Community sea-Ice CodE, and the Community Land Model [65]. The downloaded daily meteorological data included cloud cover, relative humidity, precipitation, shortwave radiation, wind speed, and air temperature, consistent with the inputs in the established ANN and GLM-AED models.
For the Maumee River, the future flow rates (Q) were predicted by rational method, we first calculated base flow (QB) based on the historical flow rates from 2002 to 2015, and then obtained the future flow rates based on the relationship between peak runoff (Q-QB) and precipitation. To achieve a 'Mild' bloom for western Lake Erie, reducing the Maumee River's annual and spring TP loads by 39% and 37% compared to the average loads for 2007-2012 of 2630 and 1275 metric tons [66] was recommended. For dissolved reactive phosphorus (DRP), the Task Force recommended an annual and spring loading reduction of 46% and 41% in the average loads from 2007-2012 of 593 and 256 MT [67]. This corresponds with the understanding that DRP comprised approximately 20% of TP in WLE tributaries with annual and spring DRP targets of 320 and 150 MT (https://legacyfiles. ijc.org/tinymce/uploaded/Draft%20LEEP-Aug29Final.pdf (accessed on 20 August 2013). The POP, DOP, CHLOR, DIAT, CYANO, and CRYPT concentrations were calculated by applying the same formulas described in Appendix B. For the Detroit River, since it is typically under control, the annual P loads are approximately constant with a value of 1080 MT, and the fluctuations in flow rates can be neglected, we used the base flow from 2002 to 2015 as the future flow rates. Additionally, the POP, DOP, CHLOR, DIAT, CYANO, and CRYPT concentrations were calculated by using the same formulas in Appendix A. Since nitrogen and carbon are not critical for CYANO growth, we used the data in 2015 as the future data in these two rivers.

Prediction of Future Water Quality
Under future climate change and nutrient management scenarios, the predicted average annual Chl-a concentrations were in the range of 15-16 µg L −1 . CYANO is the major component of total Chl-a [68], with the optimal temperature for growth above 25 • C [69]. In the future climate scenario, the temperature is 0.27-7.84 • C, which is much lower than the 25 • C that occurred in the summer during 2002 and 2015, so the predicted Chl-a concentrations are lower than the peak values during 2002 and 2015 in Figure 4. Phosphorus is also important for Chl-a growth [70]. In the future nutrient management scenario, phosphorus loads from Detroit and Maumee River are decreased, resulting in lower Chl-a concentrations than those during 2002 and 2015.
Based on the guideline values for recreational utility proposed by the World Health Organization in 2003 [71], shown in Table 1, the moderate-risk category is identified as related >50 µg L −1 of Chl-a. The probability of adverse health effects of future 10-year water quality in WLE would be moderate. To eliminate the adverse health effects, more intense water resource management would be necessary for the future; for example, making more intense phosphorus load reduction strategies for Maumee River than the targets. Additionally, applying agricultural best management practices (BMP) on the Maumee watershed, such as cover crops and filter strips. In our future work, we will predict the water quality of Lake Erie with the application of BMP on the Maumee watershed with the ANN approach. Additionally, we could predict physical features of Lake Erie, such as water temperature, by applying the ANN approach. There are also some limitations in future work, such as the lack of continuous data required for the establishment of ANN.

Conclusions
In this study, the process-based model GLM-AED and the statistical model ANN were established to predict water quality conditions in western Lake Erie. Based on the same inputs as in GLM-AED and the simulated Chl-a concentrations from calibrated GLM-AED during 2002 and 2010, we trained the ANN model, and then we compared the predicted Chl-a concentrations from GLM-AED and ANN with GLENDA observations during 2011 and 2015. The smaller RMSE and MAE values between ANN predictions and observations showed ANN has higher accuracy than GLM-AED. Lastly, the established ANN model was used to predict the future 10-year water quality of WLE, indicating that moderate adverse health effects would occur in the next decade. Particularly, with the lack of inputs from 2016 to 2020, we can still apply trained ANN to predict future water quality conditions, while it is impossible to apply GLM-AED with a lack of continuous input data. Furthermore, the specialized knowledge requirements in the ANN model application are less than the process-based GLM-AED model. Therefore, the ANN model in our study could be a potential tool replacing process-based models for informing long-term future water quality changes in WLE.