Improved Prediction of Harmful Algal Blooms in Four Major South Korea’s Rivers Using Deep Learning Models

Harmful algal blooms are an annual phenomenon that cause environmental damage, economic losses, and disease outbreaks. A fundamental solution to this problem is still lacking, thus, the best option for counteracting the effects of algal blooms is to improve advance warnings (predictions). However, existing physical prediction models have difficulties setting a clear coefficient indicating the relationship between each factor when predicting algal blooms, and many variable data sources are required for the analysis. These limitations are accompanied by high time and economic costs. Meanwhile, artificial intelligence and deep learning methods have become increasingly common in scientific research; attempts to apply the long short-term memory (LSTM) model to environmental research problems are increasing because the LSTM model exhibits good performance for time-series data prediction. However, few studies have applied deep learning models or LSTM to algal bloom prediction, especially in South Korea, where algal blooms occur annually. Therefore, we employed the LSTM model for algal bloom prediction in four major rivers of South Korea. We conducted short-term (one week) predictions by employing regression analysis and deep learning techniques on a newly constructed water quality and quantity dataset drawn from 16 dammed pools on the rivers. Three deep learning models (multilayer perceptron, MLP; recurrent neural network, RNN; and long short-term memory, LSTM) were used to predict chlorophyll-a, a recognized proxy for algal activity. The results were compared to those from OLS (ordinary least square) regression analysis and actual data based on the root mean square error (RSME). The LSTM model showed the highest prediction rate for harmful algal blooms and all deep learning models out-performed the OLS regression analysis. Our results reveal the potential for predicting algal blooms using LSTM and deep learning.


Overview
Harmful algal blooms are a phenomenon in which the water in rivers and lakes turns dark green because of excessive algal growth [1]. They can affect areas used as water sources, potentially causing harm to humans and animal, e.g., acute or chronic liver damage when the contaminated water is ingested [2,3]. Moreover, water contaminated by harmful algal blooms looks unappealing and contains a water-soluble neurotoxic component [4]. The effects of harmful algal blooms on rivers and lakes have been experienced and reported worldwide, including the large-scale death of fish [5]. For example, the 11 US health and environment departments funded by the National Center for Environmental Health (NCEH) received a total of 4534 reports on animal disease outbreaks, deaths, and human diseases related to the occurrence of harmful algal blooms from 2007 to 2011 [6], with damages in the US alone estimated at more than 2.2 billion dollars per annum [7]. Therefore, harmful algal blooms are of worldwide concern due to their potential for human and environmental harm along with economic losses [8].
Although environmental authorities worldwide are taking precautions to eliminate such blooms, finding a fundamental solution to the recurring problem is difficult. In South Korea, for instance, the Ministry of Environment sends out alerts via local algae warning and water quality forecasting systems. The algae warning system is based on field survey data for 28 major water sources, lakes, and rivers. The water quality forecasting system is based on actual data such as water temperature and weather observations for 16 dammed pools on four major rivers. Such systems measure the concentrations of cyanobacteria (actual) and chlorophyll-a (predicted) in the water.
Models traditionally used for predicting water quality include QUAL2E (United States Environmental Protection Agency, Washington, WA, USA), CE-QUAL-W2 (2-D Hydrodynamic Water Quality Model, The U.S. Army Corps of Engineers, Washington, WA, USA) and others. As such models are based on actual measurements, variations and changes can be checked through mathematical calculations of each element [9]. However, it is expensive and time-consuming to build and operate these models. It is also difficult to set a clear coefficient indicating the relationship between each factor when using such physical models.
Although QUAL2E has 15 water quality parameters that can be simulated, there are some limitations. For example, the measured value for an increase in biological oxygen demand (BOD) is limited due to the production and death of algae, so that it is not possible to simulate a large-scale river [10]. CE-QUAL-W2 considers water level, flow, water temperature, and many other water quality factors while also considering the total amount of sediment, ammonia-nitrogen, and phosphate-phosphorus. In other words, this model needs to calculate~20 derived components including pH and carbonate species [11]. Thus, for these models to be effective, many variable data sources are required for analysis, which can limit the application of the model, so recent research has focused on the use of machine learning techniques that can overcome these limitations [12].
Machine learning (and its sub-method, deep learning) can analyze and learn a vast amount of untapped big data, extracting important patterns from the datasets and providing insight into specific research questions or problems [13]. In addition, deep learning's capacity to determine the most important features can efficiently provide data scientists with concise and reliable analysis results. Consequently, deep learning has improved research techniques dramatically in fields such as speech recognition and genetics [14], and its use in various fields is increasing. For example, Song et al. [15] used deep learning to predict gastrointestinal infection rates using an environmental context because it is difficult to deal with such complex predictive problems due to the many influential indicators and unknown probabilistic relationships between indicators and disease. Li et al. [16] used the long short-term memory (LSTM) model to predict the spatio-temporal patterns of fine particulate (PM 2.5 ) pollution in China. Previous studies have revealed an increase in the use of deep learning models in environmental studies. In the study of rainfall-runoff modeling, the MLP model was compared with a traditional statistical model, and showed better performance [17]. Recently, the LSTM model has been applied to time-series prediction, for example, wind power prediction [18] and PM 2.5 pollution risk prediction [16]. For PM 2.5 prediction, the model showed higher prediction precision than other time-series prediction models (support vector machine, SVM; and autoregressive moving average, ARMA) and a traditional neural network for feature representation (time delay neural network, TDNN) [16].
In South Korea, although algal blooms occur consistently [19][20][21], the deep learning model has not often been used in water quality research, despite its proven performance. Therefore, in this study, we used machine learning (specifically the deep learning method) to build a water quality prediction model for harmful algal blooms. Existing deep learning models include the multi-layer perceptron (MLP) network and the Elman neural network, a type of recurrent neural network (RNN). We also employed the LSTM model, which is optimized to handle time series data better than other models [22]. For example, when the amount of data increases in an RNN model, past data values of the algorithm are lost at a high rate through the calculation process. However, the LSTM model solves this problem and minimizes such data loss [23], potentially allowing it to more accurately estimate the time of algal bloom occurrence and help prevent them, reducing possible damage. Applying this deep learning model to water quality prediction can therefore reduce the loss of time and money related to harmful algal blooms, aiding research development in this field and contributing to a fundamental solution to this environmental problem.
In this study, we aimed to build a more precise prediction model that would facilitate pre-emptive action to prevent or mitigate the effects of harmful algal blooms in South Korea. First, we selected variables affecting the occurrence of harmful algal blooms through a regression analysis of water quality and quantity data provided by the South Korean Ministry of Environment and the Ministry of Land, Infrastructure and Transport. Second, we constructed and compared the MLP, RNN, and LSTM deep learning models. Data collected by the Ministry of Environment and local units were used for the analysis to ascertain both dependent variables and independent variables. Through this process, we found that deep learning models generated more accurate predictions than traditional evaluation models.

Literature Review
Various factors can cause harmful algal blooms, such as an increase in nutrients from an influx of anthropogenic contaminants produced by households, factories, farmland, or other sources [24]. These nutrient increases create a favorable environment for algal growth [25]; other contributing factors include water temperature and insolation [26]. Cyanobacteria, which are the main cause of harmful algal blooms, are known to reproduce optimally at a water temperature of 25 • C [27]. The flow and circulation of water can also contribute to algal blooms [28]; when water circulation is inadequate, algae remain in the upper layer of the water after bloom occurrence, leading to abnormal residence times and the reoccurrence and rapid spread of algal blooms [29]. However, sufficient flow and circulation creates an environment in which algae cannot breed in one place, reducing the occurrence of blooms [30].
In other words, algal blooms occur because of eutrophication [31]. For example, Florida Bay in the southwestern United States is disturbed frequently by large and dense algal blooms resulting from the sediments and nutrients in the water. Although eutrophication here is most likely caused by the algal biomass, it is difficult to measure this biomass directly, so chlorophyll-a is used as a measure of eutrophication. Chlorophyll-a is an indicator of phytoplankton, is sensitive to excessive nutrients, and can be monitored continuously. In addition, an appropriate limit relevant to chlorophyll-a has been set, which can be utilized as an indicator of water pollution. Similarly, a factor analysis on the Taihu River (China) determined the cause of algal occurrence by using chlorophyll-a as an indicator [32], showing that temperature, pH, total nitrogen (TN), total phosphorus (TP), and other environmental factors (as well as anthropogenic pollution) affected the growth of algae.
In addition, changes in the chlorophyll-a concentration in relation to water quantity and flow rate are also factors in the formation of algal blooms, as shown by an analysis of water level fluctuations in Xiangxi Bay (China) [28]. These results suggested that raising the water level could modulate the occurrence of harmful algal blooms, as the eutrophication stratum became vertically blended when more water was introduced, reducing the time the algae remained on the surface. In this way, algal propagation can be reduced because of dilution and dispersion of the nutrients.
In summary, increases in chlorophyll-a can be caused by several significant water quantity and quality factors. In this study, we used chlorophyll-a as an indicator for water quality (and thus algal blooms) and considered factors such as temperature, pH, biochemical oxygen demand (BOD), COD, DO, cyanobacteria, water level, and pondage in combination with water quality and quantity data in our analysis of the causes of harmful algal blooms. Various studies have demonstrated the use of deep learning models to predict water quality, such as by predicting dissolved oxygen (DO), TN, and TP concentrations using an Elman neural network [33]. However, as described above, we used an LSTM model that can better analyze these data and potentially reduce or prevent the damage caused by harmful algal blooms by achieving more accurate predictions of their occurrence. This paper is structured as follows: Section 2 presents the scope of the research and the necessary background information on the MLP, RNN, and LSTM models, as well as a description of the variables used in the experiments. Section 3 explains the setup of the experiments, and presents and discusses the results. Section 4 provides the conclusions of this study.

Scope and Composition of Research
In this study, we aimed to identify the cause of harmful algal blooms and construct a suitable prediction model to facilitate preemptive action. We used ordinary least square (OLS) regression with the MLP, RNN, and LSTM models to build our prediction model. The study's three-stage framework is shown in Figure 1. This paper is structured as follows: Section 2 presents the scope of the research and the necessary background information on the MLP, RNN, and LSTM models, as well as a description of the variables used in the experiments. Section 3 explains the setup of the experiments, and presents and discusses the results. Section 4 provides the conclusions of this study.

Scope and Composition of Research
In this study, we aimed to identify the cause of harmful algal blooms and construct a suitable prediction model to facilitate preemptive action. We used ordinary least square (OLS) regression with the MLP, RNN, and LSTM models to build our prediction model. The study's three-stage framework is shown in Figure 1. First, we investigated the measurement criteria and causative factors for harmful algal blooms and assessed previous attempts to use machine learning models for water quality prediction. Second, we constructed our dataset by combining water quality and quantity data, then performed OLS regression analysis, which is typically used in empirical analysis. The analysis determined the influence of the independent variable on the dependent variable and whether it was positive or negative through correlation coefficients [34]. The results identifies the variables with the potential to have a significant effect on chlorophyll-a prediction. In addition, we used the combined water quality and quantity dataset to compare the deep learning model with the predicted values. Finally, we constructed a combined model that could offer one-week predictions, using data from 16 dammed pools on 4 major river basins and the MLP, RNN, and LSTM models.
For the effective prevention of algal blooms, we built a model that could predict results oneweek in advance. The one-week prediction period was chosen with reference to past research indicating that harmful algal blooms are characterized by rapid breeding when the environmental requirements are met [35]. As a comparative indicator, the accuracy of prediction between the models was examined by comparing the RMSE values commonly used as a measure of model performance [36]. It is determined by the difference between the predicted and actual values measured by the model and is a good indicator of the average error [17,37,38]. The final goal was a deep learning model that could predict harmful algal blooms in the four major rivers of South Korea.

Analytical Model
We used OLS linear regression analysis to identify the factors that could contribute to harmful algal blooms by analyzing their effects on chlorophyll-a; this approach determines the most basic linear relationship between dependent and independent variables. The variables for deep learning analysis and multiple regression analysis were chosen by backwards elimination to select a meaningful value that could satisfy the assumption in multiple regression analysis. The multicollinearity verification results for these variables were all lower than the variance inflation factor (VIF) of 10, and were also used in the MLP, RNN, and LSTM models. The formula for the First, we investigated the measurement criteria and causative factors for harmful algal blooms and assessed previous attempts to use machine learning models for water quality prediction. Second, we constructed our dataset by combining water quality and quantity data, then performed OLS regression analysis, which is typically used in empirical analysis. The analysis determined the influence of the independent variable on the dependent variable and whether it was positive or negative through correlation coefficients [34]. The results identifies the variables with the potential to have a significant effect on chlorophyll-a prediction. In addition, we used the combined water quality and quantity dataset to compare the deep learning model with the predicted values. Finally, we constructed a combined model that could offer one-week predictions, using data from 16 dammed pools on 4 major river basins and the MLP, RNN, and LSTM models.
For the effective prevention of algal blooms, we built a model that could predict results one-week in advance. The one-week prediction period was chosen with reference to past research indicating that harmful algal blooms are characterized by rapid breeding when the environmental requirements are met [35]. As a comparative indicator, the accuracy of prediction between the models was examined by comparing the RMSE values commonly used as a measure of model performance [36]. It is determined by the difference between the predicted and actual values measured by the model and is a good indicator of the average error [17,37,38]. The final goal was a deep learning model that could predict harmful algal blooms in the four major rivers of South Korea.

Analytical Model
We used OLS linear regression analysis to identify the factors that could contribute to harmful algal blooms by analyzing their effects on chlorophyll-a; this approach determines the most basic linear relationship between dependent and independent variables. The variables for deep learning analysis and multiple regression analysis were chosen by backwards elimination to select a meaningful value that could satisfy the assumption in multiple regression analysis. The multicollinearity verification results for these variables were all lower than the variance inflation factor (VIF) of 10, and were also used in the MLP, RNN, and LSTM models. The formula for the regression analysis was: Regression analysis showed that the cyanobacteria had larger standard deviations than the other variables. If these data were used as-is, the maximum-minimum normalization would be performed on all selected variables, as data with large values could have a more significant effect on the result than other data. The maximum-minimum normalization transformed the distribution of the values from 0 to 1, using the maximum and minimum values of the data.
We compared and analyzed the deep learning model based on the RNN and LSTM models (that mostly use sequential data) and the MLP model. The nonlinear activation function was used in the MLP, except for the input layer.
In the MLP model, after performing the feed-forward calculation to determine the weight for each node, the error was reduced by learning the optimal weight and bias through the back-propagation algorithm, which reduced the error by sending the error between the predicted value and the actual value of the error backward [39]. Through back-propagation, the neuron weights are updated by a gradient descent method in response to errors between neural networks to determine the weights that minimize errors. This has the advantage of adjusting parts that are difficult to adjust using human intuition in a complex system [40]. The existing RNN model was used to develop the LSTM model. Comparing the RNN, LSTM, and MLP models, the MLP model ignored the time sequence and judged only the current data because the input data pass once through all the nodes. However, the RNN and LSTM models are used widely in time series analysis, as they simultaneously consider both present and past input data. One disadvantage of the RNN model is that the weight of the initial data decreases as the distance between the input data and the nodes utilizing the data increases. The LSTM model improves on this by continuously updating the weights of the important parts of the input values by using four steps in the model. Compared with the conventional RNN model used in water quality research, the LSTM model used in our study had superior prediction accuracy.
The LSTM model derives its values through four steps: forget, input, update, and output. The forget step chooses which information to discard from the old and new incoming data. This decision is determined by the sigmoid layer in the LSTM cell. At this stage, a value between 0 and 1 is sent, which is the criterion for how much information to pass: if the value is 0, no information is transmitted; if the value is 1, all information is transmitted. In the input step, the model chooses whether to save the new information. As in the previous step, the sigmoid layer determines which values to update. Then, the tanh layer creates new results and adds information from the two layers to update the information. In the update step, the previous cell state is updated using the values determined in the previous step. Finally, the output stage determines the output value (which of the filtered values should be exported through the above process). This structure minimizes the loss of previously learned information and yields results [41].
We conducted comparative analysis using the above model. The detailed hyperparameters in the deep learning model in the instance of the activation function are those of the commonly used rectified linear unit (ReLU) function. In this study, we employed the Adam optimizer, which has the following advantages: it is straightforward to carry out, computationally effective, has minimal memory requirements, is unchangeable to diagonal rescaling of the gradients, and is suited for problems that have large amounts of data and/or parameters. This method is also appropriate for non-stationary purposes and problems with very noisy and/or sparse gradients [42]. The model has three hidden layers, a batch size of 100, and an epoch of 100. With regards to the epoch, values of 100, 300, 500, and 700 were used in the pilot analysis. The LSTM model epoch was set to 100 as it was optimal in terms of time and results. The MLP model epoch was set to 500, which had a lower RMSE value than the other epoch. In addition, a time lag of 1 was specified to construct a model that could predict the results a week in advance. Figure 2 shows the structure of the deep learning model, consisting of nine input variables, three hidden layers, and one output layer (all layers were fully connected). The first layers of the hidden layer contained 32 nodes while the second and third layers contained 64 nodes. The same structure was used for the LSTM, RNN, and MLP for proper comparison between the three models.
To compare the prediction values between models, 60% of the data were used as training data, 20% as validation data, and 20% as test data. Figure 3 shows the prediction process, in which predicted chlorophyll-a (t point) was a predictor variable for the learning of nine variables in one week (t-1 point). One week in advance was chosen as the prediction timescale for algal blooms because it showed the best balance between prediction accuracy and an effective prediction period for preventing algal bloom damage. The test data were used to predict chlorophyll-a and compare with existing data. The RMSE was used as an evaluation index by comparing the predicted value with the actual value: This indicator implies that the lower the value, the closer the predicted data are to the actual value. To compare the prediction values between models, 60% of the data were used as training data, 20% as validation data, and 20% as test data. Figure 3 shows the prediction process, in which predicted chlorophyll-a (t point) was a predictor variable for the learning of nine variables in one week (t-1 point). One week in advance was chosen as the prediction timescale for algal blooms because it showed the best balance between prediction accuracy and an effective prediction period for preventing algal bloom damage. The test data were used to predict chlorophyll-a and compare with existing data. The RMSE was used as an evaluation index by comparing the predicted value with the actual value: This indicator implies that the lower the value, the closer the predicted data are to the actual value.

Data preprocessing
We obtained data from 16 dammed pools on four major rivers in South Korea. Before using these data, the missing values in the water quality data were replaced with values obtained through linear interpolation. As the date and time of the data differed according to the week and day, we set a standard interval of six days. In addition, the water quality data were measured weekly while the water quantity data were measured daily, so we recalculated the water quantity data to a weekly unit. In To compare the prediction values between models, 60% of the data were used as training data, 20% as validation data, and 20% as test data. Figure 3 shows the prediction process, in which predicted chlorophyll-a (t point) was a predictor variable for the learning of nine variables in one week (t-1 point). One week in advance was chosen as the prediction timescale for algal blooms because it showed the best balance between prediction accuracy and an effective prediction period for preventing algal bloom damage. The test data were used to predict chlorophyll-a and compare with existing data. The RMSE was used as an evaluation index by comparing the predicted value with the actual value: This indicator implies that the lower the value, the closer the predicted data are to the actual value.

Data preprocessing
We obtained data from 16 dammed pools on four major rivers in South Korea. Before using these data, the missing values in the water quality data were replaced with values obtained through linear interpolation. As the date and time of the data differed according to the week and day, we set a standard interval of six days. In addition, the water quality data were measured weekly while the

Data preprocessing
We obtained data from 16 dammed pools on four major rivers in South Korea. Before using these data, the missing values in the water quality data were replaced with values obtained through linear interpolation. As the date and time of the data differed according to the week and day, we set a standard interval of six days. In addition, the water quality data were measured weekly while the water quantity data were measured daily, so we recalculated the water quantity data to a weekly unit. In this way, weekly algal research data were generated from 27 August 27 2012 to 25 December 2017.

Dependent Variable
After data preprocessing, chlorophyll-a was used as a dependent variable in water quality data from the Ministry of Environment. As a harmful algal bloom progresses, an increase in the number of cyanobacteria cells on the water surface causes the release of harmful toxins. An increase in chlorophyll-a indicates eutrophication of the water as a result of the algal bloom. Along with measuring chlorophyll-a, we intended to measure the number of cyanobacteria cells producing harmful toxins, but this was difficult to analyze owing to a large amount of missing data. Therefore, only chlorophyll-a was used as a dependent variable in this study.

Control and Independent Variables
The data from the branch units of the Ministry of Environment and Ministry of Land, Infrastructure and Transport were collected and used for analysis as control variables and independent variables. The Ministry of Environment data included temperature, pH, conductivity, DO, BOD, COD, T-P, and cyanobacteria. Ministry of Land data included water level, pondage, and amount of precipitation. We then used backward elimination to select the most meaningful variables. Non-significant variables were eliminated sequentially using the p-value of t. Finally, we selected the following variables: temperature, pH, BOD, COD, DO, cyanobacteria, water level, and pondage (Table 1).
We used nine selection variables and 4464 weekly data points over a period of six years. The amount of DO was used as an indicator of water quality, representing the amount of oxygen pollution. The BOD and COD levels indicated whether the microorganisms needed oxygen to decompose organic matter: the higher these two indices, the more organic matter was present. The water level indicates the surface level of each dammed pool, while pondage indicates the total volume of water.

Results and Discussion
As shown in Table 2, five variables (temperature, pH, DO, BOD, and COD) were positively correlated with the change in chlorophyll-a, i.e., when chlorophyll-a increased, these parameters also increased. This matches the results of past research showing that an increase in water temperature to 25 • C results in algae growth [27]. The positive regression coefficients for DO, BOD, and COD indicate that, when x increased 1 point, y increased by the coefficient value. With respect to DO, as the algae are photosynthetic and produce oxygen, increasing algae produce more DO [12]. BOD is an important parameter for assessing water pollution; the higher the BOD concentration, the higher the increase in organic matter and chlorophyll-a, as shown by a previous study finding that the correlation between BOD and chlorophyll-a in Ham Nghi Lake (Vietnam) in 2013 produced an R 2 value of 0.97 [43]. Finally, COD has a strong correlation with BOD and is used when determining the pollution level of a water body because measurements of COD are more accurate than those of BOD, which is significantly affected by carbon assimilation in the presence of algae [44]. Our regression analysis indicated a positive correlation between DO, BOD, and COD, in accordance with previous research. In contrast, the water level, pondage, and cyanobacteria variables were negatively correlated with changes in chlorophyll-a, i.e., when x increased 1 point, y decreased by the coefficient value. However, changes in water level could lead to a decrease as well as an increase in chlorophyll-a, as a previous study indicated that the prevention of algal blooms was augmented when the rising period of the water level increased [28]. These results confirmed that temperature, cyanobacteria, pH, DO, BOD, and COD were variables affecting chlorophyll-a, in addition to both water quantity and quality.
The results of the MLP and LSTM models are shown in Figure 4, in which the chlorophyll-a data were predicted one week in advance. The RMSE value was lower in the LSTM model than in the MLP model, demonstrating the superior accuracy of the LSTM model. Although the MLP's predictions were superior in some cases, in most instances it did not follow the trend well. For example, at the Sejongbo site (a dammed pool used for irrigation) on the Geum River, the RMSE value of the MLP model was 33.48, slightly higher than that of the LSTM model. In case of Sejongbo, MLP model predicted values did not increase sharply with a rise in the actual value; however, the LSTM predicted values closely followed the peak points of actual values. A similar result occurred at the nearby Gongjubo site (a dammed pool used for irrigation). In contrast, for Gangjeong goryeoungbo (a dammed pool for irrigation), the MLP model peak point was closer to actual values than that of the LSTM model. However, in other intervals, the predicted values deviated significantly from the actual values, indicated by higher RMSE values than for the LSTM model. At this site, the maximum value of the Y-axis was 40; lower than the other peak. In this case, the MLP model prediction line followed the actual line well. However, for the other two points, the maximum value of the Y-axis was more than 100. In this instance, the MLP model prediction line remained between 60 and 70 and did not follow the trend, while the LSTM model followed the trend well regardless of the size of the value. This limitation of the MLP model appears to explain why its RMSE value was higher than that of the LSTM model.  For example, at the Sejongbo site (a dammed pool used for irrigation) on the Geum River, the RMSE value of the MLP model was 33.48, slightly higher than that of the LSTM model. In case of Sejongbo, MLP model predicted values did not increase sharply with a rise in the actual value; however, the LSTM predicted values closely followed the peak points of actual values. A similar result occurred at the nearby Gongjubo site (a dammed pool used for irrigation). In contrast, for Gangjeong goryeoungbo (a dammed pool for irrigation), the MLP model peak point was closer to actual values than that of the LSTM model. However, in other intervals, the predicted values deviated significantly from the actual values, indicated by higher RMSE values than for the LSTM model. At this site, the maximum value of the Y-axis was 40; lower than the other peak. In this case, the MLP model prediction line followed the actual line well. However, for the other two points, the maximum value of the Y-axis was more than 100. In this instance, the MLP model prediction line remained  Table 3 shows the results of RMSE comparisons after executing 100, 300, 500, and 700 epochs. We performed analyses to determine the appropriate epoch for each model. As a result, in the MLP model, the sum of RMSE values seemed to decrease gradually as the epochs increased, but increased again from epochs of 500 or more. For the LSTM model, the RMSE values increased continuously after the 100th epoch. Consequently, for a comparison between the two models, it was necessary to compare the RMSE values in the optimal epoch. Therefore, we selected epoch 100 and 500 for the LSTM and MLP models, respectively. These results suggest that parameter adjustment is required to increase the accuracy of the two models and that increasing the amount of data would also help improve the accuracy. Table 4 shows the results of the RMSE comparison between OLS regression analysis, MLP, RNN, and LSTM. Our results show that the deep learning models were superior to the OLS linear regression model at most dammed pool sites. We compared the OLS model with each deep learning model. The RMSE values of the MLP was lower than those of OLS at 11 dammed pools, the RNN was lower than OLS at 12 dammed pools, and the LSTM was lower than OLS at 12 dammed pools. The RMSE values of the MLP and RNN were lowest at four and three dammed pools, respectively, while those of the LSTM and OLS were lowest at five dammed pools each. All deep learning models' RMSE averages were lower than the OLS average; the difference in the RMSE average between the OLS and LSTM model was 1.66. The LSTM deep learning model showed the best prediction performance overall, with the lowest average RSME and lowest individual values at five of the 16 dammed pools, although it performed less well in certain cases. These results demonstrate that the prediction accuracy for algal blooms can be improved through the use of deep learning models, particularly when compared to the commonly used OLS model.
The results of this study support the practicality of using deep learning models to supplement existing models in comparable research contexts. As this is the first attempt, to our knowledge, at predicting algal blooms in Korean rivers using the LSTM model of deep learning, we expect that this research will prompt further attempts to apply and refine these methods.
However, this study has some limitations. We considered the use of cyanobacteria in the prediction of algae blooms, yet there were some missing data values that meant it could not be used, so it became a limiting factor in the accurate prediction of algae blooms. In addition, it is necessary to adjust the parameters in future to obtain higher accuracy; better results will be obtained by adjusting the parameters to determine optimal values for each region. Subsequent studies and/or improved data would be very helpful in overcoming this limitation. In addition, as the predicted results were not always accurate, further research would be useful to better refine the methods employed here.
Furthermore, our results showed the potential for using a deep learning model when it is difficult to apply existing physical prediction models (such as QUAL2E and CE-QUAL-W2) due to a lack of data on the relationships between factors. Applying deep learning methodology to water quality and environment management studies can improve the prediction accuracy by constructing a short-term prediction model for algal blooms. The performance of the LSTM model can achieve better predictions, one week in advance that would enable the implementation of specific and appropriate measures for the prevention or mitigation of algal blooms.

Conclusions
In this study, we analyzed factors influencing the occurrence and prediction of harmful algal blooms using weekly water quality and quantity data of 16 dammed pools on four major rivers in South Korea. Based on the selected variables, we employed chlorophyll-a as a predictive factor.
Next, we constructed a model using the deep learning method and compared its results with existing analysis methods such as OLS regression analysis to analyze their relative performance. OLS regression and the MLP, RNN, and LSTM deep learning models were investigated by analyzing predictions of chlorophyll-a based on RMSE. The OLS regression model achieved the lowest RMSE value at five of the 16 dammed pools, while the LSTM model was the most accurate overall.
Moreover, the performance of the LSTM model was superior to the MLP and RNN models. In addition, the LSTM model predictions were closer to the actual data than those of the MLP model when variations in chlorophyll-a were large. This implies that the MLP model tended to fail to learn properly when the value of chlorophyll-a increased. The LSTM model, on the other hand, followed the trend line regardless of the range of values. In the comparison between deep learning models, we found that the feedforward method (MLP) performance was worse than that of the recurrent method (RNN and LSTM). In addition, the LSTM model exhibited higher performance than the other models. The algorithms used in the LSTM model were designed to solve the problem of information loss for long-term memory in existing RNN models, which is known to improve predictions by transferring information on previous data as the amount of data grows. Therefore, for water quality data collected daily with continuous data management, predictions could be made more accurate using such "big data."