Application of the Machine Learning LightGBM Model to the Prediction of the Water Levels of the Lower Columbia River

: Due to the strong nonlinear interaction with river discharge, tides in estuaries are char-acterised as nonstationary and their mechanisms are yet to be fully understood. It remains highly challenging to accurately predict estuarine water levels. Machine learning methods, which offer a unique ability to simulate the unknown relationships between variables, have been increasingly used in a large number of research areas. This study applies the LightGBM model to predicting the water levels along the lower reach of the Columbia River. The model inputs consist of the discharges from two upstream rivers (Columbia and Willamette Rivers) and the tide characteristics, including the tide range at the estuary mouth (Astoria) and tide constituents. The model is optimized with the selected parameters. The results show that the LightGBM model can achieve high prediction accuracy, with the root-mean-square-error values of water level being reduced to 0.14 m and the correlation coefﬁcient and skill score being in the ranges of 0.975–0.987 and 0.941–0.972, respectively, which are statistically better than those obtained from physics-based models such as the nonstationary tidal harmonic analysis model (NS_TIDE). The importance of subtide constituents in interacting with the river discharge in the estuary is clearly revealed from the model results.


Introduction
In recent decades, estuaries have been intensively developed and used for a wide range of economic activities [1]. In estuaries, water temperature, salinity, turbidity and sediment transport change daily in response to the tides [2]. Accurately predicting the water levels in estuaries is important, in particular under extreme conditions, to allow sufficient time for the authorities to issue a forewarning of an impending flood and to implement early evacuation measures [3].
Ocean and coastal tides are usually predictable as they have periodic signals. Their tide properties (amplitudes and phase) can be easily obtained from tide levels by using classical harmonic analysis methods such as the T_TIDE model [4], or extracted from tide prediction models such as the Oregon State University Tidal Inversion Software [5]. However, when tides propagate toward the nearshore zone, shallow water effects can significantly affect tide properties, as shallow water tide constituents can be generated during the on-shore propagation of ocean tides [6]. Moreover, tide properties can be further altered when tides enter into an estuary due to the effects caused by the river discharge from upstream. In addition, changes of estuarine width can cause energy variation of tides, while the estuarine water depth can change the propagation speed of tide waves. Due to the complex estuarine environment, estuarine water level variations contain strong nontide components, making the analysis of estuarine tides more challenging [7].
In the past decades, substantial research has been focused on the nonlinear interaction between the river discharge and tides, such as the works of Jay [8] and Godin [9]. Especially in recent years, advanced tools aimed at improving the predictions of estuarine tides have been developed. Matte et al. [7] developed the nonstationary tidal harmonic analysis model (NS_TIDE) to consider the time-dependent tide properties (amplitudes and phases) caused by the nonlinear interaction between tides and river discharge. Similarly, Pan et al. [10] proposed the enhanced harmonic analysis model (S_TIDE), which calculates time-dependent tide properties by using the interpolation method without the input of river discharge. These tools achieved a significant higher analytic accuracy than the classical harmonic analysis model. Therefore, they have been widely used in the analysis of estuarine tides, or even modified to make them more suitable for local estuary environments [11][12][13][14][15][16]. These studies provided a good generalization about the nonstationary nature of estuarine tides from the view of physical processes.
At present, mathematical descriptions of the nonlinear interaction between tides and river discharge in estuaries commonly adopt a linear simplification in predicting the tides, or predict them numerically using hydrodynamic models. The linear simplification or numerical solutions, which rely on accurate topography data and boundary conditions, can limit the prediction accuracy. As an alternative, data-driven models provide a good supplement to physical process-based models. Data-driven models require no specific mathematical assumptions about tide levels and are more suitable to simulate the nonlinear or unknown relationships between the inputs and outputs.
A number of previous studies have achieved the success of accurately predicting tide levels both in coastal and estuary zones through data-driven methods [17][18][19][20]. Some studies also showed that data-driven models, such as the artificial neural network (ANN) method, could achieve better performance than classical harmonic analysis when water level variation contains nontide components [21][22][23]. However, to the best of our knowledge, previous application of data-driven models to tide level predictions mainly focused on coastal tides, whilst their applications to estuarine tides were mainly used for short-term prediction.
Supharatid [20] constructed a prediction model at the River Chao, Phraya in the Gulf of Thailand by using the ANN method to predict the mean monthly estuarine tide level. Chang and Chen [24] used the radial basis function neural network to predict an-hourahead water levels in the Tanshui River during the typhoon periods. Similar works were done by Tsai et al. [25] in predicting the water levels in 1-3 hourly intervals in the Tanshui River. Chen et al. [26] compared the model performance of the ANN model with 2-D and 3-D hydrodynamic models in predicting the water levels of the Danshui River estuary in northern Taiwan. Their results showed that the ANN model achieved similar accuracy as 2-D and 3-D hydrodynamic models, or even better performance during the extremely high flow periods. More recently, Yoo et al. [27] applied the Long Short-Term Memory method to predict the water level of Hangang River, South Korea. Their results showed that the model prediction accuracy becomes unacceptable when the prediction length is longer than 3 h. Chen et al. [28] integrated the NS_TIDE model [7] with an autoregressive model to improve the short-term (within 48 h) prediction of the water levels in the Yangtze estuary.
With the growing research interests in artificial intelligence and the concept of big data, more data-driven methods, especially machine learning and deep learning methods, have been developed [29]. These methods have been widely used in a large number of areas such as economics [30], geoscience [23,31,32], and medicine [33]. Riazi [23] also successfully applied the deep learning approach to accurately predict the coastal tide level, which clearly indicated the applicability of machine learning methods in analyzing the estuarine tides.
In 2017, Microsoft ® proposed a fast, distributed, high performance gradient boosting framework: LightGBM [34]. Consequently, the LightGBM framework, which belongs to decision tree algorithms, has been frequently used in solving the problems of classification [35] and regression [33,36]. Considering the effective performance of the LightGBM framework in regression, this study aimed to develop the LightGBM model for predicting estuarine water levels. Appropriate selection of input parameters and the optimization of the hyperparameters are fully described and carried out. The lower reach of the Columbia River was selected as the research area because the long-term hydrographic field data are publicly available. The results from the LightGBM model were further compared with those from the commonly-used NS_TIDE model [7]. The spatially varied contribution of the parameters in the input layer of the LightGBM model in the lower Columbia River was also investigated in this study.
The remaining part of this paper is organized as follows. The algorithm and the modelling process of the LightGBM model are described in Section 2, while the information of the research area and the measurements are given in Section 3. Model training and testing results are presented in Sections 4 and 5, respectively. A detailed discussion is given in Section 6, while the main conclusions are presented in Section 7.

The Algorithm of the LightGBM Model
The LightGBM model is an open-source framework originally proposed by Microsoft ® [34]. It is a decision-tree-based algorithm, which divides the parameters in the input layer into different parts and thereby constructs the mapping relationship between the inputs and outputs. Figure 1A shows the main feature of the LightGBM model, which uses leaf-wisetree growth instead of the widely used level-wise-tree growth to speed up the training. In terms of the level-wise-tree growth strategy, the tree structure grows level by level. As shown in the figure, assuming that the tree depth is D, the number of the nodes at the final level is 2 D . Compared with the level-wise-tree growth strategy, the leaf-wise-tree growth strategy grows tree based on the node where the largest error reduction can be obtained. In other words, the leaf-wise-tree growth strategy is greedier than the level-wise-tree growth strategy. The tree nodes of the leaf-wise-tree algorithm are usually less than the tree nodes of the level-wise-tree algorithm with the same tree depth. A typical example is shown in Figure 1B, illustrating the leaf-wise-tree growth for breast cancer patients through the input parameters of gender and age groups. In comparison with the level-wise-tree growth algorithm, the leaf-wise tree growth algorithm can considerably reduce the number of tree nodes, so that the training process can be significantly speeded up when the dataset is large. However, when the dataset is small, the leaf-wise tree growth algorithm tends to over-fit because of its greedier algorithm, while the level-wise tree growth algorithm is relatively more stable.
More specifically, the LightGBM model uses the Gradient Boosting Decision Tree (GBDT) algorithm to reach a more accurate prediction. As conceptualized in Figure 1C, assuming X and Y are input and output (targets) for regression, the target of the 1st tree (Tree 1 in Figure 1C) is Y. If the estimation of the 1st tree is T 1 , the input and output (targets) of the 2nd tree (Tree 2 in Figure 1C) become X and Y − T 1 . More specifically, the target (output) of each tree comes from the tree model's errors of the previous tree, while the input is kept the same. The GBDT algorithm generates the final predictions by using the ensemble predictions of a series of trees. With the construction of more and more trees, the ensemble estimation can be gradually close to the output Y. During the construction of each decision tree, the parameters that determine the tree structure and influence the iteration process are determined through cross-validation. Another commonly-used GBDT-based model is the XGBoost model [37]. It has gained much popularity and attention in recent years, such as the studies of Shi et al. [38] and Dong et al. [39]. The XGBoost model uses the level-wise-tree growth algorithm to generate the tree nodes. Compared with other GBDT-based models such as the XGBoost model, the LightGBM model also provides another three algorithms to accelerate its training, namely histogram-based, gradient-based one-side sampling (GOSS) and exclusive feature bundling (EFB) algorithms. The histogram-based algorithms [34] convert the sorted dataset of the parameters in the input layer into a histogram with a specified number of data intervals or bins [36] (Fan et al., 2019), as shown in Figure 2A. After this transformation, each data interval (bin) shares the same index. This algorithm allows the LightGBM model to require a much lower memory consumption while considerably accelerating the training speed. However, compared with the XGBoost model, the split point of the dataset found by the LightGBM model may be less accurate than that of the XGBoost model because the LightGBM traverses the data bins ( Figure 2A) while the XGBoost model directly traverses the dataset. As shown in Figure 2B, the GOSS algorithm [34] approximates the sum of sample data by using the top a% descending-sorted data points and part randomly selected data (b%) from the remaining data (100% − a%). The GOSS algorithm reduces the number of data instances, while the accuracy of the learned decision trees can be still kept [34]. The EFB algorithm reduces the parameter size by merging some parameters in the input layer, which also improves the training speed. Taking Figure 2C as an example, the repeated zero-value points are in the same bin ( Figure 2A). However, the location of the nonzero values of Parameter 1 and Parameter 2 in Figure 2C are mutually exclusive. Merging them does not change the distribution of the output (target) dataset but can reduce the parameters' size in the input layer. Through these algorithms, the LightGBM model has the advantage of a faster training speed and is suitable for handling large-scale datasets. Another commonly-used GBDT-based model is the XGBoost model [37]. It has gained much popularity and attention in recent years, such as the studies of Shi et al. [38] and Dong et al. [39]. The XGBoost model uses the level-wise-tree growth algorithm to generate the tree nodes. Compared with other GBDT-based models such as the XGBoost model, the LightGBM model also provides another three algorithms to accelerate its training, namely histogram-based, gradient-based one-side sampling (GOSS) and exclusive feature bundling (EFB) algorithms. The histogram-based algorithms [34] convert the sorted dataset of the parameters in the input layer into a histogram with a specified number of data intervals or bins [36] (Fan et al., 2019), as shown in Figure 2A. After this transformation, each data interval (bin) shares the same index. This algorithm allows the LightGBM model to require a much lower memory consumption while considerably accelerating the training speed. However, compared with the XGBoost model, the split point of the dataset found by the LightGBM model may be less accurate than that of the XGBoost model because the LightGBM traverses the data bins ( Figure 2A) while the XGBoost model directly traverses the dataset. As shown in Figure 2B, the GOSS algorithm [34] approximates the sum of sample data by using the top a% descending-sorted data points and part randomly selected data (b%) from the remaining data (100% − a%). The GOSS algorithm reduces the number of data instances, while the accuracy of the learned decision trees can be still kept [34]. The EFB algorithm reduces the parameter size by merging some parameters in the input layer, which also improves the training speed. Taking Figure 2C as an example, the repeated zero-value points are in the same bin ( Figure 2A). However, the location of the nonzero values of Parameter 1 and Parameter 2 in Figure 2C are mutually exclusive. Merging them does not change the distribution of the output (target) dataset but can reduce the parameters' size in the input layer. Through these algorithms, the LightGBM model has the advantage of a faster training speed and is suitable for handling large-scale datasets. Taking the estuarine water levels as an example in this study, they can be estimated by the LightGBM model as: where (unit: m) is the estimated estuarine water level by the LightGBM model; (unit: m) is the estimation of the water level from the th tree, in which is the input parameter dataset and is the learned parameter set of the th tree, and is the total number of decision trees.
During the learning stage, can be determined by minimizing a loss (error) function, , against the observation data, . For the 1st tree, 1 can be estimated with the following equation: where is the condition of the function achieving the minimum value between the observed and an initial estimation, 1 . For the th ( >1) tree, the square error (SE) is adopted as the loss function, ( ) = ( − ) 2 , to evaluate the model errors, with the introduction of a target function, , as shown below: where the target function of the th tree, (unit: m 2 ), contains two parts, one represents the loss function for evaluating model accuracy, ( ) and the other, Ω( ) represents model robustness, which is a regular function expressed as: Taking the estuarine water levels as an example in this study, they can be estimated by the LightGBM model as: where η I (unit: m) is the estimated estuarine water level by the LightGBM model; T i (unit: m) is the estimation of the water level from the ith tree, in which X is the input parameter dataset and θ i is the learned parameter set of the ith tree, and I is the total number of decision trees. During the learning stage, T i can be determined by minimizing a loss (error) function, L, against the observation data, η obs . For the 1st tree, T 1 can be estimated with the following equation: where argmin is the condition of the function achieving the minimum value between the observed and an initial estimation, η 1 . For the ith (i > 1) tree, the square error (SE) is adopted as the loss function, L(η i ) = (η obs − η i ) 2 , to evaluate the model errors, with the introduction of a target function, Obj i , as shown below: where the target function of the ith tree, Obj i (unit: m 2 ), contains two parts, one represents the loss function for evaluating model accuracy, L(η i ) and the other, Ω(T i ) represents model robustness, which is a regular function expressed as: where α and β are hyperparameters, which are used to control the learning process, to be tuned at the hyperparameters optimization stage, j is the index of the leaf node, while J is the total number of leaf nodes of the ith decision tree and ω j (unit: m) is the model estimation at the jth leaf node. The first and second terms on the right-hand side of Equation (4) are respectively the Lasso Regression (L1 regularization) and Ridge Regression (L2 regularization) [40]. Both are used to improve the model stability (avoid over-fitting) by adding different penalty terms. If α = 0 and β = 0, Equation (3) becomes the ordinary square error function. The ith tree is determined by finding the smallest Obj i following the least square error approach. In other words, the determination of the decision tree structure not only considers the model accuracy but also takes into account the model's stability. The LightGBM model pertains to the concept of GBDT, which uses the second-order Taylor expansion to estimate the target function. More specifically, the ith targe function is estimated by using the learning results from the (i − 1)th process: where g and h are the first and second derivations of L(η i−1 ) with respect to η i−1 . Since L(η i−1 ) is known from the (i − 1)th learning process, the target function can be written as: where ω j at each leaf node on the ith tree is the unknown variable to be solved, the superscript t is the sample index and N is the total number of sample points. In Equation (8), except for ω j , all other parameters are specified or can be obtained from the (i − 1)th learning process. In order to obtain the smallest Obj i , we have: which yields the following: where sgn is the sign function. When ω j in Equation (12) is determined, the objective function in Equation (8) can be re-written as: To statistically evaluate the tree performance at each leaf node, a score function for the jth leaf node is introduced as: It is clear that Equation (15) is a subterm of Equation (14), indicating the model's error reduction contributed from the jth leaf node. Equation (15) can assist the LightGBM model in selecting the most suitable node to split further. If two new leaf nodes are generated from the jth leaf node, the difference of V j after the split, V d , can be calculated: where G L,α,j (H L,j ) and G R,α,j (H R,j ) are the values of G (H) on the left and right nodes newly-generated from the jth node. Equation (16) statistically compares the model's error reduction when new leaf nodes are generated from different nodes. After examining all the parameters in the input layer and all the leaf nodes, the LightGBM model chooses the most suitable parameter to split from the leaf node where the largest V d can be obtained.

Input and Output Setting
The input dataset of the LightGBM model should contain the factors that influence the variation of estuarine water levels. These input factors should also be predicted when they are used to forecast estuarine water levels. The effective input parameters include the variables related to the tide potential or the time series of tides constituents from ocean tides or coastal tides. However, due to the strong nonlinear interaction between tides and river discharge, these factors may not be enough. As shown in the works of Kukulka and Jay [41,42], the upstream river discharge and tide range near the estuary mouth are important factors in analyzing estuarine tides. Supharatid [20] also included them as variables of the input layer of the ANN model. Therefore, the LightGBM model also includes the river discharges and tide range as the parameters in the input layer. Figure 3 shows the schematic map of the input and output layers of the LightGBM model in this study. In order to both consider the riverine force and marine force factors, similar to the works of Jay et al. [43] and Matte et al. [7], upstream river discharge (Q) and tide range (R) near the estuary mouth were selected in the input layer of the LightGBM model. Moreover, the upstream river discharge was low-passed to smooth the data. In the semi-diurnal tide regime, tide range R is the greater tide range within one lunar day. Referred from the works of Lee [18] and Lee et al. [44], the time series of tide constituents were also included. The tide constituents whose signal-noise-ratio values are larger than 2 in the results of the T_TIDE model [4] were selected as the input of the LightGBM model. Harmonic analysis was used to preliminarily filter the insignificant tide constituents and can reduce the computation time of the LightGBM model. In the LightGBM model, it selects the important parameters from the input layer based on finding the parameter that contributes the largest error reduction (Equation (16)).

Hyperparameters Setting
In the LightGBM model, a number of hyperparameters were used, but many of them can be adjusted to improve the model performance for different applications as the hyperparameters may significantly influence the model performance. Therefore, hyperparameters tuning is an essential process in the construction of machine learning models. The hyperparameters tuned in this study and their related meanings are listed in Table 1. Only the hyperparameters listed in Table 1 were tuned. The other hyperparameters that can further slightly influence the training process, such as setting the minimal H j on the leaf node and the minimal number of data inside one bin, were kept as default. Generally, the core hyperparameters for regression are listed in Table 1. More details of the hyperparameters of the LightGBM model can be found in the LightGBM documentation [45].

Hyperparameters Setting
In the LightGBM model, a number of hyperparameters were used, but many of them can be adjusted to improve the model performance for different applications as the hyperparameters may significantly influence the model performance. Therefore, hyperparameters tuning is an essential process in the construction of machine learning models. The hyperparameters tuned in this study and their related meanings are listed in Table 1.
Only the hyperparameters listed in Table 1 were tuned. The other hyperparameters that can further slightly influence the training process, such as setting the minimal on the leaf node and the minimal number of data inside one bin, were kept as default. Generally, the core hyperparameters for regression are listed in Table 1. More details of the hyperparameters of the LightGBM model can be found in the LightGBM documentation [45]. Figure 4 shows a diagram of constructing the optimal LightGBM model in this study. The main process was to optimize the training process of the LightGBM model by adjusting the hyperparameter values as listed in Table 1. The numerical range of the values of the hyperparameters in Figure 4 is specified according to preliminary tests, and they can be varied when this model is applied to other research areas. The hyperparameters in Table 1 are divided into four steps to be sequentially optimized ( Figure 4) when suitable learning_rate and num_iterations are specified. The optimization sequence of these hyperparameters was determined according to their mutual relationship and the importance of their influence on the LightGBM model. Control the max number of bins (data intervals) when the dataset of a parameter in the input layer is transformed to a histogram ( Figure 2A).   Control the max number of bins (data intervals) when the dataset of a parameter in the input layer is transformed to a histogram ( Figure 2A). The value of α in Equation (4).

Optimal Hyperparameters Searching
The value of β in Equation (4) Figure 4 shows a diagram of constructing the optimal LightGBM model in this study. The main process was to optimize the training process of the LightGBM model by adjusting the hyperparameter values as listed in Table 1. The numerical range of the values of the hyperparameters in Figure 4 is specified according to preliminary tests, and they can be varied when this model is applied to other research areas. The hyperparameters in Table 1 are divided into four steps to be sequentially optimized ( Figure 4) when suitable learning_rate and num_iterations are specified. The optimization sequence of these hyperparameters was determined according to their mutual relationship and the importance of their influence on the LightGBM model. fore, five cross-validation tests were conducted in each combination of hyperparameters. The highest NMSE score in the cross-validation tests represents the smallest model error, indicating the best combination of the hyperparameters in the given range.
The LightGBM model was constructed from the opensource framework of Microsoft ® [34] and the related codes were run on the Python 3.7 environment. In determining the optimal combination of hyperparameters, the GridSearchCV function in the Scikit-learn Python module [46] was used. The GridSearchCV function uses the concept of the grid search method [47] to determine the optimal hyperparameters. On the basis of the specified range of hyperparameters, the GridSearchCV function forms all possible combinations and then performs a cross-validation process ( Figure 4) to search the optimal hyperparameter combinations.
In the specification of the GridSearchCV function, a negative mean squared error (NMSE) was selected as the scoring criterion function to evaluate the model performance. In addition, five-fold cross-validation was specified to optimize the hyperparameters. More specifically, the initial training data in Figure 4 were further averagely divided to five groups, four of which were sequentially selected as the training data in the hyperparameters optimization period and the remaining group was used to test the model. Therefore, five cross-validation tests were conducted in each combination of hyperparameters. The highest NMSE score in the cross-validation tests represents the smallest model error, indicating the best combination of the hyperparameters in the given range.
The LightGBM model was constructed from the opensource framework of Microsoft ® [34] and the related codes were run on the Python 3.7 environment.

The Lower Columbia River
The Columbia River is located in a region of North America ( Figure 5) and flows into the North Pacific Ocean. The Columbia River is the third largest river in the U.S. state [14,15] and the largest river in the Pacific northwest region of North America. Its annual average flow is about 7500 m 3 /s [48]. Three variation patterns can be found in the river discharge records of the Columbia River [48]: First, it has a significant seasonal variation pattern; second, some high-flow events can happen in the winter and early spring seasons; third, due to the actions of hydraulic electrogenerating, it has a periodical variation pattern during low-flow periods. The research area of this study was the lower Columbia River whose area ranges from the estuary mouth to 235 km landward along the river course [7]. It can be seen from Figure 5 that there is a tributary river (Willamette River) entering into the lower Columbia River. It is the largest tributary into the lower Columbia River, and its average flow is about 950 m 3 /s [7]. The tides in the Columbia River are classified as mixed tides. The amplitude ratio of semidiurnal tides to diurnal tides at the estuary mouth is around 1.5 [48]. Both the river discharge from the Columbia River and the Willamette River can affect the tides in the lower Columbia River. Under the influence of river discharge, the tides in the lower Columbia River present significant nonstationary characteristics. Former studies [10,14,[41][42][43] clearly showed that the lower Columbia River is a suitable natural laboratory for the analysis of estuarine tides.
the North Pacific Ocean. The Columbia River is the third largest river in the U.S. state [14,15] and the largest river in the Pacific northwest region of North America. Its annual average flow is about 7500 m 3 /s [48]. Three variation patterns can be found in the river discharge records of the Columbia River [48]: First, it has a significant seasonal variation pattern; second, some high-flow events can happen in the winter and early spring seasons; third, due to the actions of hydraulic electrogenerating, it has a periodical variation pattern during low-flow periods. The research area of this study was the lower Columbia River whose area ranges from the estuary mouth to 235 km landward along the river course [7]. It can be seen from Figure 5 that there is a tributary river (Willamette River) entering into the lower Columbia River. It is the largest tributary into the lower Columbia River, and its average flow is about 950 m 3 /s [7]. The tides in the Columbia River are classified as mixed tides. The amplitude ratio of semidiurnal tides to diurnal tides at the estuary mouth is around 1.5 [48]. Both the river discharge from the Columbia River and the Willamette River can affect the tides in the lower Columbia River. Under the influence of river discharge, the tides in the lower Columbia River present significant nonstationary characteristics. Former studies [10,14,[41][42][43] clearly showed that the lower Columbia River is a suitable natural laboratory for the analysis of estuarine tides.

Data
The hydrometric stations used in this study are also shown in Figure 5. Hourly water level measurements at Astoria, Wauna, Longview, St.Helens, and Vancouver stations were collected from the National Oceanic and Atmospheric Administration (NOAA) website [50]. The duration of the water level measurements at all stations was from 2003 to 2020, except for Wauna station, where data between 2003 and 2006 was unavailable. In addition, the synchronized hourly river discharge data of the Columbia River at the Bonneville Dam (denoted as Qc hereafter) and the Willamette River at Portland (denoted

Data
The hydrometric stations used in this study are also shown in Figure 5. Hourly water level measurements at Astoria, Wauna, Longview, St.Helens, and Vancouver stations were collected from the National Oceanic and Atmospheric Administration (NOAA) website [50]. The duration of the water level measurements at all stations was from 2003 to 2020, except for Wauna station, where data between 2003 and 2006 was unavailable. In addition, the synchronized hourly river discharge data of the Columbia River at the Bonneville Dam (denoted as Qc hereafter) and the Willamette River at Portland (denoted as Qw hereafter) over the same period (2003-2020) were downloaded from the website of the U.S. Geological Survey (USGS) [51].
The date and time of all the measurements were adjusted to the Greenwich Mean Time (GMT), while the mean sea level datum was specified as the uniform datum when downloading the water levels data from the NOAA. The collected measurements of the water levels are shown in Figure 6A, while the measured river discharges are shown in Figure 6B. For the discharges of the Willamette River at Portland, it can be seen that the river discharge is under the influence of tides from the downstream, as evidenced by the negative values. From the collected data of this study, in about 70% of the whole period, the values of the Qw were no more than 20% of the values of the Qc. However, there were also a few periods that the values of Qw exceeded Qc, as shown in Figure 6B.
The Astoria station is the most seaward station in the study site and is assumed under little or no influence from the upstream river discharge. Therefore, it was selected as the reference station to provide the tide range information, while Bonneville Dam and Portland stations were used as the reference stations to provide the upstream river discharge information. The measured water levels at Wauna, Longview, St.Helens and Vancouver stations were used for training and testing the LightGBM model.

Data Preparation
The measured water levels at all stations were divided into two parts: 80% of all the data from January 2003 to June 2017, together with the tide range at Astoria and the river discharges at Bonneville Dam and Portland in the same period, was used for training, whilst the remaining data measured during June 2017-December 2020 (20% of the data) was used for model predictions. As shown in Figure 6A, the measured water level at Wauna station for training and testing was slightly less due to the unavailability. Wauna station used the same data division ratio for training (80%) and testing (20%). The short data length of Wauna station may have had only a minor impact on the results because the tides at Wauna station were less affected by upstream river discharge than Vancouver, St.Helens, and Longview stations ( Figure 6A). The tides at Wauna station mainly present The Astoria station is the most seaward station in the study site and is assumed under little or no influence from the upstream river discharge. Therefore, it was selected as the reference station to provide the tide range information, while Bonneville Dam and Portland stations were used as the reference stations to provide the upstream river discharge information. The measured water levels at Wauna, Longview, St.Helens and Vancouver stations were used for training and testing the LightGBM model.

Data Preparation
The measured water levels at all stations were divided into two parts: 80% of all the data from January 2003 to June 2017, together with the tide range at Astoria and the river discharges at Bonneville Dam and Portland in the same period, was used for training, whilst the remaining data measured during June 2017-December 2020 (20% of the data) was used for model predictions. As shown in Figure 6A, the measured water level at Wauna station for training and testing was slightly less due to the unavailability. Wauna station used the same data division ratio for training (80%) and testing (20%). The short data length of Wauna station may have had only a minor impact on the results because the tides at Wauna station were less affected by upstream river discharge than Vancouver, St.Helens, and Longview stations ( Figure 6A). The tides at Wauna station mainly present the characteristics of astronomic tides, while the river discharge primarily influences the tides in flood season.

Model Training
The training process of the LightGBM model was conducted in two stages. First, to specify the appropriate values of learning_rate and num_iterations, their influence on the model performance and the computational time of the LightGBM model was tested. With the other hyperparameters being defaulted, the RMSE values of the LightGBM model in the training period with learning_rate varying from its default value (0.1) to 0.05 and 0.01 were tested. The value of num_iterations of the LightGBM model is the iteration value when the model errors stop to decrease with the specified learning_rate. Figure 7. It can be seen from Figure 7 that a smaller learning_rate led to better model performance (as indicated by the smaller RMSE values) at St.Helens, Longview and Wauna stations. However, at Vancouver station, the RMSE values when learning_rate = 0.05 were slightly smaller than that when learning_rate=0.01. Overall, the influence of learning_rate did not significantly influence the model performance. The difference between the RMSE values of each station with different learning_rate was within 0.01 m.

The comparison of the RMSE values between the water levels predicted by the Light-GBM model and measurements during the training period at each station is shown in
The training process of the LightGBM model was conducted in two stages. First, to specify the appropriate values of learning_rate and num_iterations, their influence on the model performance and the computational time of the LightGBM model was tested. With the other hyperparameters being defaulted, the RMSE values of the LightGBM model in the training period with learning_rate varying from its default value (0.1) to 0.05 and 0.01 were tested. The value of num_iterations of the LightGBM model is the iteration value when the model errors stop to decrease with the specified learning_rate.
The comparison of the RMSE values between the water levels predicted by the LightGBM model and measurements during the training period at each station is shown in Figure 7. It can be seen from Figure 7 that a smaller learning_rate led to better model performance (as indicated by the smaller RMSE values) at St.Helens, Longview and Wauna stations. However, at Vancouver station, the RMSE values when learning_rate = 0.05 were slightly smaller than that when learning_rate=0.01. Overall, the influence of learning_rate did not significantly influence the model performance. The difference between the RMSE values of each station with different learning_rate was within 0.01 m. Taking the results at Vancouver station as an example, Figure 8 plots the variation of the RMSE values of the LightGBM model during each iteration with different learn-ing_rate values. When learning_rate = 0.10, the LightGBM model used 184 iterations to reach the convergence state of the RMSE values. When learning_rate was reduced to 0.05 and 0.01, the iterations, respectively, increased to 380 and 1829. The results in Figure 7 and Figure 8 show that a smaller learning_rate did not significantly improve the model accuracy but considerably increased the model iterations (computation time). According to the preliminary test results in Figure 7 and Taking the results at Vancouver station as an example, Figure 8 plots the variation of the RMSE values of the LightGBM model during each iteration with different learning_rate values. When learning_rate = 0.10, the LightGBM model used 184 iterations to reach the convergence state of the RMSE values. When learning_rate was reduced to 0.05 and 0.01, the iterations, respectively, increased to 380 and 1829. The results in Figures 7 and 8 show that a smaller learning_rate did not significantly improve the model accuracy but considerably increased the model iterations (computation time). According to the preliminary test results in Figures 7 and 8, learning_rate at each station was specified as 0.05 during the subsequent training and testing. Meanwhile, the corresponding num_iterations values at each station were, respectively, 380 (Vancouver), 291 (St.Helens), 339 (Longview) and 374 (Wauna).
The optimization of max_depth and num_leaves at Vancouver station was used as an instance to show the optimization process of the remaining hyperparameters. Parameters max_depth and number_leaves controlled the complexity of a decision tree. In the methods using the level-wise-tree growth strategy as shown in Figure 1A, such as the XGBoost model [37], num_leaves was 2 D (assuming D = max_depth). However, as the LightGBM model uses the leaf-wise-tree growth strategy [34], the depth of the growth tree generated could be deeper than the tree generated by the level-wise-tree growth strategy with the same value of num_leaves. In this study, max_depth was set in a range of from 4 to 10, while num_leaves was specified in a range of from 5 to 40 and also smaller than 2 D to avoid over-fitting.
All combinations of max_depth and number_leaves are shown as a grid graph in Figure 9 with their related NMSE values at the cross-validation period at Vancouver station. Indicated by the size (larger size represents better performance) and the color of the data points, it can be seen that the grid point, where max_depth = 5 and num_leaves = 17, had the highest score (NMSE), indicating the best model performance during the hyperparameters optimization period. Compared with the default value of num_leaves (31), the results shown in Figure 9 suggest that a simpler (fewer leaf nodes) tree structure could achieve better model performance. However, it should be noted that the mutually restrictive relationship of max_depth or num_leaves prevented the NMSE values from presenting apparent monotonous relationships with max_depth or num_leaves. max_depth restricted the up limit number of tree leaf nodes, while num_leaves restricted max_depth to no more than num_leaves-1. Figure 9 mainly shows the process of searching the optimal hyperparameters based on all the hyperparameters' combinations in their specified range. During the subsequent optimization of the other hyperparameters (Step 2-Step 4 in Figure 4), the values of max_depth and num_leaves at Vancouver station were set as 5 and 17, respectively. The optimization of max_depth and num_leaves at Vancouver station was used as an instance to show the optimization process of the remaining hyperparameters. Parameters max_depth and number_leaves controlled the complexity of a decision tree. In the methods using the level-wise-tree growth strategy as shown in Figure 1A, such as the XGBoost model [37], num_leaves was 2 D (assuming D = max_depth). However, as the LightGBM model uses the leaf-wise-tree growth strategy [34], the depth of the growth tree generated could be deeper than the tree generated by the level-wise-tree growth strategy with the same value of num_leaves. In this study, max_depth was set in a range of from 4 to 10, while num_leaves was specified in a range of from 5 to 40 and also smaller than 2 D to avoid over-fitting. All combinations of max_depth and number_leaves are shown as a grid graph in Figure 9 with their related NMSE values at the cross-validation period at Vancouver station. Indicated by the size (larger size represents better performance) and the color of the data points, it can be seen that the grid point, where max_depth = 5 and num_leaves = 17, had the highest score (NMSE), indicating the best model performance during the hyperparameters optimization period. Compared with the default value of num_leaves (31), the results shown in Figure 9 suggest that a simpler (fewer leaf nodes) tree structure could achieve better model performance. However, it should be noted that the mutually restrictive relationship of max_depth or num_leaves prevented the NMSE values from presenting apparent monotonous relationships with max_depth or num_leaves. max_depth restricted the up limit number of tree leaf nodes, while num_leaves restricted max_depth to no more than num_leaves-1. Figure 9 mainly shows the process of searching the optimal hyperparameters based on all the hyperparameters' combinations in their specified range. During the subsequent optimization of the other hyperparameters (Step 2-Step 4 in Figure  4), the values of max_depth and num_leaves at Vancouver station were set as 5 and 17, respectively. The remaining hyperparameters were essential in improving model performance and accelerating the model iterations. Judged by the NMSE values, all the optimized hyperparameters are listed in Table 2. With the hyperparameters in Table 2, the final LightGBM model could be trained.  The remaining hyperparameters were essential in improving model performance and accelerating the model iterations. Judged by the NMSE values, all the optimized hyperparameters are listed in Table 2. With the hyperparameters in Table 2, the final LightGBM model could be trained.

Model Testing
The testing data (20% of measurements) at each station were used to examine the model accuracy. Figure 10 shows the comparison of the water levels predicted by the LightGBM model with measurements at four stations. In general, the model results of the LightGBM model show a reasonable accuracy relative to the measurements. However, it is clear that the discrepancy seems to become larger when the water levels were high at Vancouver and St.Helens stations. This illustrates that the errors of the LightGBM model

Model Testing
The testing data (20% of measurements) at each station were used to examine the model accuracy. Figure 10 shows the comparison of the water levels predicted by the LightGBM model with measurements at four stations. In general, the model results of the LightGBM model show a reasonable accuracy relative to the measurements. However, it is clear that the discrepancy seems to become larger when the water levels were high at Vancouver and St.Helens stations. This illustrates that the errors of the LightGBM model were larger in flood seasons at the upstream tide reach stations (Vancouver and St.Helens). However, the differences between the results of the LightGBM model and the measurements at Longview and Wauna stations were more randomly (or evenly) distributed.
In the time domain, Figure 11A shows the time series of the predicted water levels against the measurements at the Vancouver station during 2019 and 2020, while Figure 11B plots the model errors and the synchronous river discharge of the Qc (Columbia River). Again, the predicted water level from the LightGBM model generally agrees well with the measurements, as shown in Figure 11A. It is clear from Figure 11B that the average error limits between the predicted and measured water levels were mostly in the range of ±0.5 m. However, larger errors up to approximately ±1.0 m appeared when river discharge was strong in flood season, such as the periods between April to May of 2019 and May to June of 2020. The larger errors of the LightGBM model in these two periods both appeared during the rising flood period. In the rising flood period of 2019, the river discharge met a neap tide ( Figure 11B), and the model errors mainly came from the phase difference between the model results and the measurements. However, in the rising flood period of 2020, the river discharge met a spring tide ( Figure 11B). The model errors mainly sourced from the over-estimation of model results. This means the model errors of the LightGBM in the rising flood period tended to present a different pattern when the tidal condition was different.
were larger in flood seasons at the upstream tide reach stations (Vancouver and St.Helens). However, the differences between the results of the LightGBM model and the measurements at Longview and Wauna stations were more randomly (or evenly) distributed. In the time domain, Figure 11A shows the time series of the predicted water levels against the measurements at the Vancouver station during 2019 and 2020, while Figure  11B plots the model errors and the synchronous river discharge of the Qc (Columbia River). Again, the predicted water level from the LightGBM model generally agrees well with the measurements, as shown in Figure 11A. It is clear from Figure 11B that the average error limits between the predicted and measured water levels were mostly in the range of ±0.5 m. However, larger errors up to approximately ±1.0 m appeared when river discharge was strong in flood season, such as the periods between April to May of 2019 and May to June of 2020. The larger errors of the LightGBM model in these two periods both appeared during the rising flood period. In the rising flood period of 2019, the river discharge met a neap tide ( Figure 11B), and the model errors mainly came from the phase difference between the model results and the measurements. However, in the rising flood period of 2020, the river discharge met a spring tide ( Figure 11B). The model errors mainly sourced from the over-estimation of model results. This means the model errors of the LightGBM in the rising flood period tended to present a different pattern when the tidal condition was different. To compare the performance of the LightGBM model at different tide reaches when river discharge was significant in 2019, Figure 12 compares the predicted and measured water levels particularly for the period between 1 April and 22 April in 2019, together with the river discharges from the Columbia River (Qc) at all stations. Generally, the errors To compare the performance of the LightGBM model at different tide reaches when river discharge was significant in 2019, Figure 12 compares the predicted and measured water levels particularly for the period between 1 April and 22 April in 2019, together with the river discharges from the Columbia River (Qc) at all stations. Generally, the errors mainly arose when the phase of the water levels predicted by the LightGBM model was earlier than the measurements. Meanwhile, the phase difference was more significant at Vancouver and St.Helens stations than at Longview and Wauna stations. In the LightGBM model, the synchronized upstream river discharge and the water levels at each station were used, respectively, as the input and output of the LightGBM model. There was actually a propagation time for the upstream river discharge flowing to each station. In other words, there was a time lag in terms of the effect of the upstream river discharge on tides. Without considering the time lag, the signals of the water levels predicted by the LightGBM models could be earlier than the measurements. The time lag between the model results and the measurements was up to about one day at Vancouver and St.Helens stations and became smaller at Longview station ( Figure 12C), but insignificant at Wauna station ( Figure 12D) because the influence of river discharge on tides gradually became weaker in the landward direction.  Although the LightGBM model had relatively larger errors in the flood season, its accuracy was still comparable with the physics-based models. For statistical comparison, the performance of the LightGBM model was further compared with the NS_TIDE model [7,13] because of its typical application in the analysis of estuarine tides. The details of the NS_TIDE can be found in Matte et al. [7]. The specification of the NS_TIDE model at the lower Columbia River is referred to the works of Matte et al. [7] and Pan et al. [14]. The same data division as the LightGBM model was used to examine the model performance of the NS_TIDE model. Table 3 compares their models' performance by using the performance indicators of the maximum absolute error (MAE), RMSE, correlation coefficient (CC), and the nondimensional skill score (SS) based on the method of Murphy [52]. An SS value close to 1 represents a better model performance. It can be seen that the RMSE values of the LightGBM model were all smaller than the NS_TIDE model, which indicates that the LightGBM model statistically performed better than the NS_TIDE model. The MAE and the values of CC and SS of the LightGBM model were, respectively, smaller and larger than the NS_TIDE model. The results in Table 3 illustrate that the LightGBM model could For further comparison of the LightGBM model results and the measurements in other periods, Figure 13 compares the monthly maximum water levels from the LightGBM models and the measurements at each station. It is clear from Figure 13 that the error limits of the LightGBM model results relative to the measurements were all in the range of about ±0.5 m. This clearly shows that the LightGBM model can give an accurate prediction of the monthly maximum water levels. However, due to the inability to consider the propagation time of the upstream river discharge or even tides, the errors of the LightGBM can be larger during the periods of flood season, such as the model errors at Vancouver station ( Figure 11B).

Parameter Contribution to the LightGBM Model
During the propagation of tide waves toward the upstream direction in an estuary, nonlinear interaction between tides and river discharge intensifies, as well as the tide constituents themselves, so that the overall influence of the river discharge on estuarine tides can vary spatially. These characteristics make the parameters in the input layer ( Figure 3) contribute differently in constructing the tree nodes of the LightGBM model at different tide reaches. Figure 14 shows the contribution of the top 15 parameters in the input layer at each station, including two river discharges (Qc and Qw), tide range at Astoria (R) and 12 tide constituents from statistical analysis. The parameter contribution represents the ratio of the times of the parameter used in tree splitting to the total times of all the parameters used in tree splitting. A larger value of the parameter contribution means that it has been used more times to split the nodes, indicating more significant parameter importance. For each tide constituent, the cosine and sine parts are used as the input parameters, but the sum of the parameter contribution of the cosine and sine parts of a tide constituent is shown in Figure 14. It is clear from Figure 14 that the river discharges Qc and Qw consistently rank the top three at each station. The tide range at Astoria (R) ranks between 7th-13th at those stations for its contribution. In respect of tide constituents, their ranks are basically related to their tide amplitudes. For example, the M2 tide constituent,  [14]. The same data division as the LightGBM model was used to examine the model performance of the NS_TIDE model. Table 3 compares their models' performance by using the performance indicators of the maximum absolute error (MAE), RMSE, correlation coefficient (CC), and the nondimensional skill score (SS) based on the method of Murphy [52]. An SS value close to 1 represents a better model performance. It can be seen that the RMSE values of the LightGBM model were all smaller than the NS_TIDE model, which indicates that the LightGBM model statistically performed better than the NS_TIDE model. The MAE and the values of CC and SS of the LightGBM model were, respectively, smaller and larger than the NS_TIDE model. The results in Table 3 illustrate that the LightGBM model could achieve better performance than the NS_TIDE model. During the propagation of tide waves toward the upstream direction in an estuary, nonlinear interaction between tides and river discharge intensifies, as well as the tide constituents themselves, so that the overall influence of the river discharge on estuarine tides can vary spatially. These characteristics make the parameters in the input layer ( Figure 3) contribute differently in constructing the tree nodes of the LightGBM model at different tide reaches. Figure 14 shows the contribution of the top 15 parameters in the input layer at each station, including two river discharges (Qc and Qw), tide range at Astoria (R) and 12 tide constituents from statistical analysis. The parameter contribution represents the ratio of the times of the parameter used in tree splitting to the total times of all the parameters used in tree splitting. A larger value of the parameter contribution means that it has been used more times to split the nodes, indicating more significant parameter importance. For each tide constituent, the cosine and sine parts are used as the input parameters, but the sum of the parameter contribution of the cosine and sine parts of a tide constituent is shown in Figure 14. It is clear from Figure 14 that the river discharges Qc and Qw consistently rank the top three at each station. The tide range at Astoria (R) ranks between 7th-13th at those stations for its contribution. In respect of tide constituents, their ranks are basically related to their tide amplitudes. For example, the M2 tide constituent, which is the most significant astronomic tide constituent, ranks 3rd-5th. Although the LightGBM model plays the role of a black box, its selections of the parameters to the construction of the decision trees is still related to the physic background of the lower Columbia River. which is the most significant astronomic tide constituent, ranks 3rd-5th. Although the LightGBM model plays the role of a black box, its selections of the parameters to the construction of the decision trees is still related to the physic background of the lower Columbia River.

The Subtide Constituents
It can be seen from Figure 14 that the SA, SSA, MSF, and MM tide constituents are in the top 15 of the most important parameters. To further analyze these subtide constituents, their amplitudes obtained from the measurements are plotted in Figure 15 and compared with the M2 tide constituent. These subtide constituents generally become more significant in the landward direction. Near the estuary mouth where the reference station (Astoria) is located, the amplitudes of these subtide constituents are insignificant except the SA tide constituent (0.10 m). The strength of subtides increasing in the landward direction was also reported in previous studies [

The Subtide Constituents
It can be seen from Figure 14 that the SA, SSA, MSF, and MM tide constituents are in the top 15 of the most important parameters. To further analyze these subtide constituents, their amplitudes obtained from the measurements are plotted in Figure 15 and compared with the M2 tide constituent. These subtide constituents generally become more significant in the landward direction. Near the estuary mouth where the reference station (Astoria) is located, the amplitudes of these subtide constituents are insignificant except the SA tide constituent (0.10 m). The strength of subtides increasing in the landward direction was also reported in previous studies [12,28,53,54]. For example, MSF (MM) can become more energetic from the nonlinear interaction of the M2 and S2 (M2 and N2) tide constituents during the landward propagation of tide waves. Moreover, the lower frequency of subtides relative to the tides such as diurnal and semidiurnal tides favors their propagation because subtides are affected by smaller friction [8].
J. Mar. Sci. Eng. 2021, 9, x FOR PEER REVIEW 20 of 23 coastal or ocean zone. Figure 15 illustrates that the strong inland propagation of subtide waves is also exited in the Lower Columbia River and responds to the high ranks of the sub-tide constituents in Figure 14.

Conclusions
In this study, a novel prediction model was built based on the machine learning LightGBM framework to accurately predict the water levels in a tide-affected estuary and gain a better understanding of the nonlinear interaction between tides and river discharge in estuarine waters. The model was fully trained with the data obtained from the lower Columbia River to establish the optimal parameters required. The model results were also compared with those from the NS_TIDE model.
In the input layer of the LightGBM model, the river discharge from the lower Columbia River at the Bonneville Dam and the Willamette River at Portland, the tide ranges near the estuary mouth (Astoria), and the tide constituents are used. 80% of the field data were used for training and optimizing the model, and the remaining 20% of measurements were used for testing the model performance.
The accuracy of the LightGBM model was statistically evaluated by RMSE, MAE, CC, and SS. The RMSE value of the LightGBM model was 0.14 m, which shows higher accuracy than those from the NS_TIDE model The results clearly demonstrate that the LightGBM model is a powerful machine learning tool for predicting the water levels in the fluvial flow affected estuaries, such as the Lower Columbia River and maybe any other mega-estuaries worldwide, with satisfactory accuracy. The LightGBM model can reveal a new perspective to the prediction of estuarine water levels. The upstream river discharge can also directly affect the water level variation. The upstream river discharge is annually and seasonally varied, making the water level fluctuation also annual and seasonal. The stronger effect of upstream river discharge on water levels along the upstream direction can be reflected in the increase of the SA and SSA tide amplitudes as their periods are annual and semiannual, respectively. The amplitudes of the SA and SSA tide constituents increase by about 7 and 13 times from Astoria station to Vancouver station, which is comparable with the variation ratio of the M2 tide constituent. It should be noted that the amplification of the SA and SSA tide constituents mainly comes from the effect of the variation pattern of the influence of river discharge on water levels. This makes their physical meaning different from the SA and SSA tide constituents in coastal or ocean zone. Figure 15 illustrates that the strong inland propagation of subtide waves is also exited in the Lower Columbia River and responds to the high ranks of the sub-tide constituents in Figure 14.

Conclusions
In this study, a novel prediction model was built based on the machine learning LightGBM framework to accurately predict the water levels in a tide-affected estuary and gain a better understanding of the nonlinear interaction between tides and river discharge in estuarine waters. The model was fully trained with the data obtained from the lower Columbia River to establish the optimal parameters required. The model results were also compared with those from the NS_TIDE model.
In the input layer of the LightGBM model, the river discharge from the lower Columbia River at the Bonneville Dam and the Willamette River at Portland, the tide ranges near the estuary mouth (Astoria), and the tide constituents are used. 80% of the field data were used for training and optimizing the model, and the remaining 20% of measurements were used for testing the model performance.
The accuracy of the LightGBM model was statistically evaluated by RMSE, MAE, CC, and SS. The RMSE value of the LightGBM model was 0.14 m, which shows higher accuracy than those from the NS_TIDE model (0. 15 The results clearly demonstrate that the LightGBM model is a powerful machine learning tool for predicting the water levels in the fluvial flow affected estuaries, such as the Lower Columbia River and maybe any other mega-estuaries worldwide, with satisfactory accuracy. The LightGBM model can reveal a new perspective to the prediction of estuarine water levels.