Next Article in Journal
Stable Water Isotopes and Machine Learning Approaches to Investigate Seawater Intrusion in the Magra River Estuary (Italy)
Previous Article in Journal
Enhanced 3D Turbulence Models Sensitivity Assessment Under Real Extreme Conditions: Case Study, Santa Catarina River, Mexico
Previous Article in Special Issue
Modeling the Dynamics of the Jebel Zaghouan Karst Aquifer Using Artificial Neural Networks: Toward Improved Management of Vulnerable Water Resources
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Using Entity-Aware LSTM to Enhance Streamflow Predictions in Transboundary and Large Lake Basins

by
Yunsu Park
1,
Xiaofeng Liu
2,
Yuyue Zhu
1 and
Yi Hong
3,*
1
College of Literature, Sciences, and the Arts, University of Michigan, Ann Arbor, MI 48109, USA
2
Michigan Institute for Data and AI in Society, University of Michigan, Ann Arbor, MI 48109, USA
3
Cooperative Institute for Great Lakes Research, University of Michigan, Ann Arbor, MI 48109, USA
*
Author to whom correspondence should be addressed.
Hydrology 2025, 12(10), 261; https://doi.org/10.3390/hydrology12100261
Submission received: 19 August 2025 / Revised: 27 September 2025 / Accepted: 29 September 2025 / Published: 2 October 2025

Abstract

Hydrological simulation of large, transboundary water systems like the Laurentian Great Lakes remains challenging. Although deep learning has advanced hydrologic forecasting, prior efforts are fragmented, lacking a unified basin-wide model for daily streamflow. We address this gap by developing a single Entity-Aware Long Short-Term Memory (EA-LSTM) model, an architecture that distinctly processes static catchment attributes and dynamic meteorological forcings, trained without basin-specific calibration. We compile a cross-border dataset integrating daily meteorological forcings, static catchment attributes, and observed streamflow for 975 sub-basins across the United States and Canada (1980–2023). With a temporal training/testing split, the unified EA-LSTM attains a median Nash–Sutcliffe Efficiency (NSE) of 0.685 and a median Kling–Gupta Efficiency (KGE) of 0.678 in validation, substantially exceeding a standard LSTM (median NSE 0.567, KGE 0.555) and the operational NOAA National Water Model (median NSE 0.209, KGE 0.440). Although skill is reduced in the smallest basins (median NSE 0.554) and during high-flow events (median PBIAS −29.6%), the performance is robust across diverse hydroclimatic settings. These results demonstrate that a single, calibration-free deep learning model can provide accurate, scalable streamflow prediction across an international basin, offering a practical path toward unified forecasting for the Great Lakes and a transferable framework for other large, data-sparse watersheds.

1. Introduction

The Laurentian Great Lakes basin is the largest surface freshwater system on Earth, covering a drainage area of about 965,000 km 2 across the U.S.–Canada border [1]. This region contains roughly 20% of the planet’s freshwater and is home to Lake Superior, Lake Michigan, Lake Huron, Lake Erie, and Lake Ontario. These five lakes serve a critical role in supplying drinking water to more than 15 million people throughout North America [2,3]. Effective simulation of this large transboundary freshwater system is crucial for water resource management [4]. Despite the significant importance of this transboundary freshwater system, developing a comprehensive and reliable model for the Great Lakes region remains challenging due to data heterogeneity and scarcity [5].
The Great Lakes region has a long tradition of process-based hydrological modeling. Early efforts include GLERL’s Large Basin Runoff Model (LBRM) for simulating runoff from watersheds draining to the lakes [6,7]. More recently, community systems such as MESH-SVS-Raven and GEM-Hydro-Watroute have been applied across the basin [1,8]. These models represent snow, soil moisture, and lake–atmosphere exchanges, and they typically require calibration to observed flows. While local calibration yields good fits at tuned gauges, performance often degrades when applied beyond those basins. Furthermore, large-area calibration is computationally intensive, and evaluation can be confounded by input-data dependencies [8,9]. Case studies echo these challenges; for example, SWAT struggled with snowmelt/baseflow trade-offs under contrasting conditions, and MESH faced subgrid and coupling limitations in complex terrain [10,11]. In parallel, hybrid and physics-guided ML approaches have emerged to blend mechanistic insight with data-driven skill, including LSTM postprocessing of the National Water Model [12], DL-hydrodynamic coupling for Great Lakes forecasts [13], and broader physics-guided ML frameworks [14]. Our work complements the literature in these areas by evaluating a single, cross-border EA-LSTM for basin-wide daily streamflow without basin-specific calibration.
In recent years, machine learning (ML), particularly deep learning models like Long Short-Term Memory (LSTM) networks, has emerged as a powerful alternative for hydrological simulation. LSTM models are well-suited to capture the non-linear, temporal patterns in rainfall–runoff relationships, and they excel at learning from large datasets [15,16]. Numerous studies have shown that LSTM networks can outperform traditional hydrologic models in predicting river discharge. Kratzert et al. [17] trained an LSTM network on a dataset of 531 basins (CAMELS dataset) and achieved unprecedented accuracy in ungauged basins. The single LSTM, trained on 30 years of data per basin, obtained a superior median Nash–Sutcliffe Efficiency (NSE) of 0.69 on test basins not used in training, compared to 0.64 for a calibrated conceptual model (SAC-SMA) and 0.58 for the NOAA National Water Model [17]. Sabzipour et al. [18] directly compared an LSTM neural network with a physically based hydrological model and demonstrated that the LSTM model delivered significantly lower forecast errors with a median MAE of 25 m3/s on day 1 (versus 115 m 3 /s for the conceptual model) and higher KGE values for up to 7–9 days, without the need for explicit data assimilation. Kratzert et al. [19] also found that an LSTM network, by efficiently learning long-term hydrological dependencies, including crucial storage effects like snow accumulation and melt, significantly outperformed the physically based SAC-SMA+Snow-17 model in rainfall–runoff simulations across a large regional sample. Gauch et al. [20] emphasized that a single LSTM can learn to predict runoff in hundreds of catchments and dominate the performance of the SOTA conceptual models in benchmarks.
Complementary to LSTM-based rainfall–runoff studies, several ANN pipelines have emphasized input preprocessing and structure-aware signals, including decomposition with Fisher’s ordered clustering and MESA [21], flow-pattern recognition for daily forecasting [22], and long-term forecasting with preprocessing-only inputs [23]. These works underscore the value of careful data conditioning and sequence structure, while our contribution focuses on a single, basin-wide deep learning model for daily streamflow across the bi-national Great Lakes.
The Great Lakes region has begun to see applications of these data-driven techniques. For example, Xue et al. [13] integrated LSTM with a hydrodynamic model for forecasting the spatiotemporal distribution of lake surface temperature (LST) across all five Great Lakes, and Kurt [16] trained an LSTM model to simulate the mean water level of the five lakes. Despite the success of this data-driven approach, a model for simulating streamflows of the Great Lakes basins remains elusive due to data heterogeneity and scarcity. Previous works found that LSTM networks perform best when trained on large, diverse datasets [17,18,19,20]. Since the Great Lakes basin is a transboundary system spanning the U.S. and Canada, collecting sufficient data regulated by multiple agents is the main challenge for developing an LSTM model for streamflow prediction in the Great Lakes region. The scarcity of properly measured data from such a large area makes gathering data even more challenging. Consequently, previous works trained a model with a dataset like CAMELS, which comprises U.S. basins only [17], or developed a model targeting Canadian catchments [18]. We address these issues by compiling data from 975 basins across the entire Great Lakes region to build a unified LSTM-based hydrological model. To our knowledge, while initiatives such as GRIP-GL [8] provided valuable basin-wide comparisons and other studies have applied deep learning to lake variables [13,16,24], our work is the first to train and evaluate a single deep learning model for daily river discharge across the entire bi-national basin. A summary comparison of these different modeling paradigms is provided in Table 1.
In this paper, we first describe the study area in Section 2, providing a detailed overview of the Laurentian Great Lakes region where the hydrological processes are examined. Following this, Section 3 outlines the data collection, processing, and LSTM modeling setup. A detailed analysis of the result of our model is presented in Section 4, and the presented result is further discussed in Section 5. Section 6 concludes this paper and further suggests future research directions derived from this work.

2. Study Area

We studied the Laurentian Great Lakes basin, the most extensive system of surface freshwater on Earth, situated across the transboundary region of the United States and Canada. This investigation focuses on the hydrological responses of 975 gauged river sub-basins that contribute runoff to the five Great Lakes: Superior, Michigan, Huron, Erie, and Ontario. These sub-basins were selected to represent the diverse physiographic, climatic, and land-use characteristics inherent to this complex region. The overall geographical extent of these combined sub-basins defines the study domain boundary, as illustrated by the red dashed line in Figure 1a. Land cover across the basins is diverse, as depicted by the dominant land cover type for each basin in Figure 1a. Agriculture is the dominant land cover in approximately 37% of these basins, primarily concentrated in the southern portions. Forest is dominant in 35% of the basins, largely found in the northern regions. Open water (representing lakes within the sub-basins) is dominant in 20%, urban land cover in 6%, and wetlands in approximately 1% of the basins. This mosaic of land cover significantly influences hydrological processes, including evapotranspiration, infiltration, and runoff generation throughout the region [25,26].
The physiography of the Great Lakes basin is predominantly a legacy of Pleistocene glaciation, which shaped its topography of generally low to moderate relief, influencing drainage patterns and soil development [27,28].
The climate is predominantly humid continental [29]. Based on the studied basins, mean annual air temperatures range from approximately −18.4 °C to 38.2 °C, with a median of 19.0 °C and a basin-set average of 17.7 °C. Mean annual precipitation totals for these sub-basins vary from 718 mm/year to 1311 mm/year, with a median of 873 mm/year and a basin-set average of 879 mm/year. A notable spatial gradient in precipitation is observed, generally increasing from west/northwest to east/southeast across the basin. Snowpack accumulation and subsequent melt are critical components of the hydrological cycle in many of these catchments [30].
The drainage areas of the basins vary significantly, ranging from 4.1 km2 to 16,388 km2, with a median area of 304 km2 (Figure 1b). This scale heterogeneity is mirrored in their streamflow regimes. Mean annual discharge, calculated from the average of daily discharge time series (m3/s) for each basin, ranges from as low as 0.038 m3/s in small headwater catchments to over 204 m3/s in larger river systems. The median of these mean annual discharges across the basins is 3.60 m3/s, while the average of these means is 14.15 m3/s (Figure 1c).

3. Method

3.1. Data Collection

To support our analysis of hydrologic behavior across the Great Lakes basin—a region notable for its bi-national extent spanning both the United States and Canada—we compiled a comprehensive dataset comprising streamflow observations, meteorological forcings, and static catchment attributes. Streamflow data were collected from a total of 975 gauge stations: U.S. gauges were obtained from the United States Geological Survey (USGS), while Canadian gauges were sourced from Environment and Climate Change Canada (ECCC). This dual-sourced dataset provides consistent and coordinated coverage across the international boundary and spans the period from 1 January 1980 to 31 December 2023. For each gauge station, a corresponding sub-basin was delineated, resulting in a total of 975 sub-basins simulated in this study.
Meteorological forcing data were derived from the Daymet v4 dataset [31], which offers daily gridded meteorological variables at 1 km spatial resolution. The variables include daily precipitation, minimum and maximum temperature, solar radiation, vapor pressure, day length, and snow water equivalent (SWE). Areal averages of these variables were extracted for each sub-basin and temporally aligned with the streamflow records, resulting in a coherent set of dynamic input–output time series for each basin. Table 2 summarizes the dynamic variables used in this study for model training and testing.
To capture spatial variability in basin characteristics, we incorporated static catchment attributes from the HydroATLAS database [32], which integrates information from BasinATLAS, RiverATLAS, and LakeATLAS. A diverse suite of hydro-environmental variables was selected to represent key aspects of hydrology, climate, topography, land cover, geology, soils, and anthropogenic influence. These attributes were spatially aggregated at the sub-basin level to construct a standardized static feature set. Together, these data support a robust and interpretable modeling framework that reflects both the physical and human-induced heterogeneity across this large, bi-national watershed.

3.2. Data Preprocessing

Streamflow gauge observations across a large and complex region like the Great Lakes basin often suffer from missing or inconsistent records. For our streamflow dataset, any dates with missing discharge values were excluded from model training, and basins lacking discharge measurements altogether were omitted from the analysis. To further improve data quality, we implemented anomaly detection and removal procedures at both the point-wise and basin-wise levels.
At the point level, we flagged and removed periods with prolonged constant discharge values, which are typically indicative of sensor errors or data reporting issues. To detect instrument flat-lining, we treated runs of identical discharge values longer than a threshold k days as anomalous and removed them (main experiments used k = 14 ). This follows standard quality-control practice for attenuated/flat signals in hydrometric time series [33,34]. We assessed sensitivity by varying k { 7 , 14 , 21 } and computing the fraction of non-missing rows excluded; the exclusion rate changed only marginally: from 17.50 % ( k = 7 ) to 16.75 % ( k = 21 ), with k = 14 at 16.92 % , a spread of 0.74 percentage points (Appendix Table A1). Given this <1 percentage point variation, k = 14 balances removing flat-line artifacts and retaining data. Alternative detectors (e.g., seasonal low-variance tests; change-point procedures) could be substituted with minimal changes [35,36,37,38,39]; because they also target prolonged quasi-constant segments, similar exclusions are expected in practice.
To flag implausible magnitudes at the basin scale, we computed the mean discharge-to-surface-area ratio for each basin ( 10 6 m/s) and applied an interquartile-range (IQR) rule: a basin was flagged if its ratio was below Q 1 1.5 IQR or above Q 3 + 1.5 IQR , where IQR = Q 3 Q 1 , and Q 1 , Q 3 represent first and third quartiles, respectively. The empirical distribution and exact thresholds are reported in Appendix Table A2. Using the resulting cutoffs (lower 0.000276 , upper 0.022612 ), we discarded 49 extreme-ratio basins during training.
After anomaly filtering, discharge values were normalized by basin area to reduce scale differences and then log-transformed to address skewness. Similar log transformation was applied to precipitation and snow water equivalent (SWE). All remaining dynamic climate variables were standardized using the global mean and variance computed from the training set only. This approach ensures generalizability to ungauged basins, where local statistics are unavailable. Model predictions were made in the transformed space and then back-transformed to obtain physical units. The preprocessing steps for each dynamic variable are summarized in Table 3.

3.3. LSTM Architecture and Implementation

Recurrent Neural Networks (RNNs) are powerful for modeling sequential data due to their ability to retain information from previous inputs [40,41]. However, traditional RNNs face challenges in capturing long-term dependencies because of the vanishing gradient problem [19,41]. To overcome this, Long Short-Term Memory (LSTM) networks are employed. LSTM replaces the conventional RNN hidden layer with a memory cell that uses three gates—forget, input, and output—to regulate information flow [42].
In this study, we employed both LSTM and Entity-Aware-LSTM (EA-LSTM) Kratzert et al. [43] to predict streamflow runoffs in the Great Lakes basin. The EA-LSTM architecture retains the essential gating mechanisms of the traditional LSTM model (Figure 2). EA-LSTM processes inputs using the same fundamental gating equations as the standard LSTM. However, it distinguishes between static inputs x s and dynamic inputs x d [ t ] at time t, enabling a more efficient handling of these different input types.
i t = σ ( W i x s + b i ) , ( Input Gate )
f t = σ ( W f x d [ t ] + U f h t 1 + b f ) , ( Forget Gate )
g t = tanh ( W g x d [ t ] + U g h t 1 + b g ) , ( Candidate Cell State )
o t = σ ( W o x d [ t ] + U o h t 1 + b o ) , ( Output Gate )
c t = f t c t 1 + i t g t , ( Updated Cell State )
h t = o t tanh c t , ( Updated Hidden State )
Here, i t , f t , and o t represent the input, forget, and output gates, respectively. The vector x s refers to static inputs (invariant with respect to time t), and the vector x d [ t ] refers to dynamic inputs that change over time t. Notice that in EA-LSTM, the input gate exclusively receives static inputs x s , and the remaining two gates manipulate x d [ t ] . Here, the forget gate f t and the output gate o t depend on both x d [ t ] and h t 1 . Consequently, even though the forget gate and the output gate are computed primarily from dynamic inputs at time t, they are implicitly influenced by static inputs through h t 1 . The input gate, on the other hand, is completely independent of the dynamic inputs. All models used in experiments in this study were implemented in Python 3.12.3 and trained and evaluated with the open-source library NeuralHydrology (v1.12.0) developed by Kratzert et al. [44].

3.4. Model Configuration and Hyperparameters

In this study, we evaluate model performance using two data-splitting strategies for training and testing setup: a time-based (temporal) split and a basin-based (spatial) split.
For the time-based split, the dataset is divided chronologically into a training period (1 January 1980–31 December 2012) and a test period (1 January 2013–31 December 2023). Rather than reserving a fixed validation window, validation is performed at each training epoch by randomly sampling a subset of basins and evaluating their performance over the test period. This strategy allows the model to continually learn from the full range of training years while still assessing performance on a diverse set of catchments. For the basin-based split, the full observation record (1 January 1980–31 December 2023) is used, but basins are partitioned into training and test sets. To prevent large basins from dominating the evaluation, all basins are first grouped into five equipopulated bins based on drainage area. From each bin, 20% of basins are randomly selected to form the test set, and the remaining basins are used for training. This stratified sampling ensures that all basin size classes are proportionally represented in both the training and test sets.
Hyperparameters were selected through a targeted random search, with choices guided by common practices in hydrological deep learning. The search space for key hyperparameters included the following: hidden size { 128 , 256 , 512 } , output dropout { 0.2 , 0.3 , 0.4 } , and learning rate { 10 4 , 5 × 10 4 , 10 3 } . The Adam optimizer and Mean Squared Error (MSE) loss function were used for all experiments. The final model configuration, listed in Table 4, was chosen based on the set of hyperparameters that yielded the highest median Nash–Sutcliffe Efficiency (NSE) on the validation set. During training, we employed an early stopping protocol with a patience of 5. Training was halted if the validation loss did not improve for five consecutive epochs, and the model with the best validation performance was retained.
We set the input sequence length to 365 days to span one full hydrological year, so the network can resolve slow storage types (snowpack, soil moisture, groundwater) and seasonal hysteresis that drive delayed runoff. This choice is common and effective for daily rainfall–runoff LSTM networks, including studies that use a 365 d window explicitly [20,45,46]. Longer daily windows increase computational cost and are typically handled via multi-timescale architectures rather than by extending a single daily sequence indefinitely [20]. Conversely, shorter input windows may truncate hydrologic memory contained in snowpack and soil moisture, which can persist for months to years and thereby reduce skill, particularly in snow-affected basins [19].
To assess model stability, the final temporal-split experiment was repeated with three different random seeds (2004, 2025, 142589). The median NSE across the 632 test basins remained consistent, yielding a mean performance of 0.662 ± 0.030 , confirming that the model’s high performance is robust to initialization. The main experiment seed (142589) produced the best median NSE and is used for all subsequent analyses in this paper.

4. Results

4.1. Model Performance

To assess predictive performance, results from both the temporal-split and spatial-split experiments are compared against two benchmarks: (1) a standard LSTM architecture trained under identical conditions (i.e., the same hyperparameters, data splits, and input features), and (2) the NOAA National Water Model (NWM) version 3.0 [47]. The NWM comparisons were evaluated over the testing period specific to each run: 2013–2023 for the time-based split and 1980–2023 for the spatial-based split (20% holdout), on a common subset of 587 gauge stations. We do not employ postprocessed NWM baselines, since our goal is to benchmark against the raw operational outputs of NWM 3.0 without additional statistical corrections that could confound direct model-to-model comparisons.
Table 5 summarizes the key performance metrics for EA-LSTM, the standard LSTM, and the NWM. For a fair comparison, NWM outputs were evaluated separately for both the time-based and basin-based test sets, using only the subset of basins for which NWM predictions were available.
As also shown by the cumulative distribution functions in Figure 3, the LSTM-based models consistently outperform the NWM across both splitting strategies. Under the time-based split, EA-LSTM achieved the highest overall accuracy, with a median NSE of 0.690 and a median Kling–Gupta Efficiency (KGE) of 0.685. The standard LSTM also performed well in this setting, with a median NSE of 0.572 and a median KGE of 0.560. By contrast, the NWM yielded substantially lower skill on the same evaluation set, with a median NSE of 0.210 and a median KGE of 0.443.
For the basin-based split, performance declined for all models—reflecting the greater challenge of predicting in entirely ungauged regions. Nevertheless, EA-LSTM still outperformed the standard LSTM and demonstrated competitive accuracy compared to the temporal LSTM, underscoring the benefit of its entity-aware design in handling spatial generalization. Overall, these results confirm that both LSTM-based approaches substantially overperform the operational NWM, with EA-LSTM providing the strongest and most consistent gains.
Following the performance classification criteria of Eryani et al. [48] (Table 6), model results for each basin are categorized as high, moderate, low, or negative. Figure 4 shows the spatial distribution of EA-LSTM performance for the time-split experiment across the 632 gauged basins in the Great Lakes region. The model achieved moderate or high accuracy in most basins, indicating strong predictive skill over a large portion of the domain. However, certain basins exhibited low or negative scores, suggesting regions where hydrologic processes are less well-captured by the current model configuration.

4.2. Variation in Model Performance Across Basin Attributes

We also examine how predictive performance of EA-LSTM varies with major basin attributes, including regulation status, drainage area, and land cover (Figure 5). Table 7 summarizes the median, mean, and quartile NSE values for all groups, combining results previously shown separately.
As described by Grenier et al. [49], the dor_pc_pva attribute was used to classify basins as regulated (dor_pc_pva > 100) or unregulated (dor_pc_pva < 100). Under this criterion, the model achieved a median NSE of 0.676 for regulated basins and 0.691 for unregulated basins. The difference in performance between the two groups was minimal, and a Mann–Whitney U test confirmed that it was not statistically significant (p > 0.05; see Appendix A, Table A3 for full details). This suggests that our EA-LSTM model has adequately learned the regulation-related patterns from the data.
Basin size showed a clearer relationship with performance. To analyze this, basins were categorized into five equipopulated bins by drainage area: Bin 1 (4.8–71.9 km2), Bin 2 (72.9–194.7 km2), Bin 3 (195.7–472.4 km2), Bin 4 (475.2–1336.6 km2), and Bin 5 (1341.5–16,387.6 km2). The smallest area group, Bin 1, had a substantially lower median NSE than the overall median of 0.690 (Figure 5c). Performance generally improved with increasing basin size, peaking in the largest group (Bin 5). Post hoc comparisons (see Appendix A) confirmed that the smallest basins performed significantly worse than all larger groups.
Regarding land cover, the model’s performance varies significantly with the degree of human impact on the landscape. The highest accuracy was achieved in undeveloped (UD) basins, which are largely natural, posting a strong median NSE of 0.721. In contrast, performance was lower in human-modified catchments. Agricultural (AG) and urban (UR) basins recorded the lowest median NSE values at 0.665 and 0.666, respectively. Post hoc ANOVA tests confirm this observation, showing that the performance in undeveloped basins is statistically significantly higher than in urban basins (the full statistical results are available in Appendix A, Table A7). Notably, the performance difference between urban and agricultural basins was not statistically significant, suggesting they present a similar level of challenge to the model. Two factors may explain this: (1) the dataset is imbalanced, with relatively few urban and agricultural basins, possibly biasing the model toward more common land cover types; and (2) as noted by Sabzipour et al. [18], urban and agricultural hydrology is heavily influenced by anthropogenic controls, which may not be fully captured by models trained predominantly on natural catchments.
To further investigate these patterns quantitatively, we conducted detailed case studies on representative basins (Table 8). A large, undeveloped basin (ID: 04228500) serves as a high-performing baseline, achieving excellent metrics (NSE 0.770, KGE 0.844) but still exhibiting a tendency to underestimate the highest flow events. In the small headwater basin (ID: 02HJ005), the moderate performance (NSE 0.358) is explained by a persistent positive bias (overestimation) in low to mid flows and a significant underestimation of high flows, suggesting the model smooths out the characteristically rapid response of small catchments. The urban basin (ID: 04087159) presents a more pronounced challenge (NSE 0.509). The model systematically overestimates low and mid-range flows and severely underestimates the highest peak flows. This strong, systematic bias confirms that the standard meteorological inputs are insufficient and that the urban hydrological signal is under-represented by the model. Detailed hydrographs and quantile error plots for each case study are provided in Appendix A.3.

4.3. Performance During Extreme Hydrological Events

While metrics like NSE and KGE assess overall performance, they can mask deficiencies during critical periods like floods and droughts. To stress-test the EA-LSTM model, we conducted a focused analysis of its performance during high-flow (top 5% of observed flows) and low-flow (bottom 10% of observed flows) events (Table 9).
During high-flow events, the model shows a strong systematic bias. With a median high-flow PBIAS of −29.6%, EA-LSTM consistently underestimates the volume of flood peaks. This behavior is common for models trained with a Mean Squared Error (MSE) loss function, which heavily penalizes large errors and encourages the model to smooth out extreme peaks. Despite this underestimation, the median Fraction of High-Flow Volume (FHV) of 0.172 shows that the model still correctly attributes about 17% of the total runoff to these peak events, indicating it captures their overall importance in the water budget, just not their full magnitude.
Conversely, during low-flow periods, the model exhibited a median low-flow PBIAS of +22.8%, revealing a clear tendency to overestimate streamflow during dry conditions. This suggests the model predicts a higher baseflow than what is observed, another common trait for models that smooth predictions, as they struggle to simulate near-zero flows. This analysis provides a more nuanced view: while EA-LSTM performs well overall, it is challenged by the extremes, under-predicting floods and over-predicting droughts. These insights highlight specific areas for future improvement, such as exploring loss functions designed to better capture the full range of hydrological variability.

4.4. Physical Plausibility of Learned Processes

Since EA-LSTM is an unconstrained, data-driven model, we conducted two plausibility checks to ensure it learned physically realistic hydrological processes, as is common practice for verifying deep learning models in hydrology [19,50].
First, to assess the model’s understanding of the fundamental water balance, we evaluated the correlation between a proxy for water availability (daily precipitation minus a temperature-based proxy for evapotranspiration) and the simulated daily discharge (Q). The distribution of Pearson correlation coefficients across all 632 test basins is shown in Figure 6a. The distribution is centered around a physically plausible positive value of approximately +0.2, confirming that the model has learned the correct directional relationship: increased water availability leads to higher streamflow, but with the expected delay and damping characteristic of natural catchments.
Second, to verify that the model captures snow dynamics, a critical process in many Great Lakes catchments, we analyzed the lagged correlation between snow water equivalent (SWE) and simulated Q for snow-dominated basins. For each basin, we identified the time lag that maximized the SWE–Q correlation. Figure 6b shows that for nearly all snow-dominated basins, the strongest correlation is positive, confirming the model learned the correct physical link between snowpack and runoff. The results also show a range of optimal lags, from rapid responses (lag near 0) to more delayed systems (lags of 5–30 days), demonstrating that the model captures a spectrum of snowmelt processes reflecting the diversity of basin characteristics.

5. Discussion

This study represents the first time a comprehensive, cross-border hydrometeorological dataset has been compiled for the Laurentian Great Lakes basin and used to develop a deep learning model for regional streamflow prediction. Leveraging data from 975 U.S. and Canadian catchments, we trained an Entity-Aware Long Short-Term Memory (EA-LSTM) network that consistently outperformed both the operational NOAA National Water Model (NWM) and a standard LSTM architecture (Section 3). Notably, this high performance was achieved without basin-specific calibration, demonstrating strong generalization across the large, heterogeneous, and politically divided Great Lakes basin.
EA-LSTM’s ability to integrate heterogeneous datasets from two countries and produce consistent predictions across the entire basin addresses a long-standing challenge in transboundary water management. These results align with a growing body of literature showing that deep learning models can outperform traditional calibrated hydrological models in large-sample settings [20,43] but extend this evidence to one of the world’s most complex freshwater systems.

5.1. Model Robustness and Generalization

To ensure the model performs well beyond a single training dataset and to address potential overfitting, we evaluated the model’s generalization capabilities using multiple spatial cross-validation experiments. In addition to our primary spatial split, which was stratified by drainage area, we conducted two further experiments where the test basins were selected by stratifying across key climatic regimes: mean annual precipitation and mean annual temperature.
The model demonstrated consistent performance across these varied splits, as detailed in Table 10. The median NSE for the primary area-based split was 0.569, while the splits stratified by precipitation and temperature yielded highly comparable median NSE values of 0.527 and 0.524, respectively. This stability, with median NSE scores consistently in the 0.52–0.57 range, confirms that the model generalizes well across catchments with different hydroclimatic characteristics and that the high performance reported is not an artifact of a single, arbitrary train–test configuration. Furthermore, the model architecture incorporates an output dropout rate of 0.3, a standard and effective regularization technique that mitigates overfitting by preventing the co-adaptation of neurons during training, which further enhances the model’s robustness.

5.2. Comparison with Process-Based Approaches

The outperformance of the NWM by EA-LSTM in this study was noteworthy, suggesting that data-driven approaches could provide substantial improvements in predictive accuracy for operational forecasting [12]. This aligned with findings from other regional intercomparisons where LSTM networks have matched or exceeded the performance of suites of traditional models [8]. Narrowing down to the study domain of our interest, the Great Lakes Runoff Intercomparison Project Phase 4 (GRIP-GL) evaluated 13 models including physically based and conceptual models. Many process-based models in GRIP-GL performed well in calibration but degraded when applied regionally, particularly under the strong lake–atmosphere feedback and across national borders [8]. For instance, during the most challenging spatiotemporal validation, the best locally calibrated process-based models (Blended-lumped and Blended-Raven) achieved a median KGE of 0.59, and the best regionally calibrated model (WATFLOOD-Raven) achieved a median KGE of 0.53 [8]. In contrast, our EA-LSTM maintained high accuracy across all basins, with a median KGE of 0.685, outperforming the top GRIP-GL models by a margin of Δ KGE ≈ 0.095–0.155. This indicates superior transferability without costly, expert-led recalibration for each sub-basin, a critical factor in vast, complex systems like the Great Lakes.
A common concern with deep learning is that unlike process-based models, it does not explicitly enforce physical laws. Consequently, hybrid approaches, integrating the strengths of both data-driven and process-based models, were proposed for future pathways [14]. However, recent work by Klotz et al. [50] found that LSTM internal cell states could correlate with physical hydrological stores like soil moisture and snowpack. This discovery suggested that a well-trained LSTM model could learn meaningful hydrological dynamics. In our case, EA-LSTM’s forecasts were often more physically plausible than those of the NWM when compared with observed streamflows, indicating that despite the absence of explicit process constraints, the model captured realistic hydrodynamic behavior.

5.3. Why LSTM for This Application?

As time-series forecasting is a popular topic in the deep learning community, numerous deep learning architectures have been proposed to capture the complex patterns of time-series data [51]. While LSTM remains a traditionally popular choice for tasks like time-series forecasting and rainfall–runoff modeling, other recurrent models like the vanilla RNN and Gated Recurrent Unit (GRU) have been applied in the hydrology domain [52]. Additionally, recent works on transformer-based time-series architectures consistently demonstrate superior performance over RNN-based methods [51,53]. Zhang et al. [52] simulated reservoir operation with vanilla RNN, GRU, and LSTM algorithms. By direct comparison on the same task, Zhang et al. [52] showed that LSTM outperformed other RNN variants. Waqas and Wannasingha Humphries [54] also compared LSTM networks to different architectures of RNNs and GRUs, and they proved the advantage of LSTM in hydrological time-series data. The scenario is not yet so different with a more advanced architecture, transformers. Originally proposed for Natural Language Processing (NLP), the complex internal design of transformers effectively captures sequential patterns in the data, making it a promising candidate for hydrological forecasting. However, without a sufficient amount of data to fit such a complex architecture of transformers, the simpler design of LSTM proved to be easier to generalize and thus produce preferable predictions [51]. Aligned to the limitation of transformer models, Liu et al. [55] compared the performance of transformers to LSTM networks on streamflow prediction with the CAMELS dataset, and discovered that the vanilla transformer failed to match the predictive skill of an LSTM model, particularly for high-flow metrics. Nevertheless, by carefully redesigning the attention mechanism and the internal layers, transformer-based models could indeed outperform LSTM-based models [24,55].
Given our goal of creating the first basin-wide Great Lakes dataset and the need for a robust, generalizable baseline model, the LSTM architecture offered the best balance between complexity and predictive performance. Future research could investigate whether transformer-based models, potentially adapted with hydrology-specific attention mechanisms, can further improve predictions for this transboundary system.

5.4. Is the Entity Awareness Advantageous?

Our results indicated the benefit of employing entity awareness explicitly in LSTM by manipulating the static attributes. However, Heudorfer et al. [56] claimed that adding static features gave no out-of-sample benefit in experiments with 108 German groundwater wells. Similarly, Heudorfer et al. [57] showed that deep learning hydrological models are entity-aware by themselves, but the main driver of their entity awareness is the dynamic variables, not static attributes. The source of the inconsistency in our work requires further investigation, but we suggest two possible explanations. First, 108 samples is arguably a small number compared to the 975 samples of our work, and the benefit of static attributes may become truly evident when there are sufficient training data to learn the complex patterns from basins with a variety of characteristics. Although the CAMELS dataset consists of more training samples, its diversity is still behind that of our dataset. Second, the entity awareness gained from static features might not be significantly advantageous when applied to out-of-sample basins. Heudorfer et al. [57] argued that when the EA models were tested on catchments unseen during training, the superior performance was primarily driven by meteorological data, while the contribution of static features was limited. This observation is consistent with our finding that the spatial-split EA-LSTM underperforms compared to the temporal-split EA-LSTM. However, the fact that our spatial EA-LSTM performed comparably to the temporal LSTM still supports the utility of EA-LSTM’s dedicated input gate for static attributes [43]. Nevertheless, training the patterns of the static inputs and the dynamic response from a large sample of gauged basins is practically beneficial since the model can generate more skillful predictions for locations without historical streamflow data [20,45].
While these comparisons highlight the challenges in isolating the contribution of static information, our additional ablation experiments make it clear that static attributes are indeed significant for achieving high predictive performance. When we removed all static inputs, the standard LSTM degraded sharply, with mean NSE dropping to negative values (Table 11), confirming that static descriptors provide essential contextual information for basin heterogeneity. For EA-LSTM, the architecture cannot function with zero static attributes due to its design of a dedicated static-input gate. Therefore, longitude and latitude were retained as minimal static descriptors in the no-static setting. Under this constrained setup, EA-LSTM showed less severe degradation than the standard LSTM, suggesting that its architecture is able to systematically leverage even weak static signals to improve prediction. Importantly, the strongest overall performance was observed when static attributes were fully available, underscoring that static descriptors are indispensable for reaching the high skill scores reported in our main experiments. These findings support our broader claim that entity-aware architectures provide a principled way to disentangle and exploit static versus dynamic information, enabling more transferable predictions across heterogeneous basins. Future work should investigate whether certain categories of static features (e.g., topographic vs. land cover vs. climatic indices) drive disproportionate gains, which would help clarify the mechanisms of entity awareness beyond what we present here.

5.5. Strengths, Limitations, and Future Directions

Both our spatial- and temporal-split EA-LSTM models performed well across diverse hydrological regimes, from snowmelt-dominated northern catchments to rainfall-driven southern systems. The performance of our spatial-split model is especially noteworthy as the spatially split training strategy makes EA-LSTM particularly relevant for Prediction in Ungauged Basins (PUB). However, consistent with other studies, we observed reduced predictive skill in very small catchments and for extreme flow events [46,58]. Small basins often exhibit rapid, localized responses not fully captured by daily inputs, while extreme events, being rare, are challenging for models optimized on overall performance metrics. These limitations highlight the need for careful model application and potentially specialized modeling approaches for such specific conditions. Challenges also remain in heavily managed basins where anthropogenic influences, not captured by standard inputs, dominate flow regimes [18]. Investigating the factors driving this performance gap presents a clear direction for future research.
While many works, including ours, prove the advantage of employing deep learning for hydrologic modeling, deep learning is enabled only when there is sufficient data to power the model. In practice, obtaining such high-quality data to train a well-performing deep learning model is often the biggest challenge. The creation of this first integrated, cross-border Great Lakes hydrometeorological dataset was a non-trivial effort. In addition, the proliferation of large-sample datasets [8,59] and techniques like transfer learning and self-supervised learning offer pathways to improve predictions even with limited local data [60].

6. Conclusions

This study developed and evaluated the first comprehensive Entity-Aware Long Short-Term Memory (EA-LSTM) model for streamflow prediction across the Laurentian Great Lakes basin, using an integrated cross-border dataset of 975 U.S. and Canadian catchments. The unified EA-LSTM framework consistently outperformed both the operational NOAA National Water Model and a standard LSTM baseline, demonstrating the potential of entity-aware architectures to serve as a basin-wide data-driven forecasting tool.
Despite these advances, several limitations remain. Model skill was reduced in small and highly urbanized basins, as well as during extreme events, underscoring challenges in representing rapid local hydrological responses. In addition, benchmark comparisons with process-based models highlight trade-offs between physical interpretability and predictive accuracy that warrant further investigation.
Looking forward, this work provides a foundation for the next generation of hydrological forecasting research:
  • Developing and sharing unified cross-border datasets and using EA-LSTM as a strong, generalizable baseline model for large-scale streamflow prediction;
  • Addressing limitations through enhanced treatment of small and urban catchments, improved extreme-event prediction, and continued benchmarking against conceptual and physically based models;
  • Advancing model design by integrating physics-guided deep learning, transfer learning for data-sparse regions, hydrology-aware transformer architectures, and explicit inclusion of anthropogenic and urban covariates.
Together, these directions outline a pathway toward more adaptive, interpretable, and resilient data-driven hydrological prediction for transboundary freshwater systems.

Author Contributions

Conceptualization, Y.H. and X.L.; methodology, Y.H., X.L. and Y.P.; investigation, Y.P.; data curation, Y.P., X.L., Y.Z. and Y.H.; writing—original draft preparation, Y.P.; writing—review and editing, Y.P., Y.Z., X.L. and Y.H.; supervision, Y.H.; funding acquisition, Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the University of Michigan Undergraduate Research Opportunity Program (UROP), Schmidt Sciences, and the Cooperative Institute for Great Lakes Research (CIGLR) under the U.S. National Oceanic and Atmospheric Administration (NOAA) Cooperative Agreement with the University of Michigan.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code for model training, along with the datasets required to reproduce the findings of this study, are openly available on GitHub at https://github.com/yunsupark1120/GL-EALSTM (accessed on 25 January 2025).

Acknowledgments

The authors would like to thank Matthew Parent and Lauren Fry for their valuable discussions and guidance throughout this study. The authors would also like to thank the University of Michigan Undergraduate Research Opportunity Program (UROP) and Schmidt Sciences for their support. This work was additionally supported by funding awarded to the Cooperative Institute for Great Lakes Research (CIGLR) under the U.S. National Oceanic and Atmospheric Administration (NOAA) Cooperative Agreement with the University of Michigan. The CIGLR contribution number is 1270.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Data Quality Controls and Sensitivity Analyses

Table A1. Sensitivity of flat-line anomaly removal to threshold k (identical-flow runs of ≥k days). Values are the fraction of non-missing rows removed.
Table A1. Sensitivity of flat-line anomaly removal to threshold k (identical-flow runs of ≥k days). Values are the fraction of non-missing rows removed.
Threshold k (Days)Rows Removed (%) Δ vs. k = 14 (pp)
k = 7 17.495 + 0.574
k = 14 16.9220
k = 21 16.752 0.170
This section documents two anomaly-screening steps referenced in the main text: (1) a point-level flat-line filter for prolonged constant readings and (2) a basin-level plausibility screen using area-normalized mean discharge. We include exact thresholds, summary statistics, and concise guidance on interpretation for transparency and reproducibility.

Appendix A.1.1. Point-Level Flat-Line Sensitivity

Values indicate that exclusion varies by <1 percentage point across k { 7 , 14 , 21 } , supporting the robustness of the chosen k = 14 rule for removing attenuated/flat signals while preserving data volume.
Beyond reducing training quality, prolonged flat-line anomalies also degrade the fairness of evaluation itself. Since efficiency scores (e.g., NSE, KGE) are computed against erroneous observations, they may substantially understate the true predictive power of the model. Figure A1 illustrates such a case.
Figure A1. Example of an anomalous basin (04073473) where long periods of constant discharge values led to spurious hydrograph behavior. Despite realistic observed flows (blue), the flat-line artifacts caused the simulated values (orange) to diverge strongly, yielding highly negative evaluation metrics (NSE = 19.6 , KGE = 2.79 ). This demonstrates that anomalies harm both model training and evaluation: the model cannot learn meaningful dynamics, and computed skill scores do not reflect its true predictive ability.
Figure A1. Example of an anomalous basin (04073473) where long periods of constant discharge values led to spurious hydrograph behavior. Despite realistic observed flows (blue), the flat-line artifacts caused the simulated values (orange) to diverge strongly, yielding highly negative evaluation metrics (NSE = 19.6 , KGE = 2.79 ). This demonstrates that anomalies harm both model training and evaluation: the model cannot learn meaningful dynamics, and computed skill scores do not reflect its true predictive ability.
Hydrology 12 00261 g0a1

Appendix A.1.2. Basin-Level Plausibility Screening (Area-Normalized Flow)

Table A2. Mean discharge-to-area ratio ( 10 6 m/s): summary statistics and IQR thresholds (computed on non-missing basins prior to outlier removal; N = 858 ).
Table A2. Mean discharge-to-area ratio ( 10 6 m/s): summary statistics and IQR thresholds (computed on non-missing basins prior to outlier removal; N = 858 ).
StatisticValue
count858
mean0.017668
std0.082627
min0.000537
Q 1 (25%)0.008652
median (50%)0.011153
Q 3 (75%)0.014236
max1.910970
IQR = Q 3 Q 1 0.005584
lower bound Q 1 1.5 IQR 0.000276
upper bound Q 3 + 1.5 IQR 0.022612

Appendix A.2. Expanded Statistical Analysis

For all ANOVA and Tukey’s HSD tests in this section, the lowest 1% of NSE values (six basins) were excluded prior to analysis. These extreme values disproportionately inflated within-group variance and masked otherwise evident group-level differences. The exclusion criterion was defined a priori and applied consistently, and the full untrimmed results are available upon request.

Appendix A.2.1. Regulation Group Comparison

Table A3. Mann–Whitney U test results comparing NSE between regulated and unregulated basins.
Table A3. Mann–Whitney U test results comparing NSE between regulated and unregulated basins.
Mann–Whitney U statistic18,854
p-value0.9588

Appendix A.2.2. Basin Size Effects

Table A4. ANOVA results for NSE across basin size bins (lowest 1% of NSE values excluded).
Table A4. ANOVA results for NSE across basin size bins (lowest 1% of NSE values excluded).
Sum SqDFFp-Value
C(area_bin)2.5953418.71760.0000
Residual21.4918635
Table A5. Tukey’s HSD post hoc comparisons for NSE across basin size bins (lowest 1% of NSE values excluded).
Table A5. Tukey’s HSD post hoc comparisons for NSE across basin size bins (lowest 1% of NSE values excluded).
Group 1Group 2Mean Diff.p-adjLowerUpperReject
Bin 1Bin 20.09270.00090.02820.1571True
Bin 1Bin 30.11630.00.05170.1808True
Bin 1Bin 40.14650.00.0820.2109True
Bin 1Bin 50.19450.00.12980.2592True
Bin 2Bin 30.02360.8533−0.04070.0879False
Bin 2Bin 40.05380.1482−0.01040.118False
Bin 2Bin 50.10180.00020.03740.1662True
Bin 3Bin 40.03020.7009−0.03410.0945False
Bin 3Bin 50.07820.00860.01360.1428True
Bin 4Bin 50.0480.249−0.01640.1124False

Appendix A.2.3. Land Cover Effects

Table A6. ANOVA results for NSE across land cover categories (lowest 1% of NSE values excluded).
Table A6. ANOVA results for NSE across land cover categories (lowest 1% of NSE values excluded).
Sum SqDFFp-Value
C(area_bin)0.40233.57730.0138
Residual23.6780636
Table A7. Tukey’s HSD post hoc comparisons for NSE across land cover categories (lowest 1% of NSE values excluded).
Table A7. Tukey’s HSD post hoc comparisons for NSE across land cover categories (lowest 1% of NSE values excluded).
Group 1Group 2Mean Diff.p-adjLowerUpperReject
AGMX0.02780.5655−0.02740.083False
AGUD0.04540.1498−0.00990.1007False
AGUR−0.04000.5667−0.11950.0396False
MXUD0.01760.7841−0.03070.0659False
MXUR−0.06780.0918−0.14260.0071False
UDUR−0.08540.0181−0.1603−0.0105True

Appendix A.3. Case Study Visuals for Representative Basins

To complement the quantitative summary in Table 8, this section provides detailed hydrographs and quantile error plots for the three representative case study basins. These visualizations illustrate the model’s performance characteristics under different hydrological regimes. Figure A2 displays the observed versus simulated time series, while Figure A3 details the systematic bias across the flow distribution for each basin.
Figure A2. Observed vs. simulated hydrographs for the three case study basins, illustrating overall model performance on (a) a small headwater, (b) an urban, and (c) a large, undeveloped catchment.
Figure A2. Observed vs. simulated hydrographs for the three case study basins, illustrating overall model performance on (a) a small headwater, (b) an urban, and (c) a large, undeveloped catchment.
Hydrology 12 00261 g0a2aHydrology 12 00261 g0a2b
Figure A3. Mean error (bias) by observed flow quantile for the three case study basins. Positive values indicate overestimation. The plots reveal systematic biases, particularly the underestimation of high flows across all basin types.
Figure A3. Mean error (bias) by observed flow quantile for the three case study basins. Positive values indicate overestimation. The plots reveal systematic biases, particularly the underestimation of high flows across all basin types.
Hydrology 12 00261 g0a3aHydrology 12 00261 g0a3b

References

  1. Gaborit, É; Mai, J.; Princz, D.G.; Arsenault, R.; Fortin, V.; Tolson, B.A. Hydrologic outputs generated over the Great Lakes with a calibrated version of the GEM-Hydro model. Sci. Data 2025, 12, 127. [Google Scholar] [CrossRef]
  2. Petering, D.H.; Klump, V. Importance of the Great Lakes; Source of North American Drinking Water, Technical Report; Great Lakes Commission: Ann Arbor, MI, USA, 2003. [Google Scholar]
  3. U.S. Environmental Protection Agency. Climate Change Connections: Michigan (The Great Lakes). Available online: https://www.epa.gov/climateimpacts/climate-change-connections-michigan-great-lakes (accessed on 25 January 2025).
  4. Hong, Y.; Kessler, J.; Titze, D.; Yang, Q.; Shen, X.; Anderson, E.J. Towards efficient coastal flood modeling: A comparative assessment of bathtub, extended hydrodynamic, and total water level approaches. Ocean Dyn. 2024, 74, 391–405. [Google Scholar] [CrossRef]
  5. Hong, Y.; Do, H.X.; Kessler, J.; Fry, L.; Read, L.; Rafieei Nasab, A.; Gronewold, A.D.; Mason, L.; Anderson, E.J. Evaluation of gridded precipitation datasets over international basins and large lakes. J. Hydrol. 2022, 607, 127507. [Google Scholar] [CrossRef]
  6. Croley, T.E.I. Modified Great Lakes Hydrology Modeling System for Considering Simple Extreme Climates; NOAA Technical Memorandum GLERL-137; NOAA Great Lakes Environmental Research Laboratory: Ann Arbor, MI, USA, 2006; GLERL Contribution No. 1386.
  7. Lofgren, B.M.; Rouhana, J. Physically plausible methods for projecting changes in Great Lakes water levels under climate change scenarios. J. Hydrometeorol. 2016, 17, 2209–2223. [Google Scholar] [CrossRef]
  8. Mai, J.; Shen, H.; Tolson, B.A.; Gaborit, É.; Arsenault, R.; Craig, J.R.; Fortin, V.; Fry, L.M.; Gauch, M.; Klotz, D.; et al. The Great Lakes Runoff Intercomparison Project Phase 4: The Great Lakes (GRIP-GL). Hydrol. Earth Syst. Sci. 2022, 26, 3537–3572. [Google Scholar] [CrossRef]
  9. Enemark, T.; Peeters, L.J.M.; Mallants, D.; Batelaan, O. Hydrogeological conceptual model building and testing: A review. J. Hydrol. 2019, 569, 310–329. [Google Scholar] [CrossRef]
  10. Wu, K.; Johnston, C.A. Hydrologic response to climatic variability in a Great Lakes watershed: A case study with the SWAT model. J. Hydrol. 2007, 337, 187–199. [Google Scholar] [CrossRef]
  11. Pietroniro, A.; Fortin, V.; Kouwen, N.; Neal, C.; Turcotte, R.; Davison, B.; Verseghy, D.; Soulis, E.D.; Caldwell, R.; Evora, N.; et al. Development of the MESH modelling system for hydrological ensemble forecasting of the Laurentian Great Lakes at the regional scale. Hydrol. Earth Syst. Sci. 2007, 11, 1279–1294. [Google Scholar] [CrossRef]
  12. Frame, J.M.; Kratzert, F.; Raney, A.; Rahman, M.; Salas, F.; Nearing, G.S. Post-processing the national water model with long short-term memory networks for streamflow predictions and model diagnostics. JAWRA J. Am. Water Resour. Assoc. 2021, 57, 959–977. [Google Scholar] [CrossRef]
  13. Xue, P.; Wagh, A.; Ma, G.; Wang, Y.; Yang, Y.; Liu, T.; Huang, C. Integrating deep learning and hydrodynamic modeling to improve the Great Lakes forecast. Remote Sens. 2022, 14, 2640. [Google Scholar] [CrossRef]
  14. Khandelwal, A.; Xu, S.; Li, X.; Jia, X.; Stisen, S.; Duffy, C.; Nieber, J.; Kumar, V. Physics guided machine learning methods for hydrology. arXiv 2020. [Google Scholar] [CrossRef]
  15. Sahoo, B.B.; Jha, R.; Singh, A.; Kumar, D. Long short-term memory (LSTM) recurrent neural network for low-flow hydrological time series forecasting. Acta Geophys. 2019, 67, 1471–1481. [Google Scholar] [CrossRef]
  16. Kurt, O. Model-based prediction of water levels for the Great Lakes: A comparative analysis. Earth Sci. Inform. 2024, 17, 3333–3349. [Google Scholar] [CrossRef]
  17. Kratzert, F.; Klotz, D.; Shalev, G.; Klambauer, G.; Hochreiter, S.; Nearing, G. Toward improved predictions in ungauged basins: Exploiting the power of machine learning. Water Resour. Res. 2019, 55, 11344–11354. [Google Scholar] [CrossRef]
  18. Sabzipour, B.; Arsenault, R.; Troin, M.; Martel, J.L.; Brissette, F.; Brunet, F.; Mai, J. Comparing a long short-term memory (LSTM) neural network with a physically based hydrological model for streamflow forecasting over a Canadian catchment. J. Hydrol. 2023, 627, 130380. [Google Scholar] [CrossRef]
  19. Kratzert, F.; Klotz, D.; Brenner, C.; Schulz, K.; Herrnegger, M. Rainfall–runoff modelling using long short-term memory (LSTM) networks. Hydrol. Earth Syst. Sci. 2018, 22, 6005–6022. [Google Scholar] [CrossRef]
  20. Gauch, M.; Kratzert, F.; Klotz, D.; Nearing, G.; Lin, J.; Hochreiter, S. Rainfall–runoff prediction at multiple timescales with a single long short-term memory network. Hydrol. Earth Syst. Sci. 2021, 25, 2045–2062. [Google Scholar] [CrossRef]
  21. Li, F.F.; Wang, Z.Y.; Zhao, X.; Xie, E.; Qiu, J. Decomposition-ANN Methods for Long-Term Discharge Prediction Based on Fisher’s Ordered Clustering with MESA. Water Resour. Manag. 2019, 33, 3095–3110. [Google Scholar] [CrossRef]
  22. Li, F.F.; Wang, Z.Y.; Zhao, X.; Xie, E.; Qiu, J. Daily Streamflow Forecasting Based on Flow Pattern Recognition. Water Resour. Manag. 2021, 35, 4521–4540. [Google Scholar] [CrossRef]
  23. Li, F.F.; Wang, Z.Y.; Zhao, X.; Xie, E.; Qiu, J. Long-term Streamflow Forecasting Using Artificial Neural Network Based on Preprocessing Technique. J. Forecast. 2019, 38, 192–206. [Google Scholar] [CrossRef]
  24. Chen, Y.; Xue, P. Dual-transformer deep learning framework for seasonal forecasting of Great Lakes water levels. J. Geophys. Res. Hydrol. 2025, 2, e2024JH000519. [Google Scholar] [CrossRef]
  25. Allan, J.D. Landscapes and riverscapes: The influence of land use on stream ecosystems. Annu. Rev. Ecol. Evol. Syst. 2004, 35, 257–284. [Google Scholar] [CrossRef]
  26. DeFries, R.S.; Townshend, J.R.G. NDVI-derived land cover classifications at a global scale. Int. J. Remote Sens. 1994, 15, 3567–3586. [Google Scholar] [CrossRef]
  27. Flint, R.F. Glacial and Quaternary Geology; John Wiley & Sons: New York, NY, USA, 1971. [Google Scholar]
  28. Hough, J.L. Geology of the Great Lakes; University of Illinois Press: Urbana, IL, USA, 1958. [Google Scholar]
  29. Kottek, M.; Grieser, J.; Beck, C.; Rudolf, B.; Rubel, F. World map of the Köppen-Geiger climate classification updated. Meteorol. Z. 2006, 15, 259–263. [Google Scholar] [CrossRef] [PubMed]
  30. Prowse, T.D.; Beltaos, S. Climatic control of river-ice hydrology: A review. Hydrol. Process. 2002, 16, 805–822. [Google Scholar] [CrossRef]
  31. Thornton, P.E.; Thornton, M.M.; Mayer, B.W.; Wei, Y.; Devarakonda, R.; Vose, R.S.; Cook, R.B. Daymet: Daily Surface Weather Data on a 1-km Grid for North America, Version 4 R1; ORNL Distributed Active Archive Center: Oak Ridge, TN, USA, 2025. [CrossRef]
  32. Lehner, B.; Grill, G. Available online: https://data.hydrosheds.org/file/technical-documentation/HydroSHEDS_TechDoc_v1_4.pdf (accessed on 25 January 2025).
  33. Bushnell, M. Manual for Real-Time Quality Control of Water Level Data; National Oceanic and Atmospheric Administration: Silver Spring, MD, USA, 2021.
  34. World Meteorological Organization. Guide to Hydrological Practices, Volume I: Hydrology—From Measurement to Hydrological Information, 6th ed; Number WMO-No. 168 in WMO Technical Documents; WMO: Geneva, Switzerland, 2008. [Google Scholar]
  35. Conte, L.C.; Bayer, D.M.; Bayer, F.M. Bootstrap Pettitt test for detecting change points in hydro-climatological time series. Hydrol. Sci. J. 2019, 64, 1499–1513. [Google Scholar] [CrossRef]
  36. Buishand, T.A. Some methods for testing the homogeneity of rainfall records. J. Hydrol. 1982, 58, 11–27. [Google Scholar] [CrossRef]
  37. Killick, R.; Fearnhead, P.; Eckley, I.A. Optimal detection of changepoints with a linear computational cost. J. Am. Stat. Assoc. 2012, 107, 1590–1598. [Google Scholar] [CrossRef]
  38. Adams, R.P.; MacKay, D.J.C. Bayesian Online Changepoint Detection. arXiv 2007. [Google Scholar] [CrossRef]
  39. Reeves, J.; Chen, J.; Wang, X.L.; Lund, R.; Lu, Q. A Review and Comparison of Changepoint Detection Techniques for Climate Data. J. Appl. Meteorol. Climatol. 2007, 46, 900–915. [Google Scholar] [CrossRef]
  40. Lipton, Z.C.; Berkowitz, J.; Elkan, C. A critical review of recurrent neural networks for sequence learning. arXiv 2015. [Google Scholar] [CrossRef]
  41. Schmidt, R.M. Recurrent neural networks (RNNs): A gentle introduction and overview. arXiv 2019. [Google Scholar] [CrossRef]
  42. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  43. Kratzert, F.; Klotz, D.; Shalev, G.; Klambauer, G.; Hochreiter, S.; Nearing, G. Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets. Hydrol. Earth Syst. Sci. 2019, 23, 5089–5110. [Google Scholar] [CrossRef]
  44. Kratzert, F.; Gauch, M.; Nearing, G.; Klotz, D. NeuralHydrology—A Python library for deep learning research in hydrology. J. Open Source Softw. 2022, 7, 4050. [Google Scholar] [CrossRef]
  45. Frame, J.M.; Kratzert, F.; Klotz, D.; Gauch, M.; Shalev, G.; Gilon, O.; Qualls, L.M.; Gupta, H.V.; Nearing, G.S. Deep learning rainfall–runoff predictions of extreme events. Hydrol. Earth Syst. Sci. 2022, 26, 3377–3392. [Google Scholar] [CrossRef]
  46. Arsenault, R.; Martel, J.L.; Brunet, F.; Brissette, F.; Mai, J. Continuous streamflow prediction in ungauged basins: Long short-term memory neural networks clearly outperform traditional hydrological models. Hydrol. Earth Syst. Sci. 2023, 27, 139–157. [Google Scholar] [CrossRef]
  47. Cosgrove, B.; Gochis, D.; Flowers, T.; Dugger, A.; Ogden, F.; Graziano, T.; Clark, E.; Cabell, R.; Casiday, N.; Cui, Z.; et al. NOAA’s National Water Model: Advancing operational hydrology through continental-scale modeling. JAWRA J. Am. Water Resour. Assoc. 2024, 60, 247–272. [Google Scholar] [CrossRef]
  48. Eryani, I.G.A.P.; Jayantari, M.W.; Wijaya, I.K. Sensitivity analysis in parameter calibration of the WEAP model for integrated water resources management in Unda watershed. Civ. Eng. Archit. 2022, 10, 455–469. [Google Scholar] [CrossRef]
  49. Grenier, M.; Boudreault, M.; Carozza, D.A.; Boudreault, J.; Raymond, S. Flood occurrence and impact models for socioeconomic applications over Canada and the United States. Nat. Hazards Earth Syst. Sci. 2024, 24, 2577–2595. [Google Scholar] [CrossRef]
  50. Klotz, D.; Kratzert, F.; Gauch, M.; Sampson, A.K.; Klambauer, G.; Hochreiter, S.; Nearing, G.S. Uncertainty estimation with deep learning for rainfall–runoff modelling. Hydrol. Earth Syst. Sci. 2022, 26, 1673–1693. [Google Scholar] [CrossRef]
  51. Lim, B.; Zohren, S. Time-series forecasting with deep learning: A survey. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2021, 379, 20200209. [Google Scholar] [CrossRef]
  52. Zhang, D.; Peng, Q.; Lin, J.; Wang, D.; Liu, X.; Zhuang, J. Simulating reservoir operation using a recurrent neural network algorithm. Water 2019, 11, 865. [Google Scholar] [CrossRef]
  53. Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; Sun, L. Transformers in time series: A survey. arXiv 2023. [Google Scholar] [CrossRef]
  54. Waqas, M.; Wannasingha Humphries, U. A critical review of RNN and LSTM variants in hydrological time series predictions. MethodsX 2024, 13, 102946. [Google Scholar] [CrossRef] [PubMed]
  55. Liu, J.; Bian, Y.; Shen, C. Probing the limit of hydrologic predictability with the transformer network. arXiv 2023. [Google Scholar] [CrossRef]
  56. Heudorfer, B.; Liesch, T.; Broda, S. On the challenges of global entity-aware deep learning models for groundwater level prediction. Hydrol. Earth Syst. Sci. 2024, 28, 525–545. [Google Scholar] [CrossRef]
  57. Heudorfer, B.; Gupta, H.V.; Loritz, R. Are deep learning models in hydrology entity aware? Geophys. Res. Lett. 2024, 51, e2024GL113036. [Google Scholar] [CrossRef]
  58. Nearing, G.S.; Kratzert, F.; Sampson, A.K.; Pelissier, C.S.; Klotz, D.; Frame, J.M.; Raj, C.; Hochreiter, S. What role does hydrological science play in the age of machine learning? Water Resour. Res. 2021, 57, e2020WR028091. [Google Scholar] [CrossRef]
  59. Do, H.X.; Gudmundsson, L.; Leonard, M.; Westra, S. The global streamflow indices and metadata archive (GSIM) – Part 1: The production of a daily streamflow archive and metadata. Earth Syst. Sci. Data 2018, 10, 765–785. [Google Scholar] [CrossRef]
  60. Oruche, R.; Egede, L.; Baker, T.; O’Donncha, F. Transfer learning to improve streamflow forecasts in data sparse regions. arXiv 2021. [Google Scholar] [CrossRef]
Figure 1. Study domain and key characteristics of the 975 sub-basins in the Laurentian Great Lakes Basin. (a) Spatial distribution and dominant land cover of the studied basins. Colors represent the primary land cover type: forest (35%), agriculture (37%), open water (20%), urban (6%), and wetland (2%). The study domain boundary is indicated by the red dashed line. (b) Box plots showing the distribution of drainage areas (in km2) on a log 10 scale, grouped by dominant land cover. The overall median drainage area for all basins is 304 km2, with values ranging from 4.1 to 16,388 km2. (c) Box plots showing the distribution of mean annual discharge (in m3/s) on a log 10 scale, grouped by dominant land cover. The overall median discharge is 3.60 m3/s, with values ranging from 0.038 to 204 m3/s. In the box plots, the boxes represent the interquartile range (IQR), the central line is the median, and whiskers extend to 1.5 times the IQR; points beyond the whiskers are statistical outliers.
Figure 1. Study domain and key characteristics of the 975 sub-basins in the Laurentian Great Lakes Basin. (a) Spatial distribution and dominant land cover of the studied basins. Colors represent the primary land cover type: forest (35%), agriculture (37%), open water (20%), urban (6%), and wetland (2%). The study domain boundary is indicated by the red dashed line. (b) Box plots showing the distribution of drainage areas (in km2) on a log 10 scale, grouped by dominant land cover. The overall median drainage area for all basins is 304 km2, with values ranging from 4.1 to 16,388 km2. (c) Box plots showing the distribution of mean annual discharge (in m3/s) on a log 10 scale, grouped by dominant land cover. The overall median discharge is 3.60 m3/s, with values ranging from 0.038 to 204 m3/s. In the box plots, the boxes represent the interquartile range (IQR), the central line is the median, and whiskers extend to 1.5 times the IQR; points beyond the whiskers are statistical outliers.
Hydrology 12 00261 g001
Figure 2. Illustration of the EA-LSTM relative to the standard LSTM [43]. (a) shows the standard LSTM cell, where the input vector x t is passed into the three main gates (forget, input, and output) as described in Section 3.3. (b) visualizes the EA-LSTM cell, which separates static inputs x s (blue circle) from dynamic inputs x d [ t ] and processes them accordingly. Adopted from Kratzert et al. [43].
Figure 2. Illustration of the EA-LSTM relative to the standard LSTM [43]. (a) shows the standard LSTM cell, where the input vector x t is passed into the three main gates (forget, input, and output) as described in Section 3.3. (b) visualizes the EA-LSTM cell, which separates static inputs x s (blue circle) from dynamic inputs x d [ t ] and processes them accordingly. Adopted from Kratzert et al. [43].
Hydrology 12 00261 g002
Figure 3. Comparison of model performance using empirical cumulative distribution functions (CDFs). The figure shows the distribution of Nash–Sutcliffe Efficiency (NSE; left column) and Kling–Gupta Efficiency (KGE; right column) for the EA-LSTM, LSTM, and NWM benchmark models. The top row displays results from the temporal-split evaluation, while the bottom row shows results from the spatial split. Solid lines represent the median CDF calculated across all test basins, and the shaded regions indicate the 95% confidence interval derived from bootstrapping. For both metrics, a curve shifted further to the right represents higher model performance.
Figure 3. Comparison of model performance using empirical cumulative distribution functions (CDFs). The figure shows the distribution of Nash–Sutcliffe Efficiency (NSE; left column) and Kling–Gupta Efficiency (KGE; right column) for the EA-LSTM, LSTM, and NWM benchmark models. The top row displays results from the temporal-split evaluation, while the bottom row shows results from the spatial split. Solid lines represent the median CDF calculated across all test basins, and the shaded regions indicate the 95% confidence interval derived from bootstrapping. For both metrics, a curve shifted further to the right represents higher model performance.
Hydrology 12 00261 g003
Figure 4. The spatial distribution of performance across the basins evaluated with the temporal EA-LSTM model.The performance categories are distinguished by color.
Figure 4. The spatial distribution of performance across the basins evaluated with the temporal EA-LSTM model.The performance categories are distinguished by color.
Hydrology 12 00261 g004
Figure 5. Model performance across different configurations and basin characteristics.
Figure 5. Model performance across different configurations and basin characteristics.
Hydrology 12 00261 g005
Figure 6. Physical plausibility checks for the temporal EA-LSTM model, demonstrating that it has learned realistic (a) water balance and (b) snowmelt dynamics. (a) Distribution of correlation coefficients between a P-ET proxy and simulated discharge (Q) across all test basins. (b) Best SWE–Q lag correlation for snow-dominated basins. The color indicates the strength of the correlation coefficient.
Figure 6. Physical plausibility checks for the temporal EA-LSTM model, demonstrating that it has learned realistic (a) water balance and (b) snowmelt dynamics. (a) Distribution of correlation coefficients between a P-ET proxy and simulated discharge (Q) across all test basins. (b) Best SWE–Q lag correlation for snow-dominated basins. The color indicates the strength of the correlation coefficient.
Hydrology 12 00261 g006aHydrology 12 00261 g006b
Table 1. Comparison of selected hydrological modeling approaches relevant to the Great Lakes basin. The table highlights differences in spatial coverage and calibration requirements, emphasizing the novelty of the proposed EA-LSTM framework.
Table 1. Comparison of selected hydrological modeling approaches relevant to the Great Lakes basin. The table highlights differences in spatial coverage and calibration requirements, emphasizing the novelty of the proposed EA-LSTM framework.
ApproachSpecific Model(s)/StudyPrimary InputsSpatial CoverageCalibration Requirement
Process-basedLBRM, MESH-SVS-Raven,
GRIP-GL models [6,8]
Meteorological data,
physical basin parameters
Great Lakes basinBasin-specific calibration required
Hybrid MLLSTM postprocessing of NWM [12]NWM model outputsContinental U.S.Training of LSTM postprocess
Data-drivenKratzert et al. (2019) [17]Meteorological forcings,
static attributes
Regional (U.S. CAMELS dataset)No basin-specific calibration
Data-drivenThis study (EA-LSTM)Meteorological forcings,
static attributes
International (Full Great Lakes Basin)No basin-specific calibration
Table 2. Dynamic meteorological inputs and the target variable (discharge).
Table 2. Dynamic meteorological inputs and the target variable (discharge).
VariableDescription
daylDay length (seconds)
prcpPrecipitation (mm)
sradSolar radiation (W/m2)
sweSnow water equivalent (mm)
tmaxMaximum temperature (°C)
tminMinimum temperature (°C)
vpVapor pressure (Pa)
dischargeStreamflow or river discharge (m3/s)
Table 3. Preprocessing steps applied to each variable.
Table 3. Preprocessing steps applied to each variable.
VariablePreprocessing
Discharge (raw)(1) Remove missing rows and basins with no data.
(2) Filter point-wise anomalies (≥14 days constant).
(3) Filter basin-wise outliers (IQR rule).
(4) Normalize by area.
(5) Log-transform.
Precipitation, SWELog-transform.
Dynamic climate variablesStandardize (global mean and variance from training set).
Static catchment attributesNo preprocessing (used as raw features).
Table 4. Hyperparameters for EA-LSTM model training.
Table 4. Hyperparameters for EA-LSTM model training.
HyperparameterValueDescription
Hidden size256Number of cell states in LSTM.
Initial forget bias3Initial bias value for the forget gate.
Output dropout0.3Dropout rate applied to the LSTM output.
OptimizerAdamOptimization algorithm used.
Loss functionMSEMean Squared Error loss used for training.
Learning rate0.0001Learning rate applied for updating model parameters.
Batch size128Number of samples per training batch.
Epochs50Total number of training epochs.
Sequence length365Length of the input sequence for the model.
Seed142589Seed to reproduce the result.
Table 5. Summary of the performance of the models. All of our models outperform the baseline NWM in both temporal and spatial-split settings, while the temporal EALSTM model produces the most accurate predictions. Notice that the EALSTM variants also outperform the vanilla LSTM variants. Within each category, the highest NSE and KGE values are highlighted in bold. The terms 25% Q, 50% Q, and 75% Q denote the first quartile (25th percentile), median (50th percentile), and third quartile (75th percentile) of the metric scores across all test basins, respectively.
Table 5. Summary of the performance of the models. All of our models outperform the baseline NWM in both temporal and spatial-split settings, while the temporal EALSTM model produces the most accurate predictions. Notice that the EALSTM variants also outperform the vanilla LSTM variants. Within each category, the highest NSE and KGE values are highlighted in bold. The terms 25% Q, 50% Q, and 75% Q denote the first quartile (25th percentile), median (50th percentile), and third quartile (75th percentile) of the metric scores across all test basins, respectively.
ModelNSE ( , 1 ] )KGE ( , 1 ] )
25% Q50% Q75% QMax25% Q50% Q75% QMax
LSTM (Temporal)0.3970.5720.6830.8360.3350.5600.7180.903
LSTM (Spatial)0.1870.4800.6320.8040.2860.5020.6450.883
EALSTM (Temporal)0.5470.6900.7690.8980.5340.6850.7940.934
EALSTM (Spatial)0.3100.5690.7140.8540.3540.5670.7020.896
NWM (Temporal)−0.0660.2090.4530.7930.2660.4430.6310.881
NWM (Spatial)−0.1420.1150.3630.7270.1960.4350.6060.820
Number of gauges
Negative NSELow NSEModerate NSEHigh NSETotal
LSTM (Temporal)538643657632
LSTM (Spatial)34348911168
EALSTM (Temporal)1551328204632
EALSTM (Spatial)21289128168
NWM (Temporal)1702211875587
NWM (Spatial)4350320125
Table 6. Performance category criteria based on NSE metric [48].
Table 6. Performance category criteria based on NSE metric [48].
CategoryNSE Range
High accuracy NSE > 0.75
Moderate accuracy 0.36 < NSE 0.75
Low accuracy 0 < NSE 0.36
Negative NSE NSE 0
Table 7. Summary statistics of NSE by regulation status, basin size category, and land cover class. Mean values are vulnerable to extreme values, and the negative mean values reflect the effect of such outliers. Detailed ANOVA and post hoc results are provided in the Appendix A.
Table 7. Summary statistics of NSE by regulation status, basin size category, and land cover class. Mean values are vulnerable to extreme values, and the negative mean values reflect the effect of such outliers. Detailed ANOVA and post hoc results are provided in the Appendix A.
GroupNMean NSEMedian NSE25% Quartile75% Quartile
Regulated670.6040.6760.5410.773
Unregulated5650.5200.6910.5470.769
Bin 1 (smallest)1270.4930.5540.4410.645
Bin 21260.6170.6610.5470.737
Bin 31260.5500.6970.6150.753
Bin 41260.6700.7310.6170.786
Bin 5 (largest)1270.3170.7760.6600.821
Undeveloped (UD)2190.3850.7210.5940.790
Urban (UR)570.5720.6660.4920.751
Mixed-use (MX)2210.6170.6750.5470.763
Agricultural (AG)1350.6000.6650.5320.752
Table 8. Case study analysis of mean error (bias) by observed flow quantile for three representative basin types. Bias is calculated as ( S i m u l a t e d O b s e r v e d ). Positive values indicate overestimation.
Table 8. Case study analysis of mean error (bias) by observed flow quantile for three representative basin types. Bias is calculated as ( S i m u l a t e d O b s e r v e d ). Positive values indicate overestimation.
MetricSmall HeadwaterUrbanLarge and Undeveloped
(02HJ005)(04087159)(04228500)
Overall NSE0.3580.5090.770
Overall KGE0.2960.4060.844
Bias in low flows (0–30%)+0.0059+0.0882+1.7692
Bias in mid flows (30–80%)+0.0063+0.0953+1.3361
Bias in high flows (80–100%)−0.0864−1.1192−22.4264
Table 9. Summary statistics for extreme-event metrics across all test basins. PBIAShigh and FHV are calculated for flows exceeding the 95th percentile, while PBIASlow and FLV are for flows below the 10th percentile.
Table 9. Summary statistics for extreme-event metrics across all test basins. PBIAShigh and FHV are calculated for flows exceeding the 95th percentile, while PBIASlow and FLV are for flows below the 10th percentile.
MetricMeanMedian (50%)25% Quartile75% Quartile
PBIAShigh (%)−30.68−29.56−42.20−18.74
FHV0.180.170.120.22
PBIASlow (%)91.7322.846.5462.11
FLV0.030.020.010.04
Table 10. Performance distribution of the spatial-split EA-LSTM under different cross-validation stratification strategies. The stability of the median values across all splits demonstrates model robustness.
Table 10. Performance distribution of the spatial-split EA-LSTM under different cross-validation stratification strategies. The stability of the median values across all splits demonstrates model robustness.
Stratification StrategyNSEKGE  
25% QMedian (50%)75% QMax25% QMedian (50%)75% QMax
Area-based (primary)0.3100.5690.7140.8540.3540.5670.7020.896
Precipitation-based0.2720.5270.6630.8410.2870.5050.6870.915
Temperature-based0.2710.5240.6870.8950.3250.5410.6890.912
Table 11. Summary of ablation experiments on static attributes. Median values across all basins are reported. EA-LSTM without static features retains longitude/latitude only.
Table 11. Summary of ablation experiments on static attributes. Median values across all basins are reported. EA-LSTM without static features retains longitude/latitude only.
ModelSplitMedian NSEMedian KGE
LSTM (full statics)Temporal0.5720.560
EA-LSTM (full statics)Temporal0.6900.685
EA-LSTM (full statics)Spatial0.5690.567
LSTM (no statics)Temporal0.1190.157
EA-LSTM (lon/lat only)Temporal0.2420.299
EA-LSTM (lon/lat only)Spatial0.2840.330
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Park, Y.; Liu, X.; Zhu, Y.; Hong, Y. Using Entity-Aware LSTM to Enhance Streamflow Predictions in Transboundary and Large Lake Basins. Hydrology 2025, 12, 261. https://doi.org/10.3390/hydrology12100261

AMA Style

Park Y, Liu X, Zhu Y, Hong Y. Using Entity-Aware LSTM to Enhance Streamflow Predictions in Transboundary and Large Lake Basins. Hydrology. 2025; 12(10):261. https://doi.org/10.3390/hydrology12100261

Chicago/Turabian Style

Park, Yunsu, Xiaofeng Liu, Yuyue Zhu, and Yi Hong. 2025. "Using Entity-Aware LSTM to Enhance Streamflow Predictions in Transboundary and Large Lake Basins" Hydrology 12, no. 10: 261. https://doi.org/10.3390/hydrology12100261

APA Style

Park, Y., Liu, X., Zhu, Y., & Hong, Y. (2025). Using Entity-Aware LSTM to Enhance Streamflow Predictions in Transboundary and Large Lake Basins. Hydrology, 12(10), 261. https://doi.org/10.3390/hydrology12100261

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop