1. Introduction
One of the major achievements of modern Earth sciences is the rapid advancement in climate observation, monitoring of anthropogenic impacts, and the accumulation of environmental data. This progress is exemplified by the development of prominent platforms such as Google Earth Engine, Earthdata/OpenET, Sentinel Hub, and Microsoft Planetary Computer, among others. These systems integrate satellite observations, meteorological station records, and climate model outputs, covering the majority of the Earth’s surface. The accessibility of such data enables forecasting of hazardous events and natural disasters even in regions without established meteorological monitoring networks [
1,
2,
3,
4]. Among these events, floods and inundations are becoming increasingly frequent, yet under global change, new uncertainties make these events even harder to predict. According to the World Meteorological Organization [
5], the reported number of flood events has risen by 134% over the two decades since 2000.
Alongside the growing volume of environmental data available for analysis, there has also been significant progress in data analysis and artificial intelligence methods and models. A review [
6] cites 228 studies specifically focused on flood prediction using machine learning (ML) approaches. In total, the authors analyzed the performance of over 6596 articles and identified 180 original and influential works, including several pioneering studies [
7,
8,
9,
10,
11,
12,
13,
14,
15]. Forecasting approaches encompass both physically based hydrological models and data-driven ML models trained on historical data. As noted in [
6], “The continuous advancement of ML methods over the last two decades demonstrated their suitability for flood forecasting with an acceptable rate of outperforming conventional approaches.”
Early applications of ML for flood forecasting date back to 1995. For instance, one seminal work [
7] explores the use of neural networks for rainfall–runoff prediction, article [
8] addresses river stage prediction, and [
9] focuses on river flood forecasting using neural networks. One study [
10] is among the first to apply the support vector machine (SVM) method [
16] to flood stage forecasting. The abovementioned review [
6] covers the evolution of ML and deep-learning (DL) methods in flood prediction from 1995 to 2017, including models such as artificial neural networks (ANN), SVM, adaptive neuro-fuzzy inference systems (ANFIS), wavelet neural networks (WNN), and decision trees (DTs). Notably, long short-term memory (LSTM) networks [
17], which have become prominent in time-series forecasting, began to appear in flood forecasting literature around 2018, with their adoption increasing significantly thereafter.
To the best of our knowledge, studies [
18,
19] represent some of the earliest applications of the LSTM model in flood forecasting, published nearly concurrently with the comprehensive review of [
6]. In study [
18], both ANN and LSTM models are trained using data from 1971 to 2013 in the Fen River basin, China, encompassing 14 rainfall stations and one hydrological station. The findings indicate that both models are suitable for rainfall–runoff simulations and outperform traditional conceptual and physically based models. Ref. [
19] utilizes an LSTM model on data from 241 watersheds, employing the open CAMELS dataset [
20] for streamflow prediction. The authors demonstrate that an LSTM model trained across multiple watersheds performs comparably to the established Sacramento soil moisture accounting model (SAC-SMA) combined with the Snow-17 snow accumulation and ablation model, while offering advantages in computational efficiency. Another study [
21], published in 2018, explores the use of LSTM as an alternative to computationally intensive physical models in hydrology. It successfully predicts water table levels based on 14 years of monthly data, including variables such as water diversion, evaporation, precipitation, temperature, and time, to forecast water table depth. The proposed model achieves higher R
2 scores (0.789–0.952) compared to traditional feedforward neural networks.
Building upon their previous work, the authors of [
19] performed another study [
22] where they trained an LSTM model on 531 basins from the CAMELS dataset, incorporating meteorological time-series data and static catchment attributes. This approach significantly improves performance relative to various hydrological benchmark models. In [
23], the interdisciplinary potential of emerging DL models is highlighted. The review emphasizes the gradual adoption of DL in hydrology and is aimed at providing hydrologists and water resource scientists with a technical overview of DL’s relevance. Subsequently, DL models based on LSTM have been effectively applied to predict river levels in basins characterized by rapid changes in water discharge [
24]. Currently, LSTM networks are widely utilized for forecasting river levels, rainfall, water discharge, and even drought conditions [
25], owing to their capability to process time-series data and capture long-term dependencies.
Given the diverse natural conditions and the presence of meteorological and hydrological monitoring networks, flood prediction studies are tailored to specific regions considering factors that depend on local observation conditions. In certain studies, predictions rely on data from meteorological stations and hydrological posts, as demonstrated in [
17,
18,
19,
20,
21,
22,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34]. In some instances, such data suffice to achieve high predictive accuracy. For example, study [
26] reports Nash–Sutcliffe efficiency (NSE) values of 99%, 95%, and 87% for one-day, two-day, and three-day forecasts, respectively. In [
27], hourly water level measurements from upstream hydrological posts enable high forecast accuracy within a 1 to 24 h interval. However, Ref. [
28] highlights the method’s limited robustness for long-term forecasts and its reduced accuracy in predicting peak flow values. Consequently, a series of methodological refinements are implemented, including flow vectorization concerning minimum, maximum, and average values. Study [
29] compares the performance of random forest (RF), gradient boosting regression (GBR), and LSTM methods. The analysis considers both static runoff parameters and dynamic factors, such as precipitation and influencing flow behavior. Overall, the model accurately classifies floods in over 80% of cases and exhibits a relative error in peak flood estimates of less than 30% in most scenarios.
LSTM models have also been employed for flood index forecasting. In studies [
30,
31], real-time data processing demonstrated high predictive accuracy, with a low root mean square error (RMSE) of approximately 0.1. Study [
31] introduces a DL model based on LSTM networks combined with particle swarm optimization (PSO), where PSO was used for automatic hyperparameter tuning. The model focuses on the Jinghe watershed in the Fenhe River and the Lushi watershed in the Luohe River, using precipitation and runoff data from local observation stations. The authors show that the PSO–LSTM model improves forecasting accuracy, especially for lead times greater than six hours, and outperforms both ANN and standard LSTM models in terms of precision and robustness. The same modeling framework is applied in [
32] for flash flood prediction in mountainous watersheds, using precipitation and inflow discharge data at the watershed entry points. Forecasts are made for both short-term (1–5 h) and longer-term (6–10 h) intervals. Precipitation is identified as the key predictor for driving model performance.
Despite these advancements, study [
33] demonstrates a major limitation of LSTM models: prediction errors increase exponentially with longer lead times, diminishing their utility as early-warning tools with sufficient advance notice. To mitigate this issue, the authors incorporate data from nearby monitoring stations. The validation dataset included hourly discharge records from 2012 to 2017 from six stations along the Humber River in Toronto, Canada. The study tests forecasts with 6 and 12 h lead times using the previous 24 h of discharge data as input. The results indicated that a modified spatiotemporal attention LSTM (STA-LSTM) outperformed CNN-LSTM, ConvLSTM, and standard LSTM models when the forecast horizon exceeded six hours. These findings suggest that integrating spatially distributed input data can substantially improve prediction accuracy.
Several recent studies have also demonstrated that incorporating gridded and spatially distributed data into LSTM-based architectures significantly improves flood forecasting accuracy [
35,
36,
37]. One prominent example is a study [
35] focused on flash flood prediction in Ellicott City, Maryland. The authors develop a hybrid ConvLSTM model integrating multiple spatiotemporal data sources, including GPM IMERG satellite precipitation, NEXRAD radar mosaics, and soil moisture fields from the Noah land surface model. They preprocess inputs into 1 km-resolution gridded tensors over a 36 × 48 km domain and feed them into a multi-headed architecture combining ConvLSTM and traditional LSTM layers. The model is trained to predict stream levels at hourly intervals with up to 8 h lead times. Compared to standard LSTM, the hybrid architecture reduces RMSE by approximately 26% during peak events.
In [
36], the authors propose ConvLSTM for flood index forecasting in Fijian catchments with spatially distributed daily rainfall from nine stations. Their model outperforms conventional LSTM and feedforward networks, showing that even coarse-resolution spatial inputs enhance flash flood prediction. In [
37], a CNN-LSTM model across 226 Canadian basins is applied. Daily reanalysis maps of precipitation and temperature serve as inputs, forming a temporal sequence of spatial climate snapshots. The CNN extracts spatial features, while LSTM handles temporal dependencies. The model achieves a median NSE of 0.68 and exceeds 0.9 in several ungauged basins. In [
38], a spatiotemporal attention LSTM (STA-LSTM) is proposed for sub-hourly flood forecasting in three Chinese basins. Using hourly rainfall from multiple stations and discharge data, the model applies spatial and temporal attention to identify key inputs. It reaches R
2 values up to 0.96 and significantly improves RMSE and MAPE compared to baseline LSTM. In [
39], the authors forecast daily water levels in Bangladesh using multiple hydrological stations. Their STA-LSTM model integrates upstream and neighboring station data via attention mechanisms, improving forecasts for locations like Dhaka and Sylhet. The model achieves NSE up to 0.96 and reduces RMSE by over 20% relative to traditional LSTM. These studies confirm that integrating spatially distributed inputs such as precipitation maps, soil moisture fields, and multi-station hydrological data enhances LSTM-based flood forecasting. ConvLSTM, CNN-LSTM, and attention-based models better capture spatiotemporal dependencies, improving peak prediction and enabling longer lead-time warnings.
This transition from point-based to spatially aware models is well supported by a range of recent methodologies for incorporating gridded data. The authors of [
40] systematically evaluate LSTMs with catchment-mean rainfall input against models with spatially distributed rainfall. In one paper, Daymet [
41] rainfall is aggregated from its native 1 km grid resolution to sub-catchment units, and these spatial rainfall vectors are fed into a model that predicts daily river discharge with 1-, 7-, and 15-day lead times, concluding that the inclusion of spatial information consistently improves model performance. Other studies also experimented with different approaches handling gridded data, for example, [
42] demonstrated that flood forecasts improve when thirteen GLDAS grid-level meteorological variables at ~25 km resolution are fed into an LSTM to predict next-day streamflow at the Fuping catchment outlet. Input data are first screened with the gamma test, which identifies the most informative cells and establishes a clear, data-driven link between spatial inputs and discharge response. Study [
43] built a grid-based LSTM driven by CMIP6 climate data forcing and showed that adding static grid attributes such as elevation and vegetation cover boosted runoff prediction compared to models using meteorological forcing alone. More specifically, monthly precipitation, temperature, previous monthly runoff, and static DEM + NDVI layers on a ~25 km grid are utilized to project monthly runoff for 2016–2045 in the Yellow River source region under CMIP6 SSP scenarios.
Methodological advances now span the full pipeline, from refined gridded inputs to the deep-learning architectures that process them. To exploit the spatial structure of the data, Ref. [
44] embedded gridded rainfall and discharge fields in a ConvLSTM to predict discharge 20 h ahead, enabling the model to learn spatiotemporal patterns that govern flood formation and routing. Taking things a step further, the authors of [
45] proposed a two-stage pipeline: first, a neural network predicts local runoff on a fine-grained, regular grid, and second, another network routes these distributed runoff quantities through the river network. The model’s inputs combine eight daily ERA5-Land meteorological grids at ~8 km resolution with 46 static physiographic layers, driving an LSTM that produces one-day streamflow forecasts. As was highlighted, research has evolved from demonstrating that spatially explicit inputs enhance model skill to engineering ever more sophisticated, grid-centric deep-learning architectures that harness those inputs for sharper, more reliable flood forecasts across diverse basins.
In this study, we addressed the problem of predicting water discharge for the Uba River in the Republic of Kazakhstan. A defining characteristic of this river is that most of its basin lies in remote mountainous terrain under the challenging conditions of a sharply continental climate. However, since the basin is located on the edge of a large mountain range that traps moisture-laden westerly air masses, it receives substantial precipitation and experiences pronounced floods. As a result, within the basin’s area of approximately 9900 square kilometers, there are only two meteorological stations and two hydrological posts. This creates data scarcity due to practical constraints. In the upper reaches of the river, snow accumulates from November to March, with the snowpack reaching depths of up to 1.5 to 2 m in some years. In spring, rapid snowmelt leads to a sharp rise in river levels, posing a threat to nearby settlements. Moreover, there are no flow-regulating structures on the river.
To address the challenges posed by data scarcity and complex environmental conditions in the Uba River basin, we propose an approach that leverages deep learning to improve discharge forecasting. We also applied this method to the Middle Fork Flathead River basin based on the availability of input data for the selected model. The Uba and Flathead (Middle Fork) rivers are representative of mountain river basins in a temperate continental climate, with predominantly snow- and glacier-fed regimes and catchment areas ranging from 3000 to 10,000 km2. An important feature of both basins is their location within protected natural reserves, which virtually eliminates any anthropogenic influence. Although the methodology proposed in this study is designed to be broadly applicable, it was first validated on hydrologically similar river basins to facilitate result comparison and minimize random influences.
Unlike studies [
40,
41,
42,
43,
44,
45] that downsample reanalysis data or introduce increasingly complex network designs, our study deliberately keeps the workflow simple while leveraging the latest gains in data resolution. Only six expert-selected predictors are used: precipitation, mean and maximum air temperature, snow-water equivalent, soil moisture, and soil temperature. We retain ERA5-Land variables covering the basin at their native ~8 km grid and feed them directly into LSTM. This grid-level framework offers a transparent and data-efficient strategy for flood forecasting in data-scarce mountainous regions.
Thus, this paper formulates three interrelated research questions, as follows.
RQ-1. Whether the developed LSTM-grid can provide higher accuracy in forecasting water discharge based on ERA 5-Land data than all existing point model variants, namely:
- (a)
an LSTM using only meteorological data from the 8 × 8 km cell containing the gauging station;
- (b)
an LSTM treating the entire basin as a single aggregated cell;
- (c)
an LSTM trained on a Caravan dataset.
RQ-2. Which spatial resolution of cells (1 × 1, 2 × 2, or 3 × 3 ERA5-Land cells) provides the best compromise between forecast accuracy and computational cost in the Uba (East Kazakhstan) and Middle Fork Flathead (Montana, USA) river basins?
RQ-3. How critical is full basin coverage? Does the model maintain accuracy and stability when part of the basin is excluded, or does incomplete coverage inevitably worsen the performance?
These research questions were addressed through experiments on forecasting water discharge in the Uba and Middle Fork Flathead River basins.
3. Results and Discussion
3.1. Performance Across All Model Configurations
The boxplot in
Figure 7, comparing annual NSE across all nine scenarios for the Uba River basin, clearly demonstrates that the full-grid model LSTM-Uba-grid provides the best performance: it achieves the highest median NSE of 0.905 and the smallest year-to-year variability, indicating both predictive skill and stability.
The 2 × 2 aggregation model LSTM-Uba-2by2 delivers almost the same level of performance, but with a slightly larger spread of values. Random subset experiments achieve the same median of 0.888 for all K cells, yet their overall variability remains high, undermining reliability (each random-subset scenario aggregates NSE from 100 independent random samplings of K cells). Although it still outperforms the random subsets, the 3 × 3 aggregation model LSTM-Uba-3by3 exhibits a wider spread and an even greater number of outliers, likely due to excessive coarsening and loss of critical spatial information.
Among the reduced-input approaches, the LSTM-Caravan model trained on the Caravan dataset achieved higher central performance and moderate variability, outperforming both single-point LSTM-Uba-point and basin-mean LSTM-L-mean models, but underperforming compared to the spatial input cases.
The boxplot in
Figure 8, which shows the annual distribution of the NSE values across seven model configurations for the Middle Fork Flathead River basin (2005–2014), supports the findings from the Uba River basin. The full-grid model (LSTM-Flathead-grid) provides the highest forecast estimation accuracy, reaching a median NSE value of about 0.93 with minimal interannual variability. As in the case of Uba River basin, the performance of models with a random selection of cells increases as the total number of cells increases: the K = 50 cell model (LSTM_Flathead_rand_50) is almost comparable to the full grid in most years. Models with aggregated 2 × 2 and 3 × 3 also show high NSE values.
In contrast, the single-point model (LSTM-Flathead-point) shows significantly lower performance (median NSE ≈ 0.84) and high instability, including the lowest outlier among all models. It is noteworthy that even models with a random subset of only 30 cells are superior in quality to the point model, underscoring the importance of spatial input, even if partial. In general, although the full-grid model remains the most reliable and stable, models with aggregation (2 × 2, 3 × 3) or random sampling (from 40 cells) show good efficiency in the Middle Fork Flathead basin.
3.2. Comparison of Baseline Models (Simplified Input Models) with LSTM-Grid
From 2012 to 2020, the LSTM-Uba-grid model shows the highest annual NSE values; however, interesting patterns are revealed across the years. As shown in
Figure 9, the full-grid model leads in 2012 (0.9052) and 2013 (0.8739), while LSTM-Caravan pretrained on different basins lags only slightly, indicating the ability of pretraining to compensate for some of the spatial information. In 2014, there is a sharp drop in both models (LSTM-Uba-grid = 0.4920, LSTM-L-mean = 0.4622) due to distortion of the actual discharge data. Despite this, LSTM-Caravan (0.5327) reacts to this anomaly less sharply. In 2015, LSTM-Caravan reached a maximum of 0.9314, surpassing LSTM-Uba-grid by only 0.017, demonstrating its advantages in a high-flow year. In 2017, both LSTM-Uba-grid and LSTM-Caravan achieved their highest NSE values, 0.9485 and 0.9524, respectively, demonstrating strong predictive performance under favorable hydrological conditions, with LSTM-Caravan model slightly outperforming the full-grid model that year. In 2018, the LSTM-Uba-point model outperformed others, reaching 0.7972, which may be explained by the better capture of local hydrometeorological features that year. Finally, in 2019 and 2020, the full-grid model once again leads, confirming that maintaining full spatial coverage remains the most reliable approach, although in certain years either large-scale pretraining or local observations may also show advantages.
Regarding the results for the Middle Fork Flathead River basin, the comparison focused on two model configurations: the full-grid model (LSTM-Flathead-grid) and the single-point model (LSTM-Flathead-point). From 2005 to 2014, the LSTM-Flathead-grid model consistently outperformed the point-based model, except in 2008 and 2009, where it reached almost identical values as the grid model. As shown in
Figure 10, the LSTM-Flathead-point model exhibits lower performance and greater variability, including a sharp drop to 0.5752 in 2005, the lowest among all years and configurations. Despite the fact that the gap narrows slightly in 2009 and 2012, the LSTM-Flathead-grid model remains ahead, with annual differences often exceeding 0.05–0.10 NSE units. This gap is especially noticeable in 2010 and 2011, when the point model barely reflects the hydrological dynamics at the basin scale. These results confirm the earlier conclusions from the Uba River basin: although point models can offer a simplified alternative, the quality of their forecasts is less reliable, and full spatial coverage of the input data is still important for consistently accurate forecasts.
3.3. Effect of Spatial Aggregation on Grid Performance
Based on experimental work 2, two additional spatial aggregation scenarios, 2 × 2 and 3 × 3 cell groupings, were evaluated alongside the base full-grid model to assess the impact of input coarsening on model performance.
In preparation for assessing how input coarsening influences model performance, all six hydrometeorological variables were normalized to the [0–1] interval. For every evaluation year, we calculated parameter-wise variances on the native 1 × 1-cell grid (8 km × 8 km) and on its 2 × 2-cell (16 km × 16 km) and 3 × 3-cell (24 km × 24 km) groupings. Information loss was then expressed for each grid as the mean Frobenius norm (averaged across all parameters) of the difference between the variance matrix of the native fine-grained grid and that of each coarser grid.
For the Uba River basin, the results show that the full-grid model provides the highest modeling accuracy: the median NSE value for the period 2012–2020 was 0.905. In six out of nine years (2013, 2014, 2016, 2017, 2018, 2019), it outperformed the coarser grids. Aggregated grids showed the best results only in certain years with low- (2012, NSE = 0.9198) and high-flow (2015, NSE = 0.9506) years and in 2020 (0.9154), a year in which coarsening the 1 × 1-cell grid to 2 × 2 and 3 × 3 cells changed NSE by less than 0.01, indicating that the removal of fine-scaled spatial variability has a negligible impact on model performance (
Table 4).
The greatest reductions in NSE for the aggregation of cells occur in problematic years. For example, in 2018, the NSE for the LSTM-Uba-2by2 is lower by 0.062 units, and for the LSTM-Uba-3by3 grid, it is lower by 0.173 units compared to the LSTM-Uba-grid. This shows that at coarse resolution, the model becomes significantly more sensitive to non-standard hydrological conditions and to distorted data (in 2014).
Figure 11a quantifies the information lost when the grid is coarsened, displaying boxplots of the mean Frobenius-norm loss. A median loss of 0.011 with a narrow interquartile range is observed for the 2 × 2 aggregation, whereas the 3 × 3 grouping provides a larger median loss (0.026) and substantially greater spread.
Figure 11b illustrates the direct effect of this loss on model performance: as grid resolution decreases, the median NSE declines from 0.905 to 0.863 and subsequently to 0.846, while the spread of NSE widens. In other words, small information losses at 2 × 2 correspond to only a modest decline in forecast accuracy, whereas the larger, more variable losses at 3 × 3 map onto a noticeably lower and less stable NSE distribution.
Thus, the optimal cell resolution for modeling river flow in the Uba River basin is a full grid (8 × 8 km), which ensures maximum predictive performance and stability of the results. Since the base grid has a spacing of ≈ 8 × 8 km, which is practically nearly the same as the nominal resolution of ERA5-Land (0.1° ≈ 9 km), further enlarging the cells to 16 × 16 km and 24 × 24 km actually combines several ERA5-Land pixels into one model cell. This smooths out orography, precipitation, and snow reserves, which causes the model to lose local variability and, as our experiments have shown, leads to average NSE reductions of approximately 2% for 2 × 2 and −4.6% for 3 × 3 grids across all years. An aggregation into 2 × 2 grid can be used as a compromise option only if it is necessary to reduce computational costs at the cost of a small loss in performance, whereas a coarse 3 × 3 grid is recommended only for calculations of a rough water balance with extremely limited computing resources and low requirements for forecast detail.
When it comes to testing on the Middle Fork Flathead River basin, the performance differences between these models are generally smaller than those observed in the Uba River basin (
Table 5).
In 5 out of 10 years, the base grid model outperforms the LSTM-Flathead-2by2 and LSTM-Flathead-3by3 models. However, the 2 × 2 and 3 × 3 aggregations show competitive results in the remaining years: in 2006, 2008, 2011, 2012, and 2013, one of the aggregated models matches or slightly exceeds the LSTM-Flathead-grid model’s NSE. In both 2008 and 2013, aggregated models (especially 3 × 3) outperform the full grid. These may be years with increased data uncertainty, sparse meteorological signals, or reduced hydrological variability, where smoothing inputs helps the model generalize better. This is likely due to the fact that unlike the Uba River basin, the Flathead River basin model’s accuracy is less sensitive to reduced resolution.
Figure 12a shows that aggregating the Middle Fork Flathead grid from 1 × 1 to 2 × 2 cell results in a median loss of 0.014, with occasional years reaching 0.019. Further coarsening to 3 × 3 cells lowers the median loss slightly to 0.013 and shortens the upper whisker, indicating that very large losses occur less frequently while the IQR remains virtually unchanged.
Figure 12b shows the corresponding effect on predictive skill: the fine grid delivers the strongest and most consistent NSE values (median NSE = 0.928), moving to 2 × 2 cells lowers overall skill and introduces greater year-to-year variability (median NSE = 0.917), while the 3 × 3 grid recovers some of the lost median accuracy, yet remains prone to occasional sharp performance drops (median NSE = 0.921).
In summary, the effect of grid aggregation varies by basin and year. In the Uba basin, the native 1 × 1 grid resolution produces the highest and most stable NSE values. Similarly, in the Middle Fork Flathead basin, this fine resolution also results in the highest median NSE, although with greater variability. Moderate coarsening, however, has only a minor impact on the overall distribution of performance.
3.4. Forecast Reliability Under Random Spatial Subsets
As for experimental work 3, the study examined how varying the spatial coverage of input data affects the predictive skills and stability of LSTM forecasts. For each K, 100 independent model runs were performed, each time randomly selecting a new combination of cells. To assess the impact, we compared the performance of each subset-based model to that of the full-grid model by calculating the difference in NSE (ΔNSE) for each year between 2012 and 2020. A negative ΔNSE indicates that the subset model performed worse than the full-grid model, while a positive ΔNSE value suggests improved performance.
To present the results for the Uba River basin under random-subset conditions,
Figure 13 displays the year-by-year median ΔNSE for the three K numbers. On the Y axis, ΔNSE is plotted, solid lines reflect the median of 100 independent runs in each year, and shaded bands denote ±1σ (standard deviation).
In all years except 2015, 2017 and 2020, the median ΔNSE is negative, indicating an inevitable decline in model performance with any reduction in basin coverage. The dotted verticals indicate 2014 and 2018, the worst performance years. In these years, the deficit reaches its greatest values (down to −0.06 at K = 110, and down to about −0.09 at K = 110) and the variation of runs is widest.
Table 6 lists exact NSE statistics for each year.
When switching from 100 to 110 cells, the range of the median σ spread decreases by from 0.012 to 0.009, and the median curves shift closer to zero. Further expansion to 120 cells results in only a slight additional narrowing of the bands (σ ≤ 0.025) and practically does not change the median ΔNSE, which illustrates the effect of diminishing returns.
When the subset increases from 100 to 110 cells, the typical annual average deviation for 100 runs becomes smaller in most years. For example, σ falls by 0.004 in 2013, 2014, and 2019, by 0.003 in 2012 and 2020, and by 0.001 in 2017, while the lower bound improves from 0.004 to 0.003. The only exception is the dry year 2018, when σ rose sharply to 0.036. Further expansion of the subset to 120 cells reduces the upper bound to 0.025. This leads to the fact that every year, except 2018, the indicator falls below 0.014, but the central curve of ΔNSE remains almost unchanged, which indicates a decrease in returns.
In conclusion, the experiments show that even when decreasing the number of cells (K = 100, 110, 120 of 166), the model systematically loses ≈ 1% of NSE and becomes less stable, and when reducing to 100 cells, the losses increase to ≈ 1.3% and the variation among runs becomes noticeably wider. Adding the first ten cells (K = 100 to 110) really increases stability and partially reduces the deficit; however, even at K = 120, the average performance remains below the LSTM-Uba-grid and the negative deviation reaches its maximum in the years of 2014 and 2018. Consequently, neglecting any part of the basin impairs both the average accuracy and the reliability of the model. The largest negative deviations still occur in two specific years, 2014, when the data set is distorted, and 2018, which differs markedly from the other seasons, showing that partial coverage makes the model especially vulnerable to problematic or atypical years. Full spatial coverage therefore remains essential for consistently high performance.
When it comes to the Middle Fork Flathead River basin,
Figure 14 shows that the basic model with a full grid retains a small, but stable advantage in predictive performance. In most years, the values of ΔNSE are within ±0.02, except for 2005, 2008 and 2014, where there was a decrease to -0.06 at K = 30. Generally, increasing from K = 30 to K = 40 reduces the spread in 8 of 10 years, whereas expanding further to K = 50 provides only minimal and sometimes mixed changes:
When switching from K = 30 to K = 40, the accuracy increases by 0.01–0.02 NSE in 2005, 2009, and 2014 (see
Table 7, median values: for example, 2005, from 0.772 to 0.804; 2009, from 0.887 to 0.893; 2014, from 0.905 to 0.916).
An increase to K = 50 gives minimal improvements, which is especially noticeable in stable years (2010–2013), where the medians almost coincide with the full-grid model (for example, in 2010: the median for the full grid is 0.9519, for LSTM-Flathead-rand 50 0.948).
In general, compared to the Uba River basin, the Middle Fork Flathead River basin model demonstrates a higher resistance to partial spatial coverage. Even when using only 30 cells (about 42% of the full grid), the accuracy loss does not exceed 2–3% and the standard deviation remains at a moderate level (see
Table 7: in 2014, at K = 30 it is 0.017, at K = 50 it is 0.011).
3.5. Key Observations on LSTM-Grid Model Performance
This section analyzes the performance of the LSTM-grid models in predicting water discharge across both studied basins, focusing on agreement between predicted and observed hydrographs during the hydrological years.
The Uba River basin presents a clear example of how the model performs under varying flow conditions, including years with data inconsistencies and predictive biases.
Figure 15 presents a series of hydrographs comparing the actual observed data with the model’s predictions for each year from November to May. As previously mentioned, two years, 2014 and 2018, stand out due to anomalies in the model output or the observed data.
The model’s low NSE score (0.492) can be largely attributed to potential inconsistencies in the observed discharge data. The hydrograph for 2014 (subplot titled 2014) exhibits sharp, blocky fluctuations resembling geometric steps, suggesting that the original data may not have been recorded at daily resolution, but rather as monthly or aggregated averages. This distortion in the recorded data likely misled the model and reduced its prediction accuracy.
Unlike other years in the test period, 2018 is not characterized by high flows, yet the model overestimates discharge values in several months, although the NSE for this year remains moderate (0.7074).
The hydrographs in
Figure 16 illustrate the LSTM-Flathead-grid model’s predictions compared to observed discharge data for the Middle Fork Flathead River basin during the 2005–2014 hydrological years. The model shows consistently strong performance across nearly all years, with NSE values above 0.89 in 8 out of 10 years, indicating high accuracy in both timing and magnitude of discharge peaks, and demonstrates strong generalization even during years with complex hydrograph shapes.
In particular, years such as 2006 (NSE = 0.9538), 2010 (0.9519), 2011 (0.9621), and 2014 (0.941) demonstrate excellent alignment between predicted and observed hydrographs, including the correct representation of peak flow timing and volume.
Some minor underestimations of peak flow magnitude can be observed in years like 2005 and 2008, where the observed discharge shows sharper spikes compared to the smoother LSTM-Flathead-grid model predictions. However, these differences are relatively small and do not significantly reduce model performance (NSE values remain above 0.80 in both years).
Compared to the Uba River basin, the Middle Fork Flathead basin results suggest that the model benefits from more stable and consistent observational data and may be better tuned to this basin’s hydrological behavior. This better performance might also be explained by the fact that ERA5-Land input data are likely more accurate and better calibrated over the United States, where the observational network is denser and model validation is stronger, than over relatively under-monitored regions such as Kazakhstan.