Explainable Machine Learning for Streamflow Forecasting: Application to the Bosna River Basin

Gnjato, Slobodan; Leščešen, Igor; Zhou, Qiuwen; Ðukanović, Marko

doi:10.3390/w18101226

Open AccessArticle

Explainable Machine Learning for Streamflow Forecasting: Application to the Bosna River Basin

¹

Faculty of Natural Sciences and Mathematics, University of Banja Luka, Mladena Stojanovića 2, 78000 Banja Luka, Bosnia and Herzegovina

²

Institute of Hydrology SAS, Dúbravská cesta 9, 84104 Bratislava, Slovakia

³

School of Geography and Environmental Sciences, Guizhou Normal University, Guiyang 550003, China

⁴

University of Nova Gorica, Vipavska cesta 13, 5000 Nova Gorica, Slovenia

⁵

Institute of Information Sciences (IZUM), Prešernova 17, 2000 Maribor, Slovenia

^*

Author to whom correspondence should be addressed.

Water 2026, 18(10), 1226; https://doi.org/10.3390/w18101226

Submission received: 13 March 2026 / Revised: 2 May 2026 / Accepted: 14 May 2026 / Published: 19 May 2026

(This article belongs to the Section Hydrology)

Download

Browse Figures

Versions Notes

Abstract

As one of the key water systems in Bosnia and Herzegovina, the Bosna River Basin plays a vital role in sustaining agricultural production, industrial development, and water supply for municipalities. Accurate streamflow forecasting is fundamental to optimising water resource planning. This study explores streamflow forecasting using long-term data (1961–2020) from five meteorological stations and one hydrological station distributed across various sections of the basin. For precise streamflow forecasting, the study employs several machine-learning models: Random Forest, LSTM, and XGBoost. Model performance is evaluated using widely used metrics, including mean absolute error, root mean square error, Nash–Sutcliffe efficiency (NSE), and Kling–Gupta efficiency (KGE). Among the tested models, Random Forest proved to be the most accurate for streamflow forecasting, confirming its effectiveness in capturing the complex dynamics of hydrological processes. During the testing phase, the Random Forest model achieved an NSE of 0.591 and a KGE of 0.591, demonstrating good generalisation and reliable predictions. The results demonstrate the strength of Random Forest in capturing nonlinear hydrological patterns and supporting reliable streamflow forecasting for national water management. Moreover, as a novel approach, explainable AI was applied using SHAP analysis to go beyond the regular predictions of the models, thereby providing a deeper understanding of the model’s performance described by the magnitude and direction of influence of each problem feature.

Keywords:

LSTM; RF; XGBoost; streamflow forecast; Bosna River; river basin management

1. Introduction

Reliable streamflow predictions are fundamental to strengthening resilience and sustainability in water resource management [1]. Management practices commonly involve assessments that incorporate hydrological extremes analysis, reservoir operation, public water use, hydraulic infrastructure planning at the national scale, and agricultural systems [2,3]. At the same time, pronounced shifts in the hydrological cycle, driven by anthropogenic climate change forcing, continue to raise concerns, as projected changes in hydrological regimes suggest an overall decline in annual precipitation, coupled with an elevated likelihood of droughts and amplification of rapid extreme events [4]. Accurate streamflow prediction is a demanding task owing to the complex, nonlinear behavior of hydrological systems, shaped by precipitation patterns, temperature variations, land use, human interventions, and a lack of empirical data [5,6].

Significant efforts have been devoted to developing and implementing approaches to enhance the reliability and forecast precision of streamflow predictions, which can generally be grouped into two categories. The first category comprises models intended to simulate the rainfall–runoff response grounded in physical principles. Although they rely on simplified premises, they require substantial data inputs [7]. The second category comprises data-driven models, including statistical and ML methodologies, which rely on historical observations rather than physical process information, enabling empirical use and straightforward implementation [7,8]. Over the past decade, machine learning (ML) models have established a significant role in hydrological modelling, especially in streamflow forecasting, where they tend to outperform traditional physically based approaches in forecast precision [9].

The use of ML techniques for streamflow forecasting has become increasingly common among scientists over the past decade, primarily because of their capability to capture nonlinear dynamics and temporal variability in hydrological processes and interacting factors without requiring explicit representation of underlying physical processes [9,10]. A variety of models have been utilised in streamflow prediction research, with Artificial Neural Networks (ANNs) [11,12,13], Long short-term memory (LSTM) [14,15,16], Deep Neural Networks (DNNs) [17,18], Support Vector Regression (SVR) [19,20], Random Forest (RF) [21,22], Light Gradient Boosting Machine (LGBM) [23,24], and eXtreme Gradient Boosting (XGBoost) [25,26,27] being among the most commonly employed techniques. On the other hand, several studies have successfully applied alternative HBV- and SWAT-type models based on physical laws of hydrology, simulating water movement step-by-step using explicit equations for monthly streamflow prediction. These models have been shown to be particularly useful for simulating streamflow in various basins [24,28,29,30,31,32].

This research examines the use of RF, XGBoost, and LSTM networks for streamflow forecasting at the Doboj hydrological stations in the Bosna River Basin (BRB) in Bosnia and Herzegovina (BH). These techniques have proven to be remarkably scalable, robust, and highly efficient in solving various problems [33], not only in hydrological forecasting, as already mentioned, but also in complementary fields such as finance [34], remote sensing [35], sentiment analysis [36], and even solar energy forecasting [37], making them an attractive choice for predictive modelling.

To the best of our knowledge, this study represents the first application of machine learning (ML) methods for streamflow forecasting in Bosnia and Herzegovina. The primary objective was to address a critical gap in the existing literature by demonstrating the potential of ML techniques in the water sciences through a case study of the Bosna River Basin (BRB), thereby moving beyond the predominantly descriptive analyses of streamflow previously conducted in the country. An additional objective was to provide a foundation for understanding basin-specific hydrological processes and water management challenges within the BRB, with particular emphasis on accurate and interpretable forecasting. To this end, explainable artificial intelligence (XAI) tools were employed [38] to relate model predictions to fundamental hydrological drivers and their temporal lags. This approach offers new insights into the relative influence of individual features, facilitating high-quality predictions while enhancing model transparency and hydrological interpretability.

A comprehensive evaluation of multiple ML frameworks across the BRB is expected to inform future research directions and facilitate the effective integration of machine learning into water resources management practices worldwide.

2. Materials and Methods

2.1. Study Area

The Bosna River Basin (BRB) is the largest and one of the principal basins of Bosnia and Herzegovina (BH), with its entire area situated within the national territory and extending across the central, eastern and northern parts of the country. The BRB encompasses primary urban and industrial centers, such as Sarajevo, Zenica, Tuzla, and Doboj (Figure 1), which together host more than 50% of the national population, thereby establishing the basin as an economic hotspot. The basin occupies an area of 10,457 km2, covering ~20% of the overall area of BH, whereas the length of the river and mean elevation of the basin are 275 km and 640 m, respectively [39]. The river has its source at the foot of Mt Igman at 500 m a.s.l., emerging as a substantial karst spring. The upper section of the BRB features a well-developed drainage network, comprising several major and numerous smaller torrential streams. It extends from the river source to the city of Zenica. Extending from Zenica to Doboj, the middle section of the BRB is supplied by several major tributaries, including the Usora, Krivaja, and Spreča Rivers. The lower northern part of the Bosna River Basin extends from the town of Doboj to the confluence of the Bosna with the Sava River and lacks major tributaries. The Bosna River, along with its main affluents, is characterised by the Posavina subtype of the pluvial–nival water regime and exhibits peak streamflow in April and March and minimum streamflow in August and September [38,40]. Climatic patterns in the Bosna River basin differ from mountainous (Dfc) conditions in the south to continental (Dfb) and moderately continental (Cfb) types in the central and northern segments of the basin [41]. Mean annual air temperatures range from 9.5 to 12 °C in the northern, lower section of the basin [41]. In the middle section, mean annual air temperatures range between 7 and 10 °C, whereas in the highest, upper section they decrease to as low as 1.2 °C [41]. Mean annual precipitation generally decreases towards the north. The highest precipitation amounts are recorded in the upper section, where some areas receive more than 2000 mm annually, while lower valley areas receive around 900 mm. In the middle section of the basin, precipitation ranges from 900 to 1200 mm, whereas the northern areas experience the lowest values, between 800 and 900 mm [41].

2.2. Dataset Description

The dataset consists of monthly aggregated observations spanning the period from 1961 to 2020. Each record corresponds to a single month. Meteorological predictors include the mean monthly precipitation (P) and mean monthly air temperature (T), observed at five meteorological stations located within the Bosna River basin: Bjelašnica, Sarajevo, Zenica, Tuzla, and Doboj. The target variable for prediction is the mean monthly streamflow (Q), measured at the most downstream hydrological station on the Bosna River (Doboj station). All variables were aggregated to a monthly resolution to ensure temporal consistency and long-term coverage. In addition, we found a few missing data points for the Zenica station over a 6-month period in 2017. These were properly filled in by applying the Climatological Monthly Mean Filling across all historical data for that station. All hydroclimatic data were provided by the Republic Hydrometeorological Institute of the Republic of Srpska.

2.3. Compared Models

In this study, three data-driven modelling approaches were evaluated for predicting mean monthly streamflow in the Bosna River: Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Long Short-Term Memory (LSTM) neural networks. Random Forest is an ensemble learning method based on bootstrap aggregation of decision trees [42]. Owing to its robustness to noise, ability to capture nonlinear relationships, and reduced sensitivity to multicollinearity, RF has been widely adopted in hydrological modelling and is commonly used as a strong baseline [43,44]. XGBoost is a gradient boosting framework that sequentially constructs decision trees, with each new tree aiming to correct the errors of its predecessors [45,46]. Compared with RF, XGBoost provides enhanced control over model complexity through regularisation and learning-rate parameters and has demonstrated strong performance in hydro-climatic prediction tasks [33,47]. LSTM is a class of recurrent neural networks designed to capture long-term temporal dependencies in sequential data [15,48]. It is included in this study to assess whether deep learning models can extract additional predictive skill from long-term hydro-climatic time series compared with tree-based ensemble methods, as demonstrated in recent rainfall–runoff and streamflow forecasting studies [49,50].

Together, these models represent complementary modelling paradigms: bagging-based ensembles (RF), boosting-based ensembles (XGBoost), and sequence-based deep learning (LSTM).

All experiments were conducted using the Python programming language v3.11. The implementation relies on widely used open-source libraries to ensure reproducibility and transparency. The XGBoost model was implemented using the XGBoost library (XGBRegressor). Random Forest models were implemented using the scikit-learn library (RandomForestRegressor). The LSTM model was developed using the TensorFlow framework with the Keras high-level API. Hyperparameter tuning and model selection were performed using GridSearchCV from scikit-learn combined with TimeSeriesSplit to ensure temporally consistent cross-validation.

2.4. Feature Engineering and Lagged Variables

To account for hydrological memory and delayed catchment responses, lagged predictor variables were incorporated into the modelling framework, generating them independently for each station and time step (t). Let (Pt), (Tt), and (Qt) denote precipitation, air temperature, and streamflow at the time step, respectively. The predictor set comprised lagged precipitation terms at one-, two-, and three-month intervals (Pt-1, Pt-2, Pt-3), lagged temperature terms at one- and two-month intervals (Tt-1, Tt-2), and an autoregressive streamflow term representing previous-month discharge (Qt-1). The autoregressive streamflow term captures short-term persistence in river discharge and is widely used in hydrological forecasting to enhance predictive skill. Missing values introduced by lagged feature construction were removed using a complete-case approach, excluding them from the training dataset prior to model calibration.

Prior to model training, all datasets were merged into a single time-indexed table, where each row corresponds to a monthly observation and columns represent hydro-meteorological variables from all stations. Missing values in temperature data for the Zenica station (July–December 2017) were filled using climatological monthly means computed across all available years. No additional missing values were present after merging.

In the case of LSTM, note that input data are structured as three-dimensional tensors of shape (N, 1, F), where N is the number of samples and F is the number of input features. A single timestep was used, meaning that temporal dependencies were not modeled through sequential input windows but instead explicitly encoded via lagged features: lagged precipitation (Pt-1, Pt-2, Pt-3), lagged temperature (Tt-1, Tt-2), and lagged discharge (Qt-1) were included for each station. This approach ensures consistency with the feature representation used in tree-based models while still allowing nonlinear modeling through the LSTM architecture. Note that more lagged precipitation and temperature features were included in the experimentation; however, they did not provide any significant advantage in the overall results over those presented in the paper.

No explicit normalization or scaling of input variables was applied either in the training or the test phase. While normalization is often recommended for neural networks such as LSTM, the model was trained directly on raw values to maintain consistency with tree-based models and preserve the physical interpretability of hydrological variables.

2.5. Performance Metrics

Model performance was evaluated using hydrologically relevant efficiency metrics. Besides Minimum Absolute Error (MAE) and the Root Mean Square Error (RMSE), Nash–Sutcliffe and Kling–Gupta measures were determined. The Nash–Sutcliffe Efficiency (NSE) [15,51,52] measures the predictive skill of a model relative to the mean of observed streamflow values and is defined as:

N S E = 1 - \frac{\sum_{i = 1}^{n} {(Q_{i} - {\hat{Q}}_{i})}^{2}}{\sum_{i = 1}^{n} {(Q_{i} - {\bar{Q}}_{i})}^{2}}

(1)

where

Q_{i}

and

{\hat{Q}}_{i}

denote observed and predicted streamflow values, respectively, and

{\bar{Q}}_{i}

is the mean of observed streamflows.

The Kling–Gupta Efficiency (KGE) [15,51,52] decomposes model performance into correlation, bias, and variability components:

K G E = 1 - \sqrt{{{(r - 1)}^{2} + {(β - 1)}^{2} + {(γ - 1)}^{2}}}

(2)

where

r

is the Pearson correlation coefficient,

β

is the bias ratio, and

γ

is the variability ratio between predicted and observed streamflows.

NSE was used as the primary optimization criterion during model tuning, while KGE was employed for complementary evaluation.

For the sake of completeness, the formula for MAE is defined as:

M A E = \frac{1}{n} \sum_{i = 1}^{n} |Q_{i} - {\hat{Q}}_{i}|

(3)

where n is the total number of observations (i.e., data points), and ∣⋅∣ is the absolute-value operator. Lower MAE values indicate better model performance.

Root Mean Square Error (RMSE): The RMSE also quantifies prediction error but places greater emphasis on larger errors because the differences are squared before averaging. It is defined as

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(Q_{i} - {\hat{Q}}_{i})}^{2}}

(4)

where all symbols retain the same meanings as in the MAE equation, and the square-root operation returns the metric to the original units of the data. As with MAE, smaller RMSE values reflect superior predictive accuracy [53,54,55].

2.6. Training, Validation, and Temporal Splitting

To respect the temporal structure of the data and avoid information leakage, a time-based splitting strategy was adopted. The dataset was divided into training (1961–2010) and testing (2011–2020) periods.

During the training period, rolling-origin cross-validation was performed using a time-series split with 5 folds. In this setup, validation sets always occur chronologically after the corresponding training subsets.

2.7. Hyperparameter Tuning of XGBoost, RF, and LSTM

The hyperparameters of all three models were optimized using a grid search strategy coupled with time-series cross-validation applied to the training dataset, ensuring that temporal dependencies were preserved during model calibration. For the XGBoost model, the tuning process explored combinations of the number of trees, maximum tree depth, learning rate, subsampling ratio, and column subsampling ratio. For each predefined hyperparameter configuration, model performance was evaluated using the NSE across five cross-validation folds, and the configuration yielding the highest mean NSE was selected as optimal. An analogous procedure was employed for the Random Forest and Long Short-Term Memory models, with hyperparameter spaces tailored to their respective architectures. The optimal hyperparameters identified through this procedure were as follows: for RF, a maximum tree depth of 30, square-root feature sampling, minimum samples per leaf of 1, minimum samples for splitting of 2, and 2000 trees; for XGBoost, a column subsampling ratio of 0.8, learning rate of 0.05, maximum tree depth of 4, 600 estimators, and a subsampling ratio of 0.7; and for LSTM, 64 hidden units, a learning rate of 0.001, a batch size of 16, and 200 training epochs.

2.8. One-Step-Ahead Forecasting and Bias Correction

Despite achieving reasonable predictive performance, all evaluated models exhibited systematic bias in the magnitude of predicted discharge values, particularly in the form of consistent overestimation or underestimation. This behavior is commonly observed in data-driven hydrological models, which may capture temporal variability well but fail to reproduce absolute magnitudes accurately. To address this issue, a linear scaling bias correction was applied as a post-processing step.

Using the optimized model, one-step-ahead forecasts were generated for the testing period. Systematic bias was reduced by applying a linear scaling correction. A linear regression model was fitted between predicted and observed streamflow values on the training set:

Q_{\{o b s\}} = a \cdot Q_{\{p r e d\}} + b

(5)

where

a

and

b

are regression coefficients estimated from the training data.

The derived correction was subsequently applied to test-period predictions, preserving temporal dynamics while improving distributional consistency. The complete modelling workflow is presented in Figure 2.

3. Results

The performance metrics in Table 1 reveal notable differences in both fitting capacity and generalisation ability across the evaluated models. XGBoost achieves near-perfect accuracy on the training dataset (NSE = 0.999, KGE = 0.993), indicating a strong ability to capture nonlinear relationships between hydro-meteorological predictors and streamflows. However, this excellent training performance does not fully carry over to the testing period, where NSE and KGE drop to approximately 0.59 (Table 1). The substantial gap between training and testing accuracy suggests that XGBoost may partially overfit the historical training data, despite the use of time-series cross-validation and regularisation parameters.

Random Forest shows slightly lower training accuracy (NSE = 0.964, KGE = 0.884) than XGBoost but achieves nearly identical performance on the testing dataset (NSE = 0.591, KGE = 0.591) (Table 1). This suggests a more balanced bias–variance trade-off and indicates that Random Forest generalises marginally better to unseen data. The similarity in test performance between Random Forest and XGBoost implies that, for the given feature set and temporal resolution, increased model complexity does not provide substantial gains in predictive skill.

LSTM performs worst among the evaluated approaches. Training accuracy is moderate (NSE = 0.443), and testing performance remains limited (NSE = 0.445), suggesting the model struggles to learn streamflow dynamics (Table 1). Several factors may contribute to this outcome. First, the dataset consists of monthly-aggregated observations, which may not provide sufficient temporal resolution for LSTM models to exploit long-range dependencies. Second, the relatively small number of training samples compared to the high parameterisation of LSTM networks may hinder stable learning. Finally, the use of a single-step input structure with engineered lag features may reduce the advantage of sequence-based deep learning compared to tree-based methods. In particular, its relatively limited performance could be attributed to the chosen input representation, where temporal dependencies are captured through lagged features rather than sequential input windows. As a result, the model does not fully exploit the temporal memory capabilities typically associated with recurrent neural networks. Future work will investigate multi-timestep input sequences to better capture temporal dependencies and fully leverage the capabilities of recurrent neural networks.

Overall, the results show that tree-based ensemble models outperform deep learning approaches for monthly streamflow prediction in the Bosna River, given the available data and feature configuration. While XGBoost achieves the highest fitting accuracy, Random Forest offers comparable test performance with greater robustness. The observed performance gap between training and testing across all models highlights the challenge of non-stationarity in long-term hydro-climatic time series, emphasising the importance of careful validation strategies and model selection. These findings suggest that, for long-term monthly streamflow prediction in data-limited settings, well-regularised ensemble methods combined with physically motivated lag features can provide reliable predictive performance without the complexity of deep learning architectures (see Figure 3).

3.1. Comparison to a Rainfall–Runoff Model

The simplified conceptual rainfall–runoff model (HBV-lite) was implemented as a baseline physically interpretable benchmark against the machine learning approaches (RF, XGBoost, and LSTM). The HBV-lite model is a simplified conceptual rainfall–runoff representation inspired by the HBV framework.

Unlike fully distributed hydrological models, the proposed version uses only precipitation and temperature as driving variables, aggregated from multiple stations into basin-average time series. The model consists of a single soil moisture storage reservoir, where incoming precipitation increases soil water content while temperature-based evapotranspiration reduces it. A nonlinear transformation is then applied to represent runoff generation as a function of relative soil moisture, controlled by a small set of interpretable parameters governing storage capacity, nonlinearity, and runoff responsiveness. The resulting runoff is taken as the simulated discharge. This simplified structure retains the core hydrological concept of storage–release dynamics while avoiding the need for detailed spatial information such as soil type, land use, or topography.

The obtained results indicate a significantly lower predictive performance of the HBV-lite model compared to the data-driven approaches, with a training NSE of −0.539 and a testing NSE of 0.154.

These negative training NSE values suggest that the model performs worse than a simple mean-flow predictor, indicating that the simplified conceptual structure is insufficient to capture the complexity of the rainfall–runoff relationship in the Bosna River basin using the available input data.

3.2. Feature Importance and Explainable Artificial Intelligence

While predictive accuracy is essential, understanding the contribution of individual hydro-meteorological predictors is equally important for decision support and scientific insight. To this end, an Explainable Artificial Intelligence (XAI) analysis was conducted to assess the relative importance of input features and their influence on streamflow prediction. Given the heterogeneous nature of the evaluated models, different yet conceptually consistent feature importance approaches were employed. For the tree-based models (XGBoost and RF), feature importance was derived from model-based importance measures, while for the LSTM model a permutation-based approach was adopted to ensure comparability across model classes.

3.3. Permutation-Based Feature Importance

Permutation importance quantifies the contribution of a feature by measuring the decrease in model performance when the feature’s values are randomly permuted. Let f denote a trained model and

M (\cdot)

a performance metric, here the NSE. The importance of feature x_j is computed as:

I_{j} = M (f (X)) - M (f (X_{\{π (j)\}}))

(6)

where X

π (j)

denotes the input matrix with feature x_j permuted across samples.

A larger decrease in NSE indicates a more influential feature. This model-agnostic approach enables consistent interpretation across both tree-based and neural network models and directly reflects the impact of each feature on predictive skill. For more about explainable AI and its applications, see [56].

3.4. SHAP-Based Explainability

In addition to permutation-based feature importance, SHAP (Shapley Additive exPlanations) analysis was employed to provide a more detailed and theoretically grounded interpretation of model predictions. While permutation importance quantifies the impact of features on model performance, it does not capture the direction or context-dependent influence of individual predictors. SHAP is based on cooperative game theory and decomposes the prediction of a model into contributions from individual features. For a given instance x, the model output f(x) can be expressed as:

f (x) = φ^{0} + \sum_{j = 1}^{p} φ_{j}

(7)

where φ₀ represents the average model output and φ_j denotes the contribution of feature j = 1, …, p to the prediction. Each contribution is computed as a weighted average over all possible subsets of features:

{p h i}_{j} = \sum_{S \subseteq F ∖ j} \frac{(|(S)|! * (p - |(S)| - 1)!)}{p!} [f (S \cup j) - f (S)]

(8)

This formulation ensures desirable properties such as consistency and additivity.

In this study, SHAP analysis was applied to the XGBoost and Random Forest models using TreeSHAP, which allows efficient and exact computation of Shapley values for tree-based models. The analysis provides both:

Global interpretability, through ranking features based on mean absolute SHAP values.
Local interpretability, by explaining individual predictions and identifying how each feature increases or decreases predicted discharge.

Unlike permutation importance, which only measures performance degradation, SHAP values provide both the magnitude and direction of feature influence, as well as insight into nonlinear interactions between predictors.

3.5. Dominant Predictors of Streamflows

The analysis focused on the five most influential features for each model, ranked by their contribution to NSE reduction. Results indicate that lagged precipitation variables consistently dominate model predictions, particularly precipitation from the preceding one to three months. This finding is hydrologically meaningful, reflecting basin-scale storage effects and delayed runoff generation. The autoregressive streamflow term (Qt-1) also emerges as a key predictor, highlighting the strong persistence of monthly river flow and confirming the importance of short-term memory in streamflow dynamics. Temperature-related features show a similarly important influence, especially T_Bjelašnica, which likely captures seasonal snowmelt and evapotranspiration effects, particularly for upstream stations such as Bjelašnica. Notably, feature importance rankings remain broadly consistent across XGBoost and Random Forest, especially for the two most influential features—T_Bjelašnica and the autoregressive Q_lag1—suggesting stable hydro-climatic controls on streamflow. In contrast, the LSTM model displays weaker and less structured importance patterns, consistent with its overall lower predictive performance. The agreement between data-driven feature importance and established hydrological understanding provides confidence in the physical plausibility of the models. The dominance of lagged precipitation and streamflow terms in both reasonably efficient models aligns with conceptual rainfall–runoff processes, while the secondary role of temperature in Bjelašnica reflects its indirect influence at monthly time scales. From an operational perspective, these results demonstrate that explainable machine learning techniques can support transparent and trustworthy streamflow forecasting. By identifying the most influential predictors, the proposed framework enables improved model diagnostics and enhanced interpretability for water resource management and decision-making. Overall, the XAI analysis confirms that the tree-based ensemble models not only achieve higher predictive skill but also produce feature importance patterns consistent with hydrological theory, reinforcing their suitability for long-term monthly streamflow prediction in the BRB. Figure 4 illustrates the relative importance and directional influence of the dominant predictors for each model, highlighting the consistent dominance of lagged precipitation and autoregressive streamflow terms across tree-based ensemble models.

4. Discussion

The comparative assessment of XGBoost, Random Forest, and LSTM shows that tree-based ensemble methods provide the most accurate and robust one-step-ahead monthly streamflow forecasts for the Bosna River at Doboj, given the available data and model configuration. Both XGBoost and Random Forest achieve similar generalisation skill during the testing period (NSE ≈ 0.59; KGE ≈ 0.59; Table 1), while the LSTM clearly underperforms (NSE ≈ 0.45). In hydrological terms, these NSE values indicate moderate predictive capability at the monthly scale and are comparable to those reported in other data-driven streamflow forecasting studies using limited predictor sets and single-site applications [50,57]. The close agreement between observed and predicted flows for the two ensemble models in Figure 3a–b, particularly in reproducing intra-annual variability and the seasonal regime, further confirms that the dominant hydro-climatic signal is well captured, although residual errors remain evident during high-flow and low-flow extremes.

The near-perfect fit of XGBoost on the training data (NSE = 0.999; KGE = 0.993), compared with its markedly lower performance during the testing period, reveals a pronounced tendency towards overfitting, even with regularisation and time-series cross-validation. In contrast, Random Forest shows lower training skill (NSE = 0.964; KGE = 0.884) but almost identical test performance to XGBoost, suggesting a more favourable bias–variance trade-off. This pattern aligns with broader experience in hydro-climatic prediction, where highly flexible boosting algorithms can over-specialise in historical idiosyncrasies when trained on relatively short or non-stationary records [24,31,32,58,59]. The long modelling horizon increases the likelihood that climatic and anthropogenic changes have altered runoff generation processes, thereby limiting the transferability of complex models tuned on earlier decades to more recent conditions [60]. The similar out-of-sample performance of XGBoost and Random Forest indicates that, for this monthly forecasting problem and given the chosen feature set, increasing model complexity beyond bagging-based ensembles yields negligible gains in predictive skill.

In contrast, the LSTM model struggles to learn streamflow dynamics effectively (training NSE = 0.443; testing NSE = 0.445), and its predictions show greater dispersion and weaker reproduction of peaks and troughs in Figure 3c. This appears to contradict recent studies in which LSTM networks substantially outperform traditional models in daily rainfall–runoff and streamflow forecasting when large, information-rich datasets are available [10,49,61]. Several factors likely contribute to this discrepancy. First, using monthly aggregated data strongly smooths short-term dynamics and reduces the need to model long-range temporal dependencies, which are the main strength of recurrent architectures [62]. Second, the number of effective training samples (~600 months) is small relative to the parameterisation of even a moderately sized LSTM, increasing the risk of under-training or convergence to sub-optimal solutions [50]. Third, the input representation already encodes temporal information via lagged predictors and an autoregressive Q term, which partly diminishes the comparative advantage of sequence-based learning. Together, these constraints imply that, in data-limited, monthly-resolution settings, tree-based ensembles remain a more reliable choice than deep learning, a conclusion aligned with recent critical assessments of ML model selection in hydrology [57,63,64].

The explainable AI analysis provides an important consistency check between data-driven learning and hydrological process understanding. Permutation-based feature importance (Figure 5) shows that lagged precipitation is the dominant predictor class across the better-performing models, particularly rainfall in the preceding one to three months [65,66]. This is hydrologically plausible for a medium-sized basin with significant storage in soils, shallow groundwater, and the river network, where monthly flows reflect the integration of recent precipitation rather than only contemporaneous inputs [10,62]. The strong role of the autoregressive streamflow term (Qlag1) corroborates the pronounced persistence of monthly discharge, representing baseflow and slow-release components of catchment storage. The importance of temperature, especially at the high-elevation Bjelašnica station, indicates that snow accumulation, melt processes and temperature-dependent evapotranspiration modulate the seasonal hydrograph, consistent with established conceptual models and with ML-based analyses in other snow-affected basins [10,50]. The similarity in feature rankings between XGBoost and RF suggests that the inferred hydro-climatic controls are robust across ensemble algorithms.

In contrast, the LSTM shows weaker and less structured importance patterns, reflecting its poorer predictive performance and suggesting that it did not converge on a clear hydrological representation of the system. This finding highlights the value of XAI tools not only for post hoc interpretation but also for model diagnostics and selection: models that do not produce physically interpretable feature importance patterns, even when achieving modest skill, should be treated with caution in operational contexts [63]. In this study, the alignment between hydrological expectations and XAI-based rankings for the ensemble models supports their use as transparent decision-support tools.

From an operational perspective, an NSE of around 0.6 at monthly resolution is sufficient for various water management applications, including seasonal allocation planning, reservoir rule-curve support, and drought early warning, particularly in data-scarce contexts. For watershed-scale models, monthly streamflow performance is often judged satisfactory when NSE ≥ 0.50 [67,68,69,70]. However, inspection of Figure 3 shows that both XGBoost and Random Forest tend to underestimate extremely high flows and occasionally misrepresent sustained low-flow periods, a behaviour frequently documented for data-driven models trained on error metrics dominated by moderate flows [61]. Improving the representation of hydrological extremes in the Bosna River Basin may require either multi-objective training focused on high- and low-flow performance, incorporation of additional predictors such as snow indices, soil moisture, or large-scale climate indices, or hybrid approaches that blend data-driven models with process-based constraints [50,63].

Several structural limitations of this study warrant emphasis. The analysis is limited to a single downstream gauging station, which prevents explicit consideration of spatial heterogeneity within the basin. The predictor set includes only precipitation and air temperature from five stations, along with antecedent discharge, without remote sensing or land-use information that could better capture the non-stationary impacts of human activities. The use of monthly aggregation, although motivated by data availability and long-term coverage, obscures sub-monthly flood processes and may limit applicability to short-lead flood forecasting. Additionally, the results indicate that the models are primarily suited for capturing general discharge dynamics and seasonal variability, rather than precise prediction of extreme events.

Addressing these limitations with higher-temporal-resolution data, multi-site modelling, expanded predictor sets, and experimentation with hybrid or physics-informed machine learning architectures is a promising direction for future work and will be essential for fully leveraging machine learning to support climate-resilient water management in the BRB and similar regions. This includes direct comparisons with more powerful conceptual rainfall–runoff models to further assess the strengths and limitations of data-driven methods.

5. Conclusions

This study presents the first application of machine learning techniques for streamflow forecasting in Bosnia and Herzegovina, specifically evaluating Random Forest (RF), XGBoost, and Long Short-Term Memory (LSTM) models for mean monthly predictions at the Doboj station in the Bosna River Basin (BRB). The findings highlight the effectiveness of tree-based ensemble methods over deep learning approaches in data-limited, monthly-resolution settings, with XGBoost and RF achieving NSE ≈ 0.59 on the test data, outperforming LSTM’s NSE of 0.445. This performance gap demonstrates the robustness of tree-based models to non-stationarity and multicollinearity, whereas LSTMs struggle with aggregated sequences that lack fine-grained temporal details. Bias correction further improved predictions, aligning them with observed hydrographs and reducing systematic underestimation of peaks, as shown in Figure 3. XAI permutation analysis (Figure 5) identified lagged precipitation and autoregressive streamflow as the main drivers, with upstream temperature influencing seasonal dynamics, providing interpretable insights into the BRB’s pluvial–nival regime and karstic hydrology.

These outcomes address a critical gap in the literature by moving from descriptive streamflow studies in the region to advanced and transparent forecasting. The models’ ability to generalise, despite training-testing discrepancies, informs resilient water management amid climate-driven changes such as declining precipitation and increased extremes. By identifying key hydrological controls, this framework supports targeted interventions, such as reservoir optimisation in urban–industrial centres (e.g., Sarajevo, Zenica). Broader implications extend to global water resources, demonstrating the potential of machine learning in ungauged or sparsely monitored basins, where traditional physical models require extensive data. Future research should integrate multiple sources (e.g., satellite-derived snow cover) and hybrid physics–machine learning architectures to improve accuracy and reduce uncertainty, fostering sustainable practices worldwide. Ultimately, this work establishes a foundation for interpretable AI in hydrology, promoting equitable decision-making in vulnerable regions such as the BRB. Concerning potential drawbacks of the approach, the absence of explicit seasonality indicators (e.g., month-of-year encoding) and snow-related variables may limit the ability of the models to accurately reproduce extreme events and regime transitions. Future work will focus on extending the feature space to include such hydrologically meaningful predictors and assessing their impact on predictive performance.

Author Contributions

Conceptualization, S.G. and M.Ð.; methodology, I.L.; software, M.Ð.; validation, S.G. and I.L.; formal analysis, S.G.; investigation, S.G. and Q.Z.; resources, S.G.; data curation, M.Ð.; writing—original draft preparation, S.G.; writing—review and editing, S.G. and Q.Z.; visualization, S.G.; supervision, I.L.; project administration, S.G.; funding acquisition, I.L. All authors have read and agreed to the published version of the manuscript.

Funding

This publication is co-funded by the European Union’s Horizon Europe research and innovation program under the MSCA COFUND Postdoctoral Programme (grant agreement No. 101081355—SMASH) and by the Republic of Slovenia and the European Union through the European Regional Development Fund. Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European Research Executive Agency (REA). Neither the European Union nor the REA can be held responsible for them.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BH	Bosnia and Herzegovina
BRB	Bosna River Basin
ML	Machine Learning
XGBoost	eXtreme Gradient Boosting
RF	Random Forest
LSTM	Long Short-Term Memory
XAI	Explainable Artificial Intelligence

References

Kenyi, M.G.S.; Yamamoto, K. A hybrid SARIMA-Prophet model for predicting historical streamflow time-series of the Sobat River in South Sudan. Discov. Appl. Sci. 2024, 6, 457. [Google Scholar] [CrossRef]
Kaur, S.; Chavan, S.R. Comparative analysis of deep learning and machine learning models for one-day-ahead streamflow forecasting in the Krishna River basin. J. Hydrol. Reg. Stud. 2025, 60, 102549. [Google Scholar] [CrossRef]
Zhao, Y.; Chadha, M.; Barthlow, D.; Yeates, E.; Mcknight, C.J.; Memarsadeghi, N.P.; Gugaratshan, G.; Todd, M.D.; Hu, Z. Physics-enhanced machine learning models for streamflow discharge forecasting. J. Hydroinform. 2024, 26, 2506–2537. [Google Scholar] [CrossRef]
Luppichini, M.; Vailati, G.; Fontana, L.; Bini, M. Machine learning models for river flow forecasting in small catchments. Sci. Rep. 2024, 14, 26740. [Google Scholar] [CrossRef]
Hamzeh, H.A.; Zahra, A.; Mohammad, N.T. A comparative study between time series and soft computing models for river discharge forecasting. Appl. Water Sci. 2025, 15, 273. [Google Scholar] [CrossRef]
Suwal, N.; Khatakho, R.; Jha, A.N.; Ankon, S.B.; Lamichhane, M.; Kuriqi, A. Streamflow forecasting using machine learning and remote sensing data in the Himalayan region. Water Sci. Technol. 2025, 20, 2472–2491. [Google Scholar] [CrossRef]
Ni, L.; Wang, D.; Wu, J.; Wang, Y.; Tao, Y.; Zhang, J.; Liu, J. Streamflow forecasting using extreme gradient boosting model coupled with Gaussian mixture model. J. Hydrol. 2020, 586, 124901. [Google Scholar] [CrossRef]
Kedam, N.; Tiwari, D.K.; Kumar, V.; Khedher, K.M.; Salem, M.A. River stream flow prediction through advanced machine learning models for enhanced accuracy. Results Eng. 2024, 22, 102215. [Google Scholar] [CrossRef]
López-Chacón, S.R.; Salazar, F.; Bladé, E. Interpretation of a Machine Learning Model for Short-Term High Streamflow Prediction. Earth 2025, 6, 64. [Google Scholar] [CrossRef]
Feng, D.; Fang, K.; Shen, C. Enhancing streamflow forecast and extracting insights using long–short term memory networks with data integration at continental scales. Water Resour. Res. 2020, 56, e2019WR026793. [Google Scholar] [CrossRef]
Zealand, C.M.; Burn, D.H.; Simonovic, S.P. Short term streamflow forecasting using artificial neural networks. J. Hydrol. 1999, 214, 32–48. [Google Scholar] [CrossRef]
Zhou, J.; Peng, T.; Zhang, C.; Sun, N. Data Pre-Analysis and Ensemble of Various Artificial Neural Networks for Monthly Streamflow Forecasting. Water 2018, 10, 628. [Google Scholar] [CrossRef]
Zemzami, M.; Benaabidate, L. Improvement of artificial neural networks to predict daily streamflow in a semi-arid area. Hydrol. Sci. J. 2016, 61, 1801–1812. [Google Scholar] [CrossRef]
Hunt, K.M.R.; Matthews, G.R.; Pappenberger, F.; Prudhomme, C. Using a long short-term memory (LSTM) neural network to boost river streamflow forecasts over the western United States. Hydrol. Earth Syst. Sci. 2022, 26, 5449–5472. [Google Scholar] [CrossRef]
Leščešen, I.; Tanhapour, M.; Pekárová, P.; Miklánek, P.; Bajtek, Z. Long Short-Term Memory (LSTM) Networks for Accurate River Flow Forecasting: A Case Study on the Morava River Basin (Serbia). Water 2025, 17, 907. [Google Scholar] [CrossRef]
Nguyen, N.Y.; Kha, D.D.; Van Ninh, L.; Anh, V.T.; Anh, T.N. Streamflow prediction using Long Short-Term Memory networks: A case study at the Kratie Hydrological Station, Mekong River Basin. J. Hydroinform. 2025, 27, 275–298. [Google Scholar] [CrossRef]
Liu, Y.; Hou, G.; Huang, F.; Qin, H.; Wang, B.; Yi, L. Directed graph deep neural network for multi-step daily streamflow forecasting. J. Hydrol. 2022, 607, 127515. [Google Scholar] [CrossRef]
Huang, J.; Chen, J.; Huang, H.; Cai, X. Deep Learning-Based Daily Streamflow Prediction Model for the Hanjiang River Basin. Hydrology 2025, 12, 168. [Google Scholar] [CrossRef]
Callegari, M.; Mazzoli, P.; De Gregorio, L.; Notarnicola, C.; Pasolli, L.; Petitta, M.; Pistocchi, A. Seasonal River Discharge Forecasting Using Support Vector Regression: A Case Study in the Italian Alps. Water 2015, 7, 2494–2515. [Google Scholar] [CrossRef]
Sharma, B.; Goel, N.K. Streamflow prediction using support vector regression machine learning model for Tehri Dam. Appl. Water Sci. 2024, 14, 99. [Google Scholar] [CrossRef]
Pham, L.T.; Luo, L.; Finley, A. Evaluation of random forests for short-term daily streamflow forecasting in rainfall- and snowmelt-driven watersheds. Hydrol. Earth Syst. Sci. 2021, 25, 2997–3015. [Google Scholar] [CrossRef]
Puri, D.; Sihag, P.; Thakur, M.S.; Jameel, M.; Chadee, A.A.; Hazi, M.A. Analysis of data splitting on streamflow prediction using random forest. AIMS Environ. Sci. 2024, 11, 593–609. [Google Scholar] [CrossRef]
Xu, K.; Han, Z.; Xu, H.; Bin, L. Rapid Prediction Model for Urban Floods Based on a Light Gradient Boosting Machine Approach and Hydrological–Hydraulic Model. Int. J. Disaster Risk Sci. 2023, 14, 79–97. [Google Scholar] [CrossRef]
Kumar, V.; Kedam, N.; Sharma, K.V.; Mehta, D.J.; Caloiero, T. Advanced Machine Learning Techniques to Improve Hydrological Prediction: A Comparative Analysis of Streamflow Prediction Models. Water 2023, 15, 2572. [Google Scholar] [CrossRef]
Alipour, M.H. Streamflow prediction in ungauged basins located within data-scarce areas using XGBoost: Role of feature engineering and explainability. Int. J. River Basin Manag. 2025, 23, 71–92. [Google Scholar] [CrossRef]
R, Y.P.; R, M. Enhanced streamflow prediction using SWAT’s influential parameters: A comparative analysis of PCA-MLR and XGBoost models. Earth Sci. Inform. 2023, 16, 4053–4076. [Google Scholar] [CrossRef]
Guo, J.; Zhang, F.; Li, W.; Yang, A.; Fan, Y.; Li, J. Runoff Prediction in the Xiangxi River Basin Under Climate Change: The Application of the HBV-XGBoost Coupled Model. Water 2025, 17, 2420. [Google Scholar] [CrossRef]
Krysanova, V.; Arnold, J.G. Advances in ecohydrological modelling with SWAT—A review. Hydrol. Sci. J. 2008, 53, 939–947. [Google Scholar] [CrossRef]
Bizuneh, B.B.; Moges, M.A.; Sinshaw, B.G.; Kerebih, M.S. SWAT and HBV models’ response to streamflow estimation in the upper Blue Nile Basin, Ethiopia. Water-Energy Nexus 2021, 4, 41–53. [Google Scholar] [CrossRef]
Simonov, Y.A.; Semenova, N.K.; Khristoforov, A.V. Short-range streamflow forecasting of the Kama River Based on the HBV model application. Russ. Meteorol. Hydrol. 2021, 46, 388–395. [Google Scholar] [CrossRef]
Schoppa, L.; Disse, M.; Bachmair, S. Evaluating the performance of random forest for large-scale flood discharge simulation. J. Hydrol. 2020, 590, 125531. [Google Scholar] [CrossRef]
Szczepanek, R. Daily Streamflow Forecasting in Mountainous Catchment Using XGBoost, LightGBM and CatBoost. Hydrology 2022, 9, 226. [Google Scholar] [CrossRef]
Le, X.-H.; Ho, H.V.; Lee, G.; Jung, S. Application of long short-term memory (LSTM) neural network for flood forecasting. Water 2019, 11, 1387. [Google Scholar] [CrossRef]
Mienye, E.; Jere, N.; Obaido, G.; Mienye, I.D.; Aruleba, K. Deep Learning in Finance: A Survey of Applications and Techniques. AI 2024, 5, 2066–2091. [Google Scholar] [CrossRef]
Mohammad Reza Nikoo, M.R.; Aamri, A.A.; Talal Etri, T.; Ghazi Al-Rawas, G. A review of machine learning, remote sensing, and statistical methods for reservoir water quality assessment. J. Hydrol. 2025, 659, 133323. [Google Scholar] [CrossRef]
Hama Aziz, R.H.; Dimililer, N. SentiXGboost: Enhanced sentiment analysis in social media posts with ensemble XGBoost classifier. J. Chin. Inst. Eng. 2021, 44, 562–572. [Google Scholar] [CrossRef]
Jailani, N.L.M.; Dhanasegaran, J.K.; Alkawsi, G.; Alkahtani, A.A.; Phing, C.C.; Baashar, Y.; Capretz, L.F.; Al-Shetwi, A.Q.; Tiong, S.K. Investigating the Power of LSTM-Based Models in Solar Energy Forecasting. Processes 2023, 11, 1382. [Google Scholar] [CrossRef]
Dwivedi, R.; Dave, D.; Naik, H.; Singhal, S.; Omer, R.; Patel, P.; Qian, B.; Wen, Z.; Shah, T.; Morgan, G.; et al. Explainable AI (XAI): Core Ideas, Techniques, and Solutions. ACM Comput. Surv. 2023, 55, 194. [Google Scholar] [CrossRef]
Gnjato, S.; Popov, T.; Gnjato, R. Climate change impact on streamflows in Bosnia and Herzegovina: A case study of the lower Bosna river basin. Cent. Asian J. Geogr. Res. 2025, 1–2, 95–101. [Google Scholar] [CrossRef]
Gnjato, S.; Popov, T.; Adžić, D.; Ivanišević, M.; Trbić, G.; Bajić, D. Influence of climate change on river discharges over the Sava river watershed in Bosnia and Herzegovina. Idöjárás 2021, 125, 449–462. [Google Scholar] [CrossRef]
Gnjato, S.; Popov, T.; Ivanišević, M.; Trbić, G. Long-term streamflow trends in Bosnia and Herzegovina (BH). Environ. Earth Sci. 2023, 82, 356. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Tyralis, H.; Papacharalampous, G.; Langousis, A. A brief review of randommforests for water scientists and practitioners and their recent history in water resources. Water 2019, 11, 910. [Google Scholar] [CrossRef]
Yaseen, Z.M.; El-Shafie, A.; Jaafar, O.; Afan, H.A.; Sayl, K.N. Artificial intelligence based models for streamflow forecasting: 2000–2015. J. Hydrol. 2018, 530, 829–844. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisko, CA, USA, 13–17 August 2016. [Google Scholar] [CrossRef]
Leščešen, I.; Pekárová, P.; Miklánek, P.; Bajtek, Z. Streamflow Variability and Predictive Modeling in the Carpathian Basin: Assessing the Performance of Machine Learning Algorithms. In Proceedings of the EGU General Assembly (EGU25-8061), Vienna, Austria, 27 April–2 May 2025. [Google Scholar] [CrossRef]
Yu, T.-K.; Chang, I.-C.; Chen, S.-D.; Chen, H.-L.; Yu, T.-Y. Predicting potential soil and groundwater contamination risks from gas stations using three machine learning models (XGBoost, LightGBM, and Random Forest). Process Saf. Environ. Prot. 2025, 199, 107249. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Kratzert, F.; Klotz, D.; Brenner, C.; Schulz, K.; Herrnegger, M. Rainfall–runoff modelling using Long Short-Term Memory (LSTM) networks. Hydrol. Earth Syst. Sci. 2018, 22, 6005–6022. [Google Scholar] [CrossRef]
Shen, C. A transdisciplinary review of deep learning research and its relevance for water resources scientists. Water Resour. Res. 2018, 54, 8558–8593. [Google Scholar] [CrossRef]
Wu, H.; Chen, B. Evaluating Uncertainty Estimates in Distributed Hydrological Modeling for the Wenjing River Watershed in China by GLUE, SUFI-2, and ParaSol Methods. Ecol. Eng. 2015, 76, 110–121. [Google Scholar] [CrossRef]
Arpit, D.; Wang, H.; Zhou, Y.; Xiong, C. Ensemble of Averages: Improving Model Selection and Boosting Performance in Domain Generalization. arXiv 2021, arXiv:2110.10832. [Google Scholar]
Kartal, V. Machine learning-based streamflow forecasting using CMIP6 scenarios: Assessing performance and improving hydrological projections and climate change. Hydrol. Process. 2024, 38, e15204. [Google Scholar] [CrossRef]
Thota, S.; Nassar, A.; Filali Boubrahimi, S.; Hamdi, S.M.; Hosseinzadeh, P. Enhancing Monthly Streamflow Prediction Using Meteorological Factors and Machine Learning Models in the Upper Colorado River Basin. Hydrology 2024, 11, 66. [Google Scholar] [CrossRef]
Pala, A.; Mutlu, S.S.; Guven, A. Evaluation of machine learning models for streamflow projections under different greenhouse gas emission scenarios. J. Eng. Res. 2026, 14, 54–68. [Google Scholar] [CrossRef]
Xu, F.; Uszkoreit, H.; Du, Y.; Fan, W.; Zhao, D.; Zhu, J. Explainable AI: A brief survey on history, research areas, approaches and challenges. In CCF International Conference on Natural Language Processing and Chinese Computing; Springer International Publishing: Cham, Switzerland, 2019; pp. 563–574. [Google Scholar]
Mosavi, A.; Ozturk, P.; Chau, K.-W. Flood prediction using machine learning models: A systematic review. Water 2018, 10, 1536. [Google Scholar] [CrossRef]
Ding, B.; Yu, X.; Jia, G. Exploring the controlling factors of watershed streamflow variability using hydrological and machine learning models. Water Resour. Res. 2025, 61, e2024WR039734. [Google Scholar] [CrossRef]
Hameed, M.M.; Masood, A.; Hamid, A.; Elbeltagi, A.; Razali, S.F.M.; Salem, A. Forecasting monthly runoff in a glacierized catchment: A comparison of extreme gradient boosting (XGBoost) and deep learning models. PLoS ONE 2025, 20, e0321008. [Google Scholar] [CrossRef] [PubMed]
Blöschl, G.; Bierkens, M.F.P.; Chambel, A.; Cudennec, C.; Destouni, G.; Fiori, A.; Kirchner, J.W.; McDonnell, J.J.; Savenije, H.H.G.; Sivapalan, M.; et al. Twenty-three unsolved problems in hydrology (UPH)—A community perspective. Hydrol. Sci. J. 2019, 64, 1141–1158. [Google Scholar] [CrossRef]
Kumshe, U.M.M.; Abdulhamid, Z.M.; Mala, B.A.; Muazu, T.; Muhammad, A.U.; Sangary, O.; Ba, A.F.; Tijjani, S.; Adam, J.B.; Ali, M.A.H.; et al. Improving Short-term Daily Streamflow Forecasting Using an Autoencoder Based CNN-LSTM Model. Water Resour. Manag. 2024, 38, 5973–5989. [Google Scholar] [CrossRef]
Cheng, M.; Fang, F.; Kinouchi, T.; Navon, I.M.; Pain, C.C. Long lead-time daily and monthly streamflow forecasting using machine learning methods. J. Hydrol. 2020, 590, 125376. [Google Scholar] [CrossRef]
Nearing, G.S.; Kratzert, F.; Sampson, A.K.; Pelissier, C.S.; Klotz, D.; Frame, J.M.; Prieto, C.; Gupta, H.V. What role does hydrological science play in the age of machine learning? Water Resour. Res. 2021, 57, e2020WR028091. [Google Scholar] [CrossRef]
Rose, M.A.J.; Chithra, N.R. Tree-based ensemble model prediction for hydrological drought in a tropical river basin of India. Int. J. Environ. Sci. Technol. 2023, 20, 4973–4990. [Google Scholar] [CrossRef]
Schmidt, L.; Heße, F.; Attinger, S.; Kumar, R. Challenges in applying machine learning models for hydrological inference: A case study for flooding events across Germany. Water Resour. Res. 2020, 56, e2019WR025924. [Google Scholar] [CrossRef]
Li, K.; Huang, G.; Baetz, B. Development of a Wilks feature importance method with improved variable rankings for supporting hydrological inference and modelling. Hydrol. Earth Syst. Sci. 2021, 25, 4947–4966. [Google Scholar] [CrossRef]
Chen, S.; Huang, J.; Huang, J.C. Improving daily streamflow simulations for data-scarce watersheds using the coupled SWAT-LSTM approach. J. Hydrol. 2023, 622, 129734. [Google Scholar] [CrossRef]
Ougahi, J.H.; Rowan, J.S. Investigating deep learning knowledge transfer in streamflow prediction from global to local catchment. Water Resour. Res. 2026, 62, e2025WR041194. [Google Scholar] [CrossRef]
Lu, D.; Konapala, G.; Painter, S.L.; Kao, S.-C.; Gangrade, S. Streamflow Simulation in Data-Scarce Basins Using Bayesian and Physics-Informed Machine Learning Models. J. Hydrometeor. 2021, 22, 1421–1438. [Google Scholar] [CrossRef]
Gao, Y.; Mandania, R.; Ma, J.; Chen, J.; Zhuang, W. Enhancing Streamflow Modeling in Data-Scarce Catchments with Similarity-Guided Source Selection and Transfer Learning. Water 2025, 17, 2762. [Google Scholar] [CrossRef]

Figure 1. Map of the Bosna River Basin.

Figure 2. Summary of the computational workflow.

Figure 3. Observed versus predicted monthly streamflows during the testing period (2011–2020) for the evaluated models: (a) XGBoost, (b) RF, and (c) LSTM. All predictions correspond to one-step-ahead forecasts after linear bias correction.

Figure 4. Scatter plots between observed vs. predicted values with the regression line: (a) XGBoost, (b) RF, and (c) LSTM.

Figure 5. Direction and magnitude of the five most influential features identified through explainable artificial intelligence analysis: (a) XGBoost, (b) RF, and (c) LSTM. Bars indicate the decrease in NSE after feature permutation, while color denotes the direction of influence (positive (green) or negative (red)) contribution to predicted streamflows.

Table 1. Performance comparison of the evaluated models on training and testing data. RMSE and MAE are expressed in streamflow units, while NSE and KGE are dimensionless.

Model	Dataset	RMSE	MAE	NSE	KGE
XGBoost	Train	2.920	2.208	0.999	0.993
XGBoost	Test	64.680	48.029	0.589	0.589
RF	Train	20.769	14.888	0.964	0.884
RF	Test	64.500	47.169	0.591	0.591
LSTM	Train	81.581	57.432	0.443	0.382
LSTM	Test	75.086	54.763	0.445	0.489

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gnjato, S.; Leščešen, I.; Zhou, Q.; Ðukanović, M. Explainable Machine Learning for Streamflow Forecasting: Application to the Bosna River Basin. Water 2026, 18, 1226. https://doi.org/10.3390/w18101226

AMA Style

Gnjato S, Leščešen I, Zhou Q, Ðukanović M. Explainable Machine Learning for Streamflow Forecasting: Application to the Bosna River Basin. Water. 2026; 18(10):1226. https://doi.org/10.3390/w18101226

Chicago/Turabian Style

Gnjato, Slobodan, Igor Leščešen, Qiuwen Zhou, and Marko Ðukanović. 2026. "Explainable Machine Learning for Streamflow Forecasting: Application to the Bosna River Basin" Water 18, no. 10: 1226. https://doi.org/10.3390/w18101226

APA Style

Gnjato, S., Leščešen, I., Zhou, Q., & Ðukanović, M. (2026). Explainable Machine Learning for Streamflow Forecasting: Application to the Bosna River Basin. Water, 18(10), 1226. https://doi.org/10.3390/w18101226

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Explainable Machine Learning for Streamflow Forecasting: Application to the Bosna River Basin

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Dataset Description

2.3. Compared Models

2.4. Feature Engineering and Lagged Variables

2.5. Performance Metrics

2.6. Training, Validation, and Temporal Splitting

2.7. Hyperparameter Tuning of XGBoost, RF, and LSTM

2.8. One-Step-Ahead Forecasting and Bias Correction

3. Results

3.1. Comparison to a Rainfall–Runoff Model

3.2. Feature Importance and Explainable Artificial Intelligence

3.3. Permutation-Based Feature Importance

3.4. SHAP-Based Explainability

3.5. Dominant Predictors of Streamflows

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI