1. Introduction
Prediction in hydrology plays an essential role in the management of water resources. In a context marked by climate uncertainties and increasingly frequent extreme events, the ability to predict the behaviour of hydrological systems is very important. Thus, classical methods, based on statistical and physical models, continue to provide a solid foundation, while methods based on artificial intelligence constitute new tools capable of learning from recorded chronological data and improving the accuracy of forecasts.
Traditional time series forecasting methods, such as autoregressive ones, are successfully used in short-term prediction provided that the time series is stationary [
1,
2,
3]. One of the models that integrates a method of transforming time series from non-stationary to stationary by applying differentiation is the ARIMA (Autoregressive Integrated Moving Average) model. This model is successfully applied in forecasting in various fields, especially in the short term. With the increase of the forecast horizon, this method no longer provides good results. Hybrid and deep learning models, including LSTM and Bidirectional LSTM, have demonstrated improved capabilities compared to classical ARIMA approaches in time series forecasting [
4].
Recently, AI-based methods have seen a significant expansion in the field of forecasting, demonstrating high potential in improving the accuracy of predictions through their ability to model nonlinear relationships between time series variables. Neural networks have successfully proven their applicability to different renewable energy sources: wind [
5,
6,
7], solar [
8,
9], and hydraulic [
10,
11,
12]. A comprehensive analysis of deep sequential models used in time series forecasting is performed in [
13] highlighting the limitations of classical methods (ARIMA, Vector Machine Support) in capturing nonlinear relationships and long-term dependencies in complex time series compared to neural network-based models. Comparisons between artificial neural network and traditional statistical models are presented in many papers [
14,
15,
16], highlighting the superiority of models belonging to artificial intelligence. A comprehensive review of the scientific literature on the comparison of ARIMA models with machine learning techniques applied to time series forecasting, including hybrid combinations between the two approaches, is carried out in [
17]. Among AI-based methods, LSTM have proven to be suitable for time series with seasonal patterns, cyclicality, and trends hidden in long data series [
18,
19]. In [
20] a dynamic classification-based LSTM model is developed for the forecasting of daily flows in different climatic regions, highlighting the importance of adapting the model according to dynamic flow patterns. Despite their flexibility and ability to model nonlinear patterns, LSTM networks often require large datasets and careful hyperparameter tuning, which can limit their practicality in data-scarce environments. In order to improve predictions, various approaches to combining and optimizing existing models have been developed over time. The integration of evolutionary algorithms allowed the optimization of model parameters, and the combination of different types of models led to the creation of hybrid models. These hybrid models provided superior accuracy in estimating the predicted values compared to classical autoregressive models [
21,
22,
23].
Multi-step-ahead forecasting is essential for decision-making in time-dependent systems. However, multi-step prediction poses significant challenges compared to short-term prediction (one-step-ahead), mainly due to the propagation and accumulation of prediction errors from one step to the next as well as due to the stochastic nature of the forecasted variable.
In multi-step prediction, each future prediction is conditioned by previous values, either actual (open loop) or estimated (closed loop) [
24]. Thus, an error introduced at the current time step (
t + 1) is transported to the next step (
t + 2), where it is combined with a new local error, and this process continues iteratively, generating a potential significant deviation from the trajectory of the real variable. This phenomenon leads to a rapid degradation of the model’s performance over longer prediction intervals (5–7 days or more), which causes the prediction’s “allure”, its shape or trend, to become completely unrealistic or even divergent.
This paper starts from the hypothesis that the accuracy of the prediction in the first timestep plays a critical role in the success of a multi-step prediction. The closer the first predicted value is to the real value, the lower the error transmitted to the subsequent steps, which contributes to increasing the accuracy of the forecast. Consequently, a well-calibrated model for one-step-ahead prediction can form the foundation of a robust short-term closed-loop prediction system.
This study proposes a hybrid approach combining LSTM and ARIMA models, applied to real-world hydrological data, namely discharge values. LSTM is used as the primary forecasting model, while ARIMA is applied to correct the residual errors. The method shows improved performance over standalone LSTM in both one-day and short-term multi-step forecasts, highlighting the robustness and effectiveness of the hybrid structure in inflow prediction. Notably, most existing studies adopt the reverse configuration, using ARIMA as the main predictor and LSTM for residual correction. Moreover, multi-step forecasts are generally prone to instability; however, the proposed hybrid model demonstrates consistent and reliable performance across multiple forecasting horizons.
  2. Materials and Methods
Long short-term memory (LSTM) network is a special type of recurrent neural network that can process data sequences and retain information from the near and distant past of the time series, making them suitable for predicting data in hydrology. The neural network has the ability to retain patterns associated with variables, such as tributary flows, which exhibit complex behaviour, characterized by nonlinearity, seasonality, and stochastic components. This advantage of LSTM networks, together with the availability of subroutines for network training using various learning methods (Adam, Stochastic Gradient Descent with Momentum, Root Mean Square Propagation) already implemented in MATLAB programming languages, contributes to facilitating their application in forecasting practice. It is well known that setting the parameter characteristics of the model, such as number of epochs completed, learning rate, and mini-batch size, typically involves repeated experimentation. Even slight adjustments to these hyperparameters can lead to significantly different results. However, once properly tuned, the model is able to learn complex patterns and generate highly accurate predictions.
Each memory cell in the LSTM is updated by adding or removing information while the information is being transmitted through three gates: forget gate, input gate, and output gate.
In the simplest way, the day index can be implemented as a function composed of the sine and cosine functions to adjust the periodicity of inflows [
25]:
      where 
i is the day index; 
i = 1 corresponds to 1st of January and 
i = 365 corresponds to 31st of December.
In the proposed hybrid model, the LSTM network produces the baseline prediction for the time series. But no matter how well the model is trained, there are differences between the predictions and the expected values (target). To correct this component, an ARIMA model shall be fitted to the residual series obtained as the difference between the predicted values and the actual values corresponding to the data used during the training period.
The structure of the hybrid LSTM-ARIMA model used in this study is shown in 
Figure 1, illustrating the combination of the main forecast provided by LSTM and the error correction performed by ARIMA, applied for both open-loop (365 days) and closed-loop (7 days) predictions.
The residuals between the LSTM predictions and the observed values constituted a new time series, which generally shows autocorrelation. Based on this series of errors, an ARIMA model was calibrated to make additional predictions.
The time series model is estimated, and each forecast is recursively computed using the real values in an open loop (
Figure 2a) or using previous forecasts in a closed loop (
Figure 2b).
In 
Figure 2, 
, 
, 
, …, 
 are the observed (measured) values and 
, 
, 
, …, 
 are the forecasted values. In 
Figure 2a it can be observed that the forecasted values at each time step are not used to forecast future values. In contrast, in 
Figure 2b it can be observed that, as they are forecasted, these values are added to the input sequence and used to forecast future values.
Performance evaluation criteria are an important aspect of validating the prediction model. The most used indicators are coefficient of determination (R2), root mean square error (RMSE), and Nash–Sutcliffe efficiency (NSE).
Coefficient of determination, 
R2, describes the proportion of the variance in measured data explained by the model. The relation for determining this coefficient is as follows:
      where 
 is the 
i-th observed value; 
 is the 
i-th simulated value; 
 is the average of observed values; 
 is the average of simulated values; and 
N represents the total number of observations.
The relation for determining the root mean square error, 
RMSE, is as follows:
Nash–Sutcliffe efficiency, 
NSE, determines the relative magnitude of the residual variance compared to the measured data variance. This coefficient is calculated as follows:
In principle, for a day-ahead prediction, a model is considered good when 
R2 is higher than 0.85 [
26] and 
RMSE is less than half of the standard deviation of the observations time series [
27]. The coefficient 
NSE higher than 0.85 is considered very good and negative values indicate unacceptable performance [
28].
  3. Application of the Proposed Method on a Case Study
To validate the proposed methodology, the hybrid LSTM-ARIMA model was applied to a case study. The time series consists of daily inflows in Izvorul Muntelui-Bicaz reservoir, located in a mountainous area in the northeastern part of Romania (
Figure 3). Being the first in a series of reservoirs located on the Bistrița River, it has the advantage that the flows entering the reservoir are natural inflows, not affected by the actions of the human factor, and follow a natural pattern, easiest to be forecasted.
The average daily inflows in the reservoir in the period 2012–2021 were used. The data were divided in that corresponding to the period 2012–2019 for training, 2020 for validation, and the year 2021 was used for testing. For better precision of the model, the precipitation values downloaded from the 
https://open-meteo.com/ (accessed on 15 June 2025) were used along with the day index.
From the analysis of inflows data series, it can be observed that the inflows fall within the range 2.54–343.70 m3/s. The average inflow is 39.73 m3/s and the standard deviation is 35.98 m3/s.
Figure 4 presents the chronologically recorded daily inflows in the reservoir in the 2012–2020 period used for model calibration, providing an overview of the seasonal and interannual variability within the dataset.
 From 
Figure 4 it can be seen that the variation of flows has an important seasonality: in spring (March–May) the flows have increased values due to heavy rainfall and snow melting, and in summer (June–August) there may be sudden increases in flows following torrential rains and rapid runoff on the slopes. The main descriptive characteristics such as minimum (Min), maximum (Max), average (Mean), and multiannual standard deviation (Std) values are listed in 
Table 1.
In LSTM, the input sequence has a length of 30 values and is composed of inflows, day indices corresponding to the position within the year, and precipitation. The model architecture includes a LSTM layer with 150 hidden units. As for a learning algorithm, Adam was chosen, known for being robust and efficient in case of large data sets, working efficiently on small/medium minibatches, and accelerating training without degrading stability. During the training process, a mini-batch size of 32 was employed to balance computational efficiency and model convergence. Initial learning rate of 0.001 was set up, with a reduction factor of 0.9 after every 10 epochs, which helps to refine the adjustment of parameters throughout training. The maximum number of training epochs was set at 200. The data were kept in chronological order throughout the training process and were not shuffled, in order to preserve the temporal dependencies inherent in the time series.
The input data can be scaled using the Z-score (standardization) method [
29]; however, for LSTM applications, the min–max normalization method [
30] is often more suitable. For example, for daily inflows into the reservoir, 
Q, the normalization relation is as follows:
      
	  where 
 is the normalized inflow and 
 and 
 are the minimum and the maximum values of the inflow time series, respectively.
In order to obtain a sufficiently long time series of residuals, both the residuals corresponding to the predictions of the neural network on the training set and those related to the validation set were used, the differences between the predicted and measured values being illustrated in 
Figure 5.
The errors in 
Figure 5 were obtained as differences between the values predicted by the LSTM and the actual values. These have higher values in the spring-summer periods, when there are rapid increases and decreases in flows, generally more difficult to forecast. To obtain an ARIMA forecast model, they were differenced and analyzed to determine the autoregressive order 
p and the moving average order 
q. The number of differences required (
d) to achieve stationarity was found to be one. For this purpose, the autocorrelation function (ACF) and partial autocorrelation function (PACF) were graphically represented (
Figure 6).
From 
Figure 6 one can see that the PACF suggests correlation between flows up to 7 lags, so that 
p = 2, 3, 4, …, 7 and the ACF suggests the choice of order for the MA, 
q = 2, 3, or 4. After testing different AR orders and MA orders, the variant leading to minimum 
RMSE and maximum 
R2 was chosen, as follows: 
p = 7, 
q = 2. Residual diagnostics, including the Ljung–Box test (
p-value = 0.089), confirmed the absence of significant autocorrelation in the residuals, indicating that the selected parameters for the ARIMA model adequately captured the temporal structure of the data.
Figure 7 presents the values predicted using the LSTM network, the predictions obtained with the LSTM-ARIMA hybrid model, as well as the data measured for the analyzed period.
 In 
Figure 7, the differences between the measured and predicted values are highlighted. These values were plotted on graphs for different time intervals (January–March, April–June, July–September, October–December).The values predicted with the LSTM model follow the general trends well but fail to accurately capture the daily oscillations and especially the extreme ones. However, these daily oscillations are better predicted by the ARIMA model.
Table 2 shows the values of the performance indicators of the prediction models with 1 day of anticipation, obtained in a repetitive open loop for 365 days corresponding to 2021 based on LSTM and the hybrid model LSTM-ARIMA, respectively.
 From 
Table 2 it can be seen that in the case of the one-day prediction, the coefficient of determination calculated for the 365 values of the year 2020 (in open loop) is higher in the case of using the hybrid model than in the case of LSTM. The same thing happens for the test period, 2021, when 
R2 is improved from 0.93 to 0.96, the increase being significant. The 
RMSE coefficient is also reduced, both in the case of predictions corresponding to the validation year and in the case of testing, in the case of using the hybrid model.
The validation of the hybrid model applied in a closed loop for 7 days was done for 2020. By randomly choosing a starting point from each month, for five independent simulations the results obtained on 7 consecutive days of each month and for the 12 months of the year (84 values), the performance indicators are presented in 
Table 3 (simulations 1 to 5). Simulation 6 contains the model’s performance indicators for predictions for 7 consecutive days obtained starting from January 1, every 7 days, to cover the entire year 2020 (365 values), and the last row of the table is the average of simulations 1 to 6 results.
The values accurately forecasted for the 2020 validation year demonstrated the ability of the hybrid model to replicate the evolution of the time series. Consequently, the model was subsequently used to generate the forecast for 2021. The model has been updated by including the year 2020 in the training set, as well as in the calibration of the ARIMA model for residuals, thus expanding the history available for calibration. Predictions were made on windows of 7 consecutive days, randomly chosen from each month of 2021 with both the LSTM model and the hybrid model LSTM + ARIMA. Obtained values are graphically represented together with the real values and presented in 
Figure 8.
From 
Figure 8 it can be seen that the LSTM model generalizes well, managing to follow the general trend of the measured values over the periods of 7 consecutive days. However, daily fluctuations are not always accurately captured, which can lead to local differences between predictions and actual data. By adding the fluctuations predicted with the ARIMA model, the hybrid model’s predictions get closer to the curve of the measured values, improving short-term accuracy and better capturing rapid data variations.
The graphical analysis of 
Figure 8 shows that the values forecasted with the LSTM + ARIMA hybrid model are generally closer to the measured values compared to the predictions obtained with the simple LSTM network.
The same procedure for the validation of the hybrid model applied in a closed loop for 7 days for 2020 was also applied for 2021. The starting points for each month were randomly chosen. For five independent simulations, the results were obtained on 7 consecutive days of each month and for the 12 months of the year (84 values); the performance indicators are presented in 
Table 4 (simulations 1 to 5). Simulation 6 contains the model’s performance indicators for predictions for 7 consecutive days obtained starting from January 1, every 7 days, to cover the entire year 2021 (365 values), and the last row of the table is the average of simulations 1 to 6 results.
From 
Table 4 it can be seen that the forecasted values obtained with the hybrid model are much more appropriate than those measured, which leads to a higher 
R2 than in the case of using LSTM. Considering simulation 6, the one that covers most of 2021, as the most significant, the forecasted 
RMSE values obtained with the hybrid model is 14.37 m
3/s, compared to 20.15 m
3/s obtained in the case of using simple LSTM. The average values of these indicators were then computed to provide an aggregated assessment of the model’s accuracy and robustness. The standard deviation of the analyzed series is 35.98 m
3/s. It can be concluded that the hybrid model provides a satisfactory prediction.
To facilitate a clear and intuitive comparison between the two models, a boxplot was employed to visualize the distribution of relative prediction errors (%) (
Figure 9). This graphical representation complements the numerical metrics by providing insights into the dispersion, central tendency, and presence of outliers, thereby contributing to a deeper understanding of model behaviour and robustness.
Figure 9a reveals that the standalone LSTM model produces a higher number of extreme negative outliers, with some errors exceeding −100%, indicating instability in certain predictions. In contrast, the LSTM + ARIMA hybrid shows fewer and milder outliers, suggesting enhanced robustness. The median error (red line) is also closer to zero, implying improved average accuracy. In the multi-step prediction scenario (
Figure 9b), the LSTM + ARIMA model displays a narrower range of variation, reflecting more consistent performance across selected days. While LSTM continues to generate significantly negative outliers, those from the hybrid model remain closer to zero. The median error again indicates better performance for LSTM + ARIMA, highlighting its ability to reduce systematic bias and enhance predictive reliability.
 To further assess model performance under extreme hydrological conditions, we analyzed prediction errors in the tails of the inflow distribution. Specifically, we focused on the right tail (high-flow events) and the left tail (low-flow events), which are operationally critical yet challenging to forecast. 
Figure 10 presents the error patterns for both models, LSTM and LSTM-ARIMA, across these extremes. The horizontal axis represents the observed inflow, while the vertical axis shows the prediction error. This visualization allows for a direct comparison of model behaviour in rare but impactful scenarios, with errors computed as the difference between measured and predicted flows within the 90th and 10th percentile ranges.
Table 5 presents the error metrics for the right tail, highlighting model behaviour during extreme inflow events.
 As shown in 
Table 5, the analysis of prediction errors at the extremes of the inflow distribution highlights the differing behavior of the two tested models. In the case of high-flow conditions (≥90th percentile), the standalone LSTM model recorded significant errors, with a peak error of 99.67 m
3/s and an 
RMSE of 29.23 m
3/s, indicating a clear tendency to underestimate peak values. In contrast, the hybrid LSTM + ARIMA model substantially reduced these values (
RMSE = 17.11 m
3/s, 
Peak Error = 53.66 m
3/s), demonstrating a better ability to adapt to extreme conditions. In the low-flow range (≤10th percentile), both models tend to overestimate the actual values; however, the LSTM + ARIMA model again shows superior performance, with a lower 
RMSE (2.38 vs. 3.81 m
3/s) and a slightly reduced peak error. These results suggest that integrating the ARIMA component into the LSTM structure significantly enhances prediction accuracy under extreme hydrological conditions.
  4. Discussion
In the case of the multi-day prediction in advance, using a closed loop, the forecasted values tend to gradually move away from the actual ones, and the calculated RMSE values increase compared to the prediction on a single time step, an aspect observed in other works [
31,
32]. For example, in [
5], the coefficient of determination R
2 decreases from 0.83 in the case of the prediction one step forward (one hour) to values between 0.48 and 0.58 when the prediction horizon increases to 24 h.
For each randomly chosen 7-day forecast from each month, represented in 
Figure 8, the 
RMSE for both the LSTM model and the hybrid LSTM + ARIMA model was calculated and displayed, together with the standard deviation of the measured multiannual monthly inflows. The aim was to compare the 
RMSE value with half of the standard deviation, and the results indicate that in most cases the 
RMSE remains below this threshold, suggesting a satisfactory accuracy of the predictions. Only in two months of the analyzed period it was observed that both models exceeded this threshold, namely in February and in June 2021, an aspect that could be associated with sudden variations in the data. At the same time, 
Table 3 shows that the 
RMSE calculated on a series of 7 selected values from each month (for all 12 months) is less than half of the standard deviation of the entire data series in all five simulations performed except for simulation 5. The year 2020 selected as the validation year included the period in which the maximum flow of the entire analyzed series was recorded. This peak flow could not be accurately predicted by either the LSTM network or the LSTM + ARIMA hybrid model, both models underestimating the extreme value recorded which resulted in a higher calculated 
RMSE value.
In general, the 
RMSE values obtained with the hybrid model are lower than those resulting from the predictions made only with the LSTM network, confirming the benefits of the hybrid approach. In fact, [
33] presents a model in which an artificial neural network is trained on the residuals generated by the ARIMA model, making one-step forward predictions for three time series with distinct characteristics, and the results demonstrated improvement in the calculated Mean Absolute Error values.
In paper [
34], a comparison of the performance of different predictive models for real-time series is made, analyzing how errors evolve as the prediction horizon increases to 1, 2, …, 10 steps ahead. The authors found that, for the Sunspot series, the best performance in the 5-, 8- and 10-step forward prediction is achieved by the bidirectional LSTM model, but the standard LSTM model also achieves similar performance, outperforming other methods tested in the study.
In a recent paper [
35], the authors investigated the performance of multi-step forecasting of tributary flows in reservoirs using an encoder–decoder time-variant approach, applied to 12 different case studies. The results indicate that while the accuracy of the short-term forecast (one day ahead) is high, with 
NSE performance indicator values in the range of 0.6–0.97, this performance decreases as the forecast horizon expands. Thus, for the 2-day forecast, a decrease in performance is observed, and for the 7-day forecast, even using the best tested method (time-variant encoder–decoder), the indicator values can drop to 0.75 in some cases, and in others even below 0.2. Only the 
NSE was calculated. In eight of the cases, from the graphs presented it can be seen that the importance of flows was dominant (percentage of influence 80%) and in four of the cases the precipitation was the dominant factor (40–70%).
The boxplot analysis confirms the overall robustness of the hybrid LSTM + ARIMA model, showing lower errors, fewer outliers, and a median closer to zero compared to the standalone LSTM. These results are consistent with the model’s design, which aims to correct residual patterns in LSTM predictions. However, when focusing on extreme inflow conditions, particularly relevant for dam safety and operational decisions, the tail-based error analysis reveals limitations. The hybrid model, while improved, still struggles to accurately capture rare high-flow events. This suggests that dedicated modeling strategies, such as those using quantile regression to separate and target extreme cases [
36], may offer more reliable solutions in such scenarios.
To strengthen the evaluation of the proposed LSTM + ARIMA model, three additional forecasting methods—Random Forest (RF), Gradient Boosting (GB), and Multilayer Perceptron (MLP)—were implemented and tested under identical conditions. These models were selected for their methodological diversity, representing tree-based, ensemble, and neural network approaches, respectively. The optimal parameters for each model were determined through grid search on the validation set. Specifically, for RF, the best configuration included a minimum leaf size of 5 and 100 decision trees; for GB, 50 learning cycles and a learning rate of 0.10; and for MLP, 20 neurons in the hidden layer. The configurations were selected by minimizing the RMSE on the validation set.
All models were trained and evaluated using identical 7-day forecasting windows randomly selected from each month to ensure consistency and fairness. A summary of the performance indicators obtained is provided in 
Table 6. Since the ARIMA component was trained exclusively on the residuals derived from the training phase, the final hybrid model was not applied to the training set. Instead, its performance was assessed on the validation and test sets, where both components were combined to generate multi-step forecasts. Consequently, performance metrics were reported only for the validation and test sets, as the hybrid LSTM + ARIMA model was not applied to the training data.
The results, summarized in 
Table 6, indicate that although all models performed well on the training and validation sets (with 
R2 values exceeding 0.93 in most cases), their performance on the test set was considerably lower. The reduced performance highlights the inherent challenge of achieving generalization in multi-step time series forecasting.
In contrast, the hybrid LSTM + ARIMA model yielded superior results on the test set, with an RMSE of 12.74 m3/s, R2 of 0.89, and NSE of 0.87—outperforming all other models. These outcomes further demonstrate the robustness and predictive superiority of the hybrid approach, particularly under conditions characterized by high temporal variability and nonlinear dependencies. Such performance highlights the advantage of integrating statistical and deep learning components for capturing complex hydrological dynamics.
  5. Conclusions
In the case of multi-step prediction (for example, on 5 to 7 consecutive days), prediction errors can accumulate progressively at each time step. This compounding effect often leads to a significant degradation in the accuracy of the prediction in the medium term, even if the model performs well in the short term. In this context, it is essential that the short-term prediction is as accurate as possible, as it serves as the basis for all subsequent predictions. A high-quality prediction at the initial time step significantly reduces the uncertainty propagated in the next steps, thus increasing the chances that the model will maintain a realistic and stable trajectory of the prediction in the multi-step forecasting horizon.
While other AI-based models, such as feedforward neural networks or other similar architectures, may provide higher accuracy for day-ahead prediction compared to LSTM, they have limited generalizability for forecasts over a seven-day horizon. Consequently, for the short-term forecast (seven days in advance) the use of LSTM networks was chosen, which allows the learning of long-term dependencies from the time series. Moreover, it was observed that the integration of the ARIMA model for adjusting LSTM errors leads to a high-performance combination for the seven-day forward forecast, in a closed-loop mode, without daily updating of the real values of the time series during the prediction horizon.
This study introduces a novel hybrid modeling approach that combines LSTM and ARIMA techniques, applied to real-world hydrological inflow data. Unlike most existing studies that use ARIMA as the primary predictor and LSTM for residual correction, our method reverses this configuration: LSTM serves as the main forecasting engine, capturing nonlinear patterns and seasonality (via sine and cosine encoding), while ARIMA is employed to refine predictions by modeling residual errors. This structure enhances both stability and accuracy, particularly in short-term multi-step forecasts such as seven-day predictions. The results demonstrate that the hybrid model consistently outperforms standalone LSTM, offering a robust and reliable solution for inflow forecasting across multiple horizons.
These findings highlight the potential of integrating deep learning with statistical methods to improve predictive performance in hydrological applications. Future research may explore specialized models for extreme events or real-time updating strategies to further enhance operational reliability.