1. Introduction
Dam reservoirs are elevated, open-air storage areas created by constructing dams that retain and regulate water flow [
1]. Dams and dam reservoirs play a critical role in the infrastructure of modern societies due to their multifaceted functions, such as water resource management, hydroelectric power generation, irrigation, flood control, and industrial and urban water supply [
2,
3]. These structures are strategic water management systems that serve the public good. Increasing populations and environmental pressures, such as climate change and drought, necessitate more efficient and predictable operation of these structures [
4]. The global scarcity of freshwater requires developing solutions to balance water supply and demand, especially locally [
5]. In this context, accurately and timely estimating dam storage or reservoir water levels is vital for sustainable and effective management of water resources [
6,
7,
8].
The main problem addressed in this study is the difficulty of reliably forecasting dam water levels under increasing climate variability and anthropogenic pressures. Ensuring the sustainable management of water resources requires models that can capture nonlinear relationships and provide accurate predictions using data commonly available at dam sites.
Urbanization, industrialization, land use changes resulting from agricultural production activities, and population growth are the main anthropogenic factors that significantly increase water scarcity [
9,
10]. These factors, together with the impacts of climate change, cause unpredictable fluctuations in surface and groundwater regimes and threaten the sustainable use of water resources [
11,
12]. Multi-purpose uses of dams, such as hydropower and water supply, and operational activities carried out outside of seasonal cycles, can significantly affect water quality and aquatic ecosystems through fluctuations in water levels [
13]. Particularly in semi-arid regions, dam water level data are considered one of the fundamental inputs in flood forecasting studies [
14,
15]. Flood protection is one of the most important functions performed by dams [
1]. Numerous studies have shown that the frequency of floods caused by heavy rainfall has increased in recent years [
16,
17]. Moreover, previous studies and scenario-based projections [
18,
19] suggest that the increasing flood frequency and severity trend will continue, with climate change identified as the primary driver of this trend [
20,
21,
22].
Despite the importance of forecasting reservoir levels, traditional statistical and rule-based methods have limitations in capturing nonlinear and multidimensional dynamics. Traditional forecasting methods have been used for years to estimate dam water levels. However, these methods rely on limited tools such as operator-based rule curves, historical flow analyses, and linear mathematical approaches, which fail to adequately capture the complex and dynamic nature of the system [
23]. The interdependencies among many components, including meteorological variables, streamflows, and evaporation rates, further constrain the performance of traditional models.
Machine learning (ML) and deep learning (DL) approaches have emerged as promising alternatives in recent decades. Recent studies have shown that ML methods, particularly ensemble models such as random forest (RF) and extreme gradient boosting (XGBoost), achieve significantly higher accuracy in forecasting reservoir water levels compared with traditional approaches and can effectively learn high-dimensional, nonlinear patterns [
24,
25]. Extensive reviews in the field of hydrology reveal that various ML techniques, ranging from artificial neural networks (ANN) and support vector machines (SVM) to deep learning architectures such as long short-term memory (LSTM) and convolutional neural networks (CNN), have been widely applied to water level prediction in dams, rivers, and lakes [
26]. These methods consistently demonstrate higher accuracy and greater flexibility than traditional statistical models (e.g., autoregressive integrated moving average (ARIMA), regression) and are also superior in quantifying and communicating uncertainties.
However, existing studies still reveal several research gaps. First, there is a lack of comprehensive comparison between classical regression and advanced ML techniques using long-term daily datasets. Second, many previous works have not incorporated bias-oriented or hydrological efficiency measures (e.g., Nash–Sutcliffe Efficiency (NSE), Kling–Gupta Efficiency (KGE)) in evaluating performance. Third, limited research focuses specifically on dams in Türkiye, despite their critical role in regional water management.
This study focuses explicitly on Karaçomak Dam in the Western Black Sea Region of Türkiye, which provides irrigation, drinking water, and flood control for Kastamonu Province. The dam’s importance for regional socio-economic activities and vulnerability to climatic fluctuations justify its selection as the study area. While some regional studies on water resources and hydrological planning—such as the Karaçomak Dam Reservoir Protection Plan [
27], artificial neural network–based water quality modeling for Karaçomak as a drinking water source in Kastamonu [
28], and the Kastamonu Province Drought Action Plan [
29]—have been conducted, comprehensive machine learning–based forecasting applications for reservoir water levels remain scarce in this region.
Data-driven modeling uses ML techniques to create models for a specific system from existing data [
1]. In recent years, these models have been integrated as a complement to or, in some cases, have replaced physics-based models (e.g., hydrodynamic models) [
30]. ML techniques have demonstrated significant advantages in modeling complex and multivariable systems, yielding more accurate and reliable results [
10,
31].
Therefore, this study aims to evaluate and compare the predictive performance of linear regression and multiple ML algorithms (decision tree, random forest, and XGBoost) for reservoir water level forecasting at Karaçomak Dam. The study aims to identify models that can achieve high accuracy with a limited but routinely available dataset (precipitation, maximum, minimum, and average temperature, reservoir level, and volume) while also assessing bias and hydrological efficiency metrics (mean bias error (MBE), NSE, KGE). The novelty of this work lies in its integration of cost-effective data-efficient modeling with advanced ML algorithms, providing a practical and scalable decision support tool for sustainable dam management in Türkiye.
3. Results
Within the study’s scope, the dataset’s descriptive statistical parameters were initially analyzed. The number of observations, mean, minimum and maximum values, 25%, 50% (median), and 75% percentiles, and standard deviation were calculated for each variable. The results are summarized in
Table 2.
When
Table 2 is analyzed, the elevation variable ranges from 804.02 m to 889.99 m, with an average of 883.42 m. The volume variable ranges from 3.096 hm
3 to 25.143 hm
3, with a mean of 15.42 hm
3. Regarding temperature variables, the maximum temperature varies between −9.2 °C and 41.6 °C, while the minimum temperature ranges from −20.2 °C to 22.9 °C. The average temperature ranges from −13.9 °C to 29.8 °C, with a mean of 10.43 °C. The total precipitation variable ranges from 0 to 82.6 mm.
Prior to model training, we conducted multicollinearity and assumption diagnostics. The correlation heatmap (
Figure 2) shows strong pairwise correlations among temperature-related predictors (|r| ≳ 0.85–0.95). Consistently, VIF analysis (
Table 3) indicates severe multicollinearity for daily_avg_temp (VIF ≈ 95.27), daily_max_temp (≈31.79), and daily_min_temp (≈27.82), while season (≈1.57), daily_precipitation (≈1.12), and elevation (≈1.12) remain acceptable. Guided by these findings, an AIC-based reduction retained only the most informative predictors for the OLS models.
In addition, residual diagnostics (
Figure 3) revealed deviations from normality (Shapiro–Wilk
p < 0.001), heteroskedasticity (Breusch–Pagan
p < 0.001), and potential autocorrelation (Durbin–Watson = 1.47), which together motivated the comparative use of tree-based ensemble methods.
3.1. Model Implementation and Optimal Model Selection
Models designed to estimate dam reservoir storage levels were tested using 35,784 observations from six variables collected over 17 years (5.964 days).
Table 4 presents the MAE, MSE, RMSE, and R
2 values obtained for four models before hyperparameter tuning and cross-validation applications.
As seen in
Table 4, the R
2 values of all tree-based models (decision tree, random forest, and XGBoost) in the training dataset are 1.000, and the error values are close to zero, indicating that these models perfectly fit the training data. In the test dataset, the R
2 values remain in the range of 0.982–0.983, demonstrating that the models preserve their high predictive accuracy. However, the decision tree model exhibits relatively higher test errors.
Although the linear regression model achieved an R
2 of 0.983 on the training dataset, this value dropped to 0.574 on the test dataset. Consequently, the test RMSE value (2.898) was significantly higher than the other models. This reduction in generalization performance is consistent with the OLS assumption violations identified in
Figure 3 and the multicollinearity evidenced by
Figure 2 and
Table 3.
When comparing tree-based models, random forest and XGBoost yielded the lowest error values on the test data (MAE ≈ 0.046–0.076, RMSE ≈ 0.584–0.585), and their generalization performances were quite similar. While the test MAE (0.046) of the decision tree model was comparable, its RMSE (0.590) and MSE (0.348) were slightly higher than those of the other tree-based models.
Hyperparameter optimization and cross-validation were applied to improve model performance. The models were then retrained using the optimal parameters obtained, and their performance was re-evaluated.
Table 5 presents the results after hyperparameter optimization and cross-validation.
An examination of
Table 5 demonstrates that the tree-based models (decision tree, random forest, XGBoost) achieved R
2 values ranging from 0.999 to 1.000 on the training data, accompanied by minimal error levels, thereby confirming their strong fitting capacity. On the test data, these models maintained an R
2 value of 0.983, indicating high predictive accuracy and enhanced robustness and generalization ability as a result of the applied hyperparameter optimization. These gains are consistent with the expectation that tree-based ensembles are less sensitive to multicollinearity and non-Gaussian, heteroskedastic residual structures than OLS.
To strengthen model validation, we additionally computed symmetric mean absolute percentage error (SMAPE), explained variance score (EVS), and median absolute error (MedAE) (
Table 6). The extended evaluation corroborated the ensemble models’ superiority: random forest achieved the lowest test SMAPE (0.371%) and MedAE (0.003) while maintaining a high EVS (0.983), closely followed by decision tree (test SMAPE ≈ 0.409%; MedAE ≈ 0.008; EVS ≈ 0.983) and XGBoost (test SMAPE ≈ 0.601%; MedAE ≈ 0.031; EVS ≈ 0.983). In contrast, linear regression yielded substantially higher errors (Test SMAPE ≈ 4.40%; MedAE ≈ 0.525) and a markedly lower explained variance, confirming its inferior generalization.
In the decision tree model, although the test MAE value changed only marginally (from 0.049 to 0.046) after optimization, the RMSE (0.587) and MSE (0.345) values remained stable. The random forest model consistently performed, maintaining the lowest MSE (0.342) and RMSE (0.585) values on the test set. The XGBoost model, in contrast, achieved one of the most favorable outcomes regarding error metrics, further reducing the test RMSE to 0.580.
However, the linear regression model’s performance remained unchanged after optimization. The test R
2 value stayed at 0.574, while the RMSE (2.898) continued to be significantly higher than that of the tree-based models, confirming the inadequacy of the linear approach in capturing the complex data relationships. Together with the diagnostics in
Figure 3, this result indicates that OLS predictions are adversely affected by assumption violations in this dataset.
The optimal parameter combinations obtained for each model following hyperparameter optimization are presented in
Table 7.
Examining
Table 7, the decision tree model achieved the best performance with a maximum depth of 20, a minimum of one sample per leaf, and a minimum of 10 samples required for splitting. The optimal parameters for the random forest model were a maximum depth of 10, a minimum of one sample per leaf, a minimum of two samples for splitting, and 100 trees. In the XGBoost model, the optimal configuration included a column subsampling ratio of 1.0, a learning rate of 0.1, a maximum depth of 3, 100 trees, and a subsampling ratio of 0.8. For the linear regression model, only the setting of fit_intercept = True was the most suitable.
The relative ranking method proposed by Poudel and Cao [
52] was then applied to identify the best-performing model based on test results from the evaluation metrics. The outcomes of this ranking are presented in
Table 8.
Based on this evaluation, the random forest model achieved the highest relative score, indicating its superior predictive performance compared to the other models. Consistent with these additional criteria (SMAPE, EVS, MedAE), random forest was identified as the optimal model and is, therefore, used as the reference model in subsequent analyses and interpretation.
In addition to conventional error metrics, bias and hydrological efficiency measures were also evaluated [
45,
46,
47,
48,
53]. As presented in
Table 9, linear regression exhibited relatively poor performance, while the decision tree achieved high accuracy. Nevertheless, ensemble models (random forest and XGBoost) demonstrated the most robust and reliable performance, with very low MBE values indicating the absence of systematic bias and NSE/KGE values close to 1 confirming their accuracy and generalization ability.
3.2. Evaluating the Models’ Prediction Capabilities
In the graphs, the black line represents the actual values, while the bubbles denote the model predictions. The size of each bubble indicates the magnitude of the prediction error, and the color scale reflects the error level (blue: low error, red: high error).
Figure 4,
Figure 5,
Figure 6 and
Figure 7 present the prediction performances of the linear regression, decision tree, random forest, and XGBoost models, allowing readers to compare the models’ differences visually. To complement these visualizations,
Table 10 provides the exact numerical values of the measurements taken on the 1st, 7th, 15th, and 21st of each month from 1 May 2024 to 21 July 2025, alongside the predicted values of the four models and their corresponding errors. This combined presentation ensures that visual interpretation and detailed quantitative verification are possible. The results of the paired-samples
t-test conducted between the predicted and actual values are provided in
Table 11.
The
t-test results indicate that there is no statistically significant difference (
p > 0.05) between the predicted values of all models and the actual values (
Table 11). This finding suggests that the model predictions do not significantly deviate from the actual values.
According to the comparison results obtained from the graphs and
Table 10, the difference values in the Linear Regression model are generally negative and have high absolute values (e.g., around −0.9). This pattern indicates a systematic tendency toward overestimation. Furthermore, the long-term error is considerably higher, demonstrating poorer performance than the other models. These outcomes are consistent with the weak performance of the model’s statistical performance metrics. They are also coherent with the OLS residual diagnostics, which showed non-normal and heteroskedastic residuals.
The differences in the decision tree model results are generally within ±0.02–0.03, indicating a low error rate. Only on a few dates do the differences reach 0.5 or greater, which is rare. Overall, the predictions are very close to the observed values, although this reflects the model’s tendency to overfit the training data.
In the random forest model results, the differences are similar to those of the decision tree model, but generally smaller. Although larger deviations, such as −0.19, occasionally occur, the overall error rate remains low and consistent. This finding supports the model’s low error rates and high R2 values observed in the test dataset.
The differences in the XGBoost model results are predominantly within the range of 0.00 to ±0.05. Although slightly larger deviations, such as 0.09 or −0.14, appear on specific dates, these cases are infrequent. The deviations are balanced in direction, and the mean difference remains minimal. These results demonstrate that the XGBoost model predictions are highly consistent with the actual values. This outcome aligns with the model’s low error rates and high R2 scores on the test data, confirming its strong generalization capacity and high predictive accuracy.
4. Discussion
The study results show that tree-based ensemble models, particularly the XGBoost and random forest algorithms, provide the most accurate and balanced results in predicting reservoir water levels in Türkiye. However, the linear regression model could not adequately capture the complex nonlinear relationships between hydrological and meteorological variables. R
2 values above 98% and low error metrics (MAE ≈ 0.046–0.071, RMSE ≈ 0.58) indicate that these models demonstrate strong robustness in dealing with multidimensional and nonlinear datasets. The results obtained from this study are consistent with previous studies, confirming that machine learning-based ensemble methods outperform traditional regression and time series approaches [
54,
55,
56]. This superiority is consistent with our diagnostic findings: residual analyses indicated deviations from normality and heteroskedasticity, and the Durbin–Watson statistic suggested autocorrelation, while strong collinearity among temperature predictors was confirmed by the heatmap and VIF results, together undermining OLS generalization performance.
Like the present study, Khai et al. [
55] demonstrated that random forest and gradient boosting methods provide significantly higher accuracy than multiple linear regression in dam water level prediction. Asare et al. [
57] highlighted the limitations of statistical methods such as principal component regression (PCR) and ARIMA in capturing sudden fluctuations. In contrast, machine learning methods can model nonlinear relationships more successfully. Comparative studies on hydrology in the literature [
58,
59,
60] also indicate that ensemble methods excel in both accuracy and generalization capacity in long-term and multivariate datasets. For example, Özdoğan et al. [
61] used hybrid models, including random forest and ridge regression, on the Loskop Dam in South Africa to accurately estimate dam volume (R
2 ≈ 0.99; RMSE = 4.88 MCM). Similarly, the authors of [
1,
3] obtained successful results in outlet discharge and water level predictions for dams in Spain using advanced time series models, such as artificial neural networks (ANN) and NARX. The authors of [
6] evaluated monthly flow forecasts for the Hirakud Dam in India by comparing them with machine learning (relevance vector machine) and statistical downscaling. Similarly, Ref. [
62] used artificial neural network (ANN) models to predict water level on the Yalova Gökçe Dam, which performed better than traditional regression models. In another study on Keban Dam, water level changes were successfully modeled using ANFIS and support vector machines (SVM) [
37]. In our analyses, we also applied an AIC-based variable reduction to remove redundant temperature predictors; nevertheless, key OLS assumptions remained violated, further justifying the use of ensemble models for operational forecasting.
The overfitting tendency in the decision tree model observed in our study represents a well-documented limitation of single-tree-based algorithms. While these models successfully represent complex structures, they often suffer from limited generalization ability on unseen data [
41]. In contrast, random forest reduces variance by combining multiple trees, and XGBoost achieves a more effective balance between bias and variance through boosting and regularization [
63]. The slightly better performance of XGBoost in our study aligns with the widely reported effectiveness of gradient boosting methods in water resource management [
1,
64]. Notably, both random forest and XGBoost preserved high accuracy on the test set despite multicollinearity and non-Gaussian, heteroskedastic residuals, highlighting their robustness to the violations that degraded OLS.
A significant contribution of this research is the application of advanced machine learning models using a long-term (17 years, 5964 observations) daily dataset on dam reservoirs in Türkiye. Recent reviews conducted by Azad et al. [
26] have revealed that previous regional studies have primarily relied on regression or time-series approaches, which are insufficient to capture nonlinear dependencies between climatic variables and reservoir storage dynamics. Previous studies in the regional literature have generally relied on regression-based or time-series methods such as ARIMA [
65] and have provided limited insights into the nonlinear dependencies between hydro-meteorological variables. This study addresses a methodological gap by systematically comparing four different models and demonstrates that modern ensemble approaches deserve prioritization in operational water management in Türkiye. This is among the first regional applications to report a full suite of regression diagnostics (Q–Q, Shapiro–Wilk, Breusch–Pagan, and Durbin–Watson tests) alongside ML benchmarking for daily reservoir levels in Türkiye.
This study also revealed a systematic overestimation tendency in the linear regression model, with errors surpassing 0.9 hm
3 in some periods. This finding demonstrates that models relying solely on linear assumptions prove inadequate in representing complex hydrological processes, such as precipitation, evaporation, and temperature changes. Previous studies [
36,
56] have highlighted that linear models have been remarkably ineffective in capturing extreme events and seasonal variability, while ensemble learning methods demonstrate consistent performance across different conditions. This systematic bias is visible in the error-bubble visuals and is consistent with the positive MBE observed for linear regression.
Although random forest was determined to be the best model based on metrics, XGBoost showed the highest consistency between predictions and actual values. These results demonstrate that in such studies, relying solely on conventional evaluation metrics is insufficient and that rigorous validation of results is essential. Accordingly, we complemented conventional errors with hydrological skill (NSE, KGE), bias (MBE), and statistical significance tests to ensure that the conclusions are not an artifact of a single metric.
In particular, this study provided an additional validation layer, including bias and hydrological efficiency measures (MBE, NSE, and KGE). Very low MBE values confirmed the absence of systematic bias, while NSE values close to 1 [
45] indicated strong predictive skill. Similarly, KGE values near unity [
46,
53] demonstrated high agreement between observed and simulated values, reflecting accuracy and reliability in representing hydrological dynamics. The combined use of these indices, which has been widely recommended in hydrological model evaluation [
47,
48], reinforces the superiority of ensemble methods in terms of conventional error metrics and hydrological performance indicators. These outcomes, the multicollinearity evidence, and the residual diagnostics provide a coherent rationale for preferring ensemble models over OLS in this application.
Beyond methodological contributions, the findings of this study are directly linked to the United Nations Sustainable Development Goals (SDGs). Accurate reservoir water level forecasting supports SDG 6 (Clean Water and Sanitation), particularly Indicator 6.4.2 (Level of water stress) and Indicator 6.5.1 (Integrated water resource management implementation) by enhancing efficient and sustainable allocation of freshwater resources. Moreover, by strengthening the capacity to anticipate hydrological variability, the study contributes to SDG 13 (Climate Action), especially Target 13.1 (Resilience and adaptive capacity to climate-related hazards). Finally, the potential application of ML-based forecasts to optimize irrigation and reduce ecological pressures aligns with SDG 15 (Life on Land), underscoring the ecological co-benefits of sustainable reservoir operation.
Recent studies suggest that SDG indicators can be operationalized through quantitative metrics. For example, Marinelli et al. [
66] assessed SDG 6.4.2 by calculating the ratio of freshwater withdrawals to renewable resources, reporting that this value varied across regions from below 0.1 (very low stress) to above 1 (extreme stress) and that over 90% of non-renewable groundwater withdrawals were concentrated in just seven countries. Similarly, Tinoco et al. [
67] assessed SDG 6.5.1 using the UNEP-DHI IWRM Data Portal. They reported implementation scores on a scale of 0–100, with Mexico scoring 49 (medium-low), Brazil 51 (medium-high), and Chile 22.6 (low). This study is not only conceptually aligned with SDGs 6, 13, and 15, but it also contributes to their advancement. Future research could follow these approaches to calculate dam-specific SDG 6.4.2 water stress rates and incorporate SDG 6.5.1 implementation indices, thus providing stronger comparability of reservoir management in Turkey with international standards.
Although water level measurements are relatively simple and cost-effective, mechanistic models that rely solely on reservoir morphometry (e.g., elevation–area–volume relationships) have inherent limitations. High-resolution bathymetric data are not always available, and such models cannot adequately represent nonlinear dependencies between hydro-meteorological variables such as precipitation, evaporation, and temperature. Machine learning approaches, on the other hand, can capture these complex relationships by learning directly from long-term observational datasets, thereby providing more accurate and reliable forecasts. For this reason, the present study adopts machine learning as a complementary alternative to traditional mechanistic modeling.
From a practical perspective, the high accuracy provided by XGBoost and random forest highlights the potential for integrating these models into dam reservoir management decision support systems in Türkiye. Reliable short- and medium-term water level forecasts can enable more effective water allocation among agricultural, urban, and ecological demands, and support adaptation to climate-related uncertainties. Future research should address residual autocorrelation by incorporating lagged predictors and/or hybridizing with GLS or ARIMAX-type models, consider penalized linear methods (ridge, lasso, elastic net) to retain interpretability under multicollinearity, and enrich features with seasonality and hydrological memory (e.g., rolling precipitation/level aggregates) and, where available, operational variables such as inflows/outflows and release schedules. Future studies could contribute to improving both resilience and interpretability through the integration of satellite-based precipitation and evaporation data or by hybridizing physical hydrological models with machine learning approaches [
68].
This study used specific hyperparameter ranges (e.g., learning rate = 0.01–0.1, max_depth = 3–10) for model optimization. Experimenting with wider ranges and different parameter combinations could provide a more comprehensive evaluation of the model’s generalizability and performance. However, this is beyond the scope of the current study and is recommended as an important area for future research to further validate and explore.
In conclusion, machine learning-enabled models successfully address the limitations of classical hydrological modeling and offer practical tools for operational support to decision-makers. This study confirms that ensemble machine learning methods constitute a substantial methodological advancement in hydrological prediction compared to classical regression and single-model approaches. In Türkiye’s water resource management context, these results provide theoretical validation and a practical roadmap for sustainable dam reservoir management under climate uncertainty. Overall, the convergence of assumption diagnostics, multicollinearity assessments, and superior ensemble performance forms a consistent evidence base supporting the adoption of random forest/XGBoost for daily reservoir level forecasting at Karaçomak Dam.