Next Article in Journal
Energy-Optimized Degradation of 2,4,6-Trinitrotoluene in Water via Sono-Photo-Fenton-like Process and nZVI
Previous Article in Journal
NSGA-II and Entropy-Weighted TOPSIS for Multi-Objective Joint Operation of the Jingou River Irrigation Reservoir System
Previous Article in Special Issue
Wolfgang Cyclone Landfall in October 2023: Extreme Sea Level and Erosion on the Southern Baltic Sea Coasts
 
 
Article
Peer-Review Record

Hybrid Deep Learning Versus Empirical Methods for Daily Potential Evapotranspiration Estimation in the Nakdong River Basin, South Korea

Water 2026, 18(1), 32; https://doi.org/10.3390/w18010032
by Muhammad Waqas and Sang Min Kim *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Water 2026, 18(1), 32; https://doi.org/10.3390/w18010032
Submission received: 19 November 2025 / Revised: 15 December 2025 / Accepted: 18 December 2025 / Published: 22 December 2025
(This article belongs to the Special Issue Risks of Hydrometeorological Extremes)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper estimates daily evapotranspiration in the Nakdong River Basin, South Korea, using two types of models: empirical approaches and deep-learning models. After reviewing the paper, I found that the authors face significant issues related to model validation, presentation of the large dataset, and limited discussion of the study’s novelty, all of which hinder the applicability of the work.

Introduction

  1. The Introduction and Related Work sections should be combined to clearly present the research gap and articulate the study’s objectives.

  2. The paper does not provide a clear rationale for selecting the applied models.

  3. The study lacks a clear statement of novelty.

Methodology

  • The choice of models is not sufficiently justified. You used a CNN–BiLSTM hybrid model, but for fair comparison, individual models such as CNN and BiLSTM should also be included.

  • The study relies solely on deep-learning models, which may not be appropriate given the limited dataset size and number of input features. Classical machine-learning models such as SVR or XGBoost should be added for comparison.

  • The training conditions and hyperparameters for the models are not provided.

  • The formulas for NSE and R² are very similar; therefore, including both may be redundant.

Results and Discussion

  1. The authors should provide a concise summary comparing the performance of all models.

  2. The models show major problems in predicting extreme values, but this issue is not addressed in the discussion.

  3. The paper lacks visual comparative analyses.

  4. Uncertainty and reliability analyses should be conducted.

Author Response

Response to Reviewer 1

The paper estimates daily evapotranspiration in the Nakdong River Basin, South Korea, using two types of models: empirical approaches and deep-learning models. After reviewing the paper, I found that the authors face significant issues related to model validation, presentation of the large dataset, and limited discussion of the study's novelty, all of which hinder the applicability of the work.

We sincerely thank the reviewer for constructive feedback. Your comments significantly improved the clarity, justification of the manuscript, and we appreciate the time and effort you invested in strengthening this work.

Introduction

Comment: The Introduction and Related Work sections should be combined to clearly present the research gap and articulate the study's objectives.

Response: We thank the reviewer for this suggestion. In the revised manuscript, based on both reviewers' comments and suggestions, and keeping the journal's format in mind, we have streamlined the Introduction and Related Work sections. This restructuring now clearly presents:

  • existing empirical PET approaches,
  • machine learning and deep-learning advancements,
  • limitations of past studies, and
  • The specific research gap in the Nakdong River Basin

Comment: The paper does not provide a clear rationale for selecting the applied models.

Response: We addressed this by adding a detailed paragraph at Lines 261–268 that explicitly justifies the choice of the standalone LSTM and the hybrid CNN–BiLSTM–Attention. The justification highlights:

  • LSTM superiority in capturing long-term dependencies in hydrometeorological data;
  • CNN's usefulness for extracting short-term temporal features;
  • The attention mechanism was added for interpretability.
  • Literature evidence supporting LSTM-based models for PET prediction.

It provides a logical, data-driven rationale for the selected models.

Comment: The study lacks a clear statement of novelty.

Response: We added a clear statement of novelty in the Introduction (Lines ~105–119). The revised manuscript now highlights that:

  • The first study to compare empirical PET equations with hybrid DL architectures (CNN-BiLSTM-Attention) in the NRB;
  • The study incorporates extensive feature engineering and optimized DL hyperparameters.
  • Analysis across 13 stations is performed to quantify generalization across heterogeneous climates.

These additions clarify the study's methodological and regional novelty.

Methodology

Comment: The choice of models is not sufficiently justified. You used a CNN–BiLSTM hybrid model, but for fair comparison, individual models such as CNN and BiLSTM should also be included.

Response: We clarified our methodological choices in Lines 261–268. The manuscript now explains that:

  • CNN-only and BiLSTM-only models were omitted because previous literature shows that hybrid structures outperform isolated components for multivariate climatic data.
  • A standalone LSTM serves as the appropriate baseline because PET is primarily a time-dependent sequence problem.
  • Adding CNNs and attention improves the representational capacity beyond the LSTM baseline.

It provides a scientifically grounded reason for not adding redundant standalone CNN or BiLSTM models.

Comment: The study relies solely on deep-learning models, which may not be appropriate given the limited dataset size and number of input features. Classical machine-learning models such as SVR or XGBoost should be added for comparison.

Response: Thanks for this suggestion. In the revised manuscript, we added justification for focusing on DL rather than other ML models. Specifically, in the Introduction and Methods, we justify that:

  • ML models (SVR, RF, XGBoost) have already been extensively evaluated in previous PET studies (Tables 1–2 show relevant ML studies);
  • The purpose of this study is to evaluate empirical vs. DL vs. hybrid DL, not to repeat already well-documented ML comparisons;
  • DL models handle long sequences and multivariable temporal dependencies better than ML models, especially across 13 stations and 50 years of data.

Thus, the revised manuscript clarifies that including additional ML baselines would not meaningfully contribute to the research objectives.

Comment: The training conditions and hyperparameters for the models are not provided.

Response: This concern has been fully addressed.

We added a detailed hyperparameter table (Table 4) and text describing:

  • sequence length,
  • CNN and BiLSTM layer configurations,
  • attention dimensions,
  • batch size and epochs,
  • optimizer (AdamW),
  • learning-rate scheduler,
  • early stopping,
  • loss function (Huber loss),
  • gradient clipping,
  • normalization strategy,
  • GPU environment.

These additions fully document all training conditions for reproducibility.

Comment: The formulas for NSE and R² are very similar; therefore, including both may be redundant.

Response: We respectfully disagree and justified this in the revision. NSE and R², although mathematically related, evaluate different aspects:

  • R² measures linear correlation.
  • NSE evaluates predictive skill relative to the 1-to-1 line.
  • KGE is also used for hydrological robustness.

We retained both metrics because hydrological modeling standards (e.g., Moriasi 2015) recommend reporting R², NSE, and KGE together for reliability assessment.

Results and Discussion

Comment: The authors should provide a concise summary comparing the performance of all models.

Response: We added:

  • A basin-wide comparative performance summary is in the Results section (Table 7).
  • A concise narrative comparison describing hybrid DL superiority over empirical and standalone DL models.
  • Boxplots of overall metrics (Figure 5b) summarize differences across all stations.

A detailed comparative analysis is also provided in the Supplementary File. These detailed results provides a clear, direct, quantitative comparison.

Comment: The models show major problems in predicting extreme values, but this issue is not addressed in the discussion.

Response: This issue is addressed through:

  • The residual analysis in Figure 5a shows an improved distribution around the extremes for the hybrid model.
  • Added text in the Results discussing that LSTM tends to underperform at extremes, while the CNN-BiLSTM-Attention significantly reduces bias.
  • Discussion of how convolution layers help capture abrupt meteorological changes.

Thus, model performance at extremes is now explicitly analyzed.

Comment: The paper lacks visual comparative analyses.

Response: multiple visual analyses are provided in the Supplementary File, and also in the main manuscript, including:

  • Station-level comparisons (Figures 3 and 4),
  • Basin-wide residual distributions (Figure 5a),
  • Boxplots of metrics (Figure 5b),
  • Monthly PET comparisons (Figure 6b),
  • Station 136 example comparisons (Figure 6a).

These fulfill the reviewer's request for visual comparative analysis.

Comment: Uncertainty and reliability analyses should be conducted.

Response: We expanded the reliability assessment by including:

NSE, KGE, RSR, and PBIAS for station-level reliability;

  • Residual distribution analysis;
  • Spatial robustness evaluation across 13 stations;
  • Monthly PET variability reproduction to assess temporal reliability.

Although full uncertainty quantification (e.g., Monte Carlo dropout) is beyond the study's scope, the reliability assessment has been significantly strengthened.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Dear authors,

In the attached file are some observations made in order to improve your work.

Thanh you.

Comments for author File: Comments.pdf

Author Response

Response to Reviewer 2 Comments

We sincerely thank the reviewer for the thoughtful and constructive feedback. Your comments significantly improved the clarity, justification, and scientific rigor of the manuscript, and we appreciate the time and effort you invested in strengthening this work.

  1. Introduction section

    1. At line 54, please replace Ba-sin with Basin;

Response: Thanks for this highlight, and Ba-sin with Basin is replaced.

  1. Lines 58-60: It is mentioned that "In non-linear and multivariate climatic modeling, support vector machines (SVMs), random forests (RFs), and extreme learning machines (ELMs) are the ML algorithms [6, 18, 19].

Are the ML algorithm that? Please, rephrase.

Response: Line 58-60 is rephrased "In modeling complex, non-linear, and multivariate climatic processes, algorithms such as support vector machines (SVMs), random forests (RFs), and extreme learning machines (ELMs) have demonstrated strong predictive performance."

  1. Lines 115-118: It is mentioned: "Therefore, this study has two primary objectives: (1) to compare the performance of empirical and develop hybrid DL models to estimate the daily PET of the NRB, (2) to examine the spatial and temporal variability of PET within the basin based on the most effective "

To compare the performance of the empirical models and to develop…Please, rephrase to be more clearly.

Response:  Line 115-118 is rephrased as "Therefore, this study has two main objectives: (1) to develop hybrid DL models for estimating daily PET in the NRB and compare their prediction accuracy with empirical methods, and (2) to analyze the spatial and temporal variability of PET across the basin using the most effective modeling approaches" shown in revised lines116-119.

2.       Related Work section

  1. Line 130: Please, split Table 1 into two tables.

Response: Table 1 is split into two tables based on ML and DL/hybrid models
Table 1:
Summary of empirical and ML approaches in previous PET research.

Study

Study area/dataset

Methods (Empirical / ML)

Inputs

Key findings

[37]

Northwest China

ANN vs MLR & empirical formulas

Tmax, Tmin, RH, U2, N

ANN outperformed MLR and empirical methods; Tmax, Tmin, and RH were the most important

[38]

Central Florida

M5P, Bagging, RF, SVR

Radiation, heat flux, soil moisture, wind, RH, T

Strong performance; input quality strongly influenced accuracy

[40]

India

ANN vs empirical equations

Temperature, RH, radiation, wind

ANNs often outperform empirical equations

[41]

India

RBF neural networks

Limited climatic data

RBF is effective under sparse data

[44]

Iraq

ELM vs standalone ML

Temperature-based & multivariable inputs

ELM competitive; lightweight ML effective

[45]

Sichuan Basin, China

ELM, GRNN, RF + empirical

Temp-only & multivariable

Intelligent temp-only models are competitive; RF robust

[47, 48]

China

SVM, ELM, LightGBM, CatBoost

Limited meteorological datasets

Tree boosting is competitive; it emphasizes transferability

[49, 50]

Iran, Brazil, global

GRNN, MARS, GEP, ANFIS

Temp-only & multivariate

No single best algorithm; performance data-dependent

[42, 43]

Semiarid sites

Sequential RBF + empirical hourly formulas

Hourly meteorology

Highlighted the value of hourly modelling

Table 2: Summary of DLand hybrid modeling approaches in previous PET studies.

Study

Study area/dataset

Methods (DL / Hybrid)

Inputs

Key findings

[16]

Minas Gerais, Brazil

ANN, RF, XGBoost, 1-D CNN (DL)

Daily/hourly Temperature, RH, Ra

Hourly CNN improved RMSE by ~28%; sequence-aware DL advantageous

[39]

Prince Edward Island, Canada

LSTM, bi-LSTM

Tmax, RH

High R² (>0.90); DL effective with few inputs

[46]

India

ANN interpretability (DL-related review)

Multiple parameters

Explained the physical interpretability of ANN hydrological models

3.       Materials and Methods section

  1. Line 190: Please, replace reference (Allen et al., 1998) with the corresponding number from reference list.

Response: Thanks for highlighting this mistake; we have updated the citations to [3] (Allen et al., 1998) (line 193).

  1. Line 202: Add to the reference list the complete reference for (Duffie & Beckman, 2013) and replace it into the paper text, with the corresponding

Response: Thanks for highlighting this mistake; we have updated the citations to [51] (Duffie & Beckman, 2013) (line 205).

  1. Line 307: Add to the reference list the complete reference for (Vaswani et al., 2017).) and replace it into the paper text, with the corresponding

Response: Thanks for highlighting this mistake; we have updated the citations to [56] (Vaswani et al., 2017) (line 310).

  1. Line 320: The hyperparameters from Table 3 were tuned manually or through a formal search strategy (grid or cross-validation)?

Response: Optimal hyperparameters selected via grid search are now mentioned in revised Lines 317-318, and the table caption has been revised to "Table 4 summarizes the optimal hyperparameters obtained via grid search for training the hybrid model."

  1. Please mention, the source of the used meteorological data set (from 1972 to 2024). Was available online (was free?) or that data was collected over time?

Response: The data source is mentioned in the Line 154-155 "

  1. Please, justify why only two DL models (standalone LSTM and hybrid Convolutional Neural Network Bidirectional LSTM- CNN-BiLSTM) were used? Because, DL includes a multitude of models, such as: Feedforward Neural Networks (FNNs), Recurrent Neural Networks (RNNs), General Regression Neural Networks (GRNNs), Time Delay Neural Networks (TDNNs). Deep Belief Networks (DBNs), etc.

Response: The selection of these two DL models is justified and provides clarification in Lines 261-268. "Two DL models were developed in this research: a standalone LSTM and a hybrid CNN-Bidirectional LSTM with an attention mechanism. These models were selected for their success in capturing time-dependent features in time-series data and their ability to handle the multivariate meteorological characteristics essential for day-to-day PET forecasting. Although DL offers a wide range of models, the literature indicates that LSTM-based models outperform ML/DL models at capturing long-term dependencies and non-linear interactions in PET data. Thus, we focused on LSTM and hybrid models to balance model complexity, interpretability, and predictive performance."

 

  1. Please, explain why the hybrid DL model was developed, since the model already exists, and CNN–BiLSTM implementations are used in research for a while. Which are the new elements brought to the existing model?

Response: The development of of hybrid DL model is justified in Lines 272-278 "The CNN-BiLSTM architectures are not new, but the peculiarities of this study include the introduction of a self-attention mechanism for PET prediction, the consideration of specif-ic feature engineering (derived meteorological features and cyclical encoding of day-of-year and month), and the customization of hyperparameter optimization for PET prediction. This combination increases interpretability, highlights important temporal patterns, and improves predictive accuracy compared to current implementations."

4.       Results section

  1. At the end of this section, based on point g. (mentioned at Materials and Methods section), please present, which are the authors main contributions, regarding the hybrid DL model?

Response:  The main contribution of the Hybrid DL model is added at the end of results section in Lines 588-595 "The hybrid CNN-BiLSTM model with an attention mechanism developed in this study has three main contributions to daily PET prediction: (i) the combination of convo-lutional feature extraction and bidirectional memory enhances learning of multivariable and time-dependent patterns that are difficult to predict; (ii) the attention mechanism makes the model much easier to interpret as it adapts the weight of the influential meteor-ological signals; and (iii) the proposed feature engineering and informed hyperparameter configuration is highly promising to improve predictive accuracy and spatial robustness of diverse climatic stations."

  1. Please improve the quality for Figure 3, Figure 4, Figure 5 and Figure 7(all figures should have clear axis labels and units).

Response: The revised figures are updated with more clear resolution to see clear exis labels and units.

  1. Line 546: It is mentioned that:" LSTM models alone also performed significantly better than the empirical methods, but not as much as the hybrid models." Wasn't that expected? Please, justify this affirmation.

Response: Line 546 is revised and justified as per the reviewer's comment and suggestion (550-553) "While the superior performance of standalone LSTM models over empirical methods is anticipated because of their ability to represent complex temporal dynamics, the hybrid models yielded further gains by incorporating physical knowledge from empirical formulations, thereby improving robustness and reducing systematic error"

  1. Please renumber the figures, because figure 6 doesn't not exists.

Response:  All Figures and tables captions and numbers are checked and revised accordingly.

 

5.       Discussion section

  1. Lines 603-605: It is mentioned that: "In this context, this research is an essential contribution to the comparative analysis of empirical PET models with sophisticated hybrid DL models, such as CNN-BiLSTM, for estimating daily PET at high spatial and temporal resolution." Please mention what those contributions actually are.

Response: The contribution in Lines 609-615 is mentioned and justified "In this context, the study is substantively valuable as it presents the empirical comparison of PET equations (FAO-56 PM, P-T, H-S) with advanced hybrid DL models, such as the CNN-BiLSTM model, for daily PET prediction at high spatial and temporal resolution in the NRB. In particular, it measures differences in performance between methods, shows that hybrid DL models have higher predictive power, and presents a basin-wide assessment framework that captures geographical and seasonal heterogeneity in PET processes."

  1. Lines 615-617: It is mentioned that: "This gap is addressed by the current study, which demonstrates that standalone LSTM and, especially, hybrid CNN-BiLSTM models outperform empirical methods at all stations, reducing the RMSE by over 5070%." Something is not in order 5070%? Please correct and justify this affirmation, because it is known the fact that, usually CNN-BiLSTM, model surpasses traditional methods (it is not something new).

Response: The statement is corrected and justified in the revised Lines 625-628. "The given gap is filled by the present study, which shows that standalone LSTM and, particularly, hybrid CNN-BiLSTM models achieve superior performance across all stations, reducing RMSE by 50-70%. The R2 and NSE values are also positive, even under highly variable seasonal conditions."

  1. Lines 629-631: It is mentioned that:" This research is the first of its kind in the NRB context to make a systematic comparison of empirical and hybrid DL models of PET, offering new methodological clarity to local water managers, hydrologists, and climate adaptation planners. ". Please justify why is the first of this kind? Strengthen your statement.

Response: The statement is revised in Lines 639-647.
"To the best of the authors' knowledge, the study is the first to be conducted in the NRB to directly compare popular empirical PET formulations (FAO-56 PM, P-T, H-S) against advanced hybrid deep-learning architectures, i.e., CNN-BiLSTM. Past NRB research has either been based on large-scale PET climatology or on empirical or remote-sensing-only methods, without assessing or comparing predictive model performance in space and time. The combination of empirical physics-based modeling with current state-of-the-art DL approaches, and a rigorous comparison between them on an everyday time scale, makes this study a methodological contribution to water managers and hydrologists in the basin."

6.       Conclusion and recommendations section

  1. Please, include in this section a synthesis of the authors main

Response: The main contributions of this study is included in revised conclusion as "This study has three main contributions: (i) a comprehensive benchmarking of empirical, standalone DL, and hybrid DL models for PET estimation in the NRB (ii) evidence that hybrid CNN-BiLSTM attention architectures substantially improve PET prediction accu-racy across seasons and elevations; and (iii) a demonstration of the operational potential of hybrid DL models for supporting hydrological modeling and irrigation management in data-limited regions."

7.       References section:

  1. Please replace references number 3, 7, 11, 51 and 52, with newer ones and verify references formatting (for instance reference number 50).

Response: References 3, 7, 11, 51, and 52 are the main sources of the empirical equations for Eto, and the authors would like the reviewer to allow us to use these references to give proper credit to those scientists for their valuable contributions to innovations in Eto predictions. Also, all other references are updated.

 

General conclusion: My main concern is why were compared the performances of the mentioned empirical methods with just two DL models, a standalone Long Short-Term Memory (LSTM) network and a hybrid Convolutional Neural Network Bidirectional LSTM with an attention mechanism, since it exists many available DL methods. Also, is not clear which the authors main contributions are. Please solve this problem or give a proper justification (eventually make a comparative study of the DL methods used with the same goal, in order to justify the usage of only these two DL methods).

Response: We are very grateful to the reviewer for making this interesting and valuable remark. The choice of two DL models, the standalone LSTM and the hybrid CNN-BiLSTM with an attention mechanism, was determined not only by methodological applicability but also by the results of previous PET-related studies. The architecture built on LSTMs has consistently been shown to develop a stronger ability to capture long-term temporal dependencies, non-linear meteorological interactions, and multivariate sequence patterns, which are pertinent attributes of daily PET dynamics.

 

The hybrid CNN-BiLSTM model was chosen for its combination of convolutional feature extraction, bi-directional memory processing, and a self-attention mechanism, all of which have been demonstrated to improve temporal modelling and interpretability. To ensure a scientific focus, avoid unnecessary model proliferation, and prevent over-complexity without empirical value, the study focused on the two most theoretically and empirically adequate architectures for predicting PET.

                                                                                 

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

Dear Authors,

Thank you for making all the necessary corrections and congratulations on the work you have done.

Sincerely, 

The Reviewer.

Author Response

We are thankful to reviewer 2 for accepting this revised version of the article.

Back to TopTop