Rethinking Evaluation Metrics in Hydrological Deep Learning: Insights from Torrent Flow Velocity Prediction

Chen, Walter; Nguyen, Kieu Anh; Lin, Bor-Shiun

doi:10.3390/su17198658

Open AccessArticle

Rethinking Evaluation Metrics in Hydrological Deep Learning: Insights from Torrent Flow Velocity Prediction

by

Walter Chen

^1,*

,

Kieu Anh Nguyen

¹

and

Bor-Shiun Lin

²

¹

Department of Civil Engineering, National Taipei University of Technology, Taipei 10608, Taiwan

²

Ultron Technology Engineering Company, Taipei 11072, Taiwan

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(19), 8658; https://doi.org/10.3390/su17198658

Submission received: 5 September 2025 / Revised: 20 September 2025 / Accepted: 23 September 2025 / Published: 26 September 2025

(This article belongs to the Special Issue Integrated River Basin Management and Sustainable Water Resources Management Using Innovative Approaches)

Download

Browse Figures

Versions Notes

Abstract

Accurate estimation of flow velocities in torrents and steep rivers is essential for flood risk assessment, sediment transport analysis, and the sustainable management of water resources. While deep learning models are increasingly applied to such tasks, their evaluation often depends on statistical metrics that may yield conflicting interpretations. The objective of this study is to clarify how different evaluation metrics influence the interpretation of hydrological deep learning models. We analyze two models of flow velocity prediction in a torrential creek in Taiwan. Although the models differ in architecture, the critical distinction lies in the datasets used: the first model was trained on May–June data, whereas the second model incorporated May–August data. Four performance metrics were examined—root mean square error (RMSE), Nash–Sutcliffe efficiency (NSE), Willmott’s index of agreement (d), and mean absolute percentage error (MAPE). Quantitatively, the first model attained RMSE = 0.0471 m/s, NSE = 0.519, and MAPE = 7.78%, whereas the second model produced RMSE = 0.0572 m/s, NSE = 0.678, and MAPE = 11.56%. The results reveal a paradox. The first model achieved lower RMSE and MAPE, indicating predictions closer to the observed values, but its NSE fell below the 0.65 threshold often cited by reviewers as grounds for rejection. In contrast, the second model exceeded this NSE threshold and would likely be considered acceptable, despite producing larger errors in absolute terms. This paradox highlights the novelty of the study: model evaluation outcomes can be driven more by data variability and the choice of metric than by model architecture. This underscores the risk of misinterpretation if a single metric is used in isolation. For sustainability-oriented hydrology, robust assessment requires reporting multiple metrics and interpreting them in a balanced manner to support disaster risk reduction, resilient water management, and climate adaptation.

Keywords:

torrent flow velocity; deep learning; three-dimensional convolutional neural network; convolutional neural network with long short-term memory; root mean square error; Nash–Sutcliffe efficiency; Willmott’s index of agreement; mean absolute percentage error

1. Introduction

In recent years, deep learning has become a cornerstone of hydrological modeling, with applications extending from streamflow prediction to water quality assessment and sustainable basin management. Hao et al. [1] demonstrated that causal recurrent neural networks can improve spring discharge forecasts while supporting sustainable groundwater management. Yin et al. [2] introduced a Temporal–Periodic Transformer for monthly streamflow forecasting, emphasizing the role of the Nash–Sutcliffe efficiency (NSE) as a central performance metric in water-scarce basins. Wi et al. [3] applied long short-term memory (LSTM)-based approaches to reconstruct streamflow in ungauged basins of the Great Lakes, directly linking NSE-based evaluation to regional water sustainability. Similarly, Yin et al. [4] proposed a Transformer–XAJ model, combining deep learning with a process-based hydrological model to enhance runoff prediction and flood risk reduction. Beyond runoff forecasting, Lu et al. [5] explored turbidity prediction using multiple deep learning models, highlighting implications for rural water ecosystem management. Tang et al. [6] advanced cross-regional LSTM transfer to ungauged basins in Brazil and the Lancang–Mekong River, using NSE as a benchmark for transboundary sustainability. Xu et al. [7] developed a hybrid convolutional neural network (CNN)–BiLSTM–attention framework with probabilistic forecasting in the Yangtze River Basin, linking improved NSE performance to sustainable flood control and water allocation. Ampas et al. [8] combined a physical rainfall–runoff model with a temporal fusion Transformer in Greece, demonstrating that bias correction and NSE-based evaluation can enhance forecasting in irrigation-driven watersheds. Collectively, these studies show that recent advances in deep learning consistently rely on metrics such as NSE for model evaluation, and increasingly frame their contributions in terms of sustainability.

Reliable knowledge of river flow velocity underpins a wide range of hydrological and engineering applications, from assessing flood hazards and sediment transport to ensuring the sustainable management of reservoirs and related infrastructure [9,10]. Nowhere is this need more acute than in Taiwan, where torrential streams are shaped by steep terrain, intense rainfall, and frequent typhoons. Yet direct field measurements in such environments remain difficult, as unstable flow conditions and hazardous settings often limit the deployment of conventional monitoring systems.

Traditional in situ instruments, including Doppler radar and current meters, can deliver accurate point measurements [11,12]. However, the expense, limited coverage, and maintenance demands of these devices have curtailed their use in many mountainous catchments. To overcome these constraints, researchers have increasingly turned to image-based, non-intrusive approaches. Particle Image Velocimetry (PIV) [13], Large-Scale PIV (LSPIV) [14,15,16,17], and Space–Time Image Velocimetry (STIV) [18,19,20,21] have all been adapted for river applications, while optical flow algorithms that track pixel motion have become a versatile alternative [22,23,24,25]. Despite their promise, these methods continue to face obstacles related to image quality, flow variability, and seeding requirements.

Building on these advances, recent progress in artificial intelligence (AI) and deep learning has opened new possibilities for video-based velocity monitoring. Deep neural networks have been coupled with PIV frameworks [26,27,28,29,30], while state-of-the-art optical flow models are increasingly capable of capturing the complex structures of turbulent surface currents [31,32,33]. In our own work, we have demonstrated that combining deep learning with optical flow provides a powerful means to estimate river velocities under field conditions in Taiwan, offering a pathway toward more resilient and sustainable hydrological observation systems [34].

While these developments highlight the promise of deep learning for river monitoring, a crucial challenge remains largely overlooked: the choice of performance metrics for evaluating model accuracy. In hydrology, root mean square error (RMSE) and mean absolute percentage error (MAPE) are widely used to quantify error magnitudes, while the NSE and Willmott’s index of agreement (d) are frequently applied as variance-based efficiency measures. However, these metrics do not always lead to consistent conclusions. For example, a model may yield lower RMSE but also lower NSE compared with an alternative, reflecting differences in how each statistic accounts for variance in the observed data. Such discrepancies complicate model interpretation and risk misguiding both academic reporting and practical decision-making.

Few studies have systematically examined the implications of relying on different performance metrics in the evaluation of hydrological deep learning models. This gap is particularly important in sustainability-focused research, where reliable flow predictions are vital for disaster preparedness, sediment management, and climate adaptation strategies. Misinterpretation of model performance may translate into flawed risk assessments or inadequate design measures, undermining the goal of resilient water management.

In this study, we compare two deep learning architectures applied to video-based velocity estimation in a torrential creek in Taiwan: a three-dimensional convolutional neural network (3D CNN, hereafter referred to as the first model) and a hybrid convolutional neural network with long short-term memory (CNN+LSTM, hereafter referred to as the second model). Both models were trained on optical flow inputs and validated against Doppler radar measurements. Their performance was evaluated using four widely cited metrics: RMSE, NSE, Willmott’s d, and MAPE. The results reveal a paradox. The first model achieved an NSE below 0.65, a threshold often cited in academic publishing as grounds for rejection, whereas the second model produced an NSE above 0.65 and would likely be considered acceptable. However, the first model also yielded lower errors according to RMSE and MAPE, meaning its predictions were actually closer to the observations. Why does such a contradiction arise? By analyzing this discrepancy, we aim to clarify the interpretation of evaluation metrics in hydrological deep learning and discuss the implications for sustainable flood risk management and river monitoring. This work contributes methodological insights that are essential for balancing academic rigor with practical reliability in sustainability-oriented hydrological research.

2. Materials and Methods

This study establishes an experimental framework to evaluate the performance of deep learning models for surface flow velocity estimation in a torrential creek in Taiwan. The framework combines optical-flow-derived inputs with supervised training against Doppler radar measurements, which provide the reference velocities. In contrast to prior work that emphasized architectural design and input configurations, the present study focuses on comparing four evaluation metrics—RMSE, NSE, Willmott’s d, and MAPE—and analyzing their implications for interpreting model quality. An overview of the workflow is given in Figure 1.

Figure 1 summarizes the workflow. CCD videos recorded at the Yufeng No. 2 torrential stream were paired with Doppler radar velocities serving as reference data. From each video, 200 frames were extracted and processed with dense optical flow to generate motion fields. These optical flow inputs were supplied to two alternative model architectures: 3D CNN and CNN+LSTM. The first model used data collected from May to June, whereas the second model extended the dataset to cover May through August. For both models, training and testing followed a 70/30 random split. Performance was evaluated against Doppler radar observations using RMSE, NSE, Willmott’s d, and MAPE. The subsequent subsections provide detailed descriptions of the study area, model structures, and evaluation procedures.

2.1. Study Area and Data

The case study was conducted at the Yufeng No. 2 torrent in northern Taiwan, a steep mountain stream prone to rapid hydrological responses during typhoon rainfall events. Sediment check dams and grade-control structures (bed stabilization works) have been constructed along the stream to mitigate sediment-related hazards and prevent excessive bed material from entering the main channel. The basin area is approximately 1.52 km² and is characterized by shallow soils, rapidly weathering sedimentary formations, and predominantly forested land cover. These features contribute to flash flood hazards, frequent sediment transport, and high spatial variability of surface flows. The catchment context is shown in Figure 2.

Continuous monitoring was implemented at the outlet of the torrent using a CCD (Charge-Coupled Device) camera system paired with a co-located Doppler radar, which provided reference flow velocities at 10 min intervals. Details of the equipment setup and site photographs are available in our previous study [34]. During May and June 2025, a total of 3263 videos were recorded, each lasting 10 min at 4K resolution (3840 × 2160 pixels). A second batch of videos was collected in July and August 2025, comprising approximately 3432 recordings of similar duration and resolution. Together, these datasets capture both daytime and nighttime conditions, reflecting diverse illumination and flow states.

For this study, two model-specific datasets were derived. The first model (3D CNN) was trained and evaluated exclusively on the May–June dataset, while the second model (CNN+LSTM) was trained and evaluated on the combined May–August dataset. Both models relied only on daytime videos for training and testing. Although the data originated from the same monitoring campaign, the definition of daytime differed between the two approaches. Model 1 applied an average brightness threshold to separate daytime and nighttime videos, based on the clearly bimodal brightness distribution of all recordings. In contrast, Model 2 defined daytime as the period from 6:00 a.m. to 6:00 p.m. As a result, the videos used by the two models for May–June are not identical. It should be noted that the two models were developed by different co-authors. As a result, their data preprocessing steps (e.g., defining daytime conditions) and training datasets are not identical, which contributes to the contrasting evaluation outcomes. In both cases, the videos were paired with Doppler radar velocity measurements, providing robust ground truth for supervised learning. The availability of real-world, high-frequency velocity data makes this case study directly relevant to sustainability-oriented research on flood risk reduction, sediment management, and resilient water resource monitoring.

2.2. Model Architectures

Two deep learning architectures were applied to estimate flow velocity from video sequences: 3D CNN and CNN+LSTM. Both models were implemented in PyTorch 2.7.1+cu118 and trained to predict a single continuous value of surface velocity (m/s) from 200-frame video clips paired with Doppler radar observations. The 3D CNN captures spatiotemporal features directly through three-dimensional convolutions, whereas the CNN+LSTM first extracts spatial features and then models their temporal evolution. A detailed comparison of the two models is provided in [34].

It is important to emphasize that the subsequent analysis does not depend on the specific details of these architectures. Once predictions are obtained, the comparison with observed velocities is carried out solely using four performance metrics—RMSE, NSE, d, and MAPE. Thus, the interpretation of model performance in this study is based entirely on the relationship between predicted and observed values, independent of how the models internally process the data.

2.3. Performance Metrics

Model performance was evaluated using four widely applied statistical measures: RMSE [35,36], NSE [37], Willmott’s d [38], and MAPE [39]. These metrics capture different aspects of predictive performance and must be interpreted with care, as they vary in their treatment of scale, variance, and error magnitude versus variance-based efficiency.

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(v_{i} - {\hat{v}}_{i})}^{2}}

(1)

NSE = 1 - \frac{\sum_{i = 1}^{N} {(v_{i} - {\hat{v}}_{i})}^{2}}{\sum_{i = 1}^{N} {(v_{i} - \bar{v})}^{2}}

(2)

d = 1 - \frac{\sum_{i = 1}^{N} {(v_{i} - {\hat{v}}_{i})}^{2}}{\sum_{i = 1}^{N} {(|{\hat{v}}_{i} - \bar{v}| + |v_{i} - \bar{v}|)}^{2}}

(3)

MAPE = \frac{100}{N} \sum_{i = 1}^{N} |\frac{v_{i} - {\hat{v}}_{i}}{v_{i}}|

(4)

where:

\begin{matrix} v_{i} & = observed velocity at instance i; \\ {\hat{v}}_{i} & = predicted velocity at instance i; \\ \bar{v} & = mean of observed velocities; \\ N & = total number of observations . \end{matrix}

Conceptually, RMSE and MAPE serve as error magnitude metrics, with RMSE being scale-dependent (expressed in m/s) and MAPE providing error percentages relative to observed values. In contrast, NSE and d are variance-based efficiency metrics that assess predictive performance relative to the variability of the observed data. NSE ranges from

- \infty

to 1, where a value of 1 indicates perfect agreement, a value of 0 corresponds to performance equal to the mean of the observations, and negative values indicate that the mean outperforms the model. NSE is sensitive to variance and penalizes deviations from the mean more strongly, while d provides a more robust measure of agreement by bounding values strictly between 0 and 1. Taken together, these four metrics offer complementary perspectives on model performance, underscoring the importance of multi-metric evaluation.

3. Results

The performance of the two models was first examined through scatter plots of predicted versus observed velocities. The scatter plots in Figure 3 illustrate these contrasting tendencies. Predictions from the first model, trained on May–June data, are concentrated in the upper right-hand corner and align closely with the 1:1 line. By contrast, the second model, trained on the extended May–August dataset, shows an additional cluster of points in the lower left-hand corner corresponding to the July–August period. It should be noted that the data points representing the May–June period are not exactly the same between the two models, as explained earlier, due to the different methods used to separate daytime and nighttime videos. As will be shown later, this broader distribution substantially alters the four metrics used to evaluate model performance. Visually, the second model exhibits a wider spread around the 1:1 line, indicating inferior performance compared to the first model. However, the NSE suggests the opposite. This provides the first indication that reliance on a single indicator may lead to incomplete or even misleading conclusions. For completeness, panel (c) of Figure 3 further illustrates the case when the two datasets are combined, an issue discussed later in this section.

Table 1 summarizes the performance of the two models across four evaluation metrics. The first model (3D CNN) achieved the lowest RMSE (=0.0471 m/s) and the lowest MAPE (=7.78%), indicating superior error magnitude accuracy. In contrast, the second model (CNN+LSTM) produced a slightly higher RMSE (0.0572 m/s) and MAPE (11.56%), reflecting larger error magnitudes. When the two datasets were combined (Models 1 + 2), the RMSE (=0.0547 m/s) and MAPE (=10.53%) fell between those of the individual models, which is consistent with the expectation that combining datasets would yield intermediate performance in error magnitude metrics.

When evaluated using variance-based efficiency indices, however, the second model outperformed the first. Specifically, the CNN+LSTM achieved higher values of NSE (=0.678) and Willmott’s d (=0.895) compared with the 3D CNN (NSE = 0.519;

d = 0.833

). This suggests that although less accurate in terms of error magnitudes, the second model better reproduced the variance structure of the observed velocities and exhibited stronger overall concordance with the ground-truth data. Even more striking, the combined dataset (Models 1 + 2) further improved these variance-based efficiency metrics, reaching NSE = 0.685 and

d = 0.900

, despite no change in the underlying models. This counterintuitive result contrasts sharply with the expected intermediate values observed for RMSE and MAPE.

This situation presents a dilemma: how should one determine which model is superior? On the one hand, the first model achieves smaller error magnitudes, but on the other hand, the second model attains higher variance-based efficiency scores. A second issue further complicates interpretation: why do variance-based metrics such as NSE and Willmott’s d improve simply by combining datasets, even though the model architectures themselves remain unchanged? This counterintuitive behavior, first noted by [40], raises fundamental questions about the reliability of these indices. The NSE is among the most commonly used metrics in hydrological and hydraulic model evaluation, particularly for calibration, model comparison, and verification purposes [40,41,42,43,44,45,46]. Surveys and recent studies even describe NSE as “perhaps the most used metric in hydrology” [40], underscoring its widespread acceptance in the field. More recently, NSE has also been adopted in the evaluation of machine learning and deep learning models for other applications, owing to its convenient, dimensionless form that enables straightforward comparison across sites and contexts.

Nevertheless, this popularity has also contributed to the emergence of widely cited threshold values. For example, one reviewer of our previous work emphasized that “a generally accepted threshold of 0.65” is often used as a benchmark for satisfactory performance in environmental and spatial models. Such cutoff values are consistent with guidelines that classify performance as “satisfactory” when NSE

> 0.50

and “good” when NSE

> 0.65

[43]. Applying this criterion, the first model would be judged as inadequate and therefore rejected, whereas the second model would be considered acceptable. However, such a conclusion is paradoxical: despite its lower NSE, the first model actually produces smaller absolute and relative errors, and therefore should arguably be regarded as the more accurate model. Together with the artificial enhancement of NSE observed when datasets are combined, these findings highlight the risk of placing undue trust in a single metric and motivate a deeper examination of its limitations, a theme we will pursue in the discussion section.

4. Discussion

The results reveal that the two deep learning models demonstrate contrasting strengths depending on the evaluation metric applied. While the first model achieved lower error magnitudes, the second model attained higher variance-based efficiency and agreement scores. These discrepancies highlight a broader issue in hydrological model assessment: different metrics emphasize different aspects of performance, and reliance on a single indicator may lead to misleading conclusions. The following subsections first use

R^{2}

as a cautionary example, then turn to the central focus of this paper—the interpretation paradox between NSE and RMSE—and conclude with implications for sustainability.

4.1. The Pitfalls of $R^{2}$ and the Analogy to NSE

The coefficient of determination (

R^{2}

) is frequently reported in hydrological and environmental modeling studies, as well as in machine learning and deep learning, largely because of its apparent simplicity as a measure of explained variance. However,

R^{2}

is not a reliable indicator of predictive accuracy. As shown in [47],

R^{2}

evaluates the fit of predictions to a regression line, which is not necessarily the 1:1 line that represents perfect agreement between observations and predictions. Consequently, a model can achieve a very high

R^{2}

, even approaching unity, while still producing biased predictions or systematic deviations from the 1:1 line. In such cases,

R^{2}

reflects correlation rather than true accuracy, and its use as a primary evaluation metric can be misleading. This problem persists because many readers and practitioners who are less familiar with statistical nuances continue to interpret

R^{2}

as a gold standard for model performance.

A similar problem arises with the NSE. Although NSE and

R^{2}

share the same mathematical formula, their ranges differ because of how they are applied. Under ordinary least squares (OLS) regression, the residual sum of squares is always less than or equal to the total sum of squares, ensuring

0 \leq R^{2} \leq 1

. In contrast, NSE is used more generally to evaluate predictive models, which can yield residual errors larger than the variance of the observations; therefore, its range extends from

- \infty

to 1. As discussed by Melsen et al. [46], this historical conflation of NSE with

R^{2}

has led to persistent confusion in the hydrological literature, with some studies treating them as equivalent and others clearly distinguishing between the two. Like

R^{2}

, NSE has gained widespread popularity in hydrology and hydraulic modeling, and it is often treated as a decisive indicator of model adequacy. Yet, as demonstrated in this study, NSE can present a distorted view of performance by rewarding models that reproduce variance even if their error magnitudes are larger. In this sense, NSE has become the “new

R^{2}

”: a metric that is widely used and frequently reported, but whose limitations are not always recognized. Drawing this analogy situates our findings within a broader critique of evaluation practices in hydrology, highlighting the need for careful selection and interpretation of performance metrics.

4.2. The RMSE–NSE Paradox

Table 2 illustrates the paradoxical relationship between RMSE and NSE. Mathematically, the RMSE expression is similar to the numerator of the second term in the NSE formula. This similarity may give the false intuition that a smaller RMSE should always correspond to a larger NSE. However, the results here demonstrate the opposite: the first model yielded the lowest RMSE (0.0471) and, thus, more accurate predictions in terms of error magnitude, yet it produced a lower NSE (0.519). By contrast, the second model had a higher RMSE (0.0572), indicating larger errors, but attained a much higher NSE (0.678).

NSE has long been valued in hydrology as a convenient tool for assessing how well models reproduce temporal trends and variability. From this perspective, higher NSE values are often interpreted as evidence of better model performance. However, the present study shows that such reliance can be misleading: the second model achieved a higher NSE despite exhibiting larger error magnitudes than the first model. In this case, NSE rewarded variance reproduction while misrepresenting true predictive accuracy.

This contradiction arises because NSE depends not only on the numerator (prediction errors) but also on the denominator, which represents the total variability of the observed data around its mean and, therefore, reflects both the variance of the observations and the sample size. As shown in Table 2, the denominator for the second model (8.65) was substantially larger than that of the first model (1.48), inflating the NSE value despite greater errors. Importantly, the second model achieved this higher NSE not by changing the architecture, but simply by incorporating additional July–August data. This broader dataset increased the total variability of the observed data around its mean and, in turn, raised the NSE. When the two datasets were combined, the corresponding denominator (11.09) became larger than both. This produced an NSE of 0.685 that exceeded the values of either individual model, even though the RMSE (0.0547) and MAPE (10.53%) merely fell between them. The comparison underscores how NSE can shift dramatically with changes in data variability, even when model performance in terms of error magnitude remains inferior. This sensitivity to the choice of reference average has long been recognized, with early work showing that alternative definitions of the mean discharge can yield substantially different efficiency values and sometimes mask poor performance [48].

An additional perspective is offered by the well-known classification of MAPE thresholds proposed by Lewis [39]. According to this interpretation, MAPE values below 10% indicate “highly accurate forecasting,” values between 10–20% represent “good forecasting,” 20–50% correspond to “reasonable forecasting,” and values above 50% are considered “inaccurate forecasting.” Within this framework, the first model (

MAPE = 7.78 %

) qualifies as highly accurate, whereas the second model (

MAPE = 11.56 %

) is merely good. Despite this, the second model achieved a substantially higher NSE, which would often be cited in the literature as evidence of superior performance. This discrepancy underscores the paradox emphasized in this study: while MAPE highlights the practical accuracy of predictions in intuitive terms, NSE can elevate models that reproduce variance at the expense of error magnitude. The contrast between these two interpretations reinforces the need to evaluate models using multiple complementary metrics rather than relying solely on NSE.

4.3. Implications of the RMSE–NSE Paradox

Building on the analogy to

R^{2}

, a central finding of this study is the divergence between RMSE and NSE when assessing deep learning models of flow velocity. Although both metrics are derived from squared residuals, they are normalized differently. RMSE provides an error-magnitude measure expressed in physical units (m/s), whereas NSE is a variance-based index that compares squared prediction errors to the total variability of the observed data around its mean. Mathematically, if the observed velocities exhibit high variability, the denominator in the NSE formulation (Equation (2)) becomes large. This can yield relatively high NSE values even when prediction errors remain non-negligible. Conversely, when variability is small, the same magnitude of error has a stronger impact on NSE, potentially driving it to negative values even when error magnitudes are acceptable.

This paradox implies that two models can produce nearly identical prediction errors, but their NSE scores may differ substantially depending on the variability of the observed dataset. In the present study, the second model achieved higher NSE despite producing larger error magnitudes, reflecting its better reproduction of variability rather than superior pointwise accuracy. Such behavior demonstrates that RMSE and NSE, though mathematically related, emphasize different aspects of model performance. As shown by [49], NSE can be further decomposed into contributions from correlation, bias, and relative variability, which clarifies why the metric may appear favorable even when error magnitudes are large. Researchers and practitioners must, therefore, interpret them together rather than treating one as a universal indicator of model quality.

4.4. The Dataset Combination Effect and Divide-And-Measure Nonconformity

Building on the RMSE–NSE paradox discussed above, an even more striking issue arises when datasets are combined. As shown in Table 1, the RMSE (0.0547) and MAPE (10.53%) of the combined dataset fall between those of the individual models, which is consistent with expectations for error-magnitude metrics. Yet, counterintuitively, both NSE (0.685) and Willmott’s d (0.900) exceed the values obtained by either model alone. This outcome reflects a more general property of the NSE recently identified by [40], who demonstrated that the combined NSE is always greater than or equal to the worst of the individual NSEs and, in many cases, greater than both.

This divide-and-measure nonconformity (DAMN) highlights a statistical paradox: the act of aggregating evaluation datasets can artificially inflate variance-based indices, even though no improvement has been made to the underlying model architectures or learning process. In our study, the combined NSE improved solely because the aggregation altered the variance structure of the denominator in the NSE formula, not because of enhanced predictive accuracy.

The implications of this behavior extend beyond our case study. In principle, one could train separate models for different seasons or months of the year and then combine their outputs to report a deceptively higher overall NSE. More generally, other types of data partitioning or selective aggregation strategies could be devised to exploit the same weakness, artificially boosting variance-based indices without improving model skill. Such practices risk presenting inflated evaluation scores that do not reflect genuine advances in modeling. For this reason, DAMN-susceptible metrics such as NSE should be interpreted with caution, particularly when model performance is reported across aggregated datasets.

4.5. Academic Perspective: Strengths and Pitfalls of NSE

NSE has become a standard performance indicator in hydrological research because it offers a dimensionless scale, with 1 representing perfect prediction, 0 equivalent to the mean-observed benchmark, and negative values indicating performance worse than the mean. This straightforward interpretation makes it attractive for cross-comparison across sites and methods. Moreover, because it is a variance-based index that emphasizes variability reproduction, NSE can reveal whether models capture fluctuations in flow rather than merely matching mean values. In academic contexts, this is valuable for studies seeking to evaluate the realism of hydrological simulations.

Nevertheless, the sensitivity of NSE to the variability of the observed dataset can also exaggerate model skill under certain conditions. For catchments with high variability in flow velocities, models may achieve seemingly strong NSE values even if error magnitudes remain large. This can lead to overly optimistic conclusions about predictive skill. On the other hand, in basins with stable flows, even small discrepancies may sharply reduce NSE, giving the impression of poor model performance. For this reason, academic reporting should treat NSE as a complementary measure, interpreted in conjunction with error-magnitude metrics such as RMSE.

4.6. Practical Perspective: Intuitive Accuracy from RMSE and MAPE

For practitioners tasked with operational decisions, RMSE and MAPE provide error-magnitude metrics that are both interpretable and actionable. RMSE quantifies prediction error in the same units as the observed variable, directly informing thresholds used in river engineering, sediment management, and flood preparedness. For example, an RMSE of 0.05 m/s can be evaluated against acceptable tolerances for hydraulic design or early warning thresholds. In this sense, RMSE provides a direct link between statistical evaluation and real-world decision-making.

MAPE further enhances interpretability by expressing errors as percentages relative to observed values. This makes it accessible to decision makers and stakeholders without technical backgrounds, who may find percentage-based indicators easier to relate to performance expectations. MAPE also facilitates comparisons across sites or studies with different velocity magnitudes. However, it should be interpreted with caution when observed velocities are very small, as division by small values can inflate percentage errors. Together, RMSE and MAPE provide intuitive measures of predictive reliability that are better suited to guiding practical interventions than variance-based indices alone.

4.7. Implications for Sustainability and Risk Reduction

The choice of evaluation metric has direct consequences for sustainability-oriented water management. Although NSE is sometimes cited as useful for assessing whether models reproduce flow variability, relying on it in isolation risks overstating predictive skill. Our findings confirm the necessity of balancing NSE with error-magnitude metrics. Overreliance on NSE alone may foster overconfidence in model skill, particularly in basins with high flow variability where large errors can be masked by variance scaling. This could result in under-preparedness for extreme flood events, inappropriate design of hydraulic infrastructure, or misguided sediment transport modeling. Such misinterpretations can undermine disaster risk reduction efforts and expose communities to greater vulnerability. A balanced view that integrates error-magnitude metrics such as RMSE and MAPE is, therefore, critical for ensuring robust flood prediction and management strategies. Beyond reporting NSE, robust model evaluation requires comparison against benchmark models to ensure meaningful performance assessment and cross-study comparability. This perspective, highlighted by Schaefli and Gupta [50], underscores the necessity of interpreting NSE as an improvement over a baseline rather than in isolation.

From a broader sustainability perspective, transparent reporting of multiple performance indicators supports accountable and evidence-based decision-making. Water resource planning increasingly requires interdisciplinary collaboration, where stakeholders from engineering, environmental management, and policy must rely on understandable and credible model assessments. Presenting both error-magnitude metrics (e.g., RMSE and MAPE) and variance-based indices (e.g., NSE and d) avoids misleading conclusions and enhances trust in predictive tools. Ultimately, sustainability depends not only on developing powerful models but also on evaluating them with metrics that are meaningful, transparent, and aligned with the goals of disaster resilience, river engineering, and long-term water resource management.

5. Conclusions

This study compared two deep learning models for flow velocity estimation in a torrential creek in Taiwan and revealed a paradox in performance evaluation. The first model (3D CNN) produced lower error magnitudes, as reflected by RMSE and MAPE, while the second model (CNN+LSTM) achieved higher variance-based efficiency and agreement scores, as indicated by NSE and Willmott’s d. These contrasting outcomes demonstrate that metric choice can fundamentally alter the interpretation of model quality, with NSE and RMSE often telling different stories.

The findings emphasize the need for careful selection and interpretation of evaluation metrics in hydrological applications. NSE, although widely reported, should be interpreted with caution, particularly when observed variance and dataset definitions differ, as it may overstate model performance. RMSE and MAPE, by contrast, provide intuitive measures of error magnitude that are directly actionable for engineering and risk management purposes. For example, the first model attained RMSE = 0.0471 m/s and MAPE = 7.78%, results that directly convey predictive accuracy relevant to disaster risk reduction. For sustainability-related water resource management, a balanced evaluation framework should combine both error-magnitude metrics (RMSE, MAPE) and variance-based indices (NSE, d) to ensure transparent and meaningful assessment. The contributions clarify the RMSE–NSE paradox, highlight the practical value of multi-metric evaluation, and underline the risks of overreliance on a single index.

This work has some limitations, including the use of a single case study, differences in dataset coverage, and the exclusion of alternative indices such as the Kling–Gupta efficiency. Future research should extend this analysis to other rivers and hydrological settings to test the generality of the observed paradox. Replication across diverse hydrological contexts and the integration of additional evaluation frameworks will help establish more robust best practices. Ultimately, sustainable water management requires not only advanced predictive models but also the consistent use of evaluation metrics that align with the dual goals of scientific rigor and practical decision-making.

Author Contributions

Conceptualization, W.C.; data curation, W.C. and B.-S.L.; funding acquisition, W.C.; investigation, W.C. and K.A.N.; methodology, W.C.; project administration, W.C.; resources, W.C. and B.-S.L.; software, W.C.; supervision, W.C.; validation, W.C. and K.A.N.; visualization, W.C. and K.A.N.; writing—original draft, W.C.; writing—review and editing, W.C., K.A.N., and B.-S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This study was partially supported by the National Science and Technology Council (Taiwan) under Research Project Grant Numbers NSTC 114-2121-M-027-001 and NSTC 113-2121-M-008-004.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are not publicly available due to restrictions imposed by the data owner or source. Therefore, the data cannot be disseminated or shared as part of this publication. Interested researchers can request access to the data directly from the data owner or source, subject to their terms and conditions. The authors confirm that they do not have the right to distribute the data used in this study.

Acknowledgments

The authors acknowledge the use of ChatGPT 5, a large language model developed by OpenAI, for assisting in enhancing the readability and clarity of the manuscript. All AI-generated content was carefully reviewed and revised by the authors, who take full responsibility for the final version of the publication.

Conflicts of Interest

Two of the authors are affiliated with the National Taipei University of Technology, and one author, Bor-Shiun Lin, is affiliated with Ultron Technology Engineering Company, which provided the surface water flow data used in this study. The company had no role in the design of the study; in the collection, analysis, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results. This research was supported by grants from the National Science and Technology Council (NSTC), Taiwan, which also covered the article processing charge (APC). The other authors affirm that this study was carried out without any commercial or financial relationships that might be viewed as potential conflicts of interest.

References

Hao, H.; Zhang, J.; Illman, W.A.; Liu, Y.; Hao, Y.; Yao, J.; Liu, Q.; Wang, Q.; Yeh, T.-C.J. Insight into karst hydrological processes in the frequency domain: Critical frequency, phase difference, causality, and machine learning model. J. Hydrol. 2025, 629, 134150. [Google Scholar] [CrossRef]
Yin, H.; Zheng, Q.; Wei, C.; Liang, C.; Fan, M.; Zhang, X.; Zhang, Y. Monthly streamflow forecasting with temporal-periodic transformer. J. Hydrol. 2025, 627, 133308. [Google Scholar] [CrossRef]
Wi, S.; Gupta, R.; Steinschneider, S. Pooling local climate and donor gauges with deep learning for improved reconstructions of streamflow in ungauged and partially gauged basins. J. Hydrol. 2025, 628, 133764. [Google Scholar] [CrossRef]
Yin, H.; Zhao, L.; Zhu, M.; Zhang, Y. Runoff prediction in gauged and ungauged basins using transformer-XAJ model. J. Hydrol. 2025, 629, 133954. [Google Scholar] [CrossRef]
Lu, Y.; Yao, H.; Ma, J.; Tang, X.; Chen, J.; Cheng, M. Inter-comparison and mechanistic interpretation of deep learning models for turbidity prediction in rural areas. J. Hydrol. 2025, 662, 134004. [Google Scholar] [CrossRef]
Tang, S.; Sun, F.; Zhang, Q.; Singh, V.P.; Feng, Y. Improving trans-regional hydrological modelling by combining LSTM with big hydrological data. J. Hydrol. Reg. Stud. 2025, 58, 102257. [Google Scholar] [CrossRef]
Xu, C.; Chen, Y.; Wang, D.; Zhao, Y.; Hou, Y.; Zhu, Y.; Shen, Q. Uncertainty and driving factor analysis of streamflow forecasting for closed-basin and interval-basin: Based on a probabilistic and interpretable deep learning model. J. Hydrol. Reg. Stud. 2025, 60, 102483. [Google Scholar] [CrossRef]
Ampas, H.; Refanidis, I.; Ampas, V. Hybrid hydrological forecasting through a physical model and a weather-informed transformer model: A case study in Greek watershed. Appl. Sci. 2025, 15, 6679. [Google Scholar] [CrossRef]
Papanicolaou, A.N.; Elhakeem, M.; Krallis, G.; Prakash, S.; Edinger, J. Sediment transport modeling review—Current and future developments. J. Hydraul. Eng. 2008, 134, 1–14. [Google Scholar] [CrossRef]
Nones, M. Dealing with sediment transport in flood risk management. Acta Geophys. 2019, 67, 677–685. [Google Scholar] [CrossRef]
Bandini, F.; Frías, M.C.; Liu, J.; Simkus, K.; Karagkiolidou, S.; Bauer-Gottwein, P. Challenges with regard to unmanned aerial systems (UASs) measurement of river surface velocity using Doppler radar. Remote Sens. 2022, 14, 1277. [Google Scholar] [CrossRef]
Huang, Y.; Chen, H.; Liu, B.; Huang, K.; Wu, Z.; Yan, K. Radar technology for river flow monitoring: Assessment of the current status and future challenges. Water 2023, 15, 1904. [Google Scholar] [CrossRef]
Hain, R.; Kähler, C.J. Fundamentals of multiframe particle image velocimetry (PIV). Exp. Fluids 2007, 42, 575–587. [Google Scholar] [CrossRef]
Massó, L.; Patalano, A.; García, C.M.; García, S.A.O.; Rodríguez, A. Enhancing LSPIV accuracy in low-speed flows and heterogeneous seeding conditions using image gradient. Flow Meas. Instrum. 2024, 100, 102706. [Google Scholar] [CrossRef]
Jodeau, M.; Hauet, A.; Paquier, A.; Le Coz, J.; Dramais, G. Application and evaluation of LS-PIV technique for the monitoring of river surface velocities in high flow conditions. Flow Meas. Instrum. 2008, 19, 117–127. [Google Scholar] [CrossRef]
Jolley, M.J.; Russell, A.J.; Quinn, P.F.; Perks, M.T. Considerations when applying large-scale PIV and PTV for determining river flow velocity. Front. Water 2021, 3, 709269. [Google Scholar] [CrossRef]
Fujita, I.; Muste, M.; Kruger, A. Large-scale particle image velocimetry for flow analysis in hydraulic engineering applications. J. Hydraul. Res. 1998, 36, 397–414. [Google Scholar] [CrossRef]
Lu, J.; Yang, X.; Wang, J. Velocity vector estimation of two-dimensional flow field based on STIV. Sensors 2023, 23, 955. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Chen, H.; Liu, B.; Liu, W.; Xu, C.-Y.; Guo, S.; Wang, J. An improvement of the space-time image velocimetry combined with a new denoising method for estimating river discharge. Flow Meas. Instrum. 2021, 77, 101864. [Google Scholar] [CrossRef]
Legleiter, C.J.; Kinzel, P.J.; Engel, F.L.; Harrison, L.R.; Hewitt, G. A two-dimensional, reach-scale implementation of space-time image velocimetry (STIV) and comparison to particle image velocimetry (PIV). Earth Surf. Process. Landf. 2024, 49, 3093–3114. [Google Scholar] [CrossRef]
Fujita, I.; Watanabe, H.; Tsubaki, R. Development of a non-intrusive and efficient flow monitoring technique: The space-time image velocimetry (STIV). Int. J. River Basin Manag. 2007, 5, 105–114. [Google Scholar] [CrossRef]
Wu, H.; Zhao, R.; Gan, X.; Ma, X. Measuring surface velocity of water flow by dense optical flow method. Water 2019, 11, 2320. [Google Scholar] [CrossRef]
Jyoti, J.S.; Medeiros, H.; Sebo, S.; McDonald, W. River velocity measurements using optical flow algorithm and unoccupied aerial vehicles: A case study. Flow Meas. Instrum. 2023, 91, 102341. [Google Scholar] [CrossRef]
Tauro, F.; Tosi, F.; Mattoccia, S.; Toth, E.; Piscopia, R.; Grimaldi, S. Optical tracking velocimetry (OTV): Leveraging optical flow and trajectory-based filtering for surface streamflow observations. Remote Sens. 2018, 10, 2010. [Google Scholar] [CrossRef]
Khalid, M.; Pénard, L.; Mémin, E. Optical flow for image-based river velocity estimation. Flow Meas. Instrum. 2019, 65, 110–121. [Google Scholar] [CrossRef]
Cai, S.; Liang, J.; Gao, Q.; Xu, C.; Wei, R. Particle image velocimetry based on a deep learning motion estimator. IEEE Trans. Instrum. Meas. 2019, 69, 3538–3554. [Google Scholar] [CrossRef]
Wei, L.; Guo, X. Deep learning framework for velocity field reconstruction from low-cost particle image velocimetry measurements. Phys. Fluids 2025, 37, 013629. [Google Scholar] [CrossRef]
Tlhomole, J.B.; Hughes, G.O.; Zhang, M.; Piggott, M.D. From PIV to LSPIV: Harnessing deep learning for environmental flow velocimetry. J. Hydrol. 2025, 649, 132446. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, J.; Zhao, H.; Mu, Z.; Chen, L. Applicability of deep learning optical flow estimation for PIV methods. Flow Meas. Instrum. 2023, 93, 102398. [Google Scholar] [CrossRef]
Watanabe, K.; Fujita, I.; Iguchi, M.; Hasegawa, M. Improving accuracy and robustness of space-time image velocimetry (STIV) with deep learning. Water 2021, 13, 2079. [Google Scholar] [CrossRef]
Cao, Y.; Wu, Y.; Yao, Q.; Yu, J.; Hou, D.; Wu, Z.; Wang, Z. River surface velocity estimation using optical flow velocimetry improved with attention mechanism and position encoding. IEEE Sens. J. 2022, 22, 16533–16544. [Google Scholar] [CrossRef]
Fang, C.; Yuan, G.; Zheng, Z.; Zhong, Q.; Duan, K. Monitoring discharge of mountain streams by retrieving image features with deep learning. Hydrol. Earth Syst. Sci. 2024, 28, 4085–4098. [Google Scholar] [CrossRef]
An, G.; Du, T.; He, J.; Zhang, Y. Non-intrusive water surface velocity measurement based on deep learning. Water 2024, 16, 2784. [Google Scholar] [CrossRef]
Chen, W.; Nguyen, K.A.; Lin, B.-S. Deep Learning and Optical Flow for River Velocity Estimation: Insights from a Field Case Study. Sustainability 2025, 17, 8181. [Google Scholar] [CrossRef]
Willmott, C.J. Some Comments on the Evaluation of Model Performance. Bull. Am. Meteorol. Soc. 1982, 63, 1309–1313. [Google Scholar] [CrossRef]
Willmott, C.J.; Matsuura, K. Advantages of the Mean Absolute Error (MAE) over the Root Mean Square Error (RMSE) in Assessing Average Model Performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
Nash, J.E.; Sutcliffe, J.V. River Flow Forecasting through Conceptual Models Part I—A Discussion of Principles. J. Hydrol. 1970, 10, 282–290. [Google Scholar] [CrossRef]
Willmott, C.J. On the Validation of Models. Phys. Geogr. 1981, 2, 184–194. [Google Scholar] [CrossRef]
Lewis, C.D. Industrial and Business Forecasting Methods: A Practical Guide to Exponential Smoothing and Curve Fitting; Butterworths: London, UK, 1982. [Google Scholar]
Klotz, D.; Gauch, M.; Kratzert, F.; Nearing, G.; Zscheischler, J. The divide and measure nonconformity—How metrics can mislead when we evaluate on different data partitions. Hydrol. Earth Syst. Sci. 2024, 28, 3665–3673. [Google Scholar] [CrossRef]
ASCE Task Committee on Definition of Criteria for Evaluation of Watershed Models of the Watershed Management Committee, Irrigation and Drainage Division. Criteria for evaluation of watershed models. J. Irrig. Drain. Eng. 1993, 119, 429–442. [Google Scholar] [CrossRef]
Legates, D.R.; McCabe, G.J., Jr. Evaluating the use of “goodness-of-fit” measures in hydrologic and hydroclimatic model validation. Water Resour. Res. 1999, 35, 233–241. [Google Scholar] [CrossRef]
Moriasi, D.N.; Arnold, J.G.; Van Liew, M.W.; Bingner, R.L.; Harmel, R.D.; Veith, T.L. Model evaluation guidelines for systematic quantification of accuracy in watershed simulations. Trans. ASABE 2007, 50, 885–900. [Google Scholar] [CrossRef]
Pushpalatha, R.; Perrin, C.; Le Moine, N.; Andréassian, V. A review of efficiency criteria suitable for evaluating low-flow simulations. J. Hydrol. 2012, 420, 171–182. [Google Scholar] [CrossRef]
Duc, L.; Sawada, Y. A signal-processing-based interpretation of the Nash–Sutcliffe efficiency. Hydrol. Earth Syst. Sci. 2023, 27, 1827–1839. [Google Scholar] [CrossRef]
Melsen, L.A.; Puy, A.; Torfs, P.J.J.F.; Saltelli, A. The rise of the Nash–Sutcliffe efficiency in hydrology. Hydrol. Sci. J. 2025, 1–12. [Google Scholar] [CrossRef]
Nguyen, K.A.; Chen, W.; Lin, B.-S.; Seeboonruang, U. Comparison of ensemble machine learning methods for soil erosion pin measurements. ISPRS Int. J. Geo-Inf. 2021, 10, 42. [Google Scholar] [CrossRef]
Martinec, J.; Rango, A. Merits of statistical criteria for the performance of hydrological models 1. JAWRA J. Am. Water Resour. Assoc. 1989, 25, 421–432. [Google Scholar] [CrossRef]
Gupta, H.V.; Kling, H.; Yilmaz, K.K.; Martinez, G.F. Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling. J. Hydrol. 2009, 377, 80–91. [Google Scholar] [CrossRef]
Schaefli, B.; Gupta, H.V. Do Nash Values Have Value? Hydrol. Process. 2007, 21, 2075–2080. [Google Scholar] [CrossRef]

Figure 1. Workflow for evaluating deep learning models of torrent flow velocity. Videos recorded at the Yufeng No. 2 torrential stream are converted to motion fields via dense optical flow and then supplied to two alternative architectures: the first model (3D CNN) and the second model (CNN+LSTM). Both models are trained with a random 70/30 split and evaluated against Doppler radar reference velocities using four metrics: RMSE (m/s), NSE, d, and MAPE.

Figure 2. Location of the study area in northern Taiwan (inset, red dot) and detailed catchment map of the Yufeng No. 2 stream showing watershed boundaries, geology, and land use/land cover.

Figure 3. Predicted vs. observed velocity (m/s) for the two models and their combined dataset, each with a 1:1 reference line. Reported values are RMSE, NSE, Willmott’s d, and MAPE for the test dataset: (a) the first model (3D CNN), trained and tested on May–June data; (b) the second model (CNN+LSTM), trained and tested on May–August data; and (c) the combined dataset of both models, incorporating May–August data.

Table 1. Comparison of evaluation metrics between the first model (3D CNN), the second model (CNN+LSTM), and the combined dataset (Models 1 + 2). Lower RMSE and MAPE indicate better absolute/percentage accuracy; higher NSE and Willmott’s d indicate better variance-based efficiency and agreement.

Model/Dataset	RMSE (m/s)	NSE	Willmott’s d	MAPE (%)
First model (3D CNN)	0.0471	0.519	0.833	7.78
Second model (CNN+LSTM)	0.0572	0.678	0.895	11.56
Combined dataset (Models 1 + 2)	0.0547	0.685	0.900	10.53

Table 2. Comparison of evaluation metrics for the two models. Although the second model has a higher RMSE, it attains a higher NSE because of the larger denominator in the NSE formulation.

Dataset	RMSE	NSE	Variance of Observed	NSE Denominator
First	0.0471	0.519	0.0046	1.48
Second	0.0572	0.678	0.0102	8.65
Combined	0.0547	0.685	0.0095	11.09

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, W.; Nguyen, K.A.; Lin, B.-S. Rethinking Evaluation Metrics in Hydrological Deep Learning: Insights from Torrent Flow Velocity Prediction. Sustainability 2025, 17, 8658. https://doi.org/10.3390/su17198658

AMA Style

Chen W, Nguyen KA, Lin B-S. Rethinking Evaluation Metrics in Hydrological Deep Learning: Insights from Torrent Flow Velocity Prediction. Sustainability. 2025; 17(19):8658. https://doi.org/10.3390/su17198658

Chicago/Turabian Style

Chen, Walter, Kieu Anh Nguyen, and Bor-Shiun Lin. 2025. "Rethinking Evaluation Metrics in Hydrological Deep Learning: Insights from Torrent Flow Velocity Prediction" Sustainability 17, no. 19: 8658. https://doi.org/10.3390/su17198658

APA Style

Chen, W., Nguyen, K. A., & Lin, B.-S. (2025). Rethinking Evaluation Metrics in Hydrological Deep Learning: Insights from Torrent Flow Velocity Prediction. Sustainability, 17(19), 8658. https://doi.org/10.3390/su17198658

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Rethinking Evaluation Metrics in Hydrological Deep Learning: Insights from Torrent Flow Velocity Prediction

Abstract

1. Introduction